[Bioperl-l] Bioperl-db, bioperl schema, and load_seqdatabase.pl along with de l-assocs-sql.pl

Law, Annie Annie.Law at nrc-cnrc.gc.ca
Thu Mar 11 15:32:38 EST 2004


Hi Hilmar,

I would appreciate help with the following.  

1. My problem is I am now unable to find the data in the database that I set
it up for. It seems that I have successfully loaded NCBI taxonomy, GO
information, locuslink (using LL_tmpl file) information and then unigene
information. 
When I loaded the information in each case there seemed to be almost no
problem in inserting the information into the database Ie. It seemed that
all of the inserts worked properly.

I would like to take clone ids (associated with a microarray associated with
ESTs). I have matched the clone ids to GenBank accesion numbers form the
IMAGE consortium. I wanted to check if the SQL statements that I was
creating where giving me the desired result so I followed My statements
through.  If I were to do this by hand I would take the GenBank accession
number and find out the Unigene identifier.  I would then take the Unigene
identifier and find the corresponding locuslink id.  I would then take the
locuslink ids and find the GO ids (from each category of go)

My understanding of the the biosql schema is that as stated the bioentry
table is the main table. From this you can perform some select statements to
find out the other database references a bioentry has. For Example if we
look at all of the fields in one entry of a Unigene database It has all of
the fields that reference databases in the dbxref table. (I haven't figured
out where all of the other data from this Entry goes or if it does get
inserted in the database?)

BX095770 --accession number
Hs.2 --clusterid
N-acetyltransferase 2 (arylamine N-acetyltransferase) --title NAT2
--gene_symbol 8p22 --cytoband liver--express 
S --gnm_terminus 
10 --locuslink 
8--chromosome 
ACC=G59899 
UNISTS=137181--sts 
ORG=Escherischia coli; 
PROTGI=16129422; 
PROTID=ref:NP_415980.1; 
PCT=24; ALN=255--protsim sapiens--species Hs.2--diplay_id
N-acetyltransferase 2 (arylamine N-acetyltransferase)--description 26--size
Hs.2--object_id NCBI--authority UniGene--namespace Hs.2--display_name
26--scount

The select statement I wrote to go from GenBank accession number to
locuslink id was. 

Select dbxref.dbxref_id 
FROM dbxref
WHERE dbxref.accession = 'H08278'

With the dbxref_id that I got I was going to find out the corresponding
bioentry_id (with biodatabase_id corresponding to Unigene) and then go back
to the dbxref table to find the dbxref locuslink entry for the unigene
bioentry (in the bioentry table). The problem is that for each Genbank
accession number that I look up I get 0 rows returned using The select
statement above (I don't think there is anyhing wrong with the SQL statement
since I checked it With an unrelated GenBank accession number).  This is
true for all of the clones that I have followed through.

I found that this GenBank accession number does exist in the Hs.data file I
used as input. That entry is 
//
ID          Hs.478728
TITLE       Homo sapiens transcribed sequences
EXPRESS     whole brain ; total brain 
CHROMOSOME  3
STS         ACC=SHGC-77720 UNISTS=45248
SCOUNT      3
SEQUENCE    ACC=R60728.1; NID=g831423; CLONE=IMAGE:42222; END=5'; LID=263;
SEQTYPE=EST
SEQUENCE    ACC=Z42053.1; NID=g564288; CLONE=c-06h12; LID=186; SEQTYPE=EST
SEQUENCE    ACC=H08278.1; NID=g873100; CLONE=IMAGE:45161; END=5'; LID=263;
SEQTYPE=EST
//

????? Somehow this information is not getting entered in the database or I
am trying to access the information in the wrong 
Manner?  How can I resolve this problem?   Is there some script that needs
to be altered or some option used?

Alternatively,  I searched in the bioentry database for an entry with the
GenBank accession number that I am looking for however I am unable to find
the same bioentry_id in the bioentry_dbxref table.  Sometimes I am not even
able to find the Genbank accesion number I am looking for in the bioentry
table.
 
2. My second question when hopefully I have sorted things out in the first
question is.
I am not exactly sure how to go about accessing the GO information
corresponding to a locuslink id?
Once I had found the bioentry_id of the relevant Locuslink bioentry and join
with the bioentry table with the 
Bioentry relationship table and subsequently the term table.  I would like
to get the GO id and the category or ontology that
The GO id is associated with be it molecular function, biological process or
cellular component.  It seems that most of the
Entries in the term table are of Ontoloy Id = 1 (Gene ontology) and only
around 200 entries molecular function, biological process, and cellular
component put together when there are about 16000 entries in the term table.
Shouldn't the total number of these three
Types equal to the number of entries of Ontology Id = 1?
How do I find out what GO category or rather ontology (molecular function,
biological process, and cellular component) 
That a GO identifier belongs to?
I am not sure that I understand what all 7 distinct Ontology Ids in the
Ontology represent?
(I understand the meaning of 3,4 and 5)
1 = Gene Ontology
2 = Annotation
3 = cellular
4 = molecular
5 = biological
6 = Object Slots
7 = Relationship

3.  In a previous e-mail you were kind enough to mention that I could
contact you once I had gotten to the point
I would need to update the database.  I have gotten to this point and would
like to know how I can modify del-assocs-sql.pl to
be used with MySQL for the --mergeobjs option 

It seems to me that the --lookup option in unlike the mergeobjs option ie.
You would not enter something like
--lookup=***.pl

As an aside, I would like to know since locuslink, unigene takes around 2
days to load.  What happens in between when a user is trying
To access the database are there are deleterious effects?  My guess is that
the user will just get the data that is currently in the
Database and that no harm will happen.

Thanks very much,
Annie.



More information about the Bioperl-l mailing list