[BioSQL-l] Special cases of protein data

"Andreas Dräger" duze at gmx.de
Wed Aug 24 04:11:06 EDT 2005


Dear BioSQL-developers,

I am currently working with BioSQL using MySQL. I tried to insert a lot of
protein data which were downloaded from the NCBI web page in GenPept format.
During the insertion process (performed by BioJava) I got some error
messages. Looking at the sequences in detail showed that I got more than
1000 protein sequences which had at least two "source" entries in theire
"FEATURE" table. One of these bad examples is given at NCBI by the accession
number P76519. This one has even four "source" tags. In my opinion this
means that every single species of the four given species contains exactly
this protein. This would mean that there are at least these one thousand
proteins that I found at NCBI belonging to more than one species. This case
cannot be considered with the current BioSQL scheme because there is a one
to many relationship between the tables bioentry and taxon. To consider that
the same protein belongs to n taxa we would need to create another table to
reflect a many to many relationship between the table taxon and bioentry.
The foreign key constraint of bioentry to taxon would have to be removed.
The resuld would be something like:

bioentry <--> taxon_bioentry <--> taxon

where taxon_bioentry is the extra table. This is just what I was thinking
about. However, at the moment I cannot insert files like P76519 into the
BioSQL database. Or am I wrong and the meaning of more than one "source" tag
is somehow different?
I am looking forward to get any suggestions.

Yours Andreas Dräger

-- 
5 GB Mailbox, 50 FreeSMS http://www.gmx.net/de/go/promail
+++ GMX - die erste Adresse für Mail, Message, More +++


More information about the BioSQL-l mailing list