[BioSQL-l] Special cases of protein data

Marc Logghe Marc.Logghe at DEVGEN.com
Wed Aug 24 04:51:48 EDT 2005


> I am currently working with BioSQL using MySQL. I tried to 
> insert a lot of protein data which were downloaded from the 
> NCBI web page in GenPept format.
> During the insertion process (performed by BioJava) I got 
> some error messages. Looking at the sequences in detail 
> showed that I got more than 1000 protein sequences which had 
> at least two "source" entries in theire "FEATURE" table. One 
> of these bad examples is given at NCBI by the accession 
> number P76519. This one has even four "source" tags. In my 
> opinion this means that every single species of the four 
> given species contains exactly this protein. This would mean 
> that there are at least these one thousand proteins that I 
> found at NCBI belonging to more than one species. This case 
> cannot be considered with the current BioSQL scheme because 
> there is a one to many relationship between the tables

Gosh, I was not aware of that.
Indeed, if you look at http://www.ncbi.nlm.nih.gov/collab/FT/ it says for the source key:
"identifies the biological source of the specified span of
the sequence; this key is mandatory; more than one source
key per sequence is allowed; every entry/record will have,
as a minimum, either a single source key spanning the 
entire sequence or multiple source keys which together 
span the entire sequence."

Marc



More information about the BioSQL-l mailing list