[BioSQL-l] Special cases of protein data
Marc Logghe
Marc.Logghe at DEVGEN.com
Wed Aug 24 04:51:48 EDT 2005
> I am currently working with BioSQL using MySQL. I tried to
> insert a lot of protein data which were downloaded from the
> NCBI web page in GenPept format.
> During the insertion process (performed by BioJava) I got
> some error messages. Looking at the sequences in detail
> showed that I got more than 1000 protein sequences which had
> at least two "source" entries in theire "FEATURE" table. One
> of these bad examples is given at NCBI by the accession
> number P76519. This one has even four "source" tags. In my
> opinion this means that every single species of the four
> given species contains exactly this protein. This would mean
> that there are at least these one thousand proteins that I
> found at NCBI belonging to more than one species. This case
> cannot be considered with the current BioSQL scheme because
> there is a one to many relationship between the tables
Gosh, I was not aware of that.
Indeed, if you look at http://www.ncbi.nlm.nih.gov/collab/FT/ it says for the source key:
"identifies the biological source of the specified span of
the sequence; this key is mandatory; more than one source
key per sequence is allowed; every entry/record will have,
as a minimum, either a single source key spanning the
entire sequence or multiple source keys which together
span the entire sequence."
Marc
More information about the BioSQL-l
mailing list