[Biopython-dev] [Bug 2681] BioSQL: record annotations enhancements

Mon Nov 24 16:40:24 EST 2008

http://bugzilla.open-bio.org/show_bug.cgi?id=2681

------- Comment #5 from cymon.cox at gmail.com  2008-11-24 16:40 EST -------
(In reply to comment #2)
> (In reply to comment #0)
> > Swissprot, fasta, and EMBL SeqRecords dont have a gi annotation, retrieved
> > DBSeqRecords do. Loader uses the 'record_id' (line 522) as the identifier in
> > bioentry, if the gi annotation is missing, which is pulled as the gi
> > annotation.
> 
> There probably is something not quite right here.  Are you talking about the
> bioentry.identifier entry in the database?  Perhaps an explicit example might
> help.  As an aside, I think "gi" (GeneIndex used by NCBI) might be better
> stored in the record.dbxrefs, but that could be a parser change...

The "gi" annotation of a parsed GenBank record refers to this GenInfo
Identifier:

>From NCBI: http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html#GInB
"""
"GenInfo Identifier" sequence identification number, in this case, for the
nucleotide sequence. If a sequence changes in any way, a new GI number will be
assigned. GI sequence identifiers run parallel to the new accession.version
system of sequence identifiers. """

This is stored in bioentry.identifier. However, "gi"'s are not present in
swissprot, fasta, and embl records, instead the following couplet loads the
record.id into the identifier slot:

Loader.py:
 519         if "gi" in record.annotations :
 520             identifier = record.annotations["gi"]
 521         else :
 522             identifier = record.id

But of course, the record.id is not the "gi" - so perhaps the
bioentry.identifier should be left NULL if the "gi" number is missing. Or we
might consider calling the DBSeqRecord attribute "identifier" rather than
"gi"...

Here's an example of an EMBL file where the record.id becomes the "gi":

Testing loading from embl format file EMBL/TRBG361.embl
 - AAACAAACCAAATATGGAT...AAA [jfp/7BKv3jTJAU/4jVMrSftEq20] len 1859, X56734.1
 - Retrieving by name/display_id 'X56734', 
old annos diff: set([])
new annos diff: set(['dates', 'ncbi_taxid', 'gi'])

OLD:
taxonomy = ['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta',
'Tracheophyta', 'Spermatophyta', 'Magnoliophyta', 'eudicotyledons', 'core
eudicotyledons', 'rosids', 'eurosids I', 'Fabales', 'Fabaceae',
'Papilionoideae', 'Trifolieae', 'Trifolium']
references = [<Bio.SeqFeature.Reference instance at 0x8e9302c>,
<Bio.SeqFeature.Reference instance at 0x8e931ac>]
accessions = ['X56734', 'S46826']
data_file_division = PLN
organism = Trifolium repens (white clover)
sequence_version = 1
NEW:
dates = ['24-NOV-2008']
ncbi_taxid = 3899
references = [<Bio.SeqFeature.Reference instance at 0x8eced6c>,
<Bio.SeqFeature.Reference instance at 0x8ecedcc>]
accessions = ['X56734', 'S46826']
data_file_division = PLN
taxonomy = ['Trifolium repens (white clover)']
gi = X56734.1
organism = Trifolium repens (white clover)
sequence_version = ['1']
ncbi_taxid: 3899

C.

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.