[BioPython] error on insert new sequences from GenBank: no annotations saved in BioSQL database
Eric Gibert
ericgibert at yahoo.fr
Fri Nov 9 13:35:12 UTC 2007
Dear Hilmar,
Thank you for this reply. Now I would like to know where BioPythin has
stored "SOURCE" or "ORGANISM" in BioSQL? I cannot find them.
Then, supposing they are somewhere, how can I get them back?
Thank you
Eric
-----Original Message-----
From: Hilmar Lapp [mailto:hlapp at gmx.net]
Sent: Friday, November 09, 2007 4:28 AM
To: Eric Gibert
Cc: biopython at lists.open-bio.org; BioJava
Subject: Re: [BioPython] error on insert new sequences from GenBank: no
annotations saved in BioSQL database
Maybe we need to hold some mini-hackathon to make the different
toolkits compatible in how they map annotation to the schema.
Obviously I don't know whether you have the latest Biojava setup
here, but I'll just comment how BioPerl/Bioperl-db would map this:
'ORIGIN' - if I'm not mistaken this is only a token that introduces
the actual sequence. I'm not sure what Biojava is storing as value here.
'DIVISION' - this maps to column division in table bioentry (though I
agree that if perfectly following the weak typing principle this
should be tag/value association, but at present it's still an actual
column)
'genbank_accessions' - secondary accession numbers indeed go into the
qualifier value table. The primary accession maps to column accession
in table bioentry
'TITLE' - this is part of a publication reference, and should map to
column title in table reference (which it does in bioperl-db)
'cross_references' - not sure where these would be coming from in
GenBank format; for EMBL this will map to the dbxref table
'data_file_division' - not sure what this is (same as DIVISION?)
'VERSION' - in BioPerl we parse this apart into a version for the
accession (which is column version in table bioentry) and the GI
number, which maps to column identifier in table bioentry
'references' - these map to table reference (and bioentry_reference
for association with the bioentry)
'KEYWORDS' - indeed these map to bioentry_qualifier_value
'GI' - maps to column identifier in table bioentry
'SIZE' - not sure what size that is. If it is the length of the
sequence, it should (and in BioPerl/bioperl-db does) map to column
length in table biosequence
'DEFINITION' - maps to column description in table bioentry
'REFERENCE' - should be the same as for 'references'
'MDAT' - not sure what this is
'ORGANISM' - this is the organism and maps to the table taxon (and
taxon_name), with a foreign key in bioentry pointing to the taxon
'JOURNAL' - this is part of a reference, see 'references'
'ACCESSION' - the primary accession, maps to column accession in
table bioentry
'LOCUS' - in the file itself this is an entire line consisting of
multiple fields; BioPerl/bioperl-db maps the locus name (the first
token after the literal token LOCUS) to column name in table bioentry
'SOURCE' - this is the organism, see 'ORGANISM'
'PUBMED' - this is part of a literature reference, and maps to a
foreign key in the reference table (reference.dbxref) to a dbxref
entry with PUBMED or PMID as the database and the pubmed ID as the
accession
'AUTHORS' - part of a literature reference, maps to column authors in
table reference
'TYPE' - not sure what this is. If it's the alphabet, it maps to
table biosequence, column alphabet
'CIRCULAR' - this at present indeed maps to bioentry_qualifier_value,
though there have been plans to make it a column in table biosequence.
Note that this could in fact be the way Biojava stores it too, but
upon retrieval represents it in the way you are seeing it.
Hth,
-hilmar
On Nov 8, 2007, at 12:50 PM, Eric Gibert wrote:
> Dear all,
>
> When I retrieve a BioSQL.BioSeq.DBSeqRecord which was inserted
> previously by my BioJava application, I have:
>
> print "Debug on Seq:", Seq.id, "=", Seq.annotations.keys()
>
> Debug on Seq: AJ459190.1 = ['ORIGIN', 'DIVISION',
> 'genbank_accessions', 'TITLE', 'cross_references',
> 'data_file_division', 'VERSION', 'references', 'KEYWORDS', 'GI',
> 'SIZE', 'DEFINITION', 'REFERENCE', 'MDAT', 'ORGANISM', 'JOURNAL',
> 'ACCESSION', 'LOCUS', 'SOURCE', 'PUBMED', 'AUTHORS', 'TYPE',
> 'CIRCULAR']
>
> but a freshly inserted BioSeq by BioPython 1.44 only gives me:
> Debug on Seq: EF631597.1 = ['cross_references', 'dates',
> 'references', 'gi', 'data_file_division']
>
>
> Once I look in the table bioentry_qualifier_value
>
> * 20 records for a Sequence imported by BioJava
> * 1 only for a Sequence inserted by BioPython: the date which
> should be inserted by "_load_bioentry_date" in BioSQL/Loader.py
>
> Quite a few annotations missing, no?
>
> Any idea?
>
> Eric
>
>
>
>
>
> ______________________________________________________________________
> _______
> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers
> Yahoo! Mail
> _______________________________________________
> BioPython mailing list - BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
More information about the Biopython
mailing list