[BioPython] error on insert new sequences from GenBank: no annotations saved in BioSQL database

Fri Nov 9 13:35:12 UTC 2007

Dear Hilmar,

Thank you for this reply. Now I would like to know where BioPythin has
stored "SOURCE" or "ORGANISM" in BioSQL? I cannot find them.

Then, supposing they are somewhere, how can I get them back?

Thank you

Eric

-----Original Message-----
From: Hilmar Lapp [mailto:hlapp at gmx.net] 
Sent: Friday, November 09, 2007 4:28 AM
To: Eric Gibert
Cc: biopython at lists.open-bio.org; BioJava
Subject: Re: [BioPython] error on insert new sequences from GenBank: no
annotations saved in BioSQL database

Maybe we need to hold some mini-hackathon to make the different  
toolkits compatible in how they map annotation to the schema.  
Obviously I don't know whether you have the latest Biojava setup  
here, but I'll just comment how BioPerl/Bioperl-db would map this:

'ORIGIN' - if I'm not mistaken this is only a token that introduces  
the actual sequence. I'm not sure what Biojava is storing as value here.

'DIVISION' - this maps to column division in table bioentry (though I  
agree that if  perfectly following the weak typing principle this  
should be tag/value association, but at present it's still an actual  
column)

'genbank_accessions' - secondary accession numbers indeed go into the  
qualifier value table. The primary accession maps to column accession  
in table bioentry

'TITLE' - this is part of a publication reference, and should map to  
column title in table reference (which it does in bioperl-db)

'cross_references' - not sure where these would be coming from in  
GenBank format; for EMBL this will map to the dbxref table

'data_file_division' - not sure what this is (same as DIVISION?)

'VERSION' - in BioPerl we parse this apart into a version for the  
accession (which is column version in table bioentry) and the GI  
number, which maps to column identifier in table bioentry

'references' - these map to table reference (and bioentry_reference  
for association with the bioentry)

'KEYWORDS' - indeed these map to bioentry_qualifier_value

'GI' - maps to column identifier in table bioentry

'SIZE' - not sure what size that is. If it is the length of the  
sequence, it should (and in BioPerl/bioperl-db does) map to column  
length in table biosequence

'DEFINITION' - maps to column description in table bioentry

'REFERENCE' - should be the same as for 'references'

'MDAT' - not sure what this is

'ORGANISM' - this is the organism and maps to the table taxon (and  
taxon_name), with a foreign key in bioentry pointing to the taxon

'JOURNAL' - this is part of a reference, see 'references'

'ACCESSION' - the primary accession, maps to column accession in  
table bioentry

'LOCUS' - in the file itself this is an entire line consisting of  
multiple fields; BioPerl/bioperl-db maps the locus name (the first  
token after the literal token LOCUS) to column name in table bioentry

'SOURCE' - this is the organism, see 'ORGANISM'

'PUBMED' - this is part of a literature reference, and maps to a  
foreign key in the reference table (reference.dbxref) to a dbxref  
entry with PUBMED or PMID as the database and the pubmed ID as the  
accession

'AUTHORS' - part of a literature reference, maps to column authors in  
table reference

'TYPE' - not sure what this is. If it's the alphabet, it maps to  
table biosequence, column alphabet

'CIRCULAR' - this at present indeed maps to bioentry_qualifier_value,  
though there have been plans to make it a column in table biosequence.

Note that this could in fact be the way Biojava stores it too, but  
upon retrieval represents it in the way you are seeing it.

Hth,

	-hilmar

On Nov 8, 2007, at 12:50 PM, Eric Gibert wrote:

> Dear all,
>
> When I retrieve a BioSQL.BioSeq.DBSeqRecord which was inserted  
> previously by my BioJava application, I have:
>
> print "Debug on Seq:", Seq.id, "=", Seq.annotations.keys()
>
> Debug on Seq: AJ459190.1 = ['ORIGIN', 'DIVISION',  
> 'genbank_accessions', 'TITLE', 'cross_references',  
> 'data_file_division', 'VERSION', 'references', 'KEYWORDS', 'GI',  
> 'SIZE', 'DEFINITION', 'REFERENCE', 'MDAT', 'ORGANISM', 'JOURNAL',  
> 'ACCESSION', 'LOCUS', 'SOURCE', 'PUBMED', 'AUTHORS', 'TYPE',  
> 'CIRCULAR']
>
> but a freshly inserted BioSeq by BioPython 1.44 only gives me:
> Debug on Seq: EF631597.1 =  ['cross_references', 'dates',  
> 'references', 'gi', 'data_file_division']
>
>
> Once I look in the table bioentry_qualifier_value
>
> * 20 records for a Sequence imported by BioJava
> * 1 only for a Sequence inserted by BioPython: the date which  
> should be inserted by "_load_bioentry_date" in BioSQL/Loader.py
>
> Quite a few annotations missing, no?
>
> Any idea?
>
> Eric
>
>
>
>
>        
> ______________________________________________________________________ 
> _______
> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers  
> Yahoo! Mail
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================