[BioSQL-l] Fwd: error on insert new sequences from GenBank: no annotations saved in BioSQL database

Mon Mar 3 03:38:59 UTC 2008

FYI, I used this to start a page on the recommended mapping of  
sequence annotation to BioSQL:

http://www.biosql.org/wiki/Annotation_Mapping

Obviously, this is very rudimentary, but everyone is welcome to add  
to it or comment with further questions. Also, one of the most  
important questions, namely a consistent vocabulary for annotation  
(qualifier) tags, isn't mentioned there (yet).

	-hilmar

Begin forwarded message:

> From: Hilmar Lapp <hlapp at gmx.net>
> Date: November 8, 2007 3:28:19 PM EST
> To: Eric Gibert <ericgibert at yahoo.fr>
> Cc: biopython at lists.open-bio.org, BioJava <biojava-l at biojava.org>
> Subject: Re: [Biojava-l] [BioPython] error on insert new sequences  
> from GenBank: no annotations saved in BioSQL database
>
> Maybe we need to hold some mini-hackathon to make the different
> toolkits compatible in how they map annotation to the schema.
> Obviously I don't know whether you have the latest Biojava setup
> here, but I'll just comment how BioPerl/Bioperl-db would map this:
>
> 'ORIGIN' - if I'm not mistaken this is only a token that introduces
> the actual sequence. I'm not sure what Biojava is storing as value  
> here.
>
> 'DIVISION' - this maps to column division in table bioentry (though I
> agree that if  perfectly following the weak typing principle this
> should be tag/value association, but at present it's still an actual
> column)
>
> 'genbank_accessions' - secondary accession numbers indeed go into the
> qualifier value table. The primary accession maps to column accession
> in table bioentry
>
> 'TITLE' - this is part of a publication reference, and should map to
> column title in table reference (which it does in bioperl-db)
>
> 'cross_references' - not sure where these would be coming from in
> GenBank format; for EMBL this will map to the dbxref table
>
> 'data_file_division' - not sure what this is (same as DIVISION?)
>
> 'VERSION' - in BioPerl we parse this apart into a version for the
> accession (which is column version in table bioentry) and the GI
> number, which maps to column identifier in table bioentry
>
> 'references' - these map to table reference (and bioentry_reference
> for association with the bioentry)
>
> 'KEYWORDS' - indeed these map to bioentry_qualifier_value
>
> 'GI' - maps to column identifier in table bioentry
>
> 'SIZE' - not sure what size that is. If it is the length of the
> sequence, it should (and in BioPerl/bioperl-db does) map to column
> length in table biosequence
>
> 'DEFINITION' - maps to column description in table bioentry
>
> 'REFERENCE' - should be the same as for 'references'
>
> 'MDAT' - not sure what this is
>
> 'ORGANISM' - this is the organism and maps to the table taxon (and
> taxon_name), with a foreign key in bioentry pointing to the taxon
>
> 'JOURNAL' - this is part of a reference, see 'references'
>
> 'ACCESSION' - the primary accession, maps to column accession in
> table bioentry
>
> 'LOCUS' - in the file itself this is an entire line consisting of
> multiple fields; BioPerl/bioperl-db maps the locus name (the first
> token after the literal token LOCUS) to column name in table bioentry
>
> 'SOURCE' - this is the organism, see 'ORGANISM'
>
> 'PUBMED' - this is part of a literature reference, and maps to a
> foreign key in the reference table (reference.dbxref) to a dbxref
> entry with PUBMED or PMID as the database and the pubmed ID as the
> accession
>
> 'AUTHORS' - part of a literature reference, maps to column authors in
> table reference
>
> 'TYPE' - not sure what this is. If it's the alphabet, it maps to
> table biosequence, column alphabet
>
> 'CIRCULAR' - this at present indeed maps to bioentry_qualifier_value,
> though there have been plans to make it a column in table biosequence.
>
> Note that this could in fact be the way Biojava stores it too, but
> upon retrieval represents it in the way you are seeing it.
>
> Hth,
>
> 	-hilmar
>
> On Nov 8, 2007, at 12:50 PM, Eric Gibert wrote:
>
>> Dear all,
>>
>> When I retrieve a BioSQL.BioSeq.DBSeqRecord which was inserted
>> previously by my BioJava application, I have:
>>
>> print "Debug on Seq:", Seq.id, "=", Seq.annotations.keys()
>>
>> Debug on Seq: AJ459190.1 = ['ORIGIN', 'DIVISION',
>> 'genbank_accessions', 'TITLE', 'cross_references',
>> 'data_file_division', 'VERSION', 'references', 'KEYWORDS', 'GI',
>> 'SIZE', 'DEFINITION', 'REFERENCE', 'MDAT', 'ORGANISM', 'JOURNAL',
>> 'ACCESSION', 'LOCUS', 'SOURCE', 'PUBMED', 'AUTHORS', 'TYPE',
>> 'CIRCULAR']
>>
>> but a freshly inserted BioSeq by BioPython 1.44 only gives me:
>> Debug on Seq: EF631597.1 =  ['cross_references', 'dates',
>> 'references', 'gi', 'data_file_division']
>>
>>
>> Once I look in the table bioentry_qualifier_value
>>
>> * 20 records for a Sequence imported by BioJava
>> * 1 only for a Sequence inserted by BioPython: the date which
>> should be inserted by "_load_bioentry_date" in BioSQL/Loader.py
>>
>> Quite a few annotations missing, no?
>>
>> Any idea?
>>
>> Eric
>>
>>
>>
>>
>>
>> _____________________________________________________________________ 
>> _
>> _______
>> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers
>> Yahoo! Mail
>> _______________________________________________
>> BioPython mailing list  -  BioPython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>
> -- 
> ===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> ===========================================================
>
>
>
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================