[BioPython] error on insert new sequences from GenBank: no annotations saved in BioSQL database

Hilmar Lapp hlapp at gmx.net
Sat Nov 10 20:38:17 UTC 2007


Just a few comments below, specifically where no rows would in fact  
be what I expect:

On Nov 10, 2007, at 6:16 AM, Eric Gibert wrote:

> [...]
> --------  For you information, I went thru the tables of my BioSQL  
> database:
> [...]
> 1) table bioentry: all column populated except for 'taxon_id' which  
> is NULL
> (maybe I need an extra call for populating the 'taxon' table before?)

Bioperl-db will try to look up (or create if necessary) the taxon  
from the taxon information attached to the sequence, but for BioPerl  
we actually recommend to pre-load the database with the NCBI  
taxonomy, which can be comfortably done with the script  
load_ncbi_taxonomy.pl that comes with BioSQL.

>
> 2) table bioentry_dbxref: no data inserted (always empty, even with  
> BioJava)

This would mean that the sequence(s) have no dbxrefs. Note that for  
GenBank sequences that would be expected, since unfortunately, and  
unlike EMBL format, GenBank puts the dbxrefs into the feature table.

> 3) table bioentry_qualifier_value:
>
> One entry only, for the 'term_id' = 149, rank = 1, and value = '07- 
> JUL-2005'
> or other 'DD-MMM-YYYY' dates (see my remarks below)

Below you say that your term table is empty, so I don't know why you  
can have value here at all.

> [...]
> 5) table bioentry_relationships: no entry found (always empty, even  
> with
> BioJava)

If you load sequences, they won't have direct relationships to other  
sequences (except dbxrefs, but those are rather 'pointers' and are  
stored in their own table).

In Bioperl-db, this table is used only if you load sequence clusters  
through Bio::Cluster objects (such as UniGene).

> [...]
> 7) table comment: no entry found (always empty, even with BioJava)

Again, this is expected with GenBank. AFAIK genbank format doesn't  
allow for comments at the level of the sequence. You would (i.e.,  
should) find entries here if you load UniProt entries.

> 8) table dbxref: some records are generated, for dbname 'PUBMED'  
> and 'Taxon'
> with the correct value

Taxon obviously isn't really a dbxref, but rather a taxon (and hence  
should go into that table).

> [...]
> 9) table dbxref_qualifier_value: (always empty, even with BioJava)

That's almost expected. There's rather few cases where dbxrefs have  
additional attributes that the language can parse out from a source  
(and then maps to the schema).

> [...]
> 10) table location: all locations loaded correctly, note that  
> 'term_id' and
> 'dbxref_id' remain NULL for these seq but I have value for other seq.

Theoretically, the term_id should point to the term giving the type  
of the location. If you (or Biopython) are only dealing with simple  
('normal') locations, then it's not needed.

The dbxref_id gives the reference to the remote sequence if the  
location for a feature refers to a different sequence than the  
feature itself does (so-called 'remote locations'). If the sequences  
you loaded don't have such locations, there this would be expected to  
be empty (or if Biopython doesn't handle such locations).

> 11) table location_qualifier_value: always empty, even with BioJava

This is expected if Biopython doesn't support fuzzy locations, or if  
none of the feature locations that you loaded are fuzzy.

> [...]
> 13) Table reference: entries correct, note 'dbxref_id' remains NULL  
> for
> these seq but I have value for other seq.

It should point to the pubmed ID for the reference but only if there  
was one.

> 14) table seqfeature: entries are there (same as in table 'location').
> FYI:'display_name is always NULL.

GenBank doesn't give names to features (and I think EMBL does  
neither), so this is expected.

> 15) table seqfeature_dbxref: always empty, even with BioJava

That's likely more to do with your language object model than with  
anything else. dbxref annotation for features is in tag/value pairs,  
just as any other, so your language (Biopython in this case) will  
have to do a lot of interpretation to tease out the semantics behind  
each tag name and based on that decide what to do with the value.  
Indeed, by default we don't even do this in BioPerl.

> [...]
> 17) table seqfeature_relationship: always empty, even with BioJava

GenBank (and EMBL) feature tables are flat, not hierarchical, so this  
is expected.

> 18) table taxon: always empty, even with BioJava)

This is where the organism should go.

> 19) table taxon_name: I have one but not from this test (I tried to  
> tinker a
> little bit with taxon but stopped)

That's odd that you can have an entry in taxon_name w/o a  
corresponding one in taxon. Do you have foreign key checks disabled?

> 20) table term: always empty, even with BioJava

That's strange, since you say you do have rows in  
bioentry_qualifier_value, which has an enforced foreign key to term.  
Did you disable the foreign key checks?

> 21) table term_dbxref: always empty, even with BioJava

That's expected unless you loaded an ontology whose terms have  
dbxrefs, and your language object model supports that.

> [...]
> 23) table term_synonym: always empty, even with BioJava

Same as for 21). Your terms would have to have synonyms, and your  
language object model would have to support those, before you could  
expect to get anything in here.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================








More information about the Biopython mailing list