[BioPython] error on insert new sequences from GenBank: no annotations saved in BioSQL database
Hilmar Lapp
hlapp at gmx.net
Sat Nov 10 20:38:17 UTC 2007
Just a few comments below, specifically where no rows would in fact
be what I expect:
On Nov 10, 2007, at 6:16 AM, Eric Gibert wrote:
> [...]
> -------- For you information, I went thru the tables of my BioSQL
> database:
> [...]
> 1) table bioentry: all column populated except for 'taxon_id' which
> is NULL
> (maybe I need an extra call for populating the 'taxon' table before?)
Bioperl-db will try to look up (or create if necessary) the taxon
from the taxon information attached to the sequence, but for BioPerl
we actually recommend to pre-load the database with the NCBI
taxonomy, which can be comfortably done with the script
load_ncbi_taxonomy.pl that comes with BioSQL.
>
> 2) table bioentry_dbxref: no data inserted (always empty, even with
> BioJava)
This would mean that the sequence(s) have no dbxrefs. Note that for
GenBank sequences that would be expected, since unfortunately, and
unlike EMBL format, GenBank puts the dbxrefs into the feature table.
> 3) table bioentry_qualifier_value:
>
> One entry only, for the 'term_id' = 149, rank = 1, and value = '07-
> JUL-2005'
> or other 'DD-MMM-YYYY' dates (see my remarks below)
Below you say that your term table is empty, so I don't know why you
can have value here at all.
> [...]
> 5) table bioentry_relationships: no entry found (always empty, even
> with
> BioJava)
If you load sequences, they won't have direct relationships to other
sequences (except dbxrefs, but those are rather 'pointers' and are
stored in their own table).
In Bioperl-db, this table is used only if you load sequence clusters
through Bio::Cluster objects (such as UniGene).
> [...]
> 7) table comment: no entry found (always empty, even with BioJava)
Again, this is expected with GenBank. AFAIK genbank format doesn't
allow for comments at the level of the sequence. You would (i.e.,
should) find entries here if you load UniProt entries.
> 8) table dbxref: some records are generated, for dbname 'PUBMED'
> and 'Taxon'
> with the correct value
Taxon obviously isn't really a dbxref, but rather a taxon (and hence
should go into that table).
> [...]
> 9) table dbxref_qualifier_value: (always empty, even with BioJava)
That's almost expected. There's rather few cases where dbxrefs have
additional attributes that the language can parse out from a source
(and then maps to the schema).
> [...]
> 10) table location: all locations loaded correctly, note that
> 'term_id' and
> 'dbxref_id' remain NULL for these seq but I have value for other seq.
Theoretically, the term_id should point to the term giving the type
of the location. If you (or Biopython) are only dealing with simple
('normal') locations, then it's not needed.
The dbxref_id gives the reference to the remote sequence if the
location for a feature refers to a different sequence than the
feature itself does (so-called 'remote locations'). If the sequences
you loaded don't have such locations, there this would be expected to
be empty (or if Biopython doesn't handle such locations).
> 11) table location_qualifier_value: always empty, even with BioJava
This is expected if Biopython doesn't support fuzzy locations, or if
none of the feature locations that you loaded are fuzzy.
> [...]
> 13) Table reference: entries correct, note 'dbxref_id' remains NULL
> for
> these seq but I have value for other seq.
It should point to the pubmed ID for the reference but only if there
was one.
> 14) table seqfeature: entries are there (same as in table 'location').
> FYI:'display_name is always NULL.
GenBank doesn't give names to features (and I think EMBL does
neither), so this is expected.
> 15) table seqfeature_dbxref: always empty, even with BioJava
That's likely more to do with your language object model than with
anything else. dbxref annotation for features is in tag/value pairs,
just as any other, so your language (Biopython in this case) will
have to do a lot of interpretation to tease out the semantics behind
each tag name and based on that decide what to do with the value.
Indeed, by default we don't even do this in BioPerl.
> [...]
> 17) table seqfeature_relationship: always empty, even with BioJava
GenBank (and EMBL) feature tables are flat, not hierarchical, so this
is expected.
> 18) table taxon: always empty, even with BioJava)
This is where the organism should go.
> 19) table taxon_name: I have one but not from this test (I tried to
> tinker a
> little bit with taxon but stopped)
That's odd that you can have an entry in taxon_name w/o a
corresponding one in taxon. Do you have foreign key checks disabled?
> 20) table term: always empty, even with BioJava
That's strange, since you say you do have rows in
bioentry_qualifier_value, which has an enforced foreign key to term.
Did you disable the foreign key checks?
> 21) table term_dbxref: always empty, even with BioJava
That's expected unless you loaded an ontology whose terms have
dbxrefs, and your language object model supports that.
> [...]
> 23) table term_synonym: always empty, even with BioJava
Same as for 21). Your terms would have to have synonyms, and your
language object model would have to support those, before you could
expect to get anything in here.
-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
More information about the Biopython
mailing list