[BioSQL-l] Loading sequences with novel NCBI taxon id

Thu Mar 13 15:06:18 UTC 2008

Dear list,

One of the unresolved issues with Biopython's BioSQL interface is
dealing with the NCBI taxon ID when loading sequences into the
database.

As I understand it, ideally before loading any sequences, the user
will have loaded in the entire NCBI taxonomy using the
load_ncbi_taxonomy.pl script, as I described here:
http://biopython.org/wiki/BioSQL#NCBI_Taxonomy

When a new sequence is added to the database with a known taxon id,
there is no problem.  But happens if its a recently sequenced organism
which isn't defined yet in the BioSQL taxonomy tables?  Could/should
the user re-run load_ncbi_taxonomy.pl, and then load in their new
sequence?

Right now in Biopython due what appears to have been intended as a
short term hack, we simple don't record the taxon id at all (!), and I
would like to fix this (bug 2422).
http://bugzilla.open-bio.org/show_bug.cgi?id=2422

How do BioPerl et al deal with this issue?  Do they try and update the
taxonomy tables using the available information in the new record's
annotation (i.e. the new taxon id and the species name)?  Do they
lookup the NCBI taxonomy definition via the internet?  Do they throw
an error and halt?

Thanks,

Peter
(Biopython)