[BioSQL-l] Loading sequences with novel NCBI taxon id

Thu Mar 13 22:51:13 UTC 2008

(this is more of a bioperl question than a biosql one)

The load_ncbi_taxonomy.pl script is designed to update the taxon  
tables in a non-disruptive way, and if there weren't many changes  
shouldn't actually take that long (except that recalculating the  
nested set values may take a couple of minutes).

Bioperl-db will store the taxon information it finds in the  
Bio::Species object if it can't locate the taxon by lookup, and will  
not raise an error. The problem with this is that it relies on the  
Bio::SeqIO parser to have gotten the species and lineage information  
correct, which is sometimes a wrong assumption for exotic species.  
Most often the error will not manifest itself at the time of storing  
the erroneously parsed information, but when it is re-retrieved and  
used to populate a Bio::Species object.

For the SymAtlas project we had this situation (new species in  
sequence updates that the last NCBI taxonomy update hadn't yet  
brought in) quite regularly. I wrote a SQL script would fix those  
'haphazard' additions such that load_ncbi_taxonomy would update them  
to their correct values come the next NCBI taxonomy update. I can  
send you the script (it would be for the Oracle version), but I'm not  
sure this is a widely viable strategy.

	-hilmar

On Mar 13, 2008, at 11:06 AM, Peter wrote:

> Dear list,
>
> One of the unresolved issues with Biopython's BioSQL interface is
> dealing with the NCBI taxon ID when loading sequences into the
> database.
>
> As I understand it, ideally before loading any sequences, the user
> will have loaded in the entire NCBI taxonomy using the
> load_ncbi_taxonomy.pl script, as I described here:
> http://biopython.org/wiki/BioSQL#NCBI_Taxonomy
>
> When a new sequence is added to the database with a known taxon id,
> there is no problem.  But happens if its a recently sequenced organism
> which isn't defined yet in the BioSQL taxonomy tables?  Could/should
> the user re-run load_ncbi_taxonomy.pl, and then load in their new
> sequence?
>
> Right now in Biopython due what appears to have been intended as a
> short term hack, we simple don't record the taxon id at all (!), and I
> would like to fix this (bug 2422).
> http://bugzilla.open-bio.org/show_bug.cgi?id=2422
>
> How do BioPerl et al deal with this issue?  Do they try and update the
> taxonomy tables using the available information in the new record's
> annotation (i.e. the new taxon id and the species name)?  Do they
> lookup the NCBI taxonomy definition via the internet?  Do they throw
> an error and halt?
>
> Thanks,
>
> Peter
> (Biopython)
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================