[BioSQL-l] Loading sequences with novel NCBI taxon id
Hilmar Lapp
hlapp at gmx.net
Thu Mar 13 22:51:13 UTC 2008
(this is more of a bioperl question than a biosql one)
The load_ncbi_taxonomy.pl script is designed to update the taxon
tables in a non-disruptive way, and if there weren't many changes
shouldn't actually take that long (except that recalculating the
nested set values may take a couple of minutes).
Bioperl-db will store the taxon information it finds in the
Bio::Species object if it can't locate the taxon by lookup, and will
not raise an error. The problem with this is that it relies on the
Bio::SeqIO parser to have gotten the species and lineage information
correct, which is sometimes a wrong assumption for exotic species.
Most often the error will not manifest itself at the time of storing
the erroneously parsed information, but when it is re-retrieved and
used to populate a Bio::Species object.
For the SymAtlas project we had this situation (new species in
sequence updates that the last NCBI taxonomy update hadn't yet
brought in) quite regularly. I wrote a SQL script would fix those
'haphazard' additions such that load_ncbi_taxonomy would update them
to their correct values come the next NCBI taxonomy update. I can
send you the script (it would be for the Oracle version), but I'm not
sure this is a widely viable strategy.
-hilmar
On Mar 13, 2008, at 11:06 AM, Peter wrote:
> Dear list,
>
> One of the unresolved issues with Biopython's BioSQL interface is
> dealing with the NCBI taxon ID when loading sequences into the
> database.
>
> As I understand it, ideally before loading any sequences, the user
> will have loaded in the entire NCBI taxonomy using the
> load_ncbi_taxonomy.pl script, as I described here:
> http://biopython.org/wiki/BioSQL#NCBI_Taxonomy
>
> When a new sequence is added to the database with a known taxon id,
> there is no problem. But happens if its a recently sequenced organism
> which isn't defined yet in the BioSQL taxonomy tables? Could/should
> the user re-run load_ncbi_taxonomy.pl, and then load in their new
> sequence?
>
> Right now in Biopython due what appears to have been intended as a
> short term hack, we simple don't record the taxon id at all (!), and I
> would like to fix this (bug 2422).
> http://bugzilla.open-bio.org/show_bug.cgi?id=2422
>
> How do BioPerl et al deal with this issue? Do they try and update the
> taxonomy tables using the available information in the new record's
> annotation (i.e. the new taxon id and the species name)? Do they
> lookup the NCBI taxonomy definition via the internet? Do they throw
> an error and halt?
>
> Thanks,
>
> Peter
> (Biopython)
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
More information about the BioSQL-l
mailing list