[BioSQL-l] Loading sequences with novel NCBI taxon id

Peter biopython at maubp.freeserve.co.uk
Thu Mar 13 23:13:32 UTC 2008


On Thu, Mar 13, 2008 at 10:51 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
> (this is more of a bioperl question than a biosql one)

Well, yes and no.  And I'm not subscribed to the Bioperl list, nor the
BioJava one, nor the BioRuby one.

>  The load_ncbi_taxonomy.pl script is designed to update the taxon
>  tables in a non-disruptive way, and if there weren't many changes
>  shouldn't actually take that long (except that recalculating the
>  nested set values may take a couple of minutes).

Do you think when faced with a novel taxon id, Biopython/BioPerl/...
could write some minimal taxonomy entry (without any guess work based
on the species name), in order to record the sequence's taxon - and
then running an improved load_ncbi_taxonomy.pl at a later date would
sort out the proper taxonomy?

>  Bioperl-db will store the taxon information it finds in the
>  Bio::Species object if it can't locate the taxon by lookup, and will
>  not raise an error. The problem with this is that it relies on the
>  Bio::SeqIO parser to have gotten the species and lineage information
>  correct, which is sometimes a wrong assumption for exotic species.
>  Most often the error will not manifest itself at the time of storing
>  the erroneously parsed information, but when it is re-retrieved and
>  used to populate a Bio::Species object.

This is what I would like to avoid with Biopython.

>  For the SymAtlas project we had this situation (new species in
>  sequence updates that the last NCBI taxonomy update hadn't yet
>  brought in) quite regularly. I wrote a SQL script would fix those
>  'haphazard' additions such that load_ncbi_taxonomy would update them
>  to their correct values come the next NCBI taxonomy update. I can
>  send you the script (it would be for the Oracle version), but I'm not
>  sure this is a widely viable strategy.

So this wasn't integrated with load_ncbi_taxonomy.pl at all?

Peter



More information about the BioSQL-l mailing list