[BioPython] Taxonomy in BioSQL

Peter biopython at maubp.freeserve.co.uk
Wed Apr 23 08:42:21 UTC 2008


Eric and I have been discussing how best to deal with missing or
partial taxonomy information when importing sequences into a BioSQL
database.  I'm forwarding his email which he meant to send to the
list...

See also Bug 2475,
http://bugzilla.open-bio.org/show_bug.cgi?id=2475

Just for background, a typical GenBank file includes an NCBI taxon ID
plus the lineage as a list of names (strings).  Ideally, the user will
have run the BioSQL script load_ncbi_taxonomy.pl before hand, and
their taxonomy tables will be fully populated and included the new
sequence's NCBI taxon ID.  See:
http://biopython.org/wiki/BioSQL#NCBI_Taxonomy

The big question is what to do when the new sequence being adding to
BioSQL doesn't have an NCBI taxon ID (e.g. its from a non-NCBI
sequence file), and we are reduced to string matching of species
names...

Peter

---------- Forwarded message ----------
From: Eric Gibert
Date: Wed, Apr 23, 2008 at 2:33 AM
Subject: Taxonomy in BioPython

Dear all,

With the help of Peter and Michiel, I am looking into improving the
taxonomy management in BioPython [with BioSQL].

The first step is completed: an XML parser for the information from
the NCBI taxonomy database is posted in the CVS for Bio.Entrez

The second step is to improve the current coding of BioSQL/Loader.py
for the part in charge of gathering/saving taxonomic data.

For this point, we have 2 questions:

In the case the user does not wish to access the NCBI server to fetch
taxonomic data (or the user's server cannot access Internet) but some
lineage information is present in the records to load, do you prefer
to:
1) load ONLY the species in the BioSQL.taxon table (if INSERT is
needed) - safe: main data are known -
or
2) load the lineage information available, which most of the time
means name and rank only (i.e. no NCBI id) - risky: not all data are
known, different lineage might have different levels and looking by
scientific name might create duplicates -

Another related question: is there anyone accessing a taxonomic
database different from the NCBI one?

The persons using Loader.py or interested in doing so are welcome to
comment and choose.

Best regards,

Eric
__________________________________________________
Do You Yahoo!?
En finir avec le spam? Yahoo! Mail vous offre la meilleure protection
possible contre les messages non sollicités
http://mail.yahoo.fr Yahoo! Mail




More information about the Biopython mailing list