[BioPython] taxonomic tree

Peter biopython at maubp.freeserve.co.uk
Thu Oct 9 09:31:16 UTC 2008


>> Personally rather than designing my own database just for this (and
>> writing a parser for the taxonomy files), I would have suggested
>> installing BioSQL, and using the BioSQL script load_ncbi_taxonomy.pl
>> to download and import the data for you.  This is a simple perl script
>> - you don't need BioPerl.  See http://www.biopython.org/wiki/BioSQL
>> for details.
>
> I also used the load_ncbi_taxonomy.pl script. It worked great!

Good.  I would encourage you to use the version from BioSQL v1.0.1 if
you are not already, as the version with BioSQL v1.0.0 makes an
additional unnecessary assumption about the database keys matching the
NCBI taxon ID.

>> If you are interested, the BioSQL tables record the taxonomy tree
>> using two methods, each node has a parent node allowing you to walk up
>> the lineage.  There are also left/right values allowing selection of
>> all child nodes efficiently via an SQL select statement.
>
> This is what I was trying to do, from the name of the organism (the leaf of
> the tree) and getting every node using the parent_node field of the taxon
> table, until reaching the root node. Once I have all the steps to the root
> node then I have to create/filling the tree with my data in order to
> examinate the number of organisms integrating certain
> class/order/family/genus... etc
> Any ideas will be very apreciated.

To do this in Biopython you'll have to write some SQL commands - but
first you need to understand how the left/right values work if you
want to take advantage of them.  I refer you to this thread on the
BioSQL mailing list earlier in the year:
http://lists.open-bio.org/pipermail/biosql-l/2008-April/001234.html

In particular, Hilmar referred to Joe Celko's SQL for Smarties books,
and the introduction to this nested-set representation given here:
http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html

Alternatively, if you wanted to avoid the left/right values, you could
use recursion or loops on the parent ID links to build up the tree.
For a single lineage this is fine - but for a full try I would expect
the left/right values to be faster.

Note that Biopython (in CVS now) ignores the left/right values.  This
is for two reasons - for pulling out a single lineage, Eric found this
was faster.  Also, when adding new entries to the database
re-calculating the left/right values is too slow, so we leave them as
NULL (and let the user (re)run load_ncbi_taxonomy.pl later if they
care).  This means we don't want to depend on the left/right values
being present.

Peter



More information about the Biopython mailing list