[Bioperl-l] Bio::Taxonomy confusion

Thu May 11 11:51:44 UTC 2006

Jason Stajich wrote:
> I would use the implementation that talks to the flatfile db as the 
> standard here.  nodes are defined by the data in from taxonomy dump
> dbs from ncbi. the eutils is pretty worthless except for taxid->name
> or reverse, you can't get the full taxonomy (or couldn't when that
> implementation was written).

I'm not sure what you mean. In 1.5.1 you have access to the full
taxonomy because you're using efetch.fcgi. Indeed, you parse the full
taxonomy already to get the classification.

> The "name" method refers to the name of the node - each level in the
>  taxonomy can have a "name".

Yes, and to me the 'name of the node' is its scientific name (something
like 'sapiens'), not a 'common' name. So why is it stored as a
'common' name in the object? Why don't the DB::Taxonomy modules store
the actual common names (something like 'human')?

> The bits of hackiness relate to wrapping the node object as a 
> Bio::Species and/or being able to read  a genbank file and the
> organism taxonomy data as a list and instantiating.  If we could rely
> on everything being in a DB of course this would be simpler.

I think that Taxonomy stuff could be done in a 'pure' way, with a new
Bio::Species made as a wrapper around an appropriate Taxonomy module(s)
that cheated and made fake nodes from a genbank list and then made a
proper Bio::Taxonomy.

> With the flatfile implementation you have to walk all the way up the
> db hierarchy to get the kingdom for a node so you do have to build up
> the classification hierarchy as each node only stores data about
> itsself.

I'm still actually using bioperl 1.4 but I'm looking at 1.5.1 assuming
it is the latest available and I see that the flatfile implementation
works the same way as the entrez one. The requested node is fetched, but
then internally it walks the hierarchy purely so it can build a
classification list which is then stored on the object. If you're
already retrieving every node above the the requested node, why not just
return every node? Why not just return a whole Bio::Taxonomy?

> I'm not exactly sure what you are proposing to do, but would
> definitely enjoy another pair of hands, I don't really have time to
> mess with it any time soon.

I shouldn't really be spending any time on it either, but I knocked up a
quick implementation for myself yesterday/today. I'm working on a bunch 
of modules that inherit from bioperl and then add/alter to suit my 
needs. In this regard they're a bit limited and kind of hard-coded to my 
way of thinking, but hopefully you can see my intent and perhaps use 
some of my implementation.

In my implementation:
# DB::Taxonomy::* return a Bio::Taxonomy equivalent with a single 
database lookup.
# The Taxonomy is implicitly a tree.
# The Taxonomy can have branches of different length from root to the
same rank level.
# The Taxonomy isn't told what ranks is has (isn't limited by some
supplied rank list); it has the ranks that its Nodes have and knows
(without being told) what order those ranks should be in.
# The Taxonomy is made of Nodes that truly only contain information
about themselves and have no classification array or anything like that.
# A Node can still be classified.
# We can have Nodes of rank 'no rank' that will be correctly ordered in
the classification.
# Nodes have a scientific name and common names
# You get parent and all children nodes without database lookups.
# There is a Bio::Species like thing that wraps around this and gives
easy access to what I really want to do:

my $human = TFBS::Species->new(-common_name => 'human');
my @classification = $human->classification; # returns the array you'd
expect from a normally created, fully classified Bio::Species
my $kingdom = $human->kingdom # returns 'Metazoa'

# For genbank, we can still supply TFBS::Species a classification array

http://bix.sendu.me.uk/files/taxonomy_the_tfbs_way.tar.gz
(only tested inheriting from bioperl 1.4, but ideally that shouldn't 
make any difference!)

Is there any scope for bioperl Taxonomy becoming more like this? Or are
there problems with my design (quite likely!)? Or are there good reasons
for maintaining the current way of working? Please feel free to shoot me
down/ discuss.

Cheers,
Sendu.