[Bioperl-l] Bio::*Taxonomy* changes

Sendu Bala bix at sendu.me.uk
Tue Jul 25 17:49:04 UTC 2006

Chris Fields wrote:
> If I were to get an object back that was labeled Bio::Species, as a
> biologist I would expect it to be part of a taxonomy, not the actual
> Taxonomy itself.

I think this is the most important sentence in the discussion. Ok, so 
it's clear to me that a better solution is needed than my 
Bio::Taxonomy-related proposal. Sorry for being so slow on the uptake. I 
also needed to start trying to code my Taxonomy proposal to see some 
issues with it.

[... in another email...]
> I'm trying to view this as an outsider would,
> a biologist not familiar with the Bioperl class structure.

Ok, let's come up with a proposal that makes sense to the biologist and 
better matches Jason's original idea.

---- long post follows; there's a summary at the end

As a biologist when I consider a species I have the following primary 
questions. Let's see how we would answer them using a) Bio::Species and 
genbank.pm as they are now, b) Bio::Species if it was a 'pure' 
Bio::Taxonomy::Node with no cruft (or if we just dropped Bio::Species 
and used Node directly), and Chris' updated genbank.pm. Let's say we got 
our species information from a genbank file where the scientific name 
and tax id are available to be parsed out.

# What is the species' name?
a) Not guaranteed to be correct.
b) Correct thanks to recent changes to Node, just use scientific_name()

# What is the lineage of this species?
a) I can get a classification array with classification(). It's a bit 
rubbish though, I can't tell what any of the array elements are supposed 
to be.
b) A pure Node wouldn't store the lineage on itself. There are two 
obvious solutions: 1) add cruft to Node by giving it a classification() 
method - works as well/bad as a). 2) call get_Lineage_Nodes(), which has 
the benefit of telling me what rank each ancestor was, if that 
information had been in the file (more likely, if Node was generated 
from database). Problem: get_Lineage_Nodes() only works if it can 
$self->db_handle->get_Taxonomy_Node(-taxonid => $self->parent_id);
which obviously doesn't work if the nodes in our lineage didn't come 
from a database, but from the parsing of a genbank flat file. As we 
parse the genbank file we can certainly make nodes for each word in the 
inside genbank.pm... @class = reverse @class;
my @nodes; my $fake_id = 1;
foreach my $sci_name (@class) {
   push(@nodes, new Bio::Taxonomy::Node(-name => $sci_name, object_id =>
                                     $fake_id++, parent_id => $fake_id);
But how do we keep these nodes and make them returnable later by 
get_Lineage_Nodes? Perhaps:
my $taxonomy = new Bio::Taxonomy;
foreach my $node (@nodes) {
my $make = Bio::Taxonomy::Node->new();
Bio::Taxonomy would have to implement get_Taxonomy_Node (it has get_node 
which only accepts a rank). Of course this is ugly, storing a Taxonomy 
in our database handle. We could have a new Bio::DB::Taxonomy:: class 
instead, that treated a classification array like a database? It could 
have the added bonus of building up an entire database internally as 
more input arrays are given to it, able to therefore give each node a 
unique but consistent id. It would break if one time you gave it qw(Homo 
Primates) and another time qw(Homo Hominidae Primates), however. Ideas?

# What if I don't want the whole lineage, just to know what a specific 
rank like genus is for my species?
a) use genus(), but not guaranteed to be correct.
b) two solutions: 1) add cruft to Node by adding a genus() method: as 
good/bad as a). 2) use get_Lineage_Nodes() or get_Parent_Node() until 
you find a node with your rank() of interest. Same problems as for 
lineage question, but also it would be nicer to have a 
get_node('rank_name') style method. But such a method belongs in 
something like Bio::Taxonomy, not Node. At the very least a method like 
genus() would be implemented using pure Node methods like 
get_Parent_Node(), returning undefined if no parent had a rank() of 
'genus', never guessing it.

# Is this species the same as another species?
a) Not guaranteed to be correct. (no unique id so forced to compare names)
b) Correct answer by using object_id() method, along with Chris' change 
to genbank.pm.

# What is the most recent common ancestor of this species and another?
a) Can't be answered.
b) Use get_LCA_Node(), but same issues as the lineage question, since 
get_LCA_Node requires a working get_Lineage_Nodes(). It also requires 
correct (unique) ids for all nodes in all lineages to give the 
guaranteed correct answer. But at least you /might/ get the correct 
answer even using only the data in genbank files and no db lookup.

---- summary:

It seems like the main problem with Node right now is that it has 
classification() and things like genus(). I propose pure Node method 
solutions to answer the questions classification() and genus() were 
implemented to answer, but in a better, cruft-free way.

Bio::DB::Taxonomy::genbank anyone?

Then if you started with a Species/Node generated by a genbank parse, 
and wanted certain questions answered correctly, you only have to set a 
different db_handle(). The Node only stores the static and hopefully 
correct information about itself, whilst all other questions go via 
db_handle, so you can dynamically swap back and forth between databases 
depending on if you need speed or accuracy.

More information about the Bioperl-l mailing list