[Bioperl-l] Bio::*Taxonomy* changes
Sendu Bala
bix at sendu.me.uk
Tue Jul 25 13:49:04 EDT 2006
Chris Fields wrote:
> If I were to get an object back that was labeled Bio::Species, as a
> biologist I would expect it to be part of a taxonomy, not the actual
> Taxonomy itself.
I think this is the most important sentence in the discussion. Ok, so
it's clear to me that a better solution is needed than my
Bio::Taxonomy-related proposal. Sorry for being so slow on the uptake. I
also needed to start trying to code my Taxonomy proposal to see some
issues with it.
[... in another email...]
> I'm trying to view this as an outsider would,
> a biologist not familiar with the Bioperl class structure.
Ok, let's come up with a proposal that makes sense to the biologist and
better matches Jason's original idea.
---- long post follows; there's a summary at the end
As a biologist when I consider a species I have the following primary
questions. Let's see how we would answer them using a) Bio::Species and
genbank.pm as they are now, b) Bio::Species if it was a 'pure'
Bio::Taxonomy::Node with no cruft (or if we just dropped Bio::Species
and used Node directly), and Chris' updated genbank.pm. Let's say we got
our species information from a genbank file where the scientific name
and tax id are available to be parsed out.
# What is the species' name?
a) Not guaranteed to be correct.
b) Correct thanks to recent changes to Node, just use scientific_name()
# What is the lineage of this species?
a) I can get a classification array with classification(). It's a bit
rubbish though, I can't tell what any of the array elements are supposed
to be.
b) A pure Node wouldn't store the lineage on itself. There are two
obvious solutions: 1) add cruft to Node by giving it a classification()
method - works as well/bad as a). 2) call get_Lineage_Nodes(), which has
the benefit of telling me what rank each ancestor was, if that
information had been in the file (more likely, if Node was generated
from database). Problem: get_Lineage_Nodes() only works if it can
$self->db_handle->get_Taxonomy_Node(-taxonid => $self->parent_id);
which obviously doesn't work if the nodes in our lineage didn't come
from a database, but from the parsing of a genbank flat file. As we
parse the genbank file we can certainly make nodes for each word in the
list:
inside genbank.pm... @class = reverse @class;
my @nodes; my $fake_id = 1;
foreach my $sci_name (@class) {
push(@nodes, new Bio::Taxonomy::Node(-name => $sci_name, object_id =>
$fake_id++, parent_id => $fake_id);
}
But how do we keep these nodes and make them returnable later by
get_Lineage_Nodes? Perhaps:
my $taxonomy = new Bio::Taxonomy;
foreach my $node (@nodes) {
$taxonomy->add_node($node);
}
...
my $make = Bio::Taxonomy::Node->new();
...
$make->db_handle($taxonomy);
Bio::Taxonomy would have to implement get_Taxonomy_Node (it has get_node
which only accepts a rank). Of course this is ugly, storing a Taxonomy
in our database handle. We could have a new Bio::DB::Taxonomy:: class
instead, that treated a classification array like a database? It could
have the added bonus of building up an entire database internally as
more input arrays are given to it, able to therefore give each node a
unique but consistent id. It would break if one time you gave it qw(Homo
Primates) and another time qw(Homo Hominidae Primates), however. Ideas?
# What if I don't want the whole lineage, just to know what a specific
rank like genus is for my species?
a) use genus(), but not guaranteed to be correct.
b) two solutions: 1) add cruft to Node by adding a genus() method: as
good/bad as a). 2) use get_Lineage_Nodes() or get_Parent_Node() until
you find a node with your rank() of interest. Same problems as for
lineage question, but also it would be nicer to have a
get_node('rank_name') style method. But such a method belongs in
something like Bio::Taxonomy, not Node. At the very least a method like
genus() would be implemented using pure Node methods like
get_Parent_Node(), returning undefined if no parent had a rank() of
'genus', never guessing it.
# Is this species the same as another species?
a) Not guaranteed to be correct. (no unique id so forced to compare names)
b) Correct answer by using object_id() method, along with Chris' change
to genbank.pm.
# What is the most recent common ancestor of this species and another?
a) Can't be answered.
b) Use get_LCA_Node(), but same issues as the lineage question, since
get_LCA_Node requires a working get_Lineage_Nodes(). It also requires
correct (unique) ids for all nodes in all lineages to give the
guaranteed correct answer. But at least you /might/ get the correct
answer even using only the data in genbank files and no db lookup.
---- summary:
It seems like the main problem with Node right now is that it has
classification() and things like genus(). I propose pure Node method
solutions to answer the questions classification() and genus() were
implemented to answer, but in a better, cruft-free way.
Bio::DB::Taxonomy::genbank anyone?
Then if you started with a Species/Node generated by a genbank parse,
and wanted certain questions answered correctly, you only have to set a
different db_handle(). The Node only stores the static and hopefully
correct information about itself, whilst all other questions go via
db_handle, so you can dynamically swap back and forth between databases
depending on if you need speed or accuracy.
More information about the Bioperl-l
mailing list