[Bioperl-l] Bio::*Taxonomy* changes

Sendu Bala bix at sendu.me.uk
Tue Jul 18 18:50:37 EDT 2006


Chris Fields wrote:
> ...
>> [regarding changes to Bio::Taxonomy::Node]
>>
>> Actually, I'm really strongly leaning toward getting rid of the
>> following methods and new() options (and giving up entirely on being
>> able to keep 'sapiens' somewhere):
>>
>> -organelle, organelle()
>> -division, division()
>> -sub_species, sub_species()
>> -variant, variant()
>> species(), validate_species_name()
>> genus()
>> binomial()
> 
> Bio::Species and Bio::Taxonomy::Node are closely linked and plans are to
> have Bio::Species delegate methods to Bio::Taxonomy::Node.  So any changes
> to Node will affect Bio::Species to some degree.

I see from the original postings that Node was intended to be like 
Species, but I don't think it makes the slightest bit of sense. A 
/single/ Node need only (must only!) represent the information for a 
single node in the taxonomy. Or else what do these objects mean? What is 
the object model? It's bad bad bad for it to be sensible one way (when 
you're making your own taxonomy by making your own nodes) and 
nonsensical another (when we stuff in methods so that Bio::Species is 
happy). The way Node is written right now, and what you're suggesting, 
is that we stuff the entire Taxonomy into the Node. Well, except that 
you don't even have methods for every taxonomic level - there is genus() 
but no subphylum(). I can't emphasise strongly enough how insane all 
this is.

The correct thing for Bio::Species to interact with is Bio::Taxonomy. 
Bio::Taxonomy is a collection of Nodes and has the sort of methods that 
Bio::Species would need to delegate its current functionality.

I'm quite willing to do a proper overhaul here so everything makes 
sense. You either make your own nodes and add these to a Taxonomy or use 
a factory (which would use a Bio::DB::Taxonomy presumably). A Taxonomy 
lets you discover the classification of any node it contains. 
Bio::Species could implement a method like genus() by:
$node = $taxonomy->get_node('genus') || return;
return $node->scientific_name;

Bio::Taxonomy isn't perfect, but I can certainly get it to do its job. 
I'd probably make it rank-name and order independent for starters.

Bio::Taxonomy::Node needs to be reduced right down to just hold data 
about the node it represents, and possibly its parent node id (or other 
way of getting to its parent). So now I'm proposing dropping the 
classification() method from Node as well. It's simply not necessary; 
Bio::Taxonomy should give you that information.

Bio::Taxnomoy::FactoryI doesn't make much sense to me at the moment from 
its docs, but it could be used to build a Taxonomy (that seems to be its 
intent, I'm just not sure what some of the methods are really supposed 
to do) such that Node might not even need any methods for getting its 
parent or child nodes. The Factory or Taxonomy might be able to deal 
with that.

In short, I'm proposing a major change to Bio::Taxonomy::Node (make it 
just a node), and minor changes to (& implementation of) Bio::Taxonomy 
and Bio::Taxonomy::FactoryI such that they actually get used to do their 
jobs.


> That's also why I thought binomial() could stick around; if you have both
> the genus() and species() you could grab both using binomial(), building in
> special cases or error handling in case genus() or species() or both return
> undef.

binomial() would belong in (and is present in) Bio::Taxonomy. But in any 
case, it's not needed there either; if you want the binomial you just 
ask for the scientific_name of the species node in your Taxonomy, since 
this now contains the actual scientific name == binomial.

binomial() in Bio::Taxonomy could be reimplemented as:
$node = $self->get_node('species') || return;
return $node->scientific_name;


>> Currently, flatfile and entrez ignore nodes with a rank of 'no rank'
>> when they build the classification array. I had no intention of changing
>> this behaviour.
> 
> If you ignore nodes with 'no rank' there will be major problems when
> retrieving certain TaxID's from protein/nucleotide sequences. 

This is only for the classification array, which is meaningless anyway 
(there only for file-format compatibility). If you want the real 
information you ask your Bio::Taxonomy (which asks each of its nodes). 
This is the whole point of having Bio::Taxonomy in the first place.

It gives you great flexibility to do whatever you want to do.


>>>       <TaxId>1760</TaxId>
>>>       <ScientificName>Actinobacteria (class)</ScientificName>
>>>       <Rank>class</Rank>
>> Ugh. I guess my proposal to remove <> bits via flatfile extends to
>> removing () bits via entrez. We don't need unique names; we can use
>> object_id() when uniqueness matters.
> 
> The XML parsing in Taxonomy::entrez will take care of the <tags> and retains
> the character data in between.

You misunderstood. I meant the <> bits I discussed at the very start of 
this thread, that flatfile gives you. Here I'm referring to getting rid 
of ' (class)' as well.


> Any way we go about it here (keeping certain methods and tossing others,
> changing the data returned, etc), it looks like there will be API issues
> down the road which will directly affect anyone using tax data.  That
> affects bioperl-db directly as well as any other bioperl-based DB's which
> rely on tax data.  So we need to tread a bit carefully when making major
> changes to make sure that they work for bioperl-db and anywhere else that
> may require it.

Does anything make serious use of the current Bio::Taxonomy code? Or are 
they using Bio::Species?



More information about the Bioperl-l mailing list