[Bioperl-l] Bio::Taxonomy changes

Mon Jul 24 17:34:29 EDT 2006

> I don't even think we would need SpeciesI - why would a species-
> ranked taxonomy node be so different from any other node such that it
> would need its own interface.
> 
> Chris - just one suggestion: take a step back and imagine a Bioperl
> in which Bio::Species had never existed. Instead, only taxonomy nodes
> existed, and code that can effectively deal with them, including
> filtering by rank. In this picture, what would you make to want to
> introduce SpeciesI and Bio::Species?

Argh!!!  Just when I thought I could pull away...

Okay.  I thought it would be nice to have a class that could accomplish two
things:

1)  Act as a container for GenBank taxonomy information;
Bio::Taxonomy::Node, as written by Jason, was meant to be a replacement for
Bio::Species.
2)  Also act as a bridge, so you had the option to retrieve the Species
object from a sequence object and have it act like a Node (be db-aware
out-of-the-box, so to speak).

Also, I'm trying to follow the original idea as proposed by Jason (this is
from perldoc Bio::Taxonomy::Node):

DESCRIPTION
    This is the next generation (for Bioperl) of representing Taxonomy
    information. Previously all information was managed by a single object
    called Bio::Species. This new implementation allows representation of
    the intermediate nodes not just the species nodes and can relate their
    connections.

Which, to me, indicated that this would eventually replace Bio::Species (so,
in effect, must at least contain the relevant data for sequence objects w/o
being completely reliant on DB, yet still be DB-aware).  Everything about
Bio::Species on the wiki also leads me to believe that this was the original
intent for Bio::Taxonomy::Node.  

http://www.bioperl.org/wiki/Module:Bio::Species
http://www.bioperl.org/wiki/Project_priority_list#Taxonomy_.2F_Species_data

And all the original methods (genus(), species(), etc.) also seem to
indicate this.

That's really it.  I could give a toss about getting taxonomy information
directly from Bio::Species.  And you're right: in hindsight Bio::Species is
flawed.  However, it seemed from the beginning of this discussion with Sendu
and the proposed changes, that Bio::Species should stick around in some
capacity but should also be involved with Bio::Taxonomy (contrary to Jason's
idea above).  Now I'm hearing something completely different (Sendu still
argues that it should be involved).

I had originally wanted to start delegating everything over to
Taxonomy::Node about a month ago, when I found that it was remarkably easy
to do so.  However, when Sendu proposed making changes to remove methods in
Bio::Taxonomy::Node and make sweeping changes to Taxonomy which would
prevent an easy transition over to Node, I felt that it would be harder to
effectively have it take over for Bio::Species when parsing SeqIO objects
(all the calls to genus/species/subspecies etc methods would have to be
removed from all the classes which use Bio::Species).  Hence
Bio::Taxonomy::Species as a compromise.  Now it turns out no one wants to
have either Bio::Species (your 'contagion' references clues me in there) or
Bio::Taxonomy::Species.  

If we think it would be better to completely toss all this out the window
and use only a bare-bones Node, then I'm fine with that.   But if we go that
route we should just get rid of the Bio::Species 'disease' completely and
have things be much simpler.  Simple is good!

I think Node can still act as a viable container class for the tax data from
a GenBank file (it's original purpose) as long as it has the very basic
methods for doing so.  That would require:

scientific_name() - ORGANISM line data
common_names() - which could hold common names (in parentheses on the SOURCE
line) and the abbreviated name (from the SOURCE line)
ncbi_taxid() - from the 'source' seqfeature (already there).

The lineage information and organelle information could be stored in Node or
in SimpleValue objects.  My vote is for the latter as there's no need for a
classification() container for Node, which you have repeatedly pointed out.

> Frankly, I don't see anything. I.e., the only reason is backward
> compatibility (which is a valid reason), but let's not glorify
> Bio::Species by adding ill-conceived interfaces.

I think we should just get rid of Bio::Species completely.  We would need to
go in and rework species parsing in the SeqIO modules that use Bio::Species,
but that would only make things simpler, not more complex.  Get rid of
trying to figure out what is a genus or species based on the GenBank
information only, and have the bridge between the sequences be stored in a
Taxonomy::Node object (which should contain the NCBI TaxID, so then it can
use the associated DB object to traverse up and down other nodes).  The
interface idea was a proposed compromise i.e. my 'bridge' between GenBank
taxonomy hell and Bio::Taxonomy bliss, and intended to follow what I thought
was Jason's original intent for Bio::Taxonomy::Node.  Nothing more.  

> > Bio::DB::Taxonomy uses a factory to return NodeI-implementing modules;
> > specifically, a SpeciesNode for species ranks or below, and a Node for
> > anything else.
> 
> Like I said before, SpeciesNode or whatever it's called would draw
> its right of existence solely from backward compatibility - don't use
> it for anything else. And if you can achieve backward compatibility
> by other means, don't even create a SpeciesNode.

Agreed.  But, if there is such venom towards Bio::Species, why not put it
out of it's misery as well?  Seems like it has outlived it's usefulness.

Chris

[Bioperl-l] Bio::*Taxonomy* changes

[Bioperl-l] Bio::Taxonomy changes