[Bioperl-l] Bio::Taxonomy changes

Wed Jul 26 11:15:28 EDT 2006

I advocate anything but Bio::Species that allows you the option to use
lookups for correct taxonomic information and not guesswork (current
Bio::Species).  So, you could pretty much replace Species immediately with a
DB-aware container object with simple get/sets.  As of now, that would be
that Node or Taxonomy.  I have done this already, just haven't committed it
yet.  And, when I mentioned having freedom to do what you want with
Bio::Taxonomy, that includes all of it (including Node, Tree, etc).  We just
want it to be reasonable and not 'duct tape' for the various Bio::Species
mistakes of the past.

I don't think the problem here is really that complicated (still, the only
thing is the lineage stuff in a sequence file, right?).  

> > As Bio::Species will be deprecated, you can use that method in a dual,
> > sneaky way: 1) directly store the lineage information,
> 
> No. Lineage information must be in the form of Nodes or you can't answer
> lineage-related taxonomic questions.

You must have a way to store the 'horrible lineage information' data, as is,
for those users who do not care about taxonomy and just want to convert seq
streams.  You shouldn't burden the everyday user with something that is
pretty specialized, this being finding correct taxonomic information based
on DB lookups for a particular reason (screening sequences, as Hilmar
pointed out, was one possibility).  

I don't care how, but store lineage information as it appears in the file
(scalar string) or in a simple data structure (array, maybe?) capable of
retaining the information in some way.  There are many many ways of doing
this which I have previously pointed out; take your pick.

Hilmar, in a previous post, told me to take a step back and contemplate a
world w/o Bio::Species, where you would design a system capable of dealing
with sequence file taxonomic data in a way that allows you to get correct
tax information when needed via NCBI Taxonomy data, yet not sacrifice speed
if you're just interested in converting sequences via SeqIO.  Would you
design a Bio::Species class, then?  Would you attempt to spend time parsing
out species and genus information, when the correct data is sitting on the
NCBI server or in a local flatfile?  No.  You would retain the minimal data
necessary in an object for reading and writing data, but have the >option<
available to run a lookup.  Therefore, Bio::Taxonomy::Node was born.  A
little prematurely, yes.  Probably needed to bake a bit more...

Anyway, we must eventually sever our reliance on Bio::Species in order to
deprecate it, so the lineage information must be contained, as it appears in
the file, somewhere else.  

And my point with the classify() Bio::Taxonomy method is not to use it as
is; you could sneak in your own data if needed.  It was an example of a
possible way of containing the lineage data, but not meant to be an absolute
way.  It's up to you how you want to implement it.

I think the classes that are currently in place are more than capable of
handling the job.  Hence my statement before that you are trying to get too
many things going right out the starting gate.  Start simply by replacing
Bio::Species, then worry about other issues.  If you think that a
specialized class would work, fine, but IMHO I don't think it's absolutely
necessary.  I had proposed such a class before (more like a
Bio::Species-like Tax object) but was shut down, and rightly so; it's
unnecessarily complicated and 'contaminates' Bio::Taxonomy with extra
unnecessary methods (classification(), genus(), and so on).

My last proposal was to eventually strip out the unreliable taxonomic
parsing in the various SeqIO modules and replace it with something simple,
which seemed to be a consensus among us all.  This has to do with Hilmar's
post-apocalyptic vision of a Bio::Species-free world.  That will eventually
happen, and Bioperl will eventually switch over completely to
Bio::Taxonomy::Whatever.  And Bio::Species can join BPLite and other
deprecated modules in the BioPerl Boot Hill.  

But, for now that can't happen.  We all strive for the best information
possible.  However, you can't sacrifice the needs of other users, a majority
whom probably care squat about taxonomy, with your (our) own needs.  As I
have repeatedly stated, simple is good.  We can't just usurp the API for our
own wishes w/o warning, so the change has to be gradual and Bio::Species
must stick around for the time being.  And we must make it optional to have
DB lookups or the villagers will be storming the castle.

Listen, Sendu.  If you can wait a couple of weeks for further discussion
then we can slog on with this.  But right now I just don't have any more
time for this, sorry.  You can have the last word and I'll respond when I
get back.

Chris

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Sendu Bala
> Sent: Wednesday, July 26, 2006 7:49 AM
> To: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes
> 
> Chris Fields wrote:
> > We're giving you the freedom to do what you want to Bio::Taxonomy.
> 
> I don't want to do anything with Bio::Taxonomy any more. I've already
> shown that it isn't suitable for the job. Regardless of how it is
> implemented, the entire idea of a class that contains Nodes isn't
> appropriate, for reasons already stated.
> 
> 
> > Realize that the only contentious issue here is
> > that horrible lineage line in the GenBank file.  We should have a way to
> > rebuild it as it was from the original file (i.e. not rebuild it from
> > scratch with DB lookups by default).  However, you should also have the
> > option to rebuild it from lookups (i.e. correctly), which you could do
> > with a Taxonomy.
> 
> And I've already shown how rebuilding with a Taxonomy is very far from
> ideal, while switching db_handle on a Node would be perfect. Why are you
> now advocating Taxonomy when there is no reason to?
> 
> 
> > Note this Bio::Taxonomy method:
> >
> >        classify
> >
> >         Title   : classify
> >         Usage   : @obj[][0-1] = taxonomy->classify($species);
> >         Function: return a ranked classification
> >         Returns : @obj of taxa and ranks as word pairs separated by "@"
> >         Args    : Bio::Species object
> 
> Note that all this method does is let you combine a list of rank names
> with the classification array in a Bio::Species, spitting out some weird
> data structure. It is only of interest to Bio::Taxonomy::Tree.
> We're in the situation where we don't know the rank names corresponding
> to the classification array in a Bio::Species generated by genbank et
> al. So classify() is of zero value.
> 
> 
> > As Bio::Species will be deprecated, you can use that method in a dual,
> > sneaky way: 1) directly store the lineage information,
> 
> No. Lineage information must be in the form of Nodes or you can't answer
> lineage-related taxonomic questions.
> 
> 
> > 2) return the real one (DB lookups) if needed
> 
> Messy. Doing it with Node would be far superior.
> 
> 
> Again, Node works all the time, while Taxonomy would work badly or not
> at all some of the time. Rather than suggest ways of using Taxonomy,
> tell me what is wrong with my current Node plan.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l