[Bioperl-l] Bio::*Taxonomy* changes

Hilmar Lapp hlapp at gmx.net
Wed Jul 26 11:38:50 EDT 2006


On Jul 26, 2006, at 5:19 AM, Sendu Bala wrote:

> Chris Fields wrote:
>>
>>> It seems like the main problem with Node right now is that it has
>>> classification() and things like genus(). I propose pure Node method
>>> solutions to answer the questions classification() and genus() were
>>> implemented to answer, but in a better, cruft-free way.
>>>
>>> Bio::DB::Taxonomy::genbank anyone?
>>
>> Ach...  You're compromising here;
>
> No, I don't think so. Let me explain...
> (another very long email, but with the same conclusion as above)

Sorry, can you summarize this in a few sentences? If you do want  
feedback from me you really need to be more concise.

	-hilmar


>
>
>> 1) Switch out Bio::Species with Node or Taxonomy; relocate other
>> information temporarily (Bio::Species, get/sets in Seq object,
>> SimpleValue).  Leave Bio::Species in for the time being, but don't
>> bother making any additional changes to it.
> [...]
>> Hence Hilmar's suggestion to use a $seq->taxon() method to return a
>> Node/Taxonomy, and a $seq->species() would still return a
>> Bio::Species object.  It's redundant,
>
> As I see it, the problem to be solved is this:
>
> a) A node should just be a node, holding only information about itself
> (but this can include information on who its parent is, and methods
> relating to getting its parents/children as new objects - but the data
> of its parents/children must never be stored on itself).
>
> b) Bio::Species isn't very good at its job; you can't ask reasonable
> taxonomic questions of it and get correct answers.
>
> c) We need to transition Bio::Species to something better - something
> that lets us do the same job as Bio::Species, but do it better. An
> important aspect of 'better' is that we can switch from the taxonomic
> information in a genbank file or similar to the information in a
> taxonomic database if we want certain taxonomic questions answered
> correctly. But also, we should be able to answer all questions with a
> good chance of a correct answer even without database access/ 
> installation.
>
> There are a variety of possible solutions. How can we decide which is
> best? What would a good solution be?
>
> The 'something better' we transition Bio::Species to will become the
> preferred (or at least de facto standard) way of dealing with  
> taxonomic
> information in bioperl. This taxonomic module (or set of modules) must
> be able to model taxonomic information anywhere it is found -  
> databases
> or genbank files or anything else. If it can't, it would be
> fundamentally flawed.
>
> d) We can immediately discount any solution that involves storing some
> taxonomic information outside of the tax module. If we find ourselves
> putting lineage data in a genbank file in SimpleValue objects or
> similar, we can be pretty sure we've used a poor solution to the
> problem. That would be a compromise.
>
> e) If the thing we transition Bio::Species to can't do everything
> Bio::Species did (doing it in a different and better way is fine of
> course), it's not suitable for transitioning to (this is why Node  
> needed
> all the cruft added to it before it was a suitable candidate). If it
> /can/ do everything Bio::Species did, there would be no harm  
> immediately
> making Bio::Species inherit from the new tax module, reimplementing
> Bio::Species as necessary but making no API change. So any solution  
> that
> would /require/ $seq->taxon() and $seq->species() wouldn't be a good
> one, and would be a compromise. But we do want to get rid of
> Bio::Species eventually, so I'm not saying we shouldn't have a
> $seq->taxon() or similar, only that either method would give you the
> same type of object with the same methods available
> ($seq->taxon->isa('tax module') && ($seq->species->isa('Bio::Species')
> && $seq->species->isa('tax module')).
>
>
> I see 2 possible solutions to the problem. What should 'tax module'  
> be?:
>
> 1) Bio::Taxonomy or other similar class that is a container of  
> multiple
> nodes. Naively this makes logical sense since one of the jobs
> Bio::Species has is to store a lineage, and a lineage is best
> represented as a set of Nodes. So let's have a single object with all
> our Nodes in it. Problems:
>
> Bio::Taxonomy itself, as currently written, is fundamentally  
> flawed. It
> requires that you know the ranks and order of ranks of all your input
> nodes before you input them. It requires that all ranks have unique
> names. It doesn't handle ranks of 'no rank'. You can't have more than
> one lineage in an instance because you can't have two nodes with the
> same rank. If you don't know the ranks of your nodes (ie. genbank)  
> there
> is no way to maintain the order of your lineage because there is no
> modelling of parent/child.
> I had planned to re-write it such that the rank-centric implementation
> was removed and we had parent/child implementation instead. But then
> there is nothing to stop you adding nodes that are disconnected  
> from the
> others, creating a broken mess.
>
> Bio::Taxonomy::Tree might have been a little more suitable because it
> implements Bio::Tree::TreeI, but sadly it is also rank-centric and
> actually requires input of both Bio::Species and Bio::Taxonomy objects
> to its most useful methods.
>
> More important than issues with current implementations of
> node-container classes, such classes are unable to let us solve  
> problem
> c) in a good way, and also leave us potentially storing in memory Node
> objects representing the same taxonomic node multiple times in  
> different
> instances of the node-container. For problem c) if we were to switch
> from genbank nodes to database the solution is to delete all the nodes
> in the container and then get them all again from the database.  
> What if
> you didn't even have a lineage-related question? You've just retrieved
> 10s of nodes from the database for no reason (and then store them),  
> when
> all you wanted was accurate information on the node you were  
> interested in.
>
> All in all, it's pretty horrible. Unsuitable implementations plus  
> excess
> database retrieval plus massive waste of memory with duplicated nodes
> does not equal a good solution.
>
>
> 2) Bio::Taxonomy::Node. First, solve problem a) by getting rid of
> methods binomial(), species(), genus(), sub_species(),
> variant(), organelle(), classification() and show_all(). Except for
> organelle() which doesn't belong in taxonomy, all of these  
> Bio::Species
> 'questions' can still be answered by Node - just not in a single  
> method
> call. I outlined how to answer them in the previous post. For backward
> compatibility make Bio::Species a Node and implement the suggested way
> of answering the questions the proper 'Node' way under those methods.
> Problems:
>
> Well, those questions can't actually be answered by Node if the  
> starting
> point was genbank data or manually created Nodes. The solution is  
> clean
> and simple: Bio::DB::Taxonomy::genbank or perhaps better named
> Bio::DB::Taxonomy::list (because it makes a taxonomy database from an
> ordered list of names - I don't see anything inherently wrong or ugly
> with that). Then everything magically just works. We get all the power
> to ask all our questions that Node has already when working with the
> ncbi database, but we get it when working with genbank data. We suffer
> none of the problems of a node-container class. We can easily switch
> databases on the fly.
>
> What's not to like?
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================







More information about the Bioperl-l mailing list