[Bioperl-l] Bio::*Taxonomy* changes
Hilmar Lapp
hlapp at gmx.net
Wed Jul 26 15:38:50 UTC 2006
On Jul 26, 2006, at 5:19 AM, Sendu Bala wrote:
> Chris Fields wrote:
>>
>>> It seems like the main problem with Node right now is that it has
>>> classification() and things like genus(). I propose pure Node method
>>> solutions to answer the questions classification() and genus() were
>>> implemented to answer, but in a better, cruft-free way.
>>>
>>> Bio::DB::Taxonomy::genbank anyone?
>>
>> Ach... You're compromising here;
>
> No, I don't think so. Let me explain...
> (another very long email, but with the same conclusion as above)
Sorry, can you summarize this in a few sentences? If you do want
feedback from me you really need to be more concise.
-hilmar
>
>
>> 1) Switch out Bio::Species with Node or Taxonomy; relocate other
>> information temporarily (Bio::Species, get/sets in Seq object,
>> SimpleValue). Leave Bio::Species in for the time being, but don't
>> bother making any additional changes to it.
> [...]
>> Hence Hilmar's suggestion to use a $seq->taxon() method to return a
>> Node/Taxonomy, and a $seq->species() would still return a
>> Bio::Species object. It's redundant,
>
> As I see it, the problem to be solved is this:
>
> a) A node should just be a node, holding only information about itself
> (but this can include information on who its parent is, and methods
> relating to getting its parents/children as new objects - but the data
> of its parents/children must never be stored on itself).
>
> b) Bio::Species isn't very good at its job; you can't ask reasonable
> taxonomic questions of it and get correct answers.
>
> c) We need to transition Bio::Species to something better - something
> that lets us do the same job as Bio::Species, but do it better. An
> important aspect of 'better' is that we can switch from the taxonomic
> information in a genbank file or similar to the information in a
> taxonomic database if we want certain taxonomic questions answered
> correctly. But also, we should be able to answer all questions with a
> good chance of a correct answer even without database access/
> installation.
>
> There are a variety of possible solutions. How can we decide which is
> best? What would a good solution be?
>
> The 'something better' we transition Bio::Species to will become the
> preferred (or at least de facto standard) way of dealing with
> taxonomic
> information in bioperl. This taxonomic module (or set of modules) must
> be able to model taxonomic information anywhere it is found -
> databases
> or genbank files or anything else. If it can't, it would be
> fundamentally flawed.
>
> d) We can immediately discount any solution that involves storing some
> taxonomic information outside of the tax module. If we find ourselves
> putting lineage data in a genbank file in SimpleValue objects or
> similar, we can be pretty sure we've used a poor solution to the
> problem. That would be a compromise.
>
> e) If the thing we transition Bio::Species to can't do everything
> Bio::Species did (doing it in a different and better way is fine of
> course), it's not suitable for transitioning to (this is why Node
> needed
> all the cruft added to it before it was a suitable candidate). If it
> /can/ do everything Bio::Species did, there would be no harm
> immediately
> making Bio::Species inherit from the new tax module, reimplementing
> Bio::Species as necessary but making no API change. So any solution
> that
> would /require/ $seq->taxon() and $seq->species() wouldn't be a good
> one, and would be a compromise. But we do want to get rid of
> Bio::Species eventually, so I'm not saying we shouldn't have a
> $seq->taxon() or similar, only that either method would give you the
> same type of object with the same methods available
> ($seq->taxon->isa('tax module') && ($seq->species->isa('Bio::Species')
> && $seq->species->isa('tax module')).
>
>
> I see 2 possible solutions to the problem. What should 'tax module'
> be?:
>
> 1) Bio::Taxonomy or other similar class that is a container of
> multiple
> nodes. Naively this makes logical sense since one of the jobs
> Bio::Species has is to store a lineage, and a lineage is best
> represented as a set of Nodes. So let's have a single object with all
> our Nodes in it. Problems:
>
> Bio::Taxonomy itself, as currently written, is fundamentally
> flawed. It
> requires that you know the ranks and order of ranks of all your input
> nodes before you input them. It requires that all ranks have unique
> names. It doesn't handle ranks of 'no rank'. You can't have more than
> one lineage in an instance because you can't have two nodes with the
> same rank. If you don't know the ranks of your nodes (ie. genbank)
> there
> is no way to maintain the order of your lineage because there is no
> modelling of parent/child.
> I had planned to re-write it such that the rank-centric implementation
> was removed and we had parent/child implementation instead. But then
> there is nothing to stop you adding nodes that are disconnected
> from the
> others, creating a broken mess.
>
> Bio::Taxonomy::Tree might have been a little more suitable because it
> implements Bio::Tree::TreeI, but sadly it is also rank-centric and
> actually requires input of both Bio::Species and Bio::Taxonomy objects
> to its most useful methods.
>
> More important than issues with current implementations of
> node-container classes, such classes are unable to let us solve
> problem
> c) in a good way, and also leave us potentially storing in memory Node
> objects representing the same taxonomic node multiple times in
> different
> instances of the node-container. For problem c) if we were to switch
> from genbank nodes to database the solution is to delete all the nodes
> in the container and then get them all again from the database.
> What if
> you didn't even have a lineage-related question? You've just retrieved
> 10s of nodes from the database for no reason (and then store them),
> when
> all you wanted was accurate information on the node you were
> interested in.
>
> All in all, it's pretty horrible. Unsuitable implementations plus
> excess
> database retrieval plus massive waste of memory with duplicated nodes
> does not equal a good solution.
>
>
> 2) Bio::Taxonomy::Node. First, solve problem a) by getting rid of
> methods binomial(), species(), genus(), sub_species(),
> variant(), organelle(), classification() and show_all(). Except for
> organelle() which doesn't belong in taxonomy, all of these
> Bio::Species
> 'questions' can still be answered by Node - just not in a single
> method
> call. I outlined how to answer them in the previous post. For backward
> compatibility make Bio::Species a Node and implement the suggested way
> of answering the questions the proper 'Node' way under those methods.
> Problems:
>
> Well, those questions can't actually be answered by Node if the
> starting
> point was genbank data or manually created Nodes. The solution is
> clean
> and simple: Bio::DB::Taxonomy::genbank or perhaps better named
> Bio::DB::Taxonomy::list (because it makes a taxonomy database from an
> ordered list of names - I don't see anything inherently wrong or ugly
> with that). Then everything magically just works. We get all the power
> to ask all our questions that Node has already when working with the
> ncbi database, but we get it when working with genbank data. We suffer
> none of the problems of a node-container class. We can easily switch
> databases on the fly.
>
> What's not to like?
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
More information about the Bioperl-l
mailing list