[Bioperl-l] Bio::Taxonomy changes

Wed Jul 19 00:38:05 EDT 2006

I think we should wait a bit for any dramatic changes but implement  
the ones there seems to be a consensus on.  I understand your  
reasoning for taking this on but I'm not sure completely revamping  
Bio::Taxonomy w/o input from the core developers is wise, especially  
since we do NOT know who uses it, why they use it, and how changing/ 
removing methods will affect their code.  We are doing nothing  
productive here by constantly butting heads on this and having  
different opinions on what we think Bio::Taxonomy/Bio::Species is  
best suited for, when neither one of us is actually sure about who  
uses it and why.  A reasonable solution is there but we must rely on  
outside opinions in order to reach it, so I propose a short  
moratorium on changes to Bio::Taxonomy/Bio::Species that radically  
redefine the API on either class.  BTW, for anbody following, I'm  
perfectly comfortable if Sendu takes the lead on this and implements  
his changes; I'm just not sure about stripping the class down to the  
bare minimum.

So far, the only thing that has been proposed (and accepted by all)  
is that scientific_name() hold the data for that tag in a node.  I  
think most here would agree that's fine; I've already added a get/set  
to Bio::Species but haven't committed it yet.  However, what you  
propose doing below is refactoring the code and changing the API.  I  
agree there needs to be an overhaul but we can't do this w/o guidance  
or input from the GBE (Great Bioperl Elders).   I would like some of  
the 'senior' core developers chime in a bit more on their thoughts on  
this.  Jason also mentioned somewhere that any changes for Taxonomy/ 
Species should be tracked on the wiki somewhere as well to make sure  
everything is kosher and keep users up-to-date.  I would like his  
input here but I think he's still incommunicado at the moment.

Chris

On Jul 18, 2006, at 5:50 PM, Sendu Bala wrote:

> Chris Fields wrote:
>> ...
>>> [regarding changes to Bio::Taxonomy::Node]
>>>
>>> Actually, I'm really strongly leaning toward getting rid of the
>>> following methods and new() options (and giving up entirely on being
>>> able to keep 'sapiens' somewhere):
>>>
>>> -organelle, organelle()
>>> -division, division()
>>> -sub_species, sub_species()
>>> -variant, variant()
>>> species(), validate_species_name()
>>> genus()
>>> binomial()
>>
>> Bio::Species and Bio::Taxonomy::Node are closely linked and plans  
>> are to
>> have Bio::Species delegate methods to Bio::Taxonomy::Node.  So any  
>> changes
>> to Node will affect Bio::Species to some degree.
>
> I see from the original postings that Node was intended to be like
> Species, but I don't think it makes the slightest bit of sense. A
> /single/ Node need only (must only!) represent the information for a
> single node in the taxonomy. Or else what do these objects mean?  
> What is
> the object model? It's bad bad bad for it to be sensible one way (when
> you're making your own taxonomy by making your own nodes) and
> nonsensical another (when we stuff in methods so that Bio::Species is
> happy). The way Node is written right now, and what you're suggesting,
> is that we stuff the entire Taxonomy into the Node. Well, except that
> you don't even have methods for every taxonomic level - there is  
> genus()
> but no subphylum(). I can't emphasise strongly enough how insane all
> this is.
>
> The correct thing for Bio::Species to interact with is Bio::Taxonomy.
> Bio::Taxonomy is a collection of Nodes and has the sort of methods  
> that
> Bio::Species would need to delegate its current functionality.
>
> I'm quite willing to do a proper overhaul here so everything makes
> sense. You either make your own nodes and add these to a Taxonomy  
> or use
> a factory (which would use a Bio::DB::Taxonomy presumably). A Taxonomy
> lets you discover the classification of any node it contains.
> Bio::Species could implement a method like genus() by:
> $node = $taxonomy->get_node('genus') || return;
> return $node->scientific_name;
>
> Bio::Taxonomy isn't perfect, but I can certainly get it to do its job.
> I'd probably make it rank-name and order independent for starters.
>
> Bio::Taxonomy::Node needs to be reduced right down to just hold data
> about the node it represents, and possibly its parent node id (or  
> other
> way of getting to its parent). So now I'm proposing dropping the
> classification() method from Node as well. It's simply not necessary;
> Bio::Taxonomy should give you that information.
>
> Bio::Taxnomoy::FactoryI doesn't make much sense to me at the moment  
> from
> its docs, but it could be used to build a Taxonomy (that seems to  
> be its
> intent, I'm just not sure what some of the methods are really supposed
> to do) such that Node might not even need any methods for getting its
> parent or child nodes. The Factory or Taxonomy might be able to deal
> with that.
>
> In short, I'm proposing a major change to Bio::Taxonomy::Node (make it
> just a node), and minor changes to (& implementation of) Bio::Taxonomy
> and Bio::Taxonomy::FactoryI such that they actually get used to do  
> their
> jobs.
>
>
>> That's also why I thought binomial() could stick around; if you  
>> have both
>> the genus() and species() you could grab both using binomial(),  
>> building in
>> special cases or error handling in case genus() or species() or  
>> both return
>> undef.
>
> binomial() would belong in (and is present in) Bio::Taxonomy. But  
> in any
> case, it's not needed there either; if you want the binomial you just
> ask for the scientific_name of the species node in your Taxonomy,  
> since
> this now contains the actual scientific name == binomial.
>
> binomial() in Bio::Taxonomy could be reimplemented as:
> $node = $self->get_node('species') || return;
> return $node->scientific_name;
>
>
>>> Currently, flatfile and entrez ignore nodes with a rank of 'no rank'
>>> when they build the classification array. I had no intention of  
>>> changing
>>> this behaviour.
>>
>> If you ignore nodes with 'no rank' there will be major problems when
>> retrieving certain TaxID's from protein/nucleotide sequences.
>
> This is only for the classification array, which is meaningless anyway
> (there only for file-format compatibility). If you want the real
> information you ask your Bio::Taxonomy (which asks each of its nodes).
> This is the whole point of having Bio::Taxonomy in the first place.
>
> It gives you great flexibility to do whatever you want to do.
>
>
>>>>       <TaxId>1760</TaxId>
>>>>       <ScientificName>Actinobacteria (class)</ScientificName>
>>>>       <Rank>class</Rank>
>>> Ugh. I guess my proposal to remove <> bits via flatfile extends to
>>> removing () bits via entrez. We don't need unique names; we can use
>>> object_id() when uniqueness matters.
>>
>> The XML parsing in Taxonomy::entrez will take care of the <tags>  
>> and retains
>> the character data in between.
>
> You misunderstood. I meant the <> bits I discussed at the very  
> start of
> this thread, that flatfile gives you. Here I'm referring to getting  
> rid
> of ' (class)' as well.
>
>
>> Any way we go about it here (keeping certain methods and tossing  
>> others,
>> changing the data returned, etc), it looks like there will be API  
>> issues
>> down the road which will directly affect anyone using tax data.  That
>> affects bioperl-db directly as well as any other bioperl-based  
>> DB's which
>> rely on tax data.  So we need to tread a bit carefully when making  
>> major
>> changes to make sure that they work for bioperl-db and anywhere  
>> else that
>> may require it.
>
> Does anything make serious use of the current Bio::Taxonomy code?  
> Or are
> they using Bio::Species?
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign

[Bioperl-l] Bio::*Taxonomy* changes

[Bioperl-l] Bio::Taxonomy changes