[Bioperl-l] Bio::*Taxonomy* changes
Chris Fields
cjfields at uiuc.edu
Wed Jul 19 00:38:05 EDT 2006
I think we should wait a bit for any dramatic changes but implement
the ones there seems to be a consensus on. I understand your
reasoning for taking this on but I'm not sure completely revamping
Bio::Taxonomy w/o input from the core developers is wise, especially
since we do NOT know who uses it, why they use it, and how changing/
removing methods will affect their code. We are doing nothing
productive here by constantly butting heads on this and having
different opinions on what we think Bio::Taxonomy/Bio::Species is
best suited for, when neither one of us is actually sure about who
uses it and why. A reasonable solution is there but we must rely on
outside opinions in order to reach it, so I propose a short
moratorium on changes to Bio::Taxonomy/Bio::Species that radically
redefine the API on either class. BTW, for anbody following, I'm
perfectly comfortable if Sendu takes the lead on this and implements
his changes; I'm just not sure about stripping the class down to the
bare minimum.
So far, the only thing that has been proposed (and accepted by all)
is that scientific_name() hold the data for that tag in a node. I
think most here would agree that's fine; I've already added a get/set
to Bio::Species but haven't committed it yet. However, what you
propose doing below is refactoring the code and changing the API. I
agree there needs to be an overhaul but we can't do this w/o guidance
or input from the GBE (Great Bioperl Elders). I would like some of
the 'senior' core developers chime in a bit more on their thoughts on
this. Jason also mentioned somewhere that any changes for Taxonomy/
Species should be tracked on the wiki somewhere as well to make sure
everything is kosher and keep users up-to-date. I would like his
input here but I think he's still incommunicado at the moment.
Chris
On Jul 18, 2006, at 5:50 PM, Sendu Bala wrote:
> Chris Fields wrote:
>> ...
>>> [regarding changes to Bio::Taxonomy::Node]
>>>
>>> Actually, I'm really strongly leaning toward getting rid of the
>>> following methods and new() options (and giving up entirely on being
>>> able to keep 'sapiens' somewhere):
>>>
>>> -organelle, organelle()
>>> -division, division()
>>> -sub_species, sub_species()
>>> -variant, variant()
>>> species(), validate_species_name()
>>> genus()
>>> binomial()
>>
>> Bio::Species and Bio::Taxonomy::Node are closely linked and plans
>> are to
>> have Bio::Species delegate methods to Bio::Taxonomy::Node. So any
>> changes
>> to Node will affect Bio::Species to some degree.
>
> I see from the original postings that Node was intended to be like
> Species, but I don't think it makes the slightest bit of sense. A
> /single/ Node need only (must only!) represent the information for a
> single node in the taxonomy. Or else what do these objects mean?
> What is
> the object model? It's bad bad bad for it to be sensible one way (when
> you're making your own taxonomy by making your own nodes) and
> nonsensical another (when we stuff in methods so that Bio::Species is
> happy). The way Node is written right now, and what you're suggesting,
> is that we stuff the entire Taxonomy into the Node. Well, except that
> you don't even have methods for every taxonomic level - there is
> genus()
> but no subphylum(). I can't emphasise strongly enough how insane all
> this is.
>
> The correct thing for Bio::Species to interact with is Bio::Taxonomy.
> Bio::Taxonomy is a collection of Nodes and has the sort of methods
> that
> Bio::Species would need to delegate its current functionality.
>
> I'm quite willing to do a proper overhaul here so everything makes
> sense. You either make your own nodes and add these to a Taxonomy
> or use
> a factory (which would use a Bio::DB::Taxonomy presumably). A Taxonomy
> lets you discover the classification of any node it contains.
> Bio::Species could implement a method like genus() by:
> $node = $taxonomy->get_node('genus') || return;
> return $node->scientific_name;
>
> Bio::Taxonomy isn't perfect, but I can certainly get it to do its job.
> I'd probably make it rank-name and order independent for starters.
>
> Bio::Taxonomy::Node needs to be reduced right down to just hold data
> about the node it represents, and possibly its parent node id (or
> other
> way of getting to its parent). So now I'm proposing dropping the
> classification() method from Node as well. It's simply not necessary;
> Bio::Taxonomy should give you that information.
>
> Bio::Taxnomoy::FactoryI doesn't make much sense to me at the moment
> from
> its docs, but it could be used to build a Taxonomy (that seems to
> be its
> intent, I'm just not sure what some of the methods are really supposed
> to do) such that Node might not even need any methods for getting its
> parent or child nodes. The Factory or Taxonomy might be able to deal
> with that.
>
> In short, I'm proposing a major change to Bio::Taxonomy::Node (make it
> just a node), and minor changes to (& implementation of) Bio::Taxonomy
> and Bio::Taxonomy::FactoryI such that they actually get used to do
> their
> jobs.
>
>
>> That's also why I thought binomial() could stick around; if you
>> have both
>> the genus() and species() you could grab both using binomial(),
>> building in
>> special cases or error handling in case genus() or species() or
>> both return
>> undef.
>
> binomial() would belong in (and is present in) Bio::Taxonomy. But
> in any
> case, it's not needed there either; if you want the binomial you just
> ask for the scientific_name of the species node in your Taxonomy,
> since
> this now contains the actual scientific name == binomial.
>
> binomial() in Bio::Taxonomy could be reimplemented as:
> $node = $self->get_node('species') || return;
> return $node->scientific_name;
>
>
>>> Currently, flatfile and entrez ignore nodes with a rank of 'no rank'
>>> when they build the classification array. I had no intention of
>>> changing
>>> this behaviour.
>>
>> If you ignore nodes with 'no rank' there will be major problems when
>> retrieving certain TaxID's from protein/nucleotide sequences.
>
> This is only for the classification array, which is meaningless anyway
> (there only for file-format compatibility). If you want the real
> information you ask your Bio::Taxonomy (which asks each of its nodes).
> This is the whole point of having Bio::Taxonomy in the first place.
>
> It gives you great flexibility to do whatever you want to do.
>
>
>>>> <TaxId>1760</TaxId>
>>>> <ScientificName>Actinobacteria (class)</ScientificName>
>>>> <Rank>class</Rank>
>>> Ugh. I guess my proposal to remove <> bits via flatfile extends to
>>> removing () bits via entrez. We don't need unique names; we can use
>>> object_id() when uniqueness matters.
>>
>> The XML parsing in Taxonomy::entrez will take care of the <tags>
>> and retains
>> the character data in between.
>
> You misunderstood. I meant the <> bits I discussed at the very
> start of
> this thread, that flatfile gives you. Here I'm referring to getting
> rid
> of ' (class)' as well.
>
>
>> Any way we go about it here (keeping certain methods and tossing
>> others,
>> changing the data returned, etc), it looks like there will be API
>> issues
>> down the road which will directly affect anyone using tax data. That
>> affects bioperl-db directly as well as any other bioperl-based
>> DB's which
>> rely on tax data. So we need to tread a bit carefully when making
>> major
>> changes to make sure that they work for bioperl-db and anywhere
>> else that
>> may require it.
>
> Does anything make serious use of the current Bio::Taxonomy code?
> Or are
> they using Bio::Species?
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
More information about the Bioperl-l
mailing list