[Bioperl-l] Bio::*Taxonomy* changes

Chris Fields cjfields at uiuc.edu
Tue Jul 25 22:16:36 EDT 2006


One last thing before I shut off bioperl for a week and concentrate  
on Connecticut;

On Jul 25, 2006, at 12:49 PM, Sendu Bala wrote:

> Chris Fields wrote:
>> If I were to get an object back that was labeled Bio::Species, as a
>> biologist I would expect it to be part of a taxonomy, not the actual
>> Taxonomy itself.
>
> I think this is the most important sentence in the discussion. Ok, so
> it's clear to me that a better solution is needed than my
> Bio::Taxonomy-related proposal. Sorry for being so slow on the  
> uptake. I
> also needed to start trying to code my Taxonomy proposal to see some
> issues with it.

... Again, thanks for noticing that.

> ---- summary:
>
> It seems like the main problem with Node right now is that it has
> classification() and things like genus(). I propose pure Node method
> solutions to answer the questions classification() and genus() were
> implemented to answer, but in a better, cruft-free way.
>
> Bio::DB::Taxonomy::genbank anyone?

Ach...  You're compromising here; that's not like you.  I think  
you're making this too complicated by trying too many things at  
once.  Don't think sudden dramatic changes in the API.  Sneak changes  
in in a way that doesn't scare users away, then let them get used to  
the new way of grabbing Tax data.  Make your point that it's more  
accurate to do it this way (you'll have defenders in Hilmar and I, BTW).

Do this (start with genbank.pm):

1) Switch out Bio::Species with Node or Taxonomy; relocate other  
information temporarily (Bio::Species, get/sets in Seq object,  
SimpleValue).  Leave Bio::Species in for the time being, but don't  
bother making any additional changes to it.
2) Make sure next_seq() and write_seq() work and pass tests.  Add  
additional tests for the Tax/Node object (you could even use the tax  
dump data you recently added for more complicated tests).
3) Add in additional stuff bit by bit until it is where you would  
like it.
4) Make sure parsing is kosher with the latest release notes.   
Probably should make sure write_seq follows what the release note  
state to some degree.

And, really, you won't break anything with genbank.pm organelle()  
parsing.  If you look at  the module the organelle isn't even touched  
in next_seq() or _read_GenBank_Species(), so it was broken to begin  
with!

My proposal, though extreme, was to remove genus() etc (which you  
wanted as well with Node).  You could leave this cruft for the time  
being in Bio::Species, which could still act as a sequence tax info  
holder object.  It just won't be the >default<  Seq tax information  
object, which would be Bio::Taxonomy or Node.

Hence Hilmar's suggestion to use a $seq->taxon() method to return a  
Node/Taxonomy, and a $seq->species() would still return a  
Bio::Species object.  It's redundant, but only for the time being,  
and the redundant information wouldn't have a major memory footprint  
anyway (not like the feature table or the full sequence might).  Any  
information that isn't stored in whatever Tax object you use (i.e.  
lineage or organelle) could be stored temporarily in another fashion,  
such as a get/set in Seq or SimpleValue object, to make next_seq/ 
write_seq work (such as $seq->organelle() or $seq->classification(),  
instead of $seq->species->organelle and so on).

Hilmar then suggests, around 1.6-ish release, note the changes made  
to SeqIO towards Bio::Taxonomy-based objects, and indicate that  
Bio::Species via species() and it's associated methods will be  
deprecated around 1.7 (gives everybody notice on API issues).  Then  
add warnings to Bio::Species in 1.7 noting the deprecation, then  
remove from core completely in 1.8 - 2.0.

One last thing, which is minor really: I remember seeing something  
about having Nodes with 'no rank' ignored unless a flag is used.   
That may be bad news for some organisms in sequence files where the  
TaxID is for a 'no rank' rank, such as environmental samples.  May  
want to think about that here.

I'm hoping the releases will start popping out a bit more  
periodically than they have been.  There have been volunteers to  
release periodic updates for bug fixes etc.

If I get a chance I'll try keeping up.  Don't count on it though.   
The conference is 7am-9pm most days, for five days straight!

Chris

>
> Then if you started with a Species/Node generated by a genbank parse,
> and wanted certain questions answered correctly, you only have to  
> set a
> different db_handle(). The Node only stores the static and hopefully
> correct information about itself, whilst all other questions go via
> db_handle, so you can dynamically swap back and forth between  
> databases
> depending on if you need speed or accuracy.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign





More information about the Bioperl-l mailing list