[Bioperl-l] Bio::*Taxonomy* changes

Hilmar Lapp hlapp at gmx.net
Sun Jul 23 20:48:22 EDT 2006


On Jul 21, 2006, at 12:51 AM, Chris Fields wrote:

> my $db = Bio::DB::Taxonomy->new(-source => 'entrez');
>
> # normally not needed as this is set by default internally, but as a
> demo here...
> $species->db_handle($db);
>
> # reset the appropriate data (genus, species, etc) based on Entrez
> tax data
> $species->reset_data();     # this method, BTW, doesn't exist yet but
> should be easy to implement

Don't call this reset_data() as it may be misleading (usually reset()  
means to revert into a native or original state). Instead, you would  
use fetch_from_db() or something.

However, it seems redundant to me to begin with. If we ignore for a  
second that the return value in the following isn't exactly  
compatible, why would you not just call

	$species = $db->get_Taxonomy_Node(-taxonid => $species->ncbi_taxid);

So I think more than anything else, this should be made to work, and  
you would have a more seamless interface.

> Short and sweet summary:
>
> Sendu volunteered making changes to Bio::Taxonomy::Node and related  
> modules;
> we disagreed on exactly what changes should be made.  Sendu wanted a
> stripped-down version of Bio::Taxonomy::Node; I wanted one which would
> support similar methods as in Bio::Species.

Bio::Species should be considered legacy; I think it is flawed as an  
object model because it imposes a flat view on something which in  
reality is only a node in a tree and not flat at all.

The only real need for the flat view came from the desire to write  
sequence files - for all other purposes the classification() etc  
attributes are useless anyway.

I.e., binomial() and common_name() (corresponding to scientific_name 
() and names('common')) are the only real useful attributes, the rest  
is baggage for writing sequence files. The baggage should not be  
passed on to a better model ...

Instead, there should be a separate module (in essence a Bio::Species  
factory) which can translate a Bio::Taxonomy::Node into a  
Bio::Species object - if the rank is 'species' or below.

Alternatively, you could have a Bio::Taxonomy::SpeciesNode object  
which implements both APIs and can be initialized with either a  
Bio::Taxonomy::Node instance, or the combination of a Bio::Species  
and a db handle.

At any rate, I think Bio::Taxonomy::Node should be stripped of legacy  
methods that are only there to achieve Bio::Species compatibility.

>
> I suggested have a common interface module, one for Node and  
> another for
> Species; both implement the same interface methods (NodeI maybe),  
> so you
> could use either a bare-bones Node or a full-fledged Species  
> object.  I then
> suggested this new version of Species could replace Bio::Species.   
> We could
> worry about which one to use for Bio::DB::Taxonomy* later.

I'm not following here... How would this look like? What would the API 
(s) be?

>
> We both agreed.  Everybody's happy.

Happiness is great, so maybe you shouldn't bother about me not  
following...

> I still plan on switching Bio::DB::Taxonomy::entrez to use
> Bio::DB::EUtilities at some point

Wouldn't that rather be Bio::DB::Taxonomy::eutil?

> I may
> add a method for retrieving tax data based on protein/nucleotide  
> sequence
> primary ID and relevant sequence database, so you could directly  
> retrieve
> the relevant TaxID w/o parsing sequences directly for them.  This  
> would
> mainly be useful if you gather GIs from a BLAST search, for instance.
>
> Anyway, I could add this in then base class Bio::DB::Taxonomy  
> directly so
> one could used the retrieved TaxIDs for flat-file or entrez  
> searches; this
> requires, of course, access to the remote Entrez database (it would  
> use
> ELink).  Would that be of interest?

If you add the API methods for this to the base class (which in this  
case is close in concept to an interface, much like Bio/SeqIO.pm),  
then make clear that not every database will allow you to implement  
this.

>
>          |------Node
> NodeI----|
>          |------Species
>
> Another option would be to have Bio::Taxonomy::Node itself stripped  
> down,
> then have another class (Bio::Taxonomy::Species) inherit methods  
> from it and
> also implement additional methods (genus(), species(), etc).

I think this would be the way to go. I.e.,


          |------Node
NodeI----|
          |-|
            |----SpeciesNode
Species----|

This way the NodeI interface and its direct implementors are kept  
free of legacy.

	-hilmar


-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================







More information about the Bioperl-l mailing list