[Bioperl-l] Bio::*Taxonomy* changes

Chris Fields cjfields at uiuc.edu
Thu Jul 20 17:03:17 UTC 2006


These all seem fine to me.  Fantastic work!  I added some comments but
everything seems fine to me.

I still plan on switching Bio::DB::Taxonomy::entrez to use
Bio::DB::EUtilities at some point but probably won't get around to it until
August; I still need to write up tests for the EUtilities modules.  I may
add a method for retrieving tax data based on protein/nucleotide sequence
primary ID and relevant sequence database, so you could directly retrieve
the relevant TaxID w/o parsing sequences directly for them.  This would
mainly be useful if you gather GIs from a BLAST search, for instance.  

Anyway, I could add this in then base class Bio::DB::Taxonomy directly so
one could used the retrieved TaxIDs for flat-file or entrez searches; this
requires, of course, access to the remote Entrez database (it would use
ELink).  Would that be of interest?

If so, I'll work on that and add relevant tests to Taxonomy.t when I can.


> Bio::DB::Taxonomy::flatfile
> ---------------------------

...

> API-CHANGE: for this reason I've renamed get_taxonid() to get_taxonids()
> and it returns an array of ids in list context. For backward
> compatibility it returns one of the ids in scalar context, and
> *get_taxonid = \&get_taxonids.

Returning a scalar makes sense as long as its noted in the POD.  I have seen
similar methods return an array ref based on wantarray instead of a scalar,
but that largely depends on the complexity of the array (an array of hashes,
for instance).  

...

> Bio::DB::Taxonomy::entrez
> -------------------------

...

> NOTE: entrez modules (and website) cannot cope with '<something>' in the
> query, failing searches like 'Craniata <chordata>'. For this reason, if
> get_taxonids() is given a query with '<something>' it will immediately
> return undefined, saving a pointless website access. If you want the id
> of 'Craniata <chordata>' you must search for 'Craniata', then get the
> node for each returned id to see which one has a parent node with a
> scientific_name() or common_names() case-insensitive matching to
> 'chordata'.

It may be something with the esearch interface, though the direct TaxBrowser
query also seems to have problems with this:

http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/

I'll try looking into it to see if there is a more direct way to get those
(there probably isn't).

> # Improvements
> BEHAVIOUR-CHANGE: now throws on failure to retrieve data from website.
> 
> BEHAVIOUR-CHANGE: the ScientificName field isn't touched except for s/
> \(class\)$//, being sent directly to Bio::Taxonomy::Node->new(-name =>
> $untouched) or the $node->classification() array. Previously, a species
> node would have its name converted from 'Homo sapiens' to 'sapiens', but
> the conversion mangled very badly certain other species names.

This actually relates to the similar comment made for
Bio::DB::Taxonomy::flatfle.  The mangling probably depends on the current
node and whether using flatfile or XML (entrez).  Most of the odd XML
examples I posted before, where the TaxID associated with a sequence had
extra data, were a rank of 'no rank'.  The species rank, if present, has a
normal binomial name for <ScientificName>:

  <ScientificName>Flavobacterium johnsoniae UW101</ScientificName>
      ...
      <ScientificName>Flavobacterium johnsoniae</ScientificName>
      <Rank>species</Rank> 

  <ScientificName>Pseudomonas putida F1</ScientificName>
      ...
      <ScientificName>Pseudomonas putida</ScientificName>
      <Rank>species</Rank>

  <ScientificName>Caldicellulosiruptor saccharolyticus DSM
8903</ScientificName>
       ...
      <ScientificName>Caldicellulosiruptor saccharolyticus</ScientificName>
      <Rank>species</Rank>

The genus rank has one name; the subspecies rank has the full species name
with 'subsp.' followed by the subspecies name.  So, if using XML, one could
use the taxon subelements stored in the <LineageEx> XML element to sort out
genus(), species(), subspecies(), and also higher order elements if someone
wanted to implement them.

This, of course, isn't necessary for the current changes, but down the road
if anybody wanted it...

...

> Bio::Taxonomy::Node
> -------------------

...

> species() and genus() issue a warning when you try to use them on a node
> that isn't of rank 'species' (since they interact with the
> classification array and not names('method') like the other similar
> methods).

I would just have genus() and species() issue warnings if they aren't set to
a particular value.  So, if the current node is at the genus rank, genus()
will be set but species() won't be.  And no need to do additional checking!
 
Fabulous work Sendu!  

Chris




More information about the Bioperl-l mailing list