[Bioperl-l] Bio::*Taxonomy* changes

Hilmar Lapp hlapp at gmx.net
Sun Jul 23 19:40:45 EDT 2006


On Jul 20, 2006, at 9:35 AM, Sendu Bala wrote:

> I'll describe all the changes I've now made and if no-one complains  
> I'll
> commit. (I've also made these notes into bug 2047 for easier reference
> in the future.)
>
> Bio::DB::Taxonomy::flatfile
> ---------------------------
> [...]
>
> BEHAVIOUR-CHANGE: flatfile used to store within the nodes it makes the
> division as a three letter code, like 'PRI'. However, for consistency
> with entrez and the scientific_name() of the node the division is
> supposed to correspond to, it is now stored as the full name, like
> 'Primates'.

What about adding a method division_code() which would return the 3- 
letter abbreviation?

The abbreviation may be needed by flat-file writers, so it may be  
handy to have in some cases.

>
> The names->id solution also stores the artificially uniqued names like
> 'Craniata <chordata>', allowing you for the first time to retrieve the
> correct id. Previously the search would have simply failed completely.
>
> The names->id solution now handles nodes with scientific names of 'xyz
> (class)', allowing you to retrieve the id with both get_taxonids 
> ('xyz')
> and get_taxonids('xyz (class)'). Previously only the latter would  
> work.

Should angle brackets be allowed too?

>
> NOTE: the previous 2 changes (and the issues with entrez, see below)
> make flatfile better at searching the taxonomy database than entrez
> module or the website, both in terms of speed and completeness of  
> results.
>
> BEHAVIOUR-CHANGE: The scientific name field isn't touched in any way,
> always being sent directly to Bio::Taxonomy::Node->new(-name =>
> $untouched)

Maybe there should also be a -names parameter which accepts a hash  
reference with keys being the kind of name (scientific, common, etc)  
and the values being array references with the set of names of that  
kind?

> or the $node->classification() array.

Bio::Taxonomy::Node shouldn't have this attribute. It is legacy  
brought over from a flawed (because flat) object model in Bio::Species.

> [...]
>
> Bio::DB::Taxonomy::entrez
> -------------------------
>
> # Bug-fixes
> Special characters like ", ( and ) in the input query string to
> get_taxonid() result in the failure or inaccuracy of the search. These
> characters are now removed prior to submission, allowing for correct
> search results.
> API-CHANGE: entrez has always been able to return multiple ids that
> match a single input name, so I've renamed get_taxonid() to
> get_taxonids() and it returns an array of ids in list context. It
> returns one of the ids in scalar context. For backward compatibility,
> *get_taxonid = \&get_taxonids.

Sounds good to me.

> NOTE: entrez modules (and website) cannot cope with '<something>'  
> in the
> query, failing searches like 'Craniata <chordata>'. For this  
> reason, if
> get_taxonids() is given a query with '<something>' it will immediately
> return undefined, saving a pointless website access.

If there is a 'next-best-thing' that is still semantically compatible  
with the API documentation, I would do that.

In this case, if there is a <something> in the query the entrez  
module should strip it and automatically use the rest for searching.  
If indeed multiple IDs match there should be a warning to inform the  
user that entrez cannot use the <something> notation to limit the  
query results.

In fact, you might as well provide an option to enable an automatic  
check for the correct branch for each ID if multiple ones are  
returned. I.e., if this option is enabled, the module would  
automatically query the parent nodes to see if <something> is in the  
lineage, and if not will remove the respective ID from the result  
set. The reason you may want to make it optional is because it  
potentially costs time. (but in reality I'm not sure why a client  
will not want to enable the option - so maybe this should even be  
default)

> If you want the id
> of 'Craniata <chordata>' you must search for 'Craniata', then get the
> node for each returned id to see which one has a parent node with a
> scientific_name() or common_names() case-insensitive matching to  
> 'chordata'.

Yep, see above. The more burden you can shield from the user the better.

> [...]
> Bio::Taxonomy::Node
> -------------------
> [...]
> classification() has a proper solution to finding the classification
> when the array wasn't manually set.
>
> # Improvements
> BEHAVIOUR-CHANGE: node_name() used to be an alias to name 
> ('common'). Now
> it is an alias to name('scientific').
> NOTE: node_name is what is set when ->new(-name => $name) is set, so
> flatfile and entrez and user-created nodes now implicitly associate  
> the
> name of the node they create with its scientific name.

I'm not even sure node_name() should just be deprecated. The methods  
falsely suggests that there is only a single and definitive name for  
the taxon node.

In NCBI reality, this is only true for the scientific name of the  
node. In real reality, many nodes have multiple scientific names -  
taxonomy isn't static and therefore the scientific naming of nodes  
isn't either.

> [...]
>

Thanks for the work, all other changes sound great. Thanks also to  
Chris for assisting! Looks like this is in much better shape now than  
before.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================







More information about the Bioperl-l mailing list