[Bioperl-l] Bio::*Taxonomy* changes

Sendu Bala bix at sendu.me.uk
Mon Jul 24 08:45:38 UTC 2006


Hilmar Lapp wrote:
> On Jul 20, 2006, at 9:35 AM, Sendu Bala wrote:
> 
>> Bio::DB::Taxonomy::flatfile
>> ---------------------------
>> [...]
>>
>> BEHAVIOUR-CHANGE: flatfile used to store within the nodes it makes the
>> division as a three letter code, like 'PRI'. However, for consistency
>> with entrez and the scientific_name() of the node the division is
>> supposed to correspond to, it is now stored as the full name, like
>> 'Primates'.
> 
> What about adding a method division_code() which would return the 3- 
> letter abbreviation?
> 
> The abbreviation may be needed by flat-file writers, so it may be  
> handy to have in some cases.

As far as I know you can't get the 3-letter version via entrez, so no 
other module can really expect to be able to get it, not knowing which 
database (flatfile.pm or entez.pm) the taxonomic information is coming from.

But of course it would be somewhat harmless to add division_code() 
anyway. It might be better done as a -code => 1 option to division()?


>> The names->id solution also stores the artificially uniqued names like
>> 'Craniata <chordata>', allowing you for the first time to retrieve the
>> correct id. Previously the search would have simply failed completely.
>>
>> The names->id solution now handles nodes with scientific names of 'xyz
>> (class)', allowing you to retrieve the id with both get_taxonids 
>> ('xyz')
>> and get_taxonids('xyz (class)'). Previously only the latter would  
>> work.
> 
> Should angle brackets be allowed too?

Allowed in what sense? You can indeed search for both 
get_taxonids('Craniata <chordata>') [returns a single id] and 
get_taxonids('Craniata') [returns multipe ids, one of which is the 
previous answer].


> Maybe there should also be a -names parameter which accepts a hash  
> reference with keys being the kind of name (scientific, common, etc)  
> and the values being array references with the set of names of that  
> kind?

Not sure what you mean. name() has that data structure, though you're 
not supposed to set its hash ref directly.


>> or the $node->classification() array.
> 
> Bio::Taxonomy::Node shouldn't have this attribute. It is legacy  
> brought over from a flawed (because flat) object model in Bio::Species.

Yes, I agree.


>> NOTE: entrez modules (and website) cannot cope with '<something>'  
>> in the
>> query, failing searches like 'Craniata <chordata>'. For this  
>> reason, if
>> get_taxonids() is given a query with '<something>' it will immediately
>> return undefined, saving a pointless website access.
> 
> If there is a 'next-best-thing' that is still semantically compatible  
> with the API documentation, I would do that.
> 
> In this case, if there is a <something> in the query the entrez  
> module should strip it and automatically use the rest for searching.  
> If indeed multiple IDs match there should be a warning to inform the  
> user that entrez cannot use the <something> notation to limit the  
> query results.

I wouldn't like this. I actually had it working this way initially, but 
decided that if someone entered 'xyz <something>' they really didn't 
want multiple ids, expected to get multiple ids with just 'xyz' and 
don't want their query made something else and then be warned about it.


> In fact, you might as well provide an option to enable an automatic  
> check for the correct branch for each ID if multiple ones are  
> returned. I.e., if this option is enabled, the module would  
> automatically query the parent nodes to see if <something> is in the  
> lineage, and if not will remove the respective ID from the result  
> set. The reason you may want to make it optional is because it  
> potentially costs time. (but in reality I'm not sure why a client  
> will not want to enable the option - so maybe this should even be  
> default)

I can certainly add that, it seems like a good idea. I don't, however, 
see any scope for an option at all. What would the option be called? 
-don't_give_me_the_answer_I_actually_want_to_save_time ? Pointless, 
imho. If the user queries 'xyz <something>' with that option, they're 
just going to have to do for themselves manually what the method would 
have done for them without that option, in order to get the correct 
answer. It'll be slower that way, if anything. So the option would 
actually be called 
-don't_give_me_the_answer_I_actually_want_so_I_can_get_it_myself_a_little_slower 
(!).


>> Bio::Taxonomy::Node
>> -------------------
>> [...]
>> classification() has a proper solution to finding the classification
>> when the array wasn't manually set.
>>
>> # Improvements
>> BEHAVIOUR-CHANGE: node_name() used to be an alias to name 
>> ('common'). Now
>> it is an alias to name('scientific').
>> NOTE: node_name is what is set when ->new(-name => $name) is set, so
>> flatfile and entrez and user-created nodes now implicitly associate  
>> the
>> name of the node they create with its scientific name.
> 
> I'm not even sure node_name() should just be deprecated. The methods  
> falsely suggests that there is only a single and definitive name for  
> the taxon node.
> 
> In NCBI reality, this is only true for the scientific name of the  
> node. In real reality, many nodes have multiple scientific names -  
> taxonomy isn't static and therefore the scientific naming of nodes  
> isn't either.

For the programmer not using any database but just making up his own 
nodes, I think he needs a node_name() because he may not be thinking 
about anything fancy or realistic. He just want to give his node a 
single name that he invents. node_name() seems like the ideal method 
name to me.





More information about the Bioperl-l mailing list