[Bioperl-l] Bio::*Taxonomy* changes
hlapp at gmx.net
Mon Jul 24 12:27:28 UTC 2006
:-) I think we're largely in agreement. As for node_name() I fully
understand the motivation, but it needs to be understood that the
attribute's value will be based on a largely arbitrary choice unless
it is set directly by the user.
On Jul 24, 2006, at 4:45 AM, Sendu Bala wrote:
> Hilmar Lapp wrote:
>> On Jul 20, 2006, at 9:35 AM, Sendu Bala wrote:
>>> BEHAVIOUR-CHANGE: flatfile used to store within the nodes it
>>> makes the
>>> division as a three letter code, like 'PRI'. However, for
>>> with entrez and the scientific_name() of the node the division is
>>> supposed to correspond to, it is now stored as the full name, like
>> What about adding a method division_code() which would return the 3-
>> letter abbreviation?
>> The abbreviation may be needed by flat-file writers, so it may be
>> handy to have in some cases.
> As far as I know you can't get the 3-letter version via entrez, so no
> other module can really expect to be able to get it, not knowing which
> database (flatfile.pm or entez.pm) the taxonomic information is
> coming from.
> But of course it would be somewhat harmless to add division_code()
> anyway. It might be better done as a -code => 1 option to division()?
>>> The names->id solution also stores the artificially uniqued names
>>> 'Craniata <chordata>', allowing you for the first time to
>>> retrieve the
>>> correct id. Previously the search would have simply failed
>>> The names->id solution now handles nodes with scientific names of
>>> (class)', allowing you to retrieve the id with both get_taxonids
>>> and get_taxonids('xyz (class)'). Previously only the latter would
>> Should angle brackets be allowed too?
> Allowed in what sense? You can indeed search for both
> get_taxonids('Craniata <chordata>') [returns a single id] and
> get_taxonids('Craniata') [returns multipe ids, one of which is the
> previous answer].
>> Maybe there should also be a -names parameter which accepts a hash
>> reference with keys being the kind of name (scientific, common, etc)
>> and the values being array references with the set of names of that
> Not sure what you mean. name() has that data structure, though you're
> not supposed to set its hash ref directly.
>>> or the $node->classification() array.
>> Bio::Taxonomy::Node shouldn't have this attribute. It is legacy
>> brought over from a flawed (because flat) object model in
> Yes, I agree.
>>> NOTE: entrez modules (and website) cannot cope with '<something>'
>>> in the
>>> query, failing searches like 'Craniata <chordata>'. For this
>>> reason, if
>>> get_taxonids() is given a query with '<something>' it will
>>> return undefined, saving a pointless website access.
>> If there is a 'next-best-thing' that is still semantically compatible
>> with the API documentation, I would do that.
>> In this case, if there is a <something> in the query the entrez
>> module should strip it and automatically use the rest for searching.
>> If indeed multiple IDs match there should be a warning to inform the
>> user that entrez cannot use the <something> notation to limit the
>> query results.
> I wouldn't like this. I actually had it working this way initially,
> decided that if someone entered 'xyz <something>' they really didn't
> want multiple ids, expected to get multiple ids with just 'xyz' and
> don't want their query made something else and then be warned about
>> In fact, you might as well provide an option to enable an automatic
>> check for the correct branch for each ID if multiple ones are
>> returned. I.e., if this option is enabled, the module would
>> automatically query the parent nodes to see if <something> is in the
>> lineage, and if not will remove the respective ID from the result
>> set. The reason you may want to make it optional is because it
>> potentially costs time. (but in reality I'm not sure why a client
>> will not want to enable the option - so maybe this should even be
> I can certainly add that, it seems like a good idea. I don't, however,
> see any scope for an option at all. What would the option be called?
> -don't_give_me_the_answer_I_actually_want_to_save_time ? Pointless,
> imho. If the user queries 'xyz <something>' with that option, they're
> just going to have to do for themselves manually what the method would
> have done for them without that option, in order to get the correct
> answer. It'll be slower that way, if anything. So the option would
> actually be called
>>> classification() has a proper solution to finding the classification
>>> when the array wasn't manually set.
>>> # Improvements
>>> BEHAVIOUR-CHANGE: node_name() used to be an alias to name
>>> ('common'). Now
>>> it is an alias to name('scientific').
>>> NOTE: node_name is what is set when ->new(-name => $name) is set, so
>>> flatfile and entrez and user-created nodes now implicitly associate
>>> name of the node they create with its scientific name.
>> I'm not even sure node_name() should just be deprecated. The methods
>> falsely suggests that there is only a single and definitive name for
>> the taxon node.
>> In NCBI reality, this is only true for the scientific name of the
>> node. In real reality, many nodes have multiple scientific names -
>> taxonomy isn't static and therefore the scientific naming of nodes
>> isn't either.
> For the programmer not using any database but just making up his own
> nodes, I think he needs a node_name() because he may not be thinking
> about anything fancy or realistic. He just want to give his node a
> single name that he invents. node_name() seems like the ideal method
> name to me.
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
More information about the Bioperl-l