[Bioperl-l] Bio::*Taxonomy* changes
Hilmar Lapp
hlapp at gmx.net
Mon Jul 24 08:27:28 EDT 2006
:-) I think we're largely in agreement. As for node_name() I fully
understand the motivation, but it needs to be understood that the
attribute's value will be based on a largely arbitrary choice unless
it is set directly by the user.
-hilmar
On Jul 24, 2006, at 4:45 AM, Sendu Bala wrote:
> Hilmar Lapp wrote:
>> On Jul 20, 2006, at 9:35 AM, Sendu Bala wrote:
>>
>>> Bio::DB::Taxonomy::flatfile
>>> ---------------------------
>>> [...]
>>>
>>> BEHAVIOUR-CHANGE: flatfile used to store within the nodes it
>>> makes the
>>> division as a three letter code, like 'PRI'. However, for
>>> consistency
>>> with entrez and the scientific_name() of the node the division is
>>> supposed to correspond to, it is now stored as the full name, like
>>> 'Primates'.
>>
>> What about adding a method division_code() which would return the 3-
>> letter abbreviation?
>>
>> The abbreviation may be needed by flat-file writers, so it may be
>> handy to have in some cases.
>
> As far as I know you can't get the 3-letter version via entrez, so no
> other module can really expect to be able to get it, not knowing which
> database (flatfile.pm or entez.pm) the taxonomic information is
> coming from.
>
> But of course it would be somewhat harmless to add division_code()
> anyway. It might be better done as a -code => 1 option to division()?
>
>
>>> The names->id solution also stores the artificially uniqued names
>>> like
>>> 'Craniata <chordata>', allowing you for the first time to
>>> retrieve the
>>> correct id. Previously the search would have simply failed
>>> completely.
>>>
>>> The names->id solution now handles nodes with scientific names of
>>> 'xyz
>>> (class)', allowing you to retrieve the id with both get_taxonids
>>> ('xyz')
>>> and get_taxonids('xyz (class)'). Previously only the latter would
>>> work.
>>
>> Should angle brackets be allowed too?
>
> Allowed in what sense? You can indeed search for both
> get_taxonids('Craniata <chordata>') [returns a single id] and
> get_taxonids('Craniata') [returns multipe ids, one of which is the
> previous answer].
>
>
>> Maybe there should also be a -names parameter which accepts a hash
>> reference with keys being the kind of name (scientific, common, etc)
>> and the values being array references with the set of names of that
>> kind?
>
> Not sure what you mean. name() has that data structure, though you're
> not supposed to set its hash ref directly.
>
>
>>> or the $node->classification() array.
>>
>> Bio::Taxonomy::Node shouldn't have this attribute. It is legacy
>> brought over from a flawed (because flat) object model in
>> Bio::Species.
>
> Yes, I agree.
>
>
>>> NOTE: entrez modules (and website) cannot cope with '<something>'
>>> in the
>>> query, failing searches like 'Craniata <chordata>'. For this
>>> reason, if
>>> get_taxonids() is given a query with '<something>' it will
>>> immediately
>>> return undefined, saving a pointless website access.
>>
>> If there is a 'next-best-thing' that is still semantically compatible
>> with the API documentation, I would do that.
>>
>> In this case, if there is a <something> in the query the entrez
>> module should strip it and automatically use the rest for searching.
>> If indeed multiple IDs match there should be a warning to inform the
>> user that entrez cannot use the <something> notation to limit the
>> query results.
>
> I wouldn't like this. I actually had it working this way initially,
> but
> decided that if someone entered 'xyz <something>' they really didn't
> want multiple ids, expected to get multiple ids with just 'xyz' and
> don't want their query made something else and then be warned about
> it.
>
>
>> In fact, you might as well provide an option to enable an automatic
>> check for the correct branch for each ID if multiple ones are
>> returned. I.e., if this option is enabled, the module would
>> automatically query the parent nodes to see if <something> is in the
>> lineage, and if not will remove the respective ID from the result
>> set. The reason you may want to make it optional is because it
>> potentially costs time. (but in reality I'm not sure why a client
>> will not want to enable the option - so maybe this should even be
>> default)
>
> I can certainly add that, it seems like a good idea. I don't, however,
> see any scope for an option at all. What would the option be called?
> -don't_give_me_the_answer_I_actually_want_to_save_time ? Pointless,
> imho. If the user queries 'xyz <something>' with that option, they're
> just going to have to do for themselves manually what the method would
> have done for them without that option, in order to get the correct
> answer. It'll be slower that way, if anything. So the option would
> actually be called
> -
> don't_give_me_the_answer_I_actually_want_so_I_can_get_it_myself_a_litt
> le_slower
> (!).
>
>
>>> Bio::Taxonomy::Node
>>> -------------------
>>> [...]
>>> classification() has a proper solution to finding the classification
>>> when the array wasn't manually set.
>>>
>>> # Improvements
>>> BEHAVIOUR-CHANGE: node_name() used to be an alias to name
>>> ('common'). Now
>>> it is an alias to name('scientific').
>>> NOTE: node_name is what is set when ->new(-name => $name) is set, so
>>> flatfile and entrez and user-created nodes now implicitly associate
>>> the
>>> name of the node they create with its scientific name.
>>
>> I'm not even sure node_name() should just be deprecated. The methods
>> falsely suggests that there is only a single and definitive name for
>> the taxon node.
>>
>> In NCBI reality, this is only true for the scientific name of the
>> node. In real reality, many nodes have multiple scientific names -
>> taxonomy isn't static and therefore the scientific naming of nodes
>> isn't either.
>
> For the programmer not using any database but just making up his own
> nodes, I think he needs a node_name() because he may not be thinking
> about anything fancy or realistic. He just want to give his node a
> single name that he invents. node_name() seems like the ideal method
> name to me.
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
More information about the Bioperl-l
mailing list