[Biopython-dev] [Bug 3045] TreeMixin, please define enumerator and other convenience methods

Wed Apr 7 20:15:27 UTC 2010

http://bugzilla.open-bio.org/show_bug.cgi?id=3045

------- Comment #4 from eric.talevich at gmail.com  2010-04-07 16:15 EST -------
(In reply to comment #0)
>
> The usage patterns for, say, setting a phyloXML property from an array prop_arr
> should look something like:
> 
> [node.set_property(prop_arr[i], *prop_params,  **prop_keywords) 
>             for i, node in tree.enumerate_internals()]

How about:

[node.properties.append(PX.Property(prop_arr[i], *prop_args, **prop_kwargs[i]))
 for i, node in enumerate(tree.find_clades(terminal=False))]

Is there something different about this form versus your example above that
hurts performance?

> The three issues that frustrate such concision are
> (1) internal nodes, terminal nodes, and all nodes are not currently
> on an equal footing with respect to methods

For your usage it might be faster to use the generators:
find_clades(terminal-False), find_clades(terminal=True), find_clades()

I'm considering renaming 'find_clades' to 'find', and 'find' to 'find_any' --
would the shorter name make your code a little cleaner?

We could also have 'get_nonterminal' and 'get_all_clades' -- I'm not so sure
that the last one is useful enough to justify cluttering the API further; what
do you think? (I actually balked at add get_terminals() originally, since it's
so simple.)

> (2) there are no enumerator methods

Doesn't the enumerate() function work just as well, or even better, with the
functional/array-oriented programming style you're using? The find_* methods
return lazily-evaluated iterables to enable this kind of usage in a
memory-efficient way.

> Here I give some convenience methods that I wish were defined in TreeMixin.  I
> have tested them as standalone methods.  I hope you'll see fit to include them
> at some point.
> 
> def count_internals(self):
>     """Counts the number of  non-terminal (internal) nodes within this tree."""
>     return [i for i,e in enumerate_internals(self)][-1] + 1

I can add a convenience function that would help:

def iterlen(items):
    for i, x in enumerate(items):
        count = i
    return count + 1

Then count_internals(tree) is the same as:
iterlen(tree.find_clades(terminal=False))

Or, if we add get_nonterminals() it's easy:
len(tree.get_nonterminals())

> def enumerate_internals(self):
>     """Returns an enumerator of non-terminal clades"""
>     return  enumerate(self.find_clades(terminal=False))
> 
> def enumerate_terminals(self):
>     """Returns an enumerator of terminal clades"""
>     return  enumerate(self.find_clades(terminal=True))
> 
> def enumerate_all(self):
>     """Returns an enumerator on all clades"""
>     return  enumerate(self.find_clades())

I can see why these are handy in your own code, because you're using them a
lot, but I don't think they introduce enough new functionality to justify
adding more methods to TreeMixin.

> Less critical but still useful are the following two methods (and one private
> utility) that I find useful for operations on trees:
> 
> def is_semipreterminal(self):
>     """True if any direct descendent is terminal."""
>     if self.root.is_terminal():
>         return False
>     for clade in self.clades:
>         if clade.is_terminal():
>             return True
>         return False

Is semipreterminal a standard name for nodes like this?

In Python 2.5 and later, you could also do:
any(clade.is_terminal() for clade in self)

> def terminal_neighbor_dists(self):
>     """Return a list of distances between adjacent terminals"""
>     return [self.distance(*i) for i in
> _generate_pairs(self.find_clades(terminal=True))]
> 
> def _generate_pairs(self):
>     import itertools
>     pairs = itertools.tee(self)
>     pairs[1].next()
>     return itertools.izip(pairs[0], pairs[1])

Interesting. Getting philosophical -- I don't intend for TreeMixin to have a
built-in method for every possible use case, but one of my goals in Bio.Phylo
is to provide all the low-level functionality necessary so that when you do
have to write your own function to do something special, it doesn't take much
new code. So, I'm quite pleased that you were able to implement this
functionality for yourself in just 4 lines of (non-scaffolding) code.

The biggest weakness in Bio.Phylo from my viewpoint is that most of the
TreeMixin methods do some portion of a full-tree search every time they are
called -- there's no internal lookup table. So to make more efficient
algorithms possible, I added some methods that do as much as possible in one
shot. Example: rather than a distance_to(node) method, we have
TreeMixin.depths() which returns a dictionary of all nodes mapped to their
respective total branch lengths from the root.

What other whole-tree operations along this philosophy would you like to see
implemented? Some ideas:

- heights() -- like depths(), but mapping each node to the distance to the
nearest (or farthest?) terminal

- names() -- map each clade name to the clade instance. Clades with no name
won't be in the dictionary.

Each of these could take a 'target' specification like get_path does, so you
can restrict the result to a specific set of clades (e.g. terminals).

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.