[Biopython-dev] Code review request for phyloxml branch

Fri Sep 25 06:08:56 EDT 2009

On Fri, Sep 25, 2009 at 5:34 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
>> One minor point - the docstring for Bio.TreeIO.parse() says: "This is only
>> supported for formats that can represent multiple phylogenetic trees in a
>> single file". Is that true, and if so why? For SeqIO and AlignIO you can
>> use parse on a file with one entry, the iterator just returns one entry.
>> This is important for allowing generic code (e.g. a loop) regardless of
>> how many entries there are (one, many, or even zero).
>
> I'll delete that sentence. I don't know why it's there -- you're right, it's
> easy to return an iterable regardless of what the format itself supports.

OK.

>> On a more general note, you seem to be recreating the file/handle logic
>> in each of the individual parsers. I think it would make much more sense
>> to put this logic in the top level Bio.TreeIO.parse(), Bio.TreeIO.read()
>> and Bio.TreeIO.write() functions *only* and have the underlying format
>> specific code just use handles. This avoids the code duplication.
>
> I did the handle management case-by-case because some of the underlying
> libraries already do filename-to-handle conversion -- ElementTree and
> Bio.Nexus, specifically. It seemed non-kosher to have multiple layers of
> ad-hoc handle management, but of course I can move it all to the top if you
> think it's best.

Having a single layer of handle/filename conversion in Bio.TreeIO does
seem cleanest to me (even if some of the back ends allow either) and
will ensure our code is consistent.

> One day, perhaps we'll have a context manager that we can
> reuse everywhere to make magic easy:
>
> with maybe_open(file) as handle:
>   tree = FooIO.parse(handle)
>
> Not today, though.

Not yet, no. For one thing we'll have to phase out Python 2.4 support.

>>> (1) 'phyloxml' uses a different object representation than the other two,
>>> so converting between those formats is not possible until Nexus.Trees
>>> is ported over to Bio.Tree.
>>
>> I think that is a blocker - I wouldn't want to release Bio.TreeIO until it
>> would actually let you do phyloxml -> newick, and phyloxml -> nexus
>> (and assuming that phyloxml allows very minimal trees, the reverse
>> as well). It does look like the best plan is to use the same tree objects
>> for all three (updating Bio.Nexus if possible).
>
> I could comment out the 'nexus' and 'newick' lines from the
> supported_formats dict. That would disable the top-level functions
> but leave the direct NexusIO and NewickIO equivalents intact until
> the port is complete.

I guess shipping a "phyloxml" only Bio.TreeIO would work, but it
would be rather less useful. We could certainly start with just that
on the trunk (i.e. initially no Bio.TreeIO.NewickIO and also no
Bio.TreeIO.NexusIO modules - initially have just a single backend).

>> Note that Bio.Nexus.Trees still has some useful methods you don't
>> appear to support, like finding the last common ancestor and
>> distances between nodes.
>
> That's intentional, I was just going to port those methods directly from
> Bio.Nexus.Trees rather than invent a new API myself.

OK - sounds good.

> Currently, the Bio.Nexus.Nexus.Nexus and Nexus.Trees.Tree classes are
> combined parsers and object representations. My goal is to chop out the
> pure-object parts and merge them into Bio.Tree, and let the remaining
> parsers return objects built from the new Bio.Tree classes. This looks like
> it will be easier for Nexus.Trees than for Nexus.Nexus, but both should be
> done.

Sounds good - as with Bio.SeqIO and Bio.AlignIO, one of the goals has
been to separate the data object from the (many possible) parsers.

> For backward compatibility, I'll leave some wrappers that trigger
> DeprecationWarnings in the original places. Nexus.Trees can
> probably be reduced ...

Something like that, sure.

>>> (1) The find() function, named after the Unix utility that does the
>>> same thing for directory trees, seems capable of all the iteration
>>> and filtering necessary for locating data and automatically adding
>>> annotations to a tree. There's a 'terminal' argument for selecting
>>> internal nodes, external nodes, or both, and I think this means
>>> get_leaf_nodes() is unnecessary. I'm going to remove it if no one
>>> protests.
>>
>> I'm in two minds - iterating over the leaves (taxa) seems like a very
>> common operation, and having an explicit method for this might be
>> clearer than calling find with special arguments.
>
> I think .find(terminal=True) will do the right thing and looks reasonably
> simple, but as Brad said, this is a ridiculously common operation so
> finding it in the API should be ridiculously easy. I'll rename this function
> to get_leaves() and rename find() to findall() (to match ElementTree
> and make it clear that it returns an iterable).

OK.

Peter