[Biopython-dev] Code review request for phyloxml branch

Thu Sep 24 12:08:00 UTC 2009

Eric and Peter;
Looking forward to seeing the PhyloXML work merged into the main
branch. Eric, thanks for posting the summary of where things are at.

> > (1) 'phyloxml' uses a different object representation than the other two, so
> > converting between those formats is not possible until Nexus.Trees is ported
> > over to Bio.Tree.
> 
> I think that is a blocker - I wouldn't want to release Bio.TreeIO until it would
> actually let you do phyloxml -> newick, and phyloxml -> nexus (and assuming
> that phyloxml allows very minimal trees, the reverse as well). It does look
> like the best plan is to use the same tree objects for all three (updating
> Bio.Nexus if possible).

Agreed that this would be nice to have, but I'm not sure why it's
blocking getting the base TreeIO framework and all of PhyloXML into
the main branch. That's a major step forward from the format
specific phylogenetic code we had before and gets us a portion of
the way there.

Next up should be moving over Bio.Nexus to the new framework and
then conversions, but this is another project. I think we should
take this one step at a time.

> Note that Bio.Nexus.Trees still has some useful methods you don't
> appear to support, like finding the last common ancestor and distances
> between nodes.

Agreed. As we move Nexus over, we should be sure to keep current
functionality.

> > (1) The find() function, named after the Unix utility that does the same
> > thing for directory trees, seems capable of all the iteration and filtering
> > necessary for locating data and automatically adding annotations to a tree.
> > There's a 'terminal' argument for selecting internal nodes, external nodes,
> > or both, and I think this means get_leaf_nodes() is unnecessary. I'm going
> > to remove it if no one protests.
> 
> I'm in two minds - iterating over the leaves (taxa) seems like a very
> common operation, and having an explicit method for this might be
> clearer than calling find with special arguments.

I'm for keeping it as well, and just having the underlying
implementation of get_leaf_nodes call find with the right arguments.
This seems like an operation that should be dead obvious to do.

> > (3) I left room in each Node for the left and right indexes used by BioSQL's
> > nested-set representation. Now I'm doubting the utility of that -- any
> > Biopython function that uses those indexes would need to ensure that the
> > index is up to date, which seems tricky. Shall I remove all mention of the
> > nested-set representation, or try to support it fully?

Again I agree with Peter here -- this would be best supported as a
subclass that is database aware with an identical API, similar to
how the Seq objects and BioSQL Seq objects work. This avoids any
overhead for the in-memory case, which will be more common, but
gives you a point to implement the useful database representation
code in the future. If you don't have time to work on all of this
right now, I'd leave the nested-set stuff out and keep it in mind as
a future addition.

Brad