[Biopython-dev] Code review request for phyloxml branch

Peter biopython at maubp.freeserve.co.uk
Thu Sep 24 05:57:12 EDT 2009


On Thu, Sep 24, 2009 at 4:48 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
> Discussion:
>
> *TreeIO*
> The read, parse, write and convert functions work essentially the same as in
> SeqIO and AlignIO, for the formats 'newick', 'nexus' and 'phyloxml'. Issues:

Great.

One minor point - the docstring for Bio.TreeIO.parse() says: "This is only
supported for formats that can represent multiple phylogenetic trees in a
single file". Is that true, and if so why? For SeqIO and AlignIO you can
use parse on a file with one entry, the iterator just returns one entry. Easy.
This is important for allowing generic code (e.g. a loop) regardless of
how many entries there are (one, many, or even zero).

On a more general note, you seem to be recreating the file/handle logic
in each of the individual parsers. I think it would make much more sense
to put this logic in the top level Bio.TreeIO.parse(), Bio.TreeIO.read() and
Bio.TreeIO.write() functions *only* and have the underlying format specific
code just use handles. This avoids the code duplication.

[In fact, as I have said before, I prefer the simplicity of just allowing
handles - and we should make TreeIO and SeqIO/AlignIO consistent]

> (1) 'phyloxml' uses a different object representation than the other two, so
> converting between those formats is not possible until Nexus.Trees is ported
> over to Bio.Tree.

I think that is a blocker - I wouldn't want to release Bio.TreeIO until it would
actually let you do phyloxml -> newick, and phyloxml -> nexus (and assuming
that phyloxml allows very minimal trees, the reverse as well). It does look
like the best plan is to use the same tree objects for all three (updating
Bio.Nexus if possible).

Note that Bio.Nexus.Trees still has some useful methods you don't
appear to support, like finding the last common ancestor and distances
between nodes.

> (2) NexusIO.write() just doesn't seem to work. I don't understand how to
> make the original Nexus module write out trees that it didn't parse itself.
> Help?

To get the Newick tree, you can just call str(tree), which is basically what
you are doing in Bio.TreeIO.NewickIO. To get a Nexus file is going to be
more complicated. You'll need to create a minimal Nexus file - have a
look at the Bio.AlignIO.NexusIO code. An alternative is to look at is having
a hard coded nexus template, and just insert the tree as a Newick string
(and insert the list of taxa?). Perhaps Frank or Cymon can advise us.

> *Tree
> *The BaseTree module is meant to be the basis for Newick trees eventually,
> so I'd like to get the design right with the minimum number of public
> methods:
>
> (1) The find() function, named after the Unix utility that does the same
> thing for directory trees, seems capable of all the iteration and filtering
> necessary for locating data and automatically adding annotations to a tree.
> There's a 'terminal' argument for selecting internal nodes, external nodes,
> or both, and I think this means get_leaf_nodes() is unnecessary. I'm going
> to remove it if no one protests.

I'm in two minds - iterating over the leaves (taxa) seems like a very
common operation, and having an explicit method for this might be
clearer than calling find with special arguments.

> (2) Should find() be based on depth_first_search or breadth_first_search
> (not checked in yet)? DFS would potentially find a leaf node faster, but BFS
> seems more common in phylogenetics. Note that iteration can easily be
> reversed with the standard reversed() function, so we don't need extra
> functions for those cases.

You could do both, either via an argument or having two methods, say
depth_fist_search and breadth_first_search instead of find.

> (3) I left room in each Node for the left and right indexes used by BioSQL's
> nested-set representation. Now I'm doubting the utility of that -- any
> Biopython function that uses those indexes would need to ensure that the
> index is up to date, which seems tricky. Shall I remove all mention of the
> nested-set representation, or try to support it fully?

A partial implementation doesn't seem helpful, and wastes memory
allocating unused properties. I would remove it from the base Node,
but a full implementation might be useful for something (would it be
possible via a subclass?).

On a related point, do you think a BioSQL TaxonTree subclass is possible?
i.e. Something mimicking the new Tree objects (as a subclass), but which
loads data on demand from the taxon tables in a BioSQL database? This
would provide a nice way to work with the NCBI taxonomy (once loaded
into BioSQL), which is a very large tree. For an example use case, I might
want to extract just the bacteria as a subtree, and save that to a file.

> (4) There's some mention in the literature of a relationship-matrix
> representation for phylogenies. Does anyone here know how to work with this
> representation, or know if it would let us perform complex calculations with
> blinding speed behind the scenes? If so, should there be a function in
> Bio.Tree.Utils to export a tree to a NumPy array represented this way?  If
> not, I'll forget about it.

I don't know.

> *Graphics*
> I finally fixed the networkx/graphviz/matplotlib drawing to leave unlabeled
> nodes inconspicuous, so the resulting graphic is much cleaner, perhaps even
> usable. Plus, the nodes are now a pretty shade of blue. Still, it would be
> nice to have a Reportlab-based module in Bio.Graphics to print phylogenies
> in the way biologists are used to seeing them. Does anyone know of existing
> code that could be borrowed for this? I looked at ETE (announced on the main
> biopython list last week) and liked the examples, but it uses PyQt4 and a
> standalone GUI for display, which is a substantial departure from the
> Biopython way of doing things.

I still haven't tracked down my old report lab code, but it wasn't object
orientated and would need a lot of work to bring up to standard...

Peter



More information about the Biopython-dev mailing list