[Biopython-dev] Code review request for phyloxml branch

Fri Sep 25 04:34:17 UTC 2009

Hi Peter,

Thanks for the feedback.

On Thu, Sep 24, 2009 at 5:57 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

>
> One minor point - the docstring for Bio.TreeIO.parse() says: "This is only
> supported for formats that can represent multiple phylogenetic trees in a
> single file". Is that true, and if so why? For SeqIO and AlignIO you can
> use parse on a file with one entry, the iterator just returns one entry.
> Easy.
> This is important for allowing generic code (e.g. a loop) regardless of
> how many entries there are (one, many, or even zero).
>
>
I'll delete that sentence. I don't know why it's there -- you're right, it's
easy to return an iterable regardless of what the format itself supports.

On a more general note, you seem to be recreating the file/handle logic
> in each of the individual parsers. I think it would make much more sense
> to put this logic in the top level Bio.TreeIO.parse(), Bio.TreeIO.read()
> and
> Bio.TreeIO.write() functions *only* and have the underlying format specific
> code just use handles. This avoids the code duplication.
>
>
I did the handle management case-by-case because some of the underlying
libraries already do filename-to-handle conversion -- ElementTree and
Bio.Nexus, specifically. It seemed non-kosher to have multiple layers of
ad-hoc handle management, but of course I can move it all to the top if you
think it's best. One day, perhaps we'll have a context manager that we can
reuse everywhere to make magic easy:

with maybe_open(file) as handle:
   tree = FooIO.parse(handle)

Not today, though.

> (1) 'phyloxml' uses a different object representation than the other two,
> so
> > converting between those formats is not possible until Nexus.Trees is
> ported
> > over to Bio.Tree.
>
>
> I think that is a blocker - I wouldn't want to release Bio.TreeIO until it
> would
> actually let you do phyloxml -> newick, and phyloxml -> nexus (and assuming
> that phyloxml allows very minimal trees, the reverse as well). It does look
> like the best plan is to use the same tree objects for all three (updating
> Bio.Nexus if possible).
>
>
I could comment out the 'nexus' and 'newick' lines from the
supported_formats dict. That would disable the top-level functions but leave
the direct NexusIO and NewickIO equivalents intact until the port is
complete.

Note that Bio.Nexus.Trees still has some useful methods you don't
> appear to support, like finding the last common ancestor and distances
> between nodes.
>
> That's intentional, I was just going to port those methods directly from
Bio.Nexus.Trees rather than invent a new API myself.

Currently, the Bio.Nexus.Nexus.Nexus and Nexus.Trees.Tree classes are
combined parsers and object representations. My goal is to chop out the
pure-object parts and merge them into Bio.Tree, and let the remaining
parsers return objects built from the new Bio.Tree classes. This looks like
it will be easier for Nexus.Trees than for Nexus.Nexus, but both should be
done.

For backward compatibility, I'll leave some wrappers that trigger
DeprecationWarnings in the original places. Nexus.Trees can probably be
reduced to:

import warnings
warnings.warn("Use Bio.Tree and Bio.TreeIO instead", DeprecationWarning)

from Bio.Tree.Newick import *
from Bio.TreeIO.NewickIO import *

(more or less)

> (2) NexusIO.write() just doesn't seem to work. I don't understand how to
> > make the original Nexus module write out trees that it didn't parse
> itself.
> > Help?
>
> To get the Newick tree, you can just call str(tree), which is basically
> what
> you are doing in Bio.TreeIO.NewickIO. To get a Nexus file is going to be
> more complicated. You'll need to create a minimal Nexus file - have a
> look at the Bio.AlignIO.NexusIO code. An alternative is to look at is
> having
> a hard coded nexus template, and just insert the tree as a Newick string
> (and insert the list of taxa?). Perhaps Frank or Cymon can advise us.
>
>
OK, thanks, I'll give it a shot. I see some default Nexus template stuff in
Bio.Nexus.Nexus already.

> > *Tree
> > *The BaseTree module is meant to be the basis for Newick trees
> eventually,
> > so I'd like to get the design right with the minimum number of public
> > methods:
> >
> > (1) The find() function, named after the Unix utility that does the same
> > thing for directory trees, seems capable of all the iteration and
> filtering
> > necessary for locating data and automatically adding annotations to a
> tree.
> > There's a 'terminal' argument for selecting internal nodes, external
> nodes,
> > or both, and I think this means get_leaf_nodes() is unnecessary. I'm
> going
> > to remove it if no one protests.
>
> I'm in two minds - iterating over the leaves (taxa) seems like a very
> common operation, and having an explicit method for this might be
> clearer than calling find with special arguments.
>

I think .find(terminal=True) will do the right thing and looks reasonably
simple, but as Brad said, this is a ridiculously common operation so finding
it in the API should be ridiculously easy. I'll rename this function to
get_leaves() and rename find() to findall() (to match ElementTree and make
it clear that it returns an iterable).

> > (3) I left room in each Node for the left and right indexes used by
> BioSQL's
> > nested-set representation. Now I'm doubting the utility of that -- any
> > Biopython function that uses those indexes would need to ensure that the
> > index is up to date, which seems tricky. Shall I remove all mention of
> the
> > nested-set representation, or try to support it fully?
>
> A partial implementation doesn't seem helpful, and wastes memory
> allocating unused properties. I would remove it from the base Node,
> but a full implementation might be useful for something (would it be
> possible via a subclass?).
>
> On a related point, do you think a BioSQL TaxonTree subclass is possible?
> i.e. Something mimicking the new Tree objects (as a subclass), but which
> loads data on demand from the taxon tables in a BioSQL database? This
> would provide a nice way to work with the NCBI taxonomy (once loaded
> into BioSQL), which is a very large tree. For an example use case, I might
> want to extract just the bacteria as a subtree, and save that to a file.
>
>
Doing BioSQL integration was on the original roadmap, but research hasn't
taken me back there lately. I would like to do it eventually... anyway, that
would solve the indexing issue nicely. I'll drop the extra attributes -- I
get the impression they're not meant to be accessed directly in BioSQL
either, so there's no use for them in Biopython.

Cheers,
Eric