[Biopython-dev] Newick support in Bio.TreeIO?

Thu Jul 30 04:10:35 UTC 2009

On Wed, Jul 29, 2009 at 3:37 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Wed, Jul 29, 2009 at 6:59 PM, Eric Talevich<eric.talevich at gmail.com>
> wrote:
> >>
> >> Bio.TreeIO.parse() should be an iterator, returning complete tree
> >> objects one by one. I was thinking of having Bio.TreeIO.NewickIO
> >> just take a plain text file, split it up at the ";\n" characters (or
> >> similar) to get each tree as a string, which is passed to
> >> Bio.Nexus.Trees.Tree to parse it.
> >
> > OK, I did this.
> >
> > http://github.com/etal/biopython/blob/phyloxml/Bio/TreeIO/NewickIO.py
>
> OK, I haven't run the code but have a couple of points.
>
> On a general point, are you intending to re-write parse and read
> functions for each tree format? For Bio.SeqIO all I do is write a
> iterator (i.e. a parse) function, and Bio.SeqIO.parse() and also
> Bio.SeqIO.read() call this.
>

If the top-level TreeIO read function returns just the first parsed tree and
raises a ValueError if 0 or >1 trees are available, then I can make the
wrappers simpler and reduce some code duplication.

The parsing code looks weird to me - but that is probably a style
> thing. Certainly I had to stare at it to work out what it was doing.
> It also has a bug - consider a Newick file containing one tree but
> with no trailing semi colon.
>

It is weird; I'll fix these issues in parse() and write(). (I only tested
with a small 2-tree file.) Style: The "foo and bar or baz" is a
Py2.4-friendly idiom that we can one day replace everywhere with the real
ternary expression syntax introduced in Py2.5: "bar if foo else baz". I've
been using it throughout my GSoC code, though it's not really necessary in
this function.

Hilmar says there's supposed to be a terminal semicolon; I didn't check what
Biopython's parser does but I suppose this should duplicate that.

>> > Sounds like an incremental parse() function over these trees would be
> >> > very useful for distributed bootstrap analysis etc.
> >>
> >> Exactly. And Bio.TreeIO.read() would be for the special case where
> >> the file format contains exactly one tree.
> >
> > PhyloXML has a top-level object that contains multiple phylogenies, plus
> > arbitrary 'other' data; PhyloXML.read() returns one of those object
> > regardless of how many phylogenies it contains. Newick doesn't have a
> > top-level container, so returning one tree and raising a RuntimeError if
> > there isn't exactly one tree makes sense. But Nexus has a top-level
> > container with (potentially) a bunch of other info -- should
> NexusIO.read()
> > return the complete Nexus object, or just pretend to be a Newick wrapper
> and
> > behave that way?
>
>
> Ah. The top level information about all the trees may cause trouble
> for the TreeIO model I had in mind (which was *just* for trees). The
> advantage of this is a consistent API, the downside is certain file
> format specific things cannot be supported nicely. I think this balance
> has worked nicely for SeqIO and AlignIO to date. So:
> * Bio.TreeIO.read(...) would return one tree.
> * Bio.TreeIO.parse(...) would iterate over trees one by one.
> * Bio.TreeIO.write(...) would write trees out (ideally sequentially
> if the file format allows this).
>
> Note I am assuming it is possible to write a PhyloXML tree with
> minimal (empty) top level annotation? You would need to do this
> in order to convert from a Nexus or Newick tree to a (minimal)
> PhyloXML tree.
>
> So, based on how SeqIO and AlignIO work, I would expect Bio.TreeIO
> would only give you the trees - you'd not get the top level information.
> For parsing Nexus files, Bio.TreeIO would only give access to a
> subset of the data in a Nexus file - just the trees. In the same way,
> parsing a Nexus file with AlignIO only gives you the alignment. If
> you want any of the other data in a Nexus file, you have to use the
> Bio.Nexus module.
>
> If you (as a user) needed the top level annotation in a PhyloXML file,
> then I would say use Bio.PhyloXML (or what ever we are calling it)
> directly instead of Bio.TreeIO.
>

Within the last couple of weeks, I moved all of the PhyloXML I/O code to
Bio.TreeIO.PhyloXMLIO, and the tree class definitions to Bio.Tree.PhyloXML
-- so there is no Bio.PhyloXML module now, as far as imports and setup.py
are concerned. Unlike Nexus, a phyloXML file really doesn't contain anything
other than phylogenetic trees and their annotations, so I didn't see the
need to clutter the Bio namespace further.

Plan:
TreeIO has read(), parse(), write(), and possibly convert(), which behave
exactly like the corresponding AlignIO and SeqIO functions, but with trees.
Under Bio.TreeIO we have wrappers for other formats, and these wrappers may
have public functions that go beyond the shared TreeIO ones.

In some cases this can lead to a specific read-like function that returns a
single object containing one or more trees, plus other tree-related
metadata. This function can either be called read() also, as it currently is
in PhyloXMLIO, or we could choose another name like load().

For basic tree access:

from Bio import TreeIO
tree = TreeIO.read('example.xml', 'phyloxml')
TreeIO.write([tree], 'example.nex', 'nexus')

For the connoisseur:

from Bio.TreeIO import PhyloXMLIO
phx = PhyloXMLIO.read('example.xml')
if phx.other: # do something clever...

 Of course, in practice Nexus files may not be that big. I don't
> know if anyone uses them to store (for example) 1000 bootstrap trees.
> As Brad and I have noted before, spending time on refactoring Bio.Nexus
> is not the best use of your GSoC project time (plus we'd need to get
> Cymon and Frank much more involved, worry more about backwards
> compatibility etc).
>

This refactoring quest actually started because I was trying to figure out
an object model for BaseTree that could support PhyloDB, reuse the Nexus
tree methods with some resemblance to the original form, and still provide
useful base classes for phyloXML. That was holding up everything else -- but
I think it's under control now.

> I agree that tree drawing would be a nice addition to Bio.Graphics.
>
> But that code of mine as written would not be good enough. In the
> end it was a bit of a hack - it got the job done but had lots of special
> cases (e.g. to get colouring by species to work, and in particular the
> double bootstrap values caused me pain as I had to have two otherwise
> identical trees loaded). Even ignoring this, the basic code didn't use
> an object orientated approach which makes it a poor match to the
> rest of Bio.Graphics. Basically I would want to rewrite it from scratch
> before I felt it was fit for public reuse, and have never found the time.
>

Maybe it will be worth another shot after the Tree module settles down. If
networkx export comes easily this week, that may take also take care of
visualization for some uses.

Cheers,
Eric