[Biopython-dev] Newick support in Bio.TreeIO?

Wed Jul 29 17:59:27 UTC 2009

On Wed, Jul 29, 2009 at 12:16 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Wed, Jul 29, 2009 at 4:49 PM, Eric Talevich<eric.talevich at gmail.com>
> wrote:
> >
> > Does this bear any resemblance to your plan?
>
> No - but probably only because I didn't fancy restructuring Bio.Nexus ;)
> We can already call the Newick tree parser directly, so it doesn't
> have to be moved (although we could do). [In case you hadn't seen
> it, the current version of the Tutorial has a tiny example using this
> at the end of a ClustalW example in the Alignment chapter.]
>
> Bio.TreeIO.parse() should be an iterator, returning complete tree
> objects one by one. I was thinking of having Bio.TreeIO.NewickIO
> just take a plain text file, split it up at the ";\n" characters (or
> similar)
> to get each tree as a string, which is passed to Bio.Nexus.Trees.Tree
> to parse it.
>

OK, I did this.

http://github.com/etal/biopython/blob/phyloxml/Bio/TreeIO/NewickIO.py

> > Sounds like an incremental parse() function over these trees would be
> > very useful for distributed bootstrap analysis etc.
>
> Exactly. And Bio.TreeIO.read() would be for the special case where
> the file format contains exactly one tree.
>

PhyloXML has a top-level object that contains multiple phylogenies, plus
arbitrary 'other' data; PhyloXML.read() returns one of those object
regardless of how many phylogenies it contains. Newick doesn't have a
top-level container, so returning one tree and raising a RuntimeError if
there isn't exactly one tree makes sense. But Nexus has a top-level
container with (potentially) a bunch of other info -- should NexusIO.read()
return the complete Nexus object, or just pretend to be a Newick wrapper and
behave that way?

As far as I know, Bio.Nexus just parses a whole file in one go. This
> means either Bio.TreeIO.NexusIO would call this and then loop over
> the list (very memory inefficient), or it would need a minimal Nexus
> parser just to spot the TREES block, and handle them only.
>

That's what I pictured for a Bio.Nexus refactoring -- I don't know the right
way to do it in a memory-efficient way, though, given that there are
multiple types of blocks and they may be needed at different times. Maybe
make an initial pass to index the file at the block level, then call
incremental line-level parsers on the selected blocks? Or, simpler, factor
out the efficient line-level parsers so that they can be accessed separately
if need be -- basically the way Nexus._tree() works now -- and let the
block-level parsing code call those specific parsers.

>> I wrote some code in python to do this bootstrapping step using the
> >> splits defined by each edge (i.e. the two sets of nodes you get if the
> >> edge was severed), which I represented using bit arrays, for use as
> >> keys in a dictionary mapping the splits to the master tree's edges.
> >
> > I would be interested to see this.
>
> I'm not actually sure where I put it... it should be on my old desktop
> at home somewhere. However, I can elaborate in that in addition NJ
> using quicktree, I also did parsimony bootstrap values, and drew my
> own colourful trees using reportlab. See the three supplementary
> figures here: http://dx.doi.org/10.1099/mic.0.2007/013672-0
>
>
Hey, neat. I was about to start a project involving kinases and response
regulators.

How much trouble was it to draw trees in reportlab? Do you think it would be
worth adding a tree-drawing module to  Bio.Graphics?

Eric