[Biopython-dev] Newick support in Bio.TreeIO?

Wed Jul 29 11:49:22 EDT 2009

Hi Peter,

On Tue, Jul 28, 2009 at 12:48 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> Hi Eric,
>
> If you wanted a good multi-tree example file format for TreeIO, I would
> suggest plain Newick trees. I am familiar with plain text files which
> contain
> one Newick tree per line (with a terminating semi-colon), although in
> principle they could be wrapped over many lines. The neighbour joining
> (NJ) tree software QuickJoin from Thomas Mailund can certainly output
> this kind of file. I would expect to be able to read and write such
> multi-tree
> Newick files using Bio.TreeIO.
>

I was wondering about this in regard to Bio.Nexus. It looks like the class
Bio.Nexus.Nexus calls its _tree method when it encounters a tree block in a
Nexus file, which corresponds to a tree in Newick format plus a short
preamble. The _tree method churns the preamble, then passes a CharBuffer
(the Newick string) and some defaults to the Bio.Nexus.Trees.Tree
constructor, which does the Newick parsing and creates a Tree object.

After a quick glance at the Nexus original article/spec, it looks like the
format is a bindle of simpler formats for various applications; most of
these formats are unique to Nexus, but Newick is dropped into Nexus
completely intact. So! I'm proposing that the Newick parser, currently
stashed inside Bio.Nexus.Trees, be moved to Bio.TreeIO.NewickIO, and the
Nexus parser be changed to simply call the Newick parser from its new
location.

(A further refactoring of the Nexus parser would put the individual parsers
for each block in separate classes or files, rather than mingled with the
block-level parsing code. I can't guarantee I'll get around to that,
though.)

Does this bear any resemblance to your plan?

The obvious application of this (which I have used personally), was to
> generate bootstrap trees on multiple machines in a cluster (or cores on
> a single machine), e.g. 100 instances each of 10 bootstrap trees, giving
> in total 1000 trees (which are then used either to build a consensus, or
> allocate bootstrap support to the randomised master tree).
>

Sounds like an incremental parse() function over these trees would be very
useful for distributed bootstrap analysis etc. I don't see how Bio.Nexus
currently supports this, though, beyond iterating over the 'trees'
attribute, which is a list. How would a reasonable person go about this?
Generate trees in Newick format rather than Nexus, run on the cluster,
combine, distill, and only save the resulting master tree in Newick format
(or even phyloXML)? If the Newick parser is separated from Nexus, then this
wouldn't be too difficult to support.

> I wrote some code in python to do this bootstrapping step using the
> splits defined by each edge (i.e. the two sets of nodes you get if the
> edge was severed), which I represented using bit arrays, for use as
> keys in a dictionary mapping the splits to the master tree's edges.
>
>
I would be interested to see this.

Thanks,
Eric

P.S. On inspection, there's a possible bug in Bio.Nexus.get_start_end: the
default argument for skiplist is a list with two characters in it. If
skiplist is altered, this would persist across subsequent calls, wouldn't
it?