[Biopython-dev] Newick support in Bio.TreeIO?

Wed Jul 29 12:16:57 EDT 2009

On Wed, Jul 29, 2009 at 4:49 PM, Eric Talevich<eric.talevich at gmail.com> wrote:
> Hi Peter,
>
> On Tue, Jul 28, 2009 at 12:48 PM, Peter <biopython at maubp.freeserve.co.uk>
> wrote:
>>
>> Hi Eric,
>>
>> If you wanted a good multi-tree example file format for TreeIO, I would
>> suggest plain Newick trees. I am familiar with plain text files which
>> contain one Newick tree per line (with a terminating semi-colon),
>> although in principle they could be wrapped over many lines. The
>> neighbour joining (NJ) tree software QuickJoin from Thomas Mailund
>> can certainly output this kind of file. I would expect to be able to read
>> and write such multi-tree Newick files using Bio.TreeIO.
>
> I was wondering about this in regard to Bio.Nexus. It looks like the class
> Bio.Nexus.Nexus calls its _tree method when it encounters a tree block in a
> Nexus file, which corresponds to a tree in Newick format plus a short
> preamble. The _tree method churns the preamble, then passes a CharBuffer
> (the Newick string) and some defaults to the Bio.Nexus.Trees.Tree
> constructor, which does the Newick parsing and creates a Tree object.
>
> After a quick glance at the Nexus original article/spec, it looks like the
> format is a bindle of simpler formats for various applications; most of
> these formats are unique to Nexus, but Newick is dropped into Nexus
> completely intact. So! I'm proposing that the Newick parser, currently
> stashed inside Bio.Nexus.Trees, be moved to Bio.TreeIO.NewickIO, and the
> Nexus parser be changed to simply call the Newick parser from its new
> location.
>
> (A further refactoring of the Nexus parser would put the individual parsers
> for each block in separate classes or files, rather than mingled with the
> block-level parsing code. I can't guarantee I'll get around to that,
> though.)
>
> Does this bear any resemblance to your plan?

No - but probably only because I didn't fancy restructuring Bio.Nexus ;)
We can already call the Newick tree parser directly, so it doesn't
have to be moved (although we could do). [In case you hadn't seen
it, the current version of the Tutorial has a tiny example using this
at the end of a ClustalW example in the Alignment chapter.]

Bio.TreeIO.parse() should be an iterator, returning complete tree
objects one by one. I was thinking of having Bio.TreeIO.NewickIO
just take a plain text file, split it up at the ";\n" characters (or similar)
to get each tree as a string, which is passed to Bio.Nexus.Trees.Tree
to parse it.

I'd never read the original Nexus publication which describes the file
format (my University didn't subscribe to that journal). However, it
appears to have been digitised and made freely available since then:
http://sysbio.oxfordjournals.org/cgi/reprint/46/4/590

It looks like the NEXUS format allows explicit handling of multiple
trees within the NEXUS block structure. Note that this is quite different
to the simple concatenated plain text Newick files I was talking about.
i.e. the "nexus" and "newick" formats in Bio.TreeIO do both deal with
Newick trees, but they are held in different container formats (i.e. a
NEXUS file, or plain text).

>> The obvious application of this (which I have used personally), was to
>> generate bootstrap trees on multiple machines in a cluster (or cores on
>> a single machine), e.g. 100 instances each of 10 bootstrap trees, giving
>> in total 1000 trees (which are then used either to build a consensus, or
>> allocate bootstrap support to the randomised master tree).
>
> Sounds like an incremental parse() function over these trees would be
> very useful for distributed bootstrap analysis etc.

Exactly. And Bio.TreeIO.read() would be for the special case where
the file format contains exactly one tree.

> I don't see how Bio.Nexus currently supports this, though, beyond
> iterating over the 'trees' attribute, which is a list.

As far as I know, Bio.Nexus just parses a whole file in one go. This
means either Bio.TreeIO.NexusIO would call this and then loop over
the list (very memory inefficient), or it would need a minimal Nexus
parser just to spot the TREES block, and handle them only.

> How would a reasonable person go about this?
> Generate trees in Newick format rather than Nexus, run on the cluster,
> combine, distill, and only save the resulting master tree in Newick format
> (or even phyloXML)? If the Newick parser is separated from Nexus, then
> this wouldn't be too difficult to support.

For the example workflow I gave, I did everything with simple Newick
files. At the very end, it might make sense to save the bootstrapped
tree as phyloXML, or even as a full NEXUS file bundled up with the
alignment.

>> I wrote some code in python to do this bootstrapping step using the
>> splits defined by each edge (i.e. the two sets of nodes you get if the
>> edge was severed), which I represented using bit arrays, for use as
>> keys in a dictionary mapping the splits to the master tree's edges.
>
> I would be interested to see this.

I'm not actually sure where I put it... it should be on my old desktop
at home somewhere. However, I can elaborate in that in addition NJ
using quicktree, I also did parsimony bootstrap values, and drew my
own colourful trees using reportlab. See the three supplementary
figures here: http://dx.doi.org/10.1099/mic.0.2007/013672-0

Peter

> P.S. On inspection, there's a possible bug in Bio.Nexus.get_start_end: the
> default argument for skiplist is a list with two characters in it. If
> skiplist is altered, this would persist across subsequent calls, wouldn't
> it?

I don't understand what you are trying to say. If the get_start_end is
called with an argument (say skiplist=["a","b"]) then this will not affect
subsequence calls where there default will still be ['-','?'].