[Biopython-dev] BioGeography update/BioPython tree module discussion

Mon Jul 13 18:34:42 UTC 2009

Brad Chapman wrote:
> Hi all;
> 
>>> 1. Let the Bio.PhyloXML.Tree objects be a superset of everything needed
>>> by any phylogenetic tree representation, ever. (It's already pretty close.)
>>> Refactor Nexus and Newick to use these objects; merge the features of
>>> lagrange so the rest of the Biopython environment can benefit.
> 
> I am for this approach. It sounds like what people want is a tree
> that does everything, and re-implementations occur because
> representations are lacking in something.

Hi all -- thanks for this discussion about tree classes.  Sorry it took 
me awhile to absorb all of this (and I may still be working on absorbing 
all of it...there is a lot to keep in my head!).

PS: This also serves as my Monday update, basically I need to revise my 
schedule based on the decisions made after discussion of this thread.

Here is a summary of the situation as I understand it.  It may be a 
little long, apologies!  (I was kind of hoping an easy solution would 
just appear, since really everything after this point in my GSoC project 
requires tree processing, and thus I have to at least the decision made 
about which tree class to use.)

I. Tree Class Options

It sounds like we have 3 options being discussed:

1. making Bio.PhyloXML.Tree the super-duper tree class
2. improving Bio.Nexus.Trees
3. including the Lagrange tree class or suitably licensed/inspired 
version thereof.

(Or there is #4, some combination)

II. My Original Problem, Which is Probably Quite Small Really

I think I kind of unintentionally kicked all of this off because I 
couldn't get Bio.Nexus.Trees to read what I considered pretty standard 
Newick files back when I originally exploring this in the spring. 
Initially for my own scripts I used another newick parser & tree class I 
found online (Mailund's IIRC), then discovered a superior one in 
Lagrange and started using that.  Thus in GSoC it was simplest to begin 
by importing the Lagrange parser, but that lead to legitimate concerns 
about duplication/licensing etc.

Reviewing my original issues from the spring, really the only problem I 
found with Bio.Nexus.Trees was with node labels, i.e. when an internal 
node is given e.g. a clade name, in addition to a branch length.  This a 
standard output on a great many newick files in my experience, which 
seem to be correctly read by just about all the other programs I use 
(Mesquite, Dendroscope, etc.) so I impulsively abandoned Bio.Nexus.Trees 
at the time when I couldn't get it work.

III. Bug Report

I did file a bug report back in March.  This is outstanding as far as I 
know.

Bio.Nexus.Trees newick parser does not support internal node labels
http://bugzilla.open-bio.org/show_bug.cgi?id=2788

IV. Problem Examples

Below I have accumulated some cases that work/don't work:

=================
from Bio.Nexus import Trees

# This works

ts0 = 
"(Bovine:0.69395,(Gibbon:0.36079,(Orang:0.33636,(Gorilla:0.17147,(Chimp:0.19268, 
Human:0.11927):0.08386):0.06124):0.15057):0.54939,Mouse:1.21460):0.10;"

to0 = Trees.Tree(ts0)
print to0

# Gymnosperms tree with node labels; doesn't work
ts1a = 
'(((((Cephalotaxus:125.000000,(Taxus:100.000000,Torreya:100.000000)TT1:25.000000)Taxaceae:90.000000,((((((((Calocedrus:85.000000,Platycladus:85.000000)CP:5.000000,(Cupressus:85.000000,Juniperus:85.000000)CJ:5.000000)CJCP:5.000000,Chamaecyparis:95.000000)CCJCP:5.000000,(Thuja:7.870000,Thujopsis:7.870000)TT2:92.13)CJCPTT:30.000000,((Cryptomeria:120.000000,Taxodium:120.000000)CT:5.000000,Glyptostrobus:125.000000)CTG:5.000000)CupCallTax:5.830000,((Metasequoia:125.000000,Sequoia:125.000000)MS:5.000000,Sequoiadendron:130.000000)Sequoioid:5.830000)STCC:49.060001,Taiwania:184.889999)Taw+others:15.110000,Cunninghamia:200.000000)nonSci:15.000000)Tax+nonSci:10.000000,Sciadopitys:225.000000):25.000000,(((Abies:106.000000,Keteleeria:106.000000)AK:54.000000,(Pseudolarix:156.000000,Tsuga:156.000000)NTP:4.000000)NTPAK:24.000000,((Larix:87.000000,Pseudotsuga:87.000000)LP:81.000000,(Picea:155.000000,Pinus:155.000000)PPC:13.000000)Pinoideae:16.000000)Pinaceae:66.000000)Coniferales:25.000000,Gin
kgo:275.000000)gymnosperm:75.000000;'

to1a = Trees.Tree(ts1a)

# Just Taxaceae; doesn't work
ts1b = 
'(Cephalotaxus:125.000000,(Taxus:100.000000,Torreya:100.000000)TT1:25.000000)Taxaceae:90.000000;'
to1b = Trees.Tree(ts1b)

# Just Taxaceae; this works; node labels deleted
ts1c = 
'(Cephalotaxus:125.000000,(Taxus:100.000000,Torreya:100.000000)25.000000)90.000000;'
to1c = Trees.Tree(ts1c)

# This doesn't work (from bug report)
ts2 = "(((t9:0.385832, (t8:0.445135,t4:0.41401)C:0.024032)B:0.041436, 
t6:0.392496)A:0.0291131, t2:0.497673, ((t0:0.301171, 
t7:0.482152)E:0.0268148, ((t5:0.0984167,t3:0.488578)G:0.0349662, 
t1:0.130208)F:0.0318288)D:0.0273876);"
to2 = Trees.Tree(ts2)
=================

But if I import the Lagrange tree class/parser, all of these work and my 
life is happy:

=================
import lagrange_newick
# This is lagrange's newick.py file, renamed to lagrange_newick.py

lt1 = lagrange_newick.parse(ts1)
lt1a = lagrange_newick.parse(ts1a)
lt1b = lagrange_newick.parse(ts1b)
lt2 = lagrange_newick.parse(ts2)
=================

V. The Functions I Need From a Tree Class

Basically my method of late has been to use the Lagrange Tree class, and 
then write my own standalone functions to do various necessary basic 
processing of trees.  E.g.:

* subset tree based on list of taxa; update root and any now-redundant 
internal nodes left with 0 or 1 descendents

* extract a subtree to a new tree (cloned nodes so they don't refer to 
the old nodes, important in doing passes through tree)

* read/write to Newick

* print tree to screen in a readable format

* get distance (total branch length between 2 nodes)

* calculate many measures that can be done from the distances (total 
all-to-all distance matrix, tree length, mean phylogenetic distance, 
mean nearest-neighbor phylogenetic distance)

* several others I don't remember off the top of my head

In my list-o-functions approach, I would just write functions for the 
tree class I was using, but I think it has been made clear that really 
these functions should be methods of a certain Tree class.  Which 
requires a decision about what Tree class to use.

VI. What the current classes do.

I had never looked seriously at Bio.Nexus.Trees since I was just 
crashing it,   but it actually looks like it does a bunch:

Bio.Nexus.Trees
===========
type(to1c)
<type 'instance'>

to1c
<Bio.Nexus.Trees.Tree instance at 0x39348a0>

dir(to1c)

['_Tree__values_are_support',
  '__doc__',
  '__init__',
  '__module__',
  '__str__',
  '_add_subtree',
  '_get_id',
  '_get_values',
  '_parse',
  '_walk',
  'add',
  'all_ids',
  'branchlength2support',
  'chain',
  'collapse',
  'collapse_genera',
  'common_ancestor',
  'convert_absolute_support',
  'count_terminals',
  'dataclass',
  'display',
  'distance',
  'get_taxa',
  'get_terminals',
  'has_support',
  'id',
  'is_bifurcating',
  'is_compatible',
  'is_identical',
  'is_internal',
  'is_monophyletic',
  'is_parent_of',
  'is_preterminal',
  'is_terminal',
  'kill',
  'link',
  'max_support',
  'merge_with_support',
  'name',
  'node',
  'prune',
  'randomize',
  'root',
  'root_with_outgroup',
  'rooted',
  'search_taxon',
  'set_subtree',
  'split',
  'sum_branchlength',
  'to_string',
  'trace',
  'unlink',
  'unroot',
  'weight']

# Node methods:
nd = to1c.node(1)

nd
<Bio.Nexus.Nodes.Node instance at 0x39227b0>

type(nd)
<type 'instance'>

dir(nd)

['__doc__',
  '__init__',
  '__module__',
  'add_succ',
  'data',
  'get_data',
  'get_id',
  'get_prev',
  'get_succ',
  'id',
  'prev',
  'remove_succ',
  'set_data',
  'set_id',
  'set_prev',
  'set_succ',
  'succ']

# Node data:
ndd = nd.get_data()

dir(ndd)

['__doc__',
  '__init__',
  '__module__',
  'branchlength',
  'comment',
  'support',
  'taxon']
===========

Lagrange Tree Class:
(really class Node I guess, and the tree is reference by the root Node)

=============
type(lt1b)
<type 'instance'>

lt1b
<lagrange_phylo.Node instance at 0x392b120>

dir(lt1b)

['__doc__',
  '__init__',
  '__module__',
  'add_child',
  'children',
  'data',
  'descendants',
  'excluded_dists',
  'find_descendant',
  'graft',
  'isroot',
  'istip',
  'iternodes',
  'label',
  'labelset_nodemap',
  'leaf_distances',
  'leaves',
  'length',
  'mrca',
  'nchildren',
  'order_subtrees_by_size',
  'parent',
  'prune',
  'remove_child',
  'rootpath',
  'subtree_mapping',
  'ultrametricize_dumbly']
=============

Bio.PhyloXML.Tree
=============
[not sure...perhaps someone could contribute the list of 
methods/intended methods]
=============

VII. I am Leaning Towards Bio.Nexus.Trees

Based on current functionality and integration with BioPython, and what 
can be done in the short term, it looks to me like the best option is to 
  mod the Bio.Nexus.Trees module, inspired by the Lagrange Node class as 
necessary.  However if e.g. PhyloXML is working well enough that I can 
use that, that is an option.

VIII. What I should do next

Given what I now know, I probably should have just written a little 
function to strip node labels out of my Newick trees, and done 
everything based on the Bio.Nexus.Trees class.  I could still do this 
and continue on my merry way without too much trouble.

But given that my tree-based functions should probably be methods of 
some class...here are the questions I have:

* Should I muck with Bio.Nexus.Trees and try to fix the node labels 
issue?  My instinct was not to mess with other people's stuff, but that 
may be a poor instinct...

* Should I implement my tree-based functions methods as methods of the 
Bio.Nexus.Trees class?

* Should I delay on this whole issue while it is being discussed, and go 
back to issues more localized to my GSoC project, i.e. making my GBIF 
functions into methods of a GBIF records class?

Thanks for reading!  And sorry if this was more confusing than it had to 
be, I am definitely learning as I go here.

Cheers,
Nick

> 
> It would be nice to design this modularly -- with mixin classes for
> related add-on functionality -- as much as possible. This would
> allow lighter weight implementations in the future if that were
> desired.
> 
>> The benefit of letting the tree object structures diverge is procrastination
>> -- we could reconcile the two modules after GSoC is over, with stable
>> features and test suites in place. But I could justifiably focus on
>> integration for the remaining weeks if that's best for Biopython, since
>> otherwise I'd probably be reimplementing a number of features already
>> present in other modules.
> 
> My vote is for the integration work. Refactoring is hard work and
> best done early. It is easier to add functionality to a fully integrated
> PhyloXML parser in the future.
> 
>> I bet this could be done without different objects. Bio.PhyloXML.Tree could
>> be moved to Bio.Tree or Bio.Tree.Elements; the base class PhyloElement could
>> be renamed to TreeElement; and the Nexus and Newick parsers could reuse
>> PhyloXML's Phylogeny and Clade elements, where Clade merges with the
>> existing Node class(es). Even Clade by itself might be enough. For
>> organizational purposes, format-specific tree elements could move to their
>> own files (Bio.Tree.PhyloElement.py, Bio.Tree.NexusElement.py), or some
>> multiple-inheritance tricks could be used to smooth things over.
> 
> Yes, this sounds exactly right. Great stuff.
> 
>> (I know nothing
>> about NeXML; should we keep an eye on that too? Glance at the homepage I
>> don't see much about complex annotation types, which is probably good if we
>> want to fit that format into this framework eventually.)
> 
> PhyloXML plus Nexus/Newick is probably enough to stay reasonably
> general and keep our sanity. NeXML support would be great but
> practically is an additional project. The refactoring you've described
> is a good chunk to run with.
> 
> Brad
> 

-- 
====================================================
Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."

Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================