[Biopython-dev] BioGeography update/BioPython tree module discussion
Nick Matzke
matzke at berkeley.edu
Mon Jul 13 18:34:42 UTC 2009
Brad Chapman wrote:
> Hi all;
>
>>> 1. Let the Bio.PhyloXML.Tree objects be a superset of everything needed
>>> by any phylogenetic tree representation, ever. (It's already pretty close.)
>>> Refactor Nexus and Newick to use these objects; merge the features of
>>> lagrange so the rest of the Biopython environment can benefit.
>
> I am for this approach. It sounds like what people want is a tree
> that does everything, and re-implementations occur because
> representations are lacking in something.
Hi all -- thanks for this discussion about tree classes. Sorry it took
me awhile to absorb all of this (and I may still be working on absorbing
all of it...there is a lot to keep in my head!).
PS: This also serves as my Monday update, basically I need to revise my
schedule based on the decisions made after discussion of this thread.
Here is a summary of the situation as I understand it. It may be a
little long, apologies! (I was kind of hoping an easy solution would
just appear, since really everything after this point in my GSoC project
requires tree processing, and thus I have to at least the decision made
about which tree class to use.)
I. Tree Class Options
It sounds like we have 3 options being discussed:
1. making Bio.PhyloXML.Tree the super-duper tree class
2. improving Bio.Nexus.Trees
3. including the Lagrange tree class or suitably licensed/inspired
version thereof.
(Or there is #4, some combination)
II. My Original Problem, Which is Probably Quite Small Really
I think I kind of unintentionally kicked all of this off because I
couldn't get Bio.Nexus.Trees to read what I considered pretty standard
Newick files back when I originally exploring this in the spring.
Initially for my own scripts I used another newick parser & tree class I
found online (Mailund's IIRC), then discovered a superior one in
Lagrange and started using that. Thus in GSoC it was simplest to begin
by importing the Lagrange parser, but that lead to legitimate concerns
about duplication/licensing etc.
Reviewing my original issues from the spring, really the only problem I
found with Bio.Nexus.Trees was with node labels, i.e. when an internal
node is given e.g. a clade name, in addition to a branch length. This a
standard output on a great many newick files in my experience, which
seem to be correctly read by just about all the other programs I use
(Mesquite, Dendroscope, etc.) so I impulsively abandoned Bio.Nexus.Trees
at the time when I couldn't get it work.
III. Bug Report
I did file a bug report back in March. This is outstanding as far as I
know.
Bio.Nexus.Trees newick parser does not support internal node labels
http://bugzilla.open-bio.org/show_bug.cgi?id=2788
IV. Problem Examples
Below I have accumulated some cases that work/don't work:
=================
from Bio.Nexus import Trees
# This works
ts0 =
"(Bovine:0.69395,(Gibbon:0.36079,(Orang:0.33636,(Gorilla:0.17147,(Chimp:0.19268,
Human:0.11927):0.08386):0.06124):0.15057):0.54939,Mouse:1.21460):0.10;"
to0 = Trees.Tree(ts0)
print to0
# Gymnosperms tree with node labels; doesn't work
ts1a =
'(((((Cephalotaxus:125.000000,(Taxus:100.000000,Torreya:100.000000)TT1:25.000000)Taxaceae:90.000000,((((((((Calocedrus:85.000000,Platycladus:85.000000)CP:5.000000,(Cupressus:85.000000,Juniperus:85.000000)CJ:5.000000)CJCP:5.000000,Chamaecyparis:95.000000)CCJCP:5.000000,(Thuja:7.870000,Thujopsis:7.870000)TT2:92.13)CJCPTT:30.000000,((Cryptomeria:120.000000,Taxodium:120.000000)CT:5.000000,Glyptostrobus:125.000000)CTG:5.000000)CupCallTax:5.830000,((Metasequoia:125.000000,Sequoia:125.000000)MS:5.000000,Sequoiadendron:130.000000)Sequoioid:5.830000)STCC:49.060001,Taiwania:184.889999)Taw+others:15.110000,Cunninghamia:200.000000)nonSci:15.000000)Tax+nonSci:10.000000,Sciadopitys:225.000000):25.000000,(((Abies:106.000000,Keteleeria:106.000000)AK:54.000000,(Pseudolarix:156.000000,Tsuga:156.000000)NTP:4.000000)NTPAK:24.000000,((Larix:87.000000,Pseudotsuga:87.000000)LP:81.000000,(Picea:155.000000,Pinus:155.000000)PPC:13.000000)Pinoideae:16.000000)Pinaceae:66.000000)Coniferales:25.000000,Gin
kgo:275.000000)gymnosperm:75.000000;'
to1a = Trees.Tree(ts1a)
# Just Taxaceae; doesn't work
ts1b =
'(Cephalotaxus:125.000000,(Taxus:100.000000,Torreya:100.000000)TT1:25.000000)Taxaceae:90.000000;'
to1b = Trees.Tree(ts1b)
# Just Taxaceae; this works; node labels deleted
ts1c =
'(Cephalotaxus:125.000000,(Taxus:100.000000,Torreya:100.000000)25.000000)90.000000;'
to1c = Trees.Tree(ts1c)
# This doesn't work (from bug report)
ts2 = "(((t9:0.385832, (t8:0.445135,t4:0.41401)C:0.024032)B:0.041436,
t6:0.392496)A:0.0291131, t2:0.497673, ((t0:0.301171,
t7:0.482152)E:0.0268148, ((t5:0.0984167,t3:0.488578)G:0.0349662,
t1:0.130208)F:0.0318288)D:0.0273876);"
to2 = Trees.Tree(ts2)
=================
But if I import the Lagrange tree class/parser, all of these work and my
life is happy:
=================
import lagrange_newick
# This is lagrange's newick.py file, renamed to lagrange_newick.py
lt1 = lagrange_newick.parse(ts1)
lt1a = lagrange_newick.parse(ts1a)
lt1b = lagrange_newick.parse(ts1b)
lt2 = lagrange_newick.parse(ts2)
=================
V. The Functions I Need From a Tree Class
Basically my method of late has been to use the Lagrange Tree class, and
then write my own standalone functions to do various necessary basic
processing of trees. E.g.:
* subset tree based on list of taxa; update root and any now-redundant
internal nodes left with 0 or 1 descendents
* extract a subtree to a new tree (cloned nodes so they don't refer to
the old nodes, important in doing passes through tree)
* read/write to Newick
* print tree to screen in a readable format
* get distance (total branch length between 2 nodes)
* calculate many measures that can be done from the distances (total
all-to-all distance matrix, tree length, mean phylogenetic distance,
mean nearest-neighbor phylogenetic distance)
* several others I don't remember off the top of my head
In my list-o-functions approach, I would just write functions for the
tree class I was using, but I think it has been made clear that really
these functions should be methods of a certain Tree class. Which
requires a decision about what Tree class to use.
VI. What the current classes do.
I had never looked seriously at Bio.Nexus.Trees since I was just
crashing it, but it actually looks like it does a bunch:
Bio.Nexus.Trees
===========
type(to1c)
<type 'instance'>
to1c
<Bio.Nexus.Trees.Tree instance at 0x39348a0>
dir(to1c)
['_Tree__values_are_support',
'__doc__',
'__init__',
'__module__',
'__str__',
'_add_subtree',
'_get_id',
'_get_values',
'_parse',
'_walk',
'add',
'all_ids',
'branchlength2support',
'chain',
'collapse',
'collapse_genera',
'common_ancestor',
'convert_absolute_support',
'count_terminals',
'dataclass',
'display',
'distance',
'get_taxa',
'get_terminals',
'has_support',
'id',
'is_bifurcating',
'is_compatible',
'is_identical',
'is_internal',
'is_monophyletic',
'is_parent_of',
'is_preterminal',
'is_terminal',
'kill',
'link',
'max_support',
'merge_with_support',
'name',
'node',
'prune',
'randomize',
'root',
'root_with_outgroup',
'rooted',
'search_taxon',
'set_subtree',
'split',
'sum_branchlength',
'to_string',
'trace',
'unlink',
'unroot',
'weight']
# Node methods:
nd = to1c.node(1)
nd
<Bio.Nexus.Nodes.Node instance at 0x39227b0>
type(nd)
<type 'instance'>
dir(nd)
['__doc__',
'__init__',
'__module__',
'add_succ',
'data',
'get_data',
'get_id',
'get_prev',
'get_succ',
'id',
'prev',
'remove_succ',
'set_data',
'set_id',
'set_prev',
'set_succ',
'succ']
# Node data:
ndd = nd.get_data()
dir(ndd)
['__doc__',
'__init__',
'__module__',
'branchlength',
'comment',
'support',
'taxon']
===========
Lagrange Tree Class:
(really class Node I guess, and the tree is reference by the root Node)
=============
type(lt1b)
<type 'instance'>
lt1b
<lagrange_phylo.Node instance at 0x392b120>
dir(lt1b)
['__doc__',
'__init__',
'__module__',
'add_child',
'children',
'data',
'descendants',
'excluded_dists',
'find_descendant',
'graft',
'isroot',
'istip',
'iternodes',
'label',
'labelset_nodemap',
'leaf_distances',
'leaves',
'length',
'mrca',
'nchildren',
'order_subtrees_by_size',
'parent',
'prune',
'remove_child',
'rootpath',
'subtree_mapping',
'ultrametricize_dumbly']
=============
Bio.PhyloXML.Tree
=============
[not sure...perhaps someone could contribute the list of
methods/intended methods]
=============
VII. I am Leaning Towards Bio.Nexus.Trees
Based on current functionality and integration with BioPython, and what
can be done in the short term, it looks to me like the best option is to
mod the Bio.Nexus.Trees module, inspired by the Lagrange Node class as
necessary. However if e.g. PhyloXML is working well enough that I can
use that, that is an option.
VIII. What I should do next
Given what I now know, I probably should have just written a little
function to strip node labels out of my Newick trees, and done
everything based on the Bio.Nexus.Trees class. I could still do this
and continue on my merry way without too much trouble.
But given that my tree-based functions should probably be methods of
some class...here are the questions I have:
* Should I muck with Bio.Nexus.Trees and try to fix the node labels
issue? My instinct was not to mess with other people's stuff, but that
may be a poor instinct...
* Should I implement my tree-based functions methods as methods of the
Bio.Nexus.Trees class?
* Should I delay on this whole issue while it is being discussed, and go
back to issues more localized to my GSoC project, i.e. making my GBIF
functions into methods of a GBIF records class?
Thanks for reading! And sorry if this was more confusing than it had to
be, I am definitely learning as I go here.
Cheers,
Nick
>
> It would be nice to design this modularly -- with mixin classes for
> related add-on functionality -- as much as possible. This would
> allow lighter weight implementations in the future if that were
> desired.
>
>> The benefit of letting the tree object structures diverge is procrastination
>> -- we could reconcile the two modules after GSoC is over, with stable
>> features and test suites in place. But I could justifiably focus on
>> integration for the remaining weeks if that's best for Biopython, since
>> otherwise I'd probably be reimplementing a number of features already
>> present in other modules.
>
> My vote is for the integration work. Refactoring is hard work and
> best done early. It is easier to add functionality to a fully integrated
> PhyloXML parser in the future.
>
>> I bet this could be done without different objects. Bio.PhyloXML.Tree could
>> be moved to Bio.Tree or Bio.Tree.Elements; the base class PhyloElement could
>> be renamed to TreeElement; and the Nexus and Newick parsers could reuse
>> PhyloXML's Phylogeny and Clade elements, where Clade merges with the
>> existing Node class(es). Even Clade by itself might be enough. For
>> organizational purposes, format-specific tree elements could move to their
>> own files (Bio.Tree.PhyloElement.py, Bio.Tree.NexusElement.py), or some
>> multiple-inheritance tricks could be used to smooth things over.
>
> Yes, this sounds exactly right. Great stuff.
>
>> (I know nothing
>> about NeXML; should we keep an eye on that too? Glance at the homepage I
>> don't see much about complex annotation types, which is probably good if we
>> want to fit that format into this framework eventually.)
>
> PhyloXML plus Nexus/Newick is probably enough to stay reasonably
> general and keep our sanity. NeXML support would be great but
> practically is an additional project. The refactoring you've described
> is a good chunk to run with.
>
> Brad
>
--
====================================================
Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley
Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page:
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu
Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140
-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people
thought the earth was spherical, they were wrong. But if you think that
thinking the earth is spherical is just as wrong as thinking the earth
is flat, then your view is wronger than both of them put together."
Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer,
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================
More information about the Biopython-dev
mailing list