[Biopython-dev] BioGeography update/BioPython tree module discussion
Nick Matzke
matzke at berkeley.edu
Mon Jul 13 18:02:24 EDT 2009
Just updating one chunk of part I of the previous long message:
Nick Matzke wrote:
>
>
> I. Tree Class Options
>
> It sounds like we have 3 options being discussed:
>
> 1. making Bio.PhyloXML.Tree the super-duper tree class
> 2. improving Bio.Nexus.Trees
> 3. including the Lagrange tree class or suitably licensed/inspired
> version thereof.
>
> (Or there is #4, some combination)
> The last consensus we reached on Biopython-dev was to create two new
> modules, Bio.Tree and Bio.TreeIO, like so:
>
> 1. Extract a very basic Tree and Node class, looking at the intersection
> of the PhyloXML and Nexus class hierarchies, and put the result in
> Bio.Tree.BaseTree. I started on this today:
> http://github.com/etal/biopython/blob/phyloxml/Bio/Tree/BaseTree.py
>
> (It doesn't do anything yet besides set up a class heirarchy that we can
> use for generalizing existing code.)
>
> 2. Write wrappers for the existing PhyloXML and Nexus I/O functions. I'm
> putting that here:
> http://github.com/etal/biopython/blob/phyloxml/Bio/TreeIO/__init__.py
>
> Again, it's only useful for PhyloXML parsing right now. Eventually we
> can connect Bio.Nexus to these two modules, but that's well outside the
> scope of my GSoC project.
It sounds like for my immediate purposes, Bio.Nexus.Trees is the
solution for now, I will reorganize my code accordingly based on this.
If/when Bio.Nexus.Trees accepts node labels I will remove a function
stripping out node labels. Also I have not forgotten previous comments
from Brad et al. about bringing the other code up to specs. So I will
update the BioGeography schedule and overall organization I hope to have
at the end (with classes/methods etc., instead of just a
list-o-functions, which is how my original schedule was explicitly laid
out), and post an update when done.
Cheers!
Nick
>
>
>
>
>
> II. My Original Problem, Which is Probably Quite Small Really
>
> I think I kind of unintentionally kicked all of this off because I
> couldn't get Bio.Nexus.Trees to read what I considered pretty standard
> Newick files back when I originally exploring this in the spring.
> Initially for my own scripts I used another newick parser & tree class I
> found online (Mailund's IIRC), then discovered a superior one in
> Lagrange and started using that. Thus in GSoC it was simplest to begin
> by importing the Lagrange parser, but that lead to legitimate concerns
> about duplication/licensing etc.
>
> Reviewing my original issues from the spring, really the only problem I
> found with Bio.Nexus.Trees was with node labels, i.e. when an internal
> node is given e.g. a clade name, in addition to a branch length. This a
> standard output on a great many newick files in my experience, which
> seem to be correctly read by just about all the other programs I use
> (Mesquite, Dendroscope, etc.) so I impulsively abandoned Bio.Nexus.Trees
> at the time when I couldn't get it work.
>
>
>
>
>
> III. Bug Report
>
> I did file a bug report back in March. This is outstanding as far as I
> know.
>
> Bio.Nexus.Trees newick parser does not support internal node labels
> http://bugzilla.open-bio.org/show_bug.cgi?id=2788
>
>
>
>
>
>
>
> IV. Problem Examples
>
>
> Below I have accumulated some cases that work/don't work:
>
>
> =================
> from Bio.Nexus import Trees
>
> # This works
>
> ts0 =
> "(Bovine:0.69395,(Gibbon:0.36079,(Orang:0.33636,(Gorilla:0.17147,(Chimp:0.19268,
> Human:0.11927):0.08386):0.06124):0.15057):0.54939,Mouse:1.21460):0.10;"
>
> to0 = Trees.Tree(ts0)
> print to0
>
>
>
> # Gymnosperms tree with node labels; doesn't work
> ts1a =
> '(((((Cephalotaxus:125.000000,(Taxus:100.000000,Torreya:100.000000)TT1:25.000000)Taxaceae:90.000000,((((((((Calocedrus:85.000000,Platycladus:85.000000)CP:5.000000,(Cupressus:85.000000,Juniperus:85.000000)CJ:5.000000)CJCP:5.000000,Chamaecyparis:95.000000)CCJCP:5.000000,(Thuja:7.870000,Thujopsis:7.870000)TT2:92.13)CJCPTT:30.000000,((Cryptomeria:120.000000,Taxodium:120.000000)CT:5.000000,Glyptostrobus:125.000000)CTG:5.000000)CupCallTax:5.830000,((Metasequoia:125.000000,Sequoia:125.000000)MS:5.000000,Sequoiadendron:130.000000)Sequoioid:5.830000)STCC:49.060001,Taiwania:184.889999)Taw+others:15.110000,Cunninghamia:200.000000)nonSci:15.000000)Tax+nonSci:10.000000,Sciadopitys:225.000000):25.000000,(((Abies:106.000000,Keteleeria:106.000000)AK:54.000000,(Pseudolarix:156.000000,Tsuga:156.000000)NTP:4.000000)NTPAK:24.000000,((Larix:87.000000,Pseudotsuga:87.000000)LP:81.000000,(Picea:155.000000,Pinus:155.000000)PPC:13.000000)Pinoideae:16.000000)Pinaceae:66.000000)Coniferales:25.000000,G
in
>
> kgo:275.000000)gymnosperm:75.000000;'
>
> to1a = Trees.Tree(ts1a)
>
>
>
>
> # Just Taxaceae; doesn't work
> ts1b =
> '(Cephalotaxus:125.000000,(Taxus:100.000000,Torreya:100.000000)TT1:25.000000)Taxaceae:90.000000;'
>
> to1b = Trees.Tree(ts1b)
>
> # Just Taxaceae; this works; node labels deleted
> ts1c =
> '(Cephalotaxus:125.000000,(Taxus:100.000000,Torreya:100.000000)25.000000)90.000000;'
>
> to1c = Trees.Tree(ts1c)
>
>
>
>
> # This doesn't work (from bug report)
> ts2 = "(((t9:0.385832, (t8:0.445135,t4:0.41401)C:0.024032)B:0.041436,
> t6:0.392496)A:0.0291131, t2:0.497673, ((t0:0.301171,
> t7:0.482152)E:0.0268148, ((t5:0.0984167,t3:0.488578)G:0.0349662,
> t1:0.130208)F:0.0318288)D:0.0273876);"
> to2 = Trees.Tree(ts2)
> =================
>
>
>
>
> But if I import the Lagrange tree class/parser, all of these work and my
> life is happy:
>
> =================
> import lagrange_newick
> # This is lagrange's newick.py file, renamed to lagrange_newick.py
>
> lt1 = lagrange_newick.parse(ts1)
> lt1a = lagrange_newick.parse(ts1a)
> lt1b = lagrange_newick.parse(ts1b)
> lt2 = lagrange_newick.parse(ts2)
> =================
>
>
>
>
>
>
> V. The Functions I Need From a Tree Class
>
> Basically my method of late has been to use the Lagrange Tree class, and
> then write my own standalone functions to do various necessary basic
> processing of trees. E.g.:
>
> * subset tree based on list of taxa; update root and any now-redundant
> internal nodes left with 0 or 1 descendents
>
> * extract a subtree to a new tree (cloned nodes so they don't refer to
> the old nodes, important in doing passes through tree)
>
> * read/write to Newick
>
> * print tree to screen in a readable format
>
> * get distance (total branch length between 2 nodes)
>
> * calculate many measures that can be done from the distances (total
> all-to-all distance matrix, tree length, mean phylogenetic distance,
> mean nearest-neighbor phylogenetic distance)
>
> * several others I don't remember off the top of my head
>
>
> In my list-o-functions approach, I would just write functions for the
> tree class I was using, but I think it has been made clear that really
> these functions should be methods of a certain Tree class. Which
> requires a decision about what Tree class to use.
>
>
>
>
>
> VI. What the current classes do.
>
> I had never looked seriously at Bio.Nexus.Trees since I was just
> crashing it, but it actually looks like it does a bunch:
>
> Bio.Nexus.Trees
> ===========
> type(to1c)
> <type 'instance'>
>
> to1c
> <Bio.Nexus.Trees.Tree instance at 0x39348a0>
>
> dir(to1c)
>
> ['_Tree__values_are_support',
> '__doc__',
> '__init__',
> '__module__',
> '__str__',
> '_add_subtree',
> '_get_id',
> '_get_values',
> '_parse',
> '_walk',
> 'add',
> 'all_ids',
> 'branchlength2support',
> 'chain',
> 'collapse',
> 'collapse_genera',
> 'common_ancestor',
> 'convert_absolute_support',
> 'count_terminals',
> 'dataclass',
> 'display',
> 'distance',
> 'get_taxa',
> 'get_terminals',
> 'has_support',
> 'id',
> 'is_bifurcating',
> 'is_compatible',
> 'is_identical',
> 'is_internal',
> 'is_monophyletic',
> 'is_parent_of',
> 'is_preterminal',
> 'is_terminal',
> 'kill',
> 'link',
> 'max_support',
> 'merge_with_support',
> 'name',
> 'node',
> 'prune',
> 'randomize',
> 'root',
> 'root_with_outgroup',
> 'rooted',
> 'search_taxon',
> 'set_subtree',
> 'split',
> 'sum_branchlength',
> 'to_string',
> 'trace',
> 'unlink',
> 'unroot',
> 'weight']
>
>
> # Node methods:
> nd = to1c.node(1)
>
> nd
> <Bio.Nexus.Nodes.Node instance at 0x39227b0>
>
>
> type(nd)
> <type 'instance'>
>
> dir(nd)
>
> ['__doc__',
> '__init__',
> '__module__',
> 'add_succ',
> 'data',
> 'get_data',
> 'get_id',
> 'get_prev',
> 'get_succ',
> 'id',
> 'prev',
> 'remove_succ',
> 'set_data',
> 'set_id',
> 'set_prev',
> 'set_succ',
> 'succ']
>
>
> # Node data:
> ndd = nd.get_data()
>
> dir(ndd)
>
> ['__doc__',
> '__init__',
> '__module__',
> 'branchlength',
> 'comment',
> 'support',
> 'taxon']
> ===========
>
>
>
>
>
>
>
> Lagrange Tree Class:
> (really class Node I guess, and the tree is reference by the root Node)
>
> =============
> type(lt1b)
> <type 'instance'>
>
> lt1b
> <lagrange_phylo.Node instance at 0x392b120>
>
> dir(lt1b)
>
> ['__doc__',
> '__init__',
> '__module__',
> 'add_child',
> 'children',
> 'data',
> 'descendants',
> 'excluded_dists',
> 'find_descendant',
> 'graft',
> 'isroot',
> 'istip',
> 'iternodes',
> 'label',
> 'labelset_nodemap',
> 'leaf_distances',
> 'leaves',
> 'length',
> 'mrca',
> 'nchildren',
> 'order_subtrees_by_size',
> 'parent',
> 'prune',
> 'remove_child',
> 'rootpath',
> 'subtree_mapping',
> 'ultrametricize_dumbly']
> =============
>
>
>
>
> Bio.PhyloXML.Tree
> =============
> [not sure...perhaps someone could contribute the list of
> methods/intended methods]
> =============
>
>
>
>
> VII. I am Leaning Towards Bio.Nexus.Trees
>
> Based on current functionality and integration with BioPython, and what
> can be done in the short term, it looks to me like the best option is to
> mod the Bio.Nexus.Trees module, inspired by the Lagrange Node class as
> necessary. However if e.g. PhyloXML is working well enough that I can
> use that, that is an option.
>
>
>
>
>
> VIII. What I should do next
>
> Given what I now know, I probably should have just written a little
> function to strip node labels out of my Newick trees, and done
> everything based on the Bio.Nexus.Trees class. I could still do this
> and continue on my merry way without too much trouble.
>
> But given that my tree-based functions should probably be methods of
> some class...here are the questions I have:
>
> * Should I muck with Bio.Nexus.Trees and try to fix the node labels
> issue? My instinct was not to mess with other people's stuff, but that
> may be a poor instinct...
>
> * Should I implement my tree-based functions methods as methods of the
> Bio.Nexus.Trees class?
>
> * Should I delay on this whole issue while it is being discussed, and go
> back to issues more localized to my GSoC project, i.e. making my GBIF
> functions into methods of a GBIF records class?
>
>
> Thanks for reading! And sorry if this was more confusing than it had to
> be, I am definitely learning as I go here.
>
> Cheers,
> Nick
>
>
>
>
>
>
>
>
>>
>> It would be nice to design this modularly -- with mixin classes for
>> related add-on functionality -- as much as possible. This would
>> allow lighter weight implementations in the future if that were
>> desired.
>>
>>> The benefit of letting the tree object structures diverge is
>>> procrastination
>>> -- we could reconcile the two modules after GSoC is over, with stable
>>> features and test suites in place. But I could justifiably focus on
>>> integration for the remaining weeks if that's best for Biopython, since
>>> otherwise I'd probably be reimplementing a number of features already
>>> present in other modules.
>>
>> My vote is for the integration work. Refactoring is hard work and
>> best done early. It is easier to add functionality to a fully integrated
>> PhyloXML parser in the future.
>>
>>> I bet this could be done without different objects. Bio.PhyloXML.Tree
>>> could
>>> be moved to Bio.Tree or Bio.Tree.Elements; the base class
>>> PhyloElement could
>>> be renamed to TreeElement; and the Nexus and Newick parsers could reuse
>>> PhyloXML's Phylogeny and Clade elements, where Clade merges with the
>>> existing Node class(es). Even Clade by itself might be enough. For
>>> organizational purposes, format-specific tree elements could move to
>>> their
>>> own files (Bio.Tree.PhyloElement.py, Bio.Tree.NexusElement.py), or some
>>> multiple-inheritance tricks could be used to smooth things over.
>>
>> Yes, this sounds exactly right. Great stuff.
>>
>>> (I know nothing
>>> about NeXML; should we keep an eye on that too? Glance at the homepage I
>>> don't see much about complex annotation types, which is probably good
>>> if we
>>> want to fit that format into this framework eventually.)
>>
>> PhyloXML plus Nexus/Newick is probably enough to stay reasonably
>> general and keep our sanity. NeXML support would be great but
>> practically is an additional project. The refactoring you've described
>> is a good chunk to run with.
>>
>> Brad
>>
>
--
====================================================
Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley
Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page:
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu
Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140
-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people
thought the earth was spherical, they were wrong. But if you think that
thinking the earth is spherical is just as wrong as thinking the earth
is flat, then your view is wronger than both of them put together."
Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer,
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================
More information about the Biopython-dev
mailing list