[Biopython-dev] Code review request for phyloxml branch
Eric Talevich
eric.talevich at gmail.com
Tue Dec 29 01:51:40 UTC 2009
Hi folks,
Here's an update on the status of Bio.Tree and TreeIO. I think I've taken
care of most of the blockers since the last review in September.
First, some links:
http://github.com/etal/biopython/tree/phyloxml/Bio/Tree/
http://github.com/etal/biopython/tree/phyloxml/Bio/TreeIO/
http://github.com/etal/biopython/tree/phyloxml/Tests/test_PhyloXML.py
http://github.com/etal/biopython/tree/phyloxml/Tests/test_Tree.py
http://biopython.org/wiki/PhyloXML
Discussion:
*TreeIO*
Conversion between Nexus, Newick and phyloXML tree file formats works; the
read/parse/write functions for each IO format use the same object types.
Neat!
The tree annotations (e.g. id) aren't preserved perfectly during conversions
-- I'll keep working on this, but I don't think it's a blocker. The taxon
names of terminal nodes are kept as "clade" names in phyloXML for
round-tripping. Tree topology and branch lengths seem OK.
Under the hood:
-- PhyloXMLIO is from GSoC
-- NewickIO is ported from the Bio.Nexus.Trees parser. I think it works the
same way.
-- NexusIO relies on Bio.Nexus.Nexus for parsing, then converts the
resulting Nexus.Trees.Tree objects to Bio.Tree.Newick objects. One day, when
Nexus.Trees is replaced by NewickIO in the main Nexus parser, then this
conversion can be dropped and NexusIO will be very simple.
*Tree*
The BaseTree object structure looks like this:*
-- BaseTree.**Tree* contains global tree information, like whether the tree
is rooted, and a reference to the root clade. The phyloXML Phylogeny object
inherits from this.*
-- BaseTree.**Subtree* contains local (clade- or node-specific) information,
and references to each of its direct descendents, recursively. The phyloXML
Clade object inherits from this. Nodes are implicit. I could add references
to the ancestor of each sub-tree without too much difficulty, but I haven't
needed them yet.
The same methods (get_terminals et al.) generally apply to both classes, so
I created a separate TreeMixin class from which both BaseTree.Tree and
BaseTree.Subtree inherit.
Bio.Tree.Newick contains simple subclasses of Tree and Subtree, and an
incomplete set of shims that track Bio.Nexus.Trees.Tree (minus the I/O).
This is to ease the deprecation and eventual replacement of Bio.Nexus.Trees,
as I imagine it:
(1) Port methods from Nexus.Trees to Bio.Tree, simplifying arguments where
reasonable (since the node IDs and adjacency list lookup are no longer
needed)
(2) Implement methods in Bio.Tree.Newick with the original argument lists,
but triggering a deprecation warning indicating the newer replacement method
(3) Replace Nexus.Trees with an import of Bio.Tree.Newick(IO) and a few more
shims to duplicate the original API -- so test_Nexus.py should still pass,
ideally (with deprecation warnings)
(4) In Nexus.Nexus, replace all usage of Nexus.Trees with proper usage of
NexusIO and Bio.Tree methods.
(5) Eventually delete Nexus.Trees and the shims in Bio.Tree.Newick.
I'm currently doing (1) and (2), with more emphasis on getting (1) right.
Not all of the important methods have been ported, but I'm happy with the
tree traversal methods.
*
Tests
*I created test_Tree.py to test the methods in Bio.Tree.BaseTree;
test_PhyloXML.py tests Bio.Tree.PhyloXML objects and Bio.TreeIO.PhyloXMLIO
parsing/writing.
I noticed that in Tests/Nexus/, the example file for internal node labels is
actually in Newick/NH format, not Nexus. That was briefly confusing, so
maybe that file should be renamed.
What do you think?
All the best,
Eric
More information about the Biopython-dev
mailing list