[Biopython-dev] GSoC Weekly Update 10: PhyloXML for Biopython
Eric Talevich
eric.talevich at gmail.com
Mon Jul 27 17:56:40 UTC 2009
Hi folks,
Previously (July 20-24) I:
Finished implementing I/O methods, Tree classes and tests for all
phyloXML
elements.
Changed Writer to preserve node order in the XML; output now validates
under the phyloXML 1.00 schema (but 1.10 complains)
Did some drastic code reorganization.
- Bio.Tree:
- Moved Clade.find() and PhyloElement.__repr__ methods to BaseTree
classes
- Made Clade inherit from BaseTree.Tree in addition to
BaseTree.Node,
and added the corresponding attributes
- Moved Bio.PhyloXML.Tree to Bio.Tree.PhyloXML
- Bio.TreeIO:
- Merged PhyloXML's Parser and Writer into PhyloXMLIO under the new
Bio.TreeIO module, and updated imports everywhere
- Added wrappers for Nexus read/write; doesn't return Bio.Tree
objects
yet though
Added/updated unit tests for all of this.
Documented the code reorg on the Biopython wiki, adding Tree and TreeIO
pages and fixing the examples on the PhyloXML page.
Scrubbed docstrings and enabled epydoc processing.
This week (July 27-31) I will:
Finish implementing the phyloXML spec:
- Scan "simple types" for restricted tokens; check strings in
constructors
- Take a stab at phyloXML 1.10 support (need a 'version' arg to Writer?)
- Clean up and reorganize any code that needs it
Enhancements (time permitting):
- Improve the SeqRecord conversion
- Work on Bio.Tree.BaseTree compatibility with BioSQL's PhyloDB
extension
- Port common methods to Bio.Tree.BaseTree -- see Bio.Nexus.Tree,
Bioperl
node objects, PyCogent, p4-phylogenetics
- Tree method: build_index (set left_idx, right_idx on all nodes):
- calculate left/right indexes for nested-set representation
- see
http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html
- Export to networkx (http://networkx.lanl.gov/) -- also get graphviz
export
for free, via networkx.to_agraph()
Remarks:
- Bioperl's phyloXML driver was written for version 1.00 and might hurl
if
given a v1.10 file -- so that's a potential problem if Biopython
defaults
to writing v1.10 files. Should Writer take a option to specify the
file
format version number? Right now it only writes valid phyloXML v1.00.
- PhyloXMLIO also always writes branch_length as an XML node, not an
attribute. This validates and will be handled safely by any sane
parser,
and fits better with the idea of an implicit root node in each clade
object, I think. (The parser still handles an attribute properly.) Any
objections?
- Above, I've listed more enhancements than I'll probably be able to
finish
this week. Which should have higher priority? I know merging Bio.Nexus
and Bio.Tree would be the most useful, but since (1) Biopython
development still happens on CVS, not Git, and (2) another Tree-based
GSoC project is expected to land around the same time as mine, I think
doing the integration right now would be kind of painful. So I can
focus
either on laying the groundwork in Bio.Tree.BaseTree, copying rather
than
moving the relevant Nexus code, or else work mainly on exporting to
other
useful object representations like networkx graphs, or any Biopython
classes I've missed (e.g. alignments). Suggestions?
Cheers,
Eric
http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML
https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML
More information about the Biopython-dev
mailing list