[Biopython-dev] GSoC Weekly Update 9: PhyloXML for Biopython
Eric Talevich
eric.talevich at gmail.com
Mon Jul 20 14:57:15 UTC 2009
Hi all,
Previously (July 13-17) I:
- Implemented "Collapse Whitespace Policy" -- the spec mentions this in
the
glossary but doesn't appear to say where it should be use, so I
applied
it willy-nilly. (Mainly on 'name' and 'desc'/'description' node text.)
- Made Writer use the normal namespace prefixes -- for
human-readability,
though it technically doesn't matter for parsing.
- Tried XSD validation on the PhyloXML.Writer output using xmlstarlet --
it
failed, probably due to element ordering.
- Created Bio.Tree and Bio.TreeIO modules. The PhyloXML tree classes are
all under Bio.Tree now, while TreeIO contains just a thin wrapper for
Parser and Writer (still under Bio.PhyloXML). Three mostly empty base
classes live in Bio.Tree.BaseTree and PhyloXML's tree classes now
inherit
from them. This made it possible to generalize the Utils.pretty_print
function and move it to Bio.Tree.Utils. The other "utility", for
dumping
xml tag names, was added to PhyloXML's Parser near the other
xml-related
helpers.
- Checked that 'other' objects won't belong to the phyloXML namespace.
This week (July 20-24) I will:
Extend the core to the rest of the spec:
- Adding unit tests and classes to support the remaining (non-core)
phyloXML elements
- Use the schema document to validate the input file -- or at least,
make
Writer use the correct sub-node ordering
- Take a stab at phyloXML 1.10 support
Work on documentation:
- Address remaining comments from code/doc review
- Revisit docstrings for all classes, functions, methods; consider
enabling
epydoc formatting
Also:
- Improve the SeqRecord conversion
- Warnings: show the offending line at the previous level in the stack
Remarks:
I haven't done anything specifically for Nexus integration, though I'm
looking
at the Bio.Nexus Tree and Node classes while writing Bio.Tree.BaseTree
classes.
I'm also looking at PhyloDB, the BioSQL extension. Plan: BaseTree classes
will
mirror PhyloDB tables, and any methods from PhyloXML trees that only rely on
those attributes will be moved to the base classes.
Attribute naming will be tricky -- the 'node' in Nexus and PhyloDB is called
'clade' in phyloXML, and most of the base-class methods will operate on that
attribute. Options:
1. Create two properties on PhyloXML's Clade and Phylogeny classes,
called
'clade' and 'clades', that simply access the object's 'node' attribute.
2. Break phyloXML's naming convention, and call a 'clade' a 'node'. The
I/O
functions currently treat tag_name<->attribute as the general case, with
exceptions like pluralization scattered in, so making this change will
be
unpretty but not horrible.
Cheers,
Eric
http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML
https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML
More information about the Biopython-dev
mailing list