[Biopython-dev] GSoC Weekly Update 9: PhyloXML for Biopython

Mon Jul 20 14:57:15 UTC 2009

Hi all,

Previously (July 13-17) I:

    - Implemented "Collapse Whitespace Policy" -- the spec mentions this in
the
      glossary but doesn't appear to say where it should be use, so I
applied
      it willy-nilly. (Mainly on 'name' and 'desc'/'description' node text.)

    - Made Writer use the normal namespace prefixes -- for
human-readability,
      though it technically doesn't matter for parsing.

    - Tried XSD validation on the PhyloXML.Writer output using xmlstarlet --
it
      failed, probably due to element ordering.

    - Created Bio.Tree and Bio.TreeIO modules. The PhyloXML tree classes are
      all under Bio.Tree now, while TreeIO contains just a thin wrapper for
      Parser and Writer (still under Bio.PhyloXML). Three mostly empty base
      classes live in Bio.Tree.BaseTree and PhyloXML's tree classes now
inherit
      from them. This made it possible to generalize the Utils.pretty_print
      function and move it to Bio.Tree.Utils. The other "utility", for
dumping
      xml tag names, was added to PhyloXML's Parser near the other
xml-related
      helpers.

    - Checked that 'other' objects won't belong to the phyloXML namespace.

This week (July 20-24) I will:

    Extend the core to the rest of the spec:

    - Adding unit tests and classes to support the remaining (non-core)
      phyloXML elements
    - Use the schema document to validate the input file -- or at least,
make
      Writer use the correct sub-node ordering
    - Take a stab at phyloXML 1.10 support

    Work on documentation:

    - Address remaining comments from code/doc review
    - Revisit docstrings for all classes, functions, methods; consider
enabling
      epydoc formatting

    Also:

    - Improve the SeqRecord conversion
    - Warnings: show the offending line at the previous level in the stack

Remarks:

I haven't done anything specifically for Nexus integration, though I'm
looking
at the Bio.Nexus Tree and Node classes while writing Bio.Tree.BaseTree
classes.
I'm also looking at PhyloDB, the BioSQL extension. Plan: BaseTree classes
will
mirror PhyloDB tables, and any methods from PhyloXML trees that only rely on
those attributes will be moved to the base classes.

Attribute naming will be tricky -- the 'node' in Nexus and PhyloDB is called
'clade' in phyloXML, and most of the base-class methods will operate on that
attribute. Options:

    1. Create two properties on PhyloXML's Clade and Phylogeny classes,
called
    'clade' and 'clades', that simply access the object's 'node' attribute.

    2. Break phyloXML's naming convention, and call a 'clade' a 'node'. The
I/O
    functions currently treat tag_name<->attribute as the general case, with
    exceptions like pluralization scattered in, so making this change will
be
    unpretty but not horrible.

Cheers,
Eric
http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML
https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML