[Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 9: PhyloXML for Biopython

Tue Jul 21 08:40:10 EDT 2009

Hi Eric;
Great stuff this week. I'm happy to see the generalized Tree
interface coming together and appreciate you taking the time to look
through PhyloDB for future compatibility with that.

>     - Tried XSD validation on the PhyloXML.Writer output using xmlstarlet -- it
>       failed, probably due to element ordering.

It would be nice to be able to pull off validation. I'm not a big
stickler for XSD validation myself but have worked in the past with
those who were and know that it can be a point of contention. Being
able to cleanly validate will improve perception of the PhyloXML, and
specifically the Biopython implementation. Hopefully that'll lead to
greater use and adoption.

>     - Created Bio.Tree and Bio.TreeIO modules. The PhyloXML tree classes are
>       all under Bio.Tree now, while TreeIO contains just a thin wrapper for
>       Parser and Writer (still under Bio.PhyloXML). Three mostly empty base
>       classes live in Bio.Tree.BaseTree and PhyloXML's tree classes now inherit
>       from them.

This looks really nice -- thanks again. Do you think any of the
functionality from the Nexus trees class would fit into here and be
useful for examining PhyloXML trees? There is a whole ton of stuff
there but a few that caught my eye beyond the total_branch_length
function you had a skeleton for were: get_terminals, is_identical,
common_ancestor, and distance.

> I haven't done anything specifically for Nexus integration, though I'm
> looking
> at the Bio.Nexus Tree and Node classes while writing Bio.Tree.BaseTree
> classes.
> I'm also looking at PhyloDB, the BioSQL extension. Plan: BaseTree classes
> will
> mirror PhyloDB tables, and any methods from PhyloXML trees that only rely on
> those attributes will be moved to the base classes.

This sounds fine. If you want to dig into Nexus you are welcome, but
certainly it's outside the scope of the proposal.

> Attribute naming will be tricky -- the 'node' in Nexus and PhyloDB is called
> 'clade' in phyloXML, and most of the base-class methods will operate on that
> attribute. Options:
> 
>     1. Create two properties on PhyloXML's Clade and Phylogeny classes,
> called
>     'clade' and 'clades', that simply access the object's 'node' attribute.
> 
>     2. Break phyloXML's naming convention, and call a 'clade' a 'node'. The
> I/O
>     functions currently treat tag_name<->attribute as the general case, with
>     exceptions like pluralization scattered in, so making this change will
> be
>     unpretty but not horrible.

I like option 1 -- make clade and clades references to the node/nodes
attribute. I do prefer the node naming convention, but for the PhyloXML
specific classes you should also be able to retrieve things with their
clade nomenclature.

Brad