[Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 9: PhyloXML forBiopython

Thu Jul 23 15:59:32 UTC 2009

All,

Thanks for the ongoing discussion and helpful links. I'm going to propose an
object mapping here and see how it sits with everyone -- please correct any
questionable statements.

In raw XML, the clade designation looks reasonable. The attributes that blur
the clade-node distinction are branch_length, confidence and node_id. In the
first two, the attributes apply to an implicit root node, not the entire
clade. (Stated this way, it makes much more sense in the XML representation
to have branch_length as a child node, not an attribute.) The node_id
clearly applies to the clade's root node, once it's understood that the node
is implicit.

http://www.phyloxml.org/documentation/version_100/phyloxml.xsd.html#h-1124608460

On Thu, Jul 23, 2009 at 2:08 AM, Chris Fields <cjfields at illinois.edu> wrote:

> On Jul 23, 2009, at 12:12 AM, Mark A. Jensen wrote:
>
>> FWIW, BioPerl has Trees and Nodes. That's it; maybe Branches later (if I
>> get around to it, or convince Chase it would be a good project).
>
>
Many of the existing generalized Tree object representations seem to be
guided by the Nexus/Newick format, which is basically an s-expression. This
format can represent a tree as a parenthetical expression, and a node as a
token (comma-delimited, potentially combining a taxon label and branch
length separated by a colon) within that expression. Edges or branches are
implicit. So Trees, Nodes, branch lengths and labels are all we *really*
need to find common ground on, but other, more expressive representations
are certainly possible.

I'm basing my BaseTree classes on the tables in BioSQL's PhyloDB extension (
http://biosql.org/wiki/Extensions) -- which were probably in turn based on
BioPerl's Tree objects, but have at least been given some extra effort
towards generalization. The PhyloDB schema includes include an Edge table
definition, among other things.

Question:
The Node objects in PhyloDB have left_idx and right_idx attributes. It looks
like nodes are being kept in a double-linked list, which seems like
unusually low-level information to keep around since Perl, Python, Ruby and
Java all have flexible array or list types that can keep track of element
order efficiently. Is there a use for these indexes in general phylogenetics
work that couldn't be handled by other language-specific constructs?

In this scrap
>> http://www.bioperl.org/wiki/Finding_all_clades_represented_in_a_tree
>> I defined a clade as a "maximal set of leaf/tip taxa descended from a
>> given single node", because that's really what the question poser wanted.
>> You might expand that definition to include all branches and nodes between
>> the "given node" and the tips. That would be synonyomous with "subtree".
>>
>
> Yes, but some define clade slightly differently:
>
> http://en.wikipedia.org/wiki/Cladistics#Three_definitiOther representations
> are possible.ons_of_clade<http://en.wikipedia.org/wiki/Cladistics#Three_definitions_of_clade>
>

Helpful! It looks like phyloXML's interpretation is "branch-based". Note
that in the spec, the Phylogeny element that the various Bio* projects have
interpreted as the Tree type is defined to have exactly one Clade attribute
-- presumably the root node of the tree. I'm not sure how to interpret a
branch_length value for that clade; maybe it should be ignored or
disallowed.

I think I see the utility of a clade as an annotation entity: one wants to
>> grant properties to subtrees ("Mammalia", e.g.).
>>
>
The Clade node does have most of the important annotation types as its
children -- Taxonomy, Sequence, Events, etc. Given how Nexus trees often
label nodes with taxon names, the nearest phyloXML equivalent to a Node type
might be Taxonomy. But in phyloXML, all of the Clade attributes and
annotations apply to the root node, and potentially all sub-clades and
sub-nodes that don't override this information. I don't think I'd map the
basic Node type to anything but Clade for this reason.

A "Node" (in BioPerl, or standard phylogenetics) can be *mapped* to a clade,
>> or used to obtain a clade, *if* the tree is rooted (as Hilmar points out).
>> It seems that for a rooted tree (i.e., where anc->desc relationships are
>> defined), a "Clade" annotation that contained all the desired clade
>> properties could be associated with the Node, because of the one-to-one
>> mapping of nodes to clades in this case. In the case of an unrooted tree, a
>> Clade could also be associated with a node, if the Clade also possessed a
>> direction property. For example, in an unrooted tree, a Clade could be
>> specified by Node + Branches of Node contained in Clade (which would be two
>> of the three branches on an internal node). This would provide the direction
>> of "descent".
>>
>>
The 'rooted' and 'rerootable' attributes belong to Phylogeny, at the top of
the tree. A Clade object should probably have easy access to this
information for use in pruning or rerooting.

This raises some questions about the role of the Phylogeny element --
is-it-really-a Tree? Or simply a wrapper with metadata about all the clades
it contains, containing a single clade which is actually the top of the
phylogenetic tree? In that case it could make sense for each clade to
contain a direct or indirect reference to the phylogeny object, rather than
the other way around. The mind reels. I was more comfortable calling it a
Tree, as the other Bio* projects do, but then I haven't tried to integrate
the Nexus tree classes yet.

Conclusions:

1. A Clade is-a Tree, and also is-a Node for various operations.

2. For reusing base-class methods, a Clade should provide a 'node' attribute
that behaves properly -- in most or all cases, the nodes will be be the same
as the list of sub-clades.

3. A Clade also needs to access some attributes of its original Phylogeny.

Best regards,
Eric