[Biopython-dev] Support for NeXML and RDF trees in Bio.Phylo

Fri Jan 18 20:20:11 EST 2013

On Fri, Dec 28, 2012 at 10:50 AM, Ben Morris <ben at bendmorris.com> wrote:

> On Tue, Dec 25, 2012 at 2:18 AM, Eric Talevich <eric.talevich at gmail.com>
> wrote:
> >
> > On Mon, Dec 24, 2012 at 8:58 AM, Ben Morris <ben at bendmorris.com> wrote:
> >>
> >> Hi all,
> >>
> >> I've implemented support for two new phylogenetic tree formats: NeXML
> and
> >> RDF (conforming to the Comparative Data Analysis Ontology).
> >>
> >> I noticed that NeXML support was planned, but I didn't see anyone
> working
> >> on it on GitHub and the feature request hadn't been updated in about a
> >> year, so I went ahead and implemented a simple version. At first I tried
> >> the generateDS.py approach, but the generated writer doesn't give very
> much
> >> control over the output, so I ended up writing my own parser/writer
> using
> >> ElementTree.
> >>
> >> As for the RDF/CDAO format, AFAIK this is not a format that's supported
> by
> >> any other phylogenetic libraries, so I'm not sure how useful this is to
> >> everyone else. It provides a simple, standards-compliant format that
> can be
> >> imported to a triple store and supports annotation. We'll be using it at
> >> NESCent so I wanted to make it available to everyone else as well. The
> >> parser and writer require the Redlands Python bindings.
> >>
> >> The code is available in my fork of Biopython,
> >>
> >>     https://github.com/bendmorris/biopython
> >>
> >> under branches "cdao" and "nexml." I'd love to get everyone's thoughts
> and
> >> see if these contributions would be a good fit for the Biopython
> project.
> >
> >
> >
> > Thanks for letting us know! I'll try it out soonish. Looking at the code
> on your nexml branch, I have a few comments:
> >
> > - The parser uses ElementTree.parse rather than iterparse, so in its
> current state it would not be able to parse massive files (those larger
> than available RAM). Worth fixing eventually?
>
> Great point. I rewrote it to use iterparse instead.
>
> > - The parser creates Newick.Tree and Newick.Clade objects, which is
> nearly correct in my opinion. I would suggest subclassing BaseTree.Tree and
> BaseTree.Clade to create NeXML-specific Tree and Clade classes, even if you
> don't have any additional attributes to attach to those classes at the
> moment. (These would go in a new file NeXML.py, similar to PhyloXML.py and
> PhyloXMLIO.py.)
>
> Went ahead and did this as well.
>

Thanks! Sorry for the pace of this, I'm in the midst of a dissertation.

 > - The 'confidence' or 'confidences' attribute isn't used (for e.g.
> bootstrap support values). Does NeXML define it?
>
> Not that I'm aware of, but I'm not sure. I searched
> http://nexml.org/nexml/html/doc/schema-1/ and didn't find anything.
> I'm going to ask some people who know more about this than I do.
>

I would like for Bio.Phylo's I/O modules to be able to successfully
round-trip a file from Newick to phyloXML to NeXML and back to Newick
without losing support values. I found these two examples of how to add
this data to a NeXML document by referencing CDAO:
https://www.nescent.org/wg_evoinfo/NeXML_Test_Files#Bootstraps_represented_using_the_.22meta.22_tag
https://www.nescent.org/wg_evoinfo/NeXML_Test_Files#Bootstraps_represented_without_new_tags_or_elements

That's the standard way to store bootstrap supports in NeXML (Hilmar
confirms). How do your NeXML and CDAO modules interact, if at all? Would
the CDAO modules be useful to properly support NeXML metadata like
support/confidence values, or would it be simpler to just hard-code the few
tags we're specifically interested in?

Relatedly, those look like good test files. I see you've started writing
NeXML unit tests already; if you would like help with any of this, just let
me know.

-Eric