[Biopython-dev] Support for NeXML and RDF trees in Bio.Phylo

Mon Feb 4 15:17:36 UTC 2013

On Fri, Jan 18, 2013 at 8:20 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
> On Fri, Dec 28, 2012 at 10:50 AM, Ben Morris <ben at bendmorris.com> wrote:
>>
>> On Tue, Dec 25, 2012 at 2:18 AM, Eric Talevich <eric.talevich at gmail.com>
>> wrote:
>> >
>> > On Mon, Dec 24, 2012 at 8:58 AM, Ben Morris <ben at bendmorris.com> wrote:
>> >>
>> >> Hi all,
>> >>
>> >> I've implemented support for two new phylogenetic tree formats: NeXML
>> >> and
>> >> RDF (conforming to the Comparative Data Analysis Ontology).
>> >>
>> >> I noticed that NeXML support was planned, but I didn't see anyone
>> >> working
>> >> on it on GitHub and the feature request hadn't been updated in about a
>> >> year, so I went ahead and implemented a simple version. At first I
>> >> tried
>> >> the generateDS.py approach, but the generated writer doesn't give very
>> >> much
>> >> control over the output, so I ended up writing my own parser/writer
>> >> using
>> >> ElementTree.
>> >>
>> >> As for the RDF/CDAO format, AFAIK this is not a format that's supported
>> >> by
>> >> any other phylogenetic libraries, so I'm not sure how useful this is to
>> >> everyone else. It provides a simple, standards-compliant format that
>> >> can be
>> >> imported to a triple store and supports annotation. We'll be using it
>> >> at
>> >> NESCent so I wanted to make it available to everyone else as well. The
>> >> parser and writer require the Redlands Python bindings.
>> >>
>> >> The code is available in my fork of Biopython,
>> >>
>> >>     https://github.com/bendmorris/biopython
>> >>
>> >> under branches "cdao" and "nexml." I'd love to get everyone's thoughts
>> >> and
>> >> see if these contributions would be a good fit for the Biopython
>> >> project.
>> >
>> >
>> >
>> > Thanks for letting us know! I'll try it out soonish. Looking at the code
>> > on your nexml branch, I have a few comments:
>> >
>> > - The parser uses ElementTree.parse rather than iterparse, so in its
>> > current state it would not be able to parse massive files (those larger than
>> > available RAM). Worth fixing eventually?
>>
>> Great point. I rewrote it to use iterparse instead.
>>
>> > - The parser creates Newick.Tree and Newick.Clade objects, which is
>> > nearly correct in my opinion. I would suggest subclassing BaseTree.Tree and
>> > BaseTree.Clade to create NeXML-specific Tree and Clade classes, even if you
>> > don't have any additional attributes to attach to those classes at the
>> > moment. (These would go in a new file NeXML.py, similar to PhyloXML.py and
>> > PhyloXMLIO.py.)
>>
>> Went ahead and did this as well.
>
>
> Thanks! Sorry for the pace of this, I'm in the midst of a dissertation.
>
>
>> > - The 'confidence' or 'confidences' attribute isn't used (for e.g.
>> > bootstrap support values). Does NeXML define it?
>>
>> Not that I'm aware of, but I'm not sure. I searched
>> http://nexml.org/nexml/html/doc/schema-1/ and didn't find anything.
>> I'm going to ask some people who know more about this than I do.
>
>
> I would like for Bio.Phylo's I/O modules to be able to successfully
> round-trip a file from Newick to phyloXML to NeXML and back to Newick
> without losing support values. I found these two examples of how to add this
> data to a NeXML document by referencing CDAO:
> https://www.nescent.org/wg_evoinfo/NeXML_Test_Files#Bootstraps_represented_using_the_.22meta.22_tag
> https://www.nescent.org/wg_evoinfo/NeXML_Test_Files#Bootstraps_represented_without_new_tags_or_elements
>
> That's the standard way to store bootstrap supports in NeXML (Hilmar
> confirms). How do your NeXML and CDAO modules interact, if at all? Would the
> CDAO modules be useful to properly support NeXML metadata like
> support/confidence values, or would it be simpler to just hard-code the few
> tags we're specifically interested in?
>
> Relatedly, those look like good test files. I see you've started writing
> NeXML unit tests already; if you would like help with any of this, just let
> me know.
>
> -Eric

No worries! I just returned from a NESCent-sponsored hackathon where
we used BioPython as part of a Virtuoso-backed RDF treestore
(https://github.com/phylotastic/rdf-treestore). Now that I'm back,
I'll work on the bootstrap support values and annotations for NeXML as
I have time.

I think it's probably much easier to just hard-code specific tags for
now. The CDAO module can convert the more readable CDAO prefix names
to OBO numeric identifiers (e.g. cdao:has_Root -> obo:CDAO_0000148)
but other than that I don't see a good way for them to interact.

I gave a short demo of Bio.Phylo at the hackathon, and people were
very impressed. We had some issues with Newick and Nexus parsing as
well, so I'll open issues on the bug tracker.

~Ben