[Biopython-dev] GSoC Weekly Update: PhyloXML for Biopython

Mon Jun 15 13:04:20 EDT 2009

Hi all,

Previously (June 8-12) I:

 * Finished writing constructors and XML parsers for Tier 0,1,2 elements
   (everything that appears in the example phyloXML files)
 * Added to_seqrecord and from_seqrecord methods to the PhyloXML.Sequence
class
   -- getting Bio.SeqRecord to stand in for PhyloXML.Sequence entirely will
   require some more thought
 * Wrote a unit test for counting clades/branches (topology check)
 * Changed the no-op unit tests to count the total number of tags (nodes) in
   the given phyloXML file, keeping stdout clean
 * Miscellaneous code cleanup
 * Added a few magic methods to make usage easier: __len__, __iter__,
__str__

This week (June 15-19) I will:

 * Finish unittests for parsing and instantiating core elements
 * Test and check parser performance versus Bioperl and Archaeopterix
loading
   times
 * Document results of parser testing and performance (on wiki or here)
 * Document basic usage of the parser on the Biopython wiki

Thoughts:

 * Test-driven development kind of went out the window this week.
Implementing
   each new class is pretty short now that I'm using the from_element class
   methods consistently, so I just charged ahead rather than write tests for
   each class first. I checked the more complicated classes in the REPL, but
   didn't copy that code into the test script... shameful. There are a
couple
   bugs I know of already, but haven't fixed. So catching up there will be
the
   bulk of the effort this week.

 * The unit tests I do have in place give some sense of memory and CPU
usage.
   For the full NCBI taxonomy, memory usage climbs up above 2 GB with the
   read() function, which isn't a problem on this workstation but could be
for
   others.

 * For biopython-dev, a summary of the parsing strategy:

   There are two top-level functions, read() and parse(), which behave
   according to convention. Both use ElementTree's iterparse() function to
keep
   memory usage down (if used properly) and enable streaming data from other
   sources.

   The structure of the XML file looks like:
        <phyloxml>
            <phylogeny>
                <clade> ...                     (recursive)
            <phylogeny> ...                     (can have several trees)
            <something_completely_different>    (optional, arbitrary tags)
                ...
        <phyloxml>

    The read() function returns all of this as a single Python object, with
two
    attributes: phylogenies[] and other[]. parse() ignores the "other" stuff
    and just iterates through the "phylogeny" trees, so it should be handy
if
    you're not concerned with the extra arbitrary data that may appear after
    the trees.

    I have two more functions for parsing phylogeny and clade objects that
    track the current context of the XML parser, and clear elements after
    they're completed. Then all other tags are dispatched to the
corresponding
    classes, via from_element() methods attached to each class, or else
    built-in constructors for primitive types like int, float, str. The
    from_element() class methods take an ElementTree.Element object, deal
with
    it, and pass any child nodes for complex types to the corresponding
class's
    from_element() method. The only recursive element is Clade, which is
    treated specially, so there's nothing scary going on with the stack.

    I'm open to suggestions for reorganizing this to make Nexus/Newick
    integration more feasible. Optimization strategies are also a good topic
    this week. A few weeks later in my project plan I'm also scheduled to
    implement the rest of the magic methods, so we should discuss the
    appropriate amount and types of magic to add, too -- the showcase for
this
    right now is Tests/test_PhyloXML.

Cheers,
Eric

http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML
https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML