[Biopython-dev] GSoC Weekly Update: PhyloXML for Biopython
Eric Talevich
eric.talevich at gmail.com
Mon Jun 15 13:04:20 EDT 2009
Hi all,
Previously (June 8-12) I:
* Finished writing constructors and XML parsers for Tier 0,1,2 elements
(everything that appears in the example phyloXML files)
* Added to_seqrecord and from_seqrecord methods to the PhyloXML.Sequence
class
-- getting Bio.SeqRecord to stand in for PhyloXML.Sequence entirely will
require some more thought
* Wrote a unit test for counting clades/branches (topology check)
* Changed the no-op unit tests to count the total number of tags (nodes) in
the given phyloXML file, keeping stdout clean
* Miscellaneous code cleanup
* Added a few magic methods to make usage easier: __len__, __iter__,
__str__
This week (June 15-19) I will:
* Finish unittests for parsing and instantiating core elements
* Test and check parser performance versus Bioperl and Archaeopterix
loading
times
* Document results of parser testing and performance (on wiki or here)
* Document basic usage of the parser on the Biopython wiki
Thoughts:
* Test-driven development kind of went out the window this week.
Implementing
each new class is pretty short now that I'm using the from_element class
methods consistently, so I just charged ahead rather than write tests for
each class first. I checked the more complicated classes in the REPL, but
didn't copy that code into the test script... shameful. There are a
couple
bugs I know of already, but haven't fixed. So catching up there will be
the
bulk of the effort this week.
* The unit tests I do have in place give some sense of memory and CPU
usage.
For the full NCBI taxonomy, memory usage climbs up above 2 GB with the
read() function, which isn't a problem on this workstation but could be
for
others.
* For biopython-dev, a summary of the parsing strategy:
There are two top-level functions, read() and parse(), which behave
according to convention. Both use ElementTree's iterparse() function to
keep
memory usage down (if used properly) and enable streaming data from other
sources.
The structure of the XML file looks like:
<phyloxml>
<phylogeny>
<clade> ... (recursive)
<phylogeny> ... (can have several trees)
<something_completely_different> (optional, arbitrary tags)
...
<phyloxml>
The read() function returns all of this as a single Python object, with
two
attributes: phylogenies[] and other[]. parse() ignores the "other" stuff
and just iterates through the "phylogeny" trees, so it should be handy
if
you're not concerned with the extra arbitrary data that may appear after
the trees.
I have two more functions for parsing phylogeny and clade objects that
track the current context of the XML parser, and clear elements after
they're completed. Then all other tags are dispatched to the
corresponding
classes, via from_element() methods attached to each class, or else
built-in constructors for primitive types like int, float, str. The
from_element() class methods take an ElementTree.Element object, deal
with
it, and pass any child nodes for complex types to the corresponding
class's
from_element() method. The only recursive element is Clade, which is
treated specially, so there's nothing scary going on with the stack.
I'm open to suggestions for reorganizing this to make Nexus/Newick
integration more feasible. Optimization strategies are also a good topic
this week. A few weeks later in my project plan I'm also scheduled to
implement the rest of the magic methods, so we should discuss the
appropriate amount and types of magic to add, too -- the showcase for
this
right now is Tests/test_PhyloXML.
Cheers,
Eric
http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML
https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML
More information about the Biopython-dev
mailing list