[Biopython-dev] GSoC Weekly Update: PhyloXML for Biopython

Mon Jun 22 16:14:19 UTC 2009

Hi folks,

Previously (June 15-19) I:

    * Wrote a pretty-printer for displaying a summary of the parsed tree
      structure
    * Made all existing unit tests pass
    * Started unit tests for instantiation of each phyloXML object
    * Profiled the parser and utilities using the cProfile module on the
unit
      test suite. Summarized findings on the Biopython mailing list (nothing
      exciting was discovered)
    * Used a custom warning type to indicate noncompliance with the PhyloXML
spec
    * Separated parsing code (Parser.py) from the phyloXML class definitions
      (Tree.py) -- this should make Nexus/Newick compatibility feasible
    * Improved the conversion from PhyloXML.Sequence to Bio.SeqRecord,
making
      better use of annotations and using SeqFeature objects to represent
      protein domains

This week (June 22-26) I will:

    Work on the backlog:
    * Finish unittests for parsing and instantiating core elements
    * Compare parser performance with Bioperl and Archaeopterix
    * Document results of parser testing and performance (on wiki or here)
    * Document basic usage and performance characteristics of the parser on
the
      Biopython wiki

    Then, serialize phyloXML trees and write back to file:
    * Write unit tests for serialization
    * Write serialization methods for each class
    * Write a top-level function for triggering serialization of the whole
      hierarchy

Question:

Biopython has a couple of core objects that I'm reusing in my project. There
was a quirk in these libraries (related to this:
http://effbot.org/pyfaq/why-are-default-values-shared-between-objects.htm)
that made the objects slightly more awkward to instantiate, but the issues
were recently fixed. I'd like to merge these fixes soon.

So, GSoC requires a tarball of the code we write at the end of the summer.
Merging from upstream would bring code that I didn't write into my
development tree -- which I could probably filter out with the right
arguments to git-diff, but nonetheless, my project history would no longer
be entirely clean. Does Google care about this? Or is it safe to go ahead
and pull from the next stable release of Biopython (coming soon)?

Cheers,
Eric
http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML
https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML