[Biopython-dev] GSoC Weekly Update 6: PhyloXML for Biopython

Mon Jun 29 12:50:05 EDT 2009

Hi folks,

Previously (June 22--26) I:

    - Wrote unit tests for:
        - Instantiation of all implemented elements (properly)
        - Serialization to an output stream -- reusing the parser tests
    - Made all the unit tests pass
    - Tweaked for performance: parsing takes about 1/3 less CPU time now
    - Started Writer.py, with some imports, a class called Writer, and a
      top-level function for triggering serialization of the whole hierarchy
      (just a wrapper for ElementTree.write())
    - Added __str__ and __repr__ methods to the base class (used in
      pretty-printing)
    - Added the method to_rgb() to class BranchColor. It builds a 24-bit hex
      string representing the color that can be used from HTML/CSS directly.
      Just something completely different...
    - Pulled from the biopython trunk

This week (June 29--July 3) I will:

    - Write serialization methods for each class, matching Parser
    - Catch up on documentation (on the Biopython wiki):
        - Explain use cases
        - Basic usage of the parser
        - Provide guidance on parser performance (parse() is ~4x faster;
          compare to Bioperl and Archaopterix)

Performance:
The normal test suite running on apaf.xml, bcl_2.xml, phyloxml_examples.xml
and
ncbi_taxonomy_mollusca.xml.zip takes about 5 seconds; adding in
ncbi_taxonomy_metazoa.xml.zip and the full ncbi_taxonomy.xml.zip to the
utilities tests requires 256 seconds (parsing and pretty-printing), and just
parsing all six files without pretty-printing or counting tags takes a total
of
186 seconds.

The python process creeps up to 1.6GB while parsing all six files, but stays
under 40MB during the unit tests on the four more reasonably-sized files.

Scheduling:
The code for serializing to XML was supposed to be written last week. It was
not, but I do have comprehensive tests written for it (abusing the unittest
framework to re-run the original parser tests) and see no obstacles to its
completion this week.

I didn't completely trust the unit tests earlier last week, so I spent some
time making the pretty-printer work properly, and in the process added some
syntactic sugar that was scheduled for later in the project plan. I think
this
follows the current Biopython convention:

    bc = ProteinDomain(start=181, end=503, value='WD40')
    str(bc)     # ProteinDomain WD40
    repr(bc)    # ProteinDomain(start=181, end=503, value=WD40)

My plan is that when a phyloXML tree is exported to networkx for display and
other purposes, the str() result will be the label for each node.

Pulling from upstream:
I intended to pull the tagged 1.51 beta of biopython from github and merge
it
into my own code to take advantage of some recent improvements. But I don't
see
the 1.51b tag anywhere. Does anyone else know what happened to that tag? I
waited a few hours to see if it would be pushed from CVS automatically, but
no
luck, so I pulled from a plausible point during the lull after Peter's
CVS-freeze announcement.

Cheers,
Eric
http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML
https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML