[Biopython-dev] GSoC Weekly Update 6: PhyloXML for Biopython
Eric Talevich
eric.talevich at gmail.com
Mon Jun 29 16:50:05 UTC 2009
Hi folks,
Previously (June 22--26) I:
- Wrote unit tests for:
- Instantiation of all implemented elements (properly)
- Serialization to an output stream -- reusing the parser tests
- Made all the unit tests pass
- Tweaked for performance: parsing takes about 1/3 less CPU time now
- Started Writer.py, with some imports, a class called Writer, and a
top-level function for triggering serialization of the whole hierarchy
(just a wrapper for ElementTree.write())
- Added __str__ and __repr__ methods to the base class (used in
pretty-printing)
- Added the method to_rgb() to class BranchColor. It builds a 24-bit hex
string representing the color that can be used from HTML/CSS directly.
Just something completely different...
- Pulled from the biopython trunk
This week (June 29--July 3) I will:
- Write serialization methods for each class, matching Parser
- Catch up on documentation (on the Biopython wiki):
- Explain use cases
- Basic usage of the parser
- Provide guidance on parser performance (parse() is ~4x faster;
compare to Bioperl and Archaopterix)
Performance:
The normal test suite running on apaf.xml, bcl_2.xml, phyloxml_examples.xml
and
ncbi_taxonomy_mollusca.xml.zip takes about 5 seconds; adding in
ncbi_taxonomy_metazoa.xml.zip and the full ncbi_taxonomy.xml.zip to the
utilities tests requires 256 seconds (parsing and pretty-printing), and just
parsing all six files without pretty-printing or counting tags takes a total
of
186 seconds.
The python process creeps up to 1.6GB while parsing all six files, but stays
under 40MB during the unit tests on the four more reasonably-sized files.
Scheduling:
The code for serializing to XML was supposed to be written last week. It was
not, but I do have comprehensive tests written for it (abusing the unittest
framework to re-run the original parser tests) and see no obstacles to its
completion this week.
I didn't completely trust the unit tests earlier last week, so I spent some
time making the pretty-printer work properly, and in the process added some
syntactic sugar that was scheduled for later in the project plan. I think
this
follows the current Biopython convention:
bc = ProteinDomain(start=181, end=503, value='WD40')
str(bc) # ProteinDomain WD40
repr(bc) # ProteinDomain(start=181, end=503, value=WD40)
My plan is that when a phyloXML tree is exported to networkx for display and
other purposes, the str() result will be the label for each node.
Pulling from upstream:
I intended to pull the tagged 1.51 beta of biopython from github and merge
it
into my own code to take advantage of some recent improvements. But I don't
see
the 1.51b tag anywhere. Does anyone else know what happened to that tag? I
waited a few hours to see if it would be pushed from CVS automatically, but
no
luck, so I pulled from a plausible point during the lull after Peter's
CVS-freeze announcement.
Cheers,
Eric
http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML
https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML
More information about the Biopython-dev
mailing list