[BioRuby] Bioruby PhyloXML update 6

Tue Jun 30 01:26:28 UTC 2009

Hi all,

there is update of the last week:

• Asked for a code review. I got very good suggestions on what and how to
improve things. Some of them did this week, some will come later.
• Documented requirement of libxml-ruby.
• Documented more PhyloXML::Node element.
• Wrote code so that phyloxml test suite exits if libxml-ruby library is not
present. (This took me quite a long time to figure it out. Eventually i sent
email to ruby-talk mailing list and got a great help.)
• Created a branch testbig. There created file test_phyloxml_big.rb wrote
method parse_tree_dummy.
• Did code profiling. Discovered that ~99% of the time is spent in
Bio::Tree#parent. Changed the code to keep track myself of the current node
in an array. Speed increase was tremendous. When parsing mollusca xml (1.5MB
of data) it went down from 443 to 2 seconds. When parsing tree of life xml
(45MB of data) it took 34 seconds instead of more than 3 hours.

Plan for next week:

• Continue working on documentation
• write usage cases like phyloxml.each do |tree| end ;  Calculate total
branch lengths? (Any other uses? ) Look at Perl Phyloxml implementation and
port those usage cases in Bioruby.
• Adding tests for marginal cases. (decide what to do with invalid xml
files).
• Will do some more code profiling (its fun :) ) But it looks like we are in
pretty good shape.
• Change organization of classes a bit. Split code in several files. Have a
module PhyloXML. Have a class PhyloXMLParser (in phyloxml_parser.rb) in it.
Have all the phyloxml element classes defined in phyloxml_elements.rb file
(under PhyloXML module). And then later will have PhyloXMLWriter class.
• Other  tweaks to prepare for PhyloXML parser deliverable.

Diana