[BioRuby] GSOC PhyloXML profiling, bottleneck is Bio::Tree#children

Fri Aug 7 20:47:31 UTC 2009

Hi, Diana:

I think your array based solution is fine.

Although, at one point Bioruby's Tree might need to be enhanced, in order to avoid being slowed down by this bfs_shortest_path method. This issue is likely to prohibit (time-wise) many types of algorithms acting on large tree objects.

Christian

________________________________________
From: Diana Jaunzeikare [rozziite at gmail.com]
Sent: Friday, August 07, 2009 10:31 AM
To: Christian Zmasek; Phyloinformatics Group; Pjotr Prins; bioruby at lists.open-bio.org; Naohisa GOTO
Subject: GSOC PhyloXML profiling, bottleneck is Bio::Tree#children

Hi all,

Here is update on Google Summer of Code Bioruby PhyloXML project. I was profiling and refactoring Bioruby PhyloXML Parser code and got 67% speed increase.

With profiling PhyloXML Writer the story is different. It takes 24minutes to write the 1.5MB mollusca taxonomy tree and forever other larger files.  Again the bottleneck is bfs_shortest_path, which is called from Tree#children method. It takes forever to just iterate over all the children nodes.

To solve this I propose to save an array of the children of the node within my PhyloXML::Node (which corresponds to a clade) class. This would also ensure that when a phyloxml file is parsed and then written back, clades would be the same order in the input and output files.

Have a good weekend,

Diana