[BioRuby] GSOC PhyloXML profiling, bottleneck is Bio::Tree#children

Mon Aug 10 15:00:54 UTC 2009

On Sun, Aug 9, 2009 at 10:58 PM, Naohisa GOTO
<ngoto at gen-info.osaka-u.ac.jp>wrote:

> Hi,
>
> On Fri, 7 Aug 2009 13:31:45 -0400
> Diana Jaunzeikare <rozziite at gmail.com> wrote:
>
> > Hi all,
> >
> > Here is update on Google Summer of Code Bioruby PhyloXML project. I was
> > profiling and refactoring Bioruby PhyloXML Parser code and got 67% speed
> > increase.
> >
> > With profiling PhyloXML Writer the story is different. It takes 24minutes
> to
> > write the 1.5MB mollusca taxonomy tree and forever other larger files.
> > Again the bottleneck is bfs_shortest_path, which is called from
> > Tree#children method. It takes forever to just iterate over all the
> children
> > nodes.
>
> Bio::Tree uses graph algorithm  implemented in the Bio::Pathway class.
> As you say, their implementations are naive and should be rewritten
> in the future.
>
> > To solve this I propose to save an array of the children of the node
> within
> > my PhyloXML::Node (which corresponds to a clade) class. This would also
> > ensure that when a phyloxml file is parsed and then written back, clades
> > would be the same order in the input and output files.
>
> How to maintain the array when modifying the tree, e.g. to add,
> delete, or replace some nodes and edges?
>

I didn't think of this issue. I guess then instead of adding that support in
PhyloXML class I could rewrite Bio::Tree class, thus other projects based on
Bio::Tree class would benefit in the future.

Diana

>
> Naohisa Goto
> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
>
>
> > Have a good weekend,
> >
> > Diana
> >
>