[BioRuby] GSOC: phyloXML for BioRuby: Profiling
Diana Jaunzeikare
rozziite at gmail.com
Sun Jun 28 19:55:44 UTC 2009
Hi all,
I did some profiling of the code.
My system is Ubuntu 9.04, ruby 1.8.7 [i486-linux], Intel Core 2 Duo P8600
@2.4GHz
I created test_phyloxml_big.rb test file. It has test_next_tree method which
calls next_tree on the phyloxml file until end of file is reached. Here
follow results on the ncbi_taxonomy_mollusca.xml file which is 1.5MB large
with 5632 external nodes.
It takes around 7.5min to finish test_phyloxml_big.rb test.
(Finished in 443.231507 seconds.
Finished in 457.255576 seconds. )
output of the top:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
21222 diana 20 0 29020 25m 1928 R 95 0.9 5:32.92 ruby
So it looks like memory footprint is small ~ 25MBs. CPU usage is 95% (i have
two processors, so it is completely using one of them).
I did the same thing for tol_life_xml but it took forever to finish. (more
than 3 hours)
For curiosity I created a method next_tree_dummy. All it does is to
reader.read from file until it reaches </phylogeny> element.
tree of life xml (file size: 45.1MB) - Finished in 3.177743 seconds.
mollusca xml (1.5MB) - Finished in 0.252993 seconds.
metazoa xml (32.3MB) - Finished in 3.393467 seconds.
I think this shows that libxml is really fast.
I also did profiling with ruby-prof on ncbi mollusca taxonomy file.
Here is partial output:
diana at diana-ubuntu:~/bioruby$ ruby-prof -p graph
test/unit/bio/db/test_phyloxml_big.rb
Loaded suite /usr/bin/ruby-prof
Started
.
Finished in 1345.4039 seconds.
1 tests, 0 assertions, 0 failures, 0 errors
Thread ID: 3084157360
Total Time: 1257.6
[..]
-------------------------------------------------------------------------------
1257.56 0.60 0.00 1256.96
2/2 Bio::TestPhyloXMLBig#test_next_tree
100.00% 0.05% 1257.56 0.60 0.00 1256.96
2 Bio::PhyloXML#next_tree
0.08 0.03 0.00 0.05
8107/16210 Bio::PhyloXML#parse_attributes
0.00 0.00 0.00 0.00
1/243188 String#==
0.00 0.00 0.00 0.00
8104/24322 LibXML::XML::Reader#[]
6.18 0.97 0.00 5.21
48616/48616 Bio::PhyloXML#parse_clade_elements
0.14 0.14 0.00 0.00
48623/97244 LibXML::XML::Reader#read
0.27 0.17 0.00 0.10
48644/875134 Bio::PhyloXML#is_element?
0.11 0.04 0.00 0.07
16206/32442 Class#new
0.04 0.04 0.00 0.00
16206/116034 Kernel#==
0.00 0.00 0.00 0.00
4/72929 Bio::PhyloXML#parse_simple_elements
0.07 0.01 0.00 0.06
8102/8102 Bio::Tree#add_node
0.44 0.31 0.00 0.13
97243/137758 Bio::PhyloXML#is_end_element?
1249.05 0.02 0.00 1249.03
8102/8102 Bio::Tree#parent
0.58 0.07 0.00 0.51
8102/8102 Bio::Tree#add_edge
-----------------------------------------------------------------------------
1249.05 0.02 0.00 1249.03
8102/8102 Bio::PhyloXML#next_tree
99.32% 0.00% 1249.05 0.02 0.00 1249.03
8102 Bio::Tree#parent
1249.03 0.13 0.00 1248.90
8102/8102 Bio::Tree#path
0.00 0.00 0.00 0.00
8102/72975 Array#[]
--------------------------------------------------------------------------------
1249.03 0.13 0.00 1248.90
8102/8102 Bio::Tree#parent
99.32% 0.01% 1249.03 0.13 0.00 1248.90
8102 Bio::Tree#path
0.04 0.01 0.00 0.03
16204/164638052 Hash#[]
1248.82 0.27 0.00 1248.55
8102/8102 Bio::Pathway#bfs_shortest_path
0.03 0.03 0.00 0.00
24306/116034 Kernel#==
0.01 0.01 0.00 0.00
16204/72975 Array#[]
--------------------------------------------------------------------------------
1248.82 0.27 0.00 1248.55
8102/8102 Bio::Tree#path
99.30% 0.02% 1248.82 0.27 0.00 1248.55
8102 Bio::Pathway#bfs_shortest_path
0.26 0.19 0.00 0.07
142736/164638052 Hash#[]
0.07 0.07 0.00 0.00
75419/116034 Kernel#==
1248.18 115.50 0.00 1132.68
8102/8102 Bio::Pathway#breadth_first_search
0.04 0.04 0.00 0.00
67317/67330 Array#unshift
--------------------------------------------------------------------------------
1248.18 115.50 0.00 1132.68
8102/8102 Bio::Pathway#bfs_shortest_path
99.25% 9.18% 1248.18 115.50 0.00 1132.68
8102 Bio::Pathway#breadth_first_search
136.52 92.65 0.00
43.8765785140/164638052 Hash#[]
22.53 22.53 0.00
0.0032900672/32900681 Array#shift
973.59 324.56 0.00
649.0332892570/32892570 Hash#each_key
0.04 0.03 0.00 0.01
24306/98702064 Hash#[]=
[..]
99.32% of the total time is spent in Bio::Tree#parent method and the methods
it calls.
Bio::Tree#parent calls Bio::Tree#path which calls
Bio::Pathways#bfs_shortest_path which in turn calls
Bio::Pathway#breadth_first_search (99.25% of total time is spent in this
method and its sub calls).
This was a huge surprise for me. Why would breadth first search be needed if
I just want to know the parent node of the current node.
The reason I am using Bio::Tree#parent is because I have to keep track of
the current node I am parsing. When I have reached </clade> element i set
the current_node to the parent of the node I just parsed.
I see here two options.
1) Keep track of the current node myself (by putting references in an array
and pushing and poping accordingly). Thus I won't have to call the
Bio::Tree#parent method.
2) Update Bio::Tree/ Bio::Node class so that nodes contain references to
their parents. (thus not needing to call breadth first search).
What do you think?
Diana
More information about the BioRuby
mailing list