[Bioperl-l] SearchIO speed up

Chris Fields cjfields at uiuc.edu
Thu Aug 10 16:10:57 UTC 2006


Sendu,

You've been busy!

I think that anything that speeds up SearchIO would help  
tremendously.  We still need to be wary of API issues, but I have no  
problem whatsoever adding this in.  You might need to give the  
community time to digest these changes before committing to HEAD,  
though.

Here's a couple of  suggestions to get around that if you want to get  
the code out there for testing:

Could this be CVS-tagged to an experimental bioperl branch instead?   
It could be merged back to the main branch once everybody gets to try  
it out, and you could commit changes to the branch (tests, scripts,  
etc) along the way based on suggestions.  Think of this as a test- 
drive for a new Bioperl release.

Alternatively, add the modified modules or patches to Bugzilla for  
people to test out.  You might want to add a script and test data as  
well.  These could be updated along the way much like CVS, but  
probably a lot noisier.

I agree with the thought of retaining some degree of OO.  I still  
wonder how much object instantiation really affects speed vs. all  
those method calls.

Chris

On Aug 10, 2006, at 10:19 AM, Sendu Bala wrote:

> I am aiming to solve Project priority list item 1.2.1 "Improve
> Bio::SearchIO speed...".
>
> I have made changes that result in a certain speed improvement (up to
> 5x) in parsing BLAST results (and most/all? other SearchIO  
> parsers). The
> changes made were as minimal as possible. I did not touch the parsing
> code itself, did not make any API changes except for gain-of-function,
> and did not make any behaviour changes (except for speed!).
> For this reason I don't think there would be any contention about the
> changes so I will just commit them...
>
> ...Except I need to know if the community considers the speed problem
> solved or not. More radical changes will make SearchIO even faster,  
> eg.
> Chris Fields and Jason (if I interpret the Project priority list item
> correctly) have suggested an end to individual Hit and HSP objects,
> which become just data members of a Result-like object. Ideally I  
> don't
> want to go down that route because we lose quite a bit of OO power;  
> HSP
> objects in particular make important use of inheritance and we'd  
> end up
> stuffing tens of duplicated-code methods into the Result object
> replacement. Ugh. Or having the Result-like object implement 4  
> different
> interfaces. Ugh. That said, judging by the results below, having  
> hsps as
> pure hashes would result in a ~5x speedup all the time instead of  
> only a
> 1.5x speedup in worst case.
>
> So can people do some of their own speed tests on typical, realistic
> blast-parsing jobs before and after I commit the changes? Then  
> everyone
> can decide if more needs to be done. I don't think everyone should do
> the same test - more variety is good. Obviously each individual should
> do the same test before and after. Post your before tests here and  
> I'll
> commit sometime this weekend. Thanks for your help!
>
>
> My own speed tests
> ------------------
> #Files used:
>    'medium.blast' = 670k plain-text format blastn with 1 result
>    'large.blast'  = 8.2MB plain-text format blastn with 5 results
>
> # Best case scenario:
> time perl -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file =>
> "large.blast"); while ($result = $sio->next_result) { }'
>
> Before:
>          repeat 1    repeat 2   repeat 3   average
> real    1m28.065s   1m28.305s  1m28.271s  1m28.214s
> user    1m27.610s   1m27.850s  1m27.860s  1m27.773s
> sys     0m0.440s    0m0.450s   0m0.410s   0m0.433s
>
> After:
>          repeat 1    repeat 2   repeat 3   average    speed up:
> real    0m16.653s   0m16.711s  0m16.685s  0m16.683s  5.3x
> user    0m16.510s   0m16.570s  0m16.580s  0m16.553s  5.3x
> sys     0m0.130s    0m0.140s   0m0.110s   0m0.127s   3.4x
>
>
> perl -d:DProf -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file =>
> "medium.blast"); while ($result = $sio->next_result) { }'
>
> dprofpp -I before:
> Total Elapsed Time = 5.309818 Seconds
>    User+System Time = 5.209818 Seconds
> Inclusive Times
> %Time ExclSec CumulS #Calls sec/call Csec/c  Name
>   131.   0.473  6.834      2   0.2365 3.4169
> Bio::SearchIO::blast::next_result
>   110.   0.333  5.753  36512   0.0000 0.0002
> Bio::SearchIO::blast::end_element
>   102.   0.315  5.326   2388   0.0001 0.0022
> Bio::SearchIO::SearchResultEventBuilder::end_hsp
>   92.2   0.026  4.804   7266   0.0000 0.0007
> Bio::Factory::ObjectFactory::create_object
>   91.4   0.250  4.765   2388   0.0001 0.0020
> Bio::Search::HSP::GenericHSP::new
>   40.4   0.162  2.105   2388   0.0001 0.0009
> Bio::SeqFeature::SimilarityPair::new
>   39.0   0.173  2.034   9552   0.0000 0.0002
> Bio::SeqFeature::Similarity::new
>
> dprofpp -I after:
> Total Elapsed Time = 1.090717 Seconds
>    User+System Time = 1.490717 Seconds
> Inclusive Times
> %Time ExclSec CumulS #Calls sec/call Csec/c  Name
>   98.2   0.333  1.465      2   0.1665 0.7324
> Bio::SearchIO::blast::next_result
>   40.5   0.213  0.604  36512   0.0000 0.0000
> Bio::SearchIO::blast::end_element
>   22.8       -  0.341  34022        - 0.0000   
> Bio::SearchIO::blast::element
>   19.8   0.296  0.296   2388   0.0001 0.0001
> Bio::SearchIO::SearchResultEventBuilder::end_hsp
>   15.7   0.156  0.235  41186   0.0000 0.0000
> Bio::SearchIO::blast::characters
>   12.6   0.188  0.188 141495   0.0000 0.0000
> Bio::SearchIO::blast::in_element
>   10.3   0.133  0.154  25795   0.0000 0.0000  Bio::Root::IO::_readline
>
> Improvement: most of our time now spent in the SAX-like bits of the
> parser code, instead of creating HSPI objects. (I might streamline the
> SAX-like bits as well.)
>
>
> # Medium case scenario:
> time perl -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file =>
> "large.blast"); while ($result = $sio->next_result) { while ($hit =
> $result->next_hit) { while ($hsp = $hit->next_hsp) { } } }'
>
> Before:
>          repeat 1    repeat 2   repeat 3   average
> real    1m27.771s   1m27.996s  1m28.953s  1m28.240s
> user    1m27.200s   1m27.540s  1m28.500s  1m27.747s
> sys     0m0.560s    0m0.450s   0m0.450s   0m0.487s
>
> After:
>          repeat 1    repeat 2   repeat 3   average    speed up:
> real    0m25.984s   0m25.996s  0m25.714s  0m25.898s  3.4x
> user    0m25.780s   0m25.750s  0m25.510s  0m25.680s  3.4x
> sys     0m0.210s    0m0.240s   0m0.200s   0m0.217s   2.2x
>
>
> # Worse case scenario:
> time perl -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file =>
> "large.blast"); while ($result = $sio->next_result) { while ($hit =
> $result->next_hit) { while ($hsp = $hit->next_hsp) { $hsp->query;
> $hsp->hit; } } }'
>
> Before:
>          repeat 1    repeat 2   repeat 3   average
> real    1m28.104s   1m27.861s  1m27.970s  1m27.978s
> user    1m27.600s   1m27.420s  1m27.420s  1m27.480s
> sys     0m0.490s    0m0.430s   0m0.550s   0m0.490s
>
> After:
>          repeat 1    repeat 2   repeat 3   average    speed up:
> real    0m57.167s   0m57.080s  0m56.903s  0m58.050s  1.5x
> user    0m56.700s   0m56.770s  0m56.530s  0m56.667s  1.5x
> sys     0m0.470s    0m0.310s   0m0.370s   0m0.383s   1.3x
>
>
> perl -d:DProf -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file =>
> "medium.blast"); while ($result = $sio->next_result) { while ($hit =
> $result->next_hit) { while ($hsp = $hit->next_hsp) { $hsp->query;
> $hsp->hit; } } }'
>
> dprofpp -I before:
> Total Elapsed Time = 9.767702 Seconds
>    User+System Time = 9.577702 Seconds
> Inclusive Times
> %Time ExclSec CumulS #Calls sec/call Csec/c  Name
>   115.   0.568 11.031      2   0.2838 5.5156
> Bio::SearchIO::blast::next_result
>
> dprofpp -I after:
> Total Elapsed Time = 6.521398 Seconds
>    User+System Time = 6.571398 Seconds
> Inclusive Times
> %Time ExclSec CumulS #Calls sec/call Csec/c  Name
>   61.2   0.087  4.023  11940   0.0000 0.0003
> Bio::Search::HSP::GenericHSP::query
>
> Improvement: now we spend most of our time dealing with the thing we
> actually wanted from the parse, not dealing with the parsing  
> process itself.
>
>
>
> --------------------------------------------------------
>
> Here are the changes involved in achieving the speed up:
>
>
> Bio::Search::HSP::GenericHSP
> ----------------------------
>
> # Implementation changes
> Call to new() now calls no methods of its own; no work is done  
> simply to
> create a GenericHSP. The code that was previously in new() has been
> moved to private methods that are called just-in-time, as the user
> desires to know certain information by manually calling HSPI methods.
>
> Added new options to new() -hit_desc and -query_desc for setting the
> description text for the sequences.
>
>
> Bio::Search::Hit::GenericHit
> ----------------------------
>
> # API-CHANGES
> new() has extra option -hsp_factory.
>
> New method hsp_factory() which gets/sets a  
> Bio::Factory::ObjectFactoryI.
>
> add_hsp() can now accept a hash ref instead of just a HSPI.
>
> # Implementation changes
> next_hsp() and hsps() convert hash ref hsp data to HSPI objects using
> the hsp_factory() as necessary.
>
> # Notes
> num_hsps() claimed to throw if there were no HSPs, but returns '-'.
> Updated docs.
>
>
> Bio::Search::Result::GenericResult
> ----------------------------------
>
> # API-CHANGES
> new() has extra option -hit_factory.
>
> New method hit_factory() which gets/sets a  
> Bio::Factory::ObjectFactoryI.
>
> add_hit() can now accept a hash ref instead of just a HitI.
>
> # Implementation changes
> next_hit() and hits() convert hash ref hit data to HitI objects using
> the hit_factory() as necessary.
>
>
> Bio::Search::Iteration::GenericIteration
> ----------------------------------------
>
> # API-CHANGES
> new() has extra option -hit_factory.
>
> New method hit_factory() which gets/sets a  
> Bio::Factory::ObjectFactoryI.
>
> add_hit() can now accept a hash ref instead of just a HitI.
>
> # Implementation changes
> Various methods convert hash ref hit data to HitI objects using the
> hit_factory() as necessary.
>
>
> Bio::SearchIO::SearchResultEventBuilder
> ---------------------------------------
>
> # Implementation changes
> Methods end_hsp() and end_hit() return hash refs containing data
> suitable for creating HitI and HSPI objects respectively.
>
>
> Bio::SearchIO::IteratedSearchResultEventBuilder
> -----------------------------------------------
>
> # Implementation changes
> _add_hit() deals with the new way hit information is stored.
>
> end_iteration() supplies the hit factory to created and returned
> iteration factories.
>
>
> Bio::SeqFeature::SimilarityPair
> -------------------------------
>
> # Notes
> new() makes an object factory if not supplied -feature_factory as arg,
> but then does nothing with its freshly created factory, and ignores  
> any
> that are supplied with -feature_factory. Now it just doesn't make a
> factory at all, to save time.
>
> new() has an option -feature1, but when supplied this is never used or
> set. I've left this alone.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign






More information about the Bioperl-l mailing list