[Bioperl-l] SearchIO Performance

Fri Mar 21 21:40:00 UTC 2008

On Mar 21, 2008, at 1:13 PM, Albion Baucom wrote:

> Hi. I am pretty new to BioPerl, and have a question about  
> performance with regard to Blast (nucleotide) file parsing. My  
> Blast result files usually have close to 100 or more sequence hits.  
> Each sequence is about 1400 nucleotides long.
>
> After profiling code I wrote, I find that calling the next_result()  
> function after creating a search object takes substantially longer  
> than non-OO, quick and dirty code I am using to parse the same  
> Blast files.
>
> What is substantially longer? Well, the existing code takes about  
> 0.25 seconds, and the BioPerl call takes about 4.5 seconds. I find  
> that to be a dramatic difference, and that kind of time difference  
> becomes significant when I have to parse 30 Blast files in a row. I  
> understand that SearchIO is parsing the entire file and storing it  
> all for easy retrieval later, and maybe this time penalty is what I  
> have to pay for that convenience and organization.
>
> I am just wondering if there is anything other than writing custom  
> code based on BioPerl to speed this up. Something I might not be  
> aware of that I can do ahead of time, or during parsing, to limit  
> what is parsed, or facilitate the parsing process. For instance, is  
> there a way to "look ahead" and simply parse alignments that meet a  
> specific expectancy cutoff?
>
> I confess I have not read the documentation thoroughly (although  
> obviously enough to make it do what I want), but am certainly  
> willing to do so if someone can point me in the right direction.
>
We are quite aware of the speed issues.  This is discussed on the  
wiki in brief detail.
http://bioperl.org/wiki/Why_BioPerl_is_slow

It boils down to the object creation not the parsing (relatively  
speaking).  It takes a while because we're creating a lot of objects  
under the hood for each alignment.  Sendu has written a pull parser  
that doesn't require creation of all the objects until the user  
requests them.
As I've said in the past, if someone wrote SearchIO event-listener  
that created lightweight objects (or just hashes) instead this would  
also provide a substantial speedup.

In the fall I did some experimentation with array-based instead of  
hash-based feature objects got a pretty decent speedup as well, but  
just haven't had any time to roll out a more substantial  
prototyping.  For the inner-loops of things it may make sense to  
substitute a less-flexible but super-fast object.

I always advocate thinking about what your needs are - if you just  
want start/stop of alignments, you can grab this out of a blast  
format  table with the -m9 (NCBI) or --mformat =3 (WUBLAST) and you  
can write a fast parser that uses 'split'.

> Thanks
>
> Albion
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l