[Bioperl-l] SearchIO Performance
Jason Stajich
jason at bioperl.org
Fri Mar 21 21:40:00 UTC 2008
On Mar 21, 2008, at 1:13 PM, Albion Baucom wrote:
> Hi. I am pretty new to BioPerl, and have a question about
> performance with regard to Blast (nucleotide) file parsing. My
> Blast result files usually have close to 100 or more sequence hits.
> Each sequence is about 1400 nucleotides long.
>
> After profiling code I wrote, I find that calling the next_result()
> function after creating a search object takes substantially longer
> than non-OO, quick and dirty code I am using to parse the same
> Blast files.
>
> What is substantially longer? Well, the existing code takes about
> 0.25 seconds, and the BioPerl call takes about 4.5 seconds. I find
> that to be a dramatic difference, and that kind of time difference
> becomes significant when I have to parse 30 Blast files in a row. I
> understand that SearchIO is parsing the entire file and storing it
> all for easy retrieval later, and maybe this time penalty is what I
> have to pay for that convenience and organization.
>
> I am just wondering if there is anything other than writing custom
> code based on BioPerl to speed this up. Something I might not be
> aware of that I can do ahead of time, or during parsing, to limit
> what is parsed, or facilitate the parsing process. For instance, is
> there a way to "look ahead" and simply parse alignments that meet a
> specific expectancy cutoff?
>
> I confess I have not read the documentation thoroughly (although
> obviously enough to make it do what I want), but am certainly
> willing to do so if someone can point me in the right direction.
>
We are quite aware of the speed issues. This is discussed on the
wiki in brief detail.
http://bioperl.org/wiki/Why_BioPerl_is_slow
It boils down to the object creation not the parsing (relatively
speaking). It takes a while because we're creating a lot of objects
under the hood for each alignment. Sendu has written a pull parser
that doesn't require creation of all the objects until the user
requests them.
As I've said in the past, if someone wrote SearchIO event-listener
that created lightweight objects (or just hashes) instead this would
also provide a substantial speedup.
In the fall I did some experimentation with array-based instead of
hash-based feature objects got a pretty decent speedup as well, but
just haven't had any time to roll out a more substantial
prototyping. For the inner-loops of things it may make sense to
substitute a less-flexible but super-fast object.
I always advocate thinking about what your needs are - if you just
want start/stop of alignments, you can grab this out of a blast
format table with the -m9 (NCBI) or --mformat =3 (WUBLAST) and you
can write a fast parser that uses 'split'.
> Thanks
>
> Albion
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list