[Bioperl-l] SearchIO speed up

aaron.j.mackey at gsk.com aaron.j.mackey at gsk.com
Mon Aug 14 15:56:01 UTC 2006


> User requests report-statistic Y, which is found on the last line of the 

> report. We want to avoid reading, storing and parsing the entire file 
> just to find Y, so we seek to the last line, parse Y out and return it. 
> Yay, super fast.

This was the bit I was missing, thanks; to be honest, I never knew we had 
a get_result(Y) method, I thought we only had next_result() iterators.  Oh 
wait, we don't, but you're proposing we should extend the API to offer 
one?

The only thing we do have is a "result_count" method that is defined has 
returning the number of results that "have been parsed" (which, to me, 
could differ from the number of results that "have already been, or could 
yet to be, parsed")

> Now the user requests the next_result(). Let's say the first result 
> begins 5 lines into the file after the header. We quickly seek() there 
> and...

Yes, I understand that pipes aren't seekable.  I didn't understand the 
non-streaming context in which you wanted to seek back up the stream.

> # Allow seeking around. This adds an initial, possibly trivial, burden 
> for piped input only.

OK, if you insist on the need for "get_result(Y)" functionality, then (as 
you say) you must use a buffer/cache mechanism (switching from in-memory 
to tempfile above some threshold is another wrinkle to consider).  But, 
consider emulating XML::Twig's "purge_up_to" mechanism, whereby after I 
call "get_result(Y)", I can also call "purge_upto(Y)" to release/minimize 
the buffer contents.

The reason I'm being so fussy about this is that a primary motivation for 
a shockingly-fast parser is shockingly large datasets that we keep only as 
compressed files, uncompressing them en route to the parser; thus your 
simple "I'll just copy the stream to tempfile and proceed as normal" 
solution is not so trivial.

Here's a compromise: assume that users won't need random access to their 
results, only sequential; also, provide a new parameter to the searchIO 
constructor to specifify the desired access mode as random; then, if the 
input stream is not seekable (which is testable), you can perform your 
memory/file caching.  If get_result(X) is called without the access mode 
being set to random on an unseekable stream, throw an (informative) error.

Yes, I realize this is a bit more work; but the result could actually be 
usable!

-Aaron




More information about the Bioperl-l mailing list