[Bioperl-l] SearchIO speed up

aaron.j.mackey at gsk.com aaron.j.mackey at gsk.com
Mon Aug 14 17:01:47 UTC 2006


> And then of course the idea is that this is nested, so the parser for 
> the result data is a Bio::Search::Result::ResultI but also a pull-parser 

> in its own right (and so on for HitI and HSPI) with a need for 
> random-access to the various bits of data needed to answer all the 
> various methods of ResultI.

the second- (and third- and so on) level parsers can work on in-memory 
"blobs" (if seeking is unavailable), as these will be minute in 
comparison; it's only the top-level SearchIO parser that need fuss about 
streaming pipes and seekability.

> I currently have a -piped_behaviour argument that accepts 'memory' or 
> 'temp_file'.

does it default to memory?

> How about a third (non-default) option of 'linear' to avoid 
> any attempt at a seek and just use the data as it is piped?

fine; we can quibble about stylistic API issues later.

> The trouble 
> is that you'd need to virtually implement the methods of a parser module 

> twice, once where the methods can seek, second where they can't. Or 
> maybe not; I'll have to try and see if some sane compromise 
> implementation is possible.

fundamentally, parsing occurs when regular expressions operate on 
in-memory blobs; so while you can keep lots of file pointers around to 
define many largish blobs with minimal memory footprint, at some point 
they need to become memory-resident for the parser to take effect. 
Conversely, if you spend too much time finding out the fine-grained 
locations of every parsable bit, and saving the pointers then you're 
recapitulating Perl's own variable storage mechanisms.

-Aaron




More information about the Bioperl-l mailing list