[Bioperl-l] SearchIO speed up

Thu Aug 10 22:28:49 UTC 2006

aaron.j.mackey at gsk.com wrote:
>> As I understand your description, this is exactly what I do. My 'chunks' 
>> are the hashes that are normally used to create a new Hit/HSP object.
>>
>> The initial parse of the data file results in a small number of objects 
>> (Results) that contain all the data: HSP data nested in Hit data nested 
>> in the Result objects. When you actually want to do something with a 
>> certain hit or HSP it becomes an object, allowing you to call its 
>> methods like normal.
>>
>> Or are you suggesting something that would be even better than that? If 
>> so, please elucidate! :)
> 
> So the only lazyness you invoke is the object instantiation (but you've 
> already done all the parsing).
> 
> My proposal involves the "chunks" being unparsed, raw text "blobs", that 
> are essentially blessed into a package that does the parsing only when 
> necessary (and even then, might choose different parsing strategies, based 
> on what's been asked for).  Thus a potentially large amount of parsing and 
> storage is skipped.  Additionally, you now have the option of not even 
> storing the blobs in memory, just file seek pointers (requiring temp. 
> storage for streaming pipe data sources), and thus can process very large 
> reports without consuming memory (currently a problem).

Thanks, I might try out something along those lines. The problem I see 
is with piped input; I wouldn't want to require temp. storage because 
the user may deliberately be trying to gain speed by doing as little 
disc io as possible. Then you'd have to special-case it; pointers if we 
have a file on disc, stored-in-memory if piped. Maybe that special-case 
wouldn't be so bad.