[Bioperl-l] SearchIO speed up

Thu Aug 10 22:06:18 UTC 2006

...
> > Or are you suggesting something that would be even better than that? If
> > so, please elucidate! :)
> 
> Oh, I guess the difference is the 'minimally parsed' bit, ie. the hsp
> chunks could virtually be raw lines from the input file? I don't think
> parsing the lines into data stored in hashes is any kind of significant
> burden, but it is certainly worthy of investigation if we're really
> really hungry for speed. Remember that anyway, we have to do a
> significant amount of parsing to discover where the chunks start and end.

You would just look for start/end 'elements' for Result/Hit/HSPs.  Though
SearchIO does this, I probably wouldn't use exactly the same approach.  

This is my take on it.  I could be completely off here...

Carve out each result chunk, parse out the data for the ResultI, then carve
out the hits into 'chunks' based on start/end events only (minimal parsing).
These are passed onto the next parser, which processes the chunk for HitI,
then carves out HSP chunks based on start/end events.  This is passed on to
a third parser for grabbing HSPI data.  

Sound about right? 

> ... Though, with that approach we might also get a memory saving:
> assuming we can rely on the input file sticking around, store a pointer
> to the position and length of each 'chunk' of lines, instead of the line
> data itself.
> 
> (I don't think that's a serious suggestion, just throwing ideas out.)

That's what this forum is for. 

Chris