[Bioperl-l] SearchIO speed up

aaron.j.mackey at gsk.com aaron.j.mackey at gsk.com
Mon Aug 14 12:33:30 UTC 2006


A "pull parser" need not read everything (i.e. the entire file) into 
memory, just the current/next chunk, right?

It was the current "push parser" architecture that had me thinking about 
file pointers: if we're forced to make an initial pass through the entire 
file to build up all the top-level objects before being able to access the 
first one (as the current SearchIO does), then it would be advantageous to 
minimize the memory impact of all those top-level objects with file 
pointers rather than in-memory blobs.

But in a "pull" architecture, that consideration is no longer so 
important.

Please forgive me if I've misunderstood what you're describing below.

-Aaron

bioperl-l-bounces at lists.open-bio.org wrote on 08/14/2006 06:02:30 AM:

> Chris Fields wrote:
> > ...
> >> My proposal involves the "chunks" being unparsed, raw text "blobs", 
that
> >> are essentially blessed into a package that does the parsing only 
when
> >> necessary (and even then, might choose different parsing strategies, 
based
> >> on what's been asked for).  Thus a potentially large amount of 
parsing and
> >> storage is skipped.  Additionally, you now have the option of not 
even
> >> storing the blobs in memory, just file seek pointers (requiring temp.
> >> storage for streaming pipe data sources), and thus can process very 
large
> >> reports without consuming memory (currently a problem).
> > 
> > Using file pointers is a great touch.  Sendu has a slight aversion to 
temp
> > files but he has already indicated other ways around this.
> 
> I'm in the midst of implementing an 'Aaron'-style pull-parser which I 
> have called PullParserI. My current solution for piped input is:
> 
> '... The other thing you will need to decide when making a chunk is how 
> to handle piped input. A PullParser needs seekable data to parse, so if 
> your data is piped in and unseekable, you must decide between creating a 

> temp file or reading the input into memory, which will be done before 
> the chunk becomes usable and you can begin any parsing.'
> 
> I don't think its really possible to avoid this initial 'read everything 

> in first' step, unless anyone has any bright ideas?
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 





More information about the Bioperl-l mailing list