[Bioperl-l] GenBankParser comparison to bioperl parser

Lincoln Stein lstein@cshl.org
Fri, 13 Sep 2002 12:28:49 -0400


Maybe there's some lessons to be gained from my experience with AcePerl.  

AcePerl constructs a full object tree of the AceDB objects that it receives 
from the server, which come in the form of a space-delimited text blob 
(something like Genbank flat file format, but more structured).  Initially I 
parsed the entire tree in one go, converting it into object form, but this 
got really slow when dealing with big Sequence objects, which may be 
megabytes in size (because they include annotation and raw computes in 
addition to the sequence).  So I adopted a lazy parsing scheme:

	1) Each node of the object tree has three pointers, "right", "down"
		and "raw".  The root node starts out as "raw" and contains
		the unparsed text plus a pointer indicating where the parse
		last stopped.
	2) If a request for a node comes in, then the parser goes to work.
		It evaluates down to the node, and then stops.  All intervening
		nodes now contain fully-populated right and down pointers,
		and their raw fields are undef.  The last node contains a raw
		field containing the remainder of the text to be parsed.

The net effect of this was to speed up most processing considerably.  In 
addition, for the common case in which the programmer requests the text 
representation of a (sub)tree, the module just returns the unparsed text.

Lincoln


On Thursday 12 September 2002 12:58 pm, Hilmar Lapp wrote:
> > -----Original Message-----
> > From: Lincoln Stein [mailto:lstein@cshl.org]
> > Sent: Thursday, September 12, 2002 6:21 AM
> > To: Elia Stupka; Ewan Birney
> > Cc: Ian Korf; John Kloss; bioperl-l@bioperl.org;
> > gishlab@species.wustl.edu
> > Subject: Re: [Bioperl-l] GenBankParser comparison to bioperl parser
> >
> >
> > A separate repository is also fine with me, but I prefer
> > Bioperl-contrib,
> > because it should not just be for utility code, and nicely echoes the
> > "contrib" directory of the X Windows Consortium code distribution.
> >
> > I'll put Boulder into a Bioperl-contrib if there is one.
>
> West coast finally comes to work, looking amazed at one's inbox. I second
> that Lincoln's suggestion is the way to go.
>
> Adding additional modules that do the same thing an existing one does
> already but with a different API, be it faster or not, is not going to be
> helpful. Nevertheless, I'm convinced John's parser is an extremely valuable
> contribution for people who parse 50MB Genbank files on an every day basis.
>
> As some people pointed out very correctly, bioperl's generic design and
> unified API comes at a price, both in that you have to learn that API which
> is not the same as the input file you're so familiar with, and all creating
> all those objects does cost execution time.
>
> I'm sure that some of the parsing logic can be substantially improved both
> in readability and speed, but honestly I'd be very surprised if even the
> ultimately best regexp combined with the ultimately best parsign logic can
> speed up the whole thing by a factor of more than 2-3 fold. It's the object
> tree construction that costs you the order of magnitude.
>
> I think this is something worthwhile to spend more thoughts on, how can
> object tree construction be sped up considerably in bioperl. I started
> thinking about 2 possible ways that may help: 1) reusable object pools,
> from which clients can claim objects and release for reuse when finished;
> object construction then becomes resetting the state of an already created
> object and setting its attributes. 2) Lazy object tree construction; if
> someone's not going to query $seq->top_SeqFeatures() those feature objects
> would not have to created, let alone the relevant chunks of input parsed.
>
> This isn't meant as ultimate wisdom suggestions, but rather to instigate
> discussion and brain-storm of what makes the most sense.
>
> 	-hilmar

-- 
========================================================================
Lincoln D. Stein                           Cold Spring Harbor Laboratory
lstein@cshl.org			                  Cold Spring Harbor, NY
========================================================================