[Bioperl-l] Philosophy, BioPerl Object Creation [was Query Unigene title from input a ACC number]

Jason Stajich jason at cgt.mc.duke.edu
Tue Mar 25 12:36:10 EST 2003


On Tue, 25 Mar 2003, Jamie Hatfield (AGCoL) wrote:

> The blessed hash was actually something I was planning on trying, I just
> wasn't sure if that was a "sanctioned" method of speeding up my code.  I
> haven't been around to hear all the discussion re:
> speed/flexibility/solid object model, so I didn't realize this topic was
> becoming a dead horse.  :-)
>
it's still alive - don't worry.  We want to fix perf problems, but there
hasn't been a good solution suggested yet.

> Another area I was curious about:  my fpc module ISA MapIO, so when
> reading in newlines, it uses the _readline function.  Are there any
> plans to buffer this or do we assume that the os/hardware does a good
> enough job as is?  Also, What was the motivation for abstracting this
> away?  I mean, I assume you're saying that there is a significant
> performance hit in perl when calling methods (more so, I assume, than
> other programming languages).
>

There is a significant performance hit calling 'new' with the way we have
implemented it because it walks up the chained constructor hierarchy and
perl doesn't seem to want to cache that walk (actually I really not sure
what is going on in the guts there, others have taken a look and can
report).  But you're only calling new once to instantiate the parser so I
don't think you have to worry about things there.

I am assuming perl is doing just fine buffering things.

> More than half of the time spent reading in a fpc file has ended up in
> the _readline method, but it really doesn't take that long to read the
> file in if you do it yourself with open, <>, close and such.  I'm just
> trying to find a good way to keep within the object model, but still
> make this a useable object.
>

I didn't think there is a serious performance hit with the IO class but
would be good try and quantify.  There are several reasons to use the
class we have setup, not the least is that we support this transparent -fh
or -file to either specify a filehandle or a filename.  Using <> means you
assume you are always being given a file.  Also, we allow a method called
_pushback so that you can realize you've read too far when doing parsing
and push back onto the stack the last line (or lines) you have read.  It
also allows us to unify access to data streams across the project so that
all the parser modules behave in a common way.  This is by far the main
reason for having a common module for IO access.

> I really am not trying to be argumentative/critical.  Just trying to
> make it good and make it fast.

No it's good to question these things, we get stuck in our ways sometimes
I think. Usually because we feel we've solved that problem and want to
move on, but sometimes more appropriate or creative solutions should be
applied to 'old' problems.

My feeling is people can criticize the project all they want, just be
willing to step to the plate and lead an effort to improve things (and
still maintain our attempt at quality stable releases).  If you just stand
from afar and criticize you aren't really helping since most of us already
have quite full plates and Bioperl mostly just a free-time sort of thing.

>
> Is there a developer paper/primer that I should read that has a lot of
> this discussion in it?
>
Our OO design stuff mainly comes from Damian Conway's principals in his
book and some early frameworks that Steve Chervitz helped establish and
them some lighter weight stuff that Ewan and others help pioneer.

How I wish there was a real document about all of this - the mailing list
is a wealth of these things, but no one really has collected various
edicts or positions on things into the appropriate documents.  The wiki
was supposed to be the place for this but really has not worked out how
one might have hoped.

Other texts are emerging slowly from discussions that have been had
off-line and some attempt at framing the future directions of the project
looks like they'll be finished being written in the coming weeks.  RFCs
will be posted as soon as the core devs have agreed on what we envision
should be part of the development efforts over the next 6-9 months.


In case you are wondering, nothing too magical about the core devs - we're
just folks who have agreed to invest a fair amount of our time in making
sure the development effort is coordinated, releases code that meets a
certain threshhold of testing, only breaks backwards compatability when it
is appropriate, has at least a minimum of documentation, and tries to
establish a coherent design philosophy.

Anyone and everyone is always welcome to post their own RFC about project
directions or improvements to the toolkit and lead an effort.

Hope that helps some.

-jason

> Thanks for your help and advice.
>
> > -----Original Message-----
> > From: Jason Stajich [mailto:jason at cgt.mc.duke.edu]
> >
> > On Tue, 25 Mar 2003, Jamie Hatfield (AGCoL) wrote:
> >
> > > Maybe it's just me, but I've never been too pleased with BioPerl's
> > > ability to handle large amounts of data like these unigene clusters.
> > > You all might remember I recently proposed a FPC module for
> > reading in
> > > FPC data files.  Well, that is still in progress, but it is
> > DOG slow,
> > > and the only reason I can seem to make out of it is that
> > object creation
> > > is a bear.
> > >
> > > I would really like some input myself, from the BioPerl
> > experts about
> > > what I can do to speed up the creation of say . . . 100k
> > objects?  :-)
> > >
> > You have to take a different approach then.  We've gone back
> > and forth on
> > this a lot wrt to speed and flexibility and a solid object model.
> > Apparently Perl doesn't make it easy to have all three.
> >
> > You can get around some of the problems by instead of
> > building things with
> > new, you bless a hash and then call some methods to push the data in.
> > This prevents the walk-up-the-tree for inheritance that
> > happens on every
> > new() call which is the main bottleneck.  We do this with features and
> > locations in the genbank parser right now to get a modest performance
> > gain.  It is still an area that we are trying to rethink and improve.
> >
> > I think we want to also move more in the realm of event based parsing
> > which would allow you to attach a listener which would only
> > catch certain
> > events and perhaps wouldn't need to actually create objects
> > for certain
> > quick and dirty tasks.  But the framework for this needs to
> > be laid pretty
> > explicitly to make it really work.
> >
> > I believe Ensembl hit this perf problem and went with a
> > simplier object
> > initialization scheme to buy them the performance they
> > needed.  It means
> > that you have to code up more things when you inherit from an object
> > (and have to remember to update all child classes when every a parent
> > class changes) but you get some performance increase.
> >
> > -jason
> >
> > --
> > Jason Stajich
> > Duke University
> > jason at cgt.mc.duke.edu
> >
>

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu


More information about the Bioperl-l mailing list