[Biopython-dev] BioSQL : BatchLoader

Wed Apr 23 08:56:33 UTC 2008

>  GenBank parsing with psyco    - 1 minute, 20 seconds
>  GenBank parsing without psyco - 2 minutes, 15 seconds
>
>  BioSQL.Loader/psyco   - 4 minutes, 54 seconds
>  BioSQL.Loader without psyco - 6 minutes, 10 seconds
>
> BatchLoader/psyco     - 1 minute, 38 seconds
> BatchLoader without psyco     - 2 minutes, 42 seconds

That's impressive - you seem to have got the database side of things
down to about 30 seconds; a fraction of the time to parse the GenBank
file!  Although, as you pointed out, there are a lot of provisos here.

There are still some slow bits in the current GenBank parser which
would be an obvious next target for you in your quest for speed.  I
did a little investigation a while ago, and concluded the parsing of
the feature locations was the biggest bottleneck.  However, this is a
rather complicated lump of code, so its not such an easy task.  I
tried out a "hack" which special-cased the most common feature
location types, with a fall back on the original parser, which gave
much better performance.  I didn't check this in as it made some
already complex code WAY more complicated!

> > >  I reckon this could be faster, for example the sequence parsing could
> > >  be threaded on a multi-core machines.

Did you mean simply one GenBank file per core, or something more
complicated where parsing a single file is done using multiple cores?

Peter