[Biopython-dev] BioSQL : BatchLoader
biopython at maubp.freeserve.co.uk
Wed Apr 23 08:56:33 UTC 2008
> GenBank parsing with psyco - 1 minute, 20 seconds
> GenBank parsing without psyco - 2 minutes, 15 seconds
> BioSQL.Loader/psyco - 4 minutes, 54 seconds
> BioSQL.Loader without psyco - 6 minutes, 10 seconds
> BatchLoader/psyco - 1 minute, 38 seconds
> BatchLoader without psyco - 2 minutes, 42 seconds
That's impressive - you seem to have got the database side of things
down to about 30 seconds; a fraction of the time to parse the GenBank
file! Although, as you pointed out, there are a lot of provisos here.
There are still some slow bits in the current GenBank parser which
would be an obvious next target for you in your quest for speed. I
did a little investigation a while ago, and concluded the parsing of
the feature locations was the biggest bottleneck. However, this is a
rather complicated lump of code, so its not such an easy task. I
tried out a "hack" which special-cased the most common feature
location types, with a fall back on the original parser, which gave
much better performance. I didn't check this in as it made some
already complex code WAY more complicated!
> > > I reckon this could be faster, for example the sequence parsing could
> > > be threaded on a multi-core machines.
Did you mean simply one GenBank file per core, or something more
complicated where parsing a single file is done using multiple cores?
More information about the Biopython-dev