[Biopython-dev] BioSQL : BatchLoader

Wed Apr 23 08:08:57 EDT 2008

Peter wrote:

> That's impressive - you seem to have got the database side of things
> down to about 30 seconds; a fraction of the time to parse the GenBank
> file!  Although, as you pointed out, there are a lot of provisos here.

Yep.

Would it be helpful to do anything further with this code, i.e. put it 
into CVS and document on the Wiki, perhaps when its been a bit more tested?

> There are still some slow bits in the current GenBank parser which
> would be an obvious next target for you in your quest for speed.  I
> did a little investigation a while ago, and concluded the parsing of
> the feature locations was the biggest bottleneck.  However, this is a
> rather complicated lump of code, so its not such an easy task.  I
> tried out a "hack" which special-cased the most common feature
> location types, with a fall back on the original parser, which gave
> much better performance.  I didn't check this in as it made some
> already complex code WAY more complicated!

Aha, sounds good. I haven't profiled the Biopython code but I will check 
this. I'm dealing with bacterial sequences in the main which have mainly 
simple location identifiers, so there could well be some mileage here.

>>>>  I reckon this could be faster, for example the sequence parsing could
>>>>  be threaded on a multi-core machines.
> 
> Did you mean simply one GenBank file per core, or something more
> complicated where parsing a single file is done using multiple cores?

I mean process one GenBank file per core.

Locally that would mean on a 4-core machine you could have 3 parser 
threads working concurrently, each passing the generated Seq object to 
the Loader when read.

Cheers

Nick.