[Biopython-dev] BioSQL : BatchLoader
Nick Loman
n.j.loman at bham.ac.uk
Wed Apr 23 12:08:57 UTC 2008
Peter wrote:
> That's impressive - you seem to have got the database side of things
> down to about 30 seconds; a fraction of the time to parse the GenBank
> file! Although, as you pointed out, there are a lot of provisos here.
Yep.
Would it be helpful to do anything further with this code, i.e. put it
into CVS and document on the Wiki, perhaps when its been a bit more tested?
> There are still some slow bits in the current GenBank parser which
> would be an obvious next target for you in your quest for speed. I
> did a little investigation a while ago, and concluded the parsing of
> the feature locations was the biggest bottleneck. However, this is a
> rather complicated lump of code, so its not such an easy task. I
> tried out a "hack" which special-cased the most common feature
> location types, with a fall back on the original parser, which gave
> much better performance. I didn't check this in as it made some
> already complex code WAY more complicated!
Aha, sounds good. I haven't profiled the Biopython code but I will check
this. I'm dealing with bacterial sequences in the main which have mainly
simple location identifiers, so there could well be some mileage here.
>>>> I reckon this could be faster, for example the sequence parsing could
>>>> be threaded on a multi-core machines.
>
> Did you mean simply one GenBank file per core, or something more
> complicated where parsing a single file is done using multiple cores?
I mean process one GenBank file per core.
Locally that would mean on a 4-core machine you could have 3 parser
threads working concurrently, each passing the generated Seq object to
the Loader when read.
Cheers
Nick.
More information about the Biopython-dev
mailing list