[Biopython-dev] BioSQL : BatchLoader

Wed Apr 23 15:46:35 UTC 2008

Peter wrote:

> I'm not ready to put this into the main Biopython CVS.  But by all
> means, add a new page to the wiki to describe your approach.
> Hopefully there are a few others who might be interested, and we'll
> see.

Okay!

>>  I mean process one GenBank file per core.
>>
>>  Locally that would mean on a 4-core machine you could have 3 parser threads
>> working concurrently, each passing the generated Seq object to the Loader
>> when read.
> 
> I see - that means there is only one thread/job writing to the
> database, which keeps that side of things thread-safe.  To be honest,
> unless you are trying to import several hundred bacterial genomes into
> BioSQL, I don't think this level of complexity is a worth while pay
> off.  Right now, I would target the GenBank parsing itself (which
> would be useful outside the task of loading sequences into BioSQL).

I agree, I will take a look at GenBank parsing next, and then 
concurrency after that.

The reason I'm doing this is that I need to import all 1686 
complete/incomplete bacterial genomes in RefSeq - and plenty more besides!

> Something else you may want to consider is timing the BioPerl scripts
> for importing a GenBank file into BioSQL.  There will probably be some
> minor differences in their interpretation of the data and exactly they
> store it, but it would be a useful base mark.

I did this, it was incredibly slow, at least 5x slower. We've been using 
Bioperl for some time. I realised I needed a faster script so I 
investigated the same approach with BioPerl but I thought I'd be able to 
hack the Biopython stuff a bit faster as the BioSQL stuff seems a bit 
less complex. Plus Python is easy to read ;)

Cheers

Nick.