[Biopython-dev] BioSQL : BatchLoader
Nick Loman
n.j.loman at bham.ac.uk
Wed Apr 23 15:46:35 UTC 2008
Peter wrote:
> I'm not ready to put this into the main Biopython CVS. But by all
> means, add a new page to the wiki to describe your approach.
> Hopefully there are a few others who might be interested, and we'll
> see.
Okay!
>> I mean process one GenBank file per core.
>>
>> Locally that would mean on a 4-core machine you could have 3 parser threads
>> working concurrently, each passing the generated Seq object to the Loader
>> when read.
>
> I see - that means there is only one thread/job writing to the
> database, which keeps that side of things thread-safe. To be honest,
> unless you are trying to import several hundred bacterial genomes into
> BioSQL, I don't think this level of complexity is a worth while pay
> off. Right now, I would target the GenBank parsing itself (which
> would be useful outside the task of loading sequences into BioSQL).
I agree, I will take a look at GenBank parsing next, and then
concurrency after that.
The reason I'm doing this is that I need to import all 1686
complete/incomplete bacterial genomes in RefSeq - and plenty more besides!
> Something else you may want to consider is timing the BioPerl scripts
> for importing a GenBank file into BioSQL. There will probably be some
> minor differences in their interpretation of the data and exactly they
> store it, but it would be a useful base mark.
I did this, it was incredibly slow, at least 5x slower. We've been using
Bioperl for some time. I realised I needed a faster script so I
investigated the same approach with BioPerl but I thought I'd be able to
hack the Biopython stuff a bit faster as the BioSQL stuff seems a bit
less complex. Plus Python is easy to read ;)
Cheers
Nick.
More information about the Biopython-dev
mailing list