[Biopython-dev] BioSQL : BatchLoader

Wed Apr 23 04:09:54 EDT 2008

Hi Peter

>>  As importing data into PostgreSQL is much faster when using the batch
>> "COPY" method I decided I would hack BioSQL.Loader to produce COPY
>> statements for the bulk of the data in a typical GenBank file.
> 
> Can I ask what version of Biopython you're using?

1.45.

> is there anything you think should be
> added to the wiki documentation?:
> http://biopython.org/wiki/BioSQL

I've added a few lines on Postgres.

>>  As index updating/foreign key checking is also slow, I split the BioSQL
>> schema. I put table definitions in one file and then indexes/foreign key
>> constraints in a separate one.
> 
> While this is fine for your own use - you'd had have to take this up
> on the BioSQL mailing list if you wanted it to become a standard (i.e.
> its not just up to us at Biopython).  It might be worth moving some of
> this discussion there anyway.

Yep, appreciate that!

The problem is that you wouldn't want to have non-indexed tables ever if 
you were updating with the traditional 'interactive' scripts, as they 
will begin to slow to a crawl as more data is imported.

So this approach is only really good for this kind of batch-import model.

However I guess it is still reasonably friendly to ask people to import
2 scripts in a row.

>>   load_seqdatabase.pl   - not directly comparable as needs foreign
>>                          keys/rules to run correctly, but
>>                          conservatively >20 minutes
>>  +Import the output     - 8 seconds
> 
> Did you run the numbers for a plain Biopython BioSQL.Loader import
> (without psyco)?  If you do go back and run some more tests, could you
> also try just parsing the GenBank file without actually doing anything
> with the data (to see what the overhead is on your machine).

Yep, sure.

GenBank parsing without psyco - 2 minutes, 15 seconds
GenBank parsing with psyco    - 1 minute, 20 seconds

>>   BioSQL.Loader/psyco   - 4 minutes, 54 seconds
BioSQL.Loader without psyco - 6 minutes, 10 seconds
>>   BatchLoader/psyco     - 1 minute, 38 seconds
BatchLoader without psyco     - 2 minutes, 42 seconds

>>  I reckon this could be faster, for example the sequence parsing could be
>>  threaded on a multi-core machines.
> 
> You should in principle be able to run multiple imports even without
> making any code changes to Biopython, although I suspect there is some
> scope for clashes (e.g. two threads both adding new entries to the
> taxonomy tables).

Yep, with the interactive version I reckon this would work without many 
problems (most taxa should be pulled out of NCBI anyway), but with my 
flat-file version this wouldn't work unless specifically designed for. I 
could parallelise the GB parsing stage though as that is the current 
bottleneck for my app.

>>  Code is here:
>>  http://pathogenomics.bham.ac.uk/nick/snippets/biopython-sql/
>>
>>  I'd be grateful for any feedback on how this might be improved, and how we
>> can make it even faster!
> 
> That seems to be password protected at the moment.

My bad, it's open now.

Regards,

Nick.