[Biopython-dev] BioSQL : BatchLoader

Tue Apr 22 16:50:36 UTC 2008

Dear biopython-developers,

As importing data into PostgreSQL is much faster when using the batch 
"COPY" method I decided I would hack BioSQL.Loader to produce COPY 
statements for the bulk of the data in a typical GenBank file.

As index updating/foreign key checking is also slow, I split the BioSQL 
schema. I put table definitions in one file and then indexes/foreign key 
constraints in a separate one.

I import the schema file, then apply indexes/FK only after the data is 
loaded. Caveat, pbviously this can't be done on a "live" database and it 
relies on only a single import process being run at any one time.

A modified 'Loader' uses a new class called 'FakeTable'. FakeTable acts 
as a very, very basic data store attempting to simulate the behavior of 
Postgres. FakeTable.dump() outputs COPY statements to stdout instead of 
SQL commands.

I benchmarked load_seqdatabase.pl vs. BioSQL.loader vs. FakeTable with a 
GenBank file 42MB large (microbial32.genomic.gbff from RefSeq).

  load_seqdatabase.pl   - not directly comparable as needs foreign
                          keys/rules to run correctly, but
                          conservatively >20 minutes

  BioSQL.Loader/psyco   - 4 minutes, 54 seconds

  BatchLoader/psyco     - 1 minute, 38 seconds
+Import the output     - 8 seconds

Postgres 8.3.1, Gentoo/Linux, 8GB RAM.

As the number of sequence files increases, there should be even greater 
gains, as the interactive version will take longer to execute each query.

This is not production-quality code but might act as a starting poing 
for hacking about with. I would be grateful for any comments. If the 
team felt this would be a useful inclusion into BioPython I am happy to 
work it up a bit more.

A MySQL compatible version would not be very hard, for example.

I reckon this could be faster, for example the sequence parsing could be 
threaded on a multi-core machines.

Code is here:
http://pathogenomics.bham.ac.uk/nick/snippets/biopython-sql/

I'd be grateful for any feedback on how this might be improved, and how 
we can make it even faster!

Many thanks

Nick.