[Biopython-dev] BioSQL : BatchLoader
Nick Loman
n.j.loman at bham.ac.uk
Tue Apr 22 16:50:36 UTC 2008
Dear biopython-developers,
As importing data into PostgreSQL is much faster when using the batch
"COPY" method I decided I would hack BioSQL.Loader to produce COPY
statements for the bulk of the data in a typical GenBank file.
As index updating/foreign key checking is also slow, I split the BioSQL
schema. I put table definitions in one file and then indexes/foreign key
constraints in a separate one.
I import the schema file, then apply indexes/FK only after the data is
loaded. Caveat, pbviously this can't be done on a "live" database and it
relies on only a single import process being run at any one time.
A modified 'Loader' uses a new class called 'FakeTable'. FakeTable acts
as a very, very basic data store attempting to simulate the behavior of
Postgres. FakeTable.dump() outputs COPY statements to stdout instead of
SQL commands.
I benchmarked load_seqdatabase.pl vs. BioSQL.loader vs. FakeTable with a
GenBank file 42MB large (microbial32.genomic.gbff from RefSeq).
load_seqdatabase.pl - not directly comparable as needs foreign
keys/rules to run correctly, but
conservatively >20 minutes
BioSQL.Loader/psyco - 4 minutes, 54 seconds
BatchLoader/psyco - 1 minute, 38 seconds
+Import the output - 8 seconds
Postgres 8.3.1, Gentoo/Linux, 8GB RAM.
As the number of sequence files increases, there should be even greater
gains, as the interactive version will take longer to execute each query.
This is not production-quality code but might act as a starting poing
for hacking about with. I would be grateful for any comments. If the
team felt this would be a useful inclusion into BioPython I am happy to
work it up a bit more.
A MySQL compatible version would not be very hard, for example.
I reckon this could be faster, for example the sequence parsing could be
threaded on a multi-core machines.
Code is here:
http://pathogenomics.bham.ac.uk/nick/snippets/biopython-sql/
I'd be grateful for any feedback on how this might be improved, and how
we can make it even faster!
Many thanks
Nick.
More information about the Biopython-dev
mailing list