[Biopython-dev] BioSQL : BatchLoader

Tue Apr 22 17:38:24 UTC 2008

On Tue, Apr 22, 2008 at 5:50 PM, Nick Loman <n.j.loman at bham.ac.uk> wrote:
> Dear biopython-developers,
>
>  As importing data into PostgreSQL is much faster when using the batch
> "COPY" method I decided I would hack BioSQL.Loader to produce COPY
> statements for the bulk of the data in a typical GenBank file.

Can I ask what version of Biopython you're using?  And given you've
got it running on PostgreSQL, is there anything you think should be
added to the wiki documentation?:
http://biopython.org/wiki/BioSQL

>  As index updating/foreign key checking is also slow, I split the BioSQL
> schema. I put table definitions in one file and then indexes/foreign key
> constraints in a separate one.

While this is fine for your own use - you'd had have to take this up
on the BioSQL mailing list if you wanted it to become a standard (i.e.
its not just up to us at Biopython).  It might be worth moving some of
this discussion there anyway.

>  I benchmarked load_seqdatabase.pl vs. BioSQL.loader vs. FakeTable with a
> GenBank file 42MB large (microbial32.genomic.gbff from RefSeq).
>
>   load_seqdatabase.pl   - not directly comparable as needs foreign
>                          keys/rules to run correctly, but
>                          conservatively >20 minutes
>
>   BioSQL.Loader/psyco   - 4 minutes, 54 seconds
>
>   BatchLoader/psyco     - 1 minute, 38 seconds
>  +Import the output     - 8 seconds
>
>  Postgres 8.3.1, Gentoo/Linux, 8GB RAM.

Did you run the numbers for a plain Biopython BioSQL.Loader import
(without psyco)?  If you do go back and run some more tests, could you
also try just parsing the GenBank file without actually doing anything
with the data (to see what the overhead is on your machine).

>  I reckon this could be faster, for example the sequence parsing could be
>  threaded on a multi-core machines.

You should in principle be able to run multiple imports even without
making any code changes to Biopython, although I suspect there is some
scope for clashes (e.g. two threads both adding new entries to the
taxonomy tables).

>  Code is here:
>  http://pathogenomics.bham.ac.uk/nick/snippets/biopython-sql/
>
>  I'd be grateful for any feedback on how this might be improved, and how we
> can make it even faster!

That seems to be password protected at the moment.

Peter