[Biopython-dev] BioSQL : BatchLoader
n.j.loman at bham.ac.uk
Wed Apr 23 08:09:54 UTC 2008
>> As importing data into PostgreSQL is much faster when using the batch
>> "COPY" method I decided I would hack BioSQL.Loader to produce COPY
>> statements for the bulk of the data in a typical GenBank file.
> Can I ask what version of Biopython you're using?
> is there anything you think should be
> added to the wiki documentation?:
I've added a few lines on Postgres.
>> As index updating/foreign key checking is also slow, I split the BioSQL
>> schema. I put table definitions in one file and then indexes/foreign key
>> constraints in a separate one.
> While this is fine for your own use - you'd had have to take this up
> on the BioSQL mailing list if you wanted it to become a standard (i.e.
> its not just up to us at Biopython). It might be worth moving some of
> this discussion there anyway.
Yep, appreciate that!
The problem is that you wouldn't want to have non-indexed tables ever if
you were updating with the traditional 'interactive' scripts, as they
will begin to slow to a crawl as more data is imported.
So this approach is only really good for this kind of batch-import model.
However I guess it is still reasonably friendly to ask people to import
2 scripts in a row.
>> load_seqdatabase.pl - not directly comparable as needs foreign
>> keys/rules to run correctly, but
>> conservatively >20 minutes
>> +Import the output - 8 seconds
> Did you run the numbers for a plain Biopython BioSQL.Loader import
> (without psyco)? If you do go back and run some more tests, could you
> also try just parsing the GenBank file without actually doing anything
> with the data (to see what the overhead is on your machine).
GenBank parsing without psyco - 2 minutes, 15 seconds
GenBank parsing with psyco - 1 minute, 20 seconds
>> BioSQL.Loader/psyco - 4 minutes, 54 seconds
BioSQL.Loader without psyco - 6 minutes, 10 seconds
>> BatchLoader/psyco - 1 minute, 38 seconds
BatchLoader without psyco - 2 minutes, 42 seconds
>> I reckon this could be faster, for example the sequence parsing could be
>> threaded on a multi-core machines.
> You should in principle be able to run multiple imports even without
> making any code changes to Biopython, although I suspect there is some
> scope for clashes (e.g. two threads both adding new entries to the
> taxonomy tables).
Yep, with the interactive version I reckon this would work without many
problems (most taxa should be pulled out of NCBI anyway), but with my
flat-file version this wouldn't work unless specifically designed for. I
could parallelise the GB parsing stage though as that is the current
bottleneck for my app.
>> Code is here:
>> I'd be grateful for any feedback on how this might be improved, and how we
>> can make it even faster!
> That seems to be password protected at the moment.
My bad, it's open now.
More information about the Biopython-dev