[Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite

Wed Jun 9 08:55:37 UTC 2010

On Wed, Jun 9, 2010 at 5:33 AM, Brent Pedersen <bpederse at gmail.com> wrote:
>>
>> The version you tried didn't do anything clever with the SQLite
>> indexes, batched inserts etc. I'm hoping the current code will be
>> faster (although there is likely a penalty from having two switchable
>> back ends). Brent, could you re-run this benchmark with this code:
>> http://github.com/peterjc/biopython/tree/index-sqlite-batched
>> ...
>
> done.

Thank you Brent :)

> the previous times and the current were using py-tcdb not bsddb.
> the author of tcdb made some improvements so it's faster this time,

OK, so you are using Tokyo Cabinet to store the lookup table here
rather than BDB. Link, http://code.google.com/p/py-tcdb/

> and your SeqIO implementation is almost 2x as fast to load as the
> previous one. that's a nice implementation. i didn't try get_raw.

I've got some more re-factoring in mind which should help a little
more (but mainly to make the structure clearer).

> these timints are are with your latest version, and the version of
> screed pulled from http://github.com/acr/screed master today.

Having had a quick look, they are using SQLite3 in much the
say way as I was initially. They create the index before loading
(rather than after loading) and they use a single insert per
offset (rather than using a batch in a transaction or the
executemany method). I'm pretty sure from my experiments
those changes would speed up screed's loading time a lot
(probably inline with the speed up I achieved).

> /opt/src/methylcode/data/s_1_sequence.txt
> benchmarking fastq file with 15646356 records (62585424 lines)
> performing 500000 random queries
>
> screed
> ------
> create: 699.210
> search: 51.043
>
> biopython-sqlite
> ----------------
> create: 386.647
> search: 93.391
>
> fileindex
> ---------
> create: 184.088
> search: 48.887

That's got us looking more competitive. As noted above, I think
sceed's loading time could be much reduced by tweaking how
they use SQLite3. I wonder what the breakdown for fileindex is
between calling Tokyo Cabinet and the fileindex code itself?
I guess we should try TK as the back end in Bio.SeqIO.index()
for comparison.

Peter

P.S. Could you measure the database file sizes on disk?