[Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite

Mon Jun 7 17:45:57 UTC 2010

Hi all,

Thanks for the lively discussion on the main list,

http://lists.open-bio.org/pipermail/biopython/2010-June/006546.html
...
http://lists.open-bio.org/pipermail/biopython/2010-June/006580.html

I've spent the afternoon updating my old branch which uses SQLite
to store the record identifier to file offset mapping. Using the code
on this branch, Bio.SeqIO.index() supports a new optional argument
currently called "db" (other names I like including "cache", suggestions
welcome):

http://github.com/peterjc/biopython/tree/index-sqlite

The default (False) is not to use SQLite, but continue with an in
memory Python dictionary. As long as you have enough RAM
and don't plan to use the index at a later date, this will be fastest.

If set to True or a filename, then an SQLite index is used to hold
the offsets. This means very low RAM requirements, but is a lot
slower because the offsets are written to disk and the SQLite
index is updated as we go. I expect this part can be optimised
(e.g. try to build the index at the end, try committing in batches).

I'm still testing this, but the core of the work is done I think.
Once we're happy with the public API, we can concentrate
on things like the SQLite schema, and optimising the code.

Peter

P.S. I know it will need a little work to fail gracefully on Python 2.4
when SQLite isn't installed.