[Biopython-dev] Indexing (large) sequence files with Bio.SeqIO

Mon Aug 31 13:44:44 EDT 2009

On Mon, Aug 31, 2009 at 2:49 PM, Peter wrote:
> All the above make me lean towards a less ambitious target
> (read only dictionary access to a sequence file), which just
> requires having an (on disk) index of file offsets (which could
> be done with SQLite or anything else suitable). This choice
> could even be done on the fly at run time (e.g. we look at the
> size of the file to decide if we should use an in memory index
> or on disk - or start out in memory and if the number of records
> gets too big, switch to on disk).

With the current code (in memory dictionary mapping keys to
file offsets), the 7 million record FASTQ file (1.3GB on disk)
required almost 700MB in memory. Indexing took about 1 min.
This is probably OK for many potential uses.

I just did a quick hack to use shelve (default settings) to hold the
key to file offset mapping. RAM usage was about 10MB, the
index file about 320MB (could have been a little more, my code
cleaned up after itself), but indexing took about 12 minutes.
http://github.com/peterjc/biopython/tree/index-shelve

I also did a proof of principle implementation using SQLite to
hold the key to file offset mapping. This also needed only about
10MB of RAM, the SQLite index file was about 400MB and
indexing took about 8 minutes. Perhaps this can be sped up...
http://github.com/peterjc/biopython/tree/index-sqlite

On the bright side, these all work for all the previously supported
indexable file formats, even SFF - which is pretty cool.

The trade off of 1 minute and 700MB RAM (in memory) versus
8 minutes but only 10MB RAM (using SQLite) means neither
solution will suit every use case. So unless the SQLite dict
approach can be sped up, it may be worthwhile to support
both this and the in memory index - although I haven't worked
out how best to arrange my code to achieve this elegantly.

Anyway, using SQLite like this seems workable (especially
since for Python 2.5+ it is included in the standard library).

Another option is the Berkeley DB library (especially if we can
do this following the OBF OBDA standard for the index file),
but while bsddb was included in Python 2.x it has been
deprecated for Python 2.6+ and removed in Python 3.0+
It is still available as a third party install though...

Peter