[Biopython-dev] poor man's databases for large sequence files

Tue Sep 25 07:15:48 EDT 2007

On the discussion list I wrote:
> I've been thinking about extending Bio.SeqIO to support a (read only) 
> dictionary like interface for large sequence files (WITHOUT having 
> everything in memory).
> 
> Some of the older Biopython sequence format specific modules have an 
> index_file function and matching Dictionary class to do this (based 
> internally on either Martel/Mindy or a DIY Biopython indexer based on 
> pickle).

Some thoughts and timings using Bio.SwissProt.SProt, and the 1.1 GB 
UniProt file.  I have enough RAM that Linux has probably cached the 
entire flat file for me.  Just in case, I have run these timings a few 
times to be fair.

Note that just counting the records take about 6mins using the SeqRecord 
parser.  I think we can do a lot better.  Anyway, I wanted to talk about 
indexing files as simple read only databases.

Using the current (old) SProt indexing functions:
index_file - about 7 or 8 mins, one file of 34 MB (small!)
Dictionary - about 16s
random access - well under 0.1s

This old code works using Bio.Index to store the start (seek position) 
and length of each record (as determined by parsing the entire file) 
using cPickle.

In theory, any sequential file format could be handled this way - 
provided the parser leaves the handle's seek position in a sensible 
place when returning records.  This approach will not work for 
non-sequential file formats (e.g. most alignments).

My experimental code instead stores every SeqRecord object in full using 
cPickle (in one large file), and the seek positions for these pickled 
records in a second small index file (as a dict stored with cPickle).

Experimental code with pickled SeqRecord objects:
indexing file - about 7 or 8 mins (similar), two files, 554 MB (big!)
loading index - under 1s (much faster)
random access - well under 0.1s (similar, maybe faster)

This approach will work on any file format (and even for objects other 
than SeqRecord objects, provided they can be pickled).  It seems to be a 
lot faster when loading the index, at the expense of requiring a LARGE 
index file.  The indexing times for the two methods is very similar - 
about 6 mins of this is parsing the records in the first place.

I haven't yet looked at using the python shelve library to provide a 
read only dictionary.  Also python's marshal library may be useful.

Then there is the Mindy back end, used in Bio.Fasta and Bio.GenBank for 
their index_file and Dictionary classes (which replaced previous 
Bio.Index based code). I haven't timed these.

Peter

P.S. Using any of pickle, shelve or marshal does leave a potential 
security hole if anyone could prepare a malicious index file.