[BioPython] poor man's databases for large sequence files

Mon Sep 24 17:47:13 EDT 2007

I've been thinking about extending Bio.SeqIO to support a (read only) 
dictionary like interface for large sequence files (WITHOUT having 
everything in memory).

Some of the older Biopython sequence format specific modules have an 
index_file function and matching Dictionary class to do this (based 
internally on either Martel/Mindy or a DIY Biopython indexer based on 
pickle).

When thinking about a format agnostic SeqRecord dictionary, the built in 
python "Shelf" object from python's built in "shelve library" looks like 
a good choice.  I could add a Bio.SeqIO.to_shelf() function similar to 
the existing Bio.SeqIO.to_dict() function.

The only downside I've thought of so far is updating a shelf database, 
something supported by shelve but with a few gotchas when dealing with 
non-trivial datatypes (like dictionaries).  The need I am thinking about 
addressing is a little less flexible - read only low-memory access to a 
large collection of SeqRecords (typically from a large sequence file).

Does anyone already use python's shelve library with sequence data?

Peter