[BioPython] poor man's databases for large sequence files

Tue Sep 25 01:40:21 UTC 2007

Peter wrote:
> I've been thinking about extending Bio.SeqIO to support a (read only) 
> dictionary like interface for large sequence files (WITHOUT having 
> everything in memory).
>
> Some of the older Biopython sequence format specific modules have an 
> index_file function and matching Dictionary class to do this (based 
> internally on either Martel/Mindy or a DIY Biopython indexer based on 
> pickle).
>
> When thinking about a format agnostic SeqRecord dictionary, the built in 
> python "Shelf" object from python's built in "shelve library" looks like 
> a good choice.  I could add a Bio.SeqIO.to_shelf() function similar to 
> the existing Bio.SeqIO.to_dict() function.
>
> The only downside I've thought of so far is updating a shelf database, 
> something supported by shelve but with a few gotchas when dealing with 
> non-trivial datatypes (like dictionaries).  The need I am thinking about 
> addressing is a little less flexible - read only low-memory access to a 
> large collection of SeqRecords (typically from a large sequence file).
>
> Does anyone already use python's shelve library with sequence data?
>   

Just a curiosity, Peter, but would this extension deal with small 
collections of large sequences (finished genomes, for example)? 

Sean