[BioPython] poor man's databases for large sequence files

Tue Sep 25 11:41:25 UTC 2007

Peter wrote:
> Sean Davis wrote:
>> Peter wrote:
>>> I've been thinking about extending Bio.SeqIO to support a (read only)
>>> dictionary like interface for large sequence files (WITHOUT having
>>> everything in memory).
>>>
>>> ...
>>>
>>> Does anyone already use python's shelve library with sequence data?
>>>   
>>
>> Just a curiosity, Peter, but would this extension deal with small
>> collections of large sequences (finished genomes, for example)?
> 
> Hi Sean,
> 
> What I had in mind was say indexing all of UniProt which is currently
> 1.1 GB in the SwissProt flat file format, but each record is pretty small.
> 
> However, in theory this (largely unwritten) code could be used on any
> number of any sized records - but you would need enough ram to hold any
> one record in memory at once, plus some more RAM for the hopefully
> modest database overhead, python, your script etc.
> 
> I suppose having all the chromosomes for a given Eukaryote (e.g. mouse
> or fruit fly) would also be a sensible examples; having tens of records
> where each is tens of MB in size. Is that the sort of thing you had in
> mind Sean?

Yes.  Lincoln Stein wrote some indexing stuff in perl that allows
essentially random access to sequence records as well as subsets of
individual records.  It makes it possible to do range queries on
individual sequences with very modest memory; with a larger memory
machine, one might imagine that this would result in very fast queries
as the files get cached.

Sean