[BioPython] poor man's databases for large sequence files

Tue Sep 25 08:14:50 UTC 2007

Sean Davis wrote:
> Peter wrote:
>> I've been thinking about extending Bio.SeqIO to support a (read only) 
>> dictionary like interface for large sequence files (WITHOUT having 
>> everything in memory).
>>
>> ...
>>
>> Does anyone already use python's shelve library with sequence data?
>>   
> 
> Just a curiosity, Peter, but would this extension deal with small 
> collections of large sequences (finished genomes, for example)? 
> 

Hi Sean,

What I had in mind was say indexing all of UniProt which is currently 
1.1 GB in the SwissProt flat file format, but each record is pretty small.

However, in theory this (largely unwritten) code could be used on any 
number of any sized records - but you would need enough ram to hold any 
one record in memory at once, plus some more RAM for the hopefully 
modest database overhead, python, your script etc.

I suppose having all the chromosomes for a given Eukaryote (e.g. mouse 
or fruit fly) would also be a sensible examples; having tens of records 
where each is tens of MB in size. Is that the sort of thing you had in 
mind Sean?

Peter