[BioPython] poor man's databases for large sequence files
Peter
biopython at maubp.freeserve.co.uk
Tue Sep 25 08:14:50 UTC 2007
Sean Davis wrote:
> Peter wrote:
>> I've been thinking about extending Bio.SeqIO to support a (read only)
>> dictionary like interface for large sequence files (WITHOUT having
>> everything in memory).
>>
>> ...
>>
>> Does anyone already use python's shelve library with sequence data?
>>
>
> Just a curiosity, Peter, but would this extension deal with small
> collections of large sequences (finished genomes, for example)?
>
Hi Sean,
What I had in mind was say indexing all of UniProt which is currently
1.1 GB in the SwissProt flat file format, but each record is pretty small.
However, in theory this (largely unwritten) code could be used on any
number of any sized records - but you would need enough ram to hold any
one record in memory at once, plus some more RAM for the hopefully
modest database overhead, python, your script etc.
I suppose having all the chromosomes for a given Eukaryote (e.g. mouse
or fruit fly) would also be a sensible examples; having tens of records
where each is tens of MB in size. Is that the sort of thing you had in
mind Sean?
Peter
More information about the Biopython
mailing list