[Biopython] Indexing large sequence files

Thu Jun 18 22:16:48 UTC 2009

Hi again,

This is off list as I haven't really tested this properly... but using
shelve on a tiny sample file seems to work:

from Bio import SeqIO
import shelve, os
fasta_file = "Doc/examples/ls_orchid.fasta"
index_file = "Doc/examples/ls_orchid.index"
#I don't want to worry about editing an existing index
if os.path.isfile(index_file) :
    os.remove(index_file)
#Create new shelve index
shelf = shelve.open(index_file, flag="n", protocol=2)
for record in SeqIO.parse(open(fasta_file), "fasta") :
    shelf[record.id] = record
shelf.close()
del shelf
#Now test it!
shelf = shelve.open(index_file)
print shelf["gi|2765570|emb|Z78445.1|PUZ78445"]

Perhaps once this has been field tested it would make a good cookbook
examples?

>> Are you looking for matches based on their identifier? If you can afford
>> to have two python sets in memory with 5 million keys, then can you do
>> something like this?:
>>
> I don't have a good sense of whether I can keep 2 * 5million keys in
> dictionaries in python. Haven't tried it before.

To be honest, neither have I. This will ultimately boil down to the
amount of RAM you have and the OS (which may impose limits).

Quick guesstimate: I would say two datasets, times 5 million entries,
times 20 letters per ID, times 1 byte per letter, would be 200 MB - then
allowing something for overheads you should be well under 1 GB.
i.e. Worth using sets of strings is maybe worth a try (assuming no
stupid mistakes in my numbers).

Note - using dictionaries Python actually stores the keys as hashes,
plus you have the overhead of the file size themselves. For a ball
park guess, take the FASTA file size and double it.

Peter