[Biopython] Indexing large sequence files

Fri Jun 19 07:53:09 EDT 2009

On Fri, Jun 19, 2009 at 12:12 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
> Either way, having an index file holding even compressed
> pickled versions of SeqRecord objects takes at least three
> times as much space as the original FASTQ file.
>
> So, for millions of records, I am going off the shelve/pickle
> idea. Storing offsets in the original sequence file does seem
> more practical here.

How does this following code work for you? It is all in memory,
no index files on disk. I've been testing it on uniprot_sprot.fasta
which has only 470369 records (this example takes about 8s),
but the same approach also works on a FASTQ file with seven
million records (taking about 1min). These times are to build
the index, and access two records for testing.

#Start of code
from Bio import SeqIO

class FastaDict(object) :
    """Read only dictionary interface to a FASTA file.

    Keeps the keys in memory, reads the file to access
    entries as SeqRecord objects using Bio.SeqIO."""
    def __init__(self, filename, alphabet=None) :
        #TODO - Take a handle instead, provided it has
        #seek and tell methods?
        self._index = dict()
        self._alphabet = alphabet
        handle = open(filename, "rU")
        while True :
            pos = handle.tell()
            line = handle.readline()
            if not line : break #End of file
            if line.startswith(">") :
                self._index[line[1:].rstrip().split(None,1)[0]] = pos
        handle.seek(0)
        self._handle = handle

    def keys(self) :
        return self._index.keys()

    def __len__(self) :
        return len(self._index)

    def __getitem__(self, index) :
        handle = self._handle
        handle.seek(self._index[index])
        return SeqIO.parse(handle, "fasta", self._alphabet).next()

import time
start = time.time()
my_dict = FastaDict("uniprot_sprot.fasta")
print len(my_dict)
print my_dict["sp|Q197F8|002R_IIV3"].format("fasta") #first
print my_dict["sp|B2ZDY1|Z_WWAVU"].format("fasta") #last
print "Took %0.2fs" % (time.time()-start)
#End of code

Peter