[Biopython] Indexing large sequence files

Fri Jun 19 14:56:19 UTC 2009

Peter, I apreciate all this hard work you are doing for me. I won't be  
able to test any of it until I'm back in the office on Tuesday, but  
I'll let you know how it goes then.

Best,
Cedar
Sent via phone

On Jun 19, 2009, at 4:53 AM, Peter <biopython at maubp.freeserve.co.uk>  
wrote:

> On Fri, Jun 19, 2009 at 12:12 PM, Peter<biopython at maubp.freeserve.co.uk 
> > wrote:
>> Either way, having an index file holding even compressed
>> pickled versions of SeqRecord objects takes at least three
>> times as much space as the original FASTQ file.
>>
>> So, for millions of records, I am going off the shelve/pickle
>> idea. Storing offsets in the original sequence file does seem
>> more practical here.
>
> How does this following code work for you? It is all in memory,
> no index files on disk. I've been testing it on uniprot_sprot.fasta
> which has only 470369 records (this example takes about 8s),
> but the same approach also works on a FASTQ file with seven
> million records (taking about 1min). These times are to build
> the index, and access two records for testing.
>
> #Start of code
> from Bio import SeqIO
>
> class FastaDict(object) :
>    """Read only dictionary interface to a FASTA file.
>
>    Keeps the keys in memory, reads the file to access
>    entries as SeqRecord objects using Bio.SeqIO."""
>    def __init__(self, filename, alphabet=None) :
>        #TODO - Take a handle instead, provided it has
>        #seek and tell methods?
>        self._index = dict()
>        self._alphabet = alphabet
>        handle = open(filename, "rU")
>        while True :
>            pos = handle.tell()
>            line = handle.readline()
>            if not line : break #End of file
>            if line.startswith(">") :
>                self._index[line[1:].rstrip().split(None,1)[0]] = pos
>        handle.seek(0)
>        self._handle = handle
>
>    def keys(self) :
>        return self._index.keys()
>
>    def __len__(self) :
>        return len(self._index)
>
>    def __getitem__(self, index) :
>        handle = self._handle
>        handle.seek(self._index[index])
>        return SeqIO.parse(handle, "fasta", self._alphabet).next()
>
> import time
> start = time.time()
> my_dict = FastaDict("uniprot_sprot.fasta")
> print len(my_dict)
> print my_dict["sp|Q197F8|002R_IIV3"].format("fasta") #first
> print my_dict["sp|B2ZDY1|Z_WWAVU"].format("fasta") #last
> print "Took %0.2fs" % (time.time()-start)
> #End of code
>
> Peter