[Biopython] extremely long execution time

Thu Jun 21 16:08:24 UTC 2012

Hello Biopythoners!

I am working with fastq files and though I have been working with them with many different scripts, I now for the first time am running into the problem that it will take 8+ days for one script to execute one file. I figure I must be doing something wrong. Here is what I am trying to do:

I have a file with a list of record id's(~3 mil rec ids)  recidfile
I have two paired files from which the record id's originally came from (~10 mil recs each) infile
I wish to create two files (withfile and withoutfile) from each of the paired files run individually. One of the new files will have record ids that are in the list and the other with record ids that are not in the list

I have tried doing this with and without the fastqGeneralIterator, but both methods will require at least 8 days

Here is my code with the fastqGeneralIterator:

from time import time
from Bio.SeqIO.QualityIO import FastqGeneralIterator

start = time()

recids = open(recidfile, 'r')
for item in recids: recidlist.append(item[0:-2])

handle1 = open(withfile, 'w')
handle2 = open(withoutfile, 'w')

for header, seq, qual in FastqGeneralIterator(open(infile)):
if header[:-1] in recidlist:
handle1.write('%s\n%s\n+\n%s\n' %(header, seq, qual))
else:
handle2.write('%s\n%s\n+\n%s\n' %(header, seq, qual))

Can anyone advise me on how I can possibly make this go faster? I would prefer 8 minutes over 8 days.

Thanks in advance

Anita