[Biopython] MultiProcess SeqIO objects

Willis, Jordan R jordan.r.willis at Vanderbilt.Edu
Tue Mar 6 03:37:30 UTC 2012


Hello BioPython,

I was wondering if anyone has used the multiprocessing tool in conjunction with Biopython type objects? Here is my problem, I have 60 million sequences given in fastq format and I want to multiprocess these without having to iterate through the list multiple times.

So I have something like this:

from multiprocessing import Pool
from Bio import SeqIO

input_handle = open("huge_fastaqf_file.fastq,)


def convert_to_fasta(input)
	return [[record.id , record.seq.reverse_complement ] for record in SeqIO.parse(input,'fastq')]

p = Pool(processes=4)
g = p.map(convert_to_fasta,input_handle)

for i in g:
	print i[0],i[1]

Unfortunately, it  seems to divide up the handle by all the names and tries makes the input in the function convert_to_fasta the first line of input. What I want it to do is divide up the fastq object and do my function on 4 processors.

I can't figure out how in the world to do this though.

Thanks,
jordan 






More information about the Biopython mailing list