[Biopython] MultiProcess SeqIO objects
Brad Chapman
chapmanb at 50mail.com
Tue Mar 6 06:55:13 EST 2012
Jordan;
> I was wondering if anyone has used the multiprocessing tool in
> conjunction with Biopython type objects? Here is my problem, I have 60
> million sequences given in fastq format and I want to multiprocess
> these without having to iterate through the list multiple times.
Are you trying to make the parsing run in the parallel, or some
downstream processing happen in parallel? The later is definitely
preferable if you are looking for speed ups since the parsing will be
primarily IO bound.
You can make the processing faster by avoiding using SeqIO objects since
the conversion of quality scores will take the most time. Here is a
working example:
from multiprocessing import Pool
from Bio.SeqIO.QualityIO import FastqGeneralIterator
from Bio.Seq import Seq
def do_something_with_record(info):
name, seq = info
return name, seq
def convert_to_fasta(in_handle):
for rec_id, seq, _ in FastqGeneralIterator(in_handle):
yield rec_id, str(Seq(seq).reverse_complement())
with open("example.fastq") as input_handle:
p = Pool(processes=4)
g = p.map(do_something_with_record, convert_to_fasta(input_handle))
for i in g:
print i
Hope this helps,
Brad
> So I have something like this:
>
> from multiprocessing import Pool
> from Bio import SeqIO
>
> input_handle = open("huge_fastaqf_file.fastq,)
>
>
> def convert_to_fasta(input)
> return [[record.id , record.seq.reverse_complement ] for record in SeqIO.parse(input,'fastq')]
>
> p = Pool(processes=4)
> g = p.map(convert_to_fasta,input_handle)
>
> for i in g:
> print i[0],i[1]
>
> Unfortunately, it seems to divide up the handle by all the names and tries makes the input in the function convert_to_fasta the first line of input. What I want it to do is divide up the fastq object and do my function on 4 processors.
>
> I can't figure out how in the world to do this though.
More information about the Biopython
mailing list