[Biopython] multiprocessing problem with pysam
Michal
mictadlo at gmail.com
Mon Apr 11 07:57:17 EDT 2011
On 04/10/2011 09:15 PM, Brad Chapman wrote:
> Michal;
>
>> I have tried to rewrite the following code from
>> http://wwwfgu.anat.ox.ac.uk/~andreas/documentation/samtools/api.html
> [...]
>> with the following multiprocessing code:
> [...]
>> pool = Pool()
>>
>> samfile = pysam.Samfile("ex1.bam", "rb")
>> references = samfile.references
>>
>> for reference in samfile.references:
>> print ">", reference
>> pool.apply_async(calc_pileup, [samfile, reference, 100, 120])
> [...]
>> However, I got the following out:
> [...]
>> TypeError: _open() takes at least 1 positional argument (0 given)
> You are passing the open file handle 'samfile' to your multiprocessing
> function. The arguments you pass through need to be able to be pickled
> by Python; normally you need to stick with more basic data structures.
> Specifically, I would suggest passing in the filename and then opening a
> pysam reference within the worker functions.
>
> def calc_pileup(fname, reference_name, start_pos, end_pos):
> samfile = pysam.Samfile(fname, "rb")
> coverages = []
> print reference_name, os.getpid()
>
> if __name__ == '__main__':
> pool = Pool()
> fname = "ex1.bam"
> samfile = pysam.Samfile(fname, "rb")
> references = samfile.references
> samfile.close()
> for reference in samfile.references:
> print ">", reference
> pool.apply_async(calc_pileup, [fname, reference, 100, 120])
>
> My more general suggestion with multiprocessing is to start with a
> simple workflow and expand. This will let you get a sense of where
> your objects may be too complex to pickle and you need to simplify.
>
> Hope this helps,
> Brad
>
>
> _______________________________________________
> Biopython mailing list - Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
Thank you for your response. I changed the code in the following way:
--------------------------
import pysam
import os
from multiprocessing import Pool
from pprint import pprint
class Pileup_info():
def __init__(pileup_pos, coverage):
self.pileup_pos = pileup_pos
self.coverage = coverage
reads = []
class Reads_info():
def __init__(read_name, read_base):
self.read_name = read_name
self.read_base = read_base
def calc_pileup(fname, reference_name, start_pos, end_pos):
samfile = pysam.Samfile(fname, "rb")
coverages = []
print reference_name, os.getpid()
for pileupcolumn in samfile.pileup(reference_name, start_pos,
end_pos):
pileup_inf = Pileup_info(pileupcolumn.pos, pileupcolumn.n)
#print 'coverage at base %s = %s' % (pileupcolumn.pos ,
pileupcolumn.n)
for pileupread in pileupcolumn.pileups:
#print '\tbase in read %s = %s' %
(pileupread.alignment.qname, pileupread.alignment.seq[pileupread.qpos])
pileup_inf.reads.append(Reads_info(pileupread.alignment.qname,
pileupread.alignment.seq[pileupread.qpos]))
coverages.append(pileup_inf)
samfile.close()
return (reference_name, coverages)
def output(coverage):
#for
print
print
if __name__ == '__main__':
pool = Pool()
fname = "ex1.bam"
samfile = pysam.Samfile(fname, "rb")
references = samfile.references
samfile.close()
results = [pool.apply_async(calc_pileup, [fname, reference, 100,
120]) for reference in references]
#print ">", reference
#results = pool.apply_async(calc_pileup, [fname, reference,
100, 120])
pool.close()
pool.join()
for r in results:
print r
pprint(r.get())
--------------------------
and I have got this error:
--------------------------
$ python multi.py
chr1 6056
chr2 6057
<multiprocessing.pool.ApplyResult object at 0xeb7bd0>
Traceback (most recent call last):
File "multi.py", line 54, in <module>
pprint(r.get())
File
"/home/mictadlo/apps/python/lib/python2.7/multiprocessing/pool.py", line
491, in get
raise self._value
TypeError: __init__() takes exactly 2 arguments (3 given)
--------------------------
What did I do wrong?
More information about the Biopython
mailing list