[Biopython-dev] samtools threaded daemon
Chris Mitchell
chris.mit7 at gmail.com
Thu Apr 11 09:46:41 EDT 2013
For threading I am currently using map_async and subprocess to handle the
threading and calling of samtools. There are some other details like using
a generator to reduce the memory overhead (since map_async by itself runs
the entire list and puts the return values into memory before sending them
to you...obviously a terrible idea for 10000+ pileups, a generator gets
around this by chunking the input, which reduces the gains from threading
as it waits until all jobs are finished before submitting the next batch).
If anyone knows of a way to have map_async or similar methods return
values to the callback as threads finish, that would be good to know.
>From a user perspective, this is a simple example of how it works now:
st = SamTools(bamSource,binary=sTools,threads=30)
st.mpileup(f=hg19,r=['chr1:%d-%d'%(i,i+1) for i in
xrange(2000001,2001001)],callback=processPileup)
print st.mpileup(f=hg19,r='chr1:2000000-2000010')
For things that make sense to have multiple copies, like bamfiles, bed
files, or positions, if a list is provided as that keyword, it will thread
it.
This will put 30 threads together, and call processPileup with the output.
Since there seems to be some interest, I'm going to look at the existing
command line wrappers to make it consistent with BioPython's approach.
Also, if a binary can't be found, having it fallback to the future
BioPython parser seems like it might be a good idea (provided it has
similar functionality like creating pileups, does it?).
Chris
On Thu, Apr 11, 2013 at 5:55 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:
> On Thu, Apr 11, 2013 at 7:10 AM, Christian Brueffer
> <christian at brueffer.de> wrote:
> > On 4/11/13 4:57 , Chris Mitchell wrote:
> >> Hi everyone,
> >>
> >> I've been doing a ton of mpileup work recently with samtools so I made a
> >> python daemon to parallelize the process. Is there any interest in a
> >> generic SamTools package for BioPython? I know pysam exists, but it'd
> be
> >> an added dependency as well as not threaded. In my experience, for
> >> querying a ton of positions threading mpileup is the best way to go
> (much
> >> faster than -l bed_file in my use cases). If there's interest, I'll
> >> package it as a general SamTools command line wrapper with the added
> >> bonuses that for certain operations you can input a list and thread
> those
> >> parts.
> >>
> >
> > Hi Chris,
> >
> > sounds great! I use samtools/pysam a lot, so I'd appreciate another
> > option. My collegue uses mpileup with pysam a lot as well, I'm sure he
> > wouldn't mind some speedup in that area.
> >
> > Cheers,
> >
> > Chris
>
> A samtools command line wrapper sounds useful in itself. Saket has
> done a bwa wrapper I need to merge:
> https://github.com/biopython/biopython/pull/167
>
> I think he was planning to do samtools next:
> https://github.com/saketkc/biopython/tree/samtools_wrapper
>
> What did you have in mind for threading? Automatically calling
> multiple independent samtools processes in the background?
>
> Peter
>
> P.S. See also this thread:
> http://lists.open-bio.org/pipermail/biopython-dev/2013-April/010492.html
>
More information about the Biopython-dev
mailing list