[Biopython-dev] [biopython] added unstable Bam parser class (e6343eb)

Tue Apr 2 10:01:00 UTC 2013

I did a small test, just getting rec.rname and rec.pos (using Peter's
parser). This is something I actually need to do, to calculate basic
statistics.

Indeed for 1M reads, samtools is 3s whereas the pure Python parser takes 20s.

Tiago

On Tue, Apr 2, 2013 at 10:32 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>> On Tue, Apr 2, 2013 at 9:52 AM, Kevin Murray <k.d.murray.91 at gmail.com>
>>> wrote:
>>> > Hi All,
>>> >
>>> > Peter and I have
>>> >
>>> > discussed<https://github.com/kdmurray91/biopython/commit/e6343ebae50e4ff0633476a5761b47aa5ecacec4#commitcomment-2905033>including
>>> > the SamBam parser he has worked on into the master branch. I've
>>> > offered to help with test coverage/missing features/testing.
>>> >
>>> > The performance is very good; reading sequentially all reads from a 11mb
>>> > (540k reads) Bam file took:
>>> > CPython with Kevin's Pure-python parser: 0m17.531s real, 0m17.452s user
>>> > CPython with Peter's Pure-python parser: 0m5.589s real, 0m5.560s user
>>> > CPython with pysam: 4m29.240s real, 4m25.576s user
>>> > Pypy1.9 with Kevin's Pure-python parser: 0m6.125s real, 0m6.056s user
>>> > Pypy1.9 with Peter's Pure-python parser: 0m1.716s real, 0m1.624s user
>>> >
>>> > What are everyone's thoughts on including this into the master branch?
>>> > (with a BiopythonExperimentalWarning)
>>> >
>>> > Regards,
>>> > Kevin
>
>> On 2 April 2013 19:59, Tiago Antão <tiagoantao at gmail.com> wrote:
>>>
>>> Regarding the performance comparison to pysam: wow! Fantastic!
>>>
>
> On Tue, Apr 2, 2013 at 10:09 AM, Kevin Murray <k.d.murray.91 at gmail.com> wrote:
>> Hi Tiago,
>> It is indeed impressive, which makes me suspect I've screwed something up in
>> my benchmarks. I'll whack them up onto github for closer inspection sometime
>> tomorrow (Aussie time).
>>
>> However, in general code:
>>
>> bam = BamParser("path")
>> print next(bam)
>> for mapping in bam:
>>     pass
>>
>> Regards
>> Kevin Murray
>
> Those benchmark numbers are surprising - I suspect this is
> not a fair comparison. The different parsers likely have very
> different __str__ output for a BAM record (for mine this gives
> a SAM format string, pysam does something close to SAM
> but without the reference name).
>
> Something like BAM to SAM and then SAM to BAM would be
> better for profiling the basis parsing and writing performance.
> After than random access, and maybe something where lazy
> loading might have a chance to shine - perhaps counting the
> number of reads mapped to the reverse strand (i.e. iterate
> and look at the FLAG only).
>
> Peter

-- 
“Grant me chastity and continence, but not yet” - St Augustine