[Biopython-dev] [biopython] added unstable Bam parser class (e6343eb)

Peter Cock p.j.a.cock at googlemail.com
Tue Apr 2 22:52:24 UTC 2013

On Tue, Apr 2, 2013 at 7:28 PM, Tiago Antão <tiagoantao at gmail.com> wrote:
> On Tue, Apr 2, 2013 at 5:07 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> Those numbers are more believable. Was that using SAM or BAM?
>> Which Python?
> Bam. CPython 2.7.3
> T

Here's a quick test script which I presume does something
close to yours (any major differences would be interesting),
which tests BAM iteration and accessing the FLAG, RNAME
and POS fields only:


I grabbed three test BAM files for this, all from here:

NA12878.chrom1.454.ssaha2.SRP000032.2009_10.bam    -  2.5 GB
NA12878.chrom1.SLX.maq.SRP000032.2009_07.bam       - 12   GB
NA12878.chrom1.SOLID.corona.SRP000032.2009_08.bam  -  8.6 GB

I'm using pysam 0.7.4, and my SamBam2013 branch as of this commit:

Then using C Python 2.7.2 (on Mac OS X),

Using NA12878.chrom1.454.ssaha2.SRP000032.2009_10.bam (2.5 GB)
Peter's pure Python BAM iterator. - 389.1s giving 15126090/15126090 mapped
PySam's Samfile as BAM iterator. - 86.8s giving 15126090/15126090 mapped

I've not run this many times (should pick a smaller test set?)
but it looks like here my BAM iterator takes 4x to 5x longer than
pysam, which is better than Tiago's early figure of about 6x to
7x slower, but in the same ball park. I'll run this again tomorrow
hopefully, and include the two larger files too.

Note I've not really tried to optimise this branch for speed - there
are likely some low hanging fruit like extra assert statements etc.

One of the fun things to try would be a multi-threaded BGZF
parser which simply reads a few blocks ahead and delegates
block decompression to worker threads.


More information about the Biopython-dev mailing list