[Biopython-dev] [biopython] added unstable Bam parser class (e6343eb)

Peter Cock p.j.a.cock at googlemail.com
Wed Apr 3 13:10:16 EDT 2013


On Tue, Apr 2, 2013 at 11:52 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Tue, Apr 2, 2013 at 7:28 PM, Tiago Antão <tiagoantao at gmail.com> wrote:
>> On Tue, Apr 2, 2013 at 5:07 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>> Those numbers are more believable. Was that using SAM or BAM?
>>> Which Python?
>>
>> Bam. CPython 2.7.3
>>
>> T
>
> Here's a quick test script which I presume does something
> close to yours (any major differences would be interesting),
> which tests BAM iteration and accessing the FLAG, RNAME
> and POS fields only:
>
> https://github.com/peterjc/picobio/blob/master/sambam/profile/bench_iter.py
>
> I grabbed three test BAM files for this, all from here:
> ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/data/NA12878/alignment/
>
> NA12878.chrom1.454.ssaha2.SRP000032.2009_10.bam    -  2.5 GB
> NA12878.chrom1.SLX.maq.SRP000032.2009_07.bam       - 12   GB
> NA12878.chrom1.SOLID.corona.SRP000032.2009_08.bam  -  8.6 GB
>
> I'm using pysam 0.7.4, and my SamBam2013 branch as of this commit:
> https://github.com/peterjc/biopython/commit/316125a41f0284198e1a445486b307948f8c9cd9
>
> Then using C Python 2.7.2 (on Mac OS X),
>
> Using NA12878.chrom1.454.ssaha2.SRP000032.2009_10.bam (2.5 GB)
> Peter's pure Python BAM iterator. - 389.1s giving 15126090/15126090 mapped
> PySam's Samfile as BAM iterator. - 86.8s giving 15126090/15126090 mapped
>
> I've not run this many times (should pick a smaller test set?)
> but it looks like here my BAM iterator takes 4x to 5x longer than
> pysam, which is better than Tiago's early figure of about 6x to
> 7x slower, but in the same ball park. I'll run this again tomorrow
> hopefully, and include the two larger files too.
>
> Note I've not really tried to optimise this branch for speed - there
> are likely some low hanging fruit like extra assert statements etc.

Same machine, also using PyPy 1.9, and with the larger BAM files tested too:

Using NA12878.chrom1.454.ssaha2.SRP000032.2009_10.bam (2.5GB)
C Python 2.7.2 (on Mac OS X)
Peter's pure Python BAM iterator. - 401.2s giving 15126090/15126090 mapped
PyPy 1.9 (on Mac OS X)
Peter's pure Python BAM iterator. - 146.7s giving 15126090/15126090 mapped
C Python 2.7.2 (on Mac OS X)
PySam's Samfile as BAM iterator. - 85.3s giving 15126090/15126090 mapped

Using NA12878.chrom1.SLX.maq.SRP000032.2009_07.bam (12GB)
C Python 2.7.2 (on Mac OS X)
Peter's pure Python BAM iterator. - 4706.8s giving 196354464/201240699 mapped
PyPy 1.9 (on Mac OS X)
Peter's pure Python BAM iterator. - 1248.5s giving 196354464/201240699 mapped
C Python 2.7.2 (on Mac OS X)
PySam's Samfile as BAM iterator. - 795.7s giving 196354464/201240699 mapped

Using NA12878.chrom1.SOLID.corona.SRP000032.2009_08.bam (8.6GB)
C Python 2.7.2 (on Mac OS X)
Peter's pure Python BAM iterator. - 3445.7s giving 145879316/145879316 mapped
PyPy 1.9 (on Mac OS X)
Peter's pure Python BAM iterator. - 875.9s giving 145879316/145879316 mapped
C Python 2.7.2 (on Mac OS X)
PySam's Samfile as BAM iterator. - 602.1s giving 145879316/145879316 mapped

Using PyPy the run times are approaching the speed of pysam - and might
perhaps match and exceed it with some more time looking at profiling?
I should try the PyPy 2.0 beta as well...

Anyway, right now on C Python this code is not speed competitive with
pysam for parsing large BAM files. That doesn't mean it isn't useful
though, but makes it harder to justify including in Biopython.

Peter



More information about the Biopython-dev mailing list