[Biopython-dev] [Bug 2905] Short read alignment format SAM / BAM

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Wed May 5 13:37:54 UTC 2010


http://bugzilla.open-bio.org/show_bug.cgi?id=2905





------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2010-05-05 09:37 EST -------
Created an attachment (id=1498)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1498&action=view)
Basic SAM/BAM parser for Bio.SeqIO

This file would go in Bio/SeqIO/SamBamIO.py with the usual additions to file
Bio/SeqIO/__init__.py to define the "sam" and "bam" format names plus that
"bam" is a binary file format. There are docstring unit tests using ex1.sam
and ex1.bam borrowed from the pysam project.

(In reply to comment #3)
> I'd really like to see our support for this re-use the work in the pysam
> project. Agreed that both a pure Python implementation of BAM parsing and
> Biopython-interoperable objects are useful, and we should either contribute it
> as part of pysam or consider discussing a closer collaboration with the pysam
> authors.
> 
> Biopython should be taking the lead on encouraging better interoperability
> with other projects. pysam is useful to me in my work right now, and we
> should support that effort. 

Hi Brad,

What I was (for now) focussing on was SAM/BAM parser support in Bio.SeqIO,
which is really quite narrow in scope. It is also quite simple - I have
attached a proof of principle implementation to this bug. The gzip/struct
code to interpret the BAM fields is pretty straight forward (having done a
lot of similar work on the SFF support helped). The only challenging bit is
turning the data into a SeqRecord (and this part seems irrelevant to pysam).

Going beyond basic access to the reads, the next step up is working on the
alignment data structure - e.g. extracting columns to look at SNPs. Here
there are a lot of neat things like indexing schemes etc where the SAMtools
API (and thus pysam) is probably a sensible choice. You'll notice in the
draft module docstring I've suggested this (and this wasn't prompted by your
comment either - grin).

On the licence side, pysam and SAMtools both use the MIT licence, so no
problems there.

Regarding dependencies and cross platform support, pysam is a lightweight
wrapper of the samtools C-API, using pyrex. If we want to use pysam in
Biopython that means build time dependencies on samtools and pyrex. This
won't work under Jython, and at the time of writing pysam doesn't appear
to support Windows either. So I'm not so comfortable about this.

It would be interesting to see if pysam could have a pure python back end
as an alternative to calling the SAMtools C API (and I'm happy for any of
my code to be used for that - but this would have to cover far more than
just parsing). That would allow pysam under Jython, and might help on Windows
too.

So in the short term, I don't seem any overlap between SAM/BAM support in
Bio.SeqIO and the pysam project. In the medium/long term, working with the
cigar strings and of course the alignments rather than just the reads, then
yes absolutely - some level of discussion or collaboration would be sensible
and desirable.

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the Biopython-dev mailing list