[Biopython-dev] Alignment object

Aaron Quinlan aaronquinlan at gmail.com
Thu Mar 4 08:33:40 EST 2010


Just an FYI for those interested in developing tools to work with BAM: it may also be worth looking into the BamTools C++ API developed by Derek Barnett at Boston College (http://sourceforge.net/projects/bamtools/).  The API is quite nice and has much of the necessary functionality for iterators, getters/setters, etc.

I added BAM support for my BEDTools package (http://code.google.com/p/bedtools/) using the BAMTools libraries.  Save for a few minor bugs along the way, it was rather straightforward to include.

Aaron

Aaron Quinlan, Ph.D.
NRSA Postdoctoral Fellow
Hall Laboratory
University of Virginia
Biochem. & Mol. Genetics
aaronquinlan at gmail.com

On Mar 4, 2010, at 8:13 AM, Brad Chapman wrote:

> Kevin and Peter;
> 
>> I was aware of pysam but am concerned about the dependencies:
>> pyrex 0.9.8 or later, python 2.6 or later, plus of course SAMtools
>> itself - which may all be fine on Linux, but will likely be trouble for
>> us on other platforms (especially Windows).
> 
> I believe you can remove the pyrex requirement by shipping the
> generated C file with the distribution. Samtools itself may be an
> issue; however, right now it is probably a practical need for dealing
> with SAM/BAM since it implements a lot of BAM generation, sorting,
> merging and indexing you need in workflows. Also, the C code is
> included with the distribution so it is more a matter of getting it
> compiled than introducing extra dependencies. The bioconductor work
> appears to do the same thing.
> 
>>> I agree that we should work towards supporting SAM (and perhaps
>>> also BAM) in Biopython, and other projects APIs can be very
>>> useful for inspiration or guidance.
> 
> All of my work converts SAM directly into sorted and indexed BAM,
> and then build from that. For me, direct SAM parsing wouldn't be as
> useful as BAM.
> 
>> Honestly, the SAM/BAM format specification is pretty dodgy.  Thankfully
>> between samtools and Picard source code, I've been able to work out most of
>> the tricky bits.  I'm glad to know that the R folks are also working on
>> this, since they're usually very good about generating clear documentation.
> 
> Agreed, but at least we are converging on something instead of
> having to write a parser every time you use a new aligner. The
> bioconductor SVN is here:
> 
> https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/Rsamtools/
> (user: readonly, pass: readonly)
> 
> I think the pysam API does a decent job for reading and exposing
> this. The higher level things that would be nice to add are:
> 
> - Converting the CIGAR string into something more useful.
> - Smartly dealing with the X? fields from various aligners. These
>  often contain very useful information missing from the SAM
>  specification. Where the data actually is will be aligner
>  specific.
> - More generally easing dealing with the optional fields.
> 
>> Parsing SAM is pretty simple and I can certainly help with gluing it into
>> Biopython (with some help on the Biopython side, since I'm still a newb).
>> I'm about half-way to having a BAM reader and writer for my own purposes.
>> I'm coding the time-critical parts in Cython with a fallback to pure
>> Python, so it may not be ideal for use in Biopython.
> 
> Cool. Does the BAM reader require samtools C code or is it
> independent of that?
> 
> Brad
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev




More information about the Biopython-dev mailing list