[Biopython-dev] [Bug 2905] Short read alignment format SAM / BAM

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Wed May 5 12:29:30 UTC 2010


http://bugzilla.open-bio.org/show_bug.cgi?id=2905


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         OS/Version|Mac OS                      |All
            Version|1.51                        |Not Applicable




------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2010-05-05 08:29 EST -------
I've recently started looking at parsing SAM and BAM files. These files just
contain the reads - they do not include the reference sequence, that is usually
kept in a separate FASTA file. I therefore think it would make sense to parse
each read as a SeqRecord in Bio.SeqIO.

The SAM format is basically tab separated plain text. Parsing it is straight
forward, the complication is turning this into a suitable SeqRecord object.

The BAM format can be decompressed in Python using the gzip library (built in),
and decoded with the struct library (also built in - we already use this for
parsing the binary SFF file format). i.e. This is fiarly straightforward to do
in pure python - without any dependence on the samtools C library, an
alternative approach which is how pysam works. See
http://code.google.com/p/pysam/

Extracting just the read name, sequence, and PHRED quality scores when building
the SeqRecord objects is sufficient to implement SAM/BAM to FASTQ/FASTA/QUAL
conversion with Bio.SeqIO. The harder part will be deciding how to represent
all the other annotation information for each read...


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the Biopython-dev mailing list