[Biopython-dev] Merging Bio.SeqIO SFF support?
Kevin Jacobs <jacobs@bioinformed.com>
bioinformed at gmail.com
Tue Mar 2 07:28:22 EST 2010
On Tue, Mar 2, 2010 at 5:08 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:
> On Mon, Mar 1, 2010 at 11:22 PM, Kevin Jacobs <jacobs at bioinformed.com>
> <bioinformed at gmail.com> wrote:
> > I've tried out the recently landed SFF SeqIO code and am pleased to
> > report that it works very well.
>
> Great :)
>
> If you have suggestions for the documentation please voice them.
> Also did the handling of trimmed reads seem sensible? Until we
> release this we can tweak the API.
I only looked at the module documentation and it was more than sufficient to
get started. I've never really used BioPython before, so I was pleasantly
surprised at how easy it was to get started. The BioPython SFF parser and
indexed access replaced a hairy process of extracting data using 454's
sffinfo and packing it into a BDB file.
> > I am parsing gsMapper 454PairAlign.txt output and
> > converting it to SAM/BAM format to view in IGV (among other things) and
> > wanted to include per-based quality score information from the SFF files.
>
> Are you reading and writing SAM/BAM format with Python? Looking
> into this is on my (long) todo list.
>
Yes-- so far I have code to populate the basic data for unpaired reads, but
none of the optional annotations. My script reads the 454 pairwise
alignment data, finds each read in the source SFF file, figures out if extra
trimming was applied by gsMapper, and extracts the matching PHRED quality
scores. Uniquely mapped reads are given a mapping quality (MAPQ) of 60 and
non-unique reads are assigned MAPQ of 0 (as recommended by the SAMtools
FAQ). The script can output SAM records or create a subprocess to sort the
records and recode to BAM format using samtools. I've attached the current
version script and you are welcome to use it for any purpose.
> My only other comment is that several file reads and struct.unpacks can be
> > merged in _sff_read_seq_record. Given the number of records in most 454
> SFF
> > files, I suspect the micro-optimization effort will be worth the slight
> cost
> > in code clarity.
>
> [...]I guess you mean the flowgram values, flowgram index, bases
and qualities might be loaded with a single read? That would
> be worth trying.
>
Exactly! Also, flowgrams do not need to be unpacked when trimming. My own
bias is to encode the quality scores and flowgrams in numpy arrays rather
than lists, however I understand that the goal is to keep the external
dependencies to a minimum (although NumPy is required elsewhere).
Also, the test "chr(0)*padding != handle.read(padding)" could be written
just as clearly as "handle.read(padding).count('\0') != padding" and not
generate as many temporary objects.
Best regards,
-Kevin
-------------- next part --------------
# -*- coding: utf-8 -*-
# Convert 454PairAlign.txt and the corresponding SFF files into SAM/BAM format
import re
import sys
from operator import getitem, itemgetter
from itertools import izip, imap, groupby, repeat
from subprocess import Popen, PIPE
import numpy as np
try:
# Import fancy versions of basic IO functions from my GLU package
# see http://code.google.com/p/glu-genetics
from glu.lib.fileutils import autofile,hyphen,table_writer,table_reader
except ImportError:
import csv
# The real version handles automatic gz/bz2 (de)compression
autofile = file
def hyphen(filename,default):
if filename=='-' and default is not None:
return default
return filename
# Write a tab-delimited ASCII file
# The real version handles many more formats (CSV, XLS, Stata), column
# selection, header optionds, row filters, and other toys.
def table_writer(filename,hyphen=None):
if filename=='-' and hyphen is not None:
dest = hyphen
else:
dest = autofile(filename,'wb')
return csv.writer(dest, dialect='excel-tab')
# Read a tab-delimited ASCII file
# The real version handles many more formats (CSV, XLS, Stata), column
# selection, header optionds, row filters, and other toys.
def table_reader(filename,hyphen=None):
if filename=='-' and hyphen is not None:
dest = hyphen
else:
dest = autofile(filename,'rb')
return csv.reader(dest, dialect='excel-tab')
CIGAR_map = { ('-','-'):'P' }
for a in 'NACGTacgt':
CIGAR_map[a,'-'] = 'I'
CIGAR_map['-',a] = 'D'
for b in 'NACGTacgt':
CIGAR_map[a,b] = 'M'
def make_cigar_py(query,ref):
assert len(query)==len(ref)
igar = imap(getitem, repeat(CIGAR_map), izip(query,ref))
cigar = ''.join('%d%s' % (len(list(run)),code) for code,run in groupby(igar))
return cigar
# Try to import the optimized Cython version
# The Python version is pretty fast, but I wanted to play with Cython.
try:
from cigar import make_cigar
except ImportError:
make_cigar = make_cigar_py
class SFFIndex(object):
def __init__(self, sfffiles):
self.sffindex = sffindex = {}
for sfffile in sfffiles:
from Bio import SeqIO
prefix,ext = sfffile[-13:].split('.')
assert ext=='sff'
print >> sys.stderr,'Loading SFF index for',sfffile
reads = SeqIO.index(sfffile, 'sff-trim')
sffindex[prefix] = reads
def get_quality(self, qname, query, qstart, qstop):
prefix = qname[:9]
sff = self.sffindex.get(prefix)
if not sff:
return '*'
rec = sff[qname]
phred = rec.letter_annotations['phred_quality']
sffqual = np.array(phred,dtype=np.uint8)
sffqual += 33
sffqual = sffqual.tostring()
# Align the query to the original read to find the matching quality
# score information. This is complicated by the extra trimming done by
# gsMapper. We could obtain this information by parsing the
# 454TrimStatus.txt, but it is easier to search for the sub-sequence in
# the reference. Ones hopes the read maps uniquely, but this is not
# checked.
# CASE 1: Forward read alignment
if qstart<qstop:
# Try using specified cut-points
read = str(rec.seq)
seq = read[qstart-1:qstop]
# If it matches, then compute quality
if seq==query:
qual = sffqual[qstart-1:qstop]
else:
# otherwise gsMapper applied extra trimming, so we have to manually find the offset
start = read.index(query)
seq = read[start:start+len(query)]
if seq==query:
#print >> sys.stderr,'MATCHED TYPE F2: name=%s, qstart=%d(%d), qstop=%d, qlen=%d, len.query=%d' % (qname,start+1,qstart,qstop,qlen,len(query))
qual = sffqual[start:start+len(query)]
# CASE 2: Backward read alignment
else:
# Try using specified cut-points
read = str(rec.seq.complement())
seq = read[qstop-1:qstart][::-1]
read = read[::-1]
# If it matches, then compute quality
if seq==query:
qual = sffqual[qstop-1:qstart][::-1]
else:
# otherwise gsMapper applied extra trimming, so we have to manually find the offset
start = read.index(query)
seq = read[start:start+len(query)]
if seq==query:
#print >> sys.stderr,'MATCHED TYPE R2: name=%s, qstart=%d, qstop=%d(%d), qlen=%d, len.query=%d' % (qname,qstart,start+1,qstop,qlen,len(query))
qual = sffqual[::-1][start:start+len(query)]
assert seq==query
assert len(qual) == len(query)
return qual
def pair_align(filename, sffindex):
records = autofile(filename)
split = re.compile('[\t ,.]+').split
mrnm = '*'
mpos = 0
isize = 0
mapq = 60
for line in records:
assert line.startswith('>')
fields = split(line)
qname = fields[0][1:]
qstart = int(fields[1])
qstop = int(fields[2])
#qlen = int(fields[4])
rname = fields[6]
rstart = int(fields[7])
rstop = int(fields[8])
#rlen = int(fields[10])
query = split(records.next())[2]
qq = query.replace('-','')
ref = split(records.next())[2]
cigar = make_cigar(query,ref)
qual = sffindex.get_quality(qname, qq, qstart, qstop)
flag = 0
if qstart>qstop:
flag |= 0x10
if rstart>rstop:
flag |= 0x20
yield [qname, flag, rname, rstart, mapq, cigar, mrnm, mpos, isize, qq, qual]
def option_parser():
import optparse
usage = 'usage: %prog [options] 454PairAlign.txt[.gz] [SFFfiles.sff..]'
parser = optparse.OptionParser(usage=usage)
parser.add_option('-r', '--reflist', dest='reflist', metavar='FILE',
help='Reference genome contig list')
parser.add_option('-o', '--output', dest='output', metavar='FILE', default='-',
help='Output SAM file')
return parser
def main():
parser = option_parser()
options,args = parser.parse_args()
if not args:
parser.print_help(sys.stderr)
sys.exit(2)
sffindex = SFFIndex(args[1:])
alignment = pair_align(hyphen(args[0],sys.stdin), sffindex)
write_bam = options.output.endswith('.bam')
if write_bam:
if not options.reflist:
raise ValueError('Conversion to BAM format requires a reference genome contig list (-r/--reflist)')
# Creating the following two-stage pipeline deadlocks due to problems with subprocess
# -- use the shell method below instead
#sammer = Popen(['samtools','import',options.reflist,'-','-'],stdin=PIPE,stdout=PIPE)
#bammer = Popen(['samtools','sort','-', options.output[:-4]], stdin=sammer.stdout)
cmd = 'samtools import "%s" - - | samtools sort - "%s"' % (options.reflist,options.output[:-4])
bammer = Popen(cmd,stdin=PIPE,shell=True,bufsize=-1)
out = table_writer(bammer.stdin)
else:
out = table_writer(options.output,hyphen=sys.stdout)
out.writerow(['@HD', 'VN:1.0'])
if options.reflist:
reflist = table_reader(options.reflist)
for row in reflist:
if len(row)<2:
continue
contig_name = row[0]
contig_len = int(row[1])
out.writerow(['@SQ', 'SN:%s' % contig_name, 'LN:%d' % contig_len])
print >> sys.stderr, 'Generating alignment from %s to %s' % (args[0],options.output)
for qname,qalign in groupby(alignment,itemgetter(0)):
qalign = list(qalign)
if len(qalign)>1:
# Set MAPQ to 0 for multiply aligned reads
for row in qalign:
row[4] = 0
out.writerow(row)
else:
out.writerow(qalign[0])
if write_bam:
print >> sys.stderr,'Finishing BAM encoding...'
bammer.communicate()
if __name__=='__main__':
if 1:
main()
else:
try:
import cProfile as profile
except ImportError:
import profile
import pstats
prof = profile.Profile()
try:
prof.runcall(main)
finally:
stats = pstats.Stats(prof)
stats.strip_dirs()
stats.sort_stats('time', 'calls')
stats.print_stats(25)
More information about the Biopython-dev
mailing list