[Biopython-dev] Merging Bio.SeqIO SFF support?

Tue Mar 2 12:28:22 UTC 2010

On Tue, Mar 2, 2010 at 5:08 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Mon, Mar 1, 2010 at 11:22 PM, Kevin Jacobs <jacobs at bioinformed.com>
> <bioinformed at gmail.com> wrote:
> > I've tried out the recently landed SFF SeqIO code and am pleased to
> > report that it works very well.
>
> Great :)
>
> If you have suggestions for the documentation please voice them.
> Also did the handling of trimmed reads seem sensible? Until we
> release this we can tweak the API.

I only looked at the module documentation and it was more than sufficient to
get started.  I've never really used BioPython before, so I was pleasantly
surprised at how easy it was to get started.  The BioPython SFF parser and
indexed access replaced a hairy process of extracting data using 454's
sffinfo and packing it into a BDB file.

> > I am parsing gsMapper 454PairAlign.txt output and
> > converting it to SAM/BAM format to view in IGV (among other things) and
> > wanted to include per-based quality score information from the SFF files.
>
> Are you reading and writing SAM/BAM format with Python? Looking
> into this is on my (long) todo list.
>

Yes-- so far I have code to populate the basic data for unpaired reads, but
none of the optional annotations.  My script reads the 454 pairwise
alignment data, finds each read in the source SFF file, figures out if extra
trimming was applied by gsMapper, and extracts the matching PHRED quality
scores.  Uniquely mapped reads are given a mapping quality (MAPQ) of 60 and
non-unique reads are assigned MAPQ of 0 (as recommended by the SAMtools
FAQ).  The script can output SAM records or create a subprocess to sort the
records and recode to BAM format using samtools.  I've attached the current
version script and you are welcome to use it for any purpose.

> My only other comment is that several file reads and struct.unpacks can be
> > merged in _sff_read_seq_record.  Given the number of records in most 454
> SFF
> > files, I suspect the micro-optimization effort will be worth the slight
> cost
> > in code clarity.
>
> [...]I guess you mean the flowgram values, flowgram index, bases

and qualities might be loaded with a single read? That would
> be worth trying.
>

Exactly!  Also, flowgrams do not need to be unpacked when trimming.  My own
bias is to encode the quality scores and flowgrams in numpy arrays rather
than lists, however I understand that the goal is to keep the external
dependencies to a minimum (although NumPy is required elsewhere).

Also, the test "chr(0)*padding != handle.read(padding)" could be written
just as clearly as "handle.read(padding).count('\0') != padding" and not
generate as many temporary objects.

Best regards,
-Kevin
-------------- next part --------------
# -*- coding: utf-8 -*-
# Convert 454PairAlign.txt and the corresponding SFF files into SAM/BAM format

import re
import sys

from   operator          import getitem, itemgetter
from   itertools         import izip, imap, groupby, repeat
from   subprocess        import Popen, PIPE

import numpy as np

try:
  # Import fancy versions of basic IO functions from my GLU package
  # see http://code.google.com/p/glu-genetics
  from   glu.lib.fileutils import autofile,hyphen,table_writer,table_reader

except ImportError:
  import csv

  # The real version handles automatic gz/bz2 (de)compression
  autofile = file

  def hyphen(filename,default):
    if filename=='-' and default is not None:
      return default
    return filename

  # Write a tab-delimited ASCII file
  # The real version handles many more formats (CSV, XLS, Stata), column
  # selection, header optionds, row filters, and other toys.
  def table_writer(filename,hyphen=None):
    if filename=='-' and hyphen is not None:
      dest = hyphen
    else:
      dest = autofile(filename,'wb')

    return csv.writer(dest, dialect='excel-tab')

  # Read a tab-delimited ASCII file
  # The real version handles many more formats (CSV, XLS, Stata), column
  # selection, header optionds, row filters, and other toys.
  def table_reader(filename,hyphen=None):
    if filename=='-' and hyphen is not None:
      dest = hyphen
    else:
      dest = autofile(filename,'rb')

    return csv.reader(dest, dialect='excel-tab')

CIGAR_map = { ('-','-'):'P' }
for a in 'NACGTacgt':
  CIGAR_map[a,'-'] = 'I'
  CIGAR_map['-',a] = 'D'
  for b in 'NACGTacgt':
    CIGAR_map[a,b] = 'M'

def make_cigar_py(query,ref):
  assert len(query)==len(ref)
  igar  = imap(getitem, repeat(CIGAR_map), izip(query,ref))
  cigar = ''.join('%d%s' % (len(list(run)),code) for code,run in groupby(igar))
  return cigar

# Try to import the optimized Cython version
# The Python version is pretty fast, but I wanted to play with Cython.
try:
  from cigar import make_cigar
except ImportError:
  make_cigar = make_cigar_py

class SFFIndex(object):
  def __init__(self, sfffiles):
    self.sffindex = sffindex = {}
    for sfffile in sfffiles:
      from Bio import SeqIO
      prefix,ext = sfffile[-13:].split('.')
      assert ext=='sff'
      print >> sys.stderr,'Loading SFF index for',sfffile
      reads      = SeqIO.index(sfffile, 'sff-trim')
      sffindex[prefix] = reads

  def get_quality(self, qname, query, qstart, qstop):
    prefix = qname[:9]
    sff = self.sffindex.get(prefix)

    if not sff:
      return '*'

    rec      = sff[qname]
    phred    = rec.letter_annotations['phred_quality']
    sffqual  = np.array(phred,dtype=np.uint8)
    sffqual += 33
    sffqual  = sffqual.tostring()

    # Align the query to the original read to find the matching quality
    # score information.  This is complicated by the extra trimming done by
    # gsMapper.  We could obtain this information by parsing the
    # 454TrimStatus.txt, but it is easier to search for the sub-sequence in
    # the reference.  Ones hopes the read maps uniquely, but this is not
    # checked.

    # CASE 1: Forward read alignment
    if qstart<qstop:
      # Try using specified cut-points
      read = str(rec.seq)
      seq  = read[qstart-1:qstop]

      # If it matches, then compute quality
      if seq==query:
        qual  = sffqual[qstart-1:qstop]
      else:
        # otherwise gsMapper applied extra trimming, so we have to manually find the offset
        start = read.index(query)
        seq   = read[start:start+len(query)]
        if seq==query:
          #print >> sys.stderr,'MATCHED TYPE F2: name=%s, qstart=%d(%d), qstop=%d, qlen=%d, len.query=%d' % (qname,start+1,qstart,qstop,qlen,len(query))
          qual  = sffqual[start:start+len(query)]

    # CASE 2: Backward read alignment
    else:
      # Try using specified cut-points
      read = str(rec.seq.complement())
      seq  = read[qstop-1:qstart][::-1]
      read = read[::-1]

      # If it matches, then compute quality
      if seq==query:
        qual  = sffqual[qstop-1:qstart][::-1]
      else:
        # otherwise gsMapper applied extra trimming, so we have to manually find the offset
        start = read.index(query)
        seq   = read[start:start+len(query)]

        if seq==query:
          #print >> sys.stderr,'MATCHED TYPE R2: name=%s, qstart=%d, qstop=%d(%d), qlen=%d, len.query=%d' % (qname,qstart,start+1,qstop,qlen,len(query))
          qual = sffqual[::-1][start:start+len(query)]

    assert seq==query
    assert len(qual) == len(query)
    return qual

def pair_align(filename, sffindex):
  records = autofile(filename)
  split   = re.compile('[\t ,.]+').split
  mrnm    = '*'
  mpos    = 0
  isize   = 0
  mapq    = 60

  for line in records:
    assert line.startswith('>')

    fields  = split(line)
    qname   = fields[0][1:]
    qstart  = int(fields[1])
    qstop   = int(fields[2])
    #qlen   = int(fields[4])
    rname   = fields[6]
    rstart  = int(fields[7])
    rstop   = int(fields[8])
    #rlen   = int(fields[10])
    query   = split(records.next())[2]
    qq      = query.replace('-','')
    ref     = split(records.next())[2]
    cigar   = make_cigar(query,ref)
    qual    = sffindex.get_quality(qname, qq, qstart, qstop)
    flag    = 0
    if qstart>qstop:
      flag |= 0x10
    if rstart>rstop:
      flag |= 0x20

    yield [qname, flag, rname, rstart, mapq, cigar, mrnm, mpos, isize, qq, qual]

def option_parser():
  import optparse

  usage = 'usage: %prog [options] 454PairAlign.txt[.gz] [SFFfiles.sff..]'
  parser = optparse.OptionParser(usage=usage)

  parser.add_option('-r', '--reflist', dest='reflist', metavar='FILE',
                    help='Reference genome contig list')
  parser.add_option('-o', '--output', dest='output', metavar='FILE', default='-',
                    help='Output SAM file')
  return parser

def main():
  parser = option_parser()
  options,args = parser.parse_args()

  if not args:
    parser.print_help(sys.stderr)
    sys.exit(2)

  sffindex  = SFFIndex(args[1:])
  alignment = pair_align(hyphen(args[0],sys.stdin), sffindex)

  write_bam = options.output.endswith('.bam')

  if write_bam:
    if not options.reflist:
      raise ValueError('Conversion to BAM format requires a reference genome contig list (-r/--reflist)')

    # Creating the following two-stage pipeline deadlocks due to problems with subprocess
    # -- use the shell method below instead
    #sammer = Popen(['samtools','import',options.reflist,'-','-'],stdin=PIPE,stdout=PIPE)
    #bammer = Popen(['samtools','sort','-', options.output[:-4]], stdin=sammer.stdout)

    cmd    = 'samtools import "%s" - - | samtools sort - "%s"' % (options.reflist,options.output[:-4])
    bammer = Popen(cmd,stdin=PIPE,shell=True,bufsize=-1)
    out    = table_writer(bammer.stdin)
  else:
    out = table_writer(options.output,hyphen=sys.stdout)

  out.writerow(['@HD', 'VN:1.0'])

  if options.reflist:
    reflist = table_reader(options.reflist)
    for row in reflist:
      if len(row)<2:
        continue

      contig_name = row[0]
      contig_len  = int(row[1])

      out.writerow(['@SQ', 'SN:%s' % contig_name, 'LN:%d' % contig_len])

  print >> sys.stderr, 'Generating alignment from %s to %s' % (args[0],options.output)

  for qname,qalign in groupby(alignment,itemgetter(0)):
    qalign = list(qalign)

    if len(qalign)>1:
      # Set MAPQ to 0 for multiply aligned reads
      for row in qalign:
        row[4] = 0
        out.writerow(row)
    else:
      out.writerow(qalign[0])

  if write_bam:
    print >> sys.stderr,'Finishing BAM encoding...'
    bammer.communicate()

if __name__=='__main__':
  if 1:
    main()
  else:
    try:
      import cProfile as profile
    except ImportError:
      import profile
    import pstats

    prof = profile.Profile()
    try:
      prof.runcall(main)
    finally:
      stats = pstats.Stats(prof)
      stats.strip_dirs()
      stats.sort_stats('time', 'calls')
      stats.print_stats(25)