[Biopython] Ace Contig parsing Bio.AlignIO

Thu Nov 26 15:37:09 UTC 2015

Thanks Peter,

we also discussed using MIRA, but we have a workaround with GNU parallel to 'multithread' cap3 on our linux systems and are satisfied with the speed and accuracy of cap3 in this setting.

For all interested in ace parsing with consensus stripping, here is my script:

eike

#! /usr/bin/env python
# Script based on Biopython cookbook example http://biopython.org/wiki/ACE_contig_to_alignment
# which is adapted to multiple input files and multiple contig entries in the
# .ace file
# author: Eike Mayland-Quellhorst
#         Carl-von-Ossietzky Universität Oldenburg
#         eike.mayland.quellhorst at uni-oldenburg.de
#         2015-11-26

Usage = '''
ace_consensus.py - version 1.0
This script take a cap3 (Phrap) .ace alignment and calculates a quick 90%
consensus. Output will be the consensus sequences of contigs in a fasta file,
naming will be the original file name with .con as appendix.

Usage:
     ace_consensus.py *.ace
'''

from Bio.Sequencing import Ace
from Bio.Align import MultipleSeqAlignment
from Bio.Align import AlignInfo
from Bio import AlignIO
from Bio.Alphabet import IUPAC
from Bio.Alphabet import Gapped
from Bio.Alphabet import generic_dna
from Bio.Seq import Seq
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
import sys

def cut_ends(read, start, end):
   '''Replace residues on either end of a sequence with gaps.

   In this case we want to cut out the sections of each read which the assembler has
   decided are not good enough to include in the contig and replace them with gaps
   so the other information about positions in the read is maintained
   '''
   return (start-1) * '-' + read[start-1:end] + (len(read)-end) * '-'

def pad_read(read, start, conlength):
   ''' Pad out either end of a read so it fits into an alignment.

   The start argument is the position of the first base of the reads sequence in
   the contig it is part of. If the start value is negative (or 0 since ACE
   files count from 1, not 0) we need to take some sequence off the start
   otherwise each end is padded to the length of the consensus with gaps.
   '''
   if start < 1:
     seq = read[-1*start+1:]
   else:
     seq = (start-1) * '-' + read
   if len(read) < conlength:
     seq = seq + (conlength-len(seq)) * '-'
   else:
     seq = seq[-1*start+1:conlength]
   return seq

if len(sys.argv)<1:
     print Usage
else:
     # take files
     FileList = sys.argv[1:]
     MasterList = []
     FileNum = 0
     for InFileName in FileList:
     # We will use the Ace parser to read individual contigs from file. Be aware
     # that using this iterator can miss WA, CT, RT and WR tags (which can be
     # anywhere in the file, e.g. the end). Read the file specification here:
     # http://bozeman.mbt.washington.edu/consed/distributions/README.14.0.txt
     # If you need these tags you'll need to use  Ace.read() (and lots of RAM).
         InFile = open(InFileName, 'r')
         ace_gen = Ace.parse(InFile)
         contigs = list(ace_gen)
         consensus_records = list()
         # create consensus file .con with original Filename outputfile
         output_handle = open("%s.con" % InFileName, "a")
         # Now we have started our alignment we can add sequences to it,
         # we will loop through contig's reads and get quality clipping from
         # .reads[readnumber].qa and the position of each read in the contig
         # .af[readnumber].padded_start and use the functions above to cut and
         # pad the sequences before they are added
         for contig in contigs:
             align = MultipleSeqAlignment([])
             for readn in range(contig.nreads):
                 print "hier"
                 clipst = contig.reads[readn].qa.qual_clipping_start
                 clipe = contig.reads[readn].qa.qual_clipping_end
                 start = contig.af[readn].padded_start
                 seq = cut_ends(contig.reads[readn].rd.sequence, clipst, clipe)
                 seq = pad_read(seq, start, len(contig.sequence))
                 print len(seq)
                 seqrecord = SeqRecord(Seq(seq, generic_dna), id=contig.name)
                 print seqrecord.id
                 align.append(seqrecord)
                 summary_align = AlignInfo.SummaryInfo(align)
             consensus = summary_align.dumb_consensus(threshold=0.9, ambiguous='N',require_multiple=10)
             consensus_name = contig.name
             consensus_record = SeqRecord(consensus, id=consensus_name)
             consensus_records.append(consensus_record)
         SeqIO.write(consensus_records, output_handle, "fasta")
         output_handle.close()

Am 26.11.2015 um 15:37 schrieb Peter Cock:
> That's good news Eike,
>
> For anyone interested, I think Eike was referring to these old threads:
>
> http://lists.open-bio.org/pipermail/biopython/2008-June/010448.html
> http://lists.open-bio.org/pipermail/biopython-dev/2008-June/012957.html
> http://lists.open-bio.org/pipermail/biopython/2009-April/011305.html
>
> The result was SeqIO has support for ACE files where you get a
> SeqRecord using the consensus sequence (and ignoring all the
> supporting reads), using the ACE parser under Bio.Sequencing.
>
> It would still be possible to add AlignIO support returning each
> contig as a multiple sequence alignment containing all the supporting
> reads, based on the recipe from David Winters:
>
> http://biopython.org/wiki/ACE_contig_to_alignment
>
> I've not used ACE files for a long time now, and you are the first
> person to ask about this in a while now. However, since we already
> have the ACE parser, the new code for Bio.AlignIO would not be
> that much work - and it sounds like you're willing to test this on
> real world data which is always useful :)
>
> David - are you still interested in this?
>
> (Separately I have wondered about adding support for MIRA's
> alignment format which is a bit like the ACE format - although
> now MIRA can output SAM, I have tended to use that instead.)
>
> Peter
>
> On Thu, Nov 26, 2015 at 2:01 PM, Eike Mayland-Quellhorst
> <eike.mayland.quellhorst at uni-oldenburg.de> wrote:
> > Sorry, I found my mistake (missplaced align object, so it was not updated
> > after each iteration), so my script is working now. But in gerenal I would
> > like to see an option in AlignIO for .ace files, as we like the old Cap3
> > assembler for assembling MiSeq Amplicon data (in a pipe starting with BBmap
> > tools, cap3, filtering, MAFFT alignment with reference sequence of locus,
> > visual inspection of alignments).
> >
> > all the best
> > Eike Mayland-Quellhorst
> >
> >
> >
> > Am 26.11.2015 um 14:06 schrieb Eike Mayland-Quellhorst:
> >>
> >> Dear list, dear Peter (as you have asked years ago in 2008 whether
> >> implementing into SeqIO or AlignIO),
> >>
> >> for a Amplicon sequencing project I am doing Cap3 assemblies with
> >> pre-treated sequences. The consensus of these assemblies should go into
> >> MAFFT alignments. But as input I would like to have consensus sequences
> >> with ambiguities as produced with AlignInfo:dumb_consensus. For now I am
> >> struggling with iterating through the contigs and produce this
> >> information. I tried the cookbook example (and modifications of it), but
> >> I am not able to produce an object and consensus for each contig. The
> >> cookbook script seems to expect only one individual contig in the file.
> >> Is there a way to get the .ace contigs as an alignment, with which I
> >> could make the dumb_consensus, which I can export as fasta to pipe it to
> >> MAFFT?
> >>
> >> with best regards
> >> Eike Mayland-Quellhorst
> >>
> >
> > --
> > __________________________________
> > Eike Mayland-Quellhorst
> > AG Biodiversität und Evolution der Pflanzen
> > Fakultät V - IBU
> > Carl von Ossietzky Universität Oldenburg
> >
> > Carl-von-Ossietzky Str. 9-11
> > 26111 Oldenburg
> > Germany
> >
> > Tel. 0049-441-798-3386
> >
> >
> > _______________________________________________
> > Biopython mailing list  -  Biopython at mailman.open-bio.org
> > http://mailman.open-bio.org/mailman/listinfo/biopython

-- 
__________________________________
Eike Mayland-Quellhorst
AG Biodiversität und Evolution der Pflanzen
Fakultät V - IBU
Carl von Ossietzky Universität Oldenburg

Carl-von-Ossietzky Str. 9-11
26111 Oldenburg
Germany

Tel. 0049-441-798-3386

-------------- next part --------------
A non-text attachment was scrubbed...
Name: eike_mayland_quellhorst.vcf
Type: text/x-vcard
Size: 394 bytes
Desc: not available
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20151126/f4dad612/attachment.vcf>