[BioPython] ACE contig to alignment

David Winter winda002 at student.otago.ac.nz
Tue Mar 3 22:03:36 UTC 2009


Hi all,

I'd like to start by thanking everyone that's contributed to biopython 
and especially the cookbook/tutorial - its been a great help to this 
empiricist getting into some (decidedly amateur) bioinformatics.However, 
for the first time I've run into a problem the available docs can't help 
me with.

I want to be able to represent all of the reads that contribute to a 454 
sequencing contig as a generic biopython alignment. I've written some 
code that I thought would pad/cut the reads to size and add them to an 
alignment but when I run it a significant minority of the contigs in the 
files I'm working with have misalignments. I was wondering if someone 
more familiar with the ace parser or generic alignment class could tell 
me if I'm making some elementary mistake (it is possible that original 
alignment was bad, just seems more likely I did something dumb). I can 
send along an ACE file if you want to run the script (didn't want to 
spam the list with attachments).

Thanks in advance for any pointers and I'm sorry to force people to read 
what I'm sure is inelegant code:
 
from Bio.Sequencing import Ace
from Bio.Align.Generic import Alignment
from Bio.Alphabet import IUPAC, Gapped

ace_handle = open('eldoni.ace', 'r')
contigs = Ace.parse(ace_handle)
alignments = [] #start the list to which we'll add the contig data

for contig in contigs:    
  conname = contig.name + " numreads=" + str(contig.nreads)
  conlength = len(contig.sequence)
  align = Alignment(Gapped(IUPAC.ambiguous_dna, "*"))
  for readn in range(len(contig.reads)):
    start = contig.af[readn].padded_start # position rel to consensus
    if start < 1:
      # If 'start' is negative or zero we need to ignore bases
      readseq =  contig.reads[readn].rd.sequence[-1 * start+1:]
    else:
      # If it's larger then the start needs to be padded with gaps
      readseq =  (start-1) * '*' + contig.reads[readn].rd.sequence
    #Finally, pad the end then cut to size
    readseq = readseq + (conlength-len(readseq)) * '*'
    readseq = readseq[:conlength]
    align.add_sequence(readn+1, readseq)
  condata = conname, align
  alignments.append(condata)

-- 
PhD Student
Allan Wilson Centre 
Department of Zoology
University of Otago, PO Box 56, Dunedin 9054

p: +64-3-4798459
m: +64-27-3326815 
e: winda002 at student.otago.ac.nz




More information about the Biopython mailing list