[BioPython] ACE contig to alignment (found my error)

Wed Mar 4 02:40:08 UTC 2009

Hi again all,

After digging around a little more I realised the dumb mistake I made. 
In case anyone was interested and to prevent future suffering by getting 
the answer on to google:

The code as written is adding the entirety of each read to the alignment 
but when the assembly was made some reads where clipped on either side 
for quality. Including the low quality bases from each read makes some 
of the alignments nasty.  In my case "contig.reads[readn].qa" contains 
the start and end clipping points needed to get just the 'good' bases of 
each read into the alignment.

Cheers,
David

David Winter wrote:
> Hi all,
>
> I'd like to start by thanking everyone that's contributed to biopython 
> and especially the cookbook/tutorial - its been a great help to this 
> empiricist getting into some (decidedly amateur) 
> bioinformatics.However, for the first time I've run into a problem the 
> available docs can't help me with.
>
> I want to be able to represent all of the reads that contribute to a 
> 454 sequencing contig as a generic biopython alignment. I've written 
> some code that I thought would pad/cut the reads to size and add them 
> to an alignment but when I run it a significant minority of the 
> contigs in the files I'm working with have misalignments. I was 
> wondering if someone more familiar with the ace parser or generic 
> alignment class could tell me if I'm making some elementary mistake 
> (it is possible that original alignment was bad, just seems more 
> likely I did something dumb). I can send along an ACE file if you want 
> to run the script (didn't want to spam the list with attachments).
>
> Thanks in advance for any pointers and I'm sorry to force people to 
> read what I'm sure is inelegant code:
>
> from Bio.Sequencing import Ace
> from Bio.Align.Generic import Alignment
> from Bio.Alphabet import IUPAC, Gapped
>
> ace_handle = open('eldoni.ace', 'r')
> contigs = Ace.parse(ace_handle)
> alignments = [] #start the list to which we'll add the contig data
>
> for contig in contigs:     conname = contig.name + " numreads=" + 
> str(contig.nreads)
>  conlength = len(contig.sequence)
>  align = Alignment(Gapped(IUPAC.ambiguous_dna, "*"))
>  for readn in range(len(contig.reads)):
>    start = contig.af[readn].padded_start # position rel to consensus
>    if start < 1:
>      # If 'start' is negative or zero we need to ignore bases
>      readseq =  contig.reads[readn].rd.sequence[-1 * start+1:]
>    else:
>      # If it's larger then the start needs to be padded with gaps
>      readseq =  (start-1) * '*' + contig.reads[readn].rd.sequence
>    #Finally, pad the end then cut to size
>    readseq = readseq + (conlength-len(readseq)) * '*'
>    readseq = readseq[:conlength]
>    align.add_sequence(readn+1, readseq)
>  condata = conname, align
>  alignments.append(condata)