[Biopython] still more questions about NGS sequence trimming

Thu Oct 25 14:49:59 UTC 2012

Thanks, Peter. I am writing my quality functions. Another question about trimming. As you mentioned, the quality of the ends tend to be lower than in the middle. Could that be fixed just by using "sff-trim" when I create my FASTQ file? If I don't do that I get sequences with small and capital letters. Are you suggesting further trimming than just "sff-trim".

Csaba 

-----Original Message-----
From: Peter Cock [mailto:p.j.a.cock at googlemail.com] 
Sent: Thursday, October 25, 2012 2:15 AM
To: Kiss, Csaba
Cc: biopython at lists.open-bio.org
Subject: Re: [Biopython] still more questions about NGS sequence trimming

On Wed, Oct 24, 2012 at 7:01 PM, Kiss, Csaba <csaba.kiss at lanl.gov> wrote:
> Thanks, Peter. I am looking at this example now:
>
> from Bio import SeqIO
> good_reads = (rec for rec in \
>               SeqIO.parse("SRR020192.fastq", "fastq") \
>               if min(rec.letter_annotations["phred_quality"]) >= 20) 
> count = SeqIO.write(good_reads, "good_quality.fastq", "fastq") print 
> "Saved %i reads" % count
>
> That's a rather crude quality filtering. Is there any more 
> sophisticated options already in biopython? Ie. quality_average
>
> Or other options?

Average (mean) quality is easy, take the sum and divide by the length (or in this case, I've moved the divide to a multiply on the other side of the inequality since generally multiplication is faster than division):

from Bio import SeqIO
good_reads = (rec for rec in \
              SeqIO.parse("SRR020192.fastq", "fastq") \
              if sum(rec.letter_annotations["phred_quality"]) >= 20*len(rec)) count = SeqIO.write(good_reads, "good_quality.fastq", "fastq") print "Saved %i reads" % count

However, for most sequencing reads you'd want to use a trimming step first as read quality tends to decline with length - the first half might be good and the second half bad, meaning the average is poor.

You could write a little function to do that, and slice the SeqRecord to select the good chunk. There are examples of that in the Tutorial for removing an adapter/adaptor or PCR primer from FASTQ files.

Peter