[Biopython] still more questions about NGS sequence trimming

Thu Oct 25 08:14:50 UTC 2012

On Wed, Oct 24, 2012 at 7:01 PM, Kiss, Csaba <csaba.kiss at lanl.gov> wrote:
> Thanks, Peter. I am looking at this example now:
>
> from Bio import SeqIO
> good_reads = (rec for rec in \
>               SeqIO.parse("SRR020192.fastq", "fastq") \
>               if min(rec.letter_annotations["phred_quality"]) >= 20)
> count = SeqIO.write(good_reads, "good_quality.fastq", "fastq")
> print "Saved %i reads" % count
>
> That's a rather crude quality filtering. Is there any more sophisticated options
> already in biopython? Ie. quality_average
>
> Or other options?

Average (mean) quality is easy, take the sum and divide by the length
(or in this case, I've moved the divide to a multiply on the other side
of the inequality since generally multiplication is faster than division):

from Bio import SeqIO
good_reads = (rec for rec in \
              SeqIO.parse("SRR020192.fastq", "fastq") \
              if sum(rec.letter_annotations["phred_quality"]) >= 20*len(rec))
count = SeqIO.write(good_reads, "good_quality.fastq", "fastq")
print "Saved %i reads" % count

However, for most sequencing reads you'd want to use a trimming
step first as read quality tends to decline with length - the first half
might be good and the second half bad, meaning the average is
poor.

You could write a little function to do that, and slice the SeqRecord
to select the good chunk. There are examples of that in the Tutorial
for removing an adapter/adaptor or PCR primer from FASTQ files.

Peter