[Biopython] still more questions about NGS sequence trimming

Wed Oct 24 18:01:23 UTC 2012

Thanks, Peter. I am looking at this example now:

from Bio import SeqIO
good_reads = (rec for rec in \
              SeqIO.parse("SRR020192.fastq", "fastq") \
              if min(rec.letter_annotations["phred_quality"]) >= 20)
count = SeqIO.write(good_reads, "good_quality.fastq", "fastq")
print "Saved %i reads" % count

That's a rather crude quality filtering. Is there any more sophisticated options already in biopython? 
Ie. quality_average

Or other options?

-----Original Message-----
From: Peter Cock [mailto:p.j.a.cock at googlemail.com] 
Sent: Wednesday, October 24, 2012 11:27 AM
To: Sebastian Schmeier
Cc: Kiss, Csaba; biopython at lists.open-bio.org
Subject: Re: [Biopython] still more questions about NGS sequenbce trimming

On Wed, Oct 24, 2012 at 6:22 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Wed, Oct 24, 2012 at 6:12 PM, Sebastian Schmeier 
> <s.schmeier at gmail.com> wrote:
>> A very quick and dirty approach for your reject function (I hope I 
>> understood correctly) in script form:
>>
>> #!/usr/bin/env python
>> import sys, re
>> from Bio import SeqIO
>>
>> def main():
>>     for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") :
>>         if not discard(str(record.seq)):
>>             SeqIO.write(record, sys.stdout, 'fasta')
>>
>> def discard(seq):
>>     oRes = re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', seq)
>>     if oRes: return 1
>>     else: return 0
>>
>> if __name__ == '__main__':
>>     sys.exit(main())
>
> Minor suggestions - if you are going to use a regular expression many 
> times (here once per read), compile it once first. Also Python defines 
> "True" and "False" which are more natural than 1 and 0, but in fact 
> you could do:
>
> def discard(seq):
>      return bool(re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', seq))
>
> At that point defining and using a function seems a bit of an 
> unnecessary overhead so:
>
> def main():
>      for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") :
>          if not re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', str(record.seq)):
>              SeqIO.write(record, sys.stdout, 'fasta')
>
> Next a much more important point - try and make a single call to 
> SeqIO.write, with all the records (using an iterator
> approach) rather than many calls to SeqIO.write (which isn't supported 
> for output in formats like SFF). This should be faster:

Sorry Sebastian - I had a hiccup with my mouse focus and accidentally sent that email half finished. I meant something like this:

def main():
    wanted = (rec for rec in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") \
                  if not re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})',
str(rec.seq)))
    count = SeqIO.write(wanted, sys.stdout, 'fasta')

There are other examples of filtering sequence files in the Tutorial, http://biopython.org/DIST/docs/tutorial/Tutorial.html

I hope that is useful,

Peter