[Biopython] still more questions about NGS sequence trimming
Kiss, Csaba
csaba.kiss at lanl.gov
Wed Oct 24 14:01:23 EDT 2012
Thanks, Peter. I am looking at this example now:
from Bio import SeqIO
good_reads = (rec for rec in \
SeqIO.parse("SRR020192.fastq", "fastq") \
if min(rec.letter_annotations["phred_quality"]) >= 20)
count = SeqIO.write(good_reads, "good_quality.fastq", "fastq")
print "Saved %i reads" % count
That's a rather crude quality filtering. Is there any more sophisticated options already in biopython?
Ie. quality_average
Or other options?
-----Original Message-----
From: Peter Cock [mailto:p.j.a.cock at googlemail.com]
Sent: Wednesday, October 24, 2012 11:27 AM
To: Sebastian Schmeier
Cc: Kiss, Csaba; biopython at lists.open-bio.org
Subject: Re: [Biopython] still more questions about NGS sequenbce trimming
On Wed, Oct 24, 2012 at 6:22 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Wed, Oct 24, 2012 at 6:12 PM, Sebastian Schmeier
> <s.schmeier at gmail.com> wrote:
>> A very quick and dirty approach for your reject function (I hope I
>> understood correctly) in script form:
>>
>> #!/usr/bin/env python
>> import sys, re
>> from Bio import SeqIO
>>
>> def main():
>> for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") :
>> if not discard(str(record.seq)):
>> SeqIO.write(record, sys.stdout, 'fasta')
>>
>> def discard(seq):
>> oRes = re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', seq)
>> if oRes: return 1
>> else: return 0
>>
>> if __name__ == '__main__':
>> sys.exit(main())
>
> Minor suggestions - if you are going to use a regular expression many
> times (here once per read), compile it once first. Also Python defines
> "True" and "False" which are more natural than 1 and 0, but in fact
> you could do:
>
> def discard(seq):
> return bool(re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', seq))
>
> At that point defining and using a function seems a bit of an
> unnecessary overhead so:
>
> def main():
> for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") :
> if not re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', str(record.seq)):
> SeqIO.write(record, sys.stdout, 'fasta')
>
> Next a much more important point - try and make a single call to
> SeqIO.write, with all the records (using an iterator
> approach) rather than many calls to SeqIO.write (which isn't supported
> for output in formats like SFF). This should be faster:
Sorry Sebastian - I had a hiccup with my mouse focus and accidentally sent that email half finished. I meant something like this:
def main():
wanted = (rec for rec in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") \
if not re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})',
str(rec.seq)))
count = SeqIO.write(wanted, sys.stdout, 'fasta')
There are other examples of filtering sequence files in the Tutorial, http://biopython.org/DIST/docs/tutorial/Tutorial.html
I hope that is useful,
Peter
More information about the Biopython
mailing list