[Biopython] still more questions about NGS sequenbce trimming

Peter Cock p.j.a.cock at googlemail.com
Wed Oct 24 17:27:14 UTC 2012


On Wed, Oct 24, 2012 at 6:22 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Wed, Oct 24, 2012 at 6:12 PM, Sebastian Schmeier
> <s.schmeier at gmail.com> wrote:
>> A very quick and dirty approach for your reject function (I hope I
>> understood correctly) in script form:
>>
>> #!/usr/bin/env python
>> import sys, re
>> from Bio import SeqIO
>>
>> def main():
>>     for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") :
>>         if not discard(str(record.seq)):
>>             SeqIO.write(record, sys.stdout, 'fasta')
>>
>> def discard(seq):
>>     oRes = re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', seq)
>>     if oRes: return 1
>>     else: return 0
>>
>> if __name__ == '__main__':
>>     sys.exit(main())
>
> Minor suggestions - if you are going to use a regular expression
> many times (here once per read), compile it once first. Also
> Python defines "True" and "False" which are more natural
> than 1 and 0, but in fact you could do:
>
> def discard(seq):
>      return bool(re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', seq))
>
> At that point defining and using a function seems a bit of
> an unnecessary overhead so:
>
> def main():
>      for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") :
>          if not re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', str(record.seq)):
>              SeqIO.write(record, sys.stdout, 'fasta')
>
> Next a much more important point - try and make a single
> call to SeqIO.write, with all the records (using an iterator
> approach) rather than many calls to SeqIO.write (which
> isn't supported for output in formats like SFF). This should
> be faster:

Sorry Sebastian - I had a hiccup with my mouse focus and
accidentally sent that email half finished. I meant something
like this:

def main():
    wanted = (rec for rec in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") \
                  if not re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})',
str(rec.seq)))
    count = SeqIO.write(wanted, sys.stdout, 'fasta')

There are other examples of filtering sequence files in the
Tutorial, http://biopython.org/DIST/docs/tutorial/Tutorial.html

I hope that is useful,

Peter



More information about the Biopython mailing list