[Biopython] still more questions about NGS sequenbce trimming

Wed Oct 24 13:22:57 EDT 2012

On Wed, Oct 24, 2012 at 6:12 PM, Sebastian Schmeier
<s.schmeier at gmail.com> wrote:
> A very quick and dirty approach for your reject function (I hope I
> understood correctly) in script form:
>
> #!/usr/bin/env python
> import sys, re
> from Bio import SeqIO
>
> def main():
>     for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") :
>         if not discard(str(record.seq)):
>             SeqIO.write(record, sys.stdout, 'fasta')
>
> def discard(seq):
>     oRes = re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', seq)
>     if oRes: return 1
>     else: return 0
>
> if __name__ == '__main__':
>     sys.exit(main())

Minor suggestions - if you are going to use a regular expression
many times (here once per read), compile it once first. Also
Python defines "True" and "False" which are more natural
than 1 and 0, but in fact you could do:

def discard(seq):
     return bool(re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', seq))

At that point defining and using a function seems a bit of
an unnecessary overhead so:

def main():
     for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") :
         if not re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', str(record.seq)):
             SeqIO.write(record, sys.stdout, 'fasta')

Next a much more important point - try and make a single
call to SeqIO.write, with all the records (using an iterator
approach) rather than many calls to SeqIO.write (which
isn't supported for output in formats like SFF). This should
be faster:

     for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") :
         if not re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', str(record.seq)):
             SeqIO.write(record, sys.stdout, 'fasta')