[Biopython] still more questions about NGS sequenbce trimming

Peter Cock p.j.a.cock at googlemail.com
Wed Oct 24 13:22:57 EDT 2012


On Wed, Oct 24, 2012 at 6:12 PM, Sebastian Schmeier
<s.schmeier at gmail.com> wrote:
> A very quick and dirty approach for your reject function (I hope I
> understood correctly) in script form:
>
> #!/usr/bin/env python
> import sys, re
> from Bio import SeqIO
>
> def main():
>     for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") :
>         if not discard(str(record.seq)):
>             SeqIO.write(record, sys.stdout, 'fasta')
>
> def discard(seq):
>     oRes = re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', seq)
>     if oRes: return 1
>     else: return 0
>
> if __name__ == '__main__':
>     sys.exit(main())

Minor suggestions - if you are going to use a regular expression
many times (here once per read), compile it once first. Also
Python defines "True" and "False" which are more natural
than 1 and 0, but in fact you could do:

def discard(seq):
     return bool(re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', seq))

At that point defining and using a function seems a bit of
an unnecessary overhead so:

def main():
     for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") :
         if not re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', str(record.seq)):
             SeqIO.write(record, sys.stdout, 'fasta')

Next a much more important point - try and make a single
call to SeqIO.write, with all the records (using an iterator
approach) rather than many calls to SeqIO.write (which
isn't supported for output in formats like SFF). This should
be faster:


     for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") :
         if not re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', str(record.seq)):
             SeqIO.write(record, sys.stdout, 'fasta')


More information about the Biopython mailing list