[Biopython] still more questions about NGS sequenbce trimming
Peter Cock
p.j.a.cock at googlemail.com
Wed Oct 24 13:27:14 EDT 2012
On Wed, Oct 24, 2012 at 6:22 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Wed, Oct 24, 2012 at 6:12 PM, Sebastian Schmeier
> <s.schmeier at gmail.com> wrote:
>> A very quick and dirty approach for your reject function (I hope I
>> understood correctly) in script form:
>>
>> #!/usr/bin/env python
>> import sys, re
>> from Bio import SeqIO
>>
>> def main():
>> for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") :
>> if not discard(str(record.seq)):
>> SeqIO.write(record, sys.stdout, 'fasta')
>>
>> def discard(seq):
>> oRes = re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', seq)
>> if oRes: return 1
>> else: return 0
>>
>> if __name__ == '__main__':
>> sys.exit(main())
>
> Minor suggestions - if you are going to use a regular expression
> many times (here once per read), compile it once first. Also
> Python defines "True" and "False" which are more natural
> than 1 and 0, but in fact you could do:
>
> def discard(seq):
> return bool(re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', seq))
>
> At that point defining and using a function seems a bit of
> an unnecessary overhead so:
>
> def main():
> for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") :
> if not re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', str(record.seq)):
> SeqIO.write(record, sys.stdout, 'fasta')
>
> Next a much more important point - try and make a single
> call to SeqIO.write, with all the records (using an iterator
> approach) rather than many calls to SeqIO.write (which
> isn't supported for output in formats like SFF). This should
> be faster:
Sorry Sebastian - I had a hiccup with my mouse focus and
accidentally sent that email half finished. I meant something
like this:
def main():
wanted = (rec for rec in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") \
if not re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})',
str(rec.seq)))
count = SeqIO.write(wanted, sys.stdout, 'fasta')
There are other examples of filtering sequence files in the
Tutorial, http://biopython.org/DIST/docs/tutorial/Tutorial.html
I hope that is useful,
Peter
More information about the Biopython
mailing list