[Biopython] Adaptor trimmer and dimers
Brad Chapman
chapmanb at 50mail.com
Wed Oct 21 08:34:22 EDT 2009
Hi Anastasia;
Thanks for the additional info.
> I also had to add an additional test for the length of the alignment output,
> as I got an index Error for the cases the adapter does not align at
> all.
Good catch on this. I updated the trimming code to handle that case:
http://github.com/chapmanb/bcbb/blob/master/align/adaptor_trim.py
> My main problem now is performance of this script: On a file of 19
> million reads of 76 bp it is running for more than 12 hours! So I copy
> here my code and would be very grateful if someone could indicate parts
> where it could be sped up.
Peter had a good suggestion on profiling. The Python profile module
is quick to learn and can quickly point you in the direction of the
most used functions:
http://docs.python.org/library/profile.html
Based on reading your code there are a couple of things that stick
out to me:
- You are calling the pairwise2 alignment 3 times. You should call
this once, assign the alignment information to a variable, and then
perform your if/else tests on that. The updated trimming code above
is a good example of doing this.
- You are slicing SeqRecord objects, and then never using the sliced
records. Your code doesn't look like adaptor trimming, but rather
filtering out reads without a sequence. If you don't need the
trimmed record, pass a string (str(rec1.seq) and str(rec2.seq)) to
the handle_adaptor function instead of the record; the slicing is
then done on a much simpler object and you avoid the substantial
overhead of slicing up quality scores that are never used.
If you end up needing trimmed fastq sequences, here is how I would
reimplement your basic logic with the trimmer and Peter's suggestion:
from Bio.SeqIO.QualityIO import FastqGeneralIterator
from adaptor_trim import trim_adaptor_w_qual
in_file = "test.fastq"
out_file = "trimmed.fastq"
in_handle = open(in_file)
out_handle = open(out_file, "w")
iterator = FastqGeneralIterator(in_handle)
adaptor = "AAAAAAAAAAAAAAAAAAAA"
num_errors = 2
while 1:
try:
title1, seq1, qual1 = iterator.next()
title2, seq2, qual2 = iterator.next()
except StopIteration:
break
tseq1, tqual1 = trim_adaptor_w_qual(seq1, qual1, adaptor, num_errors)
tseq2, tqual2 = trim_adaptor_w_qual(seq2, qual2, adaptor, num_errors)
# if neither has the adaptor
if len(tseq1) == len(seq1) and len(tseq2) == len(seq2):
out_handle.write("@%s\n%s\n+\n%s\n" % (title1, tseq1, tqual1))
out_handle.write("@%s\n%s\n+\n%s\n" % (title2, tseq2, tqual2))
out_handle.close()
in_handle.close()
Hope this helps,
Brad
More information about the Biopython
mailing list