[Biopython] Adaptor trimmer and dimers

Brad Chapman chapmanb at 50mail.com
Wed Oct 21 12:34:22 UTC 2009


Hi Anastasia;
Thanks for the additional info.

> I also had to add an additional test for the length of the alignment output,
> as I got an index Error for the cases the adapter does not align at
> all. 

Good catch on this. I updated the trimming code to handle that case:

http://github.com/chapmanb/bcbb/blob/master/align/adaptor_trim.py

> My main problem now is performance of this script: On a file of 19
> million reads of 76 bp it is running for more than 12 hours! So I copy
> here my code and would be very grateful if someone could indicate parts
> where it could be sped up. 

Peter had a good suggestion on profiling. The Python profile module
is quick to learn and can quickly point you in the direction of the
most used functions:

http://docs.python.org/library/profile.html

Based on reading your code there are a couple of things that stick
out to me:

- You are calling the pairwise2 alignment 3 times. You should call
  this once, assign the alignment information to a variable, and then
  perform your if/else tests on that. The updated trimming code above 
  is a good example of doing this.

- You are slicing SeqRecord objects, and then never using the sliced
  records. Your code doesn't look like adaptor trimming, but rather
  filtering out reads without a sequence. If you don't need the
  trimmed record, pass a string (str(rec1.seq) and str(rec2.seq)) to
  the handle_adaptor function instead of the record; the slicing is
  then done on a much simpler object and you avoid the substantial 
  overhead of slicing up quality scores that are never used.

If you end up needing trimmed fastq sequences, here is how I would
reimplement your basic logic with the trimmer and Peter's suggestion:

from Bio.SeqIO.QualityIO import FastqGeneralIterator
from adaptor_trim import trim_adaptor_w_qual

in_file = "test.fastq"
out_file = "trimmed.fastq"

in_handle = open(in_file)
out_handle = open(out_file, "w")
iterator = FastqGeneralIterator(in_handle)
adaptor = "AAAAAAAAAAAAAAAAAAAA"
num_errors = 2
while 1:
    try:
        title1, seq1, qual1 = iterator.next()
        title2, seq2, qual2 = iterator.next()
    except StopIteration:
        break

    tseq1, tqual1 = trim_adaptor_w_qual(seq1, qual1, adaptor, num_errors)
    tseq2, tqual2 = trim_adaptor_w_qual(seq2, qual2, adaptor, num_errors)

    # if neither has the adaptor
    if len(tseq1) == len(seq1) and len(tseq2) == len(seq2):
        out_handle.write("@%s\n%s\n+\n%s\n" % (title1, tseq1, tqual1))
        out_handle.write("@%s\n%s\n+\n%s\n" % (title2, tseq2, tqual2))
out_handle.close()
in_handle.close()

Hope this helps,
Brad



More information about the Biopython mailing list