[Biopython] Adaptor trimmer and dimers

Fri Oct 23 08:28:43 EDT 2009

Hi Anastasia;

> Again, not very clean code as I have been oscillating between
> trimming/removing  for some days now. I finally decided that if I
> don't have a big proportion of nearly exact (max 2 errors) matches
> to the adaptor in my reads, I may just discard them, as trimming a
> 33/37 bp adaptor from a 55-bp read does not leave much anyway. 
>
> The revised script
> (for removing, but taking into account all your suggestions, so using
> the general iterator) is still running for very long, 

This was written with the idea that the adaptor would be present in
most of the sequences. This was the case with the data I was using
it on -- expression profiling with short tags -- but does not sound
like what you are tackling here. My approach speeds up the trimming
by avoiding doing local alignments for many reads since an exact
match is often found. Only in cases where the adaptor is missing or
has one or more sequencing errors does the expensive local alignment
need to be done.

If most reads do not have adaptors, then this approach is
algorithmically slow. Doing a local alignment for nearly every read
is going to take time, independent of the implementation. Profiling
this should reveal most of the time is spent in pairwise alignment.

My suggestion would be to use a heuristic seed-based approach similar to
what short query aligners do:

- Break your adaptor into three smaller seed regions of 12bp
- For each read:
  - Do a fast string find() with the seed regions to the read
  - If two or more of the seed regions match exactly, discard the
    read

This will run much quicker and should catch a majority of the cases
where you have reads. Regions with lots of errors, or errors spaced
evenly through the adaptor, will be missed. Making the code
tractable is probably worth that few that you'll let through.

Hope this helps,
Brad