[Biopython] Trimming adaptors sequences

Mon Aug 10 13:16:50 UTC 2009

Hi Peter;

> Brad's got an interesting blog post up on using Biopython for trimming
> adaptors for next gen sequencing reads, using Bio.pairwise2 for
> pairwise alignments between the adaptor and the reads:
> 
> http://bcbio.wordpress.com/2009/08/09/trimming-adaptors-from-short-read-sequences/
> 
> The basic idea is similar to what Giles Weaver was describing last
> month, although Giles was using EMBOSS needle to do a global pairwise
> alignment via BioPerl:
> http://lists.open-bio.org/pipermail/biopython/2009-July/005338.html

Yes, same idea. When I started messing with this I was thinking I
could be tricky and get something that avoided doing alignments and
would be faster. Unfortunately I didn't have good luck with the pure
string based approaches.

> We already had a simple FASTQ "primer trimming" example in the
> tutorial, which I have just extended to add a more general FASTQ
> "adaptor trimming" example. For this I am deliberately only looking
> for exact matches. This is faster of course, but it also makes the
> example much more easily understood as well - something important for
> an introductory example.

Agreed. I like the examples and was thinking of this as an extension
of the exact matching approach. I am definitely happy to roll this
or some derivative of it into Biopython.

> A full cookbook example of how to use pairwise alignments would seem
> like a great idea for a cookbook entry on the wiki. It would be
> interesting to see which is faster - using EMBOSS needle/water or
> Bio.pairwise2. Both are written in C, but using EMBOSS we'd have the
> overhead of parsing the output file.

In terms of speed, I was thinking of this as a good target
for parallelization using the multiprocessing library
(http://docs.python.org/library/multiprocessing.html) but didn't have
time yet to look into that.

> Brad - why are you using a local alignment and not a global alignment?
> Shouldn't we be looking for the entire adaptor sequence? It looks like
> you don't consider the the unaligned parts of the adaptor when you
> count the mismatches - is this a bug? 

Good call -- this should consider the number of matches in the
aligning region to the full adaptor to see if we've got it. This is
fixed in the GitHub version now. Thanks for pointing it out.

> I wonder if it would be simpler (and faster) to take a score based threshold.

Maybe, but I find comfort in being able to describe the algorithms simply: any 
matches to the adaptor with 2 or less errors. I'd imagine most of
the time is being taken up doing the actual alignment work.

Thanks for the feedback on this. It was really helpful,
Brad