[Biopython] Trimming adaptors sequences

Brad Chapman chapmanb at 50mail.com
Thu Aug 13 08:44:32 EDT 2009


Hi Peter;

> Is your aim to have a very fast pipeline, or an understandable
> reference implementation (a worked example)? If this is for a real
> pipeline, does it have to be FASTQ to FASTA?

Ideally both. It is motivated by fitting into an experiment I am
analyzing, but the purpose of the blog posting is to try and explain
the logic and solicit feedback.

In terms of the work, I don't need fastq downstream so was going the
easier fasta route. But I can certainly see myself needing fastq in
the future so prefer to be generalized.

> Further to your blog comment about slicing SeqRecord objects slowing
> things down, I agree - if you don't need the qualities, then having to
> slice them is a pointless overhead. As usual in programming, there are
> several options trading off elegant/general for speed. Personally I
> would want to keep the qualities for the assembly/mapping step.

Agreed. Unfortunately, it was unusably slow with the slicing as
currently implemented: it ran for about 16 hours and was 1/3 of the
way finished so was looking like a 2 day run, or about 12x slower
than the reference implementation.

> However, if you just want speed AND you really want to have a FASTQ
> input file, try the underlying
> Bio.SeqIO.QualityIO.FastqGeneralIterator parser which gives plain
> strings, and handle the output yourself. Working directly with Python
> strings is going to be faster than using Seq and SeqRecord objects.
> You can even opt for outputting FASTQ files - as long as you leave the
> qualities as an encoded string, you can just slice that too. The
> downside is the code will be very specific. e.g. something along these
> lines:
> 
> from Bio.SeqIO.QualityIO import FastqGeneralIterator
> in_handle = open(input_fastq_filename)
> out_handle = open(output_fastq_filename, "w")
> for title, seq, qual in FastqGeneralIterator(in_handle) :
>     #Do trim logic here on the string seq
>     if trim :
>         seq = seq[start:end]
>         qual = qual[start:end] # kept as ASCII string!
>     #Save the (possibly trimmed) FASTQ record:
>     out_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
> out_handle.close()
> in_handle.close()

Nice -- I will have to play with this. I hadn't dug into the current
SeqRecord slicing code at all but I wonder if there is a way to keep
the SeqRecord interface but incorporate some of these speed ups for
common cases like this FASTQ trimming.

Brad


More information about the Biopython mailing list