[Biopython] Trimming adaptors sequences

Peter biopython at maubp.freeserve.co.uk
Wed Aug 12 19:21:36 EDT 2009


On Mon, Aug 10, 2009 at 2:16 PM, Brad Chapman<chapmanb at 50mail.com> wrote:
>
> Agreed. I like the examples and was thinking of this as an extension
> of the exact matching approach. I am definitely happy to roll this
> or some derivative of it into Biopython.

Hi Brad,

Is your aim to have a very fast pipeline, or an understandable
reference implementation (a worked example)? If this is for a real
pipeline, does it have to be FASTQ to FASTA?

Further to your blog comment about slicing SeqRecord objects slowing
things down, I agree - if you don't need the qualities, then having to
slice them is a pointless overhead. As usual in programming, there are
several options trading off elegant/general for speed. Personally I
would want to keep the qualities for the assembly/mapping step.

While keeping things general, as you don't care about the qualities,
you could do the whole operation on FASTA files which are faster to
read in and when you slice the resulting SeqRecord you don't have the
overhead of the slicing the qualities.

However, if you just want speed AND you really want to have a FASTQ
input file, try the underlying
Bio.SeqIO.QualityIO.FastqGeneralIterator parser which gives plain
strings, and handle the output yourself. Working directly with Python
strings is going to be faster than using Seq and SeqRecord objects.
You can even opt for outputting FASTQ files - as long as you leave the
qualities as an encoded string, you can just slice that too. The
downside is the code will be very specific. e.g. something along these
lines:

from Bio.SeqIO.QualityIO import FastqGeneralIterator
in_handle = open(input_fastq_filename)
out_handle = open(output_fastq_filename, "w")
for title, seq, qual in FastqGeneralIterator(in_handle) :
    #Do trim logic here on the string seq
    if trim :
        seq = seq[start:end]
        qual = qual[start:end] # kept as ASCII string!
    #Save the (possibly trimmed) FASTQ record:
    out_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
out_handle.close()
in_handle.close()

Note that FastqGeneralIterator is already in Biopython 1.50 and 1.51b,
but is now a bit faster in CVS/github (what will be Biopython 1.51).

Peter


More information about the Biopython mailing list