[Biopython-dev] [Biopython] Trimming adaptors sequences

Thu Aug 13 13:02:17 UTC 2009

On Thu, Aug 13, 2009 at 1:44 PM, Brad Chapman<chapmanb at 50mail.com> wrote:
>> However, if you just want speed AND you really want to have a FASTQ
>> input file, try the underlying  Bio.SeqIO.QualityIO.FastqGeneralIterator
>> parser which gives plain strings, and handle the output yourself. Working
>> directly with Python strings is going to be faster than using Seq and
>> SeqRecord objects. You can even opt for outputting FASTQ files - as
>> long as you leave the qualities as an encoded string, you can just slice
>> that too. The downside is the code will be very specific. e.g. something
>> along these lines:
>>
>> from Bio.SeqIO.QualityIO import FastqGeneralIterator
>> in_handle = open(input_fastq_filename)
>> out_handle = open(output_fastq_filename, "w")
>> for title, seq, qual in FastqGeneralIterator(in_handle) :
>>     #Do trim logic here on the string seq
>>     if trim :
>>         seq = seq[start:end]
>>         qual = qual[start:end] # kept as ASCII string!
>>     #Save the (possibly trimmed) FASTQ record:
>>     out_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
>> out_handle.close()
>> in_handle.close()
>
> Nice -- I will have to play with this. I hadn't dug into the current
> SeqRecord slicing code at all but I wonder if there is a way to keep
> the SeqRecord interface but incorporate some of these speed ups
> for common cases like this FASTQ trimming.

I suggest we continue this on the dev mailing list (this reply
is cross posted), as it is starting to get rather technical.

When you really care about speed, any object creation becomes
an issue. Right now for *any* record we have at least the
following objects being created: SeqRecord, Seq, two lists (for
features and dbxrefs), two dicts (for annotation and the per letter
annotation), and the restricted dict (for per letter annotations),
and at least four strings (sequence, id, name and description).
Perhaps some lazy instantiation might be worth exploring... for
example make dbxref, features, annotations or letter_annotations
into properties where the underlying object isn't created unless
accessed. [Something to try after Biopython 1.51 is out?]

I would guess (but haven't timed it) that for trimming FASTQ
SeqRecords, a bit part of the overhead is that we are using
Python lists of integers (rather than just a string) for the scores.
So sticking with the current SeqRecord object as is, one speed
up we could try would be to leave the FASTQ quality string as an
encoded string (rather than turning into integer quality scores,
and back again on output). It would be a hack, but adding this
as another SeqIO format name, e.g. "fastq-raw" or "fastq-ascii",
might work. We'd still need a new letter_annotations key, say
"fastq_qual_ascii".  This idea might work, but it does seem ugly.

Peter