[Biopython] fastq manipulations speed

Mon Mar 18 00:33:21 UTC 2013

On Mon, Mar 18, 2013 at 12:21 AM, natassa <natassa_g_2000 at yahoo.com> wrote:
> Thanks, the length-1 was an error, it was supposed to be 0:length to get the
> qualities of the associated trimmed files. The script seems to be running
> much faster! But what would be your other suggestions?
> Natassa

You should be able to refactor the code to make a single call to
SeqIO.write by giving it a generator which constructs all the
trimmed records. That would require a bit of thought and
experience with iterators, generator functions and/or generator
expression - but can be a really powerful way to think about
things. I'm expecting this to be faster, but the second idea
below will definitely be faster, perhaps five times as fast...

More straightforwardly, you don't need to use SeqRecord
objects for this task - they make slicing the sequence and
quality easier, but come with a performance cost. See:
http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

In addition, consider doing the same for the FASTA file with:
from Bio.SeqIO.FastaIO import SimpleFastaParser
(requires Biopython 1.61 or later - looks like that wasn't
highlighted in the release notes which was an oversight).

Good night,

Peter