[emboss-dev] FASTQ parsing speed in EMBOSS

Peter biopython at maubp.freeserve.co.uk
Mon Jul 27 13:39:49 EDT 2009


Hi all,

I've been testing EMBOSS 6.1.0 with a patch from Peter Rice for
some of the FASTQ issues I've raised, and I decided to do a few
simple benchmarks.

For this example, I have used a 1.3 GB standard Sanger FASTQ
file from the NCBI short read archive which contains just over
seven million short reads of length 36 bp, which I believe were
originally from a Solexa/Illumina machine. This is actually one
of a pair of FASTQ files as this was a paired end run. The file is
here (compressed):

ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX000/SRX000430/SRR001666_1.fastq.gz

Note that some of the quality lines start with "@", so you can't
use grep for "^@" to count the records. However, all the reads
have an identifier starting SRR so you can do this:

$ time grep "^@SRR" SRR001666_1.fastq | wc -l
 7047668

real	0m15.886s
user	0m18.357s
sys	0m1.268s

For this example, I want to convert the FASTQ file to FASTA
(i.e. ignore and throw away the quality scores). This is a fairly
common task, as most all assemblers will take FASTA files,
even if they don't understand FASTQ.

As I didn't want to waste disk space and I wanted a basic
check on the output, I have simply piped the output via
grep and wc to count the FASTA records:

$ time seqret -filter -sformat fastq-sanger -osformat fasta <
SRR001666_1.fastq | grep "^>" | wc -l
 7047668

real	2m48.288s
user	3m3.994s
sys	0m3.525s

I've run this several times, and this result is typical. So, using
the "fastq-sanger" format this takes about 2m48s. There is a
slight speed up using "fastq" as the EMBOSS input format
name, as this never has to convert the quality strings into
PHRED values:

$ time seqret -filter -sformat fastq -osformat fasta <
SRR001666_1.fastq | grep "^>" | wc -l
 7047668

real	2m43.566s
user	2m59.077s
sys	0m3.540s

i.e. About 2m44, saving about 4s.

Just for the record, actually doing the FASTQ to FASTA conversion
to a file (without grep and wc) takes about 2m52s:

$ time seqret -filter -sformat fastq -osformat fasta -sequence
SRR001666_1.fastq -outseq SRR001666_1.fasta

real	2m51.791s
user	2m40.545s
sys	0m4.848s

This is over 40 thousand reads per second, but I was still a
little disappointed in the run time. Improvements in the FASTQ
parsing/writing speed would help get EMBOSS used in
sequencing centre pipelines. Once we have the EMBOSS
FASTQ input/output working as intended, does trying to
speed it up further seem worthwhile?

One specific suggestions is for the "fastq" parser (function
seqReadFastq) which doesn't do anything with the quality
strings. Other than for a debug statement, there is no need
to calculate these lines:

    minqual = ajStrGetAsciiLow(qualstr);
    maxqual = ajStrGetAsciiHigh(qualstr);
    comqual = ajStrGetAsciiCommon(qualstr);

In fact, you don't really need to record qualstr at all. Could
you just verify the total length of the quality string, without
actually recording it in a buffer?

Another suggestion (although not demonstrated in the above
benchmark) is for the Solexa FASTQ parsing (and output).
>From looking at the code, you map the ASCII to a PHRED
score for each letter of every read. This is a relatively
expensive operation using powers and logs. I would try
using a precomputed look up table (something I have just
been working on for Biopython - this made a very big
difference, especially when converting to/from Solexa
scores to PHRED scores).

Peter C.


More information about the emboss-dev mailing list