[Biopython-dev] Bio.SeqIO.convert function?

Tue Aug 11 12:19:25 UTC 2009

On Mon, Aug 10, 2009 at 5:46 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
> In terms of speed, this new code takes under a minute to
> convert a 7 million short read FASTQ file to another FASTQ
> variant, or to a (line wrapped) FASTA file. In comparison,
> using Bio.SeqIO parse/write takes over five minutes.

If anyone is interested in the details, here I am using a 7 million
entry FASTQ file of short reads (length 36bp) from a Solexa FASTQ
format file (downloaded from the NCBI and then converted from the
Sanger FASTQ format). I'm timing conversion from Solexa to Sanger
FASTQ as it is a more common operation, and I can include the MAQ
script for comparison. I pipe the output via grep and word count as a
check on the conversion.

Using a (patched) version of MAQ's fq_all2std.pl we get about 4 mins:

$ time perl ../biopython/Tests/Quality/fq_all2std.pl sol2std
SRR001666_1.fastq_solexa | grep "^@SRR" | wc -l
 7047668
real	3m58.978s
user	4m13.475s
sys	0m3.705s

And using a patched version of EMBOSS 6.1.0 (without the optimisations
Peter Rice has mentioned), we get 3m42s.

$ time seqret -filter -sformat fastq-solexa -osformat fastq-sanger <
SRR001666_1.fastq_solexa | grep "^@SRR" | wc -l
 7047668
real	3m41.625s
user	3m56.753s
sys	0m4.091s

Using the latest Biopython in CVS (or the git master branch), with
Bio.SeqIO.parse/write, takes about twice this, 7m11s:

$ time python biopython_solexa2sanger.py < SRR001666_1.fastq_solexa |
grep "^@SRR" | wc -l
 7047668
real	7m10.706s
user	7m27.597s
sys	0m3.850s

This is at least a marked improvement over Biopython 1.51b with
Bio.SeqIO.parse/write, which took about 17 minutes! The bad news is
while the Bio.SeqIO FASTQ read/write in CVS is faster than in
Biopython 1.51b, it is also much less elegant. I'm think once I've
finished adding test cases (and probably after 1.51 is out) it might
be worth while trying to make it more beautiful without sacrificing
too much of the speed gain.

Now to the good news, using my github branch with the convert function
we get a massive reduction to under a minute (52s):

$ time python convert_solexa2sanger.py < SRR001666_1.fastq_solexa |
grep "^@SRR" | wc -l
 7047668
real	0m51.618s
user	1m7.735s
sys	0m3.162s

We have a winner! Assuming of course there are no mistakes ;)

In fact, these measurements are a little misleading because I am
including grep (to check the record count) and the output isn't
actually going to disk. Doing the grep on its own takes about 15s:

$ time grep "^@SRR" SRR001666_1.fastq_solexa | wc -l
 7047668
real	0m15.318s
user	0m17.890s
sys	0m1.087s

However, if you actually output to a file the disk speed itself
becomes important when the conversion is this fast:

$ time python convert_solexa2sanger.py < SRR001666_1.fastq_solexa > temp.fastq
real	1m3.448s
user	0m49.672s
sys	0m4.826s

$ time seqret -filter -sformat fastq-solexa -osformat fastq-sanger <
SRR001666_1.fastq_solexa > temp.fastq
real	3m55.086s
user	3m39.548s
sys	0m5.998s

$ time perl ../biopython/Tests/Quality/fq_all2std.pl sol2std
SRR001666_1.fastq_solexa > temp.fastq
real	4m10.245s
user	3m54.880s
sys	0m5.085s

$ time python ../biopython/Tests/Quality/biopython_solexa2sanger.py <
SRR001666_1.fastq_solexa > temp.fastq
real	7m27.879s
user	7m9.084s
sys	0m6.008s

Nevertheless, the Bio.SeqIO.convert(...) function still wins for now.

Peter

For those interested, here are the tiny little Biopython scripts I'm using:

# biopython_solexa2sanger.py
#FASTQ conversion using Bio.SeqIO, needs Biopython 1.50 or later.
import sys
from Bio import SeqIO
records = SeqIO.parse(sys.stdin, "fastq-solexa")
SeqIO.write(records, sys.stdout, "fastq")

and:

#convert_solexa2sanger.py
#High performance FASTQ conversion using Bio.SeqIO.convert(...)
#function likely to be in Biopython 1.52 onwards.
import sys
from Bio import SeqIO
SeqIO.convert(sys.stdin, "fastq-solexa", sys.stdout, "fastq")