[Biopython-dev] Bio.SeqIO.convert function?
Peter
biopython at maubp.freeserve.co.uk
Tue Aug 11 12:19:25 UTC 2009
On Mon, Aug 10, 2009 at 5:46 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
> In terms of speed, this new code takes under a minute to
> convert a 7 million short read FASTQ file to another FASTQ
> variant, or to a (line wrapped) FASTA file. In comparison,
> using Bio.SeqIO parse/write takes over five minutes.
If anyone is interested in the details, here I am using a 7 million
entry FASTQ file of short reads (length 36bp) from a Solexa FASTQ
format file (downloaded from the NCBI and then converted from the
Sanger FASTQ format). I'm timing conversion from Solexa to Sanger
FASTQ as it is a more common operation, and I can include the MAQ
script for comparison. I pipe the output via grep and word count as a
check on the conversion.
Using a (patched) version of MAQ's fq_all2std.pl we get about 4 mins:
$ time perl ../biopython/Tests/Quality/fq_all2std.pl sol2std
SRR001666_1.fastq_solexa | grep "^@SRR" | wc -l
7047668
real 3m58.978s
user 4m13.475s
sys 0m3.705s
And using a patched version of EMBOSS 6.1.0 (without the optimisations
Peter Rice has mentioned), we get 3m42s.
$ time seqret -filter -sformat fastq-solexa -osformat fastq-sanger <
SRR001666_1.fastq_solexa | grep "^@SRR" | wc -l
7047668
real 3m41.625s
user 3m56.753s
sys 0m4.091s
Using the latest Biopython in CVS (or the git master branch), with
Bio.SeqIO.parse/write, takes about twice this, 7m11s:
$ time python biopython_solexa2sanger.py < SRR001666_1.fastq_solexa |
grep "^@SRR" | wc -l
7047668
real 7m10.706s
user 7m27.597s
sys 0m3.850s
This is at least a marked improvement over Biopython 1.51b with
Bio.SeqIO.parse/write, which took about 17 minutes! The bad news is
while the Bio.SeqIO FASTQ read/write in CVS is faster than in
Biopython 1.51b, it is also much less elegant. I'm think once I've
finished adding test cases (and probably after 1.51 is out) it might
be worth while trying to make it more beautiful without sacrificing
too much of the speed gain.
Now to the good news, using my github branch with the convert function
we get a massive reduction to under a minute (52s):
$ time python convert_solexa2sanger.py < SRR001666_1.fastq_solexa |
grep "^@SRR" | wc -l
7047668
real 0m51.618s
user 1m7.735s
sys 0m3.162s
We have a winner! Assuming of course there are no mistakes ;)
In fact, these measurements are a little misleading because I am
including grep (to check the record count) and the output isn't
actually going to disk. Doing the grep on its own takes about 15s:
$ time grep "^@SRR" SRR001666_1.fastq_solexa | wc -l
7047668
real 0m15.318s
user 0m17.890s
sys 0m1.087s
However, if you actually output to a file the disk speed itself
becomes important when the conversion is this fast:
$ time python convert_solexa2sanger.py < SRR001666_1.fastq_solexa > temp.fastq
real 1m3.448s
user 0m49.672s
sys 0m4.826s
$ time seqret -filter -sformat fastq-solexa -osformat fastq-sanger <
SRR001666_1.fastq_solexa > temp.fastq
real 3m55.086s
user 3m39.548s
sys 0m5.998s
$ time perl ../biopython/Tests/Quality/fq_all2std.pl sol2std
SRR001666_1.fastq_solexa > temp.fastq
real 4m10.245s
user 3m54.880s
sys 0m5.085s
$ time python ../biopython/Tests/Quality/biopython_solexa2sanger.py <
SRR001666_1.fastq_solexa > temp.fastq
real 7m27.879s
user 7m9.084s
sys 0m6.008s
Nevertheless, the Bio.SeqIO.convert(...) function still wins for now.
Peter
For those interested, here are the tiny little Biopython scripts I'm using:
# biopython_solexa2sanger.py
#FASTQ conversion using Bio.SeqIO, needs Biopython 1.50 or later.
import sys
from Bio import SeqIO
records = SeqIO.parse(sys.stdin, "fastq-solexa")
SeqIO.write(records, sys.stdout, "fastq")
and:
#convert_solexa2sanger.py
#High performance FASTQ conversion using Bio.SeqIO.convert(...)
#function likely to be in Biopython 1.52 onwards.
import sys
from Bio import SeqIO
SeqIO.convert(sys.stdin, "fastq-solexa", sys.stdout, "fastq")
More information about the Biopython-dev
mailing list