[Biopython-dev] Bio.SeqIO.convert function?

Peter biopython at maubp.freeserve.co.uk
Tue Jul 28 14:47:57 UTC 2009


On Tue, Jul 28, 2009 at 2:14 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
> Finally, if anyone is interested, this was idea for the high speed
> FASTQ to FASTA conversion - as a proof of principle script
> using standard input and standard output at the command line:
>
> #High performance FASTQ to FASTA conversion for short reads.
> #This uses the low level FASTQ parser in Biopython 1.50 or
> #later. This avoids Bio.SeqIO and the associated overheads
> #of object creation and decoding the FASTQ quality string.
> import sys
> from Bio.SeqIO.QualityIO import FastqGeneralIterator
> #This just returns tuples of three strings from FASTQ:
> write = sys.stdout.write #avoid repeated attribute lookups
> for title, sequence, quality in FastqGeneralIterator(sys.stdin) :
>    write(">%s\n" % title)
>    #Wrap at 60 characters (as done by Bio.SeqIO FASTA):
>    for i in range(0, len(sequence), 60):
>        write(sequence[i:i+60] + "\n")
>
> If you don't want line wrapping, the code is two lines shorter,
> and even faster:
>
> import sys
> from Bio.SeqIO.QualityIO import FastqGeneralIterator
> write = sys.stdout.write #avoid repeated attribute lookups
> for title, sequence, quality in FastqGeneralIterator(sys.stdin) :
>    write(">%s\n%s\n" % (title, sequence))
>
> Peter

And here is a similar high performance script for mapping
Solexa FASTQ to Sanger FASTQ,

import sys
from Bio.SeqIO.QualityIO import FastqGeneralIterator, phred_quality_from_solexa
from string import maketrans
solexa = "".join(chr(64+q) for q in range(-5,62+1))
sanger = "".join(chr(int(round(33+phred_quality_from_solexa(q)))) \
                 for q in range(-5,62+1))
mapping = maketrans(solexa, sanger)
write = sys.stdout.write #avoid repeated attribute lookups
for title, sequence, quality in FastqGeneralIterator(sys.stdin) :
    write("@%s\n%s\n+\n%s\n" % (title, sequence, quality.translate(mapping)))

The same basic idea works equally well for mapping between any
of the three FASTQ variants, and the speed is very similar to the
FASTQ to FASTA script, taking about 1/5 of the time using SeqIO
parse/write for this. I'm still investigating how to make the SeqIO
parsing/writing faster.

When I get an updated version of EMBOSS installed, I intend
to profile it against these scripts ;)

Peter




More information about the Biopython-dev mailing list