[Biopython-dev] Bio.SeqIO.convert function?

Peter biopython at maubp.freeserve.co.uk
Tue Jul 28 13:14:52 UTC 2009


On Tue, Jul 28, 2009 at 12:19 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
>
> Any thoughts? Would this all just make SeqIO too complicated?
>

The idea of the Bio.SeqIO.convert function was two fold:
(1) Syntactic sugar (and for this alone I wouldn't add it)
(2) Faster file format conversion (e.g. for scripts or pipelines)

While we could clearly out perform EMBOSS 6.1.0 on FASTQ
to FASTA, given the possible speed ups Peter Rice is reporting
for EMBOSS seqret, it looks this will change shortly:
http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006496.html

I don't see any real point in trying to compete with EMBOSS
for simple file conversion if in general seqret will be faster
(and on the next release of EMBOSS, it should be).

The real benefit if using Bio.SeqIO for any file format conversion
(rather than seqret), is this lets the user add their own conditional
filters or modifications as needed. And for this, my proposed
function Bio.SeqIO.convert() doesn't help in any way.

So, unless anyone pipes up, I probably won't pursue this.

Finally, if anyone is interested, this was idea for the high speed
FASTQ to FASTA conversion - as a proof of principle script
using standard input and standard output at the command line:

#High performance FASTQ to FASTA conversion for short reads.
#This uses the low level FASTQ parser in Biopython 1.50 or
#later. This avoids Bio.SeqIO and the associated overheads
#of object creation and decoding the FASTQ quality string.
import sys
from Bio.SeqIO.QualityIO import FastqGeneralIterator
#This just returns tuples of three strings from FASTQ:
write = sys.stdout.write #avoid repeated attribute lookups
for title, sequence, quality in FastqGeneralIterator(sys.stdin) :
    write(">%s\n" % title)
    #Wrap at 60 characters (as done by Bio.SeqIO FASTA):
    for i in range(0, len(sequence), 60):
        write(sequence[i:i+60] + "\n")

If you don't want line wrapping, the code is two lines shorter,
and even faster:

import sys
from Bio.SeqIO.QualityIO import FastqGeneralIterator
write = sys.stdout.write #avoid repeated attribute lookups
for title, sequence, quality in FastqGeneralIterator(sys.stdin) :
    write(">%s\n%s\n" % (title, sequence))

Peter



More information about the Biopython-dev mailing list