[Biopython-dev] Bio.SeqIO.convert function?

Tue Jul 28 11:19:15 UTC 2009

Hi all,

As a possible enhancement to Bio.SeqIO, I've been toying with
the idea of introducing another function, essentially to provide
the following functionality:

def convert(in_handle, in_format, out_handle, out_format, alphabet=None) :
    """Converts between two file formats, returns number of records."""
    records = parse(in_handle, in_format, alphabet)
    return write(records, out_handle, out_format)

As implied by this reference implementation above, this would
be a convenience or helper function which would allow simple
conversion scripts to save a line, e.g.

import sys
from Bio import SeqIO
records = SeqIO.parse(sys.stdin, "fastq-solexa")
SeqIO.write(records, sys.stdout, "fastq")

becomes:

import sys
from Bio import SeqIO
SeqIO.convert(sys.stdin, "fastq-solexa", sys.stdout, "fastq")

Now some people might find that in itself a small improvement,
but it does make the API a little more complex (feature creep).
However, that isn't the real aim here. Having a function like this
would allow a number of file format specific optimisations -
instead of using SeqIO.parse to create SeqRecord objects
which get converted by SeqIO.write as shown above.

For example, converting GenBank or EMBL to FASTA (or tab),
we don't need most of the annotation, so creating all those
SeqFeature objects is a waste of time and memory. The
GenBank/EMBL parser already has (buried) an option to skip
the features, and a Bio.SeqIO.convert function would be able
to exploit this.

Likewise, converting any of the FASTQ formats to FASTA
(which I think will be a fairly common task) can be speed up
greatly by ignoring the quality scores, and even more so by
never creating Seq and SeqRecord objects. I've tested this
particular example, and it is massively faster (about five
times faster in fact, which means it actually beats the
current version of EMBOSS seqret - which is cool).

Likewise converting between FASTQ formats (in particular
Solexa to Sanger, and Illumina to Sanger) are also going
to be common tasks which are currently something of a
bottle neck. Again, this can be made faster by avoiding
using Seq and SeqRecord objects within a convert function.

What I have in mind is a lookup table of special case
optimised converters (e.g. FASTQ to FASTA). If there is
no special case defined, the convert function would
default to the SeqIO parse/write code shown above.
We would need a good set of unit tests to ensure these
optimised converters did produce exactly the same
output as the parse/write solution.

Of course, if we have bottlenecks in the SeqIO parsing
and writing code, it would be worthwhile of course to fix
them - rather than writing a special case converter. Maybe
to avoid the gradual build up of too many specialised
converters, we might ask as a rule of thumb that it be
at least three times faster than using parse/write?

Any thoughts? Would this all just make SeqIO too complicated?

Peter