[Biopython-dev] EMBOSS format name "fastq-sanger" in Biopython?

Mon Jul 20 13:57:59 EDT 2009

Hi all at Biopython (and EMBOSS-dev CC'd),

Now that EMBOSS 6.1.0 is out I've started checking it against Biopython.
As I mentioned on the Biopython mailing list a week ago, in particular I'd
like to make sure we agree on the various FASTQ variants. I'm waiting
for EMBOSS to update the documentation on their website, but as I
recall from talking to Peter Rice at BOSC/ISMB 2009 and a quick test
this afternoon, they are using:

fastq - FASTQ where the qualities are ignored (useful for input?)
fastq-sanger - Standard Sanger style FASTQ using PHRED offset 33
fastq-solexa - Early Solexa/Illumina FASTQ, Solexa scores offset 64
fastq-illumina - Illumina 1.3+ FASTQ using PHRED offset 64

I was expecting "fastq" to be an EMBOSS input only format given
how I had understood this to be interpreted (ignore the qualities). This
makes sense for tasks like FASTQ to FASTQ where the qualities can
be ignored. I was however surprised that using "fastq" as an output
format in EMBOSS seqret gives quality strings of double quote
characters. This ASCII character (34) is outside the range used in
the Solexa and Illumina 1.3+ FASTQ variants. If interpreted as a
Sanger style FASTQ file this means a PHRED quality of one
(meaning about random, a sensible default).

Enough background. The reason for this email was that (subject to
confirmation), Biopython's "fastq" matches EMBOSS's "fastq-sanger",
so I'd like to consider adding this as an alias in Bio.SeqIO. I resisted
adding aliases initially, but we now have "gb" for "genbank" to make
working with Entrez a little easier, so there is a precedent. In this case,
it will make some of the test_Emboss.py code cleaner if I can just use
"fastq-sanger" everywhere and have both Biopython and EMBOSS
understand this.

Peter