[Biopython] A third FASTQ variant from Illumina 1.3+ ?!!

Fri Jun 5 07:47:45 EDT 2009

On Fri, Jun 5, 2009 at 11:57 AM, Giles
Weaver<giles.weaver at googlemail.com> wrote:
> There is a recent thread on the bioperl mailing lists where Heikki
> Lehvaslaiho has written a very detailed post
> (http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030017.html) on the
> peculiarities of sanger/solexa/illumina quality encoding. Evidently there
> are a lot of pitfalls for the unwary, ...

Oh dear - it sounds like Solexa/Illumina have just made the whole FASTQ
thing much much worse by introducing a third version of the FASTQ file
format. Curses! Again!

http://seqanswers.com/forums/showthread.php?t=1526
http://en.wikipedia.org/wiki/FASTQ_format

In Biopython, "fastq" refers to the original Sanger FASTQ format which
encodes a Phred quality score from 0 to 90 (or 93 in the latest code)
using an ASCII offset of 33.

In Biopython "fastq-solexa" refers to the first bastardised version of the
FASTQ format introduced by Solexa/Illumina 1.0 format which encodes
a Solexa/Illumina quality score (which can be negative) using an ACSII
offset of 64. Why they didn't make the files easily distinguishable from
Sanger FASTQ files escapes me!

Apparently Illumina 1.3 introduces a third FASTQ format which encodes
a PHRED quality score from 0 to 40 using ASCII 64 to 104. While they
switched to PHRED scores, they appear to have decided to stick with
the 64 offset - I can only assume this is so that existing tools expecting
the old Solexa/Illumina FASTQ format data will still more or less work
with this new variant (as for higher qualities the PHRED and Solexa
scores are approximately equal).

I'm going to see if I can get hold of the Illumina 1.3 or 1.4 manuals to
confirm this information... but it looks like we'll need to support a third
FASTQ format in Biopython :(

Peter