[Biopython] A third FASTQ variant from Illumina 1.3+ ?!!

Peter biopython at maubp.freeserve.co.uk
Fri Jun 5 19:10:12 UTC 2009


On Fri, Jun 5, 2009 at 1:02 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
> On Fri, Jun 5, 2009 at 12:47 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
>> Oh dear - it sounds like Solexa/Illumina have just made the whole FASTQ
>> thing much much worse by introducing a third version of the FASTQ file
>> format. Curses! Again!
>>
>> http://seqanswers.com/forums/showthread.php?t=1526
>> http://en.wikipedia.org/wiki/FASTQ_format
>>
>> In Biopython, "fastq" refers to the original Sanger FASTQ format which
>> encodes a Phred quality score from 0 to 90 (or 93 in the latest code)
>> using an ASCII offset of 33.
>>
>> In Biopython "fastq-solexa" refers to the first bastardised version of the
>> FASTQ format introduced by Solexa/Illumina 1.0 format which encodes
>> a Solexa/Illumina quality score (which can be negative) using an ACSII
>> offset of 64. Why they didn't make the files easily distinguishable from
>> Sanger FASTQ files escapes me!
>>
>> Apparently Illumina 1.3 introduces a third FASTQ format which encodes
>> a PHRED quality score from 0 to 40 using ASCII 64 to 104. While they
>> switched to PHRED scores, they appear to have decided to stick with
>> the 64 offset - I can only assume this is so that existing tools expecting
>> the old Solexa/Illumina FASTQ format data will still more or less work
>> with this new variant (as for higher qualities the PHRED and Solexa
>> scores are approximately equal).

I'm proposing to support this new FASTQ variant in Bio.SeqIO under the
format name "fastq-illumina" (unless anyone has a better idea). In the
meantime, anyone happy installing Biopython from CVS/github can try
this out - but be warned it will need full testing.

Comments on the (updated) docstring for the Bio.SeqIO.QualityIO module
would also be welcome - you can read this online here:
http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/SeqIO/QualityIO.py?cvsroot=biopython

Next week I'll try and see if one of our local sequencing centres can supply
some sample data from a Solexa/Illumina 1.3 pipeline for a test case.  If
anyone already has such data they can share please get in touch.

Thanks,

Peter



More information about the Biopython mailing list