[Biopython] A third FASTQ variant from Illumina 1.3+ ?!!

Chris Fields cjfields at illinois.edu
Fri Jun 5 20:33:08 UTC 2009


On Jun 5, 2009, at 2:10 PM, Peter wrote:

> On Fri, Jun 5, 2009 at 1:02 PM,  
> Peter<biopython at maubp.freeserve.co.uk> wrote:
>> On Fri, Jun 5, 2009 at 12:47 PM, Peter<biopython at maubp.freeserve.co.uk 
>> > wrote:
>>> Oh dear - it sounds like Solexa/Illumina have just made the whole  
>>> FASTQ
>>> thing much much worse by introducing a third version of the FASTQ  
>>> file
>>> format. Curses! Again!
>>>
>>> http://seqanswers.com/forums/showthread.php?t=1526
>>> http://en.wikipedia.org/wiki/FASTQ_format
>>>
>>> In Biopython, "fastq" refers to the original Sanger FASTQ format  
>>> which
>>> encodes a Phred quality score from 0 to 90 (or 93 in the latest  
>>> code)
>>> using an ASCII offset of 33.
>>>
>>> In Biopython "fastq-solexa" refers to the first bastardised  
>>> version of the
>>> FASTQ format introduced by Solexa/Illumina 1.0 format which encodes
>>> a Solexa/Illumina quality score (which can be negative) using an  
>>> ACSII
>>> offset of 64. Why they didn't make the files easily  
>>> distinguishable from
>>> Sanger FASTQ files escapes me!
>>>
>>> Apparently Illumina 1.3 introduces a third FASTQ format which  
>>> encodes
>>> a PHRED quality score from 0 to 40 using ASCII 64 to 104. While they
>>> switched to PHRED scores, they appear to have decided to stick with
>>> the 64 offset - I can only assume this is so that existing tools  
>>> expecting
>>> the old Solexa/Illumina FASTQ format data will still more or less  
>>> work
>>> with this new variant (as for higher qualities the PHRED and Solexa
>>> scores are approximately equal).
>
> I'm proposing to support this new FASTQ variant in Bio.SeqIO under the
> format name "fastq-illumina" (unless anyone has a better idea). In the
> meantime, anyone happy installing Biopython from CVS/github can try
> this out - but be warned it will need full testing.
>
> Comments on the (updated) docstring for the Bio.SeqIO.QualityIO module
> would also be welcome - you can read this online here:
> http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/SeqIO/QualityIO.py?cvsroot=biopython
>
> Next week I'll try and see if one of our local sequencing centres  
> can supply
> some sample data from a Solexa/Illumina 1.3 pipeline for a test  
> case.  If
> anyone already has such data they can share please get in touch.
>
> Thanks,
>
> Peter

You might be able to get some reads off NCBI's Short Read Archive (at  
least they're publicly available).  Not sure whether these indicate  
which FASTQ format they are in...

http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=main&m=main&s=main

chris





More information about the Biopython mailing list