[Biopython] SeqIO for fasta conversion of Illumina files with > 60 bp

Chris Fields cjfields at illinois.edu
Tue Sep 15 17:43:26 UTC 2009


On Sep 15, 2009, at 10:08 AM, Peter wrote:

> On Tue, Sep 15, 2009 at 4:02 PM, natassa <natassa_g_2000 at yahoo.com>  
> wrote:
>>
>>> That does look like a FASTQ file, and you probably know that it
>>> came from a Solexa/Illumina machine. However, it could be an early
>>> Solexa/Illumina file using Solexa scores ("fastq-solexa" in SeqIO),
>>> or a more recent Illumina GA pipeline 1.3+ FASTQ file with PHRED
>>> scores ("fastq-illumina" in SeqIO). From the read length (76bp) I
>>> would guess this probably is an "fastq-illumina" file, but you
>>> should double check this, as it does matter for poor quality reads.
>>
>> Because you created some doubts in my already confused mind:
>> The machine is indeed Solexa/Illumina. I have 55bp and 76 bp
>> reads from pipeline v1.3 and v1.4, respectively. In the pipeline
>> manuals they say that the scoring scheme is Phred.  I know
>> there is a lot of confusion about the terms, this is why I
>> preferred to use the seqIO -I hope I did not mix the formats....
>
> That's fine then - the Solexa/Illumina 1.3 and 1.4 pipelines use
> PHRED scores (with a FASTQ ASCII offset of 64), and in
> Biopython we call this the "fastq-illumina" format.
>
> Peter

I should add a very important caveat here.  As I had mentioned to  
Peter I met with our local nextgen sequencing lead and was able to  
check the Illumina 1.4 pipeline manual.  It indicates the ASCII offset  
for FASTQ is correct (64), but the quality score is calculated as (pg  
122 of Genome Pipeline manual for 1.4):

Q = 10*log10(p/(1-p))

Look familiar?  Hint: it's not PHRED.  I'm wondering if anyone else  
can confirm this, as it appears Illumina has switched back to using  
Solexa scores again.

chris



More information about the Biopython mailing list