[Biopython] SeqIO for fasta conversion of Illumina files with > 60 bp

Chris Fields cjfields at illinois.edu
Tue Sep 15 15:27:57 EDT 2009


On Sep 15, 2009, at 12:43 PM, Chris Fields wrote:

> On Sep 15, 2009, at 10:08 AM, Peter wrote:
>
>> On Tue, Sep 15, 2009 at 4:02 PM, natassa <natassa_g_2000 at yahoo.com>  
>> wrote:
>>>
>>>> That does look like a FASTQ file, and you probably know that it
>>>> came from a Solexa/Illumina machine. However, it could be an early
>>>> Solexa/Illumina file using Solexa scores ("fastq-solexa" in SeqIO),
>>>> or a more recent Illumina GA pipeline 1.3+ FASTQ file with PHRED
>>>> scores ("fastq-illumina" in SeqIO). From the read length (76bp) I
>>>> would guess this probably is an "fastq-illumina" file, but you
>>>> should double check this, as it does matter for poor quality reads.
>>>
>>> Because you created some doubts in my already confused mind:
>>> The machine is indeed Solexa/Illumina. I have 55bp and 76 bp
>>> reads from pipeline v1.3 and v1.4, respectively. In the pipeline
>>> manuals they say that the scoring scheme is Phred.  I know
>>> there is a lot of confusion about the terms, this is why I
>>> preferred to use the seqIO -I hope I did not mix the formats....
>>
>> That's fine then - the Solexa/Illumina 1.3 and 1.4 pipelines use
>> PHRED scores (with a FASTQ ASCII offset of 64), and in
>> Biopython we call this the "fastq-illumina" format.
>>
>> Peter
>
> I should add a very important caveat here.  As I had mentioned to  
> Peter I met with our local nextgen sequencing lead and was able to  
> check the Illumina 1.4 pipeline manual.  It indicates the ASCII  
> offset for FASTQ is correct (64), but the quality score is  
> calculated as (pg 122 of Genome Pipeline manual for 1.4):
>
> Q = 10*log10(p/(1-p))
>
> Look familiar?  Hint: it's not PHRED.  I'm wondering if anyone else  
> can confirm this, as it appears Illumina has switched back to using  
> Solexa scores again.
>
> chris

Just got off the phone with Illumina customer support to double-check  
this, and I think it may be a false alarm, though I'm getting  
conflicting accounts (our local guys say it's solexa, not PHRED qual  
scores).

According to Illumina tech support, qual scores coming off the 1.4  
pipeline should be converted over to PHRED scores prior to output  
(what natassa mentions).  The manual refers to the older (Solexa/ 
Illumina 1.0) scoring b/c that particular qual scoring option can be  
specified instead of PHRED.

If anyone out there using the 1.4 pipeline can confirm this that would  
be most helpful, as all the Bio* toolkits and EMBOSS are updating  
FASTQ parsing.

chris


More information about the Biopython mailing list