[Bioperl-l] Creating a fastq format file?
Heikki Lehvaslaiho
heikki.lehvaslaiho at gmail.com
Fri May 15 09:39:55 EDT 2009
Sorry for confusingly composed post. Here is a cleaner version.
-Heikki
I started from a series of probability estimates of true nucleotide
calls [1, 2]. They are in column 1 (Prob) in the table below. The
second column (phd) gives the corresponding phred quality. The third
has the solexa qualities as defined in [1].
>From these columns, I can see that phred and solexa qualities are
identical at value 6 and above. That quality is a lot lower than any
sensible threshold. 6 means one out of five nucleotides are wrong.
Quality score 20 that is often used as threshold means that one out of
100 nucleotides is wrong. In practice, quality values 6 or under are
automatically discarded and never considered seriously. I find it
difficult to understand why these qualities are considered to be so
different!
Both phred and solexa qualities are encoded into one character
representations into FASTQ text format [3, 4]. The rules of encoding
are slightly different.
Phred qualities are positive integers. The Sanger FASTQ specifications
[6] show that the encoding goes from "!" (ASCII 33 corresponding to
phred quality 0) up to "~" (ASCII 126, phred quality 93, probability
5e-10 of nucleotide being wrong ). See column 4 (sac[ASCII]) for
sanger encoding.
The wikipedia FASTQ page [5] says that "Sanger format encodes a Phred
quality score from 0 to 60 using ASCII 33 to 93." It generally
accepted that Phred qualities are limited to 60. See column 5
(sa6(ASCII)). Or are there exceptions?
Could it the that the authors of the maq web pages [3] were confusing
quality 60 for ASCII 60 and that lead to talking about quality 93
being the limit?
The following perl code snippet ($q = chr(($Q<=93? $Q : 93) + 33);)
from [3] seems to point that way?
Aside: What confuses me even more is that bioperl test data set
contains a quality file (t/data/qualfile.qual) where values go from 0
to 90. Where do these values come from? Which program generated them?
Solexa quality is encoded based on 64 and is limited by upper value of
40. There is no predefined lower limit although in practice used
values do not go lower than -5. In other words, the encoding goes from
"" (ASCII xx) to "@" (ASCII 64, solexa quality 0, probability 0.5 ) up
to "h" (ASCII 104, solexa quality 40, probability 1e-04) (Column 6,
colc(ASCII) ).
It was reported [6], that high solexa quality scores are
over-optimistic and low scores underestimate the data quality. It is
therefore for the better that Solexa quality is now only of historical
interest, as Solexa/Illumina pipeline v.1.3 ("illumina") uses phred
qualities 0 to 40. It still uses 64 as a base. So: The character
encoding goes from "@" (ASCII 64, phred quality 0, probability 0.8) up
to "h" (ASCII 104, phred quality 40, probability 1e-04) (final column
illc(ASCII) ).
If the facts and assumptions from above can be agreed on, we can move
on to coding. :)
Table:
The columns are probability, phred quality, solexa quality, sanger
encoding of phred qualities with corresponding ASCII values, sanger
encoding of phred qualities limited to Q60 with corresponding ASCII
values, solexa/illumina (1.0) encoding of solexa qualities with
corresponding ASCII values, and illumina (1.3) encoding of phred
quality with corresponding ASCII values.
Prob phq soq sac(ASCII) sa6(ASCII) solc(ASCII) illc(ASCII)
0.999 0 -29 !( 33) !( 33) "( 34) @( 64)
0.9 0 -9 !( 33) !( 33) 6( 54) @( 64)
0.8 1 -5 !( 33) !( 33) 9( 57) @( 64)
0.7 1 -3 "( 34) "( 34) <( 60) A( 65)
0.6 2 -1 #( 35) #( 35) >( 62) B( 66)
0.5 3 0 $( 36) $( 36) @( 64) C( 67)
0.4 4 1 $( 36) $( 36) A( 65) C( 67)
0.3 5 3 &( 38) &( 38) C( 67) E( 69)
0.2 7 6 '( 39) '( 39) F( 70) F( 70)
0.1 10 9 +( 43) +( 43) I( 73) J( 74)
0.01 20 20 5( 53) 5( 53) S( 83) T( 84)
0.001 30 30 ?( 63) ?( 63) ]( 93) ^( 94)
0.0001 40 40 I( 73) I( 73) g(103) h(104)
1e-05 50 50 S( 83) S( 83) h(104) h(104)
1e-06 60 60 ]( 93) ]( 93) h(104) h(104)
1e-07 70 70 g(103) ]( 93) h(104) h(104)
1e-08 80 80 q(113) ]( 93) h(104) h(104)
1e-09 90 90 z(122) ]( 93) h(104) h(104)
5e-10 93 93 ~(126) ]( 93) h(104) h(104)
1e-10 100 100 ~(126) ]( 93) h(104) h(104)
1e-11 110 110 ~(126) ]( 93) h(104) h(104)
Code. This perl code generates the above table.
--------------------------------------------------------------------
#!/usr/bin/env perl
use strict;
use warnings;
my @p = (0.999, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1,
0.01, 0.001, 0.0001, 0.00001, 0.000001, 0.0000001,
0.00000001, 0.000000001, 0.0000000005, 0.0000000001,
0.00000000001);
printf "%7s %3s %3s %s%6s %s%6s %s%6s %s%6s\n",
'Prob', 'phq', 'soq', 'sac', '(ASCII)', 'sa6', '(ASCII)',
'solc', '(ASCII)', 'illc', '(ASCII)';
for my $e (@p) {
my $Q = -10 * log($e) / log(10); #phred quality
my $sQ = -10 * log($e / (1 - $e)) / log(10); # solexa quality
my $qc = chr(($Q<=93 ? $Q : 93) + 33); # sanger character
my $q6 = chr(($Q<=60 ? $Q : 60) + 33); # sanger character, limited to Q60
my $qs = chr(($Q<=40 ? $sQ : 40) + 64); # solexa character
# based on phred quality
my $qi = chr(($Q<=40 ? $Q : 40) + 64); #illumina character
# I've added 0.1 some int values to couteract a strange downward
# rounding of values by perl
printf "%7s %3d %3d %s(%3d) %s(%3d) %s(%3d) %s(%3d)\n",
$e, $Q+0.1, $sQ+0.1, $qc, ord($qc), $q6, ord($q6),
$qs, ord($qs), $qi, ord($qi);
}
--------------------------------------------------------------------
References:
1. http://maq.sourceforge.net/qual.shtml
2. http://en.wikipedia.org/wiki/Phred_quality_score
3. http://maq.sourceforge.net/fastq.shtml
4. http://en.wikipedia.org/wiki/FASTQ_format
5. http://en.wikipedia.org/wiki/FASTQ_format#Encoding
6. http://maq.sourceforge.net/fastq.shtml#spec
7. Dohm JC, Lottaz C, Borodina T and Himmelbauer H: Substantial biases
in ultra-short read data sets from high-throughput DNA sequencing
Nucleic Acids Research 2008 36(16):e105; doi:10.1093/nar/gkn425
http://nar.oxfordjournals.org/cgi/content/abstract/36/16/e105
8. Solution to Sanger/Solexa/Illumina FASTQ confusion
http://seqanswers.com/forums/showthread.php?t=1526
2009/5/15 Heikki Lehvaslaiho <heikki.lehvaslaiho at gmail.com>:
> My initial assumption of linear relationship between different integer
> ranges used to represent quality values was overly simplistic. I've
> spent some time trying to understand the real relationships between
> quality integers and their ASCII encodings.
>
> My notes turned into a small essay which makes it quite long for an
> email. Please bear with me and reply back to the list if you think you
> have an insight into these matters.
>
> -Heikki
>
>
>
> I started from a series of probability estimates of true nucleotide
> calls [1, 2]. They are in column 1 (Prob) in the table below. The
> second column (phd) gives the corresponding phred quality. The third
> has the solexa qualities as defined in [1].
>
> From these columns, I can see that phred and solexa qualities are
> identical at value 6 and above. That quality is a lot lower than any
> sensible threshold. 6 means one out of five nucleotides are wrong.
> Quality score 20 that is often used as threshold means that one out of
> 100 nucleotides is wrong. In practice, quality values 6 or under are
> automatically discarded and never considered seriously. I find it
> difficult to understand why these qualities are considered to be so
> different!
>
> Both phred and solexa qualities are encoded into one character
> representations into FASTQ text format [3, 4]. The rules of encoding
> are slightly different.
>
> Phred qualities are positive integers. The Sanger FASTQ specifications
> [6] show that the encoding goes from "!" (ASCII 33 corresponding to
> phred quality 0) up to "~" (ASCII 126, phred quality 93, probability
> 5e-10 of nucleotide being wrong ). See column 4 (sac[ASCII]) for
> sanger encoding.
>
> The wikipedia FASTQ page [5] says that "Sanger format encodes a Phred
> quality score from 0 to 60 using ASCII 33 to 93." It generally
> accepted that Phred qualities are limited to 60. See column 5
> (sa6(ASCII)). Or are there exceptions?
>
> Could it the that the authors of the maq web pages [3] were confusing
> quality 60 for ASCII 60 and that lead to talking about quality 93
> being the limit?
>
> The following perl code snippet ($q = chr(($Q<=93? $Q : 93) + 33);)
> from [3] seems to point that way?
>
> Aside: What confuses me even more is that bioperl test data set
> contains a quality file (t/data/qualfile.qual) where values go from 0
> to 90. Where do these values come from? Which program generated them?
>
> Solexa quality is encoded based on 64 and is limited by upper value
> of 40. There is no predefined lower limit although in practice used
> values do not go lower than -5. In other words, the encoding goes from
> "" (ASCII xx) to "@" (ASCII 64, solexa quality 0, probability 0.5 ) up
> to "h" (ASCII 104, solexa quality 40, probability 1e-04) (Column 6,
> colc(ASCII) ).
>
> It was reported [6], that high solexa quality scores are
> over-optimistic and low scores underestimate the data quality. It is
> therefore for the better that Solexa quality is now only of historical
> interest, as Solexa/Illumina pipeline v.1.3 ("illumina") uses phred
> qualities 0 to 40. It still uses 64 as a base. So: The character
> encoding goes from "@" (ASCII 64, phred quality 0, probability 0.8) up
> to "h" (ASCII 104, phred quality 40, probability 1e-04) (final column
> illc(ASCII) ).
>
> If the facts and assumptions from above can be agreed on, we can move
> on to coding. :)
>
> Table:
>
> The columns are probability, phred quality, solexa quality, sanger
> encoding of phred qualities with corresponding ASCII values, sanger
> encoding of phred qualities limited to Q60 with corresponding ASCII
> values, solexa/illumina (1.0) encoding of solexa qualities with
> corresponding ASCII values, and illumina (1.3) encoding of phred
> quality with corresponding ASCII values.
>
>
>
> I started from a series of probability estimates of true nucleotide
> calls [References 1, 2]. They are in column 1 (Prob) in the table
> below. The second
> column (phd) gives the corresponding phred quality. The third has the
> solexa qualities as defined in [1].
>
> From these columns, I can see that phred and solexa qualities are
> identical at value 6 and above. That quality is a lot lower than any
> sensible threshold. 6 means one out of five nucleotides are
> wrong. Quality score 20 that is often used as threshold means that one
> out of 100 nucleotides is wrong. In practice, quality values 6 or
> under are automatically discarded and never considered seriously. I
> find it difficult to understand why these qualities are considered to
> be so different!
>
> Both phred and solexa qualities are encoded into one character
> representations into FASTQ text format [3, 4]. The rules of encoding
> are slightly different.
>
> Phred qualities are positive integers. The Sanger FASTQ specifications
> [6] show that the encoding goes from "!" (ASCII 33 corresponding to
> phred quality 0) up to "~" (ASCII 126, phred quality 93, probability
> 5e-10 of nucleotide being wrong ). See column 4 (sac[ASCII]) for
> sanger encoding.
>
> The wikipedia FASTQ page [5] says that "Sanger format encodes a Phred
> quality score from 0 to 60 using ASCII 33 to 93." It generally
> accepted that Phred qualities are limited to 60. See column 5
> (sa6(ASCII)). Or are there exceptions?
>
> Could it the that the authors of the maq web pages [3] were confusing
> quality 60 for ASCII 60 and that lead to talking about quality 93
> being the limit?
>
> The following perl code snippet ($q = chr(($Q<=93? $Q : 93) + 33);)
> from [3] seems to point that way?
>
> Aside: What confuses me even more is that bioperl test data set
> contains a quality file (t/data/qualfile.qual) where values go from 0
> to 90. Where do these values come from? Which program generated them?
>
>
> Solexa quality is encoded based on 64 and is limited by upper value of
> 40. There is no predefined lower limit although in practice used
> values do not go lower than -5. In other words, the encoding goes from
> "" (ASCII xx) to "@" (ASCII 64, solexa quality 0, probability 0.5 ) up
> to "h" (ASCII 104, solexa quality 40, probability 1e-04) (Column 6,
> colc(ASCII) ).
>
> It was reported [6], that high solexa quality scores are
> over-optimistic and low scores underestimate the data quality. It is
> therefore for the better that Solexa quality is now only of historical
> interest, as Solexa/Illumina pipeline v.1.3 ("illumina") uses phred
> qualities 0 to 40. It still uses 64 as a base. So: The character
> encoding goes from "@" (ASCII 64, phred quality 0, probability 0.8) up
> to "h" (ASCII 104, phred quality 40, probability 1e-04) (final column
> illc(ASCII) ).
>
> If the facts and assumptions from above can be agreed on, we can move
> on to coding. :)
>
>
>
> Table:
>
> The columns are probability, phred quality, solexa quality, sanger
> encoding of phred qualities with corresponding ASCII values, sanger
> encoding of phred qualities limited to Q60 with corresponding ASCII
> values, solexa/illumina (1.0) encoding of solexa qualities with
> corresponding ASCII values, and illumina (1.3) encoding of phred
> quality with corresponding ASCII values.
>
>
> Prob phq soq sac(ASCII) sa6(ASCII) solc(ASCII) illc(ASCII)
> 0.999 0 -29 !( 33) !( 33) "( 34) @( 64)
> 0.9 0 -9 !( 33) !( 33) 6( 54) @( 64)
> 0.8 1 -5 !( 33) !( 33) 9( 57) @( 64)
> 0.7 1 -3 "( 34) "( 34) <( 60) A( 65)
> 0.6 2 -1 #( 35) #( 35) >( 62) B( 66)
> 0.5 3 0 $( 36) $( 36) @( 64) C( 67)
> 0.4 4 1 $( 36) $( 36) A( 65) C( 67)
> 0.3 5 3 &( 38) &( 38) C( 67) E( 69)
> 0.2 7 6 '( 39) '( 39) F( 70) F( 70)
> 0.1 10 9 +( 43) +( 43) I( 73) J( 74)
> 0.01 20 20 5( 53) 5( 53) S( 83) T( 84)
> 0.001 30 30 ?( 63) ?( 63) ]( 93) ^( 94)
> 0.0001 40 40 I( 73) I( 73) g(103) h(104)
> 1e-05 50 50 S( 83) S( 83) h(104) h(104)
> 1e-06 60 60 ]( 93) ]( 93) h(104) h(104)
> 1e-07 70 70 g(103) ]( 93) h(104) h(104)
> 1e-08 80 80 q(113) ]( 93) h(104) h(104)
> 1e-09 90 90 z(122) ]( 93) h(104) h(104)
> 5e-10 93 93 ~(126) ]( 93) h(104) h(104)
> 1e-10 100 100 ~(126) ]( 93) h(104) h(104)
> 1e-11 110 110 ~(126) ]( 93) h(104) h(104)
>
>
> Code. This perl code generates the above table.
>
> --------------------------------------------------------------------
> #!/usr/bin/env perl
>
> use strict;
> use warnings;
>
> my @p = (0.999, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1,
> 0.01, 0.001, 0.0001, 0.00001, 0.000001, 0.0000001,
> 0.00000001, 0.000000001, 0.0000000005, 0.0000000001,
> 0.00000000001);
>
> printf "%7s %3s %3s %s%6s %s%6s %s%6s %s%6s\n",
> 'Prob', 'phq', 'soq', 'sac', '(ASCII)', 'sa6', '(ASCII)',
> 'solc', '(ASCII)', 'illc', '(ASCII)';
>
> for my $e (@p) {
>
> my $Q = -10 * log($e) / log(10); #phred quality
> my $sQ = -10 * log($e / (1 - $e)) / log(10); # solexa quality
>
> my $qc = chr(($Q<=93 ? $Q : 93) + 33); # sanger character
> my $q6 = chr(($Q<=60 ? $Q : 60) + 33); # sanger character, limited to Q60
> my $qs = chr(($Q<=40 ? $sQ : 40) + 64); # solexa character
> # based on phred quality
> my $qi = chr(($Q<=40 ? $Q : 40) + 64); #illumina character
>
> # I've added 0.1 some int values to couteract a strange downward
> # rounding of values by perl
> printf "%7s %3d %3d %s(%3d) %s(%3d) %s(%3d) %s(%3d)\n",
> $e, $Q+0.1, $sQ+0.1, $qc, ord($qc), $q6, ord($q6),
> $qs, ord($qs), $qi, ord($qi);
> }
> --------------------------------------------------------------------
>
> References:
>
> 1. http://maq.sourceforge.net/qual.shtml
>
> 2. http://en.wikipedia.org/wiki/Phred_quality_score
>
> 3. http://maq.sourceforge.net/fastq.shtml
>
> 4. http://en.wikipedia.org/wiki/FASTQ_format
>
> 5. http://en.wikipedia.org/wiki/FASTQ_format#Encoding
>
> 6. http://maq.sourceforge.net/fastq.shtml#spec
>
> 7. Dohm JC, Lottaz C, Borodina T and Himmelbauer H: Substantial biases
> in ultra-short read data sets from high-throughput DNA sequencing
> Nucleic Acids Research 2008 36(16):e105; doi:10.1093/nar/gkn425
> http://nar.oxfordjournals.org/cgi/content/abstract/36/16/e105
>
> 8. Solution to Sanger/Solexa/Illumina FASTQ confusion
> http://seqanswers.com/forums/showthread.php?t=1526
>
>
>
>
>
>
>
>
> 2009/5/13 Chris Fields <cjfields at illinois.edu>:
>> Heikki,
>>
>> Did you still want to commit this? I think it's a good idea and would be
>> worth including in the next 1.6 point release.
>>
>> chris
>>
>> ------------------------------------------------------------
>> I convinced at least myself to the degree that I wrote the
>> range_convert() method - with plenty of tests. I mention this now so
>> that no-one else need to start thinking through all the edge values.
>> :)
>>
>> I'll contribute it to the code base once there is a consensus of best
>> way forward.
>>
>> -Heikki
>>
>> 2009/4/27 Heikki Lehvaslaiho <heikki.lehvaslaiho at gmail.com>:
>>>> I have tried to summarise this in a central place:
>>>> http://en.wikipedia.org/wiki/FASTQ_format
>>>
>>> Torsten,
>>>
>>> Thanks for putting this together. Very helpful.
>>>
>>> Do you have a plan of action? Let me propose one for BioPerl. It
>>> based on following assumptions:
>>>
>>> 1. There is multitude of different ways of coding quality values out
>>> there.
>>> 2. Bio::Seq::Quality is agnostic of any quality value range rules
>>> 3. The emerging open standard is the Sanger fastq specification
>>> 4. Open source programs use the Sanger fastq specs
>>>
>>>
>>> From these it follows that:
>>>
>>>
>>> 1. BioPerl should support Sanger fastq standard
>>>
>>> 1.1. it already does and there are other SeqIO modules for dealing
>>> with other non-fastq formats.
>>>
>>> 2. BioPerl should offer simple ways of converting between quality range
>>> rules
>>>
>>> 2.1. Have a generic method accessible from Bio::Seq::Quality with
>>> preset versions of the method for converting between known variants
>>> (Sanger fastq and the two Illumina versions)
>>>
>>> For example:
>>>
>>> range_convert ($from_lower, $from_upper, $to_lower, $to_upper, $value)
>>> throw if $value < $from_lower or $value > $from_upper
>>> return $newvalue
>>>
>>> range_convert_illumina2fastq(), range_convert_fastq2illumina(),
>>> range_convert_fastq2phred(), range_convert_phred2fastq()....
>>>
>>> (assuming that illumina 1.3 eq phred)
>>>
>>> 2.2. Bio::SeqIO::Fastq::next_seq methods should convert Illumina
>>> qualities into Sanger fastq on the fly
>>>
>>> 2.2.1 Bio::SeqIO::Fastq::next_seq should detect the incoming stream of
>>> quality value range either automatically or be given a keyword
>>> parameter indicating the range.
>>>
>>> 2.2.2. Bio::SeqIO::Fastq::next_seq should throw an error if it detects
>>> a quality value out of range.
>>>
>>> 2.2.3. Bio::SeqIO::Fastq::write_seq should throw an error if it
>>> detects a quality value out of range.
>>>
>>> 2.2.4. It would be useful but not absolutely necessary for
>>> Bio::SeqIO::Fastq::write_seq to be able to write out in Illumina
>>> ranges
>>>
>>>
>>> What do you think?
>>>
>>> -Heikki
>>>
>>> 2009/4/26 Torsten Seemann <torsten.seemann at infotech.monash.edu.au>:
>>>>> > This might be a good place to ask the question: having looked at the
>>>>> > fastq.pm page, is the fastq format defined (only) by a "@'" followed
>>>>> > by
>>>>> a
>>>>> > sequence line and a "+" header followed by a quality line and the two
>>>>> > headers have to agree? Now that Illumina is using phred scaling, are
>>>>> > 'Sanger' and 'Illumina' versions the same?
>>>>>
>>>>> No they aren't the same, Illumina still encodes the ascii as value + 64
>>>>> and Sanger as value + 33.
>>>>>
>>>>
>>>> Illumina have now CHANGED how they calculate the quality value however in
>>>> the last month or so... Their Q range used to be -5..40 mapped to ASCII
>>>> 64+,
>>>> but now they produce Q >= 0 and it is unclear if they start at 69 or 64
>>>> now...
>>>>
>>>> I have tried to summarise this in a central place:
>>>>
>>>> http://en.wikipedia.org/wiki/FASTQ_format
>>>>
>>>> Corrections welcome!
>>>>
>>>>
>>>> --Torsten Seemann
>>>> --Victorian Bioinformatics Consortium, Dept. Microbiology, Monash
>>>> University, AUSTRALIA
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>
>>>
>>>
>>> --
>>> -Heikki
>>> Heikki Lehvaslaiho - skype:heikki_lehvaslaiho
>>> cell: +27 (0)714328090
>>> Sent from Claremont, WC, South Africa
>>>
>>
>>
>>
>> --
>> -Heikki
>> Heikki Lehvaslaiho - skype:heikki_lehvaslaiho
>> cell: +27 (0)714328090
>> Sent from Claremont, WC, South Africa
>>
>
>
>
> --
> -Heikki
> Heikki Lehvaslaiho - skype:heikki_lehvaslaiho
> cell: +27 (0)714328090
> Sent from Claremont, WC, South Africa
>
--
-Heikki
Heikki Lehvaslaiho - skype:heikki_lehvaslaiho
cell: +27 (0)714328090
Sent from Claremont, WC, South Africa
More information about the Bioperl-l
mailing list