[BioRuby] FastQ and Quality Scores
Raoul Bonnal
bonnal at ingm.org
Fri Mar 25 17:06:31 UTC 2011
Dear All,
probably I don't know very much this part of the bioruby library (not yet). In this days I'm playing a bit with fastq data :-) and bioruby
today I was reading a fastq file coming from Illumina HiScanSQ:
r=Bio::FlatFile.auto("spec/fixture/test.fastq")
x=r.first
=> #<Bio::Fastq:0x00000100849b20 @definition="H125:1:1108:1147:2046#0/1", @sequence_string="CGGGAAATGGCGGAAATGTGCAAGGATGTTATATGAATTGTGTTGGTTGGCCTAAAACACAGAGCCGGCTTGAAGT", @definition2="", @quality_string="[XSZGXPSKPJPP]TWQRFRHXW\\WXXX]PUX[W_^R^[XRNRYV]^Y[]UUNRT]X_BBBBBBBBBBBBBBBBBB">
No format specified, but by default the format of the library is sanger:
x.quality_scores
=> [58, 55, 50, 57, 38, 55, 47, 50, 42, 47, 41, 47, 47, 60, 51, 54, 48, 49, 37, 49, 39, 55, 54, 59, 54, 55, 55, 55, 60, 47, 52, 55, 58, 54, 62, 61, 49, 61, 58, 55, 49, 45, 49, 56, 53, 60, 61, 56, 58, 60, 52, 52, 45, 49, 51, 60, 55, 62, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33]
\">"
x
=> #<Bio::Fastq:0x00000100849b20 @definition="H125:1:1108:1147:2046#0/1", @sequence_string="CGGGAAATGGCGGAAATGTGCAAGGATGTTATATGAATTGTGTTGGTTGGCCTAAAACACAGAGCCGGCTTGAAGT", @definition2="", @quality_string="[XSZGXPSKPJPP]TWQRFRHXW\\WXXX]PUX[W_^R^[XRNRYV]^Y[]UUNRT]X_BBBBBBBBBBBBBBBBBB", @format=#<Bio::Fastq::FormatData::FASTQ_SANGER:0x00000100881ea8 @name="fastq-sanger", @symbol=:fastq_sanger, @offset=33, @score_range=0..93>, @quality_scores=[58, 55, 50, 57, 38, 55, 47, 50, 42, 47, 41, 47, 47, 60, 51, 54, 48, 49, 37, 49, 39, 55, 54, 59, 54, 55, 55, 55, 60, 47, 52, 55, 58, 54, 62, 61, 49, 61, 58, 55, 49, 45, 49, 56, 53, 60, 61, 56, 58, 60, 52, 52, 45, 49, 51, 60, 55, 62, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33]>
so the sanger is the default.
Is Illumina the right format for the newest technology/datasets ?
x.format = :fastq_illumina
=> :fastq_illumina
x.quality_scores
=> [27, 24, 19, 26, 7, 24, 16, 19, 11, 16, 10, 16, 16, 29, 20, 23, 17, 18, 6, 18, 8, 24, 23, 28, 23, 24, 24, 24, 29, 16, 21, 24, 27, 23, 31, 30, 18, 30, 27, 24, 18, 14, 18, 25, 22, 29, 30, 25, 27, 29, 21, 21, 14, 18, 20, 29, 24, 31, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
Note that two formats are using different offsets and ranges.
Obviously also the error_probabilities are different. I think that this behavior is misleading at least for me, using data coming from illumina using their metric and finding the phred right there.
x.error_probabilities #illumina
=> [0.001995262314968879, 0.003981071705534973, 0.012589254117941675, 0.0025118864315095794, 0.19952623149688797, 0.003981071705534973, 0.025118864315095794, 0.012589254117941675, 0.07943282347242814, 0.025118864315095794, 0.1, 0.025118864315095794, 0.025118864315095794, 0.0012589254117941675, 0.01, 0.005011872336272725, 0.0199526231496888, 0.015848931924611134, 0.251188643150958, 0.015848931924611134, 0.15848931924611134, 0.003981071705534973, 0.005011872336272725, 0.001584893192461114, 0.005011872336272725, 0.003981071705534973, 0.003981071705534973, 0.003981071705534973, 0.0012589254117941675, 0.025118864315095794, 0.007943282347242814, 0.003981071705534973, 0.001995262314968879, 0.005011872336272725, 0.0007943282347242813, 0.001, 0.015848931924611134, 0.001, 0.001995262314968879, 0.003981071705534973, 0.015848931924611134, 0.039810717055349734, 0.015848931924611134, 0.0031622776601683794, 0.00630957344480193, 0.0012589254117941675, 0.001, 0.0031622776601683794, 0.001995262314968879, 0.0012589254117941675, 0.007943282347242814, 0.007943282347242814, 0.039810717055349734, 0.015848931924611134, 0.01, 0.0012589254117941675, 0.003981071705534973, 0.0007943282347242813, 0.6309573444801932, 0.6309573444801932, 0.6309573444801932, 0.6309573444801932, 0.6309573444801932, 0.6309573444801932, 0.6309573444801932, 0.6309573444801932, 0.6309573444801932, 0.6309573444801932, 0.6309573444801932, 0.6309573444801932, 0.6309573444801932, 0.6309573444801932, 0.6309573444801932, 0.6309573444801932, 0.6309573444801932, 0.6309573444801932]
x.error_probabilities #phred
=> [1.584893192461114e-06, 3.162277660168379e-06, 1.0e-05, 1.9952623149688787e-06, 0.00015848931924611142, 3.162277660168379e-06, 1.9952623149688786e-05, 1.0e-05, 6.309573444801929e-05, 1.9952623149688786e-05, 7.943282347242822e-05, 1.9952623149688786e-05, 1.9952623149688786e-05, 1.0e-06, 7.943282347242822e-06, 3.981071705534969e-06, 1.584893192461114e-05, 1.2589254117941661e-05, 0.00019952623149688788, 1.2589254117941661e-05, 0.00012589254117941674, 3.162277660168379e-06, 3.981071705534969e-06, 1.2589254117941661e-06, 3.981071705534969e-06, 3.162277660168379e-06, 3.162277660168379e-06, 3.162277660168379e-06, 1.0e-06, 1.9952623149688786e-05, 6.30957344480193e-06, 3.162277660168379e-06, 1.584893192461114e-06, 3.981071705534969e-06, 6.30957344480193e-07, 7.943282347242822e-07, 1.2589254117941661e-05, 7.943282347242822e-07, 1.584893192461114e-06, 3.162277660168379e-06, 1.2589254117941661e-05, 3.1622776601683795e-05, 1.2589254117941661e-05, 2.5118864315095823e-06, 5.011872336272725e-06, 1.0e-06, 7.943282347242822e-07, 2.5118864315095823e-06, 1.584893192461114e-06, 1.0e-06, 6.30957344480193e-06, 6.30957344480193e-06, 3.1622776601683795e-05, 1.2589254117941661e-05, 7.943282347242822e-06, 1.0e-06, 3.162277660168379e-06, 6.30957344480193e-07, 0.0005011872336272725, 0.0005011872336272725, 0.0005011872336272725, 0.0005011872336272725, 0.0005011872336272725, 0.0005011872336272725, 0.0005011872336272725, 0.0005011872336272725, 0.0005011872336272725, 0.0005011872336272725, 0.0005011872336272725, 0.0005011872336272725, 0.0005011872336272725, 0.0005011872336272725, 0.0005011872336272725, 0.0005011872336272725, 0.0005011872336272725, 0.0005011872336272725]
a) What do you think about specify the format right in the loading phase (initialize) ?
a.1) the problem is that illumina's documentation makes clear reference to
NOTE
The quality scoring scheme Illumina uses is the Phred scoring scheme, encoded as an ASCII character by adding 64 to the Phred value. A Phred score of a base is:
Qphred =-10 log10(e) where e is the estimated probability of a base being wrong.
b) try to identify the format from the header of the first sequence or use a convention in the extension ?
c) ALWAYS use sanger
I'm lost O_o
--
Ra
linkedin: http://it.linkedin.com/in/raoulbonnal
twitter: http://twitter.com/ilpuccio
skype: ilpuccio
irc.freenode.net: Helius
github: https://github.com/helios
More information about the BioRuby
mailing list