[Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS
Peter
biopython at maubp.freeserve.co.uk
Wed Jul 29 06:15:55 EDT 2009
Hi all,
This is a follow up to the earlier discussion about high quality scores
in Solexa or Illumina 1.3+ FASTQ files and the problem of non printable
ASCII codes (which can occur if converting from Sanger FASTQ).
> On Sat, Jul 25, 2009 at 8:50 PM, Chris Fields<cjfields at illinois.edu> wrote:
>>
>>> Now, here comes the problem. I believe FASTQ files directly
>>> from an Illumina 1.3+ pipeline will have PHRED scores in the
>>> range 0 to 40 (as in this example). However, much higher
>>> PHRED scores are possible during assembly / contig'ing
>>> and read mapping. For example, the tool MAQ will output
>>> Sanger style FASTQ files with PHRED scores in the range
>>> 0 to 93 inclusive.
>>
>> We can support it as Illumina 1.3, but my point is this may getting into a
>> grey area and may be something that Illumina doesn't/wouldn't support.
>> Reminds me a little of the multiple GFF2 variations (one of the main
>> reasons for a GFF3).
>
> I agree this is an grey area (high scores in Solexa/Illumina
> FASTQ files).
>
> ...
>
> i.e. An Illumina FASTQ format file can hold PHRED scores in the
> range 0 to 62 without using problem characters. And likewise
> for a Solexa FASTQ file (Solexa scores up to 62).
Peter Rice and I have been talking about this off list, and have
a proposal for the high score problem. Basically we want to
restrict FASTQ quality strings to printable ASCII, which means
126 (0x7e) is a firm upper limit, while otherwise allowing for a
high scores as possible. This limit comes from ASCII 127 being
"delete", and the even higher characters also being non-printable.
i.e. We are suggesting:
"fastq-sanger" - Allows PHRED scores 0 to 93 inclusive, mapped
with an ASCII offset of 33 to ASCII characters 33 to 126 (or in hex,
0x21 to 0x7e). This is as defined on the MAQ web pages.
"fastq-illumina" - Believed to use at least PHRED scores 0 to 40,
mapped with an ASCII offset of 64 to ASCII characters 64 to 104
(or in hex, to 0x40 to 0x68). It is a reasonable and well defined
extension to permit PHRED scores from 0 to 62 inclusive, which
map to ASCII 64 to 126 (or in hex 0x40 to 0x7e). This avoids the
non printing characters, and gives some head room for improved
sequencing technology from Illumina giving higher raw scores.
"fastq-solexa" - Believed to use Solexa scores from -5 to at least
40, again mapped with an ASCII offset of 64 giving ASCII characters
59 to 104 (or in hex, 0x3b to 0x68). Again, a reasonable and well
defined extension would permit Solexa scores in the range -5 to 62
inclusive, using ASCII 59 to 126 (or in hex, 0x3b to 0x7e).
[Peter R. - please correct me if of the above is not what you had
in mind]
If in the process of converting between formats, a quality score
is too high (it would result in ASCII 127 or higher), then I would
argue any of the following would be acceptable:
(a) Silently impose the maximum score (ASCII 126, 0x7e)
(b) Impose the maximum score with a warning
(c) Raise an error
I don't think EMBOSS, BioPerl and Biopython have to handle
this exactly the same way, but I would favour (b) then (a).
Peter
More information about the Open-Bio-l
mailing list