[Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS

Chris Fields cjfields at illinois.edu
Wed Aug 5 15:12:18 UTC 2009


On Jul 29, 2009, at 5:15 AM, Peter wrote:

> Hi all,
>
> This is a follow up to the earlier discussion about high quality  
> scores
> in Solexa or Illumina 1.3+ FASTQ files and the problem of non  
> printable
> ASCII codes (which can occur if converting from Sanger FASTQ).
>
>> On Sat, Jul 25, 2009 at 8:50 PM, Chris  
>> Fields<cjfields at illinois.edu> wrote:
>>>
>>>> Now, here comes the problem. I believe FASTQ files directly
>>>> from an Illumina 1.3+ pipeline will have PHRED scores in the
>>>> range 0 to 40 (as in this example). However, much higher
>>>> PHRED scores are possible during assembly / contig'ing
>>>> and read mapping. For example, the tool MAQ will output
>>>> Sanger style FASTQ files with PHRED scores in the range
>>>> 0 to 93 inclusive.
>>>
>>> We can support it as Illumina 1.3, but my point is this may  
>>> getting into a
>>> grey area and may be something that Illumina doesn't/wouldn't  
>>> support.
>>>  Reminds me a little of the multiple GFF2 variations (one of the  
>>> main
>>> reasons for a GFF3).
>>
>> I agree this is an grey area (high scores in Solexa/Illumina
>> FASTQ files).
>>
>> ...
>>
>> i.e. An Illumina FASTQ format file can hold PHRED scores in the
>> range 0 to 62 without using problem characters. And likewise
>> for a Solexa FASTQ file (Solexa scores up to 62).
>
> Peter Rice and I have been talking about this off list, and have
> a proposal for the high score problem. Basically we want to
> restrict FASTQ quality strings to printable ASCII, which means
> 126 (0x7e) is a firm upper limit, while otherwise allowing for a
> high scores as possible. This limit comes from ASCII 127 being
> "delete", and the even higher characters also being non-printable.
>
> i.e. We are suggesting:
>
> "fastq-sanger" - Allows PHRED scores 0 to 93 inclusive, mapped
> with an ASCII offset of 33 to ASCII characters 33 to 126 (or in hex,
> 0x21 to 0x7e). This is as defined on the MAQ web pages.
>
> "fastq-illumina" - Believed to use at least PHRED scores 0 to 40,
> mapped with an ASCII offset of 64 to ASCII characters 64 to 104
> (or in hex, to 0x40 to 0x68). It is a reasonable and well defined
> extension to permit PHRED scores from 0 to 62 inclusive, which
> map to ASCII 64 to 126 (or in hex 0x40 to 0x7e). This avoids the
> non printing characters, and gives some head room for improved
> sequencing technology from Illumina giving higher raw scores.
>
> "fastq-solexa" - Believed to use Solexa scores from -5 to at least
> 40, again mapped with an ASCII offset of 64 giving ASCII characters
> 59 to 104 (or in hex, 0x3b to 0x68). Again, a reasonable and well
> defined extension would permit Solexa scores in the range -5 to 62
> inclusive, using ASCII 59 to 126 (or in hex, 0x3b to 0x7e).
>
> [Peter R. - please correct me if of the above is not what you had
> in mind]
>
> If in the process of converting between formats, a quality score
> is too high (it would result in ASCII 127 or higher), then I would
> argue any of the following would be acceptable:
> (a) Silently impose the maximum score (ASCII 126, 0x7e)
> (b) Impose the maximum score with a warning
> (c) Raise an error
>
> I don't think EMBOSS, BioPerl and Biopython have to handle
> this exactly the same way, but I would favour (b) then (a).
>
> Peter

I think, based on Aaron's comments, with bioperl we'll adopt in (b) to  
deal with format validation, but try to do it in a way that 'caches'  
bad data so it doesn't report a warning on every out-of-range value.   
I am planning on a Moose-based parser at some point that will do the  
same.

chris





More information about the Open-Bio-l mailing list