[Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS

Chris Fields cjfields at illinois.edu
Thu Jul 30 15:46:51 UTC 2009


On Jul 30, 2009, at 5:18 AM, Peter wrote:

> On Wed, Jul 29, 2009 at 11:15 AM, Peter<biopython at maubp.freeserve.co.uk 
> > wrote:
>> Hi all,
>>
>> This is a follow up to the earlier discussion about high quality  
>> scores
>> in Solexa or Illumina 1.3+ FASTQ files and the problem of non  
>> printable
>> ASCII codes (which can occur if converting from Sanger FASTQ).
>>
>> ...
>>
>> Peter Rice and I have been talking about this off list, and have
>> a proposal for the high score problem. Basically we want to
>> restrict FASTQ quality strings to printable ASCII, which means
>> 126 (0x7e) is a firm upper limit, while otherwise allowing for a
>> high scores as possible. This limit comes from ASCII 127 being
>> "delete", and the even higher characters also being non-printable.
>>
>> i.e. We are suggesting:
>>
>> "fastq-sanger" - Allows PHRED scores 0 to 93 inclusive, mapped
>> with an ASCII offset of 33 to ASCII characters 33 to 126 (or in hex,
>> 0x21 to 0x7e). This is as defined on the MAQ web pages.
>>
>> "fastq-illumina" - Believed to use at least PHRED scores 0 to 40,
>> mapped with an ASCII offset of 64 to ASCII characters 64 to 104
>> (or in hex, to 0x40 to 0x68). It is a reasonable and well defined
>> extension to permit PHRED scores from 0 to 62 inclusive, which
>> map to ASCII 64 to 126 (or in hex 0x40 to 0x7e). This avoids the
>> non printing characters, and gives some head room for improved
>> sequencing technology from Illumina giving higher raw scores.
>>
>> "fastq-solexa" - Believed to use Solexa scores from -5 to at least
>> 40, again mapped with an ASCII offset of 64 giving ASCII characters
>> 59 to 104 (or in hex, 0x3b to 0x68). Again, a reasonable and well
>> defined extension would permit Solexa scores in the range -5 to 62
>> inclusive, using ASCII 59 to 126 (or in hex, 0x3b to 0x7e).
>
> The latest version of Biopython in our repository now follows this,
> avoiding any non-printing characters (which should trigger an error
> on parsing).
>
>> If in the process of converting between formats, a quality score
>> is too high (it would result in ASCII 127 or higher), then I would
>> argue any of the following would be acceptable:
>> (a) Silently impose the maximum score (ASCII 126, 0x7e)
>> (b) Impose the maximum score with a warning
>> (c) Raise an error
>>
>> I don't think EMBOSS, BioPerl and Biopython have to handle
>> this exactly the same way, but I would favour (b) then (a).
>
> The EMBOSS patch I was testing from Peter Rice went for a
> silent truncation, in Biopython have also for the moment gone
> for silently imposing the maximum scores (ASCII 126, 0x7e)
> of 93, 62 and 62 for the three formats. Another reason for this
> is speed.
>
> Peter

Speed is one reason to worry, but we also should think carefully about  
silently truncating the data w/o the user's knowledge.  One thing we  
don't want to propagate is loss of data w/o warning.

chris


More information about the Open-Bio-l mailing list