[Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS

Peter biopython at maubp.freeserve.co.uk
Thu Jul 30 06:18:26 EDT 2009


On Wed, Jul 29, 2009 at 11:15 AM, Peter<biopython at maubp.freeserve.co.uk> wrote:
> Hi all,
>
> This is a follow up to the earlier discussion about high quality scores
> in Solexa or Illumina 1.3+ FASTQ files and the problem of non printable
> ASCII codes (which can occur if converting from Sanger FASTQ).
>
> ...
>
> Peter Rice and I have been talking about this off list, and have
> a proposal for the high score problem. Basically we want to
> restrict FASTQ quality strings to printable ASCII, which means
> 126 (0x7e) is a firm upper limit, while otherwise allowing for a
> high scores as possible. This limit comes from ASCII 127 being
> "delete", and the even higher characters also being non-printable.
>
> i.e. We are suggesting:
>
> "fastq-sanger" - Allows PHRED scores 0 to 93 inclusive, mapped
> with an ASCII offset of 33 to ASCII characters 33 to 126 (or in hex,
> 0x21 to 0x7e). This is as defined on the MAQ web pages.
>
> "fastq-illumina" - Believed to use at least PHRED scores 0 to 40,
> mapped with an ASCII offset of 64 to ASCII characters 64 to 104
> (or in hex, to 0x40 to 0x68). It is a reasonable and well defined
> extension to permit PHRED scores from 0 to 62 inclusive, which
> map to ASCII 64 to 126 (or in hex 0x40 to 0x7e). This avoids the
> non printing characters, and gives some head room for improved
> sequencing technology from Illumina giving higher raw scores.
>
> "fastq-solexa" - Believed to use Solexa scores from -5 to at least
> 40, again mapped with an ASCII offset of 64 giving ASCII characters
> 59 to 104 (or in hex, 0x3b to 0x68). Again, a reasonable and well
> defined extension would permit Solexa scores in the range -5 to 62
> inclusive, using ASCII 59 to 126 (or in hex, 0x3b to 0x7e).

The latest version of Biopython in our repository now follows this,
avoiding any non-printing characters (which should trigger an error
on parsing).

> If in the process of converting between formats, a quality score
> is too high (it would result in ASCII 127 or higher), then I would
> argue any of the following would be acceptable:
> (a) Silently impose the maximum score (ASCII 126, 0x7e)
> (b) Impose the maximum score with a warning
> (c) Raise an error
>
> I don't think EMBOSS, BioPerl and Biopython have to handle
> this exactly the same way, but I would favour (b) then (a).

The EMBOSS patch I was testing from Peter Rice went for a
silent truncation, in Biopython have also for the moment gone
for silently imposing the maximum scores (ASCII 126, 0x7e)
of 93, 62 and 62 for the three formats. Another reason for this
is speed.

Peter


More information about the Open-Bio-l mailing list