[Biopython-dev] FASTQ support in Biopython, BioPerl, and EMBOSS

Fri Jul 24 15:12:57 UTC 2009

On Fri, Jul 24, 2009 at 2:53 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
> On Fri, Jul 24, 2009 at 2:32 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
>>>
>>> Don't be surprised if there are still bugs lurking about, just let me
>>> know and I'll fix 'em.
>>
>> I've got a bug report coming up in a second email, but the basics work :)
>
> I think I have found a bug in BioPerl's conversion from fastq-solexa
> to fastq-sanger concerning lower quality scores.

Next up is an issue with BioPerl converting from Sanger to Illumina.
In principle this is simple - the quality strings both use PHRED scores
just with different offsets. With lower PHRED scores, everything is fine:

$ more sanger_faked.fastq
@Test PHRED qualities from 40 to 0 inclusive
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
+
IHGFEDCBA@?>=<;:9876543210/.-,+*)('&%$#"!

Again, this is an example constructed by hand to cover a broad
range of valid scores, and can be found in the Biopython
repository under biopython/Tests/Quality

$ perl bioperl_sanger2illumina.pl < sanger_faked.fastq @Test PHRED
qualities from 40 to 0 inclusive
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
+Test PHRED qualities from 40 to 0 inclusive
hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@

$ python biopython_sanger2illumina.py < sanger_faked.fastq
@Test PHRED qualities from 40 to 0 inclusive
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
+
hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@

So, BioPerl and Biopython (and EMBOSS) agree - apart from
the repeating second title on the plus line. I understand that
EMBOSS will in future omit the repeated title on the plus line:
http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000598.html

Now, here comes the problem. I believe FASTQ files directly
from an Illumina 1.3+ pipeline will have PHRED scores in the
range 0 to 40 (as in this example). However, much higher
PHRED scores are possible during assembly / contig'ing
and read mapping. For example, the tool MAQ will output
Sanger style FASTQ files with PHRED scores in the range
0 to 93 inclusive.

Now, in the Sanger FASTQ format, PHRED scores of 0 to
93 map onto ASCII values of 33 to 126 (! to ~). There is a
reason for stopping at 126, since ASCII 127 is "delete".

However, in the Illumina 1.3+ FASTQ format, PHRED
scores of 0 to 93 would map to ASCII values of 64 to
157, which includes a lot of non printing characters.
Working with such files at the command line or in an
editor is a big problem. Clearly, Illumina never intended
to include such high scores in their FASTQ files!

Nevertheless, it is possible to write a FASTQ format
following the Illumina 1.3+ encoding with these values.
Biopython and EMBOSS attempt to do this - although I
would regard throwing an error as equally acceptable.

So, here is another hand constructed example of a
Sanger style FASTQ file using the full quality range:

$ more sanger_93.fastq
@Test PHRED qualities from 93 to 0 inclusive
ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAN
+
~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;:9876543210/.-,+*)('&%$#"!

Again, this example is in the Biopython repository under
biopython/Tests/Quality

Just to check:

$ python biopython_sanger2qual.py < sanger_93.fastq
>Test PHRED qualities from 93 to 0 inclusive
93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74
73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54
53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34
33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14
13 12 11 10 9 8 7 6 5 4 3 2 1 0

So, here we go - apologies for the expected line mangling:

$ seqret -filter -sformat fastq-sanger -osformat fastq-illumina <
sanger_93.fastq | hexdump -C -v
00000000  40 54 65 73 74 20 50 48  52 45 44 20 71 75 61 6c  |@Test PHRED qual|
00000010  69 74 69 65 73 20 66 72  6f 6d 20 39 33 20 74 6f  |ities from 93 to|
00000020  20 30 20 69 6e 63 6c 75  73 69 76 65 0a 41 43 54  | 0 inclusive.ACT|
00000030  47 41 43 54 47 41 43 54  47 41 43 54 47 41 43 54  |GACTGACTGACTGACT|
00000040  47 41 43 54 47 41 43 54  47 41 43 54 47 41 43 54  |GACTGACTGACTGACT|
00000050  47 41 43 54 47 41 43 54  47 41 43 54 47 41 43 54  |GACTGACTGACTGACT|
00000060  47 41 43 54 47 41 43 54  47 0a 41 43 54 47 41 43  |GACTGACTG.ACTGAC|
00000070  54 47 41 43 54 47 41 43  54 47 41 43 54 47 41 43  |TGACTGACTGACTGAC|
00000080  54 47 41 43 54 47 41 43  54 47 41 4e 0a 2b 54 65  |TGACTGACTGAN.+Te|
00000090  73 74 0a 9d 9c 9b 9a 99  98 97 96 95 94 93 92 91  |st..............|
000000a0  90 8f 8e 8d 8c 8b 8a 89  88 87 86 85 84 83 82 81  |................|
000000b0  80 7f 7e 7d 7c 7b 7a 79  78 77 76 75 74 73 72 71  |..~}|{zyxwvutsrq|
000000c0  70 6f 6e 6d 6c 6b 6a 69  68 67 66 65 64 63 62 0a  |ponmlkjihgfedcb.|
000000d0  61 60 5f 5e 5d 5c 5b 5a  59 58 57 56 55 54 53 52  |a`_^]\[ZYXWVUTSR|
000000e0  51 50 4f 4e 4d 4c 4b 4a  49 48 47 46 45 44 43 42  |QPONMLKJIHGFEDCB|
000000f0  41 40 0a                                          |A at .|
000000f3

$ python biopython_sanger2illumina.py < sanger_93.fastq | hexdump -C
-v00000000  40 54 65 73 74 20 50 48  52 45 44 20 71 75 61 6c  |@Test
PHRED qual|
00000010  69 74 69 65 73 20 66 72  6f 6d 20 39 33 20 74 6f  |ities from 93 to|
00000020  20 30 20 69 6e 63 6c 75  73 69 76 65 0a 41 43 54  | 0 inclusive.ACT|
00000030  47 41 43 54 47 41 43 54  47 41 43 54 47 41 43 54  |GACTGACTGACTGACT|
00000040  47 41 43 54 47 41 43 54  47 41 43 54 47 41 43 54  |GACTGACTGACTGACT|
00000050  47 41 43 54 47 41 43 54  47 41 43 54 47 41 43 54  |GACTGACTGACTGACT|
00000060  47 41 43 54 47 41 43 54  47 41 43 54 47 41 43 54  |GACTGACTGACTGACT|
00000070  47 41 43 54 47 41 43 54  47 41 43 54 47 41 43 54  |GACTGACTGACTGACT|
00000080  47 41 43 54 47 41 43 54  47 41 4e 0a 2b 0a 9d 9c  |GACTGACTGAN.+...|
00000090  9b 9a 99 98 97 96 95 94  93 92 91 90 8f 8e 8d 8c  |................|
000000a0  8b 8a 89 88 87 86 85 84  83 82 81 80 7f 7e 7d 7c  |.............~}||
000000b0  7b 7a 79 78 77 76 75 74  73 72 71 70 6f 6e 6d 6c  |{zyxwvutsrqponml|
000000c0  6b 6a 69 68 67 66 65 64  63 62 61 60 5f 5e 5d 5c  |kjihgfedcba`_^]\|
000000d0  5b 5a 59 58 57 56 55 54  53 52 51 50 4f 4e 4d 4c  |[ZYXWVUTSRQPONML|
000000e0  4b 4a 49 48 47 46 45 44  43 42 41 40 0a           |KJIHGFEDCBA at .|
000000ed

Biopython and EMBOSS 6.1.0 differ regarding the plus line, but agree
on the quality string which runs from 0x9d to 0x40 (in hex), or 157 to
64 in decimal, which after subtracting the Illumina offset of 64, gives
PHRED scores of 93 to 0 as desired.

Now to BioPerl,

$ perl bioperl_sanger2illumina.pl < sanger_93.fastq
@Test PHRED qualities from 93 to 0 inclusive
ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAN
+Test PHRED qualities from 93 to 0 inclusive
hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@

$ perl bioperl_sanger2illumina.pl < sanger_93.fastq | hexdump -C -v
...

BioPerl has output an invalid FASTQ file - it seems to omit the
quality scores for the top scoring nucleotides at the start. The
BioPerl quality string runs from just "h" to "@", or 0x68 to 0x40
(in hex), giving 104 to 64 in decimal, giving PHRED values of
40 to 0. I think BioPerl should either throw an error, or output
the non printing characters as done by Biopython and EMBOSS.

Regards,

Peter C.
(@Biopython)