[Biopython-dev] FASTQ support in Biopython, BioPerl, and EMBOSS
Peter
biopython at maubp.freeserve.co.uk
Fri Jul 24 15:12:57 UTC 2009
On Fri, Jul 24, 2009 at 2:53 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
> On Fri, Jul 24, 2009 at 2:32 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
>>>
>>> Don't be surprised if there are still bugs lurking about, just let me
>>> know and I'll fix 'em.
>>
>> I've got a bug report coming up in a second email, but the basics work :)
>
> I think I have found a bug in BioPerl's conversion from fastq-solexa
> to fastq-sanger concerning lower quality scores.
Next up is an issue with BioPerl converting from Sanger to Illumina.
In principle this is simple - the quality strings both use PHRED scores
just with different offsets. With lower PHRED scores, everything is fine:
$ more sanger_faked.fastq
@Test PHRED qualities from 40 to 0 inclusive
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
+
IHGFEDCBA@?>=<;:9876543210/.-,+*)('&%$#"!
Again, this is an example constructed by hand to cover a broad
range of valid scores, and can be found in the Biopython
repository under biopython/Tests/Quality
$ perl bioperl_sanger2illumina.pl < sanger_faked.fastq @Test PHRED
qualities from 40 to 0 inclusive
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
+Test PHRED qualities from 40 to 0 inclusive
hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@
$ python biopython_sanger2illumina.py < sanger_faked.fastq
@Test PHRED qualities from 40 to 0 inclusive
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
+
hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@
So, BioPerl and Biopython (and EMBOSS) agree - apart from
the repeating second title on the plus line. I understand that
EMBOSS will in future omit the repeated title on the plus line:
http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000598.html
Now, here comes the problem. I believe FASTQ files directly
from an Illumina 1.3+ pipeline will have PHRED scores in the
range 0 to 40 (as in this example). However, much higher
PHRED scores are possible during assembly / contig'ing
and read mapping. For example, the tool MAQ will output
Sanger style FASTQ files with PHRED scores in the range
0 to 93 inclusive.
Now, in the Sanger FASTQ format, PHRED scores of 0 to
93 map onto ASCII values of 33 to 126 (! to ~). There is a
reason for stopping at 126, since ASCII 127 is "delete".
However, in the Illumina 1.3+ FASTQ format, PHRED
scores of 0 to 93 would map to ASCII values of 64 to
157, which includes a lot of non printing characters.
Working with such files at the command line or in an
editor is a big problem. Clearly, Illumina never intended
to include such high scores in their FASTQ files!
Nevertheless, it is possible to write a FASTQ format
following the Illumina 1.3+ encoding with these values.
Biopython and EMBOSS attempt to do this - although I
would regard throwing an error as equally acceptable.
So, here is another hand constructed example of a
Sanger style FASTQ file using the full quality range:
$ more sanger_93.fastq
@Test PHRED qualities from 93 to 0 inclusive
ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAN
+
~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;:9876543210/.-,+*)('&%$#"!
Again, this example is in the Biopython repository under
biopython/Tests/Quality
Just to check:
$ python biopython_sanger2qual.py < sanger_93.fastq
>Test PHRED qualities from 93 to 0 inclusive
93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74
73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54
53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34
33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14
13 12 11 10 9 8 7 6 5 4 3 2 1 0
So, here we go - apologies for the expected line mangling:
$ seqret -filter -sformat fastq-sanger -osformat fastq-illumina <
sanger_93.fastq | hexdump -C -v
00000000 40 54 65 73 74 20 50 48 52 45 44 20 71 75 61 6c |@Test PHRED qual|
00000010 69 74 69 65 73 20 66 72 6f 6d 20 39 33 20 74 6f |ities from 93 to|
00000020 20 30 20 69 6e 63 6c 75 73 69 76 65 0a 41 43 54 | 0 inclusive.ACT|
00000030 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT|
00000040 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT|
00000050 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT|
00000060 47 41 43 54 47 41 43 54 47 0a 41 43 54 47 41 43 |GACTGACTG.ACTGAC|
00000070 54 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 |TGACTGACTGACTGAC|
00000080 54 47 41 43 54 47 41 43 54 47 41 4e 0a 2b 54 65 |TGACTGACTGAN.+Te|
00000090 73 74 0a 9d 9c 9b 9a 99 98 97 96 95 94 93 92 91 |st..............|
000000a0 90 8f 8e 8d 8c 8b 8a 89 88 87 86 85 84 83 82 81 |................|
000000b0 80 7f 7e 7d 7c 7b 7a 79 78 77 76 75 74 73 72 71 |..~}|{zyxwvutsrq|
000000c0 70 6f 6e 6d 6c 6b 6a 69 68 67 66 65 64 63 62 0a |ponmlkjihgfedcb.|
000000d0 61 60 5f 5e 5d 5c 5b 5a 59 58 57 56 55 54 53 52 |a`_^]\[ZYXWVUTSR|
000000e0 51 50 4f 4e 4d 4c 4b 4a 49 48 47 46 45 44 43 42 |QPONMLKJIHGFEDCB|
000000f0 41 40 0a |A at .|
000000f3
$ python biopython_sanger2illumina.py < sanger_93.fastq | hexdump -C
-v00000000 40 54 65 73 74 20 50 48 52 45 44 20 71 75 61 6c |@Test
PHRED qual|
00000010 69 74 69 65 73 20 66 72 6f 6d 20 39 33 20 74 6f |ities from 93 to|
00000020 20 30 20 69 6e 63 6c 75 73 69 76 65 0a 41 43 54 | 0 inclusive.ACT|
00000030 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT|
00000040 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT|
00000050 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT|
00000060 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT|
00000070 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT|
00000080 47 41 43 54 47 41 43 54 47 41 4e 0a 2b 0a 9d 9c |GACTGACTGAN.+...|
00000090 9b 9a 99 98 97 96 95 94 93 92 91 90 8f 8e 8d 8c |................|
000000a0 8b 8a 89 88 87 86 85 84 83 82 81 80 7f 7e 7d 7c |.............~}||
000000b0 7b 7a 79 78 77 76 75 74 73 72 71 70 6f 6e 6d 6c |{zyxwvutsrqponml|
000000c0 6b 6a 69 68 67 66 65 64 63 62 61 60 5f 5e 5d 5c |kjihgfedcba`_^]\|
000000d0 5b 5a 59 58 57 56 55 54 53 52 51 50 4f 4e 4d 4c |[ZYXWVUTSRQPONML|
000000e0 4b 4a 49 48 47 46 45 44 43 42 41 40 0a |KJIHGFEDCBA at .|
000000ed
Biopython and EMBOSS 6.1.0 differ regarding the plus line, but agree
on the quality string which runs from 0x9d to 0x40 (in hex), or 157 to
64 in decimal, which after subtracting the Illumina offset of 64, gives
PHRED scores of 93 to 0 as desired.
Now to BioPerl,
$ perl bioperl_sanger2illumina.pl < sanger_93.fastq
@Test PHRED qualities from 93 to 0 inclusive
ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAN
+Test PHRED qualities from 93 to 0 inclusive
hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@
$ perl bioperl_sanger2illumina.pl < sanger_93.fastq | hexdump -C -v
...
BioPerl has output an invalid FASTQ file - it seems to omit the
quality scores for the top scoring nucleotides at the start. The
BioPerl quality string runs from just "h" to "@", or 0x68 to 0x40
(in hex), giving 104 to 64 in decimal, giving PHRED values of
40 to 0. I think BioPerl should either throw an error, or output
the non printing characters as done by Biopython and EMBOSS.
Regards,
Peter C.
(@Biopython)
More information about the Biopython-dev
mailing list