[Bioperl-l] FASTQ support in Biopython, BioPerl, and EMBOSS
Chris Fields
cjfields at illinois.edu
Sat Jul 25 19:50:13 UTC 2009
On Jul 24, 2009, at 10:12 AM, Peter wrote:
> On Fri, Jul 24, 2009 at 2:53 PM,
> Peter<biopython at maubp.freeserve.co.uk> wrote:
>> On Fri, Jul 24, 2009 at 2:32 PM, Peter<biopython at maubp.freeserve.co.uk
>> > wrote:
>>>>
>>>> Don't be surprised if there are still bugs lurking about, just
>>>> let me
>>>> know and I'll fix 'em.
>>>
>>> I've got a bug report coming up in a second email, but the basics
>>> work :)
>>
>> I think I have found a bug in BioPerl's conversion from fastq-solexa
>> to fastq-sanger concerning lower quality scores.
>
> Next up is an issue with BioPerl converting from Sanger to Illumina.
> In principle this is simple - the quality strings both use PHRED
> scores
> just with different offsets. With lower PHRED scores, everything is
> fine:
>
> $ more sanger_faked.fastq
> @Test PHRED qualities from 40 to 0 inclusive
> ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
> +
> IHGFEDCBA@?>=<;:9876543210/.-,+*)('&%$#"!
>
> Again, this is an example constructed by hand to cover a broad
> range of valid scores, and can be found in the Biopython
> repository under biopython/Tests/Quality
>
> $ perl bioperl_sanger2illumina.pl < sanger_faked.fastq @Test PHRED
> qualities from 40 to 0 inclusive
> ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
> +Test PHRED qualities from 40 to 0 inclusive
> hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@
>
> $ python biopython_sanger2illumina.py < sanger_faked.fastq
> @Test PHRED qualities from 40 to 0 inclusive
> ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
> +
> hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@
>
> So, BioPerl and Biopython (and EMBOSS) agree - apart from
> the repeating second title on the plus line. I understand that
> EMBOSS will in future omit the repeated title on the plus line:
> http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000598.html
We'll make it optional and default to no header.
> Now, here comes the problem. I believe FASTQ files directly
> from an Illumina 1.3+ pipeline will have PHRED scores in the
> range 0 to 40 (as in this example). However, much higher
> PHRED scores are possible during assembly / contig'ing
> and read mapping. For example, the tool MAQ will output
> Sanger style FASTQ files with PHRED scores in the range
> 0 to 93 inclusive.
Is this behavior documented anywhere, specifically by Illumina (that
values can exceed 40)? If Illumina 1.3 is specified as being PHRED
0-40, and another (non-Illumina) software package pushes that limit
above the specified range of Illumina values, I would consider that
unfortunately yet another variant.
We can support it as Illumina 1.3, but my point is this may getting
into a grey area and may be something that Illumina doesn't/wouldn't
support. Reminds me a little of the multiple GFF2 variations (one of
the main reasons for a GFF3).
> Now, in the Sanger FASTQ format, PHRED scores of 0 to
> 93 map onto ASCII values of 33 to 126 (! to ~). There is a
> reason for stopping at 126, since ASCII 127 is "delete".
>
> However, in the Illumina 1.3+ FASTQ format, PHRED
> scores of 0 to 93 would map to ASCII values of 64 to
> 157, which includes a lot of non printing characters.
> Working with such files at the command line or in an
> editor is a big problem. Clearly, Illumina never intended
> to include such high scores in their FASTQ files!
Exactly.
> Nevertheless, it is possible to write a FASTQ format
> following the Illumina 1.3+ encoding with these values.
> Biopython and EMBOSS attempt to do this - although I
> would regard throwing an error as equally acceptable.
>
> So, here is another hand constructed example of a
> Sanger style FASTQ file using the full quality range:
>
> $ more sanger_93.fastq
> @Test PHRED qualities from 93 to 0 inclusive
> ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAN
> +
> ~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;:
> 9876543210/.-,+*)('&%$#"!
>
> Again, this example is in the Biopython repository under
> biopython/Tests/Quality
>
> Just to check:
>
> $ python biopython_sanger2qual.py < sanger_93.fastq
>> Test PHRED qualities from 93 to 0 inclusive
> 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74
> 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54
> 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34
> 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14
> 13 12 11 10 9 8 7 6 5 4 3 2 1 0
>
> So, here we go - apologies for the expected line mangling:
>
> $ seqret -filter -sformat fastq-sanger -osformat fastq-illumina <
> sanger_93.fastq | hexdump -C -v
> 00000000 40 54 65 73 74 20 50 48 52 45 44 20 71 75 61 6c |@Test
> PHRED qual|
> 00000010 69 74 69 65 73 20 66 72 6f 6d 20 39 33 20 74 6f |ities
> from 93 to|
> 00000020 20 30 20 69 6e 63 6c 75 73 69 76 65 0a 41 43 54 | 0
> inclusive.ACT|
> 00000030 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |
> GACTGACTGACTGACT|
> 00000040 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |
> GACTGACTGACTGACT|
> 00000050 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |
> GACTGACTGACTGACT|
> 00000060 47 41 43 54 47 41 43 54 47 0a 41 43 54 47 41 43 |
> GACTGACTG.ACTGAC|
> 00000070 54 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 |
> TGACTGACTGACTGAC|
> 00000080 54 47 41 43 54 47 41 43 54 47 41 4e 0a 2b 54 65 |
> TGACTGACTGAN.+Te|
> 00000090 73 74 0a 9d 9c 9b 9a 99 98 97 96 95 94 93 92 91 |
> st..............|
> 000000a0 90 8f 8e 8d 8c 8b 8a 89 88 87 86 85 84 83 82 81
> |................|
> 000000b0 80 7f 7e 7d 7c 7b 7a 79 78 77 76 75 74 73 72 71 |..~}|
> {zyxwvutsrq|
> 000000c0 70 6f 6e 6d 6c 6b 6a 69 68 67 66 65 64 63 62 0a |
> ponmlkjihgfedcb.|
> 000000d0 61 60 5f 5e 5d 5c 5b 5a 59 58 57 56 55 54 53 52 |a`_^]\
> [ZYXWVUTSR|
> 000000e0 51 50 4f 4e 4d 4c 4b 4a 49 48 47 46 45 44 43 42 |
> QPONMLKJIHGFEDCB|
> 000000f0 41 40 0a |A at .|
> 000000f3
>
> $ python biopython_sanger2illumina.py < sanger_93.fastq | hexdump -C
> -v00000000 40 54 65 73 74 20 50 48 52 45 44 20 71 75 61 6c |@Test
> PHRED qual|
> 00000010 69 74 69 65 73 20 66 72 6f 6d 20 39 33 20 74 6f |ities
> from 93 to|
> 00000020 20 30 20 69 6e 63 6c 75 73 69 76 65 0a 41 43 54 | 0
> inclusive.ACT|
> 00000030 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |
> GACTGACTGACTGACT|
> 00000040 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |
> GACTGACTGACTGACT|
> 00000050 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |
> GACTGACTGACTGACT|
> 00000060 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |
> GACTGACTGACTGACT|
> 00000070 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |
> GACTGACTGACTGACT|
> 00000080 47 41 43 54 47 41 43 54 47 41 4e 0a 2b 0a 9d 9c |
> GACTGACTGAN.+...|
> 00000090 9b 9a 99 98 97 96 95 94 93 92 91 90 8f 8e 8d 8c
> |................|
> 000000a0 8b 8a 89 88 87 86 85 84 83 82 81 80 7f 7e 7d 7c
> |.............~}||
> 000000b0 7b 7a 79 78 77 76 75 74 73 72 71 70 6f 6e 6d 6c |
> {zyxwvutsrqponml|
> 000000c0 6b 6a 69 68 67 66 65 64 63 62 61 60 5f 5e 5d 5c |
> kjihgfedcba`_^]\|
> 000000d0 5b 5a 59 58 57 56 55 54 53 52 51 50 4f 4e 4d 4c |
> [ZYXWVUTSRQPONML|
> 000000e0 4b 4a 49 48 47 46 45 44 43 42 41 40 0a |
> KJIHGFEDCBA at .|
> 000000ed
>
> Biopython and EMBOSS 6.1.0 differ regarding the plus line, but agree
> on the quality string which runs from 0x9d to 0x40 (in hex), or 157 to
> 64 in decimal, which after subtracting the Illumina offset of 64,
> gives
> PHRED scores of 93 to 0 as desired.
>
> Now to BioPerl,
>
> $ perl bioperl_sanger2illumina.pl < sanger_93.fastq
> @Test PHRED qualities from 93 to 0 inclusive
> ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAN
> +Test PHRED qualities from 93 to 0 inclusive
> hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@
>
> $ perl bioperl_sanger2illumina.pl < sanger_93.fastq | hexdump -C -v
> ...
>
> BioPerl has output an invalid FASTQ file - it seems to omit the
> quality scores for the top scoring nucleotides at the start. The
> BioPerl quality string runs from just "h" to "@", or 0x68 to 0x40
> (in hex), giving 104 to 64 in decimal, giving PHRED values of
> 40 to 0. I think BioPerl should either throw an error, or output
> the non printing characters as done by Biopython and EMBOSS.
>
> Regards,
>
> Peter C.
> (@Biopython)
If this is accepted as common practice between BioPython and EMBOSS we
will follow similarly. I do think it's worth at least a warning for
the reasons outlined above (e.g. it likely isn't Illumina's intent to
support qual values outside the specified range). Might be worth
checking into.
From this it could be summarized that converting to sanger format is
least problematic, as possible issues may be encountered when
converting to the other variants. We'll need to fix the solexa
quality calculations in the BioPerl parser as noted in your previous
post; I'll work on that.
chris
More information about the Bioperl-l
mailing list