[Bioperl-l] FASTQ support in Biopython, BioPerl, and EMBOSS

Chris Fields cjfields at illinois.edu
Sat Jul 25 19:50:13 UTC 2009


On Jul 24, 2009, at 10:12 AM, Peter wrote:

> On Fri, Jul 24, 2009 at 2:53 PM,  
> Peter<biopython at maubp.freeserve.co.uk> wrote:
>> On Fri, Jul 24, 2009 at 2:32 PM, Peter<biopython at maubp.freeserve.co.uk 
>> > wrote:
>>>>
>>>> Don't be surprised if there are still bugs lurking about, just  
>>>> let me
>>>> know and I'll fix 'em.
>>>
>>> I've got a bug report coming up in a second email, but the basics  
>>> work :)
>>
>> I think I have found a bug in BioPerl's conversion from fastq-solexa
>> to fastq-sanger concerning lower quality scores.
>
> Next up is an issue with BioPerl converting from Sanger to Illumina.
> In principle this is simple - the quality strings both use PHRED  
> scores
> just with different offsets. With lower PHRED scores, everything is  
> fine:
>
> $ more sanger_faked.fastq
> @Test PHRED qualities from 40 to 0 inclusive
> ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
> +
> IHGFEDCBA@?>=<;:9876543210/.-,+*)('&%$#"!
>
> Again, this is an example constructed by hand to cover a broad
> range of valid scores, and can be found in the Biopython
> repository under biopython/Tests/Quality
>
> $ perl bioperl_sanger2illumina.pl < sanger_faked.fastq @Test PHRED
> qualities from 40 to 0 inclusive
> ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
> +Test PHRED qualities from 40 to 0 inclusive
> hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@
>
> $ python biopython_sanger2illumina.py < sanger_faked.fastq
> @Test PHRED qualities from 40 to 0 inclusive
> ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
> +
> hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@
>
> So, BioPerl and Biopython (and EMBOSS) agree - apart from
> the repeating second title on the plus line. I understand that
> EMBOSS will in future omit the repeated title on the plus line:
> http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000598.html

We'll make it optional and default to no header.

> Now, here comes the problem. I believe FASTQ files directly
> from an Illumina 1.3+ pipeline will have PHRED scores in the
> range 0 to 40 (as in this example). However, much higher
> PHRED scores are possible during assembly / contig'ing
> and read mapping. For example, the tool MAQ will output
> Sanger style FASTQ files with PHRED scores in the range
> 0 to 93 inclusive.

Is this behavior documented anywhere, specifically by Illumina (that  
values can exceed 40)?  If Illumina 1.3 is specified as being PHRED  
0-40, and another (non-Illumina) software package pushes that limit  
above the specified range of Illumina values, I would consider that  
unfortunately yet another variant.

We can support it as Illumina 1.3, but my point is this may getting  
into a grey area and may be something that Illumina doesn't/wouldn't  
support.  Reminds me a little of the multiple GFF2 variations (one of  
the main reasons for a GFF3).

> Now, in the Sanger FASTQ format, PHRED scores of 0 to
> 93 map onto ASCII values of 33 to 126 (! to ~). There is a
> reason for stopping at 126, since ASCII 127 is "delete".
>
> However, in the Illumina 1.3+ FASTQ format, PHRED
> scores of 0 to 93 would map to ASCII values of 64 to
> 157, which includes a lot of non printing characters.
> Working with such files at the command line or in an
> editor is a big problem. Clearly, Illumina never intended
> to include such high scores in their FASTQ files!

Exactly.

> Nevertheless, it is possible to write a FASTQ format
> following the Illumina 1.3+ encoding with these values.
> Biopython and EMBOSS attempt to do this - although I
> would regard throwing an error as equally acceptable.
>
> So, here is another hand constructed example of a
> Sanger style FASTQ file using the full quality range:
>
> $ more sanger_93.fastq
> @Test PHRED qualities from 93 to 0 inclusive
> ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAN
> +
> ~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;: 
> 9876543210/.-,+*)('&%$#"!
>
> Again, this example is in the Biopython repository under
> biopython/Tests/Quality
>
> Just to check:
>
> $ python biopython_sanger2qual.py < sanger_93.fastq
>> Test PHRED qualities from 93 to 0 inclusive
> 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74
> 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54
> 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34
> 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14
> 13 12 11 10 9 8 7 6 5 4 3 2 1 0
>
> So, here we go - apologies for the expected line mangling:
>
> $ seqret -filter -sformat fastq-sanger -osformat fastq-illumina <
> sanger_93.fastq | hexdump -C -v
> 00000000  40 54 65 73 74 20 50 48  52 45 44 20 71 75 61 6c  |@Test  
> PHRED qual|
> 00000010  69 74 69 65 73 20 66 72  6f 6d 20 39 33 20 74 6f  |ities  
> from 93 to|
> 00000020  20 30 20 69 6e 63 6c 75  73 69 76 65 0a 41 43 54  | 0  
> inclusive.ACT|
> 00000030  47 41 43 54 47 41 43 54  47 41 43 54 47 41 43 54  | 
> GACTGACTGACTGACT|
> 00000040  47 41 43 54 47 41 43 54  47 41 43 54 47 41 43 54  | 
> GACTGACTGACTGACT|
> 00000050  47 41 43 54 47 41 43 54  47 41 43 54 47 41 43 54  | 
> GACTGACTGACTGACT|
> 00000060  47 41 43 54 47 41 43 54  47 0a 41 43 54 47 41 43  | 
> GACTGACTG.ACTGAC|
> 00000070  54 47 41 43 54 47 41 43  54 47 41 43 54 47 41 43  | 
> TGACTGACTGACTGAC|
> 00000080  54 47 41 43 54 47 41 43  54 47 41 4e 0a 2b 54 65  | 
> TGACTGACTGAN.+Te|
> 00000090  73 74 0a 9d 9c 9b 9a 99  98 97 96 95 94 93 92 91  | 
> st..............|
> 000000a0  90 8f 8e 8d 8c 8b 8a 89  88 87 86 85 84 83 82 81   
> |................|
> 000000b0  80 7f 7e 7d 7c 7b 7a 79  78 77 76 75 74 73 72 71  |..~}| 
> {zyxwvutsrq|
> 000000c0  70 6f 6e 6d 6c 6b 6a 69  68 67 66 65 64 63 62 0a  | 
> ponmlkjihgfedcb.|
> 000000d0  61 60 5f 5e 5d 5c 5b 5a  59 58 57 56 55 54 53 52  |a`_^]\ 
> [ZYXWVUTSR|
> 000000e0  51 50 4f 4e 4d 4c 4b 4a  49 48 47 46 45 44 43 42  | 
> QPONMLKJIHGFEDCB|
> 000000f0  41 40 0a                                          |A at .|
> 000000f3
>
> $ python biopython_sanger2illumina.py < sanger_93.fastq | hexdump -C
> -v00000000  40 54 65 73 74 20 50 48  52 45 44 20 71 75 61 6c  |@Test
> PHRED qual|
> 00000010  69 74 69 65 73 20 66 72  6f 6d 20 39 33 20 74 6f  |ities  
> from 93 to|
> 00000020  20 30 20 69 6e 63 6c 75  73 69 76 65 0a 41 43 54  | 0  
> inclusive.ACT|
> 00000030  47 41 43 54 47 41 43 54  47 41 43 54 47 41 43 54  | 
> GACTGACTGACTGACT|
> 00000040  47 41 43 54 47 41 43 54  47 41 43 54 47 41 43 54  | 
> GACTGACTGACTGACT|
> 00000050  47 41 43 54 47 41 43 54  47 41 43 54 47 41 43 54  | 
> GACTGACTGACTGACT|
> 00000060  47 41 43 54 47 41 43 54  47 41 43 54 47 41 43 54  | 
> GACTGACTGACTGACT|
> 00000070  47 41 43 54 47 41 43 54  47 41 43 54 47 41 43 54  | 
> GACTGACTGACTGACT|
> 00000080  47 41 43 54 47 41 43 54  47 41 4e 0a 2b 0a 9d 9c  | 
> GACTGACTGAN.+...|
> 00000090  9b 9a 99 98 97 96 95 94  93 92 91 90 8f 8e 8d 8c   
> |................|
> 000000a0  8b 8a 89 88 87 86 85 84  83 82 81 80 7f 7e 7d 7c   
> |.............~}||
> 000000b0  7b 7a 79 78 77 76 75 74  73 72 71 70 6f 6e 6d 6c  | 
> {zyxwvutsrqponml|
> 000000c0  6b 6a 69 68 67 66 65 64  63 62 61 60 5f 5e 5d 5c  | 
> kjihgfedcba`_^]\|
> 000000d0  5b 5a 59 58 57 56 55 54  53 52 51 50 4f 4e 4d 4c  | 
> [ZYXWVUTSRQPONML|
> 000000e0  4b 4a 49 48 47 46 45 44  43 42 41 40 0a           | 
> KJIHGFEDCBA at .|
> 000000ed
>
> Biopython and EMBOSS 6.1.0 differ regarding the plus line, but agree
> on the quality string which runs from 0x9d to 0x40 (in hex), or 157 to
> 64 in decimal, which after subtracting the Illumina offset of 64,  
> gives
> PHRED scores of 93 to 0 as desired.
>
> Now to BioPerl,
>
> $ perl bioperl_sanger2illumina.pl < sanger_93.fastq
> @Test PHRED qualities from 93 to 0 inclusive
> ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAN
> +Test PHRED qualities from 93 to 0 inclusive
> hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@
>
> $ perl bioperl_sanger2illumina.pl < sanger_93.fastq | hexdump -C -v
> ...
>
> BioPerl has output an invalid FASTQ file - it seems to omit the
> quality scores for the top scoring nucleotides at the start. The
> BioPerl quality string runs from just "h" to "@", or 0x68 to 0x40
> (in hex), giving 104 to 64 in decimal, giving PHRED values of
> 40 to 0. I think BioPerl should either throw an error, or output
> the non printing characters as done by Biopython and EMBOSS.
>
> Regards,
>
> Peter C.
> (@Biopython)


If this is accepted as common practice between BioPython and EMBOSS we  
will follow similarly.  I do think it's worth at least a warning for  
the reasons outlined above (e.g. it likely isn't Illumina's intent to  
support qual values outside the specified range).  Might be worth  
checking into.

 From this it could be summarized that converting to sanger format is  
least problematic, as possible issues may be encountered when  
converting to the other variants.  We'll need to fix the solexa  
quality calculations in the BioPerl parser as noted in your previous  
post; I'll work on that.

chris




More information about the Bioperl-l mailing list