[Bioperl-l] FASTQ support in Biopython, BioPerl, and EMBOSS

Mon Aug 24 10:18:20 EDT 2009

On Mon, Jul 27, 2009 at 2:06 PM, Chris Fields<cjfields at illinois.edu> wrote:
>
> I added this (and the others) to our ticket tracking this.  Looks like
> solexa conversion either way is borked, which is very likely an issue
> with conversion.

Hi Chris,

I've been digging into the current SVN code for BioPerl's FASTQ
support - I realised you are doing the Solexa to PHRED mapping
twice when parsing "fastq-solexa" files. Using "qual" output (which
shows the PHRED scores in plain text) makes it very clear
something is wrong:

$ cat solexa_faked.fastq
@slxa_0001_1_0001_01
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
+slxa_0001_1_0001_01
hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;

That is Solexa scores from 40 (h) down to -5 (;), which should
map onto PHRED scores from 40 down to 1 (according to our
prior discussions).

$ ./bioperl_solexa2qual.pl < solexa_faked.fastq
>slxa_0001_1_0001_01
40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
15 14 13 12 11 10 10 10 9 8 7 6 6 5 5 5 5 4 4 4 4

For reference,

$ python biopython_solexa2qual.py < solexa_faked.fastq
>slxa_0001_1_0001_01
40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21
20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2
1 1

I can "fix" this in fastq.pm by commenting out one of the log mappings,
for example see the patch I've just uploaded to Bug 2857:
http://bugzilla.open-bio.org/show_bug.cgi?id=2857

That brings me to another problem, consider the following (with the
double conversion fixed):

$ ./bioperl_solexa2solexa.pl < solexa_faked.fastq
@slxa_0001_1_0001_01
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
+slxa_0001_1_0001_01
hgfedcba`_^]\[ZYXWVUTSRQPONMLKJJHGFEDDBB@@>><<

If you compare that to the original, you'll notice a loss of detail in
the poor quality reads. e.g. Solexa scores 9 (I) and 10 (J) have
both been mapped onto 10 (J).

I believe this happens because BioPerl is converting the Solexa
scores to PHRED scores on loading (which is fine - EMBOSS
does this too), but you are also storing them as integers! In order
to preserve these details, I think you'll have to hold the converted
PHRED scores as floating point numbers (which I think is what
EMBOSS does). This has the downside of taking more memory,
and may also complicate file output (you may need to round things).

Regards,

Peter
(@Biopython)