[Biopython] internal function to convert illumina quality scores to phred

Tue Feb 1 16:16:18 UTC 2011

On Tue, Feb 1, 2011 at 4:03 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>
> Alan and Peter;
> Alan, nice suggestions on conversion from phred. On the barcode
> sorting side there was just some discussion of this on the
> development list; I have a script that does barcode sorting
> and trimming with mismatches using Biopython:
>
> https://github.com/chapmanb/bcbb/blob/master/nextgen/scripts/barcode_sort_trim.py
>
> It does not use qualities, but this might be a framework you could
> build off to add that support.
>
> Peter, how hard do you think it would be to have SeqIO only convert
> from the fastq encoding to phred scores on demand? Most of the time
> when dealing with fastq I do not need any conversion at all and use
> the FastqGeneralIterator to just pull out the name, sequence and
> quality.
>
> You've done a lot of nice work with the correct conversions and it
> would be great to expose that directly though on-demand conversion
> as Alan is suggesting. Ideally you would use SeqIO as normal with
> fastq files, but the quality score would not be converted to solexa
> during parsing using letter_annotations["solexa_quality"] was
> accessed.

I actually implemented a proof of concept that does that. In order
to not alter the SeqRecord behaviour, it was a new object which
acted like a list of integers in many respects. The data is held
as a FASTQ encoded string, and decoded (and then cached) on
demand only. On output if it was already in the right encoding
the string could be used as is, otherwise the conversion could
be done very quickly with a precomputed table and the string
translate() method (without having to go via a list of integers).
It seemed to work, but I wasn't convinced about the benefits
(given the complexity). I'd really want some real world FASTQ
benchmarks to try it on... something you might have in the form
of your scripts and the real data they were written for?

I'm pretty sure this code is in a local git branch on one of my
machines (probably at home), but I don't think I pushed it to
github. I should do that...

> Another option would just be to expose a function so folks
> could do:
>
> convert_fastq_illumina_to_quality(illumina_encoded_string)
>
> to get the phred quality scores for a string they were interested
> in. This way you could use FastqGeneralIterator for no
> SeqRecord/Seq overhead, but still make use of your
> conversion work.

Yeah, three or four helper functions for the three decoding
would be sensible. It looks like there is demand for it then...

Peter