[Biopython] internal function to convert illumina quality scores to phred

Peter biopython at maubp.freeserve.co.uk
Tue Feb 1 20:39:30 UTC 2011


On Tue, Feb 1, 2011 at 4:16 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Tue, Feb 1, 2011 at 4:03 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>>
>> Peter, how hard do you think it would be to have SeqIO only convert
>> from the fastq encoding to phred scores on demand? Most of the time
>> when dealing with fastq I do not need any conversion at all and use
>> the FastqGeneralIterator to just pull out the name, sequence and
>> quality.
>>
>> You've done a lot of nice work with the correct conversions and it
>> would be great to expose that directly though on-demand conversion
>> as Alan is suggesting. Ideally you would use SeqIO as normal with
>> fastq files, but the quality score would not be converted to solexa
>> during parsing using letter_annotations["solexa_quality"] was
>> accessed.
>
> I actually implemented a proof of concept that does that. In order
> to not alter the SeqRecord behaviour, it was a new object which
> acted like a list of integers in many respects. The data is held
> as a FASTQ encoded string, and decoded (and then cached) on
> demand only. On output if it was already in the right encoding
> the string could be used as is, otherwise the conversion could
> be done very quickly with a precomputed table and the string
> translate() method (without having to go via a list of integers).
> It seemed to work, but I wasn't convinced about the benefits
> (given the complexity). I'd really want some real world FASTQ
> benchmarks to try it on... something you might have in the form
> of your scripts and the real data they were written for?
>
> I'm pretty sure this code is in a local git branch on one of my
> machines (probably at home), but I don't think I pushed it to
> github. I should do that...

Found it and pushed it:
https://github.com/peterjc/biopython/tree/fastq-tricks

Note there are unit test failures (e.g. as currently implemented
there is no range checking on the characters in the quality strings
at parse time). We may want to continue this on the dev mailing list...

Peter



More information about the Biopython mailing list