[Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS

Fri Jul 31 10:04:41 EDT 2009

Aaron Mackey wrote:
>>> I would strongly warn against truncation, for any reason.  Use the
>>> formulas you have for quality-encoding conversions, but do not
>>> assume that you know more than I do about what my data contains,
>>> or that you are in any way helping me by altering my data, silently or
>>> otherwise.  Said another way, feel free to warn me that my data may
>>> contain garbage, and utterly fail to convert it for me, but do not try
>>> to fix it for me.

http://lists.open-bio.org/pipermail/open-bio-l/2009-July/000520.html
Earlier I wrote:
>>>> If in the process of converting between formats, a quality score
>>>> is too high (it would result in ASCII 127 or higher), then I would
>>>> argue any of the following would be acceptable:
>>>> (a) Silently impose the maximum score (ASCII 126, 0x7e)
>>>> (b) Impose the maximum score with a warning
>>>> (c) Raise an error
>>>>
>>>> I don't think EMBOSS, BioPerl and Biopython have to handle
>>>> this exactly the same way, but I would favour (b) then (a).

Aaron, are you saying you support raising an error (option c), or
truncation with a warning (option b), but are against a silent score
truncation (option a)?

The problem with just raising an error (option c) is it prevents a
valid operation (conversion with truncation).

Peter Rice wrote:
>> We should bear in mind what the outer limit quality scores are. A quality
>> score of 60 means a 1 in a million chance of an error. A quality of 90 means
>> a 1 in a billion chance of an error (or 3 in an entire mammalian genome).
>> Quality scores below 1 (phred) or -5 (solexa) mean the base is wrong
>> (worse than random).
>>
>> I do not believe we are losing anything biologically significant by the
>> score limits...

Good point.

On Fri, Jul 31, 2009 at 2:19 AM, Chris Fields<cjfields at illinois.edu> wrote:
>
> I do tend to agree, and I don't think any savings from a performance hit
> will be worth the headache of having to repeatedly explain why it's
> (silently) doing so, when a simple warning or error message ('value X out of
> range for fastq format y') would suffice.

That's a shift from your early stance:
> I do think if it affects performance to a significant enough degree we
> can do this silently, we just need to ensure this is well-documented.

Still, I guess it boils down to how big a penalty the warnings would
impose on typical conversions. And for Biopython, it looks like the
answer is not much.

I've updated Biopython to issue warnings on writing FASTQ files if the
quality score had to be truncated to fit the given encoding. i.e. If you
had a PHRED quality above 93 for "fastq-sanger", or above 62 for
"fastq-illumina", or a Solexa quality above 62 for "fastq-solexa". As
implemented there is a speed penalty, but *only* for these fringe cases.

Peter