[Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS

Fri Jul 31 08:16:20 UTC 2009

Aaron Mackey wrote:
> I would strongly warn against truncation, for any reason.  Use the formulas
> you have for quality-encoding conversions, but do not assume that you know
> more than I do about what my data contains, or that you are in any way
> helping me by altering my data, silently or otherwise.  Said another way,
> feel free to warn me that my data may contain garbage, and utterly fail to
> convert it for me, but do not try to fix it for me.

We should bear in mind what the outer limit quality scores are. A 
quality score of 60 means a 1 in a million chance of an error. A quality 
of 90 means a 1 in a billion chance of an error (or 3 in an entire 
mammalian genome). Quality scores below 1 (phred) or -5 (solexa) mean 
the base is wrong (worse than random).

I do not believe we are losing anything biologically significant by the 
score limits - but we are using a tighter definition of the FASTQ format 
to protect other parsers from terrible errors with for example signed 
characters.

On the subject or warnings ... While I am happy to issue warnings, I 
suggest we take some care over what happens when someone picks the wrong 
format and a million reads have quality scores out of range.

We could, for example, report the first error and then count up so we 
can later (at the end or when another error occurs) say "and another 
987654 up to ..." and give the latest one.

regards,

Peter Rice