[Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS
Peter Rice
pmr at ebi.ac.uk
Fri Jul 31 08:16:20 UTC 2009
Aaron Mackey wrote:
> I would strongly warn against truncation, for any reason. Use the formulas
> you have for quality-encoding conversions, but do not assume that you know
> more than I do about what my data contains, or that you are in any way
> helping me by altering my data, silently or otherwise. Said another way,
> feel free to warn me that my data may contain garbage, and utterly fail to
> convert it for me, but do not try to fix it for me.
We should bear in mind what the outer limit quality scores are. A
quality score of 60 means a 1 in a million chance of an error. A quality
of 90 means a 1 in a billion chance of an error (or 3 in an entire
mammalian genome). Quality scores below 1 (phred) or -5 (solexa) mean
the base is wrong (worse than random).
I do not believe we are losing anything biologically significant by the
score limits - but we are using a tighter definition of the FASTQ format
to protect other parsers from terrible errors with for example signed
characters.
On the subject or warnings ... While I am happy to issue warnings, I
suggest we take some care over what happens when someone picks the wrong
format and a million reads have quality scores out of range.
We could, for example, report the first error and then count up so we
can later (at the end or when another error occurs) say "and another
987654 up to ..." and give the latest one.
regards,
Peter Rice
More information about the Open-Bio-l
mailing list