[emboss-dev] EMBOSS seqret FASTQ support

Fri Jul 24 10:32:50 UTC 2009

Hi again Peter,

I have another query regarding how EMBOSS treats "fastq" as
a format name.

>From our earlier discussions I was expecting "fastq" to be an
EMBOSS input *only* format where you would ignore the qualities.
This would allow tasks like FASTQ to FASTA without having to
worry if the scores where encoded following the Sanger standard,
the original Solexa scheme, or the Illumina 1.3+ encoding.

When I found EMBOSS offered "fastq" as an output format,
I initially thought it might produce files with dummy quality
values (even if the input file had qualities). This puzzled me,
as I couldn't see a use for this, but in fact this isn't the case.

Instead, "fastq" as an output format seems to act like the
"fastq-sanger" format. I notice you use dummy values for the
quality if there are unknown, specifically a PHRED quality of
one (meaning about random, a sensible default in some cases).
e.g.

$ more example.fasta
>EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
>EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
>EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG

Converting "fasta" (with no qualities) to "fastq-sanger", seqret
assigns a PHRED quality of 1 (the double quote, ASCII 34):

$ seqret -sequence example.fasta -sformat fasta -osformat fastq-sanger
-stdout -auto
@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+EAS54_6_R1_2_1_413_324
"""""""""""""""""""""""""
@EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+EAS54_6_R1_2_1_540_792
"""""""""""""""""""""""""
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+EAS54_6_R1_2_1_443_348
"""""""""""""""""""""""""

Converting "fasta" (with no qualities) to "fastq" seems
to act just like conversion to "fastq-sanger":

$ seqret -sequence example.fasta -sformat fasta -osformat fastq -stdout -auto
@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+EAS54_6_R1_2_1_413_324
"""""""""""""""""""""""""
@EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+EAS54_6_R1_2_1_540_792
"""""""""""""""""""""""""
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+EAS54_6_R1_2_1_443_348
"""""""""""""""""""""""""

As an aside, FASTA to Illumina FASTQ also uses PHRED quality
one (ASCII 64+1 = 65 is the letter A):

$ seqret -sequence example.fasta -sformat fasta -osformat
fastq-illumina -stdout -auto
@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+EAS54_6_R1_2_1_413_324
AAAAAAAAAAAAAAAAAAAAAAAAA
@EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+EAS54_6_R1_2_1_540_792
AAAAAAAAAAAAAAAAAAAAAAAAA
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+EAS54_6_R1_2_1_443_348
AAAAAAAAAAAAAAAAAAAAAAAAA

(Due to the rounding issue I have not included a FASTA to Solexa FASTQ example)

Have I understood correctly? i.e. in EMBOSS 6.1.0:

"fastq" on input - ignores quality strings
"fastq" on output - acts like "fastq-sanger"
"fastq-sanger" - PHRED scores offset 31
"fastq-solexa" - Solexa scores offset 64
"fastq-illumina" - PHRED scores offset 64

If this is correct, the "fastq" format behaviour strikes me as very odd.
I would have either made "fastq" and "fastq-solexa" the same, or
made "fastq" an input only format.

Consider this very surprising behaviour that this results in...

$ more example.fastq
@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88
@EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+
;;;;;;;;;;;7;;;;;-;;;3;83
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+
;;;;;;;;;;;9;7;;.7;393333

You might want to use seqret to "clean up" a FASTQ file, for
example to standardize the line wrapping and the captions.
As this example is a Sanger style FASTQ file, this works:

 seqret -sequence example.fastq -sformat fastq-sanger -osformat
fastq-sanger -stdout -auto
@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+EAS54_6_R1_2_1_413_324
;;3;;;;;;;;;;;;7;;;;;;;88
@EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+EAS54_6_R1_2_1_540_792
;;;;;;;;;;;7;;;;;-;;;3;83
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+EAS54_6_R1_2_1_443_348
;;;;;;;;;;;9;7;;.7;393333

Notice EMBOSS has filled in the (optional) repeated caption on the
plus lines (and would have wrapped long reads). However, consider
the more natural thing to type:

$ seqret -sequence example.fastq -sformat fastq -osformat fastq -stdout -auto
@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+EAS54_6_R1_2_1_413_324
"""""""""""""""""""""""""
@EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+EAS54_6_R1_2_1_540_792
"""""""""""""""""""""""""
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+EAS54_6_R1_2_1_443_348
"""""""""""""""""""""""""

I was shocked to find using EMBOSS to convert "FASTQ to FASTQ" like
this threw away the quality scores - and I'm sure other people will also
make this mistake.

Peter C.