[EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.

Peter Rice pmr at ebi.ac.uk
Thu Sep 17 05:52:48 EDT 2009


Peter C. wrote:
> So if the EMBOSS tools are used to read a FASTQ file without specifying
> the FASTQ variant, do the currently detect it is FASTQ and default to the
> "fastq" setting and ignore the quality information?

Yes, exactly so.

Reading the sequence data is safe, and may be all the user wanted to do.

>> What we could do is provide a utility that reads in fastq-sanger format and
>> checks whether the quality scores make most sense as Sanger, SOlexa or
>> Ilumina.
> 
> That could be useful - I guess you could scan all the reads building up
> a histogram of the ASCII characters used. This could immediately
> rule out some of the options, and then based on the distribution (if
> you assume they are raw reads) you could make a good guess.

The ACD file would be 'interesting' We could set the default format to 
be "fastq-sanger" and issue some warning if we find the user had tried 
to change it. That way the application would run with a filename as the 
input, though it will appear to interfaces to be able to read any 
sequence input.

Are there rules we can use to decide on improbably qualities? Values 
below the Illumina and Solexa minima would seem a good guide, and 
perhaps values above the likely short read maximum score.

Maybe some existing pipelines have solme cutoff values we could adopt?

>> I consider reading as fastq-sanger by default to be rather dangerous.
> 
> That is understandable. How about removing the current "fastq" output
> then? That might prevent some of the confusion at the moment. I'm
> struggling to see any purpose for the current "fastq" output - can you
> give me any example use case? Right now it has to pick an arbitrary
> quality symbol, and uses ASCI 34 (double quote) which means PHRED
> 1 (random) for a Sanger FASTQ file but is invalid as a Solexa or
> Illumina 1.3+ FASTQ file.

It is an alias for fastq-sanger which should be OK. I prefer to have an 
output format name for each input format name where it looks sensible, 
so if we read "fastq" as an input format it should do something on 
output. Unfortunately that means it has to write quality scores somehow.

regards,

Peter


More information about the EMBOSS mailing list