[Bioperl-l] FASTQ support in Biopython, BioPerl, and EMBOSS

Chris Fields cjfields at illinois.edu
Sat Jul 25 17:47:11 EDT 2009


On Jul 25, 2009, at 4:12 PM, Peter wrote:

> On Sat, Jul 25, 2009 at 8:50 PM, Chris Fields<cjfields at illinois.edu>  
> wrote:
>>
>> If this is accepted as common practice between BioPython and EMBOSS
>> we will follow similarly.  I do think it's worth at least a warning  
>> for the
>> reasons outlined above (e.g. it likely isn't Illumina's intent to  
>> support qual
>> values outside the specified range).  Might be worth checking into.
>
> True. I think what EMBOSS and Biopython are doing is reasonable
> (although a warning in this situation makes sense). Equally, an
> error is a valid option. However, one question is when would you
> issue the warning/error? For a PHRED score above 40? (Assuming
> we have a definative reference for Illumina using just 0 to 40).
> How about if a problem character would result? Since ASCII
> 64+63=127, the first problem character would be for PHRED
> score 63.
>
> i.e. An Illumina FASTQ format file can hold PHRED scores in the
> range 0 to 62 without using problem characters. And likewise
> for a Solexa FASTQ file (Solexa scores up to 62).

I don't think there is a middle ground, we either indicate it the  
score falls outside the specified range (and warn/throw), or we allow  
it completely and just run the conversion w/o warnings, regardless of  
output.  The former would at least let the user know what the problem  
is when they look at their output.

If we issue a warning it would pop up only if the bounds are passed.   
I will probably set this up to occur only warn once (if needed I could  
cache the out-of-range quals and print them).

>> From this it could be summarized that converting to sanger format  
>> is least
>> problematic, as possible issues may be encountered when converting  
>> to the
>> other variants.
>
> Yes. The Sanger FASTQ format will hold PHRED scores from 0 to 93
> while using nice ASCII characters - this means it is suitable for both
> raw reads and processed data from assemblies or read mappings.
>
> In my personal experience, Solexa/Illumina FASTQ files tend to get
> converted into the Sanger FASTQ format for downstream analysis
> (e.g. the MAQ tool, or the NCBI short read archive).
>
> i.e. Writing high quality reads (i.e. above PHRED 40) to Solexa or
> Illumina FASTQ files is unlikely.

Yes, though we can unfortunately never rule it out, just try to  
account for the possibility in some way.

>> We'll need to fix the solexa quality calculations in the BioPerl
>> parser as noted in your previous post; I'll work on that.
>
> Great.
>
> Peter

chris




More information about the Bioperl-l mailing list