[Bioperl-l] FASTQ support in Biopython, BioPerl, and EMBOSS
Chris Fields
cjfields at illinois.edu
Sat Jul 25 17:47:11 EDT 2009
On Jul 25, 2009, at 4:12 PM, Peter wrote:
> On Sat, Jul 25, 2009 at 8:50 PM, Chris Fields<cjfields at illinois.edu>
> wrote:
>>
>> If this is accepted as common practice between BioPython and EMBOSS
>> we will follow similarly. I do think it's worth at least a warning
>> for the
>> reasons outlined above (e.g. it likely isn't Illumina's intent to
>> support qual
>> values outside the specified range). Might be worth checking into.
>
> True. I think what EMBOSS and Biopython are doing is reasonable
> (although a warning in this situation makes sense). Equally, an
> error is a valid option. However, one question is when would you
> issue the warning/error? For a PHRED score above 40? (Assuming
> we have a definative reference for Illumina using just 0 to 40).
> How about if a problem character would result? Since ASCII
> 64+63=127, the first problem character would be for PHRED
> score 63.
>
> i.e. An Illumina FASTQ format file can hold PHRED scores in the
> range 0 to 62 without using problem characters. And likewise
> for a Solexa FASTQ file (Solexa scores up to 62).
I don't think there is a middle ground, we either indicate it the
score falls outside the specified range (and warn/throw), or we allow
it completely and just run the conversion w/o warnings, regardless of
output. The former would at least let the user know what the problem
is when they look at their output.
If we issue a warning it would pop up only if the bounds are passed.
I will probably set this up to occur only warn once (if needed I could
cache the out-of-range quals and print them).
>> From this it could be summarized that converting to sanger format
>> is least
>> problematic, as possible issues may be encountered when converting
>> to the
>> other variants.
>
> Yes. The Sanger FASTQ format will hold PHRED scores from 0 to 93
> while using nice ASCII characters - this means it is suitable for both
> raw reads and processed data from assemblies or read mappings.
>
> In my personal experience, Solexa/Illumina FASTQ files tend to get
> converted into the Sanger FASTQ format for downstream analysis
> (e.g. the MAQ tool, or the NCBI short read archive).
>
> i.e. Writing high quality reads (i.e. above PHRED 40) to Solexa or
> Illumina FASTQ files is unlikely.
Yes, though we can unfortunately never rule it out, just try to
account for the possibility in some way.
>> We'll need to fix the solexa quality calculations in the BioPerl
>> parser as noted in your previous post; I'll work on that.
>
> Great.
>
> Peter
chris
More information about the Bioperl-l
mailing list