[Bioperl-l] Next-gen modules

Tue Jun 23 21:39:48 UTC 2009

On Tue, Jun 23, 2009 at 1:22 PM, Peter Rice<pmr at ebi.ac.uk> wrote:
> Peter wrote:
>> On Tue, Jun 23, 2009 at 12:00 PM, Peter Rice<pmr at ebi.ac.uk> wrote:
>>> We just added FASTQ parsing to EMBOSS and faced the same issues.
>>>
>>
>> I was going to chat to you about this at BOSC, and suggest this be
>> added to EMBOSS - but you are well ahead of me ;)
>
> Not that well ahead really ... someone asked for it in our BoF at
> BOSC/ISMB last year so we thought we'd better get it done before this
> one. it was implemented a couple of days ago :-)
>

Well, ahead of my asking!

>>> Quality scores are kept as phred values. Phred of 0 means unknown,
>>> which in Solexa is -5 (0.75 error rate = could be anything).
>>
>> A Phred quality of 0 means probability of error is 1, so yes, unknown. I don't
>> quite follow your leap that this corresponds to a Solexa quality of -5. Could
>> you clarify?
>
> Phred score is -10 log(p) where p is the probability of error. A phred
> of 0 implies 1.0 (certainty) of error, but 0.75 is a better estimate
> (3/4 chance that any base you pick is wrong).
>
> Solexa scores are -10 log(p/(1-p)) so p=0.75 comes out at -5. This is
> why Solexa scores can go down to -5 in their fastq format.
>
>>> We assume lower quality scores are from alignments rather than
>>> single reads.
>>
>> Did you mean to say "higher quality scores" (i.e. lower probability of error),
>> e.g a PHRED score of 80 which you can get from MAQ doing read mapping
>> or something consensus based.
>
> Actually I mean both. Error probabilities below 0.75 for a single base
> are silly, and error probabilities below 0.0001 make sense only when two
> or more high quality bases are aligned.

I see what you mean - a probability of error of 0.75 matches that
for a random base call, obvious when you put it like that. Of course,
there is this nasty little thought at the back of my mind that sooner
or later someone will use FASTQ files for proteins (e.g. from some
mass-spec protein sequencing).

A probability less than that (e.g. 0) is actually worse than random and
could be considered as mean "we're pretty sure this isn't the stated
letter". But that would be silly, as you say.

>>> We gave up on trying to guess the quality score standard and require
>>> users to say whether they are sanger, solexa (1.0) or Illumina (1.3)
>>> format files. If we only want the sequence then we don't care so we allow
>>> "fastq" as a sequence format and ignore the quality scores in that case.
>>
>> What format names have you used? Ideally we'd have the same names
>> in EMBOSS, BioPerl and Biopython (i.e. "fastq", "fastq-solexa", and
>> "fastq-illumina").
>
> We don't normally use '-' in our format names so we have fastqsanger,
> fastqsolexa, fastqillumina and fastqint. None of these have been tried
> on users as yet.
>
> The '-' names look nice though. We can consider introducing them. Do you
> have a full list of format names (sequence, feature, alignment, etc.) we
> can try to conform to?

See http://biopython.org/wiki/SeqIO and http://biopython.org/wiki/AlignIO

Getting EMBOSS to conforming should be trivial - in general when
picking a format name for Biopython's SeqIO or AlignIO (and we
have avoided multiple aliases with one exception) we have tried to
use anything shared by BioPerl and EMBOSS. The FASTQ variants
are unusual in that Biopython got to invent some names.

In future where would be a good place to discuss these kinds of
cross-platform issues? (i.e. BioPerl, Biopython, EMBOSS, etc).

>>> We also allow the integer quality score format ... is anyone still
>>> using that (it looks horrible to me :-)
>>
>> Do you mean the QUAL file format holding PHRED scores?
>> Roche provide tools to turn their SFF files into FASTA and
>> QUAL files, so they are still used.
>
> Probably ... unless there is a Solexa version too.

We may be talking at cross purposes here, this is QUAL format:
http://www.bioperl.org/wiki/Qual_sequence_format

Peter