[Bioperl-l] Next-gen modules

Tue Jun 23 08:22:33 EDT 2009

Peter wrote:
> On Tue, Jun 23, 2009 at 12:00 PM, Peter Rice<pmr at ebi.ac.uk> wrote:
>> We just added FASTQ parsing to EMBOSS and faced the same issues.
>>
> 
> I was going to chat to you about this at BOSC, and suggest this be
> added to EMBOSS - but you are well ahead of me ;)

Not that well ahead really ... someone asked for it in our BoF at
BOSC/ISMB last year so we thought we'd better get it done before this
one. it was implemented a couple of days ago :-)

>> Parsing was easy - find the '@' line, read sequence until the '+' line
>> is reached, then read (seqlen) quality characters ... and check the next
>> line starts with '@'
> 
> That is basically what I did for Biopython.
> 
>> Quality scores are kept as phred values. Phred of 0 means unknown,
>> which in Solexa is -5 (0.75 error rate = could be anything).
> 
> A Phred quality of 0 means probability of error is 1, so yes, unknown. I don't
> quite follow your leap that this corresponds to a Solexa quality of -5. Could
> you clarify?

Phred score is -10 log(p) where p is the probability of error. A phred
of 0 implies 1.0 (certainty) of error, but 0.75 is a better estimate
(3/4 chance that any base you pick is wrong).

Solexa scores are -10 log(p/(1-p)) so p=0.75 comes out at -5. This is
why Solexa scores can go down to -5 in their fastq format.

>> We assume lower quality scores are from alignments rather than single reads.
> 
> Did you mean to say "higher quality scores" (i.e. lower probability of error),
> e.g a PHRED score of 80 which you can get from MAQ doing read mapping
> or something consensus based.

Actually I mean both. Error probabilities below 0.75 for a single base
are silly, and error probabilities below 0.0001 make sense only when two
or more high quality bases are aligned.

>> We gave up on trying to guess the quality score standard and require
>> users to say whether they are sanger, solexa (1.0) or Illumina (1.3)
>> format files. If we only want the sequence then we don't care so we allow
>> "fastq" as a sequence format and ignore the quality scores in that case.
> 
> What format names have you used? Ideally we'd have the same names
> in EMBOSS, BioPerl and Biopython (i.e. "fastq", "fastq-solexa", and
> "fastq-illumina").

We don't normally use '-' in our format names so we have fastqsanger,
fastqsolexa, fastqillumina and fastqint. None of these have been tried
on users as yet.

The '-' names look nice though. We can consider introducing them. Do you
have a full list of format names (sequence, feature, alignment, etc.) we
can try to conform to?

>> We also allow the integer quality score format ... is anyone still using
>> that (it looks horrible to me :-)
> 
> Do you mean the QUAL file format holding PHRED scores? Roche provide
> tools to turn their SFF files into FASTA and QUAL files, so they are still used.

Probably ... unless there is a Solexa version too.

regards,

Peter