[Bioperl-l] Next-gen modules

Tue Jun 23 11:00:38 UTC 2009

We just added FASTQ parsing to EMBOSS and faced the same issues.

Parsing was easy - find the '@' line, read sequence until the '+' line
is reached, then read (seqlen) quality characters ... and check the next
line starts with '@'

Quality scores are kept as phred values. Phred of 0 means unknown, which
in Solexa is -5 (0.75 error rate = could be anything). We assume lower
quality scores are from alignments rather than single reads.

We gave up on trying to guess the quality score standard and require
users to say whether they are sanger, solexa (1.0) or Illumina (1.3)
format files. If we only want the sequence then we don't care so we allow
"fastq" as a sequence format and ignore the quality scores in that case.

We also allow the integer quality score format ... is anyone still using
that (it looks horrible to me :-)

Code is in the EMBOSS CVS, and will appear in release 6.1.0 on July 15th.

Any further tips would be very useful.

regards,

Peter Rice