[Bioperl-l] Update of SeqIO:: fastq Module for PacBio

Peter Cock p.j.a.cock at googlemail.com
Thu Sep 20 12:31:50 EDT 2012


Thanks Dan - I got the attachment too. I've CC'd the list again.

On Thu, Sep 20, 2012 at 5:01 PM, Dan Nasko <dan.nasko at gmail.com> wrote:
> Hi Peter,
>
> Not sure how familiar you are with PacBio, but here's a quick synopsis.
>
> For each read the sequencer sequences 2 to 6 copies of the read at extremely
> low quality (0-14). Many of these reads are exceedingly small (1 to 3 bases)
> -- which is where I run into that issue of quality encoded value of 0
> throwing an error.
>
> From there these 2 to 6 copies of the read are aligned into a final
> consensus sequence which will have extremely high quality scores -- which is
> where I'm getting this >93 problem.
>
> I was at a PacBio conference a couple of weeks ago in Baltimore and one of
> their representatives ensured me they're encoding in Phred33, so this isn't
> the issue. Using Perl's ord function I created a hash indexing quality
> character, the score, and number of occurrences. Here are the last line of
> output:
>
> ...
> y 88 3683
> z 89 3662
> { 90 4363
> | 91 3750
> } 92 2831
> ~ 93 3150
> 94 9219
> ? 95 2221
> ? 96 4232
> ? 97 1554
> ? 98 1399
> ? 99 1723
> ? 100 25991

Right - I understand, but believe this is not valid FASTQ because they
are using non-printing characters. i.e. I think BioPerl is doing the right
thing here (and Biopython, BioRuby, BioJava, EMBOSS should too):
http://dx.doi.org/10.1093/nar/gkp1137

To stay within the printable character set, PacBio must cap their FASTQ
output at chr(127). This equivalently means capping the PHRED scores
at 93 (although it might be wise to cap at less than that as IIRC some
tools may treat the max value specially - so perhaps cap at 90?).

The same restriction also applies to SAM/BAM, since SAM also
uses the Sanger FASTQ encoding (PHRED + 33). This range is
explicit in the specification as [!-~]+ (regular expression).

Do you have contact details for a PacBio technical representative?
I think they will need to fix this.

Regards,

Peter


More information about the Bioperl-l mailing list