[Bioperl-l] Update of SeqIO:: fastq Module for PacBio

Dan Nasko dan.nasko at gmail.com
Thu Sep 20 18:00:11 UTC 2012


Peter,

Here is Frank Boellmann's email, he's a Field Application Specialist and Bioinformatician at PacBio who is generally great at resolving these issues:
fboellmann at pacificbiosciences.com

I've gone ahead and contacted another one of their software engineers and am waiting for a response from him.

Dan

On Sep 20, 2012, at 12:31 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> Thanks Dan - I got the attachment too. I've CC'd the list again.
> 
> On Thu, Sep 20, 2012 at 5:01 PM, Dan Nasko <dan.nasko at gmail.com> wrote:
>> Hi Peter,
>> 
>> Not sure how familiar you are with PacBio, but here's a quick synopsis.
>> 
>> For each read the sequencer sequences 2 to 6 copies of the read at extremely
>> low quality (0-14). Many of these reads are exceedingly small (1 to 3 bases)
>> -- which is where I run into that issue of quality encoded value of 0
>> throwing an error.
>> 
>> From there these 2 to 6 copies of the read are aligned into a final
>> consensus sequence which will have extremely high quality scores -- which is
>> where I'm getting this >93 problem.
>> 
>> I was at a PacBio conference a couple of weeks ago in Baltimore and one of
>> their representatives ensured me they're encoding in Phred33, so this isn't
>> the issue. Using Perl's ord function I created a hash indexing quality
>> character, the score, and number of occurrences. Here are the last line of
>> output:
>> 
>> ...
>> y 88 3683
>> z 89 3662
>> { 90 4363
>> | 91 3750
>> } 92 2831
>> ~ 93 3150
>> 94 9219
>> ? 95 2221
>> ? 96 4232
>> ? 97 1554
>> ? 98 1399
>> ? 99 1723
>> ? 100 25991
> 
> Right - I understand, but believe this is not valid FASTQ because they
> are using non-printing characters. i.e. I think BioPerl is doing the right
> thing here (and Biopython, BioRuby, BioJava, EMBOSS should too):
> http://dx.doi.org/10.1093/nar/gkp1137
> 
> To stay within the printable character set, PacBio must cap their FASTQ
> output at chr(127). This equivalently means capping the PHRED scores
> at 93 (although it might be wise to cap at less than that as IIRC some
> tools may treat the max value specially - so perhaps cap at 90?).
> 
> The same restriction also applies to SAM/BAM, since SAM also
> uses the Sanger FASTQ encoding (PHRED + 33). This range is
> explicit in the specification as [!-~]+ (regular expression).
> 
> Do you have contact details for a PacBio technical representative?
> I think they will need to fix this.
> 
> Regards,
> 
> Peter





More information about the Bioperl-l mailing list