[Bioperl-l] fastq parsing problem

Sat May 9 10:55:29 UTC 2009

Michael Muratet wrote:
> I've got a problem parsing fastq output from the maq aligner. The
> parser is throwing an exception for the following record:
>
> @HWI-EAS146:3:1:2:177#0/1
> CTCCGCTNNCTTCTCAG[...]
> +
> @,AB=>-&&:5).;+*=[...]
>
> I looked up the line in fastq.pm that does the parsing:
>
>     116   my ($top,$sequence,$top2,$qualsequence) = [...]

This is the fastq parser from 1.5.2 or thereabouts, which had a bug (the
$/ definition just above this code) that prevented it from parsing a
record with a quality line starting with "@".  This was probably not
recognised as a bug for a long time due to the enduring myth that fastq
quality lines always start with "!".

The fastq next_seq() was rewritten for 1.6.0 and parses this successfully.
 (Unfortunately the documentation at the top of fastq.pm was not updated
and still reflects the now-unused false belief about an initial "!"
quality.)

You may be able to just drop 1.6.0's Bio/SeqIO/fastq.pm in front of your
existing Bioperl installation, if you're a little crazy and don't want to
update the installation properly.  If you do that, or if you update,
you'll find that the new parser emits the following pedantic warning for
your fastq sequences:

MSG: Seq/Qual descriptions don't match; using sequence description

In practice, lots of people (probably even most!) don't bother putting the
sequence id on the "+" line, as it is entirely pointless duplication,
instead leaving the "+" line otherwise empty.  So I hope the maintainers
agree that this warning should be relaxed, such as in the attached patch. 
Or even removed -- there was no equivalent warning in the previous code.

Cheers,

    John

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: qualdesc.diff
Type: application/octet-stream
Size: 580 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/bioperl-l/attachments/20090509/0e25663d/attachment-0004.obj>