[Bioperl-l] fastq parsing problem

Fri May 8 19:29:38 UTC 2009

Greetings

I've got a problem parsing fastq output from the maq aligner. The  
parser is throwing an exception for the following record:

@HWI-EAS146:3:1:2:177#0/1
CTCCGCTNNCTTCTCAGCTTTCTTGTAGGCGATAGACTTCCCGAGCCTANCCAGAGCAACGAGCNTNNNGNNNNTN
+
@,AB=>-&&:5).;+*=<*8?%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 
%%%%%

I looked up the line in fastq.pm that does the parsing:

    116   my ($top,$sequence,$top2,$qualsequence) = $entry =~ /^
    117                                                         \@?(. 
+?)\n
    118                                                         ([^ 
\@]*?)\n
    119                                                         \+?(. 
+?)\n
    120                                                         (.*)\n
    121                                                       /xs

I don't consider myself a regex-pert, but I would interpret the above  
as "put everything after one or zero @ characters on the first line in  
$top; then put anything that is not @ on the second line in $sequence;  
then everything after one or zero + characters on the third line in  
$top2; then everything on the fourth line in $qualsequence; and don't  
be greedy".

It seems like the fastq record above should parse with these rules. I  
note that the @ character is escaped in the regex and appears in  
several of the problem records, but not all. Has anyone come across  
this before? I don't see this exact problem in the list archives.

Thanks

Mike