[Bioperl-l] Bio::Index::Fastq '@' in qual

Tue Nov 1 09:40:30 EDT 2011

On Oct 24, 2011, at 9:58 AM, Sofia Robb wrote:

> Hi,
> 
> I am having problems running Bio::Index::Fastq.  I get the following error when a quality line begins with '@'.
> 
> 
> ------------- EXCEPTION: Bio::Root::Exception -------------
> MSG: No description line parsed
> STACK: Error::throw
> STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:368
> STACK: Bio::SeqIO::fastq::next_dataset /usr/share/perl5/Bio/SeqIO/fastq.pm:71
> STACK: Bio::SeqIO::fastq::next_seq /usr/share/perl5/Bio/SeqIO/fastq.pm:29
> STACK: Bio::Index::AbstractSeq::fetch /usr/share/perl5/Bio/Index/AbstractSeq.pm:147
> STACK: Bio::Index::AbstractSeq::get_Seq_by_id /usr/share/perl5/Bio/Index/AbstractSeq.pm:198
> STACK: /home_stajichlab/robb/bin/clean_pairs_indexed.pl:68
> 
> 
> Here is an example of a fastq record that is causing this error, The last line which starts with an '@'  is actually the qual line.
> 
> @5:105:15806:16092:Y
> GTGGCGCGGAACAGAGGAGGAATGTTCAGGAGAGGGGGCATGTGTTGTTACCGAGTACTTGGAAACGACG
> +
> @9;A565:=8B?<E<DEEBEE<E3BB?3??BCCF2<@@=BGGBDB60:64594.81?<B??;3?8-984?
> 
> 
> 
> i see that chris has partially addressed this in the mailing list
> http://bioperl.org/pipermail/bioperl-l/2011-January/034481.html
> 
> However as he pointed out at the time, it appears this may be a fairly large problem.

The indexer is being refactored to address this problem; the Bio::SeqIO parser actually does parse this, but the (very simple) indexer does not.  I can try to push this to the forefront this week, the fix shouldn't be too hard to implement.  In essence it would simply use a few SeqIO methods I built in to parse out each bit of data in chunks; would just need to track the start and length of each chunk while the parser is running.

> My fastq seq and qual lines are alway only one line, so I think that adding a line count and only checking for @ in the lines that $line_count%4 ==0  would work since the header lines are always the first of 4 lines , 0,4,8, etc.

That doesn't work for all cases, however (some FASTQ wraps the seq and qual, like FASTA). Peter and I have discussed this elsewhere; a possible solutions is to add in an optimized parser that takes this assumption into account. 

One problem the various Bio* indexers have currently is the lack of standardization on a specific schema for indexing.  There are in-roads towards this (OBDA) that haven't been adequately traveled IMHO, which need to be taken up again.

A second, and maybe this is more specific to BioPerl, is that the parsers and indexers essentially reimplement the format parsing in each module, so if there are bugs they have to be independently fixed (hence why SeqIO works and the indexer doesn't; I wrote the first but not the second).  The best place for any optimizations would be in a unified parser that both the SeqIO and indexer modules could use.

> But if there are multiple lines of seq and qual i think that the /^+$/ of /^+$id/ can be used to identify the end of the sequence and the number of lines of quality should be equal to the number of lines of sequence
> 
> 
> ## only for single line seq and qual
> my $line_count = 0;
>   while (<$FASTQ>) {
>       if (/^@/ and  $line_count % 4 == 0) {
>           # $begin is the position of the first character after the '@'
>           my $begin = tell($FASTQ) - length( $_ ) + 1;
>           foreach my $id (&$id_parser($_)) {
>               $self->add_record($id, $i, $begin);
>               $c++;
>           }
>       }
>       $line_count++;
>   }
> 
> 
> --
> BioPerl fastq parsing issues aside, is there another tool which allows you to retrieve arbitrary sequences from a fastq file by sequence ID?
> 
> There's one called cdbfasta which looks like it might work — does anyone have experience with it?

I haven't, but it appears FASTA-specific.  Does it parse FASTQ as well?  

I recall Sanger has a C-based FASTQ/FASTQ hybrid one as well.  May have to look that one up.

> Thanks,
> sofia
> 
> P.S. I am CCing Peter Cock in case BioPython has solved this issue already — if so, perhaps their solution could be applied here.

chris