[Bioperl-l] Next-gen modules

Fri Jul 3 15:35:20 UTC 2009

Chris, I've just tested your Illumina/Solexa fastq parsing code and am
pleased to report that I haven't encountered any issues thus far.

To give an idea of the processing overhead of object instantiation, fastq
parsing performance on a lowly 3GHz Core 2 Duo (using one core) is as
follows:
Illumina fastq with next_dataset: ~1 million sequences/minute
Solexa fastq with next_dataset: ~500000 sequences/minute
Illumina fastq with next_seq: ~215000 sequences/minute
Solexa fastq with next_seq: ~175000 sequences/minute

My quality trimming script does about 300000 sequences/minute with
next_dataset, up from ~130000 sequences/minute with next_seq, so it shaves
hours off the run time, thanks!

Giles

2009/7/1 Chris Fields <cjfields at illinois.edu>

> On Jul 1, 2009, at 11:27 AM, Giles Weaver wrote:
>
> ...
>
>  Peter, the trimming algorithm I use employs a sliding window, as follows:
>>
>>  - For each sequence position calculate the mean phred quality score for a
>>  window around that position.
>>  - Record whether the mean score is above or below a threshold as an array
>>  of zeros and ones.
>>  - Use a regular expression on the joined array to find the start and end
>>  of the good quality sequence(s).
>>  - Extract the quality sequence(s) and replace any bases below the quality
>>  threshold with N.
>>  - Trim any Ns from the ends.
>>
>> A refinement would be to weight the scores from positions in the window,
>> but
>> this could give a performance hit, and the method seems to work well
>> enough
>> as is.
>>
>> Chris, thanks for committing the fix, I'll give bioperl illumina fastq
>> parsing a workout soon. Peter, as much as I'd love to help out with
>> biopython, I'm under too much time pressure right now!
>>
>
> Just let me know if the qual values match up with what is expected.  You
> can also iterate through the data with hashrefs using next_dataset (faster
> than objects).  This is from the fastq tests in core:
>
> -----------------------------------------
> $in_qual  = Bio::SeqIO->new(-file     =>
> test_input_file('fastq','test3_illumina.fastq'),
>                            -variant  => 'illumina',
>                            -format   => 'fastq');
>
> $qual = $in_qual->next_dataset();
>
> isa_ok($qual, 'HASH');
> is($qual->{-seq}, 'GTTAGCTCCCACCTTAAGATGTTTA');
> is($qual->{-raw_quality}, 'SXXTXXXXXXXXXTTSUXSSXKTMQ');
> is($qual->{-id}, 'FC12044_91407_8_200_406_24');
> is($qual->{-desc}, '');
> is($qual->{-descriptor}, 'FC12044_91407_8_200_406_24');
> is(join(',',@{$qual->{-qual}}[0..10]), '19,24,24,20,24,24,24,24,24,24,24');
> -----------------------------------------
>
> So one could check those values directly and then filter them through as
> needed directly into Bio::Seq::Quality if necessary (note some of the key
> values are constructor args):
>
> my $qualobj = Bio::Seq::Quality->new(%$qual);
>
> chris
>