[Bioperl-l] Next-gen modules

Fri Jul 3 13:34:25 EDT 2009

No problem.  Scary to see that creating an instance is 2-4x slower  
than a simple hash ref.  Not sure there is an easy way around that;  
maybe we need a direct_new?

The next step is to ensure this works cross-platform and get indexing  
(via Bio::Index::Fastq) optimized.  Would be nice to get output  
working with the hash refs as well.

chris

On Jul 3, 2009, at 10:35 AM, Giles Weaver wrote:

> Chris, I've just tested your Illumina/Solexa fastq parsing code and  
> am pleased to report that I haven't encountered any issues thus far.
>
> To give an idea of the processing overhead of object instantiation,  
> fastq parsing performance on a lowly 3GHz Core 2 Duo (using one  
> core) is as follows:
> Illumina fastq with next_dataset: ~1 million sequences/minute
> Solexa fastq with next_dataset: ~500000 sequences/minute
> Illumina fastq with next_seq: ~215000 sequences/minute
> Solexa fastq with next_seq: ~175000 sequences/minute
>
> My quality trimming script does about 300000 sequences/minute with  
> next_dataset, up from ~130000 sequences/minute with next_seq, so it  
> shaves hours off the run time, thanks!
>
> Giles
>
> 2009/7/1 Chris Fields <cjfields at illinois.edu>
> On Jul 1, 2009, at 11:27 AM, Giles Weaver wrote:
>
> ...
>
>
> Peter, the trimming algorithm I use employs a sliding window, as  
> follows:
>
>  - For each sequence position calculate the mean phred quality score  
> for a
>  window around that position.
>  - Record whether the mean score is above or below a threshold as an  
> array
>  of zeros and ones.
>  - Use a regular expression on the joined array to find the start  
> and end
>  of the good quality sequence(s).
>  - Extract the quality sequence(s) and replace any bases below the  
> quality
>  threshold with N.
>  - Trim any Ns from the ends.
>
> A refinement would be to weight the scores from positions in the  
> window, but
> this could give a performance hit, and the method seems to work well  
> enough
> as is.
>
> Chris, thanks for committing the fix, I'll give bioperl illumina fastq
> parsing a workout soon. Peter, as much as I'd love to help out with
> biopython, I'm under too much time pressure right now!
>
> Just let me know if the qual values match up with what is expected.   
> You can also iterate through the data with hashrefs using  
> next_dataset (faster than objects).  This is from the fastq tests in  
> core:
>
> -----------------------------------------
> $in_qual  = Bio::SeqIO->new(-file     =>  
> test_input_file('fastq','test3_illumina.fastq'),
>                            -variant  => 'illumina',
>                            -format   => 'fastq');
>
> $qual = $in_qual->next_dataset();
>
> isa_ok($qual, 'HASH');
> is($qual->{-seq}, 'GTTAGCTCCCACCTTAAGATGTTTA');
> is($qual->{-raw_quality}, 'SXXTXXXXXXXXXTTSUXSSXKTMQ');
> is($qual->{-id}, 'FC12044_91407_8_200_406_24');
> is($qual->{-desc}, '');
> is($qual->{-descriptor}, 'FC12044_91407_8_200_406_24');
> is(join(',',@{$qual->{-qual}}[0..10]),  
> '19,24,24,20,24,24,24,24,24,24,24');
> -----------------------------------------
>
> So one could check those values directly and then filter them  
> through as needed directly into Bio::Seq::Quality if necessary (note  
> some of the key values are constructor args):
>
> my $qualobj = Bio::Seq::Quality->new(%$qual);
>
> chris
>