[Bioperl-l] Next-gen modules
Chris Fields
cjfields at illinois.edu
Fri Jul 3 17:34:25 UTC 2009
No problem. Scary to see that creating an instance is 2-4x slower
than a simple hash ref. Not sure there is an easy way around that;
maybe we need a direct_new?
The next step is to ensure this works cross-platform and get indexing
(via Bio::Index::Fastq) optimized. Would be nice to get output
working with the hash refs as well.
chris
On Jul 3, 2009, at 10:35 AM, Giles Weaver wrote:
> Chris, I've just tested your Illumina/Solexa fastq parsing code and
> am pleased to report that I haven't encountered any issues thus far.
>
> To give an idea of the processing overhead of object instantiation,
> fastq parsing performance on a lowly 3GHz Core 2 Duo (using one
> core) is as follows:
> Illumina fastq with next_dataset: ~1 million sequences/minute
> Solexa fastq with next_dataset: ~500000 sequences/minute
> Illumina fastq with next_seq: ~215000 sequences/minute
> Solexa fastq with next_seq: ~175000 sequences/minute
>
> My quality trimming script does about 300000 sequences/minute with
> next_dataset, up from ~130000 sequences/minute with next_seq, so it
> shaves hours off the run time, thanks!
>
> Giles
>
> 2009/7/1 Chris Fields <cjfields at illinois.edu>
> On Jul 1, 2009, at 11:27 AM, Giles Weaver wrote:
>
> ...
>
>
> Peter, the trimming algorithm I use employs a sliding window, as
> follows:
>
> - For each sequence position calculate the mean phred quality score
> for a
> window around that position.
> - Record whether the mean score is above or below a threshold as an
> array
> of zeros and ones.
> - Use a regular expression on the joined array to find the start
> and end
> of the good quality sequence(s).
> - Extract the quality sequence(s) and replace any bases below the
> quality
> threshold with N.
> - Trim any Ns from the ends.
>
> A refinement would be to weight the scores from positions in the
> window, but
> this could give a performance hit, and the method seems to work well
> enough
> as is.
>
> Chris, thanks for committing the fix, I'll give bioperl illumina fastq
> parsing a workout soon. Peter, as much as I'd love to help out with
> biopython, I'm under too much time pressure right now!
>
> Just let me know if the qual values match up with what is expected.
> You can also iterate through the data with hashrefs using
> next_dataset (faster than objects). This is from the fastq tests in
> core:
>
> -----------------------------------------
> $in_qual = Bio::SeqIO->new(-file =>
> test_input_file('fastq','test3_illumina.fastq'),
> -variant => 'illumina',
> -format => 'fastq');
>
> $qual = $in_qual->next_dataset();
>
> isa_ok($qual, 'HASH');
> is($qual->{-seq}, 'GTTAGCTCCCACCTTAAGATGTTTA');
> is($qual->{-raw_quality}, 'SXXTXXXXXXXXXTTSUXSSXKTMQ');
> is($qual->{-id}, 'FC12044_91407_8_200_406_24');
> is($qual->{-desc}, '');
> is($qual->{-descriptor}, 'FC12044_91407_8_200_406_24');
> is(join(',',@{$qual->{-qual}}[0..10]),
> '19,24,24,20,24,24,24,24,24,24,24');
> -----------------------------------------
>
> So one could check those values directly and then filter them
> through as needed directly into Bio::Seq::Quality if necessary (note
> some of the key values are constructor args):
>
> my $qualobj = Bio::Seq::Quality->new(%$qual);
>
> chris
>
More information about the Bioperl-l
mailing list