[BioRuby] Fastq performances
Raoul Bonnal
bonnal at ingm.org
Mon Mar 28 15:49:37 UTC 2011
Always testing :-)
I have a fastq file, about 95 Mb and X sequences.
I did a trimming procedure scanning all the sequences but I have a performance issue, it took too long.
The code makes a scan to identify some sequence and trim it. Then I send in output all the sequences in input.
I found that the bottleneck is the conversion process when I need to write out the sequences.
Total number of sequences 514299.
THIS TASK IS USING FULL CONVERSION
if read.to_trim?
trim sequence and quality string, create a (new fastq obj).to_biosequence.output(:fastq_illumina)
else
read.to_biosequence.output(:fastq_illumina)
end
MacBook-Pro-RaoulB:bioruby-ngs bonnalraoul$ time ./bin/biongs convert:illumina:fastq:trim_b /Users/bonnalraoul/Desktop/s_1_1_1108_qseq.fastq
WARNING: no program is associated with BCLQSEQ task, does not make sense to create a thor task.
real 2m1.749s
user 1m58.079s
sys 0m3.038s
THIS TASK IS USING RAW CONVERSION (formatting fastq sequence on the fly)
same above, but I don't create the biosequence object.
MacBook-Pro-RaoulB:bioruby-ngs bonnalraoul$ time ./bin/biongs convert:illumina:fastq:trim_b /Users/bonnalraoul/Desktop/s_1_1_1108_qseq.fastq
WARNING: no program is associated with BCLQSEQ task, does not make sense to create a thor task.
real 0m43.546s
user 0m41.611s
sys 0m1.133s
The difference in term of time is quite huge, if you consider that this is a tiny dataset. I can gain other 10 seconds If I don't wrap the output string to 70 chars (see below)
Note: output from to_biosequence.output(:fastq_illumina) is not equal to the input (still from illumina)
the sequence(na and quality is wrapped to 70 chas) and the header is repeated. Is it my fault is some part of the code ? I'll put the code in github asap.
--
Ra
linkedin: http://it.linkedin.com/in/raoulbonnal
twitter: http://twitter.com/ilpuccio
skype: ilpuccio
irc.freenode.net: Helius
github: https://github.com/helios
More information about the BioRuby
mailing list