[BioRuby] Fastq performances

Mon Mar 28 15:49:37 UTC 2011

Always testing :-)

I have a fastq file,  about 95 Mb and X sequences.
I did a trimming procedure scanning all the sequences but I have a performance issue, it took too long.
The code makes a scan to identify some sequence and trim it. Then I send in output all the sequences in input.
I found that the bottleneck is the conversion process when I need to write out the sequences.

Total number of sequences 514299.

THIS TASK IS USING FULL CONVERSION
if read.to_trim?
   trim sequence and quality string, create a (new fastq obj).to_biosequence.output(:fastq_illumina) 
else 
  read.to_biosequence.output(:fastq_illumina)
end 

MacBook-Pro-RaoulB:bioruby-ngs bonnalraoul$ time ./bin/biongs convert:illumina:fastq:trim_b /Users/bonnalraoul/Desktop/s_1_1_1108_qseq.fastq
WARNING: no program is associated with BCLQSEQ task, does not make sense to create a thor task.

real	2m1.749s
user	1m58.079s
sys	0m3.038s

THIS TASK IS USING RAW CONVERSION (formatting fastq sequence on the fly)
same above, but I don't create the biosequence object.
MacBook-Pro-RaoulB:bioruby-ngs bonnalraoul$ time ./bin/biongs convert:illumina:fastq:trim_b /Users/bonnalraoul/Desktop/s_1_1_1108_qseq.fastq
WARNING: no program is associated with BCLQSEQ task, does not make sense to create a thor task.

real	0m43.546s
user	0m41.611s
sys	0m1.133s

The difference in term of time is quite huge, if you consider that this is a tiny dataset. I can gain other 10 seconds If I don't wrap the output string to 70 chars (see below)

Note: output from to_biosequence.output(:fastq_illumina) is not equal to the input (still from illumina)
the sequence(na and quality is wrapped to 70 chas) and the header is repeated. Is it my fault is some part of the code ? I'll put the code in github asap.

--
Ra

linkedin: http://it.linkedin.com/in/raoulbonnal
twitter: http://twitter.com/ilpuccio
skype: ilpuccio
irc.freenode.net: Helius
github: https://github.com/helios