[BioRuby] Benchmarking FASTA file parsing
Martin Asser Hansen
mail at maasha.dk
Fri Aug 13 12:25:46 UTC 2010
Hello,
I am new to Ruby and was testing bioruby (1.4.0) for parsing FASTA files. A
rough comparison with Perl indicated that the bioruby parser was slow. Now I
have hacked a parser of my own in Ruby in order to benchmark the bioruby
parser. The result is disappointing -> my hack is roughly 3x faster.
Admittedly, my hack should probably do a bit of format consistency checking,
but that will only take a few % off the speed.
Could someone explain why the bioruby parser is so slow?
Is it possible to optimize the code without major rewriting?
Here is the benchmark result:
user system total real
Hack 5.440000 0.010000 5.450000 ( 5.494207)
Bio 18.410000 0.020000 18.430000 ( 18.579867)
The code is shown below.
Cheers,
Martin
#!/usr/bin/env ruby
require 'stringio'
require 'bio'
require 'benchmark'
class Fasta
include Enumerable
def initialize(io)
@io = io
end
def each
while entry = get_entry do
yield entry
end
end
def get_entry
block = @io.gets("\n>")
return nil if block.nil?
block.chomp!("\n>")
block.sub!( /^\s|^>/, "")
(seq_name, seq) = block.split("\n", 2)
seq.gsub!(/\s/, "")
entry = {}
entry[:seq_name] = seq_name
entry[:seq] = seq
entry
end
end
data = <<DATA
>5_gECOjxwXsN1/1
AACGNTACTATCGTGACATGCGTGCAGGATTACAC
>3_8ICOjxwXsN1/1
ACTCNAGGGTTCGATTCCCTTCAACCGCCCCATAA
>3_GUCOjxwXsN1/1
TTGCNTCCTTCTTCTGCCTTCGTTGGCTCAGATTG
>5_BWCOjxwXsN1/1
TATATACAGGAATCCATTGTTGTTTAGATTCAGTT
>7_NZCOjxwXsN1/1
AGGTGATCCAGCCGCACCTTCCGATACGGCTACCT
>3_2VCOjxwXsN1/1
CTTTTCCAGGTGTGTAGACATCTTCACCCATTAAG
>5_kVCOjxwXsN1/1
CTACACCTAAGTTACATCGTCCATTATTTTCCAAT
>1_GbCOjxwXsN1/1
CCAGACAACTAGGATGTTGGCTTAGAAGCAGCCAT
>5_fTCOjxwXsN1/1
TTAGCTTTAACCATTTTCTTTTTGTCTAAAGCAAA
>3_VWCOjxwXsN1/1
TTATGATGCGCGTGGCGAACGTGAACGCGTTAAAC
DATA
io1 = StringIO.new(data)
io2 = StringIO.new(data)
fasta1 = Fasta.new(io1)
fasta2 = Bio::FastaFormat.open(io2)
Benchmark.bm(5) do |timer|
timer.report('Hack') { 10_000_000.times { fasta1.each { |entry1| } } }
timer.report('Bio') { 10_000_000.times { fasta2.each { |entry2| } } }
end
More information about the BioRuby
mailing list