[BioRuby] Benchmarking FASTA file parsing

Martin Asser Hansen mail at maasha.dk
Fri Aug 13 08:25:46 EDT 2010


Hello,


I am new to Ruby and was testing bioruby (1.4.0) for parsing FASTA files. A
rough comparison with Perl indicated that the bioruby parser was slow. Now I
have hacked a parser of my own in Ruby in order to benchmark the bioruby
parser. The result is disappointing -> my hack is roughly 3x faster.
Admittedly, my hack should probably do a bit of format consistency checking,
but that will only take a few % off the speed.

Could someone explain why the bioruby parser is so slow?

Is it possible to optimize the code without major rewriting?

Here is the benchmark result:

           user     system      total        real
Hack   5.440000   0.010000   5.450000 (  5.494207)
Bio   18.410000   0.020000  18.430000 ( 18.579867)


The code is shown below.

Cheers,


Martin

#!/usr/bin/env ruby

require 'stringio'
require 'bio'
require 'benchmark'

class Fasta
  include Enumerable

  def initialize(io)
    @io = io
  end

  def each
    while entry = get_entry do
      yield entry
    end
  end

  def get_entry
    block = @io.gets("\n>")
    return nil if block.nil?

    block.chomp!("\n>")
    block.sub!( /^\s|^>/, "")

    (seq_name, seq) = block.split("\n", 2)
    seq.gsub!(/\s/, "")

    entry = {}
    entry[:seq_name] = seq_name
    entry[:seq]      = seq
    entry
  end
end

data  = <<DATA
>5_gECOjxwXsN1/1
AACGNTACTATCGTGACATGCGTGCAGGATTACAC
>3_8ICOjxwXsN1/1
ACTCNAGGGTTCGATTCCCTTCAACCGCCCCATAA
>3_GUCOjxwXsN1/1
TTGCNTCCTTCTTCTGCCTTCGTTGGCTCAGATTG
>5_BWCOjxwXsN1/1
TATATACAGGAATCCATTGTTGTTTAGATTCAGTT
>7_NZCOjxwXsN1/1
AGGTGATCCAGCCGCACCTTCCGATACGGCTACCT
>3_2VCOjxwXsN1/1
CTTTTCCAGGTGTGTAGACATCTTCACCCATTAAG
>5_kVCOjxwXsN1/1
CTACACCTAAGTTACATCGTCCATTATTTTCCAAT
>1_GbCOjxwXsN1/1
CCAGACAACTAGGATGTTGGCTTAGAAGCAGCCAT
>5_fTCOjxwXsN1/1
TTAGCTTTAACCATTTTCTTTTTGTCTAAAGCAAA
>3_VWCOjxwXsN1/1
TTATGATGCGCGTGGCGAACGTGAACGCGTTAAAC
DATA

io1    = StringIO.new(data)
io2    = StringIO.new(data)
fasta1 = Fasta.new(io1)
fasta2 = Bio::FastaFormat.open(io2)

Benchmark.bm(5) do |timer|
  timer.report('Hack') { 10_000_000.times { fasta1.each { |entry1| } } }
  timer.report('Bio')  { 10_000_000.times { fasta2.each { |entry2| } } }
end


More information about the BioRuby mailing list