[BioRuby] Benchmarking FASTA file parsing

Sat Aug 14 08:21:39 UTC 2010

I was hoping for an easy to use generic FASTA parser in bioruby. I think it
would be confusing with two flavors of parsers for short/long entries. Also,
I think that with a minor effort the existing parser could be optimized a
fair bit. Subparsing of the defline should not be done for generic parsing,
but rather when needed. Without any experience I think that disabling GC
sounds like a bad idea. Of cause C is always faster, but Ruby is nicer.

Cheers,

Martin

On Sat, Aug 14, 2010 at 5:42 AM, Tomoaki NISHIYAMA <
tomoakin at kenroku.kanazawa-u.ac.jp> wrote:

> Hi,
>
> Mind you that the Benchmark is performed on StringIO data, and that the
> script does not touch the disk!
>
> In a real test, it will be much slower!
>
>
> My initial thought was :- That's true, and therefore the pure parser part
> which runs fairly fast with O(N)
> is not the primary problem.  If you push the entries into a hash it will be
> much more time consuming.
>
> But realized that its two orders slower... (due to the benchmark code as
> pointed out by goto-san)
> 20 min for 100 M could be painful.
>
> I have been trying to get an overview of the code in Bio::FastaFormat,
>
> but I find it hard to read (that could be because I am not used to Ruby, or
> OO for that matter).
>
>
> For one thing, the Bio::FastaFormat is designed to work with Bio::FlatFile.
> If you write a dedicated fasta parser that could run much faster.
>
>
> # I would write C codes for a very simple operation on NGS data.
>
> # That will run 100 times faster.
>
> # When the necessary operation is a bit more complex, I would use ruby.
> much much more time consuming....
>
>
> Perhaps the target is to process about 20 ~ 1000 M reads with each of
>
> them having 25 to 150 nt for the time being.
>
> Thats quite different situation compared to process the
>
> ~ 0.1 M entry of 50-10000 aa residues or nucleotides in a genome.
>
> The relative cost for the entry separation becomes higher compared with the
> sequence
>
> processing within the entry.
>
>
> So, it may worth to write NGS dedicated parser rather than sticking on
> FlatFile.
>
>
> Playing around the benchmark, about the half of execution time is for
> garbage collection,
>
> and the order of execution is somewhat relevant to get the number.
>
> If you can suppress unnecessary object generation to the minimum and
> disable GC, that will
>
> perhaps make it run much faster.
>
>
> $ diff -u benchfasta benchfasta-hash-GC-b
> --- benchfasta  2010-08-13 21:45:21.000000000 +0900
> +++ benchfasta-hash-GC-b        2010-08-14 11:53:20.000000000 +0900
> @@ -34,6 +34,9 @@
>    end
>  end
>
> +count = ARGV.shift.to_i
> +count = 2 if count == nil
> +
>  data  = <<DATA
>  >5_gECOjxwXsN1/1
>  AACGNTACTATCGTGACATGCGTGCAGGATTACAC
> @@ -57,12 +60,23 @@
>  TTATGATGCGCGTGGCGAACGTGAACGCGTTAAAC
>  DATA
>
> -io1    = StringIO.new(data)
> -io2    = StringIO.new(data)
> +io0    = StringIO.new(data * count)
> +io1    = StringIO.new(data * count)
> +io2    = StringIO.new(data * count)
> +fasta0 = Fasta.new(io0)
>  fasta1 = Fasta.new(io1)
>  fasta2 = Bio::FastaFormat.open(io2)
>
> -Benchmark.bm(5) do |timer|
> -  timer.report('Hack') { 10_000_000.times { fasta1.each { |entry1| } } }
> -  timer.report('Bio')  { 10_000_000.times { fasta2.each { |entry2| } } }
> +hash0=Hash.new
> +hash1=Hash.new
> +hash2=Hash.new
> +
> +Benchmark.bm(8) do |timer|
> +  GC.enable;GC.start;GC.disable;
> +  timer.report('Bio')  { i=0; fasta2.each { |entry2| i+=1;
> hash2[entry2.definition + i.to_s]  = entry2.seq[2..25]}  }
> +  hash2 = nil; GC.enable;GC.start;GC.disable;
> +  timer.report('Hack') { i=0;  fasta0.each { |entry1| i+=1;
> hash0[entry1[:seq_name] + i.to_s] = entry1[:seq][2..25]}  }
> +  hash0 = nil; GC.enable;GC.start;GC.disable;
> +  timer.report('Hack-seq') { i=0; fasta1.each { |entry1| i+=1;
> hash1[entry1[:seq_name] + i.to_s] =
> Bio::Sequence::NA.new(entry1[:seq])[2..25]}  }
> +  hash1 = nil; GC.enable;GC.start;GC.disable;
>  end
>
>
>
>
>
> --
>
> Tomoaki NISHIYAMA
>
>
> Advanced Science Research Center,
>
> Kanazawa University,
>
> 13-1 Takara-machi,
>
> Kanazawa, 920-0934, Japan
>
>
> On 2010/08/13, at 23:51, Martin Asser Hansen wrote:
>
>
>> As you stated 3 times faster with the hack, you may be already using ruby
>> 1.9.
>>
>>
> I am using ruby 1.9.1, and I am using a fairly fast computer, but I am
> actually questioning the quality of the code.
>
>
>> Anyway, I think 13 or 18 seconds for 100 M entry is fast enough and this
>> part will not be the bottle neck of any application.
>> How fast do you need it be?
>>
>
> Mind you that the Benchmark is performed on StringIO data, and that the
> script does not touch the disk! In a real test, it will be much slower! I
> did not test on real data and more speed issues may surface (I have no idea
> how Ruby's file buffering compares to Perl's, performance-wise).
>
> I was contemplating porting some Biopieces (www.biopieces.org) from Perl
> to Ruby. Biopieces are used for everyday slicing and dicing of all sorts of
> biological data in a very simple and flexible manner. While Biopieces are
> not as fast as dedicated scripts, they are fast enough
> for convenient analysis of NGS data, but I will not accept a +300% speed
> penalty (i.e. read_fasta).
>
> I have been trying to get an overview of the code in Bio::FastaFormat, but
> I find it hard to read (that could be because I am not used to Ruby, or OO
> for that matter). It strikes me that the FastaFormat class does a number of
> irrelevant things like subparsing comments when not strictly necessary. In
> fact, the FASTA format actually don't use comments prefixed with #
> (semicolon can be used, but I will strongly advice against it since most
> software don't deal with it). Also, parsing is dependent on the record
> separator being '\n' - that could be considered a bug. There seem to be an
> overuse of substitutions, transliterations and regex matching. How about
> keeping it nice an tight? ala:
>
> SEP         = $/
> FASTA_REGEX = /\s*>?([^#{SEP}]+)#{SEP}(.+)>?$/
>
> def get_entry
>   block = @io.gets(SEP + ">")
>   return nil if block.nil?
>
>   if block =~ FASTA_REGEX
>     seq_name = $1
>     seq      = $2
>   else
>     raise "Bad FASTA entry->#{block}"
>   end
>
>   seq.gsub!(/\s/, "")
> end
>
>
> Cheers,
>
>
> Martin
>
>
>>  --
>> Tomoaki NISHIYAMA
>>
>> Advanced Science Research Center,
>> Kanazawa University,
>> 13-1 Takara-machi,
>> Kanazawa, 920-0934, Japan
>>
>>
>
>