[BioRuby] Benchmarking FASTA file parsing

Tomoaki NISHIYAMA tomoakin at kenroku.kanazawa-u.ac.jp
Sat Aug 14 10:52:57 EDT 2010


Hi,

> Subparsing of the defline should not be done for generic parsing,  
> but rather when needed.


To my understanding, the subparsing of the definition occurs only
when needed, ie when entry_id, identifiers, gi, etc. is called, in  
current code.
If only definition is called, it is not further parsed.

>  Without any experience I think that disabling GC sounds like a bad  
> idea.

Yes, completely disabling GC is generally a bad idea.
A code running with 6 Gbytes mem could eat 60 Gbytes or more...
(Yes it seems two or three-fold faster if there is enough memory,
but this trade-off is too extreme).

But since the GC dominates the running time,
it is an important target for optimization.
http://en.wikibooks.org/wiki/Ruby_Programming/Reference/Objects/GC

A more moderate reduction of GC frequency will surely speedup the  
process 30~50%.
Admittedly, explicit GC.disable, GC.start make the code ugly.
Trial on tweaking the parameters in gc.c did only a minor (~5%)  
improvement.

Careful coding to reduce object creation might contribute to speed up.
One of questionable variable is
@entry_overrun
Is this variable and attr_reader :entry_overrun
really required yet or is just a trace of older code? > Goto-San

Since there is only two other variables, which is apparently essential,
this third variable might account significant speed reduction.
A tests suggested again removing 3 lines can improve 5%.  
(Unfortunately not 50%)

diff --git a/lib/bio/db/fasta.rb b/lib/bio/db/fasta.rb
index 7ea668e..95f3be4 100644
--- a/lib/bio/db/fasta.rb
+++ b/lib/bio/db/fasta.rb
@@ -111,7 +111,7 @@ module Bio
      # The seuqnce lines in text.
      attr_accessor :data

-    attr_reader :entry_overrun
+#    attr_reader :entry_overrun

      # Stores the comment and sequence information from one entry of  
the
      # FASTA format string.  If the argument contains more than one
@@ -119,8 +119,8 @@ module Bio
      def initialize(str)
        @definition = str[/.*/].sub(/^>/, '').strip      # 1st line
        @data = str.sub(/.*/, '')                                # rests
-      @data.sub!(/^>.*/m, '')  # remove trailing entries for sure
-      @entry_overrun = $&
+#      @data.sub!(/^>.*/m, '') # remove trailing entries for sure
+#      @entry_overrun = $&
      end

      # Returns the stored one entry as a FASTA format. (same as to_s)
-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan


On 2010/08/14, at 17:21, Martin Asser Hansen wrote:

> I was hoping for an easy to use generic FASTA parser in bioruby. I  
> think it would be confusing with two flavors of parsers for short/ 
> long entries. Also, I think that with a minor effort the existing  
> parser could be optimized a fair bit. Subparsing of the defline  
> should not be done for generic parsing, but rather when needed.  
> Without any experience I think that disabling GC sounds like a bad  
> idea. Of cause C is always faster, but Ruby is nicer.
>
>
> Cheers,
>
>
> Martin
>
>
>
>
> On Sat, Aug 14, 2010 at 5:42 AM, Tomoaki NISHIYAMA  
> <tomoakin at kenroku.kanazawa-u.ac.jp> wrote:
> Hi,
>
>> Mind you that the Benchmark is performed on StringIO data, and  
>> that the script does not touch the disk!
>> In a real test, it will be much slower!
>
> My initial thought was :- That's true, and therefore the pure  
> parser part which runs fairly fast with O(N)
> is not the primary problem.  If you push the entries into a hash it  
> will be much more time consuming.
>
> But realized that its two orders slower... (due to the benchmark  
> code as pointed out by goto-san)
> 20 min for 100 M could be painful.
>
>> I have been trying to get an overview of the code in  
>> Bio::FastaFormat,
>> but I find it hard to read (that could be because I am not used to  
>> Ruby, or OO for that matter).
>
>
> For one thing, the Bio::FastaFormat is designed to work with  
> Bio::FlatFile.
> If you write a dedicated fasta parser that could run much faster.
>
> # I would write C codes for a very simple operation on NGS data.
> # That will run 100 times faster.
> # When the necessary operation is a bit more complex, I would use  
> ruby. much much more time consuming....
>
> Perhaps the target is to process about 20 ~ 1000 M reads with each of
> them having 25 to 150 nt for the time being.
> Thats quite different situation compared to process the
> ~ 0.1 M entry of 50-10000 aa residues or nucleotides in a genome.
> The relative cost for the entry separation becomes higher compared  
> with the sequence
> processing within the entry.
>
> So, it may worth to write NGS dedicated parser rather than sticking  
> on FlatFile.
>
> Playing around the benchmark, about the half of execution time is  
> for garbage collection,
> and the order of execution is somewhat relevant to get the number.
> If you can suppress unnecessary object generation to the minimum  
> and disable GC, that will
> perhaps make it run much faster.
>
> $ diff -u benchfasta benchfasta-hash-GC-b
> --- benchfasta  2010-08-13 21:45:21.000000000 +0900
> +++ benchfasta-hash-GC-b        2010-08-14 11:53:20.000000000 +0900
> @@ -34,6 +34,9 @@
>    end
>  end
>
> +count = ARGV.shift.to_i
> +count = 2 if count == nil
> +
>  data  = <<DATA
>  >5_gECOjxwXsN1/1
>  AACGNTACTATCGTGACATGCGTGCAGGATTACAC
> @@ -57,12 +60,23 @@
>  TTATGATGCGCGTGGCGAACGTGAACGCGTTAAAC
>  DATA
>
> -io1    = StringIO.new(data)
> -io2    = StringIO.new(data)
> +io0    = StringIO.new(data * count)
> +io1    = StringIO.new(data * count)
> +io2    = StringIO.new(data * count)
> +fasta0 = Fasta.new(io0)
>  fasta1 = Fasta.new(io1)
>  fasta2 = Bio::FastaFormat.open(io2)
>
> -Benchmark.bm(5) do |timer|
> -  timer.report('Hack') { 10_000_000.times { fasta1.each { | 
> entry1| } } }
> -  timer.report('Bio')  { 10_000_000.times { fasta2.each { | 
> entry2| } } }
> +hash0=Hash.new
> +hash1=Hash.new
> +hash2=Hash.new
> +
> +Benchmark.bm(8) do |timer|
> +  GC.enable;GC.start;GC.disable;
> +  timer.report('Bio')  { i=0; fasta2.each { |entry2| i+=1; hash2 
> [entry2.definition + i.to_s]  = entry2.seq[2..25]}  }
> +  hash2 = nil; GC.enable;GC.start;GC.disable;
> +  timer.report('Hack') { i=0;  fasta0.each { |entry1| i+=1; hash0 
> [entry1[:seq_name] + i.to_s] = entry1[:seq][2..25]}  }
> +  hash0 = nil; GC.enable;GC.start;GC.disable;
> +  timer.report('Hack-seq') { i=0; fasta1.each { |entry1| i+=1;  
> hash1[entry1[:seq_name] + i.to_s] = Bio::Sequence::NA.new(entry1 
> [:seq])[2..25]}  }
> +  hash1 = nil; GC.enable;GC.start;GC.disable;
>  end
>
>
>
>
>
>
> -- 
> Tomoaki NISHIYAMA
>
> Advanced Science Research Center,
> Kanazawa University,
> 13-1 Takara-machi,
> Kanazawa, 920-0934, Japan
>
>
> On 2010/08/13, at 23:51, Martin Asser Hansen wrote:
>
>>
>> As you stated 3 times faster with the hack, you may be already  
>> using ruby 1.9.
>>
>>
>> I am using ruby 1.9.1, and I am using a fairly fast computer, but  
>> I am actually questioning the quality of the code.
>>
>> Anyway, I think 13 or 18 seconds for 100 M entry is fast enough  
>> and this
>> part will not be the bottle neck of any application.
>> How fast do you need it be?
>>
>> Mind you that the Benchmark is performed on StringIO data, and  
>> that the script does not touch the disk! In a real test, it will  
>> be much slower! I did not test on real data and more speed issues  
>> may surface (I have no idea how Ruby's file buffering compares to  
>> Perl's, performance-wise).
>>
>> I was contemplating porting some Biopieces (www.biopieces.org)  
>> from Perl to Ruby. Biopieces are used for everyday slicing and  
>> dicing of all sorts of biological data in a very simple and  
>> flexible manner. While Biopieces are not as fast as dedicated  
>> scripts, they are fast enough for convenient analysis of NGS data,  
>> but I will not accept a +300% speed penalty (i.e. read_fasta).
>>
>> I have been trying to get an overview of the code in  
>> Bio::FastaFormat, but I find it hard to read (that could be  
>> because I am not used to Ruby, or OO for that matter). It strikes  
>> me that the FastaFormat class does a number of irrelevant things  
>> like subparsing comments when not strictly necessary. In fact, the  
>> FASTA format actually don't use comments prefixed with #  
>> (semicolon can be used, but I will strongly advice against it  
>> since most software don't deal with it). Also, parsing is  
>> dependent on the record separator being '\n' - that could be  
>> considered a bug. There seem to be an overuse of substitutions,  
>> transliterations and regex matching. How about keeping it nice an  
>> tight? ala:
>>
>> SEP         = $/
>> FASTA_REGEX = /\s*>?([^#{SEP}]+)#{SEP}(.+)>?$/
>>
>> def get_entry
>>   block = @io.gets(SEP + ">")
>>   return nil if block.nil?
>>
>>   if block =~ FASTA_REGEX
>>     seq_name = $1
>>     seq      = $2
>>   else
>>     raise "Bad FASTA entry->#{block}"
>>   end
>>
>>   seq.gsub!(/\s/, "")
>> end
>>
>>
>> Cheers,
>>
>>
>> Martin
>>
>> -- 
>> Tomoaki NISHIYAMA
>>
>> Advanced Science Research Center,
>> Kanazawa University,
>> 13-1 Takara-machi,
>> Kanazawa, 920-0934, Japan
>>
>>
>
>



More information about the BioRuby mailing list