[BioRuby] Benchmarking FASTA file parsing

Fri Aug 13 14:51:43 UTC 2010

>
>
> As you stated 3 times faster with the hack, you may be already using ruby
> 1.9.
>
>
I am using ruby 1.9.1, and I am using a fairly fast computer, but I am
actually questioning the quality of the code.

> Anyway, I think 13 or 18 seconds for 100 M entry is fast enough and this
> part will not be the bottle neck of any application.
> How fast do you need it be?
>

Mind you that the Benchmark is performed on StringIO data, and that the
script does not touch the disk! In a real test, it will be much slower! I
did not test on real data and more speed issues may surface (I have no idea
how Ruby's file buffering compares to Perl's, performance-wise).

I was contemplating porting some Biopieces (www.biopieces.org) from Perl to
Ruby. Biopieces are used for everyday slicing and dicing of all sorts of
biological data in a very simple and flexible manner. While Biopieces are
not as fast as dedicated scripts, they are fast enough
for convenient analysis of NGS data, but I will not accept a +300% speed
penalty (i.e. read_fasta).

I have been trying to get an overview of the code in Bio::FastaFormat, but I
find it hard to read (that could be because I am not used to Ruby, or OO for
that matter). It strikes me that the FastaFormat class does a number of
irrelevant things like subparsing comments when not strictly necessary. In
fact, the FASTA format actually don't use comments prefixed with #
(semicolon can be used, but I will strongly advice against it since most
software don't deal with it). Also, parsing is dependent on the record
separator being '\n' - that could be considered a bug. There seem to be an
overuse of substitutions, transliterations and regex matching. How about
keeping it nice an tight? ala:

SEP         = $/
FASTA_REGEX = /\s*>?([^#{SEP}]+)#{SEP}(.+)>?$/

def get_entry
  block = @io.gets(SEP + ">")
  return nil if block.nil?

  if block =~ FASTA_REGEX
    seq_name = $1
    seq      = $2
  else
    raise "Bad FASTA entry->#{block}"
  end

  seq.gsub!(/\s/, "")
end

Cheers,

Martin

> --
> Tomoaki NISHIYAMA
>
> Advanced Science Research Center,
> Kanazawa University,
> 13-1 Takara-machi,
> Kanazawa, 920-0934, Japan
>
>