[BioRuby] Benchmarking FASTA file parsing

Fri Aug 13 14:46:20 UTC 2010

Hi,

Thank you for your interesting post. :-)

I used to love benchmarking bottlenecks in BioRuby.
Could you also try to compare parsing whole GenBank or
thousands of BLAST results or FASTQ files produced by NGSs
with BioPerl, BioRuby and hopefully your version?

As the FASTA is a most simple (but fuzzy) format in biology,
I suppose the speed of parsing FASTA entry may depend on
how many variations do you expect to allow in the defline
of the (loosely defined) FASTA format.

Most importantly, I also believe the current speed of parsing
FASTA files is practically enough as Nishiyama-san stated.

It required me >30min to download a file containing ~5.6 million
protein sequences from KEGG (only 2.3G bytes).

ftp://ftp.genome.jp/pub/kegg/genes/fasta/genes.pep

The cat and grep commands took 1 min to read through the file.

------------------------------------------------------------
% time cat genes.pep > /dev/null
cat genes.pep > /dev/null  0.02s user 1.84s system 3% cpu 1:00.91 total

% time egrep '^>' genes.pep | wc -l
 5604761
egrep '^>' genes.pep  12.71s user 2.13s system 23% cpu 1:04.46 total
wc -l  0.44s user 0.21s system 1% cpu 1:04.46 total
------------------------------------------------------------

I modified your benchmark to do some real tasks -- counting sequences,
printing sequence ID and the sequence length.

------------------------------------------------------------
file   = "genes.pep"
io1    = File.open(file)
io2    = file
fasta1 = Fasta.new(io1)
fasta2 = Bio::FlatFile.auto(io2)
c1 = 0
c2 = 0

Benchmark.bm(5) do |timer|
 timer.report('Hack') { 1.times { fasta1.each { |entry1| c1 += 1; $stderr.print c1, "\t", entry1[:seq_name][/^\S+/], "\t", entry1[:seq].length, "\n" } } }
 timer.report('Bio')  { 1.times { fasta2.each { |entry2| c2 += 1; $stderr.print c2, "\t", entry2.entry_id, "\t", entry2.length, "\n" } } }
end
------------------------------------------------------------

Then, your code took 3 min (sounds great!) and the current BioRuby implementation took 9 min.

% ruby-1.8 benchfasta.rb genes.pep
           user     system      total        real
Hack 146.180000  27.820000 174.000000 (191.343770)
Bio  480.940000  38.060000 519.000000 (557.216022)

It could be painful if you need to deal with more sequences, however,
please note that the number of whole protein entries in UniProt (which
is believed to contain known protein universe to date) is only twice
larger than the KEGG (which covers almost all protein sequences
in >1200 completed genomes).

http://www.expasy.org/sprot/relnotes/relstat.html
http://www.ebi.ac.uk/uniprot/TrEMBLstats/
http://www.genome.jp/en/db_growth.html#genes

> Is it possible to optimize the code without major rewriting?

Of course, it would be great if you could contribute improved codes
or suggest some possible ways to optimize the current implementation.

Regards,
Toshiaki Katayama, just back from summer vacation ;-)

On 2010/08/13, at 21:25, Martin Asser Hansen wrote:

> Hello,
> 
> 
> I am new to Ruby and was testing bioruby (1.4.0) for parsing FASTA files. A
> rough comparison with Perl indicated that the bioruby parser was slow. Now I
> have hacked a parser of my own in Ruby in order to benchmark the bioruby
> parser. The result is disappointing -> my hack is roughly 3x faster.
> Admittedly, my hack should probably do a bit of format consistency checking,
> but that will only take a few % off the speed.
> 
> Could someone explain why the bioruby parser is so slow?
> 
> Is it possible to optimize the code without major rewriting?
> 
> Here is the benchmark result:
> 
>           user     system      total        real
> Hack   5.440000   0.010000   5.450000 (  5.494207)
> Bio   18.410000   0.020000  18.430000 ( 18.579867)
> 
> 
> The code is shown below.
> 
> Cheers,
> 
> 
> Martin
> 
> #!/usr/bin/env ruby
> 
> require 'stringio'
> require 'bio'
> require 'benchmark'
> 
> class Fasta
>  include Enumerable
> 
>  def initialize(io)
>    @io = io
>  end
> 
>  def each
>    while entry = get_entry do
>      yield entry
>    end
>  end
> 
>  def get_entry
>    block = @io.gets("\n>")
>    return nil if block.nil?
> 
>    block.chomp!("\n>")
>    block.sub!( /^\s|^>/, "")
> 
>    (seq_name, seq) = block.split("\n", 2)
>    seq.gsub!(/\s/, "")
> 
>    entry = {}
>    entry[:seq_name] = seq_name
>    entry[:seq]      = seq
>    entry
>  end
> end
> 
> data  = <<DATA
>> 5_gECOjxwXsN1/1
> AACGNTACTATCGTGACATGCGTGCAGGATTACAC
>> 3_8ICOjxwXsN1/1
> ACTCNAGGGTTCGATTCCCTTCAACCGCCCCATAA
>> 3_GUCOjxwXsN1/1
> TTGCNTCCTTCTTCTGCCTTCGTTGGCTCAGATTG
>> 5_BWCOjxwXsN1/1
> TATATACAGGAATCCATTGTTGTTTAGATTCAGTT
>> 7_NZCOjxwXsN1/1
> AGGTGATCCAGCCGCACCTTCCGATACGGCTACCT
>> 3_2VCOjxwXsN1/1
> CTTTTCCAGGTGTGTAGACATCTTCACCCATTAAG
>> 5_kVCOjxwXsN1/1
> CTACACCTAAGTTACATCGTCCATTATTTTCCAAT
>> 1_GbCOjxwXsN1/1
> CCAGACAACTAGGATGTTGGCTTAGAAGCAGCCAT
>> 5_fTCOjxwXsN1/1
> TTAGCTTTAACCATTTTCTTTTTGTCTAAAGCAAA
>> 3_VWCOjxwXsN1/1
> TTATGATGCGCGTGGCGAACGTGAACGCGTTAAAC
> DATA
> 
> io1    = StringIO.new(data)
> io2    = StringIO.new(data)
> fasta1 = Fasta.new(io1)
> fasta2 = Bio::FastaFormat.open(io2)
> 
> Benchmark.bm(5) do |timer|
>  timer.report('Hack') { 10_000_000.times { fasta1.each { |entry1| } } }
>  timer.report('Bio')  { 10_000_000.times { fasta2.each { |entry2| } } }
> end
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby