[BioRuby] Benchmarking FASTA file parsing
Toshiaki Katayama
ktym at hgc.jp
Fri Aug 13 14:46:20 UTC 2010
Hi,
Thank you for your interesting post. :-)
I used to love benchmarking bottlenecks in BioRuby.
Could you also try to compare parsing whole GenBank or
thousands of BLAST results or FASTQ files produced by NGSs
with BioPerl, BioRuby and hopefully your version?
As the FASTA is a most simple (but fuzzy) format in biology,
I suppose the speed of parsing FASTA entry may depend on
how many variations do you expect to allow in the defline
of the (loosely defined) FASTA format.
Most importantly, I also believe the current speed of parsing
FASTA files is practically enough as Nishiyama-san stated.
It required me >30min to download a file containing ~5.6 million
protein sequences from KEGG (only 2.3G bytes).
ftp://ftp.genome.jp/pub/kegg/genes/fasta/genes.pep
The cat and grep commands took 1 min to read through the file.
------------------------------------------------------------
% time cat genes.pep > /dev/null
cat genes.pep > /dev/null 0.02s user 1.84s system 3% cpu 1:00.91 total
% time egrep '^>' genes.pep | wc -l
5604761
egrep '^>' genes.pep 12.71s user 2.13s system 23% cpu 1:04.46 total
wc -l 0.44s user 0.21s system 1% cpu 1:04.46 total
------------------------------------------------------------
I modified your benchmark to do some real tasks -- counting sequences,
printing sequence ID and the sequence length.
------------------------------------------------------------
file = "genes.pep"
io1 = File.open(file)
io2 = file
fasta1 = Fasta.new(io1)
fasta2 = Bio::FlatFile.auto(io2)
c1 = 0
c2 = 0
Benchmark.bm(5) do |timer|
timer.report('Hack') { 1.times { fasta1.each { |entry1| c1 += 1; $stderr.print c1, "\t", entry1[:seq_name][/^\S+/], "\t", entry1[:seq].length, "\n" } } }
timer.report('Bio') { 1.times { fasta2.each { |entry2| c2 += 1; $stderr.print c2, "\t", entry2.entry_id, "\t", entry2.length, "\n" } } }
end
------------------------------------------------------------
Then, your code took 3 min (sounds great!) and the current BioRuby implementation took 9 min.
% ruby-1.8 benchfasta.rb genes.pep
user system total real
Hack 146.180000 27.820000 174.000000 (191.343770)
Bio 480.940000 38.060000 519.000000 (557.216022)
It could be painful if you need to deal with more sequences, however,
please note that the number of whole protein entries in UniProt (which
is believed to contain known protein universe to date) is only twice
larger than the KEGG (which covers almost all protein sequences
in >1200 completed genomes).
http://www.expasy.org/sprot/relnotes/relstat.html
http://www.ebi.ac.uk/uniprot/TrEMBLstats/
http://www.genome.jp/en/db_growth.html#genes
> Is it possible to optimize the code without major rewriting?
Of course, it would be great if you could contribute improved codes
or suggest some possible ways to optimize the current implementation.
Regards,
Toshiaki Katayama, just back from summer vacation ;-)
On 2010/08/13, at 21:25, Martin Asser Hansen wrote:
> Hello,
>
>
> I am new to Ruby and was testing bioruby (1.4.0) for parsing FASTA files. A
> rough comparison with Perl indicated that the bioruby parser was slow. Now I
> have hacked a parser of my own in Ruby in order to benchmark the bioruby
> parser. The result is disappointing -> my hack is roughly 3x faster.
> Admittedly, my hack should probably do a bit of format consistency checking,
> but that will only take a few % off the speed.
>
> Could someone explain why the bioruby parser is so slow?
>
> Is it possible to optimize the code without major rewriting?
>
> Here is the benchmark result:
>
> user system total real
> Hack 5.440000 0.010000 5.450000 ( 5.494207)
> Bio 18.410000 0.020000 18.430000 ( 18.579867)
>
>
> The code is shown below.
>
> Cheers,
>
>
> Martin
>
> #!/usr/bin/env ruby
>
> require 'stringio'
> require 'bio'
> require 'benchmark'
>
> class Fasta
> include Enumerable
>
> def initialize(io)
> @io = io
> end
>
> def each
> while entry = get_entry do
> yield entry
> end
> end
>
> def get_entry
> block = @io.gets("\n>")
> return nil if block.nil?
>
> block.chomp!("\n>")
> block.sub!( /^\s|^>/, "")
>
> (seq_name, seq) = block.split("\n", 2)
> seq.gsub!(/\s/, "")
>
> entry = {}
> entry[:seq_name] = seq_name
> entry[:seq] = seq
> entry
> end
> end
>
> data = <<DATA
>> 5_gECOjxwXsN1/1
> AACGNTACTATCGTGACATGCGTGCAGGATTACAC
>> 3_8ICOjxwXsN1/1
> ACTCNAGGGTTCGATTCCCTTCAACCGCCCCATAA
>> 3_GUCOjxwXsN1/1
> TTGCNTCCTTCTTCTGCCTTCGTTGGCTCAGATTG
>> 5_BWCOjxwXsN1/1
> TATATACAGGAATCCATTGTTGTTTAGATTCAGTT
>> 7_NZCOjxwXsN1/1
> AGGTGATCCAGCCGCACCTTCCGATACGGCTACCT
>> 3_2VCOjxwXsN1/1
> CTTTTCCAGGTGTGTAGACATCTTCACCCATTAAG
>> 5_kVCOjxwXsN1/1
> CTACACCTAAGTTACATCGTCCATTATTTTCCAAT
>> 1_GbCOjxwXsN1/1
> CCAGACAACTAGGATGTTGGCTTAGAAGCAGCCAT
>> 5_fTCOjxwXsN1/1
> TTAGCTTTAACCATTTTCTTTTTGTCTAAAGCAAA
>> 3_VWCOjxwXsN1/1
> TTATGATGCGCGTGGCGAACGTGAACGCGTTAAAC
> DATA
>
> io1 = StringIO.new(data)
> io2 = StringIO.new(data)
> fasta1 = Fasta.new(io1)
> fasta2 = Bio::FastaFormat.open(io2)
>
> Benchmark.bm(5) do |timer|
> timer.report('Hack') { 10_000_000.times { fasta1.each { |entry1| } } }
> timer.report('Bio') { 10_000_000.times { fasta2.each { |entry2| } } }
> end
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
More information about the BioRuby
mailing list