[BioRuby] GFF3
Tomoaki NISHIYAMA
tomoakin at kenroku.kanazawa-u.ac.jp
Mon Aug 30 22:12:37 EDT 2010
Hi,
> When I parse a 500Mb GFF3 file, without FASTA information, with
> BioRuby it consumes 8.5 Gb RAM and takes 20 minutes. My NoCache
> version takes 1Gb RAM and 13 minutes.
This sounds nice!
> I am not 100% sure why this is, but I know that BioRuby consumes the
> whole file in memory first, splits it by line and, next, starts
> parsing GFF. Probably memory allocation and regex are expensive with
> really large buffers.
During the conversation on "Benchmarking FASTA file parsing",
I realized that GC takes quite a lot of time if a large memory is to be
used. The mark and sweep algorithm in Matz ruby implementation
scans over all the allocated objects every time the GC is run (which
is not
written in ruby code but implicitly runs if not suppressed).
Since ruby-1.9.2 seems to have much better GC performance, I am
interested
how the performance compares in ruby-1.9.2.
(I am also interested in GC.disable condition,
but this may not work with 15 Gbytes though).
Running your script with ruby 1.9 caused several errors, related to
case when :
removal of colon at the end of when line and changing colon to
newline if the
colon is not at the end of line was sufficient to run with ruby
1.9.2. (diff at the end)
Either one of newline, semicolon, and "then" seems to work.
> I only store file seek positions in
> memory, and reload and parse a record from disk every time.
The other good reason is that the data is perhaps not read from the
disk many times
but cached by the operating system and retained on memory.
So this is not as bad as it sounds.
Having 15 Gbytes, presumably 500 Mbytes file need not flushed.
--
Tomoaki NISHIYAMA
Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan
diff --git a/bin/gff3-fetch b/bin/gff3-fetch
index b8d4718..36e61f7 100755
--- a/bin/gff3-fetch
+++ b/bin/gff3-fetch
@@ -39,17 +39,17 @@ ARGV.each do | fn |
gffdb = Bio::GFFbrowser::GFFdb.new(fn,options)
gff = gffdb.assembler
case gfftype
- when 'mrna'||'mRNA' :
+ when 'mrna'||'mRNA'
gff.each_mRNA_seq do | id, seq |
puts ">"+id
puts seq
end
- when 'exon':
+ when 'exon'
gff.each_exon_seq do | id, seq |
puts ">"+id
puts seq
end
- when 'CDS':
+ when 'CDS'
gff.each_CDS_seq do | id, seq |
puts ">"+id
puts seq
diff --git a/lib/bio/db/gff/gffdb.rb b/lib/bio/db/gff/gffdb.rb
index 5325fb9..9540154 100644
--- a/lib/bio/db/gff/gffdb.rb
+++ b/lib/bio/db/gff/gffdb.rb
@@ -26,7 +26,7 @@ module Bio
cache_recs = options[:cache_records]
@assembler =
case cache_recs
- when :cache_none :
+ when :cache_none
NoCache.new(filename, options)
else
InMemory.new(filename, options) # default
diff --git a/lib/bio/db/gff/gffparser.rb b/lib/bio/db/gff/gffparser.rb
index 5522d81..e1ed9db 100644
--- a/lib/bio/db/gff/gffparser.rb
+++ b/lib/bio/db/gff/gffparser.rb
@@ -30,9 +30,12 @@ module Bio
info "Added #{rec.feature_type} with component ID #{id}"
else
case rec.feature_type
- when 'mRNA' || 'SO:0000234' : @mrnalist.add(id,rec)
- when 'CDS' || 'SO:0000316' : @cdslist.add(id,rec)
- when 'exon' || 'SO:0000147' : @exonlist.add(id,rec)
+ when 'mRNA' || 'SO:0000234'
+ @mrnalist.add(id,rec)
+ when 'CDS' || 'SO:0000316'
+ @cdslist.add(id,rec)
+ when 'exon' || 'SO:0000147'
+ @exonlist.add(id,rec)
else
if !IGNORE_FEATURES.include?(rec.feature_type)
@unrecognized_features[rec.feature_type] = true
More information about the BioRuby
mailing list