[BioRuby] GFF3

Mon Aug 30 22:12:37 EDT 2010

Hi,

> When I parse a 500Mb GFF3 file, without FASTA information, with
> BioRuby it consumes 8.5 Gb RAM and takes 20 minutes.  My NoCache
> version takes 1Gb RAM and 13 minutes.

This sounds nice!

> I am not 100% sure why this is, but I know that BioRuby consumes the
> whole file in memory first, splits it by line and, next, starts
> parsing GFF. Probably memory allocation and regex are expensive with
> really large buffers.


During the conversation on "Benchmarking FASTA file parsing",
I realized that GC takes quite a lot of time if a large memory is to be
used.  The mark and sweep algorithm in Matz ruby implementation
scans over all the allocated objects every time the GC is run (which  
is not
written in ruby code but implicitly runs if not suppressed).

Since ruby-1.9.2 seems to have much better GC performance, I am  
interested
how the performance compares in ruby-1.9.2.
(I am also interested in GC.disable condition,
but this may not work with 15 Gbytes though).

Running your script with ruby 1.9 caused several errors, related to  
case when :
removal of colon at the end of when line and changing colon to  
newline if the
colon is not at the end of line was sufficient to run with ruby  
1.9.2. (diff at the end)
Either one of newline, semicolon, and "then" seems to work.

> I only store file seek positions in
> memory, and reload and parse a record from disk every time.


The other good reason is that the data is perhaps not read from the  
disk many times
but cached by the operating system and retained on memory.
So this is not as bad as it sounds.
Having 15 Gbytes, presumably 500 Mbytes file need not flushed.
-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan

diff --git a/bin/gff3-fetch b/bin/gff3-fetch
index b8d4718..36e61f7 100755
--- a/bin/gff3-fetch
+++ b/bin/gff3-fetch
@@ -39,17 +39,17 @@ ARGV.each do | fn |
    gffdb = Bio::GFFbrowser::GFFdb.new(fn,options)
    gff = gffdb.assembler
    case gfftype
-    when 'mrna'||'mRNA' :
+    when 'mrna'||'mRNA'
            gff.each_mRNA_seq do | id, seq |
              puts ">"+id
              puts seq
            end
-    when 'exon':
+    when 'exon'
            gff.each_exon_seq do | id, seq |
              puts ">"+id
              puts seq
            end
-    when 'CDS':
+    when 'CDS'
            gff.each_CDS_seq do | id, seq |
              puts ">"+id
              puts seq
diff --git a/lib/bio/db/gff/gffdb.rb b/lib/bio/db/gff/gffdb.rb
index 5325fb9..9540154 100644
--- a/lib/bio/db/gff/gffdb.rb
+++ b/lib/bio/db/gff/gffdb.rb
@@ -26,7 +26,7 @@ module Bio
          cache_recs    = options[:cache_records]
          @assembler =
            case cache_recs
-            when :cache_none :
+            when :cache_none
                NoCache.new(filename, options)
              else
                InMemory.new(filename, options)  # default
diff --git a/lib/bio/db/gff/gffparser.rb b/lib/bio/db/gff/gffparser.rb
index 5522d81..e1ed9db 100644
--- a/lib/bio/db/gff/gffparser.rb
+++ b/lib/bio/db/gff/gffparser.rb
@@ -30,9 +30,12 @@ module Bio
                info "Added #{rec.feature_type} with component ID #{id}"
              else
                case rec.feature_type
-                when 'mRNA' || 'SO:0000234' : @mrnalist.add(id,rec)
-                when 'CDS'  || 'SO:0000316' : @cdslist.add(id,rec)
-                when 'exon' || 'SO:0000147' : @exonlist.add(id,rec)
+                when 'mRNA' || 'SO:0000234'
+                  @mrnalist.add(id,rec)
+                when 'CDS'  || 'SO:0000316'
+                  @cdslist.add(id,rec)
+                when 'exon' || 'SO:0000147'
+                  @exonlist.add(id,rec)
                  else
                    if !IGNORE_FEATURES.include?(rec.feature_type)
                      @unrecognized_features[rec.feature_type] = true