[BioRuby] Problem with Bio::GFF::GFF2

Naohisa GOTO ngoto at gen-info.osaka-u.ac.jp
Tue Jun 9 13:44:19 UTC 2009


Hi George,

On Tue, 9 Jun 2009 15:26:45 +0300
George Githinji <georgkam at gmail.com> wrote:

> Hi all,
> I am try to parse a GFF file. The file looks like this
> 
> ##gff-version 2
> ##source-version bepipred-1.0b
> ##date 2009-06-09
> ##Type Protein seq1
> # seqname            source        feature      start   end   score  N/A   ?
> #
> ---------------------------------------------------------------------------
> seq1   bepipred-1.0b epitope          1     1   0.173  . .   .
> seq1   bepipred-1.0b epitope          2     2  -0.043  . .   .
> seq1  bepipred-1.0b epitope          3     3  -0.014  . .   .
> seq1   bepipred-1.0b epitope          4     4   0.144  . .   .
> seq1   bepipred-1.0b epitope          5     5   0.250  . .   .
> seq1   bepipred-1.0b epitope          6     6   0.218  . .   .
> 
> ....truncated

The above GFF records do not contain any "attributes".
The field definition of each GFF line is:
<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]

When talking about GFF, the word "attributes" points the
"attributes" field in each GFF line.

See the GFF2 specifications document for details.
http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml

> and i have written the following lines with an aim of extracting the start,
> end and score attributes. but before that i wanted to know whether the full
> attributes are available. so i did the following.
> 
> require 'rubygems'
> require 'bio'
> bep_gff = Bio::GFF::GFF2.new(File.open('/home/george/bpred.gff'))
> 
>  bep_gff.records.each do |record|
>     puts record.attributes_to_hash.inspect
> end
> 
> However, i get empty hashes.
> Any ideas?

Because the Bio::GFF2::Record#attributes_to_hash method returns
"attributes" as a hash, and all "attributes" field in the above
GFF2 records are empty, showing empty hashes is logically right.

If you really want a hash, adding each field into a hash would
be the easiest way. For example,

  bep_gff.records.each do |record|
     h = {}
     h['seqname']    = record.seqname
     h['source']     = record.source
     h['feature']    = record.feature
     h['start']      = record.start
     h['end']        = record.end
     h['score']      = record.score
     h['strand']     = record.strand
     h['frame']      = record.frame
     h['attributes'] = record.attributes_to_hash
     p h
  end

Bio::GFF2::Record have seqname, source, feature, start, end,
score, strand, frame attributes(so called in the Ruby language),
which are inherited from Bio::GFF::Record class.
Normally, it is natural using the above attributes(in Ruby)
directly without creating a hash.

Note that using attributes_to_hash may lost some data when
there are two or more values with the same tag name in an
"attributes" field.

When creating new data, in case using "attributes" extensively,
GFF3 is recommended, because the design of GFF2 attributes is
somehow broken.

> Thank you
> 
> 
> -- 
> ---------------
> Sincerely
> George
> 
> Skype: george_g2
> Blog: http://biorelated.wordpress.com/

Your blog is nice!

-- 
Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org



More information about the BioRuby mailing list