[BioRuby] GFF attributes

Naohisa GOTO ngoto at gen-info.osaka-u.ac.jp
Wed Sep 24 10:05:26 EDT 2008


Hi,

In my github repository, I've made incompatible changes
in Bio::GFF::GFF2 and Bio::GFF::GFF3 classes.

Now, attributes are stored as an Array containing
[ tag, value ] pairs, for example,
  [ [ 'Gene', 'CEN1' ], [ 'E_value', '0.0003' ],
    [ 'Note', 'CEN1; Chromosome I Centromere' ] ].
To get an attribute, it is recommended to use a new method
Record#arrtibute(tag) and so on.
String escaping in free text is automatically processed. 
In addition, GFF2 attribute value with multiple tokens
e.g. 'Target "HBA_HUMAN" 11 55'  are parsed to
Bio::GFF::GFF2::Record::Value object. (Note that a value
with single token is still a String).

To keep backward compatibility, the specification of
Bio::GFF is not so changed except for bug fix.
To use new feature, Bio::GFF::GFF2 or Bio::GFF::GFF3
should be explicitly used.

For more details, please see
http://github.com/ngoto/bioruby/commit/95391949d217e6f7c9ee7444afebec6ee8677035

If no problems are found, it will be included in the main
bioruby repository.

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

On Wed, 17 Sep 2008 12:56:19 +0900
Naohisa GOTO <ngoto at gen-info.osaka-u.ac.jp> wrote:

> Hi,
> 
> On Thu, 11 Sep 2008 11:34:36 +0900
> Tomoaki NISHIYAMA <tomoakin at kenroku.kanazawa-u.ac.jp> wrote:
> 
> > Hi
> > 
> > > To prevent repeating the bug, I want to use the GFF string
> > > described in your mail for the test script in BioRuby.
> > > (test/unit/bio/db/test_gff.rb)
> > > Can you give permission?
> > 
> > Surely, I have no objection.
> > The string is one of the line in the Popular genome annotation from  
> > the JGI site.
> > ftp://ftp.jgi-psf.org/pub/JGI_data/Poplar/annotation/v1.1/ 
> > Poptr1_1.JamboreeModels.gff.gz
> > So, I think acknowledging them is a good idea.
> 
> Thank you. I'll add above URL in the comments of the test.
> 
> > For test string, I think another pattern including multiple value for  
> > one key is worth to add.
> > The example from http://www.sanger.ac.uk/Software/formats/GFF/ 
> > GFF_Spec.shtml:
> > seq1     BLASTX  similarity   101  235 87.1 + 0	Target "HBA_HUMAN" 11  
> > 55 ; E_value 0.0003
> > 
> > Perhaps current implementation will return '"HBA_HUMAN" 11 55' as the  
> > value for 'Target'.
> > But returning an Array ['"HBA_HUMAN"', '11', '55'] may be more  
> > sensible, or represent
> > more of the meaning of the specification.
> 
> In this case, string escaping and quotation in free text
> can also be processed by the class, and
> [ 'HBA_HUMAN', 11', '55'] can be returned.
> 
> > Since changing this return value will make incompatibilities, I'm not  
> > sure
> > whether it can be changed.
> > But if it is ever to be changed, it is better changed early, or  
> > stated as such.
> > If it is too late, perhaps we can make a method under a different  
> > name so that
> > currently working code will not be affected.
> 
> Indeed, for GFF2 attributes, I've alrealy found a
> design problem in current Bio::GFF::GFF2#attributes.
> Currently, a hash is used to store attributes, but
> the GFF2 spec allows more than two tags with the same name.
> 
> For example, 
> http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml#homology_feature
>         Align 101 11 ; Align 179 36 ;
> 
> In this case, with current bioruby implementation, the
> "Align 101 11" is overwritten by the latter "Align 179 36",
> and we can only get  { "Align" => "179 36" }.
> 
> To solve the problem, I can think the following two ways.
> 
>  1. Using an Array to store values from multiple tags.
> 
>     For example, in the above case,
>     @attributes = {}
>     @attribures['Align'] = [ '101 11', '179 36' ]
>     @attribures['Target'] =  '"HBA_HUMAN" 11 54'
> 
>     I already took this approach in GFF3 with incompatible
>     changes, because the previous implementation of
>     GFF3#attributes was broken and cannot be used.
>     But now, I just think this approch is not good and
>     I want to change it now, because checking whether
>     the value is an array or not is needed every time.
> 
>     In addition, in this case, we can not parse
>     '"HBA_HUMAN" 11 54' to [ 'HBA_HUMAN', 11', '54'],
>     because it is impossible to distinguish values from
>     multiple tags or parsed values, unless an array is
>     always used.
> 
>  2. Giving up using hash, and using an array (or possibly
>     a new class e.g. GFF2::Attributes) of [ tag, value ]
>     pairs.
> 
>     For backward compatibility, hash can be dynamically 
>     generated when old behavior is requested.
> 
>     I think this approach is better.
>     I'll implement this later.
>  
> Any comments and suggestions are welcome.
> 
> 
> Naohisa Goto
> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
> 
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


More information about the BioRuby mailing list