[BioRuby] GFF attributes

Naohisa GOTO ngoto at gen-info.osaka-u.ac.jp
Tue Sep 16 23:56:19 EDT 2008


Hi,

On Thu, 11 Sep 2008 11:34:36 +0900
Tomoaki NISHIYAMA <tomoakin at kenroku.kanazawa-u.ac.jp> wrote:

> Hi
> 
> > To prevent repeating the bug, I want to use the GFF string
> > described in your mail for the test script in BioRuby.
> > (test/unit/bio/db/test_gff.rb)
> > Can you give permission?
> 
> Surely, I have no objection.
> The string is one of the line in the Popular genome annotation from  
> the JGI site.
> ftp://ftp.jgi-psf.org/pub/JGI_data/Poplar/annotation/v1.1/ 
> Poptr1_1.JamboreeModels.gff.gz
> So, I think acknowledging them is a good idea.

Thank you. I'll add above URL in the comments of the test.

> For test string, I think another pattern including multiple value for  
> one key is worth to add.
> The example from http://www.sanger.ac.uk/Software/formats/GFF/ 
> GFF_Spec.shtml:
> seq1     BLASTX  similarity   101  235 87.1 + 0	Target "HBA_HUMAN" 11  
> 55 ; E_value 0.0003
> 
> Perhaps current implementation will return '"HBA_HUMAN" 11 55' as the  
> value for 'Target'.
> But returning an Array ['"HBA_HUMAN"', '11', '55'] may be more  
> sensible, or represent
> more of the meaning of the specification.

In this case, string escaping and quotation in free text
can also be processed by the class, and
[ 'HBA_HUMAN', 11', '55'] can be returned.

> Since changing this return value will make incompatibilities, I'm not  
> sure
> whether it can be changed.
> But if it is ever to be changed, it is better changed early, or  
> stated as such.
> If it is too late, perhaps we can make a method under a different  
> name so that
> currently working code will not be affected.

Indeed, for GFF2 attributes, I've alrealy found a
design problem in current Bio::GFF::GFF2#attributes.
Currently, a hash is used to store attributes, but
the GFF2 spec allows more than two tags with the same name.

For example, 
http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml#homology_feature
        Align 101 11 ; Align 179 36 ;

In this case, with current bioruby implementation, the
"Align 101 11" is overwritten by the latter "Align 179 36",
and we can only get  { "Align" => "179 36" }.

To solve the problem, I can think the following two ways.

 1. Using an Array to store values from multiple tags.

    For example, in the above case,
    @attributes = {}
    @attribures['Align'] = [ '101 11', '179 36' ]
    @attribures['Target'] =  '"HBA_HUMAN" 11 54'

    I already took this approach in GFF3 with incompatible
    changes, because the previous implementation of
    GFF3#attributes was broken and cannot be used.
    But now, I just think this approch is not good and
    I want to change it now, because checking whether
    the value is an array or not is needed every time.

    In addition, in this case, we can not parse
    '"HBA_HUMAN" 11 54' to [ 'HBA_HUMAN', 11', '54'],
    because it is impossible to distinguish values from
    multiple tags or parsed values, unless an array is
    always used.

 2. Giving up using hash, and using an array (or possibly
    a new class e.g. GFF2::Attributes) of [ tag, value ]
    pairs.

    For backward compatibility, hash can be dynamically 
    generated when old behavior is requested.

    I think this approach is better.
    I'll implement this later.
 
Any comments and suggestions are welcome.


Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org



More information about the BioRuby mailing list