[BioRuby] GFF attributes
Naohisa GOTO
ngoto at gen-info.osaka-u.ac.jp
Wed Sep 17 03:56:19 UTC 2008
Hi,
On Thu, 11 Sep 2008 11:34:36 +0900
Tomoaki NISHIYAMA <tomoakin at kenroku.kanazawa-u.ac.jp> wrote:
> Hi
>
> > To prevent repeating the bug, I want to use the GFF string
> > described in your mail for the test script in BioRuby.
> > (test/unit/bio/db/test_gff.rb)
> > Can you give permission?
>
> Surely, I have no objection.
> The string is one of the line in the Popular genome annotation from
> the JGI site.
> ftp://ftp.jgi-psf.org/pub/JGI_data/Poplar/annotation/v1.1/
> Poptr1_1.JamboreeModels.gff.gz
> So, I think acknowledging them is a good idea.
Thank you. I'll add above URL in the comments of the test.
> For test string, I think another pattern including multiple value for
> one key is worth to add.
> The example from http://www.sanger.ac.uk/Software/formats/GFF/
> GFF_Spec.shtml:
> seq1 BLASTX similarity 101 235 87.1 + 0 Target "HBA_HUMAN" 11
> 55 ; E_value 0.0003
>
> Perhaps current implementation will return '"HBA_HUMAN" 11 55' as the
> value for 'Target'.
> But returning an Array ['"HBA_HUMAN"', '11', '55'] may be more
> sensible, or represent
> more of the meaning of the specification.
In this case, string escaping and quotation in free text
can also be processed by the class, and
[ 'HBA_HUMAN', 11', '55'] can be returned.
> Since changing this return value will make incompatibilities, I'm not
> sure
> whether it can be changed.
> But if it is ever to be changed, it is better changed early, or
> stated as such.
> If it is too late, perhaps we can make a method under a different
> name so that
> currently working code will not be affected.
Indeed, for GFF2 attributes, I've alrealy found a
design problem in current Bio::GFF::GFF2#attributes.
Currently, a hash is used to store attributes, but
the GFF2 spec allows more than two tags with the same name.
For example,
http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml#homology_feature
Align 101 11 ; Align 179 36 ;
In this case, with current bioruby implementation, the
"Align 101 11" is overwritten by the latter "Align 179 36",
and we can only get { "Align" => "179 36" }.
To solve the problem, I can think the following two ways.
1. Using an Array to store values from multiple tags.
For example, in the above case,
@attributes = {}
@attribures['Align'] = [ '101 11', '179 36' ]
@attribures['Target'] = '"HBA_HUMAN" 11 54'
I already took this approach in GFF3 with incompatible
changes, because the previous implementation of
GFF3#attributes was broken and cannot be used.
But now, I just think this approch is not good and
I want to change it now, because checking whether
the value is an array or not is needed every time.
In addition, in this case, we can not parse
'"HBA_HUMAN" 11 54' to [ 'HBA_HUMAN', 11', '54'],
because it is impossible to distinguish values from
multiple tags or parsed values, unless an array is
always used.
2. Giving up using hash, and using an array (or possibly
a new class e.g. GFF2::Attributes) of [ tag, value ]
pairs.
For backward compatibility, hash can be dynamically
generated when old behavior is requested.
I think this approach is better.
I'll implement this later.
Any comments and suggestions are welcome.
Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
More information about the BioRuby
mailing list