[BioRuby] GFF attributes
Naohisa GOTO
ngoto at gen-info.osaka-u.ac.jp
Wed Sep 24 14:05:26 UTC 2008
Hi,
In my github repository, I've made incompatible changes
in Bio::GFF::GFF2 and Bio::GFF::GFF3 classes.
Now, attributes are stored as an Array containing
[ tag, value ] pairs, for example,
[ [ 'Gene', 'CEN1' ], [ 'E_value', '0.0003' ],
[ 'Note', 'CEN1; Chromosome I Centromere' ] ].
To get an attribute, it is recommended to use a new method
Record#arrtibute(tag) and so on.
String escaping in free text is automatically processed.
In addition, GFF2 attribute value with multiple tokens
e.g. 'Target "HBA_HUMAN" 11 55' are parsed to
Bio::GFF::GFF2::Record::Value object. (Note that a value
with single token is still a String).
To keep backward compatibility, the specification of
Bio::GFF is not so changed except for bug fix.
To use new feature, Bio::GFF::GFF2 or Bio::GFF::GFF3
should be explicitly used.
For more details, please see
http://github.com/ngoto/bioruby/commit/95391949d217e6f7c9ee7444afebec6ee8677035
If no problems are found, it will be included in the main
bioruby repository.
Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
On Wed, 17 Sep 2008 12:56:19 +0900
Naohisa GOTO <ngoto at gen-info.osaka-u.ac.jp> wrote:
> Hi,
>
> On Thu, 11 Sep 2008 11:34:36 +0900
> Tomoaki NISHIYAMA <tomoakin at kenroku.kanazawa-u.ac.jp> wrote:
>
> > Hi
> >
> > > To prevent repeating the bug, I want to use the GFF string
> > > described in your mail for the test script in BioRuby.
> > > (test/unit/bio/db/test_gff.rb)
> > > Can you give permission?
> >
> > Surely, I have no objection.
> > The string is one of the line in the Popular genome annotation from
> > the JGI site.
> > ftp://ftp.jgi-psf.org/pub/JGI_data/Poplar/annotation/v1.1/
> > Poptr1_1.JamboreeModels.gff.gz
> > So, I think acknowledging them is a good idea.
>
> Thank you. I'll add above URL in the comments of the test.
>
> > For test string, I think another pattern including multiple value for
> > one key is worth to add.
> > The example from http://www.sanger.ac.uk/Software/formats/GFF/
> > GFF_Spec.shtml:
> > seq1 BLASTX similarity 101 235 87.1 + 0 Target "HBA_HUMAN" 11
> > 55 ; E_value 0.0003
> >
> > Perhaps current implementation will return '"HBA_HUMAN" 11 55' as the
> > value for 'Target'.
> > But returning an Array ['"HBA_HUMAN"', '11', '55'] may be more
> > sensible, or represent
> > more of the meaning of the specification.
>
> In this case, string escaping and quotation in free text
> can also be processed by the class, and
> [ 'HBA_HUMAN', 11', '55'] can be returned.
>
> > Since changing this return value will make incompatibilities, I'm not
> > sure
> > whether it can be changed.
> > But if it is ever to be changed, it is better changed early, or
> > stated as such.
> > If it is too late, perhaps we can make a method under a different
> > name so that
> > currently working code will not be affected.
>
> Indeed, for GFF2 attributes, I've alrealy found a
> design problem in current Bio::GFF::GFF2#attributes.
> Currently, a hash is used to store attributes, but
> the GFF2 spec allows more than two tags with the same name.
>
> For example,
> http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml#homology_feature
> Align 101 11 ; Align 179 36 ;
>
> In this case, with current bioruby implementation, the
> "Align 101 11" is overwritten by the latter "Align 179 36",
> and we can only get { "Align" => "179 36" }.
>
> To solve the problem, I can think the following two ways.
>
> 1. Using an Array to store values from multiple tags.
>
> For example, in the above case,
> @attributes = {}
> @attribures['Align'] = [ '101 11', '179 36' ]
> @attribures['Target'] = '"HBA_HUMAN" 11 54'
>
> I already took this approach in GFF3 with incompatible
> changes, because the previous implementation of
> GFF3#attributes was broken and cannot be used.
> But now, I just think this approch is not good and
> I want to change it now, because checking whether
> the value is an array or not is needed every time.
>
> In addition, in this case, we can not parse
> '"HBA_HUMAN" 11 54' to [ 'HBA_HUMAN', 11', '54'],
> because it is impossible to distinguish values from
> multiple tags or parsed values, unless an array is
> always used.
>
> 2. Giving up using hash, and using an array (or possibly
> a new class e.g. GFF2::Attributes) of [ tag, value ]
> pairs.
>
> For backward compatibility, hash can be dynamically
> generated when old behavior is requested.
>
> I think this approach is better.
> I'll implement this later.
>
> Any comments and suggestions are welcome.
>
>
> Naohisa Goto
> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
>
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
More information about the BioRuby
mailing list