[BioRuby] GFF attributes

Tue Sep 9 07:47:46 EDT 2008

Hi,

On Fri, 5 Sep 2008 15:43:05 +0900
Tomoaki NISHIYAMA <tomoakin at kenroku.kanazawa-u.ac.jp> wrote:

> Hi,
> 
> When extracting attributes from a GFF file,
> older implementation seem to have eat the last character before ";".
> Current, (downloaded very recently from github), does not split well,
> as the regular expression search the largest match.

Thank you for reporting a bug.

> A patch is included, but I am not sure on the specification.
> http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml
> The specification says:
> > From version 2 onwards, the attribute field must have an tag value  
> > structure following the syntax used within objects in a .ace file,  
> > flattened onto one line by semicolon separators. Tags must be  
> > standard identifiers ([A-Za-z][A-Za-z0-9_]*). Free text values must  
> > be quoted with double quotes. Note: all non-printing characters in  
> > such free text value strings (e.g. newlines, tabs, control  
> > characters, etc) must be explicitly represented by their C (UNIX)  
> > style backslash-escaped representation (e.g. newlines as '\n', tabs  
> > as '\t').

I also see BioPerl's _from_gff2_string in Bio::Tools::GFF 
http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/Tools/GFF.html#CODE10
It seems is still has bugs (as described in comments in their code),
but semicolons inside double quotes are treated as normal letters
and not separators for attributes.

> So, it seems that for proper parsing, quotation with double quote  
> should be checked for free text,
> and semicolon in that quatation is not a separator
> for attributes and semicolon may not be preceeded with back slash.

I've changed to do so. This means the patch was not used.

http://github.com/ngoto/bioruby/commit/e38fd48aaf41f94eaec39a639a7f6c5db62c22e8
(This is my repository. Because the change seems severe,
I'll push to the main bioruby repository later,
after checking more and more.)

To prevent repeating the bug, I want to use the GFF string
described in your mail for the test script in BioRuby.
(test/unit/bio/db/test_gff.rb)
Can you give permission?

Best regards,

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

> 
> Anyway, the file I am looking now is not that complex,
> and I will go with a quick hack at this time.
> 
> Best regards,
> 
> Tomoaki
> 
> the test program
> $ cat test-gff.rb
> #!/usr/local/bin/ruby
> require 'bio'
> gff_str = "LG_I\tJGI\tCDS\t11052\t11064\t.\t-\t0\tname  
> \"grail3.0116000101\"; proteinId 639579; exonNumber 3\n"
> Bio::GFF.new(gff_str).records.each do |fr|
>    p fr
> end
> 
> output after patch
> $ /usr/local/bin/ruby test-gff.rb
> #<Bio::GFF::Record:0x2b0ef0eb0648 @frame="0", @start="11052",  
> @comments=nil, @strand="-", @feature="CDS", @score=".",  
> @source="JGI", @attributes={"name"=>"\"grail3.0116000101\"",  
> "proteinId"=>"639579", "exonNumber"=>"3"}, @end="11064",  
> @seqname="LG_I">
> 
> output from current
> #<Bio::GFF::Record:0x2b825ff16640 @frame="0", @start="11052",  
> @comments=nil, @strand="-", @feature="CDS", @score=".",  
> @source="JGI", @attributes={"name"=>"\"grail3.0116000101\"; proteinId  
> 639579", "exonNumber"=>"3"}, @end="11064", @seqname="LG_I">
> 
> older output
> #<Bio::GFF::Record:0x1e3674 @end="11064", @seqname="LG_I",  
> @frame="0", @start="11052", @comments=nil, @strand="-",  
> @feature="CDS", @score=".", @source="JGI", @attributes= 
> {"name"=>"\"grail3.0116000101", "proteinId"=>"63957",  
> "exonNumber"=>"3"}>
> 
> diff -ur bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ 
> bio/db/gff.rb bioruby-a/lib/bio/db/gff.rb
> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ 
> db/gff.rb  2008-09-03 22:24:39.000000000 +0900
> +++ bioruby-a/lib/bio/db/gff.rb 2008-09-05 14:56:50.000000000 +0900
> @@ -122,7 +122,7 @@
>         def parse_attributes(attributes)
>           hash = Hash.new
>           scanner = StringScanner.new(attributes)
> -        while scanner.scan(/(.*[^\\])\;/) or scanner.scan(/(.+)/)
> +        while scanner.scan(/(([^;]|\\;)*[^\\])\;/) or scanner.scan(/ 
> (.+)/)
>             key, value = scanner[1].split(' ', 2)
>             key.strip!
>             value.strip! if value
> 
> 
> -- 
> Tomoaki NISHIYAMA
> 
> Advanced Science Research Center,
> Kanazawa University,
> 13-1 Takara-machi,
> Kanazawa, 920-0934, Japan
> 
> 
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby