[BioRuby] GFF attributes
Naohisa GOTO
ngoto at gen-info.osaka-u.ac.jp
Tue Sep 9 11:47:46 UTC 2008
Hi,
On Fri, 5 Sep 2008 15:43:05 +0900
Tomoaki NISHIYAMA <tomoakin at kenroku.kanazawa-u.ac.jp> wrote:
> Hi,
>
> When extracting attributes from a GFF file,
> older implementation seem to have eat the last character before ";".
> Current, (downloaded very recently from github), does not split well,
> as the regular expression search the largest match.
Thank you for reporting a bug.
> A patch is included, but I am not sure on the specification.
> http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml
> The specification says:
> > From version 2 onwards, the attribute field must have an tag value
> > structure following the syntax used within objects in a .ace file,
> > flattened onto one line by semicolon separators. Tags must be
> > standard identifiers ([A-Za-z][A-Za-z0-9_]*). Free text values must
> > be quoted with double quotes. Note: all non-printing characters in
> > such free text value strings (e.g. newlines, tabs, control
> > characters, etc) must be explicitly represented by their C (UNIX)
> > style backslash-escaped representation (e.g. newlines as '\n', tabs
> > as '\t').
I also see BioPerl's _from_gff2_string in Bio::Tools::GFF
http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/Tools/GFF.html#CODE10
It seems is still has bugs (as described in comments in their code),
but semicolons inside double quotes are treated as normal letters
and not separators for attributes.
> So, it seems that for proper parsing, quotation with double quote
> should be checked for free text,
> and semicolon in that quatation is not a separator
> for attributes and semicolon may not be preceeded with back slash.
I've changed to do so. This means the patch was not used.
http://github.com/ngoto/bioruby/commit/e38fd48aaf41f94eaec39a639a7f6c5db62c22e8
(This is my repository. Because the change seems severe,
I'll push to the main bioruby repository later,
after checking more and more.)
To prevent repeating the bug, I want to use the GFF string
described in your mail for the test script in BioRuby.
(test/unit/bio/db/test_gff.rb)
Can you give permission?
Best regards,
Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
>
> Anyway, the file I am looking now is not that complex,
> and I will go with a quick hack at this time.
>
> Best regards,
>
> Tomoaki
>
> the test program
> $ cat test-gff.rb
> #!/usr/local/bin/ruby
> require 'bio'
> gff_str = "LG_I\tJGI\tCDS\t11052\t11064\t.\t-\t0\tname
> \"grail3.0116000101\"; proteinId 639579; exonNumber 3\n"
> Bio::GFF.new(gff_str).records.each do |fr|
> p fr
> end
>
> output after patch
> $ /usr/local/bin/ruby test-gff.rb
> #<Bio::GFF::Record:0x2b0ef0eb0648 @frame="0", @start="11052",
> @comments=nil, @strand="-", @feature="CDS", @score=".",
> @source="JGI", @attributes={"name"=>"\"grail3.0116000101\"",
> "proteinId"=>"639579", "exonNumber"=>"3"}, @end="11064",
> @seqname="LG_I">
>
> output from current
> #<Bio::GFF::Record:0x2b825ff16640 @frame="0", @start="11052",
> @comments=nil, @strand="-", @feature="CDS", @score=".",
> @source="JGI", @attributes={"name"=>"\"grail3.0116000101\"; proteinId
> 639579", "exonNumber"=>"3"}, @end="11064", @seqname="LG_I">
>
> older output
> #<Bio::GFF::Record:0x1e3674 @end="11064", @seqname="LG_I",
> @frame="0", @start="11052", @comments=nil, @strand="-",
> @feature="CDS", @score=".", @source="JGI", @attributes=
> {"name"=>"\"grail3.0116000101", "proteinId"=>"63957",
> "exonNumber"=>"3"}>
>
> diff -ur bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/
> bio/db/gff.rb bioruby-a/lib/bio/db/gff.rb
> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/
> db/gff.rb 2008-09-03 22:24:39.000000000 +0900
> +++ bioruby-a/lib/bio/db/gff.rb 2008-09-05 14:56:50.000000000 +0900
> @@ -122,7 +122,7 @@
> def parse_attributes(attributes)
> hash = Hash.new
> scanner = StringScanner.new(attributes)
> - while scanner.scan(/(.*[^\\])\;/) or scanner.scan(/(.+)/)
> + while scanner.scan(/(([^;]|\\;)*[^\\])\;/) or scanner.scan(/
> (.+)/)
> key, value = scanner[1].split(' ', 2)
> key.strip!
> value.strip! if value
>
>
> --
> Tomoaki NISHIYAMA
>
> Advanced Science Research Center,
> Kanazawa University,
> 13-1 Takara-machi,
> Kanazawa, 920-0934, Japan
>
>
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
More information about the BioRuby
mailing list