[BioRuby] GFF attributes
Tomoaki NISHIYAMA
tomoakin at kenroku.kanazawa-u.ac.jp
Fri Sep 5 06:43:05 UTC 2008
Hi,
When extracting attributes from a GFF file,
older implementation seem to have eat the last character before ";".
Current, (downloaded very recently from github), does not split well,
as the regular expression search the largest match.
A patch is included, but I am not sure on the specification.
http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml
The specification says:
> From version 2 onwards, the attribute field must have an tag value
> structure following the syntax used within objects in a .ace file,
> flattened onto one line by semicolon separators. Tags must be
> standard identifiers ([A-Za-z][A-Za-z0-9_]*). Free text values must
> be quoted with double quotes. Note: all non-printing characters in
> such free text value strings (e.g. newlines, tabs, control
> characters, etc) must be explicitly represented by their C (UNIX)
> style backslash-escaped representation (e.g. newlines as '\n', tabs
> as '\t').
So, it seems that for proper parsing, quotation with double quote
should be checked for free text,
and semicolon in that quatation is not a separator
for attributes and semicolon may not be preceeded with back slash.
Anyway, the file I am looking now is not that complex,
and I will go with a quick hack at this time.
Best regards,
Tomoaki
the test program
$ cat test-gff.rb
#!/usr/local/bin/ruby
require 'bio'
gff_str = "LG_I\tJGI\tCDS\t11052\t11064\t.\t-\t0\tname
\"grail3.0116000101\"; proteinId 639579; exonNumber 3\n"
Bio::GFF.new(gff_str).records.each do |fr|
p fr
end
output after patch
$ /usr/local/bin/ruby test-gff.rb
#<Bio::GFF::Record:0x2b0ef0eb0648 @frame="0", @start="11052",
@comments=nil, @strand="-", @feature="CDS", @score=".",
@source="JGI", @attributes={"name"=>"\"grail3.0116000101\"",
"proteinId"=>"639579", "exonNumber"=>"3"}, @end="11064",
@seqname="LG_I">
output from current
#<Bio::GFF::Record:0x2b825ff16640 @frame="0", @start="11052",
@comments=nil, @strand="-", @feature="CDS", @score=".",
@source="JGI", @attributes={"name"=>"\"grail3.0116000101\"; proteinId
639579", "exonNumber"=>"3"}, @end="11064", @seqname="LG_I">
older output
#<Bio::GFF::Record:0x1e3674 @end="11064", @seqname="LG_I",
@frame="0", @start="11052", @comments=nil, @strand="-",
@feature="CDS", @score=".", @source="JGI", @attributes=
{"name"=>"\"grail3.0116000101", "proteinId"=>"63957",
"exonNumber"=>"3"}>
diff -ur bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/
bio/db/gff.rb bioruby-a/lib/bio/db/gff.rb
--- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/
db/gff.rb 2008-09-03 22:24:39.000000000 +0900
+++ bioruby-a/lib/bio/db/gff.rb 2008-09-05 14:56:50.000000000 +0900
@@ -122,7 +122,7 @@
def parse_attributes(attributes)
hash = Hash.new
scanner = StringScanner.new(attributes)
- while scanner.scan(/(.*[^\\])\;/) or scanner.scan(/(.+)/)
+ while scanner.scan(/(([^;]|\\;)*[^\\])\;/) or scanner.scan(/
(.+)/)
key, value = scanner[1].split(' ', 2)
key.strip!
value.strip! if value
--
Tomoaki NISHIYAMA
Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan
More information about the BioRuby
mailing list