[Bioperl-l] RE: Proposed GFF version 3

Tony Cox avc at sanger.ac.uk
Tue Feb 11 19:00:42 EST 2003



Largely my experience too!

Tony

+> 
+> Everywhere outside of WormBase and DAS I've personally seen 
+> uses '&'.  We had to implement ';' to cope with DAS.
+> 
+> ----- Original Message -----
+> From: "Richard Durbin" <rd at sanger.ac.uk>
+> To: <lstein at cshl.org>
+> Cc: <bioperl-l at bioperl.org>; <suzi at fruitfly.org>; 
+> <gff-list at sanger.ac.uk>
+> Sent: Tuesday, February 11, 2003 4:54 AM
+> Subject: Re: Proposed GFF version 3
+> 
+> 
+> > Swap them entirely.  i.e. put the attributes in column 9 
+> and call that 
+> > "attributes" and put the new hierarchical group term in 
+> column 10 and 
+> > call that "group".  Or perhaps it would be better to call 
+> it something 
+> > else to minimise confusion, because in gff version 1 column 9 was 
+> > called group.  What about calling column 10 "cluster"?
+> >
+> > I see you have switched to URL type format for the 
+> attributes, away 
+> > from acedb.  That's fine - URL format is much more 
+> universal.  But is 
+> > ';' a standard separator in URLS?  I just looked and see 
+> that Ensembl 
+> > uses '&' and WormBase uses ';' and I think I have seen '+' 
+> somewhere, 
+> > so maybe there is no standard.
+> >
+> > Richard
+> >
+> > Lincoln Stein wrote:
+> > > Hi Richard,
+> > >
+> > > Do you mean that we should swap columns 9 and 10 
+> entirely, or just 
+> > > swap
+> their
+> > > names?  I think you mean the former, but I want to be sure.
+> > >
+> > > Lincoln
+> > >
+> > > On Monday 10 February 2003 11:12 am, Richard Durbin wrote:
+> > >
+> > >>Hello all,
+> > >>
+> > >>This looks very nice to me.  Not surprising perhaps 
+> because I had an 
+> > >>earlier involvement as Lincoln says.
+> > >>
+> > >>I have added gff-list at sanger.ac.uk to the mailing Cc: 
+> list because 
+> > >>it is the "official" GFF mailing list, although it is 
+> very little 
+> > >>used.
+> > >>
+> > >>I have one major comment, that columns 9 (group) and 10 
+> (attributes) 
+> > >>should be switched.  Although GFF version 1 column 9 was called 
+> > >>"group" in version 2, which is what has been current for 
+> over two 
+> > >>years, this was renamed "attribute" and contains the attribute 
+> > >>information.  For consistency we should keep column 9 for the 
+> > >>attributes.  Also, in many cases there will be attributes but no 
+> > >>group.
+> > >>
+> > >>I like ID and Target.  I see the idea with hsp's for gapped 
+> > >>alignments, though perhaps they could be called 
+> "match_block".  But 
+> > >>there is a case I think to also encode gapped alignments on one 
+> > >>line, perhaps using the CIGAR encoding used by ENSEMBL (and 
+> > >>BioPerl?), e.g. as
+> > >>
+> > >> Target=M1:1..1000;Align=xxxxxxx
+> > >>
+> > >>(sorry I don't know cigar format well enough to write a legal 
+> > >>string.
+> > >>
+> > >>Richard
+> > >>
+> > >>Lincoln Stein wrote:
+> > >>
+> > >>>This letter is to discuss a proposed extension to GFF.  
+> It arises 
+> > >>>from conversations with Richard Durbin during last 
+> fall's Hinxton 
+> > >>>genome informatics meeting.
+> > >>>
+> > >>>Although there are many richer ways of representing genomic 
+> > >>>features via XML, the stubborn persistence of a variety 
+> of ad-hoc 
+> > >>>tab-delimited flat file formats declares the bioinformatics 
+> > >>>community's need for a simple format that can be 
+> modified with a 
+> > >>>text editor and processed with shell tools like grep.  The GFF 
+> > >>>format, although widely used, has fragmented into multiple 
+> > >>>incompatible dialects.  When asked why they have modified the 
+> > >>>published Sanger specification, bioinformaticists 
+> frequently answer 
+> > >>>that the format was insufficient for their needs, and 
+> they needed 
+> > >>>to extend it.  The proposed GFF3 format addresses the 
+> most common 
+> > >>>extensions to GFF, while preserving backward compatibility with 
+> > >>>previous formats. The new format:
+> > >>>
+> > >>>    1) adds a mechanism for representing more than one level
+> > >>>       of hierarchical grouping of features and subfeatures.
+> > >>>    2) separates the ideas of group membership and 
+> feature name/id
+> > >>>    3) constrains the feature type field to be taken 
+> from a controlled
+> > >>>       vocabulary.
+> > >>>    4) allows a single feature, such as an exon, to 
+> belong to more than
+> > >>>       one group at a time.
+> > >>>    5) one level of relative addressing for subfeatures 
+> (e.g. exons
+> > >>>       can be expressed in transcript coordinates)
+> > >>>    6) an explicit convention for pairwise alignments
+> > >>>    7) an explicit convention for features that occupy disjunct 
+> > >>> regions
+> > >>>
+> > >>>The format consists of 10 columns, separated by spaces.  The 
+> > >>>following unescaped characters are allowed within fields: 
+> > >>>[a-zA-Z0-9.:;=%^*$@!+_?-].  All other characters must must be 
+> > >>>escaped using the URL escaping conventions.  Unescaped 
+> quotation 
+> > >>>marks, backslashes and other ad-hoc escaping 
+> conventions that have 
+> > >>>been added to the GFF format are explicitly forbidden.  
+> The =, ; 
+> > >>>and % characters have reserved meanings as described below.
+> > >>>
+> > >>>Undefined fields are replaced with the "." character, 
+> as described 
+> > >>>in the original GFF spec.
+> > >>>
+> > >>>Column 1: "seqid"
+> > >>>
+> > >>>The ID of the landmark used to establish the coordinate 
+> system for 
+> > >>>the current feature.  IDs must contain alphanumeric characters. 
+> > >>>Whitespace, if present, must be escaped using URL 
+> escaping rules 
+> > >>>(e.g. space="%20").
+> > >>>
+> > >>>Column 2: "source"
+> > >>>
+> > >>>The source of the feature.  This is unchanged from the 
+> older GFF 
+> > >>>specs and is not part of a controlled vocabulary.
+> > >>>
+> > >>>Column 3: "type"
+> > >>>
+> > >>>The type of the feature (previously called the 
+> "method").  This is 
+> > >>>constrained to be either: (a) a term from SOFA; or (b) a SOFA 
+> > >>>accession number.  The latter alternative is 
+> distinguished using 
+> > >>>the syntax SOFA:000000.
+> > >>>
+> > >>>Columns 4 & 5: "start" and "end"
+> > >>>
+> > >>>The start and end of the feature, in 1-based integer 
+> coordinates, 
+> > >>>relative to the landmark given in column 1.  Start is less than 
+> > >>>end.
+> > >>>
+> > >>>Column 6: "score"
+> > >>>
+> > >>>The score of the feature, a floating point number.  As 
+> in earlier 
+> > >>>versions of the format, the semantics of the score are 
+> ill-defined. 
+> > >>>It is strongly recommended that E-values be used for sequence 
+> > >>>similarity features, and that P-values be used for ab 
+> initio gene 
+> > >>>prediction features.
+> > >>>
+> > >>>Column 7: "strand"
+> > >>>
+> > >>>The strand of the feature.  + for positive strand 
+> (relative to the 
+> > >>>landmark), - for minus strand, and . for features that are not 
+> > >>>stranded.  In addition, ? can be used for features whose 
+> > >>>strandedness is relevant, but unknown.
+> > >>>
+> > >>>Column 8: "phase"
+> > >>>
+> > >>>The phase of the feature, for protein-encoding featues 
+> (primarily 
+> > >>>CDSs).  This is an integer-valued field with the values 
+> 0, 1, or 2. 
+> > >>>The integer indicates the offset from the start of the 
+> feature to 
+> > >>>the first base of the first codon in the reading frame.  "." is 
+> > >>>used for features that do not corresponding to a reading frame.
+> > >>>
+> > >>>Column 9: "group"
+> > >>>
+> > >>>A list of the immediate parents of the current feature. 
+>  Multiple 
+> > >>>parents are allowed (example: one exon shared by multiple 
+> > >>>transcripts). Multiple parents are separated by a semicolon. 
+> > >>>Parentless features have a dot in this field.
+> > >>>
+> > >>>Column 10: "attributes"
+> > >>>
+> > >>>A list of feature attributes in the format tag=value.  Multiple 
+> > >>>tag=value pairs are separated by semicolons.  URL 
+> escaping rules 
+> > >>>are used for tags or values containing whitespace, "=" 
+> characters 
+> > >>>and semicolons.
+> > >>>
+> > >>>Two tags are special:
+> > >>>
+> > >>>    ID Indicates the name of the feature.  IDs must be unique 
+> > >>> within the scope of the GFF file.
+> > >>>
+> > >>>    Target Indicates the target of a nucleotide to nucleotide or
+> > >>>    nucleotide to protein alignment.  The format of the
+> > >>>    value is "target_id:start..end"  Start may be greater
+> > >>>    than end to indicate a + strand alignment to the
+> > >>>    reverse complement of a target nucleotide sequence.
+> > >>>
+> > >>>In the example GFF3 file given below, the first column contains 
+> > >>>line numbers that I have added for the purposes of the 
+> narrative.  
+> > >>>Here are some common scenarios that I have attempted to 
+> illustrate:
+> > >>>
+> > >>>A) a simple feature, no public ID
+> > >>>
+> > >>>Line 2 in the example is a feature of type "repeat". It 
+> has a start 
+> > >>>and an end and no ID, but it does have an attribute 
+> named "Note."
+> > >>>
+> > >>>B) a simple feature with a public ID
+> > >>>
+> > >>>Line 3 is a feature of type clone.  It has a start and 
+> an end.  Its 
+> > >>>parent is undefined (empty column 9), but it has an 
+> attribute of 
+> > >>>type ID with value "cTel33B."
+> > >>>
+> > >>>C) a feature with multiple attributes
+> > >>>
+> > >>>Line 5 is a feature of type "gene."  It has no parent, and has 
+> > >>>attributes of type ID, Note, and GO_term.
+> > >>>
+> > >>>D) a hierarchical grouping of features
+> > >>>
+> > >>>Lines 5-13 demonstrate a hierarchical grouping.  At the 
+> top level 
+> > >>>is line 5, which defines the extent of a "gene" with ID 
+> Y74C9A.1.  
+> > >>>Below this are two features of type mRNA (lines 6 and 
+> 7).  Their 
+> > >>>group fields contain the ID of Y74C9A.1, indicating that this 
+> > >>>feature is their immediate parent.  In the 10th column, 
+> the mRNA 
+> > >>>features have their own IDs independent of the ID of the parent 
+> > >>>gene.
+> > >>>
+> > >>>This pattern is repeated for the exons listed on lines 
+> 8-11.  Exons 
+> > >>>e1, e2, and e4 belong to both of the transcripts.  
+> Therefore, both 
+> > >>>transcript IDs are listed in the group column, separated by 
+> > >>>semicolons.
+> > >>>
+> > >>>Exon e3 belongs only to one of the transcripts, and 
+> therefore only 
+> > >>>that transcript's ID is listed in the group column.
+> > >>>
+> > >>>Lines 12 and 13 indicate coding_start and coding_end features.  
+> > >>>These subfeatures are hierarchically grouped underneath their 
+> > >>>corresponding exons, but they do not have independent 
+> public IDs.
+> > >>>
+> > >>>E) Disjunct coordinates
+> > >>>
+> > >>>Lines 14-16 illustrates a single feature -- the CDS 
+> corresponding 
+> > >>>to mRNA Y74C9A.1a -- which occupies multiple disjunct 
+> regions.  The 
+> > >>>group column indicates that the CDS belongs to mRNA Y74C9A.1a.  
+> > >>>However, the attribute column assigns each of lines 
+> 14-16 the same 
+> > >>>ID.  Because the ID is the same, this is to be interpreted as a 
+> > >>>single feature that spans multiple locations.
+> > >>>
+> > >>>F) Alignments
+> > >>>
+> > >>>Lines 17-19 demonstrate a gapped alignment of two 
+> sequences using 
+> > >>>the reserved Target attribute.  Each non-gapped segment 
+> becomes a 
+> > >>>line in the GFF3 file.  The segments each share the same ID, 
+> > >>>thereby indicating that the segments are disjunct 
+> regions of the 
+> > >>>same feature. The Target attribute indicates the ID of 
+> the target 
+> > >>>sequence (which does not have to be represented in the 
+> GFF3 file) 
+> > >>>and the start and end coordinates of the aligned target.
+> > >>>
+> > >>>Unlike the GFF1 and GFF2 formats, the group field for gapped 
+> > >>>alignments can be empty. However, a valid alternative 
+> > >>>representation is to create a single "match" feature, 
+> and a series 
+> > >>>of "hsp" features underneath it via the group field.  
+> Lines 20-22 
+> > >>>show this alternative representation.
+> > >>>
+> > >>>G) Relative coordinates
+> > >>>
+> > >>>Lines 23-26 illustrate using relative coordinate addressing in 
+> > >>>feature/subfeature relationships.  Line 23 defines an 
+> mRNA that is 
+> > >>>positioned on sequence landmark "I" from positions 5000 
+> to 6000.  
+> > >>>Its ID field indicates that it is M7.3.  Lines 24-26 are exon 
+> > >>>subfeatures of M7.3 as indicated by their group field.  
+> However, 
+> > >>>the seqid field specifies M7.3 as the parent coordinate system, 
+> > >>>thereby allowing the exons to begin at position 1.
+> > >>>
+> > >>>  0  ##gff-version 3
+> > >>>  1  ##sequence-region I:1..14972282
+> > >>>  2  I       wormbase        repeat  5000    5100    .  
+>      .       .
+> > >>>   .       Note=ALU3 3  I       wormbase        clone   
+> 1       2679
+> .
+> > >>>      +       .       .       ID=cTel33B 4  I       wormbase
+> > >>>contig  1       14972282        .       +       .       .
+> > >>>ID=CHROMOSOME_I 5  I       wormbase        gene    
+> 43733   44677   .
+> > >>> +       .               .
+> ID=Y74C9A.1;Note=unc-3;GO_term=GO:12345
+> > >>>6  I       wormbase        mRNA    43733   44677   .    
+>    +       .
+> > >>> Y74C9A.1        ID=Y74C9A.1a 7  I       wormbase       
+>  mRNA    43733
+> > >>>44677   .       +       .       Y74C9A.1        
+> ID=Y74C9A.1b 8  I
+> > >>>wormbase        exon    43733   43961   .       +       .
+> > >>>Y74C9A.1a;Y74C9A.1b     ID=e1 9  I       wormbase       
+>  exon    44030
+> > >>>44234   .       +       .       Y74C9A.1a;T:Y74C9A.1b   
+> ID=e2 10  I
+> > >>>wormbase        exon    44281   44328   .       +       .
+> Y74C9A.1b
+> > >>>      ID=e3 11  I       wormbase        exon    44521   
+> 44677   .
+> +
+> > >>>      .       Y74C9A.1a;T:Y74C9A.1b   ID=e4 12  I       wormbase
+> > >>>coding_start    43740   43740   .       +       .       e1 13  I
+> > >>>wormbase        coding_end      44677   44677   .       
+> +       .
+> > >>>e4 14  I       wormbase        cds     43740   43961   .       +
+> 0
+> > >>>     Y74C9A.1a 15  I       wormbase        cds     
+> 44030   44234   .
+> > >>> +       1       Y74C9A.1a 16  I       wormbase        
+> cds     44521
+> > >>>44677   .       +       1       Y74C9A.1a 17  I       wormbase
+> > >>>match   1       100     100     .       .       .
+> > >>>ID=12345.s;Target=cb123:1001..1100 18  I       wormbase 
+>        match
+> > >>>101     500     20      .       .       .
+> > >>>ID=12345.s;Target=cb123:1101..1500 19  I       wormbase 
+>        match
+> > >>>501     1000    80      .       .       .
+> > >>>ID=12345.s;Target=cb123:1501..2000 20  I       wormbase 
+>        match
+> > >>>5001    6000    100     .       .       .
+> ID=abc;Target=M1:1..1000
+> > >>>21  I       wormbase        hsp     5001    5500    .   
+>     .       .
+> > >>>  abc     Target=M1:1..500 22  I       wormbase        
+> hsp     5501
+> > >>>6000    .       .       .       abc     Target=M1:501..100 23  I
+> > >>>wormbase        mRNA    5000    6000    +       .       
+> .       .
+> > >>>ID=M7.3 24  M7.3    wormbase        exon    1       300 
+>     +       .
+> > >>>  .       M7.3    ID=M7.3.1 25  M7.3    wormbase        
+> exon    301
+> > >>>400     +       .       .       M7.3    ID=M7.3.2 26  
+> M7.3    wormbase
+> > >>>    exon    401     1000    +       .       .       
+> M7.3    ID=M7.3.3
+> > >>>
+> > 
+> >>>=================================================================
+> > >>>
+> > >>>I have extended (in an experimental way), the Bio::Tools::GFF 
+> > >>>module to accomodate this new format.  Here is a test 
+> script and 
+> > >>>its output when run on the above file.
+> > >>>
+> > >>>  0  #!/usr/bin/perl -w
+> > >>>  1  use strict;
+> > >>>  2  use lib '.';
+> > >>>
+> > >>>  3  use Bio::Tools::GFF;
+> > >>>  4  my $gffio = 
+> > >>> Bio::Tools::GFF->new(-fh=>\*STDIN,-gff_version=>3);
+> > >>>  5  my @f = $gffio->features;
+> > >>>  6  format_features(\@f);
+> > >>>
+> > >>>  7  sub format_features {
+> > >>>  8    my $features = shift;
+> > >>>  9    my $tabs     = shift || 0;
+> > >>> 10    for my $f (@$features) {
+> > >>> 11      my $type  = $f->primary_tag;
+> > >>> 12      my $id    = $f->unique_id;
+> > >>> 13      $id       ||= '(no id)';
+> > >>> 14      my ($start,$end) = ($f->start,$f->end);
+> > >>> 15      my $alt = ($f->alternative_locations)[0];
+> > >>> 16      my ($target,$tstart,$tend) =
+> > >>>($alt->seq_id,$alt->start,$alt->end) if $alt;
+> > >>>
+> > >>> 17      print
+> >
+> >>>"\t"x$tabs,join("\t",$id,$type,$f->location->to_FTstring,e
+> val{$alt->l
+> >>>ocat
+> > >>>ion->seq_id,$alt->location->to_FTstring}),"\n"; 18
+> > >>>format_features([$f->sub_SeqFeature],$tabs+1);
+> > >>> 19    }
+> > >>> 20  }
+> > >>>
+> > >>> 21  1;
+> > >>>
+> > >>>OUTPUT:
+> > >>>
+> > >>>cTel33B clone 1..2679
+> > >>>CHROMOSOME_I contig 1..14972282
+> > >>>12345.s match join(101..500,1..100,501..1000)
+> > >>>M7.3 mRNA 5000..6000
+> > >>> M7.3.1 exon 5000..5299
+> > >>> M7.3.2 exon 5300..5399
+> > >>> M7.3.3 exon 5400..5999
+> > >>>abc match 5001..6000
+> > >>> (no id) hsp 5001..5500
+> > >>> (no id) hsp 5501..6000
+> > >>>(no id) repeat 5000..5100
+> > >>>Y74C9A.1 gene 43733..44677
+> > >>> Y74C9A.1a mRNA 43733..44677
+> > >>> e1 exon 43733..43961
+> > >>> (no id) coding_start 43740
+> > >>> e2 exon 44030..44234
+> > >>> e4 exon 44521..44677
+> > >>> (no id) coding_end 44677
+> > >>> (no id) cds 43740..43961
+> > >>> (no id) cds 44030..44234
+> > >>> (no id) cds 44521..44677
+> > >>> Y74C9A.1b mRNA 43733..44677
+> > >>> e1 exon 43733..43961
+> > >>> (no id) coding_start 43740
+> > >>> e3 exon 44281..44328
+> > >>
+> > >
+> >
+> >
+> 



More information about the Bioperl-l mailing list