GENERIC FEATURE FORMAT VERSION 3

Author:  Lincoln Stein
Date:    15 June 2003
Version: 0.9

Although there are many richer ways of representing genomic features
via XML, the stubborn persistence of a variety of ad-hoc tab-delimited
flat file formats declares the bioinformatics community's need for a
simple format that can be modified with a text editor and processed
with shell tools like grep.  The GFF format, although widely used, has
fragmented into multiple incompatible dialects.  When asked why they
have modified the published Sanger specification, bioinformaticists
frequently answer that the format was insufficient for their needs,
and they needed to extend it.  The proposed GFF3 format addresses the
most common extensions to GFF, while preserving backward compatibility
with previous formats. The new format:

    1) adds a mechanism for representing more than one level 
       of hierarchical grouping of features and subfeatures.
    2) separates the ideas of group membership and feature name/id
    3) constrains the feature type field to be taken from a controlled
       vocabulary.
    4) allows a single feature, such as an exon, to belong to more than
       one group at a time.
    5) an explicit convention for pairwise alignments
    5) an explicit convention for features that occupy disjunct regions

DESCRIPTION OF THE FORMAT
-------------------------

The format consists of 9 columns, separated by spaces.  The following
unescaped characters are allowed within fields:
[a-zA-Z0-9.:^*$@!+_?-].  All other characters must must be escaped
using the URL escaping conventions.  Unescaped quotation marks,
backslashes and other ad-hoc escaping conventions that have been added
to the GFF format are explicitly forbidden.  The =, ; and % characters
have reserved meanings as described below, and must be escaped when
used in other contexts.

Undefined fields are replaced with the "." character, as described in
the original GFF spec.

Column 1: "seqid"

The ID of the landmark used to establish the coordinate system for the
current feature.  IDs must contain alphanumeric characters.
Whitespace, if present, must be escaped using URL escaping rule
(e.g. space="%20" or "+").  Sequences must *NOT* begin with an
unescaped ">".

Column 2: "source"

The source of the feature.  This is unchanged from the older GFF specs
and is not part of a controlled vocabulary.

Column 3: "type"

The type of the feature (previously called the "method").  This is
constrained to be either: (a) a term from the "lite" sequence
ontology, SOFA; or (b) a SOFA accession number.  The latter
alternative is distinguished using the syntax SO:000000.

Columns 4 & 5: "start" and "end"

The start and end of the feature, in 1-based integer coordinates,
relative to the landmark given in column 1.  Start is always less than
or equal to end.

For zero-length features, such as insertion sites, start equals end
and the implied site is to the right of the indicated base.  This
convention holds regardless of the strandedness of the feature.

Column 6: "score"

The score of the feature, a floating point number.  As in earlier
versions of the format, the semantics of the score are ill-defined.
It is strongly recommended that E-values be used for sequence
similarity features, and that P-values be used for ab initio gene
prediction features.

Column 7: "strand"

The strand of the feature.  + for positive strand (relative to the
landmark), - for minus strand, and . for features that are not
stranded.  In addition, ? can be used for features whose strandedness
is relevant, but unknown.

Column 8: "phase"

For features of type "exon", the phase indicates where the feature
begins with reference to the reading frame.  The phase is one of the
integers 0, 1,or 2, indicating that the first base of the feature
corresponds to the first, second or last base of the codon,
respectively.  This is NOT to be confused with the frame, but relates
to the relative position of the translational start in whatever strand
the feature is in.

Column 9: "attributes"

A list of feature attributes in the format tag=value.  Multiple
tag=value pairs are separated by semicolons.  URL escaping rules are
used for tags or values containing the following characters: ",=;".
Whitespace should be replaced with the "+" character or the %20 URL
escape.  This will allow the file to survive text processing programs
that convert tabs into spaces.

Five tags are predefined:

    ID	   Indicates the name of the feature.  IDs must be unique
	   within the scope of the GFF file.

    Name   Display name for the feature.  This is the name to be
           displayed to the user.  Unlike IDs, there is no requirement
	   that the Name be unique within the file.

    Alias  A secondary name for the feature.  It is suggested that
	   this tag be used whenever a secondary identifier for the
	   feature is needed, such as locus names and
	   accession numbers.  Unlike ID, there is no requirement
	   that Alias be unique within the file.

    Parent Indicates the parent of the feature.  A parent ID can be
	   used to group exons into transcripts, transcripts into
	   genes, an so forth.  A feature may have multiple parents.

    Target Indicates the target of a nucleotide-to-nucleotide or
	   protein-to-nucleotide alignment.  The format of the
	   value is "target_id+start+end".

    Gap    The alignment of the feature to the target if the two are
          not colinear (e.g. contain gaps).  The alignment format is
	  taken from the CIGAR format described in the 
	  Exonerate documentation.
	  (http://cvsweb.sanger.ac.uk/cgi-bin/cvsweb.cgi/exonerate
           ?cvsroot=Ensembl)
	  This format consists of a series of (operation,length) pairs
          where operation is one of <M>atch, <I>nsert into reference
          or <D>elete from reference.  For example:

                Chr3   CAAGACCTAAACTGGAT-TCCAAT
                EST23  CAAGACCT---CTGGATATCCAAT

          Chr3 in this case is the reference sequence (the one
          referred to in the first column of the GFF3 file) and EST23
          is the sequence referred to by the Target attribute.  This
          gives a CIGAR string of "M8 D3 M6 I1 M6". The full GFF match line
          will read:
          
   Chr23 . Match 1 23 . . . ID=Match1;Target=EST23+1+21;Gap=M8D3M6I1M6

Multiple attributes of the same type are indicated by separating the
values with the comma "," character, as in:

       Parent=AF2312,AB2812,abc-3

Note that attribute names are case sensitive.  "Parent" is not the
same as "parent".

All attributes that begin with an uppercase letter are reserved for
later use.  Attributes that begin with a lowercase letter can be used
freely by applications.

THE CANONICAL GENE
------------------

To illustrate how a canonical gene should be represented consider
Figure 1 (canonical_gene.png).  This indicates a gene named EDEN
extending from position 1000 to position 9000.  It encodes three
alternatively-spliced transcripts named EDEN.1, EDEN.2 and EDEN.3.  It
also has an identified transcriptional factor binding site located 50
bp upstream from the transcriptional start site of EDEN.1 and EDEN2.

Here is how this gene should be described using GFF3:

 0  ##gff-version   3
 1  ##sequence-region   ctg123 1 1497228       
 2  ctg123 . gene            1000  9000  .  +  .  ID=gene00001;Name=EDEN
 3  ctg123 . TF_binding_site 1000  1012  .  +  .  ID=tfbs00001;Parent=gene00001

 4  ctg123 . mRNA            1050  9000  .  +  .  ID=mRNA00001;Parent=gene00001;Name=EDEN.1
 5  ctg123 . 5'-UTR          1050  1200  .  +  .  Parent=mRNA0001
 6  ctg123 . CDS             1201  1500  .  +  0  Parent=mRNA0001
 7  ctg123 . CDS             3000  3902  .  +  0  Parent=mRNA0001
 8  ctg123 . CDS             5000  5500  .  +  0  Parent=mRNA0001
 9  ctg123 . CDS             7000  7600  .  +  0  Parent=mRNA0001
10  ctg123 . 3'-UTR          7601  9000  .  +  .  Parent=mRNA0001

11  ctg123 . mRNA            1050  9000  .  +  .  ID=mRNA00002;Parent=gene00001;Name=EDEN.2
12  ctg123 . 5'-UTR          1050  1200  .  +  .  Parent=mRNA0002
13  ctg123 . CDS             1201  1500  .  +  0  Parent=mRNA0002
14  ctg123 . CDS             5000  5500  .  +  0  Parent=mRNA0002
15  ctg123 . CDS	     7000  7600	 .  +  0  Parent=mRNA0002
16  ctg123 . 3'-UTR	     7601  9000	 .  +  .  Parent=mRNA0002

17  ctg123 . mRNA            1300  9000  .  +  .  ID=mRNA00003;Parent=gene00001;Name=EDEN.3
18  ctg123 . 5'-UTR	     1300  1500	 .  +  .  Parent=mRNA0003
19  ctg123 . 5'-UTR	     3000  3300	 .  +  .  Parent=mRNA0003
20  ctg123 . CDS             3301  3902  .  +  0  Parent=mRNA0003
21  ctg123 . CDS	     5000  5500	 .  +  2  Parent=mRNA0003
22  ctg123 . CDS	     7000  7600	 .  +  2  Parent=mRNA0003
23  ctg123 . 3'-UTR	     7601  9000	 .  +  .  Parent=mRNA0003

Line 0 gives the GFF version, and line 1 indicates the boundaries of
the region being annotated (a 1,497,228 bp region named "ctg123").

Line 2 defines the boundaries of the gene.  Column 9 of this line
assigns the gene an ID of gene00001, and a human-readable name of
EDEN.  Because the gene is not part of a larger feature, it has no
Parent.

Line 3 annotates the transcriptional factor binding site.  Since it is
logically part of the gene, its Parent attribute is gene00001.

Lines 4-10 define the first alternative transcript.  An mRNA line
indicates the start and stop of the spliced transcript as a whole, and
assigns it ID mRNA0001 and Name EDEN.1. Following this are the two
UTRs and four CDS segments, each of which has mRNA00001 as its parent.
Because the CDS segments are coding features, their phase field is
defined.  As it happens, each of the CDS features is an even multiple
of 3, so their phases are all zero.

In a similar manner, lines 11-16 define the second transcript, ID
mRNA00002, Name EDEN.2.  Note that the Parent field of the UTRs and
CDS entries is mRNA00002.

Lines 17-23 define the third transcript, EDEN.3.  Of interest in this
example is that although the UTR and the first CDS partly share the
same exon, this does not change the overall structure of the
GFF3 representation.  Also, because the alternative splicing pattern
changes the reading frame of this transcript, the phase of the second
and third CDS segments has changed.

VARIATIONS ON THE CANONICAL GENE
--------------------------------

The recipe given above should be sufficient for most applications.
Some, however, may wish to extend the representation of the canonical
gene to explicitly refer to exons and/or introns.  The suggested way
to do this is to use the Gap tag to indicate that the spliced mRNAs
contains gaps when they are mapped to the genome:

 ctg123 . mRNA  1050  9000  .  +  . ID=mRNA00001;Parent=gene00001;Name=EDEN.1;Gap=M451D1499M903D1097M501D1499M2001
 ctg123 . mRNA  1050  9000  .  +  . ID=mRNA00002;Parent=gene00001;Name=EDEN.2;Gap=M451D3499M501D1499M2001
 ctg123 . mRNA  1300  9000  .  +  . ID=mRNA00003;Parent=gene00001;Name=EDEN.3;Gap=M201D1499M903D1097M501D1499M2001

Looking at EDEN.1, the gap string indicates that there is a 451 bp
exon followed by a 1499 bp intron, a 903 bp exon and so forth.

An alternative representation is also possible:

 ctg123 . mRNA  1050  1500  .  +  . ID=mRNA00001;Parent=gene00001;Name=EDEN.1
 ctg123 . mRNA  3000  3902  .  +  . ID=mRNA00001;Parent=gene00001;Name=EDEN.1
 ctg123 . mRNA  5000  5500  .  +  . ID=mRNA00001;Parent=gene00001;Name=EDEN.1
 ctg123 . mRNA  7000  9000  .  +  . ID=mRNA00001;Parent=gene00001;Name=EDEN.1

In this representation the mRNA is now split among multiple lines but
given the same ID.  This indicates that a single feature (mRNA00001)
occupies four discrete positions on the genome.

A third representation can be used if the application wishes to
identify exons individually.  In this representation, each exon
becomes a named feature whose parent is one or more primary
transcripts:

 ctg123 . primary_transcript 1050  9000  .  +  . ID=ptr00001;Parent=gene00001
 ctg123 . primary_transcript 1050  9000  .  +  . ID=ptr00002;Parent=gene00001
 ctg123 . primary_transcript 1300  9000  .  +  . ID=ptr00003;Parent=gene00001
 ctg123 . exon  1050  1500  .  +  . ID=exon00001;Parent=ptr00001,ptr00002
 ctg123 . exon  1300  1500  .  +  . ID=exon00001;Parent=ptr00003
 ctg123 . exon  3000  3902  .  +  . ID=exon00002;Parent=ptr00001,ptr00003
 ctg123 . exon  5000  5500  .  +  . ID=exon00003;Parent=ptr00001,ptr00002,ptr00003
 ctg123 . exon  7000  9000  .  +  . ID=exon00004;Parent=ptr00001,ptr00002,ptr00003

It is important to note that the feature type primary_transcript must
be used here, rather than mRNA, because in SO, exons are part of the
primary transcript, not the mRNA.  To properly represent CDS and UTR
features, one must create mRNAs whose parent is the gene, not the
primary transcript (this is because the SO does not have a "derived
from" relationship at the current time).

If the application wishes to represent introns explicitly, they should
be represented in the same way as exons, vis:

 ctg123 . primary_transcript 1050  9000  .  +  . ID=ptr00001;Parent=gene00001
 ctg123 . primary_transcript 1050  9000  .  +  . ID=ptr00002;Parent=gene00001
 ctg123 . primary_transcript 1300  9000  .  +  . ID=ptr00003;Parent=gene00001
 ctg123 . intron  1501  2999  .  +  . ID=exon00001;Parent=ptr00001,ptr00003
 ctg123 . intron  1501  4999  .  +  . ID=exon00001;Parent=ptr00002
 ctg123 . intron  3903  4999  .  +  . ID=exon00001;Parent=ptr00001,ptr00003
 ctg123 . intron  5501  6999  .  +  . ID=exon00002;Parent=ptr00001,ptr00002,ptr00003

NUCLEOTIDE-TO-NUCLEOTIDE MATCHES (ALIGNMENTS)
---------------------------------------------

In the SO, an alignment between the reference sequence and another
sequence is called a "match".  In addition to the generic "match"
type, there are the subclasses "cDNA_match," "EST_match,"
"translated_nucleotide_match," "nucleotide_to_protein_match," and
"nucleotide_motif."

Matches typically contain gaps; matches broken up by large gaps are
usually called "HSPs" (high-scoring segment pair), and previous
incarnations of GFF have handled gapped alignments by breaking up the
alignment into a series of ungapped HSPs.

The SO does not have an HSP type.  Instead, gapped matches are
represented as a single feature that occupies a discontinuous location
on the reference sequence.  Figure 2 shows the same gene as before,
but with a new track added showing an alignment of a sequenced cDNA to
the genome.  For the purposes of illustration, we have shown the
regions of alignment to be exact across the three exons of the second
spliced transcript (EDEN.2).  

The recommended way to represent this alignment is with a single
feature of type "cDNA_match" and a Gap attribute that indicates that
the alignment is in three segments:

 ctg123 . cDNA_match 1050  9000  6.2e-45  +  . ID=match0001;Target=cdna0123+12+2964;Gap=M451D3499M501D1499M2001

Parsed out, the Target attribute indicates that the sequence named
"cdna0123" between bases 12 and 2964 (in cdna coordinates) aligns to
bases 1050 to 9000 of ctg123.  The Gap attribute is easier to
read when spaces are inserted:

     M451	 match 451 bases
     D3499	 skip 3499 bases in the reference ctg123 sequence
     M501	 match the next 501 bases
     D1499	 skip 1499 bases in the reference ctg123
     M2001	 match the next 2001 bases

Note that the matched region is 2953 bases, which corresponds exactly
to the matching subsequence [12,2964] of the target.  Extra bases in
the cDNA which would cause gaps in the reference sequence would be
indicated using the CIGAR "I" notation.

Another important item to note is that the ID corresponds to the Match
and not to the target sequence.  This avoids the confusion that has
occurred in previous incarnations of GFF which made it impossible to
distinguish between a particular alignment of a target sequence to the
genome and all alignments of a target sequence to the genome.

A limitation of the Gap representation is that the entire alignment
shares the same score (column 6).  To give each component of the match
a separate score, it can be broken across multiple lines as shown
here:

 ctg123 . cDNA_match 1050  1500  5.8e-42  +  . ID=match0001;Target=cdna0123+12+462
 ctg123 . cDNA_match 5000  5500  8.1e-43  +  . ID=match0001;Target=cdna0123+463+963
 ctg123 . cDNA_match 7000  9000  1.4e-40  +  . ID=match0001;Target=cdna0123+964+2964

Notice that the ID is the same across each of the three lines,
indicating that these lines all refer to a single feature, the Match.
Each aligning segment, however has a distinct score and Target region.

The two types of representations can be mixed, allowing large aligned
segments to have their own GFF line and score, while small gaps within
them are represented using a Gap attribute.

Matches can align to either the + or the - strand of the reference
sequence.  This should be denoted in the seventh column of the GFF
line and *not* by changing the order of the start and end positions in
the Target attribute.  To illustrate this, Figure 3 adds an EST pair
to the annotation.  The two ESTs, mjm1123.5 and mum1123.3 correspond
to 5' and 3' EST reads from the same cDNA clone.  The following GFF3
lines describe them:

 ctg123 . EST_match 1200  3200  2.2e-30  +  . ID=match0002;Target=mjm1123.5+5+106;Gap=M301I1499M201
 ctg123 . EST_match 5400  9000  7.4e-32  -  . ID=match0003;Target=mjm1123.3+1+502;Gap=M101I1499M401

Please note that the subsequence indicated by the Target always uses
the coordinate system of the EST, regardless of the direction of the
alignment.  For the 3' EST, the seventh column contains a "-" to
indicate that the match is to the reverse complement of ctg123.  The
Gap attribute does not change as a consequence of this reverse
complementation, and is read from left to right in the usual manner.

An application may wish to group the EST pair into a single feature.
This can be accomplished by creating an implied cDNA_match that
extends from the left end of the first EST to the right end of the
last EST, and indicating that this cDNA match is the Parent of the two
ESTs:

 ctg123 . cDNA_match 1200  9000  .        .  . ID=cDNA0001
 ctg123 . EST_match  1200  3200  2.2e-30  +  . ID=match0002;Parent=cDNA0001;Target=mjm1123.5+5+106;Gap=M301I1499M201
 ctg123 . EST_match  5400  9000  7.4e-32  -  . ID=match0003;Parent=cDNA0001;Target=mjm1123.3+1+502;Gap=M101I1499M401


PROTEIN-TO-NUCLEOTIDE MATCHES (ALIGNMENTS)
------------------------------------------

In the cases described above, the alignment was between two nucleotide
sequences.  In the case of a protein to nucleotide alignment
(e.g. TBLAST), each residue of the protein Target corresponds to three
nucleotides in the reference sequence.  Care must be taken when
constructing the Gap attribute so that the M, I and D operations are
consistently represented in nucleotide coordinate space.  For example,
consider the following alignment:

 100 atgaaggag---gttattgcgaatgtcggcggt
   1 M  K  E  V  V  I  -  N  V  G  G

The appropriate GFF3 alignment line is:

 ctg123 . nucleotide_to_protein 100 129 . + . ID=match008;Target=p101+1+10;Gap=M9I3M6D3M12

The number of aligned protein residues equals the number of <M>atched
nucleotides divided by three, and the length of the Target equals
sum(M)+sum(D)/3.

OTHER SYNTAX
------------

Comments are preceded by the # symbol.  Meta-data and directives are
preceded by ##.  The following directives are recognized:

  ##gff-version 3        
	The GFF version, always 3 in this spec.  This must
	be the topmost line of the file.

  ##sequence-region seqid start end
        The sequence segment referred to by this file, in the format
        "seqid start end".  This element is optional, but strongly
        encouraged because it allows parsers to perform bounds
        checking on features. There may be multiple ##sequence-region
        directives, each corresponding to one of the reference
        sequences referred to in the body of the file.

  ##ontology URI
        The ontology indicated by the URI is to be loaded.  Multiple
	URIs may be added, in which case they are merged (or raise
	an exception if they cannot be merged).  The URI for the
	base Sequence Ontology is:

	  urn:lsid:song.sourceforge.net:so/solite/2003-02-24

        This directive may only occur once per file.

  ###
        This directive (three # signs in a row) indicates that all
        forward references to feature IDs that have been seen to this
        point have been resolved.  After seeing this directive, a
        program that is processing the file serially can close off any
        open objects that it has created and return them, thereby
        allowing iterative access to the file.  Otherwise, software
        cannot know that a feature has been fully populated by its
        subfeatures until the end of the file has been reached.  It
	is recommended that complex features, such as the canonical
	gene, be terminated with the ### notation.

   ##FASTA
	This notation indicates that the annotation portion of the
	file is at an end and that the remainder of the file 
	contains one or more sequences (nucleotide or protein)
	in FASTA format.  This allows features and sequences to
	be bundled together.  Example:

   ##gff-version   3
   ##sequence-region   ctg123 1 1497228       
   ctg123 . gene            1000  9000  .  +  .  ID=gene00001;Name=EDEN
   ctg123 . TF_binding_site 1000  1012  .  +  .  ID=tfbs00001;Parent=gene00001
   ctg123 . mRNA            1050  9000  .  +  .  ID=mRNA00001;Parent=gene00001;Name=EDEN.1
   ctg123 . 5'-UTR          1050  1200  .  +  .  Parent=mRNA0001
   ctg123 . CDS             1201  1500  .  +  0  Parent=mRNA0001
   ctg123 . CDS             3000  3902  .  +  0  Parent=mRNA0001
   ctg123 . CDS             5000  5500  .  +  0  Parent=mRNA0001
   ctg123 . CDS             7000  7600  .  +  0  Parent=mRNA0001
   ctg123 . 3'-UTR          7601  9000  .  +  .  Parent=mRNA0001
   ctg123 . cDNA_match 1050  1500  5.8e-42  +  . ID=match0001;Target=cdna0123+12+462
   ctg123 . cDNA_match 5000  5500  8.1e-43  +  . ID=match0001;Target=cdna0123+463+963
   ctg123 . cDNA_match 7000  9000  1.4e-40  +  . ID=match0001;Target=cdna0123+964+2964
   ##FASTA
   >ctg123
   cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg
   tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta
   tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa
   aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat
   aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat
   cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc
   gtatttgatttgggtttactatcgaataatgagaattttcaggcttaggc
   ttaggcttaggcttaggcttaggcttaggcttaggcttaggcttaggctt
   aggcttaggcttaggcttaggcttaggcttaggcttaggcttaggcttag
   aatctagctagctatccgaaattcgaggcctgaaaagtgtgacgccattc
   ...
   >cnda0123
   ttcaagtgctcagtcaatgtgattcacagtatgtcaccaaatattttggc
   agctttctcaagggatcaaaattatggatcattatggaatacctcggtgg
   aggctcagcgctcgatttaactaaaagtggaaagctggacgaaagtcata
   tcgctgtgattcttcgcgaaattttgaaaggtctcgagtatctgcatagt
   gaaagaaaaatccacagagatattaaaggagccaacgttttgttggaccg
   tcaaacagcggctgtaaaaatttgtgattatggttaaagg

       For backward-compatibility with the GFF version output by the
       Artemis tool, a GFF line that begins with the character >
       creates an implied ##FASTA directive.

APPENDIX: SOFA

For convenience, the version of SOFA available at the time this was
written is appended.

!autogenerated-by:     DAG-Edit version 1.316
!saved-by:             suzi
!date:                 Wed Feb 19 16:38:05 SGT 2003
!version: $Revision: 1.3 $
!type: % ISA Is a
!type: < PARTOF Part of
!Sequence_ontology_Lite_Version
!This is only for comment; not for implementation
!Comments to: <a href="mailto:song-devel@sourceforge.net">song-devel@sourceforge.net</a>
$Sequence_Feature_Ontology ; SO:0000000
 %sofa ; SO:2000001
  %feature ; SO:20000000
   %chromosome ; SO:0000340
    <centromere ; SO:0000577
    <telomere ; SO:0000624
   %gene ; SO:0000704
    <regulatory_region ; SO:0005836
     %enhancer ; SO:0000165
     %TF_binding_site ; SO:0000235 ; synonym:transcription_factor_binding_site % nucleotide_motif ; SO:0000714
    <transcript ; SO:0000673
     %primary_transcript ; SO:0000185 ; synonym:precursor_RNA
      <exon ; SO:0000147
      <intron ; SO:0000188
      %noncoding_primary_transcript ; SO:0000483
       %micro_RNA_primary_transcript ; SO:0000647
       %transfer_RNA_primary_transcript ; SO:0000210
      <splice_site ; SO:0000162
       %splice_acceptor ; SO:0000164 ; synonym:acceptor_splice_site
        %transsplice_acceptor_site ; SO:0000706
       %splice_donor ; SO:0000163 ; synonym:donor_splice_site
      <transcription_start_site ; SO:0000315
     %processed_transcript ; SO:0000233
      <exon_junction ; SO:0000333
      %mRNA ; SO:0000234 ; synonym:messenger_RNA
       <coding_sequence ; SO:0000316
        <coding_end ; SO:0000327
        <coding_start ; SO:0000323
       <untranslated_region ; SO:0000203 ; synonym:UTR
        %five_prime_untranslated_region ; SO:0000204 ; synonym:5'-UTR
        %three_prime_untranslated_region ; SO:0000205 ; synonym:3'-UTR
      %ncRNA ; SO:0000655 ; synonym:noncoding_RNA
       %miRNA ; SO:0000276 ; synonym:micro_RNA
       %rRNA ; SO:0000252 ; synonym:ribosomal_RNA
       %tRNA ; SO:0000253 ; synonym:transfer_RNA
      <polyA_site ; SO:0000553
   %match ; SO:0000343
    %nucleotide_to_nucleotide_match ; SO:0000347
     %cross_genome_match ; SO:0000177
     %expressed_sequence_match ; SO:0000102
      %cDNA_match ; SO:0000689 % RNAi_reagent ; SO:0000337
      %EST_match ; SO:0000668
     %translated_nucleotide_match ; SO:0000181
    %nucleotide_to_protein_match ; SO:0000351
   %nucleotide_motif ; SO:0000714
    %CpG_island ; SO:0000307
    %TF_binding_site ; SO:0000235 ; synonym:transcription_factor_binding_site % regulatory_region ; SO:0005836
   %origin_of_replication ; SO:0000296
   %pseudogene_region ; SO:0000336
   %reagent ; SO:0000695
    %assembly_component ; SO:0000143
     %contig ; SO:0000149
     %golden_path_region ; SO:0000688
    %clone ; SO:0000151
     %cDNA_clone ; SO:0000317
     <clone_end ; SO:0000103
     %genomic_clone ; SO:0000325
    %databank_entry ; SO:2000061
    %oligonucleotide ; SO:0000696 ; synonym:primer % RNAi_reagent ; SO:0000337
    %pcr_product ; SO:0000006 ; synonym:amplicon
     %STS ; SO:0000331 ; synonym:sequence_tag_site
   %remark ; SO:0000700
    %experimental_reagent_region ; SO:0000703
     %RNAi_reagent ; SO:0000337
      %cDNA_match ; SO:0000689 % expressed_sequence_match ; SO:0000102
      %oligonucleotide ; SO:0000696 ; synonym:primer % reagent ; SO:0000695
    %potential_sequencing_error ; SO:0000701
   %repeat_region ; SO:0000657
    %direct_repeat ; SO:0000314
    %dispersed_repeat ; SO:0000658
    %inverted_repeat ; SO:0000294
    %repeat_family ; SO:0000187
     %transposable_element ; SO:0000101
      %DNA_transposon ; SO:0000182
      %retrotransposon ; SO:0000180
       %LTR_retrotransposon ; SO:0000186
       %non_LTR_retrotransposon ; SO:0000189
        %LINE_element ; SO:0000194
        %SINE_element ; SO:0000206
    %tandem_repeat ; SO:0000705
     %microsatellite ; SO:0000289
   %sequence_variant ; SO:0000109
    %deletion ; SO:0000159
     <deletion_junction ; SO:0000687
    %insertion ; SO:0000667
    %inversion ; SO:0000697
     <inversion_junction ; SO:0000692
    %substitution ; SO:1000002
    %translocation_junction ; SO:0000691
