[Bioperl-l] validate_gff_via_ontology.pl

Chris Mungall cjm at fruitfly.org
Fri Feb 21 01:39:48 EST 2003


I have just committed this to branch-1-2-collection

It is an ontology validator for GFF3 - it checks both feature types and
validates the feature containment (subfeatures hierarchy) as this must
correspond to the SOFA standard (in progress). The code could be easily
modified for any format that produces seqfeature containment hierarchies
(eg chado)

Currently it requires the go-dev perl modules to work - I am working on
moving the required functionality into the bioperl ontology classes

I have also made SeqFeature::Generic ontology-aware

  type() is a getter/setter for a Bio::Ontology::TermI

  primary_tag() has exactly the same behaviour but the underlying
  implementation delegates to a Bio::Ontology::TermI object
  (using the type() attribute)

  type_string() is a synonym for primary_tag()


See also

geneontology.sf.net (for the go-dev perl api)
song.sf.net (for the SOFA ontology)

someone remind me to move this to the main branch when the time is right!

future plans: the next step is a transmogrifier for mapping legacy formats
into a standard ontology

Here are the pod docs:


=head1 NAME

  validate_gff_via_ontology.pl

=head1 SYNOPSIS

  validate_gff_via_ontology.pl -trace -m cds=coding_sequence
myfeatures.gff sofa.ontology

=head1 DESCRIPTION

This script will validate a GFF (version3 is best) file against an
ontology of sequence features. Currently the ontology must be in
GO/GOBO format, support for other formats (eg DAML+OIL, OWL) are
forthcoming.

The validation is in two parts:

first of all the types in the GFF file are checked vs the terms in the
ontology. An identifier (which will be a SO accession if the SO/SOFA
ontology is used) is assigned if it exists (case insensitive matching
is used). If the type used in the GFF file does not exist in the
ontology file the script will notify you with a message like:

  BAD:  locus transcrpt geme

The second part of the validation checks to see if the 'subfeature'
relationships specified in the GFF (using Parent and ID attributes in
GFF3) are valid. For instance, making 'transcript' a subfeature of
'repeat' is obviously silly, and it is banned because there is no
'part-of' restriction between transcript and repeat in the canonical
SOFA ontology. The rules are a bit more subtle than this, as we also
have to traverse the subsumption hierarchy.

If a feature/subfeature containment is not allowed, you will get a
message like:

  BAD PARTOF: [coding_start PARTOF exon]

To get a full explanation of why the partof link is bad, run with the
trace switch [-trace]

=head2 REQUIREMENTS

(1) An ontology file

You can obtain the canonical SOFA ontology here:

 http://sofa.sf.net/

(2) The go-dev perl objects

 http://geneontology.sf.net/

Plus some files in GFF format!

=head2 RULES

a subfeature can only be part of a superfeature
if there is a direct part-of link for the relevant feature types
OR such a link can be obtained by traversing either of the
two graphs upwards.

the part-of links must be direct; we do not use the closure of the
part-of relationship. This means that you can *not* make an exon a
subfeature of gene, you need the intermediate object (of some kind of
transcripty type, eg mRNA).

=head2 EXAMPLES

(these depend on an imaginary version of a SO-style ontology):

=head3 Example 1

  can I make a 'mRNA' feature a subfeature of noncoding gene ('nc_gene')?

YES! 'mRNA' is a subclass of 'processed_transcript' is a subclass of
'transcript'
     'nc_gene' is a subclass of 'gene'
     'mRNA' is a part of 'gene' ** RESOLVED **

=head3 Example 2

  can I make 'exon' a subfeature of 'mRNA'?

NO! 'exon' is a subclass of only the root term
    'mRNA' is a subclass of 'processed_transcript' is a subclass of
'transcript'
    'exon' is ONLY a part of 'primary_transcript',
         which is NOT in the 'mRNA' subsumption hierarchy

    therefore this is invalid

=head2 SPECIFICATION

 notes: R* is the reflexive transitive closure of R
 notes: R+ is the transitive closure of R

(reflexive includes the relationship x R x)

  WHERE ChildFeat is an object of type SeqFeature
        ParentFeat is an object of type SeqFeature
  IF
    ChildFeat part-of ParentFeat

  THEN

  THIS MUST BE SATISFIED FOR SOME POSSIBLE BINDING OF THE VARIABLES BELOW
  (if it starts with a capital it is an unbound variable):

  ChildFeat has-feature-type ChildType
  ParentFeat has-feature-type ParentType

  ChildType is-a* ChildTypeAllPossible
  ParentType is-a* ParentTypeAllPossible

  # don't allow nonreflexive transitive closure on part-of
  # we *could* allow ChildTypeAllPossible part-of+ ParentTypeAllPossible
  # which would mean we could attach exons directly to genes

  ChildTypeAllPossible part-of ParentTypeAllPossible

=head2 FORMAL SPECIFICATION

These rules can be implemented easily in prolog - I have resisted the
temptation to do so. The rules below are implemented imperatively in
the perl script. They are included below as a formal specification of
the rules.

  ;;;; RULES IN PROLOG:
  ;;;;
  ;;;; we assume an ontology (eg SOFA) that produces 'stmt' predicates of
  ;;;; of arity-3 [CHILD REL-TYPE PARENT]
  ;;;;
  ;;;; the bioperl feature containment hierarchy can be verified
  ;;;; by the predicate [SUBFEATURE 'part-of' SUPERFEATURE]
  ;;;; the type field of the feature is assumed to be
  ;;;; stmt predicates [FEATURE 'instance-of' FEATURE-TYPE]

  ;; ========================================
  ;; general rules: closure / graph traversal
  ;; ========================================

  ;; closure of R is R+
  stmt(X, 'is-a+', Y):-
    stmt(X, 'is-a', Y).

  stmt(X, 'is-a+', Y):-
    stmt(X, 'is-a', Z),
    stmt(Z, 'is-a+', Y).

  ;; reflexive closure of R is R*
  stmt(X, 'is-a*', X).
  stmt(X, 'is-a*', Y):-
    stmt(X, 'is-a+', Y).

  ;; ========================================
  ;; core rule - checks if a feat/subfeat
  ;; link is alllowed
  ;; ========================================

  ;; a subfeature can only be part of a superfeature
  ;; if there is a direct part-of link for the relevant feature types
  ;; OR such a link can be obtained by traversing either of the
  ;; two graphs upwards

  stmt(ChildFeat, 'part-of', ParentFeat):-
    stmt(ChildFeat, 'instance-of', ChildType),
    stmt(ParentFeat, 'instance-of', ParentType),

    stmt(ChildType, 'is-a*', ChildTypeAllPossible),
    stmt(ParentType, 'is-a*', ParentTypeAllPossible),

    stmt(ChildTypeAllPossible, 'is-a*', ParentTypeAllPossible).

=head1 AUTHOR - Chris Mungall

Email cjm at fruitfly.org

=cut





More information about the Bioperl-l mailing list