[Bioperl-l] Bio::Ontology

Chris Mungall cjm@fruitfly.org
Thu, 19 Sep 2002 08:11:07 -0700 (PDT)


Ok, I have some controlled vocabulary and graph code ready to check in

I was going to check it in initially as another branch (this is mainly
because I don't want to deal with namespace changes and cvs, could get
ugly) - there seems to be a lock preventing me:
cvs server: failed to obtain dir lock in repository
`/home/repository/bioperl/bioperl-live/Bio/Tools/Run/Phylo'

anywhere, for now i have stuck the code in
fruitfly.org/~cjm/bioperl-live.tar.gz
if everyone is happy with the namespaces i'll go ahead and commit it onto
the main branch

I'm including a pseudo interface spec below

So far there are implementations for the basic graph vocab stuff, but not
for associations between entities (eg gene products, markers) and CV
terms. i'd like to generate a bit of discussion on these first. Ewan,
Jason and I discussed this briefly - what should be the root interface for
the associated entities be? Jason and Ewan had some reservations about the
interface (AnnotatableI) below. It does have the advantage that tools like
AmiGO could be neutral wrt whether they were associating genes to GO,
markers to a phenotypic/trait ontology, images/experiments to an anatomy
ontology.

There are also parsers; these use an event based model, but have
Bio::OntologyIO wrappers. I'd say it's ready to use for the purpose of
feature types and a sequence ontology.

! OK, here's the proposed spec for the bioperl ontologies component
! (maybe this could be the interface definitions for the other bio*
projects
!  too?)
! it's kind of interface heavy, but I think this is necessary if we want
to
! keep this generic, flexible, etc
!
! these are the use cases assumed:
!
! flat lists of controlled biological terms, where each biological term
! can have various tracking info added (synonyms, dbxrefs, etc)
!
! structured controlled vocabularies (aka loose semantic networks),
! including the following:
!
! trees of bioterms, which represent subtype/supertype relationships
!
! DAGs of bioterms, where all arcs in the graph represent
subtype/supertype
! relationships
!
! DAGs of bioterms, where the arcs can represent different relationship
! types, a la Gene Ontology - typical relationship types would be
! ISA, PARTOF. However, relationship types are restricted to those for
! which the true-path rule holds true (see
http://www.geneontology.org/...).
! currently this is just ISA, and PARTOF when PARTOF is used in the sense
of
! "necessarily part of". for instance, a 'door' is part of a 'car', but it
! isn't necessarily part of a car - it could be part of a house. we could
! introduce PARTOF(car_door, car) which is always true.
! the true path rule is useful for ontology consistency, and for answering
! recursive queries; for example, to answer 'find all genes that are
! transmembrane receptors' we would find genes associated with TM
receptors
! AND all children of the TM receptor node.
!
! Graphs of bioterms, in which arcs in the graph or not necessarily
! transitive, and which cycles may be allowed. the true path rule may not
! necessarily hold. examples include vocabs that include temporal arcs
! (which may cause cycles, eg birth, cell cycles); the true path rule does
! not hold with temporal arcs (eg if a gene is expressed at term:G_phase
it is
! not necessarily expressed at term:M_phase; if a gene has a phenotype of
the
! right_leg, it IS correct to say that it is a limb phenotype, but NOT
correct
! to say that it is an embryonic phenotype, even though right_leg is a
recursive
! child of embryo via DEVELOPS-FROM arcs/relationships)
!
! In addition to the different kinds of vocabs/ontologies above, there is
! support for making associations between entities and vocab terms; these
! entities could be gene products (ie proteins or RNA products), genes,
! markers/alleles (eg in phenotypic/trait ontologies), sequences;
! these could be represented by Bio::SeqI, Bio::SeqFeatureI or
Bio::Map::MarkerI
! objects.
!
! associations are entities in their own right, as they are assertions
made
! at some particular time by a person or a computational analysis, and
should
! be tracked with references. see http://... for a list of GO evidence
criteria
! (of course this object model is not restricted to GO criteria)
!
! NOTE ON SEQUENCE FEATURE ONTOLOGIES
! for associations between a SeqFeatureI entity with a feature type (eg in
SO)
! (currently done with the $sf->primary_tag() method) there will probably
! be no association/evidence; we should probably add $sf->feature_type
! which returns a VocabTermI object, and make $sf->primary_tag delegate to
! $sf->feature_type()->label()
!
! there is no *explicit* support for frames-style/description logic
ontologies
! (you really want to be using more specialised tools and not bioperl for
this
! anyway); however, provision has been made for layering these on in a
! compatible way. at this time, the most prevalent ontologies within
biology
! (or at least the most prevalent use cases within bioperl) are structured
! controlled vocabularies, GO style (although these can easily be
represented
! as a full frame style or DL ontology, provided a few constraints are
followed,
! this ramping up of expressive power would force unnecessary complexity
in
! this object model).
!
! one can imagine different implementations of these interfaces;
! eg memory based vs secondary storage based (while most vocabs fit
! into memory, vocabs PLUS the entities associated with the terms
generally
! do not).
! different implementations may provide different semantics of some of the
! operations below. for instance, a simple graph implementation would
traverse
! down the graph to implement get_all_children(); a graph/vocab with added
! semantics may choose to only traverse recursive relationships.
!
! TODO: further split this into components
! (triples, graphs, vocabs, associations/annotations)

namespace Bio::Graph

enum TraversalMethod { BREADTH_FIRST, DEPTH_FIRST };
enum TraversalDirection { DOWN, UP };

typedef string TripleElement
typedef string Identifier
typedef string TimeStamp
typedef Bio::Annotation::DBLink DBLink

interface TripleI extends Bio::Root::RootI
  attribute TripleElement subject
  attribute TripleElement predicate
  attribute TripleElement object

interface TripleStoreI extends Bio::Root::RootI
  add(TripleI triple):                # adds new triple to store
  get(TripleI triple): TripleI[]      # fetches matching triples

interface NodeI extends Bio::Root::RootI
  attribute string identifier
  attribute ANY node_data

interface ArcI extends Bio::Root::RootI
  attribute NodeI parent_node
  attribute NodeI child_node
  attribute NodeI arctype_node
  arc_label():   string               # description of relationship type
  attribute ANY arc_data

interface PathI extends Bio::Root::RootI
  attribute ArcI[] arcs             # attribute accessor
  reverse():

interface GraphIteratorI extends Bio::Root::RootI
  reset_cursor():
  attribute TraversalMethod traversal_method
  attribute TraversalDirection traversal_direction
  this_node(): NodeI
  next():
  next_node(): NodeI
  path():      PathI                  # path to get here from initial node
  depth():     int

# graphs are implemented on top of triple stores;
# this allows the API user to access the underlying binary
# predicates directly if desired.
# different implementations of GraphI may choose to implement the
# semantics of the methods below differently; it may delegate
# directly to the underlying triple store, or it may apply some
# semantics (for instance, it may be desirable to only treat
# transitive predicates as parents/children, and effectively
# hide non-transitive predicates from at the graph interface level
interface GraphI extends TripleStoreI
  add_arc(ArcI arc):
  add_node(NodeI node):
  get_node(Identifier identifier): NodeI
  get_all_nodes(): NodeI[]
  get_all_arcs(): ArcI[]
  get_all_arctypes(): NodeI[]
  get_child_nodes(NodeI node): NodeI[]
  get_all_child_nodes(NodeI node): NodeI[]
  get_parent_nodes(NodeI node): NodeI[]
  get_all_parent_nodes(NodeI node): NodeI[]
  get_graph_iterator(NodeI node, TraversalMethod traversal_method):
GraphIteratorI
  get_root_nodes(): NodeI[]
  paths_to_root(): PathI[]
  get_leaf_nodes(): NodeI[]

namespace Bio::Ontology

typedef string Identifier

interface VocabTermI extends Bio::Graph::NodeI
  attribute Identifier identifier
  attribute string label
  attribute VocabDefinition definition
  attribute string[] synonyms
  add_synonym(string synonym):
  attribute DBLink[] dblinks
  add_dblink(DBLink dblink):
  timestamp(): timestamp
  category(): VocabTerm
  is_obsolete(): boolean

interface RelationshipI extends Bio::Graph::ArcI
  attribute Identifier identifier
  attribute TermI parent_term
  attribute TermI child_term
  attribute TermI relationship_type

interface VocabDefinitionI extends Bio::Root::RootI
  attribute string definition
  attribute DBLink reference
  timestamp(): timestamp

interface VocabI extends Bio::Root::RootI
  get_term(Identifier identifier): VocabTermI
  get_terms_by_label(): VocabTermI[]     # note: name/desc not unique
  get_all_terms(): VocabTermI[]
  get_all_relationships(): RelationshipI[]
  get_all_relationship_types(): VocabTermI[]
  add_term(VocabTermI term): VocabTermI
  add_relationship(RelationshipI relationship): RelationshipI
  create_term(Identifier id, string label, string[] synonyms, DBLink[]
dblinks): VocabTermI
  create_relationship(Identifier id, VocabTermI parent, VocabTermI child,
VocabTermI relationship_type): RelationshipI

interface StructuredVocabI extends VocabI

! ------------------------------------------------------------------------
# A Graph Vocabulary is a Structured controlled vocabulary with vocabulary
# terms arranged in a graph (or semantic network) structure. parent/child
# relationships in the graph often represent subsumption relationships
# (ie where a more general term subsumes a more specific one) but it is
# not always safe to assume so. The relationships are often transitive,
# but this is not always the case. The graph may be acyclic or may contain
# cycles (in which case recursive traversals must be checked for cycles,
and
# there will be no roots or leaves)
#
# The GraphVocabI interface provides different methods depending on
# what semantics are required; some programs may not care about the
meaning
# of arcs in the graph (eg graph visualisation tools). Other programs may
# only be interested in subclass/superclass hierarchies, or subsumption
# hierarchies
#
# depending on the implementation, 'covered' and 'covered_by' may mean
# exactly the same as 'child' and 'parent' respectively; if the
implementation
# provides some kind of semantics, the meaning may be more restricted.
# for instance, temporal relationship types may not be included in the
# covered/covered_by list, as they do not follow the true path rule.
# relationships that cover/subsume may be : ISA, PARTOF
#
# subclass/superclass relationships are strict inheritance hierarchies
# eg ISA
#
# each implementation of this interface should clearly specify the
# semantics of the different graph traversal calls

interface GraphVocabI extends Bio::Graph::GraphI, StructuredVocabI
  get_child_terms(VocabTermI term): VocabTermI[]
  get_all_child_terms(VocabTermI term): VocabTermI[]
  get_parent_terms(VocabTermI term): VocabTermI[]
  get_all_parent_terms(VocabTermI term): VocabTermI[]
  get_covered_terms(VocabTermI term): VocabTermI[]       # terms subsumed
  get_all_covered_terms(VocabTermI term): VocabTermI[]   # terms subsumed,
recursive
  get_covered_by_terms(VocabTermI term): VocabTermI[]    # subsuming terms
  get_all_covered_by_terms(VocabTermI term): VocabTermI[] # subsuming
terms, recursive
  get_subclass_terms(VocabTermI term): VocabTermI[]      # inheriting
terms
  get_all_subclass_terms(VocabTermI term): VocabTermI[]  # inheriting
terms, recursive
  get_superclass_terms(VocabTermI term): VocabTermI[]    # inherited terms
  get_all_superclass_terms(VocabTermI term): VocabTermI[] # inherited
terms, recursive
  is_acyclic(bool is_acyclic): boolean
  is_rooted(bool is_acyclic): boolean
  is_relationship_type_acyclic(VocabTermI relationship_type, boolean is):
boolean
  is_relationship_type_rooted(VocabTermI relationship_type, boolean is):
boolean
  is_relationship_type_covering(VocabTermI relationship_type, boolean is):
boolean
  is_relationship_type_subclass(VocabTermI relationship_type, boolean is):
boolean
  is_relationship_type_transitive(VocabTermI relationship_type, boolean
is): boolean


# entities associated could be:
# gene products, sequences, seqfeatures, markers (eg phenotypic/trait
ontologies)
# an association is often to a single term, but sometimes we may want to
# make associations to multiple terms from orthogonal ontologies; e.g.
# geneX is involved in aorta + time_stage25 + growth
interface AssociationI
  attribute VocabTermI[] vocab_terms
  attribute AnnotatableI[] associated_entities
  attribute EvidenceI[] evidence
  timestamp(): Timestamp

# question: should we have a common interface for Marker and GeneProduct
# that implements some standard methods/attributes;
# eg identifier,species, name, label, dblinks,...
# this way, a generic vocab+association tool such as AmiGO could
# be made to work with GO+genes, OR with PO+markers/alleles...

# Refactor Bio::AnnotatableI to be used by Bio::SeqI, and AnnotableI
# has annotation_collection and convience methods

interface AnnotatableI
  attribute Identifier identifier
  attribute string label                        # e.g. gene symbol
  attribute string full_name                    # e.g. full gene name
  attribute string description                  # e.g. text desc of gene
  attribute DBLink[] dblinks
  attribute Bio::Map::MarkerI[] markers                   # any associated
markers
  attribute SeqI[] seqs
  attribute SeqFeatureI[] seq_features
  attribute Bio::Species species
  attribute string source

interface EvidenceI
  attribute VocabTermI[] evidence_types
  attribute DBLink[] references                   # e.g. medline entries
  attribute DBLink[] evidence_dblinks             # e.g. swissprot
accessions

# filter for fetching terms/associations; simple querying system
# examples could be species, source, evidence
interface FilterI
  attribute Bio::Species[] species
  attribute string[] sources
  attribute string[] evidences

interface AssociationStoreI
  get_all_annotatables(): AnnotatableI[]
  get_annotatables_by_terms(VocabTermI[] vocab_terms): AnnotatableI[]
  get_terms_by_annotatables(AnnotatableI[] annotatables): VocabTermI[]
  set_filter(FilterI filter):

interface CombinedVocabI extends GraphVocabI, AssociationStoreI

interface FactoryI extends Bio::Root::RootI
  create_Graph(): GraphI
  create_GraphVocab(): GraphVocabI
  create_VocabTerm(): VocabTermI

!class Factory implements FactoryI extends Bio::Root::Root

!class Bio::Ontology::Triple::Triple implements TripleI extends
Bio::Root::Root
!class Bio::Ontology::Triple::TripleStore implements TripleStoreI extends
Bio::Root::Root

!class Bio::Ontology::Graph::Graph implements GraphI extends
Bio::Ontology::Triple::TripleStore
!class Bio::Ontology::Graph::Arc implements ArcI extends
Bio::Ontology::Triple::Triple
!class Bio::Ontology::Graph::Node implements NodeI extends Bio::Root::Root
!class Bio::Ontology::Graph::Path implements PathI extends Bio::Root::Root
!class Bio::Ontology::Graph::GraphIterator implements GraphIteratorI
extends Bio::Root::Root

!class Bio::Ontology::Vocab::VocabTerm implements VocabTermI extends
Bio::Ontology::Graph::Node
!class Bio::Ontology::Vocab::Relationship implements RelationshipI extends
Bio::Ontology::Graph::Arc
!class Bio::Ontology::Vocab::GraphVocab implements GraphVocabI extends
Bio::Ontology::Graph::Graph

!class Bio::Ontology::Association::Association implements AssociationI
extends Bio::Root::Root
!class Bio::Ontology::Association::Evidence implements EvidenceI extends
Bio::Root::Root
!class Bio::Ontology::Association::Annotatable implements AnnotatableI
extends Bio::Root::Root

namespace Bio::Ontology::KB

! this is meant to illustrate how the above system of interfaces could
! be extended into a full frame-style knoweldge base / ontology; don't
! expect any implementations of these interfaces for a while...
! should we aim for OKBC compliance here? (see http://?)
! should this be namespaced within bioperl?

! these interfaces could be implemented in different ways; one way
! would be to layer it directly on top of the existing graph layer,
! and implement the DAML+OIL axioms (hard to have full DAML+OIL compliance
! implemented in an imperative language) - or it could just form a bridge
! with an existing ontology tool/KB

! really this is just a demonstration of how this *might* be done - this
! is getting into a gnarly area... eg do we take a DL approach or a more
! expressive approach blah blah....

interface ClassI extends VocabTermI          # concept aka Class aka Frame
    all_classes(): ClassI[]
    sub_classes(): ClassI[]
    all_sub_classes(): ClassI[]
    super_classes(): ClassI[]
    all_super_classes(): ClassI[]
    slots(SlotI []): SlotI

interface SlotI extends VocabTermI