[Bioperl-l] Bio::Ontology overhaul
hlapp at gnf.org
Wed Feb 26 17:06:28 EST 2003
(This is the email I promised to write 2 days ago :)
Triggered by what happened during the hackathon in Singapore, the
Bioperl Ontology object model needs to undergo a number of changes that
I'm going to describe below, along with a proposal how to resolve the
issues. I have largely implemented the proposal and barring major
objections I'm ready to commit tonight or tomorrow morning. Even though
some of these things significantly change the API, we (I, backed by the
core people) propose to migrate these changes over to the 'stable' 1.2
branch, to be able to release bioperl-db (bridges to the biosql schema)
in dependency on a stable bioperl branch.
<b>If you use the Ontology modules in bioperl, you should read
Otherwise chances are you will not be affected, and given the fact that
this is a long email you may want to ignore it.
In Singapore we changed and solidified (among other things) the
ontology part of the Biosql schema. These changes make it incompatible
with the current bioperl ontology object model. Coding around these
incompatibilities is painful and basically a waste of energy, so my
push was to change the bioperl ontology model to be a) compatible and
b) more sensible, which fortunately nicely go together. The gist of the
schema changes in this corner are 1) ontology terms get namespaces
('Ontology') which are not shared with the namespace of sequences
entries ('Biodatabase'), and 2) ontology term relationships also get a
namespace (again, an 'Ontology').
In addition, we think the sanest approach to release is to have
bioperl-db depend on a stable bioperl release instead of a developer's
release, which would contain code that expressly is subject to changing
its API without backwards compatibility. So, we decided to make an
unusual exception to the rule that the API not change in a stable
release, hoping that not many if anyone already jumped on that (totally
new) part of bioperl. If you disagree for whatever reason, please speak
up and make your concerns heard.
Here's the deal that I came up with.
1) Introduce Bio::Ontology::OntologyI with methods
-authority (e.g., geneontology.org)
-identifier (not required to be publicly meaningful)
-definition (human-readable description)
-close (frees any occupied resources)
It inherits off Bio::Ontology::OntologyEngineI, so all the ontology
query methods are there, too (get_parent_terms, get_ancestor_terms,
I can be easily talked into changing the namespace to Bio:: (from
2) Introduce a default implementation
Bio::Ontology::Ontology is-a Bio::Ontology::OntologyI.
The implementation of all ontology query methods (those inherited from
OntologyEngineI) is by composition. There is a method engine() that
lets you set the query engine to use (with a default being instantiated
if none provided). This makes it possible to let one query engine
instance manage multiple ontologies.
3) Add a method ontology() to TermI that accepts and returns a object
implementing Bio::Ontology::OntologyI. Remove method category() (I
added an implementation to Term.pm that ensures backward compatibility).
This is a controversial thing to do because it almost inevitably
creates memory cycles (the term points to the ontology which points to
its terms). Calling OntologyI::close() is required to break the cycle.
I thought about this for the last couple days and finally decided that
for usability's sake this is probably the right thing to do
nevertheless. Here are my reasons.
- the only way to copy the PrimarySeq/Seq/SeqFeatureI pattern loses
half of its usefulness (you want name *and* query engine accessible for
it to be really useful)
- most if not all people are going to use only a few ontologies
during any given runtime, not hundreds of thousands like for features
and sequences, and those few ontologies you will want in memory anyway
- you can break all the cycles by a single call to one designated
method on an ontology, which IMHO is not asking for that much
- having the ontology() method just return a plain string or a dumb
namespace object is clunky and has very limited if any usability
outside of e.g. bioperl-db; in contrast, being able to get at the full
featured ontology with all the query methods by calling a method on any
given term is potentially very useful
- as Matt pointed out to me correctly, it is in fact possible to come
up with a query engine implementation that avoids the cycles altogether
by constructing term and relationship objects on the fly from raw hash
or array refs when such objects are requested. Given the design that
I'm proposing, it is very easy to plug that in once somebody writes it
4) Add a method ontology() to RelationshipI. See 3) for the caveats and
5) Change the OntologyIO parsers to adapt to these changes, and
implement a method next_ontology(), returning a
Bio::Ontology::OntologyI instance, or undef at EOI.
I also added a class Bio::Ontology::OntologyStore that acts as a
singleton and is able to resolve names to ontology objects. I
originally thought this was going to take care of the cyclic reference
problem, but it really doesn't. It might still be of use to someone ...
Please share your comments, concerns, suggestions, criticisms :-)
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
More information about the Bioperl-l