[Bioperl-l] Bio::Ontology overhaul

Wed Feb 26 17:06:28 EST 2003

(This is the email I promised to write 2 days ago :)

<abstract>
Triggered by what happened during the hackathon in Singapore, the 
Bioperl Ontology object model needs to undergo a number of changes that 
I'm going to describe below, along with a proposal how to resolve the 
issues. I have largely implemented the proposal and barring major 
objections I'm ready to commit tonight or tomorrow morning. Even though 
some of these things significantly change the API, we (I, backed by the 
core people) propose to migrate these changes over to the 'stable' 1.2 
branch, to be able to release bioperl-db (bridges to the biosql schema) 
in dependency on a stable bioperl branch.

	<b>If you use the Ontology modules in bioperl, you should read 
this.</b>

Otherwise chances are you will not be affected, and given the fact that 
this is a long email you may want to ignore it.
</abstract>

<background>
In Singapore we changed and solidified (among other things) the 
ontology part of the Biosql schema. These changes make it incompatible 
with the current bioperl ontology object model. Coding around these 
incompatibilities is painful and basically a waste of energy, so my 
push was to change the bioperl ontology model to be a) compatible and 
b) more sensible, which fortunately nicely go together. The gist of the 
schema changes in this corner are 1) ontology terms get namespaces 
('Ontology') which are not shared with the namespace of sequences 
entries ('Biodatabase'), and 2) ontology term relationships also get a 
namespace (again, an 'Ontology').

In addition, we think the sanest approach to release is to have 
bioperl-db depend on a stable bioperl release instead of a developer's 
release, which would contain code that expressly is subject to changing 
its API without backwards compatibility. So, we decided to make an 
unusual exception to the rule that the API not change in a stable 
release, hoping that not many if anyone already jumped on that (totally 
new) part of bioperl. If you disagree for whatever reason, please speak 
up and make your concerns heard.
</background>

<proposal>
Here's the deal that I came up with.

1) Introduce Bio::Ontology::OntologyI with methods

	-name       (unique)
	-authority  (e.g., geneontology.org)
	-identifier (not required to be publicly meaningful)
	-definition (human-readable description)
	-close      (frees any occupied resources)

It inherits off Bio::Ontology::OntologyEngineI, so all the ontology 
query methods are there, too (get_parent_terms, get_ancestor_terms, 
get_root_terms, etc.).

I can be easily talked into changing the namespace to Bio:: (from 
Bio::Ontology::).

2) Introduce a default implementation
Bio::Ontology::Ontology is-a Bio::Ontology::OntologyI.

The implementation of all ontology query methods (those inherited from 
OntologyEngineI) is by composition. There is a method engine() that 
lets you set the query engine to use (with a default being instantiated 
if none provided). This makes it possible to let one query engine 
instance manage multiple ontologies.

3) Add a method ontology() to TermI that accepts and returns a object 
implementing Bio::Ontology::OntologyI. Remove method category() (I 
added an implementation to Term.pm that ensures backward compatibility).

This is a controversial thing to do because it almost inevitably 
creates memory cycles (the term points to the ontology which points to 
its terms). Calling OntologyI::close() is required to break the cycle. 
I thought about this for the last couple days and finally decided that 
for usability's sake this is probably the right thing to do 
nevertheless. Here are my reasons.

	- the only way to copy the PrimarySeq/Seq/SeqFeatureI pattern loses 
half of its usefulness (you want name *and* query engine accessible for 
it to be really useful)

     - most if not all people are going to use only a few ontologies 
during any given runtime, not hundreds of thousands like for features 
and sequences, and those few ontologies you will want in memory anyway

     - you can break all the cycles by a single call to one designated 
method on an ontology, which IMHO is not asking for that much

     - having the ontology() method just return a plain string or a dumb 
namespace object is clunky and has very limited if any usability 
outside of e.g. bioperl-db; in contrast, being able to get at the full 
featured ontology with all the query methods by calling a method on any 
given term is potentially very useful

	- as Matt pointed out to me correctly, it is in fact possible to come 
up with a query engine implementation that avoids the cycles altogether 
by constructing term and relationship objects on the fly from raw hash 
or array refs when such objects are requested. Given the design that 
I'm proposing, it is very easy to plug that in once somebody writes it 
(call
$ontology->engine($my_engine_without_term_objects).

4) Add a method ontology() to RelationshipI. See 3) for the caveats and 
considerations.

5) Change the OntologyIO parsers to adapt to these changes, and 
implement a method next_ontology(), returning a 
Bio::Ontology::OntologyI instance, or undef at EOI.

</proposal>

I also added a class Bio::Ontology::OntologyStore that acts as a 
singleton and is able to resolve names to ontology objects. I 
originally thought this was going to take care of the cyclic reference 
problem, but it really doesn't. It might still be of use to someone ...

Please share your comments, concerns, suggestions, criticisms :-)

Cheers,

	-hilmar
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------