[Dynamite] SingleModel

Ian Holmes ihh@fruitfly.org
Sun, 5 Mar 2000 12:23:15 -0800 (PST)


On Mon, 6 Mar 2000, Ewan Birney wrote:

> > >   interface Transition {
> > >     State from;
> > >     State to;
> > >     float transition_probability;
> > >     ProbabilityEmission emission; // emission on the transitions.
> > 
> > As I said before I think parameters should be in a separate object.
> > If people don't like this then we could consider having a kind of memo
> > object that represents a parameterised model.
> 
> How do we put the parameters elsewhere? Can you show me the IDL?

Something like

	interface SingleTransitionParameters {
	  float                  transition_probability;
	  Alphabet::WeightVector emission_probability;
	}

	interface SingleModelParameters {
	  SingleTransitionParameters           get_parameters (in Transition t);
	  sequence<SingleTransitionParameters> all_parameters();
	// possibly also:
	//  sequence<SingleTransitionParameters> outgoing_parameters (in State s);
	}

The easiest way to keep the parameters & the model in sync is to stipulate
that the sequence<SingleTransitionParameters> returned by
SingleModelParameters::all_parameters() is indexed by the same index as
the sequence<Transition> returned by the Model::all_transitions() method.
Ditto the outgoing_parameters() method -- if you see what I mean.

I regard this as perfectly valid coding practise as long as it's WELL
documented. We could even incoporate a sanity check, by having a
"Transition* my_transition" field in SingleTransitionParameters.

The next option for keeping the model & parameters in sync is to use the
get_parameters(Transition) method in SingleModelParameters. If we don't
use a ParameterisedModelMemento pattern (as I suggested above), then we
care quite a lot about how fast these lookups are. We would have several
implementation options:

	(0) Linear array of Transitions, searchable by brute force in
            time O(M) where M is the model size
	(1) Sorted array of Transitions, searchable by binary chop in time
            O(log M)
	(2) Large M*M array (so, O(M^2) memory)
	(3) Some kind of hashing on Transitions

This may seem like a lot of effort (maybe this is what you meant by mad
gymnastics -- I've got used to the STL doing all these algorithms for
you!).

However, I really think the parameters are a separate thing from the
model. I want to convince you.... Think about training, think about
parameterising HMMs & GeneWise; even think about models as simple as Smith
Waterman where the number of parameters is less than the number of
transitions... think about Fisher kernels (where we will also want to have
a parameter-like datastructure)... this is a somewhat intuitive feeling
on my part, so I don't want to hammer it in without a consensus.

> How do we keep the parameters in sync with the model? (without mad
> gymnastics)

Implementation detail ;-)
seriously -- see above.

> > 
> > Possibly have a "boolean Transition::is_null" field for null transitions?
> 
> null meaning does not emit things?

Yep.

> 
> > 
> > What about "fanned" transitions (i.e. if it's an "A" then go to state 1,
> > if it's a "G" then go to state 3, if it's a "C" go to state 4 etc)?
> > These can be inefficiently implemented just by having A times as many
> > Transitions (where A is the alphabet size) -- shall we just leave it at
> > this for now? (I think we probably should.)
> 
> Lets leave this...

Check..

> > I also think "SingleModel::State" should be a typedef to int, not an
> > interface of its own. States are lightweight things that are usually
> > treated as ints anyway.
> 
> I don't understand here why we don't make this an object. We want to 
> add/remove these things don't we? When we do the DP I imagine the first
> thing we do is make a little internal datastructure to drive the generic
> DP off. I don't think trying to make the objects look like what we
> drive the low-level algorthmical code off will help us...

OK, the low-level-algorithmic rationale may have been wrong; but the point
is I don't think states have any useful internal data (except their name),
and the only thing you need to be able to DO with them is to test whether
State s1 == State s2.

> force of habit: Most (all?) IDL compilers bitch about using sequence<X>
> outside of a typedef. Annoying eh?

yes indeed... oh well.
Perhaps call them XSequence to be consistent?

Ian