[Dynamite] SingleModel with metasequences

Ewan Birney birney@ebi.ac.uk
Wed, 8 Mar 2000 16:55:45 +0000 (GMT)


On Tue, 7 Mar 2000, Ian Holmes wrote:

> Here's my latest SingleModel. I'm quite pleased with it. It replaces
> Ewan's ComplexSequence with the equally badly-named Metasequence (better
> name would be welcomed!).

<grin>. Metasequence isn't so bad. But neither is complexsequence.

> 
> A metasequence is an array of scores (e.g. splice site predictions)
> associated with each residue of a sequence.. assumed to be independent of
> the sequence itself (altho it's usually not).
> 
> You can have more than one metasequence for a model (e.g. splice site
> predictions + hexamer coding potentials). Any state can use scores from
> any number of metasequences.
> 
> Also, I moved the emission scores so they're now associated with states,
> not transitions. I think this is what we're all used to...

Not me! I *always* think of emissions on the transitions. Please - emit
on transitions. Otherwise we get burnt in
	- genewise (transitions into MATCH have different emissions)
	- HMMER (J state)
	- everything else I do ;)

> 
> Local/semilocal/global alignment is specified by means of
> AlignmentTether structures.
> 
> I have labelled each interface "abstract" (meaning it doesn't need a
> factory method) or "concrete" (meaning it does).
> 
> here goes (it's also on the wiki) -- please rip to shreds & tell me what
> you think:
> 
>  module SingleModel {
> 
>   // metasequences
> 
>   typedef Score::NatScoreVector  Metasequence;   // a log odds-ratio for each residue of a sequence (e.g. splice site prediction scores)
>   typedef sequence<Metasequence> MetasequenceVector;   // multiple Metasequences for multiple prediction categories

One of the issues here is that it is very common to have a metasequence
built which the actual state machine only uses a few metasequences (for
example, only uses the 5' splice sites). I felt a drawback to my
complexsequence is that, like above, the Metasequences were just an array
and you had to know that you were looking at "2" in the Metasequence
array.

I wonder is somehow associating a name with the array is not a bad thing.
Hmmmmmmmmmmm.

> 
>   // state & transition interfaces
> 
>   typedef sequence<int> IntVector;
> 
>   interface State {    // concrete
> 
>     readonly attribute string    name;                  // non-unique state ID
>     readonly attribute boolean   emitting;              // false if this is a null state
>     readonly attribute IntVector metasequence_indices;  // list of the Metasequence scores incurred on entering this state
> 
>     boolean equals (State s);  // test for equivalence (this should NOT be based on the name, but probably some internal unique tag)
>     State   clone();           // copy constructor

Why do we need clone()?

>   };
>   
>   struct Transition {
>     State from;
>     State to;
>   };

I want emissions in here.

Also don't we need emission length or lookback length, or something?

What about emitting triplets?

> 
>   typedef sequence<State>      StateVector;
>   typedef sequence<Transition> TransitionVector;
> 
>   // now the model interfaces
> 
>   interface Model {     // abstract
> 
>     string           name();
>     StateVector      states();
>     TransitionVector transitions();
>     State            begin();           // every model is born with a begin & an end state
>     State            end();
>     int              metasequences();   // number of Metasequences that this model requires
> 
>     // perhaps we should add a non-virtual factory method for a vanilla WriteableModelParameters here
>     // (although I hate that word "vanilla"... vanilla is my favourite flavour and now it's a synonym for "default")
>   };
> 
>   interface WriteableModel : Model {    // concrete
> 
>     State new_state (in string name,
> 		     in boolean emitting,
> 		     in IntVector metasequence_indices);    // allocates a new State but doesn't add it to the model
> 
>     void  add_state (in State s);        // adds a State to the model
>     void  remove_state (in State s);     // guess what this does
> 
>     // remove_state should raise an exception if user tries to remove the begin or end states
>     
>     // don't need a factory method for Transitions since they're just struct's. but we do need add & remove methods:
> 
>     void add_transition (in Transition t);
>     void remove_transition (in Transition t);
> 
>     void clear();               // clears all states & transitions
>     void clear_transitions();   // clears transitions but leaves states
> 
>     void set_metasequences (int n);   // set the number of metasequences
> 
>     // The set_metasequences method should raise an exception unless the model is empty (i.e. no states)
>   };
> 
>   // parameter stuff
> 
>   interface ModelParameters {      // abstract
> 
>     readonly attribute Score::Scheme scoring_scheme;
> 
>     Score::NatScore       get_log_transition_probability (in Transition t);
>     Score::NatScoreVector get_log_emission_probabilities (in State s);
> 
>     // ihh: I no longer think the parallel array stuff is a good idea for the top-level interface.
>     // These get_log_.*_probability() methods are not something we want to be calling inside a DP routine,
>     // but they're fine for the top level. We can use parallel arrays (or whatever) lower down, if needs be.
>     //
>     // Actually -- let's stick our necks out and boast that any call to get_log_.*_probability()
>     // will only take O(log M) time, where M is the number of states in the model.
>   };
> 
>   interface WriteableModelParameters : ModelParameters {    // concrete
> 
>     void set_log_transition_probability (in Transition t, in State::NatScore p);
>     void set_log_emission_probabilities (in State s, in State::NatScoreVector p_vec);
> 
>     // I think it is useful to separate out the WriteableModelParameters interface
>     // from the ModelParameters interface, because we might want to have a subclass of
>     // ModelParameters that was a wrapper to a smaller parameter space (e.g. SmithWaterman) -ihh
>   };
> 
>   // counts stuff (used for training)
>   
>   interface ModelCounts {   // concrete
> 
>     readonly attribute Score::Scheme scoring_scheme;
> 
>     Score::NatScore       get_log_transition_count (in Transition t);
>     Score::NatScoreVector get_log_emission_counts (in State s);
> 
>     void zero_counts (in Model m);     // sets all log-counts to -infinity
> 
>     void increase_log_transition_count (in Transition t, in Score::NatScore log_increment);
>     void increase_log_emission_counts (in State s, in Score::NatScoreVector log_increment_vec);
> 
>     // I don't think there's any advantage in having a separate WriteableModelCounts interface -ihh
>   };
> 
>   // null model
>   
>   struct NullParameters {
>     Score::NatScoreVector emit;     // vector of log-likelihoods of each residue according to null model
>     Score::NatScore       extend;   // log-likelihood of single-residue sequence extension according to null model
>     Score::NatScore       end;      // log-likelihood of sequence termination according to null model
>   };
> 
>   // alignments
>   
>   struct Traceback {
>     int             query_start;     // sequence start index for alignment
>     StateVector     path;
>     Score::NatScore log_likelihood_ratio;
>   };
> 

Do we want the traceback to have scores per state jump?

We are implicitly banning two different transitions of the same offset
between the same states. This is what I do anyway, I just want to point it
out ;)


>   // alignment tether (specifies global, local, semi-local etc)
> 
>   struct AlignmentTether {
> 
>     boolean left_tethered;      // if these are both TRUE then it's a global alignment
>     boolean right_tethered;     // if they're both FALSE then it's local
> 
>     // Local alignment implies that the log-likelihood will have an infinite component, because if the cost
>     // of extending the "flanking" region (i.e. the unaligned part) is zero then the penalty for leaving
>     // it has to be infinite for the corresponding global model to stay normalised.
>     // Since computer languages with infinite quantities as basic types are lamentably rare,
>     // we cancel this by ensuring that the null model has exactly as many infinitely-penalised
>     // transitions as the test model.
>     // See http://www.sanger.ac.uk/Users/ihh/thesis.ps.gz pp42-44 for a worked example.
>   };
> 
>   // now the algorithm interfaces
>   
>   interface ViterbiAlgorithm {        // abstract
> 
>     // the idea is that concrete implementations (vanilla, linear-space etc) will inherit from this class
> 
>     Traceback do_Viterbi (in Model model,
> 			  in ModelParameters parameters,
> 			  in NullParameters null,
> 			  in Sequence::LightSeq query,
> 			  in MetasequenceVector metaquery,
> 			  in AlignmentTether tether);
>     
>     // ihh: I'm starting to understand more & more of the rationale behind IDL & CORBA --
>     // it makes sense to use a LightSeq interface rather than a LightSeqMomento struct here,
>     // because interfaces are passed by reference and so are "lighter" (i.e. lower-bandwidth)
>     // in a network context. c00l!

Less of the phREAker000 stuff...

>   };
> 
>   interface ForwardBackwardAlgorithm {      // abstract
>     
>     ModelCounts do_ForwardBackward (in Model model,
> 				    in ModelParameters parameters,
> 				    in NullParameters null,
> 				    in Sequence::LightSeq query,
> 				    in MetasequenceVector metaquery,
> 				    in AlignmentTether tether);
>   };
> 
>   // maybe we want a ForwardAlgorithm interface too -- this should just return a forward score (no traceback)
>   // There should also be a variant of ForwardAlgorithm that returns a ForwardMatrix object
>   // -- which the user can then obtain sample tracebacks from. Perhaps Viterbi should do this too.
> 
>   // basic training interface
>   
>   interface ParameterUpdateAlgorithm {    // abstract
>     void update_parameters (in ModelCounts counts, out WriteableModelParameters parameters);
>   };
> 
>  }
> 
> 
> _______________________________________________
> Dynamite mailing list  -  Dynamite@bioperl.org
> http://www.bioperl.org/mailman/listinfo/dynamite
>