[Dynamite] SingleModel with metasequences

Ian Holmes ihh@fruitfly.org
Tue, 7 Mar 2000 21:49:19 -0800 (PST)


Here's my latest SingleModel. I'm quite pleased with it. It replaces
Ewan's ComplexSequence with the equally badly-named Metasequence (better
name would be welcomed!).

A metasequence is an array of scores (e.g. splice site predictions)
associated with each residue of a sequence.. assumed to be independent of
the sequence itself (altho it's usually not).

You can have more than one metasequence for a model (e.g. splice site
predictions + hexamer coding potentials). Any state can use scores from
any number of metasequences.

Also, I moved the emission scores so they're now associated with states,
not transitions. I think this is what we're all used to...

Local/semilocal/global alignment is specified by means of
AlignmentTether structures.

I have labelled each interface "abstract" (meaning it doesn't need a
factory method) or "concrete" (meaning it does).

here goes (it's also on the wiki) -- please rip to shreds & tell me what
you think:

 module SingleModel {

  // metasequences

  typedef Score::NatScoreVector  Metasequence;   // a log odds-ratio for each residue of a sequence (e.g. splice site prediction scores)
  typedef sequence<Metasequence> MetasequenceVector;   // multiple Metasequences for multiple prediction categories

  // state & transition interfaces

  typedef sequence<int> IntVector;

  interface State {    // concrete

    readonly attribute string    name;                  // non-unique state ID
    readonly attribute boolean   emitting;              // false if this is a null state
    readonly attribute IntVector metasequence_indices;  // list of the Metasequence scores incurred on entering this state

    boolean equals (State s);  // test for equivalence (this should NOT be based on the name, but probably some internal unique tag)
    State   clone();           // copy constructor
  };
  
  struct Transition {
    State from;
    State to;
  };

  typedef sequence<State>      StateVector;
  typedef sequence<Transition> TransitionVector;

  // now the model interfaces

  interface Model {     // abstract

    string           name();
    StateVector      states();
    TransitionVector transitions();
    State            begin();           // every model is born with a begin & an end state
    State            end();
    int              metasequences();   // number of Metasequences that this model requires

    // perhaps we should add a non-virtual factory method for a vanilla WriteableModelParameters here
    // (although I hate that word "vanilla"... vanilla is my favourite flavour and now it's a synonym for "default")
  };

  interface WriteableModel : Model {    // concrete

    State new_state (in string name,
		     in boolean emitting,
		     in IntVector metasequence_indices);    // allocates a new State but doesn't add it to the model

    void  add_state (in State s);        // adds a State to the model
    void  remove_state (in State s);     // guess what this does

    // remove_state should raise an exception if user tries to remove the begin or end states
    
    // don't need a factory method for Transitions since they're just struct's. but we do need add & remove methods:

    void add_transition (in Transition t);
    void remove_transition (in Transition t);

    void clear();               // clears all states & transitions
    void clear_transitions();   // clears transitions but leaves states

    void set_metasequences (int n);   // set the number of metasequences

    // The set_metasequences method should raise an exception unless the model is empty (i.e. no states)
  };

  // parameter stuff

  interface ModelParameters {      // abstract

    readonly attribute Score::Scheme scoring_scheme;

    Score::NatScore       get_log_transition_probability (in Transition t);
    Score::NatScoreVector get_log_emission_probabilities (in State s);

    // ihh: I no longer think the parallel array stuff is a good idea for the top-level interface.
    // These get_log_.*_probability() methods are not something we want to be calling inside a DP routine,
    // but they're fine for the top level. We can use parallel arrays (or whatever) lower down, if needs be.
    //
    // Actually -- let's stick our necks out and boast that any call to get_log_.*_probability()
    // will only take O(log M) time, where M is the number of states in the model.
  };

  interface WriteableModelParameters : ModelParameters {    // concrete

    void set_log_transition_probability (in Transition t, in State::NatScore p);
    void set_log_emission_probabilities (in State s, in State::NatScoreVector p_vec);

    // I think it is useful to separate out the WriteableModelParameters interface
    // from the ModelParameters interface, because we might want to have a subclass of
    // ModelParameters that was a wrapper to a smaller parameter space (e.g. SmithWaterman) -ihh
  };

  // counts stuff (used for training)
  
  interface ModelCounts {   // concrete

    readonly attribute Score::Scheme scoring_scheme;

    Score::NatScore       get_log_transition_count (in Transition t);
    Score::NatScoreVector get_log_emission_counts (in State s);

    void zero_counts (in Model m);     // sets all log-counts to -infinity

    void increase_log_transition_count (in Transition t, in Score::NatScore log_increment);
    void increase_log_emission_counts (in State s, in Score::NatScoreVector log_increment_vec);

    // I don't think there's any advantage in having a separate WriteableModelCounts interface -ihh
  };

  // null model
  
  struct NullParameters {
    Score::NatScoreVector emit;     // vector of log-likelihoods of each residue according to null model
    Score::NatScore       extend;   // log-likelihood of single-residue sequence extension according to null model
    Score::NatScore       end;      // log-likelihood of sequence termination according to null model
  };

  // alignments
  
  struct Traceback {
    int             query_start;     // sequence start index for alignment
    StateVector     path;
    Score::NatScore log_likelihood_ratio;
  };

  // alignment tether (specifies global, local, semi-local etc)

  struct AlignmentTether {

    boolean left_tethered;      // if these are both TRUE then it's a global alignment
    boolean right_tethered;     // if they're both FALSE then it's local

    // Local alignment implies that the log-likelihood will have an infinite component, because if the cost
    // of extending the "flanking" region (i.e. the unaligned part) is zero then the penalty for leaving
    // it has to be infinite for the corresponding global model to stay normalised.
    // Since computer languages with infinite quantities as basic types are lamentably rare,
    // we cancel this by ensuring that the null model has exactly as many infinitely-penalised
    // transitions as the test model.
    // See http://www.sanger.ac.uk/Users/ihh/thesis.ps.gz pp42-44 for a worked example.
  };

  // now the algorithm interfaces
  
  interface ViterbiAlgorithm {        // abstract

    // the idea is that concrete implementations (vanilla, linear-space etc) will inherit from this class

    Traceback do_Viterbi (in Model model,
			  in ModelParameters parameters,
			  in NullParameters null,
			  in Sequence::LightSeq query,
			  in MetasequenceVector metaquery,
			  in AlignmentTether tether);
    
    // ihh: I'm starting to understand more & more of the rationale behind IDL & CORBA --
    // it makes sense to use a LightSeq interface rather than a LightSeqMomento struct here,
    // because interfaces are passed by reference and so are "lighter" (i.e. lower-bandwidth)
    // in a network context. c00l!
  };

  interface ForwardBackwardAlgorithm {      // abstract
    
    ModelCounts do_ForwardBackward (in Model model,
				    in ModelParameters parameters,
				    in NullParameters null,
				    in Sequence::LightSeq query,
				    in MetasequenceVector metaquery,
				    in AlignmentTether tether);
  };

  // maybe we want a ForwardAlgorithm interface too -- this should just return a forward score (no traceback)
  // There should also be a variant of ForwardAlgorithm that returns a ForwardMatrix object
  // -- which the user can then obtain sample tracebacks from. Perhaps Viterbi should do this too.

  // basic training interface
  
  interface ParameterUpdateAlgorithm {    // abstract
    void update_parameters (in ModelCounts counts, out WriteableModelParameters parameters);
  };

 }