[Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path

Thu Feb 3 22:47:04 UTC 2011

http://bugzilla.open-bio.org/show_bug.cgi?id=2947

walter_gillett at hotmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |walter_gillett at hotmail.com

------- Comment #4 from walter_gillett at hotmail.com  2011-02-03 17:47 EST -------
Short answer:

The fix looks good - I have dug into the logic in detail and stepped through
the example. However, appears to me that there is still a bug in this line of
code in the viterbi method:

    for cur_state in self.transitions_from(main_state):

In this context, "cur_state" is a state prior to "main_state", so what we
really need here is the set of states that lead to main_state, not the set of
states that can be reached from main_state. This bug won't cause trouble in
practice unless you use a non-ergodic HMM, that is, a model in which some state
transitions are disallowed. (The variable names here are confusing, would be
better to rename main_state to cur_state and cur_state to previous_state, or
something like that.) This bug is unrelated to the problem originally reported,
other than appearing in the same part of the code, so perhaps it should be
handled in a separate ticket.

I would be happy to code up a fix if that makes sense.

Longer answer:

I had spent a bunch of time recently investigating this - should have noted
that in bugzilla to avoid duplication of effort. But still seems worthwhile
writing down my notes to document this better, so I'll do that here. 

There was a error in the Viterbi algorithm termination logic, as implemented in
the method MarkovModel#viterbi. The Viterbi probabilities were being multiplied
by the log-probability of a transition back to an end state (state 0). This was
incorrect because in log space the log-probabilities should be added, not
multiplied. Peter's fix removes that multiplication, thus dropping the end
state transition entirely (which Durbin considers optional, so that's fine; and
it was causing trouble). With the bug fixed, the most probable state path to
generate 6 tails (in the example model described by the bug reporter) becomes
"uuuuuu" as expected - no final "f".

At a higher level, there was (in versions 1.56 and prior, but no longer in
trunk) an important undocumented (as far as I can see) requirement that the
model always starts in state 0. The bug reporter complained that the results of
the Viterbi path calculation are wrong because "apparently they depend upon the
order of the state alphabet," which was true. In the example model, providing
the state alphabet ["f", "u"] causes the system to start in state f. Since
there is a big penalty in his example for switching states, you get "ff" as the
most likely state path for the output sequence [tails, tails], even though the
unfair coin is much more likely than the fair coin to yield tails.

Looks like Peter's fix treats all starting states as equally probable, there is
no longer a special start state. That's reasonable, although the coding is a
little confusing:

    # v_{0}(0) = 0
    viterbi_probs[(state_letters[0], -1)] = 0
    # v_{k}(0) = 0 for k > 0
    for state_letter in state_letters[1:]:
    viterbi_probs[(state_letter, -1)] = 0

because it could now more naturally be done in two lines of code rather than
three. Possibly it's useful to keep the assignment for state 0 separate in case
we want to change it. 

A good long-term improvement would be to have a special hidden start state like
the "MagicalState" used by BioJava (see
http://www.biojava.org/wiki/BioJava:Tutorial:Simple_HMMs_with_BioJava). That
would make it possible to specify a probability distribution for what the
initial state should be, a typical HMM feature (see Durbin's book, for
example).

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.