[Biojava-l] HMM

Sun Jan 28 22:01:10 UTC 2007

Hi,

I do have a question regarding HMMs. I created a custom HMM following  
the Dice example on the web site  
(http://www.biojava.org/wiki/BioJava:Tutorial:Dynamic_programming_examples).  
It works fine and I can ether generate sequences or the corresponding  
state path. However, I would like to train the model and to get the  
probabilities that a certain sequence was produced by this model. I  
tried the following:

try {
       DP dp = DPFactory.DEFAULT.createDP(createMyModel());
       StatePath obs_rolls = dp.generate(4);
       SymbolList roll_sequence = obs_rolls
           .symbolListForLabel(StatePath.SEQUENCE);
       SymbolList[] res_array = { roll_sequence };
       StatePath v = dp.viterbi(res_array, ScoreType.PROBABILITY);

       BaumWelchTrainer bwt = new BaumWelchTrainer(dp);
       StoppingCriteria sc = new StoppingCriteria() {
         public boolean isTrainingComplete(TrainingAlgorithm arg0) {
           if (arg0.getCycle() > 100)
           //if (Math.abs(arg0.getLastScore() - arg0.getCurrentScore()) < 0.5)
             return true;
           return false;
         }
       };

       try {
         BufferedReader br = new BufferedReader(new FileReader(args[0]));
         SequenceDB db = new HashSequenceDB();
         myAlphabet.putTokenization("token", new  
NameTokenization(myAlphabet, true));
         while (br.ready()) {
           String line = br.readLine();
           SymbolList sym = new  
SimpleSymbolList(myAlphabet.getTokenization("token"), line);
           db.addSequence(new SimpleSequence(sym, "",  
line.replaceAll(" ", ""), Annotation.EMPTY_ANNOTATION));
         }
         bwt.train(db, 0.1, sc);
         for (Iterator i=db.ids().iterator(); i.hasNext(); ) {
           Sequence seq = db.getSequence(i.next().toString());
           System.out.println(seq.seqString()+"\tprobability\t"+
               bwt.getDP().forward(new SymbolList[] {seq},  
ScoreType.PROBABILITY));
         }
       } catch (FileNotFoundException e) {
         e.printStackTrace();
       } catch (IOException e) {
         e.printStackTrace();
       } catch (ChangeVetoException e) {
         e.printStackTrace();
       }

       SymbolList realstates = obs_rolls.symbolListForLabel(StatePath.STATES);
       SymbolList realsymbols =  
obs_rolls.symbolListForLabel(StatePath.SEQUENCE);
       SymbolList states = v.symbolListForLabel(StatePath.STATES);
       SymbolList symbols = v.symbolListForLabel(StatePath.SEQUENCE);// */

       System.out.println("Output:\t" + realsymbols.seqString());
       System.out.println("Position:\t" + realstates.seqString());
       System.out.println("Probability:\t" + dp.forward(new  
SymbolList[] {realsymbols}, ScoreType.PROBABILITY));

     } catch (IllegalArgumentException e) {
       e.printStackTrace();
     } catch (BioException e) {
       e.printStackTrace();
     }

In createMyModel() I create my costum model, which is a modified  
version of the aforementioned example.
When I comment the line bwt.train(db, 0.1, sc); the output of the line

System.out.println("Probability:\t" + dp.forward(new SymbolList[]  
{realsymbols}, ScoreType.PROBABILITY));

will give negative probabilies like

Probability:	-5.851716517873089

otherwise (when I use the BaumWelchTrainer) the probabilities will  
even be NaN.

What is the meaning of this? Why are the probabilities not between 0  
and 1 and why does the BaumWelchTrainer produce NaN values?
So my question is: how can I get the probability that the HMM emitts a  
given sequence and how can I train the HMM properly?

I appreciate every answer!

Cheers
Andreas

---
Dipl.-Bioinform. Andreas Dräger
Eberhard Karls University Tübingen
Center for Bioinformatics (ZBIT)
Sand 1
72076 Tübingen
Germany

Phone: +49-7071-29-70436
Fax:   +49-7071-29-5091