From tsa at bioinf.med.uni-goettingen.de Mon Jan 15 08:16:29 2007 From: tsa at bioinf.med.uni-goettingen.de (Tilman Sauer) Date: Mon, 15 Jan 2007 14:16:29 +0100 Subject: [Biojava-l] IntegerAlphabet, creating Symbols Message-ID: <6.1.2.0.0.20070115141038.01f79af8@mailin.mi.med.uni-goettingen.de> Hi there! I have some runtime problems creating a SymbolList using an IntegerAlphabet to create the Symbols. That's what my code looks like: public static SymbolList integerArray2IntegerSymbolList(int[] intAln) throws IllegalAlphabetException { int len = intAln.length; Symbol[] syms = new Symbol[len]; IntegerAlphabet ialph = IntegerAlphabet.getInstance(); for(int i=0; i Hi, I do have a question regarding HMMs. I created a custom HMM following the Dice example on the web site (http://www.biojava.org/wiki/BioJava:Tutorial:Dynamic_programming_examples). It works fine and I can ether generate sequences or the corresponding state path. However, I would like to train the model and to get the probabilities that a certain sequence was produced by this model. I tried the following: try { DP dp = DPFactory.DEFAULT.createDP(createMyModel()); StatePath obs_rolls = dp.generate(4); SymbolList roll_sequence = obs_rolls .symbolListForLabel(StatePath.SEQUENCE); SymbolList[] res_array = { roll_sequence }; StatePath v = dp.viterbi(res_array, ScoreType.PROBABILITY); BaumWelchTrainer bwt = new BaumWelchTrainer(dp); StoppingCriteria sc = new StoppingCriteria() { public boolean isTrainingComplete(TrainingAlgorithm arg0) { if (arg0.getCycle() > 100) //if (Math.abs(arg0.getLastScore() - arg0.getCurrentScore()) < 0.5) return true; return false; } }; try { BufferedReader br = new BufferedReader(new FileReader(args[0])); SequenceDB db = new HashSequenceDB(); myAlphabet.putTokenization("token", new NameTokenization(myAlphabet, true)); while (br.ready()) { String line = br.readLine(); SymbolList sym = new SimpleSymbolList(myAlphabet.getTokenization("token"), line); db.addSequence(new SimpleSequence(sym, "", line.replaceAll(" ", ""), Annotation.EMPTY_ANNOTATION)); } bwt.train(db, 0.1, sc); for (Iterator i=db.ids().iterator(); i.hasNext(); ) { Sequence seq = db.getSequence(i.next().toString()); System.out.println(seq.seqString()+"\tprobability\t"+ bwt.getDP().forward(new SymbolList[] {seq}, ScoreType.PROBABILITY)); } } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } catch (ChangeVetoException e) { e.printStackTrace(); } SymbolList realstates = obs_rolls.symbolListForLabel(StatePath.STATES); SymbolList realsymbols = obs_rolls.symbolListForLabel(StatePath.SEQUENCE); SymbolList states = v.symbolListForLabel(StatePath.STATES); SymbolList symbols = v.symbolListForLabel(StatePath.SEQUENCE);// */ System.out.println("Output:\t" + realsymbols.seqString()); System.out.println("Position:\t" + realstates.seqString()); System.out.println("Probability:\t" + dp.forward(new SymbolList[] {realsymbols}, ScoreType.PROBABILITY)); } catch (IllegalArgumentException e) { e.printStackTrace(); } catch (BioException e) { e.printStackTrace(); } In createMyModel() I create my costum model, which is a modified version of the aforementioned example. When I comment the line bwt.train(db, 0.1, sc); the output of the line System.out.println("Probability:\t" + dp.forward(new SymbolList[] {realsymbols}, ScoreType.PROBABILITY)); will give negative probabilies like Probability: -5.851716517873089 otherwise (when I use the BaumWelchTrainer) the probabilities will even be NaN. What is the meaning of this? Why are the probabilities not between 0 and 1 and why does the BaumWelchTrainer produce NaN values? So my question is: how can I get the probability that the HMM emitts a given sequence and how can I train the HMM properly? I appreciate every answer! Cheers Andreas --- Dipl.-Bioinform. Andreas Dr?ger Eberhard Karls University T?bingen Center for Bioinformatics (ZBIT) Sand 1 72076 T?bingen Germany Phone: +49-7071-29-70436 Fax: +49-7071-29-5091 From tsa at bioinf.med.uni-goettingen.de Mon Jan 15 13:16:29 2007 From: tsa at bioinf.med.uni-goettingen.de (Tilman Sauer) Date: Mon, 15 Jan 2007 14:16:29 +0100 Subject: [Biojava-l] IntegerAlphabet, creating Symbols Message-ID: <6.1.2.0.0.20070115141038.01f79af8@mailin.mi.med.uni-goettingen.de> Hi there! I have some runtime problems creating a SymbolList using an IntegerAlphabet to create the Symbols. That's what my code looks like: public static SymbolList integerArray2IntegerSymbolList(int[] intAln) throws IllegalAlphabetException { int len = intAln.length; Symbol[] syms = new Symbol[len]; IntegerAlphabet ialph = IntegerAlphabet.getInstance(); for(int i=0; i Hi, I do have a question regarding HMMs. I created a custom HMM following the Dice example on the web site (http://www.biojava.org/wiki/BioJava:Tutorial:Dynamic_programming_examples). It works fine and I can ether generate sequences or the corresponding state path. However, I would like to train the model and to get the probabilities that a certain sequence was produced by this model. I tried the following: try { DP dp = DPFactory.DEFAULT.createDP(createMyModel()); StatePath obs_rolls = dp.generate(4); SymbolList roll_sequence = obs_rolls .symbolListForLabel(StatePath.SEQUENCE); SymbolList[] res_array = { roll_sequence }; StatePath v = dp.viterbi(res_array, ScoreType.PROBABILITY); BaumWelchTrainer bwt = new BaumWelchTrainer(dp); StoppingCriteria sc = new StoppingCriteria() { public boolean isTrainingComplete(TrainingAlgorithm arg0) { if (arg0.getCycle() > 100) //if (Math.abs(arg0.getLastScore() - arg0.getCurrentScore()) < 0.5) return true; return false; } }; try { BufferedReader br = new BufferedReader(new FileReader(args[0])); SequenceDB db = new HashSequenceDB(); myAlphabet.putTokenization("token", new NameTokenization(myAlphabet, true)); while (br.ready()) { String line = br.readLine(); SymbolList sym = new SimpleSymbolList(myAlphabet.getTokenization("token"), line); db.addSequence(new SimpleSequence(sym, "", line.replaceAll(" ", ""), Annotation.EMPTY_ANNOTATION)); } bwt.train(db, 0.1, sc); for (Iterator i=db.ids().iterator(); i.hasNext(); ) { Sequence seq = db.getSequence(i.next().toString()); System.out.println(seq.seqString()+"\tprobability\t"+ bwt.getDP().forward(new SymbolList[] {seq}, ScoreType.PROBABILITY)); } } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } catch (ChangeVetoException e) { e.printStackTrace(); } SymbolList realstates = obs_rolls.symbolListForLabel(StatePath.STATES); SymbolList realsymbols = obs_rolls.symbolListForLabel(StatePath.SEQUENCE); SymbolList states = v.symbolListForLabel(StatePath.STATES); SymbolList symbols = v.symbolListForLabel(StatePath.SEQUENCE);// */ System.out.println("Output:\t" + realsymbols.seqString()); System.out.println("Position:\t" + realstates.seqString()); System.out.println("Probability:\t" + dp.forward(new SymbolList[] {realsymbols}, ScoreType.PROBABILITY)); } catch (IllegalArgumentException e) { e.printStackTrace(); } catch (BioException e) { e.printStackTrace(); } In createMyModel() I create my costum model, which is a modified version of the aforementioned example. When I comment the line bwt.train(db, 0.1, sc); the output of the line System.out.println("Probability:\t" + dp.forward(new SymbolList[] {realsymbols}, ScoreType.PROBABILITY)); will give negative probabilies like Probability: -5.851716517873089 otherwise (when I use the BaumWelchTrainer) the probabilities will even be NaN. What is the meaning of this? Why are the probabilities not between 0 and 1 and why does the BaumWelchTrainer produce NaN values? So my question is: how can I get the probability that the HMM emitts a given sequence and how can I train the HMM properly? I appreciate every answer! Cheers Andreas --- Dipl.-Bioinform. Andreas Dr?ger Eberhard Karls University T?bingen Center for Bioinformatics (ZBIT) Sand 1 72076 T?bingen Germany Phone: +49-7071-29-70436 Fax: +49-7071-29-5091