[Biojava-l] Null Model
Ren, Zhen
zren at amylin.com
Tue May 6 00:52:16 EDT 2003
Thank you so much for the reply. BioJava is such a valuable resource. Since I am learning it, I have to ask more questions in order to understand them:
1) Notice in the protein version code in my previous email, I used ProteinTools.getAlphabet() method instead of ProteinTools.getTAlphabet(). Actually when you try both, you will get the same output. I expected 1/21 for the former alphabet since it doesn't really include the TERM magic symbol.
2) The magic symbol TERM is really helpful when making things like codon -> amino acid translation tables. However, this is not needed when making a null model. How can I make 1/20 for the most common 20 amino acids and 0 for both SEC and TERM symbols as the null weight? Is there a way to delete those two symbols or do I have to create my own alphabet?
3) In Thomas's email, he mentioned "...For proteins, that's never true, so if you are using the nullModel stuff (for instance, if you're using the DP toolkit in log-odds scoring mode), then you really ought to set it to something more sensible. Just parsing through swissprot and counting the overall amino acid usage in that would at least be vaguely sane. It really only matters if you want odds rather than probabilities..." This is absolutely true and I totally agree. However, I am not really concerned about what kind of null model I should use. I'd like to know how I set up my null model which is different from the default one using 1/22 as the null weight. The simplest example is to change from 1/22 to 1/20. This is the first part of the question.
I would still use the WeightMatrixDemo as my example, listed below are my understanding now. The sequences used to do the alignment (numbers at top indicate the column number in the alignment) are:
12345
aggag
aggaa
aggag
aagag
If without Pseudocounts:
Column# T C G A Total
1 0 0 0 4 4
2 0 0 3 1 4
3 0 0 4 0 4
4 0 0 0 4 4
5 0 0 3 1 4
If with Pseudocounts (The example uses 0.01 as null weight):
Column# T C G A Total
1 0.0025 0.0025 0.0025 4.0025 4.01
2 0.0025 0.0025 3.0025 1.0025 4.01
3 0.0025 0.0025 4.0025 0.0025 4.01
4 0.0025 0.0025 0.0025 4.0025 4.01
5 0.0025 0.0025 3.0025 1.0025 4.01
So, the probabilities of the weight matrix are in Table 1:
Column# T C G A
1 6.234413965087281E-4 6.234413965087281E-4 6.234413965087281E-4 0.9981296758104737
2 6.234413965087281E-4 6.234413965087281E-4 0.7487531172069826 0.25
3 6.234413965087281E-4 6.234413965087281E-4 0.9981296758104737 6.234413965087281E-4
4 6.234413965087281E-4 6.234413965087281E-4 6.234413965087281E-4 0.9981296758104737
5 6.234413965087281E-4 6.234413965087281E-4 0.7487531172069826 0.25
If you take natural log for each number in Table 1 and you will generate Table 2 here:
Column# T C G A
1 -7.380255788426459 -7.380255788426459 -7.380255788426459 -0.001872075429745
2 -7.380255788426459 -7.380255788426459 -0.289345966346476 -1.386294361119890
3 -7.380255788426459 -7.380255788426459 -0.001872075429745 -7.380255788426459
4 -7.380255788426459 -7.380255788426459 -7.380255788426459 -0.001872075429745
5 -7.380255788426459 -7.380255788426459 -0.289345966346476 -1.386294361119890
If you take natural log for (each number in Table 1 / 0.25) and you will generate Table 3 which is the log-odd ratio considering a psudocount with 0.01 as the null weight:
Column# T C G A
1 -5.993961427306569 -5.993961427306569 -5.993961427306569 1.384422285690145
2 -5.993961427306569 -5.993961427306569 1.096948394773414 0
3 -5.993961427306569 -5.993961427306569 1.384422285690145 -5.993961427306569
4 -5.993961427306569 -5.993961427306569 -5.993961427306569 1.384422285690145
5 -5.993961427306569 -5.993961427306569 1.096948394773414 0
00000000011111111112222
12345678901234567890123
aaagcctaggaagaggagctgat is the sequence to score against using the above weight matrix. Here the number indicates the position and please read vertically. The first match in the sequence is located at 8-12, which is aggaa. The demo program calculates the probability like this (using Table 1):
aggaa = Prob(a at pos 1) * Prob(g at pos 2) * Prob(g at pos 3) * Prob(a at pos 4) * Prob(a at pos 5)
= 0.9981296758104737 * 0.7487531172069826 * 0.9981296758104735 * 0.9981296758104737 * 0.25
= 0.186139934193745
And actually BioJava internally does this (using Table 2):
aggaa = 0.9981296758104737 * 0.7487531172069826 * 0.9981296758104735 * 0.9981296758104737 * 0.25
ln(aggaa) = ln(0.9981296758104737) + ln(0.7487531172069826) + ln(0.9981296758104735) + ln(0.9981296758104737) + ln(0.25)
= -0.001872075429745 -0.289345966346476 -0.001872075429745 -0.001872075429745 -1.386294361119890
= -1.681256553755602
So,
aggaa = exp(-1.681256553755602)
= 0.186139934193745
What I am really looking for is log-odd ratio instead of probability, as shown below:
aggaa = (0.9981296758104737 * 0.7487531172069826 * 0.9981296758104735 * 0.9981296758104737 * 0.25) / (0.25 * 0.25 * 0.25 * 0.25 * 0.25)
= 0.186139934193745 / 0.0009765625
= 190.607292614395628
ln(aggaa) = ln(190.607292614395628)
= 5.250215251843850
OR
aggaa = (0.9981296758104737 * 0.7487531172069826 * 0.9981296758104735 * 0.9981296758104737 * 0.25) / (0.25 * 0.25 * 0.25 * 0.25 * 0.25)
ln(aggaa) = ln(0.9981296758104737 / 0.25) + ln(0.7487531172069826 / 0.25) + ln(0.9981296758104735 / 0.25) + ln(0.9981296758104737 / 0.25) + ln(0.25 / 0.25)
= 1.384422285690145 + 1.096948394773414 + 1.384422285690145 + 1.384422285690145 + 0
= 5.250215251843850
The second part of the question is how to do this shown above using BioJava API. Notice although there is a null model behind this, it is actually never used. How would I use the null model here?
Thanks again in advance for any input you have.
Zhen
More information about the Biojava-l
mailing list