[Bioperl-l] E-value of a combined alignment?

Wed Sep 3 10:02:18 EDT 2003

> 	I am aligning mRNAs against human genome using ungapped tblastx. I
> got a bunch of HSPs with different e-values. I can observed that some of
> them should be in the same group because they are exons of a gene. But
> then what is the e-value of all these HSPs combined? 
> 
> 	I know the formulas of e-value and bit score for BLOSUM62:
> 
> Let S' be bit score, S be score, e be e-value, m be the length of HSP,
> n be length of database.
> 
> S' = (0.318 * S - ln(0.135)) / ln(2)
> 
> e = m * n / (2^(S'))
> 
> 	I am guessing the formula for the e-value of
> non-overlapping combined e-value to be:
> 
> S'' = (0.318 * sum_of_S - ln(0.135)) / ln(2)
> 
> e' = sum_of_m * n / (2^(S''))
> 
> 	Is this correct? Or do you know the right way to calculate it?

If you assume that gaps are irrelevant to your similarity metric (say,
you want to totally discount introns AND assume that all gaps are
caused by non conservation in introns) this is the way to go.  This
formula pretends that the gaps don't exist and that you're dealing with
one long HSP.

I believe that this is actually the behavior of NCBI's BLASTP.  All of the
HSP's in a hit get the same evalue, which is about what you would get if
you summed the bit scores of the HSPs and then calculated a final evalue.

If you want to include some sort of penalty for the gaps, Ian Korf's
suggestion (log(KMN) for each gap) is a good starting point.

If "p" scores were really probabilities, we could combine them using the
formulas for either dependent or independent events.  Has anyone tried
this?

-Chris Dwan
 CCGB, University of Minnesota.