[Biopython-dev] Chi2 test in Bio.Phylo.PAML.chi2

Eric Talevich eric.talevich at gmail.com
Sat Sep 14 13:24:17 EDT 2013


On Sat, Sep 14, 2013 at 9:49 AM, Zheng Ruan <zruan1991 at gmail.com> wrote:

> Hi all,
>
> I am trying to use chi2 test within Biopython to reduce my dependency of
> scipy. However, the chi2 test is very slow in some case of stat value when
> degree of freedom is 1 (MK test has a df of 1). Here is a small example:
>
> >>> from Bio.Phylo.PAML import chi2
> >>> chi2.cdf_chi2(1, 1)
> 0.3173105078923443
> >>> chi2.cdf_chi2(1, 2)
> 0.1572992072733692
> >>> chi2.cdf_chi2(1, 3)
> 0.08326451704454607
> >>> chi2.cdf_chi2(1, 4)
> 0.04550026405390195
> >>> chi2.cdf_chi2(1, 5)
> ^CTraceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/home/rz/Working.Dir/GSoC/biopython/Bio/Phylo/PAML/chi2.py", line
> 20, in cdf_chi2
>     prob = 1 - _incomplete_gamma(x, alpha)
>   File "/home/rz/Working.Dir/GSoC/biopython/Bio/Phylo/PAML/chi2.py", line
> 116, in _incomplete_gamma
>     pn[i] /= overflow
> KeyboardInterrupt
> >>> chi2.cdf_chi2(1, 6)
> 0.014305878510978087
> >>> chi2.cdf_chi2(1, 7)
> 0.00815097160412992
> >>> chi2.cdf_chi2(1, 8)
> 0.004677734999637195
> >>> chi2.cdf_chi2(1, 9)
> ^CTraceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/home/rz/Working.Dir/GSoC/biopython/Bio/Phylo/PAML/chi2.py", line
> 20, in cdf_chi2
>     prob = 1 - _incomplete_gamma(x, alpha)
>   File "/home/rz/Working.Dir/GSoC/biopython/Bio/Phylo/PAML/chi2.py", line
> 112, in _incomplete_gamma
>     pn[i] = pn[i+2]
> KeyboardInterrupt
>
>
> The behavior of chi2.cdf_chi2 is quite wiered. Could someone clarify this?
> Thanks!
>
> Best,
> Zheng Ruan
>

It looks like that implementation of chi2 (based on PAML's C
implementation) has trouble with convergence at df=1. I wrote another
Python implementation of chi2 based on the SciPy source code (to avoid a
hard SciPy dependency in CladeCompare, which also uses a G-test), which you
can use if you find it works better:
https://github.com/etal/biofrills/blob/master/biofrills/stats/chisq.py

It imports the original scipy version at the end in case the user does have
scipy installed, since that compiled version will be much faster. This
hasn't been tested as much as Bio.Phylo.PAML.chi2, though, and I haven't
benchmarked the two Python implementations against each other. Also note
that it uses math.lgamma, which was only added in Python 2.7, so for 2.6
compatibility you'll need to copy in the pure-Python log-gamma
implementation from Bio.Phylo.PAML.chi2. (We could add this conditional
import of math.lgamma to Bio.Phylo.PAML.chi2, too.)

Or, you could try increasing the tolerance used for testing convergence in
Bio.Phylo.PAML.chi2.

Best,
Eric


More information about the Biopython-dev mailing list