Chi-square distribution and acceptable range.
When to use χ2 test?
χ2 test is used to check the goodness of fit of data points calculated by a function against the observed data points. One of the applications I know of it is when unknown parameters when passed to a function give an observed data.
function(parameters) --> data points
In such cases we can back calculate the parameters from observed data points. The way to solve these problems is to optimize the parameters by passing them to the functions and trying to minimize the sum of squared differences (SSR) of the calculated and observed values. Following is a rough pseudo-code for this.
function(parameters):
# program statements
return calculated_data
function_ssr(calculated_data, observed_data):
return SSR
minimize SSR:
w.r.t. parameters
In most cases, we will get some minimum value of SSR and we would accept the parameter values that produced this minimum value.
However, to see if the results we get are statistically valid, we have to see if they pass the χ2 test.
We will take an example where we got a solution of parameters and an SSR value. We will not go into any exact optimization problem as it might vary depending upon our project. Here, we will only see how to see if your results are statistically acceptable.
χ2 test example
Let’s take an example where we have: - 76 observed data points - 28 parameters we are trying to find
After, running the optimization function, we get a solution of parameters and a SSR value. Now, we want to check if this solution is acceptable.
The χ2 distribution for above data points and parameters would have a degree of freedom of 76 − 28 = 48.
Following is the code to visualize this distribution
Plotting Chi-square (χ2) distribution
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import chi2
= np.arange(0.01,80, 0.1)
x
= 48
df = chi2.pdf(x, df=df)
y
plt.plot(x,y)
'Chi-square distribution df=48')
plt.title( plt.show()
This is basically, the probability distribution of SSR values we would get after optimization. Generally, the SSR values should fall within 95% confidence interval of this distribution. Meaning, the value of SSR we get should be within values representing 2.5% to 97.5% of the above probability distribution.
Below is another way of representing this distribution.
Here, as we move along X-axis, the curve shows how much percent of the distribution is accounted form (cumulative distribution function). We can see that around x=75 the cumulative probability is very close to 1.
= np.arange(0.01,1, 0.01)
x
= 48
df = chi2.ppf(x, df=df)
y
plt.plot(y,x)20, 80)
plt.xlim( plt.show()
The acceptable range of SSR are the X-axis values that correspond to cumulative probability from 0.025 to 0.975 corresponding to 95% significance level.
The exact values can be obtianed by using the following command:
chi2.ppf(0.025, df=df)
chi2.ppf((1-0.025), df=df)
Below we will plot the same figure and print out the X-values corresponding to 0.025 and 0.975 cumulative probability.
= np.arange(0.01,1, 0.01)
x
= 40
df = chi2.ppf(x, df=df)
y
print(f'{chi2.ppf(0.025, df=df)=}')
print(f'{chi2.ppf((1-0.025), df=df)=}')
plt.plot(y,x)0.025, df=df), 0.025)
plt.scatter(chi2.ppf(1-0.025), df=df), (1-0.025))
plt.scatter(chi2.ppf(( plt.show()
chi2.ppf(0.025, df=df)=30.754505709372925 chi2.ppf((1-0.025), df=df)=69.02258578966607
We found that the SSR we get after optimization should be within 24.43 to 59.34 for the solution to be statistically acceptable.
In these kinds of problems, when the SSR is not within the acceptable range, two scenarios are occur: - SSR lower than the lowest acceptable value - SSR higher than the highest acceptable value
In first case when the SSR too low, the data points fit too well i.e. they overfit. In this case there might be a chance of some data manipulation.
In second case where the SSR is too high: - The function that generates the calculated data points must be improved - There is error in data measurements.
For fun, we will plot the χ2 probability distribution graph for our example scenario with the acceptable range.
= np.arange(0.01,80, 0.1)
x
df = 48
= chi2.pdf(x, df=df)
y
print(f'{chi2.ppf(0.025, df=df)=}')
print(f'{chi2.ppf((1-0.025), df=df)=}')
plt.plot(x,y)
= chi2.ppf(0.025, df=df)
lo_x = chi2.ppf((1-0.025), df=df)
hi_x
for p in np.arange(lo_x, hi_x+0.2, 0.2):
plt.bar(p,=chi2.pdf(p, df=df),
height=0.2,
width='grey',
color=0.5
alpha
) '95% CI acceptable range')
plt.title( plt.show()
chi2.ppf(0.025, df=df)=30.754505709372925 chi2.ppf((1-0.025), df=df)=69.02258578966607
Note that we have used: - SciPy for statistical analysis. - NumPy to generate arrays. - Matplotlib for drawing graphs.