Is there an intuition behind the formula of chi-square?

Question

Why in this formula $chi^2 = sum_i frac{(x_i - m_i)^2}{m_i}$, we divide by $m_i$ instead of $m_i^2$? Dividing by the squared value has a clear logic for me that we are comparing the difference against the mean, in terms of ratio (I am not saying that it is correct :), but dividing by the means themselves is not. Are we doing some kind of normalization here, like variance per unit of mean value?

BruceET · Answer

There are several possible explanations. Here is one of them. It should be viewed as partly intuitive rather than entirely rigorous.
Suppose you have $K$ categories and your null hypothesis is that
the number of occurrences of the $i$th category is $mathsf{Pois}(lambda_i).$
Then the count in the $i$th category is $X_i sim mathsf{Pois}(lambda_i).$
For sufficiently large counts, $X_i$ is nearly normal with $mu_i = E(X_i) = lambda_i$ and
$sigma_i^2 = Var(X_i) = lambda_i.$ Standardizing, you get that
$Z_i = frac{(X_i - lambda_i)}{sqrt{lambda_i}} stackrel{aprx}{sim} mathsf{Norm}(0,1).$ And then $Z_i^2 =frac{(X_i - lambda_i)^2}{lambda_i}stackrel{aprx}{sim}mathsf{Chisq}(1).$
Then you estimate the $lambda_i$ from data according to the null hypothesis.
If, for example, the null hypothesis is that all $K$ categories are equally likely, then we would use
$E_i = hatlambda_i = frac{sum X_i}{K} = frac{T}{K}.$
If the terms $C_i = frac{(X_i-E_i)^2}{E_i},$ were independent,
then the chi-squared statistic $Q = sum_i C_i$ would be approximately
distributed as $mathsf{Chisq}(K).$
However, the terms are not quite independent
because the $sum E_i = sum X_i = T.$ So it turns out that
$Q stackrel{aprx}{sim}mathsf{Chisq}(K-1)$ (arm-waving here) because of the one linear constraint on the $E_i$s.

Is there an intuition behind the formula of chi-square?

One Answer

Add your own answers!

Ask a Question