The Pearson's Chi-Squared Test is a commonly used statistical test applied to categorical data. It has two major use-cases:
- A goodness of fit test measuring how well a observed data fits a theoretical distribution.
- A test of independence assessing how likely two categorical variables are independent of one another.
The procedure for performing a Chi-Squared Test is as follows:
- Compute the Chi-Squared statistic .
- Compute the degrees of freedom.
- Use the Chi-Squared distribution to compute the p-value.
Goodness of fit
Imagine we have been rolling a dice and want to test if it is fair. We collect a table of rolls:
Our test statistic is calculated as:
is the observed count
is the expected count under the theoretical distribution
is the number of categories
is total observations
In the dice case, as we have six sides. As our theoretical distribution is that the dice is fair, is simply . A more complicated case might be if the theortical distrbution was a normal distribution. Here we would need to compute the probability density over various intervals (e.g. and multiply by .
The degrees of freedom is where is the number of estimated parameters. For the dice example the degrees of freedom is 5, as there are no free parameters needed to be estimated. If the theoretical distribution was normal with unknown mean and variance, the degrees of freedom would be , is 2 as we would need to compute the sample mean and variance.
Test of Independence
The most common use-case of the Chi-Squared test is for testing if their is an association between two categorical variables. Some examples are:
- Is there a relationship between gender and the party you vote for?
- Is there a relationship between a landing page version and conversions (A/B testing)
One important assumption is that they must be independent groups e.g. you can only vote for one party, you are only shown one landing page.
To begin we form a contingency table, such as:
This example is a 2x2 table, but we can also have tables of 3x2, 3x3 etc.
If we have rows and columns the expected frequency is:
e.g. Row Total * Column Total / N.
The test statistic is then:
The number of degrees of freedom is .
Our null hypothesis is the variables (gender and party for instance) are independent. Our alternative hypothesis is that they have an association (but not the direction of the association). If the p-value is below the significance level, we reject the null hypothesis and the evidence suggests there is a relationship between the two variables.
Small Cell Counts
The assumption made is that test statistic is chi-squared distributed. The assumption is true in the limit.
The assumption can break down when a cell has a very small count. Often if the expected cell count is less than 5 (sometimes 10), the Yate's correction for continuity is applied. The effect of this is to make a more conservative estimate, but does also increasing the likelihood of a type II error.
If the cell counts are less than 10 and it is a 2x2 contingency table, an alternative is to apply Fisher's exact test. The test can be applied on general r x c contingency tables using monte-carlo simulation.
The chisq.test function in R implements the Chi-Squared test including the ability for use Yates' correction.
The fisher.test function implements Fisher's exact test.