Tag Archives: chi-squared

Pearson's Chi-Squared Test

The Pearson's Chi-Squared Test is a commonly used statistical test applied to categorical data. It has two major use-cases:

  • A goodness of fit test measuring how well a observed data fits a theoretical distribution.
  • A test of independence assessing how likely two categorical variables are independent of one another. 

Procedure

The procedure for performing a Chi-Squared Test is as follows:

  1. Compute the Chi-Squared statistic \chi^2.
  2. Compute the degrees of freedom.
  3. Use the Chi-Squared distribution to compute the p-value.

Goodness of fit

Imagine we have been rolling a dice and want to test if it is fair. We collect a table of rolls:

SideCount
110
212
314
47
520
65

Our test statistic is calculated as:

\chi^2 = \sum^n_{i=1}\frac{(O_i - E_i)^2}{E_i}

where

O_i is the observed count

E_i is the expected count under the theoretical distribution

n is the number of categories

N is total observations

In the dice case, n=6 as we have six sides. As our theoretical distribution is that the dice is fair,  E_i is simply \frac{N}{6}. A more complicated case might be if the theortical distrbution was a normal distribution. Here we would need to compute the probability density over various intervals (e.g. P(a \lt x \leq b) and multiply by N.

The degrees of freedom is n - (s + 1) where s is the number of estimated parameters. For the dice example the degrees of freedom is 5, as there are no free parameters needed to be estimated. If the theoretical distribution was normal with unknown mean and variance, the degrees of freedom would be n-(2 + 1), s is 2 as we would need to compute the sample mean and variance.

Test of Independence

The most common use-case of the Chi-Squared test is for testing if their is an association between two categorical variables. Some examples are:

  • Is there a relationship between gender and the party you vote for?
  • Is there a relationship between a landing page version and conversions (A/B testing)

One important assumption is that they must be independent groups e.g. you can only vote for one party, you are only shown one landing page.

To begin we form a contingency table, such as:

 DemocratRepublican
Male2030
Female3020

This example is a 2x2 table, but we can also have tables of 3x2, 3x3 etc.

If we have r rows and c columns the expected frequency is:

E_{i,j}= \frac{(\sum_{k=1}^cO_{i,k})(\sum_{k=1}^rO_{k,j})}{N}

e.g. Row Total * Column Total / N.

The test statistic is then:

\chi^2=\sum^r_{i=1}\sum^c_{j=1}\frac{(O_{i,j} - E_{i,j})^2}{E_{i,j}}

The number of degrees of freedom is (r - 1)(c - 1).

Our null hypothesis is the variables (gender and party for instance) are independent. Our alternative hypothesis is that they have an association (but not the direction of the association). If the p-value is below the significance level, we reject the null hypothesis and the evidence suggests there is a relationship between the two variables.

Small Cell Counts

The assumption made is that test statistic is chi-squared distributed. The assumption is true in the limit.

The assumption can break down when a cell has a very small count. Often if the expected cell count is less than 5 (sometimes 10), the Yate's correction for continuity is applied. The effect of this is to make a more conservative estimate, but does also increasing the likelihood of a type II error.

If the cell counts are less than 10 and it is a 2x2 contingency table, an alternative is to apply Fisher's exact test. The test can be applied on general r x c contingency tables using monte-carlo simulation.

Implementation

The chisq.test function in R implements the Chi-Squared test including the ability for use Yates' correction.

The fisher.test function implements Fisher's exact test.