Pearson's Chi-Squared Test

The Pearson's Chi-Squared Test is a commonly used statistical test applied to categorical data. It has two major use-cases:

  • A goodness of fit test measuring how well a observed data fits a theoretical distribution.
  • A test of independence assessing how likely two categorical variables are independent of one another. 

Procedure

The procedure for performing a Chi-Squared Test is as follows:

  1. Compute the Chi-Squared statistic \chi^2.
  2. Compute the degrees of freedom.
  3. Use the Chi-Squared distribution to compute the p-value.

Goodness of fit

Imagine we have been rolling a dice and want to test if it is fair. We collect a table of rolls:

SideCount
110
212
314
47
520
65

Our test statistic is calculated as:

\chi^2 = \sum^n_{i=1}\frac{(O_i - E_i)^2}{E_i}

where

O_i is the observed count

E_i is the expected count under the theoretical distribution

n is the number of categories

N is total observations

In the dice case, n=6 as we have six sides. As our theoretical distribution is that the dice is fair,  E_i is simply \frac{N}{6}. A more complicated case might be if the theortical distrbution was a normal distribution. Here we would need to compute the probability density over various intervals (e.g. P(a \lt x \leq b) and multiply by N.

The degrees of freedom is n - (s + 1) where s is the number of estimated parameters. For the dice example the degrees of freedom is 5, as there are no free parameters needed to be estimated. If the theoretical distribution was normal with unknown mean and variance, the degrees of freedom would be n-(2 + 1), s is 2 as we would need to compute the sample mean and variance.

Test of Independence

The most common use-case of the Chi-Squared test is for testing if their is an association between two categorical variables. Some examples are:

  • Is there a relationship between gender and the party you vote for?
  • Is there a relationship between a landing page version and conversions (A/B testing)

One important assumption is that they must be independent groups e.g. you can only vote for one party, you are only shown one landing page.

To begin we form a contingency table, such as:

 DemocratRepublican
Male2030
Female3020

This example is a 2x2 table, but we can also have tables of 3x2, 3x3 etc.

If we have r rows and c columns the expected frequency is:

E_{i,j}= \frac{(\sum_{k=1}^cO_{i,k})(\sum_{k=1}^rO_{k,j})}{N}

e.g. Row Total * Column Total / N.

The test statistic is then:

\chi^2=\sum^r_{i=1}\sum^c_{j=1}\frac{(O_{i,j} - E_{i,j})^2}{E_{i,j}}

The number of degrees of freedom is (r - 1)(c - 1).

Our null hypothesis is the variables (gender and party for instance) are independent. Our alternative hypothesis is that they have an association (but not the direction of the association). If the p-value is below the significance level, we reject the null hypothesis and the evidence suggests there is a relationship between the two variables.

Small Cell Counts

The assumption made is that test statistic is chi-squared distributed. The assumption is true in the limit.

The assumption can break down when a cell has a very small count. Often if the expected cell count is less than 5 (sometimes 10), the Yate's correction for continuity is applied. The effect of this is to make a more conservative estimate, but does also increasing the likelihood of a type II error.

If the cell counts are less than 10 and it is a 2x2 contingency table, an alternative is to apply Fisher's exact test. The test can be applied on general r x c contingency tables using monte-carlo simulation.

Implementation

The chisq.test function in R implements the Chi-Squared test including the ability for use Yates' correction.

The fisher.test function implements Fisher's exact test.

Area Under the receiver operator Curve (AUC)

AUC is a common way of evaluating binary classification problems. Two main reason are:

  • It's insensitive to unbalanced datasets e.g. fraud detection, conversion prediction.
  • It does not require the predictions to be thresholded (e.g. assign to one class or the other). Operates directly on classification scores.

An AUC of 0.5 is random performance, while an AUC of 1 is perfect classification. Personally I transform this to GINI = 2AUC - 1.  The main reason being, that intuitively for me 0 being random and 1 being perfect makes more sense.

Understanding AUC

First let us look at a confusion matrix:

 Positive LabelNegative Label
Predicted PositiveTrue Positive (TP)False Positive (FP)
Type I Error
Predictive NegativeFalse Negative (FN)
Type II Error
True Negative (TN)
True Positive Rate = TPR = TP / (TP + FN)False Positive Rate = FPR = FP / (FP + TN)

Imagine we have a set of scores from a classification model (this could be probabilities, etc.) along with their true labels.  The numbers in the confusion matrix assume we have pick a threshold t, where >= t we assign the positive label and < t we assign the negative label.

If we have M unique scores from the classification model, we could have M +1 possible choices of t. If we plot the True Positive Rate and False Positive Rate over all the ranges of t, we get our Receiver Operating Characteristic (ROC) Curve.

ROC
Figure 1: ROC Curve

Figure 1 shows an example ROC. The blue line is our performance, the grey dotted line is an example of what we would see from random classification.

The Area Under the receiver operator Curve (AUC) for this example would be the area under the blue line. The area under the grey line would be 0.5, hence why an AUC of 0.5 is random performance and AUC of 1 is perfect classification.

Score Distributions

"Receiver Operating Characteristic" by kakau. Original uploader was Kku at de.wikipedia - Transferred from de.wikipedia; transferred to Commons by User:Jutta234 using CommonsHelper.(original text: Selbstgepinselt mit PowerPoint). Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Receiver_Operating_Characteristic.png#mediaviewer/File:Receiver_Operating_Characteristic.png
Figure 2: Moving the classification threshold

In the top-left of figure 2 we have a hypothetical score distribution of the output of a classifier. The two "clumps" of scores are the negative (blue) and positive (red) examples. You can imagine as we move the threshold line from left to right, we will change the values of the TP and FP and therefore the TPR and FPR.

How much the score distributions of the negatives and positive overlap will determine our AUC.  In figure 3 we show some different score distributions, the red distribution is the positive class, the blue the negative.

ScoreDistribution
Figure 3: Examples of different score distributions

In the top example, wherever we set the threshold we get a perfect TPR. In this case we would get an AUC of 1.

In the middle example, wherever we set the threshold, our TPR will be the same as our FPR. Hence we will see an AUC of 0.5

In the bottom example, we have some overlap of the score distributions, so our AUC will be somewhere between 0.5 and 1.

The shape of the score distributions and amount of overlap will affect the shape of the ROC Curve and therefore the AUC we end with.

Implementation

The R pROC library has functions for computing ROC, AUC and plotting.