Tag Archives: goodness of fit

Testing for Normality

The most widely used distribution in statistical analysis is the normal distribution.  Often when performing some sort of data analysis, we need to answer the question:

Does this data sample look normally distributed?

There are a variety of statistical tests that can help us answer this question, however first we should try to visualise the  data.

Visualisation

Below is a simple piece of R code to create a histogram and normal Q-Q plot:

NormalityCheck

On top of the histogram we have have overlayed the normal probability density distribution using the sample mean and standard deviation.

In the Q-Q plot, we are plotting the quantiles of the sample data against the quantiles of a standard normal distribution. We are looking to see a roughly linear relationship between the sample and the theoretical quantiles.

In this example, we would probably conclude, based on the graphs, that the data is roughly normally distributed (which is true in this case as we randomly generated it from a normal distribution).

It is worth bearing in mind, many data analysis techniques assume normality (linear regression, PCA, etc.) but are still useful even with some deviations from normality.

Statistical Tests for Normality

We can also use several tests for normality. The two most common are the Anderson-Darling test and the Shapiro-Wilk test.

The Null Hypothesis of both these tests is that the data is normally distributed.

Anderson-Daring Test

The test is based on the distance between the empirical distribution function (EDF) and the cumulative distribution function (CDF) of the underlying distribution (e.g. Normal). The statistic is the weighted sum of the difference between the EDF and CDF, with more weight applied at the tails (making it better at detecting non-normality at the tails).

Below is the R code to perform this test:

As the p-value is above the significance level, we would accept the null hypothesis (normality).

Shapiro-Wilk Test

The Shapiro-Wilk test is a regression-type test that uses the correlation of sample order statistics (the sample values arranged in ascending order) with those of a normal distribution. Below is the R code to perform the test:

Again, as the p-value is above the significance level, we would accept the null hypothesis (normality).

The Shapiro-Wilk test is slightly more powerful than the Anderson test, but is limited to 5000 samples.

Test Caution

While these tests can be useful, they are not infallible. I would recommend looking at the histogram and Q-Q plots first, and use the tests as another check.

In particular small and large samples sizes can cause problems for the tests.

With a sample size of 10 or less, it is unlikely the test will detect non-normality (e.g. reject the null) even if the distribution is truly non-normal.

With a large sample size of 1000 samples or more, a small deviation from normality (some noise in the sample) may be concluded as significant and reject the null.

Overall your two best tools are the histogram and Q-Q plot. Potentially using the tests as additional indicators.

Pearson's Chi-Squared Test

The Pearson's Chi-Squared Test is a commonly used statistical test applied to categorical data. It has two major use-cases:

  • A goodness of fit test measuring how well a observed data fits a theoretical distribution.
  • A test of independence assessing how likely two categorical variables are independent of one another. 

Procedure

The procedure for performing a Chi-Squared Test is as follows:

  1. Compute the Chi-Squared statistic \chi^2.
  2. Compute the degrees of freedom.
  3. Use the Chi-Squared distribution to compute the p-value.

Goodness of fit

Imagine we have been rolling a dice and want to test if it is fair. We collect a table of rolls:

SideCount
110
212
314
47
520
65

Our test statistic is calculated as:

\chi^2 = \sum^n_{i=1}\frac{(O_i - E_i)^2}{E_i}

where

O_i is the observed count

E_i is the expected count under the theoretical distribution

n is the number of categories

N is total observations

In the dice case, n=6 as we have six sides. As our theoretical distribution is that the dice is fair,  E_i is simply \frac{N}{6}. A more complicated case might be if the theortical distrbution was a normal distribution. Here we would need to compute the probability density over various intervals (e.g. P(a \lt x \leq b) and multiply by N.

The degrees of freedom is n - (s + 1) where s is the number of estimated parameters. For the dice example the degrees of freedom is 5, as there are no free parameters needed to be estimated. If the theoretical distribution was normal with unknown mean and variance, the degrees of freedom would be n-(2 + 1), s is 2 as we would need to compute the sample mean and variance.

Test of Independence

The most common use-case of the Chi-Squared test is for testing if their is an association between two categorical variables. Some examples are:

  • Is there a relationship between gender and the party you vote for?
  • Is there a relationship between a landing page version and conversions (A/B testing)

One important assumption is that they must be independent groups e.g. you can only vote for one party, you are only shown one landing page.

To begin we form a contingency table, such as:

 DemocratRepublican
Male2030
Female3020

This example is a 2x2 table, but we can also have tables of 3x2, 3x3 etc.

If we have r rows and c columns the expected frequency is:

E_{i,j}= \frac{(\sum_{k=1}^cO_{i,k})(\sum_{k=1}^rO_{k,j})}{N}

e.g. Row Total * Column Total / N.

The test statistic is then:

\chi^2=\sum^r_{i=1}\sum^c_{j=1}\frac{(O_{i,j} - E_{i,j})^2}{E_{i,j}}

The number of degrees of freedom is (r - 1)(c - 1).

Our null hypothesis is the variables (gender and party for instance) are independent. Our alternative hypothesis is that they have an association (but not the direction of the association). If the p-value is below the significance level, we reject the null hypothesis and the evidence suggests there is a relationship between the two variables.

Small Cell Counts

The assumption made is that test statistic is chi-squared distributed. The assumption is true in the limit.

The assumption can break down when a cell has a very small count. Often if the expected cell count is less than 5 (sometimes 10), the Yate's correction for continuity is applied. The effect of this is to make a more conservative estimate, but does also increasing the likelihood of a type II error.

If the cell counts are less than 10 and it is a 2x2 contingency table, an alternative is to apply Fisher's exact test. The test can be applied on general r x c contingency tables using monte-carlo simulation.

Implementation

The chisq.test function in R implements the Chi-Squared test including the ability for use Yates' correction.

The fisher.test function implements Fisher's exact test.