# Testing for Normality

The most widely used distribution in statistical analysis is the normal distribution.  Often when performing some sort of data analysis, we need to answer the question:

Does this data sample look normally distributed?

There are a variety of statistical tests that can help us answer this question, however first we should try to visualise the  data.

# Visualisation

Below is a simple piece of R code to create a histogram and normal Q-Q plot:

On top of the histogram we have have overlayed the normal probability density distribution using the sample mean and standard deviation.

In the Q-Q plot, we are plotting the quantiles of the sample data against the quantiles of a standard normal distribution. We are looking to see a roughly linear relationship between the sample and the theoretical quantiles.

In this example, we would probably conclude, based on the graphs, that the data is roughly normally distributed (which is true in this case as we randomly generated it from a normal distribution).

It is worth bearing in mind, many data analysis techniques assume normality (linear regression, PCA, etc.) but are still useful even with some deviations from normality.

# Statistical Tests for Normality

We can also use several tests for normality. The two most common are the Anderson-Darling test and the Shapiro-Wilk test.

The Null Hypothesis of both these tests is that the data is normally distributed.

## Anderson-Daring Test

The test is based on the distance between the empirical distribution function (EDF) and the cumulative distribution function (CDF) of the underlying distribution (e.g. Normal). The statistic is the weighted sum of the difference between the EDF and CDF, with more weight applied at the tails (making it better at detecting non-normality at the tails).

Below is the R code to perform this test:

As the p-value is above the significance level, we would accept the null hypothesis (normality).

## Shapiro-Wilk Test

The Shapiro-Wilk test is a regression-type test that uses the correlation of sample order statistics (the sample values arranged in ascending order) with those of a normal distribution. Below is the R code to perform the test:

Again, as the p-value is above the significance level, we would accept the null hypothesis (normality).

The Shapiro-Wilk test is slightly more powerful than the Anderson test, but is limited to 5000 samples.

# Test Caution

While these tests can be useful, they are not infallible. I would recommend looking at the histogram and Q-Q plots first, and use the tests as another check.

In particular small and large samples sizes can cause problems for the tests.

With a sample size of 10 or less, it is unlikely the test will detect non-normality (e.g. reject the null) even if the distribution is truly non-normal.

With a large sample size of 1000 samples or more, a small deviation from normality (some noise in the sample) may be concluded as significant and reject the null.

Overall your two best tools are the histogram and Q-Q plot. Potentially using the tests as additional indicators.