Tag Archives: p-value

Random Permutation Tests

At the 2014 Strata + Hadoop conference John Rauser gave a great keynote title "Statistics Without the Agonizing Pain".  It is probably worth watching before reading the rest of this article, in it he introduces the concept of Random Permutation Tests.

"Classic" statistical tests usually make some sort of assumption about the distribution of the data e.g. normally distribution data . Are these assumptions always true? Probably not, but they are often approximately close enough to give you a useful result. By making these assumptions, these tests are called parametric.

Random Permutation Tests make no assumptions on the underlying distribution of the data. They are considered non-parametric tests. This can be extremely useful when:

  • Your data just doesn't seem to fit the distribution the classic statistical test assumes. For instance, perhaps it is bi-modal and the test assumes normality.
  • You have outliers e.g. users who spend significantly more than others.
  • You have a small sample size.

Random Permutation Tests can be used in almost any setting where you would compute a p-value. In this article I will focus on there use in experimental studies, you want to see if there is a difference between two treatment groups (A/B Tests, medical studies, etc.)


The essential idea behind random permutation tests is:

  1. Compute a test statistic between two (or more) groups. This could be the difference between two proportions, the difference between the means of the two groups etc.
  2. Now randomly shuffle the data assigned to each group.
  3. Measure the test statistic again on the shuffled data.
  4. Repeat 2 and 3 many times
  5. Look at where the test statistic from 1 falls in the distribution of test statistics from 2-4.

We have used steps 2-4 to empirically estimate the sampling distribution of the test statistic. From this distribution you can compute the p-value for your observed test statistic.


Let's imagine we want to add a new widget on our checkout page of our e-commerce site to upsell products to a user.

The question we want to answer is, does adding the widget increase our revenue?

We run an A/B test with:

  • Original checkout page
  • Checkout page with widget

We know how much each user spent and what variant they have been given.

Lets generate some example transaction data in R:

Figure 1. Hypothetical results of the A/B test

We have randomly sampled the data from a log-normal distribution with equal mean and variance. We set the seed to ensure the results are repeatable. So in this case, we are looking to find there is no significant difference between the two datasets.

The classic statistical approach here would be to use a t-test. Let's instead apply our random permutation test.

First, let's compute the difference between the means of the two groups:

This gives a difference of 193.47. How likely is this to have happened by chance?

What we want to do is randomly shuffle our data between the two groups. If we were to sample (without replacement) once and compute the difference using the randomly shuffled version of groups:

This gave me a difference of 45.69. Now we will repeat this many times:

If we plot the examples, along with where our observed difference falls:

Figure 2. The histogram produced by randomly re-shuffling the group labels. Black line shows the observed data

Straight away, it is fairly clear this observation could  just be due to random chance. It is not on the very extreme of the distribution. However let's compute a p-value

On the two-tailed test, we get a p-value of 0.102, so we would accept the null hypothesis of there being no difference. Notice the add one here on both the denominator and numerator. Essentially we are adding our original measured test statistic to the random permutations. This ensures we never get a zero probability.

The standard t-test would give a p-value of 0.0985, so roughly similar. However, what if we didn't care about the mean, but the median transaction value? Using random permutation tests, this is very simple to compute (a simple change to our code). Under classic statistical tests, we would have to go off and find the exact test we need to use under those conditions.

Speed Improvements

On simple speed improvement is to parallelise the loop to compute the re-samples:

Coin Package

As usual, R already has a package to help us do all of this. Using the same data as before we would run:

Generally you will find this is much faster for running large numbers of iterations.

One downside is you don't get the visualisation of how extreme the observed data is compared to the empirical sampled histogram (Figure 2). I find this graph extremely useful when explaining how extreme a result appears to be, extremely to a non-statistical audience


Random permutation tests are a nice alternative to classic hypothesis tests. In many cases they will give you almost exactly the same results. Being able to visualise the distribution (Figure 2) can be a massive assistance in explaining the p-value.

Overall the main advantages are:

  • Almost no assumptions on the underlying dataset being analysed
  • Can be used for any test statistic (either it is implemented in coin or can be programmed yourself).
  • Can be applied to all sorts of data types (numerical, ordinal, categorical) without having to remember the exact parametric test you should use.

Disadvantages can be:

  • Computing large number of re-samples is potentially slow. Although on modern computers this is less of a concern
  • Relies on the null hypothesis, that there is no association between the dataset and so the group labels are interchangeable under the null hypothesis.

Personally, I like to use both classic statistical tests and random permutation tests, even if all they do is validate one another.