Tag Archives: ab testing

Random Permutation Tests

At the 2014 Strata + Hadoop conference John Rauser gave a great keynote title "Statistics Without the Agonizing Pain".  It is probably worth watching before reading the rest of this article, in it he introduces the concept of Random Permutation Tests.

"Classic" statistical tests usually make some sort of assumption about the distribution of the data e.g. normally distribution data . Are these assumptions always true? Probably not, but they are often approximately close enough to give you a useful result. By making these assumptions, these tests are called parametric.

Random Permutation Tests make no assumptions on the underlying distribution of the data. They are considered non-parametric tests. This can be extremely useful when:

  • Your data just doesn't seem to fit the distribution the classic statistical test assumes. For instance, perhaps it is bi-modal and the test assumes normality.
  • You have outliers e.g. users who spend significantly more than others.
  • You have a small sample size.

Random Permutation Tests can be used in almost any setting where you would compute a p-value. In this article I will focus on there use in experimental studies, you want to see if there is a difference between two treatment groups (A/B Tests, medical studies, etc.)

Overview

The essential idea behind random permutation tests is:

  1. Compute a test statistic between two (or more) groups. This could be the difference between two proportions, the difference between the means of the two groups etc.
  2. Now randomly shuffle the data assigned to each group.
  3. Measure the test statistic again on the shuffled data.
  4. Repeat 2 and 3 many times
  5. Look at where the test statistic from 1 falls in the distribution of test statistics from 2-4.

We have used steps 2-4 to empirically estimate the sampling distribution of the test statistic. From this distribution you can compute the p-value for your observed test statistic.

Example

Let's imagine we want to add a new widget on our checkout page of our e-commerce site to upsell products to a user.

The question we want to answer is, does adding the widget increase our revenue?

We run an A/B test with:

  • Original checkout page
  • Checkout page with widget

We know how much each user spent and what variant they have been given.

Lets generate some example transaction data in R:

ABDistribution
Figure 1. Hypothetical results of the A/B test

We have randomly sampled the data from a log-normal distribution with equal mean and variance. We set the seed to ensure the results are repeatable. So in this case, we are looking to find there is no significant difference between the two datasets.

The classic statistical approach here would be to use a t-test. Let's instead apply our random permutation test.

First, let's compute the difference between the means of the two groups:

This gives a difference of 193.47. How likely is this to have happened by chance?

What we want to do is randomly shuffle our data between the two groups. If we were to sample (without replacement) once and compute the difference using the randomly shuffled version of groups:

This gave me a difference of 45.69. Now we will repeat this many times:

If we plot the examples, along with where our observed difference falls:

HistogramPermutations
Figure 2. The histogram produced by randomly re-shuffling the group labels. Black line shows the observed data

Straight away, it is fairly clear this observation could  just be due to random chance. It is not on the very extreme of the distribution. However let's compute a p-value

On the two-tailed test, we get a p-value of 0.102, so we would accept the null hypothesis of there being no difference. Notice the add one here on both the denominator and numerator. Essentially we are adding our original measured test statistic to the random permutations. This ensures we never get a zero probability.

The standard t-test would give a p-value of 0.0985, so roughly similar. However, what if we didn't care about the mean, but the median transaction value? Using random permutation tests, this is very simple to compute (a simple change to our code). Under classic statistical tests, we would have to go off and find the exact test we need to use under those conditions.

Speed Improvements

On simple speed improvement is to parallelise the loop to compute the re-samples:

Coin Package

As usual, R already has a package to help us do all of this. Using the same data as before we would run:

Generally you will find this is much faster for running large numbers of iterations.

One downside is you don't get the visualisation of how extreme the observed data is compared to the empirical sampled histogram (Figure 2). I find this graph extremely useful when explaining how extreme a result appears to be, extremely to a non-statistical audience

Summary

Random permutation tests are a nice alternative to classic hypothesis tests. In many cases they will give you almost exactly the same results. Being able to visualise the distribution (Figure 2) can be a massive assistance in explaining the p-value.

Overall the main advantages are:

  • Almost no assumptions on the underlying dataset being analysed
  • Can be used for any test statistic (either it is implemented in coin or can be programmed yourself).
  • Can be applied to all sorts of data types (numerical, ordinal, categorical) without having to remember the exact parametric test you should use.

Disadvantages can be:

  • Computing large number of re-samples is potentially slow. Although on modern computers this is less of a concern
  • Relies on the null hypothesis, that there is no association between the dataset and so the group labels are interchangeable under the null hypothesis.

Personally, I like to use both classic statistical tests and random permutation tests, even if all they do is validate one another.

 

Significance Testing and Sample Size

A common question when analysing any sort of online data:

Is this result statistically significant?

Examples of where I have had to answer  this question:

  • A/B Testing
  • Survey Analysis

In this post, we will explore ways to answer this question and determine the sample sizes we need.

Note: This post was heavily influenced by Marketing Distillery's excellent  A/B Tests in Marketing.

Terminology

First let us define some standard statistical terminology:

  • We are deciding between two hypotheses - the null hypothesis and the alternative hypothesis.
  • The null hypothesis is the default assumption that there is no relationship between two values.  For instance, if comparing conversion rates of two landing pages, the null hypothesis is that the conversion rates are the same.
  • The alternative hypothesis is what we assume if the null hypothesis is rejected. It can be two-tailed, the conversion rates are not the same, or one-tailed, where we state the direction of the inequality.
  • A result is determined to be statistically significant if the observed effect was unlikely to have been seen by random chance.
  • How unlikely is determined by the significance level. Commonly it is set to 5%. It is the probability of rejecting the null hypothesis when it is true. This probability is denoted by \alpha

\alpha = P(rejecting\ null\ hypothesis\ |\ null\ hypothesis\ is\ true)

  • Rejecting the null hypothesis when the null hypothesis is true, is called a type I error.
  • A type II error occurs when the null hypothesis is false,  but is erroneously failed to be rejected. This probability of this is denoted by \beta

\beta = P(fail\ to\ reject\ null\ hypothesis\ |\ null\ hypothesis\ is\ false)

 Null hypothesis (H0) is trueNull hypothesis (H0) is false
Reject null hypothesisType I error
False positive
Correct outcome
True positive
Fail to reject null hypothesisCorrect outcome
True negative
Type II error
False negative
  • The power of a test is 1-\beta. Commonly this is set to 90%.
  • The p-value is calculated from the test statistic. It is the probability that the observed result would have been seen by chance assuming the null hypothesis is true.
  • To reject the null hypothesis we need a p-value that is lower than the selected significance level.

Types of Data

Usually we have two types of data we want to perform significance tests with:

  1. Proportions e.g. Conversion Rates, Percentages
  2. Real valued numbers e.g. Average Order Values, Average Time on Page

In this post, we will look at both.

Tests for Proportions

A common scenario is we have run an A/B test of two landing pages and we wish to test if the conversion rate is significantly different between the two.

An important assumption here is that the two groups are mutually exclusive e.g. you can only have been shown one of the landing pages.

Null hypothesis is that the proportions are equal in both groups.

The test can be performed in R using:

Under the hood this is performing a Pearson's Chi-Squared Test.

Regarding the parameters:

x - Is usually a 2x2 matrix giving the counts of successes and failures in each group (conversions and non-conversions for instance).

n - Number of trials performed in each group. Can leave null if you provide x as a matrix.

p - Only if you want to test for equality against a specific proportion (e.g. 0.5).

alternative - Generally only used when testing against a specific p. Changes the underlying test to use a z-test. See Two-Tailed Test of Population Proportion for more details. The z-test may not be a good assumption with small sample sizes.

conf.level - Used only in the calculation of a confidence interval. Not used as part of the actual test.

correct - Usually safe to apply continuity correction. See my previous post for more details on the correction.

The important part of the output of prop.test is the p-value. If this is less than your desired significance level (say 0.05) you reject the null hypothesis.

In this example, the p-value is not less than our desired significance level (0.05) so we cannot reject the null hypothesis.

Before running your test, you should fix your sample size in each group in advance. The R library pwr has various functions for helping you do this. The function pwr.2p.test can be used for this:

Any one of the parameters can be left blank and the function will estimate its value. For instance, leaving n, the sample size, blank will mean the function will compute the desired sample size.

The only new parameter here is h. This is the minimum effect size you wish to be able to detect. h is calculated as the difference of the arcsine transformation of two proportions:

h = 2 \arcsin(\sqrt{p_1}) - 2\arcsin(\sqrt{p_2})

Assuming you have an idea of the base rate proportion (e.g. your current conversion rate) and the minimum change you want to detect, you can use the follow R code to calculate h:

 Tests for real-values

Now imagine we want to test if a change to a eCommerce web-page increases the average order value.

The major assumption we are going to make is that the data we are analysing is normally distributed. See my previous post on how to check if the data is normally distributed.

It may be possible to transform the data to a normal distribution, for instance if the data is log-normal. Time on page often looks to fit a log-normal distribution, in this case you can just take the log of the times on page.

The test we will run is the two-sample t-test. We are testing if the means of the two groups are significantly different.

The parameters:

x - The samples in one group

y - The samples in the other group. Leave NULL for a single sample test

alternative - Perform two-tailed or one-tailed test?

mu - Use this if you know the true mean

paired - TRUE for the paired t-test, used for data where the two samples have a natural partner in each set, for instance comparing the weights of people before and after a diet.

var.equal - Assume equal variance in the two groups or not

conf.level - Used in calculating the confidence interval around the means. Not part of the actual test.

The test can be run using:

In this example, as our p-value is greater than 0.05 significance level, so we cannot reject the null hypothesis.

As before, before running the experiment we should set the sample size required. Using the pwr library we can use:

d is the effect size. For t-tests the affect size is assessed as:

d = \frac{|\mu_1 - \mu_2|}{\sigma}

where \mu_1 is the mean of group 1, \mu_2 is the mean of group 2 and \sigma is the pooled standard deviation.

Ideally you should set d based on your problem domain, quantifying the effect size you expect to see. If this is not possible, Cohen's book on power analysis "Statistical Power Analysis for the Behavioral Sciences", suggests setting d to be 0.2, 0.5 and 0.8 for small, medium and large effect sizes respectively.

Pearson's Chi-Squared Test

The Pearson's Chi-Squared Test is a commonly used statistical test applied to categorical data. It has two major use-cases:

  • A goodness of fit test measuring how well a observed data fits a theoretical distribution.
  • A test of independence assessing how likely two categorical variables are independent of one another. 

Procedure

The procedure for performing a Chi-Squared Test is as follows:

  1. Compute the Chi-Squared statistic \chi^2.
  2. Compute the degrees of freedom.
  3. Use the Chi-Squared distribution to compute the p-value.

Goodness of fit

Imagine we have been rolling a dice and want to test if it is fair. We collect a table of rolls:

SideCount
110
212
314
47
520
65

Our test statistic is calculated as:

\chi^2 = \sum^n_{i=1}\frac{(O_i - E_i)^2}{E_i}

where

O_i is the observed count

E_i is the expected count under the theoretical distribution

n is the number of categories

N is total observations

In the dice case, n=6 as we have six sides. As our theoretical distribution is that the dice is fair,  E_i is simply \frac{N}{6}. A more complicated case might be if the theortical distrbution was a normal distribution. Here we would need to compute the probability density over various intervals (e.g. P(a \lt x \leq b) and multiply by N.

The degrees of freedom is n - (s + 1) where s is the number of estimated parameters. For the dice example the degrees of freedom is 5, as there are no free parameters needed to be estimated. If the theoretical distribution was normal with unknown mean and variance, the degrees of freedom would be n-(2 + 1), s is 2 as we would need to compute the sample mean and variance.

Test of Independence

The most common use-case of the Chi-Squared test is for testing if their is an association between two categorical variables. Some examples are:

  • Is there a relationship between gender and the party you vote for?
  • Is there a relationship between a landing page version and conversions (A/B testing)

One important assumption is that they must be independent groups e.g. you can only vote for one party, you are only shown one landing page.

To begin we form a contingency table, such as:

 DemocratRepublican
Male2030
Female3020

This example is a 2x2 table, but we can also have tables of 3x2, 3x3 etc.

If we have r rows and c columns the expected frequency is:

E_{i,j}= \frac{(\sum_{k=1}^cO_{i,k})(\sum_{k=1}^rO_{k,j})}{N}

e.g. Row Total * Column Total / N.

The test statistic is then:

\chi^2=\sum^r_{i=1}\sum^c_{j=1}\frac{(O_{i,j} - E_{i,j})^2}{E_{i,j}}

The number of degrees of freedom is (r - 1)(c - 1).

Our null hypothesis is the variables (gender and party for instance) are independent. Our alternative hypothesis is that they have an association (but not the direction of the association). If the p-value is below the significance level, we reject the null hypothesis and the evidence suggests there is a relationship between the two variables.

Small Cell Counts

The assumption made is that test statistic is chi-squared distributed. The assumption is true in the limit.

The assumption can break down when a cell has a very small count. Often if the expected cell count is less than 5 (sometimes 10), the Yate's correction for continuity is applied. The effect of this is to make a more conservative estimate, but does also increasing the likelihood of a type II error.

If the cell counts are less than 10 and it is a 2x2 contingency table, an alternative is to apply Fisher's exact test. The test can be applied on general r x c contingency tables using monte-carlo simulation.

Implementation

The chisq.test function in R implements the Chi-Squared test including the ability for use Yates' correction.

The fisher.test function implements Fisher's exact test.