Multiple Statistical Tests

A/B testing is common place in marketing and website optimisation. But why just stop at running two variations? Why not three or eight? I have certainly seen A/B tests suddenly grow to A/B/C/D... and so on. In statical literature, this is called multiple testing.

The short answer is there is nothing stopping you from doing this, however you do need to adjust your techniques for interpreting the test results or risk interpreting noise results as signal.

In fact, many A/B testing tools now offer you "multivariate testing" (Optimizely, maxymiser, etc.), which is just a fancy term for running factorial experiments and you will be running multiple tests at the same time.

• The common pitfalls when running multiple statistical tests
• Techniques to adjust for multiple comparisons
• How this affects sample size when planning experiments

The dangers of multiple comparisons

One of my favourite examples of the danger of multiple comparisons is a study by Benett et al. 1

The very brief summary of the paper is:

• A very much dead Atlantic Salmon was placed in an fMRI machine.
• The fMRI measures the blood oxygen levels in units called voxels (think image pixels). Each voxel is measuring the change in blood oxygen levels. Typically fMRI machines will measure 130,000 voxels.
• The (dead) salmon was then shown a series of emotionally charged images and the difference between blood oxygen levels was measured.
• Funnily enough, with that many comparisons, you do find a statistically significant result. Can we conclude the dead's salmon brain was responding to the images? Well no, obviously, it's dead. If you do enough comparisons without any correction, you will end up with something significant.

Let's see this in action. We will randomly sample from a binomial distribution, with p=0.5, 100 times. Each time we will test if the observed proportion is significantly different from 0.5:

Run it a few times, most times you should see one one of the p-values come out as below the alpha value. For example:

We need to control for the fact we are doing a large number of tests. Next we will look at ways to deal with these issues.

Techniques

Family Wise Error Rate

The Family Wise Error Rate (FWER) is the probability of making at least one type I error. If V is the number of type I errors:

$FWER=P(V \gt0)$

We will looking at techniques that control the FWER to ensure $FWER \le \alpha$. We will only look at methods that control this in the strong sense, that it is valid for all configurations of true and false hypotheses.

Bonferroni

The Bonferroni correction is the simplest method for controlling the FWER.  Simply you now reject the null hypothesis if:

$p_i \le \frac{\alpha}{m}$

where $p_i$ is the p-value of the i-th test and $m$ is the number of tests.

This is a very conservative correction and if you have large number of tests, you need very small p-values to find anything significant.

To adjust p-values based on Bonferroni correction:

$\hat{p}_i =m p_i$

To show this in R, let's go back to our previous example:

We should now, hopefully, see no significant results.

The p.adjust method is what we use to obtain the adjusted p-values. It is also possible to use pairwise.prop.test to do this all in one go, but I personally prefer to keep them separate (e.g. compute p-values, then adjust them).

Holm

In Holm's method you order the p-values in increasing order. It is called a step down procedure. You compute a critical value for each p-value:

$a_i = \frac{\alpha}{m - i - 1}$

Starting with the smallest p-value, keeping rejecting hypotheses until the first where $p_i > \alpha_i$, at which point accept $H_i$ and all remaining hypothesises.

$\hat{p}_i= \left\{\begin{array}{l l}mp_i & \quad \text{if i=1}\\ \max(\hat{p}_{i-1},(m -i + 1)p_i) & \quad \text{if i=2,\dots,m}\end{array} \right.$

In R we can simply use:

Holm's method is equally as powerful in terms of assumptions as Bonferroni, and it less conservative. Therefore it is always suggested to use Holm's method over Bonferroni.

Hochberg

Holm's method was known as a step down procedure, as you start with the smallest p-value and work upwards.

Hochbeg's method is a step up method. You start with the largest p-value , until you find the first $p_i \le \alpha_i$. It uses the same critical values as Holm's method.

The adjusted p-values are computed using:

$\hat{p}_i= \left\{\begin{array}{l l}p_m & \quad \text{if i=m}\\ \min(\hat{p}_{i+1},(m -i + 1)p_i) & \quad \text{if i=m-1,\dots,1}\end{array} \right.$

In R we can use:

An important consideration Hochberg's method is that each p-value must be independent or positively dependent.  If you are running A/B tests and ensured users have only ever seen one variant, then this assumption is valid.

Hochberg's method is considered more powerful than Holm's as it will reject more hypotheses.

Hommel

Hommel's procedure is more powerful than Hochberg's but slightly harder to understand. It has exactly the assumptions at Hochberg's (independent or positively dependent p-values) and being more powerful it should be preferred over Hochberg's.

Hommel's procure rejects all p-values that are $\le \frac{\alpha}{j}$

The value for $j$ is found by:

$j = \max_{i=1,\dots,m}\{ p_{m-i+k}\gt\frac{k\alpha}{i}\textrm{for }k=1,\dots,i\}$

Think of this as us looking at all the sets of hypotheses. We are trying to find the largest $j$ where the condition $p_{n-i+k}\gt\frac{k\alpha}{i}$ is true for all the values of $k$.

Let's look at this in practice 2. Suppose we have 3 hypotheses with p-values $p_1=0.024$, $p_2=0.030$ and $p_3=0.073$. To find $j$ we compute:

For $i=1$, $p_3 = 0.073 > \alpha = 0.05$

For $i=2$, $p_3 = 0.073 > \alpha = 0.05, p_2 = 0.030 >\frac{ \alpha}{2}=0.025$

For $i=3$, $p_3 = 0.073 > \alpha = 0.05, p_2 = 0.030 <\frac{ 2\alpha}{3}=0.033,$

$p_1=0.024 > \frac{\alpha}{3} = 0.0167$

We find two sets of hypotheses where the statement is true for all $k$. The values of $i$ for these two sets were $\{1,2\}$, so $j=\max\{1,2\}=2$. We use $j$ and reject any p-values less than $\frac{\alpha}{2}$. Which in this case is $p_1=0.024$.

In the following R code, I have provided code for calculating $j$ and the adjusted p-values. The adjusted p-value calculation followed the algorithm in the Appendix of Wright's paper:

As with the other methods, R provides a pre-built function for calculating the adjusted p-values:

False Discovery Rate

Up till now we have focused on methods to control the Family Wise Error Rate (FWER). Let V be the number of false positives in hypotheses we reject and P be the number of hypotheses rejected. In FWER methods we are ensuring:

$P(V \ge 0) \le\alpha$

We have applied various method (Holm, Hochberg, etc.) in order to ensure this. If we want to control False Discovery Rate (FDR), we will ensure:

$FDR =\mathbb{E}(\frac{V}{P}) \le \alpha$

That is the expected false positive rate will be below $\alpha$. Many FWER methods can be seen as too conservative and often suffer from low power. FDR methods offer a way to increase power but maintain a principled bound on the error.

The FDR methods were developed on the observation

4 false discoveries out of 10 rejected null hypotheses

is much more serious than

20 false discoveries out of 100 rejected null hypotheses

That is when we are running a large number of tests, we are possibly willing to accept a percentage of false discoveries if we still find something interesting.

Christopher Genovese has an extremely good tutorial on FDR and worth looking over for much more detail.

R provides two methods for calculating adjusted p-values (sometimes called q-values in the FDR context) based on controlling the FDR:

BH is the original Benjamini-Hochberg procedure where the concept of FDR was introduced. It has the same assumptions as the Hochberg procedure (e.g. independent or positively dependent p-values). BY is Benjamini–Yekutieli procedure, it has no assumptions on the dependency of the p-values (but as such it is more conservative than BH).

FDR methods have become popular in genomic studies, neuroscience, biochemistry, oncology and plant sciences.  Particularly where there is a large number (thousands often) of tests to be performed, but you don't want to miss interesting interactions.

Choosing a method

We have covered a variety of techniques to handle multiple tests, but what one should you use? This is my very crude suggestions for choosing which method to use:

1. I have less than or equal to 50 tests and each test is independent or positively dependent - Use Hommel.
2. I have less than or equal to 50 tests but I do not know if the tests are independent or positively dependent - Use Holm.
3. I have more than 50 tests and each test is independent or positively dependent - Use FDR Benjamini-Hochberg.
4. I have more than 50 tests but I do not know if the tests are independent or positively dependent - Use FDR Benjamini–Yekutieli.

The 50 tests is an arbitrary choice but in general if you are using a lot of tests you might want to consider FDR methods.

Sample Size

Intuitively, the more tests you run, the larger sample size you will need. While there a certainly more complicated methods for determining sample size, I will describe one simple method:

• Apply the Bonferroni correction $\alpha'=\frac{\alpha}{m}$
• Plug this $\alpha'$ into the sample size calculations discussed previously.

This is likely to be an over-estimate of the sample size required, but will give you a good indication of how many more samples will be needed.

Alternatives

This is by no means an exhaustive description of how to deal with multiple tests. In this article I have focused on ways to adjust existing p-values. Other methods it may be worth exploring:

• ANOVA and Tukey HSD tests (Post-hoc analysis in general). One issue here is that it assumes normality of the data. This may mean you need to perform transformations like Arcsine Square Root 3 on proportion data.
• Bootstrap methods - Simulating how unlikely a sample is to occur if they were not independent.
• Bayesian methods

Potentially in later articles I will try to explore some of these methods further.

1. See  http://prefrontal.org/files/posters/Bennett-Salmon-2009.pdf
2. Example courtesy of http://www2.math.uu.se/research/pub/Ekenstierna.pdf
3. https://www.biostars.org/p/6189/

Significance Testing and Sample Size

A common question when analysing any sort of online data:

Is this result statistically significant?

• A/B Testing
• Survey Analysis

In this post, we will explore ways to answer this question and determine the sample sizes we need.

Note: This post was heavily influenced by Marketing Distillery's excellent  A/B Tests in Marketing.

Terminology

First let us define some standard statistical terminology:

• We are deciding between two hypotheses - the null hypothesis and the alternative hypothesis.
• The null hypothesis is the default assumption that there is no relationship between two values.  For instance, if comparing conversion rates of two landing pages, the null hypothesis is that the conversion rates are the same.
• The alternative hypothesis is what we assume if the null hypothesis is rejected. It can be two-tailed, the conversion rates are not the same, or one-tailed, where we state the direction of the inequality.
• A result is determined to be statistically significant if the observed effect was unlikely to have been seen by random chance.
• How unlikely is determined by the significance level. Commonly it is set to 5%. It is the probability of rejecting the null hypothesis when it is true. This probability is denoted by $\alpha$

$\alpha = P(rejecting\ null\ hypothesis\ |\ null\ hypothesis\ is\ true)$

• Rejecting the null hypothesis when the null hypothesis is true, is called a type I error.
• A type II error occurs when the null hypothesis is false,  but is erroneously failed to be rejected. This probability of this is denoted by $\beta$

$\beta = P(fail\ to\ reject\ null\ hypothesis\ |\ null\ hypothesis\ is\ false)$

Null hypothesis (H0) is trueNull hypothesis (H0) is false
Reject null hypothesisType I error
False positive
Correct outcome
True positive
Fail to reject null hypothesisCorrect outcome
True negative
Type II error
False negative
• The power of a test is $1-\beta$. Commonly this is set to 90%.
• The p-value is calculated from the test statistic. It is the probability that the observed result would have been seen by chance assuming the null hypothesis is true.
• To reject the null hypothesis we need a p-value that is lower than the selected significance level.

Types of Data

Usually we have two types of data we want to perform significance tests with:

1. Proportions e.g. Conversion Rates, Percentages
2. Real valued numbers e.g. Average Order Values, Average Time on Page

In this post, we will look at both.

Tests for Proportions

A common scenario is we have run an A/B test of two landing pages and we wish to test if the conversion rate is significantly different between the two.

An important assumption here is that the two groups are mutually exclusive e.g. you can only have been shown one of the landing pages.

Null hypothesis is that the proportions are equal in both groups.

The test can be performed in R using:

Under the hood this is performing a Pearson's Chi-Squared Test.

Regarding the parameters:

x - Is usually a 2x2 matrix giving the counts of successes and failures in each group (conversions and non-conversions for instance).

n - Number of trials performed in each group. Can leave null if you provide x as a matrix.

p - Only if you want to test for equality against a specific proportion (e.g. 0.5).

alternative - Generally only used when testing against a specific p. Changes the underlying test to use a z-test. See Two-Tailed Test of Population Proportion for more details. The z-test may not be a good assumption with small sample sizes.

conf.level - Used only in the calculation of a confidence interval. Not used as part of the actual test.

correct - Usually safe to apply continuity correction. See my previous post for more details on the correction.

The important part of the output of prop.test is the p-value. If this is less than your desired significance level (say 0.05) you reject the null hypothesis.

In this example, the p-value is not less than our desired significance level (0.05) so we cannot reject the null hypothesis.

Before running your test, you should fix your sample size in each group in advance. The R library pwr has various functions for helping you do this. The function pwr.2p.test can be used for this:

Any one of the parameters can be left blank and the function will estimate its value. For instance, leaving n, the sample size, blank will mean the function will compute the desired sample size.

The only new parameter here is h. This is the minimum effect size you wish to be able to detect. h is calculated as the difference of the arcsine transformation of two proportions:

$h = 2 \arcsin(\sqrt{p_1}) - 2\arcsin(\sqrt{p_2})$

Assuming you have an idea of the base rate proportion (e.g. your current conversion rate) and the minimum change you want to detect, you can use the follow R code to calculate h:

Tests for real-values

Now imagine we want to test if a change to a eCommerce web-page increases the average order value.

The major assumption we are going to make is that the data we are analysing is normally distributed. See my previous post on how to check if the data is normally distributed.

It may be possible to transform the data to a normal distribution, for instance if the data is log-normal. Time on page often looks to fit a log-normal distribution, in this case you can just take the log of the times on page.

The test we will run is the two-sample t-test. We are testing if the means of the two groups are significantly different.

The parameters:

x - The samples in one group

y - The samples in the other group. Leave NULL for a single sample test

alternative - Perform two-tailed or one-tailed test?

mu - Use this if you know the true mean

paired - TRUE for the paired t-test, used for data where the two samples have a natural partner in each set, for instance comparing the weights of people before and after a diet.

var.equal - Assume equal variance in the two groups or not

conf.level - Used in calculating the confidence interval around the means. Not part of the actual test.

The test can be run using:

In this example, as our p-value is greater than 0.05 significance level, so we cannot reject the null hypothesis.

As before, before running the experiment we should set the sample size required. Using the pwr library we can use:

d is the effect size. For t-tests the affect size is assessed as:

$d = \frac{|\mu_1 - \mu_2|}{\sigma}$

where $\mu_1$ is the mean of group 1, $\mu_2$ is the mean of group 2 and $\sigma$ is the pooled standard deviation.

Ideally you should set d based on your problem domain, quantifying the effect size you expect to see. If this is not possible, Cohen's book on power analysis "Statistical Power Analysis for the Behavioral Sciences", suggests setting d to be 0.2, 0.5 and 0.8 for small, medium and large effect sizes respectively.