A/B testing is common place in marketing and website optimisation. But why just stop at running two variations? Why not three or eight? I have certainly seen A/B tests suddenly grow to A/B/C/D... and so on. In statical literature, this is called multiple testing.

The short answer is there is nothing stopping you from doing this, however you do need to adjust your techniques for interpreting the test results or risk interpreting noise results as signal.

In fact, many A/B testing tools now offer you "multivariate testing" (Optimizely, maxymiser, etc.), which is just a fancy term for running factorial experiments and you will be running multiple tests at the same time.

In this article I will discuss:

- The common pitfalls when running multiple statistical tests
- Techniques to adjust for multiple comparisons
- How this affects sample size when planning experiments

# The dangers of multiple comparisons

One of my favourite examples of the danger of multiple comparisons is a study by Benett et al. ^{1}

The very brief summary of the paper is:

- A very much dead Atlantic Salmon was placed in an fMRI machine.
- The fMRI measures the blood oxygen levels in units called voxels (think image pixels). Each voxel is measuring the change in blood oxygen levels. Typically fMRI machines will measure 130,000 voxels.
- The (dead) salmon was then shown a series of emotionally charged images and the difference between blood oxygen levels was measured.
- Funnily enough, with that many comparisons, you do find a statistically significant result. Can we conclude the dead's salmon brain was responding to the images? Well no, obviously, it's dead. If you do enough comparisons without any correction, you will end up with something significant.

Let's see this in action. We will randomly sample from a binomial distribution, with p=0.5, 100 times. Each time we will test if the observed proportion is significantly different from 0.5:

1 2 3 4 5 6 7 8 9 10 11 |
runs <- 100 trials <- 100 true.p <- 0.5 alpha <- 0.05 a <- unlist(lapply(1:runs, function(x){ a <- sum(rbinom(trials, 1, true.p)) k <- prop.test(a, trials, p=true.p) if (k$p.value < alpha) print(k) k$p.value < alpha })) |

Run it a few times, most times you should see one one of the p-values come out as below the alpha value. For example:

1 2 3 4 5 6 7 8 9 10 |
1-sample proportions test with continuity correction data: a out of trials, null probability true.p X-squared = 5.29, df = 1, p-value = 0.02145 alternative hypothesis: true p is not equal to 0.5 95 percent confidence interval: 0.5170589 0.7136053 sample estimates: p 0.62 |

We need to control for the fact we are doing a large number of tests. Next we will look at ways to deal with these issues.

# Techniques

## Family Wise Error Rate

The Family Wise Error Rate (FWER) is the probability of making at least one type I error. If V is the number of type I errors:

We will looking at techniques that control the FWER to ensure . We will only look at methods that control this in the **strong sense**, that it is valid for all configurations of true and false hypotheses.

### Bonferroni

The Bonferroni correction is the simplest method for controlling the FWER. Simply you now reject the null hypothesis if:

where is the p-value of the i-th test and is the number of tests.

This is a very conservative correction and if you have large number of tests, you need very small p-values to find anything significant.

To adjust p-values based on Bonferroni correction:

To show this in R, let's go back to our previous example:

1 2 3 4 5 6 7 8 |
a <- unlist(lapply(1:runs, function(x){ a <- sum(rbinom(trials, 1, true.p)) k <- prop.test(a, trials, p=true.p) k$p.value })) result <- p.adjust(a, method="bonferroni") sum(result < alpha) |

We should now, hopefully, see no significant results.

The **p.adjust** method is what we use to obtain the adjusted p-values. It is also possible to use **pairwise.prop.test **to do this all in one go, but I personally prefer to keep them separate (e.g. compute p-values, then adjust them).

### Holm

In Holm's method you order the p-values in increasing order. It is called a step down procedure. You compute a critical value for each p-value:

Starting with the smallest p-value, keeping rejecting hypotheses until the first where , at which point accept and all remaining hypothesises.

The adjusted p-values are:

In R we can simply use:

1 |
result <- p.adjust(a, method="holm") |

Holm's method is equally as powerful in terms of assumptions as Bonferroni, and it less conservative. Therefore it is always suggested to use Holm's method over Bonferroni.

### Hochberg

Holm's method was known as a step down procedure, as you start with the smallest p-value and work upwards.

Hochbeg's method is a step up method. You start with the largest p-value , until you find the first . It uses the same critical values as Holm's method.

The adjusted p-values are computed using:

In R we can use:

1 |
result <- p.adjust(a, method="hochberg") |

An important consideration Hochberg's method is that each p-value must be **independent **or positively dependent. If you are running A/B tests and ensured users have only ever seen one variant, then this assumption is valid.

Hochberg's method is considered more powerful than Holm's as it will reject more hypotheses.

**Hommel **

Hommel's procedure is more powerful than Hochberg's but slightly harder to understand. It has exactly the assumptions at Hochberg's (independent or positively dependent p-values) and being more powerful it should be preferred over Hochberg's.

Hommel's procure rejects all p-values that are

The value for is found by:

Think of this as us looking at all the sets of hypotheses. We are trying to find the largest where the condition is true for all the values of .

Let's look at this in practice ^{2}. Suppose we have 3 hypotheses with p-values , and . To find we compute:

For ,

For ,

For ,

We find two sets of hypotheses where the statement is true for all . The values of for these two sets were , so . We use and reject any p-values less than . Which in this case is .

In the following R code, I have provided code for calculating and the adjusted p-values. The adjusted p-value calculation followed the algorithm in the Appendix of Wright's paper:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
hommel.method <- function(tests, alpha) { n <- length(tests) p <- sort(tests) a <- p valid.hypotheses <- c() for (m in n:1) { q <- (n-m+1):n result <- vapply(q, function(i) { adj.alpha <- (m + i - n) * alpha / m adj.p.value <- m * p[i] / (m + i - n) c(adj.p.value, p[i] > adj.alpha) }, c("p.vals"=0, "test"=0)) cur.simes <- result[1,] cur.tests <- result[2,] if (sum(cur.tests) == length(cur.tests)) { valid.hypotheses <- c(valid.hypotheses, m) } c.min <- min(cur.simes) for (i in q) { if (a[i] < c.min) a[i] <- c.min } for (i in seq_len(n-m)) { c.i <- min(c.min, m * p[i]) if (a[i] < c.i) a[i] <- c.i } } j <- max(valid.hypotheses) adj.alpha <- alpha / j list("alpha"=adj.alpha, "adj.p.values"=a) } |

As with the other methods, R provides a pre-built function for calculating the adjusted p-values:

1 |
result <- p.adjust(a, method="hommel") |

## False Discovery Rate

Up till now we have focused on methods to control the Family Wise Error Rate (FWER). Let V be the number of false positives in hypotheses we reject and P be the number of hypotheses rejected. In FWER methods we are ensuring:

We have applied various method (Holm, Hochberg, etc.) in order to ensure this. If we want to control False Discovery Rate (FDR), we will ensure:

That is the expected false positive rate will be below . Many FWER methods can be seen as too conservative and often suffer from low power. FDR methods offer a way to increase power but maintain a principled bound on the error.

The FDR methods were developed on the observation

4 false discoveries out of 10 rejected null hypotheses

is much more serious than

20 false discoveries out of 100 rejected null hypotheses

That is when we are running a large number of tests, we are possibly willing to accept a percentage of false discoveries if we still find something interesting.

Christopher Genovese has an extremely good tutorial on FDR and worth looking over for much more detail.

R provides two methods for calculating adjusted p-values (sometimes called q-values in the FDR context) based on controlling the FDR:

1 2 |
result <- p.adjust(p, method="BH") result <- p.adjust(p, method="BY") |

BH is the original Benjamini-Hochberg procedure where the concept of FDR was introduced. It has the same assumptions as the Hochberg procedure (e.g. independent or positively dependent p-values). BY is Benjamini–Yekutieli procedure, it has no assumptions on the dependency of the p-values (but as such it is more conservative than BH).

FDR methods have become popular in genomic studies, neuroscience, biochemistry, oncology and plant sciences. Particularly where there is a large number (thousands often) of tests to be performed, but you don't want to miss interesting interactions.

## Choosing a method

We have covered a variety of techniques to handle multiple tests, but what one should you use? This is my very crude suggestions for choosing which method to use:

- I have less than or equal to 50 tests and each test is independent or positively dependent -
**Use Hommel.** - I have less than or equal to 50 tests but I do not know if the tests are independent or positively dependent -
**Use Holm.** - I have more than 50 tests and each test is independent or positively dependent -
**Use FDR Benjamini-Hochberg.** - I have more than 50 tests but I do not know if the tests are independent or positively dependent -
**Use FDR Benjamini–Yekutieli.**

The 50 tests is an arbitrary choice but in general if you are using a lot of tests you might want to consider FDR methods.

## Sample Size

Intuitively, the more tests you run, the larger sample size you will need. While there a certainly more complicated methods for determining sample size, I will describe one simple method:

- Apply the Bonferroni correction
- Plug this into the sample size calculations discussed previously.

This is likely to be an over-estimate of the sample size required, but will give you a good indication of how many more samples will be needed.

### Alternatives

This is by no means an exhaustive description of how to deal with multiple tests. In this article I have focused on ways to adjust existing p-values. Other methods it may be worth exploring:

- ANOVA and Tukey HSD tests (Post-hoc analysis in general). One issue here is that it assumes normality of the data. This may mean you need to perform transformations like Arcsine Square Root
^{3}on proportion data. - Bootstrap methods - Simulating how unlikely a sample is to occur if they were not independent.
- Bayesian methods

Potentially in later articles I will try to explore some of these methods further.