A common question when analysing any sort of online data:

Is this result statistically significant?

Examples of where I have had to answer this question:

- A/B Testing
- Survey Analysis

In this post, we will explore ways to answer this question and determine the sample sizes we need.

**Note: **This post was heavily influenced by Marketing Distillery’s excellent A/B Tests in Marketing.

## Terminology

First let us define some standard statistical terminology:

- We are deciding between two hypotheses – the
**null hypothesis**and the**alternative hypothesis.** - The
**null hypothesis**is the default assumption that there is no relationship between two values. For instance, if comparing conversion rates of two landing pages, the null hypothesis is that the conversion rates are the same. - The
**alternative hypothesis**is what we assume if the null hypothesis is rejected. It can be*two-tailed,*the conversion rates are not the same, or*one-tailed*, where we state the direction of the inequality. - A result is determined to be
**statistically significant**if the observed effect was unlikely to have been seen by random chance. - How unlikely is determined by the
**significance level**. Commonly it is set to 5%. It is the probability of rejecting the null hypothesis when it is true. This probability is denoted by \(\alpha\)

$$\alpha = P(rejecting\ null\ hypothesis\ |\ null\ hypothesis\ is\ true)$$

- Rejecting the null hypothesis when the null hypothesis is true, is called a
*type I error.* - A
*type II error*occurs when the null hypothesis is false, but is erroneously failed to be rejected. This probability of this is denoted by \(\beta\)

$$\beta = P(fail\ to\ reject\ null\ hypothesis\ |\ null\ hypothesis\ is\ false)$$

- The power of a test is \(1-\beta\). Commonly this is set to 90%.
- The
*p-value*is calculated from the*test statistic*. It is the probability that the observed result would have been seen by chance assuming the null hypothesis is true. - To
**reject the null hypothesis**we need a*p-value*that is lower than the selected significance level.

## Types of Data

Usually we have two types of data we want to perform significance tests with:

- Proportions e.g. Conversion Rates, Percentages
- Real valued numbers e.g. Average Order Values, Average Time on Page

In this post, we will look at both.

## Tests for Proportions

A common scenario is we have run an A/B test of two landing pages and we wish to test if the conversion rate is significantly different between the two.

An important assumption here is that the two groups are mutually exclusive e.g. you can only have been shown one of the landing pages.

Null hypothesis is that the proportions are equal in both groups.

The test can be performed in R using:

1 2 3 |
prop.test(x, n, p=NULL, alternative=c("two.sided","less","greater"), conf.level=0.95, correct=TRUE) |

Under the hood this is performing a Pearson’s Chi-Squared Test.

Regarding the parameters:

x – Is usually a 2×2 matrix giving the counts of successes and failures in each group (conversions and non-conversions for instance).

n – Number of trials performed in each group. Can leave null if you provide x as a matrix.

p – Only if you want to test for equality against a specific proportion (e.g. 0.5).

alternative – Generally only used when testing against a specific p. Changes the underlying test to use a z-test. See Two-Tailed Test of Population Proportion for more details. The z-test may not be a good assumption with small sample sizes.

conf.level – Used only in the calculation of a confidence interval. Not used as part of the actual test.

correct – Usually safe to apply continuity correction. See my previous post for more details on the correction.

The important part of the output of prop.test is the **p-value**. If this is less than your desired significance level (say 0.05) you reject the null hypothesis.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
heads <- rbinom(1, size=100, prob = .5) prop.test(heads, 100) 1-sample proportions test with continuity correction data: heads out of 100, null probability 0.5 X-squared = 0.81, df = 1, p-value = 0.3681 alternative hypothesis: true p is not equal to 0.5 95 percent confidence interval: 0.4475426 0.6485719 sample estimates: p 0.55 |

In this example, the p-value is not less than our desired significance level (0.05) so we cannot reject the null hypothesis.

Before running your test, you should fix your sample size in each group in advance. The R library **pwr **has various functions for helping you do this. The function **pwr.2p.test **can be used for this:

1 2 3 |
pwr.2p.test(h = NULL, n = NULL, sig.level = 0.05, power = NULL, alternative = c("two.sided","less","greater")) |

Any one of the parameters can be left blank and the function will estimate its value. For instance, leaving n, the sample size, blank will mean the function will compute the desired sample size.

The only new parameter here is h. This is the minimum effect size you wish to be able to detect. h is calculated as the difference of the arcsine transformation of two proportions:

$$h = 2 \arcsin(\sqrt{p_1}) – 2\arcsin(\sqrt{p_2}) $$

Assuming you have an idea of the base rate proportion (e.g. your current conversion rate) and the minimum change you want to detect, you can use the follow R code to calculate h:

1 2 3 4 5 |
base <- 0.2 change <- 0.1 new <- base + change h <- abs(2 * asin(sqrt(base)) - 2 * asin(sqrt(new))) |

# Tests for real-values

Now imagine we want to test if a change to a eCommerce web-page increases the average order value.

The major assumption we are going to make is that the data we are analysing is normally distributed. See my previous post on how to check if the data is normally distributed.

It may be possible to transform the data to a normal distribution, for instance if the data is log-normal. Time on page often looks to fit a log-normal distribution, in this case you can just take the log of the times on page.

The test we will run is the two-sample t-test. We are testing if the means of the two groups are significantly different.

1 2 3 4 |
t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, ...) |

The parameters:

x – The samples in one group

y – The samples in the other group. Leave NULL for a single sample test

alternative – Perform two-tailed or one-tailed test?

mu – Use this if you know the true mean

paired – TRUE for the paired t-test, used for data where the two samples have a natural partner in each set, for instance comparing the weights of people before and after a diet.

var.equal – Assume equal variance in the two groups or not

conf.level – Used in calculating the confidence interval around the means. Not part of the actual test.

The test can be run using:

1 2 3 4 5 6 7 8 9 10 11 12 |
t.test(x,y) Welch Two Sample t-test data: x and y t = 1.4174, df = 35.85, p-value = 0.165 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.3929974 2.2162110 sample estimates: mean of x mean of y 1.3699132 0.4583064 |

In this example, as our p-value is greater than 0.05 significance level, so we cannot reject the null hypothesis.

As before, before running the experiment we should set the sample size required. Using the **pwr** library we can use:

1 2 |
pwr.t.test(n = , d = , sig.level = , power = , type = c("two.sample", "one.sample", "paired")) |

d is the effect size. For t-tests the affect size is assessed as:

$$d = \frac{|\mu_1 – \mu_2|}{\sigma}$$

where \(\mu_1\) is the mean of group 1, \(\mu_2\) is the mean of group 2 and \(\sigma\) is the pooled standard deviation.

Ideally you should set d based on your problem domain, quantifying the effect size you expect to see. If this is not possible, Cohen’s book on power analysis “Statistical Power Analysis for the Behavioral Sciences”, suggests setting d to be 0.2, 0.5 and 0.8 for small, medium and large effect sizes respectively.