A common question when analysing any sort of online data:
Is this result statistically significant?
Examples of where I have had to answer this question:
 A/B Testing
 Survey Analysis
In this post, we will explore ways to answer this question and determine the sample sizes we need.
Note: This post was heavily influenced by Marketing Distillery’s excellent A/B Tests in Marketing.
Terminology
First let us define some standard statistical terminology:
 We are deciding between two hypotheses – the null hypothesis and the alternative hypothesis.
 The null hypothesis is the default assumption that there is no relationship between two values. For instance, if comparing conversion rates of two landing pages, the null hypothesis is that the conversion rates are the same.
 The alternative hypothesis is what we assume if the null hypothesis is rejected. It can be twotailed, the conversion rates are not the same, or onetailed, where we state the direction of the inequality.
 A result is determined to be statistically significant if the observed effect was unlikely to have been seen by random chance.
 How unlikely is determined by the significance level. Commonly it is set to 5%. It is the probability of rejecting the null hypothesis when it is true. This probability is denoted by \(\alpha\)
$$\alpha = P(rejecting\ null\ hypothesis\ \ null\ hypothesis\ is\ true)$$
 Rejecting the null hypothesis when the null hypothesis is true, is called a type I error.
 A type II error occurs when the null hypothesis is false, but is erroneously failed to be rejected. This probability of this is denoted by \(\beta\)
$$\beta = P(fail\ to\ reject\ null\ hypothesis\ \ null\ hypothesis\ is\ false)$$
 The power of a test is \(1\beta\). Commonly this is set to 90%.
 The pvalue is calculated from the test statistic. It is the probability that the observed result would have been seen by chance assuming the null hypothesis is true.
 To reject the null hypothesis we need a pvalue that is lower than the selected significance level.
Types of Data
Usually we have two types of data we want to perform significance tests with:
 Proportions e.g. Conversion Rates, Percentages
 Real valued numbers e.g. Average Order Values, Average Time on Page
In this post, we will look at both.
Tests for Proportions
A common scenario is we have run an A/B test of two landing pages and we wish to test if the conversion rate is significantly different between the two.
An important assumption here is that the two groups are mutually exclusive e.g. you can only have been shown one of the landing pages.
Null hypothesis is that the proportions are equal in both groups.
The test can be performed in R using:

prop.test(x, n, p=NULL, alternative=c("two.sided","less","greater"), conf.level=0.95, correct=TRUE) 
Under the hood this is performing a Pearson’s ChiSquared Test.
Regarding the parameters:
x – Is usually a 2×2 matrix giving the counts of successes and failures in each group (conversions and nonconversions for instance).
n – Number of trials performed in each group. Can leave null if you provide x as a matrix.
p – Only if you want to test for equality against a specific proportion (e.g. 0.5).
alternative – Generally only used when testing against a specific p. Changes the underlying test to use a ztest. See TwoTailed Test of Population Proportion for more details. The ztest may not be a good assumption with small sample sizes.
conf.level – Used only in the calculation of a confidence interval. Not used as part of the actual test.
correct – Usually safe to apply continuity correction. See my previous post for more details on the correction.
The important part of the output of prop.test is the pvalue. If this is less than your desired significance level (say 0.05) you reject the null hypothesis.

heads < rbinom(1, size=100, prob = .5) prop.test(heads, 100) 1sample proportions test with continuity correction data: heads out of 100, null probability 0.5 Xsquared = 0.81, df = 1, pvalue = 0.3681 alternative hypothesis: true p is not equal to 0.5 95 percent confidence interval: 0.4475426 0.6485719 sample estimates: p 0.55 
In this example, the pvalue is not less than our desired significance level (0.05) so we cannot reject the null hypothesis.
Before running your test, you should fix your sample size in each group in advance. The R library pwr has various functions for helping you do this. The function pwr.2p.test can be used for this:

pwr.2p.test(h = NULL, n = NULL, sig.level = 0.05, power = NULL, alternative = c("two.sided","less","greater")) 
Any one of the parameters can be left blank and the function will estimate its value. For instance, leaving n, the sample size, blank will mean the function will compute the desired sample size.
The only new parameter here is h. This is the minimum effect size you wish to be able to detect. h is calculated as the difference of the arcsine transformation of two proportions:
$$h = 2 \arcsin(\sqrt{p_1}) – 2\arcsin(\sqrt{p_2}) $$
Assuming you have an idea of the base rate proportion (e.g. your current conversion rate) and the minimum change you want to detect, you can use the follow R code to calculate h:

base < 0.2 change < 0.1 new < base + change h < abs(2 * asin(sqrt(base))  2 * asin(sqrt(new))) 
Tests for realvalues
Now imagine we want to test if a change to a eCommerce webpage increases the average order value.
The major assumption we are going to make is that the data we are analysing is normally distributed. See my previous post on how to check if the data is normally distributed.
It may be possible to transform the data to a normal distribution, for instance if the data is lognormal. Time on page often looks to fit a lognormal distribution, in this case you can just take the log of the times on page.
The test we will run is the twosample ttest. We are testing if the means of the two groups are significantly different.

t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, ...) 
The parameters:
x – The samples in one group
y – The samples in the other group. Leave NULL for a single sample test
alternative – Perform twotailed or onetailed test?
mu – Use this if you know the true mean
paired – TRUE for the paired ttest, used for data where the two samples have a natural partner in each set, for instance comparing the weights of people before and after a diet.
var.equal – Assume equal variance in the two groups or not
conf.level – Used in calculating the confidence interval around the means. Not part of the actual test.
The test can be run using:

t.test(x,y) Welch Two Sample ttest data: x and y t = 1.4174, df = 35.85, pvalue = 0.165 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.3929974 2.2162110 sample estimates: mean of x mean of y 1.3699132 0.4583064 
In this example, as our pvalue is greater than 0.05 significance level, so we cannot reject the null hypothesis.
As before, before running the experiment we should set the sample size required. Using the pwr library we can use:

pwr.t.test(n = , d = , sig.level = , power = , type = c("two.sample", "one.sample", "paired")) 
d is the effect size. For ttests the affect size is assessed as:
$$d = \frac{\mu_1 – \mu_2}{\sigma}$$
where \(\mu_1\) is the mean of group 1, \(\mu_2\) is the mean of group 2 and \(\sigma\) is the pooled standard deviation.
Ideally you should set d based on your problem domain, quantifying the effect size you expect to see. If this is not possible, Cohen’s book on power analysis “Statistical Power Analysis for the Behavioral Sciences”, suggests setting d to be 0.2, 0.5 and 0.8 for small, medium and large effect sizes respectively.