Random Permutation Tests

At the 2014 Strata + Hadoop conference John Rauser gave a great keynote title "Statistics Without the Agonizing Pain".  It is probably worth watching before reading the rest of this article, in it he introduces the concept of Random Permutation Tests.

"Classic" statistical tests usually make some sort of assumption about the distribution of the data e.g. normally distribution data . Are these assumptions always true? Probably not, but they are often approximately close enough to give you a useful result. By making these assumptions, these tests are called parametric.

Random Permutation Tests make no assumptions on the underlying distribution of the data. They are considered non-parametric tests. This can be extremely useful when:

• Your data just doesn't seem to fit the distribution the classic statistical test assumes. For instance, perhaps it is bi-modal and the test assumes normality.
• You have outliers e.g. users who spend significantly more than others.
• You have a small sample size.

Random Permutation Tests can be used in almost any setting where you would compute a p-value. In this article I will focus on there use in experimental studies, you want to see if there is a difference between two treatment groups (A/B Tests, medical studies, etc.)

Overview

The essential idea behind random permutation tests is:

1. Compute a test statistic between two (or more) groups. This could be the difference between two proportions, the difference between the means of the two groups etc.
2. Now randomly shuffle the data assigned to each group.
3. Measure the test statistic again on the shuffled data.
4. Repeat 2 and 3 many times
5. Look at where the test statistic from 1 falls in the distribution of test statistics from 2-4.

We have used steps 2-4 to empirically estimate the sampling distribution of the test statistic. From this distribution you can compute the p-value for your observed test statistic.

Example

Let's imagine we want to add a new widget on our checkout page of our e-commerce site to upsell products to a user.

The question we want to answer is, does adding the widget increase our revenue?

We run an A/B test with:

• Original checkout page
• Checkout page with widget

We know how much each user spent and what variant they have been given.

Lets generate some example transaction data in R:

We have randomly sampled the data from a log-normal distribution with equal mean and variance. We set the seed to ensure the results are repeatable. So in this case, we are looking to find there is no significant difference between the two datasets.

The classic statistical approach here would be to use a t-test. Let's instead apply our random permutation test.

First, let's compute the difference between the means of the two groups:

This gives a difference of 193.47. How likely is this to have happened by chance?

What we want to do is randomly shuffle our data between the two groups. If we were to sample (without replacement) once and compute the difference using the randomly shuffled version of groups:

This gave me a difference of 45.69. Now we will repeat this many times:

If we plot the examples, along with where our observed difference falls:

Straight away, it is fairly clear this observation could  just be due to random chance. It is not on the very extreme of the distribution. However let's compute a p-value

On the two-tailed test, we get a p-value of 0.102, so we would accept the null hypothesis of there being no difference. Notice the add one here on both the denominator and numerator. Essentially we are adding our original measured test statistic to the random permutations. This ensures we never get a zero probability.

The standard t-test would give a p-value of 0.0985, so roughly similar. However, what if we didn't care about the mean, but the median transaction value? Using random permutation tests, this is very simple to compute (a simple change to our code). Under classic statistical tests, we would have to go off and find the exact test we need to use under those conditions.

Speed Improvements

On simple speed improvement is to parallelise the loop to compute the re-samples:

Coin Package

As usual, R already has a package to help us do all of this. Using the same data as before we would run:

Generally you will find this is much faster for running large numbers of iterations.

One downside is you don't get the visualisation of how extreme the observed data is compared to the empirical sampled histogram (Figure 2). I find this graph extremely useful when explaining how extreme a result appears to be, extremely to a non-statistical audience

Summary

Random permutation tests are a nice alternative to classic hypothesis tests. In many cases they will give you almost exactly the same results. Being able to visualise the distribution (Figure 2) can be a massive assistance in explaining the p-value.

• Almost no assumptions on the underlying dataset being analysed
• Can be used for any test statistic (either it is implemented in coin or can be programmed yourself).
• Can be applied to all sorts of data types (numerical, ordinal, categorical) without having to remember the exact parametric test you should use.

• Computing large number of re-samples is potentially slow. Although on modern computers this is less of a concern
• Relies on the null hypothesis, that there is no association between the dataset and so the group labels are interchangeable under the null hypothesis.

Personally, I like to use both classic statistical tests and random permutation tests, even if all they do is validate one another.

Multiple Statistical Tests

A/B testing is common place in marketing and website optimisation. But why just stop at running two variations? Why not three or eight? I have certainly seen A/B tests suddenly grow to A/B/C/D... and so on. In statical literature, this is called multiple testing.

The short answer is there is nothing stopping you from doing this, however you do need to adjust your techniques for interpreting the test results or risk interpreting noise results as signal.

In fact, many A/B testing tools now offer you "multivariate testing" (Optimizely, maxymiser, etc.), which is just a fancy term for running factorial experiments and you will be running multiple tests at the same time.

• The common pitfalls when running multiple statistical tests
• Techniques to adjust for multiple comparisons
• How this affects sample size when planning experiments

The dangers of multiple comparisons

One of my favourite examples of the danger of multiple comparisons is a study by Benett et al. 1

The very brief summary of the paper is:

• A very much dead Atlantic Salmon was placed in an fMRI machine.
• The fMRI measures the blood oxygen levels in units called voxels (think image pixels). Each voxel is measuring the change in blood oxygen levels. Typically fMRI machines will measure 130,000 voxels.
• The (dead) salmon was then shown a series of emotionally charged images and the difference between blood oxygen levels was measured.
• Funnily enough, with that many comparisons, you do find a statistically significant result. Can we conclude the dead's salmon brain was responding to the images? Well no, obviously, it's dead. If you do enough comparisons without any correction, you will end up with something significant.

Let's see this in action. We will randomly sample from a binomial distribution, with p=0.5, 100 times. Each time we will test if the observed proportion is significantly different from 0.5:

Run it a few times, most times you should see one one of the p-values come out as below the alpha value. For example:

We need to control for the fact we are doing a large number of tests. Next we will look at ways to deal with these issues.

Techniques

Family Wise Error Rate

The Family Wise Error Rate (FWER) is the probability of making at least one type I error. If V is the number of type I errors:

$FWER=P(V \gt0)$

We will looking at techniques that control the FWER to ensure $FWER \le \alpha$. We will only look at methods that control this in the strong sense, that it is valid for all configurations of true and false hypotheses.

Bonferroni

The Bonferroni correction is the simplest method for controlling the FWER.  Simply you now reject the null hypothesis if:

$p_i \le \frac{\alpha}{m}$

where $p_i$ is the p-value of the i-th test and $m$ is the number of tests.

This is a very conservative correction and if you have large number of tests, you need very small p-values to find anything significant.

To adjust p-values based on Bonferroni correction:

$\hat{p}_i =m p_i$

To show this in R, let's go back to our previous example:

We should now, hopefully, see no significant results.

The p.adjust method is what we use to obtain the adjusted p-values. It is also possible to use pairwise.prop.test to do this all in one go, but I personally prefer to keep them separate (e.g. compute p-values, then adjust them).

Holm

In Holm's method you order the p-values in increasing order. It is called a step down procedure. You compute a critical value for each p-value:

$a_i = \frac{\alpha}{m - i - 1}$

Starting with the smallest p-value, keeping rejecting hypotheses until the first where $p_i > \alpha_i$, at which point accept $H_i$ and all remaining hypothesises.

$\hat{p}_i= \left\{\begin{array}{l l}mp_i & \quad \text{if i=1}\\ \max(\hat{p}_{i-1},(m -i + 1)p_i) & \quad \text{if i=2,\dots,m}\end{array} \right.$

In R we can simply use:

Holm's method is equally as powerful in terms of assumptions as Bonferroni, and it less conservative. Therefore it is always suggested to use Holm's method over Bonferroni.

Hochberg

Holm's method was known as a step down procedure, as you start with the smallest p-value and work upwards.

Hochbeg's method is a step up method. You start with the largest p-value , until you find the first $p_i \le \alpha_i$. It uses the same critical values as Holm's method.

The adjusted p-values are computed using:

$\hat{p}_i= \left\{\begin{array}{l l}p_m & \quad \text{if i=m}\\ \min(\hat{p}_{i+1},(m -i + 1)p_i) & \quad \text{if i=m-1,\dots,1}\end{array} \right.$

In R we can use:

An important consideration Hochberg's method is that each p-value must be independent or positively dependent.  If you are running A/B tests and ensured users have only ever seen one variant, then this assumption is valid.

Hochberg's method is considered more powerful than Holm's as it will reject more hypotheses.

Hommel

Hommel's procedure is more powerful than Hochberg's but slightly harder to understand. It has exactly the assumptions at Hochberg's (independent or positively dependent p-values) and being more powerful it should be preferred over Hochberg's.

Hommel's procure rejects all p-values that are $\le \frac{\alpha}{j}$

The value for $j$ is found by:

$j = \max_{i=1,\dots,m}\{ p_{m-i+k}\gt\frac{k\alpha}{i}\textrm{for }k=1,\dots,i\}$

Think of this as us looking at all the sets of hypotheses. We are trying to find the largest $j$ where the condition $p_{n-i+k}\gt\frac{k\alpha}{i}$ is true for all the values of $k$.

Let's look at this in practice 2. Suppose we have 3 hypotheses with p-values $p_1=0.024$, $p_2=0.030$ and $p_3=0.073$. To find $j$ we compute:

For $i=1$, $p_3 = 0.073 > \alpha = 0.05$

For $i=2$, $p_3 = 0.073 > \alpha = 0.05, p_2 = 0.030 >\frac{ \alpha}{2}=0.025$

For $i=3$, $p_3 = 0.073 > \alpha = 0.05, p_2 = 0.030 <\frac{ 2\alpha}{3}=0.033,$

$p_1=0.024 > \frac{\alpha}{3} = 0.0167$

We find two sets of hypotheses where the statement is true for all $k$. The values of $i$ for these two sets were $\{1,2\}$, so $j=\max\{1,2\}=2$. We use $j$ and reject any p-values less than $\frac{\alpha}{2}$. Which in this case is $p_1=0.024$.

In the following R code, I have provided code for calculating $j$ and the adjusted p-values. The adjusted p-value calculation followed the algorithm in the Appendix of Wright's paper:

As with the other methods, R provides a pre-built function for calculating the adjusted p-values:

False Discovery Rate

Up till now we have focused on methods to control the Family Wise Error Rate (FWER). Let V be the number of false positives in hypotheses we reject and P be the number of hypotheses rejected. In FWER methods we are ensuring:

$P(V \ge 0) \le\alpha$

We have applied various method (Holm, Hochberg, etc.) in order to ensure this. If we want to control False Discovery Rate (FDR), we will ensure:

$FDR =\mathbb{E}(\frac{V}{P}) \le \alpha$

That is the expected false positive rate will be below $\alpha$. Many FWER methods can be seen as too conservative and often suffer from low power. FDR methods offer a way to increase power but maintain a principled bound on the error.

The FDR methods were developed on the observation

4 false discoveries out of 10 rejected null hypotheses

is much more serious than

20 false discoveries out of 100 rejected null hypotheses

That is when we are running a large number of tests, we are possibly willing to accept a percentage of false discoveries if we still find something interesting.

Christopher Genovese has an extremely good tutorial on FDR and worth looking over for much more detail.

R provides two methods for calculating adjusted p-values (sometimes called q-values in the FDR context) based on controlling the FDR:

BH is the original Benjamini-Hochberg procedure where the concept of FDR was introduced. It has the same assumptions as the Hochberg procedure (e.g. independent or positively dependent p-values). BY is Benjamini–Yekutieli procedure, it has no assumptions on the dependency of the p-values (but as such it is more conservative than BH).

FDR methods have become popular in genomic studies, neuroscience, biochemistry, oncology and plant sciences.  Particularly where there is a large number (thousands often) of tests to be performed, but you don't want to miss interesting interactions.

Choosing a method

We have covered a variety of techniques to handle multiple tests, but what one should you use? This is my very crude suggestions for choosing which method to use:

1. I have less than or equal to 50 tests and each test is independent or positively dependent - Use Hommel.
2. I have less than or equal to 50 tests but I do not know if the tests are independent or positively dependent - Use Holm.
3. I have more than 50 tests and each test is independent or positively dependent - Use FDR Benjamini-Hochberg.
4. I have more than 50 tests but I do not know if the tests are independent or positively dependent - Use FDR Benjamini–Yekutieli.

The 50 tests is an arbitrary choice but in general if you are using a lot of tests you might want to consider FDR methods.

Sample Size

Intuitively, the more tests you run, the larger sample size you will need. While there a certainly more complicated methods for determining sample size, I will describe one simple method:

• Apply the Bonferroni correction $\alpha'=\frac{\alpha}{m}$
• Plug this $\alpha'$ into the sample size calculations discussed previously.

This is likely to be an over-estimate of the sample size required, but will give you a good indication of how many more samples will be needed.

Alternatives

This is by no means an exhaustive description of how to deal with multiple tests. In this article I have focused on ways to adjust existing p-values. Other methods it may be worth exploring:

• ANOVA and Tukey HSD tests (Post-hoc analysis in general). One issue here is that it assumes normality of the data. This may mean you need to perform transformations like Arcsine Square Root 3 on proportion data.
• Bootstrap methods - Simulating how unlikely a sample is to occur if they were not independent.
• Bayesian methods

Potentially in later articles I will try to explore some of these methods further.

1. See  http://prefrontal.org/files/posters/Bennett-Salmon-2009.pdf
2. Example courtesy of http://www2.math.uu.se/research/pub/Ekenstierna.pdf
3. https://www.biostars.org/p/6189/

R Performance (Part II)

In R performance (Part I), we looked into how to write R code in an efficient way.

In the second part, we will look into more explicit ways of improving R performance.

Parallel Loops

Doing some sort of loop is almost unavoidable in R. A simple optimisation is to run the loop in parallel.

A obvious machine learning application is when running cross-validation. If you want to run 10-fold cross-validation, if you have 10 cores available, you can run them all in parallel.

The parallel library has a multicore version of lapply, let's look at the example below:

If you already make use of lapply etc. in your code, modifying to use the multi-core version requires very little code changes.

Sadly this code will not work on Windows as it relies on Unix's fork command. Instead use the following on a Windows machine:

Update: Nathan VanHoudnos has re-implemented the mclapply function to support Windows, see his comment for more details on how to use this.

Update: One issue I have observed when using mclapply on Linux, if the machine runs out of memory, R will arbitrarily and silently kill processes on some of the cores. This will mean you will not get as many results as you expect. A good error check is to ensure your results has the correct size e.g. same size as your inputs

If instead you prefer for loop syntax,  you can use the foreach package:

The important command here is %dopar%, this says to perform the loop in parallel. If you were to use  %do% it would run on a single process.

Memory Usage

In a Unix system (I have no idea how this will work on Windows), when you fork a process (what mclapply does), you get a copy of the current processes memory (think R environment).

However this is called "copy-on-write", which means unless you modifying the data it will never physically be copied.

This is important, as much as possible you should try to avoid modifying the data inside your mclapply function. However, often this is unavoidable e.g. in your cross validation loop you will need to split data into test and train. In this case, you just need to be aware you will be increasing your memory usage. You may have to trade off how many cores you can use with how much memory you have available.

Optimised Linear Algebra Library

The default BLAS library used by R is not particular well tuned and has no support for multiple cores. Switching to an alternative BLAS implementation can give a significant speed boost.

Nathan VanHoudnos has an excellent guide on installing alternative BLAS libraries. Here I will summarise the different options:

OpenBLAS

OpenBLAS is generally the easiest to install (on Ubuntu you can use apt-get) and has out of the box support for multicore matrix operations.

On major issue (described here) is that currently OpenBLAS multicore matrix operations do not play well with R's other multicore functionality (parallel, foreach). Trying to use them together will result in segfaults. However, as long as you are aware of this you can design your code around this e.g. only using parallel loops when nothing inside of the loop utilises matrix algebra.

ATLAS

ATLAS has the potential to be the most tuned to your particular machine setup. However using out-of-the-box installations (e.g. via apt-get) will generally only support a single core.

In order to get the most out of ATLAS (multicore, optimised to your machine) you will need to compile it from source and this can be a painful experience and probably only worthwhile if you are familiar with compiling from source on Unix machines.

MKL

Intel provide their Math Kernel Library (MKL) optimised version of BLAS. It works on Intel and Intel compatible processors. Revolution R comes with MKL pre-packaged with it.

R-Bloggers has a guide to installing MKL. Note, MKL is free for non-commercial usage, but commercial usage will require a license.

vecLib

If you are using Mac OS X, Apple kindly provide you with a optimised BLAS library. Details of how to link with R can be found here. It is very easy to link and provides excellent performance.

However, this only works with Mac OS X, so not really relevant if you are planning to work in a server environment,

Which to use?

Obviously what to use is completely up to you, all have some disadvantages. Personally, I use vecLib on my Mac and we use OpenBLAS on our servers. This means we have to write our R code to not to use parallel loops and mutlicore matrix operations at the same time. However this is not a massive overhead (if you are trying to do both at the same time you will generally end up thrashing your CPUs anyway). The advantage is, spinning up new R servers does not involve any compilation from source.

Pay

At this point, linking to optimised BLAS version may look quite painful. Instead an alternative option is to pay for Revolution R. They pre-build their version of R with an optimised multicore BLAS version. They have various benchmarks on the performance improvements.

Revolution R also has various libraries for handling large datasets, parallelised statistical modelling algorithms, etc. Although it all comes with a fairly hefty price tag.

Update: Revolution R have just announced Revolution R Open a free version of Revolution R. In particular it comes linked against MKL and has the Reproducible R Toolkit to manage package upgrades. Currently this looks like the best option for using R in a production environment.

Byte Code Compiler

The compiler package allows you to compile R functions to a lower-level byte code. This can provide performance improvements of between 2-5X.

Let's look at a very simple example below:

The cmpfun is used to compile the function to byte code. You call your function in the exact same way you would before. To compare performance:

In this case, we see around a 2.9X speedup. You will see the best speed-up on functions that involve mostly numerical calculations. If your functions mainly call pre-built R functions or manipulate data types, you probably won't see any drastic speed-up.

You can also enable Just-In-Time (JIT) compilation, removing the need to call cmpfun directly:

The value passed to enableJIT controls the level of compilation, it should be between 0 and 3, 0 being no compilation; 3 being max compilation. This may initially slow down R as all the functions need to be compiled, but may later speed it up. You can also enable it via the R_ENABLE_JIT environment variable.

For more information, R-statistics has a great tutorial on compiler library and JIT.

Summary

R is constantly evolving, so along with these tips you should always try to keep your R version up to date to get the latest performance improvements.

Radford Neal has done a bunch of optimisations, some of which were adopted into R Core, and many others which were forked off into pqR. At the time of writing, I don't think pqR is ready for production work, but definitely worth watching.

With well optimised code, the right libraries, R is capable of handling pretty large data problems. At some point, your data may be too large for R to handle. At this point I look to Hadoop and Spark to scale even further. My rough guide, if your data is greater than 50GB (after pre-processing) R is probably not the right choice.

R Performance (Part I)

R as a programming language is often considered slow. However, more often than not it is how the R code is written that makes it slow. I've see people wait hours for an R script to finish, while with a few modifications it will take minutes.

In this post I will explore various ways of speeding up your R code by writing better code. In part II, I will focus on tools ands libraries you can use to optimise R.

Vectorisation

The single most important advice when writing R code is to vectorise it as much as possible. If you have ever used MATLAB, you will be aware of the difference vectorised vs. un-vectorised code in terms of speed.

Let us look at an example:

Here we have used a loop to increment the contents of a. Now using a vectorised approach:

Notice the massive performance increase in elapsed time.

Another consideration is to look at using  inherently vectorised commands like ifelse and diff. Let's look at the example below:

Again we see elapsed time has been massively reduced, a 93X reduction.

When you have a for loop in your code, think about how you can rewrite it in a vectorised way.

Looping

Sometimes it is impossible to avoid a loop, for example:

• When the result depends on the previous iteration of the loop

If this is the case some things to consider:

• Ensure you are doing the absolute minimum inside the loop. Take any non-loop dependent calculations outside of the loop.
• Make the number of iterations as small as possible. For instance if your choice is to iterate over the levels of a factor or iterate over all the elements, usually iterating over the levels will be much faster

If you have to loop, do as little as possible in it

Growing Objects

A common pitfall is growing an object inside of a loop.  Below I give an example of this:

Here we are constantly growing the vector inside of the loop. As the vector grows, we need more space to hold it, so we end up copy data to a new location. This constant allocation and copying causes the code to be very slow and memory fragmentation.

In the next example, we have pre-allocated the space we needed. This time the code is 266X faster.

We can of course do this allocation directly without the loop, making the code even faster:

If you don't know how much space you will need, it may be useful to allocate an upper-bound of space, then remove anything unused once your loop is complete.

A more common scenario is to see something along the lines of: