# Multiple Statistical Tests

A/B testing is common place in marketing and website optimisation. But why just stop at running two variations? Why not three or eight? I have certainly seen A/B tests suddenly grow to A/B/C/D… and so on. In statical literature, this is called multiple testing.

The short answer is there is nothing stopping you from doing this, however you do need to adjust your techniques for interpreting the test results or risk interpreting noise results as signal.

In fact, many A/B testing tools now offer you “multivariate testing” (Optimizely, maxymiser, etc.), which is just a fancy term for running factorial experiments and you will be running multiple tests at the same time.

In this article I will discuss:

• The common pitfalls when running multiple statistical tests
• Techniques to adjust for multiple comparisons
• How this affects sample size when planning experiments

# The dangers of multiple comparisons

One of my favourite examples of the danger of multiple comparisons is a study by Benett et al. 1 The very brief summary of the paper is:

• A very much dead Atlantic Salmon was placed in an fMRI machine.
• The fMRI measures the blood oxygen levels in units called voxels (think image pixels). Each voxel is measuring the change in blood oxygen levels. Typically fMRI machines will measure 130,000 voxels.
• The (dead) salmon was then shown a series of emotionally charged images and the difference between blood oxygen levels was measured.
• Funnily enough, with that many comparisons, you do find a statistically significant result. Can we conclude the dead’s salmon brain was responding to the images? Well no, obviously, it’s dead. If you do enough comparisons without any correction, you will end up with something significant.

Let’s see this in action. We will randomly sample from a binomial distribution, with p=0.5, 100 times. Each time we will test if the observed proportion is significantly different from 0.5:

Run it a few times, most times you should see one one of the p-values come out as below the alpha value. For example:

We need to control for the fact we are doing a large number of tests. Next we will look at ways to deal with these issues.

# Techniques

## Family Wise Error Rate

The Family Wise Error Rate (FWER) is the probability of making at least one type I error. If V is the number of type I errors:

$$FWER=P(V \gt0)$$

We will looking at techniques that control the FWER to ensure $$FWER \le \alpha$$. We will only look at methods that control this in the strong sense, that it is valid for all configurations of true and false hypotheses.

### Bonferroni

The Bonferroni correction is the simplest method for controlling the FWER.  Simply you now reject the null hypothesis if:

$$p_i \le \frac{\alpha}{m}$$

where $$p_i$$ is the p-value of the i-th test and$$m$$ is the number of tests.

This is a very conservative correction and if you have large number of tests, you need very small p-values to find anything significant.

To adjust p-values based on Bonferroni correction:

$$\hat{p}_i =m p_i$$

To show this in R, let’s go back to our previous example:

We should now, hopefully, see no significant results.

The p.adjust method is what we use to obtain the adjusted p-values. It is also possible to use pairwise.prop.test to do this all in one go, but I personally prefer to keep them separate (e.g. compute p-values, then adjust them).

### Holm

In Holm’s method you order the p-values in increasing order. It is called a step down procedure. You compute a critical value for each p-value:

$$a_i = \frac{\alpha}{m – i – 1}$$

Starting with the smallest p-value, keeping rejecting hypotheses until the first where $$p_i > \alpha_i$$, at which point accept $$H_i$$ and all remaining hypothesises.

The adjusted p-values are:

$$\hat{p}_i= \left\{\begin{array}{l l}mp_i & \quad \text{if i=1}\\ \max(\hat{p}_{i-1},(m -i + 1)p_i) & \quad \text{if i=2,\dots,m}\end{array} \right.$$

In R we can simply use:

Holm’s method is equally as powerful in terms of assumptions as Bonferroni, and it less conservative. Therefore it is always suggested to use Holm’s method over Bonferroni.

### Hochberg

Holm’s method was known as a step down procedure, as you start with the smallest p-value and work upwards.

Hochbeg’s method is a step up method. You start with the largest p-value , until you find the first $$p_i \le \alpha_i$$. It uses the same critical values as Holm’s method.

The adjusted p-values are computed using:

$$\hat{p}_i= \left\{\begin{array}{l l}p_m & \quad \text{if i=m}\\ \min(\hat{p}_{i+1},(m -i + 1)p_i) & \quad \text{if i=m-1,\dots,1}\end{array} \right.$$

In R we can use:

An important consideration Hochberg’s method is that each p-value must be independent or positively dependent.  If you are running A/B tests and ensured users have only ever seen one variant, then this assumption is valid.

Hochberg’s method is considered more powerful than Holm’s as it will reject more hypotheses.

### Hommel

Hommel’s procedure is more powerful than Hochberg’s but slightly harder to understand. It has exactly the assumptions at Hochberg’s (independent or positively dependent p-values) and being more powerful it should be preferred over Hochberg’s.

Hommel’s procure rejects all p-values that are $$\le \frac{\alpha}{j}$$

The value for $$j$$ is found by:

$$j = \max_{i=1,\dots,m}\{ p_{m-i+k}\gt\frac{k\alpha}{i}\textrm{for }k=1,\dots,i\}$$

Think of this as us looking at all the sets of hypotheses. We are trying to find the largest $$j$$ where the condition $$p_{n-i+k}\gt\frac{k\alpha}{i}$$ is true for all the values of $$k$$.

Let’s look at this in practice 2. Suppose we have 3 hypotheses with p-values $$p_1=0.024$$,$$p_2=0.030$$ and$$p_3=0.073$$. To find $$j$$ we compute:

For $$i=1$$, $$p_3 = 0.073 > \alpha = 0.05$$

For $$i=2$$, $$p_3 = 0.073 > \alpha = 0.05, p_2 = 0.030 >\frac{ \alpha}{2}=0.025$$

For $$i=3$$, $$p_3 = 0.073 > \alpha = 0.05, p_2 = 0.030 <\frac{ 2\alpha}{3}=0.033$$,

$$p_1=0.024 > \frac{\alpha}{3} = 0.0167$$

We find two sets of hypotheses where the statement is true for all $$k$$. The values of $$i$$ for these two sets were $$\{1,2\}$$, so $$j=\max\{1,2\}=2$$. We use $$j$$ and reject any p-values less than $$\frac{\alpha}{2}$$. Which in this case is $$p_1=0.024$$.

In the following R code, I have provided code for calculating $$j$$ and the adjusted p-values. The adjusted p-value calculation followed the algorithm in the Appendix of Wright’s paper:

As with the other methods, R provides a pre-built function for calculating the adjusted p-values:

## False Discovery Rate

Up till now we have focused on methods to control the Family Wise Error Rate (FWER). Let V be the number of false positives in hypotheses we reject and P be the number of hypotheses rejected. In FWER methods we are ensuring:

$$P(V \ge 0) \le\alpha$$

We have applied various method (Holm, Hochberg, etc.) in order to ensure this. If we want to control False Discovery Rate (FDR), we will ensure:

$$FDR =\mathbb{E}(\frac{V}{P}) \le \alpha$$

That is the expected false positive rate will be below $$\alpha$$. Many FWER methods can be seen as too conservative and often suffer from low power. FDR methods offer a way to increase power but maintain a principled bound on the error.

The FDR methods were developed on the observation

4 false discoveries out of 10 rejected null hypotheses

is much more serious than

20 false discoveries out of 100 rejected null hypotheses

That is when we are running a large number of tests, we are possibly willing to accept a percentage of false discoveries if we still find something interesting.

Christopher Genovese has an extremely good tutorial on FDR and worth looking over for much more detail.

R provides two methods for calculating adjusted p-values (sometimes called q-values in the FDR context) based on controlling the FDR:

BH is the original Benjamini-Hochberg procedure where the concept of FDR was introduced. It has the same assumptions as the Hochberg procedure (e.g. independent or positively dependent p-values). BY is Benjamini–Yekutieli procedure, it has no assumptions on the dependency of the p-values (but as such it is more conservative than BH).

FDR methods have become popular in genomic studies, neuroscience, biochemistry, oncology and plant sciences.  Particularly where there is a large number (thousands often) of tests to be performed, but you don’t want to miss interesting interactions.

## Choosing a method

We have covered a variety of techniques to handle multiple tests, but what one should you use? This is my very crude suggestions for choosing which method to use:

1. I have less than or equal to 50 tests and each test is independent or positively dependent – Use Hommel.
2. I have less than or equal to 50 tests but I do not know if the tests are independent or positively dependent – Use Holm.
3. I have more than 50 tests and each test is independent or positively dependent – Use FDR Benjamini-Hochberg.
4. I have more than 50 tests but I do not know if the tests are independent or positively dependent – Use FDR Benjamini–Yekutieli.

The 50 tests is an arbitrary choice but in general if you are using a lot of tests you might want to consider FDR methods.

## Sample Size

Intuitively, the more tests you run, the larger sample size you will need. While there a certainly more complicated methods for determining sample size, I will describe one simple method:

• Apply the Bonferroni correction $$\alpha’=\frac{\alpha}{m}$$
• Plug this $$\alpha’$$ into the sample size calculations discussed previously.

This is likely to be an over-estimate of the sample size required, but will give you a good indication of how many more samples will be needed.

### Alternatives

This is by no means an exhaustive description of how to deal with multiple tests. In this article I have focused on ways to adjust existing p-values. Other methods it may be worth exploring:

• ANOVA and Tukey HSD tests (Post-hoc analysis in general). One issue here is that it assumes normality of the data. This may mean you need to perform transformations like Arcsine Square Root 3 on proportion data.
• Bootstrap methods – Simulating how unlikely a sample is to occur if they were not independent.
• Bayesian methods

Potentially in later articles I will try to explore some of these methods further.

1. See  http://prefrontal.org/files/posters/Bennett-Salmon-2009.pdf
2. Example courtesy of http://www2.math.uu.se/research/pub/Ekenstierna.pdf
3. https://www.biostars.org/p/6189/

# R Performance (Part II)

In R performance (Part I), we looked into how to write R code in an efficient way.

In the second part, we will look into more explicit ways of improving R performance.

# Parallel Loops

Doing some sort of loop is almost unavoidable in R. A simple optimisation is to run the loop in parallel.

A obvious machine learning application is when running cross-validation. If you want to run 10-fold cross-validation, if you have 10 cores available, you can run them all in parallel.

The parallel library has a multicore version of lapply, let’s look at the example below:

If you already make use of lapply etc. in your code, modifying to use the multi-core version requires very little code changes.

Sadly this code will not work on Windows as it relies on Unix’s fork command. Instead use the following on a Windows machine:

Update: Nathan VanHoudnos has re-implemented the mclapply function to support Windows, see his comment for more details on how to use this.

Update: One issue I have observed when using mclapply on Linux, if the machine runs out of memory, R will arbitrarily and silently kill processes on some of the cores. This will mean you will not get as many results as you expect. A good error check is to ensure your results has the correct size e.g. same size as your inputs

If instead you prefer for loop syntax,  you can use the foreach package:

The important command here is %dopar%, this says to perform the loop in parallel. If you were to use  %do% it would run on a single process.

#### Memory Usage

In a Unix system (I have no idea how this will work on Windows), when you fork a process (what mclapply does), you get a copy of the current processes memory (think R environment).

However this is called “copy-on-write”, which means unless you modifying the data it will never physically be copied.

This is important, as much as possible you should try to avoid modifying the data inside your mclapply function. However, often this is unavoidable e.g. in your cross validation loop you will need to split data into test and train. In this case, you just need to be aware you will be increasing your memory usage. You may have to trade off how many cores you can use with how much memory you have available.

# Optimised Linear Algebra Library

The default BLAS library used by R is not particular well tuned and has no support for multiple cores. Switching to an alternative BLAS implementation can give a significant speed boost.

Nathan VanHoudnos has an excellent guide on installing alternative BLAS libraries. Here I will summarise the different options:

###### OpenBLAS

OpenBLAS is generally the easiest to install (on Ubuntu you can use apt-get) and has out of the box support for multicore matrix operations.

On major issue (described here) is that currently OpenBLAS multicore matrix operations do not play well with R’s other multicore functionality (parallel, foreach). Trying to use them together will result in segfaults. However, as long as you are aware of this you can design your code around this e.g. only using parallel loops when nothing inside of the loop utilises matrix algebra.

###### ATLAS

ATLAS has the potential to be the most tuned to your particular machine setup. However using out-of-the-box installations (e.g. via apt-get) will generally only support a single core.

In order to get the most out of ATLAS (multicore, optimised to your machine) you will need to compile it from source and this can be a painful experience and probably only worthwhile if you are familiar with compiling from source on Unix machines.

###### MKL

Intel provide their Math Kernel Library (MKL) optimised version of BLAS. It works on Intel and Intel compatible processors. Revolution R comes with MKL pre-packaged with it.

R-Bloggers has a guide to installing MKL. Note, MKL is free for non-commercial usage, but commercial usage will require a license.

###### vecLib

If you are using Mac OS X, Apple kindly provide you with a optimised BLAS library. Details of how to link with R can be found here. It is very easy to link and provides excellent performance.

However, this only works with Mac OS X, so not really relevant if you are planning to work in a server environment,

###### Which to use?

Obviously what to use is completely up to you, all have some disadvantages. Personally, I use vecLib on my Mac and we use OpenBLAS on our servers. This means we have to write our R code to not to use parallel loops and mutlicore matrix operations at the same time. However this is not a massive overhead (if you are trying to do both at the same time you will generally end up thrashing your CPUs anyway). The advantage is, spinning up new R servers does not involve any compilation from source.

# Pay

At this point, linking to optimised BLAS version may look quite painful. Instead an alternative option is to pay for Revolution R. They pre-build their version of R with an optimised multicore BLAS version. They have various benchmarks on the performance improvements.

Revolution R also has various libraries for handling large datasets, parallelised statistical modelling algorithms, etc. Although it all comes with a fairly hefty price tag.

Update: Revolution R have just announced Revolution R Open a free version of Revolution R. In particular it comes linked against MKL and has the Reproducible R Toolkit to manage package upgrades. Currently this looks like the best option for using R in a production environment.

# Byte Code Compiler

The compiler package allows you to compile R functions to a lower-level byte code. This can provide performance improvements of between 2-5X.

Let’s look at a very simple example below:

The cmpfun is used to compile the function to byte code. You call your function in the exact same way you would before. To compare performance:

In this case, we see around a 2.9X speedup. You will see the best speed-up on functions that involve mostly numerical calculations. If your functions mainly call pre-built R functions or manipulate data types, you probably won’t see any drastic speed-up.

You can also enable Just-In-Time (JIT) compilation, removing the need to call cmpfun directly:

The value passed to enableJIT controls the level of compilation, it should be between 0 and 3, 0 being no compilation; 3 being max compilation. This may initially slow down R as all the functions need to be compiled, but may later speed it up. You can also enable it via the R_ENABLE_JIT environment variable.

For more information, R-statistics has a great tutorial on compiler library and JIT.

# Summary

R is constantly evolving, so along with these tips you should always try to keep your R version up to date to get the latest performance improvements.

Radford Neal has done a bunch of optimisations, some of which were adopted into R Core, and many others which were forked off into pqR. At the time of writing, I don’t think pqR is ready for production work, but definitely worth watching.

With well optimised code, the right libraries, R is capable of handling pretty large data problems. At some point, your data may be too large for R to handle. At this point I look to Hadoop and Spark to scale even further. My rough guide, if your data is greater than 50GB (after pre-processing) R is probably not the right choice.

# R Performance (Part I)

R as a programming language is often considered slow. However, more often than not it is how the R code is written that makes it slow. I’ve see people wait hours for an R script to finish, while with a few modifications it will take minutes.

In this post I will explore various ways of speeding up your R code by writing better code. In part II, I will focus on tools ands libraries you can use to optimise R.

# Vectorisation

The single most important advice when writing R code is to vectorise it as much as possible. If you have ever used MATLAB, you will be aware of the difference vectorised vs. un-vectorised code in terms of speed.

Let us look at an example:

Here we have used a loop to increment the contents of a. Now using a vectorised approach:

Notice the massive performance increase in elapsed time.

Another consideration is to look at using  inherently vectorised commands like ifelse and diff. Let’s look at the example below:

Again we see elapsed time has been massively reduced, a 93X reduction.

When you have a for loop in your code, think about how you can rewrite it in a vectorised way.

# Looping

Sometimes it is impossible to avoid a loop, for example:

• When the result depends on the previous iteration of the loop

If this is the case some things to consider:

• Ensure you are doing the absolute minimum inside the loop. Take any non-loop dependent calculations outside of the loop.
• Make the number of iterations as small as possible. For instance if your choice is to iterate over the levels of a factor or iterate over all the elements, usually iterating over the levels will be much faster

If you have to loop, do as little as possible in it

# Growing Objects

A common pitfall is growing an object inside of a loop.  Below I give an example of this:

Here we are constantly growing the vector inside of the loop. As the vector grows, we need more space to hold it, so we end up copy data to a new location. This constant allocation and copying causes the code to be very slow and memory fragmentation.

In the next example, we have pre-allocated the space we needed. This time the code is 266X faster.

We can of course do this allocation directly without the loop, making the code even faster:

If you don’t know how much space you will need, it may be useful to allocate an upper-bound of space, then remove anything unused once your loop is complete.

A more common scenario is to see something along the lines of:

At the bottom of your loop, you are rbinding or cbinding the result you calculated in your loop to an existing data frame.

Instead, build a list of pieces and put them all together in one go:

Avoid growing data structures inside a loop.

# Apply Functions

The R library has a whole host of apply functions:

It is worth becoming familiar with them all. Neil Saunders has created a great introduction to all the apply functions.

In most situations using apply may not be any faster than using a loop (for instance the apply function is just doing a loop under the hood). The main advantage is that it avoids growing objects in the loop as the apply functions handle stitching the data together.

In Part II we will introduced the parallel versions of apply that can increase performance further.

Know your apply functions and use them where it makes sense

# Mutable Functions

One important point to remember about R, is that parameters to functions are passed by value. This, in theory, means that each time you pass something to a function it creates a copy. However, in practice, R will under the hood not create a copy as long as you don’t mutate the value of the variable inside the function.

If you can make your functions immutable (e.g. don’t change the values of the parameters passed in), you will save significant amounts of memory and CPU time by not copying.

Let’s looks at a really simple case:

Here f1 mutates x, while f2 does not. Running with a fairly large vector,  several times and looking at the average:

We see that on average, f2 is slightly quicker as we have avoided the additional temporary copy that is done under the hood in f1.

Try to make functions immutable.

## Summary

To reiterate the main points:

1. When you have a for loop in your code, think about how you can rewrite it in a vectorised way.
2. If you have to loop, do as little as possible in it
3. Avoid growing data structures inside a loop
4. Know your apply functions and use them where it makes sense
5. Try to make functions immutable

Following these tips when writing your R code should greatly improve the efficiency. For some more general tips to help your R code I also recommend:

# Calibrating Classifier Probabilties

You’ve built your classifier, run cross-validation and have a super high AUC. So you are done right? Maybe not.

Most classifiers output a score of how likely an observation is to be in the positive class. Usually these scores are between 0 and 1 and get called probabilities. However these probabilities often do not reflect reality, e.g. a probability of 20% may not mean it has a 20% chance of happening, it could have a 10%, 50%, 70%, etc. chance of happening. Our aim should be that our model outputs accurately reflect posterior probabilities $$P(Y=1|x)$$.

In the post we will mainly focus on binary classifiers. Later in the post will will talk about how to extend these ideas to mutliclass problems.

# Why it happens

Our models can output inaccurate probabilities for a variety of reasons:

• Flawed model assumptions (e.g. independence in a Naive Bayes Model)
• Hidden features not available at training time
• Deficiencies in the learning algorithm.

In terms of learning algorithms, as noted in Niculescu-Mizil et al and through my own research:

• Naive Bayes tends to push probabilities toward the extremes of 0 and 1.
• SVMs and boosted trees tend to push probabilities away from 0 and 1 (toward the centre)
• Neural networks and random forests tend to have well calibrated probabilities.
• Regularisation also tends to push probabilities toward the centre.

# Do we care?

Whether or not we want well calibrated probabilities depends entirely on the problem we are trying to solve.

If we only need to rank observations from most likely to least likely, then calibration is unnecessary.

Examples of problems I have worked on where calibrated probabilities are extremely important:

• Loan default prediction – Banks will generally be setting thresholds on the probabilities, auto-reject if probability of default is above 30%, etc.
• Ad Click Prediction – Decided what ad to show, how much to bid. You might use a baseline Click Through Rate (CTR), and compare your prediction to this to see how much more you are willing to pay for this ad impression. 1
• Demographics Estimation of Websites – Imagine you have predictions of gender/ages, as probabilities, for a number of web users. Estimating the gender distribution on a website, can be done by just averaging the probabilities of the users seen on the site. Any bias in the probabilities, will generate a bias in your estimate.

# Visualisation

A reliability diagram is a relatively simple technique for visualising how well calibrated our classifier is.  As described in Niculescu-Mizil et al:

On real problems where the true conditional probabilities are not known, model calibration can be visualized with reliability diagrams (DeGroot & Fienberg, 1982). First, the prediction space is discretized into ten bins. Cases with predicted value between 0 and 0.1 fall in the first bin, between 0.1 and 0.2 in the second bin, etc.
For each bin, the mean predicted value is plotted against the true fraction of positive cases. If the model is well calibrated the points will fall near the diagonal line.

Below we provide a piece of R code for producing relibability diagrams. Here we generalise the number of bins to be a user defined parameter.

In the figure below we show an example reliability plot. Ideally the reliability plot (red line) should be as close to the diagonal line as possible. As there is significant deviation from the diagonal, calibrating the probabilities will possible help. It is also worth mentioning that if you take the mean of the score distribution, it should ideally be close to the prior.

# Techniques for calibration

## Overfitting

The most important step is to create a separate dataset to perform calibration with. Our steps for calibration are:

• Split dataset into test and train
• Split the train set into model training and calibration.
• Train the model on train set
• Score test and calibration set
• Train the calibration model on calibration set
• Score the test set using calibration

How much data to use for calibration will depend on the amount of data you have available. The calibration model will generally only be fitting a small number of parameters (so you do not need a huge volume of data). I would aim for around 10% of your training data, but at a minimum of at least 50 examples.

## Platt Scaling

Platt Scaling essentially involves fitting a logistic regression on the classifier output. Originally developed to fit probabilities to the outputs of SVM 2, it is also well suited to the output of most other classifiers.

The reliability diagram below shows the original reliability plot (green) and after Platt Scaling (red).

The Platt Scalding should not change the rank of the observations, so measures such as AUC will be unaffected. However, measures like Log Loss 3 will be improved. In this example, Log Loss was originally 0.422 and improved to 0.418. In Platt’s original paper suggests , instead of using the original {0,1} targets in the calibration sample, it suggests to mapping to:

$$t_+=\frac{N_+ + 1}{N_+ + 2}$$

$$t_-=\frac{1}{N_-+2}$$

where $$N_+$$  and $$N_-$$ are the number of positive and negative examples in the calibration sample.

To some extent this introduces a level of regularisation. Imagine if you only gave probabilities of either 0 or 1 and you correctly predicted all examples. Your Log Loss would be zero. With Platt’s transformation, you Log Loss would be non-zero. As the Log Loss is what you are optimising when fitting the logistic regression, a level of regularisation is introduced.

In my experiments, this transformation had little to no effect on the reliability diagram and Log Loss, so seems an unnecessary step. It may be useful if you have very few examples and overfitting is more of a concern (therefore regularisation would help). You could also use a ridge or lasso regression.

## Isotonic Regression

With Isotonic Regression you make the assumption:

$$y_i = m(f_i) + \epsilon_{i}$$

where $$m$$ is an isotonic (monotonically increasing or decreasing) function. This is the exact same assumptions we would use for least squares, except $$m$$ is now a isotonic function instead of linear.

Below is an R example of how to perform isotonic regression using the isoreg function.

In the figure below we show an example of the sort of function fitted by the isotonic regression model: Notice how it goes up in steps instead of a smooth curve. To smooth the fit, we perform a linear interpolation between each step. In the reliability plot above, the original uncalibarated scores are shown in green and the isotonic regression scores are shown in red. In this example we find isotonic regression actually made it worse. The Log Loss for instance went from 0.422 to 0.426. The AUC was also reduced.

# Multiclass Classification

What happens if you have more than two classes? Firstly I would recommend visualising the problem as a series of reliability diagrams. For k classes, you can create k reliability diagrams.

Secondly, you can take the score for each of your k classes and plug them into a multinomial logistic regression. The superb glmnet package implements a multinomial logistic regression. You can set the regularisation parameter to something quite small. One word of caution, if you have many classes, overfitting can become an issue. At this point it is worth optimising the regularisation parameter.

If your favourite machine learning model (e.g. SVM) doesn’t directly support multi-class classification, you can fit a 1 vs. all set of classifiers and then plug each of those scores into the  multinomial logistic regression.

# Summary

Classifier probability calibration can be an important step in your machine learning pipeline. The first step is to always visualise and see how much of an issue you have. In general I have found Platt Scaling to be the simplest and most effective approach to most calibration of classification problems.

1. See Google’s paper on issues with ad click prediction: http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41159.pdf
2. See Platt’s original paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.1639
3. Log loss definition: https://www.kaggle.com/wiki/LogarithmicLoss

# Reading large files in R

Reading large data files into R using  read.table can be painfully slow due to the large number of checks it performs on the data (inferring data types, checking number of columns, etc.). Below are a number of tips to improve its speed:

1. Set  nrows the number of records in your dataset. An easier way to obtain this is to use wc -l  in the terminal (only works on Unix systems). A handy function to do this from R:
2. Ensure  comment.char="" to turn off the interpretation of comments.
3. Supply the column types in colClasses  e.g.  colClasses = c("numeric", "character", "logical")
4. Setting multi.line=FALSE  can help the performance of the scan.
5. In general, it is worth using  as.is=TRUE to prevent strings being turned into factors automatically.  If you provide  colClasses  this is redundant (as the column types will be determined by colClasses ).

If your data is very large (5GB+), you should look at using the data.table package. In particular  fread can be used to read very large files to data tables.

# Significance Testing and Sample Size

A common question when analysing any sort of online data:

Is this result statistically significant?

Examples of where I have had to answer  this question:

• A/B Testing
• Survey Analysis

In this post, we will explore ways to answer this question and determine the sample sizes we need.

Note: This post was heavily influenced by Marketing Distillery’s excellent  A/B Tests in Marketing.

## Terminology

First let us define some standard statistical terminology:

• We are deciding between two hypotheses – the null hypothesis and the alternative hypothesis.
• The null hypothesis is the default assumption that there is no relationship between two values.  For instance, if comparing conversion rates of two landing pages, the null hypothesis is that the conversion rates are the same.
• The alternative hypothesis is what we assume if the null hypothesis is rejected. It can be two-tailed, the conversion rates are not the same, or one-tailed, where we state the direction of the inequality.
• A result is determined to be statistically significant if the observed effect was unlikely to have been seen by random chance.
• How unlikely is determined by the significance level. Commonly it is set to 5%. It is the probability of rejecting the null hypothesis when it is true. This probability is denoted by $$\alpha$$

$$\alpha = P(rejecting\ null\ hypothesis\ |\ null\ hypothesis\ is\ true)$$

• Rejecting the null hypothesis when the null hypothesis is true, is called a type I error.
• A type II error occurs when the null hypothesis is false,  but is erroneously failed to be rejected. This probability of this is denoted by $$\beta$$

$$\beta = P(fail\ to\ reject\ null\ hypothesis\ |\ null\ hypothesis\ is\ false)$$

• The power of a test is $$1-\beta$$. Commonly this is set to 90%.
• The p-value is calculated from the test statistic. It is the probability that the observed result would have been seen by chance assuming the null hypothesis is true.
• To reject the null hypothesis we need a p-value that is lower than the selected significance level.

## Types of Data

Usually we have two types of data we want to perform significance tests with:

1. Proportions e.g. Conversion Rates, Percentages
2. Real valued numbers e.g. Average Order Values, Average Time on Page

In this post, we will look at both.

## Tests for Proportions

A common scenario is we have run an A/B test of two landing pages and we wish to test if the conversion rate is significantly different between the two.

An important assumption here is that the two groups are mutually exclusive e.g. you can only have been shown one of the landing pages.

Null hypothesis is that the proportions are equal in both groups.

The test can be performed in R using:

Under the hood this is performing a Pearson’s Chi-Squared Test.

Regarding the parameters:

x – Is usually a 2×2 matrix giving the counts of successes and failures in each group (conversions and non-conversions for instance).

n – Number of trials performed in each group. Can leave null if you provide x as a matrix.

p – Only if you want to test for equality against a specific proportion (e.g. 0.5).

alternative – Generally only used when testing against a specific p. Changes the underlying test to use a z-test. See Two-Tailed Test of Population Proportion for more details. The z-test may not be a good assumption with small sample sizes.

conf.level – Used only in the calculation of a confidence interval. Not used as part of the actual test.

correct – Usually safe to apply continuity correction. See my previous post for more details on the correction.

The important part of the output of prop.test is the p-value. If this is less than your desired significance level (say 0.05) you reject the null hypothesis.

In this example, the p-value is not less than our desired significance level (0.05) so we cannot reject the null hypothesis.

Before running your test, you should fix your sample size in each group in advance. The R library pwr has various functions for helping you do this. The function pwr.2p.test can be used for this:

Any one of the parameters can be left blank and the function will estimate its value. For instance, leaving n, the sample size, blank will mean the function will compute the desired sample size.

The only new parameter here is h. This is the minimum effect size you wish to be able to detect. h is calculated as the difference of the arcsine transformation of two proportions:

$$h = 2 \arcsin(\sqrt{p_1}) – 2\arcsin(\sqrt{p_2})$$

Assuming you have an idea of the base rate proportion (e.g. your current conversion rate) and the minimum change you want to detect, you can use the follow R code to calculate h:

# Tests for real-values

Now imagine we want to test if a change to a eCommerce web-page increases the average order value.

The major assumption we are going to make is that the data we are analysing is normally distributed. See my previous post on how to check if the data is normally distributed.

It may be possible to transform the data to a normal distribution, for instance if the data is log-normal. Time on page often looks to fit a log-normal distribution, in this case you can just take the log of the times on page.

The test we will run is the two-sample t-test. We are testing if the means of the two groups are significantly different.

The parameters:

x – The samples in one group

y – The samples in the other group. Leave NULL for a single sample test

alternative – Perform two-tailed or one-tailed test?

mu – Use this if you know the true mean

paired – TRUE for the paired t-test, used for data where the two samples have a natural partner in each set, for instance comparing the weights of people before and after a diet.

var.equal – Assume equal variance in the two groups or not

conf.level – Used in calculating the confidence interval around the means. Not part of the actual test.

The test can be run using:

In this example, as our p-value is greater than 0.05 significance level, so we cannot reject the null hypothesis.

As before, before running the experiment we should set the sample size required. Using the pwr library we can use:

d is the effect size. For t-tests the affect size is assessed as:

$$d = \frac{|\mu_1 – \mu_2|}{\sigma}$$

where $$\mu_1$$ is the mean of group 1, $$\mu_2$$ is the mean of group 2 and $$\sigma$$ is the pooled standard deviation.

Ideally you should set d based on your problem domain, quantifying the effect size you expect to see. If this is not possible, Cohen’s book on power analysis “Statistical Power Analysis for the Behavioral Sciences”, suggests setting d to be 0.2, 0.5 and 0.8 for small, medium and large effect sizes respectively.

# Testing for Normality

The most widely used distribution in statistical analysis is the normal distribution.  Often when performing some sort of data analysis, we need to answer the question:

Does this data sample look normally distributed?

There are a variety of statistical tests that can help us answer this question, however first we should try to visualise the  data.

# Visualisation

Below is a simple piece of R code to create a histogram and normal Q-Q plot: On top of the histogram we have have overlayed the normal probability density distribution using the sample mean and standard deviation.

In the Q-Q plot, we are plotting the quantiles of the sample data against the quantiles of a standard normal distribution. We are looking to see a roughly linear relationship between the sample and the theoretical quantiles.

In this example, we would probably conclude, based on the graphs, that the data is roughly normally distributed (which is true in this case as we randomly generated it from a normal distribution).

It is worth bearing in mind, many data analysis techniques assume normality (linear regression, PCA, etc.) but are still useful even with some deviations from normality.

# Statistical Tests for Normality

We can also use several tests for normality. The two most common are the Anderson-Darling test and the Shapiro-Wilk test.

The Null Hypothesis of both these tests is that the data is normally distributed.

## Anderson-Daring Test

The test is based on the distance between the empirical distribution function (EDF) and the cumulative distribution function (CDF) of the underlying distribution (e.g. Normal). The statistic is the weighted sum of the difference between the EDF and CDF, with more weight applied at the tails (making it better at detecting non-normality at the tails).

Below is the R code to perform this test:

As the p-value is above the significance level, we would accept the null hypothesis (normality).

## Shapiro-Wilk Test

The Shapiro-Wilk test is a regression-type test that uses the correlation of sample order statistics (the sample values arranged in ascending order) with those of a normal distribution. Below is the R code to perform the test:

Again, as the p-value is above the significance level, we would accept the null hypothesis (normality).

The Shapiro-Wilk test is slightly more powerful than the Anderson test, but is limited to 5000 samples.

# Test Caution

While these tests can be useful, they are not infallible. I would recommend looking at the histogram and Q-Q plots first, and use the tests as another check.

In particular small and large samples sizes can cause problems for the tests.

With a sample size of 10 or less, it is unlikely the test will detect non-normality (e.g. reject the null) even if the distribution is truly non-normal.

With a large sample size of 1000 samples or more, a small deviation from normality (some noise in the sample) may be concluded as significant and reject the null.

Overall your two best tools are the histogram and Q-Q plot. Potentially using the tests as additional indicators.

# Pearson’s Chi-Squared Test

The Pearson’s Chi-Squared Test is a commonly used statistical test applied to categorical data. It has two major use-cases:

• A goodness of fit test measuring how well a observed data fits a theoretical distribution.
• A test of independence assessing how likely two categorical variables are independent of one another.

## Procedure

The procedure for performing a Chi-Squared Test is as follows:

1. Compute the Chi-Squared statistic $$\chi^2$$.
2. Compute the degrees of freedom.
3. Use the Chi-Squared distribution to compute the p-value.

## Goodness of fit

Imagine we have been rolling a dice and want to test if it is fair. We collect a table of rolls:

Our test statistic is calculated as:

$$\chi^2 = \sum^n_{i=1}\frac{(O_i – E_i)^2}{E_i}$$

where

$$O_i$$ is the observed count

$$E_i$$ is the expected count under the theoretical distribution

$$n$$ is the number of categories

$$N$$ is total observations

In the dice case,$$n=6$$ as we have six sides. As our theoretical distribution is that the dice is fair, $$E_i$$ is simply $$\frac{N}{6}$$. A more complicated case might be if the theoretical distribution was a normal distribution. Here we would need to compute the probability density over various intervals (e.g. $$P(a \lt x \leq b)$$ and multiply by$$N$$.

The degrees of freedom is $$n – (s + 1)$$ where $$s$$ is the number of estimated parameters. For the dice example the degrees of freedom is 5, as there are no free parameters needed to be estimated. If the theoretical distribution was normal with unknown mean and variance, the degrees of freedom would be $$n-(2 + 1)$$, $$s$$ is 2 as we would need to compute the sample mean and variance.

## Test of Independence

The most common use-case of the Chi-Squared test is for testing if their is an association between two categorical variables. Some examples are:

• Is there a relationship between gender and the party you vote for?
• Is there a relationship between a landing page version and conversions (A/B testing)

One important assumption is that they must be independent groups e.g. you can only vote for one party, you are only shown one landing page.

To begin we form a contingency table, such as:

This example is a 2×2 table, but we can also have tables of 3×2, 3×3 etc.

If we have $$r$$ rows and $$c$$ columns the expected frequency is:

$$E_{i,j}= \frac{(\sum_{k=1}^cO_{i,k})(\sum_{k=1}^rO_{k,j})}{N}$$

e.g. Row Total * Column Total / N.

The test statistic is then:

$$\chi^2=\sum^r_{i=1}\sum^c_{j=1}\frac{(O_{i,j} – E_{i,j})^2}{E_{i,j}}$$

The number of degrees of freedom is $$(r – 1)(c – 1)$$.

Our null hypothesis is the variables (gender and party for instance) are independent. Our alternative hypothesis is that they have an association (but not the direction of the association). If the p-value is below the significance level, we reject the null hypothesis and the evidence suggests there is a relationship between the two variables.

## Small Cell Counts

The assumption made is that test statistic is chi-squared distributed. The assumption is true in the limit.

The assumption can break down when a cell has a very small count. Often if the expected cell count is less than 5 (sometimes 10), the Yate’s correction for continuity is applied. The effect of this is to make a more conservative estimate, but does also increasing the likelihood of a type II error.

If the cell counts are less than 10 and it is a 2×2 contingency table, an alternative is to apply Fisher’s exact test. The test can be applied on general r x c contingency tables using monte-carlo simulation.

## Implementation

The chisq.test function in R implements the Chi-Squared test including the ability for use Yates’ correction.

The fisher.test function implements Fisher’s exact test.

# Area Under the receiver operator Curve (AUC)

AUC is a common way of evaluating binary classification problems. Two main reason are:

• It’s insensitive to unbalanced datasets e.g. fraud detection, conversion prediction.
• It does not require the predictions to be thresholded (e.g. assign to one class or the other). Operates directly on classification scores.

An AUC of 0.5 is random performance, while an AUC of 1 is perfect classification. Personally I transform this to GINI = 2AUC – 1.  The main reason being, that intuitively for me 0 being random and 1 being perfect makes more sense.

# Understanding AUC

First let us look at a confusion matrix:

Imagine we have a set of scores from a classification model (this could be probabilities, etc.) along with their true labels.  The numbers in the confusion matrix assume we have pick a threshold $$t$$, where $$>= t$$ we assign the positive label and $$< t$$ we assign the negative label.

If we have $$M$$ unique scores from the classification model, we could have $$M +1$$ possible choices of $$t$$. If we plot the True Positive Rate and False Positive Rate over all the ranges of $$t$$, we get our Receiver Operating Characteristic (ROC) Curve. Figure 1: ROC Curve

Figure 1 shows an example ROC. The blue line is our performance, the grey dotted line is an example of what we would see from random classification.

The Area Under the receiver operator Curve (AUC) for this example would be the area under the blue line. The area under the grey line would be 0.5, hence why an AUC of 0.5 is random performance and AUC of 1 is perfect classification.

# Score Distributions Figure 2: Moving the classification threshold

In the top-left of figure 2 we have a hypothetical score distribution of the output of a classifier. The two “clumps” of scores are the negative (blue) and positive (red) examples. You can imagine as we move the threshold line from left to right, we will change the values of the TP and FP and therefore the TPR and FPR.

How much the score distributions of the negatives and positive overlap will determine our AUC.  In figure 3 we show some different score distributions, the red distribution is the positive class, the blue the negative. Figure 3: Examples of different score distributions

In the top example, wherever we set the threshold we get a perfect TPR. In this case we would get an AUC of 1.

In the middle example, wherever we set the threshold, our TPR will be the same as our FPR. Hence we will see an AUC of 0.5

In the bottom example, we have some overlap of the score distributions, so our AUC will be somewhere between 0.5 and 1.

The shape of the score distributions and amount of overlap will affect the shape of the ROC Curve and therefore the AUC we end with.

# Implementation

The R pROC library has functions for computing ROC, AUC and plotting.