# Collaborative Filtering using Alternating Least Squares

Collaborative filtering is commonly used in recommender systems. The idea is if you have a large set of item-user preferences, you use collaborative filtering techniques to predict missing item-user preferences. For example, you have the purchase history of all users on an eCommerce website. You use collaborative filtering to recommend which products a user might purchase next. The key assumption here is people that agreed in the past (purchased the same products) will agree in the future.

In this post I will look at how to perform Matrix Factorisation via Alternating Least Squares as a method of Collaborative Filtering. This method gave one of the best results of a single method on the Netflix dataset.

# Matrix Factorisation

The basic idea behind any matrix factorisation method is to find a smaller set of latent factors that can be used to predict missing entries

In the above example, we want to decompose the purchases matrix into a set of user latent factors and a set of product latent factors.

Usually we are not able to perform this factorisation using direct methods (e.g. SVD) as the input matrix contains mostly missing values. To solve this we treat the latent factors as parameters to be learned and treat the factorisation as an optimisation problem.

# Explicit versus Implicit data

One major distinction is the type of data we have available as this will affect how we model the data. Broadly speaking it is split into explicit and implicit data.

Examples of explicit data collection include the following:

• Asking a user to rate an item on a sliding scale.
• Asking a user to rank a collection of items from favourite to least favourite.
• Presenting two items to a user and asking him/her to choose the better one of them.
• Asking a user to create a list of items that he/she likes.

Examples of implicit data collection include the following:

• Page views in an online store
• Online purchases online.
• Obtaining a list of items a users listens, watches on their computer. Apps they have installed.

Generally companies wishing to solve this sorts of problems have large amounts of implicit data available. I will present how the Alternating Least Squares (ALS) method can be used for explicit data and then extend it to implicit data.

# Preliminaries

Lets assume we have a number of users, and a number of items. We refer to users using the syntax $u$ and $v$ and refer to items with $i$ and $j$.

The users and items are associated through $r_{u,i}$ which we will call observations. For explicit datasets this would be the ratings a user has given an item. Higher numbers mean stronger preference. For implicit datasets these would indicate user actions. For instance it might be the number of times a user has purchased an item or viewed an item online.

# Model

We will associate each user $u$ with a user-factor vector $x_u$ and each item with a item-factor vector $y_i$. Our aim is to predict our observations from the latent factors:

$\hat{r}_{u,i} = x^\top_uy_i$

We want our predictions to be as close to the ground truth as possible. Therefore to find the latent vectors we frame it as an optimisation problem using a standard squared loss with regularisation:

$\min_{x, y} \sum_{r_{u,i}\text{is known}}(r_{u,i} -x^\top_uy_i )^2 + \lambda(\sum_u||x_u||^2 + \sum_i||y_i||^2)$

Let's group all the user-factors in an $m \times f$ matrix $X$ and all the item-factors in an $n \times f$ matrix $Y$. If we assume either the user-factors or item-factors was fixed, this would look exactly like a regularised least squares problem. So, first lets assume $Y$ is fixed and solve for $x_u$:

$x_u=(Y^{\top}Y + \lambda I)^{-1}Y^\top r_u$

Likewise if we assume $X$ is fixed, and solve for $y_i$:

$y_u=(X^{\top}X + \lambda I)^{-1}X^\top r_i$

The idea is to iterate these two steps until some stopping criterion is reached. The original paper suggested 10 iterations.

For further ways to model explicit data, see this paper on the Netflix prize.

Example python implementation

# Implicit data

Using the raw observations directly has been found not to work as well on implicit data 1. Instead we define a set of binary observation variables:

$p_{u,i}=\begin{cases}1& \text{if } r_{u,i}\gt 0\\0&\text{otherwise}\end{cases}$

Now we want to define some confidence levels for each $p_{u,i}$. When $r_{u,i} = 0$ we have low confidence, it may be the user has never been exposed to that item or it may be unavailable at the time. When $r_{u,i}$ is small, it could be explained by the user buying a gift for someone else etc., hence we should still have low confidence. When $r_{u,i}$ is larger, we should have much more confidence.

We have complete freedom how we define confidence. The original proposal was

$c_{u,i}=1+\alpha r_{u,i}$

We have some level of confidence for each $p_{u,i}$. The $\alpha$ parameter is tuneable, to the dataset via cross-validation.

Our new optimisation function is

$\min_{x, y} \sum_{r_{u,i}\text{is known}}c_{u,i}(p_{u,i} -x^\top_uy_i )^2 + \lambda(\sum_u||x_u||^2 + \sum_i||y_i||^2)$

We define the diagonal matrix $C^i$, where $C^i_{u,u}=c_{u.i}$. The parameter learning is the same as a standard weighted least squares setup:

$y_u=(X^{\top}C^iX + \lambda I)^{-1}X^\top C^ip_i$

$x_u=(Y^{\top}C^uY + \lambda I)^{-1}Y^\top C^up_u$

Computing $X^{\top}C^iX$ can be expensive. One optimisation can be used is the fact: $X^{\top}C^iX=X^{\top}X +X^{\top}(C^i - I)X$. You can pre-compute $X^{\top}X$ at each step, and notice that $(C^i - I)$ only has the non-zero entries where $r_{u,i}$ was non-zero.

You might wonder, why bother with the confidence weights and why not scale the ratings instead? The problem there is that you turn a very sparse problem into a very dense problem and you will have trouble even numerically computing the MSE on the dense problem.

Example python implementation.

# Discussion

In the last section we will discuss a number of points when applying this algorithms on practical problem.

## Convergence

Let's look at the loss function again

$\mathcal{L} = \min_{x, y} \sum_{r_{u,i}\text{is known}}(r_{u,i} -x^\top_uy_i )^2 + \lambda(\sum_u||x_u||^2 + \sum_i||y_i||^2)$

The loss is non-convex in $(X, Y)$. When we assume $Y$ is fixed, $\mathcal{L}$ is now convex in $X$. Each step will move us toward a minimum, but not necessarily a global minimum.

This will mean that each time you the algorithm you will potentially get a different result depending on how you initialise $X$ or $Y$. It is worth running the algorithm several times when testing different hyperparameter values.

Also note, if you make $\lambda$ large enough you will converge to the same solution each time. This is because are constraining your solution space.

## Initialisation

As the problem is non-convex, initialising the parameters in a good way can help you converge to a solution faster.

One suggested approch, currently used in Spark:

Choose a unit vector uniformly at random from the unit sphere, but from the "first quadrant" where all elements are nonnegative. This can be done by choosing elements distributed as Normal(0,1) and taking the absolute value, and then normalising.

One interesting thought here is that better initialisations could start you off closer to a global minimum. I would welcome any comments if anyone is aware of better approaches, I've not seen any literature on this.

## Hyperparameters

Exact values of $\lambda$ and $\alpha$ are data dependent and should be determined  by cross-validation.

## Intercepts

Interestingly, the original paper does not suggest using any form of user or item biases. A better approach could be to define:

$\hat{r}_{u,i} = \mu + \beta_u + \beta_i + x^\top_uy_i$

where $\mu$ would be a global average, $\beta_u$ represents a user bias and $\beta_i$ represents an item bias. These biases help us account for the difference in users and items e.g. some users buying lots of products.

The global average $\mu$ makes sense with explicit data (the average overall rating). It may not sense to use with implicit data (where most of $r_{u,i}=0)$.

## Regularisation

Zhou et al. suggest weighting each term in the regularisation by $n_u$ and $n_i$ the number of observations of user $u$ and items $i$ respectively. The only reason I can see do to this is that it makes $\lambda$ less dependent on the scale of the dataset, so $\lambda$ from a sampled dataset should work well on a full dataset.

## Stopping Criteria

When to stop the algorithm is usually governed by:

• If a maximum number of iterations is met
• If the difference between the measured MSE of the current iteration and previous iteration goes below some epsilon

## Confidence Functions

You are free to use any function for the confidence weight you like. Straightforward linear scaling might not be what you want. For instance, in a e-Commerce scenario, you might use weights along the lines of

• 3 for a product page view
• 40 for a product purchase
• etc.

## ALS versus SGD

The loss function $\mathcal{L}$ could be optimised via stochastic gradient descent. It would potentially be easier and faster than ALS. However ALS offers two advantages:

• ALS is much easier to parallelise.
• In my experience, ALS converges very quickly. Generally I find less than 10 iterations will converge to a decent solution.

Hopefully this post provides an overall introduction to ALS and highlights some of the considerations you should take into account when implementing this algorithm.

1. http://yifanhu.net/PUB/cf.pdf

# Useful External Resources

Links to various useful external resources. Will be continually updated

##### Numpy for R users

http://mathesaurus.sourceforge.net/r-numpy.html

#### Jupyter Features

http://arogozhnikov.github.io/2016/09/10/jupyter-features.html

##### Modern Pandas

https://tomaugspurger.github.io/modern-1.html

##### Interactive Tutorial on Numerical Optimisation

http://www.benfrederickson.com/numerical-optimization/

# Computing PCA with SVD

PCA is a great tool for performing dimensionality reduction. Two reason you might want to use SVD to compute PCA:

• SVD is more numerically stable if the columns are close to collinear. I have seen this happen in text data, when certain terms almost always appear together.
• Spark's PCA implementation currently doesn't support very wide matrices. The SVD implementation does however.

# Singular Value Decomposition (SVD)

Below we briefly recap Singular Value Decomposition (SVD).

Let $\mathbf{A}$ be a $m \times n$ matrix, the singular value decomposition gives

$\mathbf{A} = \mathbf{U}_{m\times m}\mathbf{\Sigma}_{m\times n}\mathbf{V}^\top_{n\times n}$

$\mathbf{U}$ is an orthonormal matrix and is the eigenvectors of $\mathbf{A}\mathbf{A}^\top$.

$\mathbf{V}$ is an orthonormal matrix and is the eigenvectors of $\mathbf{A}^\top\mathbf{A}$.

$\mathbf{\Sigma}$ is a diagonal matrix and contains the square-roots of the eigenvalues of$\mathbf{A}\mathbf{A}^\top$ and  $\mathbf{A}^\top\mathbf{A}$ e.g.

$\mathbf{\Sigma}=\begin{pmatrix}\sqrt{\lambda_1}&0&\cdots& 0\\0 &\sqrt{\lambda_2} & \cdots &0\\\vdots & \vdots & \ddots & \vdots \\0&0& \cdots &\sqrt{\lambda_k}\end{pmatrix}$

Remember, as $\mathbf{U}$ is an orthonormal matrix $\mathbf{U}^\top=\mathbf{U}^{-1}$

# Computing PCA

1. Mean centre the matrix $\mathbf{A}$
2. Optionally scale each column by their standard deviation. You may want to do this if the variables are measured on different scales.

We noted in the previous section that $\mathbf{V}$ is the eigenvectors of $\mathbf{A}^\top\mathbf{A}$ (the covariance matrix). Thus the principal component decomposition is

$$$\begin{split}\mathbf{T} & = \mathbf{A}\mathbf{V}\\ & =\mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top\mathbf{V} \\ & =\mathbf{U}\mathbf{\Sigma}\end{split}$$$

To reduce the dimensionality of $\mathbf{A}$, select the $k$ largest singular values ($k \le n$), select the first $k$ columns from $\mathbf{U}$ and the $k \times k$ upper-left part of $\mathbf{\Sigma}$. The reduced dimensionality is given by

$\mathbf{T}_k =\mathbf{U}_k\mathbf{\Sigma}_k$

Hadoop 2 is a complete overall of some of the core Hadoop libraries. It is a fundamental shift in the way applications run on top of Hadoop and it is worth understanding these changes.

# YARN

In Hadoop 1, the programming API (MapReduce) and resource management of the cluster were all bundled together. In Hadoop 2, resource management is now handled by YARN (Yet Another Resource Negotiator).

YARN manages the resources available to us on the cluster. To understand what YARN does, we need to look at the components that make it up:

## Resource Manager

• Runs on a single master node
• Global resource scheduler across the node
• Arbitrates resources between competing applications
• Nodes have resources - memory and CPU cores - which the resource manager allocates.

## Node Manager

• Sits on each slave node
• Handles communication with Resource Manager

## Applications

• Applications are jobs submitted to the YARN framework
• Could be MapReduce job, Spark job etc.

# Application Master

• One per-application
• Requests containers to actually run the job. Containers will be distributed across the container.

## Containers

• Create by the RM upon request
• Allocate a certain amount of resources ( CPU and memory) on a slave node.

# Applications

In Hadoop 2, applications are no longer limited to just MapReduce. Cluster can be used for multiple different systems at the same time.  The cluster resources can be between utilised and new systems can integrate by implementing the YARN API.

# Scheduling

• Hierarchical queue system
• Various scheduling mechanisms (Capacity Scheduler, Fair Scheduler)
• Cloudera CDH5 uses Fair Scheduling by default.

Useful commands when using Hadoop on the command line

# Filesystem

Full reference can be found in Hadoop Documentation.

#### ls

List the contents of provided directory.

#### put

Put the local file to provided HDFS location

#### get

Copy the file to the local file system

#### cat/text

Outputs the contents of HDFS file to standard output. Text command will also read compressed files and output uncompressed data.

Common usecase is that you want to check contents of the file, but not output the whole file. Pipe the contents to head.

#### cp/mv

cp is short for copy, copy file from source to destination.

mv is short for move, move file from source to destination.

#### chmod

Change the permissions of the file/directory. Uses standard Unix file permissions.

#### getmerge

Takes a source directory and concatenates all the content and outputs to a local file. Very useful as commonly Hadoop jobs will output multiple output files depending on the number of mappers/reducers you have.

#### rm

Deletes a file from HDFS. The -r means perform recursively. You will need to do this for directories.

By default the files will be moved to trash that will eventually be cleaned up. This means the space will not be immediately freed up. If you need the space immediately you can use -skipTrash, note this will mean you can reverse the delete.

#### du

Displays the sizes of directories/files. Does this recursively, so extremely useful for find out how much space you have. The -h option makes the sizes human readable.  The -s option summarises all the files, instead of giving you individual file sizes.

One thing to note is that the size reported is un-replicated.  If your replication factor is 3, the actual disk usage will be 3 times this size.

#### count

I commonly use this command to find out how much quota I have available on a specific directory (you need to add the -q options for this).

To work out how much quota you have used, SPACE_QUOTA - REMAINING_SPACE_QUOTA, e.g. 54975581388800 - 5277747062870 or 54.97TB - 5.27TB = 49.69TB left.

Note this figures are the replicated numbers.

Useful command for finding out total usage on the cluster. Even without superuser access you can see current total capacity and usage.

Commonly you will have a fat jar file containing all the code for your map reduce job. Launch via:

A Scalding job is launched using:

If you need to kill a map-reduce job, use:

You can find the job id in the resource manager or in the log of the job launch. This can be used to kill any map-reduce job (Standard Hadoop, Scalding, Hive, etc.) but not Impala or Spark jobs for instance.

# Setting up IntelliJ for Spark

Brief guide to setting up IntelliJ to build Spark applications.

# Create new Scala Project

Select:

1. Create New Project
2. Scala Module
3. Give it an appropriate name

# Setup Directory Structure

Move to the project root. Run the following:

# Setup gen-idea plugin

In the project directory you just created, create a new file called plugins.sbt with the following content:

# Create the build file

In the project root, create a file called build.sbt containing:

# Run SBT

At the project root level, run the following

You should now be setup and ready to write Spark Applications

# Common Pitfalls in Machine Learning

Over the past few years I have worked on numerous different machine learning problems. Along the way I have fallen foul of many sometimes subtle and sometimes not so subtle pitfalls when building models.  Falling into these pitfalls will often mean when you think you have a great model, actually in real-life it performs terribly. If your aim is that business decisions are being made based on your models, you want them to be right!

I hope to convey these pitfalls to you and offer advice on avoiding and detecting them. This is by no means an exhaustive list and I would welcome comments on other common pitfalls I have not covered.

In particular, I have not tried to cover pitfalls you might come across when trying to build production machine learning systems. This article is focused more on the prototyping/model building stage.

Finally, while some of this is based on my own experience, I have also drawn upon John Langford's Clever Methods of Overfitting, Ben Hamner's talk on Machine Learning Gremlins and Kaufman et al. Leakage in Data Mining. All are worth looking at.

Traditional overfitting is where you fit an overly complicated model to the trained dataset. For example, by allowing it to have too many free parameters compared to the number of training points.

To detect this, you should always be using a test set (preferably you are using cross validation). Plot your train and test performance and compare the two. You should expect to see a graph like figure 1. As your model complexity increases, your train error goes to zero. However, your test error follows an elbow shape, it improves to a point then gets worse again.

Potential ways to avoid traditional overfitting:

#### More training data

In general, the more training data you have the better. But it may be expensive to obtain more.

Before going down this route it is worth seeing if more train data will actually help. The usual way to do this is to plot a learning curve, how the training sample size affects your error:

In the left-hand graph, the gradient of the line at our maximum training size is still very steep. Clearly here more training data will help.

In the right-hand graph, we have started to reach a plateau, more training data is not going to help too much.

#### Simpler Predictor Function

Two ways you could use a simpler predictor:

• Use a more restricted model e.g. logistic regression instead of a neural network.
• Use your test/train graph (figure 1) to find an appropriate level of model complexity.

#### Regularisation

Many techniques have been developed to penalise models that are overly complicated (Lasso, Ridge, Dropout, etc.). Usually this involve setting some form of hyper-parameter. One danger here is that you tune the hyper-parameter to fit the test data, which we will discuss in Parameter Tweak Overfitting.

#### Integrate over many predictors

In Bayesian Inference, to predict a new data point $x$:

$p(x|\mathbf{D},\alpha) = \int_{\theta} p(x|\theta,\alpha) \, p(\theta|\mathbf{D},\alpha)$

where $\alpha$ is some hyper-parameters, $\mathbf{D}$ is our training data and $\theta$ are the model parameters. Essentially we integrate out the parameters, weighting each one by how likely they are given the data.

# Parameter Tweak Overfitting

This is probably the type of overfitting I see most commonly. Say you use cross validation to produce the plot in figure 1. Based on the plot you decide the 5 parameters is optimal and you state your generalisation error to be 40%.

However, you have essentially tuned your parameters to the test data set. Even if you use cross-validation, there is some level of tuning happening. This means your true generalisation error is not 40%.

Chapter 7 of the Elements of Statistical Learning discuss this in more detail. To get a reliable estimate of generalisation error, you need to put the parameter selection, model building inside, etc. in an inner loop. Then run a outer loop of cross validation to estimate the generalisation error:

For instance, in your inner loop you may fit 10 models, each using 5X CV. 50 models in total. From this inner loop, you pick the best model. In the outer loop, you run 5X CV, using optimal model from the inner loop and the test data to estimate your generalisation error.

# Choice of measure

You should use whatever measure is canonical to your problem or makes the most business sense (Accuracy, AUC, GINI, F1, etc.). For instance, in credit default prediction, GINI is the widely accepted measurement.

It is good to practice to measure multiple performance statistics when validating your model, but you need to focus on one for measuring improvement.

One pitfall I see all the time is using accuracy for very imbalanced problems. Telling me that you achieved an accuracy of 95%, when the prior is 95% means that you have achieved random performance. You must use a measure that suits your problem.

# Resampling Bias and Variance

When measuring your test performance you will likely use some form of resampling to create multiple test sets:

• K-fold cross validation
• Repeated k-fold cross validation
• Leave One Out Cross Validation (LOOCV)
• Leave Group Out Cross Validation (LGOCV)
• Bootstrap

Ideally we want to use a method that achieves low bias and low variance. By that I mean:

• Low bias - The generalisation error produced is close to the true generalisation error.
• Low variance (high precision) - The spread of the re-sampled test results is small

In general, the more folds you use the lower the bias, but the higher the variance.

You will need to pick a resampling method to suit your problem. With large training sets, k-fold cross validation with a k of 5 may suit. With small training sets, you will need a larger k.

Max Kuhn has done a fantastic empirical analysis of the different methods in relation to bias and variance (part I, part II). In general repeated 10-fold cross validation seems to be quite stable across most problems in terms of bias and variance.

Once you have computed the results of your cross validation, you will have a series of measurement of your test performance. The usual thing to do here is to take the mean and report this as your performance.

However just reporting the mean may hide the fact there is significant variation between the measurements.

It is useful to plot the distribution of the performance estimates. This will give you and idea  of how much variation there is:If you need to boil it down to a single number, computing the standard deviation (or variance) can be useful. However you often see performance quoted like: 80% +/- 2% where the 80% is the mean and the 2% is the standard deviation. Personally I dislike this way of reporting as it suggests the true performance is between 78%-82%. I would quote them separately; mean and standard deviation.

If you want to write 80% +/- 2%, you would need to compute bounds on the estimate.

# Information Leakage

Information Leakage occurs when data on your training label leaks into your features. Potentially it is even more subtle, where irrelvant features appear as highly predictive, just because you have some sort of bias in the data you collected for training.

As an example, imagine you are an eCommerce website and want to predict converters from the way people visit your website. You build features based on the raw URLs the users visit, but take special care to remove the URLs of the conversion page (e.g. Complete Purchase). You split your users into converters (those reaching the conversion page) and non-converters. However, there will be URLs immediately before the conversion page (checkout page, etc.) that will be present in all the converters and almost none of the non-converters. Your model will end up putting an extremely high weight on these features and running your cross-validation will give your model a very high accuracy. What needed to be done here was remove any URLs that always occur immediately before the conversion page.

#### Feature Selection Leakage

Another example I see regularly is applying a feature selection method that looks at the data label (Mutual Information for instance) on all of the dataset. Once you select your features, you build the model and use cross-validation to measure your performance. However your feature selection has already looked at all the data and selected the best features. In this sense your choice of features leaks information about the data label. Instead you should have performed inner-loop cross validation discussed previously.

#### Detection

Often it can be difficult to spot these sorts of information leakage without domain knowledge of the the problem (e.g. eCommerce, medicine, etc.).

The best advice I can suggest to avoid this is to look at the top features that are selected by your model (say top 20, 50). Do they make some sort of intuitive sense? If not, potentially you need to look into them further to identify if their is some information leakage occurring.

#### Label Randomisation

A nice method to help with the feature selection leakage is to completely randomly shuffle your training labels right at the start of your data processing pipeline.

Once you get to cross validation, if your model says it has some sort of signal (an AUC > 0.5 for instance), you have probably leaked information somewhere along the line.

# Human-loop overfitting

This is a bit of a subtle one. Essentially when picking what parameters to use, what features to include, it should all be done by your program. You should be able to run it end-to-end to perform all the modelling and performance estimates.

It is ok to be a bit more "hands-on" initially when exploring different ideas. However your final "production" model should remove the human element as much as possible. You shouldn't be hard-coding parameter settings.

I have also seen this occurring when particular examples are hard to predict and the modeller decides to just exclude these. Obviously this should not be done.

# Non-Stationary Distributions

Does your training data contain all the possible cases? Or could it be biased? Could the potential labels change in the future?

Say for example you build a handwritten digit classifier trained on the the MNIST database. You can classify between the numbers 0-9. Then someone gives you handwritten digits in Thai:

How will your classifier behave? Possibly you need to obtain handwritten digits for other languages, or have an other category that could incorporate non Western Arabic numerals.

# Sampling

Potentially, your training data may have gone through some sort of sampling procedure before you were provided it. One significant danger here is that this was sampling with replacement and you end up with repeated data points in both the train and test set. This will cause your performance to be over-estimated.

If you have a unique ID for each row, check these are not repeated. In general check how many rows are repeated, you may expect some, but is it more frequent than expected?

# Summary

My one key bit of advice when building machine learning models is:

If it seems to good to be true, it probably is.

I can't stress this enough, to be a really good at building machine learning models you should be naturally sceptical. If your AUC suddenly increases by 20 points or you accuracy becomes 100%. You should stop and really look at what you have done. Have you fallen trap of one of the pitfalls described here?

When building your models, it is always a good idea to try the following:

• Plot learning curves to see if you need more data.
• Use test/train error graphs to see if your model is too complicated.
• Ensure you are running inner-loop cross validation if you are tweaking parameters.
• Use repeated k-fold cross validation if possible.
• Check your top features - Do they make sense?
• Perform the random label test.
• Check for repeated unique IDs or repeated rows.

# Random Permutation Tests

At the 2014 Strata + Hadoop conference John Rauser gave a great keynote title "Statistics Without the Agonizing Pain".  It is probably worth watching before reading the rest of this article, in it he introduces the concept of Random Permutation Tests.

"Classic" statistical tests usually make some sort of assumption about the distribution of the data e.g. normally distribution data . Are these assumptions always true? Probably not, but they are often approximately close enough to give you a useful result. By making these assumptions, these tests are called parametric.

Random Permutation Tests make no assumptions on the underlying distribution of the data. They are considered non-parametric tests. This can be extremely useful when:

• Your data just doesn't seem to fit the distribution the classic statistical test assumes. For instance, perhaps it is bi-modal and the test assumes normality.
• You have outliers e.g. users who spend significantly more than others.
• You have a small sample size.

Random Permutation Tests can be used in almost any setting where you would compute a p-value. In this article I will focus on there use in experimental studies, you want to see if there is a difference between two treatment groups (A/B Tests, medical studies, etc.)

# Overview

The essential idea behind random permutation tests is:

1. Compute a test statistic between two (or more) groups. This could be the difference between two proportions, the difference between the means of the two groups etc.
2. Now randomly shuffle the data assigned to each group.
3. Measure the test statistic again on the shuffled data.
4. Repeat 2 and 3 many times
5. Look at where the test statistic from 1 falls in the distribution of test statistics from 2-4.

We have used steps 2-4 to empirically estimate the sampling distribution of the test statistic. From this distribution you can compute the p-value for your observed test statistic.

# Example

Let's imagine we want to add a new widget on our checkout page of our e-commerce site to upsell products to a user.

The question we want to answer is, does adding the widget increase our revenue?

We run an A/B test with:

• Original checkout page
• Checkout page with widget

We know how much each user spent and what variant they have been given.

Lets generate some example transaction data in R:

We have randomly sampled the data from a log-normal distribution with equal mean and variance. We set the seed to ensure the results are repeatable. So in this case, we are looking to find there is no significant difference between the two datasets.

The classic statistical approach here would be to use a t-test. Let's instead apply our random permutation test.

First, let's compute the difference between the means of the two groups:

This gives a difference of 193.47. How likely is this to have happened by chance?

What we want to do is randomly shuffle our data between the two groups. If we were to sample (without replacement) once and compute the difference using the randomly shuffled version of groups:

This gave me a difference of 45.69. Now we will repeat this many times:

If we plot the examples, along with where our observed difference falls:

Straight away, it is fairly clear this observation could  just be due to random chance. It is not on the very extreme of the distribution. However let's compute a p-value

On the two-tailed test, we get a p-value of 0.102, so we would accept the null hypothesis of there being no difference. Notice the add one here on both the denominator and numerator. Essentially we are adding our original measured test statistic to the random permutations. This ensures we never get a zero probability.

The standard t-test would give a p-value of 0.0985, so roughly similar. However, what if we didn't care about the mean, but the median transaction value? Using random permutation tests, this is very simple to compute (a simple change to our code). Under classic statistical tests, we would have to go off and find the exact test we need to use under those conditions.

## Speed Improvements

On simple speed improvement is to parallelise the loop to compute the re-samples:

# Coin Package

As usual, R already has a package to help us do all of this. Using the same data as before we would run:

Generally you will find this is much faster for running large numbers of iterations.

One downside is you don't get the visualisation of how extreme the observed data is compared to the empirical sampled histogram (Figure 2). I find this graph extremely useful when explaining how extreme a result appears to be, extremely to a non-statistical audience

# Summary

Random permutation tests are a nice alternative to classic hypothesis tests. In many cases they will give you almost exactly the same results. Being able to visualise the distribution (Figure 2) can be a massive assistance in explaining the p-value.

• Almost no assumptions on the underlying dataset being analysed
• Can be used for any test statistic (either it is implemented in coin or can be programmed yourself).
• Can be applied to all sorts of data types (numerical, ordinal, categorical) without having to remember the exact parametric test you should use.

• Computing large number of re-samples is potentially slow. Although on modern computers this is less of a concern
• Relies on the null hypothesis, that there is no association between the dataset and so the group labels are interchangeable under the null hypothesis.

Personally, I like to use both classic statistical tests and random permutation tests, even if all they do is validate one another.

# Multiple Statistical Tests

A/B testing is common place in marketing and website optimisation. But why just stop at running two variations? Why not three or eight? I have certainly seen A/B tests suddenly grow to A/B/C/D... and so on. In statical literature, this is called multiple testing.

The short answer is there is nothing stopping you from doing this, however you do need to adjust your techniques for interpreting the test results or risk interpreting noise results as signal.

In fact, many A/B testing tools now offer you "multivariate testing" (Optimizely, maxymiser, etc.), which is just a fancy term for running factorial experiments and you will be running multiple tests at the same time.

• The common pitfalls when running multiple statistical tests
• Techniques to adjust for multiple comparisons
• How this affects sample size when planning experiments

# The dangers of multiple comparisons

One of my favourite examples of the danger of multiple comparisons is a study by Benett et al. 1

The very brief summary of the paper is:

• A very much dead Atlantic Salmon was placed in an fMRI machine.
• The fMRI measures the blood oxygen levels in units called voxels (think image pixels). Each voxel is measuring the change in blood oxygen levels. Typically fMRI machines will measure 130,000 voxels.
• The (dead) salmon was then shown a series of emotionally charged images and the difference between blood oxygen levels was measured.
• Funnily enough, with that many comparisons, you do find a statistically significant result. Can we conclude the dead's salmon brain was responding to the images? Well no, obviously, it's dead. If you do enough comparisons without any correction, you will end up with something significant.

Let's see this in action. We will randomly sample from a binomial distribution, with p=0.5, 100 times. Each time we will test if the observed proportion is significantly different from 0.5:

Run it a few times, most times you should see one one of the p-values come out as below the alpha value. For example:

We need to control for the fact we are doing a large number of tests. Next we will look at ways to deal with these issues.

# Techniques

## Family Wise Error Rate

The Family Wise Error Rate (FWER) is the probability of making at least one type I error. If V is the number of type I errors:

$FWER=P(V \gt0)$

We will looking at techniques that control the FWER to ensure $FWER \le \alpha$. We will only look at methods that control this in the strong sense, that it is valid for all configurations of true and false hypotheses.

### Bonferroni

The Bonferroni correction is the simplest method for controlling the FWER.  Simply you now reject the null hypothesis if:

$p_i \le \frac{\alpha}{m}$

where $p_i$ is the p-value of the i-th test and $m$ is the number of tests.

This is a very conservative correction and if you have large number of tests, you need very small p-values to find anything significant.

To adjust p-values based on Bonferroni correction:

$\hat{p}_i =m p_i$

To show this in R, let's go back to our previous example:

We should now, hopefully, see no significant results.

The p.adjust method is what we use to obtain the adjusted p-values. It is also possible to use pairwise.prop.test to do this all in one go, but I personally prefer to keep them separate (e.g. compute p-values, then adjust them).

### Holm

In Holm's method you order the p-values in increasing order. It is called a step down procedure. You compute a critical value for each p-value:

$a_i = \frac{\alpha}{m - i - 1}$

Starting with the smallest p-value, keeping rejecting hypotheses until the first where $p_i > \alpha_i$, at which point accept $H_i$ and all remaining hypothesises.

$\hat{p}_i= \left\{\begin{array}{l l}mp_i & \quad \text{if i=1}\\ \max(\hat{p}_{i-1},(m -i + 1)p_i) & \quad \text{if i=2,\dots,m}\end{array} \right.$

In R we can simply use:

Holm's method is equally as powerful in terms of assumptions as Bonferroni, and it less conservative. Therefore it is always suggested to use Holm's method over Bonferroni.

### Hochberg

Holm's method was known as a step down procedure, as you start with the smallest p-value and work upwards.

Hochbeg's method is a step up method. You start with the largest p-value , until you find the first $p_i \le \alpha_i$. It uses the same critical values as Holm's method.

The adjusted p-values are computed using:

$\hat{p}_i= \left\{\begin{array}{l l}p_m & \quad \text{if i=m}\\ \min(\hat{p}_{i+1},(m -i + 1)p_i) & \quad \text{if i=m-1,\dots,1}\end{array} \right.$

In R we can use:

An important consideration Hochberg's method is that each p-value must be independent or positively dependent.  If you are running A/B tests and ensured users have only ever seen one variant, then this assumption is valid.

Hochberg's method is considered more powerful than Holm's as it will reject more hypotheses.

### Hommel

Hommel's procedure is more powerful than Hochberg's but slightly harder to understand. It has exactly the assumptions at Hochberg's (independent or positively dependent p-values) and being more powerful it should be preferred over Hochberg's.

Hommel's procure rejects all p-values that are $\le \frac{\alpha}{j}$

The value for $j$ is found by:

$j = \max_{i=1,\dots,m}\{ p_{m-i+k}\gt\frac{k\alpha}{i}\textrm{for }k=1,\dots,i\}$

Think of this as us looking at all the sets of hypotheses. We are trying to find the largest $j$ where the condition $p_{n-i+k}\gt\frac{k\alpha}{i}$ is true for all the values of $k$.

Let's look at this in practice 2. Suppose we have 3 hypotheses with p-values $p_1=0.024$, $p_2=0.030$ and $p_3=0.073$. To find $j$ we compute:

For $i=1$, $p_3 = 0.073 > \alpha = 0.05$

For $i=2$, $p_3 = 0.073 > \alpha = 0.05, p_2 = 0.030 >\frac{ \alpha}{2}=0.025$

For $i=3$, $p_3 = 0.073 > \alpha = 0.05, p_2 = 0.030 <\frac{ 2\alpha}{3}=0.033,$

$p_1=0.024 > \frac{\alpha}{3} = 0.0167$

We find two sets of hypotheses where the statement is true for all $k$. The values of $i$ for these two sets were $\{1,2\}$, so $j=\max\{1,2\}=2$. We use $j$ and reject any p-values less than $\frac{\alpha}{2}$. Which in this case is $p_1=0.024$.

In the following R code, I have provided code for calculating $j$ and the adjusted p-values. The adjusted p-value calculation followed the algorithm in the Appendix of Wright's paper:

As with the other methods, R provides a pre-built function for calculating the adjusted p-values:

## False Discovery Rate

Up till now we have focused on methods to control the Family Wise Error Rate (FWER). Let V be the number of false positives in hypotheses we reject and P be the number of hypotheses rejected. In FWER methods we are ensuring:

$P(V \ge 0) \le\alpha$

We have applied various method (Holm, Hochberg, etc.) in order to ensure this. If we want to control False Discovery Rate (FDR), we will ensure:

$FDR =\mathbb{E}(\frac{V}{P}) \le \alpha$

That is the expected false positive rate will be below $\alpha$. Many FWER methods can be seen as too conservative and often suffer from low power. FDR methods offer a way to increase power but maintain a principled bound on the error.

The FDR methods were developed on the observation

4 false discoveries out of 10 rejected null hypotheses

is much more serious than

20 false discoveries out of 100 rejected null hypotheses

That is when we are running a large number of tests, we are possibly willing to accept a percentage of false discoveries if we still find something interesting.

Christopher Genovese has an extremely good tutorial on FDR and worth looking over for much more detail.

R provides two methods for calculating adjusted p-values (sometimes called q-values in the FDR context) based on controlling the FDR:

BH is the original Benjamini-Hochberg procedure where the concept of FDR was introduced. It has the same assumptions as the Hochberg procedure (e.g. independent or positively dependent p-values). BY is Benjamini–Yekutieli procedure, it has no assumptions on the dependency of the p-values (but as such it is more conservative than BH).

FDR methods have become popular in genomic studies, neuroscience, biochemistry, oncology and plant sciences.  Particularly where there is a large number (thousands often) of tests to be performed, but you don't want to miss interesting interactions.

## Choosing a method

We have covered a variety of techniques to handle multiple tests, but what one should you use? This is my very crude suggestions for choosing which method to use:

1. I have less than or equal to 50 tests and each test is independent or positively dependent - Use Hommel.
2. I have less than or equal to 50 tests but I do not know if the tests are independent or positively dependent - Use Holm.
3. I have more than 50 tests and each test is independent or positively dependent - Use FDR Benjamini-Hochberg.
4. I have more than 50 tests but I do not know if the tests are independent or positively dependent - Use FDR Benjamini–Yekutieli.

The 50 tests is an arbitrary choice but in general if you are using a lot of tests you might want to consider FDR methods.

## Sample Size

Intuitively, the more tests you run, the larger sample size you will need. While there a certainly more complicated methods for determining sample size, I will describe one simple method:

• Apply the Bonferroni correction $\alpha'=\frac{\alpha}{m}$
• Plug this $\alpha'$ into the sample size calculations discussed previously.

This is likely to be an over-estimate of the sample size required, but will give you a good indication of how many more samples will be needed.

### Alternatives

This is by no means an exhaustive description of how to deal with multiple tests. In this article I have focused on ways to adjust existing p-values. Other methods it may be worth exploring:

• ANOVA and Tukey HSD tests (Post-hoc analysis in general). One issue here is that it assumes normality of the data. This may mean you need to perform transformations like Arcsine Square Root 3 on proportion data.
• Bootstrap methods - Simulating how unlikely a sample is to occur if they were not independent.
• Bayesian methods

Potentially in later articles I will try to explore some of these methods further.

1. See  http://prefrontal.org/files/posters/Bennett-Salmon-2009.pdf
2. Example courtesy of http://www2.math.uu.se/research/pub/Ekenstierna.pdf
3. https://www.biostars.org/p/6189/

# R Performance (Part II)

In R performance (Part I), we looked into how to write R code in an efficient way.

In the second part, we will look into more explicit ways of improving R performance.

# Parallel Loops

Doing some sort of loop is almost unavoidable in R. A simple optimisation is to run the loop in parallel.

A obvious machine learning application is when running cross-validation. If you want to run 10-fold cross-validation, if you have 10 cores available, you can run them all in parallel.

The parallel library has a multicore version of lapply, let's look at the example below:

If you already make use of lapply etc. in your code, modifying to use the multi-core version requires very little code changes.

Sadly this code will not work on Windows as it relies on Unix's fork command. Instead use the following on a Windows machine:

Update: Nathan VanHoudnos has re-implemented the mclapply function to support Windows, see his comment for more details on how to use this.

Update: One issue I have observed when using mclapply on Linux, if the machine runs out of memory, R will arbitrarily and silently kill processes on some of the cores. This will mean you will not get as many results as you expect. A good error check is to ensure your results has the correct size e.g. same size as your inputs

If instead you prefer for loop syntax,  you can use the foreach package:

The important command here is %dopar%, this says to perform the loop in parallel. If you were to use  %do% it would run on a single process.

#### Memory Usage

In a Unix system (I have no idea how this will work on Windows), when you fork a process (what mclapply does), you get a copy of the current processes memory (think R environment).

However this is called "copy-on-write", which means unless you modifying the data it will never physically be copied.

This is important, as much as possible you should try to avoid modifying the data inside your mclapply function. However, often this is unavoidable e.g. in your cross validation loop you will need to split data into test and train. In this case, you just need to be aware you will be increasing your memory usage. You may have to trade off how many cores you can use with how much memory you have available.

# Optimised Linear Algebra Library

The default BLAS library used by R is not particular well tuned and has no support for multiple cores. Switching to an alternative BLAS implementation can give a significant speed boost.

Nathan VanHoudnos has an excellent guide on installing alternative BLAS libraries. Here I will summarise the different options:

###### OpenBLAS

OpenBLAS is generally the easiest to install (on Ubuntu you can use apt-get) and has out of the box support for multicore matrix operations.

On major issue (described here) is that currently OpenBLAS multicore matrix operations do not play well with R's other multicore functionality (parallel, foreach). Trying to use them together will result in segfaults. However, as long as you are aware of this you can design your code around this e.g. only using parallel loops when nothing inside of the loop utilises matrix algebra.

###### ATLAS

ATLAS has the potential to be the most tuned to your particular machine setup. However using out-of-the-box installations (e.g. via apt-get) will generally only support a single core.

In order to get the most out of ATLAS (multicore, optimised to your machine) you will need to compile it from source and this can be a painful experience and probably only worthwhile if you are familiar with compiling from source on Unix machines.

###### MKL

Intel provide their Math Kernel Library (MKL) optimised version of BLAS. It works on Intel and Intel compatible processors. Revolution R comes with MKL pre-packaged with it.

R-Bloggers has a guide to installing MKL. Note, MKL is free for non-commercial usage, but commercial usage will require a license.

###### vecLib

If you are using Mac OS X, Apple kindly provide you with a optimised BLAS library. Details of how to link with R can be found here. It is very easy to link and provides excellent performance.

However, this only works with Mac OS X, so not really relevant if you are planning to work in a server environment,

###### Which to use?

Obviously what to use is completely up to you, all have some disadvantages. Personally, I use vecLib on my Mac and we use OpenBLAS on our servers. This means we have to write our R code to not to use parallel loops and mutlicore matrix operations at the same time. However this is not a massive overhead (if you are trying to do both at the same time you will generally end up thrashing your CPUs anyway). The advantage is, spinning up new R servers does not involve any compilation from source.

# Pay

At this point, linking to optimised BLAS version may look quite painful. Instead an alternative option is to pay for Revolution R. They pre-build their version of R with an optimised multicore BLAS version. They have various benchmarks on the performance improvements.

Revolution R also has various libraries for handling large datasets, parallelised statistical modelling algorithms, etc. Although it all comes with a fairly hefty price tag.

Update: Revolution R have just announced Revolution R Open a free version of Revolution R. In particular it comes linked against MKL and has the Reproducible R Toolkit to manage package upgrades. Currently this looks like the best option for using R in a production environment.

# Byte Code Compiler

The compiler package allows you to compile R functions to a lower-level byte code. This can provide performance improvements of between 2-5X.

Let's look at a very simple example below:

The cmpfun is used to compile the function to byte code. You call your function in the exact same way you would before. To compare performance:

In this case, we see around a 2.9X speedup. You will see the best speed-up on functions that involve mostly numerical calculations. If your functions mainly call pre-built R functions or manipulate data types, you probably won't see any drastic speed-up.

You can also enable Just-In-Time (JIT) compilation, removing the need to call cmpfun directly:

The value passed to enableJIT controls the level of compilation, it should be between 0 and 3, 0 being no compilation; 3 being max compilation. This may initially slow down R as all the functions need to be compiled, but may later speed it up. You can also enable it via the R_ENABLE_JIT environment variable.

For more information, R-statistics has a great tutorial on compiler library and JIT.

# Summary

R is constantly evolving, so along with these tips you should always try to keep your R version up to date to get the latest performance improvements.

Radford Neal has done a bunch of optimisations, some of which were adopted into R Core, and many others which were forked off into pqR. At the time of writing, I don't think pqR is ready for production work, but definitely worth watching.

With well optimised code, the right libraries, R is capable of handling pretty large data problems. At some point, your data may be too large for R to handle. At this point I look to Hadoop and Spark to scale even further. My rough guide, if your data is greater than 50GB (after pre-processing) R is probably not the right choice.