Category Archives: Machine Learning

Collaborative Filtering using Alternating Least Squares

Collaborative filtering is commonly used in recommender systems. The idea is if you have a large set of item-user preferences, you use collaborative filtering techniques to predict missing item-user preferences. For example, you have the purchase history of all users on an eCommerce website. You use collaborative filtering to recommend which products a user might purchase next. The key assumption here is people that agreed in the past (purchased the same products) will agree in the future.

In this post I will look at how to perform Matrix Factorisation via Alternating Least Squares as a method of Collaborative Filtering. This method gave one of the best results of a single method on the Netflix dataset.

Matrix Factorisation

The basic idea behind any matrix factorisation method is to find a smaller set of latent factors that can be used to predict missing entries

Matrix Factorisation (1)


In the above example, we want to decompose the purchases matrix into a set of user latent factors and a set of product latent factors.

Usually we are not able to perform this factorisation using direct methods (e.g. SVD) as the input matrix contains mostly missing values. To solve this we treat the latent factors as parameters to be learned and treat the factorisation as an optimisation problem.

Explicit versus Implicit data

One major distinction is the type of data we have available as this will affect how we model the data. Broadly speaking it is split into explicit and implicit data.

Examples of explicit data collection include the following:

  • Asking a user to rate an item on a sliding scale.
  • Asking a user to rank a collection of items from favourite to least favourite.
  • Presenting two items to a user and asking him/her to choose the better one of them.
  • Asking a user to create a list of items that he/she likes.

Examples of implicit data collection include the following:

  • Page views in an online store
  • Online purchases online.
  • Obtaining a list of items a users listens, watches on their computer. Apps they have installed.

Generally companies wishing to solve this sorts of problems have large amounts of implicit data available. I will present how the Alternating Least Squares (ALS) method can be used for explicit data and then extend it to implicit data.


Lets assume we have a number of users, and a number of items. We refer to users using the syntax u and v and refer to items with i and j.

The users and items are associated through r_{u,i} which we will call observations. For explicit datasets this would be the ratings a user has given an item. Higher numbers mean stronger preference. For implicit datasets these would indicate user actions. For instance it might be the number of times a user has purchased an item or viewed an item online.


We will associate each user u with a user-factor vector x_u and each item with a item-factor vector y_i. Our aim is to predict our observations from the latent factors:

\hat{r}_{u,i} = x^\top_uy_i

We want our predictions to be as close to the ground truth as possible. Therefore to find the latent vectors we frame it as an optimisation problem using a standard squared loss with regularisation:

\min_{x, y} \sum_{r_{u,i}\text{is known}}(r_{u,i} -x^\top_uy_i )^2 + \lambda(\sum_u||x_u||^2 + \sum_i||y_i||^2)

Let's group all the user-factors in an m \times f matrix X and all the item-factors in an n \times f matrix Y. If we assume either the user-factors or item-factors was fixed, this would look exactly like a regularised least squares problem. So, first lets assume Y is fixed and solve for x_u:

x_u=(Y^{\top}Y + \lambda I)^{-1}Y^\top r_u

Likewise if we assume X is fixed, and solve for y_i:

y_u=(X^{\top}X + \lambda I)^{-1}X^\top r_i

The idea is to iterate these two steps until some stopping criterion is reached. The original paper suggested 10 iterations.

For further ways to model explicit data, see this paper on the Netflix prize.

Example python implementation

Implicit data

Using the raw observations directly has been found not to work as well on implicit data 1. Instead we define a set of binary observation variables:

p_{u,i}=\begin{cases}1& \text{if } r_{u,i}\gt 0\\0&\text{otherwise}\end{cases}

Now we want to define some confidence levels for each p_{u,i}. When r_{u,i} = 0 we have low confidence, it may be the user has never been exposed to that item or it may be unavailable at the time. When r_{u,i} is small, it could be explained by the user buying a gift for someone else etc., hence we should still have low confidence. When r_{u,i} is larger, we should have much more confidence.

We have complete freedom how we define confidence. The original proposal was

c_{u,i}=1+\alpha r_{u,i}

We have some level of confidence for each p_{u,i}. The \alpha parameter is tuneable, to the dataset via cross-validation.

Our new optimisation function is

\min_{x, y} \sum_{r_{u,i}\text{is known}}c_{u,i}(p_{u,i} -x^\top_uy_i )^2 + \lambda(\sum_u||x_u||^2 + \sum_i||y_i||^2)

We define the diagonal matrix C^i, where C^i_{u,u}=c_{u.i}. The parameter learning is the same as a standard weighted least squares setup:

y_u=(X^{\top}C^iX + \lambda I)^{-1}X^\top C^ip_i

x_u=(Y^{\top}C^uY + \lambda I)^{-1}Y^\top C^up_u

Computing X^{\top}C^iX can be expensive. One optimisation can be used is the fact: X^{\top}C^iX=X^{\top}X +X^{\top}(C^i - I)X. You can pre-compute X^{\top}X at each step, and notice that (C^i - I) only has the non-zero entries where r_{u,i} was non-zero.

You might wonder, why bother with the confidence weights and why not scale the ratings instead? The problem there is that you turn a very sparse problem into a very dense problem and you will have trouble even numerically computing the MSE on the dense problem.

Example python implementation.


In the last section we will discuss a number of points when applying this algorithms on practical problem.


Let's look at the loss function again

\mathcal{L} = \min_{x, y} \sum_{r_{u,i}\text{is known}}(r_{u,i} -x^\top_uy_i )^2 + \lambda(\sum_u||x_u||^2 + \sum_i||y_i||^2)

The loss is non-convex in (X, Y). When we assume Y is fixed, \mathcal{L} is now convex in X. Each step will move us toward a minimum, but not necessarily a global minimum.

This will mean that each time you the algorithm you will potentially get a different result depending on how you initialise X or Y. It is worth running the algorithm several times when testing different hyperparameter values.

Also note, if you make \lambda large enough you will converge to the same solution each time. This is because are constraining your solution space.


As the problem is non-convex, initialising the parameters in a good way can help you converge to a solution faster.

One suggested approch, currently used in Spark:

Choose a unit vector uniformly at random from the unit sphere, but from the "first quadrant" where all elements are nonnegative. This can be done by choosing elements distributed as Normal(0,1) and taking the absolute value, and then normalising.

One interesting thought here is that better initialisations could start you off closer to a global minimum. I would welcome any comments if anyone is aware of better approaches, I've not seen any literature on this.


Exact values of \lambda and \alpha are data dependent and should be determined  by cross-validation.


Interestingly, the original paper does not suggest using any form of user or item biases. A better approach could be to define:

\hat{r}_{u,i} = \mu + \beta_u + \beta_i + x^\top_uy_i

where \mu would be a global average, \beta_u represents a user bias and \beta_i represents an item bias. These biases help us account for the difference in users and items e.g. some users buying lots of products.

The global average \mu makes sense with explicit data (the average overall rating). It may not sense to use with implicit data (where most of r_{u,i}=0).


Zhou et al. suggest weighting each term in the regularisation by n_u and n_i the number of observations of user u and items i respectively. The only reason I can see do to this is that it makes \lambda less dependent on the scale of the dataset, so \lambda from a sampled dataset should work well on a full dataset.

Stopping Criteria

When to stop the algorithm is usually governed by:

  • If a maximum number of iterations is met
  • If the difference between the measured MSE of the current iteration and previous iteration goes below some epsilon

Confidence Functions

You are free to use any function for the confidence weight you like. Straightforward linear scaling might not be what you want. For instance, in a e-Commerce scenario, you might use weights along the lines of

  • 3 for a product page view
  • 10 for an add to basket
  • 40 for a product purchase
  • etc.

 ALS versus SGD

The loss function \mathcal{L} could be optimised via stochastic gradient descent. It would potentially be easier and faster than ALS. However ALS offers two advantages:

  • ALS is much easier to parallelise.
  • In my experience, ALS converges very quickly. Generally I find less than 10 iterations will converge to a decent solution.

Hopefully this post provides an overall introduction to ALS and highlights some of the considerations you should take into account when implementing this algorithm.


Computing PCA with SVD

PCA is a great tool for performing dimensionality reduction. Two reason you might want to use SVD to compute PCA:

  • SVD is more numerically stable if the columns are close to collinear. I have seen this happen in text data, when certain terms almost always appear together.
  • Spark's PCA implementation currently doesn't support very wide matrices. The SVD implementation does however.

Singular Value Decomposition (SVD)

Below we briefly recap Singular Value Decomposition (SVD).

Let \mathbf{A} be a m \times n matrix, the singular value decomposition gives

\mathbf{A} = \mathbf{U}_{m\times m}\mathbf{\Sigma}_{m\times n}\mathbf{V}^\top_{n\times n}

\mathbf{U} is an orthonormal matrix and is the eigenvectors of \mathbf{A}\mathbf{A}^\top.

\mathbf{V} is an orthonormal matrix and is the eigenvectors of \mathbf{A}^\top\mathbf{A}.

\mathbf{\Sigma} is a diagonal matrix and contains the square-roots of the eigenvalues of\mathbf{A}\mathbf{A}^\top and  \mathbf{A}^\top\mathbf{A} e.g.

\mathbf{\Sigma}=\begin{pmatrix}\sqrt{\lambda_1}&0&\cdots& 0\\0 &\sqrt{\lambda_2} & \cdots &0\\\vdots & \vdots & \ddots & \vdots \\0&0& \cdots &\sqrt{\lambda_k}\end{pmatrix}

Remember, as \mathbf{U} is an orthonormal matrix \mathbf{U}^\top=\mathbf{U}^{-1}

Computing PCA

Start with the standard steps of PCA:

  1. Mean centre the matrix \mathbf{A}
  2. Optionally scale each column by their standard deviation. You may want to do this if the variables are measured on different scales.

We noted in the previous section that \mathbf{V} is the eigenvectors of \mathbf{A}^\top\mathbf{A} (the covariance matrix). Thus the principal component decomposition is

\begin{equation} \begin{split}\mathbf{T} & = \mathbf{A}\mathbf{V}\\ & =\mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top\mathbf{V} \\ & =\mathbf{U}\mathbf{\Sigma}\end{split}\end{equation}

To reduce the dimensionality of \mathbf{A}, select the k largest singular values (k \le n), select the first k columns from \mathbf{U} and the k \times k upper-left part of \mathbf{\Sigma}. The reduced dimensionality is given by

\mathbf{T}_k =\mathbf{U}_k\mathbf{\Sigma}_k

Common Pitfalls in Machine Learning

Over the past few years I have worked on numerous different machine learning problems. Along the way I have fallen foul of many sometimes subtle and sometimes not so subtle pitfalls when building models.  Falling into these pitfalls will often mean when you think you have a great model, actually in real-life it performs terribly. If your aim is that business decisions are being made based on your models, you want them to be right!

I hope to convey these pitfalls to you and offer advice on avoiding and detecting them. This is by no means an exhaustive list and I would welcome comments on other common pitfalls I have not covered.

In particular, I have not tried to cover pitfalls you might come across when trying to build production machine learning systems. This article is focused more on the prototyping/model building stage.

Finally, while some of this is based on my own experience, I have also drawn upon John Langford's Clever Methods of Overfitting, Ben Hamner's talk on Machine Learning Gremlins and Kaufman et al. Leakage in Data Mining. All are worth looking at.

Traditional Overfitting

Traditional overfitting is where you fit an overly complicated model to the trained dataset. For example, by allowing it to have too many free parameters compared to the number of training points.

To detect this, you should always be using a test set (preferably you are using cross validation). Plot your train and test performance and compare the two. You should expect to see a graph like figure 1. As your model complexity increases, your train error goes to zero. However, your test error follows an elbow shape, it improves to a point then gets worse again.

Figure 1. Train/Test Performance when varying number of parameters in the model

Potential ways to avoid traditional overfitting:

More training data

In general, the more training data you have the better. But it may be expensive to obtain more.

Before going down this route it is worth seeing if more train data will actually help. The usual way to do this is to plot a learning curve, how the training sample size affects your error:

Figure 2. Two example learning curves

In the left-hand graph, the gradient of the line at our maximum training size is still very steep. Clearly here more training data will help.

In the right-hand graph, we have started to reach a plateau, more training data is not going to help too much.

Simpler Predictor Function

Two ways you could use a simpler predictor:

  • Use a more restricted model e.g. logistic regression instead of a neural network.
  • Use your test/train graph (figure 1) to find an appropriate level of model complexity.


Many techniques have been developed to penalise models that are overly complicated (Lasso, Ridge, Dropout, etc.). Usually this involve setting some form of hyper-parameter. One danger here is that you tune the hyper-parameter to fit the test data, which we will discuss in Parameter Tweak Overfitting.

Integrate over many predictors

In Bayesian Inference, to predict a new data point x:

p(x|\mathbf{D},\alpha) = \int_{\theta} p(x|\theta,\alpha) \, p(\theta|\mathbf{D},\alpha)

where \alpha is some hyper-parameters, \mathbf{D} is our training data and \theta are the model parameters. Essentially we integrate out the parameters, weighting each one by how likely they are given the data.

Parameter Tweak Overfitting

This is probably the type of overfitting I see most commonly. Say you use cross validation to produce the plot in figure 1. Based on the plot you decide the 5 parameters is optimal and you state your generalisation error to be 40%.

However, you have essentially tuned your parameters to the test data set. Even if you use cross-validation, there is some level of tuning happening. This means your true generalisation error is not 40%.

Chapter 7 of the Elements of Statistical Learning discuss this in more detail. To get a reliable estimate of generalisation error, you need to put the parameter selection, model building inside, etc. in an inner loop. Then run a outer loop of cross validation to estimate the generalisation error:

For instance, in your inner loop you may fit 10 models, each using 5X CV. 50 models in total. From this inner loop, you pick the best model. In the outer loop, you run 5X CV, using optimal model from the inner loop and the test data to estimate your generalisation error.

Choice of measure

You should use whatever measure is canonical to your problem or makes the most business sense (Accuracy, AUC, GINI, F1, etc.). For instance, in credit default prediction, GINI is the widely accepted measurement.

It is good to practice to measure multiple performance statistics when validating your model, but you need to focus on one for measuring improvement.

One pitfall I see all the time is using accuracy for very imbalanced problems. Telling me that you achieved an accuracy of 95%, when the prior is 95% means that you have achieved random performance. You must use a measure that suits your problem.

Resampling Bias and Variance

When measuring your test performance you will likely use some form of resampling to create multiple test sets:

  • K-fold cross validation
  • Repeated k-fold cross validation
  • Leave One Out Cross Validation (LOOCV)
  • Leave Group Out Cross Validation (LGOCV)
  • Bootstrap

Ideally we want to use a method that achieves low bias and low variance. By that I mean:

  • Low bias - The generalisation error produced is close to the true generalisation error.
  • Low variance (high precision) - The spread of the re-sampled test results is small


In general, the more folds you use the lower the bias, but the higher the variance.

You will need to pick a resampling method to suit your problem. With large training sets, k-fold cross validation with a k of 5 may suit. With small training sets, you will need a larger k.

Max Kuhn has done a fantastic empirical analysis of the different methods in relation to bias and variance (part I, part II). In general repeated 10-fold cross validation seems to be quite stable across most problems in terms of bias and variance.

Bad Statistics

Once you have computed the results of your cross validation, you will have a series of measurement of your test performance. The usual thing to do here is to take the mean and report this as your performance.

However just reporting the mean may hide the fact there is significant variation between the measurements.

It is useful to plot the distribution of the performance estimates. This will give you and idea  of how much variation there is:ScoresHistIf you need to boil it down to a single number, computing the standard deviation (or variance) can be useful. However you often see performance quoted like: 80% +/- 2% where the 80% is the mean and the 2% is the standard deviation. Personally I dislike this way of reporting as it suggests the true performance is between 78%-82%. I would quote them separately; mean and standard deviation.

If you want to write 80% +/- 2%, you would need to compute bounds on the estimate.

Information Leakage

Information Leakage occurs when data on your training label leaks into your features. Potentially it is even more subtle, where irrelvant features appear as highly predictive, just because you have some sort of bias in the data you collected for training.

As an example, imagine you are an eCommerce website and want to predict converters from the way people visit your website. You build features based on the raw URLs the users visit, but take special care to remove the URLs of the conversion page (e.g. Complete Purchase). You split your users into converters (those reaching the conversion page) and non-converters. However, there will be URLs immediately before the conversion page (checkout page, etc.) that will be present in all the converters and almost none of the non-converters. Your model will end up putting an extremely high weight on these features and running your cross-validation will give your model a very high accuracy. What needed to be done here was remove any URLs that always occur immediately before the conversion page.


Feature Selection Leakage

Another example I see regularly is applying a feature selection method that looks at the data label (Mutual Information for instance) on all of the dataset. Once you select your features, you build the model and use cross-validation to measure your performance. However your feature selection has already looked at all the data and selected the best features. In this sense your choice of features leaks information about the data label. Instead you should have performed inner-loop cross validation discussed previously.


Often it can be difficult to spot these sorts of information leakage without domain knowledge of the the problem (e.g. eCommerce, medicine, etc.).

The best advice I can suggest to avoid this is to look at the top features that are selected by your model (say top 20, 50). Do they make some sort of intuitive sense? If not, potentially you need to look into them further to identify if their is some information leakage occurring.

Label Randomisation

A nice method to help with the feature selection leakage is to completely randomly shuffle your training labels right at the start of your data processing pipeline.

Once you get to cross validation, if your model says it has some sort of signal (an AUC > 0.5 for instance), you have probably leaked information somewhere along the line.

Human-loop overfitting

This is a bit of a subtle one. Essentially when picking what parameters to use, what features to include, it should all be done by your program. You should be able to run it end-to-end to perform all the modelling and performance estimates.

It is ok to be a bit more "hands-on" initially when exploring different ideas. However your final "production" model should remove the human element as much as possible. You shouldn't be hard-coding parameter settings.

I have also seen this occurring when particular examples are hard to predict and the modeller decides to just exclude these. Obviously this should not be done.

Non-Stationary Distributions

Does your training data contain all the possible cases? Or could it be biased? Could the potential labels change in the future?

Say for example you build a handwritten digit classifier trained on the the MNIST database. You can classify between the numbers 0-9. Then someone gives you handwritten digits in Thai:



How will your classifier behave? Possibly you need to obtain handwritten digits for other languages, or have an other category that could incorporate non Western Arabic numerals.


Potentially, your training data may have gone through some sort of sampling procedure before you were provided it. One significant danger here is that this was sampling with replacement and you end up with repeated data points in both the train and test set. This will cause your performance to be over-estimated.

If you have a unique ID for each row, check these are not repeated. In general check how many rows are repeated, you may expect some, but is it more frequent than expected?


My one key bit of advice when building machine learning models is:

If it seems to good to be true, it probably is.

I can't stress this enough, to be a really good at building machine learning models you should be naturally sceptical. If your AUC suddenly increases by 20 points or you accuracy becomes 100%. You should stop and really look at what you have done. Have you fallen trap of one of the pitfalls described here?

When building your models, it is always a good idea to try the following:

  • Plot learning curves to see if you need more data.
  • Use test/train error graphs to see if your model is too complicated.
  • Ensure you are running inner-loop cross validation if you are tweaking parameters.
  • Use repeated k-fold cross validation if possible.
  • Check your top features - Do they make sense?
  • Perform the random label test.
  • Check for repeated unique IDs or repeated rows.

Calibrating Classifier Probabilties

You've built your classifier, run cross-validation and have a super high AUC. So you are done right? Maybe not.

Most classifiers output a score of how likely an observation is to be in the positive class. Usually these scores are between 0 and 1 and get called probabilities. However these probabilities often do not reflect reality, e.g. a probability of 20% may not mean it has a 20% chance of happening, it could have a 10%, 50%, 70%, etc. chance of happening. Our aim should be that our model outputs accurately reflect posterior probabilities P(Y=1|x).

In the post we will mainly focus on binary classifiers. Later in the post will will talk about how to extend these ideas to mutliclass problems.

Why it happens

Our models can output inaccurate probabilities for a variety of reasons:

  • Flawed model assumptions (e.g. independence in a Naive Bayes Model)
  • Hidden features not available at training time
  • Deficiencies in the learning algorithm.

In terms of learning algorithms, as noted in Niculescu-Mizil et al and through my own research:

  • Naive Bayes tends to push probabilities toward the extremes of 0 and 1.
  • SVMs and boosted trees tend to push probabilities away from 0 and 1 (toward the centre)
  • Neural networks and random forests tend to have well calibrated probabilities.
  • Regularisation also tends to push probabilities toward the centre.

Do we care?

Whether or not we want well calibrated probabilities depends entirely on the problem we are trying to solve.

If we only need to rank observations from most likely to least likely, then calibration is unnecessary.

Examples of problems I have worked on where calibrated probabilities are extremely important:

  • Loan default prediction - Banks will generally be setting thresholds on the probabilities, auto-reject if probability of default is above 30%, etc.
  • Ad Click Prediction - Decided what ad to show, how much to bid. You might use a baseline Click Through Rate (CTR), and compare your prediction to this to see how much more you are willing to pay for this ad impression. 1
  • Demographics Estimation of Websites - Imagine you have predictions of gender/ages, as probabilities, for a number of web users. Estimating the gender distribution on a website, can be done by just averaging the probabilities of the users seen on the site. Any bias in the probabilities, will generate a bias in your estimate.


A reliability diagram is a relatively simple technique for visualising how well calibrated our classifier is.  As described in Niculescu-Mizil et al:

On real problems where the true conditional probabilities are not known, model calibration can be visualized with reliability diagrams (DeGroot & Fienberg, 1982). First, the prediction space is discretized into ten bins. Cases with predicted value between 0 and 0.1 fall in the first bin, between 0.1 and 0.2 in the second bin, etc.
For each bin, the mean predicted value is plotted against the true fraction of positive cases. If the model is well calibrated the points will fall near the diagonal line.

Below we provide a piece of R code for producing relibability diagrams. Here we generalise the number of bins to be a user defined parameter.

In the figure below we show an example reliability plot. Ideally the reliability plot (red line) should be as close to the diagonal line as possible. As there is significant deviation from the diagonal, calibrating the probabilities will possible help.


It is also worth mentioning that if you take the mean of the score distribution, it should ideally be close to the prior.

Techniques for calibration


The most important step is to create a separate dataset to perform calibration with. Our steps for calibration are:

  • Split dataset into test and train
  • Split the train set into model training and calibration.
  • Train the model on train set
  • Score test and calibration set
  • Train the calibration model on calibration set
  • Score the test set using calibration

How much data to use for calibration will depend on the amount of data you have available. The calibration model will generally only be fitting a small number of parameters (so you do not need a huge volume of data). I would aim for around 10% of your training data, but at a minimum of at least 50 examples.

Platt Scaling

Platt Scaling essentially involves fitting a logistic regression on the classifier output. Originally developed to fit probabilities to the outputs of SVM 2, it is also well suited to the output of most other classifiers.

The reliability diagram below shows the original reliability plot (green) and after Platt Scaling (red).

The Platt Scalding should not change the rank of the observations, so measures such as AUC will be unaffected. However, measures like Log Loss 3 will be improved. In this example, Log Loss was originally 0.422 and improved to 0.418.

In Platt's original paper suggests , instead of using the original {0,1} targets in the calibration sample, it suggests to mapping to:

t_+=\frac{N_+ + 1}{N_+ + 2}


where N_+ and N_- are the number of positive and negative examples in the calibration sample.

To some extent this introduces a level of regularisation. Imagine if you only gave probabilities of either 0 or 1 and you correctly predicted all examples. Your Log Loss would be zero. With Platt's transformation, you Log Loss would be non-zero. As the Log Loss is what you are optimising when fitting the logistic regression, a level of regularisation is introduced.

In my experiments, this transformation had little to no effect on the reliability diagram and Log Loss, so seems an unnecessary step. It may be useful if you have very few examples and overfitting is more of a concern (therefore regularisation would help). You could also use a ridge or lasso regression.

Isotonic Regression

With Isotonic Regression you make the assumption:

y_i = m(f_i) + \epsilon_{i}

where m is an isotonic (monotonically increasing or decreasing) function. This is the exact same assumptions we would use for least squares, except m is now a isotonic function instead of linear.

Below is an R example of how to perform isotonic regression using the isoreg function.

In the figure below we show an example of the sort of function fitted by the isotonic regression model:IsoModelNotice how it goes up in steps instead of a smooth curve. To smooth the fit, we perform a linear interpolation between each step.IsoScaled

In the reliability plot above, the original uncalibarated scores are shown in green and the isotonic regression scores are shown in red. In this example we find isotonic regression actually made it worse. The Log Loss for instance went from 0.422 to 0.426. The AUC was also reduced.

Multiclass Classification

What happens if you have more than two classes? Firstly I would recommend visualising the problem as a series of reliability diagrams. For k classes, you can create k reliability diagrams.

Secondly, you can take the score for each of your k classes and plug them into a multinomial logistic regression. The superb glmnet package implements a multinomial logistic regression. You can set the regularisation parameter to something quite small. One word of caution, if you have many classes, overfitting can become an issue. At this point it is worth optimising the regularisation parameter.

If your favourite machine learning model (e.g. SVM) doesn't directly support multi-class classification, you can fit a 1 vs. all set of classifiers and then plug each of those scores into the  multinomial logistic regression.


Classifier probability calibration can be an important step in your machine learning pipeline. The first step is to always visualise and see how much of an issue you have. In general I have found Platt Scaling to be the simplest and most effective approach to most calibration of classification problems.

  1. See Google's paper on issues with ad click prediction:
  2. See Platt's original paper: 
  3. Log loss definition:

Area Under the receiver operator Curve (AUC)

AUC is a common way of evaluating binary classification problems. Two main reason are:

  • It's insensitive to unbalanced datasets e.g. fraud detection, conversion prediction.
  • It does not require the predictions to be thresholded (e.g. assign to one class or the other). Operates directly on classification scores.

An AUC of 0.5 is random performance, while an AUC of 1 is perfect classification. Personally I transform this to GINI = 2AUC - 1.  The main reason being, that intuitively for me 0 being random and 1 being perfect makes more sense.

Understanding AUC

First let us look at a confusion matrix:

 Positive LabelNegative Label
Predicted PositiveTrue Positive (TP)False Positive (FP)
Type I Error
Predictive NegativeFalse Negative (FN)
Type II Error
True Negative (TN)
True Positive Rate = TPR = TP / (TP + FN)False Positive Rate = FPR = FP / (FP + TN)

Imagine we have a set of scores from a classification model (this could be probabilities, etc.) along with their true labels.  The numbers in the confusion matrix assume we have pick a threshold t, where >= t we assign the positive label and < t we assign the negative label.

If we have M unique scores from the classification model, we could have M +1 possible choices of t. If we plot the True Positive Rate and False Positive Rate over all the ranges of t, we get our Receiver Operating Characteristic (ROC) Curve.

Figure 1: ROC Curve

Figure 1 shows an example ROC. The blue line is our performance, the grey dotted line is an example of what we would see from random classification.

The Area Under the receiver operator Curve (AUC) for this example would be the area under the blue line. The area under the grey line would be 0.5, hence why an AUC of 0.5 is random performance and AUC of 1 is perfect classification.

Score Distributions

"Receiver Operating Characteristic" by kakau. Original uploader was Kku at de.wikipedia - Transferred from de.wikipedia; transferred to Commons by User:Jutta234 using CommonsHelper.(original text: Selbstgepinselt mit PowerPoint). Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons -
Figure 2: Moving the classification threshold

In the top-left of figure 2 we have a hypothetical score distribution of the output of a classifier. The two "clumps" of scores are the negative (blue) and positive (red) examples. You can imagine as we move the threshold line from left to right, we will change the values of the TP and FP and therefore the TPR and FPR.

How much the score distributions of the negatives and positive overlap will determine our AUC.  In figure 3 we show some different score distributions, the red distribution is the positive class, the blue the negative.

Figure 3: Examples of different score distributions

In the top example, wherever we set the threshold we get a perfect TPR. In this case we would get an AUC of 1.

In the middle example, wherever we set the threshold, our TPR will be the same as our FPR. Hence we will see an AUC of 0.5

In the bottom example, we have some overlap of the score distributions, so our AUC will be somewhere between 0.5 and 1.

The shape of the score distributions and amount of overlap will affect the shape of the ROC Curve and therefore the AUC we end with.


The R pROC library has functions for computing ROC, AUC and plotting.