Tag Archives: vectorisation

R Performance (Part I)

R as a programming language is often considered slow. However, more often than not it is how the R code is written that makes it slow. I’ve see people wait hours for an R script to finish, while with a few modifications it will take minutes.

In this post I will explore various ways of speeding up your R code by writing better code. In part II, I will focus on tools ands libraries you can use to optimise R.

Vectorisation

The single most important advice when writing R code is to vectorise it as much as possible. If you have ever used MATLAB, you will be aware of the difference vectorised vs. un-vectorised code in terms of speed.

Let us look at an example:

Here we have used a loop to increment the contents of a. Now using a vectorised approach:

Notice the massive performance increase in elapsed time.

Another consideration is to look at using  inherently vectorised commands like ifelse and diff. Let’s look at the example below:

Again we see elapsed time has been massively reduced, a 93X reduction.

When you have a for loop in your code, think about how you can rewrite it in a vectorised way.

Looping

Sometimes it is impossible to avoid a loop, for example:

  • When the result depends on the previous iteration of the loop

If this is the case some things to consider:

  • Ensure you are doing the absolute minimum inside the loop. Take any non-loop dependent calculations outside of the loop.
  • Make the number of iterations as small as possible. For instance if your choice is to iterate over the levels of a factor or iterate over all the elements, usually iterating over the levels will be much faster

If you have to loop, do as little as possible in it

Growing Objects

A common pitfall is growing an object inside of a loop.  Below I give an example of this:

Here we are constantly growing the vector inside of the loop. As the vector grows, we need more space to hold it, so we end up copy data to a new location. This constant allocation and copying causes the code to be very slow and memory fragmentation.

In the next example, we have pre-allocated the space we needed. This time the code is 266X faster.

We can of course do this allocation directly without the loop, making the code even faster:

If you don’t know how much space you will need, it may be useful to allocate an upper-bound of space, then remove anything unused once your loop is complete.

A more common scenario is to see something along the lines of:

 

At the bottom of your loop, you are rbinding or cbinding the result you calculated in your loop to an existing data frame.

Instead, build a list of pieces and put them all together in one go:

Avoid growing data structures inside a loop.

Apply Functions

The R library has a whole host of apply functions:

It is worth becoming familiar with them all. Neil Saunders has created a great introduction to all the apply functions.

In most situations using apply may not be any faster than using a loop (for instance the apply function is just doing a loop under the hood). The main advantage is that it avoids growing objects in the loop as the apply functions handle stitching the data together.

In Part II we will introduced the parallel versions of apply that can increase performance further.

Know your apply functions and use them where it makes sense

Mutable Functions

One important point to remember about R, is that parameters to functions are passed by value. This, in theory, means that each time you pass something to a function it creates a copy. However, in practice, R will under the hood not create a copy as long as you don’t mutate the value of the variable inside the function.

If you can make your functions immutable (e.g. don’t change the values of the parameters passed in), you will save significant amounts of memory and CPU time by not copying.

Let’s looks at a really simple case:

Here f1 mutates x, while f2 does not. Running with a fairly large vector,  several times and looking at the average:

We see that on average, f2 is slightly quicker as we have avoided the additional temporary copy that is done under the hood in f1.

Try to make functions immutable.

Summary

To reiterate the main points:

  1. When you have a for loop in your code, think about how you can rewrite it in a vectorised way.
  2. If you have to loop, do as little as possible in it
  3. Avoid growing data structures inside a loop
  4. Know your apply functions and use them where it makes sense
  5. Try to make functions immutable

Following these tips when writing your R code should greatly improve the efficiency. For some more general tips to help your R code I also recommend: