R as a programming language is often considered slow. However, more often than not it is how the R code is written that makes it slow. I’ve see people wait hours for an R script to finish, while with a few modifications it will take minutes.
In this post I will explore various ways of speeding up your R code by writing better code. In part II, I will focus on tools ands libraries you can use to optimise R.
Vectorisation
The single most important advice when writing R code is to vectorise it as much as possible. If you have ever used MATLAB, you will be aware of the difference vectorised vs. un-vectorised code in terms of speed.
Let us look at an example:
1 2 3 4 5 6 7 8 9 |
print(system.time( # Loop for (i in 1:length(a)) { a[i] <- a[i] + 10 } )) # user system elapsed # 0.191 0.008 0.216 |
Here we have used a loop to increment the contents of a. Now using a vectorised approach:
1 2 3 4 5 6 7 |
a <- rep(10, 100000) print(system.time( # Vector a <- a + 10 )) # user system elapsed # 0 0 0 |
Notice the massive performance increase in elapsed time.
Another consideration is to look at using inherently vectorised commands like ifelse and diff. Let’s look at the example below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
large.number <- 100000 hit <- NA print(system.time( # Using if for (i in 1:large.number) { if (runif(1) < 0.3) hit[i] <- TRUE } )) # user system elapsed # 2.853 1.152 4.012 print(system.time( # Using ifelse ifelse(runif(large.number) < 0.3, TRUE, NA) )) # user system elapsed # 0.041 0.001 0.043 |
Again we see elapsed time has been massively reduced, a 93X reduction.
When you have a for loop in your code, think about how you can rewrite it in a vectorised way.
Looping
Sometimes it is impossible to avoid a loop, for example:
- When the result depends on the previous iteration of the loop
If this is the case some things to consider:
- Ensure you are doing the absolute minimum inside the loop. Take any non-loop dependent calculations outside of the loop.
- Make the number of iterations as small as possible. For instance if your choice is to iterate over the levels of a factor or iterate over all the elements, usually iterating over the levels will be much faster
If you have to loop, do as little as possible in it
Growing Objects
A common pitfall is growing an object inside of a loop. Below I give an example of this:
1 2 3 4 5 6 7 8 9 10 |
n <- 100000 print(system.time({ # Growing inside a loop vec <- numeric(0) for(i in 1:n) { vec <- c(vec, i)} })) # user system elapsed # 18.649 19.105 37.876 |
Here we are constantly growing the vector inside of the loop. As the vector grows, we need more space to hold it, so we end up copy data to a new location. This constant allocation and copying causes the code to be very slow and memory fragmentation.
In the next example, we have pre-allocated the space we needed. This time the code is 266X faster.
1 2 3 4 5 6 7 8 |
print(system.time({ # Pre-allocated and using subscripts vec <- numeric(n) for(i in 1:n) vec[i] <- i })) # user system elapsed # 0.140 0.001 0.142 |
We can of course do this allocation directly without the loop, making the code even faster:
1 2 3 4 5 6 7 |
print(system.time( # Direct allocation vec <- 1:n )) # user system elapsed # 0.001 0.000 0.000 |
If you don’t know how much space you will need, it may be useful to allocate an upper-bound of space, then remove anything unused once your loop is complete.
A more common scenario is to see something along the lines of:
1 2 3 4 5 6 |
a <- c() for (i in 1:x) { ... rbind(a, b) # Or cbind(a, b) } |
At the bottom of your loop, you are rbinding or cbinding the result you calculated in your loop to an existing data frame.
Instead, build a list of pieces and put them all together in one go:
1 2 3 4 5 6 7 |
my.list <- vector('list', n) for(i in 1:n) { my.list[[i]] <- runif(50) } my.df <- do.call('rbind', my.list) # Using rbindlist may be faster |
Avoid growing data structures inside a loop.
Apply Functions
The R library has a whole host of apply functions:
1 2 3 4 5 6 7 8 9 10 |
??apply base::apply Apply Functions Over Array Margins base::.subset Internal Objects in Package 'base' base::by Apply a Function to a Data Frame Split by Factors base::eapply Apply a Function Over Values in an Environment base::lapply Apply a Function over a List or Vector base::mapply Apply a Function to Multiple List or Vector Arguments base::rapply Recursively Apply a Function to a List base::tapply Apply a Function Over a Ragged Array |
It is worth becoming familiar with them all. Neil Saunders has created a great introduction to all the apply functions.
In most situations using apply may not be any faster than using a loop (for instance the apply function is just doing a loop under the hood). The main advantage is that it avoids growing objects in the loop as the apply functions handle stitching the data together.
In Part II we will introduced the parallel versions of apply that can increase performance further.
Know your apply functions and use them where it makes sense
Mutable Functions
One important point to remember about R, is that parameters to functions are passed by value. This, in theory, means that each time you pass something to a function it creates a copy. However, in practice, R will under the hood not create a copy as long as you don’t mutate the value of the variable inside the function.
If you can make your functions immutable (e.g. don’t change the values of the parameters passed in), you will save significant amounts of memory and CPU time by not copying.
Let’s looks at a really simple case:
An error has occurred. Please try again later. |
Here f1 mutates x, while f2 does not. Running with a fairly large vector, several times and looking at the average:
1 2 3 4 5 6 7 8 9 |
a <- runif(1:10000000) dd <- t(sapply(1:100, function(x) { c(system.time(f1(a))[3], system.time(f2(a))[3]) })) colMeans(dd) # elapsed elapsed # 0.23410 0.18562 |
We see that on average, f2 is slightly quicker as we have avoided the additional temporary copy that is done under the hood in f1.
Try to make functions immutable.
Summary
To reiterate the main points:
- When you have a for loop in your code, think about how you can rewrite it in a vectorised way.
- If you have to loop, do as little as possible in it
- Avoid growing data structures inside a loop
- Know your apply functions and use them where it makes sense
- Try to make functions immutable
Following these tips when writing your R code should greatly improve the efficiency. For some more general tips to help your R code I also recommend:
- Reading through R Inferno
- Reading and following Google’s R Style Guide