Category Archives: Programming

Setting up Jupyter for Deep Learning on EC2

This post is a simple guide on setting up an EC2 server for deep learning as quickly as possible.

Launching an Instance

In EC2, select ‘Launch Instance’. Select your AMI, choose Deep Learning AMI (Ubuntu). You could use the Amazon Linux version, I just personally prefer Ubuntu.

For actual deep learning, you will want to select a p* instance, ideally a p3 instance. If you are initially testing and writing prototype code, you will probably want to select a small t* instance. You can change this later on the EC2 Dashboard.

Move to Review and Launch. On the security groups you will need to open ports 22, 443 and 8888

Click ‘Launch’ and select your key pair (or create a new one).

On security, while we are opening 22, 443 and 8888 to the world it should not be a security risk. For SSH only the person with your key pair should be able to SSH in. For 443 and 8888, these are for Jupyter. As long as your Jupyter instance remains password or token protected (it is by default) it should not be a security risk.

Configuring HTTPS

SSH into the server with:

It is a good idea to update any packages using

Create a Jupyter config

Create a certificate directory

Create a self-signed certificate valid for one year

Edit your Jupyter config

Edit the file to include the following four lines

Note: I don’t set a password as I think it is safer to use the token (you will need to be SSH’d into the server to obtain the token)

Running Notebooks

Now run:

Copy and paste the URL into your browser. It will look like https://localhost:8888/?token=ef0f…

Change ‘localhost’ to be the IP or DNS of your server.

You will get a warning screen

As this is your own self-signed certificate, click Advanced and Proceed.

At this point you will have your notebook interface!

Installing additional packages

The AMI comes with most packages you need installed. If you need to install anything else, go to the terminal. Ensure you activate the conda environment you are using in your notebook e.g. if you are using Tensorflow Python 3, run

Then install your package via conda


Static Blog: Jekyll, Hyde and GitHub Pages

This post is a short tutorial on setting up a static Jekyll blog using GitHub pages.

Install Jekyll

Intructions for installing Jekyll can be found here:

MacOS X Install Instructions

  1. Download XCode via App Store – Search for ‘Xcode’ and click install.
  2. In the latest version of XCode the Command Line tools will be atomically installed
  3. On the terminal run  gem install jekyll bundler

Setting up Hyde

We will be using an updated version of the Hyde template for Jekyll  called Hydeout

On the command line, clone the Hydeout template into a new directory using  git clone

Move into this directory and run

You will now have a working Hydeout blog running locally on your machine.

Setting up GitHub Pages

The following assumes you are using the master branch for your blog.

Repoint the Hydeout template to your personal git repo

Push your changes to this repo.

In GitHub:

  1. Go to the Settings of your Project
  2. Go to GitHub pages
  3. Set Source to be your master branch
  4. Wait 30 seconds or so before trying to load your new website.

Updating a post

  1. Ensure your local repo is up-to-date using git pull
  2. Create a new Markdown post under the _post directory
  3. Run   jekyll serve to test your post looks the way you want
  4. Add, Commit and Push your changes to master
  5. Your new post will now appear on the website

Hadoop 2 Introduction

Hadoop 2 is a complete overall of some of the core Hadoop libraries. It is a fundamental shift in the way applications run on top of Hadoop and it is worth understanding these changes.



In Hadoop 1, the programming API (MapReduce) and resource management of the cluster were all bundled together. In Hadoop 2, resource management is now handled by YARN (Yet Another Resource Negotiator).

YARN manages the resources available to us on the cluster. To understand what YARN does, we need to look at the components that make it up:


Resource Manager

  • Runs on a single master node
  • Global resource scheduler across the node
  • Arbitrates resources between competing applications
  • Nodes have resources – memory and CPU cores – which the resource manager allocates.

Node Manager

  • Sits on each slave node
  • Handles communication with Resource Manager


  • Applications are jobs submitted to the YARN framework
  • Could be MapReduce job, Spark job etc.

Application Master

  • One per-application
  • Requests containers to actually run the job. Containers will be distributed across the container.


  • Create by the RM upon request
  • Allocate a certain amount of resources ( CPU and memory) on a slave node.


In Hadoop 2, applications are no longer limited to just MapReduce. Cluster can be used for multiple different systems at the same time.  The cluster resources can be between utilised and new systems can integrate by implementing the YARN API.



  • Hierarchical queue system
  • Various scheduling mechanisms (Capacity Scheduler, Fair Scheduler)
  • Cloudera CDH5 uses Fair Scheduling by default.

Hadoop Command Line Cheatsheet

Useful commands when using Hadoop on the command line


Full reference can be found in Hadoop Documentation.


List the contents of provided directory.


Put the local file to provided HDFS location


Copy the file to the local file system


Outputs the contents of HDFS file to standard output. Text command will also read compressed files and output uncompressed data.

Common usecase is that you want to check contents of the file, but not output the whole file. Pipe the contents to head.


cp is short for copy, copy file from source to destination.

mv is short for move, move file from source to destination.


Change the permissions of the file/directory. Uses standard Unix file permissions.


Takes a source directory and concatenates all the content and outputs to a local file. Very useful as commonly Hadoop jobs will output multiple output files depending on the number of mappers/reducers you have.


Deletes a file from HDFS. The -r means perform recursively. You will need to do this for directories.

By default the files will be moved to trash that will eventually be cleaned up. This means the space will not be immediately freed up. If you need the space immediately you can use -skipTrash, note this will mean you can reverse the delete.


Displays the sizes of directories/files. Does this recursively, so extremely useful for find out how much space you have. The -h option makes the sizes human readable.  The -s option summarises all the files, instead of giving you individual file sizes.

One thing to note is that the size reported is un-replicated.  If your replication factor is 3, the actual disk usage will be 3 times this size.


I commonly use this command to find out how much quota I have available on a specific directory (you need to add the -q options for this).

To work out how much quota you have used, SPACE_QUOTA – REMAINING_SPACE_QUOTA, e.g. 54975581388800 – 5277747062870 or 54.97TB – 5.27TB = 49.69TB left.

Note this figures are the replicated numbers.

Admin Report

Useful command for finding out total usage on the cluster. Even without superuser access you can see current total capacity and usage.

Hadoop Jobs

Launching Hadoop Jobs

Commonly you will have a fat jar file containing all the code for your map reduce job. Launch via:

A Scalding job is launched using:

If you need to kill a map-reduce job, use:

You can find the job id in the resource manager or in the log of the job launch. This can be used to kill any map-reduce job (Standard Hadoop, Scalding, Hive, etc.) but not Impala or Spark jobs for instance.

Setting up IntelliJ for Spark

Brief guide to setting up IntelliJ to build Spark applications.

Create new Scala Project


  1. Create New Project
  2. Scala Module
  3. Give it an appropriate name

Setup Directory Structure

Move to the project root. Run the following:

Setup gen-idea plugin

In the project directory you just created, create a new file called plugins.sbt with the following content:

Create the build file

In the project root, create a file called build.sbt containing:


At the project root level, run the following

Re-open your project in IntelliJ.

You should now be setup and ready to write Spark Applications




R Performance (Part II)

In R performance (Part I), we looked into how to write R code in an efficient way.

In the second part, we will look into more explicit ways of improving R performance.

Parallel Loops

Doing some sort of loop is almost unavoidable in R. A simple optimisation is to run the loop in parallel.

A obvious machine learning application is when running cross-validation. If you want to run 10-fold cross-validation, if you have 10 cores available, you can run them all in parallel.

The parallel library has a multicore version of lapply, let’s look at the example below:

If you already make use of lapply etc. in your code, modifying to use the multi-core version requires very little code changes.

Sadly this code will not work on Windows as it relies on Unix’s fork command. Instead use the following on a Windows machine:

Update: Nathan VanHoudnos has re-implemented the mclapply function to support Windows, see his comment for more details on how to use this.

Update: One issue I have observed when using mclapply on Linux, if the machine runs out of memory, R will arbitrarily and silently kill processes on some of the cores. This will mean you will not get as many results as you expect. A good error check is to ensure your results has the correct size e.g. same size as your inputs

If instead you prefer for loop syntax,  you can use the foreach package:

The important command here is %dopar%, this says to perform the loop in parallel. If you were to use  %do% it would run on a single process.

Memory Usage

In a Unix system (I have no idea how this will work on Windows), when you fork a process (what mclapply does), you get a copy of the current processes memory (think R environment).

However this is called “copy-on-write”, which means unless you modifying the data it will never physically be copied.

This is important, as much as possible you should try to avoid modifying the data inside your mclapply function. However, often this is unavoidable e.g. in your cross validation loop you will need to split data into test and train. In this case, you just need to be aware you will be increasing your memory usage. You may have to trade off how many cores you can use with how much memory you have available.

Optimised Linear Algebra Library

The default BLAS library used by R is not particular well tuned and has no support for multiple cores. Switching to an alternative BLAS implementation can give a significant speed boost.

Nathan VanHoudnos has an excellent guide on installing alternative BLAS libraries. Here I will summarise the different options:


OpenBLAS is generally the easiest to install (on Ubuntu you can use apt-get) and has out of the box support for multicore matrix operations.

On major issue (described here) is that currently OpenBLAS multicore matrix operations do not play well with R’s other multicore functionality (parallel, foreach). Trying to use them together will result in segfaults. However, as long as you are aware of this you can design your code around this e.g. only using parallel loops when nothing inside of the loop utilises matrix algebra.


ATLAS has the potential to be the most tuned to your particular machine setup. However using out-of-the-box installations (e.g. via apt-get) will generally only support a single core.

In order to get the most out of ATLAS (multicore, optimised to your machine) you will need to compile it from source and this can be a painful experience and probably only worthwhile if you are familiar with compiling from source on Unix machines.


Intel provide their Math Kernel Library (MKL) optimised version of BLAS. It works on Intel and Intel compatible processors. Revolution R comes with MKL pre-packaged with it.

R-Bloggers has a guide to installing MKL. Note, MKL is free for non-commercial usage, but commercial usage will require a license.


If you are using Mac OS X, Apple kindly provide you with a optimised BLAS library. Details of how to link with R can be found here. It is very easy to link and provides excellent performance.

However, this only works with Mac OS X, so not really relevant if you are planning to work in a server environment,

Which to use?

Obviously what to use is completely up to you, all have some disadvantages. Personally, I use vecLib on my Mac and we use OpenBLAS on our servers. This means we have to write our R code to not to use parallel loops and mutlicore matrix operations at the same time. However this is not a massive overhead (if you are trying to do both at the same time you will generally end up thrashing your CPUs anyway). The advantage is, spinning up new R servers does not involve any compilation from source.


At this point, linking to optimised BLAS version may look quite painful. Instead an alternative option is to pay for Revolution R. They pre-build their version of R with an optimised multicore BLAS version. They have various benchmarks on the performance improvements.

Revolution R also has various libraries for handling large datasets, parallelised statistical modelling algorithms, etc. Although it all comes with a fairly hefty price tag.

Update: Revolution R have just announced Revolution R Open a free version of Revolution R. In particular it comes linked against MKL and has the Reproducible R Toolkit to manage package upgrades. Currently this looks like the best option for using R in a production environment.

Byte Code Compiler

The compiler package allows you to compile R functions to a lower-level byte code. This can provide performance improvements of between 2-5X.

Let’s look at a very simple example below:

The cmpfun is used to compile the function to byte code. You call your function in the exact same way you would before. To compare performance:

In this case, we see around a 2.9X speedup. You will see the best speed-up on functions that involve mostly numerical calculations. If your functions mainly call pre-built R functions or manipulate data types, you probably won’t see any drastic speed-up.

You can also enable Just-In-Time (JIT) compilation, removing the need to call cmpfun directly:

The value passed to enableJIT controls the level of compilation, it should be between 0 and 3, 0 being no compilation; 3 being max compilation. This may initially slow down R as all the functions need to be compiled, but may later speed it up. You can also enable it via the R_ENABLE_JIT environment variable.

For more information, R-statistics has a great tutorial on compiler library and JIT.


R is constantly evolving, so along with these tips you should always try to keep your R version up to date to get the latest performance improvements.

Radford Neal has done a bunch of optimisations, some of which were adopted into R Core, and many others which were forked off into pqR. At the time of writing, I don’t think pqR is ready for production work, but definitely worth watching.

With well optimised code, the right libraries, R is capable of handling pretty large data problems. At some point, your data may be too large for R to handle. At this point I look to Hadoop and Spark to scale even further. My rough guide, if your data is greater than 50GB (after pre-processing) R is probably not the right choice.

R Performance (Part I)

R as a programming language is often considered slow. However, more often than not it is how the R code is written that makes it slow. I’ve see people wait hours for an R script to finish, while with a few modifications it will take minutes.

In this post I will explore various ways of speeding up your R code by writing better code. In part II, I will focus on tools ands libraries you can use to optimise R.


The single most important advice when writing R code is to vectorise it as much as possible. If you have ever used MATLAB, you will be aware of the difference vectorised vs. un-vectorised code in terms of speed.

Let us look at an example:

Here we have used a loop to increment the contents of a. Now using a vectorised approach:

Notice the massive performance increase in elapsed time.

Another consideration is to look at using  inherently vectorised commands like ifelse and diff. Let’s look at the example below:

Again we see elapsed time has been massively reduced, a 93X reduction.

When you have a for loop in your code, think about how you can rewrite it in a vectorised way.


Sometimes it is impossible to avoid a loop, for example:

  • When the result depends on the previous iteration of the loop

If this is the case some things to consider:

  • Ensure you are doing the absolute minimum inside the loop. Take any non-loop dependent calculations outside of the loop.
  • Make the number of iterations as small as possible. For instance if your choice is to iterate over the levels of a factor or iterate over all the elements, usually iterating over the levels will be much faster

If you have to loop, do as little as possible in it

Growing Objects

A common pitfall is growing an object inside of a loop.  Below I give an example of this:

Here we are constantly growing the vector inside of the loop. As the vector grows, we need more space to hold it, so we end up copy data to a new location. This constant allocation and copying causes the code to be very slow and memory fragmentation.

In the next example, we have pre-allocated the space we needed. This time the code is 266X faster.

We can of course do this allocation directly without the loop, making the code even faster:

If you don’t know how much space you will need, it may be useful to allocate an upper-bound of space, then remove anything unused once your loop is complete.

A more common scenario is to see something along the lines of:


At the bottom of your loop, you are rbinding or cbinding the result you calculated in your loop to an existing data frame.

Instead, build a list of pieces and put them all together in one go:

Avoid growing data structures inside a loop.

Apply Functions

The R library has a whole host of apply functions:

It is worth becoming familiar with them all. Neil Saunders has created a great introduction to all the apply functions.

In most situations using apply may not be any faster than using a loop (for instance the apply function is just doing a loop under the hood). The main advantage is that it avoids growing objects in the loop as the apply functions handle stitching the data together.

In Part II we will introduced the parallel versions of apply that can increase performance further.

Know your apply functions and use them where it makes sense

Mutable Functions

One important point to remember about R, is that parameters to functions are passed by value. This, in theory, means that each time you pass something to a function it creates a copy. However, in practice, R will under the hood not create a copy as long as you don’t mutate the value of the variable inside the function.

If you can make your functions immutable (e.g. don’t change the values of the parameters passed in), you will save significant amounts of memory and CPU time by not copying.

Let’s looks at a really simple case:

Here f1 mutates x, while f2 does not. Running with a fairly large vector,  several times and looking at the average:

We see that on average, f2 is slightly quicker as we have avoided the additional temporary copy that is done under the hood in f1.

Try to make functions immutable.


To reiterate the main points:

  1. When you have a for loop in your code, think about how you can rewrite it in a vectorised way.
  2. If you have to loop, do as little as possible in it
  3. Avoid growing data structures inside a loop
  4. Know your apply functions and use them where it makes sense
  5. Try to make functions immutable

Following these tips when writing your R code should greatly improve the efficiency. For some more general tips to help your R code I also recommend:

Reading large files in R

Reading large data files into R using  read.table can be painfully slow due to the large number of checks it performs on the data (inferring data types, checking number of columns, etc.). Below are a number of tips to improve its speed:

  1. Set  nrows the number of records in your dataset. An easier way to obtain this is to use wc -l  in the terminal (only works on Unix systems). A handy function to do this from R:
  2. Ensure  comment.char="" to turn off the interpretation of comments.
  3. Supply the column types in colClasses  e.g.  colClasses = c("numeric", "character", "logical")
  4. Setting multi.line=FALSE  can help the performance of the scan.
  5. In general, it is worth using to prevent strings being turned into factors automatically.  If you provide  colClasses  this is redundant (as the column types will be determined by colClasses ).

If your data is very large (5GB+), you should look at using the data.table package. In particular  fread can be used to read very large files to data tables.