Collaborative Filtering using Alternating Least Squares

Collaborative filtering is commonly used in recommender systems. The idea is if you have a large set of item-user preferences, you use collaborative filtering techniques to predict missing item-user preferences. For example, you have the purchase history of all users on an eCommerce website. You use collaborative filtering to recommend which products a user might purchase […]


Hadoop 2 Introduction

Hadoop 2 is a complete overall of some of the core Hadoop libraries. It is a fundamental shift in the way applications run on top of Hadoop and it is worth understanding these changes. YARN In Hadoop 1, the programming API (MapReduce) and resource management of the cluster were all bundled together. In Hadoop 2, resource management […]


Hadoop Command Line Cheatsheet

Useful commands when using Hadoop on the command line Filesystem Full reference can be found in Hadoop Documentation. ls

List the contents of provided directory. put

Put the local file to provided HDFS location get

Copy the file to the local file system cat/text

Outputs the contents of HDFS file to […]


Setting up IntelliJ for Spark

Brief guide to setting up IntelliJ to build Spark applications. Create new Scala Project Select: Create New Project Scala Module Give it an appropriate name Setup Directory Structure Move to the project root. Run the following:

Setup gen-idea plugin In the project directory you just created, create a new file called plugins.sbt with the […]


Random Permutation Tests

At the 2014 Strata + Hadoop conference John Rauser gave a great keynote title "Statistics Without the Agonizing Pain".  It is probably worth watching before reading the rest of this article, in it he introduces the concept of Random Permutation Tests. "Classic" statistical tests usually make some sort of assumption about the distribution of the data […]