8552968000_9da6bffe9a_z

Hadoop 2 Introduction

Hadoop 2 is a complete overall of some of the core Hadoop libraries. It is a fundamental shift in the way applications run on top of Hadoop and it is worth understanding these changes. YARN In Hadoop 1, the programming API (MapReduce) and resource management of the cluster were all bundled together. In Hadoop 2, resource management […]

4394639915_28dbfd85cc_z

Hadoop Command Line Cheatsheet

Useful commands when using Hadoop on the command line Filesystem Full reference can be found in Hadoop Documentation. ls

List the contents of provided directory. put

Put the local file to provided HDFS location get

Copy the file to the local file system cat/text

Outputs the contents of HDFS file to […]

4457645467_40e1775a38_z

Setting up IntelliJ for Spark

Brief guide to setting up IntelliJ to build Spark applications. Create new Scala Project Select: Create New Project Scala Module Give it an appropriate name Setup Directory Structure Move to the project root. Run the following:

Setup gen-idea plugin In the project directory you just created, create a new file called plugins.sbt with the […]

39449485_e2fdeb48e1_z

Random Permutation Tests

At the 2014 Strata + Hadoop conference John Rauser gave a great keynote title "Statistics Without the Agonizing Pain".  It is probably worth watching before reading the rest of this article, in it he introduces the concept of Random Permutation Tests. "Classic" statistical tests usually make some sort of assumption about the distribution of the data […]