Tag Archives: hadoop

Hadoop 2 Introduction

Hadoop 2 is a complete overall of some of the core Hadoop libraries. It is a fundamental shift in the way applications run on top of Hadoop and it is worth understanding these changes.

yarn

YARN

In Hadoop 1, the programming API (MapReduce) and resource management of the cluster were all bundled together. In Hadoop 2, resource management is now handled by YARN (Yet Another Resource Negotiator).

YARN manages the resources available to us on the cluster. To understand what YARN does, we need to look at the components that make it up:

yarn_architecture

Resource Manager

  • Runs on a single master node
  • Global resource scheduler across the node
  • Arbitrates resources between competing applications
  • Nodes have resources - memory and CPU cores - which the resource manager allocates.

Node Manager

  • Sits on each slave node
  • Handles communication with Resource Manager

Applications

  • Applications are jobs submitted to the YARN framework
  • Could be MapReduce job, Spark job etc.

Application Master

  • One per-application
  • Requests containers to actually run the job. Containers will be distributed across the container.

Containers

  • Create by the RM upon request
  • Allocate a certain amount of resources ( CPU and memory) on a slave node.

Applications

In Hadoop 2, applications are no longer limited to just MapReduce. Cluster can be used for multiple different systems at the same time.  The cluster resources can be between utilised and new systems can integrate by implementing the YARN API.

YARN2

Scheduling

  • Hierarchical queue system
  • Various scheduling mechanisms (Capacity Scheduler, Fair Scheduler)
  • Cloudera CDH5 uses Fair Scheduling by default.

Hadoop Command Line Cheatsheet

Useful commands when using Hadoop on the command line

Filesystem

Full reference can be found in Hadoop Documentation.

ls

List the contents of provided directory.

put

Put the local file to provided HDFS location

get

Copy the file to the local file system

cat/text

Outputs the contents of HDFS file to standard output. Text command will also read compressed files and output uncompressed data.

Common usecase is that you want to check contents of the file, but not output the whole file. Pipe the contents to head.

cp/mv

cp is short for copy, copy file from source to destination.

mv is short for move, move file from source to destination.

chmod

Change the permissions of the file/directory. Uses standard Unix file permissions.

getmerge

Takes a source directory and concatenates all the content and outputs to a local file. Very useful as commonly Hadoop jobs will output multiple output files depending on the number of mappers/reducers you have.

rm

Deletes a file from HDFS. The -r means perform recursively. You will need to do this for directories.

By default the files will be moved to trash that will eventually be cleaned up. This means the space will not be immediately freed up. If you need the space immediately you can use -skipTrash, note this will mean you can reverse the delete.

du

Displays the sizes of directories/files. Does this recursively, so extremely useful for find out how much space you have. The -h option makes the sizes human readable.  The -s option summarises all the files, instead of giving you individual file sizes.

One thing to note is that the size reported is un-replicated.  If your replication factor is 3, the actual disk usage will be 3 times this size.

count

I commonly use this command to find out how much quota I have available on a specific directory (you need to add the -q options for this).

To work out how much quota you have used, SPACE_QUOTA - REMAINING_SPACE_QUOTA, e.g. 54975581388800 - 5277747062870 or 54.97TB - 5.27TB = 49.69TB left.

Note this figures are the replicated numbers.

Admin Report

Useful command for finding out total usage on the cluster. Even without superuser access you can see current total capacity and usage.

Hadoop Jobs

Launching Hadoop Jobs

Commonly you will have a fat jar file containing all the code for your map reduce job. Launch via:

A Scalding job is launched using:

If you need to kill a map-reduce job, use:

You can find the job id in the resource manager or in the log of the job launch. This can be used to kill any map-reduce job (Standard Hadoop, Scalding, Hive, etc.) but not Impala or Spark jobs for instance.