8552968000_9da6bffe9a_z

Hadoop 2 Introduction

Hadoop 2 is a complete overall of some of the core Hadoop libraries. It is a fundamental shift in the way applications run on top of Hadoop and it is worth understanding these changes.

yarn

YARN

In Hadoop 1, the programming API (MapReduce) and resource management of the cluster were all bundled together. In Hadoop 2, resource management is now handled by YARN (Yet Another Resource Negotiator).

YARN manages the resources available to us on the cluster. To understand what YARN does, we need to look at the components that make it up:

yarn_architecture

Resource Manager

  • Runs on a single master node
  • Global resource scheduler across the node
  • Arbitrates resources between competing applications
  • Nodes have resources - memory and CPU cores - which the resource manager allocates.

Node Manager

  • Sits on each slave node
  • Handles communication with Resource Manager

Applications

  • Applications are jobs submitted to the YARN framework
  • Could be MapReduce job, Spark job etc.

Application Master

  • One per-application
  • Requests containers to actually run the job. Containers will be distributed across the container.

Containers

  • Create by the RM upon request
  • Allocate a certain amount of resources ( CPU and memory) on a slave node.

Applications

In Hadoop 2, applications are no longer limited to just MapReduce. Cluster can be used for multiple different systems at the same time.  The cluster resources can be between utilised and new systems can integrate by implementing the YARN API.

YARN2

Scheduling

  • Hierarchical queue system
  • Various scheduling mechanisms (Capacity Scheduler, Fair Scheduler)
  • Cloudera CDH5 uses Fair Scheduling by default.

One thought on “Hadoop 2 Introduction”

Leave a Reply

Your email address will not be published. Required fields are marked *