Hadoop 2 is a complete overall of some of the core Hadoop libraries. It is a fundamental shift in the way applications run on top of Hadoop and it is worth understanding these changes.
In Hadoop 1, the programming API (MapReduce) and resource management of the cluster were all bundled together. In Hadoop 2, resource management is now handled by YARN (Yet Another Resource Negotiator).
YARN manages the resources available to us on the cluster. To understand what YARN does, we need to look at the components that make it up:
- Runs on a single master node
- Global resource scheduler across the node
- Arbitrates resources between competing applications
- Nodes have resources - memory and CPU cores - which the resource manager allocates.
- Sits on each slave node
- Handles communication with Resource Manager
- Applications are jobs submitted to the YARN framework
- Could be MapReduce job, Spark job etc.
- One per-application
- Requests containers to actually run the job. Containers will be distributed across the container.
- Create by the RM upon request
- Allocate a certain amount of resources ( CPU and memory) on a slave node.
In Hadoop 2, applications are no longer limited to just MapReduce. Cluster can be used for multiple different systems at the same time. The cluster resources can be between utilised and new systems can integrate by implementing the YARN API.
- Hierarchical queue system
- Various scheduling mechanisms (Capacity Scheduler, Fair Scheduler)
- Cloudera CDH5 uses Fair Scheduling by default.