YARN

The MapR Distribution including Apache Hadoop provides advanced features for YARN allowing existing as well as new users of Hadoop to easily deploy YARN in production environments. YARN on MapR provides the following unique capabilities:

  • Label-based scheduling (or job placement control) for YARN jobs ensuring users have the capability to run YARN jobs on specific nodes within the cluster. This feature comes especially handy in a multi-tenant production environment where departmental data may be stored on specific nodes within the cluster.
  • Disk I/O characteristics are taken into account for YARN resource calculation. This goes beyond the memory and CPU characteristics supported by YARN and allows for disk I/O bottlenecks to be correctly identified and managed.
  • MapReduce v1 and YARN jobs can co-exist within the same node. This ensures easiest migration path to YARN as existing production applications can continue to work without any changes.

TECH BRIEF

TeraSort Benchmark Comparison for YARN

TeraSort is a popular benchmark that measures the amount of time to sort one terabyte of randomly distributed data on a given computer system. It is commonly used to measure MapReduce performance of an Apache™ Hadoop® cluster. The following report compares performance of a YARN-scheduled TeraSort job on MapR and other distributions.

Download

YARN Introduction

YARN (Yet Another Resource Negotiator) is a core component of Hadoop, managing access to all resources in a cluster. Before YARN, jobs were forced to go through the MapReduce framework, which is designed for long-running batch operations. Now, YARN brokers access to cluster compute resources on behalf of multiple applications, using selectable criteria such as fairness or capacity, allowing for a more general-purpose experience. Some new capabilities unlocked with YARN include:

  • In-memory Execution: Apache Spark is a data processing engine for Hadoop, offering performance-enhancing features like in-memory processing and cyclic data flow. By interacting directly with YARN, Spark is able to reach its full performance potential on a Hadoop cluster.
  • Real-time Processing: Apache Storm lets users define a multi-stage processing pipeline to process data as it enters a Hadoop cluster. Users expect Storm to process millions of events each second with low latency, so customers wanting run Storm and batch processing engines like MapReduce on the same cluster need YARN to manage resource sharing.

Technical Details

Before YARN, when MapReduce managed cluster resources, it did so using a centralized JobTracker daemon running on a master node, and TaskTracker daemons running on multiple worker nodes in the cluster. When a user executed a MapReduce Job, the JobTracker first divided the job into multiple tasks, farmed those tasks out to available TaskTrackers, and monitored status of each task - restarting it on a new TaskTracker in case of failure. The JobTracker and TaskTracker daemons were shared resources, used by all users, applications, and jobs on the cluster.

With YARN, new concepts and daemons were introduced in order to make applications more scalable and robust. The main new concept introduced is that of container, which can be thought of as a unit of computing on a node, with some amount of CPU and RAM allocated to it. Containers are used to encapsulate any tasks that run on top of YARN. The new daemons include:

  • Node Manager - Single instance per node. Responsible for monitoring and reporting on local container status.
  • Resource Manager - Single instance per cluster. Responsible for keeping track of all containers in the cluster. Communicates with Node Managers to allocate containers to tasks.
  • Application Master - Single instance per job. Spawned within a YARN container when a new job is submitted by a client, and requests additional containers for handling of any sub-tasks.