MapReduce

Apache MapReduce is a powerful framework for processing large, distributed sets of structured or unstructured data on a Hadoop cluster. The key feature of MapReduce is its ability to perform processing across an entire cluster of nodes, with each node processing its local data. This feature makes MapReduce orders of magnitude faster than legacy methods of processing big data, which often consisted of a single node accessing and processing data located in remote SAN or NAS devices.

MapReduce abstracts away the complexity of distributed programming, allowing programmers to describe the processing they'd like to perform in terms of a map function and a reduce function. At time of execution, during the map phase, multiple nodes in the cluster, called mappers, read in local raw data into key-value pairs. This is followed by a sort and shuffle phase, where each mapper sorts their results by key and forwards ranges of keys to other nodes in the cluster, called reducers. Finally, in the reduce phase, reducers analyze data for the keys it was passed from the mappers.

MapReduce v1, included in all versions of the MapR Distribution, serves two purposes in the Hadoop cluster. First, MapReduce acts as the resource manager for the nodes in the Hadoop cluster. It employs a JobTracker to divide a job into multiple tasks, distributing and monitoring their progress to one or more TaskTrackers, which perform the work in parallel. As the resource manager, it is a key component of the cluster, serving as the platform for many higher-level Hadoop applications, including Pig(link) and Hive(link). Second, MapReduce serves as a data processing engine, executing jobs that are expressed with map and reduce semantics.

Starting with MapR 4.0 release, MapR includes MapReduce v2 in addition to the v1. MapReduce v2 was redesigned to perform only as a data processing engine, spinning off the resource manager functionality into a new component called YARN (Yet Another Resource Negotiator) (link). Before this split, higher-level applications that required access to Hadoop resources had to express their jobs using map and reduce semantics, with each job going through the map, sort, shuffle, reduce processes. This was unsuitable for some types of jobs that didn't fit well into the MapReduce paradigm, either because they required faster response times than a full MapReduce cycle would allow for, or because they required more complex processing than could not be expressed in single MapReduce jobs, such as graph processing. With YARN, Hadoop clusters become much more versatile, allowing the same cluster to be used for both classic batch MapReduce processing as well as interactive jobs like SQL.