In this post, we look at the different approaches for launching multiple MapReduce jobs, and analyze their benefits and shortfalls. Topics covered include how to implement job control in the driver, how to use chaining, and how to work with Oozie to manage MapReduce workflows.
Managing Multiple Jobs
Because the MapReduce programming model is simplistic, you usually cannot completely solve a programming problem with one program. Instead, you often need to run a sequence of MapReduce jobs, using the output of one as the input to the next. And of course there may be other non-MapReduce applications, such as Hive, Drill, and Pig, that you wish to leverage in a workflow.
You have several choices to manage multiple MapReduce jobs. You could manage and launch everything from a central single Java driver or a shell script. But both of these approaches can be a challenge to write and maintain. You would, for example, be responsible for writing the code that starts, stops, tests, restarts, and so forth.
An easier approach is to use a service that is designed specifically for managing multiple jobs, such as Apache Oozie.
You can use a JobControl object to manage multiple sequential MapReduce jobs. But the JobControl object does not interface with non-MapReduce jobs, and you cannot effectively control the order of jobs. So, generally, Oozie is the better choice. And since Oozie uses XML, it is relatively easy to modify a multi-job configuration.
Yet another solution to managing multiple MapReduce jobs is called job chaining. The primary advantage of this solution over all the other solutions is that data from mappers are passed to the next mappers or reducers in memory rather than on disk. This can be a huge performance gain for programs that lend themselves to this model. However, there is no ecosystem interface except MapReduce.
MapReduce job chaining allows you to launch multiple mapper and reducer classes within the same task. Output from a given Mapper class can either be sent as input to the next Mapper class or as input to a Reducer class. Similarly, output from a reducer can be sent as input to a Mapper class.
All output is transferred in memory, except for the "normal" shuffle from the map phase to the reduce phase.
Note that you must use the mapred package API for job chaining, as MapReduce does not support chaining.
The flow of data and execution when using MapReduce job chaining pattern is:
- One or more mappers, followed by
- Shuffle phase, followed by
- Exactly one reducer, followed by
- Zero or more mappers
Note that you can run multiple mappers in the map phase before the shuffle/sort. Similarly, you can run multiple mappers in the reduce phase after the singular Reducer is called.
The code snippet above shows an implementation with ChainMapper and ChainReducer. In this example, we are constructing a chain of 2 mappers (Amap.class and Bmap.class) followed by a single reducer (Reduce.class).
Note that there is a single JobConf object (called "conf") that manages the entire job. There are also individual JobConf objects for the chain mapper job and chain reducer job.
Oozie is a client-server workflow engine for MapReduce and the Hadoop ecosystem for tasks that run on the same cluster. Workflows can be expressed as directed acyclic graphs that contain control flow and action nodes. A control flow node marks either the beginning or end of a workflow, and the action nodes are the intermediate nodes in the DAG.
In general, Oozie is the best choice for managing multiple jobs. It can be modified relatively easily and also allows job monitoring by a cluster administrator.
Multiple Jobs Workflow
Let’s review the workflow for the simple wordcount MapReduce job. The "start," "end," and "kill" nodes are called control flow nodes and the "map-reduce wordcount" node is called an action node. In this directed acyclic graph, the job will either terminate due to error or successful return from the wordcount application. Obviously, Oozie is intended to manage much more complex workflows than this.
Creating and modifying Oozie workflows is performed by editing the corresponding XML file associated with the workflow. This makes Oozie very easy to use.
A more complex workflow DAG is shown in the image above. Note that Oozie can manage non-MapReduce jobs like Pig, Hive, and standalone Java programs. Also note that fork and join operations are supported, allowing for parallel processing in an Oozie workflow.
Installing the Oozie Service
Installing the Oozie service requires 3 steps:
To install on the server use:
- yum install mapr-oozie-internal –y
- service mapr-oozie start
To install on a Linux or Mac OSX Client use:
- yum install mapr-oozie
- Configure PATH and OOZIE_URL
To make the Hadoop cluster modification use:
- Configure proxy user in core-site.xml
Now that we've had a look at the different approaches for launching multiple MapReduce jobs, analyzing their benefits and shortfalls, how to implement job control in the driver, how to use chaining, and how to work with Oozie to manage MapReduce workflows, be sure to check out our free Hadoop On-Demand Training for full-length courses on a range of Hadoop technologies.
Do you have any questions or comments about launching MapReduce jobs? Add your thoughts in the comments section below.