Apache Oozie

Apache Oozie is a valuable tool for Hadoop users to automate commonly performed tasks in order to save time and user error. With Oozie, users can describe workflows to be performed on a Hadoop cluster, schedule those workflows to execute under a specified condition, and even combine multiple workflows and schedules together into a package to manage their full lifecycle.

Oozie workflow jobs are expressed as Directed Acyclic Graphs (DAGs) of Hadoop jobs. In other words, users specify which jobs they'd like to perform from a variety of supported types (MapReduce, Oozie, Pig, Sqoop, others), and specify which jobs are dependant on which other jobs to complete before starting. When a workflow job is executed, Oozie dispatches jobs to MapReduce in the order specified by the DAG, and monitors the completion of those tasks to determine when the next set of tasks should be dispatched.

An Oozie coordinator job allows a workflow job to be executed when a specified condition is met. The most common condition used is a time interval - scheduling a workflow to execute each day, hour, or minute. Users can also schedule a coordinator job to execute based on an external event, such as when a specific piece of data becomes available.

Oozie bundles allow users to package together a pipeline of Oozie coordinator jobs. Unlike workflow jobs, where there is a strict dependency between the underlying jobs run, bundle jobs specify no strict dependencies between their underlying coordinator jobs.