MapReduce Blog Posts

Posted on November 2, 2015 by Carol McDonald

This blog is a first in a series that discusses some design patterns from the book MapReduce design patterns and shows how these patterns can be implemented in Apache Spark(R).

Posted on June 17, 2015 by Anoop Dawar

In this week's Whiteboard Walkthrough, Anoop Dawar, Senior Product Director at MapR, shows you the basics of Apache Spark and how it is different from MapReduce.

Posted on February 26, 2015 by James Casaletto

In this blog post, we introduce the concept of using non-Java programs or streaming for MapReduce jobs. MapReduce’s streaming feature allows programmers to use languages other than Java such as Perl or Python to write their MapReduce programs. You can use streaming for either rapid prototyping using sed/awk, or for full-blown MapReduce deployments. Note that the streaming feature does not include C++ programs – these are supported through a similar feature called pipes.

Posted on February 25, 2015 by James Casaletto

In this post, we look at the different approaches for launching multiple MapReduce jobs, and analyze their benefits and shortfalls. Topics covered include how to implement job control in the driver, how to use chaining, and how to work with Oozie to manage MapReduce workflows. Because the MapReduce programming model is simplistic, you usually cannot completely solve a programming problem with one program. Instead, you often need to run a sequence of MapReduce jobs, using the output of one as the input to the next. And of course there may be other non-MapReduce applications, such as Hive, Drill, and Pig, that you wish to leverage in a workflow.

Posted on February 17, 2015 by James Casaletto

Hadoop MapReduce is a framework that simplifies the process of writing big data applications running in parallel on large clusters of commodity hardware. The MapReduce framework consists of a single master ResourceManager, one slave NodeManager per cluster-node, and one MRAppMaster per application (see the YARN Architecture Guide). Each MapR software release supports and ships with a specific version of Hadoop. For example, MapR 3.0.1 shipped with Hadoop 0.20.2, while MapR 4.0.1 uses Hadoop 2.4 including YARN.

Posted on February 5, 2015 by James Casaletto

In this post, we detail how to work with counters to track MapReduce job progress. We will look at how to work with Hadoop’s built-in counters, as well as custom counters. In part 2, we will discuss how to use the MapR Control System (MCS) to monitor jobs. We’ll also detail how to manage and display jobs, history, and logs using the command line interface. Counters are used to determine if and how often a particular event occurred during a job execution. There are 4 categories of counters in Hadoop: file system, job, framework, and custom.

Posted on February 3, 2015 by James Casaletto

In this blog post, we compare MapReduce v1 to MapReduce v2 (YARN), and describe the MapReduce Job Execution framework. We also take a detailed look at how jobs are executed and managed in YARN and how YARN differs from MapReduce v1. To begin, a user runs a MapReduce program on the client node which instantiates a Job client object. Next, the Job client submits the job to the JobTracker. Then the job tracker creates a set of map and reduce tasks which get sent to the appropriate task trackers. The task tracker launches a child process which in turns runs the map or reduce task. Finally the task continuously updates the task tracker with status and counters and writes its output to its context.

Posted on January 29, 2015 by James Casaletto

In this blog post we detail how data is transformed as it executes in the MapReduce framework, how to design and implement the Mapper, Reducer, and Driver classes; and execute the MapReduce program. In order to write MapReduce applications you need to have an understanding of how data is transformed as it executes in the MapReduce framework. From start to finish, there are four fundamental transformations.

Posted on February 11, 2014 by Anoop Dawar

It gives me immense pleasure to write this blog on behalf of all of us here at MapR to announce the release of Hadoop 2.x, including YARN, on MapR. Much has been written about Hadoop 2.x and YARN and how it promises to expand Hadoop beyond MapReduce. I will give a quick summary before highlighting some of the unique benefits of Hadoop 2.x and YARN in the MapR Distribution for Hadoop.


Posted on December 20, 2013 by Aaron Eng
Improving performance by letting MapR-FS do the right thing
Posted on November 1, 2013 by Karen Whipple
The Amazon Elastic MapReduce (EMR) team has been hard at work on a series of updates and new features. You now have access to Hadoop 2.2 and new versions of Hive,Pig, HBase, and Mahout. Cluster startup time has been reduced, S3DistCp (for data movement), has been augmented, and MapR M7 is now supported.

Read more about what is new from AWS on the Amazon Web Services Blog.
Posted on October 15, 2013 by Peter Conrad

Deploying a MapR Cluster

Posted on March 14, 2013 by Jim Fiori


Running MapReduce jobs on ingested data is traditionally batch-oriented: the data must be first transferred to a local file system accessible to the Hadoop cluster, then copied into HDFS with Flume or the “hadoop fs” command. Only once the transfers are complete can MapReduce be run on the ingested files.

Posted on February 22, 2013 by Jim Fiori


Profiling Java applications can be accomplished with many tools, such as the built-in HPROF JVM native agent library for profiling heap and CPU usage. In the world of Hadoop and MapReduce, there are a number of properties you can set to enable profiling of your mapper and reducer code.

With MapR’s enterprise-grade distribution of Hadoop, there are 3 unique features that make this task of profiling MapReduce code easier. They are:

Posted on August 23, 2011 by Tomer Shiran
MapReduce 2.0 is the codename for a new execution engine for Hadoop (developed primarily by Yahoo! engineers that are now at HortonWorks). MapReduce 2.0 is expected to become available in the next major release of Hadoop (0.23). The source code directory structure can be accessed at

Blog Sign Up

Sign up and get the top posts from each week delivered to your inbox every Friday!

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free