This blog is a first in a series that discusses some design patterns from the book MapReduce design patterns and shows how these patterns can be implemented in Apache Spark(R).
MapReduce Blog Posts
In this week's Whiteboard Walkthrough, Anoop Dawar, Senior Product Director at MapR, shows you the basics of Apache Spark and how it is different from MapReduce.
In this blog post, we introduce the concept of using non-Java programs or streaming for MapReduce jobs. MapReduce’s streaming feature allows programmers to use languages other than Java such as Perl or Python to write their MapReduce programs. You can use streaming for either rapid prototyping using sed/awk, or for full-blown MapReduce deployments. Note that the streaming feature does not include C++ programs – these are supported through a similar feature called pipes.
In this post, we look at the different approaches for launching multiple MapReduce jobs, and analyze their benefits and shortfalls. Topics covered include how to implement job control in the driver, how to use chaining, and how to work with Oozie to manage MapReduce workflows. Because the MapReduce programming model is simplistic, you usually cannot completely solve a programming problem with one program. Instead, you often need to run a sequence of MapReduce jobs, using the output of one as the input to the next. And of course there may be other non-MapReduce applications, such as Hive, Drill, and Pig, that you wish to leverage in a workflow.
Hadoop MapReduce is a framework that simplifies the process of writing big data applications running in parallel on large clusters of commodity hardware. The MapReduce framework consists of a single master ResourceManager, one slave NodeManager per cluster-node, and one MRAppMaster per application (see the YARN Architecture Guide). Each MapR software release supports and ships with a specific version of Hadoop. For example, MapR 3.0.1 shipped with Hadoop 0.20.2, while MapR 4.0.1 uses Hadoop 2.4 including YARN.
In this post, we detail how to work with counters to track MapReduce job progress. We will look at how to work with Hadoop’s built-in counters, as well as custom counters. In part 2, we will discuss how to use the MapR Control System (MCS) to monitor jobs. We’ll also detail how to manage and display jobs, history, and logs using the command line interface. Counters are used to determine if and how often a particular event occurred during a job execution. There are 4 categories of counters in Hadoop: file system, job, framework, and custom.
In this blog post, we compare MapReduce v1 to MapReduce v2 (YARN), and describe the MapReduce Job Execution framework. We also take a detailed look at how jobs are executed and managed in YARN and how YARN differs from MapReduce v1. To begin, a user runs a MapReduce program on the client node which instantiates a Job client object. Next, the Job client submits the job to the JobTracker. Then the job tracker creates a set of map and reduce tasks which get sent to the appropriate task trackers. The task tracker launches a child process which in turns runs the map or reduce task. Finally the task continuously updates the task tracker with status and counters and writes its output to its context.
In this blog post we detail how data is transformed as it executes in the MapReduce framework, how to design and implement the Mapper, Reducer, and Driver classes; and execute the MapReduce program. In order to write MapReduce applications you need to have an understanding of how data is transformed as it executes in the MapReduce framework. From start to finish, there are four fundamental transformations.
It gives me immense pleasure to write this blog on behalf of all of us here at MapR to announce the release of Hadoop 2.x, including YARN, on MapR. Much has been written about Hadoop 2.x and YARN and how it promises to expand Hadoop beyond MapReduce. I will give a quick summary before highlighting some of the unique benefits of Hadoop 2.x and YARN in the MapR Distribution for Hadoop.
|Improving performance by letting MapR-FS do the right thing|
Read more about what is new from AWS on the Amazon Web Services Blog.
Deploying a MapR Cluster
Running MapReduce jobs on ingested data is traditionally batch-oriented: the data must be first transferred to a local file system accessible to the Hadoop cluster, then copied into HDFS with Flume or the “hadoop fs” command. Only once the transfers are complete can MapReduce be run on the ingested files.
Profiling Java applications can be accomplished with many tools, such as the built-in HPROF JVM native agent library for profiling heap and CPU usage. In the world of Hadoop and MapReduce, there are a number of properties you can set to enable profiling of your mapper and reducer code.
With MapR’s enterprise-grade distribution of Hadoop, there are 3 unique features that make this task of profiling MapReduce code easier. They are:
Blog Sign Up
Sign up and get the top posts from each week delivered to your inbox every Friday!