James is a Principal Solutions Architect for MapR, where he develops and deploys big data solutions with Apache Hadoop.
In this blog post, we introduce the concept of using non-Java programs or streaming for MapReduce jobs. MapReduce’s streaming feature allows programmers to use languages other than Java such as Perl or Python to write their MapReduce programs. You can use streaming for either rapid prototyping using sed/awk, or for full-blown MapReduce deployments. Note that the streaming feature does not include C++ programs – these are supported through a similar feature called pipes.
In this post, we look at the different approaches for launching multiple MapReduce jobs, and analyze their benefits and shortfalls. Topics covered include how to implement job control in the driver, how to use chaining, and how to work with Oozie to manage MapReduce workflows. Because the MapReduce programming model is simplistic, you usually cannot completely solve a programming problem with one program. Instead, you often need to run a sequence of MapReduce jobs, using the output of one as the input to the next. And of course there may be other non-MapReduce applications, such as Hive,...
Hadoop MapReduce is a framework that simplifies the process of writing big data applications running in parallel on large clusters of commodity hardware. The MapReduce framework consists of a single master ResourceManager, one slave NodeManager per cluster-node, and one MRAppMaster per application (see the YARN Architecture Guide). Each MapR software release supports and ships with a specific version of Hadoop. For example, MapR 3.0.1 shipped with Hadoop 0.20.2, while MapR 4.0.1 uses Hadoop 2.4 including YARN.
In this post, we will discuss how to use the MapR Control System (MCS) to monitor MRv1 jobs. We will also see how to manage and display jobs, history, and logs using the command line interface.
In this post, we detail how to work with counters to track MapReduce job progress. We will look at how to work with Hadoop’s built-in counters, as well as custom counters. In part 2, we will discuss how to use the MapR Control System (MCS) to monitor jobs. We’ll also detail how to manage and display jobs, history, and logs using the command line interface. Counters are used to determine if and how often a particular event occurred during a job execution. There are 4 categories of counters in Hadoop: file system, job, framework, and custom.
In this blog post, we compare MapReduce v1 to MapReduce v2 (YARN), and describe the MapReduce Job Execution framework. We also take a detailed look at how jobs are executed and managed in YARN and how YARN differs from MapReduce v1. To begin, a user runs a MapReduce program on the client node which instantiates a Job client object. Next, the Job client submits the job to the JobTracker. Then the job tracker creates a set of map and reduce tasks which get sent to the appropriate task trackers. The task tracker launches a child process which in turns runs the map or reduce task. Finally the...
In this blog post we detail how data is transformed as it executes in the MapReduce framework, how to design and implement the Mapper, Reducer, and Driver classes; and execute the MapReduce program. In order to write MapReduce applications you need to have an understanding of how data is transformed as it executes in the MapReduce framework. From start to finish, there are four fundamental transformations.
In this week's Whiteboard Walkthrough, James Casaletto walks you through how to configure the network for the MapR Hadoop Sandbox. Whether you use VirtualBox, VMware Fusion, VMware Player, or pretty much any hypervisor on your laptop to support your MapR Sandbox, you'll need to configure the network. There's essentially three different settings that you can use to configure the network for your Sandbox. One is NAT, one is host-only, and one is bridged.
Blog Sign Up
Sign up and get the top posts from each week delivered to your inbox every Friday!