Featured Author

James Casaletto
Solutions Architect , MapR

James is a Principal Solutions Architect for MapR, where he develops and deploys big data solutions with Apache Hadoop.

Author's Posts

Posted on February 26, 2015 by James Casaletto

In this blog post, we introduce the concept of using non-Java programs or streaming for MapReduce jobs. MapReduce’s streaming feature allows programmers to use languages other than Java such as Perl or Python to write their MapReduce programs. You can use streaming for either rapid prototyping using sed/awk, or for full-blown MapReduce deployments. Note that the streaming feature does not include C++ programs – these are supported through a similar feature called pipes.

Posted on February 25, 2015 by James Casaletto

In this post, we look at the different approaches for launching multiple MapReduce jobs, and analyze their benefits and shortfalls. Topics covered include how to implement job control in the driver, how to use chaining, and how to work with Oozie to manage MapReduce workflows. Because the MapReduce programming model is simplistic, you usually cannot completely solve a programming problem with one program. Instead, you often need to run a sequence of MapReduce jobs, using the output of one as the input to the next. And of course there may be other non-MapReduce applications, such as Hive,...

Posted on February 17, 2015 by James Casaletto

Hadoop MapReduce is a framework that simplifies the process of writing big data applications running in parallel on large clusters of commodity hardware. The MapReduce framework consists of a single master ResourceManager, one slave NodeManager per cluster-node, and one MRAppMaster per application (see the YARN Architecture Guide). Each MapR software release supports and ships with a specific version of Hadoop. For example, MapR 3.0.1 shipped with Hadoop 0.20.2, while MapR 4.0.1 uses Hadoop 2.4 including YARN.

Posted on February 10, 2015 by James Casaletto

In this post, we will discuss how to use the MapR Control System (MCS) to monitor MRv1 jobs. We will also see how to manage and display jobs, history, and logs using the command line interface.

Posted on February 5, 2015 by James Casaletto

In this post, we detail how to work with counters to track MapReduce job progress. We will look at how to work with Hadoop’s built-in counters, as well as custom counters. In part 2, we will discuss how to use the MapR Control System (MCS) to monitor jobs. We’ll also detail how to manage and display jobs, history, and logs using the command line interface. Counters are used to determine if and how often a particular event occurred during a job execution. There are 4 categories of counters in Hadoop: file system, job, framework, and custom.

Posted on February 3, 2015 by James Casaletto

In this blog post, we compare MapReduce v1 to MapReduce v2 (YARN), and describe the MapReduce Job Execution framework. We also take a detailed look at how jobs are executed and managed in YARN and how YARN differs from MapReduce v1. To begin, a user runs a MapReduce program on the client node which instantiates a Job client object. Next, the Job client submits the job to the JobTracker. Then the job tracker creates a set of map and reduce tasks which get sent to the appropriate task trackers. The task tracker launches a child process which in turns runs the map or reduce task. Finally the...

Posted on January 29, 2015 by James Casaletto

In this blog post we detail how data is transformed as it executes in the MapReduce framework, how to design and implement the Mapper, Reducer, and Driver classes; and execute the MapReduce program. In order to write MapReduce applications you need to have an understanding of how data is transformed as it executes in the MapReduce framework. From start to finish, there are four fundamental transformations.

Posted on December 10, 2014 by James Casaletto

In this week's Whiteboard Walkthrough, James Casaletto walks you through how to configure the network for the MapR Hadoop Sandbox. Whether you use VirtualBox, VMware Fusion, VMware Player, or pretty much any hypervisor on your laptop to support your MapR Sandbox, you'll need to configure the network. There's essentially three different settings that you can use to configure the network for your Sandbox. One is NAT, one is host-only, and one is bridged.

Blog Sign Up

Sign up and get the top posts from each week delivered to your inbox every Friday!