How to Use Hadoop without Knowing Hadoop

A brief history of the situation

One of the challenges with Hadoop is getting value out of it without having to learn all the new skillsets that you need to truly harness Hadoop’s power. The reality of using the MapR Distribution including Hadoop is… you don’t have to know Hadoop to use Hadoop!

I recently came up against this again and thought I would throw it out there and hopefully make someone’s journey to their first Hadoop job a no-brainer.

Let’s first start out with the situation where MapR came to the rescue. We had a large amount of sensor data from some legacy gear that was not going anywhere anytime soon. After all, if the gear is in the field and working, then don’t just replace it for the sake of replacing it, as this can be very expensive!

The problem is that resilient sensors were outlasting the people who knew how the information was constructed and stored, so we needed to move it to something more current—something that the new staff could consume without thinking about it.

The decision was made to give Hadoop a shot, but the task was to use as much of the existing code and processes as possible. The goal was to get the data to a format that could be used by other tools for searching and other uses. It’s interesting to note that most of the skills of the analytics team revolved around SQL.

The Challenge: Put a solution together that could take proprietary data from its current form, to one that could be consumed with SQL, using the existing code base and skills. The challenge was to do all of this without having to undergo Hadoop training to write, implement, or maintain the solution.

The Result: After about an hour with someone who knew the existing process, we had it up, running, and done without having to understand any Hadoop libraries. We tweaked existing code and took what they were currently doing on the operators’ computers and applied it to a MapR Hadoop cluster without using any Hadoop commands except for the Hadoop streaming process. That is the benefit of using a real file system with true read/write capabilities in a POSIX environment.

The rest of this article contains a generalized example of how you can do the same. Let’s take a quick visual look at the ETL solution we put together using MapR, a jar file, a bash script, and Apache Drill. The raw data is pushed into the MapR cluster via NFS. It lands and is kept in the daily staging area “/data/new”. On an hourly basis, it is processed into a JSON file and moved to “/data/json”. Once the data is processed into JSON files, the original data is moved to “/data/archive”. After the data is moved to JSON, it can then be queried with ANSI SQL using Apache Drill.

Java JSON formatting operation
As it turns out, there was already a Java program that could convert the data to a JASON object. The staff had been pulling section over section and talking a look at the data a month at a time. After spending some time understanding how that would work, I decided to use it as is and combine it with a simple bash script and Hadoop streaming to be able to change large amounts of data to JSON files for Apache Drill to search.

The current application had three needed arguments. The first is the input file for the compressed customer data. The second was the conversion format. JSON was the only one that I needed to know about. The third is the output directory. We made some changes to be able to establish a searchable file structure that broke up the data into folders like targetOutputLocation/sensorid/day/data. We then created an executable jar file that took 3 arguments.

The location to save the JSON-formatted metrics file. File is saved as clustered/date.json

To execute the jar file, a command like this was used:
java -jar /my/jar/location/myJar.jar /my/input/compressed/file JSON /my/output/directory

After a quick successful test on the MacBook, the jar file was moved to the MapR cluster and a successful test completion was repeated. Now with other flavors of Hadoop you have an NFS that can be a bit tricky to use, but MapR has a true POSIX read/write file system that allows for standard Linux commands. Because of this, most code that can execute on Linux can be used on data in the cluster without changing it. This is what we did when we tested the execution of the jar file by running it on a node in the cluster.

Now we could have just processed all the data this way, but that would have put all the compute burden on one node; we can do better than that. By using Hadoop streaming, we can take thousands of files and split them up across the nodes so that all the nodes in the cluster can be used to process the operations. We could also limit the execution to specific nodes in the cluster if we wanted to be mindful of other production workloads, but that is not covered in this blog post. We are going to focus on using Hadoop streaming to execute MapReduce jobs in Hadoop, without knowing anything about Hadoop other than the basics of one command.

The Hadoop Streaming job execution
In order to control the execution without needing to rebuild the jar file when changes are needed, a bash script is used as a mapper by the Hadoop streaming job to control the ETL operation. There are many other ways to do this same task, but this process lets us reuse existing tools quickly to get a quick return on our Hadoop effort. Next we’ll take a look at the placement of the files and the execution of the job.

File placements for execution
There are two main scripts that are needed to execute this ETL process. The first is the existing code that we put in an executable Java jar titled myApp.jar. This project reads the compressed data in its proprietary form and ETLs it to a JSON format in a specified output location. The second is the bash script that is used by the Hadoop streaming job to map the files for execution, named The other two locations of note are the input file and the output directory. The locations and files can be anywhere you want them to be.

Job Execution
This ETL operation was built on a MapR 4.0.1 cluster and is using YARN as the resource manager for the Hadoop streaming job. One interesting thing to note about MapR is that Map Reduce v1 and YARN can be used in the cluster at the same time, even on the same node. In this case, when YARN fires up the Hadoop streaming job, it pulls in the list of input files and divides them up into tasks. Since the Java code has no knowledge of Hadoop, we need to make a few changes to make sure the Java code understands where the file is that it will be working on. The bash script is used as the mapper, and takes the specified input file and passes the file location to the Java jar and the ETL operation is completed. This was done in order to be able to use the jar to ETL outside of Hadoop, but still have a way to use it with Hadoop in a MapReduce job across the cluster. Below is an example of executing the Hadoop streaming job based on the file locations listed above.

Hadoop2 jar 
-input /data/my_raw_data/*/*/* 
-mapper /mapr/cluster-name/user/user1/ 
-output /data/projectlog/01NOV14

Below is an explanation of the arguments used in the Hadoop streaming command:

Specifies the location of the bash script we are using as the mapper

Accessing multiple files with Hadoop streaming
Since we have many files that may need processing, you can execute a MapReduce job for each directory, you can include multiple “-input” designators, or you can aggregate all the files into one list. Since we have a uniform layout of the files, we can use the wildcard * to pull in multiple directories of files. This worked because all files are at the same level in the directory structure. /data/ my_raw_data /sensord/date/compresed_file.

If you can list all the files you want with an ls command, then you can use that list in a Hadoop streaming job if you are using MapR. “ls /mapr/cluster-name/data/my_raw_data/*/*/*” ends up as “-input /data/my_raw_data/*/*/*” in our job command.

Notes on the job output location
Since the mapper is actually executing the ETL script, the only thing that ends up in the output directory is all of the standard out messages that are generated during the job. You can discard this information or save it as a processing log.

Mapper script explanation
As Hadoop streaming processes the input file list, it divides them across all nodes and launches a task for each file. The mapper is spun out and launched with no arguments. The script pulls the current file it has been mapped to process from the environment variable “mapreduce_map_input_file” which is set by the Hadoop streaming job task when it launches the mapper script. The input file is specified in the “maprfs:/data/my_raw_data/AC500E1A36B/130322/compressed_file” format which cannot be used by the jar file, since the Java code has no knowledge of Hadoop. The script replaces the “maprfs:” with the NFS mount point information “/mapr/cluster-name” and then launches the jar file to do ETL on the file “/mapr/cluster-name:/data/my_raw_data/AC500E1A36B/130322/compressed_file”. The output location passed to the jar file is also an NFS access point.

An example is listed below.

That’s all it takes. Now the code and the coder that had no knowledge of Hadoop has a running Hadoop MapReduce job.

Hope this helped.

Happy big data prospecting,
Jim Bates


Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free