Profiling MapReduce with Central Configuration, MapR-FS and Centralized Logging

Overview

Profiling Java applications can be accomplished with many tools, such as the built-in HPROF JVM native agent library for profiling heap and CPU usage. In the world of Hadoop and MapReduce, there are a number of properties you can set to enable profiling of your mapper and reducer code.

With MapR’s enterprise-grade distribution of Hadoop, there are 3 unique features that make this task of profiling MapReduce code easier. They are:

  • Central Configuration
  • MapR-FS
  • Centralized Logging

This blog will describe how these features enable easy profiling of Hadoop jobs on MapR. We also describe how to enable and use Hadoop’s MapReduce profiling.

Centralized Configuration

The first step in profiling an application is to enable some MapReduce properties. Here are some of these properties and some sample values:

  • mapred.task.profile=true
  • mapred.task.profile.params=agentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,verbose=n,file=%s
  • mapred.task.profile.maps=0-5
  • mapred.task.profile.reduces=0-0

The above parameters enable profiling, specify which HPROF options to pass to the JVM when the tasks are started, and specify how many of the mappers and reducer tasks to profile. Typically, one only needs a few of these profiles to analyze, and all mappers and reducers generally have very similar profiling signatures.

One way of specifying these parameters is on the Hadoop command line. But what if you have a series of MapReduce jobs to profile, and you’d prefer to modify the Hadoop configuration file (mapred-site.xml) to add these properties and profile all jobs? This would typically involve making the changes to mapred-site.xml, and copying that file across all nodes in the cluster.

MapR’s Central Configuration feature lets you place a new configuration file in the MapR-FS distributed file system so it is available to all nodes, and then have each node pull a copy to the local configuration directory that Hadoop uses.

So where do we place the new mapred-site.xml file in MapR-FS? MapR has a list of paths for the various services at which it locates central configuration files. These list of paths live in files in this directory:

  • /opt/mapr/servicesconf

For example, for Hadoop, we use:

$ cat /opt/mapr/servicesconf/tasktracker

hadoop/hadoop-0.20.2/conf

The MapR-FS location for central configuration files is here:

  • maprfs://my.cluster.com/var/mapr/configuration

There are 2 sub-directories located here – one called ‘default’ and one called ‘nodes’. The ‘nodes’ directory allows per-node central configuration. We will assume all nodes will use the same mapred-site.xml file, therefore we will create the central configuration file in ‘defaults’.

In order to do this, we simply copy an existing mapred-site.xml file from the local per-node configuration directory (/opt/mapr) to the central location in MapR-FS, make the change there, then MapR will automatically pull that file from the distributed file system location in MapR-FS to the local directory. So here are the steps (note that we’re using MapR’s Direct NFS feature here so we can use regular Linux commands rather than ‘hadoop fs’ (which of course would also work):

$ mkdir –p /mapr/my.cluster.com/var/mapr/configuration/default/hadoop/hadoop-0.20.2/conf

$ cd /mapr/my.cluster.com/var/mapr/configuration/default/hadoop/hadoop-0.20.2/conf

$ cp /opt/mapr/hadoop/hadoop-0.20.2/conf/mapred-site.xml .

$ vi mapred-site.xml

Here are the changes to add to the end of mapred-site.xml to enable profiling:

<property>

<name>mapred.task.profile</name>

<value>true</value>

</property>

<property>

<name>mapred.task.profile.params</name>

<value>-agentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,verbose=n,file=%s</value>

</property>

<property>

<name>mapred.task.profile.maps</name> 

<value>0-6</value>

</property>

<property>

<name>mapred.task.profile.reduces</name>

<value>0-0</value>

</property> /configuration>

Now that we’ve set up this central config file, the MapR warden process will check every 5 minutes (can be customized) and will invoke /opt/mapr/server/pullcentralconfig to bring the new file over to to local /opt/mapr directory. Note that if the modification time on the local mapred-site.xml file is newer than the central file, it will not be overridden.

MapR-FS

MapR-FS is API-compatible with HDFS, but has tremendous capabilities in terms of ease-of-use, dependability, and performance/scale. For the purposes of this blog, we’ll focus on one of the features.

Unlike traditional Hadoop distributions, MapR can handle small files extremely efficiently. We can therefore keep all output from MapReduce jobs, including log files, map outputs, and TaskTracker and JobTracker messages in the MapR-FS rather than the local file system.

Another benefit of this is all of our profile data will be located in MapR-FS rather than the local file systems of the nodes. This makes it much easier to gather and analyze the profile output.

Centralized Logging

After making the changes to the central mapred-site.xml file, we run the Hadoop jobs as we normally would. Upon completion, it’s now time to find the profile output and analyze the results.

With MapR, any node has access to all of the profile files (the file name is “profile.out”) because this output was written to the distributed file system. The location in MapR-FS would be in:

/var/mapr/local/<nodename>/logs/mapred/userlogs/<jobid>/<task-attempt-dir>

But there’s an even easier way of accessing the profile output files with MapR’s Centralized logging. Basically,  you provide a job-id, and MapR builds a directory of links to the associated files regardless of which node they came from. Here is the command to do this:

$ maprcli job linklogs -jobid job_201302040944_0003 -todir /user/mapr/mylogdir

Now assuming you’re running MapR NFS, simply navigate to the new ‘mylogdir’ directory and find the profile files located there (again, they are simply links to the MapR-FS location):

$ cd /mapr/my.cluster.com/user/mapr/mylogdir/job_201302040944_0003

$ ls –l

total 2

drwxr-xr-x 5 mapr mapr  3 Feb  8 13:49 hosts

drwxr-xr-x 2 mapr mapr 59 Feb  8 13:49 mappers

drwxr-xr-x 2 mapr mapr  1 Feb  8 13:49 reducers

So basically we have the log directories grouped my hosts, mappers, and reducers. So in our case, we can go to the mappers directory and see for all nodes the log directories:

$ ls

attempt_201302040944_0003_m_000000_0

attempt_201302040944_0003_m_000001_0

attempt_201302040944_0003_m_000002_0 …

Let’s see what is in one of these directories:

$ ls attempt_201302040944_0003_m_000000_0

log.index  profile.out  stderr  stdout  syslog

So we can see the mapper’s standard out and error files, and other log information, as well as our profile.out file. Now let’s look for all profile outputs:

$ ls */profile*

attempt_201302040944_0003_m_000000_0/profile.out

attempt_201302040944_0003_m_000001_0/profile.out

attempt_201302040944_0003_m_000002_0/profile.out

attempt_201302040944_0003_m_000003_0/profile.out

attempt_201302040944_0003_m_000004_0/profile.out

attempt_201302040944_0003_m_000005_0/profile.out

attempt_201302040944_0003_m_000006_0/profile.out

Here we have all 7 of our mapper profile outputs conveniently located in one hierarchy regardless of what node they executed on (recall we set mapred.task.profile.maps=0-6). The single reducer profile output can be found in the reducers directory:

$ ls attempt_201302040944_0003_r_000000_0 

log.index  profile.out  stderr  stdout  syslog

Analyzing the Profile Data

Now that we’ve conveniently located the profile data, it’s a matter of performing the analysis. A full description of HPROF is beyond the scope of this blog. But the key things to look for in the profile.out file is the CPU sampling, which shows which Java methods were the hottest, and the “SITES” section, which shows which Java objects are using the most memory. Both of these sections are at the end of the profile.out file:

SITES BEGIN (ordered by live bytes) Tue Feb  5 10:47:39 2013

percent          live          alloc'ed  stack class

rank   self  accum     bytes objs     bytes  objs trace name

1 29.25% 29.25%   4411600  269   4969200   303 313427 char[] 

2 18.27% 47.52%   2755200  168   5510400   336 313268 char[]

3 14.64% 62.16%   2207952  269   2487024   303 313424 byte[]

4  9.14% 71.30%   1378944  168   4210704   513 301504 byte[]

5  1.03% 72.34%    156024 2622   1383512 20399 303897 char[]

CPU SAMPLES BEGIN (total = 1707) Tue Feb  5 10:47:39 2013

rank   self  accum   count trace method

1 10.54% 10.54%     180 300072 java.util.zip.ZipFile.open

2  3.81% 14.35%      65 303897 java.util.Arrays.copyOfRange

3  3.34% 17.69%      57 313140 org.apache.hadoop.mapred.IFile$Reader.nextRawValue

4  2.99% 20.68%      51 303946 java.lang.Object.<init>

5  2.93% 23.61%      50 304069 java.lang.Object.<init>

Note the SITES section shows 30% of the live memory is a char array. If you want to see what stack generated that allocation simply search the file for the “stack trace” value, which is 313427. That will show which thread and stack did the allocation.

In the CPU SAMPLES section we see the method java.util.zio.ZioFile.open made up about 10% of the reduce task.

Summary

Profiling Hadoop MapReduce jobs can be a bit tricky in terms of configuring for profiling, and chasing down the profile data that is generated. MapR provides features such as central configuration, central logging, MapR-FS, and NFS that makes the job of profiling easier.

no

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams

 

 

 

Download for free