Understanding memory allocation can help you better manage and monitor the health of your nodes and make the most out of your Hadoop cluster. Here are some insights on how memory allocation works especially in the MapReduce v1 world.
Memory allocation at startup
MapR takes care of allocating memory among all the installed services on a node based on percentage of available physical memory. When the package is installed and the configure.sh script is run, MapR generates /opt/mapr/conf/warden.conf file that has details about how much percentage of physical memory needs to be allocated for each service. For example, if you install FileSystem and MapReduce on a node, the memory allocation is distributed between those two services. On the other hand if node also has HBase™ servcies, configure.sh makes sure that the available memory is divided among FileSystem, MapReduce as well as HBase services.
To look at it in more detail, let us extend our first scenario above. If FileSystem (mfs process) and MapReduce (tasktracker process) are installed on a node, MapR allocates 20% of physical memory to FileSystem, about 5-8% of memory for OS and other applications and the rest would be given to MapReduce services (tasktracker and tasks in this case). On an average about 75% of physical memory is assigned to MapReduce in this kind of setting. Note that for the mfs process MapR pre-allocates 20% of memory, which means mfs grabs 20% of memory immediately. Hence when you run top command on a node, you would see that mfs is already using 20% of available memory - this is common and expected. On the other hand, MapReduce services starts off low and eventually grows up to 75% of physical memory because memory is not pre-allocated when you configure and start TaskTracker service.
Memory allocation for MapReduce
Given how the memory is allocated on startup, lets see how memory management works for MapReduce layer. On each node, MapR updates mapred-site.xml with number of map and reduce slots based on number of CPUs on the node. (One can change that to either absolute value or include different parameters like memory to come up with a formula to support heterogeneous nodes). This defines the compute capacity of the node. Once software is installed and configured (by running configure.sh), running Tasks would be assigned certain amount of default max memory. This calculation is based on the number of map and reduce slots.
Lets consider an example. On a 48G physical memory box, lets assume we configure 9 map slots and 6 reduce slots. Based on previous discussion, we would have allocated about 75% of memory for all of MapReduce services. This includes TaskTracker, Map Tasks and Reduce Tasks. TaskTracker usually consumes 130MB, so we can ignore it for calculation purpose. 75% of physical memory in this case is approximately 36G. Now, 36G of memory has to be divided into 9 Map and 6 Reduce slots. Based on a formula (see below), we end up dividing 36G of memory into 2.1G for each map task and 2.8G for each reduce task. When tasks are spawned on a node, MapR sets Xmx with the right memory value for the Map and Reduce slots. This way Mapper on that node will never go over 2.1G and reducer will never go over 2.8G. Note that as of now MapR does not allocate unused memory to running tasks. That means that on a node if there is 10G free memory and only one map task running which requires more than 2.1G, it will get killed with OOM exception.
Formula for calculating Xmx for MapReduce tasks.
Default memory for a map task = (Total Memory reserved for mapreduce) * (1 / (#mapslots + 1.3*#reduceslots)) Default memory for a map task = (Total Memory reserved for mapreduce) * (1.3 / (#mapslots + 1.3*#reduceslots))
Given this, on a normal running system if all Mapper and Reducers are running on a node and they are utilizing close to 100% of allocated memory, then 75% of physical memory should be utilized on the node. If any task tries to go over that, then those tasks will start getting killed due to OOM exceptions.
With the above default setting, no matter what the Map and Reduce slot numbers are, things should work normal without problems. This is because only 75% of memory is divided among available slots. Problems may surface when one overrides the default calculated Xmx value for Map and Reduce slots. Lets assume that based on slots, MapR ends up calculating 1G for Map tasks. Now, if one overrides that to specify 4G, then system will be overloaded for memory and will start swapping. This causes stability problem for services in the cluster.
Therefore it is important to understand the workload first and then perform memory management on your cluster so it is healthy at all times.
Config variables available to fine-tune this behavior.Given how memory allocation and management works, there are few config variables that could be used to further fine-tune this.
All these variables do, is override the default calculations. Some of them are shown here:
mapreduce.tasktracker.reserved.physicalmemory.mb. By default, 75% memory is allocated to the MapReduce layer. If you want to change that, you can specify absolute value by specifying the desired number in megabytes.
mapred.tasktracker.map.tasks.maximum. By default MapR calculates number of map slots based on number of CPU. Increasing this number will cause lower memory given to each map task. Decreasing this number will cause more memory to be assigned for each map tasks.
mapred.tasktracker.reduce.tasks.maximum. This is similar to above, but for Reduce tasks.
mapred.job.map.memory.physical.mb. This configuration defines how much memory (Xmx) is set for Map tasks. Default is calculated by the Formula specified above. If you want to change allocation, absolute value in megabytes needs to be specified.
mapred.job.reduce.memory.physical.mb. This is similar to above, but good for Reduce tasks.
Some Best Practices
Given this information, you can improve cluster utilization further by doing the below steps.
- Gather job stats. You should gather nature of jobs and how much memory Map and Reduce tasks consume on a cluster. This could be done with a combination of ganglia and custom monitoring scripts.
- Identify requirements. With stats, analyze memory usage pattern and identify requirement for jobs running on the cluster.
- Tune cluster. With stats and requirements tune the cluster using the config variables specified above.
- Validate changes. Re-gather stats and measure cluster utilization and throughput to validate changes have improved them.