TeraSort Benchmark Comparison for YARN


TeraSort is a popular benchmark that measures the amount of time to sort one terabyte of randomly distributed data on a given computer system. It is commonly used to measure MapReduce performance of an Apache™ Hadoop® cluster. The following report compares performance of a YARN-scheduled TeraSort job on MapR and other distributions.

Test Results

The MapR Distribution including Apache Hadoop continues to be fastest Hadoop distribution in the market. As seen in the figure, MapR is much faster than other distributions (Cloudera CDH was chosen for comparison purposes) sorting 1 TB of data on a 21-node cluster in 494 seconds. The other distribution run under the same conditions took 822 seconds. Please refer to the Appendix for test environment details.

MapR shows a significant performance advantage over other distributions for two primary reasons:

MapR Data Platform Advantage

MapR has set world records for MapReduce performance because of numerous differentiated features for performance including:

  • Distributed metadata to eliminate bottlenecks

  • C++ implementation in key components

  • Fast, direct disk I/O (vs. layered I/O on top of the Linux file system)

  • Optimized MapReduce shuffle algorithm

  • All of these features continue to provide performance benefits and lower infrastructure footprints when applied to MapReduce v2 jobs scheduled using YARN.

    Taking Disk I/O into Account for YARN Scheduling

    In order to calculate system resources required for a job, the YARN scheduler today takes memory and CPU characteristics of the nodes into account. For instance for a MapReduce job, the optimum number of map and reduce slots required will be calculated based on CPU and memory availability across the nodes.

    MapR allows the YARN scheduler to also take disk I/O characteristics into account when calculating system resources. This ensures disk bottlenecks are correctly identified during the resource allocation process making YARN jobs perform much better.


    MapR provides the best Hadoop performance for a variety of workloads, proven by MapReduce v1, MapReduce v2 (YARN), and YCSB benchmarks. Along with high reliability and the random read-write NFS capability, the MapR performance advantage continues to be one of many key benefits for end users. MapR clusters have proven to be the most cost-efficient Hadoop deployments by requiring a much smaller hardware footprint compared to other distributions.

    MapR World-Record Setting Benchmark

    MapR holds the TeraSort world record sorting 1 TB in 54 seconds, accomplished on 1003 virtual nodes on the Google Cloud platform. Details of the MapR world-record setting benchmark can be found in the MapR blogs.

    Test Environment Details

    Number of Nodes: 20+1 node for NameNode/CLDB + YARN Resource Manager

    RAM: 128GB

    Disks: 11 Disks—110 GB

    CPU: 2x16 cores

    Network: 10 GbE

    CDH Version: CDH 5.1.0 YARN

    MapR Version: MapR 4.0.1 YARN

    Test parameters* Numbers
    mapreduce.reduce.memory.mb 3072
    mapreduce.map.memory.mb 1024
    mapred.maxthreads.generate.mapoutput 2
    mapreduce.tasktracker.reserved.physicalmemory.mb.low 0.95
    mapred.maxthreads.partition.closer 2
    mapreduce.map.sort.spill.percent 0.99
    mapreduce.reduce.merge.inmem.threshold 0
    mapreduce.job.reduce.slowstart.completedmaps 1
    mapreduce.reduce.shuffle.parallelcopies 40
    mapreduce.map.speculative false
    mapreduce.reduce.speculative false
    mapreduce.map.output.compress false
    mapreduce.job.reduces 160
    mapreduce.task.io.sort.mb 480
    mapreduce.task.io.sort.factor 400
    mfs.heapsize 35

    terasort-comparison-yarn.pdf118.35 KB