Breaking the Minute Barrier for TeraSort

The following is a re-post from Wired Innovation Insights.

A new TeraSort world record was established using about $9 of Google Compute Engine resources and the MapR Distribution for Hadoop.

TeraSort is a common technique used to benchmark Hadoop storage and map-reduce performance. TeraSort benchmark measures the time to sort 1 TB of randomly generated data.

The new TeraSort record was completed in 54 seconds utilizing 1003 virtual nodes, supported by 4,012 cores, 1003, disks and 1003 network ports. The task was completed with significantly less hardware – 2/3rd the servers, 1/3rd the cores, 1/6th disks and 2/3rd the network ports of the previous record. The previous record for TeraSort took 62 seconds and was done using 1,460 physical servers supported by 11,680 cores, 5,840 disks and 1,460 network ports.

Since the servers used in MapR’s world record were virtually instantiated in the Cloud, the cost estimate for running the TeraSort was about $9 compared to the over $5M estimate to run the previous record.

This record brings a couple of interesting elements to the forefront.  First is the power and stability of Google Compute Engine. MapR did not have any control of the hardware during this process relying completely on Google Compute Engine’s user interface to set up and instantiate these nodes. The speed and consistent performance that the TeraSort assumes was delivered all the time without fail.

The second interesting element is the performance capabilities of MapR’s distribution for Hadoop. The TeraSort world record points to several performance advantages including changes to the Hadoop job-tracker, reducer-scheduling algorithm, network connection management and mapper-sorting algorithm – all of which are available in MapR’s distribution for Hadoop.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free