Hadoop Benchmarks – Looking Beyond the Splash

Recently a world record was claimed for a Hadoop benchmark.  MapR has run numerous benchmarks where MapR performs 2 to 5 times faster than other distributions and have published these results. So a world record was quite a claim. We were surprised to see that this world record was for a TeraSort benchmark on a 100GB of data.

TeraSort is a standard benchmark and the name is derived from “sorting a terabyte”.  Any record claims for sorting a 100GB dataset across a 20 node cluster with 10 times as much memory is comical. The test is named TeraSort not GigaSort. The world-record claim is like someone splashing across a kiddie pool and announcing a swimming record. It doesn’t tell you anything about how fast they swim, and it’s quite possible that this “world record holder” may drown before reaching the deep end of a real pool.

Hadoop is about Big Data and maintaining performance while scaling.  It’s a benchmark for Big Data not Big Memory. Modern machines have a 1000 to 1 ratio of disk to memory, and any benchmarks for Big Data should reflect that fact by ensuring that the data processed is indeed Big.

It’s more than a little disingenuous to consider the world record claim representative of Big Data. A representative benchmark should involve data that’s at least 10X the memory (and not 1/10th as in the kiddie pool test). In particular, the test doesn’t show how well their code handles disk operations which are the slowest part of a TeraSort test. Similarly, any speed claims with an HBase™ test where everything fits into memory isn’t valid.

Just as comparing the time it takes to run across a kiddie pool and then extrapolating that to a marathon swimming distance is silly, so is it invalid to take results for the 100GB memory-only test and extrapolate that rate to 1TB TeraSort.  But even if we did perform a straight line extrapolation on the 100GB sort of 130 seconds, it would take 22 minutes to perform a 1TB TeraSort. MapR has published a 1TB TeraSort result that took 22 minutes on a 10 node cluster with half the CPUs and 1/4th the memory. Given the differences in cluster capabilities MapR is at least 2.5X faster and that’s in comparison with a highly questionable extrapolation for the other distribution.  The conclusion here is that this world record is not a splash, but is certainly all wet.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free