Third Party Comparative Study of Hadoop Distributions

We, at Flux7 Labs, a solutions company, help customers maximize performance/$. To help our customers make the right decisions we constantly research and evaluate the solutions available to customers and thereby build and strengthen our internal knowledge. As part of this research process, we evaluated the most common Hadoop distributions on various metrics. The distributions we tested were from Intel, Cloudera, Hortonworks, and MapR. This testing was done independently on all the distributions.

Our evaluation was based on both subjective measures like the ease of use and objective measures like the performance of each distribution, enabling users to make a more informed decision. Performance, holds a special place at Flux7 Labs. Both our founders have a strong background in performance analysis in the processor design world and we bring the same data driven philosophy to the cloud infrastructure space. Without further ado, let’s take a look at the results. It is to be noted that the graph shows the time taken for the run, so lower the value, better it is:

The data is pretty clear. All the platforms showed more or less the same results, except for MapR. To be honest, MapR wasn’t our favorite going in. We were inclined to believe that Intel would outdo the rest. Intel has the microarchitectural expertise, and they claim the autotuning features of their distribution to be the major selling point. In addition, the HiBench benchmark suite has been developed by Intel. The results were indeed astounding with MapR being the only distribution that stood apart. 

The reason for MapR’s exceptional performance is its ground up architecture focusing on speeding up DFS accesses and improving scalability. The graph below shows the results of running DFSIO. It clearly indicates that the read throughput in MapR is almost 2x and the write throughput is around 8x compared to the other distributions. A very interesting observation is that these results were seen on a fairly small cluster setup of just 5 nodes. So the above observations were made while we did not even put to use the advantage of MapR’s fully-distributed and highly scalable architecture. In larger installations we expect the speedup from MapR to be significantly higher.

A common concern leveled against performance tests is, “As long as I get my results, performance doesn’t matter”. But the truth is that performance especially through a more efficient solution always matters. At the first level in a scalable distributed system, performance efficiency is equivalent to cost efficiency. To get equivalent performance you can use fewer resources. The cloud flips this equation even more with the concept of on-demand resources. This enables you to have your cake and eat it too. You bring up more resources, get your job done faster, and release the resources once you are done with them, giving you the performance of a larger cluster at the cost of a smaller cluster. Especially in such a scenario the scalability of the architecture is extremely important.

At a second order, higher performance reduces human time which is the most expensive resource any organization has. Higher speed means: earlier results, faster results, rapid development, more experiments, more iterations and higher quality results. What is really powerful though is not a 2x performance difference, it’s an order of magnitude in the performance difference. Such a difference can allow a complete paradigm shift. Think about the increase in developer productivity in switching from assembly programs to Python. I only wonder what awaits in the world of data analytics with such an improvement.

In conclusion, we at Flux7 Labs were very impressed by the performance results observed on the MapR platform. It really shows the effort put forth by their team in making their system as efficient and scalable as possible. And we’re happy to endorse their distribution for its performance capabilities. Feel free to read the complete whitepaper and see our detailed comparison along with results and methodology here


Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free