Separating Hadoop Myths from Reality

As I discussed in my keynote presentation at Strata + Hadoop World in New York, there are a lot of myths and misconceptions when it comes to Hadoop. Let’s take a closer look at the architecture and customer use cases that highlight the power of Hadoop and separate the myths from reality.

The Hadoop Space is Incredibly Competitive—a Knock-Down, Drag-Out Fight across the Major Distributions.

Myth or Reality?

There are many commercial Hadoop distributions in the marketplace today, but the reality is that we all share the same open source Apache code. Hadoop is one of the first markets that’s actually been created by open source technology, and given its early stage, it’s appropriate and, in many cases required, to combine open source code with innovations to meet customer requirements. The result? An incredibly strong ecosystem: Hadoop is by far the fastest growing Big Data technology, and is one of the top 10 fastest growing technologies overall in terms of job growth.

NoSQL Solutions are Equal and Perform the Same on all Hadoop Distributions.

Myth or Reality?

Hadoop is at the center of Big Data, in stark contrast to the NoSQL market, where there is no consensus, no common API, and no ability to seamlessly move workloads across solutions. However, the one NoSQL solution that has an inherent advantage is Apache HBase, which is integrated with Hadoop and included with every commercial distribution. You might think that if HBase is included in every distribution, and every distribution shares the same open source code, then HBase must run the same across all distributions. This is not the case, because the reality is that architecture matters. Take a look at the architecture that supports Apache HBase applications:

architecture matters On the left you can see that HBase is running on Java, which is running its data into the Hadoop Distributed File System, which is running on another Java instance, which is writing into the Linux file system, which is writing to disk. To make matters worse, you have database operations trying to write to a write-once storage layer of HDFS. On the right hand side is an example of the MapR M7 architecture that eliminates the Java dependency, collapsing those intermediate data layers, and removing the complexity. As a result, HBase applications experience drastically improved performance.

As the graph below illustrates, the performance results of the MapR M7 architecture are dramatic. In orange, you can see the performance of HBase applications on other distributions, which shows tremendous latency spikes. Imagine trying to program a real-time online application with HBase given those results. That’s in sharp contrast to the blue line, which shows consistent low-latency for HBase applications on a 24x7 basis with the MapR M7 distribution.

latency graph

Hadoop is Ready for Prime Time.

Myth or Reality?

The reality is that a significant number of companies are already enjoying production success with Hadoop. Here are just a few examples:
  • 1 trillion log lines are being analyzed and process by Solutionary as part of their security service.
  • 90 billion ad auctions are processed per day by the Rubicon Project.
  • 1.7 trillion events are processed per month by comScore, the leading Internet analytics company.
You may be thinking, “But these are Web 2.0 companies. Traditional enterprises are still experimenting —maybe they’re doing some lightweight ETL, but nobody’s really seriously using Hadoop in production.”

The reality is that there are a significant number of companies achieving powerful results with Hadoop. Here are just a few:

  • One Fortune 100 retailer has over 2000+ nodes running on Hadoop, and it’s a key part of its retail and merchandising operations, including the ability to leverage social media to better understand and meet the needs of shoppers.
  • A financial services company has over 1,000 nodes running on Hadoop, and it’s being used to mitigate risk, personalize offers, and streamline operations.

In addition to the multitude of Hadoop use cases and production success in healthcare, manufacturing, telco, and government agencies, here are a few of the more unusual production use cases for Hadoop: hadoop for waste management

  • Garbage. A leading waste and recycling management company is using Hadoop to combine location information with delivery data to optimize fleet efficiency.
  • Whiskey. The next time you’re in Japan, and you encounter a beverage kiosk that uses facial recognition to customize the interface, thank Hadoop!
  • Weather. The Climate Corporation is using Hadoop to help farmers protect and improve their farming operations worldwide.

These examples are just a subset of the more than 500 paying MapR customers that are using Hadoop today. Many have switched from another Hadoop distribution and have done so seamlessly, to enjoy production success with MapR. Is Hadoop ready for prime time? Absolutely.


Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free