Top Misconceptions about Big Data and Hadoop

February 7, 2012
The Hadoop market is a fast growing, expanding and exciting ecosystem, but this can also be accompanied by confusion. I thought I’d take a stab at addressing some of the Big Misconceptions about Big Data.  I. First of all, the term Big Data is approaching Cloud in its utter lack of descriptiveness. That said, Big Data is not simply about massive amounts of data – petabytes and beyond. Big Data represents a paradigm shift. It’s about new, unstructured, data sources. It’s about avoiding schema definitions and transformations. There’s no need to structure data before you can derive benefits. It’s about performing data and compute together to perform better and faster analysis. Through Hadoop, organizations can benefit even with relatively small amounts of data.

Since Hadoop is a funny name and somewhat new to people they assume it must be risky. Huge amounts of investment and work have addressed these concerns. Hadoop has emerged as a standard. The rich ecosystem around Hadoop has provided a lot of flexibility, choice, and trained professionals. There are product-grade distributions available, (MapR) that provide full data protection, automatic stateful failover and business continuity. The deployed footprint, complementary products, and available technical resources all contribute to Hadoop adoption. And with that, the number and breadth of deployed Hadoop applications have expanded rapidly.

Another misconception about Hadoop, is that it is a batch process. This is an artifact of the HDFS implementation and not a limitation of Hadoop per se. MapR, for example, provides full support for streaming analytics and real-time processing.

Perhaps the biggest misconception is that Hadoop is a single, monolithic, component. Hadoop is a framework -- a complete stack for distributing applications and data. Hadoop supports multiple programming paradigms and includes packages such as Pig, Hive and others. There are packages for data ingress/egress, ETL, and data integration, as well as specific components for machine learning. Most distributions integrate, test and harden these packages along with some proprietary extensions.

With respect to open source, the question about a distribution is not a simple binary “open” or “closed”. The question is what components are open and what areas do proprietary value-added components address. In the case of Cloudera, the proprietary extensions are in the management tools. MapR has chosen to innovate in the areas that provide the most benefits to customers while also being the most difficult for the community to effectively address. These also happen to be areas in which customers have the least desire to modify such as the underlying storage services. MapR’s distribution includes value-added improvements along with all of the open source programming, data access, programming, and machine learning packages.

These are some of the top misconceptions. Let me know what other areas you’d like us to address.