Editor's Note: In this week's Whiteboard Walkthrough, Dale Kim, Director of Industry Solutions at MapR, gets you up to speed on Apache Hadoop and NoSQL. He talks about the similarities and differences between the two, but most importantly how both technologies should be a requirement for any true big data environment. See his Whiteboard Walkthrough and accompanying blog post below.
If the term “big data” has been bandied around your organization as something that should be further explored, then you’ve likely also heard about Hadoop and NoSQL. Each of these technologies are closely associated with big data, so there’s overlap in terms of what they are designed to do. For example, they’re both great for managing large and rapidly growing data sets, and they’re both great for handling a variety of data formats, even if those formats change over time.
To elaborate on the issue of data volume, both can leverage commodity hardware working together as a cluster. To handle larger data sets, you simply add more hardware to the cluster in a model known as horizontal scaling, also referred to as scaling out. Contrast this to scaling up, in which you upgrade your existing servers with more powerful hardware.
With regard to data formats, both technologies are suitable for the different types you want to manage, including log files, documents, and rich media. Just as importantly, if you have structured data in which the structure differs between records, or if the structure likely will change in the future, then NoSQL and Hadoop are appropriate technologies for your environment.
So with these overlapping capabilities, it might seem that NoSQL and Hadoop are direct competitors, right? Nope. While each technology is great for big data, they are intended for different types of workloads. NoSQL, on one hand, is about real-time, interactive access to data. NoSQL use cases often entail end user interactivity, like in web applications, but more broadly they are about reading and writing data very quickly.
Hadoop, on the other hand, is about large-scale processing of data. To process large volumes of data, you want to do the work in parallel, and typically across many servers. Hadoop manages the distribution of work across many servers in a divide-and-conquer methodology known as MapReduce. Since each server houses a subset of your overall data set, MapReduce lets you move the processing close to the data to minimize network accesses to data that will slow down the task.
That said, NoSQL and Hadoop work together quite well as components in an enterprise data architecture. In fact, I would assert that for any environment that has true big data requirements, Hadoop and NoSQL must be deployed together. In a typical architecture, you have your NoSQL architecture for interactive data, and your Hadoop cluster for large-scale data processing and analytics. You might use NoSQL to manage user transactions data, sensor data, or customer profile data. You can then use Hadoop to analyze that data for outcomes like generating recommendations, performing predictive analytics, and detecting fraudulent activities.
If “whenever” is one of your favorite words, and you take a batch-oriented approach to big data, then you probably accept the separation of Hadoop and NoSQL workloads. This essentially means that you run these technologies in distinct clusters. As you create and update data on the NoSQL side, you periodically move the data in batch jobs over to Hadoop, where you can run your large-scale analytics. Sure, there is overhead and delays associated with the movement of data, extra administrative effort of the separate clusters, and duplication of tool sets as a result of the separate instances of the same data, and none of that is good. As more enterprises push harder for “now” and real-time, big data vendors will step up in the coming years to eliminate this unnecessary overhead.
Fortunately, MapR has solved this problem already. With MapR-DB, and the MapR Distribution including Hadoop, you get the best of both worlds in a single platform. So let’s remove this obsolete separation of data processing capabilities. And since these two capabilities are integrated from the ground up, rather than being duct-taped together after the fact, you get the processing efficiency necessary to run multiple workloads together in the same cluster. We have many customers doing exactly this.
Not only do you eliminate slow data movement, and eliminate duplication, you also enable more immediacy so that recommendations and predictive analytics can include more recent events, and fraud detection can become more about fraud prevention.
If you want to dig in further with MapR and MapR-DB, be sure to download the free Community Edition which is available for unlimited production use. That’s a great way to get started, and if you need powerful enterprise-grade features like high availability, disaster recovery, and snapshots, you can use the Enterprise Database Edition. To see which edition is right for you, go here.