One of the best ways to figure out how to succeed with your own large-scale projects is to see what others are doing – what has worked for them and what has not. To help you do that, my co-author Ted Dunning and I have looked at a wide array of real world use cases and put together our observations in a new book called Real World Hadoop, published this week by O’Reilly. A lot has already been written about Apache Hadoop and related technologies – one of the things that sets this book apart is that it examines real world situations including Hadoop being used in production, quite successfully.
Many people are using Apache Hadoop and No-SQL-based solutions and the growing ecosystem of tools designed to run on a Hadoop platform to meet the challenges of large scale data because these technologies offer extreme scalability and flexibility that is also cost-effective. New interest in Hadoop continue to grow, as suggested by the pattern of search terms shown in Figure 1, and experienced users are expanding the ways in which they put these technologies to work.
Figure 1 Interest level in several terms used in Google searches. We do not include “cassandra” because its use as a personal name make it difficult to disambiguate results in order to attribute to the big data technology by that name. Image © Friedman and Dunning 2015, from Chapter 1 of “Real World Hadoop”.
If you’re new to Hadoop…
You may be wondering when and how to get started. There’s a big advantage to jumping in now. You may already need a more cost-effective way to deal with current scale of data volume and to be prepared for the ever- increasing flood of future data. Getting started sooner rather than later helps you future-proof your organization by building experience with Hadoop and NoSQL among your teams. One of the best ways to begin is to pick one focused goal rather than trying to understand all the different ways you may eventually want to use these approaches.
A good first project that many choose is optimization of how sophisticated and expensive enterprise data warehouse (DW) resources are used. By offloading many of the early steps in ETL to a Hadoop data platform, you can save a lot of time and money as well as reduce traffic strain on the enterprise DW so that it also performs better. ETL offload provides a focused way to get Hadoop experience that is also quick to show tangible value in cost savings and improved performance.
The book Real World Hadoop not only examines a selection of use cases to help you get an idea of what’s right for you, but also a collection of tips for best practice. Many of these tips are especially aimed at first-time Hadoop users.
As a newcomer to Hadoop, you have the good luck to benefit from the experience of Hadoop pioneers who have been using this ecosystem of tools for years, without having to suffer some hard knocks they first experienced as they learned. If you start now, you won’t be a pioneer, but you will still have an early mover advantage that can keep you competitive.
If you’re an experienced Hadoop user…
You are in a good situation to take advantage of seeing how others are using these approaches. As people gain experience with Hadoop and NoSQL, they begin to see new ways these technologies can be used to good effect. It’s a natural progression to start conservatively with a relative small cluster, and then to expand as size and number of projects grow. The good news is that it’s relatively easy to expand a system like this as well as to manage new projects on the existing system.
What real world lessons might help the experienced Hadoop and NoSQL user? Seeing how others are gaining valuable insights from using new data sources and new data formats, including unstructured and nested data, may give you ideas about how you could do something similar in your own projects. Another pattern for success is to re-use data in a variety of ways. Because it makes sense with a Hadoop data platform to save more data in original or “raw” form and for longer periods of time, the door is open to using this data in ways you might not have imagined when it was first ingested.
Tip for success for the experienced or new user:You aren’t stuck with your first decision about how to extract and use data.
Rather than the traditional approach of discarding data that is not included in what you extract for a current project, such as billing, now different groups within your organization may go back and ask valuable questions that probe the original data for different purposes. This approach can apply to many different situations including marketing optimization and anomaly detection. To do these things successfully requires a new way of thinking about what data has to offer and how you can best design your applications and your overall organization to take advantage of these new options.
With that in mind, it’s not surprising that one of the most powerful and widespread use cases for advanced Hadoop and NoSQL users is to build a centralized data hub. For our real world stories, we looked how MapR customers are using these new technologies, and the enterprise data hub was a common theme, particularly with the MapR distribution including Hadoop since its real time read/write file system (MapR-FS) make it especially accessible to non-Hadoop applications as well as to Hadoop. Besides describing the design of a data hub as a general Hadoop use case, we also included specific customer stories about how they employ a data hub to meet their demands. This view of Hadoop as part of a larger organization that includes new and traditional tools is a great way to begin seeing your own situation.
Real World Hadoop also offers practical, technical tips for the experienced user, such as how to avoid having the counter-intuitive result of a decrease in performance when you make a small increase in the size of your cluster. This can happen when new data and old are distributed differently, as illustrated in Figure 2.
Figure 2: Tips on how to expand a cluster successfully. When you expand a cluster by only a small percentage, you must be careful to rebalance data distribution to avoid having most new data (and activity) focus on just the few new nodes of the cluster. This control of data locality is particularly easy to do with a MapR cluster, but there are ways to deal for this situation with any Hadoop distribution. From Chapter 4 of “Real World Hadoop”. Image© Friedman and Dunning 2015.
Our stories come from observation of what MapR customers are doing with the MapR Distribution including Apache Hadoop that includes the MapR distributed file system (MapR-FS) with an enterprise grade NoSQL database, MapR-DB, that is API compatible with Apache HBase. The stories we tell are only a small sample, as there are over 700 paying MapR customers, but the patterns that emerge have widespread applicability for both MapR users and for those using other Hadoop or NoSQL technologies. The specific tools of choice may differ, but the goals for what you’d like to accomplish and the concepts behind the innovative new ways you have to address them are not tied to any one technology.
So whether you are new to Hadoop or a polished pro, whether you chose the MapR data platform or another Hadoop distribution, you should find the real world lessons described in Real World Hadoop to be helpful.
To download the free ebook, Real World Hadoop, go here.
Related talks at Strata Hadoop World San Jose 2015:
“Big Data Stories: Decisions that Drive Successful Projects” Ellen Friedman Wed 18 Feb 2015 at 9:30 am
“Real World Use Cases: Hadoop and NoSQL in Production” Ted Dunning & Ellen Friedman Thur 19 Feb 2015 at 10:40am
“YARN vs. Mesos: Can’t We All Just Get Along” Ted Dunning Fri 20 Feb 2015 at 2:20pm