Forrester Research principal analyst Mike Gualtieri, along with MapR CMO Jack Norris, joined us for a webinar titled “3 Things You Didn’t Know You Could Do With Hadoop.”
Mike began the webinar by discussing a commissioned study conducted by Forrester Consulting in October 2013, which showed that a huge POC-to-production wave is coming. In that study, 45% of large organizations that were surveyed use Apache™ Hadoop® in one or more POCs, and 16% of large organizations are using Hadoop in production. Those who are moving POCs into production are starting to think about some interesting use cases.
Organizations are using big data order to understand customers on an individual level, and this is one of the biggest trends that is driving many Hadoop projects. Mike urged companies to “treat customers like royalty to get their loyalty.” Companies are now taking that advice to heart, and are starting to use big data to understand their customers more and create individualized customer experiences.
Here are three additional trends that are driving Hadoop projects:
- Hadoop can be thought of as a data operating system. Hadoop is really two things at its core: a distributed file system and a distributed processing framework.
- A Hadoop data lake is a convenient, cost-effective way to combine and process silos of data. One of the more interesting use cases is the creation of a Hadoop data lake (or “enterprise data hub”, referenced in this whitepaper), a convenient and cost-effective way to combine and process silos of data. With a data lake, firms can break down those silos and use it as a repository for all of their data, across all applications.
- Hadoop’s “Fresh Architecture” (also known as Lambda Architecture) leverages the power of off-line analytics with real-time analytics. Hadoop is unique in that it can be used for both offline data science and real-time detection in terms of what’s happening with the customer. By leveraging both off-line and real-time analytics, organizations can gain fresh insights that can be used immediately to create personalized and predictive apps.
Want to learn more? Check out these resources on Hadoop, MapR, and the MapR Sandbox:
- Key Considerations: Comparing Hadoop Distributions
- Why MapR: Architecture Matters
- Fastest On-Ramp: MapR Sandbox for Hadoop
- Watch the webinar
The following questions were also asked during the webinar, but were not answered due to lack of time. Here are those questions/answers:
Q: What does it take to move towards Hadoop for a MS SQL/BI developer?
As Mike mentioned in the webinar, solving business problems with Hadoop starts with being creative and asking questions. That said, your creativity can be tapped by exploring free versions of Hadoop. Experienced Java programmer will find most value from the Sandbox because many analytical jobs are written in Java, but other Hadoop-specific programming languages like Pig are also available.
One example of a free version of Hadoop is the MapR Sandbox for Hadoop, packaged as a virtual machine with tutorials and browser-based user interfaces to help you ramp up on Hadoop quickly. That is the first step in understanding what Hadoop is about, and the starting point for getting your creative juices flowing.
Q: In your slide, you show Storm broken out as a separate cluster. What about using YARN on MapR, and putting Storm on the MapR cluster?
A: That slide was actually meant to showing Storm running on the same cluster as the rest of the MapR Distribution for Hadoop. MapR customers do this today without requiring YARN, but MapR also supports YARN and gives customers a wider range of choices for Apache projects and other OSS. One great capability of the MapR Distribution of Hadoop is that it can handle many different workloads in a single cluster. Operational and analytical workloads, high speed streams of incoming data, different user groups, different data sets, can all be consolidated and processed simultaneously in the same cluster with MapR. This is possible because of the innovations and optimizations that MapR built into its distribution. Features like higher throughput, greater scalability, consistent responsiveness, integrated security, volumes to support multi-tenancy, and integrated security (Kerberos or leveraging LDAP or other standard protocols because MapR is POSIX-compliant), all enable the wide range of workloads in a single cluster. This means you can put all your data into a single cluster, giving you the opportunity to analyze all of it together and gain insights that you would not get with only a subset of your data. Moreover, you can close the loop faster between analytics and operations and impact business as it happens instead of after the fact.
Q: Being distributed, what implications do you think this has for expensive SANs?
A: When thinking about implementing Hadoop, consider all of the huge volumes of data you have that you might not have exploited before. You might have had cost restrictions around storage and compute resources. Hadoop can run on direct-attached storage (DAS) while maintaining great fail-over with software replication. This changes the economics to make processing your big data sets more feasible. SAN’s provide great data protection and fail-over beyond software replication, but if you choose the right distribution of Hadoop which has consistent point-in-time snapshots, automatic HA across, wire-level security, and self-healing capabilities, you can ensure even great data protection. Architecture matters!
Q: How long will it take Hadoop to excel in industry?
Although we talked about some forward-thinking uses of Hadoop in this webinar, Hadoop is recognized as a leading technology for solving business problems today. Many of our customers are running in production because they’ve found ways to improve operations by processing big data on the MapR Distribution for Hadoop. Hadoop is suitable for both big, established organizations, as well as small companies, even startups, who collect massive amounts of data from which valuable insights can be derived.
Q: How big are the average POC Hadoop clusters? We created a small POC cluster with only six nodes, and it turns out our data guys ran through that so quick we had to double it, even before we knew what sort of requirements they were going to have for a Hadoop cluster.
We see many successful POC’s using 5-10 nodes. That said, the size of cluster is better measured in spindles, GB and/or cores, but only rarely in terms of nodes. This gives a more accurate picture of cluster capacity, measured by:
- CPU capacity across all nodes (6 nodes of 4 cores each vs 4 nodes of 16 cores each will behave very differently)
- Memory capacity, and memory to core ratio (plan on 2 to 8 GB per CPU core for reasonable performance)
- Storage capacity (total and per node)