So you’ve researched the general capabilities of Hadoop, and have worked with your colleagues to identify the first set of big data use cases to tackle. Now, you’re ready to take the plunge and select the right Hadoop solution to invest in. After thoroughly doing your homework, you might think the final selection step would be simple, knowing that open source packages appear to be the same across the different Hadoop distributions out there. Unfortunately, that is not the case. You have a significant decision to make that involves a complex set of capabilities—just as you would when investigating other types of enterprise software. The available Hadoop distributions differ quite a bit in what they provide in terms of speeds and feeds, so you’ll need to carefully design a comprehensive RFP to make sure you get your core platform capabilities.
Based on our interactions with hundreds of customers, we recently published a high-level guide on questions to consider when building a Hadoop RFP. In that document, we describe the following four tenets that are critical to assess any Hadoop distribution:
1) Performance and Scalability
After identifying the initial set of use cases, remember that you are looking at a platform that is going to grow in size over time. You want a solution that gives you the best ROI as your Hadoop cluster scales, and also provides the best performance across a varied set of use cases that could be running on the same cluster. Considerations around single cluster scalability, performance latencies and architectural complexity for scaling are critical for your sustained Hadoop use case growth. Look out for those when you build the RFP.
2) Platform Reliability
Who does not want their cluster to never lose data and be up and running 24x7x365? However, reliability is perhaps not understood well when it comes to the Hadoop platform. We often run into situations where the user thinks 3x replication is enough for any reliability needs. If you want a true enterprise software platform, you should consider deeper topics such as RTO, RPO, and data protection against inadvertent errors. Be sure to consider how complicated things can get for the operations team in order to achieve the level of reliability you need. The lesson here is to get the Ops folks involved early on in the process.
3) Platform Manageability
Speaking of Ops, how do you plan to manage your cluster on a daily basis? Do you intend to run operational applications on Hadoop—like an HBase app—and if so, what are the implications for the Ops team? Thinking through your needs for multi-tenancy, data and job placement control, and support for different hardware configurations is an important step when choosing your distribution.
4) Data Access Capabilities
Hadoop is often used to capture data from across many data sources and systems, so interoperability and security are critical aspects of any Hadoop deployment. Real-time capabilities of the platform are often nebulous until one implements a use case and enforces a stringent latency requirement. Thinking through some of the capabilities upfront is critical for your success, especially for real-time Hadoop use cases.
To summarize, you’ll need to assess Hadoop the same way you would assess a critical enterprise software platform, given that Hadoop is here to stay and may eventually host a majority of your analytics-bound data. Picking the best distribution to lay the right Hadoop foundations is critical and may be time-consuming, but it is totally worth the effort. Invest wisely! To get started on the right track, read Key Elements to Include in Your RFP, the essential guide to developing a Hadoop RFP.