Robert D. Schneider, a Silicon Valley-based author and consultant who has provided technical expertise to a wide variety of enterprises in the Big Data, cloud, and analytics sectors, along with Mark Baker from Canonical, and Anoop Dawar from MapR, joined us for a webinar titled Navigating Your Hadoop Journey. Following the webinar, Robert, Anoop and Mark answered a number of questions from the webinar participants.
Watch Robert's more indepth webinar titled Hadoop or Bust: Key Considerations for Your High-Performance Analytics Platform.
Q: How do I evaluate Hadoop 2.x in this context?
RS: There are a lot of interesting things coming out with Hadoop 2.x in terms of its scheduling capabilities and the work that's been done on the Hadoop File System. But the considerations I've shared today are basically agnostic in terms of the version of Hadoop that you are deploying. It's all about making Hadoop part of the overall IT landscape; versions change, but the general concepts don't. However, I know that MapR is doing a lot of interesting things with Hadoop 2.x, so I'll hand it off to Anoop, who will elaborate.
AD: Hadoop 2.x is very promising; the YARN resource management framework allows for a lot of new applications that can now be done on top of Hadoop. In terms of evaluating Hadoop, I would use the same criteria that Robert shared in the webinar. If someone has deployed Hadoop, or is about to go into a Hadoop production deployment and they want to look at Hadoop 2.x, my recommendation would be to go into production with what you have. Do a small cluster on Hadoop 2.x so that you can learn and understand 2.x. There is a learning curve here—if you want to port your applications into YARN, it will take your developers some time to understand it. Bring up a sandbox and really play with it, so that you understand its level of maturity. This will help you evaluate Hadoop 2.x and decide when you are ready to move it into production.
Q: With the evolution of this technology, do you see the need for a Canonical model diminish?
AD: Hadoop is evolving to support a variety of use cases from batch to interactive to real time using a variety of data - unstructured, semi-structured, self-description to schema-oriented. This allows for polyglot persistence (the ability to store the data in its native format and modify it as needed for various use cases). This would decrease the need to create exchange formats like canonical models. Q: Can you use virtualization or cloud infrastructure to run a Hadoop cluster?
RS: Absolutely, they are not orthogonal at all; actually, they're quite complimentary.
MB: It's a common question among those who want to run different workloads in the cloud. There used to be a theory in the early days of Hadoop that it needed to be run natively on their metal for optimum performance. But now, any degradation of performance is tiny or marginal, versus the flexibility you get, especially if you have variable workloads running in a virtual environment or in the cloud.
Q: Is it easy to move from one distribution to the other, in case I change my mind?
AD: Absolutely. The great thing about this is that all of these distributions are Apache Hadoop and follow the same API's. It is equally easy to move from one distribution to the other. It'smostly a matter of moving the data over to the different distribution using various copy tools available.
RS: What's really interesting about this approach is the fact Hadoop basically shields the average developer from the complexities underneath, and we're even seeing that with the ability to port. Yes, you have to mechanically move information, but the applications themselves are shielded from what's going on behind the scenes.
Q: What is your suggestion on a technology stack for building BI analytics on Hadoop?
AD: There are a lot of new stacks that are being built, as well as existing stacks that are building connectors on to Hadoop, soI would recommend that you pick the tool that works best for you, and then pick the connectors for it, instead of the other way around, where you let the Hadoop infrastructure dictate the analytics stack. Almost all of the leading-edge analytics stacks are building connectors to Hadoop. Your performance with those stacks will vary depending on the particular distribution's features. For example, with high performance NFS, it may be easier to load and get information out so that your analytics will perform slightly better, but in general that shouldn't be the primary factor in picking your analytics stack.
RS: I agree with that. Hadoop is becoming part of the regular IT landscape, and you shouldn't have to make special considerations for Hadoop. It's best to pick the analytics stack that works for you. There's a very good chance that it already has a connector to Hadoop, or it will shortly.
Q: Has anyone used Hadoop with Isilon clustered storage?
AD: You could use Hadoop with NAS devices; however, MapR offers enterprise-class storage with data protection and disaster recovery that allows for storage at hundreds of dollars per terabyte instead of thousands of dollars with NAS appliances. Additionally, it automatically scales to thousands of nodes.
Q: Are there any thoughts on the IBM Big Insights distribution? They replaced JT/TT with a very powerful scheduler.
AD: The broader questions to consider when selecting a Hadoop distribution still apply here, and your decision depends on which of these criteria are more important. MapR provides a JobTracker HA with no disruption to tasks, and has MapReduce optimizations. IBM's use of synthetic MapReduce payloads to craft a scheduler is one example of such an optimization. However, the nature of Hadoop workloads is about to change drastically as YARN gets adopted.
Q: How do I figure out performance implications using Hadoop in the cloud vs. Hadoop on bare metal servers?
AD: There are several options that you need to consider when looking at performance: distribution choice, hardware choice (CPU, memory, disks), in-cluster network performance, client-to-cluster network performance, as well as scalability dimensions. The MapR Hadoop distribution eliminates a lot of performance bottlenecks inherent in HDFS. For Hadoop performance, the following benchmarks can be done: DFSIO for IO performance; TeraSort for MapReduce performance; and YCSB for HBase performance. Both the MinuteSort and TeraSort world records were set recently using MapR.