Keeping the Focus on Hadoop in Production

For those of you that feel like the Strata Conference + Hadoop World 2013 occurred a lifetime ago, let me assure you that many of us at MapR feel the same way. A lot has happened since in the Big Data world since then: Facebook open sourced their distributed SQL query engine called Presto, the Hadoop Query Processing framework written on top of YARN called TEZ is gaining momentum, and the drum beat for YARN keeps getting louder (although I've yet to see a successful YARN-based Terasort run that can beat MapR's record). I'm hopeful that YARN will gain momentum soon, as it's a critical piece of the Hadoop 2.x promise. In the midst of all of these major announcements, it can be hard to focus on the present and on how to derive real value from your Hadoop environment. Raymie Stata's article "Hadoop in Production: 5 Steps to Success" is a strong reminder that while technology will evolve and new features and capabilities will be announced, it's important to focus on what it takes to migrate these new technologies from a proof-of-concept project to real production and post-production projects. Here are five key issues from Raymie's article that you need to address in order to ensure the ongoing success of your Hadoop project, as well as a few helpful tips from MapR on how to implement them.

1. Keeping your software up to date

It is true that with many Hadoop distributions, it can be a challenge to update Hadoop software. MapR's rolling upgrades, a feature that has been integrated in our products for some time, makes this a simpler and more familiar process.

2. Scaling your cluster

Going from a half-rack to full-rack to multiple-rack cluster configurations can expose numerous scalability issues, including the limited HA capabilities of both the NameNode and ResourceManager. MapR's distributed file system was written to eliminate these issues, enabling customers to scale from a handful of nodes to hundreds of nodes without experiencing any scalability challenges.

3. Get your security in order

This is easier said than done, especially once a system goes into production. Organizations often struggle with compliance regulations and enforcement policies once sensitive data starts pouring in. MapR's recently introduced built-in security feature offers a quick and easy way to authenticate users, encrypt network traffic and enforce sophisticated access control on files and tables, including the powerful Boolean expression-based ACLs (access control lists) that can be applied at the table, column, and column family level. And partners help mask fields so that business analysts can only access data they are authorized for.

4. Supporting your users

To reduce the probability of job failures and to ensure that tasks will run to completion, you'll need to eliminate single points of failures, even in the event of a JobTracker failure. However, when failures do happen, they are easier to debug with sophisticated monitoring tools like the MapR Control System, which lets you immediately see which nodes are healthy; how much bandwidth, disk space, and CPU time are in use; and the status of MapReduce jobs.

5. Keeping tabs on technology

The ecosystem is indeed evolving rapidly, and it's quite understandable to want to jump from one proof-of-concept project to another one that features new Hadoop technologies. It is quite a different situation, however, when you have a production cluster that your users depend on. Raymie makes some really good points about the importance of evaluating the track record of any new Hadoop component first, and doing a quick evaluation in a small sandbox before deployment. Good luck with your Hadoop deployments, and may resiliency be with you!


Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free