Back in 2011, Paul Yang from Facebook’s engineering team published a fascinating blog post that detailed how they migrated a 30TB Hadoop cluster from one server base to another. As one of the world’s largest Hadoop deployments, Facebook has deployed several Hadoop clusters collectively numbering over 5,000 nodes, and has amassed an impressive roster of hundreds of Hadoop engineers. This tremendous talent base—probably unmatched in the commercial Hadoop space and representing a significant percentage of the worldwide Hadoop talent—has allowed Facebook to leverage the power of Hadoop at scale.
Not only does this intriguing blog post give readers an overview of the surface-level challenges involved in migrating 30PB of data, it also gives the “mere mortal” some insight into the level of customized code Facebook must deploy in order to take advantage of Hadoop’s power. Not only did Facebook develop its own Hadoop distribution—in fact multiple Hadoop distributions, each tuned for specific Hadoop workloads such as MapReduce and HBase—the blog entry revealed that Facebook was also required to develop and deploy its own data replication layer, due to the unique challenges associated with running Hadoop as an enterprise platform within a large data-centric enterprise. Furthermore, many gyrations were required in order to migrate the data, while still keeping operational disruption to a minimum. Facebook runs much of their business on Hadoop, so taking the entire business down for a few weeks to complete the migration was not an option for them.
In the blog’s comment section, most readers applauded the magnitude of Facebook’s accomplishment. Certainly, migrating 30PB of data is a massive task and Facebook is to be congratulated on successfully completing this initiative, one that is likely unmatched in the history of data migration. Several readers also asked Facebook to release their code to the general public, a request that has been politely declined with a comment that Facebook’s customized code is likely of little use outside Facebook due to the highly customized nature of their Hadoop deployment.
What should a typical IT organization not blessed with the talent, budget or highly customized technology of Facebook do when confronted with the likelihood that they will someday need to accomplish a similar task? Even though the data sets to be migrated might be one or two orders of magnitude smaller than what Facebook needed to move, it’s clear from the blog post that accomplishing this task would have been impossible without Facebook’s resources. Does that mean that you shouldn’t deploy Hadoop, or even worse, take on a fool’s errand by trying the same approach but without a similar set of resources? There must be a better way.
Why be in the business of designing your own transmission when the real challenge is to steer the car well enough to win the race?
Thankfully, there is. Very few organizations can, want to, or should be in the business of developing their own Hadoop distribution (not to mention the hard-to-find talent that goes with that chore). Why be in the business of designing your own transmission when the real challenge is to steer the car well enough to win the race? Sure, you’re unlikely to win the race without a great car, but doesn’t it make more sense to acquire the car and drive it to victory, rather than build the car and potentially never make it out of the garage?
Building your own Hadoop distribution may seem easy and even fun. It might even appear to be a perpetual employment machine, guaranteeing a paycheck in an era where Hadoop skills are hard to find in the open market. But it doesn’t serve the organization well. Deploying Hadoop inside an enterprise is a daunting task, made significantly more difficult due to the fact that the original design point of Hadoop was to facilitate analyzing large sets of log data over short periods of time, then throwing it away. distribution (not to mention the hard-to-find talent that goes with that chore). Why be in the business of designing your own transmission when the real challenge is to steer the car well enough to win the race? Sure, you’re unlikely to win the race without a great car, but doesn’t it make more sense to acquire the car and drive it to victory, rather than build the car and potentially never make it out of the garage?
Hadoop—specifically HDFS, the foundation of Hadoop—was implemented as a write-once, read-many, CD-ROM-like filesystem that is neither fast nor flexible making it less than ideal to accommodate real-time ingest, interactive analysis or many of the other use cases currently of interest to Hadoop adopters. It wasn’t designed to retain the data for long periods of time (ask Owen O’Malley, formerly of Yahoo and now with Hortonworks how many early Hadoop adopters have accidentally deleted massive amounts of Hadoop data). Forcing HDFS to do what it wasn’t designed to do is not the best use of your enterprise talent. Don’t be fooled into thinking that “Hey, if Facebook could do it, so can I.” Facebook clearly has a Hadoop resource base matched by few other companies. If it were that easy, they wouldn’t need hundreds of Hadoop engineers and the compute, storage, power, space and cooling resources needed to drive 30PB of data.
Many companies who are in the “adtech” business know more about Hadoop than those in any other vertical market. They’ve looked at the Hadoop landscape and adopted an enterprise-grade Hadoop solution because they’ve learned that leveraging a third-party’s Hadoop distribution with features beyond the generic open source version makes more sense than building it themselves. They’ve realized—through painful experience—that they don’t want to keep rebuilding their transmissions (i.e. low-level infrastructure changes) when their very business relies on their ability to win the race. It makes a lot more sense to focus your energies on driving the car better than anyone else instead of building transmissions. By getting out of the Hadoop business, you’ll get more out of Hadoop.