A Brief Look at the Evolution of Hadoop

This blog post is part of a four-part series, and is based on the Radiant Advisors white paper titled “Driving the Next Generation Data Architecture with Hadoop Adoption.”

As you probably know, Apache Hadoop was inspired by Google’s MapReduce and Google File System papers and cultivated at Yahoo! It started as a large-scale distributed batch processing infrastructure, and was designed to meet the need for an affordable, scalable and flexible data structure that could be used for working with very large data sets.

In the early days, big data required a lot of raw computing power, storage, and parallelism, which meant that organizations had to spend a lot of money to build the infrastructure needed to support big data analytics. Given the large price tag, only the largest Fortune 500 organizations could afford such an infrastructure.

The Birth of MapReduce: The only way to get around this problem was to break down big data into manageable chunks and run smaller jobs in parallel, using low cost hardware, where fault tolerance and self-healing would be managed in the software. This was the primary goal of the Hadoop Distributed File System (HDFS). And to fully capitalize on big data, MapReduce came on the scene. This programming paradigm made it possible for massive scalability across hundreds or thousands of servers in a Hadoop cluster.

YARN Comes on the Scene: The first generation of Hadoop provided affordable scalability and a flexible data structure, but it was really only the first step in the journey. Its batch-oriented job processing and consolidated resource management were limitations that drove the development of Yet Another Resource Negotiator (YARN). YARN essentially became the architectural center of Hadoop, since it allowed multiple data processing engines to handle data stored in one platform.

This new, modern data architecture made it possible for Hadoop to become a true data operating system and platform. YARN separated the data persistence functions from the different execution models to unify data for multiple workloads. Hadoop Version 2 provides the foundation for today’s data lake strategy, which is basically a large object-based storage repository that holds data in its native format until it is needed. However, using the data lake only as a consolidated data repository is shortsighted; Hadoop is really meant to be used as an interactive, multiple workload and operational data platform.

Challenging Fixed, Predefined Schemas for Agile Data Applications: Big data volatility, that is, the highly evolving and fluidly changing data, makes it difficult to define schemas. A shift in data architecture to key-value pairs and JSON files gave users the ability to read or define data schema on access (versus prior to writing and storing data on disk). Now, Hadoop makes it possible to store data files in almost any format. These formats include relational or known structures, those with a self-describing and changing structure, or raw data with schema to be defined on read and optimized via standardized file formats. This flexibility is now a key factor in meeting today’s big data and application agility needs. Since you can now put everything into data files, self-describing JSON document files, or as key-value store files, you can thousands or millions of docs on a distributed file system, giving you the flexibility you need without being tied to schema.

In the next post in this series, we’ll talk about embracing the data operating system.

What does the future hold? What are your thoughts on the future of Hadoop in the enterprise? Share your thoughts in the comment section below.



Driving The Next Generation Data Architecture with Hadoop

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free