This blog post is the final post of a four-part series, and is based on the Radiant Advisors white paper titled “Driving the Next Generation Data Architecture with Hadoop Adoption” which examines the emergence of Hadoop as an operational data platform. (Here's part 1, 2, and 3 of the series).
The advent of YARN made it possible for multiple data processing engines to handle data stored in Hadoop, and this shift in data architecture makes it possible for Hadoop to become a true data operating system. Before YARN, MapReduce was the only compute engine that could be resource-scheduled in Hadoop. YARN opened up Hadoop to run new compute engines. What’s more, YARN applications make it possible to access data without having to duplicate data or move it for a different workload usage. Organizations can now migrate historical data to Hadoop, and they are benefiting from having a unified data repository that can reuse data without duplication or latency issues. In addition, one early design decision by the founding Hadoop developers was to allow replacement of the default file system, HDFS. This model gave way for higher performance file systems like MapR-FS to be deployed to handle operational requirements in Hadoop.
With YARN, analytic applications can now be used alongside transactional applications in this new horizontal data operating system paradigm. While this model is still relatively early, the eventual goal is to have data created and analyzed in Hadoop with operational, transactional, and analytic YARN applications accessing it. Latency will be a problem of the past, and the “store-once/use-many” approach will become the default for all new application development.
Organizations these days are challenged with the task of storing, processing, and managing the vast amount of data created from a multitude of devices. Given this big data deluge, the only real option for storing this amount of data is the Hadoop data platform. By combining the advanced analytics and data mining opportunities that comes from all this new behavioral data with data from operational systems like CRM or MDM, organizations can now detect insights that otherwise would be missed in high-velocity data streams.
Improvements found in the third generation of Hadoop will fuel the adoption of the data lake and data operating system strategies within the enterprise data architecture. In addition, YARN applications will continue to evolve in order to meet organizations’ needs. And with the widespread interest in Hadoop ecosystem tools such as Apache Hive and Apache Drill, organizations have even more powerful ways to interact with data.
Within the big data landscape, there are many different approaches to accessing, analyzing, and manipulating data in Hadoop. Each depends on key considerations such as latency, ANSI SQL completeness (and the ability to tolerate machine-generated SQL), developer and analyst skillsets, and architecture tradeoffs. As different YARN applications for SQL access continue to improve, it’s important to look for those that are optimized for transaction processing, including both record read/write operations with transaction integrity and atomicity, consistency, isolation, and ACID capability.
Given the rapid pace of innovation in the Hadoop space, the future is bright for organizations that want to incorporate a Hadoop data lake or data operating system strategy in their strategic IT and data architecture roadmaps. Be sure to out technologies that maintain interoperability within the Hadoop/YARN framework, so that all data can be seamlessly leveraged by the entire Hadoop ecosystem.
Want to learn more?
What are your thoughts about using Hadoop as an operational data platform? Share your thoughts in the comment section below.