This blog post is the second post of a four-part series, and is based on the Radiant Advisors white paper titled “Driving the Next Generation Data Architecture with Hadoop Adoption” which examines the emergence of Hadoop as an operational data platform. (See part one here.)
Hadoop has been a phenomenon for big data and operational workloads. It has transformed from its batch-oriented roots into an interactive platform by incorporating a number of components, including technologies that provide SQL and distributed in-memory capabilities. And while there are basic data management principles that guide Hadoop adoption, there also needs to be a change in mindset to move Hadoop even beyond a big data and analytics platform.
Many organizations today are already using Hadoop to run a wide range of mission-critical production applications. And they are demanding faster performance along with the ability to do random writes and updates. In other words, they want to retain the enterprise-grade capabilities they've expected of their traditional technologies, which should come as no surprise. But achieving these demands is challenging because initially, transactional workloads were not the primary focus of the Hadoop community. Fortunately, there are solutions like Apache HBase (and of course, MapR-DB), which drive applications requiring high insert and retrieval rates within Hadoop. Other technologies will likely arise to move Hadoop even further towards more operational workloads.
What else is in store for Hadoop? Going forward, you'll see transaction systems based on SQL that have full create, replace, update, and delete capabilities that can 1) leverage a robust Hadoop SQL engine, combined with a high-performance file system capable of POSIX compliance and 2) support high transaction volume and operational services levels.
Hadoop will ultimately need to add capabilities that have long existed in relational database management systems (RDBMSs). For example, if it can leverage ACID transaction capabilities, it can maintain data integrity across an entire database, which is critical when storing master data. This means that a database transaction must be atomic, consistent, isolated and durable. In a multi-tenant environment with a large number of users and where many applications are updating data, it’s crucial to have the ability to maintain data consistency. It's reasonable to believe that one day Hadoop can be used for a majority of RDBMS-based applications, as long as it adds the necessary built-in capabilities for maintaining transactional integrity.
Other requirements for running a transaction system on Hadoop include backup, recovery, fault-tolerance, and disaster recovery capabilities. To accelerate the adoption of next-generation Hadoop, it’s important that it be compatible with existing IT tools that perform these functions for existing RDBMS systems.
Now that YARN is available to provide resource management for Hadoop clusters, fast response-time performance and near real-time are required for operational Hadoop. These requirements are essentially driving the development of projects like Apache Storm, as well as distributed in-memory architectures like Apache Spark. Other performance improvements that support these requirements include Apache Hive and Drill, as well as higher-performance HDFS replacements such as MapR-FS.
In the next post in this series, we’ll talk about the importance of incorporating a data operating system in your strategic IT and data architecture roadmaps.
What are your thoughts about operationalizing Hadoop? Share your thoughts in the comment section below.