Hadoop 2.0: The Ultimate Foundation for the Next-Generation Data Lake Architecture

There's not a dull moment in the world of Hadoop. The Apache Hadoop community recently announced the GA of Apache Hadoop 2.0 - a clear step forward in Hadoop's journey to become the de-facto platform for the Data Lakes of today and tomorrow.

My congratulations go to everyone involved in this effort that took almost 4 years. It takes an inordinate amount of commitment, passion and energy to accomplish something of this magnitude. Hats off to the entire community!

With Apache Hadoop 2.0, we are taking a significant step forward towards making Hadoop pervasive. The noteworthy aspect of Apache Hadoop 2.0 is the introduction of YARN, a distributed resource scheduler that supports additional execution engines beyond MapReduce. The MapR Data Platform is an essential foundation for YARN:

1) Many non-MapReduce execution engines are not designed to use the HDFS API and require R/W access to data. For example, work is underway to allow Apache Storm to run within the YARN framework. The MapR Data Platform's unique ability to support simultaneous reads and writes enables Storm to feed directly from the underlying distributed file system.

2) As organizations look to deploy varied applications alongside their MapReduce workloads, the expectations and demands from their Hadoop deployments will increase. With more reliance on Hadoop, comes higher expectations of business continuity, reliability and ease of integration with existing tools. Fortunately MapR customers have the foundation in place with a distribution that provides business continuity out of the box, including HA, Data Protection and Disaster Recovery.

MapR has always been focused on expanding the use cases for Hadoop.

We delivered the MapR Data Platform to expand the use cases through innovations in the storage layer, and we're excited to deliver YARN to expand the use cases in the compute layer. The combination of the MapR Data Platform and the YARN resource scheduler forms the ultimate foundation for the next-generation data lake architecture.

It's also worth noting that some enhancements have been made to HDFS in the Hadoop 2.0 release, such as NameNode HA, HDFS Federation, HDFS Snapshots and NFS support. In fact, these features have been available in commercial HDFS-based distributions for many months, though they have achieved little adoption and acceptance from customers that deploy Hadoop in production. This is because the architectural limitations of HDFS make it impossible to support consistent snapshots, R/W NFS access and real HA. MapR has uniquely addressed these architectural limitations via the MapR Data Platform. We'll explain this in more detail in an upcoming blog post.

What does arrival of Hadoop 2.x mean for existing and future MapR customers?

MapR is an open Hadoop Distribution platform that adheres to the same open API standards. We take particular care and pride in ensuring that the ecosystem packages surrounding Hadoop work and work well with MapR's Data Platform. It comes as no surprise that all the Hadoop 2.x enhancements will be available on MapR. The real good news is that almost all the new enhancements will work better on MapR due to a better architectural base of MapR.

MapR continues to be the only choice for customers who want to achieve production success, thanks to the full data protection, no single points of failure, self healing, consistent snapshots, disaster recovery, and arguably the most reliable and performant NoSQL implementation on Hadoop.

At MapR, we are focused on production deployment needs and combine the best of what's possible in the open source community with architectural innovations to ensure a successful business outcome for our customers, and integrating Hadoop 2.0 is one more step along the way.


Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free