Apache Hadoop has been steadily evolving into a robust technology ecosystem since its founding in 2005. Today saw more evidence of that evolution with Cloudera’s announcement of a new open source project called Kudu, a technology described as a “complement to HDFS and Apache HBase...designed to fill gaps in Hadoop’s storage layer.” Apparently Cloudera’s development team “... eventually came to the conclusion that large architectural changes were necessary to achieve our goals”.
MapR arrived at a similar conclusion 6 years ago. MapR was founded to create an industrial strength data and storage platform. In terms of data storage, MapR has taken a broader view that addresses the requirements for enterprise storage and files, in addition to database operations (which is where Kudu is focused). The MapR platform now includes a vastly scalable file system (while keeping HDFS APIs) and the industry’s only in-Hadoop NoSQL database – now with native support for JSON documents. If you are intrigued by Kudu, then you should also evaluate the MapR Community Edition which includes MapR-DB which has been proven over and over in production environments.
Kudu is described as “storage system for tables of structured data. Tables have a well-defined schema consisting of a predefined number of typed columns.” That sounds a lot like a database – or more specifically a storage subsystem for a database. When combined with the Impala query engine, it seems to be squarely focused on the (columnar) RDBMS market, though it seems that Cloudera chose not to use the words “relational” or “database”. But it is clear that Kudu is designed to deliver faster analytics from structured data - a use case which Hadoop and HBase were having a hard time fulfilling
Being slow and reliable is easy. Being fast and reliable is difficult, and it remains to be seen whether Kudu’s initial product can first outperform HDFS or HBase, and then to achieve the levels of performance to match MapR-DB running on the MapR Data Platform.
For a further analysis for Kudu and MaR-DB, check out this video from MapR CTO and co-founder M.C. Srivas:
So, What Problem is Kudu Aimed at Solving?
In short, Kudu seems to be aimed at:
- Reducing architectural complexity
- Performance (for table-based operations)
- Reliability across globally-distributed data centers
Reducing Architectural Complexity
Cloudera references “hybrid architectures” which customers have had to build with Hadoop to achieve efficiencies in data storage (updates in place) and analytics. MapR calls this “operational analytics” or simply being able to run operational (read: database) applications seamlessly with analytic applications on a single data platform using MapR-DB, reducing data duplication, latency, and complexity. Here’s what it looks like in MapR and this is why we have been saying for 6 years that “architecture matters”, giving customers a very detailed perspective on the architecture required for this.
The initial focus of Kudu appears to be on performance for database operations primarily for use cases where Impala is used for querying and performing analytics. Yet, Kudu's design goals of being "nearly as fast as HDFS and HBase" suggest that there is a trade-off required when trying to handle streaming versus random access workloads. We've shown that MapR can do both without compromise. We built MapR-DB as a native extension of the MapR Data Platform in that files and tables are managed together seamlessly, reducing the need for separate clusters. One observation we had of customer deployments was that they often required separate HDFS and HBase clusters. Now a third silo has been introduced so that a typical Cloudera installation would require three clusters - certainly not reducing complexity. Should you create an HDFS, HBase, or Kudu cluster? Well, it depends. Why not have one data platform that can handle all of the workloads?
Cloudera is closing a gap in their HDFS-based data architecture to take advantage of advances in modern hardware, but there is an even bigger gap in the market which customers already realize, which is why they are choosing MapR for mission-critical Hadoop applications.
MapR-FS already provides significantly higher performance than HDFS on HDD hardware architecture, with MapR providing 1.5 GB/second average throughput per node and a whopping 16 GB/second sequential reads (and 10 GB/s writes) on Samsung NVMe as highlighted Aug 2015 at a Samsung event (skip to 26:50). “The World’s Fastest Big Data. Bar none”.
For database operations, benchmarks show MapR-DB is on average 2-7x faster than Apache HBase, depending on the workload, providing significantly higher throughout and much more consistent low latency during ongoing operations - even more important for production workloads. Our tests show scan operations are 3x faster on hard disk drives (HDD) and 11x faster on SSDs. According to Cloudera’s published research, Kudu is slower than HBase across all dimensions of the YCSB workloads except one for 50% random-read and 50% updates on a special uniform access distribution test.
Reliability across globally-distributed data centers
The MapR Data Platform has long been known as the only Hadoop distribution that supports disaster recovery out of the box for the file system, and we enhanced MapR-DB earlier this year to provide multi-master table replication to give real-time access to live data distributed across multiple clusters and multiple data centers around the world. It is also essential for a comprehensive mission-critical disaster recovery strategy by significantly reducing the risk of data loss should a site-wide disaster occur.
In summary, Kudu is aimed at solving a subset of the problems which MapR addressed years ago within the Hadoop and Spark ecosystem for customers who are serious about addressing modern data application requirements in production settings. Many of the stated “issues” being addressed are the same reasons why customers who have had experience with other distributions selected MapR in the first place, as seen from this independent survey of MapR customers from TechValidate: performance, availability, less hardware (and therefore less complexity), along with other critical capabilities for production success.
Now, 6 years later, Cloudera has realized they need a better data platform and so should you. But why wait? Get started today with MapR Community Edition with MapR-DB - completely free and ready for prime time. See why MapR is the production choice for Hadoop and Spark.