The recent release of Apache Hive 0.13 prompted me to think about how far we are concerning serving both operational and analytical systems in the context of the Hadoop ecosystem.
So, let's step back a bit. The MapReduce paradigm, still considered to be one of the tenants of Hadoop enjoys many production deployments and success stories. However, due to its batch nature people have been working on interactive large-scale query systems (again, in the context of Hadoop; obviously such systems are around way longer) in the past two to three years.
Further, as a matter of fact enterprises depend on both operational (think: customer purchase orders, etc.) and analytical (for example, BI tools) systems. Now, the question arises: can we provide one platform that is capable to cater for both? A bit like Oracle databases, just at scale and with a sensible price label.
In the figure above I've tried to provide a bit of an orientation where we are at the time of writing. Let's now have a closer look at what is already available and what are the remaining work items.
SQL and NoSQL
Acknowledging that most use cases benefit from a polyglot persistence mindset—using the right datastore for a certain task, depending on the nature of the data, the workload and the SLAs—I'd argue that in practice it is SQL and NoSQL rather than SQL or NoSQL. In this sense, the available SQL-on-Hadoop offerings along with the capabilities of the Apache Spark stack enable us to address the different types of workloads we encounter in the enterprise:
- Batch SQL. Technologies such as Hive were initially designed for batch queries by providing a declarative abstraction layer using the MapReduce processing framework in the background; with Hive 0.13 the final phase of the Stinger Initiative can be considered completed and Hives moves more and more into the next category ...
- Interactive SQL. Query engines like as Impala and Apache Drill provide interactive query capabilities to enable traditional business intelligence and analytics on Hadoop-scale datasets.
- Operational SQL. Point queries are typically found in OLTP settings, operating over smaller datasets and typically include insert, update, and deletes. The expected latency is usually very low (e.g., milliseconds) due to the high volume of requests from these applications.
In the NoSQL category, we typically find Apache HBase and M7 utilised, providing low-latency access to structured and semi-structured data at scale; billions of records, tens of thousands concurrent users with latency SLAs in the hundreds of milliseconds.
Towards 100% Operational
It's fair to state that we've come a long way, being able to cover most of the use cases of operational and analytical systems with our MapR Platform. However, there is one work item left that needs to be addressed in order to be able to offer a 100% coverage and that are transactions. So, in this context we consider a commit as the unit of work incl. rollback. There are several strands of work:
- In the context of Hive, the issue HIVE-5317 (Hive with full ACID support) has as its goal to introduce this functionality to the Hadoop ecosystem.
- Both Apache Drill and Apache Spark have related plans in this direction.
- YARN offers a lot of flexibility in terms of scheduling and will serve as one piece of the puzzle to achieve transactional behavior.
- We also work within our MapR Data Platform on extensions that enable to address transactions.
Last but not least, a purely functional view (being able to deal with transactions) is, in an enterprise setup, not sufficient. Any solution that offers this feature must as well be scalable, reliability, and secure, guaranteeing business continuity and disaster recovery—ideally out of the box. You expect this from your enterprise database, so why would you expect less from a Hadoop solution?