Within the big data landscape there are multiple approaches to accessing, analyzing, and manipulating data in Hadoop. Each depends on key considerations such as latency, ANSI SQL completeness (and the ability to tolerate machine-generated SQL), developer and analyst skillsets, and architecture tradeoffs.
Below is a discussion segmented by broad latency characteristics of each approach.
Technologies such as Hive are designed for batch queries on Hadoop by providing a declarative abstraction layer (HiveQL), which uses the MapReduce processing framework in the background. Hive is used primarily for queries on very large data sets and large ETL jobs. The queries can take anywhere between a few minutes to several hours depending on the complexity of the job. The Apache Tez project aims to provide targeted performance improvements for Hive to deliver interactive query capabilities in future. MapR ships and supports Apache Hive today.
Technologies such as Impala and Apache Drill provide interactive query capabilities to enable traditional business intelligence and analytics on Hadoop-scale datasets. The response times vary between milliseconds to minutes depending on the query complexity. Users expect SQL-on-Hadoop technologies to support common BI tools such as Tableau and MicroStrategy (to name a couple) for reporting and ad-hoc queries. MapR supports customers using Impala on the MapR distribution of Hadoop today. Apache Drill will be available Q2 2014.
In-Memory SQL and Streaming
In-memory computing has enabled new ecosystem projects such as Apache Storm and Apache Spark to further accelerate stream and query processing, respectively. Shark is a new project which also uses in-memory computing while retaining full Hive compatibility to provide 100x faster queries than Hive. MapR customers are using Storm and Shark on Spark with the MapR Distribution of Hadoop today.
Unlike batch and interactive queries that are used by business teams for decision making and operate as read-only operations on large datasets (OLAP), point queries are typically done by OLTP and web applications, operating over smaller datasets and typically include insert, update, and deletes. The expected latency is usually very low (e.g., milliseconds) due to the high volume of requests from these applications. MapR ships and supports operational SQL capabilities with both Apache Hbase and MapR M7 Enterprise Database Edition.
Interactive SQL-on-Hadoop Technology LandscapeSQL technologies complement traditional data warehouse and analytical environments for:
Technologies and approaches for interactive SQL vary and include (but are not limited to)
Key Considerations for SQL-on-Hadoop Approaches
For organizations with existing skills in SQL and investments in business intelligence (BI) tools, ANSI SQL completeness is key for easy adoption and reuse.
The type of protocols and interfaces for the client to access the SQL engine in each SQL-on-Hadoop approach is worth considering, depending on your preferences and use case.
Various SQL-on-Hadoop technologies have different methods for metadata handling. This provide a mechanism to capture and maintain information about where data is stored, how it is structured, and more.
Harnessing the value of big data requires the ability to ingest and analyze multiple data types. Also, as companies mature in their use of Hadoop, they use tables (Apache HBase) and files (HDFS) for storing different data types. Consider how each SQL-on-Hadoop approach supports queries on different Hadoop ecosystem sources and data types.
In addition, each SQL-on-Hadoop technology has differing approaches to file formats supported. This affects interoperability and the ease of switching between approaches as these technologies mature – another key consideration for when to use which.