Apache HiveTM

Apache Hive is an open source Hadoop application for data warehousing. It offers a simple way to apply structure to large amounts of unstructured data, and then perform batch SQL-like queries on that data.

Queries are written using a SQL-like language called HiveQL, which Hive translates into MapReduce jobs that are executed on the Hadoop cluster. More complex queries are supported through User Defined Functions (UDF) can be written in Java and referenced by a HiveQL query.

Structure is applied to data at time of read, so users don’t need to worry about formatting the data at the time when it is stored in their Hadoop cluster. Data can be read from a variety of formats, from unstructured flat files with comma or space-separated text, to semi-structured JSON files, to structured HBase tables.

External access to Hive tables are supported through a component called HiveServer2. Using an ODBC driver, many SQL-based applications can interact with HiveServer2 and treat Hadoop data like a database. Applications take advantage of this capability for a variety of use cases, including:

  • Data Mining
  • Ad-hoc Analysis
  • Business Intelligence
  • Data Visualization

Related Projects

Since its introduction in 2009, Hive has gained a lot of popularity due to its ease of use and compatibility with existing business applications through ODBC. However, since historically Hive has run on MapReduce for execution, it has been capable only of long-running batch operations. Several efforts have emerged to improve the performance of Hive, and MapR either supports or has a roadmap to support each.

  • Hive on Tez: With the introduction of YARN as an independent resource manager, Tez has emerged as a complementary high performance execution engine. Hive is in the process of being modified to run on Tez, allowing queries to run significantly faster.
  • Shark: The Shark project adds functionality to Hive to allow it to run on top of the Spark execution engine, optimizing workflows and offering in-memory processing, improving performance significantly.
  • Impala: While technically a different component than Hive, Impala leverages Hive’s query language (HiveQL) and metadata to bring interactive SQL to Hadoop.
  • Drill: Apache Drill, while offering ANSI SQL versus Hive QL, will also provide the ability to leverage the metadata in Hive metastore for querying. This is in addition to querying nested data with dynamic schemas.