Shark is an open source Hadoop project that uses the Apache Spark advanced execution engine to accelerate SQL-like queries. Shark makes use of Hive’s language, its metadata, and its interfaces, so like Hive it offers a simple way to apply structure to large amounts of unstructured data, and then perform batch SQL-like queries on that data. More information about Hive is available here.
Like Hive, Spark queries are written using a SQL-like language called HiveQL, which Spark translates into Spark Directed Acyclic Graphs (DAGs) that are executed on the Hadoop cluster. More complex queries are supported through User Defined Functions (UDFs) that can be written in Java and referenced by a HiveQL query.
In addition to full compatibility with Hive, Shark brings the following new features and benefits:
- Speed: Like Spark, Shark allows for data sets to be held in memory. When tables are created in Hive, users can access a simple Shark API to indicate that the table be held in memory. In-memory table queries are up to 100x faster than standard Hive queries. Latency for accessing on-disk tables is still much lower, sometimes 10x, since queries no longer have to be converted to MapReduce jobs and intermediate data is held in memory versus being written to disk.
- Spark Integration: Shark allows Hive tables to be queried from within a Spark job, allowing for it to be combined with the logic exposed by Spark libraries like MLlib.
- Scalability: Shark is great for quickly returning results for simple queries. However, sometimes users need to do batch-style processing, like executing a complex query on a multi-petabyte table. Shark is ideal for these jobs as well, supporting mid-query fault tolerance to ensure the whole job completes as quickly as possible even if nodes fail.
- Hive Compatibility: Shark is fully compatible with Hive and HiveQL, making migration to a faster execution engine simple.