Hands-on Hive-on-Spark in the AWS Cloud

Nearly one year ago the Hadoop community began to embrace Apache Spark as a powerful batch processing engine. Today, many organizations and projects are augmenting their Hadoop capabilities with Spark. As part of this trend, the Apache Hive community is working to add Spark as an execution engine for Hive. The Hive-on-Spark work is being tracked by HIVE-7292 which is one of the most popular JIRAs in the Hadoop ecosystem. Furthermore, three weeks ago, the Hive-on-Spark team offered the first demo of Hive on Spark.

Since that demo, we have made tremendous progress, having finished up Map Join, Bucket Map Join, integrated with Hive Server 2 and importantly integrated our Spark Client (aka Remote SparkContext). Remote Spark Context is important as it’s not possible to have multiple SparkContexts within a single process. The RSC API allows us to run the SparkContext on the server in a container while utilizing the Spark API on the client, in this case HiveServer 2.

Many users have proactively starting using the Spark branch and providing feedback. Today, we’d like to offer you the first chance to try Hive on Spark yourself. As this work is under active development, for most users, we do not recommend you attempt to run this code outside of the packaged Amazon Machine Image (AMI) provided. The AMI ami-35ffed70 (named hos-demo-4) is available in us-west-1 while we recommend an instance of m3.large or larger.

Once logging in as ubuntu, change to the hive user (e.g. sudo su - hive) and you will be greeted with instructions on how to start Hive on Spark. Pre-loaded on the AMI is a small TPC-DS dataset and some sample queries. Users are strongly encouraged to load their own sample datasets and try their own queries. We are hoping not only to showcase our progress delivering Hive on Spark but also to help find areas of improvement, early. As such, if you find any issues, please mail hos-ami@cloudera.org and the cross-vendor team will do its best to investigate the issue.

Despite spanning the globe, the cross-company engineering teams have become close. The team members would like to thank our employers for sponsoring this project: MapR, Intel, IBM, and Cloudera.

Byline: Rui Li (Intel), Na Yang (MapR), and Brock Noland (Cloudera)

  • Na Yang is a staff software engineer at MapR and a contributor to Hive.
  • Brock Noland is an engineering manager at Cloudera and a Hive PMC member.
  • Rui Li is a software engineer at Intel and a contributor to Hive.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free