This blog post was jointly written by Cloudera (Alex Gutow), Intel (Weihua Jiang), and MapR (Nitin Bandugula) – all companies that are part of the Hive-on-Spark Team.
As one of the most popular tools in the Apache Hadoop ecosystem, there’s been a lot of noise made about Apache Spark – and for good reason. It complements the existing Hadoop ecosystem by adding easy-to-use APIs and data-pipelining capabilities to Hadoop data, and the project support continues to grow. Since its launch in 2010, Spark has seen over 400 contributors from more than 50 different companies.
This true community effort has secured Spark’s place as an open standard within Hadoop. With a robust engineering focus, its quality and popularity have ensured its portability, with support from all the major Hadoop vendors. Its production use has also led to the development and certification of Spark applications by the leading software companies – opening up Spark to more use cases and users.
One of the most exciting projects around Spark is the community coming together to improve batch processing with Spark as the execution backend. As a powerful batch processing engine, Spark will not only improve the performance of several popular projects such as Apache Hive, Apache Pig, and Apache Sqoop, but will also drive standardization as an execution backend – making management and development more efficient. Back in July, Cloudera, Databricks, IBM, Intel, and MapR announced an industry-wide collaboration to port the open source, MapReduce tools to support Spark. Since the initial announcement, there has been a lot of progress towards making this a reality. Here’s a look at what’s been accomplished since:
- Apache Crunch 0.11 releases with a SparkPipeline, making it easy to migrate data processing applications from MapReduce to Spark.
- Spark support added to Kite 0.16 release, so Spark jobs can read and write to Kite datasets.
- Sigmoid Analytics has been driving the development of Pig on Spark, successfully passing 100% of the end-to-end test cases on Pig. They are currently working to merge their work with Pig for availability in an upstream release in the near future.
- Another open standard, Apache Solr, added a Spark-based indexing tool for fast and easy indexing, ingestion, and serving searchable complex data. We also expect to see a Solr-on-Spark solution in the near future.
- The first demo of Hive on Spark is available, the result of a strong community effort with over 140 commits to the main project
- Based on joint work from Cloudera, Intel, and MapR, the first Hive-on-Spark AMI is now available on Amazon. The VM lets you quickly try out Spark in conjunction with one of the most widely used Hadoop tools.
Also Coming Soon
- Work is also being done (details forthcoming) to provide a seamless integration between Spark and Apache HBase -- for example, for uses cases that involve massive operations on a tree/DAG/graph structures stored in HBase.
Spark has come a long way at an impressive rate, thanks to the community rallying behind it as an open standard in Hadoop. With such robust developer support, we expect to see continued advancements around Spark, especially as it continues to progress as a standard execution engine for key workloads.