MapR Integrates the Complete Apache Spark Stack

This was originally posted on the Databricks blog

With over 500 paying customers, my team and I have the opportunity to talk to many organizations that are leveraging Hadoop in production to extract value from big data. One of the most common topics raised by our customers in recent months is Apache Spark. Some customers just want to learn more about the advantages of this technology and the use cases that it addresses, while others are already running it in production with the MapR Distribution. These customers range from the world’s largest cable telcos and retailers to Silicon Valley startups such as Quantifind, which recently talked about its use of Spark on MapR in an interview with Stefan Groschupf, CEO of Datameer.

Today, I am happy to announce and share with you the beginning of our journey with Databricks, and the addition of the complete Spark stack to the MapR Distribution for Apache Hadoop. We are now the only Hadoop distribution to support the complete Spark stack, including Spark, Spark Streaming (stream processing), Shark (Hive on Spark), MLLib (machine learning) and GraphX (graph processing). This is a testament to our commitment to open source and to providing our customers with maximum flexibility to pick and choose the right tool for the job.

Why Spark?

One of the challenges organizations face when adopting Hadoop is a shortage of developers who have experience building Hadoop applications. Our professional services organization has helped dozens of companies with the development and deployment of Hadoop applications, and our training department has trained countless engineers. Organizations are hungry for solutions that make it easier to develop Hadoop applications while increasing developer productivity, and Spark fits this bill. Spark jobs can require as little as 1/5th the number of lines of code. Spark provides a simple programming abstraction allowing developers to design applications as operations on data collections (known as RDDs, or Resilient Distributed Datasets). Developers can build these applications in multiple programming languages, including Java, Scala and Python, and the same code can be reused across batch, interactive and streaming applications.

In addition to making developers happier and more productive, Spark provides significant benefits with respect to end-to-end application performance. To this end, Spark provides a general-purpose execution framework with in-memory pipelining. For many applications, this results in a 5-100x performance improvement, because some or all steps can execute in memory without unnecessarily writing to and reading from disk. The performance advantage of the Spark engine, combined with the industry-leading performance of the MapR Distribution, provides customers with the highest-performance platform for big data applications. (Learn more about Spark here.) 

Why Databricks?

Databricks was founded by the creators of Apache Spark, and is currently the driving force behind the project. When we decided to add the Spark stack to our distribution and double down on our involvement in the Spark community, a strategic partnership with Databricks was a no-brainer. This partnership will benefit MapR customers who are interested in 24x7 support for Spark or any of the other projects in the stack, including Spark Streaming, Shark, MLLib and GraphX (with several other projects coming soon). In addition, MapR will be working closely with Databricks to drive the Spark roadmap and accelerate the development of new features, benefiting both MapR customers and the broader community.

We are very excited about the upcoming Apache Spark 1.0 release, expected later this month. We are looking forward to a great journey with Databricks and the other members of the Spark community.

Register for an upcoming joint webinar to learn more about the benefits of the complete Spark stack on MapR.


Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free