We are excited to announce that Spark 1.5.2 is here and is part of the MapR Converged Data Platform. In this blog post, I’ll share a few details on some of the latest capabilities in Spark. If you’re a data engineer, data scientist or in application development, Spark 1.5.2 has new capabilities that you should take advantage of.
To get the latest and greatest documentation on installing, upgrading, configuring and using Spark with MapR, please check out our Apache Spark documentation. If you’re new to Spark, the MapR Sandbox provides the easiest way to get started with Spark.
Backpressure support for Spark Streaming is new in Spark 1.5.2. This feature enables automatic and dynamic rate controlling that can support bursty input streams. It allows data pipelines that have been built with streaming to be used as the underlying approach to adapt to changes in ingestion rates. This works with receivers as well as the direct Kafka approach.
For those of you who didn’t catch the news, MapR announced a new product called MapR Streams. MapR Streams is a global publish-subscribe event streaming system that connects data producers and consumers globally. MapR Streams uses the same API as Kafka for publish and subscribe, and can serve as the ingest, transport, and buffering layer for Spark Streaming. This can enable real-time operations such as calculations and aggregations on data as it’s delivered. Integration of Spark Streaming with MapR Streams will be generally available in early 2016, so be sure to watch for that.
SparkR and MLlib
Apache Spark for MapR includes support for SparkR. This enables data scientists to run large- scale data analysis from the R shell. SparkR first came out with Spark 1.4 as an alpha release, and MapR has taken the time to test and integrate this completely. SparkR offers several benefits:
SparkR users can now use MLlib to fit machine learning linear models to large-scale datasets, using the same syntax as in R’s lm/glm/glmnet.
Modeling approaches such as linear regression and logistic regression can now be implemented using SparkR.
SparkR DataFrames now have functions that allow them better resemble local R data frames.
New algorithms such as multilayer perceptron classifier and PrefixSpan for sequential pattern mining can now be run on Spark 1.5, and existing algorithms such as ensembles and linear trees have been updated. Another important and exciting enhancement is the ability to have model summary statistics for linear and logistic regressions. Obviously, the first thing to do after importing data is to get a bird’s eye view of it. These enhancements allow data scientists to communicate large amounts of information in a concise manner. R provides extensive functionality for inspecting a model and its results, and this is now available with SparkR.
Some of the more significant enhancements in Spark 1.5 are centered around improved metrics, reporting, and visualization of query plans and SQL. The three key enhancements are:
Plan visualization for DataFrames/SQL
Web UI for displaying metrics for runtime memory usage
Web UI for pagination for jobs with large numbers of tasks
Spark 1.5.2 also delivers on the first components of Project Tungsten, which provides enhancements in memory management and binary processing, cache-aware computation, and code generation.
If you didn’t know, MapR was the first Hadoop distribution to support the entire Spark stack. This is important, as MapR customers were the first to recognize the broad implications of Spark as a new computation engine and the role it would play in production applications. Being the first to support the entire Spark stack means that MapR has the most experience among Hadoop providers in supporting all of the different components, including the core Spark engine, Spark SQL, Spark Streaming, GraphX, and MLlib.
The best testimonials are customers who are running mission-critical Spark applications on the MapR Converged Data Platform. Use cases include predictive analytics for telecommunication service providers, customer analytics for retail and banking, and drug discovery in the pharmaceutical industry. MapR also has Quick Start Solutions for use cases such as security log analytics and time-series analytics that allow you to quickly get up and running with Spark.
Congrats to the Spark community on the 1.5.2 release! We are looking forward to additional capabilities coming in 2016. In the meantime, we encourage you to start exploring the benefits of running Spark on the MapR Converged Data Platform.
If you have any questions regarding Spark 1.5.2, please ask them in the comments section below.