Apache Spark 1.6 is now in Developer Preview on the MapR Converged Data Platform. In this blog post, I’ll share a few details on what Spark 1.6 brings to the table and what you should care about. Irrespective of whether you’re a data engineer, data scientist, or in application development, Spark 1.6 has new capabilities you can test and prove out on the MapR Platform before going into production.
To get the latest and greatest documentation on installing, upgrading, configuring, and using Spark 1.6 Developer Preview with MapR, please check out our Apache Spark documentation. If you’re new to Spark, the MapR Sandbox provides the easiest way to get started with Spark.
Memory usage in Spark falls under two broad categories, namely execution memory and storage memory. Historically when large datasets are being processed, Spark users have had to statically fine-tune memory settings configurations according to their workload needs. The downside of this approach is that as workloads change dynamically based on the amount of data being processed, the memory usage remains static, resulting in inefficient usage of memory.
With Spark 1.6 automatic memory management, both of these memory regions are changed dynamically based on workload characteristics. As a result, execution memory can now borrow available memory from the storage region and vice versa, as long as the available memory does not fall under a certain threshold. For instance, execution can evict and borrow memory from storage, as long as the storage memory does not go below a certain level. The benefits of this include performance gains by avoiding unnecessary disk spills as well as better use of available resources, without needing to manually tune memory configurations. These benefits significantly improve ease of use and application development.
Spark 1.6 introduces a new experimental interface called Dataset API that is an extension of the DataFrames API. Datasets can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc). Datasets contain encoders that can be used in both Scala and Java, with Python support to be added in future releases. Datasets are similar to RDDs, but use encoders for converting between JVM objects and tabular representation, instead of Kryo or Java serialization.
Machine Learning Pipeline Persistence
Spark 1.6 adds new features to machine learning that takes persistence beyond models to persisting the entire pipeline, including transformers and estimators. You can now persist the entire workflow that includes pipeline persistence along with model persistence, without needing to write custom code for exporting or importing.
As a data scientist, you can now train your models and pipelines using historical batch data and export the same model and pipeline to production. You can train a pipeline in a nightly job, and then apply it to production data in a production job. This feature is currently supported in Scala, with other language support to be added in future releases.
Several new machine learning algorithms have also been added in Spark 1.6, including online hypothesis testing which enables A/B testing in Spark Streaming, bisecting K-Means clustering which enables fast top-down clustering of K-Means, and normal equation for least squares that provides R-like model summary statistics. The key takeaway with all of the machine learning enhancements is that you can now run more algorithms using Spark and be more productive with pipeline persistence.
The best testimonials are customers who are running mission-critical Spark applications on the MapR Converged Data Platform. Use cases include predictive analytics for telecommunication service providers, customer analytics for retail and banking, and drug discovery in the pharmaceutical industry. MapR also has Quick Start Solutions for use cases such as security log analytics and time-series analytics that allow you to quickly get up and running with Spark.
Congrats to the Spark community on the 1.6 release! We are looking forward to additional capabilities coming in future releases. In the meantime, we encourage you to start testing Spark 1.6 on the MapR Converged Data Platform.
If you have any questions regarding the Spark 1.6 Developer Preview, please ask them in the comments section below.