Apache Spark, a powerful general purpose engine for processing large amounts of data, has seen a rapid increase in its adoption since its release. Recognizing its impact very early on, MapR has supported and invested in Spark as part of our Hadoop distribution to enable enterprises to build applications with Spark and deploy it in production in a reliable manner. On June 6, 2016, we announced a separate distribution for the complete Spark stack: the MapR Platform including Apache Spark. Customers can now leverage this new and unique Spark-focused distribution to accelerate their big data mission.
- What is in the MapR Platform including Spark?
- How is it different from the MapR Platform including Hadoop?
- What if I still need some of the Hadoop tools?
- What if I need NoSQL or Event Streaming?
- What are some of the use cases I can develop with the MapR Platform including Spark?
- Get Started with MapR Platform including Spark
What is in the MapR Platform including Spark?
Enterprises across different industries have started using Spark as a unified computing engine for many of their critical use cases. The MapR Platform including Spark addresses big data challenges and requirements with the following components:
- Full Stack Spark computing engine that includes:
- Spark Core - For faster batch analytics and extract/transform/load (ETL)
- Spark MLlib - For advanced predictive analytics
- Spark SQL - For procedural SQL analytics
- Spark Streaming - For real-time ETL and analytics
- GraphX - For iterative graph computations and analytics
- SparkR - For large-scale data analytics from the R shell
Data Management Platform: A reliable and enterprise-grade converged data management platform that eliminates silos and minimizes data movement, therefore accelerating the insight-to-action cycle. The MapR Converged Data Platform includes a web-scale storage platform, scalable NoSQL, and event streaming, and at the same time provides the following features:
- System-wide high availability - the system can remain up and running despite multiple unforeseen failures, avoiding unplanned downtime or service disruption.
- Mission-critical disaster recovery - the system can return to operating status after a site-wide disaster. DR enables business continuity for significant data center failures for which high availability features cannot cover.
- Consistent snapshots - the state of your data can be captured at an exact point in time, and is used to provide full recovery of data when lost due to application or user error.
- Multi-tenancy - distinct tenants can be isolated, both in terms of the data contained in the data platform as well as the compute aspect.
Workflow Management: A scheduler system to manage a large number of Spark jobs as directed acyclic graphs of actions triggered by time and data availability.
Quick Start Solutions (QSS): A set of purpose-built solutions that allows you to jumpstart your most valuable and critical use cases for Spark.
Notebook*: An intuitive web-based notebook that data analysts and scientists can use to perform interactive analytics and visualization.
How is it different from the MapR Platform including Hadoop?
Spark has received a lot of attention lately because of its easy programming paradigm and faster performance as compared to MapReduce in Hadoop. It also has a growing ecosystem of projects, which lets it handle a wider range of big data workloads.
With the emergence of Spark as a unified computing engine, developers can perform ETL and advanced analytics in both continuous (streaming) and batch mode either programmatically (using Scala, Java, Python, or R) or with procedural SQL (using Spark SQL or Hive QL).
With MapR converging the data management platform, you can now take a preferential Spark-first approach. This differs from the traditional approach of starting with extended Hadoop tools and then adding Spark as part of your big data technology stack. As a unified computing engine, Spark can be used for faster batch ETL and analytics (with Spark core instead of MapReduce and Hive), machine learning (with Spark MLlib instead of Mahout), and streaming ETL and analytics (with Spark Streaming instead of Storm).
What if I still need some of the Hadoop tools?
Enterprises that want to use Hadoop tools and Spark can get both. With this new distribution, we simply give customers a choice to start with Spark. You can add Hadoop tools on top of Spark for a very attractive incremental price. The end result (and price) is actually no different than starting with Hadoop and adding Spark. So you can enjoy the benefits of the MapR Platform including Spark while also running Hadoop tools like MapReduce, Hive, Pig, and Mahout.
One thing to note: Spark SQL has a dependency on the Hive metastore for retrieving table schema information and accessing temporary tables stored in the Hive metastore. Support for this Hive metastore is included as part of the MapR Platform including Spark, without the need for the full Hadoop module. A similar level of support is also included for enterprises who choose to run Spark on YARN.
What if I need NoSQL or Event Streaming?
The MapR Platform including Spark includes web-scale storage (via MapR-FS) that exposes both the HDFS and NFS interfaces, but what if you need key-value, wide column, and/or document NoSQL databases? And what if you need a global publish-subscribe event streaming engine? With this Spark distribution, you always have the option to add additional modules such as MapR-DB (NoSQL) and MapR Streams (event streaming). These will provide you with real-time and operational analytics capabilities. In addition, you also have the option of adding Apache Drill for BI/ad-hoc/exploratory analytics.
What are some of the use cases I can develop with the MapR Platform including Spark?
All Hadoop use cases can be addressed including:
- Data lake and data hub: Pool vast and varied types of raw data in MapR-FS and use Spark to process the relevant data that can answer business questions.
- Large-scale ELT: Use Spark to extract and load data from the MapR Converged Data Platform into memory and transform the data into a suitable format for advanced analytics or prepare the data for BI/ad-hoc/exploratory analytics with Apache Drill.
- 360-degree customer view: Aggregate data from various customer touch points in both structured or unstructured formats, persist it in the MapR Converged Data Platform, analyze the data to understand customer behavior pattern using Spark machine learning, and deliver a personalized and unified experience to your customers.
- Log analytics/management: Collect and manage a huge volume of machine or system generated data and analyze it for troubleshooting, for mining common error patterns using Spark machine learning, or for compliance with security, audits, or regulations.
- Genomics analysis: Analyze massive amounts of sequenced human genome data with very low latency and cost, while improving system performance and reliability.
- Predictive analytics: Exploit patterns found in historical data stored as part of a data lake or data hub by building models that capture the relationship between many factors. Deploy this model in real time to predict the outcome of an event, predict a score for individuals, forecast demand for products, etc.
- Marketing and social media analytics: Measure the impact of your marketing or social media campaign to determine the ROI, effectively target customers, or optimize the campaign further.
- Time-series analytics: Use Spark with MapR-FS and/or MapR-DB to measure and analyze data over a time interval, and build dashboards or alerting/monitoring systems.
Get Started with MapR Platform including Spark
If you’d like to learn more about developing applications with Spark, MapR provides free Apache Spark on-demand training courses:
- Apache Spark Essentials (DEV 360)
- Build and Monitor Apache Spark Applications (DEV 361)
- Create Data Pipelines Using Apache Spark (DEV 362)
If you’d like to take Spark for a test drive and experience all the powerful features that are part of the complete distribution, try out the MapR Sandbox with Apache Spark.
Have you completed the MapR Spark certification courses and the test drive successfully? Interested in jump starting your Spark application and accelerating at high gear? Check out our Quick Start Solutions (QSS) for Spark that includes:
- Stream Processing for Real-Time Analytics and Dashboards
- Real-Time Security Log Analytics
- Time Series Analytics
- Genome Sequencing
Looking for a one-stop destination for all things related to Spark? Explore our Apache Spark Resources & Product Information and free Apache Spark resources pages (ebook, videos, whitepapers, and more).
*Notebook for the MapR Platform including Spark will be added in later releases.