Apache Spark is currently one of the most active projects in the Hadoop ecosystem, and there’s been plenty of hype about it in the past several months. In the latest webinar from the Data Science Central webinar series, titled “Let Spark Fly: Advantages and Use Cases for Spark on Hadoop,” we cut through the noise to uncover practical advantages for having the full set of Spark technologies at your disposal. Data Science Central co-founder Tim Matteson, along with Pat McDonough, Director of Client Solutions for Databricks, and MapR Sr. Director of Product Management Anoop Dawar, discuss the benefits of running Spark on Hadoop.
Want to learn more? Check out these resources on MapR, Spark and Hadoop:
- An easy guide to installing Spark
- Apache Spark Streaming
- Programming Guide - Apache Spark Developer Cheat Sheet
- Apache Spark Streaming Page on Apache.org
- MapR Integrates the Complete Apache Spark Stack
- Databricks and MapR Partner to Provide Enterprise Support for Spark
- Download Sandbox for Hadoop
- Watch the Webinar
The following questions were also asked during the webinar, but were not answered due to lack of time. Here are those questions/answers:
Q: What parts of Spark do you miss if you don't use the MapR Distribution for Hadoop?
MapR is the only distribution that supports the complete Spark stack, including the interactive SQL engine, Shark, as well as Spark Streaming, MLlib and GraphX.
Q: Can I use Apache Giraph for iterative processing?
A: Several customers have Giraph running on MapR clusters successfully. In addition to the Apache open source packages in the MapR Distribution, customers can choose to run several other packages depending their needs. MapR includes newer packages into the distribution based on customer demand.
Q: Is there a book on Spark?
Q: Can you compare the performance of Python, Scala and Java APIs?
This boils down to general application performance of Python (dynamic & interpreted) vs. Scala/Java (static & compiled).
Q: Is there also an interface to R?
SparkR is a research project at UC Berkeley's AMPlab, where they are working on Spark support in R: https://github.com/amplab-extras/SparkR-pkg
Q: Does RDD = graph?
RDDs are distributed datasets with operator graphs. On the other hand, GraphX is a subproject that provides a graph processing library using Spark's RDDs.
Q: Do you have any insights on how to do capacity planning for Spark, as it does use memory allocated to nodes?
Spark can certainly take advantage of lots of RAM, and processing data from RAM will be inherently faster than from disk. Any capacity planning exercise is use case specific, but expect to leverage more memory in Spark as compared to traditional Hadoop systems.
Q: When is Spark 1.0 planned for release?
Spark 1.0 release candidates are out now. Typically, the GA release takes place within a few weeks of RC releases.
Q: Do we need to tell Spark to access data from cache? Won’t it do that automatically?
When using the Spark API, you explicitly choose which datasets to cache. Applications built on top of Spark, such as Shark or an end-user program, can use their domain knowledge to implement some form of implied caching policy.
Q: How can I store a large lookup (multiple GB) into memory?
Multiple GBs of lookup data may actually be considered somewhat small on most clusters, and can be broadcasted to all nodes for optimal performance if there is enough space available. Otherwise, you can load that data into an RDD and cache it.
Q: Does Spark support message passing among the data nodes and/or master node?
Spark uses a functional programming style to execute a DAG. Spark is not an MPI-based platform.
Q: What parallelization paradigms (besides MapReduce) does Spark support?
Spark is a general data-parallel framework.
Q: Could we also shuffle B and keep the partitioning of E? How to we specify this?
In the example used, RDD E's partitions were laid out according to where the DFS blocks were originally stored and not shuffled yet, so they had to be shuffled in order to join.
Q: What does lambda mean?
Lambda architecture is a way to serve the data to both a batch layer and a speed layer. The batch layer is slow and so the speed layer computes real time analytics in between batch jobs.
Q: Can I use my Pig UDFs in it?
There have been some efforts in the community (e.g. Spork) to use Pig on Spark.
Q: Can Spark SQL be used to create intermediate tables?
Yes. The results of Spark SQL queries are just RDDs, and therefore can be queried, so they can be used in the same fashion as an intermediate table.
Q: I thought you have to have the data in HDFS before you transform the data.
You could use Spark streaming and transform data before it goes into the file system, but your experience will vary based on the complexity of the transformation.
Q: We have our data in HBase. How well does Spark work with HBase?
Spark can use any of Hadoop's input formats, so it can work with HBase data via the HBase input formats. Optimized support of HBase via catalyst is on the roadmap.
Q: How is Spark streaming better than Storm?
Spark streaming is different than Storm in that it supports exactly once semantics, uses a micro-batch architecture, can be used on existing Spark clusters, leverages the same APIs and framework as the Spark core, and also provides window functions.
Q: Is there any training available on Spark? Or is there a book that includes a comprehensive description of Spark?
Databricks is working on providing training workshops very soon. Check www.databricks.com soon for more information.
Q: When I tried using Spark, it gave me some numbers earlier after processing my data, but I changed the code somehow to use Tuple4<String, String, String, String> extractKey(String s) function, and it started giving me a Java heap space error. I was running on localhos.
Check the documentation on the Spark website for configuration and tuning parameters.
Q: The packaging time during each compilation takes about (370 sec = ~ 6 min). Is there some way to reduce this time?
If you are using sbt, continuous compilation is a good option.
Q: You talked about optimizing the process represented by the execution graph. Can Spark "decide" when to apply the .cache function to data automatically?
Caching is explicitly defined by the user.
Q: Will Shark merge with Spark SQL?
Shark will migrate toward using Spark SQL as a backend.
Q: Is it possible to push processed data from Spark streaming directly into Shark table?
For data sharing between applications (streaming & Shark, for example), take a look at Tachyon.
Q: How does Spark compare to Impala?
Spark is a general data-parallel processing platform, with built-in libraries that include SQL processing. Impala is an MPP-style SQL engine.
Q: If I created a Spark cluster through AWS APIs, I would get the master DNS, so instead of initiating Spark through localhost, do I only have to change the localhost to Master DNS? How can I put my program to run on Hadoop on the fly?
MapR is going to make Spark available on Amazon EMR M3, M5, and M7 soon, making it easier to use MapR in Amazon. Spark applications choose a master at start-up time. Check the Spark programming guide for more info.
Q: Is there a visualization tool that integrates well with Spark, like Kibana for Elasticsearch?
Shark supports some of the same ODBC/JDBC drivers as Hive.
Q: Could SparkR do all the ML algorithms better?
SparkR provides an API to work with Spark within the R programming environment. Algorithms in R will not automatically use Spark unless they are written with the Spark API.
Q: Does Spark allow functions that return a complex type, such as a tuple (Int,Int,Int). What would be the return type since there appears to be no Tuple type in Spark/Scala?
Tuples are well supported in Scala and Python, and can be used in Java as well with the Tuple2 type.
Q: How mature is Spark streaming? I tried using it with Flume, and it couldn't handle large batches or volumes of data. Plus, I heard about Shark streaming. Do you know anything about that?
Spark Streaming is GA. Improvements to the Flume input are under way for Spark 1.1
Q: When will MapR be releasing the version (3.03?) that supports Spark?
MapR plans to release Spark packages imminently.
Q: Will there (soon) be a Random Forest in MLlib? There is great interest from the bioinformatics community.
Random Forests are high on the list of new algorithms, and could potentially make it in to Spark 1.1
Q: Can you point Spark to a Hadoop directory and have it process all files within that directory?
Q: Where is the best place to find tutorials for using GraphX in Spark? We have MapR and Spark installed, but cannot find any help or documents on GraphX.
The GraphX programming guide on the Apache Spark documentation page is your best source.
Q: If setting up a completely new environment, why should I add anything besides Spark?
Note that Spark does not provide a persistent store for your data. Another reason is that although Spark is unified and offers batch, interactive and streaming computational models, there are other data ingest, work flow processing, data warehousing functions that may require other projects. Hadoop is the overarching project that enables this and adding Spark on top gives you the broadest choice for now and future.
Q: Can I get RPMs directly from the MapR repository and install them? Also, does the Warden service manage Spark daemons as well?
MapR plans to publish RPMs that will make Spark install on MapR extremely easy, and it will be integrated with the MapR Control System (management dashboard).
Q: Is Spark going to replace Apache Mahout?
Mahout is working to incorporate the benefits of Spark and is exploring other high performance back-ends as well.
Q: If running Spark, when should I use Mesos instead of YARN?
Spark supports either Mesos or YARN.
Q: Can you recommend any good tutorials for Java?
Read the Java Programming Guide on Apache Spark.
Q: How much of R will SparkR cover?
SparkR is a library used within R.
Q: Can Spark run on MapR now, or should we wait for MapR support of YARN?
Spark can run standalone on MapR today. MapR will soon publish RPMs that make Spark installation on MapR extremely easy and will be integrated with MapR management dashboards.
Q: When can we expect to have SparkR?
SparkR is currently a research project at the UC Berkeley AMPlab. You can follow progress and contribute to the project here: http://amplab-extras.github.io/SparkR-pkg/
Q: Will Spark support named fields for tuples?
Spark's Scala API supports case classes, Python.