The recent Attunity and MapR webinar ”Give your Enterprise a Spark: How to Deploy Hadoop with Spark in Production” proved to be highly interactive and engaging. As promised, we have provided answers to questions that we were not able to get to during the webinar. If you have any other unaswered questions, just comment here and we will respond.
If you missed the webinar, you can watch the replay.
Answer: Although HDFS does not support updating data, the MapR file system (MapR-FS) is a POSIX compliant read-write filesystem that allows processing data even when the files are open. Spark can update data on disk on the MapR platform.
Question: What is the advantage of Spark inside MapReduce (SIMR)?
Answer: If you are a heavy Hadoop MapReduce user already but still want to take advantage of Spark APIs for faster programming, SIMR allows you to launch a Spark program from within a MapReduce job. We have not seen a lot of these situations in our customer base.
Question: What about running Spark on Mesos vs HDFS?
Answer: For big data workloads, most users are looking at running Spark apps within the context of a big data storage platform. Since Spark does not have a storage layer of its own, it generally relies on systems such as HDFS for distributed storage. Mesos is a generic framework that supports any and all resource management needs including for Hadoop YARN jobs (See The Apache Myriad project).
Question: How much RAM is required if I need to process 20GB of data in Spark as it is an in memory process?
Answer: By default, MapR assigns 2GB of RAM to each Spark slave. Datasets that are larger than the aggregate memory available will be spilled to disk.
Question: Any document on best practices to achieve better performance?
Answer: Not that we’re aware of. The Apache Spark discusses some of the considerations in their guide pages. Performance tuning for Spark (and most Hadoop applications) varies widely based on the workload. Some simple rules can be applied in general to Spark execution: Attempt to have one Spark executor per node, allocate one Spark “core” per core in the node (the term “core” is misleading… it is actually the number of Spark tasks executing inside a single Spark executor), and allocate memory based on the data set size. Spark RDDs will remain in memory as long as they are in use. Spark memory caching (by default) is first-in, first-out.
Question: Why is it faster as compared with MapReduce? How is Spark internally managing to achieve this?
Answer: Traditional MapReduce is optimized for reliability which means that at critical points the intermediate results from Map and reduce stages are stored back to the disk. So if the job crashes, you can restart the job from where you left off. This creates performance bottlenecks. While Spark does not store intermediate results to disk. Most datasets from transformations/actions are kept in memory. Spark tries to overcome the reliability issue by building an efficient DAG that lets it build the failed RDD really fast when a crash occurs.
Question: From an infrastructure setup point of view, do we just need additional memory to run Spark on a Hadoop data node? Do we need to consider anything else?
Answer: Most often, the considerations are around Memory as pointed out. Allocation of memory in a cluster to Spark will reduce the memory available for other Hadoop services and tasks. Utilizing YARN as the execution mechanism for Spark allows much finer grained control over resources used.
Question: Spark has the RDD in memory, so what datasize at which spark performance is reletively impacted? and what are example use cases where spark should not be considered?
Answer: The biggest performance hit comes when a single RDD cannot be held in the aggregate memory of all the Spark slaves. In this case, spilling to disk will occur and will impact the performance. However, splitting the input files (for example) can allow those RDDs to work in memory, but will force Spark to work on multiple RDDs rather than one. Very large, single file input datasets that cannot be split readily or easily may be candidates for other Hadoop tools: Apache Drill, Pig and Hive or a combination of all of these.
Question: Does Spark provide the visualization component or do you need a third party tool?
Answer: Apache Spark does not provide any visualization components but integrates with several tools including those used for BI. In newer versions of Spark, Spark-R is available. While this is a specific implementation of R in the Spark environment, many of the visualization tools available in R are usable from Spark.
Question: What advantages do I get by running Spark on MapR?
Answer: By running Spark on the MapR platform you get all the inherent advantages of the MapR platform which is enterprise grade in nature. High performance and reliability of applications coupled with the in-memory processing of Spark makes the combination powerful.