Editor's note: If you're interested in learning more about Apache Spark, download the free ebook: Getting Started with Apache Spark: From Inception to Production
Apache Spark is a general-purpose data processing engine, suitable for use in a wide range of circumstances. In its current form, however, Spark is not designed to deal with the data management and cluster administration tasks associated with running data processing and analysis workloads at scale.
Rather than investing effort in building these capabilities into Spark, the project currently leverages the strengths of other open source projects, relying upon them for everything from cluster management and data persistence to disaster recovery and compliance. Projects like Apache Mesos offer a powerful and growing set of capabilities around distributed cluster management. However, most Spark deployments today still tend to use Apache Hadoop and its associated projects to fulfill these requirements.
In this blog post, I’ll talk about the relationship between Spark and Hadoop, what Hadoop gives Spark, and what Spark gives Hadoop. Keep in mind that although Spark is a viable alternative to Hadoop MapReduce in a range of circumstances, it is not a replacement for Hadoop. Instead, think of Spark as a great companion to a modern Hadoop cluster deployment.
Which Is Better, Hadoop or Spark? Neither—It’s the Wrong Question
Despite the hype, Spark is not a replacement for Hadoop. Nor is MapReduce dead.
Spark can run on top of Hadoop, benefiting from Hadoop’s cluster manager (YARN) and underlying storage (HDFS, HBase, etc.). Spark can also run completely separately from Hadoop, integrating with alternative cluster managers like Mesos and alternative storage platforms such as Cassandra and Amazon S3.
Much of the confusion around Spark’s relationship with Hadoop dates back to the early years of Spark’s development. At that time, Hadoop relied upon MapReduce for the bulk of its data processing. MapReduce also managed scheduling and task allocation processes within the cluster. Even workloads that were not best suited to batch processing were passed through Hadoop’s MapReduce engine, adding complexity and reducing performance.
MapReduce is really a programming model. In Hadoop MapReduce, multiple MapReduce jobs were strung together to create a data pipeline. In between every stage of that pipeline, the MapReduce code would read data from the disk and then write the data back to the disk. This was an inefficient process, so this is where Spark comes into play. Using the same MapReduce programming model, Spark increased performance by tenfold because it didn’t have to store the data back to the disk and all activities stayed in memory. Spark offers a far faster way to process data instead of passing it through unnecessary MapReduce processes.
Hadoop has since moved on with the development of the YARN cluster manager, freeing the project from its early dependence upon Hadoop MapReduce. MapReduce is still available within Hadoop for running static batch processes; other data processing tasks can be assigned to different processing engines (including Spark), with YARN handling the management and allocation of cluster resources.
What Hadoop Gives Spark
Spark is often deployed in conjunction with a Hadoop cluster, and is consequently able to benefit from a number of capabilities. By itself, Spark is a powerful tool for processing large volumes of data, but it is also not well-suited to production workloads in the enterprise. Integration with Hadoop gives Spark many of the capabilities that broad adoption and use in production environments will require, including:
- YARN resource manager, which takes responsibility for scheduling tasks across available nodes in the cluster.
- Distributed File System, which stores data when the cluster runs out of free memory, and which persistently stores historical data when Spark is not running.
- Disaster recovery capabilities, inherent to Hadoop, which enable recovery of data when individual nodes fail. These capabilities include basic (but reliable) data mirroring across the cluster, and richer snapshot and mirroring capabilities.
- Data security, which becomes increasingly important as Spark tackles production workloads in regulated industries such as healthcare and financial services. Projects like Apache Knox and Apache Ranger offer data security capabilities that augment Hadoop, and each of the big three vendors have alternative approaches for security implementations that complement Spark. Hadoop's core code, too, is increasingly recognizing the need to expose advanced security capabilities that Spark is able to exploit;
- A distributed data platform, benefiting from all of the preceding points, meaning that Spark jobs can be deployed on available resources anywhere in a distributed cluster without the need to manually allocate and track those individual jobs.
What Spark Gives Hadoop
Hadoop has come a long way since its early versions, which were essentially concerned with facilitating the batch processing of MapReduce jobs on large volumes of data stored in HDFS. Particularly since the introduction of the YARN resource manager, Hadoop is now better able to manage a wide range of data processing tasks, from batch processing to streaming data and graph analysis.
Via YARN, Spark is able to contribute to Hadoop-based jobs. In particular, Spark’s machine learning module delivers capabilities not easily utilized in Hadoop without the use of Spark. Spark’s original purpose, enabling rapid in-memory processing of sizeable data volumes, remains an important contribution to the capabilities of a Hadoop cluster.
In certain circumstances, Spark’s SQL capabilities, streaming capabilities (otherwise available to Hadoop through Storm, for example), and graph processing capabilities (otherwise available to Hadoop through Neo4J or Giraph) may also prove to be of value in enterprise use cases.
In this blog post, I discussed the relationship between Spark and Hadoop, what Hadoop gives Spark, and what Spark in turn gives Hadoop. When thinking of Spark, remember to think of it as a great companion to a modern Hadoop cluster deployment.
If you have any additional questions about using Spark, please ask them in the comments section below.
You can find out more about these technologies here: