Apache Spark is a top-level project of the Apache Software Foundation, designed to be used with a range of programming languages on a variety of architectures. Spark’s speed, simplicity, and broad support for existing development environments and storage systems make it increasingly popular with a wide range of developers and relatively accessible to those learning to work with it for the first time.
In this blog post, I’m going to go out on a limb and make a connection between Spark and Legos. Legos are a product of The LEGO Group, designed to be used by a range of consumers in a variety of sets. Legos’ fun, simplicity, and broad support for existing construction toys and building systems make it increasingly popular with a wide range of artists and designers and relatively accessible to those learning to work with it for the first time. See the similarities? (Learn more about using Spark in the ebook Getting Started with Apache Spark: From Inception to Production.)
Lego Batman vs. Lego Indiana Jones: Using Different Programming Languages
Let’s take it a step further. Spark’s capabilities can all be accessed and controlled using a rich API. Just as Lego has incorporated franchises such as Harry Potter, the Avengers, Indiana Jones, and Batman, Spark supports multiple existing programming languages, including Java, Python, Scala, SQL, and R. And just as each Lego set includes an instruction manual, there are extensive examples and tutorials for Spark. For tutorials using bits of code from Java, Python, and Scala, check out the Apache Spark project website. The Apache Spark module, Spark SQL, offers native support for SQL and simplifies the process of querying data stored in Spark’s own Resilient Distributed Dataset (RDD) model. Support for R is more recent; the SparkR package first appeared in June 2015 in release 1.4 of Apache Spark.
Spark gained a new DataFrames API in 2015. DataFrames offers:
- Ability to scale from kilobytes of data on a single laptop to petabytes on a large cluster
- Support for a wide array of data formats and storage systems
- State-of-the-art optimization and code generation through the Spark SQL Catalyst optimizer
- Seamless integration with all big data tooling and infrastructure via Spark
- APIs for Python, Java, Scala, and R
Deployment and Storage Options, or Do Your Legos Need a Second Basement?
You can set up a small Lego playset on the side of your desk. But if you feel the need to own the life-sized pirate ship, giant magic castle, and $350 deluxe edition superhero set, you’re going to need more room. Similarly, Spark can run both standalone or as part of a cluster. It’s easy to download and install Spark on a laptop or virtual machine, but that is not likely to be sufficient for production workloads that are operating at scale. In these circumstances, Spark will normally run on an existing big data cluster. These clusters are often used for Hadoop jobs, too, and will usually be managed by Hadoop’s YARN resource manager; see Running Spark on YARN for more details. Spark can also run just as easily on Amazon Web Services’ Elastic Compute Cloud (EC2) or on clusters controlled by Apache Mesos.
Regarding storage, Spark can integrate with a range of commercial or open source third-party data storage systems, including:
- MapR (file system and database)
- Google Cloud
- Amazon S3
- Apache Cassandra
- Apache Hadoop (HDFS)
- Apache HBase
- Apache Hive
- Berkeley’s Tachyon project
Developers are most likely to choose the data storage system they are already using elsewhere in their workflow.
Building the Spark Stack
The Spark project stack currently comprises Spark Core and four libraries that are optimized to address the requirements of four different use cases. Individual applications will typically require Spark Core and at least one of these libraries. Spark’s flexibility and power become most apparent in applications that require the combination of two or more of these libraries on top of Spark Core.
The Building Blocks
- Spark Core: This is the heart of Spark, and is responsible for management functions such as task scheduling. Think of it as the popular flat green Lego base that everything else is built on top of, and imagine the modules as Lego blocks. Spark Core implements and depends upon a programming abstraction known as Resilient Distributed Datasets, which are discussed in more detail below.
- Spark SQL: This is Spark’s module for working with structured data, and it is designed to support workloads that combine familiar SQL database queries with more complicated, algorithm-based analytics. Spark SQL supports the open source Hive project and its SQL-like HiveQL query syntax. Spark SQL also supports JDBC and ODBC connections, enabling a degree of integration with existing databases, data warehouses and business intelligence tools. JDBC connectors can also be used to integrate with Apache Drill, opening up access to an even broader range of data sources.
- Spark Streaming: This module supports scalable and fault-tolerant processing of streaming data and can integrate with established sources of data streams like Flume (optimized for data logs) and Kafka (optimized for distributed messaging). Spark Streaming’s design, as well as its use of Spark’s RDD abstraction, are meant to ensure that applications written for streaming data can be repurposed to analyze batches of historical data with little modification.
- MLlib: This is Spark’s scalable machine learning library, which implements a set of commonly used machine learning and statistical algorithms. These include correlations and hypothesis testing, classification and regression, clustering, and principal component analysis.
- GraphX: This module began life as a separate UC Berkeley research project and was eventually donated to the Apache Spark project. GraphX supports analysis of and computation over graphs of data, and also supports a version of graph processing’s Pregel API. GraphX includes a number of widely understood graph algorithms, including PageRank.
- Spark R: This module was added to the 1.4.x release of Apache Spark, providing data scientists and statisticians using R with a lightweight mechanism for calling upon Spark’s capabilities.
Resilient Distributed Datasets (RDDs)
The Resilient Distributed Dataset is a concept at the heart of Spark. It is designed to support in-memory data storage, distributed across a cluster in a manner that is demonstrably both fault-tolerant and efficient. Fault-tolerance is achieved, in part, by tracking the lineage of transformations applied to coarse-grained sets of data. Efficiency is achieved through parallelization of processing across multiple nodes in the cluster, and minimization of data replication between those nodes. Once data is loaded into an RDD, two basic types of operation can be carried out:
- Transformations, which create a new RDD by changing the original through processes such as mapping, filtering, and more. If we visualize the RDD as a fully-built Lego set, a transformation would involve making a new one and physically altering it. Imagine duplicating the set and then adding or taking away blocks.
- Actions, such as counts, which measure but do not change the original data. For our Lego RDD, imagine someone writing down observations about the Legos but not touching anything.
The original RDD remains unchanged throughout. The chain of transformations from RDD1 to RDDn are logged, and can be repeated in the event of data loss or the failure of a cluster node.
Transformations are said to be lazily evaluated, meaning that they are not executed until a subsequent action has a need for the result. This will normally improve performance, as it can avoid processing data unnecessarily. It can also, in certain circumstances, introduce processing bottlenecks that cause applications to stall while waiting for a processing action to conclude.
These RDDs remain in memory where possible. This greatly increases the performance of the cluster, particularly in use cases with a requirement for iterative queries or processes.
Why Lego and Spark Have the Same Idea
There is not yet an official Spark-themed Lego set, but I think it would work very well. For example, compare the different programming languages Spark supports to real-world franchises that have been turned into Lego. Lego can express the same basic building idea through everything from Teenage Mutant Ninja Turtles to the Lord of the Rings. Similarly, Spark can be utilized through multiple languages ranging from Python to R. Is Java really that different from Jurassic World? Could Scala be the tech version of Spiderman?
Furthermore, exactly like the way your Legos can go from one desk to taking up your entire basement, Spark can be scaled from one small machine to part of a cluster. Regarding the building blocks of Spark, using the different modules to create a Spark stack is exactly like stacking Lego bricks on top of a base. Think of how easy it is to build different ideas on the same platform—both in Lego and in Spark. In fact, some people will tell you that using Spark is even more fun than building with Legos!
In this blog post, I made a connection between Spark and Legos, and reviewed Spark deployment and storage options, the building blocks of the Spark stack, and Resilient Distributed Datasets.
If you have any additional questions about using Spark, please ask them in the comments section below.
Want to learn more?