Spark is a top-level project of the Apache Software Foundation, designed to be used with a range of programming languages and on a variety of architectures. Spark's speed, simplicity, and broad support for existing development environments and storage systems make it increasingly popular with a wide range of developers, and relatively accessible to those learning to work with it for the first time. The project supporting Spark's ongoing development is one of Apache's largest and most vibrant, with over 500 contributors from more than 200 organizations responsible for code in the current software release.
Comprehensive support for the development languages with which developers are already familiar is important so that Spark can be learned relatively easily, and incorporated into existing applications as straightforwardly as possible. Programming languages supported by Spark include:
Languages like Python are often regarded as poorly performing languages, especially in relation to alternatives such as Java. Although this concern is justified in some development environments, it is less significant in the distributed cluster model in which Spark will typically be deployed. Any slight loss of performance introduced by the use of Python can be compensated for elsewhere in the design and operation of the cluster. Familiarity with your chosen language is likely to be far more important than the raw speed of code prepared in that language.
Extensive examples and tutorials exist for Spark in a number of places, including the Apache Spark project website itself. These tutorials normally include code snippets in Java, Python and Scala.
The Structured Query Language, SQL, is widely used in relational databases, and simple SQL queries are normally well-understood by developers, data scientists and others who are familiar with asking questions of any data storage system. The Apache Spark module--Spark SQL--offers native support for SQL and simplifies the process of querying data stored in Spark's own Resilient Distributed Dataset model, alongside data from external sources such as relational databases and data warehouses.
Support for the data science package, R, is more recent. The SparkR package first appeared in release 1.4 of Apache Spark (in June 2015), but given the popularity of R among data scientists and statisticians, it is likely to prove an important addition to Spark's set of supported languages.
As noted in the previous chapter, Spark is easy to download and install on a laptop or virtual machine. Spark was built to be able to run in a couple different ways: standalone, or part of a cluster.
But for production workloads that are operating at scale, a single laptop or virtual machine is not likely to be sufficient. In these circumstances, Spark will normally run on an existing big data cluster. These clusters are often also used for Hadoop jobs, and Hadoop's YARN resource manager will generally be used to manage that Hadoop cluster (including Spark). Running Spark on YARN, from the Apache Spark project, provides more configuration details.
For those who prefer alternative resource managers, Spark can also run just as easily on clusters controlled by Apache Mesos. Running Spark on Mesos, from the Apache Spark project, provides more configuration details.
A series of scripts bundled with current releases of Spark simplify the process of launching Spark on Amazon Web Services' Elastic Compute Cloud (EC2). Running Spark on EC2, from the Apache Spark project, provides more configuration details.
Although often linked with the Hadoop Distributed File System (HDFS), Spark can integrate with a range of commercial or open source third-party data storage systems, including:
Developers are most likely to choose the data storage system they are already using elsewhere in their workflow.
The Spark project stack currently is comprised of Spark Core and four libraries that are optimized to address the requirements of four different use cases. Individual applications will typically require Spark Core and at least one of these libraries. Spark's flexibility and power become most apparent in applications that require the combination of two or more of these libraries on top of Spark Core.
The Resilient Distributed Dataset is a concept at the heart of Spark. It is designed to support in-memory data storage, distributed across a cluster in a manner that is demonstrably both fault-tolerant and efficient. Fault-tolerance is achieved, in part, by tracking the lineage of transformations applied to coarse-grained sets of data. Efficiency is achieved through parallelization of processing across multiple nodes in the cluster, and minimization of data replication between those nodes. Once data is loaded into an RDD, two basic types of operation can be carried out:
The original RDD remains unchanged throughout. The chain of transformations from RDD1 to RDDn are logged, and can be repeated in the event of data loss or the failure of a cluster node.
Transformations are said to be lazily evaluated, meaning that they are not executed until a subsequent action has a need for the result. This will normally improve performance, as it can avoid the need to process data unnecessarily. It can also, in certain circumstances, introduce processing bottlenecks that cause applications to stall while waiting for a processing action to conclude.
Where possible, these RDDs remain in memory, greatly increasing the performance of the cluster, particularly in use cases with a requirement for iterative queries or processes.
Spark's capabilities can all be accessed and controlled using a rich API. This supports Spark's four principal development environments: (Scala, Java, Python, R), and extensive documentation is provided regarding the API's instantiation in each of these languages. The Spark Programming Guide provides further detail, with comprehensive code snippets in Scala, Java and Python. The Spark API was optimized for manipulating data, with a design that reduced common data science tasks from hundreds or thousands of lines of code to only a few.
An additional DataFrames API was added to Spark in 2015. DataFrames offer:
For those familiar with a DataFrames API in other languages like R or pandas in Python, this API will make them feel right at home. For those not familiar with the API, but already familiar with Spark, this extended API will ease application development, while helping to improve performance via the optimizations and code generation.
Much of Spark's power lies in its ability to combine very different techniques and processes together into a single, coherent, whole. Outside Spark, the discrete tasks of selecting data, transforming that data in various ways, and analyzing the transformed results might easily require a series of separate processing frameworks such as Apache Oozie. Spark, on the other hand, offers the ability to combine these together, crossing boundaries between batch, streaming and interactive workflows in ways that make the user more productive.
Spark jobs perform multiple operations consecutively, in memory and only spilling to disk when required by memory limitations. Spark simplifies the management of these disparate processes, offering an integrated whole--a data pipeline that is easier to configure, easier to run, and easier to maintain. In use cases such as ETL, these pipelines can become extremely rich and complex, combining large numbers of inputs and a wide range of processing steps into a unified whole that consistently delivers the desired result.