Apache SparkTM

Apache Spark is a general-purpose graph execution engine for Hadoop that allows users to analyze large data sets with very high performance. One common use case for Spark is executing MapReduce-style graphs, achieving high performance batch processing in Hadoop. In addition to batch processing, Spark brings some new features and benefits:

  • Generalized Workflows: In contrast to MapReduce, where all jobs were expressed in rigid map and reduce semantics, Spark jobs can describe arbitrary workflows with one, two, or many stages of processing. More details on this below.
  • High Performance: Spark has several architectural innovations that enable high performance. First, Spark allows data to be held in memory, drastically increasing performance when a data set needs to be manipulated through multiple stages of processing. Next, Spark's generalized workflows greatly improve performance for jobs that don't fit perfectly into the MapReduce paradigm.
  • Simple Programmings: Spark exposes a simple, abstract API, and natively supports Java, Scala, and Python. Users can get started by using programming within a shell for both Scala and Python, allowing them to explore data interactively.
  • Extensible: Spark serves as the foundation for many higher-level projects, such as Shark (SQL), MLLib (machine learning), GraphX (graph), and Spark Streaming (streaming). Each of these applications is implemented as a library on top of Spark, allowing developers to write jobs that take advantage of two or more projects simultaneously.

Anatomy of a Spark Job

Spark features an advanced Directed Acyclic Graph (DAG) engine supporting cyclic data flow. Each Spark job creates a DAG of task stages to be performed on the cluster. Compared to MapReduce, which creates a DAG with two predefined stages - Map and Reduce, DAGs created by Spark can contain any number of stages. This allows some jobs to complete faster than they would in MapReduce, with simple jobs completing after just one stage, and more complex tasks completing in a single run of many stages, rather than having to be split into multiple jobs.

Spark jobs perform work on Resilient Distributed Datasets (RDDs), an abstraction for a collection of elements that can be operated on in parallel. When running Spark in a Hadoop cluster, RDDs are created from files in the distributed file system in any format supported by Hadoop, such as text files, SequenceFiles, or anything else supported by a Hadoop InputFormat.

Once data is read into an RDD object in Spark, a variety of operations can be performed by calling abstract Spark APIs. The two major types of operation available are:

  • Transformations: Transformations return a new, modified RDD based on the original. Several transformations are available through the Spark API, including map(), filter(), sample(), and union().
  • Actions: Actions return a value based on some computation being performed on an RDD. Some examples of actions supported by the Spark API include reduce(), count(), first(), and foreach().

Some Spark jobs will require that several actions or transformations be performed on a particular data set, making it highly desirable to hold RDDs in memory for rapid access. Spark exposes a simple API to do this - cache(). Once this API is called on an RDD, future operations called on the RDD will return in a fraction of the time they would if retrieved from disk.