This post will help you get started using the Apache Spark Web UI to understand how your Spark application is executing on a Hadoop cluster. The Spark Web UI displays useful information about your application, including:
- A list of scheduler stages and tasks
- A summary of RDD sizes and memory usage
- Environmental information
- Information about the running executors
This post will go over:
- Components and lifecycle of a Spark program
- How your Spark application runs on a Hadoop cluster
- Using the Spark web UI to view the behavior and performance of your Spark application
This post assumes a basic understanding of Spark concepts. If you have not already read the tutorial on Getting Started with Spark on MapR Sandbox, it would be good to read that first.
This tutorial will run on the MapR Sandbox. Version 4.1 needs Spark to be installed , Version 5 includes Spark.
- The example in this post can be run in the spark-shell
- You can also run the code as a standalone application as described in the tutorial on Getting Started with Spark on MapR Sandbox
Spark Components of Execution
Before looking at the web UI, you need to understand the components of execution for a Spark application. Let’s go over this using the word count example from the Getting Started with Spark on MapR Sandbox tutorial, shown in the image below.
In order to run this example, first at the command line you can get the text file with this command:
Then you can enter the code shown below in the Spark shell. The example code results in the wordcounts RDD, which defines a directed acyclic graph (DAG) of RDDs that will be used later when an action is called. Operations on RDDs create new RDDs that refer back to their parents, thereby creating a graph. You can print out this RDD lineage with toDebugString as shown below.
The first RDD, HadoopRDD, was created by calling sc.textFile(), the last RDD in the lineage is the ShuffledRDD created by reduceByKey. Below on the left is a diagram of the DAG graph for the wordcount RDD; the green inner rectangles in the RDDs represent partitions. When the action collect() is called, Spark’s scheduler creates a Physical Execution Plan for the job, shown on the right, to execute the action.
The scheduler splits the RDD graph into stages, based on the transformations. The narrow transformations (transformations without data movement) will be grouped (pipe-lined) together into a single stage. This physical plan has two stages, with everything before ShuffledRDD in the first stage.
Each stage is comprised of tasks, based on partitions of the RDD, which will perform the same computation in parallel. The scheduler submits the stage task set to the task scheduler, which launches tasks via a cluster manager.
Here is a summary of the components of execution:
- Task: a unit of execution that runs on a single machine
- Stage: a group of tasks, based on partitions of the input data, which will perform the same computation in parallel
- Job: has one or more stages
- Pipelining: collapsing of RDDs into a single stage, when RDD transformations can be computed without data movement
- DAG: Logical graph of RDD operations
- RDD: Parallel dataset with partitions
How your Spark Application runs on a Hadoop cluster
The diagram below shows Spark on an example Hadoop cluster:
Here is how a Spark application runs:
- A Spark application runs as independent processes, coordinated by the SparkContext object in the driver program.
- The task scheduler launches tasks via the cluster manager; in this case it’s YARN.
- The cluster manager assigns tasks to workers, one task per partition.
- A task applies its unit of work to the elements in its partition, and outputs a new partition.
- Partitions can be read from an HDFS block, HBase or other source and cached on a worker node (data does not have to be written to disk between tasks like with MapReduce).
- Results are sent back to the driver application.
Using the Spark web UI to view the behavior and performance of your Spark application
You can view the behavior of your Spark application in the Spark web UI at http://<host ip>:4040. Here is a screen shot of the web UI after running the word count job. Under the "Jobs" tab, you see a list of jobs that have been scheduled or run, which in this example is the word count collect job. The Jobs table displays job, stage, and task progress.
Under the Stages tab, you can see the details for stages. Below is the stages page for the word count job, Stage 0 is named after the last RDD in the stage pipeline, and Stage 1 is named after the action.
You can view RDDs in the Storage tab.
Under the Executors tab, you can see processing and storage for each executor. You can look at the thread call stack by clicking on the thread dump link.
This concludes the Getting Started with the Spark web UI tutorial. If you have any further questions about using the Spark web UI, please ask them in the section below.
You can find more information about these technologies here: