Getting Started with the Spark Web UI

Summary

This post will help you get started using the Apache Spark Web UI to understand how your Spark application is executing on a Hadoop cluster. The Spark Web UI displays useful information about your application, including:

  • A list of scheduler stages and tasks
  • A summary of RDD sizes and memory usage
  • Environmental information
  • Information about the running executors

This post will go over:

  • Components and lifecycle of a Spark program
  • How your Spark application runs on a Hadoop cluster
  • Using the Spark web UI to view the behavior and performance of your Spark application

This post assumes a basic understanding of Spark concepts. If you have not already read the tutorial on Getting Started with Spark on MapR Sandbox, it would be good to read that first.

Software

This tutorial will run on the MapR Sandbox. Version 4.1 needs Spark to be installed , Version 5 includes Spark.

Spark Components of Execution

Before looking at the web UI, you need to understand the components of execution for a Spark application.  Let’s go over this using the word count example from the  Getting Started with Spark on MapR Sandbox tutorial, shown in the image below.  

Apache Spark web UI

In order to run this example, first at the command line you can get the text file with this command:

wget -O /user/user01/alice.txt http://www.gutenberg.org/cache/epub/11/pg11.txt

Then you can enter the code shown below in the Spark shell.  The example code results in the wordcounts RDD, which defines a directed acyclic graph (DAG) of RDDs that will be used later when an action is called. Operations on RDDs create new RDDs that refer back to their parents, thereby creating a graph. You can print out this RDD lineage with toDebugString as shown below.

val file= sc.textFile("/user/user01/alice.txt").cache()

val wordcount = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

 

//show RDD and its recursive dependencies for debugging

wordcount.toDebugString

// res0: String =

// (2) ShuffledRDD[4] at reduceByKey at <console>:23 []

// +-(2) MapPartitionsRDD[3] at map at <console>:23 []

//    |  MapPartitionsRDD[2] at flatMap at <console>:23 []

//    |  /user/user01/alice.txt MapPartitionsRDD[1] at textFile at <console>:21 []

//    |  /user/user01/alice.txt HadoopRDD[0] at textFile at <console>:21 []

 

wordcount.collect()

The first RDD, HadoopRDD, was created by calling sc.textFile(), the last RDD in the lineage is the ShuffledRDD created by reduceByKey. Below on the left is a diagram of the DAG graph for the wordcount RDD; the green inner rectangles in the RDDs represent partitions. When the action collect() is called, Spark’s scheduler creates a Physical Execution Plan for the job, shown on the right, to execute the action.

HadoopRDD graph

The scheduler splits the RDD graph into stages, based on the transformations. The narrow transformations (transformations without data movement) will be grouped (pipe-lined) together into a single stage. This physical plan has two stages, with everything before ShuffledRDD in the first stage.

shuffledRDD graph

Each stage is comprised of tasks, based on partitions of the RDD, which will perform the same computation in parallel.  The scheduler submits the stage task set to the task scheduler, which launches tasks via a cluster manager.

Here is a summary of the components of execution:

  • Task: a unit of execution that runs on a single machine
  • Stage: a group of tasks, based on partitions of the input data, which will perform the same computation in parallel
  • Job: has one or more stages
  • Pipelining: collapsing of RDDs into a single stage, when RDD transformations can be computed without data movement
  • DAG: Logical graph of RDD operations
  • RDD: Parallel dataset with partitions

How your Spark Application runs on a Hadoop cluster

The diagram below shows Spark on an example Hadoop cluster:

Spark application on Hadoop

Here is how a Spark application runs:

  • A Spark application runs as independent processes, coordinated by the SparkContext object in the driver program.
  • The task scheduler launches tasks via the cluster manager;  in this case it’s YARN.
  • The cluster manager assigns tasks to workers, one task per partition.  
  • A task applies its unit of work to the elements in its partition, and outputs a new partition.
    • Partitions can be read from an HDFS block, HBase or other source and cached on a worker node (data does not have to be written to disk between tasks like with MapReduce).
  • Results are sent back to the driver application.

Using the Spark web UI to view the behavior and performance of your Spark application

You can view the behavior of your Spark application in the Spark web UI at http://<host ip>:4040.  Here is a screen shot of the web UI after running the word count job. Under the "Jobs" tab, you see a list of jobs that have been scheduled or run, which in this example is the word count collect job.  The Jobs table displays job, stage, and task progress.

Spark Jobs

Under the Stages tab, you can see the details for stages. Below is the stages page for the word count job, Stage 0 is named after the last RDD in the stage pipeline, and Stage 1 is named after the action.

Details for Spark Job

You can view RDDs in the Storage tab.

Spark RDD storage

Under the Executors tab, you can see processing and storage for each executor. You can look at the thread call stack by clicking on the thread dump link.

Spark Executors

Summary

This concludes the Getting Started with the Spark web UI tutorial.  If you have any further questions about using the Spark web UI, please ask them in the section below.

You can find more information about these technologies here:

no

CTA_Inside

Forrester: Apache Spark is Powerful and Promising
In this white paper you’ll learn about the important key differences and synergies between this next-generation cluster-computing power couple.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams

 

 

 

Download for free