DEV 3600 - Developing Apache Spark Applications


Feb 22:
Apache Spark Essentials
Tokyo, Japan
99,500 yen (+ tax) Register
Feb 22 - 24
Tokyo, Japan
298,500 yen (+ tax) Register
Mar 6-8
London, UK
£1695 + VAT Register

About this course

This course enables developers to get started developing big data applications with Apache Spark. In the first part of the course, you will use Spark’s interactive shell to load and inspect data. The course then describes the various modes for launching a Spark application. You will then go on to build and launch a standalone Spark application. The concepts are taught using scenarios that also form the basis of hands-on labs.

Right for you?

  • For developers interested in designing and developing Spark applications.
  • This is a programming course; you must have Java programming experience to do the exercises.


Prerequisites for Success in the Course

Review the following prerequisites carefully and decide if you are ready to succeed in this programming-oriented course. The Instructor will move forward with lab exercises, assuming that you have mastered the skills listed below.


  • Basic to intermediate Linux knowledge, including the ability to use a text editor such as vi, and familiarity with basic command-line options such a mv, cp, ssh, grep, cd, useradd
  • Knowledge of application development principles
  • A Linux, Windows or MacOS computer with the MapR Sandbox installed (for the on-demand course)
  • Connection to a Hadoop cluster via SSH and web browser (for the ILT and vILT course)


  • Knowledge of functional programming
  • Knowledge of Scala or Python
  • Beginner fluency with SQL
  • Completion of HDE 100 - Hadoop Essentials


Included in this 3-day course are:

  • Access to a multi-node Amazon Web Services (AWS) cluster
  • Slide Guide pdf
  • Lab Guide pdf
  • Lab Code


  • Lesson 1 – Introduction to Apache Spark
    • Describe the features of Apache Spark
      • Advantages of Spark
      • How Spark fits in with the Big Data application stack
      • How Spark fits in with Hadoop
    • Define Apache Spark components
  • Lesson 2 – Load and Inspect Data in Spark
    • Describe different ways of getting data into Spark
    • Create and use Resilient Distributed Datasets (RDDs)
    • Apply transformation to RDDs
    • Use actions on RDDs
      • Lab: Load and inspect data in RDD
    • Cache intermediate RDDs
    • Use Spark DataFrames for simple queries
      • Lab: Load and inspect data in DataFrames
  • Lesson 3 – Build a Simple Spark Application
    • Define the lifecycle of a Spark program
    • Define the function of SparkContext
      • Lab: Create the application
    • Define different ways to run a Spark application
    • Run your Spark application
      • Lab: Launch the application


  • Lesson 4 – Work with Pair RDD
    • Describe pair RDD
    • Why use pair RDD
    • Create pair RDD
    • Apply transformations and actions to pair RDD
    • Control partitioning across nodes
    • Changing partitions
    • Determine the partitioner
  • Lesson 5 - Work with Spark DataFrames
    • Create Apache Spark DataFrames
    • Work with data in DataFrames
    • Create user-defined functions
    • Repartition DataFrame
  • Lesson 6 - Monitor a Spark Application
    • Describe the components of the Spark execution model
    • Use the SparkUI to monitor a Spark application
    • Debug and tune Spark applications


  • Lesson 7 – Introduction to Apache Spark Data Pipelines
    • Identify components of Apache Spark Unified Stack
    • Benefits of the Apache Spark Unified Stack over Hadoop ecosystem
    • Describe data pipeline use cases
  • Lesson 8 – Create an Apache Spark Streaming Application
    • Spark Streaming architecture
    • Create DStreams
    • Create a simple Spark Streaming application
      • Lab: Create a Spark Streaming application
    • DStream operations
      • Lab: Apply operations on DStreams
    • Apply DStream operations
    • Use Spark SQL to query DStreams
    • Define window operations
      • Lab: Add windowing operations
    • Describe how DStreams are fault-tolerant
  • Lesson 9 – Use Apache Spark GraphX to Analyze Flight Data
    • Describe GraphX
    • Define a property graph
      • Lab: Create a property graph
    • Perform operations on graphs
      • Lab: Apply graph operations
  • Lesson 10 – Use Apache Spark MLlib to Predict Flight Delays
    • Describe Spark MLlib
    • Describe a generic classification workflow
    • Describe common terms for supervised learning
    • Use a decision tree for classification and regression
    • Lab: Create a DecisionTree model to predict flight delays on streaming data