About this course
This course is the second in the Apache Spark series. You will learn to create and modify pair RDDs, perform aggregations, and control the layout of pair RDDs across nodes with data partitioning. This course also discusses Spark SQL and DataFrames, the programming abstraction of Spark SQL. You will learn the different ways to load data into DataFrames, perform operations on DataFrames using DataFrame functions, actions and language integrated queries, and create and use user-defined functions with DataFrames. This course also describes the components of the Spark execution model using the Spark Web UI to monitor Spark applications. The concepts are taught using scenarios in Scala that also form the basis of hands-on labs. Lab solutions are provided in Scala and Python.
Right for you?
- For application developers
Prerequisites for success in the course:
- DEV 360 - Apache Spark Essentials
- Basic to intermediate Linux knowledge, including:
- The ability to use a text editor, such as vi
- Familiarity with basic command-line options such a mv, cp, ssh, grep, cd, useradd
- Knowledge of application development principles
- A Linux, Windows or MacOS computer with the MapR Sandbox installed (On-demand course)
- Connection to a Hadoop cluster via SSH and web browser (for the ILT and vILT course)
- Knowledge of functional programming
- Knowledge of Scala or Python
- Beginner fluency with SQL
- ESS 100 – Introduction to Big Data
This course helps prepare you for the MCSD – MapR Certified Spark Developer certification exam.
Work with PairRDD
Work with DataFrames
Monitor Apache Spark Applications