DA 4500 - Data Analysis with Apache Pig and Apache Hive


About this course

This course covers how to use Pig and Hive as part of a single data flow in a Hadoop cluster. The course begins with manipulating semi-structured raw data files in Pig, using the grunt shell and the Pig Latin programming language. Once the raw data has been manipulated into structured tables, they will be exported from Pig and imported into Hive. The structured data can be queried in Hive, and some basic data analysis can be performed.

Prerequisites for Success in the Course

Review the following prerequisites carefully and decide if you are ready to succeed in this programming-oriented course. The Instructor will move forward with lab exercises, assuming that you have mastered the skills listed below.

  • Required:
    • Familiarity with a command-line interface, such as a Unix shell
    • Familiarity with RDBMS database tools, such as SQL
    • Access to, and the ability to use, a laptop with an internet connection and a terminal program installed (such as terminal on the Mac, or PuTTY on Windows).
  • Recommended:

Right for you?

  • For data analysts and developers interested in the data pipeline
  • For data scientists and business analysts who are familiar with SQL and want to use data on an HDFS
  • This is a programming course; you must have some programming experience to do the exercises

What’s next?


This course prepares you for the MapR Certified Data Analyst (MCDA) certification exam.


Included in this two-day course are:

  • Access to a multi-node Amazon Web Services (AWS) cluster
  • Slide guides
  • Lab guides
  • Lab files
Day 1
  • Lesson 1 – Describe how Apache Pig fits in the Hadoop ecosystem
    • Understand the data pipeline
    • Understand the Pig Philosophy
  • Lesson 2 – Extract, Transform, and Load Data with Apache Pig
    • Load data into relations
    • Debug Pig scripts
    • Perform simple manipulations
    • Save relations as files
  • Lesson 3 – Manipulate Data with Apache Pig
    • Subset relations
    • Combine relations
    • Use UDFs on relations
Day 2
  • Lesson 1 – Describe how Apache Hive fits in the Hadoop ecosystem
    • Understand the data pipeline
    • Describe other SQL-on-Hadoop tools
  • Lesson 2 – Create tables and load data in Apache Hive
    • Create databases
    • Create simple, external, and partitioned tables
    • Alter and drop tables
  • Lesson 3 – Query data with Apache Hive
    • Query tables
    • Manipulate tables with UDFs
    • Combine and store tables