DA 4500 - Data Analysis with Apache Pig and Apache Hive

register-html: 

About this course

This course covers how to use Pig and Hive as part of a single data flow in a Hadoop cluster. The course begins with manipulating semi-structured raw data files in Pig, using the grunt shell and the Pig Latin programming language. Once the raw data has been manipulated into structured tables, they will be exported from Pig and imported into Hive. The structured data can be queried in Hive, and some basic data analysis can be performed.

Prerequisites for Success in the Course

Review the following prerequisites carefully and decide if you are ready to succeed in this programming-oriented course. The Instructor will move forward with lab exercises, assuming that you have mastered the skills listed below.

  • Required:
    • Familiarity with a command-line interface, such as a Unix shell
    • Familiarity with RDBMS database tools, such as SQL
    • Access to, and the ability to use, a laptop with an internet connection and a terminal program installed (such as terminal on the Mac, or PuTTY on Windows).
  • Recommended:

Right for you?

  • For data analysts and developers interested in the data pipeline
  • For data scientists and business analysts who are familiar with SQL and want to use data on an HDFS
  • This is a programming course; you must have some programming experience to do the exercises

What’s next?

Certification

This course prepares you for the MapR Certified Hadoop Professional: Data Analyst (MCHP: DA) certification exam. This exam is coming soon.

Syllabus

Included in this two-day course are:

  • Access to a multi-node Amazon Web Services (AWS) cluster
  • Slide guides
  • Lab guides
  • Lab files
Day 1
  • Lesson 1 – Describe how Apache Pig fits in the Hadoop ecosystem
    • Understand the data pipeline
    • Understand the Pig Philosophy
  • Lesson 2 – Extract, Transform, and Load Data with Apache Pig
    • Load data into relations
    • Debug Pig scripts
    • Perform simple manipulations
    • Save relations as files
  • Lesson 3 – Manipulate Data with Apache Pig
    • Subset relations
    • Combine relations
    • Use UDFs on relations
Day 2
  • Lesson 1 – Describe how Apache Hive fits in the Hadoop ecosystem
    • Understand the data pipeline
    • Describe other SQL-on-Hadoop tools
  • Lesson 2 – Create tables and load data in Apache Hive
    • Create databases
    • Create simple, external, and partitioned tables
    • Alter and drop tables
  • Lesson 3 – Query data with Apache Hive
    • Query tables
    • Manipulate tables with UDFs
    • Combine and store tables

 

Related Resources
SANDBOX

MapR Sandbox for Hadoop
Get started

BLOG

Advice from the front.
Read

Other Resources

Apache Drill Website
Visit

YOU MAY ALSO LIKE

On-demand Training
ESS 100 – Introduction to Big Data
Learn more