Data Factory with Pig

Technical personnel with a background in Linux, SQL, and programming who intend to join a Hadoop Engineering team in roles such as Hadoop developer, data architect, or data engineer or roles related to technical project management, cluster operations, or data analysis


Expected Duration
113 minutes

Hadoop is an open source software for affordable supercomputing. It provides the distributed file system and the parallel processing required to run a massive computing cluster. This course explains Pig as a data flow scripting tool for interfacing with Hadoop. You’ll learn about the installation and configuration of Pig and explore a demonstration of Pig in action. This learning path can be used as part of the preparation for the Cloudera Certified Administrator for Apache Hadoop (CCA-500) exam.


The Purpose of Pig

  • start the course
  • describe Pig and its strengths

Setup for Pig

  • recall the minimal edits needed to be made to the configuration file
  • install and configure Pig

Details of Pig

  • recall the complex data types used by Pig
  • recall some of the relational operators used by Pig

Operations for Pig

  • use the Grunt shell with Pig Latin
  • set parameters from both a text file and with the command line
  • write a Pig script
  • use a Pig script to filter data
  • use the FOREACH operator with a Pig script
  • set parameters and arguments in a Pig script
  • write a Pig script to count data

Working with Pig Operators

  • perform data joins using a Pig script
  • group data using a Pig script
  • cogroup data with a Pig script
  • flatten data using a pig script

User-defined Functions for Pig

  • recall the languages that can be used to write user defined functions
  • create a user defined function for Pig

Troubleshooting for Pig

  • recall the different types of error categories
  • use explain in a Pig script

Practice: Configuring and Using Pig

  • install Pig, use Pig operators and Pig Latin, and retrieve and group records





