This 5-day training course is designed for developers who need to create applications to analyze Big Data stored in Apache Hadoop using Apache Pig and Apache Hive, and developing applications on Apache Spark. Topics include: Essential understanding of Big Data & its capabilities, Hadoop, YARN, HDFS, MapReduce/Tez, data ingestion, using Pig and Hive to perform data analytics on Big Data and an introduction to Spark Core, Spark SQL, Apache Zeppelin, and additional Spark features.
PREREQUISITES
Students should be familiar with programming principles and have experience in software development. SQL and light scripting knowledge is also helpful. No prior Hadoop knowledge is required.
TARGET AUDIENCE
Developers and data engineers who need to understand and develop applications on Big Data
FORMAT
50% Lecture/Discussion
50% Hands-0n Labs
AGENDA SUMMARY
Day 1: Big Data Essentials and an Introduction to Pig
Day 2: Apache Hive
Day 3: Spark Programming
Day 4: Real-Time Processing with Spark Streaming
Day 5: Real-Time Processing with Spark Streaming Continued
DAY 1 OBJECTIVES
Describe the Case for Hadoop
Describe the Trends of Volume, Velocity and Variety
Discuss the Importance of Open Enterprise Hadoop
Describe the Hadoop Ecosystem Frameworks Across the Following Five Architectural Categories:
Data Management
Data Access
Data Governance & Integration
Security
Operations
Describe the Function and Purpose of the Hadoop Distributed File System (HDFS)
List the Major Architectural Components of HDFS and their Interactions
Describe Data Ingestion
Describe Batch/Bulk Ingestion Options
Describe the Streaming Framework Alternatives
Describe the Purpose and Function of MapReduce
Describe the Purpose and Components of YARN
Describe the Major Architectural Components of YARN and their Interactions
Define the Purpose and Function of Apache Pig
Work with the Grunt Shell
Work with Pig Latin Relation Names and Field Names
Describe the Pig Data Types and Schema
LABS
Starting an Bigdata Cluster
Using HDFS Commands
Demonstration: Understanding Apache Pig
Getting Started with Apache Pig
Exploring Data with Pig
DAY 2 OBJECTIVES
Demonstrate Common Operators Such as:
ORDER BY
CASE
DISTINCT
PARALLEL
FOREACH
Understand how Hive Tables are Defined and Implemented
Use Hive to Explore and Analyze Data Sets
Explain and Use the Various Hive File Formats
Create and Populate a Hive Table that Uses ORC File Formats
Use Hive to Run SQL-like Queries to Perform Data Analysis
Use Hive to Join Datasets Using a Variety of Techniques
Write Efficient Hive Queries
Explain the Uses and Purpose of HCatalog
Use HCatalog with Pig and Hive
LABS
Splitting a Dataset
Joining Datasets
Preparing Data for Apache Hive
Understanding Apache Hive Tables
Demonstration: Understanding Partitions and Skew
Analyzing Big Data with Apache Hive
Demonstration: Computing Ngrams
Joining Datasets in Apache Hive
Computing Ngrams of Emails in Avro Format
Using HCatalog with Apache Pig
DAY 3 OBJECTIVES
Describe the Real-Time Architecture
Define the Purpose and Function of Apache Hadoop
Describe the Hadoop Ecosystem Frameworks
Describe the Role of Hadoop in the Datecenter
Describe the Hadoop Distributed File System (HDFS)
Detail the Major Architectural Components of HDFS and their Interactions
Demonstrate How to Use Apache Zeppelin with Apache Spark
List the Major Functions of Apache Zeppelin
Describe the Purpose and Benefits of Apache Spark
List the Spark High-Level Tools
Define Spark REPLs and Application Architecture
Explain the Purpose and Function of Reslient Distributed Datasets (RDDs)
List the Characteristics of an RDD
Explain Spark Programming Basics
Define and Use Basic Spark Transformations
Define and Use Basic Spark Actions
Describe an Anonymous Function
Invoke Functions for Multiple RDDs, Created Named Functions and Use Numeric Operations
LABS
Validating the Lab Environment
Using HDFS Commands
Introduction to SPARK REPLs and Zeppelin
Creating and Manipulating RDDs
DAY 4 OBJECTIVES
Define and Create Pair RDDs
Perform Common Operations on Pair RDDs
Describe Spark Streaming
Create and View Basic Data Streams
Perform Basic Transformations on Streaming Data
Utilize Window Transformations on Streaming Data
Recognize Use Cases for Apache Kafka
Explain the Concept of a Topic Leader and Followers
Describe the Publication and Consumption of Kafka Messages
Describe the Function and Purpose of Apache HBase
List Apache HBase Key Features
LABS
Creating and Manipulating Pair RDDs
Basic Spark Streaming
Basic Spark Streaming Transformations
Spark Streaming Window Transformations
DAY 5 OBJECTIVES
List the Components of the Apache HBase Architecure
Describe an Apache HBase as a set of Value Mappings
Idenfity Apache HBase as Either Row or Column Oriented Database