This course Provides instruction on the processes and practice of data science, including machine learning and natural language processing. Included are: tools and programming languages (Python, IPython, Mahout, Pig, NumPy, pandas, SciPy, Scikit-learn), the Natural Language Toolkit (NLTK), and Spark MLlib.
PREREQUISITES
Students must have experience with at least one programming or scripting language, knowledge in statistics and/or mathematics, and a basic understanding of big data and Hadoop principles. Students new to Hadoop are encouraged to attend the
TARGET AUDIENCE
Architects, software developers, analysts and data scientists who need to apply data science and machine learning on Hadoop.
FORMAT
50% Lecture/Discussion
50% Hands-0n Labs
AGENDA SUMMARY
Day 1: Introduction to Big Data, the Data Science Life Cycle and Pig
Day 2 :Algorithms in Spark ML and SciKit-Learn: Linear Regression, Logistic Regression, Support Vectors, Decision Trees
Day 3: Python Programming and Machine Algorithms
Day 4: Python Programming ( Continued), Natural Language Processing
Day 5: Spark ML & Zeppelin
DAY 1 OBJECTIVES
List Data Science Use Cases Define Data Science and What a Data Scientists Does
List Reasons to use Hadoop for Data Science
Describe the Hadoop Distributed File System (HDFS)
Describe Block Storage
Describe the Function and Purpose of NameNodes and DataNodes
List Common HDFS Commands
Describe MapReduce, the Map Phase and the Reduce Phase
Describe Hadoop Streaming and MapReduce
Define HDFS Federation
Explain How NameNode High Availability is Implemented
Define YARN
Define Apache Slider
Describe Machine Learning and How Machines Learn
List Examples of Machine Learning Tasks
Describe Hadoop Machine Learning Capabilities
Describe the Data Science Life Cycle Process Flow
Describe Apache Pig
Describe Pig Latin
Define a Schema
Use Common Pig Operators
LABS AND DEMONSTRATIONS
Setting Up a Development Environment
Using HDFS Commands
Demonstration: Understanding Map Reduce
Using Mahout for Machine Learning
DAY 2 OBJECTIVES
Discuss categories and use cases of the various ML Algorithms
Understand Linear Regression, Logistic Regression, and Support Vectors
Understand Decision Trees and their limitations
Understand Nearest-Neighbors
Discuss and demonstrate a Spam Classifier
LABS AND DEMONSTRATIONS
Linear Regression as a Projection
Logistic Regression
Support Vectors
Decision Trees
Linear Regression as a Classifier
DAY 3 OBJECTIVES
List and Describe Python Programming Concepts
Import Python Modules
Develop Python Code
List the Components of the Scientific Python Ecosystem: o NumPy
Pandas
SciPy Library
matplotlib
List options for running Python on Hadoop
Invoke Python using Hadoop Streaming
Invoke Python using Pig User Defined Functions (UDFs)
Invoke Python Using the Pig STREAM Command
Describe Hadoop Machine Learning Tools
Describe the Scikit-Learn Library
Describe Machine Learning Algorithms:
Recommender Systems
Support Vector Machines
Naives Bayes
Nearest Neighbor
Deploy Python Machine Learning Algorithms on Hadoop
LABS AND DEMONSTRATIONS
Getting Started with Apache Pig
Exploring Data with Apache Pig
Using the IPython Notebook
Demonstration: Understanding the NumPy Package
Demonstration: The Pandas Library
Data Analysis with Python
Interpolating Data Points
Defining a Pig UDF in Python
Streaming Python with Apache Pig
Demonstration: Classification with Scikit-Learn
Computing K-Nearest Neighbor
Generating a K-Means Clustering
DAY 4 OBJECTIVES
Define Natural Language Processing
Describe Common NLP Tasks
Use the Natural Language Toolkit
LABS AND DEMONSTRATIONS
Demonstration: POS Tagging Using a Decision Tree
Using the Python Natural Language Toolkit
Classifying Text Using Naïve Bayes
DAY 5 OBJECTIVES
Describe Apache Spark
List Spark Components and their Functionalities
Use Resilient Distributed Datasets
Describe the Spark MLlib
Perform Spark Operations
Describe the Process to Implement a Data Science Production Deployment for: