This course provides instruction on the theory and practice of data science, including machine learning and natural language processing. This course introduces many of the core concepts behind today’s most commonly used algorithms and introducing them in practical applications. We’ll discuss concepts and key algorithms in all of the major areas – Classification, Regression, Clustering, Dimensionality Reduction, including a primer on Neural Networks. We’ll focus on both single-server tools and frameworks (Python, NumPy, pandas, SciPy, Scikit-learn, NLTK, TensorFlow Jupyter) as well as large-scale tools and frameworks (Spark MLlib, Stanford CoreNLP, TensorFlowOnSpark/Horovod/MLeap, Apache Zeppelin).
TARGET AUDIENCE
Architects, software developers, analysts and data scientists who need to apply data science and machine learning on Spark/Hadoop.
AGENDA SUMMARY
Day 1 : An Introduction to Data Science, SciKit-Learn, HDFS, Reviewing Spark apps, DataFrames and NOSQL
Day 2 :Algorithms in Spark ML and SciKit-Learn: Linear Regression, Logistic Regression, Support Vectors, Decision Trees
Day 3 : K-Means & GMM Clustering, Essential TensorFlow, NLP with NLTK, NLP with Stanford CoreNLP
Day 4 : HyperParameter Tuning, K-Fold Validation, Ensemble Methods, ML Pipelines in SparkML
DAY 1 OBJECTIVES
Discuss aspects of Data Science, the team members, and the team roles
Discuss use cases for Data Science
Discuss the current State of the Art and its future direction
Review HDFS, Spark, Jupyter, and Zeppelin
Work with SciKit-Learn, Pandas, NumPy, Matplotlib, and Seaborn
LABS
Hello, ML w/ SciKit-Learn
Spark REPLs, Spark Submit, & Zeppelin Review
HDFS Review
Spark DataFrames and Files
NiFi Review
DAY 2 OBJECTIVES
Discuss categories and use cases of the various ML Algorithms
Understand Linear Regression, Logistic Regression, and Support Vectors
Understand Decision Trees and their limitations
Understand Nearest-Neighbors
Discuss and demonstrate a Spam Classifier
LABS
Linear Regression as a Projection
Logistic Regression
Support Vectors
Decision Trees
Linear Regression as a Classifier
DAY 3 OBJECTIVES
Discuss and understand Clustering Algorithms
Work with TensorFlow to create a basic neural network
Work with TensorFlow to create a basic neural network
Discuss Natural Language Processing
Discuss Dimensionality Reduction Algorithms
LABS
K-Means Clustering
GMM Clustering
Essential TensorFlow
Sentiment Analysis
Dimensionality Reduction with PCA
DAY 4 OBJECTIVES
Discuss Hyper-Parameter Tuning and K-Fold Validation