HDP Analyst Data Science

This Course Includes neXT LIVE 365

LEARN FOR 365 DAYS!

Sunset Learning Institute believes in a 365-day learning experience that begins immediately, regardless of when you attend your ILT course.  At SLI, you get a range of learning opportunities, from instructor-led hands-on training, to self-directed, customizable learning paths based on your environment, your needs, and your level of experience. We provide the tools and options, and you decide what you need, when you need it, and how you want to learn it! 

Immediate access to supplemental learning assets that are INCLUDED with your purchase of the above instructor-led training course: 
  • 365 Days of Access to SLI’s Entire Contact Center Video Reference Library (VRL), not just the 5-day class you sign up for (hundreds of searchable, on-demand learning bytes in 5-15-minute videos)
  • 365 Days of Unlimited Access to Delta Sessions - What’s Not Covered in Class! (Version Upgrades, Industry Updates, Etc.)
  • 365 Days of Unlimited 24x7 Access to SLI's Community - Collaborate with SLI Instructors and Other Members (Monitored Daily by SLI Instructors) See Community Demo
  • 365 Days of Unlimited Access to Interactive neXTpertise Sessions and other IT Resources with SLI Instructors (featured hot topics, exam prep, etc.)  See Upcoming neXTpertise Sessions
  • Unlimited Access to Hosted Webinars and All Previously Recorded Sessions
  • Unlimited Access to your Digital Courseware
See Entire Portfolio

Benefits:
  • Training that fits your needs (from high intensity to small learning bytes)
  • Build immediate competency - start at time of purchase!
  • Gain know-how and skills gaps with limited work disruptions
  • Get quick answers to daily challenges - live interaction!


Overview

This course Provides instruction on the processes and practice of data science, including machine learning and natural language processing. Included are: tools and programming languages (Python, IPython, Mahout, Pig, NumPy, pandas, SciPy, Scikitlearn), the Natural Language Toolkit (NLTK), and Spark MLlib.
 

Target Audience

Architects, software developers, analysts and data scientists who need to apply data science and machine learning on Hadoop.

Prerequisites

Students must have experience with at least one programming or scripting language, knowledge in statistics and/or mathematics,
and a basic understanding of big data and Hadoop principles. Students new to Hadoop are encouraged to attend the HDP Overview: Apache Hadoop Essentials course.

Course Objectives

Recognize use cases for data science on Hadoop

  • Describe the Hadoop and YARN architecture
  • Describe supervised and unsupervised learning differences
  • Use Mahout to run a machine learning algorithm on Hadoop
  • Describe the data science life cycle
  • Use Pig to transform and prepare data on Hadoop
  • Write a Python script
  • Describe options for running Python code on a Hadoop cluster
  • Write a Pig User-Defined Function in Python
  • Use Pig streaming on Hadoop with a Python script
  • Use machine learning algorithms
  • Describe use cases for Natural Language Processing (NLP)
  • Use the Natural Language Toolkit (NLTK)
  • Describe the components of a Spark application
  • Write a Spark application in Python
  • Run machine learning algorithms using Spark MLlib
  • Take data science into production

Course Outline

Day 1: An Introduction to Data Science, Python, Hadoop and Machine Learning


OBJECTIVES

  • Define Data Science and Explain What a Data Scientist Does
  • Differentiate Between Different Types of Data Roles
  • List a Number of Data Science Use Cases
  • Present an Overview of Python
  • Describe the Components of the Big Data Scientific Stack

LABS

  • Using IPython
  • Data Analysis with Python
  • Using HDFS Commands
  • Introduction to Spark REPLs and Zeppelin
  • Using Apache Mahout for Machine Learning


Day 2: Working with Spark RDDs, DataFrames and SparkSQL, Visualization in Zeppelin


OBJECTIVES

  • Explain What an RDD Is
  • Explain How RDDs are Partitioned
  • Create Manipulate and Restore RDDs
  • Use Spark SQL to Create Tables
  • Create an Application and Submit to the Cluster

LABS

  • Create and Manipulate RDDs
  • Create and Save DataFrames
  • Build and Submit Spark Applications


Day 3: Machine Learning Algorithms, Natural Language Processing, and Spark MLlib


OBJECTIVES

  • Describe Common Machine Learning Applications
  • List the Pros and Cons of Various Algorithms
  • Explain what Natural Language Processing is
  • Explain the Feature Engineering Capabilities of Spark MLlib

LABS

  • Use the Python Natural Language Toolkit (NLTK)
  • Classify text using Naïve Bayes
  • Compute K-nearest neighbors
  • Creating a Spam Classifier with MLlib
  • Sentiment Analysis with Spark MLlib

SLI Main Menu