Menu

HDP Analyst Data Science

ALL SLI DATES ARE GUARANTEED TO RUN!

Check out our full list of training locations and learning formats. Please note that the location you choose may be an Established HD-ILT location.

What's Included With This Class?​

365 Day neXT Learning Membership

Video Reference Library

Online Discussion Forums

Tech Talk Webinars

Goal-Based Learning Paths

Your neXT membership includes…

  • A 365 Day neXT Learning Membership is included with the class, giving you access to the below resources. Join thousands of other neXT members in your learning journey!

 

  • Video Reference Library: Thousands of recorded topics, many of which relate to the official technology curriculum, broken down into short, consumable videos. These videos are all on-demand and searchable by subject or course name. Get access to content and recordings from the entire technology stack, not just this class!

 

  • Online Discussion Forums: Technical discussion boards are available for you to interact with SLI instructors, SME’s, and other neXT Learning members. You can leave questions and expect to see quick responses as discussion boards are monitored daily.

 

  • Tech Talk Webinars: SLI hosts a series of technical webinars quarterly. These are virtual, interactive sessions for customers, instructors & SME’s to engage on a variety of topics, driven by our members. Sessions are recorded and archived for future viewing. Session Types: Delta & New Featured Topics, Open Q&A Workshops, Exam Prep & Guidance, Lab Demos. We are always open to new ideas and topics!

 

  • Goal-based Learning Paths: Learning paths are available for members who have a specific end goal in sight. SLI instructors have developed these paths which may contain videos, blogs, articles, or quizzes, combined to help learners meet specific objectives. Example learning paths: CCNA Exam Prep, Scripting for Beginners

Learn More About Our Annual neXT Learning Memberships

Overview

This course Provides instruction on the processes and practice of data science, including machine learning and natural language processing. Included are: tools and programming languages (Python, IPython, Mahout, Pig, NumPy, pandas, SciPy, Scikitlearn), the Natural Language Toolkit (NLTK), and Spark MLlib.
 

Target Audience

Architects, software developers, analysts and data scientists who need to apply data science and machine learning on Hadoop.​

Prerequisites

Students must have experience with at least one programming or scripting language, knowledge in statistics and/or mathematics, and a basic understanding of big data and Hadoop principles. Students new to Hadoop are encouraged to attend the HDP Overview: Apache Hadoop Essentials course.

Course Objectives

  • Recognize use cases for data science on Hadoop
  • Describe the Hadoop and YARN architecture
  • Describe supervised and unsupervised learning differences
  • Use Mahout to run a machine learning algorithm on Hadoop
  • Describe the data science life cycle
  • Use Pig to transform and prepare data on Hadoop
  • Write a Python script
  • Describe options for running Python code on a Hadoop cluster
  • Write a Pig User-Defined Function in Python
  • Use Pig streaming on Hadoop with a Python script
  • Use machine learning algorithms
  • Describe use cases for Natural Language Processing (NLP)
  • Use the Natural Language Toolkit (NLTK)
  • Describe the components of a Spark application
  • Write a Spark application in Python
  • Run machine learning algorithms using Spark MLlib
  • Take data science into production

Full Course Outline

Day 1: An Introduction to Data Science, Python, Hadoop and Machine Learning


OBJECTIVES

  • Define Data Science and Explain What a Data Scientist Does
  • Differentiate Between Different Types of Data Roles
  • List a Number of Data Science Use Cases
  • Present an Overview of Python
  • Describe the Components of the Big Data Scientific Stack

LABS

  • Using IPython
  • Data Analysis with Python
  • Using HDFS Commands
  • Introduction to Spark REPLs and Zeppelin
  • Using Apache Mahout for Machine Learning


Day 2: Working with Spark RDDs, DataFrames and SparkSQL, Visualization in Zeppelin


OBJECTIVES

  • Explain What an RDD Is
  • Explain How RDDs are Partitioned
  • Create Manipulate and Restore RDDs
  • Use Spark SQL to Create Tables
  • Create an Application and Submit to the Cluster

LABS

  • Create and Manipulate RDDs
  • Create and Save DataFrames
  • Build and Submit Spark Applications


Day 3: Machine Learning Algorithms, Natural Language Processing, and Spark MLlib


OBJECTIVES

  • Describe Common Machine Learning Applications
  • List the Pros and Cons of Various Algorithms
  • Explain what Natural Language Processing is
  • Explain the Feature Engineering Capabilities of Spark MLlib

LABS

  • Use the Python Natural Language Toolkit (NLTK)
  • Classify text using Naïve Bayes
  • Compute K-nearest neighbors
  • Creating a Spam Classifier with MLlib
  • Sentiment Analysis with Spark MLlib
Exclusive Video Included With This Course:​
How to Load Ambari from Scratch
Exclusive Video Included With This Course:​
Configuring Local Repositories
Exclusive Video Included With This Course:​
HDPCD - Big Data Certified Developer Exam Prep
Exclusive Video Included With This Course:​
HDPCA - Big Data Certified Administrator Exam Prep
Exclusive Video Included With This Course:​
Free Open Source Components to Solve Big/”ANY” Data Problems
Exclusive Video Included With This Course:​
Deep Dive: Kafka
SLI Main Menu