ALL DATES GUARANTEED
Check out our full list of training locations and learning formats. Please note that the location you choose may be an Established HD-ILT location with a virtual live instructor.
COURSE DELIVERY OPTIONS
Train face-to-face with the live instructor.
Interact with a live, remote instructor from a specialized, HD-equipped classroom near you.
Attend the live class from the comfort of your home or office.
Students must have experience with at least one programming or scripting language, knowledge in statistics and/or mathematics, and a basic understanding of big data and Hadoop principles. Students new to Hadoop are encouraged to attend the HDP Overview: Apache Hadoop Essentials course.
Target Audience:Architects, software developers, analysts and data scientists who need to apply data science and machine learning on Hadoop.
- Recognize use cases for data science on Hadoop
- Describe the Hadoop and YARN architecture
- Describe supervised and unsupervised learning differences
- Use Mahout to run a machine learning algorithm on Hadoop
- Describe the data science life cycle
- Use Pig to transform and prepare data on Hadoop
- Write a Python script
- Describe options for running Python code on a Hadoop cluster
- Write a Pig User-Defined Function in Python
- Use Pig streaming on Hadoop with a Python script
- Use machine learning algorithms
- Describe use cases for Natural Language Processing (NLP)
- Use the Natural Language Toolkit (NLTK)
- Describe the components of a Spark application
- Write a Spark application in Python
- Run machine learning algorithms using Spark MLlib
- Take data science into production
Day 1: An Introduction to Data Science, Python, Hadoop and Machine Learning
- Define Data Science and Explain What a Data Scientist Does
- Differentiate Between Different Types of Data Roles
- List a Number of Data Science Use Cases
- Present an Overview of Python
- Describe the Components of the Big Data Scientific Stack
- Using IPython
- Data Analysis with Python
- Using HDFS Commands
- Introduction to Spark REPLs and Zeppelin
- Using Apache Mahout for Machine Learning
Day 2: Working with Spark RDDs, DataFrames and SparkSQL, Visualization in Zeppelin
- Explain What an RDD Is
- Explain How RDDs are Partitioned
- Create Manipulate and Restore RDDs
- Use Spark SQL to Create Tables
- Create an Application and Submit to the Cluster
- Create and Manipulate RDDs
- Create and Save DataFrames
- Build and Submit Spark Applications
Day 3: Machine Learning Algorithms, Natural Language Processing, and Spark MLlib
- Describe Common Machine Learning Applications
- List the Pros and Cons of Various Algorithms
- Explain what Natural Language Processing is
- Explain the Feature Engineering Capabilities of Spark MLlib
- Use the Python Natural Language Toolkit (NLTK)
- Classify text using NaÃ¯ve Bayes
- Compute K-nearest neighbors
- Creating a Spam Classifier with MLlib
- Sentiment Analysis with Spark MLlib
What's Included With This Class?
This course includes a 365-day membership to our neXT Learning Community! You will join thousands of other neXT members allowing you to interact with other IT professionals, get your questions answered, and achieve your learning goals. Upon registration, you will get immediate access to the following resources:
Join thousands of other members in our neXT Learning Community for an entire year!
Thousands of recorded topics, many of which relate to official technology curriculum.
Interact with instructors and other neXT members. You can expect a quick response as discussion boards are monitored daily.
Virtual, interactive sessions including exam prep , open Q&A workshops, lab demos, and featured exclusive topics.
Learning paths can contain videos, blogs, articles, and quizzes combined to help meet specific objectives.