ALL DATES GUARANTEED
Check out our full list of training locations and learning formats. Please note that the location you choose may be an Established HD-ILT location with a virtual live instructor.
COURSE DELIVERY OPTIONS
Train face-to-face with the live instructor.
Interact with a live, remote instructor from a specialized, HD-equipped classroom near you.
Attend the live class from the comfort of your home or office.
This course introduces the Apache Spark distributed computing engine, and is suitable for developers, data analysts, architects, technical managers, and anyone who needs to use Spark in a hands-on manner. It is based on the Spark 2.x release. The course provides a solid technical introduction to the Spark architecture and how Spark works. It covers the basic building blocks of Spark (e.g. RDDs and the distributed compute engine), as well as higher-level constructs that provide a simpler and more capable interface.It includes in-depth coverage of Spark SQL, DataFrames, and DataSets, which are now the preferred programming API. This includes exploring possible performance issues and strategies for optimization. The course also covers more advanced capabilities such as the use of Spark Streaming to process streaming data, and integrating with the Kafka server.
Students should be familiar with programming principles and have previous experience in software development using Scala. Previous experience with data streaming, SQL, and HDP is also helpful, but not required.
Software engineers that are looking to develop in-memory applications for time sensitive and highly iterative applications in an Enterprise HDP environment.
- Acquire and Install Spark
- Identify Supported Data Formats
- Use Broadcast Variables and Accumulators
- Configure and Create a SparkSession
DAY 1: Scala Ramp Up, Introduction to Spark
- Scala Introduction
- Working with: Variables, Data Types, and Control Flow
- The Scala Interpreter
- Collections and their Standard Methods (e.g. map())
- Working with: Functions, Methods, and Function Literals
- Define the Following as they Relate to Scale: Class, Object, and Case Class
- Overview, Motivations, Spark Systems
- Spark Ecosystem
- Spark vs. Hadoop
- Acquiring and Installing Spark
- The Spark Shell, SparkContext
- Setting Up the Lab Environment
- Starting the Scala Interpreter
- A First Look at Spark
- A First Look at the Spark Shell
DAY 2: RDDs and Spark Architecture, Spark SQL, DataFrames and DataSets
- RDD Concepts, Lifecycle, Lazy Evaluation
- RDD Partitioning and Transformations
- Working with RDDs Including: Creating and Transforming
- An Overview of RDDs
- SparkSession, Loading/Saving Data, Data Formats
- Introducing DataFrames and DataSets
- Identify Supported Data Formats
- Working with the DataFrame (untyped) Query DSL
- SQL-based Queries
- Working with the DataSet (typed) API
- Mapping and Splitting
- DataSets vs. DataFrames vs. RDDs
- RDD Basics
- Operations on Multiple RDDs
- Data Formats
- Spark SQL Basics
- DataFrame Transformations
- The DataSet Typed API
- Splitting Up Data
DAY 3: Shuffling, Transformations and Performance, Performance Tuning
- Working with: Grouping, Reducing, Joining
- Shuffling, Narrow vs. Wide Dependencies, and Performance Implications
- Exploring the Catalyst Query Optimizer
- The Tungsten Optimizer
- Discuss Caching, Including: Concepts, Storage Type, Guidelines
- Minimizing Shuffling for Increased Performance
- Using Broadcast Variables and Accumulators
- General Performance Guidelines
- Exploring Group Shuffling
- Seeing Catalyst at Work
- Seeing Tungsten at Work
- Working with Caching, Joins, Shuffles, Broadcasts, Accumulators
- Broadcast General Guidelines
DAY 4: Creating Standalone Applications and Spark Streaming
- Core API, SparkSession.Builder
- Configuring and Creating a SparkSession
- Building and Running Applications
- Application Lifecycle (Driver, Executors, and Tasks)
- Cluster Managers (Standalone, YARN, Mesos)
- Logging and Debugging
- Introduction and Streaming Basics
- Spark Streaming (Spark 1.0+)
- Structured Streaming (Spark 2+)
- Consuming Kafka Data
- Spark Job Submission
- Additional Spark Capabilities
- Spark Streaming
- Spark Structured Streaming
- Spark Structured Streaming with Kafka
What's Included With This Class?
This course includes a 365-day membership to our neXT Learning Community! You will join thousands of other neXT members allowing you to interact with other IT professionals, get your questions answered, and achieve your learning goals. Upon registration, you will get immediate access to the following resources:
Join thousands of other members in our neXT Learning Community for an entire year!
Thousands of recorded topics, many of which relate to official technology curriculum.
Interact with instructors and other neXT members. You can expect a quick response as discussion boards are monitored daily.
Virtual, interactive sessions including exam prep , open Q&A workshops, lab demos, and featured exclusive topics.
Learning paths can contain videos, blogs, articles, and quizzes combined to help meet specific objectives.