Apache Spark Programming (DB 105)

Ask us about our bundle discount when combining this course with Apache Spark for Machine Learning and Data Science (DB 301).

This 2 1/2-day course is equally applicable to data engineers, data scientist, analysts, architects, software engineers, and technical managers interested in a thorough, hands-on overview of Apache Spark.

The course covers the fundamentals of Apache Spark including Spark’s architecture and internals, the core APIs for using Spark, SQL and other high-level data access tools, as well as Spark’s streaming capabilities and machine learning APIs. The class is a mixture of lecture and hands-on labs.

Each topic includes lecture content along with hands-on labs in the Databricks notebook environment. Students may keep the notebooks and continue to use them with the free Databricks Community Edition offering after the class ends; all examples are guaranteed to run in that environment.

Course Information

Price: $2,500.00
Duration: 2.5 days
Learning Credits:

All Dates Guaranteed To Run

Check out our full list of training locations and learning formats. Please note that the location you choose may be an Established HD-ILT location with a virtual live instructor.

Course Delivery Options

Train face-to-face with the live instructor.

Interact with a live, remote instructor from a specialized, HD-equipped classroom near you.​

Attend the live class from the comfort of your home or office.



  • Some familiarity with Apache Spark is helpful but not required.
  • Some familiarity with Machine Learning and Data Science concepts are highly recommended but not required.
  • Basic programming experience in an object-oriented or functional language is required. The class can be taught concurrently in Python and Scala.


Target Audience:

Data scientists, analysts, architects, software engineers, and technical managers with experience in machine learning who want to adapt traditional machine learning tasks to run at scale using Apache Spark.


Course Objectives:

After taking this class, students will be able to:

  • Use the core Spark APIs to operate on data
  • Articulate and implement typical use cases for Spark
  • Build data pipelines and query large data sets using Spark SQL and DataFrames
  • Analyze Spark jobs using the administration UIs inside Databricks
  • Create Structured Streaming jobs
  • Work with relational data using the GraphFrames APIs
  • Understand how a Machine Learning pipeline works
  • Understand the basics of Spark’s internals


Course Outline:

Module 1: Spark Overview
  • Databricks Overview
  • Spark Capabilities
  • Spark Ecosystem
  • Basic Spark Components
  • Databricks Lab Environment
  • Working with Notebooks
  • Spark Clusters and Files

Module 2: Spark SQL and DataFrames
  • Use of Spark SQL
  • Use of DataFrames / DataSets
  • Reading from CSV, JSON, JDBC, Parquet Files & more
  • Writing Data
  • DataFrame, DataSet and SQL APIs
  • Aggregations
  • SQL Joins with DataFrames
  • Broadcasting
  • Catalyst Query Optimization
  • Tungsten
  • ETL
  • Creating DataFrames
  • Querying with DataFrames and SQL
  • ETL with DataFrames
  • Caching
  • Visualization

Module 3: Spark Internals
  • Jobs, Stages and Tasks
  • Partitions and Shuffling
  • Job Performance
  • Visualizing SQL Queries
  • Observing Task Execution
  • Understanding Performance
  • Measuring Memory Use

Module 4: Structured Streaming
  • Streaming Sources and Sinks
  • Structured Streaming APIs
  • Windowing and Aggregation
  • Checkpointing
  • Watermarking
  • Reliability and Fault Tolerance
  • Reading from TCP
  • Reading from Kafka
  • Continuous Visualization

Module 5: Machine Learning
  • Spark ML Pipeline API
  • Built-in Featurizing and Algorithms
  • Featurization
  • Building a Machine Learning Pipeline

Module 6: Graph Processing with GraphFrames
  • Basic Graph Analysis
  • GraphFrames API
  • GraphFrames ETL
  • Pagerank and Label Propagation with GraphFrames