Ask us about our bundle discount when combining this course with Apache Spark for Machine Learning and Data Science (DB 301).
This 2 1/2-day course is equally applicable to data engineers, data scientist, analysts, architects, software engineers, and technical managers interested in a thorough, hands-on overview of Apache Spark.
The course covers the fundamentals of Apache Spark including Spark’s architecture and internals, the core APIs for using Spark, SQL and other high-level data access tools, as well as Spark’s streaming capabilities and machine learning APIs. The class is a mixture of lecture and hands-on labs.
Each topic includes lecture content along with hands-on labs in the Databricks notebook environment. Students may keep the notebooks and continue to use them with the free Databricks Community Edition offering after the class ends; all examples are guaranteed to run in that environment.
- Some familiarity with Apache Spark is helpful but not required.
- Some familiarity with Machine Learning and Data Science concepts are highly recommended but not required.
- Basic programming experience in an object-oriented or functional language is required. The class can be taught concurrently in Python and Scala.
- Use the core Spark APIs to operate on data
- Articulate and implement typical use cases for Spark
- Build data pipelines and query large data sets using Spark SQL and DataFrames
- Analyze Spark jobs using the administration UIs inside Databricks
- Create Structured Streaming jobs
- Work with relational data using the GraphFrames APIs
- Understand how a Machine Learning pipeline works
- Understand the basics of Spark’s internals
- Databricks Overview
- Spark Capabilities
- Spark Ecosystem
- Basic Spark Components
- Databricks Lab Environment
- Working with Notebooks
- Spark Clusters and Files
Module 2: Spark SQL and DataFrames
- Use of Spark SQL
- Use of DataFrames / DataSets
- Reading from CSV, JSON, JDBC, Parquet Files & more
- Writing Data
- DataFrame, DataSet and SQL APIs
- SQL Joins with DataFrames
- Catalyst Query Optimization
- Creating DataFrames
- Querying with DataFrames and SQL
- ETL with DataFrames
Module 3: Spark Internals
- Jobs, Stages and Tasks
- Partitions and Shuffling
- Job Performance
- Visualizing SQL Queries
- Observing Task Execution
- Understanding Performance
- Measuring Memory Use
Module 4: Structured Streaming
- Streaming Sources and Sinks
- Structured Streaming APIs
- Windowing and Aggregation
- Reliability and Fault Tolerance
- Reading from TCP
- Reading from Kafka
- Continuous Visualization
Module 5: Machine Learning
- Spark ML Pipeline API
- Built-in Featurizing and Algorithms
- Building a Machine Learning Pipeline
Module 6: Graph Processing with GraphFrames
- Basic Graph Analysis
- GraphFrames API
- GraphFrames ETL
- Pagerank and Label Propagation with GraphFrames