Advanced Apache Spark™
for Developers 5 Days

@jaceklaskowski / StackOverflow / GitHub
Books: Mastering Apache Spark / Mastering Spark SQL / Spark Structured Streaming

  • Jacek Laskowski is an independent consultant
  • Specializing in Spark, Kafka, Kafka Streams, Scala
  • Development | Consulting | Training
  • Among contributors to Spark (since 1.6.0)
  • Contact me at jacek@japila.pl
  • Follow @JacekLaskowski on twitter
    for more #ApacheSpark

Jacek is best known by his Gitbooks:
  1. Mastering Apache Spark
  2. Mastering Spark SQL
  3. Spark Structured Streaming
  4. Mastering Kafka Streams
  5. Apache Kafka Notebook

Professional Objectives

  • Advance in solving analytical problems using Spark SQL
  • Explore the recent features of Apache Spark 2.3
  • Understand the internals of Apache Spark 2.x and the modules (Spark SQL, Spark Structured Streaming and Spark MLlib)
  • Understand performance tuning of Apache Spark applications and the advanced features of Apache Spark

Training content

  • Anatomy of Spark Core Data Processing Platform
  • Foundations of Spark SQL
  • Internals of Structured Query Execution
  • Standard, User-Defined and User-Defined Aggregate Functions (Spark SQL)
  • Basic, Windowed and Multi-Dimensional Aggregations (Spark SQL)
  • Monitoring Spark Applications Using web UI and SparkListeners
  • Join Optimization with Bucketing (Spark SQL)
  • Stream Processing with Spark Structured Streaming
  • Machine Learning with Spark MLlib

Agenda


  Day 1 — Spark Core

  Day 2 — Spark SQL

  Day 3 — Spark SQL

  Day 4 — Spark SQL & Spark Structured Streaming

  Day 5 — Spark MLlib & Troubleshooting and Monitoring

Day 2 — Spark SQL


  Spark SQL

  The Internals of Structured Query Execution — QueryExecution, Catalyst, Query Plans, Analyzer, Optimizer, Planner, Tungsten

  Lunch Break (12:45AM)

  Columns and Dataset Operators

Day 3 — Spark SQL


  Standard and User-Defined Functions

  Lunch Break (12:45AM)

  Joins

  Basic Aggregation

Day 4 — Spark SQL & Structured Streaming


  Windowed Aggregation

  Spark SQL Exercises

  Lunch Break (12:45AM)

  Spark Structured Streaming

Day 5 — Spark MLlib & Performance Tuning


  Machine Learning with Spark MLlib

  ML Pipelines and Components

  Supervised and Unsupervised Learning with Spark MLlib

  Model Selection and Hyperparameter Tuning

  ML Persistence — Saving and Loading Models and Pipelines

  Lunch Break (12:45AM)

  Performance Tuning, Troubleshooting and Monitoring

  web UI

Prerequisities

Be prepared to get the most out of the workshop

Prerequisities / Programming Experience

  • Good knowledge of Scala
  • Significant experience in Apache Spark 1.x

Prerequisities / To Be Installed

In-Class Preparations

Make Instructor's Life Slightly Easier. Thanks!

Introduce Yourself

  1. First name
  2. What do you expect from the workshop?
  3. Where do you want to be with Spark after 5 days?

Addendum

  1. Write down your name on paper and put it in front of you (stick to your laptop?)
  2. Is lunch at 12:45pm OK?