spark-workshop

Advanced Apache Spark for Developers Workshop (5 days)

What You Will Learn / Objectives

The goal of the Advanced Apache Spark for Developers Workshop is to build the deeper understanding of the internals of Apache Spark (Spark Core) and the modules in Apache Spark 2 (Spark SQL, Spark Structured Streaming and Spark MLlib). The workshop will teach you how to do performance tuning of Apache Spark applications and the more advanced features of Apache Spark 2.

NOTE The workshop uses the latest and greatest Apache Spark 2.2.0 and is particularly well-suited to Spark developers who worked with Apache Spark 1.x.

The workshop follows a very intense learn-by-doing approach in which the modules start with just enough knowledge to get you going and quickly move on to applying the concepts in practical exercises.

The workshop includes many practical sessions that should meet (and quite likely exceed) expectations of software developers with a significant experience in Apache Spark and a good knowledge of Scala, senior administrators, operators, devops, and senior support engineers.

CAUTION: The workshop is very hands-on and practical, i.e. not for faint-hearted. Seriously! After just a couple of days your mind, eyes, and hands will all be trained to recognise the patterns how to set up and operate Spark infrastructure for your Big Data and Predictive Analytics projects.

Duration

5 days

Target Audience

Experienced Software Developers
- Good knowledge of Scala
- Significant experience in Apache Spark 1.x
Senior Administrators
Senior Support Engineers

Agenda

Spark Core (1.5 Days)

Anatomy of Spark Core Data Processing
1. SparkContext and SparkConf
2. Transformations and Actions
3. Units of Physical Execution: Jobs, Stages, Tasks and Job Groups
4. RDD Lineage
  - DAG View of RDDs
  - Logical Execution Plan
5. Spark Execution Engine
  - DAGScheduler
  - TaskScheduler
  - Scheduler Backends
  - Executor Backends
6. Partitions and Partitioning
7. Shuffle
8. Caching and Persistence
9. Checkpointing
Elements of Spark Runtime Environment
1. The Driver and Executors
2. Deploy Modes
3. Spark Clusters
  - Master and Workers
Spark Tools
- spark-shell
- spark-submit
- spark-class
Troubleshooting and Monitoring
1. web UI
2. Log Files
3. SparkListeners
  - StatsReportListener
  - Event Logging using EventLoggingListener and History Server
  - Exercise: Event Logging using EventLoggingListener
  - Exercise: Developing Custom SparkListener
4. Spark Metrics System
Tuning Spark Infrastructure
1. Exercise: Configuring CPU and Memory for Driver and Executors
2. Scheduling Modes: FIFO and FAIR
3. Exercise: Configuring Pools in FAIR Scheduling Mode

Spark SQL (2 Days)

SparkSession
Dataset, DataFrame and Encoders
QueryExecution — Query Execution of Dataset
Exercise: Debugging Query Execution
web UI
DataSource API
Columns, Operators, Standard Functions and UDFs
Joins
Basic Aggregation
- groupBy and groupByKey operators
- Case Study: Number of Partitions for groupBy Aggregation
Windowed Aggregation
Multi-Dimensional Aggregation
Caching and Persistence
Catalyst — Tree Manipulation Framework
1. Expressions, LogicalPlans and SparkPlans
2. Logical and Physical Operators
Analyzer — Logical Query Plan Analyzer
SparkOptimizer — Logical Query Optimizer
1. Logical Plan Optimizations
SparkPlanner — Query Planner with no Hive Support
1. Execution Planning Strategies
Physical Plan Preparations Rules
Tungsten Execution Backend (aka Project Tungsten)
1. Whole-Stage Code Generation (aka Whole-Stage CodeGen)
2. InternalRow and UnsafeRow

Spark Structured Streaming (0.5 Days)

Spark Structured Streaming

Spark MLlib (1 Day)

ML Pipelines and PipelineStages (spark.ml)
ML Pipeline Components
1. Transformers
2. Estimators
3. Models
4. Evaluators
5. CrossValidator
6. Params (and ParamMaps)
Supervised and Unsupervised Learning with Spark MLlib
1. Classification and Regression
2. Clustering
3. Collaborative Filtering
Model Selection and Tuning
ML Persistence — Saving and Loading Models and Pipelines

Requirements

Training classes are best for groups up to 12 participants
Participants have decent computers, preferably with Linux or Mac OS operating systems
- There are issues with running Spark on Windows (mostly with Spark SQL / Hive).
Participants should install the following packages:
- Apache Spark 2.2
- Java SE Development Kit 8
- IntelliJ IDEA Community Edition with the Scala plugin
- sbt
- Apache Kafka 0.11.0.1
- PostgreSQL 10 or any other relational database
Participants should download the following packages:
- PostgreSQL JDBC 4.2 Driver, 42.1.4

This site is open source. Improve this page.