spark-workshop

Advanced Apache Spark for Developers Workshop (5 days)

What You Will Learn / Objectives

The goal of the Advanced Apache Spark for Developers Workshop is to build the deeper understanding of the internals of Apache Spark (Spark Core) and the modules in Apache Spark 2 (Spark SQL, Spark Structured Streaming and Spark MLlib). The workshop will teach you how to do performance tuning of Apache Spark applications and the more advanced features of Apache Spark 2.

NOTE The workshop uses the latest and greatest Apache Spark 2.2.0 and is particularly well-suited to Spark developers who worked with Apache Spark 1.x.

The workshop follows a very intense learn-by-doing approach in which the modules start with just enough knowledge to get you going and quickly move on to applying the concepts in practical exercises.

The workshop includes many practical sessions that should meet (and quite likely exceed) expectations of software developers with a significant experience in Apache Spark and a good knowledge of Scala, senior administrators, operators, devops, and senior support engineers.

CAUTION: The workshop is very hands-on and practical, i.e. not for faint-hearted. Seriously! After just a couple of days your mind, eyes, and hands will all be trained to recognise the patterns how to set up and operate Spark infrastructure for your Big Data and Predictive Analytics projects.

Duration

5 days

Target Audience

Agenda

Spark Core (1.5 Days)

  1. Anatomy of Spark Core Data Processing
    1. SparkContext and SparkConf
    2. Transformations and Actions
    3. Units of Physical Execution: Jobs, Stages, Tasks and Job Groups
    4. RDD Lineage
      • DAG View of RDDs
      • Logical Execution Plan
    5. Spark Execution Engine
      • DAGScheduler
      • TaskScheduler
      • Scheduler Backends
      • Executor Backends
    6. Partitions and Partitioning
    7. Shuffle
    8. Caching and Persistence
    9. Checkpointing
  2. Elements of Spark Runtime Environment
    1. The Driver and Executors
    2. Deploy Modes
    3. Spark Clusters
      • Master and Workers
  3. Spark Tools
    • spark-shell
    • spark-submit
    • spark-class
  4. Troubleshooting and Monitoring
    1. web UI
    2. Log Files
    3. SparkListeners
      • StatsReportListener
      • Event Logging using EventLoggingListener and History Server
      • Exercise: Event Logging using EventLoggingListener
      • Exercise: Developing Custom SparkListener
    4. Spark Metrics System
  5. Tuning Spark Infrastructure
    1. Exercise: Configuring CPU and Memory for Driver and Executors
    2. Scheduling Modes: FIFO and FAIR
    3. Exercise: Configuring Pools in FAIR Scheduling Mode

Spark SQL (2 Days)

  1. SparkSession
  2. Dataset, DataFrame and Encoders
  3. QueryExecution — Query Execution of Dataset
  4. Exercise: Debugging Query Execution
  5. web UI
  6. DataSource API
  7. Columns, Operators, Standard Functions and UDFs
  8. Joins
  9. Basic Aggregation
  10. Windowed Aggregation
  11. Multi-Dimensional Aggregation
  12. Caching and Persistence
  13. Catalyst — Tree Manipulation Framework
    1. Expressions, LogicalPlans and SparkPlans
    2. Logical and Physical Operators
  14. Analyzer — Logical Query Plan Analyzer
  15. SparkOptimizer — Logical Query Optimizer
    1. Logical Plan Optimizations
  16. SparkPlanner — Query Planner with no Hive Support
    1. Execution Planning Strategies
  17. Physical Plan Preparations Rules
  18. Tungsten Execution Backend (aka Project Tungsten)
    1. Whole-Stage Code Generation (aka Whole-Stage CodeGen)
    2. InternalRow and UnsafeRow

Spark Structured Streaming (0.5 Days)

  1. Spark Structured Streaming

Spark MLlib (1 Day)

  1. ML Pipelines and PipelineStages (spark.ml)
  2. ML Pipeline Components
    1. Transformers
    2. Estimators
    3. Models
    4. Evaluators
    5. CrossValidator
    6. Params (and ParamMaps)
  3. Supervised and Unsupervised Learning with Spark MLlib
    1. Classification and Regression
    2. Clustering
    3. Collaborative Filtering
  4. Model Selection and Tuning
  5. ML Persistence — Saving and Loading Models and Pipelines

Requirements