spark-workshop

Spark Administration and Monitoring Workshop

What You Will Learn (aka Goals)

The goal of the Spark Administration and Monitoring Workshop is to give you a practical, complete and hands-on introduction to Apache Spark and the tools to manage, monitor, troubleshoot and fine-tune Spark infrastructure.

NOTE: The workshop uses a tailor-made Docker image but it is also possible to use a commercial distribution for Spark like Cloudera’s CDH (possibly Hortonworks Data Platform (HDP) or MapR Sandbox).

The workshop uses an intense learn-by-doing approach in which the modules start with just enough knowledge to get you going and quickly move on to applying the concepts in assignments. There are a lot of practical exercises.

The workshop comes with many practical sessions that should meet (and possibly exceed) expectations of administrators, operators, devops, and other technical roles like system architects or technical leads (perhaps also software developers for whom Spark and Scala (Application Development) Workshop might be a better fit).

CAUTION: The workshop is very hands-on and practical, i.e. not for faint-hearted. Seriously! After just a couple of days your mind, eyes, and hands will all be trained to recognise the patterns how to set up and operate Spark infrastructure in your Big Data projects.

CAUTION I have already trained people who expressed their concern that there were too many exercises. Your dear drill sergeant, Jacek.

Duration

5 days

Target Audience

Agenda

  1. Anatomy of Spark Data Processing
  2. SparkContext
    • SparkConf
  3. Transformations and Actions
  4. Units of Physical Execution: Jobs, Stages, and Tasks
  5. RDD Lineage
    • DAG View of RDDs
    • Logical Execution Plan
  6. Spark Execution Engine
    • DAGScheduler
    • TaskScheduler
    • Scheduler Backends
    • Executor Backends
  7. Partitions and Partitioning
  8. Shuffle
    • Wide and Narrow Dependencies
  9. Caching and Persistence
  10. Checkpointing
  11. Elements of Spark Runtime Environment
  12. The Driver
  13. Executors
    • TaskRunners
  14. Deploy Modes
  15. Spark Clusters
    • Master and Workers
  16. RPC Environment (RpcEnv)
  17. BlockManagers
  18. Spark Tools
  19. spark-shell
  20. spark-submit
  21. web UI
  22. spark-class
  23. Monitoring Spark Applications using web UI
  24. The Different Tabs in web UI
  25. Exercise: Monitoring using web UI
    • Executing Spark Jobs to Enable Different Statistics and Statuses
  26. Spark on Hadoop YARN cluster
  27. Exercise: Setting up Hadoop YARN
    • Accessing Resource Manager’s web UI
  28. Exercise: Submitting Applications using spark-submit
    • --master yarn
    • yarn-site.xml
    • yarn application -list
    • yarn application -status
    • yarn application -kill
  29. Runtime Properties - Meaning and Application
  30. Troubleshooting
    • log files
  31. YarnShuffleService – ExternalShuffleService on YARN
  32. Multi-tenant YARN Cluster Setup and Spark
    • Overview of YARN Schedulers (e.g. Capacity Scheduler)
    • spark-submit --queue
  33. Clustering Spark using Spark Standalone
  34. Exercise: Setting up Spark Standalone
    • Using standalone Master’s web UI
  35. Exercise: Submitting Applications using spark-submit
    • --master spark://...
    • --deploy-mode with client and cluster
  36. Clustering Spark using Spark Standalone
  37. Tuning Spark Infrastructure
  38. Exercise: Configuring CPU and Memory for Master and Executors
  39. Exercise: Observing Shuffling using groupByKey-like operations.
  40. Scheduling Modes: FIFO and FAIR
    • Exercise: Configuring Pools in FAIR Scheduling Mode
  41. Monitoring Spark using SparkListeners
  42. LiveListenerBus
  43. StatsReportListener
  44. Event Logging using EventLoggingListener and History Server
  45. Exercise: Event Logging using EventLoggingListener
  46. Exercise: Developing Custom SparkListener
  47. Dynamic Allocation (of Executors)
  48. External Shuffle Service
  49. Spark Metrics System
  50. (optional) Using Spark Streaming and Kafka
  51. (optional) Clustering Spark using Apache Mesos
  52. Exercise: Setting up Mesos cluster
  53. Exercise: Submitting Applications using spark-submit
    • --master mesos://...

Requirements