Apache Spark 2 Workshop 4 Days

@jaceklaskowski / StackOverflow / GitHub / Mastering Apache Spark 2

https://github.com/jaceklaskowski

http://bit.ly/mastering-apache-spark

Among contributors to Apache Spark 1.6

Among contributors to Apache Spark 2

http://stackoverflow.com/users/1305344/jacek-laskowski

https://twitter.com/jaceklaskowski

Agenda - Day 1

  1. Using spark-shell and Spark SQL
  2. My First Spark SQL Application
    • IntelliJ IDEA, sbt and spark-submit
  3. Deploying Spark Applications to Spark Standalone Cluster
  4. Datasets and Encoders
  5. Standard Operators and Functions, and UDFs in Spark SQL
  6. RDDs vs DataFrames vs Datasets

Agenda - Day 2 (1 of 2)

  1. DataSource API
    • Reading and saving datasets
    • files (Parquet), JDBC (PostgreSQL), NoSQL (Cassandra)
  2. Spark in Cluster Mode (Spark Standalone, Hadoop Yarn and Apache Mesos)
  3. Sharing Spark Master with other jobs, long running jobs (like an API) and short lived jobs at the same time, best way to solve this
  4. Deploying to a Cluster - spark-submit
  5. Client, Driver, Master, Executors, Workers, Jobs, DAG, Stages, Tasks

Agenda - Day 2 (2 of 2)

  1. Monitoring / Performance Tuning
    1. web UI
    2. Spark Listeners
      • How to discover that a job has failed or takes too long
    3. Spark History Server
    4. Dynamic Allocation of Executors
    5. Speculative Execution of Tasks

Agenda - Day 3

  1. Spark SQL
    • Aggregates (and UDAFs)
    • Windows
  2. Structured Query Plan and Optimizations
  3. Thrift JDBC/ODBC Server
  4. Spark MLlib

Agenda - Day 4 (1 of 2)

  1. (00:30) Caching and Persistence
  2. (00:30) Kafka Architecture / Setup
  3. Structured Streaming

Agenda - Day 4 (2 of 2)

  1. Broadcast Variables and Accumulators
  2. Spark Streaming / Apache Kafka 0.10+
    • Introduction to Spark Streaming
    • StreamingContext and Operators
    • Kafka Direct Approach (No Receivers)

Prerequisities

  1. Some programming experience using modern programming language (preferably on JVM)
    • Java, Python, Scala, C#
  2. Installed
  3. Downloaded
  4. Willingness to ask PLENTY of questions

Questions?