Apache Spark 2

Administration and Monitoring 5 Days

@jaceklaskowski / StackOverflow / GitHub / Mastering Apache Spark 2

https://github.com/jaceklaskowski

https://bit.ly/mastering-apache-spark

Among contributors to Apache Spark 1.6

Among contributors to Apache Spark 2

Among contributors to Apache Spark 2.1

Ranked #96 in Spark contributors

http://stackoverflow.com/users/1305344/jacek-laskowski

https://twitter.com/jaceklaskowski

Agenda - Day 1

  1. The Elements of Apache Spark (aka Why Spark)
  2. My First Spark Application
    • IntelliJ IDEA, sbt and spark-submit

Agenda - Day 2

  1. The Core of Spark Core
  2. sbt

Agenda - Day 3

  1. Exercise: Reading CSV file using Spark with sbt (and scopt external dependency)
  2. Spark Properties and web UI
  3. Exercise: Exploring map and mapPartitions
  4. Exercise: Monitoring Spark application using web UI
    • Task strugglers using TaskContext

Agenda - Day 4

  1. Exercise: Using TaskCompletionListener, TaskFailureListener, TaskContext
  2. Exercise: Different States of Spark Jobs
  3. Monitoring Spark using SparkListeners
  4. Exercise: Developing Custom SparkListener
  5. Spark and Cluster Managers
  6. Exercise: Deploying Spark Applications to a Cluster
    • Spark Standalone
    • Hadoop YARN

Agenda - Day 5

  1. Demo: Spark on YARN Internals (1:45)
  2. Spark SQL
  3. Standard Functions and UDFs
  4. spark-sql command-line tool
  5. Spark Thrift JDBC/ODBC Server
  6. Spark SQL 2 - Tungsten / Catalyst / Query Optimizer / Performance Tuning
  7. Exercise: Using explain, debug and debugCodegen for queries
  8. Structured Streaming
  9. The Internals of Spark Application

Prerequisities

  1. Some programming experience using modern programming language (preferably on JVM)
    • Scala, Python, Java, F#
  2. Installed Java SE 8
  3. Downloaded
  4. Willingness to ask PLENTY of questions

Questions?