spark-workshop

Spark and Scala (Application Development) Workshop

What You Will Learn (aka Goals)

This Spark and Scala workshop is supposed to give you a practical, complete and more importantly hands-on introduction to the architecture of Apache Spark and how to use Spark’s Scala API (developer) and infrastructure (administrator, devops) effectively in your Big Data projects.

NOTE: Should you want a workshop about administration, monitoring, troubleshooting and fine-tuning of Apache Spark, check out Spark Administration and Monitoring Workshop.

The agenda is the result of the workshops I hosted in the following cities and a few online classes:

The workshop uses an intense code-first approach in which the modules start with just enough knowledge to get you going (mostly using scaladoc and live coding) and quickly move on to applying the concepts in programming assignments. There are a lot of them.

It comes with many practical sessions that should meet (and even exceed) expectations of software developers (and perhaps administrators, operators, devops, and other technical roles like system architects or technical leads).

The workshop provides participants with practical skills to use the features of Apache Spark with Scala.

CAUTION: The Spark and Scala workshop is very hands-on and practical, i.e. not for faint-hearted. Seriously! After 5 days your mind, eyes, and hands will all be trained to recognise the patterns where and how to use Spark and Scala in your Big Data projects.

CAUTION I have already trained people who expressed their concern that there were too many exercises. Your dear drill sergeant, Jacek.

Duration

5 days

Target Audience

Outcomes

After completing the workshop participants should be able to:

Agenda

The programming language to use during the course is Scala. There is a one-day “crash course” to the language during the workshop. It is optional for skilled Scala developers who are familiar with the fundamentals of Scala and sbt.

Scala (one-day crash course)

This module aims at introducing Scala and the tools, i.e. sbt and Scala REPL, to complete the other Spark modules.

This module requires Internet access to download sbt and plugins (unless you git clone the repository - see README).

This module covers:

Agenda:

  1. Using sbt
  2. The tasks: help, compile, test, package, update, ~, set, show, console
  3. Tab experience
  4. Configuration files and directories, i.e. build.sbt file and project directory
  5. Adding new tasks to sbt through plugins
  6. Global vs project plugins
  7. sbt-assembly
  8. sbt-updates
  9. Using sbt behind a proxy server
  10. HTTP/HTTPS/FTP Proxy in the official documentation.
  11. How to use sbt from behind proxy? on StackOverflow
  12. Proxy Repositories for sbt
  13. Proxy Repositories in the official documentation.

Spark SQL

  1. DataFrames
  2. Exercise: Creating DataFrames
    • Seqs and toDF
    • SQLContext.createDataFrame and Explicit Schema using StructType
  3. DataFrames and Query DSL
  4. Column References: col, $, ', dfName
  5. Exercise: Using Query DSL to select columns
    • where
  6. User-Defined Functions (UDFs)
  7. functions object
  8. Exercise: Manipulating DataFrames using functions * withColumn * UDFs: split and explode
  9. Creating new UDFs
  10. DataFrameWriter and DataFrameReader
  11. SQLContext.read and load
  12. DataFrame.write and save
  13. Exercise: WordCount using DataFrames (words per line)
    • SQLContext.read.text
    • SQLContext.read.format("text")
  14. Exercise: Manipulating data from CSV using DataFrames
    • spark-submit --packages com.databricks:spark-csv_2.10:1.4.0
    • SQLContext.read.csv vs SQLContext.read.format("csv") or format("com.databricks.spark.csv")
    • count
    • CSV Data Source for Spark
  15. Aggregating
  16. Exercise: Using groupBy and agg
  17. Exercise: WordCount using DataFrames (words per file)
  18. Windowed Aggregates (Windows)
  19. Exercise: Top N per Group
  20. Exercise: Revenue Difference per Category
  21. Exercise: Running Totals
  22. Datasets
  23. Exercise: WordCount using SQLContext.read.text
  24. Exercise: Compute Aggregates using mapGroups * Word Count using Datasets
  25. Caching
  26. Exercise: Measuring Query Times using web UI
  27. Accessing Structured Data using JDBC
  28. Modern / New-Age Approach
  29. Exercise: Reading Data from and Writing to PostgreSQL * Creating DataFrames from Tables using JDBC and PostgreSQL
  30. Integration with Hive
  31. Queries over DataFrames * sql
  32. Registering UDFs
  33. Temporary and permanent tables * registerTempTable * DataFrame.write and saveAsTable
  34. DataFrame performance optimizations
  35. Tungsten
  36. Catalyst

Spark MLlib

  1. Spark MLlib vs Spark ML
  2. (old-fashioned) RDD-based API vs (the latest and gratest) DataFrame-based API
  3. Transformers
  4. Exercise: Using Tokenizer, RegexTokenizer, and HashingTF
  5. Estimators and Models
  6. Exercise: Using KMeans * Fitting a model and checking spams
  7. Exercise: Using LogisticRegression * Fitting a model and checking spams
  8. Pipelines
  9. Exercise: Using Pipelines of Transformers and Estimators

Spark Streaming

  1. Spark Streaming
  2. Exercise: ConstantInputDStream in motion in Standalone Streaming Application
  3. Input DStreams (with and without Receivers)
  4. Exercise: Processing Files Using File Receiver * Word Count
  5. Exercise: Using Text Socket Receiver
  6. Exercise: Processing vmstat Using Apache Kafka
  7. Monitoring Streaming Applications using web UI (Streaming tab)
  8. Exercise: Monitoring and Tuning Streaming Applications * “Sleeping on Purpose” in map to slow down processing
  9. Spark Streaming and Checkpointing (for fault tolerance and exactly-once delivery)
  10. Exercise: Start StreamingContext from Checkpoint
  11. State Management in Spark Streaming (Stateful Operators)
  12. Exercise: Use mapWithState for stateful computation * Split lines into username and message to collect messages per user
  13. Spark Streaming and Windowed Operators
  14. Exercise: ???

Spark Core

  1. Spark “Installation” and Your First Spark Application (using spark-shell)
  2. Spark API scaladoc
  3. Exercise: Counting Elements in Distributed Collection
    • SparkContext.parallelize
    • SparkContext.range
    • SparkContext.textFile
  4. Using Spark’s Core APIs in Scala - Transformations and Actions
  5. Exercise: Processing lines in README.md
    • filter, map, flatMap, foreach
  6. Exercise: Gotchas with Transformations like zipWithIndex or sortBy
    • It may or may not submit a Spark job
    • Apply to RDDs of different number of partitions
    • Use webUI to see completed jobs
  7. Using key-value pair operators
  8. Exercise: Key-value pair operators
    • cogroup
    • flatMapValues
    • aggregateByKey
  9. Exercise: Word Counter = Counting words in README.md
  10. Building, Deploying and Monitoring Spark Applications (using sbt, spark-submit, and web UI)
  11. Exercise: A Complete Development Cycle of Spark Application
  12. Processing Structured Data using RDDs
  13. Traditional / Old-Fashioned Approach
  14. Exercise: Accessing Data in CSV
  15. Partitions
  16. mapPartitionsWithIndex and foreachPartition
  17. Example: FIXME
  18. Accumulators
  19. Exercise: Distributed Counter
  20. Exercise: Using Accumulators and cogroup to Count Non-Matching Records as in leftOuterJoin
    • Ensure exactly-one processing despite task failures
    • Use TaskContext to track tasks
  21. Exercise: Custom Accumulators
    • AccumulatorParam
  22. Broadcast Variables
  23. Community Packages for Apache Spark http://spark-packages.org
  24. Exercise: Accessing Data in Apache Cassandra using Spark-Cassandra Connector
  25. Submitting Spark Applications
  26. run-example
  27. spark-submit
  28. Specifying memory requirements et al.
  29. Exercise: Executing Spark Examples using run-example
  30. Exercise: Executing Spark Example using spark-submit
  31. Application Log Configuration
  32. conf/log4.properties

Spark GraphX

  1. RDD-based Graph API
  2. GraphFrames: DataFrame-based Graphs
  3. spark-shell --packages graphframes:graphframes:0.1.0-spark1.6

Extras

  1. Exercise: Stream Processing using Spark Streaming, Spark SQL and Spark MLlib (Pipeline API).

Requirements