spark-workshop

Exercises for Apache Spark™ and Scala Workshops

This repository contains the exercises for Apache Spark™ and Scala Workshops.

Spark Core

  1. Running Spark Applications on Hadoop YARN
  2. Submitting Spark Application to Spark Standalone Cluster

Spark SQL

  1. split function with variable delimiter per row
  2. Selecting the most important rows per assigned priority
  3. Adding count to the source DataFrame
  4. Limiting collect_set Standard Function
  5. Structs for column names and values
  6. Merging two rows
  7. Exploding structs array
  8. Standalone Spark Application to Display Spark SQL Version
  9. Using CSV Data Source
  10. Finding Ids of Rows with Word in Array Column
  11. Using Dataset.flatMap Operator
  12. Reverse-engineering Dataset.show Output
  13. Flattening Array Columns (From Datasets of Arrays to Datasets of Array Elements)
  14. Working With Files On Hadoop HDFS
  15. Finding Most Populated Cities Per Country
  16. Using upper Standard Function
  17. Using explode Standard Function
  18. Difference in Days Between Dates As Strings
  19. Counting Occurences Of Years and Months For 24 Months From Now
  20. Why are all fields null when querying with schema?
  21. How to add days (as values of a column) to date?
  22. Using UDFs
  23. Calculating aggregations
  24. Finding maximum values per group (groupBy)
  25. Collect values per group
  26. Multiple Aggregations
  27. Using pivot to generate a single-row matrix
  28. Using pivot for Cost Average and Collecting Values
  29. Pivoting on Multiple Columns
  30. Generating Exam Assessment Report
  31. Flattening Dataset from Long to Wide Format
  32. Finding 1st and 2nd Bestsellers Per Genre
  33. Calculating Gap Between Current And Highest Salaries Per Department
  34. Calculating Running Total / Cumulative Sum
  35. Calculating Difference Between Consecutive Rows Per Window
  36. Converting Arrays of Strings to String
  37. Calculating percent rank
  38. Working with Datasets Using JDBC and PostgreSQL
  39. Specifying Table and SQL Query on Command Line
  40. Finding First Non-Null Value per Group
  41. Finding Longest Sequence (Window Aggregation)
  42. Finding Most Common Non-null Prefix per Group (Occurences)
  43. Developing Custom Data Source
  44. Using rollup Operator for Total and Average Salaries by Department and Company-Wide

Spark Structured Streaming

  1. Your First Standalone Structured Streaming Application
  2. Streaming CSV Datasets
  3. Stateless Streaming Aggregation
  4. Using foreach Operator (and ForeachWriter)
  5. Using Kafka Data Source
  6. Executing SQL Statements from Kafka
  7. Writing Selected Columns to Kafka

Spark MLlib

  1. Email Classification