spark-workshop
Exercises for Apache Spark™ and Scala Workshops
This repository contains the exercises for
Apache Spark™ and Scala Workshops
.
Spark Core
Running Spark Applications on Hadoop YARN
Submitting Spark Application to Spark Standalone Cluster
Spark SQL
split function with variable delimiter per row
Selecting the most important rows per assigned priority
Adding count to the source DataFrame
Limiting collect_set Standard Function
Structs for column names and values
Merging two rows
Exploding structs array
Standalone Spark Application to Display Spark SQL Version
Using CSV Data Source
Finding Ids of Rows with Word in Array Column
Using Dataset.flatMap Operator
Reverse-engineering Dataset.show Output
Flattening Array Columns (From Datasets of Arrays to Datasets of Array Elements)
Working With Files On Hadoop HDFS
Finding Most Populated Cities Per Country
Using upper Standard Function
Using explode Standard Function
Difference in Days Between Dates As Strings
Counting Occurences Of Years and Months For 24 Months From Now
Why are all fields null when querying with schema?
How to add days (as values of a column) to date?
Using UDFs
Calculating aggregations
Finding maximum values per group (groupBy)
Collect values per group
Multiple Aggregations
Using pivot to generate a single-row matrix
Using pivot for Cost Average and Collecting Values
Pivoting on Multiple Columns
Generating Exam Assessment Report
Flattening Dataset from Long to Wide Format
Finding 1st and 2nd Bestsellers Per Genre
Calculating Gap Between Current And Highest Salaries Per Department
Calculating Running Total / Cumulative Sum
Calculating Difference Between Consecutive Rows Per Window
Converting Arrays of Strings to String
Calculating percent rank
Working with Datasets Using JDBC and PostgreSQL
Specifying Table and SQL Query on Command Line
Finding First Non-Null Value per Group
Finding Longest Sequence (Window Aggregation)
Finding Most Common Non-null Prefix per Group (Occurences)
Developing Custom Data Source
Using rollup Operator for Total and Average Salaries by Department and Company-Wide
Spark Structured Streaming
Your First Standalone Structured Streaming Application
Streaming CSV Datasets
Stateless Streaming Aggregation
Using foreach Operator (and ForeachWriter)
Using Kafka Data Source
Executing SQL Statements from Kafka
Writing Selected Columns to Kafka
Spark MLlib
Email Classification