spark-workshop

Exercise: Collect values per group

Write a structured query (using spark-shell or Databricks Community Edition) that collects ids per group in a dataset.

Protip™: Use collect_list standard function

Extra: The values collected should be ordered in a descending order

Module: Spark SQL

Duration: 15 mins

Input Dataset

val nums = spark.range(5).withColumn("group", 'id % 2)
scala> nums.show
+---+-----+
| id|group|
+---+-----+
|  0|    0|
|  1|    1|
|  2|    0|
|  3|    1|
|  4|    0|
+---+-----+

Result

+-----+---------+
|group|      ids|
+-----+---------+
|    0|[0, 2, 4]|
|    1|   [1, 3]|
+-----+---------+