spark-workshop

Exercise: Flattening Array Columns (From Datasets of Arrays to Datasets of Array Elements)

Develop a standalone Spark SQL application (using IntelliJ IDEA) that creates a Dataset with columns that have the elements of an array column as values and their names being positions (in the array column).

Protips™:

  1. (intermediate) Assume that the number of elements of all arrays is the same

  2. (advanced) Consider pivot operator

Module: Spark SQL

Duration: 30 mins

Input Dataset

val input = Seq(
  Seq("a","b","c"),
  Seq("X","Y","Z")).toDF
scala> input.show
+---------+
|    value|
+---------+
|[a, b, c]|
|[X, Y, Z]|
+---------+

scala> input.printSchema
root
 |-- value: array (nullable = true)
 |    |-- element: string (containsNull = true)

Result

+---+---+---+
|  0|  1|  2|
+---+---+---+
|  a|  b|  c|
|  X|  Y|  Z|
+---+---+---+