Write a structured query (using spark-shell
or Databricks Community Edition) that calculates cost average (part 1) and collects all the values in a column (part 2).
Protip™: Use RelationalGroupedDataset.pivot operator
Module: Spark SQL
Duration: 30 mins
+---+----+----+------+----+
| id|type|cost| date|ship|
+---+----+----+------+----+
| 0| A| 223|201603|PORT|
| 0| A| 22|201602|PORT|
| 0| A| 422|201601|DOCK|
| 1| B|3213|201602|DOCK|
| 1| B|3213|201601|PORT|
| 2| C|2321|201601|DOCK|
+---+----+----+------+----+
val data = Seq(
(0, "A", 223, "201603", "PORT"),
(0, "A", 22, "201602", "PORT"),
(0, "A", 422, "201601", "DOCK"),
(1, "B", 3213, "201602", "DOCK"),
(1, "B", 3213, "201601", "PORT"),
(2, "C", 2321, "201601", "DOCK")
).toDF("id","type", "cost", "date", "ship")
+---+----+------+------+------+
| id|type|201601|201602|201603|
+---+----+------+------+------+
| 0| A| 422.0| 22.0| 223.0|
| 1| B|3213.0|3213.0| null|
| 2| C|2321.0| null| null|
+---+----+------+------+------+
+---+----+------+------+------+
| id|type|201601|201602|201603|
+---+----+------+------+------+
| 0| A|[DOCK]|[PORT]|[PORT]|
| 1| B|[PORT]|[DOCK]| []|
| 2| C|[DOCK]| []| []|
+---+----+------+------+------+