Write a structured query that calculates the difference between consecutive running_total
rows over time per department.
Protip™: Use lag standard function
Module: Spark SQL
Duration: 15 mins
time,department,items_sold,running_total
1,IT,15,15
2,Support,81,81
3,Support,90,171
4,Support,25,196
5,IT,40,55
6,IT,24,79
7,Support,31,227
8,Support,1,228
9,HR,27,27
10,IT,75,154
// Mind the inferSchema so time is a numeric value
val sales = spark
.read
.option("header", true)
.option("inferSchema", true)
.csv("sales.csv")
scala> sales.show
+----+----------+----------+-------------+
|time|department|items_sold|running_total|
+----+----------+----------+-------------+
| 1| IT| 15| 15|
| 2| Support| 81| 81|
| 3| Support| 90| 171|
| 4| Support| 25| 196|
| 5| IT| 40| 55|
| 6| IT| 24| 79|
| 7| Support| 31| 227|
| 8| Support| 1| 228|
| 9| HR| 27| 27|
| 10| IT| 75| 154|
+----+----------+----------+-------------+
NOTE: Use Online Generate Test Data for more sophisticated datasets in CSV or JSON format.
// ordering (time) does not really matter
// but helps reviewing the output
scala> solution.show
+----+----------+----------+-------------+----+
|time|department|items_sold|running_total|diff|
+----+----------+----------+-------------+----+
| 9| HR| 27| 27| 27|
| 1| IT| 15| 15| 15|
| 5| IT| 40| 55| 40|
| 6| IT| 24| 79| 24|
| 10| IT| 75| 154| 75|
| 2| Support| 81| 81| 81|
| 3| Support| 90| 171| 90|
| 4| Support| 25| 196| 25|
| 7| Support| 31| 227| 31|
| 8| Support| 1| 228| 1|
+----+----------+----------+-------------+----+