spark-workshop

Exercise: Counting Occurences Of Years and Months For 24 Months From Now

Write a structured query (using spark-shell or Databricks Community Edition) that calculates the occurences of years and months for the past 24 months (2 years).

The query is supposed to include occurences (as 0s) for the missing months and years (that are time gaps in the input dataset).

The query is supposed to calculate a result for the last 24 months from the date of execution. Months and years older than 24 months from now should simply be excluded from the result.

Protip™: Use the standard functions for date and time.

Module: Spark SQL

Duration: 30 mins

Input Dataset

YEAR_MONTH,AMOUNT
202001,500
202001,600
201912,100
201910,200
201910,100
201909,400
201601,5000
val sales = spark
  .read
  .option("header", true)
  .option("inferSchema", true)
  .csv("sales.csv")
scala> sales.show
+----------+------+
|YEAR_MONTH|AMOUNT|
+----------+------+
|    202001|   500|
|    202001|   600|
|    201912|   100|
|    201910|   200|
|    201910|   100|
|    201909|   400|
|    201601|  5000|
+----------+------+

Result

// For now being Jan 2020
scala> solution.show(100, truncate = false)
+----------+------+
|year_month|amount|
+----------+------+
|202001    |1100  |
|201912    |100   |
|201911    |0     |
|201910    |300   |
|201909    |400   |
|201908    |0     |
|201907    |0     |
...
|201804    |0     |
|201803    |0     |
+----------+------+

Credits