Write a structured query (using spark-shell
or Databricks Community Edition) that calculates the occurences of years and months for the past 24 months (2 years).
The query is supposed to include occurences (as 0
s) for the missing months and years (that are time gaps in the input dataset).
The query is supposed to calculate a result for the last 24 months from the date of execution. Months and years older than 24 months from now should simply be excluded from the result.
Protip™: Use the standard functions for date and time.
Module: Spark SQL
Duration: 30 mins
YEAR_MONTH,AMOUNT
202001,500
202001,600
201912,100
201910,200
201910,100
201909,400
201601,5000
val sales = spark
.read
.option("header", true)
.option("inferSchema", true)
.csv("sales.csv")
scala> sales.show
+----------+------+
|YEAR_MONTH|AMOUNT|
+----------+------+
| 202001| 500|
| 202001| 600|
| 201912| 100|
| 201910| 200|
| 201910| 100|
| 201909| 400|
| 201601| 5000|
+----------+------+
// For now being Jan 2020
scala> solution.show(100, truncate = false)
+----------+------+
|year_month|amount|
+----------+------+
|202001 |1100 |
|201912 |100 |
|201911 |0 |
|201910 |300 |
|201909 |400 |
|201908 |0 |
|201907 |0 |
...
|201804 |0 |
|201803 |0 |
+----------+------+