spark-workshop

Exercise: Finding 1st and 2nd Bestsellers Per Genre

Write a structured query (using spark-shell or Databricks Community Edition) that gives the 1st and 2nd bestsellers per genre.

Protip™: Use rank standard function

Module: Spark SQL

Duration: 15 mins

Input Dataset

id,title,genre,quantity
1,Hunter Fields,romance,15
2,Leonard Lewis,thriller,81
3,Jason Dawson,thriller,90
4,Andre Grant,thriller,25
5,Earl Walton,romance,40
6,Alan Hanson,romance,24
7,Clyde Matthews,thriller,31
8,Josephine Leonard,thriller,1
9,Owen Boone,sci-fi,27
10,Max McBride,romance,75
val books = spark
  .read
  .option("header", true)
  .option("inferSchema", true)
  .csv("books.csv")
scala> books.show
+---+-----------------+--------+--------+
| id|            title|   genre|quantity|
+---+-----------------+--------+--------+
|  1|    Hunter Fields| romance|      15|
|  2|    Leonard Lewis|thriller|      81|
|  3|     Jason Dawson|thriller|      90|
|  4|      Andre Grant|thriller|      25|
|  5|      Earl Walton| romance|      40|
|  6|      Alan Hanson| romance|      24|
|  7|   Clyde Matthews|thriller|      31|
|  8|Josephine Leonard|thriller|       1|
|  9|       Owen Boone|  sci-fi|      27|
| 10|      Max McBride| romance|      75|
+---+-----------------+--------+--------+

NOTE: Use Online Generate Test Data for more sophisticated datasets in CSV or JSON format.

Result

+---+-------------+--------+--------+----+
| id|        title|   genre|quantity|rank|
+---+-------------+--------+--------+----+
| 10|  Max McBride| romance|      75|   1|
|  5|  Earl Walton| romance|      40|   2|
|  3| Jason Dawson|thriller|      90|   1|
|  2|Leonard Lewis|thriller|      81|   2|
|  9|   Owen Boone|  sci-fi|      27|   1|
+---+-------------+--------+--------+----+