spark-workshop

Exercise: Calculating percent rank

A dataset has employees and salaries entries in no particular order. Write a structured query that adds a new column per the following requirements:

Top 30% gets a value “high”
The next 40% gets “average”
The rest gets “low”

Protip™: Use percent_rank window aggregation followed by when standard function with the Column API’s when and otherwise methods.

Module: Spark SQL

Duration: 30 mins

Input Dataset

Employee,Salary
Tony,50
Alan,45
Lee,60
David,35
Steve,65
Paul,48
Micky,62
George,80
Nigel,64
John,42

val salaries = spark
  .read
  .option("header", true)
  .option("inferSchema", true)
  .csv("salaries.csv")
scala> salaries.show
+--------+------+
|Employee|Salary|
+--------+------+
|    Tony|    50|
|    Alan|    45|
|     Lee|    60|
|   David|    35|
|   Steve|    65|
|    Paul|    48|
|   Micky|    62|
|  George|    80|
|   Nigel|    64|
|    John|    42|
+--------+------+

Result

+--------+------+----------+
|Employee|Salary|Percentage|
+--------+------+----------+
|  George|    80|      High|
|   Steve|    65|      High|
|   Nigel|    64|      High|
|   Micky|    62|      High|
|     Lee|    60|   Average|
|    Tony|    50|       Low|
|    Paul|    48|       Low|
|    Alan|    45|       Low|
|    John|    42|       Low|
|   David|    35|       Low|
+--------+------+----------+

Credits

How to add a column with values based on popularity?

This site is open source. Improve this page.