spark-workshop

Exercise: Finding Most Populated Cities Per Country

Write a structured query (using spark-shell or Databricks Community Edition) that gives the most populated cities per country with the population.

Protip™: Use Dataset.groupBy operator and max standard function followed by Dataset.join.

NOTE: population column in the input dataset is a string and contains spaces.

Module: Spark SQL

Duration: 30 mins

Input Dataset

+-----------------+-------------+----------+
|             name|      country|population|
+-----------------+-------------+----------+
|           Warsaw|       Poland| 1 764 615|
|           Cracow|       Poland|   769 498|
|            Paris|       France| 2 206 488|
|Villeneuve-Loubet|       France|    15 020|
|    Pittsburgh PA|United States|   302 407|
|       Chicago IL|United States| 2 716 000|
|     Milwaukee WI|United States|   595 351|
|          Vilnius|    Lithuania|   580 020|
|        Stockholm|       Sweden|   972 647|
|         Goteborg|       Sweden|   580 020|
+-----------------+-------------+----------+
name,country,population
Warsaw,Poland,1 764 615
Cracow,Poland,769 498
Paris,France,2 206 488
Villeneuve-Loubet,France,15 020
Pittsburgh PA,United States,302 407
Chicago IL,United States,2 716 000
Milwaukee WI,United States,595 351
Vilnius,Lithuania,580 020
Stockholm,Sweden,972 647
Goteborg,Sweden,580 020

Result

+----------+-------------+----------+
|      name|      country|population|
+----------+-------------+----------+
|    Warsaw|       Poland| 1 764 615|
|     Paris|       France| 2 206 488|
|Chicago IL|United States| 2 716 000|
|   Vilnius|    Lithuania|   580 020|
| Stockholm|       Sweden|   972 647|
+----------+-------------+----------+