spark-workshop

Exercise: Calculating aggregations

Develop a standalone Spark SQL application (using IntelliJ IDEA) that calculates aggregations defined on a command line (e.g. finds the biggest city among the cities in a dataset).

Protip™: Use Dataset.agg operator and standard functions only (not UDFs!)

The standalone application should take at least two input parameters:

The path of a CSV data set to load
One or more aggregations (e.g. max, avg)

Protip™: Mind the spaces in population column and then the type.

Extra: Include the name of the city when one aggregation is used.

Module: Spark SQL

Duration: 20 mins

Input Dataset

+---+-----------------+----------+
| id|             name|population|
+---+-----------------+----------+
|  0|           Warsaw| 1 764 615|
|  1|Villeneuve-Loubet|    15 020|
|  2|           Vranje|    83 524|
|  3|       Pittsburgh| 1 775 634|
+---+-----------------+----------+

id,name,population
0,Warsaw,1 764 615
1,Villeneuve-Loubet,15 020
2,Vranje,83 524
3,Pittsburgh,1 775 634

Result

+----------+
|population|
+----------+
|   1775634|
+----------+

This site is open source. Improve this page.