Multi-Dimensional Aggregation
Apache Spark 2.4.4 / Spark SQL
@jaceklaskowski
/
StackOverflow
/
GitHub
The "Internals" Books:
Apache Spark
/
Spark SQL
/
Spark Structured Streaming
## Agenda 1. [Multi-dimensional rollup Operator](#/rollup) 1. [Multi-dimensional cube Operator](#/cube) 1. [GROUPING SETS SQL clause](#/grouping-sets) 1. [grouping Aggregate Function](#/grouping) 1. [grouping_id Aggregate Function](#/grouping-id)
## Multi-dimensional rollup Operator 1. **rollup** calculates subtotals and totals over (ordered) combination of groups ```scala val inventory = Seq(("t1", 2015, 100), ("t1", 2016, 50), ("t2", 2016, 40)) .toDF("name", "year", "amount") inventory.rollup("name", "year").sum("amount") ``` * **Quiz**: How many records in result set? 1. Advanced variant of **groupBy** with higher efficiency 1. Creates **RelationalGroupedDataset** * Supports untyped, Row-based **agg** * Shortcuts for _the usual suspects_, e.g. **avg**, **count**, **pivot** 1. Switch to [The Internals of Spark SQL](https://bit.ly/spark-sql-internals) * [rollup Aggregation Operator](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-multi-dimensional-aggregation.html#rollup)
## Multi-dimensional cube Operator 1. **cube** is similar to **rollup** but calculates subtotals and totals over **all combinations** of groups ```scala val inventory = Seq(("t1", 2015, 100), ("t1", 2016, 50), ("t2", 2016, 40)) .toDF("name", "year", "amount") inventory.cube("name", "year").sum("amount") // note cube (not rollup) ``` * **Quiz**: How many records in result set? 1. Advanced variant of **groupBy** with higher efficiency 1. Creates **RelationalGroupedDataset** 1. Switch to [The Internals of Spark SQL](https://bit.ly/spark-sql-internals) * [cube Aggregation Operator](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-multi-dimensional-aggregation.html#cube)
## GROUPING SETS SQL clause 1. Spark SQL supports **GROUPING SETS** in SQL only 1. Used in GROUP BY allows to specify more than one GROUP BY option in the same record set * Equivalent to several GROUP BYs connected by UNION ```scala Seq(("a1", "b1", 3), ("a1", "b2", 7), ("a2", "b2", 5)) .toDF("a", "b", "c") .createOrReplaceTempView("t1") val q = sql( "SELECT a,b, SUM(c) FROM t1 GROUP BY a, b GROUPING SETS (a,b)" ) ``` **Quiz**: How many records in result set? 1. Switch to [The Internals of Spark SQL](https://bit.ly/spark-sql-internals) * [GROUPING SETS SQL Clause](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-multi-dimensional-aggregation.html#grouping-sets)
## grouping Aggregate Function 1. **grouping** - aggregate function that indicates whether a specified column in a GROUP BY list is aggregated or not ```scala workshops .cube("city", "year") .agg(grouping("city"), grouping("year")) .sort($"city".desc_nulls_last, $"year".desc_nulls_last) ``` * returns 1 for aggregated or 0 for not aggregated 1. Switch to [The Internals of Spark SQL](https://bit.ly/spark-sql-internals) * [grouping Aggregate Function](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-functions.html#grouping)
## grouping_id Aggregate Function 1. **Exercise** (hard): Can you guess yourself? 1. Switch to [The Internals of Spark SQL](https://bit.ly/spark-sql-internals) * [grouping_id Aggregate Function](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-functions.html#grouping_id)
## Recap 1. [Multi-dimensional rollup Operator](#/rollup) 1. [Multi-dimensional cube Operator](#/cube) 1. [GROUPING SETS SQL clause](#/grouping-sets) 1. [grouping Aggregate Function](#/grouping) 1. [grouping_id Aggregate Function](#/grouping-id)
# Questions? * Read [The Internals of Apache Spark](https://bit.ly/apache-spark-internals) * Read [The Internals of Spark SQL](https://bit.ly/spark-sql-internals) * Read [The Internals of Spark Structured Streaming](https://bit.ly/spark-structured-streaming) * Follow [@jaceklaskowski](https://twitter.com/jaceklaskowski) on twitter * Upvote [my questions and answers on StackOverflow](http://stackoverflow.com/users/1305344/jacek-laskowski)