Basic Aggregation
(agg, groupBy and groupByKey)
Apache Spark 3.2 / Spark SQL
@jaceklaskowski
/
StackOverflow
/
GitHub
/
LinkedIn
The "Internals" Books:
books.japila.pl
## Agenda 1. [Aggregate Functions](#/aggregate-functions) 1. [agg Operator](#/agg-operator) 1. [Untyped groupBy Operator](#/groupBy-operator) 1. [Typed groupByKey Operator](#/groupByKey-operator) 1. [User-Defined Untyped Aggregate Functions (UDAFs)](#/udaf)
## Aggregate Functions 1. **Aggregate functions** accept a group of records as input * Unlike regular functions that act on a single record 1. Available among standard functions ```scala import org.apache.spark.sql.functions._ ``` 1. _Usual suspects_: **avg**, **collect_list**, **count**, **min**, **mean**, **sum** 1. You can create custom **user-defined aggregate functions (UDAFs)** 1. Read [functions object's scaladoc](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$)
## agg Operator 1. **agg** applies an aggregate function to records in Dataset ```scala val ds = spark.range(10) ds.agg(sum('id) as "sum") ``` 1. Entire Dataset acts as a single group * **groupBy** used to define groups
(more in the following slide)
1. Creates DataFrame * …hence considered untyped due to **Row** inside * Typed variant available
(more in the following slide)
1. Switch to [The Internals of Spark SQL](https://books.japila.pl/spark-sql-internals/) * [Basic Aggregation — Typed and Untyped Grouping Operators](https://books.japila.pl/spark-sql-internals/basic-aggregation)
## Untyped groupBy Operator 1. **groupBy** groups records in Dataset per _discriminator function_ ``` val nums = spark.range(10) nums.groupBy('id % 2 as "group").agg(sum('id) as "sum") ``` 1. Creates **RelationalGroupedDataset** * Supports untyped, Row-based **agg** * Shortcuts for _the usual suspects_, e.g. **avg**, **count**, **max** * Supports **pivot** * Read [RelationalGroupedDataset's scaladoc](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.RelationalGroupedDataset)
## Typed groupByKey Operator 1. **groupByKey** similar to **groupBy** operator, but gives typed interface ```scala ds.groupByKey(_ % 2).reduceGroups(_ + _).show // compare to untyped query nums.groupBy('id % 2 as "group").agg(sum('id) as "sum") ``` 1. Creates **KeyValueGroupedDataset** * Supports typed **agg** * Shortcuts for _the usual suspects_, e.g. **reduceGroups**, **mapValues**, **mapGroups**, **flatMapGroups**, **cogroup** * Read [KeyValueGroupedDataset's scaladoc](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.KeyValueGroupedDataset)
## User-Defined Aggregate Functions (UDAFs) 1. [UserDefinedAggregateFunction](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.expressions.UserDefinedAggregateFunction) - the base class for implementing user-defined aggregate functions 1. Switch to [The Internals of Spark SQL](https://books.japila.pl/spark-sql-internals/) * [UserDefinedAggregateFunction — User-Defined Untyped Aggregate Functions (UDAFs)](https://books.japila.pl/spark-sql-internals/expressions/UserDefinedAggregateFunction)
## Recap 1. [Aggregate Functions](#/aggregate-functions) 1. [agg Operator](#/agg-operator) 1. [Untyped groupBy Operator](#/groupBy-operator) 1. [Typed groupByKey Operator](#/groupByKey-operator) 1. [User-Defined Aggregate Functions (UDAFs)](#/udaf)
# Questions? * Read [The Internals of Apache Spark](https://books.japila.pl/apache-spark-internals/) * Read [The Internals of Spark SQL](https://books.japila.pl/spark-sql-internals/) * Read [The Internals of Spark Structured Streaming](https://books.japila.pl/spark-structured-streaming-internals/) * Follow [@jaceklaskowski](https://twitter.com/jaceklaskowski) on twitter * Upvote [my questions and answers on StackOverflow](http://stackoverflow.com/users/1305344/jacek-laskowski)