Standard and User-Defined Functions
Apache Spark 3.2 / Spark SQL
@jaceklaskowski
/
StackOverflow
/
GitHub
/
LinkedIn
The "Internals" Books:
books.japila.pl
## Agenda 1. [Standard Functions](#/standard-functions) 1. [Standard Functions for Collections](#/standard-functions-collections) 1. [Standard Functions for Date and Time](#/standard-functions-datetime) 1. [User-Defined Functions (UDFs)](#/udf)
## Standard Functions
(1 of 2)
1. **Standard functions** (aka **native functions**) are built-in functions that transform the values of columns into new values * Aggregate functions, e.g. **avg**, **count**, **sum** * Collection functions, e.g. **explode**, **from_json**, **array_\*** * Date time functions, e.g. **current_timestamp**, **to_date**, **window** * Math functions, e.g. **conv**, **factorial**, **pow** * Non-aggregate functions, e.g. **array**, **broadcast**, **expr**, **lit** * Sorting functions, e.g. **asc**, **asc_nulls_first**, **asc_nulls_last** * String functions, e.g. **concat_ws**, **trim**, **upper** * UDF functions, e.g. **callUDF**, **udf** * Window functions, e.g. **rank**, **row_number**
## Standard Functions
(2 of 2)
1. Import **org.apache.spark.sql.functions** object ```scala import org.apache.spark.sql.functions._ ``` 1. Use Dataset operators as "execution environment" *
withColumn
,
select
,
filter
1. Home exercise * Read up [functions object's scaladoc](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$)
## Standard Functions for Collections 1. Functions for "Array Algebra" * **array_contains**, **array_distinct**, **array_except**, **array_intersect**, **flatten**, etc. * **arrays_zip**, **arrays_overlap** 1. Functions for "Map Algebra" * **map_concat**, **map_from_entries**, **map_keys**, **map_values** 1. **explode**, **explode_outer** 1. **posexplode**, **posexplode_outer** 1. Review Spark API scaladoc for [functions object](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$)
## Standard Functions for Date and Time 1. **unix_timestamp** 1. **to_timestamp** 1. **window** 1. **from_utc_timestamp**, **to_utc_timestamp**, **months_between** 1. Review Spark API scaladoc for [functions object](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$) 1. Switch to [The Internals of Spark SQL](https://books.japila.pl/spark-sql-internals/) * [Date and Time Functions](https://books.japila.pl/spark-sql-internals/spark-sql-functions-datetime)
## User-Defined Functions (UDFs) ### « HEADS-UP » > **Use the standard functions whenever possible** before reverting to custom UDFs. > UDFs are a blackbox for Spark Optimizer and does **not even try** to optimize them.
## UDFs — User-Defined Functions 1. User-Defined Function extends the "vocabulary" of Spark SQL 1. Use
udf
function to define a user-defined function ```scala // pure Scala function val myUpperFn = (input: String) => input.toUpperCase // user-defined function val myUpper = udf(myUpperFn) ``` 1. Use UDFs as standard functions *
withColumn
,
select
,
filter
etc. * Also **callUDF** function 1. Switch to [The Internals of Spark SQL](https://books.japila.pl/spark-sql-internals/) * [User-Defined Functions](https://books.japila.pl/spark-sql-internals/spark-sql-udfs/)
## Registering UDFs for SQL queries 1. Use
spark.udf.register
to register a Scala function as a user-defined function ```scala // pure Scala function val myUpperFn = (input: String) => input.toUpperCase // register Scala function as UDF spark.udf.register("myUpper", myUpperFn) // Use myUpper as if it were a standard function sql("select myUpper(name) from people").show ```
## Deterministic UserDefinedFunctions 1. User-defined function is **deterministic** by default * Evaluates to the same result for the same input(s) 1. Use **deterministic** method to find out whether it is or not 1. Use **asNondeterministic** to disable determinism 1. Non-deterministic expressions are not allowed in some logical operators and are excluded in some optimizations
## Recap 1. [Standard Functions](#/standard-functions) 1. [Standard Functions for Collections](#/standard-functions-collections) 1. [Standard Functions for Date and Time](#/standard-functions-datetime) 1. [User-Defined Functions (UDFs)](#/udf)
# Questions? * Read [The Internals of Apache Spark](https://books.japila.pl/apache-spark-internals/) * Read [The Internals of Spark SQL](https://books.japila.pl/spark-sql-internals/) * Read [The Internals of Spark Structured Streaming](https://books.japila.pl/spark-structured-streaming-internals/) * Follow [@jaceklaskowski](https://twitter.com/jaceklaskowski) on twitter * Upvote [my questions and answers on StackOverflow](http://stackoverflow.com/users/1305344/jacek-laskowski)