Columns and Dataset Operators
Apache Spark 3.2 / Spark SQL
@jaceklaskowski
/
StackOverflow
/
GitHub
/
LinkedIn
The "Internals" Books:
books.japila.pl
## Agenda 1. [Columns](#/columns) 1. [Column Operators / Expressions](#/column-operators) 1. [Dataset Operators](#/dataset-operators)
## Columns 1. **Column** is a function that generates a value per row * Internally, a Catalyst expression with `eval` method 1. **Column** type with methods * _Explained in the following slide_ 1. Columns can be
free
or
bound
(associated or not with Datasets) ```scala $"columnName" // free (no Dataset) column reference myDataset("id") // bound column reference ``` 1. Use **import spark.implicits._** 1. Switch to [The Internals of Spark SQL](https://books.japila.pl/spark-sql-internals/) * [Column](https://books.japila.pl/spark-sql-internals/Column/)
## Column Operators / Expressions 1. Special **★** (star) column reference 1. **Operators** to create (compound) columns * **as**, **alias** or **name** for aliases * **===** for equality _(!)_ * **desc**, **desc_nulls_first** and **desc_nulls_last** (and for **asc**) * **getItem** to access items in arrays and maps * **over** for windowed aggregates * **cast** for casting to a custom data type * **when** and **otherwise** for conditional values 1. Home exercise * Read up [Column's scaladoc](http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Column.html)
## Dataset Operators 1. **Dataset Operators** work with all records of a Dataset * **as** to converting a Row-based DataFrame to a Dataset * **createOrReplaceTempView** to register a temporary view * **explain** to show the logical and execution plans * **flatMap** to "explode" records * **randomSplit** to split records to two Datasets randomly * **select**, **selectExpr**, **filter**, **where** * … _many many more_ — read up [Dataset's scaladoc](http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html) 1. Switch to [The Internals of Spark SQL](https://books.japila.pl/spark-sql-internals/) * [Dataset](https://books.japila.pl/spark-sql-internals/Dataset/)
## Recap 1. [Columns](#/columns) 1. [Column Operators / Expressions](#/column-operators) 1. [Dataset Operators](#/dataset-operators)
# Questions? * Read [The Internals of Apache Spark](https://books.japila.pl/apache-spark-internals/) * Read [The Internals of Spark SQL](https://books.japila.pl/spark-sql-internals/) * Read [The Internals of Spark Structured Streaming](https://books.japila.pl/spark-structured-streaming-internals/) * Follow [@jaceklaskowski](https://twitter.com/jaceklaskowski) on twitter * Upvote [my questions and answers on StackOverflow](http://stackoverflow.com/users/1305344/jacek-laskowski)