Working with
Missing Data

Apache Spark 2.4.1 / Spark SQL

@jaceklaskowski / StackOverflow / GitHub
The "Internals" Books: Apache Spark / Spark SQL / Spark Structured Streaming

Dataset API for Missing Data

DataFrameNaFunctions

DataFrameNaFunctions is the interface to work with missing data in DataFrames
Allows for dropping or replacing missing data
Use Dataset.na operator to access the API


              na: DataFrameNaFunctions

DataFrameNaFunctions API

Untyped transformations that return a DataFrame
drop drops rows containing any missing values
fill replaces missing values with a value
replace replaces values matching keys in a replacement map

Functions for Missing Data

Standard and SQL Functions

Available functions differ per "execution mode"
- Dataset API and SQL

Common idiom is to use expr standard function to use SQL-only function outside SQL mode


                people.withColumn("expr1", expr("nullif(expr1, expr2)"))

Review nullExpressions.scala for the definitive list of the Catalyst expressions
- AtLeastNNonNulls (Dataset.drop operator)
- Coalesce
- IfNull, NullIf
- IsNaN, IsNull, IsNotNull
- NaNvl, Nvl, Nvl2

Standard Functions

coalesce selects the first column that is not null, or null if all are null
isnan returns true iff the column is NaN
isnull returns true iff the column is null
nanvl returns col1 if not a NaN, or col2

Standard Aggregate Functions

first returns the first non-null value when ignoreNulls flag on
last

SQL Functions

ifnull returns expr2 if expr1 is null, or expr1 otherwise
isnan returns true if expr is NaN, or false otherwise
isnotnull returns true if the current expression is NOT null
isnull returns true iff the column is null
nanvl returns col1 if not a NaN, or col2
nullif returns null if expr1 equals to expr2, or expr1 otherwise
nvl returns expr2 if expr1 is null, or expr1
nvl2 returns expr2 if expr1 is not null, or expr3

Column API for Missing Data

Filtering (e.g. Dataset.where operator)
- isNaN returns true if the current expression is NaN
- isNotNull returns true if the current expression is NOT null
- isNull returns true if the current expression is null
Sorting (e.g. Dataset.sort operator)
- desc_nulls_first returns a descending sort with null values before non-null values (i.e. nulls first)
- desc_nulls_last returns a descending sort with null values after non-null values (i.e. nulls last)
```
                people.sort($"age".desc_nulls_last)
              
```

Window Aggregation and Missing Data

Sorting in Window Aggregation

Window specification's ORDER BY and nulls
Ranking functions

Schema Nullability

nullable attribute in schema
Helps query optimizer to handle such columns
Not enforced and acts as a hint
If used incorrectly (nulls used for a non-null column), can lead to exceptions difficult to debug

Optimizations

BooleanSimplification
NullPropagation
spark.sql.constraintPropagation.enabled (internal) configuration property for Constraint propagation

Defining Dataset With Missing Data

Common but not idiomatic way

val names = Seq(
  (0, null.asInstanceOf[String]), // <-- define a missing name
  (1, "hello")).toDF("id", "name")
scala> names.show
+---+-----+
| id| name|
+---+-----+
|  0| null|
|  1|hello|
+---+-----+

Idiomatic way

val names = Seq(
  (0, None), // <-- define a missing name
  (1, Some("hello"))).toDF("id", "name")
scala> names.show
+---+-----+
| id| name|
+---+-----+
|  0| null|
|  1|hello|
+---+-----+

Working with Missing Data

Apache Spark 2.4.1 / Spark SQL

@jaceklaskowski / StackOverflow / GitHub The "Internals" Books: Apache Spark / Spark SQL / Spark Structured Streaming

Dataset API for Missing Data

DataFrameNaFunctions

DataFrameNaFunctions API

Functions for Missing Data

Standard and SQL Functions

Standard Functions

Standard Aggregate Functions

SQL Functions

Column API for Missing Data

Window Aggregation and Missing Data

Sorting in Window Aggregation

Schema Nullability

Optimizations

Defining Dataset With Missing Data

Common but not idiomatic way

Idiomatic way

Working with
Missing Data

@jaceklaskowski / StackOverflow / GitHub
The "Internals" Books: Apache Spark / Spark SQL / Spark Structured Streaming