Working with
Missing Data

Apache Spark 2.4.1 / Spark SQL

@jaceklaskowski / StackOverflow / GitHub
The "Internals" Books: Apache Spark / Spark SQL / Spark Structured Streaming

Dataset API for Missing Data

DataFrameNaFunctions

  1. DataFrameNaFunctions is the interface to work with missing data in DataFrames
  2. Allows for dropping or replacing missing data
  3. Use Dataset.na operator to access the API

              na: DataFrameNaFunctions
            

DataFrameNaFunctions API

  1. Untyped transformations that return a DataFrame
  2. drop drops rows containing any missing values
  3. fill replaces missing values with a value
  4. replace replaces values matching keys in a replacement map

Functions for Missing Data

Standard and SQL Functions

  1. Available functions differ per "execution mode"
    • Dataset API and SQL
  2. Common idiom is to use expr standard function to use SQL-only function outside SQL mode
    
                    people.withColumn("expr1", expr("nullif(expr1, expr2)"))
                  
  3. Review nullExpressions.scala for the definitive list of the Catalyst expressions
    • AtLeastNNonNulls (Dataset.drop operator)
    • Coalesce
    • IfNull, NullIf
    • IsNaN, IsNull, IsNotNull
    • NaNvl, Nvl, Nvl2

Standard Functions

  1. coalesce selects the first column that is not null, or null if all are null
  2. isnan returns true iff the column is NaN
  3. isnull returns true iff the column is null
  4. nanvl returns col1 if not a NaN, or col2

Standard Aggregate Functions

  1. first returns the first non-null value when ignoreNulls flag on
  2. last

SQL Functions

  1. ifnull returns expr2 if expr1 is null, or expr1 otherwise
  2. isnan returns true if expr is NaN, or false otherwise
  3. isnotnull returns true if the current expression is NOT null
  4. isnull returns true iff the column is null
  5. nanvl returns col1 if not a NaN, or col2
  6. nullif returns null if expr1 equals to expr2, or expr1 otherwise
  7. nvl returns expr2 if expr1 is null, or expr1
  8. nvl2 returns expr2 if expr1 is not null, or expr3

Column API for Missing Data

  1. Filtering (e.g. Dataset.where operator)
    • isNaN returns true if the current expression is NaN
    • isNotNull returns true if the current expression is NOT null
    • isNull returns true if the current expression is null
  2. Sorting (e.g. Dataset.sort operator)
    • desc_nulls_first returns a descending sort with null values before non-null values (i.e. nulls first)
    • desc_nulls_last returns a descending sort with null values after non-null values (i.e. nulls last)
    
                    people.sort($"age".desc_nulls_last)
                  

Window Aggregation and Missing Data

Sorting in Window Aggregation

  1. Window specification's ORDER BY and nulls
  2. Ranking functions

Schema Nullability

  1. nullable attribute in schema
  2. Helps query optimizer to handle such columns
  3. Not enforced and acts as a hint
  4. If used incorrectly (nulls used for a non-null column), can lead to exceptions difficult to debug

Optimizations

  1. BooleanSimplification
  2. NullPropagation
  3. spark.sql.constraintPropagation.enabled (internal) configuration property for Constraint propagation

Defining Dataset With Missing Data

Common but not idiomatic way

val names = Seq(
  (0, null.asInstanceOf[String]), // <-- define a missing name
  (1, "hello")).toDF("id", "name")
scala> names.show
+---+-----+
| id| name|
+---+-----+
|  0| null|
|  1|hello|
+---+-----+

Idiomatic way

val names = Seq(
  (0, None), // <-- define a missing name
  (1, Some("hello"))).toDF("id", "name")
scala> names.show
+---+-----+
| id| name|
+---+-----+
|  0| null|
|  1|hello|
+---+-----+