Skip to content

Optimizer — Generic Logical Query Plan Optimizer

Optimizer (Catalyst Optimizer) is an extension of the RuleExecutor abstraction for logical query plan optimizers.

Optimizer: Analyzed Logical Plan ==> Optimized Logical Plan

Implementations

Creating Instance

Optimizer takes the following to be created:

Optimizer is an abstract class and cannot be created directly. It is created indirectly for the concrete Optimizers.

Default Batches

Optimizer defines the rule batches of logical optimizations that transform the query plan of a structured query to produce the optimized logical query plan.

The base rule batches can be further refined (extended or excluded).

Eliminate Distinct

Rules:

  • EliminateDistinct

Strategy: Once

Finish Analysis

Rules:

  • EliminateResolvedHint
  • EliminateSubqueryAliases
  • EliminateView
  • ReplaceExpressions
  • RewriteNonCorrelatedExists
  • ComputeCurrentTime
  • GetCurrentDatabase
  • RewriteDistinctAggregates
  • ReplaceDeduplicateWithAggregate

Strategy: Once

Union

Rules:

Strategy: Once

OptimizeLimitZero

Rules:

  • OptimizeLimitZero

Strategy: Once

LocalRelation early

Rules:

Strategy: fixedPoint

Pullup Correlated Expressions

Rules:

Strategy: Once

Subquery

Rules:

Strategy: FixedPoint(1)

Replace Operators

Rules:

  • RewriteExceptAll
  • RewriteIntersectAll
  • ReplaceIntersectWithSemiJoin
  • ReplaceExceptWithFilter
  • ReplaceExceptWithAntiJoin
  • ReplaceDistinctWithAggregate

Strategy: fixedPoint

Aggregate

Rules:

  • RemoveLiteralFromGroupExpressions
  • RemoveRepetitionFromGroupExpressions

Strategy: fixedPoint

Operator Optimization before Inferring Filters

Rules:

  • PushProjectionThroughUnion
  • ReorderJoin
  • EliminateOuterJoin
  • PushDownPredicates
  • PushDownLeftSemiAntiJoin
  • PushLeftSemiLeftAntiThroughJoin
  • LimitPushDown
  • ColumnPruning
  • CollapseRepartition
  • CollapseProject
  • CollapseWindow
  • CombineFilters
  • CombineLimits
  • CombineUnions
  • TransposeWindow
  • NullPropagation
  • ConstantPropagation
  • FoldablePropagation
  • OptimizeIn
  • ConstantFolding
  • ReorderAssociativeOperator
  • LikeSimplification
  • BooleanSimplification
  • SimplifyConditionals
  • RemoveDispensableExpressions
  • SimplifyBinaryComparison
  • ReplaceNullWithFalseInPredicate
  • PruneFilters
  • SimplifyCasts
  • SimplifyCaseConversionExpressions
  • RewriteCorrelatedScalarSubquery
  • EliminateSerialization
  • RemoveRedundantAliases
  • RemoveNoopOperators
  • SimplifyExtractValueOps
  • CombineConcats
  • extendedOperatorOptimizationRules

Strategy: fixedPoint

Infer Filters

Rules:

Strategy: Once

Operator Optimization after Inferring Filters

Rules:

Strategy: fixedPoint

Early Filter and Projection Push-Down

Rules:

Strategy: Once

Join Reorder

Rules:

Strategy: FixedPoint(1)

Eliminate Sorts

Rules:

  • EliminateSorts

Strategy: Once

Decimal Optimizations

Rules:

Strategy: fixedPoint

Object Expressions Optimization

Rules:

  • EliminateMapObjects
  • CombineTypedFilters
  • ObjectSerializerPruning
  • ReassignLambdaVariableID

Strategy: fixedPoint

LocalRelation

Rules:

Strategy: fixedPoint

Check Cartesian Products

Rules:

  • CheckCartesianProducts

Strategy: Once

RewriteSubquery

Rules:

Strategy: Once

NormalizeFloatingNumbers

Rules:

  • NormalizeFloatingNumbers

Strategy: Once

Excluded Rules

Optimizer uses spark.sql.optimizer.excludedRules configuration property to control what rules in the defaultBatches to exclude.

Non-Excludable Rules

Optimizer considers some optimization rules as non-excludable and necessary for correctness. They are considered so critical for query optimization that can never be excluded (even using spark.sql.optimizer.excludedRules configuration property).

  • EliminateDistinct
  • EliminateResolvedHint
  • EliminateSubqueryAliases
  • EliminateView
  • ReplaceExpressions
  • ComputeCurrentTime
  • GetCurrentDatabase
  • RewriteDistinctAggregates
  • ReplaceDeduplicateWithAggregate
  • ReplaceIntersectWithSemiJoin
  • ReplaceExceptWithFilter
  • ReplaceExceptWithAntiJoin
  • RewriteExceptAll
  • RewriteIntersectAll
  • ReplaceDistinctWithAggregate
  • PullupCorrelatedPredicates
  • RewriteCorrelatedScalarSubquery
  • RewritePredicateSubquery
  • NormalizeFloatingNumbers

Accessing Optimizer

Optimizer is available as the optimizer property of a session-specific SessionState.

scala> :type spark
org.apache.spark.sql.SparkSession

scala> :type spark.sessionState.optimizer
org.apache.spark.sql.catalyst.optimizer.Optimizer

You can access the optimized logical plan of a structured query (as a Dataset) using Dataset.explain basic action (with extended flag enabled) or SQL's EXPLAIN EXTENDED SQL command.

// sample structured query
val inventory = spark
  .range(5)
  .withColumn("new_column", 'id + 5 as "plus5")

// Using explain operator (with extended flag enabled)
scala> inventory.explain(extended = true)
== Parsed Logical Plan ==
'Project [id#0L, ('id + 5) AS plus5#2 AS new_column#3]
+- AnalysisBarrier
      +- Range (0, 5, step=1, splits=Some(8))

== Analyzed Logical Plan ==
id: bigint, new_column: bigint
Project [id#0L, (id#0L + cast(5 as bigint)) AS new_column#3L]
+- Range (0, 5, step=1, splits=Some(8))

== Optimized Logical Plan ==
Project [id#0L, (id#0L + 5) AS new_column#3L]
+- Range (0, 5, step=1, splits=Some(8))

== Physical Plan ==
*(1) Project [id#0L, (id#0L + 5) AS new_column#3L]
+- *(1) Range (0, 5, step=1, splits=8)

Alternatively, you can access the analyzed logical plan using QueryExecution and its optimizedPlan property (that together with numberedTreeString method is a very good "debugging" tool).

val optimizedPlan = inventory.queryExecution.optimizedPlan
scala> println(optimizedPlan.numberedTreeString)
00 Project [id#0L, (id#0L + 5) AS new_column#3L]
01 +- Range (0, 5, step=1, splits=Some(8))

FixedPoint Strategy

FixedPoint strategy with the number of iterations as defined by spark.sql.optimizer.maxIterations

Extended Operator Optimization Rules (Extension Point)

extendedOperatorOptimizationRules: Seq[Rule[LogicalPlan]] = Nil

extendedOperatorOptimizationRules extension point defines additional rules for the Operator Optimization rule batch.

extendedOperatorOptimizationRules rules are executed right after Operator Optimization before Inferring Filters and Operator Optimization after Inferring Filters.

extendedOperatorOptimizationRules is used when...FIXME

earlyScanPushDownRules (Extension Point)

earlyScanPushDownRules: Seq[Rule[LogicalPlan]] = Nil

earlyScanPushDownRules extension point...FIXME

earlyScanPushDownRules is used when...FIXME

blacklistedOnceBatches

blacklistedOnceBatches: Set[String]

blacklistedOnceBatches...FIXME

blacklistedOnceBatches is used when...FIXME

Batches

batches: Seq[Batch]

batches uses spark.sql.optimizer.excludedRules configuration property for the rules to be excluded.

batches filters out non-excludable rules from the rules to be excluded. For any filtered-out rule, batches prints out the following WARN message to the logs:

Optimization rule '[ruleName]' was not excluded from the optimizer because this rule is a non-excludable rule.

batches filters out the excluded rules from all defaultBatches. In case a batch is left with no rules, batches prints out the following INFO message to the logs:

Optimization batch '[name]' is excluded from the optimizer as all enclosed rules have been excluded.

batches is part of the RuleExecutor abstraction.


Last update: 2021-03-18