Skip to content

Logical Query Plan Analyzer

Analyzer (Spark Analyzer or Query Analyzer) is the logical query plan analyzer that validates and transforms an unresolved logical plan to an analyzed logical plan.

Analyzer is a RuleExecutor of rules that transform logical operators (RuleExecutor[LogicalPlan]).

Analyzer: Unresolved Logical Plan ==> Analyzed Logical Plan

Analyzer is used by QueryExecution to resolve the managed LogicalPlan (and, as a sort of follow-up, assert that a structured query has already been properly analyzed, i.e. no failed or unresolved or somehow broken logical plan operators and expressions exist).

extendedResolutionRules Extension Point

extendedResolutionRules: Seq[Rule[LogicalPlan]] = Nil

extendedResolutionRules is an extension point for additional logical evaluation rules for Resolution batch. The rules are added at the end of the Resolution batch.

Default: empty

Note

SessionState uses its own Analyzer with custom extendedResolutionRules, postHocResolutionRules, and extendedCheckRules extension methods.

postHocResolutionRules Extension Point

postHocResolutionRules: Seq[Rule[LogicalPlan]] = Nil

postHocResolutionRules is an extension point for rules in Post-Hoc Resolution batch if defined (that are executed in one pass, i.e. Once strategy).

Default: empty

Batches

Hints

Rules:

Strategy: fixedPoint

Simple Sanity Check

Rules:

Strategy: Once

Substitution

Rules:

Strategy: fixedPoint

Resolution

Rules:

  • ResolveTableValuedFunctions
  • ResolveNamespace
  • ResolveCatalogs
  • ResolveInsertInto
  • ResolveRelations
  • ResolveTables
  • ResolveReferences
  • ResolveCreateNamedStruct
  • ResolveDeserializer
  • ResolveNewInstance
  • ResolveUpCast
  • ResolveGroupingAnalytics
  • ResolvePivot
  • ResolveOrdinalInOrderByAndGroupBy
  • ResolveAggAliasInGroupBy
  • ResolveMissingReferences
  • ExtractGenerator
  • ResolveGenerate
  • ResolveFunctions
  • ResolveAliases
  • ResolveSubquery
  • ResolveSubqueryColumnAliases
  • ResolveWindowOrder
  • ResolveWindowFrame
  • ResolveNaturalAndUsingJoin
  • ResolveOutputRelation
  • ExtractWindowExpressions
  • GlobalAggregates
  • ResolveAggregateFunctions
  • TimeWindowing
  • ResolveInlineTables
  • ResolveHigherOrderFunctions
  • ResolveLambdaVariables
  • ResolveTimeZone
  • ResolveRandomSeed
  • ResolveBinaryArithmetic
  • Type Coercion Rules
  • extendedResolutionRules

Strategy: fixedPoint

Post-Hoc Resolution

Rules:

Strategy: Once

Normalize Alter Table

Rules:

  • ResolveAlterTableChanges

Strategy: Once

Remove Unresolved Hints

Rules:

  • RemoveAllHints

Strategy: Once

Nondeterministic

Rules:

  • PullOutNondeterministic

Strategy: Once

UDF

Rules:

Strategy: Once

UpdateNullability

Rules:

  • UpdateAttributeNullability

Strategy: Once

Subquery

Rules:

Strategy: Once

Cleanup

Rules:

Strategy: fixedPoint

Creating Instance

Analyzer takes the following to be created:

Note

Analyzer can also be created without specifying the maxIterations argument which is then configured using optimizerMaxIterations configuration setting.

Analyzer is created when SessionState is requested for the analyzer.

Creating Analyzer

Accessing Analyzer

Analyzer is available as the analyzer property of SessionState.

scala> :type spark
org.apache.spark.sql.SparkSession

scala> :type spark.sessionState.analyzer
org.apache.spark.sql.catalyst.analysis.Analyzer

You can access the analyzed logical plan of a structured query using Dataset.explain basic action (with extended flag enabled) or SQL's EXPLAIN EXTENDED SQL command.

// sample structured query
val inventory = spark
  .range(5)
  .withColumn("new_column", 'id + 5 as "plus5")

// Using explain operator (with extended flag enabled)
scala> inventory.explain(extended = true)
== Parsed Logical Plan ==
'Project [id#0L, ('id + 5) AS plus5#2 AS new_column#3]
+- AnalysisBarrier
      +- Range (0, 5, step=1, splits=Some(8))

== Analyzed Logical Plan ==
id: bigint, new_column: bigint
Project [id#0L, (id#0L + cast(5 as bigint)) AS new_column#3L]
+- Range (0, 5, step=1, splits=Some(8))

== Optimized Logical Plan ==
Project [id#0L, (id#0L + 5) AS new_column#3L]
+- Range (0, 5, step=1, splits=Some(8))

== Physical Plan ==
*(1) Project [id#0L, (id#0L + 5) AS new_column#3L]
+- *(1) Range (0, 5, step=1, splits=8)

Alternatively, you can access the analyzed logical plan using QueryExecution and its analyzed property (that together with numberedTreeString method is a very good "debugging" tool).

val analyzedPlan = inventory.queryExecution.analyzed
scala> println(analyzedPlan.numberedTreeString)
00 Project [id#0L, (id#0L + cast(5 as bigint)) AS new_column#3L]
01 +- Range (0, 5, step=1, splits=Some(8))

FixedPoint

FixedPoint with maxIterations for <>, <>, <> and <> batches.

Set when Analyzer is created (and can be defined explicitly or through optimizerMaxIterations configuration setting).

expandRelationName

expandRelationName(
  nameParts: Seq[String]): Seq[String]

expandRelationName...FIXME

expandRelationName is used when ResolveTables and ResolveRelations logical analysis rules are executed.

Logging

Enable ALL logging level for the respective session-specific loggers to see what happens inside Analyzer:

  • org.apache.spark.sql.internal.SessionState$$anon$1

  • org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1 for Hive support

Add the following line to conf/log4j.properties:

# with no Hive support
log4j.logger.org.apache.spark.sql.internal.SessionState$$anon$1=ALL

# with Hive support enabled
log4j.logger.org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1=ALL

Note

The reason for such weird-looking logger names is that analyzer attribute is created as an anonymous subclass of Analyzer class in the respective SessionStates.

Refer to Logging.


Last update: 2020-11-11