Subqueries (Subquery Expressions)¶
As of Spark 2.0, Spark SQL supports subqueries.
A subquery (aka subquery expression) is a query that is nested inside of another query.
There are the following kinds of subqueries:
. A subquery as a source (inside a SQL
FROM clause) . A scalar subquery or a predicate subquery (as a column)
Every subquery can also be correlated or uncorrelated.
[[scalar-subquery]] A scalar subquery is a structured query that returns a single row and a single column only. Spark SQL uses ScalarSubquery (SubqueryExpression) expression to represent scalar subqueries (while spark-sql-AstBuilder.md#visitSubqueryExpression[parsing a SQL statement]).
// FIXME: ScalarSubquery in a logical plan¶
ScalarSubquery expression appears as scalar-subquery#[exprId] [conditionString] in a logical plan.
// FIXME: Name of a ScalarSubquery in a logical plan¶
It is said that scalar subqueries should be used very rarely if at all and you should join instead.
Catalyst Optimizer uses the following optimizations for subqueries:
PullupCorrelatedPredicates.md[PullupCorrelatedPredicates] optimization to PullupCorrelatedPredicates.md#rewriteSubQueries[rewrite subqueries] and pull up correlated predicates
Spark Physical Optimizer uses PlanSubqueries physical optimization to plan queries with scalar subqueries.