Joins


Apache Spark 3.2 / Spark SQL


@jaceklaskowski / StackOverflow / GitHub / LinkedIn
The "Internals" Books: Apache Spark / Spark SQL / Delta Lake

Join

Internals and Optimizations

Join Logical Operator

  1. Join is a binary logical operator with two logical operators, a join type and an optional join expression
  2. Switch to The Internals of Spark SQL

JoinSelection Execution Planning Strategy

  1. JoinSelection is an execution planning strategy that SparkPlanner uses to plan Join logical operators
  2. Supported joins (and their physical operators)
    1. Broadcast Hash Join (BroadcastHashJoinExec)
    2. Shuffled Hash Join (ShuffledHashJoinExec)
    3. Sort Merge Join (SortMergeJoinExec)
    4. Broadcast Nested Loop Join (BroadcastNestedLoopJoinExec)
    5. Cartesian Join (CartesianProductExec)
  3. Switch to The Internals of Spark SQL

Join Optimization — Bucketing

  1. Bucketing is an optimization technique in Spark SQL that uses buckets and bucketing columns to determine data partitioning.
  2. Optimize performance of a join query by reducing shuffles (aka exchanges)
  3. Switch to The Internals of Spark SQL
  4. My talk Bucketing in Spark SQL 2 3 on Spark+AI Summit 2018

Join Optimization — Join Reordering

  1. Join Reordering is an optimization of a logical query plan that the Spark Optimizer uses for joins
  2. Switch to The Internals of Spark SQL

Join Optimization — Cost-based Join Reordering

  1. Cost-based Join Reordering is an optimization of a logical query plan that the Spark Optimizer uses for joins in cost-based optimization
  2. Switch to The Internals of Spark SQL