Skip to content

Streaming Join

[[operators]] In Spark Structured Streaming, a streaming join is a streaming query that was described (build) using the high-level streaming operators:

Streaming joins can be stateless or <>:

  • Joins of a streaming query and a batch query (stream-static joins) are stateless and no state management is required

  • Joins of two streaming queries (<>) are stateful and require streaming state (with an optional <>).

Stream-Stream Joins

Spark Structured Streaming supports stream-stream joins with the following:

Stream-stream equi-joins are planned as <> physical operators of two ShuffleExchangeExec physical operators (per <>).

=== [[join-state-watermark]] Join State Watermark for State Removal

Stream-stream joins may optionally define Join State Watermark for state removal (cf. <>).

A join state watermark can be specified on the following:

. <> (key state)

. <> (value state)

A join state watermark can be specified on key state, value state or both.

=== [[IncrementalExecution]] IncrementalExecution -- QueryExecution of Streaming Queries

Under the covers, the <> create a logical query plan with one or more Join logical operators.

TIP: Read up on https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-LogicalPlan-Join.html[Join Logical Operator] in https://bit.ly/spark-sql-internals[The Internals of Spark SQL] online book.

In Spark Structured Streaming IncrementalExecution is responsible for planning streaming queries for execution.

At query planning, IncrementalExecution uses the StreamingJoinStrategy execution planning strategy for planning stream-stream joins as StreamingSymmetricHashJoinExec physical operators.

Demos

Further Reading Or Watching