Structured Streaming

Internals

Apache Spark 2.2

@jaceklaskowski / StackOverflow / GitHub
Books: Mastering Apache Spark / Spark Structured Streaming

StreamExecution (1 of 4)


  StreamExecution — execution environment of a single continuous query (aka streaming Dataset)

  StreamExecution has multiple streaming sources but only one streaming sink

  Executes every trigger and adds results to the sink

  Created exclusively when DataStreamWriter is started.

©Jacek Laskowski 2017 / @jaceklaskowski / jacek@japila.pl

StreamExecution (2 of 4)


  StreamExecution starts a thread of execution that runs the streaming query continuously and concurrently

©Jacek Laskowski 2017 / @jaceklaskowski / jacek@japila.pl

StreamExecution (3 of 4)


©Jacek Laskowski 2017 / @jaceklaskowski / jacek@japila.pl

StreamExecution (4 of 4)


  StreamExecution collects duration for the execution units of a streaming batch

  Use StreamingQuery.lastProgress or StreamingQuery.recentProgress

©Jacek Laskowski 2017 / @jaceklaskowski / jacek@japila.pl

IncrementalExecution


  IncrementalExecution — QueryExecution of a streaming Dataset

  Created (in queryPlanning Phase) for incremental execution of the logical query plan (every trigger)

©Jacek Laskowski 2017 / @jaceklaskowski / jacek@japila.pl