Spark
Structured Streaming

Apache Spark 2.4.4

@jaceklaskowski / StackOverflow / GitHub
The "Internals" Books: Apache Spark / Spark SQL / Spark Structured Streaming

Spark Structured Streaming (1 of 2)

Structured Streaming is a computation model that attempts to unify streaming, interactive, and batch query execution engines
Structured Streaming is a stream processing engine with a high-level declarative streaming API built on top of Spark SQL
Continuous incremental execution of a structured query
Switch to The Internals of Spark Structured Streaming
- Structured Streaming — Streaming Datasets

Spark Structured Streaming (2 of 2)

Spark Structured Streaming is part of Spark SQL
When developing streaming applications use the following dependency


libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"

DataStreamReader

DataStreamReader is the interface for loading data from streaming data source


    import org.apache.spark.sql.streaming.DataStreamReader
    val streamReader: DataStreamReader = spark.readStream
    // source + options
    val dataset: DataFrame = streamReader.load

Streaming DataFrame represents an unbounded table
Streaming query is described using Dataset API
Switch to The Internals of Spark Structured Streaming
- DataStreamReader

DataStreamWriter

DataStreamWriter is the interface for writing result of a streaming query to a data sink


   val dataset: DataFrame = ...
   import org.apache.spark.sql.streaming.DataStreamWriter
   val streamWriter: DataStreamWriter = dataset.writeStream

Switch to The Internals of Spark Structured Streaming
- DataStreamWriter

DataStreamWriter and Query Name

queryName specifies the name of a streaming query


queryName(queryName: String): DataStreamWriter[T]

The name must be unique among all the currently active queries in the associated SparkSession


val streamWriter: DataStreamWriter = ...
val namedStreamWriter: DataStreamWriter = streamWriter.queryName("name")

DataStreamWriter and Output Mode

Output mode specifies when and what (output) rows of a streaming query are written to the sink


        outputMode(outputMode: String): DataStreamWriter[T]
        outputMode(outputMode: OutputMode): DataStreamWriter[T]

append writes only the new rows in a streaming query
complete writes all the rows in an streaming aggregation query every time there are some updates
update writes only the rows that were updated in a streaming query every time there are some updates
- Equivalent to append mode if the query doesn't use aggregations

Setting Trigger

trigger sets how often a streaming query is requested for a result


trigger(trigger: Trigger): DataStreamWriter[T]

Use Trigger.ProcessingTime


trigger(Trigger.ProcessingTime("10 seconds"))

import scala.concurrent.duration._
trigger(Trigger.ProcessingTime(10.seconds))

import java.util.concurrent.TimeUnit
trigger(ProcessingTime.create(10, TimeUnit.SECONDS))

Defaults to as fast as possible, i.e. Trigger.ProcessingTime(0)

foreach and ForeachWriter (1 of 2)

DataStreamWriter.foreach allows for defining a custom data sink and will continually send results to the given ForeachWriter as a new data arrives


foreach(writer: ForeachWriter[T]): DataStreamWriter[T]

ForeachWriter can be used to send the generated data to an external system


abstract class ForeachWriter[T] {
  abstract def close(errorOrNull: Throwable): Unit
  abstract def open(partitionId: Long, version: Long): Boolean
  abstract def process(value: T): Unit
}

foreach and ForeachWriter (2 of 2)


val streamWriter: DataStreamWriter = ...
import org.apache.spark.sql.ForeachWriter
val streamWriterWithForeachSink: DataStreamWriter =
  streamWriter.foreach(new ForeachWriter[Long] {
    override def open(partitionId: Long, version: Long) = true

    override def process(value: Long): Unit = {
      println(s">>> $value")
    }

    override def close(errorOrNull: Throwable): Unit = {}
  })

foreachBatch (1 of 2)


foreachBatch(function: (Dataset[T], Long) => Unit): DataStreamWriter[T]

DataStreamWriter.foreachBatch allows for defining a custom function that can work with the micro-batch output as a dataframe for the following:
- Pass the output rows of each batch to a library that is designed for the batch jobs only
- Reuse batch data sources for output whose streaming version does not exist
- Multi-writes where the output rows are written to multiple outputs by writing twice for every batch
New in 2.4.0

foreachBatch (2 of 2)


        import org.apache.spark.sql.Dataset
        spark.readStream
          .format("rate")
          .load
          .writeStream
          .foreachBatch { (output: Dataset[_], batchId: Long) =>
            println(s"Batch ID: $batchId")
            output.show
          }

Starting Streaming Query

start starts execution of a streaming query that will continually output results to a sink as new data arrives


start(): StreamingQuery

Returns StreamingQuery that can be used to interact with the streaming query


import org.apache.spark.sql.streaming.StreamingQuery
val query: StreamingQuery = counter.writeStream.start

Streaming Source

Streaming Source acts as a continuous stream of data for a streaming query
Defined using format method on DataStreamReader
- Uses shortName of a source
FileStreamSource and TextSocketSource
KafkaSource for Apache Kafka 0.10+
RateStreamSource and MemoryStream for unit tests, PoCs, tutorials and debugging
Switch to The Internals of Spark Structured Streaming
- Streaming Data Source

Streaming Sink

Streaming Sink represents an external storage to write streaming datasets to.
Defined using format method on DataStreamWriter
- Uses shortName of a sink
ConsoleSink, FileStreamSink and ForeachSink
KafkaSink for Apache Kafka 0.10+
MemorySink for unit tests, tutorials and debugging
Switch to The Internals of Spark Structured Streaming
- Streaming Sink

StreamingQuery

StreamingQuery represents a streaming query


    import org.apache.spark.sql.streaming.StreamingQuery
    val query: StreamingQuery = counter.writeStream.start

id is the unique id of a query
runId is the unique id of the run of a query
Use awaitTermination to wait for the termination of a query, either by query.stop or by an exception
Use stop to stop execution of a query
Switch to The Internals of Spark Structured Streaming
- StreamingQuery

StreamingQueryManager — Streaming Query Management

StreamingQueryManager is the Management API for streaming queries in a SparkSession


        val qm: StreamingQueryManager = spark.streams

Switch to The Internals of Spark Structured Streaming
- StreamingQueryManager — Streaming Query Management

SparkStructured Streaming