Spark
Structured Streaming

Apache Spark 2.4.4


@jaceklaskowski / StackOverflow / GitHub
The "Internals" Books: Apache Spark / Spark SQL / Spark Structured Streaming

Spark Structured Streaming (1 of 2)

  1. Structured Streaming is a computation model that attempts to unify streaming, interactive, and batch query execution engines
  2. Structured Streaming is a stream processing engine with a high-level declarative streaming API built on top of Spark SQL
  3. Continuous incremental execution of a structured query
  4. Switch to The Internals of Spark Structured Streaming

Spark Structured Streaming (2 of 2)

  1. Spark Structured Streaming is part of Spark SQL
  2. When developing streaming applications use the following dependency

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"
              

DataStreamReader

  1. DataStreamReader is the interface for loading data from streaming data source
    
        import org.apache.spark.sql.streaming.DataStreamReader
        val streamReader: DataStreamReader = spark.readStream
        // source + options
        val dataset: DataFrame = streamReader.load
                  
  2. Streaming DataFrame represents an unbounded table
  3. Streaming query is described using Dataset API
  4. Switch to The Internals of Spark Structured Streaming

DataStreamWriter

  1. DataStreamWriter is the interface for writing result of a streaming query to a data sink
    
       val dataset: DataFrame = ...
       import org.apache.spark.sql.streaming.DataStreamWriter
       val streamWriter: DataStreamWriter = dataset.writeStream
                    
  2. Switch to The Internals of Spark Structured Streaming

DataStreamWriter and Query Name

  1. queryName specifies the name of a streaming query
  2. 
    queryName(queryName: String): DataStreamWriter[T]
                  
  3. The name must be unique among all the currently active queries in the associated SparkSession
  4. 
    val streamWriter: DataStreamWriter = ...
    val namedStreamWriter: DataStreamWriter = streamWriter.queryName("name")
                  

DataStreamWriter and Output Mode

  1. Output mode specifies when and what (output) rows of a streaming query are written to the sink
  2. 
            outputMode(outputMode: String): DataStreamWriter[T]
            outputMode(outputMode: OutputMode): DataStreamWriter[T]
                  
  3. append writes only the new rows in a streaming query
  4. complete writes all the rows in an streaming aggregation query every time there are some updates
  5. update writes only the rows that were updated in a streaming query every time there are some updates
    • Equivalent to append mode if the query doesn't use aggregations

Setting Trigger

  1. trigger sets how often a streaming query is requested for a result
  2. 
    trigger(trigger: Trigger): DataStreamWriter[T]
                  
  3. Use Trigger.ProcessingTime
  4. 
    trigger(Trigger.ProcessingTime("10 seconds"))
    
    import scala.concurrent.duration._
    trigger(Trigger.ProcessingTime(10.seconds))
    
    import java.util.concurrent.TimeUnit
    trigger(ProcessingTime.create(10, TimeUnit.SECONDS))
                  
  5. Defaults to as fast as possible, i.e. Trigger.ProcessingTime(0)

foreach and ForeachWriter (1 of 2)

  1. DataStreamWriter.foreach allows for defining a custom data sink and will continually send results to the given ForeachWriter as a new data arrives
  2. 
    foreach(writer: ForeachWriter[T]): DataStreamWriter[T]
                  
  3. ForeachWriter can be used to send the generated data to an external system
  4. 
    abstract class ForeachWriter[T] {
      abstract def close(errorOrNull: Throwable): Unit
      abstract def open(partitionId: Long, version: Long): Boolean
      abstract def process(value: T): Unit
    }
                  

foreach and ForeachWriter (2 of 2)


val streamWriter: DataStreamWriter = ...
import org.apache.spark.sql.ForeachWriter
val streamWriterWithForeachSink: DataStreamWriter =
  streamWriter.foreach(new ForeachWriter[Long] {
    override def open(partitionId: Long, version: Long) = true

    override def process(value: Long): Unit = {
      println(s">>> $value")
    }

    override def close(errorOrNull: Throwable): Unit = {}
  })
              

foreachBatch (1 of 2)

    
    foreachBatch(function: (Dataset[T], Long) => Unit): DataStreamWriter[T]
                  
  1. DataStreamWriter.foreachBatch allows for defining a custom function that can work with the micro-batch output as a dataframe for the following:
    • Pass the output rows of each batch to a library that is designed for the batch jobs only
    • Reuse batch data sources for output whose streaming version does not exist
    • Multi-writes where the output rows are written to multiple outputs by writing twice for every batch
  2. New in 2.4.0

foreachBatch (2 of 2)


        import org.apache.spark.sql.Dataset
        spark.readStream
          .format("rate")
          .load
          .writeStream
          .foreachBatch { (output: Dataset[_], batchId: Long) =>
            println(s"Batch ID: $batchId")
            output.show
          }
            

Starting Streaming Query

  1. start starts execution of a streaming query that will continually output results to a sink as new data arrives
  2. 
    start(): StreamingQuery
                  
  3. Returns StreamingQuery that can be used to interact with the streaming query
  4. 
    import org.apache.spark.sql.streaming.StreamingQuery
    val query: StreamingQuery = counter.writeStream.start
                  

Streaming Source

  1. Streaming Source acts as a continuous stream of data for a streaming query
  2. Defined using format method on DataStreamReader
    • Uses shortName of a source
  3. FileStreamSource and TextSocketSource
  4. KafkaSource for Apache Kafka 0.10+
  5. RateStreamSource and MemoryStream for unit tests, PoCs, tutorials and debugging
  6. Switch to The Internals of Spark Structured Streaming

Streaming Sink

  1. Streaming Sink represents an external storage to write streaming datasets to.
  2. Defined using format method on DataStreamWriter
    • Uses shortName of a sink
  3. ConsoleSink, FileStreamSink and ForeachSink
  4. KafkaSink for Apache Kafka 0.10+
  5. MemorySink for unit tests, tutorials and debugging
  6. Switch to The Internals of Spark Structured Streaming

StreamingQuery

  1. StreamingQuery represents a streaming query
    
        import org.apache.spark.sql.streaming.StreamingQuery
        val query: StreamingQuery = counter.writeStream.start
                  
  2. id is the unique id of a query
  3. runId is the unique id of the run of a query
  4. Use awaitTermination to wait for the termination of a query, either by query.stop or by an exception
  5. Use stop to stop execution of a query
  6. Switch to The Internals of Spark Structured Streaming

StreamingQueryManager — Streaming Query Management

  1. StreamingQueryManager is the Management API for streaming queries in a SparkSession
    
            val qm: StreamingQueryManager = spark.streams
                  
  2. Switch to The Internals of Spark Structured Streaming