Spark Streaming

Apache Spark 2

@jaceklaskowski / StackOverflow / GitHub / Mastering Apache Spark 2

Heads-up

Spark Streaming had almost no notable changes between Spark 1.6 and Spark 2.0. Jacek believes that it will be marked deprecated after the modern Structured Streaming has been announced ready for production use
(which has happened at SparkSummit 2017 in San Francisco)

Spark Streaming

  1. Spark Streaming is the incremental stream processing framework for Apache Spark.
  2. Scalable, high-throughput, fault-tolerant, batch-oriented
  3. Discretized Streams (DStreams) of continuous data
  4. Integration with other Spark modules
    • Spark SQL and MLlib
  5. Switch to Mastering Apache Spark 2

From the official documentation of Apache Spark

StreamingContext

  1. StreamingContext is the entry point for all Spark Streaming functionality.
    
            val ssc = new StreamingContext(sc, Seconds(5))
                  
  2. Used to create built-in DStreams (covered next)
  3. Switch to Mastering Apache Spark 2

Discretized Streams (DStreams)

  1. Discretized Stream (DStream) is the fundamental concept of Spark Streaming.
  2. A stream of RDDs with input records per batch.
  3. Created using StreamingContext or custom factories, e.g. KafkaUtils.
    
       val dstream = KafkaUtils.createDirectStream[String, String](...)
                  
  4. Switch to Mastering Apache Spark 2

DStream Operators

  1. Stream operators allow for transformations to the records from input DStreams and ultimately trigger computations using output operators.
  2. Switch to Mastering Apache Spark 2

Web UI

  1. Spark Streaming applications have their own web UI with Streaming Statistics Page
  2. Switch to Mastering Apache Spark 2

Spark Streaming Demo

  1. Reading text records from socket
  2. Uses StreamingContext.socketTextStream and netcat

More Streaming Examples

  1. "Official" examples in examples/streaming repo
  2. Use run-example streaming.*
    • NetworkWordCount
    • SqlNetworkWordCount
    • StatefulNetworkWordCount

Questions?