Skip to content

Day 3 / Apr 21 (Thu)

Continuing our journey into Apache Kafka. Today we're using Spark SQL and Spark Structured Streaming to process records from Kafka topics.

Morning Exercise

  1. Exercise: Word Count Per Record

Theory

Structured Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher)

kafka-console-consumer with some options:

./bin/kafka-console-consumer.sh \
    --property print.key=true \
    --property key.separator=" -> " \
    --bootstrap-server :9092 \
    --topic output \
    --from-beginning \
    --value-deserializer org.apache.kafka.common.serialization.IntegerDeserializer

Practice

  1. Write a Spark application that loads Kafka records (from a topic given by args(0)) and displays them to the console

    1. Create a brand new project in IntelliJ IDEA
    2. Push the project to Github

    Part 1. Spark SQL and show records (using DataFrame.show)

    Part 2. Spark Structured Streaming and show records (using format("console"))

  2. Modify the above Spark application to accept 2 command-line arguments topicIn and topicOut to load records from and save them to, appropriately. The application should change record values to their UPPERCASE variant.

    Push the project to Github once finished or at the end of the day (whatever happens earlier). Report it on slack.

Learning Resources

  1. Structured Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher)
  2. How to use the console consumer to read non-string primitive keys and values
Back to top