Deep Dive Into / Internals of
Kafka Streams
Kafka Streams 2.2.0
@jaceklaskowski
/
StackOverflow
/
GitHub
The "Internals" Books:
Kafka Streams
/
Apache Kafka
## My Very First #KafkaSummit!
* **Jacek Laskowski** is a freelance IT consultant * Core competencies in Spark, Kafka, **Kafka Streams**, Scala * Development | Consulting | Training * Among contributors to
Apache Spark
* Contact me at **jacek@japila.pl** * Follow [@JacekLaskowski](https://twitter.com/jaceklaskowski) on twitter
for more #ApacheSpark, #ApacheKafka, #KafkaStreams
Jacek is best known by the online "Internals" books:
The Internals of Apache Spark
The Internals of Spark SQL
The Internals of Spark Structured Streaming
The Internals of Kafka Streams
The Internals of Apache Kafka
Jacek is "active" on
StackOverflow
🥳
## Agenda 1. [Kafka Streams](#/intro) 1. [Main Development Entities](#/main-development-entities) 1. [Topology](#/topology) 1. [KafkaStreams](#/kafkastreams) 1. [Main Execution Entities](#/main-execution-entities) 1. [StreamThread](#/StreamThread) 1. [TaskManager](#/TaskManager) 1. [StreamTask and StandbyTask](#/kafka-streams-tasks) 1. [StreamsPartitionAssignor](#/StreamsPartitionAssignor) 1. [RebalanceListener](#/RebalanceListener)
## Kafka Streams
(1 of 2)
1. **Kafka Streams** is a client library for **stream processing applications** that process records in Kafka topics * Low-level **Processor API** 1. Stream processing primitives, e.g. **KStream** and **KTable** * High-level **Streams DSL** 1. **Topology** to describe the processing flow * » One record at a time « 1. Wrapper around the Kafka Producer and Consumer APIs 1. Supports **fault-tolerant local state** for stateful operations (e.g. windowed aggregations and joins)
## Kafka Streams
(2 of 2)
1. "Groundbreaking" facts (which changed my life): 1. When started, a topology processes one record at a time only 1. A Kafka Streams application can be run in multiple instances (e.g. as Docker containers) 1. Increasing stream processing power is to increase number of threads and instances 1. Processor itself decides whether to forward a record downstream or not
## Main Development Entities 1. As a Kafka Streams developer you work with the two main developer-facing entities: 1. Topology 1. KafkaStreams
## Topology 1. Represents the **stream processing logic** of a Kafka Streams application 1. Directed Acyclic Graph of **Processors** (Stream Processing Nodes) 1. Logical representation 1. Created directly or indirectly (using Streams DSL) 1. Topology API * Adding sources, processors, sinks, state stores 1. Can be described (and printed out to stdout)
## Example: Creating Topology ``` // Creating directly import org.apache.kafka.streams.Topology val topology = new Topology ``` ``` // Created using Streams DSL (StreamsBuilder API) // Scala API for Kafka Streams import org.apache.kafka.streams.scala._ import ImplicitConversions._ import Serdes._ val builder = new StreamsBuilder val topology = builder.build ```
## Example: Describing Topology ``` scala> println(topology.describe) Topologies: Sub-topology: 0 for global store (will not generate tasks) Source: demo-source-processor (topics: [demo-topic]) --> demo-processor-supplier Processor: demo-processor-supplier (stores: [in-memory-key-value-store]) --> none <-- demo-source-processor ```
## KafkaStreams 1. Manages execution of a topology of a Kafka Streams application * Start, close (shut down), state 1. Consumes messages from and produces processing results to Kafka topics 1. Acceptable to create multiple KafkaStreams instances per Kafka Streams application 1. For better performance, consider creating multiple KafkaStreams instances as separate instances of Kafka Streams application
## Example: Creating KafkaStreams ```scala import org.apache.kafka.streams.KafkaStreams val topology: Topology = ... val config: StreamsConfig = ... val ks = new KafkaStreams(topology, config) ks.start // <-- starts execution ```
## Main Execution Entities 1. [StreamThread](#/StreamThread) 1. [TaskManager](#/TaskManager) 1. [StreamTask and StandbyTask](#/kafka-streams-tasks) 1. [StreamsPartitionAssignor](#/StreamsPartitionAssignor) 1. [RebalanceListener](#/RebalanceListener)
## StreamThread
(1 of 4)
1. Stream Processor Thread * Runs the **main record processing loop** * Thread of execution (**java.lang.Thread**) 1. StreamThread uses a Kafka consumer (to poll for records) * Subscribes to source topics * Think of Consumer Group 1. **num.stream.threads** configuration property (default: **1**) 1. Uses Kafka Consumer "tools" for operation * Registers **StreamsPartitionAssignor** * Uses **RebalanceListener** to intercept changes to the partitions assigned 1. Uses TaskManager to manage processing tasks
(on next slide)
StreamThread
(2 of 4)
StreamThread
(3 of 4)
StreamThread
(4 of 4)
## TaskManager
(1 of 2)
1. Task manager of a StreamThread 1. Manages active and standby tasks * Only active StreamTasks process records 1. Creates processor tasks for assigned partitions * **RebalanceListener.onPartitionsAssigned**
TaskManager
(2 of 2)
## Stream Processor Tasks
(1 of 3)
1. Managed by **TaskManager** to run a topology of stream processors 1. **StreamTask** - the default processor task * As many as partitions assigned * Managed as a group as **AssignedStreamsTasks** * Only active StreamTasks process records * Use FIFO RecordQueues (per Kafka TopicPartition) 1. **StandbyTask** - a backup processor task * "Ghost" tasks for active StreamTasks * Default: **0** standby tasks * Managed as a group as **AssignedStandbyTasks**
Stream Processor Tasks
(2 of 3)
Stream Processor Tasks
(3 of 3)
## StreamsPartitionAssignor
(1 of 2)
1. Custom **PartitionAssignor** from the Kafka Consumer API * Used for dynamic partition assignment and distributing partition ownership across the members of a consumer group * **partition.assignment.strategy** / **ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG** configuration property 1. Group management protocol * **group membership** * **state synchronization** 1. Assigns partitions dynamically across the instances of a Kafka Streams application * Required **application.id** / **StreamsConfig.APPLICATION_ID_CONFIG** configuration property
## StreamsPartitionAssignor
(2 of 2)
## RebalanceListener
(1 of 2)
1. Custom **ConsumerRebalanceListener** from the Kafka Consumer API * Callback interface for custom actions when the set of partitions assigned to the consumer changes * partition re-assignment will be triggered any time the members of the consumer group change 1. Intercepts changes to the partitions assigned to a single StreamThread
RebalanceListener
(2 of 2)
## Recap 1. [Kafka Streams](#/intro) 1. [Main Development Entities](#/main-development-entities) 1. [Topology](#/topology) 1. [KafkaStreams](#/kafkastreams) 1. [Main Execution Entities](#/main-execution-entities) 1. [StreamThread](#/StreamThread) 1. [TaskManager](#/TaskManager) 1. [StreamTask and StandbyTask](#/kafka-streams-tasks) 1. [StreamsPartitionAssignor](#/StreamsPartitionAssignor) 1. [RebalanceListener](#/RebalanceListener)
# Questions? * Read [The Internals of Kafka Streams](http://bit.ly/kafka-streams-internals) * Read [The Internals of Apache Kafka](http://bit.ly/apache-kafka-internals) * Follow [@jaceklaskowski](https://twitter.com/jaceklaskowski) on twitter (DMs open) * Upvote [my questions and answers on StackOverflow](http://stackoverflow.com/users/1305344/jacek-laskowski) * Contact me at **jacek@japila.pl**