Kafka Streams is a library for developing applications for processing records from topics in Apache Kafka. It provides high-level Streams DSL and low-level Processor API for describing fault-tolerant distributed streaming pipelines in Java or Scala programming languages. Kafka Streams also offers elaborate API for stateless and stateful stream processing. That’s a high-level view of Kafka Streams. Have you ever wondered how Kafka Streams does all this and what the relationship with Apache Kafka (brokers) is? That’s among the topics of the talk.
During this talk we will look under the covers of Kafka Streams and deep dive into Kafka Streams’ Fault-Tolerant Distributed Stream Processing Engine. You will know the role of StreamThreads, TaskManager, StreamTasks, StandbyTasks, StreamsPartitionAssignor, RebalanceListener and few others. The aim of this talk is to get you equipped with knowledge about the internals of Kafka Streams that should help you fine-tune your stream processing pipelines for better performance.
Decarbonising Buildings: Making a net-zero built environment a reality
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (Jacek Laskowski, Consultant) Kafka Summit London 2019
1. DEEP DIVE INTO / INTERNALSDEEP DIVE INTO / INTERNALS
OFOF
KAFKA STREAMSKAFKA STREAMS
KAFKA STREAMS 2.2.0KAFKA STREAMS 2.2.0
//
//
THE "INTERNALS" BOOKS:THE "INTERNALS" BOOKS:
//
@JACEKLASKOWSKI@JACEKLASKOWSKI
STACKOVERFLOWSTACKOVERFLOW GITHUBGITHUB
KAFKAKAFKA
STREAMSSTREAMS APACHE KAFKAAPACHE KAFKA
2. MY VERY FIRST #KAFKASUMMIT!MY VERY FIRST #KAFKASUMMIT!
3. Jacek Laskowski is a freelance IT consultant
Core competencies in Spark, Kafka, Kafka
Streams, Scala
Development | Consulting | Training
Among contributors to
Contact me at jacek@japila.pl
Follow on twitter
for more #ApacheSpark, #ApacheKafka,
#KafkaStreams
Apache Spark
@JacekLaskowski
4. Jacek is best known by the online "Internals"
books:
1.
2.
3.
4.
5.
The Internals of Apache Spark
The Internals of Spark SQL
The Internals of Spark Structured
Streaming
The Internals of Kafka Streams
The Internals of Apache Kafka
7. KAFKA STREAMSKAFKA STREAMS (1 OF 2)(1 OF 2)
1. Kafka Streams is a client library for stream
processing applications that process records
in Kafka topics
Low-level Processor API
2. Stream processing primitives, e.g. KStream and
KTable
High-level Streams DSL
3. Topology to describe the processing flow
» One record at a time «
4. Wrapper around the Kafka Producer and
Consumer APIs
5. Supports faulttolerant local state for stateful
operations (e.g. windowed aggregations and
joins)
8. KAFKA STREAMSKAFKA STREAMS (2 OF 2)(2 OF 2)
1. "Groundbreaking" facts (which changed my life):
1. When started, a topology processes one
record at a time
2. A Kafka Streams application can be run in
multiple instances (e.g. as Docker containers)
3. Increasing stream processing power is about
increasing number of threads or even
instances
9. MAIN DEVELOPMENT ENTITIESMAIN DEVELOPMENT ENTITIES
1. As a Kafka Streams developer you work with the
two main developer-facing entities:
1. Topology
2. KafkaStreams
10. TOPOLOGYTOPOLOGY
1. Represents the stream processing logic of a
Kafka Streams application
2. Directed Acyclic Graph of Processors (Stream
Processing Nodes)
3. Logical representation
4. Created directly or indirectly (using Streams
DSL)
5. Topology API
Adding sources, processors, sinks, state stores
6. Can be described (and printed out to stdout)
11. EXAMPLE: CREATING TOPOLOGYEXAMPLE: CREATING TOPOLOGY
// Creating directly
import org.apache.kafka.streams.Topology
val topology = new Topology
// Created using Streams DSL (StreamsBuilder API)
// Scala API for Kafka Streams
import org.apache.kafka.streams.scala._
import ImplicitConversions._
import Serdes._
val builder = new StreamsBuilder
val topology = builder.build
12. EXAMPLE: DESCRIBING TOPOLOGYEXAMPLE: DESCRIBING TOPOLOGY
scala> println(topology.describe)
Topologies:
Sub-topology: 0 for global store (will not generate tasks)
Source: demo-source-processor (topics: [demo-topic])
--> demo-processor-supplier
Processor: demo-processor-supplier (stores: [in-memory-key-v
--> none
<-- demo-source-processor
13. KAFKASTREAMSKAFKASTREAMS
1. Manages execution of a topology of a Kafka
Streams application
Start, close (shut down), state
2. Consumes messages from and produces
processing results to Kafka topics
3. Acceptable to create multiple KafkaStreams
instances per Kafka Streams application
4. For better performance, consider creating
multiple KafkaStreams instances as separate
instances of Kafka Streams application
14. EXAMPLE: CREATING KAFKASTREAMSEXAMPLE: CREATING KAFKASTREAMS
import org.apache.kafka.streams.KafkaStreams
val topology: Topology = ...
val config: StreamsConfig = ...
val ks = new KafkaStreams(topology, config)
ks.start // <-- starts execution
15. MAIN EXECUTION ENTITIESMAIN EXECUTION ENTITIES
1. StreamThread
2. TaskManager
3. StreamTask and StandbyTask
4. StreamsPartitionAssignor
5. RebalanceListener
16. STREAMTHREADSTREAMTHREAD (1 OF 2)(1 OF 2)
1. Stream Processor Thread
Runs the main record processing loop
Thread of execution (java.lang.Thread)
2. StreamThread uses a Kafka consumer (to poll for
records)
Subscribes to source topics
Think of Consumer Group
3. num.stream.threads configuration property
(default: 1)
4. Uses Kafka Consumer "tools" for operation
Registers StreamsPartitionAssignor
Uses RebalanceListener to intercept
changes to the partitions assigned
5. Uses TaskManager to manage processing tasks
(on next slide)
18. TASKMANAGERTASKMANAGER (1 OF 2)(1 OF 2)
1. Task manager of a StreamThread
2. Manages active and standby tasks
Only active StreamTasks process records
3. Creates processor tasks for assigned partitions
RebalanceListener.onPartitionsAssigned
20. STREAM PROCESSOR TASKSSTREAM PROCESSOR TASKS (1 OF 2)(1 OF 2)
1. Managed by TaskManager to run a topology of
stream processors
2. StreamTask - the default processor task
As many as partitions assigned
Managed as a group as
AssignedStreamsTasks
Only active StreamTasks process records
Use FIFO RecordQueues (per Kafka
TopicPartition)
3. StandbyTask - a backup processor task
"Ghost" tasks for active StreamTasks
Default: 0 standby tasks
Managed as a group as
AssignedStandbyTasks
22. STREAMSPARTITIONASSIGNORSTREAMSPARTITIONASSIGNOR (1 OF 2)(1 OF 2)
1. Custom PartitionAssignor from the Kafka Consum
Used for dynamic partition assignment and distr
across the members of a consumer group
partition.assignment.strategy /
ConsumerConfig.PARTITION_ASSIGNME
configuration property
2. Group management protocol
group membership
state synchronization
3. Assigns partitions dynamically across the instances
Required application.id /
StreamsConfig.APPLICATION_ID_CONFI
24. REBALANCELISTENERREBALANCELISTENER (1 OF 2)(1 OF 2)
1. Custom ConsumerRebalanceListener from
the Kafka Consumer API
Callback interface for custom actions when
the set of partitions assigned to the consumer
changes
partition re-assignment will be triggered any
time the members of the consumer group
change
2. Intercepts changes to the partitions assigned to a
single StreamThread