In the IoT-enabled fleet management domain, real-time signal tracking is crucial. Signals refer to various datapoint readings from different sensors across the vehicle, like engine temperature, fuel level or braking force. Our solution processes batches of these signals, handling up to 8K batches (or 500K signals) per second in production every day. This talk explores our architectural journey, focusing on real-time, horizontal scalability, fault tolerance, monitoring and alerting. We utilized Kafka Streams' interactive queries API and a gRPC layer for Protobuf-formatted data storage and querying, achieving near-instantaneous data access. Key optimizations to both Kafka topology and cluster will be discussed, specifically aimed at reducing network overhead and controlling changelog size. These optimizations not only ensure resource efficiency but also enhance fault tolerance and rapid startups. Walk away with actionable insights for your own Kafka deployments.
14. 14
Use case: Last known data of devices
Value Proposition
Access the last know state of vehicles, regardless of their
current connectivity status
Usage
⢠What was the last position of my fleet?
⢠What was the last state of charge of my electric vehicle?
⢠I donât want to process all the IoT Data from my vehicles, but I
want to periodically check the distance travelled by each of my
vehicles.
15. 15
Use case: Last known data of devices
Requirements
⢠Keep the last state of a signal, keyed by tenant, vehicle ID and signal ID
⢠Allow accessing a set of signals for a set of vehicles
⢠Data must be made available near real time (<5 seconds end to end)
⢠API must be fast (usage in Front End)
21. 21
Architecture | API
Kafka
Cluster
Instance 1
Store
Instance 2
Store
Instance 3
Store
Data is split across many local stores,
each of which only handles part of the
entire state store
22. 22
Architecture | API
Kafka
Cluster
Instance 1
Store
Instance 2
Store
Instance 3
Store
application.id: my-topology
application.server: instance1:8080
group.instance.id: instance1
application.id: my-topology
application.server: instance2:8080
group.instance.id: instance2
application.id: my-topology
application.server: instance3:8080
group.instance.id: instance3
1/ Register instances of the
Kafka Streams application
23. 23
Architecture | API
Kafka
Cluster
Instance 1
Store
Instance 2
Store
Instance 3
Store
2/ When receiving a requests, get
metadata to know which data is
retrievable where
GET key1, key2,
key3, âŚ
Metadata
24. 24
Architecture | API
Kafka
Cluster
Instance 1
Store
Instance 2
Store
Instance 3
Store
key1
key2
key3
3/ Get some data in local store
The rest is retrieved using RPC requests
to other instances.
RPC
28. 28
First Version | Topology
input
key-value-repartition-
v1-repartition
key-value-
store-v1
messages-source-
v1
key-values-
extractor-v1
key-value-repartition-
v1-repartition-source
keep-most-
recent-reducer-v1
Sub Topology 1 Sub Topology 2
29. 29
First Version | Topology
input
key-value-repartition-
v1-repartition
key-value-
store-v1
messages-source-
v1
key-values-
extractor-v1
key-value-repartition-
v1-repartition-source
keep-most-
recent-reducer-v1
Sub Topology 1 Sub Topology 2
streamsBuilder
// source of input data
.stream[String, Message](
inputTopic,
Consumed
.`with`(Serdes.String(), messageSerde)
.withName("messages-source-v1")
)
30. 30
First Version | Topology
input
key-value-repartition-
v1-repartition
key-value-
store-v1
messages-source-
v1
key-values-
extractor-v1
key-value-repartition-
v1-repartition-source
keep-most-
recent-reducer-v1
Sub Topology 1 Sub Topology 2
// extract key values - changing the key of the stream
.flatMap(
{ case (_, message) => message.toKeyValues },
Named.as("key-values-extractor-v1"),
)
31. 31
First Version | Topology
input
key-value-repartition-
v1-repartition
key-value-
store-v1
messages-source-
v1
key-values-
extractor-v1
key-value-repartition-
v1-repartition-source
keep-most-
recent-reducer-v1
Sub Topology 1 Sub Topology 2
// Kafka streams does that implicitly
.repartition(
Repartitioned
.as("key-value-repartition-v1")
.withKeySerde(keyValueKeySerde)
.withValueSerde(keyValueSerde)
)
32. 32
First Version | Topology
input
key-value-repartition-
v1-repartition
key-value-
store-v1
messages-source-
v1
key-values-
extractor-v1
key-value-repartition-
v1-repartition-source
keep-most-
recent-reducer-v1
Sub Topology 1 Sub Topology 2
// group by key, effectively allowing operations at key level
.groupByKey(Grouped.as("group-by-key-v1"))
// Keep the last value based on the timestamp
.reduce(
{ case (keyValue1, keyValue2) => keyValue1.maxByTimestamp(keyValue2) },
Named.as("keep-most-recent-reducer-v1"),
Materialized.as(Stores.persistentKeyValueStore("key-value-store-v1")),
)
33. 33
First Version | gRPC Service
class GrpcService(
kafkaStreams: KafkaStreams,
storeName: String,
localhost: HostInfo
) {
def getKeyValue(tenantId: String, deviceId: String, signalId: String) = {
val key = (tenantId, deviceId, signalId)
val activeHost = kafkaStreams.queryMetadataForKey(storeName, key, implicitly)
.activeHost()
if (activeHost == localhost) {
queryLocalStore(âŚ)
} else {
queryRemoteHostWithGrpc(âŚ)
}
}
}
34. 34
First Version | Interactive Query
implicit val bytesOrdering: Ordering[Bytes] = Ordering.fromLessThan { (a, b) =>
Bytes.BYTES_LEXICO_COMPARATOR.compare(a.get(), b.get()) < 0
}
def queryLocalStore(val keys: List<(String, String, String)>) = {
if (keys.isEmpty) {
List.empty
} else {
// Serialize all keys to find the min and max by lexicographical order
val allKeys = keys
.map(key => Bytes.wrap(key.toByteArray) -> key)
.sortBy(_._1)(bytesOrdering)
val (_, minKey) = allKeys.head
val (_, maxKey) = allKeys.last
35. 35
First Version | Interactive Query
// Range query to pull values
var request = StateQueryRequest
.inStore(storeName)
.withQuery(RangeQuery.withRange(minKey, maxKey))
.requireActive()
// Get the values
kafkaStreamsWrapper.kafkaStreams.query(request)
.getPartitionResults.values().asScala
.flatMap { result =>
Using(result.getResult)(_.asScala.map(_.value))
.getOrElse(Iterator.empty)
}
.filter(value => keys.contains(value.key))
.toList
36. 36
First Version | Test setup
Load generator
⢠100 devices
⢠2 messages / device / second
⢠10 to 100 signals per message
⢠500 different signals
⢠200 messages / second
⢠10K signals per second
Production
⢠>10K devices
⢠1K different signals
⢠500K signals per second
39. 39
First Optimization | Topology
input
key-value-repartition-
v1-repartition
key-value-
store-v1
messages-source-
v1
key-values-
extractor-v1
key-value-repartition-
v1-repartition-source
keep-most-
recent-reducer-v1
Sub Topology 1 Sub Topology 2
Most of the network pressure
comes from the repartition
40. 40
First Optimization | Topology
input
key-value-repartition-
v1-repartition
key-value-
store-v1
messages-source-
v1
key-values-
extractor-v1
key-value-repartition-
v1-repartition-source
keep-most-
recent-reducer-v1
Sub Topology 1 Sub Topology 2
Invariant: The input topic is
already partitioned by deviceId
41. 41
First Optimization | Topology
input
key-value-repartition-
v1-repartition
key-value-
store-v1
messages-source-
v1
key-values-
extractor-v1
key-value-repartition-
v1-repartition-source
keep-most-
recent-reducer-v1
Sub Topology 1 Sub Topology 2
We can remove these steps
using the processor API
45. 45
First Optimization | Topology V2
input
key-value-
store-v2
messages-source-
v2
keep-most-
recent-processor-
v2
// Add the last values processor, connected to the store
.process(
() => new KeyValueStoreProcessor(v2StoreName),
Named.as("keep-most-recent-processor-v2"),
v2StoreName
)
46. 46
First Optimization | Topology V2 | Processor
class KeyValueStoreProcessor(
storeName: String
) extends Processor[String, Message, Void, Void] {
private var store: KeyValueStore[KeyValue.Key, KeyValue] = _
override def init(context: ProcessorContext[Void, Void]): Unit = {
super.init(context)
// Inject the store on initialization
this.store = context.getStateStore[
KeyValueStore[KeyValue.Key, KeyValue]
](storeName)
}
47. 47
First Optimization | Topology V2 | Processor
override def process(record: Record[String, Message]): Unit = {
// For each value
extractLastValues(record.value()).foreach { value =>
// Get the current value in store
val key = value.key
val currentValue = store.get(key)
// Override if the new value has a bigger timestamp
if (null == currentValue ||
currentValue.timestamp < value.timestamp) {
store.put(key, value)
}
}
}
52. 54
First Optimization | Topology V2 | gRPC?
class GrpcService(
kafkaStreams: KafkaStreams,
storeName: String,
localhost: HostInfo
) {
def getKeyValue(deviceId: String, signalId: String) = {
val key = deviceId
val activeHost = kafkaStreams.queryMetadataForKey(storeName, key, implicitly)
.activeHost()
if (activeHost == localhost) {
queryLocalStore(âŚ)
} else {
queryRemoteHostWithGrpc(âŚ)
}
}
}
53. 55
First Optimization | Topology V2 | Take Away
⢠Describe your topologies to understand what is done automatically
⢠Study your system to understand if all steps are required
⢠Load test to understand the behavior of the different operations
⢠The low level API gives you a lot of control
55. 57
Second Optimization | Persistence
RocksDB Store
⢠Persisted to disk
⢠Cache in memory
⢠Write ahead log
⢠Store can exceed available RAM
In Memory Store
⢠Faster
⢠Size constrained by available RAM
⢠OOM can be thrown if store too big
56. 58
Second Optimization | Persistence
RocksDB Store
⢠Persisted to disk
⢠Cache in memory
⢠Write ahead log
⢠Store can exceed available RAM
Used in our case
57. 59
Second Optimization | Persistence | RocksDB
Tuning RocksDB: limit memory and disk usage
p.put(
StreamsConfig.ROCKSDB_CONFIG_SETTER_CLASS_CONFIG,
classOf[CustomRocksDBConfig]
)
class CustomRocksDBConfig extends RocksDBConfigSetter {
override def setConfig(
storeName: String,
options: Options,
configs: Map[String, AnyRef],
): Unit = {
// âŚ
}
}
58. 60
Second Optimization | Persistence | RocksDB
RocksDB allocates off-heap memory you need to limit:
⢠Block cache (for reads)
⢠Index and filter blocks
⢠Memtable (write buffer)
59. 61
Second Optimization | Persistence | RocksDB
When tuning / optimizing RocksDB usage:
⢠check Kafka Streams latest recommendations on RocksDB documentation
⢠Kafka streams store cache and RocksDB are not mutually exclusive
⢠Consider enabling compression (Memory usage VS performance trade-off)
⢠Experiment
60. 62
Second Optimization | Persistence | RocksDB
Operating RocksDB in Kubernetes
Persistent Volume
⢠Each instance of the application has
a persistent volume where state is
written
⢠Volume get detached and
reattached during operations
⢠Moving volume to another node
can be really slow (30m)
Ephemeral volume
⢠State is written to volume
⢠Volume is destroyed and recreated
during operations
⢠Creation of volume is fast (few
seconds)
⢠State is loaded from changelog at
startup
61. 63
Second Optimization | Persistence | RocksDB
Operating RocksDB in Kubernetes
Persistent Volume
⢠Each instance of the application has
a persistent volume where state is
written
⢠Volume get detached and
reattached during operations
⢠Moving volume to another node
can be really slow (30m) Major issue
Used in our first versions
62. 64
Second Optimization | Persistence | RocksDB
Operating RocksDB in Kubernetes
Ephemeral volume
⢠State is written to volume
⢠Volume is destroyed and recreated
during operations
⢠Creation of volume is fast (few
seconds)
⢠State is loaded from changelog at
startup
Moving to this
Letâs see that
63. 65
Second Optimization | Persistence | Changelog
Kafka
Streams
Store
Changelog
topic
Key values Key values
Every change in the store
is replicated in the
changelog topic
The changelog is a compacted topic:
periodically, the topic is compacted and
the last (written) value of each key is kept
Persistence of storesâ states
64. 66
Second Optimization | Persistence | Changelog
Kafka
Streams
Store
Changelog
topic
Key values
The changelog is completely consumed
at start to recreate the storesâ state
Initialization of storesâ states (at start)
65. 67
Second Optimization | Persistence | Changelog
How does the changelog grow?
After 12 hours, 2GB
66. 68
Second Optimization | Persistence | Changelog
INFO o.a.k.s.p.i.StoreChangelogReader
stream-thread Restoration in progress for 32 partitions
INFO o.a.k.s.p.internals.StreamThread
stream-thread Restoration took 80684 ms for all active tasks
âŚ
67. 69
Second Optimization | Persistence | Changelog
Restoration needs
~10MB/s bandwidth
to read the changelog
68. 70
Second Optimization | Persistence | Changelog
Forcing a compaction,
2GB -> 5MB for 100 devices
and 500 signals per device
69. 71
Second Optimization | Persistence | Changelog
Log cleaner thread 0 cleaned log kv-store-v1-key-value-store-v1-
changelog-22 (dirty section = [0, 2447900])
59.9 MB of log processed in 1.3 seconds (44.9 MB/sec).
Indexed 59.9 MB in 0.7 seconds (87.3 Mb/sec, 51.5% of total time)
Buffer utilization: 0.0%
Cleaned 59.9 MB in 0.6 seconds (92.5 Mb/sec, 48.5% of total time)
Start size: 59.9 MB (2,447,900 messages)
End size: 0.0 MB (1,540 messages)
99.9% size reduction (99.9% fewer messages)
70. 72
Second Optimization | Persistence | Changelog
INFO o.a.k.s.p.internals.StreamThread
stream-thread Restoration took 1102 ms for all active tasks
71. 73
Second Optimization | Persistence | Changelog
Restoration needs
~100kB/s bandwidth
to read the changelog
72. 74
Second Optimization | Persistence | Changelog
Properties of the changelog
⢠Small when compacted (number of devices * number of signals values to keep)
⢠Number of values grows organically
⢠Lots of writes, changelog can grow fast
⢠How can we keep the changelog âsmallâ?
73. 75
Second Optimization | Persistence | Changelog
Configuration Description Default
max.compaction.lag.ms maximum time a message will remain ineligible for
compaction in the log.
Long.max
min.cleanable.dirty.ratio Controls how frequently the log compactor will attempt to
clean the log.
This ratio bounds the maximum space wasted in the log by
duplicates.
0.5
log.segment.bytes The maximum size of a single log file 1GB
74. 76
Second Optimization | Persistence | Changelog
How to control compaction frequency?
⢠Setting a max compaction lag puts a upper bound on the period of compaction
⢠Lowering the segment size and/or the dirty ratio triggers compaction more often
77. 79
Second Optimization | Persistence | Take Away
⢠Persistent volumes can bring uncontrollable entropy
⢠Ephemeral volumes rely on small changelogs
⢠Compaction is âlazyâ by default
⢠Optimize changelogsâ compaction based on your data
79. 81
Third Optimization | Resilience
What happens when an instance of a Kafka Streams application is down?
⢠Data assigned to the instance stops being processed
⢠Local store stops being available
80. 82
Third Optimization | Resilience | Standby Replicas
Kafka
Streams
Store
Changelog
topic
Input
topic
Topology
Active Replica
Standby Replica
Kafka
Streams
Store
81. 83
Third Optimization | Resilience | Standby Replicas
Kafka
Streams
Store
Changelog
topic
Input
topic
Topology
Active Replica
Standby Replica
Kafka
Streams
Store
The active replica processes input data
and stores the result in its local store
82. 84
Third Optimization | Resilience | Standby Replicas
Kafka
Streams
Store
Changelog
topic
Input
topic
Topology
Active Replica
Standby Replica
Kafka
Streams
Store
The store is persisted in
its changelog topic
83. 85
Third Optimization | Resilience | Standby Replicas
Kafka
Streams
Store
Changelog
topic
Input
topic
Topology
Active Replica
Standby Replica
Kafka
Streams
Store
The local store is replicated in the standby
replica by reading the changelog
84. 86
Third Optimization | Resilience | Standby Replicas
⢠Standby replicas are (eventually consistent) shadow copies of state stores
⢠Each task of the topology has one active replica and num.standby.replicas standby replicas
Pros
⢠Faster rebalancing
⢠Better reliability
Cons
⢠More resources
⢠Eventual consistency
85. 87
Third Optimization | Resilience | Interactive Queries
public <K> KeyQueryMetadata queryMetadataForKey(
final String storeName,
final K key,
final Serializer<K> keySerializer
)
public class KeyQueryMetadata {
private final HostInfo activeHost;
private final Set<HostInfo> standbyHosts;
private final int partition;
}
86. 88
Third Optimization | Resilience | Interactive Queries
/**
* Returns LagInfo, for all store partitions (active or standby) local
* to this Streams instance. Note that the values returned are just
* estimates and meant to be used for making soft decisions on whether
* the data in the store partition is fresh enough for querying.
*
* Note: Each invocation of this method issues a call to the Kafka
* brokers. Thus its advisable to limit the frequency of invocation to
* once every few seconds.
*
* @return map of store names to another map of partition to LagInfos
*/
public Map<String, Map<Integer, LagInfo>> allLocalStorePartitionLags()
88. 90
Third Optimization | Resilience | Take Away
Standby Replicas
⢠Faster recovery when an instance fails
⢠Allows to serve interactive queries when rebalancingâŚ
⢠⌠but with stale data
90. 92
Take Aways
⢠Describe your topologies to understand what your application really does
⢠Monitor your system using metrics from Kafka Streams, the Kafka Cluster, your
deployment environment, âŚ
⢠Load Test as much as possible to understand trends and potential bottlenecks before they
become problematic in Production
⢠Prototype using the high level APIs, optimize using the low level APIs (at your own risks)
⢠gRPC đ Kafka Streams
⢠Understand Kafka Streams internals (in particular RocksDB and changelog compaction) to
help optimize further
⢠Kafka Streams is highly configurable