Walmart.com generates millions of events per second. At WalmartLabs, I’m working in a team called the Customer Backbone (CBB), where we wanted to upgrade to a platform capable of processing this event volume in real-time and store the state/knowledge of possibly all the Walmart Customers generated by the processing. Kafka streams’ event-driven architecture seemed like the only obvious choice.
However, there are a few challenges w.r.t. Walmart’s scale:
• the clusters need to be large and the problems thereof.
• infinite retention of changelog topics, wasting valuable disk.
• slow stand-by task recovery in case of a node failure (changelog topics have GBs of data)
• no repartitioning in Kafka Streams.
As part of the event-driven development and addressing the challenges above, I’m going to talk about some bold new ideas we developed as features/patches to Kafka Streams to deal with the scale required at Walmart.
• Cold Bootstrap: Where in case of a Kafka Streams node failure, how instead of recovering from the change-log topic, we bootstrap the standby from active’s RocksDB using JSch and zero event loss by careful offset management.
• Dynamic Repartitioning: We added support for repartitioning in Kafka Streams where state is distributed among the new partitions. We can now elastically scale to any number of partitions and any number of nodes.
• Cloud/Rack/AZ aware task assignment: No active and standby tasks of the same partition are assigned to the same rack.
• Decreased Partition Assignment Size: With large clusters like ours (>400 nodes and 3 stream threads per node), the size of Partition Assignment of the KS cluster being few 100MBs, it takes a lot of time to settle a rebalance.
Key Takeaways:
• Basic understanding of Kafka Streams.
• Productionizing Kafka Streams at scale.
• Using Kafka Streams as Distributed NoSQL DB
8. Default
Bootstrap
ChangeLog topic as a source of truth
• Slow stand-by recovery
• Log-Compacted change-log topics
• Inefficient disk usage
@walmartlabs
9. app cluster
task-0’
stand-by
task-0 task-0’
new
stand-by
kafka cluster
change-log topic
partition 0’’
partition 0’
partition 0
input topic
partition 0’’
partition 0’
partition 0
active
rocks
db
rocks
db
the source of truth
Cold Bootstrap
empty
rocks
db
rocks
db
enhanced stand-by recovery
the new source of truth
@walmartlabs
16. AZ/Rack Aware
Task Assignment
AZ2 AZ3AZ1
task-0
active
rocks
db
task-0’’
stand-by
rocks
db
task-1
active
rocks
db
task-1’
stand-by
rocks
db
task-0’
stand-by
rocks
db
task-1’’
stand-by
rocks
db
StickyTaskAssignor using RACK_ID_CONFIG = “rack.id”;
@walmartlabs
21. Large Clusters
Rebalance time
• Bottleneck: Group Leader Broker
• Partition Assignment Info
• Compression
• Better Encoding
• 24x smaller in terms of size
@walmartlabs
25. Up Next
• Feature Extraction and Model Inferencing 😎
• Cold Bootstrap from other stand-by 😍
• Cold-Bootstrap and Repartitioning for DSL🧐
• TTL support for RocksDB 🧐
• Merge Operator for RocksJava 😥
• Chaining Kafka Streams Apps 🧐
• Multi Tenancy 🧐 😪 🧐 😵
@walmartlabs
Interactions (search, page views, item clicks etc.) Transaction, Profile changes.
Each event tells a story about the customer, what he/she likes, dislikes, gonna buy.
Needed a platform capable of processing this event volume and store the knowledge/state of all the Walmart’s online customers.
Alternatives: Flink, Spark Streaming, DynomiteKafka streams seemed to be the only obvious choice (library, scalable, fault tolerable, exactly once semantics)But at Walmart’s Scale, even Kafka Streams came with its own bag of challenges.
Self Introduction
Forked over an year ago, made more than 100 commits to improve its scalability, fault-tolerance, performance and add more features to it. “Distributed DB”
But………
Slow Fault Recovery
Scalability: Limited to number of partitions (one on one mapping to input topics)
Cloud management and maintenance
Rocks Java: Limitations of usage Rocks DB Store
Large Clusters: Long rebalance time.
Single process with embedded RocksDB and akka server to serve keys from DB.
N partitionsApp partition zero maps to kafka topic’s zero’th partition
No apostrophe -> Active
One apostrophe -> Stand-by
Each broker can hold multiple partitions
Each app instance can hold multiple tasksRecovers from Change-log topic
Replays entire change-logRecovery time proportional to amount of data in change-log topics
100Gbs connection, 100GB data, 8 seconds
Unwindowed stores yet windowed changelog topicsLose exactly-once semantics in cross DC boot strapsJSch for file transfer
One on one mapping with input topic. 100 partitions means 100 Actives and 100 Standby Tasks.Lookups
Happens as a part of restoration phase.
Background threads cleaning unrequired key-value pairs.
Multiplying serving throughput
Notes: A slide for partition specific lookups.
Availability zones / Racks going down or under maintanance
Adding RACK_ID_CONFIG to Subscription Info and implementing StickyTaskAssignor
We always expect a few instances to go down. If so rebalance. But if entire AZ goes down, don’t rebalance. Only if above threshold, then do reassignment on rebalance.
Replacing with Akka Server did not help as long as Gets were synchronized.Talk about locking with state changes of rocksDBServing Downtime to potentially zero.
acks( producer) = all
linger.ms (producer) = as required but >0
auto.offet.rest(consumer) = earliest
retries (streams) = infinite
state.cleanup.delay.ms (streams) = sufficiently large
We never lose state for any customer.Potentially zero serving down-time
Going towards zero processing down-time
Streaming Engine and Distributed NoSQL DB with exactly once Semantics, high lookup and processing throughput