This document discusses streaming architectures and libraries for processing streaming data. It provides an overview of Kafka Streams and Akka Streams, highlighting that Kafka Streams is ideal for ETL, aggregations, joins and "effectively once" requirements, while Akka Streams is suitable for low latency, mid-volume workloads based on graphs of processing nodes. It also includes a Kafka Streams example in Scala for scoring streaming data records using a streaming machine learning model.
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Streaming Microservices With Akka Streams And Kafka Streams
1.
2. Check out these resources:
Dean’s book
Webinars
etc.
Fast Data Architectures
for Streaming Applications
Getting Answers Now from Data Sets that Never End
By Dean Wampler, Ph. D., VP of Fast Data Engineering
LIGHTBEND.COM/LEARN
2
5. Service 1
Log &
Other Files
Internet
Services
Service 2
Service 3
Services
Services
N * M links ConsumersProducers
Before:
Why Kafka?
6. Service 1
Log &
Other Files
Internet
Services
Service 2
Service 3
Services
Services
N * M links ConsumersProducers
Before:
Service 1
Log &
Other Files
Internet
Services
Service 2
Service 3
Services
Services
N + M links ConsumersProducers
After:
Why Kafka?
7. Service 1
Log &
Other Files
Internet
Services
Service 2
Service 3
Services
Services
N + M links ConsumersProducers
After:
Why Kafka?
Kafka:
• Simplify dependencies between
services
• Reduce data loss when a service
crashes
• M producers, N consumers
• Simplicity of one “API” for
communication
11. “Record-centric” μ-services
Events Records
A Spectrum of Microservices
Event-driven μ-services
…
Browse
REST
AccountOrders
Shopping
Cart
API Gateway
Inventory
storage
Data
Model
Training
Model
Serving
Other
Logic
13. “Record-centric” μ-services
Events Records
A Spectrum of Microservices
storage
Data
Model
Training
Model
Serving
Other
Logic
Emerged from the right-hand
side.
Kafka Streams pushes to the
left, supporting many event-
processing scenarios.
18. val builder = new StreamsBuilderS // New Scala Wrapper API.
val data = builder.stream[Array[Byte], Array[Byte]](rawDataTopic)
val model = builder.stream[Array[Byte], Array[Byte]](modelTopic)
val modelProcessor = new ModelProcessor
val scorer = new Scorer(modelProcessor)
model.mapValues(bytes => Model.parseBytes(bytes)) // array => record
.filter((key, model) => model.valid) // Successful?
.mapValues(model => ModelImpl.findModel(model))
.process(() => modelProcessor, …) // Set up actual model
data.mapValues(bytes => DataRecord.parseBytes(bytes))
.filter((key, record) => record.valid)
.mapValues(record => new ScoredRecord(scorer.score(record),record))
.to(scoredRecordsTopic)
val streams = new KafkaStreams(
builder.build, streamsConfiguration)
streams.start()
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Scored
Records
19. val builder = new StreamsBuilderS // New Scala Wrapper API.
val data = builder.stream[Array[Byte], Array[Byte]](rawDataTopic)
val model = builder.stream[Array[Byte], Array[Byte]](modelTopic)
val modelProcessor = new ModelProcessor
val scorer = new Scorer(modelProcessor)
model.mapValues(bytes => Model.parseBytes(bytes)) // array => record
.filter((key, model) => model.valid) // Successful?
.mapValues(model => ModelImpl.findModel(model))
.process(() => modelProcessor, …) // Set up actual model
data.mapValues(bytes => DataRecord.parseBytes(bytes))
.filter((key, record) => record.valid)
.mapValues(record => new ScoredRecord(scorer.score(record),record))
.to(scoredRecordsTopic)
val streams = new KafkaStreams(
builder.build, streamsConfiguration)
streams.start()
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Scored
Records
20. val builder = new StreamsBuilderS // New Scala Wrapper API.
val data = builder.stream[Array[Byte], Array[Byte]](rawDataTopic)
val model = builder.stream[Array[Byte], Array[Byte]](modelTopic)
val modelProcessor = new ModelProcessor
val scorer = new Scorer(modelProcessor)
model.mapValues(bytes => Model.parseBytes(bytes)) // array => record
.filter((key, model) => model.valid) // Successful?
.mapValues(model => ModelImpl.findModel(model))
.process(() => modelProcessor, …) // Set up actual model
data.mapValues(bytes => DataRecord.parseBytes(bytes))
.filter((key, record) => record.valid)
.mapValues(record => new ScoredRecord(scorer.score(record),record))
.to(scoredRecordsTopic)
val streams = new KafkaStreams(
builder.build, streamsConfiguration)
streams.start()
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Scored
Records
21. val builder = new StreamsBuilderS // New Scala Wrapper API.
val data = builder.stream[Array[Byte], Array[Byte]](rawDataTopic)
val model = builder.stream[Array[Byte], Array[Byte]](modelTopic)
val modelProcessor = new ModelProcessor
val scorer = new Scorer(modelProcessor)
model.mapValues(bytes => Model.parseBytes(bytes)) // array => record
.filter((key, model) => model.valid) // Successful?
.mapValues(model => ModelImpl.findModel(model))
.process(() => modelProcessor, …) // Set up actual model
data.mapValues(bytes => DataRecord.parseBytes(bytes))
.filter((key, record) => record.valid)
.mapValues(record => new ScoredRecord(scorer.score(record),record))
.to(scoredRecordsTopic)
val streams = new KafkaStreams(
builder.build, streamsConfiguration)
streams.start()
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Scored
Records
22. val builder = new StreamsBuilderS // New Scala Wrapper API.
val data = builder.stream[Array[Byte], Array[Byte]](rawDataTopic)
val model = builder.stream[Array[Byte], Array[Byte]](modelTopic)
val modelProcessor = new ModelProcessor
val scorer = new Scorer(modelProcessor)
model.mapValues(bytes => Model.parseBytes(bytes)) // array => record
.filter((key, model) => model.valid) // Successful?
.mapValues(model => ModelImpl.findModel(model))
.process(() => modelProcessor, …) // Set up actual model
data.mapValues(bytes => DataRecord.parseBytes(bytes))
.filter((key, record) => record.valid)
.mapValues(record => new ScoredRecord(scorer.score(record),record))
.to(scoredRecordsTopic)
val streams = new KafkaStreams(
builder.build, streamsConfiguration)
streams.start()
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Scored
Records
23. val builder = new StreamsBuilderS // New Scala Wrapper API.
val data = builder.stream[Array[Byte], Array[Byte]](rawDataTopic)
val model = builder.stream[Array[Byte], Array[Byte]](modelTopic)
val modelProcessor = new ModelProcessor
val scorer = new Scorer(modelProcessor)
model.mapValues(bytes => Model.parseBytes(bytes)) // array => record
.filter((key, model) => model.valid) // Successful?
.mapValues(model => ModelImpl.findModel(model))
.process(() => modelProcessor, …) // Set up actual model
data.mapValues(bytes => DataRecord.parseBytes(bytes))
.filter((key, record) => record.valid)
.mapValues(record => new ScoredRecord(scorer.score(record),record))
.to(scoredRecordsTopic)
val streams = new KafkaStreams(
builder.build, streamsConfiguration)
streams.start()
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Scored
Records
24. val builder = new StreamsBuilderS // New Scala Wrapper API.
val data = builder.stream[Array[Byte], Array[Byte]](rawDataTopic)
val model = builder.stream[Array[Byte], Array[Byte]](modelTopic)
val modelProcessor = new ModelProcessor
val scorer = new Scorer(modelProcessor)
model.mapValues(bytes => Model.parseBytes(bytes)) // array => record
.filter((key, model) => model.valid) // Successful?
.mapValues(model => ModelImpl.findModel(model))
.process(() => modelProcessor, …) // Set up actual model
data.mapValues(bytes => DataRecord.parseBytes(bytes))
.filter((key, record) => record.valid)
.mapValues(record => new ScoredRecord(scorer.score(record),record))
.to(scoredRecordsTopic)
val streams = new KafkaStreams(
builder.build, streamsConfiguration)
streams.start()
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Scored
Records
25. val builder = new StreamsBuilderS // New Scala Wrapper API.
val data = builder.stream[Array[Byte], Array[Byte]](rawDataTopic)
val model = builder.stream[Array[Byte], Array[Byte]](modelTopic)
val modelProcessor = new ModelProcessor
val scorer = new Scorer(modelProcessor)
model.mapValues(bytes => Model.parseBytes(bytes)) // array => record
.filter((key, model) => model.valid) // Successful?
.mapValues(model => ModelImpl.findModel(model))
.process(() => modelProcessor, …) // Set up actual model
data.mapValues(bytes => DataRecord.parseBytes(bytes))
.filter((key, record) => record.valid)
.mapValues(record => new ScoredRecord(scorer.score(record),record))
.to(scoredRecordsTopic)
val streams = new KafkaStreams(
builder.build, streamsConfiguration)
streams.start()
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Scored
Records
30. Akka Cluster
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Alpakka
implicit val system = ActorSystem("ModelServing")
implicit val materializer = ActorMaterializer()
implicit val executionContext = system.dispatcher
val modelProcessor = new ModelProcessor
val scorer = new Scorer(modelProcessor)
val modelStream: Source[ModelImpl, Consumer.Control] =
Consumer.atMostOnceSource(modelConsumerSettings,
Subscriptions.topics(modelTopic))
.map(input => Model.parseBytes(input.value()))
.filter(model => model.valid).map(_.get)
.map(model => ModelImpl.findModel(model))
.filter(model => model.valid).map(_.get)
val dataStream: Source[Record, Consumer.Control] =
Consumer.atMostOnceSource(dataConsumerSettings,
Subscriptions.topics(rawDataTopic))
.map(input => DataRecord.parseBytes(input.value()))
.filter(record => record.valid).map(_.get)
31. implicit val system = ActorSystem("ModelServing")
implicit val materializer = ActorMaterializer()
implicit val executionContext = system.dispatcher
val modelProcessor = new ModelProcessor
val scorer = new Scorer(modelProcessor)
val modelStream: Source[ModelImpl, Consumer.Control] =
Consumer.atMostOnceSource(modelConsumerSettings,
Subscriptions.topics(modelTopic))
.map(input => Model.parseBytes(input.value()))
.filter(model => model.valid).map(_.get)
.map(model => ModelImpl.findModel(model))
.filter(model => model.valid).map(_.get)
val dataStream: Source[Record, Consumer.Control] =
Consumer.atMostOnceSource(dataConsumerSettings,
Subscriptions.topics(rawDataTopic))
.map(input => DataRecord.parseBytes(input.value()))
.filter(record => record.valid).map(_.get)
Akka Cluster
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Alpakka
32. implicit val system = ActorSystem("ModelServing")
implicit val materializer = ActorMaterializer()
implicit val executionContext = system.dispatcher
val modelProcessor = new ModelProcessor
val scorer = new Scorer(modelProcessor)
val modelStream: Source[ModelImpl, Consumer.Control] =
Consumer.atMostOnceSource(modelConsumerSettings,
Subscriptions.topics(modelTopic))
.map(input => Model.parseBytes(input.value()))
.filter(model => model.valid).map(_.get)
.map(model => ModelImpl.findModel(model))
.filter(model => model.valid).map(_.get)
val dataStream: Source[Record, Consumer.Control] =
Consumer.atMostOnceSource(dataConsumerSettings,
Subscriptions.topics(rawDataTopic))
.map(input => DataRecord.parseBytes(input.value()))
.filter(record => record.valid).map(_.get)
Akka Cluster
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Alpakka
33. implicit val system = ActorSystem("ModelServing")
implicit val materializer = ActorMaterializer()
implicit val executionContext = system.dispatcher
val modelProcessor = new ModelProcessor
val scorer = new Scorer(modelProcessor)
val modelStream: Source[ModelImpl, Consumer.Control] =
Consumer.atMostOnceSource(modelConsumerSettings,
Subscriptions.topics(modelTopic))
.map(input => Model.parseBytes(input.value()))
.filter(model => model.valid).map(_.get)
.map(model => ModelImpl.findModel(model))
.filter(model => model.valid).map(_.get)
val dataStream: Source[Record, Consumer.Control] =
Consumer.atMostOnceSource(dataConsumerSettings,
Subscriptions.topics(rawDataTopic))
.map(input => DataRecord.parseBytes(input.value()))
.filter(record => record.valid).map(_.get)
Akka Cluster
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Alpakka
34. implicit val system = ActorSystem("ModelServing")
implicit val materializer = ActorMaterializer()
implicit val executionContext = system.dispatcher
val modelProcessor = new ModelProcessor
val scorer = new Scorer(modelProcessor)
val modelStream: Source[ModelImpl, Consumer.Control] =
Consumer.atMostOnceSource(modelConsumerSettings,
Subscriptions.topics(modelTopic))
.map(input => Model.parseBytes(input.value()))
.filter(model => model.valid).map(_.get)
.map(model => ModelImpl.findModel(model))
.filter(model => model.valid).map(_.get)
val dataStream: Source[Record, Consumer.Control] =
Consumer.atMostOnceSource(dataConsumerSettings,
Subscriptions.topics(rawDataTopic))
.map(input => DataRecord.parseBytes(input.value()))
.filter(record => record.valid).map(_.get)
Akka Cluster
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Alpakka
35. val model = ModelStage(modelProcessor)
def keepModelMaterializedValue[M1,M2,M3](m1:M1, m2:M2, m3:M3):M3 = m3
val modelPredictions: Source[Option[Double], ReadableModelStateStore] =
Source.fromGraph {
GraphDSL.create(dataStream, modelStream, model)(
keepModelMaterializedValue) { implicit builder => (d, m, w) =>
import GraphDSL.Implicits._
// Wire the input streams with the model stage (2 in, 1 out)
// dataStream --> | |
// | model | -> predictions
// modelStream -> | |
d ~> w.dataRecordIn
m ~> w.modelRecordIn
SourceShape(w.scoringResultOut)
}
}
Akka Cluster
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Alpakka
36. val model = ModelStage(modelProcessor)
def keepModelMaterializedValue[M1,M2,M3](m1:M1, m2:M2, m3:M3):M3 = m3
val modelPredictions: Source[Option[Double], ReadableModelStateStore] =
Source.fromGraph {
GraphDSL.create(dataStream, modelStream, model)(
keepModelMaterializedValue) { implicit builder => (d, m, w) =>
import GraphDSL.Implicits._
// Wire the input streams with the model stage (2 in, 1 out)
// dataStream --> | |
// | model | -> predictions
// modelStream -> | |
d ~> w.dataRecordIn
m ~> w.modelRecordIn
SourceShape(w.scoringResultOut)
}
}
Akka Cluster
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Alpakka
case class ModelStage(modelProcessor: …) extends
GraphStageWithMaterializedValue[…, …] {
val scorer = new Scorer(modelProcessor)
val dataRecordIn = Inlet[Record]("dataRecordIn")
val modelRecordIn = Inlet[ModelImpl]("modelRecordIn")
val scoringResultOut = Outlet[ScoredRecord]("scoringOut")
…
setHandler(dataRecordIn, new InHandler {
override def onPush(): Unit = {
val record = grab(dataRecordIn)
val newRecord = new ScoredRecord(scorer.score(record),record))
push(scoringResultOut, Some(newRecord))
pull(dataRecordIn)
})
…
}
37. val model = ModelStage(modelProcessor)
def keepModelMaterializedValue[M1,M2,M3](m1:M1, m2:M2, m3:M3):M3 = m3
val modelPredictions: Source[Option[Double], ReadableModelStateStore] =
Source.fromGraph {
GraphDSL.create(dataStream, modelStream, model)(
keepModelMaterializedValue) { implicit builder => (d, m, w) =>
import GraphDSL.Implicits._
// Wire the input streams with the model stage (2 in, 1 out)
// dataStream --> | |
// | model | -> predictions
// modelStream -> | |
d ~> w.dataRecordIn
m ~> w.modelRecordIn
SourceShape(w.scoringResultOut)
}
}
Akka Cluster
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Alpakka
38. val model = ModelStage(modelProcessor)
def keepModelMaterializedValue[M1,M2,M3](m1:M1, m2:M2, m3:M3):M3 = m3
val modelPredictions: Source[Option[Double], ReadableModelStateStore] =
Source.fromGraph {
GraphDSL.create(dataStream, modelStream, model)(
keepModelMaterializedValue) { implicit builder => (d, m, w) =>
import GraphDSL.Implicits._
// Wire the input streams with the model stage (2 in, 1 out)
// dataStream --> | |
// | model | -> predictions
// modelStream -> | |
d ~> w.dataRecordIn
m ~> w.modelRecordIn
SourceShape(w.scoringResultOut)
}
}
Akka Cluster
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Alpakka
39. val model = ModelStage(modelProcessor)
def keepModelMaterializedValue[M1,M2,M3](m1:M1, m2:M2, m3:M3):M3 = m3
val modelPredictions: Source[Option[Double], ReadableModelStateStore] =
Source.fromGraph {
GraphDSL.create(dataStream, modelStream, model)(
keepModelMaterializedValue) { implicit builder => (d, m, w) =>
import GraphDSL.Implicits._
// Wire the input streams with the model stage (2 in, 1 out)
// dataStream --> | |
// | model | -> predictions
// modelStream -> | |
d ~> w.dataRecordIn
m ~> w.modelRecordIn
SourceShape(w.scoringResultOut)
}
}
Akka Cluster
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Alpakka
40. d ~> w.dataRecordIn
m ~> w.modelRecordIn
SourceShape(w.scoringResultOut)
}
}
val materializedReadableModelStateStore: ReadableModelStateStore =
modelPredictions
.to(Sink.ignore) // we do not read the results directly
.run() // we run the stream, materializing the stage's StateStore
Akka Cluster
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Alpakka
41. •Scale scoring with workers and
routers, across a cluster
•Persist actor state with Akka
Persistence
•Connect to almost anything with
Alpakka
• Enterprise Suite
•for production
Other Concerns
Akka Cluster
Model Serving
Other
Logic
Alpakka
Alpakka
Router
Worker
Worker
Worker
Worker
Stateful
Logic
Persistence
actor state
storage
Akka Cluster
Data
Model
Training
Model
Serving
Other
Logic
Raw
Data
Model
Params
Final
Records
Alpakka
Alpakka
storage
42. •Extremely low latency
•Minimal I/O and memory overhead
•Reactive Streams backpressure
•M producers, N consumers, but
directly connected (sort of)
•Use Akka Persistence for durable state
Go Direct or Through Kafka?
Akka Cluster
Model
Serving
Other
Logic
Alpakka
Alpakka
Model
Serving
Other
Logic
Scored
Records
vs. ?
•Higher latency (including queue depth)
•Higher I/O overhead
•Very large buffer (disk size)
•M producers, N consumers, completely
disconnected
•Automatic durability (topics on disk)
43. •Use for smaller, faster
messaging between
“components”.
•Watch for consumer “backup”
•Use Akka Persistence for
important state!
Akka Cluster
Model
Serving
Other
Logic
Alpakka
Alpakka
Model
Serving
Other
Logic
Scored
Records
vs. ?
•Use for larger volumes, more
course-grained service
interactions
•Plan partitioning and
replication carefully
Go Direct or Through Kafka?
46. Check out these resources:
Dean’s book
Webinars
etc.
Fast Data Architectures
for Streaming Applications
Getting Answers Now from Data Sets that Never End
By Dean Wampler, Ph. D., VP of Fast Data Engineering
46
LIGHTBEND.COM/LEARN
47. Serving Machine Learning Models
A Guide to Architecture, Stream Processing Engines,
and Frameworks
By Boris Lublinsky, Fast Data Platform Architect
47
LIGHTBEND.COM/LEARN
48. For even more information:
Tutorial - Building Streaming Applications with Kafka:
Software Architecture Conference New York
Strata Data Conference San Jose
Strata Data Conference London
My talk:
Strata Data Conference San Jose
48