Kafka Streams – The Power without the Weight at Streaming Morning@Lohika

•

4 gefällt mir•725 views

Do you have any experience working with Kafka? Have you ever used Spark streaming or Storm for real time stream processing? Would you like to find out how to solve the same problems using the new Kafka Streams library? During this talk Dmitry will give an overview of key features of Kafka Streams library and highlight the difference with existing streaming frameworks. A few demos are also included!

Ingenieurwesen

KStreamBuilder builder =
new KStreamBuilder();

KStreamBuilder builder =
new KStreamBuilder();
KStream<String, String> textLines =
builder.stream(inputTopic);

KStreamBuilder builder =
new KStreamBuilder();
KStream<String, String> textLines =
builder.stream(inputTopic);
KTable<String, Long> wordCounts = textLines
.flatMapValues(value -> toWords(value))
.groupBy((key, word) -> word)
.count("Counts");

•Local/Remote state in
stream processing

Kafka Streams – The Power without the Weight at Streaming Morning@Lohika

Weitere ähnliche Inhalte

Ähnlich wie Kafka Streams – The Power without the Weight at Streaming Morning@Lohika

Look Ma, “update DB to HTML5 using C++”, no hands! aleks-f

Sql saturday 829_decalogo_powerbiLorenzo Vercellati

In a few short paragraphs, explain which cloud services you use (GMalikPinckney86

Computer Programming -II (Lec. 10).pptxSaurabhSharma783949

S01 e01 schema-designMongoDB

Connect() Mini 2016Jeff Chu

Kotlin and Domain-Driven Design: A perfect match - Kotlin Meetup MunichFlorian Benz

Heroku Postgres Cloud Database WebinarSalesforce Developers

C# Starter L04-CollectionsMohammad Shaker

MongoDBSteve Klabnik

Работа с документами в JavaScriptДмитрий Радыно

Cassandra London - 2.2 and 3.0Christopher Batey

Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...NoSQLmatters

Javascripting.pptxVinod Srivastava

All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...Altinity Ltd

Structured Query Language (SQL) - Lecture 5 - Introduction to Databases (1007...Beat Signer

Cassandra 3.0 - JSON at scale - StampedeCon 2015StampedeCon

Webinar: Build an Application Series - Session 2 - Getting StartedMongoDB

Type safe embedded domain-specific languagesArthur Xavier

greenDAOMu Chun Wang

Ähnlich wie Kafka Streams – The Power without the Weight at Streaming Morning@Lohika (20)

Look Ma, “update DB to HTML5 using C++”, no hands!

Sql saturday 829_decalogo_powerbi

In a few short paragraphs, explain which cloud services you use (G

Computer Programming -II (Lec. 10).pptx

S01 e01 schema-design

Connect() Mini 2016

Kotlin and Domain-Driven Design: A perfect match - Kotlin Meetup Munich

Heroku Postgres Cloud Database Webinar

C# Starter L04-Collections

MongoDB

Работа с документами в JavaScript

Cassandra London - 2.2 and 3.0

Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...

Javascripting.pptx

All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...

Structured Query Language (SQL) - Lecture 5 - Introduction to Databases (1007...

Cassandra 3.0 - JSON at scale - StampedeCon 2015

Webinar: Build an Application Series - Session 2 - Getting Started

Type safe embedded domain-specific languages

greenDAO

Kürzlich hochgeladen

young call girls in Green Park🔝 9953056974 🔝 escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

8251 universal synchronous asynchronous receiver transmitterShivangiSharma879191

Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff

Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ

Introduction-To-Agricultural-Surveillance-Rover.pptxk795866

Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774

CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani

Earthing details of Electrical Substationstephanwindworld

Oxy acetylene welding presentation note.eptoze12

Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066

lifi-technology with integration of IOT.pptxsomshekarkn64

Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran

UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N

Design and analysis of solar grass cutter.pdfTagore Institute of Engineering And Technology

Work Experience-Dalton Park.pptxfvvvvvvvLewisJB

Indian Dairy Industry Present Status and.pptMadan Karki

Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3

An experimental study in using natural admixture as an alternative for chemic...Chandu841456

Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxnull - The Open Security Community

Kürzlich hochgeladen (20)

young call girls in Green Park🔝 9953056974 🔝 escort Service

young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service

8251 universal synchronous asynchronous receiver transmitter

Call Girls Narol 7397865700 Independent Call Girls

Software and Systems Engineering Standards: Verification and Validation of Sy...

Introduction-To-Agricultural-Surveillance-Rover.pptx

Arduino_CSE ece ppt for working and principal of arduino.ppt

CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf

Earthing details of Electrical Substation

Oxy acetylene welding presentation note.

Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)

lifi-technology with integration of IOT.pptx

Introduction to Machine Learning Unit-3 for II MECH

UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)

Design and analysis of solar grass cutter.pdf

Work Experience-Dalton Park.pptxfvvvvvvv

Indian Dairy Industry Present Status and.ppt

Concrete Mix Design - IS 10262-2019 - .pptx

An experimental study in using natural admixture as an alternative for chemic...

Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx

Kafka Streams – The Power without the Weight at Streaming Morning@Lohika

10.

11.

12.

13.

14.

15.

16.

17.

18.

19.

20.

21.

22.

23.

24.

25. KStreamBuilder builder = new KStreamBuilder();

26. KStreamBuilder builder = new KStreamBuilder(); KStream<String, String> textLines = builder.stream(inputTopic);

27. KStreamBuilder builder = new KStreamBuilder(); KStream<String, String> textLines = builder.stream(inputTopic); KTable<String, Long> wordCounts = textLines .flatMapValues(value -> toWords(value)) .groupBy((key, word) -> word) .count("Counts");

28. KStreamBuilder builder = new KStreamBuilder(); KStream<String, String> textLines = builder.stream(inputTopic); KTable<String, Long> wordCounts = textLines .flatMapValues(value -> toWords(value)) .groupBy((key, word) -> word) .count("Counts"); wordCounts.toStream().to(outputTopic);

29.

30.

31.

32.

33.

34.

35.

36.

37.

38.

39.

40.

41. •elasticity and scalability

42.

43.

44.

45. •Local/Remote state in stream processing

Hinweis der Redaktion

Request–response, or request–reply, is one of the basic methods computers use to communicate with each other, in which the first computer sends a request for some data and the second computer responds to the request. Usually, there is a series of such interchanges until the complete message is sent; browsing a web page is an example of request–response communication. Important thing to remember – that such application possible serve only future requests.
Important thing to remember – that such application possible serve only historical data.
A stream is the most important abstraction provided by Kafka Streams: it represents an unbounded, continuously updating data set, where unbounded means “of unknown or of unlimited size”. A stream is an ordered, replayable, and fault-tolerant sequence of immutable data records, where a data record is defined as a key-value pair.
The who’s who: Kafka distinguishes producers, consumers, and brokers. In short, producers publish data to Kafka brokers, and consumers read published data from Kafka brokers. Producers and consumers are totally decoupled. A Kafka clusterconsists of one or more brokers. The data: Data is stored in topics. The topic is the most important abstraction provided by Kafka: it is a category or feed name to which data is published by producers. Every topic in Kafka is split into one or more partitions, which are replicated across Kafka brokers for fault tolerance. Parallelism: Partitions of Kafka topics, and especially their number for a given topic, are also the main factor that determines the parallelism of Kafka with regards to reading and writing data. Because of their tight integration the parallelism of Kafka Streams is heavily influenced by and depending on Kafka’s parallelism.
For each topic, the Kafka cluster maintains a partitioned log that looks like this: Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.
The Kafka cluster retains all published records—whether or not they have been consumed—using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so storing data for a long time is not a problem. In fact, the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes. For example a consumer can reset to an older offset to reprocess data from the past or skip ahead to the most recent record and start consuming from "now". This combination of features means that Kafka consumers are very cheap—they can come and go without much impact on the cluster or on other consumers. For example, you can use our command line tools to "tail" the contents of any topic without changing what is consumed by any existing consumers. The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism—more on that in a bit.
First option is to process streams if you have Kafka – its do it yourself – no frameworks, no libraries. Consume some messages from kafka Process Produce output
Using the Kafka APIs directly works well for simple things. It doesn’t pull in any heavy dependencies to your app. We called this “hipster stream processing” since it is a kind of low-tech solution that appealed to people who liked to roll their own. This works well for simple one-message-at-a-time processing, but the problem comes when you want to do something more involved, say compute aggregations or join streams. In this case inventing a solution on top of the Kafka consumer APIs is fairly involved.
Its hard, you need to deal with a lot of things and reinvent the wheel. Ordering? Partitioning? Fault tolerance? State management? Window operations? How to reprocess data?
Pulling in a full-fledged stream processing framework gives you easy access to these more advanced operations. But the cost for a simple application is an explosion of complexity. This makes everything difficult, from debugging to performance optimization to monitoring to deployment. This is even worse if your app has both synchronous and asynchronous pieces as then you end up splitting your code between the stream processing framework and whatever mechanism you have for implementing services or apps. It’s just really hard to build and operationalize a critical part of your business in this way. This isn’t such a problem in all domains—after all, if you are already using Spark to build a batch workflow, and you want to add a Spark Streaming job into this mix for some real-time bits, the additional complexity is pretty low and it reuses the skills you already have. However if you are deploying a Spark cluster for the sole purpose of this new application, that is definitely a big complexity hit.
Pulling in a full-fledged stream processing framework gives you easy access to these more advanced operations. But the cost for a simple application is an explosion of complexity. This makes everything difficult, from debugging to performance optimization to monitoring to deployment. This is even worse if your app has both synchronous and asynchronous pieces as then you end up splitting your code between the stream processing framework and whatever mechanism you have for implementing services or apps. It’s just really hard to build and operationalize a critical part of your business in this way. This isn’t such a problem in all domains—after all, if you are already using Spark to build a batch workflow, and you want to add a Spark Streaming job into this mix for some real-time bits, the additional complexity is pretty low and it reuses the skills you already have. However if you are deploying a Spark cluster for the sole purpose of this new application, that is definitely a big complexity hit.
With kafka streams it looks much simple The inputs and outputs are just Kafka topics The data model is just Kafka’s keyed record data model throughout The partitioning model is just Kafka’s partitioning model, a Kafka partitioner works for streams too The group membership mechanism that manages partitions, assignment, and liveness is just Kafka’s group membership mechanism Tables and other stateful computations are just log compacted topics. Metrics are unified across the producer, consumer, and streams app so there is only one type of metric to capture for monitoring The position of your app is maintained by the application’s offsets, just as any Kafka consumer er.
core abstractions in Kafka to be the primitives for stream processing we wanted to be able to give something that provides you what you would get out of a stream processing framework, but which has very little additional operational complexity beyond the normal Kafka producer and consumer APIs. In other words we were aiming for something like this
Kafka Streams has a strong focus on usability and a great developer experience. It offers all the necessary stream processing primitives to allow applications to read data from Kafka as streams, process the data, and then either write the resulting data back to Kafka or send the final output to an external system. Developers can choose between a high-level DSL with commonly used operations like filter, map, join, as well as a low-level API for developers who need maximum control and flexibility.
Designed as a lightweight library in Apache Kafka, much like the Kafka producer and consumer client libraries. You can easily embed and integrate Kafka Streams into your own applications, which is a significant departure from framework-based stream processing tools that dictate many requirements upon you such as how you must package and “submit” processing jobs to their cluster. Has no external dependencies on systems other than Apache Kafka and can be used in any Java application. Read: You do not need to deploy and operate a separate cluster for your stream processing needs. Your Operations and Info Sec teams, among others, will surely be happy to hear this. Leverages Kafka as its internal messaging layer instead of (re)implementing a custom messaging layer like many other stream processing tools. Notably, Kafka Streams uses Kafka’s partitioning model to horizontally scale processing while maintaining strong ordering guarantees. This ensures high performance, scalability, and operational simplicity for production environments. A key benefit of this design decision is that you do not have to understand and tune two different messaging layers – one for moving data streams at scale (Kafka) plus a separate one for your stream processing tool. Similarly, any performance and reliability improvements of Kafka will automatically be available to Kafka Streams, too, thus tapping into the momentum of Kafka’s strong developer community.
Employs one-record-at-a-time processing to achieve low processing latency, which is crucial for a variety of use cases
Let’s illustrate this with an example. Imagine a table that tracks the total number of pageviews by user (first column of diagram below). Over time, whenever a new pageview event is processed, the state of the table is updated accordingly. Here, the state changes between different points in time – and different revisions of the table – can be represented as a changelog stream (second column).
Interestingly, because of the stream-table duality, the same stream can be used to reconstruct the original table (third column):
The stream-table duality describes the close relationship between streams and tables. Stream as Table: A stream can be considered a changelog of a table, where each data record in the stream captures a state change of the table. A stream is thus a table in disguise, and it can be easily turned into a “real” table by replaying the changelog from beginning to end to reconstruct the table. Similarly, in a more general analogy, aggregating data records in a stream – such as computing the total number of pageviews by user from a stream of pageview events – will return a table (here with the key and the value being the user and its corresponding pageview count, respectively). Table as Stream: A table can be considered a snapshot, at a point in time, of the latest value for each key in a stream (a stream’s data records are key-value pairs). A table is thus a stream in disguise, and it can be easily turned into a “real” stream by iterating over each key-value entry in the table.
Figure 1: Before adding capacity, only a single instance of your Kafka Streams application is running. At this point the corresponding “consumer group” of your application contains only a single member (this instance). All data is being read and processed by this single instance.
After adding capacity, two additional instances of your Kafka Streams application are running, and they have automatically joined the application’s consumer group for a total of three current members. These three instances are automatically splitting the processing work between each other. The splitting is based on the Kafka topic partitions from which data is being read.
If one of the application instances is stopped (e.g. intentional reduction of capacity, maintenance, machine failure), it will automatically leave the application’s consumer group, which causes the remaining instances to automatically take over the stopped instance’s processing work. how many instances can or should you run for your application? Is there an upper limit for the number of instances and, similarly, for the parallelism of your application? In a nutshell, the parallelism of a Kafka Streams application -- similar to the parallelism of Kafka -- is primarily determined by the number of partitions of the input topic(s) from which your application is reading. For example, if your application reads from a single topic that has 10 partitions, then you can run up to 10 instances of your applications (note that you can run further instances but these will be idle)
A common pattern is for the stream processing job to take input records from its input stream, and for each input record make a remote call to a distributed database. The input stream is partitioned over multiple processors, each of which query a remote database. And, of course, since this is a distributed database, it is itself partitioning over multiple machines. One possibility is to co-partition the database and the input processing, and then move the data to be directly co-located with the processing
An easy way to understand state in stream processing is to think about the kinds of operations you might do in SQL. Imagine running SQL queries against a real-time stream of data. If your SQL query contains only filtering and single-row transformations (a simple select and where clause, say), then it is stateless. That is, you can process a single row at a time without needing to remember anything in between rows. However, if your query involves aggregating many rows (a group by) or joining together data from multiple streams, then it must maintain some state in between rows. If you are grouping data by some field and counting, then the state you maintain would be the counts that have accumulated so far in the window you are processing. If you are joining two streams, the state would be the rows in each stream waiting to find a match in the other stream.
Interactive Queries enables faster and more efficient use of the application state. Data is local to your application (in memory or possibly on SSDs); you can access it very quickly. This is especially useful for applications that need to access large amounts of application state, e.g., when they do joins. There is no duplication of data between the store doing aggregation for stream processing and the store answering queries. What do you need to do as a developer to make your Kafka Streams applications queryable? It turns out that Kafka Streams handles most of the low-level querying, metadata discovery and data fault tolerance for you. Depending on the application, you might be able to query straight out-of-the box with zero work (see local stores section below), or you might have to implement a layer of indirection for distributed querying We start simple with a single app’s instance. That instance can query its own local state stores out-of-the-box. It can enumerate all the stores that are in that instance (in any processor node) as well as query the values of keys in those stores. This is simple and useful for apps that have a single instance, however apps can have multiple instances running in potentially different servers. Kafka Streams will partition up the data amongst these instances to scale out the capacity. If I want to get the latest value for key “stock X” and that key is not in my instance, how do I go about finding it?
An application might have multiple instances running, each with its own set of state stores. We have made each instance aware of each other instance’s state stores through periodic metadata exchanges, which we provide through Kafka’s group membership protocol. Starting with Confluent Platform 3.1 and Apache Kafka 0.10.1, each instance may expose its endpoint information metadata (hostname and port, collectively known as the “application.server” config parameter) to other instances of the same application. The new Interactive Query APIs allow a developer to obtain the metadata for a given store name and key, and do that across the instances of an application. Hence, you can discover where the store is that holds a particular key by examining the metadata. So now we know which store on which application instance holds our key but how do we actually query that (potentially remote) store? It sounds like we need some sort of RPC mechanism.
Out-of-the box, the Kafka Streams library includes all the low-level APIs to query state stores and discover state stores across the running instances of an application. The application can then implement its own query patterns on the data and route the requests among instances as needed. Apps often have intricate processing logic and might need non-trivial routing of requests among Kafka Streams instances. For example, an app might want to locate the instance that holds a particular customer ID, and then route a call to rank that customer’s stock to that particular instance. The business logic on that instance would sort the available data and return the result to the app. In more complex instances, we could have scatter-gather query patterns where a call to an instance results in N calls from that instance to other instances and N results being collated and returned to the original caller. It is clear from these examples that there is no one API for distributed querying. Furthermore, there is no single best transport layer for the RPCs either (some users favor or must use REST within their company, others might opt for Thrift, etc.) RPCs are needed for inter-instance communication (i.e., within the same app). Otherwise instance 3 of the app can’t actually retrieve state information from instance 1. They are also needed for inter-apps communication, where other applications query the original application’s state.

Kafka Streams – The Power without the Weight at Streaming Morning@Lohika

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Kafka Streams – The Power without the Weight at Streaming Morning@Lohika

Ähnlich wie Kafka Streams – The Power without the Weight at Streaming Morning@Lohika (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Kafka Streams – The Power without the Weight at Streaming Morning@Lohika

Hinweis der Redaktion