Workshop Series: ksqlDB 2021 10 20

Workshop Series:
ksqlDB
2021 10 20 2

:
Jupil Hwang ( )
Hyunsoo Kim ( )
:
14:00 – 17:00
2

Agenda — ksqlDB
3 3
01 02:00 - 02:10 PM
05 Lab: Hands on
03:00 AM - 05:00 PM
02 Talk: Kafka, Kafka Streams ksqlDB
02:10 - 02:30 PM
03 Lab:
02:30 - 02:45 PM
04 Lab:
02:45 - 03:00 PM

4
• Q&A
• 궁금한 점이 있으시다면 Q&A를 통해 질문 보내주시기 바랍니다. 발
표 이후 연사가 직접 답변 전달할 예정입니다.
• 온라인 설문조사
• 금일 워크샵에 대한 소중한 의견 보내주시기 바랍니다. 향후 알찬 내
용을 준비하는데 참고하겠습니다.
• 설문조사 참여 링크는 (1) Zoom 채팅창 통해 확인, (2) 행사 종료 이
후 웹 브라우저 통해 자동 참여
워크샵 안내사항

App App
DWH
Transactional
Databases
Analytics
Databases
Data Flow
DB DB
App App
MOM MOM
ETL
ETL
EAI / ESB
●
●
● / (Pub/sub)
(Point-to-Point)
●
●
App App

NoSQL DBs Big Data Analytics
?
App App
DWH
Transactional
Databases
Analytics
Databases
Data Flow
DB DB
App App
MOM MOM
ETL
ETL
EAI / ESB
App App

:
스트리밍 플랫폼은 조직
의 모든 사람과 시스템에
게 데이터에 대한 단일
정보 소스(single source
of truth)를 제공한다.
NoSQL DBs Big Data Analytics
App App
DWH
Transactional
Databases
Analytics
Databases
Data Flow
DB DB
App App
App App
Streaming Platform

80% +
Fortune 100
Apache Kafka
Confluent
LinkedIn
•
• Producer Consumer (Decouple)
•

Event Streaming Platform
, , ,
Core Loans Credit Cards Patient
Lending
Data :
...
Device
Logs
... ...
...
Data Stores Logs 3rd Party
Apps
Custom Apps /
Microservices
Real-time
Inventory
Real-time
Fraud
Detection
Real-time
Customer
360
Machine
Learning
Models
Real-time
Data
Transformat
ion
...
Data in Motion Applications
Data-in-Motion Pipeline
Amazon
S3
SaaS
apps
Data in Motion :
, , .

Streaming Platform :
A Sale A shipment
A Trade
A Customer
Experience
11
…and more

What’s stream processing good for?
13
Materialized Cache
view
Streaming ETL Pipeline
,
Source Sink
Event-Driven
Microservice

Confluent Platform Conceptual Architecture
14
OSS
Apache Kafka
Data
Sink
POJO /
MicroServices
Data
Sink
OSS Apache Kafka® Messaging Data Integration/ETL .
POJO /
MicroServices
Streams
Apps
Source
Connector
Data
Source
Sink
Connector
ksqlDB
Schema
Registry

Confluent Platform Conceptual Architecture
15
Confluent Platform
(Apache Kafka)
Enterprise
Security
ksqlDB
Replicator
Machine
Learning
Data
Sink
Data
Source
Schema
Registry
Control
Center
Source
Connector
Sink
Connector
Micro
Services
Mobile
Devices
Car/IoT
MQTT
Proxy
REST
Proxy
Sensor
Data
Sink
Confluent Platform
(Apache Kafka)
Confluent Platform Kafka Cluster Connect, Replicator, ksqlDB REST/MQTT Proxy .
Streams
Apps

Confluent
Hall of Innovation
CTO Innovation
Award Winner
2019
Enterprise Technology
Innovation
AWARDS
Vision
● Kafka
● Event streaming
Category Leadership
● Kafka commits 80%
● 1 Kafka
● 5000 Kafka
Value
● Risk
●
● TCO
● Time-to-market
Product
● Kafka
● Software
Cloud-Native Service
16

Confluent Enterprise Apache Kafka
17
- cloud, on-
prem, hybrid, or multi-cloud
,
– Connect
Stream processing
application
– KStreams, ksqlDB

Confluent
18
Open Source | Community licensed
Fully Managed Cloud Service
Self-managed Software
Training Partners
Enterprise
Support
Professional
Services
ARCHITECT
OPERATOR
DEVELOPER EXECUTIVE
Confluent Platform
Self-Balancing Clusters | Tiered Storage
DevOps
Operator | Ansible
GUI-
Control Center | Proactive Support
ksqlDB
Pre-built
Connectors | Hub | Schema Registry
Non-Java Clients | REST Proxy
Admin REST APIs
Multi-Region Clusters | Replicator
Cluster Linking
Schema Registry | Schema Validation
RBAC | Secrets | Audit Logs
TCO / ROI
Revenue / Cost / Risk Impact
Complete Engagement Model

Apache Kafka ?
19
Kafka distributed commit log
• Publish Subscribe .
• .
• Transaction .
1 2 3 4 5 6 7 8
Append-only
writes
Reads are a single
seek and scan
App App App
Producers
App App App
Consumers
Kafka
Cluster

Kafka Connect Kafka Streams ?
Kafka Streams API
• Java
• Producer/ Consumer APIs
Kafka Connect API
• Kafka
•
Orders
Customers
STREAM
PROCESSING
KStreams / KTable

Multi-Language
Development
Confluent
) Connector
21
200+ Pre-Built
Connectors
Event Stream
Processing
ksqlDB
/ KStream

Stream Processing by Analogy
Kafka Cluster
Connect API Stream Processing Connect API
$ cat < in.txt | grep “ksql” | tr a-z A-Z > out.txt

Confluent 3
23
Kafka Clients Kafka Streams ksqlDB
ConsumerRecords<String, String> records =
consumer.poll(100);
Map<String, Integer> counts = new DefaultMap<String,
Integer>();
for (ConsumerRecord<String, Integer> record : records) {
String key = record.key();
int c = counts.get(key)
c += record.value()
counts.put(key, c)
}
for (Map.Entry<String, Integer> entry : counts.entrySet()) {
int stateCount;
int attempts;
while (attempts++ < MAX_RETRIES) {
try {
stateCount = stateStore.getValue(entry.getKey())
stateStore.setValue(entry.getKey(), entry.getValue() +
stateCount)
break;
} catch (StateStoreException e) {
RetryUtils.backoff(attempts);
}
}
}
builder
.stream("input-stream",
Consumed.with(Serdes.String(), Serdes.String()))
.groupBy((key, value) -> value)
.count()
.toStream()
.to("counts", Produced.with(Serdes.String(), Serdes.Long()));
SELECT x, count(*) FROM stream GROUP BY x EMIT
CHANGES;

subscribe(), poll(), send(),
flush(), beginTransaction(), …
KStream, KTable, filter(), map(),
flatMap(), join(), aggregate(),
transform(), …
CREATE STREAM, CREATE TABLE,
SELECT, JOIN, GROUP BY, SUM, …
Stream Processing
KSQL UDFs
24

25
3-5 ,
DB
CONNECTOR
CONNECTOR
APP
APP
DB
STREAM
PROCESSING
CONNECTOR APP
DB
2
3
4
1

26
ksqlDB , , Push Pull
DB
APP
APP
DB
PULL
PUSH
CONNECTORS
STREAM
PROCESSING
STATE STORES
ksqlDB
1 2
APP

Serve lookups against
materialized views
Create
materialized views
Perform continuous
transformations
Capture data
CREATE STREAM purchases AS
SELECT viewtime, userid,pageid, TIMESTAMPTOSTRING(viewtime, 'yyyy-MM-dd')
FROM pageviews;
CREATE TABLE orders_by_country AS
SELECT country, COUNT(*) AS order_count, SUM(order_total) AS order_total
FROM purchases
WINDOW TUMBLING (SIZE 5 MINUTES)
LEFT JOIN user_profiles ON purchases.customer_id = user_profiles.customer_id
GROUP BY country
EMIT CHANGES;
SELECT * FROM orders_by_country WHERE country='usa';
CREATE SOURCE CONNECTOR jdbcConnector WITH (
‘connector.class’ = '...JdbcSourceConnector',
‘connection.url’ = '...',
…);
Connector
Stream
Table
Query
SQL

Filter messages to a separate topic in real-time
28
Partition 0
Partition 1
Partition 2
Topic: Blue and Red Widgets
Partition 0
Partition 1
Partition 2
Topic: Blue Widgets Only
STREAM
PROCESSING
Filters

29
Filters CREATE STREAM high_readings AS
SELECT sensor,
reading,
FROM readings
WHERE reading > 41
EMIT CHANGES;

Easily merge and join topics to one another
30
Partition 0
Partition 1
Partition 2
Topic: Blue and Red Widgets
Partition 0
Partition 1
Partition 2
Topic: Green and Yellow Widgets
Partition 0
Partition 1
Partition 2
Topic: Blue and Yellow Widgets
STREAM
PROCESSING
Joins

31
Joins
CREATE STREAM enriched_readings AS
SELECT reading, area, brand_name,
FROM readings
INNER JOIN brands b
ON b.sensor = readings.sensor
EMIT CHANGES;

Aggregate streams into tables and capture
summary statistics
32
Partition 0
Partition 1
Partition 2
Topic: Blue and Red Widgets Table: Widget Count
STREAM
PROCESSING
Widget Color Count
Blue 15
Red 9
Aggregate

33
Aggregate CREATE TABLE avg_readings AS
SELECT sensor,
AVG(reading) AS location
FROM readings
GROUP BY sensor
EMIT CHANGES;

35
• Zoom과 브라우저(Instructions, ksqlDB console 및 Confluent
Control Center)로 작업하게 됩니다.
• 질문이 있는 경우 Zoom chat 기능을 통해 게시할 수 있습니다.
• 막히더라도 걱정하지 마세요 - Zoom에서 "Raise hand" 버튼을
사용하면 Confluent 엔지니어가 도와드릴 것입니다.
• 그냥 앞질러서 복사하여 붙여넣기 하는 것을 피하십시오 - 대부분의
사람들은 실제로 콘솔에 코드를 입력할 때 더 잘 배웁니다. 그리고
실수로부터 배울 수 있습니다.
•
교육 진행하는 방법

37
•
•
•
• submit
• id -
/
.

Use Case -
38
• . / ,
.
• 9/12/19 12:55:05 GMT, 5313, {
"rating_id": 5313,
"user_id": 3,
"stars": 1,
"route_id": 6975,
"rating_time": 1519304105213,
"channel": "web",
"message": "why is it so difficult to keep the bathrooms clean?"
}

Use Case - Approach 1
39
리뷰를 데이터 웨어하우스로 이동시킵니다.
매월 말에 검토를 처리한 다음, 상당한 수의 의견이 접수된 해
당 부서에 전달합니다.
이 접근 방식은 이미 발생했었던 일을 알려줍니다.

40
실시간으로 리뷰를 처리하고 공항 관리팀에 대시보드를
제공합니다.
이 대시보드는 주제별로 리뷰를 정렬하여 청결과 관련된
문제를 신속하게 표시할 수 있습니다.
이 접근 방식은 지금 무슨 일이 일어나고 있는지 알려줍
니다.

41
실시간으로 리뷰를 처리합니다.
최근 10 동안의 화장실 청결과 관련된 3 나쁜 리뷰
에 대한 알림을 설정합니다.
자동으로 청소 직원을 호출하여 문제를 처리합니다.
이 접근 방식은 무슨 일이 일어나고 있는지에 따라 무언
가를 수행합니다.

Cluster Architectural Overview
43
MySQL
Microservice
Website
Kafka Connect
Datagen Source
connector
MySQL CDC
connector
Kafka
ksqlDB
transforms
enriches
queries

ksqlDB
44
ksqlDB Kafka Brokers
node
Confluent Control
Center
ksqlDB Editor &
DataFlow
ksqlDB
CLI
ksqlDB
RESTFul
API

ksqlDB console
46
> show topics;
> show streams;
> print 'ratings';

Discussion - tables vs streams
48
> describe extended customers;
> select * from customers emit changes;
> select * from customers_flat emit changes;

Stream <-> Table duality
http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple
http://docs.confluent.io/current/streams/concepts.html#duality-of-streams-and-tables
49

Streams and Tables
{ "event_ts": "2020-02-17T15:22:00Z",
"person" : "robin",
"location": "Leeds"
}
{ "event_ts": "2020-02-17T17:23:00Z",
"person" : "robin",
"location": "London"
}
{ "event_ts": "2020-02-17T22:23:00Z",
"person" : "robin",
"location": "Wakefield"
}
{ "event_ts": "2020-02-18T09:00:00Z",
"person" : "robin",
"location": "Leeds"
+--------------------+-------+---------+
|EVENT_TS |PERSON |LOCATION |
+--------------------+-------+---------+
|2020-02-17 15:22:00 |robin |Leeds |
|2020-02-17 17:23:00 |robin |London |
|2020-02-17 22:23:00 |robin |Wakefield|
|2020-02-18 09:00:00 |robin |Leeds |
+-------+---------+
|PERSON |LOCATION |
+-------+---------+
|robin |Leeds |
Kafka topic
+-------+---------+
|PERSON |LOCATION |
+-------+---------+
|robin |London |
+-------+---------+
|PERSON |LOCATION |
+-------+---------+
|robin |Wakefield|
+-------+---------+
|PERSON |LOCATION |
+-------+---------+
|robin |Leeds |
ksqlDB Table
ksqlDB Stream
Stream (append-only series of
events):
Topic + Schema
Table: state for
given key
Topic + Schema
50

• Streams = INSERT only
Immutable, append-only
• Tables = INSERT, UPDATE, DELETE
Mutable, row key (event.key) identifies which
row
51

The key to mutability is … the event.key!
52
Stream Table
Has unique key constraint? No Yes
First event with key ‘alice’ arrives INSERT INSERT
Another event with key ‘alice’ arrives INSERT UPDATE
Event with key ‘alice’ and value == null arrives INSERT DELETE
Event with key == null arrives INSERT <ignored>
RDBMS analogy: A Stream is ~ a Table that has no unique key and is append-only.

Creating a table from a stream or topic
streams

Aggregating a stream (COUNT example)
streams

KSQL for Data Exploration
An easy way to inspect your data in Kafka
SHOW TOPICS;
SELECT page, user_id, status, bytes
FROM clickstream
WHERE user_agent LIKE 'Mozilla/5.0%';
PRINT 'my-topic' FROM BEGINNING;
56

KSQL for Data Transformation
Quickly make derivations of existing data in Kafka
CREATE STREAM clicks_by_user_id
WITH (PARTITIONS=6,
TIMESTAMP='view_time’
VALUE_FORMAT='JSON') AS
SELECT * FROM clickstream
PARTITION BY user_id;
Change number of partitions
1
Convert data to JSON
2
Repartition the data
3
57

.
59
• Kafka .
• Format .
• data streams join .
• Event Stream Query Query
.
• !

KSQL for Real-Time, Streaming ETL
Filter, cleanse, process data while it is in motion
CREATE STREAM clicks_from_vip_users AS
SELECT user_id, u.country, page, action
FROM clickstream c
LEFT JOIN users u ON c.user_id = u.user_id
WHERE u.level ='Platinum'; Pick only VIP users
1
60

CDC — only after state
61
JSON 데이터는 Debezium CDC를 통해
MySQL에서 가져오는 정보를 보여줍니다.
여기서 "BEFORE" 데이터가 없음을 알 수 있습
니다(null임).
이것은 레코드가 업데이트 없이 방금 생성되었음
을 의미합니다. 새 고객이 처음 추가된 경우를 예
로 들 수 있습니다.

CDC — before and after
62
이제 고객 레코드에 대한 업데이트가 있었기 때
문에 일부 "BEFORE" 데이터가 있습니다.

KSQL for Anomaly Detection
Aggregate data to identify patterns and anomalies in real-time
CREATE TABLE possible_fraud AS
SELECT card_number, COUNT(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 30 SECONDS)
GROUP BY card_number
HAVING COUNT(*) > 3;
Aggregate data
1
… per 30-sec windows
2
63

KSQL for Real-Time Monitoring
Derive insights from events (IoT, sensors, etc.) and turn them into actions
CREATE TABLE failing_vehicles AS
SELECT vehicle, COUNT(*)
FROM vehicle_monitoring_stream
WINDOW TUMBLING (SIZE 1 MINUTE)
WHERE event_type = 'ERROR’
GROUP BY vehicle
HAVING COUNT(*) >= 5; Now we know to alert, and whom
1
64

Monitoring ksqlDB applications
Data flow (1/2)
69

Monitoring ksqlDB applications
Data flow (2/2)
70

Storage Layer
(Brokers)
Processing Layer
(ksqlDB, KStreams,
etc.)
Partitions play a central role in Kafka
72
Topics are partitioned. Partitions enable scalability, elasticity, fault-tolerance.
stored in
replicated based on
ordered based on
partitions
Data is
joined based on
read from and written to
processed based on

Processing
Layer
(KSQL,
KStreams)
00100 11101 11000 00011 00100 00110
Topic
alice Paris bob Sydney alice Rome
Stream
plus schema (serdes)
alice 2
bob 1
Table
plus aggregation
Storage Layer
(Brokers)
Topics vs. Streams and Tables
73

Kafka Processing
Data is processed per-partition
...
...
...
...
P1
P2
P3
P4
Storage Processing
read via
network
Topic App Instance 1 Application
App Instance 2
‘payments’ with consumer group
‘my-app’
74

Kafka Processing
Data is processed per-partition
...
...
...
...
P1
P2
P3
P4
Storage Processing State
Stream Task 1
Stream Task 2
Stream Task 3
Stream Task 4
read via
network
Application Instance 1
Topic
75

Streams and Tables are partitioned, too
...
...
...
...
P1
P2
P3
P4
Stream Task 1
Stream Task 2
Stream Task 3
Stream Task 4
KTable / TABLE
2 GB
3 GB
5 GB
2 GB
76

Windowing
79
“10 3 ”
Windowed Query ksqlDB 로직을 .
.
Tumbling Hopping Session
WINDOW TUMBLING (SIZE 5 MINUTES)
GROUP BY key
WINDOW HOPPING (SIZE 5 MINUTE, ADVANCE BY 1 MINUTE)
GROUP BY key
WINDOW SESSION (60 SECONDS)
GROUP BY key

UDF and machine learning
80
“ ”
ksqlDB는 스트림 처리를 단순화하기 위해 여러 내장 함수들을 제공합니다. 예는 다음과 같습니다.:
• GEODISTANCE: 두 위도/경도 좌표 사이의 거리를 측정
• MASK: 문자열을 마스크하거나 난독화된 버전으로 변환
• JSON_ARRAY_CONTAINS: 배열에 검색 값이 포함되어 있는지 확인
사용자 정의 함수를 개발하여 ksqlDB에서 사용 가능한 기능을 확장합니다. 일반적인 사용 사례는 ksqlDB를
통해 기계 학습 알고리즘을 구현하여 이러한 모델이 실시간 데이터 변환에 기여할 수 있도록 하는 것입니다.

ksqlDB ?
81
Streaming ETL Anomaly detection
Real-time monitoring
and Analytics
Sensor data and IoT Customer 360-view
https://docs.ksqldb.io/en/latest/#what-can-i-do-with-ksqldb

Example: Streaming ETL pipeline
82
* Full example here
• Apache Kafka is a popular choice for powering data pipelines
• ksqlDB makes it simple to transform data within the pipeline,
preparing the messages for consumption by another system.

Example: Anomaly detection
83
• Identify patterns and spot anomalies in real-time data with
millisecond latency, enabling you to properly surface out-of-the-
ordinary events and to handle fraudulent activities separately.
* Full example here

Developer https://developer.confluent.io

Tutorials
90
•
• Kafka ksqlDB
Kafka ksqlDB
?
• ?
• Apache Kafka® .
https://kafka-tutorials.confluent.io/

Free eBooks
Kafka: The Definitive Guide
Neha Narkhede, Gwen Shapira, Todd
Palino
Making Sense of Stream Processing
Martin Kleppmann
I ❤ Logs
Jay Kreps
Designing Event-Driven Systems
Ben Stopford
http://cnfl.io/book-bundle

Confluent
92
Confluent Blog
cnfl.io/blog
Confluent Cloud
cnfl.io/confluent-cloud
Community
cnfl.io/meetups

Max processing parallelism = #input partitions
...
...
...
...
P1
P2
P3
P4
Topic Application Instance 1
Application Instance 5 *** idle ***
Application Instance 6 *** idle ***
→ Need higher parallelism? Increase the original topic’s partition count.
→ Higher parallelism for just one use case? Derive a new topic from the
original with higher partition count. Lower its retention to save storage.
94

How to increase # of partitions when needed
CREATE STREAM products_repartitioned
WITH (PARTITIONS=30) AS
SELECT * FROM products
95
KSQL example: statement below creates a new stream with the desired number of partitions.

‘Hot’ partitions is a problem, often caused by
Strategies to address hot partitions include
1a. Ingress: Find better partitioning function ƒ(event.key) for producers
1b. Storage: Re-partition data into new topic if you can’t change the original
2. Scale processing vertically, e.g. more powerful CPU instances
...
...
...
...
P1
P2
P3
P4
96
1. Events not evenly distributed across partitions
2. Events evenly distributed but certain events take longer to process

Joining Streams and Tables
Data must be ‘co-partitioned’
Table
Stream
Join Output
(Stream) 97

Data must be ‘co-partitioned’
bob male
alice female
alex male
alice Paris
Table
P1
P2
P3
zoie female
andrew male
mina female
natalie female
blake male
alice Paris
Stream
P2
(alice, Paris) from
stream’s P2 has a
matching entry for
alice in the table’s P2.
female 98

Data is looked up in same partition number
99
alice Paris alice male
alice female
alice Paris
Stream Table
P2 P1
P2
P3
Here, key ‘alice’ exists in
multiple partitions.
But entry in P2
(female) is used
because the stream-
side event is from
stream’s partition P2.
female
Scenario 2

Data is looked up in same partition number
100
alice Paris alice male
alice Paris
Stream Table
P2 P1
P2
P3
Here, key ‘alice’ exists
only in the table’s P1 !=
P2.
null
no
match!
Scenario 3

Data co-partitioning requirements in detail
Further Reading on Joining Streams and Tables:
https://www.confluent.io/kafka-summit-sf18/zen-and-the-art-of-streaming-joins
https://docs.confluent.io/current/ksql/docs/developer-guide/partition-data.html
101
1. Same keying scheme for both input sides
2. Same number of partitions
3. Same partitioning function ƒ(event.key)

Why is that so?
Because of how input data is mapped to stream tasks
...
...
...
P1
P2
P3
storage
processing state
Stream Task 2
read via
network
Strea
m
Topic
...
...
...
P1
P2
P3
Table
Topic
from stream’s P2
from table’s P2
102

How to re-partition your data when needed
CREATE STREAM products_repartitioned
WITH (PARTITIONS=42) AS
SELECT * FROM products
PARTITION BY product_id;
103
KSQL example: statement below creates a new stream with changed number of partitions and a new field as
event.key (so that its data is now correctly co-partitioned for joining)

Workshop Series: ksqlDB 2021 10 20

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Workshop Series: ksqlDB 2021 10 20

Ähnlich wie Workshop Series: ksqlDB 2021 10 20 (20)

Mehr von confluent

Mehr von confluent (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Workshop Series: ksqlDB 2021 10 20