Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 103

Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드

0

Share

Download to read offline

Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드

  1. 1. Workshop Series: ksqlDB 2021 10 20 2
  2. 2. : Jupil Hwang ( ) Hyunsoo Kim ( ) : 14:00 – 17:00 2
  3. 3. Agenda — ksqlDB 3 3 01 02:00 - 02:10 PM 05 Lab: Hands on 03:00 AM - 05:00 PM 02 Talk: Kafka, Kafka Streams ksqlDB 02:10 - 02:30 PM 03 Lab: 02:30 - 02:45 PM 04 Lab: 02:45 - 03:00 PM
  4. 4. 4 • Q&A • 궁금한 점이 있으시다면 Q&A를 통해 질문 보내주시기 바랍니다. 발 표 이후 연사가 직접 답변 전달할 예정입니다. • 온라인 설문조사 • 금일 워크샵에 대한 소중한 의견 보내주시기 바랍니다. 향후 알찬 내 용을 준비하는데 참고하겠습니다. • 설문조사 참여 링크는 (1) Zoom 채팅창 통해 확인, (2) 행사 종료 이 후 웹 브라우저 통해 자동 참여 워크샵 안내사항
  5. 5. Confluent Platform & Cloud:
  6. 6. App App DWH Transactional Databases Analytics Databases Data Flow DB DB App App MOM MOM ETL ETL EAI / ESB ● ● ● / (Pub/sub) (Point-to-Point) ● ● App App
  7. 7. NoSQL DBs Big Data Analytics ? App App DWH Transactional Databases Analytics Databases Data Flow DB DB App App MOM MOM ETL ETL EAI / ESB App App
  8. 8. : 스트리밍 플랫폼은 조직 의 모든 사람과 시스템에 게 데이터에 대한 단일 정보 소스(single source of truth)를 제공한다. NoSQL DBs Big Data Analytics App App DWH Transactional Databases Analytics Databases Data Flow DB DB App App App App Streaming Platform
  9. 9. 80% + Fortune 100 Apache Kafka Confluent LinkedIn • • Producer Consumer (Decouple) •
  10. 10. Event Streaming Platform , , , Core Loans Credit Cards Patient Lending Data : ... Device Logs ... ... ... Data Stores Logs 3rd Party Apps Custom Apps / Microservices Real-time Inventory Real-time Fraud Detection Real-time Customer 360 Machine Learning Models Real-time Data Transformat ion ... Data in Motion Applications Data-in-Motion Pipeline Amazon S3 SaaS apps Data in Motion : , , .
  11. 11. Streaming Platform : A Sale A shipment A Trade A Customer Experience 11 …and more
  12. 12. 12 Event Stream Processing ,
  13. 13. What’s stream processing good for? 13 Materialized Cache view Streaming ETL Pipeline , Source Sink Event-Driven Microservice
  14. 14. Confluent Platform Conceptual Architecture 14 OSS Apache Kafka Data Sink POJO / MicroServices Data Sink OSS Apache Kafka® Messaging Data Integration/ETL . POJO / MicroServices Streams Apps Source Connector Data Source Sink Connector ksqlDB Schema Registry
  15. 15. Confluent Platform Conceptual Architecture 15 Confluent Platform (Apache Kafka) Enterprise Security ksqlDB Replicator Machine Learning Data Sink Data Source Schema Registry Control Center Source Connector Sink Connector Micro Services Mobile Devices Car/IoT MQTT Proxy REST Proxy Sensor Data Sink Confluent Platform (Apache Kafka) Confluent Platform Kafka Cluster Connect, Replicator, ksqlDB REST/MQTT Proxy . Streams Apps
  16. 16. Confluent Hall of Innovation CTO Innovation Award Winner 2019 Enterprise Technology Innovation AWARDS Vision ● Kafka ● Event streaming Category Leadership ● Kafka commits 80% ● 1 Kafka ● 5000 Kafka Value ● Risk ● ● TCO ● Time-to-market Product ● Kafka ● Software Cloud-Native Service 16
  17. 17. Confluent Enterprise Apache Kafka 17 - cloud, on- prem, hybrid, or multi-cloud , – Connect Stream processing application – KStreams, ksqlDB
  18. 18. Confluent 18 Open Source | Community licensed Fully Managed Cloud Service Self-managed Software Training Partners Enterprise Support Professional Services ARCHITECT OPERATOR DEVELOPER EXECUTIVE Confluent Platform Self-Balancing Clusters | Tiered Storage DevOps Operator | Ansible GUI- Control Center | Proactive Support ksqlDB Pre-built Connectors | Hub | Schema Registry Non-Java Clients | REST Proxy Admin REST APIs Multi-Region Clusters | Replicator Cluster Linking Schema Registry | Schema Validation RBAC | Secrets | Audit Logs TCO / ROI Revenue / Cost / Risk Impact Complete Engagement Model
  19. 19. Apache Kafka ? 19 Kafka distributed commit log • Publish Subscribe . • . • Transaction . 1 2 3 4 5 6 7 8 Append-only writes Reads are a single seek and scan App App App Producers App App App Consumers Kafka Cluster
  20. 20. Kafka Connect Kafka Streams ? Kafka Streams API • Java • Producer/ Consumer APIs Kafka Connect API • Kafka • Orders Customers STREAM PROCESSING KStreams / KTable
  21. 21. Multi-Language Development Confluent ) Connector 21 200+ Pre-Built Connectors Event Stream Processing ksqlDB / KStream
  22. 22. Stream Processing by Analogy Kafka Cluster Connect API Stream Processing Connect API $ cat < in.txt | grep “ksql” | tr a-z A-Z > out.txt
  23. 23. Confluent 3 23 Kafka Clients Kafka Streams ksqlDB ConsumerRecords<String, String> records = consumer.poll(100); Map<String, Integer> counts = new DefaultMap<String, Integer>(); for (ConsumerRecord<String, Integer> record : records) { String key = record.key(); int c = counts.get(key) c += record.value() counts.put(key, c) } for (Map.Entry<String, Integer> entry : counts.entrySet()) { int stateCount; int attempts; while (attempts++ < MAX_RETRIES) { try { stateCount = stateStore.getValue(entry.getKey()) stateStore.setValue(entry.getKey(), entry.getValue() + stateCount) break; } catch (StateStoreException e) { RetryUtils.backoff(attempts); } } } builder .stream("input-stream", Consumed.with(Serdes.String(), Serdes.String())) .groupBy((key, value) -> value) .count() .toStream() .to("counts", Produced.with(Serdes.String(), Serdes.Long())); SELECT x, count(*) FROM stream GROUP BY x EMIT CHANGES;
  24. 24. subscribe(), poll(), send(), flush(), beginTransaction(), … KStream, KTable, filter(), map(), flatMap(), join(), aggregate(), transform(), … CREATE STREAM, CREATE TABLE, SELECT, JOIN, GROUP BY, SUM, … Stream Processing KSQL UDFs 24
  25. 25. 25 3-5 , DB CONNECTOR CONNECTOR APP APP DB STREAM PROCESSING CONNECTOR APP DB 2 3 4 1
  26. 26. 26 ksqlDB , , Push Pull DB APP APP DB PULL PUSH CONNECTORS STREAM PROCESSING STATE STORES ksqlDB 1 2 APP
  27. 27. Serve lookups against materialized views Create materialized views Perform continuous transformations Capture data CREATE STREAM purchases AS SELECT viewtime, userid,pageid, TIMESTAMPTOSTRING(viewtime, 'yyyy-MM-dd') FROM pageviews; CREATE TABLE orders_by_country AS SELECT country, COUNT(*) AS order_count, SUM(order_total) AS order_total FROM purchases WINDOW TUMBLING (SIZE 5 MINUTES) LEFT JOIN user_profiles ON purchases.customer_id = user_profiles.customer_id GROUP BY country EMIT CHANGES; SELECT * FROM orders_by_country WHERE country='usa'; CREATE SOURCE CONNECTOR jdbcConnector WITH ( ‘connector.class’ = '...JdbcSourceConnector', ‘connection.url’ = '...', …); Connector Stream Table Query SQL
  28. 28. Filter messages to a separate topic in real-time 28 Partition 0 Partition 1 Partition 2 Topic: Blue and Red Widgets Partition 0 Partition 1 Partition 2 Topic: Blue Widgets Only STREAM PROCESSING Filters
  29. 29. 29 Filters CREATE STREAM high_readings AS SELECT sensor, reading, FROM readings WHERE reading > 41 EMIT CHANGES;
  30. 30. Easily merge and join topics to one another 30 Partition 0 Partition 1 Partition 2 Topic: Blue and Red Widgets Partition 0 Partition 1 Partition 2 Topic: Green and Yellow Widgets Partition 0 Partition 1 Partition 2 Topic: Blue and Yellow Widgets STREAM PROCESSING Joins
  31. 31. 31 Joins CREATE STREAM enriched_readings AS SELECT reading, area, brand_name, FROM readings INNER JOIN brands b ON b.sensor = readings.sensor EMIT CHANGES;
  32. 32. Aggregate streams into tables and capture summary statistics 32 Partition 0 Partition 1 Partition 2 Topic: Blue and Red Widgets Table: Widget Count STREAM PROCESSING Widget Color Count Blue 15 Red 9 Aggregate
  33. 33. 33 Aggregate CREATE TABLE avg_readings AS SELECT sensor, AVG(reading) AS location FROM readings GROUP BY sensor EMIT CHANGES;
  34. 34. Workshop
  35. 35. 35 • Zoom과 브라우저(Instructions, ksqlDB console 및 Confluent Control Center)로 작업하게 됩니다. • 질문이 있는 경우 Zoom chat 기능을 통해 게시할 수 있습니다. • 막히더라도 걱정하지 마세요 - Zoom에서 "Raise hand" 버튼을 사용하면 Confluent 엔지니어가 도와드릴 것입니다. • 그냥 앞질러서 복사하여 붙여넣기 하는 것을 피하십시오 - 대부분의 사람들은 실제로 콘솔에 코드를 입력할 때 더 잘 배웁니다. 그리고 실수로부터 배울 수 있습니다. • 교육 진행하는 방법
  36. 36. 37 • • • • submit • id - / .
  37. 37. Use Case - 38 • . / , . • 9/12/19 12:55:05 GMT, 5313, { "rating_id": 5313, "user_id": 3, "stars": 1, "route_id": 6975, "rating_time": 1519304105213, "channel": "web", "message": "why is it so difficult to keep the bathrooms clean?" }
  38. 38. Use Case - Approach 1 39 리뷰를 데이터 웨어하우스로 이동시킵니다. 매월 말에 검토를 처리한 다음, 상당한 수의 의견이 접수된 해 당 부서에 전달합니다. 이 접근 방식은 이미 발생했었던 일을 알려줍니다.
  39. 39. Use Case - Approach 2 40 실시간으로 리뷰를 처리하고 공항 관리팀에 대시보드를 제공합니다. 이 대시보드는 주제별로 리뷰를 정렬하여 청결과 관련된 문제를 신속하게 표시할 수 있습니다. 이 접근 방식은 지금 무슨 일이 일어나고 있는지 알려줍 니다.
  40. 40. Use Case - Approach 3 41 실시간으로 리뷰를 처리합니다. 최근 10 동안의 화장실 청결과 관련된 3 나쁜 리뷰 에 대한 알림을 설정합니다. 자동으로 청소 직원을 호출하여 문제를 처리합니다. 이 접근 방식은 무슨 일이 일어나고 있는지에 따라 무언 가를 수행합니다.
  41. 41. Hands on 3. 3.2.1
  42. 42. Cluster Architectural Overview 43 MySQL Microservice Website Kafka Connect Datagen Source connector MySQL CDC connector Kafka ksqlDB transforms enriches queries
  43. 43. ksqlDB 44 ksqlDB Kafka Brokers node Confluent Control Center ksqlDB Editor & DataFlow ksqlDB CLI ksqlDB RESTFul API
  44. 44. ksqlDB console 45
  45. 45. ksqlDB console 46 > show topics; > show streams; > print 'ratings';
  46. 46. Hands on 4. ksqlDB 4.2.2
  47. 47. Discussion - tables vs streams 48 > describe extended customers; > select * from customers emit changes; > select * from customers_flat emit changes;
  48. 48. Stream <-> Table duality http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple http://docs.confluent.io/current/streams/concepts.html#duality-of-streams-and-tables 49
  49. 49. Streams and Tables { "event_ts": "2020-02-17T15:22:00Z", "person" : "robin", "location": "Leeds" } { "event_ts": "2020-02-17T17:23:00Z", "person" : "robin", "location": "London" } { "event_ts": "2020-02-17T22:23:00Z", "person" : "robin", "location": "Wakefield" } { "event_ts": "2020-02-18T09:00:00Z", "person" : "robin", "location": "Leeds" +--------------------+-------+---------+ |EVENT_TS |PERSON |LOCATION | +--------------------+-------+---------+ |2020-02-17 15:22:00 |robin |Leeds | |2020-02-17 17:23:00 |robin |London | |2020-02-17 22:23:00 |robin |Wakefield| |2020-02-18 09:00:00 |robin |Leeds | +-------+---------+ |PERSON |LOCATION | +-------+---------+ |robin |Leeds | Kafka topic +-------+---------+ |PERSON |LOCATION | +-------+---------+ |robin |London | +-------+---------+ |PERSON |LOCATION | +-------+---------+ |robin |Wakefield| +-------+---------+ |PERSON |LOCATION | +-------+---------+ |robin |Leeds | ksqlDB Table ksqlDB Stream Stream (append-only series of events): Topic + Schema Table: state for given key Topic + Schema 50
  50. 50. • Streams = INSERT only Immutable, append-only • Tables = INSERT, UPDATE, DELETE Mutable, row key (event.key) identifies which row 51
  51. 51. The key to mutability is … the event.key! 52 Stream Table Has unique key constraint? No Yes First event with key ‘alice’ arrives INSERT INSERT Another event with key ‘alice’ arrives INSERT UPDATE Event with key ‘alice’ and value == null arrives INSERT DELETE Event with key == null arrives INSERT <ignored> RDBMS analogy: A Stream is ~ a Table that has no unique key and is append-only.
  52. 52. Creating a table from a stream or topic streams
  53. 53. Aggregating a stream (COUNT example) streams
  54. 54. Aggregating a stream (COUNT example) streams
  55. 55. KSQL for Data Exploration An easy way to inspect your data in Kafka SHOW TOPICS; SELECT page, user_id, status, bytes FROM clickstream WHERE user_agent LIKE 'Mozilla/5.0%'; PRINT 'my-topic' FROM BEGINNING; 56
  56. 56. KSQL for Data Transformation Quickly make derivations of existing data in Kafka CREATE STREAM clicks_by_user_id WITH (PARTITIONS=6, TIMESTAMP='view_time’ VALUE_FORMAT='JSON') AS SELECT * FROM clickstream PARTITION BY user_id; Change number of partitions 1 Convert data to JSON 2 Repartition the data 3 57
  57. 57. Hands on 4.3 4.4 Query 8
  58. 58. . 59 • Kafka . • Format . • data streams join . • Event Stream Query Query . • !
  59. 59. KSQL for Real-Time, Streaming ETL Filter, cleanse, process data while it is in motion CREATE STREAM clicks_from_vip_users AS SELECT user_id, u.country, page, action FROM clickstream c LEFT JOIN users u ON c.user_id = u.user_id WHERE u.level ='Platinum'; Pick only VIP users 1 60
  60. 60. CDC — only after state 61 JSON 데이터는 Debezium CDC를 통해 MySQL에서 가져오는 정보를 보여줍니다. 여기서 "BEFORE" 데이터가 없음을 알 수 있습 니다(null임). 이것은 레코드가 업데이트 없이 방금 생성되었음 을 의미합니다. 새 고객이 처음 추가된 경우를 예 로 들 수 있습니다.
  61. 61. CDC — before and after 62 이제 고객 레코드에 대한 업데이트가 있었기 때 문에 일부 "BEFORE" 데이터가 있습니다.
  62. 62. KSQL for Anomaly Detection Aggregate data to identify patterns and anomalies in real-time CREATE TABLE possible_fraud AS SELECT card_number, COUNT(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 30 SECONDS) GROUP BY card_number HAVING COUNT(*) > 3; Aggregate data 1 … per 30-sec windows 2 63
  63. 63. KSQL for Real-Time Monitoring Derive insights from events (IoT, sensors, etc.) and turn them into actions CREATE TABLE failing_vehicles AS SELECT vehicle, COUNT(*) FROM vehicle_monitoring_stream WINDOW TUMBLING (SIZE 1 MINUTE) WHERE event_type = 'ERROR’ GROUP BY vehicle HAVING COUNT(*) >= 5; Now we know to alert, and whom 1 64
  64. 64. Confluent Control Center
  65. 65. C3 - Connector
  66. 66. ksqlDB - Cloud UI (1/2) 67
  67. 67. ksqlDB - Cloud UI (2/2) 68
  68. 68. Monitoring ksqlDB applications Data flow (1/2) 69
  69. 69. Monitoring ksqlDB applications Data flow (2/2) 70
  70. 70. ksqlDB Internals
  71. 71. Storage Layer (Brokers) Processing Layer (ksqlDB, KStreams, etc.) Partitions play a central role in Kafka 72 Topics are partitioned. Partitions enable scalability, elasticity, fault-tolerance. stored in replicated based on ordered based on partitions Data is joined based on read from and written to processed based on
  72. 72. Processing Layer (KSQL, KStreams) 00100 11101 11000 00011 00100 00110 Topic alice Paris bob Sydney alice Rome Stream plus schema (serdes) alice 2 bob 1 Table plus aggregation Storage Layer (Brokers) Topics vs. Streams and Tables 73
  73. 73. Kafka Processing Data is processed per-partition ... ... ... ... P1 P2 P3 P4 Storage Processing read via network Topic App Instance 1 Application App Instance 2 ‘payments’ with consumer group ‘my-app’ 74
  74. 74. Kafka Processing Data is processed per-partition ... ... ... ... P1 P2 P3 P4 Storage Processing State Stream Task 1 Stream Task 2 Stream Task 3 Stream Task 4 read via network Application Instance 1 Topic Application Instance 2 75
  75. 75. Streams and Tables are partitioned, too ... ... ... ... P1 P2 P3 P4 Stream Task 1 Stream Task 2 Stream Task 3 Stream Task 4 KTable / TABLE 2 GB 3 GB 5 GB 2 GB Application Instance 1 Application Instance 2 76
  76. 76. Kafka Streams Architecture 77
  77. 77. Advanced Features
  78. 78. Windowing 79 “10 3 ” Windowed Query ksqlDB 로직을 . . Tumbling Hopping Session WINDOW TUMBLING (SIZE 5 MINUTES) GROUP BY key WINDOW HOPPING (SIZE 5 MINUTE, ADVANCE BY 1 MINUTE) GROUP BY key WINDOW SESSION (60 SECONDS) GROUP BY key
  79. 79. UDF and machine learning 80 “ ” ksqlDB는 스트림 처리를 단순화하기 위해 여러 내장 함수들을 제공합니다. 예는 다음과 같습니다.: • GEODISTANCE: 두 위도/경도 좌표 사이의 거리를 측정 • MASK: 문자열을 마스크하거나 난독화된 버전으로 변환 • JSON_ARRAY_CONTAINS: 배열에 검색 값이 포함되어 있는지 확인 사용자 정의 함수를 개발하여 ksqlDB에서 사용 가능한 기능을 확장합니다. 일반적인 사용 사례는 ksqlDB를 통해 기계 학습 알고리즘을 구현하여 이러한 모델이 실시간 데이터 변환에 기여할 수 있도록 하는 것입니다.
  80. 80. ksqlDB ? 81 Streaming ETL Anomaly detection Real-time monitoring and Analytics Sensor data and IoT Customer 360-view https://docs.ksqldb.io/en/latest/#what-can-i-do-with-ksqldb
  81. 81. Example: Streaming ETL pipeline 82 * Full example here • Apache Kafka is a popular choice for powering data pipelines • ksqlDB makes it simple to transform data within the pipeline, preparing the messages for consumption by another system.
  82. 82. Example: Anomaly detection 83 • Identify patterns and spot anomalies in real-time data with millisecond latency, enabling you to properly surface out-of-the- ordinary events and to handle fraudulent activities separately. * Full example here
  83. 83. Any questions? 87
  84. 84. one more …
  85. 85. Developer https://developer.confluent.io
  86. 86. Tutorials 90 • • Kafka ksqlDB Kafka ksqlDB ? • ? • Apache Kafka® . https://kafka-tutorials.confluent.io/
  87. 87. Free eBooks Kafka: The Definitive Guide Neha Narkhede, Gwen Shapira, Todd Palino Making Sense of Stream Processing Martin Kleppmann I ❤ Logs Jay Kreps Designing Event-Driven Systems Ben Stopford http://cnfl.io/book-bundle
  88. 88. Confluent 92 Confluent Blog cnfl.io/blog Confluent Cloud cnfl.io/confluent-cloud Community cnfl.io/meetups
  89. 89. 93
  90. 90. Max processing parallelism = #input partitions ... ... ... ... P1 P2 P3 P4 Topic Application Instance 1 Application Instance 2 Application Instance 3 Application Instance 4 Application Instance 5 *** idle *** Application Instance 6 *** idle *** → Need higher parallelism? Increase the original topic’s partition count. → Higher parallelism for just one use case? Derive a new topic from the original with higher partition count. Lower its retention to save storage. 94
  91. 91. How to increase # of partitions when needed CREATE STREAM products_repartitioned WITH (PARTITIONS=30) AS SELECT * FROM products 95 KSQL example: statement below creates a new stream with the desired number of partitions.
  92. 92. ‘Hot’ partitions is a problem, often caused by Strategies to address hot partitions include 1a. Ingress: Find better partitioning function ƒ(event.key) for producers 1b. Storage: Re-partition data into new topic if you can’t change the original 2. Scale processing vertically, e.g. more powerful CPU instances ... ... ... ... P1 P2 P3 P4 96 1. Events not evenly distributed across partitions 2. Events evenly distributed but certain events take longer to process
  93. 93. Joining Streams and Tables Data must be ‘co-partitioned’ Table Stream Join Output (Stream) 97
  94. 94. Joining Streams and Tables Data must be ‘co-partitioned’ bob male alice female alex male alice Paris Table P1 P2 P3 zoie female andrew male mina female natalie female blake male alice Paris Stream P2 (alice, Paris) from stream’s P2 has a matching entry for alice in the table’s P2. female 98
  95. 95. Joining Streams and Tables Data is looked up in same partition number 99 alice Paris alice male alice female alice Paris Stream Table P2 P1 P2 P3 Here, key ‘alice’ exists in multiple partitions. But entry in P2 (female) is used because the stream- side event is from stream’s partition P2. female Scenario 2
  96. 96. Joining Streams and Tables Data is looked up in same partition number 100 alice Paris alice male alice Paris Stream Table P2 P1 P2 P3 Here, key ‘alice’ exists only in the table’s P1 != P2. null no match! Scenario 3
  97. 97. Data co-partitioning requirements in detail Further Reading on Joining Streams and Tables: https://www.confluent.io/kafka-summit-sf18/zen-and-the-art-of-streaming-joins https://docs.confluent.io/current/ksql/docs/developer-guide/partition-data.html 101 1. Same keying scheme for both input sides 2. Same number of partitions 3. Same partitioning function ƒ(event.key)
  98. 98. Why is that so? Because of how input data is mapped to stream tasks ... ... ... P1 P2 P3 storage processing state Stream Task 2 read via network Strea m Topic ... ... ... P1 P2 P3 Table Topic from stream’s P2 from table’s P2 102
  99. 99. How to re-partition your data when needed CREATE STREAM products_repartitioned WITH (PARTITIONS=42) AS SELECT * FROM products PARTITION BY product_id; 103 KSQL example: statement below creates a new stream with changed number of partitions and a new field as event.key (so that its data is now correctly co-partitioned for joining)

×