The document discusses streams and tables as two sides of the same coin. It proposes a dual streaming model that handles out-of-order data by continuously updating stateful operators and emitting changelog streams. This allows processing data with low latency while avoiding buffering and reordering. The model is implemented in Apache Kafka Streams and adopted in industry through tools like KSQL.
Streams and Tables: Two Sides of the Same Coin (BIRTE 2018)
1. 1
Streams and Tables: Two Sides of the Same Coin
Matthias J. Sax12, Guozhang Wang1, Matthias Weidlich2, Johann-Christoph Freytag2
1Confluent Inc., Palo Alto (CA)
matthias@confluent.io
guozhang@confluent.io
2Humboldt-Universität zu Berlin
mjsax@informatik.hu-berlin.de
matthias.weidlich@hu-berlin.de
freytag@informatik.hu-berlin.de
@MatthiasJSax
Twelfth International Workshop on Real-Time Business Intelligence and Analytics
27 August, 2018, Rio de Janeiro
2. 2
Count Clicks per Page
BIRTE
VLDB
VLDB
Distributed Data Source
BIRTE(3)
VLDB(4)
VLDB(7)
BIRTE(3)
VLDB(4)
VLDB(7)
BIRTE(3)VLDB(4)VLDB(7)
Input Data Stream
3. 3
Count Clicks per Page
BIRTE
VLDB
VLDB
Distributed Data Source
BIRTE(3)
VLDB(4)
VLDB(7)
BIRTE(3)
VLDB(4)
VLDB(7)
BIRTE(3) VLDB(4)VLDB(7)
Input Data Stream
Arrival order non-deterministic
Even-time semantics implies out-of-order data
4. 4
Ordering: Common Approaches
Input Data Stream
BIRTE(3) VLDB(4)VLDB(7)
BIRTE(3)VLDB(4)VLDB(7)
buffer and re-order
SPS
SPS
punctuations/watermarks
BIRTE(3) VLDB(4)VLDB(7)
time=3
6. 6
Problem Statement
How to design a model
• for the evaluation of expressive operators
• with low latency over potentially unordered data streams
• that can be implemented by mean of distributed online algorithms?
7. 7
High-Level Proposal
• To reduce latency, we need to avoid any processing delays
• Process data in arrival order
• Emit current result immediately
• Law et al.5: cannot handle out-of-order data
• To handle out-of-order data, we need to be able to update/refine previous results
• Data streams must allow for update records
• Update/delete records by Babu and Widom6: no operator semantics defined
• Borealis:7 replays data stream after “updating/reordering”; very high cost
8. 8
Data Model
• Offset: physical order (arrival/processing order)
• Timestamp: logical order (event-time)
• Key: optional for grouping
• Value: payload
9. 9
Stream Processing Operators
• Stateless, order agnostic
• filter, projection, flatMap
• No special handling necessary
• Stateful, order sensitive
• aggregation, joins, windowing
• Need to handle out-of-order data
10. 10
Data Stream Aggregation
• Model output of (windowed) aggregations as table
• State is not internal but first-class citizen
• Update stateful operator continuously
• Emit changelog stream to downstream operators
• Streams, Table, and Changelogs
• Define operator semantics over changelogs and updating tables
• Temporal operator semantics
21. 21
Implementation
• Implemented in Apache Kafka (v0.10)
• Kafka Streams / Streams API
• Leveraged in Confluent’s KSQL
CREATE TABLE click_count_per_url AS
SELECT url, count(*)
FROM click_stream
WINDOW TUMBLING (SIZE 1 MINUTE)
WHERE url LIKE = '%confluent%' OR url LIKE ‘%hu-berlin%’
GROUP BY url;
22. 22
Implementation
• Implemented in Apache Kafka (v0.10)
• Kafka Streams / Streams API
• Leveraged in Confluent’s KSQL
• Widely adopted in industry
23. 23
Summary
• Suggest the Dual-Streaming-Model
• Handles out-of-order data within the processing model
• Optimized for low latency
• Streams and Tables are Dual
• Allows to trade-off processing cost, latency, completeness
• Adopted in industry via Kafka Streams and KSQL
25. 25
References
[1] Arvind Arasu, Shivnath Babu, and Jennifer Widom. 2003. CQL: A Language for Continuous Queries
over Streams and Relations. Database Programming Languages, 9th Int. WS. 1–19.
[2] Badrish Chandramouli et al. 2014. Trill: A High-performance Incremental Query Processor for
Diverse Analytics. Proc. VLDB Endow. 8, 4 (2014), 401–412.
[3] Jin Li et al. 2005. Semantics and Evaluation Techniques for Window Aggregates in Data Streams.
Proc. of the ACM SIGMOD Int. Conf. on Management of Data. 311–322.
[4] Sailesh Krishnamurthy et al. 2010. Continuous Analytics over Discontinuous Streams. Proc. of the
2010 ACM SIGMOD Int. Conf. on Management of Data. 1081–1092.
[5] Yan-Nei Law, HaixunWang, and Carlo Zaniolo. 2004. Query Languages and Data Models for
Database Sequences and Data Streams. Proc. of the 13th Int. Conf. on Very Large Data Bases. 492-503.
[6] Shivnath Babu and Jennifer Widom. 2001. Continuous Queries over Data Streams. SIGMOD Records
30, 3 (2001), 109–120.
[7] Daniel Abadi et al. 2005. The Design of the Borealis Stream Processing Engine. CIDR, 2nd Biennial
Conf. on Innovative Data Systems Research. 277–289.
30. 30
Stream-Stream Join
• Sliding Window Join, i.e., band join
• Window size specifies additional timestamp based join predicate
SELECT * FROM stream1, stream2
WHERE
stream1.key = stream2.key
AND
stream1.ts – windowSize <= stream2.ts
AND stream2.ts <= stream1.ts + windowSize
windowSize
stream1
stream2
31. 31
Stream-Table Join
• Temporal table “lookup” join
• For each stream record, lookup for a matching table record
• Join condition: streamRecord.key == tableRecord.key
• The join is temporal is the sense, that the “correct” table version must be use
• i.e., youngest table version that is before the stream records timestamp