SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Downloaden Sie, um offline zu lesen
1
Streams and Tables: Two Sides of the Same Coin
Matthias J. Sax12, Guozhang Wang1, Matthias Weidlich2, Johann-Christoph Freytag2
1Confluent Inc., Palo Alto (CA)
matthias@confluent.io
guozhang@confluent.io
2Humboldt-Universität zu Berlin
mjsax@informatik.hu-berlin.de
matthias.weidlich@hu-berlin.de
freytag@informatik.hu-berlin.de
@MatthiasJSax
Twelfth International Workshop on Real-Time Business Intelligence and Analytics
27 August, 2018, Rio de Janeiro
2
Count Clicks per Page
BIRTE
VLDB
VLDB
Distributed Data Source
BIRTE(3)
VLDB(4)
VLDB(7)
BIRTE(3)
VLDB(4)
VLDB(7)
BIRTE(3)VLDB(4)VLDB(7)
Input Data Stream
3
Count Clicks per Page
BIRTE
VLDB
VLDB
Distributed Data Source
BIRTE(3)
VLDB(4)
VLDB(7)
BIRTE(3)
VLDB(4)
VLDB(7)
BIRTE(3) VLDB(4)VLDB(7)
Input Data Stream
Arrival order non-deterministic
Even-time semantics implies out-of-order data
4
Ordering: Common Approaches
Input Data Stream
BIRTE(3) VLDB(4)VLDB(7)
BIRTE(3)VLDB(4)VLDB(7)
buffer and re-order
SPS
SPS
punctuations/watermarks
BIRTE(3) VLDB(4)VLDB(7)
time=3
5
Cost
Correctness/
Completeness
Latency
Buffering and Reordering
- Ref: CQL1, Trill2
Punctuations/Watermarks
- Ref: Li et al.3, Krishnamurthy et al.4
Design Space
6
Problem Statement
How to design a model
• for the evaluation of expressive operators
• with low latency over potentially unordered data streams
• that can be implemented by mean of distributed online algorithms?
7
High-Level Proposal
• To reduce latency, we need to avoid any processing delays
• Process data in arrival order
• Emit current result immediately
• Law et al.5: cannot handle out-of-order data
• To handle out-of-order data, we need to be able to update/refine previous results
• Data streams must allow for update records
• Update/delete records by Babu and Widom6: no operator semantics defined
• Borealis:7 replays data stream after “updating/reordering”; very high cost
8
Data Model
• Offset: physical order (arrival/processing order)
• Timestamp: logical order (event-time)
• Key: optional for grouping
• Value: payload
9
Stream Processing Operators
• Stateless, order agnostic
• filter, projection, flatMap
• No special handling necessary
• Stateful, order sensitive
• aggregation, joins, windowing
• Need to handle out-of-order data
10
Data Stream Aggregation
• Model output of (windowed) aggregations as table
• State is not internal but first-class citizen
• Update stateful operator continuously
• Emit changelog stream to downstream operators
• Streams, Table, and Changelogs
• Define operator semantics over changelogs and updating tables
• Temporal operator semantics
11
Example: Count Clicks per Page
url count
record stream changelog stream
BIRTE
1
<BIRTE,1>
BIRTE 1
countTable = stream.groupBy(r->url).count()
12
Example: Count Clicks per Page
url count
record stream changelog stream
1
<BIRTE,1>
BIRTE 1
VLDB
1VLDB
<VLDB,1>
countTable = stream.groupBy(r->url).count()
VLDB
13
Example: Count Clicks per Page
url count
record stream changelog stream
1
<BIRTE,1>
BIRTE 1
VLDB
2VLDB
<VLDB,2>
countTable = stream.groupBy(r->url).count()
VLDB
<VLDB,1>
14
countTable2 = countTable.filter(url=‘VLDB’).toTable()
Example: Processing a Changelog Stream
url count
changelog stream
1
<BIRTE,1>
BIRTE
VLDB
<VLDB,1>
2
<VLDB,2>
url count
2VLDB
15
countTable = stream.groupBy(r->url).windowedBy(5sec).count()
windowID = <groupingKey,windowStartTimestamp>
windowStartTimestamp = recordTimestamp / windowSize
Example: Windowed Count
window ID count
record stream changelog stream
BIRTE(3)
1
<<BIRTE,0>,1>
<BIRTE,0>
VLDB(7)
1<VLDB,0>
<<VLDB,0>,1>VLDB(4) <VLDB,5> <<VLDB,5>,1>1
16
countTable = stream.groupBy(r->url).windowedBy(5sec).count()
windowID = <groupingKey,windowStartTimestamp>
windowStartTimestamp = recordTimestamp / windowSize
Example: Out-of-Order Data
window ID count
record stream changelog stream
BIRTE(1)
2<BIRTE,0>
1<VLDB,0>
<VLDB,5> 1 <<BIRTE,0>,2>
17
Duality of Streams and Tables
18
Cost
Correctness/
Completeness
Latency
Buffering and Reordering
- Ref: CQL1, Trill2
Punctuations/Watermarks
- Ref: Li et al.3, Krishnamurthy et al.4
Design Space
Dual Streaming Model
- continuous updates / changelogs
- decouple latency from correctness
- trade-off latency and cost
- trade-off cost and completeness
(retention time)
19
Stream-Table Transformations
See the paper for details…
20
Implementation
• Implemented in Apache Kafka (v0.10)
• Kafka Streams / Streams API
StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> textLines = builder.stream("TextLinesTopic");
KTable<String, Long> wordCounts = textLines
.flatMapValues(textLine -> Arrays.asList(textLine.toLowerCase().split("W+")))
.groupBy((key, word) -> word)
.windowedBy(TimeWindows.of(5_000L))
.count();
wordCounts.toStream().to("WordsWithCountsTopic");
KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();
21
Implementation
• Implemented in Apache Kafka (v0.10)
• Kafka Streams / Streams API
• Leveraged in Confluent’s KSQL
CREATE TABLE click_count_per_url AS
SELECT url, count(*)
FROM click_stream
WINDOW TUMBLING (SIZE 1 MINUTE)
WHERE url LIKE = '%confluent%' OR url LIKE ‘%hu-berlin%’
GROUP BY url;
22
Implementation
• Implemented in Apache Kafka (v0.10)
• Kafka Streams / Streams API
• Leveraged in Confluent’s KSQL
• Widely adopted in industry
23
Summary
• Suggest the Dual-Streaming-Model
• Handles out-of-order data within the processing model
• Optimized for low latency
• Streams and Tables are Dual
• Allows to trade-off processing cost, latency, completeness
• Adopted in industry via Kafka Streams and KSQL
24
Thank You
We are hiring!
25
References
[1] Arvind Arasu, Shivnath Babu, and Jennifer Widom. 2003. CQL: A Language for Continuous Queries
over Streams and Relations. Database Programming Languages, 9th Int. WS. 1–19.
[2] Badrish Chandramouli et al. 2014. Trill: A High-performance Incremental Query Processor for
Diverse Analytics. Proc. VLDB Endow. 8, 4 (2014), 401–412.
[3] Jin Li et al. 2005. Semantics and Evaluation Techniques for Window Aggregates in Data Streams.
Proc. of the ACM SIGMOD Int. Conf. on Management of Data. 311–322.
[4] Sailesh Krishnamurthy et al. 2010. Continuous Analytics over Discontinuous Streams. Proc. of the
2010 ACM SIGMOD Int. Conf. on Management of Data. 1081–1092.
[5] Yan-Nei Law, HaixunWang, and Carlo Zaniolo. 2004. Query Languages and Data Models for
Database Sequences and Data Streams. Proc. of the 13th Int. Conf. on Very Large Data Bases. 492-503.
[6] Shivnath Babu and Jennifer Widom. 2001. Continuous Queries over Data Streams. SIGMOD Records
30, 3 (2001), 109–120.
[7] Daniel Abadi et al. 2005. The Design of the Borealis Stream Processing Engine. CIDR, 2nd Biennial
Conf. on Innovative Data Systems Research. 277–289.
26
Data Streams Types
27
Evolving Table
window ID count
1<BIRTE,0>
1<VLDB,0>
window ID count
1<BIRTE,0>
window ID count
1<BIRTE,0>
1<VLDB,0>
<VLDB,5> 1
record stream
BIRTE(3)VLDB(7) VLDB(4)BIRTE(1)
table v3 table v4 table v7
28
Evolving Table
window ID count
<BIRTE,0>
1<VLDB,0>
window ID count
<BIRTE,0>
window ID count
<BIRTE,0>
1<VLDB,0>
<VLDB,5> 1
record stream
BIRTE(3)VLDB(7) VLDB(4)BIRTE(1)
table v3 table v4 table v7
window ID count
1<BIRTE,0>
table v1
2 2 2
29
Table-Table Join
30
Stream-Stream Join
• Sliding Window Join, i.e., band join
• Window size specifies additional timestamp based join predicate
SELECT * FROM stream1, stream2
WHERE
stream1.key = stream2.key
AND
stream1.ts – windowSize <= stream2.ts
AND stream2.ts <= stream1.ts + windowSize
windowSize
stream1
stream2
31
Stream-Table Join
• Temporal table “lookup” join
• For each stream record, lookup for a matching table record
• Join condition: streamRecord.key == tableRecord.key
• The join is temporal is the sense, that the “correct” table version must be use
• i.e., youngest table version that is before the stream records timestamp
32
Stream-Table Join

Weitere ähnliche Inhalte

Was ist angesagt?

A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planningconfluent
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewenconfluent
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkDataWorks Summit
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Julien Le Dem
 
Designing Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQLDesigning Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQLVenu Anuganti
 
RedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedis Labs
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka StreamsGuozhang Wang
 
Oracle Database Performance Tuning Concept
Oracle Database Performance Tuning ConceptOracle Database Performance Tuning Concept
Oracle Database Performance Tuning ConceptChien Chung Shen
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?confluent
 
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin OmerogluStorage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin OmerogluHostedbyConfluent
 

Was ist angesagt? (20)

A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Designing Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQLDesigning Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQL
 
RedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ Twitter
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Oracle Database Performance Tuning Concept
Oracle Database Performance Tuning ConceptOracle Database Performance Tuning Concept
Oracle Database Performance Tuning Concept
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
 
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin OmerogluStorage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu
 

Ähnlich wie Streams and Tables: Two Sides of the Same Coin (BIRTE 2018)

Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...Big Data Spain
 
Internet of Things in Tbilisi
Internet of Things in TbilisiInternet of Things in Tbilisi
Internet of Things in TbilisiAlexey Bokov
 
Hpdw 2015-v10-paper
Hpdw 2015-v10-paperHpdw 2015-v10-paper
Hpdw 2015-v10-paperrestassure
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in MotionRuhani Arora
 
Open Analytics Environment
Open Analytics EnvironmentOpen Analytics Environment
Open Analytics EnvironmentIan Foster
 
SQL Server 2008 R2 StreamInsight
SQL Server 2008 R2 StreamInsightSQL Server 2008 R2 StreamInsight
SQL Server 2008 R2 StreamInsightEduardo Castro
 
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...Mia Yuan Cao
 
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Neo4j
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use CasesMax De Marzi
 
Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015StampedeCon
 
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...Adel Sabour
 
eBay EDW元数据管理及应用
eBay EDW元数据管理及应用eBay EDW元数据管理及应用
eBay EDW元数据管理及应用mysqlops
 
BioIT Europe 2010 - BioCatalogue
BioIT Europe 2010 - BioCatalogueBioIT Europe 2010 - BioCatalogue
BioIT Europe 2010 - BioCatalogueBioCatalogue
 
Web Data Extraction: A Crash Course
Web Data Extraction: A Crash CourseWeb Data Extraction: A Crash Course
Web Data Extraction: A Crash CourseGiorgio Orsi
 
DisrupTech - Robin Bloor (2)
DisrupTech - Robin Bloor (2)DisrupTech - Robin Bloor (2)
DisrupTech - Robin Bloor (2)Inside Analysis
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming VisualizationGuido Schmutz
 
Building IoT and Big Data Solutions on Azure
Building IoT and Big Data Solutions on AzureBuilding IoT and Big Data Solutions on Azure
Building IoT and Big Data Solutions on AzureIdo Flatow
 
Big data meet_up_08042016
Big data meet_up_08042016Big data meet_up_08042016
Big data meet_up_08042016Mark Smith
 

Ähnlich wie Streams and Tables: Two Sides of the Same Coin (BIRTE 2018) (20)

Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
 
Internet of Things in Tbilisi
Internet of Things in TbilisiInternet of Things in Tbilisi
Internet of Things in Tbilisi
 
Hpdw 2015-v10-paper
Hpdw 2015-v10-paperHpdw 2015-v10-paper
Hpdw 2015-v10-paper
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 
Open Analytics Environment
Open Analytics EnvironmentOpen Analytics Environment
Open Analytics Environment
 
SQL Server 2008 R2 StreamInsight
SQL Server 2008 R2 StreamInsightSQL Server 2008 R2 StreamInsight
SQL Server 2008 R2 StreamInsight
 
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
 
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 
Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015
 
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
 
eBay EDW元数据管理及应用
eBay EDW元数据管理及应用eBay EDW元数据管理及应用
eBay EDW元数据管理及应用
 
BioIT Europe 2010 - BioCatalogue
BioIT Europe 2010 - BioCatalogueBioIT Europe 2010 - BioCatalogue
BioIT Europe 2010 - BioCatalogue
 
Web Data Extraction: A Crash Course
Web Data Extraction: A Crash CourseWeb Data Extraction: A Crash Course
Web Data Extraction: A Crash Course
 
DisrupTech - Robin Bloor (2)
DisrupTech - Robin Bloor (2)DisrupTech - Robin Bloor (2)
DisrupTech - Robin Bloor (2)
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 
Z04506138145
Z04506138145Z04506138145
Z04506138145
 
Building IoT and Big Data Solutions on Azure
Building IoT and Big Data Solutions on AzureBuilding IoT and Big Data Solutions on Azure
Building IoT and Big Data Solutions on Azure
 
Big data meet_up_08042016
Big data meet_up_08042016Big data meet_up_08042016
Big data meet_up_08042016
 

Mehr von confluent

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flinkconfluent
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsconfluent
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flinkconfluent
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...confluent
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluentconfluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkconfluent
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloudconfluent
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Diveconfluent
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluentconfluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Meshconfluent
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservicesconfluent
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3confluent
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernizationconfluent
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataconfluent
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2confluent
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023confluent
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesisconfluent
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023confluent
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streamsconfluent
 

Mehr von confluent (20)

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streams
 

Kürzlich hochgeladen

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 

Kürzlich hochgeladen (20)

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 

Streams and Tables: Two Sides of the Same Coin (BIRTE 2018)

  • 1. 1 Streams and Tables: Two Sides of the Same Coin Matthias J. Sax12, Guozhang Wang1, Matthias Weidlich2, Johann-Christoph Freytag2 1Confluent Inc., Palo Alto (CA) matthias@confluent.io guozhang@confluent.io 2Humboldt-Universität zu Berlin mjsax@informatik.hu-berlin.de matthias.weidlich@hu-berlin.de freytag@informatik.hu-berlin.de @MatthiasJSax Twelfth International Workshop on Real-Time Business Intelligence and Analytics 27 August, 2018, Rio de Janeiro
  • 2. 2 Count Clicks per Page BIRTE VLDB VLDB Distributed Data Source BIRTE(3) VLDB(4) VLDB(7) BIRTE(3) VLDB(4) VLDB(7) BIRTE(3)VLDB(4)VLDB(7) Input Data Stream
  • 3. 3 Count Clicks per Page BIRTE VLDB VLDB Distributed Data Source BIRTE(3) VLDB(4) VLDB(7) BIRTE(3) VLDB(4) VLDB(7) BIRTE(3) VLDB(4)VLDB(7) Input Data Stream Arrival order non-deterministic Even-time semantics implies out-of-order data
  • 4. 4 Ordering: Common Approaches Input Data Stream BIRTE(3) VLDB(4)VLDB(7) BIRTE(3)VLDB(4)VLDB(7) buffer and re-order SPS SPS punctuations/watermarks BIRTE(3) VLDB(4)VLDB(7) time=3
  • 5. 5 Cost Correctness/ Completeness Latency Buffering and Reordering - Ref: CQL1, Trill2 Punctuations/Watermarks - Ref: Li et al.3, Krishnamurthy et al.4 Design Space
  • 6. 6 Problem Statement How to design a model • for the evaluation of expressive operators • with low latency over potentially unordered data streams • that can be implemented by mean of distributed online algorithms?
  • 7. 7 High-Level Proposal • To reduce latency, we need to avoid any processing delays • Process data in arrival order • Emit current result immediately • Law et al.5: cannot handle out-of-order data • To handle out-of-order data, we need to be able to update/refine previous results • Data streams must allow for update records • Update/delete records by Babu and Widom6: no operator semantics defined • Borealis:7 replays data stream after “updating/reordering”; very high cost
  • 8. 8 Data Model • Offset: physical order (arrival/processing order) • Timestamp: logical order (event-time) • Key: optional for grouping • Value: payload
  • 9. 9 Stream Processing Operators • Stateless, order agnostic • filter, projection, flatMap • No special handling necessary • Stateful, order sensitive • aggregation, joins, windowing • Need to handle out-of-order data
  • 10. 10 Data Stream Aggregation • Model output of (windowed) aggregations as table • State is not internal but first-class citizen • Update stateful operator continuously • Emit changelog stream to downstream operators • Streams, Table, and Changelogs • Define operator semantics over changelogs and updating tables • Temporal operator semantics
  • 11. 11 Example: Count Clicks per Page url count record stream changelog stream BIRTE 1 <BIRTE,1> BIRTE 1 countTable = stream.groupBy(r->url).count()
  • 12. 12 Example: Count Clicks per Page url count record stream changelog stream 1 <BIRTE,1> BIRTE 1 VLDB 1VLDB <VLDB,1> countTable = stream.groupBy(r->url).count() VLDB
  • 13. 13 Example: Count Clicks per Page url count record stream changelog stream 1 <BIRTE,1> BIRTE 1 VLDB 2VLDB <VLDB,2> countTable = stream.groupBy(r->url).count() VLDB <VLDB,1>
  • 14. 14 countTable2 = countTable.filter(url=‘VLDB’).toTable() Example: Processing a Changelog Stream url count changelog stream 1 <BIRTE,1> BIRTE VLDB <VLDB,1> 2 <VLDB,2> url count 2VLDB
  • 15. 15 countTable = stream.groupBy(r->url).windowedBy(5sec).count() windowID = <groupingKey,windowStartTimestamp> windowStartTimestamp = recordTimestamp / windowSize Example: Windowed Count window ID count record stream changelog stream BIRTE(3) 1 <<BIRTE,0>,1> <BIRTE,0> VLDB(7) 1<VLDB,0> <<VLDB,0>,1>VLDB(4) <VLDB,5> <<VLDB,5>,1>1
  • 16. 16 countTable = stream.groupBy(r->url).windowedBy(5sec).count() windowID = <groupingKey,windowStartTimestamp> windowStartTimestamp = recordTimestamp / windowSize Example: Out-of-Order Data window ID count record stream changelog stream BIRTE(1) 2<BIRTE,0> 1<VLDB,0> <VLDB,5> 1 <<BIRTE,0>,2>
  • 17. 17 Duality of Streams and Tables
  • 18. 18 Cost Correctness/ Completeness Latency Buffering and Reordering - Ref: CQL1, Trill2 Punctuations/Watermarks - Ref: Li et al.3, Krishnamurthy et al.4 Design Space Dual Streaming Model - continuous updates / changelogs - decouple latency from correctness - trade-off latency and cost - trade-off cost and completeness (retention time)
  • 20. 20 Implementation • Implemented in Apache Kafka (v0.10) • Kafka Streams / Streams API StreamsBuilder builder = new StreamsBuilder(); KStream<String, String> textLines = builder.stream("TextLinesTopic"); KTable<String, Long> wordCounts = textLines .flatMapValues(textLine -> Arrays.asList(textLine.toLowerCase().split("W+"))) .groupBy((key, word) -> word) .windowedBy(TimeWindows.of(5_000L)) .count(); wordCounts.toStream().to("WordsWithCountsTopic"); KafkaStreams streams = new KafkaStreams(builder.build(), props); streams.start();
  • 21. 21 Implementation • Implemented in Apache Kafka (v0.10) • Kafka Streams / Streams API • Leveraged in Confluent’s KSQL CREATE TABLE click_count_per_url AS SELECT url, count(*) FROM click_stream WINDOW TUMBLING (SIZE 1 MINUTE) WHERE url LIKE = '%confluent%' OR url LIKE ‘%hu-berlin%’ GROUP BY url;
  • 22. 22 Implementation • Implemented in Apache Kafka (v0.10) • Kafka Streams / Streams API • Leveraged in Confluent’s KSQL • Widely adopted in industry
  • 23. 23 Summary • Suggest the Dual-Streaming-Model • Handles out-of-order data within the processing model • Optimized for low latency • Streams and Tables are Dual • Allows to trade-off processing cost, latency, completeness • Adopted in industry via Kafka Streams and KSQL
  • 25. 25 References [1] Arvind Arasu, Shivnath Babu, and Jennifer Widom. 2003. CQL: A Language for Continuous Queries over Streams and Relations. Database Programming Languages, 9th Int. WS. 1–19. [2] Badrish Chandramouli et al. 2014. Trill: A High-performance Incremental Query Processor for Diverse Analytics. Proc. VLDB Endow. 8, 4 (2014), 401–412. [3] Jin Li et al. 2005. Semantics and Evaluation Techniques for Window Aggregates in Data Streams. Proc. of the ACM SIGMOD Int. Conf. on Management of Data. 311–322. [4] Sailesh Krishnamurthy et al. 2010. Continuous Analytics over Discontinuous Streams. Proc. of the 2010 ACM SIGMOD Int. Conf. on Management of Data. 1081–1092. [5] Yan-Nei Law, HaixunWang, and Carlo Zaniolo. 2004. Query Languages and Data Models for Database Sequences and Data Streams. Proc. of the 13th Int. Conf. on Very Large Data Bases. 492-503. [6] Shivnath Babu and Jennifer Widom. 2001. Continuous Queries over Data Streams. SIGMOD Records 30, 3 (2001), 109–120. [7] Daniel Abadi et al. 2005. The Design of the Borealis Stream Processing Engine. CIDR, 2nd Biennial Conf. on Innovative Data Systems Research. 277–289.
  • 27. 27 Evolving Table window ID count 1<BIRTE,0> 1<VLDB,0> window ID count 1<BIRTE,0> window ID count 1<BIRTE,0> 1<VLDB,0> <VLDB,5> 1 record stream BIRTE(3)VLDB(7) VLDB(4)BIRTE(1) table v3 table v4 table v7
  • 28. 28 Evolving Table window ID count <BIRTE,0> 1<VLDB,0> window ID count <BIRTE,0> window ID count <BIRTE,0> 1<VLDB,0> <VLDB,5> 1 record stream BIRTE(3)VLDB(7) VLDB(4)BIRTE(1) table v3 table v4 table v7 window ID count 1<BIRTE,0> table v1 2 2 2
  • 30. 30 Stream-Stream Join • Sliding Window Join, i.e., band join • Window size specifies additional timestamp based join predicate SELECT * FROM stream1, stream2 WHERE stream1.key = stream2.key AND stream1.ts – windowSize <= stream2.ts AND stream2.ts <= stream1.ts + windowSize windowSize stream1 stream2
  • 31. 31 Stream-Table Join • Temporal table “lookup” join • For each stream record, lookup for a matching table record • Join condition: streamRecord.key == tableRecord.key • The join is temporal is the sense, that the “correct” table version must be use • i.e., youngest table version that is before the stream records timestamp