This slide set was presented at my PhD defense in March 2019.
The thesis can be fount at: https://depositonce.tu-berlin.de/bitstream/11303/10519/4/traub_jonas.pdf
Abstract:
The Internet of Things (IoT) consists of billions of devices which form a cloud of network connected sensor nodes. These sensor nodes supply a vast number of data streams with massive amounts of sensor data. Real-time sensor data enables diverse applications including traffic-aware navigation, machine monitoring, and home automation. Current stream processing pipelines are demand-oblivious, which means that they gather, transmit, and process as much data as possible. In contrast, a demand-based processing pipeline uses requirement specifications of data consumers, such as failure tolerances and latency limitations, to save resources. In this thesis, we present an end-to-end architecture for demand-based data stream gathering, processing, and transmission. Our solution unifies the way applications express their data demands, i.e., their requirements with respect to their input streams. This unification allows for multiplexing the data demands of all concurrently running applications. On sensor nodes, we schedule sensor reads based on the data demands of all applications, which saves up to 87% in sensor reads and data transfers in our experiments with real-world sensor data. Our demand-based control layer optimizes the data acquisition from thousands of sensors. We introduce time coherence as a fundamental data characteristic, which is the delay between the first and the last sensor read that contribute values to a tuple. A large scale parameter exploration shows that our solution scales to large numbers of sensors and operates reliably under varying latency and coherence constraints. On stream analysis systems, we tackle the problem of efficient window aggregation. We contribute a general aggregation technique, which adapts to four key workload characteristics: Stream (dis)order, aggregation types, window types, and window measures. Our experiments show that our solution outperforms alternative solutions by an order of magnitude in throughput, which prevents expensive system scale-out. We further derive data demands from visualization needs of applications and make these data demands available to streaming systems such as Apache Flink. This enables streaming systems to preprocess data with respect to changing visualization needs. Experiments show that our solution reliably prevents overloads when data rates increase.
Demand-based Data Stream Gathering, Processing, and Transmission - PhD Defense
1. Candidate: Jonas Traub PhD Defense, Berlin, 22.03.2019
Demand-based Data Stream Gathering, Processing, and Transmission
Committee: Prof. Dr. Manfred Hauswirth Prof. Dr. Amr El Abbadi
Prof. Dr. Volker Markl Prof. Dr. Albert Bifet
2. Thesis Goal
Efficiently gather, transmit, and process sensor data
from a large number of sensor nodes, and
make it available to a large number of applications
in real-time.
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 2
3. Thesis Goal
Efficiently gather, transmit, and process sensor data
from a large number of sensor nodes, and
make it available to a large number of applications
in real-time.
Efficiency
Scalability
Timeliness
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 2
4. Stream Processing Pipelines
A stream processing pipeline is a series of concurrently running operators.
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 3
5. Stream Processing Pipelines
A stream processing pipeline is a series of concurrently running operators.
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 3
6. Stream Processing Pipelines
A stream processing pipeline is a series of concurrently running operators.
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 3
7. Stream Processing Pipelines
A stream processing pipeline is a series of concurrently running operators.
Data Gathering
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 3
8. Stream Processing Pipelines
A stream processing pipeline is a series of concurrently running operators.
Data Gathering
53
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 3
9. Stream Processing Pipelines
A stream processing pipeline is a series of concurrently running operators.
Data Gathering Data Processing
53
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 3
10. Stream Processing Pipelines
A stream processing pipeline is a series of concurrently running operators.
Data Gathering Data Processing
8
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 3
11. Stream Processing Pipelines
A stream processing pipeline is a series of concurrently running operators.
Data Gathering Data Processing Data Transmission
8
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 3
12. Stream Processing Pipelines
A stream processing pipeline is a series of concurrently running operators.
Data Gathering Data Processing Data Transmission
8
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 3
13. Demand-oblivious Data Streaming in the Internet of Things
s1
s2
s3
s4
s5
s6
…
sN-1
sN
Sensor
Nodes
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 4
14. Demand-oblivious Data Streaming in the Internet of Things
s1
s2
s3
s4
s5
s6
…
sN-1
sN
Sensor
Nodes
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 4
15. Demand-oblivious Data Streaming in the Internet of Things
s1
s2
s3
s4
s5
s6
…
sN-1
sN
Sensor
Nodes
Stream
Analysis
System
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 4
16. Demand-oblivious Data Streaming in the Internet of Things
s1
s2
s3
s4
s5
s6
…
sN-1
sN
Sensor
Nodes
Stream
Analysis
System
Front-End
Applications
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 4
17. Demand-oblivious Data Streaming in the Internet of Things
s1
s2
s3
s4
s5
s6
…
sN-1
sN
Sensor
Nodes
Stream
Analysis
System
Front-End
Applications
Data Demand:
Minimum number of data points
which allows for providing a
desired functionality.
Definitions:
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 4
18. Demand-oblivious Data Streaming in the Internet of Things
s1
s2
s3
s4
s5
s6
…
sN-1
sN
Sensor
Nodes
Stream
Analysis
System
Front-End
Applications
Data Demand:
Minimum number of data points
which allows for providing a
desired functionality.
Demand-oblivious:
Not considering the data
demand of data consumers.
Definitions:
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 4
19. Problems Demand-oblivious Data Streaming in the Internet of Things
s1
s2
s3
s4
s5
s6
…
sN-1
sN
Sensor
Nodes
Stream
Analysis
System
Front-End
Applications
Data Demand:
Minimum number of data points
which allows for providing a
desired functionality.
Demand-oblivious:
Not considering the data
demand of data consumers.
Definitions:
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 4
20. Problems Demand-oblivious Data Streaming in the Internet of Things
Combined
Stream
Bandwidth
s1
s2
s3
s4
s5
s6
…
sN-1
sN
Sensor
Nodes
Stream
Analysis
System
Front-End
Applications
Data Demand:
Minimum number of data points
which allows for providing a
desired functionality.
Demand-oblivious:
Not considering the data
demand of data consumers.
Definitions:
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 4
21. Problems Demand-oblivious Data Streaming in the Internet of Things
Parallel
Network
Connections
Combined
Stream
Bandwidth
s1
s2
s3
s4
s5
s6
…
sN-1
sN
Sensor
Nodes
Stream
Analysis
System
Front-End
Applications
Data Demand:
Minimum number of data points
which allows for providing a
desired functionality.
Demand-oblivious:
Not considering the data
demand of data consumers.
Definitions:
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 4
22. Problems
Expensive
Cluster
Scale-Out
Demand-oblivious Data Streaming in the Internet of Things
Parallel
Network
Connections
Combined
Stream
Bandwidth
s1
s2
s3
s4
s5
s6
…
sN-1
sN
Sensor
Nodes
Stream
Analysis
System
Front-End
Applications
Data Demand:
Minimum number of data points
which allows for providing a
desired functionality.
Demand-oblivious:
Not considering the data
demand of data consumers.
Definitions:
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 4
23. Problems
Expensive
Cluster
Scale-Out
Demand-oblivious Data Streaming in the Internet of Things
Parallel
Network
Connections
Combined
Stream
Bandwidth
Front-End
Overload
s1
s2
s3
s4
s5
s6
…
sN-1
sN
Sensor
Nodes
Stream
Analysis
System
Front-End
Applications
Data Demand:
Minimum number of data points
which allows for providing a
desired functionality.
Demand-oblivious:
Not considering the data
demand of data consumers.
Definitions:
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 4
24. Problems
Expensive
Cluster
Scale-Out
Demand-oblivious Data Streaming in the Internet of Things
Parallel
Network
Connections
Combined
Stream
Bandwidth
Front-End
Overload
s1
s2
s3
s4
s5
s6
…
sN-1
sN
Sensor
Nodes
Stream
Analysis
System
Front-End
Applications
Demand-oblivious
Stream Processing Pipelines
do not suite the IoT.
Data Demand:
Minimum number of data points
which allows for providing a
desired functionality.
Demand-oblivious:
Not considering the data
demand of data consumers.
Definitions:
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 4
25. Demand-based Data Streaming in the Internet of Things
Stream
Analysis
System
Front-End
Applications
s2
s5
…
sN
Sensor
Nodes
s1
s3
s4
s6
sN-1
Data Demand:
Minimum number of data points
which allows for providing a
desired functionality.
Demand-oblivious:
Not considering the data
demand of data consumers.
Definitions:
Demand-based:
Utilize requirement specifications of
data consumers to save resources.
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 5
26. Demand-based Data Streaming in the Internet of Things
Stream
Analysis
System
Front-End
Applications
s2
s5
…
sN
Sensor
Nodes
s1
s3
s4
s6
sN-1
Data Demand:
Minimum number of data points
which allows for providing a
desired functionality.
Demand-oblivious:
Not considering the data
demand of data consumers.
Definitions:
Demand-based:
Utilize requirement specifications of
data consumers to save resources.
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 5
27. Demand-based Data Streaming in the Internet of Things
Stream
Analysis
System
Front-End
Applications
s2
s5
…
sN
Sensor
Nodes
s1
s3
s4
s6
sN-1
Data Demand:
Minimum number of data points
which allows for providing a
desired functionality.
Demand-oblivious:
Not considering the data
demand of data consumers.
Definitions:
Demand-based:
Utilize requirement specifications of
data consumers to save resources.
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 5
28. Demand-based Data Streaming in the Internet of Things
Stream
Analysis
System
Front-End
Applications
s2
s5
…
sN
Sensor
Nodes
s1
s3
s4
s6
sN-1
Data Demand:
Minimum number of data points
which allows for providing a
desired functionality.
Demand-oblivious:
Not considering the data
demand of data consumers.
Definitions:
Demand-based:
Utilize requirement specifications of
data consumers to save resources.
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 5
29. Demand-based Data Streaming in the Internet of Things
Stream
Analysis
System
Front-End
Applications
s2
s5
…
sN
Sensor
Nodes
s1
s3
s4
s6
sN-1
Sensor
Control
System
Data Demand:
Minimum number of data points
which allows for providing a
desired functionality.
Demand-oblivious:
Not considering the data
demand of data consumers.
Definitions:
Demand-based:
Utilize requirement specifications of
data consumers to save resources.
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 5
30. Demand-based Data Streaming in the Internet of Things
Stream
Analysis
System
Front-End
Applications
s2
s5
…
sN
Sensor
Nodes
s1
s3
s4
s6
sN-1
Sensor
Control
System
Data Demand:
Minimum number of data points
which allows for providing a
desired functionality.
Demand-oblivious:
Not considering the data
demand of data consumers.
Definitions:
Demand-based:
Utilize requirement specifications of
data consumers to save resources.
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 5
31. Demand-based Data Streaming in the Internet of Things
Stream
Analysis
System
Front-End
Applications
s2
s5
…
sN
Sensor
Nodes
s1
s3
s4
s6
sN-1
Sensor
Control
System
Data Demand:
Minimum number of data points
which allows for providing a
desired functionality.
Demand-oblivious:
Not considering the data
demand of data consumers.
Definitions:
Demand-based:
Utilize requirement specifications of
data consumers to save resources.
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 5
32. Demand-based Data Streaming in the Internet of Things
Stream
Analysis
System
Front-End
Applications
s2
s5
…
sN
Sensor
Nodes
s1
s3
s4
s6
sN-1
Sensor
Control
System
Data Demand:
Minimum number of data points
which allows for providing a
desired functionality.
Demand-oblivious:
Not considering the data
demand of data consumers.
Definitions:
Demand-based:
Utilize requirement specifications of
data consumers to save resources.
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 5
33. Demand-based Data Streaming in the Internet of Things
Stream
Analysis
System
Front-End
Applications
s2
s5
sN
Sensor
Nodes
s1 s3
s4 s6
sN-1
Sensor
Control
System
Data Demand:
Minimum number of data points
which allows for providing a
desired functionality.
Demand-oblivious:
Not considering the data
demand of data consumers.
Definitions:
Demand-based:
Utilize requirement specifications of
data consumers to save resources.
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 5
34. Demand-based Data Streaming in the Internet of Things
Stream
Analysis
System
Front-End
Applications
s2
s5
sN
Sensor
Nodes
s1 s3
s4 s6
sN-1
Sensor
Control
System
Data Demand:
Minimum number of data points
which allows for providing a
desired functionality.
Demand-oblivious:
Not considering the data
demand of data consumers.
Definitions:
Demand-based:
Utilize requirement specifications of
data consumers to save resources.
Solution
We enable demand-based optimizations
by introducing control interfaces for expressing data demands.
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 5
36. Optimized On-Demand
Data Streaming from Sensor Nodes
Chapter 2:
Scope of Chapter 2: Read Scheduling on Sensor Nodes
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 7
38. User-Defined Sampling Functions
Read time tolerance
Desired read time
Penalty function
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 9
39. Sensor Read Fusion
Read time tolerance
Desired read time
Penalty function
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 10
40. Sensor Read Fusion
Read time tolerance
Desired read time
Penalty function
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 11
41. Local Filtering
Read time tolerance
Desired read time
Penalty function
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 12
42. Local Filtering
Read time tolerance
Desired read time
Penalty function
Our scheduler minimizes the number of sensor reads and data transmissions
based on the joint data demand of all data consumers.
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 12
43. Read Scheduling Example
Read time tolerance
Desired read time
Penalty function
Optimal time frame
for read sharing
Sensor read time
Symbol Key
Sum of penalty
functions
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 13
44. Read Scheduling Example
Read time tolerance
Desired read time
Penalty function
Optimal time frame
for read sharing
Sensor read time
Symbol Key
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 14
45. Increasing the number of concurrent queries
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 15
46. Increasing the number of concurrent queries
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 15
On-Demand scheduling reduces sensor reads and data transfer by up to 87%.
The # of reads and transfers increases sub-linearly with the # of queries.
47. Conclusion of Chapter 2
• We introduced user-defined sampling functions (UDSFs).
• We contributed a multi-query read scheduling algorithm.
• We optimize read times based on given read time tolerances.
• We experimentally and analytically validated our approach.
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 16
48. Conclusion of Chapter 2
• We introduced user-defined sampling functions (UDSFs).
• We contributed a multi-query read scheduling algorithm.
• We optimize read times based on given read time tolerances.
• We experimentally and analytically validated our approach.
Optimized On-Demand Data Streaming from Sensor Nodes
Jonas Traub, Sebastian Breß, Tilmann Rabl, Asterios Katsifodimos, and Volker Markl.
ACM Symposium on Cloud Computing 2017 (SoCC '17)
Full Paper:
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 16
49. Scalable Data Acquisition with
Guaranteed Time Coherence
Chapter 3:
Scope of Chapter 3: The Sensor Control Layer
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 17
50. Scalable Data Acquisition with
Guaranteed Time Coherence
Chapter 3:
Scope of Chapter 3: The Sensor Control Layer
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 17
51. Architecture: Central Join Topology
s1
s2
s3
s4
…
sN
Sensor
Nodes
(t3,v3) (t,v1,v2, …, vN)
Central
Join
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 18
52. Architecture: Central Join Topology
s1
s2
s3
s4
…
sN
Sensor
Nodes
(t3,v3) (t,v1,v2, …, vN)
Central
Join
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 18
Central Stream Joins do not scale to thousands of input streams.
53. s1 s2 s3 s4
Architecture: Sensing Loop Topology
s1
s2
s3
s4
Sensor
Nodes
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 19
54. s1 s2 s3 s4
Architecture: Sensing Loop Topology
s1
s2 s3 s4
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 19
55. s1 s2 s3 s4
Architecture: Sensing Loop Topology
s1
s2 s3 s4Loop
Node
(t)
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 19
56. s1 s2 s3 s4
Architecture: Sensing Loop Topology
s1
s2 s3 s4
(t,v1)
Loop
Node
(t)
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 19
61. s1 s2 s3 s4
Architecture: Sensing Loop Topology
s1
s2 s3 s4
(t,v1,v2)(t,v1) (t,v1,v2,v3) (t,v1,v2,v3,v4)(t)
Loop
Node
s5 s6 s7 s8
(t,v5,v6)(t,v5) (t,v5,v6,v7) (t,v5,v6,v7,v8)(t)
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 19
We dynamically combine sensing loops and central joins to solve scalability challenges.
62. Architecture: Sensor Node Internals
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 20
63. Architecture: Sensor Node Internals
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 20
64. Architecture: Sensor Node Internals
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 20
65. Architecture: Sensor Node Internals
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 20
66. Architecture: Sensor Node Internals
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 20
67. Architecture: Sensor Node Internals
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 20
68. Architecture: Sensor Node Internals
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 20
69. Architecture: Sensor Node Internals
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 20
70. Architecture: Sensor Node Internals
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 20
71. Time Coherence of Data Tuples
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 21
72. Time Coherence of Data Tuples
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 21
Definition of Time Coherence:
A tuple 𝑡, 𝑣1, 𝑣2,⋯ , 𝑣 𝑁 is time coherent
if all its values 𝑣1, 𝑣2,⋯ , 𝑣 𝑁 were read exactly at time 𝑡.
73. Time Coherence of Data Tuples
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 21
Definition of Time Coherence:
A tuple 𝑡, 𝑣1, 𝑣2,⋯ , 𝑣 𝑁 is time coherent
if all its values 𝑣1, 𝑣2,⋯ , 𝑣 𝑁 were read exactly at time 𝑡.
74. Time Coherence of Data Tuples
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 21
Definition of Time Coherence:
A tuple 𝑡, 𝑣1, 𝑣2,⋯ , 𝑣 𝑁 is time coherent
if all its values 𝑣1, 𝑣2,⋯ , 𝑣 𝑁 were read exactly at time 𝑡.
75. Time Coherence of Data Tuples
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 21
Definition of Time Coherence:
A tuple 𝑡, 𝑣1, 𝑣2,⋯ , 𝑣 𝑁 is time coherent
if all its values 𝑣1, 𝑣2,⋯ , 𝑣 𝑁 were read exactly at time 𝑡.
76. Time Coherence of Data Tuples
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 21
Definition of Time Coherence:
A tuple 𝑡, 𝑣1, 𝑣2,⋯ , 𝑣 𝑁 is time coherent
if all its values 𝑣1, 𝑣2,⋯ , 𝑣 𝑁 were read exactly at time 𝑡.
77. Time Coherence of Data Tuples
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 21
Definition of Time Coherence:
A tuple 𝑡, 𝑣1, 𝑣2,⋯ , 𝑣 𝑁 is time coherent
if all its values 𝑣1, 𝑣2,⋯ , 𝑣 𝑁 were read exactly at time 𝑡.
78. Time Coherence of Data Tuples
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 21
Definition of Time Coherence:
A tuple 𝑡, 𝑣1, 𝑣2,⋯ , 𝑣 𝑁 is time coherent
if all its values 𝑣1, 𝑣2,⋯ , 𝑣 𝑁 were read exactly at time 𝑡.
How can we measure, optimize, and guarantee time coherency?
79. Coherence Estimate (Ce) Coherence Guarantee (Cg)
Time Coherence Measures
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 22
80. Coherence Estimate (Ce) Coherence Guarantee (Cg)
Time Coherence Measures
Based on Untrusted Clocks
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 22
81. Coherence Estimate (Ce) Coherence Guarantee (Cg)
Time Coherence Measures
Based on Trusted ClocksBased on Untrusted Clocks
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 22
82. Coherence Estimate (Ce) Coherence Guarantee (Cg)
Time Coherence Measures
Based on Trusted ClocksBased on Untrusted Clocks
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 22
83. Coherence Estimate (Ce) Coherence Guarantee (Cg)
Time Coherence Measures
Based on Trusted ClocksBased on Untrusted Clocks
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 22
84. Coherence Estimate (Ce) Coherence Guarantee (Cg)
Time Coherence Measures
Based on Trusted ClocksBased on Untrusted Clocks
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 22
Best Case: Ce=0 Best Case: Cg = round trip time
85. Coherence Estimate (Ce) Coherence Guarantee (Cg)
Time Coherence Optimization
Based on Trusted ClocksBased on Untrusted Clocks
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 23
Best Case: Ce=0 Best Case: Cg = round trip time
86. Coherence Estimate (Ce) Coherence Guarantee (Cg)
Time Coherence Optimization
Based on Trusted ClocksBased on Untrusted Clocks
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 23
Best Case: Ce=0 Best Case: Cg = round trip time
87. Coherence Estimate (Ce) Coherence Guarantee (Cg)
Time Coherence Optimization
Based on Trusted ClocksBased on Untrusted Clocks
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 23
If untrusted clocks are correct,
data quality will be better
if you just trust them.
Best Case: Ce=0 Best Case: Cg = round trip time
88. Coherence Estimate (Ce) Coherence Guarantee (Cg)
Time Coherence Optimization
Based on Trusted ClocksBased on Untrusted Clocks
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 23
If untrusted clocks are correct,
data quality will be better
if you just trust them.
I can‘t
rely on
untrusted
clocks
Best Case: Ce=0 Best Case: Cg = round trip time
89. Coherence Estimate (Ce) Coherence Guarantee (Cg)
Time Coherence Optimization
Based on Trusted ClocksBased on Untrusted Clocks
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 23
If untrusted clocks are correct,
data quality will be better
if you just trust them.
I can‘t
rely on
untrusted
clocks
If untrusted clocks are wrong,
data quality is ensured
iff you don‘t trust them.
Best Case: Ce=0 Best Case: Cg = round trip time
90. Coherence Estimate (Ce) Coherence Guarantee (Cg)
Time Coherence Optimization
Based on Trusted ClocksBased on Untrusted Clocks
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 23
If untrusted clocks are correct,
data quality will be better
if you just trust them.
I can‘t
rely on
untrusted
clocks
If untrusted clocks are wrong,
data quality is ensured
iff you don‘t trust them.
Best Case: Ce=0 Best Case: Cg = round trip time
91. Coherence Estimate (Ce) Coherence Guarantee (Cg)
Time Coherence Optimization
Based on Trusted ClocksBased on Untrusted Clocks
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 23
If untrusted clocks are correct,
data quality will be better
if you just trust them.
I can‘t
rely on
untrusted
clocks
If untrusted clocks are wrong,
data quality is ensured
iff you don‘t trust them.
Dmax
Best Case: Ce=0 Best Case: Cg = round trip time
92. Example of Coherence Optimization
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 24
Coherence Estimate (Ce)
Coherence Guarantee (Cg)
Round Trip Time
Dmax (Desired upper limit for Cg)
Symbol Key
93. Large Scale Parameter Exploration
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 25
94. Large Scale Parameter Exploration
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 25
Our optimization works for arbitrary coherence requirements and large numbers of sensor nodes.
95. Conclusion of Chapter 3
• We present an architecture for acquiring values from large numbers of
sensors with guaranteed time coherence and low latency.
• We introduce time coherence as a fundamental data characteristic.
• We optimize the coherence of tuples and provide coherence guarantees.
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 26
96. Conclusion of Chapter 3
• We present an architecture for acquiring values from large numbers of
sensors with guaranteed time coherence and low latency.
• We introduce time coherence as a fundamental data characteristic.
• We optimize the coherence of tuples and provide coherence guarantees.
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 26
SENSE: Scalable Data Acquisition from Distributed Sensors with Guaranteed Time
Coherence. Jonas Traub, Julius Hülsmann, Tim Stullich, Sebastian Breß, Tilmann
Rabl, and Volker Markl. (Under Submission)
Full Paper:
97. Efficient Window Aggregation
with General Stream Slicing
Chapter 4:
Scope of Chapter 4: Flexible Stream Discretization and Efficient Window Aggregation in Stream Analysis Systems
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 27
98. Efficient Window Aggregation
with General Stream Slicing
Chapter 4:
Scope of Chapter 4: Flexible Stream Discretization and Efficient Window Aggregation in Stream Analysis Systems
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 27
101. The number of slices depends on data consumers. → Demand-based memory requirement.
Stream Slicing Example
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 29
106. We store partial aggregates instead of all tuples. → Small memory footprint.
Stream Slicing Example
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 33
108. We assign each tuple to exactly one slice. → O(1) per-tuple complexity.
Stream Slicing Example
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 34
110. We require just a few computation steps to calculate final aggregates. → Low latency.
Stream Slicing Example
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 35
112. We share partial aggregations among all users and queries. → Efficiency by preventing redundancy.
Stream Slicing Example
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 36
116. General Stream Slicing
Workload
Characteristics
Window
Types
Context Free
Forward Context Free
Forward Context Aware
Aggregation
Functions
distributive
algebraic
holistic
associativity
cummutativity
invertibility
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 37
117. General Stream Slicing
Workload
Characteristics
Window
Types
Context Free
Forward Context Free
Forward Context Aware
Window
Measures
time
tuple count
arbitrary
Aggregation
Functions
distributive
algebraic
holistic
associativity
cummutativity
invertibility
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 37
118. General Stream Slicing
Workload
Characteristics
Window
Types
Context Free
Forward Context Free
Forward Context Aware
Stream
Order
in-order
out-of-order
Window
Measures
time
tuple count
arbitrary
Aggregation
Functions
distributive
algebraic
holistic
associativity
cummutativity
invertibility
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 37
119. General Stream Slicing
Workload
Characteristics
Window
Types
Context Free
Forward Context Free
Forward Context Aware
Stream
Order
in-order
out-of-order
Window
Measures
time
tuple count
arbitrary
Aggregation
Functions
distributive
algebraic
holistic
associativity
cummutativity
invertibility
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 37
General Stream Slicing combines generality and efficiency in a single solution.
120. General Stream Slicing Internals
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 38
121. General Stream Slicing Internals
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 38
Part 1: Three Fundamental Operations on Slices
122. General Stream Slicing Internals
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 38
Merge Slices
Part 1: Three Fundamental Operations on Slices
123. General Stream Slicing Internals
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 38
Merge Slices Split Slices
Part 1: Three Fundamental Operations on Slices
124. General Stream Slicing Internals
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 38
Merge Slices Split Slices Update Slices
Part 1: Three Fundamental Operations on Slices
125. General Stream Slicing Internals
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 38
Merge Slices Split Slices Update Slices
Part 1: Three Fundamental Operations on Slices
Part 2: Adapt to Workload Characteristics:
126. General Stream Slicing Internals
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 38
Merge Slices Split Slices Update Slices
Part 1: Three Fundamental Operations on Slices
Part 2: Adapt to Workload Characteristics:
Do we need to store original tuples?
127. General Stream Slicing Internals
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 38
Merge Slices Split Slices Update Slices
Part 1: Three Fundamental Operations on Slices
Part 2: Adapt to Workload Characteristics:
Do we need to store original tuples?
Do we potentially need to split slices?
128. General Stream Slicing Internals
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 38
Merge Slices Split Slices Update Slices
Part 1: Three Fundamental Operations on Slices
Part 2: Adapt to Workload Characteristics:
Do we need to store original tuples?
Do we potentially need to split slices?
Do we potentially need
to remove tuples from slices?
129. General Stream Slicing Internals
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 38
Merge Slices Split Slices Update Slices
Part 1: Three Fundamental Operations on Slices
Part 2: Adapt to Workload Characteristics:
Do we need to store original tuples?
Do we potentially need to split slices?
Do we potentially need
to remove tuples from slices?
130. General Stream Slicing Internals
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 38
Merge Slices Split Slices Update Slices
Part 1: Three Fundamental Operations on Slices
Part 2: Adapt to Workload Characteristics:
Do we need to store original tuples?
Do we potentially need to split slices?
Do we potentially need
to remove tuples from slices?
General Stream Slicing adapts to current workload characteristics.
131. Performance Evaluation with 20% out-of-order Tuples
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 39
132. Conclusion of Chapter 4
• We identify workload characteristics which impact
applicability and performance of window aggregation techniques.
• We contribute the General Stream Slicing technique.
• We show that general stream slicing is generally applicable and
offers better performance than alternative approaches.
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 40
133. Conclusion of Chapter 4
• We identify workload characteristics which impact
applicability and performance of window aggregation techniques.
• We contribute the General Stream Slicing technique.
• We show that general stream slicing is generally applicable and
offers better performance than alternative approaches.
Efficient Window Aggregation with General Stream Slicing
Jonas Traub, Philipp Grulich, Alejandro Rodríguez Cuéllar, Sebastian Breß, Asterios Katsifodimos, Tilmann Rabl, Volker Markl.
International Conference on Extending Database Technology (EDBT), 2019.
Full Paper:
Scotty: Efficient Window Aggregation for out-of-order Stream Processing
Jonas Traub, Philipp Grulich, Alejandro Rodríguez Cuéllar, Sebastian Breß, Asterios Katsifodimos, Tilmann Rabl, Volker Markl.
IEEE International Conference on Data Engineering (ICDE), 2018.
Short Paper:
Open Source: https://tu-berlin-dima.github.io/scotty-window-processor/
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 40
134. Interactive Real-Time Visualization
for Streaming Data
Chapter 5:
Scope of Chapter 5: Connecting Stream Analysis Systems and Front-End Applications
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 41
135. Interactive Real-Time Visualization
for Streaming Data
Chapter 5:
Scope of Chapter 5: Connecting Stream Analysis Systems and Front-End Applications
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 41
136. Demonstration Data
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 42
The DEBS 2013 Grand Challenge
ChristopherMutschler, Holger Ziekow, ZbigniewJerzak
Data:
● Sensor data from a football match
(speed, acceleration, and position of the ball and players)
● Up to 2000 Hz frequency
● roughly 12.000 data points per second
137.
138.
139.
140.
141.
142. Conclusion of Chapter 5
• We connect visualization front-ends with stream analysis clusters to enable
demand-based optimizations.
• We show that our solution significantly reduces the amount of processed
and transferred data, while still providing loss-free visualizations.
• We provide an algorithm for the live visualization of time series in line charts.
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 44
143. Conclusion of Chapter 5
• We connect visualization front-ends with stream analysis clusters to enable
demand-based optimizations.
• We show that our solution significantly reduces the amount of processed
and transferred data, while still providing loss-free visualizations.
• We provide an algorithm for the live visualization of time series in line charts.
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 44
I²: Interactive Real-Time Visualization for Streaming Data
Jonas Traub, Nikolaas Steenbergen, Philipp M Grulich, Tilmann Rabl, and Volker Markl.
International Conference on Extending Database Technology (EDBT), 2017.
Demonstration:
Open Source: https://tu-berlin-dima.github.io/i2
145. Additional Contributions:
• Cutty: Aggregate Sharing for User-Defined Windows. (CIKM 2016)
Paris Carbone, Jonas Traub, Asterios Katsifodimos, Seif Haridi, Volker Markl
• Scalable Detection of Concept Drifts on Data Streams with Parallel Adaptive Windowing. (EDBT 2018)
Philipp Marian Grulich, René Saitenmacher, Jonas Traub, Sebastian Breß,Tilmann Rabl, Volker Markl
• A Concept Drift based Approach for Predicting Event-Time Progress in Data Streams (EDBT 2019)
Ahmed Awad, Jonas Traub, Sherif Sakr
• Analyzing Efficient Stream Processing on Modern Hardware (PVLDB 2019)
Steffen Zeuch, Bonaventura Del Monte, Jeyhun Karimov, Clemens Lutz, Manuel Renz, Jonas Traub, Sebastian Breß, Tilmann Rabl, Volker Markl
• Efficient SIMD Vectorization for Hashing in OpenCL (EDBT 2018)
Tobias Behrens, Viktor Rosenfeld, Jonas Traub, Sebastian Breß, Volker Markl
• Resense: Transparent Record and Replay of Sensor Data in the Internet of Things (EDBT 2019)
Dimitrios Giouroukis, Julius Hülsmann, Janis von Bleichert, Morgan Geldenhuys, Tim Stullich, Felipe Gutierrez, Jonas Traub, Kaustubh Beedkar, Volker Markl
• Apache Flink in current research (it-Information Technology 2016)
Tilmann Rabl, Jonas Traub, Asterios Katsifodimos, and Volker Markl
• Die Apache Flink Plattform zur parallelen Analyse von Datenströmen und Stapeldaten (LWA 2015)
Jonas Traub, Tilmann Rabl, Fabian Hueske, Till Rohrmann, Volker Markl
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 46
List of Publications
PhD Thesis Publications:
• Optimized On-Demand Data Streaming from Sensor Nodes (SoCC 2017)
Jonas Traub, Sebastian Breß, Tilmann Rabl, Asterios Katsifodimos, Volker Markl
• SENSE: Scalable Data Acquisition from Distributed Sensors with Guaranteed Time Coherence.
Jonas Traub, Julius Hülsmann, Tim Stullich, Sebastian Breß, Tilmann Rabl, and Volker Markl
• Scotty: Efficient Window Aggregation for out-of-order Stream Processing (ICDE 2018)
Jonas Traub, Philipp Grulich, Alejandro Rodríguez Cuéllar, Sebastian Breß, Asterios Katsifodimos, Tilmann Rabl, Volker Markl
• Efficient Window Aggregation with General Stream Slicing (EDBT 2019, Best Paper)
Jonas Traub, Philipp Grulich, Alejandro Rodríguez Cuéllar, Sebastian Breß, Asterios Katsifodimos, Tilmann Rabl, Volker Markl
• I²: Interactive Real-Time Visualization for Streaming Data (EDBT 2017, Best Demo)
Jonas Traub, Nikolaas Steenbergen, Philipp M Grulich, Tilmann Rabl, Volker Markl