Demand-based Data Stream Gathering, Processing, and Transmission - PhD Defense

Candidate: Jonas Traub PhD Defense, Berlin, 22.03.2019
Demand-based Data Stream Gathering, Processing, and Transmission
Committee: Prof. Dr. Manfred Hauswirth Prof. Dr. Amr El Abbadi
Prof. Dr. Volker Markl Prof. Dr. Albert Bifet

Thesis Goal
Eﬃciently gather, transmit, and process sensor data
from a large number of sensor nodes, and
make it available to a large number of applications
in real-time.
22.03.2019 Jonas Traub - Demand-based Data Stream Gathering, Processing, and Transmission 2

Thesis Goal
Eﬃciently gather, transmit, and process sensor data
from a large number of sensor nodes, and
make it available to a large number of applications
in real-time.
Efficiency
Scalability
Timeliness

Stream Processing Pipelines
A stream processing pipeline is a series of concurrently running operators.

Data Gathering

Data Gathering
53

Data Gathering Data Processing
53

Data Gathering Data Processing
8

Data Gathering Data Processing Data Transmission
8

Demand-oblivious Data Streaming in the Internet of Things
s1
s2
s3
s4
s5
s6
…
sN-1
sN
Sensor
Nodes

s1
s2
s3
s4
s5
s6
…
sN-1
sN
Sensor
Nodes
Stream
Analysis
System

s1
s2
s3
s4
s5
s6
…
sN-1
sN
Sensor
Nodes
Stream
Analysis
System
Front-End
Applications

s1
s2
s3
s4
s5
s6
…
sN-1
sN
Sensor
Nodes
Stream
Analysis
System
Front-End
Applications
Data Demand:
Minimum number of data points
which allows for providing a
desired functionality.
Definitions:

s1
s2
s3
s4
s5
s6
…
sN-1
sN
Sensor
Nodes
Stream
Analysis
System
Front-End
Applications
Data Demand:
Demand-oblivious:
Not considering the data
demand of data consumers.
Definitions:

Problems Demand-oblivious Data Streaming in the Internet of Things
s1
s2
s3
s4
s5
s6
…
sN-1
sN
Sensor
Nodes
Stream
Analysis
System
Front-End
Applications
Data Demand:
Demand-oblivious:
Definitions:

Combined
Stream
Bandwidth
s1
s2
s3
s4
s5
s6
…
sN-1
sN
Sensor
Nodes
Stream
Analysis
System
Front-End
Applications
Data Demand:
Demand-oblivious:
Definitions:

Parallel
Network
Connections
Combined
Stream
Bandwidth
s1
s2
s3
s4
s5
s6
…
sN-1
sN
Sensor
Nodes
Stream
Analysis
System
Front-End
Applications
Data Demand:
Demand-oblivious:
Definitions:

Problems
Expensive
Cluster
Scale-Out
Parallel
Network
Connections
Combined
Stream
Bandwidth
s1
s2
s3
s4
s5
s6
…
sN-1
sN
Sensor
Nodes
Stream
Analysis
System
Front-End
Applications
Data Demand:
Demand-oblivious:
Definitions:

Problems
Expensive
Cluster
Scale-Out
Parallel
Network
Connections
Combined
Stream
Bandwidth
Front-End
Overload
s1
s2
s3
s4
s5
s6
…
sN-1
sN
Sensor
Nodes
Stream
Analysis
System
Front-End
Applications
Data Demand:
Demand-oblivious:
Definitions:

Problems
Expensive
Cluster
Scale-Out
Parallel
Network
Connections
Combined
Stream
Bandwidth
Front-End
Overload
s1
s2
s3
s4
s5
s6
…
sN-1
sN
Sensor
Nodes
Stream
Analysis
System
Front-End
Applications
Demand-oblivious
do not suite the IoT.
Data Demand:
Demand-oblivious:
Definitions:

Demand-based Data Streaming in the Internet of Things
Stream
Analysis
System
Front-End
Applications
s2
s5
…
sN
Sensor
Nodes
s1
s3
s4
s6
sN-1
Data Demand:
Demand-oblivious:
Definitions:
Demand-based:
Utilize requirement specifications of
data consumers to save resources.

Stream
Analysis
System
Front-End
Applications
s2
s5
…
sN
Sensor
Nodes
s1
s3
s4
s6
sN-1
Sensor
Control
System
Data Demand:
Demand-oblivious:
Definitions:
Demand-based:

Stream
Analysis
System
Front-End
Applications
s2
s5
sN
Sensor
Nodes
s1 s3
s4 s6
sN-1
Sensor
Control
System
Data Demand:
Demand-oblivious:
Definitions:
Demand-based:

Stream
Analysis
System
Front-End
Applications
s2
s5
sN
Sensor
Nodes
s1 s3
s4 s6
sN-1
Sensor
Control
System
Data Demand:
Demand-oblivious:
Definitions:
Demand-based:
Solution
We enable demand-based optimizations
by introducing control interfaces for expressing data demands.

End-to-End Architecture

Optimized On-Demand
Data Streaming from Sensor Nodes
Chapter 2:
Scope of Chapter 2: Read Scheduling on Sensor Nodes

Sensor Read Scheduling

User-Defined Sampling Functions
Read time tolerance
Desired read time
Penalty function

Sensor Read Fusion
Read time tolerance
Desired read time
Penalty function

Local Filtering
Read time tolerance
Desired read time
Penalty function

Local Filtering
Read time tolerance
Desired read time
Penalty function
Our scheduler minimizes the number of sensor reads and data transmissions
based on the joint data demand of all data consumers.

Read Scheduling Example
Read time tolerance
Desired read time
Penalty function
Optimal time frame
for read sharing
Sensor read time
Symbol Key
Sum of penalty
functions

Read Scheduling Example
Read time tolerance
Desired read time
Penalty function
Optimal time frame
for read sharing
Sensor read time
Symbol Key

Increasing the number of concurrent queries

Increasing the number of concurrent queries
On-Demand scheduling reduces sensor reads and data transfer by up to 87%.
The # of reads and transfers increases sub-linearly with the # of queries.

Conclusion of Chapter 2
• We introduced user-deﬁned sampling functions (UDSFs).
• We contributed a multi-query read scheduling algorithm.
• We optimize read times based on given read time tolerances.
• We experimentally and analytically validated our approach.

• We introduced user-deﬁned sampling functions (UDSFs).
• We contributed a multi-query read scheduling algorithm.
• We optimize read times based on given read time tolerances.
• We experimentally and analytically validated our approach.
Optimized On-Demand Data Streaming from Sensor Nodes
Jonas Traub, Sebastian Breß, Tilmann Rabl, Asterios Katsifodimos, and Volker Markl.
ACM Symposium on Cloud Computing 2017 (SoCC '17)
Full Paper:

Scalable Data Acquisition with
Guaranteed Time Coherence
Chapter 3:
Scope of Chapter 3: The Sensor Control Layer

Architecture: Central Join Topology
s1
s2
s3
s4
…
sN
Sensor
Nodes
(t3,v3) (t,v1,v2, …, vN)
Central
Join

Architecture: Central Join Topology
s1
s2
s3
s4
…
sN
Sensor
Nodes
(t3,v3) (t,v1,v2, …, vN)
Central
Join
Central Stream Joins do not scale to thousands of input streams.

s1 s2 s3 s4
Architecture: Sensing Loop Topology
s1
s2
s3
s4
Sensor
Nodes

s1 s2 s3 s4
s1
s2 s3 s4

s1 s2 s3 s4
s1
s2 s3 s4Loop
Node
(t)

s1 s2 s3 s4
s1
s2 s3 s4
(t,v1)
Loop
Node
(t)

s1 s2 s3 s4
s1
s2 s3 s4
(t,v1,v2)(t,v1) (t,v1,v2,v3) (t,v1,v2,v3,v4)
Loop
Node
(t)

s1 s2 s3 s4
s1
s2 s3 s4
(t,v1,v2)(t,v1) (t,v1,v2,v3) (t,v1,v2,v3,v4)(t)
Loop
Node

s1 s2 s3 s4
s1
s2 s3 s4
Loop
Node
s5 s6 s7 s8

s1 s2 s3 s4
s1
s2 s3 s4
Loop
Node
s5 s6 s7 s8
We dynamically combine sensing loops and central joins to solve scalability challenges.

Architecture: Sensor Node Internals

Time Coherence of Data Tuples

Definition of Time Coherence:
A tuple 𝑡, 𝑣1, 𝑣2,⋯ , 𝑣 𝑁 is time coherent
if all its values 𝑣1, 𝑣2,⋯ , 𝑣 𝑁 were read exactly at time 𝑡.

Definition of Time Coherence:
A tuple 𝑡, 𝑣1, 𝑣2,⋯ , 𝑣 𝑁 is time coherent
if all its values 𝑣1, 𝑣2,⋯ , 𝑣 𝑁 were read exactly at time 𝑡.
How can we measure, optimize, and guarantee time coherency?

Coherence Estimate (Ce) Coherence Guarantee (Cg)
Time Coherence Measures

Based on Untrusted Clocks

Based on Trusted ClocksBased on Untrusted Clocks

Best Case: Ce=0 Best Case: Cg = round trip time

Time Coherence Optimization

If untrusted clocks are correct,
data quality will be better
if you just trust them.

I can‘t
rely on
untrusted
clocks

I can‘t
rely on
untrusted
clocks
If untrusted clocks are wrong,
data quality is ensured
iff you don‘t trust them.

I can‘t
rely on
untrusted
clocks
If untrusted clocks are wrong,
data quality is ensured
iff you don‘t trust them.
Dmax

Example of Coherence Optimization
Coherence Estimate (Ce)
Coherence Guarantee (Cg)
Round Trip Time
Dmax (Desired upper limit for Cg)
Symbol Key

Large Scale Parameter Exploration

Large Scale Parameter Exploration
Our optimization works for arbitrary coherence requirements and large numbers of sensor nodes.

• We present an architecture for acquiring values from large numbers of
sensors with guaranteed time coherence and low latency.
• We introduce time coherence as a fundamental data characteristic.
• We optimize the coherence of tuples and provide coherence guarantees.

• We present an architecture for acquiring values from large numbers of
sensors with guaranteed time coherence and low latency.
• We introduce time coherence as a fundamental data characteristic.
• We optimize the coherence of tuples and provide coherence guarantees.
SENSE: Scalable Data Acquisition from Distributed Sensors with Guaranteed Time
Coherence. Jonas Traub, Julius Hülsmann, Tim Stullich, Sebastian Breß, Tilmann
Rabl, and Volker Markl. (Under Submission)
Full Paper:

Efficient Window Aggregation
with General Stream Slicing
Chapter 4:
Scope of Chapter 4: Flexible Stream Discretization and Efficient Window Aggregation in Stream Analysis Systems

Stream Slicing Example

The number of slices depends on data consumers. → Demand-based memory requirement.

We store partial aggregates instead of all tuples. → Small memory footprint.

We assign each tuple to exactly one slice. → O(1) per-tuple complexity.

We require just a few computation steps to calculate final aggregates. → Low latency.

We share partial aggregations among all users and queries. → Efficiency by preventing redundancy.

General Stream Slicing

Workload
Characteristics

Workload
Characteristics
Aggregation
Functions
distributive
algebraic
holistic
associativity
cummutativity
invertibility

Workload
Characteristics
Window
Types
Context Free
Forward Context Free
Forward Context Aware
Aggregation
Functions
distributive
algebraic
holistic
associativity
cummutativity
invertibility

Workload
Characteristics
Window
Types
Context Free
Window
Measures
time
tuple count
arbitrary
Aggregation
Functions
distributive
algebraic
holistic
associativity
cummutativity
invertibility

Workload
Characteristics
Window
Types
Context Free
Stream
Order
in-order
out-of-order
Window
Measures
time
tuple count
arbitrary
Aggregation
Functions
distributive
algebraic
holistic
associativity
cummutativity
invertibility

Workload
Characteristics
Window
Types
Context Free
Stream
Order
in-order
out-of-order
Window
Measures
time
tuple count
arbitrary
Aggregation
Functions
distributive
algebraic
holistic
associativity
cummutativity
invertibility
General Stream Slicing combines generality and efficiency in a single solution.

General Stream Slicing Internals

Part 1: Three Fundamental Operations on Slices

Merge Slices

Merge Slices Split Slices

Merge Slices Split Slices Update Slices

Part 2: Adapt to Workload Characteristics:

Do we need to store original tuples?

Do we potentially need to split slices?

Do we potentially need
to remove tuples from slices?

Do we potentially need
to remove tuples from slices?
General Stream Slicing adapts to current workload characteristics.

Performance Evaluation with 20% out-of-order Tuples

• We identify workload characteristics which impact
applicability and performance of window aggregation techniques.
• We contribute the General Stream Slicing technique.
• We show that general stream slicing is generally applicable and
oﬀers better performance than alternative approaches.

• We identify workload characteristics which impact
applicability and performance of window aggregation techniques.
• We contribute the General Stream Slicing technique.
• We show that general stream slicing is generally applicable and
oﬀers better performance than alternative approaches.
Efficient Window Aggregation with General Stream Slicing
Jonas Traub, Philipp Grulich, Alejandro Rodríguez Cuéllar, Sebastian Breß, Asterios Katsifodimos, Tilmann Rabl, Volker Markl.
International Conference on Extending Database Technology (EDBT), 2019.
Full Paper:
Scotty: Efficient Window Aggregation for out-of-order Stream Processing
Jonas Traub, Philipp Grulich, Alejandro Rodríguez Cuéllar, Sebastian Breß, Asterios Katsifodimos, Tilmann Rabl, Volker Markl.
IEEE International Conference on Data Engineering (ICDE), 2018.
Short Paper:
Open Source: https://tu-berlin-dima.github.io/scotty-window-processor/

Interactive Real-Time Visualization
for Streaming Data
Chapter 5:
Scope of Chapter 5: Connecting Stream Analysis Systems and Front-End Applications

Demonstration Data
The DEBS 2013 Grand Challenge
ChristopherMutschler, Holger Ziekow, ZbigniewJerzak
Data:
● Sensor data from a football match
(speed, acceleration, and position of the ball and players)
● Up to 2000 Hz frequency
● roughly 12.000 data points per second

• We connect visualization front-ends with stream analysis clusters to enable
demand-based optimizations.
• We show that our solution signiﬁcantly reduces the amount of processed
and transferred data, while still providing loss-free visualizations.
• We provide an algorithm for the live visualization of time series in line charts.

• We connect visualization front-ends with stream analysis clusters to enable
demand-based optimizations.
• We show that our solution signiﬁcantly reduces the amount of processed
and transferred data, while still providing loss-free visualizations.
• We provide an algorithm for the live visualization of time series in line charts.
I²: Interactive Real-Time Visualization for Streaming Data
Jonas Traub, Nikolaas Steenbergen, Philipp M Grulich, Tilmann Rabl, and Volker Markl.
International Conference on Extending Database Technology (EDBT), 2017.
Demonstration:
Open Source: https://tu-berlin-dima.github.io/i2

Conclusion

Additional Contributions:
• Cutty: Aggregate Sharing for User-Defined Windows. (CIKM 2016)
Paris Carbone, Jonas Traub, Asterios Katsifodimos, Seif Haridi, Volker Markl
• Scalable Detection of Concept Drifts on Data Streams with Parallel Adaptive Windowing. (EDBT 2018)
Philipp Marian Grulich, René Saitenmacher, Jonas Traub, Sebastian Breß,Tilmann Rabl, Volker Markl
• A Concept Drift based Approach for Predicting Event-Time Progress in Data Streams (EDBT 2019)
Ahmed Awad, Jonas Traub, Sherif Sakr
• Analyzing Efficient Stream Processing on Modern Hardware (PVLDB 2019)
Steffen Zeuch, Bonaventura Del Monte, Jeyhun Karimov, Clemens Lutz, Manuel Renz, Jonas Traub, Sebastian Breß, Tilmann Rabl, Volker Markl
• Efficient SIMD Vectorization for Hashing in OpenCL (EDBT 2018)
Tobias Behrens, Viktor Rosenfeld, Jonas Traub, Sebastian Breß, Volker Markl
• Resense: Transparent Record and Replay of Sensor Data in the Internet of Things (EDBT 2019)
Dimitrios Giouroukis, Julius Hülsmann, Janis von Bleichert, Morgan Geldenhuys, Tim Stullich, Felipe Gutierrez, Jonas Traub, Kaustubh Beedkar, Volker Markl
• Apache Flink in current research (it-Information Technology 2016)
Tilmann Rabl, Jonas Traub, Asterios Katsifodimos, and Volker Markl
• Die Apache Flink Plattform zur parallelen Analyse von Datenströmen und Stapeldaten (LWA 2015)
Jonas Traub, Tilmann Rabl, Fabian Hueske, Till Rohrmann, Volker Markl
List of Publications
PhD Thesis Publications:
• Optimized On-Demand Data Streaming from Sensor Nodes (SoCC 2017)
Jonas Traub, Sebastian Breß, Tilmann Rabl, Asterios Katsifodimos, Volker Markl
• SENSE: Scalable Data Acquisition from Distributed Sensors with Guaranteed Time Coherence.
Jonas Traub, Julius Hülsmann, Tim Stullich, Sebastian Breß, Tilmann Rabl, and Volker Markl
• Scotty: Efficient Window Aggregation for out-of-order Stream Processing (ICDE 2018)
Jonas Traub, Philipp Grulich, Alejandro Rodríguez Cuéllar, Sebastian Breß, Asterios Katsifodimos, Tilmann Rabl, Volker Markl
• Efficient Window Aggregation with General Stream Slicing (EDBT 2019, Best Paper)
Jonas Traub, Philipp Grulich, Alejandro Rodríguez Cuéllar, Sebastian Breß, Asterios Katsifodimos, Tilmann Rabl, Volker Markl
• I²: Interactive Real-Time Visualization for Streaming Data (EDBT 2017, Best Demo)
Jonas Traub, Nikolaas Steenbergen, Philipp M Grulich, Tilmann Rabl, Volker Markl

Demand-based Data Stream Gathering, Processing, and Transmission - PhD Defense

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Mehr von Jonas Traub

Mehr von Jonas Traub (10)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Demand-based Data Stream Gathering, Processing, and Transmission - PhD Defense