7. Continuous counting
A seemingly simple application,
but generally an unsolved
problem
E.g., count visitors, impressions,
interactions, clicks, etc
Aggregations and OLAP cube
operations are generalizations
of counting
7
8. Counting in batch architecture
Continuous
ingestion
Periodic (e.g.,
hourly) files
Periodic
batch jobs
8
9. Problems with batch architecture
High latency
Too many moving parts
Implicit treatment of time
Out of order event handling
Implicit batch boundaries
9
10. Counting in λ architecture
"Batch layer": what
we had before
"Stream layer":
approximate early
results
10
11. Problems with batch and λ
Way too many moving parts (and code dup)
Implicit treatment of time
Out of order event handling
Implicit batch boundaries
11
12. Counting in streaming architecture
Message queue ensures stream durability and replay
Stream processor ensures consistent counting
12
13. Counting in Flink DataStream API
Number of visitors in last hour by country
13
DataStream<LogEvent> stream = env
.addSource(new FlinkKafkaConsumer(...)); // create stream from Kafka
.keyBy("country"); // group by country
.timeWindow(Time.minutes(60)) // window of size 1 hour
.apply(new CountPerWindowFunction()); // do operations per window
14. Counting hierarchy of needs
14
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
... accurate and
repeatable,
... queryable
Based on Maslow's
hierarchy of needs
17. Counting hierarchy of needs
17
Continuous counting
... with low latency,
... efficiently on high volume streams,
18. Counting hierarchy of needs
18
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
19. Counting hierarchy of needs
19
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
... accurate and
repeatable,
20. Counting hierarchy of needs
20
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
... accurate and
repeatable,
... queryable
1.1+
21. Rest of this talk
21
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
... accurate and
repeatable,
... queryable
23. Yahoo! Streaming Benchmark
23
Storm, Spark Streaming, and Flink benchmark by the
Storm team at Yahoo!
Focus on measuring end-to-end latency at low
throughputs
First benchmark that was modeled after a real
application
Read more:
• https://yahooeng.tumblr.com/post/135321837876/benchma
rking-streaming-computation-engines-at
24. Benchmark task: counting!
24
Count ad impressions
grouped by campaign
Compute aggregates
over last 10 seconds
Make aggregates
available for queries
(Redis)
25. Results (lower is better)
25
Flink and Storm at
sub-second latencies
Spark has a latency-
throughput tradeoff
170k events/sec
27. Handling high-volume streams
Scalability: how many events/sec can a
system scale to, with infinite resources?
Scalability comes at a cost (systems add
overhead to be scalable)
Efficiency: how many events/sec can a
system scale to, with limited resources?
27
28. Extending the Yahoo! benchmark
28
Yahoo! benchmark is a great starting point to
understand engine behavior.
However, benchmark stops at low write throughput
and programs are not fault tolerant.
We extended the benchmark to (1) high volumes,
and (2) to use Flink's built-in state.
• http://data-artisans.com/extending-the-yahoo-
streaming-benchmark/
29. Results (higher is better)
29
Also: Flink jobs are correct under failures (exactly once), Storm jobs are not
500k events/sec
3mi events/sec
15mi events/sec
31. Stateful streaming applications
Most interesting stream applications are stateful
How to ensure that state is correct after failures?
31
32. Fault tolerance definitions
At least once
• May over-count
• In some cases even under-count with different definition
Exactly once
• Counts are the same after failure
End-to-end exactly once
• Counts appear the same in an external sink (database, file
system) after failure
32
33. Fault tolerance in Flink
Flink guarantees exactly once
End-to-end exactly once supported with
specific sources and sinks
• E.g., Kafka Flink HDFS
Internally, Flink periodically takes consistent
snapshots of the state without ever stopping
the computation
33
34. Savepoints
Maintaining stateful
applications in production is
challenging
Flink savepoints: externally
triggered, durable checkpoints
Easy code upgrades (Flink or
app), maintenance, migration,
and debugging, what-if
simulations, A/B tests
34
36. Notions of time
36
1977 1980 1983 1999 2002 2005 2015
Episode IV:
A New Hope
Episode V:
The Empire
Strikes Back
Episode VI:
Return of
the Jedi
Episode I:
The Phantom
Menace
Episode II:
Attach of
the Clones
Episode III:
Revenge of
the Sith
Episode VII:
The Force
Awakens
This is called event time
This is called processing time
37. Out of order streams
37
Event time windows
Arrival time windows
Instant event-at-a-time
First burst of events
Second burst of events
38. Why event time
Most stream processors are limited to
processing/arrival time, Flink can operate on
event time as well
Benefits of event time
• Accurate results for out of order data
• Sessions and unaligned windows
• Time travel (backstreaming)
38
40. Evolution of streaming in Flink
Flink 0.9 (Jun 2015): DataStream API in beta, exactly-once
guarantees via checkpoiting
Flink 0.10 (Nov 2015): Event time support, windowing
mechanism based on Dataflow/Beam model, graduated
DataStream API, high availability, state interface,
new/updated connectors (Kafka, Nifi, Elastic, ...), improved
monitoring
Flink 1.0 (Mar 2015): DataStream API stability, out of core
state, savepoints, CEP library, improved monitoring, Kafka 0.9
support
40
41. Upcoming features
SQL: ongoing work in collaboration with Apache
Calcite
Dynamic scaling: adapt resources to stream volume,
historical stream processing
Queryable state: ability to query the state inside the
stream processor
Mesos support
More sources and sinks (e.g., Kinesis, Cassandra)
41
44. Summary
Stream processing gaining momentum, the right
paradigm for continuous data applications
Even seemingly simple applications can be complex at
scale and in production – choice of framework crucial
44
Flink: unique
combination of
capabilities,
performance, and
robustness
45. Flink in the wild
45
30 billion events daily 2 billion events in
10 1Gb machines
Picked Flink for "Saiki"
data integration & distribution
platform
See talks by at
47. 2 meetups next week in Bay Area!
47
April 6, San JoseApril 5, San Francisco
What's new in Flink 1.0 & recent performance benchmarks with Flink
http://www.meetup.com/Bay-Area-Apache-Flink-Meetup/
Hinweis der Redaktion
3 systems (batch), or 5 systems (streaming), Need to add a new system for millisecond alerts
What If I want to count every 5 minutes, not 1 hour?
Just ignores out of order
What if I wanna do sessions?
3 systems (batch), or 5 systems (streaming), Need to add a new system for millisecond alerts
What If I want to count every 5 minutes, not 1 hour?
Just ignores out of order
What if I wanna do sessions?