SlideShare ist ein Scribd-Unternehmen logo
1 von 47
Kostas Tzoumas
@kostas_tzoumas
Apache FlinkTM
Counting elements in streams
Introduction
2
3
Data streaming is becoming
increasingly popular*
*Biggest understatement of 2016
4
Streaming technology is enabling the
obvious: continuous processing on data that
is continuously produced
5
Streaming is the next programming
paradigm for data applications, and you
need to start thinking in terms of streams
Counting
6
Continuous counting
 A seemingly simple application,
but generally an unsolved
problem
 E.g., count visitors, impressions,
interactions, clicks, etc
 Aggregations and OLAP cube
operations are generalizations
of counting
7
Counting in batch architecture
 Continuous
ingestion
 Periodic (e.g.,
hourly) files
 Periodic
batch jobs
8
Problems with batch architecture
 High latency
 Too many moving parts
 Implicit treatment of time
 Out of order event handling
 Implicit batch boundaries
9
Counting in λ architecture
 "Batch layer": what
we had before
 "Stream layer":
approximate early
results
10
Problems with batch and λ
 Way too many moving parts (and code dup)
 Implicit treatment of time
 Out of order event handling
 Implicit batch boundaries
11
Counting in streaming architecture
 Message queue ensures stream durability and replay
 Stream processor ensures consistent counting
12
Counting in Flink DataStream API
Number of visitors in last hour by country
13
DataStream<LogEvent> stream = env
.addSource(new FlinkKafkaConsumer(...)); // create stream from Kafka
.keyBy("country"); // group by country
.timeWindow(Time.minutes(60)) // window of size 1 hour
.apply(new CountPerWindowFunction()); // do operations per window
Counting hierarchy of needs
14
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
... accurate and
repeatable,
... queryable
Based on Maslow's
hierarchy of needs
Counting hierarchy of needs
15
Continuous counting
Counting hierarchy of needs
16
Continuous counting
... with low latency,
Counting hierarchy of needs
17
Continuous counting
... with low latency,
... efficiently on high volume streams,
Counting hierarchy of needs
18
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
Counting hierarchy of needs
19
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
... accurate and
repeatable,
Counting hierarchy of needs
20
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
... accurate and
repeatable,
... queryable
1.1+
Rest of this talk
21
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
... accurate and
repeatable,
... queryable
Latency
22
Yahoo! Streaming Benchmark
23
 Storm, Spark Streaming, and Flink benchmark by the
Storm team at Yahoo!
 Focus on measuring end-to-end latency at low
throughputs
 First benchmark that was modeled after a real
application
 Read more:
• https://yahooeng.tumblr.com/post/135321837876/benchma
rking-streaming-computation-engines-at
Benchmark task: counting!
24
 Count ad impressions
grouped by campaign
 Compute aggregates
over last 10 seconds
 Make aggregates
available for queries
(Redis)
Results (lower is better)
25
Flink and Storm at
sub-second latencies
Spark has a latency-
throughput tradeoff
170k events/sec
Efficiency, and scalability
26
Handling high-volume streams
 Scalability: how many events/sec can a
system scale to, with infinite resources?
 Scalability comes at a cost (systems add
overhead to be scalable)
 Efficiency: how many events/sec can a
system scale to, with limited resources?
27
Extending the Yahoo! benchmark
28
 Yahoo! benchmark is a great starting point to
understand engine behavior.
 However, benchmark stops at low write throughput
and programs are not fault tolerant.
 We extended the benchmark to (1) high volumes,
and (2) to use Flink's built-in state.
• http://data-artisans.com/extending-the-yahoo-
streaming-benchmark/
Results (higher is better)
29
Also: Flink jobs are correct under failures (exactly once), Storm jobs are not
500k events/sec
3mi events/sec
15mi events/sec
Fault tolerance and repeatability
30
Stateful streaming applications
 Most interesting stream applications are stateful
 How to ensure that state is correct after failures?
31
Fault tolerance definitions
 At least once
• May over-count
• In some cases even under-count with different definition
 Exactly once
• Counts are the same after failure
 End-to-end exactly once
• Counts appear the same in an external sink (database, file
system) after failure
32
Fault tolerance in Flink
 Flink guarantees exactly once
 End-to-end exactly once supported with
specific sources and sinks
• E.g., Kafka  Flink  HDFS
 Internally, Flink periodically takes consistent
snapshots of the state without ever stopping
the computation
33
Savepoints
 Maintaining stateful
applications in production is
challenging
 Flink savepoints: externally
triggered, durable checkpoints
 Easy code upgrades (Flink or
app), maintenance, migration,
and debugging, what-if
simulations, A/B tests
34
Explicit handling of time
35
Notions of time
36
1977 1980 1983 1999 2002 2005 2015
Episode IV:
A New Hope
Episode V:
The Empire
Strikes Back
Episode VI:
Return of
the Jedi
Episode I:
The Phantom
Menace
Episode II:
Attach of
the Clones
Episode III:
Revenge of
the Sith
Episode VII:
The Force
Awakens
This is called event time
This is called processing time
Out of order streams
37
Event time windows
Arrival time windows
Instant event-at-a-time
First burst of events
Second burst of events
Why event time
 Most stream processors are limited to
processing/arrival time, Flink can operate on
event time as well
 Benefits of event time
• Accurate results for out of order data
• Sessions and unaligned windows
• Time travel (backstreaming)
38
What's coming up in Flink
39
Evolution of streaming in Flink
 Flink 0.9 (Jun 2015): DataStream API in beta, exactly-once
guarantees via checkpoiting
 Flink 0.10 (Nov 2015): Event time support, windowing
mechanism based on Dataflow/Beam model, graduated
DataStream API, high availability, state interface,
new/updated connectors (Kafka, Nifi, Elastic, ...), improved
monitoring
 Flink 1.0 (Mar 2015): DataStream API stability, out of core
state, savepoints, CEP library, improved monitoring, Kafka 0.9
support
40
Upcoming features
 SQL: ongoing work in collaboration with Apache
Calcite
 Dynamic scaling: adapt resources to stream volume,
historical stream processing
 Queryable state: ability to query the state inside the
stream processor
 Mesos support
 More sources and sinks (e.g., Kinesis, Cassandra)
41
Queryable state
42
Using the
stream
processor as a
database
Closing
43
Summary
 Stream processing gaining momentum, the right
paradigm for continuous data applications
 Even seemingly simple applications can be complex at
scale and in production – choice of framework crucial
44
 Flink: unique
combination of
capabilities,
performance, and
robustness
Flink in the wild
45
30 billion events daily 2 billion events in
10 1Gb machines
Picked Flink for "Saiki"
data integration & distribution
platform
See talks by at
Join the community!
46
 Follow: @ApacheFlink, @dataArtisans
 Read: flink.apache.org/blog, data-artisans.com/blog
 Subscribe: (news | user | dev) @ flink.apache.org
2 meetups next week in Bay Area!
47
April 6, San JoseApril 5, San Francisco
What's new in Flink 1.0 & recent performance benchmarks with Flink
http://www.meetup.com/Bay-Area-Apache-Flink-Meetup/

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016Till Rohrmann
 
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...Flink Forward
 
Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016Stephan Ewen
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Robert Metzger
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Apache Flink Taiwan User Group
 
Flink Streaming Berlin Meetup
Flink Streaming Berlin MeetupFlink Streaming Berlin Meetup
Flink Streaming Berlin MeetupMárton Balassi
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseKostas Tzoumas
 
Tech Talk @ Google on Flink Fault Tolerance and HA
Tech Talk @ Google on Flink Fault Tolerance and HATech Talk @ Google on Flink Fault Tolerance and HA
Tech Talk @ Google on Flink Fault Tolerance and HAParis Carbone
 
Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonStephan Ewen
 
Aljoscha Krettek - The Future of Apache Flink
Aljoscha Krettek - The Future of Apache FlinkAljoscha Krettek - The Future of Apache Flink
Aljoscha Krettek - The Future of Apache FlinkFlink Forward
 
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015Till Rohrmann
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkTill Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkFlink Forward
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkDataWorks Summit
 
Streaming Analytics & CEP - Two sides of the same coin?
Streaming Analytics & CEP - Two sides of the same coin?Streaming Analytics & CEP - Two sides of the same coin?
Streaming Analytics & CEP - Two sides of the same coin?Till Rohrmann
 
Debunking Six Common Myths in Stream Processing
Debunking Six Common Myths in Stream ProcessingDebunking Six Common Myths in Stream Processing
Debunking Six Common Myths in Stream ProcessingKostas Tzoumas
 
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...Flink Forward
 
Pulsar connector on flink 1.14
Pulsar connector on flink 1.14Pulsar connector on flink 1.14
Pulsar connector on flink 1.14宇帆 盛
 

Was ist angesagt? (20)

Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016
 
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
 
Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
 
Flink Streaming Berlin Meetup
Flink Streaming Berlin MeetupFlink Streaming Berlin Meetup
Flink Streaming Berlin Meetup
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
 
Tech Talk @ Google on Flink Fault Tolerance and HA
Tech Talk @ Google on Flink Fault Tolerance and HATech Talk @ Google on Flink Fault Tolerance and HA
Tech Talk @ Google on Flink Fault Tolerance and HA
 
Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World London
 
Aljoscha Krettek - The Future of Apache Flink
Aljoscha Krettek - The Future of Apache FlinkAljoscha Krettek - The Future of Apache Flink
Aljoscha Krettek - The Future of Apache Flink
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
 
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkTill Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink
 
Streaming Analytics & CEP - Two sides of the same coin?
Streaming Analytics & CEP - Two sides of the same coin?Streaming Analytics & CEP - Two sides of the same coin?
Streaming Analytics & CEP - Two sides of the same coin?
 
Debunking Six Common Myths in Stream Processing
Debunking Six Common Myths in Stream ProcessingDebunking Six Common Myths in Stream Processing
Debunking Six Common Myths in Stream Processing
 
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
 
A look at Flink 1.2
A look at Flink 1.2A look at Flink 1.2
A look at Flink 1.2
 
Big Data Warsaw
Big Data WarsawBig Data Warsaw
Big Data Warsaw
 
Pulsar connector on flink 1.14
Pulsar connector on flink 1.14Pulsar connector on flink 1.14
Pulsar connector on flink 1.14
 

Ähnlich wie Apache Flink at Strata San Jose 2016

Counting Elements in Streams
Counting Elements in StreamsCounting Elements in Streams
Counting Elements in StreamsJamie Grier
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Ververica
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingKostas Tzoumas
 
Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorAljoscha Krettek
 
Data analytics at scale implementing stateful stream processing - publish
Data analytics at scale implementing stateful stream processing - publishData analytics at scale implementing stateful stream processing - publish
Data analytics at scale implementing stateful stream processing - publishCodeValue
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkRobert Metzger
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkRobert Metzger
 
Kostas Tzoumas - Apache Flink®: State of the Union and What's Next
Kostas Tzoumas - Apache Flink®: State of the Union and What's NextKostas Tzoumas - Apache Flink®: State of the Union and What's Next
Kostas Tzoumas - Apache Flink®: State of the Union and What's NextVerverica
 
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...Flink Forward
 
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017Till Rohrmann
 
The Power of Distributed Snapshots in Apache Flink
The Power of Distributed Snapshots in Apache FlinkThe Power of Distributed Snapshots in Apache Flink
The Power of Distributed Snapshots in Apache FlinkC4Media
 
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...Ververica
 
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)Kai Wähner
 
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...Flink Forward
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureRobert Metzger
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
 
Devoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en basDevoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en basFlorent Ramiere
 

Ähnlich wie Apache Flink at Strata San Jose 2016 (20)

Counting Elements in Streams
Counting Elements in StreamsCounting Elements in Streams
Counting Elements in Streams
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream Processor
 
Data analytics at scale implementing stateful stream processing - publish
Data analytics at scale implementing stateful stream processing - publishData analytics at scale implementing stateful stream processing - publish
Data analytics at scale implementing stateful stream processing - publish
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache Flink
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache Flink
 
Kostas Tzoumas - Apache Flink®: State of the Union and What's Next
Kostas Tzoumas - Apache Flink®: State of the Union and What's NextKostas Tzoumas - Apache Flink®: State of the Union and What's Next
Kostas Tzoumas - Apache Flink®: State of the Union and What's Next
 
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
 
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
 
The Power of Distributed Snapshots in Apache Flink
The Power of Distributed Snapshots in Apache FlinkThe Power of Distributed Snapshots in Apache Flink
The Power of Distributed Snapshots in Apache Flink
 
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
 
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)
 
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architecture
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Devoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en basDevoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en bas
 

Kürzlich hochgeladen

2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfmaor17
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 

Kürzlich hochgeladen (20)

2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdf
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 

Apache Flink at Strata San Jose 2016

  • 3. 3 Data streaming is becoming increasingly popular* *Biggest understatement of 2016
  • 4. 4 Streaming technology is enabling the obvious: continuous processing on data that is continuously produced
  • 5. 5 Streaming is the next programming paradigm for data applications, and you need to start thinking in terms of streams
  • 7. Continuous counting  A seemingly simple application, but generally an unsolved problem  E.g., count visitors, impressions, interactions, clicks, etc  Aggregations and OLAP cube operations are generalizations of counting 7
  • 8. Counting in batch architecture  Continuous ingestion  Periodic (e.g., hourly) files  Periodic batch jobs 8
  • 9. Problems with batch architecture  High latency  Too many moving parts  Implicit treatment of time  Out of order event handling  Implicit batch boundaries 9
  • 10. Counting in λ architecture  "Batch layer": what we had before  "Stream layer": approximate early results 10
  • 11. Problems with batch and λ  Way too many moving parts (and code dup)  Implicit treatment of time  Out of order event handling  Implicit batch boundaries 11
  • 12. Counting in streaming architecture  Message queue ensures stream durability and replay  Stream processor ensures consistent counting 12
  • 13. Counting in Flink DataStream API Number of visitors in last hour by country 13 DataStream<LogEvent> stream = env .addSource(new FlinkKafkaConsumer(...)); // create stream from Kafka .keyBy("country"); // group by country .timeWindow(Time.minutes(60)) // window of size 1 hour .apply(new CountPerWindowFunction()); // do operations per window
  • 14. Counting hierarchy of needs 14 Continuous counting ... with low latency, ... efficiently on high volume streams, ... fault tolerant (exactly once), ... accurate and repeatable, ... queryable Based on Maslow's hierarchy of needs
  • 15. Counting hierarchy of needs 15 Continuous counting
  • 16. Counting hierarchy of needs 16 Continuous counting ... with low latency,
  • 17. Counting hierarchy of needs 17 Continuous counting ... with low latency, ... efficiently on high volume streams,
  • 18. Counting hierarchy of needs 18 Continuous counting ... with low latency, ... efficiently on high volume streams, ... fault tolerant (exactly once),
  • 19. Counting hierarchy of needs 19 Continuous counting ... with low latency, ... efficiently on high volume streams, ... fault tolerant (exactly once), ... accurate and repeatable,
  • 20. Counting hierarchy of needs 20 Continuous counting ... with low latency, ... efficiently on high volume streams, ... fault tolerant (exactly once), ... accurate and repeatable, ... queryable 1.1+
  • 21. Rest of this talk 21 Continuous counting ... with low latency, ... efficiently on high volume streams, ... fault tolerant (exactly once), ... accurate and repeatable, ... queryable
  • 23. Yahoo! Streaming Benchmark 23  Storm, Spark Streaming, and Flink benchmark by the Storm team at Yahoo!  Focus on measuring end-to-end latency at low throughputs  First benchmark that was modeled after a real application  Read more: • https://yahooeng.tumblr.com/post/135321837876/benchma rking-streaming-computation-engines-at
  • 24. Benchmark task: counting! 24  Count ad impressions grouped by campaign  Compute aggregates over last 10 seconds  Make aggregates available for queries (Redis)
  • 25. Results (lower is better) 25 Flink and Storm at sub-second latencies Spark has a latency- throughput tradeoff 170k events/sec
  • 27. Handling high-volume streams  Scalability: how many events/sec can a system scale to, with infinite resources?  Scalability comes at a cost (systems add overhead to be scalable)  Efficiency: how many events/sec can a system scale to, with limited resources? 27
  • 28. Extending the Yahoo! benchmark 28  Yahoo! benchmark is a great starting point to understand engine behavior.  However, benchmark stops at low write throughput and programs are not fault tolerant.  We extended the benchmark to (1) high volumes, and (2) to use Flink's built-in state. • http://data-artisans.com/extending-the-yahoo- streaming-benchmark/
  • 29. Results (higher is better) 29 Also: Flink jobs are correct under failures (exactly once), Storm jobs are not 500k events/sec 3mi events/sec 15mi events/sec
  • 30. Fault tolerance and repeatability 30
  • 31. Stateful streaming applications  Most interesting stream applications are stateful  How to ensure that state is correct after failures? 31
  • 32. Fault tolerance definitions  At least once • May over-count • In some cases even under-count with different definition  Exactly once • Counts are the same after failure  End-to-end exactly once • Counts appear the same in an external sink (database, file system) after failure 32
  • 33. Fault tolerance in Flink  Flink guarantees exactly once  End-to-end exactly once supported with specific sources and sinks • E.g., Kafka  Flink  HDFS  Internally, Flink periodically takes consistent snapshots of the state without ever stopping the computation 33
  • 34. Savepoints  Maintaining stateful applications in production is challenging  Flink savepoints: externally triggered, durable checkpoints  Easy code upgrades (Flink or app), maintenance, migration, and debugging, what-if simulations, A/B tests 34
  • 36. Notions of time 36 1977 1980 1983 1999 2002 2005 2015 Episode IV: A New Hope Episode V: The Empire Strikes Back Episode VI: Return of the Jedi Episode I: The Phantom Menace Episode II: Attach of the Clones Episode III: Revenge of the Sith Episode VII: The Force Awakens This is called event time This is called processing time
  • 37. Out of order streams 37 Event time windows Arrival time windows Instant event-at-a-time First burst of events Second burst of events
  • 38. Why event time  Most stream processors are limited to processing/arrival time, Flink can operate on event time as well  Benefits of event time • Accurate results for out of order data • Sessions and unaligned windows • Time travel (backstreaming) 38
  • 39. What's coming up in Flink 39
  • 40. Evolution of streaming in Flink  Flink 0.9 (Jun 2015): DataStream API in beta, exactly-once guarantees via checkpoiting  Flink 0.10 (Nov 2015): Event time support, windowing mechanism based on Dataflow/Beam model, graduated DataStream API, high availability, state interface, new/updated connectors (Kafka, Nifi, Elastic, ...), improved monitoring  Flink 1.0 (Mar 2015): DataStream API stability, out of core state, savepoints, CEP library, improved monitoring, Kafka 0.9 support 40
  • 41. Upcoming features  SQL: ongoing work in collaboration with Apache Calcite  Dynamic scaling: adapt resources to stream volume, historical stream processing  Queryable state: ability to query the state inside the stream processor  Mesos support  More sources and sinks (e.g., Kinesis, Cassandra) 41
  • 44. Summary  Stream processing gaining momentum, the right paradigm for continuous data applications  Even seemingly simple applications can be complex at scale and in production – choice of framework crucial 44  Flink: unique combination of capabilities, performance, and robustness
  • 45. Flink in the wild 45 30 billion events daily 2 billion events in 10 1Gb machines Picked Flink for "Saiki" data integration & distribution platform See talks by at
  • 46. Join the community! 46  Follow: @ApacheFlink, @dataArtisans  Read: flink.apache.org/blog, data-artisans.com/blog  Subscribe: (news | user | dev) @ flink.apache.org
  • 47. 2 meetups next week in Bay Area! 47 April 6, San JoseApril 5, San Francisco What's new in Flink 1.0 & recent performance benchmarks with Flink http://www.meetup.com/Bay-Area-Apache-Flink-Meetup/

Hinweis der Redaktion

  1. 3 systems (batch), or 5 systems (streaming), Need to add a new system for millisecond alerts What If I want to count every 5 minutes, not 1 hour? Just ignores out of order What if I wanna do sessions?
  2. 3 systems (batch), or 5 systems (streaming), Need to add a new system for millisecond alerts What If I want to count every 5 minutes, not 1 hour? Just ignores out of order What if I wanna do sessions?