SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
Introduction to Flink
Streaming
Framework for modern streaming
applications
https://github.com/phatak-dev/flink-examples
● Madhukara Phatak
● Big data consultant and
trainer at datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
Agenda
● Stream abstraction vs streaming applications
● Stream as an abstraction
● Challenges with modern streaming applications
● Why not Spark streaming?
● Introduction to Flink
● Introduction to Flink streaming
● Flink Streaming API
● References
Use of stream in applications
● Streams are used both in big data and outside big data
to support two major use cases
○ Stream as abstraction layer
○ Stream as unbounded data to support real time
analysis
● Abstraction and real time have different need and
expectation from the streams
● Different platforms use stream in different meanings
Stream as the abstraction
● A stream is a sequence of data elements made
available over time.
● A stream can be thought of as items on a conveyor belt
being processed one at a time rather than in large
batches.
● Streams can be unbounded ( message queue) and
bounded ( files)
● Streams are becoming new abstractions to build data
pipelines.
Streams as abstraction outside big data
● Streams are used as an abstraction outside big data in
last few years
● Some of them are
○ Reactive streams like akka-streams, akka-http
○ Java 8 streams
○ RxJava etc
● These use of streams are don't care about real time
analysis
Streams for real time analysis
● In this use cases of stream, stream is viewed as
unbounded data which has low latency and available as
soon it arrives in the system
● Stream can be processed using non stream abstraction
in run time
● So focus in these scenarios is only to model API in
streams not the implementation
● Ex : Spark streaming
Stream abstraction in big data
● Stream is the new abstraction layer people are
exploring in the big data
● With right implementation, stream can support both
streaming and batch applications much more effectively
than existing abstractions.
● Batch on streaming is new way of looking at processing
rather than treating streaming as the special case of
batch
● Batch can be faster on streaming than dedicated batch
processing
Frameworks with stream as abstraction
Apache flink
● Flink’s core is a streaming dataflow engine that
provides data distribution, communication, and fault
tolerance for distributed computations over data
streams.
● Flink provides
○ Dataset API - for bounded streams
○ Datastream API - for unbounded streams
● Flink embraces the stream as abstraction to implement
it’s dataflow.
Flink stack
Flink history
● Stratosphere project started in Technical university,
Berlin in 2009
● Incubated to Apache in March, 2014
● Top level project in Dec 2014
● Started as stream engine for batch processing
● Started to support streaming few versions before
● DataArtisians is company founded by core flink team
Flink streaming
● Flink Streaming is an extension of the core Flink API for
high-throughput, low-latency data stream processing
● Supports many data sources like Flume, Twitter,
ZeroMQ and also from any user defined data source
● Data streams can be transformed and modified using
high-level functions similar to the ones provided by the
batch processing API
● Sound much like Spark streaming promises !!
Streaming is not fast batch processing
● Most of the streaming framework focuses too much on
the latency when they develop streaming extensions
● Both storm and spark-streaming view streaming as low
latency batch processing system
● Though latency plays an important role in the real time
application, the need and challenges go beyond it
● Addressing the complex needs of modern streaming
systems need a fresh view on streaming API’s
Streaming in Lamda architecture
● Streaming is viewed as limited, approximate, low
latency computing system compared to a batch system
in lambda architecture
● So we usually run a streaming system to get low latency
approximate results and run a batch system to get high
latency with accurate result
● All the limitations of streaming is stemmed from
conventional thinking and implementations
● New idea is why not streaming a low latency accurate
system itself?
Google dataflow
● Google articulated the first modern streaming
framework which is low latency, exactly once, accurate
stream applications in their dataflow paper
● It talks about a single system which can replace need of
separate streaming and batch processing system
● Known as Kappa architecture
● Modern stream frameworks embrace this over lambda
architecture
● Google dataflow is open sourced under the name
apache beam
Google dataflow and Flink streaming
● Flink adopted dataflow ideas for their streaming API
● Flink streaming API went through big overhaul in 1.0
version to embrace these ideas
● It was relatively easy to adapt ideas as both google
dataflow and flink use streaming as abstraction
● Spark 2.0 may add some of these ideas in their
structured stream processing effort
Needs of modern real time applications
● Ability to handle out of time events in unbounded data
● Ability to correlate the events with different dimensions
of time
● Ability to correlate events using custom application
based characteristics like session
● Ability to both microbatch and event at a time on same
framework
● Support for complex stream processing libraries
Mandatory wordcount
● Streams are represented using DataStream in Flink
streaming
● DataStream support both RDD and Dataset like API for
manipulation
● In this example,
○ Read from socket to create DataStream
○ Use map, keyBy and sum operation for aggregation
● com.madhukaraphatak.flink.streaming.examples.
StreamingWordCount
Flink streaming vs Spark streaming
Spark Streaming Flink Streaming
Streams are represented using DStreams Streams are represented using
DataStreams
Stream is discretized to mini batch Stream is not discretized
Support RDD DSL Supports Dataset like DSL
By default stateless By default stateful at operator level
Runs mini batch for each interval Runs pipelined operators for each events
that comes in
Near realtime Real time
Discretizing the stream
● Flink by default don’t need any discretization of stream
to work
● But using window API, we can create discretized stream
similar to spark
● This time state will be discarded, as and when the batch
is computed
● This way you can mimic spark micro batches in Flink
● com.madhukaraphatak.flink.streaming.examples.
WindowedStreamingWordCount
Understanding dataflow of flink
● All programs in flink, both batch and streaming, are
represented using a dataflow
● This dataflow signifies the stream abstraction provided
by the flink runtime
● This dataflow treats all the data as stream and
processes using long running operator model
● This is quite different than RDD model of the spark
● Flink UI allows us to understand dataflow of a given flink
program
Running in local mode
● bin/start-local.sh
● bin/flink run -c com.madhukaraphatak.flink.streaming.
examples.StreamingWordCount
/home/madhu/Dev/mybuild/flink-examples/target/scala-
2.10/flink-examples_2.10-1.0.jar
Dataflow for wordcount example
Operator fusing
● Flink optimiser fuses the operator for efficiency
● All the fused operator run in a same thread, which
saves the serialization and deserialization cost between
the operators
● For all fused operators, flink generates a nested
function which comprises all the code from operators
● This is much efficient that RDD optimization
● Dataset is planning to support this functionality
● You can disable this by env.disableOperatorChaining()
Dataflow for without operate fusing
Flink streaming vs Spark streaming
Spark Streaming Flink Streaming
Uses RDD distribution model for processing Uses pipelined stream processing
paradigm for processing
Parallelism is done at batch level Parallelism is controlled at operator level
Uses RDD immutability for fault recovery Uses Asynchronous barriers for fault
recovery
RDD level optimization for stream
optimization
Operator fusing for stream optimization
Window API
● Powerful API to track and do custom state analysis
● Types of windows
○ Time window
■ Tumbling window
■ Sliding window
○ Non time based window
■ Count window
● Ex : WindowExample.scala
Anatomy of Window API
● Window API is made of 3 different components
● The three components of window are
○ Window assigner
○ Trigger
○ Evictor
● These three components made up all the window API in
Flink
Window Assigner
● A function which determines given element, which
window it should belong
● Responsible for creation of window and assigning
elements to a window
● Two types of window assigner
○ Time based window assigner
○ GlobalWindow assigner
● User can write their custom window assigner too
Trigger
● Trigger is a function responsible for determining when a
given window is triggered
● In a time based window, this function will wait till time is
done to trigger
● But in non time based window, it can use custom logic
to determine when to evaluate a given window
● In our example, the example number of records in an
given window is used to determine the trigger or not.
● WindowAnatomy.scala
Building custom session window
● We want to track session of a user
● Each session is identified using sessionID
● We will get an event when the session is started
● Evaluate the session, when we get the end of session
event
● For this, we want to implement our own custom window
trigger which tracks the end of session
● Ex : SessionWindowExample.scala
Concept of Time in Flink streaming
● Time in a streaming application plays an important role
● So having ability to express time in flexible way is very
important feature of modern streaming application
● Flink support three kind of time
○ Process time
○ Event time
○ Ingestion time
● Event time is one of the important feature of flink which
compliments the custom window API
Understanding event time
● Time in flink needs to address following two questions
○ When event is occurred?
○ How much time has occurred after the event?
● First question can be answered using the assigning time
stamps
● Second question is answered using understanding the
concept of the water marks
● Ex : EventTimeExample.scala
Watermarks in Event Time
● Watermark is a special signal which signifies flow of
time in Flink
● In above diagram, w(20) signifies 20 units of time is
passed in source
● Watermarks allow flink to support different time
abstractions
References
● http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
● http://blog.madhukaraphatak.com/categories/flink-
streaming/
● https://www.youtube.com/watch?v=y7f6wksGM6c
● https://yahooeng.tumblr.
com/post/135321837876/benchmarking-streaming-
computation-engines-at
● https://www.youtube.com/watch?v=v_exWHj1vmo
● http://www.slideshare.net/FlinkForward/dongwon-kim-a-
comparative-performance-evaluation-of-flink

Weitere ähnliche Inhalte

Was ist angesagt?

Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streamingdatamantra
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streamingdatamantra
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0datamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2datamantra
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 APIdatamantra
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra
 
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberKafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberHostedbyConfluent
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streamingdatamantra
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkabandatamantra
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to datasetdatamantra
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streamingdatamantra
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Executiondatamantra
 
Introduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsIntroduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsShashank L
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLdatamantra
 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyIt's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyHostedbyConfluent
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache FlinkAKASH SIHAG
 
Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureCase Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureJoey Bolduc-Gilbert
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2datamantra
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIdatamantra
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0datamantra
 

Was ist angesagt? (20)

Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
 
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberKafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to dataset
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Introduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsIntroduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actors
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQL
 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyIt's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
 
Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureCase Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa Architecture
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 

Andere mochten auch

Apache spark with Machine learning
Apache spark with Machine learningApache spark with Machine learning
Apache spark with Machine learningdatamantra
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPdatamantra
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scaledatamantra
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalystdatamantra
 
Machine learning pipeline with spark ml
Machine learning pipeline with spark mlMachine learning pipeline with spark ml
Machine learning pipeline with spark mldatamantra
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scaladatamantra
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientistsdatamantra
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2datamantra
 
Il üniversitesi islam fikhi_2.2.ef_'ali mükellefin_namaz
 Il üniversitesi islam fikhi_2.2.ef_'ali mükellefin_namaz Il üniversitesi islam fikhi_2.2.ef_'ali mükellefin_namaz
Il üniversitesi islam fikhi_2.2.ef_'ali mükellefin_namazColorado Theology University
 
TermiChack
TermiChackTermiChack
TermiChackBBSheshi
 
İL Üniversitesi - 1.15.habesistan hicreti asr i saadet-islam tarihi
İL Üniversitesi - 1.15.habesistan hicreti asr i saadet-islam tarihiİL Üniversitesi - 1.15.habesistan hicreti asr i saadet-islam tarihi
İL Üniversitesi - 1.15.habesistan hicreti asr i saadet-islam tarihiColorado Theology University
 
How to plan a hadoop cluster for testing and production environment
How to plan a hadoop cluster for testing and production environmentHow to plan a hadoop cluster for testing and production environment
How to plan a hadoop cluster for testing and production environmentAnna Yen
 
Extending the Yahoo Streaming Benchmark
Extending the Yahoo Streaming BenchmarkExtending the Yahoo Streaming Benchmark
Extending the Yahoo Streaming BenchmarkJamie Grier
 
Alexander Kolb – Flink. Yet another Streaming Framework?
Alexander Kolb – Flink. Yet another Streaming Framework?Alexander Kolb – Flink. Yet another Streaming Framework?
Alexander Kolb – Flink. Yet another Streaming Framework?Flink Forward
 
Automatic Detection of Web Trackers by Vasia Kalavri
Automatic Detection of Web Trackers by Vasia KalavriAutomatic Detection of Web Trackers by Vasia Kalavri
Automatic Detection of Web Trackers by Vasia KalavriFlink Forward
 
Jamie Grier - Robust Stream Processing with Apache Flink
Jamie Grier - Robust Stream Processing with Apache FlinkJamie Grier - Robust Stream Processing with Apache Flink
Jamie Grier - Robust Stream Processing with Apache FlinkFlink Forward
 
Flink Streaming @BudapestData
Flink Streaming @BudapestDataFlink Streaming @BudapestData
Flink Streaming @BudapestDataGyula Fóra
 

Andere mochten auch (17)

Apache spark with Machine learning
Apache spark with Machine learningApache spark with Machine learning
Apache spark with Machine learning
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTP
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
 
Machine learning pipeline with spark ml
Machine learning pipeline with spark mlMachine learning pipeline with spark ml
Machine learning pipeline with spark ml
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
 
Il üniversitesi islam fikhi_2.2.ef_'ali mükellefin_namaz
 Il üniversitesi islam fikhi_2.2.ef_'ali mükellefin_namaz Il üniversitesi islam fikhi_2.2.ef_'ali mükellefin_namaz
Il üniversitesi islam fikhi_2.2.ef_'ali mükellefin_namaz
 
TermiChack
TermiChackTermiChack
TermiChack
 
İL Üniversitesi - 1.15.habesistan hicreti asr i saadet-islam tarihi
İL Üniversitesi - 1.15.habesistan hicreti asr i saadet-islam tarihiİL Üniversitesi - 1.15.habesistan hicreti asr i saadet-islam tarihi
İL Üniversitesi - 1.15.habesistan hicreti asr i saadet-islam tarihi
 
How to plan a hadoop cluster for testing and production environment
How to plan a hadoop cluster for testing and production environmentHow to plan a hadoop cluster for testing and production environment
How to plan a hadoop cluster for testing and production environment
 
Extending the Yahoo Streaming Benchmark
Extending the Yahoo Streaming BenchmarkExtending the Yahoo Streaming Benchmark
Extending the Yahoo Streaming Benchmark
 
Alexander Kolb – Flink. Yet another Streaming Framework?
Alexander Kolb – Flink. Yet another Streaming Framework?Alexander Kolb – Flink. Yet another Streaming Framework?
Alexander Kolb – Flink. Yet another Streaming Framework?
 
Automatic Detection of Web Trackers by Vasia Kalavri
Automatic Detection of Web Trackers by Vasia KalavriAutomatic Detection of Web Trackers by Vasia Kalavri
Automatic Detection of Web Trackers by Vasia Kalavri
 
Jamie Grier - Robust Stream Processing with Apache Flink
Jamie Grier - Robust Stream Processing with Apache FlinkJamie Grier - Robust Stream Processing with Apache Flink
Jamie Grier - Robust Stream Processing with Apache Flink
 
Flink Streaming @BudapestData
Flink Streaming @BudapestDataFlink Streaming @BudapestData
Flink Streaming @BudapestData
 

Ähnlich wie Introduction to Flink Streaming

Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flinkdatamantra
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Bowen Li
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko GlobalLogic Ukraine
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in SparkDigital Vidya
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uberconfluent
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Sparkdatamantra
 
Automation + dev ops summit hail hydrate! from stream to lake
Automation + dev ops summit   hail hydrate! from stream to lakeAutomation + dev ops summit   hail hydrate! from stream to lake
Automation + dev ops summit hail hydrate! from stream to lakeTimothy Spann
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerFederico Palladoro
 
The Fn Project: A Quick Introduction (December 2017)
The Fn Project: A Quick Introduction (December 2017)The Fn Project: A Quick Introduction (December 2017)
The Fn Project: A Quick Introduction (December 2017)Oracle Developers
 
[WSO2Con Asia 2018] Architecting for Container-native Environments
[WSO2Con Asia 2018] Architecting for Container-native Environments[WSO2Con Asia 2018] Architecting for Container-native Environments
[WSO2Con Asia 2018] Architecting for Container-native EnvironmentsWSO2
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureTimothy Spann
 
Easy Microservices with JHipster - Devoxx BE 2017
Easy Microservices with JHipster - Devoxx BE 2017Easy Microservices with JHipster - Devoxx BE 2017
Easy Microservices with JHipster - Devoxx BE 2017Deepu K Sasidharan
 
Devoxx Belgium 2017 - easy microservices with JHipster
Devoxx Belgium 2017 - easy microservices with JHipsterDevoxx Belgium 2017 - easy microservices with JHipster
Devoxx Belgium 2017 - easy microservices with JHipsterJulien Dubois
 
Cassandra Lunch #88: Cadence
Cassandra Lunch #88: CadenceCassandra Lunch #88: Cadence
Cassandra Lunch #88: CadenceAnant Corporation
 
Delivering Cloud Native Batch Solutions - Dodd Pfeffer
Delivering Cloud Native Batch Solutions - Dodd PfefferDelivering Cloud Native Batch Solutions - Dodd Pfeffer
Delivering Cloud Native Batch Solutions - Dodd PfefferVMware Tanzu
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015aspyker
 
Getting_Started_with_Salesforce_Flow_for_Developers_(In-person_event)_.pptx
Getting_Started_with_Salesforce_Flow_for_Developers_(In-person_event)_.pptxGetting_Started_with_Salesforce_Flow_for_Developers_(In-person_event)_.pptx
Getting_Started_with_Salesforce_Flow_for_Developers_(In-person_event)_.pptxShams Pirzada
 

Ähnlich wie Introduction to Flink Streaming (20)

Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in Spark
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
 
Automation + dev ops summit hail hydrate! from stream to lake
Automation + dev ops summit   hail hydrate! from stream to lakeAutomation + dev ops summit   hail hydrate! from stream to lake
Automation + dev ops summit hail hydrate! from stream to lake
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
 
The Fn Project: A Quick Introduction (December 2017)
The Fn Project: A Quick Introduction (December 2017)The Fn Project: A Quick Introduction (December 2017)
The Fn Project: A Quick Introduction (December 2017)
 
[WSO2Con Asia 2018] Architecting for Container-native Environments
[WSO2Con Asia 2018] Architecting for Container-native Environments[WSO2Con Asia 2018] Architecting for Container-native Environments
[WSO2Con Asia 2018] Architecting for Container-native Environments
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azure
 
Netty training
Netty trainingNetty training
Netty training
 
Easy Microservices with JHipster - Devoxx BE 2017
Easy Microservices with JHipster - Devoxx BE 2017Easy Microservices with JHipster - Devoxx BE 2017
Easy Microservices with JHipster - Devoxx BE 2017
 
Devoxx Belgium 2017 - easy microservices with JHipster
Devoxx Belgium 2017 - easy microservices with JHipsterDevoxx Belgium 2017 - easy microservices with JHipster
Devoxx Belgium 2017 - easy microservices with JHipster
 
Cassandra Lunch #88: Cadence
Cassandra Lunch #88: CadenceCassandra Lunch #88: Cadence
Cassandra Lunch #88: Cadence
 
Delivering Cloud Native Batch Solutions - Dodd Pfeffer
Delivering Cloud Native Batch Solutions - Dodd PfefferDelivering Cloud Native Batch Solutions - Dodd Pfeffer
Delivering Cloud Native Batch Solutions - Dodd Pfeffer
 
Netty training
Netty trainingNetty training
Netty training
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
 
Getting_Started_with_Salesforce_Flow_for_Developers_(In-person_event)_.pptx
Getting_Started_with_Salesforce_Flow_for_Developers_(In-person_event)_.pptxGetting_Started_with_Salesforce_Flow_for_Developers_(In-person_event)_.pptx
Getting_Started_with_Salesforce_Flow_for_Developers_(In-person_event)_.pptx
 

Mehr von datamantra

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Telliusdatamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetesdatamantra
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2datamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsdatamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle managementdatamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark MLdatamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scaladatamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scaladatamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetesdatamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsdatamantra
 

Mehr von datamantra (10)

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 

Kürzlich hochgeladen

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 

Kürzlich hochgeladen (20)

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 

Introduction to Flink Streaming

  • 1. Introduction to Flink Streaming Framework for modern streaming applications https://github.com/phatak-dev/flink-examples
  • 2. ● Madhukara Phatak ● Big data consultant and trainer at datamantra.io ● Consult in Hadoop, Spark and Scala ● www.madhukaraphatak.com
  • 3. Agenda ● Stream abstraction vs streaming applications ● Stream as an abstraction ● Challenges with modern streaming applications ● Why not Spark streaming? ● Introduction to Flink ● Introduction to Flink streaming ● Flink Streaming API ● References
  • 4. Use of stream in applications ● Streams are used both in big data and outside big data to support two major use cases ○ Stream as abstraction layer ○ Stream as unbounded data to support real time analysis ● Abstraction and real time have different need and expectation from the streams ● Different platforms use stream in different meanings
  • 5. Stream as the abstraction ● A stream is a sequence of data elements made available over time. ● A stream can be thought of as items on a conveyor belt being processed one at a time rather than in large batches. ● Streams can be unbounded ( message queue) and bounded ( files) ● Streams are becoming new abstractions to build data pipelines.
  • 6. Streams as abstraction outside big data ● Streams are used as an abstraction outside big data in last few years ● Some of them are ○ Reactive streams like akka-streams, akka-http ○ Java 8 streams ○ RxJava etc ● These use of streams are don't care about real time analysis
  • 7. Streams for real time analysis ● In this use cases of stream, stream is viewed as unbounded data which has low latency and available as soon it arrives in the system ● Stream can be processed using non stream abstraction in run time ● So focus in these scenarios is only to model API in streams not the implementation ● Ex : Spark streaming
  • 8. Stream abstraction in big data ● Stream is the new abstraction layer people are exploring in the big data ● With right implementation, stream can support both streaming and batch applications much more effectively than existing abstractions. ● Batch on streaming is new way of looking at processing rather than treating streaming as the special case of batch ● Batch can be faster on streaming than dedicated batch processing
  • 9. Frameworks with stream as abstraction
  • 10. Apache flink ● Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. ● Flink provides ○ Dataset API - for bounded streams ○ Datastream API - for unbounded streams ● Flink embraces the stream as abstraction to implement it’s dataflow.
  • 12. Flink history ● Stratosphere project started in Technical university, Berlin in 2009 ● Incubated to Apache in March, 2014 ● Top level project in Dec 2014 ● Started as stream engine for batch processing ● Started to support streaming few versions before ● DataArtisians is company founded by core flink team
  • 13. Flink streaming ● Flink Streaming is an extension of the core Flink API for high-throughput, low-latency data stream processing ● Supports many data sources like Flume, Twitter, ZeroMQ and also from any user defined data source ● Data streams can be transformed and modified using high-level functions similar to the ones provided by the batch processing API ● Sound much like Spark streaming promises !!
  • 14. Streaming is not fast batch processing ● Most of the streaming framework focuses too much on the latency when they develop streaming extensions ● Both storm and spark-streaming view streaming as low latency batch processing system ● Though latency plays an important role in the real time application, the need and challenges go beyond it ● Addressing the complex needs of modern streaming systems need a fresh view on streaming API’s
  • 15. Streaming in Lamda architecture ● Streaming is viewed as limited, approximate, low latency computing system compared to a batch system in lambda architecture ● So we usually run a streaming system to get low latency approximate results and run a batch system to get high latency with accurate result ● All the limitations of streaming is stemmed from conventional thinking and implementations ● New idea is why not streaming a low latency accurate system itself?
  • 16. Google dataflow ● Google articulated the first modern streaming framework which is low latency, exactly once, accurate stream applications in their dataflow paper ● It talks about a single system which can replace need of separate streaming and batch processing system ● Known as Kappa architecture ● Modern stream frameworks embrace this over lambda architecture ● Google dataflow is open sourced under the name apache beam
  • 17. Google dataflow and Flink streaming ● Flink adopted dataflow ideas for their streaming API ● Flink streaming API went through big overhaul in 1.0 version to embrace these ideas ● It was relatively easy to adapt ideas as both google dataflow and flink use streaming as abstraction ● Spark 2.0 may add some of these ideas in their structured stream processing effort
  • 18. Needs of modern real time applications ● Ability to handle out of time events in unbounded data ● Ability to correlate the events with different dimensions of time ● Ability to correlate events using custom application based characteristics like session ● Ability to both microbatch and event at a time on same framework ● Support for complex stream processing libraries
  • 19. Mandatory wordcount ● Streams are represented using DataStream in Flink streaming ● DataStream support both RDD and Dataset like API for manipulation ● In this example, ○ Read from socket to create DataStream ○ Use map, keyBy and sum operation for aggregation ● com.madhukaraphatak.flink.streaming.examples. StreamingWordCount
  • 20. Flink streaming vs Spark streaming Spark Streaming Flink Streaming Streams are represented using DStreams Streams are represented using DataStreams Stream is discretized to mini batch Stream is not discretized Support RDD DSL Supports Dataset like DSL By default stateless By default stateful at operator level Runs mini batch for each interval Runs pipelined operators for each events that comes in Near realtime Real time
  • 21. Discretizing the stream ● Flink by default don’t need any discretization of stream to work ● But using window API, we can create discretized stream similar to spark ● This time state will be discarded, as and when the batch is computed ● This way you can mimic spark micro batches in Flink ● com.madhukaraphatak.flink.streaming.examples. WindowedStreamingWordCount
  • 22. Understanding dataflow of flink ● All programs in flink, both batch and streaming, are represented using a dataflow ● This dataflow signifies the stream abstraction provided by the flink runtime ● This dataflow treats all the data as stream and processes using long running operator model ● This is quite different than RDD model of the spark ● Flink UI allows us to understand dataflow of a given flink program
  • 23. Running in local mode ● bin/start-local.sh ● bin/flink run -c com.madhukaraphatak.flink.streaming. examples.StreamingWordCount /home/madhu/Dev/mybuild/flink-examples/target/scala- 2.10/flink-examples_2.10-1.0.jar
  • 25. Operator fusing ● Flink optimiser fuses the operator for efficiency ● All the fused operator run in a same thread, which saves the serialization and deserialization cost between the operators ● For all fused operators, flink generates a nested function which comprises all the code from operators ● This is much efficient that RDD optimization ● Dataset is planning to support this functionality ● You can disable this by env.disableOperatorChaining()
  • 26. Dataflow for without operate fusing
  • 27. Flink streaming vs Spark streaming Spark Streaming Flink Streaming Uses RDD distribution model for processing Uses pipelined stream processing paradigm for processing Parallelism is done at batch level Parallelism is controlled at operator level Uses RDD immutability for fault recovery Uses Asynchronous barriers for fault recovery RDD level optimization for stream optimization Operator fusing for stream optimization
  • 28. Window API ● Powerful API to track and do custom state analysis ● Types of windows ○ Time window ■ Tumbling window ■ Sliding window ○ Non time based window ■ Count window ● Ex : WindowExample.scala
  • 29. Anatomy of Window API ● Window API is made of 3 different components ● The three components of window are ○ Window assigner ○ Trigger ○ Evictor ● These three components made up all the window API in Flink
  • 30. Window Assigner ● A function which determines given element, which window it should belong ● Responsible for creation of window and assigning elements to a window ● Two types of window assigner ○ Time based window assigner ○ GlobalWindow assigner ● User can write their custom window assigner too
  • 31. Trigger ● Trigger is a function responsible for determining when a given window is triggered ● In a time based window, this function will wait till time is done to trigger ● But in non time based window, it can use custom logic to determine when to evaluate a given window ● In our example, the example number of records in an given window is used to determine the trigger or not. ● WindowAnatomy.scala
  • 32. Building custom session window ● We want to track session of a user ● Each session is identified using sessionID ● We will get an event when the session is started ● Evaluate the session, when we get the end of session event ● For this, we want to implement our own custom window trigger which tracks the end of session ● Ex : SessionWindowExample.scala
  • 33. Concept of Time in Flink streaming ● Time in a streaming application plays an important role ● So having ability to express time in flexible way is very important feature of modern streaming application ● Flink support three kind of time ○ Process time ○ Event time ○ Ingestion time ● Event time is one of the important feature of flink which compliments the custom window API
  • 34. Understanding event time ● Time in flink needs to address following two questions ○ When event is occurred? ○ How much time has occurred after the event? ● First question can be answered using the assigning time stamps ● Second question is answered using understanding the concept of the water marks ● Ex : EventTimeExample.scala
  • 35. Watermarks in Event Time ● Watermark is a special signal which signifies flow of time in Flink ● In above diagram, w(20) signifies 20 units of time is passed in source ● Watermarks allow flink to support different time abstractions
  • 36. References ● http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf ● http://blog.madhukaraphatak.com/categories/flink- streaming/ ● https://www.youtube.com/watch?v=y7f6wksGM6c ● https://yahooeng.tumblr. com/post/135321837876/benchmarking-streaming- computation-engines-at ● https://www.youtube.com/watch?v=v_exWHj1vmo ● http://www.slideshare.net/FlinkForward/dongwon-kim-a- comparative-performance-evaluation-of-flink