SlideShare a Scribd company logo
1 of 46
Google Cloud Dataflow
Two Worlds Become a Much Better One
Eric Schmidt, Product Manager
cloude@google.com
You leave here understanding the fundamentals of Cloud Dataflow and
possibly have drawn some comparisons to existing data processing models.
We have some fun.
1
Goals
2
The Cloud Big Data
Promise of the Cloud and Big Data
Optimized
The Cloud For Big Data
Promise of the Cloud and Big Data
Batch Streaming
Data Processing
And
Batch Streaming
Data Processing
Time to answer some questions
1M
Devices
16.6K Events/sec
43B Events/month
518B Events/year
What was the average viewing time over the past
7 days, compared to the last year?
How many active viewers did I have in the last
minute?
How many sales were made in the last
hour due to advertising conversion?
Let’s build something
1M
Devices
16.6K Events/sec
43B Events/month
518B Events/month
The tension and polarity of Big Data
AccuracySpeed
Cost control Complexity
Time to answer
❯ Time & life never stop
❯ Data rates & schema are not static
❯ Scaling models are not static
❯ Non-elastic compute is wasteful and
creates lag
The reality of Big Data elasticity & business
… that also provides accuracy control & intelligent resource elasticity
… to reduce operational complexity
… while optimizing resources to reduce cost
What if you just had simple knob for speed?
Cloud Dataflow
Cloud Dataflow is a
collection of SDKs for
building batch or
streaming parallelized
data processing pipelines.
Cloud Dataflow is a fully
managed service for
executing optimized
parallelized data processing
pipelines.
• Movement
• Filtering
• Enrichment
• Shaping
• Reduction
• Batch computation
• Continuous
computation
• Composition
• External
orchestration
• Simulation
Where might you use Cloud Dataflow?
AnalysisETL Orchestration
Benefits of Cloud Dataflow
❯ No Ops - truly elastic data processing for the cloud
• On demand resource allocation w/intelligent auto-scaling
• Automated worker lifetime management
• Automated work optimization
❯ Unified model - for batch & stream based processing
• Functional programming model
• Fine grained correctness primitives
❯ Open sourced SDK @ github
• Java 7 today @ /GoogleCloudPlatform/DataflowJavaSDK
• Python 2 in progress
• Scala @/darkjh/scalaflow & /jhlch/scala-dataflow-dsl
• Spark runner@ /cloudera/spark-dataflow
• Flink runner @ /dataArtisans/flink-dataflow
Release Timeline
• June 24, 2014: Early Access Preview at Google I/O
• Dec. 17, 2014: Alpha
• Next milestone...
Release Timeline
• June 24, 2014: Early Access Preview at Google I/O
• Dec. 17, 2014: Alpha
• April 16, 2015: Beta - now open to everyone
• Next milestone GA
cloud.google.com/dataflow
Management MobileDeveloper
Tools
Compute
Networking
Big Data
Storage
Big Data on Google Cloud
Capture
Pub/Sub
Process
Dataflow
Store
Storage
SQL
Datastore
Analyze
BigQuery
Dataflow
Open Source Tools
Big Data on Google Cloud
BigQuery
Ingest data at 100,000
rows per second
Dataflow
Stream & batch
processing, unified and
simplified
Pub/Sub
Scalable, flexible, and
globally available
messaging
Fully Managed, No-Ops Services
Time answer some questions
1M
Devices
16.6K Events/sec
43B Events/month
518B Events/year
What was the average viewing time over the past
7 days, compared to the year?
How many active viewers did I have in the last
minute?
How many sales were made in the last 30
minutes due to advertising conversion?
Let’s build something
1M
Devices
16.6K Events/sec
43B Events/month
518B Events/month
Cloud Dataflow BigQuery
How many active viewers did I have in the last
minute?
Let’s build something
1M
Devices
16.6K Events/sec
43B Events/month
518B Events/month
Cloud Dataflow
How many active viewers did I have in the last
minute?
Let’s build something
1M
Devices
16.6K Events/sec
43B Events/month
518B Events/month
Cloud Dataflow SDK
+
How many active viewers did I have in the last
minute?
Let’s build something
1M
Devices
16.6K Events/sec
43B Events/month
518B Events/month
Cloud Pub/Sub Cloud Dataflow BigQuery
How many active viewers did I have in the last
minute?
Let’s build something - Demo!
Create a globally available queue
Create a dataset for massive scale ingest and query execution
Submit job
• Globally redundant
• Low latency (sub
sec.)
• Batched read/write
• Custom labels
• Push & Pull
• Auto expiration
Cloud Pub/Sub
Publisher A Publisher B Publisher C
Message 1
Topic A Topic B Topic C
Subscription XA Subscription XB
Subscription
YC
Subscription
ZC
Cloud
Pub/Sub
Subscriber X Subscriber Y
Message 2 Message 3
Subscriber Z
Message 1
Message 2
Message 3
Message 3
Big Query
Google Big Query
Fast ETL
Regex
JSON
Spreadsheets
BI Tools
Coworkers
Your Data
• Scales into Petabytes
• I/O of 1TB in 1 second
• 100,000 rows/sec Streaming API
• Simple data ingest from GCS or Hadoop
• Connect to R, Pandas, Hadoop, etc.
• Now available in our European data centers
• New row level security and data expiration
Cloud Dataflow
Cloud Dataflow is a
collection of SDKs for
building batch or
streaming parallelized
data processing pipelines.
Cloud Dataflow is a fully
managed service for
executing optimized
parallelized data processing
pipelines.
weekly monthly
Cloud Dataflow SDK Release Process
<- At once guarantee (modulo completeness thresholds)
Cloud Dataflow SDK
<- Aggregations, Filters, Joins, ...
<- Correctness
Pipeline{
Who => Inputs
What => Transforms
Where => Windows
When => Watermarks + Triggers
To => Outputs
}
Cloud Dataflow SDK - Logical Model
<- GCS, Pub/Sub, BigQuery, w/Avro, XML,
JSON,...
<- Time space Fixed, Sliding, Sessions, ...
<- GCS, Pub/Sub, BigQuery, ...
Pipeline p = Pipeline.create(
OptionsBuilder.RunOnService(true, false));
PCollection<String> rawData = p.begin().apply(TextIO.Read
.from(OptionsBuilder.GCS_RAWDUMP_URI));
PCollection<PlaybackEvent> events = rawData.apply(
new ParseTransform());
events.apply(new ArchiveTransform());
events.apply(new SessionAnalysisTransform());
events.apply(new AssetTransform());
p.run();
Java 7 Implementation
Some Code
❯ A collection of data of type T in a
pipeline - a “hippie cousin” of an RDD
❯ Maybe be either bounded or
unbounded in size
❯ Created by using a PTransform to:
• Build from a java.util.Collection
• Read from a backing data store
• Transform an existing
PCollection
❯ Often contain the key-value pairs using
KV
{Seahawks, NFC, Champions, Seattle, ...}
{...,
“NFC Champions #GreenBay”,
“Green Bay #superbowl!”,
...
“#GoHawks”,
...}
PCollections
Cloud Dataflow SDK
{Seahawks, NFC, Champions, Seattle, ...}
{KV<S, Seahawks>, KV<C,Champions>,
<KV<S, Seattle>, KV<N, NFC>, ...}
❯ Processes each element of a
PCollection independently using a user-
provided DoFn
❯ Elements are processed in arbitrary
‘bundles’ e.g. “shards”
• startBundle(), processElement() -
N times, finishBundle()
❯ Corresponds to both the Map and
Reduce phases in Hadoop i.e. ParDo-
>GBK->ParDo
KeyBySessionId
ParDo (“Parallel Do”)
Cloud Dataflow SDK
Wait a minute…
How do you do a GroupByKey on an unbounded PCollection?
{KV<S, Seahawks>, KV<C,Champions>,
<KV<S, Seattle>, KV<N, NFC>, ...}
{KV<S, Seahawks>, KV<C,Champions>,
<KV<S, Seattle>, KV<N, NFC>, ...}
GroupByKey
• Takes a PCollection of key-value
pairs and gathers up all values
with the same key
• Corresponds to the shuffle phase
in Hadoop
Cloud Dataflow SDK
GroupByKey
{KV<S, {Seahawks, Seattle, …},
KV<N, {NFC, …}
KV<C, {Champion, …}}
❯ Logically divide up or groups the elements of a
PCollection into finite windows
• Fixed Windows: hourly, daily, …
• Sliding Windows
• Sessions
❯ Required for GroupByKey-based transforms on
an unbounded PCollection, but can also be used
for bounded PCollections
❯ Window.into() can be called at any point in the
pipeline and will be applied when needed
❯ Can be tied to arrival/processing time or custom
event time
❯ Watermarks + Triggers enable robust
correctness
Windows
Cloud Dataflow SDK
Nighttime Mid-Day Nighttime
Cloud Dataflow Service
Managing with correctness
.apply(Window.<KV<String, PlaybackEvent>>into(
FixedWindows.of(Duration.standardMinutes(1)))
.triggering(
AfterEach.inOrder(
AfterWatermark.pastEndOfWindow(),
Repeatedly.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(10)))
.orFinally(AfterWatermark
.pastEndOfWindow()
.plusDelayOf(Duration.standardDays(2)))))
.discardingFiredPanes());
GroupByKey
Pair With Ones
Sum Values
Count
❯ Define new PTransforms by building up
subgraphs of existing transforms
❯ Some utilities are included in the SDK
• Count, RemoveDuplicates, Join,
Min, Max, Sum, ...
❯ You can define your own:
• DoSomething, DoSomethingElse,
etc.
❯ Why bother?
• Code reuse
• Better monitoring experience
Composite PTransforms
Cloud Dataflow SDK
Cloud Dataflow
Cloud Dataflow is a
collection of SDKs for
building batch or
streaming parallelized
data processing pipelines.
Cloud Dataflow is a fully
managed service for
executing optimized
parallelized data processing
pipelines.
GCP
Managed Service
User Code & SDK
Work Manager
Deploy & Schedule
Monitoring UI
Job Manager
Cloud Dataflow Service
Progress & Logs
● ParDo fusion
○ Producer Consumer
○ Sibling
○ Intelligent fusion
boundaries
● Combiner lifting e.g. partial
aggregations before
reduction
● Reshard placement
...
Graph Optimization
Cloud Dataflow Service
C D
C+D
consumer-producer
= ParallelDo
GBK = GroupByKey
+ = CombineValues
sibling
C D
C+D
A GBK + B
A+ GBK + B
combiner lifting
Deploy Schedule & Monitor Tear Down
Worker Lifecycle Management
Cloud Dataflow Service
Worker Scaling
Cloud Dataflow Service
Decreased Clock Time
100 mins. 65 mins.
vs.
Dynamic Work Rebalancing
Cloud Dataflow Service
Optimized
The Cloud For Big Data
Promise of the Cloud and Big Data
Optimizing Your Time To Answer
More time to dig
into your data
Programming
Resource
provisioning
Performance
tuning
Monitoring
Reliability
Deployment &
configuration
Handling
Growing Scale
Utilization
improvements
Data Processing with Cloud DataflowTypical Data Processing
Programming
Thank You!
cloud.google.com/dataflow
cloude@google.com
StackOverflow @ google+cloud+dataflow

More Related Content

What's hot

Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoophuguk
 
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud DataflowHow to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud DataflowLucas Arruda
 
Deep Dive on Accelerating Content, APIs, and Applications with Amazon CloudFr...
Deep Dive on Accelerating Content, APIs, and Applications with Amazon CloudFr...Deep Dive on Accelerating Content, APIs, and Applications with Amazon CloudFr...
Deep Dive on Accelerating Content, APIs, and Applications with Amazon CloudFr...Amazon Web Services
 
Big data on google cloud
Big data on google cloudBig data on google cloud
Big data on google cloudTu Pham
 
Google Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsLynn Langit
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data MeshLibbySchulze
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)James Serra
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
BigQuery walk through.pptx
BigQuery walk through.pptxBigQuery walk through.pptx
BigQuery walk through.pptxVikRam S
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's includedJames Serra
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowYohei Onishi
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
From airflow to google cloud composer
From airflow to google cloud composerFrom airflow to google cloud composer
From airflow to google cloud composerBruce Kuo
 
Azure Data Factory Data Flow
Azure Data Factory Data FlowAzure Data Factory Data Flow
Azure Data Factory Data FlowMark Kromer
 
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction by Alexander Zaitsev, Altinity CTOClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction by Alexander Zaitsev, Altinity CTOAltinity Ltd
 
Getting started with BigQuery
Getting started with BigQueryGetting started with BigQuery
Getting started with BigQueryPradeep Bhadani
 
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...HostedbyConfluent
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
 

What's hot (20)

Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
 
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud DataflowHow to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
 
Deep Dive on Accelerating Content, APIs, and Applications with Amazon CloudFr...
Deep Dive on Accelerating Content, APIs, and Applications with Amazon CloudFr...Deep Dive on Accelerating Content, APIs, and Applications with Amazon CloudFr...
Deep Dive on Accelerating Content, APIs, and Applications with Amazon CloudFr...
 
Big data on google cloud
Big data on google cloudBig data on google cloud
Big data on google cloud
 
Google Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline Patterns
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
BigQuery walk through.pptx
BigQuery walk through.pptxBigQuery walk through.pptx
BigQuery walk through.pptx
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
From airflow to google cloud composer
From airflow to google cloud composerFrom airflow to google cloud composer
From airflow to google cloud composer
 
Azure Data Factory Data Flow
Azure Data Factory Data FlowAzure Data Factory Data Flow
Azure Data Factory Data Flow
 
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction by Alexander Zaitsev, Altinity CTOClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
 
Getting started with BigQuery
Getting started with BigQueryGetting started with BigQuery
Getting started with BigQuery
 
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
 
Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
 

Viewers also liked

Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture  for Big Data AppGoogle Cloud Dataflow and lightweight Lambda Architecture  for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data AppTrieu Nguyen
 
Strtio Spark Streaming + Siddhi CEP Engine
Strtio Spark Streaming + Siddhi CEP EngineStrtio Spark Streaming + Siddhi CEP Engine
Strtio Spark Streaming + Siddhi CEP EngineMyung Ho Yun
 
Streaming Auto-scaling in Google Cloud Dataflow
Streaming Auto-scaling in Google Cloud DataflowStreaming Auto-scaling in Google Cloud Dataflow
Streaming Auto-scaling in Google Cloud DataflowC4Media
 
Stratio platform overview v4.1
Stratio platform overview v4.1Stratio platform overview v4.1
Stratio platform overview v4.1Stratio
 
[Strata] Sparkta
[Strata] Sparkta[Strata] Sparkta
[Strata] SparktaStratio
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamNeville Li
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine kiran palaka
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentationCyanny LIANG
 
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkMaximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkFlink Forward
 
Realtime Data Analysis Patterns
Realtime Data Analysis PatternsRealtime Data Analysis Patterns
Realtime Data Analysis PatternsMikio L. Braun
 
Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Vicente Orjales
 
Présentation Google Dataflow
Présentation Google DataflowPrésentation Google Dataflow
Présentation Google DataflowGeoffrey Garnotel
 

Viewers also liked (14)

Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture  for Big Data AppGoogle Cloud Dataflow and lightweight Lambda Architecture  for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
 
Strtio Spark Streaming + Siddhi CEP Engine
Strtio Spark Streaming + Siddhi CEP EngineStrtio Spark Streaming + Siddhi CEP Engine
Strtio Spark Streaming + Siddhi CEP Engine
 
Streaming Auto-scaling in Google Cloud Dataflow
Streaming Auto-scaling in Google Cloud DataflowStreaming Auto-scaling in Google Cloud Dataflow
Streaming Auto-scaling in Google Cloud Dataflow
 
Stratio platform overview v4.1
Stratio platform overview v4.1Stratio platform overview v4.1
Stratio platform overview v4.1
 
[Strata] Sparkta
[Strata] Sparkta[Strata] Sparkta
[Strata] Sparkta
 
Google BigQuery
Google BigQueryGoogle BigQuery
Google BigQuery
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
 
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkMaximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
 
Realtime Data Analysis Patterns
Realtime Data Analysis PatternsRealtime Data Analysis Patterns
Realtime Data Analysis Patterns
 
Google Cloud Dataflow を理解する - #bq_sushi
Google Cloud Dataflow を理解する - #bq_sushiGoogle Cloud Dataflow を理解する - #bq_sushi
Google Cloud Dataflow を理解する - #bq_sushi
 
Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.
 
Présentation Google Dataflow
Présentation Google DataflowPrésentation Google Dataflow
Présentation Google Dataflow
 

Similar to Google Cloud Dataflow Two Worlds Become a Much Better One

Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...DataWorks Summit
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21JDA Labs MTL
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT_MTL
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...Cisco DevNet
 
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Alluxio, Inc.
 
Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataprocAlluxio, Inc.
 
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...HostedbyConfluent
 
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...HostedbyConfluent
 
Estimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformEstimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformDATAVERSITY
 
Introducing Azure SQL Database
Introducing Azure SQL DatabaseIntroducing Azure SQL Database
Introducing Azure SQL DatabaseJames Serra
 
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward
 
Solving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute finalSolving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute finalAvere Systems
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAlluxio, Inc.
 
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...Gary Arora
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analyticskgshukla
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...Deepak Chandramouli
 
Log everything! @DC13
Log everything! @DC13Log everything! @DC13
Log everything! @DC13DECK36
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBconfluent
 
Cloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsCloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsVMware Tanzu
 
Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Crate.io
 

Similar to Google Cloud Dataflow Two Worlds Become a Much Better One (20)

Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
 
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
 
Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataproc
 
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
 
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
 
Estimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformEstimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics Platform
 
Introducing Azure SQL Database
Introducing Azure SQL DatabaseIntroducing Azure SQL Database
Introducing Azure SQL Database
 
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
 
Solving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute finalSolving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute final
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
 
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
 
Log everything! @DC13
Log everything! @DC13Log everything! @DC13
Log everything! @DC13
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDB
 
Cloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsCloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive Applications
 
Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Recently uploaded (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

Google Cloud Dataflow Two Worlds Become a Much Better One

  • 1. Google Cloud Dataflow Two Worlds Become a Much Better One Eric Schmidt, Product Manager cloude@google.com
  • 2. You leave here understanding the fundamentals of Cloud Dataflow and possibly have drawn some comparisons to existing data processing models. We have some fun. 1 Goals 2
  • 3. The Cloud Big Data Promise of the Cloud and Big Data
  • 4. Optimized The Cloud For Big Data Promise of the Cloud and Big Data
  • 7. Time to answer some questions 1M Devices 16.6K Events/sec 43B Events/month 518B Events/year What was the average viewing time over the past 7 days, compared to the last year? How many active viewers did I have in the last minute? How many sales were made in the last hour due to advertising conversion?
  • 8. Let’s build something 1M Devices 16.6K Events/sec 43B Events/month 518B Events/month
  • 9. The tension and polarity of Big Data AccuracySpeed Cost control Complexity Time to answer
  • 10. ❯ Time & life never stop ❯ Data rates & schema are not static ❯ Scaling models are not static ❯ Non-elastic compute is wasteful and creates lag The reality of Big Data elasticity & business
  • 11. … that also provides accuracy control & intelligent resource elasticity … to reduce operational complexity … while optimizing resources to reduce cost What if you just had simple knob for speed?
  • 12. Cloud Dataflow Cloud Dataflow is a collection of SDKs for building batch or streaming parallelized data processing pipelines. Cloud Dataflow is a fully managed service for executing optimized parallelized data processing pipelines.
  • 13. • Movement • Filtering • Enrichment • Shaping • Reduction • Batch computation • Continuous computation • Composition • External orchestration • Simulation Where might you use Cloud Dataflow? AnalysisETL Orchestration
  • 14. Benefits of Cloud Dataflow ❯ No Ops - truly elastic data processing for the cloud • On demand resource allocation w/intelligent auto-scaling • Automated worker lifetime management • Automated work optimization ❯ Unified model - for batch & stream based processing • Functional programming model • Fine grained correctness primitives ❯ Open sourced SDK @ github • Java 7 today @ /GoogleCloudPlatform/DataflowJavaSDK • Python 2 in progress • Scala @/darkjh/scalaflow & /jhlch/scala-dataflow-dsl • Spark runner@ /cloudera/spark-dataflow • Flink runner @ /dataArtisans/flink-dataflow
  • 15. Release Timeline • June 24, 2014: Early Access Preview at Google I/O • Dec. 17, 2014: Alpha • Next milestone...
  • 16. Release Timeline • June 24, 2014: Early Access Preview at Google I/O • Dec. 17, 2014: Alpha • April 16, 2015: Beta - now open to everyone • Next milestone GA cloud.google.com/dataflow
  • 18. Big Data on Google Cloud Capture Pub/Sub Process Dataflow Store Storage SQL Datastore Analyze BigQuery Dataflow Open Source Tools
  • 19. Big Data on Google Cloud BigQuery Ingest data at 100,000 rows per second Dataflow Stream & batch processing, unified and simplified Pub/Sub Scalable, flexible, and globally available messaging Fully Managed, No-Ops Services
  • 20. Time answer some questions 1M Devices 16.6K Events/sec 43B Events/month 518B Events/year What was the average viewing time over the past 7 days, compared to the year? How many active viewers did I have in the last minute? How many sales were made in the last 30 minutes due to advertising conversion?
  • 21. Let’s build something 1M Devices 16.6K Events/sec 43B Events/month 518B Events/month Cloud Dataflow BigQuery How many active viewers did I have in the last minute?
  • 22. Let’s build something 1M Devices 16.6K Events/sec 43B Events/month 518B Events/month Cloud Dataflow How many active viewers did I have in the last minute?
  • 23. Let’s build something 1M Devices 16.6K Events/sec 43B Events/month 518B Events/month Cloud Dataflow SDK + How many active viewers did I have in the last minute?
  • 24. Let’s build something 1M Devices 16.6K Events/sec 43B Events/month 518B Events/month Cloud Pub/Sub Cloud Dataflow BigQuery How many active viewers did I have in the last minute?
  • 25. Let’s build something - Demo! Create a globally available queue Create a dataset for massive scale ingest and query execution Submit job
  • 26. • Globally redundant • Low latency (sub sec.) • Batched read/write • Custom labels • Push & Pull • Auto expiration Cloud Pub/Sub Publisher A Publisher B Publisher C Message 1 Topic A Topic B Topic C Subscription XA Subscription XB Subscription YC Subscription ZC Cloud Pub/Sub Subscriber X Subscriber Y Message 2 Message 3 Subscriber Z Message 1 Message 2 Message 3 Message 3
  • 27. Big Query Google Big Query Fast ETL Regex JSON Spreadsheets BI Tools Coworkers Your Data • Scales into Petabytes • I/O of 1TB in 1 second • 100,000 rows/sec Streaming API • Simple data ingest from GCS or Hadoop • Connect to R, Pandas, Hadoop, etc. • Now available in our European data centers • New row level security and data expiration
  • 28. Cloud Dataflow Cloud Dataflow is a collection of SDKs for building batch or streaming parallelized data processing pipelines. Cloud Dataflow is a fully managed service for executing optimized parallelized data processing pipelines.
  • 29. weekly monthly Cloud Dataflow SDK Release Process
  • 30. <- At once guarantee (modulo completeness thresholds) Cloud Dataflow SDK <- Aggregations, Filters, Joins, ... <- Correctness Pipeline{ Who => Inputs What => Transforms Where => Windows When => Watermarks + Triggers To => Outputs } Cloud Dataflow SDK - Logical Model <- GCS, Pub/Sub, BigQuery, w/Avro, XML, JSON,... <- Time space Fixed, Sliding, Sessions, ... <- GCS, Pub/Sub, BigQuery, ...
  • 31. Pipeline p = Pipeline.create( OptionsBuilder.RunOnService(true, false)); PCollection<String> rawData = p.begin().apply(TextIO.Read .from(OptionsBuilder.GCS_RAWDUMP_URI)); PCollection<PlaybackEvent> events = rawData.apply( new ParseTransform()); events.apply(new ArchiveTransform()); events.apply(new SessionAnalysisTransform()); events.apply(new AssetTransform()); p.run(); Java 7 Implementation Some Code
  • 32. ❯ A collection of data of type T in a pipeline - a “hippie cousin” of an RDD ❯ Maybe be either bounded or unbounded in size ❯ Created by using a PTransform to: • Build from a java.util.Collection • Read from a backing data store • Transform an existing PCollection ❯ Often contain the key-value pairs using KV {Seahawks, NFC, Champions, Seattle, ...} {..., “NFC Champions #GreenBay”, “Green Bay #superbowl!”, ... “#GoHawks”, ...} PCollections Cloud Dataflow SDK
  • 33. {Seahawks, NFC, Champions, Seattle, ...} {KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...} ❯ Processes each element of a PCollection independently using a user- provided DoFn ❯ Elements are processed in arbitrary ‘bundles’ e.g. “shards” • startBundle(), processElement() - N times, finishBundle() ❯ Corresponds to both the Map and Reduce phases in Hadoop i.e. ParDo- >GBK->ParDo KeyBySessionId ParDo (“Parallel Do”) Cloud Dataflow SDK
  • 34. Wait a minute… How do you do a GroupByKey on an unbounded PCollection? {KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...} {KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...} GroupByKey • Takes a PCollection of key-value pairs and gathers up all values with the same key • Corresponds to the shuffle phase in Hadoop Cloud Dataflow SDK GroupByKey {KV<S, {Seahawks, Seattle, …}, KV<N, {NFC, …} KV<C, {Champion, …}}
  • 35. ❯ Logically divide up or groups the elements of a PCollection into finite windows • Fixed Windows: hourly, daily, … • Sliding Windows • Sessions ❯ Required for GroupByKey-based transforms on an unbounded PCollection, but can also be used for bounded PCollections ❯ Window.into() can be called at any point in the pipeline and will be applied when needed ❯ Can be tied to arrival/processing time or custom event time ❯ Watermarks + Triggers enable robust correctness Windows Cloud Dataflow SDK Nighttime Mid-Day Nighttime
  • 36. Cloud Dataflow Service Managing with correctness .apply(Window.<KV<String, PlaybackEvent>>into( FixedWindows.of(Duration.standardMinutes(1))) .triggering( AfterEach.inOrder( AfterWatermark.pastEndOfWindow(), Repeatedly.forever(AfterProcessingTime .pastFirstElementInPane() .plusDelayOf(Duration.standardMinutes(10))) .orFinally(AfterWatermark .pastEndOfWindow() .plusDelayOf(Duration.standardDays(2))))) .discardingFiredPanes());
  • 37. GroupByKey Pair With Ones Sum Values Count ❯ Define new PTransforms by building up subgraphs of existing transforms ❯ Some utilities are included in the SDK • Count, RemoveDuplicates, Join, Min, Max, Sum, ... ❯ You can define your own: • DoSomething, DoSomethingElse, etc. ❯ Why bother? • Code reuse • Better monitoring experience Composite PTransforms Cloud Dataflow SDK
  • 38. Cloud Dataflow Cloud Dataflow is a collection of SDKs for building batch or streaming parallelized data processing pipelines. Cloud Dataflow is a fully managed service for executing optimized parallelized data processing pipelines.
  • 39. GCP Managed Service User Code & SDK Work Manager Deploy & Schedule Monitoring UI Job Manager Cloud Dataflow Service Progress & Logs
  • 40. ● ParDo fusion ○ Producer Consumer ○ Sibling ○ Intelligent fusion boundaries ● Combiner lifting e.g. partial aggregations before reduction ● Reshard placement ... Graph Optimization Cloud Dataflow Service C D C+D consumer-producer = ParallelDo GBK = GroupByKey + = CombineValues sibling C D C+D A GBK + B A+ GBK + B combiner lifting
  • 41. Deploy Schedule & Monitor Tear Down Worker Lifecycle Management Cloud Dataflow Service
  • 42. Worker Scaling Cloud Dataflow Service Decreased Clock Time
  • 43. 100 mins. 65 mins. vs. Dynamic Work Rebalancing Cloud Dataflow Service
  • 44. Optimized The Cloud For Big Data Promise of the Cloud and Big Data
  • 45. Optimizing Your Time To Answer More time to dig into your data Programming Resource provisioning Performance tuning Monitoring Reliability Deployment & configuration Handling Growing Scale Utilization improvements Data Processing with Cloud DataflowTypical Data Processing Programming