Google Cloud Dataflow Two Worlds Become a Much Better One

Google Cloud Dataflow
Two Worlds Become a Much Better One
Eric Schmidt, Product Manager
cloude@google.com

You leave here understanding the fundamentals of Cloud Dataflow and
possibly have drawn some comparisons to existing data processing models.
We have some fun.
1
Goals
2

The Cloud Big Data
Promise of the Cloud and Big Data

Optimized
The Cloud For Big Data
Promise of the Cloud and Big Data

Batch Streaming
Data Processing

And
Batch Streaming
Data Processing

Time to answer some questions
1M
Devices
16.6K Events/sec
43B Events/month
518B Events/year
What was the average viewing time over the past
7 days, compared to the last year?
How many active viewers did I have in the last
minute?
How many sales were made in the last
hour due to advertising conversion?

Let’s build something
1M
Devices
16.6K Events/sec
43B Events/month
518B Events/month

The tension and polarity of Big Data
AccuracySpeed
Cost control Complexity
Time to answer

❯ Time & life never stop
❯ Data rates & schema are not static
❯ Scaling models are not static
❯ Non-elastic compute is wasteful and
creates lag
The reality of Big Data elasticity & business

… that also provides accuracy control & intelligent resource elasticity
… to reduce operational complexity
… while optimizing resources to reduce cost
What if you just had simple knob for speed?

Cloud Dataflow
Cloud Dataflow is a
collection of SDKs for
building batch or
streaming parallelized
data processing pipelines.
Cloud Dataflow is a fully
managed service for
executing optimized
parallelized data processing
pipelines.

• Movement
• Filtering
• Enrichment
• Shaping
• Reduction
• Batch computation
• Continuous
computation
• Composition
• External
orchestration
• Simulation
Where might you use Cloud Dataflow?
AnalysisETL Orchestration

Benefits of Cloud Dataflow
❯ No Ops - truly elastic data processing for the cloud
• On demand resource allocation w/intelligent auto-scaling
• Automated worker lifetime management
• Automated work optimization
❯ Unified model - for batch & stream based processing
• Functional programming model
• Fine grained correctness primitives
❯ Open sourced SDK @ github
• Java 7 today @ /GoogleCloudPlatform/DataflowJavaSDK
• Python 2 in progress
• Scala @/darkjh/scalaflow & /jhlch/scala-dataflow-dsl
• Spark runner@ /cloudera/spark-dataflow
• Flink runner @ /dataArtisans/flink-dataflow

Release Timeline
• June 24, 2014: Early Access Preview at Google I/O
• Dec. 17, 2014: Alpha
• Next milestone...

Release Timeline
• June 24, 2014: Early Access Preview at Google I/O
• Dec. 17, 2014: Alpha
• April 16, 2015: Beta - now open to everyone
• Next milestone GA
cloud.google.com/dataflow

Management MobileDeveloper
Tools
Compute
Networking
Big Data
Storage

Big Data on Google Cloud
Capture
Pub/Sub
Process
Dataflow
Store
Storage
SQL
Datastore
Analyze
BigQuery
Dataflow
Open Source Tools

Big Data on Google Cloud
BigQuery
Ingest data at 100,000
rows per second
Dataflow
Stream & batch
processing, unified and
simplified
Pub/Sub
Scalable, flexible, and
globally available
messaging
Fully Managed, No-Ops Services

Time answer some questions
1M
Devices
16.6K Events/sec
43B Events/month
518B Events/year
What was the average viewing time over the past
7 days, compared to the year?
minute?
How many sales were made in the last 30
minutes due to advertising conversion?

1M
Devices
16.6K Events/sec
43B Events/month
518B Events/month
Cloud Dataflow BigQuery
minute?

1M
Devices
16.6K Events/sec
43B Events/month
518B Events/month
Cloud Dataflow
minute?

1M
Devices
16.6K Events/sec
43B Events/month
518B Events/month
Cloud Dataflow SDK
+
minute?

1M
Devices
16.6K Events/sec
43B Events/month
518B Events/month
Cloud Pub/Sub Cloud Dataflow BigQuery
minute?

Let’s build something - Demo!
Create a globally available queue
Create a dataset for massive scale ingest and query execution
Submit job

• Globally redundant
• Low latency (sub
sec.)
• Batched read/write
• Custom labels
• Push & Pull
• Auto expiration
Cloud Pub/Sub
Publisher A Publisher B Publisher C
Message 1
Topic A Topic B Topic C
Subscription XA Subscription XB
Subscription
YC
Subscription
ZC
Cloud
Pub/Sub
Subscriber X Subscriber Y
Message 2 Message 3
Subscriber Z
Message 1
Message 2
Message 3
Message 3

Big Query
Google Big Query
Fast ETL
Regex
JSON
Spreadsheets
BI Tools
Coworkers
Your Data
• Scales into Petabytes
• I/O of 1TB in 1 second
• 100,000 rows/sec Streaming API
• Simple data ingest from GCS or Hadoop
• Connect to R, Pandas, Hadoop, etc.
• Now available in our European data centers
• New row level security and data expiration

weekly monthly
Cloud Dataflow SDK Release Process

<- At once guarantee (modulo completeness thresholds)
Cloud Dataflow SDK
<- Aggregations, Filters, Joins, ...
<- Correctness
Pipeline{
Who => Inputs
What => Transforms
Where => Windows
When => Watermarks + Triggers
To => Outputs
}
Cloud Dataflow SDK - Logical Model
<- GCS, Pub/Sub, BigQuery, w/Avro, XML,
JSON,...
<- Time space Fixed, Sliding, Sessions, ...
<- GCS, Pub/Sub, BigQuery, ...

Pipeline p = Pipeline.create(
OptionsBuilder.RunOnService(true, false));
PCollection<String> rawData = p.begin().apply(TextIO.Read
.from(OptionsBuilder.GCS_RAWDUMP_URI));
PCollection<PlaybackEvent> events = rawData.apply(
new ParseTransform());
events.apply(new ArchiveTransform());
events.apply(new SessionAnalysisTransform());
events.apply(new AssetTransform());
p.run();
Java 7 Implementation
Some Code

❯ A collection of data of type T in a
pipeline - a “hippie cousin” of an RDD
❯ Maybe be either bounded or
unbounded in size
❯ Created by using a PTransform to:
• Build from a java.util.Collection
• Read from a backing data store
• Transform an existing
PCollection
❯ Often contain the key-value pairs using
KV
{Seahawks, NFC, Champions, Seattle, ...}
{...,
“NFC Champions #GreenBay”,
“Green Bay #superbowl!”,
...
“#GoHawks”,
...}
PCollections
Cloud Dataflow SDK

{Seahawks, NFC, Champions, Seattle, ...}
{KV<S, Seahawks>, KV<C,Champions>,
<KV<S, Seattle>, KV<N, NFC>, ...}
❯ Processes each element of a
PCollection independently using a user-
provided DoFn
❯ Elements are processed in arbitrary
‘bundles’ e.g. “shards”
• startBundle(), processElement() -
N times, finishBundle()
❯ Corresponds to both the Map and
Reduce phases in Hadoop i.e. ParDo-
>GBK->ParDo
KeyBySessionId
ParDo (“Parallel Do”)
Cloud Dataflow SDK

Wait a minute…
How do you do a GroupByKey on an unbounded PCollection?
GroupByKey
• Takes a PCollection of key-value
pairs and gathers up all values
with the same key
• Corresponds to the shuffle phase
in Hadoop
Cloud Dataflow SDK
GroupByKey
{KV<S, {Seahawks, Seattle, …},
KV<N, {NFC, …}
KV<C, {Champion, …}}

❯ Logically divide up or groups the elements of a
PCollection into finite windows
• Fixed Windows: hourly, daily, …
• Sliding Windows
• Sessions
❯ Required for GroupByKey-based transforms on
an unbounded PCollection, but can also be used
for bounded PCollections
❯ Window.into() can be called at any point in the
pipeline and will be applied when needed
❯ Can be tied to arrival/processing time or custom
event time
❯ Watermarks + Triggers enable robust
correctness
Windows
Cloud Dataflow SDK
Nighttime Mid-Day Nighttime

Cloud Dataflow Service
Managing with correctness
.apply(Window.<KV<String, PlaybackEvent>>into(
FixedWindows.of(Duration.standardMinutes(1)))
.triggering(
AfterEach.inOrder(
AfterWatermark.pastEndOfWindow(),
Repeatedly.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(10)))
.orFinally(AfterWatermark
.pastEndOfWindow()
.plusDelayOf(Duration.standardDays(2)))))
.discardingFiredPanes());

GroupByKey
Pair With Ones
Sum Values
Count
❯ Define new PTransforms by building up
subgraphs of existing transforms
❯ Some utilities are included in the SDK
• Count, RemoveDuplicates, Join,
Min, Max, Sum, ...
❯ You can define your own:
• DoSomething, DoSomethingElse,
etc.
❯ Why bother?
• Code reuse
• Better monitoring experience
Composite PTransforms
Cloud Dataflow SDK

GCP
Managed Service
User Code & SDK
Work Manager
Deploy & Schedule
Monitoring UI
Job Manager
Progress & Logs

● ParDo fusion
○ Producer Consumer
○ Sibling
○ Intelligent fusion
boundaries
● Combiner lifting e.g. partial
aggregations before
reduction
● Reshard placement
...
Graph Optimization
C D
C+D
consumer-producer
= ParallelDo
GBK = GroupByKey
+ = CombineValues
sibling
C D
C+D
A GBK + B
A+ GBK + B
combiner lifting

Deploy Schedule & Monitor Tear Down
Worker Lifecycle Management

Worker Scaling
Decreased Clock Time

100 mins. 65 mins.
vs.
Dynamic Work Rebalancing

Optimizing Your Time To Answer
More time to dig
into your data
Programming
Resource
provisioning
Performance
tuning
Monitoring
Reliability
Deployment &
configuration
Handling
Growing Scale
Utilization
improvements
Data Processing with Cloud DataflowTypical Data Processing
Programming

Thank You!
cloud.google.com/dataflow
cloude@google.com
StackOverflow @ google+cloud+dataflow

Google Cloud Dataflow Two Worlds Become a Much Better One

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Google Cloud Dataflow Two Worlds Become a Much Better One

Similar to Google Cloud Dataflow Two Worlds Become a Much Better One (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Google Cloud Dataflow Two Worlds Become a Much Better One