Google Cloud Dataflow Two Worlds Become a Much Better One
1. Google Cloud Dataflow
Two Worlds Become a Much Better One
Eric Schmidt, Product Manager
cloude@google.com
2. You leave here understanding the fundamentals of Cloud Dataflow and
possibly have drawn some comparisons to existing data processing models.
We have some fun.
1
Goals
2
7. Time to answer some questions
1M
Devices
16.6K Events/sec
43B Events/month
518B Events/year
What was the average viewing time over the past
7 days, compared to the last year?
How many active viewers did I have in the last
minute?
How many sales were made in the last
hour due to advertising conversion?
9. The tension and polarity of Big Data
AccuracySpeed
Cost control Complexity
Time to answer
10. ❯ Time & life never stop
❯ Data rates & schema are not static
❯ Scaling models are not static
❯ Non-elastic compute is wasteful and
creates lag
The reality of Big Data elasticity & business
11. … that also provides accuracy control & intelligent resource elasticity
… to reduce operational complexity
… while optimizing resources to reduce cost
What if you just had simple knob for speed?
12. Cloud Dataflow
Cloud Dataflow is a
collection of SDKs for
building batch or
streaming parallelized
data processing pipelines.
Cloud Dataflow is a fully
managed service for
executing optimized
parallelized data processing
pipelines.
13. • Movement
• Filtering
• Enrichment
• Shaping
• Reduction
• Batch computation
• Continuous
computation
• Composition
• External
orchestration
• Simulation
Where might you use Cloud Dataflow?
AnalysisETL Orchestration
14. Benefits of Cloud Dataflow
❯ No Ops - truly elastic data processing for the cloud
• On demand resource allocation w/intelligent auto-scaling
• Automated worker lifetime management
• Automated work optimization
❯ Unified model - for batch & stream based processing
• Functional programming model
• Fine grained correctness primitives
❯ Open sourced SDK @ github
• Java 7 today @ /GoogleCloudPlatform/DataflowJavaSDK
• Python 2 in progress
• Scala @/darkjh/scalaflow & /jhlch/scala-dataflow-dsl
• Spark runner@ /cloudera/spark-dataflow
• Flink runner @ /dataArtisans/flink-dataflow
15. Release Timeline
• June 24, 2014: Early Access Preview at Google I/O
• Dec. 17, 2014: Alpha
• Next milestone...
16. Release Timeline
• June 24, 2014: Early Access Preview at Google I/O
• Dec. 17, 2014: Alpha
• April 16, 2015: Beta - now open to everyone
• Next milestone GA
cloud.google.com/dataflow
18. Big Data on Google Cloud
Capture
Pub/Sub
Process
Dataflow
Store
Storage
SQL
Datastore
Analyze
BigQuery
Dataflow
Open Source Tools
19. Big Data on Google Cloud
BigQuery
Ingest data at 100,000
rows per second
Dataflow
Stream & batch
processing, unified and
simplified
Pub/Sub
Scalable, flexible, and
globally available
messaging
Fully Managed, No-Ops Services
20. Time answer some questions
1M
Devices
16.6K Events/sec
43B Events/month
518B Events/year
What was the average viewing time over the past
7 days, compared to the year?
How many active viewers did I have in the last
minute?
How many sales were made in the last 30
minutes due to advertising conversion?
21. Let’s build something
1M
Devices
16.6K Events/sec
43B Events/month
518B Events/month
Cloud Dataflow BigQuery
How many active viewers did I have in the last
minute?
22. Let’s build something
1M
Devices
16.6K Events/sec
43B Events/month
518B Events/month
Cloud Dataflow
How many active viewers did I have in the last
minute?
23. Let’s build something
1M
Devices
16.6K Events/sec
43B Events/month
518B Events/month
Cloud Dataflow SDK
+
How many active viewers did I have in the last
minute?
24. Let’s build something
1M
Devices
16.6K Events/sec
43B Events/month
518B Events/month
Cloud Pub/Sub Cloud Dataflow BigQuery
How many active viewers did I have in the last
minute?
25. Let’s build something - Demo!
Create a globally available queue
Create a dataset for massive scale ingest and query execution
Submit job
26. • Globally redundant
• Low latency (sub
sec.)
• Batched read/write
• Custom labels
• Push & Pull
• Auto expiration
Cloud Pub/Sub
Publisher A Publisher B Publisher C
Message 1
Topic A Topic B Topic C
Subscription XA Subscription XB
Subscription
YC
Subscription
ZC
Cloud
Pub/Sub
Subscriber X Subscriber Y
Message 2 Message 3
Subscriber Z
Message 1
Message 2
Message 3
Message 3
27. Big Query
Google Big Query
Fast ETL
Regex
JSON
Spreadsheets
BI Tools
Coworkers
Your Data
• Scales into Petabytes
• I/O of 1TB in 1 second
• 100,000 rows/sec Streaming API
• Simple data ingest from GCS or Hadoop
• Connect to R, Pandas, Hadoop, etc.
• Now available in our European data centers
• New row level security and data expiration
28. Cloud Dataflow
Cloud Dataflow is a
collection of SDKs for
building batch or
streaming parallelized
data processing pipelines.
Cloud Dataflow is a fully
managed service for
executing optimized
parallelized data processing
pipelines.
30. <- At once guarantee (modulo completeness thresholds)
Cloud Dataflow SDK
<- Aggregations, Filters, Joins, ...
<- Correctness
Pipeline{
Who => Inputs
What => Transforms
Where => Windows
When => Watermarks + Triggers
To => Outputs
}
Cloud Dataflow SDK - Logical Model
<- GCS, Pub/Sub, BigQuery, w/Avro, XML,
JSON,...
<- Time space Fixed, Sliding, Sessions, ...
<- GCS, Pub/Sub, BigQuery, ...
31. Pipeline p = Pipeline.create(
OptionsBuilder.RunOnService(true, false));
PCollection<String> rawData = p.begin().apply(TextIO.Read
.from(OptionsBuilder.GCS_RAWDUMP_URI));
PCollection<PlaybackEvent> events = rawData.apply(
new ParseTransform());
events.apply(new ArchiveTransform());
events.apply(new SessionAnalysisTransform());
events.apply(new AssetTransform());
p.run();
Java 7 Implementation
Some Code
32. ❯ A collection of data of type T in a
pipeline - a “hippie cousin” of an RDD
❯ Maybe be either bounded or
unbounded in size
❯ Created by using a PTransform to:
• Build from a java.util.Collection
• Read from a backing data store
• Transform an existing
PCollection
❯ Often contain the key-value pairs using
KV
{Seahawks, NFC, Champions, Seattle, ...}
{...,
“NFC Champions #GreenBay”,
“Green Bay #superbowl!”,
...
“#GoHawks”,
...}
PCollections
Cloud Dataflow SDK
33. {Seahawks, NFC, Champions, Seattle, ...}
{KV<S, Seahawks>, KV<C,Champions>,
<KV<S, Seattle>, KV<N, NFC>, ...}
❯ Processes each element of a
PCollection independently using a user-
provided DoFn
❯ Elements are processed in arbitrary
‘bundles’ e.g. “shards”
• startBundle(), processElement() -
N times, finishBundle()
❯ Corresponds to both the Map and
Reduce phases in Hadoop i.e. ParDo-
>GBK->ParDo
KeyBySessionId
ParDo (“Parallel Do”)
Cloud Dataflow SDK
34. Wait a minute…
How do you do a GroupByKey on an unbounded PCollection?
{KV<S, Seahawks>, KV<C,Champions>,
<KV<S, Seattle>, KV<N, NFC>, ...}
{KV<S, Seahawks>, KV<C,Champions>,
<KV<S, Seattle>, KV<N, NFC>, ...}
GroupByKey
• Takes a PCollection of key-value
pairs and gathers up all values
with the same key
• Corresponds to the shuffle phase
in Hadoop
Cloud Dataflow SDK
GroupByKey
{KV<S, {Seahawks, Seattle, …},
KV<N, {NFC, …}
KV<C, {Champion, …}}
35. ❯ Logically divide up or groups the elements of a
PCollection into finite windows
• Fixed Windows: hourly, daily, …
• Sliding Windows
• Sessions
❯ Required for GroupByKey-based transforms on
an unbounded PCollection, but can also be used
for bounded PCollections
❯ Window.into() can be called at any point in the
pipeline and will be applied when needed
❯ Can be tied to arrival/processing time or custom
event time
❯ Watermarks + Triggers enable robust
correctness
Windows
Cloud Dataflow SDK
Nighttime Mid-Day Nighttime
36. Cloud Dataflow Service
Managing with correctness
.apply(Window.<KV<String, PlaybackEvent>>into(
FixedWindows.of(Duration.standardMinutes(1)))
.triggering(
AfterEach.inOrder(
AfterWatermark.pastEndOfWindow(),
Repeatedly.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(10)))
.orFinally(AfterWatermark
.pastEndOfWindow()
.plusDelayOf(Duration.standardDays(2)))))
.discardingFiredPanes());
37. GroupByKey
Pair With Ones
Sum Values
Count
❯ Define new PTransforms by building up
subgraphs of existing transforms
❯ Some utilities are included in the SDK
• Count, RemoveDuplicates, Join,
Min, Max, Sum, ...
❯ You can define your own:
• DoSomething, DoSomethingElse,
etc.
❯ Why bother?
• Code reuse
• Better monitoring experience
Composite PTransforms
Cloud Dataflow SDK
38. Cloud Dataflow
Cloud Dataflow is a
collection of SDKs for
building batch or
streaming parallelized
data processing pipelines.
Cloud Dataflow is a fully
managed service for
executing optimized
parallelized data processing
pipelines.
39. GCP
Managed Service
User Code & SDK
Work Manager
Deploy & Schedule
Monitoring UI
Job Manager
Cloud Dataflow Service
Progress & Logs
40. ● ParDo fusion
○ Producer Consumer
○ Sibling
○ Intelligent fusion
boundaries
● Combiner lifting e.g. partial
aggregations before
reduction
● Reshard placement
...
Graph Optimization
Cloud Dataflow Service
C D
C+D
consumer-producer
= ParallelDo
GBK = GroupByKey
+ = CombineValues
sibling
C D
C+D
A GBK + B
A+ GBK + B
combiner lifting
41. Deploy Schedule & Monitor Tear Down
Worker Lifecycle Management
Cloud Dataflow Service
45. Optimizing Your Time To Answer
More time to dig
into your data
Programming
Resource
provisioning
Performance
tuning
Monitoring
Reliability
Deployment &
configuration
Handling
Growing Scale
Utilization
improvements
Data Processing with Cloud DataflowTypical Data Processing
Programming