At Google Cloud Platform, we're combining the Apache Spark and Hadoop ecosystem with our software and hardware innovations. We want to make these awesome tools easier, faster, and more cost-effective, from 3 to 30,000 cores. This presentation will showcase how Google Cloud Platform is innovating with the goal of bringing the Hadoop ecosystem to everyone.
Bio: "I love data because it surrounds us - everything is data. I also love open source software, because it shows what is possible when people come together to solve common problems with technology. While they are awesome on their own, I am passionate about combining the power of open source software with the potential unlimited uses of data. That's why I joined Google. I am a product manager for Google Cloud Platform and manage Cloud Dataproc and Apache Beam (incubating). I've previously spent time hanging out at Disney and Amazon. Beyond Google, love data, amateur radio, Disneyland, photography, running and Legos."
3. Google Cloud Platform 3
Apache Spark and Apache Hadoop should be
fast, easy, and cost-effective.
4. Easy, fast, cost-effective
Fast
Things take seconds to minutes, not hours or weeks
Easy
Be an expert with your data, not your data infrastructure
Cost-effective
Pay for exactly what you use
5. Running Hadoop on Google Cloud
bdutil
Free OSS Toolkit
Dataproc
Managed Hadoop
Custom Code
Monitoring/Health
Dev Integration
Scaling
Job Submission
GCP Connectivity
Deployment
Creation
Custom Code
Monitoring/Health
Dev Integration
Manual Scaling
Job Submission
GCP Connectivity
Deployment
Creation
On
Premise
Custom Code
Monitoring/Health
Dev Integration
Scaling
Job Submission
GCP Connectivity
Deployment
Creation
Google Managed
Google Cloud Platform
Customer Managed
Vendor
Hadoop
Custom Code
Monitoring/Health
Dev Integration
Scaling
Job Submission
GCP Connectivity
Deployment
Creation
6. 6
Cloud Dataproc - integrated
6
Cloud Dataproc is
natively integrated with
several Google Cloud
Platform products as
part of an integrated
data platform.
Storage
Operations
Data
7. 7
Where Cloud Dataproc fits into GCP
7
Google Bigtable
(HBase)
Google BigQuery
(Analytics, Data warehouse)
Stackdriver Logging
(Logging Ops.)
Google Cloud Dataflow
(Batch/Stream Processing)
Google Cloud Storage
(HCFS/HDFS)
Stackdriver Monitoring
(Monitoring)
8. 8
Most time can be spent with data, not tooling
More time can be
dedicated to examining
data for actionable insights
Less time is spent with
clusters since creating,
resizing, and destroying
clusters is easily done
Hands-on with data
Cloud Dataproc setup
and customization
9. 9
Lift and shift workloads to Cloud Dataproc
Copy data to GCS
Copy your data to Google
Cloud Storage (GCS) by
installing the connector or
by copying manually.
Update file prefix
Update the file location
prefix in your scripts from
hdfs:// to gcs:// to
access your data in GCS.
Use Cloud Dataproc
Create a Cloud Dataproc
cluster and run your job on
the cluster against the data
you copied to GCS. Done.
1 32
13. Cloud example - slow vs. fast
Things take
seconds to minutes,
not hours or weeks
capacityneeded
t
Time needed to obtain new capacity
capacityused
t
Scaling can take
hours, days, or
weeks to perform
Traditional clusters Cloud Dataproc
14. Cloud example - hard vs. easy
Be an expert with
your data, not your
data infrastructure
Need experts to
optimize utilization
and deployment
Traditional clusters Cloud Dataproc
clusterutilization
Cluster
Inactive
t
clusterutilization
t
cluster 1 cluster 2
15. Cloud example - costly vs. cost-effective
Pay for exactly what
you use
You (probably) pay
for more capacity
than actually used
Traditional clusters Cloud Dataproc
Time
Cost
Time
Cost
16. Google Cloud Dataproc - under the hood
Google Cloud Services
Dataproc Cluster
Cloud Dataproc uses GCP - Compute Engine,
Cloud Storage, and Stackdriver tools
17. Google Cloud Dataproc - under the hood
Cloud Dataproc Agent
Google Cloud Services
Dataproc Cluster
Cloud Dataproc clusters have an agent
to manage the Cloud Dataproc cluster
Dataproc uses Compute Engine, Cloud
Storage, and Cloud Ops tools
18. Google Cloud Dataproc - under the hood
Spark & Hadoop OSS Spark, Hadoop, Hive, Pig, and other OSS
components execute on the cluster
Cloud Dataproc Agent
Google Cloud Services
Dataproc Cluster
Cloud Dataproc clusters have an agent
to manage the Cloud Dataproc cluster
Dataproc uses Compute Engine, Cloud
Storage, and Cloud Ops tools
19. Google Cloud Dataproc - under the hood
Spark
PySpark
Spark SQL
MapReduce
Pig
Hive
Spark & Hadoop OSS
Cloud Dataproc Agent
Google Cloud Services
Dataproc Cluster Dataproc Jobs
20. Google Cloud Dataproc - under the hood
Applications on
the cluster
Dataproc Jobs
GCP Products
Spark
PySpark
Spark SQL
MapReduce
Pig
Hive
Dataproc Cluster
Spark & Hadoop OSS
Cloud Dataproc Agent
Google Cloud Services
Dataproc Jobs FeaturesData Outputs
26. Confidential & ProprietaryGoogle Cloud Platform 26
Google Cloud Dataproc - demo overview
In this demo we are going to do a few things:
Create a cluster
Query a large set of data stored in Google Cloud Storage
Review the output of the queries
Delete the cluster
27. Google Cloud Platform 27
YARN Cores
1,600
What just happened?
YARN RAM
4.7 TB
Spark & Hadoop
100%
Click
1
28. Google Cloud Platform 2828
The New York City Taxi & Limousine
Commission and Uber released a
dataset of trips from 2009-2015
Original dataset is in CSV format and
contains over 20 columns of data and
about 1.2 billion trips
The dataset is about ~270 gigabytes
NYC taxi data
28
30. Google Cloud Platform 30
SELECT cab_type, count(*)
FROM trips
GROUP BY cab_type;
SELECT passenger_count, avg(total_amount)
FROM trips
GROUP BY passenger_count;
SELECT passenger_count, year(pickup_datetime), count(*)
FROM trips
GROUP BY passenger_count, year(pickup_datetime);
SELECT passenger_count, year(pickup_datetime) trip_year,
round(trip_distance), count(*) trips
FROM trips
GROUP BY passenger_count, year(pickup_datetime), round(trip_distance)
ORDER BY trip_year, trips DESC;
31. Google Cloud Platform 31
Dataset
270 GB
Demo recap
Trips
1.2 B
Queries
4
Apache ecosystem
100%
33. Google Cloud Platform 33
If you’re processing data, you may also want to consider...
34. Google Cloud Dataflow & Apache Beam
The Cloud Dataflow SDK, based
on Apache Beam, is a collection
of SDKs for building streaming
data processing pipelines.
Cloud Dataflow is a fully managed
(no-ops) and integrated service for
executing optimized parallelized
data processing pipelines.
36. Joining several threads into Beam
MapReduce
BigTable DremelColossus
FlumeMegastore
SpannerPubSub
Millwheel
Cloud
Dataflow
Cloud
Dataproc
Apache
Beam
37. Google BigQuery
Virtually unlimited resources, but you only pay for what you use
Fully-managed
Analytics Data Warehouse
Highly Available, Encrypted, Durable
38. Google Cloud Bigtable
Google Cloud Bigtable offers companies a fast, fully managed, infinitely
scalable NoSQL database service with a HBase-compliant API included.
Unlike comparable market offerings, Bigtable is the only fully-managed
database where organizations don’t have to sacrifice speed, scale or cost-
efficiency when they build applications.
Google Cloud Bigtable has been battle-tested at Google for 10 years as the
database driving all major applications including Google Analytics, Gmail and
YouTube.
40. Cloud Dataproc - get started today
Create a Google Cloud project
Visit Dataproc section
1
2
3
4
Open Developers Console
Create cluster in 1 click, 90 sec.
41. If you only remember 3 things...
Cloud Dataproc
is easy
Cloud Dataproc offers a
number of tools to easily
interact with clusters and
jobs so you can be hands-
on with your data.
Cloud Dataproc
is fast
Cloud Dataproc clusters
start in under 90 seconds
on average so you spend
less time and money
waiting for your clusters.
Cloud Dataproc
is cost effective
Cloud Dataproc is easy on
the pocketbook with a low
pricing of just 1c per vCPU
per hour and minute by
minute billing