Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

James Malone
Product Manager
More data. Zero headaches.
Making the Spark and Hadoop ecosystem fast, easy, and cost-effective.

Google Cloud Platform 2
Cloud Dataproc features and benefits

Apache Spark and Apache Hadoop should be
fast, easy, and cost-effective.

Easy, fast, cost-effective
Fast
Things take seconds to minutes, not hours or weeks
Easy
Be an expert with your data, not your data infrastructure
Cost-effective
Pay for exactly what you use

Running Hadoop on Google Cloud
bdutil
Free OSS Toolkit
Dataproc
Managed Hadoop
Custom Code
Monitoring/Health
Dev Integration
Scaling
Job Submission
GCP Connectivity
Deployment
Creation
Custom Code
Monitoring/Health
Dev Integration
Manual Scaling
Job Submission
GCP Connectivity
Deployment
Creation
On
Premise
Custom Code
Monitoring/Health
Dev Integration
Scaling
Job Submission
GCP Connectivity
Deployment
Creation
Google Managed
Google Cloud Platform
Customer Managed
Vendor
Hadoop
Custom Code
Monitoring/Health
Dev Integration
Scaling
Job Submission
GCP Connectivity
Deployment
Creation

6
Cloud Dataproc - integrated
6
Cloud Dataproc is
natively integrated with
several Google Cloud
Platform products as
part of an integrated
data platform.
Storage
Operations
Data

7
Where Cloud Dataproc fits into GCP
7
Google Bigtable
(HBase)
Google BigQuery
(Analytics, Data warehouse)
Stackdriver Logging
(Logging Ops.)
Google Cloud Dataflow
(Batch/Stream Processing)
Google Cloud Storage
(HCFS/HDFS)
Stackdriver Monitoring
(Monitoring)

8
Most time can be spent with data, not tooling
More time can be
dedicated to examining
data for actionable insights
Less time is spent with
clusters since creating,
resizing, and destroying
clusters is easily done
Hands-on with data
Cloud Dataproc setup
and customization

9
Lift and shift workloads to Cloud Dataproc
Copy data to GCS
Copy your data to Google
Cloud Storage (GCS) by
installing the connector or
by copying manually.
Update file prefix
Update the file location
prefix in your scripts from
hdfs:// to gcs:// to
access your data in GCS.
Use Cloud Dataproc
Create a Cloud Dataproc
cluster and run your job on
the cluster against the data
you copied to GCS. Done.
1 32

How does Google Cloud Dataproc help me?

Traditional Spark and Hadoop clusters

Cloud example - slow vs. fast
Things take
seconds to minutes,
not hours or weeks
capacityneeded
t
Time needed to obtain new capacity
capacityused
t
Scaling can take
hours, days, or
weeks to perform
Traditional clusters Cloud Dataproc

Cloud example - hard vs. easy
Be an expert with
your data, not your
data infrastructure
Need experts to
optimize utilization
and deployment
clusterutilization
Cluster
Inactive
t
clusterutilization
t
cluster 1 cluster 2

Cloud example - costly vs. cost-effective
Pay for exactly what
you use
You (probably) pay
for more capacity
than actually used
Time
Cost
Time
Cost

Google Cloud Dataproc - under the hood
Google Cloud Services
Dataproc Cluster
Cloud Dataproc uses GCP - Compute Engine,
Cloud Storage, and Stackdriver tools

Cloud Dataproc Agent
Dataproc Cluster
Cloud Dataproc clusters have an agent
to manage the Cloud Dataproc cluster
Dataproc uses Compute Engine, Cloud
Storage, and Cloud Ops tools

Spark & Hadoop OSS Spark, Hadoop, Hive, Pig, and other OSS
components execute on the cluster
Dataproc Cluster
Cloud Dataproc clusters have an agent
to manage the Cloud Dataproc cluster
Dataproc uses Compute Engine, Cloud
Storage, and Cloud Ops tools

Spark
PySpark
Spark SQL
MapReduce
Pig
Hive
Spark & Hadoop OSS
Dataproc Cluster Dataproc Jobs

Applications on
the cluster
Dataproc Jobs
GCP Products
Spark
PySpark
Spark SQL
MapReduce
Pig
Hive
Dataproc Cluster
Spark & Hadoop OSS
Dataproc Jobs FeaturesData Outputs

How can I use Cloud Dataproc?

Google Developers Console
https://console.developers.google.com/

Google Cloud SDK
https://cloud.google.com/sdk/

Cloud Dataproc REST API
https://cloud.google.com/dataproc/reference/rest/

Let’s see an example - Cloud Dataproc demo

Confidential & ProprietaryGoogle Cloud Platform 26
Google Cloud Dataproc - demo overview
In this demo we are going to do a few things:
Create a cluster
Query a large set of data stored in Google Cloud Storage
Review the output of the queries
Delete the cluster

YARN Cores
1,600
What just happened?
YARN RAM
4.7 TB
Spark & Hadoop
100%
Click
1

The New York City Taxi & Limousine
Commission and Uber released a
dataset of trips from 2009-2015
Original dataset is in CSV format and
contains over 20 columns of data and
about 1.2 billion trips
The dataset is about ~270 gigabytes
NYC taxi data
28

CREATE EXTERNAL TABLE trips (
trip_id INT,
vendor_id STRING,
pickup_datetime TIMESTAMP,
dropoff_datetime TIMESTAMP,
store_and_fwd_flag STRING,
...(44 other columns)...,
dropoff_puma STRING)
STORED AS orc
LOCATION 'gs://taxi-nyc-demo/trips/'
TBLPROPERTIES (
"orc.compress"="SNAPPY",
"orc.stripe.size"="536870912",
"orc.row.index.stride"="50000");

SELECT cab_type, count(*)
FROM trips
GROUP BY cab_type;
SELECT passenger_count, avg(total_amount)
FROM trips
GROUP BY passenger_count;
SELECT passenger_count, year(pickup_datetime), count(*)
FROM trips
GROUP BY passenger_count, year(pickup_datetime);
SELECT passenger_count, year(pickup_datetime) trip_year,
round(trip_distance), count(*) trips
FROM trips
GROUP BY passenger_count, year(pickup_datetime), round(trip_distance)
ORDER BY trip_year, trips DESC;

Dataset
270 GB
Demo recap
Trips
1.2 B
Queries
4
Apache ecosystem
100%

$12.85
(vs $77.58, $41.54)

If you’re processing data, you may also want to consider...

Google Cloud Dataflow & Apache Beam
The Cloud Dataflow SDK, based
on Apache Beam, is a collection
of SDKs for building streaming
data processing pipelines.
Cloud Dataflow is a fully managed
(no-ops) and integrated service for
executing optimized parallelized
data processing pipelines.

MapReduce
BigTable DremelColossus
FlumeMegastore
SpannerPubSub
Millwheel
Cloud
Dataflow
Cloud
Dataproc
Apache
Beam

Joining several threads into Beam
MapReduce
BigTable DremelColossus
FlumeMegastore
SpannerPubSub
Millwheel
Cloud
Dataflow
Cloud
Dataproc
Apache
Beam

Google BigQuery
Virtually unlimited resources, but you only pay for what you use
Fully-managed
Analytics Data Warehouse
Highly Available, Encrypted, Durable

Google Cloud Bigtable
Google Cloud Bigtable offers companies a fast, fully managed, infinitely
scalable NoSQL database service with a HBase-compliant API included.
Unlike comparable market offerings, Bigtable is the only fully-managed
database where organizations don’t have to sacrifice speed, scale or cost-
efficiency when they build applications.
Google Cloud Bigtable has been battle-tested at Google for 10 years as the
database driving all major applications including Google Analytics, Gmail and
YouTube.

Wrapping things up

Cloud Dataproc - get started today
Create a Google Cloud project
Visit Dataproc section
1
2
3
4
Open Developers Console
Create cluster in 1 click, 90 sec.

If you only remember 3 things...
Cloud Dataproc
is easy
Cloud Dataproc offers a
number of tools to easily
interact with clusters and
jobs so you can be hands-
on with your data.
Cloud Dataproc
is fast
Cloud Dataproc clusters
start in under 90 seconds
on average so you spend
less time and money
waiting for your clusters.
Cloud Dataproc
is cost effective
Cloud Dataproc is easy on
the pocketbook with a low
pricing of just 1c per vCPU
per hour and minute by
minute billing

Thank You

Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Ähnlich wie Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop (20)

Mehr von huguk

Mehr von huguk (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop