SlideShare ist ein Scribd-Unternehmen logo
1 von 42
James Malone
Product Manager
More data. Zero headaches.
Making the Spark and Hadoop ecosystem fast, easy, and cost-effective.
Google Cloud Platform 2
Cloud Dataproc features and benefits
Google Cloud Platform 3
Apache Spark and Apache Hadoop should be
fast, easy, and cost-effective.
Easy, fast, cost-effective
Fast
Things take seconds to minutes, not hours or weeks
Easy
Be an expert with your data, not your data infrastructure
Cost-effective
Pay for exactly what you use
Running Hadoop on Google Cloud
bdutil
Free OSS Toolkit
Dataproc
Managed Hadoop
Custom Code
Monitoring/Health
Dev Integration
Scaling
Job Submission
GCP Connectivity
Deployment
Creation
Custom Code
Monitoring/Health
Dev Integration
Manual Scaling
Job Submission
GCP Connectivity
Deployment
Creation
On
Premise
Custom Code
Monitoring/Health
Dev Integration
Scaling
Job Submission
GCP Connectivity
Deployment
Creation
Google Managed
Google Cloud Platform
Customer Managed
Vendor
Hadoop
Custom Code
Monitoring/Health
Dev Integration
Scaling
Job Submission
GCP Connectivity
Deployment
Creation
6
Cloud Dataproc - integrated
6
Cloud Dataproc is
natively integrated with
several Google Cloud
Platform products as
part of an integrated
data platform.
Storage
Operations
Data
7
Where Cloud Dataproc fits into GCP
7
Google Bigtable
(HBase)
Google BigQuery
(Analytics, Data warehouse)
Stackdriver Logging
(Logging Ops.)
Google Cloud Dataflow
(Batch/Stream Processing)
Google Cloud Storage
(HCFS/HDFS)
Stackdriver Monitoring
(Monitoring)
8
Most time can be spent with data, not tooling
More time can be
dedicated to examining
data for actionable insights
Less time is spent with
clusters since creating,
resizing, and destroying
clusters is easily done
Hands-on with data
Cloud Dataproc setup
and customization
9
Lift and shift workloads to Cloud Dataproc
Copy data to GCS
Copy your data to Google
Cloud Storage (GCS) by
installing the connector or
by copying manually.
Update file prefix
Update the file location
prefix in your scripts from
hdfs:// to gcs:// to
access your data in GCS.
Use Cloud Dataproc
Create a Cloud Dataproc
cluster and run your job on
the cluster against the data
you copied to GCS. Done.
1 32
Google Cloud Platform 10
How does Google Cloud Dataproc help me?
Traditional Spark and Hadoop clusters
Google Cloud Dataproc
Cloud example - slow vs. fast
Things take
seconds to minutes,
not hours or weeks
capacityneeded
t
Time needed to obtain new capacity
capacityused
t
Scaling can take
hours, days, or
weeks to perform
Traditional clusters Cloud Dataproc
Cloud example - hard vs. easy
Be an expert with
your data, not your
data infrastructure
Need experts to
optimize utilization
and deployment
Traditional clusters Cloud Dataproc
clusterutilization
Cluster
Inactive
t
clusterutilization
t
cluster 1 cluster 2
Cloud example - costly vs. cost-effective
Pay for exactly what
you use
You (probably) pay
for more capacity
than actually used
Traditional clusters Cloud Dataproc
Time
Cost
Time
Cost
Google Cloud Dataproc - under the hood
Google Cloud Services
Dataproc Cluster
Cloud Dataproc uses GCP - Compute Engine,
Cloud Storage, and Stackdriver tools
Google Cloud Dataproc - under the hood
Cloud Dataproc Agent
Google Cloud Services
Dataproc Cluster
Cloud Dataproc clusters have an agent
to manage the Cloud Dataproc cluster
Dataproc uses Compute Engine, Cloud
Storage, and Cloud Ops tools
Google Cloud Dataproc - under the hood
Spark & Hadoop OSS Spark, Hadoop, Hive, Pig, and other OSS
components execute on the cluster
Cloud Dataproc Agent
Google Cloud Services
Dataproc Cluster
Cloud Dataproc clusters have an agent
to manage the Cloud Dataproc cluster
Dataproc uses Compute Engine, Cloud
Storage, and Cloud Ops tools
Google Cloud Dataproc - under the hood
Spark
PySpark
Spark SQL
MapReduce
Pig
Hive
Spark & Hadoop OSS
Cloud Dataproc Agent
Google Cloud Services
Dataproc Cluster Dataproc Jobs
Google Cloud Dataproc - under the hood
Applications on
the cluster
Dataproc Jobs
GCP Products
Spark
PySpark
Spark SQL
MapReduce
Pig
Hive
Dataproc Cluster
Spark & Hadoop OSS
Cloud Dataproc Agent
Google Cloud Services
Dataproc Jobs FeaturesData Outputs
Google Cloud Platform 21
How can I use Cloud Dataproc?
Google Cloud Platform 22
Google Developers Console
https://console.developers.google.com/
Google Cloud Platform 23
Google Cloud SDK
https://cloud.google.com/sdk/
Google Cloud Platform 24
Cloud Dataproc REST API
https://cloud.google.com/dataproc/reference/rest/
Google Cloud Platform 25
Let’s see an example - Cloud Dataproc demo
Confidential & ProprietaryGoogle Cloud Platform 26
Google Cloud Dataproc - demo overview
In this demo we are going to do a few things:
Create a cluster
Query a large set of data stored in Google Cloud Storage
Review the output of the queries
Delete the cluster
Google Cloud Platform 27
YARN Cores
1,600
What just happened?
YARN RAM
4.7 TB
Spark & Hadoop
100%
Click
1
Google Cloud Platform 2828
The New York City Taxi & Limousine
Commission and Uber released a
dataset of trips from 2009-2015
Original dataset is in CSV format and
contains over 20 columns of data and
about 1.2 billion trips
The dataset is about ~270 gigabytes
NYC taxi data
28
Google Cloud Platform 29
CREATE EXTERNAL TABLE trips (
trip_id INT,
vendor_id STRING,
pickup_datetime TIMESTAMP,
dropoff_datetime TIMESTAMP,
store_and_fwd_flag STRING,
...(44 other columns)...,
dropoff_puma STRING)
STORED AS orc
LOCATION 'gs://taxi-nyc-demo/trips/'
TBLPROPERTIES (
"orc.compress"="SNAPPY",
"orc.stripe.size"="536870912",
"orc.row.index.stride"="50000");
Google Cloud Platform 30
SELECT cab_type, count(*)
FROM trips
GROUP BY cab_type;
SELECT passenger_count, avg(total_amount)
FROM trips
GROUP BY passenger_count;
SELECT passenger_count, year(pickup_datetime), count(*)
FROM trips
GROUP BY passenger_count, year(pickup_datetime);
SELECT passenger_count, year(pickup_datetime) trip_year,
round(trip_distance), count(*) trips
FROM trips
GROUP BY passenger_count, year(pickup_datetime), round(trip_distance)
ORDER BY trip_year, trips DESC;
Google Cloud Platform 31
Dataset
270 GB
Demo recap
Trips
1.2 B
Queries
4
Apache ecosystem
100%
Google Cloud Platform 32
$12.85
(vs $77.58, $41.54)
Google Cloud Platform 33
If you’re processing data, you may also want to consider...
Google Cloud Dataflow & Apache Beam
The Cloud Dataflow SDK, based
on Apache Beam, is a collection
of SDKs for building streaming
data processing pipelines.
Cloud Dataflow is a fully managed
(no-ops) and integrated service for
executing optimized parallelized
data processing pipelines.
MapReduce
BigTable DremelColossus
FlumeMegastore
SpannerPubSub
Millwheel
Cloud
Dataflow
Cloud
Dataproc
Apache
Beam
Joining several threads into Beam
MapReduce
BigTable DremelColossus
FlumeMegastore
SpannerPubSub
Millwheel
Cloud
Dataflow
Cloud
Dataproc
Apache
Beam
Google BigQuery
Virtually unlimited resources, but you only pay for what you use
Fully-managed
Analytics Data Warehouse
Highly Available, Encrypted, Durable
Google Cloud Bigtable
Google Cloud Bigtable offers companies a fast, fully managed, infinitely
scalable NoSQL database service with a HBase-compliant API included.
Unlike comparable market offerings, Bigtable is the only fully-managed
database where organizations don’t have to sacrifice speed, scale or cost-
efficiency when they build applications.
Google Cloud Bigtable has been battle-tested at Google for 10 years as the
database driving all major applications including Google Analytics, Gmail and
YouTube.
Google Cloud Platform 39
Wrapping things up
Cloud Dataproc - get started today
Create a Google Cloud project
Visit Dataproc section
1
2
3
4
Open Developers Console
Create cluster in 1 click, 90 sec.
If you only remember 3 things...
Cloud Dataproc
is easy
Cloud Dataproc offers a
number of tools to easily
interact with clusters and
jobs so you can be hands-
on with your data.
Cloud Dataproc
is fast
Cloud Dataproc clusters
start in under 90 seconds
on average so you spend
less time and money
waiting for your clusters.
Cloud Dataproc
is cost effective
Cloud Dataproc is easy on
the pocketbook with a low
pricing of just 1c per vCPU
per hour and minute by
minute billing
Google Cloud Platform 42
Thank You

Weitere ähnliche Inhalte

Was ist angesagt?

From my sql to postgresql using kafka+debezium
From my sql to postgresql using kafka+debeziumFrom my sql to postgresql using kafka+debezium
From my sql to postgresql using kafka+debeziumClement Demonchy
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query BasicsIdo Green
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowYohei Onishi
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
 
Google BigQuery Best Practices
Google BigQuery Best PracticesGoogle BigQuery Best Practices
Google BigQuery Best PracticesMatillion
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdfChris Hoyean Song
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
Introduction to GCP BigQuery and DataPrep
Introduction to GCP BigQuery and DataPrepIntroduction to GCP BigQuery and DataPrep
Introduction to GCP BigQuery and DataPrepPaweł Mitruś
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 

Was ist angesagt? (20)

From my sql to postgresql using kafka+debezium
From my sql to postgresql using kafka+debeziumFrom my sql to postgresql using kafka+debezium
From my sql to postgresql using kafka+debezium
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query Basics
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Bigquery 101
Bigquery 101Bigquery 101
Bigquery 101
 
Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Google BigQuery Best Practices
Google BigQuery Best PracticesGoogle BigQuery Best Practices
Google BigQuery Best Practices
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Gcp dataflow
Gcp dataflowGcp dataflow
Gcp dataflow
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Introduction to GCP BigQuery and DataPrep
Introduction to GCP BigQuery and DataPrepIntroduction to GCP BigQuery and DataPrep
Introduction to GCP BigQuery and DataPrep
 
Google BigQuery
Google BigQueryGoogle BigQuery
Google BigQuery
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 

Ähnlich wie Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Flink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward
 
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Alluxio, Inc.
 
Eric Andersen Keynote
Eric Andersen KeynoteEric Andersen Keynote
Eric Andersen KeynoteData Con LA
 
Deep dive into Google Cloud for Big Data
Deep dive into Google Cloud for Big DataDeep dive into Google Cloud for Big Data
Deep dive into Google Cloud for Big DataTu Le Dinh
 
Big data on google cloud
Big data on google cloudBig data on google cloud
Big data on google cloudTu Pham
 
Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataprocAlluxio, Inc.
 
Bogdan botea, dmitry nefedkin no fiddle, efficient development on the googl...
Bogdan botea, dmitry nefedkin   no fiddle, efficient development on the googl...Bogdan botea, dmitry nefedkin   no fiddle, efficient development on the googl...
Bogdan botea, dmitry nefedkin no fiddle, efficient development on the googl...Codecamp Romania
 
Google Cloud Platform (GCP) At a Glance
Google Cloud Platform (GCP)  At a GlanceGoogle Cloud Platform (GCP)  At a Glance
Google Cloud Platform (GCP) At a GlanceCloud Analogy
 
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)Ido Green
 
Google Cloud Data Platform - Why Google for Data Analysis?
Google Cloud Data Platform - Why Google for Data Analysis?Google Cloud Data Platform - Why Google for Data Analysis?
Google Cloud Data Platform - Why Google for Data Analysis?Andreas Raible
 
Google Cloud lightning talk @MHacks
Google Cloud lightning talk @MHacksGoogle Cloud lightning talk @MHacks
Google Cloud lightning talk @MHackswesley chun
 
QMeeting 2018 - Como integrar qlik e cloudera
QMeeting 2018 - Como integrar qlik e clouderaQMeeting 2018 - Como integrar qlik e cloudera
QMeeting 2018 - Como integrar qlik e clouderaRoberto Oliveira
 
Introduction to Google Cloud Platform
Introduction to Google Cloud PlatformIntroduction to Google Cloud Platform
Introduction to Google Cloud Platformdhruv_chaudhari
 
Spark on Dataproc - Israel Spark Meetup at taboola
Spark on Dataproc - Israel Spark Meetup at taboolaSpark on Dataproc - Israel Spark Meetup at taboola
Spark on Dataproc - Israel Spark Meetup at taboolatsliwowicz
 
Slides: Success Stories for Data-to-Cloud
Slides: Success Stories for Data-to-CloudSlides: Success Stories for Data-to-Cloud
Slides: Success Stories for Data-to-CloudDATAVERSITY
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21JDA Labs MTL
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT_MTL
 
Talend Summer '17 Release: New Features and Tech Overview
Talend Summer '17 Release: New Features and Tech OverviewTalend Summer '17 Release: New Features and Tech Overview
Talend Summer '17 Release: New Features and Tech OverviewTalend
 
code lab live Google Cloud Endpoints [DevFest 2015 Bari]
code lab live Google Cloud Endpoints [DevFest 2015 Bari]code lab live Google Cloud Endpoints [DevFest 2015 Bari]
code lab live Google Cloud Endpoints [DevFest 2015 Bari]Nicola Policoro
 

Ähnlich wie Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop (20)

Flink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
 
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
 
Eric Andersen Keynote
Eric Andersen KeynoteEric Andersen Keynote
Eric Andersen Keynote
 
Deep dive into Google Cloud for Big Data
Deep dive into Google Cloud for Big DataDeep dive into Google Cloud for Big Data
Deep dive into Google Cloud for Big Data
 
Big data on google cloud
Big data on google cloudBig data on google cloud
Big data on google cloud
 
Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataproc
 
Bogdan botea, dmitry nefedkin no fiddle, efficient development on the googl...
Bogdan botea, dmitry nefedkin   no fiddle, efficient development on the googl...Bogdan botea, dmitry nefedkin   no fiddle, efficient development on the googl...
Bogdan botea, dmitry nefedkin no fiddle, efficient development on the googl...
 
Google Cloud Platform (GCP) At a Glance
Google Cloud Platform (GCP)  At a GlanceGoogle Cloud Platform (GCP)  At a Glance
Google Cloud Platform (GCP) At a Glance
 
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
 
Google Cloud Data Platform - Why Google for Data Analysis?
Google Cloud Data Platform - Why Google for Data Analysis?Google Cloud Data Platform - Why Google for Data Analysis?
Google Cloud Data Platform - Why Google for Data Analysis?
 
Google Cloud lightning talk @MHacks
Google Cloud lightning talk @MHacksGoogle Cloud lightning talk @MHacks
Google Cloud lightning talk @MHacks
 
QMeeting 2018 - Como integrar qlik e cloudera
QMeeting 2018 - Como integrar qlik e clouderaQMeeting 2018 - Como integrar qlik e cloudera
QMeeting 2018 - Como integrar qlik e cloudera
 
Introduction to Google Cloud Platform
Introduction to Google Cloud PlatformIntroduction to Google Cloud Platform
Introduction to Google Cloud Platform
 
Spark on Dataproc - Israel Spark Meetup at taboola
Spark on Dataproc - Israel Spark Meetup at taboolaSpark on Dataproc - Israel Spark Meetup at taboola
Spark on Dataproc - Israel Spark Meetup at taboola
 
Slides: Success Stories for Data-to-Cloud
Slides: Success Stories for Data-to-CloudSlides: Success Stories for Data-to-Cloud
Slides: Success Stories for Data-to-Cloud
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
TIAD : Automate everything with Google Cloud
TIAD : Automate everything with Google CloudTIAD : Automate everything with Google Cloud
TIAD : Automate everything with Google Cloud
 
Talend Summer '17 Release: New Features and Tech Overview
Talend Summer '17 Release: New Features and Tech OverviewTalend Summer '17 Release: New Features and Tech Overview
Talend Summer '17 Release: New Features and Tech Overview
 
code lab live Google Cloud Endpoints [DevFest 2015 Bari]
code lab live Google Cloud Endpoints [DevFest 2015 Bari]code lab live Google Cloud Endpoints [DevFest 2015 Bari]
code lab live Google Cloud Endpoints [DevFest 2015 Bari]
 

Mehr von huguk

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifactahuguk
 
ether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introhuguk
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...huguk
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watsonhuguk
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink huguk
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...huguk
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitchinghuguk
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoringhuguk
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startuphuguk
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapulthuguk
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysishuguk
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analyticshuguk
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Socialhuguk
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligencehuguk
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive huguk
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...huguk
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthyhuguk
 
Fast real-time approximations using Spark streaming
Fast real-time approximations using Spark streamingFast real-time approximations using Spark streaming
Fast real-time approximations using Spark streaminghuguk
 

Mehr von huguk (20)

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
 
ether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp intro
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitching
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoring
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startup
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapult
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysis
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analytics
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Social
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligence
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthy
 
Fast real-time approximations using Spark streaming
Fast real-time approximations using Spark streamingFast real-time approximations using Spark streaming
Fast real-time approximations using Spark streaming
 

Kürzlich hochgeladen

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 

Kürzlich hochgeladen (20)

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 

Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

  • 1. James Malone Product Manager More data. Zero headaches. Making the Spark and Hadoop ecosystem fast, easy, and cost-effective.
  • 2. Google Cloud Platform 2 Cloud Dataproc features and benefits
  • 3. Google Cloud Platform 3 Apache Spark and Apache Hadoop should be fast, easy, and cost-effective.
  • 4. Easy, fast, cost-effective Fast Things take seconds to minutes, not hours or weeks Easy Be an expert with your data, not your data infrastructure Cost-effective Pay for exactly what you use
  • 5. Running Hadoop on Google Cloud bdutil Free OSS Toolkit Dataproc Managed Hadoop Custom Code Monitoring/Health Dev Integration Scaling Job Submission GCP Connectivity Deployment Creation Custom Code Monitoring/Health Dev Integration Manual Scaling Job Submission GCP Connectivity Deployment Creation On Premise Custom Code Monitoring/Health Dev Integration Scaling Job Submission GCP Connectivity Deployment Creation Google Managed Google Cloud Platform Customer Managed Vendor Hadoop Custom Code Monitoring/Health Dev Integration Scaling Job Submission GCP Connectivity Deployment Creation
  • 6. 6 Cloud Dataproc - integrated 6 Cloud Dataproc is natively integrated with several Google Cloud Platform products as part of an integrated data platform. Storage Operations Data
  • 7. 7 Where Cloud Dataproc fits into GCP 7 Google Bigtable (HBase) Google BigQuery (Analytics, Data warehouse) Stackdriver Logging (Logging Ops.) Google Cloud Dataflow (Batch/Stream Processing) Google Cloud Storage (HCFS/HDFS) Stackdriver Monitoring (Monitoring)
  • 8. 8 Most time can be spent with data, not tooling More time can be dedicated to examining data for actionable insights Less time is spent with clusters since creating, resizing, and destroying clusters is easily done Hands-on with data Cloud Dataproc setup and customization
  • 9. 9 Lift and shift workloads to Cloud Dataproc Copy data to GCS Copy your data to Google Cloud Storage (GCS) by installing the connector or by copying manually. Update file prefix Update the file location prefix in your scripts from hdfs:// to gcs:// to access your data in GCS. Use Cloud Dataproc Create a Cloud Dataproc cluster and run your job on the cluster against the data you copied to GCS. Done. 1 32
  • 10. Google Cloud Platform 10 How does Google Cloud Dataproc help me?
  • 11. Traditional Spark and Hadoop clusters
  • 13. Cloud example - slow vs. fast Things take seconds to minutes, not hours or weeks capacityneeded t Time needed to obtain new capacity capacityused t Scaling can take hours, days, or weeks to perform Traditional clusters Cloud Dataproc
  • 14. Cloud example - hard vs. easy Be an expert with your data, not your data infrastructure Need experts to optimize utilization and deployment Traditional clusters Cloud Dataproc clusterutilization Cluster Inactive t clusterutilization t cluster 1 cluster 2
  • 15. Cloud example - costly vs. cost-effective Pay for exactly what you use You (probably) pay for more capacity than actually used Traditional clusters Cloud Dataproc Time Cost Time Cost
  • 16. Google Cloud Dataproc - under the hood Google Cloud Services Dataproc Cluster Cloud Dataproc uses GCP - Compute Engine, Cloud Storage, and Stackdriver tools
  • 17. Google Cloud Dataproc - under the hood Cloud Dataproc Agent Google Cloud Services Dataproc Cluster Cloud Dataproc clusters have an agent to manage the Cloud Dataproc cluster Dataproc uses Compute Engine, Cloud Storage, and Cloud Ops tools
  • 18. Google Cloud Dataproc - under the hood Spark & Hadoop OSS Spark, Hadoop, Hive, Pig, and other OSS components execute on the cluster Cloud Dataproc Agent Google Cloud Services Dataproc Cluster Cloud Dataproc clusters have an agent to manage the Cloud Dataproc cluster Dataproc uses Compute Engine, Cloud Storage, and Cloud Ops tools
  • 19. Google Cloud Dataproc - under the hood Spark PySpark Spark SQL MapReduce Pig Hive Spark & Hadoop OSS Cloud Dataproc Agent Google Cloud Services Dataproc Cluster Dataproc Jobs
  • 20. Google Cloud Dataproc - under the hood Applications on the cluster Dataproc Jobs GCP Products Spark PySpark Spark SQL MapReduce Pig Hive Dataproc Cluster Spark & Hadoop OSS Cloud Dataproc Agent Google Cloud Services Dataproc Jobs FeaturesData Outputs
  • 21. Google Cloud Platform 21 How can I use Cloud Dataproc?
  • 22. Google Cloud Platform 22 Google Developers Console https://console.developers.google.com/
  • 23. Google Cloud Platform 23 Google Cloud SDK https://cloud.google.com/sdk/
  • 24. Google Cloud Platform 24 Cloud Dataproc REST API https://cloud.google.com/dataproc/reference/rest/
  • 25. Google Cloud Platform 25 Let’s see an example - Cloud Dataproc demo
  • 26. Confidential & ProprietaryGoogle Cloud Platform 26 Google Cloud Dataproc - demo overview In this demo we are going to do a few things: Create a cluster Query a large set of data stored in Google Cloud Storage Review the output of the queries Delete the cluster
  • 27. Google Cloud Platform 27 YARN Cores 1,600 What just happened? YARN RAM 4.7 TB Spark & Hadoop 100% Click 1
  • 28. Google Cloud Platform 2828 The New York City Taxi & Limousine Commission and Uber released a dataset of trips from 2009-2015 Original dataset is in CSV format and contains over 20 columns of data and about 1.2 billion trips The dataset is about ~270 gigabytes NYC taxi data 28
  • 29. Google Cloud Platform 29 CREATE EXTERNAL TABLE trips ( trip_id INT, vendor_id STRING, pickup_datetime TIMESTAMP, dropoff_datetime TIMESTAMP, store_and_fwd_flag STRING, ...(44 other columns)..., dropoff_puma STRING) STORED AS orc LOCATION 'gs://taxi-nyc-demo/trips/' TBLPROPERTIES ( "orc.compress"="SNAPPY", "orc.stripe.size"="536870912", "orc.row.index.stride"="50000");
  • 30. Google Cloud Platform 30 SELECT cab_type, count(*) FROM trips GROUP BY cab_type; SELECT passenger_count, avg(total_amount) FROM trips GROUP BY passenger_count; SELECT passenger_count, year(pickup_datetime), count(*) FROM trips GROUP BY passenger_count, year(pickup_datetime); SELECT passenger_count, year(pickup_datetime) trip_year, round(trip_distance), count(*) trips FROM trips GROUP BY passenger_count, year(pickup_datetime), round(trip_distance) ORDER BY trip_year, trips DESC;
  • 31. Google Cloud Platform 31 Dataset 270 GB Demo recap Trips 1.2 B Queries 4 Apache ecosystem 100%
  • 32. Google Cloud Platform 32 $12.85 (vs $77.58, $41.54)
  • 33. Google Cloud Platform 33 If you’re processing data, you may also want to consider...
  • 34. Google Cloud Dataflow & Apache Beam The Cloud Dataflow SDK, based on Apache Beam, is a collection of SDKs for building streaming data processing pipelines. Cloud Dataflow is a fully managed (no-ops) and integrated service for executing optimized parallelized data processing pipelines.
  • 36. Joining several threads into Beam MapReduce BigTable DremelColossus FlumeMegastore SpannerPubSub Millwheel Cloud Dataflow Cloud Dataproc Apache Beam
  • 37. Google BigQuery Virtually unlimited resources, but you only pay for what you use Fully-managed Analytics Data Warehouse Highly Available, Encrypted, Durable
  • 38. Google Cloud Bigtable Google Cloud Bigtable offers companies a fast, fully managed, infinitely scalable NoSQL database service with a HBase-compliant API included. Unlike comparable market offerings, Bigtable is the only fully-managed database where organizations don’t have to sacrifice speed, scale or cost- efficiency when they build applications. Google Cloud Bigtable has been battle-tested at Google for 10 years as the database driving all major applications including Google Analytics, Gmail and YouTube.
  • 39. Google Cloud Platform 39 Wrapping things up
  • 40. Cloud Dataproc - get started today Create a Google Cloud project Visit Dataproc section 1 2 3 4 Open Developers Console Create cluster in 1 click, 90 sec.
  • 41. If you only remember 3 things... Cloud Dataproc is easy Cloud Dataproc offers a number of tools to easily interact with clusters and jobs so you can be hands- on with your data. Cloud Dataproc is fast Cloud Dataproc clusters start in under 90 seconds on average so you spend less time and money waiting for your clusters. Cloud Dataproc is cost effective Cloud Dataproc is easy on the pocketbook with a low pricing of just 1c per vCPU per hour and minute by minute billing
  • 42. Google Cloud Platform 42 Thank You