SlideShare ist ein Scribd-Unternehmen logo
1 von 33
1© Cloudera, Inc. All rights reserved.
Spark Operations
Kostas Sakellis
2© Cloudera, Inc. All rights reserved.
Me
• Software Engineer at Cloudera
• Contributor to Apache Spark
• Before that, contributed to Cloudera Manager
3© Cloudera, Inc. All rights reserved.
Building a proof of
concept!
Courtesy of: http://www.nefloridadesign.com/mbimages/6.jpg
4© Cloudera, Inc. All rights reserved.
Example
sc.textFile(“hdfs://data/u.item”, 4)
.map(Movie(_))
.filter(_.month.equals(“Nov”))
.collect()
5© Cloudera, Inc. All rights reserved.
Example
sc.textFile(“hdfs://data/u.item”, 4)
.map(Movie(_))
.filter(_.month.equals(“Nov”))
.collect()
6© Cloudera, Inc. All rights reserved.
Example
sc.textFile(“hdfs://data/u.item”, 4)
.map(Movie(_))
.filter(_.month.equals(“Nov”))
.collect()
7© Cloudera, Inc. All rights reserved.
Partitions
sc.textFile(“hdfs://data/u.item”, 4)
.map(Movie(_))
.filter(_.month.equals(“Nov”))
.collect()
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
8© Cloudera, Inc. All rights reserved.
RDDs
sc.textFile(“hdfs://data/u.item”, 4)
.map(Movie(_))
.filter(_.month.equals(“Nov”))
.collect()
…RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
9© Cloudera, Inc. All rights reserved.
RDDs
sc.textFile(“hdfs://data/u.item”, 4)
.map(Movie(_))
.filter(_.month.equals(“Nov”))
.collect()
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
10© Cloudera, Inc. All rights reserved.
RDDs
sc.textFile(“hdfs://data/u.item”, 4)
.map(Movie(_))
.filter(_.month.equals(“Nov”))
.collect()
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
11© Cloudera, Inc. All rights reserved.
…RDD …RDD
RDDs
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
sc.textFile(“hdfs://data/u.item”, 4)
.map(Movie(_))
.filter(_.month.equals(“Nov”))
.collect()
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Collect
12© Cloudera, Inc. All rights reserved.
…RDD …RDD
RDD Lineage
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
sc.textFile(“hdfs://data/u.item”, 4)
.map(Movie(_))
.filter(_.month.equals(“Nov”))
.collect()
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Collect
Lineage
13© Cloudera, Inc. All rights reserved.
Task
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Collect
• A pipelined set of transformation on a single thread
14© Cloudera, Inc. All rights reserved.
Spark Architecture
15© Cloudera, Inc. All rights reserved.
Spark System Architecture
16© Cloudera, Inc. All rights reserved.
Deployments
• Spark supports pluggable Cluster Managers
• local, Standalone, YARN and Mesos
• In early 2014, CDH 4.x with Spark 0.9 only supported Standalone
• CDH 5.x includes Spark on YARN support
17© Cloudera, Inc. All rights reserved.
Standalone
Master
Worker
Client
Worker
Process
App
Master
Process
18© Cloudera, Inc. All rights reserved.
Standalone
• On cluster
./sbin/start-master.sh
./sbin/start-slave.sh <master-spark-URL>
• Submit job
spark-submit --master <master-spark-URL> …
19© Cloudera, Inc. All rights reserved.
Container
YARN Architecture
Resource
Manager
Node
Manager
Client
Node
Manager
Container
Process
App
Master
Container
Process
20© Cloudera, Inc. All rights reserved.
Container
Spark on YARN Architecture
Resource
Manager
Node
Manager
Client
Node
Manager
Container
Process
App
Master
Container
Process
21© Cloudera, Inc. All rights reserved.
Container
Spark on YARN Architecture
Resource
Manager
Node
Manager
Client
Node
Manager
Container
Process
App
Master
Container
Process
22© Cloudera, Inc. All rights reserved.
Spark on YARN
• Submit job
spark-submit --master yarn-client …
• Cluster mode
spark-submit --master yarn-cluster …
• Spark shell only works in client mode!
23© Cloudera, Inc. All rights reserved.
Customers often
have shared
infrastructure
Courtesy of: https://radioglobalistic.files.wordpress.com/2011/02/lagos-traffic.jpg
24© Cloudera, Inc. All rights reserved.
Multi-tenancy
• Cluster utilization is top metric
• Target: 70-80% utilization
• Mixed workloads from mixed customers
• We recommend YARN
• Built in resource manager
25© Cloudera, Inc. All rights reserved.
Underutilized
Clusters
Courtesy of: http://media.nbclosangeles.com/images/1200*675/60-freeway-repair-dec16-2-empty.JPG
26© Cloudera, Inc. All rights reserved.
Dynamic Allocation
• Spark applications scale the number of executors based on load
• Removes need for: --num-executors
• Idle executors get killed
• First supported in CDH 5.4
• Ideal for:
• Long ETL jobs with large shuffles
• shell applications: hive and spark shell
27© Cloudera, Inc. All rights reserved.
Dynamic Allocation Limitations
• Still required to specify cores
• --num-cores
• Memory
• --executor-memory
• Includes JVM overhead
• Need to do the math yourself
• Our customers still get it wrong!
28© Cloudera, Inc. All rights reserved.
The Future of Dynamic Allocation
• Only “task size” needed: --task-size
• Eliminates
• --num-cores
• --num-executors
• --executor-memory
• Leads to better cluster utilization
29© Cloudera, Inc. All rights reserved.
Security, now it’s
getting serious.
Courtesy of: https://www.iti.illinois.edu/sites/default/files/Cybersecurity_image.jpg
30© Cloudera, Inc. All rights reserved.
Authentication
• Kerberos – the necessary evil
• Ubiquitous amongst other services
• YARN, HDFS, Hive, HBase, etc.
• Spark utilizes delegation tokens
31© Cloudera, Inc. All rights reserved.
Encryption
• Control plane
• File distribution
• Block Manager
• User UI / REST API
• Data-at-rest (shuffle files)
SPARK-6028 (Replace with netty)
Replace with netty
Spark 1.4
SPARK-2750 (SSL)
SPARK-5682
32© Cloudera, Inc. All rights reserved.
Authorization
• Enterprises have sensitive data
• Beyond HDFS file permissions
• Partial access to data
• Column level granularity
• Apache Sentry
• HDFS-Sentry synchronization plugin
• Record Service
• Column level security for Spark!
33© Cloudera, Inc. All rights reserved.
Thank you
We’re Hiring!

Weitere ähnliche Inhalte

Was ist angesagt?

How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...Cloudera, Inc.
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwordsSzehon Ho
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 
dplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Datadplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale DataCloudera, Inc.
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentBlueData, Inc.
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform WebinarCloudera, Inc.
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Cloudera, Inc.
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduDataWorks Summit
 
A deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloudA deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloudCloudera, Inc.
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark Summit
 
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersMercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersDataWorks Summit
 
Road to Cloudera certification
Road to Cloudera certificationRoad to Cloudera certification
Road to Cloudera certificationCloudera, Inc.
 
Unlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerUnlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerCloudera, Inc.
 
Apache solr performance and scalability effort update palo alto 2017%2 f7
Apache solr performance and scalability effort update palo alto 2017%2 f7Apache solr performance and scalability effort update palo alto 2017%2 f7
Apache solr performance and scalability effort update palo alto 2017%2 f7Cloudera, Inc.
 
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
 Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac... Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...Cloudera, Inc.
 
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And CloudYARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And CloudDataWorks Summit
 

Was ist angesagt? (20)

How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...
 
Empower Hive with Spark
Empower Hive with SparkEmpower Hive with Spark
Empower Hive with Spark
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwords
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
dplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Datadplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Data
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environment
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Effective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant ClustersEffective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant Clusters
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
 
A deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloudA deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloud
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
 
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersMercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
 
Road to Cloudera certification
Road to Cloudera certificationRoad to Cloudera certification
Road to Cloudera certification
 
Kudu Cloudera Meetup Paris
Kudu Cloudera Meetup ParisKudu Cloudera Meetup Paris
Kudu Cloudera Meetup Paris
 
Unlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerUnlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator Optimizer
 
Apache Hadoop 3
Apache Hadoop 3Apache Hadoop 3
Apache Hadoop 3
 
Apache solr performance and scalability effort update palo alto 2017%2 f7
Apache solr performance and scalability effort update palo alto 2017%2 f7Apache solr performance and scalability effort update palo alto 2017%2 f7
Apache solr performance and scalability effort update palo alto 2017%2 f7
 
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
 Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac... Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
 
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And CloudYARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
 

Andere mochten auch

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Spark on YARN: The Road Ahead
Spark on YARN: The Road AheadSpark on YARN: The Road Ahead
Spark on YARN: The Road AheadCloudera, Inc.
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...StampedeCon
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARNAdam Kawa
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopCloudera, Inc.
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Emilio Coppa
 
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the CloudData Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the CloudCloudera, Inc.
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopHortonworks
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation FrameworkMongoDB
 
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB
 
Dynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDPDynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDPHortonworks
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsHortonworks
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Cloudera, Inc.
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 

Andere mochten auch (19)

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Are you Kudu-ing me?!
Are you Kudu-ing me?!Are you Kudu-ing me?!
Are you Kudu-ing me?!
 
Spark on YARN: The Road Ahead
Spark on YARN: The Road AheadSpark on YARN: The Road Ahead
Spark on YARN: The Road Ahead
 
Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARN
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
 
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the CloudData Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
 
Operations on rdd
Operations on rddOperations on rdd
Operations on rdd
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
 
Dynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDPDynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDP
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 

Ähnlich wie Apache Spark Operations

Getting Apache Spark Customers to Production
Getting Apache Spark Customers to ProductionGetting Apache Spark Customers to Production
Getting Apache Spark Customers to ProductionCloudera, Inc.
 
Data Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the EnterpriseData Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the EnterpriseCloudera, Inc.
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Cloudera, Inc.
 
How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017Cloudera Japan
 
Cloudera のサポートエンジニアリング #supennight
Cloudera のサポートエンジニアリング #supennightCloudera のサポートエンジニアリング #supennight
Cloudera のサポートエンジニアリング #supennightCloudera Japan
 
Enterprise machine learning on k8s lessons learned and the road ahead
Enterprise machine learning on k8s   lessons learned and the road aheadEnterprise machine learning on k8s   lessons learned and the road ahead
Enterprise machine learning on k8s lessons learned and the road aheadTimothy Chen
 
Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Cloudera, Inc.
 
Kite SDK introduction for Portland Big Data
Kite SDK introduction for Portland Big DataKite SDK introduction for Portland Big Data
Kite SDK introduction for Portland Big Data_blue
 
Decoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaDecoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaGrant Henke
 
Container Native Development Tools - Talk by Mickey Boxell
Container Native Development Tools - Talk by Mickey BoxellContainer Native Development Tools - Talk by Mickey Boxell
Container Native Development Tools - Talk by Mickey BoxellOracle Developers
 
Querying multiple distributed storage systems with Apache Hive robustly
Querying multiple distributed storage systems with Apache Hive robustlyQuerying multiple distributed storage systems with Apache Hive robustly
Querying multiple distributed storage systems with Apache Hive robustlyAshish Singh
 
One Hadoop, Multiple Clouds
One Hadoop, Multiple CloudsOne Hadoop, Multiple Clouds
One Hadoop, Multiple CloudsCloudera, Inc.
 
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Stefan Lipp
 
Continuous Delivery to Kubernetes with Jenkins and Helm
Continuous Delivery to Kubernetes with Jenkins and HelmContinuous Delivery to Kubernetes with Jenkins and Helm
Continuous Delivery to Kubernetes with Jenkins and HelmDavid Currie
 
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC MeetupTimothy Spann
 
Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2DataWorks Summit
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
 

Ähnlich wie Apache Spark Operations (20)

Getting Apache Spark Customers to Production
Getting Apache Spark Customers to ProductionGetting Apache Spark Customers to Production
Getting Apache Spark Customers to Production
 
Data Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the EnterpriseData Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the Enterprise
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 

 
How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017
 
Cloudera のサポートエンジニアリング #supennight
Cloudera のサポートエンジニアリング #supennightCloudera のサポートエンジニアリング #supennight
Cloudera のサポートエンジニアリング #supennight
 
Enterprise machine learning on k8s lessons learned and the road ahead
Enterprise machine learning on k8s   lessons learned and the road aheadEnterprise machine learning on k8s   lessons learned and the road ahead
Enterprise machine learning on k8s lessons learned and the road ahead
 
Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?
 
Kite SDK introduction for Portland Big Data
Kite SDK introduction for Portland Big DataKite SDK introduction for Portland Big Data
Kite SDK introduction for Portland Big Data
 
Decoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaDecoupling Decisions with Apache Kafka
Decoupling Decisions with Apache Kafka
 
Container Native Development Tools - Talk by Mickey Boxell
Container Native Development Tools - Talk by Mickey BoxellContainer Native Development Tools - Talk by Mickey Boxell
Container Native Development Tools - Talk by Mickey Boxell
 
Querying multiple distributed storage systems with Apache Hive robustly
Querying multiple distributed storage systems with Apache Hive robustlyQuerying multiple distributed storage systems with Apache Hive robustly
Querying multiple distributed storage systems with Apache Hive robustly
 
YARN
YARNYARN
YARN
 
One Hadoop, Multiple Clouds
One Hadoop, Multiple CloudsOne Hadoop, Multiple Clouds
One Hadoop, Multiple Clouds
 
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
 
Continuous Delivery to Kubernetes with Jenkins and Helm
Continuous Delivery to Kubernetes with Jenkins and HelmContinuous Delivery to Kubernetes with Jenkins and Helm
Continuous Delivery to Kubernetes with Jenkins and Helm
 
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
 
Hadoop Operations
Hadoop OperationsHadoop Operations
Hadoop Operations
 
Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 

Mehr von Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Mehr von Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Kürzlich hochgeladen

Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Anthony Dahanne
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profileakrivarotava
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 

Kürzlich hochgeladen (20)

Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profile
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 

Apache Spark Operations

  • 1. 1© Cloudera, Inc. All rights reserved. Spark Operations Kostas Sakellis
  • 2. 2© Cloudera, Inc. All rights reserved. Me • Software Engineer at Cloudera • Contributor to Apache Spark • Before that, contributed to Cloudera Manager
  • 3. 3© Cloudera, Inc. All rights reserved. Building a proof of concept! Courtesy of: http://www.nefloridadesign.com/mbimages/6.jpg
  • 4. 4© Cloudera, Inc. All rights reserved. Example sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()
  • 5. 5© Cloudera, Inc. All rights reserved. Example sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()
  • 6. 6© Cloudera, Inc. All rights reserved. Example sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()
  • 7. 7© Cloudera, Inc. All rights reserved. Partitions sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect() HDFS Partition 1 Partition 2 Partition 3 Partition 4
  • 8. 8© Cloudera, Inc. All rights reserved. RDDs sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect() …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4
  • 9. 9© Cloudera, Inc. All rights reserved. RDDs sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect() …RDD …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4
  • 10. 10© Cloudera, Inc. All rights reserved. RDDs sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect() …RDD …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4
  • 11. 11© Cloudera, Inc. All rights reserved. …RDD …RDD RDDs HDFS Partition 1 Partition 2 Partition 3 Partition 4 sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect() Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4 Collect
  • 12. 12© Cloudera, Inc. All rights reserved. …RDD …RDD RDD Lineage HDFS Partition 1 Partition 2 Partition 3 Partition 4 sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect() Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4 Collect Lineage
  • 13. 13© Cloudera, Inc. All rights reserved. Task …RDD …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4 Collect • A pipelined set of transformation on a single thread
  • 14. 14© Cloudera, Inc. All rights reserved. Spark Architecture
  • 15. 15© Cloudera, Inc. All rights reserved. Spark System Architecture
  • 16. 16© Cloudera, Inc. All rights reserved. Deployments • Spark supports pluggable Cluster Managers • local, Standalone, YARN and Mesos • In early 2014, CDH 4.x with Spark 0.9 only supported Standalone • CDH 5.x includes Spark on YARN support
  • 17. 17© Cloudera, Inc. All rights reserved. Standalone Master Worker Client Worker Process App Master Process
  • 18. 18© Cloudera, Inc. All rights reserved. Standalone • On cluster ./sbin/start-master.sh ./sbin/start-slave.sh <master-spark-URL> • Submit job spark-submit --master <master-spark-URL> …
  • 19. 19© Cloudera, Inc. All rights reserved. Container YARN Architecture Resource Manager Node Manager Client Node Manager Container Process App Master Container Process
  • 20. 20© Cloudera, Inc. All rights reserved. Container Spark on YARN Architecture Resource Manager Node Manager Client Node Manager Container Process App Master Container Process
  • 21. 21© Cloudera, Inc. All rights reserved. Container Spark on YARN Architecture Resource Manager Node Manager Client Node Manager Container Process App Master Container Process
  • 22. 22© Cloudera, Inc. All rights reserved. Spark on YARN • Submit job spark-submit --master yarn-client … • Cluster mode spark-submit --master yarn-cluster … • Spark shell only works in client mode!
  • 23. 23© Cloudera, Inc. All rights reserved. Customers often have shared infrastructure Courtesy of: https://radioglobalistic.files.wordpress.com/2011/02/lagos-traffic.jpg
  • 24. 24© Cloudera, Inc. All rights reserved. Multi-tenancy • Cluster utilization is top metric • Target: 70-80% utilization • Mixed workloads from mixed customers • We recommend YARN • Built in resource manager
  • 25. 25© Cloudera, Inc. All rights reserved. Underutilized Clusters Courtesy of: http://media.nbclosangeles.com/images/1200*675/60-freeway-repair-dec16-2-empty.JPG
  • 26. 26© Cloudera, Inc. All rights reserved. Dynamic Allocation • Spark applications scale the number of executors based on load • Removes need for: --num-executors • Idle executors get killed • First supported in CDH 5.4 • Ideal for: • Long ETL jobs with large shuffles • shell applications: hive and spark shell
  • 27. 27© Cloudera, Inc. All rights reserved. Dynamic Allocation Limitations • Still required to specify cores • --num-cores • Memory • --executor-memory • Includes JVM overhead • Need to do the math yourself • Our customers still get it wrong!
  • 28. 28© Cloudera, Inc. All rights reserved. The Future of Dynamic Allocation • Only “task size” needed: --task-size • Eliminates • --num-cores • --num-executors • --executor-memory • Leads to better cluster utilization
  • 29. 29© Cloudera, Inc. All rights reserved. Security, now it’s getting serious. Courtesy of: https://www.iti.illinois.edu/sites/default/files/Cybersecurity_image.jpg
  • 30. 30© Cloudera, Inc. All rights reserved. Authentication • Kerberos – the necessary evil • Ubiquitous amongst other services • YARN, HDFS, Hive, HBase, etc. • Spark utilizes delegation tokens
  • 31. 31© Cloudera, Inc. All rights reserved. Encryption • Control plane • File distribution • Block Manager • User UI / REST API • Data-at-rest (shuffle files) SPARK-6028 (Replace with netty) Replace with netty Spark 1.4 SPARK-2750 (SSL) SPARK-5682
  • 32. 32© Cloudera, Inc. All rights reserved. Authorization • Enterprises have sensitive data • Beyond HDFS file permissions • Partial access to data • Column level granularity • Apache Sentry • HDFS-Sentry synchronization plugin • Record Service • Column level security for Spark!
  • 33. 33© Cloudera, Inc. All rights reserved. Thank you We’re Hiring!

Hinweis der Redaktion

  1. Lets talk about what we have seen as issues from our customers as issues as they try to get Spark into production.
  2. In scope - Focus on operational issues - Not on building the code itself Experience from our customer support tickets
  3. Spark makes building a proof of concept with a subset of data relatively easy. But then things go wrong Plug for my talk at Hadoop Summit
  4. Lets start with an example program in Spark.
  5. Lets start with an example program in Spark.
  6. The sum() call launches a job
  7. A chunk of data somewhere Could be on Hadoop File System (HDFS) Could be cached in Spark Defines the degree of parallelism
  8. Describes a way of generating input and output partitions Immutable – very important! RDDs can depend on other RDDs Most have single parent Joins have multiple parents Lineage over replication for fault tolerance https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  9. Describes a way of generating input and output partitions Immutable – very important! RDDs can depend on other RDDs Most have single parent Joins have multiple parents Lineage over replication for fault tolerance https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  10. Describes a way of generating input and output partitions Immutable – very important! RDDs can depend on other RDDs Most have single parent Joins have multiple parents Lineage over replication for fault tolerance https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  11. Describes a way of generating input and output partitions Immutable – very important! RDDs can depend on other RDDs Most have single parent Joins have multiple parents Lineage over replication for fault tolerance https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  12. Describes a way of generating input and output partitions Immutable – very important! RDDs can depend on other RDDs Most have single parent Joins have multiple parents Lineage over replication for fault tolerance https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  13. Lets review the general Spark architecture
  14. A driver Where the DAG scheduler lives Drives the show Single point of failure Executors Communicates with driver Runs the tasks created by the driver Think of this as a ThreadPoolExecutor in java Pluggable cluster managers YARN, Mesos, standalone
  15. In scope - Focus on operational issues - Not on building the code itself Experience from our customer support tickets
  16. Lets review the general Spark architecture
  17. In scope - Focus on operational issues - Not on building the code itself Experience from our customer support tickets
  18. Lets review the general Spark architecture
  19. Lets review the general Spark architecture
  20. Lets review the general Spark architecture
  21. In scope - Focus on operational issues - Not on building the code itself Experience from our customer support tickets
  22. Spark makes building a proof of concept with a subset of data relatively easy.
  23. Spark makes building a proof of concept with a subset of data relatively easy.
  24. Control plane File distribution Block Manager User UI / REST API Data-at-rest (shuffle files)