SlideShare ist ein Scribd-Unternehmen logo
1 von 51
SPARK AND DEEP LEARNING FRAMEWORKS AT
SCALE
Vartika Singh
2 © Cloudera, Inc. All rights reserved.
3 © Cloudera, Inc. All rights reserved.
OBJECTIVE
• Enabling Machine Learning in field
• Enablement and use case discovery
• Data and ML: what do we focus on?
• Typical data ingest architecture
• Extending Spark
• Deep Learning - how does the fit in?
• Hardware
Objective
4 © Cloudera, Inc. All rights reserved.
5 © Cloudera, Inc. All rights reserved.
DATA - MARKET PROPOSITION
Click Stream Smart clicks, impression and
conversions
Videos Fraud, navigation, ad placement
Medical Data Tumor detection, patient mortality,
anomaly identification
City data Planning, Resource distribution
Wafer, Oil and gas data Pipeline optimization, fault detection
?? ...
6 © Cloudera, Inc. All rights reserved.
7 © Cloudera, Inc. All rights reserved.
Ref: https://hbr.org/2017/05/whats-your-data-strategy
• Less than half of an organization’s structured data is actively used in making decisions
• Less than 1% of it’s unstructured data is analyzed or used at all
• More than 70% of employees have access to data they should not
• 80% of analysts time is spent simply discovering and preparing data
• Data breaches are common
• Rogue data sets propagate in silos
• Companies’ data technology often is not up to the demands put on it
8 © Cloudera, Inc. All rights reserved.
9 © Cloudera, Inc. All rights reserved.
Use case
discovery
Model Serving
Hidden feedback
loops
Undeclared
consumer
dependencies
Change in the
external world
Ref: Hidden Technical Debt in Machine Learning ... - NIPS Proceedings
10 © Cloudera, Inc. All rights reserved.
11 © Cloudera, Inc. All rights reserved.
Is evolving Science
We are not very good at anticipating what the next emerging serious flaw will
be.
What we’re missing is an engineering discipline with its principles of analysis
and design.
Keep It Simple Stupid!
https://medium.com/@mijordan3/artificial-intelligence-the-revolution-hasnt-happened-yet-5e1d5812e1e7
12 © Cloudera, Inc. All rights reserved.
13 © Cloudera, Inc. All rights reserved.
Data
Processes
ML
● Deconstruct the problem.
● Democratize
● Paved Pathways
© Cloudera, Inc. All rights reserved.
INTELLIGENT INFRASTRUCTURE!!!
15 © Cloudera, Inc. All rights reserved.
CLOUDERA DATA SCIENCE WORKBENCH
16 © Cloudera, Inc. All rights reserved.
OVERVIEW - PROJECTS
17 © Cloudera, Inc. All rights reserved.
OVERVIEW - GPUS
18 © Cloudera, Inc. All rights reserved.
OVERVIEW - WEBUIS
19 © Cloudera, Inc. All rights reserved.
OVERVIEW - DISTRIBUTED COMPUTING WITH WORKERS
20 © Cloudera, Inc. All rights reserved.
OTHER FEATURES
• Git
• S3/HDFS
21 © Cloudera, Inc. All rights reserved.
• Create a snapshot of model code,
dependencies, and configuration
necessary to train the model.
• Build and execute the training run
in an isolate container.
• Track specified model metrics,
performance, and model artifacts.
• Inspect, compare , or deploy prior
models.
EXPERIMENTS
22 © Cloudera, Inc. All rights reserved.
MODELS
23 © Cloudera, Inc. All rights reserved.
• In model parallelism, different machines in
the distributed system are responsible for
the computations in different parts of a
single network - for example, each layer in
the neural network may be assigned to a
different machine.
24 © Cloudera, Inc. All rights reserved.
• In data parallelism, different machines have
a complete copy of the model; each machine
simply gets a different portion of the data, and
results from each are somehow combined.
25 © Cloudera, Inc. All rights reserved.
26 © Cloudera, Inc. All rights reserved.
SPARK AND JNI
• OpenCV
• Tesseract
• Common Implementations using JavaCPP
Ref: https://github.com/bytedeco/javacpp
27 © Cloudera, Inc. All rights reserved.
SPARK/HPC WORKLOADS
Gene Sequencing/ Assembling/ Analysis
• Data parallelism and statistical methods lie at the core of all DNA sequencing
workloads.
• Sequencing - Base calling
• Variant calling
• GATK - Can run on Spark
• Canu - Transform to PySpark workload using Python C extensions
• Analysis - HAIL
Ref: https://software.broadinstitute.org/gatk/
Ref: https://hail.is/
Ref: https://blog.cloudera.com/blog/2017/05/hail-scalable-genomics-analysis-with-spark/
28 © Cloudera, Inc. All rights reserved.
HPC WORKLOADS
• Portions of the Hadoop ecosystem can open your grid to more users.
• PySpark allows a company that is using a legacy C++ grid to re-use their C++ library assets
with very little to no changes. Python to C++ bindings result in minimal performance penalties.
• Cloudera Data Science Workbench (CDSW) allow Data Scientists to rapidly develop and
visualize models with more involvement from the business.
• In infrastructures with direct attached storage, Hadoop’s locality based processing allows for
fast efficient movement of data between storage and compute.
• Deploying Hadoop on a portion or on all of your grid allows you to use the same tools on the
grid that you would use on a Cloud Based Hadoop Cluster.
29 © Cloudera, Inc. All rights reserved.
DEEP LEARNING IN BIG DATA
• A major source of difficulty in many real-
world artificial intelligence applications is
that many of the factors of variation
influence every single piece of data we can
observe.
• Deep learning solves this central problem
via representation learning by introducing
representations that are expressed in terms
of other, simpler representations.
30 © Cloudera, Inc. All rights reserved.
BIOINFORMATICS
• Protein Structure
• Gene Expression Regulation
• Protein Classification
• Anomaly Classification
• Segmentation
31 © Cloudera, Inc. All rights reserved.
BIOINFORMATICS: THE NATURE OF DATA
• Complex and expensive data acquisition processes limit the size of
bioinformatics datasets.
• Significantly unequal class distributions
• In clinical or disease-related cases, there is inevitably less data from treatment groups than
from the normal (control) group.
• Visualization
• Multimodal Deep Learning
32 © Cloudera, Inc. All rights reserved.
IOT
• A time series is a sequence of regular time-ordered observations
• Example: stock prices, weather readings, smartphone sensor data
• Challenges
• Large scale streaming data
• Heterogeneity
• Time and space correlation
• High noise data
• NRT decision on multimodal data
33 © Cloudera, Inc. All rights reserved.
IOT DEVICES
• Network compression
• Convert to sparse network
• Not general enough
• Factors to consider
• Running time
• Energy consumption
• Architectural considerations
• FFL are much faster than convolution layers in CNN
• Activation functions (ReLu are more time-efficient than Tanh than Sigmoid)
• CNNs use less storage than DNNs due to fewer stored parameters in convolutional layers
• Accelerators
• Tinymotes
• Fog Computing
34 © Cloudera, Inc. All rights reserved.
NLP
• Word Embeddings: GloVe, Word2Vec
• RNN -> LSTMs -> Attention Mechanism
• Applications
• Sentiment analysis
• Gene sequencing
• Natural language generation
35 © Cloudera, Inc. All rights reserved.
DEEP LEARNING - THE HYPERPARAMETERS
• Architecture
• How many layers
• How many nodes/filters
• Which type
• Data
• Batches size
• Size of filters
• Number of steps the
memory of cells will learn
• Training
• Regularization
• Learning rate
• Gradient expressions
• Init policy
36 © Cloudera, Inc. All rights reserved.
TRANSFER LEARNING
37 © Cloudera, Inc. All rights reserved.
TRANSFER LEARNING
• Deep neural networks trained on natural images exhibit a curious phenomenon
in common:
• In the first layer they learn features similar to Gabor filters and color blobs.
• Such first-layer features appear not to be specific to a particular dataset or task, but general in
that they are applicable to many datasets and tasks.
• Initializing a network with transferred features from almost any number of layers
can produce a boost to generalization that lingers even after fine-tuning to the
target dataset.
• The effectiveness of feature transfer is expected to decline as the base and
target tasks become less similar.
38 © Cloudera, Inc. All rights reserved.
SPARK DEEP LEARNING PIPELINES
• Transfer learning
• Distributed hyperparameter tuning
• Deploying models in SQL
39 © Cloudera, Inc. All rights reserved.
DISTRIBUTED TRAINING - WHEN TO DO IT
• Distributed training isn’t free
• Setup time
• Continue to train your networks on a single machine, until the training time
becomes prohibitive
40 © Cloudera, Inc. All rights reserved.
OPERATIONAL IMPLICATIONS
• Model exploration using small data
• Computational limits
• Irreducible errors
• Predictable
41 © Cloudera, Inc. All rights reserved.
• Neurons and Synapses
• Computed weighted sum for
each layer
• Compute the gradient of the loss
relative to the filter inputs
• Compute the gradient of the loss
relative to the weights
M. Mohammadi, A. Al-Fuqaha, S. Sorour, and M. Guizani, “Deep Learning for IoT Big Data and Streaming Analytics: A Survey,” arXiv preprint arXiv:1712.04301v1 [cs.NI], 2017.
DNN
42 © Cloudera, Inc. All rights reserved.
DEEP LEARNING AT SCALE
• Backpropagation requires intermediate outputs of the network to be preserved
for the backwards computation, thus training has increased storage
requirements.
• Second, due to the gradients use for hill-climbing, the precision requirement for
training is generally higher than inference.
43 © Cloudera, Inc. All rights reserved.
DEEP LEARNING AT SCALE
• A significant amount of effort has been put into developing deep learning
systems that can scale to very large models and large training sets
• Large models in the literature are now top performers in supervised visual
recognition tasks
• Can even learn to detect objects when trained from unlabeled images alone
• The very largest of these systems are able to train neural networks with over 1
billion trainable parameters
44 © Cloudera, Inc. All rights reserved.
HARDWARE FOR DNN
• Intel Knights Landing CPU features special vector instructions for deep learning
• Nvidia PASCAL GP100 GPU features 16-bit floating point (FP16) arithmetic
support to perform two FP16 operations on a single precision core for faster
deep learning computation
• Systems have also been built specifically for DNN processing such as Nvidia
DGX-1 and Facebook’s Big Basin custom DNN server
• DNN inference has also been demonstrated on various embedded System-on-
Chips (SoC) such as Nvidia Tegra and Samsung Exynos as well as FPGAs
45 © Cloudera, Inc. All rights reserved.
GPU SUPPORT IN YARN
• As of now, only Nvidia GPUs are supported by YARN
• YARN node managers have to be pre-installed with Nvidia drivers.
• When Docker is used as container runtime context, nvidia-docker 1.0 needs to
be installed (Current supported version in YARN for nvidia-docker).
• https://issues.apache.org/jira/browse/YARN-3926
• https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-
site/UsingGpus.html
46 © Cloudera, Inc. All rights reserved.
Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, Joel Emer, Efficient Processing of Deep Neural Networks: A Tutorial and Survey
47 © Cloudera, Inc. All rights reserved.
48 © Cloudera, Inc. All rights reserved.
ACCELERATORS FOR TEMPORAL ARCHITECTURES
• The downside for using matrix multiplication for the CONV layers is that there is
redundant data in the input feature map matrix, which can lead to either
inefficiency in storage, or a complex memory access pattern
• There are software libraries designed for CPUs (e.g., Open- BLAS, Intel MKL,
etc.) and GPUs (e.g., cuBLAS, cuDNN, etc.) that optimize for matrix
multiplications
• The matrix multiplications on these platforms can be further sped up by
applying computational transforms to the data to reduce the number of
multiplications
49 © Cloudera, Inc. All rights reserved.
ACCELERATORS FOR SPATIAL ARCHITECTURES
• For DNNs, the bottleneck for processing is in the
memory access
• Accelerators, such as spatial architectures,
provide an opportunity to reduce the energy cost
of data movement by introducing several levels
of local memory hierarchy with different energy
cost
• The multiple levels of memory hierarchy help to
improve energy efficiency by providing low-cost
data accesses
50 © Cloudera, Inc. All rights reserved.
1) How do you
collect your data?
2) Where do your
data scientists play?
3) Let’s talk to
the business
THANK YOU

Weitere ähnliche Inhalte

Was ist angesagt?

Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18Cloudera, Inc.
 
Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18Cloudera, Inc.
 
Cloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for AnalyticsCloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for AnalyticsCloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Making Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the EnterpriseMaking Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the EnterpriseCloudera, Inc.
 
Cloudera training: secure your Cloudera cluster
Cloudera training: secure your Cloudera clusterCloudera training: secure your Cloudera cluster
Cloudera training: secure your Cloudera clusterCloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Turning Data into Business Value with a Modern Data Platform
Turning Data into Business Value with a Modern Data PlatformTurning Data into Business Value with a Modern Data Platform
Turning Data into Business Value with a Modern Data PlatformCloudera, Inc.
 
Secure Data - Why Encryption and Access Control are Game Changers
Secure Data - Why Encryption and Access Control are Game ChangersSecure Data - Why Encryption and Access Control are Game Changers
Secure Data - Why Encryption and Access Control are Game ChangersCloudera, Inc.
 
Preparing for the Cybersecurity Renaissance
Preparing for the Cybersecurity RenaissancePreparing for the Cybersecurity Renaissance
Preparing for the Cybersecurity RenaissanceCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Cloudera Fast Forward Labs: The Vision and the Challenge of Applied Machine L...
Cloudera Fast Forward Labs: The Vision and the Challenge of Applied Machine L...Cloudera Fast Forward Labs: The Vision and the Challenge of Applied Machine L...
Cloudera Fast Forward Labs: The Vision and the Challenge of Applied Machine L...Cloudera, Inc.
 

Was ist angesagt? (20)

Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Cloudera SDX
Cloudera SDXCloudera SDX
Cloudera SDX
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
 
Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18
 
Cloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for AnalyticsCloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for Analytics
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Making Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the EnterpriseMaking Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the Enterprise
 
Cloudera training: secure your Cloudera cluster
Cloudera training: secure your Cloudera clusterCloudera training: secure your Cloudera cluster
Cloudera training: secure your Cloudera cluster
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Turning Data into Business Value with a Modern Data Platform
Turning Data into Business Value with a Modern Data PlatformTurning Data into Business Value with a Modern Data Platform
Turning Data into Business Value with a Modern Data Platform
 
Secure Data - Why Encryption and Access Control are Game Changers
Secure Data - Why Encryption and Access Control are Game ChangersSecure Data - Why Encryption and Access Control are Game Changers
Secure Data - Why Encryption and Access Control are Game Changers
 
Preparing for the Cybersecurity Renaissance
Preparing for the Cybersecurity RenaissancePreparing for the Cybersecurity Renaissance
Preparing for the Cybersecurity Renaissance
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Cloudera Fast Forward Labs: The Vision and the Challenge of Applied Machine L...
Cloudera Fast Forward Labs: The Vision and the Challenge of Applied Machine L...Cloudera Fast Forward Labs: The Vision and the Challenge of Applied Machine L...
Cloudera Fast Forward Labs: The Vision and the Challenge of Applied Machine L...
 

Ähnlich wie Spark and Deep Learning Frameworks at Scale 7.19.18

Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Data Con LA
 
The Vision & Challenge of Applied Machine Learning
The Vision & Challenge of Applied Machine LearningThe Vision & Challenge of Applied Machine Learning
The Vision & Challenge of Applied Machine LearningCloudera, Inc.
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Cloudera, Inc.
 
The 5 Biggest Data Myths in Telco: Exposed
The 5 Biggest Data Myths in Telco: ExposedThe 5 Biggest Data Myths in Telco: Exposed
The 5 Biggest Data Myths in Telco: ExposedCloudera, Inc.
 
Machine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to ImplementationMachine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to ImplementationDataWorks Summit
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaNeo4j
 
From Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your OrganizationFrom Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your OrganizationCloudera, Inc.
 
Parallel/Distributed Deep Learning and CDSW
Parallel/Distributed Deep Learning and CDSWParallel/Distributed Deep Learning and CDSW
Parallel/Distributed Deep Learning and CDSWDataWorks Summit
 
Parallel & Distributed Deep Learning - Dataworks Summit
Parallel & Distributed Deep Learning - Dataworks SummitParallel & Distributed Deep Learning - Dataworks Summit
Parallel & Distributed Deep Learning - Dataworks SummitRafael Arana
 
Data Science in Enterprise
Data Science in EnterpriseData Science in Enterprise
Data Science in EnterpriseJosh Yeh
 
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
Deep Learning Frameworks Using Spark on YARN by Vartika SinghDeep Learning Frameworks Using Spark on YARN by Vartika Singh
Deep Learning Frameworks Using Spark on YARN by Vartika SinghData Con LA
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningYogesh Sharma
 
Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...
Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...
Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...Ed Dodds
 
110307 cloud security requirements gourley
110307 cloud security requirements gourley110307 cloud security requirements gourley
110307 cloud security requirements gourleyGovCloud Network
 
Part 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science WorkbenchPart 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science WorkbenchCloudera, Inc.
 
Cloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made EasyCloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made EasyCloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Part 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to EndPart 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to EndCloudera, Inc.
 
Lecture 3.31 3.32.pptx
Lecture 3.31  3.32.pptxLecture 3.31  3.32.pptx
Lecture 3.31 3.32.pptxRATISHKUMAR32
 

Ähnlich wie Spark and Deep Learning Frameworks at Scale 7.19.18 (20)

Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
 
Federated Learning
Federated LearningFederated Learning
Federated Learning
 
The Vision & Challenge of Applied Machine Learning
The Vision & Challenge of Applied Machine LearningThe Vision & Challenge of Applied Machine Learning
The Vision & Challenge of Applied Machine Learning
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 

 
The 5 Biggest Data Myths in Telco: Exposed
The 5 Biggest Data Myths in Telco: ExposedThe 5 Biggest Data Myths in Telco: Exposed
The 5 Biggest Data Myths in Telco: Exposed
 
Machine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to ImplementationMachine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to Implementation
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, Cloudera
 
From Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your OrganizationFrom Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your Organization
 
Parallel/Distributed Deep Learning and CDSW
Parallel/Distributed Deep Learning and CDSWParallel/Distributed Deep Learning and CDSW
Parallel/Distributed Deep Learning and CDSW
 
Parallel & Distributed Deep Learning - Dataworks Summit
Parallel & Distributed Deep Learning - Dataworks SummitParallel & Distributed Deep Learning - Dataworks Summit
Parallel & Distributed Deep Learning - Dataworks Summit
 
Data Science in Enterprise
Data Science in EnterpriseData Science in Enterprise
Data Science in Enterprise
 
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
Deep Learning Frameworks Using Spark on YARN by Vartika SinghDeep Learning Frameworks Using Spark on YARN by Vartika Singh
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...
Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...
Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...
 
110307 cloud security requirements gourley
110307 cloud security requirements gourley110307 cloud security requirements gourley
110307 cloud security requirements gourley
 
Part 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science WorkbenchPart 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science Workbench
 
Cloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made EasyCloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made Easy
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Part 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to EndPart 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to End
 
Lecture 3.31 3.32.pptx
Lecture 3.31  3.32.pptxLecture 3.31  3.32.pptx
Lecture 3.31 3.32.pptx
 

Mehr von Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18Cloudera, Inc.
 
Delivering improved patient outcomes through advanced analytics 6.26.18
Delivering improved patient outcomes through advanced analytics 6.26.18Delivering improved patient outcomes through advanced analytics 6.26.18
Delivering improved patient outcomes through advanced analytics 6.26.18Cloudera, Inc.
 

Mehr von Cloudera, Inc. (11)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18
 
Delivering improved patient outcomes through advanced analytics 6.26.18
Delivering improved patient outcomes through advanced analytics 6.26.18Delivering improved patient outcomes through advanced analytics 6.26.18
Delivering improved patient outcomes through advanced analytics 6.26.18
 

Kürzlich hochgeladen

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Kürzlich hochgeladen (20)

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

Spark and Deep Learning Frameworks at Scale 7.19.18

  • 1. SPARK AND DEEP LEARNING FRAMEWORKS AT SCALE Vartika Singh
  • 2. 2 © Cloudera, Inc. All rights reserved.
  • 3. 3 © Cloudera, Inc. All rights reserved. OBJECTIVE • Enabling Machine Learning in field • Enablement and use case discovery • Data and ML: what do we focus on? • Typical data ingest architecture • Extending Spark • Deep Learning - how does the fit in? • Hardware Objective
  • 4. 4 © Cloudera, Inc. All rights reserved.
  • 5. 5 © Cloudera, Inc. All rights reserved. DATA - MARKET PROPOSITION Click Stream Smart clicks, impression and conversions Videos Fraud, navigation, ad placement Medical Data Tumor detection, patient mortality, anomaly identification City data Planning, Resource distribution Wafer, Oil and gas data Pipeline optimization, fault detection ?? ...
  • 6. 6 © Cloudera, Inc. All rights reserved.
  • 7. 7 © Cloudera, Inc. All rights reserved. Ref: https://hbr.org/2017/05/whats-your-data-strategy • Less than half of an organization’s structured data is actively used in making decisions • Less than 1% of it’s unstructured data is analyzed or used at all • More than 70% of employees have access to data they should not • 80% of analysts time is spent simply discovering and preparing data • Data breaches are common • Rogue data sets propagate in silos • Companies’ data technology often is not up to the demands put on it
  • 8. 8 © Cloudera, Inc. All rights reserved.
  • 9. 9 © Cloudera, Inc. All rights reserved. Use case discovery Model Serving Hidden feedback loops Undeclared consumer dependencies Change in the external world Ref: Hidden Technical Debt in Machine Learning ... - NIPS Proceedings
  • 10. 10 © Cloudera, Inc. All rights reserved.
  • 11. 11 © Cloudera, Inc. All rights reserved. Is evolving Science We are not very good at anticipating what the next emerging serious flaw will be. What we’re missing is an engineering discipline with its principles of analysis and design. Keep It Simple Stupid! https://medium.com/@mijordan3/artificial-intelligence-the-revolution-hasnt-happened-yet-5e1d5812e1e7
  • 12. 12 © Cloudera, Inc. All rights reserved.
  • 13. 13 © Cloudera, Inc. All rights reserved. Data Processes ML ● Deconstruct the problem. ● Democratize ● Paved Pathways
  • 14. © Cloudera, Inc. All rights reserved. INTELLIGENT INFRASTRUCTURE!!!
  • 15. 15 © Cloudera, Inc. All rights reserved. CLOUDERA DATA SCIENCE WORKBENCH
  • 16. 16 © Cloudera, Inc. All rights reserved. OVERVIEW - PROJECTS
  • 17. 17 © Cloudera, Inc. All rights reserved. OVERVIEW - GPUS
  • 18. 18 © Cloudera, Inc. All rights reserved. OVERVIEW - WEBUIS
  • 19. 19 © Cloudera, Inc. All rights reserved. OVERVIEW - DISTRIBUTED COMPUTING WITH WORKERS
  • 20. 20 © Cloudera, Inc. All rights reserved. OTHER FEATURES • Git • S3/HDFS
  • 21. 21 © Cloudera, Inc. All rights reserved. • Create a snapshot of model code, dependencies, and configuration necessary to train the model. • Build and execute the training run in an isolate container. • Track specified model metrics, performance, and model artifacts. • Inspect, compare , or deploy prior models. EXPERIMENTS
  • 22. 22 © Cloudera, Inc. All rights reserved. MODELS
  • 23. 23 © Cloudera, Inc. All rights reserved. • In model parallelism, different machines in the distributed system are responsible for the computations in different parts of a single network - for example, each layer in the neural network may be assigned to a different machine.
  • 24. 24 © Cloudera, Inc. All rights reserved. • In data parallelism, different machines have a complete copy of the model; each machine simply gets a different portion of the data, and results from each are somehow combined.
  • 25. 25 © Cloudera, Inc. All rights reserved.
  • 26. 26 © Cloudera, Inc. All rights reserved. SPARK AND JNI • OpenCV • Tesseract • Common Implementations using JavaCPP Ref: https://github.com/bytedeco/javacpp
  • 27. 27 © Cloudera, Inc. All rights reserved. SPARK/HPC WORKLOADS Gene Sequencing/ Assembling/ Analysis • Data parallelism and statistical methods lie at the core of all DNA sequencing workloads. • Sequencing - Base calling • Variant calling • GATK - Can run on Spark • Canu - Transform to PySpark workload using Python C extensions • Analysis - HAIL Ref: https://software.broadinstitute.org/gatk/ Ref: https://hail.is/ Ref: https://blog.cloudera.com/blog/2017/05/hail-scalable-genomics-analysis-with-spark/
  • 28. 28 © Cloudera, Inc. All rights reserved. HPC WORKLOADS • Portions of the Hadoop ecosystem can open your grid to more users. • PySpark allows a company that is using a legacy C++ grid to re-use their C++ library assets with very little to no changes. Python to C++ bindings result in minimal performance penalties. • Cloudera Data Science Workbench (CDSW) allow Data Scientists to rapidly develop and visualize models with more involvement from the business. • In infrastructures with direct attached storage, Hadoop’s locality based processing allows for fast efficient movement of data between storage and compute. • Deploying Hadoop on a portion or on all of your grid allows you to use the same tools on the grid that you would use on a Cloud Based Hadoop Cluster.
  • 29. 29 © Cloudera, Inc. All rights reserved. DEEP LEARNING IN BIG DATA • A major source of difficulty in many real- world artificial intelligence applications is that many of the factors of variation influence every single piece of data we can observe. • Deep learning solves this central problem via representation learning by introducing representations that are expressed in terms of other, simpler representations.
  • 30. 30 © Cloudera, Inc. All rights reserved. BIOINFORMATICS • Protein Structure • Gene Expression Regulation • Protein Classification • Anomaly Classification • Segmentation
  • 31. 31 © Cloudera, Inc. All rights reserved. BIOINFORMATICS: THE NATURE OF DATA • Complex and expensive data acquisition processes limit the size of bioinformatics datasets. • Significantly unequal class distributions • In clinical or disease-related cases, there is inevitably less data from treatment groups than from the normal (control) group. • Visualization • Multimodal Deep Learning
  • 32. 32 © Cloudera, Inc. All rights reserved. IOT • A time series is a sequence of regular time-ordered observations • Example: stock prices, weather readings, smartphone sensor data • Challenges • Large scale streaming data • Heterogeneity • Time and space correlation • High noise data • NRT decision on multimodal data
  • 33. 33 © Cloudera, Inc. All rights reserved. IOT DEVICES • Network compression • Convert to sparse network • Not general enough • Factors to consider • Running time • Energy consumption • Architectural considerations • FFL are much faster than convolution layers in CNN • Activation functions (ReLu are more time-efficient than Tanh than Sigmoid) • CNNs use less storage than DNNs due to fewer stored parameters in convolutional layers • Accelerators • Tinymotes • Fog Computing
  • 34. 34 © Cloudera, Inc. All rights reserved. NLP • Word Embeddings: GloVe, Word2Vec • RNN -> LSTMs -> Attention Mechanism • Applications • Sentiment analysis • Gene sequencing • Natural language generation
  • 35. 35 © Cloudera, Inc. All rights reserved. DEEP LEARNING - THE HYPERPARAMETERS • Architecture • How many layers • How many nodes/filters • Which type • Data • Batches size • Size of filters • Number of steps the memory of cells will learn • Training • Regularization • Learning rate • Gradient expressions • Init policy
  • 36. 36 © Cloudera, Inc. All rights reserved. TRANSFER LEARNING
  • 37. 37 © Cloudera, Inc. All rights reserved. TRANSFER LEARNING • Deep neural networks trained on natural images exhibit a curious phenomenon in common: • In the first layer they learn features similar to Gabor filters and color blobs. • Such first-layer features appear not to be specific to a particular dataset or task, but general in that they are applicable to many datasets and tasks. • Initializing a network with transferred features from almost any number of layers can produce a boost to generalization that lingers even after fine-tuning to the target dataset. • The effectiveness of feature transfer is expected to decline as the base and target tasks become less similar.
  • 38. 38 © Cloudera, Inc. All rights reserved. SPARK DEEP LEARNING PIPELINES • Transfer learning • Distributed hyperparameter tuning • Deploying models in SQL
  • 39. 39 © Cloudera, Inc. All rights reserved. DISTRIBUTED TRAINING - WHEN TO DO IT • Distributed training isn’t free • Setup time • Continue to train your networks on a single machine, until the training time becomes prohibitive
  • 40. 40 © Cloudera, Inc. All rights reserved. OPERATIONAL IMPLICATIONS • Model exploration using small data • Computational limits • Irreducible errors • Predictable
  • 41. 41 © Cloudera, Inc. All rights reserved. • Neurons and Synapses • Computed weighted sum for each layer • Compute the gradient of the loss relative to the filter inputs • Compute the gradient of the loss relative to the weights M. Mohammadi, A. Al-Fuqaha, S. Sorour, and M. Guizani, “Deep Learning for IoT Big Data and Streaming Analytics: A Survey,” arXiv preprint arXiv:1712.04301v1 [cs.NI], 2017. DNN
  • 42. 42 © Cloudera, Inc. All rights reserved. DEEP LEARNING AT SCALE • Backpropagation requires intermediate outputs of the network to be preserved for the backwards computation, thus training has increased storage requirements. • Second, due to the gradients use for hill-climbing, the precision requirement for training is generally higher than inference.
  • 43. 43 © Cloudera, Inc. All rights reserved. DEEP LEARNING AT SCALE • A significant amount of effort has been put into developing deep learning systems that can scale to very large models and large training sets • Large models in the literature are now top performers in supervised visual recognition tasks • Can even learn to detect objects when trained from unlabeled images alone • The very largest of these systems are able to train neural networks with over 1 billion trainable parameters
  • 44. 44 © Cloudera, Inc. All rights reserved. HARDWARE FOR DNN • Intel Knights Landing CPU features special vector instructions for deep learning • Nvidia PASCAL GP100 GPU features 16-bit floating point (FP16) arithmetic support to perform two FP16 operations on a single precision core for faster deep learning computation • Systems have also been built specifically for DNN processing such as Nvidia DGX-1 and Facebook’s Big Basin custom DNN server • DNN inference has also been demonstrated on various embedded System-on- Chips (SoC) such as Nvidia Tegra and Samsung Exynos as well as FPGAs
  • 45. 45 © Cloudera, Inc. All rights reserved. GPU SUPPORT IN YARN • As of now, only Nvidia GPUs are supported by YARN • YARN node managers have to be pre-installed with Nvidia drivers. • When Docker is used as container runtime context, nvidia-docker 1.0 needs to be installed (Current supported version in YARN for nvidia-docker). • https://issues.apache.org/jira/browse/YARN-3926 • https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn- site/UsingGpus.html
  • 46. 46 © Cloudera, Inc. All rights reserved. Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, Joel Emer, Efficient Processing of Deep Neural Networks: A Tutorial and Survey
  • 47. 47 © Cloudera, Inc. All rights reserved.
  • 48. 48 © Cloudera, Inc. All rights reserved. ACCELERATORS FOR TEMPORAL ARCHITECTURES • The downside for using matrix multiplication for the CONV layers is that there is redundant data in the input feature map matrix, which can lead to either inefficiency in storage, or a complex memory access pattern • There are software libraries designed for CPUs (e.g., Open- BLAS, Intel MKL, etc.) and GPUs (e.g., cuBLAS, cuDNN, etc.) that optimize for matrix multiplications • The matrix multiplications on these platforms can be further sped up by applying computational transforms to the data to reduce the number of multiplications
  • 49. 49 © Cloudera, Inc. All rights reserved. ACCELERATORS FOR SPATIAL ARCHITECTURES • For DNNs, the bottleneck for processing is in the memory access • Accelerators, such as spatial architectures, provide an opportunity to reduce the energy cost of data movement by introducing several levels of local memory hierarchy with different energy cost • The multiple levels of memory hierarchy help to improve energy efficiency by providing low-cost data accesses
  • 50. 50 © Cloudera, Inc. All rights reserved. 1) How do you collect your data? 2) Where do your data scientists play? 3) Let’s talk to the business