SlideShare a Scribd company logo
1 of 32
Download to read offline
MACHINE LEARNING 
PIPELINES 
Evan R. Sparks 
Graduate Student, AMPLab 
With: Shivaram Venkataraman, Tomer Kaftan, Gylfi Gudmundsson, 
Michael Franklin, Benjamin Recht, and others!
WHAT IS MACHINE 
LEARNING?
Model 
“Machine learning is a scientific discipline that deals 
with the construction and study of algorithms that can 
learn from data. Such algorithms operate by building 
a model based on inputs and using that to make 
predictions or decisions, rather than following only 
explicitly programmed instructions.” 
–Wikipedia 
Data
ML PROBLEMS 
• Real data often not ∈ Rd 
• Real data not well-behaved 
according to my algorithm. 
• Features need to be 
engineered. 
• Transformations need to be 
applied. 
• Hyperparameters need to be 
tuned. 
SVM Input: 
Real Data:
SYSTEMS PROBLEMS 
• Datasets are huge. 
• Distributed computing is 
hard. 
• Mapping common ML 
techniques to distributed 
setting may be untenable.
WHAT IS MLBASE? 
• Distributed Machine 
Learning - Made Easy! 
• Spark-based platform to 
simplify the development 
and usage of large scale 
machine learning.
Data Train 
Classifier Model 
A STANDARD MACHINE LEARNING PIPELINE 
Right?
Test 
Data 
A STANDARD MACHINE LEARNING PIPELINE 
That’s more like it! 
Data 
Train 
Linear 
Classifier 
Feature Model 
Extraction 
Predictions
Data Image 
Parser Normalizer Convolver 
A REAL PIPELINE FOR 
IMAGE CLASSIFICATION 
Inspired by Coates & Ng, 2012 
Linear 
Solver 
Feature Extractor 
Symmetric 
Rectifier 
Patch 
Extractor 
Patch 
Whitener 
Patch 
Selector 
Label 
Extractor 
Test 
Feature 
Data 
Model Extractor 
Label 
Extractor 
Test 
Error 
Error 
Computer 
Pooler
A SIMPLE EXAMPLE 
• Load up some images. 
• Featurize. 
• Apply a transformation. 
• Fit a linear model. 
• Evaluate on test data. Replicates Fast Food Features Pipeline - Le et. al., 2012
PIPELINES API 
• A pipeline is made of nodes 
which have an expected 
input and output type. 
• Nodes fit together in a 
sensible way. 
• Pipelines are just nodes. 
• Nodes should be things that 
we know how to scale.
WHAT’S IN THE TOOLBOX? 
Nodes 
Images - Patches, Gabor Filters, HoG, Contrast 
Normalization 
Text - n-grams, lemmatization, TF-IDF, POS, NER 
General Purpose - ZCA Whitening, FFT, Scaling, 
Random Signs, Linear Rectifier, Windowing, Pooling, 
Sampling, QR Decomopsition 
Statistics - Borda Voting, Linear Mapping, Matrix 
Multiply 
ML - Linear Solvers, TSQR, Cholesky Solver, MLlib 
Speech and more - coming soon! 
Pipelines 
Example pipelines across domains CIFAR, MNIST, 
ImageNet, ACL Argument Extraction, TIMIT. 
Stay Tuned! 
Hyper Parameter Tuning Libraries 
GraphX MLlib ml-matrix Featurizers Stats 
Spark 
Utils 
Pipelines 
MLI
Data Image 
Parser Normalizer Convolver 
A REAL PIPELINE FOR 
IMAGE CLASSIFICATION 
Inspired by Coates & Ng, 2012 
Linear 
Solver 
Feature Extractor 
Symmetric 
Rectifier 
Patch 
Extractor 
Patch 
Whitener 
Patch 
Selector 
Label 
Extractor 
Test 
Feature 
Data 
Model Extractor 
Label 
Extractor 
Test 
Error 
Error 
Computer 
Pooler 
YOU’RE GOING TO BUILD THIS!!
BEAR WITH 
ME 
Photo: Andy Rouse, (c) Smithsonian Institute
COMPUTER VISION CRASH 
COURSE
SVM Model
FEATURE EXTRACTION 
Data Image 
Parser Normalizer Convolver 
Linear 
Solver 
Feature Extractor 
Symmetric 
Rectifier 
Patch 
Extractor 
Patch 
Whitener 
Patch 
Selector 
Label 
Extractor 
Model 
Pooler
FEATURE EXTRACTION 
Data Image 
Parser Normalizer Convolver 
Linear 
Solver 
Feature Extractor 
Symmetric 
Rectifier 
Patch 
Extractor 
Patch 
Whitener 
Patch 
Selector 
Label 
Extractor 
Model 
Pooler
NORMALIZATION 
• Moves pixels from [0, 255] to 
[-1.0,1.0]. 
• Why? Math! 
• -1*-1 = 1, 1*1 =1 
• If I overlay two pixels on each 
other and they’re similar values, 
their product will be close to 1 
- otherwise, it will be close to 0 
or -1. 
• Necessary for whitening. 
0 
255 
-1 
+1
FEATURE EXTRACTION 
Data Image 
Parser Normalizer Convolver 
Linear 
Solver 
Feature Extractor 
Symmetric 
Rectifier 
Patch 
Extractor 
Patch 
Whitener 
Patch 
Selector 
Label 
Extractor 
Model 
Pooler
PATCH EXTRACTION 
• Image patches become our 
“visual vocabulary” 
• Intuition from text classification. 
• If I’m trying to classify a 
document as “sports” - I’d look 
for words like “football”, 
“batter”, etc. 
• For images - classifying pictures as 
“face” - I’m looking for things that 
look like eyes, ears, noses, etc. 
Visual Vocabulary
FEATURE EXTRACTION 
Data Image 
Parser Normalizer Convolver 
Linear 
Solver 
Feature Extractor 
Symmetric 
Rectifier 
Patch 
Extractor 
Patch 
Whitener 
Patch 
Selector 
Label 
Extractor 
Model 
Pooler
CONVOLUTION 
• A convolution filter applies a weighted 
average to sliding patches of data. 
• Can be used for lots of things - finding 
edges, blurring, etc. 
• Normalized Input: 
• Image, Ear Filter 
• Output: 
• New image - close to 1 for areas 
that look like the ear filter. 
• Apply many of these simultaneously.
FEATURE EXTRACTION 
Data Image 
Parser Normalizer Convolver 
Linear 
Solver 
Feature Extractor 
Symmetric 
Rectifier 
Patch 
Extractor 
Patch 
Whitener 
Patch 
Selector 
Label 
Extractor 
Model 
Pooler
LINEAR RECTIFICATION 
• For each feature, x, given 
some a (=0.25): 
• xnew=max(x-a, 0) 
• What does it do? 
• Removes a bunch of 
noise.
FEATURE EXTRACTION 
Data Image 
Parser Normalizer Convolver 
Linear 
Solver 
Feature Extractor 
Symmetric 
Rectifier 
Patch 
Extractor 
Patch 
Whitener 
Patch 
Selector 
Label 
Extractor 
Model 
Pooler
POOLING 
• convolve(image, k filters) => k 
filtered images. 
• Lots of info - super granular. 
• Pooling lets us break the (filtered) 
images into regions and sum. 
• Think of the “sum” a how much 
an image quadrant is activated. 
• Image summarized into 4*k 
numbers. 
0.5 8 
0 2
LINEAR CLASSIFICATION
Data: A Labels: b Model: x 
Hypothesis: 
Ax = b + error 
Find the x, which minimizes the error = |Ax - b| 
WHY LINEAR CLASSIFIERS? 
They’re simple. They’re fast. They’re well studied. They scale. 
With the right features, they do a good job!
BACK TO OUR PROBLEM 
• What is A in our problem? 
• #images x #features (4f) 
• What about x? 
• #features x #classes 
• For f < 10000, pretty easy to 
solve! 
• Bigger - we have to get 
creative. 
100k 
1k 
10m x 100k = 
10m 
1k
TODAY’S EXERCISE 
• Build 3 image classification pipelines - simple, 
intermediate, advanced. 
• Qualitatively (with your eyes) and quantitatively 
(with statistics) compare their effectiveness.
ML PIPELINES 
• Reusable, general purpose components. 
• Built with distributed data in mind from day 1. 
• Used together: give a complex system comprised 
of well-understood parts. 
GO BEARS

More Related Content

What's hot

Machine Learning vs. Deep Learning
Machine Learning vs. Deep LearningMachine Learning vs. Deep Learning
Machine Learning vs. Deep LearningBelatrix Software
 
Few shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningFew shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningﺁﺻﻒ ﻋﻠﯽ ﻣﯿﺮ
 
A Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxA Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxSaiPragnaKancheti
 
Deep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & FutureDeep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & FutureRouyun Pan
 
Deep Learning - Overview of my work II
Deep Learning - Overview of my work IIDeep Learning - Overview of my work II
Deep Learning - Overview of my work IIMohamed Loey
 
Using Large Language Models in 10 Lines of Code
Using Large Language Models in 10 Lines of CodeUsing Large Language Models in 10 Lines of Code
Using Large Language Models in 10 Lines of CodeGautier Marti
 
An introduction to Deep Learning
An introduction to Deep LearningAn introduction to Deep Learning
An introduction to Deep LearningJulien SIMON
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural networkMojammilHusain
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...David Talby
 
Simple Introduction to AutoEncoder
Simple Introduction to AutoEncoderSimple Introduction to AutoEncoder
Simple Introduction to AutoEncoderJun Lang
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers Arvind Devaraj
 
Autoencoders
AutoencodersAutoencoders
AutoencodersCloudxLab
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNNShuai Zhang
 
Introduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlowIntroduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlowSri Ambati
 

What's hot (20)

Machine Learning vs. Deep Learning
Machine Learning vs. Deep LearningMachine Learning vs. Deep Learning
Machine Learning vs. Deep Learning
 
Few shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningFew shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learning
 
A Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxA Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptx
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Deep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & FutureDeep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & Future
 
Deep Learning - Overview of my work II
Deep Learning - Overview of my work IIDeep Learning - Overview of my work II
Deep Learning - Overview of my work II
 
Using Large Language Models in 10 Lines of Code
Using Large Language Models in 10 Lines of CodeUsing Large Language Models in 10 Lines of Code
Using Large Language Models in 10 Lines of Code
 
Deep learning
Deep learningDeep learning
Deep learning
 
An introduction to Deep Learning
An introduction to Deep LearningAn introduction to Deep Learning
An introduction to Deep Learning
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural network
 
Deep learning
Deep learningDeep learning
Deep learning
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
 
Simple Introduction to AutoEncoder
Simple Introduction to AutoEncoderSimple Introduction to AutoEncoder
Simple Introduction to AutoEncoder
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNN
 
Introduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlowIntroduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlow
 

Viewers also liked

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascentjeykottalam
 
Kafka Tutorial: Advanced Producers
Kafka Tutorial: Advanced ProducersKafka Tutorial: Advanced Producers
Kafka Tutorial: Advanced ProducersJean-Paul Azar
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkDataWorks Summit
 
Building A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning PipelineBuilding A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning PipelineRobert Dempsey
 
Python as part of a production machine learning stack by Michael Manapat PyDa...
Python as part of a production machine learning stack by Michael Manapat PyDa...Python as part of a production machine learning stack by Michael Manapat PyDa...
Python as part of a production machine learning stack by Michael Manapat PyDa...PyData
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in productionTuri, Inc.
 
PostgreSQL + Kafka: The Delight of Change Data Capture
PostgreSQL + Kafka: The Delight of Change Data CapturePostgreSQL + Kafka: The Delight of Change Data Capture
PostgreSQL + Kafka: The Delight of Change Data CaptureJeff Klukas
 
Machine learning in production with scikit-learn
Machine learning in production with scikit-learnMachine learning in production with scikit-learn
Machine learning in production with scikit-learnJeff Klukas
 
Square's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanSquare's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanHakka Labs
 
Production machine learning_infrastructure
Production machine learning_infrastructureProduction machine learning_infrastructure
Production machine learning_infrastructurejoshwills
 
Multi runtime serving pipelines for machine learning
Multi runtime serving pipelines for machine learningMulti runtime serving pipelines for machine learning
Multi runtime serving pipelines for machine learningStepan Pushkarev
 
Managing and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in PythonManaging and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in PythonSimon Frid
 
Serverless machine learning operations
Serverless machine learning operationsServerless machine learning operations
Serverless machine learning operationsStepan Pushkarev
 
Using PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataUsing PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataRobert Dempsey
 
Production and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsProduction and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsTuri, Inc.
 
Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In ProductionSamir Bessalah
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architectureStepan Pushkarev
 
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017Carol Smith
 

Viewers also liked (19)

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascent
 
Kafka Tutorial: Advanced Producers
Kafka Tutorial: Advanced ProducersKafka Tutorial: Advanced Producers
Kafka Tutorial: Advanced Producers
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
Building A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning PipelineBuilding A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning Pipeline
 
Python as part of a production machine learning stack by Michael Manapat PyDa...
Python as part of a production machine learning stack by Michael Manapat PyDa...Python as part of a production machine learning stack by Michael Manapat PyDa...
Python as part of a production machine learning stack by Michael Manapat PyDa...
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in production
 
PostgreSQL + Kafka: The Delight of Change Data Capture
PostgreSQL + Kafka: The Delight of Change Data CapturePostgreSQL + Kafka: The Delight of Change Data Capture
PostgreSQL + Kafka: The Delight of Change Data Capture
 
Machine learning in production with scikit-learn
Machine learning in production with scikit-learnMachine learning in production with scikit-learn
Machine learning in production with scikit-learn
 
Square's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanSquare's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong Yan
 
Production machine learning_infrastructure
Production machine learning_infrastructureProduction machine learning_infrastructure
Production machine learning_infrastructure
 
Multi runtime serving pipelines for machine learning
Multi runtime serving pipelines for machine learningMulti runtime serving pipelines for machine learning
Multi runtime serving pipelines for machine learning
 
Managing and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in PythonManaging and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in Python
 
Serverless machine learning operations
Serverless machine learning operationsServerless machine learning operations
Serverless machine learning operations
 
Using PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataUsing PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of Data
 
Production and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsProduction and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning Models
 
Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In Production
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architecture
 
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
 

Similar to Machine Learning Pipelines

Machine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxMachine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxIvo Andreev
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSpark Summit
 
Sparking Science up with Research Recommendations
Sparking Science up with Research RecommendationsSparking Science up with Research Recommendations
Sparking Science up with Research RecommendationsMaya Hristakeva
 
Fast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareFast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareTigerGraph
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingDatabricks
 
Trinity of AI: data, algorithms and cloud
Trinity of AI: data, algorithms and cloudTrinity of AI: data, algorithms and cloud
Trinity of AI: data, algorithms and cloudAnima Anandkumar
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsMark Peng
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersAlbert Y. C. Chen
 
30thSep2014
30thSep201430thSep2014
30thSep2014Mia liu
 
Best Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowBest Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowDatabricks
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用CHENHuiMei
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
machine learning workflow with data input.pptx
machine learning workflow with data input.pptxmachine learning workflow with data input.pptx
machine learning workflow with data input.pptxjasontseng19
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017Manish Pandey
 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptxDr.Shweta
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedOmid Vahdaty
 
Automated Testing of Autonomous Driving Assistance Systems
Automated Testing of Autonomous Driving Assistance SystemsAutomated Testing of Autonomous Driving Assistance Systems
Automated Testing of Autonomous Driving Assistance SystemsLionel Briand
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceSridhara R
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureIvo Andreev
 

Similar to Machine Learning Pipelines (20)

Machine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxMachine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackbox
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya Hristakeva
 
Sparking Science up with Research Recommendations
Sparking Science up with Research RecommendationsSparking Science up with Research Recommendations
Sparking Science up with Research Recommendations
 
Fast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareFast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA Hardware
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
 
Trinity of AI: data, algorithms and cloud
Trinity of AI: data, algorithms and cloudTrinity of AI: data, algorithms and cloud
Trinity of AI: data, algorithms and cloud
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
 
30thSep2014
30thSep201430thSep2014
30thSep2014
 
Best Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowBest Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflow
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
machine learning workflow with data input.pptx
machine learning workflow with data input.pptxmachine learning workflow with data input.pptx
machine learning workflow with data input.pptx
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptx
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
 
Automated Testing of Autonomous Driving Assistance Systems
Automated Testing of Autonomous Driving Assistance SystemsAutomated Testing of Autonomous Driving Assistance Systems
Automated Testing of Autonomous Driving Assistance Systems
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
 
專題報告
專題報告專題報告
專題報告
 

More from jeykottalam

AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Introjeykottalam
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Concurrency Control for Parallel Machine Learning
Concurrency Control for Parallel Machine LearningConcurrency Control for Parallel Machine Learning
Concurrency Control for Parallel Machine Learningjeykottalam
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Libraryjeykottalam
 
SparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at ScaleSparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scalejeykottalam
 
SampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS StackSampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS Stackjeykottalam
 
The BDAS Open Source Community
The BDAS Open Source CommunityThe BDAS Open Source Community
The BDAS Open Source Communityjeykottalam
 

More from jeykottalam (7)

AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Intro
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Concurrency Control for Parallel Machine Learning
Concurrency Control for Parallel Machine LearningConcurrency Control for Parallel Machine Learning
Concurrency Control for Parallel Machine Learning
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 
SparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at ScaleSparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scale
 
SampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS StackSampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS Stack
 
The BDAS Open Source Community
The BDAS Open Source CommunityThe BDAS Open Source Community
The BDAS Open Source Community
 

Recently uploaded

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfYashikaSharma391629
 

Recently uploaded (20)

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
 

Machine Learning Pipelines

  • 1. MACHINE LEARNING PIPELINES Evan R. Sparks Graduate Student, AMPLab With: Shivaram Venkataraman, Tomer Kaftan, Gylfi Gudmundsson, Michael Franklin, Benjamin Recht, and others!
  • 2. WHAT IS MACHINE LEARNING?
  • 3. Model “Machine learning is a scientific discipline that deals with the construction and study of algorithms that can learn from data. Such algorithms operate by building a model based on inputs and using that to make predictions or decisions, rather than following only explicitly programmed instructions.” –Wikipedia Data
  • 4. ML PROBLEMS • Real data often not ∈ Rd • Real data not well-behaved according to my algorithm. • Features need to be engineered. • Transformations need to be applied. • Hyperparameters need to be tuned. SVM Input: Real Data:
  • 5. SYSTEMS PROBLEMS • Datasets are huge. • Distributed computing is hard. • Mapping common ML techniques to distributed setting may be untenable.
  • 6. WHAT IS MLBASE? • Distributed Machine Learning - Made Easy! • Spark-based platform to simplify the development and usage of large scale machine learning.
  • 7. Data Train Classifier Model A STANDARD MACHINE LEARNING PIPELINE Right?
  • 8. Test Data A STANDARD MACHINE LEARNING PIPELINE That’s more like it! Data Train Linear Classifier Feature Model Extraction Predictions
  • 9. Data Image Parser Normalizer Convolver A REAL PIPELINE FOR IMAGE CLASSIFICATION Inspired by Coates & Ng, 2012 Linear Solver Feature Extractor Symmetric Rectifier Patch Extractor Patch Whitener Patch Selector Label Extractor Test Feature Data Model Extractor Label Extractor Test Error Error Computer Pooler
  • 10. A SIMPLE EXAMPLE • Load up some images. • Featurize. • Apply a transformation. • Fit a linear model. • Evaluate on test data. Replicates Fast Food Features Pipeline - Le et. al., 2012
  • 11. PIPELINES API • A pipeline is made of nodes which have an expected input and output type. • Nodes fit together in a sensible way. • Pipelines are just nodes. • Nodes should be things that we know how to scale.
  • 12. WHAT’S IN THE TOOLBOX? Nodes Images - Patches, Gabor Filters, HoG, Contrast Normalization Text - n-grams, lemmatization, TF-IDF, POS, NER General Purpose - ZCA Whitening, FFT, Scaling, Random Signs, Linear Rectifier, Windowing, Pooling, Sampling, QR Decomopsition Statistics - Borda Voting, Linear Mapping, Matrix Multiply ML - Linear Solvers, TSQR, Cholesky Solver, MLlib Speech and more - coming soon! Pipelines Example pipelines across domains CIFAR, MNIST, ImageNet, ACL Argument Extraction, TIMIT. Stay Tuned! Hyper Parameter Tuning Libraries GraphX MLlib ml-matrix Featurizers Stats Spark Utils Pipelines MLI
  • 13. Data Image Parser Normalizer Convolver A REAL PIPELINE FOR IMAGE CLASSIFICATION Inspired by Coates & Ng, 2012 Linear Solver Feature Extractor Symmetric Rectifier Patch Extractor Patch Whitener Patch Selector Label Extractor Test Feature Data Model Extractor Label Extractor Test Error Error Computer Pooler YOU’RE GOING TO BUILD THIS!!
  • 14. BEAR WITH ME Photo: Andy Rouse, (c) Smithsonian Institute
  • 17. FEATURE EXTRACTION Data Image Parser Normalizer Convolver Linear Solver Feature Extractor Symmetric Rectifier Patch Extractor Patch Whitener Patch Selector Label Extractor Model Pooler
  • 18. FEATURE EXTRACTION Data Image Parser Normalizer Convolver Linear Solver Feature Extractor Symmetric Rectifier Patch Extractor Patch Whitener Patch Selector Label Extractor Model Pooler
  • 19. NORMALIZATION • Moves pixels from [0, 255] to [-1.0,1.0]. • Why? Math! • -1*-1 = 1, 1*1 =1 • If I overlay two pixels on each other and they’re similar values, their product will be close to 1 - otherwise, it will be close to 0 or -1. • Necessary for whitening. 0 255 -1 +1
  • 20. FEATURE EXTRACTION Data Image Parser Normalizer Convolver Linear Solver Feature Extractor Symmetric Rectifier Patch Extractor Patch Whitener Patch Selector Label Extractor Model Pooler
  • 21. PATCH EXTRACTION • Image patches become our “visual vocabulary” • Intuition from text classification. • If I’m trying to classify a document as “sports” - I’d look for words like “football”, “batter”, etc. • For images - classifying pictures as “face” - I’m looking for things that look like eyes, ears, noses, etc. Visual Vocabulary
  • 22. FEATURE EXTRACTION Data Image Parser Normalizer Convolver Linear Solver Feature Extractor Symmetric Rectifier Patch Extractor Patch Whitener Patch Selector Label Extractor Model Pooler
  • 23. CONVOLUTION • A convolution filter applies a weighted average to sliding patches of data. • Can be used for lots of things - finding edges, blurring, etc. • Normalized Input: • Image, Ear Filter • Output: • New image - close to 1 for areas that look like the ear filter. • Apply many of these simultaneously.
  • 24. FEATURE EXTRACTION Data Image Parser Normalizer Convolver Linear Solver Feature Extractor Symmetric Rectifier Patch Extractor Patch Whitener Patch Selector Label Extractor Model Pooler
  • 25. LINEAR RECTIFICATION • For each feature, x, given some a (=0.25): • xnew=max(x-a, 0) • What does it do? • Removes a bunch of noise.
  • 26. FEATURE EXTRACTION Data Image Parser Normalizer Convolver Linear Solver Feature Extractor Symmetric Rectifier Patch Extractor Patch Whitener Patch Selector Label Extractor Model Pooler
  • 27. POOLING • convolve(image, k filters) => k filtered images. • Lots of info - super granular. • Pooling lets us break the (filtered) images into regions and sum. • Think of the “sum” a how much an image quadrant is activated. • Image summarized into 4*k numbers. 0.5 8 0 2
  • 29. Data: A Labels: b Model: x Hypothesis: Ax = b + error Find the x, which minimizes the error = |Ax - b| WHY LINEAR CLASSIFIERS? They’re simple. They’re fast. They’re well studied. They scale. With the right features, they do a good job!
  • 30. BACK TO OUR PROBLEM • What is A in our problem? • #images x #features (4f) • What about x? • #features x #classes • For f < 10000, pretty easy to solve! • Bigger - we have to get creative. 100k 1k 10m x 100k = 10m 1k
  • 31. TODAY’S EXERCISE • Build 3 image classification pipelines - simple, intermediate, advanced. • Qualitatively (with your eyes) and quantitatively (with statistics) compare their effectiveness.
  • 32. ML PIPELINES • Reusable, general purpose components. • Built with distributed data in mind from day 1. • Used together: give a complex system comprised of well-understood parts. GO BEARS