3. Model
“Machine learning is a scientific discipline that deals
with the construction and study of algorithms that can
learn from data. Such algorithms operate by building
a model based on inputs and using that to make
predictions or decisions, rather than following only
explicitly programmed instructions.”
–Wikipedia
Data
4. ML PROBLEMS
• Real data often not ∈ Rd
• Real data not well-behaved
according to my algorithm.
• Features need to be
engineered.
• Transformations need to be
applied.
• Hyperparameters need to be
tuned.
SVM Input:
Real Data:
5. SYSTEMS PROBLEMS
• Datasets are huge.
• Distributed computing is
hard.
• Mapping common ML
techniques to distributed
setting may be untenable.
6. WHAT IS MLBASE?
• Distributed Machine
Learning - Made Easy!
• Spark-based platform to
simplify the development
and usage of large scale
machine learning.
8. Test
Data
A STANDARD MACHINE LEARNING PIPELINE
That’s more like it!
Data
Train
Linear
Classifier
Feature Model
Extraction
Predictions
9. Data Image
Parser Normalizer Convolver
A REAL PIPELINE FOR
IMAGE CLASSIFICATION
Inspired by Coates & Ng, 2012
Linear
Solver
Feature Extractor
Symmetric
Rectifier
Patch
Extractor
Patch
Whitener
Patch
Selector
Label
Extractor
Test
Feature
Data
Model Extractor
Label
Extractor
Test
Error
Error
Computer
Pooler
10. A SIMPLE EXAMPLE
• Load up some images.
• Featurize.
• Apply a transformation.
• Fit a linear model.
• Evaluate on test data. Replicates Fast Food Features Pipeline - Le et. al., 2012
11. PIPELINES API
• A pipeline is made of nodes
which have an expected
input and output type.
• Nodes fit together in a
sensible way.
• Pipelines are just nodes.
• Nodes should be things that
we know how to scale.
12. WHAT’S IN THE TOOLBOX?
Nodes
Images - Patches, Gabor Filters, HoG, Contrast
Normalization
Text - n-grams, lemmatization, TF-IDF, POS, NER
General Purpose - ZCA Whitening, FFT, Scaling,
Random Signs, Linear Rectifier, Windowing, Pooling,
Sampling, QR Decomopsition
Statistics - Borda Voting, Linear Mapping, Matrix
Multiply
ML - Linear Solvers, TSQR, Cholesky Solver, MLlib
Speech and more - coming soon!
Pipelines
Example pipelines across domains CIFAR, MNIST,
ImageNet, ACL Argument Extraction, TIMIT.
Stay Tuned!
Hyper Parameter Tuning Libraries
GraphX MLlib ml-matrix Featurizers Stats
Spark
Utils
Pipelines
MLI
13. Data Image
Parser Normalizer Convolver
A REAL PIPELINE FOR
IMAGE CLASSIFICATION
Inspired by Coates & Ng, 2012
Linear
Solver
Feature Extractor
Symmetric
Rectifier
Patch
Extractor
Patch
Whitener
Patch
Selector
Label
Extractor
Test
Feature
Data
Model Extractor
Label
Extractor
Test
Error
Error
Computer
Pooler
YOU’RE GOING TO BUILD THIS!!
14. BEAR WITH
ME
Photo: Andy Rouse, (c) Smithsonian Institute
17. FEATURE EXTRACTION
Data Image
Parser Normalizer Convolver
Linear
Solver
Feature Extractor
Symmetric
Rectifier
Patch
Extractor
Patch
Whitener
Patch
Selector
Label
Extractor
Model
Pooler
18. FEATURE EXTRACTION
Data Image
Parser Normalizer Convolver
Linear
Solver
Feature Extractor
Symmetric
Rectifier
Patch
Extractor
Patch
Whitener
Patch
Selector
Label
Extractor
Model
Pooler
19. NORMALIZATION
• Moves pixels from [0, 255] to
[-1.0,1.0].
• Why? Math!
• -1*-1 = 1, 1*1 =1
• If I overlay two pixels on each
other and they’re similar values,
their product will be close to 1
- otherwise, it will be close to 0
or -1.
• Necessary for whitening.
0
255
-1
+1
20. FEATURE EXTRACTION
Data Image
Parser Normalizer Convolver
Linear
Solver
Feature Extractor
Symmetric
Rectifier
Patch
Extractor
Patch
Whitener
Patch
Selector
Label
Extractor
Model
Pooler
21. PATCH EXTRACTION
• Image patches become our
“visual vocabulary”
• Intuition from text classification.
• If I’m trying to classify a
document as “sports” - I’d look
for words like “football”,
“batter”, etc.
• For images - classifying pictures as
“face” - I’m looking for things that
look like eyes, ears, noses, etc.
Visual Vocabulary
22. FEATURE EXTRACTION
Data Image
Parser Normalizer Convolver
Linear
Solver
Feature Extractor
Symmetric
Rectifier
Patch
Extractor
Patch
Whitener
Patch
Selector
Label
Extractor
Model
Pooler
23. CONVOLUTION
• A convolution filter applies a weighted
average to sliding patches of data.
• Can be used for lots of things - finding
edges, blurring, etc.
• Normalized Input:
• Image, Ear Filter
• Output:
• New image - close to 1 for areas
that look like the ear filter.
• Apply many of these simultaneously.
24. FEATURE EXTRACTION
Data Image
Parser Normalizer Convolver
Linear
Solver
Feature Extractor
Symmetric
Rectifier
Patch
Extractor
Patch
Whitener
Patch
Selector
Label
Extractor
Model
Pooler
25. LINEAR RECTIFICATION
• For each feature, x, given
some a (=0.25):
• xnew=max(x-a, 0)
• What does it do?
• Removes a bunch of
noise.
26. FEATURE EXTRACTION
Data Image
Parser Normalizer Convolver
Linear
Solver
Feature Extractor
Symmetric
Rectifier
Patch
Extractor
Patch
Whitener
Patch
Selector
Label
Extractor
Model
Pooler
27. POOLING
• convolve(image, k filters) => k
filtered images.
• Lots of info - super granular.
• Pooling lets us break the (filtered)
images into regions and sum.
• Think of the “sum” a how much
an image quadrant is activated.
• Image summarized into 4*k
numbers.
0.5 8
0 2
29. Data: A Labels: b Model: x
Hypothesis:
Ax = b + error
Find the x, which minimizes the error = |Ax - b|
WHY LINEAR CLASSIFIERS?
They’re simple. They’re fast. They’re well studied. They scale.
With the right features, they do a good job!
30. BACK TO OUR PROBLEM
• What is A in our problem?
• #images x #features (4f)
• What about x?
• #features x #classes
• For f < 10000, pretty easy to
solve!
• Bigger - we have to get
creative.
100k
1k
10m x 100k =
10m
1k
31. TODAY’S EXERCISE
• Build 3 image classification pipelines - simple,
intermediate, advanced.
• Qualitatively (with your eyes) and quantitatively
(with statistics) compare their effectiveness.
32. ML PIPELINES
• Reusable, general purpose components.
• Built with distributed data in mind from day 1.
• Used together: give a complex system comprised
of well-understood parts.
GO BEARS