3. What is deep learning?
3
“Deep learning is a branch of machine learning based on a set of
algorithms that attempt to model high-level abstractions in data by
using multiple processing layers, with complex structures or otherwise,
composed of multiple non-linear transformations.”
Wikipedia says:
Machine
Learning
High-level
abstraction Network
4. Is it brand new?
4
Neural Nets McCulloch & Pitt 1943
Perception Rosenblatt 1958
RNN Grossberg 1973
CNN Fukushima 1979
RBM Hinton 1999
DBN Hinton 2006
D-AE Vincent 2008
AlexNet Alex 2012
GoogLeNet Szegedy 2015
5. Deep architectures
5
Feed-Forward: multilayer neural nets, convolutional nets
Feed-Back: Stacked Sparse Coding, Deconvolutional Nets
Bi-Directional: Deep Boltzmann Machines, Stacked Auto-Encoders
Recurrent: Recurrent Nets, Long-Short Term Memory
7. CNN
7
CNNs are basically layers of convolutions followed by
subsampling and fully connected layers.
Intuitively speaking, convolutions and subsampling
layers works as feature extraction layers while a fully
connected layer classifies which category current input
belongs to using extracted features.
19. Gradient descent?
There are three variants of gradient descent
Differ in how much data we use to compute
gradient
We make a trade-off between the accuracy
and computing time
20. Batch gradient descent
In batch gradient decent, we use the entire
training dataset to compute the gradient.
21. Stochastic gradient descent
In stochastic gradient descent (SGD), the
gradient is computed from each training
sample, one by one.
22. Mini-batch gradient decent
In mini-batch gradient decent, we take the
best of both worlds.
Common mini-batch sizes range between 50
and 256 (but can vary).
23. Challenges
Choosing a proper learning rate is cumbersome.
Learning rate schedule
Avoiding getting trapped in suboptimal local
minima
26. Adagrad
It adapts the learning rate to the parameters,
performing larger updates for infrequent and
smaller updates for frequent parameters.
𝜃𝑡+1,𝑖 = 𝜃𝑡,𝑖 −
𝜂
𝐺𝑡,𝑖𝑖 + 𝜖
𝑔𝑡,𝑖
Performing larger updates for infrequent and
smaller updates for frequent parameters.
27. Adadelta
Adadelta is an extension of Adagrad that seeks
to reduce its monotonically decreasing learning
rate.
It restricts the window of accumulated past
gradients to some fixed size 𝑤.
𝐸 𝑔2
𝑡 = 𝛾𝐸 𝑔2
𝑡−1 + 1 − 𝛾 𝑔𝑡
2
𝐸 ∆𝜃2
𝑡 = 𝛾𝐸 ∆𝜃2
𝑡−1 + 1 − 𝛾 ∆𝜃𝑡
2
𝜃𝑡+1 = 𝜃𝑡 −
𝐸 ∆𝜃2
𝑡 + 𝜖
𝐸 𝑔2
𝑡 + 𝜖
𝑔𝑡
No learning rate!
78. Weakly Supervised Object Localization
78
Usually supervised learning of localization is annotated with bounding box
What if localization is possible with image label without bounding box
annotations?
Today’s seminar: Learning Deep Features for Discriminative
Localization
1512.04150v1 Zhou et al. 2015 CVPR2016
80. Class activation map (CAM)
80
• Identify important image regions by projecting back
the weights of output layer to convolutional feature
maps.
• CAMs can be generated for each class in single image.
• Regions for each categories are different in given image.
• palace, dome, church …
81. Results
81
• CAM on top 5 predictions on an image
• CAM for one object class in images
82. GAP vs. GMP
82
• Oquab et al. CVPR2015
Is object localization for free? weakly-supervised learning with convolutional neural
networks.
• Use global max pooling(GMP)
• Intuitive difference between GMP and GAP?
• GAP loss encourages identification on the extent of an object.
• GMP loss encourages it to identify just one discriminative part.
• GAP, average of a map maximized by finding all discriminative
parts of object
• if activations is all low, output of particular map reduces.
• GMP, low scores for all image regions except the most
discriminative part
• do not impact the score when perform MAX
pooling
83. GAP & GMP
83
• GAP (upper) vs GMP (lower)
• GAP outperforms GMP
• GAP highlights more complete
object regions and less
background noise.
• Loss for average pooling
benefits when the network
identifies all discriminative
regions of an object
85. Concept localization
85
Concept localization in weakly
labeled images
• Positive set: short phrase in text caption
• Negative set: randomly selected images
• Model catch the concept, phrases are
much more abstract than object name.
Weakly supervised text detector
• Positive set: 350 Google StreeView
images that contain text.
• Negative set: outdoor scene images in
SUN dataset
• Text highlighted without bounding box
annotations.
157. LSTM comes in!
157
Long Short Term Memory
This is just a standard RNN.
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
158. LSTM comes in!
158
Long Short Term Memory
This is just a standard RNN.This is the LSTM!
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
159. Overall Architecture
159
(Cell) state
Hidden State
Forget Gate
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Input Gate
Output Gate
Next (Cell) State
Next Hidden State
Input
Output
Output = Hidden state
162. VQA: Dataset and Problem definition
162
VQA dataset - Example
Q: How many dogs are seen?
Q: What animal is this?
Q: What color is the car?
Q: What is the mustache made of?Q: Is this vegetarian pizza?
163. Solving VQA
163
Approach
[Malinowski et al., 2015] [Ren et al., 2015] [Andres et al., 2015]
[Ma et al., 2015] [Jiang et al., 2015]
Various methods have been proposed
164. DPPnet
164
Motivation
Common pipeline of using deep learning for vision
CNN trained on ImageNet
Switch the final layer and fine-tune for the New Task
In VQA, Task is determined by a question
Observation:
166. DPPnet
166
Parameter Explosion
Number of parameter for fc-layer (R):
DynamicParameterLayer
Question Feature
Predicted Parameter
M
N
Q
P
: Dimension of hidden state
fc-layer
N=Q×P R=Q×P×M Q=1000, P=1000, M=500
For example:
R=500,000,000
1.86GB for single layer
Number of parameters for
VGG19: 144,000,000
167. DPPnet
167
Parameter Explosion
Number of parameter for fc-layer (R):
DynamicParameterLayer
Question Feature
Predicted Parameter
M
N
Q
P
: Dimension of hidden state
fc-layer
Solution:
R=Q×P×M R= N×M
N=Q×P N<Q×P
We can control N
168. DPPnet
168
Weight Sharing with Hashing Trick
Weights of Dynamic Parameter Layer are picked from Candidate weights by Hashing
Question Feature
Candidate Weights
fc-layer
0.11.2-0.70.3-0.2
0.1 0.1 -0.2 -0.7
1.2 -0.2 0.1 -0.7
-0.7 1.2 0.3 -0.2
0.3 0.3 0.1 1.2
DynamicParameterLayer
Hasing
[Chen et al., 2015]
250. Visual texture synthesis
250
Which one do you think is real?
Right one is real.
Goal of texture synthesis is to produce (arbitrarily many)
new samples from an example texture.
262. Reconstruction from feature map
262
𝑋 𝑎
Input a
𝐹𝑎
1 𝐹𝑎
2 𝐹𝑎
3
𝑋 𝑏
Input b
𝐹𝑏
1
𝐹𝑏
2
𝐹𝑏
3
number of filters
Let’s make this features similar!
By changing the input image!
266. How?
266
Style Image
Content Image
Mixed ImageNeural Art
Texture Synthesis Using
Convolutional Neural Networks
Understanding Deep Image
Representations by Inverting Them