Optic Flow Estimation by Deep Learning outlines several key concepts in optical flow estimation including:
- Optical flow is the apparent motion of brightness patterns in images. Estimating optical flow involves making assumptions like brightness constancy and spatial coherence.
- Classical algorithms like Lucas-Kanade and Horn-Schunck use techniques like regularization, coarse-to-fine processing, and descriptor matching to address challenges like the aperture problem, large displacements, and occlusions.
- Recent deep learning approaches like FlowNet, DeepFlow, and EpicFlow use convolutional neural networks to directly learn optical flow, achieving state-of-the-art performance on benchmarks. These approaches combine descriptor matching, variational optimization,
1. Optic Flow Estimation by Deep Learning
YU HUANG
SUNNYVALE, CALIFORNIA
YU.HUANG07@GMAIL.COM
2. Outline
• Optic Flow
• Brightness Constancy Constraints
• Aperture Problem
• Regularization and Smoothness Constraints
• Lucas-Kanade algorithm
• Focus of Expansion (FOE)
• Discrete Optimization for Optical Flow
• Large Displacement Optical Flow: Descriptor Matching
• EpicFlow: Edge-Preserving Interpolation of
Correspondences for Optical Flow
• Optical Flow with Piecewise Parametric Model
• SPM-BP: Sped-up PatchMatch Belief Propagation
• Coarse-to-Fine PatchMatch for Large Optical Flow
• Flow Fields: Correspondence Fields for Optical Flow
• Full Flow: Optic Flow Estimate By Global Optimization
over Regular Grids
• DeepFlow: Large displ. optical flow with deep matching
• FlowNet: Learning Optical Flow with ConvNets
• Deep Discrete Flow
• Optical Flow Estimation using a Spatial Pyramid Network
• A Large Dataset to Train ConvNets for Disparity, Optical
Flow, and Scene Flow Estimation
• Optical Flow via Direct Cost Volume Processing by CNN
• Appendix A: A Database and Evaluation for Optical Flow
• Appendix B: Secret of Optic Flow Estimation
• Appendix C: Deep Learning and optimization theory
4. Optic Flow
•Definition: optical flow is the
apparent motion of brightness
patterns in the image
•Ideally, optical flow would be the
same as the motion field
•Have to be careful: apparent
motion can be caused by lighting
changes without any actual motion
• Think of a uniform rotating sphere
under fixed lighting v.s. a stationary
sphere under moving illumination
…
5. Estimating optical flow
• Given two subsequent frames, estimate the apparent motion field between them.
• Key assumptions
• Brightness constancy: projection of the same point looks the same in every frame
• Small motion: points do not move very far
• Spatial coherence: points move like their neighbors
I(x,y,t–1) I(x,y,t)
6. Brightness Constancy Equation:
),()1,,( ),,(),( tyxyx vyuxItyxI
),(),(),,()1,,( yxvIyxuItyxItyxI yx
Can be written as:
Brightness Constancy Constraint
I(x,y,t–1) I(x,y,t)
0 tyx IvIuISo,
7. 0 tyx IvIuI
1 equation in 2 unknowns
dt
dx
u
dt
dy
v
y
I
Ix
y
I
I y
t
I
It
The Aperture Problem
0
11. Regularization & Smoothness Constraints
Additional smoothness constraint :
,))()(( 2222
dxdyvvuue yxyxs
Besides of constraint equation term
,)( 2
dxdyIvIuIe tyxc
minimize es+ec
Temporal aliasing causes ambiguities in optical flow because
images can have many pixels with the same intensity.
i.e., how do we know which ‘correspondence’ is correct?
nearest match is correct (no aliasing)
nearest match is incorrect (aliasing)
actual shift
estimated shift
To overcome aliasing: coarse-to-fine strategy.
Horn & Schunck algorithm
12. Lucas-Kanade algorithm
Prob: we have more equations than unknowns
• The summations are over all pixels in the K x K window
• This technique was first proposed by Lukas & Kanade (1981)
Solution: solve least squares problem
• Minimum least squares solution given by solution (in d) of:
13. Lucas-Kanade algorithm
◦ Optimal (u, v) satisfies Lucas-Kanade equation
When is This Solvable?
• ATA should be invertible
• ATA should not be too small due to noise
– eigenvalues 1 and 2 of ATA should not be too small
• ATA should be well-conditioned
– 1/ 2 should not be too large (1 = larger eigenvalue)
ATA is solvable when there is no aperture problem
What are the potential causes of errors in this procedure?
◦ Suppose ATA is easily invertible
◦ Suppose there is not much noise in the image
When the assumptions are violated?
• Brightness constancy is not satisfied
• The motion is not small
• A point does not move like its neighbors
– window size is too large
– what is the ideal window size?
14. Lucas-Kanade algorithm
Iterative Refinement in Lukas-Kanade
Estimate velocity at each pixel by solving Lucas-Kanade equations
Warp H towards I using the estimated flow field
use image warping techniques
Repeat until convergence
Some Implementation Issues:
Warping is not easy (ensure that errors in warping are smaller than the estimate
refinement)
Warp one image, take derivatives of the other so you don’t need to re-compute
the gradient after each iteration.
Often useful to low-pass filter the images before motion estimation (for better
derivative estimation, and linear approximations to image intensity)
15. Focus of Expansion (FOE)
• Motion of object = - (Motion of Sensor)
• For a given translatory motion and gaze direction, the world seems to
flow out of one point (FOE).
),,( zyx
),,( 000 zyx1f
)','( yx
After time t, the scene point moves to:
),,(),,( 000 wtzvtyutxzyx
},{)','(
0
0
0
0
wtz
vty
wtz
utx
yx
• As t varies the image point moves along a
straight line in the image
• Focus of Expansion: backtrack time or )( t
},{)','(
w
v
w
u
yx
17. Discrete Optimization for Optical Flow
Large-displacement optical flow from a discrete point of view: sub-pixel from pixel-accurate flow;
Formulate optical flow estimation as discrete inference in a CRF, followed by sub-pixel refinement.
3 different strategies, to reduce computation and memory demands by several orders of magnitude.
Combination of three strategies allow to estimate large-displacement optical flow.
Diverse Flow Proposals:
Efficient search structure, 300 nearest neighbors, 200 proposals from neighboring pixels
Block Coordinate Descent
Alternating optimization of image rows and columns, sub-problems solved optimally via DP
Truncated Pairwise Potential
Efficient Dynamic Programming
18. Discrete Optimization for Optical Flow
Strategies for Efficient Discrete Optical Flow. Left: a large set of diverse flow proposals per pixel by combining
NN in feature space from a set of grid cells with winner-takes-all solutions from neighboring pixels. Middle:
apply block coordinate descent, iteratively optimizing all image rows and columns conditioned on neighboring
blocks via dynamic programming. Right: Taking advantage of robust penalties, reduce pairwise computation
costs by pre-computing the set of non-truncated neighboring flow proposals for each flow vector.
19. Discrete Optimization for Optical Flow
Robust data term based on DAISY descriptors d
Similar flow vectors f are encouraged by
weighted by the edge strength
Naive Dynamic Programming
Efficient Dynamic Programming
20. Large Displacement Optical Flow: Descriptor
Matching in Variational Motion Estimation
Integrating rich descriptors into variational optical flow setting vs coarse-to-fine warping schemes;
Estimate a dense optical flow field with the same high accuracy as from variational optical flow;
VARIATIONAL MODEL
21. SPM-BP: Sped-up PatchMatch Belief
Propagation for Continuous MRFs
Integrating key ideas from PatchMatch of
effective particle propagation and
resampling, PatchMatch belief propagation
(PMBP) has been demonstrated to have
good performance in addressing continuous
labeling problems and runs orders of
magnitude faster than Particle BP (PBP).
Sped-up PMBP (SPM-BP): unifying efficient
filter-based cost aggregation and message
passing with PatchMatch-based particle
generation in a highly effective way.
Two-layer graph structure used in SPM-BP: (b)(c) A superpixel-level
graph generates new particle proposals to be tested on the pixel-
level graph. (d) For the reference superpixel, the EAF is applied to
obtain the data cost. (e) The message passing algo. proceeds in the
inner loop, while outgoing messages on the boundary are fixed.
22. Efficient Coarse-to-Fine PatchMatch for
Large Displacement Optical Flow
CPM (Coarse-to-fine PatchMatch), blends an efficient random search strategy with the coarse-to-fine
scheme for optical flow problem, wrt the nearest neighbor field (NNF).
Propagation with constrained random search radius btw adjacent levels on the hierarchical architecture.
Construct the pyramids. On each level, the initial matching correspondences is propagated with random search
after a fixed number of times, and the results of each level is used as a initialization of the next lower level.
23. Efficient Coarse-to-Fine PatchMatch for
Large Displacement Optical Flow
Sec. 3.4
A forward-backward consistency check is performed
to detect the occlusions and remove the outliers, on
multi levels of the pyramid.
Only validation of matching correspondences on the
two finest levels is checked.
With backward flow interpolated from matching
correspondences linearly, let the error threshold of
the consistency check equal to the grid spacing, and
the coarser matches are all upscaled to the finest
resolution before the consistency check.
The matches > 400 pixels are also removed.
24. EpicFlow: Edge-Preserving Interpolation of
Correspondences for Optical Flow
Optical flow estimation at large displacements with significant occlusions.
Edge-Preserving Interpolation of Correspondences (EpicFlow) is fast and robust.
2 steps: i) dense matching by edge-preserving interpolation from a sparse set of
matches; ii) variational energy minimization initialized with the dense matches.
The sparse-to-dense interpolation relies on an appropriate choice of the distance,
namely an edge-aware geodesic distance.
Handle occlusions and motion boundaries – two issues for optical flow computation.
Approximation for geodesic distance to allow fast computation w/o performance loss.
Variational energy minimization on dense matches to obtain the final flow estimation.
25. EpicFlow: Edge-Preserving Interpolation of
Correspondences for Optical Flow
Overview of EpicFlow. Given two images, compute matches using DeepFlow and the edges of the
first image using SED (Structured Edge Detector). Combine them to interpolate matches and obtain a
dense correspondence field, as initialization of a one-level energy minimization framework.
26. EpicFlow: Edge-Preserving Interpolation of
Correspondences for Optical Flow
(a-b) two consecutive frames; (c) contour response C from SED; (d) match positions from DeepMatching; (e-f)
geodesic distance from a pixel to all others. (g-h) 100 nearest matches using geodesic distance from the pixel.
27. Dense, Accurate Optical Flow Estimation with
Piecewise Parametric Model
Fit a flow field piecewise to a variety of parametric models, where the domain of each
piece (i.e., each piece’s shape, position and size) is determined adaptively, while at the
same time maintaining a global inter-piece flow continuity constraint.
A multi-model fitting scheme via energy minimization, taking into account both the
piecewise constant model assumption and the flow field continuity constraint, enabling
to effectively handle both homogeneous motions and complex motions.
Potts model term MDL term
Data term
Flow continuity(inter-piece compatibility) term
29. Flow Fields: Dense Correspondence Fields for
Accurate Large Displacement Optical Flow Estimation
A dense correspondence field approach much better suited for optical flow estimation than
approximate nearest neighbor fields.
Do not require explicit regularization, smoothing or a new data term, but a data based search strategy
that finds most inliers and enhancements for outlier filtering.
The pipeline of the Flow Field approach. For the basic approach, only consider the full resolution.
30. Flow Fields: Dense Correspondence Fields for
Accurate Large Displacement Optical Flow Estimation
Illustration of the hierarchical Flow Field approach. Flow offsets
saved in pixels are propagated in all arrow directions.
31. Full Flow: Optical Flow Estimation By Global
Optimization over Regular Grids
A global optimization approach to optical flow
estimation which optimizes a classical optical flow
objective over the full space of mappings between
discrete grids.
The regular structure of the space of mappings
enables optimizations that reduce the
computational complexity of the algorithm’s inner
loop and support efficient matching.
The approach treats the objective (data term and
regularization term) as a Markov random field and
uses discrete optimization techniques.
Optical flow over regular grids. Each pixel p in
I1 is spatially connected to its four neighbors
in I1 and temporally connected to (2ς + 1)2
pixels in I2. Ω → [−ς, ς]2 be a flow field.
32. Full Flow: Optical Flow Estimation By Global
Optimization over Regular Grids
This objective is a discrete Markov random field with
a two-dimensional label space;
To optimize the model, use TRW-S, which optimizes
the dual of a natural linear programming relaxation of
the problem;
To reduce wall-clock time, implemente a parallelized
TRW-S solver;
Occlusion handling by FW-BW consistency checking;
Use EpicFlow interpolation scheme as postprocessing.
33. DeepFlow: Large displacement optical flow
with deep matching
DeepFlow, blends a matching algorithm with a variational approach for optical flow.
A descriptor matching algorithm, tailored to the optical flow problem, that allows to boost
performance on fast motions.
The matching algorithm builds upon a multi-stage architecture with 6 layers, interleaving
convolutions and max-pooling, a construction akin to deep convolutional nets.
Using dense sampling, it allows to efficiently retrieve quasi-dense correspondences, and enjoys a
built-in smoothing effect on descriptors matches, a valuable asset for integration into an energy
minimization framework for optical flow estimation.
DeepFlow efficiently handles large displacements occurring in realistic videos, and shows
competitive performance on optical flow benchmarks.
35. FlowNet: Learning Optical Flow with
Convolutional Networks
A generic architecture and another one including a layer that correlates feature vectors at
different image locations: FlowNetSimple and FlowNetCorr, being trained end-to-end.
A simple choice is to stack both input images together and feed them through a rather generic
network, allowing the network to decide itself how to process the image pair to extract the motion
information, called consisting only of convolutional layers ‘FlowNetSimple’.
A straightforward step is to create two separate, yet identical processing streams for the two
images and to combine them at a later stage. Design a ‘correlation layer’ that performs
multiplicative patch comparisons between two feature maps consisting of the layer “FlowNetCorr”.
Given two multi-channel feature maps f1, f2: R2 → Rc, with w, h, and c being their width, height and
number of channels, the correlation layer lets the network compare each patch from f1 with each
path from f2.
Refinement: The main ingredient are ‘upconvolutional’ layers, consisting of unpooling (extending
the feature maps, as opposed to pooling) and a convolution.
37. FlowNet: Learning Optical Flow with
Convolutional Networks
Refinement of the coarse feature maps to the high resolution prediction
38. FlowNet 2.0
End-to-end learning of optical flow: a stacked architecture with warping of the 2nd image with
intermediate optical flow; small displacements by a sub-network specializing on small motions.
Evaluation of options when stacking two FlowNetS networks (Net1 and Net2)
40. Deep Discrete Flow
Investigate two types of networks: a local network with a small receptive field consisting of 3x3
convolutions followed by non-linearities a subsequent context network that aggregates information
over larger image regions using dilated convolutions;
Learning context-aware features for solving optical flow using discrete optimization;
Training a context network with a large receptive field size on top of a local network using dilated
convolutions on patches.
Feature matching by comparing each pixel in the ref image to every pixel in the target image;
The matching cost volume from the network's output forms the data term for discrete MAP inference
in a pairwise MRF.
Local Network: leverages 3x3 convolution kernels. The hyper-parameters of the network are the
number of layers and the number of feature maps in each layer as specified in evaluation.
Context Network: increases the size of the receptive field with only modest increase in complexity by
exploiting dilated convolutions, i.e. reading the input feature maps at locations with a spatial stride
larger than one, taking more contextual information into account.
41. Deep Discrete Flow
The input images are processed in forward order and backward order using local and context Siamese CNN, yielding per-
pixel descriptors. Then match points on a regular grid in the ref image to every pixel in the other image, yielding a large
tensor of forward matching costs (F1/F2) and backward matching costs (B1/B2). Matching costs are smoothed using
discrete MAP inference in a pairwise MRF. Finally, a forward-backward consistency check removes outliers and sub-pixel
accuracy is attained using the EpicFlow interpolator . Train the model in a piece-wise fashion via the loss functions.
42. Deep Discrete Flow
(a) Naive
(b) Fast
Dilated Convolution Implementations
The center of the patch is marked with a red * and each color corresponds to a convolution center for a specific
dilation factor, red for 4 dilations (shown in green), green for 2 dilations (shown in blue) and yellow for both.
43. Deep Discrete Flow
Fast Patch-based Training of Dilated Convolutional Networks. Left: A naive
implementation requires dilated convolution operations which are computationally
less efficient than highly optimized cudnn convolutions without dilations. Right:
The behavior of dilated convolutions can be replicated with regular convolutions by
first sub-sampling the feature map and then applying 1-dilated convolutions with
stride. Here dilations is denoting an array that species the dilation factor of the
dilated convolution in each convolutional layer.
44. Optical Flow Estimation using a Spatial
Pyramid Network
Compute optical flow by combining a classical spatial-pyramid formulation with deep learning.
This estimates large motions in a coarse-to-fine approach by warping one image of a pair at each pyramid level by
the current flow estimate and computing an update to the flow.
Train one deep network per level to compute the flow update.
Do not need to deal with large motions, instead these are dealt with by the pyramid.
Spatial Pyramid Network (SPyNet) is much simpler and 96% smaller than FlowNet in terms of model parameters.
Since the flow at each pyramid level is small (< 1 pixel), a convolutional approach applied to pairs of warped
images is appropriate.
The learned convolution filters appear similar to classical spatio-temporal filters, giving insight into the method
and how to improve it.
Trained using Adam optimization with 1 = 0:9 and 2 = 0:999. A batch size of 32 across all networks with 4000
iterations per epoch. A learning rate of 1e-4 for the first 60 epochs and decrease it to 1e-5 until converge.
45. Optical Flow Estimation using a Spatial
Pyramid Network
Training network Gk requires trained models {G0,…,Gk} to
obtain the initial flow u(Vk-1). Obtain ground truth residual
flows ῠk by subtracting downsampled ground truth flow Ṽk
and u(Vk-1) to train the network Gk using the End Point Error
(EPE) loss.
Each level in the pyramid has a simplified task relative to the full optical flow
estimation problem; it only has to estimate a small-motion update to an existing
flow field. Consequently each network can be simple.
46. Optical Flow Estimation using a Spatial
Pyramid Network
Inference in a 3-Level Pyramid Network: The network G0 computes the residual flow v0 at the highest level of the
pyramid(smallest image) using the low resolution images {I1
0 , I2
0 }. At each pyramid level, the network Gk
computes a residual flow vk which propagates to each of the next lower levels of the pyramid in turn, to finally
obtain the flow V2 at the highest resolution.
47. A Large Dataset to Train ConvNets for Disparity,
Optical Flow, and Scene Flow Estimation
What is Scene Flow?
Scene flow describes the 3D motion of scene points, just like optical flow describes the 2D motion.
Disparity estimation: First only train early low res. losses, second enable higher res. and phase out low res.
losses, then repurpose the deeper layers when no longer constrained by directly attached losses;
Scene flow estimation: 1. Interleaving 3 pretrained networks (1x FlowNet and 2x DispNets); 2. Joint retraining
on optical flow, 2x disparity, and disparity change.
48. A Large Dataset to Train ConNets for Disparity,
Optical Flow, and Scene Flow Estimation
Interleaving the weights of a FlowNet (green) and two DispNets (red and blue) to a SceneFlowNet. For
every layer, the filter masks are created by taking the weights of one network (left) and setting the
weights of the other networks to zero, respectively (middle). The outputs from each network are then
concatenated to yield one big network with three times the number of inputs and outputs (right).
49. A Large Dataset to Train ConNets for Disparity,
Optical Flow, and Scene Flow Estimation
50. Accurate Optical Flow via Direct Cost Volume
Processing by CNN
Optical flow estimation operating on the full 4-d cost volume.
Share the structural benefits of leading stereo matching pipelines to yield high accuracy.
The full 4-d cost volume can be constructed in a fraction of a second due to its regularity.
Adapt semi-global matching to the 4-d setting, to a pipeline that achieves higher accuracy.
Learn a nonlinear feature embedding using a convolutional network.
Embed image patches into a compact and discriminative feature space that is robust to geometric and
radiometric distortions encountered in optical flow estimation.
Feature space embeddings as well as distances in this space can be computed extremely efficiently.
A small fully-convolutional network that embeds raw image patches into a compact Euclidean space.
SGM is a common stand-in for more costly MRF optimization in stereo processing pipelines, robust/in parallel.
EpicFlow uses locally-weighted affine models to synthesize a dense flow field from semidense matches.
51. Accurate Optical Flow via Direct Cost Volume
Processing by CNN
Qualitative results on three images from the KITTI 2015 training set.
53. Limitations of Yosemite
Only sequence used for quantitative evaluation
Limitations:
•Very simple and synthetic
•Small, rigid motion
•Minimal motion discontinuities/occlusions
Image 7 Image 8
Yosemite
Ground-Truth Flow
Flow Color
Coding
54. Limitations of Yosemite
Only sequence used for quantitative evaluation
Current challenges:
•Non-rigid motion
•Real sensor noise
•Complex natural scenes
•Motion discontinuities
Need more challenging and more realistic benchmarks
Image 7 Image 8
Yosemite
Ground-Truth Flow
Flow Color
Coding
55. Realistic synthetic imagery
•Randomly generate scenes with “trees” and “rocks”
•Significant occlusions, motion, texture, and blur
•Rendered using Mental Ray and “lens shader” plugin
RockGrove
57. •Paint scene with textured fluorescent paint
•Take 2 images: One in visible light, one in UV light
•Move scene in very small steps using robot
•Generate ground-truth by tracking the UV images
Dense flow with hidden texture
Setup
Visible
UV
Lights Image Cropped
58. Conclusions
•Difficulty: Data substantially more challenging than Yosemite
•Diversity: Substantial variation in difficulty across the various datasets
•Motion GT vs Interpolation: Best algorithms for one are not the best for the other
•Comparison with Stereo: Performance of existing flow algorithms appears weak
Szeliski
60. Classical Optical Flow Objective Function
u and v are the horizontal and vertical components of the optical flow field
to be estimated from images I1 and I2, λ is a regularization parameter, and
ρD and ρS are the data and spatial penalty functions.
The penalty functions: (1) the quadratic HS penalty, (2) the Charbonnier
penalty, and (3) the Lorentzian.
61. Pre-processing
Optimize the regularization parameter λ for the training sequences;
Apply non-linear prefiltering of the images to reduce the influence of
illumination changes;
Use a standard brightness constancy model;
Gradient only imposes constancy of the gradient vector at each pixel (i.e. it
robustly penalizes Euclidean distance between image gradients);
Simple derivative constancy is as good as the more sophisticated texture
decomposition method.
62. Coarse-to-fine estimation and GNC
(Graduated Non-Convexity)
The GNC (graduated non-convexity) scheme: linearly combine a quadratic
objective with a robust objective in varying proportions, from fully quadratic
to fully robust;
The downsampling factor does not matter when using a convex penalty; a
standard factor of 0.5 is fine;
The GNC method is helpful even for the convex Charbonnier penalty
function due to the nonlinearity of the data term.
63. Interpolation method and derivatives
Bicubic interpolation is more accurate than bilinear;
Removing temporal averaging of the gradients, using Central difference
filters, or using a 7-point derivative filter all reduce accuracy compared to
the baseline, but not significantly.
The spline-based interpolation scheme is consistently better;
Temporal averaging of the derivatives is probably worthwhile for a small
computational expense.
64. Penalty functions
The convex Charbonnier penalty performs better than the more robust, non-
convex Lorentzian on both the training and test sets.
One reason that non-convex functions are more difficult to optimize, causing
the optimization scheme to find a poor local optimum.
The less-robust Charbonnier is preferable to the Lorentzian and a slightly non-
convex penalty function is better still.
65. Median filtering
The baseline 5 × 5 median filter is better than both MF 3×3 and MF 7×7 but the
difference is not significant.
When we perform 5× 5 median filtering twice (2× MF) or five times (5× MF) per
warping step, the results are worse.
Finally, removing the median filtering step (w/o MF) makes the computed flow
significantly less accurate with larger outliers.
67. Graphical Models
• Graphical Models: Powerful framework for representing dependency
structure between random variables.
• The joint probability distribution over a set of random variables.
• The graph contains a set of nodes (vertices) that represent random variables, and a set
of links (edges) that represent dependencies between those random variables.
• The joint distribution over all random variables decomposes into a product of
factors, where each factor depends on a subset of the variables.
• Two type of graphical models:
• Directed (Bayesian networks)
• Undirected (Markov random fields, Boltzmann machines)
• Hybrid graphical models that combine directed and undirected models, such as Deep
Belief Networks, Hierarchical-Deep Models.
68. Generative Model: MRF
Random Field: F={F1,F2,…FM} a family of random variables on set S in which each Fi takes
value fi in a label set L.
Markov Random Field: F is said to be a MRF on S w.r.t. a neighborhood N if and only if it
satisfies Markov property.
◦ Generative model for joint probability p(x)
◦ allows no direct probabilistic interpretation
◦ define potential functions Ψ on maximal cliques A
◦ map joint assignment to non-negative real number
◦ requires normalization
MRF is undirected graphical models
69. A flow network G(V, E) defined as a fully connected directed graph
where each edge (u,v) in E has a positive capacity c(u,v) >= 0;
The max-flow problem is to find the flow of maximum value on a
flow network G;
A s-t cut or simply cut of a flow network G is a partition of V into S
and T = V-S, such that s in S and t in T;
A minimum cut of a flow network is a cut whose capacity is the
least over all the s-t cuts of the network;
Methods of max flow or mini-cut:
◦ Ford Fulkerson method;
◦ "Push-Relabel" method.
70. Mostly labeling is solved as an energy minimization problem;
Two common energy models:
◦ Potts Interaction Energy Model;
◦ Linear Interaction Energy Model.
Graph G contain two kinds of vertices: p-vertices and i-vertices;
◦ all the edges in the neighborhood N, called n-links;
◦ edges between the p-vertices and the i-vertices called t-links.
In the multiple labeling case, the multi-way cut should leave each p-vertex connected to one i-vertex;
The minimum cost multi-way cut will minimize the energy function where the severed n-links would
correspond to the boundaries of the labeled vertices;
The approximation algorithms to find this multi-way cut:
◦ "alpha-expansion" algorithm;
◦ "alpha-beta swap" algorithm.
71. Deep Learning
Representation learning attempts to automatically learn good features or representations;
Deep learning algorithms attempt to learn multiple levels of representation of increasing
complexity/abstraction (intermediate and high level features);
Become effective via unsupervised pre-training + supervised fine tuning;
◦ Deep networks trained with back propagation (without unsupervised pre-training) perform worse than
shallow networks.
Deal with the curse of dimensionality (smoothing & sparsity) and over-fitting (unsupervised, regularizer);
Semi-supervised: structure of manifold assumption;
◦ labeled data is scarce and unlabeled data is abundant.
72. Why Deep Learning?
Supervised training of deep models (e.g. many-layered Nets) is too hard (optimization
problem);
◦ Learn prior from unlabeled data;
Shallow models are not for learning high-level abstractions;
◦ Ensembles or forests do not learn features first;
◦ Graphical models could be deep net, but mostly not.
Unsupervised learning could be “local-learning”;
◦ Resemble boosting with each layer being like a weak learner
Learning is weak in directed graphical models with many hidden variables;
◦ Sparsity and regularizer.
Traditional unsupervised learning methods aren’t easy to learn multiple levels of
representation.
◦ Layer-wised unsupervised learning is the solution.
Multi-task learning (transfer learning and self taught learning);
Other issues: scalability & parallelism with the burden from big data.
73. Multi Layer Neural Network
A neural network = running several logistic regressions at the same time;
◦ Neuron=logistic regression or…
Calculate error derivatives (gradients) to refine: back propagate the error derivative through model
(the chain rule)
◦ Online learning: stochastic/incremental gradient descent
◦ Batch learning: conjugate gradient descent
74. Problems in MLPs
Multi Layer Perceptrons (MLPs), one feed-forward neural network, were popularly used for decades.
Gradient is progressively getting more scattered
◦ Below the top few layers, the correction signal is minimal
Gets stuck in local minima
◦ Especially start out far from ‘good’ regions (i.e., random initialization)
In usual settings, use only labeled data
◦ Almost all data is unlabeled!
◦ Instead the human brain can learn from unlabeled data.
75. Convolutional Neural Networks
CNN is a special kind of multi-layer NNs applied to 2-d arrays (usually images), based on spatially localized
neural input;
◦ local receptive fields(shifted window), shared weights (weight averaging) across the hidden units, and often, spatial
or temporal sub-sampling;
◦ Related to generative MRF/discriminative CRF:
◦ CNN=Field of Experts MRF=ML inference in CRF;
◦ Generate ‘patterns of patterns’ for pattern recognition.
Each layer combines (merge, smooth) patches from previous layers
◦ Pooling /Sampling (e.g., max or average) filter: compress and smooth the data.
◦ Convolution filters: (translation invariance) unsupervised;
◦ Local contrast normalization: increase sparsity, improve optimization/invariance.
C layers convolutions,
S layers pool/sample
76. Convolutional Neural Networks
Convolutional Networks are trainable multistage architectures composed of multiple stages;
Input and output of each stage are sets of arrays called feature maps;
At output, each feature map represents a particular feature extracted at all locations on input;
Each stage is composed of: a filter bank layer, a non-linearity layer, and a feature pooling layer;
A ConvNet is composed of 1, 2 or 3 such 3-layer stages, followed by a classification module;
◦ A fully connected layer: softmax transfer function for posterior distribution.
Filter: A trainable filter (kernel) in filter bank connects input feature map to output feature map;
Nonlinearity: a pointwise sigmoid tanh() or a rectified sigmoid abs(gi•tanh()) function;
◦ In rectified function, gi is a trainable gain parameter, might be followed a contrast normalization N;
Feature pooling: treats each feature map separately -> a reduced-resolution output feature map;
Supervised training is performed using a form of SGD to minimize the prediction error;
◦ Gradients are computed with the back-propagation method.
Unsupervised pre-training: predictive sparse decomposition (PSD), then supervised fine-tuning.
* is discrete convolution operator
77. Belief Nets
Belief net is directed acyclic graph composed of stochastic var.
Can observe some of the variables and solve two problems:
◦ inference: Infer the states of the unobserved variables.
◦ learning: Adjust the interactions between variables to more likely generate the observed data.
stochastic
hidden
cause
visible
effect
Use nets composed of layers
of stochastic variables with
weighted connections.
78. Boltzmann Machines
Energy-based model associate a energy to each configuration of stochastic variables of interests (for
example, MRF, Nearest Neighbor);
◦ Learning means adjustment of the low energy function’s shape properties;
Boltzmann machine is a stochastic recurrent model with hidden variables;
◦ Monte Carlo Markov Chain, i.e. MCMC sampling (appendix);
Restricted Boltzmann machine is a special case:
◦ Only one layer of hidden units;
◦ factorization of each layer’s neurons/units (no connections in the same layer);
Contrastive divergence: approximation of gradient (appendix).
probability
Energy Function
Learning rule
79. Deep Belief Networks
A hybrid model: can be trained as generative or
discriminative model;
Deep architecture: multiple layers (learn features
layer by layer);
◦ Multi layer learning is difficult in sigmoid belief
networks.
◦ Top two layers are undirected connections, RBM;
◦ Lower layers get top down directed connections
from layers above;
Unsupervised or self-taught pre-learning provides
a good initialization;
◦ Greedy layer-wise unsupervised training for
RBM
Supervised fine-tuning
◦ Generative: wake-sleep algorithm (Up-down)
◦ Discriminative: back propagation (bottom-up)
80. Deep Boltzmann Machine
Learning internal representations that become increasingly complex;
High-level representations built from a large supply of unlabeled inputs;
Pre-training consists of learning a stack of modified RBMs, which are composed to create a deep Boltzmann
machine (undirected graph);
Generative fine-tuning: different from DBN
◦ Positive and negative phase (appendix)
Discriminative fine-tuning: the same to DBN
◦ Back propagation.
81. Denoising Auto-Encoder
Multilayer NNs with target output=input;
Reconstruction=decoder(encoder(input));
◦ Perturbs the input x to a corrupted version;
◦ Randomly sets some of the coordinates of input to zeros.
◦ Recover x from encoded perturbed data.
Learns a vector field towards higher probability regions;
Pre-trained with DBN or regularizer with perturbed training data;
Minimizes variational lower bound on a generative model;
◦ corresponds to regularized score matching on an RBM;
PCA=linear manifold=linear Auto Encoder;
Auto-encoder learns the salient variation like a nonlinear PCA.
82. Stacked Denoising Auto-Encoder
Stack many (may be sparse) auto-encoders in succession and train them using greedy layer-wise
unsupervised learning
◦ Drop the decode layer each time
◦ Performs better than stacking RBMs;
Supervised training on the last layer using final features;
(option) Supervised training on the entire network to fine- tune all weights of the neural net;
Empirically not quite as accurate as DBNs.
83. A simplified Bayes Net: it propagates info. throughout a graphical model via a series
of messages between neighboring nodes iteratively; likely to converge to a consensus that
determines the marginal prob. of all the variables;
messages estimate the cost (or energy) of a configuration of a clique given all other cliques;
then the messages are combined to compute a belief (marginal or maximum probability);
Two types of BP methods:
◦ max-product;
◦ sum-product.
BP provides exact solution when there are no loops in graph!
Equivalent to dynamic programming/Viterbi in these cases;
Loopy Belief Propagation: still provides approximate (but often good) solution;
84. Generalized BP for pairwise MRFs
◦ Hidden variables xi and xj are connected through a compatibility function;
◦ Hidden variables xi are connected to observable variables yi by the local “evidence” function;
The joint probability of {x} is given by
To improve inference by taking into account higher-order interactions among the
variables;
◦ An intuitive way is to define messages that propagate between groups of nodes rather than just single nodes;
◦ This is the intuition in Generalized Belief Propagation (GBP).
85. Stochastic Gradient Descent (SGD)
• The general class of estimators that arise as minimizers of sums are called M-
estimators;
• Where are stationary points of the likelihood function (or zeroes of its derivative, the score
function)?
• Online gradient descent samples a subset of summand functions at every step;
• The true gradient is approximated by a gradient at a single example;
• Shuffling of training set at each pass.
• There is a compromise between two forms, often called "mini-batches", where the
true gradient is approximated by a sum over a small number of training examples.
• STD converges almost surely to a global minimum when the objective function
is convex or pseudo-convex, and otherwise converges almost surely to a local
minimum.
87. Variable Learning Rate
Too large learning rate
◦ cause oscillation in searching for the minimal point
Too slow learning rate
◦ too slow convergence to the minimal point
Adaptive learning rate
◦ At the beginning, the learning rate can be large when the current point is far from the
optimal point;
◦ Gradually, the learning rate will decay as time goes by.
Should not be too large or too small:
◦ annealing rate 𝛼(𝑡)=𝛼(0)/(1+𝑡/𝑇)
◦ 𝛼(𝑡) will eventually go to zero, but at the beginning it is almost a constant.
90. Dropout and Maxout for Overfitting
Dropout: set the output of each hidden neuron to zero w.p. 0.5.
◦ Motivation: Combining many different models that share parameters succeeds in reducing test
errors by approximately averaging together the predictions, which resembles the bagging.
◦ The units which are “dropped out” in this way do not contribute to the forward pass and do not
participate in back propagation.
◦ So every time an input is presented, the NN samples a different architecture, but all these
architectures share weights.
◦ This technique reduces complex co-adaptations of units, since a neuron cannot rely on the presence
of particular other units.
◦ It is, therefore, forced to learn more robust features that are useful in conjunction with many
different random subsets of the other units.
◦ Without dropout, the network exhibits substantial overfitting.
◦ Dropout roughly doubles the number of iterations required to converge.
Maxout takes the maximum across multiple feature maps;
91. Weight Decay for Overfitting
Weight decay or L2 regularization adds a penalty term to the error function, a term called the
regularization term: the negative log prior in Bayesian justification,
◦ Weight decay works as rescaling weights in the learning rule, but bias learning still the same;
◦ Prefer to learn small weights, and large weights allowed if improving the original cost function;
◦ A way of compromising btw finding small weights and minimizing the original cost function;
In a linear model, weight decay is equivalent to ridge (Tikhonov) regression;
L1 regularization: the weights not really useful shrink by a constant amount toward zero;
◦ Act like a form of feature selection;
◦ Make the input filters cleaner and easier to interpret;
L2 regularization penalizes large values strongly while L1 regularization ;
Markov Chain Monte Carlo (MCMC): simulating a Markov chain whose equilibrium distr. is the
posterior distribution for weights & hyper-parameters;
Hybrid Monte Carlo: gradient and sampling.
92. Early Stopping for Overfitting
Steps in early stopping:
◦ Divide the available data into training and validation sets.
◦ Use a large number of hidden units.
◦ Use very small random initial values.
◦ Use a slow learning rate.
◦ Compute the validation error rate periodically during training.
◦ Stop training when the validation error rate "starts to go up".
Early stopping has several advantages:
◦ It is fast.
◦ It can be applied successfully to networks in which the number of weights far exceeds the sample size.
◦ It requires only one major decision by the user: what proportion of validation cases to use.
Practical issues in early stopping:
◦ How many cases do you assign to the training and validation sets?
◦ Do you split the data into training and validation sets randomly or by some systematic algorithm?
◦ How do you tell when the validation error rate "starts to go up"?
93. MCMC Sampling for Optimization
Markov Chain: a stochastic process in which future states are independent of past states but the
present state.
◦ Markov chain will typically converge to a stable distribution.
Monte Carlo Markov Chain: sampling using ‘local’ information
◦ Devise a Markov chain whose stationary distribution is the target.
◦ Ergodic MC must be aperiodic, irreducible, and positive recurrent.
◦ Monte Carlo Integration to get quantities of interest.
Metropolis-Hastings method: sampling from a target distribution
◦ Create a Markov chain whose transition matrix does not depend on the normalization term.
◦ Make sure the chain has a stationary distribution and it is equal to the target distribution (accept ratio).
◦ After sufficient number of iterations, the chain will converge the stationary distribution.
Gibbs sampling is a special case of M-H Sampling.
◦ The Hammersley-Clifford Theorem: get the joint distribution from the complete conditional distribution.
Hybrid Monte Carlo: gradient sub step for each Markov chain.
94. Mean Field for Optimization
Variational approximation modifies the optimization problem to be tractable, at the price of
approximate solution;
Mean Field replaces M with a (simple) subset M(F), on which A* (μ) is a closed form (Note: F is
disconnected graph);
◦ Density becomes factorized product distribution in this sub-family.
◦ Objective: K-L divergence.
Mean field is a structured variation approximation approach:
◦ Coordinate ascent (deterministic);
Compared with stochastic approximation (sampling):
◦ Faster, but maybe not exact.
95. Contrastive Divergence for RBMs
Contrastive divergence (CD) is proposed for training PoE first, also being a quicker way to learn
RBMs;
◦ Contrastive divergence as the new objective;
◦ Taking gradients and ignoring a term which is usually very small.
Steps:
◦ Start with a training vector on the visible units.
◦ Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel.
Can be applied using any MCMC algorithm to simulate the model (not limited to just Gibbs
sampling);
CD learning is biased: not work as gradient descent
Improved: Persistent CD explores more modes in the distribution
◦ Rather than from data samples, begin sampling from the mode samples, obtained from the last gradient
update.
◦ Still suffer from divergence of likelihood due to missing the modes.
Score matching: the score function does not depend on its normal. factor. So, match it b.t.w. the
model with the empirical density.
96. “Wake-Sleep” Algorithm for DBN
Pre-trained DBN is a generative model;
Do a stochastic bottom-up pass (wake phase)
◦ Get samples from factorial distribution (visible first, then generate hidden);
◦ Adjust the top-down weights to be good at reconstructing the feature activities in the layer below.
Do a few iterations of sampling in the top level RBM
◦ Adjust the weights in the top-level RBM.
Do a stochastic top-down pass (sleep phase)
◦ Get visible and hidden samples generated by generative model using data coming from nowhere!
◦ Adjust the bottom-up weights to be good at reconstructing the feature activities in the layer above.
◦ Any guarantee for improvement? No!
The “Wake-Sleep” algorithm is trying to describe the representation economical (Shannon’s coding
theory).
97. Greedy Layer-Wise Training
Deep networks tend to have more local minima problems than shallow networks during
supervised training
Train first layer using unlabeled data
◦ Supervised or semi-supervised: use more unlabeled data.
Freeze the first layer parameters and train the second layer
Repeat this for as many layers as desire
◦ Build more robust features
Use the outputs of the final layer to train the last supervised layer (leave early weights frozen)
Fine tune the full network with a supervised approach;
Avoid problems to train a deep net in a supervised fashion.
◦ Each layer gets full learning
◦ Help with ineffective early layer learning
◦ Help with deep network local minima
98. Why Greedy Layer-Wise Training Works?
Take advantage of the unlabeled data;
Regularization Hypothesis
◦ Pre-training is “constraining” parameters in a region relevant to unsupervised
dataset;
◦ Better generalization (representations that better describe unlabeled data are
more discriminative for labeled data) ;
Optimization Hypothesis
◦ Unsupervised training initializes lower level parameters near localities of better
minima than random initialization can.
Only need fine tuning in the supervised learning stage.
99. Two-Stage Pre-training in DBMs
Pre-training in one stage
◦ Positive phase: clamp observed, sample hidden, using variational approximation (mean-field)
◦ Negative phase: sample both observed and hidden, using persistent sampling (stochastic approximation:
MCMC)
Pre-training in two stages
◦ Approximating a posterior distribution over the states of hidden units (a simpler directed deep model as DBNs
or stacked DAE);
◦ Train an RBM by updating parameters to maximize the lower-bound of log-likelihood and correspond.
posterior of hidden units.
◦ Options (CAST, contrastive divergence, stochastic approximation…).