Visual Detection, Recognition and Tracking with Deep Learning
-
Upload
yu-huang -
Category
Engineering
-
view
1.799 -
download
6
Transcript of Visual Detection, Recognition and Tracking with Deep Learning
Yu Huang
Sunnyvale, California
Visual Detection, Recognition and
Tracking with Deep Learning
Outline
• Deep learning • Sparse coding
• Deep models • CNN/NIN/RNN;
• DBN/DBM;
• Stacked DAE;
• Optimization/Learning methods • SGD & BP;
• AdaGrad/AdaDelta
• Dropout/Maxout
• Data Augmentation
• MCMC/Mean Field/Contrastive Divergence
• Wake-Sleep/Greedy layer-wise pre-training
• Model Compression • Dark knowledge/Distilling the knowledge.
• Visual recognition • Sparse coding
• Hierarchical feature learning
• LeNet/Alexnet/Mattnet (ZFNet);
• VGG Net/GoogleNet
• PReLU/Batch normalization;
• Rethink the Inception;
• Deep Residual Learning;
• Generic object detection • Deep multi-box/OverFeat;
• R-CNN/Fast R-CNN/SPP Net/Faster R-CNN;
• DeepID-Net;
• YOLO/YOLO9000;
• DeepBox;
• R-FCN;
• LocNet;
• MS-CNN;
• Pedestrian detection
• Pose estimation
• Face detection, landmark detection and face
recognition
• Text detection and recognition
• Scene parsing/Semantic segmentation • Multiscale Feature Learning;
• Simultaneous Detection and Segmentation;
• FCN/DeepLab/Parsenet/Segnet/Mask R-CNN.
• Visual object tracking
• Appendix A: SoC implementation of CNN
Deep Learning Representation learning attempts to automatically learn good features or representations;
Deep learning algorithms attempt to learn multiple levels of representation of increasing
complexity/abstraction (features);
Become effective via unsupervised pre-training + supervised fine tuning; Deep networks trained with back propagation (without unsupervised pre-training) perform worse than shallow
networks.
Deal with the curse of dimensionality (smoothing & sparsity) and over-fitting (unsupervised, regularizer);
Semi-supervised: structure of manifold assumption; labeled data is scarce and unlabeled data is abundant.
Learning Feature Hierarchy with DL
• Deep architectures can be more efficient in feature representation;
• Natural derivation/abstration from low level structures to high level structures;
• Share the lower-level representations for multiple tasks (such as detection, recognition, segmentation).
Sparse Coding
Sparse coding (Olshausen & Field, 1996).
Originally developed to explain early visual processing in the brain
(edge detection).
Objective: Given a set of input data vectors learn a
dictionary of bases such that:
Each data vector is represented as a sparse linear combination of
bases.
Sparse: mostly zeros
Methods of Solving Sparse Coding Greedy methods: projecting the residual on some atom;
Matching pursuit, orthogonal matching pursuit;
L1-norm: Least Absolute Shrinkage and Selection Operator (LASSO);
The residual is updated iteratively in the direction of the atom;
Gradient-based finding new search directions
Projected Gradient Descent
Coordinate Descent
Homotopy: a set of solutions indexed by a parameter (regularization)
LARS (Least Angle Regression)
First order/proximal methods: Generalized gradient descent
solving efficiently the proximal operator
soft-thresholding for L1-norm
Accelerated by the Nesterov optimal first-order method
Iterative reweighting schemes
L2-norm: Chartand and Yin (2008)
L1-norm: Cand`es et al. (2008)
Sparse Coding for Unsupervised Pre-training
• SC learns the optimal dictionary that can be used to reconstruct a set of training samples under
sparsity constraints on the feature vector;
• Predictive Sparse Decomposition(PSD):
• Train a simple regressor (or encoder) to approximate the sparse solution for all data in the training set;
• Modify by adding a penalty for prediction error: Approximate the sparse coding with an encoder;
• PSD for hierarchical feature training
• Phase 1: train the first layer;
• Phase 2: use encoder + absolute value as 1st feature extractor
• Phase 3: train the second layer;
• Phase 4: use encoder + absolute value as 1st feature extractor
• Phase 5: train a supervised classifier on top layer;
• Phase 6: optionally train the whole network with supervised BP.
Convolutional Neural Networks CNN is a special kind of multi-layer NNs applied to 2-d arrays (usually images), based on spatially
localized neural input; Local receptive fields(shifted window), shared weights (weight averaging) across the hidden units, and often, spatial or
temporal sub-sampling;
Related to generative MRF/discriminative CRF: CNN=Field of Experts MRF=ML inference in CRF;
Generate ‘patterns of patterns’ for pattern recognition.
Each layer combines (merge, smooth) patches from previous layers Pooling /Sampling (e.g., max or average) filter: compress and smooth the data.
Convolution filters: (translation invariance) unsupervised;
Local contrast normalization: increase sparsity, improve optimization/invariance.
C layers convolutions,
S layers pool/sample
Convolutional Neural Networks Convolutional Networks are trainable multistage architectures composed of multiple stages;
Input and output of each stage are sets of arrays called feature maps;
At output, each feature map represents a particular feature extracted at all locations on input;
Each stage is composed of: a filter bank layer, a non-linearity layer, and a feature pooling layer;
A ConvNet is composed of 1, 2 or 3 such 3-layer stages, followed by a classification module;
A fully connected layer: softmax transfer function for posterior distribution.
Filter: A trainable filter (kernel) in filter bank connects input feature map to output feature map;
Nonlinearity: a pointwise sigmoid tanh() or a rectified sigmoid abs(gi•tanh()) function;
In rectified function, gi is a trainable gain parameter, might be followed a contrast normalization N;
Feature pooling: treats each feature map separately -> a reduced-resolution output feature map;
Supervised training is performed using a form of SGD to minimize the prediction error;
Gradients are computed with the back-propagation method.
Unsupervised pre-training: predictive sparse decomposition (PSD), then supervised fine-tuning.
* is discrete convolution operator
Deep Network-In-Network
Enhance model discriminability for local patches within the receptive
field;
Build micro neural networks with more complex structures to abstract
data within receptive field;
Design micro neural network with a multilayer perceptron;
Deep NIN can be implemented by stacking multiple layers with micro
neural network slided over the input like CNN to get the feature maps;
Apply global average pooling over feature maps in the classification
layer;
It is easier to explain and less prone to overfitting than fully connected layers;
The overall structure of Network In Network
Comparison of linear convolution layer and mlpconv layer
Tiled CNN Use a regular “tiled” pattern of tied weights, no adjacent hidden units sharing
identical weights;
only that hidden units k steps away from each other to have tied weights;
Learn complex invariances (scale and rotation) by pooling over neighboring
units;
Relatively a small number of learned parameters (sparse), easy of learning
and greater scalability;
Unsupervised pre-training: Topographic ICA as an efficient learning method;
TICA is a two-layered network with sqr and sqrt nonlinearities in the 1st and 2nd
layers respectively;
It can learn invariances even when trained only on unlabeled data;
Avoid approximate orthogonalization by using local receptive fields;
Trained by batch projected gradient descent.
CNN with local receptive fields and tied weights Partially untied local receptive field networks
TICA network architecture
Local orthorgonalization
Localize neurons that have identical receptive fields
Projection step
Locality, Tie weight and
Orthogonality contraints
Optimization for
Sparsity at Pooling units
Multi-column Deep Neural Networks
MDNN: inspired by “Neocognitron”, small (often minimal) receptive fields of convolutional winner-take-all neurons yield large network depth;
hundreds of maps per layer, many (6-10) layers of non-linear neurons stacked on top of each other, similar to the layers found between retina and visual cortex of macaque monkeys;
2-d layers of winner-take-all neurons with overlapping receptive fields whose weights are shared;
A simple max pooling technique decides the winner neurons by partitioning layers into quadratic regions of local inhibition, selecting the most active neuron of each region;
A smaller, down-sampled layer with lower resolution feeding the next layer in the hierarchy;
The top part of the hierarchy becomes a standard multi-layer perceptron (MLP), while receptive fields and winner-take-all regions of the DNN are (near-)minimal, e.g., only 2x2 or 3x3 neurons;
Only winner neurons are trained, which training algorithm is fully online;
Several deep neural columns become experts on inputs or input preprocessed in different ways, and then those predictions are averaged;
(a) DNN architecture. (b) MCDNN
architecture. The input image can
be preprocessed by P0 –Pn-1
blocks. An arbitrary number of
columns can be trained on inputs
preprocessed in different ways.
The final predictions are obtained
by averaging individual
predictions of each DNN. (c)
Training a DNN. The dataset is
preprocessed before training,
then, at the beginning of every
epoch, the images are distorted
(D block).
RNN: Recurrent Neural Network A nonlinear dynamical system that maps sequences to sequences;
Parameterized with three weight matrices and three bias vectors;
RNNs are fundamentally difficult to train due to their nonlinear iterative nature;
The derivative of the loss function can be exponentially large with respect to the hidden activations;
RNN suffers also from the vanishing gradient problem.
Back Propagation Through Time (BPTT):
“Unfold” the recurrent network in time, by stacking identical copies of the RNN, and redirecting
connections within the network to obtain connections between subsequent copies;
It’s hard to be used where online adaption is required as the entire time series must be used.
Real-Time Recurrent Learning (RTRL) is a forward-pass only algorithm that computes the
derivatives of the RNN w.r.t. its parameters at each timestep;
Unlike BPTT, RTRL maintains the exact derivative of the loss so far at each timestep of the forward
pass, without a backward pass and the need to store the past hidden states;
However, the computational cost of RTRL is prohibitive and more memory than BPTT as well.
Success in Application: Speech Recognition and Handwriting recognition.
LSTM: Long Short-Term Memory
An RNN structure that elegantly addresses the vanishing gradients problem using “memory units”;
These linear units have a self-connection of strength 1 and a pair of auxiliary “gating units” that control
the flow of information to and from the unit;
Let N be the number of memory units of the LSTM. At each timestep t, the LSTM maintains a set of
vectors as below, whose evolution is governed by the following equations:
Since the forward pass of the LSTM is relatively intricate, the equations for the correct derivatives of the
LSTM are highly complex, making them tedious to implement;
Note: Theano has LSTM module.
Left: RNN with one fully connected hidden layer;
Right: LSTM with memory blocks in hidden layer.
From Simple RNN to BPTT
LSTM: Long Short-Term Memory
Gated Recurrent Unit GRU is a variation of RNN, adaptively capturing dependencies of different time scales with each recurrent unit;
GRU uses gating units as well to modulate the flow of information inside the unit, but without a memory cells.
GRU doesn’t control degree to which its state is exposed, but exposes the whole state each time;
Different from LSTM:
GRU expose its full content without control;
GRU controls the information flow from the previous activation when computing the new, candidate activation, but does not
independently control the amount of the candidate activation being added (the control is tied via the update gate).
• Shared virtues with LSTM: the additive
component of their update from t to t + 1;
• Easy for each unit to remember the
existence of a specific feature in the input
stream for a long series of steps;
• Effectively creates shortcut paths that
bypass multiple temporal steps, which
allow the error to be back-propagated
easily without too quickly vanishing.
Belief Nets Belief net is directed acyclic graph composed of stochastic var.
Can observe some of the variables and solve two problems:
inference: Infer the states of the unobserved variables.
learning: Adjust the interactions between variables to more likely generate the
observed data.
stochastic hidden cause
visible effect
Use nets composed of layers of stochastic variables with weighted connections.
Boltzmann Machines Energy-based model associate a energy to each configuration of stochastic variables of interests (for
example, MRF, Nearest Neighbor);
Learning means adjustment of the low energy function’s shape properties;
Boltzmann machine is a stochastic recurrent model with hidden variables;
Monte Carlo Markov Chain (MCMC) sampling for gradient estimate;
Restricted Boltzmann machine is a special case:
Only one layer of hidden units;
factorization of each layer’s neurons/units (no connections in the same layer);
Contrastive divergence: approximation of gradient in RBMs.
probability
Energy Function
Learning rule
Deep Belief Networks A hybrid model: can be trained as
generative or discriminative model;
Deep architecture: multiple layers (learn
features layer by layer);
Multi layer learning is difficult in sigmoid belief
networks.
Top two layers are undirected connections,
RBM;
Lower layers get top down directed
connections from layers above;
Unsupervised or self-taught pre-learning
provides a good initialization;
Greedy layer-wise training for RBM
Supervised fine-tuning
Generative: Up-down wake-sleep algorithm
Discriminative: bottom-up back propagation
Deep Boltzmann Machine Learning internal representations that become increasingly complex;
High-level representations built from a large supply of unlabeled inputs;
Pre-training consists of learning a stack of modified RBMs, then which are composed to create a deep
Boltzmann machine (undirected graph);
Generative fine-tuning: different from DBN, two phases
Positive: observed, sample hidden, using variational approximation (mean-field);
Negative: sample both observed and hidden, using persistent sampling (stochastic approximation: MCMC).
Discriminative fine-tuning: the same to DBN Back propagation.
Denoising Auto-Encoder Multilayer NNs with target output=input;
Reconstruction=decoder(encoder(input)); Perturbs the input x to a corrupted version;
Randomly sets some of the coordinates of input to zeros.
Recover x from encoded perturbed data.
Learns a vector field towards higher probability regions;
Pre-trained with DBN or regularizer with perturbed training data;
Minimizes variational lower bound on a generative model;
Corresponds to regularized score matching on an RBM;
PCA=linear manifold=linear Auto Encoder;
Auto-encoder learns the salient variation like a nonlinear PCA.
Stacked Denoising Auto-Encoder
Stack many (sparse) auto-encoders in succession and train them using greedy layer-wise learning Drop the decode layer each time
Supervised training on the last layer using final features Then supervised training on the entire network to fine- tune all weights
Performs better than stacking RBMs for unsupervised pre-training.
Empirically not quite as accurate as DBNs.
Stochastic Gradient Descent (SGD) • The general class of estimators that arise as minimizers of sums are called M-
estimators;
• Where are stationary points of the likelihood function (or zeroes of its derivative,
the score function)?
• Online gradient descent samples a subset of summand functions at every
step;
• The true gradient of is approximated by a gradient at a single example;
• Shuffling of training set at each pass.
• There is a compromise between two forms, often called "mini-batches", where
the true gradient is approximated by a sum over a small number of training
examples.
• STD converges almost surely to a global minimum when the objective function
is convex or pseudo-convex, and otherwise converges almost surely to a local
minimum.
Hinton’s RMSProp: another variation of SGD
Rprop: using only sign of gradient, not work with mini-batches;
The magnitude of the gradient can be very different for different weights and can change during learning;
Combine the idea of only using sign of gradient with the idea of adapting the step size separately for
each weight;
RMSProp: a mini-batch version of Rprop;
Rprop is equivalent to using gradient but also dividing by the size of gradient;
RMSProp divides the learning rate for a weight by a running average of the squared gradients for that weight;
Other improvements:
Combination with momentum, Nesterov momentum, adaptive learning rate, …
LeCun’s “No more pesky learning rate”;
Back Propagation
E (f(x0,w),y0) = -log (f(x0,w)- y0).
Loss function Euclidean loss is used for regressing to real-valued lables [-inf,inf];
Sigmoid cross-entropy loss is used for predicting K independent probability values in [0,1];
Softmax (normalized exponential) loss is predicting a single class of K mutually exclusive
classes;
Generalization of the logistic function that "squashes" a K-dimensional vector of arbitrary real values z to
a K-dimensional vector of real values σ(z) in the range (0, 1).
The predicted probability for the j'th class given a sample vector x is
Sigmoidal or Softmax normalization is a way of reducing the influence of extreme values or
outliers in the data without removing them from the dataset.
Variable Learning Rate Too large learning rate
cause oscillation in searching for the minimal point
Too slow learning rate
too slow convergence to the minimal point
Adaptive learning rate
At the beginning, the learning rate can be large when the current point is
far from the optimal point;
Gradually, the learning rate will decay as time goes by.
Should not be too large or too small:
annealing rate 𝛼(𝑡)=𝛼(0)/(1+𝑡/𝑇)
𝛼(𝑡) will eventually go to zero, but at the beginning it is almost a constant.
Variable Momentum
AdaGrad/AdaDelta
Weight Decay for Overfitting Weight decay or L2 regularization adds a penalty term to the error function, a term called
the regularization term: the negative log prior in Bayesian justification, Weight decay works as rescaling weights in the learning rule, but bias learning still the same;
Prefer to learn small weights, and large weights allowed if improving the original cost function;
A way of compromising btw finding small weights and minimizing the original cost function;
In a linear model, weight decay is equivalent to ridge (Tikhonov) regression;
L1 regularization: the weights not really useful shrink by a constant amount toward zero; Act like a form of feature selection;
Make the input filters cleaner and easier to interpret;
L2 regularization penalizes large values strongly while L1 regularization ;
Markov Chain Monte Carlo (MCMC): simulating a Markov chain whose equilibrium distr.
is the posterior distribution for weights & hyper-parameters;
Hybrid Monte Carlo: gradient and sampling.
Early Stopping for Overfitting Steps in early stopping:
Divide the available data into training and validation sets.
Use a large number of hidden units.
Use very small random initial values.
Use a slow learning rate.
Compute the validation error rate periodically during training.
Stop training when the validation error rate "starts to go up".
Early stopping has several advantages:
It is fast.
It can be applied successfully to networks in which the number of weights far exceeds the
sample size.
It requires only one major decision by the user: what proportion of validation cases to use.
Practical issues in early stopping:
How many cases do you assign to the training and validation sets?
Do you split the data into training and validation sets randomly or by some systematic
algorithm?
How do you tell when the validation error rate "starts to go up"?
Dropout and Maxout for Overfitting
Dropout: set the output of each hidden neuron to zero w.p. 0.5. Motivation: Combining many different models that share parameters succeeds in reducing
test errors by approximately averaging together the predictions, which resembles the bagging.
The units which are “dropped out” in this way do not contribute to the forward pass and do not participate in back propagation.
So every time an input is presented, the NN samples a different architecture, but all these architectures share weights.
This technique reduces complex co-adaptations of units, since a neuron cannot rely on the presence of particular other units.
It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other units.
Without dropout, the network exhibits substantial overfitting.
Dropout roughly doubles the number of iterations required to converge.
Maxout takes the maximum across multiple feature maps;
Data Augmentation for Overfitting
The easiest and most common method to reduce overfitting on
image data is to artificially enlarge the dataset using label-
preserving transformations;
Perturbing an image I by transformations that leave the underlying
class unchanged (e.g. cropping and flipping) in order to generate
additional examples of the class;
Two distinct forms of data augmentation:
image translation
horizontal reflections
changing RGB intensities
MCMC Sampling for Optimization Markov Chain: a stochastic process in which future states are independent of past states but the
present state.
Markov chain will typically converge to a stable distribution.
Monte Carlo Markov Chain: sampling using ‘local’ information
Devise a Markov chain whose stationary distribution is the target.
Ergodic MC must be aperiodic, irreducible, and positive recurrent.
Monte Carlo Integration to get quantities of interest.
Metropolis-Hastings method: sampling from a target distribution
Create a Markov chain whose transition matrix does not depend on the normalization term.
Make sure the chain has a stationary distribution and it is equal to the target distribution (accept ratio).
After sufficient number of iterations, the chain will converge the stationary distribution.
Gibbs sampling is a special case of M-H Sampling. The Hammersley-Clifford Theorem: get the joint distribution from the complete conditional
distribution.
Hybrid Monte Carlo: gradient sub step for each Markov chain.
Mean Field for Optimization Variational approximation modifies the optimization problem to be tractable, at the price of
approximate solution;
Mean Field replaces M with a (simple) subset M(F), on which A* (μ) is a closed form (Note: F is disconnected graph);
Density becomes factorized product distribution in this sub-family.
Objective: K-L divergence.
Mean field is a structured variation approximation approach:
Coordinate ascent (deterministic);
Compared with stochastic approximation (sampling):
Faster, but maybe not exact.
Contrastive Divergence Contrastive divergence (CD) is a quicker way to learn RBMs;
Contrastive divergence as the new objective;
Taking gradients and ignoring a term which is usually very small.
Steps:
Start with a training vector on the visible units.
Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel.
Can be applied using any MCMC algorithm to simulate the model (not limited to just Gibbs sampling);
CD learning is biased: not work as gradient descent
Improved: Persistent CD explores more modes in the distribution
Rather than from data samples, begin sampling from the mode samples, obtained from the last gradient update.
Still suffer from divergence of likelihood due to missing the modes.
Score matching: the score function does not depend on its normal. factor. So, match it b.t.w. the model
with the empirical density.
“Wake-Sleep” Algorithm for DBN Pre-trained DBN is a generative model;
Do a stochastic bottom-up pass (wake phase)
Get samples from factorial distribution (visible first, then generate hidden);
Adjust the top-down weights to be good at reconstructing the feature activities in the layer below.
Do a few iterations of sampling in the top level RBM
Adjust the weights in the top-level RBM.
Do a stochastic top-down pass (sleep phase)
Get visible and hidden samples generated by generative model using data coming from nowhere!
Adjust the bottom-up weights to be good at reconstructing the feature activities in the layer above.
Any guarantee for improvement? No!
The “Wake-Sleep” algorithm is trying to describe the representation economical (Shannon’s coding
theory).
Greedy Layer-Wise Training Deep networks tend to have more local minima problems than shallow networks during supervised
training
Train first layer using unlabeled data
Supervised or semi-supervised: use more unlabeled data.
Freeze the first layer parameters and train the second layer
Repeat this for as many layers as desire
Build more robust features
Use the outputs of the final layer to train the last supervised layer (leave early weights frozen)
Fine tune the full network with a supervised approach;
Avoid problems to train a deep net in a supervised fashion.
Each layer gets full learning
Help with ineffective early layer learning
Help with deep network local minima
• Fixed point implementation: memory footprint and vectorization in SIMD;
• 8 bit quantization for activations and weight, 32 bits for biases, except for input layer (still floating point);
• Limited precision: 16 bits rounding for training by SGD with no degradation;
• Low precision storage: both training and testing in deep learning;
• Factorization of parameter matrix: low rank constraints;
• Feature prediction: dictionary construction and working like pooling;
• Columnar architecture for fully connected network and convolutional network.
Redundancy of Parameterization in Deep Learning
Left: Columnar
architecture in a fully
connected network, with
the path through one
column highlighted. Each
column corresponds to a
different para index.
Right: Columnar
architecture in a
convolutional network. In
this setting the para’s take
linear combinations of the
feature maps obtained by
convolving the input with
the dictionary.
• Fusion of convolution and max pooling layers;
• Separable convolution filters and fusion with the pooling layer;
Simplifying CNN for Fast Learning
• Dense multiscale features from conv. layers of a CNN (similar to the HOG pyramid and R-CNN);
• Image pyramid input (25 scales) as stitching together with optional warping for each aspect ratio;
• Data centering to simplify mean subtraction;
• Aspect ratios at the running stage.
DenseNet: A CNN Pyramid (in Caffe)
Acceleration of CNN by Utilization of Filter Redundancy
Visualization of monochromatic and bi-clustering approximation structures. (a) The monochromatic
approximation, used for the first layer. Input color channels are projected onto a set of intermediate color
channels. After this transformation, output features need only to look at one intermediate color channel. (b)
The bi-clustering approximation, used for higher convolution layers. Input and output features are clustered
into equal sized groups. The weight tensor corresponding to each pair of input and output clusters is then
approximated. (c) The weight tensors for each input-output pair in (b) are approximated by a sum of rank 1
tensors by applying SVD decomposition.
Acceleration of CNN by Utilization of Filter Redundancy
a) The original
convolutional layer
acting on a single-
channel input i.e. C=1.
b) The approximation to
that layer using the
method of Scheme 1:
approximation of filter set
by a linear combination
of a smaller basis set of
M filters (rank-1 for
separable ones).
c) The approximation to that layer using the method of
Scheme 2: take into account of both input and output
redundancies by considering 3D filters throughout, i.e.
each convolution layer is factored as a sequence of two
regular convolution layers with rectangular (in spatial
domain) filters, working on multiple channels
simultaneously and shaped to match a separable filter
approximation (also rank-1).
• Compress the function learned by a complex model into a much smaller faster model
that has comparable performance;
• Given enough data, a NN can approximate any function to arbitrary precision;
• Idea: Instead of training the NN on the original (small) training set, use an ensemble to
label a large unlabeled dataset and then train the NN on this much larger ensemble
labeled data set, to yield a NN that predicts similar to the ensemble and performs much
better than a NN which is trained on the original training set;
• Three methods to generate pseudo data:
• RANDOM, generate data for each attribute independently from its marginal distribution;
• NBE, estimate the joint density of attributes using the Naive Bayes and then generate samples
from this joint distribution;
• MUNGE, a new procedure that samples from a non-parametric estimate of the joint density.
Model Compression
• Shallow feed-forward nets can learn the complex functions previously learned by deep
nets and achieve accuracies previously only achievable with deep models;
• In some cases the shallow neural nets can learn these deep functions using the same
number of parameters as the original deep models;
• Complexity of a learned model, and the size of representation best used to learn that
model, are different things;
• Model compression works best when the unlabeled set is much larger than the train set,
when the unlabeled samples fall not on train points where the teacher model more likely
have overfit.
• Train the student model to mimic a more accurate ensemble of deep NN models (the
teacher).
Do Deep Nets Really Need to be Deep?
• The ensemble implements a function from input to output;
• Forget the models in the ensemble and the way they are parameterized and focus on the function.
• After learning the ensemble, we have our hands on the function.
• Can we transfer the knowledge in the function into a single smaller model (Distillation)?
• Soft targets: A way to transfer the function
• If we have the ensemble, we can divide the averaged logits from the ensemble by a “temperature”
to get a much softer distribution;
• Softened outputs reveal the dark knowledge in the ensemble.
• However it works better to fit both the hard targets and the soft targets from the
ensemble.
• Down-weight the cross entropy with the hard targets.
• Dropout is an efficient way to average many large NNs;
• Organize the labels into a tree, and predict all of the labels on the path to the root,
instead of just predicting a label.
Dark Knowledge
• Distilling the knowledge in an ensemble of models into a single model;
• Transfer from a cumbersome model to a small model, more suitable for deployment.
• An ensemble of one or more full models and many specialist models which learn to
distinguish fine grained classes that the full models confuse;
• These specialist model can be trained in parallel.
• Use the class prob. by the cumbersome model as “soft targets” for training the small
model.
• When the cumbersome model is a large ensemble of simpler models, we can use an arithmetic or
geometric mean of their individual predictive distributions as the soft targets.
• When the soft targets have high entropy, they provide much more information per training case than
hard targets and much less variance in the gradient between training cases, so the small model can
often be trained on much less data than the original cumbersome model and using a much higher
learning rate.
• Distillation: raise the temperature of the final softmax until the cumbersome model produces a
suitably soft set of targets.
Distilling knowledge in a Neural Network
• Allow training of a student that is deeper and thinner than the teacher, using the outputs
and the intermediate representations learned by the teacher as hints to improve the
training process and final performance of the student that compress the wide and
shallower (but still deep) networks;
• A hint defined as output of a teacher’s hidden layer for guiding the student’s learning process;
• Choose a hidden layer of the FitNet as the guided layer in the student, to learn from the teacher’s hint
layer;
• Train the FitNet in a stage-wise fashion: hints as a form of regularization, guided layer with a
convolutional regressor, loss function based on prediction error of a teacher’s hint layer and regressor
over guided layer.
FitNets by Thin Deep Nets: Extension of Distillation
• Compress each Conv layer by finding an appropriate low-rank approximation;
• Then fine-tune the upper layers until the prediction performance is restored;
• Elementary tensor decompositions based on SVD; Filter clustering to take
advantage of similarities between learned features. Speed-up by 2x.
Exploiting Linear Structure Within CNN
for Efficient Evaluation
(a) The monochromatic approximation, used for the first layer. (b) The biclustering
approximation, used for higher convolution layers. (c) The weight tensors for each
input-output pair in (b) are approximated by a sum of rank 1 tensors.
• Reparameterizing with Adaptive Fastfood transform;
• It reduces the storage and computational costs costs from O(nd) to O(n) and O(n
log d) respectively;
Deep Fried Convnets
The FCLs are replaced with the Adaptive Fastfood transform
• HashedNets exploit inherent
redundancy in NN for model
reductions;
• A hash function to randomly group
connection weights into hash
buckets, and all connections within
the same hash bucket share a single
parameter value;
• Tuned to adjust to the HashedNets
weight sharing architecture with BP;
• Shrink the storage requirements and
preserve the generalization perform.
Compressing NNs with the Hashing Trick
Random weight sharing under compression factor ¼.
• A two-step approach for speeding up conv. layers within CNN based on tensor
decomposition and discriminative fine tuning;
• NLS to compute low-rank CP-decomposition (PARAFAC or CANDECOMP) of
the 4D conv. kernel tensor into a sum of a number of rank-one tensors;
• This decomposition is used to replace the original conv. layer with a sequence
of four conv. layers with small kernels;
• Finally fin tuning on the training data using standard BP process;
• 4x CPU for AlexNet with ImageNet, only 1% up in top-5 performance error.
Speeding-up CNN Using Fine-tuned CPD
• Prune the network by learning only the important connections (9x, 13x);
• Quantize the weights to enforce weight sharing (32 to 5) and apply Huffman coding;
• Retrain to fine tune the remaining connections and the quantized centroids;
• AlexNet by 35x (from 240MB to 6.9MB); VGG-16 by 49x (from 552MB to 11.3MB).
Compressing Deep NN with Huffman Coding
• Tensor Train: a compact multiliniear format (TT format or decomposition);
• TT-layer is fully connected layer with the weight matrix stored in the TT-format;
• TensorNet: NN with one or more TT-layers;
• # parameters reduced and expressive power of the layer is preserved;
• 7x for VGGNet where 200k times for FCL;
Tensorizing Neural Networks
• Replacing the conventional linear projection in FCLs with the circulant projection;
The circulant structure reduces memory footprint and enables FFT;
• Gradient computation and optimization of the circulant projections can be
performed very efficiently.
Parameter Redundancy in Deep Networks
with Circulant Projections
• For the most storage demanding dense connected layers, vector quantization
(VQ) methods have a clear gain over existing matrix factorization methods;
• K-means clustering, product quantization (PQ);
• MattNet with ImageNet: 16-24x compression with only 1% loss in accuracy;
Compressing Deep CNN using VQ
• Low-rank Approximation of Responses: SVD or PCA;
• AlexNet 4× for ImageNet with top-5 error up only 0.9%;
Efficient and Accurate Approximations
of Nonlinear CNN
Illustration of the approximation.
(a) An original layer with complexity O(dk2c).
(b) (b) An approximated layer with complexity
reduced to O(dk2c) + O(dd’).
k - the spatial size of the filter
c - the number of input channels of this layer
d - the number of filters
d’ - the rank of matrix W
Sparse Coding for Visual Recognition
• Descriptor Layer: detect and locate features, extract
corresponding descriptors (e.g. SIFT);
• Code Layer: code the descriptors
• Vector Quantization (VQ): each code has only
one non-zero element;
• Soft-VQ: small group of elements can be non-
zero;
• SPM layer: pool codes across subregions and
average/normalize into a histogram.
[Lazebnik et al., CVPR 2005; Yang et al., CVPR 2009]
Sparse Coding for Visual Recognition
• Classifiers using these features need nonlinear kernels • Lazebnik et al., CVPR 2005; Grauman, Darrell, JMLR 2007; • High computational complexity
• Idea: modify the coding step to produce feature representations that linear classifiers can use effectively
• Sparse coding [Olshausen & Field, Nature 1996; Lee et al., NIPS 2007; Yang et al., CVPR 2009; Boureau et al., CVPR 2010]
• Local Coordinate coding [Yu et al., NIPS 2009; Wang et al., CVPR 2010] • RBMs [Sohn, Jung, Lee, Hero III, ICCV 2011] • Other feature learning algorithms
Improving the coding step
Deep learning for visual recognition,
detection, localization
• Hand-crafted features:
• Needs expert knowledge
• Requires time-consuming hand-tuning
• (Arguably) one limiting factor of computer vision systems
• Key idea of feature learning:
• Learn statistical structure or correlation from unlabeled data
• The learned representations used as features in supervised and semi-supervised settings
• Hierarchical feature learning
• Deep architectures can be representationally efficient.
• Natural progression from the low level to the high level structures.
• Can share the lower-level representations for multiple tasks in computer vision.
Hierarchical Feature Learning
Feature Learning Architectures
Pixels /
Features
Filtering with Dictionary
(patch/tiled/convolutional)
Spatial/Feature
(Sum or Max)
Normalization between
feature responses
Features
+ Non-linearity
Local Contrast Normalization
(Subtractive & Divisive)
(Group)
Sparsity
Max
/
Softmax
Not an exact
separation
‘Filtering’
Patch
Image as a set of patches
Input
#patches
#filte
rs
Filters
‘Filtering’ Convolutional
Translation equivariance
Tied filter weights
(same at each position few parameters)
Input Feature Map
.
.
.
‘Filtering’
Tiled
Filters repeat every n
More filters than convolution for given # features
Input
Filters
Feature maps
‘Normalization’
Contrast normalize. (across feature
maps)
Local mean = 0, local std. = 1, “Local”
7x7 Gaussian
Equalizes the features maps
Feature Maps Feature Maps
After Contrast Normalization
Input Filters
‘Normalization’
Sparsity
Constrain L0 or L1
norm of features
Iterate with filtering
operation (ISTA
sparse coding)
Filters Features K-means Sparse Coding
Input
Patch
‘Normalization’
Induces local competition
between features to explain input
“Explaining away” in graphical
models
Just like top-down models
But more local mechanism
Filtering alone cannot do this!
Example: Convolutional Sparse Coding
Filters
Convolution
|.|1 |.|1 |.|1 |.|1
from Zeiler et al. [CVPR’10/ICCV’11]
Input
Image Reconstructed
Image
‘Pooling’
Spatial Pooling
Non-overlapping / overlapping regions
Sum or Max
Boureau et al. ICML’10 for theoretical analysis
Max
Sum
‘Pooling’
Spatial pooling
Invariance to small transformations
Larger receptive fields (see more of input)
Visualization technique from [Le et al. NIPS’10]:
Zeiler, Taylor, Fergus [ICCV 2011]
‘Pooling’
Chen, Zhu, Lin, Yuille, Zhang [NIPS 2007]
• Pooling across feature groups
• Additional form of inter-feature competition
• Gives AND/OR type behavior via (sum / max)
• Compositional models of Zhu, Yuille
[Zeiler et al., ‘11]
LeNet (LeNet-5) A layered model composed of convolution and subsampling operations followed by a
holistic representation and ultimately a classifier for handwritten digits;
Local receptive fields (5x5) with local connections;
Output via a RBF function, one for each class, with 84 inputs each;
Learning by Graph Transformer Networks (GTN);
FO
RW
AR
D
BA
CK
WA
RD
Data Layer
Convolutional layer [5x5]
Convolutional layer [5x5]
Pooling [2x2, stride 2]
Pooling [2x2, stride 2]
Inner Product
ReLUP
Inner Product
Soft Max
20x24x24
20x12x12
50x8x8
50x4x4
500x1
500x1
10x1
10x1
1x28x28
AlexNet A layered model composed of convol., subsample., followed by a
holistic representation and all-in-all a landmark classifier;
Consists of 5 convolutional layers, some of which followed by max-
pooling layers, 3 fully-connected layers with a final 1000-way
softmax;
Fully-connected “FULL” layers: linear classifiers/matrix
multiplications;
ReLU are rectified-linear nonlinearities on layer output, can be
trained several times faster;
Local normalization scheme aids generalization;
Overlapping pooling slightly less prone to overfitting;
Data augmentation: artificially enlarge the dataset using label-
preserving transformations;
Dropout: setting to zero output of each hidden neuron with prob. 0.5;
Trained by SGD with batch # 128, momentum 0.9, weight decay
0.0005.
The network’s input is 150,528-dimensional, and the number of neurons in the network’s
remaining layers is given by 253,440–186,624–64,896–64,896–43,264-4096–4096–1000.
MattNet Matthew Zeiler from the startup company “Clarifai”, winner of ImageNet Classification in 2013;
Preprocessing: subtracting a per-pixel mean;
Data augmentation: downsampled to 256 pixels and a random 224 pixel crop is taken out of the
image and randomly flipped horizontally to provide more views of each example;
SGD with min-batch # 128, learning rate annealing, momentum 0.9 and dropout to prevent
overfitting;
65M parameters trained for 12 days on a single Nvidia GPU;
Visualization by layered DeconvNets: project the feature activations back to the input pixel space;
Reveal input stimuli exciting individual feature maps at any layer;
Observe evolution of features during training;
Sensitivity analysis of the classifier output by occluding portions to reveal which parts of scenes are important;
DeconvNet attached to each of ConvNet layer, unpooling uses locations of maxima to preserve
structure;
Multiple such models were averaged together to further boost performance;
Supervised pre-training with AlexNet, then modify it to get better performance (error rate 14.8%).
Architecture of an eight layer ConvNet model. Input: 224 by 224 crop of an image (with 3 color
planes). # 1-5 layers Convolution: 96 filters, 7x7, stride of 2 in both x and y. Feature maps: (i)
via a rectified linear function, (ii) 3x3 max pooled (stride 2), (iii) contrast normalized 55x55
feature maps. # 6-7 layers: fully connected, input in vector form (6x6x256 = 9216 dimensions).
The final layer: a C-way softmax function, C - number of classes.
Top: A deconvnet layer (left)
attached to a convnet layer
(right). The deconvnet will
reconstruct approximate version
of convnet features from the
layer beneath.
Bottom: Unpooling operation in
the deconvnet, using switches
which record the location of the
local max in each pooling region
(colored zones) during pooling in
the convnet.
Oxford VGG Net: Very Deep CNN
Networks of increasing depth using an architecture with very small (3×3) convolution filters;
Spatial pooling is carried out by 5 max-pooling layers;
A stack of convolutional layers followed by three Fully-Connected (FC) layers;
All hidden layers are equipped with the rectification ReLU non-linearity;
No Local Response Normalisation!
Trained by optimising the multinomial logistic regression objective using SGD;
Regularised by weight decay and dropout regularisation for the first two fully-connected layers;
The learning rate was initially set to 10−2, and then decreased by a factor of 10;
For random initialisation, sample the weights from a normal distribution;
Derived from the publicly available C++ Caffe toolbox, allow training and evaluation on multiple
GPUs installed in a single system, and on full-size (uncropped) images at multiple scales;
Combine the outputs of several models by averaging their soft-max class posteriors.
The depth of the configurations increases from the left (A) to the
right (E), as more layers are added (the added layers are shown in
bold). The convolutional layer parameters are denoted as
“conv<receptive field size> - <number of channels>”. The ReLU
activation function is not shown for brevity.
GoogleNet Questions:
Vanishing gradient?
Exploding gradient?
Tricky weight initialization?
Deep convolutional neural network architecture codenamed Inception;
Finding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components;
Judiciously applying dimension reduction and projections wherever the computational requirements would increase too much otherwise;
Increasing the depth and width of the network but keeping the computational budget constant;
Drawbacks: Bigger size typically means a larger number of parameters, which makes the enlarged network more prone to overfitting and the dramatically increased use of computational resources;
Solution: From fully connected to sparsely connected architectures, analyze the correlation statistics of the activations of the last layer and clustering neurons with highly correlated outputs.
Based on the well known Hebbian principle: neurons that fire together, wire together;
Trained using the DistBelief: distributed machine learning system.
Inception module (with dimension reductions)
Convolution
Pooling
Softmax
Other
Problems with training deep architectures?
Network in a network in a network
9 Inception modules
PReLU Networks at MSR A Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit;
PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk;
Allow negative activations on the ReLU function with a control parameter a learned Adaptively;
Resolve diminishing gradient problem for very deep neural networks (> 13 layers) ;
Derive a robust initialization method better than “Xavier” (normalization) initialization;
Also use Spatial Pyramid Pooling (SPP) layer just before the fully connected layers;
Can train extremely deep rectified models and investigate deeper or wider network
architectures;
ReLU vs. PReLU Note: μ is momemtum, ϵ is learning rate.
PReLU Networks at MSR Performance: 4.94% top-5 test error on the ImageNet 2012 Classification dataset;
ILSVRC 2014 winner (GoogLeNet, 6.66%);
Adopt the momentum method in BP training;
Mostly initialized by random weights from Gaussian distr.;
Investigate the variance of the FP responses in each layer;
Consider a sufficient condition in BP:
The gradient is not exponentially large/small.
Architectures of large models
PReLU Networks
Batch Normalization at Google Normalizing layer inputs for each mini-batch to handle saturating
nonlinearities and covariate shift;
Internal Covariate Shift (ICS): the change in the distribution of network
activations due to the change in network parameters during training;
Whitening to reduce ICS: linear transform to have zero means and unit
variances, and decorrelated;
Fix the means and variance of layer inputs (instead of whitening jointly
the features in both I/O);
Batch normalizing transform applied for activation over a mini-batch;
BN transform is differentiable transform introducing normalized
activations into the network;
Batch normalized networks
Unbiased variance estimate;
Moving average;
Batch normalized ConvNets
Effective mini-batch size;
Per feature, not per activation.
Batch Normalization at Google Reduce the dependence of gradients on the scale of the parameters or of the initial values;
Prevent small changes from amplifying into larger and suboptimal changes in activation in
gradients;
Stabilize the parameter growth and make gradient propagation better behaved in BN training;
In some cases, eliminate the need of dropout as a regularizer;
In ImageNet Classification, remove local response normalization and reduce photometric
distortions;
Reach 4.9% in top-five validation error and 4.8% test error (human raters only 5.1%).
Accelerating BN network:
Enable larger learning rate and less care about initialization, which accelerates the training;
Reduce L2 weight regularization;
Accelerate the learning rate decay.
Batch Normalization at Google
Inception architecture
Neural Turing Machines A Neural Turing Machine (NTM) architecture contains two basic components:
a neural network controller and a memory bank;
During each update cycle, the controller network receives inputs from an external
environment and emits outputs in response;
It also reads to and writes from a memory matrix via a set of parallel read and write
heads.
These weightings arise by combining two addressing mechanisms with
complementary facilities;
“content-based addressing”: focuses attention on locations based on the similarity
between their current values and values emitted by the controller;
“location-based addressing”: the content of a variable is arbitrary, but the variable still
needs a recognizable name or addresses, by location, not by content;
Controller network: feed forward or recurrent.
Neural Turing Machines
Neural Turing Machine Architecture.
Flow Diagram of the Addressing Mechanism.
Highway Networks: Information Highway
Ease gradient-based training of very deep networks;
Allow unimpeded info. flow across several layers on information highways;
Use gating units to learn regulating the flow of info. through a network;
A highway network consists of multiple blocks such that the ith block computes
a block state Hi(x) and transform gate output Ti(x);
Highway networks with hundreds of layers can be trained directly using SGD
and with a variety of activation functions.
the transform gate the carry gate C = 1 - T
Deep Residual Learning for Image Recognition
Reformulate the layers as learning residual functions with reference to the layer inputs,
instead of learning unreferenced functions;
The desired underlying mapping as H(x), then let the stacked nonlinear layers fit another
mapping of F(x) = H(x) - x;
The formulation of F(x)+x can be realized by feed forward NN with “shortcut connections”
(such as “Highway Network” and “Inception”);
These residual networks are easier to optimize, and can gain accuracy from
considerably increased depth;
An ensemble of 152 layers residual nets achieves 3.57% error on the ImageNet test set;
224x224 crop, per-pixel mean subtracted, color augmentation, batch normalization;
SGD with a mini-batch size of 256, learning rate from 0.1 then by 10;
Weight decay of 0.0001 and a momentum of 0.9, no drop-out;
Deep Residual Learning for Image Recognition
Residual learning: a
building block
Exa
mp
le n
etw
ork
arc
hite
ctu
res fo
r Im
ag
eN
et
A deeper residual function F for
ImageNet
Rethink Inception Architecture for Computer Vision
Scale up networks in ways that aim at utilizing the added computation efficiently by factorized convolutions and aggressive regularization;
Design principles in Inception: Avoid representational bottlenecks, especially early in the network;
Higher dimensional representations are easier to process locally within a network;
Spatial aggregation over lower dim embeddings w/o loss in representational power;
Balance the width and depth of the network.
Factorizing convolutions with large filter size: asymmetric convolutions;
Auxiliary classifiers: act as regularizer, esp. batch normalized or dropout;
Grid size reduction: two parallel stride 2 blocks (pooling and activation) ;
Model regularization via label smoothing: marginalized effect of dropout;
Trained with Tensorflow: SGD with 50 replicas, batch size 32 for 100 epochs, learning rate of 0.045, exponential rate of 0.94, a wei decay of 0.9.
Rethink Inception Architecture for Computer Vision
Inception modules after the factorization of the nxn
convolutions. In the proposed architecture, it choses
n = 7 for the 17x17 grid.
Inception modules with expanded
the filter bank outputs.
Inception modules where each
5x5 convolution is replaced by
two 3x3 convolution.
Rethink Inception Architecture for Computer Vision
Auxiliary classifier on top
of the last 17x17 layer Inception module that reduces the grid-size while
expands the filter banks. It is both cheap and avoids
the representational bottleneck.
The outline of the proposed
network architecture
Deep Learning for Generic Object Detection • Predicting the bounding box of multiple objects by DL-
based regression (GoogleNet);
• Deep Multibox method;
• Overfeat: sliding window-based detection and
localization by deep NN;
• R-CNN: region-based proposal + CNN feature (cuda-
convent)-based detection by SVM;
• SPP (spatial pyramid pooling) Net: extract window
wise features from region of feature maps;
• DeepID: selective search + R-CNN (Clarifai-fast);
• Deformation constraint pooling layer.
OverFeat: Integrated Framework with CNN
● Multi-scale and sliding window for classification, localization and detection with CNN;
● Classification: similar to AlexNet;
■ Use the same fixed input size approach with AlexNet, no contrast norm., non-overlapping
pooling;
■ SGD with decreasing learning rate, momentum, weight decay and dropout;
■ A feature extractor named “OverFeat” with two models: a fast and accurate one;
■ multi-view (4 corners + 1 center views + flip = 10 views);
■ fast and low memory footprint important to train bigger models;
● Localization: regression predicting coordinates of boundary boxes;
■ inputs: 256x5x5 (right after last pooling);
■ top-left, bottom-right, center, height/width (center not depend on scale);
■ fancier (similar to Yann’s face pose estimation);
● Detection: training with BG to avoid False Pos., trade-off btw pos./neg. accuracy.
OverFeat: Integrated Framework with CNN
(a): 20 pixel unpooled layer 5 feature map. (b): max
pooling over non-overlapping 3 pixel groups, using
offsets of Δ= {0, 1, 2} pixels (red, green, blue
respectively). (c): The resulting 6 pixel pooled maps for
different Δ. (d): 5 pixel classifier (layers 6,7) is applied in
sliding window fashion to pooled maps, yielding 2 pixel
by C maps for each Δ.(e): reshaped into 6 pixel by C
output maps.
Single (top)/Multiple(bottom) output in detection Application example of regression
network
R-CNN: Regions with CNN Features A framework for object detection with ConvNets;
One can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects;
Regions with CNN detection approach:
generates ~2000 category-independent regions for the input image,
extracts a fixed-length feature vector from each region using a CNN (built on cuda-convnet);
classifies each region with category-specific linear SVM.
R-CNN outperforms OverFeat, with a mAP = 31.4% vs 24.3%.
Training: train feature extraction CNN on a large auxiliary dataset (ILSVRC), followed by domain specific fine-tuning on a small dataset (PASCAL);
Pre-training: Train ImageNet
Replace last layer with FC layer to N+1 outputs (N classes + 1 “background”; VOC N=20, ILSVRC N=200 )
When labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.
• Region detection 2000 regions;
• Region cropped and scaled to [227 x 227] feature extraction with ImageNet: • 5 convolutional layers + 2FC 4096 features;
• SVM for 200 classes;
• Greedy non-maximum suppression for each class: rejects a region if it has an
intersection-over-union (IoU) overlap with a higher scoring selected region larger than a
learned threshold;
R-CNN: Regions with CNN Features
DeepID-Net: deformable CNNs for
Generic Object Detection
Bounding box proposal by selective search;
Bounding box rejection;
Pre-train a deep model: RCNN (Classification+Detection) with Clarifai-fast;
Pre-train on image-level annotation with 1000 classes;
Fine-tune on object-level annotation with 200 classes;
Gap: classification vs. detection, 1000 vs. 200;
A deformation constrained pooling layer: even for repeated patterns;
Modeling part detectors: different parts have different sizes;
Context modeling and model averaging;
Bounding box regression.
DeepID-Net: deformable CNNs for
Generic Object Detection
SPP-Net: Spatial Pyramid Pooling CNN Introduce a spatial pyramid pooling layer to replace the pooling layer: on the convolutional layer;
Adaptively sized pooling on shared conv feature maps;
Outperform Bag of Words in keeping spatial information;
Generate a fixed-length represent. regardless of image size/scale;
Use multi-level spatial bins, robust to object deformations.
Training
• Size augmentation:
• Imagenet: 224x224
180x180
• Horizontal flipping
• Color altering
• Dropout with 2 last FC layers
• Learning rate:
• Initialize as 0.01; divide by
10 when error plateau
ImageNet Detection
• Find 2000 windows candidate /~ R-CNN /
• Extract the feature maps from the entire image only
once (possibly at multiple scales) /~ Overfeat/.
• Apply the spatial pyramid pooling on each candidate
window of the feature, which maps window to a fixed-
length representation
• Then 2 Fully Connected layers
• SVM
• ~170x faster than R-CNN
A network structure with a spatial pyramid pooling layer Pooling features from arbitrary windows on feature maps
SPP-Net: Spatial Pyramid Pooling CNN
Fast R-CNN for Object Detection
• Simultaneously learns to classify object
proposals and refine their spatial
locations in a multi-task loss;
• Pre-training: max pooling -> RoI; a final
FCL as softmax -> two FCLs as
softmax + bounding box regressor;
input as images + RoIs;
• Fine-tuning:
• Image centric sampling;
• Hierarchical min-batch sampling;
• Joint optimize softmax + BB regressor;
• Detection:
• Truncated SVD in FCL weight matrix.
An input image and multiple RoIs are input into a FCN.
Each RoI is pooled into a fixed-size feature map and
then mapped to a feature vector by FCLs. The network
has two output vectors per RoI: softmax probabilities and
per-class bounding-box regression offsets. The
architecture is trained end-to-end with a multi-task loss.
Training Region-based Object Detectors with
Online Hard Example Mining
Online hard example mining (OHEM) to train region-based ConvNet detectors.
Auto-selection of hard examples makes training more effective and efficient.
It eliminates several heuristics and hyperparameters in common use.
It yields consistent and significant boosts in detection performance on
benchmarks like PASCAL VOC 2007 and 2012.
Architecture of the Fast R-CNN approach
Training Region-based Object Detectors with
Online Hard Example Mining
Given an image and RoIs, the network computes a feature map. (a): the RoI network runs fwd pass and the
Hard RoI uses RoI losses to select B examples. (b): hard examples used by RoI network for fwd& bwd passes.
Faster R-CNN with RPN for Object Detection
• Region Proposal Network (RPN): a FCN
• Share the conv. layers with detection network;
• Learning: image centric sampling with BP and
SGD as Fast R-CNN, initialized with ImageNet pre-
trained model, fine-tuned end-to-end for region
proposal task;
• Fast R-CNN:
• Learn using the proposals from RPN, also
initialized by a ImageNet pre-trained model;
• Joint RPN + Fast-CNN:
• Use the detector network to initialize RPN with
fixed shared conv layers and fine tune RPN;
• Finally, fine tune the FCL of Fast R-CNN.
Region Proposal Network Encode conv. map positions and
output classified objectness score +
regressed bounds for k=9 region
proposals;
Reference boxes (k=9 anchors) with 3
scales/aspect ratios: translation
invariant;
A Unified Multi-scale Deep CNN for Fast Object Detection
A unified deep neural network, denoted the multi-scale CNN (MS-
CNN) for fast multi-scale object detection.
Consists of a proposal sub-network and a detection sub-network.
In the proposal sub-network, detection is performed at multiple output
layers, so that receptive fields match objects of different scales.
These complementary scale-specific detectors are combined to
produce a strong multi-scale object detector.
The network is learned end-to-end, by optimizing a multi-task loss.
Feature upsampling by deconvolution as an alternative to input
upsampling, to reduce the memory and computation costs.
A Unified Multi-scale Deep CNN for Fast Object Detection
The cubes - output tensors. h × w - filter size, c - # classes, b # bounding box coordinates.
A Unified Multi-scale Deep CNN for Fast Object Detection
Different strategies for multi-scale detection (the template size)
A Unified Multi-scale Deep CNN for Fast Object Detection
Object detection sub-network of the MS-CNN. “trunk CNN layers” are shared with proposal
sub-network. The green (blue) cubes represent object (context) region pooling. “class
probability” and “bounding box” are the outputs of the detection sub-network.
SSD: Single Shot MultiBox Detector
Discretize the output space of bounding boxes into a set of default boxes over
different aspect ratios and scales per feature map location;
At prediction time, generate scores for the presence of each object category in
each default box and produce adjustments to the box to better match the object
shape;
Combine predictions from multiple feature maps with different resolutions to
naturally handle objects of various sizes;
Eliminates proposal generation and subsequent pixel or feature resampling stage
and encapsulates all computation in a single network.
SSD: Single Shot MultiBox Detector
(a) only needs an input image and ground truth boxes for each object during
training. In a convolutional fashion, evaluate a small set of default boxes of
different aspect ratios at each location in several feature maps with different
scales. For each default box, predict both the shape offsets and the confidences
for all object categories. At training time, first match these default boxes to the
ground truth boxes. The model loss is a weighted sum between localization loss
(e.g. Smooth L1) and confidence loss (e.g. Softmax).
You Olny Look Once (YOLO) for Object Detection
The YOLO Detection System
The system models detection as a regression problem
to a 7724 tensor. This tensor encodes bounding boxes
and class probabilities for all objects in the image.
The network uses strided conv. layers to
downsample the feature space instead of
maxpooling layers. Pre-train the conv. layers
on the ImageNet classification task and then
double the resolution for detection.
Note: More localization errors, relatively low recall.
YOLO9000: Better, Faster, Stronger
Detect over 9000 object categories: http://pjreddie.com/yolo9000/;
YOLOv2, 67 FPS, 76.8 mAP on VOC 2007; 40 FPS, 78.6 mAP;
Jointly train on object detection COCO and classification ImageNet;
Batch Normalization: 2% improvement in mAP;
High Resolution Classifier: full 448 × 448 resolution, almost 4% up in mAP;
Convolutional With Anchor Boxes: use anchor boxes to predict bound. boxes;
Dimension Clusters: k-means on the training set bounding boxes to
automatically find good priors to adjust the boxes appropriately;
Direct location prediction: predict location relative to location of the grid cell;
Fine-Grained Features: 13 × 13 map, pass through layer from 26 × 26 res.
Multi-Scale Training: Every 10 batches randomly a new image dimension size.
YOLO9000: Better, Faster, Stronger
Based on Googlenet architecture, faster than VGG-16;
Darknet-19: 19 convolutional layers and 5 maxpooling layers;
Training for classification: Darknet, data augmentation;
Training for detection: remove the last conv. layer, add on three 3 × 3 conv.
layers with 1024 filters each followed by a final 1 × 1 conv. layer;
Hierarchical classification: WordNet, -> WordTree, a model of visual concepts;
Dataset combination with WordTree: combine labels from ImageNet & COCO;
Joint classification and detection: use the COCO detection dataset and the top
9000 classes from the full ImageNet release;
YOLO9000: WordTree with 9418 classes.
YOLO9000: Better, Faster, Stronger
DeepBox: Learning Objectness with CNN
DeepBox uses CNNs to rerank proposals from a bottom-up method;
A four-layer CNN architecture that is as good as much larger networks on the task
of evaluating objectness while being much faster;
DenseBox: Landmark Localization and Object Detection
Fully convolutional neural network (FCN);
Directly predicts bounding boxes and object class confidences through all
locations and scales of an image;
Improve accuracy with landmark localization during multi-task learning.
Pipeline:1) Image pyramid is fed to the network. 2) After several layers of convolution and pooling,
upsampling feature map back and apply convolution layers to get final output. 3) Convert output feature
map to bounding boxes, and apply non-maximum suppression to all bounding boxes over the threshold.
DenseBox: Landmark Localization and Object Detection
DenseBox
Densebox with landmark localization
R-FCN: Object Detection via Region-based
Fully Convolutional Networks
Position-sensitive score maps to handle conflict of translation-invariance in
image classification and translation-variance in object detection;
https://github.com/daijifeng001/r-fcn
Overall architecture of R-FCN. A
RPN proposes candidate RoIs,
applied on the score maps. All
learnable weight layers are
convolutional and are computed
on the entire image; the per-RoI
computational cost is negligible
R-FCN: Object Detection via Region-based
Fully Convolutional Networks
Key idea of R-FCN for object detection. k × k = 3 × 3 position-sensitive score maps generated
by a FCN. For each of the k × k bins in an RoI, pooling is only performed on one of the k2 maps.
R-FCN: Object Detection via Region-based
Fully Convolutional Networks
LocNet: Improving Localization Accuracy for Object Detection
Object localization aiming at boosting the localization accuracy.
The model, given a search region, aims at returning the bounding box of an object of interest inside this region.
Assign conditional prob. to each row and column of this region, where these prob. provide useful info. regarding loc. of boundaries of the object inside the search region and allow accurate inference of the object bounding box under a simple prob. framework.
A CNN architecture adapted for this task, called LocNet.
LocNet: Improving Localization Accuracy for Object Detection
Illustration of the basic work-flow of the localization
module. Left column: the model given a candidate
box B (yellow box) it ”looks” on a search region R
(red box), which is obtained by enlarging box B by a
constant factor, in order to localize the bounding box
of an object of interest. Right column: To localize a
bounding box the model assigns one or more
probabilities on each row and independently on each
column of region R. Those prob. can be either the
prob. of an element (row or column) to be one of the
four object borders (see top-right image), or the
probability for being on the inside of an objects
bounding box (see bottom-right image). In either
case the predicted bounding box is drawn with blue
color.
LocNet: Improving Localization Accuracy for Object Detection
The posterior prob. that the loc. model yields given a region R. Left Image: the in-
out conditional prob. assigned on each row (py) and column (px) of R. Drawn with
blues curves on the right and the bottom side. Right Image: the conditional prob. pl,
pr, pt, and pb of each column or row to be the left (l), right (r), top (t) and bottom (b)
border of an object’s bounding box. They are drawn with blue and red curves on the
bottom and on the right side of the search region.
LocNet: Improving Localization Accuracy for Object Detection
Visualization of the LocNet network architecture
The processing starts by forwarding the entire image I, through a seq. of convolutional layers that outputs
the AI activation maps. Then, the region R is projected on AI and the activations that lay inside it are
cropped and pooled with a spatially adaptive max-pooling layer. The resulting fixed size activation maps are
forwarded through two convolutional layers, each followed by ReLU non-linearities, that yield the
localization-aware activation maps AR of region R. The network is split into two branches, the X and Y, with
each being dedicated for the predictions that correspond to the dimension (x or y respectively) that is
assigned to it. The resulted activations AxR and Ay
R efficiently encode the object location only across the
dimension that their branch handles. Finally, each of those aggregated features is fed into the final fully
connected layer followed from sigmoid units in order to output conditional prob. of its assigned dimension.
A Discriminative Deep Model for Pedestrian
Detection with Occlusion Handling
Joint Deep Learning for Pedestrian Detection
Learned filtered at the second convolutional layer Part models
Joint Deep Learning for Pedestrian Detection
• Visibility Reasoning with Deep Belief Net
Deformation Layer
Multi-Stage Contextual Deep Learning for Pedestrian Detection
multi-stage contextual deep model
The proposed deep Learning Architecture.
Apply different filters Fi on the same feature
map f and obtain different score maps si.
Pedestrian Detection aided by Deep Learning Semantic Tasks
Jointly optimizes pedestrian detection with semantic tasks, including pedestrian attributes ( ‘carrying
backpack’) and scene attributes ( ‘road’, ‘tree’, and ‘horizontal’);
A multi-task objective function is designed to coordinate tasks and reduce discrepancies among datasets
and a deep model, task-assistant CNN (TA-CNN), is to learn high-level features from multiple tasks and
multiple data sources.
Pedestrian Detection aided by Deep Learning Semantic Tasks
CNN-based Pose Classification
One convolutional neural net is trained on semantic part patches for each poselet and then the top-
level activations of all nets are concatenated to obtain a pose-normalized deep representation. The
final attributes are predicted by linear SVM classifier using the pose-normalized representations.
Given an image, a human body detector is used to find the bounding box around the human. Next, a
convolutional neural network (CNN) extracts shared features from the cropped image, and the shared
features are the inputs to the joint point regression tasks and the body-part detection tasks. The CNN,
regression, and detection tasks are learned simultaneously, resulting in a shared feature representation.
Heterogeneous Multi-task Learning for
Pose Estimation
A Multi-source Deep Model for Pose Estimation
How to generate multiple candidate locations: A
candidate is used as the input to a deep model
to determine whether the candidate is correct
and estimate body locations.
A multi-source deep model for constructing the
non-linear representation from three information
sources: mixture type, appearance score and
deformation.
Joint Training of a Convolutional Network and a
Graphical Model for Human Pose Estimation
A hybrid architecture that consists of a deep CNN and a MRF.
The architecture exploits structural domain constraints such as geometric relationships between body joint locations.
Joint training of these two model paradigms improves.
Multi-Resolution Sliding-Window With Overlapping Receptive Fields
Joint Training of a Convolutional Network and a
Graphical Model for Human Pose Estimation
Efficient Sliding Window Model with Single Receptive Field
Joint Training of a Convolutional Network and a
Graphical Model for Human Pose Estimation
Efficient Sliding Window Model with Single Receptive Field
Efficient Sliding Window Model with Overlapping Receptive Fields
Joint Training of a Convolutional Network and a
Graphical Model for Human Pose Estimation
Efficient Sliding Window Model with Single Receptive Field
Approximated Efficient Sliding Window Model with Overlapping Receptive Fields
Stacked Hourglass Networks for Human Pose Estimation
A convolutional network architecture for human pose estimation.
Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body.
Run repeated bottom-up, top-down processing in conjunction with intermediate supervision to improve the performance of the network.
A “stacked hourglass” network based on successive steps of pooling-and-upsampling.
A network consists of multiple stacked hourglass modules which allow for repeated bottom-up, top-down inference
The person’s orientation, arrangement of their limbs, and the relationships of adjacent joints are best recognized at different scales in the image.
The hourglass is a simple, minimal design with the capacity to capture all these features and bring them together to output pixel-wise predictions. Use a pipeline with skip layers to preserve spatial info. at each resolution.
It reaches its lowest resolution at 4x4 pixels allowing smaller spatial filters to be applied that compare features across the entire space of the image.
Stacking multiple hourglasses end-to-end, feeding the output of one as input into the next, which provides a mechanism for repeated bottom-up, top-down inference allowing for reevaluation of initial estimates and features across the whole image;
The key: prediction of intermediate heatmaps upon which to apply a loss.
Stacked Hourglass Networks for Human Pose Estimation
An illustration of a single “hourglass” module. Each box corresponds to a
residual module. # of features is consistent across the whole hourglass.
Residual Module in the network The intermediate supervision process
Stacked Hourglass Networks for Human Pose Estimation
The change in predictions
from an intermediate
stage (second hourglass)
(left) to final predictions
(eighth hourglass) (right).
Stacked Hourglass Networks for Human Pose Estimation
Real-time Multi-Person 2D Pose Estimation using Part Affinity Fields
Efficiently detect 2D pose of multi- people.
A nonparametric representation, Part
Affinity Fields (PAFs), to learn to associate
body parts with individuals in the image.
The architecture encodes global context,
allowing a greedy bottom-up parsing step
that maintains high accuracy while
achieving real-time performance.
The architecture jointly learn part locations
and their association via two branches of
the same sequential prediction process.
Top: Multi-person pose estimation.
Bottom left: Part Affinity Fields (PAFs) to the
limb connecting right elbow and right wrist.
Bottom right: zoom-in view of predicted PAFs.
Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields
Overall pipeline. The entire image as input for a two-branch CNN to jointly predict confidence maps for
body part detection in (b), and part affinity fields for parts association in (c). A set of bipartite matchings to
associate body parts candidates (d). Assemble them into full body poses for all people in the image (e).
Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields
Architecture of the two-branch multi-stage CNN. Each
stage in the first branch predicts confidence maps, and
each stage in the second branch predicts PAFs. After each
stage, the predictions from the two branches, along with
the image features, are concatenated for next stage.
a set of detection confidence maps
a set of part affinity fields
Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields
Confidence maps of the right wrist (top) and
PAFs (bottom) of right forearm across stages.
Though confusion between left and right body
parts and limbs in early stages, estimates are
increasingly refined through global inference in
later stages, as shown in the highlighted areas.
Part association strategies. (a) body part detection
candidates (red and blue dots) for two body part types
and all connection candidates (grey lines). (b)
connections using midpoint (yellow dots) represent.:
correct connections (black lines) and incorrect
connections (green lines) that also satisfy incidence
constraint. (c) Results using PAFs (yellow arrows). By
encoding position and orientation over the support of
the limb, PAFs eliminate false associations.
Real-time Multi-Person 2D Pose Estimation using Part Affinity Fields
Graph matching. (a) image with part detections (b) K-partite graph (c) Tree (d) A set of bipartite graphs
Multi-view Face Detection using CNNs: Deep Dense Face Detector
No face pose/landmark annotation, segmentation, bounding box regression;
Detect faces from different view angles and with occlusion to some extent;
Can improve based on better sampling strategies and data augmentation;
Sliding window-based detector with tuned AlexNet and data augmentation;
Size 227x227, 50k iterations, batch size 128 (32 pos + 96 neg);
Convert the fully connected layer to convolutional layer by reshaping the layer paras;
Run CNN to get the heat map of face classifiers for refined localization;
Improve the face localization using a bounding box regression module.
Comparison with R-CNN: the best R-CNN classifier is inferior;
Loss of recall (miss of selective search);
loss of localization (bounding box regression).
Multi-view Face Detection using CNNs: Deep Dense Face Detector
A set of faces with different out-of-
plane rotations and occlusions. Heat map of face classifier
Facial Landmark Detecting by Deep Multi-task Learning
• Face attribute would help in detecting facial landmarks;
• Task-wise early stop in Multi-task learning;
• Training data: Multi-task Facial Landmark dataset;
• Testing data: AFLW (annotated facial landmarks in the wild);
• Points: extend 5 points to 68 points.
Face Alignment by Deep Regression
• Apply a global layer and multi-stage local layers;
• Sequential learning, joint learning.
(a) The network takes face
image as input and outputs
shape estimation. The global
layer estimates initial shape
and the rest local layers refine
the estimation iteratively. (b)
Inner structure of the global
layer; (c) Inner structure of
the tth local layer.
DeepFace for Fave Verification at Facebook
• Fiducial points are extracted by a Support
Vector Regressor (SVR) trained to predict point
configurations from an image descriptor;
• 2D alignment: 2D similarity transformation;
• 3D alignment: Based on a generic 3D shape
model, register a 3D affine camera by back-
projecting the frontal face plane of the 2D-
aligned crop to the image plane of the 3D
shape.
Learn the Deep Face Representation: Face++
• Megvii Face Classification (MFC) database: 5 million labeled faces with 20000 people;
• 10-layer CNN for a four cropped-face-region-based feature extractors used in softmax-
based training and PCA + L2 norm verification;
• 99.50% accuracy on the LFW benchmark.
Learn the Deep Face Representation: Face++
• Pyramid CNN adopts a greedy-filter-and-down-sample operation, which enables the
training procedure to be very fast and computation efficient.
• Its structure can naturally incorporate feature sharing across multi-scale face
representations, increasing the discriminative ability of resulting representation.
DeepID: Deep Learning Face Representation
Deep hidden identity features (DeepID) for face verification and identification;
Features are taken from the last hidden layer neuron activations of deep CNN;
The proposed features are extracted from various face regions to form
complementary and over-complete representations;
Integrated with Joined Bayesian for face verification: 97.45% accuracy on LFW
dataset.
Structure of the neural network
used for face verification ConvNet structure used for DeepID
DeepID2: Deep Learning Face Representation
DeepID2: joint face verification and identification to reduce former-cared intra-
personal variations and enlarge the latter-cared inter-personal variations;
Deep CNN: in 55x47, 4 conv. + 3 max pooling, ReLU, 160-D feature vector;
Learning by SGD;
99.15% for verification rate on LFW dataset.
DeepID2+: Deep Learning Face Representation
DeepID2+ net and supervisory signals
• DeepID2+: increasing the dimension of
hidden representations and adding
supervision to early convolutional layer;
• Sparse neural activations, selective
neurons in higher layers and robustness
to occlusions;
• Get larger with 128 feature maps in each
conv. layer;
• Supervisory signals are only added to one
fully-connected layer from 3rd & 4th conv.
layers; the lower conv. layers can only get
supervision from higher layers;
• 99.47% and 93.2% for verification rates on
LFW and Youtube dataset.
DeepID3: Face Recognition with Very Deep Neural Network
• Apply stacked
convolution and
inception layers
proposed in VGG Net
and GoogLeNet to make
them suitable to face
recognition;
• An ensemble of
proposed two
architectures achieves
LFW face verification
accuracy 99.53% and
LFW rank-1 face
identification accuracy
96.0%, respectively.
FaceNet: A Unified Embedding for Face Recognition and Clustering
Learn a Euclidean 128-D embedding per image using a deep CNN;
L2 distances -> face similarity;
Face verification: thresholding the distance btw. two wmbeddings;
Face recognition: k-NN classification;
Face clustering: k-means or agglomerative clustering;
Triplet based loss function based on LMNN (Large Margin Nearest Neighbor);
Apply a new online negative exemplar mining strategy;
Apply hard positive mining strategy in face clustering.
Model Structure The Triplet Loss
FaceNet: A Unified Embedding for Face Recognition and Clustering
The Triplet Loss minimizes the distance btw an
anchor and a positive, both of which have the same
identity, and maximizes the distance between the
anchor and a negative of a different identity.
L =
anchor positive negative
Instead of picking the hardest positive, use all
anchor-positive pairs in a mini-batch while still
selecting the hard negatives.
Selecting the hardest negatives
FaceNet: A Unified Embedding for Face Recognition and Clustering
CNN: MattNet and GoogleNet (Inception);
Training: use SGD with standard BP and AdaGrad;
Performance: LWF (98.87%, 99.63% with alignment) and Youtube (95.12%).
MattNet GoogleNet
Note: at GTC’15, Andrew Ng announced Baidu’s verification rate at LWF dataset is 99.85%!
Text Detection and Recognition based on CNNs
Page 172
The end-to-end text spotting pipeline. a) A combination of region proposal methods extracts many word
bounding box proposals. b) Proposals are filtered with a random forest classier reducing number of false-
positive detections. c) A CNN is used to perform bounding box regression for refining the proposals. d) A CNN
performs text recognition on each of the refined proposals. e) Detections are merged based on proximity and
recognition results and assigned a score. f) Thresholding the detections results in the final text spotting result.
Text Detection and Recognition based on CNNs
Page 173
A schematics of the CNNs used showing the dimensions of the feature maps at each stage for (a)
dictionary encoding, (b) character sequence encoding, and (c) bag-of-N-gram encoding. The same
five-layer, base CNN architecture is used for all three models.
Multi-digit Number Recognition from Street
Views using Deep CNNs Unified one of localization,
segmentation, and recognition steps
via the use of a deep CNN that
operates directly on the image pixels;
Train a probabilistic model of
sequences given images;
Extract a set of features H from the
image X using a CNN with a fully
connected final layer;
Six separate softmax classifiers are
then connected to this feature vector
H, i.e., each softmax classifier forms a
response by making an affine
transformation of H and normalizing
this response with the softmax
function.
Page 174
Scene Parsing with Multiscale Feature Learning
• Compute a tree of segments from a graph of pixel dissimilarities.
• A set of dense feature vectors encodes regions of multiple sizes centered on each pixel.
• The feature extractor is a multi-scale trained convolutional network.
• The feature vectors associated with the segments covered by each node in the tree are
aggregated and fed to a classifier which produces an estimate of the distribution of
object categories contained in the segment.
• A subset of tree nodes that cover the image are then selected so as to maximize the average
“purity” of the class distributions, hence maximizing the overall likelihood that each segment will
contain a single object.
• The convolutional network feature extractor is trained end-to-end from raw pixels,
alleviating the need for engineered features. After training, the system is parameter free.
• From a pool of segmentation components, retrieve an optimal set of components that
best explain the scene, taken from a segmentation tree or any family of over-
segmentations.
Scene Parsing with Multiscale Feature Learning
(a) Method 1
(b) Method 2
Scene Parsing with Multiscale Feature Learning
Simultaneous Detection and Segmentation (SDS)
• Apply R-CNN to classify category-independent region proposals, and use category-
specific top-down figure ground predictions to refine the bottom up proposals;
• Proposal generation: category-independent bottom-up by MCG;
• Feature extraction for each region by R-CNN;
• Region classification by SVM and assign a score for each category to each candidate.
• Region refinement (non maximum suppression) on the scored candidates.
Learning Rich Features from RGB-D Images for Object Detection and Segmentation
• Geocentric embedding for depth images that encodes height above ground and angle
with gravity for each pixel in addition to the horizontal disparity.
• Better than using depth images for learning feature representations with CNN.
• A decision forest approach that classifies pixels as FG or BG using a family of unary
and binary tests that query shape and geocentric pose features.
Fully Convolutional Networks for Semantic Segmentation
• Build “fully convolutional” networks that take input of arbitrary size and produce
correspondingly-sized output with efficient inference and learning;
• Adapt the classification nets into fully convolutional network and transfer their learned
representations by fine tuning (after a supervised pre-training);
• Combine semantic info. from a deep coarse layer with appearance info. from a
shallow fine layer to produce accurate and detailed segmentation;
Transforming fully connected layers into convolution layers. Fully convolutional networks can efficiently learn to make
dense predictions for per pixel semantic segmentation.
Fully Convolutional Networks for Semantic Segmentation
• While a general deep net computes a general nonlinear function, a net with only layers of this form
computes a nonlinear filter, called a deep filter or fully convolutional network;
• Upsampling by de-concolution is more efficient and effective to yield dense predictions;
• Input shifting and output interlacing without interpolation (proposed by OverFeat);
• Changing only the filters and layers strides of a convent produce the same as the “shift-and-stitch” trick.
• Define a FCN combining layers of the feature hierarchy
and refining the spatial precision of the output;
• Add links combining the final prediction layer with lower layers
with finer strides, turning into a DAG, not layer-wise, with edges
skipping head from lower layers to higher ones;
• Learn the skip net which improves the perf.
• Note: Decreasing the stride of pooling layers can get finer
prediction, but problematic in big kernel size for conv.
Layers for learning cost.
Fully Convolutional Networks for Semantic Segmentation
A DAG nets learn to combine coarse, high layer information with fine, low layer information.
Solid line (FCN-32s): A single-stream net upsamples stride 32 predictions back to pixels.
Dashed line (FCN-16s): Combining predictions from both the final layer and the pool4 layer, at
stride 16, lets the net predict finer details, while retaining high-level semantic information.
Dotted line (FCN-8s): Additional predictions from pool3, at stride 8, provide further precision.
Deep Jet (Nonlinear local feature
hierarchy: make local predictions while
respect global structure.
DeepLab: CNN + CRF for Semantic Segmentation
oLearning DCNNs for semantic image segmentation from either (1) weakly
annotated training data such as bounding boxes or image-level labels or (2) a
combination of few strongly labeled and many weakly labeled images, sourced
from one or multiple datasets;
oUse DCNN to predict the label distribution per pixel, followed by a fully-
connected (dense) CRF to smooth the predictions while preserving image edges;
oExpectation-Maximization (EM) methods for semantic image segmentation
model training under these weakly supervised and semi-supervised settings.
DeepLab: CNN + CRF for Semantic Segmentation
DeepLab model training using image-level labels
DeepLab: CNN + CRF for Semantic Segmentation
DeepLab model training from bounding boxes
DeepLab model training on a union of full (strong
labels) and image-level (weak labels) annotations
ParseNet: Looking Wider to See Better
o Global context to CNN for semantic segmentation;
oUsing the average feature for a layer to augment the
features at each location;
oLearning normalization parameters for improvement.
oEarly fusion: unpool (replicate) global feature to the
same size as of local feature map spatially and then
concatenate them, and use the combined feature to
learn the classifier;
oLate fusion: each feature is used to learn its own
classifier, followed by merging the two predictions
into a single classification score;
SegNet: Deep Conv. Encoder-Decoder Architecture
A decoder upsamples its input using the transferred pool indices from its
encoder to produce a sparse feature map(s);
It then performs convolution with a trainable filter bank to densify the
feature map;
The final decoder output feature maps are fed to a soft-max classifier for
pixel-wise classification.
SegNet: Deep Conv. Encoder-Decoder Architecture
oSegNet uses the max pooling indices to upsample (without learning) the feature map(s) and
convolves with a trainable decoder filter bank.
oFCN upsamples by learning to deconvolve the input feature map and adds the corresponding
encoder feature map to produce the decoder output; this feature map is the output of the max-
pooling layer (includes sub-sampling) in the corresponding encoder.
oNote that there are no trainable decoder filters in FCN.
Mask R-CNN for Object Instance Segmentation
Detect objects in an image while generate a segment. mask for each instance.
Mask R-CNN extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition.
Simple to train and adds a small overhead to Faster R-CNN, running at 5 fps.
Easy to generalize to other tasks, allowing to estimate poses in the framework.
Mask R-CNN for Object Instance Segmentation
Head Architecture: Left/Right for the heads for the ResNet C4 and FPN backbones, to which
a mask branch is added. Numbers denote spatial resolution and channels. Arrows denote
either conv, deconv, or fc layers as can be inferred from context (conv preserves spatial
dimension while deconv increases it). All convs are 3×3, except the output conv which is
1×1, deconvs are 2×2 with stride 2, and use ReLU in hidden layers. Left: ‘res5’ denotes
ResNet’s 5th stage, which for simplicity we altered so that the first conv operates on a 7×7
RoI with stride 1. Right: ‘×4’ denotes a stack of four consecutive convs.
Mask R-CNN for Object Instance Segmentation
Extended to human pose estimation: Model a keypoint’s location as a one-hot mask,
and adopt Mask R-CNN to predict K masks, one for each of K keypoint types (e.g., left
shoulder, right elbow).
Minimal domain knowledge for human pose is exploited by the system. For each of the K keypoints of an instance, the training target is a one-hot m × m binary mask
where only a single pixel is labeled as FG.
During training, for each visible ground-truth keypoint, minimize the cross-entropy loss over an
m2 -way softmax output.
In instance segmentation, the K keypoints are treated independently.
The keypoint head consists of a stack of eight 3×3 512-d conv layers, followed by a
deconv layer and 2× bilinear upscaling, producing an output resolution of 56×56.
Train the models using image scales randomly sampled from [640, 800] pixels;
inference is on a single scale of 800 pixels. 90k iterations, learning rate of 0.02 and reducing it by 10 at 60k and 80k iterations.
Bounding-box non-maximum suppression with a threshold of 0.5.
Tracking with Deep Neural Networks Use a ConvNet with filter banks trained either supervised or unsupervised;
Each layer consists of three sequential operations as follows:
Spatial convolutions with kernels;
tanh non-linearity;
Pooling.
Images are transformed into the feature vectors at the output of ConvNet;
A RBFN is used to produce confidence map based on the distance btw the feature
vectors and a reference vector.
Learning Representation for Visual Tracking
Train a stacked Denoising Autoencoder (SDAE) to learn generic features
followed by knowledge transfer from offline training to online tracking;
More robust against variations;
Unsupervised feature learning by training an SDAE with auxiliary image;
Layer-by-layer pretraining is first applied and then the whole SDAE is fine-tuned;
Online tracking involves a classification NN constructed from the encoder part
as a feature extractor and an additional classification layer;
Tuned to adapt to appearance changes of the moving object;
An additional classification layer is added to the encoder part of the trained SDAE.
Some filters in the first layer of the learned SDAE
Learning Representation for Visual Tracking
denoising autoencoder stacked denoising autoencoder network for online tracking.
Visual Tracking with a Single CNN A truncated structural loss function maintains as many training samples as possible and reduces risk
of tracking error by accommodating uncertainty of model output;
Enhance the ordinary SGD approach in training with a temporal selection mechanism, which
generates positive and negative samples within different time periods;
The architecture of CNN tracker with multiple image cues
The bottom row shows the three-stages operations on a frame: test, estimation and training.
In the training frames, the green bounding-boxes are the negative samples while the red ones
denote the positive samples. The dashed block covers the positive sample pool (red) and
negative sample pool (green). In each pool, the edges of the sample patches indicate their
sampling importances. The thicker the edge, the more possible it will be selected for training.
Transferring Rich Feature Hierarchies for
Robust Visual Tracking
A major hurdle that hinders the application of CNN to visual tracking is the
lack of properly labeled training data.
An enormous amount of training data is required, but visual tracking typically
have only one labeled example in the first frame of each video.
Pre-training a CNN offline and then transferring the rich feature hierarchies
learned to online tracking.
The CNN is also fine-tuned during online tracking to adapt to the appearance
of the tracked target specified in the first video frame.
To fit the characteristics of object tracking, first pre-train the CNN to recognize
what is an object, and then propose to generate a probability map instead of
producing a simple class label.
Transferring Rich Feature Hierarchies for
Robust Visual Tracking
Architecture of the proposed structured output CNN
Pipeline of the proposed tracking algorithm
Robust Visual Tracking via Convolutional
Networks Without Training
Without offline training with a large amount of auxiliary data, simple two-layer
convolutional networks can be powerful enough to learn robust
representations for visual tracking.
In the 1st frame, extract a set of normalized patches from the target region as
fixed filters, which integrate a series of adaptive contextual filters surrounding
the target to define a set of feature maps in the subsequent frames.
These maps measure similarities btw each filter and useful local intensity
patterns across the target, thereby encoding its local structural information.
Furthermore, all the maps together form a global representation, via which the
inner geometric layout of the target is also preserved.
A simple soft shrinkage method that suppresses noisy values below an
adaptive threshold is employed to de-noise the global representation.
Visual Tracking with Fully Convolutional Networks
In-depth study on the properties of CNN features offline pre-trained on massive
image data and classification task on ImageNet.
Convolutional layers in different levels characterize the target from different
perspectives.
A top layer encodes more semantic features and serves as a category detector,
while a lower layer carries more discriminative information and can better
separate the target from distracters with similar appearance.
Both layers are jointly used with a switch mechanism during tracking.
For a tracking target, only a subset of neurons are relevant.
Feature map selection to remove noisy and irrelevant feature maps, which can
reduce computation redundancy and improve tracking accuracy.
Visual Tracking with Fully Convolutional Networks
Algorithm pipeline. (a) Input ROI region. (b) VGG network. (c) SNet. (d) GNet. (e) Tracking results
Robust Visual Tracking via Convolutional
Networks Without Training
Input samples are warped into canonical 32 × 32 images. k-means algorithm to extract a set of norm. local patches from the
warped target region in the 1st frame, and extract a set of norm. local patches from the contextual region surrounding the
target. Apply as filters to convolve each norm. sample extracted from subsequent frames, resulting in a set of feature maps.
Finally, the feature maps are de-noised by a soft shrinkage method, which results in a robust sparse representation.
Hierarchical Convolutional Features for Visual Tracking
Features extracted from deep CNNs
trained to improve accuracy and robust.
Outputs of the last conv. layers encode
semantic info. and such represent. are
robust to appearance variations.
Hierarchies of conv. layers as a
nonlinear counterpart of an image
pyramid represen. and exploit these
multiple levels of abstraction for tracking.
Adaptively learn corr. filters on each
conv. layer to encode target appearance.
Hierarchically infer maximum response
of each layer to locate targets.
Given an image, crop the search window centered at
the estimated position in the previous frame. Use the
3rd, 4th and 5th conv. layers as target representations.
Each layer indexed by i is then convolved with the
learned linear correlation filter w(i) to generate a
response map, whose location of the maximum value
indicates the estimated target position. Search multi-
level response maps to infer target location.
Hierarchical Convolutional Features for Visual Tracking
Visualization of convolutional layers
Learning to Track 100FPS with Deep Regression Networks
Offline training of NNs that can track novel objects at test-time at 100 fps.
A simple feed-forward network with no online training required.
The tracker learns a generic relationship btw object motion and appearance and
can be used to track novel objects that do not appear in the training set.
It improves when adding more videos to the offline training set.
Learning to Track 100FPS with Deep Regression Networks
The network architecture for tracking. Input to the network a search region
from the current frame and a target from the previous frame. The network
learns to compare these crops to find the target object in the current image.
Visual Tracking with CNN based Object Proposals
Tracking by detection based methods: object appearance changes, size and
shape deformations, partial and full occlusions, which make online adaptation of
classifiers and object models a substantial challenge.
An object proposal network that generates a small yet refined set of bounding
box candidates to mitigate the object model refitting problem by concentrating
on hard negatives when updating the classifier.
Improving the discriminative power as hard negatives are likely to be due to
background and other distractions.
In each frame, applying the classifier only on the refined set of object-like
candidates would be sufficient to eliminate most of the false positives.
Incorporating an object proposal makes the tracker robust against shape
deformations as handled naturally by the proposal stage.
Visual Tracking with CNN based Object Proposals
Region Proposal Network Tracker (RPNT): In a new frame t + 1, an offline VGG network is used to
generate a feature map, which is then fed to the region proposal network (RPN) to obtain candidate
bounding boxes. Region of interest (RoI) pooling layer extracts feature vectors with a fixed size for
the online structured support vector machine (SSVM) that serves as the classifier. The proposal with
the maximal response is assigned as the new location of the object.
Learning Multi-Domain CNNs for Visual Tracking
Pre-trains a CNN using a large set of videos with tracking ground truths to
obtain a generic target representation.
Composed of shared layers and multiple branches of domain-specific layers,
where domains correspond to individual training sequences and each branch is
responsible for binary classification to identify target in each domain.
Train each domain in the network iteratively to obtain generic target
representations in the shared layers.
When tracking a target in a new sequence, construct a new network by
combining the shared layers in the pre-trained CNN with a new binary
classification layer, which is updated online.
Online tracking is performed by evaluating the candidate windows randomly
sampled around the previous target state.
Learning Multi-Domain CNNs for Visual Tracking
The architecture of Multi-Domain Network, which consists of shared layers and
K branches of domain-specific layers. Yellow and blue bounding boxes denote
the positive and negative samples in each domain, respectively.
Learning Multi-Domain CNNs for Visual Tracking
Recurrently Target-Attending Tracking
Recurrently Target-attending Tracking (RTT) attempts to identify and exploit
those reliable parts which are beneficial for the overall tracking process.
To bypass occlusion and discover reliable components, multi-directional
Recurrent Neural Networks (RNNs) are employed in RTT to capture long-range
contextual cues by traversing a candidate spatial region from multiple directions.
The produced confidence maps from the RNNs are employed to adaptively
regularize the learning of discriminative correlation filters by suppressing clutter
background noises while making full use of the information from reliable parts.
To solve the weighted correlation filters, derive an efficient closed form
solution with a sharp reduction in computation complexity.
The proposed RTT is more competitive over correlation filter based methods.
Recurrently Target-Attending Tracking
RTT tracker. To identify and exploit those reliable components during tracking, a confidence map
is estimated by using multi-directional RNNs, and further used to regularize correlation filters.
work flow of RNNs
⊙ - element-wise multiplication operation
Appendix A:
SoC Implementation of CNN
Large-Scale FPGA-based Convolutional Networks
a 2D grid of NPT Processing Tiles (PTs) that contain:
a bank of processing operators. An operator can be anything from a FIFO to an arithmetic operator,
or even a combination of arithmetic operators;
The operators are connected to local data lines;
a routing multiplexer (MUX). The MUX connects the local data lines to global data lines or to
neighboring tiles.
a Smart Direct Memory Access module (Smart DMA), that interfaces off-chip memory and
provides asynchronous data transfers, with priority management;
a set of Nglobal global data lines used to connect PTs to the Smart DMA, Nglobal << NPT;
a set of local data lines used to connect PTs with their 4 neighbors;
a Runtime Configuration Bus, used to reconfigure many aspects of the grid at runtime —
connections, operators, Smart DMA modes;
a controller that can reconfigure most of the computing grid and the Smart DMA at runtime.
Large-Scale FPGA-based Convolutional Networks
A set of runtime configurable processing tiles are connected on a 2D grid. They can exchange data with their 4
neighbors and with an off-chip memory via global lines.
Large-Scale FPGA-based Convolutional Networks
The grid is configured for a complex computation that involves several tiles: the 3 top tiles
perform a 3×3 convolution, the 3 intermediate tiles another 3 × 3 convolution, the bottom left
tile sums these two convolutions, and the bottom centre tile applies a function to the result.
Large-Scale FPGA-based Convolutional Networks
Overview of the ConvNet Processor system. A grid of multiple full-
custom Processing Tiles tailored to ConvNet operations, and a fast
streaming memory interface (Smart DMA).
Mobile Coprocessor NN-X for Deep NN Startup: TeraDeep (from e-lab of Purdue
University);
Its nn-X ("Neural Network next”) as "a vision system
based on programmable-logic with embedded
mobile processor“;
Coprocessor, composed of multiple accelerators
and interfaces to ARM cores, can run on mobile
phones, embedded system and mobile computers;
Implemented with FPGA in the programmable logic
section of an ARM-based Xilinx Zynq SoC;
Combination of cloud-based machine-learning
algorithms, deep learning neural network software
running on embedded devices and mobile apps;
Note: Qualcomm’s Zeroth processor, as a Neural
Processing Unit (NPU), will be released.
The memory router is a crossbar switch. A local
router in each collection is directly connected to
routers in neighboring collections, constructing a
1-d torus-like data stream network. The group of
operators in each collection contains
convolution, max-pooling and non-linear
operators.
NN-X is composed of a coprocessor, a host processor and external
memory. The coprocessor has three main components: processing
elements called collections, a system bus called memory router and a
configuration bus to control flow of data. Collections perform DNN
operations: data routing, convolutions, pooling, non-linear programmable
functions.
Memory Centric Accelerator for CNN
Bottleneck: limited amount of external
memory bandwidth for data transfer.
The effects of the memory bottleneck
can be reduced by a flexible memory
hierarchy that supports the complex
data access patterns in CNN workload;
The efficiency of the on-chip memories
is maximized by a scheduler that uses
tiling to optimize for data locality;
This design flow ensures that on-chip
memory size is minimized, which
reduces area and energy usage.
Increasing accelerator utilization with
more external memory bandwidth is
bad for energy.
Memory Centric Accelerator for CNN
To instantiate the accelerator template with the
obtained schedules, an HW/SW integration flow is
used;
The top left part contains the scheduling design
space exploration, from which the optimal
schedules are used to select the template
parameters;
The parameter set and the hardware template are
manually converted into a HLS instantiation of the
accelerator;
In the left part of the design flow the selected
schedule is manually converted to control
software for the MicroBlaze host processor.
FPGA-based Accelerator Design for CNN Matching btw computation throughput and memory
bandwidth or logic resource;
Uniform loop unroll factors across different convolutional
layers;
Apply a roofline performance model to relate the system
performance to off-chip memory traffic and the peak
performance provided by the HW platform;
Loop tiling to fit portion of data on-chip , critical for data
reuse and parallelism;
Organization of PEs and buffer banks and interconnects
in between for data processing efficiency;
Optimization in computation and memory access:
Loop unrolling/pipeline;
Local memory promotion;
Loop transformations for data reuse;
FPGA-based Accelerator Design for CNN
MSR’s FPGA of CNN for Large Scale DataCenter
Each instantiation of the fabric consists of a 6x8 2-D torus of high-end Stratix V
FPGAs embedded into a half-rack of 48 machines;
One FPGA is placed into each server, accessible through PCIe, and wired
directly to other FPGAs with pairs of 10 Gb SAS cables. A diagram of the 1 U, half-width server that
hosts the FPGA board Components of the Shell Architecture
MSR’s FPGA of CNN for Large Scale DataCenter
Comparison of Image Classification Throughput and Power
[3] A. Putnam, et al., A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services, International Symposium on Computer Architecture, 2014. [4] … [5] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao and J. Cong, Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks, FPGA'2015, 2015. [6] http://caffe.berkeleyvision.org/performance_hardware.html [7] … [8] http://www.altera.com/literature/hb/arria-10/a10_overview.pdf.
AuvizDNN Auviz Middleware IP is highly optimized for Xilinx Series 7 FPGAs, available as C functions for
HLS users, OpenCL for Embedded developers, or IP Integrator blocks for RTL users;
FPGAs have a compelling Performance at a Power profile that is suitable for the Data Center;
AuvizDNN: optimized library to create deep learning algorithms on the FPGA, having
implemented CNN.
Vectorization of Deep CNN at Lenovo Research
Vectorization is fundamental for parallelism in deep CNN;
Caffe, Overfeat, CudaConvnet, Theano;
Vectorizing Convolution: multiplication of matrix-vector;
Vectorizing Pooling: vector accumulation guided by a predefined index map;
Vectorizing Fully Connected Layers: multiplication of matrix-matrix;
Vectorization for Mini-batches: SGD.
Strategy to vectorize convolution Strategy to vectorize pooling