Visual Detection, Recognition and Tracking with Deep Learning

Yu Huang

Sunnyvale, California

[email protected]

Visual Detection, Recognition and

Tracking with Deep Learning

mailto:[email protected]

Outline

• Deep learning • Sparse coding

• Deep models • CNN/NIN/RNN;

• DBN/DBM;

• Stacked DAE;

• Optimization/Learning methods • SGD & BP;

• AdaGrad/AdaDelta

• Dropout/Maxout

• Data Augmentation

• MCMC/Mean Field/Contrastive Divergence

• Wake-Sleep/Greedy layer-wise pre-training

• Model Compression • Dark knowledge/Distilling the knowledge.

• Visual recognition • Sparse coding

• Hierarchical feature learning

• LeNet/Alexnet/Mattnet (ZFNet);

• VGG Net/GoogleNet

• PReLU/Batch normalization;

• Rethink the Inception;

• Deep Residual Learning;

• Generic object detection • Deep multi-box/OverFeat;

• R-CNN/Fast R-CNN/SPP Net/Faster R-CNN;

• DeepID-Net;

• YOLO/YOLO9000;

• DeepBox;

• R-FCN;

• LocNet;

• MS-CNN;

• Pedestrian detection

• Pose estimation

• Face detection, landmark detection and face

recognition

• Text detection and recognition

• Scene parsing/Semantic segmentation • Multiscale Feature Learning;

• Simultaneous Detection and Segmentation;

• FCN/DeepLab/Parsenet/Segnet/Mask R-CNN.

• Visual object tracking

• Appendix A: SoC implementation of CNN

Deep Learning Representation learning attempts to automatically learn good features or representations;

Deep learning algorithms attempt to learn multiple levels of representation of increasing

complexity/abstraction (features);

Become effective via unsupervised pre-training + supervised fine tuning; Deep networks trained with back propagation (without unsupervised pre-training) perform worse than shallow

networks.

Deal with the curse of dimensionality (smoothing & sparsity) and over-fitting (unsupervised, regularizer);

Semi-supervised: structure of manifold assumption; labeled data is scarce and unlabeled data is abundant.

Learning Feature Hierarchy with DL

• Deep architectures can be more efficient in feature representation;

• Natural derivation/abstration from low level structures to high level structures;

• Share the lower-level representations for multiple tasks (such as detection, recognition, segmentation).

Sparse Coding

Sparse coding (Olshausen & Field, 1996).

Originally developed to explain early visual processing in the brain

(edge detection).

Objective: Given a set of input data vectors learn a

dictionary of bases such that:

Each data vector is represented as a sparse linear combination of

bases.

Sparse: mostly zeros

Methods of Solving Sparse Coding Greedy methods: projecting the residual on some atom;

Matching pursuit, orthogonal matching pursuit;

L1-norm: Least Absolute Shrinkage and Selection Operator (LASSO);

The residual is updated iteratively in the direction of the atom;

Gradient-based finding new search directions

Projected Gradient Descent

Coordinate Descent

Homotopy: a set of solutions indexed by a parameter (regularization)

LARS (Least Angle Regression)

First order/proximal methods: Generalized gradient descent

solving efficiently the proximal operator

soft-thresholding for L1-norm

Accelerated by the Nesterov optimal first-order method

Iterative reweighting schemes

L2-norm: Chartand and Yin (2008)

L1-norm: Cand`es et al. (2008)

Sparse Coding for Unsupervised Pre-training

• SC learns the optimal dictionary that can be used to reconstruct a set of training samples under

sparsity constraints on the feature vector;

• Predictive Sparse Decomposition(PSD):

• Train a simple regressor (or encoder) to approximate the sparse solution for all data in the training set;

• Modify by adding a penalty for prediction error: Approximate the sparse coding with an encoder;

• PSD for hierarchical feature training

• Phase 1: train the first layer;

• Phase 2: use encoder + absolute value as 1st feature extractor

• Phase 3: train the second layer;

• Phase 4: use encoder + absolute value as 1st feature extractor

• Phase 5: train a supervised classifier on top layer;

• Phase 6: optionally train the whole network with supervised BP.

Convolutional Neural Networks CNN is a special kind of multi-layer NNs applied to 2-d arrays (usually images), based on spatially

localized neural input; Local receptive fields(shifted window), shared weights (weight averaging) across the hidden units, and often, spatial or

temporal sub-sampling;

Related to generative MRF/discriminative CRF: CNN=Field of Experts MRF=ML inference in CRF;

Generate ‘patterns of patterns’ for pattern recognition.

Each layer combines (merge, smooth) patches from previous layers Pooling /Sampling (e.g., max or average) filter: compress and smooth the data.

Convolution filters: (translation invariance) unsupervised;

Local contrast normalization: increase sparsity, improve optimization/invariance.

C layers convolutions,

S layers pool/sample

Convolutional Neural Networks Convolutional Networks are trainable multistage architectures composed of multiple stages;

Input and output of each stage are sets of arrays called feature maps;

At output, each feature map represents a particular feature extracted at all locations on input;

Each stage is composed of: a filter bank layer, a non-linearity layer, and a feature pooling layer;

A ConvNet is composed of 1, 2 or 3 such 3-layer stages, followed by a classification module;

A fully connected layer: softmax transfer function for posterior distribution.

Filter: A trainable filter (kernel) in filter bank connects input feature map to output feature map;

Nonlinearity: a pointwise sigmoid tanh() or a rectified sigmoid abs(gi•tanh()) function;

In rectified function, gi is a trainable gain parameter, might be followed a contrast normalization N;

Feature pooling: treats each feature map separately -> a reduced-resolution output feature map;

Supervised training is performed using a form of SGD to minimize the prediction error;

Gradients are computed with the back-propagation method.

Unsupervised pre-training: predictive sparse decomposition (PSD), then supervised fine-tuning.

* is discrete convolution operator

Deep Network-In-Network

Enhance model discriminability for local patches within the receptive

field;

Build micro neural networks with more complex structures to abstract

data within receptive field;

Design micro neural network with a multilayer perceptron;

Deep NIN can be implemented by stacking multiple layers with micro

neural network slided over the input like CNN to get the feature maps;

Apply global average pooling over feature maps in the classification

layer;

It is easier to explain and less prone to overfitting than fully connected layers;

The overall structure of Network In Network

Comparison of linear convolution layer and mlpconv layer

Tiled CNN Use a regular “tiled” pattern of tied weights, no adjacent hidden units sharing

identical weights;

only that hidden units k steps away from each other to have tied weights;

Learn complex invariances (scale and rotation) by pooling over neighboring

units;

Relatively a small number of learned parameters (sparse), easy of learning

and greater scalability;

Unsupervised pre-training: Topographic ICA as an efficient learning method;

TICA is a two-layered network with sqr and sqrt nonlinearities in the 1st and 2nd

layers respectively;

It can learn invariances even when trained only on unlabeled data;

Avoid approximate orthogonalization by using local receptive fields;

Trained by batch projected gradient descent.

CNN with local receptive fields and tied weights Partially untied local receptive field networks

TICA network architecture

Local orthorgonalization

Localize neurons that have identical receptive fields

Projection step

Locality, Tie weight and

Orthogonality contraints

Optimization for

Sparsity at Pooling units

Multi-column Deep Neural Networks

MDNN: inspired by “Neocognitron”, small (often minimal) receptive fields of convolutional winner-take-all neurons yield large network depth;

hundreds of maps per layer, many (6-10) layers of non-linear neurons stacked on top of each other, similar to the layers found between retina and visual cortex of macaque monkeys;

2-d layers of winner-take-all neurons with overlapping receptive fields whose weights are shared;

A simple max pooling technique decides the winner neurons by partitioning layers into quadratic regions of local inhibition, selecting the most active neuron of each region;

A smaller, down-sampled layer with lower resolution feeding the next layer in the hierarchy;

The top part of the hierarchy becomes a standard multi-layer perceptron (MLP), while receptive fields and winner-take-all regions of the DNN are (near-)minimal, e.g., only 2x2 or 3x3 neurons;

Only winner neurons are trained, which training algorithm is fully online;

Several deep neural columns become experts on inputs or input preprocessed in different ways, and then those predictions are averaged;

(a) DNN architecture. (b) MCDNN

architecture. The input image can

be preprocessed by P0 –Pn-1

blocks. An arbitrary number of

columns can be trained on inputs

preprocessed in different ways.

The final predictions are obtained

by averaging individual

predictions of each DNN. (c)

Training a DNN. The dataset is

preprocessed before training,

then, at the beginning of every

epoch, the images are distorted

(D block).

RNN: Recurrent Neural Network A nonlinear dynamical system that maps sequences to sequences;

Parameterized with three weight matrices and three bias vectors;

RNNs are fundamentally difficult to train due to their nonlinear iterative nature;

The derivative of the loss function can be exponentially large with respect to the hidden activations;

RNN suffers also from the vanishing gradient problem.

Back Propagation Through Time (BPTT):

“Unfold” the recurrent network in time, by stacking identical copies of the RNN, and redirecting

connections within the network to obtain connections between subsequent copies;

It’s hard to be used where online adaption is required as the entire time series must be used.

Real-Time Recurrent Learning (RTRL) is a forward-pass only algorithm that computes the

derivatives of the RNN w.r.t. its parameters at each timestep;

Unlike BPTT, RTRL maintains the exact derivative of the loss so far at each timestep of the forward

pass, without a backward pass and the need to store the past hidden states;

However, the computational cost of RTRL is prohibitive and more memory than BPTT as well.

Success in Application: Speech Recognition and Handwriting recognition.

LSTM: Long Short-Term Memory

An RNN structure that elegantly addresses the vanishing gradients problem using “memory units”;

These linear units have a self-connection of strength 1 and a pair of auxiliary “gating units” that control

the flow of information to and from the unit;

Let N be the number of memory units of the LSTM. At each timestep t, the LSTM maintains a set of

vectors as below, whose evolution is governed by the following equations:

Since the forward pass of the LSTM is relatively intricate, the equations for the correct derivatives of the

LSTM are highly complex, making them tedious to implement;

Note: Theano has LSTM module.

Left: RNN with one fully connected hidden layer;

Right: LSTM with memory blocks in hidden layer.

From Simple RNN to BPTT

LSTM: Long Short-Term Memory

Gated Recurrent Unit GRU is a variation of RNN, adaptively capturing dependencies of different time scales with each recurrent unit;

GRU uses gating units as well to modulate the flow of information inside the unit, but without a memory cells.

GRU doesn’t control degree to which its state is exposed, but exposes the whole state each time;

Different from LSTM:

GRU expose its full content without control;

GRU controls the information flow from the previous activation when computing the new, candidate activation, but does not

independently control the amount of the candidate activation being added (the control is tied via the update gate).

• Shared virtues with LSTM: the additive

component of their update from t to t + 1;

• Easy for each unit to remember the

existence of a specific feature in the input

stream for a long series of steps;

• Effectively creates shortcut paths that

bypass multiple temporal steps, which

allow the error to be back-propagated

easily without too quickly vanishing.

Belief Nets Belief net is directed acyclic graph composed of stochastic var.

Can observe some of the variables and solve two problems:

inference: Infer the states of the unobserved variables.

learning: Adjust the interactions between variables to more likely generate the

observed data.

stochastic hidden cause

visible effect

Use nets composed of layers of stochastic variables with weighted connections.

Boltzmann Machines Energy-based model associate a energy to each configuration of stochastic variables of interests (for

example, MRF, Nearest Neighbor);

Learning means adjustment of the low energy function’s shape properties;

Boltzmann machine is a stochastic recurrent model with hidden variables;

Monte Carlo Markov Chain (MCMC) sampling for gradient estimate;

Restricted Boltzmann machine is a special case:

Only one layer of hidden units;

factorization of each layer’s neurons/units (no connections in the same layer);

Contrastive divergence: approximation of gradient in RBMs.

probability

Energy Function

Learning rule

Deep Belief Networks A hybrid model: can be trained as

generative or discriminative model;

Deep architecture: multiple layers (learn

features layer by layer);

Multi layer learning is difficult in sigmoid belief

networks.

Top two layers are undirected connections,

RBM;

Lower layers get top down directed

connections from layers above;

Unsupervised or self-taught pre-learning

provides a good initialization;

Greedy layer-wise training for RBM

Supervised fine-tuning

Generative: Up-down wake-sleep algorithm

Discriminative: bottom-up back propagation

Deep Boltzmann Machine Learning internal representations that become increasingly complex;

High-level representations built from a large supply of unlabeled inputs;

Pre-training consists of learning a stack of modified RBMs, then which are composed to create a deep

Boltzmann machine (undirected graph);

Generative fine-tuning: different from DBN, two phases

Positive: observed, sample hidden, using variational approximation (mean-field);

Negative: sample both observed and hidden, using persistent sampling (stochastic approximation: MCMC).

Discriminative fine-tuning: the same to DBN Back propagation.

Denoising Auto-Encoder Multilayer NNs with target output=input;

Reconstruction=decoder(encoder(input)); Perturbs the input x to a corrupted version;

Randomly sets some of the coordinates of input to zeros.

Recover x from encoded perturbed data.

Learns a vector field towards higher probability regions;

Pre-trained with DBN or regularizer with perturbed training data;

Minimizes variational lower bound on a generative model;

Corresponds to regularized score matching on an RBM;

PCA=linear manifold=linear Auto Encoder;

Auto-encoder learns the salient variation like a nonlinear PCA.

Stacked Denoising Auto-Encoder

Stack many (sparse) auto-encoders in succession and train them using greedy layer-wise learning Drop the decode layer each time

Supervised training on the last layer using final features Then supervised training on the entire network to fine- tune all weights

Performs better than stacking RBMs for unsupervised pre-training.

Empirically not quite as accurate as DBNs.

Stochastic Gradient Descent (SGD) • The general class of estimators that arise as minimizers of sums are called M-

estimators;

• Where are stationary points of the likelihood function (or zeroes of its derivative,

the score function)?

• Online gradient descent samples a subset of summand functions at every

step;

• The true gradient of is approximated by a gradient at a single example;

• Shuffling of training set at each pass.

• There is a compromise between two forms, often called "mini-batches", where

the true gradient is approximated by a sum over a small number of training

examples.

• STD converges almost surely to a global minimum when the objective function

is convex or pseudo-convex, and otherwise converges almost surely to a local

minimum.

Hinton’s RMSProp: another variation of SGD

Rprop: using only sign of gradient, not work with mini-batches;

The magnitude of the gradient can be very different for different weights and can change during learning;

Combine the idea of only using sign of gradient with the idea of adapting the step size separately for

each weight;

RMSProp: a mini-batch version of Rprop;

Rprop is equivalent to using gradient but also dividing by the size of gradient;

RMSProp divides the learning rate for a weight by a running average of the squared gradients for that weight;

Other improvements:

Combination with momentum, Nesterov momentum, adaptive learning rate, …

LeCun’s “No more pesky learning rate”;

Back Propagation

E (f(x0,w),y0) = -log (f(x0,w)- y0).

Loss function Euclidean loss is used for regressing to real-valued lables [-inf,inf];

Sigmoid cross-entropy loss is used for predicting K independent probability values in [0,1];

Softmax (normalized exponential) loss is predicting a single class of K mutually exclusive

classes;

Generalization of the logistic function that "squashes" a K-dimensional vector of arbitrary real values z to

a K-dimensional vector of real values σ(z) in the range (0, 1).

The predicted probability for the j'th class given a sample vector x is

Sigmoidal or Softmax normalization is a way of reducing the influence of extreme values or

outliers in the data without removing them from the dataset.

Variable Learning Rate Too large learning rate

cause oscillation in searching for the minimal point

Too slow learning rate

too slow convergence to the minimal point

Adaptive learning rate

At the beginning, the learning rate can be large when the current point is

far from the optimal point;

Gradually, the learning rate will decay as time goes by.

Should not be too large or too small:

annealing rate 𝛼(𝑡)=𝛼(0)/(1+𝑡/𝑇)

𝛼(𝑡) will eventually go to zero, but at the beginning it is almost a constant.

Variable Momentum

AdaGrad/AdaDelta

Weight Decay for Overfitting Weight decay or L2 regularization adds a penalty term to the error function, a term called

the regularization term: the negative log prior in Bayesian justification, Weight decay works as rescaling weights in the learning rule, but bias learning still the same;

Prefer to learn small weights, and large weights allowed if improving the original cost function;

A way of compromising btw finding small weights and minimizing the original cost function;

In a linear model, weight decay is equivalent to ridge (Tikhonov) regression;

L1 regularization: the weights not really useful shrink by a constant amount toward zero; Act like a form of feature selection;

Make the input filters cleaner and easier to interpret;

L2 regularization penalizes large values strongly while L1 regularization ;

Markov Chain Monte Carlo (MCMC): simulating a Markov chain whose equilibrium distr.

is the posterior distribution for weights & hyper-parameters;

Hybrid Monte Carlo: gradient and sampling.

Early Stopping for Overfitting Steps in early stopping:

Divide the available data into training and validation sets.

Use a large number of hidden units.

Use very small random initial values.

Use a slow learning rate.

Compute the validation error rate periodically during training.

Stop training when the validation error rate "starts to go up".

Early stopping has several advantages:

It is fast.

It can be applied successfully to networks in which the number of weights far exceeds the

sample size.

It requires only one major decision by the user: what proportion of validation cases to use.

Practical issues in early stopping:

How many cases do you assign to the training and validation sets?

Do you split the data into training and validation sets randomly or by some systematic

algorithm?

How do you tell when the validation error rate "starts to go up"?

Dropout and Maxout for Overfitting

Dropout: set the output of each hidden neuron to zero w.p. 0.5. Motivation: Combining many different models that share parameters succeeds in reducing

test errors by approximately averaging together the predictions, which resembles the bagging.

The units which are “dropped out” in this way do not contribute to the forward pass and do not participate in back propagation.

So every time an input is presented, the NN samples a different architecture, but all these architectures share weights.

This technique reduces complex co-adaptations of units, since a neuron cannot rely on the presence of particular other units.

It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other units.

Without dropout, the network exhibits substantial overfitting.

Dropout roughly doubles the number of iterations required to converge.

Maxout takes the maximum across multiple feature maps;

Data Augmentation for Overfitting

The easiest and most common method to reduce overfitting on

image data is to artificially enlarge the dataset using label-

preserving transformations;

Perturbing an image I by transformations that leave the underlying

class unchanged (e.g. cropping and flipping) in order to generate

additional examples of the class;

Two distinct forms of data augmentation:

image translation

horizontal reflections

changing RGB intensities

MCMC Sampling for Optimization Markov Chain: a stochastic process in which future states are independent of past states but the

present state.

Markov chain will typically converge to a stable distribution.

Monte Carlo Markov Chain: sampling using ‘local’ information

Devise a Markov chain whose stationary distribution is the target.

Ergodic MC must be aperiodic, irreducible, and positive recurrent.

Monte Carlo Integration to get quantities of interest.

Metropolis-Hastings method: sampling from a target distribution

Create a Markov chain whose transition matrix does not depend on the normalization term.

Make sure the chain has a stationary distribution and it is equal to the target distribution (accept ratio).

After sufficient number of iterations, the chain will converge the stationary distribution.

Gibbs sampling is a special case of M-H Sampling. The Hammersley-Clifford Theorem: get the joint distribution from the complete conditional

distribution.

Hybrid Monte Carlo: gradient sub step for each Markov chain.

Mean Field for Optimization Variational approximation modifies the optimization problem to be tractable, at the price of

approximate solution;

Mean Field replaces M with a (simple) subset M(F), on which A* (μ) is a closed form (Note: F is disconnected graph);

Density becomes factorized product distribution in this sub-family.

Objective: K-L divergence.

Mean field is a structured variation approximation approach:

Coordinate ascent (deterministic);

Compared with stochastic approximation (sampling):

Faster, but maybe not exact.

Contrastive Divergence Contrastive divergence (CD) is a quicker way to learn RBMs;

Contrastive divergence as the new objective;

Taking gradients and ignoring a term which is usually very small.

Steps:

Start with a training vector on the visible units.

Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel.

Can be applied using any MCMC algorithm to simulate the model (not limited to just Gibbs sampling);

CD learning is biased: not work as gradient descent

Improved: Persistent CD explores more modes in the distribution

Rather than from data samples, begin sampling from the mode samples, obtained from the last gradient update.

Still suffer from divergence of likelihood due to missing the modes.

Score matching: the score function does not depend on its normal. factor. So, match it b.t.w. the model

with the empirical density.

“Wake-Sleep” Algorithm for DBN Pre-trained DBN is a generative model;

Do a stochastic bottom-up pass (wake phase)

Get samples from factorial distribution (visible first, then generate hidden);

Adjust the top-down weights to be good at reconstructing the feature activities in the layer below.

Do a few iterations of sampling in the top level RBM

Adjust the weights in the top-level RBM.

Do a stochastic top-down pass (sleep phase)

Get visible and hidden samples generated by generative model using data coming from nowhere!

Adjust the bottom-up weights to be good at reconstructing the feature activities in the layer above.

Any guarantee for improvement? No!

The “Wake-Sleep” algorithm is trying to describe the representation economical (Shannon’s coding

theory).

Greedy Layer-Wise Training Deep networks tend to have more local minima problems than shallow networks during supervised

training

Train first layer using unlabeled data

Supervised or semi-supervised: use more unlabeled data.

Freeze the first layer parameters and train the second layer

Repeat this for as many layers as desire

Build more robust features

Use the outputs of the final layer to train the last supervised layer (leave early weights frozen)

Fine tune the full network with a supervised approach;

Avoid problems to train a deep net in a supervised fashion.

Each layer gets full learning

Help with ineffective early layer learning

Help with deep network local minima

• Fixed point implementation: memory footprint and vectorization in SIMD;

• 8 bit quantization for activations and weight, 32 bits for biases, except for input layer (still floating point);

• Limited precision: 16 bits rounding for training by SGD with no degradation;

• Low precision storage: both training and testing in deep learning;

• Factorization of parameter matrix: low rank constraints;

• Feature prediction: dictionary construction and working like pooling;

• Columnar architecture for fully connected network and convolutional network.

Redundancy of Parameterization in Deep Learning

Left: Columnar

architecture in a fully

connected network, with

the path through one

column highlighted. Each

column corresponds to a

different para index.

Right: Columnar

architecture in a

convolutional network. In

this setting the para’s take

linear combinations of the

feature maps obtained by

convolving the input with

the dictionary.

• Fusion of convolution and max pooling layers;

• Separable convolution filters and fusion with the pooling layer;

Simplifying CNN for Fast Learning

• Dense multiscale features from conv. layers of a CNN (similar to the HOG pyramid and R-CNN);

• Image pyramid input (25 scales) as stitching together with optional warping for each aspect ratio;

• Data centering to simplify mean subtraction;

• Aspect ratios at the running stage.

DenseNet: A CNN Pyramid (in Caffe)

Acceleration of CNN by Utilization of Filter Redundancy

Visualization of monochromatic and bi-clustering approximation structures. (a) The monochromatic

approximation, used for the first layer. Input color channels are projected onto a set of intermediate color

channels. After this transformation, output features need only to look at one intermediate color channel. (b)

The bi-clustering approximation, used for higher convolution layers. Input and output features are clustered

into equal sized groups. The weight tensor corresponding to each pair of input and output clusters is then

approximated. (c) The weight tensors for each input-output pair in (b) are approximated by a sum of rank 1

tensors by applying SVD decomposition.

Acceleration of CNN by Utilization of Filter Redundancy

a) The original

convolutional layer

acting on a single-

channel input i.e. C=1.

b) The approximation to

that layer using the

method of Scheme 1:

approximation of filter set

by a linear combination

of a smaller basis set of

M filters (rank-1 for

separable ones).

c) The approximation to that layer using the method of

Scheme 2: take into account of both input and output

redundancies by considering 3D filters throughout, i.e.

each convolution layer is factored as a sequence of two

regular convolution layers with rectangular (in spatial

domain) filters, working on multiple channels

simultaneously and shaped to match a separable filter

approximation (also rank-1).

• Compress the function learned by a complex model into a much smaller faster model

that has comparable performance;

• Given enough data, a NN can approximate any function to arbitrary precision;

• Idea: Instead of training the NN on the original (small) training set, use an ensemble to

label a large unlabeled dataset and then train the NN on this much larger ensemble

labeled data set, to yield a NN that predicts similar to the ensemble and performs much

better than a NN which is trained on the original training set;

• Three methods to generate pseudo data:

• RANDOM, generate data for each attribute independently from its marginal distribution;

• NBE, estimate the joint density of attributes using the Naive Bayes and then generate samples

from this joint distribution;

• MUNGE, a new procedure that samples from a non-parametric estimate of the joint density.

Model Compression

• Shallow feed-forward nets can learn the complex functions previously learned by deep

nets and achieve accuracies previously only achievable with deep models;

• In some cases the shallow neural nets can learn these deep functions using the same

number of parameters as the original deep models;

• Complexity of a learned model, and the size of representation best used to learn that

model, are different things;

• Model compression works best when the unlabeled set is much larger than the train set,

when the unlabeled samples fall not on train points where the teacher model more likely

have overfit.

• Train the student model to mimic a more accurate ensemble of deep NN models (the

teacher).

Do Deep Nets Really Need to be Deep?

• The ensemble implements a function from input to output;

• Forget the models in the ensemble and the way they are parameterized and focus on the function.

• After learning the ensemble, we have our hands on the function.

• Can we transfer the knowledge in the function into a single smaller model (Distillation)?

• Soft targets: A way to transfer the function

• If we have the ensemble, we can divide the averaged logits from the ensemble by a “temperature”

to get a much softer distribution;

• Softened outputs reveal the dark knowledge in the ensemble.

• However it works better to fit both the hard targets and the soft targets from the

ensemble.

• Down-weight the cross entropy with the hard targets.

• Dropout is an efficient way to average many large NNs;

• Organize the labels into a tree, and predict all of the labels on the path to the root,

instead of just predicting a label.

Dark Knowledge

• Distilling the knowledge in an ensemble of models into a single model;

• Transfer from a cumbersome model to a small model, more suitable for deployment.

• An ensemble of one or more full models and many specialist models which learn to

distinguish fine grained classes that the full models confuse;

• These specialist model can be trained in parallel.

• Use the class prob. by the cumbersome model as “soft targets” for training the small

model.

• When the cumbersome model is a large ensemble of simpler models, we can use an arithmetic or

geometric mean of their individual predictive distributions as the soft targets.

• When the soft targets have high entropy, they provide much more information per training case than

hard targets and much less variance in the gradient between training cases, so the small model can

often be trained on much less data than the original cumbersome model and using a much higher

learning rate.

• Distillation: raise the temperature of the final softmax until the cumbersome model produces a

suitably soft set of targets.

Distilling knowledge in a Neural Network

• Allow training of a student that is deeper and thinner than the teacher, using the outputs

and the intermediate representations learned by the teacher as hints to improve the

training process and final performance of the student that compress the wide and

shallower (but still deep) networks;

• A hint defined as output of a teacher’s hidden layer for guiding the student’s learning process;

• Choose a hidden layer of the FitNet as the guided layer in the student, to learn from the teacher’s hint

layer;

• Train the FitNet in a stage-wise fashion: hints as a form of regularization, guided layer with a

convolutional regressor, loss function based on prediction error of a teacher’s hint layer and regressor

over guided layer.

FitNets by Thin Deep Nets: Extension of Distillation

• Compress each Conv layer by finding an appropriate low-rank approximation;

• Then fine-tune the upper layers until the prediction performance is restored;

• Elementary tensor decompositions based on SVD; Filter clustering to take

advantage of similarities between learned features. Speed-up by 2x.

Exploiting Linear Structure Within CNN

for Efficient Evaluation

(a) The monochromatic approximation, used for the first layer. (b) The biclustering

approximation, used for higher convolution layers. (c) The weight tensors for each

input-output pair in (b) are approximated by a sum of rank 1 tensors.

• Reparameterizing with Adaptive Fastfood transform;

• It reduces the storage and computational costs costs from O(nd) to O(n) and O(n

log d) respectively;

Deep Fried Convnets

The FCLs are replaced with the Adaptive Fastfood transform

• HashedNets exploit inherent

redundancy in NN for model

reductions;

• A hash function to randomly group

connection weights into hash

buckets, and all connections within

the same hash bucket share a single

parameter value;

• Tuned to adjust to the HashedNets

weight sharing architecture with BP;

• Shrink the storage requirements and

preserve the generalization perform.

Compressing NNs with the Hashing Trick

Random weight sharing under compression factor ¼.

• A two-step approach for speeding up conv. layers within CNN based on tensor

decomposition and discriminative fine tuning;

• NLS to compute low-rank CP-decomposition (PARAFAC or CANDECOMP) of

the 4D conv. kernel tensor into a sum of a number of rank-one tensors;

• This decomposition is used to replace the original conv. layer with a sequence

of four conv. layers with small kernels;

• Finally fin tuning on the training data using standard BP process;

• 4x CPU for AlexNet with ImageNet, only 1% up in top-5 performance error.

Speeding-up CNN Using Fine-tuned CPD

• Prune the network by learning only the important connections (9x, 13x);

• Quantize the weights to enforce weight sharing (32 to 5) and apply Huffman coding;

• Retrain to fine tune the remaining connections and the quantized centroids;

• AlexNet by 35x (from 240MB to 6.9MB); VGG-16 by 49x (from 552MB to 11.3MB).

Compressing Deep NN with Huffman Coding

• Tensor Train: a compact multiliniear format (TT format or decomposition);

• TT-layer is fully connected layer with the weight matrix stored in the TT-format;

• TensorNet: NN with one or more TT-layers;

• # parameters reduced and expressive power of the layer is preserved;

• 7x for VGGNet where 200k times for FCL;

Tensorizing Neural Networks

• Replacing the conventional linear projection in FCLs with the circulant projection;

The circulant structure reduces memory footprint and enables FFT;

• Gradient computation and optimization of the circulant projections can be

performed very efficiently.

Parameter Redundancy in Deep Networks

with Circulant Projections

• For the most storage demanding dense connected layers, vector quantization

(VQ) methods have a clear gain over existing matrix factorization methods;

• K-means clustering, product quantization (PQ);

• MattNet with ImageNet: 16-24x compression with only 1% loss in accuracy;

Compressing Deep CNN using VQ

• Low-rank Approximation of Responses: SVD or PCA;

• AlexNet 4× for ImageNet with top-5 error up only 0.9%;

Efficient and Accurate Approximations

of Nonlinear CNN

Illustration of the approximation.

(a) An original layer with complexity O(dk2c).

(b) (b) An approximated layer with complexity

reduced to O(dk2c) + O(dd’).

k - the spatial size of the filter

c - the number of input channels of this layer

d - the number of filters

d’ - the rank of matrix W

Sparse Coding for Visual Recognition

• Descriptor Layer: detect and locate features, extract

corresponding descriptors (e.g. SIFT);

• Code Layer: code the descriptors

• Vector Quantization (VQ): each code has only

one non-zero element;

• Soft-VQ: small group of elements can be non-

zero;

• SPM layer: pool codes across subregions and

average/normalize into a histogram.

[Lazebnik et al., CVPR 2005; Yang et al., CVPR 2009]

Sparse Coding for Visual Recognition

• Classifiers using these features need nonlinear kernels • Lazebnik et al., CVPR 2005; Grauman, Darrell, JMLR 2007; • High computational complexity

• Idea: modify the coding step to produce feature representations that linear classifiers can use effectively

• Sparse coding [Olshausen & Field, Nature 1996; Lee et al., NIPS 2007; Yang et al., CVPR 2009; Boureau et al., CVPR 2010]

• Local Coordinate coding [Yu et al., NIPS 2009; Wang et al., CVPR 2010] • RBMs [Sohn, Jung, Lee, Hero III, ICCV 2011] • Other feature learning algorithms

Improving the coding step

Deep learning for visual recognition,

detection, localization

• Hand-crafted features:

• Needs expert knowledge

• Requires time-consuming hand-tuning

• (Arguably) one limiting factor of computer vision systems

• Key idea of feature learning:

• Learn statistical structure or correlation from unlabeled data

• The learned representations used as features in supervised and semi-supervised settings

• Hierarchical feature learning

• Deep architectures can be representationally efficient.

• Natural progression from the low level to the high level structures.

• Can share the lower-level representations for multiple tasks in computer vision.

Hierarchical Feature Learning

Feature Learning Architectures

Pixels /

Features

Filtering with Dictionary

(patch/tiled/convolutional)

Spatial/Feature

(Sum or Max)

Normalization between

feature responses

Features

+ Non-linearity

Local Contrast Normalization

(Subtractive & Divisive)

(Group)

Sparsity

Max

/

Softmax

Not an exact

separation

‘Filtering’

Patch

Image as a set of patches

Input

#patches

#filte

rs

Filters

‘Filtering’ Convolutional

Translation equivariance

Tied filter weights

(same at each position few parameters)

Input Feature Map

.

.

.

‘Filtering’

Tiled

Filters repeat every n

More filters than convolution for given # features

Input

Filters

Feature maps

‘Normalization’

Contrast normalize. (across feature

maps)

Local mean = 0, local std. = 1, “Local”

7x7 Gaussian

Equalizes the features maps

Feature Maps Feature Maps

After Contrast Normalization

Input Filters

‘Normalization’

Sparsity

Constrain L0 or L1

norm of features

Iterate with filtering

operation (ISTA

sparse coding)

Filters Features K-means Sparse Coding

Input

Patch

‘Normalization’

Induces local competition

between features to explain input

“Explaining away” in graphical

models

Just like top-down models

But more local mechanism

Filtering alone cannot do this!

Example: Convolutional Sparse Coding

Filters

Convolution

|.|1 |.|1 |.|1 |.|1

from Zeiler et al. [CVPR’10/ICCV’11]

Input

Image Reconstructed

Image

‘Pooling’

Spatial Pooling

Non-overlapping / overlapping regions

Sum or Max

Boureau et al. ICML’10 for theoretical analysis

Max

Sum

‘Pooling’

Spatial pooling

Invariance to small transformations

Larger receptive fields (see more of input)

Visualization technique from [Le et al. NIPS’10]:

Zeiler, Taylor, Fergus [ICCV 2011]

‘Pooling’

Chen, Zhu, Lin, Yuille, Zhang [NIPS 2007]

• Pooling across feature groups

• Additional form of inter-feature competition

• Gives AND/OR type behavior via (sum / max)

• Compositional models of Zhu, Yuille

[Zeiler et al., ‘11]

LeNet (LeNet-5) A layered model composed of convolution and subsampling operations followed by a

holistic representation and ultimately a classifier for handwritten digits;

Local receptive fields (5x5) with local connections;

Output via a RBF function, one for each class, with 84 inputs each;

Learning by Graph Transformer Networks (GTN);

FO

RW

AR

D

BA

CK

WA

RD

Data Layer

Convolutional layer [5x5]

Convolutional layer [5x5]

Pooling [2x2, stride 2]

Pooling [2x2, stride 2]

Inner Product

ReLUP

Inner Product

Soft Max

20x24x24

20x12x12

50x8x8

50x4x4

500x1

500x1

10x1

10x1

1x28x28

AlexNet A layered model composed of convol., subsample., followed by a

holistic representation and all-in-all a landmark classifier;

Consists of 5 convolutional layers, some of which followed by max-

pooling layers, 3 fully-connected layers with a final 1000-way

softmax;

Fully-connected “FULL” layers: linear classifiers/matrix

multiplications;

ReLU are rectified-linear nonlinearities on layer output, can be

trained several times faster;

Local normalization scheme aids generalization;

Overlapping pooling slightly less prone to overfitting;

Data augmentation: artificially enlarge the dataset using label-

preserving transformations;

Dropout: setting to zero output of each hidden neuron with prob. 0.5;

Trained by SGD with batch # 128, momentum 0.9, weight decay

0.0005.

The network’s input is 150,528-dimensional, and the number of neurons in the network’s

remaining layers is given by 253,440–186,624–64,896–64,896–43,264-4096–4096–1000.

MattNet Matthew Zeiler from the startup company “Clarifai”, winner of ImageNet Classification in 2013;

Preprocessing: subtracting a per-pixel mean;

Data augmentation: downsampled to 256 pixels and a random 224 pixel crop is taken out of the

image and randomly flipped horizontally to provide more views of each example;

SGD with min-batch # 128, learning rate annealing, momentum 0.9 and dropout to prevent

overfitting;

65M parameters trained for 12 days on a single Nvidia GPU;

Visualization by layered DeconvNets: project the feature activations back to the input pixel space;

Reveal input stimuli exciting individual feature maps at any layer;

Observe evolution of features during training;

Sensitivity analysis of the classifier output by occluding portions to reveal which parts of scenes are important;

DeconvNet attached to each of ConvNet layer, unpooling uses locations of maxima to preserve

structure;

Multiple such models were averaged together to further boost performance;

Supervised pre-training with AlexNet, then modify it to get better performance (error rate 14.8%).

Architecture of an eight layer ConvNet model. Input: 224 by 224 crop of an image (with 3 color

planes). # 1-5 layers Convolution: 96 filters, 7x7, stride of 2 in both x and y. Feature maps: (i)

via a rectified linear function, (ii) 3x3 max pooled (stride 2), (iii) contrast normalized 55x55

feature maps. # 6-7 layers: fully connected, input in vector form (6x6x256 = 9216 dimensions).

The final layer: a C-way softmax function, C - number of classes.

Top: A deconvnet layer (left)

attached to a convnet layer

(right). The deconvnet will

reconstruct approximate version

of convnet features from the

layer beneath.

Bottom: Unpooling operation in

the deconvnet, using switches

which record the location of the

local max in each pooling region

(colored zones) during pooling in

the convnet.

Oxford VGG Net: Very Deep CNN

Networks of increasing depth using an architecture with very small (3×3) convolution filters;

Spatial pooling is carried out by 5 max-pooling layers;

A stack of convolutional layers followed by three Fully-Connected (FC) layers;

All hidden layers are equipped with the rectification ReLU non-linearity;

No Local Response Normalisation!

Trained by optimising the multinomial logistic regression objective using SGD;

Regularised by weight decay and dropout regularisation for the first two fully-connected layers;

The learning rate was initially set to 10−2, and then decreased by a factor of 10;

For random initialisation, sample the weights from a normal distribution;

Derived from the publicly available C++ Caffe toolbox, allow training and evaluation on multiple

GPUs installed in a single system, and on full-size (uncropped) images at multiple scales;

Combine the outputs of several models by averaging their soft-max class posteriors.

The depth of the configurations increases from the left (A) to the

right (E), as more layers are added (the added layers are shown in

bold). The convolutional layer parameters are denoted as

“conv<receptive field size> - <number of channels>”. The ReLU

activation function is not shown for brevity.

GoogleNet Questions:

Vanishing gradient?

Exploding gradient?

Tricky weight initialization?

Deep convolutional neural network architecture codenamed Inception;

Finding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components;

Judiciously applying dimension reduction and projections wherever the computational requirements would increase too much otherwise;

Increasing the depth and width of the network but keeping the computational budget constant;

Drawbacks: Bigger size typically means a larger number of parameters, which makes the enlarged network more prone to overfitting and the dramatically increased use of computational resources;

Solution: From fully connected to sparsely connected architectures, analyze the correlation statistics of the activations of the last layer and clustering neurons with highly correlated outputs.

Based on the well known Hebbian principle: neurons that fire together, wire together;

Trained using the DistBelief: distributed machine learning system.

Inception module (with dimension reductions)

Convolution

Pooling

Softmax

Other

Problems with training deep architectures?

Network in a network in a network

9 Inception modules

PReLU Networks at MSR A Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit;

PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk;

Allow negative activations on the ReLU function with a control parameter a learned Adaptively;

Resolve diminishing gradient problem for very deep neural networks (> 13 layers) ;

Derive a robust initialization method better than “Xavier” (normalization) initialization;

Also use Spatial Pyramid Pooling (SPP) layer just before the fully connected layers;

Can train extremely deep rectified models and investigate deeper or wider network

architectures;

ReLU vs. PReLU Note: μ is momemtum, ϵ is learning rate.

http://www.erogol.com/wp-content/uploads/2015/02/PReLU_act.jpg

http://www.erogol.com/wp-content/uploads/2015/02/PReLU_bp.jpg

http://www.erogol.com/wp-content/uploads/2015/02/PReLU_bp2.jpg

http://www.erogol.com/wp-content/uploads/2015/02/PReLU_bp3.jpg

PReLU Networks at MSR Performance: 4.94% top-5 test error on the ImageNet 2012 Classification dataset;

ILSVRC 2014 winner (GoogLeNet, 6.66%);

Adopt the momentum method in BP training;

Mostly initialized by random weights from Gaussian distr.;

Investigate the variance of the FP responses in each layer;

Consider a sufficient condition in BP:

The gradient is not exponentially large/small.

http://www.erogol.com/wp-content/uploads/2015/02/PReLU2.jpg

Architectures of large models

PReLU Networks

Batch Normalization at Google Normalizing layer inputs for each mini-batch to handle saturating

nonlinearities and covariate shift;

Internal Covariate Shift (ICS): the change in the distribution of network

activations due to the change in network parameters during training;

Whitening to reduce ICS: linear transform to have zero means and unit

variances, and decorrelated;

Fix the means and variance of layer inputs (instead of whitening jointly

the features in both I/O);

Batch normalizing transform applied for activation over a mini-batch;

BN transform is differentiable transform introducing normalized

activations into the network;

Batch normalized networks

Unbiased variance estimate;

Moving average;

Batch normalized ConvNets

Effective mini-batch size;

Per feature, not per activation.

Batch Normalization at Google Reduce the dependence of gradients on the scale of the parameters or of the initial values;

Prevent small changes from amplifying into larger and suboptimal changes in activation in

gradients;

Stabilize the parameter growth and make gradient propagation better behaved in BN training;

In some cases, eliminate the need of dropout as a regularizer;

In ImageNet Classification, remove local response normalization and reduce photometric

distortions;

Reach 4.9% in top-five validation error and 4.8% test error (human raters only 5.1%).

Accelerating BN network:

Enable larger learning rate and less care about initialization, which accelerates the training;

Reduce L2 weight regularization;

Accelerate the learning rate decay.

Batch Normalization at Google

Inception architecture

Neural Turing Machines A Neural Turing Machine (NTM) architecture contains two basic components:

a neural network controller and a memory bank;

During each update cycle, the controller network receives inputs from an external

environment and emits outputs in response;

It also reads to and writes from a memory matrix via a set of parallel read and write

heads.

These weightings arise by combining two addressing mechanisms with

complementary facilities;

“content-based addressing”: focuses attention on locations based on the similarity

between their current values and values emitted by the controller;

“location-based addressing”: the content of a variable is arbitrary, but the variable still

needs a recognizable name or addresses, by location, not by content;

Controller network: feed forward or recurrent.

Neural Turing Machines

Neural Turing Machine Architecture.

Flow Diagram of the Addressing Mechanism.

Highway Networks: Information Highway

Ease gradient-based training of very deep networks;

Allow unimpeded info. flow across several layers on information highways;

Use gating units to learn regulating the flow of info. through a network;

A highway network consists of multiple blocks such that the ith block computes

a block state Hi(x) and transform gate output Ti(x);

Highway networks with hundreds of layers can be trained directly using SGD

and with a variety of activation functions.

the transform gate the carry gate C = 1 - T

Deep Residual Learning for Image Recognition

Reformulate the layers as learning residual functions with reference to the layer inputs,

instead of learning unreferenced functions;

The desired underlying mapping as H(x), then let the stacked nonlinear layers fit another

mapping of F(x) = H(x) - x;

The formulation of F(x)+x can be realized by feed forward NN with “shortcut connections”

(such as “Highway Network” and “Inception”);

These residual networks are easier to optimize, and can gain accuracy from

considerably increased depth;

An ensemble of 152 layers residual nets achieves 3.57% error on the ImageNet test set;

224x224 crop, per-pixel mean subtracted, color augmentation, batch normalization;

SGD with a mini-batch size of 256, learning rate from 0.1 then by 10;

Weight decay of 0.0001 and a momentum of 0.9, no drop-out;

Deep Residual Learning for Image Recognition

Residual learning: a

building block

Exa

mp

le n

etw

ork

arc

hite

ctu

res fo

r Im

ag

eN

et

A deeper residual function F for

ImageNet

Rethink Inception Architecture for Computer Vision

Scale up networks in ways that aim at utilizing the added computation efficiently by factorized convolutions and aggressive regularization;

Design principles in Inception: Avoid representational bottlenecks, especially early in the network;

Higher dimensional representations are easier to process locally within a network;

Spatial aggregation over lower dim embeddings w/o loss in representational power;

Balance the width and depth of the network.

Factorizing convolutions with large filter size: asymmetric convolutions;

Auxiliary classifiers: act as regularizer, esp. batch normalized or dropout;

Grid size reduction: two parallel stride 2 blocks (pooling and activation) ;

Model regularization via label smoothing: marginalized effect of dropout;

Trained with Tensorflow: SGD with 50 replicas, batch size 32 for 100 epochs, learning rate of 0.045, exponential rate of 0.94, a wei decay of 0.9.


Inception modules after the factorization of the nxn

convolutions. In the proposed architecture, it choses

n = 7 for the 17x17 grid.

Inception modules with expanded

the filter bank outputs.

Inception modules where each

5x5 convolution is replaced by

two 3x3 convolution.


Auxiliary classifier on top

of the last 17x17 layer Inception module that reduces the grid-size while

expands the filter banks. It is both cheap and avoids

the representational bottleneck.

The outline of the proposed

network architecture

Deep Learning for Generic Object Detection • Predicting the bounding box of multiple objects by DL-

based regression (GoogleNet);

• Deep Multibox method;

• Overfeat: sliding window-based detection and

localization by deep NN;

• R-CNN: region-based proposal + CNN feature (cuda-

convent)-based detection by SVM;

• SPP (spatial pyramid pooling) Net: extract window

wise features from region of feature maps;

• DeepID: selective search + R-CNN (Clarifai-fast);

• Deformation constraint pooling layer.

OverFeat: Integrated Framework with CNN

● Multi-scale and sliding window for classification, localization and detection with CNN;

● Classification: similar to AlexNet;

■ Use the same fixed input size approach with AlexNet, no contrast norm., non-overlapping

pooling;

■ SGD with decreasing learning rate, momentum, weight decay and dropout;

■ A feature extractor named “OverFeat” with two models: a fast and accurate one;

■ multi-view (4 corners + 1 center views + flip = 10 views);

■ fast and low memory footprint important to train bigger models;

● Localization: regression predicting coordinates of boundary boxes;

■ inputs: 256x5x5 (right after last pooling);

■ top-left, bottom-right, center, height/width (center not depend on scale);

■ fancier (similar to Yann’s face pose estimation);

● Detection: training with BG to avoid False Pos., trade-off btw pos./neg. accuracy.

OverFeat: Integrated Framework with CNN

(a): 20 pixel unpooled layer 5 feature map. (b): max

pooling over non-overlapping 3 pixel groups, using

offsets of Δ= {0, 1, 2} pixels (red, green, blue

respectively). (c): The resulting 6 pixel pooled maps for

different Δ. (d): 5 pixel classifier (layers 6,7) is applied in

sliding window fashion to pooled maps, yielding 2 pixel

by C maps for each Δ.(e): reshaped into 6 pixel by C

output maps.

Single (top)/Multiple(bottom) output in detection Application example of regression

network

R-CNN: Regions with CNN Features A framework for object detection with ConvNets;

One can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects;

Regions with CNN detection approach:

generates ~2000 category-independent regions for the input image,

extracts a fixed-length feature vector from each region using a CNN (built on cuda-convnet);

classifies each region with category-specific linear SVM.

R-CNN outperforms OverFeat, with a mAP = 31.4% vs 24.3%.

Training: train feature extraction CNN on a large auxiliary dataset (ILSVRC), followed by domain specific fine-tuning on a small dataset (PASCAL);

Pre-training: Train ImageNet

Replace last layer with FC layer to N+1 outputs (N classes + 1 “background”; VOC N=20, ILSVRC N=200 )

When labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.

• Region detection 2000 regions;

• Region cropped and scaled to [227 x 227] feature extraction with ImageNet: • 5 convolutional layers + 2FC 4096 features;

• SVM for 200 classes;

• Greedy non-maximum suppression for each class: rejects a region if it has an

intersection-over-union (IoU) overlap with a higher scoring selected region larger than a

learned threshold;

R-CNN: Regions with CNN Features

DeepID-Net: deformable CNNs for

Generic Object Detection

Bounding box proposal by selective search;

Bounding box rejection;

Pre-train a deep model: RCNN (Classification+Detection) with Clarifai-fast;

Pre-train on image-level annotation with 1000 classes;

Fine-tune on object-level annotation with 200 classes;

Gap: classification vs. detection, 1000 vs. 200;

A deformation constrained pooling layer: even for repeated patterns;

Modeling part detectors: different parts have different sizes;

Context modeling and model averaging;

Bounding box regression.

DeepID-Net: deformable CNNs for

Generic Object Detection

SPP-Net: Spatial Pyramid Pooling CNN Introduce a spatial pyramid pooling layer to replace the pooling layer: on the convolutional layer;

Adaptively sized pooling on shared conv feature maps;

Outperform Bag of Words in keeping spatial information;

Generate a fixed-length represent. regardless of image size/scale;

Use multi-level spatial bins, robust to object deformations.

Training

• Size augmentation:

• Imagenet: 224x224

180x180

• Horizontal flipping

• Color altering

• Dropout with 2 last FC layers

• Learning rate:

• Initialize as 0.01; divide by

10 when error plateau

ImageNet Detection

• Find 2000 windows candidate /~ R-CNN /

• Extract the feature maps from the entire image only

once (possibly at multiple scales) /~ Overfeat/.

• Apply the spatial pyramid pooling on each candidate

window of the feature, which maps window to a fixed-

length representation

• Then 2 Fully Connected layers

• SVM

• ~170x faster than R-CNN

A network structure with a spatial pyramid pooling layer Pooling features from arbitrary windows on feature maps

SPP-Net: Spatial Pyramid Pooling CNN

Fast R-CNN for Object Detection

• Simultaneously learns to classify object

proposals and refine their spatial

locations in a multi-task loss;

• Pre-training: max pooling -> RoI; a final

FCL as softmax -> two FCLs as

softmax + bounding box regressor;

input as images + RoIs;

• Fine-tuning:

• Image centric sampling;

• Hierarchical min-batch sampling;

• Joint optimize softmax + BB regressor;

• Detection:

• Truncated SVD in FCL weight matrix.

An input image and multiple RoIs are input into a FCN.

Each RoI is pooled into a fixed-size feature map and

then mapped to a feature vector by FCLs. The network

has two output vectors per RoI: softmax probabilities and

per-class bounding-box regression offsets. The

architecture is trained end-to-end with a multi-task loss.

Training Region-based Object Detectors with

Online Hard Example Mining

Online hard example mining (OHEM) to train region-based ConvNet detectors.

Auto-selection of hard examples makes training more effective and efficient.

It eliminates several heuristics and hyperparameters in common use.

It yields consistent and significant boosts in detection performance on

benchmarks like PASCAL VOC 2007 and 2012.

Architecture of the Fast R-CNN approach

Training Region-based Object Detectors with

Online Hard Example Mining

Given an image and RoIs, the network computes a feature map. (a): the RoI network runs fwd pass and the

Hard RoI uses RoI losses to select B examples. (b): hard examples used by RoI network for fwd& bwd passes.

Faster R-CNN with RPN for Object Detection

• Region Proposal Network (RPN): a FCN

• Share the conv. layers with detection network;

• Learning: image centric sampling with BP and

SGD as Fast R-CNN, initialized with ImageNet pre-

trained model, fine-tuned end-to-end for region

proposal task;

• Fast R-CNN:

• Learn using the proposals from RPN, also

initialized by a ImageNet pre-trained model;

• Joint RPN + Fast-CNN:

• Use the detector network to initialize RPN with

fixed shared conv layers and fine tune RPN;

• Finally, fine tune the FCL of Fast R-CNN.

Region Proposal Network Encode conv. map positions and

output classified objectness score +

regressed bounds for k=9 region

proposals;

Reference boxes (k=9 anchors) with 3

scales/aspect ratios: translation

invariant;

A Unified Multi-scale Deep CNN for Fast Object Detection

A unified deep neural network, denoted the multi-scale CNN (MS-

CNN) for fast multi-scale object detection.

Consists of a proposal sub-network and a detection sub-network.

In the proposal sub-network, detection is performed at multiple output

layers, so that receptive fields match objects of different scales.

These complementary scale-specific detectors are combined to

produce a strong multi-scale object detector.

The network is learned end-to-end, by optimizing a multi-task loss.

Feature upsampling by deconvolution as an alternative to input

upsampling, to reduce the memory and computation costs.


The cubes - output tensors. h × w - filter size, c - # classes, b # bounding box coordinates.


Different strategies for multi-scale detection (the template size)


Object detection sub-network of the MS-CNN. “trunk CNN layers” are shared with proposal

sub-network. The green (blue) cubes represent object (context) region pooling. “class

probability” and “bounding box” are the outputs of the detection sub-network.

SSD: Single Shot MultiBox Detector

Discretize the output space of bounding boxes into a set of default boxes over

different aspect ratios and scales per feature map location;

At prediction time, generate scores for the presence of each object category in

each default box and produce adjustments to the box to better match the object

shape;

Combine predictions from multiple feature maps with different resolutions to

naturally handle objects of various sizes;

Eliminates proposal generation and subsequent pixel or feature resampling stage

and encapsulates all computation in a single network.

SSD: Single Shot MultiBox Detector

(a) only needs an input image and ground truth boxes for each object during

training. In a convolutional fashion, evaluate a small set of default boxes of

different aspect ratios at each location in several feature maps with different

scales. For each default box, predict both the shape offsets and the confidences

for all object categories. At training time, first match these default boxes to the

ground truth boxes. The model loss is a weighted sum between localization loss

(e.g. Smooth L1) and confidence loss (e.g. Softmax).

You Olny Look Once (YOLO) for Object Detection

The YOLO Detection System

The system models detection as a regression problem

to a 7724 tensor. This tensor encodes bounding boxes

and class probabilities for all objects in the image.

The network uses strided conv. layers to

downsample the feature space instead of

maxpooling layers. Pre-train the conv. layers

on the ImageNet classification task and then

double the resolution for detection.

Note: More localization errors, relatively low recall.

YOLO9000: Better, Faster, Stronger

Detect over 9000 object categories: http://pjreddie.com/yolo9000/;

YOLOv2, 67 FPS, 76.8 mAP on VOC 2007; 40 FPS, 78.6 mAP;

Jointly train on object detection COCO and classification ImageNet;

Batch Normalization: 2% improvement in mAP;

High Resolution Classifier: full 448 × 448 resolution, almost 4% up in mAP;

Convolutional With Anchor Boxes: use anchor boxes to predict bound. boxes;

Dimension Clusters: k-means on the training set bounding boxes to

automatically find good priors to adjust the boxes appropriately;

Direct location prediction: predict location relative to location of the grid cell;

Fine-Grained Features: 13 × 13 map, pass through layer from 26 × 26 res.

Multi-Scale Training: Every 10 batches randomly a new image dimension size.

http://pjreddie.com/yolo9000/


Based on Googlenet architecture, faster than VGG-16;

Darknet-19: 19 convolutional layers and 5 maxpooling layers;

Training for classification: Darknet, data augmentation;

Training for detection: remove the last conv. layer, add on three 3 × 3 conv.

layers with 1024 filters each followed by a final 1 × 1 conv. layer;

Hierarchical classification: WordNet, -> WordTree, a model of visual concepts;

Dataset combination with WordTree: combine labels from ImageNet & COCO;

Joint classification and detection: use the COCO detection dataset and the top

9000 classes from the full ImageNet release;

YOLO9000: WordTree with 9418 classes.

DeepBox: Learning Objectness with CNN

DeepBox uses CNNs to rerank proposals from a bottom-up method;

A four-layer CNN architecture that is as good as much larger networks on the task

of evaluating objectness while being much faster;

DenseBox: Landmark Localization and Object Detection

Fully convolutional neural network (FCN);

Directly predicts bounding boxes and object class confidences through all

locations and scales of an image;

Improve accuracy with landmark localization during multi-task learning.

Pipeline:1) Image pyramid is fed to the network. 2) After several layers of convolution and pooling,

upsampling feature map back and apply convolution layers to get final output. 3) Convert output feature

map to bounding boxes, and apply non-maximum suppression to all bounding boxes over the threshold.

DenseBox: Landmark Localization and Object Detection

DenseBox

Densebox with landmark localization

R-FCN: Object Detection via Region-based

Fully Convolutional Networks

Position-sensitive score maps to handle conflict of translation-invariance in

image classification and translation-variance in object detection;

https://github.com/daijifeng001/r-fcn

Overall architecture of R-FCN. A

RPN proposes candidate RoIs,

applied on the score maps. All

learnable weight layers are

convolutional and are computed

on the entire image; the per-RoI

computational cost is negligible






Key idea of R-FCN for object detection. k × k = 3 × 3 position-sensitive score maps generated

by a FCN. For each of the k × k bins in an RoI, pooling is only performed on one of the k2 maps.

LocNet: Improving Localization Accuracy for Object Detection

Object localization aiming at boosting the localization accuracy.

The model, given a search region, aims at returning the bounding box of an object of interest inside this region.

Assign conditional prob. to each row and column of this region, where these prob. provide useful info. regarding loc. of boundaries of the object inside the search region and allow accurate inference of the object bounding box under a simple prob. framework.

A CNN architecture adapted for this task, called LocNet.


Illustration of the basic work-flow of the localization

module. Left column: the model given a candidate

box B (yellow box) it ”looks” on a search region R

(red box), which is obtained by enlarging box B by a

constant factor, in order to localize the bounding box

of an object of interest. Right column: To localize a

bounding box the model assigns one or more

probabilities on each row and independently on each

column of region R. Those prob. can be either the

prob. of an element (row or column) to be one of the

four object borders (see top-right image), or the

probability for being on the inside of an objects

bounding box (see bottom-right image). In either

case the predicted bounding box is drawn with blue

color.


The posterior prob. that the loc. model yields given a region R. Left Image: the in-

out conditional prob. assigned on each row (py) and column (px) of R. Drawn with

blues curves on the right and the bottom side. Right Image: the conditional prob. pl,

pr, pt, and pb of each column or row to be the left (l), right (r), top (t) and bottom (b)

border of an object’s bounding box. They are drawn with blue and red curves on the

bottom and on the right side of the search region.


Visualization of the LocNet network architecture

The processing starts by forwarding the entire image I, through a seq. of convolutional layers that outputs

the AI activation maps. Then, the region R is projected on AI and the activations that lay inside it are

cropped and pooled with a spatially adaptive max-pooling layer. The resulting fixed size activation maps are

forwarded through two convolutional layers, each followed by ReLU non-linearities, that yield the

localization-aware activation maps AR of region R. The network is split into two branches, the X and Y, with

each being dedicated for the predictions that correspond to the dimension (x or y respectively) that is

assigned to it. The resulted activations AxR and Ay

R efficiently encode the object location only across the

dimension that their branch handles. Finally, each of those aggregated features is fed into the final fully

connected layer followed from sigmoid units in order to output conditional prob. of its assigned dimension.

A Discriminative Deep Model for Pedestrian

Detection with Occlusion Handling

Joint Deep Learning for Pedestrian Detection

Learned filtered at the second convolutional layer Part models

Joint Deep Learning for Pedestrian Detection

• Visibility Reasoning with Deep Belief Net

Deformation Layer

Multi-Stage Contextual Deep Learning for Pedestrian Detection

multi-stage contextual deep model

The proposed deep Learning Architecture.

Apply different filters Fi on the same feature

map f and obtain different score maps si.

Pedestrian Detection aided by Deep Learning Semantic Tasks

Jointly optimizes pedestrian detection with semantic tasks, including pedestrian attributes ( ‘carrying

backpack’) and scene attributes ( ‘road’, ‘tree’, and ‘horizontal’);

A multi-task objective function is designed to coordinate tasks and reduce discrepancies among datasets

and a deep model, task-assistant CNN (TA-CNN), is to learn high-level features from multiple tasks and

multiple data sources.

Pedestrian Detection aided by Deep Learning Semantic Tasks

CNN-based Pose Classification

One convolutional neural net is trained on semantic part patches for each poselet and then the top-

level activations of all nets are concatenated to obtain a pose-normalized deep representation. The

final attributes are predicted by linear SVM classifier using the pose-normalized representations.

Given an image, a human body detector is used to find the bounding box around the human. Next, a

convolutional neural network (CNN) extracts shared features from the cropped image, and the shared

features are the inputs to the joint point regression tasks and the body-part detection tasks. The CNN,

regression, and detection tasks are learned simultaneously, resulting in a shared feature representation.

Heterogeneous Multi-task Learning for

Pose Estimation

A Multi-source Deep Model for Pose Estimation

How to generate multiple candidate locations: A

candidate is used as the input to a deep model

to determine whether the candidate is correct

and estimate body locations.

A multi-source deep model for constructing the

non-linear representation from three information

sources: mixture type, appearance score and

deformation.

Joint Training of a Convolutional Network and a

Graphical Model for Human Pose Estimation

A hybrid architecture that consists of a deep CNN and a MRF.

The architecture exploits structural domain constraints such as geometric relationships between body joint locations.

Joint training of these two model paradigms improves.

Multi-Resolution Sliding-Window With Overlapping Receptive Fields



Efficient Sliding Window Model with Single Receptive Field




Efficient Sliding Window Model with Overlapping Receptive Fields




Approximated Efficient Sliding Window Model with Overlapping Receptive Fields

Stacked Hourglass Networks for Human Pose Estimation

A convolutional network architecture for human pose estimation.

Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body.

Run repeated bottom-up, top-down processing in conjunction with intermediate supervision to improve the performance of the network.

A “stacked hourglass” network based on successive steps of pooling-and-upsampling.

A network consists of multiple stacked hourglass modules which allow for repeated bottom-up, top-down inference

The person’s orientation, arrangement of their limbs, and the relationships of adjacent joints are best recognized at different scales in the image.

The hourglass is a simple, minimal design with the capacity to capture all these features and bring them together to output pixel-wise predictions. Use a pipeline with skip layers to preserve spatial info. at each resolution.

It reaches its lowest resolution at 4x4 pixels allowing smaller spatial filters to be applied that compare features across the entire space of the image.

Stacking multiple hourglasses end-to-end, feeding the output of one as input into the next, which provides a mechanism for repeated bottom-up, top-down inference allowing for reevaluation of initial estimates and features across the whole image;

The key: prediction of intermediate heatmaps upon which to apply a loss.


An illustration of a single “hourglass” module. Each box corresponds to a

residual module. # of features is consistent across the whole hourglass.

Residual Module in the network The intermediate supervision process


The change in predictions

from an intermediate

stage (second hourglass)

(left) to final predictions

(eighth hourglass) (right).


Real-time Multi-Person 2D Pose Estimation using Part Affinity Fields

Efficiently detect 2D pose of multi- people.

A nonparametric representation, Part

Affinity Fields (PAFs), to learn to associate

body parts with individuals in the image.

The architecture encodes global context,

allowing a greedy bottom-up parsing step

that maintains high accuracy while

achieving real-time performance.

The architecture jointly learn part locations

and their association via two branches of

the same sequential prediction process.

Top: Multi-person pose estimation.

Bottom left: Part Affinity Fields (PAFs) to the

limb connecting right elbow and right wrist.

Bottom right: zoom-in view of predicted PAFs.

Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields

Overall pipeline. The entire image as input for a two-branch CNN to jointly predict confidence maps for

body part detection in (b), and part affinity fields for parts association in (c). A set of bipartite matchings to

associate body parts candidates (d). Assemble them into full body poses for all people in the image (e).


Architecture of the two-branch multi-stage CNN. Each

stage in the first branch predicts confidence maps, and

each stage in the second branch predicts PAFs. After each

stage, the predictions from the two branches, along with

the image features, are concatenated for next stage.

a set of detection confidence maps

a set of part affinity fields


Confidence maps of the right wrist (top) and

PAFs (bottom) of right forearm across stages.

Though confusion between left and right body

parts and limbs in early stages, estimates are

increasingly refined through global inference in

later stages, as shown in the highlighted areas.

Part association strategies. (a) body part detection

candidates (red and blue dots) for two body part types

and all connection candidates (grey lines). (b)

connections using midpoint (yellow dots) represent.:

correct connections (black lines) and incorrect

connections (green lines) that also satisfy incidence

constraint. (c) Results using PAFs (yellow arrows). By

encoding position and orientation over the support of

the limb, PAFs eliminate false associations.

Real-time Multi-Person 2D Pose Estimation using Part Affinity Fields

Graph matching. (a) image with part detections (b) K-partite graph (c) Tree (d) A set of bipartite graphs

Multi-view Face Detection using CNNs: Deep Dense Face Detector

No face pose/landmark annotation, segmentation, bounding box regression;

Detect faces from different view angles and with occlusion to some extent;

Can improve based on better sampling strategies and data augmentation;

Sliding window-based detector with tuned AlexNet and data augmentation;

Size 227x227, 50k iterations, batch size 128 (32 pos + 96 neg);

Convert the fully connected layer to convolutional layer by reshaping the layer paras;

Run CNN to get the heat map of face classifiers for refined localization;

Improve the face localization using a bounding box regression module.

Comparison with R-CNN: the best R-CNN classifier is inferior;

Loss of recall (miss of selective search);

loss of localization (bounding box regression).

Multi-view Face Detection using CNNs: Deep Dense Face Detector

A set of faces with different out-of-

plane rotations and occlusions. Heat map of face classifier

Facial Landmark Detecting by Deep Multi-task Learning

• Face attribute would help in detecting facial landmarks;

• Task-wise early stop in Multi-task learning;

• Training data: Multi-task Facial Landmark dataset;

• Testing data: AFLW (annotated facial landmarks in the wild);

• Points: extend 5 points to 68 points.

Face Alignment by Deep Regression

• Apply a global layer and multi-stage local layers;

• Sequential learning, joint learning.

(a) The network takes face

image as input and outputs

shape estimation. The global

layer estimates initial shape

and the rest local layers refine

the estimation iteratively. (b)

Inner structure of the global

layer; (c) Inner structure of

the tth local layer.

DeepFace for Fave Verification at Facebook

• Fiducial points are extracted by a Support

Vector Regressor (SVR) trained to predict point

configurations from an image descriptor;

• 2D alignment: 2D similarity transformation;

• 3D alignment: Based on a generic 3D shape

model, register a 3D affine camera by back-

projecting the frontal face plane of the 2D-

aligned crop to the image plane of the 3D

shape.

Learn the Deep Face Representation: Face++

• Megvii Face Classification (MFC) database: 5 million labeled faces with 20000 people;

• 10-layer CNN for a four cropped-face-region-based feature extractors used in softmax-

based training and PCA + L2 norm verification;

• 99.50% accuracy on the LFW benchmark.

Learn the Deep Face Representation: Face++

• Pyramid CNN adopts a greedy-filter-and-down-sample operation, which enables the

training procedure to be very fast and computation efficient.

• Its structure can naturally incorporate feature sharing across multi-scale face

representations, increasing the discriminative ability of resulting representation.

DeepID: Deep Learning Face Representation

Deep hidden identity features (DeepID) for face verification and identification;

Features are taken from the last hidden layer neuron activations of deep CNN;

The proposed features are extracted from various face regions to form

complementary and over-complete representations;

Integrated with Joined Bayesian for face verification: 97.45% accuracy on LFW

dataset.

Structure of the neural network

used for face verification ConvNet structure used for DeepID

DeepID2: Deep Learning Face Representation

DeepID2: joint face verification and identification to reduce former-cared intra-

personal variations and enlarge the latter-cared inter-personal variations;

Deep CNN: in 55x47, 4 conv. + 3 max pooling, ReLU, 160-D feature vector;

Learning by SGD;

99.15% for verification rate on LFW dataset.

DeepID2+: Deep Learning Face Representation

DeepID2+ net and supervisory signals

• DeepID2+: increasing the dimension of

hidden representations and adding

supervision to early convolutional layer;

• Sparse neural activations, selective

neurons in higher layers and robustness

to occlusions;

• Get larger with 128 feature maps in each

conv. layer;

• Supervisory signals are only added to one

fully-connected layer from 3rd & 4th conv.

layers; the lower conv. layers can only get

supervision from higher layers;

• 99.47% and 93.2% for verification rates on

LFW and Youtube dataset.

DeepID3: Face Recognition with Very Deep Neural Network

• Apply stacked

convolution and

inception layers

proposed in VGG Net

and GoogLeNet to make

them suitable to face

recognition;

• An ensemble of

proposed two

architectures achieves

LFW face verification

accuracy 99.53% and

LFW rank-1 face

identification accuracy

96.0%, respectively.

FaceNet: A Unified Embedding for Face Recognition and Clustering

Learn a Euclidean 128-D embedding per image using a deep CNN;

L2 distances -> face similarity;

Face verification: thresholding the distance btw. two wmbeddings;

Face recognition: k-NN classification;

Face clustering: k-means or agglomerative clustering;

Triplet based loss function based on LMNN (Large Margin Nearest Neighbor);

Apply a new online negative exemplar mining strategy;

Apply hard positive mining strategy in face clustering.

Model Structure The Triplet Loss


The Triplet Loss minimizes the distance btw an

anchor and a positive, both of which have the same

identity, and maximizes the distance between the

anchor and a negative of a different identity.

L =

anchor positive negative

Instead of picking the hardest positive, use all

anchor-positive pairs in a mini-batch while still

selecting the hard negatives.

Selecting the hardest negatives


CNN: MattNet and GoogleNet (Inception);

Training: use SGD with standard BP and AdaGrad;

Performance: LWF (98.87%, 99.63% with alignment) and Youtube (95.12%).

MattNet GoogleNet

Note: at GTC’15, Andrew Ng announced Baidu’s verification rate at LWF dataset is 99.85%!

Text Detection and Recognition based on CNNs

The end-to-end text spotting pipeline. a) A combination of region proposal methods extracts many word

bounding box proposals. b) Proposals are filtered with a random forest classier reducing number of false-

positive detections. c) A CNN is used to perform bounding box regression for refining the proposals. d) A CNN

performs text recognition on each of the refined proposals. e) Detections are merged based on proximity and

recognition results and assigned a score. f) Thresholding the detections results in the final text spotting result.

Text Detection and Recognition based on CNNs

A schematics of the CNNs used showing the dimensions of the feature maps at each stage for (a)

dictionary encoding, (b) character sequence encoding, and (c) bag-of-N-gram encoding. The same

five-layer, base CNN architecture is used for all three models.

Multi-digit Number Recognition from Street

Views using Deep CNNs Unified one of localization,

segmentation, and recognition steps

via the use of a deep CNN that

operates directly on the image pixels;

Train a probabilistic model of

sequences given images;

Extract a set of features H from the

image X using a CNN with a fully

connected final layer;

Six separate softmax classifiers are

then connected to this feature vector

H, i.e., each softmax classifier forms a

response by making an affine

transformation of H and normalizing

this response with the softmax

function.

Scene Parsing with Multiscale Feature Learning

• Compute a tree of segments from a graph of pixel dissimilarities.

• A set of dense feature vectors encodes regions of multiple sizes centered on each pixel.

• The feature extractor is a multi-scale trained convolutional network.

• The feature vectors associated with the segments covered by each node in the tree are

aggregated and fed to a classifier which produces an estimate of the distribution of

object categories contained in the segment.

• A subset of tree nodes that cover the image are then selected so as to maximize the average

“purity” of the class distributions, hence maximizing the overall likelihood that each segment will

contain a single object.

• The convolutional network feature extractor is trained end-to-end from raw pixels,

alleviating the need for engineered features. After training, the system is parameter free.

• From a pool of segmentation components, retrieve an optimal set of components that

best explain the scene, taken from a segmentation tree or any family of over-

segmentations.


(a) Method 1

(b) Method 2

Simultaneous Detection and Segmentation (SDS)

• Apply R-CNN to classify category-independent region proposals, and use category-

specific top-down figure ground predictions to refine the bottom up proposals;

• Proposal generation: category-independent bottom-up by MCG;

• Feature extraction for each region by R-CNN;

• Region classification by SVM and assign a score for each category to each candidate.

• Region refinement (non maximum suppression) on the scored candidates.

Learning Rich Features from RGB-D Images for Object Detection and Segmentation

• Geocentric embedding for depth images that encodes height above ground and angle

with gravity for each pixel in addition to the horizontal disparity.

• Better than using depth images for learning feature representations with CNN.

• A decision forest approach that classifies pixels as FG or BG using a family of unary

and binary tests that query shape and geocentric pose features.

Fully Convolutional Networks for Semantic Segmentation

• Build “fully convolutional” networks that take input of arbitrary size and produce

correspondingly-sized output with efficient inference and learning;

• Adapt the classification nets into fully convolutional network and transfer their learned

representations by fine tuning (after a supervised pre-training);

• Combine semantic info. from a deep coarse layer with appearance info. from a

shallow fine layer to produce accurate and detailed segmentation;

Transforming fully connected layers into convolution layers. Fully convolutional networks can efficiently learn to make

dense predictions for per pixel semantic segmentation.


• While a general deep net computes a general nonlinear function, a net with only layers of this form

computes a nonlinear filter, called a deep filter or fully convolutional network;

• Upsampling by de-concolution is more efficient and effective to yield dense predictions;

• Input shifting and output interlacing without interpolation (proposed by OverFeat);

• Changing only the filters and layers strides of a convent produce the same as the “shift-and-stitch” trick.

• Define a FCN combining layers of the feature hierarchy

and refining the spatial precision of the output;

• Add links combining the final prediction layer with lower layers

with finer strides, turning into a DAG, not layer-wise, with edges

skipping head from lower layers to higher ones;

• Learn the skip net which improves the perf.

• Note: Decreasing the stride of pooling layers can get finer

prediction, but problematic in big kernel size for conv.

Layers for learning cost.


A DAG nets learn to combine coarse, high layer information with fine, low layer information.

Solid line (FCN-32s): A single-stream net upsamples stride 32 predictions back to pixels.

Dashed line (FCN-16s): Combining predictions from both the final layer and the pool4 layer, at

stride 16, lets the net predict finer details, while retaining high-level semantic information.

Dotted line (FCN-8s): Additional predictions from pool3, at stride 8, provide further precision.

Deep Jet (Nonlinear local feature

hierarchy: make local predictions while

respect global structure.

DeepLab: CNN + CRF for Semantic Segmentation

oLearning DCNNs for semantic image segmentation from either (1) weakly

annotated training data such as bounding boxes or image-level labels or (2) a

combination of few strongly labeled and many weakly labeled images, sourced

from one or multiple datasets;

oUse DCNN to predict the label distribution per pixel, followed by a fully-

connected (dense) CRF to smooth the predictions while preserving image edges;

oExpectation-Maximization (EM) methods for semantic image segmentation

model training under these weakly supervised and semi-supervised settings.


DeepLab model training using image-level labels


DeepLab model training from bounding boxes

DeepLab model training on a union of full (strong

labels) and image-level (weak labels) annotations

ParseNet: Looking Wider to See Better

o Global context to CNN for semantic segmentation;

oUsing the average feature for a layer to augment the

features at each location;

oLearning normalization parameters for improvement.

oEarly fusion: unpool (replicate) global feature to the

same size as of local feature map spatially and then

concatenate them, and use the combined feature to

learn the classifier;

oLate fusion: each feature is used to learn its own

classifier, followed by merging the two predictions

into a single classification score;

SegNet: Deep Conv. Encoder-Decoder Architecture

A decoder upsamples its input using the transferred pool indices from its

encoder to produce a sparse feature map(s);

It then performs convolution with a trainable filter bank to densify the

feature map;

The final decoder output feature maps are fed to a soft-max classifier for

pixel-wise classification.

SegNet: Deep Conv. Encoder-Decoder Architecture

oSegNet uses the max pooling indices to upsample (without learning) the feature map(s) and

convolves with a trainable decoder filter bank.

oFCN upsamples by learning to deconvolve the input feature map and adds the corresponding

encoder feature map to produce the decoder output; this feature map is the output of the max-

pooling layer (includes sub-sampling) in the corresponding encoder.

oNote that there are no trainable decoder filters in FCN.

Mask R-CNN for Object Instance Segmentation

Detect objects in an image while generate a segment. mask for each instance.

Mask R-CNN extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition.

Simple to train and adds a small overhead to Faster R-CNN, running at 5 fps.

Easy to generalize to other tasks, allowing to estimate poses in the framework.


Head Architecture: Left/Right for the heads for the ResNet C4 and FPN backbones, to which

a mask branch is added. Numbers denote spatial resolution and channels. Arrows denote

either conv, deconv, or fc layers as can be inferred from context (conv preserves spatial

dimension while deconv increases it). All convs are 3×3, except the output conv which is

1×1, deconvs are 2×2 with stride 2, and use ReLU in hidden layers. Left: ‘res5’ denotes

ResNet’s 5th stage, which for simplicity we altered so that the first conv operates on a 7×7

RoI with stride 1. Right: ‘×4’ denotes a stack of four consecutive convs.


Extended to human pose estimation: Model a keypoint’s location as a one-hot mask,

and adopt Mask R-CNN to predict K masks, one for each of K keypoint types (e.g., left

shoulder, right elbow).

Minimal domain knowledge for human pose is exploited by the system. For each of the K keypoints of an instance, the training target is a one-hot m × m binary mask

where only a single pixel is labeled as FG.

During training, for each visible ground-truth keypoint, minimize the cross-entropy loss over an

m2 -way softmax output.

In instance segmentation, the K keypoints are treated independently.

The keypoint head consists of a stack of eight 3×3 512-d conv layers, followed by a

deconv layer and 2× bilinear upscaling, producing an output resolution of 56×56.

Train the models using image scales randomly sampled from [640, 800] pixels;

inference is on a single scale of 800 pixels. 90k iterations, learning rate of 0.02 and reducing it by 10 at 60k and 80k iterations.

Bounding-box non-maximum suppression with a threshold of 0.5.

Tracking with Deep Neural Networks Use a ConvNet with filter banks trained either supervised or unsupervised;

Each layer consists of three sequential operations as follows:

Spatial convolutions with kernels;

tanh non-linearity;

Pooling.

Images are transformed into the feature vectors at the output of ConvNet;

A RBFN is used to produce confidence map based on the distance btw the feature

vectors and a reference vector.

Learning Representation for Visual Tracking

Train a stacked Denoising Autoencoder (SDAE) to learn generic features

followed by knowledge transfer from offline training to online tracking;

More robust against variations;

Unsupervised feature learning by training an SDAE with auxiliary image;

Layer-by-layer pretraining is first applied and then the whole SDAE is fine-tuned;

Online tracking involves a classification NN constructed from the encoder part

as a feature extractor and an additional classification layer;

Tuned to adapt to appearance changes of the moving object;

An additional classification layer is added to the encoder part of the trained SDAE.

Some filters in the first layer of the learned SDAE

Learning Representation for Visual Tracking

denoising autoencoder stacked denoising autoencoder network for online tracking.

Visual Tracking with a Single CNN A truncated structural loss function maintains as many training samples as possible and reduces risk

of tracking error by accommodating uncertainty of model output;

Enhance the ordinary SGD approach in training with a temporal selection mechanism, which

generates positive and negative samples within different time periods;

The architecture of CNN tracker with multiple image cues

The bottom row shows the three-stages operations on a frame: test, estimation and training.

In the training frames, the green bounding-boxes are the negative samples while the red ones

denote the positive samples. The dashed block covers the positive sample pool (red) and

negative sample pool (green). In each pool, the edges of the sample patches indicate their

sampling importances. The thicker the edge, the more possible it will be selected for training.

Transferring Rich Feature Hierarchies for

Robust Visual Tracking

A major hurdle that hinders the application of CNN to visual tracking is the

lack of properly labeled training data.

An enormous amount of training data is required, but visual tracking typically

have only one labeled example in the first frame of each video.

Pre-training a CNN offline and then transferring the rich feature hierarchies

learned to online tracking.

The CNN is also fine-tuned during online tracking to adapt to the appearance

of the tracked target specified in the first video frame.

To fit the characteristics of object tracking, first pre-train the CNN to recognize

what is an object, and then propose to generate a probability map instead of

producing a simple class label.

Transferring Rich Feature Hierarchies for

Robust Visual Tracking

Architecture of the proposed structured output CNN

Pipeline of the proposed tracking algorithm

Robust Visual Tracking via Convolutional

Networks Without Training

Without offline training with a large amount of auxiliary data, simple two-layer

convolutional networks can be powerful enough to learn robust

representations for visual tracking.

In the 1st frame, extract a set of normalized patches from the target region as

fixed filters, which integrate a series of adaptive contextual filters surrounding

the target to define a set of feature maps in the subsequent frames.

These maps measure similarities btw each filter and useful local intensity

patterns across the target, thereby encoding its local structural information.

Furthermore, all the maps together form a global representation, via which the

inner geometric layout of the target is also preserved.

A simple soft shrinkage method that suppresses noisy values below an

adaptive threshold is employed to de-noise the global representation.

Visual Tracking with Fully Convolutional Networks

In-depth study on the properties of CNN features offline pre-trained on massive

image data and classification task on ImageNet.

Convolutional layers in different levels characterize the target from different

perspectives.

A top layer encodes more semantic features and serves as a category detector,

while a lower layer carries more discriminative information and can better

separate the target from distracters with similar appearance.

Both layers are jointly used with a switch mechanism during tracking.

For a tracking target, only a subset of neurons are relevant.

Feature map selection to remove noisy and irrelevant feature maps, which can

reduce computation redundancy and improve tracking accuracy.

Visual Tracking with Fully Convolutional Networks

Algorithm pipeline. (a) Input ROI region. (b) VGG network. (c) SNet. (d) GNet. (e) Tracking results

Robust Visual Tracking via Convolutional

Networks Without Training

Input samples are warped into canonical 32 × 32 images. k-means algorithm to extract a set of norm. local patches from the

warped target region in the 1st frame, and extract a set of norm. local patches from the contextual region surrounding the

target. Apply as filters to convolve each norm. sample extracted from subsequent frames, resulting in a set of feature maps.

Finally, the feature maps are de-noised by a soft shrinkage method, which results in a robust sparse representation.

Hierarchical Convolutional Features for Visual Tracking

Features extracted from deep CNNs

trained to improve accuracy and robust.

Outputs of the last conv. layers encode

semantic info. and such represent. are

robust to appearance variations.

Hierarchies of conv. layers as a

nonlinear counterpart of an image

pyramid represen. and exploit these

multiple levels of abstraction for tracking.

Adaptively learn corr. filters on each

conv. layer to encode target appearance.

Hierarchically infer maximum response

of each layer to locate targets.

Given an image, crop the search window centered at

the estimated position in the previous frame. Use the

3rd, 4th and 5th conv. layers as target representations.

Each layer indexed by i is then convolved with the

learned linear correlation filter w(i) to generate a

response map, whose location of the maximum value

indicates the estimated target position. Search multi-

level response maps to infer target location.

Hierarchical Convolutional Features for Visual Tracking

Visualization of convolutional layers

Learning to Track 100FPS with Deep Regression Networks

Offline training of NNs that can track novel objects at test-time at 100 fps.

A simple feed-forward network with no online training required.

The tracker learns a generic relationship btw object motion and appearance and

can be used to track novel objects that do not appear in the training set.

It improves when adding more videos to the offline training set.

Learning to Track 100FPS with Deep Regression Networks

The network architecture for tracking. Input to the network a search region

from the current frame and a target from the previous frame. The network

learns to compare these crops to find the target object in the current image.

Visual Tracking with CNN based Object Proposals

Tracking by detection based methods: object appearance changes, size and

shape deformations, partial and full occlusions, which make online adaptation of

classifiers and object models a substantial challenge.

An object proposal network that generates a small yet refined set of bounding

box candidates to mitigate the object model refitting problem by concentrating

on hard negatives when updating the classifier.

Improving the discriminative power as hard negatives are likely to be due to

background and other distractions.

In each frame, applying the classifier only on the refined set of object-like

candidates would be sufficient to eliminate most of the false positives.

Incorporating an object proposal makes the tracker robust against shape

deformations as handled naturally by the proposal stage.

Visual Tracking with CNN based Object Proposals

Region Proposal Network Tracker (RPNT): In a new frame t + 1, an offline VGG network is used to

generate a feature map, which is then fed to the region proposal network (RPN) to obtain candidate

bounding boxes. Region of interest (RoI) pooling layer extracts feature vectors with a fixed size for

the online structured support vector machine (SSVM) that serves as the classifier. The proposal with

the maximal response is assigned as the new location of the object.

Learning Multi-Domain CNNs for Visual Tracking

Pre-trains a CNN using a large set of videos with tracking ground truths to

obtain a generic target representation.

Composed of shared layers and multiple branches of domain-specific layers,

where domains correspond to individual training sequences and each branch is

responsible for binary classification to identify target in each domain.

Train each domain in the network iteratively to obtain generic target

representations in the shared layers.

When tracking a target in a new sequence, construct a new network by

combining the shared layers in the pre-trained CNN with a new binary

classification layer, which is updated online.

Online tracking is performed by evaluating the candidate windows randomly

sampled around the previous target state.


The architecture of Multi-Domain Network, which consists of shared layers and

K branches of domain-specific layers. Yellow and blue bounding boxes denote

the positive and negative samples in each domain, respectively.

Recurrently Target-Attending Tracking

Recurrently Target-attending Tracking (RTT) attempts to identify and exploit

those reliable parts which are beneficial for the overall tracking process.

To bypass occlusion and discover reliable components, multi-directional

Recurrent Neural Networks (RNNs) are employed in RTT to capture long-range

contextual cues by traversing a candidate spatial region from multiple directions.

The produced confidence maps from the RNNs are employed to adaptively

regularize the learning of discriminative correlation filters by suppressing clutter

background noises while making full use of the information from reliable parts.

To solve the weighted correlation filters, derive an efficient closed form

solution with a sharp reduction in computation complexity.

The proposed RTT is more competitive over correlation filter based methods.

Recurrently Target-Attending Tracking

RTT tracker. To identify and exploit those reliable components during tracking, a confidence map

is estimated by using multi-directional RNNs, and further used to regularize correlation filters.

work flow of RNNs

⊙ - element-wise multiplication operation

Appendix A:

SoC Implementation of CNN

Large-Scale FPGA-based Convolutional Networks

a 2D grid of NPT Processing Tiles (PTs) that contain:

a bank of processing operators. An operator can be anything from a FIFO to an arithmetic operator,

or even a combination of arithmetic operators;

The operators are connected to local data lines;

a routing multiplexer (MUX). The MUX connects the local data lines to global data lines or to

neighboring tiles.

a Smart Direct Memory Access module (Smart DMA), that interfaces off-chip memory and

provides asynchronous data transfers, with priority management;

a set of Nglobal global data lines used to connect PTs to the Smart DMA, Nglobal << NPT;

a set of local data lines used to connect PTs with their 4 neighbors;

a Runtime Configuration Bus, used to reconfigure many aspects of the grid at runtime —

connections, operators, Smart DMA modes;

a controller that can reconfigure most of the computing grid and the Smart DMA at runtime.


A set of runtime configurable processing tiles are connected on a 2D grid. They can exchange data with their 4

neighbors and with an off-chip memory via global lines.


The grid is configured for a complex computation that involves several tiles: the 3 top tiles

perform a 3×3 convolution, the 3 intermediate tiles another 3 × 3 convolution, the bottom left

tile sums these two convolutions, and the bottom centre tile applies a function to the result.


Overview of the ConvNet Processor system. A grid of multiple full-

custom Processing Tiles tailored to ConvNet operations, and a fast

streaming memory interface (Smart DMA).

Mobile Coprocessor NN-X for Deep NN Startup: TeraDeep (from e-lab of Purdue

University);

Its nn-X ("Neural Network next”) as "a vision system

based on programmable-logic with embedded

mobile processor“;

Coprocessor, composed of multiple accelerators

and interfaces to ARM cores, can run on mobile

phones, embedded system and mobile computers;

Implemented with FPGA in the programmable logic

section of an ARM-based Xilinx Zynq SoC;

Combination of cloud-based machine-learning

algorithms, deep learning neural network software

running on embedded devices and mobile apps;

Note: Qualcomm’s Zeroth processor, as a Neural

Processing Unit (NPU), will be released.

The memory router is a crossbar switch. A local

router in each collection is directly connected to

routers in neighboring collections, constructing a

1-d torus-like data stream network. The group of

operators in each collection contains

convolution, max-pooling and non-linear

operators.

NN-X is composed of a coprocessor, a host processor and external

memory. The coprocessor has three main components: processing

elements called collections, a system bus called memory router and a

configuration bus to control flow of data. Collections perform DNN

operations: data routing, convolutions, pooling, non-linear programmable

functions.

Memory Centric Accelerator for CNN

Bottleneck: limited amount of external

memory bandwidth for data transfer.

The effects of the memory bottleneck

can be reduced by a flexible memory

hierarchy that supports the complex

data access patterns in CNN workload;

The efficiency of the on-chip memories

is maximized by a scheduler that uses

tiling to optimize for data locality;

This design flow ensures that on-chip

memory size is minimized, which

reduces area and energy usage.

Increasing accelerator utilization with

more external memory bandwidth is

bad for energy.

Memory Centric Accelerator for CNN

To instantiate the accelerator template with the

obtained schedules, an HW/SW integration flow is

used;

The top left part contains the scheduling design

space exploration, from which the optimal

schedules are used to select the template

parameters;

The parameter set and the hardware template are

manually converted into a HLS instantiation of the

accelerator;

In the left part of the design flow the selected

schedule is manually converted to control

software for the MicroBlaze host processor.

FPGA-based Accelerator Design for CNN Matching btw computation throughput and memory

bandwidth or logic resource;

Uniform loop unroll factors across different convolutional

layers;

Apply a roofline performance model to relate the system

performance to off-chip memory traffic and the peak

performance provided by the HW platform;

Loop tiling to fit portion of data on-chip , critical for data

reuse and parallelism;

Organization of PEs and buffer banks and interconnects

in between for data processing efficiency;

Optimization in computation and memory access:

Loop unrolling/pipeline;

Local memory promotion;

Loop transformations for data reuse;

FPGA-based Accelerator Design for CNN

MSR’s FPGA of CNN for Large Scale DataCenter

Each instantiation of the fabric consists of a 6x8 2-D torus of high-end Stratix V

FPGAs embedded into a half-rack of 48 machines;

One FPGA is placed into each server, accessible through PCIe, and wired

directly to other FPGAs with pairs of 10 Gb SAS cables. A diagram of the 1 U, half-width server that

hosts the FPGA board Components of the Shell Architecture

MSR’s FPGA of CNN for Large Scale DataCenter

Comparison of Image Classification Throughput and Power

[3] A. Putnam, et al., A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services, International Symposium on Computer Architecture, 2014. [4] … [5] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao and J. Cong, Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks, FPGA'2015, 2015. [6] http://caffe.berkeleyvision.org/performance_hardware.html [7] … [8] http://www.altera.com/literature/hb/arria-10/a10_overview.pdf.

AuvizDNN Auviz Middleware IP is highly optimized for Xilinx Series 7 FPGAs, available as C functions for

HLS users, OpenCL for Embedded developers, or IP Integrator blocks for RTL users;

FPGAs have a compelling Performance at a Power profile that is suitable for the Data Center;

AuvizDNN: optimized library to create deep learning algorithms on the FPGA, having

implemented CNN.

Vectorization of Deep CNN at Lenovo Research

Vectorization is fundamental for parallelism in deep CNN;

Caffe, Overfeat, CudaConvnet, Theano;

Vectorizing Convolution: multiplication of matrix-vector;

Vectorizing Pooling: vector accumulation guided by a predefined index map;

Vectorizing Fully Connected Layers: multiplication of matrix-matrix;

Vectorization for Mini-batches: SGD.

Strategy to vectorize convolution Strategy to vectorize pooling

Visual Detection, Recognition and Tracking with Deep Learning

Engineering

Transcript of Visual Detection, Recognition and Tracking with Deep Learning