Learning Feature Hierarchies for Object Recognitionkoray/files/defense-presentation.pdfCSG ï P A...
Transcript of Learning Feature Hierarchies for Object Recognitionkoray/files/defense-presentation.pdfCSG ï P A...
Learning Feature Hierarchiesfor
Object Recognition
Koray Kavukcuoglu
Computer Science DepartmentCourant Institute of Mathematical Sciences
New York University
Marc’Aurelio Ranzato, Kevin Jarrett, Pierre Sermanet,Y-Lan Boureau, Karol Gregor, Arthur Szlam
Rob Fergus and Yann Lecun
Overview
• Feature Extractors
• Unsupervised Feature Learning
• Sparse Coding
• Learning Invariance
• Convolutional Sparse Coding
• Hierarchical Object Recognition
Object Recognition
• Feature Extraction
• Gabor, SIFT, HoG, Color, combinations ...
• Classification
• PMK-SVM, Linear, ...• Grauman’05, Lazebnik’06, Serre’05, Mutch’06,...
Object Recognition
• It would be better to learn everything
• adaptive to different domains
• Learn feature extractor and classifier together
Feature Extractor Classifier
Feature Extraction
• Can be based on unsupervised learning
• Should be efficient to extract features
• Overcomplete sparse representations are easily separable
Sparse Coding
• is given, search for optimal
• Reconstruction + Sparsity
• A mapping
• For every input x inference takes too much time
Input
min12�x−Dz�2
2 + λ�
i
|zi|
DictionaryCode
Sparsity
D
• Mallat’93, Chen’98, Beck’09, Li’09
f : x → z
z
Sparse Modeling
• Olshausen and Field’97, Aharon’06, Lee’07, Ranzato’07, Kavukcuoglu’08, Zeiler’10,...
min12�x−Dz�2
2 + λ�
i
|zi|
• has to be bounded to avoid trivial solutions
• Online or batch algorithms for updating dictionary
• Learn mapping
DLearn from data
fD : x → z
Sparse Modeling
E(x, z,D) = min1
2�x−Dz�22 + λ
�
i
|zi|
• Per sample energy
• Loss
L(x,D) =1
|X |�
x∈XE(x, z,D)
• For each sample,
1. do inference
minimize E(x,z,D) wrt z (sparse coding)
2. update parameters keeping z fixed
3. Project columns of D on the unit sphere
D ← D − η∂E
∂D
Sparse Modeling
=
0 10 20 30 400.9
0.1
1.1
0 10 20 30 400.9
0.1
1.1
0 10 20 30 400.9
0.1
1.1ConvergenceIteration 1
0 10 20 30 400.9
0.1
1.1ConvergenceIteration 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 320.9
0.1
1.1ConvergenceIteration 1
• Inference process suppresses many except few
min12�x−Dz�2
2 + λ�
i
|zi|
Sparse Modeling
• Problems
1. Inference takes long time
➡ Train a predictor function
2. Sparse coding is unstable
➡ Complex cell model
3. Patch based modeling produces redundant features
➡ Convolutional sparse modeling
Learning
Predictive Sparse Decomposition
For each sample from data, do:
1. Fix K and D, minimize to get optimal z
2. Using the optimal value of z update D and K
3. Scale elements of D to be unit norm
min1
2�x−Dz�22 + λ
�
i
|zi|+ α�z − Fe(x;K)�22z̃
Learned ISTAGregor’10
z̃ = g · tanh(kTx)z̃ = shλ
�kTx
�
z̃ = shλ
�kTx+ Sshλ(k
Tx)�
Predictive Sparse Decomposition
Encoder (k) Decoder (D)
•12x12 image patches
•256 dictionary elements
Predictive Sparse Decomposition
Encoder (k) Decoder (D)
•28x28 MNIST digit images
•200 dictionary elements
•Strokes for digit parts
Good Representation?
• Performance on MNIST using 28x28 filters
• Compare representations from different methods
• PSD : worse reconstruction than other models, but better recognition
• Ranzato’07, Kavukcuoglu’08
Recognition• Filterbank + Non-linearity + Pooling
• 64 filters
• Pinto’08
Non-linearity
max / av
PoolingRectification Local ContrastNormalization
Recognition - C101
• Optimal (Feature Sign, Lee’07) vs PSD features
• PSD features perform slightly better
• Naturally optimal point of sparsity
• After 64 features not much gain
• PSD features are order of magnitude faster
(In)Stability of Sparse Coding
0 100 200 300 400 500 600 700 800 900 10000.5
0.4
0.3
0.2
0.1
0
0.1
0.2
0.3
0.4
0.5
• 16x16 input patch
• 1024 dictionary elements (4x overcomplete)
• 3 pixel shifted
Learning Invariance
• Group sparsity : Idea proposed by Hyvarinen&Hoyer (2001) in the context of square ICA
• wj : Gaussian weighting window
• Learning algorithm is the same as PSD
• Feedforward regressor Fe(x;K), followed by pooling function produces invariant representations efficiently
• Ability to learn necessary transformations from data
min1
2�x−Dz�22 + λ
K�
i=1
��
j∈Pi
wjz2j + α�z − Fe(x;K)�22
Learning Invariance
• Sparsity across pools rather than units
• Drives basis functions in a pool to be similar
• Overlapping pools ensure smooth representation manifolds
• Pool size =1 ⇔ Regular PSD• Kavukcuoglu’09
P1
GaussianWindow
wj
OverlappingNeighborhoods
Pi P
Map of z
{
PK
(a) (b)
Topographic Maps
• Circular boundary conditions in both directions
• 6x6 pools with stride 2 in both dimensions
How invariant?
• Left: Normalized MSE between representations of original and transformed 16x16 patches
• Right: Same after 25˚ rotation
• IPSD is more invariant
0 4 8 12 160
0.5
1
1.5rotation 0 degrees
horizontal shift
Nor
mal
ized
MSE
0 4 8 12 160.2
0.4
0.6
0.8
1
1.2rotation 25 degrees
horizontal shift
Nor
mal
ized
MSE
SIFT non rot. inv.SIFTOur alg. non inv.Our alg. inv.
Good for Recognition?
Caltech 101 (Accuracy)Caltech 101 (Accuracy)
Linear IPSD(24x24) 50.9%
Linear SVM SIFT(not rot.inv.) (24x24) 51.2%SVM
SIFT(rot.inv.)(24x24) 45.2%
PMK IPSD(34x34) 59.6%
PMK SVM IPSD(56x56) 62.6%SVM
IPSD(120x120) 65.6%
MNIST (Error Rate)MNIST (Error Rate)Linear IPSD (5x5) 1.0%Linear SVM SIFT(not rot.inv.) (5x5) 1.5%
i=1
i=2
i=K
Multi-Stage Object Recognition • Each stage contains a filter-bank, non-linearity
and pooling
Filterbank Tanh Abs LCN Pooling
Conv Net Learned ✔ ✘ ✘ Average
HMAX Gabor ✘ ✘ ✘ Max
• Jarret’09
Multi-Stage Object Recognition
Filterbank - Fe(x;K) Non-linearities Pooling
• Building block of a multi-stage architecture
Filter BankNon-
LinearityPooling
Unsupervised Pre-Training
x z1
Filter BankNon-
LinearityPooling
Unsupervised Pre-Training
z2
Supervised Refinement
Multi-Stage Object Recognition
R U R+ U+ RR UU R+R+ U+U+
Pa 12.1 13.4 14.3 14.5 8.8 9.1 29.7 32.0
N-Pa 15.3 17.2 16.3 18.5 19.6 23.1 31.0 34.0
N-Pm 32.1 44.0 38.0 44.3 37.6 56.0 60.0 61.0
Rabs-Pa 31.7 44.3 47.0 50.0 33.7 46.7 59.5 60.5
Rabs-N-Pa 53.3 52.2 54.8 54.2 62.9 63.7 64.7 65.5
C-Rabs-N-Pa 53.3 57.1 54.8 57.6 62.9 65.3 64.7 66.3
U Unsupervised
R Random
+ Supervised Fine Tuning2 stage > 1 stage
Pa Unsupervised
Pm Random
N Supervised Fine Tuning
Rabs Absolute Value Rect
C Convolutional Unsup
Multi-Stage Object Recognition
R U R+ U+ RR UU R+R+ U+U+
Pa 12.1 13.4 14.3 14.5 8.8 9.1 29.7 32.0
N-Pa 15.3 17.2 16.3 18.5 19.6 23.1 31.0 34.0
N-Pm 32.1 44.0 38.0 44.3 37.6 56.0 60.0 61.0
Rabs-Pa 31.7 44.3 47.0 50.0 33.7 46.7 59.5 60.5
Rabs-N-Pa 53.3 52.2 54.8 54.2 62.9 63.7 64.7 65.5
C-Rabs-N-Pa 53.3 57.1 54.8 57.6 62.9 65.3 64.7 66.3
Abs > No AbsU Unsupervised
R Random
+ Supervised Fine Tuning
Pa Unsupervised
Pm Random
N Supervised Fine Tuning
Rabs Absolute Value Rect
C Convolutional Unsup
Multi-Stage Object Recognition
R U R+ U+ RR UU R+R+ U+U+
Pa 12.1 13.4 14.3 14.5 8.8 9.1 29.7 32.0
N-Pa 15.3 17.2 16.3 18.5 19.6 23.1 31.0 34.0
N-Pm 32.1 44.0 38.0 44.3 37.6 56.0 60.0 61.0
Rabs-Pa 31.7 44.3 47.0 50.0 33.7 46.7 59.5 60.5
Rabs-N-Pa 53.3 52.2 54.8 54.2 62.9 63.7 64.7 65.5
C-Rabs-N-Pa 53.3 57.1 54.8 57.6 62.9 65.3 64.7 66.3
LCN > No LCNU Unsupervised
R Random
+ Supervised Fine Tuning
Pa Unsupervised
Pm Random
N Supervised Fine Tuning
Rabs Absolute Value Rect
C Convolutional Unsup
Multi-Stage Object Recognition
R U R+ U+ RR UU R+R+ U+U+
Pa 12.1 13.4 14.3 14.5 8.8 9.1 29.7 32.0
N-Pa 15.3 17.2 16.3 18.5 19.6 23.1 31.0 34.0
N-Pm 32.1 44.0 38.0 44.3 37.6 56.0 60.0 61.0
Rabs-Pa 31.7 44.3 47.0 50.0 33.7 46.7 59.5 60.5
Rabs-N-Pa 53.3 52.2 54.8 54.2 62.9 63.7 64.7 65.5
C-Rabs-N-Pa 53.3 57.1 54.8 57.6 62.9 65.3 64.7 66.3
Even Random Works!!!
U Unsupervised
R Random
+ Supervised Fine Tuning
Pa Unsupervised
Pm Random
N Supervised Fine Tuning
Rabs Absolute Value Rect
C Convolutional Unsup
Optimal Stimuli
• Optimize input to maximize output of one unit after abs + LCN + average pooling
• Random feature extraction respond to oriented gratings too.
PSD Random
20 50 100 200 500 1000 2000 48604
6789
10
15
202530354050
number of training samples per class
erro
r rat
e
FCSG PA (R+ R+)
FCSG Rabs N PA (UU)FCSG Rabs N PA (R+ R+)
FCSG Rabs N PA (RR)FCSG Rabs N PA (U+ U+)
Random Filter Performance
NORB Dataset:
1. 96x96 grayscale images
2. 5 classes (human, car, truck, airplane, animal)
3. Almost 5000 training samples per class
Caltech 101
Redundancy in Feature Extraction
• Patch based learning has to model same structure at every location
• They produce highly redundant features
Filters Convolve Feature maps
Convolutional PSD
• Convolutional training yields a more diverse set of features
z ∈ RK×(w−s+1)×(h−s+1)x ∈ Rw×h D ∈ RK×s×s
Patch based Convolutional
1
2�x−
�
k
Dk ∗ zk�22 + λ|z|1 + α||z − Fe(x)||22
• Kavukcuoglu’10
0 20 40 60 80100
101
102
103
104
deg
# of
cro
ss c
orr <
deg
Patch Based TrainingConvolutional Training
Convolutional PSD• Measuring the redundancy in the dictionary
• Cumulative histogram of angle between ALL PAIRS of dictionary elements
acos(max(abs(Di ∗DTj )))
Convolutional PSD
z ∈ RK×(w−s+1)×(h−s+1)x ∈ Rw×h D ∈ RK×s×s
1
2�x−
�
k
Dk ∗ zk�22 + λ|z|1 + α||z − Fe(x)||22
=
• Convolutional sparse coding model large images rather than small image patches
• Each iteration reduces redundancy in the feature representation
Convolutional PSDInput (x) Dictionary (D)
Code (z) at Iteration 1
• Each iteration reduces redundancy in the feature representation
Reconstruction
Convolutional PSDInput (x) Dictionary (D)
Code (z) at Iteration 2
• Each iteration reduces redundancy in the feature representation
Reconstruction
Convolutional PSDInput (x) Dictionary (D)
Code (z) at Convergence
• Each iteration reduces redundancy in the feature representation
Reconstruction
Convolutional PSD - Better Encoders• To be able to predict convolutional sparse representations, simple
encoders are very inadequate
• A better encoder should use shrinkage operator with a learned suppression matrix to be able to approximate sparse codes (Gregor’10)
• Encoder Training
• 2nd order information is important for fast convergence
• Smooth shrinkage is important for conserving derivatives
1
βlog(exp(β × b) + exp(β × s)− 1)− b
z̃ = shλ
�kTx
�z̃ = shλ
�kTx+ Sshλ(k
Tx)�
Convolutional Training
• Inference and Training
• Order of magnitude more costly
• Efficient inference algorithms are crucial (ISTA, FISTA, CD)
• 64 filters = 64 times overcomplete representation
• Proper handling of border effects is important
• Test time is the same as patch based model
Convolutional PSD
• Recognition Performance on C101
• Low level convolutional feature learning improves
Patch Based Convolutional
1 Stage
Unsup 52.2% 57.1%1 Stage Unsup+ 54.2% 57.6%
2 Stage
Unsup 63.7% 65.5%2 Stage Unsup+ 65.3% 66.3%
Pedestrian Detection On INRIA
10 2 10 1 100 101
0.05
0.1
0.2
0.3
0.40.50.60.70.80.9
1
false positives per image
miss
rate
Shapelet orig (90.5%)PoseInvSvm (68.6%)VJ OpenCv (53.0%)PoseInv (51.4%)Shapelet (50.4%)VJ (47.5%)FtrMine (34.0%)Pls (23.4%)HOG (23.1%)HikSvm (21.9%)LatSvm V1 (17.5%)MultiFtr (15.6%)R+R+ (14.8%)U+U+ (11.5%)MultiFtr+CSS (10.9%)LatSvm V2 (9.3%)FPDW (9.3%)ChnFtrs (8.7%)
10 2 10 1 100 101
0.05
0.1
0.2
0.3
0.40.50.60.70.80.9
1
false positives per image
miss
rate
Shapelet orig (90.5%)PoseInvSvm (68.6%)VJ OpenCv (53.0%)PoseInv (51.4%)Shapelet (50.4%)VJ (47.5%)FtrMine (34.0%)Pls (23.4%)HOG (23.1%)HikSvm (21.9%)LatSvm V1 (17.5%)MultiFtr (15.6%)R+R+ (14.8%)U+U+ (11.5%)MultiFtr+CSS (10.9%)LatSvm V2 (9.3%)FPDW (9.3%)ChnFtrs (8.7%)
14.8%
11.5%
• Purely supervised training: 14.8% miss rate
• Unsupervised pre-training with Conv PSD + supervised refinement : 11.5%
• Close to state of the art and improving quickly...
Questions?