Learning Feature Hierarchies for Object Recognitionkoray/files/defense-presentation.pdfCSG ï P A...

Learning Feature Hierarchiesfor

Object Recognition

Koray Kavukcuoglu

Computer Science DepartmentCourant Institute of Mathematical Sciences

New York University

Marc’Aurelio Ranzato, Kevin Jarrett, Pierre Sermanet,Y-Lan Boureau, Karol Gregor, Arthur Szlam

Rob Fergus and Yann Lecun

Overview

• Feature Extractors

• Unsupervised Feature Learning

• Sparse Coding

• Learning Invariance

• Convolutional Sparse Coding

• Hierarchical Object Recognition

Object Recognition

• Feature Extraction

• Gabor, SIFT, HoG, Color, combinations ...

• Classification

• PMK-SVM, Linear, ...• Grauman’05, Lazebnik’06, Serre’05, Mutch’06,...

Object Recognition

• It would be better to learn everything

• adaptive to different domains

• Learn feature extractor and classifier together

Feature Extractor Classifier

Feature Extraction

• Can be based on unsupervised learning

• Should be efficient to extract features

• Overcomplete sparse representations are easily separable

Sparse Coding

• is given, search for optimal

• Reconstruction + Sparsity

• A mapping

• For every input x inference takes too much time

Input

min12�x−Dz�2

2 + λ�

i

|zi|

DictionaryCode

Sparsity

D

• Mallat’93, Chen’98, Beck’09, Li’09

f : x → z

z

Sparse Modeling

• Olshausen and Field’97, Aharon’06, Lee’07, Ranzato’07, Kavukcuoglu’08, Zeiler’10,...

min12�x−Dz�2

2 + λ�

i

|zi|

• has to be bounded to avoid trivial solutions

• Online or batch algorithms for updating dictionary

• Learn mapping

DLearn from data

fD : x → z

Sparse Modeling

E(x, z,D) = min1

2�x−Dz�22 + λ

�

i

|zi|

• Per sample energy

• Loss

L(x,D) =1

|X |�

x∈XE(x, z,D)

• For each sample,

1. do inference

minimize E(x,z,D) wrt z (sparse coding)

2. update parameters keeping z fixed

3. Project columns of D on the unit sphere

D ← D − η∂E

∂D

Sparse Modeling

=

0 10 20 30 400.9

0.1

1.1

0 10 20 30 400.9

0.1

1.1

0 10 20 30 400.9

0.1

1.1ConvergenceIteration 1

0 10 20 30 400.9

0.1


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 320.9

0.1


• Inference process suppresses many except few

min12�x−Dz�2

2 + λ�

i

|zi|

Sparse Modeling

• Problems

1. Inference takes long time

➡ Train a predictor function

2. Sparse coding is unstable

➡ Complex cell model

3. Patch based modeling produces redundant features

➡ Convolutional sparse modeling

Learning

Predictive Sparse Decomposition

For each sample from data, do:

1. Fix K and D, minimize to get optimal z

2. Using the optimal value of z update D and K

3. Scale elements of D to be unit norm

min1

2�x−Dz�22 + λ

�

i

|zi|+ α�z − Fe(x;K)�22z̃

Learned ISTAGregor’10

z̃ = g · tanh(kTx)z̃ = shλ

�kTx

�

z̃ = shλ

�kTx+ Sshλ(k

Tx)�


Encoder (k) Decoder (D)

•12x12 image patches

•256 dictionary elements


Encoder (k) Decoder (D)

•28x28 MNIST digit images

•200 dictionary elements

•Strokes for digit parts

Good Representation?

• Performance on MNIST using 28x28 filters

• Compare representations from different methods

• PSD : worse reconstruction than other models, but better recognition

• Ranzato’07, Kavukcuoglu’08

Recognition• Filterbank + Non-linearity + Pooling

• 64 filters

• Pinto’08

Non-linearity

max / av

PoolingRectification Local ContrastNormalization

Recognition - C101

• Optimal (Feature Sign, Lee’07) vs PSD features

• PSD features perform slightly better

• Naturally optimal point of sparsity

• After 64 features not much gain

• PSD features are order of magnitude faster

(In)Stability of Sparse Coding

0 100 200 300 400 500 600 700 800 900 10000.5

0.4

0.3

0.2

0.1

0

0.1

0.2

0.3

0.4

0.5

• 16x16 input patch

• 1024 dictionary elements (4x overcomplete)

• 3 pixel shifted

Learning Invariance

• Group sparsity : Idea proposed by Hyvarinen&Hoyer (2001) in the context of square ICA

• wj : Gaussian weighting window

• Learning algorithm is the same as PSD

• Feedforward regressor Fe(x;K), followed by pooling function produces invariant representations efficiently

• Ability to learn necessary transformations from data

min1

2�x−Dz�22 + λ

K�

i=1

��

j∈Pi

wjz2j + α�z − Fe(x;K)�22

Learning Invariance

• Sparsity across pools rather than units

• Drives basis functions in a pool to be similar

• Overlapping pools ensure smooth representation manifolds

• Pool size =1 ⇔ Regular PSD• Kavukcuoglu’09

P1

GaussianWindow

wj

OverlappingNeighborhoods

Pi P

Map of z

{

PK

(a) (b)

Topographic Maps

• Circular boundary conditions in both directions

• 6x6 pools with stride 2 in both dimensions

How invariant?

• Left: Normalized MSE between representations of original and transformed 16x16 patches

• Right: Same after 25˚ rotation

• IPSD is more invariant

0 4 8 12 160

0.5

1

1.5rotation 0 degrees

horizontal shift

Nor

mal

ized

MSE

0 4 8 12 160.2

0.4

0.6

0.8

1

1.2rotation 25 degrees

horizontal shift

Nor

mal

ized

MSE

SIFT non rot. inv.SIFTOur alg. non inv.Our alg. inv.

Good for Recognition?

Caltech 101 (Accuracy)Caltech 101 (Accuracy)

Linear IPSD(24x24) 50.9%

Linear SVM SIFT(not rot.inv.) (24x24) 51.2%SVM

SIFT(rot.inv.)(24x24) 45.2%

PMK IPSD(34x34) 59.6%

PMK SVM IPSD(56x56) 62.6%SVM

IPSD(120x120) 65.6%

MNIST (Error Rate)MNIST (Error Rate)Linear IPSD (5x5) 1.0%Linear SVM SIFT(not rot.inv.) (5x5) 1.5%

i=1

i=2

i=K

Multi-Stage Object Recognition • Each stage contains a filter-bank, non-linearity

and pooling

Filterbank Tanh Abs LCN Pooling

Conv Net Learned ✔ ✘ ✘ Average

HMAX Gabor ✘ ✘ ✘ Max

• Jarret’09

Multi-Stage Object Recognition

Filterbank - Fe(x;K) Non-linearities Pooling

• Building block of a multi-stage architecture

Filter BankNon-

LinearityPooling

Unsupervised Pre-Training

x z1

Filter BankNon-

LinearityPooling

Unsupervised Pre-Training

z2

Supervised Refinement


R U R+ U+ RR UU R+R+ U+U+

Pa 12.1 13.4 14.3 14.5 8.8 9.1 29.7 32.0

N-Pa 15.3 17.2 16.3 18.5 19.6 23.1 31.0 34.0

N-Pm 32.1 44.0 38.0 44.3 37.6 56.0 60.0 61.0

Rabs-Pa 31.7 44.3 47.0 50.0 33.7 46.7 59.5 60.5

Rabs-N-Pa 53.3 52.2 54.8 54.2 62.9 63.7 64.7 65.5

C-Rabs-N-Pa 53.3 57.1 54.8 57.6 62.9 65.3 64.7 66.3

U Unsupervised

R Random

+ Supervised Fine Tuning2 stage > 1 stage

Pa Unsupervised

Pm Random

N Supervised Fine Tuning

Rabs Absolute Value Rect

C Convolutional Unsup



Pa 12.1 13.4 14.3 14.5 8.8 9.1 29.7 32.0

N-Pa 15.3 17.2 16.3 18.5 19.6 23.1 31.0 34.0

N-Pm 32.1 44.0 38.0 44.3 37.6 56.0 60.0 61.0

Rabs-Pa 31.7 44.3 47.0 50.0 33.7 46.7 59.5 60.5

Rabs-N-Pa 53.3 52.2 54.8 54.2 62.9 63.7 64.7 65.5

C-Rabs-N-Pa 53.3 57.1 54.8 57.6 62.9 65.3 64.7 66.3

Abs > No AbsU Unsupervised

R Random

+ Supervised Fine Tuning

Pa Unsupervised

Pm Random






Pa 12.1 13.4 14.3 14.5 8.8 9.1 29.7 32.0

N-Pa 15.3 17.2 16.3 18.5 19.6 23.1 31.0 34.0

N-Pm 32.1 44.0 38.0 44.3 37.6 56.0 60.0 61.0

Rabs-Pa 31.7 44.3 47.0 50.0 33.7 46.7 59.5 60.5

Rabs-N-Pa 53.3 52.2 54.8 54.2 62.9 63.7 64.7 65.5

C-Rabs-N-Pa 53.3 57.1 54.8 57.6 62.9 65.3 64.7 66.3

LCN > No LCNU Unsupervised

R Random


Pa Unsupervised

Pm Random






Pa 12.1 13.4 14.3 14.5 8.8 9.1 29.7 32.0

N-Pa 15.3 17.2 16.3 18.5 19.6 23.1 31.0 34.0

N-Pm 32.1 44.0 38.0 44.3 37.6 56.0 60.0 61.0

Rabs-Pa 31.7 44.3 47.0 50.0 33.7 46.7 59.5 60.5

Rabs-N-Pa 53.3 52.2 54.8 54.2 62.9 63.7 64.7 65.5

C-Rabs-N-Pa 53.3 57.1 54.8 57.6 62.9 65.3 64.7 66.3

Even Random Works!!!

U Unsupervised

R Random


Pa Unsupervised

Pm Random




Optimal Stimuli

• Optimize input to maximize output of one unit after abs + LCN + average pooling

• Random feature extraction respond to oriented gratings too.

PSD Random

20 50 100 200 500 1000 2000 48604

6789

10

15

202530354050

number of training samples per class

erro

r rat

e

FCSG PA (R+ R+)

FCSG Rabs N PA (UU)FCSG Rabs N PA (R+ R+)

FCSG Rabs N PA (RR)FCSG Rabs N PA (U+ U+)

Random Filter Performance

NORB Dataset:

1. 96x96 grayscale images

2. 5 classes (human, car, truck, airplane, animal)

3. Almost 5000 training samples per class

Caltech 101

Redundancy in Feature Extraction

• Patch based learning has to model same structure at every location

• They produce highly redundant features

Filters Convolve Feature maps

Convolutional PSD

• Convolutional training yields a more diverse set of features

z ∈ RK×(w−s+1)×(h−s+1)x ∈ Rw×h D ∈ RK×s×s

Patch based Convolutional

1

2�x−

�

k

Dk ∗ zk�22 + λ|z|1 + α||z − Fe(x)||22

• Kavukcuoglu’10

0 20 40 60 80100

101

102

103

104

deg

# of

cro

ss c

orr <

deg

Patch Based TrainingConvolutional Training

Convolutional PSD• Measuring the redundancy in the dictionary

• Cumulative histogram of angle between ALL PAIRS of dictionary elements

acos(max(abs(Di ∗DTj )))

Convolutional PSD

z ∈ RK×(w−s+1)×(h−s+1)x ∈ Rw×h D ∈ RK×s×s

1

2�x−

�

k

Dk ∗ zk�22 + λ|z|1 + α||z − Fe(x)||22

=

• Convolutional sparse coding model large images rather than small image patches

• Each iteration reduces redundancy in the feature representation

Convolutional PSDInput (x) Dictionary (D)

Code (z) at Iteration 1


Reconstruction


Code (z) at Iteration 2


Reconstruction


Code (z) at Convergence


Reconstruction

Convolutional PSD - Better Encoders• To be able to predict convolutional sparse representations, simple

encoders are very inadequate

• A better encoder should use shrinkage operator with a learned suppression matrix to be able to approximate sparse codes (Gregor’10)

• Encoder Training

• 2nd order information is important for fast convergence

• Smooth shrinkage is important for conserving derivatives

1

βlog(exp(β × b) + exp(β × s)− 1)− b

z̃ = shλ

�kTx

�z̃ = shλ

�kTx+ Sshλ(k

Tx)�

Convolutional Training

• Inference and Training

• Order of magnitude more costly

• Efficient inference algorithms are crucial (ISTA, FISTA, CD)

• 64 filters = 64 times overcomplete representation

• Proper handling of border effects is important

• Test time is the same as patch based model

Convolutional PSD

• Recognition Performance on C101

• Low level convolutional feature learning improves

Patch Based Convolutional

1 Stage

Unsup 52.2% 57.1%1 Stage Unsup+ 54.2% 57.6%

2 Stage

Unsup 63.7% 65.5%2 Stage Unsup+ 65.3% 66.3%

Pedestrian Detection On INRIA

10 2 10 1 100 101

0.05

0.1

0.2

0.3

0.40.50.60.70.80.9

1

false positives per image

miss

rate

Shapelet orig (90.5%)PoseInvSvm (68.6%)VJ OpenCv (53.0%)PoseInv (51.4%)Shapelet (50.4%)VJ (47.5%)FtrMine (34.0%)Pls (23.4%)HOG (23.1%)HikSvm (21.9%)LatSvm V1 (17.5%)MultiFtr (15.6%)R+R+ (14.8%)U+U+ (11.5%)MultiFtr+CSS (10.9%)LatSvm V2 (9.3%)FPDW (9.3%)ChnFtrs (8.7%)

10 2 10 1 100 101

0.05

0.1

0.2

0.3

0.40.50.60.70.80.9

1

false positives per image

miss

rate

Shapelet orig (90.5%)PoseInvSvm (68.6%)VJ OpenCv (53.0%)PoseInv (51.4%)Shapelet (50.4%)VJ (47.5%)FtrMine (34.0%)Pls (23.4%)HOG (23.1%)HikSvm (21.9%)LatSvm V1 (17.5%)MultiFtr (15.6%)R+R+ (14.8%)U+U+ (11.5%)MultiFtr+CSS (10.9%)LatSvm V2 (9.3%)FPDW (9.3%)ChnFtrs (8.7%)

14.8%

11.5%

• Purely supervised training: 14.8% miss rate

• Unsupervised pre-training with Conv PSD + supervised refinement : 11.5%

• Close to state of the art and improving quickly...

Questions?

Learning Feature Hierarchies for Object Recognitionkoray/files/defense-presentation.pdfCSG ï P A...

Documents

Transcript of Learning Feature Hierarchies for Object Recognitionkoray/files/defense-presentation.pdfCSG ï P A...