A Bayesian Approach to Recognition

A Bayesian Approach to Recognition Moshe Blank

Ita LifshitzReverend Thomas Bayes

1702-1761

Agenda

Bayesian decision theory Maximum Likelihood Bayesian Estimation

Recognition Simple probabilistic model Mixture model More advanced probabilistic model “One-Shot” Learning

Bayesian Decision Theory

We are given a training set T of samples of class c.

Given a query image x, want to know the probability it belongs to the class, p(x)

We know that the class has some fixed distribution, with unknown parameters θ, that is p(x|θ) is known

Bayes rule tells us:

p(x|T) = ∫p(x,θ|T)dθ = ∫p(x|θ)p(θ|T)dθ What can we do about p(θ|T)?

Maximum Likelihood Estimation

What can we do about p(θ|T)?

Choose parameter value θML, that make the training data most probable:

θML = arg max P(T|θ)p(θ|T) = δ(θ – θML)

∫p(x|θ)p(θ|T)dθ = p(x| θML)

ML Illustration

Assume that the points of T are drawn from some normal distribution with known variance and unknown mean

Bayesian Estimation

The Bayesian Estimation approach considers θ as a random variable.

Before we observe the training data, the parameters are described by a prior p(θ) which is typically very broad.

Once the data is observed, we can make use of Bayes’ formula to find posterior p(θ|T). Since some values of the parameters are more consistent with the data than others, the posterior is narrower than prior.

Bayesian Estimation

Unlike ML, Bayesian estimation does not choose a specific value for θ, but instead performs a weighted average over all possible values of θ.

Why is it more accurate then ML?

Maximal Likelihood vs Bayesian

ML and Bayesian estimations are asymptotically equivalent and “consistent”.

ML is typically computationally easier. ML is often easier to interpret: it returns the single best

model (parameter) whereas Bayesian gives a weighted average of models.

But for a finite training data (and given a reliable prior) Bayesian is more accurate (uses more of the information).

Bayesian with “flat” prior is essentially ML; with asymmetric and broad priors the methods lead to different solutions.

Agenda

Bayesian decision theory Recognition Simple probabilistic model Mixture model More advanced probabilistic model “One-Shot” Learning

Objective

Given an image, decide whether or not it contains an object of a specific class.

Main Issues

Representation Learning Recognition

Approaches to Recognition

Photometric properties – filter subspaces, neural networks, principal analysis…

Geometric constraints between low level object features – alignment, geometric invariance, geometric hashing…

Object Model

Gadi Lifshitz

more comments

Fischler & Elschlager, 1973

Yuille, ‘91 Brunelli & Poggio, ‘93 Lades, v.d. Malsburg et al. ‘93 Cootes, Lanitis, Taylor et al. ‘95 Amit & Geman, ‘95, ‘99 Perona et al. ‘95, ‘96, ‘98, ‘00, ‘02

Model: constellation of Parts

Gadi Lifshitz

Read the abstracts of ...

Perona’s Approach

Objects are represented as a probabilistic constellation of rigid parts (features).

The variability within a class is represented by a joint probability density function on the shape of the constellation and the appearance of the parts.

Agenda

Bayesian decision theory Recognition Simple probabilistic model

Model parameterization Feature Selection Learning

Mixture model More advanced probabilistic model “One-Shot” Learning

Weber, Weilling, Perona - 2000

Unsupervised Learning of Models for Recognition

Towards Automatic Discovery of Object Categories

Unsupervised Learning

Learn to recognize object class given a set of class and background pictures, without preprocessing – labeling, segmentation, alignment.

Model Description

Each object is constructed of F parts, each of a certain type.

Relations between the part locations define the shape of the object.

Image Model

Image is transformed into a collection of parts

Objects are modeled as sub collections

Model Parameterization

Given an image we detect potential object parts, to obtain the following observable:

Hypothesis

When presented with an un-segmented and unlabeled image, we do not know which parts correspond to the foreground.

Assuming the image contains the object, use vector of indices h to indicate which of the observables correspond to a foreground point (i.e. real part of the object).

We call h hypothesis since it is a guess on the structure of the object. h = (h1, …, hT) is not observable.

Additional Hidden Variables

We denote by the locations of the unobserved object parts.

b = sign(h) – binary vector indicates which parts were detected

n = number of background parts detected of each type

mx

Probabilistic Model

We can now define a generative probabilistic model for the object class using the probability density function:

Model Details

Since n, b are determined by Xo, h, we have:

By Bayesian rule:

Model Details

Full table of joint probabilities (for small F) or F independent detection rate probabilities for large F

Model Details

Poisson probability density function with parameter Mt for detection of feature of type t

Model Details

Uniform probability over all hypotheses consistent with n and b

Model Details

Where - coordinates of all foreground detections, and - coordinates of all background detections

Sample object classes

Invariance to Translation Rotation and Scale There is no use in modeling the shape of the object in terms of

absolute pixel positions of the features. We apply a transformation on features’ coordinates to make the

shape invariant to translation, rotation and scale.

But the feature detector must be invariant to the transformations as well!

Automatic Part Selection

Find points of interest in all training images

Apply Vector Quantization and clustering to get 100 total candidate patterns.


Points of interest patterns

Method Scheme

Part Selection

Model

Learning Test


Find subset of candidate parts of (small) size F to be used in the model that gives the best performance in the learning phase.

57%

87%

51%

Learning

Goal: Find θ = {μ, Σ, p(b), M} which best explains the observed (training) data

μ, Σ – expectation and covariance parameters of the joint Gaussian modeling the shape of the foreground

b – random variable denoting whether each of the parts of the model is detected or not

M – average number of background detections for each of the parts

Learning

Goal: Find θ = {μ, Σ, p(b), M} which best explains the observed (training) data,

i.e. maximize the likelihood

arg max p( Xo | θ )

θ

Done using the EM method

Expectation Maximization (EM)

EM is an iterative optimization method to estimate some unknown parameters θ, given measurement data, but not given some “hidden” variables J.

We want to maximize the posterior probability of the parameters θ given the data U, marginalizing over J:


Choose an initial parameter θ0

Guess of unknown hidden data

E-Step:

Estimate unobserved data using θk

M-Step:

Compute Maximum Likelihood

Estimate parameter θk+1 using estimated data

Observed Data

Guess of parameters θk


alternate between estimating the unknowns θ and the hidden variables J.

EM algorithm converges to a local maximum

Method Scheme

Part Selection

Model

Learning Test

Recognition

Using the maximum a posteriori approach we consider the ratio

R =

where h0 is the null hypothesis – which explains all parts as background noise.

We accept the image as belonging to the class if R is above a certain threshold.

Database

Two classes – faces and cars 100 training images for each class 100 test images for each class Images vary in scale, location of the

object, lighting conditions Images have cluttered background No manual preprocessing

Learning Results

Model Performance

Average training and testing errors measured as 1-Area(ROC)

Suggests 4 parts model for faces and 5 parts model for cars as optimal.

Multiple use of parts

Part ‘C’ has high variance along the vertical direction – can be detected in several locations – bumper, license plate or roof.

Part Labels:

Recognition Results

Average success rate (at even False Positive and False Negative ratios):• Faces: 93.5%• Cars: 86.5%

Agenda


Mixture Model

Gaussian model works good for homogenous classes, but real life objects can be far from homogenous.

Can we extend our approach to multi-model classes?

Mixture Model

An object is modeled using Ω different components, each is a probabilistic model:

Each component “sees the whole picture”. Components are trained together.

Database

Faces with different viewing angles – 0°, 15°, …, 90°

Cars – rear view and side view

Tree leaves – of several types

Tuning of the mixture components

Each training image was assigned to the component which responds to it the most, i.e. one that maximizes .

Results

Misclassification error at even false positive and false negative rate for training and test sets

Zero false alarm detection rate (ZFA-DR).

Separately trained components

Two components trained independently on two subclasses of the cars class.

When merged into a mixture model with p(w) = 0.5, gave worse results than two-components model trained on both subclasses simultaniously.

Agenda

Bayesian decision theory Recognition Simple probabilistic model Mixture model More advanced probabilistic model

Feature Selection Model parameterization Results

“One-Shot” Learning

Fergus, Perona, Zisserman

Object Class Recognition By Scale Invariant Learning - Proc. of the IEEE Conf on Computer Vision and Pattern Recognition - 2003

Object Class Recognition By Scale Invariant Learning

Extended version of previous model (by weber et al.)

New feature detector Probabilistic model for appearance instead

of feature types

Feature Detection

Kadir-Brady feature detector

Detects salient regions over different scales and locations

Choose N most salient regions

Each feature contains scale and location information

Notation

X – Shape : Locations of the features A – Appearance : Representations of the

features S – Scale : Vector of feature scales h – Hypothesis : Which part is represented

by which observed feature.

Feature Appearance

Feature contents is rescaled to a 11x11 pixel patch

Normalization Reduce data dimension

from 121 to 15 dimensions using PCA method

Result is the appearance vector for the part

11x11 patch

c1c2

Normalize

Projection ontoPCA basis

c15

Recognition

Assuming we learned the model parameters θ. Given an image we extract X, S, A and can make a Bayesian decision:

We apply threshold to the likelihood ratio R to decide whether the input image belongs to the class.

Recognition

The term p(X, S, A | θ) can be factored into:

Each of the terms has a closed (computable) form given the model parameters θ

Part appearance pdf

Foreground model Clutter modelGaussian Gaussian

Shape pdf

Foreground model Clutter model

Gaussian Uniform

Relative Scale pdf

Gaussian

Log(scale)

Uniform

Log(scale)


Detection Probability pdf


Probability of detection

0.8 0.75 0.9

Poisson probability density function on

the number of detections

Learning

Want to estimate model parameters:

Using EM method find that will best explain the training set images, i.e. maximize the likelihood:

Sample Model

Confusion Table

How good is a model for object class A is for distinguishing images of class B from background images?

Comparison of Results

Average performance of the models at ROC equal error rates:

Scale invariant learning:

Agenda


Fei-Fei, Fergus, Perona

A bayesian Approach to Unsupervised One-Shot Learning of Object Categories - Proc. ICCV. 2003

Small Training Set

Humans can learn a new category using very few training examples.

Rule-of-thumb in computer learning tells us that number of training examples should be 5-10 times the number of model parameters.

Can computers do better?

Prior knowledge about objects

Incorporating prior knowledge

Bayesian methods allow us to use a “prior” information p(θ) about the nature of objects. Given the new observations we can update our knowledge into a “posterior” p(θ|x)

Bayesian Decision

∫P(test | θ, object) p(θ | object, train) dθ

Until now we used the ML approach – approximating p(θ) by a delta function centered at the θML = arg max p(θ).

This will not work for small training set.

Maximum Likelihood vs. Bayesian Learning

Maximum Likelihood

Bayesian Learning

Experimental setup

Learn three object categories using ML approach

Estimate the prior hyper-parameters

Use VBEM to learn new object category from few images

Prior Hyper-Parameters

Performance Results – Motorbikes

1 training image 5 training images

Performance Results – Motorbikes

Performance Results – Face Model

1 training image 5 training images

Performance Results – Face Model

Results Comparison

Algorithm # training images

Learning speed Error rate

Burl, et al.

Weber, et al.

Fergus, et al.

200~400 Hours 5.6 -10 %

Bayesian

One-Shot 1 ~ 5 < 1 min 8 –15 %

References

Object Class Recognition By Scale Invariant Learning – Fergus, Perona, Zisserman - 2003

A Bayesian Approach to Unsupervised One-Shot Learning of Object Categories - Fei-Fei, Fergus, Perona - 2003

Towards Automatic Discovery of Object Categories – Weber, Welling, Perona – 2000

Unsupervised Learning of Models for Recognition – Weber, Welling, Perona – 2000

Recognition of Planar Object Classes – Burl, Perona – 1996 Pattern Classification and Scene Analysis – Duda, Hart –

1973

A Bayesian Approach to Recognition

Documents

Transcript of A Bayesian Approach to Recognition