07 cv mil_modeling_complex_densities

Post on 04-Jul-2015

229 views 1 download

Transcript of 07 cv mil_modeling_complex_densities

Computer vision: models, learning and inference

Chapter 7

Modeling complex densities

Please send errata to s.prince@cs.ucl.ac.uk

Structure

22Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

• Densities for classification

• Models with hidden variables

– Mixture of Gaussians

– t-distributions

– Factor analysis

• EM algorithm in detail

• Applications

Models for machine vision

3Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Face Detection

4Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Type 3: Pr(x|w) - Generative

How to model Pr(x|w)?

– Choose an appropriate form for Pr(x)

– Make parameters a function of w

– Function takes parameters q that define its shape

Learning algorithm: learn parameters q from training data x,w

Inference algorithm: Define prior Pr(w) and then compute Pr(w|x) using Bayes’ rule

5Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Classification Model

Or writing in terms of class conditional density functions

Parameters m0, S0 learnt just from data S0 where w=0

Similarly, parameters m1, S1 learnt just from data S1 where w=16Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

7Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Inference algorithm: Define prior Pr(w) and then compute Pr(w|x) using Bayes’ rule

Experiment

1000 non-faces1000 faces

60x60x3 Images =10800 x1 vectors

Equal priors Pr(y=1)=Pr(y=0) = 0.5

75% performance on test set. Not very good!

8Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Results (diagonal covariance)

9Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

The plan...

10Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Structure

1111Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

• Densities for classification

• Models with hidden variables

– Mixture of Gaussians

– t-distributions

– Factor analysis

• EM algorithm in detail

• Applications

Hidden (or latent) Variables

Key idea: represent density Pr(x) as marginalization of joint density with another variable h that we do not see

12Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Will also depend on some parameters:

Hidden (or latent) Variables

13Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Expectation Maximization

An algorithm specialized to fitting pdfs which are the marginalization of a joint distribution

Defines a lower bound on log likelihood and increases bound iteratively

14Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Lower bound

Lower bound is a function of parameters q and a set of probability distributions qi(hi)

Expectation Maximization (EM) algorithm alternates E-steps and M-Steps

E-Step – Maximize bound w.r.t. distributions q(hi)M-Step – Maximize bound w.r.t. parameters q

15Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Computer vision: models, learning and inference. ©2011 Simon J.D. Prince 16

Lower bound

16

E-Step & M-Step

E-Step – Maximize bound w.r.t. distributions qi(hi)

M-Step – Maximize bound w.r.t. parameters q

17Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

18

E-Step & M-Step

18Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Structure

1919Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

• Densities for classification

• Models with hidden variables

– Mixture of Gaussians

– t-distributions

– Factor analysis

• EM algorithm in detail

• Applications

Mixture of Gaussians (MoG)

20Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

ML Learning

Q. Can we learn this model with maximum likelihood?

A. Yes, but using brute force approach is tricky• If you take derivative and set to zero, can’t solve

-- the log of the sum causes problems• Have to enforce constraints on parameters

• covariances must be positive definite• weights must sum to one

21Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

MoG as a marginalization

Define a variable and then write

Then we can recover the density by marginalizing Pr(x,h)

22Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

MoG as a marginalization

23Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

MoG as a marginalization

Define a variable and then write

Note :

• This gives us a method to generate data from MoG• First sample Pr(h), then sample Pr(x|h)

• The hidden variable h has a clear interpretation –it tells you which Gaussian created data point x

24Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

GOAL: to learn parameters from training data

Expectation Maximization for MoG

E-Step – Maximize bound w.r.t. distributions q(hi)

M-Step – Maximize bound w.r.t. parameters q

25Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

E-Step

We’ll call this the responsibility of the kth Gaussian for the ith data point

Repeat this procedure for every datapoint!

26Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

M-Step

Take derivative, equate to zero and solve (Lagrange multipliers for l)

27Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

M-Step

Update means , covariances and weights according to responsibilities of datapoints

28Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Iterate until no further improvement

E-Step M-Step

29Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

30Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Different flavours...

Diagonalcovariance

Samecovariance

Full covariance

31Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Local Minima

L = 98.76 L = 96.97 L = 94.35

Start from three random positions

32Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Means of face/non-face model

Classification 84% (9% improvement!)

33Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

The plan...

34Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Structure

3535Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

• Densities for classification

• Models with hidden variables

– Mixture of Gaussians

– t-distributions

– Factor analysis

• EM algorithm in detail

• Applications

Student t-distributions

36Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Student t-distributions motivation

Normal distribution Normal distributionw/ one extra datapoint!

t-distribution

Normal distribution Normal distributionw/ one extra datapoint!

t-distribution

The normal distribution is not very robust – a single outlier can completely throw it off because the tails fall off so fast...

37Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Student t-distributions

Univariate student t-distribution

Multivariate student t-distribution

38Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

t-distribution as a marginalization

Define hidden variable h

Can be expressed as a marginalization

39Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Gamma distribution

40Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

t-distribution as a marginalization

Define hidden variable h

Things to note:

• Again this provides a method to sample from the t-distribution • Variable h has a clear interpretation:

• Each datum drawn from a Gaussian, mean m• Covariance depends inversely on h

• Can think of this as an infinite mixture (sum becomes integral) of Gaussians w/ same mean, but different variances

41Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

t-distribution

42Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

GOAL: to learn parameters from training data

EM for t-distributions

E-Step – Maximize bound w.r.t. distributions q(hi)

M-Step – Maximize bound w.r.t. parameters q

43Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

E-Step

Extract expectations

44Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

M-Step

Where...

45Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Updates

No closed form solution for n. Must optimize bound – since it is only one number we can optimize by just evaluating with a fine grid of values and picking the best

46Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

EM algorithm for t-distributions

47Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

The plan...

48Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Structure

4949Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

• Densities for classification

• Models with hidden variables

– Mixture of Gaussians

– t-distributions

– Factor analysis

• EM algorithm in detail

• Applications

Factor Analysis

Compromise between – Full covariance matrix (desirable but many parameters)

– Diagonal covariance matrix (no modelling of covariance)

Models full covariance in a subspace F

Mops everything else up with diagonal component S

50Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Data Density

Consider modelling distribution of 100x100x3 RGB Images

Gaussian w/ spherical covariance: 1 covariance matrix parameters

Gaussian w/ diagonal covariance: Dx covariance matrix parameters

Full Gaussian: ~Dx2 covariance matrix parameters

PPCA: ~Dx Dh covariance matrix parameters

Factor Analysis: ~Dx (Dh+1) covariance matrix parameters

(full covariance in subspace + diagonal)51Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Subspaces

52Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Factor Analysis

Compromise between – Full covariance matrix (desirable but many parameters)

– Diagonal covariance matrix (no modelling of covariance)

Models full covariance in a subspace F

Mops everything else up with diagonal component S

53Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Factor Analysis as a Marginalization

Then it can be shown (not obvious) that

Let’s define

54Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

55Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Sampling from factor analyzer

Factor analysis vs. MoG

56Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

E-Step

• Compute posterior over hidden variables

57Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

• Compute expectations needed for M-Step

E-Step

58Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

M-Step

• Optimize parameters w.r.t. EM bound

59Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

where....

M-Step

• Optimize parameters w.r.t. EM bound

60Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

where...

giving expressions:

Face model

61Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Sampling from 10 parameter model

To generate:

• Choose factor loadings, hi from standard normal distribution• Multiply by factors, F• Add mean, m• (should add random noise component ei w/ diagonal cov S)

62Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Combining Models

Can combine three models in various combinations

63Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

• Mixture of t-distributions = mixture of Gaussians + t-distributions• Robust subspace models = t-distributions + factor analysis• Mixture of factor analyizers = mixture of Gaussians + factor analysis

Or combine all models to create mixture of robust subspace models

Structure

6464Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

• Densities for classification

• Models with hidden variables

– Mixture of Gaussians

– t-distributions

– Factor analysis

• EM algorithm in detail

• Applications

Expectation Maximization

Problem: Optimize cost functions of the form

Solution: Expectation Maximization (EM) algorithm (Dempster, Laird and Rubin 1977)

Key idea: Define lower bound on log-likelihood and increase at each iteration

Continuous case

Discrete case

65Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Expectation Maximization

Defines a lower bound on log likelihood and increases bound iteratively

66Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Lower bound

Lower bound is a function of parameters q and a set of probability distributions {qi(hi)}

67Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

E-Step & M-Step

E-Step – Maximize bound w.r.t. distributions {qi(hi)}

M-Step – Maximize bound w.r.t. parameters q

68Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

E-Step: Update {qi[hi]} so that bound equals log likelihood for this q

69Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

M-Step: Update q to maximum70Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

71Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Missing Parts of Argument

Missing Parts of Argument

1. Show that this is a lower bound

2. Show that the E-Step update is correct

3. Show that the M-Step update is correct

72Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Jensen’s Inequality

or…

similarly

73Computer vision: models, learning and inference. ©2011 Simon

J.D. Prince

Lower Bound for EM Algorithm

Where we have used Jensen’s inequality

74Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Constant w.r.t. q(h) Only this term matters

E-Step – Optimize bound w.r.t {qi(hi)}

75Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

E-Step

Kullback Leibler divergence – distance between probability distributions . We are maximizing the negative distance (i.e. Minimizing distance)

76Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Use this relation

log[y]<y-1

77Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Kullback Leibler Divergence

So the cost function must be positive

In other words, the best we can do is choose qi(hi) so that this is zero

78Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

E-StepSo the cost function must be positive

The best we can do is choose qi(hi) so that this is zero.

How can we do this? Easy – choose posterior Pr(h|x)

79Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

M-Step – Optimize bound w.r.t. q

In the M-Step we optimize expected joint log likelihood with respect to parameters q (Expectation w.r.t distribution from E-Step)

80Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

E-Step & M-Step

E-Step – Maximize bound w.r.t. distributions qi(hi)

M-Step – Maximize bound w.r.t. parameters q

81Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Structure

8282Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

• Densities for classification

• Models with hidden variables

– Mixture of Gaussians

– t-distributions

– Factor analysis

• EM algorithm in detail

• Applications

Face Detection

83Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

(not a very realistic model – face detection really done using discriminative methods)

Object recognition

84Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

t-distributionsnormal

distributions

Model as J independent patches

Segmentation

85Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Face Recognition

86Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Pose Regression

Training data

Frontal faces xNon-frontal faces w

Learning

Fit factor analysis model to joint distribution of x,w

Inference

Given new x, infer w(cond. dist of normal)

87Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Conditional Distributions

If

then

88Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

Modeling transformations

89Computer vision: models, learning and inference. ©2011 Simon J.D. Prince