Computer Vision: Models, Learning and Inference Markov ...cv192/wiki.files/CV192_lec_MRF3.pdf ·...

$: Computer Vision: Models, Learning and Inference Markov ...cv192/wiki.files/CV192_lec_MRF3.pdf · Computer Vision: Models, Learning and Inference {Markov Random Fields, Part 3 Oren$
Computer Vision: Models, Learning and Inference–

Markov Random Fields, Part 3

Oren Freifeld and Ron Shapira-Weber

Computer Science, Ben-Gurion University

Mar 11, 2019

www.cs.bgu.ac.il/~cv192/ MRFs, Part 3 (ver. 1.02) Mar 11, 2019 1 / 41

www.cs.bgu.ac.il/~cv192/

Image Restoration as a Generic Example

Image Restoration

Consider an n× n image

Pixels define a regular 2D lattice

x = (xs)s∈S , a latent (“clean”/“uncorrupted”) image

y = (ys)s∈S , an observed (“corrupted-by-noise”) image

Problem: Recover x from y.




Consider Binary Latent Images with an MRF prior

xs ∈ {−1, 1} (binary, with -1 instead of 0); i.e., x ∈ {−1, 1}|S|

Prior model:

p(x) ∝∏c∈C

Hc(xc) (1)

Observation/likelihood model 1 (R-valued, additive):

R 3 ys = xs + ns s ∈ S R 3 nsiid∼ N (0, σ2)

Observation/likelihood model 2 (binary, multiplicative):

{−1, 1} 3 ys = xsns s ∈ S ns ∈ {−1, 1} (ns + 1)/2iid∼ Bernoulli(θ)

In both cases, the overall likelihood is:p(y|x) =

∏s∈S p(ys|x) =

∏s∈S p(ys|xs)

The first “=”: the ys’s are conditionally independent given x.The second “=” is stronger: ys ⊥⊥ (sx, sy)|xs ∀s ∈ S.iid = independent and identically distributed




Posterior

Case 1:

p(x|y) =

likelihood︷︸︸︷[∏s∈S

1√2πσ2

exp

(−(ys − xs)2

2σ2

)]∏c∈C Hc(xc)

∑(x′s)s∈S∈{−1,1}

|S|

[∏s∈S

1√2πσ2

exp(− (ys−x′s)2

2σ2

)]∏c∈C Hc(x′c)

∝

[∏s∈S

1√2πσ2

exp

(−(ys − xs)2

2σ2

)]∏c∈C

Hc(xc)

∝∏c∈C

Fc(xc)

(singletons – w.r.t. GA – absorbed into the Hc’s; renamed the resultingclique functions Fc’s where Fc(xc) = Fc(xc, y))




Posterior

Case 2:

p(x|y) ∝

likelihood︷︸︸︷(1− θ)|{s∈S:ys=−xs}|θ|{s∈S:ys=xs}|

∏c∈C

Hc(xc)

=∏c∈C

Fc(xc)

(singletons – w.r.t. GA – absorbed into the Hc’s; renamed the resultingclique functions Fc’s where Fc(xc) = Fc(xc, y))




Digression: What if ys Depends on more than xs?

E.g., y obtained from x by convolving it with a 3× 3 filter. We getproducts of p(ys|3× 3 n’hood of xs)⇒ In p(x|y), each 3× 3 block is a clique, with (partial) overlapsbetween nearby cliques.




What We Want

Sample from p(x|y), compute posterior expectations (e.g., E(x|y) orE(g(x)|y) for some function g), and posterior argmaxx p(x|y).In fact, if we can sample somehow, we can use this for the other tasksas well; for example (and regardless of MRF’s):

(xis)s∈S︸︷︷︸xi

iid∼ p((xs)s∈S |y)︸︷︷︸p(x|y)

i = 1, 2, . . . , N

⇒ 1

N

N∑i=1

g((xis)s∈S)︸︷︷︸1N

∑Ni=1 g(x

i)

N→∞−−−−→by LLN

E(g((Xs)s∈S)|Y = y)︸︷︷︸E(g(X|Y=y))

(xi is a sample of an entire image)

Remark: usually E(X|Y = y) is in [−1, 1]|S| (as opposed to {−1, 1}|S|)LLN: Law of Large Numbers



Ising Model

The Ising Model Prior

p(x) =1

Zexp

(β∑s∼t

xsxt

)(s ∼ t means s is a neighbor of t)

Samples from p(x) with three different betas where, from left to right,β1 < β2 < β3

Figure from Winkler’s monograph, “Image Analysis: Random Fields and MarkovChain Monte Carlo Methods”



Ising Model

Example

If β = 11.8 and ys = xs + ns where ns

iid∼ N (0, 4), s ∈ S then

p(x|y) ∝ exp

(1

1.8

(∑s∼t

xsxt

)− 1

8

∑s

(ys − xs)2)



Ising Model

Left: the true x where x ∼ p(x) (using the Ising Model)Right: y – the noisy image.



Ising Model

Left: x ∼ p(x) (using the Ising Model)Right: a posterior sample, x ∼ p(x|y)



Ising Model

Left: x ∼ p(x) (using the Ising Model)Right: the maximum-a-posteriori (MAP) solution, argmaxx p(x|y)



Ising Model

Interpretations of the Ising Model

WLOG, let β = 1.

exp (∑

s∼t xsxt)∑x′ exp (

∑s∼t x

′sx′t)

=exp (

∑s∼t xsxt + c)∑

x′ exp (∑

s∼t x′sx′t + c)

∀c ∈ R (2)

⇒ p(x) ∝exp (

∑s∼t xsxt + c)∑

x′ exp (∑

s∼t x′sx′t + c)

(3)

Take c = |S| (# pixels) ⇒∑s∼t xsxt + c=

# like n’brs pairs - # unlike n’brs pairs + |S| = 2 × # like n’brs pairs⇒ p(x) ∝ exp(2×# like n’brs pairs)

Take c = −|S| ⇒# like n’brs pairs - # unlike n’brs pairs - |S| = - 2 × # unlike n’brs pairs= - 2 times of boundary length⇒ p(x) ∝ exp( -2 times of boundary length)



Introduction to MCMC and Gibbs Sampling Gibbs Sampling

When Sampling from the Conditionals is EasyRegardless of MRFs

Assume the following setting:

X, Y and Z: 3 random vectors (of possibly-different dimensions) with ajoint pdf/pmf p(x,y, z)

Want to sample x,y, z ∼ p(x,y, z) – but it’s hard or don’t know how.

Can relatively easily sample from the conditionals:

p(x|y, z)p(y|x, z)p(z|x,y)




Gibbs Sampling [Geman and Geman, 1984]

The algorithm:1 x[0],y[0]z[0] ∼ g(x,y, z) using some easy-to-sample-from g(x,y, z) of the

same support as p(x,y, z) (e.g., take g(x,y, z) = g(x)g(y)g(z)).2 Iterate:

x[i] ∼ p(x|y[i−1],z[i−1])y[i] ∼ p(y|x[i],z[i−1])z[i] ∼ p(z|x[i],y[i])

p(x[n],y[n], z[n])n→∞−−−→ p(x,y, z) (in some sense)

The resulting sequence is clearly a Markov Chain:

p(x[i],y[i], z[i]|x[0:(i−1)],y[0:(i−1)], z[0:(i−1)])

= p(x[i],y[i], z[i]|x[i−1],y[i−1], z[i−1])

Gibbs sampling is a particular case of MCMC





It is OK even if we know p only up to a multiplicative constant:

Know g(x,y, z), a non-normalized nonnegative function, where

p(x,y, z) =1

Zg(x,y, z)

Z =

∫g(x,y, z) dxdydz

but can’t (or don’t want to) compute the integral.

p(x|y, z) = p(x,y, z)

p(y, z)=

p(x,y, z)∫p(x′,y, z) dx′

=1Z g(x,y, z)∫

1Z g(x

′,y, z) dx′=

g(x,y, z)∫g(x′,y, z) dx′

or, in the discrete case:

p(x|y, z) = g(x,y, z)∑x′ g(x

′,y, z)





More generally: want (x1, . . . ,xn) ∼ p(x1, . . . ,xn).Gibbs sampling n RV’s:

1 x[0]1 ,x

[0]2 , . . . ,x

[0]n ∼ g(x1, . . . ,xn) using some easy-to-sample-from

g(x1, . . . ,xn) of the same support as p(x1, . . . ,xn)

2 Iterate:

Update x1 by sampling x1 given the restUpdate x2 by sampling x2 given the rest...Update xn by sampling xn given the rest

One such iteration over all n variables is called a sweep.




Motivation

Assume

p is “local” (MRF with modest n’hood size)

p(y|x) is “local”

⇒ p(x|y) is MRF with “local” n’hood structure.But: in an n× n lattice structure (e.g., images), max boundary is of ordern – can’t use Dynamic Programming. So try MCMC.

Gibbs sampling [Geman and Geman, 1984] is a particular case of MCMCmethods.

In principle, Gibbs sampling is applicable in general distributions, notjust in Gibbs distributions (i.e., MRFs), but it is particularly easy to doGibbs sampling in Gibbs distributions.




Gibbs Sampling in MRFs [Geman and Geman, 1984]

Pick a (possibly-random) site-visitation scheme, and then, when visitings, sample

xs ∼ p(xs|sx)MRF= p(xs|xηs)

p(xs|xηs), which involves only the local neighborhood, is usually easy tosample from.

It is ok if know p only up to a multiplicative constant (which is often thecase with MRFs):

p(x) ∝∏c∈C

Fc(xc)

Asymptotically great, but takes a long time if the graph is toolarge/complicated

Often, particularly for images, it can be massively parallelized (but eventhen this can take a long time)




Gibbs Sampling in the Ising Model: Sampling from p(x)

p(x) ∝ exp

(β∑s∼t

xsxt

)

p(xs|sx) =p(xs, sx)

p(sx)=

p(x)

p(sx)=

1Z exp (β

∑s∼t xsxt)∑

x′s1Z exp (β

∑s∼t x

′sxt)

=exp

(β∑

t:t∈ηs xsxt

)∑

x′sexp

(β∑

t:t∈ηs x′sxt

) =exp

(β∑

t:t∈ηs xsxt

)exp

(−β∑

t:t∈ηs xt

)+ exp

(β∑

t:t∈ηs xt

)⇒

p(xs = 1|sx) =exp(β

∑t:t∈ηs xt)

exp(−β∑t:t∈ηs xt)+exp(β

∑t:t∈ηs xt)

∝ exp(β∑

t:t∈ηs xt

)p(xs = −1|sx) =

exp(−β∑t:t∈ηs xt)

exp(−β∑t:t∈ηs xt)+exp(β

∑t:t∈ηs xt)

∝ exp(−β∑

t:t∈ηs xt

)www.cs.bgu.ac.il/~cv192/ MRFs, Part 3 (ver. 1.02) Mar 11, 2019 21 / 41



Gibbs Sampling in the Ising Model: Sampling from p(x|y)iid Additive Gaussian Noise

p(x|y) ∝ exp

(β(∑

s∼txsxt

)− 1

2σ2

∑s(ys − xs)2

)p(xs|sx, y) =

p(xs, sx|y)p(sx|y)

=p(x|y)p(sx|y)

=exp

(β (∑

s∼t xsxt)−1

2σ2

∑s(ys − xs)2

)∑x′s

exp(β (∑

s∼t x′sxt)− 1

2σ2

∑s(ys − x′s)2

)∝ exp

(β(∑

t:t∈ηsxsxt

)− 1

2σ2(ys − xs)2

)p(xs = 1|sx, y) ∝ exp

(β(∑

t:t∈ηsxt

)− 1

2σ2(ys − 1)2

)p(xs = −1|sx, y) ∝ exp

(β(−∑

t:t∈ηsxt

)− 1

2σ2(ys + 1)2

)Nothing more than flipping a biased coin.




Gibbs Sampling in the Ising Model: Sampling from p(x|y)Multiplicative flips (AKA binary symmetric channel)

p(x|y) ∝ exp(β∑

s∼txsxt

)θ|{s∈S:ys=xs}|(1− θ)|{s∈S:ys=−xs}|

p(xs|sx, y) =p(xs, sx|y)p(sx|y)

=p(x|y)p(sx|y)

∝ exp(β∑

t:t∈ηsxt

)θ1ys=xs (1− θ)1ys=−xs

⇒

p(xs = 1|sx, y) ∝ exp(β∑

t:t∈ηsxt

)θ1ys=1(1− θ)1ys=−1

p(xs = −1|sx, y) ∝ exp(−β∑

t:t∈ηsxt

)θ1ys=−1(1− θ)1ys=1

Again, this is just flipping a biased coin.



Introduction to MCMC and Gibbs Sampling MCMC

Markov Chain Monte Carlo (MCMC)Regardless of MRFs

Let p be the pdf/pmf of interest, known as the target distribution

Find a Markov Chain, X(0), X(1), . . . with X(t) ∈ R|S| such that:

Pr(X(t) = x)→ p(x) as t→∞ (“sampling”)orPr(X(t) ∈M)→ 1 as t→∞ where (“annealing”)

M = {x : p(x) = maxx′

p(x′)}

or

1

T

T∑t=1

H(X(t))t→∞−−−→ EpH(X)

(“ergodicity”)for some function of interest H




Stochastic matrix

Definition

A square matrix P of nonnegative values is called a stochastic matrix iseach of its row sums to 1.

Remark: It is called doubly stochastic if both P and P T are stochasticmatrices.

Remark: a stochastic vector is a vector of nonnegative values whose sumis 1.

This terminology can be confusing – we often use “random” and“stochastic” interchangeably, but here neither a stochastic matrix nor astochastic vector are “random”.




Markov Chain Monte Carlo (MCMC) – Regardless of MRFs

For simplicity, assume finite-state space (but more generally, MCMC isalso applicable for the countable or continuous settings)The MC is said to be stationary if the transition probability

Px,y , Pr(X(t) = y|X(t− 1) = x)

is independent of t ∀x, y ∈ R|S|The transition probability matrix is

P = {Px,y} ∈ R|R|S||×|R|S|| (this might be huge)

P is a stochastic matrix.Consider pmf’s over R|S| as row vectors of length

∣∣R|S|∣∣. Such a vectorhas nonnegative entries summing to 1.Let π(t) be the pmf of X(t) ⇒ π(t+1) = π(t)P is the pmf of X(t+1):

πi(t+ 1) =

|R|S||∑j=1

πj(t)Px,y(j, i) =

|R|S||∑j=1

Pr(X(t+ 1) = i,X(t) = j)




Markov Chain Monte Carlo (MCMC)Regardless of MRFs

π(t+ 2) = π(t+ 1)P = π(t)PP = π(t)P 2

more generally π(t+ n) = π(t)Pn

If πP = π, i.e., π is a left eigenvector of P with eigenvalue 1, then π iscalled an equilibrium distribution of the MC

If π(t)t→∞−−−→ π then π is an equilibrium distribution, called the

stationary distribution of the MC.

If, in addition, ∃ limt→∞ Pt, then P t

t→∞−−−→

[ ππ...π

](i.e., identical rows)

This is consistent with the fact that regardless what π(0) is,

π(t) = π(0)P tt→∞−−−→ π.

Not every MC has an equilibrium distribution.

MCMC methods are based on finding an MC whose equilibriumdistribution is the target distribution.




Definition

Let P = {Px,y} be a transition probability matrix. Then R|S| is calledconnected under {Px,y}, if ∀x, y ∈ R|S|,

∃ (x(0), x(1), . . . , x(t))

such that1

x(0) = x , x(t) = y ;

2

Px(n+1),x(n) > 0

with n = 0, 1, . . . , t− 1

In MC terminology, this means there is exactly one communication class.




Theorem

If (X(t))t≥0 is an MC on R|S| and if

1 R|S| is connected under {Px,y}2 Px,x > 0 for some x ∈ R|S| (or more generally, in terms of MCterminology, Px,y is aperiodic)

Then:

∃ a unique p̃ on R|S| such that∑

x p̃Px,y = p̃(y) ∀y ∈ R|S|(p̃=“equilibrium distribution”)

Pr(X(t) = x|X(0) = x0)t→∞−−−→ p̃(x), indep. of x0 (the initial state)

1T

∑Ti=1H(X(t))

T→∞−−−−→∑

x p̃(x)H(x) = Ep̃H(X) with H : R|S| → R

But what we want is to sample from p, not p̃.The trick: find Px,y such that p = p̃.




Fact (Detailed Balance)

If for some π, probability on R|S|,

π(x)Px,y = π(y)Py,x ∀x, y ∈ R|S|

then∑

x π(x)Px,y = π(y), i.e., π is the equilibrium distribution.

Proof.

∑x

π(x)Px,y =∑x

π(y)Py,x = π(y)∑x

Pr(X(t+1) = x|X(t) = y) = π(y)

Remark

Note that π(x)Px,y = π(y)Py,x is just a short way for writing that

Pr(X(t) = x,X(t+ 1) = y) = Pr(X(t) = y,X(t+ 1) = x)




Putting the theorem and the fact together: If Px,y satisfies the conditionof the theorem and π = p satisfies detailed balance, then p is the uniqueequilibrium probability of the chain and (2) and (3) from the theoremhold with p̃ = p.




Detailed balance is also called reversibility since it implies

Pr(X(t) = y|X(t+ 1) = x) = Pr(X(t+ 1) = y|X(t) = x)︸︷︷︸Px,y (by definition)

;

i.e., the backward dynamics is the same as the forward dynamics.

Proof.

Pr(X(t) = y|X(t+ 1) = x) =Pr(X(t) = y,X(t+ 1) = x)

Pr(X(t+ 1) = x)︸︷︷︸π(x)

=Pr(X(t+ 1) = x|X(t) = y)

π(y)︷︸︸︷Pr(X(t) = y)

π(x)=Py,xπ(y)

π(x)

DB= Px,y




Gibbs-Sampling Example

Suppose the visiting schedule is done by, qv, a probability on S. GivenX(t) = x ∈ R|S|:

Choose a site v using v ∼ qv.

Change xv to a sample from p(xv|xηv)Checking the conditions of the theorem:

We can get from any state x to any state y by changing one site at thetime. So the state space is connected.

Px,x > 0 for some x ∈ R|S|? Yes. In fact, it holds for every x.

Gibbs Sampling also satisfies Detailed Balance (see next slide).




Remarks on Gibbs Sampling

If the visitation schedule is deterministic we still havePr(X(t) = x|X(0) = x(0))

t→∞−−−→ p(x) provided that every site is visitedinfinitely often.

Important to observe that we don’t need Z.

Recall: in MRFs, p(x) = 1Z exp

(−∑

c∈C Ec(xc))

(similarly if working with p(x|y)) Write

p(x) = pT (x)|T=1 ,1

ZTexp

(− 1

T

∑c∈CEc(xc)

)∣∣∣∣∣T=1

Fact (Simulated Annealing)

pT (x)T↓0−−→ probability concentrated on {x : x = argmaxx′ p(x

′)}. . . as long as T doesn’t decay “too fast”.

Similar results if working with p(x|y) instead of p(x)



Generalizations

Modeling: Beyond the Ising Model

xs ∈ {0, 1, . . . ,m}. The Potts model is

p(x) ∝ exp

(−γ∑s∼t

1xs 6=xt

)γ > 0

A modification in case the ordering of the labels matters:

p(x) ∝ exp

(−γ∑s∼t

(xs − xt)2)

γ > 0

(can also replace the `2 loss with a robust error function)

Gaussian MRF: x ∼ N (µ,Q−1) where Q is SPD and sparse.

More complicated models and/or higher-order cliques(see examples in the presentation “MRFs, part 4”)



Generalizations

Inference: Beyond Dynamic Programming and Gibbs Sam-pling

There are many other approaches for searching for the argmax. Forexample:

Graph-cut methods

Belief propagation and loopy belief propagation

Mean-field approximations

. . .



Generalizations

Applications: Beyond Image Restoration

Inpainting (y has missing pixels)

Image segmentation

Spatial coherence in various problems such as optical-flow estimation,depth-estimation, stereo, etc.

Pose estimation

Texture synthesis

Interactions between superpixels (irregular grid)

HMM-like models are useful in CV applications such as tracking

Many other applications



Generalizations

Learning MRFs

MRFs often have parameters (e.g., β in the Ising model, or Q in aGMRF) that can/should be learned; we won’t get into this too much,but we will touch upon some examples in the next presentation.

We can also learn the structure of the graph (i.e., the existence, or lackthereof, edges in the graph); this is called structural inference.



If you want to learn more about MRFs and their CV Applications

Books (see course website for details)

Prince

Szeliski

Markov random fields for vision and image processing; editors: Blakeand Rother (nice mix of tutorials and applications)

Winkler’s Image Analysis, Random Fields and Markov Chain MonteCarlo (mathematically advanced, focused on MCMC)



If you want to learn more about MRFs and their CV Applications

Version Log

25/3/2019, ver 1.02. S28: Fixed RS to R|S|. Split S28 into two slides.S35: Clarified when we move back from the general case to MRFs;moved the caveat into the Fact. to MRFs

18/3/2019, ver 1.01. S35 (now S36): γ → −γ, stated γ > 0

11/03/2019, ver 1.00.



Computer Vision: Models, Learning and Inference Markov ...cv192/wiki.files/CV192_lec_MRF3.pdf ·...

Documents

Transcript of Computer Vision: Models, Learning and Inference Markov ...cv192/wiki.files/CV192_lec_MRF3.pdf ·...