On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation...

On the Bethe approximation

Adrian Weller

Department of Statistics at Oxford UniversitySeptember 12, 2014

Joint work with Tony Jebara

1 / 46

Outline

1 Background on inference in graphical modelsExamplesWhat are the problems and why are they interesting?Belief propagation (BP) as sum-product message passingVariational perspective on inference

2 Bethe approximationLink to BPOther methods to minimize the Bethe free energyNew approach: discretize to obtain an ε-approx globaloptimum

How discretize s.t. ε-approx guaranteed?How search efficiently over the discretized space?

If time: Understanding the two aspects of the Betheapproximation (entropy and polytope), and new work onclamping...

Questions (anytime!)2 / 46

Background

Focus on undirected probabilistic graphical models, also calledMarkov random fields (MRFs).Compact, powerful way to model dependencies among variables.

Many applications, including:

Systems biology (protein folding)

Social network analysis (friends, politics, terrorism)

Combinatorial problems (counting independent sets)

Computer vision (image denoising, depth perception)

Error-correcting codes (turbo codes, 3G/4G phones, satellitecommunication)

3 / 46

Example: image denoising

Inference is combining prior beliefs with observed evidence toform a prediction.

−→ MAP inference 4 / 46

Notation

Focus on MRFs which are discrete and finite

n variables V = {X1, . . . ,Xn} and (log) potential functions ψc

over subsets/factors c of V , c ∈ C ⊆ P(V ) which give higherscore to sub-configurations with higher compatibility

Write x = (x1, . . . , xn) ∈ X for one particular completeconfiguration and xc for a configuration of the variables in c

ψc maps each setting xc → ψc(xc) ∈ R [lookup table]

p(x) =1

Zexp

(∑c∈C

ψc(xc)

)=

e−E(x)

Z, E = −

∑c∈C

ψc(xc),

where the partition function Z =∑

x∈X exp(∑

c∈C ψc(xc))

is thenormalizing constant to ensure that probabilities sum to 1;E is the energy (negative score) cf. physics.

5 / 46

Inference: 3 key problems

p(x) =1

Zexp

(∑c∈C

ψc(xc)

)

MAP inference, identify a configuration of variables withmaximum probability: x∗ ∈ argmaxx∈X

∑c∈C ψc(xc)

Marginal inference, Compute the probability distribution of asubset of variables xc :

p(xc) =∑

x∈X :Xc=xcp(x) =

∑x∈X :Xc=xc

exp(∑

c∈C ψc (xc ))∑x∈X exp(

∑c∈C ψc (xc ))

Evaluate the partition function,Z =

∑x∈X exp

(∑c∈C ψc(xc)

)Great interest to find classes of problems and approachessuch that exact or approximate inference is tractable.

6 / 46

Remark: conditioning on observed variables

p(x) =1

Zexp

(∑c∈C

ψc(xc)

)Suppose V is split into observed variables Y = y and unobserved

variables XU so x = (xu, y), xu ∈ Xu

p(xu|y) = p(xu ,y)p(y) = p(xu ,y)∑

x′u∈Xup(x ′u ,y)

This is just a new smaller MRF with modified potentials onthe variable set XU

New partition function to normalize the new distribution

Hence the MRF framework is rich enough to handleconditioning

When we discuss MRFs, they might or might not have beenbased on conditioning on variables

7 / 46

Belief propagation (BP) for inference

Marginal inference via sum-product message passing

Send messages from variable v ∈ V to factor c ∈ C

mv→c(xv ) =∏

c∗∈C(v)\{c}

mc∗→v (xv )

Send messages from factor c to variable v

mc→v (xv ) =∑

x′c :x′v=xv

φc(x ′c)∏

v∗∈V (c)\{v}

mv∗→c(x ′v∗)

where φc(x ′c) = exp(ψc(x ′c))

For MAP inference, use max-product, switch∑

x ′c→ maxx ′c

For acyclic models, converges to exact marginals efficiently(2 passes, collect leaves to root then distribute)

8 / 46

What about cyclic (loopy) models?

Can triangulate and run junction tree

Exact solution but takes time exponential in treewidth

Or... just run loopy belief propagation (LBP) and hope

Often produces strikingly good resultsBut may not converge at all

Extensive literature on trying to understand LBP

9 / 46

Inference: a variational perspective

Recall p(x) =1

Zexp

(∑c∈C

ψc(xc)

)=

e−E(x)

Z

KL-divergence between some distribution q(x) and p(x) given

by D(q||p) =∑

x q(x) log q(x)p(x) ≥ 0, equality iff q = p

Have

0 ≤ D(q||p) =∑x

q(x) log q(x)−∑x

q(x) log p(x)

= −S(q)−∑x

q(x) [−E (x)− logZ ]

= Eq(E (x))− S(q) + logZ

where S(q) is the standard Shannon entropy of q10 / 46

Inference: a variational perspective

0 ≤ D(q||p) = Eq(E (x))− S(q) + logZ , equality iff q = p

Hence Eq(E (x))− S(q) ≥ − logZ

This function of distribution q is called the (Gibbs) free energyFG (q) = Eq(E (x))− S(q)

Minimizing it over all valid distributions q yields − logZ

And the argmin is exactly when q = p, the true distribution

Hence can think of inference as optimization

But still intractable in general...

END OF PART I

11 / 46

Part II: Bethe approximation

Seek to approximate the partition function ZAlso interested in approximate marginal inference (medicaldiagnosis, power network)

The Bethe approximation: what and why?

Introduced by Hans Bethe in the 1930s to study phasetransitions in statistical physics. Wikipedia:Bethe left Germany in 1933, moving to England afterreceiving an offer as lecturer... He moved in with his friendRudolf Peierls... This meant that Bethe had someone tospeak to in German, and did not have to eat English food.

Found fresh application in machine learningDirect connections to variational inference and beliefpropagation [YFW01]

12 / 46

Part II: Bethe approximation

Seek to approximate the partition function ZAlso interested in approximate marginal inference (medicaldiagnosis, power network)

The Bethe approximation: what and why?

Introduced by Hans Bethe in the 1930s to study phasetransitions in statistical physics. Wikipedia:Bethe left Germany in 1933, moving to England afterreceiving an offer as lecturer... He moved in with his friendRudolf Peierls... This meant that Bethe had someone tospeak to in German, and did not have to eat English food.Found fresh application in machine learningDirect connections to variational inference and beliefpropagation [YFW01] 12 / 46

Recall variational approach

− logZ = minq∈MFG (q) = min

q∈MEq(E )− S(q(x))

M is the marginal polytope which comprises all globally validprobability distributions over all the variables, i.e. convex hull of all2n configurations (for binary variables)FG is the Gibbs free energy, optimum at the true distribution

Bethe approximation has 2 aspects, both pairwise approximations:1 Relax the marginal polytope M to the local polytope L which

enforces only pairwise consistency, hence pseudo-marginals2 Use Bethe entropy SB=

∑i∈V Si +

∑(i ,j)∈E Sij − Si − Sj

Obtain Bethe partition function ZB at the global optimum

− logZB = minq∈LF(q) = min

q∈LEq(E )− SB(q(x))

13 / 46

Connection to LBP

Obtain Bethe partition function ZB at the global optimum

− logZB = minq∈LF = min


marginal polytope(global consistency)

local polytope(local consistency)

F is called the Bethe free energy (approximates true free energy)

In a seminal paper, [YFW01] showed that fixed points of LBPcorrespond to stationary points of the Bethe free energy FRefined by [Hes02], stable fixed points correspond to localminima of F (converse not true in general)

14 / 46

Other methods to minimize Bethe free energy F

LBP may be viewed as an algorithm to try to minimize FBut may not converge, or may converge only to a localminimum

Spurred much effort to find convergent algorithms such as

Gradient methods [WT01]Double loop methods, e.g. CCCP [Yui02] or [HAK03]

But still only to a local optimum, no time guarantee

For binary pairwise models

Recent algorithm guaranteed to converge in polynomial timeto an approximately stationary point of F [Shi12], restrictionson topologyOur algorithm guaranteed to return an ε-approximation to theglobal optimum [WJ14]To our knowledge, no previously known methods guaranteed toreturn or approximate the global optimum

15 / 46

Binary pairwise MRFs

Main focus now on MRFs which are binary, i.e. all Xi ∈ {0, 1},and pairwise, i.e. all potentials are over ≤ 2 variables

n variables V = {X1, . . . ,Xn}, singleton potentials ψi (xi )

x = (x1, . . . , xn) ∈ {0, 1}n is one particular configuration

m edges (i , j) ∈ E ⊆ V × V , pairwise potentials ψij(xi , xj)

p(x) =1

Zexp

∑i∈V

ψi (xi ) +∑

(i ,j)∈E

ψij(xi , xj)

Can always reparameterize to a minimal representation{θi : i ∈ V }, {Wij : (i , j) ∈ E} s.t. same distribution

p(x) =1

Z ′exp

∑i∈V

θixi +∑

(i ,j)∈E

Wijxixj

16 / 46

Binary pairwise MRFs: simple example

p(x) =1

Zexp

∑i∈V

ψi (xi ) +∑

(i ,j)∈E

ψij(xi , xj)

Can always reparameterize to a minimal representation{θi : i ∈ V }, {Wij : (i , j) ∈ E} s.t. same distribution

p(x) =1

Z ′exp

∑i∈V

θixi +∑

(i ,j)∈E

Wijxixj

X1 X2

local θ1 = 4local θ2 = −5edge Wij = 3

Wij > 0 attractiveWij < 0 repulsive

ψ1(x1) ψ12(x1, x2) ψ2(x2)x1 ψ1(x1)0 21 4

x1\x2 0 10 1 −31 3 2

x2 0 1ψ2(x2) −1 −2

17 / 46

Bethe pseudo-marginals in the local polytope

− logZB = minq∈LF = min


Must identify q(x) ∈ L that minimizes F

q defined by singleton pseudo-marginals qi = p(Xi = 1) ∀i ∈ Vand pairwise µij ∀(i , j) ∈ E . Local polytope constraints imply

µij =

[p(Xi = 0,Xj = 0) p(Xi = 0,Xj = 1)p(Xi = 1,Xj = 0) p(Xi = 1,Xj = 1)

]=

[1 + ξij − qi − qj qj − ξij

qi − ξij ξij

]with constaint that all terms ≥ 0⇒ ξij ∈ [max(0, qi + qj − 1),min(qi , qj)]

[WT01] showed:

Minimizing F , can solve explicitly for ξij(qi , qj ,Wij)

Here Wij is the associativity of the edge (as earlier)

Hence sufficient to search over (q1, . . . , qn) ∈ [0, 1]n, but how?18 / 46

Our approach: a mesh over Bethe pseudo-marginals

We discretize the space (q1, . . . , qn) ∈ [0, 1]n with a provablysufficient mesh M(ε), fine enough s.t. optimum discretized pointq∗ has F(q∗) ≤ minq∈LF(q) + ε

00.2

0.40.6

0.81

0

0.5

10

0.10.20.30.40.50.60.70.80.9

1

q 3

q1

q2

19 / 46

Key ideas to approximate log ZB to within ε

Discretize to construct a provably sufficient mesh M(ε):

How guarantee F(q∗) ≤ minq∈L F(q) + ε?How search the large discrete mesh efficiently?

Developed two approaches:

curvMesh bounds curvature [WJ13]gradMesh bounds gradients - typically much better (orders ofmagnitude) [WJ14]

If original model attractive, i.e. Wij > 0 ∀(i , j) ∈ E(submodular cost functions), then show the discretizedmulti-label problem is submodular [WJ13,KKL12]

Hence, can be solved via graph cuts [SF06]O(N3) where N =

∑i∈V Ni points in dim i [cf.

∏i∈V Ni ]

Obtain FPTAS with gradMesh, N = O(nmWε

)To compare, for curvMesh,N = O

(ε−1/2n7/4∆3/4 exp

[12 (W (1 + ∆/2) + T )

])

20 / 46

Key ideas to approximate log ZB to within ε

Discretize to construct a provably sufficient mesh M(ε):

How guarantee F(q∗) ≤ minq∈L F(q) + ε?How search the large discrete mesh efficiently?

Developed two approaches:

curvMesh bounds curvature [WJ13]gradMesh bounds gradients - typically much better (orders ofmagnitude) [WJ14]

If original model attractive, i.e. Wij > 0 ∀(i , j) ∈ E(submodular cost functions), then show the discretizedmulti-label problem is submodular [WJ13,KKL12]

Hence, can be solved via graph cuts [SF06]O(N3) where N =

∑i∈V Ni points in dim i [cf.

∏i∈V Ni ]

Obtain FPTAS with gradMesh, N = O(nmWε

)To compare, for curvMesh,N = O

(ε−1/2n7/4∆3/4 exp

[12 (W (1 + ∆/2) + T )

])20 / 46

Bounding the locations of stationary points

For general edge types (associative or repulsive), letWi =

∑j∈N(i):Wij>0Wij , Vi = −

∑j∈N(i):Wij<0Wij

Theorem (WJ13)

At any stationary point of the Bethe free energy,σ(θi − Vi ) ≤ qi ≤ σ(θi + Wi )

Developed an algorithm (Bethe bound propagation BBP) thatiteratively improves these bounds

[MK07] already had a similar algorithm, finds ranges ofpossible beliefs in LBP - bit slower but typically better

Use this to preprocess model to yield a smaller orthotope

reduces search space directlyfor curvMesh lowers max curvature, hence coarser mesh

21 / 46

Bethe free energy landscape (stylized)

Red dot shows the global optimum, we might return the green dot

22 / 46

Curvature: all terms of the Hessian Hij = ∂2F∂qi∂qj

Hii = − di − 1

qi (1− qi )+∑

j∈N(i)

qj(1− qj)

Tij≥ 1

qi (1− qi ),

Hij =

{qiqj−ξij

Tij(i , j) ∈ E

0 (i , j) /∈ E , i 6= j .

where di is the degree of Xi in the model, and

Tij = qiqj(1−qi )(1−qj)−(ξij−qiqj)2 ≥ 0, equality iff qi or qj ∈ {0, 1}

Leads to bound on max second derivative in any direction(curvMesh)

qiqj − ξij term is negative for an attractive edge, hence obtainthe submodularity result

23 / 46

gradMesh: analyze first derivatives of F

∂F∂qi

= −θi + log(1− qi )

di−1

qdi−1i

∏j∈N(i)(qi − ξij)∏

j∈N(i)(1 + ξij − qi − qj)[WT01]

Theorem (WJ14)

−θi + log qi1−qi −Wi ≤ ∂F

∂qi≤ −θi + log qi

1−qi + Vi

Upper and lower bounds are separated by a constant, andboth are monotonically increasing with qi

Within our search space, allows us to bound∣∣∣∂F∂qi ∣∣∣ ≤ Di := Vi + Wi =∑

j∈N(i) |Wij |

24 / 46

gradMesh: search over purple region

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−15

−10

−5

0

5

10

15

Upper and Lower Bounds for ∂F∂qi

Pseudo−marginal qi

Par

tial d

eriv

ativ

e

fiU

fiL

Ai

1−Bi

qi s.t.

fiU(q

i)=0

qi s.t.

fiL(q

i)=0

Region of Bethe box[A

i, 1−B

i]

Di=V

i+W

i−logL

i−logU

i

Shaded area shows wherepartial derivative can be 0

Parameters used in this example:θ

i=1, V

i=2, W

i=3

Li=1.8, U

i=2.9

25 / 46

gradMesh: complexity

In search space,

∣∣∣∣∂F∂qi∣∣∣∣ ≤ Di := Vi + Wi =

∑j∈N(i)

|Wij |

We can apportion ε error among n variables

Simple method: each gets εn

Need gradienti .stepi ≈ εn .

Hence number of mesh points in dimension i ,

Ni ≈1

stepi≈ n

ε.gradienti = O

n

ε

∑j∈N(i)

|Wij |

Hence N =

∑i Ni = O

(nεmW

)Various tricks in paper show how to improve performance

26 / 46

Comparison of methods: left ε = 1, right ε = 0.1; (when fixed, W = 5, n = 10)

5 10 15 2010

0

1010

1020

n

N

curvMeshOrigcurvMeshNewgradMesh

0 5 10

1010

1020

W

N


5 10 15 2010

0

1010

1020

n

N


0 5 10

1010

1020

W

N


27 / 46

Example where LBP fails to converge, gradMesh works well

Power network oftransformers

Xi ∈ {stable,fail}Attractiveedges betweentransformers

Would like torank bymarginalprobability offailure p(Xi )

1

2

3 4

5

6

7

8

9

10

11

1213

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

4950

51

52

5354

55

28 / 46

Recap

The Bethe approximation is often strikingly accurate.New results:

Novel formulation of the Hessian of the Bethe free energy FBounds on derivatives and locations of optima

First method guaranteed to return ε-approx global optimumlogZB , allows its accuracy to be tested rigorously

Provides benchmark against which to judge other heuristics(LBP, HAK etc.)

Useful in practice for small problems

FPTAS for attractive models, was open theoretical question

Further improvements in new work...

29 / 46

Understanding the Bethe approximation

Joint work with Kui Tang and David Sontag

Goal - separate and evaluate the two aspects of the Betheapproximation:

1 Relax the marginal polytope M to the local polytope L whichenforces only pairwise consistency, hence pseudo-marginals

2 Use Bethe entropy SB=∑

i∈V Si +∑

(i,j)∈E Sij − Si − Sj

Consider marginal, cycle and local polytopes

Compare against tree-reweighted approximation (TRW)

same polytopesconcave upper-bounding entropy

Analytic and experimental results

30 / 46

Illustration of polytopes

marginal polytopeglobal consistency

cycle polytopecycle consistency

local polytopelocal consistency

31 / 46

Questions addressed include

Does tightening the relaxation of the marginal polytopealways improve the Bethe approximation for logZ?

No (empirically usually very helpful for general models)

In attractive models, when local potentials are low andcouplings high, why does the Bethe approximation performpoorly for marginals?

Bethe entropy

In general models, for low couplings, the Bethe approximationperforms much better than TRW, yet as coupling increases,this advantage disappears. How does this vary if we tightenthe relaxation of the marginal polytope?

Mixed, see Experiments

32 / 46

Tightening the polytope relaxation - does it always help?

NoConsider symmetric

nonhomogeneous

cycle, vary WBC ,

θA = θB = θC = 0

A

B C

WAB = WAC = 10,strongly attractive

−10 −5 0 5 106

7

8

9

10

11

12

13

14

15

16

BC edge weight

log

Z

trueBetheBethe+cycle

Lemma: ∂ log ZB∂WBC

= µBC (0, 0) + µBC (1, 1), all singleton marginals 12

For weakly attractive edge BC, cycle improves pairwise marginal (similarslopes near 0) but worsens partition function (gap between curves near 0)

33 / 46

Threshold result for attractive models due to SB entropy

Lemma: For a symmetric homogeneous d-regular MRF,q = (12 , . . . ,

12) is a stationary point of F but not a minimum

for W > 2 log dd−2 (uses earlier Hessian result)

Recall∑

i di = 2m (handshake lemma), henceSB = mSij + (n − 2m)Si . For large W , all probability masspulled onto main diagonal, hence Sij ≈ Si . For m > n, toavoid negative SB , each entropy term → 0 by tending to

pairwise

(1 00 0

)or symmetrically

(0 00 1

).

0 0.5 1−1.5

−1

−0.5

0

q

Bet

he fr

ee e

nerg

y E

−S

B

K5 : W = 1

0 0.5 1−0.8

−0.6

−0.4

−0.2

0

q

Bet

he fr

ee e

nerg

y E

−S

B

W = 1.38

0 0.5 1−0.4

−0.3

−0.2

−0.1

0

qB

ethe

free

ene

rgy

E−

SB

W = 1.75 34 / 46

Also a polytope effect for frustrated cycles

A frustrated cycle has an odd number of repulsive edges, this pullssingleton marginals the other way, toward 1

2

Seen Bethe entropy effect for attractive cycles

Also a polytope effect for frustrated cycles

Recall optimum energy on local polytope for a symmetricfrustrated cycle is at (12 , . . . ,

12)

C5 topology, θi ∼ [0,Tmax ], all edges W

−10 −5 0 5 100.5

0.6

0.7

0.8

0.9

1

edge weight W

avg

sin

gle

ton

mar

gin

al


−10 −5 0 5 100.5

0.6

0.7

0.8

0.9

1

edge weight W

avg

sin

gle

ton

mar

gin

al


35 / 46

Experiments: General models θi ∼ [−2, 2](attractive and repulsive edges) K10 topology

2 8 16 24 320

20

40

60

80

100

Maximum coupling strength y

Bethe+localBethe+cycleBethe+margTRW+localTRW+cycleTRW+marg

log partition error

2 8 16 24 320

0.5

1

1.5

2


Bethe+cycleBethe+marg

TRW+cycleTRW+marg

log partition error, local removed

2 8 16 24 320

0.1

0.2

0.3

0.4


Singleton marginals, average `1 error

2 8 16 24 320

0.1

0.2

0.3

0.4


Pairwise marginals, average `1 error36 / 46

Conclusions for general models

Big gains from cycle polytope (suggest Frank-Wolfe)

Not much additional gain from marginal polytope(computationally harder)

Bethe performs remarkably well

Better than TRW for logZ , pairwise marginalsLess clear on singleton marginals: TRW better for very strongcoupling

Still much to learn about why Bethe performs so well...

37 / 46

Summary

The Bethe approximation is remarkably effective forapproximate inference

Novel results on Hessian of Bethe free energy

First algorithm for ε-approx of global optimum logZB , FPTASfor attractive models

Contributions to understanding the Bethe approximation(polytope and entropy)

Where feasible, tightening to the cycle polytope can be veryhelpful

Additional results in new work (e.g. clamping)...

Thank you!

38 / 46

Attractive example: max score and value, with argmax

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1

0

1

qi

Sco

re/V

alue

Opt Score(C) and Value(−F), i=3/4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

argm

ax S

ingl

eton

Val

ues

39 / 46

References

F. Korc̆, V. Kolmogorov, and C. Lampert. Approximating marginals usingdiscrete energy minimization. Technical report, IST Austria, 2012.

J. Mooij and H. Kappen. Sufficient conditions for convergence of thesum-product algorithm. IEEE Transactions on Information Theory, 2007.

D. Schlesinger and B. Flach. Transforming an arbitrary minsum probleminto a binary one. Technical report, Dresden University of Tech, 2006.

A. Weller and T. Jebara. Approximating the Bethe partition function. InUAI, 2014.

A. Weller, K. Tang, D. Sontag, and T. Jebara. Understanding the Betheapproximation: When and how can it go wrong? In UAI, 2014.

A. Weller and T. Jebara. Bethe bounds and approximating the globaloptimum. In AISTATS, 2013.

M. Welling and Y. Teh. Belief optimization for binary networks: A stablealternative to loopy belief propagation. In UAI, 2001.

J. Yedidia, W. Freeman, and Y. Weiss. Understanding belief propagationand its generalizations. In IJCA, Distinguished Lecture Track, 2001.

40 / 46

Extra Slides with Supplementary Material

Supplementary Material(if time or questions)

41 / 46

Cycle polytope

A relaxation of the marginal polytope

Inherits all constraints of the local polytope, hence at least astight

In addition, enforces consistency around any cycle

Cycle inequalities [B93]

∀ cycles C and every subset of edges F ⊆ C with |F | odd:∑(i,j)∈F

(µij(0, 0) + µij(1, 1))+∑

(i,j)∈C\F

(µij(1, 0) + µij(0, 1)) ≥ 1.

Cycle polytope = marginal polytope for symmetric planarMRFs [B93]

Cycle polytope = TRI for binary pairwise [S10]

42 / 46

Threshold for attractive models ξij(qi , qj ,Wij)

0 0.5 1−1.5

−1

−0.5

0

q

Bet

he fr

ee e

nerg

y E

−S

B

K5 : W = 1

0 0.5 1−0.8

−0.6

−0.4

−0.2

0

q

Bet

he fr

ee e

nerg

y E

−S

BW = 1.38

0 0.5 1−0.4

−0.3

−0.2

−0.1

0

q

Bet

he fr

ee e

nerg

y E

−S

B

W = 1.75

0 0.5 10

1

2

3

4

q

Bet

he e

ntro

py S

B

W = 1

0 0.5 10

0.5

1

1.5

2

2.5

q

Ene

rgy

E

0 0.5 1−0.4

−0.2

0

0.2

0.4

q

Bet

he e

ntro

py S

B

W = 4.5

0 0.5 10

0.5

1

1.5

2

2.5

q

Ene

rgy

E

43 / 46

Experiments: Attractive models θi ∼ [−0.1, 0.1]

0.4 2 4 8 12 160

0.2

0.4

0.6

0.8

1



log partition error

For this distribution of models,the polytope appears to makeno difference

Though recall we showedtheoretically it can

0.4 2 4 8 12 160

0.1

0.2

0.3

0.4

0.5



Singleton marginals, average `1 error

0.4 2 4 8 12 160

0.02

0.04

0.06

0.08

0.1



Pairwise marginals, average `1 error (small scale)

44 / 46

Clamping variables: Attractive binary pairwise models

ZB = optimal Bethe partition function for original model

Clamp variable Xi , form new approximation

Z(i)B = ZB |Xi=0 + ZB |Xi=1.

Theorem (WJ14 NIPS)

For an attractive binary pairwise model and any variable Xi ,

ZB ≤ Z(i)B .

Corollary

For an attractive binary pairwise model, ZB ≤ Z .

⇒ clamping only improves the estimate of the partition function.

45 / 46

Clamping variables: stronger result

For any i ∈ V, x ∈ [0, 1], letlogZBi (x) = maxq∈[0,1]n:qi=x −F(q)

Observe logZBi (0) = logZB |Xi=0, logZBi (1) = logZB |Xi=1

and logZB = maxqi∈[0,1] logZBi (qi )

Recall Si (x) = −x log x − (1− x) log(1− x) singleton entropy

Lemma: To prove clamping result, sufficient iflogZBi (qi ) ≤ qi logZBi (1) + (1− qi ) logZBi (0) + Si (qi )

Theorem (WJ14 NIPS)

For an attractive binary pairwise model, logZBi (qi )− Si (qi ) isconvex.

Uses earlier results on Hessian

46 / 46

On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation...

Documents

Transcript of On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation...