On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation...
Transcript of On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation...
On the Bethe approximation
Adrian Weller
Department of Statistics at Oxford UniversitySeptember 12, 2014
Joint work with Tony Jebara
1 / 46
Outline
1 Background on inference in graphical modelsExamplesWhat are the problems and why are they interesting?Belief propagation (BP) as sum-product message passingVariational perspective on inference
2 Bethe approximationLink to BPOther methods to minimize the Bethe free energyNew approach: discretize to obtain an ε-approx globaloptimum
How discretize s.t. ε-approx guaranteed?How search efficiently over the discretized space?
If time: Understanding the two aspects of the Betheapproximation (entropy and polytope), and new work onclamping...
Questions (anytime!)2 / 46
Background
Focus on undirected probabilistic graphical models, also calledMarkov random fields (MRFs).Compact, powerful way to model dependencies among variables.
Many applications, including:
Systems biology (protein folding)
Social network analysis (friends, politics, terrorism)
Combinatorial problems (counting independent sets)
Computer vision (image denoising, depth perception)
Error-correcting codes (turbo codes, 3G/4G phones, satellitecommunication)
3 / 46
Example: image denoising
Inference is combining prior beliefs with observed evidence toform a prediction.
−→ MAP inference 4 / 46
Notation
Focus on MRFs which are discrete and finite
n variables V = {X1, . . . ,Xn} and (log) potential functions ψc
over subsets/factors c of V , c ∈ C ⊆ P(V ) which give higherscore to sub-configurations with higher compatibility
Write x = (x1, . . . , xn) ∈ X for one particular completeconfiguration and xc for a configuration of the variables in c
ψc maps each setting xc → ψc(xc) ∈ R [lookup table]
p(x) =1
Zexp
(∑c∈C
ψc(xc)
)=
e−E(x)
Z, E = −
∑c∈C
ψc(xc),
where the partition function Z =∑
x∈X exp(∑
c∈C ψc(xc))
is thenormalizing constant to ensure that probabilities sum to 1;E is the energy (negative score) cf. physics.
5 / 46
Inference: 3 key problems
p(x) =1
Zexp
(∑c∈C
ψc(xc)
)
MAP inference, identify a configuration of variables withmaximum probability: x∗ ∈ argmaxx∈X
∑c∈C ψc(xc)
Marginal inference, Compute the probability distribution of asubset of variables xc :
p(xc) =∑
x∈X :Xc=xcp(x) =
∑x∈X :Xc=xc
exp(∑
c∈C ψc (xc ))∑x∈X exp(
∑c∈C ψc (xc ))
Evaluate the partition function,Z =
∑x∈X exp
(∑c∈C ψc(xc)
)Great interest to find classes of problems and approachessuch that exact or approximate inference is tractable.
6 / 46
Remark: conditioning on observed variables
p(x) =1
Zexp
(∑c∈C
ψc(xc)
)Suppose V is split into observed variables Y = y and unobserved
variables XU so x = (xu, y), xu ∈ Xu
p(xu|y) = p(xu ,y)p(y) = p(xu ,y)∑
x′u∈Xup(x ′u ,y)
This is just a new smaller MRF with modified potentials onthe variable set XU
New partition function to normalize the new distribution
Hence the MRF framework is rich enough to handleconditioning
When we discuss MRFs, they might or might not have beenbased on conditioning on variables
7 / 46
Belief propagation (BP) for inference
Marginal inference via sum-product message passing
Send messages from variable v ∈ V to factor c ∈ C
mv→c(xv ) =∏
c∗∈C(v)\{c}
mc∗→v (xv )
Send messages from factor c to variable v
mc→v (xv ) =∑
x′c :x′v=xv
φc(x ′c)∏
v∗∈V (c)\{v}
mv∗→c(x ′v∗)
where φc(x ′c) = exp(ψc(x ′c))
For MAP inference, use max-product, switch∑
x ′c→ maxx ′c
For acyclic models, converges to exact marginals efficiently(2 passes, collect leaves to root then distribute)
8 / 46
What about cyclic (loopy) models?
Can triangulate and run junction tree
Exact solution but takes time exponential in treewidth
Or... just run loopy belief propagation (LBP) and hope
Often produces strikingly good resultsBut may not converge at all
Extensive literature on trying to understand LBP
9 / 46
Inference: a variational perspective
Recall p(x) =1
Zexp
(∑c∈C
ψc(xc)
)=
e−E(x)
Z
KL-divergence between some distribution q(x) and p(x) given
by D(q||p) =∑
x q(x) log q(x)p(x) ≥ 0, equality iff q = p
Have
0 ≤ D(q||p) =∑x
q(x) log q(x)−∑x
q(x) log p(x)
= −S(q)−∑x
q(x) [−E (x)− logZ ]
= Eq(E (x))− S(q) + logZ
where S(q) is the standard Shannon entropy of q10 / 46
Inference: a variational perspective
0 ≤ D(q||p) = Eq(E (x))− S(q) + logZ , equality iff q = p
Hence Eq(E (x))− S(q) ≥ − logZ
This function of distribution q is called the (Gibbs) free energyFG (q) = Eq(E (x))− S(q)
Minimizing it over all valid distributions q yields − logZ
And the argmin is exactly when q = p, the true distribution
Hence can think of inference as optimization
But still intractable in general...
END OF PART I
11 / 46
Part II: Bethe approximation
Seek to approximate the partition function ZAlso interested in approximate marginal inference (medicaldiagnosis, power network)
The Bethe approximation: what and why?
Introduced by Hans Bethe in the 1930s to study phasetransitions in statistical physics. Wikipedia:Bethe left Germany in 1933, moving to England afterreceiving an offer as lecturer... He moved in with his friendRudolf Peierls... This meant that Bethe had someone tospeak to in German, and did not have to eat English food.
Found fresh application in machine learningDirect connections to variational inference and beliefpropagation [YFW01]
12 / 46
Part II: Bethe approximation
Seek to approximate the partition function ZAlso interested in approximate marginal inference (medicaldiagnosis, power network)
The Bethe approximation: what and why?
Introduced by Hans Bethe in the 1930s to study phasetransitions in statistical physics. Wikipedia:Bethe left Germany in 1933, moving to England afterreceiving an offer as lecturer... He moved in with his friendRudolf Peierls... This meant that Bethe had someone tospeak to in German, and did not have to eat English food.Found fresh application in machine learningDirect connections to variational inference and beliefpropagation [YFW01] 12 / 46
Recall variational approach
− logZ = minq∈MFG (q) = min
q∈MEq(E )− S(q(x))
M is the marginal polytope which comprises all globally validprobability distributions over all the variables, i.e. convex hull of all2n configurations (for binary variables)FG is the Gibbs free energy, optimum at the true distribution
Bethe approximation has 2 aspects, both pairwise approximations:1 Relax the marginal polytope M to the local polytope L which
enforces only pairwise consistency, hence pseudo-marginals2 Use Bethe entropy SB=
∑i∈V Si +
∑(i ,j)∈E Sij − Si − Sj
Obtain Bethe partition function ZB at the global optimum
− logZB = minq∈LF(q) = min
q∈LEq(E )− SB(q(x))
13 / 46
Connection to LBP
Obtain Bethe partition function ZB at the global optimum
− logZB = minq∈LF = min
q∈LEq(E )− SB(q(x))
marginal polytope(global consistency)
local polytope(local consistency)
F is called the Bethe free energy (approximates true free energy)
In a seminal paper, [YFW01] showed that fixed points of LBPcorrespond to stationary points of the Bethe free energy FRefined by [Hes02], stable fixed points correspond to localminima of F (converse not true in general)
14 / 46
Other methods to minimize Bethe free energy F
LBP may be viewed as an algorithm to try to minimize FBut may not converge, or may converge only to a localminimum
Spurred much effort to find convergent algorithms such as
Gradient methods [WT01]Double loop methods, e.g. CCCP [Yui02] or [HAK03]
But still only to a local optimum, no time guarantee
For binary pairwise models
Recent algorithm guaranteed to converge in polynomial timeto an approximately stationary point of F [Shi12], restrictionson topologyOur algorithm guaranteed to return an ε-approximation to theglobal optimum [WJ14]To our knowledge, no previously known methods guaranteed toreturn or approximate the global optimum
15 / 46
Binary pairwise MRFs
Main focus now on MRFs which are binary, i.e. all Xi ∈ {0, 1},and pairwise, i.e. all potentials are over ≤ 2 variables
n variables V = {X1, . . . ,Xn}, singleton potentials ψi (xi )
x = (x1, . . . , xn) ∈ {0, 1}n is one particular configuration
m edges (i , j) ∈ E ⊆ V × V , pairwise potentials ψij(xi , xj)
p(x) =1
Zexp
∑i∈V
ψi (xi ) +∑
(i ,j)∈E
ψij(xi , xj)
Can always reparameterize to a minimal representation{θi : i ∈ V }, {Wij : (i , j) ∈ E} s.t. same distribution
p(x) =1
Z ′exp
∑i∈V
θixi +∑
(i ,j)∈E
Wijxixj
16 / 46
Binary pairwise MRFs
Main focus now on MRFs which are binary, i.e. all Xi ∈ {0, 1},and pairwise, i.e. all potentials are over ≤ 2 variables
n variables V = {X1, . . . ,Xn}, singleton potentials ψi (xi )
x = (x1, . . . , xn) ∈ {0, 1}n is one particular configuration
m edges (i , j) ∈ E ⊆ V × V , pairwise potentials ψij(xi , xj)
p(x) =1
Zexp
∑i∈V
ψi (xi ) +∑
(i ,j)∈E
ψij(xi , xj)
Can always reparameterize to a minimal representation{θi : i ∈ V }, {Wij : (i , j) ∈ E} s.t. same distribution
p(x) =1
Z ′exp
∑i∈V
θixi +∑
(i ,j)∈E
Wijxixj
16 / 46
Binary pairwise MRFs: simple example
p(x) =1
Zexp
∑i∈V
ψi (xi ) +∑
(i ,j)∈E
ψij(xi , xj)
Can always reparameterize to a minimal representation{θi : i ∈ V }, {Wij : (i , j) ∈ E} s.t. same distribution
p(x) =1
Z ′exp
∑i∈V
θixi +∑
(i ,j)∈E
Wijxixj
X1 X2
local θ1 = 4local θ2 = −5edge Wij = 3
Wij > 0 attractiveWij < 0 repulsive
ψ1(x1) ψ12(x1, x2) ψ2(x2)x1 ψ1(x1)0 21 4
x1\x2 0 10 1 −31 3 2
x2 0 1ψ2(x2) −1 −2
17 / 46
Bethe pseudo-marginals in the local polytope
− logZB = minq∈LF = min
q∈LEq(E )− SB(q(x))
Must identify q(x) ∈ L that minimizes F
q defined by singleton pseudo-marginals qi = p(Xi = 1) ∀i ∈ Vand pairwise µij ∀(i , j) ∈ E . Local polytope constraints imply
µij =
[p(Xi = 0,Xj = 0) p(Xi = 0,Xj = 1)p(Xi = 1,Xj = 0) p(Xi = 1,Xj = 1)
]=
[1 + ξij − qi − qj qj − ξij
qi − ξij ξij
]with constaint that all terms ≥ 0⇒ ξij ∈ [max(0, qi + qj − 1),min(qi , qj)]
[WT01] showed:
Minimizing F , can solve explicitly for ξij(qi , qj ,Wij)
Here Wij is the associativity of the edge (as earlier)
Hence sufficient to search over (q1, . . . , qn) ∈ [0, 1]n, but how?18 / 46
Our approach: a mesh over Bethe pseudo-marginals
We discretize the space (q1, . . . , qn) ∈ [0, 1]n with a provablysufficient mesh M(ε), fine enough s.t. optimum discretized pointq∗ has F(q∗) ≤ minq∈LF(q) + ε
00.2
0.40.6
0.81
0
0.5
10
0.10.20.30.40.50.60.70.80.9
1
q 3
q1
q2
19 / 46
Key ideas to approximate log ZB to within ε
Discretize to construct a provably sufficient mesh M(ε):
How guarantee F(q∗) ≤ minq∈L F(q) + ε?How search the large discrete mesh efficiently?
Developed two approaches:
curvMesh bounds curvature [WJ13]gradMesh bounds gradients - typically much better (orders ofmagnitude) [WJ14]
If original model attractive, i.e. Wij > 0 ∀(i , j) ∈ E(submodular cost functions), then show the discretizedmulti-label problem is submodular [WJ13,KKL12]
Hence, can be solved via graph cuts [SF06]O(N3) where N =
∑i∈V Ni points in dim i [cf.
∏i∈V Ni ]
Obtain FPTAS with gradMesh, N = O(nmWε
)To compare, for curvMesh,N = O
(ε−1/2n7/4∆3/4 exp
[12 (W (1 + ∆/2) + T )
])
20 / 46
Key ideas to approximate log ZB to within ε
Discretize to construct a provably sufficient mesh M(ε):
How guarantee F(q∗) ≤ minq∈L F(q) + ε?How search the large discrete mesh efficiently?
Developed two approaches:
curvMesh bounds curvature [WJ13]gradMesh bounds gradients - typically much better (orders ofmagnitude) [WJ14]
If original model attractive, i.e. Wij > 0 ∀(i , j) ∈ E(submodular cost functions), then show the discretizedmulti-label problem is submodular [WJ13,KKL12]
Hence, can be solved via graph cuts [SF06]O(N3) where N =
∑i∈V Ni points in dim i [cf.
∏i∈V Ni ]
Obtain FPTAS with gradMesh, N = O(nmWε
)To compare, for curvMesh,N = O
(ε−1/2n7/4∆3/4 exp
[12 (W (1 + ∆/2) + T )
])
20 / 46
Key ideas to approximate log ZB to within ε
Discretize to construct a provably sufficient mesh M(ε):
How guarantee F(q∗) ≤ minq∈L F(q) + ε?How search the large discrete mesh efficiently?
Developed two approaches:
curvMesh bounds curvature [WJ13]gradMesh bounds gradients - typically much better (orders ofmagnitude) [WJ14]
If original model attractive, i.e. Wij > 0 ∀(i , j) ∈ E(submodular cost functions), then show the discretizedmulti-label problem is submodular [WJ13,KKL12]
Hence, can be solved via graph cuts [SF06]O(N3) where N =
∑i∈V Ni points in dim i [cf.
∏i∈V Ni ]
Obtain FPTAS with gradMesh, N = O(nmWε
)To compare, for curvMesh,N = O
(ε−1/2n7/4∆3/4 exp
[12 (W (1 + ∆/2) + T )
])20 / 46
Bounding the locations of stationary points
For general edge types (associative or repulsive), letWi =
∑j∈N(i):Wij>0Wij , Vi = −
∑j∈N(i):Wij<0Wij
Theorem (WJ13)
At any stationary point of the Bethe free energy,σ(θi − Vi ) ≤ qi ≤ σ(θi + Wi )
Developed an algorithm (Bethe bound propagation BBP) thatiteratively improves these bounds
[MK07] already had a similar algorithm, finds ranges ofpossible beliefs in LBP - bit slower but typically better
Use this to preprocess model to yield a smaller orthotope
reduces search space directlyfor curvMesh lowers max curvature, hence coarser mesh
21 / 46
Bethe free energy landscape (stylized)
Red dot shows the global optimum, we might return the green dot
22 / 46
Curvature: all terms of the Hessian Hij = ∂2F∂qi∂qj
Hii = − di − 1
qi (1− qi )+∑
j∈N(i)
qj(1− qj)
Tij≥ 1
qi (1− qi ),
Hij =
{qiqj−ξij
Tij(i , j) ∈ E
0 (i , j) /∈ E , i 6= j .
where di is the degree of Xi in the model, and
Tij = qiqj(1−qi )(1−qj)−(ξij−qiqj)2 ≥ 0, equality iff qi or qj ∈ {0, 1}
Leads to bound on max second derivative in any direction(curvMesh)
qiqj − ξij term is negative for an attractive edge, hence obtainthe submodularity result
23 / 46
gradMesh: analyze first derivatives of F
∂F∂qi
= −θi + log(1− qi )
di−1
qdi−1i
∏j∈N(i)(qi − ξij)∏
j∈N(i)(1 + ξij − qi − qj)[WT01]
Theorem (WJ14)
−θi + log qi1−qi −Wi ≤ ∂F
∂qi≤ −θi + log qi
1−qi + Vi
Upper and lower bounds are separated by a constant, andboth are monotonically increasing with qi
Within our search space, allows us to bound∣∣∣∂F∂qi ∣∣∣ ≤ Di := Vi + Wi =∑
j∈N(i) |Wij |
24 / 46
gradMesh: search over purple region
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−15
−10
−5
0
5
10
15
Upper and Lower Bounds for ∂F∂qi
Pseudo−marginal qi
Par
tial d
eriv
ativ
e
fiU
fiL
Ai
1−Bi
qi s.t.
fiU(q
i)=0
qi s.t.
fiL(q
i)=0
Region of Bethe box[A
i, 1−B
i]
Di=V
i+W
i−logL
i−logU
i
Shaded area shows wherepartial derivative can be 0
Parameters used in this example:θ
i=1, V
i=2, W
i=3
Li=1.8, U
i=2.9
25 / 46
gradMesh: complexity
In search space,
∣∣∣∣∂F∂qi∣∣∣∣ ≤ Di := Vi + Wi =
∑j∈N(i)
|Wij |
We can apportion ε error among n variables
Simple method: each gets εn
Need gradienti .stepi ≈ εn .
Hence number of mesh points in dimension i ,
Ni ≈1
stepi≈ n
ε.gradienti = O
n
ε
∑j∈N(i)
|Wij |
Hence N =
∑i Ni = O
(nεmW
)Various tricks in paper show how to improve performance
26 / 46
Comparison of methods: left ε = 1, right ε = 0.1; (when fixed, W = 5, n = 10)
5 10 15 2010
0
1010
1020
n
N
curvMeshOrigcurvMeshNewgradMesh
0 5 10
1010
1020
W
N
curvMeshOrigcurvMeshNewgradMesh
5 10 15 2010
0
1010
1020
n
N
curvMeshOrigcurvMeshNewgradMesh
0 5 10
1010
1020
W
N
curvMeshOrigcurvMeshNewgradMesh
27 / 46
Example where LBP fails to converge, gradMesh works well
Power network oftransformers
Xi ∈ {stable,fail}Attractiveedges betweentransformers
Would like torank bymarginalprobability offailure p(Xi )
1
2
3 4
5
6
7
8
9
10
11
1213
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
4950
51
52
5354
55
28 / 46
Recap
The Bethe approximation is often strikingly accurate.New results:
Novel formulation of the Hessian of the Bethe free energy FBounds on derivatives and locations of optima
First method guaranteed to return ε-approx global optimumlogZB , allows its accuracy to be tested rigorously
Provides benchmark against which to judge other heuristics(LBP, HAK etc.)
Useful in practice for small problems
FPTAS for attractive models, was open theoretical question
Further improvements in new work...
29 / 46
Understanding the Bethe approximation
Joint work with Kui Tang and David Sontag
Goal - separate and evaluate the two aspects of the Betheapproximation:
1 Relax the marginal polytope M to the local polytope L whichenforces only pairwise consistency, hence pseudo-marginals
2 Use Bethe entropy SB=∑
i∈V Si +∑
(i,j)∈E Sij − Si − Sj
Consider marginal, cycle and local polytopes
Compare against tree-reweighted approximation (TRW)
same polytopesconcave upper-bounding entropy
Analytic and experimental results
30 / 46
Illustration of polytopes
marginal polytopeglobal consistency
cycle polytopecycle consistency
local polytopelocal consistency
31 / 46
Questions addressed include
Does tightening the relaxation of the marginal polytopealways improve the Bethe approximation for logZ?
No (empirically usually very helpful for general models)
In attractive models, when local potentials are low andcouplings high, why does the Bethe approximation performpoorly for marginals?
Bethe entropy
In general models, for low couplings, the Bethe approximationperforms much better than TRW, yet as coupling increases,this advantage disappears. How does this vary if we tightenthe relaxation of the marginal polytope?
Mixed, see Experiments
32 / 46
Questions addressed include
Does tightening the relaxation of the marginal polytopealways improve the Bethe approximation for logZ?
No (empirically usually very helpful for general models)
In attractive models, when local potentials are low andcouplings high, why does the Bethe approximation performpoorly for marginals?
Bethe entropy
In general models, for low couplings, the Bethe approximationperforms much better than TRW, yet as coupling increases,this advantage disappears. How does this vary if we tightenthe relaxation of the marginal polytope?
Mixed, see Experiments
32 / 46
Questions addressed include
Does tightening the relaxation of the marginal polytopealways improve the Bethe approximation for logZ?
No (empirically usually very helpful for general models)
In attractive models, when local potentials are low andcouplings high, why does the Bethe approximation performpoorly for marginals?
Bethe entropy
In general models, for low couplings, the Bethe approximationperforms much better than TRW, yet as coupling increases,this advantage disappears. How does this vary if we tightenthe relaxation of the marginal polytope?
Mixed, see Experiments
32 / 46
Questions addressed include
Does tightening the relaxation of the marginal polytopealways improve the Bethe approximation for logZ?
No (empirically usually very helpful for general models)
In attractive models, when local potentials are low andcouplings high, why does the Bethe approximation performpoorly for marginals?
Bethe entropy
In general models, for low couplings, the Bethe approximationperforms much better than TRW, yet as coupling increases,this advantage disappears. How does this vary if we tightenthe relaxation of the marginal polytope?
Mixed, see Experiments
32 / 46
Questions addressed include
Does tightening the relaxation of the marginal polytopealways improve the Bethe approximation for logZ?
No (empirically usually very helpful for general models)
In attractive models, when local potentials are low andcouplings high, why does the Bethe approximation performpoorly for marginals?
Bethe entropy
In general models, for low couplings, the Bethe approximationperforms much better than TRW, yet as coupling increases,this advantage disappears. How does this vary if we tightenthe relaxation of the marginal polytope?
Mixed, see Experiments
32 / 46
Questions addressed include
Does tightening the relaxation of the marginal polytopealways improve the Bethe approximation for logZ?
No (empirically usually very helpful for general models)
In attractive models, when local potentials are low andcouplings high, why does the Bethe approximation performpoorly for marginals?
Bethe entropy
In general models, for low couplings, the Bethe approximationperforms much better than TRW, yet as coupling increases,this advantage disappears. How does this vary if we tightenthe relaxation of the marginal polytope?
Mixed, see Experiments
32 / 46
Tightening the polytope relaxation - does it always help?
NoConsider symmetric
nonhomogeneous
cycle, vary WBC ,
θA = θB = θC = 0
A
B C
WAB = WAC = 10,strongly attractive
−10 −5 0 5 106
7
8
9
10
11
12
13
14
15
16
BC edge weight
log
Z
trueBetheBethe+cycle
Lemma: ∂ log ZB∂WBC
= µBC (0, 0) + µBC (1, 1), all singleton marginals 12
For weakly attractive edge BC, cycle improves pairwise marginal (similarslopes near 0) but worsens partition function (gap between curves near 0)
33 / 46
Threshold result for attractive models due to SB entropy
Lemma: For a symmetric homogeneous d-regular MRF,q = (12 , . . . ,
12) is a stationary point of F but not a minimum
for W > 2 log dd−2 (uses earlier Hessian result)
Recall∑
i di = 2m (handshake lemma), henceSB = mSij + (n − 2m)Si . For large W , all probability masspulled onto main diagonal, hence Sij ≈ Si . For m > n, toavoid negative SB , each entropy term → 0 by tending to
pairwise
(1 00 0
)or symmetrically
(0 00 1
).
0 0.5 1−1.5
−1
−0.5
0
q
Bet
he fr
ee e
nerg
y E
−S
B
K5 : W = 1
0 0.5 1−0.8
−0.6
−0.4
−0.2
0
q
Bet
he fr
ee e
nerg
y E
−S
B
W = 1.38
0 0.5 1−0.4
−0.3
−0.2
−0.1
0
qB
ethe
free
ene
rgy
E−
SB
W = 1.75 34 / 46
Also a polytope effect for frustrated cycles
A frustrated cycle has an odd number of repulsive edges, this pullssingleton marginals the other way, toward 1
2
Seen Bethe entropy effect for attractive cycles
Also a polytope effect for frustrated cycles
Recall optimum energy on local polytope for a symmetricfrustrated cycle is at (12 , . . . ,
12)
C5 topology, θi ∼ [0,Tmax ], all edges W
−10 −5 0 5 100.5
0.6
0.7
0.8
0.9
1
edge weight W
avg
sin
gle
ton
mar
gin
al
trueBetheBethe+cycle
−10 −5 0 5 100.5
0.6
0.7
0.8
0.9
1
edge weight W
avg
sin
gle
ton
mar
gin
al
trueBetheBethe+cycle
35 / 46
Experiments: General models θi ∼ [−2, 2](attractive and repulsive edges) K10 topology
2 8 16 24 320
20
40
60
80
100
Maximum coupling strength y
Bethe+localBethe+cycleBethe+margTRW+localTRW+cycleTRW+marg
log partition error
2 8 16 24 320
0.5
1
1.5
2
Maximum coupling strength y
Bethe+cycleBethe+marg
TRW+cycleTRW+marg
log partition error, local removed
2 8 16 24 320
0.1
0.2
0.3
0.4
Maximum coupling strength y
Singleton marginals, average `1 error
2 8 16 24 320
0.1
0.2
0.3
0.4
Maximum coupling strength y
Pairwise marginals, average `1 error36 / 46
Conclusions for general models
Big gains from cycle polytope (suggest Frank-Wolfe)
Not much additional gain from marginal polytope(computationally harder)
Bethe performs remarkably well
Better than TRW for logZ , pairwise marginalsLess clear on singleton marginals: TRW better for very strongcoupling
Still much to learn about why Bethe performs so well...
37 / 46
Summary
The Bethe approximation is remarkably effective forapproximate inference
Novel results on Hessian of Bethe free energy
First algorithm for ε-approx of global optimum logZB , FPTASfor attractive models
Contributions to understanding the Bethe approximation(polytope and entropy)
Where feasible, tightening to the cycle polytope can be veryhelpful
Additional results in new work (e.g. clamping)...
Thank you!
38 / 46
Attractive example: max score and value, with argmax
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1
0
1
qi
Sco
re/V
alue
Opt Score(C) and Value(−F), i=3/4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
argm
ax S
ingl
eton
Val
ues
39 / 46
References
F. Korc̆, V. Kolmogorov, and C. Lampert. Approximating marginals usingdiscrete energy minimization. Technical report, IST Austria, 2012.
J. Mooij and H. Kappen. Sufficient conditions for convergence of thesum-product algorithm. IEEE Transactions on Information Theory, 2007.
D. Schlesinger and B. Flach. Transforming an arbitrary minsum probleminto a binary one. Technical report, Dresden University of Tech, 2006.
A. Weller and T. Jebara. Approximating the Bethe partition function. InUAI, 2014.
A. Weller, K. Tang, D. Sontag, and T. Jebara. Understanding the Betheapproximation: When and how can it go wrong? In UAI, 2014.
A. Weller and T. Jebara. Bethe bounds and approximating the globaloptimum. In AISTATS, 2013.
M. Welling and Y. Teh. Belief optimization for binary networks: A stablealternative to loopy belief propagation. In UAI, 2001.
J. Yedidia, W. Freeman, and Y. Weiss. Understanding belief propagationand its generalizations. In IJCA, Distinguished Lecture Track, 2001.
40 / 46
Extra Slides with Supplementary Material
Supplementary Material(if time or questions)
41 / 46
Cycle polytope
A relaxation of the marginal polytope
Inherits all constraints of the local polytope, hence at least astight
In addition, enforces consistency around any cycle
Cycle inequalities [B93]
∀ cycles C and every subset of edges F ⊆ C with |F | odd:∑(i,j)∈F
(µij(0, 0) + µij(1, 1))+∑
(i,j)∈C\F
(µij(1, 0) + µij(0, 1)) ≥ 1.
Cycle polytope = marginal polytope for symmetric planarMRFs [B93]
Cycle polytope = TRI for binary pairwise [S10]
42 / 46
Threshold for attractive models ξij(qi , qj ,Wij)
0 0.5 1−1.5
−1
−0.5
0
q
Bet
he fr
ee e
nerg
y E
−S
B
K5 : W = 1
0 0.5 1−0.8
−0.6
−0.4
−0.2
0
q
Bet
he fr
ee e
nerg
y E
−S
BW = 1.38
0 0.5 1−0.4
−0.3
−0.2
−0.1
0
q
Bet
he fr
ee e
nerg
y E
−S
B
W = 1.75
0 0.5 10
1
2
3
4
q
Bet
he e
ntro
py S
B
W = 1
0 0.5 10
0.5
1
1.5
2
2.5
q
Ene
rgy
E
0 0.5 1−0.4
−0.2
0
0.2
0.4
q
Bet
he e
ntro
py S
B
W = 4.5
0 0.5 10
0.5
1
1.5
2
2.5
q
Ene
rgy
E
43 / 46
Experiments: Attractive models θi ∼ [−0.1, 0.1]
0.4 2 4 8 12 160
0.2
0.4
0.6
0.8
1
Maximum coupling strength y
Bethe+localBethe+cycleBethe+margTRW+localTRW+cycleTRW+marg
log partition error
For this distribution of models,the polytope appears to makeno difference
Though recall we showedtheoretically it can
0.4 2 4 8 12 160
0.1
0.2
0.3
0.4
0.5
Maximum coupling strength y
Bethe+localBethe+cycleBethe+margTRW+localTRW+cycleTRW+marg
Singleton marginals, average `1 error
0.4 2 4 8 12 160
0.02
0.04
0.06
0.08
0.1
Maximum coupling strength y
Bethe+localBethe+cycleBethe+margTRW+localTRW+cycleTRW+marg
Pairwise marginals, average `1 error (small scale)
44 / 46
Clamping variables: Attractive binary pairwise models
ZB = optimal Bethe partition function for original model
Clamp variable Xi , form new approximation
Z(i)B = ZB |Xi=0 + ZB |Xi=1.
Theorem (WJ14 NIPS)
For an attractive binary pairwise model and any variable Xi ,
ZB ≤ Z(i)B .
Corollary
For an attractive binary pairwise model, ZB ≤ Z .
⇒ clamping only improves the estimate of the partition function.
45 / 46
Clamping variables: stronger result
For any i ∈ V, x ∈ [0, 1], letlogZBi (x) = maxq∈[0,1]n:qi=x −F(q)
Observe logZBi (0) = logZB |Xi=0, logZBi (1) = logZB |Xi=1
and logZB = maxqi∈[0,1] logZBi (qi )
Recall Si (x) = −x log x − (1− x) log(1− x) singleton entropy
Lemma: To prove clamping result, sufficient iflogZBi (qi ) ≤ qi logZBi (1) + (1− qi ) logZBi (0) + Si (qi )
Theorem (WJ14 NIPS)
For an attractive binary pairwise model, logZBi (qi )− Si (qi ) isconvex.
Uses earlier results on Hessian
46 / 46
Clamping variables: stronger result
For any i ∈ V, x ∈ [0, 1], letlogZBi (x) = maxq∈[0,1]n:qi=x −F(q)
Observe logZBi (0) = logZB |Xi=0, logZBi (1) = logZB |Xi=1
and logZB = maxqi∈[0,1] logZBi (qi )
Recall Si (x) = −x log x − (1− x) log(1− x) singleton entropy
Lemma: To prove clamping result, sufficient iflogZBi (qi ) ≤ qi logZBi (1) + (1− qi ) logZBi (0) + Si (qi )
Theorem (WJ14 NIPS)
For an attractive binary pairwise model, logZBi (qi )− Si (qi ) isconvex.
Uses earlier results on Hessian
46 / 46
Clamping variables: stronger result
For any i ∈ V, x ∈ [0, 1], letlogZBi (x) = maxq∈[0,1]n:qi=x −F(q)
Observe logZBi (0) = logZB |Xi=0, logZBi (1) = logZB |Xi=1
and logZB = maxqi∈[0,1] logZBi (qi )
Recall Si (x) = −x log x − (1− x) log(1− x) singleton entropy
Lemma: To prove clamping result, sufficient iflogZBi (qi ) ≤ qi logZBi (1) + (1− qi ) logZBi (0) + Si (qi )
Theorem (WJ14 NIPS)
For an attractive binary pairwise model, logZBi (qi )− Si (qi ) isconvex.
Uses earlier results on Hessian
46 / 46