All of Graphical Models

8/12/2019 All of Graphical Models

1/135

All of Graphical Models

Xiaojin Zhu

Department of Computer SciencesUniversity of WisconsinMadison, USA

Tutorial at ICMLA 2011
http://find/http://goback/


2/135

The Whole Tutorial in One Slide

Given GM = joint distribution p(x1, . . . , xn)

Do inference = p(XQ| XE), in generalXQ XE {x1 . . . xn}

Ifp(x1, . . . , xn) not given, estimate it from data


3/135

Outline

Life without Graphical Models

RepresentationDirected Graphical Models (Bayesian Networks)Undirected Graphical Models (Markov Random Fields)

Inference

Exact InferenceMarkov Chain Monte CarloVariational Inference

Loopy Belief Propagation

Mean Field Algorithm

Exponential FamilyMaximizing Problems

Parameter Learning

Structure Learning
http://find/


4/135

Outline



Inference





Parameter Learning

Structure Learning
http://find/


5/135


. . . is fine mathematically:

The universe is reduced to a set of random variablesx1, . . . , xn e.g., x1, . . . , xn1 can be the discrete or continuous features e.g., xn y can be the discrete class label

The joint p(x1, . . . , xn) completely describes how the universe

works

Machine learning: estimate p(x1, . . . , xn) from training

data X(1), . . . , X (N), where X(i) = (x(i)1 , . . . , x

(i)n )

Prediction: y = argmaxp(xn| x1, . . . , x

n), a.k.a.

inference by the definition of conditional probability

p(xn| x

1, . . . , x

n) =

p(x1, . . . , xn

, xn)

vp(x1, . . . , x

n

, xn=v)
http://find/


6/135

Conclusion

Life without graphical models is just fine So why are we still here?


7/135

Life can be Better for Computer Scientists

Given GM = joint distribution p(x1, . . . , xn) exponential nave storage (2n for binary r.v.) hard to interpret (conditional independence)

Do inference = p(XQ| XE), in generalXQ XE {x1 . . . xn} Often cant do it computationally

Ifp(x1, . . . , xn) not given, estimate it from data

Cant do it either


8/135

Acknowledgments Before We Start

Much of this tutorial is based on

Koller & Friedman, Probabilistic Graphical Models. MIT 2009

Wainwright & Jordan, Graphical Models, ExponentialFamilies, and Variational Inference. FTML 2008

Bishop, Pattern Recognition and Machine Learning. Springer2006.
http://find/


9/135

Outline



Inference





Parameter Learning

Structure Learning


10/135

Graphical-Model-Nots

Graphical model is the study ofprobabilistic models

Just because there is a graph with nodes and edges doesntmean its GM

These are not graphical models

neural network decision tree network flow HMM template


11/135

Outline



Inference





Parameter Learning

Structure Learning


12/135

Bayesian Network

A directed graph has nodes X= (x1, . . . , xn), some of themconnected by directed edges xi xj

A cycle is a directed path x1 . . . xk where x1=xk A directed acyclic graph (DAG) contains no cycles

A Bayesian network on the DAG is a family of distributionssatisfying

{p | p(X) =i

p(xi| P a(xi))}

where P a(xi) is the set of parents ofxi.

p(xi| P a(xi)) is the conditional probability distribution(CPD) at xi

By specifying the CPDs for all i, we specify a particulardistribution p(X)
http://find/


13/135

Example: Alarm

Binary variables

P(A | B, E) = 0.95

P(A | B, ~E) = 0.94

P(A | ~B, E) = 0.29

P(A | ~B, ~E) = 0.001

P(J | A) = 0.9

P(J | ~A) = 0.05

P(M | A) = 0.7

P(M | ~A) = 0.01

A

J M

B E

P(E)=0.002P(B)=0.001

P(B, E,A,J, M)

= P(B)P( E)P(A | B, E)P(J | A)P( M | A)= 0.001 (1 0.002) 0.94 0.9 (1 0.7)

.000253


14/135

Example: Naive Bayes

y y

x x. . .1 d x

d

p(y, x1, . . . xd) =p(y)di=1p(xi| y)

Used extensively in natural language processing Plate representation on the right
http://find/


15/135

No Causality Whatsoever

P(A)=a

P(B|A)=b

P(B|~A)=c

A

B

B

A

P(B)=ab+(1a)c

P(A|B)=ab/(ab+(1a)c)

P(A|~B)=a(1b)/(1ab(1a)c)

The two BNs are equivalent in all respects

Bayesian networks imply no causality at all

They only encode the joint probability distribution (hencecorrelation)

However, people tend to design BNs based on causal relations

( )
http://find/


16/135

Example: Latent Dirichlet Allocation (LDA)

Nd

w

D

T

z

A generative model for p(,,z,w| , ):For each topic t

t Dirichlet()For each document d

Dirichlet()For each word position in d

topic z Multinomial()word w Multinomial(z)

Inference goals: p(z| w,,), argmax,p(, | w,,)

E l L Di i hl All i (LDA)
http://find/


17/135


Nd

w

D

T

z






http://find/


18/135


Nd

w

D

T

z






http://find/


19/135


Nd

w

D

T

z






E l L t t Di i hl t All ti (LDA)
http://find/


20/135


Nd

w

D

T

z



Dirichlet()For each word position in dtopic z Multinomial()word w Multinomial(z)


E l L t t Di i hl t All ti (LDA)
http://find/


21/135


Nd

w

D

T

z





http://find/


22/135


Nd

w

D

T

z





http://goforward/http://find/http://goback/


23/135


Nd

w

D

T

z





Some Topics by LDA on the Wish Corpus
http://find/


24/135

Some Topics by LDA on the Wish Corpus

p(word | topic)

troops election love

Conditional Independence
http://find/


25/135

Conditional Independence

Two r.v.s A, B are independent ifP(A, B) =P(A)P(B) orP(A|B) =P(A) (the two are equivalent)

Two r.v.s A, B are conditionally independent given C if

P(A, B| C) =P(A | C)P(B| C) orP(A | B, C) =P(A | C) (the two are equivalent)

This extends to groups of r.v.s

Conditional independence in a BN is precisely specified by

d-separation(directed separation)

d-Separation Case 1: Tail-to-Tail
http://find/


26/135

d-Separation Case 1: Tail-to-Tail

C

A B

C

A B

A, B in general dependent

A, B conditionally independent given C

C is a tail-to-tail node, blocks the undirected path A-B

d-Separation Case 2: Head-to-Tail
http://find/


27/135

d-Separation Case 2: Head-to-Tail

A C B A C B

A, B in general dependent A, B conditionally independent given C

C is a head-to-tail node, blocks the path A-B

d-Separation Case 3: Head-to-Head


28/135

d Separation Case 3: Head to Head

A B A B

C C

A, B in general independent

A, B conditionallydependentgiven C,or any of Csdescendants

C is a head-to-head node,unblocksthe path A-B

d-Separation
http://find/


29/135

d Separation

Any groups of nodes A and B are conditionally independentgiven another group C, if all undirected paths from any node

in A to any node in B are blocked A path is blocked if it includes a node x such that either

The path is head-to-tail or tail-to-tail at x and x C, or The path is head-to-head at x, and neither x nor any of its

descendants is in C.

d-Separation Example 1
http://find/


30/135

d Separation Example 1

The path from A to B not blocked by either E or F A, B dependent given C

A

C

B

F

E

d-Separation Example 2
http://find/


31/135

d Separation Example 2

The path from A to B is blocked both at E and F A, B conditionally independent given F

A

B

F

E

C

Outline
http://find/


32/135



Inference




Exponential Family

Maximizing Problems

Parameter Learning

Structure Learning

Markov Random Fields
http://find/


33/135

The efficiency of directed graphical model (acyclic graph,locally normalized CPDs) also makes it restrictive

A clique C in an undirected graph is a fully connected set ofnodes (note: full of loops!)

Define a nonnegative potential function C :XC R+

An undirected graphical model (aka Markov Random Field)on the graph is a family of distributions satisfying

p|p(X) =

1

ZC

C

(XC

) Z=

CC(XC)dX is the partition function

Example: A Tiny Markov Random Field
http://find/


34/135

p y

x x1 2

C

x1, x2 {1, 1} A single clique C(x1, x2) =e

ax1x2

p(x1, x2) = 1Ze

ax1x2

Z= (ea + ea + ea + ea)

p(1, 1) =p(1, 1) =ea/(2ea + 2ea) p(1, 1) =p(1, 1) =ea/(2ea + 2ea)

When the parameter a >0, favor homogeneous chains

When the parameter a


35/135

g

Real-valued feature functions f1(X), . . . , f k(X)

Real-valued weights w1, . . . , wk

p(X) = 1

Zexp

ki=1

wifi(X)

Example: The Ising Model
http://find/


36/135

s xs xtst

This is an undirected model with x {0, 1}.

p(x) = 1

Zexp

sV

sxs+

(s,t)E

stxsxt

fs(X) =xs, fst(X) =xsxt ws= s, wst= st


37/135

Example: Gaussian Random Field


38/135

p(X) N(, ) = 1

(2)n/2||1/2exp

1

2(X )1(X )

Multivariate Gaussian

The n n covariance matrix positive semi-definite

Let = 1 be the precision matrix

xi, xj are conditionally independent given all other variables, if

and only ifij = 0

When ij = 0, there is an edge between xi, xj

Conditional Independence in Markov Random Fields
http://find/


39/135

Two group of variables A, B are conditionally independent

given another group C, if Remove C and all edges involving C A, B beome disconnected

A

C

B

Factor Graph
http://find/


40/135

For both directed and undirected graphical models

Bipartite: edges between a variable node and a factor node Factors represent computation

A B

C

(A,B,C)

A B

C

A B

C

(A,B,C)f

A B

C

fP(A)P(B)P(C|A,B)

Outline
http://find/


41/135



Inference




Exponential Family

Maximizing Problems

Parameter Learning

Structure Learning

Outline
http://find/


42/135



Inference




Exponential Family

Maximizing Problems

Parameter Learning

Structure Learning

Inference by Enumeration
http://find/


43/135

Let X= (XQ, XE, XO) for query, evidence, and othervariables.

Infer P(XQ| XE)

By definition

P(XQ| XE) =P(XQ, XE)

P(XE) =

XO

P(XQ, XE, XO)XQ,XO

P(XQ, XE, XO)

Summing exponential number of terms: with k variables inXO each taking r values, there are r

k terms

Details of the summing problem
http://find/


44/135

There are a bunch of other variables x1, . . . , xk We sum over r values each variable can take

vrxi=v1

This is exponential (rk): x1...x

k We want

x1...xk

p(X)

For a graphical model, the joint probability factorsp(X) =

mj=1 fj(X(j))

Each factor fj operates on X(j) X

Eliminating a Variable
http://find/


45/135

Rearrange factors

x1...xkf1 . . . f

l f

+l+1 . . . f

+m by whether

x1 X(j)

Obviously equivalent:

x2...xk

f1 . . . f l

x1f+l+1 . . . f

+m

Introduce a new factor fm+1= x

1

f+l+1 . . . f +m

fm+1 contains the union of variables in f+l+1 . . . f

+m except x1

In fact, x1 disappears altogether in

x2...xkf1 . . . f

l f

m+1

Dynamic programming: compute fm+1 once, use it thereafter

Hope: fm+1 contains very few variables

Recursively eliminate other variables in turn

Example: Chain Graph
http://find/


46/135

A B C D

Binary variables

Say we want P(D) =

A,B,CP(A)P(B|A)P(C|B)P(D|C)

Let f1(A) =P(A). Note f1 is an array of size two:P(A= 0)P(A= 1)

f2(A, B) is a table of size four:P(B= 0|A= 0)P(B= 0|A= 1)

P(B= 1|A= 0)P(B= 1|A= 1))

A,B,Cf1(A)f2(A, B)f3(B, C)f4(C, D) =

B,Cf3(B, C)f4(C, D)(A

f1(A)f2(A, B))

http://find/


47/135

A B C D

f1(A)f2(A, B) an array of size four: match AvaluesP(A= 0)P(B = 0|A= 0)P(A= 1)P(B = 0|A= 1)P(A= 0)P(B = 1|A= 0)

P(A= 1)P(B = 1|A= 1) f5(B)

A f1(A)f2(A, B) an array of size two

P(A= 0)P(B = 0|A= 0) + P(A= 1)P(B = 0|A= 1)P(A= 0)P(B = 1|A= 0) + P(A= 1)P(B = 1|A= 1)

For this example, f5(B) happens to be P(B)

B,Cf3(B, C)f4(C, D)f5(B) =Cf4(C, D)(

Bf3(B, C)f5(B)), and so on

In the end, f7(D) = (P(D= 0), P(D= 1))

http://find/


48/135

A B C D

Computation for P(D): 12, 6 +

Enumeration: 48, 14 + Saving depends on elimination order. Finding optimal order

NP-hard; there are heuristic methods.

Saving depends more critically on the graph structure (treewidth), can be intractable

Handling Evidence
http://find/


49/135

For evidence variables XE, simply plug in their value e

Eliminate variables XO =X XE XQ The final factor will be the jointf(XQ) =P(XQ, XE=e)

Normalize to answer query:

P(XQ| XE=e) = f(XQ)

XQf(XQ)

Summary: Exact Inference
http://find/


50/135

Enumeration

Variable elimination

Not covered: junction tree (aka clique tree)

Exact, but intractable for large graphs

Outline
http://find/


51/135



Inference




Exponential Family

Maximizing Problems

Parameter Learning

Structure Learning

Inference by Monte Carlo


52/135

Consider the inference problem p(XQ=cQ| XE) where

XQ XE {x1 . . . xn}

p(XQ=cQ| XE) =

1(xQ=cQ)p(xQ| XE)dxQ

If we can draw samples x(1)Q , . . . x(m)Q p(xQ| XE), anunbiased estimator is

p(XQ=cQ| XE) 1

m

m

i=1

1(x

(i)Q=cQ)

The variance of the estimator decreases as V/m

Inference reduces to sampling from p(xQ| XE)

Forward Sampling Example
http://find/


53/135

P(A | B, E) = 0.95

P(A | B, ~E) = 0.94

P(A | ~B, E) = 0.29

P(A | ~B, ~E) = 0.001

P(J | A) = 0.9

P(J | ~A) = 0.05

P(M | A) = 0.7

P(M | ~A) = 0.01

A

J M

B E

P(E)=0.002P(B)=0.001

To generate a sample X= (B,E,A,J,M):1. Sample B Ber(0.001): r U(0, 1). If(r


54/135

Say the inference task is P(B = 1 | E= 1, M= 1)

Throw awayall samples except those with (E= 1, M= 1)

p(B= 1 | E= 1, M= 1) 1

m

mi=1

1(B(i)=1)

where m is the number of surviving samples

Can be highly inefficient (note P(E= 1) tiny)

Does not work for Markov Random Fields

Gibbs Sampler Example: P(B = 1 | E= 1,M= 1)
http://find/


55/135

Gibbs sampler is a Markov Chain Monte Carlo (MCMC)

method. Directly sample from p(xQ| XE)

Works for both graphical models Initialization:

Fix evidence; randomly set other variables e.g. X(0) = (B= 0, E= 1, A= 0, J= 0, M= 1)

P(A | B, E) = 0.95

P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29

P(A | ~B, ~E) = 0.001

P(J | A) = 0.9

P(J | ~A) = 0.05

P(M | A) = 0.7

P(M | ~A) = 0.01

A

J M

B E

P(E)=0.002P(B)=0.001

Gibbs Update
http://find/


56/135

For each non-evidence variable xi, fixing all other nodes Xi,resample its value xi P(xi| Xi)

This is equivalent to xi P(xi| MarkovBlanket(xi)) For a Bayesian network MarkovBlanket(xi) includes xis

parents, spouses, and children

P(xi| MarkovBlanket(xi)) P(xi| P a(xi)) yC(xi)

P(y| P a(y))

where P a(x) are the parents ofx, and C(x) the children ofx. For many graphical models the Markov Blanket is small. For example,

B P(B| E= 1, A= 0) P(B)P(A= 0 | B, E= 1)

P(A | B, E) = 0.95

P(A | B, ~E) = 0.94

P(A | ~B, E) = 0.29

P(A | ~B, ~E) = 0.001

A

J M

B E

P(E)=0.002P(B)=0.001

Gibbs Update
http://find/


57/135

Say we sampled B= 1. Then

X(1) = (B = 1, E= 1, A= 0, J= 0, M= 1) Starting from X(1), sample

A P(A | B= 1, E= 1, J= 0, M= 1) to get X(2)

Move on toJ, then repeat B,A, J, B, A, J . . .

Keep all latersamples. P(B= 1 | E= 1, M= 1) is thefraction of samples with B = 1.

P(A | B, E) = 0.95

P(A | B, ~E) = 0.94

P(A | ~B, E) = 0.29

P(A | ~B, ~E) = 0.001

P(J | A) = 0.9

P(J | ~A) = 0.05

P(M | A) = 0.7

P(M | ~A) = 0.01

A

J M

B E

P(E)=0.002P(B)=0.001

Gibbs Example 2: The Ising Model
http://find/


58/135

xs

A

B

C

D

This is an undirected model with x {0, 1}.

p(x) = 1

Z

expsV

sxs+ (s,t)E

stxsxt

Gibbs Example 2: The Ising Model
http://find/


59/135

xs

A

B

C

D

The Markov blanket ofxs is A, B, C, D

In general for undirected graphical models

p(xs| xs) =p(xs| xN(s))

N(s) is the neighbors ofs.

The Gibbs update is

p(xs= 1 | xN(s)) = 1

exp((s+ tN(s) stxt)) + 1

Gibbs Sampling as a Markov Chain
http://find/


60/135

A Markov chain is defined by a transition matrix T(X

| X) Certain Markov chains have a stationary distribution such

that =T

Gibbs sampler is such a Markov chain with

Ti((Xi, x

i) | (Xi, xi)) =p(x

i| Xi), and stationarydistribution p(xQ| XE) But it takes time for the chain to reach stationary distribution

(mix) Can be difficult to assert mixing

In practice burn in: discard X(0)

, . . . , X (T)

UseallofX(T+1), . . . for inference (they are correlated) Do not thin

Collapsed Gibbs Sampling
http://find/


61/135

In general, Ep[f(X)] 1mmi=1 f(X(i)) ifX(i) p

Sometimes X= (Y, Z) where Zhas closed-form operations

If so,

Ep[f(X)] = Ep(Y)Ep(Z|Y)[f(Y, Z)]

1m

mi=1

Ep(Z|Y(i))[f(Y(i), Z)]

ifY(i) p(Y)

No need to sample Z: it is collapsed Collapsed Gibbs sampler Ti((Yi, y

i) | (Yi, yi)) =p(y

i| Yi)

Note p(yi| Yi) =

p(yi, Z | Yi)dZ

Example: Collapsed Gibbs Sampling for LDA

D
http://find/


62/135

Nd

w

D

T

z

Collapse , , Gibbs update:

P(zi=j|z

i,w

)

n(wi)i,j+ n

(di)i,j+

n()i,j+ W n(di)i,+ T

n(wi)i,j : number of times word wi has been assigned to topic j,

excluding the current position

n(di)i,j: number of times a word from document di has been

assigned to topic j, excluding the current position

n()i,j: number of times any word has been assigned to topic j,

excluding the current position

n

(di)

i,: length of document di, excluding the current position
http://find/


63/135

Outline


64/135



InferenceExact InferenceMarkov Chain Monte CarloVariational Inference



Exponential Family

Maximizing Problems

Parameter Learning

Structure Learning

Outline
http://find/


65/135






Exponential Family

Maximizing Problems

Parameter Learning

Structure Learning

The Sum-Product Algorithm
http://find/


66/135

Also known as belief propagation (BP)

Exact if the graph is a tree; otherwise known as loopy BP,

approximate The algorithm involves passing messageson the factor graph

Alternative view: variational approximation (more later)

Example: A Simple HMM

The Hidden Markov Model template (not a graphical model)


67/135

The Hidden Markov Model template (not a graphical model)

= = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

1/4 1/2

R G B R G B

Observing x1=R, x2=G, the directed graphical model

z1

x =G2

z2

x =R1

Factor graphz1f1 z2f2

P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

Messages
http://find/


68/135

A message is a vector of length K, where Kis the number ofvalues x takes.There are two types of messages:

1. fx: message from a factor node fto a variable node xfx(i) is the ith element, i= 1 . . . K .

2. xf: message from a variable node x to a factor node f

Leaf Messages
http://find/


69/135

Assume tree factor graph. Pick an arbitrary root, say z2

Start messages at leaves. If a leaf is a factor node f, fx(x) =f(x)

If a leaf is a variable node x, xf(x) = 1

z1f1 z2f2

P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

f1z1(z1= 1) =P(z1= 1)P(R|z1= 1) = 1/2 1/2 = 1/4f1z1(z1= 2) =P(z1= 2)P(R|z1= 2) = 1/2 1/4 = 1/8

= = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

1/4 1/2

R G B R G B

Message from Variable to Factor

A node (factor or variable) can send out a message if all other
http://find/


70/135

( ) gincoming messages have arrived

Let x be in factor fs.xfs(x) =

fne(x)\fs

fx(x)

ne(x)\fs are factors connected to x excluding fs.

z1f1 z2f2

P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

z1f2(z1= 1) = 1/4z1f2

(z1

= 2) = 1/8

= = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

1/4 1/2

R G B R G B

Message from Factor to Variable

Let x be in factor fs Let the other variables in fs be x1:M


71/135

Let x be in factor fs. Let the other variables in fs be x1:M.

fsx(x) =x1

. . .xM

fs(x, x1, . . . , xM)

Mm=1

xmfs(xm)

z1f1 z2f2

P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

f2z2(s) =2

s=1

z1f2(s)f2(z1=s

, z2=s)

= 1/4P(z2

=s|z1

= 1)P(x2

=G|z2

=s)

+1/8P(z2=s|z1= 2)P(x2=G|z2=s)

We getf2z2(z2= 1) = 1/32

f2z2(z2 = 2) = 1/8

Up to Root, Back Down


72/135

The message has reached the root, pass it back down

z1f1 z2f2

P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

z2f2(z2= 1) = 1z2f2(z2= 2) = 1

= = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

1/4 1/2

R G B R G B

Keep Passing Down
http://find/


74/135

Once a variable receives all incoming messages, we compute its

marginal asp(x)

fne(x)

fx(x)

In this example

P(z1|x1, x2) f1z1 f2z1 = 1/41/8

7/16

3/8

= 7/643/64

0.70.3

P(z2|x1, x2) f2z2 = 1/32

1/8

0.20.8

One can also compute the marginal of the set ofvariables xsinvolved in a factor fs

p(xs) fs(xs)

xne(f)

xf(x)

Handling Evidence

Ob i
http://find/


75/135

Observing x= v,

we can absorb it in the factor (as we did); or

set messages xf(x) = 0 for all x =v

Observing XE,

multiplying the incoming messages to x / XEgives the joint(not p(x|X

E)):

p(x, XE)

fne(x)

fx(x)

The conditional is easily obtained by normalization

p(x|XE) = p(x, XE)xp(x

, XE)
http://find/


76/135

Outline



77/135






Exponential Family

Maximizing Problems

Parameter Learning

Structure Learning

Example: The Ising Model
http://find/


78/135

s

xs xt

st

The random variables x take values in{0, 1}.

p(x) = 1

Zexp

sV

sxs+ (s,t)E

stxsxt

The Conditional

st
http://find/


79/135

s

xs xt

st

Markovian: the conditional distribution for xs is

p(xs| xs) =p(xs| xN(s))

N(s) is the neighbors ofs.

This reduces to

p(xs= 1 | xN(s)) = 1

exp((s+

tN(s) stxt)) + 1

Gibbs sampling would draw xs like this.

The Mean Field Algorithm for Ising Model
http://find/


80/135

p(xs= 1 | xN(s)) = 1exp((s+

tN(s) stxt)) + 1

Instead of Gibbs sampling, let s be the estimated marginalp(xs = 1)

s 1

exp((s+

tN(s) stt)) + 1

The s are updated iteratively

The Mean Field algorithm is coordinate ascent andguaranteed to converge to a local optimal (more later).

Outline



81/135

p


InferenceExact Inference

Markov Chain Monte CarloVariational Inference



Exponential Family

Maximizing Problems

Parameter Learning

Structure Learning

Exponential Family


82/135

Let (X) = (1(X), . . . , d(X)) be d sufficient statistics,

where i: X R

Note Xis all the nodes in a Graphical model

i(X) sometimes called a feature function

Let = (1, . . . , d) Rd be canonical parameters.

The exponential family is a family of probability densities:

p(x) = exp(x) A()

Exponential Family
http://find/


83/135

p(x) = exp

(x) A()

The key is the inner product between parameters and

sufficient statistics . A is the log partition function,

A() = log

exp

(x)

(dx)

A= log Z


84/135

Exponential Family Example 1: Bernoulli


85/135

p(x) =x(1 )1x forx {0, 1} and (0, 1). Does not look like an exponential family!

Can be rewritten as

p(x) = exp (x log + (1 x) log(1 )) Now in exponential family form with

1(x) =x, 2(x) = 1 x, 1= log , 2= log(1 ), andA() = 0.

Overcomplete: 1=2= 1 makes (x) = 1 for all x

Exponential Family Example 1: Bernoulli


86/135

p(x) = exp (x log + (1 x) log(1 ))

Can be further rewritten as

p(x) = exp (x log(1 + exp()))

Minimal exponential family with(x) =x, = log 1, A() = log(1 + exp()).

Many distributions (e.g., Gaussian, exponential, Poisson, Beta) are

in the exponential family, but not all (e.g., the Laplacedistribution).

Exponential Family Example 2: Ising Model
http://find/


87/135

s

xs xt

st

p(x) = exp

sV

sxs+

(s,t)E

stxsxt A()

Binary random variable xs {0, 1}

d= |V| + |E| sufficient statistics: (x) = (. . . xs . . . xst . . .)

This is a regular ( = Rd), minimal exponential family.

Exponential Family Example 3: Potts Model

s

xs xt

st


88/135

Similar to Ising model but generalizing xs {0, . . . , r 1}. Indicator functions fsj(x) = 1 ifxs=j and 0 otherwise, and

fstjk(x) = 1 ifxs=j xt=k, and 0 otherwise.

p(x) = exp

sj

sjfsj(x) +stjk

stjkfstjk(x) A()

d= r|V| + r2|E| Regular but overcomplete, because

r1j=0 sj(x) = 1 for any

s V and all x. The Potts model is a special case where the parameters are

tied: stkk

=, and stjk

= forj=k.
http://find/


89/135

Mean Parameters

Let p be any density (not necessarily in exponential family).


90/135

Let p be any density (not necessarily in exponential family).

Given sufficient statistics , the mean parameters= (1, . . . , d) is

i= Ep[i(x)] =

i(x)p(x)dx

The set of mean parameters

M = { Rd | p s.t. Ep[(x)] =}

If(1), (2) M, there must exist p(1), p(2)

The convex combinations ofp(1), p(2) leads to another meanparameter in M

Therefore M is convex

Example: The First Two Moments
http://find/


91/135

Let 1(x) =x, 2(x) =x2

For any p (not necessarily Gaussian) on x, the meanparameters= (1, 2) = (E(x),E(x

2)).

Note V(x) = E(x2) E2(x) =2 21 0 for any p

M is not R2

but rather the subset 1 R, 2 21.

The Marginal Polytope
http://find/


92/135

The marginal polytope is defined for discrete xs Recall M = { Rd | =

x

(x)p(x) for some p}

p can be a point mass function on a particular x.

In fact any p is a convex combination of such point massfunctions.

M =conv{(x), x} is a convex hull, called the marginalpolytope.
http://find/


93/135


94/135

Conjugate Duality

The conjugate dual function A to A is defined as


95/135

A() = sup

A()

Such definition, where a quantity is expressed as the solution to anoptimization problem, is called a variationaldefinition.

For any Ms interior, let () satisfyE()[(x)] = A(()) =.

Then A() = H(p()) the negative entropy.

The dual of the dual gives back A:

A() = supM

A

() For all , the supremum is attained uniquely at the

M0 by the moment matching conditions = E[(x)].

Example: Conjugate Dual for Bernoulli
http://find/


96/135

Recall the minimal exponential family for Bernoulli with(x) =x, A() = log(1 + exp()), = R.

By definition

A() = sup

R

log(1 + exp())

Taking derivative and solve

A() = log + (1 ) log(1 )

i.e., the negative entropy.
http://find/


97/135

The Difficulties with Variational Representation


98/135

A() = supM

A()

Difficult to solve even though it is a convex problem

Two issues:

Although the marginal polytope M is convex, it can be quitecomplex (exponential number of vertices) The dual function A() usually does not admit an explicit

form.

Variational approximationmodifies the optimization problem

so that it is tractable, at the price of an approximate solution. Next, we cast mean field and sum-product algorithms as

variational approximations.
http://find/


99/135

The Geometry ofM(F)

LetM(F) be the mean parameters of the fully factorizedsub-family. In general, M(F) M


100/135

Recall M is the convex hull of extreme points {(x)}. It turns out the extreme points {(x)} M(F). Example:

The tiny Ising model x1, x2 {0, 1} with = (x1, x2, x1x2)

The point mass distribution p(x= (0, 1)) = 1 is realized as a

limit to the series p(x) = exp(1x1+ 2x2 A()) where1 and 2 .

This series is in F because 12= 0. Hence the extreme point (x) = (0, 1, 0) is in M(F).

The Geometry ofM(F)
http://find/


101/135

Because the extreme points ofM are in M(F), ifM(F)were convex, we would have M = M(F).

But in general M(F) is a true subset ofM

Therefore, M(F) is a nonconvex inner set ofM

M(F)M

( )x
http://find/


102/135

Example: Mean Field for Ising Model

The mean parameters for the Ising model are the node andedge marginals: s = p(xx = 1), st = p(xs = 1, xt = 1)


103/135

edge marginals: s p(xx 1), st p(xs 1, xt 1)

Fully factorized M(F) means no edge. st=st For M(F), the dual function A() has the simple form

A() = sV

H(s) = sV

slog s+ (1 s) log(1 s)

Thus the mean field problem is

L() = supM(F)

sV(slog s+ (1 s) log(1 s))

= max(1...m)[0,1]m

sV

ss+

(s,t)E

stst+sV

H(s)

Example: Mean Field for Ising Model

L() a + + H( )


104/135

L() = max(1...m)[0,1]m

sV

ss+ (s,t)E

stst+ sV

H(s) Bilinear in , not jointly concave

But concave in a single dimension s, fixing others.

Iterative coordinate-wise maximization: fixing t fort =s andoptimizing s.

Setting the partial derivative w.r.t. s to 0 yields:

s= 1

1 + exp

(s+

(s,t)Estt)

as weve seen before.

Caution: mean field converges to a local maximum dependingon the initialization of1 . . . m.

The Sum-Product Algorithm as Variational Approximation
http://find/


105/135

A() = supM

A()

The sum-product algorithm makes two approximations:

it relaxes M to an outerset L it replaces the dual A with an approximation.

A() = supL

A()

The Outer Relaxation

For overcomplete exponential families on discrete nodes, themean parameters are node and edge marginals j = p(x = j) k = p(x = j x = k)


106/135

sj =p(xs=j), stjk =p(xs=j, xt =k).

The marginal polytope is M = { | p with marginals }. Now consider Rd+ satisfying node normalization and

edge-node marginal consistency conditions:

r1j=0

sj = 1 s V

r1

k=0stjk =sj s, t V, j = 0 . . . r 1

r1j=0

stjk=tk s, t V, k= 0 . . . r 1

Define L= { satisfying the above conditions}.
http://find/


107/135


108/135

Approximating A


109/135

Define the Bethe entropyfor L on loopy graphs in the sameway:

HBethe

(p) = sV

r1

j=0

sjlog sj (s,t)E

j,k

stjk

log stjk

sjtk

Note HBethe is not a true entropy. The second approximation insum-product is to replace A() with HBethe(p).
http://find/


110/135

Summary: Variational Inference


111/135

The sum-product algorithm (loopy belief propagation)

The mean field method

Not covered: Expectation Propagation

Efficient computation. But often unknown bias in solution.
http://find/


112/135

Maximizing ProblemsRecall the HMM example

1/4 1/2


113/135

= = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

R G B R G B

There are two senses of best states z1:N given x1:N:1. So far we computed the marginal p(zn|x1:N)

We can define best as zn= arg maxkp(zn =k|x1:N)

Howeverz1:Nas a whole may not be the best In fact z1:Ncan even have zero probability!

2. An alternative is to find

z

1:N= arg maxz1:N p(z1:N|x1:N)

finds the most likely state configuration as a whole The max-sum algorithm solves this Generalizes the Viterbi algorithm for HMMs

Intermediate: The Max-Product Algorithm

Simple modification to the sum-product algorithm: replace with
http://find/


114/135

p p g p max in the factor-to-variable messages.

fsx(x) = maxx1. . . max

xMfs(x, x1, . . . , xM)

M

m=1xmfs(xm)

xmfs(xm) =

fne(xm)\fs

fxm(xm)

xleaff(x) = 1

f

leaf

x(x) = f(x)


As in sum-product, pick an arbitrary variable node x as the
http://find/


115/135

root Pass messages up from leaves until they reach the root

Unlike sum-product, do not pass messages back from root toleaves

At the root, multiply incoming messages

pmax = maxx

fne(x)

fx(x)

This is the probability of the most likely state configuration

http://find/


116/135

To identify the configuration itself, keep back pointers: When creating the message

fsx(x) = maxx1

. . . maxxM

fs(x, x1, . . . , xM)M

m=1

xmfs(xm)

for each x value, we separately create Mpointers back to thevalues ofx1, . . . , xM that achieve the maximum.

At the root, backtrack the pointers.


z1f1 z2f2


117/135

P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

= = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

1/4 1/2

R G B R G B

Message from leaff1f1z1(z1= 1) =P(z1= 1)P(R|z1= 1) = 1/2 1/2 = 1/4f1z1(z1= 2) =P(z1= 2)P(R|z1= 2) = 1/2 1/4 = 1/8

The second messagez1f2(z1= 1) = 1/4z1f2(z1= 2) = 1/8


z1f1 z2f2
http://find/


118/135

P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

= = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

1/4 1/2

R G B R G B

f2z2(z2= 1)

= maxz1

f2(z1, z2)z1f2(z1)

= maxz1 P(z2= 1 | z1)P(x2=G | z2= 1)z1f2(z1)= max(1/4 1/4 1/4, 1/2 1/4 1/8) = 1/64

Back pointer for z2= 1: either z1= 1 orz1= 2


z1f1 z2f2


119/135

P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

= = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

/ /

R G B R G B

The other element of the same message:

f2z2(z2= 2)

= maxz1

f2(z1, z2)z1f2(z1)

= maxz1 P(z2= 2 | z1)P(x2=G | z2= 2)z1f2(z1)

= max(3/4 1/2 1/4, 1/2 1/2 1/8) = 3/32

Back pointer for z2= 2: z1= 1


z1f1 z2f2

P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2


120/135

= = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

1/4 1/2

R G B R G B

f2z2 =1/64 z1=1,23/32 z1=1

At root z2,

maxs=1,2

f2z2(s) = 3/32

z2= 2 z1= 1

z1:2= arg maxz1:2p(z1:2|x1:2) = (1, 2)

In this example, sum-product and max-product produce the samebest sequence; In general they differ.

From Max-Product to Max-SumThe max-sum algorithm is equivalent to the max-productalgorithm, but work in log space to avoid underflow.

M


121/135

fsx(x) = maxx1...xMlog fs(x, x1, . . . , xM) +

m=1

xmfs(xm)

xmfs(xm) =

fne(xm)\fs

fxm(xm)

xleaff(x) = 0

fleafx(x) = log f(x)

When at the root,

logpmax = maxx

fne(x)

fx(x)

The back pointers are the same.

Outline


Representation


122/135

Directed Graphical Models (Bayesian Networks)Undirected Graphical Models (Markov Random Fields)

InferenceExact Inference

Markov Chain Monte CarloVariational Inference



Exponential Family

Maximizing Problems

Parameter Learning

Structure Learning

Parameter Learning
http://find/


123/135

Assume the graph structure is given

Learning in exponential family: estimate from iid datax1 . . .xn.

Principle: maximum likelihood Distinguish two cases:

fully observed data: all dimensions of xare observed partially observed data: some dimensions of xare unobserved.


124/135

Partially Observed Data

Each item (x, z) where xobserved, z unobserved


125/135

Full data (x1, z1) . . . (xn, zn), but we only observe x1 . . .xn The incomplete likelihood () = 1n

ni=1logp(xi) where

p(xi) =

p(xi, z)dz

Can be written as () = 1nni=1 Axi() A() New log partition function ofp(z | xi), one per item:

Axi() = log

exp((xi, z

))dz

Expectation-Maximization (EM) algorithm: lower bound Axi

EM as Variational Lower Bound

Mean parameter realizable by any distribution on z whileholding xi fixed:

Mx

i {Rd

|E

[(x

iz

)] f }
http://find/


126/135

M i = { | = p[( i, )] for some p} The variational definition Axi() = supMxi

Axi

()

Trivial variational lower bound:Axi()

i Axi

(i), i Mxi

Lower bound L on the incomplete log likelihood:

() = 1

n

ni=1

Axi() A()

1n

ni=1

i A

xi(i)

A()

L(1, . . . , n, )
http://find/


127/135

Exact EM: The M-Step

I h M i i h ldi h fi d


128/135

In the M-step, maximize holding the s fixed:

arg max

L(1, . . . , n, ) = arg max

A()

= 1nn

i=1 i

The solution () satisfies E()[(x)] =

Standard fully observed maximum likelihood problem, hencethe name M-step

Variational EM

For loopy graphs E-step often intractable.

Cant maximize

i A ( i)
http://find/


129/135

maxiMxi

i Axi

(i)

Improve but not necessarily maximize: generalized EM

The mean field method maximizes

maxiMxi(F)

i Axi

(i)

up to local maximum recallMxi(F) is an inner approximation to Mxi

Mean field E-step leads to generalized EM

The sum-product algorithm does not lead to generalized EM
http://find/


130/135

Score-Based Structure Learning

Let M be all allowed candidate features


131/135

LetM be all allowed candidate features Let M M be a log-linear model structure

P(X | M, ) = 1

Zexp

iM

ifi(X) A score for the model Mcan be maxln P(Data | M, )

The score is always better for larger M needs regularization

M and treated separately

Structure Learning for Gaussian Random Fields

Consider a p-dimensional multivariate GaussianN(, )

The graphical model has p nodes x1 xp
http://find/


132/135

The graphical model has p nodes x1, . . . , xp The edge between xi, xj is absent if and only ifij = 0,

where = 1

Equivalently, xi, xj are conditionally independent given other

variables

xx

xx

1

2

3

4
http://find/


133/135

Structure Learning for Gaussian Random Fields


134/135

For centered data, minimize a regularized problem instead:

log || +1

n

n

i=1

X(i)

X(i) + i=j

|ij |

Known as glasso
http://find/


135/135

All of Graphical Models

Documents

Transcript of All of Graphical Models