All of Graphical Models

download All of Graphical Models

of 135

Transcript of All of Graphical Models

  • 8/12/2019 All of Graphical Models

    1/135

    All of Graphical Models

    Xiaojin Zhu

    Department of Computer SciencesUniversity of WisconsinMadison, USA

    Tutorial at ICMLA 2011

    http://find/http://goback/
  • 8/12/2019 All of Graphical Models

    2/135

    The Whole Tutorial in One Slide

    Given GM = joint distribution p(x1, . . . , xn)

    Do inference = p(XQ| XE), in generalXQ XE {x1 . . . xn}

    Ifp(x1, . . . , xn) not given, estimate it from data

    http://find/http://goback/
  • 8/12/2019 All of Graphical Models

    3/135

    Outline

    Life without Graphical Models

    RepresentationDirected Graphical Models (Bayesian Networks)Undirected Graphical Models (Markov Random Fields)

    Inference

    Exact InferenceMarkov Chain Monte CarloVariational Inference

    Loopy Belief Propagation

    Mean Field Algorithm

    Exponential FamilyMaximizing Problems

    Parameter Learning

    Structure Learning

    http://find/
  • 8/12/2019 All of Graphical Models

    4/135

    Outline

    Life without Graphical Models

    RepresentationDirected Graphical Models (Bayesian Networks)Undirected Graphical Models (Markov Random Fields)

    Inference

    Exact InferenceMarkov Chain Monte CarloVariational Inference

    Loopy Belief Propagation

    Mean Field Algorithm

    Exponential FamilyMaximizing Problems

    Parameter Learning

    Structure Learning

    http://find/
  • 8/12/2019 All of Graphical Models

    5/135

    Life without Graphical Models

    . . . is fine mathematically:

    The universe is reduced to a set of random variablesx1, . . . , xn e.g., x1, . . . , xn1 can be the discrete or continuous features e.g., xn y can be the discrete class label

    The joint p(x1, . . . , xn) completely describes how the universe

    works

    Machine learning: estimate p(x1, . . . , xn) from training

    data X(1), . . . , X (N), where X(i) = (x(i)1 , . . . , x

    (i)n )

    Prediction: y = argmaxp(xn| x1, . . . , x

    n), a.k.a.

    inference by the definition of conditional probability

    p(xn| x

    1, . . . , x

    n) =

    p(x1, . . . , xn

    , xn)

    vp(x1, . . . , x

    n

    , xn=v)

    http://find/
  • 8/12/2019 All of Graphical Models

    6/135

    Conclusion

    Life without graphical models is just fine So why are we still here?

    http://find/http://goback/
  • 8/12/2019 All of Graphical Models

    7/135

    Life can be Better for Computer Scientists

    Given GM = joint distribution p(x1, . . . , xn) exponential nave storage (2n for binary r.v.) hard to interpret (conditional independence)

    Do inference = p(XQ| XE), in generalXQ XE {x1 . . . xn} Often cant do it computationally

    Ifp(x1, . . . , xn) not given, estimate it from data

    Cant do it either

    http://find/http://goback/
  • 8/12/2019 All of Graphical Models

    8/135

    Acknowledgments Before We Start

    Much of this tutorial is based on

    Koller & Friedman, Probabilistic Graphical Models. MIT 2009

    Wainwright & Jordan, Graphical Models, ExponentialFamilies, and Variational Inference. FTML 2008

    Bishop, Pattern Recognition and Machine Learning. Springer2006.

    http://find/
  • 8/12/2019 All of Graphical Models

    9/135

    Outline

    Life without Graphical Models

    RepresentationDirected Graphical Models (Bayesian Networks)Undirected Graphical Models (Markov Random Fields)

    Inference

    Exact InferenceMarkov Chain Monte CarloVariational Inference

    Loopy Belief Propagation

    Mean Field Algorithm

    Exponential FamilyMaximizing Problems

    Parameter Learning

    Structure Learning

    http://find/http://goback/
  • 8/12/2019 All of Graphical Models

    10/135

    Graphical-Model-Nots

    Graphical model is the study ofprobabilistic models

    Just because there is a graph with nodes and edges doesntmean its GM

    These are not graphical models

    neural network decision tree network flow HMM template

    http://find/http://goback/
  • 8/12/2019 All of Graphical Models

    11/135

    Outline

    Life without Graphical Models

    RepresentationDirected Graphical Models (Bayesian Networks)Undirected Graphical Models (Markov Random Fields)

    Inference

    Exact InferenceMarkov Chain Monte CarloVariational Inference

    Loopy Belief Propagation

    Mean Field Algorithm

    Exponential FamilyMaximizing Problems

    Parameter Learning

    Structure Learning

    http://find/http://goback/
  • 8/12/2019 All of Graphical Models

    12/135

    Bayesian Network

    A directed graph has nodes X= (x1, . . . , xn), some of themconnected by directed edges xi xj

    A cycle is a directed path x1 . . . xk where x1=xk A directed acyclic graph (DAG) contains no cycles

    A Bayesian network on the DAG is a family of distributionssatisfying

    {p | p(X) =i

    p(xi| P a(xi))}

    where P a(xi) is the set of parents ofxi.

    p(xi| P a(xi)) is the conditional probability distribution(CPD) at xi

    By specifying the CPDs for all i, we specify a particulardistribution p(X)

    http://find/
  • 8/12/2019 All of Graphical Models

    13/135

    Example: Alarm

    Binary variables

    P(A | B, E) = 0.95

    P(A | B, ~E) = 0.94

    P(A | ~B, E) = 0.29

    P(A | ~B, ~E) = 0.001

    P(J | A) = 0.9

    P(J | ~A) = 0.05

    P(M | A) = 0.7

    P(M | ~A) = 0.01

    A

    J M

    B E

    P(E)=0.002P(B)=0.001

    P(B, E,A,J, M)

    = P(B)P( E)P(A | B, E)P(J | A)P( M | A)= 0.001 (1 0.002) 0.94 0.9 (1 0.7)

    .000253

    http://find/http://goback/
  • 8/12/2019 All of Graphical Models

    14/135

    Example: Naive Bayes

    y y

    x x. . .1 d x

    d

    p(y, x1, . . . xd) =p(y)di=1p(xi| y)

    Used extensively in natural language processing Plate representation on the right

    http://find/
  • 8/12/2019 All of Graphical Models

    15/135

    No Causality Whatsoever

    P(A)=a

    P(B|A)=b

    P(B|~A)=c

    A

    B

    B

    A

    P(B)=ab+(1a)c

    P(A|B)=ab/(ab+(1a)c)

    P(A|~B)=a(1b)/(1ab(1a)c)

    The two BNs are equivalent in all respects

    Bayesian networks imply no causality at all

    They only encode the joint probability distribution (hencecorrelation)

    However, people tend to design BNs based on causal relations

    ( )

    http://find/
  • 8/12/2019 All of Graphical Models

    16/135

    Example: Latent Dirichlet Allocation (LDA)

    Nd

    w

    D

    T

    z

    A generative model for p(,,z,w| , ):For each topic t

    t Dirichlet()For each document d

    Dirichlet()For each word position in d

    topic z Multinomial()word w Multinomial(z)

    Inference goals: p(z| w,,), argmax,p(, | w,,)

    E l L Di i hl All i (LDA)

    http://find/
  • 8/12/2019 All of Graphical Models

    17/135

    Example: Latent Dirichlet Allocation (LDA)

    Nd

    w

    D

    T

    z

    A generative model for p(,,z,w| , ):For each topic t

    t Dirichlet()For each document d

    Dirichlet()For each word position in d

    topic z Multinomial()word w Multinomial(z)

    Inference goals: p(z| w,,), argmax,p(, | w,,)

    E l L Di i hl All i (LDA)

    http://find/
  • 8/12/2019 All of Graphical Models

    18/135

    Example: Latent Dirichlet Allocation (LDA)

    Nd

    w

    D

    T

    z

    A generative model for p(,,z,w| , ):For each topic t

    t Dirichlet()For each document d

    Dirichlet()For each word position in d

    topic z Multinomial()word w Multinomial(z)

    Inference goals: p(z| w,,), argmax,p(, | w,,)

    E l L Di i hl All i (LDA)

    http://find/
  • 8/12/2019 All of Graphical Models

    19/135

    Example: Latent Dirichlet Allocation (LDA)

    Nd

    w

    D

    T

    z

    A generative model for p(,,z,w| , ):For each topic t

    t Dirichlet()For each document d

    Dirichlet()For each word position in d

    topic z Multinomial()word w Multinomial(z)

    Inference goals: p(z| w,,), argmax,p(, | w,,)

    E l L t t Di i hl t All ti (LDA)

    http://find/
  • 8/12/2019 All of Graphical Models

    20/135

    Example: Latent Dirichlet Allocation (LDA)

    Nd

    w

    D

    T

    z

    A generative model for p(,,z,w| , ):For each topic t

    t Dirichlet()For each document d

    Dirichlet()For each word position in dtopic z Multinomial()word w Multinomial(z)

    Inference goals: p(z| w,,), argmax,p(, | w,,)

    E l L t t Di i hl t All ti (LDA)

    http://find/
  • 8/12/2019 All of Graphical Models

    21/135

    Example: Latent Dirichlet Allocation (LDA)

    Nd

    w

    D

    T

    z

    A generative model for p(,,z,w| , ):For each topic t

    t Dirichlet()For each document d

    Dirichlet()For each word position in dtopic z Multinomial()word w Multinomial(z)

    Inference goals: p(z| w,,), argmax,p(, | w,,)

    Example: Latent Dirichlet Allocation (LDA)

    http://find/
  • 8/12/2019 All of Graphical Models

    22/135

    Example: Latent Dirichlet Allocation (LDA)

    Nd

    w

    D

    T

    z

    A generative model for p(,,z,w| , ):For each topic t

    t Dirichlet()For each document d

    Dirichlet()For each word position in dtopic z Multinomial()word w Multinomial(z)

    Inference goals: p(z| w,,), argmax,p(, | w,,)

    Example: Latent Dirichlet Allocation (LDA)

    http://goforward/http://find/http://goback/
  • 8/12/2019 All of Graphical Models

    23/135

    Example: Latent Dirichlet Allocation (LDA)

    Nd

    w

    D

    T

    z

    A generative model for p(,,z,w| , ):For each topic t

    t Dirichlet()For each document d

    Dirichlet()For each word position in dtopic z Multinomial()word w Multinomial(z)

    Inference goals: p(z| w,,), argmax,p(, | w,,)

    Some Topics by LDA on the Wish Corpus

    http://find/
  • 8/12/2019 All of Graphical Models

    24/135

    Some Topics by LDA on the Wish Corpus

    p(word | topic)

    troops election love

    Conditional Independence

    http://find/
  • 8/12/2019 All of Graphical Models

    25/135

    Conditional Independence

    Two r.v.s A, B are independent ifP(A, B) =P(A)P(B) orP(A|B) =P(A) (the two are equivalent)

    Two r.v.s A, B are conditionally independent given C if

    P(A, B| C) =P(A | C)P(B| C) orP(A | B, C) =P(A | C) (the two are equivalent)

    This extends to groups of r.v.s

    Conditional independence in a BN is precisely specified by

    d-separation(directed separation)

    d-Separation Case 1: Tail-to-Tail

    http://find/
  • 8/12/2019 All of Graphical Models

    26/135

    d-Separation Case 1: Tail-to-Tail

    C

    A B

    C

    A B

    A, B in general dependent

    A, B conditionally independent given C

    C is a tail-to-tail node, blocks the undirected path A-B

    d-Separation Case 2: Head-to-Tail

    http://find/
  • 8/12/2019 All of Graphical Models

    27/135

    d-Separation Case 2: Head-to-Tail

    A C B A C B

    A, B in general dependent A, B conditionally independent given C

    C is a head-to-tail node, blocks the path A-B

    d-Separation Case 3: Head-to-Head

    http://find/http://goback/
  • 8/12/2019 All of Graphical Models

    28/135

    d Separation Case 3: Head to Head

    A B A B

    C C

    A, B in general independent

    A, B conditionallydependentgiven C,or any of Csdescendants

    C is a head-to-head node,unblocksthe path A-B

    d-Separation

    http://find/
  • 8/12/2019 All of Graphical Models

    29/135

    d Separation

    Any groups of nodes A and B are conditionally independentgiven another group C, if all undirected paths from any node

    in A to any node in B are blocked A path is blocked if it includes a node x such that either

    The path is head-to-tail or tail-to-tail at x and x C, or The path is head-to-head at x, and neither x nor any of its

    descendants is in C.

    d-Separation Example 1

    http://find/
  • 8/12/2019 All of Graphical Models

    30/135

    d Separation Example 1

    The path from A to B not blocked by either E or F A, B dependent given C

    A

    C

    B

    F

    E

    d-Separation Example 2

    http://find/
  • 8/12/2019 All of Graphical Models

    31/135

    d Separation Example 2

    The path from A to B is blocked both at E and F A, B conditionally independent given F

    A

    B

    F

    E

    C

    Outline

    http://find/
  • 8/12/2019 All of Graphical Models

    32/135

    Life without Graphical Models

    RepresentationDirected Graphical Models (Bayesian Networks)Undirected Graphical Models (Markov Random Fields)

    Inference

    Exact InferenceMarkov Chain Monte CarloVariational Inference

    Loopy Belief Propagation

    Mean Field Algorithm

    Exponential Family

    Maximizing Problems

    Parameter Learning

    Structure Learning

    Markov Random Fields

    http://find/
  • 8/12/2019 All of Graphical Models

    33/135

    The efficiency of directed graphical model (acyclic graph,locally normalized CPDs) also makes it restrictive

    A clique C in an undirected graph is a fully connected set ofnodes (note: full of loops!)

    Define a nonnegative potential function C :XC R+

    An undirected graphical model (aka Markov Random Field)on the graph is a family of distributions satisfying

    p|p(X) =

    1

    ZC

    C

    (XC

    ) Z=

    CC(XC)dX is the partition function

    Example: A Tiny Markov Random Field

    http://find/
  • 8/12/2019 All of Graphical Models

    34/135

    p y

    x x1 2

    C

    x1, x2 {1, 1} A single clique C(x1, x2) =e

    ax1x2

    p(x1, x2) = 1Ze

    ax1x2

    Z= (ea + ea + ea + ea)

    p(1, 1) =p(1, 1) =ea/(2ea + 2ea) p(1, 1) =p(1, 1) =ea/(2ea + 2ea)

    When the parameter a >0, favor homogeneous chains

    When the parameter a

  • 8/12/2019 All of Graphical Models

    35/135

    g

    Real-valued feature functions f1(X), . . . , f k(X)

    Real-valued weights w1, . . . , wk

    p(X) = 1

    Zexp

    ki=1

    wifi(X)

    Example: The Ising Model

    http://find/
  • 8/12/2019 All of Graphical Models

    36/135

    s xs xtst

    This is an undirected model with x {0, 1}.

    p(x) = 1

    Zexp

    sV

    sxs+

    (s,t)E

    stxsxt

    fs(X) =xs, fst(X) =xsxt ws= s, wst= st

    http://find/http://goback/
  • 8/12/2019 All of Graphical Models

    37/135

    Example: Gaussian Random Field

  • 8/12/2019 All of Graphical Models

    38/135

    p(X) N(, ) = 1

    (2)n/2||1/2exp

    1

    2(X )1(X )

    Multivariate Gaussian

    The n n covariance matrix positive semi-definite

    Let = 1 be the precision matrix

    xi, xj are conditionally independent given all other variables, if

    and only ifij = 0

    When ij = 0, there is an edge between xi, xj

    Conditional Independence in Markov Random Fields

    http://find/
  • 8/12/2019 All of Graphical Models

    39/135

    Two group of variables A, B are conditionally independent

    given another group C, if Remove C and all edges involving C A, B beome disconnected

    A

    C

    B

    Factor Graph

    http://find/
  • 8/12/2019 All of Graphical Models

    40/135

    For both directed and undirected graphical models

    Bipartite: edges between a variable node and a factor node Factors represent computation

    A B

    C

    (A,B,C)

    A B

    C

    A B

    C

    (A,B,C)f

    A B

    C

    fP(A)P(B)P(C|A,B)

    Outline

    http://find/
  • 8/12/2019 All of Graphical Models

    41/135

    Life without Graphical Models

    RepresentationDirected Graphical Models (Bayesian Networks)Undirected Graphical Models (Markov Random Fields)

    Inference

    Exact InferenceMarkov Chain Monte CarloVariational Inference

    Loopy Belief Propagation

    Mean Field Algorithm

    Exponential Family

    Maximizing Problems

    Parameter Learning

    Structure Learning

    Outline

    http://find/
  • 8/12/2019 All of Graphical Models

    42/135

    Life without Graphical Models

    RepresentationDirected Graphical Models (Bayesian Networks)Undirected Graphical Models (Markov Random Fields)

    Inference

    Exact InferenceMarkov Chain Monte CarloVariational Inference

    Loopy Belief Propagation

    Mean Field Algorithm

    Exponential Family

    Maximizing Problems

    Parameter Learning

    Structure Learning

    Inference by Enumeration

    http://find/
  • 8/12/2019 All of Graphical Models

    43/135

    Let X= (XQ, XE, XO) for query, evidence, and othervariables.

    Infer P(XQ| XE)

    By definition

    P(XQ| XE) =P(XQ, XE)

    P(XE) =

    XO

    P(XQ, XE, XO)XQ,XO

    P(XQ, XE, XO)

    Summing exponential number of terms: with k variables inXO each taking r values, there are r

    k terms

    Details of the summing problem

    http://find/
  • 8/12/2019 All of Graphical Models

    44/135

    There are a bunch of other variables x1, . . . , xk We sum over r values each variable can take

    vrxi=v1

    This is exponential (rk): x1...x

    k We want

    x1...xk

    p(X)

    For a graphical model, the joint probability factorsp(X) =

    mj=1 fj(X(j))

    Each factor fj operates on X(j) X

    Eliminating a Variable

    http://find/
  • 8/12/2019 All of Graphical Models

    45/135

    Rearrange factors

    x1...xkf1 . . . f

    l f

    +l+1 . . . f

    +m by whether

    x1 X(j)

    Obviously equivalent:

    x2...xk

    f1 . . . f l

    x1f+l+1 . . . f

    +m

    Introduce a new factor fm+1= x

    1

    f+l+1 . . . f +m

    fm+1 contains the union of variables in f+l+1 . . . f

    +m except x1

    In fact, x1 disappears altogether in

    x2...xkf1 . . . f

    l f

    m+1

    Dynamic programming: compute fm+1 once, use it thereafter

    Hope: fm+1 contains very few variables

    Recursively eliminate other variables in turn

    Example: Chain Graph

    http://find/
  • 8/12/2019 All of Graphical Models

    46/135

    A B C D

    Binary variables

    Say we want P(D) =

    A,B,CP(A)P(B|A)P(C|B)P(D|C)

    Let f1(A) =P(A). Note f1 is an array of size two:P(A= 0)P(A= 1)

    f2(A, B) is a table of size four:P(B= 0|A= 0)P(B= 0|A= 1)

    P(B= 1|A= 0)P(B= 1|A= 1))

    A,B,Cf1(A)f2(A, B)f3(B, C)f4(C, D) =

    B,Cf3(B, C)f4(C, D)(A

    f1(A)f2(A, B))

    Example: Chain Graph

    http://find/
  • 8/12/2019 All of Graphical Models

    47/135

    A B C D

    f1(A)f2(A, B) an array of size four: match AvaluesP(A= 0)P(B = 0|A= 0)P(A= 1)P(B = 0|A= 1)P(A= 0)P(B = 1|A= 0)

    P(A= 1)P(B = 1|A= 1) f5(B)

    A f1(A)f2(A, B) an array of size two

    P(A= 0)P(B = 0|A= 0) + P(A= 1)P(B = 0|A= 1)P(A= 0)P(B = 1|A= 0) + P(A= 1)P(B = 1|A= 1)

    For this example, f5(B) happens to be P(B)

    B,Cf3(B, C)f4(C, D)f5(B) =Cf4(C, D)(

    Bf3(B, C)f5(B)), and so on

    In the end, f7(D) = (P(D= 0), P(D= 1))

    Example: Chain Graph

    http://find/
  • 8/12/2019 All of Graphical Models

    48/135

    A B C D

    Computation for P(D): 12, 6 +

    Enumeration: 48, 14 + Saving depends on elimination order. Finding optimal order

    NP-hard; there are heuristic methods.

    Saving depends more critically on the graph structure (treewidth), can be intractable

    Handling Evidence

    http://find/
  • 8/12/2019 All of Graphical Models

    49/135

    For evidence variables XE, simply plug in their value e

    Eliminate variables XO =X XE XQ The final factor will be the jointf(XQ) =P(XQ, XE=e)

    Normalize to answer query:

    P(XQ| XE=e) = f(XQ)

    XQf(XQ)

    Summary: Exact Inference

    http://find/
  • 8/12/2019 All of Graphical Models

    50/135

    Enumeration

    Variable elimination

    Not covered: junction tree (aka clique tree)

    Exact, but intractable for large graphs

    Outline

    http://find/
  • 8/12/2019 All of Graphical Models

    51/135

    Life without Graphical Models

    RepresentationDirected Graphical Models (Bayesian Networks)Undirected Graphical Models (Markov Random Fields)

    Inference

    Exact InferenceMarkov Chain Monte CarloVariational Inference

    Loopy Belief Propagation

    Mean Field Algorithm

    Exponential Family

    Maximizing Problems

    Parameter Learning

    Structure Learning

    Inference by Monte Carlo

    http://find/http://goback/
  • 8/12/2019 All of Graphical Models

    52/135

    Consider the inference problem p(XQ=cQ| XE) where

    XQ XE {x1 . . . xn}

    p(XQ=cQ| XE) =

    1(xQ=cQ)p(xQ| XE)dxQ

    If we can draw samples x(1)Q , . . . x(m)Q p(xQ| XE), anunbiased estimator is

    p(XQ=cQ| XE) 1

    m

    m

    i=1

    1(x

    (i)Q=cQ)

    The variance of the estimator decreases as V/m

    Inference reduces to sampling from p(xQ| XE)

    Forward Sampling Example

    http://find/
  • 8/12/2019 All of Graphical Models

    53/135

    P(A | B, E) = 0.95

    P(A | B, ~E) = 0.94

    P(A | ~B, E) = 0.29

    P(A | ~B, ~E) = 0.001

    P(J | A) = 0.9

    P(J | ~A) = 0.05

    P(M | A) = 0.7

    P(M | ~A) = 0.01

    A

    J M

    B E

    P(E)=0.002P(B)=0.001

    To generate a sample X= (B,E,A,J,M):1. Sample B Ber(0.001): r U(0, 1). If(r

  • 8/12/2019 All of Graphical Models

    54/135

    Say the inference task is P(B = 1 | E= 1, M= 1)

    Throw awayall samples except those with (E= 1, M= 1)

    p(B= 1 | E= 1, M= 1) 1

    m

    mi=1

    1(B(i)=1)

    where m is the number of surviving samples

    Can be highly inefficient (note P(E= 1) tiny)

    Does not work for Markov Random Fields

    Gibbs Sampler Example: P(B = 1 | E= 1,M= 1)

    http://find/
  • 8/12/2019 All of Graphical Models

    55/135

    Gibbs sampler is a Markov Chain Monte Carlo (MCMC)

    method. Directly sample from p(xQ| XE)

    Works for both graphical models Initialization:

    Fix evidence; randomly set other variables e.g. X(0) = (B= 0, E= 1, A= 0, J= 0, M= 1)

    P(A | B, E) = 0.95

    P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29

    P(A | ~B, ~E) = 0.001

    P(J | A) = 0.9

    P(J | ~A) = 0.05

    P(M | A) = 0.7

    P(M | ~A) = 0.01

    A

    J M

    B E

    P(E)=0.002P(B)=0.001

    Gibbs Update

    http://find/
  • 8/12/2019 All of Graphical Models

    56/135

    For each non-evidence variable xi, fixing all other nodes Xi,resample its value xi P(xi| Xi)

    This is equivalent to xi P(xi| MarkovBlanket(xi)) For a Bayesian network MarkovBlanket(xi) includes xis

    parents, spouses, and children

    P(xi| MarkovBlanket(xi)) P(xi| P a(xi)) yC(xi)

    P(y| P a(y))

    where P a(x) are the parents ofx, and C(x) the children ofx. For many graphical models the Markov Blanket is small. For example,

    B P(B| E= 1, A= 0) P(B)P(A= 0 | B, E= 1)

    P(A | B, E) = 0.95

    P(A | B, ~E) = 0.94

    P(A | ~B, E) = 0.29

    P(A | ~B, ~E) = 0.001

    A

    J M

    B E

    P(E)=0.002P(B)=0.001

    Gibbs Update

    http://find/
  • 8/12/2019 All of Graphical Models

    57/135

    Say we sampled B= 1. Then

    X(1) = (B = 1, E= 1, A= 0, J= 0, M= 1) Starting from X(1), sample

    A P(A | B= 1, E= 1, J= 0, M= 1) to get X(2)

    Move on toJ, then repeat B,A, J, B, A, J . . .

    Keep all latersamples. P(B= 1 | E= 1, M= 1) is thefraction of samples with B = 1.

    P(A | B, E) = 0.95

    P(A | B, ~E) = 0.94

    P(A | ~B, E) = 0.29

    P(A | ~B, ~E) = 0.001

    P(J | A) = 0.9

    P(J | ~A) = 0.05

    P(M | A) = 0.7

    P(M | ~A) = 0.01

    A

    J M

    B E

    P(E)=0.002P(B)=0.001

    Gibbs Example 2: The Ising Model

    http://find/
  • 8/12/2019 All of Graphical Models

    58/135

    xs

    A

    B

    C

    D

    This is an undirected model with x {0, 1}.

    p(x) = 1

    Z

    expsV

    sxs+ (s,t)E

    stxsxt

    Gibbs Example 2: The Ising Model

    http://find/
  • 8/12/2019 All of Graphical Models

    59/135

    xs

    A

    B

    C

    D

    The Markov blanket ofxs is A, B, C, D

    In general for undirected graphical models

    p(xs| xs) =p(xs| xN(s))

    N(s) is the neighbors ofs.

    The Gibbs update is

    p(xs= 1 | xN(s)) = 1

    exp((s+ tN(s) stxt)) + 1

    Gibbs Sampling as a Markov Chain

    http://find/
  • 8/12/2019 All of Graphical Models

    60/135

    A Markov chain is defined by a transition matrix T(X

    | X) Certain Markov chains have a stationary distribution such

    that =T

    Gibbs sampler is such a Markov chain with

    Ti((Xi, x

    i) | (Xi, xi)) =p(x

    i| Xi), and stationarydistribution p(xQ| XE) But it takes time for the chain to reach stationary distribution

    (mix) Can be difficult to assert mixing

    In practice burn in: discard X(0)

    , . . . , X (T)

    UseallofX(T+1), . . . for inference (they are correlated) Do not thin

    Collapsed Gibbs Sampling

    http://find/
  • 8/12/2019 All of Graphical Models

    61/135

    In general, Ep[f(X)] 1mmi=1 f(X(i)) ifX(i) p

    Sometimes X= (Y, Z) where Zhas closed-form operations

    If so,

    Ep[f(X)] = Ep(Y)Ep(Z|Y)[f(Y, Z)]

    1m

    mi=1

    Ep(Z|Y(i))[f(Y(i), Z)]

    ifY(i) p(Y)

    No need to sample Z: it is collapsed Collapsed Gibbs sampler Ti((Yi, y

    i) | (Yi, yi)) =p(y

    i| Yi)

    Note p(yi| Yi) =

    p(yi, Z | Yi)dZ

    Example: Collapsed Gibbs Sampling for LDA

    D

    http://find/
  • 8/12/2019 All of Graphical Models

    62/135

    Nd

    w

    D

    T

    z

    Collapse , , Gibbs update:

    P(zi=j|z

    i,w

    )

    n(wi)i,j+ n

    (di)i,j+

    n()i,j+ W n(di)i,+ T

    n(wi)i,j : number of times word wi has been assigned to topic j,

    excluding the current position

    n(di)i,j: number of times a word from document di has been

    assigned to topic j, excluding the current position

    n()i,j: number of times any word has been assigned to topic j,

    excluding the current position

    n

    (di)

    i,: length of document di, excluding the current position

    http://find/
  • 8/12/2019 All of Graphical Models

    63/135

    Outline

  • 8/12/2019 All of Graphical Models

    64/135

    Life without Graphical Models

    RepresentationDirected Graphical Models (Bayesian Networks)Undirected Graphical Models (Markov Random Fields)

    InferenceExact InferenceMarkov Chain Monte CarloVariational Inference

    Loopy Belief Propagation

    Mean Field Algorithm

    Exponential Family

    Maximizing Problems

    Parameter Learning

    Structure Learning

    Outline

    http://find/
  • 8/12/2019 All of Graphical Models

    65/135

    Life without Graphical Models

    RepresentationDirected Graphical Models (Bayesian Networks)Undirected Graphical Models (Markov Random Fields)

    InferenceExact InferenceMarkov Chain Monte CarloVariational Inference

    Loopy Belief Propagation

    Mean Field Algorithm

    Exponential Family

    Maximizing Problems

    Parameter Learning

    Structure Learning

    The Sum-Product Algorithm

    http://find/
  • 8/12/2019 All of Graphical Models

    66/135

    Also known as belief propagation (BP)

    Exact if the graph is a tree; otherwise known as loopy BP,

    approximate The algorithm involves passing messageson the factor graph

    Alternative view: variational approximation (more later)

    Example: A Simple HMM

    The Hidden Markov Model template (not a graphical model)

    http://find/http://goback/
  • 8/12/2019 All of Graphical Models

    67/135

    The Hidden Markov Model template (not a graphical model)

    = = 1/21 2

    P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

    1 2

    1/4 1/2

    R G B R G B

    Observing x1=R, x2=G, the directed graphical model

    z1

    x =G2

    z2

    x =R1

    Factor graphz1f1 z2f2

    P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

    Messages

    http://find/
  • 8/12/2019 All of Graphical Models

    68/135

    A message is a vector of length K, where Kis the number ofvalues x takes.There are two types of messages:

    1. fx: message from a factor node fto a variable node xfx(i) is the ith element, i= 1 . . . K .

    2. xf: message from a variable node x to a factor node f

    Leaf Messages

    http://find/
  • 8/12/2019 All of Graphical Models

    69/135

    Assume tree factor graph. Pick an arbitrary root, say z2

    Start messages at leaves. If a leaf is a factor node f, fx(x) =f(x)

    If a leaf is a variable node x, xf(x) = 1

    z1f1 z2f2

    P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

    f1z1(z1= 1) =P(z1= 1)P(R|z1= 1) = 1/2 1/2 = 1/4f1z1(z1= 2) =P(z1= 2)P(R|z1= 2) = 1/2 1/4 = 1/8

    = = 1/21 2

    P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

    1 2

    1/4 1/2

    R G B R G B

    Message from Variable to Factor

    A node (factor or variable) can send out a message if all other

    http://find/
  • 8/12/2019 All of Graphical Models

    70/135

    ( ) gincoming messages have arrived

    Let x be in factor fs.xfs(x) =

    fne(x)\fs

    fx(x)

    ne(x)\fs are factors connected to x excluding fs.

    z1f1 z2f2

    P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

    z1f2(z1= 1) = 1/4z1f2

    (z1

    = 2) = 1/8

    = = 1/21 2

    P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

    1 2

    1/4 1/2

    R G B R G B

    Message from Factor to Variable

    Let x be in factor fs Let the other variables in fs be x1:M

    http://find/http://goback/
  • 8/12/2019 All of Graphical Models

    71/135

    Let x be in factor fs. Let the other variables in fs be x1:M.

    fsx(x) =x1

    . . .xM

    fs(x, x1, . . . , xM)

    Mm=1

    xmfs(xm)

    z1f1 z2f2

    P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

    f2z2(s) =2

    s=1

    z1f2(s)f2(z1=s

    , z2=s)

    = 1/4P(z2

    =s|z1

    = 1)P(x2

    =G|z2

    =s)

    +1/8P(z2=s|z1= 2)P(x2=G|z2=s)

    We getf2z2(z2= 1) = 1/32

    f2z2(z2 = 2) = 1/8

    Up to Root, Back Down

    http://find/http://goback/
  • 8/12/2019 All of Graphical Models

    72/135

    The message has reached the root, pass it back down

    z1f1 z2f2

    P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

    z2f2(z2= 1) = 1z2f2(z2= 2) = 1

    = = 1/21 2

    P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

    1 2

    1/4 1/2

    R G B R G B

    Keep Passing Down

    http://find/
  • 8/12/2019 All of Graphical Models

    73/135

    z1f1 z2f2

    P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

    f2z1(s) =2

    s=1 z2f2(s)f2(z1=s, z2=s

    )= 1P(z2= 1|z1=s)P(x2=G|z2= 1)

    + 1P(z2= 2|z1=s)P(x2=G|z2= 2). We getf2z1(z1= 1) = 7/16f2z1(z1= 2) = 3/8

    = = 1/21 2

    P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

    1 2

    1/4 1/2

    R G B R G B

    From Messages to Marginals

    http://find/
  • 8/12/2019 All of Graphical Models

    74/135

    Once a variable receives all incoming messages, we compute its

    marginal asp(x)

    fne(x)

    fx(x)

    In this example

    P(z1|x1, x2) f1z1 f2z1 = 1/41/8

    7/16

    3/8

    = 7/643/64

    0.70.3

    P(z2|x1, x2) f2z2 = 1/32

    1/8

    0.20.8

    One can also compute the marginal of the set ofvariables xsinvolved in a factor fs

    p(xs) fs(xs)

    xne(f)

    xf(x)

    Handling Evidence

    Ob i

    http://find/
  • 8/12/2019 All of Graphical Models

    75/135

    Observing x= v,

    we can absorb it in the factor (as we did); or

    set messages xf(x) = 0 for all x =v

    Observing XE,

    multiplying the incoming messages to x / XEgives the joint(not p(x|X

    E)):

    p(x, XE)

    fne(x)

    fx(x)

    The conditional is easily obtained by normalization

    p(x|XE) = p(x, XE)xp(x

    , XE)

    http://find/
  • 8/12/2019 All of Graphical Models

    76/135

    Outline

    Life without Graphical Models

  • 8/12/2019 All of Graphical Models

    77/135

    Life without Graphical Models

    RepresentationDirected Graphical Models (Bayesian Networks)Undirected Graphical Models (Markov Random Fields)

    InferenceExact InferenceMarkov Chain Monte CarloVariational Inference

    Loopy Belief Propagation

    Mean Field Algorithm

    Exponential Family

    Maximizing Problems

    Parameter Learning

    Structure Learning

    Example: The Ising Model

    http://find/
  • 8/12/2019 All of Graphical Models

    78/135

    s

    xs xt

    st

    The random variables x take values in{0, 1}.

    p(x) = 1

    Zexp

    sV

    sxs+ (s,t)E

    stxsxt

    The Conditional

    st

    http://find/
  • 8/12/2019 All of Graphical Models

    79/135

    s

    xs xt

    st

    Markovian: the conditional distribution for xs is

    p(xs| xs) =p(xs| xN(s))

    N(s) is the neighbors ofs.

    This reduces to

    p(xs= 1 | xN(s)) = 1

    exp((s+

    tN(s) stxt)) + 1

    Gibbs sampling would draw xs like this.

    The Mean Field Algorithm for Ising Model

    http://find/
  • 8/12/2019 All of Graphical Models

    80/135

    p(xs= 1 | xN(s)) = 1exp((s+

    tN(s) stxt)) + 1

    Instead of Gibbs sampling, let s be the estimated marginalp(xs = 1)

    s 1

    exp((s+

    tN(s) stt)) + 1

    The s are updated iteratively

    The Mean Field algorithm is coordinate ascent andguaranteed to converge to a local optimal (more later).

    Outline

    Life without Graphical Models

    http://find/http://goback/
  • 8/12/2019 All of Graphical Models

    81/135

    p

    RepresentationDirected Graphical Models (Bayesian Networks)Undirected Graphical Models (Markov Random Fields)

    InferenceExact Inference

    Markov Chain Monte CarloVariational Inference

    Loopy Belief Propagation

    Mean Field Algorithm

    Exponential Family

    Maximizing Problems

    Parameter Learning

    Structure Learning

    Exponential Family

    http://find/http://goback/
  • 8/12/2019 All of Graphical Models

    82/135

    Let (X) = (1(X), . . . , d(X)) be d sufficient statistics,

    where i: X R

    Note Xis all the nodes in a Graphical model

    i(X) sometimes called a feature function

    Let = (1, . . . , d) Rd be canonical parameters.

    The exponential family is a family of probability densities:

    p(x) = exp(x) A()

    Exponential Family

    http://find/
  • 8/12/2019 All of Graphical Models

    83/135

    p(x) = exp

    (x) A()

    The key is the inner product between parameters and

    sufficient statistics . A is the log partition function,

    A() = log

    exp

    (x)

    (dx)

    A= log Z

    http://find/http://goback/
  • 8/12/2019 All of Graphical Models

    84/135

    Exponential Family Example 1: Bernoulli

  • 8/12/2019 All of Graphical Models

    85/135

    p(x) =x(1 )1x forx {0, 1} and (0, 1). Does not look like an exponential family!

    Can be rewritten as

    p(x) = exp (x log + (1 x) log(1 )) Now in exponential family form with

    1(x) =x, 2(x) = 1 x, 1= log , 2= log(1 ), andA() = 0.

    Overcomplete: 1=2= 1 makes (x) = 1 for all x

    Exponential Family Example 1: Bernoulli

    http://find/http://goback/
  • 8/12/2019 All of Graphical Models

    86/135

    p(x) = exp (x log + (1 x) log(1 ))

    Can be further rewritten as

    p(x) = exp (x log(1 + exp()))

    Minimal exponential family with(x) =x, = log 1, A() = log(1 + exp()).

    Many distributions (e.g., Gaussian, exponential, Poisson, Beta) are

    in the exponential family, but not all (e.g., the Laplacedistribution).

    Exponential Family Example 2: Ising Model

    http://find/
  • 8/12/2019 All of Graphical Models

    87/135

    s

    xs xt

    st

    p(x) = exp

    sV

    sxs+

    (s,t)E

    stxsxt A()

    Binary random variable xs {0, 1}

    d= |V| + |E| sufficient statistics: (x) = (. . . xs . . . xst . . .)

    This is a regular ( = Rd), minimal exponential family.

    Exponential Family Example 3: Potts Model

    s

    xs xt

    st

    http://find/http://goback/
  • 8/12/2019 All of Graphical Models

    88/135

    Similar to Ising model but generalizing xs {0, . . . , r 1}. Indicator functions fsj(x) = 1 ifxs=j and 0 otherwise, and

    fstjk(x) = 1 ifxs=j xt=k, and 0 otherwise.

    p(x) = exp

    sj

    sjfsj(x) +stjk

    stjkfstjk(x) A()

    d= r|V| + r2|E| Regular but overcomplete, because

    r1j=0 sj(x) = 1 for any

    s V and all x. The Potts model is a special case where the parameters are

    tied: stkk

    =, and stjk

    = forj=k.

    http://find/
  • 8/12/2019 All of Graphical Models

    89/135

    Mean Parameters

    Let p be any density (not necessarily in exponential family).

  • 8/12/2019 All of Graphical Models

    90/135

    Let p be any density (not necessarily in exponential family).

    Given sufficient statistics , the mean parameters= (1, . . . , d) is

    i= Ep[i(x)] =

    i(x)p(x)dx

    The set of mean parameters

    M = { Rd | p s.t. Ep[(x)] =}

    If(1), (2) M, there must exist p(1), p(2)

    The convex combinations ofp(1), p(2) leads to another meanparameter in M

    Therefore M is convex

    Example: The First Two Moments

    http://find/
  • 8/12/2019 All of Graphical Models

    91/135

    Let 1(x) =x, 2(x) =x2

    For any p (not necessarily Gaussian) on x, the meanparameters= (1, 2) = (E(x),E(x

    2)).

    Note V(x) = E(x2) E2(x) =2 21 0 for any p

    M is not R2

    but rather the subset 1 R, 2 21.

    The Marginal Polytope

    http://find/
  • 8/12/2019 All of Graphical Models

    92/135

    The marginal polytope is defined for discrete xs Recall M = { Rd | =

    x

    (x)p(x) for some p}

    p can be a point mass function on a particular x.

    In fact any p is a convex combination of such point massfunctions.

    M =conv{(x), x} is a convex hull, called the marginalpolytope.

    http://find/
  • 8/12/2019 All of Graphical Models

    93/135

  • 8/12/2019 All of Graphical Models

    94/135

    Conjugate Duality

    The conjugate dual function A to A is defined as

  • 8/12/2019 All of Graphical Models

    95/135

    A() = sup

    A()

    Such definition, where a quantity is expressed as the solution to anoptimization problem, is called a variationaldefinition.

    For any Ms interior, let () satisfyE()[(x)] = A(()) =.

    Then A() = H(p()) the negative entropy.

    The dual of the dual gives back A:

    A() = supM

    A

    () For all , the supremum is attained uniquely at the

    M0 by the moment matching conditions = E[(x)].

    Example: Conjugate Dual for Bernoulli

    http://find/
  • 8/12/2019 All of Graphical Models

    96/135

    Recall the minimal exponential family for Bernoulli with(x) =x, A() = log(1 + exp()), = R.

    By definition

    A() = sup

    R

    log(1 + exp())

    Taking derivative and solve

    A() = log + (1 ) log(1 )

    i.e., the negative entropy.

    http://find/
  • 8/12/2019 All of Graphical Models

    97/135

    The Difficulties with Variational Representation

  • 8/12/2019 All of Graphical Models

    98/135

    A() = supM

    A()

    Difficult to solve even though it is a convex problem

    Two issues:

    Although the marginal polytope M is convex, it can be quitecomplex (exponential number of vertices) The dual function A() usually does not admit an explicit

    form.

    Variational approximationmodifies the optimization problem

    so that it is tractable, at the price of an approximate solution. Next, we cast mean field and sum-product algorithms as

    variational approximations.

    http://find/
  • 8/12/2019 All of Graphical Models

    99/135

    The Geometry ofM(F)

    LetM(F) be the mean parameters of the fully factorizedsub-family. In general, M(F) M

  • 8/12/2019 All of Graphical Models

    100/135

    Recall M is the convex hull of extreme points {(x)}. It turns out the extreme points {(x)} M(F). Example:

    The tiny Ising model x1, x2 {0, 1} with = (x1, x2, x1x2)

    The point mass distribution p(x= (0, 1)) = 1 is realized as a

    limit to the series p(x) = exp(1x1+ 2x2 A()) where1 and 2 .

    This series is in F because 12= 0. Hence the extreme point (x) = (0, 1, 0) is in M(F).

    The Geometry ofM(F)

    http://find/
  • 8/12/2019 All of Graphical Models

    101/135

    Because the extreme points ofM are in M(F), ifM(F)were convex, we would have M = M(F).

    But in general M(F) is a true subset ofM

    Therefore, M(F) is a nonconvex inner set ofM

    M(F)M

    ( )x

    http://find/
  • 8/12/2019 All of Graphical Models

    102/135

    Example: Mean Field for Ising Model

    The mean parameters for the Ising model are the node andedge marginals: s = p(xx = 1), st = p(xs = 1, xt = 1)

  • 8/12/2019 All of Graphical Models

    103/135

    edge marginals: s p(xx 1), st p(xs 1, xt 1)

    Fully factorized M(F) means no edge. st=st For M(F), the dual function A() has the simple form

    A() = sV

    H(s) = sV

    slog s+ (1 s) log(1 s)

    Thus the mean field problem is

    L() = supM(F)

    sV(slog s+ (1 s) log(1 s))

    = max(1...m)[0,1]m

    sV

    ss+

    (s,t)E

    stst+sV

    H(s)

    Example: Mean Field for Ising Model

    L() a + + H( )

    http://goforward/http://find/http://goback/
  • 8/12/2019 All of Graphical Models

    104/135

    L() = max(1...m)[0,1]m

    sV

    ss+ (s,t)E

    stst+ sV

    H(s) Bilinear in , not jointly concave

    But concave in a single dimension s, fixing others.

    Iterative coordinate-wise maximization: fixing t fort =s andoptimizing s.

    Setting the partial derivative w.r.t. s to 0 yields:

    s= 1

    1 + exp

    (s+

    (s,t)Estt)

    as weve seen before.

    Caution: mean field converges to a local maximum dependingon the initialization of1 . . . m.

    The Sum-Product Algorithm as Variational Approximation

    http://find/
  • 8/12/2019 All of Graphical Models

    105/135

    A() = supM

    A()

    The sum-product algorithm makes two approximations:

    it relaxes M to an outerset L it replaces the dual A with an approximation.

    A() = supL

    A()

    The Outer Relaxation

    For overcomplete exponential families on discrete nodes, themean parameters are node and edge marginals j = p(x = j) k = p(x = j x = k)

    http://goforward/http://find/http://goback/
  • 8/12/2019 All of Graphical Models

    106/135

    sj =p(xs=j), stjk =p(xs=j, xt =k).

    The marginal polytope is M = { | p with marginals }. Now consider Rd+ satisfying node normalization and

    edge-node marginal consistency conditions:

    r1j=0

    sj = 1 s V

    r1

    k=0stjk =sj s, t V, j = 0 . . . r 1

    r1j=0

    stjk=tk s, t V, k= 0 . . . r 1

    Define L= { satisfying the above conditions}.

    http://find/
  • 8/12/2019 All of Graphical Models

    107/135

  • 8/12/2019 All of Graphical Models

    108/135

    Approximating A

  • 8/12/2019 All of Graphical Models

    109/135

    Define the Bethe entropyfor L on loopy graphs in the sameway:

    HBethe

    (p) = sV

    r1

    j=0

    sjlog sj (s,t)E

    j,k

    stjk

    log stjk

    sjtk

    Note HBethe is not a true entropy. The second approximation insum-product is to replace A() with HBethe(p).

    http://find/
  • 8/12/2019 All of Graphical Models

    110/135

    Summary: Variational Inference

  • 8/12/2019 All of Graphical Models

    111/135

    The sum-product algorithm (loopy belief propagation)

    The mean field method

    Not covered: Expectation Propagation

    Efficient computation. But often unknown bias in solution.

    http://find/
  • 8/12/2019 All of Graphical Models

    112/135

    Maximizing ProblemsRecall the HMM example

    1/4 1/2

  • 8/12/2019 All of Graphical Models

    113/135

    = = 1/21 2

    P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

    1 2

    R G B R G B

    There are two senses of best states z1:N given x1:N:1. So far we computed the marginal p(zn|x1:N)

    We can define best as zn= arg maxkp(zn =k|x1:N)

    Howeverz1:Nas a whole may not be the best In fact z1:Ncan even have zero probability!

    2. An alternative is to find

    z

    1:N= arg maxz1:N p(z1:N|x1:N)

    finds the most likely state configuration as a whole The max-sum algorithm solves this Generalizes the Viterbi algorithm for HMMs

    Intermediate: The Max-Product Algorithm

    Simple modification to the sum-product algorithm: replace with

    http://find/
  • 8/12/2019 All of Graphical Models

    114/135

    p p g p max in the factor-to-variable messages.

    fsx(x) = maxx1. . . max

    xMfs(x, x1, . . . , xM)

    M

    m=1xmfs(xm)

    xmfs(xm) =

    fne(xm)\fs

    fxm(xm)

    xleaff(x) = 1

    f

    leaf

    x(x) = f(x)

    Intermediate: The Max-Product Algorithm

    As in sum-product, pick an arbitrary variable node x as the

    http://find/
  • 8/12/2019 All of Graphical Models

    115/135

    root Pass messages up from leaves until they reach the root

    Unlike sum-product, do not pass messages back from root toleaves

    At the root, multiply incoming messages

    pmax = maxx

    fne(x)

    fx(x)

    This is the probability of the most likely state configuration

    Intermediate: The Max-Product Algorithm

    http://find/
  • 8/12/2019 All of Graphical Models

    116/135

    To identify the configuration itself, keep back pointers: When creating the message

    fsx(x) = maxx1

    . . . maxxM

    fs(x, x1, . . . , xM)M

    m=1

    xmfs(xm)

    for each x value, we separately create Mpointers back to thevalues ofx1, . . . , xM that achieve the maximum.

    At the root, backtrack the pointers.

    Intermediate: The Max-Product Algorithm

    z1f1 z2f2

    http://goforward/http://find/http://goback/
  • 8/12/2019 All of Graphical Models

    117/135

    P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

    = = 1/21 2

    P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

    1 2

    1/4 1/2

    R G B R G B

    Message from leaff1f1z1(z1= 1) =P(z1= 1)P(R|z1= 1) = 1/2 1/2 = 1/4f1z1(z1= 2) =P(z1= 2)P(R|z1= 2) = 1/2 1/4 = 1/8

    The second messagez1f2(z1= 1) = 1/4z1f2(z1= 2) = 1/8

    Intermediate: The Max-Product Algorithm

    z1f1 z2f2

    http://find/
  • 8/12/2019 All of Graphical Models

    118/135

    P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

    = = 1/21 2

    P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

    1 2

    1/4 1/2

    R G B R G B

    f2z2(z2= 1)

    = maxz1

    f2(z1, z2)z1f2(z1)

    = maxz1 P(z2= 1 | z1)P(x2=G | z2= 1)z1f2(z1)= max(1/4 1/4 1/4, 1/2 1/4 1/8) = 1/64

    Back pointer for z2= 1: either z1= 1 orz1= 2

    Intermediate: The Max-Product Algorithm

    z1f1 z2f2

    http://goforward/http://find/http://goback/
  • 8/12/2019 All of Graphical Models

    119/135

    P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

    = = 1/21 2

    P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

    1 2

    / /

    R G B R G B

    The other element of the same message:

    f2z2(z2= 2)

    = maxz1

    f2(z1, z2)z1f2(z1)

    = maxz1 P(z2= 2 | z1)P(x2=G | z2= 2)z1f2(z1)

    = max(3/4 1/2 1/4, 1/2 1/2 1/8) = 3/32

    Back pointer for z2= 2: z1= 1

    Intermediate: The Max-Product Algorithm

    z1f1 z2f2

    P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

    http://goforward/http://find/http://goback/
  • 8/12/2019 All of Graphical Models

    120/135

    = = 1/21 2

    P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

    1 2

    1/4 1/2

    R G B R G B

    f2z2 =1/64 z1=1,23/32 z1=1

    At root z2,

    maxs=1,2

    f2z2(s) = 3/32

    z2= 2 z1= 1

    z1:2= arg maxz1:2p(z1:2|x1:2) = (1, 2)

    In this example, sum-product and max-product produce the samebest sequence; In general they differ.

    From Max-Product to Max-SumThe max-sum algorithm is equivalent to the max-productalgorithm, but work in log space to avoid underflow.

    M

    http://goforward/http://find/http://goback/
  • 8/12/2019 All of Graphical Models

    121/135

    fsx(x) = maxx1...xMlog fs(x, x1, . . . , xM) +

    m=1

    xmfs(xm)

    xmfs(xm) =

    fne(xm)\fs

    fxm(xm)

    xleaff(x) = 0

    fleafx(x) = log f(x)

    When at the root,

    logpmax = maxx

    fne(x)

    fx(x)

    The back pointers are the same.

    Outline

    Life without Graphical Models

    Representation

    http://goforward/http://find/http://goback/
  • 8/12/2019 All of Graphical Models

    122/135

    Directed Graphical Models (Bayesian Networks)Undirected Graphical Models (Markov Random Fields)

    InferenceExact Inference

    Markov Chain Monte CarloVariational Inference

    Loopy Belief Propagation

    Mean Field Algorithm

    Exponential Family

    Maximizing Problems

    Parameter Learning

    Structure Learning

    Parameter Learning

    http://find/
  • 8/12/2019 All of Graphical Models

    123/135

    Assume the graph structure is given

    Learning in exponential family: estimate from iid datax1 . . .xn.

    Principle: maximum likelihood Distinguish two cases:

    fully observed data: all dimensions of xare observed partially observed data: some dimensions of xare unobserved.

    http://find/http://goback/
  • 8/12/2019 All of Graphical Models

    124/135

    Partially Observed Data

    Each item (x, z) where xobserved, z unobserved

  • 8/12/2019 All of Graphical Models

    125/135

    Full data (x1, z1) . . . (xn, zn), but we only observe x1 . . .xn The incomplete likelihood () = 1n

    ni=1logp(xi) where

    p(xi) =

    p(xi, z)dz

    Can be written as () = 1nni=1 Axi() A() New log partition function ofp(z | xi), one per item:

    Axi() = log

    exp((xi, z

    ))dz

    Expectation-Maximization (EM) algorithm: lower bound Axi

    EM as Variational Lower Bound

    Mean parameter realizable by any distribution on z whileholding xi fixed:

    Mx

    i {Rd

    |E

    [(x

    iz

    )] f }

    http://find/
  • 8/12/2019 All of Graphical Models

    126/135

    M i = { | = p[( i, )] for some p} The variational definition Axi() = supMxi

    Axi

    ()

    Trivial variational lower bound:Axi()

    i Axi

    (i), i Mxi

    Lower bound L on the incomplete log likelihood:

    () = 1

    n

    ni=1

    Axi() A()

    1n

    ni=1

    i A

    xi(i)

    A()

    L(1, . . . , n, )

    http://find/
  • 8/12/2019 All of Graphical Models

    127/135

    Exact EM: The M-Step

    I h M i i h ldi h fi d

  • 8/12/2019 All of Graphical Models

    128/135

    In the M-step, maximize holding the s fixed:

    arg max

    L(1, . . . , n, ) = arg max

    A()

    = 1nn

    i=1 i

    The solution () satisfies E()[(x)] =

    Standard fully observed maximum likelihood problem, hencethe name M-step

    Variational EM

    For loopy graphs E-step often intractable.

    Cant maximize

    i A ( i)

    http://find/
  • 8/12/2019 All of Graphical Models

    129/135

    maxiMxi

    i Axi

    (i)

    Improve but not necessarily maximize: generalized EM

    The mean field method maximizes

    maxiMxi(F)

    i Axi

    (i)

    up to local maximum recallMxi(F) is an inner approximation to Mxi

    Mean field E-step leads to generalized EM

    The sum-product algorithm does not lead to generalized EM

    http://find/
  • 8/12/2019 All of Graphical Models

    130/135

    Score-Based Structure Learning

    Let M be all allowed candidate features

  • 8/12/2019 All of Graphical Models

    131/135

    LetM be all allowed candidate features Let M M be a log-linear model structure

    P(X | M, ) = 1

    Zexp

    iM

    ifi(X) A score for the model Mcan be maxln P(Data | M, )

    The score is always better for larger M needs regularization

    M and treated separately

    Structure Learning for Gaussian Random Fields

    Consider a p-dimensional multivariate GaussianN(, )

    The graphical model has p nodes x1 xp

    http://find/
  • 8/12/2019 All of Graphical Models

    132/135

    The graphical model has p nodes x1, . . . , xp The edge between xi, xj is absent if and only ifij = 0,

    where = 1

    Equivalently, xi, xj are conditionally independent given other

    variables

    xx

    xx

    1

    2

    3

    4

    http://find/
  • 8/12/2019 All of Graphical Models

    133/135

    Structure Learning for Gaussian Random Fields

  • 8/12/2019 All of Graphical Models

    134/135

    For centered data, minimize a regularized problem instead:

    log || +1

    n

    n

    i=1

    X(i)

    X(i) + i=j

    |ij |

    Known as glasso

    http://find/
  • 8/12/2019 All of Graphical Models

    135/135