Introduction to Graphical Models Brookes Vision Lab Reading Group.
-
Upload
leah-johnston -
Category
Documents
-
view
228 -
download
0
Transcript of Introduction to Graphical Models Brookes Vision Lab Reading Group.
Introduction to Graphical Models
Brookes Vision Lab Reading Group
Graphical Models
• To build a complex system using simpler parts.
• System should be consistent
• Parts are combined using probability
• Undirected – Markov random fields
• Directed – Bayesian Networks
Overview
• Representation
• Inference
• Linear Gaussian Models
• Approximate inference
• Learning
Causality : Sprinkler “causes” wet grass
Representation
Conditional Independence
• Independent of ancestors given parents• P(C,S,R,W) = P(C) P(S|C) P(R|C,S) P(W|C,S,R)• = P(C) P(S|C) P(R|C) P(W|S,R)
• Space required for n binary nodes– O(2n) without factorization– O(n2k) with factorization, k = maximum fan-in
Inference
• Pr(S=1|W=1) = Pr(S=1,W=1)/Pr(W=1)
= 0.2781/0.6471
= 0.430
• Pr(R=1|W=1) = Pr(R=1,W=1)/Pr(W=1)
= 0.4581/0.6471
= 0.708
Explaining Away
• S and R “compete” to explain W=1
• S and R are conditionally dependent
• Pr(S=1|R=1,W=1) = 0.1945
Inference
where
where
Inference
• Variable elimination
• Choosing optimal ordering – NP hard
• Greedy methods work well
• Computing several marginals
• Dynamic programming avoids redundant computation
• Sound familiar ??
Bayes Balls for Conditional Independence
A Unifying (Re)View
Linear GaussianModel (LGM)
FA SPCA PCA LDS
Mixture of Gaussians VQ HMM
Continuous-State LGM
Basic Model
Discrete-State LGM
Basic Model● State of a system is a k-vector x (unobserved)● Output of a system is a p-vector y (observed) ● Often k << p
● Basic model ● xt+1 = A xt + w● yt = C xt + v
● A is the k x k transition matrix● C is a p x k observation matrix● w = N(0, Q)● v = N(0, R)
● Noise processes are essential
Zero mean w.l.o.g
Degeneracy in Basic Model
• Structure in Q can be moved to A and C• W.l.o.g. Q = I• R cannot be restricted as yt are observed• Components of x can be reordered arbitrarily.• Ordering is based on norms of columns of C.• x1 = N(µ1, Q1)• A and C are assumed to have rank k.• Q, R, Q1 are assumed to be full rank.
Probability Computation
• P( xt+1 | xt ) = N(A xt, Q ; xt+1)
• P( yt | xt ) = N( C xt, R; yt)
• P({x1,..,xT,{y1,..,yT}) =
P(x1) П P(xt+1|xtП P(yt|xt)
• Negative log probability
Inference● Given model parameters {A, C, Q, R, µ1, Q1}● Given observations y● What can be infered about hidden states x ?● Total likelihood
● Filtering : P (x(t) | {y(1), ... , y(t)})● Smoothing: P (x(t) | {y(1), ... , y(T)})● Partial smoothing: P (x(t) | {y(1), ... , y(t+t')})● Partial prediction: P (x(t) | {y(1), ... , y(t-t')})● Intermediate values of recursive methods for computing total likelihood.
Learning• Unknown parameters {A, C, Q, R, µ1, Q1}
• Given observations y• Log-likelihood
F(Q,Ө) – free energy
EM algorithm
• Alternate between maximizing F(Q,Ө) w.r.t. Q and Ө.
• F = L at the beginning of M-step• E-step does not change Ө• Therefore, likelihood does not decrease.
Continuous-State LGM
Continuous-State LGM
Static Data Modeling Time-series Modeling
● No temporal dependence ● Factor analysis● SPCA● PCA
● Time ordering of data crucial● LDS (Kalman filter models)
Static Data Modelling
• A = 0• x = w• y = C x + v
• x1 = N(0,Q)
• y = N(0, CQC'+R)• Degeneracy in model• Learning : EM
– R restricted
• Inference
Factor Analysis
• Restrict R to be diagonal.• Q = I• x – factors• C – factor loading matrix• R – uniqueness• Learning – EM , quasi-Newton optimization• Inference
SPCA
• R = єI• є – global noise level• Columns of C span the principal subspace.• Learning – EM algorithm• Inference
PCA
• R = lim є->0 єI
• Learning– Diagonalize sample covariance of data– Leading k eigenvalues and eigenvectors define C– EM determines leading eigenvectors without
diagonalization
• Inference– Noise becomes infinitesimal– Posterior collapses to a single point
Linear Dynamical Systems
• Inference – Kalman filter
• Smoothing – RTS recursions
• Learning – EM algorithm– C known – Shumway and Stoffer, 1982– All unknown – Ghahramani and Hinton, 1995
Discrete-State LGM
• xt+1 = WTA[A xt + w]
• yt = C xt + v
• x1 = WTA[N(µ1,Q1)]
Discrete-State LGM
Discrete-state LGM
Static Data Modeling Time-series Modeling
● Mixture of Gaussians● VQ
● HMM
Static Data Modelling
• A = 0• x = WTA[w]• w = N(µ,Q)• Y = C x + v
• лj = P(x = ej)
• Nonzero µ for nonuniform лj
• y = N(Cj, R)
• Cj – jth column of C
Mixture of Gaussians• Mixing coefficients of cluster лj
• Mean – columns Cj
• Variance – R
• Learning: EM (corresponds to ML competitive learning)
• Inference
Vector Quantization
• Observation noise becomes infinitesimal
• Inference problem solved by 1NN rule
• Euclidean distance for diagonal R
• Mahalanobis distance for unscaled R
• Posterior collapses to closest cluster
• Learning with EM = batch version of k-means
Time-series modelling
HMM
• Transition matrix T
• Ti,j = P(xt+1 = ej | xt = ei)
• For every T, there exist A and Q
• Filtering : forward recursions
• Smoothing: forward-backward algorithm
• Learning: EM (called Baum-Welsh reestimation)
• MAP state sequences - Viterbi