Probabilistic Graphical Models (10-708)epxing/Class/10708-07/Slides...8 Eric Xing 15 KF intuition...

1

1

School of Computer Science

State Space Models

Probabilistic Graphical Models (10Probabilistic Graphical Models (10--708)708)

Lecture 13, part II Nov 5th, 2007

Eric XingEric XingReceptor A

Kinase C

TF F

Gene G Gene H

Kinase EKinase D

Receptor BX1 X2

X3 X4 X5

X6

X7 X8

Receptor A

Kinase C

TF F

Gene G Gene H

Kinase EKinase D

Receptor BX1 X2

X3 X4 X5

X6

X7 X8

X1 X2

X3 X4 X5

X6

X7 X8

Reading: J-Chap. 15, K&F chapter 19.1 -19.3

Eric Xing 2

A road map to more complex dynamic models

AX

Y

AX

Y

AX

Ydiscrete

discrete

discrete

continuous

continuous

continuous

Mixture modele.g., mixture of multinomials

Mixture modele.g., mixture of Gaussians

Factor analysis

A AA Ax2 x3x1 xN

y2 y3y1 yN...

... A AA Ax2 x3x1 xN

y2 y3y1 yN...


y2 y3y1 yN...


y2 y3y1 yN...


y2 y3y1 yN...


y2 y3y1 yN...

...

HMM (for discrete sequential data, e.g., text)

HMM(for continuous sequential data, e.g., speech signal)

State space model

... ... ... ...

A AA Ax2 x3x1 xN

yk2 yk3yk1 ykN...

...

y12 y13y11 y1N...

S2 S3S1 SN...

... ... ... ...

A AA Ax2 x3x1 xN

yk2 yk3yk1 ykN...

...

y12 y13y11 y1N...

S2 S3S1 SN...

... ... ... ...

A AA Ax2 x3x1 xN

yk2 yk3yk1 ykN...

...

y12 y13y11 y1N...

S2 S3S1 SN...

... ... ... ...

A AA Ax2 x3x1 xN

yk2 yk3yk1 ykN...

...

y12 y13y11 y1N...

S2 S3S1 SN...

Factorial HMM Switching SSM

2

Eric Xing 3

State space models (SSM):

A sequential FA or a continuous state HMM

In general,

A AA AY2 Y3Y1 YN

X2 X3X1 XN...

... ),;0(~

);0(~ ),;0(~

0

1

0

tt

ttt

ttt

RvQwvC

wGA

Σ

+=+= −

NNN

x

xyxx

This is a linear dynamic system.

ttt

ttt

vw

+=+=

−

−

)()(

1

1

xyxx

gGf

where f is an (arbitrary) dynamic model, and g is an (arbitrary) observation model

Eric Xing 4

LDS for 2D trackingDynamics: new position = old position + ∆×velocity + noise (constant velocity model, Gaussian noise)

Observation: project out first two components (we observe Cartesian position of object - linear!)

noise

10000100

010001

21

11

21

11

2

1

2

1

+

⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜

⎝

⎛

⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜

⎝

⎛∆

∆

=

⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜

⎝

⎛

−

−

−

−

t

t

t

t

t

t

t

t

xxxx

xxxx

&

&

&

&

noise +

⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜

⎝

⎛

⎟⎟⎠

⎞⎜⎜⎝

⎛=⎟⎟

⎠

⎞⎜⎜⎝

⎛

2

1

2

1

2

1

00100001

t

t

t

t

t

t

xxxx

yy

&

&

3

Eric Xing 5

The inference problem 1

jtt

jttt

ittt jipipip 111 −− ===∝== ∑ αα )X|X()X|y()|X( :y

A Y2 Y3Y1 Yt

X2 X3X1 Xt...

...

)|( :1 ttxP y

0α 1α 2α tα

Filtering given y1, …, yt, estimate xt: The Kalman filter is a way to perform exact online inference (sequential Bayesian updating) in an LDS. It is the Gaussian analog of the forward algorithm for HMMs:

Eric Xing 6

The inference problem 2Smoothing given y1, …, yT, estimate xt (t<T)

The Rauch-Tung-Strievel smoother is a way to perform exact off-line inference in an LDS. It is the Gaussian analog of the forwards-backwards (alpha-gamma) algorithm:

∑ ++∝==j

jt

ji

jt

it

itTt XXPyip 11:1 )|()|X( γαγ

A Y2 Y3Y1 Yt

X2 X3X1 Xt...

...

0α 1α 2α tα0γ 1γ 2γ tγ

4

Eric Xing 7

2D tracking

filtered

X1

X 2

X1

X 2

Eric Xing 8

Kalman filtering in the brain?

5

Eric Xing 9

Kalman filtering derivationSince all CPDs are linear Gaussian, the system defines a large multivariate Gaussian.

Hence all marginals are Gaussian.Hence we can represent the belief state p(Xt|y1:t) as a Gaussian

mean

covariance .Hence, instead of marginalization for message passing, we will directly estimate the means and covariances of the required marginalsIt is common to work with the inverse covariance (precision) matrix ; this is called information form.

),,|XX(| ttttt yy K1TEP ≡

1−tt |P

),,|X(x̂ | tttt yy K1E≡

Eric Xing 10

Kalman filtering derivationKalman filtering is a recursive procedure to update the belief state:

Predict step: compute p(Xt+1|y1:t) from prior belief p(Xt|y1:t) and dynamical model p(Xt+1|Xt) --- time update

Update step: compute new belief p(Xt+1|y1:t+1) from prediction p(Xt+1|y1:t), observation yt+1 and observation model p(yt+1|Xt+1) --- measurement update

AA YtY1

Xt Xt+1X1 ...

A AA Yt Yt+1Y1

Xt Xt+1X1 ...

6

Eric Xing 11

Predict stepDynamical Model:

One step ahead prediction of state:

Observation model:One step ahead prediction of observation:

);(~ , QGA 01 Ntttt ww+=+ xx

);0(~ , RvvC tttt N+= xy

AA YtY1

Xt Xt+1X1 ...

A AA Yt Yt+1Y1

Xt Xt+1X1 ...

Eric Xing 12

Update stepSummarizing results from previous slide, we have p(Xt+1,Yt+1|y1:t) ~ N(mt+1, Vt+1), where

Remember the formulas for conditional Gaussian distributions:

) , (),|( ⎥⎦

⎤⎢⎣

⎡ΣΣΣΣ

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=Σ⎥

⎦

⎤⎢⎣

⎡

2221

1211

2

1

2

1

2

1

µµ

µxx

xx

Np

211

22121121

221

2212121

2121121

ΣΣΣ−Σ=

−ΣΣ+=

=

−

−

|

|

||

)(

),|()(

V

xm

Vmxxx

µµ

Np

222

22

2222

Σ=

=

=

m

m

mmp

V

m

Vmxx

µ

),|()( N

,ˆˆ

|

|

⎟⎟⎠

⎞⎜⎜⎝

⎛=

+

++

tt

ttt xC

xm

1

11 ,

||

||

⎟⎟⎠

⎞⎜⎜⎝

⎛

+=

++

+++ RCCPCP

CPPV Ttttt

Ttttt

t11

111

7

Eric Xing 13

Kalman FilterMeasurement updates:

where Kt+1 is the Kalman gain matrix

)x̂C(yˆˆ t|1t1t|| ++++++ −+= 1111 ttttt Kxx

ttttttt CPKPP |11|11|1 +++++ −=

-1TT RCCPCPK )( || += +++ ttttt 111

Eric Xing 14

Example of KF in 1DConsider noisy observations of a 1D particle doing a random walk:

KF equations:

),(~ ,| xttt wwxx σ011 N+= −− ),(~ , ztt vvxz σ0N+=

t|tt|ttt xxx ˆˆˆ| ==+ A1, ||1 xt

TTtttt GQGAAPP σσ +=+=+

( )zxt

ttztxtttttt

xzxCzKxx

σσσσσσ

++++

=+= +++++++

|1t|1t1t1|11|1

ˆ)ˆ-(ˆˆ

( )zxt

zxttttttt σσσ

σσσ++

+== ++++ ||| 1111 KCP-PP

))(()( || zxtxtttttt σσσσσ +++=+= +++-1TT RCCPCPK 111

8

Eric Xing 15

KF intuitionThe KF update of the mean is

the term is called the innovation

New belief is convex combination of updates from prior and observation, weighted by Kalman Gain matrix:

If the observation is unreliable, σz (i.e., R) is large so Kt+1 is small, so we pay more attention to the prediction.If the old prior is unreliable (large σt) or the process is very unpredictable (large σx), we pay more attention to the observation.

( )zxt

ttztxtttttt

xzxCzKxx

σσσσσσ

++++

=+= +++++++

|1t|1t1t1|11|1

ˆ)ˆ-(ˆˆ

)ˆ( t|1t1t ++ − xz C

))(()( || zxtxtttttt σσσσσ +++=+= +++-1TT RCCPCPK 111

Eric Xing 16

The KF update of the mean is

Consider the special case where the hidden state is a constant, xt =θ, but the “observation matrix” C is a time-varying vector, C = xt

T.Hence the observation model at each time slide, , is a linear regression

We can estimate recursively using the Kalman filter:

This is called the recursive least squares (RLS) algorithm.

We can approximate by a scalar constant. This is called the least mean squares (LMS) algorithm.We can adapt ηt online using stochastic approximation theory.

)ˆ-(ˆˆt|1t1t|| +++++ += xyxx ttttt CKA 111

tTtt vxy += θ

ttTtttt xxy )ˆ(ˆˆ

1t θθθ −+= +−

++1

11 RP

11

1 +−

+ ≈ tt ηRP

KF, RLS and LMS

9

Eric Xing 17

Complexity of one KF stepLet and ,

Computing takes O(Nx3) time, assuming

dense P and dense A.

Computing takes O(Ny3) time.

So overall time is, in general, max {Nx3,Ny

3}

xNtX R∈ yN

tY R∈

TTtttt GQGAAPP +=+ ||1

-1TT RCCPCPK )( || += +++ ttttt 111

Eric Xing 18

The inference problem 2Smoothing given y1, …, yT, estimate xt (t<T)

The Rauch-Tung-Strievel smoother is a way to perform exact off-line inference in an LDS. It is the Gaussian analog of the forwards-backwards (alpha-gamma) algorithm:

∑ ++∝==j

jt

ji

jt

it

itTt XXPyip 11:1 )|()|X( γαγ

A Y2 Y3Y1 Yt

X2 X3X1 Xt...

...

0α 1α 2α tα0γ 1γ 2γ tγ

10

Eric Xing 19

RTS smoother derivationSmoothing given y1, …, yT, estimate P(xt |y1:T) (t<T)

Step 1: joint distribution of xt and xt+1 conditioned on y1:t

Use A AYt Yt+1

Xt+1

);;0(~;1 QwwGA tttt N+=+ xx

Xt

Eric Xing 20

RTS smoother derivationFollowing the results from previous slide, we need to derive p(Xt+1,Xt|y1:t) ~ N(m, V), where

all the quantities here are available after a forward KF pass

Remember the formulas for conditional Gaussian distributions:

,ˆˆ

|

|

⎟⎟⎠

⎞⎜⎜⎝

⎛=

+ tt

tt

xx

m1

,||

||

⎟⎟⎠

⎞⎜⎜⎝

⎛=

+ tttt

Ttttt

PAPAPPV1

, ) , (),|( ⎥⎦

⎤⎢⎣

⎡ΣΣΣΣ

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=Σ⎥

⎦

⎤⎢⎣

⎡

2221

1211

2

1

2

1

2

1

µµ

µxx

xx

Np

211

22121121

221

2212121

2121121

ΣΣΣ−Σ=

−ΣΣ+=

=

−

−

|

|

||

)(

),|()(

V

xm

Vmxxx

µµ

Np

222

22

2222

Σ=

=

=

m

m

mmp

V

m

Vmxx

µ

),|()( N

11

−+= ttttt || PAPL T

11

Eric Xing 21

RTS smoother derivation

Step 2: compute = E[xt|y0:T] using results aboveUseUse E[X|Z] = E[E[X|Y,Z]|Z]

A AYt Yt+1

Xt+1Xt

Ttx |ˆ

Eric Xing 22

RTS derivationRepeat the same process for Variance

Refer to Jordan chapter 15

The RTS smoother results:

)ˆ-ˆ(ˆˆ |1|1|| ttTttttTt Lx +++= xxx

( ) TtttTttttTt LPPLPP |1|1|| ++ −+=

12

Eric Xing 23

Learning SSMsComplete log likelihood

EME-step: compute

these quantities can be inferred via KF and RTS filters, etc., e,g.,

M-step: MLE using

c.f., M-step in factor analysis

{ }( ) { }( )RCGQA ,;:,,,;:,,);(

)|(log)|(log)(log),(log),( ,,,,

tXXXftXXXXXfXf

xypxxpxpyxpD

tTttt

Ttt

Ttt

n ttntn

n ttntn

nnnn

∀+∀+Σ=

++==

−

− ∑∑∑∑∑∑

312011

11θcl

TtTtt

Ttt yyXXXXX K, , , 11−

22TtTtt

Ttt

Ttt xPXXXXX ||

ˆ)(E)var( +=+≡

( ) { }( ) { }( )RCGQA ,;:,,,;:,,;),( tXXXftXXXXXfXfD tTttt

Ttt

Ttt ∀+∀+Σ= − 312011θcl

Eric Xing 24

Nonlinear systemsIn robotics and other problems, the motion model and the observation model are often nonlinear:

An optimal closed form solution to the filtering problem is no longer possible.The nonlinear functions f and g are sometimes represented by neural networks (multi-layer perceptrons or radial basis function networks).The parameters of f and g may be learned offline using EM, where we do gradient descent (back propagation) in the M step, c.f. learning a MRF/CRF with hidden nodes.Or we may learn the parameters online by adding them to the state space: xt'= (xt, θ). This makes the problem even more nonlinear.

)( , )( tttttt vxgywxfx +=+= −1

13

Eric Xing 25

Extended Kalman Filter (EKF)The basic idea of the EKF is to linearize f and g using a second order Taylor expansion, and then apply the standard KF.

i.e., we approximate a stationary nonlinear system with a non-stationary linear system.

where and and

The noise covariance (Q and R) is not changed, i.e., the additional error due to linearization is not modeled.

)ˆ()ˆ(

)ˆ()ˆ(

|ˆ|

|ˆ|

|

|

ttttxttt

ttttxttt

vxxxgywxxxfx

tt

tt

+−+=

+−+=

−−

−−−−−

−

−−

11

11111

1

11

C

A

)ˆ(ˆ|| 111 −−− = tttt xfx

xx x

fˆ

def

ˆ ∂∂

=Ax

x xg

ˆ

def

ˆ ∂∂

=C

26

School of Computer Science

Complex Graphical Models

Probabilistic Graphical Models (10Probabilistic Graphical Models (10--708)708)

Lecture 14, Nov 5th, 2007

Eric XingEric XingReceptor A

Kinase C

TF F

Gene G Gene H

Kinase EKinase D

Receptor BX1 X2

X3 X4 X5

X6

X7 X8

Receptor A

Kinase C

TF F

Gene G Gene H

Kinase EKinase D

Receptor BX1 X2

X3 X4 X5

X6

X7 X8

X1 X2

X3 X4 X5

X6

X7 X8

Reading: K&F chapter 20.1 - 20.3

14

Eric Xing 27

A road map to more complex dynamic models

AX

Y

AX

Y

AX

Ydiscrete

discrete

discrete

continuous

continuous

continuous

Mixture modele.g., mixture of multinomials

Mixture modele.g., mixture of Gaussians

Factor analysis

A AA Ax2 x3x1 xN

y2 y3y1 yN...


y2 y3y1 yN...


y2 y3y1 yN...


y2 y3y1 yN...


y2 y3y1 yN...


y2 y3y1 yN...

...

HMM (for discrete sequential data, e.g., text)

HMM(for continuous sequential data, e.g., speech signal)

State space model

... ... ... ...

A AA Ax2 x3x1 xN

yk2 yk3yk1 ykN...

...

y12 y13y11 y1N...

S2 S3S1 SN...

... ... ... ...

A AA Ax2 x3x1 xN

yk2 yk3yk1 ykN...

...

y12 y13y11 y1N...

S2 S3S1 SN...

... ... ... ...

A AA Ax2 x3x1 xN

yk2 yk3yk1 ykN...

...

y12 y13y11 y1N...

S2 S3S1 SN...

... ... ... ...

A AA Ax2 x3x1 xN

yk2 yk3yk1 ykN...

...

y12 y13y11 y1N...

S2 S3S1 SN...

Factorial HMM Switching SSM

Eric Xing 28

NLP and Data MiningWe want:

Semantic-based search infer topics and categorize documentsMultimedia inferenceAutomatic translation Predict how topics evolve…

Researchtopics

1900 2000

Researchtopics

1900 2000

15

Eric Xing 29

The Vector Space ModelRepresent each document by a high-dimensional vector in the space of words

⇒⇒

Eric Xing 30

* *

T (m x k)

Λ(k x k)

DT

(k x n)

=

X (m x n)

Document

Term ...

∑=

=K

kkkk Tdw

1

rr λ

Latent Semantic Indexing

LSA does not define a properly normalized probability distribution of observed and latent entities

Does not support probabilistic reasoning under uncertainty and data fusion

16

Eric Xing 31

Latent Semantic Structure

Latent Structure

Words

∑=l

l),()( ww PPl

w

Distribution over words

)w()()|w()w|(

PPPP ll

l =

Inferring latent structure

...)w|( 1 =+nwPPrediction

Eric Xing 32

Objects are bags of elements

Mixtures are distributions over elements

Objects have mixing vector θRepresents each mixtures’ contributions

Object is generated as follows:Pick a mixture component from θPick an element from that component

Admixture Models

money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 money1 stream2 bank1 money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 bank1 money1 stream2



…

0.1 0.1 0.5…..

0.1 0.5 0.1…..

0.5 0.1 0.1…..


17

Eric Xing 33

Topic Models =Admixture ModelsGenerating a document

Prior

θ

z

w βNd

N

K

( ){ } ( )

from ,| Draw -

from Draw- each wordFor

prior thefrom

:1 nzknn

n

lmultinomiazwlmultinomiaz

nDraw

ββθ

θ−

Which prior to use?

Eric Xing 34

Choice of Prior

Dirichlet (LDA) (Blei et al. 2003)Conjugate prior means efficient inference

Can only capture variations in each topic’s intensity independently

Logistic Normal (CTM=LoNTAM) (Blei & Lafferty 2005, Ahmed & Xing 2006)

Capture the intuition that some topics are highly correlated and can rise up in intensity togetherNot a conjugate prior implies hard inference

18

Eric Xing 35

Logistic Normal Vs. Dirichlet

Dirichlet

Eric Xing 36

Logistic Normal Vs. Dirichlet

Logistic

Normal

19

Eric Xing 37

Mixed Membership Model (M3)Mixture versus admixture

⇒⇒

A Bayesian mixture model A Bayesian admixture model: Mixed membership model

Eric Xing 38

Latent Dirichlet Allocation: M3 in text mining

A document is a bag of words each generated from a randomly selected topic

20

Eric Xing 39

Population admixture: M3 in genetics

The genetic materials of each modern individual are inherited from multiple ancestral populations, each DNA locus may have a different generic origin …

Ancestral labels may have (e.g., Markovian) dependencies

Eric Xing 40

Inference in Mixed Membership Models

Mixture versus admixture

Inference is very hard in M3, all hidden variables are coupled and not factorizable!

∑ ∏ ∏∫ ∫ ⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

}{,,

,

)|()|()|()|()(mn

nz

Nn

nm

nmnzmn dddGppzpxpDp φππφαππφ LL 1

⇒⇒

∑ ∫ ∏ ∏ −⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛

}{,,

,

)|()|()|()|(~)|(mn

nz

in

nm

nmnzmnn ddGppzpxpDp φπφαππφπ

21

Eric Xing 41

Approaches to inferenceExact inference algorithms

The elimination algorithmThe junction tree algorithms

Approximate inference techniques

Monte Carlo algorithms:Stochastic simulation / sampling methodsMarkov chain Monte Carlo methods

Variational algorithms: Belief propagationVariational inference

Probabilistic Graphical Models (10-708)epxing/Class/10708-07/Slides...8 Eric Xing 15 KF intuition...

Documents

Transcript of Probabilistic Graphical Models (10-708)epxing/Class/10708-07/Slides...8 Eric Xing 15 KF intuition...