Pattern Recognition and Machine Learning-Chapter 13: Sequential Data Affiliation: Kyoto University...

Pattern Recognition and Machine Learning-Chapter 13: Sequential

DataAffiliation: Kyoto UniversityName: Kevin Chien, Dr. Oba

Shigeyuki, Dr. Ishii ShinDate: Dec. 9, 2011

1

Idea

Origin of Markov Models

2

• IID data not always possible. Illustrate future data (prediction) dependent on some recent data, using DAGs where inference is done by sum-product algorithm.

• State Space (Markov) Model: Latent Variables– Discrete latent: Hidden Markov Model– Gaussian latent: Linear Dynamical Systems

• Order of Markov Chain: data dependence– 1st order: Current observation depends only on

previous 1 observation3

Why Markov Models

• Latent variable Zn forms a Markov chain. Each Zn contributes to its observation Xn.– As order grows #parameter grows, to organize

this we use State Space Model– Zn-1 and Zn+1 is now independent given Zn (d-

separated)

4

State Space Model

Terminologies

For understanding Markov Models

5

Terminologies• Markovian Property: stochastic process that

probability of a transition is dependent only on present state and not on the manner in which the current state is reached.

• Transition diagram for same variable different state

6

Terminologies (cont.)• F is bounded above and below by g

asymptotically

• (review)Zn+1 and Zn-1 is d-separated given Zn: means given we block Zn’s outgoing edges there is no path from Zn+1 and Zn-1 =>independent

7

[Big_O_notation, Wikipedia, Dec. 2011]

Markov Models

Formula and motivation

8

Hidden Markov Models (HMM)• Zn discrete multinomial variable• Transition probability matrix

– Sum of each row =1– P(staying in present state) is non-zero– Counting non-diagonals K(K-1) parameters

9

Hidden Markov Models (cont.)• Emission (transition) probability

with parameters governing the distribution– homogeneous model: latent variable share the

same parameter A

– Sampling data is simply noting the parameter values while following transitions with emission probability.

10

HMM, Expect. Max. for max. likelihood

• Likelihood function: marginalizing over latent variables

• Start with initial model parameters for– Evaluate – Defining

• Likelihood function

– results

11

HMM: forward-backward algorithm

• 2 stage message passing in tree for HMM, to find marginals p(node) efficiently– Here the marginals are

– Assume p(xk|zk), p(zk|zk-1),p(z1) known

– X=(x1,..,xn), xi:j=(xi,xi+1,..,xj)

– Goal compute p(zk|x)

– Forward part: compute p(zk, x1:k) for every k=1,..,n

– Backward part: compute p(xk+1:n|zk) for every k=1,…,n

12

HMM: forward-backward algorithm (cont.)

• P(zk|x) p(z∝ k,x) = p(xk+1:n|zk,x1:k) p(zk,x1:k)– Where xk+1:n and x1:k are d-separated given zk

– so P(zk|x) p(z∝ k,x) = p(xk+1:n|zk) p(zk,x1:k)

• Now we can do– EM algorithm and Baum-Welch algorithm to

estimate parameter values– Sample from posterior z given x. Most likely z with

Viterbi algorithm

13

HMM forward-backward algorithm: Forward part

14Emission prob.

transition prob.

recursive part

For k=2,..,n

HMM forward-backward algorithm: Forward part (cont.)

• α1(z1)=p(z1,x1)=p(z1)p(x1|z1)

• If each z has m states then computational complexity is– Θ(m) for each zk for one k

– Θ(m2) for each k– Θ(nm2) in total

15

HMM forward-backward algorithm: Backward part

• Compute p(xk+1:n|zk) for all zk and all k=1,..,n-1– p(xk+1:n|zk)=∑(all values of zk+1) p(xk+1:n,zk+1|zk)

=∑(all values of zk+1)

p(xk+2:n|zk+1,zk,xk+1)p(xk+1|zk+1,zk)p(zk+1|zk)mm…look like a recursive function,

if p(xk+1:n|zk) is labeled βk(zk) then

– zk,xk+1 and xk+2:n d-separated given zk+1

– zk and xk+1 d-separated given zk+1

So βk(zk) =∑(all values of zk+1)

βk+1(zk+1) p(xk+1|zk+1)p(zk+1|zk) 16Emission prob.

transition prob.

recursive part

For k=1,..,n-1

HMM forward-backward algorithm: Backward part (cont.)

• βn(zn) =1 for all zn

• If each z has m states then computational complexity is same as forward part– Θ(nm2) in total

17

HMM: Viterbi algorithm• Max-sum algorithm for HMM, to find most

probable sequence of hidden states for a given observation sequence X1:n

– Example: transform handwriting images into text– Assume p(xk|zk), p(zk|zk-1),p(z1) known

– Goal: compute z*= argmaxz p(z|x)

– Given x=x1:n, z=z1:n

– Given lemma f(a)≥0 a and g(a,b)∀ ≥0 a,b then∀• Maxa,b f(a)g(a,b) = maxa[f(a) maxb g(a,b)]

– maxz p(z|x) max∝ z p(z,x) 18

HMM: Viterbi algorithm (cont.)• μk(zk)=maxz1:k p(z1:k,x1:k)

=maxz1:k p(xk|zk)p(zk|zk-1) …..f(a) part

p(z1:k-1,x1:k-1) ....g(a,b) partmm…look like a recursive function, if we can make max to appear in

front of p(z1:k-1,x1:k-1). Use lemma

- by setting a=zk-1, b=z1:k-2

=maxzk-1[p(xk|zk)p(zk|zk-1) maxz1:k-2

p(z1:k-1,x1:k-1)]

=maxzk-1[p(xk|zk)

p(zk|zk-1) μk-1(zk-1) ] 19For k=2,…,n

HMM: Viterbi algorithm (finish up)• μk(zk)=maxzk-1 p(xk|zk) p(zk|zk-1) μk-1(zk-1)

μ1(z1)= p(x1,z1)=p(z1)p(x1|z1)

• Same method to get maxz μn(zn)=maxz p(x,z)

• We can get max value, to get max sequence, compute recursive equation bottom-up while remembering values (μk(zk) looks at all paths of μk-1(zk-1))

20

For k=2,…,n

Additional Information

21

• Excerpt of equations and diagrams from [Pattern Recognition and Machine Learning, Bishop C.M.] page 605-646

• Excerpt of equations from Mathematicalmonk, Youtube LLC, Google Inc., (ML 14.6 and 14.7) various titles, July 2011

Pattern Recognition and Machine Learning-Chapter 13: Sequential Data Affiliation: Kyoto University...

Documents

Transcript of Pattern Recognition and Machine Learning-Chapter 13: Sequential Data Affiliation: Kyoto University...