Cdc2011 Slides

8/3/2019 Cdc2011 Slides

1/18

Recursive Learning Algorithm for Model Reduction ofHidden Markov Models

Kun Deng

Joint work with: P. G. Mehta, S. P. Meyn, and M. Vidyasagar

Department of Mechanical Science and EngineeringCoordinated Science Laboratory

University of Illinois at Urbana-Champaign

50th IEEE Conference on Decision and ControlOrlando, FL, Dec 14, 2011

8/3/2019 Cdc2011 Slides

2/18

Introduction

Motivation

Hidden Markov Model (HMM) forms a valuable modeling framework inapplied science and engineering.

Pattern recognition: speech, handwriting, gesture recognition,...

Bioinformatics: gene prediction, alignment of bio-sequences,...

Others: POMDP, Gaussian mixture model, state space model,...

2

8/3/2019 Cdc2011 Slides

3/18

Introduction

Motivation

Hidden Markov Model (HMM) forms a valuable modeling framework inapplied science and engineering.

Pattern recognition: speech, handwriting, gesture recognition,...

Bioinformatics: gene prediction, alignment of bio-sequences,...

Others: POMDP, Gaussian mixture model, state space model,...

Underlying state space of HMMs might be very large.

Increased complexities in inference (filtering, smoothing, prediction),parameter estimating, and optimal policy design.

2

8/3/2019 Cdc2011 Slides

4/18

Introduction

Aggregating the state space of Markov chains

Nearly completely decomposable Markov chain

Singular perturbation theory

Simon and Ando 1961, Kokotovic 1981, Yin and Zhang 1998

Markov spectral theory

Wentzell and Freidlin 1972, Denllnitz and Junge 1997, Deuflhard2000, Huisinga, Meyn, and Schutte 2004

[1] Deng, Mehta, and Meyn, Optimal K-L Aggregation via Spectral Theory of Markov Chains, TAC 2011.

3

I d i

8/3/2019 Cdc2011 Slides

5/18

Introduction

Recent approaches to model reduction of HMMs

Objective:

Reduce a HMM via aggregation of the state space.

Metric:

Kullback-Leibler divergence rate between laws of the jointstate-observation process of two HMMs:

R( ) = limn

1

nE

logP (X

n0 , Y

n0 )

P (Xn0 , Y

n0 )

.

Approaches:

Recursive learning through the simulation trajectory of the HMM.

Pros & Cons:

We have the explicit formula for expressing the K-L divergence rate.

We cant use it to compare the observation processes of two HMMs.

[1] Deng, Mehta, and Meyn, Aggregation-based model reduction of a Hidden Markov Model, CDC 2010.[2] Vidyasagar, K-L divergence rate between probability distributions on sets of diff. cardinalities, CDC 2010.

4

I t d ti

8/3/2019 Cdc2011 Slides

6/18

Introduction

New approaches for model reduction of HMM (this talk)

Metric:Kullback-Leibler divergence rate between laws of the observationprocess of two HMMs:

R( ) = limn

1

n

E logP (Y

n0 )

P (Yn0 ) .Main ideas:

Aggregate the state space and preserve the observation space.

Employ the optimal representation for the aggregated HMM.

Simulate the original HMM and evaluate the nonlinear filter only forthe aggregated HMM.

Approach the optimal partition using the recursive learning algorithm.

[1] Xie, Ugrinovskii, and Petersen, Probabilistic distances between finite-state HMMs, TAC 2005.

5

Preliminaries

8/3/2019 Cdc2011 Slides

7/18

Preliminaries

Notations of Hidden Markov Model = (,A,C)

A random process {Xn, Yn}n0 is modeled as a HMM:The unobserved state process {Xn}n0 is a finite Markov chain.

Aij:=P(Xn+1 = j|Xn = i), i,jN.

The observation process {Yn}n0 is independent conditioned on thestate process {Xn}n0.

Cir:=P(Yn = r|Xn = i), iN, rO.

The initial distribution of the state process is given.

i:=P(X0 = i), iN.

6

Preliminaries

8/3/2019 Cdc2011 Slides

8/18

Preliminaries

Ergodicity & Nondegeneracy Assumptions

Ergodicity Assumption: All underlying Markov chains are assumed to beirreducible and aperiodic.

There exists a unique invariant distribution of the state process

A = .

Nondegeneracy Assumption: The transition matrix C is strictly positive,i.e., Cir > 0 for all i and r.

The state process {Xn}n0 can be statistically inferred from anysimulation trajectory of the observation process {Yn}n0.

Recall:

Aij:=P(Xn+1 = j|Xn = i), Cir:=P(Yn = r|Xn = i), i,jN, rO.

7

Preliminaries

8/3/2019 Cdc2011 Slides

9/18

Preliminaries

Kullback-Leibler divergence rate for HMMs

Under Assumptions, we have asymptotic convergence in P -a.s. senseShannon-McMillan-Breiman theorem in 1950s

limn

1

nlogP (y

n0 ) = lim

n

1

nE

logP (Yn

0 )

= H( , ).

Baum and Petrie in 1966

limn

1

nlogP (y

n0 ) = lim

n

1

nE

logP (Yn

0 )

= H( , ).

For two HMMs and defined on the same observation space

R( ):= limn

1n

E log P (Yn0 )P (Yn0 ) = H( , )H( , ).[1] Breiman, The individual ergodic theorem of information theory, AMS 1957.[2] Baum and Petrie, Statistical inference for probabilistic functions of finite state Markov chains, AMS 1966.

8

Preliminaries

8/3/2019 Cdc2011 Slides

10/18

Preliminaries

Nonlinear filter recursion

Prediction filter

pk(i) = P(Xk = i|Yk1, . . . , Y0), iN.

Nonlinear filter recursion

pk+1

=ATB(Yk)pk

bT(Yk)pk

with

b(r) = [C1r, . . . , CNr]T

, B(r) = diag(b(r)), rO.

Chain rule for computing the log-likelihood rate function

1

nlogP (Y

n0 ) =

1

n

n

k=0

logP (Yk|Yk1

0 )

withP(Yk|Y

k10 ) = b

T(Yk)pk.

9

Model Reduction

8/3/2019 Cdc2011 Slides

11/18

o uct o

Optimal representation of the aggregated HMM

Aggregate the state space N = {1, . . . , N} into M= {1, . . . , M}

(1) = 1, (2) = (3) = 2, 1(1) = 1, 1(2) = {1, 2}.

Optimal representation of the aggregated HMM = (, A, C)

k() = P (X0 1(k)), kM.

Akl() = P (Xn+1

1(l)|Xn 1(k)), k, lM.

Ckr() = P (Yn = r|Xn

1(k)), kM, rO.

Jump to backup

10

Model Reduction

8/3/2019 Cdc2011 Slides

12/18

Maximum Likelihood Estimation

For any fixed , let () denote the optimal aggregated HMM

R( ()) = H( , )H( , ()).

Optimal partition problem

argmin R( ())

argmax H( , ()).

Maximum Likelihood Estimation problem

n argmax1

nlogP()(y

n0 ).

Asymptotical convergencen , as n .11

Model Reduction

8/3/2019 Cdc2011 Slides

13/18

Approaches to the MLE problem max

n1 logP ()(yn0 )

Exact approach: Hypothesis Testing

Need to evaluate ||= MN filter recursions on the reduced space M.

Approximate approach: Recursive Stochastic-Gradient Learning

Randomization and parametrization with RK and KMN:

gk():=

(Xk;)logP()(Yk|Yk10 ).

Parameterized MLE problem

n argmax1

n

n

k=0

gk().

Stochastic-gradient algorithm

n+1 = n +n

1

n

n

k=0

gk()

.

Jump to backup

12

Simulation

8/3/2019 Cdc2011 Slides

14/18

Bi-partition of a simple HMM (Hypothesis Testing)

The parameters of the original HMM

A =

0.500 0.200 0.225 0.0750.200 0.500 0.135 0.1650.030 0.270 0.500 0.2000.150 0.165 0.185 0.500

, C =

0.15 0.850.05 0.950.89 0.110.88 0.12

.Hypothesis testing for the optimal bi-partition

= [1, 1, 2, 2]

Log-likelihood rate function: ln =

1

n logP (yn

0 ), ln() =

1

n logP ()(yn

0 ).13

Simulation

8/3/2019 Cdc2011 Slides

15/18

Bi-partition of a simple HMM (Recursive Learning)

Learning through the simulation trajectory {xn, yn}n0 of the HMM

After n = 1000 iterations

The probabilities of all states being in the first group

=[1,1,1,1](; n) = [0.9948, 0.9984, 0.0034, 0.0032].

The optimal partition function is obtained nearly deterministically

= [1, 1, 2, 2].

14

Conclusions

8/3/2019 Cdc2011 Slides

16/18

Conclusions and future directions

ConclusionsK-L divergence rate between the laws of the observation process

Optimal representation for the aggregated HMM

Hypothesis testing and recursive learning algorithm

Future directions

Convergence rate and performance analysis

Applications to bioinformatics

Thank You!

15

Backup

8/3/2019 Cdc2011 Slides

17/18

Optimal representation

Optimal representation of the aggregated HMM = (, A, C)

k() = i1(k)

i, kM.

Akl() =i1(k)ij1(l) Aij

i1(k)i, k, lM.

Ckr() =i1(k)iCir

i1(k)i, kM, rO.

Jump back to origin

[1] Deng, Mehta, and Meyn, Aggregation-based model reduction of a Hidden Markov Model, CDC 2010.

16

Backup

8/3/2019 Cdc2011 Slides

18/18

Bi-partition randomized policy

Parameter vector= [1, . . . ,N]

T.

Probability of group assignments

P((i) = 1) = 11+exp(Ki)

, P((i) = 2) = exp(Ki)1+exp(Ki)

.

Randomized and parameterized partition

(i;) = 11+exp(Ki)

1l{(i)=1} + exp(Ki)

1+exp(Ki)1l{(i)=2}.

Jump back to origin

17

Cdc2011 Slides

Documents

Transcript of Cdc2011 Slides