Cdc2011 Slides

download Cdc2011 Slides

of 18

Transcript of Cdc2011 Slides

  • 8/3/2019 Cdc2011 Slides

    1/18

    Recursive Learning Algorithm for Model Reduction ofHidden Markov Models

    Kun Deng

    Joint work with: P. G. Mehta, S. P. Meyn, and M. Vidyasagar

    Department of Mechanical Science and EngineeringCoordinated Science Laboratory

    University of Illinois at Urbana-Champaign

    50th IEEE Conference on Decision and ControlOrlando, FL, Dec 14, 2011

  • 8/3/2019 Cdc2011 Slides

    2/18

    Introduction

    Motivation

    Hidden Markov Model (HMM) forms a valuable modeling framework inapplied science and engineering.

    Pattern recognition: speech, handwriting, gesture recognition,...

    Bioinformatics: gene prediction, alignment of bio-sequences,...

    Others: POMDP, Gaussian mixture model, state space model,...

    2

  • 8/3/2019 Cdc2011 Slides

    3/18

    Introduction

    Motivation

    Hidden Markov Model (HMM) forms a valuable modeling framework inapplied science and engineering.

    Pattern recognition: speech, handwriting, gesture recognition,...

    Bioinformatics: gene prediction, alignment of bio-sequences,...

    Others: POMDP, Gaussian mixture model, state space model,...

    Underlying state space of HMMs might be very large.

    Increased complexities in inference (filtering, smoothing, prediction),parameter estimating, and optimal policy design.

    2

  • 8/3/2019 Cdc2011 Slides

    4/18

    Introduction

    Aggregating the state space of Markov chains

    Nearly completely decomposable Markov chain

    Singular perturbation theory

    Simon and Ando 1961, Kokotovic 1981, Yin and Zhang 1998

    Markov spectral theory

    Wentzell and Freidlin 1972, Denllnitz and Junge 1997, Deuflhard2000, Huisinga, Meyn, and Schutte 2004

    [1] Deng, Mehta, and Meyn, Optimal K-L Aggregation via Spectral Theory of Markov Chains, TAC 2011.

    3

    I d i

  • 8/3/2019 Cdc2011 Slides

    5/18

    Introduction

    Recent approaches to model reduction of HMMs

    Objective:

    Reduce a HMM via aggregation of the state space.

    Metric:

    Kullback-Leibler divergence rate between laws of the jointstate-observation process of two HMMs:

    R( ) = limn

    1

    nE

    logP (X

    n0 , Y

    n0 )

    P (Xn0 , Y

    n0 )

    .

    Approaches:

    Recursive learning through the simulation trajectory of the HMM.

    Pros & Cons:

    We have the explicit formula for expressing the K-L divergence rate.

    We cant use it to compare the observation processes of two HMMs.

    [1] Deng, Mehta, and Meyn, Aggregation-based model reduction of a Hidden Markov Model, CDC 2010.[2] Vidyasagar, K-L divergence rate between probability distributions on sets of diff. cardinalities, CDC 2010.

    4

    I t d ti

  • 8/3/2019 Cdc2011 Slides

    6/18

    Introduction

    New approaches for model reduction of HMM (this talk)

    Metric:Kullback-Leibler divergence rate between laws of the observationprocess of two HMMs:

    R( ) = limn

    1

    n

    E logP (Y

    n0 )

    P (Yn0 ) .Main ideas:

    Aggregate the state space and preserve the observation space.

    Employ the optimal representation for the aggregated HMM.

    Simulate the original HMM and evaluate the nonlinear filter only forthe aggregated HMM.

    Approach the optimal partition using the recursive learning algorithm.

    [1] Xie, Ugrinovskii, and Petersen, Probabilistic distances between finite-state HMMs, TAC 2005.

    5

    Preliminaries

  • 8/3/2019 Cdc2011 Slides

    7/18

    Preliminaries

    Notations of Hidden Markov Model = (,A,C)

    A random process {Xn, Yn}n0 is modeled as a HMM:The unobserved state process {Xn}n0 is a finite Markov chain.

    Aij:=P(Xn+1 = j|Xn = i), i,jN.

    The observation process {Yn}n0 is independent conditioned on thestate process {Xn}n0.

    Cir:=P(Yn = r|Xn = i), iN, rO.

    The initial distribution of the state process is given.

    i:=P(X0 = i), iN.

    6

    Preliminaries

  • 8/3/2019 Cdc2011 Slides

    8/18

    Preliminaries

    Ergodicity & Nondegeneracy Assumptions

    Ergodicity Assumption: All underlying Markov chains are assumed to beirreducible and aperiodic.

    There exists a unique invariant distribution of the state process

    A = .

    Nondegeneracy Assumption: The transition matrix C is strictly positive,i.e., Cir > 0 for all i and r.

    The state process {Xn}n0 can be statistically inferred from anysimulation trajectory of the observation process {Yn}n0.

    Recall:

    Aij:=P(Xn+1 = j|Xn = i), Cir:=P(Yn = r|Xn = i), i,jN, rO.

    7

    Preliminaries

  • 8/3/2019 Cdc2011 Slides

    9/18

    Preliminaries

    Kullback-Leibler divergence rate for HMMs

    Under Assumptions, we have asymptotic convergence in P -a.s. senseShannon-McMillan-Breiman theorem in 1950s

    limn

    1

    nlogP (y

    n0 ) = lim

    n

    1

    nE

    logP (Yn

    0 )

    = H( , ).

    Baum and Petrie in 1966

    limn

    1

    nlogP (y

    n0 ) = lim

    n

    1

    nE

    logP (Yn

    0 )

    = H( , ).

    For two HMMs and defined on the same observation space

    R( ):= limn

    1n

    E log P (Yn0 )P (Yn0 ) = H( , )H( , ).[1] Breiman, The individual ergodic theorem of information theory, AMS 1957.[2] Baum and Petrie, Statistical inference for probabilistic functions of finite state Markov chains, AMS 1966.

    8

    Preliminaries

  • 8/3/2019 Cdc2011 Slides

    10/18

    Preliminaries

    Nonlinear filter recursion

    Prediction filter

    pk(i) = P(Xk = i|Yk1, . . . , Y0), iN.

    Nonlinear filter recursion

    pk+1

    =ATB(Yk)pk

    bT(Yk)pk

    with

    b(r) = [C1r, . . . , CNr]T

    , B(r) = diag(b(r)), rO.

    Chain rule for computing the log-likelihood rate function

    1

    nlogP (Y

    n0 ) =

    1

    n

    n

    k=0

    logP (Yk|Yk1

    0 )

    withP(Yk|Y

    k10 ) = b

    T(Yk)pk.

    9

    Model Reduction

  • 8/3/2019 Cdc2011 Slides

    11/18

    o uct o

    Optimal representation of the aggregated HMM

    Aggregate the state space N = {1, . . . , N} into M= {1, . . . , M}

    (1) = 1, (2) = (3) = 2, 1(1) = 1, 1(2) = {1, 2}.

    Optimal representation of the aggregated HMM = (, A, C)

    k() = P (X0 1(k)), kM.

    Akl() = P (Xn+1

    1(l)|Xn 1(k)), k, lM.

    Ckr() = P (Yn = r|Xn

    1(k)), kM, rO.

    Jump to backup

    10

    Model Reduction

  • 8/3/2019 Cdc2011 Slides

    12/18

    Maximum Likelihood Estimation

    For any fixed , let () denote the optimal aggregated HMM

    R( ()) = H( , )H( , ()).

    Optimal partition problem

    argmin R( ())

    argmax H( , ()).

    Maximum Likelihood Estimation problem

    n argmax1

    nlogP()(y

    n0 ).

    Asymptotical convergencen , as n .11

    Model Reduction

  • 8/3/2019 Cdc2011 Slides

    13/18

    Approaches to the MLE problem max

    n1 logP ()(yn0 )

    Exact approach: Hypothesis Testing

    Need to evaluate ||= MN filter recursions on the reduced space M.

    Approximate approach: Recursive Stochastic-Gradient Learning

    Randomization and parametrization with RK and KMN:

    gk():=

    (Xk;)logP()(Yk|Yk10 ).

    Parameterized MLE problem

    n argmax1

    n

    n

    k=0

    gk().

    Stochastic-gradient algorithm

    n+1 = n +n

    1

    n

    n

    k=0

    gk()

    .

    Jump to backup

    12

    Simulation

  • 8/3/2019 Cdc2011 Slides

    14/18

    Bi-partition of a simple HMM (Hypothesis Testing)

    The parameters of the original HMM

    A =

    0.500 0.200 0.225 0.0750.200 0.500 0.135 0.1650.030 0.270 0.500 0.2000.150 0.165 0.185 0.500

    , C =

    0.15 0.850.05 0.950.89 0.110.88 0.12

    .Hypothesis testing for the optimal bi-partition

    = [1, 1, 2, 2]

    Log-likelihood rate function: ln =

    1

    n logP (yn

    0 ), ln() =

    1

    n logP ()(yn

    0 ).13

    Simulation

  • 8/3/2019 Cdc2011 Slides

    15/18

    Bi-partition of a simple HMM (Recursive Learning)

    Learning through the simulation trajectory {xn, yn}n0 of the HMM

    After n = 1000 iterations

    The probabilities of all states being in the first group

    =[1,1,1,1](; n) = [0.9948, 0.9984, 0.0034, 0.0032].

    The optimal partition function is obtained nearly deterministically

    = [1, 1, 2, 2].

    14

    Conclusions

  • 8/3/2019 Cdc2011 Slides

    16/18

    Conclusions and future directions

    ConclusionsK-L divergence rate between the laws of the observation process

    Optimal representation for the aggregated HMM

    Hypothesis testing and recursive learning algorithm

    Future directions

    Convergence rate and performance analysis

    Applications to bioinformatics

    Thank You!

    15

    Backup

  • 8/3/2019 Cdc2011 Slides

    17/18

    Optimal representation

    Optimal representation of the aggregated HMM = (, A, C)

    k() = i1(k)

    i, kM.

    Akl() =i1(k)ij1(l) Aij

    i1(k)i, k, lM.

    Ckr() =i1(k)iCir

    i1(k)i, kM, rO.

    Jump back to origin

    [1] Deng, Mehta, and Meyn, Aggregation-based model reduction of a Hidden Markov Model, CDC 2010.

    16

    Backup

  • 8/3/2019 Cdc2011 Slides

    18/18

    Bi-partition randomized policy

    Parameter vector= [1, . . . ,N]

    T.

    Probability of group assignments

    P((i) = 1) = 11+exp(Ki)

    , P((i) = 2) = exp(Ki)1+exp(Ki)

    .

    Randomized and parameterized partition

    (i;) = 11+exp(Ki)

    1l{(i)=1} + exp(Ki)

    1+exp(Ki)1l{(i)=2}.

    Jump back to origin

    17