Cdc2011 Slides
Transcript of Cdc2011 Slides
-
8/3/2019 Cdc2011 Slides
1/18
Recursive Learning Algorithm for Model Reduction ofHidden Markov Models
Kun Deng
Joint work with: P. G. Mehta, S. P. Meyn, and M. Vidyasagar
Department of Mechanical Science and EngineeringCoordinated Science Laboratory
University of Illinois at Urbana-Champaign
50th IEEE Conference on Decision and ControlOrlando, FL, Dec 14, 2011
-
8/3/2019 Cdc2011 Slides
2/18
Introduction
Motivation
Hidden Markov Model (HMM) forms a valuable modeling framework inapplied science and engineering.
Pattern recognition: speech, handwriting, gesture recognition,...
Bioinformatics: gene prediction, alignment of bio-sequences,...
Others: POMDP, Gaussian mixture model, state space model,...
2
-
8/3/2019 Cdc2011 Slides
3/18
Introduction
Motivation
Hidden Markov Model (HMM) forms a valuable modeling framework inapplied science and engineering.
Pattern recognition: speech, handwriting, gesture recognition,...
Bioinformatics: gene prediction, alignment of bio-sequences,...
Others: POMDP, Gaussian mixture model, state space model,...
Underlying state space of HMMs might be very large.
Increased complexities in inference (filtering, smoothing, prediction),parameter estimating, and optimal policy design.
2
-
8/3/2019 Cdc2011 Slides
4/18
Introduction
Aggregating the state space of Markov chains
Nearly completely decomposable Markov chain
Singular perturbation theory
Simon and Ando 1961, Kokotovic 1981, Yin and Zhang 1998
Markov spectral theory
Wentzell and Freidlin 1972, Denllnitz and Junge 1997, Deuflhard2000, Huisinga, Meyn, and Schutte 2004
[1] Deng, Mehta, and Meyn, Optimal K-L Aggregation via Spectral Theory of Markov Chains, TAC 2011.
3
I d i
-
8/3/2019 Cdc2011 Slides
5/18
Introduction
Recent approaches to model reduction of HMMs
Objective:
Reduce a HMM via aggregation of the state space.
Metric:
Kullback-Leibler divergence rate between laws of the jointstate-observation process of two HMMs:
R( ) = limn
1
nE
logP (X
n0 , Y
n0 )
P (Xn0 , Y
n0 )
.
Approaches:
Recursive learning through the simulation trajectory of the HMM.
Pros & Cons:
We have the explicit formula for expressing the K-L divergence rate.
We cant use it to compare the observation processes of two HMMs.
[1] Deng, Mehta, and Meyn, Aggregation-based model reduction of a Hidden Markov Model, CDC 2010.[2] Vidyasagar, K-L divergence rate between probability distributions on sets of diff. cardinalities, CDC 2010.
4
I t d ti
-
8/3/2019 Cdc2011 Slides
6/18
Introduction
New approaches for model reduction of HMM (this talk)
Metric:Kullback-Leibler divergence rate between laws of the observationprocess of two HMMs:
R( ) = limn
1
n
E logP (Y
n0 )
P (Yn0 ) .Main ideas:
Aggregate the state space and preserve the observation space.
Employ the optimal representation for the aggregated HMM.
Simulate the original HMM and evaluate the nonlinear filter only forthe aggregated HMM.
Approach the optimal partition using the recursive learning algorithm.
[1] Xie, Ugrinovskii, and Petersen, Probabilistic distances between finite-state HMMs, TAC 2005.
5
Preliminaries
-
8/3/2019 Cdc2011 Slides
7/18
Preliminaries
Notations of Hidden Markov Model = (,A,C)
A random process {Xn, Yn}n0 is modeled as a HMM:The unobserved state process {Xn}n0 is a finite Markov chain.
Aij:=P(Xn+1 = j|Xn = i), i,jN.
The observation process {Yn}n0 is independent conditioned on thestate process {Xn}n0.
Cir:=P(Yn = r|Xn = i), iN, rO.
The initial distribution of the state process is given.
i:=P(X0 = i), iN.
6
Preliminaries
-
8/3/2019 Cdc2011 Slides
8/18
Preliminaries
Ergodicity & Nondegeneracy Assumptions
Ergodicity Assumption: All underlying Markov chains are assumed to beirreducible and aperiodic.
There exists a unique invariant distribution of the state process
A = .
Nondegeneracy Assumption: The transition matrix C is strictly positive,i.e., Cir > 0 for all i and r.
The state process {Xn}n0 can be statistically inferred from anysimulation trajectory of the observation process {Yn}n0.
Recall:
Aij:=P(Xn+1 = j|Xn = i), Cir:=P(Yn = r|Xn = i), i,jN, rO.
7
Preliminaries
-
8/3/2019 Cdc2011 Slides
9/18
Preliminaries
Kullback-Leibler divergence rate for HMMs
Under Assumptions, we have asymptotic convergence in P -a.s. senseShannon-McMillan-Breiman theorem in 1950s
limn
1
nlogP (y
n0 ) = lim
n
1
nE
logP (Yn
0 )
= H( , ).
Baum and Petrie in 1966
limn
1
nlogP (y
n0 ) = lim
n
1
nE
logP (Yn
0 )
= H( , ).
For two HMMs and defined on the same observation space
R( ):= limn
1n
E log P (Yn0 )P (Yn0 ) = H( , )H( , ).[1] Breiman, The individual ergodic theorem of information theory, AMS 1957.[2] Baum and Petrie, Statistical inference for probabilistic functions of finite state Markov chains, AMS 1966.
8
Preliminaries
-
8/3/2019 Cdc2011 Slides
10/18
Preliminaries
Nonlinear filter recursion
Prediction filter
pk(i) = P(Xk = i|Yk1, . . . , Y0), iN.
Nonlinear filter recursion
pk+1
=ATB(Yk)pk
bT(Yk)pk
with
b(r) = [C1r, . . . , CNr]T
, B(r) = diag(b(r)), rO.
Chain rule for computing the log-likelihood rate function
1
nlogP (Y
n0 ) =
1
n
n
k=0
logP (Yk|Yk1
0 )
withP(Yk|Y
k10 ) = b
T(Yk)pk.
9
Model Reduction
-
8/3/2019 Cdc2011 Slides
11/18
o uct o
Optimal representation of the aggregated HMM
Aggregate the state space N = {1, . . . , N} into M= {1, . . . , M}
(1) = 1, (2) = (3) = 2, 1(1) = 1, 1(2) = {1, 2}.
Optimal representation of the aggregated HMM = (, A, C)
k() = P (X0 1(k)), kM.
Akl() = P (Xn+1
1(l)|Xn 1(k)), k, lM.
Ckr() = P (Yn = r|Xn
1(k)), kM, rO.
Jump to backup
10
Model Reduction
-
8/3/2019 Cdc2011 Slides
12/18
Maximum Likelihood Estimation
For any fixed , let () denote the optimal aggregated HMM
R( ()) = H( , )H( , ()).
Optimal partition problem
argmin R( ())
argmax H( , ()).
Maximum Likelihood Estimation problem
n argmax1
nlogP()(y
n0 ).
Asymptotical convergencen , as n .11
Model Reduction
-
8/3/2019 Cdc2011 Slides
13/18
Approaches to the MLE problem max
n1 logP ()(yn0 )
Exact approach: Hypothesis Testing
Need to evaluate ||= MN filter recursions on the reduced space M.
Approximate approach: Recursive Stochastic-Gradient Learning
Randomization and parametrization with RK and KMN:
gk():=
(Xk;)logP()(Yk|Yk10 ).
Parameterized MLE problem
n argmax1
n
n
k=0
gk().
Stochastic-gradient algorithm
n+1 = n +n
1
n
n
k=0
gk()
.
Jump to backup
12
Simulation
-
8/3/2019 Cdc2011 Slides
14/18
Bi-partition of a simple HMM (Hypothesis Testing)
The parameters of the original HMM
A =
0.500 0.200 0.225 0.0750.200 0.500 0.135 0.1650.030 0.270 0.500 0.2000.150 0.165 0.185 0.500
, C =
0.15 0.850.05 0.950.89 0.110.88 0.12
.Hypothesis testing for the optimal bi-partition
= [1, 1, 2, 2]
Log-likelihood rate function: ln =
1
n logP (yn
0 ), ln() =
1
n logP ()(yn
0 ).13
Simulation
-
8/3/2019 Cdc2011 Slides
15/18
Bi-partition of a simple HMM (Recursive Learning)
Learning through the simulation trajectory {xn, yn}n0 of the HMM
After n = 1000 iterations
The probabilities of all states being in the first group
=[1,1,1,1](; n) = [0.9948, 0.9984, 0.0034, 0.0032].
The optimal partition function is obtained nearly deterministically
= [1, 1, 2, 2].
14
Conclusions
-
8/3/2019 Cdc2011 Slides
16/18
Conclusions and future directions
ConclusionsK-L divergence rate between the laws of the observation process
Optimal representation for the aggregated HMM
Hypothesis testing and recursive learning algorithm
Future directions
Convergence rate and performance analysis
Applications to bioinformatics
Thank You!
15
Backup
-
8/3/2019 Cdc2011 Slides
17/18
Optimal representation
Optimal representation of the aggregated HMM = (, A, C)
k() = i1(k)
i, kM.
Akl() =i1(k)ij1(l) Aij
i1(k)i, k, lM.
Ckr() =i1(k)iCir
i1(k)i, kM, rO.
Jump back to origin
[1] Deng, Mehta, and Meyn, Aggregation-based model reduction of a Hidden Markov Model, CDC 2010.
16
Backup
-
8/3/2019 Cdc2011 Slides
18/18
Bi-partition randomized policy
Parameter vector= [1, . . . ,N]
T.
Probability of group assignments
P((i) = 1) = 11+exp(Ki)
, P((i) = 2) = exp(Ki)1+exp(Ki)
.
Randomized and parameterized partition
(i;) = 11+exp(Ki)
1l{(i)=1} + exp(Ki)
1+exp(Ki)1l{(i)=2}.
Jump back to origin
17