Post on 26-Sep-2020
Markov decision processes c©Vikram Krishnamurthy 2013 1
Part 4: MarkovDecisionProcesses
Aim: This part covers discrete time Markov Decision
processes whose state is completely observed. The key
ideas covered is stochastic dynamic programming. We
apply stochastic dynamic programming to solve fully
observed Markov decision processes (MDPs). Later we
will tackle Partially Observed Markov Decision Processes
(POMDPs).
Issues such as general state spaces and measurability are
omitted. Instead we focus on structural aspects of
stochastic dynamic programming.
Markov decision processes c©Vikram Krishnamurthy 2013 2
History
• Classical control (Freq. domain): Root locus,
Bode diagrams, stability, PID, 1950s.
• State Space Theory - Kalman 1960s
Modern Control (Time domain): State variable
feedback, observability, controllability
• Optimal and Stochastic Control 1960s – 1990s
– Dynamic Programming (Bellman)
– LQ and Markov Decision Processes (1960s)
– Partially observed Stochastic Control =
Filtering + control
– Stochastic Adaptive Control (1980s & 1990s)
– Robust stochastic control H∞ control (1990s)
– Scheduling control of computer networks,
manufacturing systems (1990s).
– Neurodynamic programming (Re-inforcement
learning) 1990s.
Markov decision processes c©Vikram Krishnamurthy 2013 3
Applications
• Control in Telecom and Sensor Networks:
Admission, Access and Power control – wireless
networks, computer networks.
• Sensor Scheduling and Optimal search.
• Robotic navigation and Intelligent Control.
• Process scheduling and manufacturing
• Aeronautics: Auto-pilots, missile guidance
systems, satellite navigation systems
Markov decision processes c©Vikram Krishnamurthy 2013 4
1 Fully Observed MDP
1. Discrete-time dynamic system: State
{xk} ∈ X . For time k = 0, 1, . . . , N evolves as
xk+1 = Ak(xk, uk, wk) x0 ∼ π0(·)
Observations: yk = xk
Control: uk ∈ Uk(xk). Process Noise: wk iid
2. Policy class: Consider admissible policy
π = {µ0, . . . , µN−1}
where uk = µk(xk) ∈ Uk(xk).
3. Cost function: additive cost function
Jπ(x0) = E
{
cN (xN ) +N−1∑
k=0
ck(xk, µk(xk))
}
(1)
cN denotes terminal cost.
Aim: Compute the optimal policy
J∗(x0) = minπ∈Π
Jπ(x0)
Π is set of admissible policies.
J∗(x0) is called optimal cost (value) function.
Markov decision processes c©Vikram Krishnamurthy 2013 5
Terminology
Finite Horizon: N finite.
Fully observed: yk = xk
Partially observed: yk = Ck(xk, vk) (next part)
Infinite horizon:
1. Average cost:
Jπ(x0) = limN→∞
1
NE
{
N−1∑
k=0
c(xk, µ(xk))
}
2 Discounted cost: ρ ∈ (0, 1)
Jπ(x0) = E
{
∞∑
k=0
ρk c(xk, µ(xk))
}
Remarks: 1. Average cost problems need more
technical conditions and somewhat harder than
discounted cost problems.
2. Stochastic dynamic optimization or Sequential
decision making problem.
Markov decision processes c©Vikram Krishnamurthy 2013 6
2 Application Examples
2.1 Finite state Markov Decision
Processes (MDP)
xk is a S state Markov chain.
Transition prob: Pij(u) = P (xk+1 = j|xk = i, uk = u),
i, j ∈ {1, . . . , S}.
Cost function as in (1).
Numerous applications in OR, EE, Gambling theory.
Benchmark Example: Machine (or Sensor)
Replacement
State: xk ∈ {0, 1} – machine state
xk = 0 operational; xk = 1 failed.
Control: uk ∈ {0, 1}.
uk = 0 keep machine; uk = 1 replace by new one
Trans prob matrices: Let θ = P (xk+1 = 1 | xk = 0).
P (0) =
1− θ θ
0 1
, P (1) =
1 0
1 0
Cost: Minimize E{∑N−1
k=0c(xk, uk)}
where c(0, 0) = 0, c(1, 0) = C, c(x, 1) = R
Markov decision processes c©Vikram Krishnamurthy 2013 7
2.2 Other fully observed problems
1. Linear Quadratic (LQ) control Fully observed problem
xk+1 = Akxk +Bkuk + wk
Jπ(x0) = E
{
x′N QN xN +
N−1∑
k=0
u′kRkuk + x′
kQkxk
}
Qk ≥ 0 and Rk > 0.
wk is zero mean finite variance white noise (not
necessarily Gaussian)
For detailed analysis and design examples, see Anderson
and Moore.
1. Linear control methods have explicit solutions
2. Maybe applied to nonlinear systems operating on a
small signal basis – linearization
3. Selection of Q and R involve engineering judgment.
4. Partially observed problem - LQG control is widely
used.
Markov decision processes c©Vikram Krishnamurthy 2013 8
2. Optimal Stopping Problems (termination control)
Asset Selling problem: You want to sell an asset.
Offers w0, w1, . . . , wN−1 are iid.
If offer k accepted: invest wk at fixed interest rate r.
If you reject offer, wait till next offer. Rejected offers
cannot be renewed
Offer N − 1 must be accepted if other offers were rejected.
Aim: What is optimal policy for accepting and rejecting?
Formulation: wk: offer value – real valued (say).
Control space: accept (sell) u1, reject (dont sell) u2
State space: reals + T (termination state)
xk+1 = Ak(xk, uk, wk), k = 1, . . . , N − 1
=
T if xk = T or xk 6= T and uk = u1
wk otherwise
Reward: E
{
cN (xN ) +
N−1∑
k=0
ck(xk, uk, wk)
}
cN (xN ) =
xN if xN 6= T
0 otherwise
ck(xk, uk, wk) =
(1 + r)N−kxk if xk 6= T
0 otherwise
Markov decision processes c©Vikram Krishnamurthy 2013 9
Rational Thief problem: Thief can chose night k to retire
with earnings xk or rob a house and bring wk.
Thief is caught with probability p. If caught, lose all
money.
Assume wk are iid with mean w. Compute optimal policy
over N nights.
3. Scheduling Problems: Job scheduling on single
processor: N jobs to be done in sequential order.
Job i requires time random Ti. Ti are iid.
If job i is completed at time t, reward is ρtRi, 0 < ρ < 1.
Find schedule that maximizes total reward.
See Bertsekas or Ross or Puterman for a wealth of
examples. Journals such as IEEE Auto Control, Machine
Learning, Annals of Operations Research.
Markov decision processes c©Vikram Krishnamurthy 2013 10
3 Dynamic Programming (DP)3.1 Principle of Optimality(Bellman 1962). An optimal policy has the property
that whatever the initial state and initial decision are,
the remaining decisions must constitute an optimal
policy with regard to the state resulting from the first
selection.
1
a b
2
dc
If shortest distance from 1 to 2 is via a,b,c,d, then
shortest distance from b to 2 is via c,d.
DP is an algorithm which uses the principle of
optimality to determine optimal policy.
DP is widely used control, OR, discrete optimization.
DP yields a functional equation of the form:
Jk(x) = minu
[
ck(x, u) +
∫
IR
Jk+1(Ak(x, u, w))p(w)dw
]
Markov decision processes c©Vikram Krishnamurthy 2013 11
4
3
3
4
21 *
*
1
2
5
2
7
2
Figure 1: Deterministic Shortest path problem
Markov decision processes c©Vikram Krishnamurthy 2013 12
3.2 DP for Shortest path problemCompute shortest distance and path from node 1 to 5.
Let cij = distance from node i to j.
c12 = 3, c13 = 2, c23 = 1, c24 = 2,
c34 = 2, c35 = 7, c45 = 4.
Define:
Ji = min dist from node i to 5.
u∗i = next point in min path from i to 5.
DP yields:
J5 = 0
J4 = 4; u∗4 = 5.
J3 = minu∈{4,5}
c3u + J∗u
= min{c34 + J∗4 , c35 + J∗
5 }
= min{2 + 4, 7} = 6; u∗3 = 4.
J2 = minu∈{3,4}
c2u + J∗u
= min{c23 + J∗3 , c24 + J∗
4 }
= min{1 + 6, 2 + 4} = 6; u∗2 = 4.
J1 = min{c12 + J∗2 , c13 + J∗
3 }
= min{3 + 6, 2 + 6} = 8; u∗1 = 3.
Bactracking: shortest path from 1 to 5 is via 3, 4.
Markov decision processes c©Vikram Krishnamurthy 2013 13
Remarks on DP
1. We have worked backwards from node 5 to node 1.
Called Backward DP
Forward DP (FDP) yields identical result.
2. Minimum path problem arises in optimal network
flow design, critical path design, Viterbi algorithm.
3. It is important to note that the cost (distance)
between nodes is path independent. If it is path
dependent, DP does not yield optimal soln.
4. Backtracking is a characteristic of DP.
“Life can only be understood going backwards; but it must
be lived going forwards”
5. We will focus on solving stochastic control
problems via DP. (SDP)
6. Proof on principle of optimality is straightforward.
(proof by contradiction).
Markov decision processes c©Vikram Krishnamurthy 2013 14
4 Stochastic Dynamic
Programming
4.1 Principle of Optimality
Stochastic Control version: Consider fully observed
problem with cost function
Jπ(x0) = E
{
cN (xN ) +N−1∑
k=0
ck(xk, µk(xk)) | x0
}
Let π∗ = {µ∗0, µ
∗1, . . . , µ
∗N−1} be optimal policy.
Suppose state at time i is xi when using π∗.
Consider subproblem of minimizing
E
{
cN (xN ) +N−1∑
k=i
ck(xk, µk(xk)) | xi
}
Then {µ∗i , µ
∗i+1, . . . , µ
∗N−1} is optimal for this
subproblem.
Markov decision processes c©Vikram Krishnamurthy 2013 15
4.2 Soln via Stochastic Dynamic
Programming
Recall fully observed system is
xk+1 = Ak(xk, uk, wk), k = 0, 1, . . . , N − 1
Aim: Determine optimal policy to minimize Jπ(x0)
Jπ(x0) = E
{
cN (xN ) +
N−1∑
k=0
ck(xk, µk(xk)) | x0
}
Outline: The solution consists of two stages:
1. Backwards SDP to determine optimal policy –
this is data independent (offline). Creates a
lookup table – for state xk it gives optimal uk.
2. Forward implementation of controller: Given xk
pick optimal uk – table lookup.
k opt control if xk = e1 opt control if xk = e2
1 u∗1,1 u∗
2,1
2 u∗1,2 u∗
2,2
......
...
N − 1 u∗1,N−1 u∗
2,N−1
Markov decision processes c©Vikram Krishnamurthy 2013 16
SDP solution
Given
xk+1 = Ak(xk, uk, wk), k = 0, 1, . . . , N − 1
Jπ(x0) = E
{
cN (xN ) +N−1∑
k=0
ck(xk, µk(xk)) | x0
}
For every initial state x0, the optimal cost
J∗(x0) = infπ Jπ(x0) = J0(x0).
Here J0(x0) is given by the last step of the backward
DP algorithm for k = N,N − 1, . . . , 0:
JN (xN ) = cN (xN ) (terminal cost)
Jk(x) = minuk∈Uk(x)
[ck(x, uk) + E {J(xk+1)|xk = x}] ,
= minu∈Uk(x)
[
ck(x, u) +
∫
IR
Jk+1(Ak(x, u, w))p(w)dw
]
µ∗k(x) = argmin
uk∈Uk(x)
[
ck(x, u) +
∫
IR
Jk+1(Ak(x, u, w))p(w)dw]
.
Jk(x)defn= min
µk,...,µN
E{N∑
t=k
ct(xt, ut)|xk = x}
is called value- to-go function.
Markov decision processes c©Vikram Krishnamurthy 2013 17
Remarks on SDP
Mathematical rigor: If wk ∈ Dk where Dk is
countable, then above results are rigorous.
Jk(x) = minu∈Uk(x)
[
ck(x, u) +∑
w∈Dk
Jk+1(Ak(x, u, w))p(w)
]
Otherwise, the results are “informal” in that we have
not specified the function spaces to which u, x and w
belong to. For general case measurable selection
theorems are required.
(i) Need ck(x, u) to be a measurable function
(ii) For min to replace inf need compactness and
lower semi-continuity.
General conditions (Hernandez Lerma & Lasserre).
(a) Uk(x) compact, ck(x, ·) is lower semi-cont on
Uk(x) for all x ∈ X .
(b)∫
IRJk+1(Ak(x, u, w))p(w)dw is lower semi-cont on
Uk(x) for every x ∈ X and every cont bounded
function Jk+1 on X . (does not work for LQ!)
(a) can be replaced by inf-compactness of ck(x, u):
For x ∈ X , r ∈ IR, {u ∈ Uk(x)|ck(x, u) ≤ r} compact.
Markov decision processes c©Vikram Krishnamurthy 2013 18
5 Finite State Markov Decision
Processes (MDP)
Markov chain xk ∈ {1, 2, . . . , S}
P (xk+1 = j|xk = i, uk = u) = Pij(u),
i, j ∈ {1, . . . , S}.
Aim: Minimize
Jπ(x0) = E
{
cN (xN ) +N−1∑
k=0
ck(xk, µk(xk))
}
DP yields: For i = 1, 2, . . . , S
JN (i) = cN (i),
Jk(i) = minuk
[ck(i, uk) + E {J(xk+1)|xk = i}] ,
k = N − 1, N − 2, . . . , 1, 0
= minuk
[
ck(i, uk) +
S∑
j=1
Jk+1(j)P (xk+1 = j | xk = i, uk)
]
,
= minuk
[
ck(i, uk) +S∑
j=1
Jk+1(j)Pij(uk)
]
,
Markov decision processes c©Vikram Krishnamurthy 2013 19
In matrix vector notation:
Jk = minu
[ck(u) + P (u) Jk+1]
Markov decision processes c©Vikram Krishnamurthy 2013 20
Lookup Table
Dynamic programming creates a look-up table
Jk(i) = minuk
[
ck(i, uk) +
S∑
j=1
Jk+1(j)Pij(uk)
]
u∗i,k = µ∗
k(xk = i)defn= argmin
uk
[
ck(i, uk) +
S∑
j=1
Jk+1(j)Pij(uk)
]
Thus we have a look-up table
k opt control if xk = 1 opt control if xk = 2
1 u∗1,1 u∗
2,1
2 u∗1,2 u∗
2,2
3 u∗1,3 u∗
2,3
......
...
N − 2 u∗1,N−2 u∗
2,N−2
N − 1 u∗1,N−1 u∗
2,N−1
Remarks: (i) Requires O(N S) memory.
(ii) If N = ∞ (infinite horizon), and if
ck(xk, uk) = ρkc(xk, uk), then entries u∗i,k in lookup table
converge to values indpt of k. Steady (stationary) state
policy; O(S) memory
Markov decision processes c©Vikram Krishnamurthy 2013 21
Controller Implementation:
1. Set k = 0, Initial condition: x0 = i.
2. Select optimal control as u∗xk,k
= µk(xk) from DP
lookup table
Instantaneous cost = ck(xk, u∗xk,k
)
3. Markov chain evolves randomly according to
P (u∗xk,k
). Generates new state xk+1.
4. If k = N − 1, stop.
Else, set k = k + 1 and go to step 2.
✛
✲
Markov chain
prob P (u∗xk,k
)
Lookup table
xk
u∗xk,k
Markov decision processes c©Vikram Krishnamurthy 2013 22
Design Example: Machine
Replacement
Recall that xk ∈ {0, 1} and uk ∈ {0, 1}.
P (0) =
1− θ θ
0 1
, c(0) =
0
c
P (1) =
1 0
1 0
, c(1) =
R
R
Cost: Minimize E{∑N−1
k=0c(xk, uk)}
DP Soln for Control Policy:
JN (1)
JN (2)
=
0
0
For k = N − 1, N − 2, . . . , 1, 0
Jk(1)
Jk(2)
= minu∈{0,1}
[c(u) + P (u)Jk+1]
= min
(1− θ)Jk+1(1) + Jk+1(2)
c+ Jk+1(2)
,
R+ Jk+1(1)
R+ Jk+1(1)
Markov decision processes c©Vikram Krishnamurthy 2013 23
Design Example: Continued
e.g. if N = 4, θ = 0.1, c = 4, R = 3:
k 0 1 2 3 4
Jk(1) 0.8430 0.5700 0.3000 0 0
Jk(2) 3.5700 3.3000 3.0000 3.0000 0
u∗1,k 0 0 0 0
u∗2,k 1 1 1 1
e.g. if N = 4, θ = 0.1, c = 4, R = 6:
k 0 1 2 3 4
Jk(1) 1.4130 0.8700 0.3000 0 0
Jk(2) 6.8700 6.3000 6.0000 3.0000 0
u∗1,k 0 0 0 0
u∗2,k 1 1 1 0
Markovdecisio
nprocesse
sc©Vikram
Krish
namurth
y2013
24
10 20 30 40 50 60 70 80 90 100−1
−0.5
0
0.5
1
1.5
2
time
mac
hine
sta
te
state 0 = machine operational, state 1 = machine failed
10 20 30 40 50 60 70 80 90 100−1
−0.5
0
0.5
1
1.5
2
time
cont
rol i
nput
Not profitable to repair after time 90
Markov Decision Process: Machine Repair Example
0 = leave as is, 1 = repair machine
Markov decision processes c©Vikram Krishnamurthy 2013 25
Remarks:
1. Note DP minimizes expected cost. Actual cost of a
sample path is a random variable which may be lower
than the expected value.
2. In the machine repair example, there is a threshold
after which it becomes infeasible to repair the
machine.
Markov decision processes c©Vikram Krishnamurthy 2013 26
6 Perspective
1. DP is a widely used optimization algorithm in:
Stochastic control, combinatorial optimization,
operations research.
We have looked at Backward DP
Forward DP is similar – yields identical result
e.g. Viterbi algorithm is a shortest path algorithm.
2. DP for fully observed stochastic control.
LQ and MDP have explicit solutions.
Most other problems do not.
3. We considered additive cost functions.
Risk sensitive control considers exponential cost
functions, see Elliott et.al.
4. Why feedback control is essential (next slide)
Things not covered:
1. Engineering LQ control – Anderson & Moore
2. Detailed mathematics – Bertsekas & Shreve
3. Numerical approximations for solving DP.
4. Properties of value-to-go function – Ross
Markov decision processes c©Vikram Krishnamurthy 2013 27
7 Why Feedback Control is
essential
1. Open loop systems are a special case of closed loop
systems for both deterministic and stochastic systems
2. In deterministic systems, for every closed loop
systems, there is an equivalent open loop system,
Example: Linear case.
3. For stochastic systems closed loop (feedback) and
open loop systems are not equivalent.
We show that for a stochastic system:
(i) no open loop system has the same properties of a
feedback system
(ii) Feedback always achieves a better cost that open
loop in optimal control.
Markov decision processes c©Vikram Krishnamurthy 2013 28
Feedback is essential in
stochastic systems
Consider stochastic system xk+1 = xk + uk + wk
where x0, wk are white noise with variance σ2.
Closed loop: Suppose feedback is uk = −xk.
Then xk+1 = wk
Open loop: xk+1 = xk + uk + wk
= x0 +∑k
m=0 um +∑k
m=0 wm
So mean is E{xk} =∑k−1
m=0 um and
variance is E{x20}+
∑
m E{w2m} = (k + 1)σ2
No open loop system can produce xk = wk−1.
Cost: Suppose J = E{∑N
k=0 x2k}.
Closed loop: J = E{x20 +
∑
n w2k} = (k + 1)σ2
Open loop: J =∑
k E{(x0 +∑k
m um +∑k
m wm)2}
≥ 12 (k + 1)(k + 2)σ2
Feedback is superior to any open loop control
Markov decision processes c©Vikram Krishnamurthy 2013 29
8 Infinite horizon results
xk+1 = A(xk, uk, wk), k = 0, 1, . . . ,
Jπ(x0) = limN→∞
E
{
N−1∑
k=0
ρkc(xk, µk(xk))
}
0 < ρ < 1 is discount factor.
Admissible policies: π = {µ0, µ1, . . . , } where
uk = µk(xk). Optimal cost is
J∗(x) = minπ∈Π
Jπ(x)
Define class of stationary policies: π = {µ, µ, . . .}.
To simplify notation call Jπ as Jµ.
Require Jπ(x0) to be finite. Examples include
(i) Stochastic shortest path: ρ = 1, cost free termination
state. Termination is inevitable.
(ii) Discounted problems with bounded cost |c(x, u)| ≤ M
We will only consider discounted problems. Then
Jπ(x0) = E
{
∞∑
k=0
ρkc(xk, µk(xk))
}
Markov decision processes c©Vikram Krishnamurthy 2013 30
8.1 DP for finite horizon version
Consider minimizing cost
E
{
ρnJ(xN ) +
N−1∑
k=0
ρkc(xk, uk)
}
DP recursion yields for k = 0, . . . , N − 1:
Jk(x) = minu
E{ρN−kc(x, u) + Jk+1(A(x, u,w))}
initialized by JN (x) = ρNJ(x). Define
Vk(x) =JN−k(x)
ρN−k
Then DP can be written for k = 0, 1, . . . , N − 1 as
Vk+1(x) = minu
E {c(x, u) + ρVk(A(x, u, w))}
8.2 Main Result
limk→∞ Vk(x) = V ∗(x) where V ∗ is optimal value
function for infinite horizon.
Bellman’s equation holds
V ∗(x) = minu
E {c(x, u) + ρV ∗(A(x, u, w))}
Markov decision processes c©Vikram Krishnamurthy 2013 31
Define the operator
(TV )(x)defn= min
uE {c(x, u) + ρV (A(x, u, w))}
for any V (x). Then Bellman’s eqn is:
TV ∗ = V ∗
T is monotonic: Suppose V and V ′ are s.t. V (x) ≤ V ′(x)
for all x. Then
(TV )(x) ≤ (TV ′)(x) ∀x
Define also for any stationary policy µ
(Tµ)V (x) = E {c(x, µ(x)) + ρV (A(x, µ(x), w))}
Result: (Bertsekas Vol.2, pg.12.) For every stationary
policy µ, the associated cost satisfies
Vµ = TµVµ
This result means: For any stationary policy µ, the policy
cost Vµ can be computed by solving Vµ = TµVµ. In finite
state MDP case, Vµ can be computed exactly since
TµVµ = c(µ) + ρP (µ)Vµ. Thus
Vµ = c(µ) + ρP (µ)Vµ =⇒ [I − ρP (µ)]Vµ = c(µ)
Note: Bellman’s equation is a functional equation. It can
rarely be solved explicitly.
Markov decision processes c©Vikram Krishnamurthy 2013 32
8.3 Infinite Horizon MDPs: Numerical methods
Bellman’s equation (V ∗ = TV ∗) for finite state MDP is
V ∗(i) = minu
[
c(i, u) + ρ
S∑
j=1
Pij(u)V∗(j)
]
In vector notation
V ∗ = minu
[c(u) + ρP (u)V ∗]
where V ∗ and c(u) are S dim vectors.
Recall optimal policy µ∗ for MDP allocates u∗k = µ∗(xk),
i.e. we need to construct the one row lookup table
k opt control if xk = 1 opt control if xk = 2
Any k u∗1 u∗
2
1. Linear Programming: Since limN→∞ TNV = V ∗ for all
V , we have using monotonicity of T :
V ≤ TV =⇒ V ≤ V ∗ = TV ∗
Thus V ∗ is largest V that satisfies V ≤ TV .
maxλ′1
s.t. λ ≤ c(u) + ρP (u)λ
LP with S|U | constraints. In queuing problems (that
satisfy “conservation laws”) these form a polymatroid.
Markov decision processes c©Vikram Krishnamurthy 2013 33
2. Value Iteration: Successive iteration method for solving
V ∗ = TV ∗, i.e., a finite horizon approximation.
Initialize V0. Then successive approximation procedure is
Vk+1 = TVk i.e. Vk+1 = minu
[c(u) + ρP (u)Vk]
Contraction mapping type proof of convergence
One can show ‖Vk − V ∗‖∞ ≤2ρ
1− ρ‖Vk − Vk−1‖∞
3. Policy Iteration: For any stationary policy µ recall
TµV = [c(u) + ρP (u)V ]
for any V . Recall cost function corresponding to µ, i.e.,
Vµ satisfies
TµVµ = Vµ
This means that for any stationary policy µ we can solve
for Vµ:
Vµ = c(µ) + ρP (µ)Vµ =⇒ (I − ρP (µ))Vµ = c(µ)
Policy Iteration algorithm: Initialize µ0 arbitrarily
Iterations: (i) Policy evaluation: Vµk is solution of linear
equation
[I − ρP (µk)]Vµk = c(µk)
(ii) Policy improvement : uk+1 = minu
[c(u) + ρP (u)Vk]
Markov decision processes c©Vikram Krishnamurthy 2013 34
Structural Results
When is optimal policy monotone in state?
Two concepts: submodularity, stochastic orders.
Submodular: φ(x, u) is submodular in (x, u) if
φ(x, u+ 1)− φ(x, u) ≥ φ(x+ 1, u+ 1)− φ(x+ 1, u).
Examples: The following are submodular in (x, u)
(i): φ(x, u) = −xu.
(ii) φ(x) or φ(u) is trivially submodular.
(iii) max(x, u)
(iv) The sum of submodular functions is submodular.
Theorem [Topkis] Consider φ : X × U → IR. If φ(x, u) is
submodular, then u∗(x) = argminu φ(x, u) ↑ x.
First order stochastic dominance: Then π1 first order
stochastically dominates π2 if∑X
i=j π1(i) ≥∑X
i=j π2(i) for
j = 1, . . . , X. This is denoted as π1 ≥s π2 or π2 ≤s π1.
Example: π1 = [0.3, 0.2, 0.5]′, π2 = [0.2, 0.4, 0.4]′
not orderable.
Theorem: Let V denote the set of all X dimensional
vectors v with nondecreasing components, i.e.,
v1 ≤ v2 ≤ · · · vX . Then π1 ≥s π2 iff for all v ∈ V,
v′π1 ≥ v′π2.
Markov decision processes c©Vikram Krishnamurthy 2013 35
Monotone Policies
(A1) c(x, u, k) ↓ x.
(A2) Px(u) ≤s Px+1(u).
(A3) c(x, u, k) is submodular in (x, u) That is:
c(x, u+ 1, k)− c(x, u, k) ↓ x
(A4) P (u) is tail supermodular:∑
j≥l (Pxj(u+ 1)− Pxj(u)) is increasing in x.
Theorem: Assume that a finite horizon Markov decision
process satisfies conditions (A1), (A2), (A3) and (A4).
Then µ∗k(x) ↑ x.
Same proof applies for infinite horizon discounted cost
and average cost.
Markov decision processes c©Vikram Krishnamurthy 2013 36
Qk(i, u)defn= c(i, u, k) + J ′
k+1Pi(u)
Jk(i) = minu∈U
Qk(i, u), µ∗k(i) = argmin
u∈UQk(i, u)
where Jk+1 =[
Jk+1(1), . . . , Jk+1(X)]′
Step 1. Assuming (A1) and (A2), Qk(i, u) ↑ i. Therefore
Jk(i) ↑ i.
Step 2. Assuming (A3) and (A4), Qk(i, u) is submodular.
Therefore µ∗k(i) = argminu∈U Qk(i, u) ↑ i.
Step 1: Use mathematical induction.
QN (i, u) = c(i,N) ↓ i by (A1). Suppose Qk+1(j, u) ↓ j.
Jk+1(j) = minu Qk+1(j, u) ↓ j. Next Pi(u) ≤r Pi+1(u) by
(A2). So J ′k+1Pi(u) ≥ J ′
k+1Pi+1(u).
Finally c(i, u, k) ↓ i (A1),
c(i, u, k) + J ′k+1Pi(u) ≥ c(i+ 1, u, k) + J ′
k+1Pi+1(u).
Step 2: Consider Qk(i, u) = c(i, u, k) + J ′k+1Pi(u). By
(A3), c(i, u, k) is submodular. Applying (A4), since
elements of Jk+1 are decreasing, J ′k+1Pi(u) is submodular.
Markov decision processes c©Vikram Krishnamurthy 2013 37
How does Optimal Cost
depend on Transition Matrix
Consider two MDPs with identical costs but different
transition matrices P and P .
(A1) c(x, u, k) ↓ x.
(A2) Px(u) ≤s Px+1(u).
(A5) Px(u) ≥s Px(u) ∀x.
Theorem. [Muller 1997]: Optimal cost incurred by
policy µ∗(x;P ) is smaller than that incurred by µ∗(x; P ).
Proof:
Qk(i, u) = c(i, u, k) + J ′k+1Pi(u)
Qk(i, u) = c(i, u, k) + J ′k+1Pi(u)
The proof is by induction. Clearly
JN (i) = JN (i) = c(i,N) for all i ∈ X .
Suppose Jk+1(i) ≤ Jk+1(i) for all i ∈ X . Therefore
J ′k+1Pi(u) ≤ J ′
k+1Pi(u). By (A1), (A2), Jk+1(i) is
decreasing in i. By (A5), Pi ≥r Pi. Therefore
J ′k+1Pi ≤ J ′
k+1Pi. So
c(i, u, k) + J ′k+1Pi(u) ≤ c(i, u, k) + J ′
k+1Pi(u) or
equivalently, Qk(i, u) ≤ Qk(i, u).
Markov decision processes c©Vikram Krishnamurthy 2013 38
Neuro-Dynamic Programming
Methods
The next two methods are simulation based. That is,
although parameters are unknown, system can be
simulated or observed under any choice of actions.
They form the core of re-inforcement learning or
neuro-dynamic programming. The key idea in them is the
Robbin’s Munro stochastic approximation algorithm:
Result: Robbins Munro Algorithm.
Aim: Solve the algebraic equation X = E{H(X)} where
H is a noisy function. That is we can measure samples
Yn = H(Xn).
Algorithm Xn+1 = Xn + γn(Yn −Xn)
Key idea behind stochastic approx is to replace E{H(X)}
by the sample Yn = H(Xn).
Remarks: The implicit assumption is that E{H(X)}
cannot be computed in closed form – this is true when the
density function is unknown. Step size γn = 1/n typically.
Stochastic approximations are widely used in adaptive
signal processing – e.g. adaptive filtering algorithms such
as LMS and RLS algorithm. Recursive EM algorithm
covered earlier is another example.
Markov decision processes c©Vikram Krishnamurthy 2013 39
4. Q-learning: Simulation based. Define Q-factor
Q(i, u) =
[
c(i, u) + ρS∑
j=1
Pij(u)V∗(j)
]
From Bellman’s equation this yields
Q(i, u) =
[
c(i, u) + ρ
S∑
j=1
Pij(u)minu′
Q(j, u′)
]
The trick above expresses Q as E{min(·)}:
Q(i, u) =
[
c(i, u) + ρE{minu′
Q(xk+1, u′)|xk = i}
]
Hence can be solved via Robbins-Munro algorithm:
Qk+1(i, u) = Qk(i, u)
+ γ
(
c(i, u) + ρminu′
(
Qk(j, u′))
−Qk(i, u)
)
Note: j is generated from (i, u) via simulation ∼ Pij(u).
Remarks: (i) The above recursion does not require
knowledge of P (u).
(ii) Q learning is merely a stochastic approx algorithm!
(iii) NDP is widely used in Artificial intelligence where it
is called Reinforcement learning.
5. Temporal difference methods: These can be used to
compute by simulation the cost of a policy (details
omitted).
Markov decision processes c©Vikram Krishnamurthy 2013 40
Summary and Extensions
Stochastic Dynamic programming (SDP) involves solving
a functional equation. This yields a (possibly infinite
dimensional) lookup table.
There are 2 types of problems considered (i) Finite
horizon
(ii) Infinite horizon – steady state controller.
For infinite horizon finite state MDPs there are several
numerical algorithms: e.g. Policy iteration, value
iteration, linear programming, neurodynamic
programming.
We have not covered cont-time finite state MDPs. These
arise in control of queuing systems – e.g. telecomms. By a
process called “uniformization”, a cont-MDP can be
covered to an equivalent discrete-time MDP.
A generalization of cont-time MDPs are semi-Markov
Decision processes. These are widely studied in discrete
event systems.
Finally, MDPs with constraints can also be considered.
Often the optimal policy is “randomized”.