Dynamic Programming and Stochastic...
Transcript of Dynamic Programming and Stochastic...
Dynamic Programming and Stochastic Control
Dr. Alex Leong
Department of Electrical Engineering (EIM-E)Paderborn University, Germany
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 1 / 158
Outline
1 Introduction
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 2 / 158
Introduction
What is dynamic programming (DP)?
Method for solving multi-stage decision problems (Sequential decisionmaking).
There is often some randomness to what happens in future.
Optimize set of decisions to achieve a good overall outcome.
Richard Bellman popularized DP in the 1950s
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 3 / 158
Examples
1) Inventory control
A store sells a product, e.g. Ice cream.
Order supplies once a week.
Sales during the week are “random”.
How much supply should the store get to maximize expected profitover summer?
I Order too little, can’t meet demand.I Order too much, storage/refrigeration cost.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 4 / 158
Examples
2) Parts replacement e.g. bus engine.
At the start of each month, decide whether the engine on a busshould be replaced, to maximize expected profit?
If replace, profit = earnings - replacement cost - maintenance.
If don’t replace, profit = earnings - maintenance.
Earnings will decrease if engine breaks down.
P(Breakdown) is age dependent.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 5 / 158
Examples
3) Formula 1 engines, replace or not?
20 races, 4 engines (in 2017)
Decide whether to replace engine at the start of each race, tomaximize chance of winning championship.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 6 / 158
Examples4) Queueing (see Figure 1)
Packets arrive at queues 1 and 2.
If both queues transmit at same time, have collision.
If collision, retransmit at next time with a certain probability.
Choose retransmission probabilities to maximize throughput.
Figure 1: Queueing
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 7 / 158
Examples
5) LQR (Linear Quadratic Regulator)Linear System: xk+1 = Axk + Buk (Deterministic Problem)
Assume knowledge of xk at time k (Perfect state info)
Choose sequence of uk to
minu0,u1,...,uN−1
N−1∑k=0
(xTk Qxk + uTk Ruk) + xTN QxN
N = number of stages = horizon.N finite → finite horizon.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 8 / 158
Examples
6) xk+1 = Axk + Buk + wk
wk = Random noise.
Assume xk known (Perfect state info)
Choose sequence of uk to
minu0,u1,...,uN−1
E
[N−1∑k=0
(xTk Qxk + uTk Ruk) + xTN QxN
]
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 9 / 158
Examples
7) LQG (Linear Quadratic Gaussian) Control
xk+1 = Axk + Buk + wk
yk = Cxk + vk
vk ,wk Gaussian noise.
Case of imperfect state info.
Based on measurements yk , choose uk to
minu0,u1,...,uN−1
E
[N−1∑k=0
(xTk Qxk + uTk Ruk) + xTN QxN
]
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 10 / 158
Examples
8) Infinite horizon
minu0,u1,...,uN−1
limN→∞
1
NE
[N−1∑k=0
(xTk Qxk + uTk Ruk) + xTN QxN
]
Note: Here we divide by N, otherwise summation often blows up.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 11 / 158
Examples9) Shortest paths (see Figure 2)
Find shortest path from A to stage D (Deterministic Problem).
Can solve using the Viterbi algorithm (1967)
Can be regarded as a special case of (forward) DP.Applications:
I decoding of convolutional codes (communications)I channel equalization (communications)I estimation of hidden Markov models (signal processing)
Figure 2: Shortest paths problem
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 12 / 158
Outline
2 The Dynamic Programming Principle and Dynamic ProgrammingAlgorithm
Basic Structure of Dynamic Programming ProblemDynamic Programming Principle of OptimalityDynamic Programming AlgorithmShortest Path Problems
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 13 / 158
Basic structure of stochastic DP problem
Two ingredients, discrete time system and cost function
1. Discrete time system
xk+1 = fk(xk , uk ,wk), k = 0, 1, . . . ,N − 1 (or k = 1, 2, ...N)
k is time index.
xk is state at time k, summarizes past information that is relevant forfuture optimization.
uk is control/decision/action at time k , lies in a set Uk(xk) whichmay depend on k and xk .
wk is random disturbance (noise), with a probability distributionP(.|k , xk , uk) which may depend on k , xk , uk .
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 14 / 158
Basic structure of stochastic DP problem
xk+1 = fk(xk , uk ,wk), k = 0, 1, . . . ,N − 1
N is horizon, or number of times control is applied.
fk is function that describes how system evolves over time.
ExamplesI fk = Axk + Buk + wk (linear system)I fk = xkuk + wk (non-linear)I fk = cos xk + wk sin uk (non-linear)
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 15 / 158
Basic structure of stochastic DP problem
2. Cost function which is additive over time
E
[N−1∑k=0
gk(xk , uk ,wk) + gN(xN)
]
Expectation is used because of random wk .
gk is function that represents cost at time k .
ExamplesI gk = xk + ukI gk = x2k + Cu2k , where C is a constant.
gN(xN) is terminal cost.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 16 / 158
Basic structure of stochastic DP problem
Objective: Minimize the cost function over the controls
u0 = µ0(x0), u1 = µ1(x1), ..., uN−1 = µN−1(xN−1)
Choice of uk depends on xk .
Optimization over policies: rules/functions µk for generating uk forevery possible value of xk .
Expected cost of policy π = (µ0, µ1, ..., µN−1) starting at x0 is
Jπ(x0) = E
[N−1∑k=0
gk(xk , µk(xk),wk) + gN(xN)
]
Optimal policy: π∗ = argminπ
Jπ(x0)
Optimal cost starting at x0: J∗(x0) = minπ
Jπ(x0)
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 17 / 158
Examples
1) Inventory example
xk = amount of stock at time k.uk = stock ordered at time k.wk = demand at time k, with some probability distribution e.g. uniform.
System: xk+1 = xk + uk − wk (= fk(xk , uk ,wk))
xk can be negative with this model.
Alternative model: xk+1 = max(0, xk + uk − wk).
Cost function at time k : gk(xk , uk ,wk) = r(xk) + Cuk
r(xk) is penalty for holding excess stock.
C is cost per item.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 18 / 158
Examples
1) Inventory example (cont.)
Terminal cost: R(xN) is penalty for having excess stock at the end.
Cost function: E[∑N−1
k=0 (r(xk) + Cuk) + R(xN)]
Amount uk to order can depend on inventory level xk .
Can have constraints on uk , e.g. xk + uk ≤ max. storage.
Optimization over policies: Find the rule which tells you how much toorder for every possible stock level xk .
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 19 / 158
Examples2) Example 6 of previous section
System
xk+1 = Axk+Buk+wk︸ ︷︷ ︸fk
Cost function
E
N−1∑k=0
(xTk Qxk+uTk Ruk︸ ︷︷ ︸gk
) + xTN QxN︸ ︷︷ ︸gN(xN)
Objective: Determine uk = µk(xk), k = 0, 1, . . . ,N − 1, to minimizethe cost function.
Solution turns out to be u∗k = Lkxk for some matrices Lk . (Derived inlater lecture)
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 20 / 158
Examples
3) Shortest paths (see Figure 3)
Figure 3: Shortest path problem
xk = which node we’re in at stage k .uk = which path we take to get to stage k + 1wk = zeroCost function = Sum of values along the paths we choose.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 21 / 158
Open loop vs. Closed loop
Open loop: Controls (u0, u1, . . . , uN−1) chosen at beginning (time 0).
Closed loop: Policy (µ0, µ1, . . . , µN−1) chosen, where at time k ,µk(xk) = uk can depend on xk .
Can adapt to conditions.
e.g. Inventory problem. If current stock level:I xk high → order less.I xk low → order more.
Closed loop is always at least as good as open loop.
For deterministic problems, open loop is as good as closed loopI can predict exactly the future states given initial state and sequence of
controls.
For stochastic problems, generally should use closed loop.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 22 / 158
D.P. Principle of OptimalityIntuition
B 3 D
5
A
2 F
1 2 1
6 4
C 4 E
Figure 4: Shortest path problem
Consider the shortest path problem in Figure 4.
Shortest path from A to F shown in red: A→C→D→F
Shortest path from C to F: C→D→F.I Subpath of shortest path from A→F.
Shortest path from D to F: D→F.I Subpath of shortest path from A→F.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 23 / 158
D.P. Principle of Optimality
ObservationShortest path from A to F contains shortest paths from intermediatenodes to F.
Why?
Suppose there is a shorter path from C to F which is not C→D→F.
Then can construct a new path A→C→ . . .→F (new shortest path)which is shorter than A→C→D→F⇒ contradicts A→C→D→F being the shortest.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 24 / 158
D.P. Principle of Optimality
Formal statement:
Basic problem
minπ
E
(N−1∑k=0
gk(xk , µk(xk),wk) + gN(xN)
)
Let π∗ = {µ∗0, µ∗1, . . . , µ∗N−1} be the optimal policy. Consider the“tail subproblem”
minµi ,µi+1,...,µN−1
E
(N−1∑k=i
gk(xk , µk(xk),wk) + gN(xN)
),
where we are at state xi at time i and we wish to minimize the “costto go” from time i to time N.
D.P. Principle of optimality then says that {µ∗i , µ∗i+1, ..., µ∗N−1} is
optimal for the tail subproblem.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 25 / 158
D.P. Principle of Optimality
“Proof”:If {µi , ..., µN−1} is a better policy for tail subproblem, then{µ∗0, µ∗1, ..., µ∗i−1, µi , ..., µN−1} would be a better policy for original problem⇒ contradiction of {µ∗0, µ∗1, ..., µ∗N−1} being optimal.
How can we make use of the D.P. principle?Idea: Construct an optimal policy in stages.
Solve tail subproblem involving last stage, to obtain µ∗N−1Solve tail subproblem involving last two stages, making use of µ∗N−1,to obtain µ∗N−2Solve tail subproblem involving last three stages, making use ofµ∗N−2, µ
∗N−1, to obtain µ∗N−3
...
Solve tail subproblem involving last N stages, making use ofµ∗1, .., µ
∗N−1, to obtain µ∗0
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 26 / 158
D.P. Algorithm
Basic problem:
minπ
E
{N−1∑k=0
gk(xk , µk(xk),wk) + gN(xN)
}D.P. Algorithm: For each possible xk , compute:
JN(xN) = gN(xN),
Jk(xk) = minuk∈Uk (xk )
E{gk(xk , uk ,wk) + Jk+1(fk(xk , uk ,wk))},
for k = N − 1,N − 2, ..., 1, 0
Theorem:1 Optimal cost J∗(x0) = J0(x0), where J0(x0) is quantity computed by
D.P. algorithm.2 Let µ∗k(.) be the function that generates the minimum uk in the D.P.
algorithm, i.e. µ∗k(xk) = u∗k . Then {µ∗0, µ∗1, ..., µ∗N−1} is the optimalpolicy to the basic problem.
Proof: See laterDr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 27 / 158
D.P. Algorithm
Comments:
D.P. algorithm needs to be run for all possible states xk .
Solves all tail subproblems (don’t know which subproblem you need atthe start).
Can be computationally expensive if number of states/controls islarge.
Often done on computer.
Suboptimal methods can reduce complexity.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 28 / 158
Inventory Example
xk = level of stock at time k.
uk = amount ordered at time k .
wk = demand at time k .
xk+1 = max(0, xk + uk − wk) = fk(xk , uk ,wk), excess demand is lost.
Storage constraint: xk + uk ≤ 2
Cost at time k= Purchasing cost︸ ︷︷ ︸
cost per item=1euro
+ storage cost︸ ︷︷ ︸(xk+uk−wk )2
= uk + (xk + uk − wk)2 = gk(xk , uk ,wk)
Terminal cost gN(xN) = 0.
Probability distribution of wk :
P(wk = 0) = 0.1,P(wk = 1) = 0.7,P(wk = 2) = 0.2
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 29 / 158
Inventory Example
Problem: Find the optimal policy for horizon N = 3, i.e.
min(µ0,µ1,µ2)
E
{2∑
k=0
gk(xk , µk(xk),wk)
}
Apply D.P. algorithm:J3(x3) = g3(x3) = 0Jk(xk) = min
uk∈Uk
E{uk + (xk + uk − wk)2 + Jk+1(max(0, xk + uk − wk))},k = 2, 1, 0
Question: What values can xk take?
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 30 / 158
Inventory Example
Period 2:Compute J2(x2) for all possible values of x2
J2(0) = minu2∈{0,1,2}
E{u2 + (0 + u2 − w2)2 + J3(x3)︸ ︷︷ ︸=0 for all x3
}
= minu2∈{0,1,2}
u2 + E{(u2 − w2)2}
= minu2∈{0,1,2}
u2 + (u2 − 0)20.1 + (u2 − 1)20.7 + (u2 − 2)20.2
If u2 = 0: u2+0.1u22+0.7(u2−1)2+0.2(u2−2)2 = 0.7×1+0.2×4 = 1.5
If u2 = 1: 1 + 0.1× 1 + 0.7× 0 + 0.2× 1 = 1.3
If u2 = 2: 2 + 0.1× 4 + 0.7× 1 + 0.2× 0 = 3.1⇒ J2(0) = 1.3 and µ∗2(0) = 1
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 31 / 158
Inventory Example
J2(1) = minu2∈{0,1}
u2 + (1 + u2)20.1 + (1 + u2 − 1)20.7 + (1 + u2 − 2)20.2
If u2 = 0: 0.3 (check this!)
If u2 = 1: 2.1⇒ J2(1) = 0.3 and µ∗2(1) = 0
J2(2) = minu2∈{0}
E{u2 + (2 + u2 − w2)2} = · · · = 1.1
⇒ J2(2) = 1.1 and µ∗2(2) = 0.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 32 / 158
Inventory Example
Period 1:Compute J1(x1) for all possible values of x1.
J1(0) = minu1∈{0,1,2}
E{u1 + (u1 − w1)2 + J2(max(0, 0 + u1 − w1))}
= minu1∈{0,1,2}
u1 + (u21 + J2(max(0, u1))0.1
+ ((u1 − 1)2 + J2(max(0, u1 − 1)))0.7
+ ((u1 − 2)2 + J2(max(0, u1 − 2)))0.2
u1 = 0: J2(0)︸ ︷︷ ︸from previous stage
×0.1 + (1 + J2(0)︸ ︷︷ ︸)0.7 + (4 + J2(0)︸ ︷︷ ︸)0.2 = 2.8
u1 = 1: 1 + (1 + J2(1)︸ ︷︷ ︸from previous stage
)0.1 + J2(0)︸ ︷︷ ︸ 0.7 + (1 + J2(0)︸ ︷︷ ︸)0.2 = 2.5
u1 = 2: 2 + (4 + J2(2))0.1 + (1 + J2(1))0.7 + J2(0)0.2 = 3.6⇒ J1(0) = 2.5 and µ∗1(0) = 1
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 33 / 158
Inventory Example
J1(1) = minu1∈{0,1}
E{u1 + (1 + u1 − w1)2 + J2(max(0, 1 + u1 − w1))}
u1 = 0: 1.5(check!)
u1 = 1: 2.68⇒ J1(1) = 1.5, and µ∗1(1) = 0
J1(2) = 1.68, µ∗1(2) = 0 (check!)
Period 0:Compute J0(x0) for all possible x0 (Tutorial problem)Solution: J0(0) = 3.7, J0(1) = 2.1, J0(2) = 2.818µ∗0(0) = 1, µ∗0(1) = 0, µ∗0(2) = 0
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 34 / 158
Scheduling Example
Example: Scheduling problem (deterministic problem)
Four operations need to be performed: A, B, C, D.
B has to occur after A, D has to occur after C.
Costs: cAB = 2, cAC = 3, cAD = 4, cBC = 3, cBD = 1, cCA = 4, cCB =4, cCD = 6, cDA = 3, cDB = 3.
Startup costs: SA = 5,SC = 3.
What is the optimal order?
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 35 / 158
Scheduling Example6
ABC 6 ABCD
9 3
AB
2 8
A 5 4
1
ACB
3
1 ACBD
5 10 3 AC
3 7
3C CA
4
6 5
CD
ACD 6
1
2 CAB
4 3
CAD
3 2
CDA
3 ACDB
1 CABD
3 CADB
2 CDAB
Minimum cost to go in red
Figure: Scheduling
0
Figure 5: Scheduling Problem
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 36 / 158
Scheduling ExampleUse D.P. algorithm
Let State = Set of operations already performed, see Figure“Scheduling”.
No terminal costs for this problem.
Tail subproblems of length 1.
Easy, only one choice at each state, e.g. if state ACD , nextoperation has to be B.
Tail subproblems of length 2.
State AB , only one choice, next operation is C.
State AC , if next operation is B: cost = 4 + 1 = 5.
State AC , if next operation is D: cost = 6 + 3 = 9. ⇒ Choose B.
State CA , if next operation is B: cost = 2 + 1 = 3.
State CA , if next operation is D: cost = 4 + 3 = 7. ⇒ Choose B.
State CD , only one choice, next operation is A.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 37 / 158
Scheduling Example
Tail subproblems of length 3.
State A , if next operation is B: cost = 2 + 9 = 11.
State A , if next operation is C: cost = 3 + 5 = 8.⇒ Choose C
State C , if next operation is A: cost = 4 + 3 = 7.
State C , if next operation is D: cost = 6 + 5 = 11. ⇒ Choose A.
Original problem of length 4.
If start with A: cost = 5 + 8 = 13
If start with C: cost = 3 + 7 = 10 ⇒ Choose C
Therefore, the optimal sequence = CABD , and the optimal cost = 10.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 38 / 158
Proof that D.P. Algorithm gives Optimal SolutionBasic problem:
minπ
E
{N−1∑k=0
gk(xk , µk(xk),wk) + gN(xN)
}
D.P. Algorithm: For each possible xk , compute:
JN(xN) = gN(xN),
Jk(xk) = minuk∈Uk (xk )
E{gk(xk , uk ,wk) + Jk+1(fk(xk , uk ,wk))},
for k = N − 1,N − 2, ..., 1, 0Theorem:
1 Optimal cost J∗(x0) = J0(x0), where J0(x0) is quantity computed byD.P. algorithm.
2 Let µ∗k(.) be the function that generates the minimum uk in D.P.algorithm i.e µ∗k(xk) = u∗k . Then {µ∗0, µ∗1, ..., µ∗N−1} is the optimalpolicy to the basic problem.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 39 / 158
Proof that D.P. Algorithm gives Optimal Solution
Notation:
Given policy π = (µ0, µ1, ..., µN−1),
let πk = (µk , µk+1, ..., µN−1) = “tail policy”
and J∗k (xk) = minπk
E{∑N−1
i=k gi (xi , µi (xi ),wi ) + gN(xN)} be the optimal cost
for tail subproblem.
Let Jk(xk) = quantity computed by D.P algorithm.
Want to show that J∗k (xk) = Jk(xk), for all xk , k .
Proof is by mathematical induction
Initial step (k = N):
By definition of J∗k (xk), J∗N(xN) = gN(xN)
By definition of D.P algorithm JN(xN) = gN(xN)⇒ J∗N(xN) = JN(xN)
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 40 / 158
Proof that D.P. Algorithm gives Optimal Solution
Induction step:
Assume J∗l (xl) = Jl(xl) for l = N,N − 1, ..., k + 1
Want to show that J∗k (xk) = Jk(xk)
From the definition of J∗k (xk),
J∗k (xk) = minπk
E
{N−1∑i=k
gi (xi , µi (xi ),wi ) + gN(xN)
}
= min(µk ,πk+1)
E
{gk(xk , µk(xk),wk) +
N−1∑i=k+1
gi (xi , µi (xi ),wi ) + gN(xN)
}
= minµk
E
{gk(xk , µk(xk),wk)+min
πk+1E
[N−1∑i=k+1
gi (xi , µi (xi ),wi )+gN(xN)
]}by D.P principle (optimize tail subproblem then µk)
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 41 / 158
Proof that D.P. Algorithm gives Optimal Solution
= minµk
E{gk(xk , µk(xk),wk) + J∗k+1(fk(xk , µk(xk),wk))} by definition of
J∗k+1(xk+1)
= minµk
E{gk(xk , µk(xk),wk) + Jk+1(fk(xk , µk(xk),wk))} by induction
hypothesis
= minuk∈Uk (xk )
E{gk(xk , uk ,wk) + Jk+1(fk(xk , uk ,wk))} using fact that
minµ
F (x , µ(x)) = minu∈U(x)
F (x , u).
= Jk(xk) from D.P. algorithm equations
So J∗k (xk) = Jk(xk), and µ∗k(xk) = u∗k is the optimal policy.
By induction, this is true for k = N,N − 1, ..., 1, 0.
In particular, J∗(x0) = J∗0 (x0) = J0(x0) is the optimal cost.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 42 / 158
Shortest Paths in a Trellis
s t
Initialstate
Artificial Terminal state
Stage 0
Stage 1Stage 2
Stage N-1Stage N
1
2
a^1_12
Figure 6: Shortest paths in a trellis
Find shortest path from a node in Stage 1 to a node in Stage N
states → nodescontrols → arcsakij : cost of transition from state i at stage k to state j at stage k + 1.
aNit : terminal cost of state icost function = length of path from s to t
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 43 / 158
Shortest Paths in a TrellisD.P. Algorithm:
JN(i) = aNit
Jk(i) = minj
[akij + Jk+1(j)], k = N − 1, . . . , 1, 0
Optimal cost = J0(s) = length of shortest path from s to t.
Example: Find shortest path from stage 1 to stage 3 in Figure 7.
Shortestpath in red300 100
50 400
400 200
150 350
Stage 1 Stage 2 Stage 3
Figure 7: Shortest paths example
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 44 / 158
Shortest Paths in a TrellisRedraw as a trellis with initial and terminal node, see Figure 8.
s
1
2 2
1 1
2
t
Stage 0
Stage 1 Stage 2 Stage 3
0
0
100300
50 400
0
0
400 200
150 350
100 0400
350 0250
250
Figure 8: Redrawn shortest paths example
Here N = 3.Call the top node state 1 and bottom node state 2.
Stage N:
J3(1) = 0J3(2) = 0
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 45 / 158
Shortest Paths in a Trellis
Stage 2:
J2(1) = min{a211 + J3(1), a212 + J3(2)}= min{100 + 0, 200 + 0} = 100
J2(2) = min{a221 + J3(1), a222 + J3(2)}= min{350 + 0, 400 + 0} = 350
Stage 1:
J1(1) = min{a111 + J2(1), a112 + J2(2)}= min{300 + 100, 400 + 350} = 400
J1(2) = min{a121 + J2(1), a122 + J2(2)}= min{150 + 100, 50 + 350} = 250
Stage 0:
J0(s) = min{0 + J1(1), 0 + J1(2)} = 250
Shortest path to original problem shown in red in Figure 7.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 46 / 158
Forward D.P. Algorithm
Observe that optimal path s→ t is also optimal path t→ s ifdirections of arcs are reversed.⇒ Shortest path algorithm can be run forwards in time (see Bertsekasfor equations).Figure 9 shows the result of forward D.P. on shortest paths example.
Forward D.P. useful in real-time applications, where data arrives justbefore you need to make a decision.
Viterbi algorithm uses this idea
Shortest paths is a deterministic problem, so forward D.P. works.
For stochastic problems, no such concept of forward D.P.I Impossible to guarantee that any given state can be reached
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 47 / 158
Forward D.P. Algorithm
s
1
t
1
2
1
22
35050
250150
0 200
150 350
400 0
0
250
Figure 9: Forward D.P. on shortest paths example
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 48 / 158
Viterbi Algorithm Applications
Estimation of hidden Markov models (HMMs)I xk = Markov chainI state transitions in xk not observed (hidden).I observe zk , r(z , i , j) = probability we observe z given a transition in
Markov chain xk from state i to j .I Estimation problem:
Given ZN = {z1, z2, ..., zN}, find a sequence XN = {x0, x1, ..., xN} overall possible {x0, x1, ..., xN} that maximizes P(XN |ZN).
Note that P(XN |ZN) = P(XN ,ZN )P(ZN )
, and P(ZN) is “constant” given ZN
Somax
{x0,...,xN}P(XN |ZN)←→ max
{x0,...,xN}P(XN ,ZN)←→ max
{x0,...,xN}ln P(XN ,ZN)
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 49 / 158
Viterbi Algorithm Applications
I After some calculations (see Bertsekas), can show that problem isequivalent to:
min{x0,...,xN}
− ln(πx0)−N∑
k=1
ln(πxk−1xk r(Zk , xk−1, xk))
where πx0 = probability of initial state, πxk−1xk = transition probabilitiesof Markov chain, and − lnπx0 and − ln(πxk−1xk r(Zk , xk−1, xk)) can beregarded as lengths of the different stages⇒ shortest path problem through a trellis
Decoding of convolutional codes
Channel equalization in presence of ISI (Inter-symbol interference)
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 50 / 158
General Shortest Path Problems
No trellis structuree.g. Find the shortest path from each node to node 5 in Figure 10.
1
2 3
4 5
5
2
3
4
1
7
Figure 10: General shortest path problem
Graph with N + 1 nodes {1, 2, ...,N, t}aij = cost of moving from node i to node j .
Find the shortest path from each node i to node t.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 51 / 158
General Shortest Path Problems
Assume some aij ’s can be negative, but cycles have non-negativelength.
I Then shortest path will not involve more than N arcs.
Reformulate as a trellis-type shortest path problem with N arcs, byallowing arcs from node i to itself with cost aii = 0
D.P. algorithm:
JN−1(i) = ait
Jk(i) = minj{aij + Jk+1(j)}, k = N − 2, . . . , 1, 0
This algorithm is essentially the Bellman-Ford algorithm.
Other algorithms have also been invented, e.g. Dijkstra’s algorithmwhich can be used when all aij ’s are positive.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 52 / 158
Outline
3 Problems with Perfect State InformationLinear Quadratic ControlOptimal Stopping Problems
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 53 / 158
Problems with Perfect State Information
Will study some problems where analytical solutions can be obtained:
Linear quadratic control
Optimal stopping problems
+ others in Chapter 4 of Bertsekas
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 54 / 158
Linear Quadratic Control(Linear) System:
xk+1 = Axk + Buk + wk , k = 0, 1, ...,N − 1
(Quadratic) Cost function:
E
{N−1∑k=0
(xTk Qxk + uTk Ruk) + xTN QxN
}
Problem: Determine optimal policy to minimize cost function
xk , uk ,wk are column vectors
A,B,Q,R are matrices.
wk are independent and zero mean.
Q is positive semi-definite.
R is positive definite.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 55 / 158
Linear Quadratic Control
Definition:A symmetric matrix M is positive semi-definite if xTMx ≥ 0, ∀ vectors xM is positive definite if xTMx > 0,∀x 6= 0
One characterization:
M is positive semi definite ⇔ all eigenvalues of M are ≥ 0.
M is positive definite ⇔ all eigenvalues of M are > 0.
D.P. algorithm applied to this problem gives:
JN(xN) = xTN QxN
Jk(xk) = minuk{xTk Qxk + uTk Ruk + Jk+1(Axk + Buk + wk)},
k = N − 1, ..., 1, 0.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 56 / 158
Linear Quadratic Control
Turns out that minimization can be done analytically
JN−1(xN−1) = minuN−1
E{xTN−1QxN−1 + uTN−1RuN−1
+ (AxN−1 + BuN−1 + wN−1)TQ(AxN−1 + BuN−1 + wN−1)}
= minuN−1
E{xTN−1QxN−1 + uTN−1RuN−1
+ xTN−1ATQAxN−1 + xTN−1A
TQBuN−1 + xTN−1ATQwN−1
+ uTN−1BTQAxN−1 + uTN−1B
TQBuN−1 + uTN−1BTQwN−1
+ wTN−1QAxN−1 + wT
N−1QBuN−1 + wTN−1QwN−1}
= xTN−1(ATQA + Q)xN−1 + E{wTN−1QwN−1}
+ minuN−1
{uTN−1(R + BTQB)uN−1 + 2xTN−1ATQBuN−1}
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 57 / 158
Linear Quadratic ControlDigressionProblem: min
xf (x)
How to solve?For unconstrained scalar problems, can differentiate and set derivative equalto 0.e.g. min
x(x − 2)2, d
dx (x − 2)2 = 2(x − 2) = 0 ⇒ x∗ = 2.
Similarly, differentiate uTN−1(R + BTQB)uN−1 + 2xTN−1ATQBuN−1
with respect to the vector uN−1 and set equal to zero
Note that∂(uTAu)
∂u= 2Au,
∂(aTu)
∂u= a,
where a and u are column vectors, and A is a symmetric matrix.
Using above formulas, obtain 2(R + BTQB)uN−1 + 2BTQAxN−1 = 0⇒ u∗N−1 = −(R + BTQB)−1BTQAxN−1
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 58 / 158
Linear Quadratic Control
Substituting u∗N−1 = −(R + BTQB)−1BTQAxN−1 back into expressionfor JN−1(xN−1), we obtain
JN−1(xN−1) = xTN−1(ATQA + Q)xN−1 + E{wTN−1QwN−1}
+ xTN−1ATQB(R + BTQB)−1(R + BTQB)(R + BTQB)−1BTQAxN−1
− 2xTN−1ATQB(R + BTQB)−1BTQAxN−1
= xTN−1(ATQA + Q)xN−1 − xTN−1ATQB(R + BTQB)−1BTQAxN−1
+ E{wTN−1QwN−1}
= xTN−1(ATQA + Q − ATQB(R + BTQB)−1BTQA)xN−1
+ E{wTN−1QwN−1}
= xTN−1KN−1xN−1 + E{wTN−1QwN−1}
with KN−1 = ATQA + Q − ATQB(R + BTQB)−1BTQA
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 59 / 158
Linear Quadratic Control
Continuing on, can show that
u∗N−2 = −(BTKN−1B + R)−1BTKN−1AxN−2,
and more generally (tutorial problem) that
µ∗k(xk) = −(BTKk+1B + R)−1BTKk+1Axk
where
KN = Q,
Kk = ATKk+1A + Q − ATKk+1B(BTKk+1B + R)−1BTKk+1A
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 60 / 158
Certainty Equivalence
Certainty Equivalence: Optimal policy is the same as solving problem forthe deterministic system:
xk+1 = Axk + Buk + E[wk ],
where wk is replaced by its expected value E[wk ] = 0, i.e. the standardLQR problem
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 61 / 158
Asymptotic Behaviour
Definition:
A pair of matrices (A,B), where A is n × n, B is n ×m, iscontrollable if the n × nm matrix[
B AB A2B ... An−1B]
has full rank (all rows linearly independent)
A pair (A,C ), where A is n× n, C is m× n, is observable if (AT ,CT )is controllable.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 62 / 158
Asymptotic Behaviour
TheoremIf (A,B) is controllable and Q can be written as Q = CTC, where (A, C) isobservable, then:
1 Kk → K as k → −∞, with K satisfying the algebraic Riccati equation
K = ATKA + Q − ATKB(BTKB + R)−1BTKA
2 The steady state controller
µ∗(xk) = Lxk ,
where L = −(BTKB + R)−1BTKA, stabilizes the system, i.e. theeigenvalues of A + BL have magnitude < 1.
Proof: See Bertsekas
Note: If uk = Lxk , then xk+1 = Axk + Buk + wk = (A + BL)xk + wk .xk stays “bounded” when the eigenvalues of A + BL have magnitude < 1
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 63 / 158
Other Variationsxk+1 = Akxk + Bkuk + wk
Ak ,Bk random, unknown, independent.Optimal policy:
µ∗k(k) = −(R + E{BTk Kk+1B})−1E{BT
k Kk+1A}xk ,where
KN = Q,
Kk = E{ATk Kk+1A
Tk }+ Q
− E{ATk Kk+1Bk}(E{BT
k Kk+1Bk}+ R)−1E{BTk Kk+1Ak}
I may not have certainty equivalenceI may not have steady state solution
xk+1 = Axk + Bkuk + wk
Bk is random, independent, and is only revealed to us at time k .Motivation: Wireless channelsSimilar to Leong, Dey, Anand, “Optimal LQG control over continuousfading channels”, Proc. IFAC World Congress, 2011.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 64 / 158
Optimal Stopping Problems
At each state, there is a “stop” control that stops the system, i.emoves to and stays in a stop state.
Pure stopping problem: if only other control is “continue”.
For pure stopping problems, policy characterized by partition of set ofstates into:
I stop regionI continue region,
which may depend on time.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 65 / 158
Example (Asset selling)
A person has an asset for sale, e.g. a house.
At each time k = 0, 1, ...,N − 1, person receives a random offer wk
for the asset.
Assume wk ’s are independent.
Either accept wk at time k + 1, and invest money at interest rate r ,or reject wk and wait for offer wk+1.
Must accept last offer wN−1 at time N if every previous offer wasrejected.
Find policy that maximizes (expected) revenue at the N-th period.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 66 / 158
Example (Asset selling)
States: If xk = T : asset already sold (= stop state)If xk = wk−1: offer currently under consideration.
Controls: {accept, reject}
System evolves as:
xk+1 = fk(xk ,wk , uk)
=
{T , if 1) xk = T or 2) xk 6= T and uk = acceptwk , otherwise.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 67 / 158
Example (Asset selling)
Rewards at time k :
gN(xN) =
{xN , if xN 6= T ;0, otherwise.
gk(xk , uk ,wk) =
{(1 + r)N−kxk , if xk 6= T and uk = accept ;0, otherwise.
I (For compound interest over n years, final amount = (1 + r)n×initialamount.)
I Note: From the way the rewards are defined, gk is non-zero for onlyone k ∈ {0, 1, ...,N − 1}.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 68 / 158
Example (Asset selling)
Expected total reward
= E
[N−1∑k=0
gk(xk , uk ,wk) + gN(xN)
]
D.P. algorithm (for reward maximization):
JN(xN) = gN(xN) =
{xN , if xN 6= T ;0, otherwise.
Jk(xk) = maxuk
E[gk(xk , uk ,wk) + Jk+1(xk+1)]
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 69 / 158
Example (Asset selling)If xk = T , then gk(xk , uk ,wk) = 0 and Jk+1(xk+1) = 0, by propertyof gk being non-zero for only one k, and reward being incurred priorto time kIf xk 6= T , then
E[gk(xk , uk ,wk)+Jk+1(xk+1)] =
{(1 + r)N−kxk , if uk = accept;0 + E[Jk+1(wk)], if uk = reject.
So
Jk(xk) = maxuk
E[gk(xk , uk ,wk) + Jk+1(xk+1)]
=
{max((1 + r)N−kxk ,E[Jk+1(wk)]), if xk 6= T ,0, if xk = T ,
and optimal policy is of the form:
uk = accept if (1 + r)N−kxk > E[Jk+1(wk)]
or uk =
{accept, if xk >
E[Jk+1(wk )](1+r)N−k ;
reject, otherwise.Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 70 / 158
Example (Asset selling)Let
αk =E[Jk+1(wk)]
(1 + r)N−k
Can show (see Bertsekas) that αk ≥ αk+1 for all k if wk are i.i.d.I Intuition: offer acceptable at time k should also be acceptable at time
k + 1. See Figure 11
Reject
Accept
N-11 2k
αN−1
α2
α1
Figure 11: Asset selling
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 71 / 158
Example (Asset selling)
Can also show that if wk are i.i.d and N →∞, then optimal policy“converges” to the stationary policy:
uk =
{accept, if xk > αreject, if xk ≤ α
where α is constant.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 72 / 158
General Stopping Problems
Pure stopping problem - stop or continue only possible controls
General stoping problem - stop or choose a control uk from U(xk)(where U has more than one element)
Consider time invariant case: f (xk , uk ,wk), g(xk , uk ,wk) don’tdepend on k, and wk is i.i.d.
Stop at time k with cost t(xk)
Must stop by last stage.
D.P. algorithm:JN(xN) = t(xN),
Jk(xk) = min[t(xk), minuk∈U(xk )
E{g(xk , uk ,wk) + Jk+1(f (xk , uk ,wk))}]
Optimal to stop when
t(xk) ≤ minuk∈U(xk )
E{g(xk , uk ,wk) + Jk+1(f (xk , uk ,wk))}
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 73 / 158
General Stopping ProblemsStopping set at time k (set of states where you stop) defined as
Tk = {x |t(x) ≤ minu∈U(x)
E[g(x , u,w) + Jk+1(f (x , u,w))]}
Note that JN−1(x) ≤ JN(x) for all x , since JN(x) = t(x) and
JN−1(x) = min[t(x), min
u∈U(x)E[g(x , u,w) + Jk+1(f (x , u,w))]
]≤ t(x) = JN(x)
Can show that Jk(x) ≤ Jk+1(x) (Monotonicity principle: tutorialproblem)
Then we have :
T0 ⊆ T1 ⊆ T2 ⊆ ... ⊆ Tk ⊆ Tk+1 ⊆ ... ⊆ TN−1
i.e. set of states in which we stop increases with time.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 74 / 158
Special Case
If f (x , u,w) ∈ TN−1 for all x ∈ TN−1, u ∈ U(x),w , i.e. the set TN−1is absorbing, then
T0 = T1 = T2 = · · · = TN−1.
Proof: See Bertsekas
Simplifies optimal policy, called the one step lookahead policy.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 75 / 158
Special Case
E.g. Asset selling with past offers retained
Same situation as before, except that previously rejected offers can beaccepted at a later time.
State evolves asxk+1 = max(xk ,wk)
(instead of xk+1 = wk before)
Can show (see Bertsekas) that TN−1 = {x |x ≥ α} for some constantα
This set is absorbing, since best currently received offer cannotdecrease over time.⇒ optimal policy at every time k is to accept if best offer > α
Have constant threshold α even for finite horizon N
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 76 / 158
Outline
4 Problems with Imperfect State InformationReformulation as Perfect State Information ProblemLinear Quadratic Control with Noisy MeasurementsSufficient Statistics
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 77 / 158
Problems with Imperfect State Information
State xk not known to controller.
Instead have “noisy” observations zk of the form:
z0 = h0(x0, v0),
zk = hk(xk , uk−1, vk), k = 1, 2, ...,N − 1,
where vk is “observation noise”, with a probability distribution
Pv (.|x0, ..., xk , u0, ..., uk−1,w0, ...,wk−1, v0, ..., vk−1)
which can depend on states, controls and disturbances
Exampleshx(xk , uk−1, vk) = xk + vk ,
hk(xk , uk−1, vk) = sin xk + uk−1vk
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 78 / 158
Problems with Imperfect State Information
Initial state x0 is random with distribution Px0 .
uk ∈ Uk , where Uk does not depend on (unknown) xk .
Information vector, i.e. information available to controller at time k,defined as
I0 = z0,
Ik = (z0, ..., zk , u0, ..., uk−1), k = 1, 2, ...,N − 1
Policies π = (µ0, ..., µN−1), where now µk(Ik) ∈ Uk (before µk(xk)).
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 79 / 158
Basic Problem with Imperfect State Information
Find π that minimizes the cost function
Jπ = E
{N−1∑k=0
gk(xk , µk(Ik),wk) + gN(xN)
}
s.t. system equation
xk+1 = fk(xk , µk(Ik),wk)
and measurement equation
zk = hk(xk , µk−1(Ik−1), vk)
Question: How to solve this problem?
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 80 / 158
Reformulation as Perfect State Information Problem
Idea: Define new system where the state is Ik . Then have D.P.algorithm etc.
By definition
Ik+1 = (z0, ..., zk , zk+1, u0, ..., uk−1, uk)
= (z0, ..., zk , u0, ..., uk−1︸ ︷︷ ︸Ik
, zk+1, uk)
⇒ Ik+1 = (Ik , uk , zk+1).
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 81 / 158
Reformulation as Perfect State Information Problem
RegardIk+1 = (Ik , uk , zk+1)
as a dynamical system with state Ik , control uk and disturbance zk+1
Next note that E[gk(xk , uk ,wk)] = E[E[gk(xk , uk ,wk)|Ik , uk ]] (Recallthat E[X ] = E[E[X |Y ]])
Define gk(Ik , uk) = E[gk(xk , uk ,wk)|Ik , uk ] = cost per stage of newsystem, and gN(IN) = E[gN |IN ] = terminal cost.
Cost function becomes
E{∑N−1
k=0 gk(xk , µk(Ik),wk) + gN(xN)}
= E{∑N−1
k=0 gk(Ik , µk(Ik)) + gN(IN)}
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 82 / 158
Reformulation as Perfect State Information Problem
D.P. algorithm for reformulated perfect state information problem is:
JN(IN) = gN(IN) = E[gN(xN)|IN ]
Jk(Ik) = minuk∈Uk
E{gk(Ik , uk) + Jk+1(Ik , uk , zk+1)}
= minuk∈Uk
E{gk(xk , uk ,wk) + Jk+1(Ik , uk , zk+1)|Ik}, k = N − 1, . . . , 0
Optimal cost J∗ = E{J0(z0)}
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 83 / 158
Linear Quadratic Control with Noisy Measurements
Systemxk+1 = Axk + Buk + wk
Cost function
E
N−1∑k=0
(xTk Qxk + uTk Ruk︸ ︷︷ ︸gk (xk ,uk ,wk )
) + xTN QxN︸ ︷︷ ︸gN(xN)
Observations
zk = Cxk + vk
wk are independent, zero mean.
From D.P. Algorithm:
JN(IN) = E[xTN QxN |IN ],
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 84 / 158
Linear Quadratic Control with Noisy Measurements
JN−1(IN−1) = minuN−1
E{xTN−1QxN−1 + uTN−1RuN−1
+E[(AxN−1 + BuN−1 + wN−1)TQ(AxN−1 + BuN−1 + wN−1)|IN ]∣∣∣IN−1}
= minuN−1
E{xTN−1QxN−1 + uTN−1RuN−1
+(AxN−1 + BuN−1 + wN−1)TQ(AxN−1 + BuN−1 + wN−1)|IN−1}
(Using the tower property that E(E(X |Y )|Z ) = E(X |Z ) if Y contains“more information” than Z )
= ... ( expand, simplify and use E(wN−1|IN−1) = 0.)
= E[xTN−1(ATQA + Q)xN−1|IN−1] + E[wTN−1QwN−1|IN−1]
+ minuN−1
{uTN−1(BTQB + R)uN−1 + 2E[xN−1|IN−1]TATQBuN−1
}Differentiate with respect to uN−1 and set equal to zero:
2(BTQB + R)uN−1 + 2BTQAE[xN−1|IN−1] = 0
⇒ u∗N−1 = −(BTQB + R)−1BTQAE[xN−1|IN−1]
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 85 / 158
Linear Quadratic Control with Noisy MeasurementsSubstituting expression for u∗N−1 back in:
JN−1(IN−1) = E[xTN−1(ATQA + Q)xN−1|IN−1] + E[wTN−1QwN−1]
+ E[xN−1|IN−1]TATQB(BTQB + R)−1(BTQB + R)
× (BTQB + R)−1BTQAE[xN−1|IN−1]− 2E[xN−1|IN−1]T
× ATQB(BTQB + R)−1BTQAE[xN−1|IN−1]
= E[xTN−1(ATQA + Q)xN−1|IN−1] + E(wTN−1QwN−1)
− E(xN−1|IN−1)TATQB(BTQB + R)−1BTQAE(xN−1|IN−1)
= E[xTN−1(ATQA + Q)xN−1|IN−1] + E(wTN−1QwN−1)
+ E[(xN−1 − E[xN−1|IN−1])TATQB(BTQB + R)−1
× BTQA(xN−1 − E[xN−1|IN−1])|IN−1]
− E[xTN−1 ATQB(BTQB + R)−1BTQA︸ ︷︷ ︸
PN−1
xN−1|IN−1]
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 86 / 158
Linear Quadratic Control with Noisy Measurements
We have
JN−1(IN−1) = E[xTN−1KN−1xN−1|IN−1] + E[wTN−1QwN−1]
+ E[(xN−1 − E[xN−1|IN−1])TPN−1(xN−1 − E(xN−1|IN−1))|IN−1]
where
PN−1 = ATQB(BTQB + R)−1BTQA
KN−1 = ATQA + Q − PN−1.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 87 / 158
Linear Quadratic Control with Noisy MeasurementsFor period N − 2,
JN−2(IN−2) = minuN−2
E{xTN−2QxN−2 + uTN−2RuN−2 + JN−1(IN−1)|IN−2}
= E{xTN−2QxN−2|IN−2}+ minuN−2
[uTN−2RuN−2 + E{xTN−1KN−1xN−1|IN−2}
]+ E
[(xN−1 − E[xN−1|IN−1])TPN−1(xN−1 − E[xN−1|IN−1])|IN−2
]+ E(wT
N−1QwN−1)
Then can obtain
u∗N−2 = −(BTKN−1B + R)−1BTKN−1AE[xN−2|IN−2]
Note that in the above the term
E[(xN−1 − E[xN−1|IN−1])TPN−1(xN−1 − E[xN−1|IN−1])|IN−2
]can be taken outside the minimization (see Bertsekas for proof).
I Intuition: estimation error xk − E[xk |Ik ] can’t be influenced by choiceof control.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 88 / 158
Linear Quadratic Control with Noisy Measurements
Continuing on, general solution is:
µ∗k(Ik) = u∗k = −(BTKk+1B + R)−1BTKk+1AE[xk |Ik ] = LkE[xk |Ik ]
where
KN = Q
Pk = ATKk+1B(BTKk+1B + R)−1BTKk+1A
Kk = ATKk+1A + Q − Pk
Comparison with perfect state information case:
Lk matrix the same
xk is replaced by E[xk |Ik ]
How to compute E[xk |Ik ]?
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 89 / 158
Linear Quadratic Control with Noisy Measurements
Summary so far:
System
xk+1 = Axk + Buk + wk
zk = Cxk + vk
Problem
minE[ N−1∑k=0
(xTk Qxk + uTk Ruk) + xTN QxN
]Optimal solution is
µ∗k(Ik) = −(BTKk+1B + R)−1BTKk+1AE[xk |Ik ] = LkE[xk |Ik ]
where Ik = (z0, ..., zk , u0, ..., uk−1)
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 90 / 158
Linear Quadratic Control with Noisy Measurements
Optimal controller can be decomposed into two parts:
1) An estimator which computes E[xk |Ik ].
2) An actuator which multiplies E[xk |Ik ] with Lk . Lk is the same gainmatrix as in the perfect state information case, only replace xk withE[xk |Ik ].
Estimator and actuator can be designed separately.I Known as the separation principle/theorem
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 91 / 158
LQG Control
Remaining problem: How do we compute E[xk |Ik ]?
Very difficult problem in general (subject called non-linear filtering).
When system is linear and wk , vk are Gaussian, E[xk |Ik ] can becomputed analytically.
I Procedure/algorithm is known as the Kalman Filter (ref: Anderson andMoore, “Optimal Filtering”), and the overall controller is called theLQG (linear quadratic Gaussian) controller
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 92 / 158
Kalman Filter
System:
xk+1 = Axk + Buk + wk
zk = Cxk + vk
wk ∼ N(0,Σw ) i.i.d, Σw = E[wkwTK ]
vk ∼ N(0,Σv ) i.i.d, Σv = E[vkvTK ]
Define state estimates
xk|k = E[xk |Ik ]
xk+1|k = E[xk+1|Ik ]
and estimation error covariance matrices
Σk|k = E[(xk − xk|k)(xk − xk|k)T |Ik ]
Σk+1|k = E[(xk+1 − xk+1|k)(xk+1 − xk+1|k)T |Ik ]
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 93 / 158
Kalman Filter
Then xk|k , xk+1|k ,Σk|k ,Σk+1|k can be computed recursively using theKalman Filter equations:
xk|k = xk|k−1 + Σk|k−1CT (CΣk|k−1C
T + Σv )−1(zk − Cxk|k−1)
xk+1|k = Axk|k + Buk
Σk|k = Σk|k−1 − Σk|k−1CT (CΣk|k−1C
T + Σv )−1CΣk|k−1
Σk+1|k = AΣk|kAT + Σw , k = 0, 1, ...,N − 1
Proof: see Bertsekas, or Anderson and Moore.
Beware: Many people who work in Kalman filtering like to use Q forΣw , R for Σv , Kk for the “Kalman gain”Σk|k−1C
T (CΣk|k−1CT + Σv )−1, but here Q,R,Kk have been used
for different things. People also use Pk+1|k for Σk+1|k , Pk|k for Σk|ketc.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 94 / 158
Kalman Filter Properties
In general the mean squared error
E[(xk − xk)T (xk − xk)|Ik ]
is minimized when xk = E[xk |Ik ]
Kalman filter equations compute E[xk |Ik ] when noises are Gaussian,and (optimal) estimates are linear functions of the measurements zk .
Even when noises are not Gaussian, xk|k computed by Kalman filterequations gives the best linear estimate of xk .
I Useful suboptimal solution when noises are non-Gaussian.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 95 / 158
Kalman Filter Properties
Recall that if the pair (A,B) is controllable and (A,Q1/2) isobservable, optimal controller has a steady state solution.
Similarly, if (A,C ) is observable, and (A,Σ1/2w ) is controllable, then
Σk|k−1 converges to a steady state value Σ as k →∞, where Σsatisfies the algebraic Riccati equation
Σ = AΣAT − AΣCT (CΣCT + Σv )−1CΣAT + Σw
So we have a steady state estimator:
xk|k = xk|k−1 + ΣCT (CΣCT + Σv )−1(zk − Cxk|k−1)
xk+1|k = Axk|k + Buk
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 96 / 158
Sufficient Statistics
Information vector Ik = (z0, .., zk , u0, ..., uk−1)
Dimension of Ik increases with time k.
Inconvenient for large k
Sufficient statistic: function Sk(Ik) which summarizes all essential contentin Ik for computing the optimal control, i.e. µ∗k(Ik) = µ(Sk(Ik)) for somefunction µ.
Sk(Ik) preferably of smaller dimension than Ik .
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 97 / 158
Examples of Sufficient Statistics1) Ik itself
2) Conditional state distribution/belief state Pxk |Ik , assuming thatdistribution of vk depends only on xk−1, uk−1,wk−1.
If number of states is finite then Pxk |Ik is a vector.
e.g. if states are 1, 2, ..., n, then
Pxk |Ik =
P(xk = 1|Ik)P(xk = 2|Ik)
.
.
.P(xk = n|Ik)
Dimension of vector is n, which doesn’t grow with k
3) Special case: E[xk |Ik ] is a sufficient statistic for LQG problem (thoughnot a sufficient statistic in general).
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 98 / 158
Conditional State Distribution
For conditional state distribution, Pxk |Ik can be generated recursively,as
Pxk+1|Ik+1= Φk(Pxk |Ik , uk , zk+1)
for some function Φk(·, ·, ·).
Then D.P. algorithm can be written as
Jk(Pxk |Ik ) = minuk∈Uk
E[gk(xk , uk ,wk) + Jk+1(Φk(Pxk |Ik , uk , zk+1))|Ik ].
General formula for Φk(·, ·, ·) can be derived, but is quite complicated(see Bertsekas). Will derive some examples from first principles.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 99 / 158
Example 1: Search Problem
At each period, decide whether to search a site that may contain atreasure.
If treasure is present and we search, we find it with probability β andtake it.
States: {treasure present, treasure not present}Controls: {search, no search}Regard each search result as (imperfect) observation of the state.
Let pk = probability treasure present at start of time k .I If not search, pk+1 = pk .I If search and find treasure, pk+1 = 0.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 100 / 158
Example 1
If search and don’t find treasure,
pk+1 = P(treasure present at k|don’t find at k)
=P(treasure present at k
⋂don’t find at k)
P(don’t find at k)
=pk(1− β)
pk(1− β) + (1− pk),
with (1− pk) corresponding to treasure not present & don’t find.
Thus
pk+1 =
pk , not search at time k0, search and find treasure.
pk (1−β)pk (1−β)+(1−pk ) , search and don’t find treasure
= Φk(pk , uk , zk+1) function.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 101 / 158
Example 1
Now let treasure be worth V , each search costs C , and once wedecide not to search we can’t search again at future times.
D.P. algorithm gives:
Jk(pk) = max{no search,search}
[0,−C + pkβV
+ (1− pkβ)Jk+1
( pk(1− β)
pk(1− β) + 1− pk
)+ pkβJk+1(0)
]
= max{no search,search}
[0,−C + pkβV
+ (1− pkβ)Jk+1
(pk(1− β)
pk(1− β) + 1− pk
)](where pkβJk+1(0) = 0 since treasure already found)
Can show that Jk(pk) = 0,∀pk ≤ CβV , and that it is optimal to search
iff expected reward pkβV ≥ cost of search C . (Tutorial problem)
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 102 / 158
Example 2: Research Paper*
A process {Pe,k} evolves in the following way, for k = 1, ...,N:
Pe,k+1 =
{P, νk+1γe,k+1 = 1APe,kA
T + Q, νk+1γe,k+1 = 0,
P,A,Q are some matrices
{γe,k} is i.i.d Bernoulli process with
P(γe,k = 1) = λe ,P(γe,k = 0) = 1− λe , ∀k
νk ∈ {0, 1}{Pe,k} is not observed at all (no observation zk).
*Leong, Quevedo, Dolz, Dey “On Remote State Estimation in thePresence of an Eavesdropper” Proc. IFAC World Congress, 2017
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 103 / 158
Example 2
Regard Pe,k as the state at time k , and νk+1 as the control. AssumePe,0 = P
Then Pe,k ∈ {P,APAT + Q,A(APAT + Q) + Q, ...} ={P, f (P), f 2(P), ..., f N(P)}, where
f (P) = APAT + Q
Conditional state distribution isP(Pe,k = P|ν0, .., νk)
P(Pe,k = f (P)|ν0, ..., νk)......
P(Pe,k = f N(P)|ν0, ..., νk)
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 104 / 158
Example 2
When νk+1 = 0, Pe,k+1 = f (Pe,k) with probability 1. SoP(Pe,k+1 = P|ν0, .., νk+1)
P(Pe,k+1 = f (P)|ν0, ..., νk+1)......
P(Pe,k+1 = f N(P)|ν0, ..., νk+1)
=
0
P(Pe,k = P|ν0, .., νk)......
P(Pe,k = f N−1(P)|ν0, ..., νk)
= Φk(PPe,k |Ik , νk+1, zk+1) function when νk+1 = 0
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 105 / 158
Example 2
When νk+1 = 1,Pe,k+1 = P with probability λe , andPe,k+1 = f (Pe,k) with probability 1− λe . So
P(Pe,k+1 = P|ν0, .., νk+1)P(Pe,k+1 = f (P)|ν0, ..., νk+1)
...
...P(Pe,k+1 = f N(P)|ν0, ..., νk+1)
=
λe
(1− λe)P(Pe,k = P|ν0, ..., νk)......
(1− λe)P(Pe,k = f N−1(P)|ν0, ..., νk)
= Φk(PPe,k |Ik , νk+1, zk+1) function when νk+1 = 1
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 106 / 158
Outline
5 Suboptimal Methods / Approximate Dynamic ProgrammingCertainty Equivalent ControlRollout AlgorithmsModel Predictive Control
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 107 / 158
Suboptimal Methods
Why do we need/want suboptimal methods?
In D.P. need to compute
Jk(xk) = minuk
E[gk(xk , uk ,wk) + Jk+1(xk+1)]
for all states xk
1) In many problems, this minimization can’t be done analytically.
Have to test each uk .
When number of possible xk , uk or wk are large, amount ofcomputation required can be substantial.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 108 / 158
Suboptimal Methods2) In some problems, xk , uk or wk are continuous valued.
Have to discretize their ranges to convert to discrete problem, seeFig. 12.
Using more points gives better approximation, but more computationrequired.
Situation worse for higher dimensions - “curse of dimensionality”.
Discretization points
-1 0 1
Range of x_k =[-1,1]
Figure 12: Discretization
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 109 / 158
Suboptimal Methods
3) In problems with imperfect state information, conditional statedistribution Pxk |Ik is of the form
Pxk |Ik =
P(xk = 1|Ik)P(xk = 2|Ik).........P(xk = n|Ik)
Range is [0, 1]n (continuous).
Solving imperfect state information problems exactly is intractableexcept in special or very simple cases.
4) Real time constraints, data not available until shortly before, or datamay change as system is being controlled.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 110 / 158
Suboptimal Methods
Will discuss a few methods for suboptimal solutions
Certainty Equivalent Control (CEC)
Rollout Algorithms
Model Predictive Control (MPC)
Many other methods in Vol. I Ch.6 and Vol. II of Bertsekas.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 111 / 158
Certainty Equivalent Control
Idea
Replace a stochastic problem with a deterministic one
At each time k , fix the future uncertain quantities to some “typical”values, e.g. replace wk with E[wk ].
Procedure (Online Version)At each time k
(1) Fix wi , i ≥ k to some wi . Solve the deterministic problem
min{uk ,uk+1,...,uN−1}
[N−1∑i=k
gi (xi , ui , wi ) + gN(xN)
],
assuming xi+1 = fi (xi , ui , wi ), i = k, k + 1, ...,N − 1, ui ∈ Ui (xi )
(2) Use the first control in optimal control sequence {uk , uk+1, ..., uN−1}found, i.e. µk(xk) = uk
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 112 / 158
Certainty Equivalent Control
Equivalent Procedure (Offline Version)
(1) Fix wk to some wk for k = 0, 1, ..,N − 1. Solve the deterministicproblem
min{µ0,µ1,...,µN−1}
[N−1∑k=0
gk(xk , µk(xk), wk) + gN(xN)
],
assuming xk+1 = fk(xk , µk(xk), wk), k = 0, 1, ...,N − 1, uk(xk) ∈ Uk(xk)
(2) Let {µd0 , µd1 , ..., µdN−1} be the solution to problem above. At each time
k , apply µk(xk) = µdk (xk)
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 113 / 158
Certainty Equivalent ControlComments:
N problems have to be solved in online version, one in the offlineversion
Online and offline versions give same controller if data is notchanging. Use online version if data is changing.
For problems with imperfect state information, also replace xk byestimate xk(Ik) (e.g. xk(Ik) = E[xk |Ik ]).
Certainty Equivalent Control often performs well in practice.
For linear quadratic control problem, Certainty Equivalent Controlleris equivalent to optimal controller.
Can fix some disturbances while leaving others stochastic, e.g. forimperfect state information problems, replace xk by xk(Ik) whileleaving wk as stochastic.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 114 / 158
Rollout Algorithms
One step lookahead policy, with optimal cost to go approximated by costto go of some base policy.
“Rollout” coined by Gerald Tesauro in 1996 in the context of rollingdice in a backgammon playing computer program.
A given backgammon position evaluated by “rolling out” many gamesstarting from that position, and taking average.
Rollout policy has a cost improvement property
Often produces substantial improvement over base policy.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 115 / 158
Rollout Algorithms
One step lookahead policy
At each k and xk we use the control µk(xk) that solves the problem
minuk∈Uk
E{gk(xk , uk ,wk) + Jk+1(fk(xk , uk ,wk))}
where JN = gN , and Jk+1 is approximation to true cost to go Jk+1.
Rollout policy
When the approximation Jk is cost to go of some heuristic base policy.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 116 / 158
Example: Quiz problem
N questions given.
Question i answered correctly with probability pi , reward vi if correct.
Quiz terminates at first incorrect answer.
Choose order of questions to maximize total reward.
Index policy: answer questions in decreasing order of pivi1−pi
I Index policy is optimal when no other constraints (Ch. 4.5 Bertsekas).
Now assume there is a limit (< N) on maximum number of questionsto be answered.
I Then index policy in general is not optimal.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 117 / 158
Example: Quiz problem
Rollout algorithm: use index policy as base policy.I At a state denoting the subset of questions already answered, compute
the expected reward R(j) for each possible next question j , assumingthe order of remaining questions follows index policy.
I Answer the question with maximum R(j).
R(j) can be computed analytically, since given an order of questions(i1, i2, ..., iM), with M ≤ N, the expected reward is
pi1(vi1 + pi2(vi2 + pi3(...+ piMviM )...))
.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 118 / 158
Example: Travelling Salesman Problem
N cities
Assume graph is complete
Find the minimum cost tour that visits each city exactly once andreturns to the city you started from.
Important and difficult problem in combinatorial optimization.
20
12
3442
30 35
AB
DC
Figure 13: Travelling Salesman Problem
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 119 / 158
Example: Travelling Salesman Problem
Nearest neighbour heuristic:I Start from an arbitrary city.I Next city visited is the one with minimum distance from current city
(and has not been previously visited)
Rollout algorithm: Use the nearest neighbour heuristic as base policy.I For each node not yet visited, assume nearest neighbour heuristic is
then run afterwards, and compute cost of the tour.I Choose next city as the one that gives best tour.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 120 / 158
Example: Travelling Salesman Problem
Consider the travelling salesman problem for the graph shown below.
Let a be the node with which we start and end the tour. An optimaltour can be shown to be abcdea, with length 375.
Figure 14: Travelling Salesman Problem
Nearest neighbour (N.N.) heuristic gives tour aedbca with length 550.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 121 / 158
Example: Travelling Salesman ProblemRollout algorithm with nearest neighbour heuristic as base policy:1st stage:ab cdea︸︷︷︸
N.N.
length = 375
ac bdea︸︷︷︸N.N.
length = 550
ad ebca︸︷︷︸N.N.
length = 625
ae dbca︸︷︷︸N.N.
length = 550
So next node should be b.2nd stage:abc dea︸︷︷︸
N.N.
length = 375
abd eca︸︷︷︸N.N.
length = 650
abe dca︸︷︷︸N.N.
length = 675
So next node is cDr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 122 / 158
Example: Travelling Salesman Problem
Rollout algorithm with nearest neighbour heuristic as base policy:3rd stage:abcd ea︸︷︷︸
N.N.
length = 375
abce da︸︷︷︸N.N.
length = 425
So next node is d4th stage:abcdea = tour computed by rollout, with length 375
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 123 / 158
Cost Improvement Property of Rollout Algorithm
Theorem:
Let Jk(xk) be the cost to go of rollout policy. Let Jk(xk) be the cost to goof base policy. Then
Jk(xk) ≤ Jk(xk),∀xk , k
Proof: Use induction
Initial step:
By definitionJN(xN) = JN(xN) = gN(xN), ∀xN
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 124 / 158
Cost Improvement Property of Rollout Algorithm
Induction step:
Assume Jl(xl) ≤ Jl(xl),∀xl , l = N − 1,N − 2, . . . , k + 1
Want to show Jk(xk) ≤ Jk(xk).
Let µk(xk) be control applied by rollout policy.Let µk(xk) be control applied by base policy.
ThenJk(xk) = E[gk(xk , µk(xk),wk) + Jk+1(fk(xk , µk(xk),wk))] bydefinition of cost to go function Jk≤ E[gk(xk , µk(xk),wk) + Jk+1(fk(xk , µk(xk),wk))] by inductionhypothesis≤ E[gk(xk , µk(xk),wk) + Jk+1(fk(xk , µk(xk),wk))] by definition ofµk(xk) being optimal for rollout= Jk(xk) by definition of cost to go function Jk
By induction, Jk(xk) ≤ Jk(xk),∀k , xk .
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 125 / 158
Difficulties In Using Rollout
For stochastic problems, cost to go Jk of base policy may still bedifficult to evaluate analytically.
Need to approximate Jk using e.g. Monte Carlo simulations, orcertainty equivalence.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 126 / 158
Model Predictive Control (MPC)
Originated and widely used in process control industries.
Concepts have since been applied to many areas.
Idea:
Compute a set of m control signals which optimizes objective over afinite horizon m, using a model that predicts system outputs at futuretimes.
First element of this set is applied to system.
Repeat process at next time step in a receding horizon manner.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 127 / 158
Model Predictive Control (MPC)u
2 431
u_0^*
k
A
u
2 431
u_0^*
k
A
1 2 34 5 6
u u_2^*
k
u
k1 2 4 53
B
C
u_1^*
Figure 15: Model Predictive Control:A: Computed set of control signals at time 0.B: Computed set of control signals at time 1.C: Computed set of control signals at time 2.Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 128 / 158
Model Predictive Control (MPC)
Well suited to systems with control/state constraints, non-linearsystems etc...
Corresponds to a m-step lookahead policy with cost to goapproximation equal to zero.
As m increases, performance “usually” improves (see Bertsekas forcounter-examples)
Higher amount of computations required for larger m.
For nonlinear systems, computation of control signals often needs tobe done numerically.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 129 / 158
Outline
6 Infinite Horizon ProblemsDiscounted Cost ProblemsAverage Cost Problems
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 130 / 158
Infinite Horizon Problems
Infinite number of stages.
Assume system is stationary, i.e. f (., ., .), g(., ., .) and distribution ofwk don’t depend on time k .
“Different” algorithms needed.
Optimal policies often have a simple stationary form that does notdepend on time.
But analysis is more difficult than finite horizon problems (won’tcover in this course).
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 131 / 158
Types of Infinite Horizon Problems
1 Total cost problems
min{µk}
limN→∞
E
[N−1∑k=0
g(xk , µ(xk),wk)
]
I Not commonly used because cost function often goes to infinity.I Special type called stochastic shortest path problem with cost free
termination state studied in Bertsekas.
2 Discounted cost problems
min{µk}
limN→∞
E
[N−1∑k=0
γkg(xk , µk(xk),wk)
]
where γ ∈ (0, 1).I γ is called the discount factor.I Cost incurred at earlier times more important than later times.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 132 / 158
Types of Infinite Horizon Problems
3 Average cost problems
min{µk}
limN→∞
E
[1
N
N−1∑k=0
g(xk , µ(xk),wk)
]
I Cost incurred in the future more important than at the beginningI Cost function usually finite in contrast to total cost problems.I Optimal average cost usually independent of initial states.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 133 / 158
Discounted Cost Problems
min{µk}
limN→∞
E
[N−1∑k=0
γkg(xk , µk(xk),wk)
]Assume
number of states is finite, taking values 1, 2, ..., n.
number of possible controls is finite.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 134 / 158
Notation
Given policy π = {µ0, µ1, ..., }, cost of policy starting at state i is
Jπ(i) = limN→∞
E
[N−1∑k=0
γkg(xk , µk(xk),wk)
∣∣∣∣x0 = i
]
If policy is stationary, i.e. π = {µ, µ, ...}, write Jµ(i) instead of Jπ(i).
Optimal cost starting at state i is
J∗(i) = minπ
limN→∞
E
[N−1∑k=0
γkg(xk , µk(xk),wk)
∣∣∣∣x0 = i
]
Optimal policy π∗ satisfies Jπ∗(i) = J∗(i), ∀i .
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 135 / 158
Theorem
For the discounted cost problem we have:
(a) The value iteration algorithm
Jk+1(i) = minu∈U(i)
E[g(i , u,w) + γJk(f (i , u,w))], i = 1, 2, ..., n
converges as k →∞ to optimal costs J∗(i), i = 1, 2, ..., n, startingfrom arbitrary J0(i), i = 1, 2, ..., n
(b) The optimal costs J∗(i), i = 1, 2, ..., n, satisfy the Bellman equation
J∗(i) = minu∈U(i)
E[g(i , u,w) + γJ∗(f (i , u,w))], i = 1, 2, ..., n
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 136 / 158
Theorem(c) Given a stationary policy µ, the cost Jµ(i), i = 1, ..., n satisfies
Jµ(i) = E[g(i , µ(i),w) + γJµ(f (i , µ(i),w))], i = 1, ..., n
Starting from arbitrary J0(i), i = 1, ..., n, the iteration
Jk+1(i) = E[g(i , µ(i),w) + γJk(f (i , µ(i),w))]
converges to Jµ(i), i = 1, ..., n
(d) A stationary policy µ is optimal iff. for every state i , µ(i) attains theminimum in the Bellman equation.
(e) The policy iteration algorithm
µk+1(i) = argminu∈U(i)
E[g(i , u,w) + γJµk (f (i , u,w))], i = 1, 2, ..., n
generates an improving sequence of policies and terminates (in finitetime) with an optimal policy.
Proof: see BertsekasDr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 137 / 158
Comments
Parts (a) and (e) provide algorithms for solving discounted costproblems (like the D.P. algorithm for finite horizon problems). Valueiteration (part (a)) requires less computation at every iteration, whilepolicy iteration (part (e)) is guaranteed to terminate in finite time.
In part (c),
Jµ(i) = E[g(i , µ(i),w) + γJµ(f (i , u,w))], i = 1, ..., n
is a system of n linear equations, with which one can solve forJµ(i), i = 1, ..., n.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 138 / 158
Policy Iteration
Starting with a stationary policy µ0, generate a sequence µ1, µ2, . . .of stationary policies.
Given µk , perform policy evaluation step, to computeJµk (i), i = 1, 2, .., n, using
Jµk (i) = E[g(i , µk(i),w) + γJµk (f (i , µk(i),w))], i = 1, . . . , n
Given Jµk (.), perform policy improvement step
µk+1(i) = argminu∈U(i)
E[g(i , u,w) + γJµk (f (i , u,w))]
with Jµk (f (i , u,w)) the “cost to go of old policy” (c.f. rolloutalgorithm)
Terminate when Jµk (i) = Jµk+1(i),∀i .
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 139 / 158
Example
A manufacturer at each time period:
Receives an order with probability p, no order with probability 1− p.
May process all unfilled orders at cost K > 0, or process no orders.
Cost per unfilled order at each time period is C > 0.
max. no. unfilled orders is n.
Find processing policy that minimizes the discounted cost, with discountfactor γ.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 140 / 158
Example
Let state = no. unfilled orders at the start of each period(∈ {0, 1, ..., n}).
Bellman equation :For states i = 0, 1, . . . , n − 1, can either process orders or not, soBellman equation is
J∗(i) = min{
K︸︷︷︸process
+ γp J∗(1)︸ ︷︷ ︸new order received
+ γ(1− p) J∗(0)︸ ︷︷ ︸new order not received
,
Ci︸︷︷︸don’t process
+ γp J∗(i + 1)︸ ︷︷ ︸new order received
+ γ(1− p) J∗(i)︸ ︷︷ ︸no new order
}For state i=n, all orders must be processed, so Bellman equation is
J∗(n) = K + γpJ∗(1) + γ(1− p)J∗(0)
Can show that the optimal policy is a threshold policy: process orderiff. i ≥ m∗, where m∗ is a threshold (see Bertsekas).
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 141 / 158
Average Cost Problems
min{µk}
limN→∞
1
NE
[N−1∑k=0
g(xk , µk(xk),wk)
]
In most problems of this type, the average cost per stage of a policyis independent of initial state.
Expresses costs occurred in the long run, costs incurred in early stagesdo not matter.
Analysis is harder than for discounted cost problems (won’t coverhere).
AssumeI number of states is finite, taking values 1, 2, ..., n.I number of possible controls is finite.
Also assume that there is some state t such that for all initial statesand policies, t is visited infinitely often with probability 1.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 142 / 158
Theorem
For the average cost problem, we have:
(a) The optimal average cost per stage λ∗ is the same for all initialstates, and there exists a vector h∗ = (h∗(1), h∗(2), ..., h∗(n))satisfying the Bellman equation:
λ∗ + h∗(i) = minu∈U(i)
E[g(i , u,w) + h∗(f (i , u,w))], i = 1, ...n
(h∗ is unique if we fix h∗(t) = 0.)If µ(i) attains the minimum in the Bellman equation for all i, then thestationary policy µ is optimal.
(b) If µ and h satisfy Bellman’s equation, then λ is the optimal averagecost per stage for each initial state.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 143 / 158
Theorem
(c) Given a stationary policy µ with average cost per stage λµ, thereexists a vector hµ = (hµ(1), ..., hµ(n)) such that
λµ + hµ(i) = E[g(i , µ(i),w) + hµ(f (i , µ(i),w))], i = 1, ..., n
(hµ is unique if we fix hµ(t) = 0.)
Proof: See Bertsekas
Comment: h is also called the differential cost vector.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 144 / 158
Example
A manufacturer at each time period:
Receives an order with probability p, no order with probability 1− p.
May process all unfilled orders at cost K > 0, or process no orders.
Cost per unfilled order at each time period is C > 0.
max. no. unfilled orders is n.
Find processing policy that minimizes the average cost.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 145 / 158
Example
State = no. unfilled orders at start of each period.State 0 = Special state t here (will visit this state infinitely often)
Bellman equation:For states 0, 1, . . . , n − 1, Bellman equation is
λ∗+h∗(i) = min{K+ph∗(1)+(1−p)h∗(0),Ci+ph∗(i+1)+(1−p)h∗(i)}
For state n, Bellman equation is
λ∗ + h∗(n) = K + ph∗(1) + (1− p)h∗(0)
Optimal policy: Process orders if
K + ph∗(1) + (1− p)h∗(0) ≤ Ci + ph∗(i + 1) + (1− p)h∗(i)
Can again show that a threshold policy is optimal, where value of thethreshold may be different from value of the threshold in discountedcost problem
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 146 / 158
Algorithms For Average Cost Problems
Value iteration:
Starting from any J0, compute
Jk+1(i) = minu∈U(i)
E[g(i , u,w) + Jk(f (i , u,w))], i = 1, ..., n
Have
limk→∞
Jk(i)
k= λ∗, ∀i
Drawbacks of value iteration :
Often components of Jk will diverge to ∞ or −∞, so calculatinglimk→∞
Jk (i)k may be tricky.
Doesn’t compute a differential cost vector h∗.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 147 / 158
Algorithms For Average Cost Problems
Relative value iteration:
Subtract a constant (dependent on k) from all components of Jk , sothat the difference hk is bounded, e.g.
hk(i) = Jk(i)− Jk(s), i = 1, ..., n
where s is some fixed state. Then relative value iteration algorithm is:
hk+1(i) = minu∈U(i)
E[g(i , u,w) + hk(f (i , u,w))]
− minu∈U(s)
E[g(s, u,w) + hk(f (s, u,w))], i = 1, ..., n
Can show that hk → h∗ as k →∞
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 148 / 158
Algorithms For Average Cost Problems
Policy iteration:
Given µk , perform policy evaluation step to compute λk and hk ,using the equations:
λk + hk(i) = E[g(i , µk(i),w) + hk(f (i , µk(i),w))], i = 1, ..., n
hk(t) = 0 for some state t which is visited infinitely often.
Given λk and hk , perform policy improvement step:
µk+1(i) = argminu∈U(i)
E[g(i , u,w) + hk(f (i , u,w))], i = 1, ..., n
Terminate when λk+1 = λk and hk+1(i) = hk(i), i = 1, . . . , n.
Policy iteration can be shown to terminate in finite time.
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 149 / 158
Outline
7 Introduction to Reinforcement Learning
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 150 / 158
Introduction to Reinforcement Learning
As in the finite horizon case, want to consider suboptimal methods forsolving infinite horizon problems
Also studied in machine learning as reinforcement learning
Many different methods, e.g. Q-learning, TD/SARSA(λ),REINFORCE, . . .
I References: Sutton & Barto, “Reinforcement Learning”, Bertsekas Vol.II
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 151 / 158
Introduction to Reinforcement Learning
Slight change of notation:
State xk → state sk
Control uk → action ak
Cost function g(., ., .) → Reward function g(., ., .)
Cost minimization min{µk}
limN→∞
E[∑N−1
k=0 γkg(xk , µk(xk),wk)
]→ Reward maximization max
{ak}lim
N→∞E[∑N−1
k=0 γkg(sk , ak(sk),wk)
]
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 152 / 158
Q-Learning for Discounted Problems
Bellman equation
J∗(s) = maxa
E[g(s, a,w) + γJ∗(f (s, a,w))],∀s
J∗(s) is the optimal expected future reward when in state s
Introduce now the Q-Bellman equation
Q∗(s, a) = E[g(s, a,w) + γ maxa′
Q∗(s ′, a′)], ∀(s, a)
where s ′ , f (s, a,w).I Q-factor Q(s, a) is the expected future reward when in state s and
taking action aI Q∗ are the optimal Q-factors
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 153 / 158
Q-Learning for Discounted Problems
Can also solve Q-Bellman equation using value iteration or policyiteration
Given Q∗(s, a), optimal policy can be computed as
a∗(s) = arg maxa
Q∗(s, a)
Using Q∗(s, a) gives same policy as using J∗(s), though it requiresmore storage
However, one advantage is that Q∗(s, a) can be found approximatelyusing e.g. Q-learning
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 154 / 158
Q-Learning for Discounted Problems
Q-learning algorithm. Repeat:
Generate (sk , ak) using any probabilistic mechanism such that allstate-action pairs (s, a) are chosen infinitely often
Given (sk , ak), update Q(sk , ak) as:
Qk+1(sk , ak) = Qk(sk , ak) + αk
(r + γ max
a′Qk(s ′, a′)− Qk(sk , ak)
)where r = g(sk , ak ,wk) is the sampled reward, s ′ = f (sk , ak ,wk) isthe sampled next state when current state is sk and action ak isapplied, and {αk} is a sequence converging to 0.
Leave all other Q-factors unchanged
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 155 / 158
Q-Learning for Discounted Problems
Q-learning algorithm converges to the optimal Q-factors Q∗(s, a)provided all pairs (s, a) are chosen infinitely often, and the sequence{αk} satisfies
αk > 0,∞∑k=0
αk =∞,∞∑k=0
α2k <∞
I e.g. αk = 1k satisfies this condition
In Sutton & Barto, {(sk , ak)} generated according to:
sk+1 := s ′
ak+1 =
{random a, w.p. εarg max
aQk+1(sk+1, a), w.p. 1− ε,
for some ε > 0
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 156 / 158
Function Approximation and Deep Reinforcement Learning
For large problems:I Too many (state, action) pairs to store in memoryI Too slow to learn the value of each Q∗(s, a) individually
Function approximation:I Regard Q∗(s, a) as a function of s and a. Approximate Q∗(s, a) by
another function Q(s, a,θ) parameterized by a set of weights θI Learn the weights θ instead of the entire set of values Q∗(s, a)
Deep reinforcement learning: When the weights θ are learnt using adeep neural network, seehttps://deepmind.com/blog/deep-reinforcement-learning
Spectacular recent advances in AI using deep reinforcement learning,e.g. AlphaGo, AlphaZero
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 157 / 158
Deep Reinforcement Learning - Further Reading
OverviewI https://deepmind.com/blog/deep-reinforcement-learningI http://www0.cs.ucl.ac.uk/staff/d.silver/web/Talks.html
Deep Q-Network (DQN) algorithmI https://arxiv.org/pdf/1312.5602.pdfI https://keon.io/deep-q-learning
More AdvancedI https://arxiv.org/pdf/1509.02971.pdfI https://medium.com/tensorflow/deep-reinforcement-learning-playing-
cartpole-through-asynchronous-advantage-actor-critic-a3c-7eab2eea5296
Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 158 / 158