Dynamic Programming and Stochastic...

Dynamic Programming and Stochastic Control

Dr. Alex Leong

Department of Electrical Engineering (EIM-E)Paderborn University, Germany

[email protected]

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 1 / 158

Outline

1 Introduction


Introduction

What is dynamic programming (DP)?

Method for solving multi-stage decision problems (Sequential decisionmaking).

There is often some randomness to what happens in future.

Optimize set of decisions to achieve a good overall outcome.

Richard Bellman popularized DP in the 1950s


Examples

1) Inventory control

A store sells a product, e.g. Ice cream.

Order supplies once a week.

Sales during the week are “random”.

How much supply should the store get to maximize expected profitover summer?

I Order too little, can’t meet demand.I Order too much, storage/refrigeration cost.


Examples

2) Parts replacement e.g. bus engine.

At the start of each month, decide whether the engine on a busshould be replaced, to maximize expected profit?

If replace, profit = earnings - replacement cost - maintenance.

If don’t replace, profit = earnings - maintenance.

Earnings will decrease if engine breaks down.

P(Breakdown) is age dependent.


Examples

3) Formula 1 engines, replace or not?

20 races, 4 engines (in 2017)

Decide whether to replace engine at the start of each race, tomaximize chance of winning championship.


Examples4) Queueing (see Figure 1)

Packets arrive at queues 1 and 2.

If both queues transmit at same time, have collision.

If collision, retransmit at next time with a certain probability.

Choose retransmission probabilities to maximize throughput.

Figure 1: Queueing


Examples

5) LQR (Linear Quadratic Regulator)Linear System: xk+1 = Axk + Buk (Deterministic Problem)

Assume knowledge of xk at time k (Perfect state info)

Choose sequence of uk to

minu0,u1,...,uN−1

N−1∑k=0

(xTk Qxk + uTk Ruk) + xTN QxN

N = number of stages = horizon.N finite → finite horizon.


Examples

6) xk+1 = Axk + Buk + wk

wk = Random noise.

Assume xk known (Perfect state info)

Choose sequence of uk to

minu0,u1,...,uN−1

E

[N−1∑k=0


]


Examples

7) LQG (Linear Quadratic Gaussian) Control

xk+1 = Axk + Buk + wk

yk = Cxk + vk

vk ,wk Gaussian noise.

Case of imperfect state info.

Based on measurements yk , choose uk to

minu0,u1,...,uN−1

E

[N−1∑k=0


]


Examples

8) Infinite horizon

minu0,u1,...,uN−1

limN→∞

1

NE

[N−1∑k=0


]

Note: Here we divide by N, otherwise summation often blows up.


Examples9) Shortest paths (see Figure 2)

Find shortest path from A to stage D (Deterministic Problem).

Can solve using the Viterbi algorithm (1967)

Can be regarded as a special case of (forward) DP.Applications:

I decoding of convolutional codes (communications)I channel equalization (communications)I estimation of hidden Markov models (signal processing)

Figure 2: Shortest paths problem


Outline

2 The Dynamic Programming Principle and Dynamic ProgrammingAlgorithm

Basic Structure of Dynamic Programming ProblemDynamic Programming Principle of OptimalityDynamic Programming AlgorithmShortest Path Problems


Basic structure of stochastic DP problem

Two ingredients, discrete time system and cost function

1. Discrete time system

xk+1 = fk(xk , uk ,wk), k = 0, 1, . . . ,N − 1 (or k = 1, 2, ...N)

k is time index.

xk is state at time k, summarizes past information that is relevant forfuture optimization.

uk is control/decision/action at time k , lies in a set Uk(xk) whichmay depend on k and xk .

wk is random disturbance (noise), with a probability distributionP(.|k , xk , uk) which may depend on k , xk , uk .



xk+1 = fk(xk , uk ,wk), k = 0, 1, . . . ,N − 1

N is horizon, or number of times control is applied.

fk is function that describes how system evolves over time.

ExamplesI fk = Axk + Buk + wk (linear system)I fk = xkuk + wk (non-linear)I fk = cos xk + wk sin uk (non-linear)



2. Cost function which is additive over time

E

[N−1∑k=0

gk(xk , uk ,wk) + gN(xN)

]

Expectation is used because of random wk .

gk is function that represents cost at time k .

ExamplesI gk = xk + ukI gk = x2k + Cu2k , where C is a constant.

gN(xN) is terminal cost.



Objective: Minimize the cost function over the controls

u0 = µ0(x0), u1 = µ1(x1), ..., uN−1 = µN−1(xN−1)

Choice of uk depends on xk .

Optimization over policies: rules/functions µk for generating uk forevery possible value of xk .

Expected cost of policy π = (µ0, µ1, ..., µN−1) starting at x0 is

Jπ(x0) = E

[N−1∑k=0

gk(xk , µk(xk),wk) + gN(xN)

]

Optimal policy: π∗ = argminπ

Jπ(x0)

Optimal cost starting at x0: J∗(x0) = minπ

Jπ(x0)


Examples

1) Inventory example

xk = amount of stock at time k.uk = stock ordered at time k.wk = demand at time k, with some probability distribution e.g. uniform.

System: xk+1 = xk + uk − wk (= fk(xk , uk ,wk))

xk can be negative with this model.

Alternative model: xk+1 = max(0, xk + uk − wk).

Cost function at time k : gk(xk , uk ,wk) = r(xk) + Cuk

r(xk) is penalty for holding excess stock.

C is cost per item.


Examples

1) Inventory example (cont.)

Terminal cost: R(xN) is penalty for having excess stock at the end.

Cost function: E[∑N−1

k=0 (r(xk) + Cuk) + R(xN)]

Amount uk to order can depend on inventory level xk .

Can have constraints on uk , e.g. xk + uk ≤ max. storage.

Optimization over policies: Find the rule which tells you how much toorder for every possible stock level xk .


Examples2) Example 6 of previous section

System

xk+1 = Axk+Buk+wk︸︷︷︸fk

Cost function

E

N−1∑k=0

(xTk Qxk+uTk Ruk︸︷︷︸gk

) + xTN QxN︸︷︷︸gN(xN)

Objective: Determine uk = µk(xk), k = 0, 1, . . . ,N − 1, to minimizethe cost function.

Solution turns out to be u∗k = Lkxk for some matrices Lk . (Derived inlater lecture)


Examples

3) Shortest paths (see Figure 3)

Figure 3: Shortest path problem

xk = which node we’re in at stage k .uk = which path we take to get to stage k + 1wk = zeroCost function = Sum of values along the paths we choose.


Open loop vs. Closed loop

Open loop: Controls (u0, u1, . . . , uN−1) chosen at beginning (time 0).

Closed loop: Policy (µ0, µ1, . . . , µN−1) chosen, where at time k ,µk(xk) = uk can depend on xk .

Can adapt to conditions.

e.g. Inventory problem. If current stock level:I xk high → order less.I xk low → order more.

Closed loop is always at least as good as open loop.

For deterministic problems, open loop is as good as closed loopI can predict exactly the future states given initial state and sequence of

controls.

For stochastic problems, generally should use closed loop.


D.P. Principle of OptimalityIntuition

B 3 D

5

A

2 F

1 2 1

6 4

C 4 E

Figure 4: Shortest path problem

Consider the shortest path problem in Figure 4.

Shortest path from A to F shown in red: A→C→D→F

Shortest path from C to F: C→D→F.I Subpath of shortest path from A→F.

Shortest path from D to F: D→F.I Subpath of shortest path from A→F.


D.P. Principle of Optimality

ObservationShortest path from A to F contains shortest paths from intermediatenodes to F.

Why?

Suppose there is a shorter path from C to F which is not C→D→F.

Then can construct a new path A→C→ . . .→F (new shortest path)which is shorter than A→C→D→F⇒ contradicts A→C→D→F being the shortest.



Formal statement:

Basic problem

minπ

E

(N−1∑k=0


)

Let π∗ = {µ∗0, µ∗1, . . . , µ∗N−1} be the optimal policy. Consider the“tail subproblem”

minµi ,µi+1,...,µN−1

E

(N−1∑k=i


),

where we are at state xi at time i and we wish to minimize the “costto go” from time i to time N.

D.P. Principle of optimality then says that {µ∗i , µ∗i+1, ..., µ∗N−1} is

optimal for the tail subproblem.



“Proof”:If {µi , ..., µN−1} is a better policy for tail subproblem, then{µ∗0, µ∗1, ..., µ∗i−1, µi , ..., µN−1} would be a better policy for original problem⇒ contradiction of {µ∗0, µ∗1, ..., µ∗N−1} being optimal.

How can we make use of the D.P. principle?Idea: Construct an optimal policy in stages.

Solve tail subproblem involving last stage, to obtain µ∗N−1Solve tail subproblem involving last two stages, making use of µ∗N−1,to obtain µ∗N−2Solve tail subproblem involving last three stages, making use ofµ∗N−2, µ

∗N−1, to obtain µ∗N−3

...

Solve tail subproblem involving last N stages, making use ofµ∗1, .., µ

∗N−1, to obtain µ∗0


D.P. Algorithm

Basic problem:

minπ

E

{N−1∑k=0


}D.P. Algorithm: For each possible xk , compute:

JN(xN) = gN(xN),

Jk(xk) = minuk∈Uk (xk )

E{gk(xk , uk ,wk) + Jk+1(fk(xk , uk ,wk))},

for k = N − 1,N − 2, ..., 1, 0

Theorem:1 Optimal cost J∗(x0) = J0(x0), where J0(x0) is quantity computed by

D.P. algorithm.2 Let µ∗k(.) be the function that generates the minimum uk in the D.P.

algorithm, i.e. µ∗k(xk) = u∗k . Then {µ∗0, µ∗1, ..., µ∗N−1} is the optimalpolicy to the basic problem.

Proof: See laterDr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 27 / 158

D.P. Algorithm

Comments:

D.P. algorithm needs to be run for all possible states xk .

Solves all tail subproblems (don’t know which subproblem you need atthe start).

Can be computationally expensive if number of states/controls islarge.

Often done on computer.

Suboptimal methods can reduce complexity.


Inventory Example

xk = level of stock at time k.

uk = amount ordered at time k .

wk = demand at time k .

xk+1 = max(0, xk + uk − wk) = fk(xk , uk ,wk), excess demand is lost.

Storage constraint: xk + uk ≤ 2

Cost at time k= Purchasing cost︸︷︷︸

cost per item=1euro

+ storage cost︸︷︷︸(xk+uk−wk )2

= uk + (xk + uk − wk)2 = gk(xk , uk ,wk)

Terminal cost gN(xN) = 0.

Probability distribution of wk :

P(wk = 0) = 0.1,P(wk = 1) = 0.7,P(wk = 2) = 0.2


Inventory Example

Problem: Find the optimal policy for horizon N = 3, i.e.

min(µ0,µ1,µ2)

E

{2∑

k=0

gk(xk , µk(xk),wk)

}

Apply D.P. algorithm:J3(x3) = g3(x3) = 0Jk(xk) = min

uk∈Uk

E{uk + (xk + uk − wk)2 + Jk+1(max(0, xk + uk − wk))},k = 2, 1, 0

Question: What values can xk take?


Inventory Example

Period 2:Compute J2(x2) for all possible values of x2

J2(0) = minu2∈{0,1,2}

E{u2 + (0 + u2 − w2)2 + J3(x3)︸︷︷︸=0 for all x3

}

= minu2∈{0,1,2}

u2 + E{(u2 − w2)2}

= minu2∈{0,1,2}

u2 + (u2 − 0)20.1 + (u2 − 1)20.7 + (u2 − 2)20.2

If u2 = 0: u2+0.1u22+0.7(u2−1)2+0.2(u2−2)2 = 0.7×1+0.2×4 = 1.5

If u2 = 1: 1 + 0.1× 1 + 0.7× 0 + 0.2× 1 = 1.3

If u2 = 2: 2 + 0.1× 4 + 0.7× 1 + 0.2× 0 = 3.1⇒ J2(0) = 1.3 and µ∗2(0) = 1


Inventory Example

J2(1) = minu2∈{0,1}

u2 + (1 + u2)20.1 + (1 + u2 − 1)20.7 + (1 + u2 − 2)20.2

If u2 = 0: 0.3 (check this!)

If u2 = 1: 2.1⇒ J2(1) = 0.3 and µ∗2(1) = 0

J2(2) = minu2∈{0}

E{u2 + (2 + u2 − w2)2} = · · · = 1.1

⇒ J2(2) = 1.1 and µ∗2(2) = 0.


Inventory Example

Period 1:Compute J1(x1) for all possible values of x1.

J1(0) = minu1∈{0,1,2}

E{u1 + (u1 − w1)2 + J2(max(0, 0 + u1 − w1))}

= minu1∈{0,1,2}

u1 + (u21 + J2(max(0, u1))0.1

+ ((u1 − 1)2 + J2(max(0, u1 − 1)))0.7

+ ((u1 − 2)2 + J2(max(0, u1 − 2)))0.2

u1 = 0: J2(0)︸︷︷︸from previous stage

×0.1 + (1 + J2(0)︸︷︷︸)0.7 + (4 + J2(0)︸︷︷︸)0.2 = 2.8

u1 = 1: 1 + (1 + J2(1)︸︷︷︸from previous stage

)0.1 + J2(0)︸︷︷︸ 0.7 + (1 + J2(0)︸︷︷︸)0.2 = 2.5

u1 = 2: 2 + (4 + J2(2))0.1 + (1 + J2(1))0.7 + J2(0)0.2 = 3.6⇒ J1(0) = 2.5 and µ∗1(0) = 1


Inventory Example

J1(1) = minu1∈{0,1}

E{u1 + (1 + u1 − w1)2 + J2(max(0, 1 + u1 − w1))}

u1 = 0: 1.5(check!)

u1 = 1: 2.68⇒ J1(1) = 1.5, and µ∗1(1) = 0

J1(2) = 1.68, µ∗1(2) = 0 (check!)

Period 0:Compute J0(x0) for all possible x0 (Tutorial problem)Solution: J0(0) = 3.7, J0(1) = 2.1, J0(2) = 2.818µ∗0(0) = 1, µ∗0(1) = 0, µ∗0(2) = 0


Scheduling Example

Example: Scheduling problem (deterministic problem)

Four operations need to be performed: A, B, C, D.

B has to occur after A, D has to occur after C.

Costs: cAB = 2, cAC = 3, cAD = 4, cBC = 3, cBD = 1, cCA = 4, cCB =4, cCD = 6, cDA = 3, cDB = 3.

Startup costs: SA = 5,SC = 3.

What is the optimal order?


Scheduling Example6

ABC 6 ABCD

9 3

AB

2 8

A 5 4

1

ACB

3

1 ACBD

5 10 3 AC

3 7

3C CA

4

6 5

CD

ACD 6

1

2 CAB

4 3

CAD

3 2

CDA

3 ACDB

1 CABD

3 CADB

2 CDAB

Minimum cost to go in red

Figure: Scheduling

0

Figure 5: Scheduling Problem


Scheduling ExampleUse D.P. algorithm

Let State = Set of operations already performed, see Figure“Scheduling”.

No terminal costs for this problem.

Tail subproblems of length 1.

Easy, only one choice at each state, e.g. if state ACD , nextoperation has to be B.


State AB , only one choice, next operation is C.

State AC , if next operation is B: cost = 4 + 1 = 5.

State AC , if next operation is D: cost = 6 + 3 = 9. ⇒ Choose B.

State CA , if next operation is B: cost = 2 + 1 = 3.

State CA , if next operation is D: cost = 4 + 3 = 7. ⇒ Choose B.

State CD , only one choice, next operation is A.


Scheduling Example


State A , if next operation is B: cost = 2 + 9 = 11.

State A , if next operation is C: cost = 3 + 5 = 8.⇒ Choose C

State C , if next operation is A: cost = 4 + 3 = 7.

State C , if next operation is D: cost = 6 + 5 = 11. ⇒ Choose A.

Original problem of length 4.

If start with A: cost = 5 + 8 = 13

If start with C: cost = 3 + 7 = 10 ⇒ Choose C

Therefore, the optimal sequence = CABD , and the optimal cost = 10.


Proof that D.P. Algorithm gives Optimal SolutionBasic problem:

minπ

E

{N−1∑k=0


}

D.P. Algorithm: For each possible xk , compute:

JN(xN) = gN(xN),

Jk(xk) = minuk∈Uk (xk )

E{gk(xk , uk ,wk) + Jk+1(fk(xk , uk ,wk))},

for k = N − 1,N − 2, ..., 1, 0Theorem:

1 Optimal cost J∗(x0) = J0(x0), where J0(x0) is quantity computed byD.P. algorithm.

2 Let µ∗k(.) be the function that generates the minimum uk in D.P.algorithm i.e µ∗k(xk) = u∗k . Then {µ∗0, µ∗1, ..., µ∗N−1} is the optimalpolicy to the basic problem.


Proof that D.P. Algorithm gives Optimal Solution

Notation:

Given policy π = (µ0, µ1, ..., µN−1),

let πk = (µk , µk+1, ..., µN−1) = “tail policy”

and J∗k (xk) = minπk

E{∑N−1

i=k gi (xi , µi (xi ),wi ) + gN(xN)} be the optimal cost

for tail subproblem.

Let Jk(xk) = quantity computed by D.P algorithm.

Want to show that J∗k (xk) = Jk(xk), for all xk , k .

Proof is by mathematical induction

Initial step (k = N):

By definition of J∗k (xk), J∗N(xN) = gN(xN)

By definition of D.P algorithm JN(xN) = gN(xN)⇒ J∗N(xN) = JN(xN)



Induction step:

Assume J∗l (xl) = Jl(xl) for l = N,N − 1, ..., k + 1

Want to show that J∗k (xk) = Jk(xk)

From the definition of J∗k (xk),

J∗k (xk) = minπk

E

{N−1∑i=k

gi (xi , µi (xi ),wi ) + gN(xN)

}

= min(µk ,πk+1)

E

{gk(xk , µk(xk),wk) +

N−1∑i=k+1

gi (xi , µi (xi ),wi ) + gN(xN)

}

= minµk

E

{gk(xk , µk(xk),wk)+min

πk+1E

[N−1∑i=k+1

gi (xi , µi (xi ),wi )+gN(xN)

]}by D.P principle (optimize tail subproblem then µk)



= minµk

E{gk(xk , µk(xk),wk) + J∗k+1(fk(xk , µk(xk),wk))} by definition of

J∗k+1(xk+1)

= minµk

E{gk(xk , µk(xk),wk) + Jk+1(fk(xk , µk(xk),wk))} by induction

hypothesis

= minuk∈Uk (xk )

E{gk(xk , uk ,wk) + Jk+1(fk(xk , uk ,wk))} using fact that

minµ

F (x , µ(x)) = minu∈U(x)

F (x , u).

= Jk(xk) from D.P. algorithm equations

So J∗k (xk) = Jk(xk), and µ∗k(xk) = u∗k is the optimal policy.

By induction, this is true for k = N,N − 1, ..., 1, 0.

In particular, J∗(x0) = J∗0 (x0) = J0(x0) is the optimal cost.


Shortest Paths in a Trellis

s t

Initialstate

Artificial Terminal state

Stage 0

Stage 1Stage 2

Stage N-1Stage N

1

2

a^1_12

Figure 6: Shortest paths in a trellis

Find shortest path from a node in Stage 1 to a node in Stage N

states → nodescontrols → arcsakij : cost of transition from state i at stage k to state j at stage k + 1.

aNit : terminal cost of state icost function = length of path from s to t


Shortest Paths in a TrellisD.P. Algorithm:

JN(i) = aNit

Jk(i) = minj

[akij + Jk+1(j)], k = N − 1, . . . , 1, 0

Optimal cost = J0(s) = length of shortest path from s to t.

Example: Find shortest path from stage 1 to stage 3 in Figure 7.

Shortestpath in red300 100

50 400

400 200

150 350

Stage 1 Stage 2 Stage 3

Figure 7: Shortest paths example


Shortest Paths in a TrellisRedraw as a trellis with initial and terminal node, see Figure 8.

s

1

2 2

1 1

2

t

Stage 0

Stage 1 Stage 2 Stage 3

0

0

100300

50 400

0

0

400 200

150 350

100 0400

350 0250

250

Figure 8: Redrawn shortest paths example

Here N = 3.Call the top node state 1 and bottom node state 2.

Stage N:

J3(1) = 0J3(2) = 0


Shortest Paths in a Trellis

Stage 2:

J2(1) = min{a211 + J3(1), a212 + J3(2)}= min{100 + 0, 200 + 0} = 100

J2(2) = min{a221 + J3(1), a222 + J3(2)}= min{350 + 0, 400 + 0} = 350

Stage 1:

J1(1) = min{a111 + J2(1), a112 + J2(2)}= min{300 + 100, 400 + 350} = 400

J1(2) = min{a121 + J2(1), a122 + J2(2)}= min{150 + 100, 50 + 350} = 250

Stage 0:

J0(s) = min{0 + J1(1), 0 + J1(2)} = 250

Shortest path to original problem shown in red in Figure 7.


Forward D.P. Algorithm

Observe that optimal path s→ t is also optimal path t→ s ifdirections of arcs are reversed.⇒ Shortest path algorithm can be run forwards in time (see Bertsekasfor equations).Figure 9 shows the result of forward D.P. on shortest paths example.

Forward D.P. useful in real-time applications, where data arrives justbefore you need to make a decision.

Viterbi algorithm uses this idea

Shortest paths is a deterministic problem, so forward D.P. works.

For stochastic problems, no such concept of forward D.P.I Impossible to guarantee that any given state can be reached


Forward D.P. Algorithm

s

1

t

1

2

1

22

35050

250150

0 200

150 350

400 0

0

250

Figure 9: Forward D.P. on shortest paths example


Viterbi Algorithm Applications

Estimation of hidden Markov models (HMMs)I xk = Markov chainI state transitions in xk not observed (hidden).I observe zk , r(z , i , j) = probability we observe z given a transition in

Markov chain xk from state i to j .I Estimation problem:

Given ZN = {z1, z2, ..., zN}, find a sequence XN = {x0, x1, ..., xN} overall possible {x0, x1, ..., xN} that maximizes P(XN |ZN).

Note that P(XN |ZN) = P(XN ,ZN )P(ZN )

, and P(ZN) is “constant” given ZN

Somax

{x0,...,xN}P(XN |ZN)←→ max

{x0,...,xN}P(XN ,ZN)←→ max

{x0,...,xN}ln P(XN ,ZN)


Viterbi Algorithm Applications

I After some calculations (see Bertsekas), can show that problem isequivalent to:

min{x0,...,xN}

− ln(πx0)−N∑

k=1

ln(πxk−1xk r(Zk , xk−1, xk))

where πx0 = probability of initial state, πxk−1xk = transition probabilitiesof Markov chain, and − lnπx0 and − ln(πxk−1xk r(Zk , xk−1, xk)) can beregarded as lengths of the different stages⇒ shortest path problem through a trellis

Decoding of convolutional codes

Channel equalization in presence of ISI (Inter-symbol interference)


General Shortest Path Problems

No trellis structuree.g. Find the shortest path from each node to node 5 in Figure 10.

1

2 3

4 5

5

2

3

4

1

7

Figure 10: General shortest path problem

Graph with N + 1 nodes {1, 2, ...,N, t}aij = cost of moving from node i to node j .

Find the shortest path from each node i to node t.


General Shortest Path Problems

Assume some aij ’s can be negative, but cycles have non-negativelength.

I Then shortest path will not involve more than N arcs.

Reformulate as a trellis-type shortest path problem with N arcs, byallowing arcs from node i to itself with cost aii = 0

D.P. algorithm:

JN−1(i) = ait

Jk(i) = minj{aij + Jk+1(j)}, k = N − 2, . . . , 1, 0

This algorithm is essentially the Bellman-Ford algorithm.

Other algorithms have also been invented, e.g. Dijkstra’s algorithmwhich can be used when all aij ’s are positive.


Outline

3 Problems with Perfect State InformationLinear Quadratic ControlOptimal Stopping Problems


Problems with Perfect State Information

Will study some problems where analytical solutions can be obtained:

Linear quadratic control

Optimal stopping problems

+ others in Chapter 4 of Bertsekas


Linear Quadratic Control(Linear) System:

xk+1 = Axk + Buk + wk , k = 0, 1, ...,N − 1

(Quadratic) Cost function:

E

{N−1∑k=0


}

Problem: Determine optimal policy to minimize cost function

xk , uk ,wk are column vectors

A,B,Q,R are matrices.

wk are independent and zero mean.

Q is positive semi-definite.

R is positive definite.


Linear Quadratic Control

Definition:A symmetric matrix M is positive semi-definite if xTMx ≥ 0, ∀ vectors xM is positive definite if xTMx > 0,∀x 6= 0

One characterization:

M is positive semi definite ⇔ all eigenvalues of M are ≥ 0.

M is positive definite ⇔ all eigenvalues of M are > 0.

D.P. algorithm applied to this problem gives:

JN(xN) = xTN QxN

Jk(xk) = minuk{xTk Qxk + uTk Ruk + Jk+1(Axk + Buk + wk)},

k = N − 1, ..., 1, 0.



Turns out that minimization can be done analytically

JN−1(xN−1) = minuN−1

E{xTN−1QxN−1 + uTN−1RuN−1

+ (AxN−1 + BuN−1 + wN−1)TQ(AxN−1 + BuN−1 + wN−1)}

= minuN−1


+ xTN−1ATQAxN−1 + xTN−1A

TQBuN−1 + xTN−1ATQwN−1

+ uTN−1BTQAxN−1 + uTN−1B

TQBuN−1 + uTN−1BTQwN−1

+ wTN−1QAxN−1 + wT

N−1QBuN−1 + wTN−1QwN−1}

= xTN−1(ATQA + Q)xN−1 + E{wTN−1QwN−1}

+ minuN−1

{uTN−1(R + BTQB)uN−1 + 2xTN−1ATQBuN−1}


Linear Quadratic ControlDigressionProblem: min

xf (x)

How to solve?For unconstrained scalar problems, can differentiate and set derivative equalto 0.e.g. min

x(x − 2)2, d

dx (x − 2)2 = 2(x − 2) = 0 ⇒ x∗ = 2.

Similarly, differentiate uTN−1(R + BTQB)uN−1 + 2xTN−1ATQBuN−1

with respect to the vector uN−1 and set equal to zero

Note that∂(uTAu)

∂u= 2Au,

∂(aTu)

∂u= a,

where a and u are column vectors, and A is a symmetric matrix.

Using above formulas, obtain 2(R + BTQB)uN−1 + 2BTQAxN−1 = 0⇒ u∗N−1 = −(R + BTQB)−1BTQAxN−1



Substituting u∗N−1 = −(R + BTQB)−1BTQAxN−1 back into expressionfor JN−1(xN−1), we obtain

JN−1(xN−1) = xTN−1(ATQA + Q)xN−1 + E{wTN−1QwN−1}

+ xTN−1ATQB(R + BTQB)−1(R + BTQB)(R + BTQB)−1BTQAxN−1

− 2xTN−1ATQB(R + BTQB)−1BTQAxN−1

= xTN−1(ATQA + Q)xN−1 − xTN−1ATQB(R + BTQB)−1BTQAxN−1

+ E{wTN−1QwN−1}

= xTN−1(ATQA + Q − ATQB(R + BTQB)−1BTQA)xN−1

+ E{wTN−1QwN−1}

= xTN−1KN−1xN−1 + E{wTN−1QwN−1}

with KN−1 = ATQA + Q − ATQB(R + BTQB)−1BTQA



Continuing on, can show that

u∗N−2 = −(BTKN−1B + R)−1BTKN−1AxN−2,

and more generally (tutorial problem) that

µ∗k(xk) = −(BTKk+1B + R)−1BTKk+1Axk

where

KN = Q,

Kk = ATKk+1A + Q − ATKk+1B(BTKk+1B + R)−1BTKk+1A


Certainty Equivalence

Certainty Equivalence: Optimal policy is the same as solving problem forthe deterministic system:

xk+1 = Axk + Buk + E[wk ],

where wk is replaced by its expected value E[wk ] = 0, i.e. the standardLQR problem


Asymptotic Behaviour

Definition:

A pair of matrices (A,B), where A is n × n, B is n ×m, iscontrollable if the n × nm matrix[

B AB A2B ... An−1B]

has full rank (all rows linearly independent)

A pair (A,C ), where A is n× n, C is m× n, is observable if (AT ,CT )is controllable.


Asymptotic Behaviour

TheoremIf (A,B) is controllable and Q can be written as Q = CTC, where (A, C) isobservable, then:

1 Kk → K as k → −∞, with K satisfying the algebraic Riccati equation

K = ATKA + Q − ATKB(BTKB + R)−1BTKA

2 The steady state controller

µ∗(xk) = Lxk ,

where L = −(BTKB + R)−1BTKA, stabilizes the system, i.e. theeigenvalues of A + BL have magnitude < 1.

Proof: See Bertsekas

Note: If uk = Lxk , then xk+1 = Axk + Buk + wk = (A + BL)xk + wk .xk stays “bounded” when the eigenvalues of A + BL have magnitude < 1


Other Variationsxk+1 = Akxk + Bkuk + wk

Ak ,Bk random, unknown, independent.Optimal policy:

µ∗k(k) = −(R + E{BTk Kk+1B})−1E{BT

k Kk+1A}xk ,where

KN = Q,

Kk = E{ATk Kk+1A

Tk }+ Q

− E{ATk Kk+1Bk}(E{BT

k Kk+1Bk}+ R)−1E{BTk Kk+1Ak}

I may not have certainty equivalenceI may not have steady state solution

xk+1 = Axk + Bkuk + wk

Bk is random, independent, and is only revealed to us at time k .Motivation: Wireless channelsSimilar to Leong, Dey, Anand, “Optimal LQG control over continuousfading channels”, Proc. IFAC World Congress, 2011.


Optimal Stopping Problems

At each state, there is a “stop” control that stops the system, i.emoves to and stays in a stop state.

Pure stopping problem: if only other control is “continue”.

For pure stopping problems, policy characterized by partition of set ofstates into:

I stop regionI continue region,

which may depend on time.


Example (Asset selling)

A person has an asset for sale, e.g. a house.

At each time k = 0, 1, ...,N − 1, person receives a random offer wk

for the asset.

Assume wk ’s are independent.

Either accept wk at time k + 1, and invest money at interest rate r ,or reject wk and wait for offer wk+1.

Must accept last offer wN−1 at time N if every previous offer wasrejected.

Find policy that maximizes (expected) revenue at the N-th period.



States: If xk = T : asset already sold (= stop state)If xk = wk−1: offer currently under consideration.

Controls: {accept, reject}

System evolves as:

xk+1 = fk(xk ,wk , uk)

=

{T , if 1) xk = T or 2) xk 6= T and uk = acceptwk , otherwise.



Rewards at time k :

gN(xN) =

{xN , if xN 6= T ;0, otherwise.

gk(xk , uk ,wk) =

{(1 + r)N−kxk , if xk 6= T and uk = accept ;0, otherwise.

I (For compound interest over n years, final amount = (1 + r)n×initialamount.)

I Note: From the way the rewards are defined, gk is non-zero for onlyone k ∈ {0, 1, ...,N − 1}.



Expected total reward

= E

[N−1∑k=0

gk(xk , uk ,wk) + gN(xN)

]

D.P. algorithm (for reward maximization):

JN(xN) = gN(xN) =

{xN , if xN 6= T ;0, otherwise.

Jk(xk) = maxuk

E[gk(xk , uk ,wk) + Jk+1(xk+1)]


Example (Asset selling)If xk = T , then gk(xk , uk ,wk) = 0 and Jk+1(xk+1) = 0, by propertyof gk being non-zero for only one k, and reward being incurred priorto time kIf xk 6= T , then

E[gk(xk , uk ,wk)+Jk+1(xk+1)] =

{(1 + r)N−kxk , if uk = accept;0 + E[Jk+1(wk)], if uk = reject.

So

Jk(xk) = maxuk


=

{max((1 + r)N−kxk ,E[Jk+1(wk)]), if xk 6= T ,0, if xk = T ,

and optimal policy is of the form:

uk = accept if (1 + r)N−kxk > E[Jk+1(wk)]

or uk =

{accept, if xk >

E[Jk+1(wk )](1+r)N−k ;

reject, otherwise.Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 70 / 158

Example (Asset selling)Let

αk =E[Jk+1(wk)]

(1 + r)N−k

Can show (see Bertsekas) that αk ≥ αk+1 for all k if wk are i.i.d.I Intuition: offer acceptable at time k should also be acceptable at time

k + 1. See Figure 11

Reject

Accept

N-11 2k

αN−1

α2

α1

Figure 11: Asset selling



Can also show that if wk are i.i.d and N →∞, then optimal policy“converges” to the stationary policy:

uk =

{accept, if xk > αreject, if xk ≤ α

where α is constant.


General Stopping Problems

Pure stopping problem - stop or continue only possible controls

General stoping problem - stop or choose a control uk from U(xk)(where U has more than one element)

Consider time invariant case: f (xk , uk ,wk), g(xk , uk ,wk) don’tdepend on k, and wk is i.i.d.

Stop at time k with cost t(xk)

Must stop by last stage.

D.P. algorithm:JN(xN) = t(xN),

Jk(xk) = min[t(xk), minuk∈U(xk )

E{g(xk , uk ,wk) + Jk+1(f (xk , uk ,wk))}]

Optimal to stop when

t(xk) ≤ minuk∈U(xk )

E{g(xk , uk ,wk) + Jk+1(f (xk , uk ,wk))}


General Stopping ProblemsStopping set at time k (set of states where you stop) defined as

Tk = {x |t(x) ≤ minu∈U(x)

E[g(x , u,w) + Jk+1(f (x , u,w))]}

Note that JN−1(x) ≤ JN(x) for all x , since JN(x) = t(x) and

JN−1(x) = min[t(x), min

u∈U(x)E[g(x , u,w) + Jk+1(f (x , u,w))]

]≤ t(x) = JN(x)

Can show that Jk(x) ≤ Jk+1(x) (Monotonicity principle: tutorialproblem)

Then we have :

T0 ⊆ T1 ⊆ T2 ⊆ ... ⊆ Tk ⊆ Tk+1 ⊆ ... ⊆ TN−1

i.e. set of states in which we stop increases with time.


Special Case

If f (x , u,w) ∈ TN−1 for all x ∈ TN−1, u ∈ U(x),w , i.e. the set TN−1is absorbing, then

T0 = T1 = T2 = · · · = TN−1.


Simplifies optimal policy, called the one step lookahead policy.


Special Case

E.g. Asset selling with past offers retained

Same situation as before, except that previously rejected offers can beaccepted at a later time.

State evolves asxk+1 = max(xk ,wk)

(instead of xk+1 = wk before)

Can show (see Bertsekas) that TN−1 = {x |x ≥ α} for some constantα

This set is absorbing, since best currently received offer cannotdecrease over time.⇒ optimal policy at every time k is to accept if best offer > α

Have constant threshold α even for finite horizon N


Outline

4 Problems with Imperfect State InformationReformulation as Perfect State Information ProblemLinear Quadratic Control with Noisy MeasurementsSufficient Statistics


Problems with Imperfect State Information

State xk not known to controller.

Instead have “noisy” observations zk of the form:

z0 = h0(x0, v0),

zk = hk(xk , uk−1, vk), k = 1, 2, ...,N − 1,

where vk is “observation noise”, with a probability distribution

Pv (.|x0, ..., xk , u0, ..., uk−1,w0, ...,wk−1, v0, ..., vk−1)

which can depend on states, controls and disturbances

Exampleshx(xk , uk−1, vk) = xk + vk ,

hk(xk , uk−1, vk) = sin xk + uk−1vk


Problems with Imperfect State Information

Initial state x0 is random with distribution Px0 .

uk ∈ Uk , where Uk does not depend on (unknown) xk .

Information vector, i.e. information available to controller at time k,defined as

I0 = z0,

Ik = (z0, ..., zk , u0, ..., uk−1), k = 1, 2, ...,N − 1

Policies π = (µ0, ..., µN−1), where now µk(Ik) ∈ Uk (before µk(xk)).


Basic Problem with Imperfect State Information

Find π that minimizes the cost function

Jπ = E

{N−1∑k=0

gk(xk , µk(Ik),wk) + gN(xN)

}

s.t. system equation

xk+1 = fk(xk , µk(Ik),wk)

and measurement equation

zk = hk(xk , µk−1(Ik−1), vk)

Question: How to solve this problem?


Reformulation as Perfect State Information Problem

Idea: Define new system where the state is Ik . Then have D.P.algorithm etc.

By definition

Ik+1 = (z0, ..., zk , zk+1, u0, ..., uk−1, uk)

= (z0, ..., zk , u0, ..., uk−1︸︷︷︸Ik

, zk+1, uk)

⇒ Ik+1 = (Ik , uk , zk+1).



RegardIk+1 = (Ik , uk , zk+1)

as a dynamical system with state Ik , control uk and disturbance zk+1

Next note that E[gk(xk , uk ,wk)] = E[E[gk(xk , uk ,wk)|Ik , uk ]] (Recallthat E[X ] = E[E[X |Y ]])

Define gk(Ik , uk) = E[gk(xk , uk ,wk)|Ik , uk ] = cost per stage of newsystem, and gN(IN) = E[gN |IN ] = terminal cost.

Cost function becomes

E{∑N−1

k=0 gk(xk , µk(Ik),wk) + gN(xN)}

= E{∑N−1

k=0 gk(Ik , µk(Ik)) + gN(IN)}



D.P. algorithm for reformulated perfect state information problem is:

JN(IN) = gN(IN) = E[gN(xN)|IN ]

Jk(Ik) = minuk∈Uk

E{gk(Ik , uk) + Jk+1(Ik , uk , zk+1)}

= minuk∈Uk

E{gk(xk , uk ,wk) + Jk+1(Ik , uk , zk+1)|Ik}, k = N − 1, . . . , 0

Optimal cost J∗ = E{J0(z0)}


Linear Quadratic Control with Noisy Measurements

Systemxk+1 = Axk + Buk + wk

Cost function

E

N−1∑k=0

(xTk Qxk + uTk Ruk︸︷︷︸gk (xk ,uk ,wk )

) + xTN QxN︸︷︷︸gN(xN)

Observations

zk = Cxk + vk

wk are independent, zero mean.

From D.P. Algorithm:

JN(IN) = E[xTN QxN |IN ],



JN−1(IN−1) = minuN−1


+E[(AxN−1 + BuN−1 + wN−1)TQ(AxN−1 + BuN−1 + wN−1)|IN ]∣∣∣IN−1}

= minuN−1


+(AxN−1 + BuN−1 + wN−1)TQ(AxN−1 + BuN−1 + wN−1)|IN−1}

(Using the tower property that E(E(X |Y )|Z ) = E(X |Z ) if Y contains“more information” than Z )

= ... ( expand, simplify and use E(wN−1|IN−1) = 0.)

= E[xTN−1(ATQA + Q)xN−1|IN−1] + E[wTN−1QwN−1|IN−1]

+ minuN−1

{uTN−1(BTQB + R)uN−1 + 2E[xN−1|IN−1]TATQBuN−1

}Differentiate with respect to uN−1 and set equal to zero:

2(BTQB + R)uN−1 + 2BTQAE[xN−1|IN−1] = 0

⇒ u∗N−1 = −(BTQB + R)−1BTQAE[xN−1|IN−1]



We have

JN−1(IN−1) = E[xTN−1KN−1xN−1|IN−1] + E[wTN−1QwN−1]

+ E[(xN−1 − E[xN−1|IN−1])TPN−1(xN−1 − E(xN−1|IN−1))|IN−1]

where

PN−1 = ATQB(BTQB + R)−1BTQA

KN−1 = ATQA + Q − PN−1.


Linear Quadratic Control with Noisy MeasurementsFor period N − 2,

JN−2(IN−2) = minuN−2

E{xTN−2QxN−2 + uTN−2RuN−2 + JN−1(IN−1)|IN−2}

= E{xTN−2QxN−2|IN−2}+ minuN−2

[uTN−2RuN−2 + E{xTN−1KN−1xN−1|IN−2}

]+ E

[(xN−1 − E[xN−1|IN−1])TPN−1(xN−1 − E[xN−1|IN−1])|IN−2

]+ E(wT

N−1QwN−1)

Then can obtain

u∗N−2 = −(BTKN−1B + R)−1BTKN−1AE[xN−2|IN−2]

Note that in the above the term

E[(xN−1 − E[xN−1|IN−1])TPN−1(xN−1 − E[xN−1|IN−1])|IN−2

]can be taken outside the minimization (see Bertsekas for proof).

I Intuition: estimation error xk − E[xk |Ik ] can’t be influenced by choiceof control.



Continuing on, general solution is:

µ∗k(Ik) = u∗k = −(BTKk+1B + R)−1BTKk+1AE[xk |Ik ] = LkE[xk |Ik ]

where

KN = Q

Pk = ATKk+1B(BTKk+1B + R)−1BTKk+1A

Kk = ATKk+1A + Q − Pk

Comparison with perfect state information case:

Lk matrix the same

xk is replaced by E[xk |Ik ]

How to compute E[xk |Ik ]?



Summary so far:

System


zk = Cxk + vk

Problem

minE[ N−1∑k=0


]Optimal solution is

µ∗k(Ik) = −(BTKk+1B + R)−1BTKk+1AE[xk |Ik ] = LkE[xk |Ik ]

where Ik = (z0, ..., zk , u0, ..., uk−1)



Optimal controller can be decomposed into two parts:

1) An estimator which computes E[xk |Ik ].

2) An actuator which multiplies E[xk |Ik ] with Lk . Lk is the same gainmatrix as in the perfect state information case, only replace xk withE[xk |Ik ].

Estimator and actuator can be designed separately.I Known as the separation principle/theorem


LQG Control

Remaining problem: How do we compute E[xk |Ik ]?

Very difficult problem in general (subject called non-linear filtering).

When system is linear and wk , vk are Gaussian, E[xk |Ik ] can becomputed analytically.

I Procedure/algorithm is known as the Kalman Filter (ref: Anderson andMoore, “Optimal Filtering”), and the overall controller is called theLQG (linear quadratic Gaussian) controller


Kalman Filter Properties

In general the mean squared error

E[(xk − xk)T (xk − xk)|Ik ]

is minimized when xk = E[xk |Ik ]

Kalman filter equations compute E[xk |Ik ] when noises are Gaussian,and (optimal) estimates are linear functions of the measurements zk .

Even when noises are not Gaussian, xk|k computed by Kalman filterequations gives the best linear estimate of xk .

I Useful suboptimal solution when noises are non-Gaussian.


Kalman Filter Properties

Recall that if the pair (A,B) is controllable and (A,Q1/2) isobservable, optimal controller has a steady state solution.

Similarly, if (A,C ) is observable, and (A,Σ1/2w ) is controllable, then

Σk|k−1 converges to a steady state value Σ as k →∞, where Σsatisfies the algebraic Riccati equation

Σ = AΣAT − AΣCT (CΣCT + Σv )−1CΣAT + Σw

So we have a steady state estimator:

xk|k = xk|k−1 + ΣCT (CΣCT + Σv )−1(zk − Cxk|k−1)

xk+1|k = Axk|k + Buk


Sufficient Statistics

Information vector Ik = (z0, .., zk , u0, ..., uk−1)

Dimension of Ik increases with time k.

Inconvenient for large k

Sufficient statistic: function Sk(Ik) which summarizes all essential contentin Ik for computing the optimal control, i.e. µ∗k(Ik) = µ(Sk(Ik)) for somefunction µ.

Sk(Ik) preferably of smaller dimension than Ik .


Examples of Sufficient Statistics1) Ik itself

2) Conditional state distribution/belief state Pxk |Ik , assuming thatdistribution of vk depends only on xk−1, uk−1,wk−1.

If number of states is finite then Pxk |Ik is a vector.

e.g. if states are 1, 2, ..., n, then

Pxk |Ik =

P(xk = 1|Ik)P(xk = 2|Ik)

.

.

.P(xk = n|Ik)

Dimension of vector is n, which doesn’t grow with k

3) Special case: E[xk |Ik ] is a sufficient statistic for LQG problem (thoughnot a sufficient statistic in general).


Conditional State Distribution

For conditional state distribution, Pxk |Ik can be generated recursively,as

Pxk+1|Ik+1= Φk(Pxk |Ik , uk , zk+1)

for some function Φk(·, ·, ·).

Then D.P. algorithm can be written as

Jk(Pxk |Ik ) = minuk∈Uk

E[gk(xk , uk ,wk) + Jk+1(Φk(Pxk |Ik , uk , zk+1))|Ik ].

General formula for Φk(·, ·, ·) can be derived, but is quite complicated(see Bertsekas). Will derive some examples from first principles.


Example 1: Search Problem

At each period, decide whether to search a site that may contain atreasure.

If treasure is present and we search, we find it with probability β andtake it.

States: {treasure present, treasure not present}Controls: {search, no search}Regard each search result as (imperfect) observation of the state.

Let pk = probability treasure present at start of time k .I If not search, pk+1 = pk .I If search and find treasure, pk+1 = 0.


Example 1

If search and don’t find treasure,

pk+1 = P(treasure present at k|don’t find at k)

=P(treasure present at k

⋂don’t find at k)

P(don’t find at k)

=pk(1− β)

pk(1− β) + (1− pk),

with (1− pk) corresponding to treasure not present & don’t find.

Thus

pk+1 =

pk , not search at time k0, search and find treasure.

pk (1−β)pk (1−β)+(1−pk ) , search and don’t find treasure

= Φk(pk , uk , zk+1) function.


Example 1

Now let treasure be worth V , each search costs C , and once wedecide not to search we can’t search again at future times.

D.P. algorithm gives:

Jk(pk) = max{no search,search}

[0,−C + pkβV

+ (1− pkβ)Jk+1

( pk(1− β)

pk(1− β) + 1− pk

)+ pkβJk+1(0)

]

= max{no search,search}

[0,−C + pkβV

+ (1− pkβ)Jk+1

(pk(1− β)

pk(1− β) + 1− pk

)](where pkβJk+1(0) = 0 since treasure already found)

Can show that Jk(pk) = 0,∀pk ≤ CβV , and that it is optimal to search

iff expected reward pkβV ≥ cost of search C . (Tutorial problem)


Example 2: Research Paper*

A process {Pe,k} evolves in the following way, for k = 1, ...,N:

Pe,k+1 =

{P, νk+1γe,k+1 = 1APe,kA

T + Q, νk+1γe,k+1 = 0,

P,A,Q are some matrices

{γe,k} is i.i.d Bernoulli process with

P(γe,k = 1) = λe ,P(γe,k = 0) = 1− λe , ∀k

νk ∈ {0, 1}{Pe,k} is not observed at all (no observation zk).

*Leong, Quevedo, Dolz, Dey “On Remote State Estimation in thePresence of an Eavesdropper” Proc. IFAC World Congress, 2017


Example 2

Regard Pe,k as the state at time k , and νk+1 as the control. AssumePe,0 = P

Then Pe,k ∈ {P,APAT + Q,A(APAT + Q) + Q, ...} ={P, f (P), f 2(P), ..., f N(P)}, where

f (P) = APAT + Q

Conditional state distribution isP(Pe,k = P|ν0, .., νk)

P(Pe,k = f (P)|ν0, ..., νk)......

P(Pe,k = f N(P)|ν0, ..., νk)


Example 2

When νk+1 = 1,Pe,k+1 = P with probability λe , andPe,k+1 = f (Pe,k) with probability 1− λe . So

P(Pe,k+1 = P|ν0, .., νk+1)P(Pe,k+1 = f (P)|ν0, ..., νk+1)

...

...P(Pe,k+1 = f N(P)|ν0, ..., νk+1)

=

λe

(1− λe)P(Pe,k = P|ν0, ..., νk)......

(1− λe)P(Pe,k = f N−1(P)|ν0, ..., νk)

= Φk(PPe,k |Ik , νk+1, zk+1) function when νk+1 = 1


Outline

5 Suboptimal Methods / Approximate Dynamic ProgrammingCertainty Equivalent ControlRollout AlgorithmsModel Predictive Control


Suboptimal Methods

Why do we need/want suboptimal methods?

In D.P. need to compute

Jk(xk) = minuk


for all states xk

1) In many problems, this minimization can’t be done analytically.

Have to test each uk .

When number of possible xk , uk or wk are large, amount ofcomputation required can be substantial.


Suboptimal Methods2) In some problems, xk , uk or wk are continuous valued.

Have to discretize their ranges to convert to discrete problem, seeFig. 12.

Using more points gives better approximation, but more computationrequired.

Situation worse for higher dimensions - “curse of dimensionality”.

Discretization points

-1 0 1

Range of x_k =[-1,1]

Figure 12: Discretization


Suboptimal Methods

3) In problems with imperfect state information, conditional statedistribution Pxk |Ik is of the form

Pxk |Ik =

P(xk = 1|Ik)P(xk = 2|Ik).........P(xk = n|Ik)

Range is [0, 1]n (continuous).

Solving imperfect state information problems exactly is intractableexcept in special or very simple cases.

4) Real time constraints, data not available until shortly before, or datamay change as system is being controlled.


Suboptimal Methods

Will discuss a few methods for suboptimal solutions

Certainty Equivalent Control (CEC)

Rollout Algorithms

Model Predictive Control (MPC)

Many other methods in Vol. I Ch.6 and Vol. II of Bertsekas.


Certainty Equivalent Control

Idea

Replace a stochastic problem with a deterministic one

At each time k , fix the future uncertain quantities to some “typical”values, e.g. replace wk with E[wk ].

Procedure (Online Version)At each time k

(1) Fix wi , i ≥ k to some wi . Solve the deterministic problem

min{uk ,uk+1,...,uN−1}

[N−1∑i=k

gi (xi , ui , wi ) + gN(xN)

],

assuming xi+1 = fi (xi , ui , wi ), i = k, k + 1, ...,N − 1, ui ∈ Ui (xi )

(2) Use the first control in optimal control sequence {uk , uk+1, ..., uN−1}found, i.e. µk(xk) = uk


Certainty Equivalent Control

Equivalent Procedure (Offline Version)

(1) Fix wk to some wk for k = 0, 1, ..,N − 1. Solve the deterministicproblem

min{µ0,µ1,...,µN−1}

[N−1∑k=0

gk(xk , µk(xk), wk) + gN(xN)

],

assuming xk+1 = fk(xk , µk(xk), wk), k = 0, 1, ...,N − 1, uk(xk) ∈ Uk(xk)

(2) Let {µd0 , µd1 , ..., µdN−1} be the solution to problem above. At each time

k , apply µk(xk) = µdk (xk)


Certainty Equivalent ControlComments:

N problems have to be solved in online version, one in the offlineversion

Online and offline versions give same controller if data is notchanging. Use online version if data is changing.

For problems with imperfect state information, also replace xk byestimate xk(Ik) (e.g. xk(Ik) = E[xk |Ik ]).

Certainty Equivalent Control often performs well in practice.

For linear quadratic control problem, Certainty Equivalent Controlleris equivalent to optimal controller.

Can fix some disturbances while leaving others stochastic, e.g. forimperfect state information problems, replace xk by xk(Ik) whileleaving wk as stochastic.


Rollout Algorithms

One step lookahead policy, with optimal cost to go approximated by costto go of some base policy.

“Rollout” coined by Gerald Tesauro in 1996 in the context of rollingdice in a backgammon playing computer program.

A given backgammon position evaluated by “rolling out” many gamesstarting from that position, and taking average.

Rollout policy has a cost improvement property

Often produces substantial improvement over base policy.


Rollout Algorithms

One step lookahead policy

At each k and xk we use the control µk(xk) that solves the problem

minuk∈Uk

E{gk(xk , uk ,wk) + Jk+1(fk(xk , uk ,wk))}

where JN = gN , and Jk+1 is approximation to true cost to go Jk+1.

Rollout policy

When the approximation Jk is cost to go of some heuristic base policy.


Example: Quiz problem

N questions given.

Question i answered correctly with probability pi , reward vi if correct.

Quiz terminates at first incorrect answer.

Choose order of questions to maximize total reward.

Index policy: answer questions in decreasing order of pivi1−pi

I Index policy is optimal when no other constraints (Ch. 4.5 Bertsekas).

Now assume there is a limit (< N) on maximum number of questionsto be answered.

I Then index policy in general is not optimal.


Example: Quiz problem

Rollout algorithm: use index policy as base policy.I At a state denoting the subset of questions already answered, compute

the expected reward R(j) for each possible next question j , assumingthe order of remaining questions follows index policy.

I Answer the question with maximum R(j).

R(j) can be computed analytically, since given an order of questions(i1, i2, ..., iM), with M ≤ N, the expected reward is

pi1(vi1 + pi2(vi2 + pi3(...+ piMviM )...))

.


Example: Travelling Salesman Problem

N cities

Assume graph is complete

Find the minimum cost tour that visits each city exactly once andreturns to the city you started from.

Important and difficult problem in combinatorial optimization.

20

12

3442

30 35

AB

DC

Figure 13: Travelling Salesman Problem



Nearest neighbour heuristic:I Start from an arbitrary city.I Next city visited is the one with minimum distance from current city

(and has not been previously visited)

Rollout algorithm: Use the nearest neighbour heuristic as base policy.I For each node not yet visited, assume nearest neighbour heuristic is

then run afterwards, and compute cost of the tour.I Choose next city as the one that gives best tour.



Consider the travelling salesman problem for the graph shown below.

Let a be the node with which we start and end the tour. An optimaltour can be shown to be abcdea, with length 375.

Figure 14: Travelling Salesman Problem

Nearest neighbour (N.N.) heuristic gives tour aedbca with length 550.


Example: Travelling Salesman ProblemRollout algorithm with nearest neighbour heuristic as base policy:1st stage:ab cdea︸︷︷︸

N.N.

length = 375

ac bdea︸︷︷︸N.N.

length = 550

ad ebca︸︷︷︸N.N.

length = 625

ae dbca︸︷︷︸N.N.

length = 550

So next node should be b.2nd stage:abc dea︸︷︷︸

N.N.

length = 375

abd eca︸︷︷︸N.N.

length = 650

abe dca︸︷︷︸N.N.

length = 675

So next node is cDr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 122 / 158


Rollout algorithm with nearest neighbour heuristic as base policy:3rd stage:abcd ea︸︷︷︸

N.N.

length = 375

abce da︸︷︷︸N.N.

length = 425

So next node is d4th stage:abcdea = tour computed by rollout, with length 375


Cost Improvement Property of Rollout Algorithm

Theorem:

Let Jk(xk) be the cost to go of rollout policy. Let Jk(xk) be the cost to goof base policy. Then

Jk(xk) ≤ Jk(xk),∀xk , k

Proof: Use induction

Initial step:

By definitionJN(xN) = JN(xN) = gN(xN), ∀xN


Cost Improvement Property of Rollout Algorithm

Induction step:

Assume Jl(xl) ≤ Jl(xl),∀xl , l = N − 1,N − 2, . . . , k + 1

Want to show Jk(xk) ≤ Jk(xk).

Let µk(xk) be control applied by rollout policy.Let µk(xk) be control applied by base policy.

ThenJk(xk) = E[gk(xk , µk(xk),wk) + Jk+1(fk(xk , µk(xk),wk))] bydefinition of cost to go function Jk≤ E[gk(xk , µk(xk),wk) + Jk+1(fk(xk , µk(xk),wk))] by inductionhypothesis≤ E[gk(xk , µk(xk),wk) + Jk+1(fk(xk , µk(xk),wk))] by definition ofµk(xk) being optimal for rollout= Jk(xk) by definition of cost to go function Jk

By induction, Jk(xk) ≤ Jk(xk),∀k , xk .


Difficulties In Using Rollout

For stochastic problems, cost to go Jk of base policy may still bedifficult to evaluate analytically.

Need to approximate Jk using e.g. Monte Carlo simulations, orcertainty equivalence.



Originated and widely used in process control industries.

Concepts have since been applied to many areas.

Idea:

Compute a set of m control signals which optimizes objective over afinite horizon m, using a model that predicts system outputs at futuretimes.

First element of this set is applied to system.

Repeat process at next time step in a receding horizon manner.


Model Predictive Control (MPC)u

2 431

u_0^*

k

A

u

2 431

u_0^*

k

A

1 2 34 5 6

u u_2^*

k

u

k1 2 4 53

B

C

u_1^*

Figure 15: Model Predictive Control:A: Computed set of control signals at time 0.B: Computed set of control signals at time 1.C: Computed set of control signals at time 2.Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 128 / 158


Well suited to systems with control/state constraints, non-linearsystems etc...

Corresponds to a m-step lookahead policy with cost to goapproximation equal to zero.

As m increases, performance “usually” improves (see Bertsekas forcounter-examples)

Higher amount of computations required for larger m.

For nonlinear systems, computation of control signals often needs tobe done numerically.


Outline

6 Infinite Horizon ProblemsDiscounted Cost ProblemsAverage Cost Problems


Infinite Horizon Problems

Infinite number of stages.

Assume system is stationary, i.e. f (., ., .), g(., ., .) and distribution ofwk don’t depend on time k .

“Different” algorithms needed.

Optimal policies often have a simple stationary form that does notdepend on time.

But analysis is more difficult than finite horizon problems (won’tcover in this course).


Types of Infinite Horizon Problems

1 Total cost problems

min{µk}

limN→∞

E

[N−1∑k=0

g(xk , µ(xk),wk)

]

I Not commonly used because cost function often goes to infinity.I Special type called stochastic shortest path problem with cost free

termination state studied in Bertsekas.

2 Discounted cost problems

min{µk}

limN→∞

E

[N−1∑k=0

γkg(xk , µk(xk),wk)

]

where γ ∈ (0, 1).I γ is called the discount factor.I Cost incurred at earlier times more important than later times.


Types of Infinite Horizon Problems

3 Average cost problems

min{µk}

limN→∞

E

[1

N

N−1∑k=0

g(xk , µ(xk),wk)

]

I Cost incurred in the future more important than at the beginningI Cost function usually finite in contrast to total cost problems.I Optimal average cost usually independent of initial states.


Discounted Cost Problems

min{µk}

limN→∞

E

[N−1∑k=0


]Assume

number of states is finite, taking values 1, 2, ..., n.

number of possible controls is finite.


Notation

Given policy π = {µ0, µ1, ..., }, cost of policy starting at state i is

Jπ(i) = limN→∞

E

[N−1∑k=0


∣∣∣∣x0 = i

]

If policy is stationary, i.e. π = {µ, µ, ...}, write Jµ(i) instead of Jπ(i).

Optimal cost starting at state i is

J∗(i) = minπ

limN→∞

E

[N−1∑k=0


∣∣∣∣x0 = i

]

Optimal policy π∗ satisfies Jπ∗(i) = J∗(i), ∀i .


Theorem

For the discounted cost problem we have:

(a) The value iteration algorithm

Jk+1(i) = minu∈U(i)

E[g(i , u,w) + γJk(f (i , u,w))], i = 1, 2, ..., n

converges as k →∞ to optimal costs J∗(i), i = 1, 2, ..., n, startingfrom arbitrary J0(i), i = 1, 2, ..., n

(b) The optimal costs J∗(i), i = 1, 2, ..., n, satisfy the Bellman equation

J∗(i) = minu∈U(i)

E[g(i , u,w) + γJ∗(f (i , u,w))], i = 1, 2, ..., n


Theorem(c) Given a stationary policy µ, the cost Jµ(i), i = 1, ..., n satisfies

Jµ(i) = E[g(i , µ(i),w) + γJµ(f (i , µ(i),w))], i = 1, ..., n

Starting from arbitrary J0(i), i = 1, ..., n, the iteration

Jk+1(i) = E[g(i , µ(i),w) + γJk(f (i , µ(i),w))]

converges to Jµ(i), i = 1, ..., n

(d) A stationary policy µ is optimal iff. for every state i , µ(i) attains theminimum in the Bellman equation.

(e) The policy iteration algorithm

µk+1(i) = argminu∈U(i)

E[g(i , u,w) + γJµk (f (i , u,w))], i = 1, 2, ..., n

generates an improving sequence of policies and terminates (in finitetime) with an optimal policy.

Proof: see BertsekasDr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 137 / 158

Comments

Parts (a) and (e) provide algorithms for solving discounted costproblems (like the D.P. algorithm for finite horizon problems). Valueiteration (part (a)) requires less computation at every iteration, whilepolicy iteration (part (e)) is guaranteed to terminate in finite time.

In part (c),

Jµ(i) = E[g(i , µ(i),w) + γJµ(f (i , u,w))], i = 1, ..., n

is a system of n linear equations, with which one can solve forJµ(i), i = 1, ..., n.


Policy Iteration

Starting with a stationary policy µ0, generate a sequence µ1, µ2, . . .of stationary policies.

Given µk , perform policy evaluation step, to computeJµk (i), i = 1, 2, .., n, using

Jµk (i) = E[g(i , µk(i),w) + γJµk (f (i , µk(i),w))], i = 1, . . . , n

Given Jµk (.), perform policy improvement step


E[g(i , u,w) + γJµk (f (i , u,w))]

with Jµk (f (i , u,w)) the “cost to go of old policy” (c.f. rolloutalgorithm)

Terminate when Jµk (i) = Jµk+1(i),∀i .


Example

A manufacturer at each time period:

Receives an order with probability p, no order with probability 1− p.

May process all unfilled orders at cost K > 0, or process no orders.

Cost per unfilled order at each time period is C > 0.

max. no. unfilled orders is n.

Find processing policy that minimizes the discounted cost, with discountfactor γ.


Example

Let state = no. unfilled orders at the start of each period(∈ {0, 1, ..., n}).

Bellman equation :For states i = 0, 1, . . . , n − 1, can either process orders or not, soBellman equation is

J∗(i) = min{

K︸︷︷︸process

+ γp J∗(1)︸︷︷︸new order received

+ γ(1− p) J∗(0)︸︷︷︸new order not received

,

Ci︸︷︷︸don’t process

+ γp J∗(i + 1)︸︷︷︸new order received

+ γ(1− p) J∗(i)︸︷︷︸no new order

}For state i=n, all orders must be processed, so Bellman equation is

J∗(n) = K + γpJ∗(1) + γ(1− p)J∗(0)

Can show that the optimal policy is a threshold policy: process orderiff. i ≥ m∗, where m∗ is a threshold (see Bertsekas).


Average Cost Problems

min{µk}

limN→∞

1

NE

[N−1∑k=0

g(xk , µk(xk),wk)

]

In most problems of this type, the average cost per stage of a policyis independent of initial state.

Expresses costs occurred in the long run, costs incurred in early stagesdo not matter.

Analysis is harder than for discounted cost problems (won’t coverhere).

AssumeI number of states is finite, taking values 1, 2, ..., n.I number of possible controls is finite.

Also assume that there is some state t such that for all initial statesand policies, t is visited infinitely often with probability 1.


Theorem

For the average cost problem, we have:

(a) The optimal average cost per stage λ∗ is the same for all initialstates, and there exists a vector h∗ = (h∗(1), h∗(2), ..., h∗(n))satisfying the Bellman equation:

λ∗ + h∗(i) = minu∈U(i)

E[g(i , u,w) + h∗(f (i , u,w))], i = 1, ...n

(h∗ is unique if we fix h∗(t) = 0.)If µ(i) attains the minimum in the Bellman equation for all i, then thestationary policy µ is optimal.

(b) If µ and h satisfy Bellman’s equation, then λ is the optimal averagecost per stage for each initial state.


Theorem

(c) Given a stationary policy µ with average cost per stage λµ, thereexists a vector hµ = (hµ(1), ..., hµ(n)) such that

λµ + hµ(i) = E[g(i , µ(i),w) + hµ(f (i , µ(i),w))], i = 1, ..., n

(hµ is unique if we fix hµ(t) = 0.)


Comment: h is also called the differential cost vector.


Example

A manufacturer at each time period:

Receives an order with probability p, no order with probability 1− p.

May process all unfilled orders at cost K > 0, or process no orders.

Cost per unfilled order at each time period is C > 0.

max. no. unfilled orders is n.

Find processing policy that minimizes the average cost.


Example

State = no. unfilled orders at start of each period.State 0 = Special state t here (will visit this state infinitely often)

Bellman equation:For states 0, 1, . . . , n − 1, Bellman equation is

λ∗+h∗(i) = min{K+ph∗(1)+(1−p)h∗(0),Ci+ph∗(i+1)+(1−p)h∗(i)}

For state n, Bellman equation is

λ∗ + h∗(n) = K + ph∗(1) + (1− p)h∗(0)

Optimal policy: Process orders if

K + ph∗(1) + (1− p)h∗(0) ≤ Ci + ph∗(i + 1) + (1− p)h∗(i)

Can again show that a threshold policy is optimal, where value of thethreshold may be different from value of the threshold in discountedcost problem


Algorithms For Average Cost Problems

Value iteration:

Starting from any J0, compute

Jk+1(i) = minu∈U(i)

E[g(i , u,w) + Jk(f (i , u,w))], i = 1, ..., n

Have

limk→∞

Jk(i)

k= λ∗, ∀i

Drawbacks of value iteration :

Often components of Jk will diverge to ∞ or −∞, so calculatinglimk→∞

Jk (i)k may be tricky.

Doesn’t compute a differential cost vector h∗.



Relative value iteration:

Subtract a constant (dependent on k) from all components of Jk , sothat the difference hk is bounded, e.g.

hk(i) = Jk(i)− Jk(s), i = 1, ..., n

where s is some fixed state. Then relative value iteration algorithm is:

hk+1(i) = minu∈U(i)

E[g(i , u,w) + hk(f (i , u,w))]

− minu∈U(s)

E[g(s, u,w) + hk(f (s, u,w))], i = 1, ..., n

Can show that hk → h∗ as k →∞



Policy iteration:

Given µk , perform policy evaluation step to compute λk and hk ,using the equations:

λk + hk(i) = E[g(i , µk(i),w) + hk(f (i , µk(i),w))], i = 1, ..., n

hk(t) = 0 for some state t which is visited infinitely often.

Given λk and hk , perform policy improvement step:


E[g(i , u,w) + hk(f (i , u,w))], i = 1, ..., n

Terminate when λk+1 = λk and hk+1(i) = hk(i), i = 1, . . . , n.

Policy iteration can be shown to terminate in finite time.


Outline

7 Introduction to Reinforcement Learning


Introduction to Reinforcement Learning

As in the finite horizon case, want to consider suboptimal methods forsolving infinite horizon problems

Also studied in machine learning as reinforcement learning

Many different methods, e.g. Q-learning, TD/SARSA(λ),REINFORCE, . . .

I References: Sutton & Barto, “Reinforcement Learning”, Bertsekas Vol.II


Introduction to Reinforcement Learning

Slight change of notation:

State xk → state sk

Control uk → action ak

Cost function g(., ., .) → Reward function g(., ., .)

Cost minimization min{µk}

limN→∞

E[∑N−1

k=0 γkg(xk , µk(xk),wk)

]→ Reward maximization max

{ak}lim

N→∞E[∑N−1

k=0 γkg(sk , ak(sk),wk)

]


Q-Learning for Discounted Problems

Bellman equation

J∗(s) = maxa

E[g(s, a,w) + γJ∗(f (s, a,w))],∀s

J∗(s) is the optimal expected future reward when in state s

Introduce now the Q-Bellman equation

Q∗(s, a) = E[g(s, a,w) + γ maxa′

Q∗(s ′, a′)], ∀(s, a)

where s ′ , f (s, a,w).I Q-factor Q(s, a) is the expected future reward when in state s and

taking action aI Q∗ are the optimal Q-factors



Can also solve Q-Bellman equation using value iteration or policyiteration

Given Q∗(s, a), optimal policy can be computed as

a∗(s) = arg maxa

Q∗(s, a)

Using Q∗(s, a) gives same policy as using J∗(s), though it requiresmore storage

However, one advantage is that Q∗(s, a) can be found approximatelyusing e.g. Q-learning



Q-learning algorithm. Repeat:

Generate (sk , ak) using any probabilistic mechanism such that allstate-action pairs (s, a) are chosen infinitely often

Given (sk , ak), update Q(sk , ak) as:

Qk+1(sk , ak) = Qk(sk , ak) + αk

(r + γ max

a′Qk(s ′, a′)− Qk(sk , ak)

)where r = g(sk , ak ,wk) is the sampled reward, s ′ = f (sk , ak ,wk) isthe sampled next state when current state is sk and action ak isapplied, and {αk} is a sequence converging to 0.

Leave all other Q-factors unchanged



Q-learning algorithm converges to the optimal Q-factors Q∗(s, a)provided all pairs (s, a) are chosen infinitely often, and the sequence{αk} satisfies

αk > 0,∞∑k=0

αk =∞,∞∑k=0

α2k <∞

I e.g. αk = 1k satisfies this condition

In Sutton & Barto, {(sk , ak)} generated according to:

sk+1 := s ′

ak+1 =

{random a, w.p. εarg max

aQk+1(sk+1, a), w.p. 1− ε,

for some ε > 0


Function Approximation and Deep Reinforcement Learning

For large problems:I Too many (state, action) pairs to store in memoryI Too slow to learn the value of each Q∗(s, a) individually

Function approximation:I Regard Q∗(s, a) as a function of s and a. Approximate Q∗(s, a) by

another function Q(s, a,θ) parameterized by a set of weights θI Learn the weights θ instead of the entire set of values Q∗(s, a)

Deep reinforcement learning: When the weights θ are learnt using adeep neural network, seehttps://deepmind.com/blog/deep-reinforcement-learning

Spectacular recent advances in AI using deep reinforcement learning,e.g. AlphaGo, AlphaZero


Deep Reinforcement Learning - Further Reading

OverviewI https://deepmind.com/blog/deep-reinforcement-learningI http://www0.cs.ucl.ac.uk/staff/d.silver/web/Talks.html

Deep Q-Network (DQN) algorithmI https://arxiv.org/pdf/1312.5602.pdfI https://keon.io/deep-q-learning

More AdvancedI https://arxiv.org/pdf/1509.02971.pdfI https://medium.com/tensorflow/deep-reinforcement-learning-playing-

cartpole-through-asynchronous-advantage-actor-critic-a3c-7eab2eea5296


Dynamic Programming and Stochastic...

Documents

Transcript of Dynamic Programming and Stochastic...