Topics in Reinforcement Learning:Lessons from AlphaZero ...

Topics in Reinforcement Learning:Lessons from AlphaZero for

(Sub)Optimal Control and Discrete Optimization

Arizona State UniversityCourse CSE 691, Spring 2022

Links to Class Notes, Videolectures, and Slides athttp://web.mit.edu/dimitrib/www/RLbook.html

Dimitri P. [email protected]

Lecture 2Stochastic Finite and Infinite Horizon DP

Bertsekas Reinforcement Learning 1 / 29

Outline

1 Finite Horizon Deterministic Problem - Approximation in Value Space

2 Stochastic DP Algorithm

3 Linear Quadratic Problems - An Important Favorable Special Case

4 Infinite Horizon - An Overview of Theory and Algorithms


Review - Finite Horizon Deterministic Problem

......

Permanent trajectory P k Tentative trajectory T k

Control uk Cost gk(xk, uk) xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π(T

(λ)µ (Φr)

)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution

Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)

Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)

Set of States (u1) Set of States (u1, u2) Neural Network

Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)

Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)

Candidate (m + 1)-Solutions (u1, . . . , um, um+1)

Cost G(u) Heuristic N -Solutions

Piecewise Constant Aggregate Problem Approximation

Artificial Start State End State


Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

(F (i)

)

R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗

3 Cost Jµ

(F (i)

)

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

(F (i)

)

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

1




Φr = Π(T

(λ)µ (Φr)



minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :


max{0, ξ} J(x)















(F (i)

)


3 Cost Jµ

(F (i)

)


3 Cost Jµ

(F (i)

)





1




Φr = Π(T

(λ)µ (Φr)



minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :


max{0, ξ} J(x)















(F (i)

)


3 Cost Jµ

(F (i)

)


3 Cost Jµ

(F (i)

)





1




Φr = Π(T

(λ)µ (Φr)



minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :


max{0, ξ} J(x)















(F (i)

)


3 Cost Jµ

(F (i)

)


3 Cost Jµ

(F (i)

)





1




Φr = Π(T

(λ)µ (Φr)



minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :


max{0, ξ} J(x)















(F (i)

)


3 Cost Jµ

(F (i)

)


3 Cost Jµ

(F (i)

)





1


Stage k Future Stges



Φr = Π(T

(λ)µ (Φr)



minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :


max{0, ξ} J(x)















(F (i)

)


3 Cost Jµ

(F (i)

)


3 Cost Jµ

(F (i)

)





1


Stage k Future Stages

Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N


Φr = Π(T

(λ)µ (Φr)



minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :


max{0, ξ} J(x)















(F (i)

)


3 Cost Jµ

(F (i)

)


3 Cost Jµ

(F (i)

)





1

Systemxk+1 = fk (xk , uk ), k = 0, 1, . . . ,N − 1

where xk : State, uk : Control chosen from some set Uk (xk )

Arbitrary state and control spaces

Cost function:

gN(xN) +N−1∑k=0

gk (xk , uk )

For given initial state x0, minimize over control sequences {u0, . . . , uN−1}

J(x0; u0, . . . , uN−1) = gN(xN) +N−1∑k=0

gk (xk , uk )

Optimal cost function J∗(x0) = min uk∈Uk (xk )k=0,...,N−1

J(x0; u0, . . . , uN−1)


Review - DP Algorithm for Deterministic Problems

Go backward to compute the optimal costs J∗k (xk ) of the xk -tail subproblems

(off-line training - involves lots of computation)Start with

J∗N(xN) = gN(xN), for all xN ,

and for k = 0, . . . ,N − 1, let

J∗k (xk ) = minuk∈Uk (xk )

[gk (xk , uk ) + J∗k+1

(fk (xk , uk )

)], for all xk .

Then optimal cost J∗(x0) is obtained at the last step: J∗0 (x0) = J∗(x0).

Go forward to construct optimal control sequence {u∗0 , . . . ,u

∗N−1} (on-line play)

Start with

u∗0 ∈ arg minu0∈U0(x0)

[g0(x0, u0) + J∗1

(f0(x0, u0)

)], x∗1 = f0(x0, u∗0 ).

Sequentially, going forward, for k = 1, 2, . . . ,N − 1, set

u∗k ∈ arg minuk∈Uk (x∗k )

[gk (x∗k , uk ) + J∗k+1

(fk (x∗k , uk )

)], x∗k+1 = fk (x∗k , u

∗k ).


Q-Factors for Deterministic Problems

An alternative (and equivalent) form of the DP algorithm

Generates the optimal Q-factors, defined for all (xk , uk ) and k by

Q∗k (xk , uk ) = gk (xk , uk ) + J∗k+1(fk (xk , uk )

)The optimal cost function J∗k can be recovered from the optimal Q-factor Q∗k


Q∗k (xk , uk )

The DP algorithm can be written in terms of Q-factors

Q∗k (xk , uk ) = gk (xk , uk ) + minuk+1∈Uk+1(fk (xk ,uk ))

Q∗k+1(fk (xk , uk ), uk+1

)Exact and approximate forms of this and other related algorithms, form animportant class of RL methods known as Q-learning.


Approximation in Value Space

We replace J∗k with an approximation Jk during on-line play

Start withu0 ∈ arg min

u0∈U0(x0)

[g0(x0, u0) + J1

(f0(x0, u0)

)]Set x1 = f0(x0, u0)

Sequentially, going forward, for k = 1, 2, . . . ,N − 1, set

uk ∈ arg minuk∈Uk (xk )

[gk (xk , uk ) + Jk+1

(fk (xk , uk )

)], xk+1 = fk (xk , uk )

How do we compute Jk+1(xk+1)? This is one of the principal issues in RL

Off-line problem approximation: Use as Jk+1 the optimal cost function of a simplerproblem, computed off-line by exact DP

On-line approximate optimization, e.g., solve on-line a shorter horizon problem bymultistep lookahead minimization and simple terminal cost (often done in MPC)

Parametric cost approximation: Obtain Jk+1(xk+1) from a parametric class offunctions J(xk+1, r), where r is a parameter, e.g., training using data and a NN.

Rollout with a heuristic: We will focus on this for the moment.Bertsekas Reinforcement Learning 7 / 29

Rollout for Finite-State Deterministic Problems

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk)

x0 x1 xk xN uk xk+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1





x0 x1 xk xN uk xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1




Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN uk xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1





x0 x1 xk xN uk xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1





x0 x1 xk xN uk xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1





x0 x1 xk xN uk xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1





x0 x1 xk xN uk u0k u00

k xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1






k xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1






k xk+1 x0k+1 x00

k+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1





x0 x1 xk xN x0N x00

N uk u0k u00

k xk+1 x0k+1 x00

k+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1





x0 x1 xk xN x0N x00

N uk u0k u00

k xk+1 x0k+1 x00

k+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N − 1 c(N) c(N − 1) k k + 1

Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk

x0 x1 xk xN x′N x′′

N uk u′k u′′

k xk+1 x′k+1 x′′

k+1



J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from

within Wp+

Prob. u Prob. 1 − u Cost 1 Cost 1 − √u

J(1) = min{c, a + J(2)

}

J(2) = b + J(1)

J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1






N uk u′k u′′

k xk+1 x′k+1 x′′

k+1




within Wp+


J(1) = min{c, a + J(2)

}

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1

Initial City Current Partial Tour Next Cities

Nearest Neighbor Heuristic

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization


Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

b+k b−

k Permanent trajectory P k Tentative trajectory T k

minuk

E{gk(xk, uk, wk)+Jk+1(xk+1)

}

Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1

Optimal control sequence {u∗0, . . . , u∗

k, . . . , u∗N−1} Simplify E{·}

Tail subproblem Time x∗k Future Stages Terminal Cost k N

Stage k Future Stages Terminal Cost gN(xN )



Φr = Π(T

(λ)µ (Φr)



minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :


max{0, ξ} J(x)



1




E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},








b+k b−


minuk


}



k, . . . , u∗N−1} Simplify E{·}





Φr = Π(T

(λ)µ (Φr)



minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :


max{0, ξ} J(x)



1




E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},








b+k b−


minuk


}



k, . . . , u∗N−1} Simplify E{·}





Φr = Π(T

(λ)µ (Φr)



minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :


max{0, ξ} J(x)



1

Initial City Current Partial Tour Next Cities Next States



E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},








b+k b−


minuk


}



k, . . . , u∗N−1} Simplify E{·}





Φr = Π(T

(λ)µ (Φr)



minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :


max{0, ξ} J(x)



1

Start End Plus Terminal Cost Approximation S1 S2 S3 S` Sm�1 Sm

State i y(i) Ay(i) + b �1(i, v) �m(i, v) �2(i, v) J(i, v) = r0�(i, v)

Deterministic Transition xk+1 = fk(xk, uk)

Cost Vector r⇤ J1 = Corrected V Enlarged State Space Cost J0 Cost J1 Cost r⇤

Representative States Controls u are associated with states i

N Stages jN�1 jN i 2 Ix j 2 Iy

Sx1 Sx` Sxm x1 x` xm r⇤x1r⇤x`

r⇤xmFootprint Sets J(i) J(j) =

Py2A �jyr⇤y

minu2U(i)

nX

j=1

pij(u)�g(i, u, j) + ↵J(j)

�i = x Ix

⇡/4 Sample State xsk Sample Control us

k Sample Next State xsk+1 Sample Transition Cost gs

k Simulator

Representative States x (Coarse Grid) Critic Actor Approximate PI Aggregate Problem

pxy(u) =

nX

j=1

pxj(u)�jy g(x, u) =

nX

j=1

pxj(u)g(x, u, j)

Range of Weighted Projections J⇤(i) Original States to States (Fine Grid) Original State Space

dxi = 0 for i /2 Ix �jy = 1 for j 2 Iy �jy = 0 or 1 for all j and y Each j connects to a single x

x pxj1(u) pxj2(u) pxj3(u) �j1y1 �j1y2 �j1y3 �jy with Aggregation Probabilities �jy = 0 or 1

Relate to Rm r⇤m�1 r⇤m x0k+1

Policy Q-Factor Evaluation Evaluate Q-Factor Qµ of Current policy µ Width (✏ + 2↵�)/(1 � ↵)

Random Transition xk+1 = fk(xk, uk, wk) Random Cost gk(xk, uk, wk) Representative Features

Control v (j, v) Cost = 0 State-Control Pairs Transitions under policy µ Evaluate Cost Function

1

Sec. 1.2 Deterministic Dynamic Programming 25

x0 0 x1 ) . . .

uk xk xNk xk+1

k u′k

′k u′′

k x+1 x′′

k+1

x′N

x′′N

. . . Q-Factors

-Factors Current State x

Current State xk




Next Cities Next States

∗ x′k+1

Figure 1.2.9 Schematic illustration of rollout with one-step lookahead for a de-terministic problem. At state xk, for every pair (xk, uk), uk ∈ Uk(xk), the base

heuristic generates an approximate Q-factor

Qk(xk, uk) = gk(xk, uk) + Hk+1

(fk(xk, uk)

),

and selects the control µk(xk) with minimal Q-factor.

and the corresponding cost

Hk+1(xk+1) = gk+1(xk+1, uk+1) + · · · + gN−1(xN−1, uN−1) + gN (xN ).

The rollout algorithm then applies the control that minimizes over uk ∈Uk(xk) the tail cost expression for stages k to N :

gk(xk, uk) + Hk+1(xk+1).

Equivalently, and more succinctly, the rollout algorithm applies atstate xk the control µk(xk) given by the minimization

µk(xk) ∈ arg minuk∈Uk(xk)

Qk(xk, uk), (1.14)

where Qk(xk, uk) is the approximate Q-factor defined by

Qk(xk, uk) = gk(xk, uk) + Hk+1

(fk(xk, uk)

); (1.15)

see Fig. 1.2.9. Rollout defines a suboptimal policy π = {µ0, . . . , µN−1},referred to as the rollout policy, where for each xk and k, µk(xk) is thecontrol produced by the Q-factor minimization (1.14).

Note that the rollout algorithm requires running the base heuristicfor a number of times that is bounded by Nn, where n is an upper boundon the number of control choices available at each state. Thus if n issmall relative to N , it requires computation equal to a small multiple of Ntimes the computation time for a single application of the base heuristic.Similarly, if n is bounded by a polynomial in N , the ratio of the rolloutalgorithm computation time to the base heuristic computation time is apolynomial in N .

Cost approximation by running a heuristic from states of interest

We generate a single system trajectory {x0, x1, . . . , xN} by on-line play

Upon reaching xk , we compute for all uk ∈ Uk (xk ), the corresponding next statesxk+1 = fk (xk , uk )

From each of the next states xk+1 we run the heuristic and compute the heuristiccost Hk+1(xk+1)

We apply uk that minimizes over uk ∈ Uk (xk ), the (heuristic) Q-factor

gk (xk , uk ) + Hk+1(xk+1)

We generate the next state xk+1 = fk (xk , uk ) and repeatBertsekas Reinforcement Learning 8 / 29

Traveling Salesman Example

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence


1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

Matrix of Intercity Travel Costs

Corrected J J J* Cost Jµ

�F (i), r

�of i ⇡ Jµ(i) Jµ(i) Feature Map

Jµ

�F (i), r

�: Feature-based parametric architecture State

r: Vector of weights Original States Aggregate States

Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r⇤` Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost ↵kg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

�F (i)

�of

F (i) =�F1(i), . . . , Fs(i)

�: Vector of Features of i

Jµ

�F (i)

�: Feature-based architecture Final Features

If Jµ

�F (i), r

�=Ps

`=1 F`(i)r` it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J � Jp with J(xk) ! 0 for all p-stable ⇡

Wp0 : Functions J � Jp0 with J(xk) ! 0 for all p0-stable ⇡

W+ =�J | J � J+, J(t) = 0

VI converges to J+ from within W+

Cost: g(xk, uk) � 0 VI converges to Jp from within Wp

1

Matrix of Intercity Travel Costs


�F (i), r

�of i ⇡ Jµ(i) Jµ(i) Feature Map

Jµ

�F (i), r

�: Feature-based parametric architecture State









Aggregate costs r⇤` Cost function J0(i) Cost function J1(j)


Cost ↵kg(i, u, j) Transition probabilities pij(u) Wp



�F (i)

�of

F (i) =�F1(i), . . . , Fs(i)

�: Vector of Features of i

Jµ

�F (i)

�: Feature-based architecture Final Features

If Jµ

�F (i), r

�=Ps

`=1 F`(i)r` it is a linear feature-based architecture


Wp: Functions J � Jp with J(xk) ! 0 for all p-stable ⇡

Wp0 : Functions J � Jp0 with J(xk) ! 0 for all p0-stable ⇡

W+ =�J | J � J+, J(t) = 0


Cost: g(xk, uk) � 0 VI converges to Jp from within Wp

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5



1

Initial State s Terminal State t



within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1

15 1 5




within Wp+


J(1) = min{c, a + J(2)

}

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1

15 1 5




within Wp+


J(1) = min{c, a + J(2)

}

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1

15 1 5 18 4 19 9 21 25




within Wp+


J(1) = min{c, a + J(2)

}

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1

15 1 5 18 4 19 9 21 25




within Wp+


J(1) = min{c, a + J(2)

}

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1

15 1 5 18 4 19 9 21 25




within Wp+


J(1) = min{c, a + J(2)

}

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1

15 1 5 18 4 19 9 21 25 8 12




within Wp+


J(1) = min{c, a + J(2)

}

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1

15 1 5 18 4 19 9 21 25 8 12




within Wp+


J(1) = min{c, a + J(2)

}

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1

15 1 5 18 4 19 9 21 25 8 12 13




within Wp+


J(1) = min{c, a + J(2)

}

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1

15 1 5 18 4 19 9 21 25 8 12 13




within Wp+


J(1) = min{c, a + J(2)

}

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1

6 13 14 24 27


(F (i), r

)of i ≈ Jµ(i) Jµ(i) Feature Map

Jµ

(F (i), r

): Feature-based parametric architecture State









Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)


Cost αkg(i, u, j) Transition probabilities pij(u) Wp



(F (i)

)of

F (i) =(F1(i), . . . , Fs(i)

): Vector of Features of i

Jµ

(F (i)

): Feature-based architecture Final Features

If Jµ

(F (i), r

)=

∑sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture


Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π

Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π

W+ ={J | J ≥ J+, J(t) = 0

}


Cost: g(xk, uk) ≥ 0 VI converges to Jp from within Wp

1

6 13 14 24 27


(F (i), r


Jµ

(F (i), r















(F (i)

)of

F (i) =(F1(i), . . . , Fs(i)


Jµ

(F (i)


If Jµ

(F (i), r

)=





W+ ={J | J ≥ J+, J(t) = 0

}



1

6 13 14 24 27


(F (i), r


Jµ

(F (i), r















(F (i)

)of

F (i) =(F1(i), . . . , Fs(i)


Jµ

(F (i)


If Jµ

(F (i), r

)=





W+ ={J | J ≥ J+, J(t) = 0

}



1

6 13 14 24 27 Rollout

Base Heuristic Corrected J J J* Cost Jµ

(F (i), r

)of i ≈ Jµ(i) Jµ(i)

Feature Map

Jµ

(F (i), r















(F (i)

)of

F (i) =(F1(i), . . . , Fs(i)


Jµ

(F (i)


If Jµ

(F (i), r

)=





W+ ={J | J ≥ J+, J(t) = 0

}



1

6 13 14 24 27 Rollout

Base Heuristic Corrected J J J* Cost Jµ

(F (i), r

)of i ≈ Jµ(i) Jµ(i)

Feature Map

Jµ

(F (i), r















(F (i)

)of

F (i) =(F1(i), . . . , Fs(i)


Jµ

(F (i)


If Jµ

(F (i), r

)=





W+ ={J | J ≥ J+, J(t) = 0

}



1

Cost-to-go approximation Expected value approximation

Optimal cost J∗ Jµ1(x)/Jµ0(x) = K1/K0 L0 r

TµJ Jµ = TµJµ Jµ = TµJµ Cost of base policy µ

Cost of rollout policy µ Optimal Base Rollout Terminal Score Ap-proximation

Simplified minimization

Changing System, Cost, and Constraint Parameters

Linearized Bellman Eq. at Jµ Yields Rollout Policy µ 20

Through TµJµ = TJµ Lookahead Minimization

Value iterations

Rollout with Base Off-Line Obtained Policy

Policy Improvement with Base Policy µPolicy evaluations for µ and µ

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1

Termination State Constraint Set X X = X X Multiagent

rb2 + 1 1 − r

αb2 K K K∗ Kk Kk+1 F (K) = αrKr+αb2K + 1

Current Partial Folding Moving Obstacle

Complete Folding Corresponding to Open

Expert

Rollout with Base Policy m-Step Value Network Policy Network

Approximation of E{·}: Approximate minimization:

minu∈U(x)

n∑

y=1


)

x1k, u1

k u2k x2

k dk τ

Q-factor approximation

u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)

)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1

1









Value iterations



minu∈U(x)

n∑

y=1


)



rb2 + 1 1 − r




Expert



minu∈U(x)

n∑

y=1


)

x1k, u1

k u2k x2

k dk τ



)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1

1









Value iterations



minu∈U(x)

n∑

y=1


)



rb2 + 1 1 − r




Expert



minu∈U(x)

n∑

y=1


)

x1k, u1

k u2k x2

k dk τ



)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1

1









Value iterations



minu∈U(x)

n∑

y=1


)



rb2 + 1 1 − r




Expert



minu∈U(x)

n∑

y=1


)

x1k, u1

k u2k x2

k dk τ



)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1

1

(u0, . . . , uk) Sk(u0, . . . , uk, uk+1), uk+1 ∈ Uk+1 (u0, . . . , uk, uk+1, . . . , uN−1)

Base Heuristic Expert Ranks Complete Solutions R2 40 23

R0 R1 Optimal cost J∗ Jµ1(x)/Jµ0(x) = K1/K0 L0 r







Value iterations Compares Complete Foldings

Expert Rollout with Base Off-Line Obtained Policy


minu∈U(x)

n∑

y=1


)



rb2 + 1 1 − r




Expert



minu∈U(x)

n∑

y=1


)

x1k, u1

k u2k x2

k dk τ



)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

1

T0 T1 T2


Monotonicity Property Under Sequential Improvement

Cost of Tk ≥ Cost of Tk+1

Cost of R0 ≥ · · · ≥ Cost of Rk ≥ Cost of Rk+1 ≥ · · · ≥ Cost of RN .


R0 R1 Optimal cost J∗ Jµ1(x)/Jµ0(x) = K1/K0 L0 r uk xk+1

Heuristic from AB

TµJ Jµ = TµJµ Jµ = TµJµ Cost of base policy µ x0 = x0

Optimal Base Rollout Terminal Score Approximation Current

Trajectory Tk Trajectory Tk+1 rollout policy µ Simplified mini-mization

Base Heuristic Cost Hk(xk) Base Heuristic Cost Hk+1(xk+1)







minu∈U(x)

n∑

y=1


)



rb2 + 1 1 − r




Expert



1

T0 T1 T2


Monotonicity Property Under Sequential Improvement

Cost of Tk ≥ Cost of Tk+1

Cost of R0 ≥ · · · ≥ Cost of Rk ≥ Cost of Rk+1 ≥ · · · ≥ Cost of RN .


R0 R1 Optimal cost J∗ Jµ1(x)/Jµ0(x) = K1/K0 L0 r uk xk+1

Heuristic from AB

TµJ Jµ = TµJµ Jµ = TµJµ Cost of base policy µ x0 = x0

Optimal Base Rollout Terminal Score Approximation Current

Trajectory Tk Trajectory Tk+1 rollout policy µ Simplified mini-mization

Base Heuristic Cost Hk(xk) Base Heuristic Cost Hk+1(xk+1)







minu∈U(x)

n∑

y=1


)



rb2 + 1 1 − r




Expert



1

Bellman Equation on Space of Quadratic Functions J(x) = Kx2

F (K) T2

Value Space Approximation J Jµ = TµJµ TµJ

One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

Newton iterate starting from K Tangent Line of Unstable Policy

J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK

TJ Instability Region Match Win Probability 1 0 pw (Sudden death)Stability Region Slope=1

also Newton Step Value Iteration: Kk+1 = F (Kk)

Optimal Policy Riccati Equation: K = F (K)

J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from

Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from

using an Corresponds to One-Step Lookahead Policy µ

Line Stable Policies Unstable Policy Optimal Policy

Region of stability

Also Region of Convergence of Newton’s Method Riccati Equation

Cost of rollout policy µ Cost of base policy µ

1


F (K)

T2 Cost 28 Cost 27 Cost 13












Region of stability



1


F (K)













Region of stability



1


F (K)













Region of stability



1





x0 x1 xk xN uk xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1





x0 x1 xk xN uk xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1





x0 x1 xk xN uk xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1





x0 x1 xk xN uk xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1





x0 x1 xk xN uk xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1





x0 x1 xk xN uk xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1






k xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1






k xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1






k xk+1 x0k+1 x00

k+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1






k xk+1 x0k+1 x00

k+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1





x0 x1 xk xN x0N x00

N uk u0k u00

k xk+1 x0k+1 x00

k+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1





x0 x1 xk xN x0N x00

N uk u0k u00

k xk+1 x0k+1 x00

k+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1






N uk u′k u′′

k xk+1 x′k+1 x′′

k+1




within Wp+


J(1) = min{c, a + J(2)

}

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1

Initial City Current Partial Tour Next Cities Nearest Neighbor

Jk+1(xk+1) = minuk+12Uk+1(xk+1)

En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,


Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0




Cost Function Approximation Jk+`


b+k b�k Permanent trajectory P k Tentative trajectory T k

minuk

En

gk(xk, uk, wk)+Jk+1(xk+1)o


Optimal control sequence {u⇤0, . . . , u⇤

k, . . . , u⇤N�1} Simplify E{·}

Tail subproblem Time x⇤k Future Stages Terminal Cost k N

Stage k Future Stages Terminal Cost gN (xN )

Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x0N

uk uk xk+1 xk+1 xN xN x0N

�r = ⇧�T

(�)µ (�r)

�⇧(Jµ) µ(i) 2 arg minu2U(i) Qµ(i, u, r)

Subspace M = {�r | r 2 <m} Based on Jµ(i, r) Jµk

minu2U(i)

Pnj=1 pij(u)

�g(i, u, j) + J(j)

�Computation of J :

Good approximation Poor Approximation �(⇠) = ln(1 + e⇠)

max{0, ⇠} J(x)


Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N �1


1



E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},








b+k b−


minuk


}



k, . . . , u∗N−1} Simplify E{·}





Φr = Π(T

(λ)µ (Φr)



minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :


max{0, ξ} J(x)




1



E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},








b+k b−


minuk


}



k, . . . , u∗N−1} Simplify E{·}





Φr = Π(T

(λ)µ (Φr)



minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :


max{0, ξ} J(x)




1



E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},








b+k b−


minuk


}



k, . . . , u∗N−1} Simplify E{·}





Φr = Π(T

(λ)µ (Φr)



minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :


max{0, ξ} J(x)




1




E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},








b+k b−


minuk


}



k, . . . , u∗N−1} Simplify E{·}





Φr = Π(T

(λ)µ (Φr)



minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :


max{0, ξ} J(x)



1




E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},








b+k b−


minuk


}



k, . . . , u∗N−1} Simplify E{·}





Φr = Π(T

(λ)µ (Φr)



minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :


max{0, ξ} J(x)



1




E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},








b+k b−


minuk


}



k, . . . , u∗N−1} Simplify E{·}





Φr = Π(T

(λ)µ (Φr)



minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :


max{0, ξ} J(x)



1




E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},








b+k b−


minuk


}



k, . . . , u∗N−1} Simplify E{·}





Φr = Π(T

(λ)µ (Φr)



minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :


max{0, ξ} J(x)



1




E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},








b+k b−


minuk


}



k, . . . , u∗N−1} Simplify E{·}





Φr = Π(T

(λ)µ (Φr)



minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :


max{0, ξ} J(x)



1




E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},








b+k b−


minuk


}



k, . . . , u∗N−1} Simplify E{·}





Φr = Π(T

(λ)µ (Φr)



minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :


max{0, ξ} J(x)



1

Complete Tours Current Partial Tour Next Cities Next States



E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},








b+k b−


minuk


}



k, . . . , u∗N−1} Simplify E{·}





Φr = Π(T

(λ)µ (Φr)



minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :


max{0, ξ} J(x)



1


Generate Improved Policy µ Next Partial Tour

Generate “Improved” Policy µ by µ(i) 2 arg minu2U(i) Qµ(i, u, r)








Py2A �jyr⇤y

minu2U(i)

nX

j=1


�i = x Ix



k Simulator


pxy(u) =

nX

j=1


nX

j=1

pxj(u)g(x, u, j)







1


Generate Improved Policy µ Next Partial Tours

Generate “Improved” Policy µ by µ(i) 2 arg minu2U(i) Qµ(i, u, r)








Py2A �jyr⇤y

minu2U(i)

nX

j=1


�i = x Ix



k Simulator


pxy(u) =

nX

j=1


nX

j=1

pxj(u)g(x, u, j)







1


Stochastic DP Problems - Perfect State Observation (We Know xk )

......Permanent trajectory P k Tentative trajectory T k



Φr = Π(T

(λ)µ (Φr)



minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :


max{0, ξ} J(x)















(F (i)

)


3 Cost Jµ

(F (i)

)


3 Cost Jµ

(F (i)

)





1




Φr = Π(T

(λ)µ (Φr)



minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :


max{0, ξ} J(x)















(F (i)

)


3 Cost Jµ

(F (i)

)


3 Cost Jµ

(F (i)

)





1




Φr = Π(T

(λ)µ (Φr)



minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :


max{0, ξ} J(x)















(F (i)

)


3 Cost Jµ

(F (i)

)


3 Cost Jµ

(F (i)

)





1





Φr = Π(T

(λ)µ (Φr)



minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :


max{0, ξ} J(x)















(F (i)

)


3 Cost Jµ

(F (i)

)


3 Cost Jµ

(F (i)

)





1

Iteration Index k PI index k Jµk J⇤ 0 1 2 . . . Error Zone Width (✏ + 2↵�)/(1 � ↵)2


Random Transition xk+1 = fk(xk, uk, wk) Random cost gk(xk, uk, wk)


Variable Length Rollout Selective Depth Rollout Policy µ Adaptive Simulation Terminal Cost Function

Limited Rollout Selective Depth Adaptive Simulation Policy µ Approximation J

u Qk(xk, u) Qk(xk, u) uk uk Qk(xk, u) � Qk(xk, u)

x0 xk x1k+1 x2

k+1 x3k+1 x4

k+1 States xN Base Heuristic ik States ik+1 States ik+2





x0 x1 xk xN x0N x00

N uk u0k u00

k xk+1 x0k+1 x00

k+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

1








x0 xk x1k+1 x2

k+1 x3k+1 x4






x0 x1 xk xN x0N x00

N uk u0k u00

k xk+1 x0k+1 x00

k+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

1



Random Transition xk+1 = fk(xk, uk, wk) Random Cost gk(xk, uk, wk)





x0 xk x1k+1 x2

k+1 x3k+1 x4






x0 x1 xk xN x0N x00

N uk u0k u00

k xk+1 x0k+1 x00

k+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

1








x0 xk x1k+1 x2

k+1 x3k+1 x4






x0 x1 xk xN x0N x00

N uk u0k u00

k xk+1 x0k+1 x00

k+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

1

System xk+1 = fk (xk , uk ,wk ) with random “disturbance" wk (e.g., physical noise,market uncertainties, demand for inventory, unpredictable breakdowns, etc)

Cost function: E{

gN(xN) +∑N−1

k=0 gk (xk , uk ,wk )}

Policies π = {µ0, . . . , µN−1}, where µk is a “closed-loop control law" or “feedbackpolicy"/a function of xk . A “lookup table" for the control uk = µk (xk ) to apply at xk .

An important point: Using feedback (i.e., choosing controls with knowledge of thestate) is beneficial in view of the stochastic nature of the problem.

For given initial state x0, minimize over all π = {µ0, . . . , µN−1} the cost

Jπ(x0) = E

{gN(xN) +

N−1∑k=0

gk(xk , µk (xk ),wk

)}

Optimal cost function: J∗(x0) = minπ Jπ(x0). Optimal policy: Jπ∗(x0) = J∗(x0)


The Stochastic DP Algorithm

Produces the optimal costs J∗k (xk ) of the tail subproblems that start at xk

Start with J∗N(xN) = gN(xN), and for k = 0, . . . ,N − 1, let


Ewk

{gk (xk , uk ,wk ) + J∗k+1

(fk (xk , uk ,wk )

)}, for all xk .

The optimal cost J∗(x0) is obtained at the last step: J∗0 (x0) = J∗(x0).

The optimal policy component µ∗k can be constructed simultaneously with J∗k , andconsists of the minimizing u∗k = µ∗k (xk ) above.

Alternative on-line implementation of the optimal policy, given J∗1 , . . . , J

∗N−1

Sequentially, going forward, for k = 0, 1, . . . ,N − 1, observe xk and apply

u∗k ∈ arg minuk∈Uk (xk )

Ewk

{gk (xk , uk ,wk ) + J∗k+1

(fk (xk , uk ,wk )

)}.

Issues: Need to know J∗k+1, compute expectation for each uk , minimize over all uk

Approximation in value space: Use Jk in place of J∗k ; approximate E{·} and minuk .


A Very Favorable Case: Linear-Quadratic Problems

An example of a linear-quadratic problemKeep car velocity constant (like oversimplified cruise control): xk+1 = xk + buk + wk

Here xk = vk − v is the deviation between the vehicle’s velocity vk at time k fromdesired level v , and b is given

uk is unconstrained; wk has 0-mean and variance σ2

Cost over N stages: qx2N +

∑N−1k=0 (qx2

k + ru2k ), where q > 0 and r > 0 are given

Consider a more general problem where the system is xk+1 = axk + buk + wk

The DP algorithm starts with J∗N(xN) = qx2N , and generates J∗k according to

J∗k (xk ) = minuk

Ewk

{qx2

k + ru2k + J∗k+1(axk + buk + wk )

}, k = 0, . . . ,N − 1

DP algorithm can be carried out in closed form to yieldJ∗k (xk ) = Kk x2

k + const, µ∗k (xk ) = Lk xk : Kk and Lk can be explicitly computed

The solution does not depend on the distribution of wk as long as it has 0 mean:Certainty Equivalence (a common approximation idea for other problems)


Derivation - DP Algorithm starting from Terminal Cost J∗N(x) = qx2

J∗N−1(xN−1) = minuN−1

E{

qx2N−1 + ru2

N−1 + J∗N(axN−1 + buN−1 + wN−1)}

= minuN−1

E{

qx2N−1 + ru2

N−1 + q(axN−1 + buN−1 + wN−1)2}= min

uN−1

[qx2

N−1 + ru2N−1 + (axN−1 + buN−1)2 + 2 E{wN−1}︸︷︷︸

=0

(axN−1 + buN−1) + q E{w2N−1}︸︷︷︸

=σ2

]= qx2

N−1 + minuN−1

[ru2

N−1 + q(axN−1 + buN−1)2]+ qσ2

Minimize by setting to zero the derivative: 0 = 2ruN−1 + 2qb(axN−1 + buN−1), to obtain

µ∗N−1(xN−1) = LN−1xN−1 with LN−1 = − abqr + b2q

and by substitution, J∗N−1(xN−1) = PN−1x2N−1 + qσ2, where PN−1 = a2rq

r+b2q + q

Similarly, going backwards, we obtain for all k :

J∗k (xk ) = Pk x2k +σ2

N−1∑m=k

Pm+1, µ∗k (xk ) = Lk xk , Pk =

a2rPk+1

r + b2Pk+1+q, Lk = − abPk+1

r + b2Pk+1


Linear-Quadratic Problems in General

Observations and generalizationsThe solution does not depend on the distribution of wk , only on the mean (which is0), i.e., we have certainty equivalence

Generalization to multidimensional problems, nonzero mean disturbances, etc

Generalization to infinite horizon

Generalization to problems where the state is observed partially through linearmeasurements: Optimal policy involves an extended form of certainty equivalence

Lk E{xk | measurements}

where E{xk | measurements} is provided by an estimator (e.g., Kalman filter)Linear systems and quadratic cost are a starting point for other lines ofinvestigations and approximations:

I Problems with safety/state constraints [Model Predictive Control (MPC)]I Problems with control constraints (MPC)I Unknown or changing system parameters (adaptive control)


Approximation in Value Space - The Three Approximationsmin

uk,µk+1,...,µk+ℓ−1

E

{gk(xk, uk, wk) +

k+ℓ−1∑

m=k+1

gk

(xm, µm(xm), wm

)+ Jk+ℓ(xk+ℓ)

}

First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial

Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)

Feature Extraction Features: Material Balance, uk = µdk

(xk(Ik)

)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector φk(xk) Approximator r′kφk(xk)

x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i

s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector φ(x) Approximator φ(x)′r

ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

1

Approximations: Computation of Jk+ℓ: (Could be approximate)

DP minimization Replace E{·} with nominal values

(certainty equivalent control)

Limited simulation (Monte Carlo tree search)

Simple choices Parametric approximation Problem approximation

Rollout

minuk

E{gk(xk, uk, wk) + Jk+1(xk+ℓ)

}

minuk,µk+1,...,µk+ℓ−1

E

{gk(xk, uk, wk) +

k+ℓ−1∑

m=k+1

gk

(xm, µm(xm), wm

)+ Jk+ℓ(xk+ℓ)

}

First ℓ Steps “Future” First StepNonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial




(xk(Ik)

)





u1k u2

k u3k u4


1

Approximations: Computation of Jk+ℓ: (Could be approximate)

DP minimization Replace E{·} with nominal values

(certainty equivalent control) Computation of Jk+1:

Limited simulation (Monte Carlo tree search)

Simple choices Parametric approximation Problem approximation

Rollout

minuk

E{

gk(xk, uk, wk) + Jk+1(xk+1)}


E

{gk(xk, uk, wk) +

k+ℓ−1∑

m=k+1

gk

(xm, µm(xm), wm

)+ Jk+ℓ(xk+ℓ)

}

First ℓ Steps “Future” First StepNonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial




(xk(Ik)

)





u1k u2

k u3k u4


1



minu∈U(x)

n∑

y=1


)



rb2 + 1 1 − r




Expert



minu∈U(x)

n∑

y=1


)

x1k, u1

k u2k x2

k dk τ



)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1

High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice

Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5

Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial

Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2

One-Step or Multistep Lookahead for stages Possible Terminal Cost

Approximation in Policy Space Heuristic Cost Approximation for

for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)

Approximate Q-Factor Q(x, u) At x Approximation J

1



minu∈U(x)

n∑

y=1


)



rb2 + 1 1 − r




Expert



minu∈U(x)

n∑

y=1


)

x1k, u1

k u2k x2

k dk τ



)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1









1



minu∈U(x)

n∑

y=1


)



rb2 + 1 1 − r




Expert



minu∈U(x)

n∑

y=1


)

x1k, u1

k u2k x2

k dk τ



)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1









1

uk = µk(xk, rk) µk(·, rk) µk(xk) xk At xk

µk(xk) Jk(xk) xsk, us

k = µk(xsk) s = 1, . . . , q µk(xk, rk) µ(·, r) µ(x, r)

Motion equations xk+1 = fk(xk, uk) Current State x

Penalty for deviating from nominal trajectory

State and control constraints Keep state close to a trajectory

Control Probabilities Run the Base Policy

Truncated Horizon Rollout Terminal Cost Approximation J

J∗3 (x3) J∗

2 (x2) J∗1 (x1) Optimal Cost J∗

0 (x0) = J∗(x0)

Optimal Cost J∗k (xk) xk xk+1 x

′k+1 x

′′k+1

Opt. Cost J∗k+1(xk+1) Opt. Cost J∗

k+1(x′k+1) Opt. Cost J∗

k+1(x′′k+1)

xk uk u′k u

′′k Matrix of Intercity Travel Costs


(F (i), r


Jµ

(F (i), r















(F (i)

)of

F (i) =(F1(i), . . . , Fs(i)


Jµ

(F (i)


1

J1 J2 J∗ = TJ∗

x pxx(u) pxy(u) pyx(u) pyy(u) pxt(u) pyt(u) x y

αpxx(u) αpxy(u) αpyx(u) αpyy(u) 1 − α

TJ = minµ TµJ Cost 0 Cost g(x, u, y) System State Data Control Parameter Estimation

Optimal cost Cost of rollout policy µ Cost of base policy µ

Cost E{g(x, u, y)

}Cost E

{g(i, u, j)

}“On-Line Play”

Value Network Current Policy Network Approximate Policy

Approximate Policy Evaluation Approximately Improved Policy Evaluation

TµJ Approximate Policy Evaluation for µ Approximate Policy Improvement

0 1 2 3 4 5 6

Deterministic Stochastic Rollout Continuous MPC Constrained Discrete Combinatorial Multiagent

MCTS Variance Reduction

Section 2.3 Section 2.4 Sections 2.5, 3.1 3.3 3.4 3.2, 3.3, 3.3.3 2.4.3 2.4.2 3.3, 3.4

2.4.3, 2.4.4 2.4.2 3.3, 3.4

Monte Carlo Tree Search ‘


E

{gk(xk, uk, wk) +

k+ℓ−1∑

i=k+1

gi

(xi, µi(xi), wi

)+ Jk+ℓ(xk+ℓ)

}

1

Important variants: Use multistep lookahead, replace E{·} by limited simulation (e.g., a“certainty equivalent" of wk ), multiagent rollout (for multicomponent control problems)

An example: Truncated rollout with base policy and terminal costapproximation (however obtained, e.g., off-line training)uk = µk(xk, rk) µk(·, rk) µk(xk) xk

µk(xk) Jk(xk) xsk, us

k = µk(xsk) s = 1, . . . , q µk(xk, rk) µ(·, r) µ(x, r)

Motion equations xk+1 = fk(xk, uk) Current State x

Penalty for deviating from nominal trajectory

State and control constraints Keep state close to a trajectory

Control Probabilities Run the Base Policy

Truncated Horizon Rollout

J∗3 (x3) J∗

2 (x2) J∗1 (x1) Optimal Cost J∗

0 (x0) = J∗(x0)

Optimal Cost J∗k (xk) xk xk+1 x

′k+1 x

′′k+1

Opt. Cost J∗k+1(xk+1) Opt. Cost J∗

k+1(x′k+1) Opt. Cost J∗

k+1(x′′k+1)

xk uk u′k u

′′k Matrix of Intercity Travel Costs


(F (i), r


Jµ

(F (i), r















(F (i)

)of

F (i) =(F1(i), . . . , Fs(i)


Jµ

(F (i)


1

Base Policy Rollout Policy Approximation in Value Space

One-Step or Multistep Lookahead

Approximation in Policy Space Heuristic Cost Approximation

Approximate Q-Factor Q(x, u) At x

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}


Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1

Cost per stage: x2k + u2

k

{X0, X1, . . . , XN} must be reachable Largest reachable tube

x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization

Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0


Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1

Base Heuristic Minimization Possible Path

Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath


E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},








Rollout Control uk Rollout Policy µk Base Policy Cost

1


One-Step or Multistep Lookahead for stages


for Stages Beyond Truncation


minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}




k









E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},








1






minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}




k









E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},








1






minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}




k









E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},








1

Base Policy Rollout Policy Approximation in Value Space n n � 1n � 2





minu2U(x)

Ew

ng(x, u, w) + ↵J

�f(x, u, w)

�o


Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| 1


k


x0 Control uk (`� 1)-Stages Base Heuristic Minimization

Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0


Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1



Jk+1(xk+1) = minuk+12Uk+1(xk+1)

En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,


Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0




Cost Function Approximation Jk+`


1

minu∈U(x)

n∑

y=1


)

Multiagent Q-factor minimization


Current Partial Folding



Expert

Base Policy


minu∈U(x)

n∑

y=1


)

x1k, u1

k u2k x2

k dk τ



)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1









minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Truncated Rollout Policy µ m Steps

1

minu∈U(x)

n∑

y=1


)






Expert

Base Policy m-Step


minu∈U(x)

n∑

y=1


)

x1k, u1

k u2k x2

k dk τ



)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1









minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}


1

minu∈U(x)

n∑

y=1


)






Expert

Rollout with Base Policy m-Step


minu∈U(x)

n∑

y=1


)

x1k, u1

k u2k x2

k dk τ



)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1









minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}


1

minu∈U(x)

n∑

y=1


)

Multiagent Q-factor minimization xk





Expert



minu∈U(x)

n∑

y=1


)

x1k, u1

k u2k x2

k dk τ



)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1









minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}


1

minu∈U(x)

n∑

y=1


)

Multiagent Q-factor minimization xk Possible States xk xk+m+1





Expert



minu∈U(x)

n∑

y=1


)

x1k, u1

k u2k x2

k dk τ



)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1









minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}


1

minu∈U(x)

n∑

y=1


)

Multiagent Q-factor minimization xk Possible States xk xk+m+1





Expert



minu∈U(x)

n∑

y=1


)

x1k, u1

k u2k x2

k dk τ



)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1









minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}


1

minu∈U(x)

n∑

y=1


)






Expert



minu∈U(x)

n∑

y=1


)

x1k, u1

k u2k x2

k dk τ



)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1









minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}


1

minu∈U(x)

n∑

y=1


)






Expert



minu∈U(x)

n∑

y=1


)

x1k, u1

k u2k x2

k dk τ



)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1









minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}


1

Sec. 1.3 Stochastic Dynamic Programming 27

Similar to the deterministic case, Q-learning involves the calculationof either the optimal Q-factors (1.16) or approximations Qk(xk, uk). Theapproximate Q-factors may be obtained using approximation in value spaceschemes, and can be used to obtain approximately optimal policies throughthe Q-factor minimization


Qk(xk, uk). (1.17)

In Chapter 4, we will discuss the use of neural networks in such approxi-mations.

Cost Versus Q-factor Approximations - Robustness and On-Line Replanning

We have seen that it is possible to implement approximation in value spaceby using cost function approximations [cf. Eq. (1.15)] or by using Q-factorapproximations [cf. Eq. (1.17)], so the question arises which one to use in agiven practical situation. One important consideration is the facility of ob-taining suitable cost or Q-factor approximations. This depends largely onthe problem and also on the availability of data on which the approxima-tions can be based. However, there are some other major considerations.

In particular, the cost function approximation scheme


E{gk(xk, uk, wk) + Jk+1

(fk(xk, uk, wk)

)},

has an important disadvantage: the expected value above needs to be com-puted on-line for all uk ∈ Uk(xk), and this may involve substantial compu-tation. On the other hand it also has an important advantage in situationswhere the system function fk, the cost per stage gk, or the control con-straint set Uk(xk) can change as the system is operating. We will discussin more detail how this situation can arise in practice later in this chapter.Assuming that the new values of fk, gk, or Uk(xk) become known to thecontroller, on-line replanning may be used, as discussed earlier for deter-ministic problems. This may improve substantially the robustness of theapproximation in value space scheme.

By comparison, the Q-factor function approximation scheme (1.17)does not allow for on-line replanning. On the other hand, for problemswhere there is no need for on-line replanning, the Q-factor approximationscheme does not require the on-line computation of expected values andmay allow for a much faster on-line computation of the minimizing controlµk(xk) via Eq. (1.17).

1.3.2 Infinite Horizon Problems - An Overview

We will now provide an outline of infinite horizon stochastic DP with anemphasis on its aspects that relate to our RL/approximation methods.


Let’s Take a 15-min Working Break: Catch your Breath, Collect yourQuestions, and Consider the Following Challenge Puzzle

A chess match puzzleA chess player plays against a chess computer program a two-game match.

A win counts for 1, a draw counts for 1/2, and a loss counts for 0, for both playerand computer.

“Sudden death" games are played if the score is tied at 1-1 after the two games.The chess player can choose to play each game in one of two possible styles:

I Bold play (wins with probability pw < 1/2 and loses with probability 1 − pw ) orI Timid play (draws with probability pd < 1 and loses with probability 1 − pd ).

The style for the 2nd game is chosen after seeing the outcome of the 1st game.

Note that the player plays worse than the computer (on the average), regardless ofchosen style of play, and must play bold at least one game to have any chance towin.

Speculate on the optimal policy of the player.

Is it possible for the player to have a better than 50-50 chance to win the match,even though the computer is the better player?


Answer: Depending on pw and pd , Player’s Win Prob. May be > 1/2

wk xk uk Demand at Period k Stock at Period k Stock at Periodk + 1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk


ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

2 � 0 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

Finite Horizon Problems Ch. 1

Deterministic Problems Ch. 2

Stochastic Problems

Perfect-State Info Ch. 3

Imperfect-State Info Ch. 4

1




ACB ACD CAB CAD CDA




...




2 � 0 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2




Stochastic Problems



1




ACB ACD CAB CAD CDA




...




2 � 0 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2




Stochastic Problems



1




ACB ACD CAB CAD CDA




...




2 � 0 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2




Stochastic Problems



1




ACB ACD CAB CAD CDA




...




2 � 0 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2




Stochastic Problems



1




ACB ACD CAB CAD CDA




...




2 � 0 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2




Stochastic Problems



1




ACB ACD CAB CAD CDA




...




2 � 0 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2




Stochastic Problems



1




ACB ACD CAB CAD CDA




...




0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2




Stochastic Problems



1




ACB ACD CAB CAD CDA




...




0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2




Stochastic Problems



1




ACB ACD CAB CAD CDA




...




0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2




Stochastic Problems



1




ACB ACD CAB CAD CDA




...




0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2




Stochastic Problems



1




ACB ACD CAB CAD CDA




...




0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2




Stochastic Problems



1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

{1, 2, 3, 4, 5} {1, 2, 3} {4, 5} {1, 2} {2, 3} {1} {2} {3} {4} {5}

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20


Timid Play Bold Play 1.5 � 0.5 ACDB ADBC ADCB

Noninferior Vectors Origin Node s Artificial Terminal Node t X ={1, 2, 3, 4, 5}








w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

{1, 2, 3, 4, 5} {1, 2, 3} {4, 5} {1, 2} {2, 3} {1} {2} {3} {4} {5}

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

{1, 2, 3, 4, 5} {1, 2, 3} {4, 5} {1, 2} {2, 3} {1} {2} {3} {4} {5}

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

{1, 2, 3, 4, 5} {1, 2, 3} {4, 5} {1, 2} {2, 3} {1} {2} {3} {4} {5}

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20











w1, w2, . . . xN t m xk xk+1 5.5





xk+1 5.5

1





TJ Instability Region Match Win Probability 1 0 (Sudden death)Stability Region Slope=1


Optimal Style Choice Policy Riccati Equation: K = F (K)





Region of stability



1












Region of stability



1












Region of stability



1












Region of stability



1












Region of stability



1












Region of stability



1

The optimal policy: Play bold in the 1st game. Then play bold again if the 1stgame is lost, and timid if the 1st game is won (see the full DP solution in DPB, DPtextbook, Vol. I, Chapter 1; available from Google Books).

Example: For pw = 0.45 and pd = 0.9, optimal style of play policy gives a matchwin probability of roughly 0.53 (a simple DP calculation that you can try).

Intuition: The player can use feedback, while the computer cannot.


Infinite Horizon Problems

......Permanent trajectory P k Tentative trajectory T k



Φr = Π(T

(λ)µ (Φr)



minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :


max{0, ξ} J(x)















(F (i)

)


3 Cost Jµ

(F (i)

)


3 Cost Jµ

(F (i)

)





1




Φr = Π(T

(λ)µ (Φr)



minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :


max{0, ξ} J(x)















(F (i)

)


3 Cost Jµ

(F (i)

)


3 Cost Jµ

(F (i)

)





1





Φr = Π(T

(λ)µ (Φr)



minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :


max{0, ξ} J(x)















(F (i)

)


3 Cost Jµ

(F (i)

)


3 Cost Jµ

(F (i)

)





1








x0 xk x1k+1 x2

k+1 x3k+1 x4






x0 x1 xk xN x0N x00

N uk u0k u00

k xk+1 x0k+1 x00

k+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

1








x0 xk x1k+1 x2

k+1 x3k+1 x4






x0 x1 xk xN x0N x00

N uk u0k u00

k xk+1 x0k+1 x00

k+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

1

xk+1 = f(xk, uk, wk) g(xk, uk, wk)

Termination State

Aggregate Problem Approximation Jµ(i) Jµ(i) u1k u2

k u3k u4

k u5k Self-Learning/Policy Iteration Con-

straint Relaxation

Tail problem approximation u1k u2

k u3k u4

k u5k Self-Learning/Policy Iteration Constraint Relaxation d`i

�j`

Learned from scratch ... with 4 hours of training! Current “Improved”

AlphaZero (Google-Deep Mind) Plays much better than all computer programs F (i) Cost J�F (i)

�

Plays di↵erent! Approximate Value Function Player Features Mapping

At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation

Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2

minuk,µk+1,...,µk+`�1

En

gk(xk, uk, wk) +

k+`�1X

m=k+1

gk

�xm, µm(xm), wm

�+ Jk+`(xk+`)

o

Subspace S = {�r | r 2 <s} x⇤ x

Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search

T (�)(x) = T (x) x = P (c)(x)

x � T (x) y � T (y) rf(x) x � P (c)(x) xk xk+1 xk+2 Slope = �1

c

T (�)(x) = T (x) x = P (c)(x)

1

xk+1 = f(xk, uk, wk) ↵kg(xk, uk, wk)

Termination State


k u3k u4


straint Relaxation


k u3k u4

k u5k Self-Learning/Policy Iteration Constraint Relaxation d`i

�j`


AlphaZero (Google-Deep Mind) Plays much better than all computer programs F (i) Cost J�F (i)

�

Plays di↵erent! Approximate Value Function Player Features Mapping



minuk,µk+1,...,µk+`�1

En

gk(xk, uk, wk) +

k+`�1X

m=k+1

gk

�xm, µm(xm), wm

�+ Jk+`(xk+`)

o

Subspace S = {�r | r 2 <s} x⇤ x


T (�)(x) = T (x) x = P (c)(x)

x � T (x) y � T (y) rf(x) x � P (c)(x) xk xk+1 xk+2 Slope = �1

c

T (�)(x) = T (x) x = P (c)(x)

1

xk+1 = f(xk, uk, wk) αkg(xk, uk, wk)

Termination State Infinite Horizon


k u3k u4


straint Relaxation


k u3k u4

k u5k Self-Learning/Policy Iteration Constraint Relaxation dℓi

φjℓ


AlphaZero (Google-Deep Mind) Plays much better than all computer programs F (i) Cost J(F (i)

)

Plays different! Approximate Value Function Player Features Mapping



minuk ,µk+1,...,µk+ℓ−1

E{gk(xk, uk, wk) +

k+ℓ−1∑

m=k+1

gk

(xm, µm(xm), wm

)+ Jk+ℓ(xk+ℓ)

}

Subspace S = {Φr | r ∈ ℜs} x∗ x


T (λ)(x) = T (x) x = P (c)(x)

x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1

c

T (λ)(x) = T (x) x = P (c)(x)

1

Infinite number of stages, and stationary system and costSystem xk+1 = f (xk , uk ,wk ) with state, control, and random disturbance.

Policies π = {µ0, µ1, . . .} with µk (x) ∈ U(x) for all x and k .

Cost of stage k : αk g(xk , µk (xk ),wk

).

Cost of a policy π = {µ0, µ1, . . .}: The limit as N →∞ of the N-stage costs

Jπ(x0) = limN→∞

Ewk

{N−1∑k=0

αk g(xk , µk (xk ),wk

)}

0 < α ≤ 1 is the discount factor. If α < 1 the problem is called discounted.

Optimal cost function J∗(x0) = minπ Jπ(x0).

Problems with α = 1 typically include a special cost-free termination state t . Theobjective is to reach (or approach) t at minimum expected cost.


Infinite Horizon Problems - The Three Theorems

Intuition: N-stages opt. costs –> Infinite horizon opt. costApply DP, let VN−k (x) be the optimal cost-to-go starting at x with k stages to go:

VN−k (x) = minu∈U(x)

Ew

{αN−k g(x , u,w) + VN−k+1

(f (x , u,w)

)}, VN(x) ≡ 0

Define Jk (x) = VN−k (x)/αN−k , i.e., reverse the time index and divide with αN−k :

Jk (x) = minu∈U(x)

Ew

{g(x , u,w) + αJk−1

(f (x , u,w)

)}, J0(x) ≡ 0 (DP)

JN(x) is equal to V0(x), the N-stages optimal cost starting from x

So for any k , Jk (x) = k -stages optimal cost starting from x . Intuitively:

J∗(x) = limk→∞

Jk (x), for all x

J∗ satisfies Bellman’s equation: Take the limit in Eq. (DP) (?)

J∗(x) = minu∈U(x)

Ew

{g(x , u,w) + αJ∗

(f (x , u,w)

)}, for all x

Optimality condition: Let µ∗(x) attain the min in the Bellman equation for all x

The policy {µ∗, µ∗, . . .} is optimal. (This type of policy is called stationary.)


Infinite Horizon Problems - Algorithms

Value iteration (VI): Generates finite horizon opt. cost function sequence {Jk}

Jk (x) = minu∈U(x)

Ew

{g(x , u,w) + αJk−1

(f (x , u,w)

)}, J0 is “arbitrary" (??)

Policy Iteration (PI): Generates sequences of policies {µk} and their costfunctions {Jµk }; µ0 is “arbitrary"

The typical iteration starts with a policy µ and generates a new policy µ in two steps:

Policy evaluation step, which computes the cost function Jµ (base) policy µ

Policy improvement step, which computes the improved (rollout) policy µ using theone-step lookahead minimization

µ(x) ∈ arg minu∈U(x)

Ew

{g(x , u,w) + αJµ

(f (x , u,w)

)}There are several options for policy evaluation to compute Jµ

Solve Bellman’s equation for µ [Jµ(x) = E{g(x , µ(x),w) + αJµ(f (x , µ(x),w))}] byusing VI or other method (it is linear in Jµ)

Use simulation (on-line Monte-Carlo, Temporal Difference (TD) methods)


Exact and Approximate Policy IterationAlphazero has discovered a new way to play! Base Policy Evaluation

One-Step Lookahead Policy Improvement

u = µ(x, r) Current State x µ µ

1

Alphazero has discovered a new way to play! Base Policy EvaluationOne-Step Lookahead Policy Improvement

u = µ(x, r) Current State x µ Rollout Policy µ Randomized Jµ

Approximate Policy Evaluation Approximate Policy Improvement

Value Network Policy Network J State-Control Pairs Data-TrainedClassifier

1


u = µ(x, r) Current State x µ Rollout Policy µ Randomized

Jµ instead of J*



1



Jµ instead of J* Bellman Eq. with



1


Policy Evaluation Policy Improvement Rollout Policy µ Base Policy µ





1







1







1







1



Approximation in Policy Space


minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}


System: xk+1 = 2xk + uk Control constraint: |uk| ≤ 1


k









E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},









b+k b−


1





minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}




k









E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},









b+k b−


1


u = µ(x, r) Current State x µ Rollout Policy µ


Value Network Policy Network

1

Important facts (to be discussed later):PI yields in the limit an optimal policy (?)

PI is faster than VI; can be viewed as Newton’s method for solving Bellman’s Eq.

PI can be implemented approximately, with a value and (perhaps) a policy network





minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}




k









E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},









b+k b−


1





minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}




k









E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},









b+k b−


1





minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}




k









E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},









b+k b−


1





minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}




k









E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},









b+k b−


1





minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}




k









E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},









b+k b−


1





minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}




k









E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},









b+k b−


1





minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}




k









E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},









1





1





1

Alphazero has discovered a new way to play! Base Policy EvaluationOne-Step Lookahead Policy Improvement µ


Assigns x to µ(x, r) Pairs (xs, us) Training Data

u = µ(x, r) Current State x µ Rollout Policy µ Randomized µ(·, r)

Jµ instead of J* Bellman Eq. TRUNCATED ROLLOUT with BASEPOLICY µ


(Assigns x to u)

Value Network Policy Network Value Data

J State-Control Pairs Data-Trained Classifier with µ

Initial State Current State Approximation Truncated Rollout Usinga Local Policy Network

State Space Partition

Each Set Has a Local Value Network and a Local Policy Network

Terminal Cost Supplied by Local Value Network Terminal State

1

minu∈U(x)

n∑

y=1


)



rb2 + 1 1 − r




Expert



minu∈U(x)

n∑

y=1


)

x1k, u1

k u2k x2

k dk τ



)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1









minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}


1

minu∈U(x)

n∑

y=1


)



rb2 + 1 1 − r




Expert



minu∈U(x)

n∑

y=1


)

x1k, u1

k u2k x2

k dk τ



)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1









minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}


1


A More Abstract Notational View

Bellman’s equation, VI, and PI can be written using Bellman operatorsRecall Bellman’s equation

J∗(x) = minu∈U(x)

Ew

{g(x , u,w) + αJ∗

(f (x , u,w)

)}, for all x

It can be written as a fixed point equation: J∗(x) = (TJ∗)(x), where T is the Bellmanoperator that transforms a function J(·) into a function (TJ)(·)

(TJ)(x) = minu∈U(x)

Ew

{g(x , u,w) + αJ

(f (x , u,w)

)}, for all x

Shorthand theory using Bellman operators:VI is the fixed point iteration Jk+1 = TJk

There is a Bellman operator Tµ for any policy µ and corresponding Bellman Eq.Jµ(x) = (TµJµ)(x) = E{g(x , µ(x),w) + αJµ(f (x , µ(x),w))}PI is written compactly as Jµk = Tµk Jµk (policy evaluation) and Tµk+1 Jµk = TJµk

(policy improvement)

The abstract view is very useful for theoretical analysis, intuition, and visualization


Deterministic Linear Quadratic Problem - Infinite Horizon, Undiscounted

Linear system xk+1 = axk + buk ; quadratic cost per stage g(x ,u) = qx2 + ru2

Bellman equation: J(x) = minu{

qx2 + ru2 + J(ax + bu)}

Finite horizon results (quadratic optimal cost, linear optimal policy) suggest:

J∗(x) = K ∗x2 where K ∗ is some positive scalar

The optimal policy has the form µ∗(x) = L∗x where L∗ is some scalar

To characterize K ∗ and L∗, we plug J(x) = Kx2 into the Bellman equation

Kx2 = minu

{qx2 + ru2 + K (ax + bu)2} = · · · = F (K )x2

where F (K ) = a2rKr+b2K + q with the minimizing u being equal to − abK

r+b2K x

Thus the Bellman equation is solved by J∗(x) = K ∗x2, with K ∗ being a solution ofthe Riccati equation

K ∗ = F (K ∗) =a2rK ∗

r + b2K ∗+ q

and the optimal policy is linear:

µ∗(x) = L∗x with L∗ = − abK ∗

r + b2K ∗Bertsekas Reinforcement Learning 26 / 29

Graphical Solution of Riccati Equation

xk Lk uk wk xk+1 = Akxk + Bkuk + wk�2PP+1

F (P ) P Pk Pk+1 P ⇤ Q 0 P � RB2

A2RB2 + Q 45�

F (P ) k Q 0 P � RE{B2} 45�


ACB ACD CAB CAD CDA




...




0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2


3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2

⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk

Stochastic Problems

1

minu∈U(x)

n∑

y=1


)



rb2 + 1 1 − r

b2 K K∗ Kk kk+1αKr

r+αKb2 + 1



Expert



minu∈U(x)

n∑

y=1


)

x1k, u1

k u2k x2

k dk τ



)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1









minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}


1

minu∈U(x)

n∑

y=1


)



rb2 + 1 1 − r


r+αKb2 + 1



Expert



minu∈U(x)

n∑

y=1


)

x1k, u1

k u2k x2

k dk τ



)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1









minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}


1


F (P ) k Q 0 P � RE{B2} 45�


ACB ACD CAB CAD CDA




...




0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2


3 5 2 4 6 2

10 5 7 8 3 9 6 1 2



Stochastic Problems


1




Line

1

Effective Cost Approximation Value Space Approximation J Jµ Jµ =TµJµ TµJ TJ = minµ TµJ Cost of µ

Cost of Truncated Rollout Policy µ

T J = minµ TµJ One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

TJ = minµ TµJ Multistep Lookahead Policy Cost l J Jµ = TµJµ TµJ

Multistep Lookahead Policy Cost T 2J T J

J Region where Sequential Improvement Holds TJ ≤ J TµJ

TJ Instability Region Stability Region 0 T mµ J

rb2 + q q F (K) = arK

r+b2K + q


Line Stability Region

TµT mµ J = TT m

µ J Yields Truncated Rollout Policy µ Defined by

1





Multistep Lookahead Policy Cost

FL(K) = (a + bL)2K + q + rL2 FL(K)

T 2J T J J Region where Sequential Improvement Holds TJ ≤ J TµJ


arb2 + q q F (K) = a2rK

r+b2K + q K = 0 K KL

L = −r + ab2K

abKK1 L = −r + ab2K1

abK1


Line Stability Region F (K) = arKr+b2K + q FL(K1)

Tµ(T mµ J) = T (T m

µ J) Yields Truncated Rollout Policy µ Defined by

Newton step from J for solving J = TJ

1

E↵ective Cost Approximation Value Space Approximation J Jµ Jµ =TµJµ TµJ TJ = minµ TµJ Cost of µ

Cost of Truncated Rollout Policy µ 1

TJ = minµ TµJ One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ



FL(K) = (a + bL)2K + q + rL2 FL(K)

T 2J T J J Region where Sequential Improvement Holds TJ J TµJ

TJ Instability Region Stability Region 0 Tmµ J

a2rb2

a2rb2 + q q F (K) = a2rK

r+b2K + q K⇤ = 0 K = 0 K K = 0 KL

L = � abK

r + ab2KK1 L = � abK1

r + ab2K1

F (K) =a2rK

r + b2K

J⇤(1) = 0 J(1) (TJ)(1) = min{J(1), 1}


Line Stability Region F (K) = arKr+b2K + q FL(K1) K = a2 � 1

Tµ(Tmµ J) = T (Tm



Newton step from T `�1J for solving J = TJ (TJ)(1)

1

Interval I Interval II Interval III Interval IV Ks K∗ Kµ K − 12 −µ −1

J 0 Jµ = − 1µ TµJ = −µ + (1 − µ2)J TJ = minµ∈(0,1] TµJ

L = − abK

r + b2K

Region of Instability Region of Stability TµJ = −µ + (1 − µ2)J K

State 1 State 2 K∗ K 2-State/2-Control ExampleEffective Cost Approximation Value Space Approximation State 1

State 2 (TJ)(1)

J Jµ Jµ = TµJµ TµJ TJ = minµ TµJ Cost of µ − rb2

Generic stable policy µ TµJ Generic unstable policy µ′ Tµ′J

Cost of Truncated Rollout Policy µ 1 of the graph of T

J∗ J∗(1) J∗(2) (TJ∗)(1) = J∗(1) (TJ∗)(2) = J∗(2)



Multistep Lookahead Policy Cost J is a function of x

FL(K) = (a + bL)2K + q + rL2 FL(K)



a2rb2


r+b2K + q K∗ = 0 K = 0 K K = 0 KL

L = − abK

r + ab2KK1 L = − abK1

r + b2K1

F (K) =a2rK

r + b2K

J∗(1) = 0 J(1) (TJ)(1) = min{J(1), 1}

1



L = − abK

r + b2K



State 2 (TJ)(1)




J∗ J∗(1) J∗(2) (TJ∗)(1) = J∗(1) (TJ∗)(2) = J∗(2)




FL(K) = (a + bL)2K + q + rL2 FL(K)



a2rb2


r+b2K + q K∗ = 0 K = 0 K K = 0 KL

L = − abK


r + b2K1

F (K) =a2rK

r + b2K

J∗(1) = 0 J(1) (TJ)(1) = min{J(1), 1}

1

Interval I Interval II Interval III Interval IV Ks K⇤ Kµ K � 12 �µ �1

J 0 Jµ = � 1µ TµJ = �µ + (1 � µ2)J TJ = minµ2(0,1] TµJ

L = � abK

r + b2K

Region of Instability Region of Stability TµJ = �µ + (1 � µ2)J K

State 1 State 2 2-State/2-Control ExampleE↵ective Cost Approximation Value Space Approximation State 1

State 2 (TJ)(1)

J Jµ Jµ = TµJµ TµJ TJ = minµ TµJ Cost of µ � rb2

Generic stable policy µ TµJ Generic unstable policy µ0 Tµ0J


J⇤ J⇤(1) J⇤(2) (TJ⇤)(1) = J⇤(1) (TJ⇤)(2) = J⇤(2)




FL(K) = (a + bL)2K + q + rL2 FL(K)



a2rb2


r+b2K + q K⇤ = 0 K = 0 K K = 0 KL

L = � abK


r + b2K1

F (K) =a2rK

r + b2K

J⇤(1) = 0 J(1) (TJ)(1) = min{J(1), 1}

1





TJ Instability Region Stability Region Slope=1


Riccati Equation: K = F (K)





Region of stability



1


F (K)












Region of stability



1


F (K)












Region of stability



1


F (K)












Region of stability



1


F (K)












Region of stability



1


Visualization of VI


F (P ) P Pk Pk+1 P ⇤ Q 0 P � RB2

A2RB2 + Q 45�

F (P ) k Q 0 P � RE{B2} 45�


ACB ACD CAB CAD CDA




...




0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2


3 5 2 4 6 2

10 5 7 8 3 9 6 1 2



Stochastic Problems

1

minu∈U(x)

n∑

y=1


)



rb2 + 1 1 − r


r+αKb2 + 1



Expert



minu∈U(x)

n∑

y=1


)

x1k, u1

k u2k x2

k dk τ



)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1









minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}


1

minu∈U(x)

n∑

y=1


)



rb2 + 1 1 − r


r+αKb2 + 1



Expert



minu∈U(x)

n∑

y=1


)

x1k, u1

k u2k x2

k dk τ



)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1









minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}


1

minu∈U(x)

n∑

y=1


)



rb2 + 1 1 − r


r+αKb2 + 1



Expert



minu∈U(x)

n∑

y=1


)

x1k, u1

k u2k x2

k dk τ



)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1









minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}


1

minu∈U(x)

n∑

y=1


)



rb2 + 1 1 − r

b2 K K∗ Kk Kk+1αKr

r+αKb2 + 1



Expert



minu∈U(x)

n∑

y=1


)

x1k, u1

k u2k x2

k dk τ



)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1









minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}


1


F (P ) k Q 0 P � RE{B2} 45�


ACB ACD CAB CAD CDA




...




0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2


3 5 2 4 6 2

10 5 7 8 3 9 6 1 2



Stochastic Problems


1




Line

1





Multistep Lookahead Policy Cost T 2J T J

J Region where Sequential Improvement Holds TJ ≤ J TµJ


rb2 + q q F (K) = arK

r+b2K + q


Line Stability Region

TµT mµ J = TT m

µ J Yields Truncated Rollout Policy µ Defined by

1

Interval I Interval II Interval III Interval IV Ks K⇤ Kµ K � 12 �µ �1

J 0 Jµ = � 1µ TµJ = �µ + (1 � µ2)J TJ = minµ2(0,1] TµJ

L = � abK

r + b2K

Region of Instability Region of Stability TµJ = �µ + (1 � µ2)J K

State 1 State 2 2-State/2-Control ExampleE↵ective Cost Approximation Value Space Approximation State 1

State 2 (TJ)(1)

J Jµ Jµ = TµJµ TµJ TJ = minµ TµJ Cost of µ � rb2

Generic stable policy µ TµJ Generic unstable policy µ0 Tµ0J


J⇤ J⇤(1) J⇤(2) (TJ⇤)(1) = J⇤(1) (TJ⇤)(2) = J⇤(2)




FL(K) = (a + bL)2K + q + rL2 FL(K)



a2rb2


r+b2K + q K⇤ = 0 K = 0 K K = 0 KL

L = � abK


r + b2K1

F (K) =a2rK

r + b2K

J⇤(1) = 0 J(1) (TJ)(1) = min{J(1), 1}

1



L = − abK

r + b2K



State 2 (TJ)(1)




J∗ J∗(1) J∗(2) (TJ∗)(1) = J∗(1) (TJ∗)(2) = J∗(2)




FL(K) = (a + bL)2K + q + rL2 FL(K)



a2rb2


r+b2K + q K∗ = 0 K = 0 K K = 0 KL

L = − abK


r + b2K1

F (K) =a2rK

r + b2K

J∗(1) = 0 J(1) (TJ)(1) = min{J(1), 1}

1



L = − abK

r + b2K



State 2 (TJ)(1)




J∗ J∗(1) J∗(2) (TJ∗)(1) = J∗(1) (TJ∗)(2) = J∗(2)




FL(K) = (a + bL)2K + q + rL2 FL(K)



a2rb2


r+b2K + q K∗ = 0 K = 0 K K = 0 KL

L = − abK


r + b2K1

F (K) =a2rK

r + b2K

J∗(1) = 0 J(1) (TJ)(1) = min{J(1), 1}

1






FL(K) = (a + bL)2K + q + rL2 FL(K)



arb2 + q q F (K) = a2rK

r+b2K + q K = 0 K KL

L = −r + ab2K

abKK1 L = −r + ab2K1

abK1


Line Stability Region F (K) = arKr+b2K + q FL(K1)

Tµ(T mµ J) = T (T m



1

E↵ective Cost Approximation Value Space Approximation J Jµ Jµ =TµJµ TµJ TJ = minµ TµJ Cost of µ

Cost of Truncated Rollout Policy µ 1




FL(K) = (a + bL)2K + q + rL2 FL(K)



a2rb2


r+b2K + q K⇤ = 0 K = 0 K K = 0 KL

L = � abK


r + ab2K1

F (K) =a2rK

r + b2K

J⇤(1) = 0 J(1) (TJ)(1) = min{J(1), 1}


Line Stability Region F (K) = arKr+b2K + q FL(K1) K = a2 � 1

Tµ(Tmµ J) = T (Tm



Newton step from T `�1J for solving J = TJ (TJ)(1)

1







Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk)



Region of stability



1










Region of stability



1










Region of stability



1










Region of stability



1


About the Next Lecture

Linear quadratic problems and Newton step interpretationsApproximation in value space as a Newton step for solving the Riccati equation

Rollout as a Newton step starting from the cost of the base policy

Policy Iteration as repeated Newton steps

Problem formulations and reformulationsHow do we formulate DP models for practical problems?

Problems involving a terminal state (stochastic shortest path problems)

Problem reformulation by state augmentation (dealing with delays, correlations,forecasts, etc)

Problems involving imperfect state observation (POMDP)

Multiagent problems - Nonclassical information patterns

Systems with unknown or changing parameters - Adaptive control

PLEASE READ SECTIONS 1.5 and 1.6 OF THE CLASS NOTES (AMAP)

1ST HOMEWORK (DUE IN ONE WEEK): Exercise 1.1 of the Class NotesBertsekas Reinforcement Learning 29 / 29

Topics in Reinforcement Learning:Lessons from AlphaZero ...

Documents

Transcript of Topics in Reinforcement Learning:Lessons from AlphaZero ...