Topics in Reinforcement Learning:Lessons from AlphaZero ...

25
Topics in Reinforcement Learning: Lessons from AlphaZero for (Sub)Optimal Control and Discrete Optimization Arizona State University Course CSE 691, Spring 2022 Links to Class Notes, Videolectures, and Slides at http://web.mit.edu/dimitrib/www/RLbook.html Dimitri P. Bertsekas [email protected] Lecture 2 Stochastic Finite and Infinite Horizon DP Bertsekas Reinforcement Learning 1 / 29

Transcript of Topics in Reinforcement Learning:Lessons from AlphaZero ...

Page 1: Topics in Reinforcement Learning:Lessons from AlphaZero ...

Topics in Reinforcement Learning:Lessons from AlphaZero for

(Sub)Optimal Control and Discrete Optimization

Arizona State UniversityCourse CSE 691, Spring 2022

Links to Class Notes, Videolectures, and Slides athttp://web.mit.edu/dimitrib/www/RLbook.html

Dimitri P. [email protected]

Lecture 2Stochastic Finite and Infinite Horizon DP

Bertsekas Reinforcement Learning 1 / 29

Page 2: Topics in Reinforcement Learning:Lessons from AlphaZero ...

Outline

1 Finite Horizon Deterministic Problem - Approximation in Value Space

2 Stochastic DP Algorithm

3 Linear Quadratic Problems - An Important Favorable Special Case

4 Infinite Horizon - An Overview of Theory and Algorithms

Bertsekas Reinforcement Learning 2 / 29

Page 3: Topics in Reinforcement Learning:Lessons from AlphaZero ...

Review - Finite Horizon Deterministic Problem

......

Permanent trajectory P k Tentative trajectory T k

Control uk Cost gk(xk, uk) xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π(T

(λ)µ (Φr)

)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution

Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)

Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)

Set of States (u1) Set of States (u1, u2) Neural Network

Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)

Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)

Candidate (m + 1)-Solutions (u1, . . . , um, um+1)

Cost G(u) Heuristic N -Solutions

Piecewise Constant Aggregate Problem Approximation

Artificial Start State End State

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

(F (i)

)

R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗

3 Cost Jµ

(F (i)

)

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

(F (i)

)

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

1

Permanent trajectory P k Tentative trajectory T k

Control uk Cost gk(xk, uk) xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π(T

(λ)µ (Φr)

)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution

Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)

Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)

Set of States (u1) Set of States (u1, u2) Neural Network

Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)

Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)

Candidate (m + 1)-Solutions (u1, . . . , um, um+1)

Cost G(u) Heuristic N -Solutions

Piecewise Constant Aggregate Problem Approximation

Artificial Start State End State

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

(F (i)

)

R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗

3 Cost Jµ

(F (i)

)

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

(F (i)

)

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

1

Permanent trajectory P k Tentative trajectory T k

Control uk Cost gk(xk, uk) xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π(T

(λ)µ (Φr)

)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution

Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)

Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)

Set of States (u1) Set of States (u1, u2) Neural Network

Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)

Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)

Candidate (m + 1)-Solutions (u1, . . . , um, um+1)

Cost G(u) Heuristic N -Solutions

Piecewise Constant Aggregate Problem Approximation

Artificial Start State End State

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

(F (i)

)

R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗

3 Cost Jµ

(F (i)

)

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

(F (i)

)

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

1

Permanent trajectory P k Tentative trajectory T k

Control uk Cost gk(xk, uk) xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π(T

(λ)µ (Φr)

)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution

Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)

Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)

Set of States (u1) Set of States (u1, u2) Neural Network

Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)

Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)

Candidate (m + 1)-Solutions (u1, . . . , um, um+1)

Cost G(u) Heuristic N -Solutions

Piecewise Constant Aggregate Problem Approximation

Artificial Start State End State

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

(F (i)

)

R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗

3 Cost Jµ

(F (i)

)

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

(F (i)

)

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

1

Permanent trajectory P k Tentative trajectory T k

Control uk Cost gk(xk, uk) xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π(T

(λ)µ (Φr)

)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution

Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)

Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)

Set of States (u1) Set of States (u1, u2) Neural Network

Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)

Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)

Candidate (m + 1)-Solutions (u1, . . . , um, um+1)

Cost G(u) Heuristic N -Solutions

Piecewise Constant Aggregate Problem Approximation

Artificial Start State End State

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

(F (i)

)

R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗

3 Cost Jµ

(F (i)

)

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

(F (i)

)

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

1

Permanent trajectory P k Tentative trajectory T k

Stage k Future Stges

Control uk Cost gk(xk, uk) xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π(T

(λ)µ (Φr)

)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution

Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)

Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)

Set of States (u1) Set of States (u1, u2) Neural Network

Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)

Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)

Candidate (m + 1)-Solutions (u1, . . . , um, um+1)

Cost G(u) Heuristic N -Solutions

Piecewise Constant Aggregate Problem Approximation

Artificial Start State End State

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

(F (i)

)

R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗

3 Cost Jµ

(F (i)

)

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

(F (i)

)

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

1

Permanent trajectory P k Tentative trajectory T k

Stage k Future Stages

Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π(T

(λ)µ (Φr)

)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution

Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)

Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)

Set of States (u1) Set of States (u1, u2) Neural Network

Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)

Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)

Candidate (m + 1)-Solutions (u1, . . . , um, um+1)

Cost G(u) Heuristic N -Solutions

Piecewise Constant Aggregate Problem Approximation

Artificial Start State End State

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

(F (i)

)

R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗

3 Cost Jµ

(F (i)

)

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

(F (i)

)

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

1

Systemxk+1 = fk (xk , uk ), k = 0, 1, . . . ,N − 1

where xk : State, uk : Control chosen from some set Uk (xk )

Arbitrary state and control spaces

Cost function:

gN(xN) +N−1∑k=0

gk (xk , uk )

For given initial state x0, minimize over control sequences {u0, . . . , uN−1}

J(x0; u0, . . . , uN−1) = gN(xN) +N−1∑k=0

gk (xk , uk )

Optimal cost function J∗(x0) = min uk∈Uk (xk )k=0,...,N−1

J(x0; u0, . . . , uN−1)

Bertsekas Reinforcement Learning 4 / 29

Page 4: Topics in Reinforcement Learning:Lessons from AlphaZero ...

Review - DP Algorithm for Deterministic Problems

Go backward to compute the optimal costs J∗k (xk ) of the xk -tail subproblems

(off-line training - involves lots of computation)Start with

J∗N(xN) = gN(xN), for all xN ,

and for k = 0, . . . ,N − 1, let

J∗k (xk ) = minuk∈Uk (xk )

[gk (xk , uk ) + J∗k+1

(fk (xk , uk )

)], for all xk .

Then optimal cost J∗(x0) is obtained at the last step: J∗0 (x0) = J∗(x0).

Go forward to construct optimal control sequence {u∗0 , . . . ,u

∗N−1} (on-line play)

Start with

u∗0 ∈ arg minu0∈U0(x0)

[g0(x0, u0) + J∗1

(f0(x0, u0)

)], x∗1 = f0(x0, u∗0 ).

Sequentially, going forward, for k = 1, 2, . . . ,N − 1, set

u∗k ∈ arg minuk∈Uk (x∗k )

[gk (x∗k , uk ) + J∗k+1

(fk (x∗k , uk )

)], x∗k+1 = fk (x∗k , u

∗k ).

Bertsekas Reinforcement Learning 5 / 29

Page 5: Topics in Reinforcement Learning:Lessons from AlphaZero ...

Q-Factors for Deterministic Problems

An alternative (and equivalent) form of the DP algorithm

Generates the optimal Q-factors, defined for all (xk , uk ) and k by

Q∗k (xk , uk ) = gk (xk , uk ) + J∗k+1(fk (xk , uk )

)The optimal cost function J∗k can be recovered from the optimal Q-factor Q∗k

J∗k (xk ) = minuk∈Uk (xk )

Q∗k (xk , uk )

The DP algorithm can be written in terms of Q-factors

Q∗k (xk , uk ) = gk (xk , uk ) + minuk+1∈Uk+1(fk (xk ,uk ))

Q∗k+1(fk (xk , uk ), uk+1

)Exact and approximate forms of this and other related algorithms, form animportant class of RL methods known as Q-learning.

Bertsekas Reinforcement Learning 6 / 29

Page 6: Topics in Reinforcement Learning:Lessons from AlphaZero ...

Approximation in Value Space

We replace J∗k with an approximation Jk during on-line play

Start withu0 ∈ arg min

u0∈U0(x0)

[g0(x0, u0) + J1

(f0(x0, u0)

)]Set x1 = f0(x0, u0)

Sequentially, going forward, for k = 1, 2, . . . ,N − 1, set

uk ∈ arg minuk∈Uk (xk )

[gk (xk , uk ) + Jk+1

(fk (xk , uk )

)], xk+1 = fk (xk , uk )

How do we compute Jk+1(xk+1)? This is one of the principal issues in RL

Off-line problem approximation: Use as Jk+1 the optimal cost function of a simplerproblem, computed off-line by exact DP

On-line approximate optimization, e.g., solve on-line a shorter horizon problem bymultistep lookahead minimization and simple terminal cost (often done in MPC)

Parametric cost approximation: Obtain Jk+1(xk+1) from a parametric class offunctions J(xk+1, r), where r is a parameter, e.g., training using data and a NN.

Rollout with a heuristic: We will focus on this for the moment.Bertsekas Reinforcement Learning 7 / 29

Page 7: Topics in Reinforcement Learning:Lessons from AlphaZero ...

Rollout for Finite-State Deterministic Problems

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk)

x0 x1 xk xN uk xk+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk)

x0 x1 xk xN uk xk+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN uk xk+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN uk xk+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN uk xk+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN uk xk+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN uk u0k u00

k xk+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN uk u0k u00

k xk+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN uk u0k u00

k xk+1 x0k+1 x00

k+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN x0N x00

N uk u0k u00

k xk+1 x0k+1 x00

k+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN x0N x00

N uk u0k u00

k xk+1 x0k+1 x00

k+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N − 1 c(N) c(N − 1) k k + 1

Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk

x0 x1 xk xN x′N x′′

N uk u′k u′′

k xk+1 x′k+1 x′′

k+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from

within Wp+

Prob. u Prob. 1 − u Cost 1 Cost 1 − √u

J(1) = min{c, a + J(2)

}

J(2) = b + J(1)

J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N − 1 c(N) c(N − 1) k k + 1

Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk

x0 x1 xk xN x′N x′′

N uk u′k u′′

k xk+1 x′k+1 x′′

k+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from

within Wp+

Prob. u Prob. 1 − u Cost 1 Cost 1 − √u

J(1) = min{c, a + J(2)

}

J(2) = b + J(1)

J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial City Current Partial Tour Next Cities

Nearest Neighbor Heuristic

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

b+k b−

k Permanent trajectory P k Tentative trajectory T k

minuk

E{gk(xk, uk, wk)+Jk+1(xk+1)

}

Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1

Optimal control sequence {u∗0, . . . , u∗

k, . . . , u∗N−1} Simplify E{·}

Tail subproblem Time x∗k Future Stages Terminal Cost k N

Stage k Future Stages Terminal Cost gN(xN )

Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π(T

(λ)µ (Φr)

)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

1

Initial City Current Partial Tour Next Cities

Nearest Neighbor Heuristic

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

b+k b−

k Permanent trajectory P k Tentative trajectory T k

minuk

E{gk(xk, uk, wk)+Jk+1(xk+1)

}

Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1

Optimal control sequence {u∗0, . . . , u∗

k, . . . , u∗N−1} Simplify E{·}

Tail subproblem Time x∗k Future Stages Terminal Cost k N

Stage k Future Stages Terminal Cost gN(xN )

Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π(T

(λ)µ (Φr)

)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

1

Initial City Current Partial Tour Next Cities

Nearest Neighbor Heuristic

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

b+k b−

k Permanent trajectory P k Tentative trajectory T k

minuk

E{gk(xk, uk, wk)+Jk+1(xk+1)

}

Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1

Optimal control sequence {u∗0, . . . , u∗

k, . . . , u∗N−1} Simplify E{·}

Tail subproblem Time x∗k Future Stages Terminal Cost k N

Stage k Future Stages Terminal Cost gN(xN )

Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π(T

(λ)µ (Φr)

)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

1

Initial City Current Partial Tour Next Cities Next States

Nearest Neighbor Heuristic

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

b+k b−

k Permanent trajectory P k Tentative trajectory T k

minuk

E{gk(xk, uk, wk)+Jk+1(xk+1)

}

Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1

Optimal control sequence {u∗0, . . . , u∗

k, . . . , u∗N−1} Simplify E{·}

Tail subproblem Time x∗k Future Stages Terminal Cost k N

Stage k Future Stages Terminal Cost gN(xN )

Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π(T

(λ)µ (Φr)

)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

1

Start End Plus Terminal Cost Approximation S1 S2 S3 S` Sm�1 Sm

State i y(i) Ay(i) + b �1(i, v) �m(i, v) �2(i, v) J(i, v) = r0�(i, v)

Deterministic Transition xk+1 = fk(xk, uk)

Cost Vector r⇤ J1 = Corrected V Enlarged State Space Cost J0 Cost J1 Cost r⇤

Representative States Controls u are associated with states i

N Stages jN�1 jN i 2 Ix j 2 Iy

Sx1 Sx` Sxm x1 x` xm r⇤x1r⇤x`

r⇤xmFootprint Sets J(i) J(j) =

Py2A �jyr⇤y

minu2U(i)

nX

j=1

pij(u)�g(i, u, j) + ↵J(j)

�i = x Ix

⇡/4 Sample State xsk Sample Control us

k Sample Next State xsk+1 Sample Transition Cost gs

k Simulator

Representative States x (Coarse Grid) Critic Actor Approximate PI Aggregate Problem

pxy(u) =

nX

j=1

pxj(u)�jy g(x, u) =

nX

j=1

pxj(u)g(x, u, j)

Range of Weighted Projections J⇤(i) Original States to States (Fine Grid) Original State Space

dxi = 0 for i /2 Ix �jy = 1 for j 2 Iy �jy = 0 or 1 for all j and y Each j connects to a single x

x pxj1(u) pxj2(u) pxj3(u) �j1y1 �j1y2 �j1y3 �jy with Aggregation Probabilities �jy = 0 or 1

Relate to Rm r⇤m�1 r⇤m x0k+1

Policy Q-Factor Evaluation Evaluate Q-Factor Qµ of Current policy µ Width (✏ + 2↵�)/(1 � ↵)

Random Transition xk+1 = fk(xk, uk, wk) Random Cost gk(xk, uk, wk) Representative Features

Control v (j, v) Cost = 0 State-Control Pairs Transitions under policy µ Evaluate Cost Function

1

Sec. 1.2 Deterministic Dynamic Programming 25

x0 0 x1 ) . . .

uk xk xNk xk+1

k u′k

′k u′′

k x+1 x′′

k+1

x′N

x′′N

. . . Q-Factors

-Factors Current State x

Current State xk

Nearest Neighbor Heuristic

Nearest Neighbor Heuristic

Nearest Neighbor Heuristic

Next Cities Next States

∗ x′k+1

Figure 1.2.9 Schematic illustration of rollout with one-step lookahead for a de-terministic problem. At state xk, for every pair (xk, uk), uk ∈ Uk(xk), the base

heuristic generates an approximate Q-factor

Qk(xk, uk) = gk(xk, uk) + Hk+1

(fk(xk, uk)

),

and selects the control µk(xk) with minimal Q-factor.

and the corresponding cost

Hk+1(xk+1) = gk+1(xk+1, uk+1) + · · · + gN−1(xN−1, uN−1) + gN (xN ).

The rollout algorithm then applies the control that minimizes over uk ∈Uk(xk) the tail cost expression for stages k to N :

gk(xk, uk) + Hk+1(xk+1).

Equivalently, and more succinctly, the rollout algorithm applies atstate xk the control µk(xk) given by the minimization

µk(xk) ∈ arg minuk∈Uk(xk)

Qk(xk, uk), (1.14)

where Qk(xk, uk) is the approximate Q-factor defined by

Qk(xk, uk) = gk(xk, uk) + Hk+1

(fk(xk, uk)

); (1.15)

see Fig. 1.2.9. Rollout defines a suboptimal policy π = {µ0, . . . , µN−1},referred to as the rollout policy, where for each xk and k, µk(xk) is thecontrol produced by the Q-factor minimization (1.14).

Note that the rollout algorithm requires running the base heuristicfor a number of times that is bounded by Nn, where n is an upper boundon the number of control choices available at each state. Thus if n issmall relative to N , it requires computation equal to a small multiple of Ntimes the computation time for a single application of the base heuristic.Similarly, if n is bounded by a polynomial in N , the ratio of the rolloutalgorithm computation time to the base heuristic computation time is apolynomial in N .

Cost approximation by running a heuristic from states of interest

We generate a single system trajectory {x0, x1, . . . , xN} by on-line play

Upon reaching xk , we compute for all uk ∈ Uk (xk ), the corresponding next statesxk+1 = fk (xk , uk )

From each of the next states xk+1 we run the heuristic and compute the heuristiccost Hk+1(xk+1)

We apply uk that minimizes over uk ∈ Uk (xk ), the (heuristic) Q-factor

gk (xk , uk ) + Hk+1(xk+1)

We generate the next state xk+1 = fk (xk , uk ) and repeatBertsekas Reinforcement Learning 8 / 29

Page 8: Topics in Reinforcement Learning:Lessons from AlphaZero ...

Traveling Salesman Example

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

Matrix of Intercity Travel Costs

Corrected J J J* Cost Jµ

�F (i), r

�of i ⇡ Jµ(i) Jµ(i) Feature Map

�F (i), r

�: Feature-based parametric architecture State

r: Vector of weights Original States Aggregate States

Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r⇤` Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost ↵kg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

�F (i)

�of

F (i) =�F1(i), . . . , Fs(i)

�: Vector of Features of i

�F (i)

�: Feature-based architecture Final Features

If Jµ

�F (i), r

�=Ps

`=1 F`(i)r` it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J � Jp with J(xk) ! 0 for all p-stable ⇡

Wp0 : Functions J � Jp0 with J(xk) ! 0 for all p0-stable ⇡

W+ =�J | J � J+, J(t) = 0

VI converges to J+ from within W+

Cost: g(xk, uk) � 0 VI converges to Jp from within Wp

1

Matrix of Intercity Travel Costs

Corrected J J J* Cost Jµ

�F (i), r

�of i ⇡ Jµ(i) Jµ(i) Feature Map

�F (i), r

�: Feature-based parametric architecture State

r: Vector of weights Original States Aggregate States

Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r⇤` Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost ↵kg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

�F (i)

�of

F (i) =�F1(i), . . . , Fs(i)

�: Vector of Features of i

�F (i)

�: Feature-based architecture Final Features

If Jµ

�F (i), r

�=Ps

`=1 F`(i)r` it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J � Jp with J(xk) ! 0 for all p-stable ⇡

Wp0 : Functions J � Jp0 with J(xk) ! 0 for all p0-stable ⇡

W+ =�J | J � J+, J(t) = 0

VI converges to J+ from within W+

Cost: g(xk, uk) � 0 VI converges to Jp from within Wp

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Origin Node s Artificial Terminal Node t

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

Old State xk�1 New State xk Encoder Noisy Channel ReceivedSequence

Decoder Decoded Sequence

1

Initial State s Terminal State t

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

15 1 5

Initial State s Terminal State t

x0 a 0 1 2 t b C Destination

J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from

within Wp+

Prob. u Prob. 1 − u Cost 1 Cost 1 − √u

J(1) = min{c, a + J(2)

}

J(2) = b + J(1)

J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

15 1 5

Initial State s Terminal State t

x0 a 0 1 2 t b C Destination

J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from

within Wp+

Prob. u Prob. 1 − u Cost 1 Cost 1 − √u

J(1) = min{c, a + J(2)

}

J(2) = b + J(1)

J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

15 1 5 18 4 19 9 21 25

Initial State s Terminal State t

x0 a 0 1 2 t b C Destination

J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from

within Wp+

Prob. u Prob. 1 − u Cost 1 Cost 1 − √u

J(1) = min{c, a + J(2)

}

J(2) = b + J(1)

J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

15 1 5 18 4 19 9 21 25

Initial State s Terminal State t

x0 a 0 1 2 t b C Destination

J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from

within Wp+

Prob. u Prob. 1 − u Cost 1 Cost 1 − √u

J(1) = min{c, a + J(2)

}

J(2) = b + J(1)

J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

15 1 5 18 4 19 9 21 25

Initial State s Terminal State t

x0 a 0 1 2 t b C Destination

J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from

within Wp+

Prob. u Prob. 1 − u Cost 1 Cost 1 − √u

J(1) = min{c, a + J(2)

}

J(2) = b + J(1)

J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

15 1 5 18 4 19 9 21 25 8 12

Initial State s Terminal State t

x0 a 0 1 2 t b C Destination

J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from

within Wp+

Prob. u Prob. 1 − u Cost 1 Cost 1 − √u

J(1) = min{c, a + J(2)

}

J(2) = b + J(1)

J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

15 1 5 18 4 19 9 21 25 8 12

Initial State s Terminal State t

x0 a 0 1 2 t b C Destination

J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from

within Wp+

Prob. u Prob. 1 − u Cost 1 Cost 1 − √u

J(1) = min{c, a + J(2)

}

J(2) = b + J(1)

J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

15 1 5 18 4 19 9 21 25 8 12 13

Initial State s Terminal State t

x0 a 0 1 2 t b C Destination

J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from

within Wp+

Prob. u Prob. 1 − u Cost 1 Cost 1 − √u

J(1) = min{c, a + J(2)

}

J(2) = b + J(1)

J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

15 1 5 18 4 19 9 21 25 8 12 13

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from

within Wp+

Prob. u Prob. 1 − u Cost 1 Cost 1 − √u

J(1) = min{c, a + J(2)

}

J(2) = b + J(1)

J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

6 13 14 24 27

Corrected J J J* Cost Jµ

(F (i), r

)of i ≈ Jµ(i) Jµ(i) Feature Map

(F (i), r

): Feature-based parametric architecture State

r: Vector of weights Original States Aggregate States

Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost αkg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

(F (i)

)of

F (i) =(F1(i), . . . , Fs(i)

): Vector of Features of i

(F (i)

): Feature-based architecture Final Features

If Jµ

(F (i), r

)=

∑sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π

Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π

W+ ={J | J ≥ J+, J(t) = 0

}

VI converges to J+ from within W+

Cost: g(xk, uk) ≥ 0 VI converges to Jp from within Wp

1

6 13 14 24 27

Corrected J J J* Cost Jµ

(F (i), r

)of i ≈ Jµ(i) Jµ(i) Feature Map

(F (i), r

): Feature-based parametric architecture State

r: Vector of weights Original States Aggregate States

Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost αkg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

(F (i)

)of

F (i) =(F1(i), . . . , Fs(i)

): Vector of Features of i

(F (i)

): Feature-based architecture Final Features

If Jµ

(F (i), r

)=

∑sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π

Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π

W+ ={J | J ≥ J+, J(t) = 0

}

VI converges to J+ from within W+

Cost: g(xk, uk) ≥ 0 VI converges to Jp from within Wp

1

6 13 14 24 27

Corrected J J J* Cost Jµ

(F (i), r

)of i ≈ Jµ(i) Jµ(i) Feature Map

(F (i), r

): Feature-based parametric architecture State

r: Vector of weights Original States Aggregate States

Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost αkg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

(F (i)

)of

F (i) =(F1(i), . . . , Fs(i)

): Vector of Features of i

(F (i)

): Feature-based architecture Final Features

If Jµ

(F (i), r

)=

∑sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π

Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π

W+ ={J | J ≥ J+, J(t) = 0

}

VI converges to J+ from within W+

Cost: g(xk, uk) ≥ 0 VI converges to Jp from within Wp

1

6 13 14 24 27 Rollout

Base Heuristic Corrected J J J* Cost Jµ

(F (i), r

)of i ≈ Jµ(i) Jµ(i)

Feature Map

(F (i), r

): Feature-based parametric architecture State

r: Vector of weights Original States Aggregate States

Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost αkg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

(F (i)

)of

F (i) =(F1(i), . . . , Fs(i)

): Vector of Features of i

(F (i)

): Feature-based architecture Final Features

If Jµ

(F (i), r

)=

∑sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π

Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π

W+ ={J | J ≥ J+, J(t) = 0

}

VI converges to J+ from within W+

Cost: g(xk, uk) ≥ 0 VI converges to Jp from within Wp

1

6 13 14 24 27 Rollout

Base Heuristic Corrected J J J* Cost Jµ

(F (i), r

)of i ≈ Jµ(i) Jµ(i)

Feature Map

(F (i), r

): Feature-based parametric architecture State

r: Vector of weights Original States Aggregate States

Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost αkg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

(F (i)

)of

F (i) =(F1(i), . . . , Fs(i)

): Vector of Features of i

(F (i)

): Feature-based architecture Final Features

If Jµ

(F (i), r

)=

∑sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π

Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π

W+ ={J | J ≥ J+, J(t) = 0

}

VI converges to J+ from within W+

Cost: g(xk, uk) ≥ 0 VI converges to Jp from within Wp

1

Cost-to-go approximation Expected value approximation

Optimal cost J∗ Jµ1(x)/Jµ0(x) = K1/K0 L0 r

TµJ Jµ = TµJµ Jµ = TµJµ Cost of base policy µ

Cost of rollout policy µ Optimal Base Rollout Terminal Score Ap-proximation

Simplified minimization

Changing System, Cost, and Constraint Parameters

Linearized Bellman Eq. at Jµ Yields Rollout Policy µ 20

Through TµJµ = TJµ Lookahead Minimization

Value iterations

Rollout with Base Off-Line Obtained Policy

Policy Improvement with Base Policy µPolicy evaluations for µ and µ

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1

Termination State Constraint Set X X = X X Multiagent

rb2 + 1 1 − r

αb2 K K K∗ Kk Kk+1 F (K) = αrKr+αb2K + 1

Current Partial Folding Moving Obstacle

Complete Folding Corresponding to Open

Expert

Rollout with Base Policy m-Step Value Network Policy Network

Approximation of E{·}: Approximate minimization:

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

x1k, u1

k u2k x2

k dk τ

Q-factor approximation

u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)

)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1

1

Cost-to-go approximation Expected value approximation

Optimal cost J∗ Jµ1(x)/Jµ0(x) = K1/K0 L0 r

TµJ Jµ = TµJµ Jµ = TµJµ Cost of base policy µ

Cost of rollout policy µ Optimal Base Rollout Terminal Score Ap-proximation

Simplified minimization

Changing System, Cost, and Constraint Parameters

Linearized Bellman Eq. at Jµ Yields Rollout Policy µ 20

Through TµJµ = TJµ Lookahead Minimization

Value iterations

Rollout with Base Off-Line Obtained Policy

Policy Improvement with Base Policy µPolicy evaluations for µ and µ

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1

Termination State Constraint Set X X = X X Multiagent

rb2 + 1 1 − r

αb2 K K K∗ Kk Kk+1 F (K) = αrKr+αb2K + 1

Current Partial Folding Moving Obstacle

Complete Folding Corresponding to Open

Expert

Rollout with Base Policy m-Step Value Network Policy Network

Approximation of E{·}: Approximate minimization:

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

x1k, u1

k u2k x2

k dk τ

Q-factor approximation

u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)

)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1

1

Cost-to-go approximation Expected value approximation

Optimal cost J∗ Jµ1(x)/Jµ0(x) = K1/K0 L0 r

TµJ Jµ = TµJµ Jµ = TµJµ Cost of base policy µ

Cost of rollout policy µ Optimal Base Rollout Terminal Score Ap-proximation

Simplified minimization

Changing System, Cost, and Constraint Parameters

Linearized Bellman Eq. at Jµ Yields Rollout Policy µ 20

Through TµJµ = TJµ Lookahead Minimization

Value iterations

Rollout with Base Off-Line Obtained Policy

Policy Improvement with Base Policy µPolicy evaluations for µ and µ

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1

Termination State Constraint Set X X = X X Multiagent

rb2 + 1 1 − r

αb2 K K K∗ Kk Kk+1 F (K) = αrKr+αb2K + 1

Current Partial Folding Moving Obstacle

Complete Folding Corresponding to Open

Expert

Rollout with Base Policy m-Step Value Network Policy Network

Approximation of E{·}: Approximate minimization:

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

x1k, u1

k u2k x2

k dk τ

Q-factor approximation

u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)

)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1

1

Cost-to-go approximation Expected value approximation

Optimal cost J∗ Jµ1(x)/Jµ0(x) = K1/K0 L0 r

TµJ Jµ = TµJµ Jµ = TµJµ Cost of base policy µ

Cost of rollout policy µ Optimal Base Rollout Terminal Score Ap-proximation

Simplified minimization

Changing System, Cost, and Constraint Parameters

Linearized Bellman Eq. at Jµ Yields Rollout Policy µ 20

Through TµJµ = TJµ Lookahead Minimization

Value iterations

Rollout with Base Off-Line Obtained Policy

Policy Improvement with Base Policy µPolicy evaluations for µ and µ

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1

Termination State Constraint Set X X = X X Multiagent

rb2 + 1 1 − r

αb2 K K K∗ Kk Kk+1 F (K) = αrKr+αb2K + 1

Current Partial Folding Moving Obstacle

Complete Folding Corresponding to Open

Expert

Rollout with Base Policy m-Step Value Network Policy Network

Approximation of E{·}: Approximate minimization:

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

x1k, u1

k u2k x2

k dk τ

Q-factor approximation

u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)

)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1

1

(u0, . . . , uk) Sk(u0, . . . , uk, uk+1), uk+1 ∈ Uk+1 (u0, . . . , uk, uk+1, . . . , uN−1)

Base Heuristic Expert Ranks Complete Solutions R2 40 23

R0 R1 Optimal cost J∗ Jµ1(x)/Jµ0(x) = K1/K0 L0 r

TµJ Jµ = TµJµ Jµ = TµJµ Cost of base policy µ

Cost of rollout policy µ Optimal Base Rollout Terminal Score Ap-proximation

Simplified minimization

Changing System, Cost, and Constraint Parameters

Linearized Bellman Eq. at Jµ Yields Rollout Policy µ 20

Through TµJµ = TJµ Lookahead Minimization

Value iterations Compares Complete Foldings

Expert Rollout with Base Off-Line Obtained Policy

Policy Improvement with Base Policy µPolicy evaluations for µ and µ

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1

Termination State Constraint Set X X = X X Multiagent

rb2 + 1 1 − r

αb2 K K K∗ Kk Kk+1 F (K) = αrKr+αb2K + 1

Current Partial Folding Moving Obstacle

Complete Folding Corresponding to Open

Expert

Rollout with Base Policy m-Step Value Network Policy Network

Approximation of E{·}: Approximate minimization:

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

x1k, u1

k u2k x2

k dk τ

Q-factor approximation

u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)

)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

1

T0 T1 T2

(u0, . . . , uk) Sk(u0, . . . , uk, uk+1), uk+1 ∈ Uk+1 (u0, . . . , uk, uk+1, . . . , uN−1)

Monotonicity Property Under Sequential Improvement

Cost of Tk ≥ Cost of Tk+1

Cost of R0 ≥ · · · ≥ Cost of Rk ≥ Cost of Rk+1 ≥ · · · ≥ Cost of RN .

Base Heuristic Expert Ranks Complete Solutions R2 40 23

R0 R1 Optimal cost J∗ Jµ1(x)/Jµ0(x) = K1/K0 L0 r uk xk+1

Heuristic from AB

TµJ Jµ = TµJµ Jµ = TµJµ Cost of base policy µ x0 = x0

Optimal Base Rollout Terminal Score Approximation Current

Trajectory Tk Trajectory Tk+1 rollout policy µ Simplified mini-mization

Base Heuristic Cost Hk(xk) Base Heuristic Cost Hk+1(xk+1)

Changing System, Cost, and Constraint Parameters

Linearized Bellman Eq. at Jµ Yields Rollout Policy µ 20

Through TµJµ = TJµ Lookahead Minimization

Value iterations Compares Complete Foldings

Expert Rollout with Base Off-Line Obtained Policy

Policy Improvement with Base Policy µPolicy evaluations for µ and µ

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1

Termination State Constraint Set X X = X X Multiagent

rb2 + 1 1 − r

αb2 K K K∗ Kk Kk+1 F (K) = αrKr+αb2K + 1

Current Partial Folding Moving Obstacle

Complete Folding Corresponding to Open

Expert

Rollout with Base Policy m-Step Value Network Policy Network

Approximation of E{·}: Approximate minimization:

1

T0 T1 T2

(u0, . . . , uk) Sk(u0, . . . , uk, uk+1), uk+1 ∈ Uk+1 (u0, . . . , uk, uk+1, . . . , uN−1)

Monotonicity Property Under Sequential Improvement

Cost of Tk ≥ Cost of Tk+1

Cost of R0 ≥ · · · ≥ Cost of Rk ≥ Cost of Rk+1 ≥ · · · ≥ Cost of RN .

Base Heuristic Expert Ranks Complete Solutions R2 40 23

R0 R1 Optimal cost J∗ Jµ1(x)/Jµ0(x) = K1/K0 L0 r uk xk+1

Heuristic from AB

TµJ Jµ = TµJµ Jµ = TµJµ Cost of base policy µ x0 = x0

Optimal Base Rollout Terminal Score Approximation Current

Trajectory Tk Trajectory Tk+1 rollout policy µ Simplified mini-mization

Base Heuristic Cost Hk(xk) Base Heuristic Cost Hk+1(xk+1)

Changing System, Cost, and Constraint Parameters

Linearized Bellman Eq. at Jµ Yields Rollout Policy µ 20

Through TµJµ = TJµ Lookahead Minimization

Value iterations Compares Complete Foldings

Expert Rollout with Base Off-Line Obtained Policy

Policy Improvement with Base Policy µPolicy evaluations for µ and µ

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1

Termination State Constraint Set X X = X X Multiagent

rb2 + 1 1 − r

αb2 K K K∗ Kk Kk+1 F (K) = αrKr+αb2K + 1

Current Partial Folding Moving Obstacle

Complete Folding Corresponding to Open

Expert

Rollout with Base Policy m-Step Value Network Policy Network

Approximation of E{·}: Approximate minimization:

1

Bellman Equation on Space of Quadratic Functions J(x) = Kx2

F (K) T2

Value Space Approximation J Jµ = TµJµ TµJ

One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

Newton iterate starting from K Tangent Line of Unstable Policy

J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK

TJ Instability Region Match Win Probability 1 0 pw (Sudden death)Stability Region Slope=1

also Newton Step Value Iteration: Kk+1 = F (Kk)

Optimal Policy Riccati Equation: K = F (K)

J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from

Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from

using an Corresponds to One-Step Lookahead Policy µ

Line Stable Policies Unstable Policy Optimal Policy

Region of stability

Also Region of Convergence of Newton’s Method Riccati Equation

Cost of rollout policy µ Cost of base policy µ

1

Bellman Equation on Space of Quadratic Functions J(x) = Kx2

F (K)

T2 Cost 28 Cost 27 Cost 13

Value Space Approximation J Jµ = TµJµ TµJ

One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

Newton iterate starting from K Tangent Line of Unstable Policy

J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK

TJ Instability Region Match Win Probability 1 0 pw (Sudden death)Stability Region Slope=1

also Newton Step Value Iteration: Kk+1 = F (Kk)

Optimal Policy Riccati Equation: K = F (K)

J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from

Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from

using an Corresponds to One-Step Lookahead Policy µ

Line Stable Policies Unstable Policy Optimal Policy

Region of stability

Also Region of Convergence of Newton’s Method Riccati Equation

Cost of rollout policy µ Cost of base policy µ

1

Bellman Equation on Space of Quadratic Functions J(x) = Kx2

F (K)

T2 Cost 28 Cost 27 Cost 13

Value Space Approximation J Jµ = TµJµ TµJ

One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

Newton iterate starting from K Tangent Line of Unstable Policy

J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK

TJ Instability Region Match Win Probability 1 0 pw (Sudden death)Stability Region Slope=1

also Newton Step Value Iteration: Kk+1 = F (Kk)

Optimal Policy Riccati Equation: K = F (K)

J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from

Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from

using an Corresponds to One-Step Lookahead Policy µ

Line Stable Policies Unstable Policy Optimal Policy

Region of stability

Also Region of Convergence of Newton’s Method Riccati Equation

Cost of rollout policy µ Cost of base policy µ

1

Bellman Equation on Space of Quadratic Functions J(x) = Kx2

F (K)

T2 Cost 28 Cost 27 Cost 13

Value Space Approximation J Jµ = TµJµ TµJ

One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

Newton iterate starting from K Tangent Line of Unstable Policy

J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK

TJ Instability Region Match Win Probability 1 0 pw (Sudden death)Stability Region Slope=1

also Newton Step Value Iteration: Kk+1 = F (Kk)

Optimal Policy Riccati Equation: K = F (K)

J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from

Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from

using an Corresponds to One-Step Lookahead Policy µ

Line Stable Policies Unstable Policy Optimal Policy

Region of stability

Also Region of Convergence of Newton’s Method Riccati Equation

Cost of rollout policy µ Cost of base policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk)

x0 x1 xk xN uk xk+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk)

x0 x1 xk xN uk xk+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN uk xk+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN uk xk+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN uk xk+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN uk xk+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN uk u0k u00

k xk+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN uk u0k u00

k xk+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN uk u0k u00

k xk+1 x0k+1 x00

k+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN uk u0k u00

k xk+1 x0k+1 x00

k+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN x0N x00

N uk u0k u00

k xk+1 x0k+1 x00

k+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN x0N x00

N uk u0k u00

k xk+1 x0k+1 x00

k+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N − 1 c(N) c(N − 1) k k + 1

Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk

x0 x1 xk xN x′N x′′

N uk u′k u′′

k xk+1 x′k+1 x′′

k+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from

within Wp+

Prob. u Prob. 1 − u Cost 1 Cost 1 − √u

J(1) = min{c, a + J(2)

}

J(2) = b + J(1)

J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial City Current Partial Tour Next Cities Nearest Neighbor

Jk+1(xk+1) = minuk+12Uk+1(xk+1)

En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+`

Rollout, Model Predictive Control

b+k b�k Permanent trajectory P k Tentative trajectory T k

minuk

En

gk(xk, uk, wk)+Jk+1(xk+1)o

Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1

Optimal control sequence {u⇤0, . . . , u⇤

k, . . . , u⇤N�1} Simplify E{·}

Tail subproblem Time x⇤k Future Stages Terminal Cost k N

Stage k Future Stages Terminal Cost gN (xN )

Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x0N

uk uk xk+1 xk+1 xN xN x0N

�r = ⇧�T

(�)µ (�r)

�⇧(Jµ) µ(i) 2 arg minu2U(i) Qµ(i, u, r)

Subspace M = {�r | r 2 <m} Based on Jµ(i, r) Jµk

minu2U(i)

Pnj=1 pij(u)

�g(i, u, j) + J(j)

�Computation of J :

Good approximation Poor Approximation �(⇠) = ln(1 + e⇠)

max{0, ⇠} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N �1

Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution

1

Initial City Current Partial Tour Next Cities Nearest Neighbor

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

b+k b−

k Permanent trajectory P k Tentative trajectory T k

minuk

E{gk(xk, uk, wk)+Jk+1(xk+1)

}

Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1

Optimal control sequence {u∗0, . . . , u∗

k, . . . , u∗N−1} Simplify E{·}

Tail subproblem Time x∗k Future Stages Terminal Cost k N

Stage k Future Stages Terminal Cost gN(xN )

Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π(T

(λ)µ (Φr)

)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution

1

Initial City Current Partial Tour Next Cities Nearest Neighbor

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

b+k b−

k Permanent trajectory P k Tentative trajectory T k

minuk

E{gk(xk, uk, wk)+Jk+1(xk+1)

}

Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1

Optimal control sequence {u∗0, . . . , u∗

k, . . . , u∗N−1} Simplify E{·}

Tail subproblem Time x∗k Future Stages Terminal Cost k N

Stage k Future Stages Terminal Cost gN(xN )

Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π(T

(λ)µ (Φr)

)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution

1

Initial City Current Partial Tour Next Cities Nearest Neighbor

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

b+k b−

k Permanent trajectory P k Tentative trajectory T k

minuk

E{gk(xk, uk, wk)+Jk+1(xk+1)

}

Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1

Optimal control sequence {u∗0, . . . , u∗

k, . . . , u∗N−1} Simplify E{·}

Tail subproblem Time x∗k Future Stages Terminal Cost k N

Stage k Future Stages Terminal Cost gN(xN )

Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π(T

(λ)µ (Φr)

)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution

1

Initial City Current Partial Tour Next Cities

Nearest Neighbor Heuristic

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

b+k b−

k Permanent trajectory P k Tentative trajectory T k

minuk

E{gk(xk, uk, wk)+Jk+1(xk+1)

}

Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1

Optimal control sequence {u∗0, . . . , u∗

k, . . . , u∗N−1} Simplify E{·}

Tail subproblem Time x∗k Future Stages Terminal Cost k N

Stage k Future Stages Terminal Cost gN(xN )

Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π(T

(λ)µ (Φr)

)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

1

Initial City Current Partial Tour Next Cities

Nearest Neighbor Heuristic

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

b+k b−

k Permanent trajectory P k Tentative trajectory T k

minuk

E{gk(xk, uk, wk)+Jk+1(xk+1)

}

Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1

Optimal control sequence {u∗0, . . . , u∗

k, . . . , u∗N−1} Simplify E{·}

Tail subproblem Time x∗k Future Stages Terminal Cost k N

Stage k Future Stages Terminal Cost gN(xN )

Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π(T

(λ)µ (Φr)

)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

1

Initial City Current Partial Tour Next Cities

Nearest Neighbor Heuristic

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

b+k b−

k Permanent trajectory P k Tentative trajectory T k

minuk

E{gk(xk, uk, wk)+Jk+1(xk+1)

}

Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1

Optimal control sequence {u∗0, . . . , u∗

k, . . . , u∗N−1} Simplify E{·}

Tail subproblem Time x∗k Future Stages Terminal Cost k N

Stage k Future Stages Terminal Cost gN(xN )

Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π(T

(λ)µ (Φr)

)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

1

Initial City Current Partial Tour Next Cities

Nearest Neighbor Heuristic

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

b+k b−

k Permanent trajectory P k Tentative trajectory T k

minuk

E{gk(xk, uk, wk)+Jk+1(xk+1)

}

Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1

Optimal control sequence {u∗0, . . . , u∗

k, . . . , u∗N−1} Simplify E{·}

Tail subproblem Time x∗k Future Stages Terminal Cost k N

Stage k Future Stages Terminal Cost gN(xN )

Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π(T

(λ)µ (Φr)

)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

1

Initial City Current Partial Tour Next Cities

Nearest Neighbor Heuristic

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

b+k b−

k Permanent trajectory P k Tentative trajectory T k

minuk

E{gk(xk, uk, wk)+Jk+1(xk+1)

}

Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1

Optimal control sequence {u∗0, . . . , u∗

k, . . . , u∗N−1} Simplify E{·}

Tail subproblem Time x∗k Future Stages Terminal Cost k N

Stage k Future Stages Terminal Cost gN(xN )

Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π(T

(λ)µ (Φr)

)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

1

Initial City Current Partial Tour Next Cities

Nearest Neighbor Heuristic

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

b+k b−

k Permanent trajectory P k Tentative trajectory T k

minuk

E{gk(xk, uk, wk)+Jk+1(xk+1)

}

Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1

Optimal control sequence {u∗0, . . . , u∗

k, . . . , u∗N−1} Simplify E{·}

Tail subproblem Time x∗k Future Stages Terminal Cost k N

Stage k Future Stages Terminal Cost gN(xN )

Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π(T

(λ)µ (Φr)

)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

1

Complete Tours Current Partial Tour Next Cities Next States

Nearest Neighbor Heuristic

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

b+k b−

k Permanent trajectory P k Tentative trajectory T k

minuk

E{gk(xk, uk, wk)+Jk+1(xk+1)

}

Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1

Optimal control sequence {u∗0, . . . , u∗

k, . . . , u∗N−1} Simplify E{·}

Tail subproblem Time x∗k Future Stages Terminal Cost k N

Stage k Future Stages Terminal Cost gN(xN )

Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π(T

(λ)µ (Φr)

)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

1

Start End Plus Terminal Cost Approximation S1 S2 S3 S` Sm�1 Sm

Generate Improved Policy µ Next Partial Tour

Generate “Improved” Policy µ by µ(i) 2 arg minu2U(i) Qµ(i, u, r)

State i y(i) Ay(i) + b �1(i, v) �m(i, v) �2(i, v) J(i, v) = r0�(i, v)

Deterministic Transition xk+1 = fk(xk, uk)

Cost Vector r⇤ J1 = Corrected V Enlarged State Space Cost J0 Cost J1 Cost r⇤

Representative States Controls u are associated with states i

N Stages jN�1 jN i 2 Ix j 2 Iy

Sx1 Sx` Sxm x1 x` xm r⇤x1r⇤x`

r⇤xmFootprint Sets J(i) J(j) =

Py2A �jyr⇤y

minu2U(i)

nX

j=1

pij(u)�g(i, u, j) + ↵J(j)

�i = x Ix

⇡/4 Sample State xsk Sample Control us

k Sample Next State xsk+1 Sample Transition Cost gs

k Simulator

Representative States x (Coarse Grid) Critic Actor Approximate PI Aggregate Problem

pxy(u) =

nX

j=1

pxj(u)�jy g(x, u) =

nX

j=1

pxj(u)g(x, u, j)

Range of Weighted Projections J⇤(i) Original States to States (Fine Grid) Original State Space

dxi = 0 for i /2 Ix �jy = 1 for j 2 Iy �jy = 0 or 1 for all j and y Each j connects to a single x

x pxj1(u) pxj2(u) pxj3(u) �j1y1 �j1y2 �j1y3 �jy with Aggregation Probabilities �jy = 0 or 1

Relate to Rm r⇤m�1 r⇤m x0k+1

Policy Q-Factor Evaluation Evaluate Q-Factor Qµ of Current policy µ Width (✏ + 2↵�)/(1 � ↵)

Random Transition xk+1 = fk(xk, uk, wk) Random Cost gk(xk, uk, wk) Representative Features

1

Start End Plus Terminal Cost Approximation S1 S2 S3 S` Sm�1 Sm

Generate Improved Policy µ Next Partial Tours

Generate “Improved” Policy µ by µ(i) 2 arg minu2U(i) Qµ(i, u, r)

State i y(i) Ay(i) + b �1(i, v) �m(i, v) �2(i, v) J(i, v) = r0�(i, v)

Deterministic Transition xk+1 = fk(xk, uk)

Cost Vector r⇤ J1 = Corrected V Enlarged State Space Cost J0 Cost J1 Cost r⇤

Representative States Controls u are associated with states i

N Stages jN�1 jN i 2 Ix j 2 Iy

Sx1 Sx` Sxm x1 x` xm r⇤x1r⇤x`

r⇤xmFootprint Sets J(i) J(j) =

Py2A �jyr⇤y

minu2U(i)

nX

j=1

pij(u)�g(i, u, j) + ↵J(j)

�i = x Ix

⇡/4 Sample State xsk Sample Control us

k Sample Next State xsk+1 Sample Transition Cost gs

k Simulator

Representative States x (Coarse Grid) Critic Actor Approximate PI Aggregate Problem

pxy(u) =

nX

j=1

pxj(u)�jy g(x, u) =

nX

j=1

pxj(u)g(x, u, j)

Range of Weighted Projections J⇤(i) Original States to States (Fine Grid) Original State Space

dxi = 0 for i /2 Ix �jy = 1 for j 2 Iy �jy = 0 or 1 for all j and y Each j connects to a single x

x pxj1(u) pxj2(u) pxj3(u) �j1y1 �j1y2 �j1y3 �jy with Aggregation Probabilities �jy = 0 or 1

Relate to Rm r⇤m�1 r⇤m x0k+1

Policy Q-Factor Evaluation Evaluate Q-Factor Qµ of Current policy µ Width (✏ + 2↵�)/(1 � ↵)

Random Transition xk+1 = fk(xk, uk, wk) Random Cost gk(xk, uk, wk) Representative Features

1

Bertsekas Reinforcement Learning 9 / 29

Page 9: Topics in Reinforcement Learning:Lessons from AlphaZero ...

Stochastic DP Problems - Perfect State Observation (We Know xk )

......Permanent trajectory P k Tentative trajectory T k

Control uk Cost gk(xk, uk) xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π(T

(λ)µ (Φr)

)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution

Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)

Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)

Set of States (u1) Set of States (u1, u2) Neural Network

Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)

Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)

Candidate (m + 1)-Solutions (u1, . . . , um, um+1)

Cost G(u) Heuristic N -Solutions

Piecewise Constant Aggregate Problem Approximation

Artificial Start State End State

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

(F (i)

)

R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗

3 Cost Jµ

(F (i)

)

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

(F (i)

)

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

1

Permanent trajectory P k Tentative trajectory T k

Control uk Cost gk(xk, uk) xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π(T

(λ)µ (Φr)

)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution

Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)

Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)

Set of States (u1) Set of States (u1, u2) Neural Network

Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)

Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)

Candidate (m + 1)-Solutions (u1, . . . , um, um+1)

Cost G(u) Heuristic N -Solutions

Piecewise Constant Aggregate Problem Approximation

Artificial Start State End State

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

(F (i)

)

R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗

3 Cost Jµ

(F (i)

)

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

(F (i)

)

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

1

Permanent trajectory P k Tentative trajectory T k

Control uk Cost gk(xk, uk) xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π(T

(λ)µ (Φr)

)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution

Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)

Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)

Set of States (u1) Set of States (u1, u2) Neural Network

Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)

Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)

Candidate (m + 1)-Solutions (u1, . . . , um, um+1)

Cost G(u) Heuristic N -Solutions

Piecewise Constant Aggregate Problem Approximation

Artificial Start State End State

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

(F (i)

)

R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗

3 Cost Jµ

(F (i)

)

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

(F (i)

)

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

1

Permanent trajectory P k Tentative trajectory T k

Stage k Future Stages

Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π(T

(λ)µ (Φr)

)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution

Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)

Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)

Set of States (u1) Set of States (u1, u2) Neural Network

Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)

Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)

Candidate (m + 1)-Solutions (u1, . . . , um, um+1)

Cost G(u) Heuristic N -Solutions

Piecewise Constant Aggregate Problem Approximation

Artificial Start State End State

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

(F (i)

)

R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗

3 Cost Jµ

(F (i)

)

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

(F (i)

)

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

1

Iteration Index k PI index k Jµk J⇤ 0 1 2 . . . Error Zone Width (✏ + 2↵�)/(1 � ↵)2

Policy Q-Factor Evaluation Evaluate Q-Factor Qµ of Current policy µ Width (✏ + 2↵�)/(1 � ↵)

Random Transition xk+1 = fk(xk, uk, wk) Random cost gk(xk, uk, wk)

Control v (j, v) Cost = 0 State-Control Pairs Transitions under policy µ Evaluate Cost Function

Variable Length Rollout Selective Depth Rollout Policy µ Adaptive Simulation Terminal Cost Function

Limited Rollout Selective Depth Adaptive Simulation Policy µ Approximation J

u Qk(xk, u) Qk(xk, u) uk uk Qk(xk, u) � Qk(xk, u)

x0 xk x1k+1 x2

k+1 x3k+1 x4

k+1 States xN Base Heuristic ik States ik+1 States ik+2

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk

x0 x1 xk xN x0N x00

N uk u0k u00

k xk+1 x0k+1 x00

k+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

1

Iteration Index k PI index k Jµk J⇤ 0 1 2 . . . Error Zone Width (✏ + 2↵�)/(1 � ↵)2

Policy Q-Factor Evaluation Evaluate Q-Factor Qµ of Current policy µ Width (✏ + 2↵�)/(1 � ↵)

Random Transition xk+1 = fk(xk, uk, wk) Random cost gk(xk, uk, wk)

Control v (j, v) Cost = 0 State-Control Pairs Transitions under policy µ Evaluate Cost Function

Variable Length Rollout Selective Depth Rollout Policy µ Adaptive Simulation Terminal Cost Function

Limited Rollout Selective Depth Adaptive Simulation Policy µ Approximation J

u Qk(xk, u) Qk(xk, u) uk uk Qk(xk, u) � Qk(xk, u)

x0 xk x1k+1 x2

k+1 x3k+1 x4

k+1 States xN Base Heuristic ik States ik+1 States ik+2

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk

x0 x1 xk xN x0N x00

N uk u0k u00

k xk+1 x0k+1 x00

k+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

1

Iteration Index k PI index k Jµk J⇤ 0 1 2 . . . Error Zone Width (✏ + 2↵�)/(1 � ↵)2

Policy Q-Factor Evaluation Evaluate Q-Factor Qµ of Current policy µ Width (✏ + 2↵�)/(1 � ↵)

Random Transition xk+1 = fk(xk, uk, wk) Random Cost gk(xk, uk, wk)

Control v (j, v) Cost = 0 State-Control Pairs Transitions under policy µ Evaluate Cost Function

Variable Length Rollout Selective Depth Rollout Policy µ Adaptive Simulation Terminal Cost Function

Limited Rollout Selective Depth Adaptive Simulation Policy µ Approximation J

u Qk(xk, u) Qk(xk, u) uk uk Qk(xk, u) � Qk(xk, u)

x0 xk x1k+1 x2

k+1 x3k+1 x4

k+1 States xN Base Heuristic ik States ik+1 States ik+2

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk

x0 x1 xk xN x0N x00

N uk u0k u00

k xk+1 x0k+1 x00

k+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

1

Iteration Index k PI index k Jµk J⇤ 0 1 2 . . . Error Zone Width (✏ + 2↵�)/(1 � ↵)2

Policy Q-Factor Evaluation Evaluate Q-Factor Qµ of Current policy µ Width (✏ + 2↵�)/(1 � ↵)

Random Transition xk+1 = fk(xk, uk, wk) Random Cost gk(xk, uk, wk)

Control v (j, v) Cost = 0 State-Control Pairs Transitions under policy µ Evaluate Cost Function

Variable Length Rollout Selective Depth Rollout Policy µ Adaptive Simulation Terminal Cost Function

Limited Rollout Selective Depth Adaptive Simulation Policy µ Approximation J

u Qk(xk, u) Qk(xk, u) uk uk Qk(xk, u) � Qk(xk, u)

x0 xk x1k+1 x2

k+1 x3k+1 x4

k+1 States xN Base Heuristic ik States ik+1 States ik+2

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk

x0 x1 xk xN x0N x00

N uk u0k u00

k xk+1 x0k+1 x00

k+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

1

System xk+1 = fk (xk , uk ,wk ) with random “disturbance" wk (e.g., physical noise,market uncertainties, demand for inventory, unpredictable breakdowns, etc)

Cost function: E{

gN(xN) +∑N−1

k=0 gk (xk , uk ,wk )}

Policies π = {µ0, . . . , µN−1}, where µk is a “closed-loop control law" or “feedbackpolicy"/a function of xk . A “lookup table" for the control uk = µk (xk ) to apply at xk .

An important point: Using feedback (i.e., choosing controls with knowledge of thestate) is beneficial in view of the stochastic nature of the problem.

For given initial state x0, minimize over all π = {µ0, . . . , µN−1} the cost

Jπ(x0) = E

{gN(xN) +

N−1∑k=0

gk(xk , µk (xk ),wk

)}

Optimal cost function: J∗(x0) = minπ Jπ(x0). Optimal policy: Jπ∗(x0) = J∗(x0)

Bertsekas Reinforcement Learning 11 / 29

Page 10: Topics in Reinforcement Learning:Lessons from AlphaZero ...

The Stochastic DP Algorithm

Produces the optimal costs J∗k (xk ) of the tail subproblems that start at xk

Start with J∗N(xN) = gN(xN), and for k = 0, . . . ,N − 1, let

J∗k (xk ) = minuk∈Uk (xk )

Ewk

{gk (xk , uk ,wk ) + J∗k+1

(fk (xk , uk ,wk )

)}, for all xk .

The optimal cost J∗(x0) is obtained at the last step: J∗0 (x0) = J∗(x0).

The optimal policy component µ∗k can be constructed simultaneously with J∗k , andconsists of the minimizing u∗k = µ∗k (xk ) above.

Alternative on-line implementation of the optimal policy, given J∗1 , . . . , J

∗N−1

Sequentially, going forward, for k = 0, 1, . . . ,N − 1, observe xk and apply

u∗k ∈ arg minuk∈Uk (xk )

Ewk

{gk (xk , uk ,wk ) + J∗k+1

(fk (xk , uk ,wk )

)}.

Issues: Need to know J∗k+1, compute expectation for each uk , minimize over all uk

Approximation in value space: Use Jk in place of J∗k ; approximate E{·} and minuk .

Bertsekas Reinforcement Learning 12 / 29

Page 11: Topics in Reinforcement Learning:Lessons from AlphaZero ...

A Very Favorable Case: Linear-Quadratic Problems

An example of a linear-quadratic problemKeep car velocity constant (like oversimplified cruise control): xk+1 = xk + buk + wk

Here xk = vk − v is the deviation between the vehicle’s velocity vk at time k fromdesired level v , and b is given

uk is unconstrained; wk has 0-mean and variance σ2

Cost over N stages: qx2N +

∑N−1k=0 (qx2

k + ru2k ), where q > 0 and r > 0 are given

Consider a more general problem where the system is xk+1 = axk + buk + wk

The DP algorithm starts with J∗N(xN) = qx2N , and generates J∗k according to

J∗k (xk ) = minuk

Ewk

{qx2

k + ru2k + J∗k+1(axk + buk + wk )

}, k = 0, . . . ,N − 1

DP algorithm can be carried out in closed form to yieldJ∗k (xk ) = Kk x2

k + const, µ∗k (xk ) = Lk xk : Kk and Lk can be explicitly computed

The solution does not depend on the distribution of wk as long as it has 0 mean:Certainty Equivalence (a common approximation idea for other problems)

Bertsekas Reinforcement Learning 14 / 29

Page 12: Topics in Reinforcement Learning:Lessons from AlphaZero ...

Derivation - DP Algorithm starting from Terminal Cost J∗N(x) = qx2

J∗N−1(xN−1) = minuN−1

E{

qx2N−1 + ru2

N−1 + J∗N(axN−1 + buN−1 + wN−1)}

= minuN−1

E{

qx2N−1 + ru2

N−1 + q(axN−1 + buN−1 + wN−1)2}= min

uN−1

[qx2

N−1 + ru2N−1 + (axN−1 + buN−1)2 + 2 E{wN−1}︸ ︷︷ ︸

=0

(axN−1 + buN−1) + q E{w2N−1}︸ ︷︷ ︸

=σ2

]= qx2

N−1 + minuN−1

[ru2

N−1 + q(axN−1 + buN−1)2]+ qσ2

Minimize by setting to zero the derivative: 0 = 2ruN−1 + 2qb(axN−1 + buN−1), to obtain

µ∗N−1(xN−1) = LN−1xN−1 with LN−1 = − abqr + b2q

and by substitution, J∗N−1(xN−1) = PN−1x2N−1 + qσ2, where PN−1 = a2rq

r+b2q + q

Similarly, going backwards, we obtain for all k :

J∗k (xk ) = Pk x2k +σ2

N−1∑m=k

Pm+1, µ∗k (xk ) = Lk xk , Pk =

a2rPk+1

r + b2Pk+1+q, Lk = − abPk+1

r + b2Pk+1

Bertsekas Reinforcement Learning 15 / 29

Page 13: Topics in Reinforcement Learning:Lessons from AlphaZero ...

Linear-Quadratic Problems in General

Observations and generalizationsThe solution does not depend on the distribution of wk , only on the mean (which is0), i.e., we have certainty equivalence

Generalization to multidimensional problems, nonzero mean disturbances, etc

Generalization to infinite horizon

Generalization to problems where the state is observed partially through linearmeasurements: Optimal policy involves an extended form of certainty equivalence

Lk E{xk | measurements}

where E{xk | measurements} is provided by an estimator (e.g., Kalman filter)Linear systems and quadratic cost are a starting point for other lines ofinvestigations and approximations:

I Problems with safety/state constraints [Model Predictive Control (MPC)]I Problems with control constraints (MPC)I Unknown or changing system parameters (adaptive control)

Bertsekas Reinforcement Learning 16 / 29

Page 14: Topics in Reinforcement Learning:Lessons from AlphaZero ...

Approximation in Value Space - The Three Approximationsmin

uk,µk+1,...,µk+ℓ−1

E

{gk(xk, uk, wk) +

k+ℓ−1∑

m=k+1

gk

(xm, µm(xm), wm

)+ Jk+ℓ(xk+ℓ)

}

First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial

Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)

Feature Extraction Features: Material Balance, uk = µdk

(xk(Ik)

)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector φk(xk) Approximator r′kφk(xk)

x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i

s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector φ(x) Approximator φ(x)′r

ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

1

Approximations: Computation of Jk+ℓ: (Could be approximate)

DP minimization Replace E{·} with nominal values

(certainty equivalent control)

Limited simulation (Monte Carlo tree search)

Simple choices Parametric approximation Problem approximation

Rollout

minuk

E{gk(xk, uk, wk) + Jk+1(xk+ℓ)

}

minuk,µk+1,...,µk+ℓ−1

E

{gk(xk, uk, wk) +

k+ℓ−1∑

m=k+1

gk

(xm, µm(xm), wm

)+ Jk+ℓ(xk+ℓ)

}

First ℓ Steps “Future” First StepNonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial

Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)

Feature Extraction Features: Material Balance, uk = µdk

(xk(Ik)

)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector φk(xk) Approximator r′kφk(xk)

x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i

s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

1

Approximations: Computation of Jk+ℓ: (Could be approximate)

DP minimization Replace E{·} with nominal values

(certainty equivalent control) Computation of Jk+1:

Limited simulation (Monte Carlo tree search)

Simple choices Parametric approximation Problem approximation

Rollout

minuk

E{

gk(xk, uk, wk) + Jk+1(xk+1)}

minuk,µk+1,...,µk+ℓ−1

E

{gk(xk, uk, wk) +

k+ℓ−1∑

m=k+1

gk

(xm, µm(xm), wm

)+ Jk+ℓ(xk+ℓ)

}

First ℓ Steps “Future” First StepNonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial

Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)

Feature Extraction Features: Material Balance, uk = µdk

(xk(Ik)

)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector φk(xk) Approximator r′kφk(xk)

x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i

s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

1

Cost-to-go approximation Expected value approximation

Simplified minimization

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1

Termination State Constraint Set X X = X X Multiagent

rb2 + 1 1 − r

αb2 K K K∗ Kk Kk+1 F (K) = αrKr+αb2K + 1

Current Partial Folding Moving Obstacle

Complete Folding Corresponding to Open

Expert

Rollout with Base Policy m-Step Value Network Policy Network

Approximation of E{·}: Approximate minimization:

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

x1k, u1

k u2k x2

k dk τ

Q-factor approximation

u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)

)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1

High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice

Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5

Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial

Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2

One-Step or Multistep Lookahead for stages Possible Terminal Cost

Approximation in Policy Space Heuristic Cost Approximation for

for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)

Approximate Q-Factor Q(x, u) At x Approximation J

1

Cost-to-go approximation Expected value approximation

Simplified minimization

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1

Termination State Constraint Set X X = X X Multiagent

rb2 + 1 1 − r

αb2 K K K∗ Kk Kk+1 F (K) = αrKr+αb2K + 1

Current Partial Folding Moving Obstacle

Complete Folding Corresponding to Open

Expert

Rollout with Base Policy m-Step Value Network Policy Network

Approximation of E{·}: Approximate minimization:

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

x1k, u1

k u2k x2

k dk τ

Q-factor approximation

u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)

)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1

High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice

Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5

Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial

Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2

One-Step or Multistep Lookahead for stages Possible Terminal Cost

Approximation in Policy Space Heuristic Cost Approximation for

for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)

Approximate Q-Factor Q(x, u) At x Approximation J

1

Cost-to-go approximation Expected value approximation

Simplified minimization

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1

Termination State Constraint Set X X = X X Multiagent

rb2 + 1 1 − r

αb2 K K K∗ Kk Kk+1 F (K) = αrKr+αb2K + 1

Current Partial Folding Moving Obstacle

Complete Folding Corresponding to Open

Expert

Rollout with Base Policy m-Step Value Network Policy Network

Approximation of E{·}: Approximate minimization:

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

x1k, u1

k u2k x2

k dk τ

Q-factor approximation

u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)

)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1

High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice

Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5

Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial

Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2

One-Step or Multistep Lookahead for stages Possible Terminal Cost

Approximation in Policy Space Heuristic Cost Approximation for

for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)

Approximate Q-Factor Q(x, u) At x Approximation J

1

uk = µk(xk, rk) µk(·, rk) µk(xk) xk At xk

µk(xk) Jk(xk) xsk, us

k = µk(xsk) s = 1, . . . , q µk(xk, rk) µ(·, r) µ(x, r)

Motion equations xk+1 = fk(xk, uk) Current State x

Penalty for deviating from nominal trajectory

State and control constraints Keep state close to a trajectory

Control Probabilities Run the Base Policy

Truncated Horizon Rollout Terminal Cost Approximation J

J∗3 (x3) J∗

2 (x2) J∗1 (x1) Optimal Cost J∗

0 (x0) = J∗(x0)

Optimal Cost J∗k (xk) xk xk+1 x

′k+1 x

′′k+1

Opt. Cost J∗k+1(xk+1) Opt. Cost J∗

k+1(x′k+1) Opt. Cost J∗

k+1(x′′k+1)

xk uk u′k u

′′k Matrix of Intercity Travel Costs

Corrected J J J* Cost Jµ

(F (i), r

)of i ≈ Jµ(i) Jµ(i) Feature Map

(F (i), r

): Feature-based parametric architecture State

r: Vector of weights Original States Aggregate States

Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost αkg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

(F (i)

)of

F (i) =(F1(i), . . . , Fs(i)

): Vector of Features of i

(F (i)

): Feature-based architecture Final Features

1

J1 J2 J∗ = TJ∗

x pxx(u) pxy(u) pyx(u) pyy(u) pxt(u) pyt(u) x y

αpxx(u) αpxy(u) αpyx(u) αpyy(u) 1 − α

TJ = minµ TµJ Cost 0 Cost g(x, u, y) System State Data Control Parameter Estimation

Optimal cost Cost of rollout policy µ Cost of base policy µ

Cost E{g(x, u, y)

}Cost E

{g(i, u, j)

}“On-Line Play”

Value Network Current Policy Network Approximate Policy

Approximate Policy Evaluation Approximately Improved Policy Evaluation

TµJ Approximate Policy Evaluation for µ Approximate Policy Improvement

0 1 2 3 4 5 6

Deterministic Stochastic Rollout Continuous MPC Constrained Discrete Combinatorial Multiagent

MCTS Variance Reduction

Section 2.3 Section 2.4 Sections 2.5, 3.1 3.3 3.4 3.2, 3.3, 3.3.3 2.4.3 2.4.2 3.3, 3.4

2.4.3, 2.4.4 2.4.2 3.3, 3.4

Monte Carlo Tree Search ‘

minuk,µk+1,...,µk+ℓ−1

E

{gk(xk, uk, wk) +

k+ℓ−1∑

i=k+1

gi

(xi, µi(xi), wi

)+ Jk+ℓ(xk+ℓ)

}

1

Important variants: Use multistep lookahead, replace E{·} by limited simulation (e.g., a“certainty equivalent" of wk ), multiagent rollout (for multicomponent control problems)

An example: Truncated rollout with base policy and terminal costapproximation (however obtained, e.g., off-line training)uk = µk(xk, rk) µk(·, rk) µk(xk) xk

µk(xk) Jk(xk) xsk, us

k = µk(xsk) s = 1, . . . , q µk(xk, rk) µ(·, r) µ(x, r)

Motion equations xk+1 = fk(xk, uk) Current State x

Penalty for deviating from nominal trajectory

State and control constraints Keep state close to a trajectory

Control Probabilities Run the Base Policy

Truncated Horizon Rollout

J∗3 (x3) J∗

2 (x2) J∗1 (x1) Optimal Cost J∗

0 (x0) = J∗(x0)

Optimal Cost J∗k (xk) xk xk+1 x

′k+1 x

′′k+1

Opt. Cost J∗k+1(xk+1) Opt. Cost J∗

k+1(x′k+1) Opt. Cost J∗

k+1(x′′k+1)

xk uk u′k u

′′k Matrix of Intercity Travel Costs

Corrected J J J* Cost Jµ

(F (i), r

)of i ≈ Jµ(i) Jµ(i) Feature Map

(F (i), r

): Feature-based parametric architecture State

r: Vector of weights Original States Aggregate States

Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost αkg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

(F (i)

)of

F (i) =(F1(i), . . . , Fs(i)

): Vector of Features of i

(F (i)

): Feature-based architecture Final Features

1

Base Policy Rollout Policy Approximation in Value Space

One-Step or Multistep Lookahead

Approximation in Policy Space Heuristic Cost Approximation

Approximate Q-Factor Q(x, u) At x

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Approximate Q-Factor Q(x, u) At x

Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1

Cost per stage: x2k + u2

k

{X0, X1, . . . , XN} must be reachable Largest reachable tube

x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization

Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0

Complete Tours Current Partial Tour Next Cities Next States

Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1

Base Heuristic Minimization Possible Path

Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

Rollout Control uk Rollout Policy µk Base Policy Cost

1

Base Policy Rollout Policy Approximation in Value Space

One-Step or Multistep Lookahead for stages

Approximation in Policy Space Heuristic Cost Approximation for

for Stages Beyond Truncation

Approximate Q-Factor Q(x, u) At x

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Approximate Q-Factor Q(x, u) At x

Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1

Cost per stage: x2k + u2

k

{X0, X1, . . . , XN} must be reachable Largest reachable tube

x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization

Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0

Complete Tours Current Partial Tour Next Cities Next States

Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1

Base Heuristic Minimization Possible Path

Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

1

Base Policy Rollout Policy Approximation in Value Space

One-Step or Multistep Lookahead for stages

Approximation in Policy Space Heuristic Cost Approximation for

for Stages Beyond Truncation

Approximate Q-Factor Q(x, u) At x

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Approximate Q-Factor Q(x, u) At x

Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1

Cost per stage: x2k + u2

k

{X0, X1, . . . , XN} must be reachable Largest reachable tube

x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization

Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0

Complete Tours Current Partial Tour Next Cities Next States

Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1

Base Heuristic Minimization Possible Path

Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

1

Base Policy Rollout Policy Approximation in Value Space

One-Step or Multistep Lookahead for stages

Approximation in Policy Space Heuristic Cost Approximation for

for Stages Beyond Truncation

Approximate Q-Factor Q(x, u) At x

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Approximate Q-Factor Q(x, u) At x

Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1

Cost per stage: x2k + u2

k

{X0, X1, . . . , XN} must be reachable Largest reachable tube

x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization

Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0

Complete Tours Current Partial Tour Next Cities Next States

Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1

Base Heuristic Minimization Possible Path

Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

1

Base Policy Rollout Policy Approximation in Value Space n n � 1n � 2

One-Step or Multistep Lookahead for stages Possible Terminal Cost

Approximation in Policy Space Heuristic Cost Approximation for

for Stages Beyond Truncation

Approximate Q-Factor Q(x, u) At x

minu2U(x)

Ew

ng(x, u, w) + ↵J

�f(x, u, w)

�o

Approximate Q-Factor Q(x, u) At x

Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| 1

Cost per stage: x2k + u2

k

{X0, X1, . . . , XN} must be reachable Largest reachable tube

x0 Control uk (`� 1)-Stages Base Heuristic Minimization

Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0

Complete Tours Current Partial Tour Next Cities Next States

Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1

Base Heuristic Minimization Possible Path

Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath

Jk+1(xk+1) = minuk+12Uk+1(xk+1)

En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+`

Rollout, Model Predictive Control

1

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

Multiagent Q-factor minimization

Termination State Constraint Set X X = X X Multiagent

Current Partial Folding

Current Partial Folding Moving Obstacle

Complete Folding Corresponding to Open

Expert

Base Policy

Approximation of E{·}: Approximate minimization:

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

x1k, u1

k u2k x2

k dk τ

Q-factor approximation

u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)

)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1

High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice

Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5

Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial

Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2

One-Step or Multistep Lookahead for stages Possible Terminal Cost

Approximation in Policy Space Heuristic Cost Approximation for

for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)

Approximate Q-Factor Q(x, u) At x Approximation J

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Truncated Rollout Policy µ m Steps

1

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

Multiagent Q-factor minimization

Termination State Constraint Set X X = X X Multiagent

Current Partial Folding

Current Partial Folding Moving Obstacle

Complete Folding Corresponding to Open

Expert

Base Policy m-Step

Approximation of E{·}: Approximate minimization:

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

x1k, u1

k u2k x2

k dk τ

Q-factor approximation

u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)

)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1

High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice

Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5

Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial

Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2

One-Step or Multistep Lookahead for stages Possible Terminal Cost

Approximation in Policy Space Heuristic Cost Approximation for

for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)

Approximate Q-Factor Q(x, u) At x Approximation J

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Truncated Rollout Policy µ m Steps

1

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

Multiagent Q-factor minimization

Termination State Constraint Set X X = X X Multiagent

Current Partial Folding

Current Partial Folding Moving Obstacle

Complete Folding Corresponding to Open

Expert

Rollout with Base Policy m-Step

Approximation of E{·}: Approximate minimization:

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

x1k, u1

k u2k x2

k dk τ

Q-factor approximation

u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)

)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1

High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice

Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5

Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial

Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2

One-Step or Multistep Lookahead for stages Possible Terminal Cost

Approximation in Policy Space Heuristic Cost Approximation for

for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)

Approximate Q-Factor Q(x, u) At x Approximation J

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Truncated Rollout Policy µ m Steps

1

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

Multiagent Q-factor minimization xk

Termination State Constraint Set X X = X X Multiagent

Current Partial Folding

Current Partial Folding Moving Obstacle

Complete Folding Corresponding to Open

Expert

Rollout with Base Policy m-Step

Approximation of E{·}: Approximate minimization:

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

x1k, u1

k u2k x2

k dk τ

Q-factor approximation

u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)

)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1

High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice

Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5

Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial

Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2

One-Step or Multistep Lookahead for stages Possible Terminal Cost

Approximation in Policy Space Heuristic Cost Approximation for

for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)

Approximate Q-Factor Q(x, u) At x Approximation J

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Truncated Rollout Policy µ m Steps

1

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

Multiagent Q-factor minimization xk Possible States xk xk+m+1

Termination State Constraint Set X X = X X Multiagent

Current Partial Folding

Current Partial Folding Moving Obstacle

Complete Folding Corresponding to Open

Expert

Rollout with Base Policy m-Step

Approximation of E{·}: Approximate minimization:

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

x1k, u1

k u2k x2

k dk τ

Q-factor approximation

u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)

)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1

High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice

Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5

Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial

Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2

One-Step or Multistep Lookahead for stages Possible Terminal Cost

Approximation in Policy Space Heuristic Cost Approximation for

for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)

Approximate Q-Factor Q(x, u) At x Approximation J

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Truncated Rollout Policy µ m Steps

1

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

Multiagent Q-factor minimization xk Possible States xk xk+m+1

Termination State Constraint Set X X = X X Multiagent

Current Partial Folding

Current Partial Folding Moving Obstacle

Complete Folding Corresponding to Open

Expert

Rollout with Base Policy m-Step

Approximation of E{·}: Approximate minimization:

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

x1k, u1

k u2k x2

k dk τ

Q-factor approximation

u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)

)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1

High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice

Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5

Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial

Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2

One-Step or Multistep Lookahead for stages Possible Terminal Cost

Approximation in Policy Space Heuristic Cost Approximation for

for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)

Approximate Q-Factor Q(x, u) At x Approximation J

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Truncated Rollout Policy µ m Steps

1

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1

Termination State Constraint Set X X = X X Multiagent

Current Partial Folding

Current Partial Folding Moving Obstacle

Complete Folding Corresponding to Open

Expert

Rollout with Base Policy m-Step

Approximation of E{·}: Approximate minimization:

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

x1k, u1

k u2k x2

k dk τ

Q-factor approximation

u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)

)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1

High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice

Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5

Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial

Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2

One-Step or Multistep Lookahead for stages Possible Terminal Cost

Approximation in Policy Space Heuristic Cost Approximation for

for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)

Approximate Q-Factor Q(x, u) At x Approximation J

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Truncated Rollout Policy µ m Steps

1

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1

Termination State Constraint Set X X = X X Multiagent

Current Partial Folding

Current Partial Folding Moving Obstacle

Complete Folding Corresponding to Open

Expert

Rollout with Base Policy m-Step

Approximation of E{·}: Approximate minimization:

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

x1k, u1

k u2k x2

k dk τ

Q-factor approximation

u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)

)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1

High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice

Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5

Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial

Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2

One-Step or Multistep Lookahead for stages Possible Terminal Cost

Approximation in Policy Space Heuristic Cost Approximation for

for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)

Approximate Q-Factor Q(x, u) At x Approximation J

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Truncated Rollout Policy µ m Steps

1

Sec. 1.3 Stochastic Dynamic Programming 27

Similar to the deterministic case, Q-learning involves the calculationof either the optimal Q-factors (1.16) or approximations Qk(xk, uk). Theapproximate Q-factors may be obtained using approximation in value spaceschemes, and can be used to obtain approximately optimal policies throughthe Q-factor minimization

µk(xk) ∈ arg minuk∈Uk(xk)

Qk(xk, uk). (1.17)

In Chapter 4, we will discuss the use of neural networks in such approxi-mations.

Cost Versus Q-factor Approximations - Robustness and On-Line Replanning

We have seen that it is possible to implement approximation in value spaceby using cost function approximations [cf. Eq. (1.15)] or by using Q-factorapproximations [cf. Eq. (1.17)], so the question arises which one to use in agiven practical situation. One important consideration is the facility of ob-taining suitable cost or Q-factor approximations. This depends largely onthe problem and also on the availability of data on which the approxima-tions can be based. However, there are some other major considerations.

In particular, the cost function approximation scheme

µk(xk) ∈ arg minuk∈Uk(xk)

E{gk(xk, uk, wk) + Jk+1

(fk(xk, uk, wk)

)},

has an important disadvantage: the expected value above needs to be com-puted on-line for all uk ∈ Uk(xk), and this may involve substantial compu-tation. On the other hand it also has an important advantage in situationswhere the system function fk, the cost per stage gk, or the control con-straint set Uk(xk) can change as the system is operating. We will discussin more detail how this situation can arise in practice later in this chapter.Assuming that the new values of fk, gk, or Uk(xk) become known to thecontroller, on-line replanning may be used, as discussed earlier for deter-ministic problems. This may improve substantially the robustness of theapproximation in value space scheme.

By comparison, the Q-factor function approximation scheme (1.17)does not allow for on-line replanning. On the other hand, for problemswhere there is no need for on-line replanning, the Q-factor approximationscheme does not require the on-line computation of expected values andmay allow for a much faster on-line computation of the minimizing controlµk(xk) via Eq. (1.17).

1.3.2 Infinite Horizon Problems - An Overview

We will now provide an outline of infinite horizon stochastic DP with anemphasis on its aspects that relate to our RL/approximation methods.

Bertsekas Reinforcement Learning 17 / 29

Page 15: Topics in Reinforcement Learning:Lessons from AlphaZero ...

Let’s Take a 15-min Working Break: Catch your Breath, Collect yourQuestions, and Consider the Following Challenge Puzzle

A chess match puzzleA chess player plays against a chess computer program a two-game match.

A win counts for 1, a draw counts for 1/2, and a loss counts for 0, for both playerand computer.

“Sudden death" games are played if the score is tied at 1-1 after the two games.The chess player can choose to play each game in one of two possible styles:

I Bold play (wins with probability pw < 1/2 and loses with probability 1 − pw ) orI Timid play (draws with probability pd < 1 and loses with probability 1 − pd ).

The style for the 2nd game is chosen after seeing the outcome of the 1st game.

Note that the player plays worse than the computer (on the average), regardless ofchosen style of play, and must play bold at least one game to have any chance towin.

Speculate on the optimal policy of the player.

Is it possible for the player to have a better than 50-50 chance to win the match,even though the computer is the better player?

Bertsekas Reinforcement Learning 18 / 29

Page 16: Topics in Reinforcement Learning:Lessons from AlphaZero ...

Answer: Depending on pw and pd , Player’s Win Prob. May be > 1/2

wk xk uk Demand at Period k Stock at Period k Stock at Periodk + 1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

2 � 0 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

Finite Horizon Problems Ch. 1

Deterministic Problems Ch. 2

Stochastic Problems

Perfect-State Info Ch. 3

Imperfect-State Info Ch. 4

1

wk xk uk Demand at Period k Stock at Period k Stock at Periodk + 1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

2 � 0 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

Finite Horizon Problems Ch. 1

Deterministic Problems Ch. 2

Stochastic Problems

Perfect-State Info Ch. 3

Imperfect-State Info Ch. 4

1

wk xk uk Demand at Period k Stock at Period k Stock at Periodk + 1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

2 � 0 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

Finite Horizon Problems Ch. 1

Deterministic Problems Ch. 2

Stochastic Problems

Perfect-State Info Ch. 3

Imperfect-State Info Ch. 4

1

wk xk uk Demand at Period k Stock at Period k Stock at Periodk + 1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

2 � 0 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

Finite Horizon Problems Ch. 1

Deterministic Problems Ch. 2

Stochastic Problems

Perfect-State Info Ch. 3

Imperfect-State Info Ch. 4

1

wk xk uk Demand at Period k Stock at Period k Stock at Periodk + 1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

2 � 0 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

Finite Horizon Problems Ch. 1

Deterministic Problems Ch. 2

Stochastic Problems

Perfect-State Info Ch. 3

Imperfect-State Info Ch. 4

1

wk xk uk Demand at Period k Stock at Period k Stock at Periodk + 1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

2 � 0 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

Finite Horizon Problems Ch. 1

Deterministic Problems Ch. 2

Stochastic Problems

Perfect-State Info Ch. 3

Imperfect-State Info Ch. 4

1

wk xk uk Demand at Period k Stock at Period k Stock at Periodk + 1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

2 � 0 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

Finite Horizon Problems Ch. 1

Deterministic Problems Ch. 2

Stochastic Problems

Perfect-State Info Ch. 3

Imperfect-State Info Ch. 4

1

wk xk uk Demand at Period k Stock at Period k Stock at Periodk + 1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

Finite Horizon Problems Ch. 1

Deterministic Problems Ch. 2

Stochastic Problems

Perfect-State Info Ch. 3

Imperfect-State Info Ch. 4

1

wk xk uk Demand at Period k Stock at Period k Stock at Periodk + 1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

Finite Horizon Problems Ch. 1

Deterministic Problems Ch. 2

Stochastic Problems

Perfect-State Info Ch. 3

Imperfect-State Info Ch. 4

1

wk xk uk Demand at Period k Stock at Period k Stock at Periodk + 1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

Finite Horizon Problems Ch. 1

Deterministic Problems Ch. 2

Stochastic Problems

Perfect-State Info Ch. 3

Imperfect-State Info Ch. 4

1

wk xk uk Demand at Period k Stock at Period k Stock at Periodk + 1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

Finite Horizon Problems Ch. 1

Deterministic Problems Ch. 2

Stochastic Problems

Perfect-State Info Ch. 3

Imperfect-State Info Ch. 4

1

wk xk uk Demand at Period k Stock at Period k Stock at Periodk + 1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

Finite Horizon Problems Ch. 1

Deterministic Problems Ch. 2

Stochastic Problems

Perfect-State Info Ch. 3

Imperfect-State Info Ch. 4

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

{1, 2, 3, 4, 5} {1, 2, 3} {4, 5} {1, 2} {2, 3} {1} {2} {3} {4} {5}

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

Timid Play Bold Play 1.5 � 0.5 ACDB ADBC ADCB

Noninferior Vectors Origin Node s Artificial Terminal Node t X ={1, 2, 3, 4, 5}

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

{1, 2, 3, 4, 5} {1, 2, 3} {4, 5} {1, 2} {2, 3} {1} {2} {3} {4} {5}

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

Timid Play Bold Play 1.5 � 0.5 ACDB ADBC ADCB

Noninferior Vectors Origin Node s Artificial Terminal Node t X ={1, 2, 3, 4, 5}

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

{1, 2, 3, 4, 5} {1, 2, 3} {4, 5} {1, 2} {2, 3} {1} {2} {3} {4} {5}

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

Timid Play Bold Play 1.5 � 0.5 ACDB ADBC ADCB

Noninferior Vectors Origin Node s Artificial Terminal Node t X ={1, 2, 3, 4, 5}

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

{1, 2, 3, 4, 5} {1, 2, 3} {4, 5} {1, 2} {2, 3} {1} {2} {3} {4} {5}

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

Timid Play Bold Play 1.5 � 0.5 ACDB ADBC ADCB

Noninferior Vectors Origin Node s Artificial Terminal Node t X ={1, 2, 3, 4, 5}

(Is the path s ! i ! j

have a chance to be part of a shorter s ! j path?)

(Does the path s ! i ! j shorter than the current s ! j path?)

Length = 0 Dead-End Position Solution

Starting Position w1, w2, . . . xN t m xk xk+1 5.5 i j

YES Set dj = di + aij Is di + aij < dj? Is di + aij < UPPER?

Open List INSERT REMOVE

w1, w2, . . . xN t m xk xk+1 5.5

Origin Node s Artificial Terminal Destination Node t Encoder NoisyChannel Received Sequence

Decoder Decoded Sequence

s x0 x1 x2 xN�1 xN t m xk xk+1 5.5

w1, w2, . . . y1, y2, . . . x1, x2, . . . z1, z2, . . . w1, w2, . . . xN t m xk

xk+1 5.5

1

Value Space Approximation J Jµ = TµJµ TµJ

One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

Newton iterate starting from K Tangent Line of Unstable Policy

J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK

TJ Instability Region Match Win Probability 1 0 (Sudden death)Stability Region Slope=1

also Newton Step Value Iteration: Kk+1 = F (Kk)

Optimal Style Choice Policy Riccati Equation: K = F (K)

J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from

Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from

using an Corresponds to One-Step Lookahead Policy µ

Line Stable Policies Unstable Policy Optimal Policy

Region of stability

Also Region of Convergence of Newton’s Method Riccati Equation

Cost of rollout policy µ Cost of base policy µ

1

Value Space Approximation J Jµ = TµJµ TµJ

One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

Newton iterate starting from K Tangent Line of Unstable Policy

J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK

TJ Instability Region Match Win Probability 1 0 (Sudden death)Stability Region Slope=1

also Newton Step Value Iteration: Kk+1 = F (Kk)

Optimal Style Choice Policy Riccati Equation: K = F (K)

J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from

Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from

using an Corresponds to One-Step Lookahead Policy µ

Line Stable Policies Unstable Policy Optimal Policy

Region of stability

Also Region of Convergence of Newton’s Method Riccati Equation

Cost of rollout policy µ Cost of base policy µ

1

Value Space Approximation J Jµ = TµJµ TµJ

One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

Newton iterate starting from K Tangent Line of Unstable Policy

J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK

TJ Instability Region Match Win Probability 1 0 (Sudden death)Stability Region Slope=1

also Newton Step Value Iteration: Kk+1 = F (Kk)

Optimal Style Choice Policy Riccati Equation: K = F (K)

J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from

Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from

using an Corresponds to One-Step Lookahead Policy µ

Line Stable Policies Unstable Policy Optimal Policy

Region of stability

Also Region of Convergence of Newton’s Method Riccati Equation

Cost of rollout policy µ Cost of base policy µ

1

Value Space Approximation J Jµ = TµJµ TµJ

One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

Newton iterate starting from K Tangent Line of Unstable Policy

J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK

TJ Instability Region Match Win Probability 1 0 pw (Sudden death)Stability Region Slope=1

also Newton Step Value Iteration: Kk+1 = F (Kk)

Optimal Style Choice Policy Riccati Equation: K = F (K)

J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from

Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from

using an Corresponds to One-Step Lookahead Policy µ

Line Stable Policies Unstable Policy Optimal Policy

Region of stability

Also Region of Convergence of Newton’s Method Riccati Equation

Cost of rollout policy µ Cost of base policy µ

1

Value Space Approximation J Jµ = TµJµ TµJ

One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

Newton iterate starting from K Tangent Line of Unstable Policy

J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK

TJ Instability Region Match Win Probability 1 0 pw (Sudden death)Stability Region Slope=1

also Newton Step Value Iteration: Kk+1 = F (Kk)

Optimal Style Choice Policy Riccati Equation: K = F (K)

J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from

Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from

using an Corresponds to One-Step Lookahead Policy µ

Line Stable Policies Unstable Policy Optimal Policy

Region of stability

Also Region of Convergence of Newton’s Method Riccati Equation

Cost of rollout policy µ Cost of base policy µ

1

Value Space Approximation J Jµ = TµJµ TµJ

One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

Newton iterate starting from K Tangent Line of Unstable Policy

J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK

TJ Instability Region Match Win Probability 1 0 pw (Sudden death)Stability Region Slope=1

also Newton Step Value Iteration: Kk+1 = F (Kk)

Optimal Policy Riccati Equation: K = F (K)

J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from

Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from

using an Corresponds to One-Step Lookahead Policy µ

Line Stable Policies Unstable Policy Optimal Policy

Region of stability

Also Region of Convergence of Newton’s Method Riccati Equation

Cost of rollout policy µ Cost of base policy µ

1

The optimal policy: Play bold in the 1st game. Then play bold again if the 1stgame is lost, and timid if the 1st game is won (see the full DP solution in DPB, DPtextbook, Vol. I, Chapter 1; available from Google Books).

Example: For pw = 0.45 and pd = 0.9, optimal style of play policy gives a matchwin probability of roughly 0.53 (a simple DP calculation that you can try).

Intuition: The player can use feedback, while the computer cannot.

Bertsekas Reinforcement Learning 19 / 29

Page 17: Topics in Reinforcement Learning:Lessons from AlphaZero ...

Infinite Horizon Problems

......Permanent trajectory P k Tentative trajectory T k

Control uk Cost gk(xk, uk) xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π(T

(λ)µ (Φr)

)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution

Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)

Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)

Set of States (u1) Set of States (u1, u2) Neural Network

Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)

Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)

Candidate (m + 1)-Solutions (u1, . . . , um, um+1)

Cost G(u) Heuristic N -Solutions

Piecewise Constant Aggregate Problem Approximation

Artificial Start State End State

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

(F (i)

)

R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗

3 Cost Jµ

(F (i)

)

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

(F (i)

)

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

1

Permanent trajectory P k Tentative trajectory T k

Control uk Cost gk(xk, uk) xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π(T

(λ)µ (Φr)

)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution

Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)

Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)

Set of States (u1) Set of States (u1, u2) Neural Network

Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)

Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)

Candidate (m + 1)-Solutions (u1, . . . , um, um+1)

Cost G(u) Heuristic N -Solutions

Piecewise Constant Aggregate Problem Approximation

Artificial Start State End State

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

(F (i)

)

R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗

3 Cost Jµ

(F (i)

)

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

(F (i)

)

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

1

Permanent trajectory P k Tentative trajectory T k

Stage k Future Stages

Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π(T

(λ)µ (Φr)

)Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

∑nj=1 pij(u)

(g(i, u, j) + J(j)

)Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution

Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)

Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)

Set of States (u1) Set of States (u1, u2) Neural Network

Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)

Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)

Candidate (m + 1)-Solutions (u1, . . . , um, um+1)

Cost G(u) Heuristic N -Solutions

Piecewise Constant Aggregate Problem Approximation

Artificial Start State End State

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

(F (i)

)

R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗

3 Cost Jµ

(F (i)

)

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

(F (i)

)

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

1

Iteration Index k PI index k Jµk J⇤ 0 1 2 . . . Error Zone Width (✏ + 2↵�)/(1 � ↵)2

Policy Q-Factor Evaluation Evaluate Q-Factor Qµ of Current policy µ Width (✏ + 2↵�)/(1 � ↵)

Random Transition xk+1 = fk(xk, uk, wk) Random cost gk(xk, uk, wk)

Control v (j, v) Cost = 0 State-Control Pairs Transitions under policy µ Evaluate Cost Function

Variable Length Rollout Selective Depth Rollout Policy µ Adaptive Simulation Terminal Cost Function

Limited Rollout Selective Depth Adaptive Simulation Policy µ Approximation J

u Qk(xk, u) Qk(xk, u) uk uk Qk(xk, u) � Qk(xk, u)

x0 xk x1k+1 x2

k+1 x3k+1 x4

k+1 States xN Base Heuristic ik States ik+1 States ik+2

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk

x0 x1 xk xN x0N x00

N uk u0k u00

k xk+1 x0k+1 x00

k+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

1

Iteration Index k PI index k Jµk J⇤ 0 1 2 . . . Error Zone Width (✏ + 2↵�)/(1 � ↵)2

Policy Q-Factor Evaluation Evaluate Q-Factor Qµ of Current policy µ Width (✏ + 2↵�)/(1 � ↵)

Random Transition xk+1 = fk(xk, uk, wk) Random Cost gk(xk, uk, wk)

Control v (j, v) Cost = 0 State-Control Pairs Transitions under policy µ Evaluate Cost Function

Variable Length Rollout Selective Depth Rollout Policy µ Adaptive Simulation Terminal Cost Function

Limited Rollout Selective Depth Adaptive Simulation Policy µ Approximation J

u Qk(xk, u) Qk(xk, u) uk uk Qk(xk, u) � Qk(xk, u)

x0 xk x1k+1 x2

k+1 x3k+1 x4

k+1 States xN Base Heuristic ik States ik+1 States ik+2

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk

x0 x1 xk xN x0N x00

N uk u0k u00

k xk+1 x0k+1 x00

k+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

1

xk+1 = f(xk, uk, wk) g(xk, uk, wk)

Termination State

Aggregate Problem Approximation Jµ(i) Jµ(i) u1k u2

k u3k u4

k u5k Self-Learning/Policy Iteration Con-

straint Relaxation

Tail problem approximation u1k u2

k u3k u4

k u5k Self-Learning/Policy Iteration Constraint Relaxation d`i

�j`

Learned from scratch ... with 4 hours of training! Current “Improved”

AlphaZero (Google-Deep Mind) Plays much better than all computer programs F (i) Cost J�F (i)

Plays di↵erent! Approximate Value Function Player Features Mapping

At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation

Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2

minuk,µk+1,...,µk+`�1

En

gk(xk, uk, wk) +

k+`�1X

m=k+1

gk

�xm, µm(xm), wm

�+ Jk+`(xk+`)

o

Subspace S = {�r | r 2 <s} x⇤ x

Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search

T (�)(x) = T (x) x = P (c)(x)

x � T (x) y � T (y) rf(x) x � P (c)(x) xk xk+1 xk+2 Slope = �1

c

T (�)(x) = T (x) x = P (c)(x)

1

xk+1 = f(xk, uk, wk) ↵kg(xk, uk, wk)

Termination State

Aggregate Problem Approximation Jµ(i) Jµ(i) u1k u2

k u3k u4

k u5k Self-Learning/Policy Iteration Con-

straint Relaxation

Tail problem approximation u1k u2

k u3k u4

k u5k Self-Learning/Policy Iteration Constraint Relaxation d`i

�j`

Learned from scratch ... with 4 hours of training! Current “Improved”

AlphaZero (Google-Deep Mind) Plays much better than all computer programs F (i) Cost J�F (i)

Plays di↵erent! Approximate Value Function Player Features Mapping

At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation

Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2

minuk,µk+1,...,µk+`�1

En

gk(xk, uk, wk) +

k+`�1X

m=k+1

gk

�xm, µm(xm), wm

�+ Jk+`(xk+`)

o

Subspace S = {�r | r 2 <s} x⇤ x

Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search

T (�)(x) = T (x) x = P (c)(x)

x � T (x) y � T (y) rf(x) x � P (c)(x) xk xk+1 xk+2 Slope = �1

c

T (�)(x) = T (x) x = P (c)(x)

1

xk+1 = f(xk, uk, wk) αkg(xk, uk, wk)

Termination State Infinite Horizon

Aggregate Problem Approximation Jµ(i) Jµ(i) u1k u2

k u3k u4

k u5k Self-Learning/Policy Iteration Con-

straint Relaxation

Tail problem approximation u1k u2

k u3k u4

k u5k Self-Learning/Policy Iteration Constraint Relaxation dℓi

φjℓ

Learned from scratch ... with 4 hours of training! Current “Improved”

AlphaZero (Google-Deep Mind) Plays much better than all computer programs F (i) Cost J(F (i)

)

Plays different! Approximate Value Function Player Features Mapping

At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation

Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2

minuk ,µk+1,...,µk+ℓ−1

E{gk(xk, uk, wk) +

k+ℓ−1∑

m=k+1

gk

(xm, µm(xm), wm

)+ Jk+ℓ(xk+ℓ)

}

Subspace S = {Φr | r ∈ ℜs} x∗ x

Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search

T (λ)(x) = T (x) x = P (c)(x)

x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1

c

T (λ)(x) = T (x) x = P (c)(x)

1

Infinite number of stages, and stationary system and costSystem xk+1 = f (xk , uk ,wk ) with state, control, and random disturbance.

Policies π = {µ0, µ1, . . .} with µk (x) ∈ U(x) for all x and k .

Cost of stage k : αk g(xk , µk (xk ),wk

).

Cost of a policy π = {µ0, µ1, . . .}: The limit as N →∞ of the N-stage costs

Jπ(x0) = limN→∞

Ewk

{N−1∑k=0

αk g(xk , µk (xk ),wk

)}

0 < α ≤ 1 is the discount factor. If α < 1 the problem is called discounted.

Optimal cost function J∗(x0) = minπ Jπ(x0).

Problems with α = 1 typically include a special cost-free termination state t . Theobjective is to reach (or approach) t at minimum expected cost.

Bertsekas Reinforcement Learning 21 / 29

Page 18: Topics in Reinforcement Learning:Lessons from AlphaZero ...

Infinite Horizon Problems - The Three Theorems

Intuition: N-stages opt. costs –> Infinite horizon opt. costApply DP, let VN−k (x) be the optimal cost-to-go starting at x with k stages to go:

VN−k (x) = minu∈U(x)

Ew

{αN−k g(x , u,w) + VN−k+1

(f (x , u,w)

)}, VN(x) ≡ 0

Define Jk (x) = VN−k (x)/αN−k , i.e., reverse the time index and divide with αN−k :

Jk (x) = minu∈U(x)

Ew

{g(x , u,w) + αJk−1

(f (x , u,w)

)}, J0(x) ≡ 0 (DP)

JN(x) is equal to V0(x), the N-stages optimal cost starting from x

So for any k , Jk (x) = k -stages optimal cost starting from x . Intuitively:

J∗(x) = limk→∞

Jk (x), for all x

J∗ satisfies Bellman’s equation: Take the limit in Eq. (DP) (?)

J∗(x) = minu∈U(x)

Ew

{g(x , u,w) + αJ∗

(f (x , u,w)

)}, for all x

Optimality condition: Let µ∗(x) attain the min in the Bellman equation for all x

The policy {µ∗, µ∗, . . .} is optimal. (This type of policy is called stationary.)

Bertsekas Reinforcement Learning 22 / 29

Page 19: Topics in Reinforcement Learning:Lessons from AlphaZero ...

Infinite Horizon Problems - Algorithms

Value iteration (VI): Generates finite horizon opt. cost function sequence {Jk}

Jk (x) = minu∈U(x)

Ew

{g(x , u,w) + αJk−1

(f (x , u,w)

)}, J0 is “arbitrary" (??)

Policy Iteration (PI): Generates sequences of policies {µk} and their costfunctions {Jµk }; µ0 is “arbitrary"

The typical iteration starts with a policy µ and generates a new policy µ in two steps:

Policy evaluation step, which computes the cost function Jµ (base) policy µ

Policy improvement step, which computes the improved (rollout) policy µ using theone-step lookahead minimization

µ(x) ∈ arg minu∈U(x)

Ew

{g(x , u,w) + αJµ

(f (x , u,w)

)}There are several options for policy evaluation to compute Jµ

Solve Bellman’s equation for µ [Jµ(x) = E{g(x , µ(x),w) + αJµ(f (x , µ(x),w))}] byusing VI or other method (it is linear in Jµ)

Use simulation (on-line Monte-Carlo, Temporal Difference (TD) methods)

Bertsekas Reinforcement Learning 23 / 29

Page 20: Topics in Reinforcement Learning:Lessons from AlphaZero ...

Exact and Approximate Policy IterationAlphazero has discovered a new way to play! Base Policy Evaluation

One-Step Lookahead Policy Improvement

u = µ(x, r) Current State x µ µ

1

Alphazero has discovered a new way to play! Base Policy EvaluationOne-Step Lookahead Policy Improvement

u = µ(x, r) Current State x µ Rollout Policy µ Randomized Jµ

Approximate Policy Evaluation Approximate Policy Improvement

Value Network Policy Network J State-Control Pairs Data-TrainedClassifier

1

Alphazero has discovered a new way to play! Base Policy EvaluationOne-Step Lookahead Policy Improvement

u = µ(x, r) Current State x µ Rollout Policy µ Randomized

Jµ instead of J*

Approximate Policy Evaluation Approximate Policy Improvement

Value Network Policy Network J State-Control Pairs Data-TrainedClassifier

1

Alphazero has discovered a new way to play! Base Policy EvaluationOne-Step Lookahead Policy Improvement

u = µ(x, r) Current State x µ Rollout Policy µ Randomized

Jµ instead of J* Bellman Eq. with

Approximate Policy Evaluation Approximate Policy Improvement

Value Network Policy Network J State-Control Pairs Data-TrainedClassifier

1

Alphazero has discovered a new way to play! Base Policy EvaluationOne-Step Lookahead Policy Improvement

Policy Evaluation Policy Improvement Rollout Policy µ Base Policy µ

u = µ(x, r) Current State x µ Rollout Policy µ Randomized

Jµ instead of J* Bellman Eq. with

Approximate Policy Evaluation Approximate Policy Improvement

Value Network Policy Network J State-Control Pairs Data-TrainedClassifier

1

Alphazero has discovered a new way to play! Base Policy EvaluationOne-Step Lookahead Policy Improvement

Policy Evaluation Policy Improvement Rollout Policy µ Base Policy µ

u = µ(x, r) Current State x µ Rollout Policy µ Randomized

Jµ instead of J* Bellman Eq. with

Approximate Policy Evaluation Approximate Policy Improvement

Value Network Policy Network J State-Control Pairs Data-TrainedClassifier

1

Alphazero has discovered a new way to play! Base Policy EvaluationOne-Step Lookahead Policy Improvement

Policy Evaluation Policy Improvement Rollout Policy µ Base Policy µ

u = µ(x, r) Current State x µ Rollout Policy µ Randomized

Jµ instead of J* Bellman Eq. with

Approximate Policy Evaluation Approximate Policy Improvement

Value Network Policy Network J State-Control Pairs Data-TrainedClassifier

1

Alphazero has discovered a new way to play! Base Policy EvaluationOne-Step Lookahead Policy Improvement

Policy Evaluation Policy Improvement Rollout Policy µ Base Policy µ

u = µ(x, r) Current State x µ Rollout Policy µ Randomized

Jµ instead of J* Bellman Eq. with

Approximate Policy Evaluation Approximate Policy Improvement

Value Network Policy Network J State-Control Pairs Data-TrainedClassifier

1

Base Policy Rollout Policy Approximation in Value Space

One-Step or Multistep Lookahead

Approximation in Policy Space

Approximate Q-Factor Q(x, u) At x

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Approximate Q-Factor Q(x, u) At x

System: xk+1 = 2xk + uk Control constraint: |uk| ≤ 1

Cost per stage: x2k + u2

k

{X0, X1, . . . , XN} must be reachable Largest reachable tube

x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization

Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0

Complete Tours Current Partial Tour Next Cities Next States

Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1

Base Heuristic Minimization Possible Path

Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

Rollout Control uk Rollout Policy µk Base Policy Cost

b+k b−

k Permanent trajectory P k Tentative trajectory T k

1

Base Policy Rollout Policy Approximation in Value Space

One-Step or Multistep Lookahead

Approximation in Policy Space

Approximate Q-Factor Q(x, u) At x

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Approximate Q-Factor Q(x, u) At x

System: xk+1 = 2xk + uk Control constraint: |uk| ≤ 1

Cost per stage: x2k + u2

k

{X0, X1, . . . , XN} must be reachable Largest reachable tube

x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization

Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0

Complete Tours Current Partial Tour Next Cities Next States

Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1

Base Heuristic Minimization Possible Path

Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

Rollout Control uk Rollout Policy µk Base Policy Cost

b+k b−

k Permanent trajectory P k Tentative trajectory T k

1

Alphazero has discovered a new way to play! Base Policy EvaluationOne-Step Lookahead Policy Improvement

u = µ(x, r) Current State x µ Rollout Policy µ

Approximate Policy Evaluation Approximate Policy Improvement

Value Network Policy Network

1

Important facts (to be discussed later):PI yields in the limit an optimal policy (?)

PI is faster than VI; can be viewed as Newton’s method for solving Bellman’s Eq.

PI can be implemented approximately, with a value and (perhaps) a policy network

Base Policy Rollout Policy Approximation in Value Space

One-Step or Multistep Lookahead

Approximation in Policy Space

Approximate Q-Factor Q(x, u) At x

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Approximate Q-Factor Q(x, u) At x

System: xk+1 = 2xk + uk Control constraint: |uk| ≤ 1

Cost per stage: x2k + u2

k

{X0, X1, . . . , XN} must be reachable Largest reachable tube

x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization

Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0

Complete Tours Current Partial Tour Next Cities Next States

Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1

Base Heuristic Minimization Possible Path

Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

Rollout Control uk Rollout Policy µk Base Policy Cost

b+k b−

k Permanent trajectory P k Tentative trajectory T k

1

Base Policy Rollout Policy Approximation in Value Space

One-Step or Multistep Lookahead

Approximation in Policy Space

Approximate Q-Factor Q(x, u) At x

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Approximate Q-Factor Q(x, u) At x

System: xk+1 = 2xk + uk Control constraint: |uk| ≤ 1

Cost per stage: x2k + u2

k

{X0, X1, . . . , XN} must be reachable Largest reachable tube

x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization

Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0

Complete Tours Current Partial Tour Next Cities Next States

Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1

Base Heuristic Minimization Possible Path

Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

Rollout Control uk Rollout Policy µk Base Policy Cost

b+k b−

k Permanent trajectory P k Tentative trajectory T k

1

Base Policy Rollout Policy Approximation in Value Space

One-Step or Multistep Lookahead

Approximation in Policy Space

Approximate Q-Factor Q(x, u) At x

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Approximate Q-Factor Q(x, u) At x

System: xk+1 = 2xk + uk Control constraint: |uk| ≤ 1

Cost per stage: x2k + u2

k

{X0, X1, . . . , XN} must be reachable Largest reachable tube

x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization

Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0

Complete Tours Current Partial Tour Next Cities Next States

Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1

Base Heuristic Minimization Possible Path

Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

Rollout Control uk Rollout Policy µk Base Policy Cost

b+k b−

k Permanent trajectory P k Tentative trajectory T k

1

Base Policy Rollout Policy Approximation in Value Space

One-Step or Multistep Lookahead

Approximation in Policy Space

Approximate Q-Factor Q(x, u) At x

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Approximate Q-Factor Q(x, u) At x

System: xk+1 = 2xk + uk Control constraint: |uk| ≤ 1

Cost per stage: x2k + u2

k

{X0, X1, . . . , XN} must be reachable Largest reachable tube

x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization

Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0

Complete Tours Current Partial Tour Next Cities Next States

Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1

Base Heuristic Minimization Possible Path

Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

Rollout Control uk Rollout Policy µk Base Policy Cost

b+k b−

k Permanent trajectory P k Tentative trajectory T k

1

Base Policy Rollout Policy Approximation in Value Space

One-Step or Multistep Lookahead

Approximation in Policy Space

Approximate Q-Factor Q(x, u) At x

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Approximate Q-Factor Q(x, u) At x

System: xk+1 = 2xk + uk Control constraint: |uk| ≤ 1

Cost per stage: x2k + u2

k

{X0, X1, . . . , XN} must be reachable Largest reachable tube

x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization

Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0

Complete Tours Current Partial Tour Next Cities Next States

Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1

Base Heuristic Minimization Possible Path

Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

Rollout Control uk Rollout Policy µk Base Policy Cost

b+k b−

k Permanent trajectory P k Tentative trajectory T k

1

Base Policy Rollout Policy Approximation in Value Space

One-Step or Multistep Lookahead

Approximation in Policy Space

Approximate Q-Factor Q(x, u) At x

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Approximate Q-Factor Q(x, u) At x

System: xk+1 = 2xk + uk Control constraint: |uk| ≤ 1

Cost per stage: x2k + u2

k

{X0, X1, . . . , XN} must be reachable Largest reachable tube

x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization

Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0

Complete Tours Current Partial Tour Next Cities Next States

Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1

Base Heuristic Minimization Possible Path

Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

Rollout Control uk Rollout Policy µk Base Policy Cost

b+k b−

k Permanent trajectory P k Tentative trajectory T k

1

Base Policy Rollout Policy Approximation in Value Space

One-Step or Multistep Lookahead

Approximation in Policy Space

Approximate Q-Factor Q(x, u) At x

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Approximate Q-Factor Q(x, u) At x

Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1

Cost per stage: x2k + u2

k

{X0, X1, . . . , XN} must be reachable Largest reachable tube

x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization

Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0

Complete Tours Current Partial Tour Next Cities Next States

Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1

Base Heuristic Minimization Possible Path

Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{

gk+1(xk+1, uk+1, wk+1)

+Jk+2

(fk+1(xk+1, uk+1, wk+1)

)},

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

Rollout Control uk Rollout Policy µk Base Policy Cost

1

Alphazero has discovered a new way to play! Base Policy EvaluationOne-Step Lookahead Policy Improvement

u = µ(x, r) Current State x µ Rollout Policy µ

Approximate Policy Evaluation Approximate Policy Improvement

Value Network Policy Network

1

Alphazero has discovered a new way to play! Base Policy EvaluationOne-Step Lookahead Policy Improvement

u = µ(x, r) Current State x µ Rollout Policy µ

Approximate Policy Evaluation Approximate Policy Improvement

Value Network Policy Network

1

Alphazero has discovered a new way to play! Base Policy EvaluationOne-Step Lookahead Policy Improvement µ

Policy Evaluation Policy Improvement Rollout Policy µ Base Policy µ

Assigns x to µ(x, r) Pairs (xs, us) Training Data

u = µ(x, r) Current State x µ Rollout Policy µ Randomized µ(·, r)

Jµ instead of J* Bellman Eq. TRUNCATED ROLLOUT with BASEPOLICY µ

Approximate Policy Evaluation Approximate Policy Improvement

(Assigns x to u)

Value Network Policy Network Value Data

J State-Control Pairs Data-Trained Classifier with µ

Initial State Current State Approximation Truncated Rollout Usinga Local Policy Network

State Space Partition

Each Set Has a Local Value Network and a Local Policy Network

Terminal Cost Supplied by Local Value Network Terminal State

1

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1

Termination State Constraint Set X X = X X Multiagent

rb2 + 1 1 − r

αb2 K K K∗ Kk Kk+1 F (K) = αrKr+αb2K + 1

Current Partial Folding Moving Obstacle

Complete Folding Corresponding to Open

Expert

Rollout with Base Policy m-Step Value Network Policy Network

Approximation of E{·}: Approximate minimization:

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

x1k, u1

k u2k x2

k dk τ

Q-factor approximation

u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)

)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1

High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice

Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5

Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial

Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2

One-Step or Multistep Lookahead for stages Possible Terminal Cost

Approximation in Policy Space Heuristic Cost Approximation for

for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)

Approximate Q-Factor Q(x, u) At x Approximation J

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Truncated Rollout Policy µ m Steps

1

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1

Termination State Constraint Set X X = X X Multiagent

rb2 + 1 1 − r

αb2 K K K∗ Kk Kk+1 F (K) = αrKr+αb2K + 1

Current Partial Folding Moving Obstacle

Complete Folding Corresponding to Open

Expert

Rollout with Base Policy m-Step Value Network Policy Network

Approximation of E{·}: Approximate minimization:

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

x1k, u1

k u2k x2

k dk τ

Q-factor approximation

u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)

)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1

High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice

Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5

Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial

Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2

One-Step or Multistep Lookahead for stages Possible Terminal Cost

Approximation in Policy Space Heuristic Cost Approximation for

for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)

Approximate Q-Factor Q(x, u) At x Approximation J

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Truncated Rollout Policy µ m Steps

1

Bertsekas Reinforcement Learning 24 / 29

Page 21: Topics in Reinforcement Learning:Lessons from AlphaZero ...

A More Abstract Notational View

Bellman’s equation, VI, and PI can be written using Bellman operatorsRecall Bellman’s equation

J∗(x) = minu∈U(x)

Ew

{g(x , u,w) + αJ∗

(f (x , u,w)

)}, for all x

It can be written as a fixed point equation: J∗(x) = (TJ∗)(x), where T is the Bellmanoperator that transforms a function J(·) into a function (TJ)(·)

(TJ)(x) = minu∈U(x)

Ew

{g(x , u,w) + αJ

(f (x , u,w)

)}, for all x

Shorthand theory using Bellman operators:VI is the fixed point iteration Jk+1 = TJk

There is a Bellman operator Tµ for any policy µ and corresponding Bellman Eq.Jµ(x) = (TµJµ)(x) = E{g(x , µ(x),w) + αJµ(f (x , µ(x),w))}PI is written compactly as Jµk = Tµk Jµk (policy evaluation) and Tµk+1 Jµk = TJµk

(policy improvement)

The abstract view is very useful for theoretical analysis, intuition, and visualization

Bertsekas Reinforcement Learning 25 / 29

Page 22: Topics in Reinforcement Learning:Lessons from AlphaZero ...

Deterministic Linear Quadratic Problem - Infinite Horizon, Undiscounted

Linear system xk+1 = axk + buk ; quadratic cost per stage g(x ,u) = qx2 + ru2

Bellman equation: J(x) = minu{

qx2 + ru2 + J(ax + bu)}

Finite horizon results (quadratic optimal cost, linear optimal policy) suggest:

J∗(x) = K ∗x2 where K ∗ is some positive scalar

The optimal policy has the form µ∗(x) = L∗x where L∗ is some scalar

To characterize K ∗ and L∗, we plug J(x) = Kx2 into the Bellman equation

Kx2 = minu

{qx2 + ru2 + K (ax + bu)2} = · · · = F (K )x2

where F (K ) = a2rKr+b2K + q with the minimizing u being equal to − abK

r+b2K x

Thus the Bellman equation is solved by J∗(x) = K ∗x2, with K ∗ being a solution ofthe Riccati equation

K ∗ = F (K ∗) =a2rK ∗

r + b2K ∗+ q

and the optimal policy is linear:

µ∗(x) = L∗x with L∗ = − abK ∗

r + b2K ∗Bertsekas Reinforcement Learning 26 / 29

Page 23: Topics in Reinforcement Learning:Lessons from AlphaZero ...

Graphical Solution of Riccati Equation

xk Lk uk wk xk+1 = Akxk + Bkuk + wk�2PP+1

F (P ) P Pk Pk+1 P ⇤ Q 0 P � RB2

A2RB2 + Q 45�

F (P ) k Q 0 P � RE{B2} 45�

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2

⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk

Stochastic Problems

1

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1

Termination State Constraint Set X X = X X Multiagent

rb2 + 1 1 − r

b2 K K∗ Kk kk+1αKr

r+αKb2 + 1

Current Partial Folding Moving Obstacle

Complete Folding Corresponding to Open

Expert

Rollout with Base Policy m-Step

Approximation of E{·}: Approximate minimization:

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

x1k, u1

k u2k x2

k dk τ

Q-factor approximation

u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)

)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1

High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice

Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5

Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial

Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2

One-Step or Multistep Lookahead for stages Possible Terminal Cost

Approximation in Policy Space Heuristic Cost Approximation for

for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)

Approximate Q-Factor Q(x, u) At x Approximation J

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Truncated Rollout Policy µ m Steps

1

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1

Termination State Constraint Set X X = X X Multiagent

rb2 + 1 1 − r

b2 K K∗ Kk kk+1αKr

r+αKb2 + 1

Current Partial Folding Moving Obstacle

Complete Folding Corresponding to Open

Expert

Rollout with Base Policy m-Step

Approximation of E{·}: Approximate minimization:

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

x1k, u1

k u2k x2

k dk τ

Q-factor approximation

u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)

)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1

High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice

Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5

Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial

Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2

One-Step or Multistep Lookahead for stages Possible Terminal Cost

Approximation in Policy Space Heuristic Cost Approximation for

for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)

Approximate Q-Factor Q(x, u) At x Approximation J

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Truncated Rollout Policy µ m Steps

1

xk Lk uk wk xk+1 = Akxk + Bkuk + wk�2PP+1

F (P ) k Q 0 P � RE{B2} 45�

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2

⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk

Stochastic Problems

Perfect-State Info Ch. 3

1

Value Space Approximation J Jµ = TµJµ TµJ

One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

using an Corresponds to One-Step Lookahead Policy µ

Line

1

Effective Cost Approximation Value Space Approximation J Jµ Jµ =TµJµ TµJ TJ = minµ TµJ Cost of µ

Cost of Truncated Rollout Policy µ

T J = minµ TµJ One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

TJ = minµ TµJ Multistep Lookahead Policy Cost l J Jµ = TµJµ TµJ

Multistep Lookahead Policy Cost T 2J T J

J Region where Sequential Improvement Holds TJ ≤ J TµJ

TJ Instability Region Stability Region 0 T mµ J

rb2 + q q F (K) = arK

r+b2K + q

using an Corresponds to One-Step Lookahead Policy µ

Line Stability Region

TµT mµ J = TT m

µ J Yields Truncated Rollout Policy µ Defined by

1

Effective Cost Approximation Value Space Approximation J Jµ Jµ =TµJµ TµJ TJ = minµ TµJ Cost of µ

Cost of Truncated Rollout Policy µ

T J = minµ TµJ One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

TJ = minµ TµJ Multistep Lookahead Policy Cost l J Jµ = TµJµ TµJ

Multistep Lookahead Policy Cost

FL(K) = (a + bL)2K + q + rL2 FL(K)

T 2J T J J Region where Sequential Improvement Holds TJ ≤ J TµJ

TJ Instability Region Stability Region 0 T mµ J

arb2 + q q F (K) = a2rK

r+b2K + q K = 0 K KL

L = −r + ab2K

abKK1 L = −r + ab2K1

abK1

using an Corresponds to One-Step Lookahead Policy µ

Line Stability Region F (K) = arKr+b2K + q FL(K1)

Tµ(T mµ J) = T (T m

µ J) Yields Truncated Rollout Policy µ Defined by

Newton step from J for solving J = TJ

1

E↵ective Cost Approximation Value Space Approximation J Jµ Jµ =TµJµ TµJ TJ = minµ TµJ Cost of µ

Cost of Truncated Rollout Policy µ 1

TJ = minµ TµJ One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

TJ = minµ TµJ Multistep Lookahead Policy Cost l J Jµ = TµJµ TµJ

Multistep Lookahead Policy Cost

FL(K) = (a + bL)2K + q + rL2 FL(K)

T 2J T J J Region where Sequential Improvement Holds TJ J TµJ

TJ Instability Region Stability Region 0 Tmµ J

a2rb2

a2rb2 + q q F (K) = a2rK

r+b2K + q K⇤ = 0 K = 0 K K = 0 KL

L = � abK

r + ab2KK1 L = � abK1

r + ab2K1

F (K) =a2rK

r + b2K

J⇤(1) = 0 J(1) (TJ)(1) = min{J(1), 1}

using an Corresponds to One-Step Lookahead Policy µ

Line Stability Region F (K) = arKr+b2K + q FL(K1) K = a2 � 1

Tµ(Tmµ J) = T (Tm

µ J) Yields Truncated Rollout Policy µ Defined by

Newton step from J for solving J = TJ

Newton step from T `�1J for solving J = TJ (TJ)(1)

1

Interval I Interval II Interval III Interval IV Ks K∗ Kµ K − 12 −µ −1

J 0 Jµ = − 1µ TµJ = −µ + (1 − µ2)J TJ = minµ∈(0,1] TµJ

L = − abK

r + b2K

Region of Instability Region of Stability TµJ = −µ + (1 − µ2)J K

State 1 State 2 K∗ K 2-State/2-Control ExampleEffective Cost Approximation Value Space Approximation State 1

State 2 (TJ)(1)

J Jµ Jµ = TµJµ TµJ TJ = minµ TµJ Cost of µ − rb2

Generic stable policy µ TµJ Generic unstable policy µ′ Tµ′J

Cost of Truncated Rollout Policy µ 1 of the graph of T

J∗ J∗(1) J∗(2) (TJ∗)(1) = J∗(1) (TJ∗)(2) = J∗(2)

TJ = minµ TµJ One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

TJ = minµ TµJ Multistep Lookahead Policy Cost l J Jµ = TµJµ TµJ

Multistep Lookahead Policy Cost J is a function of x

FL(K) = (a + bL)2K + q + rL2 FL(K)

T 2J T J J Region where Sequential Improvement Holds TJ ≤ J TµJ

TJ Instability Region Stability Region 0 T mµ J

a2rb2

a2rb2 + q q F (K) = a2rK

r+b2K + q K∗ = 0 K = 0 K K = 0 KL

L = − abK

r + ab2KK1 L = − abK1

r + b2K1

F (K) =a2rK

r + b2K

J∗(1) = 0 J(1) (TJ)(1) = min{J(1), 1}

1

Interval I Interval II Interval III Interval IV Ks K∗ Kµ K − 12 −µ −1

J 0 Jµ = − 1µ TµJ = −µ + (1 − µ2)J TJ = minµ∈(0,1] TµJ

L = − abK

r + b2K

Region of Instability Region of Stability TµJ = −µ + (1 − µ2)J K

State 1 State 2 K∗ K 2-State/2-Control ExampleEffective Cost Approximation Value Space Approximation State 1

State 2 (TJ)(1)

J Jµ Jµ = TµJµ TµJ TJ = minµ TµJ Cost of µ − rb2

Generic stable policy µ TµJ Generic unstable policy µ′ Tµ′J

Cost of Truncated Rollout Policy µ 1 of the graph of T

J∗ J∗(1) J∗(2) (TJ∗)(1) = J∗(1) (TJ∗)(2) = J∗(2)

TJ = minµ TµJ One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

TJ = minµ TµJ Multistep Lookahead Policy Cost l J Jµ = TµJµ TµJ

Multistep Lookahead Policy Cost J is a function of x

FL(K) = (a + bL)2K + q + rL2 FL(K)

T 2J T J J Region where Sequential Improvement Holds TJ ≤ J TµJ

TJ Instability Region Stability Region 0 T mµ J

a2rb2

a2rb2 + q q F (K) = a2rK

r+b2K + q K∗ = 0 K = 0 K K = 0 KL

L = − abK

r + ab2KK1 L = − abK1

r + b2K1

F (K) =a2rK

r + b2K

J∗(1) = 0 J(1) (TJ)(1) = min{J(1), 1}

1

Interval I Interval II Interval III Interval IV Ks K⇤ Kµ K � 12 �µ �1

J 0 Jµ = � 1µ TµJ = �µ + (1 � µ2)J TJ = minµ2(0,1] TµJ

L = � abK

r + b2K

Region of Instability Region of Stability TµJ = �µ + (1 � µ2)J K

State 1 State 2 2-State/2-Control ExampleE↵ective Cost Approximation Value Space Approximation State 1

State 2 (TJ)(1)

J Jµ Jµ = TµJµ TµJ TJ = minµ TµJ Cost of µ � rb2

Generic stable policy µ TµJ Generic unstable policy µ0 Tµ0J

Cost of Truncated Rollout Policy µ 1 of the graph of T

J⇤ J⇤(1) J⇤(2) (TJ⇤)(1) = J⇤(1) (TJ⇤)(2) = J⇤(2)

TJ = minµ TµJ One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

TJ = minµ TµJ Multistep Lookahead Policy Cost l J Jµ = TµJµ TµJ

Multistep Lookahead Policy Cost J is a function of x

FL(K) = (a + bL)2K + q + rL2 FL(K)

T 2J T J J Region where Sequential Improvement Holds TJ J TµJ

TJ Instability Region Stability Region 0 Tmµ J

a2rb2

a2rb2 + q q F (K) = a2rK

r+b2K + q K⇤ = 0 K = 0 K K = 0 KL

L = � abK

r + ab2KK1 L = � abK1

r + b2K1

F (K) =a2rK

r + b2K

J⇤(1) = 0 J(1) (TJ)(1) = min{J(1), 1}

1

Value Space Approximation J Jµ = TµJµ TµJ

One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

Newton iterate starting from K Tangent Line of Unstable Policy

J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK

TJ Instability Region Stability Region Slope=1

also Newton Step Value Iteration: Kk+1 = F (Kk)

Riccati Equation: K = F (K)

J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from

Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from

using an Corresponds to One-Step Lookahead Policy µ

Line Stable Policies Unstable Policy Optimal Policy

Region of stability

Also Region of Convergence of Newton’s Method Riccati Equation

Cost of rollout policy µ Cost of base policy µ

1

Bellman Equation on Space of Quadratic Functions J(x) = Kx2

F (K)

Value Space Approximation J Jµ = TµJµ TµJ

One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

Newton iterate starting from K Tangent Line of Unstable Policy

J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK

TJ Instability Region Match Win Probability 1 0 pw (Sudden death)Stability Region Slope=1

also Newton Step Value Iteration: Kk+1 = F (Kk)

Optimal Policy Riccati Equation: K = F (K)

J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from

Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from

using an Corresponds to One-Step Lookahead Policy µ

Line Stable Policies Unstable Policy Optimal Policy

Region of stability

Also Region of Convergence of Newton’s Method Riccati Equation

Cost of rollout policy µ Cost of base policy µ

1

Bellman Equation on Space of Quadratic Functions J(x) = Kx2

F (K)

Value Space Approximation J Jµ = TµJµ TµJ

One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

Newton iterate starting from K Tangent Line of Unstable Policy

J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK

TJ Instability Region Match Win Probability 1 0 pw (Sudden death)Stability Region Slope=1

also Newton Step Value Iteration: Kk+1 = F (Kk)

Optimal Policy Riccati Equation: K = F (K)

J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from

Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from

using an Corresponds to One-Step Lookahead Policy µ

Line Stable Policies Unstable Policy Optimal Policy

Region of stability

Also Region of Convergence of Newton’s Method Riccati Equation

Cost of rollout policy µ Cost of base policy µ

1

Bellman Equation on Space of Quadratic Functions J(x) = Kx2

F (K)

Value Space Approximation J Jµ = TµJµ TµJ

One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

Newton iterate starting from K Tangent Line of Unstable Policy

J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK

TJ Instability Region Match Win Probability 1 0 pw (Sudden death)Stability Region Slope=1

also Newton Step Value Iteration: Kk+1 = F (Kk)

Optimal Policy Riccati Equation: K = F (K)

J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from

Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from

using an Corresponds to One-Step Lookahead Policy µ

Line Stable Policies Unstable Policy Optimal Policy

Region of stability

Also Region of Convergence of Newton’s Method Riccati Equation

Cost of rollout policy µ Cost of base policy µ

1

Bellman Equation on Space of Quadratic Functions J(x) = Kx2

F (K)

Value Space Approximation J Jµ = TµJµ TµJ

One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

Newton iterate starting from K Tangent Line of Unstable Policy

J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK

TJ Instability Region Match Win Probability 1 0 pw (Sudden death)Stability Region Slope=1

also Newton Step Value Iteration: Kk+1 = F (Kk)

Optimal Policy Riccati Equation: K = F (K)

J(x) = Kx2 = F (K)x2 = Jk(x) or Kk+1 = F (Kk) from

Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from

using an Corresponds to One-Step Lookahead Policy µ

Line Stable Policies Unstable Policy Optimal Policy

Region of stability

Also Region of Convergence of Newton’s Method Riccati Equation

Cost of rollout policy µ Cost of base policy µ

1

Bertsekas Reinforcement Learning 27 / 29

Page 24: Topics in Reinforcement Learning:Lessons from AlphaZero ...

Visualization of VI

xk Lk uk wk xk+1 = Akxk + Bkuk + wk�2PP+1

F (P ) P Pk Pk+1 P ⇤ Q 0 P � RB2

A2RB2 + Q 45�

F (P ) k Q 0 P � RE{B2} 45�

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2

⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk

Stochastic Problems

1

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1

Termination State Constraint Set X X = X X Multiagent

rb2 + 1 1 − r

b2 K K∗ Kk kk+1αKr

r+αKb2 + 1

Current Partial Folding Moving Obstacle

Complete Folding Corresponding to Open

Expert

Rollout with Base Policy m-Step

Approximation of E{·}: Approximate minimization:

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

x1k, u1

k u2k x2

k dk τ

Q-factor approximation

u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)

)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1

High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice

Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5

Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial

Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2

One-Step or Multistep Lookahead for stages Possible Terminal Cost

Approximation in Policy Space Heuristic Cost Approximation for

for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)

Approximate Q-Factor Q(x, u) At x Approximation J

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Truncated Rollout Policy µ m Steps

1

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1

Termination State Constraint Set X X = X X Multiagent

rb2 + 1 1 − r

b2 K K∗ Kk kk+1αKr

r+αKb2 + 1

Current Partial Folding Moving Obstacle

Complete Folding Corresponding to Open

Expert

Rollout with Base Policy m-Step

Approximation of E{·}: Approximate minimization:

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

x1k, u1

k u2k x2

k dk τ

Q-factor approximation

u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)

)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1

High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice

Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5

Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial

Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2

One-Step or Multistep Lookahead for stages Possible Terminal Cost

Approximation in Policy Space Heuristic Cost Approximation for

for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)

Approximate Q-Factor Q(x, u) At x Approximation J

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Truncated Rollout Policy µ m Steps

1

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1

Termination State Constraint Set X X = X X Multiagent

rb2 + 1 1 − r

b2 K K∗ Kk kk+1αKr

r+αKb2 + 1

Current Partial Folding Moving Obstacle

Complete Folding Corresponding to Open

Expert

Rollout with Base Policy m-Step

Approximation of E{·}: Approximate minimization:

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

x1k, u1

k u2k x2

k dk τ

Q-factor approximation

u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)

)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1

High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice

Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5

Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial

Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2

One-Step or Multistep Lookahead for stages Possible Terminal Cost

Approximation in Policy Space Heuristic Cost Approximation for

for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)

Approximate Q-Factor Q(x, u) At x Approximation J

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Truncated Rollout Policy µ m Steps

1

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1

Termination State Constraint Set X X = X X Multiagent

rb2 + 1 1 − r

b2 K K∗ Kk Kk+1αKr

r+αKb2 + 1

Current Partial Folding Moving Obstacle

Complete Folding Corresponding to Open

Expert

Rollout with Base Policy m-Step

Approximation of E{·}: Approximate minimization:

minu∈U(x)

n∑

y=1

pxy(u)(g(x, u, y) + αJ(y)

)

x1k, u1

k u2k x2

k dk τ

Q-factor approximation

u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =(yk, uk, R(yk+1)

)∈ C

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1

High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice

Capacity=1 Optimal Solution 2.4.2, 2.4.3 2.4.5

Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial

Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2

One-Step or Multistep Lookahead for stages Possible Terminal Cost

Approximation in Policy Space Heuristic Cost Approximation for

for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)

Approximate Q-Factor Q(x, u) At x Approximation J

minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}

Truncated Rollout Policy µ m Steps

1

xk Lk uk wk xk+1 = Akxk + Bkuk + wk�2PP+1

F (P ) k Q 0 P � RE{B2} 45�

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2

⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk

Stochastic Problems

Perfect-State Info Ch. 3

1

Value Space Approximation J Jµ = TµJµ TµJ

One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

using an Corresponds to One-Step Lookahead Policy µ

Line

1

Effective Cost Approximation Value Space Approximation J Jµ Jµ =TµJµ TµJ TJ = minµ TµJ Cost of µ

Cost of Truncated Rollout Policy µ

T J = minµ TµJ One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

TJ = minµ TµJ Multistep Lookahead Policy Cost l J Jµ = TµJµ TµJ

Multistep Lookahead Policy Cost T 2J T J

J Region where Sequential Improvement Holds TJ ≤ J TµJ

TJ Instability Region Stability Region 0 T mµ J

rb2 + q q F (K) = arK

r+b2K + q

using an Corresponds to One-Step Lookahead Policy µ

Line Stability Region

TµT mµ J = TT m

µ J Yields Truncated Rollout Policy µ Defined by

1

Interval I Interval II Interval III Interval IV Ks K⇤ Kµ K � 12 �µ �1

J 0 Jµ = � 1µ TµJ = �µ + (1 � µ2)J TJ = minµ2(0,1] TµJ

L = � abK

r + b2K

Region of Instability Region of Stability TµJ = �µ + (1 � µ2)J K

State 1 State 2 2-State/2-Control ExampleE↵ective Cost Approximation Value Space Approximation State 1

State 2 (TJ)(1)

J Jµ Jµ = TµJµ TµJ TJ = minµ TµJ Cost of µ � rb2

Generic stable policy µ TµJ Generic unstable policy µ0 Tµ0J

Cost of Truncated Rollout Policy µ 1 of the graph of T

J⇤ J⇤(1) J⇤(2) (TJ⇤)(1) = J⇤(1) (TJ⇤)(2) = J⇤(2)

TJ = minµ TµJ One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

TJ = minµ TµJ Multistep Lookahead Policy Cost l J Jµ = TµJµ TµJ

Multistep Lookahead Policy Cost J is a function of x

FL(K) = (a + bL)2K + q + rL2 FL(K)

T 2J T J J Region where Sequential Improvement Holds TJ J TµJ

TJ Instability Region Stability Region 0 Tmµ J

a2rb2

a2rb2 + q q F (K) = a2rK

r+b2K + q K⇤ = 0 K = 0 K K = 0 KL

L = � abK

r + ab2KK1 L = � abK1

r + b2K1

F (K) =a2rK

r + b2K

J⇤(1) = 0 J(1) (TJ)(1) = min{J(1), 1}

1

Interval I Interval II Interval III Interval IV Ks K∗ Kµ K − 12 −µ −1

J 0 Jµ = − 1µ TµJ = −µ + (1 − µ2)J TJ = minµ∈(0,1] TµJ

L = − abK

r + b2K

Region of Instability Region of Stability TµJ = −µ + (1 − µ2)J K

State 1 State 2 K∗ K 2-State/2-Control ExampleEffective Cost Approximation Value Space Approximation State 1

State 2 (TJ)(1)

J Jµ Jµ = TµJµ TµJ TJ = minµ TµJ Cost of µ − rb2

Generic stable policy µ TµJ Generic unstable policy µ′ Tµ′J

Cost of Truncated Rollout Policy µ 1 of the graph of T

J∗ J∗(1) J∗(2) (TJ∗)(1) = J∗(1) (TJ∗)(2) = J∗(2)

TJ = minµ TµJ One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

TJ = minµ TµJ Multistep Lookahead Policy Cost l J Jµ = TµJµ TµJ

Multistep Lookahead Policy Cost J is a function of x

FL(K) = (a + bL)2K + q + rL2 FL(K)

T 2J T J J Region where Sequential Improvement Holds TJ ≤ J TµJ

TJ Instability Region Stability Region 0 T mµ J

a2rb2

a2rb2 + q q F (K) = a2rK

r+b2K + q K∗ = 0 K = 0 K K = 0 KL

L = − abK

r + ab2KK1 L = − abK1

r + b2K1

F (K) =a2rK

r + b2K

J∗(1) = 0 J(1) (TJ)(1) = min{J(1), 1}

1

Interval I Interval II Interval III Interval IV Ks K∗ Kµ K − 12 −µ −1

J 0 Jµ = − 1µ TµJ = −µ + (1 − µ2)J TJ = minµ∈(0,1] TµJ

L = − abK

r + b2K

Region of Instability Region of Stability TµJ = −µ + (1 − µ2)J K

State 1 State 2 K∗ K 2-State/2-Control ExampleEffective Cost Approximation Value Space Approximation State 1

State 2 (TJ)(1)

J Jµ Jµ = TµJµ TµJ TJ = minµ TµJ Cost of µ − rb2

Generic stable policy µ TµJ Generic unstable policy µ′ Tµ′J

Cost of Truncated Rollout Policy µ 1 of the graph of T

J∗ J∗(1) J∗(2) (TJ∗)(1) = J∗(1) (TJ∗)(2) = J∗(2)

TJ = minµ TµJ One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

TJ = minµ TµJ Multistep Lookahead Policy Cost l J Jµ = TµJµ TµJ

Multistep Lookahead Policy Cost J is a function of x

FL(K) = (a + bL)2K + q + rL2 FL(K)

T 2J T J J Region where Sequential Improvement Holds TJ ≤ J TµJ

TJ Instability Region Stability Region 0 T mµ J

a2rb2

a2rb2 + q q F (K) = a2rK

r+b2K + q K∗ = 0 K = 0 K K = 0 KL

L = − abK

r + ab2KK1 L = − abK1

r + b2K1

F (K) =a2rK

r + b2K

J∗(1) = 0 J(1) (TJ)(1) = min{J(1), 1}

1

Effective Cost Approximation Value Space Approximation J Jµ Jµ =TµJµ TµJ TJ = minµ TµJ Cost of µ

Cost of Truncated Rollout Policy µ

T J = minµ TµJ One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

TJ = minµ TµJ Multistep Lookahead Policy Cost l J Jµ = TµJµ TµJ

Multistep Lookahead Policy Cost

FL(K) = (a + bL)2K + q + rL2 FL(K)

T 2J T J J Region where Sequential Improvement Holds TJ ≤ J TµJ

TJ Instability Region Stability Region 0 T mµ J

arb2 + q q F (K) = a2rK

r+b2K + q K = 0 K KL

L = −r + ab2K

abKK1 L = −r + ab2K1

abK1

using an Corresponds to One-Step Lookahead Policy µ

Line Stability Region F (K) = arKr+b2K + q FL(K1)

Tµ(T mµ J) = T (T m

µ J) Yields Truncated Rollout Policy µ Defined by

Newton step from J for solving J = TJ

1

E↵ective Cost Approximation Value Space Approximation J Jµ Jµ =TµJµ TµJ TJ = minµ TµJ Cost of µ

Cost of Truncated Rollout Policy µ 1

TJ = minµ TµJ One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

TJ = minµ TµJ Multistep Lookahead Policy Cost l J Jµ = TµJµ TµJ

Multistep Lookahead Policy Cost

FL(K) = (a + bL)2K + q + rL2 FL(K)

T 2J T J J Region where Sequential Improvement Holds TJ J TµJ

TJ Instability Region Stability Region 0 Tmµ J

a2rb2

a2rb2 + q q F (K) = a2rK

r+b2K + q K⇤ = 0 K = 0 K K = 0 KL

L = � abK

r + ab2KK1 L = � abK1

r + ab2K1

F (K) =a2rK

r + b2K

J⇤(1) = 0 J(1) (TJ)(1) = min{J(1), 1}

using an Corresponds to One-Step Lookahead Policy µ

Line Stability Region F (K) = arKr+b2K + q FL(K1) K = a2 � 1

Tµ(Tmµ J) = T (Tm

µ J) Yields Truncated Rollout Policy µ Defined by

Newton step from J for solving J = TJ

Newton step from T `�1J for solving J = TJ (TJ)(1)

1

Value Space Approximation J Jµ = TµJµ TµJ

One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

Newton iterate starting from K Tangent Line of Unstable Policy

J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK

TJ Instability Region Stability Region Slope=1

also Newton Step Value Iteration: Kk+1 = F (Kk)

Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk)

using an Corresponds to One-Step Lookahead Policy µ

Line Stable Policies Unstable Policy Optimal Policy

Region of stability

Also Region of Convergence of Newton’s Method Riccati Equation

Cost of rollout policy µ Cost of base policy µ

1

Value Space Approximation J Jµ = TµJµ TµJ

One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

Newton iterate starting from K Tangent Line of Unstable Policy

J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK

TJ Instability Region Stability Region Slope=1

also Newton Step Value Iteration: Kk+1 = F (Kk)

Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk)

using an Corresponds to One-Step Lookahead Policy µ

Line Stable Policies Unstable Policy Optimal Policy

Region of stability

Also Region of Convergence of Newton’s Method Riccati Equation

Cost of rollout policy µ Cost of base policy µ

1

Value Space Approximation J Jµ = TµJµ TµJ

One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

Newton iterate starting from K Tangent Line of Unstable Policy

J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK

TJ Instability Region Stability Region Slope=1

also Newton Step Value Iteration: Kk+1 = F (Kk)

Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk)

using an Corresponds to One-Step Lookahead Policy µ

Line Stable Policies Unstable Policy Optimal Policy

Region of stability

Also Region of Convergence of Newton’s Method Riccati Equation

Cost of rollout policy µ Cost of base policy µ

1

Value Space Approximation J Jµ = TµJµ TµJ

One-Step Lookahead Policy Cost l J Jµ = TµJµ TµJ

Newton iterate starting from K Tangent Line of Unstable Policy

J Region where Sequential Improvement Holds TJ ≤ J TµJ K µK

TJ Instability Region Stability Region Slope=1

also Newton Step Value Iteration: Kk+1 = F (Kk)

Jk+1(x) = Kk+1x2 = F (Kk)x2 = Jk(x) or Kk+1 = F (Kk) from

using an Corresponds to One-Step Lookahead Policy µ

Line Stable Policies Unstable Policy Optimal Policy

Region of stability

Also Region of Convergence of Newton’s Method Riccati Equation

Cost of rollout policy µ Cost of base policy µ

1

Bertsekas Reinforcement Learning 28 / 29

Page 25: Topics in Reinforcement Learning:Lessons from AlphaZero ...

About the Next Lecture

Linear quadratic problems and Newton step interpretationsApproximation in value space as a Newton step for solving the Riccati equation

Rollout as a Newton step starting from the cost of the base policy

Policy Iteration as repeated Newton steps

Problem formulations and reformulationsHow do we formulate DP models for practical problems?

Problems involving a terminal state (stochastic shortest path problems)

Problem reformulation by state augmentation (dealing with delays, correlations,forecasts, etc)

Problems involving imperfect state observation (POMDP)

Multiagent problems - Nonclassical information patterns

Systems with unknown or changing parameters - Adaptive control

PLEASE READ SECTIONS 1.5 and 1.6 OF THE CLASS NOTES (AMAP)

1ST HOMEWORK (DUE IN ONE WEEK): Exercise 1.1 of the Class NotesBertsekas Reinforcement Learning 29 / 29