Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We...

58
Lecture 7: Review on Tabular RL Bolei Zhou The Chinese University of Hong Kong [email protected] February 18, 2020 Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 1 / 58

Transcript of Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We...

Page 1: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Lecture 7: Review on Tabular RL

Bolei Zhou

The Chinese University of Hong Kong

[email protected]

February 18, 2020

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 1 / 58

Page 2: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Announcement

1 We continue with ZOOM Online Teaching1 20-30 mins per session2 Try to open your webcam3 Take note

2 HW1 and project proposal due by the end of Feb.19, please email [email protected]

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 2 / 58

Page 3: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Today’s Content: Review Session

1 RL Basics2 MDP prediction and control

1 Policy evaluation2 Control: Policy iteration and value iteration

3 Model-free prediction and control

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 3 / 58

Page 4: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Reinforcement Learning

1 a computational approach to learning whereby an agent tries tomaximize the total amount of reward it receives while interacting witha complex and uncertain environment.- Sutton and Barto

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 4 / 58

Page 5: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Difference between Reinforcement Learning and SupervisedLearning

1 Sequential data as input (not i.i.d)

2 Trial-and-error exploration (balance between exploration andexploitation)

3 There is no supervisor, only a reward signal, which is also delayed

4 Agent’s actions affect the subsequent data it receives (agent’s actionchanges the environment)

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 5 / 58

Page 6: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Major Components of RL

1 An RL agent may include one or more of these components1 Policy: agent’s behavior function2 Value function: how good is each state or action3 Model: agent’s state representation of the environment

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 6 / 58

Page 7: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Define the model of environment

1 Markov Processes

2 Markov Reward Processes(MRPs)

3 Markov Decision Processes (MDPs)

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 7 / 58

Page 8: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Markov Process

1 P is the dynamics/transition model that specifies p(st+1 = s ′|st = s)

2 State transition matrix: P(To|From)

P =

P(s1|s1) P(s2|s1) . . . P(sN |s1)P(s1|s2) P(s2|s2) . . . P(sN |s2)

......

. . ....

P(s1|sN) P(s2|sN) . . . P(sN |sN)

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 8 / 58

Page 9: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Markov Reward Process (MRP)

1 Markov Reward Process is a Markov Chain + reward2 Definition of Markov Reward Process (MRP)

1 S is a (finite) set of states (s ∈ S)2 P is dynamics/transition model that specifies P(St+1 = s ′|st = s)3 R is a reward function R(st = s) = E[rt |st = s]4 Discount factor γ ∈ [0, 1]

3 If finite number of states, R can be a vector

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 9 / 58

Page 10: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

MRP Example

Reward: +5 in s1, +10 in s7, 0 in all other states. So that we canrepresent R = [5, 0, 0, 0, 0, 0, 10]

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 10 / 58

Page 11: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Return Function and Value Function

1 Return: Discounted sum of rewards from time step t to horizon

Gt = Rt+1 + γRt+2 + γ2Rt+3 + γ3Rt+4 + ...+ γT−t−1RT

2 Value function Vt(s): Expected return from t in state s

Vt(s) =E[Gt |st = s]

=E[Rt+1 + γRt+2 + γ2Rt+3 + ...+ γT−t−1RT |st = s]

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 11 / 58

Page 12: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Computing the Value of a Markov Reward Process

1 Value function: expected return from starting in state s

V (s) = E[Gt |st = s] = E[Rt+1 + γRt+2 + γ2Rt+3 + ...|st = s]

2 MRP value function satisfies the following Bellman equation:

V (s) = R(s)︸︷︷︸Immediate reward

+ γ∑s′∈S

P(s ′|s)V (s ′)︸ ︷︷ ︸Discounted sum of future reward

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 12 / 58

Page 13: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Iterative Algorithm for Computing Value of a MRP

Algorithm 1 Iterative algorithm to calculate MRP value function

1: for all states s ∈ S ,V ′(s)← 0,V (s)←∞2: while ||V − V ′|| > ε do3: V ← V ′

4: For all states s ∈ S ,V ′(s) = R(s) + γ∑

s′∈S P(s ′|s)V (s ′)5: end while6: return V ′(s) for all s ∈ S

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 13 / 58

Page 14: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Markov Decision Process (MDP)

1 Markov Decision Process describes the framework of reinforcementlearning. MDP is a tuple: (S ,A,P,R, γ)

1 S is a finite set of states, A is a finite set of actions2 Pa is dynamics/transition model for each action

P(st+1 = s ′|st = s, at = a)3 R is a reward function R(st = s, at = a) = E[rt |st = s, at = a]4 Discount factor γ ∈ [0, 1]

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 14 / 58

Page 15: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Return Function and Value Function

1 Definition of Horizon1 Number of maximum time steps in each episode2 Can be infinite, otherwise called finite Markov (reward) Process

2 Definition of Return1 Discounted sum of rewards from time step t to horizon

Gt = Rt+1 + γRt+2 + γ2Rt+3 + γ3Rt+4 + ...+ γT−t−1RT

3 Definition of state value function Vt(s) for a MRP1 Expected return from t in state s

Vt(s) =E[Gt |st = s]

=E[Rt+1 + γRt+2 + γ2Rt+3 + ...+ γT−t−1RT |st = s]

2 Present value of future rewards

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 15 / 58

Page 16: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Value function for MDP

1 The state-value function vπ(s) of an MDP is the expected returnstarting from state s, and following policy π

vπ(s) = Eπ[Gt |st = s] (1)

2 The action-value function qπ(s, a) is the expected return startingfrom state s, taking action a, and then following policy π

Qπ(s, a) = Eπ[Gt |st = s,At = a] (2)

3 We have the relation between vπ(s) and Qπ(s, a)

vπ(s) =∑a∈A

π(a|s)Qπ(s, a) (3)

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 16 / 58

Page 17: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Comparison of Markov Process and MDP

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 17 / 58

Page 18: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Bellman Expectation Equation

1 The state-value function can be decomposed into immediate rewardplus discounted value of the successor state,

Vπ(s) = Eπ[Rt+1 + γVπ(st+1)|st = s] (4)

2 The action-value function can similarly be decomposed

Qπ(s, a) = Eπ[Rt+1 + γQπ(st+1,At+1)|st = s,At = a] (5)

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 18 / 58

Page 19: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Backup Diagram for V π

vπ(s) =∑a∈A

π(a|s)(R(s, a) + γ∑s′∈S

P(s ′|s, a)vπ(s ′)) (6)

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 19 / 58

Page 20: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Backup Diagram for Qπ

qπ(s, a) =R(s, a) + γ∑s′∈S

P(s ′|s, a)∑a′∈A

π(a′|s ′)qπ(s ′, a′) (7)

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 20 / 58

Page 21: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Decision Making in MDP

1 Prediction: evaluate a given policy1 Input: MDP < S,A,P,R, γ > and policy π2 Output: value function vπ

2 Control: search the optimal policy1 Input: MDP < S,A,P,R, γ >2 Output: optimal value function v∗ and optimal policy π∗

3 Prediction and control can be solved by dynamic programming.

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 21 / 58

Page 22: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Policy evaluation: Iteration on Bellman expectation backup

Bellman expectation backup for a particular policyBootstrapping: estimate on top of estimates

vk+1(s) =∑a∈A

π(a|s)(R(s, a) + γ∑s′∈S

P(s ′|s, a)vk(s ′)) (8)

1 Synchronous backup algorithm:1 At each iteration k+1

update vk+1(s) from vk(s ′) for all states s ∈ S where s ′ is a successorstate of s

vk+1(s)←∑a∈A

π(a|s)(R(s, a) + γ∑s′∈S

P(s ′|s, a)vk(s ′)) (9)

2 Convergence: v1 → v2 → ...→ vπ

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 22 / 58

Page 23: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

MDP Control

1 Compute the optimal policy

π∗(s) = arg maxπ

vπ(s) (10)

2 Policy iteration and value iteration

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 23 / 58

Page 24: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Improve a Policy through Policy Iteration

1 Iterate through the two steps:1 Evaluate the policy π (computing v given current π)2 Improve the policy by acting greedily with respect to vπ

π′ = greedy(vπ) (11)

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 24 / 58

Page 25: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Policy Improvement

1 compute the state-action value of a policy π:

qπi (s, a) = R(s, a) + γ∑s′∈S

P(s ′|s, a)vπi (s′) (12)

2 Compute a new greedy policy πi+1 for all s ∈ S following

πi+1(s) = arg maxa

qπi (s, a) (13)

3 We can prove that Monotonic Improvement in Policy.

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 25 / 58

Page 26: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Monotonic Improvement and Bellman optimality equation

1 If improvements stop,

qπ(s, π′(s)) = maxa∈A

qπ(s, a) = qπ(s, π(s)) = vπ(s)

2 Thus the Bellman optimality equation has been satisfied

vπ(s) = maxa∈A

qπ(s, a)

3 Then we have the following Bellman optimality backup:

v∗(s) = maxa

(R(s, a) + γ∑s′∈S

P(s ′|s, a)v∗(s′))

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 26 / 58

Page 27: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Value iteration by turning the Bellman optimality equationas update rule

1 If we know the solution to subproblem v∗(s′), which is optimal.

2 Then the solution for the optimal v∗(s) can be found by iteration overthe following Bellman Optimality backup rule,

v(s)← maxa∈A

(R(s, a) + γ

∑s′∈S

P(s ′|s, a)v(s ′))

3 The idea of value iteration is to apply these updates iteratively

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 27 / 58

Page 28: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Summary for Prediction and Control in MDP

Table: Dynamic Programming Algorithms

Problem Bellman Equation Algorithm

Prediction Bellman Expectation Equation Iterative Policy Evaluation

Control Bellman Expectation Equation Policy Iteration

Control Bellman Optimality Equation Value Iteration

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 28 / 58

Page 29: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Model-free RL

1 Policy evaluation, policy iteration and value iteration for solving aknown MDP

2 Model-free prediction: Estimate value function of an unknown MDP

3 Model-free control: Optimize value function of an unknown MDP

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 29 / 58

Page 30: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Model-free RL: prediction and optimal control byinteraction

1 No more oracle or shortcut of the known transition dynamics andreward function

2 Trajectories/episodes are collected by the interaction of the agent andthe environment

3 Each trajectory/episode contains{S1,A1,R1,S2,A2,R2, ...,ST ,AT ,RT}

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 30 / 58

Page 31: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Model-free prediction: policy evaluation without the accessto the model

1 Estimating the expected return of a particular policy if we don’t haveaccess to the MDP models

1 Monte Carlo policy evaluation2 Temporal Difference (TD) learning

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 31 / 58

Page 32: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Monte-Carlo Policy Evaluation

1 To evaluate state v(s)1 Every time-step t that state s is visited in an episode,2 Increment counter N(s)← N(s) + 13 Increment total return S(s)← S(s) + Gt

4 Value is estimated by mean return v(s) = S(s)/N(s)

2 By law of large numbers, v(s)→ vπ(s) as N(s)→∞

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 32 / 58

Page 33: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Incremental MC Updates

1 Collect one episode (S1,A1,R1, ...,St)

2 For each state st with computed return Gt

N(St)←N(St) + 1

v(St)←v(St) +1

N(St)(Gt − v(St))

3 Or use a running mean (old episodes are forgotten). Good fornon-stationary problems.

v(St)← v(St) + α(Gt − v(St))

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 33 / 58

Page 34: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Difference between DP and MC for policy evaluation

1 DP computes vi by bootstrapping the rest of the expected return bythe value estimate vi−1

2 Iteration on Bellman expectation backup:

vi (s)←∑a∈A

π(a|s)(R(s, a) + γ

∑s′∈S

P(s ′|s, a)vi−1(s ′))

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 34 / 58

Page 35: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Difference between DP and MC for policy evaluation

1 MC updates the empirical mean return with one sampled episode

v(St)← v(St) + α(Gi ,t − v(St))

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 35 / 58

Page 36: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Advantages of MC over DP

1 MC works when the environment is unknown

2 Working with sample episodes has a huge advantage, even when onehas complete knowledge of the environment’s dynamics, for example,transition probability is complex to compute

3 Cost of estimating a single state’s value is independent of the totalnumber of states. So you can sample episodes starting from thestates of interest then average returns

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 36 / 58

Page 37: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Temporal-Difference (TD) Learning

1 TD methods learn directly from episodes of experience

2 TD is model-free: no knowledge of MDP transitions/rewards

3 TD learns from incomplete episodes, by bootstrapping

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 37 / 58

Page 38: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Temporal-Difference (TD) Learning

1 Objective: learn vπ online from experience under policy π2 Simplest TD algorithm: TD(0)

1 Update v(St) toward estimated return Rt+1 + γv(St+1)

v(St)← v(St) + α(Rt+1 + γv(St+1)− v(St))

3 Rt+1 + γv(St+1) is called TD target

4 δt = Rt+1 + γv(St+1)− v(St) is called the TD error5 Comparison: Incremental Monte-Carlo

1 Update v(St) toward actual return Gt given an episode i

v(St)← v(St) + α(Gi,t − v(St))

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 38 / 58

Page 39: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Advantages of TD over MC

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 39 / 58

Page 40: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

n-step TD

1 n-step TD methods that generalize both one-step TD and MC.2 We can shift from one to the other smoothly as needed to meet the

demands of a particular task.

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 40 / 58

Page 41: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

n-step TD prediction

1 Consider the following n-step returns for n = 1, 2,∞

n = 1(TD) G(1)t =Rt+1 + γv(St+1)

n = 2 G(2)t =Rt+1 + γRt+2 + γ2v(St+2)

...

n =∞(MC ) G∞t =Rt+1 + γRt+2 + ...+ γT−t−1RT

2 Thus the n-step return is defined as

Gnt = Rt+1 + γRt+2 + ...+ γn−1Rt+n + γnv(St+n)

3 n-step TD: v(St)← v(St) + α(Gnt − v(St)

)Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 41 / 58

Page 42: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Bootstrapping and Sampling for DP, MC, and TD

1 Bootstrapping: update involves an estimate1 MC does not bootstrap2 DP bootstraps3 TD bootstraps

2 Sampling: update samples an expectation1 MC samples2 DP does not sample3 TD samples

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 42 / 58

Page 43: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Unified View: Dynamic Programming Backup

v(St)← Eπ[Rt+1 + γv(St+1)]

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 43 / 58

Page 44: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Unified View: Monte-Carlo Backup

v(St)← v(St) + α(Gt − v(St))

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 44 / 58

Page 45: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Unified View: Temporal-Difference Backup

TD(0) : v(St)← v(St) + α(Rt+1 + γv(st+1)− v(St))

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 45 / 58

Page 46: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Unified View of Reinforcement Learning

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 46 / 58

Page 47: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Control: Optimize the policy for a MDP

1 Model-based control: optimize the value function with known MDP

2 Model-free control: optimize the value function with unknown MDP

3 Many model-free RL examples: Go, robot locomation, patienttreatment, helicopter control, Atari, Starcraft

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 47 / 58

Page 48: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Policy Iteration

1 Iterate through the two steps:1 Evaluate the policy π (computing v given current π)2 Improve the policy by acting greedily with respect to vπ

π′ = greedy(vπ) (14)

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 48 / 58

Page 49: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Policy Iteration for a Known MDP

1 compute the state-action value of a policy π:

qπi (s, a) = R(s, a) + γ∑s′∈S

P(s ′|s, a)vπi (s′)

2 Compute new policy πi+1 for all s ∈ S following

πi+1(s) = arg maxa

qπi (s, a) (15)

3 Problem: What to do if there is neither R(s, a) nor P(s ′|s, a)known/available?

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 49 / 58

Page 50: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Generalized Policy Iteration

Monte Carlo version of policy iteration

1 Policy evaluation: Monte-Carlo policy evaluation Q = qπ2 Policy improvement: Greedy policy improvement?

π(s) = arg maxa

q(s, a)

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 50 / 58

Page 51: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

ε-Greedy Exploration

1 Trade-off between exploration and exploitation (we will talk aboutthis in later lecture)

2 ε-Greedy Exploration: Ensuring continual exploration1 All actions are tried with non-zero probability2 With probability 1− ε choose the greedy action3 With probability ε choose an action at random

π(a|s) =

{ε/|A|+ 1− ε if a∗ = arg maxa∈AQ(s, a)

ε/|A| otherwise

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 51 / 58

Page 52: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Monte Carlo with ε-Greedy Exploration

Algorithm 2

1: Initialize Q(S ,A) = 0,N(S ,A) = 0, ε = 1, k = 12: πk = ε-greedy(Q)3: loop4: Sample k-th episode (S1,A1,R2, ...,ST ) ∼ πk5: for each state St and action At in the episode do6: N(St ,At)← N(St ,At) + 17: Q(St ,At)← Q(St ,At) + 1

N(St ,At)(Gt − Q(St ,At))

8: end for9: k ← k + 1, ε← 1/k

10: πk = ε-greedy(Q)11: end loop

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 52 / 58

Page 53: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

TD for Prediction and Control

1 We can also use TD instead of MC in our control loop1 Apply TD to Q(S ,A)2 Use ε-greedy policy improvement3 Update every time-step rather than at the end of one episode

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 53 / 58

Page 54: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Recall: TD Prediction

1 An episode consists of an alternating sequence of states andstate–action pairs:

2 TD(0) method for estimating the value function V (S)

At ← action given by π for S

Take action At , observe Rt+1 and St+1

V (St)← V (St) + α[Rt+1 + γV (St+1)− V (St)]

3 How about estimating action value function Q(S)?

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 54 / 58

Page 55: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Sarsa: On-Policy TD Control

1 An episode consists of an alternating sequence of states andstate–action pairs:

2 ε-greedy policy for one step, then bootstrap the action value function:

Q(St ,At)← Q(St ,At) + α[Rt+1 + γQ(St+1,At+1)− Q(St ,At)

]

3 The update is done after every transition from a nonterminal state St4 TD target δt = Rt+1 + γQ(St+1,At+1)

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 55 / 58

Page 56: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Sarsa algorithm

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 56 / 58

Page 57: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

n-step Sarsa

1 Consider the following n-step Q-returns for n = 1, 2,∞

n = 1(Sarsa)q(1)t =Rt+1 + γQ(St+1,At+1)

n = 2 q(2)t =Rt+1 + γRt+2 + γ2Q(St+2,At+2)

...

n =∞(MC ) q∞t =Rt+1 + γRt+2 + ...+ γT−t−1RT

2 Thus the n-step Q-return is defined as

q(n)t = Rt+1 + γRt+2 + ...+ γn−1Rt+n + γnQ(St+n,At+n)

3 n-step Sarsa updates Q(s,a) towards the n-step Q-return:

Q(St ,At)← Q(St ,At) + α(q(n)t − Q(St ,At)

)Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 57 / 58

Page 58: Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We continue with ZOOM Online Teaching 1 20-30 mins per session 2 Try to open your webcam

Tomorrow

1 Lecture on on-policy learning and off-policy learning

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 58 / 58