Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We...
Transcript of Lecture 7: Review on Tabular RLierg6130/slides/lecture7.pdf · 2020-02-18 · Announcement 1 We...
Lecture 7: Review on Tabular RL
Bolei Zhou
The Chinese University of Hong Kong
February 18, 2020
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 1 / 58
Announcement
1 We continue with ZOOM Online Teaching1 20-30 mins per session2 Try to open your webcam3 Take note
2 HW1 and project proposal due by the end of Feb.19, please email [email protected]
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 2 / 58
Today’s Content: Review Session
1 RL Basics2 MDP prediction and control
1 Policy evaluation2 Control: Policy iteration and value iteration
3 Model-free prediction and control
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 3 / 58
Reinforcement Learning
1 a computational approach to learning whereby an agent tries tomaximize the total amount of reward it receives while interacting witha complex and uncertain environment.- Sutton and Barto
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 4 / 58
Difference between Reinforcement Learning and SupervisedLearning
1 Sequential data as input (not i.i.d)
2 Trial-and-error exploration (balance between exploration andexploitation)
3 There is no supervisor, only a reward signal, which is also delayed
4 Agent’s actions affect the subsequent data it receives (agent’s actionchanges the environment)
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 5 / 58
Major Components of RL
1 An RL agent may include one or more of these components1 Policy: agent’s behavior function2 Value function: how good is each state or action3 Model: agent’s state representation of the environment
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 6 / 58
Define the model of environment
1 Markov Processes
2 Markov Reward Processes(MRPs)
3 Markov Decision Processes (MDPs)
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 7 / 58
Markov Process
1 P is the dynamics/transition model that specifies p(st+1 = s ′|st = s)
2 State transition matrix: P(To|From)
P =
P(s1|s1) P(s2|s1) . . . P(sN |s1)P(s1|s2) P(s2|s2) . . . P(sN |s2)
......
. . ....
P(s1|sN) P(s2|sN) . . . P(sN |sN)
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 8 / 58
Markov Reward Process (MRP)
1 Markov Reward Process is a Markov Chain + reward2 Definition of Markov Reward Process (MRP)
1 S is a (finite) set of states (s ∈ S)2 P is dynamics/transition model that specifies P(St+1 = s ′|st = s)3 R is a reward function R(st = s) = E[rt |st = s]4 Discount factor γ ∈ [0, 1]
3 If finite number of states, R can be a vector
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 9 / 58
MRP Example
Reward: +5 in s1, +10 in s7, 0 in all other states. So that we canrepresent R = [5, 0, 0, 0, 0, 0, 10]
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 10 / 58
Return Function and Value Function
1 Return: Discounted sum of rewards from time step t to horizon
Gt = Rt+1 + γRt+2 + γ2Rt+3 + γ3Rt+4 + ...+ γT−t−1RT
2 Value function Vt(s): Expected return from t in state s
Vt(s) =E[Gt |st = s]
=E[Rt+1 + γRt+2 + γ2Rt+3 + ...+ γT−t−1RT |st = s]
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 11 / 58
Computing the Value of a Markov Reward Process
1 Value function: expected return from starting in state s
V (s) = E[Gt |st = s] = E[Rt+1 + γRt+2 + γ2Rt+3 + ...|st = s]
2 MRP value function satisfies the following Bellman equation:
V (s) = R(s)︸︷︷︸Immediate reward
+ γ∑s′∈S
P(s ′|s)V (s ′)︸ ︷︷ ︸Discounted sum of future reward
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 12 / 58
Iterative Algorithm for Computing Value of a MRP
Algorithm 1 Iterative algorithm to calculate MRP value function
1: for all states s ∈ S ,V ′(s)← 0,V (s)←∞2: while ||V − V ′|| > ε do3: V ← V ′
4: For all states s ∈ S ,V ′(s) = R(s) + γ∑
s′∈S P(s ′|s)V (s ′)5: end while6: return V ′(s) for all s ∈ S
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 13 / 58
Markov Decision Process (MDP)
1 Markov Decision Process describes the framework of reinforcementlearning. MDP is a tuple: (S ,A,P,R, γ)
1 S is a finite set of states, A is a finite set of actions2 Pa is dynamics/transition model for each action
P(st+1 = s ′|st = s, at = a)3 R is a reward function R(st = s, at = a) = E[rt |st = s, at = a]4 Discount factor γ ∈ [0, 1]
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 14 / 58
Return Function and Value Function
1 Definition of Horizon1 Number of maximum time steps in each episode2 Can be infinite, otherwise called finite Markov (reward) Process
2 Definition of Return1 Discounted sum of rewards from time step t to horizon
Gt = Rt+1 + γRt+2 + γ2Rt+3 + γ3Rt+4 + ...+ γT−t−1RT
3 Definition of state value function Vt(s) for a MRP1 Expected return from t in state s
Vt(s) =E[Gt |st = s]
=E[Rt+1 + γRt+2 + γ2Rt+3 + ...+ γT−t−1RT |st = s]
2 Present value of future rewards
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 15 / 58
Value function for MDP
1 The state-value function vπ(s) of an MDP is the expected returnstarting from state s, and following policy π
vπ(s) = Eπ[Gt |st = s] (1)
2 The action-value function qπ(s, a) is the expected return startingfrom state s, taking action a, and then following policy π
Qπ(s, a) = Eπ[Gt |st = s,At = a] (2)
3 We have the relation between vπ(s) and Qπ(s, a)
vπ(s) =∑a∈A
π(a|s)Qπ(s, a) (3)
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 16 / 58
Comparison of Markov Process and MDP
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 17 / 58
Bellman Expectation Equation
1 The state-value function can be decomposed into immediate rewardplus discounted value of the successor state,
Vπ(s) = Eπ[Rt+1 + γVπ(st+1)|st = s] (4)
2 The action-value function can similarly be decomposed
Qπ(s, a) = Eπ[Rt+1 + γQπ(st+1,At+1)|st = s,At = a] (5)
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 18 / 58
Backup Diagram for V π
vπ(s) =∑a∈A
π(a|s)(R(s, a) + γ∑s′∈S
P(s ′|s, a)vπ(s ′)) (6)
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 19 / 58
Backup Diagram for Qπ
qπ(s, a) =R(s, a) + γ∑s′∈S
P(s ′|s, a)∑a′∈A
π(a′|s ′)qπ(s ′, a′) (7)
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 20 / 58
Decision Making in MDP
1 Prediction: evaluate a given policy1 Input: MDP < S,A,P,R, γ > and policy π2 Output: value function vπ
2 Control: search the optimal policy1 Input: MDP < S,A,P,R, γ >2 Output: optimal value function v∗ and optimal policy π∗
3 Prediction and control can be solved by dynamic programming.
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 21 / 58
Policy evaluation: Iteration on Bellman expectation backup
Bellman expectation backup for a particular policyBootstrapping: estimate on top of estimates
vk+1(s) =∑a∈A
π(a|s)(R(s, a) + γ∑s′∈S
P(s ′|s, a)vk(s ′)) (8)
1 Synchronous backup algorithm:1 At each iteration k+1
update vk+1(s) from vk(s ′) for all states s ∈ S where s ′ is a successorstate of s
vk+1(s)←∑a∈A
π(a|s)(R(s, a) + γ∑s′∈S
P(s ′|s, a)vk(s ′)) (9)
2 Convergence: v1 → v2 → ...→ vπ
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 22 / 58
MDP Control
1 Compute the optimal policy
π∗(s) = arg maxπ
vπ(s) (10)
2 Policy iteration and value iteration
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 23 / 58
Improve a Policy through Policy Iteration
1 Iterate through the two steps:1 Evaluate the policy π (computing v given current π)2 Improve the policy by acting greedily with respect to vπ
π′ = greedy(vπ) (11)
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 24 / 58
Policy Improvement
1 compute the state-action value of a policy π:
qπi (s, a) = R(s, a) + γ∑s′∈S
P(s ′|s, a)vπi (s′) (12)
2 Compute a new greedy policy πi+1 for all s ∈ S following
πi+1(s) = arg maxa
qπi (s, a) (13)
3 We can prove that Monotonic Improvement in Policy.
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 25 / 58
Monotonic Improvement and Bellman optimality equation
1 If improvements stop,
qπ(s, π′(s)) = maxa∈A
qπ(s, a) = qπ(s, π(s)) = vπ(s)
2 Thus the Bellman optimality equation has been satisfied
vπ(s) = maxa∈A
qπ(s, a)
3 Then we have the following Bellman optimality backup:
v∗(s) = maxa
(R(s, a) + γ∑s′∈S
P(s ′|s, a)v∗(s′))
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 26 / 58
Value iteration by turning the Bellman optimality equationas update rule
1 If we know the solution to subproblem v∗(s′), which is optimal.
2 Then the solution for the optimal v∗(s) can be found by iteration overthe following Bellman Optimality backup rule,
v(s)← maxa∈A
(R(s, a) + γ
∑s′∈S
P(s ′|s, a)v(s ′))
3 The idea of value iteration is to apply these updates iteratively
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 27 / 58
Summary for Prediction and Control in MDP
Table: Dynamic Programming Algorithms
Problem Bellman Equation Algorithm
Prediction Bellman Expectation Equation Iterative Policy Evaluation
Control Bellman Expectation Equation Policy Iteration
Control Bellman Optimality Equation Value Iteration
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 28 / 58
Model-free RL
1 Policy evaluation, policy iteration and value iteration for solving aknown MDP
2 Model-free prediction: Estimate value function of an unknown MDP
3 Model-free control: Optimize value function of an unknown MDP
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 29 / 58
Model-free RL: prediction and optimal control byinteraction
1 No more oracle or shortcut of the known transition dynamics andreward function
2 Trajectories/episodes are collected by the interaction of the agent andthe environment
3 Each trajectory/episode contains{S1,A1,R1,S2,A2,R2, ...,ST ,AT ,RT}
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 30 / 58
Model-free prediction: policy evaluation without the accessto the model
1 Estimating the expected return of a particular policy if we don’t haveaccess to the MDP models
1 Monte Carlo policy evaluation2 Temporal Difference (TD) learning
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 31 / 58
Monte-Carlo Policy Evaluation
1 To evaluate state v(s)1 Every time-step t that state s is visited in an episode,2 Increment counter N(s)← N(s) + 13 Increment total return S(s)← S(s) + Gt
4 Value is estimated by mean return v(s) = S(s)/N(s)
2 By law of large numbers, v(s)→ vπ(s) as N(s)→∞
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 32 / 58
Incremental MC Updates
1 Collect one episode (S1,A1,R1, ...,St)
2 For each state st with computed return Gt
N(St)←N(St) + 1
v(St)←v(St) +1
N(St)(Gt − v(St))
3 Or use a running mean (old episodes are forgotten). Good fornon-stationary problems.
v(St)← v(St) + α(Gt − v(St))
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 33 / 58
Difference between DP and MC for policy evaluation
1 DP computes vi by bootstrapping the rest of the expected return bythe value estimate vi−1
2 Iteration on Bellman expectation backup:
vi (s)←∑a∈A
π(a|s)(R(s, a) + γ
∑s′∈S
P(s ′|s, a)vi−1(s ′))
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 34 / 58
Difference between DP and MC for policy evaluation
1 MC updates the empirical mean return with one sampled episode
v(St)← v(St) + α(Gi ,t − v(St))
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 35 / 58
Advantages of MC over DP
1 MC works when the environment is unknown
2 Working with sample episodes has a huge advantage, even when onehas complete knowledge of the environment’s dynamics, for example,transition probability is complex to compute
3 Cost of estimating a single state’s value is independent of the totalnumber of states. So you can sample episodes starting from thestates of interest then average returns
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 36 / 58
Temporal-Difference (TD) Learning
1 TD methods learn directly from episodes of experience
2 TD is model-free: no knowledge of MDP transitions/rewards
3 TD learns from incomplete episodes, by bootstrapping
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 37 / 58
Temporal-Difference (TD) Learning
1 Objective: learn vπ online from experience under policy π2 Simplest TD algorithm: TD(0)
1 Update v(St) toward estimated return Rt+1 + γv(St+1)
v(St)← v(St) + α(Rt+1 + γv(St+1)− v(St))
3 Rt+1 + γv(St+1) is called TD target
4 δt = Rt+1 + γv(St+1)− v(St) is called the TD error5 Comparison: Incremental Monte-Carlo
1 Update v(St) toward actual return Gt given an episode i
v(St)← v(St) + α(Gi,t − v(St))
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 38 / 58
Advantages of TD over MC
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 39 / 58
n-step TD
1 n-step TD methods that generalize both one-step TD and MC.2 We can shift from one to the other smoothly as needed to meet the
demands of a particular task.
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 40 / 58
n-step TD prediction
1 Consider the following n-step returns for n = 1, 2,∞
n = 1(TD) G(1)t =Rt+1 + γv(St+1)
n = 2 G(2)t =Rt+1 + γRt+2 + γ2v(St+2)
...
n =∞(MC ) G∞t =Rt+1 + γRt+2 + ...+ γT−t−1RT
2 Thus the n-step return is defined as
Gnt = Rt+1 + γRt+2 + ...+ γn−1Rt+n + γnv(St+n)
3 n-step TD: v(St)← v(St) + α(Gnt − v(St)
)Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 41 / 58
Bootstrapping and Sampling for DP, MC, and TD
1 Bootstrapping: update involves an estimate1 MC does not bootstrap2 DP bootstraps3 TD bootstraps
2 Sampling: update samples an expectation1 MC samples2 DP does not sample3 TD samples
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 42 / 58
Unified View: Dynamic Programming Backup
v(St)← Eπ[Rt+1 + γv(St+1)]
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 43 / 58
Unified View: Monte-Carlo Backup
v(St)← v(St) + α(Gt − v(St))
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 44 / 58
Unified View: Temporal-Difference Backup
TD(0) : v(St)← v(St) + α(Rt+1 + γv(st+1)− v(St))
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 45 / 58
Unified View of Reinforcement Learning
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 46 / 58
Control: Optimize the policy for a MDP
1 Model-based control: optimize the value function with known MDP
2 Model-free control: optimize the value function with unknown MDP
3 Many model-free RL examples: Go, robot locomation, patienttreatment, helicopter control, Atari, Starcraft
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 47 / 58
Policy Iteration
1 Iterate through the two steps:1 Evaluate the policy π (computing v given current π)2 Improve the policy by acting greedily with respect to vπ
π′ = greedy(vπ) (14)
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 48 / 58
Policy Iteration for a Known MDP
1 compute the state-action value of a policy π:
qπi (s, a) = R(s, a) + γ∑s′∈S
P(s ′|s, a)vπi (s′)
2 Compute new policy πi+1 for all s ∈ S following
πi+1(s) = arg maxa
qπi (s, a) (15)
3 Problem: What to do if there is neither R(s, a) nor P(s ′|s, a)known/available?
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 49 / 58
Generalized Policy Iteration
Monte Carlo version of policy iteration
1 Policy evaluation: Monte-Carlo policy evaluation Q = qπ2 Policy improvement: Greedy policy improvement?
π(s) = arg maxa
q(s, a)
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 50 / 58
ε-Greedy Exploration
1 Trade-off between exploration and exploitation (we will talk aboutthis in later lecture)
2 ε-Greedy Exploration: Ensuring continual exploration1 All actions are tried with non-zero probability2 With probability 1− ε choose the greedy action3 With probability ε choose an action at random
π(a|s) =
{ε/|A|+ 1− ε if a∗ = arg maxa∈AQ(s, a)
ε/|A| otherwise
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 51 / 58
Monte Carlo with ε-Greedy Exploration
Algorithm 2
1: Initialize Q(S ,A) = 0,N(S ,A) = 0, ε = 1, k = 12: πk = ε-greedy(Q)3: loop4: Sample k-th episode (S1,A1,R2, ...,ST ) ∼ πk5: for each state St and action At in the episode do6: N(St ,At)← N(St ,At) + 17: Q(St ,At)← Q(St ,At) + 1
N(St ,At)(Gt − Q(St ,At))
8: end for9: k ← k + 1, ε← 1/k
10: πk = ε-greedy(Q)11: end loop
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 52 / 58
TD for Prediction and Control
1 We can also use TD instead of MC in our control loop1 Apply TD to Q(S ,A)2 Use ε-greedy policy improvement3 Update every time-step rather than at the end of one episode
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 53 / 58
Recall: TD Prediction
1 An episode consists of an alternating sequence of states andstate–action pairs:
2 TD(0) method for estimating the value function V (S)
At ← action given by π for S
Take action At , observe Rt+1 and St+1
V (St)← V (St) + α[Rt+1 + γV (St+1)− V (St)]
3 How about estimating action value function Q(S)?
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 54 / 58
Sarsa: On-Policy TD Control
1 An episode consists of an alternating sequence of states andstate–action pairs:
2 ε-greedy policy for one step, then bootstrap the action value function:
Q(St ,At)← Q(St ,At) + α[Rt+1 + γQ(St+1,At+1)− Q(St ,At)
]
3 The update is done after every transition from a nonterminal state St4 TD target δt = Rt+1 + γQ(St+1,At+1)
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 55 / 58
Sarsa algorithm
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 56 / 58
n-step Sarsa
1 Consider the following n-step Q-returns for n = 1, 2,∞
n = 1(Sarsa)q(1)t =Rt+1 + γQ(St+1,At+1)
n = 2 q(2)t =Rt+1 + γRt+2 + γ2Q(St+2,At+2)
...
n =∞(MC ) q∞t =Rt+1 + γRt+2 + ...+ γT−t−1RT
2 Thus the n-step Q-return is defined as
q(n)t = Rt+1 + γRt+2 + ...+ γn−1Rt+n + γnQ(St+n,At+n)
3 n-step Sarsa updates Q(s,a) towards the n-step Q-return:
Q(St ,At)← Q(St ,At) + α(q(n)t − Q(St ,At)
)Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 57 / 58
Tomorrow
1 Lecture on on-policy learning and off-policy learning
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 18, 2020 58 / 58