Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1...

39
Lecture 3: Markov Decision Processes Bolei Zhou The Chinese University of Hong Kong [email protected] January 14, 2020 Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 1 / 39

Transcript of Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1...

Page 1: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Lecture 3: Markov Decision Processes

Bolei Zhou

The Chinese University of Hong Kong

[email protected]

January 14, 2020

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 1 / 39

Page 2: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

This week’s Plan

1 Last Time1 Key elements of an RL agent: model, value, policy

2 This Week: Decision Making in MDP1 Markov decision processes (MDP)2 Policy evaluation in MDP3 Control in MDP: policy iteration and value iteration

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 2 / 39

Page 3: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Markov Decision Process (MDP)

1 Markov Decision Process can model a lot of real-world problem. Itformally describes the framework of reinforcement learning

2 Under MDP, the environment is fully observable.1 Optimal control primarily deals with continuous MDPs2 Partially observable problems can be converted into MDPs

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 3 / 39

Page 4: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Define the model of the environment

Markov Processes

Markov Reward Processes(MRPs)

Markov Decision Processes (MDPs)

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 4 / 39

Page 5: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Markov Property

1 The history of states: ht = {s1, s2, s3, ..., st}2 State st is Markov if and only if:

p(st+1|st) =p(st+1|ht) (1)

p(st+1|st , at) =p(st+1|ht , at) (2)

3 ”The future is independent of the past given the present”

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 5 / 39

Page 6: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Markov Process/Markov Chain

1 P is the dynamics/transition model that specifies p(st+1 = s ′|st = s)

2 State transition matrix: (From|To)

P =

P(s1|s1) P(s2|s1) . . . P(sN |s1)P(s1|s2) P(s2|s2) . . . P(sN |s2)

......

. . ....

P(s1|sN) P(s2|sN) . . . P(sN |sN)

3 Note that there are no rewards or no actions.

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 6 / 39

Page 7: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Example of MP

1 Samples episodes starting from s31 s3, s4, s5, s6, s62 s3, s2, s3, s2, s13 s3, s4, s4, s5, s5

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 7 / 39

Page 8: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Markov Reward Process (MRP)

1 Markov Reward Process is a Markov Chain + reward2 Definition of Markov Reward Process (MRP)

1 S is a (finite) set of states (s ∈ S)2 P is dynamics/transition model that specifies P(St+1 = s ′|st = s)3 R is a reward function R(st = s) = E[rt |st = s]4 Discount factor γ ∈ [0, 1]

3 If finite number of states, R can be a vector

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 8 / 39

Page 9: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Example of MRP

Reward: +5 in s1, +10 in s7, 0 in all other states. So that we canrepresent R = [5, 0, 0, 0, 0, 0, 10]

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 9 / 39

Page 10: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Return function and Value function

1 Definition of Horizon1 Number of maximum time steps in each episode2 Can be infinite, otherwise called finite Markov (reward) Process

2 Definition of Return1 Discounted sum of rewards from time step t to horizon

Gt = Rt+1 + γRt+2 + γ2Rt+3 + γ3Rt+4 + ...+ γT−t−1RT

3 Definition of state value function Vt(s) for a MRP1 Expected return from t in state s

Vt(s) =E[Gt |st = s]

=E[Rt+1 + γRt+2 + γ2Rt+3 + ...+ γT−t−1RT |st = s]

2 Present value of future rewards

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 10 / 39

Page 11: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Why Discount Factor γ

1 Mathematically convenient to discount rewards

2 Avoids infinite returns in cyclic Markov processes

3 Uncertainty about the future may not be fully represented

4 If the reward is financial, immediate rewards may earn more interestthan delayed rewards

5 Animal/human behaviour shows preference for immediate reward6 It is sometimes possible to use undiscounted Markov reward processes

(i.e. γ = 1), e.g. if all sequences terminate.1 γ = 0: Only care about the immediate reward2 γ = 1: Future reward is equal to the immediate reward.

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 11 / 39

Page 12: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Example of MRP

1 Reward: +5 in s1, +10 in s7, 0 in all other states. So that we canrepresent R = [5, 0, 0, 0, 0, 0, 10]

2 Sample returns for a 4-step episodes with γ = 1/21 return for s4, s5, s6, s7 : 0 + 1

2 × 0 + 14 × 0 + 1

8 × 10 = 1.252 return for s4, s3, s2, s1 : 0 + 1

2 × 0 + 14 × 0 + 1

8 × 5 = 0.6253 return s4, s5, s6, s6: = 04 How to compute the value function?

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 12 / 39

Page 13: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Computing the Value of a Markov Reward Process

1 Value function: expected return from starting in state s

V (s) = E[Gt |st = s] = E[Rt+1 + γRt+2 + γ2Rt+3 + ...|st = s]

2 MRP value function satisfies the following Bellman equation:

V (s) = R(s)︸︷︷︸Immediate reward

+ γ∑s′∈S

P(s ′|s)V (s ′)︸ ︷︷ ︸Discounted sum of future reward

3 Practice: To derive the Bellman equation for V(s)

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 13 / 39

Page 14: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Matrix Form of Bellman Equation for MRP

Therefore, we can express V(s) using the matrix form:V (s1)V (s2)

...V (sN)

=

R(s1)R(s2)

...R(sN)

P(s1|s1) P(s2|s1) . . . P(sN |s1)P(s1|s2) P(s2|s2) . . . P(sN |s2)

......

. . ....

P(s1|sN) P(s2|sN) . . . P(sN |sN)

V (s1)V (s2)

...V (sN)

V = R + γPV

1 Analytic solution for value of MRP: V = (I − γP)−1R1 Matrix inverse takes the complexity O(N3) for N states2 Only possible for a small MRPs

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 14 / 39

Page 15: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Iterative Algorithm for Computing Value of a MRP

1 Iterative methods for large MRPs:1 Dynamic Programming2 Monte-Carlo evaluation3 Temporal-Difference learning

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 15 / 39

Page 16: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Iterative Algorithm for Computing Value of a MRP

Algorithm 1 Monte Carlo simulation to calculate MRP value function

1: i ← 0,Gt ← 02: while i 6= N do3: generate an episode, starting from state s and time t4: Using the generated episode, calculate return g =

∑H−1i=t γ i−tri

5: Gt ← Gt + g , i ← i + 16: end while7: Vt(s)← Gt/N

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 16 / 39

Page 17: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Iterative Algorithm for Computing Value of a MRP

Algorithm 2 Iterative algorithm to calculate MRP value function

1: for all states s ∈ S ,V ′(s)← 0,V (s)←∞2: while ||V − V ′|| > ε do3: V ← V ′

4: For all states s ∈ S ,V ′(s) = R(s) + γ∑

s′∈S P(s ′|s)V (s ′)5: end while6: return V ′(s) for all s ∈ S

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 17 / 39

Page 18: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Markov Decision Process (MDP)

1 Markov Decision Process is Markov Reward Process with decisions.2 Definition of MDP

1 S is a finite set of states2 A is a finite set of actions3 Pa is dynamics/transition model for each action

P(st+1 = s ′|st = s, at = a)4 R is a reward function R(st = s, at = a) = E[rt |st = s, at = a]5 Discount factor γ ∈ [0, 1]

3 MDP is a tuple: (S ,A,P,R, γ)

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 18 / 39

Page 19: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Policy in MDP

1 Policy specifies what action to take in each state

2 Give a state, specify a distribution over actions

3 Policy: π(a|s) = P(at = a|st = s)

4 Policies are stationary (time-independent), At ∼ π(a|s) for any t > 0

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 19 / 39

Page 20: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Policy in MDP

1 Given an MDP (S ,A,P,R, γ) and a policy π

2 The state sequence S1,S2, ... is a Markov process (S ,Pπ)

3 The state and reward sequence S1,R2, S2,R2, ... is a Markov rewardprocess (S ,Pπ,Rπ, γ) where,

Pπ(s ′|s) =∑a∈A

π(a|s)P(s ′|s, a)

Rπ(s) =∑a∈A

π(a|s)R(s, a)

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 20 / 39

Page 21: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Comparison of MP/MRP and MDP

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 21 / 39

Page 22: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Value function for MDP

1 The state-value function vπ(s) of an MDP is the expected returnstarting from state s, and following policy π

vπ(s) = Eπ[Gt |st = s] (3)

2 The action-value function qπ(s, a) is the expected return startingfrom state s, taking action a, and then following policy π

Qπ(s, a) = Eπ[Gt |st = s,At = a] (4)

3 We have the relation between vπ(s) and Qπ(s, a)

vπ(s) =∑a∈A

π(a|s)Qπ(s, a) (5)

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 22 / 39

Page 23: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Bellman Expectation Equation

1 The state-value function can be decomposed into immediate rewardplus discounted value of the successor state,

Vπ(s) = Eπ[Rt+1 + γVπ(st+1)|st = s] (6)

2 The action-value function can similarly be decomposed

Qπ(s, a) = Eπ[Rt+1 + γQπ(st+1,At+1)|st = s,At = a] (7)

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 23 / 39

Page 24: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Bellman Expectation Equation for V π and Qπ

vπ(s) =∑a∈A

π(a|s)qπ(s, a) (8)

qπ(s, a) =Ras + γ

∑s′∈S

P(s ′|s, a)vπ(s ′) (9)

Thus

vπ(s) =∑a∈A

π(a|s)(R(s, a) + γ∑s′∈S

P(s ′|s, a)vπ(s ′)) (10)

qπ(s, a) =R(s, a) + γ∑s′∈S

P(s ′|s, a)∑a′∈A

π(a′|s ′)qπ(s ′, a′) (11)

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 24 / 39

Page 25: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Backup Diagram for V π

vπ(s) =∑a∈A

π(a|s)(R(s, a) + γ∑s′∈S

P(s ′|s, a)vπ(s ′)) (12)

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 25 / 39

Page 26: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Backup Diagram for Qπ

qπ(s, a) =R(s, a) + γ∑s′∈S

P(s ′|s, a)∑a′∈A

π(a′|s ′)qπ(s ′, a′) (13)

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 26 / 39

Page 27: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Policy Evaluation

1 Evaluate the value of state given a policy: compute vπ(s)

2 Also called as prediction

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 27 / 39

Page 28: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Example: Mars Rover

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 28 / 39

Page 29: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Example: Policy Evaluation

1 Two actions: Left and Right

2 For all actions, reward: +5 in s1, +10 in s7, 0 in all other states. Sothat we can represent R = [5, 0, 0, 0, 0, 0, 10]

3 Let’s have a deterministic policy π(s) = Left and γ = 0 for any states, then what is the value of the policy?

4 V π = [5, 0, 0, 0, 0, 0, 10]

5 Iterative: vπk (s) = r(s, π(s)) + γ∑

s′∈S P(s ′|s, π(s))vπk−1(s ′)

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 29 / 39

Page 30: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Example: Policy Evaluation

1 R = [5, 0, 0, 0, 0, 0, 10]

2 Practice 1: Deterministic policy π(s) = Left with γ = 0.5 for anystate s, then what are the state values under the policy?

3 Practice 2: Stochastic policy P(π(s) = Left) = 0.5 andP(π(s) = Right) = 0.5 and γ = 0.5 for any state s, then what are thestate values under the policy?

4 Following the iteration:vπk (s) = r(s, π(s)) + γ

∑s′∈S P(s ′|s, π(s))vπk−1(s ′)

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 30 / 39

Page 31: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Decision Making in Markov Decision Process

1 Prediction (evaluate a given policy):1 Input: MDP < S,A,P,R, γ > and policy π or MRP < S,Pπ,Rπ, γ >2 Output: value function vπ

2 Control (search the optimal policy):1 Input: MDP < S,A,P,R, γ >2 Output: optimal value function v∗ and optimal policy π∗

3 Prediction and control can be solved by dynamic programming.

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 31 / 39

Page 32: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Dynamic Programming

Dynamic Programming is a very general solution method for problemswhich have two properties:

1 Optimal substructure1 Principle of optimality applies2 Optimal solution can be decomposed into subproblems

2 Overlapping subproblems1 Subproblems recur many times2 Solutions can be cached and reused

Markov decision processes satisfy both properties

1 Bellman equation gives recursive decomposition

2 Value function stores and reuses solutions

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 32 / 39

Page 33: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Policy evaluation on MDP

1 Problem: Evaluate a given policy π for a MDP

2 Output the value function under policy vπ3 Solution: iteration on Bellman expectation backup4 Synchronous backup algorithm:

1 At each iteration k+1update vk+1(s) from vk(s ′) for all states s ∈ S where s ′ is a successorstate of s

vk+1(s) =∑a∈A

π(a|s)(R(s, a) + γ∑s′∈S

P(s ′|s, a)vk(s ′)) (14)

5 Convergence: v1 → v2 → ...→ vπ

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 33 / 39

Page 34: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Policy evaluation: Iteration on Bellman expectation backup

Bellman expectation backup for a particular policy

vk+1(s) =∑a∈A

π(a|s)(R(s, a) + γ∑s′∈S

P(s ′|s, a)vk(s ′)) (15)

Or if in the form of MRP < S,Pπ,R, γ >

vk+1(s) = Rπ(s) + γPπ(s ′|s)vk(s ′) (16)

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 34 / 39

Page 35: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Evaluating a Random Policy in the Small Gridworld

Example 4.1 in the Sutton RL textbook.

1 Undiscounted episodic MDP (γ = 1)

2 Nonterminal states 1, ..., 14

3 Two terminal states (two shaded squares)

4 Action leading out of grid leaves state unchanged, P(7|7, right) = 1

5 Reward is −1 until the terminal state is reach

6 Transition is deterministic given the action, e.g., P(6|5, right) = 1

7 Uniform random policy π(l |.) = π(r |.) = π(u|.) = π(d |.) = 0.25

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 35 / 39

Page 36: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Evaluating a Random Policy in the Small Gridworld

1 Iteratively evaluate the random policy

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 36 / 39

Page 37: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Practice: Gridworld

Textbook Example 3.5:GridWorld

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 37 / 39

Page 38: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Optimal Value Function

1 The optimal state-value function v∗(s) is the maximum value functionover all policies

v∗(s) = maxπ

vπ(s)

2 The optimal policy

π∗(s) = arg maxπ

vπ(s)

3 An MDP is “solved” when we know the optimal value

4 There exists a unique optimal value function, but could be multipleoptimal policies (two actions that have the same optimal valuefunction)

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 38 / 39

Page 39: Lecture 3: Markov Decision Processesierg6130/slides/lecture3.pdf · Markov Decision Process (MDP) 1 Markov Decision Process can model a lot of real-world problem. It formally describes

Finding Optimal Policy

1 An optimal policy can be found by maximizing over q∗(s, a),

π∗(a|s) =

{1, if a = arg maxa∈A q∗(s, a)

0, otherwise

2 There is always a deterministic optimal policy for any MDP

3 If we know q∗(s, a), we immediately have the optimal policy

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 14, 2020 39 / 39