Markov Decision Process II
Transcript of Markov Decision Process II
![Page 1: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/1.jpg)
AnnouncementsAssignments:
P3: Optimization; Due today, 10 pm
HW7 (online) 10/22 Tue, 10 pm
Recitations canceled on: October 18 (Mid-Semester Break) and October 25 (Day for Community Engagement)
We will provide recitation worksheet (reference for midterm/final)
Piazza post for In-class Questions
![Page 2: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/2.jpg)
AI: Representation and Problem Solving
Markov Decision Processes II
Instructors: Fei Fang & Pat Virtue
Slide credits: CMU AI and http://ai.berkeley.edu
![Page 3: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/3.jpg)
Learning Objectives
β’ Write Bellman Equation for state-value and Q-value for optimal policy and a given policy
β’ Describe and implement value iteration algorithm (through Bellman update) for solving MDPs
β’ Describe and implement policy iteration algorithm (through policy evaluation and policy improvement) for solving MDPs
β’ Understand convergence for value iteration and policy iteration
β’ Understand concept of exploration, exploitation, regret
![Page 4: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/4.jpg)
MDP Notation
π π = maxπ
π β²
π π β² π , π)π(π β²)
π π = maxπ
π β²
π π β² π , π π π , π, π β² + πΎπ π β²
ππ+1 π = maxπ
π β²
π π β² π , π π π , π, π β² + πΎππ π β² , β π
ππ+1 π , π =
π β²
π π β² π , π [π π , π, π β² + πΎmaxπβ²
ππ(π β², πβ²)] , β π , π
ππ π = argmaxπ
π β²
π π β² π , π [π π , π, π β² + πΎπ π β² ] , β π
ππ+1π π =
π β²
π π β² π , π π [π π , π π , π β² + πΎπππ π β² ] , β π
ππππ€ π = argmaxπ
π β²
π π β² π , π π π , π, π β² + πΎπππππ π β² , β π
Bellman equations:
Value iteration:
Q-iteration:
Policy extraction:
Policy improvement:
Policy evaluation:
Standard expectimax:
![Page 5: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/5.jpg)
Example: Grid World
Goal: maximize sum of (discounted) rewards
![Page 6: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/6.jpg)
MDP Quantities
Markov decision processes: States S Actions A Transitions P(sβ|s,a) (or T(s,a,sβ)) Rewards R(s,a,sβ) (and discount ) Start state s0
MDP quantities: Policy = map of states to actions Utility = sum of (discounted) rewards (State) Value = expected utility starting from a state (max node) Q-Value = expected utility starting from a state-action pair, i.e., q-state (chance node)
a
s
s, a
s,a,sβ
sβ
![Page 7: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/7.jpg)
MDP Optimal Quantities The optimal policy:
*(s) = optimal action from state s
The (true) value (or utility) of a state s:V*(s) = expected utility starting in s and
acting optimally
The (true) value (or utility) of a q-state (s,a):Q*(s,a) = expected utility starting out
having taken action a from state s and (thereafter) acting optimally
[Demo: gridworld values (L9D1)]
a
s
s, a
s,a,sβ
sβ
Solve MDP: Find πβ, πβ and/or πβ
πβ(π ) < +β if πΎ < 1 and π π , π, π β² < β
![Page 8: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/8.jpg)
Piazza Poll 1Which ones are true about optimal policy πβ(π ), true values πβ(π ) and true Q-Values πβ(π , π)?
B: πβ π = argmaxπ
πβ(π , π)A: πβ π = argmaxπ
πβ(π β²)
where π β² = argmaxπ β²β²
πβ(π , π, π β²β²) C: πβ π = maxπ
πβ(π , π)
![Page 9: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/9.jpg)
Piazza Poll 1
πβ π = argmaxπ
π β²
π π β²|π , π β (π π , π, π β² + πΎπβ π β² ) β argmaxπ
πβ(π β²)
πβ π = argmaxπ
πβ(π , π)
πβ π , π, π β²β² : Represent πβ(π β²β²) where π β²β² is reachable through (π , π). Not a standard notation.
![Page 10: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/10.jpg)
Computing Optimal Policy from Values
![Page 11: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/11.jpg)
Computing Optimal Policy from ValuesLetβs imagine we have the optimal values V*(s)
How should we act?
We need to do a mini-expectimax (one step)
Sometimes this is called policy extraction, since it gets the policy implied by the values
a
s
s, a
s,a,sβsβ
![Page 12: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/12.jpg)
Computing Optimal Policy from Q-ValuesLetβs imagine we have the optimal q-values:
How should we act?
Completely trivial to decide!
Important lesson: actions are easier to select from q-values than values!
![Page 13: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/13.jpg)
The Bellman Equations
How to be optimal:
Step 1: Take correct first action
Step 2: Keep being optimal
![Page 14: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/14.jpg)
Definition of βoptimal utilityβ leads to Bellman Equations, which characterize the relationship amongst optimal utility values
Necessary and sufficient conditions for optimality
Solution is unique
a
s
s, a
s,a,sβ
sβ
The Bellman Equations
Expectimax-like computation
![Page 15: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/15.jpg)
Definition of βoptimal utilityβ leads to Bellman Equations, which characterize the relationship amongst optimal utility values
Necessary and sufficient conditions for optimality
Solution is unique
a
s
s, a
s,a,sβ
sβ
The Bellman Equations
Expectimax-like computation with one-step lookahead and a βperfectβ heuristic at leaf nodes
![Page 16: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/16.jpg)
Solving Expectimax
![Page 17: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/17.jpg)
Solving MDP
Limited Lookahead
![Page 18: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/18.jpg)
Solving MDP
Limited Lookahead
![Page 19: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/19.jpg)
Value Iteration
![Page 20: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/20.jpg)
Demo Value Iteration
[Demo: value iteration (L8D6)]
![Page 21: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/21.jpg)
Value Iteration
Start with V0(s) = 0: no time steps left means an expected reward sum of zero
Given vector of Vk(s) values, apply Bellman update once (do one ply of expectimax with R and πΎ from each state):
Repeat until convergence
a
Vk+1(s)
s, a
s,a,sβ
Vk(sβ)Will this process converge?
Yes!
![Page 22: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/22.jpg)
Piazza Poll 2What is the complexity of each iteration in Value Iteration?
S -- set of states; A -- set of actions
I: π(|π||π΄|)
II: π( π 2|π΄|)
III: π(|π| π΄ 2)
IV: π( π 2 π΄ 2)
V: π( π 2)
a
Vk+1(s)
s, a
s,a,sβ
Vk(sβ)
![Page 23: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/23.jpg)
Piazza Poll 2What is the complexity of each iteration in Value Iteration?
S -- set of states; A -- set of actions
I: π(|π||π΄|)
II: π( π 2|π΄|)
III: π(|π| π΄ 2)
IV: π( π 2 π΄ 2)
V: π( π 2)
a
Vk+1(s)
s, a
s,a,sβ
Vk(sβ)
![Page 24: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/24.jpg)
Value Iterationfunction VALUE-ITERATION(MDP=(S,A,T,R,πΎ), π‘βπππ βπππ) returns a state value function
for s in Sπ0(π ) β 0
π β 0repeat
πΏ β 0for s in S
ππ+1 π β ββfor a in A
π£ β 0for sβ in S
π£ β π£ + π π , π, π β² (π π , π, π β² + πΎππ(π β²))ππ+1 π β max{ππ+1 π , π£}
πΏ β max{πΏ, |ππ+1 π β ππ π |}π β π + 1
until πΏ < π‘βπππ βπππreturn ππβ1
Do we really need to store the value of ππ for each π?
Does ππ+1 π β₯ ππ(π ) always hold?
![Page 25: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/25.jpg)
Value Iterationfunction VALUE-ITERATION(MDP=(S,A,T,R,πΎ), π‘βπππ βπππ) returns a state value function
for s in Sπ0(π ) β 0
π β 0repeat
πΏ β 0for s in S
ππ+1 π β ββfor a in A
π£ β 0for sβ in S
π£ β π£ + π π , π, π β² (π π , π, π β² + πΎππ(π β²))ππ+1 π β max{ππ+1 π , π£}
πΏ β max{πΏ, |ππ+1 π β ππ π |}π β π + 1
until πΏ < π‘βπππ βπππreturn ππβ1
Do we really need to store the value of ππ for each π?
Does ππ+1 π β₯ ππ(π ) always hold?
No. Use π = ππππ π‘ and πβ² = πππ’πππππ‘
No. If π π , π, π β² = 1 and π π , π, π β² < 0, then π1 π = π π , π, π β² < 0
![Page 26: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/26.jpg)
Bellman Equation vs Value Iteration vs Bellman Update
Bellman equations characterize the optimal values:
Value iteration computes them by applying Bellman update repeatedly
Value iteration is a method for solving Bellman Equation
ππ vectors are also interpretable as time-limited values
Value iteration finds the fixed point of the function
a
s
s, a
s,a,sβ
π π = maxπ
π β²
π π , π, π β² [π π , π, π β² + πΎπ(π β²)]
![Page 27: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/27.jpg)
Value Iteration Convergence
How do we know the Vk vectors are going to converge?
Case 1: If the tree has maximum depth M, then VM
holds the actual untruncated values
Case 2: If πΎ < 1 and π (π , π, π β²) β€ π πππ₯ < β Intuition: For any state ππ and ππ+1 can be viewed as depth k+1
expectimax results (with π and πΎ) in nearly identical search trees
The difference is that on the bottom layer, ππ+1 has actual rewards while ππ has zeros
π (π , π, π β²) β€ π πππ₯
π1 π β π0 π = π1 π β 0 β€ π πππ₯
|ππ+1 π β ππ π | β€ πΎππ πππ₯
So ππ+1 π β ππ π β 0 as π β β
If we initialized π0(π ) differently, what would happen?
![Page 28: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/28.jpg)
Value Iteration Convergence
How do we know the Vk vectors are going to converge?
Case 1: If the tree has maximum depth M, then VM holds the actual untruncated values
Case 2: If πΎ < 1 and π (π , π, π β²) β€ π πππ₯ < β Intuition: For any state ππ and ππ+1 can be viewed as depth k+1
expectimax results (with π and πΎ) in nearly identical search trees
The difference is that on the bottom layer, ππ+1 has actual rewards while ππ has zeros
Each value at last layer of ππ+1 tree is at most π πππ₯ in magnitude
But everything is discounted by πΎπ that far out
So ππ and ππ+1 are at most πΎππ πππ₯ different
So as π increases, the values converge
If we initialized π0(π ) differently, what would happen?
Still converge to πβ π as long as π0 π < +β, but may be slower
![Page 29: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/29.jpg)
Other ways to solve Bellman Equation?
Treat πβ π as variables
Solve Bellman Equation through Linear Programming
![Page 30: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/30.jpg)
Other ways to solve Bellman Equation?
Treat πβ π as variables
Solve Bellman Equation through Linear Programming
minπβ
π
πβ(π )
s.t. πβ π β₯ π β²π π , π, π β² [π π , π, π β² + πΎπβ(π β²)],βπ , π
![Page 31: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/31.jpg)
Policy Iteration for Solving MDPs
![Page 32: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/32.jpg)
Policy Evaluation
![Page 33: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/33.jpg)
Fixed Policies
Expectimax trees max over all actions to compute the optimal values
If we fixed some policy (s), then the tree would be simpler β only one action per state
β¦ though the treeβs value would depend on which policy we fixed
a
s
s, a
s,a,sβsβ
(s)
s
s, (s)
s, (s),sβsβ
Do the optimal action Do what says to do
![Page 34: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/34.jpg)
Utilities for a Fixed Policy
Another basic operation: compute the utility of a state s under a fixed (generally non-optimal) policy
Define the utility of a state s, under a fixed policy :V(s) = expected total discounted rewards starting in s
and following
Recursive relation (one-step look-ahead / Bellman equation):
(s)
s
s, (s)
s, (s),sβsβ
![Page 35: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/35.jpg)
Compare
![Page 36: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/36.jpg)
MDP Quantities A policy π: map of states to actions The optimal policy *: *(s) = optimal action
from state s
Value function of a policy ππ π : expected utility starting in s and acting according to π
Optimal value function V*: V*(s) = ππβπ
Q function of a policy Qπ π : expected utility starting out having taken action a from state s and (thereafter) acting according to π
Optimal Q function Q* : Q*(s,a) = Qπβπ
a
s
sβ
s, a
(s,a,sβ) is a transition
s,a,sβ
s is a state
(s, a) is a state-action pair
Solve MDP: Find πβ, πβ and/or πβ
![Page 37: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/37.jpg)
Example: Policy Evaluation
Always Go Right Always Go Forward
![Page 38: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/38.jpg)
Example: Policy Evaluation
Always Go Right Always Go Forward
![Page 39: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/39.jpg)
Policy EvaluationHow do we calculate the Vβs for a fixed policy ?
Idea 1: Turn recursive Bellman equations into updates
(like value iteration)
(s)
s
s, (s)
s, (s),sβsβ
![Page 40: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/40.jpg)
Piazza Poll 3What is the complexity of each iteration in Policy Evaluation?
S -- set of states; A -- set of actions
I: π(|π||π΄|)
II: π( π 2|π΄|)
III: π(|π| π΄ 2)
IV: π( π 2 π΄ 2)
V: π( π 2)
a
Vk+1(s)
s, a
s,a,sβ
Vk(sβ)
![Page 41: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/41.jpg)
Piazza Poll 3What is the complexity of each iteration in Policy Evaluation?
S -- set of states; A -- set of actions
I: π(|π||π΄|)
II: π( π 2|π΄|)
III: π(|π| π΄ 2)
IV: π( π 2 π΄ 2)
V: π( π 2)
a
Vk+1(s)
s, a
s,a,sβ
Vk(sβ)
![Page 42: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/42.jpg)
Policy EvaluationIdea 2: Bellman Equation w.r.t. a given policy π defines a linear system Solve with your favorite linear system solver
Treat ππ π as variables
How many variables?
How many constraints?
![Page 43: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/43.jpg)
Policy Iteration
![Page 44: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/44.jpg)
Problems with Value IterationValue iteration repeats the Bellman updates:
Problem 1: Itβs slow β O( π 2|π΄|) per iteration
Problem 2: The βmaxβ at each state rarely changes
Problem 3: The policy often converges long before the values
a
s
s, a
s,a,sβsβ
[Demo: value iteration (L9D2)]
![Page 45: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/45.jpg)
k=0
Noise = 0.2Discount = 0.9Living reward = 0
![Page 46: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/46.jpg)
k=1
Noise = 0.2Discount = 0.9Living reward = 0
![Page 47: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/47.jpg)
k=2
Noise = 0.2Discount = 0.9Living reward = 0
![Page 48: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/48.jpg)
k=3
Noise = 0.2Discount = 0.9Living reward = 0
![Page 49: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/49.jpg)
k=4
Noise = 0.2Discount = 0.9Living reward = 0
![Page 50: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/50.jpg)
k=5
Noise = 0.2Discount = 0.9Living reward = 0
![Page 51: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/51.jpg)
k=6
Noise = 0.2Discount = 0.9Living reward = 0
![Page 52: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/52.jpg)
k=7
Noise = 0.2Discount = 0.9Living reward = 0
![Page 53: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/53.jpg)
k=8
Noise = 0.2Discount = 0.9Living reward = 0
![Page 54: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/54.jpg)
k=9
Noise = 0.2Discount = 0.9Living reward = 0
![Page 55: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/55.jpg)
k=10
Noise = 0.2Discount = 0.9Living reward = 0
![Page 56: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/56.jpg)
k=11
Noise = 0.2Discount = 0.9Living reward = 0
![Page 57: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/57.jpg)
k=12
Noise = 0.2Discount = 0.9Living reward = 0
![Page 58: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/58.jpg)
k=100
Noise = 0.2Discount = 0.9Living reward = 0
![Page 59: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/59.jpg)
Policy IterationAlternative approach for optimal values:
Step 1: Policy evaluation: calculate utilities for some fixed policy (may not be optimal!) until convergence
Step 2: Policy improvement: update policy using one-step look-ahead with resulting converged (but not optimal!) utilities as future values
Repeat steps until policy converges
This is policy iteration
Itβs still optimal!
Can converge (much) faster under some conditions
![Page 60: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/60.jpg)
Policy Iteration
Policy Evaluation: For fixed current policy , find values w.r.t. the policy Iterate until values converge:
Policy Improvement: For fixed values, get a better policy with one-step look-ahead:
Similar to how you derive optimal policy πβ given optimal value πβ
![Page 61: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/61.jpg)
Piazza Poll 4True/False: πππ+1 π β₯ πππ π , βπ
![Page 62: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/62.jpg)
Piazza Poll 4True/False: πππ+1 π β₯ πππ π , βπ
πππ π =
π β²
π π , ππ(π ), π β² [π π , ππ(π ), π β² + πΎπππ π β² ]
If I take first step according to ππ+1 and then follow ππ, we get an expected utility of
π1(π ) = maxπ
π β²
π π , π, π β² [π π , π, π β² + πΎπππ π β² ]
Which is β₯ πππ π What if I take two steps according to ππ+1?
![Page 63: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/63.jpg)
ComparisonBoth value iteration and policy iteration compute the same thing (all optimal values)
In value iteration: Every iteration updates both the values and (implicitly) the policy
We donβt track the policy, but taking the max over actions implicitly recomputes it
In policy iteration: We do several passes that update utilities with fixed policy (each pass is fast because we
consider only one action, not all of them)
After the policy is evaluated, a new policy is chosen (slow like a value iteration pass)
The new policy will be better (or weβre done)
(Both are dynamic programs for solving MDPs)
![Page 64: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/64.jpg)
Summary: MDP AlgorithmsSo you want toβ¦.
Turn values into a policy: use one-step lookahead
Compute optimal values: use value iteration or policy iteration
Compute values for a particular policy: use policy evaluation
These all look the same!
They basically are β they are all variations of Bellman updates
They all use one-step lookahead expectimax fragments
They differ only in whether we plug in a fixed policy or max over actions
![Page 65: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/65.jpg)
MDP Notation
π π = maxπ
π β²
π π β² π , π)π(π β²)
π π = maxπ
π β²
π π β² π , π π π , π, π β² + πΎπ π β²
ππ+1 π = maxπ
π β²
π π β² π , π π π , π, π β² + πΎππ π β² , β π
ππ+1 π , π =
π β²
π π β² π , π [π π , π, π β² + πΎmaxπβ²
ππ(π β², πβ²)] , β π , π
ππ π = argmaxπ
π β²
π π β² π , π [π π , π, π β² + πΎπ π β² ] , β π
ππ+1π π =
π β²
π π β² π , π π [π π , π π , π β² + πΎπππ π β² ] , β π
ππππ€ π = argmaxπ
π β²
π π β² π , π π π , π, π β² + πΎπππππ π β² , β π
Bellman equations:
Value iteration:
Q-iteration:
Policy extraction:
Policy improvement:
Policy evaluation:
Standard expectimax:
![Page 66: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/66.jpg)
MDP Notation
π π = maxπ
π β²
π π β² π , π)π(π β²)
π π = maxπ
π β²
π π β² π , π π π , π, π β² + πΎπ π β²
ππ+1 π = maxπ
π β²
π π β² π , π π π , π, π β² + πΎππ π β² , β π
ππ+1 π , π =
π β²
π π β² π , π [π π , π, π β² + πΎmaxπβ²
ππ(π β², πβ²)] , β π , π
ππ π = argmaxπ
π β²
π π β² π , π [π π , π, π β² + πΎπ π β² ] , β π
ππ+1π π =
π β²
π π β² π , π π [π π , π π , π β² + πΎπππ π β² ] , β π
ππππ€ π = argmaxπ
π β²
π π β² π , π π π , π, π β² + πΎπππππ π β² , β π
Bellman equations:
Value iteration:
Q-iteration:
Policy extraction:
Policy improvement:
Policy evaluation:
Standard expectimax:
![Page 67: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/67.jpg)
MDP Notation
π π = maxπ
π β²
π π β² π , π)π(π β²)
π π = maxπ
π β²
π π β² π , π π π , π, π β² + πΎπ π β²
ππ+1 π = maxπ
π β²
π π β² π , π π π , π, π β² + πΎππ π β² , β π
ππ+1 π , π =
π β²
π π β² π , π [π π , π, π β² + πΎmaxπβ²
ππ(π β², πβ²)] , β π , π
ππ π = argmaxπ
π β²
π π β² π , π [π π , π, π β² + πΎπ π β² ] , β π
ππ+1π π =
π β²
π π β² π , π π [π π , π π , π β² + πΎπππ π β² ] , β π
ππππ€ π = argmaxπ
π β²
π π β² π , π π π , π, π β² + πΎπππππ π β² , β π
Bellman equations:
Value iteration:
Q-iteration:
Policy extraction:
Policy improvement:
Policy evaluation:
Standard expectimax:
![Page 68: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/68.jpg)
MDP Notation
Standard expectimax: π π = maxπ
π β²
π π β² π , π)π(π β²)
π π = maxπ
π β²
π π β² π , π π π , π, π β² + πΎπ π β²
ππ+1 π = maxπ
π β²
π π β² π , π π π , π, π β² + πΎππ π β² , β π
ππ+1 π , π =
π β²
π π β² π , π [π π , π, π β² + πΎmaxπβ²
ππ(π β², πβ²)] , β π , π
ππ π = argmaxπ
π β²
π π β² π , π [π π , π, π β² + πΎπ π β² ] , β π
ππ+1π π =
π β²
π π β² π , π π [π π , π π , π β² + πΎπππ π β² ] , β π
ππππ€ π = argmaxπ
π β²
π π β² π , π π π , π, π β² + πΎπππππ π β² , β π
Bellman equations:
Value iteration:
Q-iteration:
Policy extraction:
Policy evaluation:
Policy improvement:
![Page 69: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/69.jpg)
Double Bandits
![Page 70: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/70.jpg)
Double-Bandit MDP
Actions: Blue, Red
States: Win, Lose
W L$1
1.0
$1
1.0
0.25 $0
0.75 $2
0.75 $2
0.25 $0
No discount
100 time steps
Both states have the same value
Actually a simple MDP where the current state does not impact transition or reward:π π β²|π , π = π π β²|π and π π , π, π β² = π (π, π β²)
![Page 71: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/71.jpg)
Offline Planning
Solving MDPs is offline planning You determine all quantities through computation
You need to know the details of the MDP
You do not actually play the game!
Play Red
Play Blue
Value
No discount
100 time steps
Both states have the same value
150
100
W L$1
1.0
$1
1.0
0.25 $0
0.75 $2
0.75 $2
0.25 $0
![Page 72: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/72.jpg)
Letβs Play!
$2 $2 $0 $2 $2
$2 $2 $0 $0 $0
![Page 73: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/73.jpg)
Online PlanningRules changed! Redβs win chance is different.
W L
$1
1.0
$1
1.0
?? $0
?? $2
?? $2
?? $0
![Page 74: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/74.jpg)
Letβs Play!
$0 $0 $0 $2 $0
$2 $0 $0 $0 $0
![Page 75: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/75.jpg)
What Just Happened?
That wasnβt planning, it was learning!
Specifically, reinforcement learning
There was an MDP, but you couldnβt solve it with just computation
You needed to actually act to figure it out
Important ideas in reinforcement learning that came up
Exploration: you have to try unknown actions to get information
Exploitation: eventually, you have to use what you know
Regret: even if you learn intelligently, you make mistakes
Sampling: because of chance, you have to try things repeatedly
Difficulty: learning can be much harder than solving a known MDP
![Page 76: Markov Decision Process II](https://reader033.fdocuments.net/reader033/viewer/2022042902/6269a02ce8e0ee3c4e289838/html5/thumbnails/76.jpg)
Next Time: Reinforcement Learning!