Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the...
-
Upload
cory-russell -
Category
Documents
-
view
228 -
download
0
Transcript of Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the...
![Page 1: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/1.jpg)
Utility Theory & MDPs
Tamara Berg
CS 590-133 Artificial Intelligence
Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer
![Page 2: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/2.jpg)
Announcements
• An edited version of HW2 was released on the class webpage today– Due date is extended to Feb 25 (but make sure to
start before the exam!)
• As always, you can work in pairs and submit 1 written/coding solution (pairs don’t need to be the same as HW1)
![Page 3: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/3.jpg)
Review from last class
![Page 4: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/4.jpg)
![Page 5: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/5.jpg)
A more abstract game tree
Terminal utilities (for MAX)
3 2 2
3
A two-ply game
![Page 6: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/6.jpg)
A more abstract game tree
• Minimax value of a node: the utility (for MAX) of being in the corresponding state, assuming perfect play on both sides
• Minimax strategy: Choose the move that gives the best worst-case payoff
3 2 2
3
![Page 7: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/7.jpg)
Computing the minimax value of a node
• Minimax(node) = Utility(node) if node is terminal maxaction Minimax(Succ(node, action)) if player = MAX
minaction Minimax(Succ(node, action)) if player = MIN
3 2 2
3
![Page 8: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/8.jpg)
Alpha-beta pruning• It is possible to compute the exact minimax decision
without expanding every node in the game tree
![Page 9: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/9.jpg)
Alpha-beta pruning• It is possible to compute the exact minimax decision
without expanding every node in the game tree
3
3
![Page 10: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/10.jpg)
Alpha-beta pruning• It is possible to compute the exact minimax decision
without expanding every node in the game tree
3
3
2
![Page 11: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/11.jpg)
Alpha-beta pruning• It is possible to compute the exact minimax decision
without expanding every node in the game tree
3
3
2 14
![Page 12: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/12.jpg)
Alpha-beta pruning• It is possible to compute the exact minimax decision
without expanding every node in the game tree
3
3
2 5
![Page 13: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/13.jpg)
Alpha-beta pruning• It is possible to compute the exact minimax decision
without expanding every node in the game tree
3
3
2 2
![Page 14: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/14.jpg)
Games of chance
• How to incorporate dice throwing into the game tree?
![Page 15: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/15.jpg)
![Page 16: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/16.jpg)
Games of chance
![Page 17: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/17.jpg)
![Page 18: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/18.jpg)
![Page 19: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/19.jpg)
![Page 20: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/20.jpg)
Why MEU?
![Page 21: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/21.jpg)
![Page 22: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/22.jpg)
![Page 23: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/23.jpg)
![Page 24: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/24.jpg)
![Page 25: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/25.jpg)
![Page 26: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/26.jpg)
Human Utilities
• How much do people value their lives?– How much would you pay to avoid a risk, e.g.
Russian roulette with a million-barreled revolver (1 micromort)?
– Driving in a car for 230 miles incurs a risk of 1 micromort.
![Page 27: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/27.jpg)
Measuring Utilities
Worst possible catastrophe
Best possible prize
![Page 28: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/28.jpg)
![Page 29: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/29.jpg)
![Page 30: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/30.jpg)
![Page 31: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/31.jpg)
Stochastic, sequential environments
(Chapter 17)
Image credit: P. Abbeel and D. Klein
Markov Decision Processes
![Page 32: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/32.jpg)
Markov Decision Processes• Components:
– States s, beginning with initial state s0
– Actions a• Each state s has actions A(s) available from it
– Transition model P(s’ | s, a)• Markov assumption: the probability of going to s’ from
s depends only on s and a and not on any other past actions or states
– Reward function R(s)• Policy (s): the action that an agent takes in any given state
– The “solution” to an MDP
![Page 33: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/33.jpg)
![Page 34: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/34.jpg)
![Page 35: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/35.jpg)
Overview
• First, we will look at how to “solve” MDPs, or find the optimal policy when the transition model and the reward function are known
• Second, we will consider reinforcement learning, where we don’t know the rules of the environment or the consequences of our actions
![Page 36: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/36.jpg)
Game show• A series of questions with increasing level of
difficulty and increasing payoff• Decision: at each step, take your earnings and
quit, or go for the next question– If you answer wrong, you lose everything
Q1 Q2 Q3 Q4Correct
Incorrect:$0
Correct
Incorrect:$0
Quit:$100
Correct
Incorrect:$0
Quit:$1,100
Correct:$61,100
Incorrect:$0
Quit:$11,100
$100 question
$1,000 question
$10,000 question
$50,000 question
![Page 37: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/37.jpg)
Game show• Consider $50,000 question
– Probability of guessing correctly: 1/10– Quit or go for the question?
• What is the expected payoff for continuing?
0.1 * 61,100 + 0.9 * 0 = 6,110• What is the optimal decision?
Q1 Q2 Q3 Q4Correct
Incorrect:$0
Correct
Incorrect:$0
Quit:$100
Correct
Incorrect:$0
Quit:$1,100
Correct:$61,100
Incorrect:$0
Quit:$11,100
$100 question
$1,000 question
$10,000 question
$50,000 question9/10 3/4 1/2
1/10
![Page 38: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/38.jpg)
Game show• What should we do in Q3?
– Payoff for quitting: $1,100– Payoff for continuing: 0.5 * $11,100 = $5,550
• What about Q2?– $100 for quitting vs. $4,162 for continuing
• What about Q1?
Q1 Q2 Q3 Q4Correct
Incorrect:$0
Correct
Incorrect:$0
Quit:$100
Correct
Incorrect:$0
Quit:$1,100
Correct:$61,100
Incorrect:$0
Quit:$11,100
$100 question
$1,000 question
$10,000 question
$50,000 question9/10 3/4 1/2
1/10
U = $11,100U = $5,550U = $4,162U = $3,746
![Page 39: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/39.jpg)
Grid world
R(s) = -0.04 for every non-terminal state
Transition model:
0.8 0.10.1
Source: P. Abbeel and D. Klein
![Page 40: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/40.jpg)
Goal: Policy
Source: P. Abbeel and D. Klein
![Page 41: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/41.jpg)
Grid world
R(s) = -0.04 for every non-terminal state
Transition model:
![Page 42: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/42.jpg)
Grid world
Optimal policy when R(s) = -0.04 for every non-terminal state
![Page 43: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/43.jpg)
Grid world• Optimal policies for other values of R(s):
![Page 44: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/44.jpg)
Solving MDPs• MDP components:
– States s– Actions a– Transition model P(s’ | s, a)– Reward function R(s)
• The solution:– Policy (s): mapping from states to actions– How to find the optimal policy?
![Page 45: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/45.jpg)
![Page 46: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/46.jpg)
![Page 47: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/47.jpg)
Maximizing expected utility• The optimal policy should maximize the expected
utility over all possible state sequences produced by following that policy:
• How to define the utility of a state sequence?– Sum of rewards of individual states– Problem: infinite state sequences
0sfromstarting
sequencesstate
)sequence()sequence( UP
![Page 48: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/48.jpg)
Utilities of state sequences• Normally, we would define the utility of a state sequence
as the sum of the rewards of the individual states• Problem: infinite state sequences• Solution: discount the individual state rewards by a factor
between 0 and 1:
– Sooner rewards count more than later rewards– Makes sure the total utility stays bounded– Helps algorithms converge
)10(1
)(
)()()(]),,,([
max
0
22
10210
R
sR
sRsRsRsssU
tt
t
![Page 49: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/49.jpg)
Utilities of states
• Expected utility obtained by policy starting in state s:
• The “true” utility of a state, is the expected sum of discounted rewards if the agent executes an optimal policy starting in state s
sfromstarting
sequencesstate
)sequence()sequence()( UPsU
![Page 50: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/50.jpg)
Finding the utilities of states
'
)'(),|'(s
sUassP
U(s’)
Max node
Chance node
')(
* )'(),|'(maxarg)(ssAa
sUassPs
P(s’ | s, a)
• What is the expected utility of taking action a in state s?
• How do we choose the optimal action?
• What is the recursive expression for U(s) in terms of the utilities of its successor states?
'
)'(),|'(max)()(s
a sUassPsRsU
![Page 51: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/51.jpg)
The Bellman equation• Recursive relationship between the utilities of
successive states:
End up here with P(s’ | s, a)Get utility U(s’)
(discounted by )
Receive reward R(s)
Choose optimal action a
'
)()'(),|'(max)()(
ssAa
sUassPsRsU
![Page 52: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/52.jpg)
The Bellman equation• Recursive relationship between the utilities of
successive states:
• For N states, we get N equations in N unknowns– Solving them solves the MDP– We can solve them algebraically– Two methods: value iteration and policy iteration
'
)()'(),|'(max)()(
ssAa
sUassPsRsU
![Page 53: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/53.jpg)
Method 1: Value iteration
• Start out with every U(s) = 0• Iterate until convergence
– During the ith iteration, update the utility of each state according to this rule:
• In the limit of infinitely many iterations, guaranteed to find the correct utility values– In practice, don’t need an infinite number of iterations…
')(
1 )'(),|'(max)()(s
isAa
i sUassPsRsU
![Page 54: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/54.jpg)
Value iteration
• What effect does the update have?
')(
1 )'(),|'(max)()(s
isAa
i sUassPsRsU
Value iteration demo
![Page 55: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/55.jpg)
Values vs Policy
• Basic idea: approximations get refined towards optimal values
• Policy may converge long before values do
![Page 56: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/56.jpg)
Method 2: Policy iteration
• Start with some initial policy 0 and alternate between the following steps:– Policy evaluation: calculate Ui(s) for every
state s– Policy improvement: calculate a new policy
i+1 based on the updated utilities
')(
1 )'(),|'(maxarg)(ssAa
i sUassPs i
![Page 57: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/57.jpg)
Policy evaluation• Given a fixed policy , calculate U(s) for every
state s • The Bellman equation for the optimal policy:
– How does it need to change if our policy is fixed?
– Can solve a linear system to get all the utilities!– Alternatively, can apply the following update:
'
1 )'())(,|'()()(s
iii sUsssPsRsU
'
)()'(),|'(max)()(
ssAa
sUassPsRsU
'
)'())(,|'()()(s
sUsssPsRsU
![Page 58: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/58.jpg)
![Page 59: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/59.jpg)
![Page 60: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/60.jpg)
Looking ahead
![Page 61: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/61.jpg)
Reinforcement Learning• Components:
– States s, beginning with initial state s0
– Actions a• Each state s has actions A(s) available from it
– Transition model P(s’ | s, a)– Reward function R(s)
• Policy (s): the action that an agent takes in any given state– The “solution”
• New twist: don’t know Transition model or Reward function ahead of time!– Have to actually try actions and states out to learn
![Page 62: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/62.jpg)
![Page 63: Utility Theory & MDPs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649e0f5503460f94af9d33/html5/thumbnails/63.jpg)