MDPs (cont) & Reinforcement Learning Tamara Berg CS 560 Artificial Intelligence Many slides...
-
Upload
phyllis-turner -
Category
Documents
-
view
231 -
download
0
Transcript of MDPs (cont) & Reinforcement Learning Tamara Berg CS 560 Artificial Intelligence Many slides...
![Page 1: MDPs (cont) & Reinforcement Learning Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein,](https://reader033.fdocuments.net/reader033/viewer/2022052603/5697bf8a1a28abf838c8aa48/html5/thumbnails/1.jpg)
MDPs (cont) & Reinforcement Learning
Tamara Berg
CS 560 Artificial Intelligence
Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer
![Page 2: MDPs (cont) & Reinforcement Learning Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein,](https://reader033.fdocuments.net/reader033/viewer/2022052603/5697bf8a1a28abf838c8aa48/html5/thumbnails/2.jpg)
Announcements
• HW2 online: CSPs and Games– Due Oct 8, 11:59pm (start now!)
• Mid-term exam next Wed, Sept 30– Held during regular class time.– Closed book. You may bring a calculator.– Written questions (no coding).– Ric will lead an in class mid-term review/exercises
session in class on Sept 28.
![Page 3: MDPs (cont) & Reinforcement Learning Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein,](https://reader033.fdocuments.net/reader033/viewer/2022052603/5697bf8a1a28abf838c8aa48/html5/thumbnails/3.jpg)
Exam topics1) Intro to AI, agents and environments Turing test
Rationality
Expected utility maximizationPEASEnvironment characteristics: fully vs. partially observable, deterministic vs. stochastic, episodic vs. sequential, static vs. dynamic, discrete vs. continuous, single-agent vs. multi-agent, known vs. unknown
2) SearchSearch problem formulation: initial state, actions, transition model, goal state, path cost
State space graph
Search tree
Frontier, Explored set
Evaluation of search strategies: completeness, optimality, time complexity, space complexity
Uninformed search strategies: breadth-first search, uniform cost search, depth-first search, iterative deepening search
Informed search strategies: greedy best-first, A*, weighted A*
Heuristics: admissibility, dominance
![Page 4: MDPs (cont) & Reinforcement Learning Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein,](https://reader033.fdocuments.net/reader033/viewer/2022052603/5697bf8a1a28abf838c8aa48/html5/thumbnails/4.jpg)
Exam topics
3) Constraint satisfaction problems Backtracking search
Heuristics: most constrained/most constraining variable, least constraining value Forward checking, constraint propagation, arc consistencyTaking advantage of structure – connected components, Tree-structured CSPsLocal search
Formulating photo ordering as a CSP
4) GamesZero-sum games
Game treeMinimax/Expectimax/Expectiminimax searchAlpha-beta pruningEvaluation functionQuiescence searchHorizon effectStochastic elements in games
![Page 5: MDPs (cont) & Reinforcement Learning Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein,](https://reader033.fdocuments.net/reader033/viewer/2022052603/5697bf8a1a28abf838c8aa48/html5/thumbnails/5.jpg)
Exam topics
5) Markov decision processesMarkov assumption, transition model, policy
Bellman equationValue iterationPolicy iteration
6) Reinforcement learningModel-based vs. model-free approaches
Passive vs Active
Exploration vs. exploitation
Direct Estimation
TD Learning
TD Q-learning
Applications to backgammon, quadruped locomotion, helicopter flying
![Page 6: MDPs (cont) & Reinforcement Learning Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein,](https://reader033.fdocuments.net/reader033/viewer/2022052603/5697bf8a1a28abf838c8aa48/html5/thumbnails/6.jpg)
Stochastic, sequential environments
Image credit: P. Abbeel and D. Klein
Markov Decision Processes
![Page 7: MDPs (cont) & Reinforcement Learning Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein,](https://reader033.fdocuments.net/reader033/viewer/2022052603/5697bf8a1a28abf838c8aa48/html5/thumbnails/7.jpg)
Markov Decision Processes• Components:
– States s, beginning with initial state s0
– Actions a• Each state s has actions A(s) available from it
– Transition model P(s’ | s, a)• Markov assumption: the probability of going to s’ from
s depends only on s and a and not on any other past actions or states
– Reward function R(s)• Policy (s): the action that an agent takes in any given state
– The “solution” to an MDP
![Page 8: MDPs (cont) & Reinforcement Learning Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein,](https://reader033.fdocuments.net/reader033/viewer/2022052603/5697bf8a1a28abf838c8aa48/html5/thumbnails/8.jpg)
Overview
• First, we will look at how to “solve” MDPs, ie find the optimal policy when the transition model and the reward function are known
• Second, we will consider reinforcement learning, where we don’t know the rules of the environment or the consequences of our actions
![Page 9: MDPs (cont) & Reinforcement Learning Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein,](https://reader033.fdocuments.net/reader033/viewer/2022052603/5697bf8a1a28abf838c8aa48/html5/thumbnails/9.jpg)
Grid world
R(s) = -0.04 for every non-terminal state
Transition model:
0.8 0.10.1
Source: P. Abbeel and D. Klein
![Page 10: MDPs (cont) & Reinforcement Learning Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein,](https://reader033.fdocuments.net/reader033/viewer/2022052603/5697bf8a1a28abf838c8aa48/html5/thumbnails/10.jpg)
Goal: Policy
Source: P. Abbeel and D. Klein
![Page 11: MDPs (cont) & Reinforcement Learning Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein,](https://reader033.fdocuments.net/reader033/viewer/2022052603/5697bf8a1a28abf838c8aa48/html5/thumbnails/11.jpg)
Grid world
Optimal policy when R(s) = -0.04 for every non-terminal state
![Page 12: MDPs (cont) & Reinforcement Learning Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein,](https://reader033.fdocuments.net/reader033/viewer/2022052603/5697bf8a1a28abf838c8aa48/html5/thumbnails/12.jpg)
Grid world• Optimal policies for other values of R(s):
![Page 13: MDPs (cont) & Reinforcement Learning Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein,](https://reader033.fdocuments.net/reader033/viewer/2022052603/5697bf8a1a28abf838c8aa48/html5/thumbnails/13.jpg)
Solving MDPs• MDP components:
– States s– Actions a– Transition model P(s’ | s, a)– Reward function R(s)
• The solution:– Policy (s): mapping from states to actions– How to find the optimal policy?
![Page 14: MDPs (cont) & Reinforcement Learning Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein,](https://reader033.fdocuments.net/reader033/viewer/2022052603/5697bf8a1a28abf838c8aa48/html5/thumbnails/14.jpg)
![Page 15: MDPs (cont) & Reinforcement Learning Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein,](https://reader033.fdocuments.net/reader033/viewer/2022052603/5697bf8a1a28abf838c8aa48/html5/thumbnails/15.jpg)
![Page 16: MDPs (cont) & Reinforcement Learning Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein,](https://reader033.fdocuments.net/reader033/viewer/2022052603/5697bf8a1a28abf838c8aa48/html5/thumbnails/16.jpg)
Maximizing expected utility• The optimal policy should maximize the expected
utility over all possible state sequences produced by following that policy:
• How to define the utility of a state sequence?– Sum of rewards of individual states– Problem: infinite state sequences
0sfromstarting
sequencesstate
)sequence()sequence( UP
![Page 17: MDPs (cont) & Reinforcement Learning Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein,](https://reader033.fdocuments.net/reader033/viewer/2022052603/5697bf8a1a28abf838c8aa48/html5/thumbnails/17.jpg)
Utilities of state sequences• Normally, we would define the utility of a state sequence
as the sum of the rewards of the individual states• Problem: infinite state sequences• Solution: discount the individual state rewards by a factor
between 0 and 1:
– Sooner rewards count more than later rewards– Makes sure the total utility stays bounded– Helps algorithms converge
)10(1
)(
)()()(]),,,([
max
0
22
10210
R
sR
sRsRsRsssU
tt
t
![Page 18: MDPs (cont) & Reinforcement Learning Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein,](https://reader033.fdocuments.net/reader033/viewer/2022052603/5697bf8a1a28abf838c8aa48/html5/thumbnails/18.jpg)
Utilities of states
• Expected utility obtained by policy starting in state s:
• The “true” utility (value) of a state, is the expected sum of discounted rewards if the agent executes an optimal policy starting in state s
sfromstarting
sequencesstate
)sequence()sequence()( UPsU
![Page 19: MDPs (cont) & Reinforcement Learning Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein,](https://reader033.fdocuments.net/reader033/viewer/2022052603/5697bf8a1a28abf838c8aa48/html5/thumbnails/19.jpg)
Finding the utilities of states
'
)'(),|'(s
sUassP
U(s’)
Max node
Chance node
')(
* )'(),|'(maxarg)(ssAa
sUassPs
P(s’ | s, a)
• What is the expected utility of taking action a in state s?
• How do we choose the optimal action?
• What is the recursive expression for U(s) in terms of the utilities of its successor states?
'
)'(),|'(max)()(s
a sUassPsRsU
![Page 20: MDPs (cont) & Reinforcement Learning Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein,](https://reader033.fdocuments.net/reader033/viewer/2022052603/5697bf8a1a28abf838c8aa48/html5/thumbnails/20.jpg)
The Bellman equation• Recursive relationship between the utilities of
successive states:
End up here with P(s’ | s, a)Get utility U(s’)
(discounted by )
Receive reward R(s)
Choose optimal action a
'
)()'(),|'(max)()(
ssAa
sUassPsRsU
![Page 21: MDPs (cont) & Reinforcement Learning Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein,](https://reader033.fdocuments.net/reader033/viewer/2022052603/5697bf8a1a28abf838c8aa48/html5/thumbnails/21.jpg)
The Bellman equation• Recursive relationship between the utilities of
successive states:
• For N states, we get N equations in N unknowns– Solving them solves the MDP– Two methods: value iteration and policy iteration
'
)()'(),|'(max)()(
ssAa
sUassPsRsU
![Page 22: MDPs (cont) & Reinforcement Learning Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein,](https://reader033.fdocuments.net/reader033/viewer/2022052603/5697bf8a1a28abf838c8aa48/html5/thumbnails/22.jpg)
Method 1: Value iteration
• Start out with every U(s) = 0• Iterate until convergence
– During the ith iteration, update the utility of all states (simultaneously) according to this rule:
• In the limit of infinitely many iterations, guaranteed to find the correct utility values– In practice, don’t need an infinite number of iterations…
')(
1 )'(),|'(max)()(s
isAa
i sUassPsRsU
![Page 23: MDPs (cont) & Reinforcement Learning Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein,](https://reader033.fdocuments.net/reader033/viewer/2022052603/5697bf8a1a28abf838c8aa48/html5/thumbnails/23.jpg)
Value iteration
• Run value iteration on the following grid-world:
R(s)= -0.25 for all non-terminal states
Transition model: 0.7 chance of going in desired direction, 0.1 chance of going in any of the other 3 directions. If agent moves into a wall, it stays put.
![Page 24: MDPs (cont) & Reinforcement Learning Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein,](https://reader033.fdocuments.net/reader033/viewer/2022052603/5697bf8a1a28abf838c8aa48/html5/thumbnails/24.jpg)
Value iteration
• Run value iteration on the following grid-world:
R(s)= -0.25 for all non-terminal states
Transition model: 0.7 chance of going in desired direction, 0.1 chance of going in any of the other 3 directions. If agent moves into a wall, it stays put.
0 0 0
0 0
0 0 00
![Page 25: MDPs (cont) & Reinforcement Learning Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein,](https://reader033.fdocuments.net/reader033/viewer/2022052603/5697bf8a1a28abf838c8aa48/html5/thumbnails/25.jpg)
Value iteration
Value iteration demo