Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer...
-
Upload
matilda-alexander -
Category
Documents
-
view
219 -
download
0
Transcript of Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer...
![Page 1: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/1.jpg)
Reinforcement Learning
HUT Spatial Intelligence course August/September 2004
Bram Bakker
Computer Science, University of Amsterdam
![Page 2: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/2.jpg)
Overview day 1 (Monday 13-16)
Basic concepts Formalized model Value functions Learning value functions In-class assignment & discussion
![Page 3: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/3.jpg)
Overview day 2 (Tuesday 9-12)
Learning value functions more efficiently Generalization Case studies In-class assignment & discussion
![Page 4: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/4.jpg)
Overview day 3 (Thursday 13-16)
Models and planning Multi-agent reinforcement learning Other advanced RL issues Presentation of home assignments & discussion
![Page 5: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/5.jpg)
What is it? Subfield of Artificial Intelligence Making computers learn tasks rather than directly program them
Why is it interesting? Some tasks are very difficult to program, or difficult to optimize,
so learning might be better Relevance for geoinformatics/spatial intelligence:
Geoinformatics deals with many such tasks: transport optimization, water management, etc.
Machine Learning
![Page 6: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/6.jpg)
Supervised learning Works by instructing the learning system what output
to give for each input Unsupervised learning
Clustering inputs based on similarity (e.g. Kohonen Self-organizing maps)
Reinforcement learning Works by letting the learning system learn
autonomously what is good and bad
Classes of Machine Learning techniques
![Page 7: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/7.jpg)
Neural networks Work in a way analogous to brains, can be used with
supervised, unsupervised, reinforcement learning, genetic algorithms
Genetic algorithms Work in a way analogous to evolution
Ant Colony Optimization Works in a way analogous to ant colonies
Some well-known Machine Learning techniques
![Page 8: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/8.jpg)
What is Reinforcement Learning?
Learning from interaction Goal-oriented learning Learning about, from, and while interacting with an
external environment Learning what to do—how to map situations to actions
—so as to maximize a numerical reward signal
![Page 9: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/9.jpg)
Some Notable RL Applications
TD-Gammon: Tesauro– world’s best backgammon program
Elevator Control: Crites & Barto– high performance elevator controller
Dynamic Channel Assignment: Singh & Bertsekas, Nie & Haykin– high performance assignment of radio channels to mobile
telephone calls Traffic light control: Wiering et al., Choy et al.
– high performance control of traffic lights to optimize traffic flow
Water systems control: Bhattacharya et al.– high performance control of water levels of regional water
systems
![Page 10: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/10.jpg)
Relationships to other fields
Psychology
Artificial Intelligence Planning methods
Control Theory andOperations Research
Artificial Neural Networks
ReinforcementLearning (RL)
Neuroscience
![Page 11: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/11.jpg)
Recommended literature
Sutton & Barto (1998). Reinforcement learning: an introduction. MIT Press.
Kaelbling, Littmann, & Moore (1996). Reinforcement learning: a survey. Artificial Inteligence Research, vol. 4, pp. 237--285.
![Page 12: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/12.jpg)
Complete Agent
Temporally situated Continual learning and planning Agent affects the environment Environment is stochastic and uncertain
Environment
actionstate
rewardAgent
![Page 13: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/13.jpg)
Supervised Learning
Supervised Learning SystemInputs Outputs
Training Info = desired (target) outputs
Error = (target output – actual output)
![Page 14: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/14.jpg)
Reinforcement Learning (RL)
RLSystemInputs Outputs (“actions”)
Training Info = evaluations (“rewards” / “penalties”)
Objective: get as much reward as possible
![Page 15: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/15.jpg)
Key Features of RL
Learner is not told which actions to take Trial-and-Error search Possibility of delayed reward
Sacrifice short-term gains for greater long-term gains The need to explore and exploit Considers the whole problem of a goal-directed agent
interacting with an uncertain environment
![Page 16: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/16.jpg)
What is attractive about RL?
Online, “autonomous” learning without a need for preprogrammed behavior or instruction
Learning to satisfy long-term goals Applicable to many tasks
![Page 17: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/17.jpg)
Some RL History
Trial-and-Errorlearning
Temporal-differencelearning
Optimal control,value functions
Thorndike ()1911
Minsky
Klopf
Barto et al.
Secondary reinforcement ()
Samuel
Witten
Sutton
Hamilton (Physics)1800s
Shannon
Bellman/Howard (OR)
Werbos
Watkins
Holland
![Page 18: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/18.jpg)
Elements of RL
Policy: what to do Maps states to actions
Reward: what is good Value: what is good because it predicts reward
Reflects total, long-term reward Model: what follows what
Maps states and actions to new states and rewards
![Page 19: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/19.jpg)
An Extended Example: Tic-Tac-Toe
X XXO O
X
XO
X
O
XO
X
O
X
XO
X
O
X O
XO
X
O
X O
X
} x’s move
} x’s move
} o’s move
} x’s move
} o’s move
...
...... ...
... ... ... ... ...
x x
x
x o
x
o
xo
x
x
xx
o
o
Assume an imperfect opponent:
—he/she sometimes makes mistakes
![Page 20: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/20.jpg)
An RL Approach to Tic-Tac-Toe
1. Make a table with one entry per state:
2. Now play lots of games.
To pick our moves,
look ahead one step:
State V(s) – estimated probability of winning
.5 ?
.5 ?. . .
. . .
. . .. . .
1 win
0 loss
. . .. . .
0 draw
x
xxx
oo
oo
ox
x
oo
o ox
xx
xo
current state
various possible
next states*Just pick the next state with the highest
estimated prob. of winning — the largest V(s);
a greedy move.
But 10% of the time pick a move at random;
an exploratory move.
![Page 21: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/21.jpg)
RL Learning Rule for Tic-Tac-Toe
“Exploratory” move
s – the state before our greedy move
s – the state after our greedy move
We increment each V(s) toward V( s ) – a backup :
V(s) V (s) V( s ) V (s)
a small positive fraction, e.g., .1
the step - size parameter
![Page 22: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/22.jpg)
How can we improve this T.T.T. player?
Take advantage of symmetries representation/generalization
Do we need “random” moves? Why? Do we always need a full 10%?
Can we learn from “random” moves? Can we learn offline?
Pre-training from self play? Using learned models of opponent?
. . .
![Page 23: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/23.jpg)
How is Tic-Tac-Toe easy?
Small number of states and actions Small number of steps until reward . . .
![Page 24: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/24.jpg)
RL Formalized
Agent and environment interact at discrete time steps : t 0,1, 2, Agent observes state at step t : st S
produces action at step t : at A(st )
gets resulting reward : rt1 and resulting next state : st1
t
. . . st art +1 st +1
t +1art +2 st +2
t +2art +3 st +3
. . .t +3a
![Page 25: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/25.jpg)
Policy at step t, t :
a mapping from states to action probabilities
t (s, a) probability that at a when st s
The Agent Learns a Policy
Reinforcement learning methods specify how the agent changes its policy as a result of experience.
Roughly, the agent’s goal is to get as much reward as it can over the long run.
![Page 26: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/26.jpg)
Getting the Degree of Abstraction Right
Time steps need not refer to fixed intervals of real time. Actions can be low level (e.g., voltages to motors), or high
level (e.g., accept a job offer), “mental” (e.g., shift in focus of attention), etc.
States can be low-level “sensations”, or they can be abstract, symbolic, based on memory, or subjective (e.g., the state of being “surprised” or “lost”).
Reward computation is in the agent’s environment because the agent cannot change it arbitrarily.
![Page 27: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/27.jpg)
Goals and Rewards
Is a scalar reward signal an adequate notion of a goal?—maybe not, but it is surprisingly flexible.
A goal should specify what we want to achieve, not how we want to achieve it.
A goal must be outside the agent’s direct control—thus outside the agent.
The agent must be able to measure success: explicitly; frequently during its lifespan.
![Page 28: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/28.jpg)
Returns
maximize? want to wedoWhat
,,,
:is stepafter rewards of sequence theSuppose
321 ttt rrr
t
. stepeach for ,, themaximize want towe
general,In
tRE t return expected
Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze.
,21 Tttt rrrR
where T is a final time step at which a terminal state is reached, ending an episode.
![Page 29: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/29.jpg)
Returns for Continuing Tasks
Continuing tasks: interaction does not have natural episodes.
Discounted return:
. theis ,10, where
, 0
132
21
ratediscount
k
ktk
tttt rrrrR
shortsighted 0 1 farsighted
![Page 30: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/30.jpg)
An Example
Avoid failure: the pole falling beyonda critical angle or the cart hitting end oftrack.
reward 1 for each step before failure
return number of steps before failure
As an episodic task where episode ends upon failure:
As a continuing task with discounted return:reward 1 upon failure; 0 otherwise
return k , for k steps before failure
In either case, return is maximized by avoiding failure for as long as possible.
![Page 31: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/31.jpg)
Another Example
Get to the top of the hillas quickly as possible.
reward 1 for each step where not at top of hill
return number of steps before reaching top of hill
Return is maximized by minimizing number of steps reach the top of the hill.
![Page 32: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/32.jpg)
A Unified Notation
Think of each episode as ending in an absorbing state that always produces reward of zero:
We can cover all cases by writing
reached. always is state absorbing reward zero a ifonly 1 becan where
, 0
1
kkt
kt rR
![Page 33: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/33.jpg)
The Markov Property
A state should retain all “essential” information, i.e., it should have the Markov Property:
.,,,,,,,, histories and,, allfor
,,Pr
,,,,,,,,,Pr
00111
11
0011111
asrasrasrs
asrrss
asrasrasrrss
ttttt
tttt
ttttttt
![Page 34: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/34.jpg)
Markov Decision Processes
If a reinforcement learning task has the Markov Property, it is a Markov Decision Process (MDP).
If state and action sets are finite, it is a finite MDP. To define a finite MDP, you need to give:
state and action sets one-step “dynamics” defined by state transition
probabilities:
expected rewards:
Ps s a Pr st1 s st s,at a for all s, s S, a A(s).
Rs s a E rt1 st s,at a,st1 s for all s, s S, a A(s).
![Page 35: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/35.jpg)
Value Functions
State - value function for policy :
V (s) E Rt st s E krtk 1 st sk 0
Action - value function for policy :
Q (s, a) E Rt st s, at a E krtk1 st s,at ak0
The value of a state is the expected return starting from that state; depends on the agent’s policy:
The value of taking an action in a state under policy is the expected return starting from that state, taking that action, and thereafter following :
![Page 36: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/36.jpg)
Bellman Equation for a Policy
11
42
321
43
32
21
tt
tttt
ttttt
Rr
rrrr
rrrrR
The basic idea:
So: V (s) E Rt st s E rt1 V st1 st s
Or, without the expectation operator:
V (s) (s,a) Ps s a Rs s
a V ( s ) s
a
![Page 37: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/37.jpg)
Gridworld
Actions: north, south, east, west; deterministic. If action would take agent off the grid: no move but reward = –1 Other actions produce reward = 0, except actions that move agent out
of special states A and B as shown.
State-value function for equiprobable random policy;= 0.9
![Page 38: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/38.jpg)
if and only if V (s) V (s) for all s S
Optimal Value Functions
For finite MDPs, policies can be partially ordered:
There is always at least one (and possibly many) policies that is better than or equal to all the others. This is an optimal policy. We denote them all by *.
Optimal policies share the same optimal state-value function:
Optimal policies also share the same optimal action-value function:
V (s) max
V (s) for all s S
Q(s,a) max
Q (s, a) for all s S and a A(s)
This is the expected return for taking action a in state s and thereafter following an optimal policy.
![Page 39: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/39.jpg)
Bellman Optimality Equation for V*
V (s) maxaA(s)
Q
(s,a)
maxaA(s)
E rt1 V(st1) st s, at a max
aA(s)Ps s
a
s Rs s
a V ( s )
The value of a state under an optimal policy must equalthe expected return for the best action from that state:
is the unique solution of this system of nonlinear equations.V
![Page 40: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/40.jpg)
Bellman Optimality Equation for Q*
sa
ass
ass
ttta
t
asQRP
aassasQrEasQ
),(max
,),(max),( 11
is the unique solution of this system of nonlinear equations.Q*
![Page 41: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/41.jpg)
Why Optimal State-Value Functions are Useful
V
V
Any policy that is greedy with respect to is an optimal policy.
Therefore, given , one-step-ahead search produces the long-term optimal actions.
E.g., back to the gridworld:
![Page 42: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/42.jpg)
What About Optimal Action-Value Functions?
Given , the agent does not evenhave to do a one-step-ahead search:
Q*
(s) arg maxaA (s)
Q(s,a)
![Page 43: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/43.jpg)
Solving the Bellman Optimality Equation
Finding an optimal policy by solving the Bellman Optimality Equation exactly requires the following:
accurate knowledge of environment dynamics; we have enough space and time to do the computation; the Markov Property.
How much space and time do we need? polynomial in number of states (via dynamic programming
methods), BUT, number of states is often huge (e.g., backgammon has about
10**20 states). We usually have to settle for approximations. Many RL methods can be understood as approximately solving
the Bellman Optimality Equation.
![Page 44: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/44.jpg)
Temporal Difference (TD) Learning
Basic idea: transform the Bellman Equation into an update rule, using two consecutive timesteps
Policy Evaluation: learn approximation to the value function of the current policy
Policy Improvement: Act greedily with respect to the intermediate, learned value function
Repeating this over and over again leads to approximations of the optimal value function
![Page 45: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/45.jpg)
Q-Learning: TD-learning of action values
ttta
ttttt asQasQrasQasQ ,,max,,
:learning-Q step-One
11
![Page 46: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/46.jpg)
Exploration/Exploitation revisited
Suppose you form estimates
The greedy action at t is
You can’t exploit all the time; you can’t explore all the time You can never stop exploring; but you should always reduce
exploring
),(),( * asQasQt action value estimates
nexploratio
onexploitati
),(maxarg
*
*
*
tt
tt
ta
t
aa
aa
asQa
![Page 47: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/47.jpg)
-Greedy Action Selection
Greedy action selection:
-Greedy:
),(maxarg* asQaa ta
tt
at* with probability 1
random action with probability {at
. . . the simplest way to try to balance exploration and exploitation
![Page 48: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/48.jpg)
Softmax Action Selection
Softmax action selection methods grade action probs. by estimated values.
The most common softmax uses a Gibbs, or Boltzmann, distribution:
theis where
,
yprobabilit with play on action Choose
1
),(
),(
n
b
bsQ
asQ
t
t
e
e
ta
“computational temperature”
![Page 49: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/49.jpg)
Pole balancing learned using RL
![Page 50: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/50.jpg)
Improving the basic TD learning scheme
Can we learn more efficiently? Can we update multiple values at the same timestep? Can we look ahead further in time, rather than just use the
value at the next timestep?
Yes! All these can be done simultaneously with one extension: eligibility traces
![Page 51: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/51.jpg)
N-step TD Prediction
Idea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps)
![Page 52: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/52.jpg)
Monte Carlo:
TD: Use V to estimate remaining return
n-step TD: 2 step return:
n-step return:
Mathematics of N-step TD Prediction
TtT
tttt rrrrR 13
221
)( 11)1(
tttt sVrR
)( 22
21)2(
ttttt sVrrR
)(13
221
)(ntt
nnt
nttt
nt sVrrrrR
![Page 53: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/53.jpg)
Learning with N-step Backups
Backup (on-line or off-line):
Vt(st ) Rt(n) Vt(st )
![Page 54: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/54.jpg)
Random Walk Example
How does 2-step TD work here? How about 3-step TD?
![Page 55: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/55.jpg)
Forward View of TD()
TD() is a method for averaging all n-step backups
weight by n-1 (time since visitation)
-return:
Backup using -return:
Rt (1 ) n 1
n1
Rt(n)
Vt(st ) Rt Vt(st )
![Page 56: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/56.jpg)
-return Weighting Function
![Page 57: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/57.jpg)
Relation to TD(0) and MC
-return can be rewritten as:
If = 1, you get MC:
If = 0, you get TD(0)
Rt (1 ) n 1
n1
T t 1
Rt(n) T t 1Rt
Rt (1 1) 1n 1
n1
T t 1
Rt(n ) 1T t 1 Rt Rt
Rt (1 0) 0n 1
n1
T t 1
Rt(n ) 0T t 1 Rt Rt
(1)
Until termination After termination
![Page 58: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/58.jpg)
Forward View of TD() II
Look forward from each state to determine update from future states and rewards:
![Page 59: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/59.jpg)
-return on the Random Walk
Same random walk as before, but now with 19 states Why do you think intermediate values of are best?
![Page 60: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/60.jpg)
Backward View of TD()
The forward view was for theory The backward view is for mechanism
New variable called eligibility trace On each step, decay all traces by and increment the
trace for the current state by 1 Accumulating trace
)(set
et(s) et 1(s) if s st
et 1(s) 1 if s st
![Page 61: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/61.jpg)
Backward View
Shout t backwards over time
The strength of your voice decreases with temporal distance by
)()( 11 tttttt sVsVr
![Page 62: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/62.jpg)
Relation of Backwards View to MC & TD(0)
Using update rule:
As before, if you set to 0, you get to TD(0) If you set to 1, you get MC but in a better way
Can apply TD(1) to continuing tasks Works incrementally and on-line (instead of waiting to
the end of the episode)
)()( sesV ttt
![Page 63: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/63.jpg)
Forward View = Backward View
The forward (theoretical) view of TD() is equivalent to the backward (mechanistic) view
Sutton & Barto’s book shows:
VtTD(s)
t0
T 1
Vt(st )
t0
T 1
Isst
Backward updates Forward updates
algebra shown in book
![Page 64: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/64.jpg)
Q()-learning
Zero out eligibility trace after a non-greedy action. Do max when backing up at first non-greedy choice.
et(s, a) 1 et 1(s,a)
0
et 1(s,a)
if s st , a at ,Qt 1(st ,at ) max a Qt 1(st , a)
if Qt 1(st ,at) maxa Qt 1(st ,a)
otherwise
Qt1(s,a) Qt(s,a) tet(s, a)
t rt1 max a Qt(st1, a ) Qt (st ,at)
![Page 65: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/65.jpg)
Q()-learning
terminalis Until
;
0),( else
),(),( then , If
),(),(),(
: allFor
1
),(),(
) then max, for the ties (if ),(maxarg
greedy)- (e.g. from derivedpolicy using from Choose
, observe ,action Take
:episode) of stepeach (for Repeat
, Initialize
:episode)each (for Repeat
, allfor ,0, andy arbitraril , Initialize
*
*
**
s
aass
ase
aseaseaa
aseasQasQ
s,a
e(s,a)e(s,a)
asQasQr
aaabsQa
Qsa
sra
as
asa)e(sa)Q(s
b
![Page 66: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/66.jpg)
Q() Gridworld Example
With one trial, the agent has much more information about how to get to the goal
not necessarily the best way Can considerably accelerate learning
![Page 67: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/67.jpg)
Conclusions TD()/Q() methods
Can significantly speed learning Robustness against unreliable value estimations (e.g.
caused by Markov violation) Does have a cost in computation
![Page 68: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/68.jpg)
Generalization and Function Approximation
Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part
Overview of function approximation (FA) methods and how they can be adapted to RL
![Page 69: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/69.jpg)
Generalization
Table Generalizing Function Approximator
State VState V
s
s
s
.
.
.
s
1
2
3
N
Train
here
![Page 70: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/70.jpg)
So with function approximation a single value update affects a larger region of the state space
![Page 71: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/71.jpg)
Value Prediction with FA
Before, value functions were stored in lookup tables.
updated. is
vectorparameter only the and , aon
depends , , at time estimatefunction value theNow,
t
tVt
vectorparameter
network. neural a of
weightsconnection of vector thebe could e.g., t
![Page 72: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/72.jpg)
Adapt Supervised Learning Algorithms
Supervised Learning SystemInputs Outputs
Training Info = desired (target) outputs
Error = (target output – actual output)
Training example = {input, target output}
![Page 73: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/73.jpg)
Backups as Training Examples
e.g., the TD(0) backup :
V(st) V(st ) rt1 V(st1) V(st)
description of st , rt1 V (st1 )
As a training example:
input target output
![Page 74: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/74.jpg)
Any FA Method?
In principle, yes: artificial neural networks decision trees multivariate regression methods etc.
But RL has some special requirements: usually want to learn while interacting ability to handle nonstationarity other?
![Page 75: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/75.jpg)
Gradient Descent Methods
Tttt n)(,,)2(),1(t
. allfor , of
function abledifferenti smooth)tly (sufficien a is Assume
Ss
V
t
t
Assume, for now, training examples of this form :
description of st , V (st)
![Page 76: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/76.jpg)
Performance Measures for Gradient Descent
Many are applicable but… a common and simple one is the mean-squared error
(MSE) over a distribution P :
MSE( t) P(s) V (s) Vt (s) s S 2
![Page 77: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/77.jpg)
Gradient Descent
.)(
)(,,
)2(
)(,
)1(
)( )(
:is space in this point any at gradient Its
space.parameter theoffunction any be Let
T
tttt
t
n
ffff
f
(1)
(2) Tttt )2(),1(
)(1 ttt f
Iteratively move down the gradient:
![Page 78: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/78.jpg)
Control with FA
description of ( st , at ), v t Training examples of the form:
Learning state-action values
The general gradient-descent rule:
Gradient-descent Q() (backward view):
),(),(1 tttttttt asQasQv
),(
),(),(max
where
1
111
1
ttttt
tttttttt
tttt
asQee
asQasQr
e
![Page 79: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/79.jpg)
Linear Gradient Descent Q()
![Page 80: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/80.jpg)
Mountain-Car Task
![Page 81: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/81.jpg)
Mountain-Car Results
![Page 82: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/82.jpg)
Summary
Generalization can be done in those cases where there are too many states
Adapting supervised-learning function approximation methods
Gradient-descent methods
![Page 83: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/83.jpg)
Case Studies
Illustrate the promise of RL Illustrate the difficulties, such as long learning times,
finding good state representations
![Page 84: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/84.jpg)
TD Gammon
Tesauro 1992, 1994, 1995, ...
Objective is to advance all pieces to points 19-24
30 pieces, 24 locations implies enormous number of configurations
Effective branching factor of 400
![Page 85: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/85.jpg)
A Few Details
Reward: 0 at all times except those in which the game is won, when it is 1
Episodic (game = episode), undiscounted Gradient descent TD() with a multi-layer neural network
weights initialized to small random numbers backpropagation of TD error four input units for each point; unary encoding of
number of white pieces, plus other features Learning during self-play
![Page 86: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/86.jpg)
Multi-layer Neural Network
![Page 87: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/87.jpg)
Summary of TD-Gammon Results
![Page 88: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/88.jpg)
The Acrobot
Spong 1994Sutton 1996
![Page 89: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/89.jpg)
Acrobot Learning Curves for Q()
![Page 90: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/90.jpg)
Typical Acrobot Learned Behavior
![Page 91: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/91.jpg)
Elevator Dispatching
Crites and Barto 1996
![Page 92: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/92.jpg)
State Space
• 18 hall call buttons: 2 combinations
• positions and directions of cars: 18 (rounding to nearest floor)
• motion states of cars (accelerating, moving, decelerating, stopped, loading, turning): 6
• 40 car buttons: 2
• Set of passengers waiting at each floor, each passenger's arrival time and destination: unobservable. However, 18 real numbers are available giving elapsed time since hall buttons pushed; we discretize these.
• Set of passengers riding each car and their destinations: observable only through the car buttons
18
44
40
Conservatively about 10 states22
![Page 93: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/93.jpg)
Control Strategies
• Zoning: divide building into zones; park in zone when idle. Robust in heavy traffic.
• Search-based methods: greedy or non-greedy. Receding Horizon control.
• Rule-based methods: expert systems/fuzzy logic; from human “experts”
• Other heuristic methods: Longest Queue First (LQF), Highest Unanswered Floor First (HUFF), Dynamic Load Balancing (DLB)
• Adaptive/Learning methods: NNs for prediction, parameter space search using simulation, DP on simplified model, non-sequential RL
![Page 94: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/94.jpg)
Performance Criteria
• Average wait time
• Average system time (wait + travel time)
• % waiting > T seconds (e.g., T = 60)
• Average squared wait time (to encourage fast and fair service)
Minimize:
![Page 95: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/95.jpg)
Average Squared Wait Time
Instantaneous cost:
Define return as an integral rather than a sum (Bradtke and Duff, 1994):
r wait p ( ) p 2
2rtt0
e rd0
becomes
![Page 96: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/96.jpg)
Algorithm
Repeat forever :
1. In state x at time tx , car c must decide to STOP or CONTINUE
2. It selects an action using Boltzmann distribution
(with decreasing temperature) based on current Q values
3. The next decision by car c is required in state y at time ty
4. Implements the gradient descent version of the following backup using backprop :
Q(x,a) Q(x,a) e t x r d e
t y t x maxa
Q(y, a t x
t y
) Q(x,a)
5. x y, tx ty
![Page 97: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/97.jpg)
Neural Networks
• 9 binary: state of each hall down button
• 9 real: elapsed time of hall down button if pushed
• 16 binary: one on at a time: position and direction of car making decision
• 10 real: location/direction of other cars
• 1 binary: at highest floor with waiting passenger?
• 1 binary: at floor with longest waiting passenger?
• 1 bias unit 1
47 inputs, 20 sigmoid hidden units, 1 or 2 output units
Inputs:
![Page 98: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/98.jpg)
Elevator Results
![Page 99: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/99.jpg)
Dynamic Channel Allocation
Details in:Singh and Bertsekas 1997
![Page 100: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/100.jpg)
Helicopter flying
Difficult nonlinear control problem Also difficult for humans Approach: learn in simulation, then transfer to real
helicopter Uses function approximator for generalization Bagnell, Ng, and Schneider (2001, 2003, …)
![Page 101: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/101.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 101
In-class assignment
Think again of your own RL problem, with states, actions, and rewards
This time think especially about how uncertainty may play a role, and about how generalization may be important
Discussion
![Page 102: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/102.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 102
Homework assignment
Due Thursday 13-16 Think again of your own RL problem, with states, actions,
and rewards Do a web search on your RL problem or related work What is there already, and what, roughly, have they done
to solve the RL problem? Present briefly in class
![Page 103: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/103.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 103
Overview day 3
Summary of what we’ve learnt about RL so far Models and planning Multi-agent RL Presentation of homework assignments and discussion
![Page 104: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/104.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 104
RL summary
Objective: maximize the total amount of (discounted) reward Approach: estimate a value function (defined over state
space) which represents this total amount of reward Learn this value function incrementally by doing updates
based on values of consecutive states (temporal differences).
After having learnt optimal value function, optimal behavior can be obtained by taking action which has or leads to highest value
Use function approximation techniques for generalization if state space becomes too large for tables
ttta
ttttt asQasQrasQasQ ,,max,,
:learning-Q step-One
11
![Page 105: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/105.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 105
RL weaknesses
Still “art” involved in defining good state (and action) representations
Long learning times
![Page 106: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/106.jpg)
Planning and Learning
Use of environment models Integration of planning and learning methods
![Page 107: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/107.jpg)
Models
Model: anything the agent can use to predict how the environment will respond to its actions
Models can be used to produce simulated experience
![Page 108: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/108.jpg)
Planning
Planning: any computational process that uses a model to create or improve a policy
![Page 109: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/109.jpg)
Learning, Planning, and Acting
Two uses of real experience: model learning: to improve
the model direct RL: to directly
improve the value function and policy
Improving value function and/or policy via a model is sometimes called indirect RL or model-based RL. Here, we call it planning.
![Page 110: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/110.jpg)
Direct vs. Indirect RL
Indirect methods: make fuller use of
experience: get better policy with fewer environment interactions
Direct methods simpler not affected by bad
models
But they are very closely related and can be usefully combined:
planning, acting, model learning, and direct RL can occur simultaneously
and in parallel
![Page 111: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/111.jpg)
The Dyna Architecture (Sutton 1990)
![Page 112: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/112.jpg)
The Dyna-Q Algorithm
direct RL
model learning
planning
![Page 113: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/113.jpg)
Dyna-Q on a Simple Maze
rewards = 0 until goal, when =1
![Page 114: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/114.jpg)
Dyna-Q Snapshots: Midway in 2nd Episode
![Page 115: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/115.jpg)
Using Dyna-Q for real-time robot learning
Before learning After learning (approx. 15 minutes)
![Page 116: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/116.jpg)
Multi-agent RL
So far considered only single-agent RL But many domains have multiple agents!
Group of industrial robots working on a single car Robot soccer Traffic
Can we extend the methods of single-agent RL to multi-agent RL?
![Page 117: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/117.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 117
Dimensions of multi-agent RL
Is the objective to maximize individual rewards or to maximize global rewards? Competition vs. cooperation
Do the agents share information? Shared state representation? Communication?
Homogeneous or heterogeneous agents? Do some agents have special capabilities?
![Page 118: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/118.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 118
Competion
Like multiple single-agent cases simultaneously Related to game theory
Nash equilibria etc. Research goals
study how to optimize individual rewards in the face of competition
study group dynamics
![Page 119: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/119.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 119
Cooperation
More different from single-agent case than competition How can we make the individual agents work together? Are rewards shared among the agents?
should all agents be punished for individual mistakes?
![Page 120: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/120.jpg)
Robot soccer example: cooperation
Riedmiller group in Karlsruhe Robots must play together to beat other groups of robots in
Robocup tournaments Riedmiller group uses reinforcement learning techniques
to do this
![Page 121: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/121.jpg)
Opposite approaches to cooperative case
Consider the multi-agent system as a collection of individual reinforcement learners Design individual reward functions such that
cooperation “emerges” They may become “selfish”, or may not cooperate in a
desirable way Consider the whole multi-agent system as one big MDP
with a large action vector State-action space may become very large, but perhaps
possible with advanced function approximation
![Page 122: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/122.jpg)
Interesting intermediate approach
Let agents learn mostly individually Assign (or learn!) a limited number of states where agents
must coordinate, and at those points consider those agents as a larger single agent
This can be represented and computed efficiently using coordination graphs
Guestrin & Koller (2003), Kok & Vlassis (2004)
![Page 123: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/123.jpg)
Robocup simulation league
Kok & Vlassis (2002-2004)
![Page 124: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/124.jpg)
Advanced Generalization Issues
Generalization over states tables linear methods nonlinear methods
Generalization over actions
Proving convergence with generalizion methods
![Page 125: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/125.jpg)
Non-Markov case
Try to do the best you can with non-Markov states Partially Observable MDPs (POMDPs)
– Bayesian approach: belief states
– construct state from sequence of observations
![Page 126: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/126.jpg)
Other issues
Model-free vs. model-based Value functions vs. directly searching for good policies
(e.g. using genetic algorithms) Hierarchical methods Incorporating prior knowledge
advice and hints trainers and teachers shaping Lyapunov functions etc.
![Page 127: Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer Science, University of Amsterdam bram@science.uva.nl.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649ddf5503460f94ad8d22/html5/thumbnails/127.jpg)
The end!