Reinforcement Learning
description
Transcript of Reinforcement Learning
CS 5751 Machine Learning
Chapter 13 Reinforcement Learning 1
Reinforcement Learning • Control learning
• Control polices that choose optimal actions
• Q learning
• Convergence
CS 5751 Machine Learning
Chapter 13 Reinforcement Learning 2
Control LearningConsider learning to choose actions, e.g.,• Robot learning to dock on battery charger• Learning to choose actions to optimize factory
output• Learning to play BackgammonNote several problem characteristics• Delayed reward• Opportunity for active exploration• Possibility that state only partially observable• Possible need to learn multiple tasks with same
sensors/effectors
CS 5751 Machine Learning
Chapter 13 Reinforcement Learning 3
One Example: TD-GammonTesauro, 1995
Learn to play BackgammonImmediate reward• +100 if win• -100 if lose• 0 for all other states
Trained by playing 1.5 million games against itselfNow approximately equal to best human player
CS 5751 Machine Learning
Chapter 13 Reinforcement Learning 4
Reinforcement Learning Problem
Environment
Agent
state action reward
s0 r0
a0 s1 r1
a1 s2 r2
a2 ...
Goal: learn to choose actions that maximize r0 + r1 + 2r2 + …, where 0 < 1
CS 5751 Machine Learning
Chapter 13 Reinforcement Learning 5
Markov Decision ProcessAssume• finite set of states S• set of actions A• at each discrete time, agent observes state st S and
choose action at A
• then receives immediate reward rt
• and state changes to st+1
• Markov assumption: st+1 = (st, at) and rt = r(st, at)– i.e., rt and st+1 depend only on current state and action– functions and r may be nondeterministic– functions and r no necessarily known to agent
CS 5751 Machine Learning
Chapter 13 Reinforcement Learning 6
Agent’s Learning TaskExecute action in environment, observe results, and• learn action policy : S A that maximizes
E[rt + rt+1 + 2rt+2 + …]from any starting state in S
• here 0 < 1 is the discount factor for future rewards
Note something new:• target function is : S A • but we have no training examples of form <s,a>• training examples are of form <<s,a>,r>
CS 5751 Machine Learning
Chapter 13 Reinforcement Learning 7
Value FunctionTo begin, consider deterministic worlds …For each possible policy the agent might adopt, we
can define an evaluation function over states
where rt,rt+1,… are generated by following policy starting at state s
Restated, the task is to learn the optimal policy *
0
22
1π
γ
... γ γ )(V
iit
i
ttt
r
rrrs
)(),(Vargmaxπ* π
πss
CS 5751 Machine Learning
Chapter 13 Reinforcement Learning 8
G
100
100
0 00
0
00 0
0
0
0
0
r(s,a) (immediate reward) values
G
100
100
90
0
Q(s,a) values
909072
72
81
81
81
8181
G
100
100
90
0
V*(s) values
90
81
G
One optimal policy
CS 5751 Machine Learning
Chapter 13 Reinforcement Learning 9
What to LearnWe might try to have agent learn the evaluation
function V* (which we write as V*)We could then do a lookahead search to choose best
action from any state s because
A problem:• This works well if agent knows a : S A S, and
r : S A • But when it doesn’t, we can’t choose actions this way
(s,a))V*(r(s,a)(s) δ γargmaxπ*a
CS 5751 Machine Learning
Chapter 13 Reinforcement Learning 10
Q FunctionDefine new function very similar to V*
If agent learns Q, it can choose optimal action even without knowing d!
Q is the evaluation function the agent will learn
),(argmaxπ*
δ γargmaxπ*
a
a
asQ(s)
(s,a))V*(r(s,a)(s)
(s,a))V*(r(s,a)asQ δ γ),(
CS 5751 Machine Learning
Chapter 13 Reinforcement Learning 11
Training Rule to Learn QNote Q and V* closely related:
Which allows us to write Q recursively as
Let denote learner’s current approximation to Q. Consider training rule
where s' is the state resulting from applying action a in state s
),(max*a
asQ(s)V
)(max γ δ γ)(
1 a,sQ),ar(s)),a(sV*(),ar(s,asQ
tatt
tttttt
)(ˆmax γ)(ˆ a,sQrs,aQa
Q̂
CS 5751 Machine Learning
Chapter 13 Reinforcement Learning 12
Q Learning for Deterministic WorldsFor each s,a initialize table entry Observe current state sDo forever:• Select an action a and execute it• Receive immediate reward r• Observe the new state s'• Update the table entry for as follows:
• s s'
)(ˆmax γ)(ˆ a,sQrs,aQa
0)(ˆ s,aQ
)(ˆ s,aQ
CS 5751 Machine Learning
Chapter 13 Reinforcement Learning 13
Updating100
initial state: s1
72
63
81R 100
next state: s2
90
63
81R
aright
),(),(ˆ 0 ),,(
and),(ˆ),(ˆ ),,(
thennegative,-non rewards if notice90 }100,81,63max{9.00
)(Q̂maxγ),(ˆ
1
2a1
asQasQnas
asQasQnas
a,srasQ
n
nn
right
CS 5751 Machine Learning
Chapter 13 Reinforcement Learning 14
Convergence converges to Q. Consider case of deterministic world
where each <s,a> visited infinitely often.
Proof: define a full interval to be an interval during which
each <s,a> is visited. During each full interval the largest
error in table is reduced by factor of
Let be table after n updates, and n be the maximum error
in ; that is
Q̂
Q̂
nQ̂
nQ̂
),(),(ˆmax,
asQasQnasn
CS 5751 Machine Learning
Chapter 13 Reinforcement Learning 15
Convergence (cont)For any table entry updated on iteration n+1, the
error in the revised estimate is ),(ˆ asQn
),(ˆ asQn
(a)f(a)f(a)f(a)f
asQasQ
)a,sQ()a,s(Q
)a,sQ()a,s(Q
)a,sQ()a,s(Q
))a,sQ((r))a,s(Q(rasQasQ
aaa
nn
nas
na
ana
anan
2121
1
,
1
-max max- max
fact that general used weNote
γ ),(),(ˆ
ˆmaxγ
ˆmaxγ
maxˆmaxγ
maxγˆmaxγ),(),(ˆ
CS 5751 Machine Learning
Chapter 13 Reinforcement Learning 16
Nondeterministic CaseWhat if reward and next state are non-deterministic?We redefine V,Q by taking expected values
]δ*γ),([E),(
γE
...]γγE[
0i
22
1π
(s,a))(VasrasQ
r
rrr(s)V
i it
ttt
CS 5751 Machine Learning
Chapter 13 Reinforcement Learning 17
Nondeterministic CaseQ learning generalizes to nondeterministic worldsAlter training rule to
where
Can still prove converge of to Q [Watkins and Dayan, 1992]
Q̂
)],(ˆmax[α),(ˆ)α1(),(ˆ11 asQrasQasQ nannnn
),(11α
asvisitsnn
CS 5751 Machine Learning
Chapter 13 Reinforcement Learning 18
Temporal Difference LearningQ learning: reduce discrepancy between successive
Q estimatesOne step time difference:
Why not two steps?
Or n?
Blend all of these:
),(ˆmaxγ),( 1)1( asQrasQ tattt
),(ˆmaxγγ),( 22
1)2( asQrrasQ tatttt
),(ˆmaxγγ...γ),( n1
1-n1
)( asQrrrasQ ntantttttn
...),(λ),(λ),()λ1(),( (3)2(2)(1)λ tttttttt asQasQasQasQ
CS 5751 Machine Learning
Chapter 13 Reinforcement Learning 19
Temporal Difference Learning
Equivalent expression:
TD() algorithm uses above training rule• Sometimes converges faster than Q learning• converges for learning V* for any 0 1
(Dayan, 1992)• Tesauro’s TD-Gammon uses this algorithm
...),(λ),(λ),()λ1(),( (3)2(2)(1)λ tttttttt asQasQasQasQ
),(λ),(ˆmax)λ1(γ),( 11λλ
ttttattt asQasQrasQ
CS 5751 Machine Learning
Chapter 13 Reinforcement Learning 20
Subtleties and Ongoing Research• Replace table with neural network or other
generalizer
• Handle case where state only partially observable
• Design optimal exploration strategies
• Extend to continuous action, state
• Learn and use d : S A S, d approximation to
• Relationship to dynamic programming
Q̂
CS 5751 Machine Learning
Chapter 13 Reinforcement Learning 21
RL Summary• Reinforcement learning (RL)
– control learning– delayed reward– possible that the state is only partially observable– possible that the relationship between states/actions
unknown• Temporal Difference Learning
– learn discrepancies between successive estimates– used in TD-Gammon
• V(s) - state value function– needs known reward/state transition functions
CS 5751 Machine Learning
Chapter 13 Reinforcement Learning 22
RL Summary• Q(s,a) - state/action value function
– related to V– does not need reward/state trans functions– training rule– related to dynamic programming– measure actual reward received for action and future
value using current Q function– deterministic - replace existing estimate– nondeterministic - move table estimate towards
measure estimate– convergence - can be shown in both cases