Reinforcement Learning

CS 5751 Machine Learning

Chapter 13 Reinforcement Learning 1

Reinforcement Learning • Control learning

• Control polices that choose optimal actions

• Q learning

• Convergence



Control LearningConsider learning to choose actions, e.g.,• Robot learning to dock on battery charger• Learning to choose actions to optimize factory

output• Learning to play BackgammonNote several problem characteristics• Delayed reward• Opportunity for active exploration• Possibility that state only partially observable• Possible need to learn multiple tasks with same

sensors/effectors



One Example: TD-GammonTesauro, 1995

Learn to play BackgammonImmediate reward• +100 if win• -100 if lose• 0 for all other states

Trained by playing 1.5 million games against itselfNow approximately equal to best human player



Reinforcement Learning Problem

Environment

Agent

state action reward

s0 r0

a0 s1 r1

a1 s2 r2

a2 ...

Goal: learn to choose actions that maximize r0 + r1 + 2r2 + …, where 0 < 1



Markov Decision ProcessAssume• finite set of states S• set of actions A• at each discrete time, agent observes state st S and

choose action at A

• then receives immediate reward rt

• and state changes to st+1

• Markov assumption: st+1 = (st, at) and rt = r(st, at)– i.e., rt and st+1 depend only on current state and action– functions and r may be nondeterministic– functions and r no necessarily known to agent



Agent’s Learning TaskExecute action in environment, observe results, and• learn action policy : S A that maximizes

E[rt + rt+1 + 2rt+2 + …]from any starting state in S

• here 0 < 1 is the discount factor for future rewards

Note something new:• target function is : S A • but we have no training examples of form <s,a>• training examples are of form <<s,a>,r>



Value FunctionTo begin, consider deterministic worlds …For each possible policy the agent might adopt, we

can define an evaluation function over states

where rt,rt+1,… are generated by following policy starting at state s

Restated, the task is to learn the optimal policy *

0

22

1π

γ

... γ γ )(V

iit

i

ttt

r

rrrs

)(),(Vargmaxπ* π

πss



G

100

100

0 00

0

00 0

0

0

0

0

r(s,a) (immediate reward) values

G

100

100

90

0

Q(s,a) values

909072

72

81

81

81

8181

G

100

100

90

0

V*(s) values

90

81

G

One optimal policy



What to LearnWe might try to have agent learn the evaluation

function V* (which we write as V*)We could then do a lookahead search to choose best

action from any state s because

A problem:• This works well if agent knows a : S A S, and

r : S A • But when it doesn’t, we can’t choose actions this way

(s,a))V*(r(s,a)(s) δ γargmaxπ*a



Q FunctionDefine new function very similar to V*

If agent learns Q, it can choose optimal action even without knowing d!

Q is the evaluation function the agent will learn

),(argmaxπ*

δ γargmaxπ*

a

a

asQ(s)

(s,a))V*(r(s,a)(s)

(s,a))V*(r(s,a)asQ δ γ),(



Training Rule to Learn QNote Q and V* closely related:

Which allows us to write Q recursively as

Let denote learner’s current approximation to Q. Consider training rule

where s' is the state resulting from applying action a in state s

),(max*a

asQ(s)V

)(max γ δ γ)(

1 a,sQ),ar(s)),a(sV*(),ar(s,asQ

tatt

tttttt

)(ˆmax γ)(ˆ a,sQrs,aQa

Q̂



Q Learning for Deterministic WorldsFor each s,a initialize table entry Observe current state sDo forever:• Select an action a and execute it• Receive immediate reward r• Observe the new state s'• Update the table entry for as follows:

• s s'

)(ˆmax γ)(ˆ a,sQrs,aQa

0)(ˆ s,aQ

)(ˆ s,aQ



Updating100

initial state: s1

72

63

81R 100

next state: s2

90

63

81R

aright

),(),(ˆ 0 ),,(

and),(ˆ),(ˆ ),,(

thennegative,-non rewards if notice90 }100,81,63max{9.00

)(Q̂maxγ),(ˆ

1

2a1

asQasQnas

asQasQnas

a,srasQ

n

nn

right



Convergence converges to Q. Consider case of deterministic world

where each <s,a> visited infinitely often.

Proof: define a full interval to be an interval during which

each <s,a> is visited. During each full interval the largest

error in table is reduced by factor of

Let be table after n updates, and n be the maximum error

in ; that is

Q̂

Q̂

nQ̂

nQ̂

),(),(ˆmax,

asQasQnasn



Convergence (cont)For any table entry updated on iteration n+1, the

error in the revised estimate is ),(ˆ asQn

),(ˆ asQn

(a)f(a)f(a)f(a)f

asQasQ

)a,sQ()a,s(Q

)a,sQ()a,s(Q

)a,sQ()a,s(Q

))a,sQ((r))a,s(Q(rasQasQ

aaa

nn

nas

na

ana

anan

2121

1

,

1

-max max- max

fact that general used weNote

γ ),(),(ˆ

ˆmaxγ

ˆmaxγ

maxˆmaxγ

maxγˆmaxγ),(),(ˆ



Nondeterministic CaseWhat if reward and next state are non-deterministic?We redefine V,Q by taking expected values

]δ*γ),([E),(

γE

...]γγE[

0i

22

1π

(s,a))(VasrasQ

r

rrr(s)V

i it

ttt



Nondeterministic CaseQ learning generalizes to nondeterministic worldsAlter training rule to

where

Can still prove converge of to Q [Watkins and Dayan, 1992]

Q̂

)],(ˆmax[α),(ˆ)α1(),(ˆ11 asQrasQasQ nannnn

),(11α

asvisitsnn



Temporal Difference LearningQ learning: reduce discrepancy between successive

Q estimatesOne step time difference:

Why not two steps?

Or n?

Blend all of these:

),(ˆmaxγ),( 1)1( asQrasQ tattt

),(ˆmaxγγ),( 22

1)2( asQrrasQ tatttt

),(ˆmaxγγ...γ),( n1

1-n1

)( asQrrrasQ ntantttttn

...),(λ),(λ),()λ1(),( (3)2(2)(1)λ tttttttt asQasQasQasQ



Temporal Difference Learning

Equivalent expression:

TD() algorithm uses above training rule• Sometimes converges faster than Q learning• converges for learning V* for any 0 1

(Dayan, 1992)• Tesauro’s TD-Gammon uses this algorithm

...),(λ),(λ),()λ1(),( (3)2(2)(1)λ tttttttt asQasQasQasQ

),(λ),(ˆmax)λ1(γ),( 11λλ

ttttattt asQasQrasQ



Subtleties and Ongoing Research• Replace table with neural network or other

generalizer

• Handle case where state only partially observable

• Design optimal exploration strategies

• Extend to continuous action, state

• Learn and use d : S A S, d approximation to

• Relationship to dynamic programming

Q̂



RL Summary• Reinforcement learning (RL)

– control learning– delayed reward– possible that the state is only partially observable– possible that the relationship between states/actions

unknown• Temporal Difference Learning

– learn discrepancies between successive estimates– used in TD-Gammon

• V(s) - state value function– needs known reward/state transition functions



RL Summary• Q(s,a) - state/action value function

– related to V– does not need reward/state trans functions– training rule– related to dynamic programming– measure actual reward received for action and future

value using current Q function– deterministic - replace existing estimate– nondeterministic - move table estimate towards

measure estimate– convergence - can be shown in both cases

Reinforcement Learning

Documents

Transcript of Reinforcement Learning