Reinforcement Learning

22
CS 5751 Machine Learning Chapter 13 Reinforcement Learning 1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning • Convergence

description

Reinforcement Learning . Control learning Control polices that choose optimal actions Q learning Convergence. Control Learning. Consider learning to choose actions, e.g., Robot learning to dock on battery charger Learning to choose actions to optimize factory output - PowerPoint PPT Presentation

Transcript of Reinforcement Learning

Page 1: Reinforcement Learning

CS 5751 Machine Learning

Chapter 13 Reinforcement Learning 1

Reinforcement Learning • Control learning

• Control polices that choose optimal actions

• Q learning

• Convergence

Page 2: Reinforcement Learning

CS 5751 Machine Learning

Chapter 13 Reinforcement Learning 2

Control LearningConsider learning to choose actions, e.g.,• Robot learning to dock on battery charger• Learning to choose actions to optimize factory

output• Learning to play BackgammonNote several problem characteristics• Delayed reward• Opportunity for active exploration• Possibility that state only partially observable• Possible need to learn multiple tasks with same

sensors/effectors

Page 3: Reinforcement Learning

CS 5751 Machine Learning

Chapter 13 Reinforcement Learning 3

One Example: TD-GammonTesauro, 1995

Learn to play BackgammonImmediate reward• +100 if win• -100 if lose• 0 for all other states

Trained by playing 1.5 million games against itselfNow approximately equal to best human player

Page 4: Reinforcement Learning

CS 5751 Machine Learning

Chapter 13 Reinforcement Learning 4

Reinforcement Learning Problem

Environment

Agent

state action reward

s0 r0

a0 s1 r1

a1 s2 r2

a2 ...

Goal: learn to choose actions that maximize r0 + r1 + 2r2 + …, where 0 < 1

Page 5: Reinforcement Learning

CS 5751 Machine Learning

Chapter 13 Reinforcement Learning 5

Markov Decision ProcessAssume• finite set of states S• set of actions A• at each discrete time, agent observes state st S and

choose action at A

• then receives immediate reward rt

• and state changes to st+1

• Markov assumption: st+1 = (st, at) and rt = r(st, at)– i.e., rt and st+1 depend only on current state and action– functions and r may be nondeterministic– functions and r no necessarily known to agent

Page 6: Reinforcement Learning

CS 5751 Machine Learning

Chapter 13 Reinforcement Learning 6

Agent’s Learning TaskExecute action in environment, observe results, and• learn action policy : S A that maximizes

E[rt + rt+1 + 2rt+2 + …]from any starting state in S

• here 0 < 1 is the discount factor for future rewards

Note something new:• target function is : S A • but we have no training examples of form <s,a>• training examples are of form <<s,a>,r>

Page 7: Reinforcement Learning

CS 5751 Machine Learning

Chapter 13 Reinforcement Learning 7

Value FunctionTo begin, consider deterministic worlds …For each possible policy the agent might adopt, we

can define an evaluation function over states

where rt,rt+1,… are generated by following policy starting at state s

Restated, the task is to learn the optimal policy *

0

22

γ

... γ γ )(V

iit

i

ttt

r

rrrs

)(),(Vargmaxπ* π

πss

Page 8: Reinforcement Learning

CS 5751 Machine Learning

Chapter 13 Reinforcement Learning 8

G

100

100

0 00

0

00 0

0

0

0

0

r(s,a) (immediate reward) values

G

100

100

90

0

Q(s,a) values

909072

72

81

81

81

8181

G

100

100

90

0

V*(s) values

90

81

G

One optimal policy

Page 9: Reinforcement Learning

CS 5751 Machine Learning

Chapter 13 Reinforcement Learning 9

What to LearnWe might try to have agent learn the evaluation

function V* (which we write as V*)We could then do a lookahead search to choose best

action from any state s because

A problem:• This works well if agent knows a : S A S, and

r : S A • But when it doesn’t, we can’t choose actions this way

(s,a))V*(r(s,a)(s) δ γargmaxπ*a

Page 10: Reinforcement Learning

CS 5751 Machine Learning

Chapter 13 Reinforcement Learning 10

Q FunctionDefine new function very similar to V*

If agent learns Q, it can choose optimal action even without knowing d!

Q is the evaluation function the agent will learn

),(argmaxπ*

δ γargmaxπ*

a

a

asQ(s)

(s,a))V*(r(s,a)(s)

(s,a))V*(r(s,a)asQ δ γ),(

Page 11: Reinforcement Learning

CS 5751 Machine Learning

Chapter 13 Reinforcement Learning 11

Training Rule to Learn QNote Q and V* closely related:

Which allows us to write Q recursively as

Let denote learner’s current approximation to Q. Consider training rule

where s' is the state resulting from applying action a in state s

),(max*a

asQ(s)V

)(max γ δ γ)(

1 a,sQ),ar(s)),a(sV*(),ar(s,asQ

tatt

tttttt

)(ˆmax γ)(ˆ a,sQrs,aQa

Page 12: Reinforcement Learning

CS 5751 Machine Learning

Chapter 13 Reinforcement Learning 12

Q Learning for Deterministic WorldsFor each s,a initialize table entry Observe current state sDo forever:• Select an action a and execute it• Receive immediate reward r• Observe the new state s'• Update the table entry for as follows:

• s s'

)(ˆmax γ)(ˆ a,sQrs,aQa

0)(ˆ s,aQ

)(ˆ s,aQ

Page 13: Reinforcement Learning

CS 5751 Machine Learning

Chapter 13 Reinforcement Learning 13

Updating100

initial state: s1

72

63

81R 100

next state: s2

90

63

81R

aright

),(),(ˆ 0 ),,(

and),(ˆ),(ˆ ),,(

thennegative,-non rewards if notice90 }100,81,63max{9.00

)(Q̂maxγ),(ˆ

1

2a1

asQasQnas

asQasQnas

a,srasQ

n

nn

right

Page 14: Reinforcement Learning

CS 5751 Machine Learning

Chapter 13 Reinforcement Learning 14

Convergence converges to Q. Consider case of deterministic world

where each <s,a> visited infinitely often.

Proof: define a full interval to be an interval during which

each <s,a> is visited. During each full interval the largest

error in table is reduced by factor of

Let be table after n updates, and n be the maximum error

in ; that is

nQ̂

nQ̂

),(),(ˆmax,

asQasQnasn

Page 15: Reinforcement Learning

CS 5751 Machine Learning

Chapter 13 Reinforcement Learning 15

Convergence (cont)For any table entry updated on iteration n+1, the

error in the revised estimate is ),(ˆ asQn

),(ˆ asQn

(a)f(a)f(a)f(a)f

asQasQ

)a,sQ()a,s(Q

)a,sQ()a,s(Q

)a,sQ()a,s(Q

))a,sQ((r))a,s(Q(rasQasQ

aaa

nn

nas

na

ana

anan

2121

1

,

1

-max max- max

fact that general used weNote

γ ),(),(ˆ

ˆmaxγ

ˆmaxγ

maxˆmaxγ

maxγˆmaxγ),(),(ˆ

Page 16: Reinforcement Learning

CS 5751 Machine Learning

Chapter 13 Reinforcement Learning 16

Nondeterministic CaseWhat if reward and next state are non-deterministic?We redefine V,Q by taking expected values

]δ*γ),([E),(

γE

...]γγE[

0i

22

(s,a))(VasrasQ

r

rrr(s)V

i it

ttt

Page 17: Reinforcement Learning

CS 5751 Machine Learning

Chapter 13 Reinforcement Learning 17

Nondeterministic CaseQ learning generalizes to nondeterministic worldsAlter training rule to

where

Can still prove converge of to Q [Watkins and Dayan, 1992]

)],(ˆmax[α),(ˆ)α1(),(ˆ11 asQrasQasQ nannnn

),(11α

asvisitsnn

Page 18: Reinforcement Learning

CS 5751 Machine Learning

Chapter 13 Reinforcement Learning 18

Temporal Difference LearningQ learning: reduce discrepancy between successive

Q estimatesOne step time difference:

Why not two steps?

Or n?

Blend all of these:

),(ˆmaxγ),( 1)1( asQrasQ tattt

),(ˆmaxγγ),( 22

1)2( asQrrasQ tatttt

),(ˆmaxγγ...γ),( n1

1-n1

)( asQrrrasQ ntantttttn

...),(λ),(λ),()λ1(),( (3)2(2)(1)λ tttttttt asQasQasQasQ

Page 19: Reinforcement Learning

CS 5751 Machine Learning

Chapter 13 Reinforcement Learning 19

Temporal Difference Learning

Equivalent expression:

TD() algorithm uses above training rule• Sometimes converges faster than Q learning• converges for learning V* for any 0 1

(Dayan, 1992)• Tesauro’s TD-Gammon uses this algorithm

...),(λ),(λ),()λ1(),( (3)2(2)(1)λ tttttttt asQasQasQasQ

),(λ),(ˆmax)λ1(γ),( 11λλ

ttttattt asQasQrasQ

Page 20: Reinforcement Learning

CS 5751 Machine Learning

Chapter 13 Reinforcement Learning 20

Subtleties and Ongoing Research• Replace table with neural network or other

generalizer

• Handle case where state only partially observable

• Design optimal exploration strategies

• Extend to continuous action, state

• Learn and use d : S A S, d approximation to

• Relationship to dynamic programming

Page 21: Reinforcement Learning

CS 5751 Machine Learning

Chapter 13 Reinforcement Learning 21

RL Summary• Reinforcement learning (RL)

– control learning– delayed reward– possible that the state is only partially observable– possible that the relationship between states/actions

unknown• Temporal Difference Learning

– learn discrepancies between successive estimates– used in TD-Gammon

• V(s) - state value function– needs known reward/state transition functions

Page 22: Reinforcement Learning

CS 5751 Machine Learning

Chapter 13 Reinforcement Learning 22

RL Summary• Q(s,a) - state/action value function

– related to V– does not need reward/state trans functions– training rule– related to dynamic programming– measure actual reward received for action and future

value using current Q function– deterministic - replace existing estimate– nondeterministic - move table estimate towards

measure estimate– convergence - can be shown in both cases