Download - reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct

Reinforcement LearningDipendra Misra

Cornell [email protected]

https://dipendramisra.wordpress.com/

mailto:[email protected]

Task

Setup from Lenz et. al. 2014

Grasp the green cup.

Output: Sequence of controller actions

Supervised Learning



Expert Demonstrations

Supervised Learning



Expert Demonstrations

Problem?

Supervised LearningGrasp the cup. Problem?

Training data

Test data No exploration

Exploring the environment

What is reinforcement learning?

“Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct interaction with its environment, without relying on exemplary supervision or complete models of the environment”

- R. Sutton and A. Barto

Interaction with the environment

reward +

new environment

action


Scalar reward



Episodic vs

Non-Episodic

.

.

.

a1

a2

an

r1

r2

rn

Rollout


.

.

.

a1

a2

an

r1

r2

rn

hs1, a1, r1, s2, a2, r2, s3, · · · an, rn, sni

Setup

st ! st+1

atrt

e.g.,1$

Policy

⇡(s, a) = 0.9


reward +

new environment

action


Objective?

Objective


.

.

.

a1

a2

an

r1

r2

rn

hs1, a1, r1, s2, a2, r2, s3, · · · an, rn, sni

maximize expected reward

E

"nX

t=1

rt

#Problem?

Discounted Reward

.

.

.

a1

a2

an

r1

r2

rn

unbounded

maximize expected reward

discount future reward

Problem?

E

" 1X

t=0

�trt+1

#

E

" 1X

t=0

rt+1

#

� 2 [0, 1)

Discounted Reward

.

.

.

a1

a2

an

r1

r2

rn

maximize discounted expected reward

E

"n�1X

t=0

�trt+1

#

if r M

E

" 1X

t=0

�trt+1

#

1X

t=0

�tM =M

1� �

and � 2 [0, 1)

Need for discounting

• To keep the problem well formed

• Evidence that humans discount future reward

Markov Decision ProcessMDP is a tuple where

• is a set of finite or infinite states

• is a set of finite or infinite actions

• For the transition

• is the transition probability

• is the reward for the transition

• is the discounted factor

(S,A, P,R, �)

S

A

P as,s0 2 P

Ras,s0 2 R

sa�! s0

� 2 [0, 1]

}Markov Asmp.

MDP Example

Example from Sutton and Barto 1998

Summary

.

.

.

a1

a2

an

r1

r2

rn

Maximize discounted expected reward

E

"n�1X

t=0

�trt+1

#

MDP is a tuple (S,A, P,R, �)

Agent controls the policy⇡(s, a)

What we learned

Reinforcement Learning

Agent-Reward-EnvironmentNo supervisionExploration

MDPPolicy

Value functions • Expected reward from following a policy

V ⇡(s) = E

" 1X

t=0

�trt+1 | s1 = s,⇡

#

Q⇡(s, a) = E

" 1X

t=0

�trt+1 | s1 = s, a1 = a,⇡

#

State value function

State action value function

State Value function

V ⇡(s) = E

" 1X

t=0

�trt+1 | s1 = s,⇡

#

s1

s2

s3

a1

a2

· · ·

· · ·

...

...

⇡(s1, a1)Pa1s1,s2

⇡(s2, a2)Pa2s2,s3

Ra1s1,s2

Ra2s2,s3


V ⇡(s1) = E

" 1X

t=0

�trt+1

#

s1

s2

s3

a1

a2

· · ·

· · ·

...

...

⇡(s1, a1)Pa1s1,s2

⇡(s2, a2)Pa2s2,s3

Ra1s1,s2

Ra2s2,s3

t = hs1, a1, s2, a2 · · · i = hs1, a1, s2i : t0where=X

t

(r1 + �r2 · · · )p(t)


V ⇡(s1) = E

" 1X

t=0

�trt+1

#

=X

a1,s2

X

t0

P (s1, a1, s2)P (t0 | s1, a1, s2)�Ra1

s1,s2 + �(r2 · · · )

=X

a1

⇡(s1, a2)X

s2

P a1s1,s2

�Ra1

s1,s2 + �V (s2)

t = hs1, a1, s2, a2 · · · i = hs1, a1, s2i : t0where

V ⇡(s2)

V ⇡(s2)}

=X

t

(r1 + �r2 · · · )p(t)

=X

a1,s2

P (s1, a1, s2){Ra1s1,s2 + �

X

t0

P (t0 | s1, a1, s2)(r2 · · · )}

Bellman Self-Consistency Eqn

V ⇡(s) =X

a

⇡(s, a)X

s0

P as,s0

�Ra

s,s0 + �V ⇡(s0)

Q⇡(s, a) =X

s0

P as,s0

(Ra

s,s0 + �X

a0

⇡(s0, a0)Q⇡(s0, a0)

)

Q⇡(s, a) =X

s0

P as,s0

�Ra

s,s0 + �V ⇡(s0)

similarly

Bellman Self-Consistency Eqn

V ⇡(s) =X

a

⇡(s, a)X

s0

P as,s0

�Ra

s,s0 + �V ⇡(s0)

Given N states, we have N equations in N variables

Solve the above equation

Does it have a unique solution?

Yes, it does. Exercise: Prove it.

Optimal Policy

V ⇡(s) =X

a

⇡(s, a)X

s0

P as,s0

�Ra

s,s0 + �V ⇡(s0)

Given a state s

policy is as good as (den. ) if: ⇡1 ⇡2 ⇡1 � ⇡2

V ⇡1(s) � V ⇡2(s)

How to define a globally optimal policy?

Optimal Policy

policy is as good as (den. ) if: ⇡1 ⇡2 ⇡1 � ⇡2

V ⇡1(s) � V ⇡2(s)

How to define a globally optimal policy?

is an optimal policy if:⇡⇤

V ⇡⇤(s) � V ⇡(s) 8s 2 S,⇡

Does it always exists?

Yes it always does.

Existence of Optimal Policy

Leader policy for every state is: ⇡s = argmax

⇡V ⇡

(s)sDefine: ⇡⇤(s, a) = ⇡s(s, a) 8s, a

To show is optimal or equivalently:⇡⇤

�(s) = V ⇡⇤(s)� V ⇡s(s) � 0

V ⇡⇤(s) =

X

a

⇡s(s, a)X

s0

P as,s0{Ra

s,s0 + �V ⇡⇤(s0)}

V ⇡s

(s) =X

a

⇡s(s, a)X

s0

P as,s0{Ra

s,s0 + �V ⇡s(s0)}

Existence of Optimal Policy

V ⇡⇤(s) =

X

a

⇡s(s, a)X

s0

P as,s0{Ra

s,s0 + �V ⇡⇤(s0)}

V ⇡s

(s) =X

a

⇡s(s, a)X

s0

P as,s0{Ra

s,s0 + �V ⇡s(s0)}

Leader policy for every state is: ⇡s = argmax

⇡V ⇡


�(s) = V ⇡⇤(s)� V ⇡s(s) = �

X

a

⇡s(s, a)X

s0

P as,s0{V ⇡⇤

(s0)� V ⇡s(s0)}

= �X

a

⇡s(s, a)X

s0

P as,s0{�(s0)} = �conv(�(s0))�

� �(s0)

Existence of Optimal Policy Leader policy for every state is: ⇡s = argmax

⇡V ⇡


�(s) = V ⇡⇤(s)� V ⇡s(s) = �

X

a

⇡s(s, a)X

s0

P as,s0{V ⇡⇤

(s0)� V ⇡s(s0)}

�(s) � �min �(s0)

min �(s) � �min �(s0)

� 2 [0, 1) ) min �(s) � 0 Hence proved

= �X

a

⇡s(s, a)X

s0

P as,s0{�(s0)} = �conv(�(s0))�

Bellman’s Optimality Condition Define and V ⇤(s) = V ⇡⇤

(s) Q⇤(s, a) = Q⇡⇤(s, a)

V ⇡⇤(s) =

X

a

⇡⇤(s, a)Q⇡⇤

(s, a) max

aQ⇡⇤

(s, a)

Claim: V ⇡⇤(s) = max

aQ⇡⇤

(s, a)

Let V ⇡⇤(s) < max

aQ⇡⇤

(s, a)

Define ⇡0(s) = argmax

aQ⇡⇤

(s, a)

Bellman’s Optimality Condition ⇡0(s) = argmax

aQ⇡⇤

(s, a)

V ⇡0(s) = Q⇡0

(s,⇡0(s))

V ⇡⇤(s) =

X

a

⇡⇤(s, a)Q⇡⇤(s, a)

�(s) = V ⇡0(s)� V ⇡⇤

(s) = Q⇡0(s,⇡0(s))�

X

a

⇡⇤(s, a)Q⇡⇤(s, a)

� Q⇡0(s,⇡0(s))�Q⇡⇤

(s,⇡0(s)) = �X

s0

P⇡0(s)s,s0 �(s0)

�(s) � 0

such that 9s0 �(s0) > 0is not optimal⇡⇤

Bellman’s Optimality Condition

V ⇤(s) = max

aQ⇤

(s, a)

V ⇤(s) = max

a

X

s0

P as,s0{Ra

s,s0 + �V ⇤(s0)}

similarly

Q⇤(s, a) =

X

s0

P as,s0{Ra

s,s0 + �max

a0Q⇤

(s0, a0)}

Optimal policy from Q value

Given an optimal policy is given by:Q⇤(s, a)

Corollary: Every MDP has a deterministic optimal policy

⇡⇤(s) = argmax

aQ⇤

(s, a)

Summary

V ⇤(s) = max

a

X

s0

P as,s0{Ra

s,s0 + �V ⇤(s0)}

V ⇡(s) =X

a

⇡(s, a)X

s0

P as,s0

�Ra

s,s0 + �V ⇡(s0)

Bellman’s optimality condition

Bellman’s self-consistency equation

An optimal policy exists such that:⇡⇤

V ⇡⇤(s) � V ⇡(s) 8s 2 S,⇡

What we learned



MDPPolicy

Consistency Equation Optimal Policy Optimality Condition

Solving MDP

To solve an MDP is to find an optimal policy

Bellman’s Optimality Condition

V ⇤(s) = max

a

X

s0

P as,s0{Ra

s,s0 + �V ⇤(s0)}

Iteratively solve the above equation

Bellman Backup Operator

V ⇤(s) = max

a

X

s0

P as,s0{Ra

s,s0 + �V ⇤(s0)}

T : V ! V

(TV )(s) = max

a

X

s0

P as,s0{Ra

s,s0 + �V (s0)}

Dynamic Programming Solution

Initialize randomlyV 0

do

until kV t+1 � V tk1 > ✏

V t+1 = TV t

return V t+1

V t+1(s) = max

a

X

s0

P as,s0{Ra

s,s0 + �V t(s0)}

Convergence

Theorem:

(TV )(s) = max

a

X

s0

P as,s0{Ra

s,s0 + �V (s0)}

kxk1 = max{|x1|, |x2| · · · |xk|}; x 2 R

kwhere

Proof:|(TV1)(s)� (TV2)(s)| = |max

a

X

s0

P as,s0{Ra

s,s0 + �V1(s0)}�

�max

a

X

s0

P as,s0{Ra

s,s0 + �V2(s0)}|

) kTV1 � TV2k1 �kV1 � V2k1

|max

x

f(x)�max

x

g(x)| max

x

|f(x)� g(x)|using

Convergence

Theorem: kTV1 � TV2k1 = �kV1 � V2k1kxk1 = max{|x1|, |x2| · · · |xk|}; x 2 R

kwhere

Proof: |(TV1)(s)� (TV2)(s)| max

a�|

X

s0

P as,s0(V1(s

0)� V2(s

0))|

max

a�X

s0

P as,s0 |(V1(s

0)� V2(s

0))|

�kV1 � V2k1

) kTV1 � TV2k1 �kV1 � V2k1

max

amax

s0|V1(s

0)� V2(s

0)|

Optimal is a fixed point

V ⇤= max

a

X

s0

P as,s0{Ra

s,s0 + �V ⇤(s0)} = TV ⇤

is a fixed point of TV ⇤

Optimal is the fixed point

V ⇤= max

a

X

s0

P as,s0{Ra

s,s0 + �V ⇤(s0)} = TV ⇤

is a fixed point of TV ⇤

Theorem: is the only fixed point of V ⇤ T

TV1 = V1Proof: TV2 = V2

kV1 � V2k1 = kTV1 � TV2k1 �kV1 � V2k1

As therefore kV1 � V2k1 = 0 ) V1 = V2� 2 [0, 1)

Dynamic Programming Solution

Initialize randomlyV 0

do

until kV t+1 � V tk1 > ✏

V t+1 = TV t

return V t+1

Theorem: algorithm converges for all V 0

Proof: kV t+1 � V ⇤k1 = kTV t � TV ⇤k1 �kV t � V ⇤k1kV t � V ⇤k1 �tkV 0 � V ⇤k1

limt!1

kV t � V ⇤k1 limt!1

�tkV 0 � V ⇤k1 = 0

Problem?

Summary

Iteratively solving optimality condition

Bellman Backup Operator

Convergence of the iterative solution

V t+1(s) = max

a

X

s0

P as,s0{Ra

s,s0 + �V t(s0)}

(TV )(s) = max

a

X

s0

P as,s0{Ra

s,s0 + �V (s0)}

What we learned



MDPPolicy

Consistency Equation Optimal Policy Optimality Condition

Bellman Backup Operator Iterative Solution

In next tutorial

• Value and Policy Iteration

• Monte Carlo Solution

• SARSA and Q-Learning

• Policy Gradient Methods

• Learning to search OR Atari game paper?

https://dipendramisra.wordpress.com/