Task
Setup from Lenz et. al. 2014
Grasp the green cup.
Output: Sequence of controller actions
Supervised Learning
Setup from Lenz et. al. 2014
Grasp the green cup.
Expert Demonstrations
Supervised Learning
Setup from Lenz et. al. 2014
Grasp the green cup.
Expert Demonstrations
Problem?
Supervised LearningGrasp the cup. Problem?
Training data
Test data No exploration
Exploring the environment
What is reinforcement learning?
“Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct interaction with its environment, without relying on exemplary supervision or complete models of the environment”
- R. Sutton and A. Barto
Interaction with the environment
reward +
new environment
action
Setup from Lenz et. al. 2014
Scalar reward
Interaction with the environment
Setup from Lenz et. al. 2014
Episodic vs
Non-Episodic
.
.
.
a1
a2
an
r1
r2
rn
Rollout
Setup from Lenz et. al. 2014
.
.
.
a1
a2
an
r1
r2
rn
hs1, a1, r1, s2, a2, r2, s3, · · · an, rn, sni
Setup
st ! st+1
atrt
e.g.,1$
Policy
⇡(s, a) = 0.9
Interaction with the environment
reward +
new environment
action
Setup from Lenz et. al. 2014
Objective?
Objective
Setup from Lenz et. al. 2014
.
.
.
a1
a2
an
r1
r2
rn
hs1, a1, r1, s2, a2, r2, s3, · · · an, rn, sni
maximize expected reward
E
"nX
t=1
rt
#Problem?
Discounted Reward
.
.
.
a1
a2
an
r1
r2
rn
unbounded
maximize expected reward
discount future reward
Problem?
E
" 1X
t=0
�trt+1
#
E
" 1X
t=0
rt+1
#
� 2 [0, 1)
Discounted Reward
.
.
.
a1
a2
an
r1
r2
rn
maximize discounted expected reward
E
"n�1X
t=0
�trt+1
#
if r M
E
" 1X
t=0
�trt+1
#
1X
t=0
�tM =M
1� �
and � 2 [0, 1)
Need for discounting
• To keep the problem well formed
• Evidence that humans discount future reward
Markov Decision ProcessMDP is a tuple where
• is a set of finite or infinite states
• is a set of finite or infinite actions
• For the transition
• is the transition probability
• is the reward for the transition
• is the discounted factor
(S,A, P,R, �)
S
A
P as,s0 2 P
Ras,s0 2 R
sa�! s0
� 2 [0, 1]
}Markov Asmp.
MDP Example
Example from Sutton and Barto 1998
Summary
.
.
.
a1
a2
an
r1
r2
rn
Maximize discounted expected reward
E
"n�1X
t=0
�trt+1
#
MDP is a tuple (S,A, P,R, �)
Agent controls the policy⇡(s, a)
What we learned
Reinforcement Learning
Agent-Reward-EnvironmentNo supervisionExploration
MDPPolicy
Value functions • Expected reward from following a policy
V ⇡(s) = E
" 1X
t=0
�trt+1 | s1 = s,⇡
#
Q⇡(s, a) = E
" 1X
t=0
�trt+1 | s1 = s, a1 = a,⇡
#
State value function
State action value function
State Value function
V ⇡(s) = E
" 1X
t=0
�trt+1 | s1 = s,⇡
#
s1
s2
s3
a1
a2
· · ·
· · ·
...
...
⇡(s1, a1)Pa1s1,s2
⇡(s2, a2)Pa2s2,s3
Ra1s1,s2
Ra2s2,s3
State Value function
V ⇡(s1) = E
" 1X
t=0
�trt+1
#
s1
s2
s3
a1
a2
· · ·
· · ·
...
...
⇡(s1, a1)Pa1s1,s2
⇡(s2, a2)Pa2s2,s3
Ra1s1,s2
Ra2s2,s3
t = hs1, a1, s2, a2 · · · i = hs1, a1, s2i : t0where=X
t
(r1 + �r2 · · · )p(t)
State Value function
V ⇡(s1) = E
" 1X
t=0
�trt+1
#
=X
a1,s2
X
t0
P (s1, a1, s2)P (t0 | s1, a1, s2)�Ra1
s1,s2 + �(r2 · · · )
=X
a1
⇡(s1, a2)X
s2
P a1s1,s2
�Ra1
s1,s2 + �V (s2)
t = hs1, a1, s2, a2 · · · i = hs1, a1, s2i : t0where
V ⇡(s2)
V ⇡(s2)}
=X
t
(r1 + �r2 · · · )p(t)
=X
a1,s2
P (s1, a1, s2){Ra1s1,s2 + �
X
t0
P (t0 | s1, a1, s2)(r2 · · · )}
Bellman Self-Consistency Eqn
V ⇡(s) =X
a
⇡(s, a)X
s0
P as,s0
�Ra
s,s0 + �V ⇡(s0)
Q⇡(s, a) =X
s0
P as,s0
(Ra
s,s0 + �X
a0
⇡(s0, a0)Q⇡(s0, a0)
)
Q⇡(s, a) =X
s0
P as,s0
�Ra
s,s0 + �V ⇡(s0)
similarly
Bellman Self-Consistency Eqn
V ⇡(s) =X
a
⇡(s, a)X
s0
P as,s0
�Ra
s,s0 + �V ⇡(s0)
Given N states, we have N equations in N variables
Solve the above equation
Does it have a unique solution?
Yes, it does. Exercise: Prove it.
Optimal Policy
V ⇡(s) =X
a
⇡(s, a)X
s0
P as,s0
�Ra
s,s0 + �V ⇡(s0)
Given a state s
policy is as good as (den. ) if: ⇡1 ⇡2 ⇡1 � ⇡2
V ⇡1(s) � V ⇡2(s)
How to define a globally optimal policy?
Optimal Policy
policy is as good as (den. ) if: ⇡1 ⇡2 ⇡1 � ⇡2
V ⇡1(s) � V ⇡2(s)
How to define a globally optimal policy?
is an optimal policy if:⇡⇤
V ⇡⇤(s) � V ⇡(s) 8s 2 S,⇡
Does it always exists?
Yes it always does.
Existence of Optimal Policy
Leader policy for every state is: ⇡s = argmax
⇡V ⇡
(s)sDefine: ⇡⇤(s, a) = ⇡s(s, a) 8s, a
To show is optimal or equivalently:⇡⇤
�(s) = V ⇡⇤(s)� V ⇡s(s) � 0
V ⇡⇤(s) =
X
a
⇡s(s, a)X
s0
P as,s0{Ra
s,s0 + �V ⇡⇤(s0)}
V ⇡s
(s) =X
a
⇡s(s, a)X
s0
P as,s0{Ra
s,s0 + �V ⇡s(s0)}
Existence of Optimal Policy
V ⇡⇤(s) =
X
a
⇡s(s, a)X
s0
P as,s0{Ra
s,s0 + �V ⇡⇤(s0)}
V ⇡s
(s) =X
a
⇡s(s, a)X
s0
P as,s0{Ra
s,s0 + �V ⇡s(s0)}
Leader policy for every state is: ⇡s = argmax
⇡V ⇡
(s)sDefine: ⇡⇤(s, a) = ⇡s(s, a) 8s, a
�(s) = V ⇡⇤(s)� V ⇡s(s) = �
X
a
⇡s(s, a)X
s0
P as,s0{V ⇡⇤
(s0)� V ⇡s(s0)}
= �X
a
⇡s(s, a)X
s0
P as,s0{�(s0)} = �conv(�(s0))�
� �(s0)
Existence of Optimal Policy Leader policy for every state is: ⇡s = argmax
⇡V ⇡
(s)sDefine: ⇡⇤(s, a) = ⇡s(s, a) 8s, a
�(s) = V ⇡⇤(s)� V ⇡s(s) = �
X
a
⇡s(s, a)X
s0
P as,s0{V ⇡⇤
(s0)� V ⇡s(s0)}
�(s) � �min �(s0)
min �(s) � �min �(s0)
� 2 [0, 1) ) min �(s) � 0 Hence proved
= �X
a
⇡s(s, a)X
s0
P as,s0{�(s0)} = �conv(�(s0))�
Bellman’s Optimality Condition Define and V ⇤(s) = V ⇡⇤
(s) Q⇤(s, a) = Q⇡⇤(s, a)
V ⇡⇤(s) =
X
a
⇡⇤(s, a)Q⇡⇤
(s, a) max
aQ⇡⇤
(s, a)
Claim: V ⇡⇤(s) = max
aQ⇡⇤
(s, a)
Let V ⇡⇤(s) < max
aQ⇡⇤
(s, a)
Define ⇡0(s) = argmax
aQ⇡⇤
(s, a)
Bellman’s Optimality Condition ⇡0(s) = argmax
aQ⇡⇤
(s, a)
V ⇡0(s) = Q⇡0
(s,⇡0(s))
V ⇡⇤(s) =
X
a
⇡⇤(s, a)Q⇡⇤(s, a)
�(s) = V ⇡0(s)� V ⇡⇤
(s) = Q⇡0(s,⇡0(s))�
X
a
⇡⇤(s, a)Q⇡⇤(s, a)
� Q⇡0(s,⇡0(s))�Q⇡⇤
(s,⇡0(s)) = �X
s0
P⇡0(s)s,s0 �(s0)
�(s) � 0
such that 9s0 �(s0) > 0is not optimal⇡⇤
Bellman’s Optimality Condition
V ⇤(s) = max
aQ⇤
(s, a)
V ⇤(s) = max
a
X
s0
P as,s0{Ra
s,s0 + �V ⇤(s0)}
similarly
Q⇤(s, a) =
X
s0
P as,s0{Ra
s,s0 + �max
a0Q⇤
(s0, a0)}
Optimal policy from Q value
Given an optimal policy is given by:Q⇤(s, a)
Corollary: Every MDP has a deterministic optimal policy
⇡⇤(s) = argmax
aQ⇤
(s, a)
Summary
V ⇤(s) = max
a
X
s0
P as,s0{Ra
s,s0 + �V ⇤(s0)}
V ⇡(s) =X
a
⇡(s, a)X
s0
P as,s0
�Ra
s,s0 + �V ⇡(s0)
Bellman’s optimality condition
Bellman’s self-consistency equation
An optimal policy exists such that:⇡⇤
V ⇡⇤(s) � V ⇡(s) 8s 2 S,⇡
What we learned
Reinforcement Learning
Agent-Reward-EnvironmentNo supervisionExploration
MDPPolicy
Consistency Equation Optimal Policy Optimality Condition
Solving MDP
To solve an MDP is to find an optimal policy
Bellman’s Optimality Condition
V ⇤(s) = max
a
X
s0
P as,s0{Ra
s,s0 + �V ⇤(s0)}
Iteratively solve the above equation
Bellman Backup Operator
V ⇤(s) = max
a
X
s0
P as,s0{Ra
s,s0 + �V ⇤(s0)}
T : V ! V
(TV )(s) = max
a
X
s0
P as,s0{Ra
s,s0 + �V (s0)}
Dynamic Programming Solution
Initialize randomlyV 0
do
until kV t+1 � V tk1 > ✏
V t+1 = TV t
return V t+1
V t+1(s) = max
a
X
s0
P as,s0{Ra
s,s0 + �V t(s0)}
Convergence
Theorem:
(TV )(s) = max
a
X
s0
P as,s0{Ra
s,s0 + �V (s0)}
kxk1 = max{|x1|, |x2| · · · |xk|}; x 2 R
kwhere
Proof:|(TV1)(s)� (TV2)(s)| = |max
a
X
s0
P as,s0{Ra
s,s0 + �V1(s0)}�
�max
a
X
s0
P as,s0{Ra
s,s0 + �V2(s0)}|
) kTV1 � TV2k1 �kV1 � V2k1
|max
x
f(x)�max
x
g(x)| max
x
|f(x)� g(x)|using
Convergence
Theorem: kTV1 � TV2k1 = �kV1 � V2k1kxk1 = max{|x1|, |x2| · · · |xk|}; x 2 R
kwhere
Proof: |(TV1)(s)� (TV2)(s)| max
a�|
X
s0
P as,s0(V1(s
0)� V2(s
0))|
max
a�X
s0
P as,s0 |(V1(s
0)� V2(s
0))|
�kV1 � V2k1
) kTV1 � TV2k1 �kV1 � V2k1
max
amax
s0|V1(s
0)� V2(s
0)|
Optimal is a fixed point
V ⇤= max
a
X
s0
P as,s0{Ra
s,s0 + �V ⇤(s0)} = TV ⇤
is a fixed point of TV ⇤
Optimal is the fixed point
V ⇤= max
a
X
s0
P as,s0{Ra
s,s0 + �V ⇤(s0)} = TV ⇤
is a fixed point of TV ⇤
Theorem: is the only fixed point of V ⇤ T
TV1 = V1Proof: TV2 = V2
kV1 � V2k1 = kTV1 � TV2k1 �kV1 � V2k1
As therefore kV1 � V2k1 = 0 ) V1 = V2� 2 [0, 1)
Dynamic Programming Solution
Initialize randomlyV 0
do
until kV t+1 � V tk1 > ✏
V t+1 = TV t
return V t+1
Theorem: algorithm converges for all V 0
Proof: kV t+1 � V ⇤k1 = kTV t � TV ⇤k1 �kV t � V ⇤k1kV t � V ⇤k1 �tkV 0 � V ⇤k1
limt!1
kV t � V ⇤k1 limt!1
�tkV 0 � V ⇤k1 = 0
Problem?
Summary
Iteratively solving optimality condition
Bellman Backup Operator
Convergence of the iterative solution
V t+1(s) = max
a
X
s0
P as,s0{Ra
s,s0 + �V t(s0)}
(TV )(s) = max
a
X
s0
P as,s0{Ra
s,s0 + �V (s0)}
What we learned
Reinforcement Learning
Agent-Reward-EnvironmentNo supervisionExploration
MDPPolicy
Consistency Equation Optimal Policy Optimality Condition
Bellman Backup Operator Iterative Solution
In next tutorial
• Value and Policy Iteration
• Monte Carlo Solution
• SARSA and Q-Learning
• Policy Gradient Methods
• Learning to search OR Atari game paper?
https://dipendramisra.wordpress.com/
Top Related