The Mathematical Foundations of Policy Gradient Methods
Sham M. KakadeUniversity of Washington
&Microsoft Research
Reinforcement (interactive) learning (RL):
Setting: Markov decision processes
S states. start with s0 ⇠ d0
A actions.dynamics model P(s0|s, a).reward function r(s)discount factor �
Sutton, Barto ’18
Stochastic policy ⇡: st ! at
Standard objective: find ⇡ which maximizes:
V⇡(s0) = E[r(s0) + �r(s1) + �2r(s2) + . . .]
where the distribution of st , at is induced by ⇡.
S. M. Kakade (UW) Curiosity 1 / 1
Markov Decision Processes:a framework for RL
• A policy:!: States → Actions
• We execute ! to obtain a trajectory:.!, 0!, 1!, .", 0", 1", …
• Total 3-discounted reward:
4#(.!) = 8 9$%!
&3$1$ │.!, !
Goal: Find a policy that maximizes our value, !!(#").
O0 TOuna
j EI
Challenges in RL
1. Exploration(the environment may be unknown)
2. Credit assignment problem(due to delayed rewards)
3. Large state/action spaces:hand state: joint angles/velocitiescube state: configuration actions: forces applied to actuators
Dexterous Robotic Hand ManipulationOpenAI, Oct 15, 2019
Values, State-Action Values, and Advantages
• Expectation with respect to sampled trajectories under !• Have S states and A actions.• Effective “horizon” is 1/(1 − 3) time steps.
!!(#") = & '#$"
%(#)(##, +#) │#", -
.!(#", +") = & '#$"
%(# )(##, +#)│#", +", -
/! #, + = .! #, + − !!(#)
demand start atso takedo thenfollow
FT advantages 8 4
go ye weightinglike she
The “Tabular” Dynamic Programming approach
• Table: ‘bookkeeping’ for dynamic programming (with known rewards/dynamics)
1. Estimate the state-action value '!(), +) for every entry in the table.
2. Update the policy - & goto step 1
• Generalization: how can we deal with this infinite table? using sampling/supervised learning?
State >: (joint angles, … cube config,…)
Action ?:(forces at joints)
&!((, *): state-action value“one step look-ahead value” using #
(31°, 12°, … , 8134,… ) (1.2 Newton, 0.1 Newton,… ) 8 units of reward
⋮ ⋮ ⋮startwith to
ATTY Q ga
§ Part – I: BasicsA. Derivation and EstimationB. Preconditioning and the Natural Policy Gradient
§ Part – II: Convergence and ApproximationA. Convergence: This is a non-convex problems!B. Approximation: How to the think about the role of deep learning?
This Tutorial: Mathematical Foundations of Policy Gradient Methods
generalization
Part-1: Basics
State-Action Visitation Measures!• This helps to clean up notation!• “Occupancy frequency” of being in state . and action a, after following ! starting in .!
%$!! # = 1 − ) * +%&"
')% , #% = # │#", /
• @.!# is a probability distribution
• With this notation:
4#(.!) =1
1 − 3 8.∼0"!# ,1∼# 1(., 0)digG chance
of visiting
Direct Policy Optimization over Stochastic Policies
• /( 0 # is the probability of action 0 given #, parameterized by
/( 0 # ∝ exp(5((#, 0))
• Softmax policy class: 5( #, 0 = 6$,*• Linear policy class: 5( #, 0 = 6⃗ ⋅ 9(#, 0)where 9(#, 0) ∈ ;+• Neural policy class: 5((#, 0) is a neural network
0 just z
In practice, policy gradient methods rule…
• Why do we like them?
• They easily deal with large state/action spaces (through the neural net parameterization)
• We can estimate the gradient using only simulation of our current policy ,!(the expectation is under the state actions visited under ,!)
• They directly optimize the cost function of interest!
They are the most effective method for obtaining state of the art.
! ← ! + $ %&!A((")
Two (equal) expressions for the policy gradient!
(some shorthand notation above)
• Where do these expression come from?
• How do we compute this?
%&# (" = 11 − - .$∼&B,(∼! /
# (, 1 %log 5# 1|(
%&# (" = 11 − - .$∼&B,(∼! 7
# (, 1 %log 5# 1|(
miectonies
11 exercise
hintp Glo lyPoly O
Example: an important special case!• Remember the softmax policy class (a “tabular” parameterization)
-C + # ∝ exp(6D,E)• Complete class with AB params:
one parameter per state action, so it contains the optimal policy
• Expression for softmax class:
7!C #"76D,E
= 8!2 # -C + # /C #, +
• Intuition: increase !!,# if the ‘weighted’ advantage is large.
Part-1A: Derivations and Estimation
O
General DerivationrV ⇡✓ (s0)
= rX
a0
⇡✓(a0|s0)Q⇡✓ (s0, a0)
=X
a0
⇣r⇡✓(a0|s0)
⌘Q⇡✓ (s0, a0) +
X
a0
⇡✓(a0|s0)rQ⇡✓ (s0, a0)
=X
a0
⇡✓(a0|s0)⇣r log ⇡✓(a0|s0)
⌘Q⇡✓ (s0, a0)
+X
a0
⇡✓(a0|s0)r⇣r(s0, a0) + �
X
s1
P (s1|s0, a0)V ⇡✓ (s1)⌘
=X
a0
⇡✓(a0|s0)⇣r log ⇡✓(a0|s0)
⌘Q⇡✓ (s0, a0) + �
X
a0,s1
⇡✓(a0|s0)P (s1|s0, a0)rV ⇡✓ (s1)
= E [Q⇡✓ (s0, a0)r log ⇡✓(a0|s0)] + �E [rV ⇡✓ (s1)] .
y vain quaint matey'siutinemn
Y drain rule taking action
m dates.ainresintrinar
CE w
unravelimmediate impact E E8tQe0lyA
SL vs RL: How do we obtain gradients?• In supervised learning, how do we compute the gradient of our loss CD(E)?
6 ← 6 + ; <=(6)• Hint: can we compute our loss?
• In reinforcement learning, how do we compute the policy gradient C43(.!)?
6 ← 6 + ; <!C(#")
%&# (" = 11 − - .$,( /# (, 1 %log 5# 1|(
If 8 Qeoly
even computing
Gol is tricky
Monte Carlo Estimation• Sample a trajectory: execute !3 and .!, 0!, 1!, .", 0", 1", …
• Lemma: [Glynn ’90, Williams ‘92]] This gives an unbiased estimate of the gradient:
E #$%$ = $%$((%)This is the “likelihood ratio” method.
bQ(st, at) =1X
t0=0
�t0r(st0+t, at0+t)
[rV ✓ =1X
t=0
�t bQ(st, at)r log ⇡✓(at|st)
OHHtruncation
Exercise
Back to the softmax policy class…
-C + # ∝ exp(6D,E)• Expression for softmax class:
7!C #"76D,E
= 8!2 # -C + # /C #, +
• What might be making gradient estimation difficult here?(hint: when does gradient descent “effective” stop?)
sas Y sa
g e
method Gp stopswhen grad is small
if alals is smallsuppose Asais 770 but als is small X
Part-1B: Preconditioning and the Natural Policy Gradient
A closer look at Natural Policy Gradient (NPG)
• Practice: (almost) all methods are gradient based, usually variants of:Natural Policy Gradient [K. ‘01]; TRPO [Schulman ‘15]; PPO [Schulman ‘17]
• NPG warps the distance metric to stretch the corners out (using the Fisher information metric) move ‘more’ near the boundaries. The update is:
F E = 8.∼0#,1∼# Clog !3 0|. Clog !3 0|. 4
E ← E + L F E 5"C43(.!)
tf statehave a Dist
Flats
SEm Istate space sine 2 7 Parang
TRPO (Trust Region Policy Optimization)
• TRPO [Schulman ‘15] (related: PPO [Schulman ‘17]): move staying “close” in KL to previous policy:
E$6" = argmin3 43(.!)s. t. 8.∼0#$ PD !3 ⋅ . R !3$ ⋅ .
• NPG=TRPO: they are first order equivalent (and have same practical behavior)
8
NPG intuition. But first….• NPG as preconditioning:
E ← E + L F E 5"C43(.!)OR
E ← E + L1 − 3 8 Clog !3 0|. Clog !3 0|. 4 5"8 Clog !3 0|. B3(., 0)
• What does the following problem remind you of?
8 SS7 5"8[SU]
• What is NPG is trying to approximate?
weup
I Px ray
WrGakAL
Equivalent Update Rule (for the softmax)• Take the best linear fit of W3 in “policy space”-features”: this gives
W.,1∗ = B3(., 0)• Using the NPG update rule :
E.,1 ← E.,1 +L
1 − γ B3(., 0)
• And so an equivalent update rule to NPG is:
!3 0|. ← !3 0|. exp L1 − γB
3(., 0) /\
• What algorithm does this remind you of?
Questions: convergence? General case/approximation?
Ftv exercisein
soft Spica
PolicyIteration
as 9 70 next policy it is policy it update
But does gradient descent even work in RL??
Supervised Learning Reinforcement Learning
What about approximation?
Stay tuned!!
Part-2: Convergence and Approximation
The Optimization Landscape
Supervised Learning:• Gradient descent tends to ‘just
work’ in practice and is not sensitive to initialization• Saddle points not a problem…
Reinforcement Learning:• Local search depends on initialization in
many real problems, due to “very” flat regions.
• Gradients can be exponentially small in the “horizon”
RL and the vanishing gradient problem
Reinforcement Learning:• The random init. has “very” flat regions in real problems (lack of ‘exploration’)• Lemma: [Agarwal, Lee, K., Mahajan 2019]
With random init, all F-th higher-order gradients are 2#$/& in magnitude for up to k <H/ ln L orders, L = 1/(1 − O).
• This is a landscape/optimization issues.(also a statistical issue if we used random init).
Prior work: The Explore/Exploit Tradeoff
Thrun ’92
Random search does not find the reward quickly.
(theory) Balancing the explore/exploit tradeoff:[Kearns & Singh, ’02] E3 is a near-optimal algo.Sample complexity: [K. ’03, Azar ’17]Model free: [Strehl et.al. ’06; Dann and Brunskill ’15; Szita &Szepesvari ’10; Lattimore et.al. ’14; Jin et.al. ’18]
S. M. Kakade (UW) Curiosity 4 / 16
s!
artat2 a 3
a
n
§ A: Convergence • Let’s look at the tabular/”softmax” case
§ B: Approximation§ Approximation: “linear” policies and neural nets
Part 2:Understanding the convergence properties of the (NPG) policy gradient methods!
NPG: back to the “soft” policy iteration interpretation
• Remember the softmax policy class
!3 0 . ∝ exp(E.,1)has A ⋅ B params
• At iteration t, the NPG update rule:E$6" ← E$ + L F E$ 5"C4$(.!)
is equivalent to a “soft” (exact) policy iteration update rule:
!$6" 0|. ← !$ 0|. exp L1 − γB
$(., 0) /\
• What happens for this non-convex update rule?
to11 A
00 A
ParfIintIfEyz qyqnaes.AtEike
Top Related