The Mathematical Foundations of Policy Gradient...
Transcript of The Mathematical Foundations of Policy Gradient...
The Mathematical Foundations of Policy Gradient Methods
Sham M. KakadeUniversity of Washington
&Microsoft Research
Reinforcement (interactive) learning (RL):
Setting: Markov decision processes
S states. start with s0 ⇠ d0
A actions.dynamics model P(s0|s, a).reward function r(s)discount factor �
Sutton, Barto ’18
Stochastic policy ⇡: st ! at
Standard objective: find ⇡ which maximizes:
V⇡(s0) = E[r(s0) + �r(s1) + �2r(s2) + . . .]
where the distribution of st , at is induced by ⇡.
S. M. Kakade (UW) Curiosity 1 / 1
Markov Decision Processes:a framework for RL
• A policy:𝜋: States → Actions
• We execute 𝜋 to obtain a trajectory:𝑠!, 𝑎!, 𝑟!, 𝑠", 𝑎", 𝑟", …
• Total 𝛾-discounted reward:
𝑉#(𝑠!) = 𝐸 9$%!
&
𝛾$𝑟$ │𝑠!, 𝜋
Goal: Find a policy that maximizes our value, 𝑉!(𝑠").
Challenges in RL
1. Exploration(the environment may be unknown)
2. Credit assignment problem(due to delayed rewards)
3. Large state/action spaces:hand state: joint angles/velocitiescube state: configuration actions: forces applied to actuators
Dexterous Robotic Hand ManipulationOpenAI, Oct 15, 2019
Values, State-Action Values, and Advantages
• Expectation with respect to sampled trajectories under 𝜋• Have S states and A actions.
• Effective “horizon” is 1/(1 − 𝛾) time steps.
𝑉!(𝑠") = 𝐸 '#$"
%
𝛾#𝑟(𝑠#, 𝑎#) │𝑠", 𝜋
𝑄!(𝑠", 𝑎") = 𝐸 '#$"
%
𝛾# 𝑟(𝑠#, 𝑎#)│𝑠", 𝑎", 𝜋
𝐴! 𝑠, 𝑎 = 𝑄! 𝑠, 𝑎 − 𝑉!(𝑠)
The “Tabular” Dynamic Programming approach
• Table: ‘bookkeeping’ for dynamic programming (with known rewards/dynamics)
1. Estimate the state-action value 𝑄!(𝑠, 𝑎) for every entry in the table.
2. Update the policy 𝜋 & goto step 1
• Generalization: how can we deal with this infinite table? using sampling/supervised learning?
State 𝒔: (joint angles, … cube config,…)
Action 𝒂:(forces at joints)
𝑸𝝅(𝒔, 𝒂): state-action value“one step look-ahead value” using 𝝅
(31°, 12°, … , 8134,… ) (1.2 Newton, 0.1 Newton,… ) 8 units of reward
⋮ ⋮ ⋮
§ Part – I: BasicsA. Derivation and EstimationB. Preconditioning and the Natural Policy Gradient
§ Part – II: Convergence and ApproximationA. Convergence: This is a non-convex problems!B. Approximation: How to the think about the role of deep learning?
This Tutorial: Mathematical Foundations of Policy Gradient Methods
Part-1: Basics
State-Action Visitation Measures!• This helps to clean up notation!
• “Occupancy frequency” of being in state 𝑠 and action a, after following 𝜋 starting in 𝑠!
𝑑$!! 𝑠 = 1 − 𝛾 𝐸 +
%&"
'
𝛾% 𝐼 𝑠% = 𝑠 │𝑠", 𝜋
• 𝑑.!# is a probability distribution
• With this notation:
𝑉#(𝑠!) =1
1 − 𝛾 𝐸.∼0"!# ,1∼# 𝑟(𝑠, 𝑎)
Direct Policy Optimization over Stochastic Policies
• 𝜋( 𝑎 𝑠 is the probability of action 𝑎 given 𝑠, parameterized by
𝜋( 𝑎 𝑠 ∝ exp(𝑓((𝑠, 𝑎))
• Softmax policy class: 𝑓( 𝑠, 𝑎 = 𝜃$,*• Linear policy class: 𝑓( 𝑠, 𝑎 = �⃗� ⋅ 𝜙(𝑠, 𝑎)where 𝜙(𝑠, 𝑎) ∈ 𝑅+
• Neural policy class: 𝑓((𝑠, 𝑎) is a neural network
In practice, policy gradient methods rule…
• Why do we like them?
• They easily deal with large state/action spaces (through the neural net parameterization)
• We can estimate the gradient using only simulation of our current policy 𝜋!(the expectation is under the state actions visited under 𝜋!)
• They directly optimize the cost function of interest!
They are the most effective method for obtaining state of the art.
𝜃 ← 𝜃 + 𝜂 𝛻𝑉!A(𝑠")
Two (equal) expressions for the policy gradient!
(some shorthand notation above)
• Where do these expression come from?
• How do we compute this?
𝛻𝑉# 𝑠" =1
1 − 𝛾 𝐸$∼&B,(∼! 𝑄# 𝑠, 𝑎 𝛻log 𝜋# 𝑎|𝑠
𝛻𝑉# 𝑠" =1
1 − 𝛾 𝐸$∼&B,(∼! 𝐴# 𝑠, 𝑎 𝛻log 𝜋# 𝑎|𝑠
Example: an important special case!• Remember the softmax policy class (a “tabular” parameterization)
𝜋C 𝑎 𝑠 ∝ exp(𝜃D,E)
• Complete class with 𝑆𝐴 params:one parameter per state action, so it contains the optimal policy
• Expression for softmax class:
𝜕𝑉C 𝑠"𝜕𝜃D,E
= 𝑑!2 𝑠 𝜋C 𝑎 𝑠 𝐴C 𝑠, 𝑎
• Intuition: increase 𝜃!,# if the ‘weighted’ advantage is large.
Part-1A: Derivations and Estimation
General DerivationrV ⇡✓ (s0)
= rX
a0
⇡✓(a0|s0)Q⇡✓ (s0, a0)
=X
a0
⇣r⇡✓(a0|s0)
⌘Q⇡✓ (s0, a0) +
X
a0
⇡✓(a0|s0)rQ⇡✓ (s0, a0)
=X
a0
⇡✓(a0|s0)⇣r log ⇡✓(a0|s0)
⌘Q⇡✓ (s0, a0)
+X
a0
⇡✓(a0|s0)r⇣r(s0, a0) + �
X
s1
P (s1|s0, a0)V ⇡✓ (s1)⌘
=X
a0
⇡✓(a0|s0)⇣r log ⇡✓(a0|s0)
⌘Q⇡✓ (s0, a0) + �
X
a0,s1
⇡✓(a0|s0)P (s1|s0, a0)rV ⇡✓ (s1)
= E [Q⇡✓ (s0, a0)r log ⇡✓(a0|s0)] + �E [rV ⇡✓ (s1)] .
SL vs RL: How do we obtain gradients?• In supervised learning, how do we compute the gradient of our loss 𝛻𝐿(𝜃)?
𝜃 ← 𝜃 + 𝜂 𝛻𝐿(𝜃)• Hint: can we compute our loss?
• In reinforcement learning, how do we compute the policy gradient 𝛻𝑉3(𝑠!)?
𝜃 ← 𝜃 + 𝜂 𝛻𝑉C(𝑠")
𝛻𝑉# 𝑠" =1
1 − 𝛾 𝐸$,( 𝑄# 𝑠, 𝑎 𝛻log 𝜋# 𝑎|𝑠
Monte Carlo Estimation• Sample a trajectory: execute 𝜋3 and 𝑠!, 𝑎!, 𝑟!, 𝑠", 𝑎", 𝑟", …
• Lemma: [Glynn ’90, Williams ‘92]] This gives an unbiased estimate of the gradient:
E #𝛻𝑉$ = 𝛻𝑉$(𝑠%)This is the “likelihood ratio” method.
bQ(st, at) =1X
t0=0
�t0r(st0+t, at0+t)
[rV ✓ =1X
t=0
�t bQ(st, at)r log ⇡✓(at|st)
Back to the softmax policy class…
𝜋C 𝑎 𝑠 ∝ exp(𝜃D,E)
• Expression for softmax class:
𝜕𝑉C 𝑠"𝜕𝜃D,E
= 𝑑!2 𝑠 𝜋C 𝑎 𝑠 𝐴C 𝑠, 𝑎
• What might be making gradient estimation difficult here?(hint: when does gradient descent “effective” stop?)
Part-1B: Preconditioning and the Natural Policy Gradient
A closer look at Natural Policy Gradient (NPG)
• Practice: (almost) all methods are gradient based, usually variants of:Natural Policy Gradient [K. ‘01]; TRPO [Schulman ‘15]; PPO [Schulman ‘17]• NPG warps the distance metric to stretch the corners out (using the Fisher
information metric) move ‘more’ near the boundaries. The update is:
𝐹 𝜃 = 𝐸.∼0#,1∼# 𝛻log 𝜋3 𝑎|𝑠 𝛻log 𝜋3 𝑎|𝑠 4
𝜃 ← 𝜃 + 𝜂 𝐹 𝜃 5"𝛻𝑉3(𝑠!)
TRPO (Trust Region Policy Optimization)
• TRPO [Schulman ‘15] (related: PPO [Schulman ‘17]): move staying “close” in KL to previous policy:
𝜃$6" = argmin3 𝑉3(𝑠!)
s. t. 𝐸.∼0#$ 𝐾𝐿 𝜋3 ⋅ 𝑠 R 𝜋3$ ⋅ 𝑠
• NPG=TRPO: they are first order equivalent (and have same practical behavior)
NPG intuition. But first….• NPG as preconditioning:
𝜃 ← 𝜃 + 𝜂 𝐹 𝜃 5"𝛻𝑉3(𝑠!)OR
𝜃 ← 𝜃 +𝜂
1 − 𝛾𝐸 𝛻log 𝜋3 𝑎|𝑠 𝛻log 𝜋3 𝑎|𝑠 4 5"𝐸 𝛻log 𝜋3 𝑎|𝑠 𝐴3(𝑠, 𝑎)
• What does the following problem remind you of?
𝐸 𝑋𝑋7 5"𝐸[𝑋𝑌]
• What is NPG is trying to approximate?
Equivalent Update Rule (for the softmax)• Take the best linear fit of 𝑄3 in “policy space”-features”: this gives
W.,1∗ = 𝐴3(𝑠, 𝑎)
• Using the NPG update rule : 𝜃.,1 ← 𝜃.,1 +
𝜂1 − γ
𝐴3(𝑠, 𝑎)
• And so an equivalent update rule to NPG is:
𝜋3 𝑎|𝑠 ← 𝜋3 𝑎|𝑠 exp𝜂
1 − γ𝐴3(𝑠, 𝑎) /𝑍
• What algorithm does this remind you of?
Questions: convergence? General case/approximation?
But does gradient descent even work in RL??
Supervised Learning Reinforcement Learning
What about approximation?
Stay tuned!!
Part-2: Convergence and Approximation
The Optimization Landscape
Supervised Learning:• Gradient descent tends to ‘just
work’ in practice and is not sensitive to initialization• Saddle points not a problem…
Reinforcement Learning:• Local search depends on initialization in
many real problems, due to “very” flat regions.• Gradients can be exponentially small in
the “horizon”
RL and the vanishing gradient problem
Reinforcement Learning:• The random init. has “very” flat regions in real problems (lack of ‘exploration’)• Lemma: [Agarwal, Lee, K., Mahajan 2019]
With random init, all 𝑘-th higher-order gradients are 2#$/& in magnitude for up to k <H/ ln 𝐻 orders, 𝐻 = 1/(1 − 𝛾).
• This is a landscape/optimization issues.(also a statistical issue if we used random init).
Prior work: The Explore/Exploit Tradeoff
Thrun ’92
Random search does not find the reward quickly.
(theory) Balancing the explore/exploit tradeoff:[Kearns & Singh, ’02] E3 is a near-optimal algo.Sample complexity: [K. ’03, Azar ’17]Model free: [Strehl et.al. ’06; Dann and Brunskill ’15; Szita &Szepesvari ’10; Lattimore et.al. ’14; Jin et.al. ’18]
S. M. Kakade (UW) Curiosity 4 / 16
s!
§ A: Convergence • Let’s look at the tabular/”softmax” case
§ B: Approximation§ Approximation: “linear” policies and neural nets
Part 2:Understanding the convergence properties of the (NPG) policy gradient methods!
NPG: back to the “soft” policy iteration interpretation
• Remember the softmax policy class
𝜋3 𝑎 𝑠 ∝ exp(𝜃.,1)
has 𝑆 ⋅ 𝐴 params• At iteration t, the NPG update rule:
𝜃$6" ← 𝜃$ + 𝜂 𝐹 𝜃$ 5"𝛻𝑉$(𝑠!)
is equivalent to a “soft” (exact) policy iteration update rule:
𝜋$6" 𝑎|𝑠 ← 𝜋$ 𝑎|𝑠 exp𝜂
1 − γ𝐴$(𝑠, 𝑎) /𝑍
• What happens for this non-convex update rule?
Part-2A: Global Convergence
Provable Global Convergence of NPGTheorem [Agarwal, Lee, K., Mahajan 2019]
For the softmax policy class, with 𝜂 = 1 − 𝛾 & log 𝐴 ,we have after T iterations,
𝑉 ' 𝑠% ≥ 𝑉⋆ 𝑠% −2
1 − 𝛾 & 𝑇• Dimension free iteration complexity! (No dependence on 𝑆, 𝐴)
Also a “FAST RATE”!• Even though problem is non-convex, a mirror descent analysis applies.
Analysis idea from [Even-Dar, K., Mansour 2009]• What about approximate/sampled gradients and large state space?
Notes: Potentials and Progress?
But first, the “Performance Difference Lemma”• Lemma: [K’02]: a characterization of the performance gap between any two policies:
𝑉! 𝑠" − 𝑉!9 𝑠" = 𝐸E:,D;,E;…∼! '#$"
%
𝛾#𝐴!9(𝑠#, 𝑎#) │𝑠"
= QQRS
𝐸D∼T<,E∼! 𝐴!9 𝑠, 𝑎
Mirror Descent Gives a Proof!(even though it is non-convex)
Es⇠d?
⇣KL(⇡?
s ||⇡(t)s )�KL(⇡?
s ||⇡(t+1)s )
⌘
= Es⇠d?
X
a
⇡?(a|s) log ⇡(t+1)(a|s)⇡(t)(a|s)
= Es⇠d?
X
a
⇡?(a|s)✓
⌘
1� �A(t)(s, a)
◆�X
a
⇡⇤(a|s) logZt(s)
!
= ⌘⇣V ⇡?
(s0)� V (t)(s0)� Es⇠d? logZt(s)⌘
Notes: are we making progress?
Re-arrangingV ⇡?
(s0)� V (t)(s0) =1
⌘Es⇠d?
⇣KL(⇡?
s ||⇡(t)s )�KL(⇡?
s ||⇡(t+1)s ) + logZt(s)
⌘
Understanding progress:
V ⇡?
(s0)� V (T�1)(s0)
1
T
T�1X
t=0
(V ⇡?
(s0)� V (t)(s0))
1
⌘TEs⇠d?(KL(⇡?
s ||⇡(0)s )�KL(⇡?
s ||⇡(T )s )) +
1
⌘T
T�1X
t=0
Es⇠d? logZt(s)
log |A|⌘T
+1
⌘T
T�1X
t=0
Es⇠d? logZt(s)
A slow rate proof sketch…
The key lemma for the fast rate…
Es⇠µ logZt(s) ...
. . .
. . .
. . .
⌘
1� �Es⇠µ
⇣V (t+1)(s)� V (t)(s)
⌘
The fast rate proof!
V ⇡?
(s0)� V (T�1)(s0)
log |A|⌘T
+1
⌘T
T�1X
t=0
Es⇠d? logZt(s)
log |A|⌘T
+1
(1� �)T
T�1X
t=0
⇣V (t+1)(d?)� V (t)(d?)
⌘
=log |A|⌘T
+V (T )(d?)� V (0)(d?)
(1� �)T
log |A|⌘T
+1
(1� �)2T.
Part-2B: Approximation(and statistics)
Remember our policy classes:
• 𝜋( 𝑎 𝑠 is the probability of action 𝑎 given 𝑠, parameterized by
𝜋( 𝑎 𝑠 ∝ exp(𝑓((𝑠, 𝑎))
• Softmax policy class: 𝑓( 𝑠, 𝑎 = 𝜃$,*• Linear policy class: 𝑓( 𝑠, 𝑎 = �⃗� ⋅ 𝜙(𝑠, 𝑎)where 𝜙(𝑠, 𝑎) ∈ 𝑅+
• Neural policy class: 𝑓((𝑠, 𝑎) is a neural network
OpenAI: dexterous hand manipulation not far off?
Trained with “domain randomization”
Basically: The measure 𝑠! ∼ 𝜇 was diverse.
𝑚𝑎𝑥(∈.
𝐸$∼0[𝑉( 𝑠 ]
Prior work: The Explore/Exploit Tradeoff
Thrun ’92
Random search does not find the reward quickly.
(theory) Balancing the explore/exploit tradeoff:[Kearns & Singh, ’02] E3 is a near-optimal algo.Sample complexity: [K. ’03, Azar ’17]Model free: [Strehl et.al. ’06; Dann and Brunskill ’15; Szita &Szepesvari ’10; Lattimore et.al. ’14; Jin et.al. ’18]
S. M. Kakade (UW) Curiosity 4 / 16
s"
Policy search algorithms: exploration and start state-measures
• Idea: Reweighting by a diverse distribution 𝜇 to handles the ”vanishing gradient” problem.• There is sense in which this reweighting is related to the a “condition number”
• Related theory:• [K. & Langford; ‘02] [K. ‘03] Conservative policy iteration (CPI) has the strongest provable
guarantees, in terms of the 𝜇 along with the error of a ‘supervised learning’ black box.• Other ‘reductions to SL’ : [Bagnell et al, ‘04], [Scherer & Geist, ‘14], [Geist et al., ‘19], etc…• helpful for imitation learning: [Ross et al., 2011]; [Ross & Bagnell, 2014]; [Sun et al., 2017 ]
NPG for the linear policy class• Now:
𝜋3 𝑎 𝑠 ∝ exp(𝜃.,1 ⋅ 𝜙.,1)• Take the best linear fit in “policy space”-features:
W∗ = argmin 𝐸.! ∼=𝐸.,1 ∼0"!# 𝑊 ⋅ 𝜙.,1 − 𝐴3 𝑠, 𝑎>
• 𝜇 is our start-state distribution, hopefully with “coverage”
• Define b𝐴3 𝑠, 𝑎 = 𝑊∗ ⋅ 𝜙.,1, and the NPG is update is equivalent to:
𝜋3 𝑎|𝑠 ← 𝜋3 𝑎|𝑠 exp𝜂
1 − γb𝐴3 𝑠, 𝑎 /𝑍
• This is like a soft “approximate” policy iteration step.
Sample Based NPG, linear case• Sample trajectories: at iteration t, using start state s! ∼ 𝜇, then follow 𝜋• Now do regression on this sampled data:
b𝑊$ = argmin c𝐸.,1 𝑊 ⋅ 𝜙.,1 − 𝐴3$ 𝑠, 𝑎>
• Define: b𝐴$ 𝑠, 𝑎 = b𝑊$ ⋅ 𝜙.,1
• And so an equivalent update rule to NPG is:
𝜋$6" 𝑎|𝑠 ← 𝜋$ 𝑎|𝑠 exp𝜂
1 − γb𝐴$ 𝑠, 𝑎 /𝑍
Guarantees: NPG for linear policy classes
Theorem [Agarwal, Lee, K., Mahajan 2019]𝐴: # actions. 𝐻: horizon. After 𝑇 iterations, for all 𝑠!, the NPG algorithm satisfies:
𝑉3% 𝑠! ≥ 𝑉∗ 𝑠! −𝐻 2 log𝐴
𝑇+ 𝐻?𝜅 𝜀
• (realizability) Suppose that 𝐴' 𝑠, 𝑎 is a linear function in 𝜙(,*• supervised learning error: suppose we have bounded regression error, say due to sampling
Y𝐸 𝐴+ 𝑠, 𝑎 − [𝑊+ ⋅ 𝜙 𝑠, 𝑎&≤ 𝜀
• relative condition number: (to opt state-action measure 𝑑∗starting from 𝑠-)
𝜅 = max@
𝐸.,1∼0∗ 𝜙.,1 ⋅ 𝑥>
𝐸.,1∼A 𝜙.,1 ⋅ 𝑥>
Sample Based NPG, neural case• Now:
𝜋3 𝑎 𝑠 ∝ exp(𝑓3(𝑠, 𝑎))• Sampling: at iteration t, sample s! ∼ 𝜇 and follow 𝜋, • Supervised learning/regression:
b𝑊$ = argmin c𝐸.,1 𝑊 ⋅ 𝛻𝑓3$ 𝑠, 𝑎 − 𝐴3$ 𝑠, 𝑎>
• Define: b𝐴$ 𝑠, 𝑎 = b𝑊$ ⋅ 𝛻𝑓3$ 𝑠, 𝑎
• The NPG is:
𝜋$6" 𝑎|𝑠 ← 𝜋$ 𝑎|𝑠 exp𝜂
1 − γb𝐴$ 𝑠, 𝑎 /𝑍
Guarantees: NPG for linear policy classes
Theorem [Agarwal, Lee, K., Mahajan 2019]𝐴: # actions. 𝐻: horizon. After 𝑇 iterations, for all 𝑠!, the NPG algorithm satisfies:
𝑉3% 𝑠! ≥ 𝑉∗ 𝑠! −𝐻 2 log𝐴
𝑇+ 𝐻?𝜅 𝜀
• (realizability) Suppose that 𝐴' 𝑠, 𝑎 is a linear function in 𝛻𝑓' 𝑠, 𝑎• supervised learning error: suppose we have bounded regression error, say due to sampling
𝐸(,*~/ 𝐴' 𝑠, 𝑎 − [𝑊+ ⋅ 𝛻𝑓' 𝑠, 𝑎&≤ 𝜀
• relative condition number: (to opt state-action measure 𝑑∗starting from 𝑠-)
𝜅 = max@
𝐸.,1∼0∗ 𝛻𝑓! 𝑠, 𝑎 ⋅ 𝑥 >
𝐸.,1∼A 𝛻𝑓! 𝑠, 𝑎 .,1 ⋅ 𝑥>
NTK+TRPO analysis [Lie et. al ‘19]
Thank you!• Today: mathematical foundations of policy gradient methods.
• With “coverage”, policy gradients have the strongest theoretical guarantees and are practically effective!
• New directions/not discussed:• design of good exploratory distributions 𝜇…• Relations to transfer learning and “distribution shift”
RL is a very relevant area, both now and the in the future! With some basics, please participate…
Some details for the fast rate!
V (t+1)(µ)� V (t)(µ)
=1
1� �Es⇠d(t+1)
µ
X
a
⇡(t+1)(a|s)A(t)(s, a)
=1
⌘Es⇠d(t+1)
µ
X
a
⇡(t+1)(a|s) log ⇡(t+1)(a|s)Zt(s)
⇡(t)(a|s)
=1
⌘Es⇠d(t+1)
µKL(⇡(t+1)
s ||⇡(t)s ) +
1
⌘Es⇠d(t+1)
µlogZt(s)
� 1
⌘Es⇠d(t+1)
µlogZt(s) �
1� �
⌘Es⇠µ logZt(s).