Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic...
Transcript of Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic...
Reinforcement Learning China Summer School
Model-based Reinforcement Learning
Prof. Weinan Zhang
John Hopcroft Center, Shanghai Jiao Tong University
July 30, 2020
Overall Pathway of DRL• Deep reinforcement learning gets appealing
success• Atari, AlphaGo, DOTA 2, AlphaStar
A Perspective of
Slide from Sergey Levine. http://rail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-1.pdf
Overall Pathway of DRL• Deep reinforcement learning gets appealing
success• Atari, AlphaGo, DOTA 2, AlphaStar
• But DRL has very low data efficiency• Trial-and-error learning for deep networks
• A recent popular direction is model-based RL• Build a model 𝑝 𝑠’, 𝑟 𝑠, 𝑎)• Based on the model to train the policy• So that the data efficiency could be improved
A Perspective of
Interaction between Agent and Environment
• Real environment• State dynamics• Reward function
p(s0; rjs; a)p(s0; rjs; a)
¼(ajs)¼(ajs)
f(s; a; r; s0)gf(s; a; r; s0)g
p(s0js; a)p(s0js; a)
r(s; a)r(s; a)
Experience data
Interaction between Agent and Env. Model
• Real environment• State dynamics• Reward function
p(s0; rjs; a)p(s0; rjs; a)
¼(ajs)¼(ajs)
f(s; a; r; s0)gf(s; a; r; s0)g
p(s0js; a)p(s0js; a)
r(s; a)r(s; a)
Simulated data
• Environment model• State dynamics• Reward function
p(s0js; a)p(s0js; a)
r(s; a)r(s; a)
Model-free RL v.s. Model-based RL (1)
• Model-based RL• On-policy learning once the model is learned• May not need further real interaction data once the
model is learned (batch RL)• Always show higher sample efficiency than MFRL• Suffer from model compounding error
• Model-free RL• The best asymptotic performance• Highly suitable for DL architecture with big data• Off-policy methods still show instabilities• Very low sample efficiency & require huge amount of
training data
Model-based RL: Blackbox and Whitebox
Model as a Blackbox• Seamless to policy training
algorithms• The simulation data efficiency
may still be low• E.g., Dyna-Q, MPC, MBPO
Model as a Whitebox• Offer both data and gradient
guidance for value and policy• High data efficiency• E.g., MAAC, SVG, PILCO
» (s; a; r; s0)» (s; a; r; s0)
» (s; a; r; s0)» (s; a; r; s0)
! @V (s0)@s0
@s0
@a
@a
@μ
¯s0»fÁ(s;a)
! @V (s0)@s0
@s0
@a
@a
@μ
¯s0»fÁ(s;a)
Content
1. Introduction to MBRL from Dyna
2. Shooting methods: RS, PETS, POPLIN
3. Theoretic bounds and methods: SLBO, MBPO & BMPO
4. Backpropagation through paths: SVG and MAAC
Model-based RL
p(s0; rjs; a)p(s0; rjs; a)
p(s0js; a) with r(s; a) knownp(s0js; a) with r(s; a) known
or
in virtual environment
in realenvironment
Roll-out trajectory data
Real trajectory data
Q(s; a) and ¼(ajs)Q(s; a) and ¼(ajs)
f(s; a; r; s0)gf(s; a; r; s0)g
Annotations based on Rich Sutton’s figure
Model-free RL v.s. Model-based RL (2)
• Model-based RL• Can efficiently learn model by supervised learning
methods• Can reason about model uncertainty• Empirically less interactions with real environment• First learn a model, then construct a value function
• two sources of approximation error
• Model-free RL• No need to learn the model• Learn value function (and/or policy) from real
experience (on-policy or off-policy)Costly or even
dangerousBiased or out-of-date data
Q-Planning• Random-sample one-step tabular Q-planning
• First, learn a model p(s’,r|s,a) from experience data• Then perform one-step sampling by the model to learn
the Q function
• Here model learning and reinforcement learning are separate
Dyna
p(s0; rjs; a)p(s0; rjs; a)
or
in virtual environment
in realenvironment
Roll-out trajectory data
Real trajectory data
Q(s; a) and ¼(ajs)Q(s; a) and ¼(ajs)
f(s; a; r; s0)gf(s; a; r; s0)g
p(s0js; a) with r(s; a) knownp(s0js; a) with r(s; a) known
Annotations based on Rich Sutton’s figure
Dyna-Q
Sutton, Richard S. "Integrated architectures for learning, planning, and reacting based on approximating dynamic programming." Machine Learning Proceedings 1990. Morgan Kaufmann, 1990. 216-224.
Dyna-Q on a Simple Maze
Key Questions of Model-based RL• Does the model really help improve the data
efficiency?
• Inevitably, the model is to-some-extent inaccurate. When to trust the model?
• How to properly leverage the model to better train our policy?
Content
1. Introduction to MBRL from Dyna
2. Shooting methods: RS, PETS, POPLIN
3. Theoretic bounds and methods: SLBO, MBPO & BMPO
4. Backpropagation through paths: SVG and MAAC
Shooting MethodsModel can also be used to help decision making when interact with environment. For the current state :• Given an action sequence of length T:
• we can sample trajectory from the model:
• And then choose the action sequence with highest estimated return:
[s0; a0; r0; s1; a1; r1; s2; a2; r2 : : : ; sT ; aT ; rT ][s0; a0; r0; s1; a1; r1; s2; a2; r2 : : : ; sT ; aT ; rT ]
[a0; a1; a2; : : : ; aT ][a0; a1; a2; : : : ; aT ]
Q(s; a) =
TXt=0
°trtQ(s; a) =
TXt=0
°trt ¼(s) = arg maxa
Q(s; a)¼(s) = arg maxa
Q(s; a)
s0s0
Random Shooting (RS)• The action sequences are randomly sampled• Pros:
• implementation simplicity• lower computational burden (no gradients)• no requirement to specify the task-horizon in advance
• Cons: high variance, may not sample the high reward action
• A refinement: Cross Entropy Method (CEM)
• CEM samples actions from a distribution closer to previous action samples that yield high reward.
Chua, Kurtland, et al. "Deep reinforcement learning in a handful of trials using probabilistic dynamics models." NIPS 2018.
5% median & 95%performance
PETS: Probabilistic Ensembles with Trajectory Sampling
• Probabilistic ensemble (PE) dynamics model is shown as an ensemble of two bootstraps
• Bootstrap disagreement far from data captures epistemicuncertainty: our subjective uncertainty due to a lack of data (model variance)
• Each a probabilistic neural network that captures aleatoricuncertainty (stochastic environment).
Chua, Kurtland, et al. "Deep reinforcement learning in a handful of trials using probabilistic dynamics models." NIPS 2018.
Ensemble
Gaussian NN
PETS: Probabilistic Ensembles with Trajectory Sampling• The trajectory sampling (TS) propagation technique
uses our dynamics model to re-sample each particle (with associated bootstrap) according to its probabilistic prediction at each point in time, up until horizon T
Chua, Kurtland, et al. "Deep reinforcement learning in a handful of trials using probabilistic dynamics models." NIPS 2018.
PETS: Probabilistic Ensembles with Trajectory Sampling• Planning: At each time step, MPC algorithm computes an optimal action sequence by sampling multiple sequences, applies the first action in the sequence, and repeats until the task-horizon.
Chua, Kurtland, et al. "Deep reinforcement learning in a handful of trials using probabilistic dynamics models." NIPS 2018.
PETS Algorithm
Chua, Kurtland, et al. "Deep reinforcement learning in a handful of trials using probabilistic dynamics models." NIPS 2018.
POPLIN: Model-based Policy Planning
• PETS uses CEM to maintain a distribution of actions yielding high reward at each time step
• Can we better choose the action sequence?
• POPLIN: maintain a policy to sample actions given the current simulated state
Wang, Tingwu, and Jimmy Ba. "Exploring Model-based Planning with Policy Networks." arXiv preprint arXiv:1906.08649 (2019).
POPLIN: Model-based Policy Planning
• POPLIN: maintain a policy to sample actions given the current simulated state
Wang, Tingwu, and Jimmy Ba. "Exploring Model-based Planning with Policy Networks." arXiv preprint arXiv:1906.08649 (2019).
• Policy parameter level• Action level
POPLIN Policy Distillation Schemes
• With action sequences ranked & selected by total reward
• Behavior cloning (BC) for POPLIN-A and -P
• Generative adversarial network training (GAN) for POPLIN-P
• Setting parameter average (AVG) for POPLIN-P
POPLIN Experiments
CheetahAnt
Swimmer Acrobot
POPLIN Experiments
POPLIN vs. PETSTransform each planned candidate action trajectory with PCA into a 2D blue scatter.Red areas are with high reward. Blue points are actions taken.
Content
1. Introduction to MBRL from Dyna
2. Shooting methods: RS, PETS, POPLIN
3. Theoretic bounds and methods: SLBO, MBPO & BMPO
4. Backpropagation through paths: SVG and MAAC
Any Theoretic Bound for MBRL?
• A set of practical requirement of theoretic analysis
Discrepancy of model and real env when
playing with 𝜋Value in the model
Value in real env
(R1)
(R2)
(R3)
Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees." ICLR 2019.
e.g.
A very strong assumption
Meta Algorithm of MBRL
• Remarks1. Optimize the lower bound directly (with TRPO)2. Jointly optimizing M and 𝜋 encourages the algorithm
to choose the most optimistic model among those that can be used to accurately estimate the value function.
3. Fixing M to optimize is 𝜋 just model-free RL
Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees." ICLR 2019.
Convergence Theorem & Proof
Proof:
= 0
(by R1)
(by R2)
SLBO: Stochastic Lower Bound Optimization
• Model optimization loss
• Model & policy optimization target
Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees." ICLR 2019.
sg: stop gradient
SLBO Experiments
Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees." ICLR 2019.
All envs with maximum horizon of 500
Problems of SLBO• All environments have restricted 500 steps horizon• Use the model to roll-out whole trajectories from
the start state• Practically, we may not roll-out so long horizon due
to the compounding error of the model• Particularly when the model is inaccurate• We need to know whether and to-what-extent the
model is inaccurate• And decide how to adaptively perform roll-out
Bound based on Model & Policy Error
• The discrepancy of policy value in real environment and environment model is linearly proportional to the model error 𝜖
Janner, Michael, et al. "When to Trust Your Model: Model-Based Policy Optimization." NIPS 2019.
²m = maxt
Es»¼D;t [DTV (p(s0; rjs; a)kpμ(s0; rjs; a))]²m = maxt
Es»¼D;t [DTV (p(s0; rjs; a)kpμ(s0; rjs; a))]
²¼ = maxs
DTV (¼k¼D)²¼ = maxs
DTV (¼k¼D)
based on old policy sampled data
Bound based on Model & Policy Error
• Branched rollout• Begin a rollout from a state under the previous policy’s
state distribution 𝑑 (𝑠) and run k steps according to 𝜋under the learned model 𝑝
• Dyna can be viewed as a special case of k = 1 branched rollout
Janner, Michael, et al. "When to Trust Your Model: Model-Based Policy Optimization." NIPS 2019.
• This lower bound is maximized when k = 0, corresponding to not using the model at all.
Bound based on Model & Policy Error
Janner, Michael, et al. "When to Trust Your Model: Model-Based Policy Optimization." NIPS 2019.
• With a new model error (first-order) approximation
• The bound is rewritten as
where the optimal k > 0 if is sufficiently small
²m0 = maxt
Es»¼t [DTV (p(s0; rjs; a)kpμ(s0; rjs; a))]²m0 = maxt
Es»¼t [DTV (p(s0; rjs; a)kpμ(s0; rjs; a))]
Empirical Analysis of
where the optimal k > 0 if is sufficiently small
MBPO Algorithm
• Remarks• Branch out from the real trajectories (instead from s0)• Branch rollout k steps depends on model & policy• Soft AC to update policy
MBPO Experiments 1000-step horizon
One Step Further based on MBPO• Idea: Bidirectional models for branched rollout to
reduce compounding error
• Paper: Hang Lai, Jian Shen, Weinan Zhang, Yong Yu.Bidirectional Model-based Policy Optimization. ICML 2020.
• https://arxiv.org/abs/2007.01995
Bidirectional ModelKey finding: Generating trajectories with the same length, the compounding error of the bidirectional model will be less than that of the forward model.
forward model
bidirectional model
Hang Lai, Jian Shen, Weinan Zhang, Yong Yu. Bidirectional Model-based Policy Optimization. ICML 2020.
Dynamics Model Learning• An ensemble of bootstrapped probabilistic networks are
used to parameterize both the forward model and the backward model .
where and are the mean and covariance respectively, and denotes the total number of real transition data.
Hang Lai, Jian Shen, Weinan Zhang, Yong Yu. Bidirectional Model-based Policy Optimization. ICML 2020.
• Each probabilistic neural network outputs a Gaussian distribution with diagonal covariance and is trained via maximum likelihood. The corresponding loss functions are:
s a s'
a s's
Backward Policy• In the forward model rollout, actions are
selected by the current policy . • To sample trajectories backwards, we
need to learn a backward policy to take actions given the next state.
LMLE(Á0) = ¡NX
t=0
log ~¼Á0 (atjst+1) ;
• or conditional GAN (GAIL):
min~¼
maxD
V (D; ~¼) = E(a;s0)»¼ [log D(a; s0)] + Es0»¼ [log (1¡D(~¼(¢js0); s0))] :
Hang Lai, Jian Shen, Weinan Zhang, Yong Yu. Bidirectional Model-based Policy Optimization. ICML 2020.
• The backward policy can be trained by either maximum likelihood estimation:
s a s' a'
a s'
Other Components of BMPO• State Sampling Strategy: instead of random chosen states from
environment replay buffer, we sample high value states to begin rolloutsaccording to a Boltzmann distribution.
p(s) / e¯V (s)
Hang Lai, Jian Shen, Weinan Zhang, Yong Yu. Bidirectional Model-based Policy Optimization. ICML 2020.
• Thus, the agent could learn to reach high value states through backward rollouts, and also learn to act better after these states through forward rollouts
at = argmaxa1:N
t »¼
Pt+H¡1t0=t °t0¡tr (st0; at0) + °HV (st+H)
• Incorporating MPC to refine the action taken in real environment.• At each time-step, candidate action sequences are generated from current policy and the
corresponding trajectories are simulated by the learned model. • Then the first action of the sequence that yields the highest accumulated rewards is
selected:
Overall Algorithm of BMPO
1. When interacting with the environment, the agent uses the model to perform MPC-based action selection
2. Data is stored in Denv, where the value of state increases from light red to dark red
3. High value states are then sampled from Denv to perform bidirectional model rollouts, which are stored in Dmodel
4. Model-free method (soft actor-critic) is used to train the policy based on the data from Dmodel
Hang Lai, Jian Shen, Weinan Zhang, Yong Yu. Bidirectional Model-based Policy Optimization. ICML 2020.
Theoretical Analysis
Bidirectional model:
Forward model:
Hang Lai, Jian Shen, Weinan Zhang, Yong Yu. Bidirectional Model-based Policy Optimization. ICML 2020.
forward model bidirectional model
Comparison with State-of-the-Arts
Compared with previous state-of-the-art baselines, BMPO (blue) learns faster and has better asymptotic performance than previous model-based algorithms using only the forward model.
Hang Lai, Jian Shen, Weinan Zhang, Yong Yu. Bidirectional Model-based Policy Optimization. ICML 2020.
Model Compounding Error
Hang Lai, Jian Shen, Weinan Zhang, Yong Yu. Bidirectional Model-based Policy Optimization. ICML 2020.
Errorfor =1
2h
2hXi=1
ksi ¡ sik22
Errorbi =1
2h
hXi=1
¡ksh+i ¡ sh+ik22 + ksh¡i ¡ sh¡ik2
2
¢
Content
1. Introduction to MBRL from Dyna
2. Shooting methods: RS, PETS, POPLIN
3. Theoretic bounds and methods: SLBO, MBPO & BMPO
4. Backpropagation through paths: SVG and MAAC
Deterministic Policy Gradient• A critic module for state-action value estimation
Qw(s; a) ' Q¼(s; a)Qw(s; a) ' Q¼(s; a)
L(w) = Es»½¼;a»¼μ
£(Qw(s; a)¡Q¼(s; a))2
¤L(w) = Es»½¼;a»¼μ
£(Qw(s; a)¡Q¼(s; a))2
¤• With the differentiable critic, the deterministic
continuous-action actor can be updated as• Deterministic policy gradient theorem
J(¼μ) = Es»½¼ [Q¼(s; a)]J(¼μ) = Es»½¼ [Q¼(s; a)]
rμJ(¼μ) = Es»½¼ [rμ¼μ(s)raQ¼(s; a)ja=¼μ(s)]rμJ(¼μ) = Es»½¼ [rμ¼μ(s)raQ¼(s; a)ja=¼μ(s)]
Chain ruleOn-policy
D. Silver et al. Deterministic Policy Gradient Algorithms. ICML 2014.
REVIEW
From Deterministic to Stochastic• For deterministic environment, i.e., reward and
transition, and policy, i.e., a function mapping state to action
Heess, Nicolas, et al. "Learning continuous control policies by stochastic value gradients." NIPS 2015.
From Deterministic to Stochastic• Math: reparameterization of distributions
• Conditional Gaussian distribution
• Consider conditional densities whose samples are generated by a deterministic function of an input noise variable
derivative
Heess, Nicolas, et al. "Learning continuous control policies by stochastic value gradients." NIPS 2015.
From Deterministic to Stochastic• Deterministic version for
reference
• Stochastic environment, policy, value and gradients
Heess, Nicolas, et al. "Learning continuous control policies by stochastic value gradients." NIPS 2015.
SVG Algorithm and Experiments
Heess, Nicolas, et al. "Learning continuous control policies by stochastic value gradients." NIPS 2015.
Model-Augmented Actor Critic• The objective function can be directly built as a
stochastic computation graph
Stochastic variable
Deterministic variable
H makes tradeoff between accuracy of model and learned Q
Ignasi Clavera, Yao Fu, Pieter Abbeel. Model-Augmented Actor Critic: Backpropagation through paths. ICLR 2020.
Theoretic BoundsLarge H, i.e., long rollout, brings large gradient error, thus the model error
Ignasi Clavera, Yao Fu, Pieter Abbeel. Model-Augmented Actor Critic: Backpropagation through paths. ICLR 2020.
Theoretic BoundsOne-step gradient approximation error
Dynamics error
Value function error
Policy value lower bound
Ignasi Clavera, Yao Fu, Pieter Abbeel. Model-Augmented Actor Critic: Backpropagation through paths. ICLR 2020.
MAAC Algorithm
• Model learning: an ensemble of probabilistic model
• Policy Optimization
• Q-function Learning
Ignasi Clavera, Yao Fu, Pieter Abbeel. Model-Augmented Actor Critic: Backpropagation through paths. ICLR 2020.
Experiments
maac
Ignasi Clavera, Yao Fu, Pieter Abbeel. Model-Augmented Actor Critic: Backpropagation through paths. ICLR 2020.
Backpropagation through Paths• PILCO
• Deisenroth, Marc, and Carl E. Rasmussen. "PILCO: A model-based and data-efficient approach to policy search." ICML 2011.
• SVG• Heess, Nicolas, et al. "Learning continuous control
policies by stochastic value gradients." NIPS 2015.• MAAC
• Ignasi Clavera, Yao Fu, Pieter Abbeel. Model-Augmented Actor Critic: Backpropagation through paths. ICLR 2020.
References of
Summary of Model-based RL
Model as a Blackbox• Sample experience data
and train the policy in the manner of model-free RL
• Seamless to policy training algorithms
• Easy for rollout planning• The simulation data
efficiency may still be low• E.g., Dyna-Q, MPC, MBPO
Model as a Whitebox• Environment model,
including transition dynamics and reward function, is a differentiable stochastic mapping
• Offer both data and gradient guidance for value and policy
• High data efficiency• E.g., MAAC, SVG, PILCO
Future Directions of MBRL• Environment model learning
• Learning objectives• Environment imitation• Complex simulation• MBRL with true model in simulation
• Better understanding of bounds for model usage• When to have tighter bounds• Rollout length and when to rollout
• Multi-agent MBRL
Thank You! Questions?
Weinan ZhangAssociate ProfessorAPEX Data & Knowledge Management Lab
John Hopcroft Center for Computer Science
Shanghai Jiao Tong University
http://wnzhang.net