Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · –...
Transcript of Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · –...
![Page 1: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/1.jpg)
Distributed RLRichard Liaw, Eric Liang
![Page 2: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/2.jpg)
Common Computational Patterns for RL
Batch Optimization
Simulation
Simulation
Simulation
OptimizationOptimization
How can we better utilize our computational resourcesto accelerate RL progress?
Original
![Page 3: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/3.jpg)
History of large scale distributed RL
2013
DQN
Playing Atari with Deep Reinforcement Learning
(Mnih 2013)
GORILA
Massively Parallel Methods for Deep
Reinforcement Learning(Nair 2015)
2015
A3C
Asynchronous Methods for Deep Reinforcement
Learning(Mnih 2016)
2016
Ape-X
Distributed Prioritized Experience Replay
(Horgan 2018)
2018
IMPALA
IMPALA: Scalable Distributed Deep-RL with
Importance Weighted Actor-Learner Architectures
(Espeholt 2018)
2018 ?
![Page 4: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/4.jpg)
2013/2015: DQN
for i in range(T): s, a, s_1, r = evaluate() replay.store((s, a, s_1, r))
minibatch = replay.sample() q_network.update(mini_batch)
if should_update_target(): q_network.sync_with(target_net)
![Page 5: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/5.jpg)
2015: General Reinforcement Learning Architecture (GORILA)
![Page 6: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/6.jpg)
GORILA Performance
![Page 7: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/7.jpg)
2016: Asynchronous Advantage Actor Critic (A3C)
Sends gradients back
# Each worker:
while True: sync_weights_from_master()
for i in range(5): collect sample from env
grad = compute_grad(samples) async_send_grad_to_master()
Each has different exploration -> more diverse samples!
![Page 8: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/8.jpg)
A3C Performance
Changes to GORILA:
1. Faster updates2. Removes the
replay buffer3. Moves to
Actor-Critic (from Q learning)
![Page 9: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/9.jpg)
Distributed Prioritized Experience Replay (Ape-X)
A3C doesn’t scale very well…
Ape-X:1. Distributed DQN/DDPG
2. Reintroduces replay
3. Distributed Prioritization: Unlike Prioritized DQN, initial priorities are not set to “max TD”
![Page 10: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/10.jpg)
Ape-X Performance
![Page 11: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/11.jpg)
Importance Weighted Actor-Learner Architectures (IMPALA)
Motivated by progress in distributed deep learning!
![Page 12: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/12.jpg)
How to correct for Policy Lag? Importance Sampling!
Given an actor-critic model:
1. Apply importance-sampling to policy gradient
2. Apply importance sampling to critic update
![Page 13: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/13.jpg)
IMPALA Performance
![Page 14: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/14.jpg)
Other interesting distributed architectures
![Page 15: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/15.jpg)
AlphaZero
Each model trained on 64 GPUs and 19 parameter servers!
![Page 16: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/16.jpg)
Evolution Strategies
![Page 17: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/17.jpg)
http://rllib.io
RLlib: Abstractions for Distributed Reinforcement Learning (ICML'18)
17
Eric Liang*, Richard Liaw*, Philipp Moritz, Robert Nishihara, Roy Fox, Ken Goldberg, Joseph E. Gonzalez, Michael I. Jordan, Ion Stoica
![Page 18: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/18.jpg)
http://rllib.io 18
Fig. courtesy OpenAI
Fig. courtesy NVidia Inc.
RL research scales with compute
![Page 19: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/19.jpg)
http://rllib.io
How do we leverage this hardware?
19
scalable abstractions for RL?
(a) Supervised Learning (b) Reinforcement Learning
![Page 20: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/20.jpg)
http://rllib.io
Systems for RL today• Many implementations (7000+ repos on GitHub!)
– how general are they (and do they scale)?PPO: multiprocessing, MPI AlphaZero: custom systems
Evolution Strategies: Redis IMPALA: Distributed TensorFlow
A3C: shared memory, multiprocessing, TF
• Huge variety of algorithms and distributed systems used to implement, but little reuse of components
20
![Page 21: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/21.jpg)
http://rllib.io
Challenges to reuse
1. Wide range of physical execution strategies for one "algorithm"
21
single-node cluster
GPU
CPUsynchronous
asynchronous
send gradients
send experiences
MPImultiprocessing
param-server
![Page 22: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/22.jpg)
http://rllib.io
Challenges to reuse
2. Tight coupling with deep learning frameworks
22
Different parallelism paradigms:– Distributed TensorFlow vs TensorFlow + MPI?
![Page 23: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/23.jpg)
http://rllib.io
Challenges to reuse
3. Large variety of algorithms with different structures
23
![Page 24: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/24.jpg)
http://rllib.io
We need abstractions for RL
Good abstractions decompose RL algorithms into reusable components.
Goals:– Code reuse across deep learning frameworks– Scalable execution of algorithms– Easily compare and reproduce algorithms
24
![Page 25: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/25.jpg)
http://rllib.io
Structure of RL computations
25
Agent Environment
action (ai+1)
Policy:state → action state (si)(observation)
reward (ri)
![Page 26: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/26.jpg)
http://rllib.io
Structure of RL computations
26
Agent Environment
action (ai+1)
state (si)(observation)
reward (ri)
Policy evaluation
(state → action)
Policy improvement
(e.g., SGD)
trajectory X: s0, (s1, r1), …, (sn, rn)
policy
![Page 27: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/27.jpg)
http://rllib.io
Many RL loop decompositions
27
Actor-Learner
Actor-Learner
Actor-Learner
ParamServer Learner
Replay
Async DQN (Mnih et al; 2016) Ape-X DQN (Horgan et al; 2018)
Actor
Actor
Actor
X <- rollout()dθ <- grad(L, X)sync(dθ)
θ <- sync()rollout()
X <- replay()apply(grad(L, X))
![Page 28: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/28.jpg)
http://rllib.io
Replay
Common components
28
Actor-Learner
Actor-Learner
Actor-Learner
ParamServer
Actor
Actor
Actor
Learner
Replay
Async DQN (Mnih et al; 2016) Ape-X DQN (Horgan et al; 2018)
ActorActor
ActorActor
ActorActor
Policy πθ(ot)
Trajectory postprocessor ρθ(X)
Loss L(θ,X)
![Page 29: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/29.jpg)
http://rllib.io
Replay
Common components
29
Actor-Learner
Actor-Learner
Actor-Learner
ParamServer
Actor
Actor
Actor
Learner
Replay
Async DQN (Mnih et al; 2016) Ape-X DQN (Horgan et al; 2018)
ActorActor
ActorActor
ActorActor
Policy πθ(ot)
Trajectory postprocessor ρθ(X)
Loss L(θ,X)
![Page 30: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/30.jpg)
http://rllib.io
Structural differences
30
Async DQN (Mnih et al; 2016)● Asynchronous optimization● Replicated workers● Single machine
Ape-X DQN (Horgan et al; 2018)● Central learner● Data queues between components● Large replay buffers● Scales to clusters
...and this is just one family!
➝ No existing system can effectively meet all the varied demands of RL workloads.
+ Population-Based Training (Jaderberg et al; 2017)
● Nested parallel computations● Control decisions based on
intermediate results
![Page 31: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/31.jpg)
http://rllib.io
Requirements for a new systemGoal: Capture a broad range of RL workloads with high performance and substantial code reuse1. Support stateful computations
- e.g., simulators, neural nets, replay buffers- big data frameworks, e.g., Spark, are typically stateless
2. Support asynchrony- difficult to express in MPI, esp. nested parallelism
3. Allow easy composition of (distributed) components
31
![Page 32: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/32.jpg)
http://rllib.io
Ray System Substrate
32
Hierarchical Task Model
• RLlib builds on Ray to provide higher-level RL abstractions• Hierarchical parallel task model with stateful workers
– flexible enough to capture a broad range of RL workloads (vs specialized sys.)
single-node cluster
GPU
CPUsynchronous
asynchronous
send gradients
send experiences
MPImultiprocessing
param-server
![Page 33: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/33.jpg)
http://rllib.io
Ray Cluster
Hierarchical Parallel Task Model1. Create Python class instances in the cluster (stateful workers)2. Schedule short-running tasks onto workers
– Challenge: High performance: 1e6+ tasks/s, ~200us task overhead
33
Top-level worker(Python process)
Sub-worker (process)
Sub-worker
Sub-worker
"collect experiences"
Sub-sub workerprocesses
"do model-based rollouts"
"allreduce your
gradients"
exchange weight shards through Ray object store
"run K steps of training"
![Page 34: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/34.jpg)
http://rllib.io
Unifying system enables RL Abstractions
34
Policy Optimizer Abstraction
SyncSamples AsyncSamplesAsyncGradientsSyncReplay MultiGPU ...
Policy Graph Abstraction{πθ, ρθ, L(θ,X)}
{Q-func, n-step, Q-loss}
{LSTM, adv. calc, PG loss}
Examples:
Hierarchical Task Model
single-node cluster
GPU
CPUsynchronous
asynchronous
send gradients
send experiences
![Page 35: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/35.jpg)
http://rllib.io
RLlib Abstractions in Action
35
Policy Optimizers
SyncSamples AsyncSamplesAsyncGradientsSyncReplay MultiGPU ...
{Q-func, n-step, Q-loss}
{LSTM, adv. calc, PG loss}
DQN (2015)
Async DQN (2016)
Ape-X (2018)
Policy Gradient (2000)
+actor-critic loss, GAE
A2C (2016)
PPO (GPU-optimized)PPO (2017)+clipped obj.
IMPALA (2018)+V-trace
A3C (2016)
PolicyGraphs
![Page 36: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/36.jpg)
http://rllib.io
RLlib Reference Algorithms• High-throughput architectures
– Distributed Prioritized Experience Replay (Ape-X)– Importance Weighted Actor-Learner Architecture (IMPALA)
• Gradient-based– Advantage Actor-Critic (A2C, A3C)– Deep Deterministic Policy Gradients (DDPG)– Deep Q Networks (DQN, Rainbow)– Policy Gradients– Proximal Policy Optimization (PPO)
• Derivative-free– Augmented Random Search (ARS)– Evolution Strategies
Community Contributions
![Page 37: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/37.jpg)
http://rllib.io
RLlib Reference Algorithms
1 GPU + 64 vCPUs (large single machine)
![Page 38: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/38.jpg)
http://rllib.io
Scale your algorithms with RLlib
38
• Beyond a "collection of algorithms",• RLlib's abstractions let you easily implement and scale new
algorithms (multi-agent, novel losses, architectures, etc)
![Page 39: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/39.jpg)
http://rllib.io
Code example: training PPO
![Page 40: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/40.jpg)
http://rllib.io
Code example: multi-agent RL
![Page 41: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/41.jpg)
http://rllib.io
Code example: hyperparam tuning
![Page 42: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/42.jpg)
http://rllib.io
Code example: hyperparam tuning
![Page 43: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/43.jpg)
http://rllib.io
RLlib is open source and available at http://rllib.ioThanks!
43
Summary: Ray and RLlib addresses challenges in providing scalable abstractions for reinforcement learning.
![Page 44: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/44.jpg)
http://rllib.io
Ray distributed execution engine
44
• Ray provides Task parallel and Actor APIs built on dynamic task graphs
• These APIs are used to build distributed applications, libraries and systems
Ray execution modelDynamic Task Graphs
Applications...Numerical computation
Third-party simulators
Ray programming modelTask Parallelism Actors
![Page 45: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec69a6425ea1f1b6d46a09b/html5/thumbnails/45.jpg)
http://rllib.io
Ray distributed scheduler
45
• Faster than Python multi-processing on a single node
• Competitive with MPI in many workloads
WorkerDriver WorkerWorker WorkerWorker
Object Store Object Store Object Store
Local Scheduler Local Scheduler Local Scheduler
Global Scheduler
Global Scheduler
Global SchedulerGlobal Scheduler