Latent Space Policies for Hierarchical Reinforcement Learning
Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning...
Transcript of Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning...
![Page 1: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine](https://reader033.fdocuments.net/reader033/viewer/2022060515/5f8f8c122b81d56511673eb7/html5/thumbnails/1.jpg)
Hierarchical Reinforcement Learning with Unlimited Recursive Subroutine Calls
2019-09-17 ICANN 2019
1
![Page 2: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine](https://reader033.fdocuments.net/reader033/viewer/2022060515/5f8f8c122b81d56511673eb7/html5/thumbnails/2.jpg)
Humans can set suitable subgoals recursively to achieve certain tasks
2
Take an object on a high shelf
Set up a ladder
"the ladder is set up" Initial state Goal state: "holding the object"
Go to the store room Carry the ladder
"in the store room" Subgoal state: "the ladder is set up" Initial state
Sub-subgoal state: "in the store room" Initial state
The depth of this recursion is apparently unlimited.
![Page 3: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine](https://reader033.fdocuments.net/reader033/viewer/2022060515/5f8f8c122b81d56511673eb7/html5/thumbnails/3.jpg)
Hierarchical reinforcement learning architecture RGoal
• In RGoal, an agent's subgoal settings are similar to subroutine calls in programming languages.
• Each subroutine can execute primitive actions or recursively call other subroutines.
• The timing for calling another subroutine is learned by using a standard reinforcement learning method.
• Unlimited recursive subroutine calls accelerate learning because they increase the opportunity for the reuse of subroutines in multitask settings.
3
![Page 4: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine](https://reader033.fdocuments.net/reader033/viewer/2022060515/5f8f8c122b81d56511673eb7/html5/thumbnails/4.jpg)
Previous work: MAXQ [Dietterich 2000]
• Multi-layered hierarchical reinforcement learning architecture with a fixed number of layers.
• Accelerates learning speed • Subtask sharing, Temporal abstraction, State abstraction
4 RGoal simplifies MAXQ archicecture and extends its feature.
![Page 5: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine](https://reader033.fdocuments.net/reader033/viewer/2022060515/5f8f8c122b81d56511673eb7/html5/thumbnails/5.jpg)
RGoal architecture
5
Agent
Sensor
Actuator
Action value function 𝑄𝑄𝐺𝐺 �̃�𝑠, 𝑎𝑎� = 𝑄𝑄 𝑠𝑠,𝑔𝑔, 𝑎𝑎� + 𝑉𝑉𝐺𝐺(𝑔𝑔)
Environment
Augmented state �̃�𝑠 = (𝑠𝑠,𝑔𝑔)
Augmented action 𝑎𝑎�
Reward 𝑟𝑟
Action 𝑎𝑎
Reward function 𝑟𝑟(�̃�𝑠, 𝑎𝑎�) State 𝑠𝑠
Subgoal 𝑔𝑔
Stack
Subroutine call 𝐶𝐶𝑚𝑚
The agent has a subgoal and a stack as its internal states.
![Page 6: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine](https://reader033.fdocuments.net/reader033/viewer/2022060515/5f8f8c122b81d56511673eb7/html5/thumbnails/6.jpg)
Augmented state-action space
6
𝒮𝒮 = 0,0 , 0,1 , … 𝒜𝒜 = 𝑈𝑈𝑈𝑈,𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷,𝑅𝑅𝑅𝑅𝑔𝑔𝑅𝑅𝑅, 𝐿𝐿𝐿𝐿𝐿𝐿𝑅𝑅, …
augment
𝑠𝑠
Up The augmented state is a pair of state and goal �̃�𝑠 = (𝑠𝑠,𝑔𝑔) . The augmented action 𝑎𝑎� is a primitive action 𝑎𝑎 or a subroutine call 𝐶𝐶𝑚𝑚 . RGoal solves the Markov Decision Process (MDP) in the augmented state-action space. Unlike previous hierarchical RLs, caller and callee relation between subroutines is not predefined, but is learned within the framework of RL. (Subroutine call is just a state transition in the augmented state-action space.)
Right Left
Down
�̃�𝒮 = 𝒮𝒮 × ℳ �̃�𝒜 = 𝒜𝒜 ∪ 𝒞𝒞ℳ ℳ = {𝑚𝑚1,𝑚𝑚2, … } ⊆ 𝒮𝒮 𝒞𝒞ℳ = 𝐶𝐶𝑚𝑚1 ,𝐶𝐶𝑚𝑚2 , …
𝑠𝑠
𝐶𝐶𝑚𝑚2
𝐶𝐶𝑚𝑚3
𝑔𝑔 = 𝑚𝑚1
𝑔𝑔 = 𝑚𝑚2
𝑔𝑔 = 𝑚𝑚3
Up
Right Left
Down
[Levy and Shimkin 2011]
![Page 7: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine](https://reader033.fdocuments.net/reader033/viewer/2022060515/5f8f8c122b81d56511673eb7/html5/thumbnails/7.jpg)
Value function decomposition
• Action value function 𝑄𝑄 is decomposed into two parts:
• 𝑄𝑄(𝑠𝑠,𝑔𝑔,𝑎𝑎) can be shared between tasks because it does not depend on the original goal G. – Accelerates learning speed
7
𝑄𝑄𝐺𝐺𝜋𝜋 (𝑠𝑠,𝑔𝑔),𝑎𝑎 = 𝑄𝑄𝜋𝜋 𝑠𝑠,𝑔𝑔,𝑎𝑎 + 𝑉𝑉𝐺𝐺𝜋𝜋 𝑔𝑔
before subgoal after subgoal
where 𝑉𝑉𝐺𝐺𝜋𝜋 𝑔𝑔 = Σ𝑎𝑎𝜋𝜋 𝑔𝑔,𝐺𝐺 , 𝑎𝑎 𝑄𝑄𝜋𝜋 𝑔𝑔,𝐺𝐺, 𝑎𝑎 G : original global goal g : current subgoal
[Singh 1992][Dietterich 2000]
![Page 8: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine](https://reader033.fdocuments.net/reader033/viewer/2022060515/5f8f8c122b81d56511673eb7/html5/thumbnails/8.jpg)
Update rule of Q(s,g,a) is derived from the standard Sarsa algorithm
8
𝑄𝑄𝑔𝑔𝑛𝑛 �̃�𝑠′,𝑎𝑎�′ − 𝑄𝑄𝑔𝑔𝑛𝑛 �̃�𝑠,𝑎𝑎� = 𝑄𝑄 𝑠𝑠′,𝑔𝑔′,𝑎𝑎�′ + 𝑉𝑉𝑔𝑔 𝑔𝑔′ + 𝑉𝑉𝑔𝑔1 𝑔𝑔 + 𝑉𝑉𝑔𝑔2 𝑔𝑔1 + ⋯+ 𝑉𝑉𝑔𝑔𝑛𝑛 𝑔𝑔𝑛𝑛−1
− 𝑄𝑄 𝑠𝑠,𝑔𝑔,𝑎𝑎� + 𝑉𝑉𝑔𝑔1 𝑔𝑔 + 𝑉𝑉𝑔𝑔2 𝑔𝑔1 + ⋯+ 𝑉𝑉𝑔𝑔𝑛𝑛 𝑔𝑔𝑛𝑛−1 = 𝑄𝑄 𝑠𝑠′,𝑔𝑔′,𝑎𝑎�′ − 𝑄𝑄 𝑠𝑠,𝑔𝑔,𝑎𝑎� + 𝑉𝑉𝑔𝑔 𝑔𝑔′
When subgoal is g and stack contents are g1,g2,...gn, the agent moves though the path: s -> g -> g1 -> g2 -> ... -> gn . If a subroutine g' is called, the changed path is: s -> g' -> g -> g1 -> g2 -> ... -> gn . Therefore, the equation bellow holds:
𝑄𝑄 𝑠𝑠,𝑔𝑔,𝑎𝑎 ← 𝑄𝑄 𝑠𝑠,𝑔𝑔,𝑎𝑎 + 𝛼𝛼(𝑟𝑟 + 𝑄𝑄 𝑠𝑠′,𝑔𝑔′,𝑎𝑎′ − 𝑄𝑄 𝑠𝑠,𝑔𝑔,𝑎𝑎 + 𝑉𝑉𝑔𝑔 𝑔𝑔′ )
This equation also holds when 𝑎𝑎� is not a subroutine call, but is a primitive action. Therefore:
𝑄𝑄 �̃�𝑠,𝑎𝑎� ← 𝑄𝑄 �̃�𝑠,𝑎𝑎� + 𝛼𝛼(𝑟𝑟 + 𝑄𝑄 �̃�𝑠′,𝑎𝑎�′ − 𝑄𝑄 �̃�𝑠,𝑎𝑎� )
Efficiently executed.
![Page 9: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine](https://reader033.fdocuments.net/reader033/viewer/2022060515/5f8f8c122b81d56511673eb7/html5/thumbnails/9.jpg)
Maze task
9
Twenty landmarks are placed on the map. Landmarks may become subgoals. For each episode, the start S and goal G are randomly selected from the landmark set. We focus on convergence speed to suboptimal solutions, rather than exact solutions. m : landmark
■■■■■■■■■■■■■■■■■■■■■■■■■ ■・・m・・・・・・・・・・・・・・■・・・・m■ ■・・・・・■・・m・・・・・・・m■・・・・・■ ■・・・・・■■■■■・・・・■■■■・・・・・■ ■・・・・m・・・・■・・・・・■・・・・・・・■ ■・・・・・■・・・■m・・・m■・・・・・・・■ ■・・・・・■・・・■■■・・・・・m・・・・・■ ■・・・・・■・m・・・■・・・・・■・・・・m■ ■・・・m■■・・・・・■・・・・・■・・・・・■ ■・・・・・■・・・・・■m■・・・■・・・・・■ ■・・・・・■・・・■・■・■・・・・・・・・・■ ■・・・・・・・・・・m■・・・・・・・■・・・■ ■■■■■・・・・・・・■■■■・・・・■m・・■ ■・・m■・・・・・・・・・・■・・・・■■■■■ ■・・・■・・m・・・m・・・■m・・・・・・・■ ■・・・・・・・・・・■・・・■■■・・・・・・■ ■・・・・・・・・・・■・・・・・・・・・・・・■ ■・・・■・・・m・・■・・・・・・・m・・・・■ ■■■■■■■■■■■■■■■■■■■■■■■■■
![Page 10: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine](https://reader033.fdocuments.net/reader033/viewer/2022060515/5f8f8c122b81d56511673eb7/html5/thumbnails/10.jpg)
Experiment 1. Relationship between the upper limit S of the stack depth and RGoal performance
10
x 100,000 steps
episo
des/
step
s
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
S=0
S=1
S=2
S=3
S=4
S=100
A greater upper limit results in faster convergence because it increases the opportunity for the reuse of subroutines.
No subroutine calls (Sarsa)
Hierarchical RL without recursive calls
![Page 11: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine](https://reader033.fdocuments.net/reader033/viewer/2022060515/5f8f8c122b81d56511673eb7/html5/thumbnails/11.jpg)
"Thought-mode" combines learned simple tasks to solve unknown complicated tasks
• Suppose that the optimal routes between all neighboring pairs of landmarks have been already learned.
• An approximate solution for the optimal route between distant landmarks can be obtained by connecting neighboring landmarks.
• Such solutions can be found without taking any actions within the environment. [Singh 1992] – Found by simulations within the
agent's brain • A kind of Model-based RL • A kind of Deductive reasoning
11
m4
m1 m2
m3
m1
m2
m3
![Page 12: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine](https://reader033.fdocuments.net/reader033/viewer/2022060515/5f8f8c122b81d56511673eb7/html5/thumbnails/12.jpg)
Before evaluation of Thought-mode, the agent learns simple tasks
12
In the pre-training phase, only pairs of the start and goal within Euclidean distances of eight are selected. In the evaluation phase, arbitrary pairs are selected.
m : landmark
■■■■■■■■■■■■■■■■■■■■■■■■■ ■・・m・・・・・・・・・・・・・・■・・・・m■ ■・・・・・■・・m・・・・・・・m■・・・・・■ ■・・・・・■■■■■・・・・■■■■・・・・・■ ■・・・・m・・・・■・・・・・■・・・・・・・■ ■・・・・・■・・・■m・・・m■・・・・・・・■ ■・・・・・■・・・■■■・・・・・m・・・・・■ ■・・・・・■・m・・・■・・・・・■・・・・m■ ■・・・m■■・・・・・■・・・・・■・・・・・■ ■・・・・・■・・・・・■m■・・・■・・・・・■ ■・・・・・■・・・■・■・■・・・・・・・・・■ ■・・・・・・・・・・m■・・・・・・・■・・・■ ■■■■■・・・・・・・■■■■・・・・■m・・■ ■・・m■・・・・・・・・・・■・・・・■■■■■ ■・・・■・・m・・・m・・・■m・・・・・・・■ ■・・・・・・・・・・■・・・■■■・・・・・・■ ■・・・・・・・・・・■・・・・・・・・・・・・■ ■・・・■・・・m・・■・・・・・・・m・・・・■ ■■■■■■■■■■■■■■■■■■■■■■■■■
![Page 13: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine](https://reader033.fdocuments.net/reader033/viewer/2022060515/5f8f8c122b81d56511673eb7/html5/thumbnails/13.jpg)
Experiment 2. Relationship between the length P of the pre-training phase and RGoal performance
13
x 100,000 steps
episo
des/
step
s
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
P=0
P=1000
P=2000
A greater value of P results in faster convergence If an agent learns simple tasks first, learning difficult tasks becomes faster because the learned simple tasks can be reused as subroutines.
![Page 14: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine](https://reader033.fdocuments.net/reader033/viewer/2022060515/5f8f8c122b81d56511673eb7/html5/thumbnails/14.jpg)
Experiment 3. Relationship between thought-mode length T and RGoal performance
14
T is the number of simulations executed prior to the actual execution of each episode. Here, we only plot the change in score after the pre-training phase.
x 1,000 steps
episo
des/
step
s
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
2001
2005
2009
2013
2017
2021
2025
2029
2033
2037
2041
2045
2049
2053
2057
2061
2065
2069
2073
2077
2081
2085
2089
2093
2097
T=0
T=1
T=10
T=100
If thought-mode length is sufficiently long, approximate solutions are obtained in almost zero-shot time.
![Page 15: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine](https://reader033.fdocuments.net/reader033/viewer/2022060515/5f8f8c122b81d56511673eb7/html5/thumbnails/15.jpg)
Pseudo code of RGoal
15
This algorithm only uses simple data structures and a simple single loop. Because of its simplicity, we consider RGoal to be a promising first step toward a computational model of the planning mechanism of the human brain.
![Page 16: Hierarchical Reinforcement Learning with Unlimited Recursive ...Hierarchical reinforcement learning architecture RGoal • In RGoal, an agent's subgoal settings are similar to subroutine](https://reader033.fdocuments.net/reader033/viewer/2022060515/5f8f8c122b81d56511673eb7/html5/thumbnails/16.jpg)
Conclusion
• We proposed a novel hierarchical RL architecture that allows unlimited recursive subroutine calls.
• We integrated several important ideas proposed before into a single simple architecture. – Augmented action-value space[Levy and Shimkin 2011], – Value function decomposition[Singh 1992][Dietterich 2000] – Model-based RL [Sutton 1990]
• A novel mechanism called Thought-mode combines learned simple tasks to solve unknown complicated tasks rapidly, sometimes in zero-shot time.
• Because the theoretical framework and the architecture of RGoal is simple, it is easy to extend.
16