Learning (Part II) Hierarchical Reinforcement

124
Hierarchical Reinforcement Learning (Part II) Mayank Mittal

Transcript of Learning (Part II) Hierarchical Reinforcement

Page 1: Learning (Part II) Hierarchical Reinforcement

Hierarchical Reinforcement Learning (Part II)

Mayank Mittal

Page 2: Learning (Part II) Hierarchical Reinforcement

What are humans good at?

Page 3: Learning (Part II) Hierarchical Reinforcement

Let’s go and have lunch!

Page 4: Learning (Part II) Hierarchical Reinforcement

Let’s go and have lunch!

1. Exit ETZ building 3. Eat at mensa2. Cross the street

Page 5: Learning (Part II) Hierarchical Reinforcement

Let’s go and have lunch!

1. Exit ETZ building

➔ Open door➔ Walk to the lift➔ Press button➔ Wait for lift➔ …..

3. Eat at mensa

➔ Open door➔ Wait in a queue➔ Take food➔ …..

2. Cross the street

➔ Find shortest route➔ Walk safely➔ Follow traffic rules➔ …..

Page 6: Learning (Part II) Hierarchical Reinforcement

What are humans good at?

Temporal abstraction

Page 7: Learning (Part II) Hierarchical Reinforcement

Let’s go and have lunch!

1. Exit ETZ building

➔ Open door➔ Walk to the lift➔ Press button➔ Wait for lift➔ …..

3. Eat at mensa

➔ Open door➔ Wait in a queue➔ Take food➔ …..

2. Cross the street

➔ Find shortest route➔ Walk safely➔ Follow traffic rules➔ …..

Page 8: Learning (Part II) Hierarchical Reinforcement

What are humans good at?

Transfer/Reusability of Skills

Temporal abstraction

Page 9: Learning (Part II) Hierarchical Reinforcement

Let’s go and have lunch!

1. Exit ETZ building

➔ Open door➔ Walk to the lift➔ Press button➔ Wait for lift➔ …..

3. Eat at mensa

➔ Open door➔ Wait in a queue➔ Take food➔ …..

2. Cross the street

➔ Find shortest route➔ Walk safely➔ Follow traffic rules➔ …..

How to represent these different goals?

Page 10: Learning (Part II) Hierarchical Reinforcement

What are humans good at?

Powerful/meaningful state abstraction

Transfer/Reusability of Skills

Temporal abstraction

Page 11: Learning (Part II) Hierarchical Reinforcement

What are humans good at?

Can a learning-based agent do the same?

Powerful/meaningful state abstraction

Transfer/Reusability of Skills

Temporal abstraction

Page 12: Learning (Part II) Hierarchical Reinforcement

Promise of Hierarchical RL

Structured exploration

Transfer learning

Long-term credit assignment (and

memory)

Page 13: Learning (Part II) Hierarchical Reinforcement

Hierarchical RL

Environment

AgentM

anag

erW

orke

r(s)

Page 14: Learning (Part II) Hierarchical Reinforcement

Hierarchical RL

FeUdal Networks for Hierarchical Reinforcement Learning (ICML 2017)

Meta-Learning Shared Hierarchies (ICLR 2018)

Data-Efficient Hierarchical Reinforcement Learning (NeurIPS 2018)

Page 15: Learning (Part II) Hierarchical Reinforcement

Hierarchical RL

FeUdal Networks for Hierarchical Reinforcement Learning (ICML 2017)

Meta-Learning Shared Hierarchies (ICLR 2018)

Data-Efficient Hierarchical Reinforcement Learning (NeurIPS 2018)

Page 16: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Page 17: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)Le

vel o

f Abs

trac

tion

Temporal R

esolution

Dayan, Peter and Geoffrey E. Hinton. “Feudal Reinforcement Learning.” NIPS (1992).

Page 18: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Page 19: Learning (Part II) Hierarchical Reinforcement

Detour: Dilated RNN

▪ Able to preserve memories over longer periods

For more details: Chang, S. et al. (2017). Dilated Recurrent Neural Networks, NIPS.

Page 20: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Man

ager

Worker

Agent

Page 21: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Page 22: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Man

ager

Page 23: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Man

ager

Page 24: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Worker

Man

ager

Page 25: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Absolute Goal

(-3, 1)

(3, 9)

c : Manager’s Horizon

Page 26: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Directional Goal

Page 27: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Directional Goal

Idea: A single sub-goal (direction) can be reused in many different locations in state space

Page 28: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

▪ Intrinsic reward

Page 29: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

▪ Intrinsic reward

Page 30: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Worker

Man

ager

Page 31: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Page 32: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

▪ Action Stochastic

Policy!

Page 33: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Man

ager

Worker

Agent

Page 34: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Man

ager

Worker

Agent

Why not do end-to-end learning?

Page 35: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Man

ager

Worker

Agent

Manager & Worker: Separate Actor-Critic

No gradient

Transition Policy

Gradient

Policy Gradient

Page 36: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Qualitative Analysis

Page 37: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Ablative Analysis

Page 38: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Ablative Analysis

Page 39: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Comparison

Page 40: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Action Repeat Transfer

Page 41: Learning (Part II) Hierarchical Reinforcement

Experiences

FeUdal Networks (FUN)

On-Policy Learning

Learning

Wastage!

Page 42: Learning (Part II) Hierarchical Reinforcement

Experiences

Can we do better?

Off-Policy Learning

Learning

Replay Buffer

Reusage!

Page 43: Learning (Part II) Hierarchical Reinforcement

Can we do better?

Off-Policy Learning

Unstable Learning

Page 44: Learning (Part II) Hierarchical Reinforcement

Can we do better?

Off-Policy Learning

To-Be-DisclosedUnstable Learning

Page 45: Learning (Part II) Hierarchical Reinforcement

Hierarchical RL

FeUdal Networks for Hierarchical Reinforcement Learning (ICML 2017)

Meta-Learning Shared Hierarchies (ICLR 2018)

Data-Efficient Hierarchical Reinforcement Learning (NeurIPS 2018)

Page 46: Learning (Part II) Hierarchical Reinforcement

Man

ager

Wor

ker

Data-Efficient HRL (HIRO)

Page 47: Learning (Part II) Hierarchical Reinforcement

Input Goal Action

Raw Observation Space

Data-Efficient HRL (HIRO)

Page 48: Learning (Part II) Hierarchical Reinforcement

Man

ager

Wor

ker

Rollout sequence

Data-Efficient HRL (HIRO)

Page 49: Learning (Part II) Hierarchical Reinforcement

Man

ager

Wor

ker

Rollout sequence

Data-Efficient HRL (HIRO)

Page 50: Learning (Part II) Hierarchical Reinforcement

Man

ager

Wor

ker

Rollout sequence

Data-Efficient HRL (HIRO)

Page 51: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO)

c : Manager’s Horizon

Page 52: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO)

Page 53: Learning (Part II) Hierarchical Reinforcement

▪ Intrinsic reward

Data-Efficient HRL (HIRO)

Page 54: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO)

Environment

Manager

Worker(s)

Agent

Replay Buffer

Replay Buffer

Page 55: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO)

Environment

Manager

Worker(s)

Agent

Replay Buffer

Replay Buffer

Page 56: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO)

Environment

Manager

Worker(s)

Agent

Replay Buffer

Replay Buffer

Page 57: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO)

Environment

Manager

Worker(s)

Agent

Replay Buffer

Replay Buffer

Page 58: Learning (Part II) Hierarchical Reinforcement

Can we do better?

Off-Policy Learning

Unstable Learning To-Be-Disclosed

Page 59: Learning (Part II) Hierarchical Reinforcement

Can we do better?

Off-Policy Learning

Unstable Learning Manager’s past experience might become useless

Page 60: Learning (Part II) Hierarchical Reinforcement

Can we do better?

Off-Policy Learning

t = 12 yrs

Goal: “wear a shirt”

Page 61: Learning (Part II) Hierarchical Reinforcement

Can we do better?

Off-Policy Learning

Same goal induces different behavior

t = 22 yrs

Goal: “wear a shirt”

Page 62: Learning (Part II) Hierarchical Reinforcement

Can we do better?

Off-Policy Learning

Goal relabelling required!

t = 22 yrs

Goal: “wear a dress”

Goal: “wear a shirt”

Page 63: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO) Off-Policy Correction for Manager

where

Page 64: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO) Off-Policy Correction for Manager

where

...

Page 65: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO)

Environment

Manager

Worker(s)

Agent

Replay Buffer

Replay Buffer

Page 66: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO)

Ant Push

Page 67: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO)

Qualitative Analysis

Page 68: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO)

Ablative Analysis

Experience Samples (in millions)

Perf

orm

ance

Experience Samples (in millions) Experience Samples (in millions)

Page 69: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO)

Comparison

Page 70: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO)

Comparison

Experience Samples (in millions)

Perf

orm

ance

Page 71: Learning (Part II) Hierarchical Reinforcement

Can we do better?

What is missing?

Structured exploration

Page 72: Learning (Part II) Hierarchical Reinforcement

Hierarchical RL

FeUdal Networks for Hierarchical Reinforcement Learning (ICML 2017)

Meta-Learning Shared Hierarchies (ICLR 2018)

Data-Efficient Hierarchical Reinforcement Learning (NeurIPS 2018)

Page 73: Learning (Part II) Hierarchical Reinforcement

Meta-Learning Shared Hierarchies (MLSH)

Page 74: Learning (Part II) Hierarchical Reinforcement

Taken after every N steps

Meta-Learning Shared Hierarchies (MLSH)

Page 75: Learning (Part II) Hierarchical Reinforcement

Computer Vision practice:▪ Train on ImageNet▪ Fine tune on actual task

Slide Credits: Pieter Abbeel, Metal-Learning Symposium (NIPS 2017)

Meta-Learning Shared Hierarchies (MLSH)

Page 76: Learning (Part II) Hierarchical Reinforcement

Computer Vision practice:▪ Train on ImageNet▪ Fine tune on actual task

How to generalize this to behavior learning?

Slide Credits: Pieter Abbeel, Metal-Learning Symposium (NIPS 2017)

Meta-Learning Shared Hierarchies (MLSH)

Page 77: Learning (Part II) Hierarchical Reinforcement

Environment A

Environment B

...Meta-RL

Algorithm“Fast” RL

Agent

Image Credits: Pieter Abbeel, Metal-Learning Symposium (NIPS 2017)

Meta-Learning Shared Hierarchies (MLSH)

Page 78: Learning (Part II) Hierarchical Reinforcement

Environment A

Environment B

...Meta-RL

Algorithm“Fast” RL

Agent

Environment F

ar, o

Testing environments

Image Credits: Pieter Abbeel, Metal-Learning Symposium (NIPS 2017)

Meta-Learning Shared Hierarchies (MLSH)

Page 79: Learning (Part II) Hierarchical Reinforcement

GOAL: Find sub-policies that enable fast learning of master policy

Meta-Learning Shared Hierarchies (MLSH)

Page 80: Learning (Part II) Hierarchical Reinforcement

GOAL: Find sub-policies that enable fast learning of master policy

Meta-Learning Shared Hierarchies (MLSH)

Page 81: Learning (Part II) Hierarchical Reinforcement

Meta-Learning Shared Hierarchies (MLSH)

Page 82: Learning (Part II) Hierarchical Reinforcement

Meta-Learning Shared Hierarchies (MLSH)

Page 83: Learning (Part II) Hierarchical Reinforcement

Meta-Learning Shared Hierarchies (MLSH)

Page 84: Learning (Part II) Hierarchical Reinforcement

Ant Two-walks

Meta-Learning Shared Hierarchies (MLSH)

Page 85: Learning (Part II) Hierarchical Reinforcement

Ant Obstacle Course

Meta-Learning Shared Hierarchies (MLSH)

Page 86: Learning (Part II) Hierarchical Reinforcement

Movement Bandits

Meta-Learning Shared Hierarchies (MLSH)

Page 87: Learning (Part II) Hierarchical Reinforcement

Comparison

Meta-Learning Shared Hierarchies (MLSH)

Page 88: Learning (Part II) Hierarchical Reinforcement

Ablative Analysis

Meta-Learning Shared Hierarchies (MLSH)

Page 89: Learning (Part II) Hierarchical Reinforcement

Ablative Analysis

Meta-Learning Shared Hierarchies (MLSH)

Page 90: Learning (Part II) Hierarchical Reinforcement

Four Rooms

Meta-Learning Shared Hierarchies (MLSH)

Page 91: Learning (Part II) Hierarchical Reinforcement

Comparison

Meta-Learning Shared Hierarchies (MLSH)

Page 92: Learning (Part II) Hierarchical Reinforcement

SummaryFUN● Directional goals● Dilated RNN● Transition Policy Gradient

MLSH● Generalization in RL algorithm● Inspired from “Options” framework

HIRO● Absolute goals in observation space● Data-efficient ● Off-policy label correction

Page 93: Learning (Part II) Hierarchical Reinforcement

Discussion

▪ How to decide temporal resolution (i.e. c, N)?

▪ Do we need discrete # of sub-policies?

▪ Future prospects of HRL? More hierarchies?

Page 94: Learning (Part II) Hierarchical Reinforcement

Thank you for your attention!

Page 95: Learning (Part II) Hierarchical Reinforcement

Any Questions?

Page 96: Learning (Part II) Hierarchical Reinforcement

Let’s go and have lunch!

Page 97: Learning (Part II) Hierarchical Reinforcement

References ▪ Vezhnevets, A.S., Osindero, S., Schaul, T., Heess,

N., Jaderberg, M., Silver, D., & Kavukcuoglu, K. (2017). FeUdal Networks for Hierarchical Reinforcement Learning. ICML.

▪ Nachum, O., Gu, S., Lee, H., & Levine, S. (2018). Data-Efficient Hierarchical Reinforcement Learning. NeurIPS.

▪ Frans, K., Ho, J., Chen, X., Abbeel, P., & Schulman, J. (2018). Meta Learning Shared Hierarchies. CoRR, abs/1710.09767.

Page 98: Learning (Part II) Hierarchical Reinforcement

Appendix

Page 99: Learning (Part II) Hierarchical Reinforcement

Hierarchical RL

Environment

Manager

Worker(s)

Agent

Page 100: Learning (Part II) Hierarchical Reinforcement

Hierarchical RL

Image Credits: Levy A. et. al (2019) Learning Multi-Level Hierarchies With Hindsight, ICLR

Page 101: Learning (Part II) Hierarchical Reinforcement

Detour: A2C

Image Credits: Sergey Levine (2018), CS 294-112 (Lecture 6)

Page 102: Learning (Part II) Hierarchical Reinforcement

Advantage Function:

Update Rule:

FeUdal Networks (FUN)

Worker

Policy Gradient

Page 103: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Man

ager

Advantage Function:

Update Rule:

Transition Policy Gradient

Page 104: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Transition Policy Gradient

Assumption:

● Worker will eventually learn to follow the goal directions● Direction in state-space follows von Mises-Fisher distribution

Page 105: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Learnt sub-goals by Manager

Page 106: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Memory Task: Non-Match

Page 107: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Memory Task: T-Maze

Page 108: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Memory Task: Water-Maze

Page 109: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Comparison

Page 110: Learning (Part II) Hierarchical Reinforcement

FeUdal Networks (FUN)

Comparison

Page 111: Learning (Part II) Hierarchical Reinforcement

Network Structure: TD3

Data-Efficient HRL (HIRO)

Dimension of raw

observation space

Dimension of Action Space

Manager

Actor-Critic with2-layer MLP each

Worker

Actor-Critic with2-layer MLP each

For more details: Fujimoto, S., et. al (2018). Addressing Function Approximation Error in Actor-Critic Methods. ICML.

Page 112: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO) Off-Policy Correction for Manager

where

...

Page 113: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO) Off-Policy Correction for Manager

Approximately solved by generating candidate goals

Page 114: Learning (Part II) Hierarchical Reinforcement

Data-Efficient HRL (HIRO) Off-Policy Correction for Manager

Approximately solved by generating candidate goals :

● Original goal given:

● Absolute goal based on transition observed:

● Randomly sampled candidates:

Page 115: Learning (Part II) Hierarchical Reinforcement

Training

Data-Efficient HRL (HIRO)

Page 117: Learning (Part II) Hierarchical Reinforcement

Network Structure: PPO

Meta-Learning Shared Hierarchies (MLSH)

Number of sub-policies

Dimension of Action Space

Manager

2-layer MLP with 64 hidden units

Each sub-policy

2-layer MLP with 64 hidden units

For more details: Schulman, J., et. al (2017).. Proximal Policy Optimization Algorithms. CoRR, abs/1707.06347

Page 118: Learning (Part II) Hierarchical Reinforcement

Training

Meta-Learning Shared Hierarchies (MLSH)

Page 119: Learning (Part II) Hierarchical Reinforcement

Comparison

Meta-Learning Shared Hierarchies (MLSH)

Page 120: Learning (Part II) Hierarchical Reinforcement

Comparison

Meta-Learning Shared Hierarchies (MLSH)

Page 121: Learning (Part II) Hierarchical Reinforcement

Comparison

Meta-Learning Shared Hierarchies (MLSH)

Page 122: Learning (Part II) Hierarchical Reinforcement

▪ Useful when input data is sequential (such as in speech recognition, language modelling)

Recurrent Neural Network

For more details: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Page 123: Learning (Part II) Hierarchical Reinforcement

Stochastic NN for HRL (SNN4HRL)

For more details: Florensa, C. et. al (2017). Stochastic Neural Network for Hierarchical Reinforcement Learning. ICLR.

Aims to learn useful skills during pre-training and then leverage them for learning faster in future tasks

Page 124: Learning (Part II) Hierarchical Reinforcement

Variational Information Maximizing Exploration (VIME)

For more details: Houthooft, R. et. al (2016). VIME: Variational Information Maximizing Exploration, NIPS.

Exploration based on maximizing information gain about agent’s belief of the environment