SDRL: Interpretable and Data-efficient Deep Reinforcement ...bzl0056/public_html/c-2019-aaai...Deep...

20
SDRL: Symbolic Deep Reinforcement Learning Liu Introduction Background Method Experiment Conclusion and Future Work SDRL: Interpretable and Data-efficient Deep Reinforcement Learning Leveraging Symbolic Planning Bo Liu Auburn University, Auburn, AL, USA 1 / 20

Transcript of SDRL: Interpretable and Data-efficient Deep Reinforcement ...bzl0056/public_html/c-2019-aaai...Deep...

Page 1: SDRL: Interpretable and Data-efficient Deep Reinforcement ...bzl0056/public_html/c-2019-aaai...Deep Reinforcement Learning Liu Introduction Background Method Experiment Conclusion

SDRL:Symbolic

DeepReinforcement

Learning

Liu

Introduction

Background

Method

Experiment

Conclusionand FutureWork

SDRL: Interpretable and Data-efficient DeepReinforcement Learning

Leveraging Symbolic Planning

Bo Liu

Auburn University, Auburn, AL, USA

1 / 20

Page 2: SDRL: Interpretable and Data-efficient Deep Reinforcement ...bzl0056/public_html/c-2019-aaai...Deep Reinforcement Learning Liu Introduction Background Method Experiment Conclusion

SDRL:Symbolic

DeepReinforcement

Learning

Liu

Introduction

Background

Method

Experiment

Conclusionand FutureWork

Collaborators

Daoming LyuAuburn UniversityAuburn, AL, USA

Fangkai YangNVIDIA CorporationRedmond, WA, USA

Steven GustafsonMaana Inc.Bellevue, WA, USA

2 / 20

Page 3: SDRL: Interpretable and Data-efficient Deep Reinforcement ...bzl0056/public_html/c-2019-aaai...Deep Reinforcement Learning Liu Introduction Background Method Experiment Conclusion

SDRL:Symbolic

DeepReinforcement

Learning

Liu

Introduction

Background

Method

Experiment

Conclusionand FutureWork

Sequential Decision-Making

Sequential decision-making (SDM) concerns an agentmaking a sequence of actions based on its behavior in theenvironment.

Deep reinforcement learning (DRL) achievestremendous success on sequential decision-makingproblems using deep neural networks (Mnih et al., 2015).

3 / 20

Page 4: SDRL: Interpretable and Data-efficient Deep Reinforcement ...bzl0056/public_html/c-2019-aaai...Deep Reinforcement Learning Liu Introduction Background Method Experiment Conclusion

SDRL:Symbolic

DeepReinforcement

Learning

Liu

Introduction

Background

Method

Experiment

Conclusionand FutureWork

Challenge: Montezuma’s Revenge

The avatar: climbs down the ladder, jumps over a rotatingskull, picks up the key (+100), goes back and uses thekey to open the right door (+300).

Vanilla DQN achieves 0 score (Mnih et al., 2015).

4 / 20

Page 5: SDRL: Interpretable and Data-efficient Deep Reinforcement ...bzl0056/public_html/c-2019-aaai...Deep Reinforcement Learning Liu Introduction Background Method Experiment Conclusion

SDRL:Symbolic

DeepReinforcement

Learning

Liu

Introduction

Background

Method

Experiment

Conclusionand FutureWork

Challenge: Montezuma’s Revenge

Problem: long horizon sequential actions, sparse anddelayed reward.

poor data efficiency.lack of interpretability.

5 / 20

Page 6: SDRL: Interpretable and Data-efficient Deep Reinforcement ...bzl0056/public_html/c-2019-aaai...Deep Reinforcement Learning Liu Introduction Background Method Experiment Conclusion

SDRL:Symbolic

DeepReinforcement

Learning

Liu

Introduction

Background

Method

Experiment

Conclusionand FutureWork

Our Solution

Solution: task decomposition

Symbolic planning: subtasks scheduling (high-level plan).

DRL: subtask learning (low-level control).

Meta-learner: subtask evaluation.

Goal

Symbolic planning drives learning, improving task-levelinterpretablility.

DRL learns feasible subtasks, improving data-efficiency.

6 / 20

Page 7: SDRL: Interpretable and Data-efficient Deep Reinforcement ...bzl0056/public_html/c-2019-aaai...Deep Reinforcement Learning Liu Introduction Background Method Experiment Conclusion

SDRL:Symbolic

DeepReinforcement

Learning

Liu

Introduction

Background

Method

Experiment

Conclusionand FutureWork

Background: Symbolic Planning with ActionLanguage

Action language (Gelfond & Lifschitz, 1998): a formal,declarative, logic-based language that describes dynamicdomains.

Dynamic domains can be represented as a transitionsystem.

7 / 20

Page 8: SDRL: Interpretable and Data-efficient Deep Reinforcement ...bzl0056/public_html/c-2019-aaai...Deep Reinforcement Learning Liu Introduction Background Method Experiment Conclusion

SDRL:Symbolic

DeepReinforcement

Learning

Liu

Introduction

Background

Method

Experiment

Conclusionand FutureWork

Action Language BC

Action Language BC (Lee et al., 2013) is a language thatdescribes the transition system using a set of causal laws.

dynamic laws describe transition of states

move(x , y1, y2) causes on(x , y2) if on(x , y1).

static laws describe value of fluents inside a state

intower(x , y2) if intower(x , y1), on(y1, y2).

8 / 20

Page 9: SDRL: Interpretable and Data-efficient Deep Reinforcement ...bzl0056/public_html/c-2019-aaai...Deep Reinforcement Learning Liu Introduction Background Method Experiment Conclusion

SDRL:Symbolic

DeepReinforcement

Learning

Liu

Introduction

Background

Method

Experiment

Conclusionand FutureWork

Background: Reinforcement Learning

Reinforcement learning is defined on a Markov DecisionProcess (S,A,Pa

ss′ , r , γ). To achieve optimal behavior, apolicy π : S ×A 7→ [0, 1] is learned.

An option is defined on the tuple (I , π, β), which enablesthe decision-making to have a hierarchical structure:

the initiation set I ⊆ S,policy π: S ×A 7→ [0, 1],probabilistic termination condition β: S 7→ [0, 1].

9 / 20

Page 10: SDRL: Interpretable and Data-efficient Deep Reinforcement ...bzl0056/public_html/c-2019-aaai...Deep Reinforcement Learning Liu Introduction Background Method Experiment Conclusion

SDRL:Symbolic

DeepReinforcement

Learning

Liu

Introduction

Background

Method

Experiment

Conclusionand FutureWork

SDRL: Symbolic Deep Reinforcement Learning

Symbolic Planner: orchestrates sequence of subtasksusing high-level symbolic plan.

Controller: uses DRL approaches to learn the subpolicyfor each subtask with intrinsic rewards.

Meta-Controller: measures learning performance ofsubtasks, updates intrinsic goal to enable reward-drivenplan improvement.

10 / 20

Page 11: SDRL: Interpretable and Data-efficient Deep Reinforcement ...bzl0056/public_html/c-2019-aaai...Deep Reinforcement Learning Liu Introduction Background Method Experiment Conclusion

SDRL:Symbolic

DeepReinforcement

Learning

Liu

Introduction

Background

Method

Experiment

Conclusionand FutureWork

Symbolic Planner

11 / 20

Page 12: SDRL: Interpretable and Data-efficient Deep Reinforcement ...bzl0056/public_html/c-2019-aaai...Deep Reinforcement Learning Liu Introduction Background Method Experiment Conclusion

SDRL:Symbolic

DeepReinforcement

Learning

Liu

Introduction

Background

Method

Experiment

Conclusionand FutureWork

Symbolic Planner: Planning with Intrinsic Goal

Intrinsic goal: a linear constraint on plan qualityquality ≥ quality(Πt) where Πt is the plan at episode t.

Plan quality: a utility function

quality(Πt) =∑

〈si−1,gi−1,si 〉∈Πt

ρgi−1(si−1)

where ρgi is the gain reward for subtask gi .

Symbolic planner: generates a new plan that

explores new subtasks,exploits more rewarding subtasks.

12 / 20

Page 13: SDRL: Interpretable and Data-efficient Deep Reinforcement ...bzl0056/public_html/c-2019-aaai...Deep Reinforcement Learning Liu Introduction Background Method Experiment Conclusion

SDRL:Symbolic

DeepReinforcement

Learning

Liu

Introduction

Background

Method

Experiment

Conclusionand FutureWork

From Symbolic Transition to Subtask

Assumption: given the set S of symbolic states and S ofsensory input, we assumed there is an Oracle for symbolgrounding: F : S × S 7→ {t, f}.Given F and a pair of symbolic states s, s ′ ∈ S:

initiation set I = {s ∈ S : F(s, s) = t},π : S 7→ A is the subpolicy for the corresponding subtask,β is the termination condition such that

β(s ′) =

{1 F(s ′, s ′) = t, for s ′ ∈ S,0 otherwise.

13 / 20

Page 14: SDRL: Interpretable and Data-efficient Deep Reinforcement ...bzl0056/public_html/c-2019-aaai...Deep Reinforcement Learning Liu Introduction Background Method Experiment Conclusion

SDRL:Symbolic

DeepReinforcement

Learning

Liu

Introduction

Background

Method

Experiment

Conclusionand FutureWork

Controller

14 / 20

Page 15: SDRL: Interpretable and Data-efficient Deep Reinforcement ...bzl0056/public_html/c-2019-aaai...Deep Reinforcement Learning Liu Introduction Background Method Experiment Conclusion

SDRL:Symbolic

DeepReinforcement

Learning

Liu

Introduction

Background

Method

Experiment

Conclusionand FutureWork

Controllers: DRL with Intrinsic Reward

Intrinsic reward: pseudo-reward crafted by the human.

Given a subtask defined on (I , π, β), intrinsic reward

ri (s ′) =

{φ β(s ′) = 1r otherwise

where φ is a positive constant encouraging achievingsubtasks and r is the reward from the environment atstate s ′.

15 / 20

Page 16: SDRL: Interpretable and Data-efficient Deep Reinforcement ...bzl0056/public_html/c-2019-aaai...Deep Reinforcement Learning Liu Introduction Background Method Experiment Conclusion

SDRL:Symbolic

DeepReinforcement

Learning

Liu

Introduction

Background

Method

Experiment

Conclusionand FutureWork

Meta-Controller

16 / 20

Page 17: SDRL: Interpretable and Data-efficient Deep Reinforcement ...bzl0056/public_html/c-2019-aaai...Deep Reinforcement Learning Liu Introduction Background Method Experiment Conclusion

SDRL:Symbolic

DeepReinforcement

Learning

Liu

Introduction

Background

Method

Experiment

Conclusionand FutureWork

Meta-Controller: Evaluation with Extrinsic Reward

Extrinsic rewards: re(s, g) = f (ε) where ε can measurethe competence of the learned subpolicy for each subtask.

For example, let ε be the success ratio, f can be definedas

f (ε) =

{−ψ ε < thresholdr(s, g) ε ≥ threshold

ψ is a positive constant to punish selecting unlearnablesubtasks,r(s, g) is the cumulative environmental reward by followingthe subtask g .

17 / 20

Page 18: SDRL: Interpretable and Data-efficient Deep Reinforcement ...bzl0056/public_html/c-2019-aaai...Deep Reinforcement Learning Liu Introduction Background Method Experiment Conclusion

SDRL:Symbolic

DeepReinforcement

Learning

Liu

Introduction

Background

Method

Experiment

Conclusionand FutureWork

Experimental Results I.

18 / 20

Page 19: SDRL: Interpretable and Data-efficient Deep Reinforcement ...bzl0056/public_html/c-2019-aaai...Deep Reinforcement Learning Liu Introduction Background Method Experiment Conclusion

SDRL:Symbolic

DeepReinforcement

Learning

Liu

Introduction

Background

Method

Experiment

Conclusionand FutureWork

Experimental Results II.

Baseline: Kulkarni et. al, Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and

Intrinsic Motivation, NIPS’2016.19 / 20

Page 20: SDRL: Interpretable and Data-efficient Deep Reinforcement ...bzl0056/public_html/c-2019-aaai...Deep Reinforcement Learning Liu Introduction Background Method Experiment Conclusion

SDRL:Symbolic

DeepReinforcement

Learning

Liu

Introduction

Background

Method

Experiment

Conclusionand FutureWork

Conclusion

We present a SDRL framework features:

High-level symbolic planning based on intrinsic goalLow-level policy control with DRL.Subtask learning evaluation by a meta-learner.

This is the first work on integrating symbolic planningwith DRL that achieves both task-level interpretabilityand data-efficiency for decision-making.

Future work.

20 / 20