Download - Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Transcript
Page 1: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Reward Machines: Structuring reward function specifications

and reducing sample complexity in reinforcement learning

Sheila A. McIlraithDepartment of Computer Science

University of Toronto

MSR Reinforcement Learning Day 2019New York, NY

October 3, 2019 McIlraith MSR RL Day 2019

Page 2: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Acknowledgements

Rodrigo Toro Icarte

McIlraith MSR RL Day 2019

Page 3: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Acknowledgements

Toryn Klassen Richard ValenzanoRodrigo Toro Icarte

Alberto Camacho Ethan Waldie Margarita Castro McIlraith MSR RL Day 2019

Page 4: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

LANGUAGE

McIlraith MSR RL Day 2019

Page 5: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

LANGUAGE

Humans have evolved languages over thousands of years to provide useful abstractions for understanding and interacting with each other and with the physical world.

The claim advanced by some is that language influences what we think, what we perceive, how we focus our attention, and what we remember.

While psychologist continue to debate how (and whether) language shapes the way we think, there is some agreement that the alphabet and structure of a language can have a significant impact on learning and reasoning.

McIlraith MSR RL Day 2019

Page 6: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

LANGUAGE

We use language to capture our understanding of the world around us, to communicate high-level goals, inten8ons and objec8ves,, and to support coordina8on with others.

We also use language to teach – to transfer knowledge.

Importantly, language can provide us with useful and purposeful abstrac8ons that can help us to generalize and transfer knowledge to new situa8ons.

Can exploiting the alphabet and structure of language help RL agents learn and think?

McIlraith MSR RL Day 2019

Page 7: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Photo: Javier Pierin (Getty Images)

How do we advise, instruct, task, … and impart knowledge to our RL agents?

McIlraith MSR RL Day 2019

Page 8: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Goals and Preferences

• Run the dishwasher when it’s full or when dishes are needed for the next meal.

•Make sure the bath temperature is between 38 – 43 celcius immediately before letting

someone enter the bathtub.

• Do not vacuum while someone in the house is sleeping.

McIlraith MSR RL Day 2019

Page 9: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Goals and Preferences

•When ge'ng ice cream, please always open the freezer, take out the ice cream,

serve yourself, put the ice cream back in the freezer, and close the freezer door.

McIlraith MSR RL Day 2019

Page 10: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Linear Temporal Logic (LTL) A compelling logic to express temporal properties of traces.

Syntax

Properties• Interpreted over finite or infintite traces.• Can be transformed into automata.

LTL in a Nutshell

Syntax

Logic connectives: ^,_,¬LTL basic operators:

next: ⌦'weak next: ✏'until: U�

Other LTL operators:

eventually: ' def= trueU'

always: �' def= ¬ ¬'

release: R�def= ¬(¬ U¬�)

Example: Eventually hold the key, and then have the door open.

(hold(key) ^⌦ open(door))

Finite and Infinite interpretations

The truth of an LTL formula is interpreted over state traces:

LTL, infinite traces

LTLf , finite traces 1

1cf. Bacchus et al. (1996), De Giacomo et al (2013, 2015)Camacho et al.: Bridging the Gap Between LTL Synthesis and Automated Planning 5 / 24

McIlraith MSR RL Day 2019

Page 11: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Linear Temporal Logic (LTL) A compelling logic to express temporal properties of traces.

Syntax

Properties• Interpreted over finite or infintite traces.• Can be transformed into automata.

LTL in a Nutshell

Syntax

Logic connectives: ^,_,¬LTL basic operators:

next: ⌦'weak next: ✏'until: U�

Other LTL operators:

eventually: ' def= trueU'

always: �' def= ¬ ¬'

release: R�def= ¬(¬ U¬�)

Example: Eventually hold the key, and then have the door open.

(hold(key) ^⌦ open(door))

Finite and Infinite interpretations

The truth of an LTL formula is interpreted over state traces:

LTL, infinite traces

LTLf , finite traces 1

1cf. Bacchus et al. (1996), De Giacomo et al (2013, 2015)Camacho et al.: Bridging the Gap Between LTL Synthesis and Automated Planning 5 / 24

Remember this!

McIlraith MSR RL Day 2019

Page 12: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Goals and Preferences• Do not vacuum while someone is sleeping

always[¬ (vacuum ∧ sleeping)]

McIlraith MSR RL Day 2019

Page 13: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Goals and Preferences• Do not vacuum while someone is sleeping

always[¬ (vacuum ∧ sleeping)]•When ge4ng an ice cream for someone …

always[ get(ice-cream) -> eventually [open(freezer) ∧

next[remove(ice-cream,freezer) ∧next[serve(ice-cream) ∧

next[replace(ice-cream,freezer) ∧next[close(freezer)]]]]]]

McIlraith MSR RL Day 2019

Page 14: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

How do we communicate this to our RL agent?

McIlraith MSR RL Day 2019

Page 15: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

MOTIVATION

McIlraith MSR RL Day 2019

Page 16: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Challenges to RL

• Reward Specification: It’s hard to define reward functions for complex tasks.

• Sample Efficiency: RL agents might require billions of interactions with the environment to learn good policies.

McIlraith MSR RL Day 2019

Page 17: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Reinforcement Learning

AgentEnvironment

Transi/on Func/onReward Func/on

Ac/on

Reward

State

McIlraith MSR RL Day 2019

Page 18: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Running Example

21

B * * C

* o *

A * * D

AgentFurniture

Coffee MachineMail Room

OfficeMarked Loca=ons

*

oA, B, C, D

Symbol Meaning

Task: Visit A, B, C, and D, in order.

McIlraith MSR RL Day 2019

Page 19: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Toy Problem Disclaimer

22

B * * C

* o *

A * * D

Agent

Furniture

Coffee Machine

Mail Room

Office

Marked Locations

*

o

A, B, C, D

Symbol Meaning

These “toy problems” challenge state-of-the-

art RL techniques

Task: Visit A, B, C, and D, in order.

McIlraith MSR RL Day 2019

Page 20: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Running Example

B * * C

* o *

A * * D

count = 0 # global variable

def get_reward(s):if count == 0 and state.at(“A”):

count = 1if count == 1 and state.at(“B”):

count = 2if count == 2 and state.at(“C”):

count = 3if count == 3 and state.at(“D”):

count = 0return 1

return 0

Observation: Someone always has to program the reward function… even when the environment is the real world!

Task: Visit A, B, C, and D, in order.

McIlraith MSR RL Day 2019

Page 21: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Running Example

B * * C

* o *

A * * D

count = 0 # global variable

def get_reward(state):if count == 0 and state.at(“A”):

count = 1if count == 1 and state.at(“B”):

count = 2if count == 2 and state.at(“C”):

count = 3if count == 3 and state.at(“D”):

count = 0return 1

return 0

Reward Function(as part of environment)

Task: Visit A, B, C, and D, in order.

McIlraith MSR RL Day 2019

Page 22: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

B * * C

* o *

A * * D

count = 0 # global variable

def get_reward(state):if count == 0 and state.at(“A”):

count = 1if count == 1 and state.at(“B”):

count = 2if count == 2 and state.at(“C”):

count = 3if count == 3 and state.at(“D”):

count = 0return 1

return 0

Task: Visit A, B, C, and D, in order.

Reward Function(as part of environment)

Running Example

McIlraith MSR RL Day 2019

Page 23: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

B * * C

* o *

A * * D

count = 0 # global variable

def get_reward(state):if count == 0 and state.at(“A”):

count = 1if count == 1 and state.at(“B”):

count = 2if count == 2 and state.at(“C”):

count = 3if count == 3 and state.at(“D”):

count = 0return 1

return 0

Reward Function(as part of environment)state

Running Example

Task: Visit A, B, C, and D, in order.

McIlraith MSR RL Day 2019

Page 24: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

B * * C

* o *

A * * D

count = 0 # global variable

def get_reward(state):if count == 0 and state.at(“A”):

count = 1if count == 1 and state.at(“B”):

count = 2if count == 2 and state.at(“C”):

count = 3if count == 3 and state.at(“D”):

count = 0return 1

return 0

Reward Function(as part of environment)state

0

Running Example

Task: Visit A, B, C, and D, in order.

McIlraith MSR RL Day 2019

Page 25: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

B * * C

* o *

A * * D

count = 0 # global variable

def get_reward(state):if count == 0 and state.at(“A”):

count = 1if count == 1 and state.at(“B”):

count = 2if count == 2 and state.at(“C”):

count = 3if count == 3 and state.at(“D”):

count = 0return 1

return 0

Reward Function(as part of environment)

Running Example

Task: Visit A, B, C, and D, in order.

McIlraith MSR RL Day 2019

Page 26: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

B * * C

* o *

A * * D

count = 0 # global variable

def get_reward(state):if count == 0 and state.at(“A”):

count = 1if count == 1 and state.at(“B”):

count = 2if count == 2 and state.at(“C”):

count = 3if count == 3 and state.at(“D”):

count = 0return 1

return 0

Reward Function(as part of environment)state

Running Example

Task: Visit A, B, C, and D, in order.

McIlraith MSR RL Day 2019

Page 27: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

B * * C

* o *

A * * D

count = 0 # global variable

def get_reward(state):if count == 0 and state.at(“A”):

count = 1if count == 1 and state.at(“B”):

count = 2if count == 2 and state.at(“C”):

count = 3if count == 3 and state.at(“D”):

count = 0return 1

return 0

Reward Function(as part of environment)state

0

Running Example

Task: Visit A, B, C, and D, in order.

McIlraith MSR RL Day 2019

Page 28: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

B * * C

* o *

A * * D

count = 0 # global variable

def get_reward(state):if count == 0 and state.at(“A”):

count = 1if count == 1 and state.at(“B”):

count = 2if count == 2 and state.at(“C”):

count = 3if count == 3 and state.at(“D”):

count = 0return 1

return 0

Reward Function(as part of environment)

Running Example

Task: Visit A, B, C, and D, in order.

McIlraith MSR RL Day 2019

Page 29: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

B * * C

* o *

A * * D

count = 0 # global variable

def get_reward(state):if count == 0 and state.at(“A”):

count = 1if count == 1 and state.at(“B”):

count = 2if count == 2 and state.at(“C”):

count = 3if count == 3 and state.at(“D”):

count = 0return 1

return 0

Reward Function(as part of environment)

state

Running Example

Task: Visit A, B, C, and D, in order.

McIlraith MSR RL Day 2019

Page 30: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

B * * C

* o *

A * * D

count = 0 # global variable

def get_reward(state):if count == 0 and state.at(“A”):

count = 1if count == 1 and state.at(“B”):

count = 2if count == 2 and state.at(“C”):

count = 3if count == 3 and state.at(“D”):

count = 0return 1

return 0

Reward Function(as part of environment)

state

0

Running Example

Task: Visit A, B, C, and D, in order.

McIlraith MSR RL Day 2019

Page 31: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Simple Idea: - Give the agent access to the reward function - Exploit reward function structure in learning

Remember this!

McIlraith MSR RL Day 2019

Page 32: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

B * * C

* o *

A * * D

count = 0 # global variable

def get_reward(s):if count == 0 and state.at(“A”):

count = 1if count == 1 and state.at(“B”):

count = 2if count == 2 and state.at(“C”):

count = 3if count == 3 and state.at(“D”):

count = 0return 1

return 0

The agent can exploit structure in the reward function.

Running Example

McIlraith MSR RL Day 2019

Page 33: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Decoupling Transition and Reward Functions

Agent

Ac'on

Reward

State

EnvironmentTransition Function

Reward Function

McIlraith MSR RL Day 2019

Page 34: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Decoupling Transition and Reward Functions

Agent

Action

Reward

RewardFunc0on

State

EnvironmentTransition Function

McIlraith MSR RL Day 2019

Page 35: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

The Rest of the Talk

▶ Reward Machines (RM)

§ Exploiting RM Structure in Learning

§ Experiments

§ Creating Reward Machines

§ Recap

McIlraith MSR RL Day 2019

Page 36: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

REWARD MACHINES

McIlraith MSR RL Day 2019

Page 37: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Encode reward function in an automata-like structureusing a vocabulary

Define a Reward Func.on using a Reward Machine

count = 0 # global variable

def get_reward(s):if count == 0 and state.at(“A”):

count = 1if count == 1 and state.at(“B”):

count = 2if count == 2 and state.at(“C”):

count = 3if count == 3 and state.at(“D”):

count = 0return 1

return 0

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1

. = { , , 1,∗, 3, 4, 5, 6}

McIlraith MSR RL Day 2019

Page 38: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Vocabulary can comprise human-interpretable events/properties realized via detectors over the environment state, or it can (conceivably) be learned.

Reward Func,on Vocabulary

B * * C

* o *

A * * D

AgentFurniture

Coffee MachineMail Room

OfficeMarked Locations

*

oA, B, C, D

Symbol Meaning

McIlraith MSR RL Day 2019

Page 39: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1

Reward MachineReward Machine

McIlraith MSR RL Day 2019

Page 40: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Reward MachineReward Machine• finite set of states !

"#

"$

"%

"&

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1

McIlraith MSR RL Day 2019

Page 41: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Reward MachineReward Machine• finite set of states !• ini2al state "# ∈ !

"%

"#

"&

"'

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1

McIlraith MSR RL Day 2019

Page 42: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1

Reward MachineReward Machine• finite set of states .• initial state !# ∈ .• set of transitions labelled by:

McIlraith MSR RL Day 2019

Page 43: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, )

¬*, )

¬+, )

¬,, )

-, )

*, )+, )

,, .

Reward Machine• finite set of states /• initial state !# ∈ /• set of transitions labelled by:

§ A logical condition (guards)§ A reward function (or constant)

Condi;ons are over proper;es of the current state:1 = { , , 4,∗, 6, 7, 8, 9}

Reward Machine

McIlraith MSR RL Day 2019

Page 44: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, )

¬B, )

¬C, )

¬D, )

A, )

B, )C, )

D, -

Reward Machine• finite set of states .• initial state !# ∈ .• set of transitions labelled by:

§ A logical condition (guards)§ A reward function (or constant)

Condi;ons are over proper;es of the current state:0 = { , , 3,∗, 5, 6, 7, 8}

Reward Machine

A Reward Machine is a Mealy Machine over the input alphabet Σ = 20 , whose output alphabet is a set of Markovian reward func;ons.

McIlraith MSR RL Day 2019

Page 45: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Ac/on

McIlraith MSR RL Day 2019

Page 46: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Ac/on

McIlraith MSR RL Day 2019

Page 47: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Ac/on

McIlraith MSR RL Day 2019

Page 48: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D State

Reward Machines in Ac/on

McIlraith MSR RL Day 2019

Page 49: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D State

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 50: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D State

0

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 51: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 52: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 53: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D State

Reward Machines in Ac/on

McIlraith MSR RL Day 2019

Page 54: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D State

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 55: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D State

0

Reward Machines in Ac/on

McIlraith MSR RL Day 2019

Page 56: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Ac/on

McIlraith MSR RL Day 2019

Page 57: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Ac/on

McIlraith MSR RL Day 2019

Page 58: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * DState

Reward Machines in Ac/on

McIlraith MSR RL Day 2019

Page 59: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * DState

0

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 60: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 61: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 62: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 63: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 64: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 65: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 66: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 67: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 68: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 69: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 70: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 71: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 72: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 73: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 74: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 75: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 76: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 77: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 78: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 79: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 80: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 81: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 82: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 83: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 84: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 85: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 86: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 87: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 88: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Reward Machines in Action

McIlraith MSR RL Day 2019

Page 89: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Other Reward Machines

Task: Deliver coffee to the office, while avoiding furniture.

¬ ∧ ¬ ∗ , 0

true, 0

∗, 0

*+ *,

¬o ∧ ¬ ∗, 0

*.

true, 0

o ∧ ¬ ∗, 1∧ ¬ ∗ , 0

*0

∗, 0

McIlraith MSR RL Day 2019

Page 90: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Other Reward Machines

Task: Deliver coffee to the office, while avoiding furniture.

¬ ∧ ¬ ∗ , 0

true, 0

∗, 0

*+ *,

¬o ∧ ¬ ∗, 0

*.

true, 0

o ∧ ¬ ∗, 1∧ ¬ ∗ , 0

*0

∗, 0

McIlraith MSR RL Day 2019

Page 91: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Other Reward Machines

Task: Deliver coffee to the office, while avoiding furniture.

¬ ∧ ¬ ∗ , 0

true, 0

∗, 0

*+ *,

¬o ∧ ¬ ∗, 0

*.

true, 0

o ∧ ¬ ∗, 1∧ ¬ ∗ , 0

*0

∗, 0

McIlraith MSR RL Day 2019

Page 92: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Other Reward Machines

Task: Deliver coffee and mail to the office.¬ ∧ ¬ , 0

¬o, 0

true, 0o, 1

, 0

+,

+-

+.

+/¬ , 0, 0

, 0

, 0¬ , 0

+,

McIlraith MSR RL Day 2019

Page 93: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Other Reward Machines

Task: Deliver coffee and mail to the office.¬ ∧ ¬ , 0

¬o, 0

true, 0o, 1

, 0

+,

+-

+.

+/¬ , 0, 0

, 0

, 0¬ , 0

+,

McIlraith MSR RL Day 2019

Page 94: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Other Reward Machines

Task: Deliver coffee and mail to the office.¬ ∧ ¬ , 0

¬o, 0

true, 0o, 1

, 0

+,

+-

+.

+/¬ , 0, 0

, 0

, 0¬ , 0

+,

McIlraith MSR RL Day 2019

Page 95: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

The Rest of the Talk

• Reward Machines (RM)

▶ Exploiting RM Structure in Learning

• Experiments

• Creating Reward Machines

• Recap

McIlraith MSR RL Day 2019

Page 96: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

EXPLOITING RM STRUCTURE IN LEARNING

McIlraith MSR RL Day 2019

Page 97: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Methods for Exploi0ng RM Structure

Baselines based on exis0ng methods:1. Q-learning over an equivalent MDP (Q-learning)2. Hierarchical RL based on opAons (HRL)3. HRL with RM-based pruning (HRL-RM)

Our approaches:4. Q-learning for Reward Machines (QRM)5. QRM + Reward Shaping for Reward Machine (QRM + RS)

McIlraith MSR RL Day 2019

Page 98: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

1. Q-Learning Baseline

A Reward Machine may define a non-Markovian reward function.

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

McIlraith MSR RL Day 2019

Page 99: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

1. Q-Learning Baseline

A Reward Machine may define a non-Markovian reward func7on.

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

McIlraith MSR RL Day 2019

Page 100: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

1. Q-Learning Baseline

A Reward Machine may define a non-Markovian reward func7on.

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D State

McIlraith MSR RL Day 2019

Page 101: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

1. Q-Learning Baseline

A Reward Machine may define a non-Markovian reward func7on.

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

0

State

McIlraith MSR RL Day 2019

Page 102: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

1. Q-Learning Baseline

A Reward Machine may define a non-Markovian reward func7on.

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D State

McIlraith MSR RL Day 2019

Page 103: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

1. Q-Learning Baseline

A Reward Machine may define a non-Markovian reward func7on.

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D State

1

McIlraith MSR RL Day 2019

Page 104: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

1. Q-Learning Baseline

Solu3on: Include RM state as part of agent’s state representa4on.Use standard Q-learning on resul4ng MDP.

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D State

1

McIlraith MSR RL Day 2019

Page 105: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

2. Op&on-Based Hierarchical RL (HRL)

Learn one op&on policy for each proposi.on men.oned in the RM

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1

• RM refers to A, B, C, and D• Learn policies ./, .0, .1, and .2• Optimize .9, to sa.sfy : op.mally

McIlraith MSR RL Day 2019

Page 106: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

2. Op&on-Based Hierarchical RL (HRL)

Simultaneously learn when to use each op3on policy

!"

!#

!$

!%

¬A,0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Meta-Controller

./ .0 .1 .2

McIlraith MSR RL Day 2019

Page 107: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

3. HRL with RM-Based Pruning (HRL-RM)

!"

!#

!$

!%

¬A,0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Meta-Controller

./ .0 .1 .2

Prune irrelevant options using current RM state

McIlraith MSR RL Day 2019

Page 108: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

3. HRL with RM-Based Pruning (HRL-RM)

!"

!#

!$

!%

¬A,0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

Meta-Controller

./ .0 .1 .2

Prune irrelevant op3ons using current RM state

McIlraith MSR RL Day 2019

Page 109: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

HRL Methods Can Find Subop5mal Policies

B * * C

* o *

A * * D

¬ ∧ , 0

%&

¬o, 0

true, 0

∧ , 0

%,o, 0

%-

HRL approaches find “locally” op9mal solu9ons.

McIlraith MSR RL Day 2019

Page 110: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

HRL Methods Can Find Subop5mal Policies

B * * C

* o *

A * * D

¬ ∧ , 0

%&

¬o, 0

true, 0

∧ , 0

%,o, 0

%-

Optimal solution (. < 1)§ 13 total steps

10 Steps

3 Steps

HRL approaches find “locally” op@mal solu@ons.

McIlraith MSR RL Day 2019

Page 111: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

HRL Methods Can Find Subop5mal Policies

B * * C

* o *

A * * D

¬ ∧ , 0

%&

¬o, 0

true, 0

∧ , 0

%,o, 0

%-

Learns two options:1. Getting 2. Getting to “o”

18 Steps

HRL approaches find “locally” opBmal soluBons.

McIlraith MSR RL Day 2019

Page 112: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

4. Q-Learning for Reward Machines (QRM)

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1

McIlraith MSR RL Day 2019

Page 113: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

4. Q-Learning for Reward Machines (QRM)

QRM (our approach) 1. Learn one policy (q-value func5on) per state in

the Reward Machine.

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1.#.$

.".%

McIlraith MSR RL Day 2019

Page 114: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

4. Q-Learning for Reward Machines (QRM)

QRM (our approach) 1. Learn one policy (q-value func5on) per state in

the Reward Machine.2. Select ac5ons using the policy of the current

RM state. !"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1.#.$

.".%

McIlraith MSR RL Day 2019

Page 115: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

4. Q-Learning for Reward Machines (QRM)

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1.#.$

.".%

QRM (our approach) 1. Learn one policy (q-value func5on) per state in

the Reward Machine.2. Select ac5ons using the policy of the current

RM state.

McIlraith MSR RL Day 2019

Page 116: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

4. Q-Learning for Reward Machines (QRM)

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1.#.$

.".%

QRM (our approach) 1. Learn one policy (q-value func5on) per state in

the Reward Machine.2. Select ac5ons using the policy of the current

RM state.

McIlraith MSR RL Day 2019

Page 117: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

4. Q-Learning for Reward Machines (QRM)

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1.#.$

.".%

QRM (our approach) 1. Learn one policy (q-value func5on) per state in

the Reward Machine.2. Select ac5ons using the policy of the current

RM state.

McIlraith MSR RL Day 2019

Page 118: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

4. Q-Learning for Reward Machines (QRM)

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1.#.$

.".%

QRM (our approach) 1. Learn one policy (q-value func5on) per state in

the Reward Machine.2. Select ac5ons using the policy of the current

RM state.

McIlraith MSR RL Day 2019

Page 119: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

4. Q-Learning for Reward Machines (QRM)

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1.#.$

.".%

QRM (our approach) 1. Learn one policy (q-value func5on) per state in

the Reward Machine.2. Select ac5ons using the policy of the current

RM state.3. Reuse experience to update all q-value

func5ons on every transi5on via off-policy reinforcement learning.

Remember this!

McIlraith MSR RL Day 2019

Page 120: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

QRM In Ac)on

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

.#

.$.".%

/Select an ac/on according to the current RM state.

McIlraith MSR RL Day 2019

Page 121: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

QRM In Ac)on

B * * C

* o *

A * * D

Update each q-value func7on as if RM were in corresponding state.

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1.#.$

.".%

McIlraith MSR RL Day 2019

Page 122: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

QRM In Ac)on

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

.#

.$.".%

/’

0 /′

McIlraith MSR RL Day 2019

Page 123: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

QRM In Ac)on

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

.#

.$.".%

0

/′

1 /′

McIlraith MSR RL Day 2019

Page 124: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

QRM In Ac)on

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

.#

.$.".%

0

.# /, 0 ← 0 + 3 ⋅ max89 .#(/9, 09)

/′

0 /′

McIlraith MSR RL Day 2019

Page 125: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

QRM In Ac)on

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

.#

.$.".%

/ 0′

McIlraith MSR RL Day 2019

Page 126: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

QRM In Ac)on

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

.#

.$.".%

/0′

0

." 0, / ← 0 + 4 ⋅ max9: ."(0:, /:)0′

McIlraith MSR RL Day 2019

Page 127: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

QRM In Ac)on

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

.#

.$.".%

/ 0′

McIlraith MSR RL Day 2019

Page 128: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

QRM In Action

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

.#

.$.".%

/0′

0

.$ 0, / ← 0 + 4 ⋅ max9: .$(0:, /:)0′

McIlraith MSR RL Day 2019

Page 129: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

QRM In Ac)on

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

.#

.$.".%

/ 0′

McIlraith MSR RL Day 2019

Page 130: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

QRM In Ac)on

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

.#

.$.".%

/

0′

10′

McIlraith MSR RL Day 2019

Page 131: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

QRM In Ac)on

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

.#

.$.".%

/

0′

1

.% 0, / ← 1 + 4 ⋅ max9: .#(0:, /:)0′

McIlraith MSR RL Day 2019

Page 132: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

QRM In Action

!"

!#

!$

!%

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1B * * C

* o *

A * * D

.#

.$.".%

/

0′

1

.% 0, / ← 1 + 4 ⋅ max9: .#(0:, /:)0′

McIlraith MSR RL Day 2019

Page 133: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Recall: Methods for Exploi4ng RM Structure

Baselines based on exis4ng methods:1. Q-learning over an equivalent MDP (Q-learning)2. Hierarchical RL based on opAons (HRL)3. HRL with RM-based pruning (HRL-RM)

Our approaches:4. Q-learning for Reward Machines (QRM)5. QRM + Reward Shaping for Reward Machine (QRM + RS)

McIlraith MSR RL Day 2019

Page 134: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

5. QRM + Reward Shaping (QRM + RS) Reward Shaping Intui8on: Some reward func.ons are easier to learn policies for than others, even if those func.ons that have the same op.mal policy.

Given any MDP and poten8al func8on , changing the reward func.on of the MDP to:

will not change the set of op.mal policies. Thus, if we find a func.on that also allows us to learn op.mal policies more quickly, we are guaranteed that the found policies are s.ll op.mal with respect to the original reward func.on.

[Ng, Harada, Russell, 1999] McIlraith MSR RL Day 2019

Page 135: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

5. QRM + Reward Shaping (QRM + RS)

QRM + RS (our approach) 1. Treat the RM itself as an MDP and perform value itera9on over the RM.2. Apply QRM to the shaped RM

McIlraith MSR RL Day 2019

Page 136: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Op#mality of QRM and QRM + RS

B * * C

* o *

A * * D!

Theorem: QRM converges to the optimal policy in the limit, as does QRM + RS.

"#

"$

"%

"&

¬A, 0

¬B, 0

¬C, 0

¬D, 0

A, 0

B, 0C, 0

D, 1

McIlraith MSR RL Day 2019

Page 137: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

The Rest of the Talk

• Reward Machines (RM)

• Exploiting RM Structure in Learning

▶ Experiments

• Creating Reward Machines

• Concluding Remarks

McIlraith MSR RL Day 2019

Page 138: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

EXPERIMENTS

McIlraith MSR RL Day 2019

Page 139: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Test Domains

• Two domains with a discrete ac0on and state-space§Office domain (4 tasks)§Cra: domain (10 tasks)

• One domain with a con0nuous state-space§Water World domain (10 tasks)

McIlraith MSR RL Day 2019

Page 140: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Test in Discrete DomainsTested all five approaches

1. Q-learning over an equivalent MDP (Q-learning)2. Hierarchical RL based on opCons (HRL)3. HRL with RM-based pruning (HRL-RM)4. Q-learning for Reward Machines (QRM)5. QRM + Reward Shaping (QRM + RS)

Method Optimality? Decomposition?

Q-LearningHRLHRL-RMQRMQRM + RS

McIlraith MSR RL Day 2019

Page 141: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Office World Experiments

4 tasks, 30 independent trials per task

0 10,000 20,000 30,000 40,000 50,0000

0.2

0.4

0.6

0.8

1

Number of training steps

Normalizeddiscountedreward

O�ce World

Legend:Q-LearningHRLHRL-RMQRM

B * * C

* o *

A * * D

McIlraith MSR RL Day 2019

Page 142: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Office World Experiments

4 tasks, 30 independent trials per task

Legend:Q-LearningHRLHRL-RMQRM

B * * C

* o *

A * * D0 10,000 20,000 30,000 40,000 50,000

0

0.2

0.4

0.6

0.8

1

Number of training steps

Normalizeddiscountedreward

O�ce World

McIlraith MSR RL Day 2019

Page 143: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Minecra( World Experiments

10 tasks over 10 random maps, 3 independent trials per combina6onTasks from Andreas et al. (ICML 2017)

Legend:Q-LearningHRLHRL-RMQRM

0

2 · 105 4 · 105 6 · 105 8 · 105 1 · 1060

0.2

0.4

0.6

0.8

1

Number of training steps

Normalizeddiscountedreward

Minecraft World

McIlraith MSR RL Day 2019

Page 144: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Minecraft World Experiments

Legend:Q-LearningHRLHRL-RMQRM

0

2 · 105 4 · 105 6 · 105 8 · 105 1 · 1060

0.2

0.4

0.6

0.8

1

Number of training steps

Normalizeddiscountedreward

Minecraft World

10 tasks over 10 random maps, 3 independent trials per combina6onTasks from Andreas et al. (ICML 2017)

McIlraith MSR RL Day 2019

Page 145: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Func%on Approxima%on with QRM

From tabular QRM to Deep QRM• Replace Q-learning by Double DQN (DDQN) with priori9zed

experience replays

Method Optimality? Decomposition?

Q-LearningHRLHRL-RMQRMQRM + RS

McIlraith MSR RL Day 2019

Page 146: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Water World Experiments

10 tasks over 10 random maps, 3 independent trials per combina6on

0

5 · 105 1 · 106 1.5 · 106 2 · 1060

0.2

0.4

0.6

0.8

1

Number of training steps

Normalizeddiscountedreward

Water World

Legend:DDQNDHRLDHRL-RMDQRM

McIlraith MSR RL Day 2019

Page 147: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Water World Experiments

10 tasks over 10 random maps, 3 independent trials per combina6on

0

5 · 105 1 · 106 1.5 · 106 2 · 1060

0.2

0.4

0.6

0.8

1

Number of training steps

Normalizeddiscountedreward

Water World

Legend:DDQNDHRLDHRL-RMDQRM

McIlraith MSR RL Day 2019

Page 148: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

QRM + Reward Shaping (QRM + RS)

0 1 · 106 2 · 1060

0.20.40.60.81

Training steps

Norm

alized

rew

ard

Water World

0 3 · 105 6 · 1050

0.20.40.60.81

Training steps

Norm

alized

rew

ard

Minecraft World

0 10,000 20,0000

0.20.40.60.81

Training steps

Nor

mal

ized

rew

ard Office World

Q-learning QRM QRM + RS

Discount factor ! of 0.9 and explora5on constant " of 0.1

McIlraith MSR RL Day 2019

Page 149: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

The Rest of the Talk

• Reward Machines (RM)

• Exploiting RM Structure in Learning

• Experiments

▶ Creating Reward Machines

• Recap

McIlraith MSR RL Day 2019

Page 150: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

CREATING REWARD MACHINES

McIlraith MSR RL Day 2019

Page 151: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Crea%ng Reward Machines

Where do Reward Machines come from?

1. Specify RM ⎯ Directly⎯ Via automatic translation from specifications in various languages

2. Generate RM from high-level goal specifications

3. Learn RM

McIlraith MSR RL Day 2019

Page 152: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

1. Reward Specification: one size does not fit allDo not need to specify Reward Machines directly. Reward Machines are a form of Mealy Machine.Specify reward-worthy behavior in any formal language that is translatable to finite-state automata.

finite-state automaton

push-down automaton

linear-bounded automaton

Turing machines

The Chomsky HierarchyNoam Chomsky

McIlraith MSR RL Day 2019

Page 153: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

1. Construct Reward Machine from Formal Languages

DFA RM

QRM

Reward shaping

Future RM-basedalgorithms

LTL dialects, LTLf, PLTL, … Regular Expressions

GologLDL dialects,LDLf

LTL-RE

Reward Machines serves as a lingua franca and provide a normal form representa;on for the reward funcEon that supports reward-func;on-tailored learning.

[Camacho, Toro Icarte, Klassen, Valenzano, M., IJCAI19] McIlraith MSR RL Day 2019

Page 154: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

1. Construct Reward Machine from Formal Languages

DFA RM

QRM

Reward shaping

Future RM-basedalgorithms

LTL dialects, LTLf, PLTL, … Regular Expressions

GologLDL dialects,LDLf

LTL-RE

Reward Machines serves as a lingua franca and provide a normal form representation for the reward function that supports reward-function-tailored learning.

Remember this!

[Camacho, Toro Icarte, Klassen, Valenzano, M., IJCAI19] McIlraith MSR RL Day 2019

Page 155: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

2. Generate RM using a Symbolic Planner

• Employ an explicit high-level model to describe abstract ac7ons (op7ons)• Employ symbolic planning to generate RMs

corresponding to high-level par7al-order plans• Use these abstract solu7ons to guide an RL agent

[Illanes, Yan, Toro Icarte, M., RLDM19]

McIlraith MSR RL Day 2019

Page 156: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

3. Learn RMs for Par/ally-Observable RL

Problem: Find a policy that maximizes the external reward given by a partially observable environment

Assumptions: Agent has a set of high-level binary classifiers/event detectors (e.g., button-pushed, cookies, etc.)

Key Insight: Learn an RM such that its internal state can be effectively used as external memory by the agent to solve the task.

Approach: Discrete Optimization via Tabu Search

? ?

?

McIlraith MSR RL Day 2019

Page 157: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

3. Learn RMs for Partially-Observable RL

Problem: Find a policy that maximizes the external reward given by a partially observable environment

Assumptions: Agent has a set of high-level binary classifiers/event detectors (e.g., button-pushed, cookies, etc.)

Key Insight: Learn an RM such that its internal state can be effectively used as external memory by the agent to solve the task.

Approach: Discrete Optimization via Tabu Search

? ?

?

These “toy problems” cannot be solved by A3C, PPO,

and ACER with LSTMs

McIlraith MSR RL Day 2019

Page 158: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

3. Learn Reward Machines (LRM)

More human interpretable concept of what the agent is trying to do

u0

u1 u2u3

ho/w, 0i

ho/w, 0i ho/w, 0iho/w, 0i

h , 0i

h , 0i;h , 0i

h , 0i;h , 0i

h , 1ih , 1i

h , 0ih , 0i

[Toro Icarte; Waldie; Klassen; Valenzano; Castro; M, NeurIPS 2019] McIlraith MSR RL Day 2019

Page 159: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

3. Learn Reward Machines (LRM)

[Toro Icarte, Waldie, Klassen, Valenzano, Castro, M, NeurIPS 2019]

Good Results!

McIlraith MSR RL Day 2019

Page 160: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

RECAP

McIlraith MSR RL Day 2019

Page 161: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Can exploi+ng the alphabet and structure of language help RL agents learn and think?

McIlraith MSR RL Day 2019

Page 162: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

B * * C

* o *

A * * D

count = 0 # global variable

def get_reward(state):if count == 0 and state.at(“A”):

count = 1if count == 1 and state.at(“B”):

count = 2if count == 2 and state.at(“C”):

count = 3if count == 3 and state.at(“D”):

count = 0return 1

return 0

Reward Function(as part of environment)

state

0

Key Insight: Reveal Reward Func=on to the Agent

McIlraith MSR RL Day 2019

Page 163: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

B * * C

* o *

A * * D

count = 0 # global variable

def get_reward(s):if count == 0 and state.at(“A”):

count = 1if count == 1 and state.at(“B”):

count = 2if count == 2 and state.at(“C”):

count = 3if count == 3 and state.at(“D”):

count = 0return 1

return 0

Key Insight: Reveal Reward Function to the Agent

McIlraith MSR RL Day 2019

Page 164: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Contributions

• Reward Machines (RMs): An automata-based structure that can be used to define reward func4ons.

• QRM: An RL algorithm that exploits an RM’s structure

[Camacho, Toro Icarte, Klassen, Valenzano, McIlraith, ICML 2018]

• QRM+RS: Automated RM-based reward shaping

• Transla;on to RM from other languages: RMs as a normal form representa4on for reward func4ons

[Camacho, Toro Icarte, Klassen, Valenzano, McIlraith, IJCAI 2019]

• LRM: learning RMs from experience in par4ally observable environments

[Toro Icarte, Waldie, Klassen, Valenzano, Castro, McIlraith, NeurIPS 2019]

McIlraith MSR RL Day 2019

Page 165: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

QRM outperforms HRL and standard Q-learning in two domains

0 10,000 20,000 30,000 40,000 50,0000

0.2

0.4

0.6

0.8

1

Number of training steps

Normalizeddiscountedreward

O�ce World

0

2 · 105 4 · 105 6 · 105 8 · 105 1 · 1060

0.2

0.4

0.6

0.8

1

Number of training steps

Normalizeddiscountedreward

Minecraft World

Legend:Q-LearningHRLHRL-RMQRM

Great Results in Discrete Domains

McIlraith MSR RL Day 2019

Page 166: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

… and is also effective when combined with deep learning

0

5 · 105 1 · 106 1.5 · 106 2 · 1060

0.2

0.4

0.6

0.8

1

Number of training steps

Normalizeddiscountedreward

Water World

Legend:DDQNDHRLDHRL-RMDQRM

…and in Con)nuous Domains

McIlraith MSR RL Day 2019

Page 167: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

We can construct RMs from a diversity of formal languages …

DFA RM

QRM

Reward shaping

Future RM-basedalgorithms

LTL dialects, LTLf, PLTL, … Regular Expressions

GologLDL dialects,LDLf

LTL-RE

McIlraith MSR RL Day 2019

Page 168: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

…and they can be learned in par0ally observable environments to solve hard problems

u0

u1 u2u3

ho/w, 0i

ho/w, 0i ho/w, 0iho/w, 0i

h , 0i

h , 0i;h , 0i

h , 0i;h , 0i

h , 1ih , 1i

h , 0ih , 0i

McIlraith MSR RL Day 2019

Page 169: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Using Reward Machines for High-Level Task Specifica<on and Decomposi<on in Reinforcement LearningToro Icarte, Klassen, Valenzano, McIlraithICML 2018Code: h<ps://bitbucket.org/RToroIcarte/qrm

Teaching Mul<ple Tasks to an RL Agent using LTLToro Icarte, Klassen, Valenzano, McIlraithAAMAS 2018 & NeurIPS 2018 Workshop (Learning by InstrucOons)Code: h<ps://bitbucket.org/RToroIcarte/lpopl

LTL and Beyond: Formal Languages for Reward Func<on Specifica<on in Reinforcement LearningCamacho, Toro Icarte, Klassen, Valenzano, McIlraithIJCAI 2019

Learning Reward Machines for Par<ally Observable Reinforcement LearningToro Icarte, Waldie, Klassen, Valenzano, Castro, McIlraith NeurIPS 2019

Play with the code, read the papers, …

McIlraith MSR RL Day 2019

Page 170: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Advice-Based Exploration in Model-Based Reinforcement Learning.Toro Icarte, Klassen, Valenzano, McIlraithCanadian AI 2018.Linear temporal logic (LTL) formulas and a heuristic were used to guide exploration during reinforcement learning.

Non-Markovian Rewards Expressed in LTL: Guiding Search Via Reward Shaping (Extended Version)Camacho, Chen, Sanner, McIlraithExtended Abstract: SoCS 2017, RLDM 2017 Full Paper: First Workshop on Goal Specifications for Reinforcement Learning, collocated with ICML/IJCAI/AAMAS, 2018.Linear temporal logic (LTL) formulas are used to express non-Markovian reward in fully specified MDPs. LTL is translated to automata and reward shaping is used over the automata to help solve the MDP.

Other related work

McIlraith MSR RL Day 2019

Page 171: Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Acknowledgements

Toryn Klassen Richard ValenzanoRodrigo Toro Icarte

Alberto Camacho Ethan Waldie Margarita Castro McIlraith MSR RL Day 2019