Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ......

67
Reinforcement Learning Abstraction and Hierarchy in RL Subramanian Ramamoorthy School of Informatics 16 March, 2012

Transcript of Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ......

Page 1: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Reinforcement Learning

Abstraction and Hierarchy in RL

Subramanian Ramamoorthy School of Informatics

16 March, 2012

Page 2: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

2

On the Organization of Behaviour

Consider day to day tasks such as opening a bottle or picking up your coat from the stand

– How would you tell someone how to perform this?

– How do you actually get the job done? Using what?

16/03/2012 Reinforcement Learning

An old idea: sequential behavior cannot be understood as a chain of stimulus–response associations… behavior displays hierarchical structure, comprising nested subroutines.

- Karl Lashley (1951)

Page 3: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

On the Complexity of RL

16/03/2012 Reinforcement Learning 3

• RL agents must learn from exploring the environment, trying out different actions in different states • Complexity grows quickly; how to get around?

• Balance exploration/exploitation • Abstractions

Page 4: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Instrumental Hierarchy

16/03/2012 Reinforcement Learning 4

[Botvinick ‘08+

Page 5: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

In Computational Terms, Why Hierarchy?

• Knowledge transfer/injection

• Biases exploration

• Faster solutions (even if model known)

16/03/2012 5 Reinforcement Learning

Page 6: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

A Natural Concept: Reward Shaping

• The robots’ objective is to collectively find pucks and bring them home.

• Represent the 12-dim environment by state variables: – have-puck?

– at-home?

– near-intruder?

• What should the immediate reward function be?

16/03/2012 Reinforcement Learning 6

[Mataric 1994]

Page 7: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Reward Shaping

• If a reward is given only when a robot drops a puck at home, learning will be extremely difficult. – The delay between the action and the reward is large.

• Solution: Reward shaping (intermediate rewards).

– Add rewards/penalties for achieving sub-goals/errors: • subgoal: grasped-puck • subgoal: dropped-puck-at-home • error: dropped-puck-away-from-home

• Add progress estimators: – Intruder-avoiding progress function – Homing progress function

• Adding intermediate rewards will potentially allow RL to handle more complex problems.

16/03/2012 Reinforcement Learning 7

Page 8: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Reward Shaping Results

Percentage of policy learnt after 15 minutes:

16/03/2012 Reinforcement Learning 8

Page 9: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Why Not Hierarchy?

• Some cool ideas and algorithms, but

• No killer apps or wide acceptance, yet.

• Good idea that needs more refinement:

– More user friendliness

– More rigor in • Problem specification

• Measures of progress

– Improvement = Flat – (Hierarchical + Hierarchy)

– What units?

16/03/2012 9 Reinforcement Learning

Page 10: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Early Idea: Hierarchical Algorithms - Gating Mechanisms

Hierarchical Learning

•Learning the gating function

•Learning the individual behaviours

•Learning both

*

*Can be a multi- level hierarchy.

g is a gateg is a gate

bbi is a behaviour

16/03/2012 11 Reinforcement Learning

Page 11: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Temporal Abstraction

• What’s the issue?

– Want “macro” actions (multiple time steps)

– Advantages:

• Avoid dealing with (exploring/computing values for) less desirable states

• Reuse experience across problems/regions

• What’s not obvious

– Dealing with the Markov assumption

– Getting the math right (e.g., stability and convergence)

16/03/2012 12 Reinforcement Learning

Page 12: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

State Transitions → Macro Transitions

• F plays the role of generalized transition function

• More general:

– Need not be a probability

– Coefficient for state value in terms of another

– May be: • P (special case)

• Arbitrary SMDP (discount varies w/state, etc.)

• Discounted probability of following a policy/running program

'

1 )()',,(),|(max)(:s

i

a

i sVsasRassFsVT

16/03/2012 15 Reinforcement Learning

Page 13: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

So What?

• Modified Bellman operator:

• T is also a contraction in max norm

• Free goodies!

– Optimality (Hierarchical Optimality)

– Convergence & stability

'

1 )()',,(),|(max)(:s

i

a

i sVsasRassFsVT

16/03/2012 16 Reinforcement Learning

Page 14: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Using Temporal Abstraction

• Accelerate convergence (usually)

• Avoid uninteresting states

– Improve exploration in RL

– Avoid computing all values for MDPs

• Can finesse partial observability (a little)

• Simplify state space with “funnel” states

16/03/2012 17 Reinforcement Learning

Page 15: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Idea of Funneling

• Proposed by Forestier & Varaiya 78

• Define “supervisor” MDP over boundary states

• Selects policies at boundaries to – Push system back into nominal states

– Keep it there

Nominal Region

Boundary states

Boundary states

Control theory version of maze world!

16/03/2012 18 Reinforcement Learning

Page 16: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Options : Move until end of hallway

Start : Any state in the hallway.

Execute : policy as shown.

Terminate : when s is end of hallway.

16/03/2012 19 Reinforcement Learning

Page 17: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Options [Sutton, Precup, Singh ’99+

• An option is behaviour defined in terms of:

o = { Io, o, o }

• Io : Set of states in which o can be initiated.

• o s : Policy (S - A*) when o is executing.

• o(s) : Probability that o terminates in s.

*Can be a policy over lower level options.

16/03/2012 20 Reinforcement Learning

Page 18: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Rooms Example

16/03/2012 Reinforcement Learning 21

Page 19: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Options Define a Semi-MDP

16/03/2012 Reinforcement Learning 22

Page 20: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

MDP + Options = SMDP

16/03/2012 Reinforcement Learning 23

Page 21: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Why does SMDP Connection Help?

16/03/2012 Reinforcement Learning 24

Page 22: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Value Functions for Options

16/03/2012 Reinforcement Learning 25

Page 23: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Motivations for Options Framework

• Add temporally extended activities to choices available to RL agent, without precluding planning and learning at finer grained MDP level

• Optimal policies over primitives are not compromised due to addition of options

• However, if an option is useful, learning will quickly find this out – prevent prolonged and useless ‘flailing about’

PS: If all options are 1-step, you recover the core MDP

16/03/2012 Reinforcement Learning 26

Page 24: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Rooms Example – One Room

16/03/2012 Reinforcement Learning 27

Colour = option specific value

Page 25: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Rooms Example

16/03/2012 Reinforcement Learning 28

Page 26: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Hierarchical Abstract Machines (HAM)

• Policies of a core MDP (M) are defined as programs which execute based on own state and current state of core MDP

• HAM policy ≈ collection of FSMs, ,Hi}

• Four types of states of Hi: action, call, choice, stop – Action: Generate based on state of M and currently executing Hi

– Call: Switch machines, Hi, setting local state using stochastic function

– Choice: Non-deterministically pick an Hi

– Stop: return control to calling machine

16/03/2012 Reinforcement Learning 30

Page 27: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

State Transition Structure in HAM

16/03/2012 Reinforcement Learning 31

HAM H: Initial machine + closure of all machine states in all machines reachable from possible initial states of the initial machine

Page 28: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

HAM + MDP = SMDP

• Composition of HAM and core MDP defines an SMDP

• Only actions of the composite machine = choice points

– These actions only affect HAM part

• As an SMDP, things run autonomously between choice points

– What are the rewards?

• Moreover, optimal policy of the composite machine only depend on the choice points!

• There is a reduce(H o M) whose states are just choice points

• We can apply SMDP Q-learning to keep updates manageable

16/03/2012 Reinforcement Learning 32

Page 29: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Why HAM?

16/03/2012 Reinforcement Learning 33

Page 30: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Goal/State Abstraction

• Why are these together? – Abstract goals typically imply abstract states

• Makes sense for classical planning

– Classical planning uses state sets – Implicit in use of state variables – What about factored MDPs?

• Does this make sense for RL?

– No goals – Markov property issues

16/03/2012 35 Reinforcement Learning

Page 31: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Feudal RL *Dayan & Hinton ‘95+

• Lords dictate subgoals to serfs

• Subgoals = reward functions?

• Demonstrated on a navigation task

• Markov property problem

– Stability?

– Optimality?

16/03/2012 36 Reinforcement Learning

Page 32: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Feudal RL

16/03/2012 Reinforcement Learning 37

Page 33: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

MAXQ [Dietterich ’98+

• Included temporal abstraction

• Introduced “safe” abstraction

• Handled subgoals/tasks elegantly – Subtasks w/repeated structure can appear in multiple

copies throughout state space

– Subtasks can be isolated w/o violating Markov

– Separated subtask reward from completion reward

• Example taxi/logistics domain – Subtasks move between locations

– High level tasks pick up/drop off assets

16/03/2012 38 Reinforcement Learning

Page 34: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

MAXQ – Key Notions

• Hierarchy: – Root level: the overall task of solving MDP

– Child nodes: necessary tasks for completing the parent node

– Two types of nodes: Max-Nodes, Q-Nodes

• MAX Nodes – With no child Q-Nodes: primitive tasks

– With child Q-Nodes: subtasks

– Each Max-node: a separate MDP

– Context independent: • Learns policy independent of context

• Value function shared by all parent Q-Nodes

16/03/2012 Reinforcement Learning 39

Page 35: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

MAXQ – Key Notions

• Q Nodes – Actions required for completing parent task

– 1 parent Max-Node, 1 child Max-Node

– Context dependent: • Learn the value of completing its subtask and then completing its parent task

• Value containers

16/03/2012 Reinforcement Learning 40

Page 36: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Usage of Hierarchical Policy

A “call graph":

– “Push a Max-node unto the stack"

– Max-node executes until it enters a terminal state

– Each Max-node converges to locally optimal policy

16/03/2012 Reinforcement Learning 41

Page 37: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

A MAXQ Decomposition

16/03/2012 Reinforcement Learning 42

Page 38: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Task Hierarchy: MAXQ Decomposition

Root

Take Give Navigate(loc)

Deliver Fetch

Extend-arm Extend-arm Grab Release

Movee Movew Moves Moven

Children of a

unordered

Children of a task are unordered

16/03/2012 43 Reinforcement Learning

Page 39: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

MAXQ Decomposition

• Augment the state s by adding the subtask i : [s,i].

• Define C([s,i],j) as the reward received in i after j finishes.

• Q([s,Fetch],Navigate(prr)) =

V([s,Navigate(prr)]) + C([s,Fetch],Navigate(prr))

• Express V in terms of C

• Learn C, instead of directly learning Q

Reward received while navigating

Reward received after navigation

16/03/2012 44 Reinforcement Learning

Recursive optimality

Page 40: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Dividing Up Value

16/03/2012 Reinforcement Learning 45

Page 41: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

A-LISP (Andre & Russell 02)

• Combined and extended ideas from:

– HAMs

– MAXQ

– Function approximation

• Allowed partially specified LISP programs

• Very powerful when the stars aligned

– Halting

– “Safe” abstraction

– Function approximation

16/03/2012 46 Reinforcement Learning

Page 42: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Why Isn’t Everybody Doing It?

• Totally “safe” state abstraction is:

– Rare

– Hard to guarantee w/o domain knowledge

• “Safe” function approximation hard too

• Developing hierarchies is hard (like threading a needle in some cases)

• Bad choices can make things worse

• Mistakes not always obvious at first

16/03/2012 47 Reinforcement Learning

Page 43: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

More Advanced: Concurrency

• Define the notion of multi-option: only subset of state variables influenced by an option

• Maintain coherence:

• Can do Q-learning:

16/03/2012 Reinforcement Learning 48

Page 44: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Frontier Problem: Automatic Hierarchy Discovery

• Hard in contexts such as classical planning

• Within a single problem:

– Battle is lost if all states considered (polynomial speedup at best)

– If fewer states considered, when to stop?

• Across problems

– Considering all states OK for few problems?

– Generalize to other problems in class

• How to measure progress?

16/03/2012 49 Reinforcement Learning

Page 45: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Promising Ideas

• Idea: Bottlenecks are interesting…maybe

• Exploit

– Connectivity (Andre 98, McGovern 01)

– Ease of changing state variables (Hengst 02)

• Issues

– Noise

– Less work than learning a model?

– Relationship between hierarchy and model?

16/03/2012 50 Reinforcement Learning

Page 46: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Frontier Problem: Representation

• Model, hierarchy, value function should all be integrated in some meaningful way

• “Safe” state abstraction is a kind of factorization

• Need approximately safe state abstraction

• Factored models w/approximation?

16/03/2012 51 Reinforcement Learning

Page 47: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

A common theme…

16/03/2012 Reinforcement Learning 52

Page 48: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Semi-Markov Decision Processes

• A generalization of MDPs:

The amount of time between one decision and the next is a random variable (either real or integer valued)

• Treat the system as remaining in each state for a random waiting time

– after which, transition to next state is instantaneous

• So, SMDP is defined in terms of

P(s’, |s,a): Transition probability or dynamics

R(s,a): Reward, amount expected to accumulate during waiting time in particular state and action

16/03/2012 Reinforcement Learning 53

Page 49: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Semi-Markov Decision Processes

Bellman equations:

Could modify Q-learning accordingly,

16/03/2012 Reinforcement Learning 54

Page 50: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Continuous version of SMDP

• e.g., Bradtke & Duff, NIPS 1994

– Countable state set, finite action set

• Dynamics:

• Define value in terms of minimum of an infinite-horizon cost:

16/03/2012 Reinforcement Learning 55

Page 51: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Continuous SMDP

• For a fixed policy , value and reward are:

• We don’t know exactly when next state transition occurs, so expected discount factor:

16/03/2012 Reinforcement Learning 56

Page 52: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Optimal Value for Continuous SMDP

16/03/2012 Reinforcement Learning 57

To better understand, consider example: Passengers arriving at an office elevator

Page 53: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Learning for Continuous SMDP

16/03/2012 Reinforcement Learning 58

Page 54: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Elevator Dispatching

Crites and Barto 1996

16/03/2012 59 Reinforcement Learning

Page 55: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Semi-Markov Q-Learning

Rt

krt k 1

k 0

becomes Rt e rt d0

Continuous-time problem but decisions in discrete jumps (we’ll write the expressions in slightly simpler form than in the earlier slides)

Q(s,a) Q(s, a) et1 r d e

t 2 t1 maxa

Q(s ,a t1

t2

) Q(s, a)

Suppose system takes action a from state s at time t1,

and next decision is needed at time t2 in state s :

16/03/2012 60 Reinforcement Learning

Page 56: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Passenger Arrival Patterns

Up-peak and Down-peak traffic

• Not equivalent: down-peak handling capacity is much greater than up-peak handling capacity; so up-peak capacity is limiting factor.

• Up-peak easiest to analyse: once everyone is onboard at lobby, rest of trip is determined. The only decision is when to open and close doors at lobby. Optimal policy for pure case is: close doors when threshold number on; threshold depends on traffic intensity.

• More policies to consider for two-way and down-peak traffic.

• We focus on down-peak traffic pattern.

16/03/2012 61 Reinforcement Learning

Page 57: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Control Strategies

• Zoning: divide building into zones; park in zone when idle. Robust in heavy traffic.

• Search-based methods: greedy or non-greedy. Receding Horizon control.

• Rule-based methods: expert systems/fuzzy logic; from human “experts”

• Other heuristic methods: Longest Queue First (LQF), Highest Unanswered Floor First (HUFF), Dynamic Load Balancing (DLB)

• Adaptive/Learning methods: NNs for prediction, parameter space search using simulation, DP on simplified model, non-sequential RL

16/03/2012 62 Reinforcement Learning

Page 58: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

The Elevator Model (Lewis, 1991)

Parameters:

• Floor Time (time to move one floor at max speed): 1.45 secs.

• Stop Time (time to decelerate, open and close doors, and accelerate again): 7.19 secs.

• TurnTime (time needed by a stopped car to change directions): 1 sec.

• Load Time (the time for one passenger to enter or exit a car): a random variable with range from 0.6 to 6.0 secs, mean of 1 sec.

• Car Capacity: 20 passengers

Discrete Event System: continuous time, asynchronous elevator operation

Traffic Profile:

• Poisson arrivals with rates changing every 5 minutes; down-peak

16/03/2012 63 Reinforcement Learning

Page 59: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

State Space

• 18 hall call buttons: 2 combinations

• positions and directions of cars: 18 (rounding to nearest floor)

• motion states of cars (accelerating, moving, decelerating, stopped, loading, turning): 6

• 40 car buttons: 2

• Set of passengers waiting at each floor, each passenger's arrival time and destination: unobservable. However, 18 real numbers are available giving elapsed time since hall buttons pushed; we discretize these.

• Set of passengers riding each car and their destinations: observable only through the car buttons

18

4

40

Conservatively about 10 states 22

16/03/2012 64 Reinforcement Learning

Page 60: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Actions

• When moving (halfway between floors):

– stop at next floor

– continue past next floor

• When stopped at a floor:

– go up

– go down

• Asynchronous

16/03/2012 65 Reinforcement Learning

Page 61: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Constraints

• A car cannot pass a floor if a passenger wants to get off there

• A car cannot change direction until it has serviced all onboard passengers traveling in the current direction

• Don’t stop at a floor if another car is already stopping, or is stopped, there

• Don’t stop at a floor unless someone wants to get off there

• Given a choice, always move up

standard

special

heuristic

Stop and Continue

16/03/2012 66 Reinforcement Learning

Page 62: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Performance Criteria

• Average wait time

• Average system time (wait + travel time)

• % waiting > T seconds (e.g., T = 60)

• Average squared wait time (to encourage fast and fair service)

Minimize:

16/03/2012 67 Reinforcement Learning

Page 63: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Average Squared Wait Time

Instantaneous cost:

Define return as an integral rather than a sum (Bradtke and Duff, 1994):

r wait p ( )p

2

2rt

t 0

e r d0

becomes

16/03/2012 68 Reinforcement Learning

Page 64: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Algorithm

Repeat forever :

1. In state x at time tx, car c must decide to STOP or CONTINUE

2. It selects an action using Boltzmann distribution

(with decreasing temperature) based on current Q values

3. The next decision by car c is required in state y at time ty

4. Implements the gradient descent version of the following backup using backprop:

Q(x,a) Q(x,a) et x r d e

t y t x

maxa

Q(y,a t x

t y

) Q(x,a)

5. x y, tx ty

16/03/2012 69 Reinforcement Learning

Page 65: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Computing Rewards

e( t x )

r d0

Must calculate

• “Omniscient Rewards”: the simulator knows how long each passenger has been waiting.

• “On-Line Rewards”: Assumes only arrival time of first passenger in each queue is known (elapsed hall button time); estimate arrival times

16/03/2012 70 Reinforcement Learning

Page 66: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Neural Networks

• 9 binary: state of each hall down button

• 9 real: elapsed time of hall down button if pushed

• 16 binary: one on at a time: position and direction of car making decision

• 10 real: location/direction of other cars: “footprint”

• 1 binary: at highest floor with waiting passenger?

• 1 binary: at floor with longest waiting passenger?

• 1 bias unit 1

47 inputs, 20 sigmoid hidden units, 1 or 2

output units

Inputs:

16/03/2012 71 Reinforcement Learning

Page 67: Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ... Semi-Markov Decision Processes ... • Treat the system as remaining in each state for

Elevator Results

16/03/2012 72 Reinforcement Learning