Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ......

Reinforcement Learning

Abstraction and Hierarchy in RL

Subramanian Ramamoorthy School of Informatics

16 March, 2012

2

On the Organization of Behaviour

Consider day to day tasks such as opening a bottle or picking up your coat from the stand

– How would you tell someone how to perform this?

– How do you actually get the job done? Using what?

16/03/2012 Reinforcement Learning

An old idea: sequential behavior cannot be understood as a chain of stimulus–response associations… behavior displays hierarchical structure, comprising nested subroutines.

- Karl Lashley (1951)

On the Complexity of RL

16/03/2012 Reinforcement Learning 3

• RL agents must learn from exploring the environment, trying out different actions in different states • Complexity grows quickly; how to get around?

• Balance exploration/exploitation • Abstractions

Instrumental Hierarchy


[Botvinick ‘08+

In Computational Terms, Why Hierarchy?

• Knowledge transfer/injection

• Biases exploration

• Faster solutions (even if model known)

16/03/2012 5 Reinforcement Learning

A Natural Concept: Reward Shaping

• The robots’ objective is to collectively find pucks and bring them home.

• Represent the 12-dim environment by state variables: – have-puck?

– at-home?

– near-intruder?

• What should the immediate reward function be?


[Mataric 1994]

Reward Shaping

• If a reward is given only when a robot drops a puck at home, learning will be extremely difficult. – The delay between the action and the reward is large.

• Solution: Reward shaping (intermediate rewards).

– Add rewards/penalties for achieving sub-goals/errors: • subgoal: grasped-puck • subgoal: dropped-puck-at-home • error: dropped-puck-away-from-home

• Add progress estimators: – Intruder-avoiding progress function – Homing progress function

• Adding intermediate rewards will potentially allow RL to handle more complex problems.


Reward Shaping Results

Percentage of policy learnt after 15 minutes:


Why Not Hierarchy?

• Some cool ideas and algorithms, but

• No killer apps or wide acceptance, yet.

• Good idea that needs more refinement:

– More user friendliness

– More rigor in • Problem specification

• Measures of progress

– Improvement = Flat – (Hierarchical + Hierarchy)

– What units?


Early Idea: Hierarchical Algorithms - Gating Mechanisms

Hierarchical Learning

•Learning the gating function

•Learning the individual behaviours

•Learning both

*

*Can be a multi- level hierarchy.

g is a gateg is a gate

bbi is a behaviour


Temporal Abstraction

• What’s the issue?

– Want “macro” actions (multiple time steps)

– Advantages:

• Avoid dealing with (exploring/computing values for) less desirable states

• Reuse experience across problems/regions

• What’s not obvious

– Dealing with the Markov assumption

– Getting the math right (e.g., stability and convergence)


State Transitions → Macro Transitions

• F plays the role of generalized transition function

• More general:

– Need not be a probability

– Coefficient for state value in terms of another

– May be: • P (special case)

• Arbitrary SMDP (discount varies w/state, etc.)

• Discounted probability of following a policy/running program

'

1 )()',,(),|(max)(:s

i

a

i sVsasRassFsVT


So What?

• Modified Bellman operator:

• T is also a contraction in max norm

• Free goodies!

– Optimality (Hierarchical Optimality)

– Convergence & stability

'

1 )()',,(),|(max)(:s

i

a

i sVsasRassFsVT


Using Temporal Abstraction

• Accelerate convergence (usually)

• Avoid uninteresting states

– Improve exploration in RL

– Avoid computing all values for MDPs

• Can finesse partial observability (a little)

• Simplify state space with “funnel” states


Idea of Funneling

• Proposed by Forestier & Varaiya 78

• Define “supervisor” MDP over boundary states

• Selects policies at boundaries to – Push system back into nominal states

– Keep it there

Nominal Region

Boundary states

Boundary states

Control theory version of maze world!


Options : Move until end of hallway

Start : Any state in the hallway.

Execute : policy as shown.

Terminate : when s is end of hallway.


Options [Sutton, Precup, Singh ’99+

• An option is behaviour defined in terms of:

o = { Io, o, o }

• Io : Set of states in which o can be initiated.

• o s : Policy (S - A*) when o is executing.

• o(s) : Probability that o terminates in s.

*Can be a policy over lower level options.


Rooms Example


Options Define a Semi-MDP


MDP + Options = SMDP


Why does SMDP Connection Help?


Value Functions for Options


Motivations for Options Framework

• Add temporally extended activities to choices available to RL agent, without precluding planning and learning at finer grained MDP level

• Optimal policies over primitives are not compromised due to addition of options

• However, if an option is useful, learning will quickly find this out – prevent prolonged and useless ‘flailing about’

PS: If all options are 1-step, you recover the core MDP


Rooms Example – One Room


Colour = option specific value

Rooms Example


Hierarchical Abstract Machines (HAM)

• Policies of a core MDP (M) are defined as programs which execute based on own state and current state of core MDP

• HAM policy ≈ collection of FSMs, ,Hi}

• Four types of states of Hi: action, call, choice, stop – Action: Generate based on state of M and currently executing Hi

– Call: Switch machines, Hi, setting local state using stochastic function

– Choice: Non-deterministically pick an Hi

– Stop: return control to calling machine


State Transition Structure in HAM


HAM H: Initial machine + closure of all machine states in all machines reachable from possible initial states of the initial machine

HAM + MDP = SMDP

• Composition of HAM and core MDP defines an SMDP

• Only actions of the composite machine = choice points

– These actions only affect HAM part

• As an SMDP, things run autonomously between choice points

– What are the rewards?

• Moreover, optimal policy of the composite machine only depend on the choice points!

• There is a reduce(H o M) whose states are just choice points

• We can apply SMDP Q-learning to keep updates manageable


Why HAM?


Goal/State Abstraction

• Why are these together? – Abstract goals typically imply abstract states

• Makes sense for classical planning

– Classical planning uses state sets – Implicit in use of state variables – What about factored MDPs?

• Does this make sense for RL?

– No goals – Markov property issues


Feudal RL *Dayan & Hinton ‘95+

• Lords dictate subgoals to serfs

• Subgoals = reward functions?

• Demonstrated on a navigation task

• Markov property problem

– Stability?

– Optimality?


Feudal RL


MAXQ [Dietterich ’98+

• Included temporal abstraction

• Introduced “safe” abstraction

• Handled subgoals/tasks elegantly – Subtasks w/repeated structure can appear in multiple

copies throughout state space

– Subtasks can be isolated w/o violating Markov

– Separated subtask reward from completion reward

• Example taxi/logistics domain – Subtasks move between locations

– High level tasks pick up/drop off assets


MAXQ – Key Notions

• Hierarchy: – Root level: the overall task of solving MDP

– Child nodes: necessary tasks for completing the parent node

– Two types of nodes: Max-Nodes, Q-Nodes

• MAX Nodes – With no child Q-Nodes: primitive tasks

– With child Q-Nodes: subtasks

– Each Max-node: a separate MDP

– Context independent: • Learns policy independent of context

• Value function shared by all parent Q-Nodes


MAXQ – Key Notions

• Q Nodes – Actions required for completing parent task

– 1 parent Max-Node, 1 child Max-Node

– Context dependent: • Learn the value of completing its subtask and then completing its parent task

• Value containers


Usage of Hierarchical Policy

A “call graph":

– “Push a Max-node unto the stack"

– Max-node executes until it enters a terminal state

– Each Max-node converges to locally optimal policy


A MAXQ Decomposition


Task Hierarchy: MAXQ Decomposition

Root

Take Give Navigate(loc)

Deliver Fetch

Extend-arm Extend-arm Grab Release

Movee Movew Moves Moven

Children of a

unordered

Children of a task are unordered


MAXQ Decomposition

• Augment the state s by adding the subtask i : [s,i].

• Define C([s,i],j) as the reward received in i after j finishes.

• Q([s,Fetch],Navigate(prr)) =

V([s,Navigate(prr)]) + C([s,Fetch],Navigate(prr))

• Express V in terms of C

• Learn C, instead of directly learning Q

Reward received while navigating

Reward received after navigation


Recursive optimality

Dividing Up Value


A-LISP (Andre & Russell 02)

• Combined and extended ideas from:

– HAMs

– MAXQ

– Function approximation

• Allowed partially specified LISP programs

• Very powerful when the stars aligned

– Halting

– “Safe” abstraction

– Function approximation


Why Isn’t Everybody Doing It?

• Totally “safe” state abstraction is:

– Rare

– Hard to guarantee w/o domain knowledge

• “Safe” function approximation hard too

• Developing hierarchies is hard (like threading a needle in some cases)

• Bad choices can make things worse

• Mistakes not always obvious at first


More Advanced: Concurrency

• Define the notion of multi-option: only subset of state variables influenced by an option

• Maintain coherence:

• Can do Q-learning:


Frontier Problem: Automatic Hierarchy Discovery

• Hard in contexts such as classical planning

• Within a single problem:

– Battle is lost if all states considered (polynomial speedup at best)

– If fewer states considered, when to stop?

• Across problems

– Considering all states OK for few problems?

– Generalize to other problems in class

• How to measure progress?


Promising Ideas

• Idea: Bottlenecks are interesting…maybe

• Exploit

– Connectivity (Andre 98, McGovern 01)

– Ease of changing state variables (Hengst 02)

• Issues

– Noise

– Less work than learning a model?

– Relationship between hierarchy and model?


Frontier Problem: Representation

• Model, hierarchy, value function should all be integrated in some meaningful way

• “Safe” state abstraction is a kind of factorization

• Need approximately safe state abstraction

• Factored models w/approximation?


A common theme…


Semi-Markov Decision Processes

• A generalization of MDPs:

The amount of time between one decision and the next is a random variable (either real or integer valued)

• Treat the system as remaining in each state for a random waiting time

– after which, transition to next state is instantaneous

• So, SMDP is defined in terms of

P(s’, |s,a): Transition probability or dynamics

R(s,a): Reward, amount expected to accumulate during waiting time in particular state and action


Semi-Markov Decision Processes

Bellman equations:

Could modify Q-learning accordingly,


Continuous version of SMDP

• e.g., Bradtke & Duff, NIPS 1994

– Countable state set, finite action set

• Dynamics:

• Define value in terms of minimum of an infinite-horizon cost:


Continuous SMDP

• For a fixed policy , value and reward are:

• We don’t know exactly when next state transition occurs, so expected discount factor:


Optimal Value for Continuous SMDP


To better understand, consider example: Passengers arriving at an office elevator

Learning for Continuous SMDP


Elevator Dispatching

Crites and Barto 1996


Semi-Markov Q-Learning

Rt

krt k 1

k 0

becomes Rt e rt d0

Continuous-time problem but decisions in discrete jumps (we’ll write the expressions in slightly simpler form than in the earlier slides)

Q(s,a) Q(s, a) et1 r d e

t 2 t1 maxa

Q(s ,a t1

t2

) Q(s, a)

Suppose system takes action a from state s at time t1,

and next decision is needed at time t2 in state s :


Passenger Arrival Patterns

Up-peak and Down-peak traffic

• Not equivalent: down-peak handling capacity is much greater than up-peak handling capacity; so up-peak capacity is limiting factor.

• Up-peak easiest to analyse: once everyone is onboard at lobby, rest of trip is determined. The only decision is when to open and close doors at lobby. Optimal policy for pure case is: close doors when threshold number on; threshold depends on traffic intensity.

• More policies to consider for two-way and down-peak traffic.

• We focus on down-peak traffic pattern.


Control Strategies

• Zoning: divide building into zones; park in zone when idle. Robust in heavy traffic.

• Search-based methods: greedy or non-greedy. Receding Horizon control.

• Rule-based methods: expert systems/fuzzy logic; from human “experts”

• Other heuristic methods: Longest Queue First (LQF), Highest Unanswered Floor First (HUFF), Dynamic Load Balancing (DLB)

• Adaptive/Learning methods: NNs for prediction, parameter space search using simulation, DP on simplified model, non-sequential RL


The Elevator Model (Lewis, 1991)

Parameters:

• Floor Time (time to move one floor at max speed): 1.45 secs.

• Stop Time (time to decelerate, open and close doors, and accelerate again): 7.19 secs.

• TurnTime (time needed by a stopped car to change directions): 1 sec.

• Load Time (the time for one passenger to enter or exit a car): a random variable with range from 0.6 to 6.0 secs, mean of 1 sec.

• Car Capacity: 20 passengers

Discrete Event System: continuous time, asynchronous elevator operation

Traffic Profile:

• Poisson arrivals with rates changing every 5 minutes; down-peak


State Space

• 18 hall call buttons: 2 combinations

• positions and directions of cars: 18 (rounding to nearest floor)

• motion states of cars (accelerating, moving, decelerating, stopped, loading, turning): 6

• 40 car buttons: 2

• Set of passengers waiting at each floor, each passenger's arrival time and destination: unobservable. However, 18 real numbers are available giving elapsed time since hall buttons pushed; we discretize these.

• Set of passengers riding each car and their destinations: observable only through the car buttons

18

4

40

Conservatively about 10 states 22


Actions

• When moving (halfway between floors):

– stop at next floor

– continue past next floor

• When stopped at a floor:

– go up

– go down

• Asynchronous


Constraints

• A car cannot pass a floor if a passenger wants to get off there

• A car cannot change direction until it has serviced all onboard passengers traveling in the current direction

• Don’t stop at a floor if another car is already stopping, or is stopped, there

• Don’t stop at a floor unless someone wants to get off there

• Given a choice, always move up

standard

special

heuristic

Stop and Continue


Performance Criteria

• Average wait time

• Average system time (wait + travel time)

• % waiting > T seconds (e.g., T = 60)

• Average squared wait time (to encourage fast and fair service)

Minimize:


Average Squared Wait Time

Instantaneous cost:

Define return as an integral rather than a sum (Bradtke and Duff, 1994):

r wait p ( )p

2

2rt

t 0

e r d0

becomes


Algorithm

Repeat forever :

1. In state x at time tx, car c must decide to STOP or CONTINUE

2. It selects an action using Boltzmann distribution

(with decreasing temperature) based on current Q values

3. The next decision by car c is required in state y at time ty

4. Implements the gradient descent version of the following backup using backprop:

Q(x,a) Q(x,a) et x r d e

t y t x

maxa

Q(y,a t x

t y

) Q(x,a)

5. x y, tx ty


Computing Rewards

e( t x )

r d0

Must calculate

• “Omniscient Rewards”: the simulator knows how long each passenger has been waiting.

• “On-Line Rewards”: Assumes only arrival time of first passenger in each queue is known (elapsed hall button time); estimate arrival times


Neural Networks

• 9 binary: state of each hall down button

• 9 real: elapsed time of hall down button if pushed

• 16 binary: one on at a time: position and direction of car making decision

• 10 real: location/direction of other cars: “footprint”

• 1 binary: at highest floor with waiting passenger?

• 1 binary: at floor with longest waiting passenger?

• 1 bias unit 1

47 inputs, 20 sigmoid hidden units, 1 or 2

output units

Inputs:


Elevator Results


Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ......

Documents

Transcript of Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ......