Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ......
Transcript of Abstraction and Hierarchy in RL - School of Informatics · Abstraction and Hierarchy in RL ......
Reinforcement Learning
Abstraction and Hierarchy in RL
Subramanian Ramamoorthy School of Informatics
16 March, 2012
2
On the Organization of Behaviour
Consider day to day tasks such as opening a bottle or picking up your coat from the stand
– How would you tell someone how to perform this?
– How do you actually get the job done? Using what?
16/03/2012 Reinforcement Learning
An old idea: sequential behavior cannot be understood as a chain of stimulus–response associations… behavior displays hierarchical structure, comprising nested subroutines.
- Karl Lashley (1951)
On the Complexity of RL
16/03/2012 Reinforcement Learning 3
• RL agents must learn from exploring the environment, trying out different actions in different states • Complexity grows quickly; how to get around?
• Balance exploration/exploitation • Abstractions
Instrumental Hierarchy
16/03/2012 Reinforcement Learning 4
[Botvinick ‘08+
In Computational Terms, Why Hierarchy?
• Knowledge transfer/injection
• Biases exploration
• Faster solutions (even if model known)
16/03/2012 5 Reinforcement Learning
A Natural Concept: Reward Shaping
• The robots’ objective is to collectively find pucks and bring them home.
• Represent the 12-dim environment by state variables: – have-puck?
– at-home?
– near-intruder?
• What should the immediate reward function be?
16/03/2012 Reinforcement Learning 6
[Mataric 1994]
Reward Shaping
• If a reward is given only when a robot drops a puck at home, learning will be extremely difficult. – The delay between the action and the reward is large.
• Solution: Reward shaping (intermediate rewards).
– Add rewards/penalties for achieving sub-goals/errors: • subgoal: grasped-puck • subgoal: dropped-puck-at-home • error: dropped-puck-away-from-home
• Add progress estimators: – Intruder-avoiding progress function – Homing progress function
• Adding intermediate rewards will potentially allow RL to handle more complex problems.
16/03/2012 Reinforcement Learning 7
Reward Shaping Results
Percentage of policy learnt after 15 minutes:
16/03/2012 Reinforcement Learning 8
Why Not Hierarchy?
• Some cool ideas and algorithms, but
• No killer apps or wide acceptance, yet.
• Good idea that needs more refinement:
– More user friendliness
– More rigor in • Problem specification
• Measures of progress
– Improvement = Flat – (Hierarchical + Hierarchy)
– What units?
16/03/2012 9 Reinforcement Learning
Early Idea: Hierarchical Algorithms - Gating Mechanisms
Hierarchical Learning
•Learning the gating function
•Learning the individual behaviours
•Learning both
*
*Can be a multi- level hierarchy.
g is a gateg is a gate
bbi is a behaviour
16/03/2012 11 Reinforcement Learning
Temporal Abstraction
• What’s the issue?
– Want “macro” actions (multiple time steps)
– Advantages:
• Avoid dealing with (exploring/computing values for) less desirable states
• Reuse experience across problems/regions
• What’s not obvious
– Dealing with the Markov assumption
– Getting the math right (e.g., stability and convergence)
16/03/2012 12 Reinforcement Learning
State Transitions → Macro Transitions
• F plays the role of generalized transition function
• More general:
– Need not be a probability
– Coefficient for state value in terms of another
– May be: • P (special case)
• Arbitrary SMDP (discount varies w/state, etc.)
• Discounted probability of following a policy/running program
'
1 )()',,(),|(max)(:s
i
a
i sVsasRassFsVT
16/03/2012 15 Reinforcement Learning
So What?
• Modified Bellman operator:
• T is also a contraction in max norm
• Free goodies!
– Optimality (Hierarchical Optimality)
– Convergence & stability
'
1 )()',,(),|(max)(:s
i
a
i sVsasRassFsVT
16/03/2012 16 Reinforcement Learning
Using Temporal Abstraction
• Accelerate convergence (usually)
• Avoid uninteresting states
– Improve exploration in RL
– Avoid computing all values for MDPs
• Can finesse partial observability (a little)
• Simplify state space with “funnel” states
16/03/2012 17 Reinforcement Learning
Idea of Funneling
• Proposed by Forestier & Varaiya 78
• Define “supervisor” MDP over boundary states
• Selects policies at boundaries to – Push system back into nominal states
– Keep it there
Nominal Region
Boundary states
Boundary states
Control theory version of maze world!
16/03/2012 18 Reinforcement Learning
Options : Move until end of hallway
Start : Any state in the hallway.
Execute : policy as shown.
Terminate : when s is end of hallway.
16/03/2012 19 Reinforcement Learning
Options [Sutton, Precup, Singh ’99+
• An option is behaviour defined in terms of:
o = { Io, o, o }
• Io : Set of states in which o can be initiated.
• o s : Policy (S - A*) when o is executing.
• o(s) : Probability that o terminates in s.
*Can be a policy over lower level options.
16/03/2012 20 Reinforcement Learning
Rooms Example
16/03/2012 Reinforcement Learning 21
Options Define a Semi-MDP
16/03/2012 Reinforcement Learning 22
MDP + Options = SMDP
16/03/2012 Reinforcement Learning 23
Why does SMDP Connection Help?
16/03/2012 Reinforcement Learning 24
Value Functions for Options
16/03/2012 Reinforcement Learning 25
Motivations for Options Framework
• Add temporally extended activities to choices available to RL agent, without precluding planning and learning at finer grained MDP level
• Optimal policies over primitives are not compromised due to addition of options
• However, if an option is useful, learning will quickly find this out – prevent prolonged and useless ‘flailing about’
PS: If all options are 1-step, you recover the core MDP
16/03/2012 Reinforcement Learning 26
Rooms Example – One Room
16/03/2012 Reinforcement Learning 27
Colour = option specific value
Rooms Example
16/03/2012 Reinforcement Learning 28
Hierarchical Abstract Machines (HAM)
• Policies of a core MDP (M) are defined as programs which execute based on own state and current state of core MDP
• HAM policy ≈ collection of FSMs, ,Hi}
• Four types of states of Hi: action, call, choice, stop – Action: Generate based on state of M and currently executing Hi
– Call: Switch machines, Hi, setting local state using stochastic function
– Choice: Non-deterministically pick an Hi
– Stop: return control to calling machine
16/03/2012 Reinforcement Learning 30
State Transition Structure in HAM
16/03/2012 Reinforcement Learning 31
HAM H: Initial machine + closure of all machine states in all machines reachable from possible initial states of the initial machine
HAM + MDP = SMDP
• Composition of HAM and core MDP defines an SMDP
• Only actions of the composite machine = choice points
– These actions only affect HAM part
• As an SMDP, things run autonomously between choice points
– What are the rewards?
• Moreover, optimal policy of the composite machine only depend on the choice points!
• There is a reduce(H o M) whose states are just choice points
• We can apply SMDP Q-learning to keep updates manageable
16/03/2012 Reinforcement Learning 32
Why HAM?
16/03/2012 Reinforcement Learning 33
Goal/State Abstraction
• Why are these together? – Abstract goals typically imply abstract states
• Makes sense for classical planning
– Classical planning uses state sets – Implicit in use of state variables – What about factored MDPs?
• Does this make sense for RL?
– No goals – Markov property issues
16/03/2012 35 Reinforcement Learning
Feudal RL *Dayan & Hinton ‘95+
• Lords dictate subgoals to serfs
• Subgoals = reward functions?
• Demonstrated on a navigation task
• Markov property problem
– Stability?
– Optimality?
16/03/2012 36 Reinforcement Learning
Feudal RL
16/03/2012 Reinforcement Learning 37
MAXQ [Dietterich ’98+
• Included temporal abstraction
• Introduced “safe” abstraction
• Handled subgoals/tasks elegantly – Subtasks w/repeated structure can appear in multiple
copies throughout state space
– Subtasks can be isolated w/o violating Markov
– Separated subtask reward from completion reward
• Example taxi/logistics domain – Subtasks move between locations
– High level tasks pick up/drop off assets
16/03/2012 38 Reinforcement Learning
MAXQ – Key Notions
• Hierarchy: – Root level: the overall task of solving MDP
– Child nodes: necessary tasks for completing the parent node
– Two types of nodes: Max-Nodes, Q-Nodes
• MAX Nodes – With no child Q-Nodes: primitive tasks
– With child Q-Nodes: subtasks
– Each Max-node: a separate MDP
– Context independent: • Learns policy independent of context
• Value function shared by all parent Q-Nodes
16/03/2012 Reinforcement Learning 39
MAXQ – Key Notions
• Q Nodes – Actions required for completing parent task
– 1 parent Max-Node, 1 child Max-Node
– Context dependent: • Learn the value of completing its subtask and then completing its parent task
• Value containers
16/03/2012 Reinforcement Learning 40
Usage of Hierarchical Policy
A “call graph":
– “Push a Max-node unto the stack"
– Max-node executes until it enters a terminal state
– Each Max-node converges to locally optimal policy
16/03/2012 Reinforcement Learning 41
A MAXQ Decomposition
16/03/2012 Reinforcement Learning 42
Task Hierarchy: MAXQ Decomposition
Root
Take Give Navigate(loc)
Deliver Fetch
Extend-arm Extend-arm Grab Release
Movee Movew Moves Moven
Children of a
unordered
Children of a task are unordered
16/03/2012 43 Reinforcement Learning
MAXQ Decomposition
• Augment the state s by adding the subtask i : [s,i].
• Define C([s,i],j) as the reward received in i after j finishes.
• Q([s,Fetch],Navigate(prr)) =
V([s,Navigate(prr)]) + C([s,Fetch],Navigate(prr))
• Express V in terms of C
• Learn C, instead of directly learning Q
Reward received while navigating
Reward received after navigation
16/03/2012 44 Reinforcement Learning
Recursive optimality
Dividing Up Value
16/03/2012 Reinforcement Learning 45
A-LISP (Andre & Russell 02)
• Combined and extended ideas from:
– HAMs
– MAXQ
– Function approximation
• Allowed partially specified LISP programs
• Very powerful when the stars aligned
– Halting
– “Safe” abstraction
– Function approximation
16/03/2012 46 Reinforcement Learning
Why Isn’t Everybody Doing It?
• Totally “safe” state abstraction is:
– Rare
– Hard to guarantee w/o domain knowledge
• “Safe” function approximation hard too
• Developing hierarchies is hard (like threading a needle in some cases)
• Bad choices can make things worse
• Mistakes not always obvious at first
16/03/2012 47 Reinforcement Learning
More Advanced: Concurrency
• Define the notion of multi-option: only subset of state variables influenced by an option
• Maintain coherence:
• Can do Q-learning:
16/03/2012 Reinforcement Learning 48
Frontier Problem: Automatic Hierarchy Discovery
• Hard in contexts such as classical planning
• Within a single problem:
– Battle is lost if all states considered (polynomial speedup at best)
– If fewer states considered, when to stop?
• Across problems
– Considering all states OK for few problems?
– Generalize to other problems in class
• How to measure progress?
16/03/2012 49 Reinforcement Learning
Promising Ideas
• Idea: Bottlenecks are interesting…maybe
• Exploit
– Connectivity (Andre 98, McGovern 01)
– Ease of changing state variables (Hengst 02)
• Issues
– Noise
– Less work than learning a model?
– Relationship between hierarchy and model?
16/03/2012 50 Reinforcement Learning
Frontier Problem: Representation
• Model, hierarchy, value function should all be integrated in some meaningful way
• “Safe” state abstraction is a kind of factorization
• Need approximately safe state abstraction
• Factored models w/approximation?
16/03/2012 51 Reinforcement Learning
A common theme…
16/03/2012 Reinforcement Learning 52
Semi-Markov Decision Processes
• A generalization of MDPs:
The amount of time between one decision and the next is a random variable (either real or integer valued)
• Treat the system as remaining in each state for a random waiting time
– after which, transition to next state is instantaneous
• So, SMDP is defined in terms of
P(s’, |s,a): Transition probability or dynamics
R(s,a): Reward, amount expected to accumulate during waiting time in particular state and action
16/03/2012 Reinforcement Learning 53
Semi-Markov Decision Processes
Bellman equations:
Could modify Q-learning accordingly,
16/03/2012 Reinforcement Learning 54
Continuous version of SMDP
• e.g., Bradtke & Duff, NIPS 1994
– Countable state set, finite action set
• Dynamics:
• Define value in terms of minimum of an infinite-horizon cost:
16/03/2012 Reinforcement Learning 55
Continuous SMDP
• For a fixed policy , value and reward are:
• We don’t know exactly when next state transition occurs, so expected discount factor:
16/03/2012 Reinforcement Learning 56
Optimal Value for Continuous SMDP
16/03/2012 Reinforcement Learning 57
To better understand, consider example: Passengers arriving at an office elevator
Learning for Continuous SMDP
16/03/2012 Reinforcement Learning 58
Elevator Dispatching
Crites and Barto 1996
16/03/2012 59 Reinforcement Learning
Semi-Markov Q-Learning
Rt
krt k 1
k 0
becomes Rt e rt d0
Continuous-time problem but decisions in discrete jumps (we’ll write the expressions in slightly simpler form than in the earlier slides)
Q(s,a) Q(s, a) et1 r d e
t 2 t1 maxa
Q(s ,a t1
t2
) Q(s, a)
Suppose system takes action a from state s at time t1,
and next decision is needed at time t2 in state s :
16/03/2012 60 Reinforcement Learning
Passenger Arrival Patterns
Up-peak and Down-peak traffic
• Not equivalent: down-peak handling capacity is much greater than up-peak handling capacity; so up-peak capacity is limiting factor.
• Up-peak easiest to analyse: once everyone is onboard at lobby, rest of trip is determined. The only decision is when to open and close doors at lobby. Optimal policy for pure case is: close doors when threshold number on; threshold depends on traffic intensity.
• More policies to consider for two-way and down-peak traffic.
• We focus on down-peak traffic pattern.
16/03/2012 61 Reinforcement Learning
Control Strategies
• Zoning: divide building into zones; park in zone when idle. Robust in heavy traffic.
• Search-based methods: greedy or non-greedy. Receding Horizon control.
• Rule-based methods: expert systems/fuzzy logic; from human “experts”
• Other heuristic methods: Longest Queue First (LQF), Highest Unanswered Floor First (HUFF), Dynamic Load Balancing (DLB)
• Adaptive/Learning methods: NNs for prediction, parameter space search using simulation, DP on simplified model, non-sequential RL
16/03/2012 62 Reinforcement Learning
The Elevator Model (Lewis, 1991)
Parameters:
• Floor Time (time to move one floor at max speed): 1.45 secs.
• Stop Time (time to decelerate, open and close doors, and accelerate again): 7.19 secs.
• TurnTime (time needed by a stopped car to change directions): 1 sec.
• Load Time (the time for one passenger to enter or exit a car): a random variable with range from 0.6 to 6.0 secs, mean of 1 sec.
• Car Capacity: 20 passengers
Discrete Event System: continuous time, asynchronous elevator operation
Traffic Profile:
• Poisson arrivals with rates changing every 5 minutes; down-peak
16/03/2012 63 Reinforcement Learning
State Space
• 18 hall call buttons: 2 combinations
• positions and directions of cars: 18 (rounding to nearest floor)
• motion states of cars (accelerating, moving, decelerating, stopped, loading, turning): 6
• 40 car buttons: 2
• Set of passengers waiting at each floor, each passenger's arrival time and destination: unobservable. However, 18 real numbers are available giving elapsed time since hall buttons pushed; we discretize these.
• Set of passengers riding each car and their destinations: observable only through the car buttons
18
4
40
Conservatively about 10 states 22
16/03/2012 64 Reinforcement Learning
Actions
• When moving (halfway between floors):
– stop at next floor
– continue past next floor
• When stopped at a floor:
– go up
– go down
• Asynchronous
16/03/2012 65 Reinforcement Learning
Constraints
• A car cannot pass a floor if a passenger wants to get off there
• A car cannot change direction until it has serviced all onboard passengers traveling in the current direction
• Don’t stop at a floor if another car is already stopping, or is stopped, there
• Don’t stop at a floor unless someone wants to get off there
• Given a choice, always move up
standard
special
heuristic
Stop and Continue
16/03/2012 66 Reinforcement Learning
Performance Criteria
• Average wait time
• Average system time (wait + travel time)
• % waiting > T seconds (e.g., T = 60)
• Average squared wait time (to encourage fast and fair service)
Minimize:
16/03/2012 67 Reinforcement Learning
Average Squared Wait Time
Instantaneous cost:
Define return as an integral rather than a sum (Bradtke and Duff, 1994):
r wait p ( )p
2
2rt
t 0
e r d0
becomes
16/03/2012 68 Reinforcement Learning
Algorithm
Repeat forever :
1. In state x at time tx, car c must decide to STOP or CONTINUE
2. It selects an action using Boltzmann distribution
(with decreasing temperature) based on current Q values
3. The next decision by car c is required in state y at time ty
4. Implements the gradient descent version of the following backup using backprop:
Q(x,a) Q(x,a) et x r d e
t y t x
maxa
Q(y,a t x
t y
) Q(x,a)
5. x y, tx ty
16/03/2012 69 Reinforcement Learning
Computing Rewards
e( t x )
r d0
Must calculate
• “Omniscient Rewards”: the simulator knows how long each passenger has been waiting.
• “On-Line Rewards”: Assumes only arrival time of first passenger in each queue is known (elapsed hall button time); estimate arrival times
16/03/2012 70 Reinforcement Learning
Neural Networks
• 9 binary: state of each hall down button
• 9 real: elapsed time of hall down button if pushed
• 16 binary: one on at a time: position and direction of car making decision
• 10 real: location/direction of other cars: “footprint”
• 1 binary: at highest floor with waiting passenger?
• 1 binary: at floor with longest waiting passenger?
• 1 bias unit 1
47 inputs, 20 sigmoid hidden units, 1 or 2
output units
Inputs:
16/03/2012 71 Reinforcement Learning
Elevator Results
16/03/2012 72 Reinforcement Learning