Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay...
-
Upload
angel-maher -
Category
Documents
-
view
214 -
download
0
Transcript of Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay...
Selective attention in RL
B. RavindranJoint work with Shravan Matthur, Vimal Mathew,
Sanjay Karanth, Andy Barto
Features, features, everywhere!
• We inhabit a world rich in sensory input
• Focus on features relevant to task at hand
Features, features, everywhere!
• We inhabit a world rich in sensory input
• Focus on features relevant to task at hand
• Two questions: 1. How do you characterize relevance?
2. How do you identify relevant features?
Outline
• Characterization of relevance– MDP homomorphisms– Factored Representations– Relativized Options
• Identifying features– Option schemas– Deictic Options
5
Markov Decision Processes• MDP, M, is the tuple:
– S : set of states.– A : set of actions.– : set of admissible state-action
pairs.– : probability of transition.– : expected reward.
• Policy • Maximize total expected reward
– optimal policy
M = S,A,Ψ,P,R
AS×⊆Ψ
P :Ψ×S→ 0,1[ ]
ℜ→Ψ:R
[ ]1,0 : →Ψπ
Notion of Equivalence
A,E( ) ≡ B,N( )
A,W( ) ≡ B,S( )
A,N( ) ≡ B,E( )
A,S( ) ≡ B,W( )
N
E
S
W
M = S,A,Ψ,P,R ′M = ′S , ′A ,Ψ ′ ,P ′ , ′R
Find reduced models that preserve some aspects of the original model
MDP Homomorphism
),( as ),( as
),( as ′′ ),( as ′′
saP
asP ′′′P′
P r
h hagg.
R
R′
MDPs M = S,A,Ψ,P,R , ′M = ′S , ′A , ′Ψ , ′P , ′R
surjection h :Ψ→ ′Ψ defined by h((s,a)) =( f (s),gs(a)) where:f :S→ ′S , gs :As→ ′Af (s), for all s∈S, are surjections such thatfor all s, s∈S, and a∈As:(1) ′P f(s), gs(a), f(s)( )= P s,a,t( )
t∈ s[ ] f∑
(2) ′R f(s), gs(a)( )=R s, a( )
Example
( ) ( )NB,EA, ≡
A,W( ) ≡ B,S( )
A,N( ) ≡ B,E( )
( ) ( )WB,SA, ≡
N
E
S
W
RPASM ,,,, Ψ= RPASM ′′′Ψ′′=′ , , ,,
)},,({ ),(),( EBANBhEAh ==
State dependent action recoding
Some Theoretical Results
• Optimal Value equivalence:
If then•
• Solve homomorphic image and lift the policy to the original MDP.
Q*(s,a) =Q* ( ′s , ′a ).h(s,a) = ( ′s , ′a )Corollary:
If h(s1,a1) =h(s2 ,a2 ) then Q∗(s1,a1) =Q
∗(s2 ,a2 ).
[generalizing those of Dean and Givan, 1997]
Theorem: If is a homomorphic image of , then a policy optimal in induces an optimal policy in .
′M M
M′M
More results
• Polynomial time algorithm to find reduced images Dean and Givan ’97, Lee and Yannakakis ’92, Ravindran ‘04
• Approximate homomorphisms Ravindran, Barto ‘04
– Bounds for the loss in the optimal value function• Soft homomorphisms Sorg, Singh ‘09
– Fuzzy notions of equivalence between two MDPs– Efficient algorithm for finding them
• Transfer learning (Soni, Singh et al ‘06), partial observability (Wolfe ‘10), etc.
Still more results
• Symmetries are special cases of homomorphisms Matthur Ravindran 08
– Finding symmetries is GI-complete– Harder than finding general reductions
• Efficient algorithms for constructing the reduced image Matthur Ravindran 07
– Factored MDPs– Polynomial in the size of the smaller MDP
Attention?
• How to use this concept for modeling attention?
• Combine with hierarchical RL– Look at sub-task specific relevance – Structured homomorphisms– Deixis (δεῖξις to point)
• State and action spaces defined as product of features/variables.
• Factor transition probabilities.• Exploit structure to define simple
transformations.
1x 1x′
2x
3x
2x′
3x′
r
),|( 211 xxxP ′
),|( 212 xxxP ′
),|( 323 xxxP ′
),|( 21 xxrP
2 Slice
Temporal
Bayes
Net
Factored MDPs
3x 3x′
Using Factored Representations
• Represent symmetry information in terms of features.– Eg: As an example the NE-SW symmetry can be
represented as
• Simple forms of transformations– Projections– Permutations
(x, y)−N ≡(y,x)−E& (x,y)−W≡(y,x)−S
Hierarchical Reinforcement Learning
Options frameworkOptions (Sutton, Precup, & Singh, 1999): A generalization of
actions to include temporally-extended courses of action
state each in gterminatin of yprobabilit the is
during followed policy c)(stochasti the is
started be may whichin states of set the is
triple a is option An
]1,0[
]10[
,,
→•→Ψ•
⊆•>=<
S:,:
SII
oo
o
o
o
βπ
βπ
Example: robot docking π : pre-defined controller
β : terminate when docked or charger not visible
I : all states in which charger is in sight
o
• Gather all the red objects
• Five options – one for each room
• Sub-goal options• Implicitly represent
option policy• Option MDPs
related to each other
Sub-goal Options
Relativized Options
• Relativized options (Ravindran and Barto ’03)
– Spatial abstraction - MDP Homomorphisms
– Temporal abstraction – options framework
• Abstract representation of a related family of sub-tasks – Each sub-task can be derived by applying
well defined transformations
Relativized Options (Cont)
Relativized option:
: Option homomorphism
: Option MDP (Image of h)
: Initiation set
: Termination criterion
O = h,MO, I ,β
I ⊆S
h
β :SO → [0,1]
OM
reduced state
actionoption
Top level
actions
perceptenv
• Single relativized option – get-object-exit-room
• Especially useful when learning option policy– Speed up– Knowledge transfer
• Terminology: Iba ’89 • Related to
parameterized sub-tasks (Dietterich ’00, Andre and Russell ’01, 02)
Rooms World Task
Option Schema
• Finding the right transformation?– Given a set of candidate transformations
• Option MDP and policy can be viewed as a policy schema (Schmidt ’75)
– Template of a policy– Acquire schema in a prototypical setting– Learn bindings of sensory inputs and
actions to schema
Problem Formulation
• Given:– of a relativized option– , a family of transformations
• Identify the option homomorphism • Formulate as a parameter estimation
problem– One parameter for each sub-task, takes
values from H– Samples: – Bayesian learning
MO , I ,βH
h
L,,,, 2211 asas
Algorithm• Assume uniform prior: • Experience:
• Update Posteriors:
),(0 shp
1,, +nnn sas
pn (h, s ) =PO f(sn),gsn (an), f(sn+1)( )⋅pn−1(h,s)
Normalizing Factor
P sn ,an , sn+1 h,s( ) =PO f(sn),gsn (an), f(sn+1)( )
Complex Game World
• Symmetric option MDP• One delayer • 40 transformations
– 8 spatial transformations combined with 5 projections
• Parameters of option MDP different from the rooms
ResultsSpeed of Convergence
• Learning the policy is more difficult than learning the correct transformation!
ResultsTransformation Weights in Room 4
• Transformation 12 eventually converges to 1
ResultsTransformation Weights in Room 2
• Weights oscillate a lot• Some transformation dominates eventually
– Changes from one run to another
27
Deictic Representation
• Making attention more explicit• Sense world via pointers – selective attention• Actions defined with respect to pointers• Agre ’88
– Game domain Pengo• Pointers can be arbitrarily complex
– ice-cube-next-to-me– robot-chasing-me
Move block to top of block .
Deixis and Abstraction• Deictic pointers project states and actions
onto some abstract state-action space• Consistent Representation (Whitehead and Ballard ’91)
– states with same abstract representation have the same optimal value.
– Lion algorithm, works with deterministic systems• Extend relativized options to model deictic
representation Ravindran Barto Mathew ‘07
– Factored MDPs– Restrict transformations available
• Only projections– Homomorphism conditions ensure consistent
representation
Deictic Option Schema• Deictic option schema:
– O - A relativized option– K - A set of deictic pointers– D - A collection of sets of possible projections, one
for each pointer
• Finding the correct transformation for a given state gives a consistent representation
• Use a factored version of a parameter estimation algorithm
ODK ,,
Classes of Pointers
• Independent pointers
• Mutually dependent pointers
• Dependent pointers
1x 1x′
2x
3x
2x′
3x′
1x 1x′
2x 2x′
Problem Formulation
• Given:– of a relativized option
• Identify the right pointer configuration for each sub-task
• Formulate as a parameter estimation problem– One parameter for each set of connected pointers per
sub-task – Takes values from – Samples: – Heuristic modification of Bayesian learning
ODK ,,
L,,,, 2211 asasK2
Heuristic Update Rule
wnl (h, s ) =
POl f(sn),gsn (an), f(sn+1)( )⋅wn−1
l (h,s)
Normalizing Factor
• Use a heuristic update rule:
where, POl (s,a, s ') =max ν,PO
l (s,a,s')( )
and ν is a small positive constant
Game Domain
• 2 deictic pointers: delayer and retriever• 2 fixed pointers: where-am-I and have-diamond• 8 possible values for delayer and retriever
– 64 transformations
Experimental Setup• Composite Agent
– Uses 64 transformations and a single component weight vector
• Deictic Agent– Uses 2 component weight vector– 8 transformations per component
• Hierarchical SMDP Q-learning
Experimental Results – Speed of Convergence
Experimental Results – Timing
• Execution Time
– Composite – 14 hours and 29 minutes
– Factored – 4 hours and 16 minutes
Experimental Results – Composite Weights
Mean 2006Std. Dev. 1673
Experimental Results – Delayer Weights
Mean 52Std. Dev. 28.21
Experimental Results – Retriever Weights
Mean 3045Std. Dev. 2332.42
Robosoccer• Hard to learn a polciy
for the entire game• Look at simpler
problems– Keepaway– Half-field offence
• Learn policy in a relative frame of reference
• Keep changing the bindings
Summary
• Richer representations needed for RL– Deictic representations
• Deictic Option Schemas– Combines ideas of hierarchical RL and
MDP homomorphisms– Captures aspects of deictic representation
Future
• Build an integrated cognitive architecture, that uses both bottom-up and top-down information.– In perception– In decision making
• Combine aspects of supervised, unsupervised, reinforcement learning and planning approaches