Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay...

50
Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto

Transcript of Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay...

Page 1: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Selective attention in RL

B. RavindranJoint work with Shravan Matthur, Vimal Mathew,

Sanjay Karanth, Andy Barto

Page 2: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Features, features, everywhere!

• We inhabit a world rich in sensory input

• Focus on features relevant to task at hand

Page 3: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Features, features, everywhere!

• We inhabit a world rich in sensory input

• Focus on features relevant to task at hand

• Two questions: 1. How do you characterize relevance?

2. How do you identify relevant features?

Page 4: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Outline

• Characterization of relevance– MDP homomorphisms– Factored Representations– Relativized Options

• Identifying features– Option schemas– Deictic Options

Page 5: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

5

Markov Decision Processes• MDP, M, is the tuple:

– S : set of states.– A : set of actions.– : set of admissible state-action

pairs.– : probability of transition.– : expected reward.

• Policy • Maximize total expected reward

– optimal policy

M = S,A,Ψ,P,R

AS×⊆Ψ

P :Ψ×S→ 0,1[ ]

ℜ→Ψ:R

[ ]1,0 : →Ψπ

Page 6: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Notion of Equivalence

A,E( ) ≡ B,N( )

A,W( ) ≡ B,S( )

A,N( ) ≡ B,E( )

A,S( ) ≡ B,W( )

N

E

S

W

M = S,A,Ψ,P,R ′M = ′S , ′A ,Ψ ′ ,P ′ , ′R

Find reduced models that preserve some aspects of the original model

Page 7: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

MDP Homomorphism

),( as ),( as

),( as ′′ ),( as ′′

saP

asP ′′′P′

P r

h hagg.

R

R′

MDPs M = S,A,Ψ,P,R , ′M = ′S , ′A , ′Ψ , ′P , ′R

surjection h :Ψ→ ′Ψ defined by h((s,a)) =( f (s),gs(a)) where:f :S→ ′S , gs :As→ ′Af (s), for all s∈S, are surjections such thatfor all s, s∈S, and a∈As:(1) ′P f(s), gs(a), f(s)( )= P s,a,t( )

t∈ s[ ] f∑

(2) ′R f(s), gs(a)( )=R s, a( )

Page 8: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Example

( ) ( )NB,EA, ≡

A,W( ) ≡ B,S( )

A,N( ) ≡ B,E( )

( ) ( )WB,SA, ≡

N

E

S

W

RPASM ,,,, Ψ= RPASM ′′′Ψ′′=′ , , ,,

)},,({ ),(),( EBANBhEAh ==

State dependent action recoding

Page 9: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Some Theoretical Results

• Optimal Value equivalence:

If then•

• Solve homomorphic image and lift the policy to the original MDP.

Q*(s,a) =Q* ( ′s , ′a ).h(s,a) = ( ′s , ′a )Corollary:

If h(s1,a1) =h(s2 ,a2 ) then Q∗(s1,a1) =Q

∗(s2 ,a2 ).

[generalizing those of Dean and Givan, 1997]

Theorem: If is a homomorphic image of , then a policy optimal in induces an optimal policy in .

′M M

M′M

Page 10: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

More results

• Polynomial time algorithm to find reduced images Dean and Givan ’97, Lee and Yannakakis ’92, Ravindran ‘04

• Approximate homomorphisms Ravindran, Barto ‘04

– Bounds for the loss in the optimal value function• Soft homomorphisms Sorg, Singh ‘09

– Fuzzy notions of equivalence between two MDPs– Efficient algorithm for finding them

• Transfer learning (Soni, Singh et al ‘06), partial observability (Wolfe ‘10), etc.

Page 11: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Still more results

• Symmetries are special cases of homomorphisms Matthur Ravindran 08

– Finding symmetries is GI-complete– Harder than finding general reductions

• Efficient algorithms for constructing the reduced image Matthur Ravindran 07

– Factored MDPs– Polynomial in the size of the smaller MDP

Page 12: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Attention?

• How to use this concept for modeling attention?

• Combine with hierarchical RL– Look at sub-task specific relevance – Structured homomorphisms– Deixis (δεῖξις to point)

Page 13: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

• State and action spaces defined as product of features/variables.

• Factor transition probabilities.• Exploit structure to define simple

transformations.

1x 1x′

2x

3x

2x′

3x′

r

),|( 211 xxxP ′

),|( 212 xxxP ′

),|( 323 xxxP ′

),|( 21 xxrP

2 Slice

Temporal

Bayes

Net

Factored MDPs

3x 3x′

Page 14: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Using Factored Representations

• Represent symmetry information in terms of features.– Eg: As an example the NE-SW symmetry can be

represented as

• Simple forms of transformations– Projections– Permutations

(x, y)−N ≡(y,x)−E& (x,y)−W≡(y,x)−S

Page 15: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Hierarchical Reinforcement Learning

Options frameworkOptions (Sutton, Precup, & Singh, 1999): A generalization of

actions to include temporally-extended courses of action

state each in gterminatin of yprobabilit the is

during followed policy c)(stochasti the is

started be may whichin states of set the is

triple a is option An

]1,0[

]10[

,,

→•→Ψ•

⊆•>=<

S:,:

SII

oo

o

o

o

βπ

βπ

Example: robot docking π : pre-defined controller

β : terminate when docked or charger not visible

I : all states in which charger is in sight

o

Page 16: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

• Gather all the red objects

• Five options – one for each room

• Sub-goal options• Implicitly represent

option policy• Option MDPs

related to each other

Sub-goal Options

Page 17: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Relativized Options

• Relativized options (Ravindran and Barto ’03)

– Spatial abstraction - MDP Homomorphisms

– Temporal abstraction – options framework

• Abstract representation of a related family of sub-tasks – Each sub-task can be derived by applying

well defined transformations

Page 18: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Relativized Options (Cont)

Relativized option:

: Option homomorphism

: Option MDP (Image of h)

: Initiation set

: Termination criterion

O = h,MO, I ,β

I ⊆S

h

β :SO → [0,1]

OM

reduced state

actionoption

Top level

actions

perceptenv

Page 19: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

• Single relativized option – get-object-exit-room

• Especially useful when learning option policy– Speed up– Knowledge transfer

• Terminology: Iba ’89 • Related to

parameterized sub-tasks (Dietterich ’00, Andre and Russell ’01, 02)

Rooms World Task

Page 20: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Option Schema

• Finding the right transformation?– Given a set of candidate transformations

• Option MDP and policy can be viewed as a policy schema (Schmidt ’75)

– Template of a policy– Acquire schema in a prototypical setting– Learn bindings of sensory inputs and

actions to schema

Page 21: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Problem Formulation

• Given:– of a relativized option– , a family of transformations

• Identify the option homomorphism • Formulate as a parameter estimation

problem– One parameter for each sub-task, takes

values from H– Samples: – Bayesian learning

MO , I ,βH

h

L,,,, 2211 asas

Page 22: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Algorithm• Assume uniform prior: • Experience:

• Update Posteriors:

),(0 shp

1,, +nnn sas

pn (h, s ) =PO f(sn),gsn (an), f(sn+1)( )⋅pn−1(h,s)

Normalizing Factor

P sn ,an , sn+1 h,s( ) =PO f(sn),gsn (an), f(sn+1)( )

Page 23: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Complex Game World

• Symmetric option MDP• One delayer • 40 transformations

– 8 spatial transformations combined with 5 projections

• Parameters of option MDP different from the rooms

Page 24: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

ResultsSpeed of Convergence

• Learning the policy is more difficult than learning the correct transformation!

Page 25: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

ResultsTransformation Weights in Room 4

• Transformation 12 eventually converges to 1

Page 26: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

ResultsTransformation Weights in Room 2

• Weights oscillate a lot• Some transformation dominates eventually

– Changes from one run to another

Page 27: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

27

Deictic Representation

• Making attention more explicit• Sense world via pointers – selective attention• Actions defined with respect to pointers• Agre ’88

– Game domain Pengo• Pointers can be arbitrarily complex

– ice-cube-next-to-me– robot-chasing-me

Move block to top of block .

Page 28: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Deixis and Abstraction• Deictic pointers project states and actions

onto some abstract state-action space• Consistent Representation (Whitehead and Ballard ’91)

– states with same abstract representation have the same optimal value.

– Lion algorithm, works with deterministic systems• Extend relativized options to model deictic

representation Ravindran Barto Mathew ‘07

– Factored MDPs– Restrict transformations available

• Only projections– Homomorphism conditions ensure consistent

representation

Page 29: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Deictic Option Schema• Deictic option schema:

– O - A relativized option– K - A set of deictic pointers– D - A collection of sets of possible projections, one

for each pointer

• Finding the correct transformation for a given state gives a consistent representation

• Use a factored version of a parameter estimation algorithm

ODK ,,

Page 30: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Classes of Pointers

• Independent pointers

• Mutually dependent pointers

• Dependent pointers

1x 1x′

2x

3x

2x′

3x′

1x 1x′

2x 2x′

Page 31: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Problem Formulation

• Given:– of a relativized option

• Identify the right pointer configuration for each sub-task

• Formulate as a parameter estimation problem– One parameter for each set of connected pointers per

sub-task – Takes values from – Samples: – Heuristic modification of Bayesian learning

ODK ,,

L,,,, 2211 asasK2

Page 32: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Heuristic Update Rule

wnl (h, s ) =

POl f(sn),gsn (an), f(sn+1)( )⋅wn−1

l (h,s)

Normalizing Factor

• Use a heuristic update rule:

where, POl (s,a, s ') =max ν,PO

l (s,a,s')( )

and ν is a small positive constant

Page 33: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Game Domain

• 2 deictic pointers: delayer and retriever• 2 fixed pointers: where-am-I and have-diamond• 8 possible values for delayer and retriever

– 64 transformations

Page 34: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Experimental Setup• Composite Agent

– Uses 64 transformations and a single component weight vector

• Deictic Agent– Uses 2 component weight vector– 8 transformations per component

• Hierarchical SMDP Q-learning

Page 35: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Experimental Results – Speed of Convergence

Page 36: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Experimental Results – Timing

• Execution Time

– Composite – 14 hours and 29 minutes

– Factored – 4 hours and 16 minutes

Page 37: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Experimental Results – Composite Weights

Mean 2006Std. Dev. 1673

Page 38: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Experimental Results – Delayer Weights

Mean 52Std. Dev. 28.21

Page 39: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Experimental Results – Retriever Weights

Mean 3045Std. Dev. 2332.42

Page 40: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Robosoccer• Hard to learn a polciy

for the entire game• Look at simpler

problems– Keepaway– Half-field offence

• Learn policy in a relative frame of reference

• Keep changing the bindings

Page 41: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.
Page 42: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.
Page 43: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.
Page 44: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.
Page 45: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.
Page 46: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.
Page 47: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.
Page 48: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.
Page 49: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Summary

• Richer representations needed for RL– Deictic representations

• Deictic Option Schemas– Combines ideas of hierarchical RL and

MDP homomorphisms– Captures aspects of deictic representation

Page 50: Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto.

Future

• Build an integrated cognitive architecture, that uses both bottom-up and top-down information.– In perception– In decision making

• Combine aspects of supervised, unsupervised, reinforcement learning and planning approaches