Reinforcement Learning in POMDPs Without Resets

Reinforcement Learning in POMDPs Without Resets

IJCAI, August, 2005

E.Even-Dar, S.M.Kakade & Y.Mansour

Presented by Lihan He

Machine Learning Reading Group

Duke University

07/29/2005

During reinforcement learning in POMDP, we usually reset the agent

to a same situation in the beginning of each attempt. This guarantees

the agent starts from the same point so that the comparison of

rewards is fair.

This paper gives an approach of approximate reset in the situation

where the agent is not allowed to be exactly reset. Author proved

that by this approximate reset, or homing strategy, the agent moves

towards a reset within a given tolerance in the sense of expectation.

Outline

POMDP, policy, and horizontal length;

Reinforcement learning

Homing strategies

Two algorithms of reinforce learning with homing

(1) model-free (2) model based

Conclusion

POMDP

POMDP = HMM + controllable actions.

A POMDP model is defined by the tuple < S, A, T, R, , O >.

An example: Hallway2 – navigation problem.

0 1 2 3 4 5 6 7

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

89 states: 4 orientations in 22 room, plus a goal.

17 observations: all combination of walls, plus ‘star’.

5 actions: stay in place, move forward, turn right, turn left, turn around.

The state is hidden since the agent doesn’t know its current state based on current observation (wall / no wall in 4 orientations).

POMDP policy

A policy π is a mapping from belief state b to action a, which tells agent which action to be taken under an estimated belief state.

T-horizon optimal policy: The algorithm looks only T step ahead to maximize expected reward value V.

T=1: consider only immediate rewardT=infinite : consider all the discounted future reward

])ahead step for the reward expected()(),([max)(1

1

th*

T

k

k

AaT ksbasRbV

immediate reward also a function of a

t

*TV

*V

1 T

V

Horizontal length v.s. reward value for optimal policy

Reinforcement Learning

How to get an optimal policy if the agent does not know the model parameters (state transition probability T(s,a,s’) and observation function O(a,s’,o)), or even does not know the structure of model?

Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error interactions with a dynamic environment.

Model-based: first estimate model parameters by exploring environment, and then get policy based on this model. During the exploration, the agent improves the model and also policy continuously.

Model-free: totally discard model, find policy directly by exploring environment. Usually, algorithm searches in the space of behaviors to find the best performance.

Reinforcement Learning : reset

To compare performance during trial-and-error process, the agent usually reset itself to the same initial situation (belief state) before each try. In this way, the comparison is fair.

0 1 2 3 4 5 6 7

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Same starting here

This is usually done by “offline” simulation.

Reinforcement Learning : without reset

Assume a realistic situation in which an agent starts in an unknown environment and must follow one continuous and uninterrupted chain of experience with no access to ‘resets’ or ‘offline’ simulation.

The paper presents an algorithm with an approximate reset

strategy, or a homing strategy.

The homing strategy always exists in every POMDP.

By performing homing strategy, the agent approximately resets

its current belief state ε-close to the (unknown) initial belief state.

The algorithm balances exploration and exploitation.

•A series of actions to achieve the approximate reset.•When homing, the agent is neither exploring nor exploiting.

Homing Strategies

Definition 1: H is an (ε, k)-approximate reset (or homing) strategy if for

every 2 belief states b1 and b2, we have ||HE(b1)-HE(b2)||1≤ ε, where HE(b) is

the expected belief state started from b after k homing actions of H and H(b)

is a random variable such that HE(b)=EB[H(b)].

)( 11 bHb H

)( 22 bHb H 121 |)()(| bHbH EE

This definition states that H will approximately reset b, but this approximation quality could be poor.

Next lemma will amplify the accuracy.

Homing Strategies

Lemma 1 (accuracy amplification): Suppose that H is an (ε, k)-approximate

reset, then is an -approximate reset, where consecutively

implements H for times.

H ),( k H

)()()( 112

11 bHbHbHb 121 |)()(| bHbH EE)()()( 22

222 bHbHbHb

H H H H

H H H H

Lemma 2 (existence of homing strategy): For all POMDPs, the random

walk strategy (using ‘stay’ actions) constitutes an (ε, k)-approximate reset

for some k≥1 and 0< ε<1/2.

Assumption: POMDP is connected, i.e., for any states s,s’, there exists a

strategy which can reach s’ with positive probability starting from s.Must contain ‘stay’ action in random walk – avoiding trapped into loop.

According to these two lemmas, for any POMDP, we have at least “random walk” to achieve approximate reset with any accuracy.

Reinforcement Learning with Homing

Algorithm 1: model-free algorithm

a (1/2,KH)-approximate reset strategy, e.g. random walk

Explorationin Phase t

Input : H for t = 1 to do

foreach Policy π in πt dofor i = 1 to k1

t doRun π for t steps;Repeatedly run H for log(1/εt) times;

endLet vπ be the average return of π from these k1

t trials;endLet

for i = 1 to k2t do

Run for t steps;Repeatedly run H for log(1/εt) times;

endend

|));|log(

1( 2

21 tt

t tOk

;maxargˆ *

v

t

t

]));phasen exploratioth -1in time[] mecurrent ti([1

(2 tTOkt

t

*ˆt

Exploitationin Phase t

Homing

Homing

horizontallength

number of exploration time

number of exploitation time

optimal policy

a set of all possible policies


Reason of choosing k1t (number of exploration time) and k2

t (number of exploitation time):

Run enough times to guarantee convergence of estimated average reward.

Algorithm 1: model-free algorithm

No relationship between the tth and the (t+1)th iteration.

Very inefficient, since it is testing all possible policies.

Impossible to implement.

Definition: A history h is a sequence of actions, rewards and observations of some finite length, i.e., h={(a1,r1,o1), …, (at,rt,ot)}.

What is a policy in this model-free POMDP?

A policy is defined as a mapping from histories to actions.

tttt bHbH )/1log(

2)/1log(

1)/1log( )2/1(||)()(|| Approximate reset:


Algorithm 2: model-based algorithm

a (1/2,KH)-approximate reset strategy, e.g. random walk

Explorationin Phase t

Input : H Let L=|A|.|O|;for t = 1 to do

for k1t times do

Run RANDOM WALK for t+1 steps;Repeatedly run H for log(Lt/εt) times;

endfor do

if then ; endendCompute using

for k2t times do

Run for t steps;Repeatedly run H for log(Lt/εt) times;

endend

|));|log(( 22

4

1 tt

tt t

LOk Η

*ˆt

Exploitationin Phase t

Homing

Homing

]));phasen exploratioth -1in time[] mecurrent ti([1

(2 tTOkt

t

OoAah t and ,H;0),,|r(P̂ 0 abho

tt Lboah /]|),(r[P̂ 0

]Pr[]|r[P̂

]|),(r[P̂],,|r[P̂

0

00

abh

boahabho

*ˆt );,,|r(P̂ 0 abho

Model update in Phase t

a set of all possible histories


Algorithm 2: model-based algorithm

),( oah is a history with h followed by (a,o).

*ˆt );,,|r(P̂ 0 abhoPOMDP is equivalent to an MDP where the history are states. So we can compute policy from

Again, no relationship between the tth and the (t+1)th iteration.

Instead of trying all the policy in algorithm 1, here the algorithm 2 uses sparse model parameters to compute policy. ),,|r(P̂ 0 abho

Conclusion

Author gives an approach of approximate reset in the situation

where the agent is not allowed to be reset in the lifelong learning.

A model-free algorithm and a model-based algorithm are suggested.

The model-free algorithm is inefficient.

Reference

Eyal Even-Dar, Sham M.Kakade, Yishay Mansour, “Reinforcement Learning in POMDPs without Resets”, 19th IJCAI, Jul.31, 2005.

Mance E.Harmon, Stephanie S. Harmon. “Reinforcement Learning: A Tutorial”.

Website about reinforcement learning: http://www-anw.cs.umass.edu/rlr/

Reinforcement Learning in POMDPs Without Resets

Documents

Transcript of Reinforcement Learning in POMDPs Without Resets