Reinforcement Learning in POMDPs Without Resets
description
Transcript of Reinforcement Learning in POMDPs Without Resets
Reinforcement Learning in POMDPs Without Resets
IJCAI, August, 2005
E.Even-Dar, S.M.Kakade & Y.Mansour
Presented by Lihan He
Machine Learning Reading Group
Duke University
07/29/2005
During reinforcement learning in POMDP, we usually reset the agent
to a same situation in the beginning of each attempt. This guarantees
the agent starts from the same point so that the comparison of
rewards is fair.
This paper gives an approach of approximate reset in the situation
where the agent is not allowed to be exactly reset. Author proved
that by this approximate reset, or homing strategy, the agent moves
towards a reset within a given tolerance in the sense of expectation.
Outline
POMDP, policy, and horizontal length;
Reinforcement learning
Homing strategies
Two algorithms of reinforce learning with homing
(1) model-free (2) model based
Conclusion
POMDP
POMDP = HMM + controllable actions.
A POMDP model is defined by the tuple < S, A, T, R, , O >.
An example: Hallway2 – navigation problem.
0 1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
89 states: 4 orientations in 22 room, plus a goal.
17 observations: all combination of walls, plus ‘star’.
5 actions: stay in place, move forward, turn right, turn left, turn around.
The state is hidden since the agent doesn’t know its current state based on current observation (wall / no wall in 4 orientations).
POMDP policy
A policy π is a mapping from belief state b to action a, which tells agent which action to be taken under an estimated belief state.
T-horizon optimal policy: The algorithm looks only T step ahead to maximize expected reward value V.
T=1: consider only immediate rewardT=infinite : consider all the discounted future reward
])ahead step for the reward expected()(),([max)(1
1
th*
T
k
k
AaT ksbasRbV
immediate reward also a function of a
t
*TV
*V
1 T
V
Horizontal length v.s. reward value for optimal policy
Reinforcement Learning
How to get an optimal policy if the agent does not know the model parameters (state transition probability T(s,a,s’) and observation function O(a,s’,o)), or even does not know the structure of model?
Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error interactions with a dynamic environment.
Model-based: first estimate model parameters by exploring environment, and then get policy based on this model. During the exploration, the agent improves the model and also policy continuously.
Model-free: totally discard model, find policy directly by exploring environment. Usually, algorithm searches in the space of behaviors to find the best performance.
Reinforcement Learning : reset
To compare performance during trial-and-error process, the agent usually reset itself to the same initial situation (belief state) before each try. In this way, the comparison is fair.
0 1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Same starting here
This is usually done by “offline” simulation.
Reinforcement Learning : without reset
Assume a realistic situation in which an agent starts in an unknown environment and must follow one continuous and uninterrupted chain of experience with no access to ‘resets’ or ‘offline’ simulation.
The paper presents an algorithm with an approximate reset
strategy, or a homing strategy.
The homing strategy always exists in every POMDP.
By performing homing strategy, the agent approximately resets
its current belief state ε-close to the (unknown) initial belief state.
The algorithm balances exploration and exploitation.
•A series of actions to achieve the approximate reset.•When homing, the agent is neither exploring nor exploiting.
Homing Strategies
Definition 1: H is an (ε, k)-approximate reset (or homing) strategy if for
every 2 belief states b1 and b2, we have ||HE(b1)-HE(b2)||1≤ ε, where HE(b) is
the expected belief state started from b after k homing actions of H and H(b)
is a random variable such that HE(b)=EB[H(b)].
)( 11 bHb H
)( 22 bHb H 121 |)()(| bHbH EE
This definition states that H will approximately reset b, but this approximation quality could be poor.
Next lemma will amplify the accuracy.
Homing Strategies
Lemma 1 (accuracy amplification): Suppose that H is an (ε, k)-approximate
reset, then is an -approximate reset, where consecutively
implements H for times.
H ),( k H
)()()( 112
11 bHbHbHb 121 |)()(| bHbH EE)()()( 22
222 bHbHbHb
H H H H
H H H H
Lemma 2 (existence of homing strategy): For all POMDPs, the random
walk strategy (using ‘stay’ actions) constitutes an (ε, k)-approximate reset
for some k≥1 and 0< ε<1/2.
Assumption: POMDP is connected, i.e., for any states s,s’, there exists a
strategy which can reach s’ with positive probability starting from s.Must contain ‘stay’ action in random walk – avoiding trapped into loop.
According to these two lemmas, for any POMDP, we have at least “random walk” to achieve approximate reset with any accuracy.
Reinforcement Learning with Homing
Algorithm 1: model-free algorithm
a (1/2,KH)-approximate reset strategy, e.g. random walk
Explorationin Phase t
Input : H for t = 1 to do
foreach Policy π in πt dofor i = 1 to k1
t doRun π for t steps;Repeatedly run H for log(1/εt) times;
endLet vπ be the average return of π from these k1
t trials;endLet
for i = 1 to k2t do
Run for t steps;Repeatedly run H for log(1/εt) times;
endend
|));|log(
1( 2
21 tt
t tOk
;maxargˆ *
v
t
t
]));phasen exploratioth -1in time[] mecurrent ti([1
(2 tTOkt
t
*ˆt
Exploitationin Phase t
Homing
Homing
horizontallength
number of exploration time
number of exploitation time
optimal policy
a set of all possible policies
Reinforcement Learning with Homing
Reason of choosing k1t (number of exploration time) and k2
t (number of exploitation time):
Run enough times to guarantee convergence of estimated average reward.
Algorithm 1: model-free algorithm
No relationship between the tth and the (t+1)th iteration.
Very inefficient, since it is testing all possible policies.
Impossible to implement.
Definition: A history h is a sequence of actions, rewards and observations of some finite length, i.e., h={(a1,r1,o1), …, (at,rt,ot)}.
What is a policy in this model-free POMDP?
A policy is defined as a mapping from histories to actions.
tttt bHbH )/1log(
2)/1log(
1)/1log( )2/1(||)()(|| Approximate reset:
Reinforcement Learning with Homing
Algorithm 2: model-based algorithm
a (1/2,KH)-approximate reset strategy, e.g. random walk
Explorationin Phase t
Input : H Let L=|A|.|O|;for t = 1 to do
for k1t times do
Run RANDOM WALK for t+1 steps;Repeatedly run H for log(Lt/εt) times;
endfor do
if then ; endendCompute using
for k2t times do
Run for t steps;Repeatedly run H for log(Lt/εt) times;
endend
|));|log(( 22
4
1 tt
tt t
LOk Η
*ˆt
Exploitationin Phase t
Homing
Homing
]));phasen exploratioth -1in time[] mecurrent ti([1
(2 tTOkt
t
OoAah t and ,H;0),,|r(P̂ 0 abho
tt Lboah /]|),(r[P̂ 0
]Pr[]|r[P̂
]|),(r[P̂],,|r[P̂
0
00
abh
boahabho
*ˆt );,,|r(P̂ 0 abho
Model update in Phase t
a set of all possible histories
Reinforcement Learning with Homing
Algorithm 2: model-based algorithm
),( oah is a history with h followed by (a,o).
*ˆt );,,|r(P̂ 0 abhoPOMDP is equivalent to an MDP where the history are states. So we can compute policy from
Again, no relationship between the tth and the (t+1)th iteration.
Instead of trying all the policy in algorithm 1, here the algorithm 2 uses sparse model parameters to compute policy. ),,|r(P̂ 0 abho
Conclusion
Author gives an approach of approximate reset in the situation
where the agent is not allowed to be reset in the lifelong learning.
A model-free algorithm and a model-based algorithm are suggested.
The model-free algorithm is inefficient.
Reference
Eyal Even-Dar, Sham M.Kakade, Yishay Mansour, “Reinforcement Learning in POMDPs without Resets”, 19th IJCAI, Jul.31, 2005.
Mance E.Harmon, Stephanie S. Harmon. “Reinforcement Learning: A Tutorial”.
Website about reinforcement learning: http://www-anw.cs.umass.edu/rlr/