Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois...

Optimal Tuning of Continual Online Exploration in

Reinforcement Learning

Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens

Information Systems Research Unit (ISYS)Université de Louvain

Belgium

Achbany Youssef - UCL 2

Outline

Introduction Mathematical concepts Modelling exploration by entropy Optimal policy Preliminary experiments Conclusion and further work


Introduction One of the challenges of reinforcement

learning is to manage: The tradeoff between exploration and

exploitation. Exploitation

aims to capitalize on already well-established solutions.

Exploration: aims to continually try new ways of solving the

problem. is relevant when the environment is changing.


Introduction Simple routing problem

The goal is to reach a destination node (13) From an initial node (1) To minimize costs

For each node Set of admissible actions Weight (cost) associated We define a probability distribution on the set of admissible actions

1

2

5

4

3

9

8

7

6

10

11

12

14

13

1

11

1

1

11

1

1

22

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2


Mathematical concepts

We have a set of states, S = {1, 2, …,n} st = k means that the system is in state k

at time t

In each state s = k, we have a set of admissible control actions, U(k) So that u(k) U(k) is a control action

available at state k



When we choose action u(st) at state st, A bounded cost C(u(st)| st) < ∞ is incurred The system jumps to state st+1 = f(u(st)| st)

Where f is a function

We suppose the network of states does not contain any negative cycle



For each state s, we define a probability distribution on the set of admissible actions, P(u(s)| s)

Meaning that the choice is randomized This introduces exploration – not only

exploitation This is the main contribution of our

work



For instance if, in state s = k, there are three admissible actions,

The probability distribution P(u(k)| s=k) involves three values

k

uk1

P(uk

1|k)

uk2

uk3

P(u k3 |k)

P(uk2|k)



The policy is defined as the set of all probability distributions for all states

1

2

5

4

3

9

8

7

6

10

11

12

14

13

1

11

1

1

11

1

1

22

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2


Mathematical concepts The goal is to reach a destination state,

s = d From an initial state, s0 = k0

While minimizing the total expected cost

The expectation is taken on the policy, that is, on all the random variables u(k) associated to the states


Mathematical concepts In other words, we have to determine

the best policy that minimizes V(k0) That is, the best probability distributions

This is standard, except the fact that we introduce choice randomisation


Mathematical concepts We now introduce a way to control

exploration

We introduce the degree of exploration, Ek, defined on each state k Which is the entropy of the

probability distribution of actions in this state k


Modelling exploration by entropy The degree of exploration, Ek, is

defined as the entropy at state k

The minimum is 0 (no exploration) The maximum is log(nk) where nk is the

number of admissible actions in state k (full exploration)


Modelling exploration by entropy

While the exploration rate is defined as

and takes its value between 0 (no exploration)

and 1 (full exploration).


Modelling exploration by entropy The goal now is to determine the

optimal policy under exploration constraints That is, seek the policy, *, among

for which the expected cost, V(k0), is minimal

while guarantying a given degree of exploration (entropy) in each state k


Modelling exploration by entropy In other words,

where the Ek are provided/fixed by the user/designer

They control the degree of exploration at each node k


Modelling exploration by entropy Thus, we route the agents as fast

as possible, while exploring the network

1

2

5

4

3

9

8

7

6

10

11

12

14

13

1

11

1

1

11

1

1

22

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2


Optimal policy Here are the necessary optimality

conditions (for a local minimum), very similar to Bellman’s equations V

*(k) is the optimal expected cost from state k

P(i|k) is the probability of chosing action i satisfying the entropy constraint through k


Optimal policy

Which lead to the following updating rules Convergence has been proved in a

stationary environment


Optimal policy This updating rule has a nice

interpretation: Route the agents preferably (with probability

P(i|k)) to the state from which the expected cost is minimal

Including the direct cost for reaching this state

1

2

5

4

3

9

8

7

6

10

11

12

14

13

1

11

1

1

11

1

1

22

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2


Optimal policy

If k is large (zero entropy: no exploration), we obtain

which is the common value iteration algorithm or Bellman’s equation

for finding the shortest path


Optimal policy If k is zero (maximum entropy: full

exploration), We perform a blind exploration

We estimate the « average first passage time »

Without taking the costs into consideration:

where nk is the number of admissible actions in state k


Advantages of our algorithm Our strategy could be interesting if the

environment is changing And there is a need for continuous exploration

Indeed, if no exploration is performed, The agent will not notice the changes unless

they occur on the shortest path So that the policy will not be adjusted

In other words, we propose an optimal exploration/exploitation trade-off


Preliminary experiments Simple Network

routing Dynamic Uncertain

1

2

5

4

3

9

8

7

6

10

11

12

14

13

1

11

1

1

11

1

1

22

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2


Preliminary experiments Exploration rate of 0% for all nodes (no

exploration)

1

4

3

2

5

6

7

8

9

12

11

10

13

14

White: very low trafficLight gray: low trafficGray: medium trafficDark gray: high trafficBlack: very high traffic


Preliminary experiments Entropy rate of 30% for all nodes

1

4

3

2

5

6

7

8

9

12

11

10

13

14




1

4

3

2

5

6

7

8

9

12

11

10

13

14



Preliminary experiments

Other experimental simulations are provided in: Tuning continual exploration in

reinforcement learning (Technical report submitted for publication).

http://www.isys.ucl.ac.be/staff/francois/Articles/Achbany2005a.pdf


Conclusion In this work,

we presented a model integrating both exploration and exploitation in a common framework.

The exploration rate is controlled by the entropy of the choice probability distribution defined on the states of the system.

When no exploration is performed (zero entropy on each node), the model reduces to the common value iteration algorithm computing the minimum cost policy.

On the other hand, when full exploration is performed (maximum entropy on each node), the model reduces to a "blind" exploration, without considering the costs.


Further work

This model has been extended to Stochastic shortest paths problems Discounted problems Acyclic graphs Edit-distances between string Developing links with Q-learning


Thank you !!!

Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois...

Documents

Transcript of Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois...