Multiple timescales for multiagent learning David Leslie and E. J. Collins University of Bristol...

Multiple timescales for multiagent learning

David Leslie and E. J. Collins

University of Bristol

David Leslie is supported by CASE Research Studentship 00317214 from theUK Engineering and Physical Sciences Research Council in cooperation with BAE SYSTEMS.

NIPS 2002workshop onmultiagent learning

Introduction

Learning in iterated normal form games.

Simple environment.

Theoretical properties of multiagent Q-learning.


Notation

players.

Player plays mixed strategy .

Opponent mixed strategy .

Expected reward for playing is .

estimated by .

N

i

),( ii ar

i

a

i

)(aQ in),( i

ni ar


Mixed strategies

Mixed equilibria necessary.

Mixed strategies from values.

Boltzmann smoothing with fixed temperature parameter .

Q


Fixed temperatures

Nash distribution approximates Nash equilibrium.

No discontinuities.

True convergence to mixed strategies.


Q-learning

Standard Q-learning, except for division by .

is the indicator function, is the reward.

Learning parameters satisfy

)(ain

I inR

n n

in

in .)( , 2

)()(

)()(}{1 a

aQRIaQaQ i

n

in

in

aa

in

in

in i

n


Three player penniesPlayer 1

Player 3

Player 2

1 point if choicematches player

2

1 point ifchoice

matchesplayer 3

1 point ifchoice is

opposite toplayer 1


A plot of Q values


Stochastic approximation

Relate to an ODE.

implies values track

Deterministic, continuous time system.

jijn

in , Q

)(),()(dd aQaraQ i

tit

iitt


Analysis of the example

Unique fixed point.

Small temperatures make fixed point unstable - a periodic orbit is stable.

Explains cycling of values.Q


Multiple timescales - I

Generalise stochastic approximation.

for .

The quicker , the slower the process adapts.

0jnin ji

i

nCin

)(

0in


Multiple timescales - II

Fast processes can fully adapt to slow processes.

Slow processes see fast processes as having completely converged.

Will work if the fast processes converge to a unique value for each fixed value of the slow processes.


Multiple-timescalesQ-learning assumption Assume that for fixed the

values of will converge to a unique value, resulting in joint best response .

For example, holds for two-player games and for cyclic games.

),...,( 1 jQQ),...,( 1 N

njn QQ

),...,( 1 jj QQB


Convergence of multiple-timescales Q-learning Behaviour determined by the ODE

Can prove convergence if player 1 has only two actions.

Hence process converges for three player pennies.

)())(,()( 11111dd aQBaraQ tttt


Another plot of Q values


Conclusion

Theoretical study of multiagent learning.

Fixed temperature parameter to achieve mixed equilibria from values.

Multiple timescales assists convergence and enables theoretical study.


Future work

Investigate when the convergence assumption must hold.

Experiments with multiple-timescales learning in Markov games.

Theoretical results for multiple-timescales learning in Markov games.

Multiple timescales for multiagent learning David Leslie and E. J. Collins University of Bristol...

Documents

Transcript of Multiple timescales for multiagent learning David Leslie and E. J. Collins University of Bristol...