Download - Reinforcement Learning - uni-freiburg.deml.informatik.uni-freiburg.de/former/_media/teaching/ws1314/rl/ue13… · Types of multi agent environments There are di erent types of multi

Reinforcement Learning

LU 13 - Learning in Multi Agent Environments

Dr. Martin LauerAG Maschinelles Lernen und Naturlichsprachliche Systeme

Albert-Ludwigs-Universitat Freiburg

[email protected]

Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (1)

Learning in multi agent environments

Up to now: single agent acting in a stochastic environment

Now: several agents acting in the same environment.

I all of them are trying to maximize their reward/minimize their costs

I all of them are learning/adapting their policy

Examples:

I two persons working together to mount IKEA furnitures

I a group of students learning together for an exam

I the two players of tic-tac-toe

I 11 players in a football team playing versus 11 players of the opponentsoccer team

I the ministers in a government


Types of multi agent environments

There are different types of multi agent environments

I agents which share the same goal. Both benefit in the same way fromreaching the goal

I agents which have adversary goals

I agents which share some goals but not all.


Modeling multi agent environments

How can we model multiple agents for reinforcement learning?

A Multi agent Markov decision process (MAMDP) is defined by:

I a set of discrete points in time T

I a set of states S

I a number of agents k ∈ NI a set of actions for each agent Aj

I transition probabilities pss′(a1, . . . , ak)

I a reward function for each agent rj : S × A1 × · · · × Ak → R(or a cost function for each agent)

I a discount factor α ∈ [0, 1]

MAMDPs are also called Markov games.


Modeling multi agent environments

Remarks:

I we assume that all agents are selecting their actions simultaneously

I the transition and rewards depend on the actions of all agents

I games where the players make their moves alternately can be modeled aswell. How? Discuss it at the tic-tac-toe example


Fully cooperative MAMDPs

In fully cooperative MAMDPs all agents sharethe same reward function, i.e.r1 = r2 = · · · = rkWhat can we conclude about the optimality ofpolicies?

Assume a fully cooperative MAMDP(T , S , k,A1, . . . ,Ak , pss′(a1, . . . , ak), r1, . . . , rk , α)

If we define A = A1 × · · · × Ak we can createan equivalent single agent MDP(T , S ,A, pss′(~a), r1, α). This MDP imitates theMAMDP assume a central control of all agentsinstead of distributed decisions.

Once we found an optimal policy Π : S → Afor central control we can distribute it to theagents and generate agent-individual policiesπj : S → Aj by

Π(s) = (π1(s), . . . , πk(s))

agent 1

decided

about

a1

agent 2

decided

about

a2

agent 3

decided

about

a3

agent 4

decided

about

a4

decentralized decisisions in aMAMDP

a1 a2 a3 a4

advises agents

to play certain

actions

central control

decides about

vector of actions

agent 1 agent 2 agent 3 agent 4

executes executes executes executes

central decisions in an MDP


Learning in fully cooperative MAMDPs

From theory of MDPs we can conclude that for a “central control MAMDP”

I there is an optimal central policy Π∗

I Π∗ is greedy w.r.t. Q∗ : S ×A → RI Q∗ can be calculated, e.g. with value iteration or Q-learning

Since we can always derive agent-individual policies from Π∗ we get

I there are optimal agent-individual policies π∗j : S → Aj

I the application of π∗j maximizes the accumulated reward for each agent

I we can obtain π∗j as follows1. learn Q∗, e.g. with Q-learning2. derive Π∗ greedily from Q∗

3. calculate agent-individual policies π∗j by “distributing” Π∗



In the previous approach, learning is assumed to be done by a central controller.

However, as long as each agent is informed about the actions selected by itsteammates learning the Q-function can also be done by each agent individually

Q-learning for “joint action learners”

1. for each agent j , learn Q-function with Q-learning → Qj

2. determine optimal policy πj by greedy evaluation of Qj

Greedy evaluation for agent j :

πj(s) = arg maxaj∈Aj

(max

a1∈A1,...,aj−1∈Aj−1,aj+1∈Aj+1,...,ak∈Ak

Qj(s, a1, . . . , ak))

︸︷︷︸=:qj (s,aj )

= arg maxaj∈Aj

qj(s, aj)



Example: a fully cooperative two-player Markov games

Q(a1, a2) a2 =1 2 3 q2(a2)

1 5 6 2 6a1 = 2 2 1 4 4

3 7 4 2 7q1(a1) 7 6 2

Optimal policy found

Q(a1, a2) a2 =1 2 3 q2(a2)

1 5 6 2 6a1 = 2 2 7 4 7

3 7 4 2 7q1(a1) 7 7 2

individual actions that are promising:a1 ∈ {2, 3}, a2 ∈ {1, 2}However, not all joint actions areoptimal.

Optimal policy among promising ones.But not all promising joint actions areoptimal

⇒ Every optimal joint policy is greedy w.r.t. agent-individual qj function.However, not all greedy agent-individual policies form an optimal joint policy


Q-learning in fully cooperative MAMDPs

Learning a Q-function is not sufficient to obtain an optimal joint policy.

Idea: learn the policy explicitly instead of implicitly. Memorize the latestactions that contributed to optimal behavior

Every agent learns

I a Q-function Qj : S → A (the critic)

I an agent-individual policy πj : S → Aj (the actor)

After observing a transition from st to st+1 with joint action (a1,t , . . . , ak,t) andreward rt it updates

Qj,t+1(st , a1, . . . , ak) = (1− γ) · Qj,t(st , a1, . . . , ak) + γ · (rt + αmax~a∈A

Qj,t(st+1,~a))

πj,t+1(st) =

{aj,t if (a1,t , . . . , ak,t) ∈ arg max~a∈AQj,t+1(st ,~a)

πj,t(st) otherwise



Lemma:If the prerequisites for Q-learning are met and the initial Q-functions Qj,0

are equal for all agents j , then the algorithm described on the previousslide is guaranteed to converge, i.e.

Qj,t(s,~a) −→t→∞

Q∗(s,~a)

and from a certain t0 on the joint policies Πt(s) = (π1,t(s), . . . , πk,t(s))are optimal.

Proof sketch:The convergence of the Q-functions follows directly from the convergence proofof standard Q-learning. Furthermore, since all agents are performing the sameupdate and the initial Qj,0 are equal, the Qj,t functions for all agents will beequal at anytime.

The policies of all agents are updated synchronously whenever a joint action isobserved that is greedy w.r.t. the current Q-function. Hence, the joint policyΠt is always greedy w.r.t. the current Q-function. Since the Q-functionsconverge to Q∗ the joint policy Πt is optimal from a certain point in time on.



Previous approach is known as “joint action learners” because all agents arelearning a Q-function for joint actions ~a.

Question: is it possible to replace the Qj function by the qj function thatconsiders only agent individual actions (“independent learner”)?

In general, no approaches are known for independent learners that areguaranteed to converge to the optimal joint policy.

I Q-learning is used for independent learners. It converges but notnecessarily to the optimal joint policy.

I for deterministic MDPs there is a variant of Q-learning that is guaranteedto the optimal joint policy (Lauer&Riedmiller, 2000)


Zero sum Markov games

Different than cooperative MAMDPs are zero sumMarkov games.

I 2 agents

I the reward function of one is the negative rewardof the other, i.e. r2(s, a1, a2) = −r1(s, a1, a2). Thegain of one player is the loss of the other. Thesum of rewards is zero.

Examples:

I many board games (chess, checkers, go,tic-tac-toe, ...)

I competitions in sports (often, not all)

agent 1 agent 2

maximizer minimizer



What means “optimal policy” in zero sum games?

Different cases

(a) our opponent plays a fixed, stationary policy (either deterministic orrandomized)We can interpret our opponent as part of the environment. Hence, we canuse the standard definition of optimality for MDPs and apply standardlearning algorithms for MDPs (Q-learning, value iteration, TD(λ))

(b) our opponent varies its policy and potentially tries to obtain an optimalbehavior as response to our policy → new definition of optimality



Let us assume the tradition game rock-paper-scissors

I the a priori success probability of all three actions is 13

I however, if we always select the same action, our opponent could easilybeat us in repeated playing

I in contrast, if we would randomly select our action with 13

probability foreach action, our opponent could not predict our behavior and we wouldachieve maximal reward

⇒ zero sum games require randomized policies (also: stochastic policies,probabilistic policies, mixed policies)

A randomized policy is a mapping π : S × A→ R that assigns a probability toeach action which can be selected in state s. It must meet

π(s, a) ≥ 0 for all s ∈ S , a ∈ A∑a∈A

π(s, a) = 1 for all s ∈ S



Like for MDPs we can introduce Q-functions for zero sum games. Let π1, π2

be agent-individual policies. Then (j ∈ {1, 2})

Qπ1,π2j (s, a1, a2) = E

[ ∞∑t=0

αtrj,t |s0 = s,~a0 = (a1, a2),

~at = (π1(st), π2(st)) for all t ≥ 1]

i.e. Qπ1,π2j (s, a1, a2) models the expected discounted sum of rewards of agent j

in the zero sum game if a trajectory starts in s and the agents apply actions a1,a2 first and follow their policies π1, π2 thereafter.

In a zero sum game, Qπ1,π21 (s, a1, a2) = −Qπ1,π2

2 (s, a1, a2)

From Qπ1,π2j we can obtain the expected sum of reward function V π1,π2

j as

V π1,π2j =

∑a1∈A1

( ∑a2∈A2

(π1(s, a1) · π2(s, a2) · Qπ1,π2

j (s, a1, a2)))



V π1,π2j =

∑a1∈A1

( ∑a2∈A2

(π1(s, a1) · π2(s, a2) · Qπ1,π2

j (s, a1, a2)))

What could agent 1 do to maximize its expected reward in state s?→ select a1 that maximizes π2(s, a2) · Qπ1,π2

j (s, a1, a2)

What could agent 2 do to maximize its expected reward in state s?→ select a2 that minimizes π1(s, a1) · Qπ1,π2

j (s, a1, a2)

⇒ if both agents are optimizing their policy in parallel the best response ofeach agent can be achieved by selecting

a1 ∈ arg maxa1∈A1

(mina2∈A2

Qπ1,π21 (s, a1, a2)

)a2 ∈ arg min

a2∈A2

(maxa1∈A1

Qπ1,π21 (s, a1, a2)

)Due to this minimax/maximin criterion, an optimal joint action is a saddlepoint of the Qj function. Neither agent can increase its reward by itself butdeviating from the minimax solution will decrease its reward.For zero sum games the minimax criterion is the analogue of the greedy policyevaluation criterion for MDPs.


Q-learning in zero sum Markov games

An optimal policy would meet the minimax criterion for all states, i.e.π1(s, a1) > 0 ⇔ min

a2∈A2

Qπ1,π21 (s, a1, a2) = max

a′1∈A1

mina2∈A2

Qπ1,π21 (s, a′1, a2)

π2(s, a2) > 0 ⇔ maxa1∈A1

Qπ1,π21 (s, a1, a2) = min

a′2∈A2

maxa1∈A1

Qπ1,π21 (s, a1, a

′2)

It is possible to derive a Bellman equation like relationship

Q1(s, a1, a2) = r1(s, a1, a2) + α∑s′∈S

(pss′(a1, a2) max

a′1∈A1

mina′2∈A2

Q1(s ′, a′1, a′2))

And from this we can derive a Q-learning update rule for zero sum games(Littman, 1994; Littman, 2001)

Q1,t+1(st , a1,t , a2,t) = (1− γ)Q1,t(st , a1,t , a2,t) + γ(r1,t + α maxa′1∈A1

mina′2∈A2

Qt,1(st+1, a′1, a′2))

Convergence can be guaranteed under the usual conditions.



Example:Zero sum game with single state

Q1(a1, a2) a2 =1 2 3 min(row)

1 3 4 1 1a1 = 2 7 4 1 1

3 2 3 5 2max(col) 7 4 5

Q1(a1, a2) a2 =1 2 3 4 min(row)

1 5 7 6 1 1a1 = 2 3 1 4 1 1

3 6 2 3 5 24 2 3 2 9 2

max(col) 6 7 6 9

Observations:

I minimax criterion is behaves defensive

I minimax criterion is not unique

I actions seem to be equally good although they are not (row 3,4)

I better policies exist if we know more about the behavior of the opponent


Markov games - the general case

In general, each agent might have its own reward function rj . What meansoptimal in this case?

Definition:A joint policy is called Pareto optimal if there is no other policy that is betterfor at least one agent and no worse for the other agents.

Pareto optimal policies are not necessarily unique.

I in fully cooperative MAMDPs all optimal policies are also Pareto optimal

I in zero sum games, all policies are Pareto optimal


Markov games - the general case

Example prisoners’ dilemma:one state, two agents, each has two actions a and b

r1

r2

(a,a)

(b,a)

(b,b)

(a,b)

Pareto optimal joint policies:

I deterministic policies (a, b), (a, a), (b, a)

I randomized policies that mix (a, b) with(a, a) or (b, a) with (a, a).

Which action would agent 1 prefer?

I if agent 2 chooses a, action b is better for agent 1

I if agent 2 chooses b, action b is better for agent 1

Hence, both agents would prefer b which leads to a non-optimal joint policy(b, b). In joint policy (b, b) neither agent has a possibility to increase itsreward.


Nash equilibria

Definition:A Nash equilibrium is a joint policy in which neither agent can increase itsreward by changing its agent-individual policy.

I Nash equilibria might not be unique

I Nash equilibria are often not Pareto optimal (e.g. prisoners’ dilemma)

I this shows that individual optimization of rewards does not necessarilyresult in an optimal joint policy

Example: cold war

Example: behavior of individuals in a team

Example: bicycle race

How can we overcome being trapped in suboptimal Nash equilibria?

→ trust and cooperation→ known as bargaining problem, e.g. Nash’s bargaining solution→ tit for tat (especially for the iterated prisoners’ dilemma)


Q-learning in the general case

There is an algorithm Nash-Q that learns to achieve a Nash equilibrium (Huand Wellman, 1998) for general two-player games.

Sketch:

I initialize Q-functions of each agent with 1

I in every step, calculate the Nash equilibrium for the present Q-functionsand apply this Nash policy

I update the two Q-functions in the usual way

Convergence:the algorithm is guaranteed to converge into a Nash equilibrium if

I all state-joint-action pairs are visited an infinite number of times

I the learning rate decreases in the usual way

I in every step there is a unique Nash equilibrium

I both agents act perfectly synchronously


Multi agent learning in practice

I Nash-Q convergent only under restrictive conditions → not applicable inpractice

I Nash equilibria often suboptimal. Opponent does not play perfectresponse.

I required information often not accessible (e.g. state not fully known,especially internal state of other agents; action of other player notaccessible)

Fictitious play (Brown, 1951):Assume, the other agents follow fixed policies. Adapt your own policy to theobserved behavior of the others. I.e. apply reinforcement learning (e.g.Q-learning) to learn an optimal response to the (fixed) policies of the otheragents.

Self play:Make all agents perform fictitious play in parallel. No convergence guarantees,however, good results in some applications


Examples

Examples from robot soccer simulation league (Merke, Gabel, Riedmiller, et al.)

Learning to dribble the ball. Two agents (player, opponent).Fictitious play. Opponent directly moves to the ball.→ video

Learning to score against a goal keeper. Three agents (two attackers, goalkeeper).Fictitious play. Attackers use Q-learning/TD(λ) to optimize their behavior.Goal keeper follows fixed policy (move to the ball).→ video

Extensions to larger attack and defense strategies have been trained as well (3vs. 4, 7 vs. 8)


References

I Hu & Wellman, 1998J. Hu and M.P. Wellman, Multiagent reinforcement learning: theoreticalframework and an algorithm. In: Proceedings 15th Int’l Conf. on MachineLearning, pp. 242-250, 1998

I J. Hu and M. Wellman, Nash Q-learning for general-sum stochastic games.In: Journal on Machine Learning Research, vol. 4, pp. 1039–1069, 2003

I Lauer & Riedmiller, 2000M. Lauer and M. Riedmiller, An algorithm for distributed reinforcementlearning in cooperative multi-agent systems. In: Proceedings 17th Int’lConf. on Machine Learning, pp. 535-542, 2000

I Littman, 1994M. L. Littman, Markov games as a framework for multi-agentreinforcement learning. In: Proceedings 11th Int’l Conf. on MachineLearning, pp. 157–163, 1994

I Littman, 2001M. L. Littman, Value-function reinforcement learning in Markov games.In: Journal of Cognitive Systems Research, vol. 2, pp. 55-66, 2001