# Reinforcement Learning - uni- · PDF file Types of multi agent environments There are di erent...

date post

23-Sep-2020Category

## Documents

view

1download

0

Embed Size (px)

### Transcript of Reinforcement Learning - uni- · PDF file Types of multi agent environments There are di erent...

Reinforcement Learning

LU 13 - Learning in Multi Agent Environments

Dr. Martin Lauer AG Maschinelles Lernen und Natürlichsprachliche Systeme

Albert-Ludwigs-Universität Freiburg

Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (1)

Learning in multi agent environments

Up to now: single agent acting in a stochastic environment

Now: several agents acting in the same environment.

I all of them are trying to maximize their reward/minimize their costs

I all of them are learning/adapting their policy

Examples:

I two persons working together to mount IKEA furnitures

I a group of students learning together for an exam

I the two players of tic-tac-toe

I 11 players in a football team playing versus 11 players of the opponent soccer team

I the ministers in a government

Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (2)

Types of multi agent environments

There are different types of multi agent environments

I agents which share the same goal. Both benefit in the same way from reaching the goal

I agents which have adversary goals

I agents which share some goals but not all.

Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (3)

Modeling multi agent environments

How can we model multiple agents for reinforcement learning?

A Multi agent Markov decision process (MAMDP) is defined by:

I a set of discrete points in time T

I a set of states S

I a number of agents k ∈ N I a set of actions for each agent Aj

I transition probabilities pss′(a1, . . . , ak)

I a reward function for each agent rj : S × A1 × · · · × Ak → R (or a cost function for each agent)

I a discount factor α ∈ [0, 1]

MAMDPs are also called Markov games.

Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (4)

Modeling multi agent environments

Remarks:

I we assume that all agents are selecting their actions simultaneously

I the transition and rewards depend on the actions of all agents

I games where the players make their moves alternately can be modeled as well. How? Discuss it at the tic-tac-toe example

Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (5)

Fully cooperative MAMDPs

In fully cooperative MAMDPs all agents share the same reward function, i.e. r1 = r2 = · · · = rk What can we conclude about the optimality of policies?

Assume a fully cooperative MAMDP (T , S , k,A1, . . . ,Ak , pss′(a1, . . . , ak), r1, . . . , rk , α)

If we define A = A1 × · · · × Ak we can create an equivalent single agent MDP (T , S ,A, pss′(~a), r1, α). This MDP imitates the MAMDP assume a central control of all agents instead of distributed decisions.

Once we found an optimal policy Π : S → A for central control we can distribute it to the agents and generate agent-individual policies πj : S → Aj by

Π(s) = (π1(s), . . . , πk(s))

agent 1

decided

about

a1

agent 2

decided

about

a2

agent 3

decided

about

a3

agent 4

decided

about

a4

decentralized decisisions in a MAMDP

a1 a2 a3 a4

advises agents

to play certain

actions

central control

decides about

vector of actions

agent 1 agent 2 agent 3 agent 4

executes executes executes executes

central decisions in an MDP

Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (6)

Learning in fully cooperative MAMDPs

From theory of MDPs we can conclude that for a “central control MAMDP”

I there is an optimal central policy Π∗

I Π∗ is greedy w.r.t. Q∗ : S ×A → R I Q∗ can be calculated, e.g. with value iteration or Q-learning

Since we can always derive agent-individual policies from Π∗ we get

I there are optimal agent-individual policies π∗j : S → Aj I the application of π∗j maximizes the accumulated reward for each agent

I we can obtain π∗j as follows 1. learn Q∗, e.g. with Q-learning 2. derive Π∗ greedily from Q∗

3. calculate agent-individual policies π∗j by “distributing” Π ∗

Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (7)

Learning in fully cooperative MAMDPs

In the previous approach, learning is assumed to be done by a central controller.

However, as long as each agent is informed about the actions selected by its teammates learning the Q-function can also be done by each agent individually

Q-learning for “joint action learners”

1. for each agent j , learn Q-function with Q-learning → Qj 2. determine optimal policy πj by greedy evaluation of Qj

Greedy evaluation for agent j :

πj(s) = arg max aj∈Aj

( max

a1∈A1,...,aj−1∈Aj−1,aj+1∈Aj+1,...,ak∈Ak Qj(s, a1, . . . , ak)

) ︸ ︷︷ ︸

=:qj (s,aj )

= arg max aj∈Aj

qj(s, aj)

Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (8)

Learning in fully cooperative MAMDPs

Example: a fully cooperative two-player Markov games

Q(a1, a2) a2 = 1 2 3 q2(a2)

1 5 6 2 6 a1 = 2 2 1 4 4

3 7 4 2 7 q1(a1) 7 6 2

Optimal policy found

Q(a1, a2) a2 = 1 2 3 q2(a2)

1 5 6 2 6 a1 = 2 2 7 4 7

3 7 4 2 7 q1(a1) 7 7 2

individual actions that are promising: a1 ∈ {2, 3}, a2 ∈ {1, 2} However, not all joint actions are optimal.

Optimal policy among promising ones. But not all promising joint actions are optimal

⇒ Every optimal joint policy is greedy w.r.t. agent-individual qj function. However, not all greedy agent-individual policies form an optimal joint policy

Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (9)

Q-learning in fully cooperative MAMDPs

Learning a Q-function is not sufficient to obtain an optimal joint policy.

Idea: learn the policy explicitly instead of implicitly. Memorize the latest actions that contributed to optimal behavior

Every agent learns

I a Q-function Qj : S → A (the critic) I an agent-individual policy πj : S → Aj (the actor)

After observing a transition from st to st+1 with joint action (a1,t , . . . , ak,t) and reward rt it updates

Qj,t+1(st , a1, . . . , ak) = (1− γ) · Qj,t(st , a1, . . . , ak) + γ · (rt + αmax ~a∈A

Qj,t(st+1,~a))

πj,t+1(st) =

{ aj,t if (a1,t , . . . , ak,t) ∈ arg max~a∈AQj,t+1(st ,~a) πj,t(st) otherwise

Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (10)

Q-learning in fully cooperative MAMDPs

Lemma: If the prerequisites for Q-learning are met and the initial Q-functions Qj,0 are equal for all agents j , then the algorithm described on the previous slide is guaranteed to converge, i.e.

Qj,t(s,~a) −→ t→∞

Q∗(s,~a)

and from a certain t0 on the joint policies Πt(s) = (π1,t(s), . . . , πk,t(s)) are optimal.

Proof sketch: The convergence of the Q-functions follows directly from the convergence proof of standard Q-learning. Furthermore, since all agents are performing the same update and the initial Qj,0 are equal, the Qj,t functions for all agents will be equal at anytime.

The policies of all agents are updated synchronously whenever a joint action is observed that is greedy w.r.t. the current Q-function. Hence, the joint policy Πt is always greedy w.r.t. the current Q-function. Since the Q-functions converge to Q∗ the joint policy Πt is optimal from a certain point in time on.

Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (11)

Q-learning in fully cooperative MAMDPs

Previous approach is known as “joint action learners” because all agents are learning a Q-function for joint actions ~a.

Question: is it possible to replace the Qj function by the qj function that considers only agent individual actions (“independent learner”)?

In general, no approaches are known for independent learners that are guaranteed to converge to the optimal joint policy.

I Q-learning is used for independent learners. It converges but not necessarily to the optimal joint policy.

I for deterministic MDPs there is a variant of Q-learning that is guaranteed to the optimal joint policy (Lauer&Riedmiller, 2000)

Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (12)

Zero sum Markov games

Different than cooperative MAMDPs are zero sum Markov games.

I 2 agents

I the reward function of one is the negative reward of the other, i.e. r2(s, a1, a2) = −r