Reinforcement Learning - uni- · PDF file Types of multi agent environments There are di erent...

Click here to load reader

  • date post

    23-Sep-2020
  • Category

    Documents

  • view

    1
  • download

    0

Embed Size (px)

Transcript of Reinforcement Learning - uni- · PDF file Types of multi agent environments There are di erent...

  • Reinforcement Learning

    LU 13 - Learning in Multi Agent Environments

    Dr. Martin Lauer AG Maschinelles Lernen und Natürlichsprachliche Systeme

    Albert-Ludwigs-Universität Freiburg

    [email protected]

    Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (1)

  • Learning in multi agent environments

    Up to now: single agent acting in a stochastic environment

    Now: several agents acting in the same environment.

    I all of them are trying to maximize their reward/minimize their costs

    I all of them are learning/adapting their policy

    Examples:

    I two persons working together to mount IKEA furnitures

    I a group of students learning together for an exam

    I the two players of tic-tac-toe

    I 11 players in a football team playing versus 11 players of the opponent soccer team

    I the ministers in a government

    Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (2)

  • Types of multi agent environments

    There are different types of multi agent environments

    I agents which share the same goal. Both benefit in the same way from reaching the goal

    I agents which have adversary goals

    I agents which share some goals but not all.

    Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (3)

  • Modeling multi agent environments

    How can we model multiple agents for reinforcement learning?

    A Multi agent Markov decision process (MAMDP) is defined by:

    I a set of discrete points in time T

    I a set of states S

    I a number of agents k ∈ N I a set of actions for each agent Aj

    I transition probabilities pss′(a1, . . . , ak)

    I a reward function for each agent rj : S × A1 × · · · × Ak → R (or a cost function for each agent)

    I a discount factor α ∈ [0, 1]

    MAMDPs are also called Markov games.

    Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (4)

  • Modeling multi agent environments

    Remarks:

    I we assume that all agents are selecting their actions simultaneously

    I the transition and rewards depend on the actions of all agents

    I games where the players make their moves alternately can be modeled as well. How? Discuss it at the tic-tac-toe example

    Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (5)

  • Fully cooperative MAMDPs

    In fully cooperative MAMDPs all agents share the same reward function, i.e. r1 = r2 = · · · = rk What can we conclude about the optimality of policies?

    Assume a fully cooperative MAMDP (T , S , k,A1, . . . ,Ak , pss′(a1, . . . , ak), r1, . . . , rk , α)

    If we define A = A1 × · · · × Ak we can create an equivalent single agent MDP (T , S ,A, pss′(~a), r1, α). This MDP imitates the MAMDP assume a central control of all agents instead of distributed decisions.

    Once we found an optimal policy Π : S → A for central control we can distribute it to the agents and generate agent-individual policies πj : S → Aj by

    Π(s) = (π1(s), . . . , πk(s))

    agent 1

    decided

    about

    a1

    agent 2

    decided

    about

    a2

    agent 3

    decided

    about

    a3

    agent 4

    decided

    about

    a4

    decentralized decisisions in a MAMDP

    a1 a2 a3 a4

    advises agents

    to play certain

    actions

    central control

    decides about

    vector of actions

    agent 1 agent 2 agent 3 agent 4

    executes executes executes executes

    central decisions in an MDP

    Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (6)

  • Learning in fully cooperative MAMDPs

    From theory of MDPs we can conclude that for a “central control MAMDP”

    I there is an optimal central policy Π∗

    I Π∗ is greedy w.r.t. Q∗ : S ×A → R I Q∗ can be calculated, e.g. with value iteration or Q-learning

    Since we can always derive agent-individual policies from Π∗ we get

    I there are optimal agent-individual policies π∗j : S → Aj I the application of π∗j maximizes the accumulated reward for each agent

    I we can obtain π∗j as follows 1. learn Q∗, e.g. with Q-learning 2. derive Π∗ greedily from Q∗

    3. calculate agent-individual policies π∗j by “distributing” Π ∗

    Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (7)

  • Learning in fully cooperative MAMDPs

    In the previous approach, learning is assumed to be done by a central controller.

    However, as long as each agent is informed about the actions selected by its teammates learning the Q-function can also be done by each agent individually

    Q-learning for “joint action learners”

    1. for each agent j , learn Q-function with Q-learning → Qj 2. determine optimal policy πj by greedy evaluation of Qj

    Greedy evaluation for agent j :

    πj(s) = arg max aj∈Aj

    ( max

    a1∈A1,...,aj−1∈Aj−1,aj+1∈Aj+1,...,ak∈Ak Qj(s, a1, . . . , ak)

    ) ︸ ︷︷ ︸

    =:qj (s,aj )

    = arg max aj∈Aj

    qj(s, aj)

    Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (8)

  • Learning in fully cooperative MAMDPs

    Example: a fully cooperative two-player Markov games

    Q(a1, a2) a2 = 1 2 3 q2(a2)

    1 5 6 2 6 a1 = 2 2 1 4 4

    3 7 4 2 7 q1(a1) 7 6 2

    Optimal policy found

    Q(a1, a2) a2 = 1 2 3 q2(a2)

    1 5 6 2 6 a1 = 2 2 7 4 7

    3 7 4 2 7 q1(a1) 7 7 2

    individual actions that are promising: a1 ∈ {2, 3}, a2 ∈ {1, 2} However, not all joint actions are optimal.

    Optimal policy among promising ones. But not all promising joint actions are optimal

    ⇒ Every optimal joint policy is greedy w.r.t. agent-individual qj function. However, not all greedy agent-individual policies form an optimal joint policy

    Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (9)

  • Q-learning in fully cooperative MAMDPs

    Learning a Q-function is not sufficient to obtain an optimal joint policy.

    Idea: learn the policy explicitly instead of implicitly. Memorize the latest actions that contributed to optimal behavior

    Every agent learns

    I a Q-function Qj : S → A (the critic) I an agent-individual policy πj : S → Aj (the actor)

    After observing a transition from st to st+1 with joint action (a1,t , . . . , ak,t) and reward rt it updates

    Qj,t+1(st , a1, . . . , ak) = (1− γ) · Qj,t(st , a1, . . . , ak) + γ · (rt + αmax ~a∈A

    Qj,t(st+1,~a))

    πj,t+1(st) =

    { aj,t if (a1,t , . . . , ak,t) ∈ arg max~a∈AQj,t+1(st ,~a) πj,t(st) otherwise

    Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (10)

  • Q-learning in fully cooperative MAMDPs

    Lemma: If the prerequisites for Q-learning are met and the initial Q-functions Qj,0 are equal for all agents j , then the algorithm described on the previous slide is guaranteed to converge, i.e.

    Qj,t(s,~a) −→ t→∞

    Q∗(s,~a)

    and from a certain t0 on the joint policies Πt(s) = (π1,t(s), . . . , πk,t(s)) are optimal.

    Proof sketch: The convergence of the Q-functions follows directly from the convergence proof of standard Q-learning. Furthermore, since all agents are performing the same update and the initial Qj,0 are equal, the Qj,t functions for all agents will be equal at anytime.

    The policies of all agents are updated synchronously whenever a joint action is observed that is greedy w.r.t. the current Q-function. Hence, the joint policy Πt is always greedy w.r.t. the current Q-function. Since the Q-functions converge to Q∗ the joint policy Πt is optimal from a certain point in time on.

    Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (11)

  • Q-learning in fully cooperative MAMDPs

    Previous approach is known as “joint action learners” because all agents are learning a Q-function for joint actions ~a.

    Question: is it possible to replace the Qj function by the qj function that considers only agent individual actions (“independent learner”)?

    In general, no approaches are known for independent learners that are guaranteed to converge to the optimal joint policy.

    I Q-learning is used for independent learners. It converges but not necessarily to the optimal joint policy.

    I for deterministic MDPs there is a variant of Q-learning that is guaranteed to the optimal joint policy (Lauer&Riedmiller, 2000)

    Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (12)

  • Zero sum Markov games

    Different than cooperative MAMDPs are zero sum Markov games.

    I 2 agents

    I the reward function of one is the negative reward of the other, i.e. r2(s, a1, a2) = −r