ROMA: Multi-Agent Reinforcement Learning with …ROMA: Role-Oriented Multi-Agent Reinforcement...

ROMA: Multi-Agent Reinforcement Learning with Emergent Roles

Tonghan Wang 1 Heng Dong 1 Victor Lesser 2 Chongjie Zhang 1

Abstract

The role concept provides a useful tool to de-sign and understand complex multi-agent sys-tems, which allows agents with a similar role toshare similar behaviors. However, existing role-based methods use prior domain knowledge andpredefine role structures and behaviors. In con-trast, multi-agent reinforcement learning (MARL)provides flexibility and adaptability, but less ef-ficiency in complex tasks. In this paper, wesynergize these two paradigms and propose arole-oriented MARL framework (ROMA). In thisframework, roles are emergent, and agents withsimilar roles tend to share their learning to bespecialized on certain sub-tasks. To this end, weconstruct a stochastic role embedding space byintroducing two novel regularizers and condition-ing individual policies on roles. Experimentsshow that our method can learn dynamic, ver-satile, identifiable, and specialized roles, whichhelp our method push forward the state of the arton the StarCraft II micromanagement benchmark.Demonstrative videos are available at https://sites.google.com/view/romarl/.

1. IntroductionMany real-world systems can be modeled as multi-agent sys-tems (MAS), such as autonomous vehicle teams (Cao et al.,2012), intelligent warehouse systems (Nowe et al., 2012),and sensor networks (Zhang & Lesser, 2011). Coopera-tive multi-agent reinforcement learning (MARL) providesa promising approach to developing these systems, whichallows agents to deal with uncertainty and adapt to the dy-namics of an environment. In recent years, cooperativeMARL has achieved prominent progresses and many deepmethods have been proposed (Foerster et al., 2018; Sunehaget al., 2018; Rashid et al., 2018; Son et al., 2019; Vinyals

1Institute for Interdisciplinary Information Sciences, Ts-inghua University, Beijing, China 2University of MassachusettsAmherst, Amherst, MA, US. Correspondence to: Tonghan Wang<[email protected]>.

Preprint. Under review.

0

8

6

7

3

2

9

14

5

12

0,3,5,7

46

98

Figure 1. Visualization of our learned role representations at atimestep. The blue agent has the maximum health, while the redones are dead. The corresponding policy is that agent 6 movestowards enemies to take on more firepower, so that more seriouslyinjured agents are protected. Roles can change adaptively and willaggregate according to responsibilities that are compatible withindividual characteristics, such as location, agent type, health, etc.

et al., 2019; Wang et al., 2020b; Baker et al., 2020).

In order to achieve scalability, these deep MARL methodsadopt a simple mechanism that all agents share and learna decentralized value or policy network. However, suchsimple sharing is often not effective for many complex multi-agent tasks. For example, in Adam Smith’s Pin Factory,workers must complete up to eighteen different tasks tocreate one pin (Smith, 1937). In this case, it is a heavyburden for a single shared policy to represent and learn allrequired skills. On the other hand, it is also not necessaryfor each agent to use a distinct policy network, which leadsto high learning complexity, because some of agents oftenperform similar sub-tasks from time to time. The questionis how we can give full play to agents’ specialization anddynamic sharing for improving learning efficiency.

A natural concept that comes to mind is role. A role is acomprehensive pattern of behavior, often specialized forsome tasks. Agents with similar roles will show similarbehaviors, and thus can share their experiences to improveperformance. The role theory has been widely studied ineconomics, sociology and organization theory. Researchershas also introduced the concept of role into MAS (Bechtet al., 1999; Stone & Veloso, 1999; Depke et al., 2001; Fer-ber et al., 2003; Odell et al., 2004; Bonjean et al., 2014;Lhaksmana et al., 2018). In these role-based frameworks,the complexity of agent design is reduced via task decom-position by defining roles which are associated with respon-sibilities made up of a set of sub-tasks, so that the policy

arX

iv:2

003.

0803

9v2

[cs

.MA

] 2

2 M

ar 2

020

https://sites.google.com/view/romarl/


ROMA: Role-Oriented Multi-Agent Reinforcement Learning

search space is effectively decomposed (Zhu & Zhou, 2008).However, these works exploit prior domain knowledge todecompose tasks and predefine the responsibilities of eachrole, which prevents role-based MAS from being dynamicand adaptive to uncertain environments.

To leverage the benefits of both role-based and learn-ing methods, in this paper, we propose a role-orientedmulti-agent reinforcement learning framework (ROMA).This framework implicitly introduces the role concept intoMARL, which serves as an intermediary to enable agentswith similar responsibilities to share their learning. Weachieve this by ensuring that agents with the similar rolehave both similar policies and responsibilities. To estab-lish the connection between roles and decentralized poli-cies, ROMA conditions agents’ policies on individual roles,which are parameterized in a stochastic embedding spaceand are determined by agents’ local observations. To asso-ciate roles with responsibilities, we introduce two regulariz-ers to enable roles to be identifiable by behaviors and spe-cialized in certain sub-tasks. We show how well-formed rolerepresentations can be learned via optimizing tractable vari-ational estimations of the proposed regularizers. In this way,our method synergizes role-based and learning methodswhile avoiding their individual shortcomings – we provide aflexible and general-purpose mechanism that promotes theemergence and specialization of role, which in turn providesan adaptive learning sharing mechanism for efficient policylearning.

We test our method on StarCraft II1 micromanagement en-vironments (Vinyals et al., 2017; Samvelyan et al., 2019).Results show that our method significantly pushes forwardthe state of the art of MARL algorithms, by virtue of theadaptive policy sharing among agents with similar roles.Visualization of the role representations in both homoge-neous and heterogeneous agent teams demonstrates thatthe learned roles can adapt automatically in dynamic envi-ronments, and that agents with similar responsibility havesimilar roles. In addition, the emergence and evolution pro-cess of roles is shown, highlighting the connection betweenrole-driven sub-task specialization and improvement of teamefficiency in our framework. These results provide a newperspective in understanding and promoting the emergenceof cooperation among agents.

2. BackgroundIn our work, we consider a fully cooperative multi-agenttask that can be modelled by a Dec-POMDP (Oliehoek et al.,2016) G=〈I, S,A, P,R,Ω, O, n, γ〉, where A is the finiteaction set, I ≡ 1, 2, ..., n is the finite set of agents, γ ∈[0, 1) is the discount factor, and s ∈ S is the true state of the

1StarCraft II are trademarks of Blizzard EntertainmentTM.

TD Loss

Role Encoder 𝑓#$(𝑜'

()

Role Embedding

𝑜'(

Hyper-net𝑔#+(𝜌'

()Local Utility

Network

𝜏'(./, 𝑜'(Sample 𝜌'(

𝜃'

Other agents’ Utilities 𝑄.'

History

Posterior Estimator 𝑞4(𝜏'(./, 𝑜'()

Agent 𝑖

Max 𝐻

𝐶𝐸 Loss

𝑄'(𝜏', 𝑎'()

𝑄(:((𝝉, 𝒂𝒕)

Environment

SpecializationLoss

Mixing Network

Data FlowGradient Flow

Other agents’ Roles 𝜌.'(

Role Distribution p(𝜌'(|𝑜'()

Role Distribution 𝑞(𝜌'(|𝜏'(./, 𝑜'()

Figure 2. Schematics of our approach. The role encoder generatesa role embedding distribution, from which a role is sampled andserves as an input to the hyper-net. The hyper-net generates theparameters of the local utility network. Local utilities are fed intoa mixing network to get an estimation of the global action value.During the interactions with other agents and the environment,individual trajectories are collected and fed into the trajectoryencoder to get posterior estimations of the role distributions. Theframework can be trained in an end-to-end fashion.

environment. We consider partially observable settings andagent i only has access to an observation oi ∈ Ω drawn ac-cording to the observation function O(s, i). Each agent hasa history τi ∈ T ≡ (Ω×A)∗. At each timestep, each agenti selects an action ai ∈ A, forming a joint action a ∈ An,leading to next state s′ according to the transition functionP (s′|s,a) and a shared reward r = R(s,a) for each agent.The joint policy π induces a joint action-value function:Qπtot(s,a)=Es0:∞,a0:∞ [

∑∞t=0 γ

trt| s0=s,a0=a,π].

To effectively learn policies for agents, the paradigm of cen-tralized training with decentralized execution (CTDE) (Fo-erster et al., 2016; 2018; Wang et al., 2020a) has recentlyattracted attention from deep MARL to deal with non-stationarity. One promising way to exploit the CTDEparadigm is the value function decomposition method (Sune-hag et al., 2018; Rashid et al., 2018; Son et al., 2019; Wanget al., 2020b), which learns a decentralized utility functionfor each agent and uses a mixing network to combine theselocal utilities into a global action value. To achieve learningscalability, existing CTDE methods typically learn a sharedlocal value or policy network for agents. However, this sim-ple sharing mechanism is often not sufficient for learningcomplex tasks, where agents may require diverse responsi-bilities or skills to achieve goals. In this paper, we aim todevelop a novel MARL framework to address this challengethat achieves efficient shared learning while allowing agentsto learn sufficiently diverse skills.

3. MethodIn this section, we will present a novel role-oriented MARLframework (ROMA) that introduces the role concept intoMARL and enables adaptive shared learning among agents.


ROMA adopts the CTDE paradigm. As shown in Figure 2,it learns local Q-value functions for agents, which are fedinto a mixing network to compute a global TD loss for cen-tralized training. During the execution, the mixing networkwill be removed and each agent will act based on its lo-cal policy derived from its value function. Agents’ valuefunctions or policies depend on their roles, each of whichis responsible for performing similar automatically iden-tified sub-tasks. To enable efficient and effective sharedlearning among agents with similar behaviors, ROMA willautomatically learn roles that are:

i) Identifiable: The role of an agent can be identifiable byits behavior patterns;

ii) Specialized: Agents with similar roles are expected tospecialize in similar responsibilities;

iii) Dynamic: An agent’s role can automatically adapt tothe dynamics of the environment;

iv) Versatile: The emergent roles are diverse enough toefficiently fulfill a task.

Formally, each agent i has a local utility function (or anindividual policy), whose parameters θi are conditioned onits role ρi. To learn roles with desired properties, we encoderoles in a stochastic embedding space, and the role of agenti, ρi, is drawn from a multivariate Gaussian distributionN (µρi ,σρi). To enable the dynamic property, ROMA con-ditions an agent’s role on its local observations, and usesa trainable neural network f to learn the parameters of theGaussian distribution of the role:

(µρi ,σρi) = f(oi; θρ),

ρi ∼ N (µρi ,σρi),(1)

where θρ are parameters of f . The sampled role ρi is thenfed into a hyper-network g(ρi; θh) parameterized by θh togenerate the parameters for the individual policy, θi. We callf the role encoder and g the role decoder. In the next twosub-sections, we will describe two regularizers for learningidentifiable, versatile, and specialized roles.

3.1. Identifiable and Versatile Roles

Introducing latent role embedding and conditioning indi-vidual policies on this embedding does not automaticallygenerate versatile and identifiable roles. To ensure thesetwo properties for roles, we can maximize an entropy ob-jective H(ρi|oi) for role ρi to be diverse based on localobservation oi, and, meanwhile, minimize the conditionalentropy H(ρi|τi, oi) of role ρi given an experience trajec-tory τi to make a behavior identifiable by a role. There-fore, we propose a regularizer for ROMA to maximizeH(ρi|oi) − H(ρi|τi, oi), which mathematically equals toI(τi; ρi|oi), the conditional mutual information between theindividual trajectory and role given the current observation.

However, estimating and maximizing mutual informationis often intractable. Drawing inspiration from the literatureof variational inference (Wainwright et al., 2008; Alemiet al., 2017), we introduce a variational posterior estimatorto derive a tractable lower bound for the mutual informationfor each time step t:

I(ρti; τt−1i |oti) = Eρti,τt−1

i ,oti

[log

p(ρti|τt−1i , oti)

p(ρti|oti)

]= Eρti,τt−1

i ,oti

[log

qξ(ρti|τ

t−1i , oti)

p(ρti|oti)

]+ Eτt−1

i ,oti

[DKL(p(ρti|τ t−1i , oti)‖qξ(ρti|τ t−1i , oti))

]≥ Eρti,τt−1

i ,oti

[log

qξ(ρti|τ

t−1i , oti)

p(ρti|oti)

](2)

where τ t−1i = (o0i , a0i , · · · , o

t−1i , at−1i ), qξ is the variational

estimator parameterised with ξ, and the last inequality holdsbecause of the non-negativity of the KL divergence. Thedistribution qξ can be arbitrary, and we use a GRU (Choet al., 2014) to encode an agent’s history of observationsand actions, which is called a trajectory encoder. The lowerbound in Eq. 2 can be further rewritten as a loss function tobe minimized:

LI(θρ, ξ) =E(τt−1i ,oti)∼D

[CE [p(ρti|oti)‖qξ(ρti|τ t−1i , oti)]

−H(ρti|oti)], (3)

where D is a replay buffer, H(·) and CE [·‖·] are the en-tropy and cross entropy operators, respectively. The detailedderivation can be found in Appendix A.1.

3.2. Specialized Roles

The formulation so far does not promote sub-task special-ization, which is a key component to decompose the taskand improve efficiency in multi-agent systems. Minimizingloss LI enables roles to contain enough information aboutlong-term behaviors, but does not explicitly ensure agentswith similar behaviors to have similar role embeddings.

Intuitively, for any two agents in a population with clearsub-task specialization, either they have similar roles or theyhave quite different responsibilities, which can be charac-terized by their behavior trajectories. However, neither iswhich agents have similar roles known during the processof role emergence, nor is the similarity between behaviorsstraightforward to be defined.

To resolve this problem, we define another regularizer. Toencourage two agents i and j to have similar roles andsimilar behaviors, we can maximize I(ρi; τj), the mutualinformation between the role of agent i and the trajectoryof agent j. Directly optimizing these objectives will resultin all agents having the same role, and, correspondingly,the same policy, which will limit system performance. To


settle this issue, we introduce a dissimilarity model dφ :T × T → R, which is a trainable neural network takingtwo trajectories as input, and seek to maximize I(ρi; τj) +dφ(τi, τj) while minimize the number of non-zero elementsin the matrix Dφ = (dij), where dij = dφ(τi, τj) is theestimated dissimilarity between trajectories of agent i and j.Such formulation makes sure that dissimilarity d is high onlywhen mutual information I is low, so that the set of learnedroles is compact but diverse enough for identified sub-tasksin order to efficiently solve the given task. Formally, thefollowing equation constrains the role embedding learningand encourages sub-task specialization:

minimizeθρ,ξ,φ

‖Dtφ‖2,0 (4)

subject to I(ρti; τt−1j |otj) + dφ(τ t−1i , τ t−1j ) > U,∀i 6= j

where U is a hyper-parameter which controls the compact-ness of the role representation. Relaxing the matrix norm‖ · ‖2,0 with Frobenius norm, we can get the optimizationobjective for minimizing:

‖Dtφ‖F −

∑i6=j

minI(ρti; τt−1j |otj) + dφ(τ t−1i , τ t−1j ), U,

(5)However, as estimating and optimizing mutual informationterm are intractable, we use the variational posterior esti-mator introduced in Sec. 3.1 to construct an upper bound,serving as the second regularizer of ROMA:

LD(θρ, φ, ξ) = E(τ t−1,ot)∼D,ρt∼p(ρt|ot)[‖Dt

φ‖F (6)

−∑i6=j

minqξ(ρti|τ t−1j , otj) + dφ(τ t−1i , τ t−1j ), U]

where D is the replay buffer, τ t−1 is the joint trajectory,ot is the joint observation, and ρt = 〈ρt1, ρt2, · · · , ρtn〉. Adetailed derivation can be found in Appendix A.2.

3.3. Overall Optimization Objective

We have introduced optimization objectives for learningroles to be versatile, identifiable, and specialized. Apartfrom these regularizers, all the parameters in the frameworkare updated by gradients induced by the standard TD lossof reinforcement learning. As shown in Figure 2, to com-pute the global TD loss, individual utilities are fed into amixing network whose output is the estimation of globalaction-value Qtot. In this paper, our ROMA implementa-tion uses the mixing network introduced by QMIX (Rashidet al., 2018) (see Appendix D) for its monotonic approxima-tion, but it can be easily replaced by other mixing methods.The parameters of the mixing network are conditioned onthe global state s and are generated by a hyper-net param-eterized by θm. Therefore, the final learning objective ofROMA is:

L(θ) = LTD(θ) + λILI(θρ, ξ) + λDLD(θρ, ξ, φ), (7)

where θ = (θρ, ξ, φ, θh, θm), λI and λD are scaling factors,andLTD(θ) = [r+γmaxa′ Qtot(s

′,a′; θ−)-Qtot(s,a; θ)]2

(θ− are parameters of a periodically updated target net-work). In our centralized training with decentralized execu-tion paradigm, only the role encoder, the role decoder, andthe individual policies are used when execution.

4. Related WorksThe emergence of role has been documented in many naturalsystems, such as bees (Jeanson et al., 2005), ants (Gordon,1996), and humans (Butler, 2012). In these systems, therole is closely related to the division of labor and is crucialto the improvement of labor efficiency. Many multi-agentsystems are inspired by these natural systems. They decom-pose the task, make agents with the same role specializein certain sub-tasks, and thus reduce the design complex-ity (Wooldridge et al., 2000; Omicini, 2000; Padgham &Winikoff, 2002; Pavon & Gomez-Sanz, 2003; Cossentinoet al., 2005; Zhu & Zhou, 2008; Spanoudakis & Moraitis,2010; DeLoach & Garcia-Ojeda, 2010; Bonjean et al., 2014).These methodologies are designed for tasks with clear struc-ture, such as soft engineering (Bresciani et al., 2004). There-fore, they tends to use predefined roles and the associatedresponsibilities (Lhaksmana et al., 2018). In contrast, wefocus on how to implicitly introduce the concept of rolesinto general multi-agent sequential decision making underdynamic and uncertain environments.

Deep multi-agent reinforcement learning has witnessed vig-orous progress in recent years. COMA (Foerster et al.,2018), MADDPG (Lowe et al., 2017), PR2 (Wen et al.,2019), and MAAC (Iqbal & Sha, 2019) explore multi-agentpolicy gradients. Another line of research focuses on value-based multi-agent RL, and value-function factorization isthe most popular method. VDN (Sunehag et al., 2018),QMIX (Rashid et al., 2018), and QTRAN (Son et al., 2019)have progressively enlarged the family of functions thatcan be represented by the mixing network. NDQ (Wanget al., 2020b) proposes nearly decomposable value func-tions to address the miscoordination problem in learningfully decentralized value functions. Emergence is a topicwith increasing interest in deep MARL. Works on the emer-gence of communication (Foerster et al., 2016; Lazaridouet al., 2017; Das et al., 2017; Mordatch & Abbeel, 2018;Wang et al., 2020b), the emergence of fairness (Jiang & Lu,2019), and the emergence of tool usage (Baker et al., 2020)provide a deep learning perspective in understanding bothnatural and artificial multi-agent systems.

To learn diverse and identifiable roles, we propose to opti-mize the mutual information between individual roles andtrajectories. A recent work studying multi-agent exploration,MAVEN (Mahajan et al., 2019), uses a similar objective.Different from ROMA, MAVEN aims at committed explo-


120

9

3456

87

0

83

7

2456

91

7

3

6

0

4

8

9

1

5

2

120

345 6

987

2nd column

1st column

3rd column

(a) t=1, roles are characterized by agents’positions.

0

Died

21 / 45

45 / 45

Hit Points

(b) t=8, roles are characterized by agents’remaining hit points.

1

0

34

5

69

8

7

2

Others

Died

Healthiest

(c) t=19, roles are characterized by agents’remaining hit points and positions.

Figure 3. Dynamic role adaptation during an episode (means of the role distributions, µρi , are shown, without using any dimensionalityreduction techniques). The role encoder learns to focus on different parts of observations according to the automatically discovereddemands of the task. The role-induced strategy helps (a) quickly form the offensive arc when t=1; (b) protect injured agents when t=8; (c)protect dying agents and alternate fire when t = 19.

ration. This difference in high-level purpose leads to manytechnical distinctions. First, MAVEN optimizes the mutualinformation between the joint trajectory and a latent vari-able conditioned on a Gaussian or uniform random variableto encourage diverse joint trajectory. Second, apart fromthe mutual information objective, we propose a novel reg-ularizer to learn specialized roles, while MAVEN adopts ahierarchical structure and encourages the latent variable tohelp get more environmental rewards. We empirically com-pare ROMA with MAVEN in Sec. 5. More related workswill be discussed in Appendix D.

5. ExperimentsOur experiments aim to answer the following questions:(1) Whether the learned roles can automatically adapt indynamic environments? (Sec. 5.1.) (2) Can our method pro-mote sub-task specialization? That is, agents with similarresponsibilities have similar role embedding representations,while agents with different responsibilities have role embed-ding representations far from each other. (Sec. 5.1, 5.3.)(3) Can such sub-task specialization improve the perfor-mance of multi-agent reinforcement learning algorithms?(Sec. 5.2.) (4) How do roles evolve during training, and howdo they influence team performance? (Sec. 5.4.) Videos ofour experiments are available online2.

Baselines We compare our methods with various baselinesshown in Table 1. In particular, we carry out the following

2https://sites.google.com/view/romarl/

Table 1. Baseline algorithms.

Alg. Description

RelatedWorks

IQL Independent Q-learningCOMA Foerster et al. (2018)QMIX Rashid et al. (2018)QTRAN Son et al. (2019)MAVEN Mahajan et al. (2019)

Abla-tions

ROMA-RAW Without LI and LDQMIX-NPS

QMIX without parametersharing among agents

QMIX-LARQMIX with similar numberof parameters with ROMA

ablation studies: (i) ROMA-RAW has the same frameworkas ROMA but does not include LI and LD in the optimiza-tion objective. This ablation is designed to highlight thecontribution of the proposed regularizers. (ii) QMIX-NPS.The same as QMIX (Rashid et al., 2018), but agents donot share parameters. Our method achieves adaptive learn-ing sharing, and comparison against QMIX (parameters areshared among agents) and QMIX-NPS tests whether thisflexibility can improve learning efficiency. (iii) QMIX-LAR,QMIX with a similar number of parameters with our frame-work, which can test whether the superiority of our methodcomes from the increase in the number of parameters.

We carry out a grid search over the loss coefficients λI andλD, and fix them at 10−4 and 10−2, respectively, across all



0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0T (mil)

0

20

40

60

80

100

Test

Win

%

MMM2

0 2 4 6 8 10T (mil)

0

20

40

60

80

100

Test

Win

%

10m_vs_11m

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0T (mil)

0

20

40

60

80

100

Test

Win

%

5m_vs_6m

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0T (mil)

0

20

40

60

80

100

Test

Win

%

27m_vs_30m

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0T (mil)

0

20

40

60

80

100

Test

Win

%

8m_vs_9m

0.0 0.5 1.0 1.5 2.0 2.5 3.0T (mil)

0

20

40

60

80

100

Test

Win

%

6z4b

MAVEN QMIX COMA IQL QTRAN ROMA (Ours)

Figure 4. Comparison of our method against baseline algorithms. Results for more maps can be found in Appendix C.1.

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0T (mil)

0

20

40

60

80

100

Test

Win

%

MMM2

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0T (mil)

0

20

40

60

80

100

Test

Win

%

8m_vs_9m

0 2 4 6 8 10T (mil)

0

20

40

60

80

100

Test

Win

%

10m_vs_11mROMA (Ours) ROMA-RAW QMIX-NPS QMIX-LAR

Figure 5. Comparison of our method against ablations.

the experiments. The dimensionality of latent role space isset to 3, so we did not use any dimensionality reduction tech-niques when visualizing the role embedding representations.Other hyperparameters are also fixed in our experiments,which are listed in Appendix B.1. For ROMA, We useelementary network structures (fully-connected networksor GRU) for the role encoder, role decoder, and trajectoryencoder. The details of the architecture of our method andbaselines can be found in Appendix B.

5.1. Dynamic Roles

Answering the first and second questions, we show snap-shots in an episode played by ROMA agents on the StarCraftII micromanagement benchmark (SMAC) map 10m vs 11m,where 10 Marines face 11 enemy Marines. As shown inFig. 3 (the role representations at t=27 are presented inFig. 1), although observations contain much information,such as positions, health points, shield points, states of allyand enemy units, etc., the role encoder learns to focus on dif-ferent parts of the observations according to the dynamicallychanged situations. At the beginning (t=1), agents need toform a concave arc to maximize the number of agents whoseshoot range covers the front line of enemies. ROMA learnsto allocate roles according to agents’ relative positions sothat agents can form the offensive formation quickly usingspecialized policies. In the middle of the battle, one impor-

tant tactic is to protect the injured ranged units. Our methodlearns this maneuver and roles cluster according to the re-maining health points (t=8, 19, 27). Healthiest agents haverole representations far from those of other agents. Suchrepresentations result in differentiated strategies: healthi-est agents move forward to take on more firepower whileother agents move backward, firing from a distance. In themeantime, some roles also cluster according to positions(agents 3 and 8 when t=19). The corresponding behaviorsare agents with different roles fire alternatively to share thefirepower. We can also observe that the role representationsof dead agents aggregate together, representing a specialgroup with an increasing number of agents during the battle.

These results demonstrate that our method learns dynamicroles and roles cluster clearly corresponding to automati-cally detected sub-tasks, in line with implicit constraints ofthe proposed optimization objectives.

5.2. Performance on StarCraft II

To test whether these roles and the corresponding sub-task specialization can improve learning efficiency, we testour method on the StarCraft II micromanagement (SMAC)benchmark (Samvelyan et al., 2019). This benchmark con-sists of various maps which have been classified as easy,hard, and super hard. We compare ROMA with algorithms


97

6

8

1

3

0 4

2

5

0

1

2

3

4

5

69

87

1 20

3

4

56

9

87

Sacrificed Zealot

Stalker

Zealot

(a) Strategy: sacrificing Zealots 9 and 7 tominimize Banelings’ splash damage.

3rd column

1st column2nd column

4th column5th column

6th column

(b) Strategy: forming an offensive concavearc quickly

Ambush Zergling

Baneling

Zergling1

2

0

3

45

6 987

(c) Strategy: green Zerglings hide away andBanelings kill most enemies by explosion.

Figure 6. Learned roles for 6s4z vs 10b30z, 27m vs 30m, and 6z4b (means of the role distributions, µρi , are shown, without using anydimensionality reduction techniques), and the related, automatically discovered responsibilities.

shown in Table 1 and present results for three hard maps5m vs 6m, 8m vs 9m & 10m vs 11m and two super hardmaps MMM2 & 27m vs 30m. Although SMAC benchmark ischallenging, it is not specially designed to test performancein tasks with many agents. We thus introduce three newSMAC maps to test the scalability of our method, which aredescribed in detail in Appendix C.

For evaluation, all experiments in this section are carriedout with 5 different random seeds, and results are shownwith a 95% confidence interval. Among these maps, fourmaps, MMM2, 6s4z vs 10b30z, 6z4b, and 10z5b vs 2z3s,feature heterogeneous agents, and the others have homo-geneous agents. Fig. 4 shows that our method yields sub-stantially better results than all the alternative approacheson both homogeneous and heterogeneous maps (additionalplots can be found in Appendix C.1). MAVEN overcomesthe negative effects of QMIX’s monotonicity constraint onexploration. However, it performs less satisfactorily thanQMIX on most maps. We believe this is because agents startengaging in the battle immediately after spawning in SMACmaps, and exploration is not the critical factor affectingperformance.

Ablations We carry out ablation studies, comparing with theablations shown in Table 1 and present results on three maps:MMM2 (heterogeneous), 8m vs 9m, and 10m vs 11m (homo-geneous) in Fig. 5. The superiority of our method againstROMA-RAW highlights the contribution of proposed regu-larizers – ROMA-RAW performs even worse than QMIXon two of the three maps. Comparison between QMIX-NPSand QMIX demonstrates that parameter sharing can, as doc-umented (Foerster et al., 2018; Rashid et al., 2018), speedup training. As discussed in the introduction, both these two

paradigms may not get the best possible performance. Incontrast, our method provides a dynamic learning sharingmechanism – agents committed to a certain responsibilityhave similar policies. The comparison of the performanceof ROMA, QMIX, and QMIX-NPS proves that such sub-task specialization can indeed improve team performance.What’s more, comparison of ROMA against QMIX-LARproves that the superiority of our method does not dependon the larger number of parameters.

The performance gap between ROMA and ablations is moresignificant on maps with more than ten agents. This ob-servation supports discussions in previous sections – theemergence of role is more likely to improve the labor effi-ciency in larger populations.

5.3. Role Embedding Representations

To explain the superiority of ROMA, we present the learnedrole embedding representations for three maps in Fig. 6.Roles are representative of automatically discovered sub-tasks in the learned winning strategy. In the map of6s4z vs 10b30z, ROMA learns to sacrifice Zealots 9 and 7to kill all the enemy Banelings. Specifically, Zealots 9 and 7will move to the frontier one by one to minimize the splashdamage, while other agents will stay away and wait untilall Banelings explode. Fig. 6(a) shows the role embeddingrepresentations while performing the first sub-task whereagent 9 is sacrificed. We can see that the role of Zealot 9 isquite different from those of other agents. Correspondingly,the strategy at this time is agent 9 moving rightward whileother agents keeping still. Detailed analysis for the othertwo maps can be found in Appendix C.2.


Figure 7. Role emergence and evolution on the map MMM2 (role representations at time step 1 are shown) during training (means of therole distributions, µρi , are shown, without using any dimensionality reduction techniques). The emergence and specialization of roles isclosely connected to the improvement of team performance. Agents in MMM2 are heterogeneous, and we show role evolution process in ahomogeneous team in Appendix C.3.

5.4. Emergence and Evolution of Roles

We have shown the learned role representations and perfor-mance of our method, but the relationship between rolesand performance remains unclear. To make up for thisshortcoming, we visualize the emergence and evolution ofroles during the training process on the map MMM2 (hetero-geneous) and 10m vs 11m (homogeneous). We discuss theresults on MMM2 here and defer analysis for 10m vs 11m toAppendix C.3.

In MMM2, 1 Medivac, 2 Marauders, and 7 Marines are facedwith a stronger enemy team consisting of 1 Medivac, 3Marauders, and 8 Marines. Among the three involved unittypes, Medivac is the most special one for that it can healthe injured units. In Fig. 7, we show one of the learningcurves of ROMA (red) and the role representations at thefirst environment step at three different stages. When thetraining begins (T=0), roles are random, and the agents areexploring the environment to learn the basic dynamics andstructure of the task. By T=6M, ROMA has learned that theresponsibilities of the Medivac are different from those ofMarines and Marauders. The role, and correspondingly, thepolicy of the Medivac becomes quite different (Fig. 7 mid-dle). Such differentiation in behaviors enables agents to startwinning the game. Gradually, ROMA learns that Marinesand Marauders have dissimilar characteristics and shouldtake different sub-tasks, indicated by the differentiation oftheir role representations (Fig. 7 right). This further special-ization facilitates the performance increase between 6M and10M. After T=10M, the responsibilities of roles are clear,and, as a result, the win rate gradually converges (Fig. 4

top left). For comparison, ROMA without LI and LD cannot even win once on this challenging task (ROMA-RAWin Fig. 5-left). These results demonstrate that the gradu-ally specialized roles are indispensable in team performanceimprovement.

In summary, our experiments demonstrate that ROMA canlearn dynamic, identifiable, versatile, and specialized rolesthat effectively decompose the task. Drawing support fromthese emergent roles, our method significantly pushes for-ward the state of the art of multi-agent reinforcement learn-ing algorithms.

6. Closing RemarksWe have introduced the concept of roles into deep multi-agent reinforcement learning by capturing the emergentroles and encouraging them to specialize on a set of au-tomatically detected sub-tasks. Such deep role-orientedmulti-agent learning framework provides another perspec-tive to explain and promote cooperation within agent teams,and implicitly draws connection to the division of labor,which has been practiced in many natural systems for long.

To our best knowledge, this paper is making a first attemptat learning roles via deep reinforcement learning. The gar-gantuan task of understanding the emergence of roles, thedivision of labor, and interactions between more complexroles in hierarchical organization still lies ahead. We believethat these topics are basic and indispensable in buildingeffective, flexible, and general-purpose multi-agent systemsand this paper can help tackle these challenges.


ReferencesAlemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. Deep

variational information bottleneck. In Proceedings of theInternational Conference on Learning Representations(ICLR), 2017.

Baker, B., Kanitscheider, I., Markov, T., Wu, Y., Powell,G., McGrew, B., and Mordatch, I. Emergent tool usefrom multi-agent autocurricula. In Proceedings of theInternational Conference on Learning Representations(ICLR), 2020.

Bargiacchi, E., Verstraeten, T., Roijers, D., Nowe, A., andHasselt, H. Learning to coordinate with coordinationgraphs in repeated single-stage multi-agent decision prob-lems. In International Conference on Machine Learning,pp. 491–499, 2018.

Becht, M., Gurzki, T., Klarmann, J., and Muscholl, M. Rope:Role oriented programming environment for multiagentsystems. In Proceedings Fourth IFCIS International Con-ference on Cooperative Information Systems. CoopIS 99(Cat. No. PR00384), pp. 325–333. IEEE, 1999.

Bonjean, N., Mefteh, W., Gleizes, M., Maurel, C., andMigeon, F. Adelfe 2.0: Handbook on agent-orienteddesign processes, m. cossentino, v. hilaire, a. molesini,and v. seidita, 2014.

Bresciani, P., Perini, A., Giorgini, P., Giunchiglia, F., andMylopoulos, J. Tropos: An agent-oriented software de-velopment methodology. Autonomous Agents and Multi-Agent Systems, 8(3):203–236, 2004.

Butler, E. The condensed wealth of nations. Centre forIndependent Studies, 2012.

Cao, Y., Yu, W., Ren, W., and Chen, G. An overview ofrecent progress in the study of distributed multi-agent co-ordination. IEEE Transactions on Industrial informatics,9(1):427–438, 2012.

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D.,Bougares, F., Schwenk, H., and Bengio, Y. Learningphrase representations using rnn encoder–decoder forstatistical machine translation. In Proceedings of the 2014Conference on Empirical Methods in Natural LanguageProcessing (EMNLP), pp. 1724–1734, 2014.

Cossentino, M., Gaglio, S., Sabatucci, L., and Seidita, V.The passi and agile passi mas meta-models comparedwith a unifying proposal. In International Central andEastern European Conference on Multi-Agent Systems,pp. 183–192. Springer, 2005.

Das, A., Kottur, S., Moura, J. M., Lee, S., and Batra, D.Learning cooperative visual dialog agents with deep rein-forcement learning. In Proceedings of the IEEE Interna-tional Conference on Computer Vision, pp. 2951–2960,2017.

Das, A., Gervet, T., Romoff, J., Batra, D., Parikh, D., Rab-bat, M., and Pineau, J. Tarmac: Targeted multi-agentcommunication. In International Conference on MachineLearning, pp. 1538–1546, 2019.

DeLoach, S. A. and Garcia-Ojeda, J. C. O-mase: a customis-able approach to designing and building complex, adap-tive multi-agent systems. International Journal of Agent-Oriented Software Engineering, 4(3):244–280, 2010.

Depke, R., Heckel, R., and Kuster, J. M. Roles in agent-oriented modeling. International Journal of Softwareengineering and Knowledge engineering, 11(03):281–302, 2001.

Ferber, J., Gutknecht, O., and Michel, F. From agentsto organizations: an organizational view of multi-agentsystems. In International workshop on agent-orientedsoftware engineering, pp. 214–230. Springer, 2003.

Foerster, J., Assael, I. A., de Freitas, N., and Whiteson,S. Learning to communicate with deep multi-agent rein-forcement learning. In Advances in Neural InformationProcessing Systems, pp. 2137–2145, 2016.

Foerster, J., Nardelli, N., Farquhar, G., Afouras, T., Torr,P. H., Kohli, P., and Whiteson, S. Stabilising experiencereplay for deep multi-agent reinforcement learning. InProceedings of the 34th International Conference on Ma-chine Learning-Volume 70, pp. 1146–1155. JMLR. org,2017.

Foerster, J. N., Farquhar, G., Afouras, T., Nardelli, N., andWhiteson, S. Counterfactual multi-agent policy gradi-ents. In Thirty-Second AAAI Conference on ArtificialIntelligence, 2018.

Gordon, D. M. The organization of work in social insectcolonies. Nature, 380(6570):121–124, 1996.

Grover, A., Al-Shedivat, M., Gupta, J. K., Burda, Y., andEdwards, H. Evaluating generalization in multiagent sys-tems using agent-interaction graphs. In Proceedings ofthe 17th International Conference on Autonomous Agentsand MultiAgent Systems, pp. 1944–1946. InternationalFoundation for Autonomous Agents and Multiagent Sys-tems, 2018.

Guestrin, C., Koller, D., and Parr, R. Multiagent planningwith factored mdps. In Advances in neural informationprocessing systems, pp. 1523–1530, 2002.


Hoshen, Y. Vain: Attentional multi-agent predictive mod-eling. In Advances in Neural Information ProcessingSystems, pp. 2701–2711, 2017.

Iqbal, S. and Sha, F. Actor-attention-critic for multi-agentreinforcement learning. In International Conference onMachine Learning, pp. 2961–2970, 2019.

Jeanson, R., Kukuk, P. F., and Fewell, J. H. Emergence ofdivision of labour in halictine bees: contributions of socialinteractions and behavioural variance. Animal behaviour,70(5):1183–1193, 2005.

Jiang, J. and Lu, Z. Learning attentional communication formulti-agent cooperation. In Advances in Neural Informa-tion Processing Systems, pp. 7254–7264, 2018.

Jiang, J. and Lu, Z. Learning fairness in multi-agent systems.In Advances in Neural Information Processing Systems,pp. 13854–13865, 2019.

Kim, D., Moon, S., Hostallero, D., Kang, W. J., Lee, T., Son,K., and Yi, Y. Learning to schedule communication inmulti-agent reinforcement learning. In Proceedings of theInternational Conference on Learning Representations(ICLR), 2019.

Kipf, T., Fetaya, E., Wang, K.-C., Welling, M., and Zemel,R. Neural relational inference for interacting systems.In International Conference on Machine Learning, pp.2688–2697, 2018.

Lazaridou, A., Peysakhovich, A., and Baroni, M. Multi-agent cooperation and the emergence of (natural) lan-guage. In Proceedings of the International Conferenceon Learning Representations (ICLR), 2017.

Lhaksmana, K. M., Murakami, Y., and Ishida, T. Role-basedmodeling for designing agent behavior in self-organizingmulti-agent systems. International Journal of SoftwareEngineering and Knowledge Engineering, 28(01):79–96,2018.

Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, O. P.,and Mordatch, I. Multi-agent actor-critic for mixedcooperative-competitive environments. In Advances inNeural Information Processing Systems, pp. 6379–6390,2017.

Mahajan, A., Rashid, T., Samvelyan, M., and Whiteson,S. Maven: Multi-agent variational exploration. In Ad-vances in Neural Information Processing Systems, pp.7611–7622, 2019.

Mordatch, I. and Abbeel, P. Emergence of grounded com-positional language in multi-agent populations. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

Nguyen, D. T., Kumar, A., and Lau, H. C. Credit assign-ment for collective multiagent rl with global rewards. InAdvances in Neural Information Processing Systems, pp.8102–8113, 2018.

Nowe, A., Vrancx, P., and De Hauwere, Y.-M. Game theoryand multi-agent reinforcement learning. In ReinforcementLearning, pp. 441–470. Springer, 2012.

Odell, J., Nodine, M., and Levy, R. A metamodel for agents,roles, and groups. In International Workshop on Agent-Oriented Software Engineering, pp. 78–92. Springer,2004.

Oliehoek, F. A., Amato, C., et al. A concise introduction todecentralized POMDPs, volume 1. Springer, 2016.

Omicini, A. Soda: Societies and infrastructures in the anal-ysis and design of agent-based systems. In InternationalWorkshop on Agent-Oriented Software Engineering, pp.185–193. Springer, 2000.

Padgham, L. and Winikoff, M. Prometheus: A methodol-ogy for developing intelligent agents. In InternationalWorkshop on Agent-Oriented Software Engineering, pp.174–185. Springer, 2002.

Pavon, J. and Gomez-Sanz, J. Agent oriented softwareengineering with ingenias. In International Central andEastern European Conference on Multi-Agent Systems,pp. 394–403. Springer, 2003.

Peng, P., Wen, Y., Yang, Y., Yuan, Q., Tang, Z., Long,H., and Wang, J. Multiagent bidirectionally-coordinatednets: Emergence of human-level coordination in learn-ing to play starcraft combat games. arXiv preprintarXiv:1703.10069, 2017.

Rashid, T., Samvelyan, M., Witt, C. S., Farquhar, G., Fo-erster, J., and Whiteson, S. Qmix: Monotonic valuefunction factorisation for deep multi-agent reinforcementlearning. In International Conference on Machine Learn-ing, pp. 4292–4301, 2018.

Samvelyan, M., Rashid, T., de Witt, C. S., Farquhar, G.,Nardelli, N., Rudner, T. G., Hung, C.-M., Torr, P. H.,Foerster, J., and Whiteson, S. The starcraft multi-agentchallenge. arXiv preprint arXiv:1902.04043, 2019.

Singh, A., Jain, T., and Sukhbaatar, S. Learning whento communicate at scale in multiagent cooperative andcompetitive tasks. In Proceedings of the InternationalConference on Learning Representations (ICLR), 2019.

Smith, A. The wealth of nations [1776], 1937.


Son, K., Kim, D., Kang, W. J., Hostallero, D. E., and Yi, Y.Qtran: Learning to factorize with transformation for co-operative multi-agent reinforcement learning. In Interna-tional Conference on Machine Learning, pp. 5887–5896,2019.

Spanoudakis, N. and Moraitis, P. Using aseme method-ology for model-driven agent systems development. InInternational Workshop on Agent-Oriented Software En-gineering, pp. 106–127. Springer, 2010.

Stone, P. and Veloso, M. Task decomposition, dynamicrole assignment, and low-bandwidth communication forreal-time strategic teamwork. Artificial Intelligence, 110(2):241–273, 1999.

Sukhbaatar, S., Fergus, R., et al. Learning multiagent com-munication with backpropagation. In Advances in NeuralInformation Processing Systems, pp. 2244–2252, 2016.

Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W. M., Zam-baldi, V., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo,J. Z., Tuyls, K., et al. Value-decomposition networks forcooperative multi-agent learning based on team reward.In Proceedings of the 17th International Conference onAutonomous Agents and MultiAgent Systems, pp. 2085–2087. International Foundation for Autonomous Agentsand Multiagent Systems, 2018.

Tan, M. Multi-agent reinforcement learning: Independentvs. cooperative agents. In Proceedings of the tenth inter-national conference on machine learning, pp. 330–337,1993.

Usunier, N., Synnaeve, G., Lin, Z., and Chintala, S. Episodicexploration for deep deterministic policies: An applica-tion to starcraft micromanagement tasks. In Proceedingsof the International Conference on Learning Representa-tions (ICLR), 2017.

Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhn-evets, A. S., Yeo, M., Makhzani, A., Kuttler, H., Agapiou,J., Schrittwieser, J., et al. Starcraft ii: A new challenge forreinforcement learning. arXiv preprint arXiv:1708.04782,2017.

Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M.,Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds,T., Georgiev, P., et al. Grandmaster level in starcraft iiusing multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.

Wainwright, M. J., Jordan, M. I., et al. Graphical models,exponential families, and variational inference. Founda-tions and Trends R© in Machine Learning, 1(1–2):1–305,2008.

Wang, T., Wang, J., Yi, W., and Zhang, C. Influence-basedmulti-agent exploration. In Proceedings of the Interna-tional Conference on Learning Representations (ICLR),2020a.

Wang, T., Wang, J., Zheng, C., and Zhang, C. Learningnearly decomposable value functions with communica-tion minimization. In Proceedings of the InternationalConference on Learning Representations (ICLR), 2020b.

Wen, Y., Yang, Y., Luo, R., Wang, J., and Pan, W. Proba-bilistic recursive reasoning for multi-agent reinforcementlearning. In Proceedings of the International Conferenceon Learning Representations (ICLR), 2019.

Wooldridge, M., Jennings, N. R., and Kinny, D. The gaiamethodology for agent-oriented analysis and design. Au-tonomous Agents and multi-agent systems, 3(3):285–312,2000.

Yang, Z., Zhao, J., Dhingra, B., He, K., Cohen, W. W.,Salakhutdinov, R. R., and LeCun, Y. Glomo: unsuper-vised learning of transferable relational graphs. In Ad-vances in Neural Information Processing Systems, pp.8950–8961, 2018.

Zhang, C. and Lesser, V. Coordinated multi-agent rein-forcement learning in networked distributed pomdps. InTwenty-Fifth AAAI Conference on Artificial Intelligence,2011.

Zhu, H. and Zhou, M. Role-based multi-agent systems. InPersonalized Information Retrieval and Access: Concepts,Methods and Practices, pp. 254–285. Igi Global, 2008.


A. Mathematical DerivationA.1. Identifiable and Versatile Roles

For learning identifiable and versatile roles, we propose to maximize the conditional mutual information objective betweenroles and local observation-action histories given the current observations. In Sec. 3.1 of the paper, we introduce a posteriorestimator and derive a tractable lower bound of the mutual information term:

I(ρti; τt−1i |oti) ≥ Eρti,τt−1

i ,oti

[log

qξ(ρti|τ

t−1i , oti)

p(ρti|oti)

](8)

Then it follows that:

Eρti,τt−1i ,oti

[log

qξ(ρti|τ

t−1i , oti)

p(ρti|oti)

]=Eρti,τt−1

i ,oti

[log qξ(ρ

ti|τ t−1i , oti)

]− Eρti,oti

[log p(ρti|oti)

]=Eρti,τt−1

i ,oti

[log qξ(ρ

ti|τ t−1i , oti)

]+ Eoti

[H(ρti|oti)

]=Eτt−1

i ,oti

[∫p(ρti|τ t−1i , oti) log qξ(ρ

ti|τ t−1i , oti)dρ

ti

]+ Eoti

[H(ρti|oti)

](9)

The role encoder is conditioned on the local observations, so given the observations, the distributions of roles, p(ρti), areindependent from the local histories. Thus, we have

I(ρti; τt−1i |oti) ≥ −Eτt−1

i ,oti

[CE [p(ρti|oti)‖qξ(ρti|τ t−1i , oti)

]+ Eoti

[H(ρti|oti)

](10)

In practice, we use a replay buffer D and minimize

LI(θρ, ξ) = E(τt−1i ,oti)∼D

[CE [p(ρti|oti)‖qξ(ρti|τ t−1i , oti)−H(ρti|oti)

]. (11)

A.2. Specialized Roles

Conditioning roles on local observations enables roles to be dynamic, and optimizing LI enables roles to be identifiable andversatile, but these formulations do not explicitly encourage specialized roles. To make up for this shortcoming, we proposea role differentiation objective in Sec. 3.2 of the paper, where a mutual information maximization objective is involved(maximizing I(ρti; τ

t−1j |otj)). Here, we derive a variational lower bound of this mutual information objective to render it

feasible to be optimized.

I(ρti; τt−1j |otj) = Eρti,τt−1

j ,otj

[log

p(ρti, τt−1j |otj)

p(ρti|otj)p(τt−1j |otj)

]

= Eρti,τt−1j ,otj

[log

p(ρti|τt−1j , otj)

p(ρti|otj)

]= Eρti,τt−1

j ,otj

[log p(ρti|τ t−1j , otj)

]+ Eotj

[H(ρti|otj)

]≥ Eρti,τt−1

j ,otj


].

(12)

We clip the variances of role distributions at a small value (0.1) to ensure that the entropies of role distributions are alwaysnon-negative so that the last inequality holds. Then, it follows that:

Eρti,τt−1j ,otj


]=Eρti,τt−1

j ,otj

[log qξ(ρ

ti|τ t−1j , otj)

]+ Eτt−1

j ,otj

[DKL

[p(ρti|τ t−1j , otj)‖qξ(ρti|τ t−1j , otj)

]]≥Eρti,τt−1

j ,otj

[log qξ(ρ

ti|τ t−1j , otj)

],

(13)

where qξ is the trajectory encoder introduced in Sec. A.1, and the KL divergence term can be left out when deriving thelower bound because it is non-negative. Therefore, we have:

I(ρti; τt−1j |otj) ≥ Eρti,τt−1

j ,otj

[log qξ(ρ

ti|τ t−1j , otj)

]. (14)


Recall that, in order to learn specialized roles, we propose to minimize:

‖Dtφ‖F −

∑i6=j

minI(ρti; τt−1j |otj) + dφ(τ t−1i , τ t−1j ), U, (15)

where Dtφ = (dtij), and dtij = dφ(τ t−1i , τ t−1j ) is the estimated dissimilarity between trajectories of agent i and j. For the

term minI(ρti; τt−1j |otj) + dφ(τ t−1i , τ t−1j ), U, we have:

minI(ρti; τt−1j |otj) + dφ(τ t−1i , τ t−1j ), U

= minEτ t−1,ot,ρt

[log

p(ρti, τt−1j |otj)


+ dφ(τ t−1i , τ t−1j )

],Eτ t−1,ot,ρt [U ],

(16)

where τ t−1 is the joint trajectory, ot is the joint observation, and ρt = 〈ρt1, ρt2, · · · , ρtn〉. We denote

T1 ≡ logp(ρti, τ

t−1j |otj)


,

T2 ≡ log qξ(ρti|τ t−1j , otj).

(17)

BecauseT2 ≥ minT2, U,U ≥ minT2, U,

(18)

it follows that:Eτ t−1,ot,ρt [T2] ≥ Eτ t−1,ot,ρt [minT2, U] ,Eτ t−1,ot,ρt [U ] ≥ Eτ t−1,ot,ρt [minT2, U] .

(19)

So thatminEτ t−1,ot,ρt [T1] ,Eτ t−1,ot,ρt [U ]≥minEτ t−1,ot,ρt [T2] ,Eτ t−1,ot,ρt [U ] Eq. 14≥Eτ t−1,ot,ρt [minT2, U] Eq. 19,

(20)

which means that Eq. 15 satisfies:

‖Dtφ‖F −

∑i6=j

minI(ρti; τt−1j |otj) + dφ(τ t−1i , τ t−1j ), U

=Eτ t−1,ot,ρt[‖Dt

φ‖F]−∑i 6=j

minEτ t−1,ot,ρt[T1 + dφ(τ t−1i , τ t−1j )

], U

≤Eτ t−1,ot,ρt[‖Dt

φ‖F]−∑i 6=j

Eτ t−1,ot,ρt[minT2 + dφ(τ t−1i , τ t−1j ), U

]Eq. 20 (21)

=Eτ t−1,ot,ρt

‖Dtφ‖F −

∑i 6=j

minT2 + dφ(τ t−1i , τ t−1j ), U

.We minimize this upper bound to optimize Eq. 15. In practice, we use a replay buffer, and minimize:

LD(θρ, φ, ξ) = E(τ t−1,ot)∼D,ρt∼p(ρt|ot)[‖Dt

φ‖F −∑i 6=j

minqξ(ρti|τ t−1j , otj) + dφ(τ t−1i , τ t−1j ), U], (22)

where D is the replay buffer, τ t−1 is the joint trajectory, ot is the joint observation, and ρt = 〈ρt1, ρt2, · · · , ρtn〉.

B. Architecture, Hyperparameters, and InfrastructureB.1. ROMA

In this paper, we base our algorithm on QMIX (Rashid et al., 2018), whose framework is shown in Fig. 12 and described inAppendix D. In ROMA, each agent has a neural network to approximate its local utility. The local utility network consists of


0 1 2 3 4 5T (mil)

0

10

20

30

40

Test

Win

%

6s4z_vs_10b30z

0 1 2 3 4 5T (mil)

0

20

40

60

80

100

Test

Win

%

10z5b_vs_2s3z

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0T (mil)

0

20

40

60

80

100

Test

Win

%

2s3zROMA (Ours) MAVEN QMIX COMA IQL QTRAN

Figure 8. Additional results on the SMAC benchmark.

97

6

8

1

3

0 4

2

5

0

1

2

3

4

5

69

87

1 20

3

4

56

9

87

Sacrificed Zealot

Stalker

Zealot

(a) Strategy: sacrificing Zealots 9 and 7 tominimize Banelings’ splash damage.

3rd column

1st column2nd column

4th column5th column

6th column

(b) Strategy: forming an offensive concavearc quickly

Ambush Zergling

Baneling

Zergling1

2

0

3

45

6 987

(c) Strategy: green Zerglings hide away andBanelings kill most enemies by explosion.

Figure 9. (Reproduced from Fig. 6 in the paper, for quick reference.) Learned roles for 6s4z vs 10b30z, 27m vs 30m, and 6z4b (meansof the role distributions, µρi , are shown, without using any dimensionality reduction techniques), and the related, automatically discoveredresponsibilities.

three layers, a fully-connected layer, followed by a 64 bit GRU, and followed by another fully-connected layer that outputsan estimated value for each action. The local utilities are fed into a mixing network estimating the global action value. Themixing network has a 32-dimensional hidden layer with ReLU activation. Parameters of the mixing network are generatedby a hyper-net conditioning on the global state. This hyper-net has a fully-connected hidden layer of 32 dimensions. Thesesettings are the same as QMIX.

We use very simple network structures for the components related to role embedding learning, i.e., the role encoder, the roledecoder, and the trajectory encoder. The multi-variate Gaussian distributions from which the individual roles are drawnhave their means and variances generated by the role encoder, which is a fully-connected network with a 12-dimensionalhidden layer with ReLU activation. The parameters in the second fully-connected layers of the local utility approximatorsare generated by the role decoder whose inputs are the individual roles, which are 3-dimensional in all experiments. Therole decoder is also a fully-connected network with a 12-dimensional hidden layer and ReLU activation. For the trajectoryencoder, we again use a fully-connected network with a 12-dimensional hidden layer and ReLU activation. The inputs of thetrajectory encoder are the hidden states of the GRUs in the local utility functions after the last time step.

For all experiments, we set λI = 10−4, λD = 10−2, and the discount factor γ = 0.99. The optimization is conducted usingRMSprop with a learning rate of 5× 10−4, α of 0.99, and with no momentum or weight decay. For exploration, we useε-greedy with ε annealed linearly from 1.0 to 0.05 over 50k time steps and kept constant for the rest of the training. Werun 8 parallel environments to collect samples. Batches of 32 episodes are sampled from the replay buffer, and the wholeframework is trained end-to-end on fully unrolled episodes. All experiments on StarCraft II use the default reward andobservation settings of the SMAC benchmark.

Experiments are carried out on NVIDIA GTX 2080 Ti GPU.


Figure 10. The process of role emergence and evolution on the map 10m vs 11m.

B.2. Baselines and Ablations

We compare ROMA with various baselines and ablations, which are listed in Table. 1 of the paper. For COMA (Foersteret al., 2018), QMIX (Rashid et al., 2018), and MAVEN (Mahajan et al., 2019), we use the codes provided by the authorswhere the hyper-parameters have been fin-tuned on the SMAC benchmark. QMIX-NPS uses the identical architecture asQMIX, and the only difference lies in that QMIX-NPS does not share parameters among agents. Compared to QMIX, forthe local utility function of agents, QMIX-LAR adds two more fully-connected layers of 80 and 25 dimensions after theGRU layer so that it approximately has the same number of parameters as ROMA.

C. Additional Experimental ResultsWe benchmark our method on the StarCraft II unit micromanagement tasks. To test the scalability of the proposed approach,we introduce three maps. The 6z4b map features symmetry teams consisting of 4 Banelings and 6 Zerglings. In themap of 6s4z vs 10b30z, 6 Stalkers and 4 Zealots learn to defeat 10 Banelings and 30 Zerglings. And 10z5b vs 2z3s

characterizes asymmetry teams consisting of 10 Zerglings & 5 Banelings and 2 Zealots & 3 Stalkers, respectively.

C.1. Performance Comparison against Baselines

Fig. 8 presents performance of ROMA against various baselines on three maps. Performance comparison on the other mapsis shown in Fig. 4 of the paper. We can see that the advantage of ROMA is more significant on maps with more agents, suchas 10z5b vs 2z3s, MMM2, 27 vs 30m, and 10m vs 11m.

C.2. Role Embedding Representations

Fig. 9 shows various roles learned by ROMA. Roles are closely related to the sub-tasks in the learned winning strategy.

For the map 27m vs 30m, the winning strategy is to form an offensive concave arc before engaging in the battle. Fig. 9(b)illustrates the role embedding representations at the first time step when the agents are going to set up the attack formation.We can see the roles aggregate according to the relative positions of the agents. Such role differentiation leads to differentmoving strategies so that agents can quickly form the arc without collisions.

Similar role-behavior relationships can be seen in all tasks. We present another example on the task of 6z4b. In the winningstrategy learned by ROMA, Zerglings 4 & 5 and Banelings kill most of the enemies, taking advantage of the splash damageof the Banelings, while Zerglings 6-9 hideaway, wait until the explosion is over, and then kill the remaining enemies.Fig. 9(c) shows the role embedding representations before the explosion. We can see clear clusters closely corresponding tothe automatically detected sub-tasks at this time step.


Figure 12. The framework of QMIX, reproduced from the original paper (Rashid et al., 2018). (a) The architecture of the mixing network(blue), whose weights and biases are generated by a hyper-net (red) conditioned on the global state. (b) The overall QMIX structure. (c)Local utility network structure.

Supported by these results, we can conclude that ROMA can automatically decompose the task and learn versatile roles,each of which is specialized in a certain sub-task.

C.3. Additional Results for Role Evolution

Figure 11. Screenshot of 10m vs 11m, t=1.

In Fig. 7 of the paper, we show how roles emerge and evolve on the mapMMM2, where the involved agents are heterogeneous. In this section, wediscuss the case of homogeneous agent teams. To this end, we visualizethe emergence and evolution process of roles on the map 10m vs 11m,which features 10 ally Marines facing 11 enemy Marines. In Fig. 10,we show the roles at the first time step of the battle (screenshot can befound in Fig. 11) at four different stages during the training. At thismoment, agents need to form an offensive concave arc quickly. We cansee that ROMA gradually learns to allocate roles according to relativepositions of agents. Such roles and the corresponding differentiation inthe individual policies help agents form the offensive arc more efficiently.Since setting up an attack formation is critical for winning the game, aconnection between the specialization of the roles at the first time stepand the improvement of the win rate can be observed.

D. Related WorksMulti-agent reinforcement learning holds the promise to solve many real-world problems and has been making vigorousprogress recently. To avoid otherwise exponentially large state-action space, factorizing MDPs for multi-agent systems isproposed (Guestrin et al., 2002). Coordination graphs (Bargiacchi et al., 2018; Yang et al., 2018; Grover et al., 2018; Kipfet al., 2018) and explicit communication (Sukhbaatar et al., 2016; Hoshen, 2017; Jiang & Lu, 2018; Singh et al., 2019;Das et al., 2019; Singh et al., 2019; Kim et al., 2019) are studied to model the dependence between the decision-makingprocesses of agents. Training decentralized policies is faced with two challenges: the issue of non-stationarity (Tan, 1993)and reward assignment (Foerster et al., 2018; Nguyen et al., 2018). To resolve these problems, Sunehag et al. (2018) proposea value decomposition method called VDN. VDN learns a global action-value function, which is factored as the sum of eachagent’s local Q-value. QMIX (Rashid et al., 2018) extends VDN by representing the global value function as a learnable,state-condition, and monotonic combination of the local Q-values. In this paper, we use the mixing network of QMIX. Theframework of QMIX is shown in Fig. 12.

The StarCraft II unit micromanagement task is considered as one of the most challenging cooperative multi-agent testbedsfor its high degree of control complexity and environmental stochasticity. Usunier et al. (2017) and Peng et al. (2017) studythis problem from a centralized perspective. In order to facilitate decentralized control, we test our method on the SMACbenchmark (Samvelyan et al., 2019), which is the same as in (Foerster et al., 2017; 2018; Rashid et al., 2018; Mahajan et al.,2019).

ROMA: Multi-Agent Reinforcement Learning with …ROMA: Role-Oriented Multi-Agent Reinforcement...

Documents

Transcript of ROMA: Multi-Agent Reinforcement Learning with …ROMA: Role-Oriented Multi-Agent Reinforcement...