[IEEE 2012 IEEE International Conference on Systems, Man and Cybernetics - SMC - Seoul, Korea...

6
Maintaining Cooperation in Homogeneous Multi-agent System Jianye Hao and Ho-fung Leung Department of Computer Science and Engineering The Chinese University of Hong Kong {jyhao,lhf}@cse.cuhk.edu.hk Abstractβ€”During multi-agent interactions, robust strategies are needed to help the agents to coordinate their actions to achieve efficient outcomes. A large body of previous work focuses on designing strategies towards the goal of Nash equilibrium, which can be extremely inefficient in many situations such as prisoner’s dilemma game. A number of improved algorithms based on Q-learning have been developed recently for agents to achieve mutual cooperation in games like prisoner’s dilemma. However, almost all of them involve only two agents playing the same game repeatedly. However, this may not reflect the real scenario in practical multi-agent interaction situations. In practical multi- agent environments, each agent may interact with multiple agents at the same time and each agent may only be allowed to interact with a specific set of agents, which is determined by the system’s underlying topology. In this paper, we propose a learning framework by taking into consideration the underlying interaction topology of the agents. We show that the system can maintain certain level of cooperation though the agents are individually rational. To better understand this phenomenon, we also develop a mathematical model to analyze the dynamics resulting from the learning framework. The theoretical results of the mathematical model are shown to be able to successfully predict the transition point and the expected behaviors of the system compared with the simulation results. Index Termsβ€”Prisoner’s Dilemma Game, Q-learning I. I NTRODUCTION In a multi-agent environment, multiple agents usually inter- act with each other aiming at achieving joint goals or maximiz- ing their individual benefits only. One important type of multi- agent interaction situation is the so-called social dilemma, in which individual interest is in conflict with group welfare. One well-known representative example of social dilemmas is the Prisoner’s Dilemma Game (PDG) (see Fig. 1) which has been studied extensively in various disciplines such as social science, behavioral economics and game theory. Fig. 1: a typical payoff matrix for prisoner’s dilemma game Considering the open and dynamic characteristic of the multi-agent environment, it is typically difficult to assign the actions of the agents in advance. Therefore many multi-agent reinforcement learning algorithms (e.g., minimax-Q learning [8], Nash-Q learning [7], Correlated-Q learning [4]) are pro- posed to help the agents learning to coordinate with other agents in the system. Most of them aim at converging to pure strategy or mixed strategy Nash Equilibriums, and this is not the desired solution we expect to achieve in social dilemma situations like prisoner’s dilemma game. Accordingly, other algorithms and improved algorithms based on Q-learning [16] [11] [12] [2] [9] have been developed recently for agents to achieve mutual cooperation in games like Prisoner’s Dilemma. However, almost all of them involve only two agents play- ing the same game iteratively, which may be not practical for modeling real-world multi-agent interaction situations. In real multi-agent environments, each agent may interact with multiple agents at the same time and each agent may only be allowed to interact with certain agents (e.g., its neighbors), which can be determined by the system’s underlying topology [13][14] or certain tag mechanisms [5][10]. Taking the above issues into considerations, in this paper, we consider the problem of multi-agent interaction modeled as prisoner’s dilemma game by taking the agents’ interaction topology into account. The multi-agent interaction framework we consider is as following: There exist a population of agents in the system, and each agent is only allowed to interact with its neighbors (i.e., playing prisoner’s dilemma game with its neighbors) controlled by the interaction topology. At each time step, exactly agents are picked out randomly to play the prisoner’s dilemma game with their neighbors sequentially. Each agent employs a pre-determined individually rational learning strategy, Q-learning, to choose its actions. The overall reward each agent receives is the sum of the rewards it obtains by playing the prisoner’s dilemma game with its neighbors. Surprisingly simulation results show that the system can maintain the percentage of cooperation agents (choosing ) above a certain threshold if each agent’s neighborhood size is small. To have a better understanding of this phenomenon, we also develop a mathematical model to analyze the dynamics resulting from the above learning framework we studied. Different models for analyzing the dynamics of multi-agent Q-learning situation involving two agents have been proposed 2012 IEEE International Conference on Systems, Man, and Cybernetics October 14-17, 2012, COEX, Seoul, Korea 978-1-4673-1714-6/12/$31.00 Β©2012 IEEE 301

Transcript of [IEEE 2012 IEEE International Conference on Systems, Man and Cybernetics - SMC - Seoul, Korea...

Page 1: [IEEE 2012 IEEE International Conference on Systems, Man and Cybernetics - SMC - Seoul, Korea (South) (2012.10.14-2012.10.17)] 2012 IEEE International Conference on Systems, Man, and

Maintaining Cooperation in HomogeneousMulti-agent System

Jianye Hao and Ho-fung LeungDepartment of Computer Science and Engineering

The Chinese University of Hong Kong{jyhao,lhf}@cse.cuhk.edu.hk

Abstractβ€”During multi-agent interactions, robust strategiesare needed to help the agents to coordinate their actions to achieveefficient outcomes. A large body of previous work focuses ondesigning strategies towards the goal of Nash equilibrium, whichcan be extremely inefficient in many situations such as prisoner’sdilemma game. A number of improved algorithms based onQ-learning have been developed recently for agents to achievemutual cooperation in games like prisoner’s dilemma. However,almost all of them involve only two agents playing the samegame repeatedly. However, this may not reflect the real scenarioin practical multi-agent interaction situations. In practical multi-agent environments, each agent may interact with multiple agentsat the same time and each agent may only be allowed tointeract with a specific set of agents, which is determined bythe system’s underlying topology. In this paper, we propose alearning framework by taking into consideration the underlyinginteraction topology of the agents. We show that the systemcan maintain certain level of cooperation though the agents areindividually rational. To better understand this phenomenon,we also develop a mathematical model to analyze the dynamicsresulting from the learning framework. The theoretical resultsof the mathematical model are shown to be able to successfullypredict the transition point and the expected behaviors of thesystem compared with the simulation results.

Index Termsβ€”Prisoner’s Dilemma Game, Q-learning

I. INTRODUCTION

In a multi-agent environment, multiple agents usually inter-act with each other aiming at achieving joint goals or maximiz-ing their individual benefits only. One important type of multi-agent interaction situation is the so-called social dilemma, inwhich individual interest is in conflict with group welfare.One well-known representative example of social dilemmas isthe Prisoner’s Dilemma Game (PDG) (see Fig. 1) which hasbeen studied extensively in various disciplines such as socialscience, behavioral economics and game theory.

Fig. 1: a typical payoff matrix for prisoner’s dilemma game

Considering the open and dynamic characteristic of themulti-agent environment, it is typically difficult to assign theactions of the agents in advance. Therefore many multi-agentreinforcement learning algorithms (e.g., minimax-Q learning[8], Nash-Q learning [7], Correlated-Q learning [4]) are pro-posed to help the agents learning to coordinate with otheragents in the system. Most of them aim at converging to purestrategy or mixed strategy Nash Equilibriums, and this is notthe desired solution we expect to achieve in social dilemmasituations like prisoner’s dilemma game. Accordingly, otheralgorithms and improved algorithms based on Q-learning [16][11] [12] [2] [9] have been developed recently for agents toachieve mutual cooperation in games like Prisoner’s Dilemma.However, almost all of them involve only two agents play-ing the same game iteratively, which may be not practicalfor modeling real-world multi-agent interaction situations. Inreal multi-agent environments, each agent may interact withmultiple agents at the same time and each agent may onlybe allowed to interact with certain agents (e.g., its neighbors),which can be determined by the system’s underlying topology[13][14] or certain tag mechanisms [5][10].

Taking the above issues into considerations, in this paper,we consider the problem of multi-agent interaction modeledas prisoner’s dilemma game by taking the agents’ interactiontopology into account. The multi-agent interaction frameworkwe consider is as following: There exist a population of 𝑁agents in the system, and each agent is only allowed tointeract with its neighbors (i.e., playing prisoner’s dilemmagame with its neighbors) controlled by the interaction topology.At each time step, exactly 𝑁 agents are picked out randomlyto play the prisoner’s dilemma game with their neighborssequentially. Each agent employs a pre-determined individuallyrational learning strategy, Q-learning, to choose its actions. Theoverall reward each agent receives is the sum of the rewardsit obtains by playing the prisoner’s dilemma game with itsneighbors. Surprisingly simulation results show that the systemcan maintain the percentage of cooperation agents (choosing𝐢) above a certain threshold if each agent’s neighborhood sizeis small.

To have a better understanding of this phenomenon, wealso develop a mathematical model to analyze the dynamicsresulting from the above learning framework we studied.Different models for analyzing the dynamics of multi-agentQ-learning situation involving two agents have been proposed

2012 IEEE International Conference on Systems, Man, and Cybernetics October 14-17, 2012, COEX, Seoul, Korea

978-1-4673-1714-6/12/$31.00 Β©2012 IEEE 301

Page 2: [IEEE 2012 IEEE International Conference on Systems, Man and Cybernetics - SMC - Seoul, Korea (South) (2012.10.14-2012.10.17)] 2012 IEEE International Conference on Systems, Man, and

[3][17] according to different exploration mechanisms adopted.In this work, we present a mathematical model to model theexpected behaviors of the agents in the system (i.e., the averagepercentage of cooperating agents in the system at each timestep). The theoretical results are shown to be able to predictthe transition point and the expected behaviors of the systemcompared with the simulation results, and thus proves theeffectiveness of the mathematical model we proposed.

The remainder of the paper is organized as follows. InSection II, we give an overview of the related work on variousalgorithms and strategies for achieving cooperation in prisonerdilemma game. In Section III, the learning framework wepropose and the mathematical model are introduced. In SectionV, we conduct various simulations and discuss the effects ofdifferent parameters on the system’s performance in terms ofthe percentage of cooperation achieved. In the last section, weconclude our paper and present some possible further work.

II. RELATED WORK

A. Strategies to achieve cooperation in Prisoner’s Dilemma

The Prisoner’s Dilemma game has been receivedwidespread attention in the literature of both game theoryand multi-agent learning research. Various learning strategiesand mechanisms (e.g., using tags) have been proposedto achieve cooperation in iterated Prisoner’s Dilemma[1][16][11][12][15][6]. Here we only briefly describe therelated works focusing on designing learning strategies forthe purpose of achieving cooperation due to space constraints.

In [16], Stimpson et al. propose a satisficing strategy forthe prisoner’s dilemma game. In this strategy, each agent 𝑖is assigned a aspiration level 𝛼𝑖(0) at the beginning of thegame (𝑑 = 0). After receiving the payoffs πœ‹π‘–(𝑑) at time step𝑑, the agents will update its aspiration level and choose itsaction according to the following rules: if πœ‹π‘–(𝑑) β‰₯ 𝛼𝑖(𝑑), then𝛼𝑖(𝑑 + 1) = 𝛼𝑖(𝑑), and agent 𝑖 chooses the same action asprevious time step; if πœ‹π‘–(𝑑) < 𝛼𝑖(𝑑), then agent 𝑖 updates thevalue of 𝛼𝑖 based on Equation 1

𝛼𝑖(𝑑 + 1) = πœ†π›Όπ‘–(𝑑) + (1βˆ’ πœ†)πœ‹π‘–(𝑑) (1)

where πœ† is the learning rate, and also the other action ofagent 𝑖 is chosen at time step 𝑑 + 1. The authors performvast simulations by selecting different values of parametersfrom uniform distributions and show that the agents convergeto mutual cooperation most of the time. Besides, the effects ofdifferent factors (i.e., initial aspirations, payoff matrix, learningrate and initial actions) on cooperation are analyzed as well.

In [11], Moriyama investigates how to maintain mutualcooperation (𝐢, 𝐢) when the Q-learning agents reach theoutcome (𝐢, 𝐢) occasionally due to stochastic exploration. Theauthor derives two related theorems to provide guidance onmaintaining mutual cooperation between the agents. The firsttheorem states how many times mutual cooperation outcome(𝐢, 𝐢) is needed to be reached in order to maintain it forever(i.e., making the Q-function of cooperation larger than that ofdefection); the second one deals with how many additional re-wards are needed to make the Q-function of cooperation larger

than that of defection if only single-shot mutual cooperationis reached.

Following this work, the author also proposes another Q-learning algorithm called the learning rate adjusting Q-learning(LAR-Q) [12] to achieve cooperation among the agents. Thekey idea of this approach is that the agents have to adjusttheir learning rate according to the outcome of the game. Byfollowing the specific rules to update the learning rate, thefollowing two desired properties can be guaranteed: 1) 𝑄(𝐢)will become larger than 𝑄(𝐷) once the outcome (𝐢, 𝐢) isachieved; 2) Once 𝑄(𝐢) has become larger than 𝑄(𝐷), 𝑄(𝐢)will always be larger than 𝑄(𝐷) no matter which outcome isachieved later. Besides, the outcome (𝐢, 𝐢) is achieved due tostochastic exploration and they do not enforce control on it.

B. Theoretical Models on Multi-agent Q-Learning

Tuyls et al. [17] analyze the dynamics of multiple Q-learning agents with Boltzmann exploration. They representthe dynamics of the probabilities for each action of the agentsusing a set of differential equations. They firstly constructthe continuous time limit of Boltzmann exploration equation,and then change the Q-function update rule from its discreteform to its continuous form. By combining them together, thedynamics of the probability for each action of each agent isexpressed under a differential equation which depends on thevalues of the probabilities of each action of the agents and thegame structure only.

Gomes and Kowalczyk [3] propose a theoretical frameworkto model the average behaviors of the agents when eachagent adopts Q-learning strategy with πœ–-greedy exploration.They model the trend changes of the Q-functions using asystem of difference equations. Based on this set of differenceequations, the average behaviors of the agents with time canbe obtained given that the initial values of the Q-functions areavailable. They evaluate their model through two sets of gameexperiments: the prisoner dilemma game and a game with onlymixed strategy Nash Equilibrium. Experimental results provethe applicability of their model by showing that the agents’behaviors obtained from their model are in consistent withsimulation results.

III. LEARNING FRAMEWORK

The learning framework we consider in this paper is asfollows. There exist a population of 𝑁 agents in the system,and each agent is only allowed to interact with its neighbors.The agent interaction scenario we consider here is modeledas prisoner’s dilemma game. At each time step, exactly 𝑁agents are picked out randomly to play the prisoner’s dilemmagame with their neighbors sequentially. In other words, eachagent in the population has one chance on average to interactwith its neighbors during each time step. The reward eachagent receives at each time step is the sum of the rewards itobtains by playing the prisoner’s dilemma game with each ofits neighbors, and is 0 if it is not chosen at all. We assumethat a pre-determined learning algorithm is adopted (it can beconsidered as the intrinsic property of the agents) to help the

302

Page 3: [IEEE 2012 IEEE International Conference on Systems, Man and Cybernetics - SMC - Seoul, Korea (South) (2012.10.14-2012.10.17)] 2012 IEEE International Conference on Systems, Man, and

agents make their decisions, and this will be introduced laterin details. We assume that the payoff matrix of the game isunknown to the agents but the agents are interacting with eachother with perfect monitoring. That is, each agent can observeits neighbors’ action and rewards at the current time step. Theprotocol of the interactions among the population of 𝑁 agentsis shown in Algorithm 1.

Algorithm 1 Interaction Protocol

for a certain number of time steps 𝑇 doask agent 𝑖 and its neighbors 𝐴 to select an actionbased on their own Q-functions with πœ–-greedy explorationmechanism.for 𝑖 = 1 to 𝑁 do

pick up agent 𝑖 randomly from the population.agent 𝑖 and its neighbors 𝐴 play the prisoner’s dilemmagames with their own neighbors respectively and obtainthe rewards.agent 𝑖 performs Q-function update (details shown inAlgorithm 2) based on agent 𝑖 and its neighbors’s jointinformation (i.e., their actions and rewards).

end forend for

In this paper we focus on the following topology to modelthe interaction environment of the agnets: the one-dimensionallattice network with connections between all neighboringvertex pairs.1 Agents are represented by the vertices of thenetwork and the links indicate the connections between agents,i.e., there exists a link between two agents if they are neighborsof each other. Different network structures can be obtained byusing different values of neighborhood size (𝑛). For example,when 𝑛 = 2, the network turns into a ring-like structure, inwhich each agent is only connected with its direct left and rightneighbor agents; when 𝑛 = 𝑁 βˆ’ 1, the network will becomefully connected, in which each agent has connections with allother agents in the population. Fig. 2 shows an example ofa one-dimensional lattice network with 𝑛 = 4 in which eachagent is associated with four neighbors agents.

The basis of the learning algorithm used here is the tradi-tional Q-learning algorithm [18] and there are two reasons forthis consideration. First, in an open and dynamic multi-agentsystem, it is highly possible that the agents are created bydifferent individuals or groups, which means that we have littlecontrol on the learning algorithm implemented on the agents.Since Q-learning algorithm is well suitable for repeated gamesand is widely used in multi-agent system, it is reasonable toadopt it as the learning algorithm for the agents in the system.Secondly, since we are interested in investigating the effectof topology that the agents are located in on the learningoutcome of the system, it helps to eliminate the noise of

1Note that there are other topologies (e.g., small world network) availablefor modeling the agents’ interaction environment as well and will be consid-ered as future work.

Fig. 2: one dimensional lattice network with 𝑛 =4

the learning algorithm itself by using Q-learning algorithm,2

and thus we can focus on investigating how to utilize theunderlying topology to better influence the agents’ behaviorstowards the desired goal of mutual cooperation.

Given that there exists only one state in the repeatedprisoner’s dilemma game, the Q-value update function of theQ-learning algorithm can be simplified as following:

𝑄𝑑+1𝑖 (π‘Ž) = 𝑄𝑑

𝑖(π‘Ž) + 𝛼(π‘Ÿπ‘–(π‘Ž)βˆ’ 𝑄𝑑𝑖(π‘Ž)) (2)

where 𝑄𝑑𝑖(π‘Ž) is agent 𝑖’s Q-function on action π‘Ž at time step 𝑑

and π‘Ÿπ‘–(π‘Ž) is the reward agent 𝑖 receives by choosing action π‘Ž.The action selection mechanism adopted here is the πœ–-

greedy exploration. Following this mechanism, each action ischosen randomly with probability πœ– and the action whose Q-value is the highest at the current time step is chosen withprobability 1-πœ–. In other words, the probability of choosingaction π‘Ž 𝑃 𝑑(π‘Ž) for agent 𝑖 is expressed as following:

𝑃 𝑑𝑖 (π‘Ž) =

{(1βˆ’ πœ–) + (πœ–/2) if 𝑄𝑑

𝑖(π‘Ž) 𝑖𝑠 π‘‘β„Žπ‘’ β„Žπ‘–π‘”β„Žπ‘’π‘ π‘‘πœ–/2 if π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’

(3)

Since the agents are located in a network with specifiedtopology, we modify the Q-function updating rule for eachagent by allowing the agents to imitate its neighbors’ actions.Suppose that agent 𝑖 is currently interacting with its neighborsby taking action π‘Ž1, if agent 𝑖’s reward is higher than all itsneighbor’s rewards, then agent 𝑖 simply updates its Q-functionon action π‘Ž1 using its own reward; if there exists one of itsneighbor agents (e.g., agent 𝑗) receiving higher rewards thanagent 𝑖 and also agent 𝑗’s action (e.g., action π‘Ž2) is differentfrom agent 𝑖’s action (e.g., action π‘Ž1), then agent 𝑖 will updateits Q-function on action π‘Ž2 using the reward agent 𝑗 obtains.The way to update Q-value can be intuitively understood asfollows. Since all the agents are homogeneous in terms of theunderlying network topology they are located in, when agent𝑗 receives higher rewards than agent 𝑖 with different actionchoices, it is highly possible that agent 𝑖 would receive higher

2When two agents using Q-learning algorithm play the prisoner’s dilemmagame, it is most likely that the outcome will converge to mutual defection.In this article, similar algorithm is adopted, and we are aiming at investigatethe effects of additional information from the underlying network structure oninducing the agents towards cooperation.

303

Page 4: [IEEE 2012 IEEE International Conference on Systems, Man and Cybernetics - SMC - Seoul, Korea (South) (2012.10.14-2012.10.17)] 2012 IEEE International Conference on Systems, Man, and

rewards by choosing the same action π‘Ž2 as agent 𝑗 insteadof the original action π‘Ž1 . Thus it is reasonable that agent 𝑖imitates agent 𝑗’s action (updates its Q-function accordinglyby pretending that it chooses action π‘Ž2) in the above situation.The overall rule for updating the Q-functions of agent 𝑖 isshown in Algorithm 2.

Algorithm 2 Agent 𝑖’s Update Strategy for Q-function

Find the agent 𝑗 who obtains the highest reward among allthe neighbors of agent 𝑖 (If there exist multiple neighborssharing the highest reward, sort them in order of theirrelative positions to agent 𝑖).if agent 𝑗’s reward π‘Ÿπ‘— is higher than agent 𝑖’s reward π‘Ÿπ‘–then

if agent 𝑗’s action π‘Žπ‘— is different from agent 𝑖’s action π‘Žπ‘–

thenagent 𝑖 updates its Q-functions assuming action π‘Žπ‘— istaken and reward π‘Ÿπ‘— .

elseagent 𝑖 updates its Q-functions with action π‘Žπ‘– andreward π‘Ÿπ‘–.

end ifelse

agent 𝑖 updates its Q-functions with action π‘Žπ‘– and rewardπ‘Ÿπ‘–.

end if

IV. A MATHEMATICAL MODEL OF THE LEARNING

FRAMEWORK

In order to better understand the learning framework andthe agents’ behaviors within this framework, we propose amathematical model to analyze the dynamics of the agents interms of the average behaviors of the agents, i.e., the averagepercentage of agents choosing cooperation (action 𝐢) at eachtime step.

A. Model of Multi-Agent Learning involving two players

First let us consider the multi-agent learning problem fol-lowing Q-learning algorithm shown in Equation 2. If thereexist only two agents (agent 𝑖 and agent 𝑗) in the system,agent 𝑖 and 𝑗 will interact with each other (i.e., choose itsaction simultaneously and update its Q-function accordinglyafter receiving the rewards) at each time step. In this case,we can obtain the following continuous differential equationto model agent 𝑖’s dynamic behavior [3] [17].

d𝑄𝑖(π‘Ž)

d𝑑= 𝛼(π‘Ÿπ‘–(π‘Ž)βˆ’ 𝑄𝑖(π‘Ž)) (4)

where 𝑄𝑖(π‘Ž) is the Q-function of agent 𝑖 for action π‘Ž and π‘Ÿπ‘–(π‘Ž)is the reward agent 𝑖 receives.

If agent 𝑖’s opponent agent 𝑗 is playing the game accordingto a stationary strategy (pure strategy or mixed strategy), it iseasy to solve the above differential equation and we can get

the trend of agent 𝑖’s Q-values for each action over time shownin Equation 5 [3].

𝑄𝑖(π‘Ž) = πΆπ‘’βˆ’π›Όπ‘‘ + 𝐸[π‘Ÿπ‘–(π‘Ž)] (5)

where 𝐸[π‘Ÿπ‘–(π‘Ž)] is the expected reward of agent 𝑖 by playingaction π‘Ž.

If agent 𝑖’s opponent agent 𝑗 is learning as well instead ofadopting a fixed strategy, it is impossible to solve Equation 4in a similar way as above. However, we can easily keep trackof the path of each agent’s Q-values for each action once theinitial values of them are given.

B. Model of the Multi-Agent Learning Framework

According to the interaction protocol in Algorithm 1, ateach time step, each agent has one opportunity on averageto be picked out to interact with its neighbors and performQ-function update afterwards. Also the reward it receives isdefined as the sum of the payoff it receives by interactingwith each of its neighbors. The only key difference with thecase using Q-learning algorithm in Equation 2 is that here theupdate strategy for the Q-function (Algorithm 2) is also takingthe neighbors’ information (i.e., their actions and rewards atthe current time step) into consideration.

Based on the above analysis, we extend the model formodeling multi-agent learning considering only two agentsin section IV-A to our general case involving 𝑁 agents. Inorder to achieve this goal, two important aspects have to beconsidered here. Firstly, the update speed for each action’sQ-function is different. There are mainly two reasons for it.The first reason is that the probability that each action ischosen each time is different due to the difference of their Q-values. Another reason is that the probability that each action isupdated is different which is determined by the update strategy,even if the Q-values of the actions are the same. Thus wehave to take the factor of updating speed into account whenanalyzing the dynamics of the multi-agent learning process.Secondly, since all the agents are learning concurrently, therewards each agent receives depends on the actions of otheragents (its neighbors) as well. Here we are aim at analyzingthe average trend changes of each agent’s Q-function, thus therewards for each agent to update its Q-functions are taken asthe expected rewards each agent can receive by interacting withits neighbors each time.

For a population of 𝑁 agents, the difference equation formodeling the expected dynamics of agent 𝑖’s Q-values foraction π‘Ž can be expressed as following:3

𝑄𝑑+1𝑖 (π‘Ž) = 𝑄𝑑

𝑖(π‘Ž) + 𝑃 π‘Ÿπ‘œπ‘π‘‘π‘–(π‘Ž)𝛼(π‘Ÿπ‘‘π‘–(π‘Ž)βˆ’ 𝑄𝑑

𝑖(π‘Ž)) (6)

3Note that the proposed model can be regarded as a generalization of themodel proposed by Gomes and Kowalczyk [3]. In their model, they are aimingat modeling the average behaviors of the Q-learning agents with πœ–-greedyexploration, thus the meanings of π‘ƒπ‘Ÿπ‘œπ‘π‘‘π‘–(π‘Ž) and π‘Ÿπ‘‘π‘–(π‘Ž) are different fromthe definition here. However, if applying the proposed model to analyze theirlearning situation, the same values of π‘ƒπ‘Ÿπ‘œπ‘π‘‘π‘–(π‘Ž) and π‘Ÿπ‘‘π‘–(π‘Ž) can be obtained,and thus the same average behaviors of the agents as the results obtained fromtheir model are obtained. However, we cannot directly apply their model toanalyze the learning situation here.

304

Page 5: [IEEE 2012 IEEE International Conference on Systems, Man and Cybernetics - SMC - Seoul, Korea (South) (2012.10.14-2012.10.17)] 2012 IEEE International Conference on Systems, Man, and

where 𝑃 π‘Ÿπ‘œπ‘π‘‘π‘–(π‘Ž) is the probability that agent 𝑖’s Q-function foraction π‘Ž is updated at time step 𝑑, and π‘Ÿπ‘‘π‘–(π‘Ž) is the expectedreward agent 𝑖 can receive when agent 𝑖 updates its Q-functionfor action π‘Ž at time step 𝑑.

Since both the values of 𝑃 π‘Ÿπ‘œπ‘π‘‘π‘–(π‘Ž) and π‘Ÿπ‘‘π‘–(π‘Ž) are changingwith time 𝑑, we cannot obtain the closed-form solution forthe average values of 𝑄𝑑

𝑖(π‘Ž). However, given its initial value𝑄0

𝑖 (π‘Ž), we can still trace its expected value at each time step𝑑 if the values of 𝑃 π‘Ÿπ‘œπ‘π‘‘π‘–(π‘Ž) and π‘Ÿπ‘‘π‘–(π‘Ž) for each time step 𝑑 areavailable.

According to the update rule for Q-function, the value of𝑃 π‘Ÿπ‘œπ‘π‘‘π‘–(π‘Ž) for agent 𝑖 can be calculated as the sum of theprobabilities for all the cases when agent 𝑖 can update its Q-function on action π‘Ž, since the probability for each case isindependent. The formula for calculating 𝑃 π‘Ÿπ‘œπ‘π‘‘π‘–(π‘Ž) is shownbelow:

𝑃 π‘Ÿπ‘œπ‘π‘‘π‘–(π‘Ž) =

π‘›βˆ‘π‘˜=1

𝑝𝑑𝑖(π‘˜) (7)

where 𝑝𝑑𝑖(π‘˜) represents the probability that the π‘˜π‘‘β„Ž case occurs.The probability 𝑝𝑑𝑖(π‘˜) can be obtained by simply multiply-

ing the corresponding probabilities for each agents involvedto choose their corresponding actions in π‘˜π‘‘β„Ž case. Supposeeach agent is located in a ring-structured network and playingthe prisoner dilemma game shown in Fig. 1. Since the networkstructure is symmetric, we only analyze one of the agent (agent𝑖) and the analysis is similar for all the other agents. Based onthe update rule for Q-function in Algorithm 2, without mucheffort we can know that there are two cases when agent 𝑖updates its Q-function on action 𝐢: 1)agent 𝑖 and its neighborson both sides choose action 𝐢 simultaneously; 2)agent 𝑖, itsleft side neighbor and the agent on the left of its left side agentchoose action 𝐢, and also both its right side neighbor and theagent on the right side of its right side neighbor choose action𝐷. Thus the expression for calculating the value of 𝑃 π‘Ÿπ‘œπ‘π‘‘π‘–(𝐢)at time step 𝑑 can be represented as following:

𝑃 π‘Ÿπ‘œπ‘π‘‘π‘–(𝐢) = 𝑃 𝑑𝑖 (𝐢)𝑃 𝑑

π‘–βˆ’1(𝐢)𝑃 𝑑𝑖+1(𝐢)+

𝑃 𝑑𝑖 (𝐢)𝑃 𝑑

π‘–βˆ’2(𝐢)𝑃 π‘‘π‘–βˆ’1(𝐢)𝑃 𝑑

𝑖+1(𝐷)𝑃 𝑑𝑖+2(𝐷) (8)

where 𝑃 𝑑𝑖 (𝐢) is the average probability that agent 𝑖 chooses

action 𝐢 at time step 𝑑, and we can easily obtain its valueusing Equation 3. The subscript 𝑖 βˆ’ 1 in 𝑃 𝑑

π‘–βˆ’1(𝐢) representsthat this agent is the left neighbor of agent 𝑖 and so on.

Here π‘Ÿπ‘‘π‘–(π‘Ž) is defined as the expected reward agent 𝑖 canreceive at time step 𝑑 when its Q-function on action π‘Ž isupdated, and its value depends on the actions of its neighborsand the neighbors of its neighbors as well. The equation forcalculating π‘Ÿπ‘‘π‘–(π‘Ž) can be expressed as following:

π‘Ÿπ‘‘π‘–(π‘Ž) =

π‘›βˆ‘π‘˜=1

𝑝𝑑𝑖(π‘˜) βˆ— π‘Ÿπ‘‘π‘–(π‘˜) (9)

where 𝑝𝑑𝑖(π‘˜) is the probability for case π‘˜ that agent 𝑖 updateits Q-function on action π‘Ž at time step 𝑑, and π‘Ÿπ‘‘π‘–(π‘˜) is thecorresponding reward agent 𝑖 receives in this case. Takingthe same example in previous analysis for illustration, we can

easily get the expression of π‘Ÿπ‘‘π‘–(π‘Ž) by multiplying the rewardagent 𝑖 receives in each case with its corresponding probabilityand adding them together:

π‘Ÿπ‘‘π‘–(π‘Ž) = 𝑃 𝑑𝑖 (𝐢)𝑃 𝑑

π‘–βˆ’1(𝐢)𝑃 𝑑𝑖+1(𝐢)π‘Ÿπ‘‘π‘–(1)+

𝑃 𝑑𝑖 (𝐢)𝑃 𝑑

π‘–βˆ’2(𝐢)𝑃 π‘‘π‘–βˆ’1(𝐢)𝑃 𝑑

𝑖+1(𝐷)𝑃 𝑑𝑖+2(𝐷)π‘Ÿπ‘‘π‘–(2) (10)

For a system of N agents, the Q-function on each action of eachagent in the system can be expressed in the form of Equation 6.Since we have known how to calculate the values of 𝑃 π‘Ÿπ‘œπ‘π‘‘π‘–(π‘Ž)and π‘Ÿπ‘‘π‘–(π‘Ž) for each time step 𝑑, we can obtain the expectedvalue of the Q-function for each agent on each action at eachtime step based on Equation 6 given that the initial value foreach Q-function is available. Finally, according to Equation 3,we can derive the expected behaviors of the agents (i.e., theaverage percentage of agents choosing 𝐢 at each time step)based on the expected values of the Q-functions for each agentat each time step.

V. SIMULATION AND EVALUATION

A. Simulation

In this section, we perform simulation under the learningframework described in section III and illustrate the averagebehaviors of the agents (the average percentage of agentschoosing 𝐢 each time step) in the system. By comparingwith the multi-agent Q-learning case, we show that the topol-ogy based learning framework proposed here has tremendousinfluence on inducing the agents’ average behaviors towardscooperation. The usage of different parameters (the topology’sdegree of connectivity 𝑛, learning rate 𝛼 and exploration rateπœ–) can have great effects on the average behaviors of the agentsbut we omit the detailed analysis here due to space limitation.

Fig. 3 shows the comparison between the results underthe learning framework and the results using traditional Q-learning approach. We can see that the average percentage ofcooperators using traditional Q-learning approach convergesto approximately the minimum value 0.1, while the proposedlearning framework can promote the percentage of cooperatorsto a much higher level and converge to it infinitely. Theproperty of stability of the average behaviors of the agentsis very desirable, especially when the system is open andhighly dynamic. The reason behind is that the agents canhave access to the information (i.e., its action and rewards)of its neighbor agents under the learning framework, and thuscan have the incentives to imitate other agents’ behaviors ifother agents receive higher rewards by choosing the differentactions. At first thought, it seems that this would result in moredefections since the agents who receive higher rewards areusually defectors in prisoner’s dilemma game and thus morebehaviors of defection will be imitated. On the contrary, ithelps to induce the agents’ behaviors towards cooperation byallowing imitation.

B. Application of the Model in Prisoner’s Dilemma

Based on extensive simulation by varying the value of dif-ferent parameters, we know that the influence of the topology

305

Page 6: [IEEE 2012 IEEE International Conference on Systems, Man and Cybernetics - SMC - Seoul, Korea (South) (2012.10.14-2012.10.17)] 2012 IEEE International Conference on Systems, Man, and

0 1000 2000 3000 4000 50000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

TIme

Ave

rage

per

cent

age

of c

oope

rato

rs

Simulation result using the learning frameworkSimulation result using traditional Qβˆ’learning approach

Fig. 3: Comparison between the results under the learningframework and the results using traditional Q-learning ap-proach with πœ– = 0.2 and 𝑛 = 2

on altering the agents’ behaviors to cooperation is maximizedwhen the degree of connectivity 𝑛 = 2. Due to spacelimitation, we only analyze the theoretical behavior of theagents in this case using the mathematical model we previouslydeveloped. Here we compare the theoretical average behaviorsof the agents with the simulation results, and show thatthe proposed mathematical model can effectively predict theaverage behaviors of the agents within the learning frameworkdescribed in section III.

Fig. 4(a) to Fig. 4(b) show the average behaviors of theagents in the system(i.e.,the average percentage of cooperators)obtained from both simulation and theoretical model with thelearning rate 𝛼 varying from 0.2 to 0.4 respectively. On thewhole, the theoretical model can roughly predict the dynamicsof the average behaviors of the agents: the average percentageof cooperators drops to the minimum value(0.1), and graduallyincrease afterwards, and then converge to a constant value atlast. Remarkably, the theoretical model can accurately predictthe transition time when the average percentage of cooperatorsbegins to converge to constant.

There are also some discrepancies between the simulationresults and the theoretical results. For the simulation curve,we can notice that it is continuously ascending from the verybeginning to the time when it converges. However, for thetheoretical curve, it sticks to the constant minimum value (0.1)for certain time periods before it starts to increase. In otherwords, there exist some latencies in the theoretical results.At the very beginning, the average percentage of cooperatorsdrops to the minimum value 0.1, and 𝑄𝑖(𝐷) > 𝑄𝑖(𝐢) foreach agent 𝑖. From the theoretical model shown in Equation6, we can see that the increase of 𝑄𝑖(π‘Ž) each time step isvery small due to the effect of parameters 𝑃 π‘Ÿπ‘œπ‘π‘‘π‘–(π‘Ž) and π‘Ÿπ‘‘π‘–(π‘Ž),thus it needs more time to increase the value of 𝑄𝑖(𝐢) to thesituation that 𝑄𝑖(𝐢) > 𝑄𝑖(𝐷). That is why there exist certaintime period when the average percentage of cooperators sticksto its minimum value in the theoretical results.

VI. CONCLUSION AND FURTHER WORK

In this paper, we propose a learning framework by takinginto consideration the underlying interaction topology of the

0 1000 2000 3000 4000 50000.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Time

Ave

rage

Per

cent

age

of C

oope

rato

rs

simulation resulttheoretical result

(a)

0 1000 2000 3000 4000 50000.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Time

Ave

rage

Per

cent

age

of C

oope

rato

rs

simulation resulttheoretical result

(b)

Fig. 4: Comparison between simulation results and theoreticalresults (a) 𝛼 = 0.2 (b)𝛼 = 0.4

agents. We show that under this framework, the system canmaintain certain level of cooperation though the agents areindividually rational. To better understand the results, we alsodevelop a mathematical model to analyze the dynamics result-ing from the learning framework. The theoretical results ofthe mathematical model can successfully predict the transitionpoint and the expected behaviors of the system compared withthe simulation results. As future work, we are going to considerother topologies such as small world network, which can betterrepresent the interacting patterns of people in the real world.

REFERENCES

[1] R. Axelrod. The Evolution of Cooperation. Basic Books, 1984.[2] A. Bonarini, A. Lazaric, E. M. Cote, and M. Restelli. Improving coop-

eration among self-interested reinforcement learning agents. In ECMLworkshop on Reinforcement Learning in non-stationary enviromenment,2005.

[3] E. R. Gomes and R. Kowalczyki. Dynamic analysis of multiagent q-learning with epsilon-greedy exploration. In ICML’09, pages 41–48.ACM, 2009.

[4] A. Greenwald and K. Hall. Correlated q-learning. In ICML’03, pages242–249, 2003.

[5] J. Y. Hao and H. F. Leung. Learning to achieve social rationality usingtag mechanism in repeated interactions. In ICTAI’11, pages 148–155,2011.

[6] J. Y. Hao and H. F. Leung. Learning to achieve socially optimal solutionsin general-sum game. In PRICAI’12, 2012.

[7] J. Hu and M. Wellman. Multiagent reinforcement learning: Theoreticalframework and an algorithm. In ICML’98, pages 242–250, 1998.

[8] M. Littman. Markov games as a framework for multi-agent reinforce-ment learning. In ICML’94, pages 322–328, 1994.

[9] J. M. liu, H. Jing, and Y. Y. Tang. Multi-agent oriented constraintsatisfaction. Journal of Artificial Intelligence, 136, 2002.

[10] M. Matlock and S. Sen. Effective tag mechanisms for evolvingcooperation. In AAMAS’09, pages 489–496. ACM, 2009.

[11] K. Moriyama. Utility based q-learning to maintain cooperation inprisoner’s dilemma game. In IAT ’07, pages 146 – 152, 2007.

[12] K. Moriyama. Learning-rate adjusting q-learning for prisoner’s dilemmagames. In WI-IAT ’08, pages 322–325, 2008.

[13] P. Mukherjee, S. Sen, and S. Airiau. Emergence of norms with biasedinteractions in heterogeneous agent societies. In WI-IAT, pages 512–515,2007.

[14] S. Sen and S. Airiau. Emergence of norms through social learning. InIJCAI’07, pages 1507–1512, 2007.

[15] K. Sigmund and M. A. Nowak. The alternating prisoner’s dilemma.Journal of Theorretical Biology, 38, 1994.

[16] J. L. Stimpson, M. A. Goodrich, and L. C. Walters. Satisficing andlearning cooperation in the prisoner’s dilemma. In IJCAI’01, pages 535–540. Morgan Kaufmann Publishers Inc, 2001.

[17] K. Tuyls, K. Verbeeck, and T. Lenaerts. A selection-mutation model forq-learning in multi-agent systems. In AAMAS’03, pages 693–700. ACM,2003.

[18] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4),1992.

306