[IEEE 2012 IEEE International Conference on Systems, Man and Cybernetics - SMC - Seoul, Korea...
Transcript of [IEEE 2012 IEEE International Conference on Systems, Man and Cybernetics - SMC - Seoul, Korea...
Maintaining Cooperation in HomogeneousMulti-agent System
Jianye Hao and Ho-fung LeungDepartment of Computer Science and Engineering
The Chinese University of Hong Kong{jyhao,lhf}@cse.cuhk.edu.hk
AbstractβDuring multi-agent interactions, robust strategiesare needed to help the agents to coordinate their actions to achieveefficient outcomes. A large body of previous work focuses ondesigning strategies towards the goal of Nash equilibrium, whichcan be extremely inefficient in many situations such as prisonerβsdilemma game. A number of improved algorithms based onQ-learning have been developed recently for agents to achievemutual cooperation in games like prisonerβs dilemma. However,almost all of them involve only two agents playing the samegame repeatedly. However, this may not reflect the real scenarioin practical multi-agent interaction situations. In practical multi-agent environments, each agent may interact with multiple agentsat the same time and each agent may only be allowed tointeract with a specific set of agents, which is determined bythe systemβs underlying topology. In this paper, we propose alearning framework by taking into consideration the underlyinginteraction topology of the agents. We show that the systemcan maintain certain level of cooperation though the agents areindividually rational. To better understand this phenomenon,we also develop a mathematical model to analyze the dynamicsresulting from the learning framework. The theoretical resultsof the mathematical model are shown to be able to successfullypredict the transition point and the expected behaviors of thesystem compared with the simulation results.
Index TermsβPrisonerβs Dilemma Game, Q-learning
I. INTRODUCTION
In a multi-agent environment, multiple agents usually inter-act with each other aiming at achieving joint goals or maximiz-ing their individual benefits only. One important type of multi-agent interaction situation is the so-called social dilemma, inwhich individual interest is in conflict with group welfare.One well-known representative example of social dilemmas isthe Prisonerβs Dilemma Game (PDG) (see Fig. 1) which hasbeen studied extensively in various disciplines such as socialscience, behavioral economics and game theory.
Fig. 1: a typical payoff matrix for prisonerβs dilemma game
Considering the open and dynamic characteristic of themulti-agent environment, it is typically difficult to assign theactions of the agents in advance. Therefore many multi-agentreinforcement learning algorithms (e.g., minimax-Q learning[8], Nash-Q learning [7], Correlated-Q learning [4]) are pro-posed to help the agents learning to coordinate with otheragents in the system. Most of them aim at converging to purestrategy or mixed strategy Nash Equilibriums, and this is notthe desired solution we expect to achieve in social dilemmasituations like prisonerβs dilemma game. Accordingly, otheralgorithms and improved algorithms based on Q-learning [16][11] [12] [2] [9] have been developed recently for agents toachieve mutual cooperation in games like Prisonerβs Dilemma.However, almost all of them involve only two agents play-ing the same game iteratively, which may be not practicalfor modeling real-world multi-agent interaction situations. Inreal multi-agent environments, each agent may interact withmultiple agents at the same time and each agent may onlybe allowed to interact with certain agents (e.g., its neighbors),which can be determined by the systemβs underlying topology[13][14] or certain tag mechanisms [5][10].
Taking the above issues into considerations, in this paper,we consider the problem of multi-agent interaction modeledas prisonerβs dilemma game by taking the agentsβ interactiontopology into account. The multi-agent interaction frameworkwe consider is as following: There exist a population of πagents in the system, and each agent is only allowed tointeract with its neighbors (i.e., playing prisonerβs dilemmagame with its neighbors) controlled by the interaction topology.At each time step, exactly π agents are picked out randomlyto play the prisonerβs dilemma game with their neighborssequentially. Each agent employs a pre-determined individuallyrational learning strategy, Q-learning, to choose its actions. Theoverall reward each agent receives is the sum of the rewardsit obtains by playing the prisonerβs dilemma game with itsneighbors. Surprisingly simulation results show that the systemcan maintain the percentage of cooperation agents (choosingπΆ) above a certain threshold if each agentβs neighborhood sizeis small.
To have a better understanding of this phenomenon, wealso develop a mathematical model to analyze the dynamicsresulting from the above learning framework we studied.Different models for analyzing the dynamics of multi-agentQ-learning situation involving two agents have been proposed
2012 IEEE International Conference on Systems, Man, and Cybernetics October 14-17, 2012, COEX, Seoul, Korea
978-1-4673-1714-6/12/$31.00 Β©2012 IEEE 301
[3][17] according to different exploration mechanisms adopted.In this work, we present a mathematical model to model theexpected behaviors of the agents in the system (i.e., the averagepercentage of cooperating agents in the system at each timestep). The theoretical results are shown to be able to predictthe transition point and the expected behaviors of the systemcompared with the simulation results, and thus proves theeffectiveness of the mathematical model we proposed.
The remainder of the paper is organized as follows. InSection II, we give an overview of the related work on variousalgorithms and strategies for achieving cooperation in prisonerdilemma game. In Section III, the learning framework wepropose and the mathematical model are introduced. In SectionV, we conduct various simulations and discuss the effects ofdifferent parameters on the systemβs performance in terms ofthe percentage of cooperation achieved. In the last section, weconclude our paper and present some possible further work.
II. RELATED WORK
A. Strategies to achieve cooperation in Prisonerβs Dilemma
The Prisonerβs Dilemma game has been receivedwidespread attention in the literature of both game theoryand multi-agent learning research. Various learning strategiesand mechanisms (e.g., using tags) have been proposedto achieve cooperation in iterated Prisonerβs Dilemma[1][16][11][12][15][6]. Here we only briefly describe therelated works focusing on designing learning strategies forthe purpose of achieving cooperation due to space constraints.
In [16], Stimpson et al. propose a satisficing strategy forthe prisonerβs dilemma game. In this strategy, each agent πis assigned a aspiration level πΌπ(0) at the beginning of thegame (π‘ = 0). After receiving the payoffs ππ(π‘) at time stepπ‘, the agents will update its aspiration level and choose itsaction according to the following rules: if ππ(π‘) β₯ πΌπ(π‘), thenπΌπ(π‘ + 1) = πΌπ(π‘), and agent π chooses the same action asprevious time step; if ππ(π‘) < πΌπ(π‘), then agent π updates thevalue of πΌπ based on Equation 1
πΌπ(π‘ + 1) = ππΌπ(π‘) + (1β π)ππ(π‘) (1)
where π is the learning rate, and also the other action ofagent π is chosen at time step π‘ + 1. The authors performvast simulations by selecting different values of parametersfrom uniform distributions and show that the agents convergeto mutual cooperation most of the time. Besides, the effects ofdifferent factors (i.e., initial aspirations, payoff matrix, learningrate and initial actions) on cooperation are analyzed as well.
In [11], Moriyama investigates how to maintain mutualcooperation (πΆ, πΆ) when the Q-learning agents reach theoutcome (πΆ, πΆ) occasionally due to stochastic exploration. Theauthor derives two related theorems to provide guidance onmaintaining mutual cooperation between the agents. The firsttheorem states how many times mutual cooperation outcome(πΆ, πΆ) is needed to be reached in order to maintain it forever(i.e., making the Q-function of cooperation larger than that ofdefection); the second one deals with how many additional re-wards are needed to make the Q-function of cooperation larger
than that of defection if only single-shot mutual cooperationis reached.
Following this work, the author also proposes another Q-learning algorithm called the learning rate adjusting Q-learning(LAR-Q) [12] to achieve cooperation among the agents. Thekey idea of this approach is that the agents have to adjusttheir learning rate according to the outcome of the game. Byfollowing the specific rules to update the learning rate, thefollowing two desired properties can be guaranteed: 1) π(πΆ)will become larger than π(π·) once the outcome (πΆ, πΆ) isachieved; 2) Once π(πΆ) has become larger than π(π·), π(πΆ)will always be larger than π(π·) no matter which outcome isachieved later. Besides, the outcome (πΆ, πΆ) is achieved due tostochastic exploration and they do not enforce control on it.
B. Theoretical Models on Multi-agent Q-Learning
Tuyls et al. [17] analyze the dynamics of multiple Q-learning agents with Boltzmann exploration. They representthe dynamics of the probabilities for each action of the agentsusing a set of differential equations. They firstly constructthe continuous time limit of Boltzmann exploration equation,and then change the Q-function update rule from its discreteform to its continuous form. By combining them together, thedynamics of the probability for each action of each agent isexpressed under a differential equation which depends on thevalues of the probabilities of each action of the agents and thegame structure only.
Gomes and Kowalczyk [3] propose a theoretical frameworkto model the average behaviors of the agents when eachagent adopts Q-learning strategy with π-greedy exploration.They model the trend changes of the Q-functions using asystem of difference equations. Based on this set of differenceequations, the average behaviors of the agents with time canbe obtained given that the initial values of the Q-functions areavailable. They evaluate their model through two sets of gameexperiments: the prisoner dilemma game and a game with onlymixed strategy Nash Equilibrium. Experimental results provethe applicability of their model by showing that the agentsβbehaviors obtained from their model are in consistent withsimulation results.
III. LEARNING FRAMEWORK
The learning framework we consider in this paper is asfollows. There exist a population of π agents in the system,and each agent is only allowed to interact with its neighbors.The agent interaction scenario we consider here is modeledas prisonerβs dilemma game. At each time step, exactly πagents are picked out randomly to play the prisonerβs dilemmagame with their neighbors sequentially. In other words, eachagent in the population has one chance on average to interactwith its neighbors during each time step. The reward eachagent receives at each time step is the sum of the rewards itobtains by playing the prisonerβs dilemma game with each ofits neighbors, and is 0 if it is not chosen at all. We assumethat a pre-determined learning algorithm is adopted (it can beconsidered as the intrinsic property of the agents) to help the
302
agents make their decisions, and this will be introduced laterin details. We assume that the payoff matrix of the game isunknown to the agents but the agents are interacting with eachother with perfect monitoring. That is, each agent can observeits neighborsβ action and rewards at the current time step. Theprotocol of the interactions among the population of π agentsis shown in Algorithm 1.
Algorithm 1 Interaction Protocol
for a certain number of time steps π doask agent π and its neighbors π΄ to select an actionbased on their own Q-functions with π-greedy explorationmechanism.for π = 1 to π do
pick up agent π randomly from the population.agent π and its neighbors π΄ play the prisonerβs dilemmagames with their own neighbors respectively and obtainthe rewards.agent π performs Q-function update (details shown inAlgorithm 2) based on agent π and its neighborsβs jointinformation (i.e., their actions and rewards).
end forend for
In this paper we focus on the following topology to modelthe interaction environment of the agnets: the one-dimensionallattice network with connections between all neighboringvertex pairs.1 Agents are represented by the vertices of thenetwork and the links indicate the connections between agents,i.e., there exists a link between two agents if they are neighborsof each other. Different network structures can be obtained byusing different values of neighborhood size (π). For example,when π = 2, the network turns into a ring-like structure, inwhich each agent is only connected with its direct left and rightneighbor agents; when π = π β 1, the network will becomefully connected, in which each agent has connections with allother agents in the population. Fig. 2 shows an example ofa one-dimensional lattice network with π = 4 in which eachagent is associated with four neighbors agents.
The basis of the learning algorithm used here is the tradi-tional Q-learning algorithm [18] and there are two reasons forthis consideration. First, in an open and dynamic multi-agentsystem, it is highly possible that the agents are created bydifferent individuals or groups, which means that we have littlecontrol on the learning algorithm implemented on the agents.Since Q-learning algorithm is well suitable for repeated gamesand is widely used in multi-agent system, it is reasonable toadopt it as the learning algorithm for the agents in the system.Secondly, since we are interested in investigating the effectof topology that the agents are located in on the learningoutcome of the system, it helps to eliminate the noise of
1Note that there are other topologies (e.g., small world network) availablefor modeling the agentsβ interaction environment as well and will be consid-ered as future work.
Fig. 2: one dimensional lattice network with π =4
the learning algorithm itself by using Q-learning algorithm,2
and thus we can focus on investigating how to utilize theunderlying topology to better influence the agentsβ behaviorstowards the desired goal of mutual cooperation.
Given that there exists only one state in the repeatedprisonerβs dilemma game, the Q-value update function of theQ-learning algorithm can be simplified as following:
ππ‘+1π (π) = ππ‘
π(π) + πΌ(ππ(π)β ππ‘π(π)) (2)
where ππ‘π(π) is agent πβs Q-function on action π at time step π‘
and ππ(π) is the reward agent π receives by choosing action π.The action selection mechanism adopted here is the π-
greedy exploration. Following this mechanism, each action ischosen randomly with probability π and the action whose Q-value is the highest at the current time step is chosen withprobability 1-π. In other words, the probability of choosingaction π π π‘(π) for agent π is expressed as following:
π π‘π (π) =
{(1β π) + (π/2) if ππ‘
π(π) ππ π‘βπ βππβππ π‘π/2 if ππ‘βπππ€ππ π
(3)
Since the agents are located in a network with specifiedtopology, we modify the Q-function updating rule for eachagent by allowing the agents to imitate its neighborsβ actions.Suppose that agent π is currently interacting with its neighborsby taking action π1, if agent πβs reward is higher than all itsneighborβs rewards, then agent π simply updates its Q-functionon action π1 using its own reward; if there exists one of itsneighbor agents (e.g., agent π) receiving higher rewards thanagent π and also agent πβs action (e.g., action π2) is differentfrom agent πβs action (e.g., action π1), then agent π will updateits Q-function on action π2 using the reward agent π obtains.The way to update Q-value can be intuitively understood asfollows. Since all the agents are homogeneous in terms of theunderlying network topology they are located in, when agentπ receives higher rewards than agent π with different actionchoices, it is highly possible that agent π would receive higher
2When two agents using Q-learning algorithm play the prisonerβs dilemmagame, it is most likely that the outcome will converge to mutual defection.In this article, similar algorithm is adopted, and we are aiming at investigatethe effects of additional information from the underlying network structure oninducing the agents towards cooperation.
303
rewards by choosing the same action π2 as agent π insteadof the original action π1 . Thus it is reasonable that agent πimitates agent πβs action (updates its Q-function accordinglyby pretending that it chooses action π2) in the above situation.The overall rule for updating the Q-functions of agent π isshown in Algorithm 2.
Algorithm 2 Agent πβs Update Strategy for Q-function
Find the agent π who obtains the highest reward among allthe neighbors of agent π (If there exist multiple neighborssharing the highest reward, sort them in order of theirrelative positions to agent π).if agent πβs reward ππ is higher than agent πβs reward ππthen
if agent πβs action ππ is different from agent πβs action ππ
thenagent π updates its Q-functions assuming action ππ istaken and reward ππ .
elseagent π updates its Q-functions with action ππ andreward ππ.
end ifelse
agent π updates its Q-functions with action ππ and rewardππ.
end if
IV. A MATHEMATICAL MODEL OF THE LEARNING
FRAMEWORK
In order to better understand the learning framework andthe agentsβ behaviors within this framework, we propose amathematical model to analyze the dynamics of the agents interms of the average behaviors of the agents, i.e., the averagepercentage of agents choosing cooperation (action πΆ) at eachtime step.
A. Model of Multi-Agent Learning involving two players
First let us consider the multi-agent learning problem fol-lowing Q-learning algorithm shown in Equation 2. If thereexist only two agents (agent π and agent π) in the system,agent π and π will interact with each other (i.e., choose itsaction simultaneously and update its Q-function accordinglyafter receiving the rewards) at each time step. In this case,we can obtain the following continuous differential equationto model agent πβs dynamic behavior [3] [17].
dππ(π)
dπ‘= πΌ(ππ(π)β ππ(π)) (4)
where ππ(π) is the Q-function of agent π for action π and ππ(π)is the reward agent π receives.
If agent πβs opponent agent π is playing the game accordingto a stationary strategy (pure strategy or mixed strategy), it iseasy to solve the above differential equation and we can get
the trend of agent πβs Q-values for each action over time shownin Equation 5 [3].
ππ(π) = πΆπβπΌπ‘ + πΈ[ππ(π)] (5)
where πΈ[ππ(π)] is the expected reward of agent π by playingaction π.
If agent πβs opponent agent π is learning as well instead ofadopting a fixed strategy, it is impossible to solve Equation 4in a similar way as above. However, we can easily keep trackof the path of each agentβs Q-values for each action once theinitial values of them are given.
B. Model of the Multi-Agent Learning Framework
According to the interaction protocol in Algorithm 1, ateach time step, each agent has one opportunity on averageto be picked out to interact with its neighbors and performQ-function update afterwards. Also the reward it receives isdefined as the sum of the payoff it receives by interactingwith each of its neighbors. The only key difference with thecase using Q-learning algorithm in Equation 2 is that here theupdate strategy for the Q-function (Algorithm 2) is also takingthe neighborsβ information (i.e., their actions and rewards atthe current time step) into consideration.
Based on the above analysis, we extend the model formodeling multi-agent learning considering only two agentsin section IV-A to our general case involving π agents. Inorder to achieve this goal, two important aspects have to beconsidered here. Firstly, the update speed for each actionβsQ-function is different. There are mainly two reasons for it.The first reason is that the probability that each action ischosen each time is different due to the difference of their Q-values. Another reason is that the probability that each action isupdated is different which is determined by the update strategy,even if the Q-values of the actions are the same. Thus wehave to take the factor of updating speed into account whenanalyzing the dynamics of the multi-agent learning process.Secondly, since all the agents are learning concurrently, therewards each agent receives depends on the actions of otheragents (its neighbors) as well. Here we are aim at analyzingthe average trend changes of each agentβs Q-function, thus therewards for each agent to update its Q-functions are taken asthe expected rewards each agent can receive by interacting withits neighbors each time.
For a population of π agents, the difference equation formodeling the expected dynamics of agent πβs Q-values foraction π can be expressed as following:3
ππ‘+1π (π) = ππ‘
π(π) + π ππππ‘π(π)πΌ(ππ‘π(π)β ππ‘
π(π)) (6)
3Note that the proposed model can be regarded as a generalization of themodel proposed by Gomes and Kowalczyk [3]. In their model, they are aimingat modeling the average behaviors of the Q-learning agents with π-greedyexploration, thus the meanings of πππππ‘π(π) and ππ‘π(π) are different fromthe definition here. However, if applying the proposed model to analyze theirlearning situation, the same values of πππππ‘π(π) and ππ‘π(π) can be obtained,and thus the same average behaviors of the agents as the results obtained fromtheir model are obtained. However, we cannot directly apply their model toanalyze the learning situation here.
304
where π ππππ‘π(π) is the probability that agent πβs Q-function foraction π is updated at time step π‘, and ππ‘π(π) is the expectedreward agent π can receive when agent π updates its Q-functionfor action π at time step π‘.
Since both the values of π ππππ‘π(π) and ππ‘π(π) are changingwith time π‘, we cannot obtain the closed-form solution forthe average values of ππ‘
π(π). However, given its initial valueπ0
π (π), we can still trace its expected value at each time stepπ‘ if the values of π ππππ‘π(π) and ππ‘π(π) for each time step π‘ areavailable.
According to the update rule for Q-function, the value ofπ ππππ‘π(π) for agent π can be calculated as the sum of theprobabilities for all the cases when agent π can update its Q-function on action π, since the probability for each case isindependent. The formula for calculating π ππππ‘π(π) is shownbelow:
π ππππ‘π(π) =
πβπ=1
ππ‘π(π) (7)
where ππ‘π(π) represents the probability that the ππ‘β case occurs.The probability ππ‘π(π) can be obtained by simply multiply-
ing the corresponding probabilities for each agents involvedto choose their corresponding actions in ππ‘β case. Supposeeach agent is located in a ring-structured network and playingthe prisoner dilemma game shown in Fig. 1. Since the networkstructure is symmetric, we only analyze one of the agent (agentπ) and the analysis is similar for all the other agents. Based onthe update rule for Q-function in Algorithm 2, without mucheffort we can know that there are two cases when agent πupdates its Q-function on action πΆ: 1)agent π and its neighborson both sides choose action πΆ simultaneously; 2)agent π, itsleft side neighbor and the agent on the left of its left side agentchoose action πΆ, and also both its right side neighbor and theagent on the right side of its right side neighbor choose actionπ·. Thus the expression for calculating the value of π ππππ‘π(πΆ)at time step π‘ can be represented as following:
π ππππ‘π(πΆ) = π π‘π (πΆ)π π‘
πβ1(πΆ)π π‘π+1(πΆ)+
π π‘π (πΆ)π π‘
πβ2(πΆ)π π‘πβ1(πΆ)π π‘
π+1(π·)π π‘π+2(π·) (8)
where π π‘π (πΆ) is the average probability that agent π chooses
action πΆ at time step π‘, and we can easily obtain its valueusing Equation 3. The subscript π β 1 in π π‘
πβ1(πΆ) representsthat this agent is the left neighbor of agent π and so on.
Here ππ‘π(π) is defined as the expected reward agent π canreceive at time step π‘ when its Q-function on action π isupdated, and its value depends on the actions of its neighborsand the neighbors of its neighbors as well. The equation forcalculating ππ‘π(π) can be expressed as following:
ππ‘π(π) =
πβπ=1
ππ‘π(π) β ππ‘π(π) (9)
where ππ‘π(π) is the probability for case π that agent π updateits Q-function on action π at time step π‘, and ππ‘π(π) is thecorresponding reward agent π receives in this case. Takingthe same example in previous analysis for illustration, we can
easily get the expression of ππ‘π(π) by multiplying the rewardagent π receives in each case with its corresponding probabilityand adding them together:
ππ‘π(π) = π π‘π (πΆ)π π‘
πβ1(πΆ)π π‘π+1(πΆ)ππ‘π(1)+
π π‘π (πΆ)π π‘
πβ2(πΆ)π π‘πβ1(πΆ)π π‘
π+1(π·)π π‘π+2(π·)ππ‘π(2) (10)
For a system of N agents, the Q-function on each action of eachagent in the system can be expressed in the form of Equation 6.Since we have known how to calculate the values of π ππππ‘π(π)and ππ‘π(π) for each time step π‘, we can obtain the expectedvalue of the Q-function for each agent on each action at eachtime step based on Equation 6 given that the initial value foreach Q-function is available. Finally, according to Equation 3,we can derive the expected behaviors of the agents (i.e., theaverage percentage of agents choosing πΆ at each time step)based on the expected values of the Q-functions for each agentat each time step.
V. SIMULATION AND EVALUATION
A. Simulation
In this section, we perform simulation under the learningframework described in section III and illustrate the averagebehaviors of the agents (the average percentage of agentschoosing πΆ each time step) in the system. By comparingwith the multi-agent Q-learning case, we show that the topol-ogy based learning framework proposed here has tremendousinfluence on inducing the agentsβ average behaviors towardscooperation. The usage of different parameters (the topologyβsdegree of connectivity π, learning rate πΌ and exploration rateπ) can have great effects on the average behaviors of the agentsbut we omit the detailed analysis here due to space limitation.
Fig. 3 shows the comparison between the results underthe learning framework and the results using traditional Q-learning approach. We can see that the average percentage ofcooperators using traditional Q-learning approach convergesto approximately the minimum value 0.1, while the proposedlearning framework can promote the percentage of cooperatorsto a much higher level and converge to it infinitely. Theproperty of stability of the average behaviors of the agentsis very desirable, especially when the system is open andhighly dynamic. The reason behind is that the agents canhave access to the information (i.e., its action and rewards)of its neighbor agents under the learning framework, and thuscan have the incentives to imitate other agentsβ behaviors ifother agents receive higher rewards by choosing the differentactions. At first thought, it seems that this would result in moredefections since the agents who receive higher rewards areusually defectors in prisonerβs dilemma game and thus morebehaviors of defection will be imitated. On the contrary, ithelps to induce the agentsβ behaviors towards cooperation byallowing imitation.
B. Application of the Model in Prisonerβs Dilemma
Based on extensive simulation by varying the value of dif-ferent parameters, we know that the influence of the topology
305
0 1000 2000 3000 4000 50000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
TIme
Ave
rage
per
cent
age
of c
oope
rato
rs
Simulation result using the learning frameworkSimulation result using traditional Qβlearning approach
Fig. 3: Comparison between the results under the learningframework and the results using traditional Q-learning ap-proach with π = 0.2 and π = 2
on altering the agentsβ behaviors to cooperation is maximizedwhen the degree of connectivity π = 2. Due to spacelimitation, we only analyze the theoretical behavior of theagents in this case using the mathematical model we previouslydeveloped. Here we compare the theoretical average behaviorsof the agents with the simulation results, and show thatthe proposed mathematical model can effectively predict theaverage behaviors of the agents within the learning frameworkdescribed in section III.
Fig. 4(a) to Fig. 4(b) show the average behaviors of theagents in the system(i.e.,the average percentage of cooperators)obtained from both simulation and theoretical model with thelearning rate πΌ varying from 0.2 to 0.4 respectively. On thewhole, the theoretical model can roughly predict the dynamicsof the average behaviors of the agents: the average percentageof cooperators drops to the minimum value(0.1), and graduallyincrease afterwards, and then converge to a constant value atlast. Remarkably, the theoretical model can accurately predictthe transition time when the average percentage of cooperatorsbegins to converge to constant.
There are also some discrepancies between the simulationresults and the theoretical results. For the simulation curve,we can notice that it is continuously ascending from the verybeginning to the time when it converges. However, for thetheoretical curve, it sticks to the constant minimum value (0.1)for certain time periods before it starts to increase. In otherwords, there exist some latencies in the theoretical results.At the very beginning, the average percentage of cooperatorsdrops to the minimum value 0.1, and ππ(π·) > ππ(πΆ) foreach agent π. From the theoretical model shown in Equation6, we can see that the increase of ππ(π) each time step isvery small due to the effect of parameters π ππππ‘π(π) and ππ‘π(π),thus it needs more time to increase the value of ππ(πΆ) to thesituation that ππ(πΆ) > ππ(π·). That is why there exist certaintime period when the average percentage of cooperators sticksto its minimum value in the theoretical results.
VI. CONCLUSION AND FURTHER WORK
In this paper, we propose a learning framework by takinginto consideration the underlying interaction topology of the
0 1000 2000 3000 4000 50000.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Time
Ave
rage
Per
cent
age
of C
oope
rato
rs
simulation resulttheoretical result
(a)
0 1000 2000 3000 4000 50000.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Time
Ave
rage
Per
cent
age
of C
oope
rato
rs
simulation resulttheoretical result
(b)
Fig. 4: Comparison between simulation results and theoreticalresults (a) πΌ = 0.2 (b)πΌ = 0.4
agents. We show that under this framework, the system canmaintain certain level of cooperation though the agents areindividually rational. To better understand the results, we alsodevelop a mathematical model to analyze the dynamics result-ing from the learning framework. The theoretical results ofthe mathematical model can successfully predict the transitionpoint and the expected behaviors of the system compared withthe simulation results. As future work, we are going to considerother topologies such as small world network, which can betterrepresent the interacting patterns of people in the real world.
REFERENCES
[1] R. Axelrod. The Evolution of Cooperation. Basic Books, 1984.[2] A. Bonarini, A. Lazaric, E. M. Cote, and M. Restelli. Improving coop-
eration among self-interested reinforcement learning agents. In ECMLworkshop on Reinforcement Learning in non-stationary enviromenment,2005.
[3] E. R. Gomes and R. Kowalczyki. Dynamic analysis of multiagent q-learning with epsilon-greedy exploration. In ICMLβ09, pages 41β48.ACM, 2009.
[4] A. Greenwald and K. Hall. Correlated q-learning. In ICMLβ03, pages242β249, 2003.
[5] J. Y. Hao and H. F. Leung. Learning to achieve social rationality usingtag mechanism in repeated interactions. In ICTAIβ11, pages 148β155,2011.
[6] J. Y. Hao and H. F. Leung. Learning to achieve socially optimal solutionsin general-sum game. In PRICAIβ12, 2012.
[7] J. Hu and M. Wellman. Multiagent reinforcement learning: Theoreticalframework and an algorithm. In ICMLβ98, pages 242β250, 1998.
[8] M. Littman. Markov games as a framework for multi-agent reinforce-ment learning. In ICMLβ94, pages 322β328, 1994.
[9] J. M. liu, H. Jing, and Y. Y. Tang. Multi-agent oriented constraintsatisfaction. Journal of Artificial Intelligence, 136, 2002.
[10] M. Matlock and S. Sen. Effective tag mechanisms for evolvingcooperation. In AAMASβ09, pages 489β496. ACM, 2009.
[11] K. Moriyama. Utility based q-learning to maintain cooperation inprisonerβs dilemma game. In IAT β07, pages 146 β 152, 2007.
[12] K. Moriyama. Learning-rate adjusting q-learning for prisonerβs dilemmagames. In WI-IAT β08, pages 322β325, 2008.
[13] P. Mukherjee, S. Sen, and S. Airiau. Emergence of norms with biasedinteractions in heterogeneous agent societies. In WI-IAT, pages 512β515,2007.
[14] S. Sen and S. Airiau. Emergence of norms through social learning. InIJCAIβ07, pages 1507β1512, 2007.
[15] K. Sigmund and M. A. Nowak. The alternating prisonerβs dilemma.Journal of Theorretical Biology, 38, 1994.
[16] J. L. Stimpson, M. A. Goodrich, and L. C. Walters. Satisficing andlearning cooperation in the prisonerβs dilemma. In IJCAIβ01, pages 535β540. Morgan Kaufmann Publishers Inc, 2001.
[17] K. Tuyls, K. Verbeeck, and T. Lenaerts. A selection-mutation model forq-learning in multi-agent systems. In AAMASβ03, pages 693β700. ACM,2003.
[18] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4),1992.
306