Chris Hughes Final Year Project

EXTORTION ON THE FRONT LINES

AN EXTORTIONATE APPROACH TO THE LIVE AND LET LIVE SYSTEM OF TRENCHWARFARE

CHRIS HUGHES

Abstract. Despite the abundance of literature on the Prisoner’s Dilemma, the re-cent discovery of zero-determinant (ZD) extortion strategies - which allow one playerto enforce an opponent’s unfair payoff - provides a new methodology for success inthis game. Although ZD-strategies are the focus of current research, there are sur-prisingly few ‘real-world’ examples of the uses of such strategies. In addition, manyexisting cases only consider frequently used conventional values. This project eluci-dates on the Prisoner’s Dilemma, particularly detailing the derivation and applicationsof zero-determinant extortion strategies; demonstrating their robustness when consid-ering only two competing players. By constructing a game inspired by Axelrod’s “Liveand Let Live” trench warfare analysis, which obtains non-conventional payoffs throughLanchester modelling, I simulate how an extortioner can triumph against an opponentunwitting of ZD-strategies, who adaptively changes his strategy following a given optimi-sation algorithm. The dominance of ZD-strategies is further exhibited through examples,confirming that, provided the opponent remains ignorant of the extortion, a player util-ising a ZD-strategy is assured the maximum payoff under the desired enforced relation- regardless of the adaptation of the other player. The use of ZD-strategies against asentient opponent is also discussed, where I provide a new interpretation of the param-eter φ as the strategy’s intensity; proposing how it may be favourable to modify thisparameter before resorting to reducing the extortion.

This project provides the essential foundations required to explore current researchin this area, presented at an undergraduate level.

Acknowledgements

I would like to thank my supervisor Dr. Gustav W. Delius for his enthusiasm, guidanceand dedication; providing excellent supervision over the course of this project. I wouldalso like to thank Rebecca Nelson for inspiring and motivating me to work to the best ofmy ability.

0

Contents

Acknowledgements 0List of Figures 11. Introduction 31.1. The significance of this work. 31.2. Outline 42. The Prisoner’s Dilemma 42.1. The Prisoner’s Dilemma game 52.2. The Dilemma 62.3. The iterated Prisoner’s Dilemma 62.4. Memory, mixed strategies and the stochastic iterated Prisoner’s Dilemma 83. Extortion Strategies 123.1. Formulating a Markov matrix 123.2. Determining a linear relationship between payoffs 133.3. Attempting to set one’s own payoff 163.4. Unilaterally setting the score of the opposing player 173.5. Extorting an opponent 183.6. A Theory of Mind 204. Live and Let Live 214.1. Background 214.2. Application of the Prisoner’s Dilemma game 225. Lanchester’s models of warfare 235.1. Lanchester’s Square Law 235.2. Lanchester’s Linear Law 245.3. Examples of Lanchester models suitable for trench warfare. 255.4. Criticisms and Limitations of Lanchester’s Models 265.5. Notation 276. Application of Lanchester Modelling to Live and Let Live 276.1. Artillery model 276.2. Calculation Results 307. Extortion on the front lines 307.1. The Premise 317.2. Extorting an adapting player 327.3. Application to examples 337.4. The existence of desirable adapting paths 367.5. Convergence to the stationary state and realism of the model 398. The effects of modifying ZD strategy parameters 418.1. The roles of χ and φ 418.2. Extorting a sentient opponent 449. Conclusion 459.1. Recent related work 469.2. Potential for future research 46Appendix A. Optimisation Methods for an adapting player 47References 47

List of Figures

1 Payoff matrix for Prisoner’s Dilemma game. Note that the row player’s payoffsare listed first. 5

1

2 Payoff matrix for Artillery model game 28

3 Payoff matrix for Artillery model game 30

4 Payoffs attained from the artillery model using various different parameters 31

5 Previously calculated Payoff matrix for Artillery model game 31

6 A graph displaying how the changes made to player 2’s strategy lead to themaximum possible scores for both players 34

7 A graph displaying how the changes made to player 2’s initially uncooperativestrategy still lead to the maximum possible scores for both players 36

8 The adapting paths taken by player 2’s strategy in four different instances, eacharriving at the maximum score 38

9 The weightings used to influence the adapting paths taken by player 2’s strategyin four different instances, as seen in Fig. 8 38

10 Expected payoff per round for player 2 after n iterations of the game 40

11 Average payoff per round for player 2 after n iterations of the game 41

12 Graph displaying how the probability of player 1’s cooperation after each outcomechanges in relation to an increasing extortion factor - with φ set at its maximumvalue 42

13 Graph displaying how the probability of player 1’s cooperation after each outcomechanges in relation to an increasing extortion factor - with φ set at half of itsmaximum value 43

14 Graph displaying how the probability of player 1’s cooperation after each outcomechanges in relation to an increasing extortion factor - with φ set at a tenth of itsmaximum value 43

15 Graph displaying how increasing the value of φ affects the adapting path ofplayer 2 44

2

1. Introduction

Since its introduction in 1950 [15], the Prisoner’s Dilemma has been widely and thor-oughly researched. Despite the seemingly contrived nature of the game, the Prisoner’sDilemma, and its associated iterated form, have been extensively employed as a frame-work for modelling situations in which selfish individuals attempt to maximise their scoresby balancing cooperation with competition [4]. Thus, this game has been abstracted tonumerous examples of both human and animal interactions [16], and has been frequentlyapplied across many diverse fields outside of mathematics; such as economics [11], polit-ical science [39], biology [34], psychology [7], and strategic decision making [17]. Aftersuch a rich exploration of this concept, researchers were suitably shocked, when in 2012,Press and Dyson discovered the existence of a new class of extortion strategies [36], thesuccess of which appeared to contradict the previously accepted conclusions of employingsimple and generous tactics to achieve success in the game [4].

In this project, I intend to explore the derivation of these zero-determinant extortionstrategies and how they can be successfully applied when considering two players com-peting only against each other. In order to do this, building on ideas proposed by RobertAxelrod [4], I shall demonstrate how the Prisoner’s Dilemma scenario can be applied totrench warfare in the First World War; constructing a model inspired by historical sourcesand obtaining the outcomes of this game through the means of Lanchester combat mod-elling.

1.1. The significance of this work.

• Having been so widely studied, the extensive body of literature associated withthe Prisoner’s Dilemma can appear very inaccessible to those unfamiliar withthe subject. In particular, many of the concepts within this field are so wellestablished that recent publications simply assume familiarity - providing littleor no background or review. Thus, this project provides a clear and conciseintroduction to the Prisoner’s Dilemma game between two competing players,from its inception through to current research; in a way that can be understoodby an undergraduate mathematician. Whilst it is impossible to detail everythingin such a review, I discuss the key developments made to the field, and placethem in a wider context, in a way that will provide the reader with the necessaryknowledge to explore recent publications in this area.• Since the discovery of zero-determinant extortion strategies, recent research has

moved away from the study of only two competing players [1] [19], and as a result,there are very few examples of how these strategies can be applied to such in prac-tice. In addition, many of the existing examples only consider the conventionallyused values associated with the Prisoner’s Dilemma. In this project, I explicitlydemonstrate how to formulate and successfully apply a zero-determinant strategyin an iterated Prisoner’s Dilemma game with non-conventional payoffs; using thecontext of trench warfare in the First World War. The reason for this choice ofcontext is simply to demonstrate to the reader how the discussed theory may beapplied to a real world scenario. Hence, while all historical content associatedwith this interpretation is fact, the models considered in this project are utilisedprimarily as a tool to aid the reader’s understanding, and are not intended toprovide a completely realistic depiction of past events.• Despite recent research, very little attention has been devoted to the parameterφ associated with a zero-determinant strategy; other than using it to ensure astrategy’s feasibility. Here, I provide a new interpretation of this parameter as

3

the intensity of the strategy, and explore how it can be adjusted to influence theperceptions of an opponent.

1.2. Outline. In this project, I shall assume the reader has knowledge, or at least famil-iarity, of fundamental Linear Algebra, Probability and Stochastic Processes - particularlythe Markov process and stationary distributions.

After a very brief background of Game Theory, Section 2 contains a review of thedevelopment of the Prisoner’s Dilemma game. Beginning with the premise for the sin-gle shot version, I detail how this was adapted into an iterated form and comment onthe results of Robert Axelrod’s computer tournaments, before discussing the stochasticiterated version of the game and defining the memory of a player.

In Section 3, I provide a detailed derivation of zero-determinant extortion strategies,before exploring how they can be applied to fix the score of an unsuspecting opponent.In addition, I discuss the consequences if both players are witting of such strategies andattempt to use them simultaneously.

Section 4 presents a historical background of World War 1 trench warfare, using sourcesfrom [3], before demonstrating how this fulfils the conditions of a Prisoner’s Dilemma,using arguments from [4].

Section 5 provides a deviation from the Prisoner’s Dilemma to introduce Lanchester’sequations for combat modelling. It is then demonstrated how these models can be adaptedto more complex scenarios, to represent situations in the First World War, along with abrief discussion of why Lanchester models are suitable for our purposes.

In Section 6, I present the Lanchester model that shall be used to explore zero-determinant extortion strategies, and detail how it was inspired by the historical accountsin Section 4. I then provide an example of how to use this model, in practice, to calculatethe payoffs for our Prisoner’s Dilemma game, and demonstrate how it fulfils the necessaryconditions in multiple cases.

In Section 7, I demonstrate how zero-determinant extortion strategies can be used toextort an adapting player. I then define an adapting path, and explore how, when facingan extortion strategy, there exist adapting paths such that both players receive theirmaximum possible long term scores. Limitations of the model are also discussed, withparticular focus on the speed of convergence to the game’s stationary state.

In Section 8, I move away from the context of Live and Let live, and explore howadjusting the parameters of a zero determinant extortion strategy may influence thebehaviour of a sentient opponent.

2. The Prisoner’s Dilemma

Game Theory background. Game theory can be described as “the study of mathemat-ical models of conflict and cooperation between intelligent rational decision-makers”[29].Although examples of solutions to two player card games had been discussed as early as1713 [28], it was not until von Neumann and Morgenstern’s 1944 work on the applica-tion of game theory to the economy [31] that game theory was recognised as a formalmathematical discipline. This provided a comprehensive exploration of finite two-playerzero-sum games along with the framework for determining a general strategic game. De-spite the 1950s and 1960s being considered the field’s golden age, during which John Nashproposed his now famous Nash equilibrium [30], game theory is still a highly active field[8].

For the purposes of this project, a game is defined [40] to be any situation in which:4

• There are at least two players. A player may represent a single being or collectiveof individuals. Companies, units and biological species are all examples of whatmay be defined as a player.• Each player has a set of possible strategies, a specification of how to choose their

next move in any given situation. The strategies chosen by each player determinethe outcome of the game.• Each player has a numerical payoff associated with each possible outcome, to

represent the value of the outcome to the different players.

A player is said to be rational if his actions are motivated by maximising his ownpayoff.

2.1. The Prisoner’s Dilemma game. The Prisoner’s Dilemma is one of the mostfamous and widely studied games in all of Mathematics. The premise for the Prisoner’sDilemma was first introduced in 1950 by the American mathematicians Merill M. Floodand Melvin Dresher, as a model of cooperation and conflict. This was later formalisedand given its prison sentence interpretation, along with its current name, in 1992 byAlbert W. Tucker who presented it in a similar manner as the following.

2.1.1. The premise of the game. Two criminals are arrested and imprisoned, each heldseparately with no means of communication. Due to a lack of evidence, the prosecutorsare only able to sentence the men for a minor offence, unable to convict the pair for theprincipal charge without a confession from one of the men [45].

To elicit this confession, each prisoner is simultaneously offered the same deal. Eachman is given the opportunity either to: defect, betraying the other by testifying that theother committed the crime, or to cooperate with his accomplice by remaining silent.

• If both men choose to defect, they will both be convicted of the principal offenceand each serve 3 years in prison.• If one man defects while the other cooperates, the cooperating prisoner will serve

an extended sentence of 5 years, while the defecting man walks free.• if both choose to cooperate with each other, no evidence is gained by the prose-

cutors, so both men are only convicted of the lesser charge, serving 1 year each.

This is summarised in Fig. 1.

Cooperate Defect

Cooperate R = -1, R = -1Reward for mutual cooperation

S = -5, T = 0Suckers payoff,

and temptation to defect

Defect T = 0, S = -5Temptation to defect,and sucker’s payoff

P = -3, P = -3Punishment for mutual defection

RowPlayer

Column Player

�1

Figure 1. Payoff matrix for Prisoner’s Dilemma game. Note that therow player’s payoffs are listed first.

The payoffs in a Prisoner’s Dilemma game must be strictly ordered such that

T > R > P > S. (1)

The payoff relationship R > P implies that mutual cooperation is superior to mutualdefection, while the payoff relationships T > R and P > S imply that defection is themost effective strategy for both agents.

5

2.2. The Dilemma. Now the game is established, I shall introduce the concept of aNash Equilibrium in order to understand the reasoning of a rational player in a prisoner’sdilemma.

A Nash Equilibrium is a situation in which no player can do better by changing hisstrategy, assuming the strategies of his opponents remain fixed [30].

It is important to notice that in the Prisoner’s dilemma, despite mutual cooperationleading to the highest individual payoff, this is not a Nash equilibrium; one player isacting sub-optimally by cooperating whilst the other defects.

Theorem 2.1. The Prisoner’s Dilemma has exactly one Nash Equilibrium, the outcomeof mutual defection.

Proof. As we assume both players are rational, if one player chooses to cooperate, theother players best option is to defect as, by (1), T > R. Likewise, if the player chooses todefect then the other player’s best option is still to Defect as P > S by (1). Hence, regard-less of the other player’s choice, defection always ensures the maximal payoff thereforemutual defection is the only Nash Equilibrium. �

From this we see that a strictly rational player - who believes his opponent to also bestrictly rational - will always choose defection and, applying this reasoning to both players,the encounter will end with a mutual defection. As a result of individual rationality, bothplayers have ended up in a worse position than if they had cooperated, hence the dilemma.

What has made this game so interesting to study is that, although the best outcomefor both players is through mutual cooperation, both believe it is in their best intereststo defect. To an observer this seems bizarre; essentially two players resign themselves toa worse position in an attempt to limit their opponent to the same or worse position asthemselves, and as a result are punished by their own selfishness. Despite this knowledge,in reality people continue to act in the same way, an example of this is during an armsrace [4].

2.3. The iterated Prisoner’s Dilemma. Although interesting to examine, there isonly a limited amount of knowledge that can be gained from studying a single occurrenceof the Prisoner’s Dilemma. Regardless of the sense in which we wish to apply thistemplate, potentially to such fields as evolutionary biology [27] or business and economics[11], it is generally unrealistic to assume that players will cease interacting after a singleexchange. Thus, we gain a far greater insight through a string of repeated interactions.When the Prisoner’s Dilemma game is iterated, another condition is often established:

The payoffs in an iterated Prisoner’s Dilemma must satisfy

2R > T + S (2)

This ensures that the players cannot escape the dilemma by taking turns exploiting eachother.

In the iterated Prisoner’s Dilemma, we move away from the notion of the payoffsrepresenting jail time and instead usually envision them as the number of points awardedto each player at the end of an iteration. Therefore, it is conventional to make the payoffspositive numbers; the values of the payoffs traditionally used in most literature is(T, R, P, S) = (5, 3, 1, 0) for both players.

In addition, it is usually assumed in iterated games that all players have perfect infor-mation; that is, at every stage of the game, each player knows all moves made so far. Wealso assume that every player has knowledge of the moves available to the other playersin the future (Cooperate or Defect in this case).

6

Another important factor which must be considered when iterating the Prisoner’sDilemma is the duration, or total number of rounds, of the game. In a finite game,in which the the number of turns is known to the players, players are able to form aplan for the entire game; which results in players having no incentive to cooperate. Thereason for this is, that if a player is aware that the next choice is in the final round, he isessentially placed in the ‘one-shot’ version of the Prisoner’s Dilemma. As a result of this,we can expect a pair of rational players to Defect, satisfying the Nash equilibrium of thisgame. Using backwards induction, we end up in a situation where players will never bewilling to cooperate as they are aware their opponent is going to defect.

One way to overcome this is to introduce a probability w that the game will end atthe end of the current round. By doing this, we place the players in a situation in whichthey are unsure if they will meet again, thus eliminating any endgame plan.

As an aside, it is interesting to note that humans do not tend to rigorously use back-wards induction in experimental game theory, as our instincts are not naturally equippedto plan for games with a well-defined number of turns [4].

2.3.1. Axelrod’s computer tournaments. In the 1980s, Robert Axelrod popularised theidea of the Prisoner’s Dilemma game in an iterated form when he organised competitivecomputer tournaments, in which players were invited to submit strategies of their choiceto compete in a round-robin type environment. These strategies were submitted as aprogram that would chose to cooperate or defect on a turn by turn basis, using thehistory of the interaction to make its choice if required. However, it is important to notethat strategies can not include any type of metagame analysis [22], such as “make thesame move that other player will make”, promises, commitments or enforceable threats.The tournament environment measured the performance of the strategies by playing everystrategy against each other entry, also pairing it with itself and a strategy which choseits next move at random.

Axelrod held two tournaments. The first only accepted entries from professional gametheorists, while the second round was much more widely accessible, with a range ofdifferent entrants. Despite a selection of sophisticated strategies being submitted, andthe entrants of second tournament being provided with a detailed analysis of the results ofthe first, both were won by the same simple strategy - Tit for Tat. Tit for Tat (hereafterTFT) unconditionally cooperates in the first round, then simply proceeds to repeat theprevious move of its opponent.

The results of these tournaments yielded several surprising insights and were publishedin [5][6]. These were later expanded on in the The Evolution of Cooperation [4], in whichAxelrod presented detailed analysis and conclusions.

2.3.2. Conclusions from the computer tournaments. A strategy is said to be nice if it isnot the first to Defect and a strategy is said to be forgiving if it willing to cooperateafter a defection from an opponent.

Despite both tournaments being won by TFT, the conclusions drawn from each werequite different. In the first round, the most successful strategies on the whole were thosethat were ‘nice’ and ‘forgiving’, due to the high scores gained by these strategies whencompeting with each other. Possibly as a result of Axelrod informing the entrants ofthis fact, the second tournament saw many strategies that were too generous with theirwillingness to cooperate, leaving themselves open to extortion. This yielded a differentconclusion, if an opponent is too forgiving, a player should attempt to exploit them formaximum gains. In light of this, Axelrod provided advice on how to create a successfulstrategy which can be summarised as follows: [4]

7

• While it pays to be forgiving, a strategy should reciprocate cooperation and de-fection, in order to avoid exploitation.• A strategy should not be too complex, it is better to employ a simple strategy

with clear intentions, to encourage cooperation.• Do not be the first to defect, in order to gain more points by avoiding unnecessary

conflict

Axelrod also stated, somewhat controversially, that probabilistic strategies are typicallytoo complex, therefore they may seem random to the opposing player; leading them toassume the strategy is unresponsive to sustained cooperation and destroying incentive tocooperate.

While most of Axelrod’s conclusions are generally accepted, some doubt has beenexpressed as to the efficacy of TFT [32], and the effectiveness of the criterion used todetermine success in the tournaments [38].

2.3.3. Application to an environment of only two competing players. By adopting a round-robin approach, Axelrod moved away from the scenario of an interaction between onlytwo players, instead exploring how a strategy performed against a population of differentstrategies. While this provided some fascinating results, much of Axelrod’s advice on ef-fective choice in a Prisoner’s Dilemma is not applicable to an environment in which onlytwo players are competing. For example, TFT never scored higher than its opponent inthe computer tournament, but always performed well. As a result, it averaged higherthan any other strategy, thus winning the tournament. With only two players, TFT isbeaten if the other player simply chooses to always defect (ALLD); and also by mostexploitative strategies. However, consider the following modification to TFT: instead ofcooperating on the first turn, the strategy defects. This ensures that playing againstALLD, the player will be guaranteed at least an equal payoff, and performs significantlybetter against more aggressive strategies. This variant is known as Suspicious Tit for Tat(sTFT). Interestingly however, sTFT does not fare well competing against a population.To understand this, we consider its performance against TFT. Although sTFT will re-ceive a higher payoff in the first round, the initial defection forces the strategies into asequence of mutual defections, with both losing out on a large number of points. Theinitial defection also completely destroys any chance of cooperation against non-forgivingstrategies. Therefore, it is clear to observe that, when choosing a strategy to guaranteesuccess when only considering two competing players, this must be approached differentlythan when performing against a population.

2.3.4. Commonly used strategies. Whilst there are far too many strategies that havebeen discussed to provide a comprehensive list, here I give a brief overview of those thatcommonly appear in the literature:

Always Cooperate: (ALLC) the player unconditionally cooperates every turnAlways Defect: (ALLD) the player unconditionally defects every turnFriedman: (GRIM) cooperates on first turn and continues to cooperate until the

opponent defects, then switches to play ALLDTit for Two Tats: (TFTT) a more forgiving variant of TFT, which only defects

after two consecutive defections from the opponentPavlov: cooperates if and only if both strategies made the same move in the pre-

vious round

2.4. Memory, mixed strategies and the stochastic iterated Prisoner’s Dilemma.A pure (or deterministic) strategy, is a strategy that does not include any randomness.

8

That is, the strategy determines the player’s next move, and should therefore specify allresponses to any possible choice made by the other player.

A mixed strategy, is an assignment of a probability to each of a player’s possiblemoves. This allows a player to select his next move at random.

Remark. As probabilities are continuous, there are infinitely many mixed strategiesavailable to a player.

2.4.1. The stochastic iterated Prisoner’s Dilemma. The Prisoner’s Dilemma was devel-oped further in the 1990s with the introduction of the stochastic iterated game [33], byresearchers such as Nowak and Sigmund. In the stochastic version of this game, all strate-gies are represented by a set of cooperation probabilities and, provided both players usefinite-memory strategies, the game can be modelled by a Markov chain. For the rest ofthis work, we assume that all games are represented as a stochastic iterated Prisoner’sDilemma.

2.4.2. Memory. When facing an opponent for the first time, each player has very littleinsight into how to choose their next move. While some strategies determine the play-ers next move independently of an opponents behaviour, such as AllD, many strategiesutilise the results from previous interactions; which are acquired as the game progresses.Informally, this is described as the memory of a player [36].

Intuitively, most would assume that a player with longer memory, and hence moreknowledge of past outcomes, would have a distinct advantage over a more forgetful player.After all, if we were to observe a repeated game in which both players had no memory,thus forgetting every move made at the end of each iteration, the players are essentiallyplaying a series of the ‘one-shot’ version of the game. Hence, we would expect to observemutual defection, the Nash Equilibrium, in every iteration. However, while having amemory of the previous turn is useful, in the instance in which players are competing inan indefinitely repeated game with the same allowed moves and payoff matrices, we canprove that this is not the case.

In order to do this, first I define a player’s memory.

The memory of a player is the set containing the moves made by the players in allprevious interactions; which is available to be used by the player’s strategy when choos-ing their next move. That is, in the nth round of a game, a players memory contains alltuples of strategies which have been played up until round n. We regard a memory-mstrategy as a strategy requiring knowledge of m previous rounds in order to choose itsnext move, where m ∈ N. A memory-m player is defined to be a player using a memory-mstrategy.

For our purposes, we can envision a memory-n player as only being able to rememberthe previous n turns, and thus the player has no knowledge of any additional interactionsthat have previously taken place.

Here we shall consider a player’s long term expected payoff, which can be thought of asthe average payoff gained per round of the game; after the game has been played for asufficiently high number of rounds.

Theorem 2.2. [36] In a two player game, in which the set of moves available to eachplayer consists of two options, let a memory-m player X play against a memory-n playerY, where m > n. For any memory-m strategy, there exists an equivalent memory-nstrategy such that X would receive the same long term expected payoff under both strategies.

9

Proof. [36] Let X and Y be random variables that take values x and y. We consider thesevalues as the moves made by the players in any given turn, such that x is the move ofplayer X and y is the move of player Y. We now label the history of past events in such away that H = [H0, H1], where the most recent history H0 is known to both players. Theolder history H1 is only available to player X, the player with the greatest memory. Asthe scores of the players are only dependent on (x, y), we consider the expectation of thejoint probability (X,Y) with respect to H:

〈P (X = x, Y = y|H0, H1)〉H0,H1=∑h0,h1

P (x, y|h0, h1)P (h0, h1)

=∑h0,h1

P (y|h0)P (x|h0, h1)P (h0, h1)

=∑h0

P (y|h0)

[∑h1

P (x|h0, h1)P (h1|h0)P (h0)

]

=∑h0

P (y|h0)

[∑h1

P (x, h1|h0)P (h0)

]=∑h0

P (x|h0)P (y|h0)P (h0)

= 〈P (X = x, Y = y|H0)〉H0

By using standard properties of the expectation and conditional probability, we haveredefined the game in a form that is only conditioned on H0. In this game, player X playsthe marginalised strategy:

P (x|h0) ≡∑h1

P (x, h1|h0). (3)

Thus, by averaging over H1 - all outcomes remembered by X but not by Y - we havefound a shorter memory strategy dependent only on H0 that yields the same expectedpayoff as the original strategy of player X. �

Remark. As X’s payoff gained from a memory-m strategy is equal to the payoff of somememory-n strategy, this shows that X gained no advantage through utilising a longermemory. That is, if we consider the case in which X plays a memory-m strategy againstY’s memory-1 strategy, by averaging over the resulting probability distribution of allsequences of m outcomes, we can obtain an alternate memory-1 strategy for X, yieldingthe same long term average score when played against Y [41].

This result is remarkable, yet with a little reasoning, we can gain an intuitive under-standing of how this aids our analysis. By 2.2 we know that regardless of the length of thememory employed by player X, from the perspective of his opponent, this is equivalentto playing some alternate strategy in which the memory matches that being employedby Y. This means that, while X may insist upon deciding his next move as a result ofanalysing a long sequence of past encounters, the outcome will be the same as the casein which he played the corresponding strategy of shorter memory. As these strategiesare equivalent from the perspective of Y, it is entirely undetectable to Y if X is using alonger memory strategy.

It is important to note that, as X does not have knowledge of the outcomes to averageover before the game begins, he cannot explicitly use the shorter memory strategy. How-ever, this is irrelevant as we are aware of the existence of a shorter-memory strategy that

10

would have generated the same gameplay. Thus, in our analysis, we can obtain the sameresults by considering only the case in which both players are using strategies of memoryequal to that of the most forgetful player.

In light of this, we will only consider memory-1 strategies, as going beyond this quicklybecomes complex even to simulate, and will provide no greater insight than in the simplercase.

2.4.3. Expressing a strategy as a vector of probabilities. As demonstrated in 2.2, we knowthat if one player is using a memory-1 strategy, as there is no advantage to other playeremploying a longer memory, we can analyse this game as though both players were usingmemory-1 strategies. We refer to this as a memory-1 game.

Once again considering players X and Y, with respective moves x and y, we can considerthat the four possible outcomes from each round of an iterated Prisoner’s Dilemma can berepresented as xy ∈ {CC,CD,DC,DD}; from the perspective of player X. If we label theoutcomes of the previous move from 1 to 4, then we can represent player X’s memory-onestrategy as the tuple p = (p1, p2, p3, p4) = (pCC , pCD, pDC , pDD), containing the probabilityof the player’s cooperation in the current round, given the previous outcome. In summary,we have:

p1 = P (Xn+1 = C|Xn = C, Yn = C) = pCC

p2 = P (Xn+1 = C|Xn = C, Yn = D) = pCD

p3 = P (Xn+1 = C|Xn = D, Yn = C) = pDC

p4 = P (Xn+1 = C|Xn = D, Yn = D) = pDD.

(4)

Similarly, we can express player Y’s strategy, from his own viewpoint, as q = (q1, q2, q3, q4)=(qCC , qCD, qDC , qDD). In this case, we have:

q1 = P (Yn+1 = C|Xn = C, Yn = C) = qCC

q2 = P (Yn+1 = C|Xn = D, Yn = C) = qCD

q3 = P (Yn+1 = C|Xn = C, Yn = D) = qDC

q4 = P (Yn+1 = C|Xn = D, Yn = D) = qDD,

(5)

which correspond to the outcomes yx ∈ {CC,CD,DC,DD}.

Remark. It is important to notice that the probabilities p2 and q2 correspond todifferent outcomes, representing the different viewpoints of the players. If we were torepresent player Y’s strategy from the viewpoint of player X, that is, corresponding tothe outcomes xy ∈ {CC,CD,DC,DD}, we would notate the tuple as q = (q1, q3, q2, q4)in order to be consistent with p.

Example 2.3. We use this notation to express several commonly used memory-1 strate-gies as follows:

ALLC = (1, 1, 1, 1)

ALLD = (0, 0, 0, 0)

TFT = (1, 0, 1, 0)

Pavlov = (1, 0, 0, 1)

Random =

(1

2,1

2,1

2,1

2

).

11

3. Extortion Strategies

The developments made to the iterated stochastic Prisoner’s Dilemma by researcherssuch as Boerlijst and Nowak [9] - which explored how the use of simple ‘equaliser’ strate-gies could ensure that both players received the same payoff - enabled William Pressand Freeman Dyson to revolutionise the field in 2012, with the introduction of zero-determinant extortion strategies [36]. Contrary to Axelrod’s claims of simplicity andfairness [4], Press and Dyson demonstrated that, if playing against an unwitting oppo-nent, there exists a class of strategies which allow one player to enforce an extortionatelinear relationship between the scores of himself and his opponent. It is this class ofstrategies that shall be our primary focus in this project. Although recent research inthis area has been primarily focused on the evolutionary stability of zero-determinantstrategies, with some excellent examples being Adami and Hintze [1] and Hilbe [19],most of this work only concerns competing populations of strategies, and is thereforelargely inapplicable to an environment of only two competing players.

Thus, in this section, we return to the scenario described by Press and Dyson, oftwo competing players. Here, I shall present and discuss the motivation behind and thederivation of zero-determinant strategies.

3.1. Formulating a Markov matrix. In this section, we recall the notation intro-duced in 2.4.3, such that we represent the memory-1 strategies of players 1 and 2 byp = (p1, p2, p3, p4) and q = (q1, q2, q3, q4) respectively. As we regard a player utilising amemory-1 strategy as only able to remember the outcome of the previous round, we candescribe the stochastic iterated Prisoner’s Dilemma as a four-state Markov chain withstate space {CC, CD, DC, DD}. As both players make their move simultaneously, we areable to calculate the 16 transition probabilities and formulate the Markov matrix for thegame as in [33]. For example, suppose both players cooperate in round 1. We see thatthe probability in round 2 that both players cooperate again is p1q1, that X cooperatesand Y defects is p1(1− q1), that X defects and Y cooperates is (1− p1)q1, and that bothdefect is (1− p1)(1− q1). We can calculate the remaining probabilities in the same way.

Since this matrix is fully determined by p and q, we denote this as M(p, q):

M(p,q) =

p1q1 p1(1− q1) (1− p1)q1 (1− p1)(1− q1)p2q3 p2(1− q3) (1− p2)q3 (1− p2)(1− q3)p3q2 p3(1− q2) (1− p3)q2 (1− p3)(1− q2)p4q4 p4(1− q4) (1− p4)q4 (1− p4)(1− q4)

. (6)

In transition matrix form, we are able to calculate the states in future rounds ana-lytically through repeated multiplication of the transition matrix. As all elements of M(6) are a product of probabilities, all entries of this matrix are positive, and as this isrepresenting the transition probabilities of a Markov chain, all rows sum to one. As werepeatedly multiply this matrix by itself, the chain will converge to a stationary state.However, it is not always the case that that the stationary distribution of (6) is unique.In fact, in order to have a unique stationary distribution, a Markov chain with a finitestate space must have exactly one closed communicating class. As a result, we shallalways have one eigenvalue of this matrix equal to one, and the others less than one.The eigenvector associated to the unit eigenvalue contains the stationary probabilities ofthe states, and is known as the stationary vector ; as its entries do not vary with time.The stationary vector v, here taken as a column vector, satisfies the following eigenvalueequation:

vTM = vT (7)12

Put simply, the stationary distribution is the normalized left eigenvector of the transitionmatrix.

More information on Markov matrices can be found in [35]. In the following sections,I shall assume some familiarity with the long term behaviour of Markov chains.

3.1.1. On the uniqueness of stationary distributions. Following [36], the derivation in thisproject applies only to the case in which there exists a unique stationary distribution for(6). Thus, when calculating the long term scores of the players, we are able to disregardtheir initial moves; which is not the case when multiple stationary distributions exist.While I do not detail all of the conditions under which the following analysis will nothold, an in depth exploration can be found in [14]. However, the majority of problemscan be avoided if for player 1 using the strategy

p̃ = (p1 − 1, p2 − 1, p3, p4)

and player 2 using the strategy

q̃ = (q1 − 1, q3, q2 − 1, q4)

none of the following conditions hold:

(1) There are two distinct states i and j for which p̃i = q̃i = p̃j = q̃j = 0.(2) Either p̃ = 0 or q̃ = 0.(3) Both p̃ = q̃ = 1.

The reason for these conditions will become apparent in 3.2 and 3.3.

3.2. Determining a linear relationship between payoffs. The observations madein this section were first made in [36]. However in the original paper, many of theintermediate steps are not included. Here, I detail steps within the derivation that arenot present in [36].

As M (6) has a unit eigenvalue, if we consider M ′ = M − I (where I is simply the 3×3identity matrix) we obtain a singular matrix, with determinant equal to zero:

M ′(p,q) =

p1q1 − 1 p1(1− q1) (1− p1)q1 (1− p1)(1− q1)p2q3 p2(1− q3)− 1 (1− p2)q3 (1− p2)(1− q3)p3q2 p3(1− q2) (1− p3)q2 − 1 (1− p3)(1− q2)p4q4 p4(1− q4) (1− p4)q4 (1− p4)(1− q4)− 1

. (8)

As we have taken M ′ = M − I, we see that the eigenvalue equation (7) becomes:

vTM ′ = 0. (9)

If we now apply Cramer’s Rule, a formula for the determinant, to the matrix M ′, wehave:

Adj(M ′)M ′ = det(M ′)I = 0 =⇒ Adj(M ′)M ′ = 0 (10)

where Adj(M ′) is the transpose of the matrix of cofactors, known as the Adjugate matrixin Linear Algebra. Recall that the cofactor matrix C is given by the matrix composedof the determinants of the matrix minors. That is, the element Cij is given by thedeterminant of the 3× 3 matrix seen if row i and column j are excluded; if the the sumof i and j is even, this element is positive.

In [36], Press and Dyson make the key observation that, since equations (9) and (10) areequal to zero, the rows of Adj(M ′) must be proportional to v. Although not mentionedin [36], this is correct as long as the left kernel of M ′ is one dimensional, which we see istrue as M ′ has rank 3. Hence, if we take the dot product of v with any vector f, this isproportional to the dot product of any row of Adj(M ′) with f.

13

Following [36] and considering the fourth row, we can observe that the components ofv are, in fact, the determinants of the 3× 3 matrices formed from the first three columnsof M ′, leaving out each one of the four rows in turn. Thus, if we take the dot product ofthe stationary vector and the fourth row of Adj(M ′), we have:

v · f = C14f1 − C24f2 + C34f3 − C34f4. (11)

Notice that, by the definition of Adj(M ′), we can express this as the product of v withthe fourth column of C. Using basic Linear Algebra, we know that the determinantof any matrix can be expressed, with appropriate signs, as the elements of any row orcolumn multiplied by the determinants of corresponding minor matrices. Thus, if weexpand along the fourth row, by observing that the determinants of the minor matricesare equal to the elements of the cofactor matrix seen in (11), we can express v · f as thedeterminant of M ′, with the fourth column replaced by f as in (12).

v · f = det

∣∣∣∣∣∣∣∣p1q1 − 1 p1(1− q1) (1− p1)q1 f1p2q3 p2(1− q3)− 1 (1− p2)q3 f2p3q2 p3(1− q2) (1− p3)q2 − 1 f3p4q4 p4(1− q4) (1− p4)q4 f4

∣∣∣∣∣∣∣∣ . (12)

In order to simplify (12), we can manipulate this matrix further by using the fact thatthe determinant of any matrix remains unchanged if a multiple of a column is added toanother column. Hence, by adding the first column to the second column we obtain:∣∣∣∣∣∣∣∣

p1q1 − 1 p1 − 1 (1− p1)q1 f1p2q3 p2 − 1 (1− p2)q3 f2p3q2 p3 (1− p3)q2 − 1 f3p4q4 p4 (1− p4)q4 f4

∣∣∣∣∣∣∣∣and by adding the first column to the third column we obtain a formula for the productof any four dimensional vector f with the stationary vector of the Markov matrix v:

v · f = det

∣∣∣∣∣∣∣∣p1q1 − 1 p1 − 1 q1 − 1 f1p2q3 p2 − 1 q3 f2p3q2 p3 q2 − 1 f3p4q4 p4 q4 f4

∣∣∣∣∣∣∣∣ ≡ D(p,q, f). (13)

The form of matrix (13) allows us to make a remarkable observation. If we consider thesecond column, which we shall denote:

p̃ = (p1 − 1, p2 − 1, p3, p4) (14)

we see that this is completely determined by the strategy of player 1, and thus solelyunder his control. Similarly, the third column:

q̃ = (q1 − 1, q3, q2 − 1, q4) (15)

depends entirely on the strategy of player 2. The significance of this result is that, ifcorrectly chosen, one of the players can select his strategy as to ensure that D, (13), isequal to zero.

If we recall from 2.4.3 that we described the state space of the Prisoner’s Dilemmaas xy ∈ {CC,CD,DC,DD}, denoting the payoff matrix in vector form for each of theplayers, from the perspective of player 1 we have:

SX = (R, S, T, P )

SY = (R, T, S, P ).(16)

14

We can therefore calculate the payoff of each player in the stationary state, that is, theaverage payoff of each player in each iteration, as:

PX =v · SX

v · 1=D(p,q,SX)

D(p,q,1)

PY =v · SY

v · 1=D(p,q,SY )

D(p,q,1)

(17)

where 1 is simply the vector with all components equal to 1.Note that the denominators are necessary as v had previously not been normalised,

such that: ∑i

vi = 1 = v · 1

which is required for a stationary probability vector.In addition, we can observe from (10) that the expected payoffs of the players are

linearly dependent on their payoff vectors (16). Therefore, we can deduce that the sameis true for any linear combination of the players’ scores; allowing us to write the followingformula for arbitrary constants α, β, γ ∈ R :

αPX + βPY + γ =D(p,q, αSX + βSY + γ1)

D(p,q,1). (18)

If we combine the observations made when examining (14) and (15) with the relation(18), I am able to present the following proposition:

Proposition 3.1. [26] If player 1 is able to select a strategy, for some values α, β, γ ∈ R,which satisfies

p̃ = (p1 − 1, p2 − 1, p3, p4) = αSX + βSY + γ1,

or if player 2 can select a strategy which satisfies

q̃ = (q1 − 1, q3, q2 − 1, q4) = αSX + βSY + γ1

then regardless of the strategy played by the opposing player, the following linear relationwill be enforced between the expected payoffs of the players:

αPX + βPY + γ = 0. (19)

Proof. In order to prove this proposition, we use the fact that if a matrix has two identicalor proportional columns, then the determinant of the matrix is equal to zero.

Suppose that player 1 chooses a strategy satisfying:

p̃ = (p1 − 1, p2 − 1, p3, p4) = αSX + βSY + γ1

As the second column of D(p,q, αSX + βSY + γ1) is completely determined by player1, as a result of this choice of strategy, the second and fourth columns of this matrix areidentical. Thus we have, regardless of the values in the other two columns:

D(p,q, αSX + βSY + γ1) = 0.

which, by (18), implies that:

αPX + βPY + γ =D(p,q, αSX + βSY + γ1)

D(p,q,1)= 0.

As this result follows from considering only the columns influenced by player 1, we seethat this relation can be enforced by player 1, regardless of player 2’s strategy. We can use

15

a similar argument to show that player 2 is also able to enforce this relation by adoptinga strategy which satisfies:

q̃ = (q1 − 1, q3, q2 − 1, q4) = αSX + βSY + γ1

�

Remark. It is important to notice that, in order to unilaterally set a linear relationshipbetween the payoffs of the two players, it is only necessary to alter p̃ or q̃ to impose thisrelation.

A strategy satisfying the conditions described in Proposition 3.1 is known as a Zero-Determinant strategy.

3.2.1. A note on the dependence of a Markov equilibrium. At this point, it is importantto note that all of the work in this section is done on the assumption that the Markovequilibrium is attained. However, while it may seem plausible that, through erratic be-haviour such as changing his strategy every turn, a player may be able to prevent thegame from reaching a stationary state, it is proved in Appendix B of [36] that there isno way in which a player can “usefully keep the game out of Markov equilibrium”. Morespecifically, Press and Dyson demonstrate that, if player 1 employs a zero-determinantstrategy, the ability to fix player 2’s score does not depend on player 2 using a fixedstrategy. As player 1’s zero-determinant strategy will be independent of any fixed oppos-ing strategy, as long as the game is played for a sufficiently high number of rounds, therelationships formed by the zero-determinant strategy will remain enforceable. Thus, weneed not concern ourselves with the strategy played by the non-zero-determinant player,as long as we can assume the number of iterations in the game is sufficiently high.

3.3. Attempting to set one’s own payoff. Upon gaining the knowledge that it ispossible to establish a linear relation between the payoffs of oneself and and an opponent,it is natural to assume that a player would attempt to use this to set his own payoff. Herewe examine what would happen in that case from the perspective of player 1. Recallingthe linear relation from (19), as we are interested in the payoff of player 1, we take β = 0.Thus, we have:

αPX + γ = 0.

In a similar manner to the process followed in the proof of Proposition 3.1, we requiresolutions that satisfy:

p̃ = (p1 − 1, p2 − 1, p3, p4) = αSX + γ1

which is equivalent to the following equations:

p1 − 1 = αR + γ [1]

p2 − 1 = αS + γ [2]

p3 = αT + γ [3]

p4 = αP + γ. [4]

Subtracting [4] from [1] and rearranging for α, then substituting back into [1] andrearranging for γ, we obtain:

α =p1 − p4 − 1

R− P, γ =

Rp4 + P (1− p1)R− P

16

which allows us to eliminate these parameters. Now, if we solve this system of equationsfor p2 and p3, in terms of p1 and p4, we have:

p2 =(1 + p4)(R− S)− p1(P − S)

R− P≥ 1

p3 =−(1− p1)(T − P )− p4(T − P )

R− P≤ 0.

(20)

If we recall the conditions for the payoffs of a Prisoner’s Dilemma (1), and the factthat probabilities p1, p4 lie between 0 and 1, it is clear from the expressions above thatwe have p2 ≥ 1 and p3 ≤ 0. Therefore, the only feasible strategy for player 1, expressedas in 2.4.3, is:

p = (1, 1, 0, 0).

This strategy simply repeats the player’s opening move, then proceeds to play thismove with probability 1 for the rest of the game; leaving player 1 effectively playingALLC or ALLD after the first move. Therefore, it is obvious that this approach does notallow player 1 any control over his long term payoff, as his resulting strategy is completelydependant on the move of his opponent.

Remark. It is important to note that we are unable to use Proposition 3.1 directly inthis case. The reason for this is that, if this strategy is used by player 1, as all elementsof the second column of our transformed matrix (8) are equal to zero, the determinantformula (13) will equal zero for any choice of vector f. As a result, the denominators ofthe equations for the long term payoffs of the players (10) are equal to zero under thisstrategy. Hence, the conditions of Proposition 3.1 are not satisfied under this strategy.

3.4. Unilaterally setting the score of the opposing player. While we have seenthat there is no advantage in a player trying to set his own payoff, suppose now heattempts to fix the long term payoff of his opponent. If we set α = 0 in the relation (19),we have:

βPY + γ = 0.

Thus, for player 1 to manipulate his opponent’s score, we consider strategies that satisfy:

p̃ = (p1 − 1, p2 − 1, p3, p4) = βSY + γ1.

If we manipulate the corresponding equations similarly to 3.3, we obtain the following:

p2 =p1(T − P )− (1 + p4)(T −R)

R− P

p3 =(1− p1)(P − S) + p4(R− S)

R− P.

(21)

Thus, we have obtained the strategy:

p = (p1,p1(T − P )− (1 + p4)(T −R)

R− P,(1− p1)(P − S) + p4(R− S)

R− P, p4). (22)

Once again recalling (1), we see from (22) that, unlike the previous case, there existfeasible solutions when p1 is close to (but ≤) 1 and p4 is close to (but ≥) 0. Thuswe will have p2 close to (but ≤) 1 and p3 close to (but ≥) 0. As we are consideringprobabilities, the strategy (22) is only feasible if p1, p2, p3, p4 ∈ [0, 1]. Therefore, if we

17

solve the simultaneous inequalities 0 ≤ pi ≤ 1 for i = 1, 2, 3, 4, we find that p1 and p4must be chosen such that:

p1 ∈[max

{T −RT − P

, 1− R− PP − S

}, 1

]p4 ∈

[0,min

{T − PT −R

p1 − 1, 1− (2− p1)(P − S)

R− S

}].

We are also able to derive an expression for the long term payoff of player 2, in termsof p1 and p4. To do this, we substitute the values from (21) into (13), which can then beused to calculate player 2’s payoff from (10). Thus, we obtain:

PY =(1− p1)P + p4R

(1− p1) + p4. (23)

Remark. It is important to notice, when attempting to fix an opponent’s score, that wemust have either p1 6= 1 or p4 6= 0 to calculate PY . If not, so we have p1 = 1 and p4 = 0,we obtain the same strategy as in 3.3.

As we are aware that p1 and p4 are bounded by 0 and 1, we are able to calculate theupper and lower bounds for (23), through substitution of the boundaries of p1 and p4. Ifwe do this, we attain:

P ≤ PY ≤ R.

This result is remarkable. We have found that, regardless of the strategy employed byplayer 2, player 1 can choose his strategy in such a way that he can fix his opponent’s longterm payoff at some value between the ‘Punishment’ and ‘Reward’ payoffs. While theboundaries of the player 2’s payoff are not entirely unexpected, what is more astonishingis that player 1 can ensure the payoff of his opponent without having to react to player2’s strategy in any way. That is, one player can force a fixed score upon the other simplyby playing a fixed strategy; which is independent to the strategy of his opponent.

3.5. Extorting an opponent. Now we shall consider the case in which player 1 enforcesa relation resulting in himself gaining a greater payoff than the mutual punishment valueP, while simultaneously assigning a lesser payoff to his opponent. If we, once again, recall(19):

αPX + βPY + γ = 0. (24)

Notice that, if we set:

α = φ

β = −φχγ = −φ(P + χP )

it is possible to rewrite (19) in the following form:

φ [(PX − P )− χ(PY − P )] = 0 (25)

where χ is known as the extortion factor and φ is a non-zero parameter to ensure thefeasibility of the strategy.

Therefore, in order for player 1 to enforce this relation, by 3.1 he must select a strategysatisfying:

p̃ = φ [(SX − P1)− χ(SY − P1)] (26)18

or equivalently:

p1 − 1 = φ [(R− P )− χ(R− P )]

p2 − 1 = φ [(S − P )− χ(T − P )]

p3 = φ [(T − P )− χ(S − P )]

p4 = φ [(P − P )− χ(P − P )] .

Hence, we can rearrange this system to obtain expressions for pi as follows:

p1 = 1− φ(χ− 1)(R− P )

p2 = 1− φ [(P − S) + χ(T − P )]

p3 = (P − S)φ

(χ+

T − PP − S

)p4 = 0.

(27)

Remark. This system of equations (27) is similar in its purpose to equation (12) in paper[36]. However, the system presented in [36] will only satisfy the relation (26) if the payoffsof the Prisoner’s Dilemma game satisfy P −S = 1; as when using the conventional values.This is a result of the additional (P − S) term in the expression for p1 and the missing(P −S) term in the expression for p3 in [36]. While this error carries over to (13) in [36],later results are unaffected due to the cancellation of these terms. I suspect this oversightwas as a result of considering only the conventional values; thus it is necessary to use thesystem in the form (27) in order to ensure correct results when simulating games that donot use the conventional payoffs.

For payoffs that satisfy the conditions of the Prisoner’s Dilemma game, provided φ issufficiently small, we are able to find feasible strategies in this case. Through some simplemanipulation of (27), as the probabilities take values between 0 and 1, we can calculatethat, in order to produce feasible extortionate strategies, we must have:

0 < φ ≤ min

{1

χ(T − P ) + (P − S),

1

(T − P ) + χ(P − S)

}. (28)

Although the case φ = 0 is formally allowed, this will result in the same situation seenin 3.3, thus only yielding the strategy (1, 1, 0, 0); therefore is of little interest to us.

When a player utilises a strategy of the form (27), known as an extortion strategy,despite demanding a larger share of the total points from his opponent, the long termpayoffs of the players are still subject to the linear relation (24). Assuming feasibilityand recalling (26), we can rewrite this relation as

(PX − P ) = χ(PY − P ), (29)

from which we can clearly observe that the payoff of the extorting player is maximisedonly when his opponent receives his own maximal payoff. Considering the payoff of player2, his maximal payoff will be attained when he uses the strategy

q = (1, 1, 1, 1)

that is, unconditionally cooperates. Thus, if player 2 decides to freely cooperate, in orderto maximise his own score, he is also unknowingly maximising the score of his opponent- assuming the player is unwitting of the extortion.

Using results from 3.2, we are able to derive expressions for the maximal long termpayoffs of each player, that is, when player 2 cooperates unconditionally. Recalling (13),

19

and setting:

p̃ = φ [(SX − P1)− χ(SY − P1)]

q = (1, 1, 1, 1)

such that:

p =

(1− φ(χ− 1)(R− P ), 1− φ [(P − S) + χ(T − P )] , (P − S)φ

(χ+

T − PP − S

), 0

)we can calculate:

D(p,q,SX) = −Tφ(χ− 1)(R− P )−Rφ [(T − P ) + χ(P − S)]

D(p,q,SY ) = −Sφ(χ− 1)(R− P )−Rφ [(T − P ) + χ(P − S)]

D(p,q,1) = −φ(χ− 1)(R− P )− φ [(T − P ) + χ(P − S)]

and thus, from (10), we find that:

PX =D(p,q,SX)

D(p,q,1)=P (T −R) + χ [R(P − S) + T (R− P )]

(T −R) + χ(R− S)

PY =D(p,q,SY )

D(p,q,1)=R(T − S) + (χ− 1)P (R− S)

(T −R) + χ(R− S).

(30)

3.5.1. A special case. As we have seen in 3.5, when χ > 1, player 1 demands an extor-tionate share from his opponent. This motivates us to question the outcome if player 1chooses χ = 1 which, by (24), is the case in which player 1 has selected his strategy toguarantee equal payoffs to both players. Upon calculating the probabilities for player 1,using φ = 1

5, we obtain the strategy:

p = (1, 0, 1, 0)

which we recognise from 2.4.3 as Tit For Tat. Hence, we can conclude that TFT is themost ‘fair’ of the extortion strategies.

3.6. A Theory of Mind. After this analysis, a logical question one may ask is ‘how doesthis work against a non-cooperative opponent?’, the answer is simple - it doesn’t. In fact,the only way to prevent score fixing or extortion in this scenario is to unconditionallydefect. As this will result in suboptimal payoffs for both players, we can deduce thatthe only reason a rational player would act in this way against an opponent whom isnot unconditionally non-cooperative, is as an attempt to force their opponent to changehis strategy; implying the player has some knowledge of his opponent’s motives. Thisscenario in which a player would deliberately harm his own score in order to minimisethe payoffs of both parties is described in [36] as the player having a “theory of mind”.

3.6.1. Trying to extort an extortioner. Another natural question that may be asked is,instead of one player harming his own score in order to stop an exhortative opponent,what would happen if both players were to employ zero determinant extortion strategies?While either player is free to employ any strategy they may choose, it seems inherentlyobvious that it is impossible for both players to simultaneously extort their opponent outof a larger share. To explore this, we shall consider the following scenario.Suppose that players 1 and 2 select extortionate strategies, with χ1, χ2 ≥ 1, that satisfy:

p̃ = φ1 [(SX − P1)− χ1(SY − P1)]

q̃ = φ2 [(SY − P1)− χ2(SX − P1)] .

20

Using (27), we can state these strategies explicitly as:

p =

(1− φ1(χ1 − 1)(R− P ), 1− φ1 [(P − S) + χ1(T − P )] , (P − S)φ1

(χ1 +

T − PP − S

), 0

)q =

(1− φ2(χ2 − 1)(R− P ), 1− φ2 [(P − S) + χ2(T − P )] , (P − S)φ2

(χ2 +

T − PP − S

), 0

).

Thus, with the players employing these strategies, the payoffs for both players mustsatisfy: {

(PX − P ) = χ1(PY − P )

(PY − P ) = χ2(PX − P )

If we assume that χ1 6= χ2, this system has one solution:

PX = PY = P

Thus, when both players attempt to extort their opponent, the result is the same as ifboth players played the strategy ALLD; offering no advantage to either player. However,if χ1 = 1 = χ2, this system simply reduces to:

PX = PY

which is consistent with our expectations of two players playing TFT.

Hereafter in this project, I shall assume that only one player is witting of ZD strate-gies, and there is no situation in which the other player develops a theory of mind or anyother form of awareness of extortion. In addition, we also assume that the extorted playerwill attempt to optimise his strategy, in response to his opponent’s extortion strategy,rather than using a fixed strategy throughout.

4. Live and Let Live

Here, I introduce an example of how the Prisoner’s Dilemma game can be applied to areal life scenario - specifically to trench warfare in the First World War. This will providethe context for many of the examples we shall consider in future sections of this project.

Many of the historical interpretations used to form the basis of this section are takenfrom letters, memoirs and diary entries of the soldiers who fought in the First World War.As such, many of these sources have not been independently published, making accuratereferencing difficult. However, British sociologist Tony Ashworth published a completestudy of this period, in which he gathered a wide range of primary sources from everyone of the fifty-seven British divisions and, to a lesser extent, French and German troops.Thus, while I have attempted try and state the origin of any historical material wherepossible, a complete collection of these records can be found in [3].

4.1. Background. Several months into the First World War, a situation emerged in thetrenches on the Western Front which has not been observed in any conflict prior or since- cooperation between opposing forces. This phenomenon has been widely documentedand has since been referred to as the “Live and let Live” system [3].

[I] was astonished to observe German soldiers walking about within riflerange behind their own line. Our men appeared to take no notice. Thesepeople evidently did not know there was a war on. Both sides apparentlybelieved in the policy of ”Live and Let live.”A British staff officer on a tour of the trenches (Dugdale 1932)[3]

21

Despite the best efforts of the commanding officers, outside of the conflicts whichrequired the men to leave the trenches, many soldiers actively rebelled against the requiredlevel of aggressiveness demanded by their superiors; finding various ways to discourageunrelenting combat.

Several soldiers were court-martialled and sometimes even whole battalionspunished as a result of making direct truces with the enemy, such as raisinga flag over areas regarded as out of bounds by the snipers on both sides anddeclaring that from 8 to 9 A.M was used for private business.(Morgan 1916) [3]

Due to the immobility of trench warfare, the same units faced each other for extendedperiods of time, and as a result, the soldiers appreciated the similarity in their opponentssituation. Many soldiers’ accounts, such as Hay 1916, remark that there were manysituations in which it would have been “easily possible” using “heavy artillery” to causeextreme casualties to the opposing side yet both sides actively avoided doing so out offear of retaliation of the same kind. In addition, due to socialisation between units, anynew troops joining the battalion were already familiar with the nature of mutual restraintand the benefits associated with such (Gillon n.d.)[3]. This ensured that the system wenton uninterrupted even after receiving reinforcements.

However, despite this system of mutual cooperation, defections did occur. The crewswho operated the artillery in particular, were less vulnerable to potential enemy retalia-tion, thus had a much lower stake in the system of mutual restraint. As a result, artilleryteams were regularly encouraged by members of the infantry not to antagonise the enemy,in order to protect those on the front lines (Sulzbach 1973)[3]. Thus, artillery played akey role in the Live and Let Live system, maintaining passiveness when unprovoked, butproviding instant retaliation in the case of enemy defection.

4.2. Application of the Prisoner’s Dilemma game. While no one is entirely sureas to the true cause of this, political scientist Robert Axelrod proposed that, due to thestationary nature of WW1 trench warfare, the two sides could be thought of as playersin an iterated Prisoner’s Dilemma [4]. He justified this in the following way.

• We regard the two players to be small battalions on either side. Typically con-sisting of around 1000 men, such a battalion thus occupies a large enough sectoron the front lines to be held directly accountable for any aggressive action orig-inating from that territory, yet is still small enough to assert a firm control overthe behaviour of individual soldiers.• At any time, the players have to choose between shooting to kill, or shooting with

a view to avoid causing excess damage.• Both sides believe it is important to weaken the enemy, as this will promote

survival in the case that a major battle is ordered in their sector, but both sidesultimately care about their own survival.

From this, we can establish the conditions required for a Prisoner’s Dilemma. In theshort term, it is better to weaken the enemy immediately, regardless of whether theenemy is shooting back or not. This establishes that a unilateral defection by one sideis even better than mutual cooperation (T > R), yet a mutual defection is better thana unilateral restraint (P > S). However, in the case that both sides defect, the mutualpunishment implies that both battalions would suffer for little or no relative gain, thusthis is not as favourable as mutual restraint (R > P ). Moreover, both sides would prefermutual restraint to the random alternation of serious hostilities, making (2R > T + S).

22

Hence, we have satisfied the required conditions (1) (2), between two small battalions ina given immobile sector.

Now that we have identified this situation as an iterated Prisoner’s Dilemma, in orderto simulate this, we require a method of calculating the payoffs of the players. To do this,we shall first explore Lanchester combat modelling.

5. Lanchester’s models of warfare

In 1916, Frederick W. Lanchester presented a system of differential equations thatcan be used to model two forms of warfare: ancient and modern [24]. Traditionally,ancient warfare is characterised by the Lanchester linear law and modern warfare by theLanchester square law, the reason for this was summarised by Taylor [43]:

In “ancient times”, warfare was essentially a sequence of one-on-one duelsso that the casualty-exchange ratio during any period of battle did notdepend on the combatants force levels. But under “modern conditions”,however, the firepower of weapons widely separated in firing location canbe concentrated on surviving targets so that each side’s casualty rate isproportional to the number of enemy firers and the casualty-exchange ratioconsequently depends inversely on the force ratio.

Before examining the application of these laws to our current situation, I provide a briefoverview of the models. In this section I shall define the following:

• x(t): number of men alive in army x at time t ; Initial Size of army x: x(0) = X• y(t): number of men alive in army y at time t ; Initial Size of army y: y(0) = Y• a: The combat effectiveness of army y• b: The combat effectiveness of army x

Remark. The non alphabetical order correspondence between a, b and x, y is a conse-quence of re-imagining the meaning of a and b [25]. Traditionally, these parameters werepresented as attrition-rate coefficients [43], rather than the combat effectiveness of theopposing force. Instead of modifying the equations to suit this re-evaluation, I have leftthem in the form most commonly found in the literature.

5.1. Lanchester’s Square Law. The Lanchester square law depicts the rate of casu-alties suffered by each side as depending only on the size and military prowess of theopposing force. The primary factor determining the amount of attrition is the size of theforce committed to battle, therefore it is almost always advantageous to concentrate yourforces.

We can express the rate of attrition of each army as a function of the number of enemyunits and their effectiveness:

dx(t)

dt= −ay(t),

dy(t)

dt= −bx(t) (31)

and these equations can be solved explicitly [43] to obtain the following results:

x(t) = X cosh√ab t− Y

√a

bsinh√ab t

y(t) = Y cosh√ab t−X

√b

asinh√ab t.

In order to understand the reason for the name ‘Square Law’, we examine the followingmanipulation of (31) as in [13], and consider the victory condition to be when one side

23

has been annihilated. Dividing the two equations we obtain:

dx(t)dt

dy(t)dt

=dx(t)

dy(t)=ay(t)

bx(t)

from which, after rearranging and integrating from time t = 0 to t we can obtain thegeneral solution:

bx(t)2 − ay(t)2 = bX2 − aY 2.

Therefore, from the victory condition, for army x to win, we require that at time t = Twe have y(T ) = 0, x(T ) > 0. Rewriting the equation above for t = T and solving forx(T ) we see:

x(T )2 = X2 − a

bY 2 > 0.

Solving this, we see that for x to win, the relative effectiveness of troops must exceedthe force ratio:

b

a>

(Y

X

)2

.

with a stalemate ifb

a=

(Y

X

)2

.

From this we conclude that any modification to the size of the force will affect thearmies’ potential quadratically, hence following a square law. We also notice that if thesize of an army is doubled, the attrition rate experienced by their opponent would beincreased by a factor of four, while if its effectiveness were doubled, the rate of attritionwould only double. The square law therefore indicates that the outcome of combat ismore sensitive to changes in numbers than to changes in weapons effectiveness.

5.2. Lanchester’s Linear Law. Although the linear law has traditionally been used tomodel ancient warfare, a more modern interpretation is that it represents unaimed, areafire. This is when the attacker does not target each target individually, instead firingindirectly into the enemy occupied region, as in the case of artillery fire.

In this situation, attrition depends not only on the weapon proficiency of the attackersand the number of attackers firing into the region, but also on the the concentrationof forces in the targeted area. From this we can see, that there is no advantage inconcentrating your forces when using the linear law.

We can express the rate of attrition of each army as:

dx(t)

dt= −[ay(t)]x(t),

dy(t)

dt= −[bx(t)]y(t) (32)

and these equations can be solved explicitly [43] to obtain:

x(t) =

{X(

bX−aYbX−aY exp (−(bX−aY )t)

)for bX 6= aY

X1+bXt

for bX = Ay

y(t) =

{Y exp (−(bX − aY )t)

(bX−aY

bX−aY exp (−(bX−aY )t)

)for bX 6= aY

Y1+aY t

for bX = Ay

If we manipulate (32) in the same way as (31), we find that the general solution takesthe following form:

24

bx(t)− ay(t) = bX − aY.If we once again consider the case in which one side is annihilated, such that at timet = T we have y(T ) = 0, x(T ) > 0 we obtain:

x(T ) = X − a

bY > 0

With army x achieving victory if:

b

a>

(Y

X

)and stalemate if:

b

a=

(Y

X

).

From this we can observe that an army’s attempt at victory is affected linearly byscaling its troop size. In this case, the impact of the force size on combat outcome issignificantly less than than when dealing with the square law.

However, despite this interpretation as an unaimed fire model, it is unfair to suggestthat the uses of either the square or linear laws are entirely fixed in their applications.As an example, Bracken demonstrated [10] that the linear law provides a more accuratemodel of the World War 2 Ardennes campaign, which is considered an example of modernwarfare.

5.3. Examples of Lanchester models suitable for trench warfare. In order to il-lustrate how different models can be used to simulate various situations, I shall presentseveral ideas which could be explored in the context of trench warfare. As this is notdirectly relevant to the material in this project, I include them only to provide an illus-tration of how such a simple system can be adapted to suit more complex scenarios.

5.3.1. Indiscipline on the front lines. One approach would be to consider the behaviourand fighting spirit of the units. Ashworth [3] describes many cases of how soldiers wouldactively avoid conflict, risking punishment and court-marshall, by firing with no intentionto kill or sometimes avoiding conflict at all. To account for this, I consider a unit’swillingness to fight as a subcomponent of their effectiveness.

In a similar manner to Darilek [13], I introduce a variant of the Lanchester Square lawwhere we can introduce a probability P (d) as a subcomponent of the units’ effectiveness,such that a = P (k|d)P (d). In this model we define:

• k as the unit’s effectiveness when engaging the enemy• P (d) is the probability that the unit decides to actively engage the enemy.

such that:

dx(t)

dt= −[P (k1|d1)P (d1)]y(t),

dy(t)

dt= −[P (k2|d2)P (d2)]x(t) (33)

The value of P (d) can be made to depend on many factors. In a stochastic timesimulation, it could be that P (d) changes with time, or could it be dependent on theresult of the last engagement.

Remark. It can be noted that, in order to ensure both sides of the equations (33) aredimensionally correct, the parameters a and b should be rates and not probabilities.However, this is often overlooked in practice - as by Darilek in [13].

25

5.3.2. Effectiveness of retaliation model. In this case, we consider how a unit would re-spond to an unprovoked attack. If we envision this in an environment similar to thePrisoner’s Dilemma, the moves available to each side are as follows:

• Cooperate (C) - unit does not actively engage the enemy, but will retaliate ifattacked• Defect (D) - unit decides to attack

In order to model this, I introduce a variant of the Square law with parameters {s :0 ≤ s ≤ 1} and {k : 0 < k ≤ 1} as subcomponents of the units’ effectiveness, such thata = k1s1 and b = k2s2. In this model we define:

dx(t)

dt= −[k1s1]y(t),

dy(t)

dt= −[k2s2]x(t)

where

• k is the unit’s effectiveness when engaging the enemy• s is the proficiency of the unit’s response.

An example of how to select parameter s is as follows:

• If a player has defected s = 1• If player cooperates, s is proportional to number of men remaining at time t = 0

(i.e large number of men =⇒ high s value)

This is because an attacking unit does not have to try and respond to an ongoing attack,and a greater number of men will be able to mount a better response than fewer men.

Alternatively, we could construct this as in 5.3.1 where:

• k as the unit’s effectiveness when engaging the enemy• P (d) is the probability that the unit is able to pick out a target while under fire.

The purpose of including these models is to demonstrate that, while Lanchester mod-elling is a very simple system, it can be made increasingly complex by expressing thecombat efficiency parameters as a product of subcomponents. The criteria for settingthese subcomponents can be adapted to suit the situation.

5.4. Criticisms and Limitations of Lanchester’s Models. Lanchester’s models ofwarfare have often been criticised, a comprehensive treatment is given in [43], most oftenfor reasons relating to the number of assumptions required and unrealistic omissions, suchas no force movement, reinforcement or withdrawals. Thus, despite the convenience ofthe simple Lanchester system, there are serious doubts as to whether Lanchester mod-elling provides an accurate representation of reality. However, in this project we are notattempting to fit our results to real historical data, nor are we trying to recreate a specificconflict; instead we are more interested in trench warfare as a concept. Thus, I feelthat Lanchester models are an excellent choice in this case, primarily due to the simplicityof the system when efficiently analysing a combat scenario. In addition, as opposed toarbitrarily selecting values, there is a certain novelty associated with obtaining our payoffmatrices through a valid form of combat modelling; which has been previously, thoughinfrequently, associated with game theory [44].

26

5.5. Notation. In order to reference Lanchester’s laws more clearly in future sections, Iintroduce the following notation:

LS(a, b) :=

(dx(t)

dt= −ay(t),

dy(t)

dt= −bx(t)

)LL(a, b) :=

(dx(t)

dt= −[ay(t)]x(t),

dy(t)

dt= −[bx(t)]y(t)

)

6. Application of Lanchester Modelling to Live and Let Live

“The goal was to encourage the artillery to respect the infantry’s desire tolet sleeping dogs lie. A new forward observer for the artillery was oftengreeted by the infantry with the request, “I hope you are not going to starttrouble.” The best answer was, “Not unless you want”. [3]

In [4], Axelrod’s justification of why the “Live and let Live” system satisfies the con-ditions of an iterated Prisoner’s Dilemma, which was examined in 4.2, is accomplishedalmost entirely through observation and reasoning. In this section, I investigate how,under certain conditions, it is possible to construct a model of trench warfare such thatthe outcomes satisfy conditions (1) and (2); thus numerically satifying the conditions ofa Prisoner’s Dilemma game. To do this, I shall examine a very simple example of howLanchester models can be used to calculate the entries of a 2×2 payoff matrix.

6.1. Artillery model. In this case, I draw inspiration from some of the accounts pro-vided by Tony Ashworth [3] which suggest that, as a measure of cooperation, both sideswould refrain from using artillery during conflict; as a conscious measure of reducingtheir opponent’s casualties. Due to the sustained nature of trench warfare on the westernfront, many troops became disillusioned with the fighting, despite the pressures from highcommand. Therefore, we consider the payoff from the view of the individual soldiers thatmake up the battalion, whose aim was to be seen to be attacking the enemy while losingas few men as possible.

6.1.1. Assumptions.

• The combat efficiencies of the troops on both sides are evenly matched : a, b = 0.5• The size of each battalion is approximately equal. The average WW1 battalion

size was 1000 men, so X, Y ∈ [900, 1100]• The unprovoked use of artillery on an unsuspecting opponent increases the ag-

gressor’s combat efficiency by 70%• Time is measured in discrete steps• Combat lasts for exactly 1 day, starting at t = 0 and ending at t = 1• In eyes of the soldiers, survival is twice as important as the number of enemy killed,

such that the payoff is calculated: P = 2 · (number of survivors) + (number ofenemy killed)

6.1.2. Payoffs. If we consider the payoff at time t = T , we can formalise this calculationas follows:

x payoff: Px(T ) = 2 · x(T ) + (Y − y(T ))y payoff: Py(T ) = 2 · y(T ) + (X − x(T ))

27

6.1.3. The Game. In this game, the moves available to each side are:

• Cooperate (C) - unit decides not to involve artillery in the fighting• Defect (D) - unit uses artillery to ensure high casualties on opposing side

If we recall the results from section 5, we regard artillery fire as unaimed, therefore theattrition rate of an army subject to artillery fire is as in (32). We regard any army notexperiencing artillery fire as only being subject to the fire of aimed small arms, as in(31). In the case where one side deploys artillery and the other does not, we employ theLinear Law once again, but with a greater combat efficiency for the defecting army. Inthis case we assume that, even if a side has initially chosen not to use artillery fire, uponreceiving artillery fire from their opponent, this decision is reversed and they respond inkind. The greater combat efficiency represents the advantage of the element of surpriseto the defecting army. We can summarise this for army x in the payoff matrix Fig. 2.

Cooperate Defect






RowPlayer

Column Player

Cooperate Defect

Cooperate (1607, 1607) (xReward, yReward)

(397, 2212) (xSuckers, yTemptation)

Defect (2212, 397) (xTemptation, ySuckers)

(1002, 1002) (xPunishment, yPunishment)

army x

army y

X 1000 1100 900 1100 900

Y 1000 900 1100 900 900

a 0.5 0.6 0.5 0.5 0.6

b 0.5 0.5 0.6 0.6 0.6

xTemptation 2212 2304 2000 2492 1886

xReward 1607 1844 1318 2147 1394

xPunishment 1002 940 1080 1600 902

xSuckers 397 480 398 1255 410

yTemptation 2212 2000 2304 1440 1886

yReward 1607 1318 1844 994 1394

yPunishment 1002 1080 940 750 902

ySuckers 397 398 480 304 410

Cooperate Defect

Cooperate LS(0.5, 0.5) Neither side deploys artillery

LL(0.85, 0.5) y deploys artillery on unsuspecting x

Defect LL(0.5, 0.85) x deploys artillery on unsuspecting y

LL(0.5, 0.5) Both sides deploy artillery

army x

army y

�1

Figure 2. Payoff matrix for Artillery model game

Under these conditions it seems reasonable to assume that if an army is experiencingartillery fire they will achieve in a lower payoff, due to the increased rate of attrition.Therefore, from the view of x, we expect the following relations to hold:

Px(LL(0.5, 0.85)) >Px(LS(0.5, 0.5)) > Px(LL(0.5, 0.5)) > Px(LL(0.85, 0.5))

Py(LL(0.85, 0.5)) >Py(LS(0.5, 0.5)) > Py(LL(0.5, 0.5)) > Py(LL(0.5, 0.85))(34)

where we interpret Px(LS(x, y)) to be the payoff of army x at t = 1, obtained by solvingthe system of differential equations for x and y.

Comparing this with (1), we notice that provided (34) does hold, this game satisfiesthe conditions of a ‘one-shot’ Prisoner’s Dilemma. However, in order to iterate this game,we also require:

2Px(LS(0.5, 0.5)) >Px(LL(0.5, 0.85)) + Px(LL(0.85, 0.5))

2Py(LS(0.5, 0.5)) >Py(LL(0.5, 0.85)) + Py(LL(0.85, 0.5))(35)

as stated in (2). In order to verify that (34) and (35) are satisfied, I shall obtain thepayoffs for game using different parameters and examine the results. The payoffs arecalculated out in the following way.

Example 6.1. In this case we consider the situation in which both battalions are ofequal size. We are interested in finding the payoff matrix at time t = 1.

• X = 1000• Y = 1000• a = 0.5• b = 0.5

28

Case 1 - x Defects, y Cooperates.The differential equations are given by:

LL(0.5, 0.85) :=

(dx(t)

dt= −[0.5y(t)]x(t),

dy(t)

dt= −[0.85x(t)]y(t)

).

Using the results given in 5.2 we can calculate the number of survivors at time t as:

x(t) = 1000

(0.85 · 1000− 0.5 · 1000

0.85 · 1000− 0.5 · 1000 exp (−(0.85 · 1000− 0.5 · 1000)t)

)y(t) = 1000 exp (−(850− 500)t)·

(0.85 · 1000− 0.5 · 1000

0.85 · 1000− 0.5 · 1000 exp (−(0.85 · 1000− 0.5 · 1000)t)

).

Thus at t = 1:

x(1) = 411.76 = 412

y(1) = 4×10−150 ≈ 0

From this, we can obtain the payoffs for x and y at t = 1:

Px(1) = 2x(1) + (Y − y(1)) = 2 · 412 + (1000− 0) = 1824

Py(1) = 2y(1) + (X − x(1)) = 2 · 0 + (1000− 412) = 588

Case 2 - x Cooperates, y Cooperates.Similarly, we have:

LS(0.5, 0.5) :=

(dx(t)

dt= −0.5y(t),

dy(t)

dt= −0.5x(t)

)and from 5.1:

x(1) = 1000 cosh√

0.25 − 1000 sinh√

0.25 = 607

y(1) = 1000 cosh√

0.25 − 1000 sinh√

0.25 = 607.

Thus,

Px(1) = 2 · 607 + (1000− 607) = 1607

Py(1) = 2 · 607 + (1000− 607) = 1607

Case 3 - x Defects, y Defects.

LL(0.5, 0.5) :=

(dx(t)

dt= −[0.5y(t)]x(t),

dy(t)

dt= −[0.5x(t)]y(t)

)As, in this case, we have bX = 0.5 · 1000 = 500 = aY , from 5.2:

x(1) =1000

1 + (0.5 · 1000)= 2

y(1) =1000

1 + (0.5 · 1000)= 2.

Px(1) = 2 · 2 + (1000− 2) = 1002

Py(1) = 2 · 2 + (1000− 2) = 1002

Case 4 - x Cooperates, y Defects.

LL(0.85, 0.5) :=

(dx(t)

dt= −[0.85y(t)]x(t),

dy(t)

dt= −[0.5x(t)]y(t)

)from 5.2,

29

x(1) = 1000

(0.5 · 1000− 0.85 · 1000

0.5 · 1000− 0.85 · 1000 exp (−(0.5 · 1000− 0.85 · 1000))

)≈ 0

y(1) = 1000 exp (−(500− 850)) ·(

0.5 · 1000− 0.85 · 1000

0.5 · 1000− 0.85 · 1000 exp (−(0.5 · 1000− 0.85 · 1000))

)= 412.

Px(1) = 2x(1) + (Y − y(1)) = 2 · 0 + (1000− 412) = 588

Py(1) = 2y(1) + (X − x(1)) = 2 · 412 + (1000− 0) = 1824

Using these results, we can formulate the payoff matrix Fig.3.

Cooperate Defect






RowPlayer

Column Player

Cooperate Defect





army x

army y

X 1000 1100 900 1100 900

Y 1000 900 1100 900 900

a 0.5 0.6 0.5 0.5 0.6

b 0.5 0.5 0.6 0.6 0.6

xTemptation 2212 2304 2000 2492 1886

xReward 1607 1844 1318 2147 1394

xPunishment 1002 940 1080 1600 902

xSuckers 397 480 398 1255 410

yTemptation 2212 2000 2304 1440 1886

yReward 1607 1318 1844 994 1394

yPunishment 1002 1080 940 750 902

ySuckers 397 398 480 304 410

Cooperate Defect





army x

army y

�1

Figure 3. Payoff matrix for Artillery model game

From this we can easily verify that (34) and (35) are satisfied and hence, for theseparameters, confirm that this example satisfies the conditions of an iterated Prisoner’sDilemma.

6.2. Calculation Results. In the previous example, we saw that when the initial bat-talion sizes are equal, the system satisfies the conditions needed for an iterated Prisoner’sDilemma game. Although I do not provide an analytic proof that this holds for all valuesstated in 6.1.1, due to the nature of the calculation process, if the conditions (34) and(35) are met at all combinations of the extreme values, we can confidently assert thatthese relations hold for the whole region. Notice especially the symmetric nature of thecalculation process, when the initial conditions are equal, this results in equal payoffs.Likewise, if we switch the initial battalion sizes of x and y, this is the equivalent to simplyswitching the payoffs. This symmetry greatly decreases the number of simulations neededwhen checking combinations of the extreme values. The results of these calculations areprovided in Fig.4, in which we can clearly see that conditions (34) and (35) are satisfiedin all cases.

Hereafter in this project, for simplicity, we shall only consider the cases in which theinitial sizes of the two armies are equal.

7. Extortion on the front lines

In this section, we shall consider an extended example in the context of ‘Live and LetLive’ - to examine in greater detail how extortion strategies may be applied in practice.

30

Cooperate Defect






RowPlayer

Column Player

Cooperate Defect





army x

army y

X 1000 1100 900 1000 1100 900

Y 1000 900 1100 950 1100 900

xTemptation 1824 2042 1606 1832 2006 1642

xReward 1607 2000 1213 1666 1767 1446

xPunishment 1002 1300 900 1050 1102 902

xSuckers 588 647 529 588 647 529

yTemptation 1824 1606 2042 1724 2006 1642

yReward 1607 1213 2000 1467 1767 1446

yPunishment 1002 900 1300 950 1102 902

ySuckers 588 529 647 559 647 529

Cooperate Defect





army x

army y

�1

Figure 4. Payoffs attained from the artillery model using variousdifferent parameters

7.1. The Premise. Suppose that we consider two battalions x and y of equal size, fromopposing armies, that are currently engaging in trench warfare. As described in 4, wecan assume that under certain conditions, each battalion is willing to cooperate with theother to some degree; in an attempt to reduce the amount of artillery fire the battalionwill receive. At the end of each day, the number of men killed in the fighting are replacedby reinforcements, thus at the start of the next day, the battalion size will be the sameas in the initial case. We analyse this scenario for an indefinite amount of time.

In order to model this, we can visualise the battalions as two players in a Prisoner’sDilemma, using the artillery model game described in 6.1; where the player’s next moveis decided by some authority within the battalion. For simplicity, we consider the casein which the initial sizes of the battalions are 1000 men, thus we use the payoff matrixFig. 5, which was calculated in 6.1.3, for this game.

Cooperate Defect






RowPlayer

Column Player

Cooperate Defect





army x

army y

X 1000 1100 900 1100 900

Y 1000 900 1100 900 900

a 0.5 0.6 0.5 0.5 0.6

b 0.5 0.5 0.6 0.6 0.6

xTemptation 2212 2304 2000 2492 1886

xReward 1607 1844 1318 2147 1394

xPunishment 1002 940 1080 1600 902

xSuckers 397 480 398 1255 410

yTemptation 2212 2000 2304 1440 1886

yReward 1607 1318 1844 994 1394

yPunishment 1002 1080 940 750 902

ySuckers 397 398 480 304 410

Cooperate Defect





army x

army y

�1

Figure 5. Previously calculated Payoff matrix for Artillery model game

In addition, again for simplicity, we assume that the score considered by each player isthe long term expected payoff per round, and that each side may only utilise memory-1strategies. As shown in 2.4.2, we can assume the latter without loss of generality to someextent. We shall express the strategies used by the players as in 2.4.3.

7.1.1. The Extortion. Suppose now that battalion x employed a team of mathematicians,who analysed this situation, and suggested that it may be possible to unilaterally encour-age their opponent y towards an increased level of cooperation; done in such a way toresult in x consistently receiving a greater payoff than y. Thus, in the context of attritionwarfare, gradually tipping the balance in their favour. The proposed method suggested toaccomplish this, is for x to determine his next move by utilising a fixed zero-determinantextortion strategy, in an attempt to manipulate y - who would be unwitting of any suchextortion and only monitoring his own payoff.

31

It is this scenario that we shall consider in this section. Here, we shall regard player 1as the extortioner, battalion x, and player 2 as the adapting player, battalion y.

7.2. Extorting an adapting player. It is a reasonable assumption that, under normalcircumstances, the commanding authority of a battalion would adjust their tactics in anattempt to increase the score gained by the battalion.

We shall consider a player to be adaptive if he adjusts his strategy according to someoptimisation scheme, which may be known only to him, in an attempt to maximise hislong term payoff. The player does not otherwise explicitly consider his own strategy, orthe score of his opponent.

Remark. In [36], such a player is known as “evolutionary”. However, this encourages acomparison with evolutionary game theory, in which an evolutionary player already hasa specific and different meaning. Therefore, I follow [12] and shall refer to a player of thistype as an adapting player in this work.

A brief examination of possible optimisation methods are examined in Appendix A.For our purposes, we shall use the gradient descent method, as described in A.0.2, tomodel how an adapting player may alter his strategy. Thus, we must first express thelong term payoff as some function of q.

We can accomplish this by recalling, from (10), that we can express the long termpayoff as:

PY =v · SY

v · 1=D(p,q,SY )

D(p,q,1).

Therefore, by using (13), we can determine a function, in terms of the probabilitiescontained in the strategies of both players, to calculate the expected payoff per roundunder a given strategy. However, as the probabilities used by the extorting player, andthe adapting player’s payoff vector, remains fixed throughout the game, we can essentiallyview this as a function depending only on the probabilities used by the adaptive player.Let us express this function as:

fY (q) =D(p,q,SY )

D(p,q,1). (36)

Thus, we can apply the gradient descent method in the following way:

q(i+1) = q(i) + t · ∇fY (q(i)) (37)

where we denote the strategy of the adapting player after i steps as q(i). In the followingexamples, the step size t is fixed at t = 0.0001.

Remark. Although it could be argued that, in reality, an adaptive player would not havean in-depth knowledge of the strategy being played by an opponent, it is important torecall that we are only using this method as a model; in order to produce an approximationof how the player may adjust his strategy. In this model, all steps lead to an improvementin the long term scores. This is because, in reality, we can assume that if a step wastaken that caused the player’s payoff to decrease, that step would simply be undoneand a different step would be taken. Hence, we can ignore the case in which there existintermediate steps that do not increase the long term payoffs. Thus, it is entirely plausiblethat a player may make very similar adjustments to those we shall consider, done withoutexplicitly knowing the probabilities being used by their opponent.

32

7.3. Application to examples.

Example 7.1. In the first situation we shall explore, I take inspiration from Axelrod’sdiscussion on Live and Let Live in [4] and consider the scenario that, prior to the extortion,both players can be viewed as using TFT. As the analysis in this example begins at thestart of player 1’s extortion, we examine how an extortion strategy will perform againstan adaptive player using the strategy TFT.

7.3.1. The strategy of the extorting player. Here, if we arbitrarily set χ = 3 and φ at itsmaximum φ = 1

2880, we see that player 1 attempts to enforce the relation:

(PX − 1002) = 3(PY − 1002).

Using (27), we can calculate player 1’s extortion strategy as:

p =

(167

288, 0,

43

60, 0

). (38)

7.3.2. The strategy of the adapting player. As we have calculated (38), using (13), we canobtain:

D(p,q, f) = det

∣∣∣∣∣∣∣∣167288q1 − 1 −121

288q1− 1 f1

0 −1 q3 f24360q2

4360

q2 − 1 f30 0 q4 f4

∣∣∣∣∣∣∣∣ (39)

and can therefore obtain fY (q) by (36). I do not explicitly give this here due to spaceconstraints. As, at the start of this game, player 2 is using TFT, we have:

q(0) = (1, 0, 1, 0).

7.3.3. The Simulation. Although simulations of this type should almost certainly be doneusing a computer, for clarity, I shall detail the first step of this process here. Using (36),we can calculate the expected score per round if player 2 continued to play TFT as:

fY

1010

=

D

167288043600

,

1010

,

106718245881002

D

167288043600

,

1010

,

1111

= 1002.

In order to try and improve on this score, player 2 will take a small step in the directionof an increasing gradient. We calculate the gradient as

∇fY (q(i)) =

(∂fY (q)

∂q1,∂fY (q)

∂q2,∂fY (q)

∂q3,∂fY (q)

∂q4

)∣∣∣∣q=q(i)

.

While I omit the calculation due to space constraints, the initial gradient can be calculatedas

∇fY

1010

=

9.095×10−13

04.547×10−13

618.0

≈

000

618

.

33

Therefore, from (37) we have:

q(1) =

1010

+ 0.0001 ·

000

618

=

101

0.0618

.

If we now calculate the expected payoff per round under this new strategy, we find that:

fY

101

0.0618

= 1029.7877.

Thus, we see that fY (q(1)) > fY (q(0)), therefore the player would notice this improvementto his strategy and now attempt to improve upon this new strategy. We can repeat thisprocess until the player has reached his maximum score.

7.3.4. Simulation results. If this simulation is continued, the average score per round ofplayers 1 and 2 can be plotted as in Fig 6, with the final scores as:

PX = 1687.1985

PY = 1230.3995(40)

which are attained when player 2’s strategy is:

q = (1, 1, 1, 0.50515) (41)

and we can thus verify that relation (7.3.1) is satisfied.

1000

1100

1200

1300

1400

1500

1600

1700

1800

1 11 21 31 41 51 61 71

Expectedpayoffp

erro

und

Steps

P1score P1Max P2score P2Max

Figure 6. A graph displaying how the changes made to player 2’sstrategy lead to the maximum possible scores for both players

34

If we recall (30), we are able to calculate the maximum expected payoff for each player.We can calculate these values as:

PX =217434 + 3 [665298 + 1103520]

217 + 3057= 1687.1985

PY =1986252 + 2 · 1021038

217 + 3057= 1230.3995.

From this we can see, by comparison with that in this case both players attained theirmaximum expected scores.

Remark. Notice that, while we stated in 3.5 that the player’s scores are maximised whenplayer 2 unconditionally cooperates after every possible outcome (ALLC), the strategy(41) is in fact equivalent to this, in this case. This can be seen by observing the entriesq1 = 1 and q2 = 1. Therefore, we interpret this as player 2 cooperating after both acooperation or defection from player 1, and hence the game will never reach the otherstates in the chain.

Returning to the context of trench warfare, we can interpret this in the following way.Through player 1’s calculated use of artillery strikes, player 2 believes himself to be ina situation in which he sees no benefit from defection. This is a result of player 1 beingcompletely uncooperative after any form of defection from his opponent; as we can seefrom 7.3.2. However, by adopting this fixed mindset, player 1 essentially does not considerhow these actions may provoke further retaliation from his opponent, and thus in no waytries to protect his troops from future artillery fire. Hence, player 1 is predominantlyconsidering his long term position, with little or no regard towards the expenditure ofresources. Alternatively, player 2, who may have a limited supply of artillery shells or amore conservative number of reinforcements, is only attempting to improve his positionat the current moment, and thus has his actions guided by his opponent; potentiallywithout realising this. As a result, while player 2’s final position is better than when hestarted, it is inferior to that of player 1.

Example 7.2. In 7.1, we saw how the use of an extortion strategy encouraged cooperationfrom an opponent who already cooperated under certain conditions. Now we examinethe case in which player 2 begins as a fully non-cooperative player. Here, suppose player1 uses the same strategy as calculated in 7.1, but let the initial strategy of player 2 be:

q(0) = (0, 0, 0, 0).

If we run a similar simulation, the results can be plotted as in Fig 7. From this simulation,we obtain once again:

PX = 1687.1985

PY = 1230.3995(42)

but this time attained when player 2’s final strategy is:

q = (1, 1, 0.45565, 1). (43)

This result is remarkable, while it took considerably longer, both players once againreceived their maximum possible scores. Thus, under the conditions of the artillery modelgame, we have demonstrated that it is possible to manipulate an adaptive opponent,even if initially non-cooperative, into a position equivalent to unconditional cooperation,through the application a zero determinant extortion strategy.

35

1000

1100

1200

1300

1400

1500

1600

1700

1800

1 11 21 31 41 51 61 71 81 91 101

111

121

131

141

151

161

171

181

191

201

211

221

231

Expectedpayoffp

erro

und

Steps

P1score P1Max P2score P2Max

Figure 7. A graph displaying how the changes made to player 2’sinitially uncooperative strategy still lead to the maximum possible scores

for both players

From this, it seems clear that player 1, witting of zero-determinant strategies can viewthis Prisoner’s Dilemma game very differently to, and with a clear advantage over, player2. Although we have seen two cases in which player 1 is able to manipulate his opponentin such a way that both players receive their maximum possible expected payoffs, thisraises another question - is this always the case?

The answer to this the question, unsurprisingly, depends on the manner in whichplayer 2 chooses to adapt his strategy. While it is player 2’s best interests to cooperateunconditionally, this may not necessarily be known to the player. Thus, it seems feasibleto imagine that a scenario may exist in which the local improvements made by player2 to his strategy will result in the strategy becoming stuck at a local optimum; as thedirections of these improvements are by no means unique. If this is the case, the scoresreceived by the players would be somewhat smaller than if player 2 was unconditionallycooperative. However, somewhat surprisingly, we are able to demonstrate that, for anadapting player similar to the one considered in this section, this is not the case.

7.4. The existence of desirable adapting paths. In [36], Press and Dyson suggest,with no analytic proof, that based on numerical evidence it seems likely that there existsadapting paths that lead to globally maximum scores, whenever player 1 employs anextortion strategy against player 2. It was later rigorously proved in [12], that in allcases, all possible adapting paths that may be taken by player 2 result in both playersattaining their maximum scores. However, as seen in 7.1 and 7.2, player 2’s strategy maynot necessarily end up as ALLC. Interestingly, this result holds even in some degeneratecases, which are not considered in the analysis of [36] nor in this project.

36

Before I can state this result, I must first formally define an adapting path.

Definition 7.3. [12] An adapting path for player 2 is a smooth map λ: [0,τ ] 7→ [0, 1]4,for some τ ∈ R, such that:

(1) fY (λ(t1)) < fY (λ(t2)) whenever 0 ≤ t1 < t2 ≤ τ

(2) there is no smooth map λ̃ : [0, τ̃ ] 7→ [0, 1]4 such that λ̃ satisfies (1) and λ([0, τ ]) =

λ̃([0, τ̃ ′]) for some τ̃ ′ < τ̃ .

Here, λ takes time as its input and outputs a mixed strategy of player 2. The condition(1) ensures that player 2 improves his own utility by changing his strategy; the condition(2) means that player 2 does not stop at a locally sub-optimal strategy. The fact that τis finite reflects the requirement that the speed of player 2’s adaptation does not tend to0 as time tends to infinity.

Theorem 7.4. [12] Let p be the zero-determinant extortion strategy used by player 1 inan iterated Prisoner’s Dilemma game with payoffs that satisfy:

2R > T + S > 2P.

Every adapting path for the strategy of player 2 leads to some strategy q such that:{(q1, q2) = (1, 1) if p1 < 1

q1 = 1 if p1 = 1

This strategy is de facto equivalent to unconditional cooperation and thus maximisesthe stationary payoff scores PX and PY among all possible strategies for player 2 (forthe given strategy p of player 1). Furthermore, there always exists an adapting path forplayer 2.

Proof. See [12] �

This is a very powerful result. It is clear that, in a two player environment, this resultdemonstrates a definite robustness associated with zero-determinant extortion strategies.Provided his opponent does not develop a theory of mind, player 1 is safe in the knowledgethat, over enough time, he will receive his desired score - regardless of the adapting pathtaken by player 2.

7.4.1. A modification to the optimisation algorithm. In order to observe this result work-ing in practice, we recall the gradient descent method as applied in (37), however we nowintroduce a ‘weight’ matrix W:

W =

w1 0 0 00 w2 0 00 0 w3 00 0 0 w4

with entries such that for i = 1, 2, 3, 4:

0 < wi ≤ 1.

Thus, implementing this, we have:

q(i+1) = q(i) + tW∇fY (q(i)). (44)

Hence, we are able to influence the direction in which player 2 adapts as, by decreasingwi, we are effectively increasing the difficulty of moving in that direction. For example,it seems reasonable that a player would be more reluctant to become cooperative after a

37

unilateral defection from his opponent, thus we could decrease the value of w2 to accountfor this.

To illustrate this, we once again consider the case in which player 2 is initially un-cooperative. Fig. 8 displays four adapting paths of the strategy of player 2, and thecorresponding paths of player 1, with differently chosen weightings. The values of theseweightings are given in Fig. 9. From this, we can clearly see that each path leads to themaximum score in each case, in line with our expectations from Thm. 7.4.

1000

1100

1200

1300

1400

1500

1600

1700

1800

1

101

201

301

401

501

601

701

801

901

1001

1101

1201

1301

1401

1501

1601

1701

1801

1901

2001

2101

2201

Expectedpayoffp

erro

und

Steps

P1Max P1Score-1 P1Score-2 P1Score-3 P1Score-4

P2Max P2Score-1 P2Score-2 P2Score-3 P2Score-4

Figure 8. The adapting paths taken by player 2’s strategy in fourdifferent instances, each arriving at the maximum score

Path w1 w2 w3 w4

1 0.6 0.3 0.1 0.8

2 0.8 0.1 0.1 0.5

3 0.2 0.1 0.01 0.4

4 0.07 0.6 0.3 0.2

�1

Figure 9. The weightings used to influence the adapting paths taken byplayer 2’s strategy in four different instances, as seen in Fig. 8

Hence, in the context of the model we are considering, there is a definite advantage inbattalion x employing a zero-determinant strategy to extort battalion y. Not only willthis result in the maximum long term payoff for battalion x, provided that battalion y

38

does not from a theory of mind, we have also demonstrated that the use of an extortionstrategy provides a very robust way of doing so.

Remark. As a consequence of 7.4, we can also consider the case in which battalion xdoes not wish to extort his opponent, but perhaps wishes to promote mutual cooperationwith a selfish opponent. This could be achieved through the use of the ‘fair’ extortionstrategy, where χ = 1, as discussed in 3.5.1. In this situation, x would be enforcingthe maximum total score of the two players, in which both players receive the ‘Reward’outcome, which is equivalent to unconditional cooperation from both players. As theresult is true in all cases, this would be achieved even when battalion y does not considerthe total score and only adapts selfishly.

7.5. Convergence to the stationary state and realism of the model. One factorthat has been overlooked by the applications in this section is the time taken for thegame to converge to its stationary state. Specifically, how many iterations of this gamewould be required before the average payoffs of the players are sufficiently close to theirlong-term expected payoffs. As in 3.2, here I shall assume the reader’s familiarity withMarkov Chains.

If we recall the scenario introduced in Example 7.1, after the first modification to player2’s strategy, the strategies employed by the players are:

p =

167288043600

, q(1) =

101

0.0618

. (45)

To investigate the convergence speed of this interaction, we consider the Markov matrix

M(p,q(1)) =

167288

0 121288

00 0 1 00 43

600 17

600 0 0.0618 0.9382

. (46)

As (46) has one closed communicating class, it will converge to a unique stationarydistribution, which can be estimated through repeated multiplication of this matrix [37].If we calculate M(p,q(1))n, as n becomes large, all of the columns of this matrix willconverge to the stationary vector [46]. Thus, from the number of multiplications requiredwe are able to observe how quickly the average payoff of the players approaches thelong-term expected payoffs.

If we consider n = 200, we see a definite convergence in the columns of this matrix4.621 646 005 443 03×10−48 0.113732270793568 0.158696191804976 0.727571537401459

0 0.113732270793572 0.158696191804971 0.7275715374014600 0.113732270793562 0.158696191804983 0.7275715374014580 0.1137322707935670 0.158696191804977 0.727571537401459

.

Hence, we can approximate the stationary vector in this case as

v ≈

0

0.113732270790.158696191800.72757153740

.

39

From this, we can see that the expected score of player 2 after 200 iterations can becalculated as

v · SY =

0

0.113732270790.158696191800.72757153740

· (1607, 1824, 588, 1002) = 1029.7877

which is consistent with the value of the expected long term payoff calculated in 7.1.However, in the context of our example, in which an iteration represents one day, it seemsludicrous that player 2 would wait 200 days before modifying his strategy. The expectedscore of player 2 for different values of n can be seen in Fig. 10. The convergence of thismatrix does not become obvious until approximately n = 30, and in the cases where theconvergence in the column entires in unclear, the average of the column is used as thecomponent of the stationary distribution.

n Expected Payoff

3 1023.1527

5 1026.2110

15 1026.6631

30 1030.4387

�1

Figure 10. Expected payoff per round for player 2 after n iterations ofthe game

From Fig. 10, we can see that it would take around 30 iterations before the expectedpayoff comes reasonably close to our calculated long term payoff. In the context of theexample, this is clearly far too much time to expect player 2 to wait before modifying hisstrategy.

In addition, in the extortion strategy used in this example, φ was set at its maximumvalue. As, in [20], it was noted that φ determines how quickly the payoffs converge tothe long term values, if the extorting player chooses a lower value of φ, an even greaternumber of iterations will be required - making the model even more impractical.

7.5.1. An alternate approach. A different approach for estimating the expected payoffwould be to consider the average score over the number of rounds previously played. Bythe law of large numbers for Markov chains, as the number of iterations increases, theaverage of the observed payoffs will also converge to the stationary vector [23]. To inves-tigate how many iterations are required before the average payoff per round approachesthe long term payoff, we can simulate an iterated Prisoner’s Dilemma with the playersusing strategies (45). The results of such a simulation are recorded in Fig. 11, from whichwe can see that, using this method, an even greater number of iterations are requiredbefore player 2’s payoff becomes reasonably close to our calculated value.

7.5.2. Realism of the model. After considering the time taken for the payoffs to convergeto the long term payoffs, it is doubtful that player 2 would ever maintain his strategylong enough for his payoff to converge - making the game considered in this section apoor model of real-world trench warfare.

Thus, at this point it is important to recall that this model was never intended toprovide a realistic depiction of events. In this application, we are more interested in

40

n Average Payoff

5 1002.0

10 1042.2

20 1083.0

40 1011.9

80 1032.3

100 1030.2

200 1028.3

1000 1028.6

10,000 1030.6

�1

Figure 11. Average payoff per round for player 2 after n iterations ofthe game

trench warfare as a concept, providing us with a context through which we exploredhow the theoretical work on zero-determinant strategies by researchers such as Press andDyson [36], as well as Chen [12], can be applied to a concrete example; fulfilling theconditions stated in [36]. For a more realistic model of the situation, one could simulatean interaction in which the long-term expected payoff is not known to player 2, and thechanges made to his strategy are based on his achieved payoff. Such a model, however,has little relevance to the study of zero-determinant extortion strategies and the ideaspresented in work such as [36] [12] [26].

8. The effects of modifying ZD strategy parameters

Although in Section 3 we introduced the extortion factor χ and parameter φ, it is notalways explicitly clear as to how these parameters affect the adaptation of an opponent’sscore. Here, we move away from all context and explore the effects of changing theseparameters, along with how they may affect the behaviour of a ‘real-world’ opponent.

8.1. The roles of χ and φ. Intuitively, from the relation (29), we see that as χ increases,the extorting player becomes more demanding as he seeks to attain a greater share ofthe overall score; thus can we deduce that, in some way, he must become less cooperativewith his opponent. As mentioned earlier, from previous work [20], we are aware that φinfluences the speed of convergence to the long term values; from which we can expectthe length of the game to be affected as this parameter is modified. Yet, in the casein which we only examine the value of φ and the system (27), it is difficult to preciselydetermine the impact φ has on the strategy, outside of ensuring feasibility, especially inrelation to an increasing χ.

In order to visualise the roles of these parameters more clearly, I plot for each outcomehow the probability of player 1’s cooperation changes with respect to an increasing ex-tortion factor. The case in which φ is set at its maximum is displayed in Fig. 12. Thecases in which we set φ at half and a tenth of its maximum are displayed in Fig. 13 andFig. 14 respectively. In all of these cases, we use the payoff matrix Fig. 5, which wascalculated in 6.1.3, and used throughout the previous sections.

41

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10

ProbabilityofC

ooperatio

n

ExtortionFactor

p1 p2 p3 p4

Figure 12. Graph displaying how the probability of player 1’scooperation after each outcome changes in relation to an increasing

extortion factor - with φ set at its maximum value

From this, we see that as we increase χ, player 1 becomes less willing to cooperateafter the CC and DC outcomes, causing a decrease in p1 and p3. In addition, as weincrease φ towards its upper limit, we see an increase in the rate at which p1 declines.We also notice that increasing the value of φ, this causes an increase in the value of p3.Interestingly, p2, the probability of cooperating after a CD outcome, is unaffected byvalue of χ - decreasing as φ tends to it maximum, and reaching zero in the final case.

Thus, one interpretation of how we can envision φ, is as the intensity of the extor-tion strategy. Hence, if φ is set low, player 1 will readily cooperate after the outcomesfavourable to his opponent (CC, CD), as well as after DC with a low probability; whichhas the effect of encouraging his opponent to become more cooperative over a long pe-riod of time. However, as the value of φ increases, the extortive player is still willing tocooperate, but in a more selfish manner, predominantly after after the outcomes mostfavourable to himself (CC, DC), and thus forcefully guides his opponent’s adaptationover a shorter period.

Example 8.1. Consider the scenario in which an adapting player starts with a fullynon-cooperative strategy. If an extorting opponent sets χ = 3, we can plot the long termscores of the players in the cases in which:

(1) φ is set at its upper limit(2) φ is set at half of its upper limit(3) φ is set at a tenth of its upper limit

42

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10

ProbabilityofC

ooperatio

n

ExtortionFactor

p1 p2 p3 p4


extortion factor - with φ set at half of its maximum value

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10

ProbabilityofC

ooperatio

n

ExtortionFactor

p1 p2 p3 p4


extortion factor - with φ set at a tenth of its maximum value

as in Fig 15. From this, we can clearly see that the players reach their maximum expectedscores more quickly as φ is increased.

43

1000

1100

1200

1300

1400

1500

1600

1700

1800

1 51 101

151

201

251

Expectedpayoffp

erro

und

Steps

P1score- 1 P1score- 0.5 P1score- 0.1

P2score- 1 P2score- 0.5 P2score- 0.1

Figure 15. Graph displaying how increasing the value of φ affects theadapting path of player 2

8.2. Extorting a sentient opponent. In this project, we have only considered thescenario in which, when facing an extorting opponent, it is impossible for an adaptingplayer to develop a theory of mind. Here, we shall briefly discuss what may happen ifthis is not the case, and in light of the observations made in 8.1, how this can possiblybe prevented.

If player 1, the extorter, has knowledge to suggest that his opponent is incapable offorming a theory of mind, and thus will blindly adapt to any extortion strategy, it is inthe interests of player 1 to set χ reasonably high and φ at its maximum. Hence, obtaininga score much greater than that of his opponent, in the shortest possible time.

However, it seems reasonable to suggest, that if high values of χ and φ are employedagainst a sentient player, even one unwitting of ZD strategies, this may have a negativeeffect on player 2’s adaptation. If we recall Axelrod’s observations from 2.3.2, he notedthat greedy probabilistic strategies were often mistaken as being largely uncooperative,and thus encouraging their opponents to employ a strategy of ALLD; in an attempt tominimise potential losses. Therefore, if the parameters of the ZD strategy are chosen toogreedily, it is more likely to prevent an opponent’s cooperation than encourage favourableadaptation. Regardless of whether player 2 ceases cooperation based on a suspicion offoul play from his opponent, or has incorrectly deduced that he is facing an uncooperativeplayer, the result is the same. By resigning himself to the ‘Punishment’ outcome, player2 is damaging his own payoff to prevent being extorted by his opponent; thus he hasdeveloped a theory of mind.

In [36], Press and Dyson suggest that, to discourage an adapting player developing atheory of mind, player 1 may wish to reduce the value of χ. While this may work, as thiswould cause an increase in the probabilities of cooperation, I argue that, if there are no

44

constraints on time, a better option would be to reduce the value of φ. By reducing χ,player 1 is adjusting the relation (29) and is thus settling for a lower long term payoff.However, by adjusting φ, from our observations in 8.1, this also results in an increasedprobability of cooperation; without any modification to to the long term payoff. In thiscase, the extortion simply occurs over a longer period of time. Hence, as by decreasingthe intensity of his strategy, player 1 is effectively encouraging his opponent to takesmaller steps in his adaptation, which is closer to a natural evolution, and thus player 1’sextortion will appear less apparent to his opponent.

Alternatively, in the case in which player 2 has developed a theory of mind, it may beadvantageous for player 1 to resist making any changes to his strategy at all. In this case,the ‘dilemma’ aspect of this game has been removed, and the players find themselves inan ultimatum game [17], in which player 2 can accept the unfair proposal or reject it -with rejection resigning both players to low payoff. Whilst player 2 may reject this andswitch to an uncooperative strategy, provided that his has not misidentified his opponentas wholly uncooperative, it is likely that he has adopted this strategy in an attemptto incite change in player 1’s strategy. If player 1 remains stubborn in his approach,upon realising this, player 2 may realise that there is no advantage in his defection andreluctantly accept the proposal, choosing to increase his own score despite also increasingthat of his opponent.

9. Conclusion

In this project, I have presented an exposition of the iterated Prisoner’s Dilemma game,with a particular emphasis on the application of zero-determinant extortion strategies. Inorder to demonstrate to the reader how this game has real-world applications, I introducedthe Live and Let Live system, accentuating Axelrod’s analysis of this situation throughamalgamation with a modern interpretation of Lanchester’s models of combat; thus cre-ating a model suitable for simulation. Through a series of examples, I demonstrated therobustness of zero-determinant extortion strategies against an unwitting adaptive oppo-nent, and used a result from [12] to assert that all adapting paths will result in bothplayers achieving their maximum scores. Finally, I provided a new interpretation of theparameter φ as the strategy’s intensity, and discussed how, in order to prevent an oppo-nent from developing a theory of mind, it may be favourable to modify this parameterbefore resorting to reducing χ.

At this point it is important to note that the original situation considered in [36]by Press and Dyson is quite different from those considered in most subsequent works;examples being [1] [2] [19] [21] [42]. That is, Press and Dyson discuss a situation in whichthere are only two players, one using a fixed zero-determinant strategy to extort hisadaptive opponent. Contrastingly, most other studies have focused on the evolutionaryaspects of zero-determinant strategies, in which they consider populations of players, all ofwhich can change their strategies over time. In addition, the analysis carried out by Pressand Dyson [36] only compares the performance of the zero-determinant extortion strategyagainst the non-zero-determinant strategy of an adapting opponent. In a number of otherstudies, the performance of zero-determinant strategies against other zero-determinantstrategies is also considered; as well as the performance of non-ZD strategies against othernon-ZD strategies. In this project, I opted to return to the original situation as examinedin [36]. As a result of this, while my conclusions are consistent with other studies thatconsider this form of the game [12] [26] [36], they may appear different, or even opposite,to those presented in recent work.

45

9.1. Recent related work.

9.1.1. Generous zero-determinant strategies. Contrary to extortion strategies, in whichone player attempts to elicit a greater score to the other, in [2], Akin proposed a newclass of good strategies, which encourage the outcome of mutual cooperation by rewardinga cooperating player. Akin demonstrates that, if player 1 utilises a good strategy, it isimpossible for player 2 to receive an expected payoff greater than or equal to the “Reward”payoff for himself, while simultaneously assigning a lesser expected payoff to player 1.Thus, player 1’s choice of a good strategy ensures that cooperation is the best responsefor player 2, therefore encouraging the game into a state of mutual cooperation. It is alsoshown that there are regions of overlap between good strategies and zero-determinantstrategies, which are known as generous zero-determinant strategies; in which one playermay enforce unilateral control, but uses it to promote mutual cooperation.

9.1.2. Zero-determinant strategies in a competing population. Whilst we have seen thatzero-determinant strategies fare extremely well in an environment of only two competingplayers, Adami and Hintze noted that this is not the case when competing against awell-mixed population [1]. In an evolutionary population game, competing individualslearn and adopt the most successful strategies. Thus, while extortion strategies receivehigh payoffs at the start of the game, as these strategies are adopted by more players, thepayoffs of the extortioners decrease. This is due to poor performance against each other,as seen in 3.6.1. Therefore, when extortioners dominate the population, as the averagepayoff recieved by each player will be the “Punishment” outcome, the population will besusceptible to invasion from small clusters of generous strategies, such as TFT; whomwill receive a greater payoff when facing each other.

Based on this, Hilbe, Nowak, and Sigmund [21] concluded that, whilst extortion strate-gies do not form a stable population, they do in fact act as a catalyst for the emergenceof cooperative strategies. Furthermore, Stewart and Plotkin [42] demonstrated that gen-erous zero-determinant strategies are particularly successful in a large population.

9.2. Potential for future research. Since their discovery, zero-determinant strategieshave been the focus of the majority of research related to the Prisoner’s Dilemma, as wellas within Game Theory itself; looking set to continue for the foreseeable future. A partic-ular area of interest to myself is the application of zero-determinant strategies to gamesoutside of the Prisoner’ Dilemma but similar in structure. That is, whether the advanta-geous properties of zero-determinant strategies remain intact, and if any new constraintsare required as a result of different payoff structures. Furthermore, initial research in thisarea [18] suggests that it is possible to generalise zero-determinant strategies to multi-player, multi-action iterated games, yielding another surprising result - every game mayhave at most one “master player”, who can control the expected payoff of the others. Thisinspires even more questions, such as how to optimally define and apply a multi-actionzero-determinant strategy, and exploration of the exact properties of such strategies arerequired.

Returning to the Prisoner’s Dilemma, I would also be interested in pursuing researchinto the adaptive dynamics of zero-determinant strategies within populations. Whileextortion strategies have previously been studied in this context, there has been littleinvestigation into the change in evolutionary stability of a zero-determinant strategyas it evolves away from extortion and instead approaches generosity. From a purelycompetitive outlook, is it possible to select an extortion strategy in such a way thatencourages evolution into generosity when facing a population of extortionate opponents?

46

Evidently, there is still much to explore within the field of zero-determinant strategies.

Appendix A. Optimisation Methods for an adapting player

From 7.1.1, it becomes apparent that in order to simulate a game involving an adaptiveplayer, we must explore how an adapting player may attempt to increase his score.

A.0.1. Trial and error. Perhaps the most basic way for an adaptive player to attempt toincrease his payoff would be through the means of trial and error. While this is of coursevery inefficient, it may be the only option if a player is completely unaware of the factorscontributing to his payoff. We can summarise this in the following (informal) algorithm:

(1) Choose a probability q1, q2, q3, q4 to increase(2) Increase the chosen probability by some amount, if this is impossible, go to (1).(3) At the end of the next iteration, compare the current score to the score in the

previous round. If the new score is less, undo the change made in (2). If not, goto (1).

(4) End when it is impossible to make more changes

As we can see, this is a very laborious and inefficient method; which is often difficultto simulate, as long trial and error algorithms can be taxing on a computer’s resources.As we are only interested in the cases in which an adapting player alters his strategy tosuccessfully increase his score, we can disregard the failed attempts, and obtain similarresults by considering a more elegant method in our analysis.

A.0.2. The Gradient Descent Method. A different option we may consider is the gradientdescent method - an algorithm which can be used to find the local maximum of a function.This is achieved by evaluating the gradient of the function at an initial starting point,and, moving in a direction of an increasing gradient; repeating this process until thealgorithm converges where the gradient is zero. This is a first order algorithm as onlythe first derivatives of the function are required in this process.

Thus, we can present this method in the following way:

xk+1 = xk + t · ∇f(xk) (47)

where the parameter t > 0 is the size of the step taken in the direction of the gradient.The optimal value of t is dependent on the function used and should be chosen such thatf(xk+1) ≥ f(xk).

One of the most important decisions when using the gradient descent method is thechoice of the step size t. The optimal way to to do this is by an exact line search whichcomputes a step size s follows:

t = argmins≥0

(f(x)− s∇f(x)).

However, in practice it is not possible to do this minimization exactly, and the resultsyielded from approximations to exact line search are not usually a significant improvementover other, simpler methods. Therefore, I shall consider a small fixed step size whenimplementing this method.

References

[1] Adami C; Hintze A. Evolutionary instability of zero-determinant strategies demonstrates that win-ning is not everything. Nature Communications 4. 2013.

[2] Akin E. Stable Cooperative Solutions for the Iterated Prisoners Dilemma. arXiv preprintarXiv:12110969 [Internet]. 2012 [cited 2015 Nov 18]; Available from: http://arxiv.org/abs/1211.0969

[3] Ashworth T. Trench Warfare 1914-1918: The Live and Let Live System. 1980.47

[4] Axelrod R. The Evolution of Cooperation. New York: Basic Books; 1984[5] Axelrod R. Effective choice in the Prisoner’s Dilemma (1980a). Journal of Conflict Resolution 24.

1980. 3-25[6] Axelrod R. More effective choice in the Prisoner’s Dilemma (1980b). Journal of Conflict Resolution

24. 1980. 379-403[7] Batson CD, Moran T. Empathy-induced altruism in a prisoners dilemma.Eur J Soc Psychol. 1999

Nov 1;29(7):90924.[8] Bacci G.; Lasaulce S.; Saad W.; Sanguinetti L. Game Theory for Networks: A tutorial on game-

theoretic tools for emerging signal processing applications. Signal Processing Magazine, IEEE. 2016Jan;33(1):94-119.

[9] Boerlijst M.C.; Nowak M.; Sigmund K. Equal Pay for All Prisoners. The American MathematicalMonthly Vol. 104, No. 4; 1997. pp. 303-305

[10] Bracken J. Lanchester Models of the Ardennes Campaign. Naval Research Logistics. 1995.[11] Carilli A.M.; Dempster G.M. Expectations in Austrian Business Cycle Theory: An Application of

the Prisoner’s Dilemma. The Review of Austrian Economics Vol. 14, No. 4; 2001. pp. 319-330[12] Chen J.; Zinger A. The Robustness of Zero-Determinant Strategies in Iterated Prisoner’s Dilemma

Games. Journal of Theoretical Biology Vol. 357; 2014. pp. 4654[13] Darilek R. Knowledge Enhanced Lanchester. In: Darilek et al. Measures of Effectiveness for the

Information-Age Army. RAND Corporation. 2001[14] Earnest M. J. Extortion and Evolution in the Iterated Prisoners Dilemma. 2013 [cited 2016 Feb 17];

Available from: http://scholarship.claremont.edu/hmc theses/51/[15] Flood, Merrill M. Some Experimental Games. Management Science 5 (1). 1958. INFORMS: 526.

http://www.jstor.org/stable/2626968.[16] Fehr E.; Fischbacher U. The nature of human altruism. Nature. 2003 Oct 23;425(6960):785-91.[17] Guth W.; Schmittberger R.; Schwarze B. An experimental analysis of ultimatum bargaining. Journal

of Economic Behaviour and Organisation 3; 1982. pp. 367388[18] He X.; Dai H.; Ning P.; Dutta R. Zero-determinant Strategies for Multi-player Multi-action Iterated

Games. IEEE Signal Processing Letters. 2016 Mar; 23(3):3115.[19] Hilbe et al. Evolutionary performance of zero-determinant strategies in multiplayer games. Journal

of Theoretical Biology 374. 2015 115-124[20] Hilbe C; Rhl T; Milinski M. Extortion subdues human players but is finally punished in the prisoners

dilemma. Nature Communications 2014 May 29;5:3976.[21] Hilbe C.; Sigmund K.; Nowak M. Evolution of extortion in Iterated Prisoner’s Dilemma games.

Proceedings of the National Academy of Sciences Vol. 110 No. 17; 2013. pp.6913–6918[22] Howard N. Paradoxes of Rationality: Theory of Metagames and Political Behaviour. MIT Press,

Cambridge; 1971.[23] Kemeny J.; Snell L. Finite Markov Chains. Undergraduate Texts in Mathematics. Springer. 1976.[24] Lanchester F. Aircraft in Warfare: The Dawn of the Fourth Arm. Constable, London. 1916.[25] Lepingwell JW. The laws of combat? Lanchester reexamined. International Security. 1987;89134.[26] Li S. Strategies in the Stochastic Iterated Prisoner’s Dilemma. [cited 2016 Jan 13]; Available from:

http://math.uchicago.edu/ may/REU2014/REUPapers/Li,Siwei.pdf[27] Maynard Smith, J. Evolution and the Theory of Games. Cambridge University Press. 1982.[28] de Montmort P. R. Essai danalyse sur les jeux de hasard. 2nd ed.: Quilau, Paris, 1713.[29] Myerson R.B. Game Theory: Analysis of Conflict. Harvard University Press. 1991.[30] Nash J F. Equilibrium Points in N-Person Games. Proceedings of the National Academy of Sciences

of the United States of America; 1950.[31] von Neumann J.; Morgenstern, O. Theory of Games and Economic Behavior. Princeton University

Press. 1944.[32] Nowak M.; Sigmund K. A strategy of win-stay, lose-shift that outperforms tit-for-tat in the Prisoner’s

Dilemma game. Nature Vol. 364. No. 6432; 1993. pp. 56-58[33] Nowak M.; Sigmund K. The evolution of stochastic strategies in the prisoner’s dilemma Acta Ap-

plicandae Mathematicae 20. 1990 247265[34] Nowak M.; Sigmund K. Evolutionary Dynamics of Biological Games. Science. 2004 Feb

6;303(5659):7939.[35] Pinsky M.; Karlin S. An Introduction to Stochastic Modeling Fourth Edition. Academic Press. 2010.[36] Press W.; Dyson F. Iterated Prisoners Dilemma contains strategies that dominate any evolutionary

opponent. Proceedings of the National Academy of Sciences 109. 2012.

48

[37] Radev D. et al. Steady-state solutions of Markov chains. In: Proceedings of the 7thBalkan Conference on Operational Research [Internet]. 2005 [cited 2016 Apr 20]. Availablefrom: https://www.researchgate.net/profile/Dimitar Radev3/publication/241138992 STEADY-STATE SOLUTIONS OF MARKOV CHAINS/links/0c960533ccfd75cedc000000.pdf

[38] Rapoport et al. Is Tit-for-Tat the Answer? On the Conclusions Drawn from Axelrod’s Tournaments.PLOS ONE Vol. 10. No. 7; 2015

[39] Shapley L.; Shubik, M. A Method for Evaluating the Distribution of Power in a Committee System.American Political Science Review 48. 1954.

[40] Straffin P. Game Theory and Strategy. 7th ed. The Mathematical Association of America; 1993.[41] Stewart A.; Plotkin J. Extortion and cooperation in the Prisoners Dilemma. Proceedings of the

National Academy of Sciences Vol. 109. No. 26; 2012[42] Stewart AJ; Plotkin JB. From extortion to generosity, evolution in the Iterated Prisoners Dilemma.

Proceedings of the National Academy of Sciences. 2013 Sep 17;110(38):1534853.[43] Taylor J.G. Lanchester Models of Warfare 1983, vol. 1.[44] Tolk A. Engineering Principles of Combat Modeling and Distributed Simulation. Wiley-Blackwell;

2012[45] Tucker A.W. The Mathematics of Tucker: A Sampler. The Two-Year College Mathematics Journal

Vol. 14, No. 3; 1983. pp. 228-232[46] Zabek M. Transition Matrices and Markov Chains. [cited 2016 Apr 21]; Available from:

http://www2.kenyon.edu/Depts/Math/hartlaub/Math224%20Fall2008/Markov-Sample1.pdf

49

Chris Hughes Final Year Project

Documents

Transcript of Chris Hughes Final Year Project