Co-evolution time, changing environments agents. Static search space solutions are determined by...

49
Co-evolution time, changing environments agents

Transcript of Co-evolution time, changing environments agents. Static search space solutions are determined by...

Co-evolution

time,

changing environments

agents

Static search space

solutions are determined by the optimization process only, by doing variation and selection

evaluation determines fitness of each solution

BUT…

Dynamic search space

what happens if state of environment changes between actions

of the optimization process

AND / OR other processes are concurrently trying to

change the state also

Agents to act in changing environments the search is now for an agent strategy to

act successfully in a changing environmentexample:

agent is defined with an algorithm based on a set of parameters

‘learning’ means searching for parameter settings to make the algorithm successful in the environment

agent algorithm takes current state of environment as input and outputs an action to change the environment

Optimizing an agent

procedural optimization recall 7.1.3 finite state machines

(recognize a string)

and

7.1.4 symbolic expressions

(control operation of a cart)

which are functions

The environment of activity

parameter space, e.g. game nodes are game states edges are legal moves that transform one

state to another example: tic-tac-toe, rock-scissors-paper

Agent in the environment

procedure to input current state and determine an action to alter the state

evaluated according to its success at achieving goals in the environment external goal - environment in desirable state

(win game) internal goal - maintain internal state

(survive)

Models of agent implementation

1. look-up table: state / action2. deterministic algorithm

hard-coded intelligence

ideal if perfect strategy is known,otherwise, parameterized algorithm

‘tune’ the parameters to optimize performance

3. neural network

Optimizing agent procedure

initialize agent parameters

evaluate agent success in environment

repeat

modify parameters

evaluate modified agent success

if improved success, retain modified parameters

Environment features

1. other agents?

2. state changes independent of agent(s)?

3. randomness?

4. discrete or continuous?

5. sequential or parallel?

Environment featuresgame examples

1. other agents?

solitaire, two-person, multi-player

Environment featuresgame examples

2. state changes independent of agent(s)?

YES

simulators (weather, resources)

NO

board games

Environment featuresgame examples

3. randomness?

NO

chess, sudoku(!)

YES

dice, card games

Environment featuresgame examples

4. discrete or continuous?video gamesboard games

Environment featuresgame examples

5. sequential or parallel?turn-taking, simultaneous

which sports?marathon , javelin, hockey, tennis

Models of agent optimization

evolving agent vs environment evolving agent vs skilled player co-evolving agents

competing as equals e.g., playing a symmetric game

competing as non-equals (predator - prey) e.g., playing an asymmetric game

co-operative

Example: Game Theoryinvented by John von Neumann models concurrent* interaction between

agents each player has a set of choices players concurrently make a choice each player receives a payoff based on the

outcome of play: the vector of choices e.g. paper-scissors-rock

2 players, with 3 choices each 3 outcomes: win, lose, draw

*there are sequential Game Theory models also

Payoff matrix

tabulates payoffs for all outcomes e.g. paper-scissors-rock

P1\P2 paper scissors rock

paper (draw,draw) (lose,win) (win,lose)

scissors (win,lose) (draw,draw) (lose,win)

rock (lose,win) (win,lose) (draw,draw)

Two-player two-choiceordinal games (2 x 2 games) Four outcomes Each player has four ordered payoffs

example game:

P1\P2 L R

U 1,2 3,1

D 2,4 4,3

2 x 2 games

144 distinct games based on relation of payoffs to outcomes: (4! x 4!) / (2 x 2)

model many real world encounters environment for studying agent strategies

how do players decide what choice to make?

P1\P2 L R

U 1,2 3,1

D 2,4 4,3

2 x 2 games

player strategy: minimax - minimize losses

pick row / column with largest minimum assumes nothing about other player

P1\P2 L R

U 1,2 3,1

D 2,4 4,3

2 x 2 games

player strategies: dominant strategy

preferred choice, whatever opponent does does not determine a choice in all games

P1\P2 L R

U 1,3 4,2

D 3,4 2,1

P1\P2 L R

U 1,2 4,3

D 3,4 2,1

P1\P2 L R

U 2,1 1,2

D 4,3 3,4

2 x 2 games

some games are dilemmas

“prisoner’s dilemma”

C = ‘cooperate’, D = ‘defect’

P1\P2 C D

C 3,3 1,4

D 4,1 2,2

2 x 2 games - playing strategies

what happens if players use different strategies?

how do/should players decide in difficult games?

modeling players and evaluating in games

P1\P2 C D

C 3,3 1,4

D 4,1 2,2

iterated games

can players analyze games after playing and determine a better strategy for next encounter?

iterated play

e.g., prisoner’s dilemma:

can players learn to trust each other and get better payoffs?

P1\P2 C D

C 3,3 1,4

D 4,1 2,2

iterated prisoner’s dilemma

player strategy for choice as a function of previous choices and outcomes

P1\P2 C D

C 3,3 1,4

D 4,1 2,2

game 1 2 … i-1 i i+1 …

P1 C C … D D ?

P2 D C … C D ?

P1: choicei+1,P1 = f( choicei,P1, choicei,P2)

iterated prisoner’s dilemma

evaluating strategies based on payoffs in iterated play

P1\P2 C D

C 3,3 1,4

D 4,1 2,2

P1: choicen+1,P1 = f( choicen,P1, choicen,P2)

game 1 2 … i-1 i i+1 … n-1 n Σ

1 3 … 4 2 3 … 4 3 298

P1 C C … D D C … D C

P2 D C … C D C … C C

4 3 … 1 2 3 … 1 3 343

tournament play

set of K strategies for a game,e.g., iterated prisoner’s dilemma

evaluate strategies by performance against all other strategies

round-robin tournament

example of tournament play

P1 P2 … P(K-1) PK ΣP1 308 298 … 340 278 7140

P2 343 302 … 297 254 6989

… … … … … … …

P(K-1) 280 288 … 301 289 6860

PK 312 322 … 297 277 7045

tournament play

P1 P2 … P(K-1) PK Σ

P1 308 298 … 340 278 7140

P2 343 302 … 297 254 6989

… … … … … … …

P(K-1) 280 288 … 301 289 6860

PK 312 322 … 297 277 7045

order players by total payoff

change ‘weights’ (proportion of population) based on success

repeat tournament until weights stabilize

tournament survivors

final weights of players define who is successful

typical distribution:

1 dominant player

a few minor players

most players eliminated

variations on fitness evaluation

spatially distributed agents

random players on an n x n grid

each player plays 4 iterated games against neighbours and computes total payoff

each player compares total payoff with 4 neighbours; replaced by neighbour with best payoff

spatial distribution

defining agents, e.g., prisoner’s dilemma

algorithms from experts Axelrod 1978

exhaustive variations on a model e.g. react to previous choices of both players:

need choices for:first moveafter (C,C) // (self, opponent)after (C,D)after (D,C)after (D,D)

5 bit representation of strategy32 possible strategies, 32 players in

tournament

tit-for-tatfirst: C(C,C) C(C,D) D(D,C) C(D,D) D

tit-for-tatfirst: C(C,C) C(C,D) D(D,C) C(D,D) D

tit-for-tat interpreting the strategy

start by trusting repeat opponent’s last behaviour

cooperate after cooperation punish defection

don’t hold a grudge

BUT… Axelrod’s payoffsbecame the standard

tit-for-tatfirst: C(C,C) C(C,D) D(D,C) C(D,D) D

tit-for-tatfirst: C(C,C) C(C,D) D(D,C) C(D,D) D

P1\P2 C D

C 3,3 0,5

D 5,0 1,1

problems with the research

if payoffs are changed but the prisoner’s dilemma pattern remains,

tournaments can produce different winners

--> relative payoff does matter

P1\P2 C D

C 3,3 1,4

D 4,1 2,2

P1\P2 C D

C 3,3 0,5

D 5,0 1,1

game 1 2 … i-1 i i+1 … n-1 n Σ

1 3 … 4 2 3 … 4 3 298

P1 C C … D D C … D C

P2 D C … C D C … C C

4 3 … 1 2 3 … 1 3 343

agent is function of more than previous strategy

choiceP1,n influenced by variables previous choices - how many? 2, 4, 6, ..? payoffs - of both players 4, 8 ? individual’s goals 1

same, different measures? maximum payoff maximum difference altruism equality social utility

other factors? ?

agent research (Robinson & Goforth) choiceP1,n influenced by

previous choices - how many payoffs - of both playersrepeated tournaments by sampling the space of payoffs- demonstrated that payoffs matterexample of discovery: extra level of trust beyond tit-for-tat

effect of players’ goals on outcomes and payoffs and classic strategies 3 x 3 games ~3 billion games -> grid computing

general model environment is payoff space where encounters

take place multiple agents interact in the space

repeated encounters where agents get payoffs agents ‘display’ their own strategies and agents ‘learn’ the strategies of others

goals of research reproduce real-world behaviour by simulating strategies develop effective strategies

Developing effective agent strategies

co-evolution bootstrapping symmetric competition

games, symmetric game theory asymmetric competition

asymmetric game theory cooperation

common goal, social goal

Coevolutionin symmetric competition

Fogel’s checkers player evolving player strategies search space of strategies

fitness is determined by game play variation by random change to parameters selection by ranking in play against others initial strategies randomly generated

Checkers game representation

32 element vector for the playable squares on the board (one colour)

Domain of each element: {-K,-1,0,1,K} e.g., start state of game:

{-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1, 0,0,0,0,0,0,0,0, 1,1,1,1,1,1,1,1,1,1,1,1}

Game play algorithm

board.initialize()

while (not board.gameOver())

player1.makeMove(board)

if (board.gameOver()) break

player2.makeMove(board)

end while

board.assignPoints(player1,player2)

makeMove algorithm

makeMove(board) {

boardList = board.getNextBoards(this)

bestBoardVal = -1; bestBoard = null

forAll (tempBoard in boardList)

boardVal = evaluate(tempBoard)

if (boardVal>bestBoardVal)

bestBoardVal = boardVal

bestBoard = tempBoard

board = bestBoard

evaluate algorithm

evaluate(board) { function of 32 elements of board parameters are variables of search space, including K output is in range [-1,1]

used for comparing boards}

what to put in the evaluation?

Chellapilla and Fogel’s evaluation

no knowledge of chess rules board generates legal moves and updates

state no ‘look-ahead’ to future moves implicitly a pattern recognition system

BUT only relational patterns predefined - see diagram

(neural network)

Neural network

input layer output layerhidden layer(s)

i1

h1 = w11.i1+w12.i2+w13.i3

o1=w31.g1+w32.g2+w33.g3g1

Neural network gent 32 inputs - board state 1st hidden layer: 91 nodes

3x3 squares (36), 4x4 (25), 5x5 (16), 6x6 (9), 7x7 (4), 8x8 (1)

2nd hidden layer: 40 nodes 3rd hidden layer: 10 nodes 1 output - evaluation of board [-1, 1] parameters - weights on neural network

plus K (value of king)

Genetic algorithm population: 15 player agents one child agent from each parent: variation

by random changes in K and weights fitness based on score in tournament

among 30 agents (win: 1, loss: -2, draw: 0) agents with best 15 scores selected for

next generationafter 840 generations, effective checkers

player in competition against humans