Reinforcement Learning Presentation Markov Games as a Framework for Multi-agent Reinforcement...

18
Reinforcement Learning Presentation Reinforcement Learning Presentation Markov Games as a Framework Markov Games as a Framework for Multi-agent Reinforcement for Multi-agent Reinforcement Learning Learning Mike L. Littman Mike L. Littman Jinzhong Niu March 30, 2004

Transcript of Reinforcement Learning Presentation Markov Games as a Framework for Multi-agent Reinforcement...

Reinforcement Learning PresentationReinforcement Learning Presentation

Markov Games as a Framework for Markov Games as a Framework for Multi-agent Reinforcement LearningMulti-agent Reinforcement Learning

Mike L. LittmanMike L. Littman

Markov Games as a Framework for Markov Games as a Framework for Multi-agent Reinforcement LearningMulti-agent Reinforcement Learning

Mike L. LittmanMike L. Littman

Jinzhong Niu

March 30, 2004

Markov Games as a Framework for Multi-agent Reinforcement Learning 2

Overview

MDP is capable of describing only single-agent

environments.

New mathematical framework is needed to support

multi-agent reinforcement learning.

Markov Games

A single step in this direction is explored.

2-player zero-sum Markov Games

Markov Games as a Framework for Multi-agent Reinforcement Learning 3

Definitions

Markov Decision Process (MDP)

Markov Games as a Framework for Multi-agent Reinforcement Learning 6

2P-MG Is Capable?

Precludes cooperation!

Generalizes

MDPs (when |O|=1)

The opponent has a constant behavior, which may be

viewed as part of the environment.

Matrix Games (when |S|=1)

The environment doesn’t hold any information and rewards

are totally decided by the actions.

Yes

Markov Games as a Framework for Multi-agent Reinforcement Learning 7

Matrix Games

Example – “rock, paper, scissors”

Markov Games as a Framework for Multi-agent Reinforcement Learning 8

What does ‘optimality’ exactly mean?

MDPA stationary, deterministic, and undominated optimal policy always exists.

MGThe performance of a policy depends on the opponent’s policy, so we cannot evaluate them without context.

New definition of ‘optimality’ in game theory Performs best at its worst case compared with others

At least one optimal policy exists, which may or may not be deterministic because the agent is uncertain of its opponent’s move.

Markov Games as a Framework for Multi-agent Reinforcement Learning 9

Finding Optimal Policy - Matrix Games

The optimal agent’s minimum expected reward should be as large as possible.

Use V to express the minimum value, then consider how to maximize it

Markov Games as a Framework for Multi-agent Reinforcement Learning 11

Finding Optimal Policy – 2P-MG

Value of a state

Quality of a s-a-o triple

V(s)

Q(s,a3,o3)Q(s,a2,o2)Q(s,a1,o1)

o1

o2

o3

V(s,o2)

min

(s,a1) (s,a2)(s,a3)

Markov Games as a Framework for Multi-agent Reinforcement Learning 12

Learning Optimal Polices

Q-learning

minimax-Q learning

Markov Games as a Framework for Multi-agent Reinforcement Learning 13

Minimax-Q Algorithm

Markov Games as a Framework for Multi-agent Reinforcement Learning 14

Experiment - Problem

Soccer

Markov Games as a Framework for Multi-agent Reinforcement Learning 15

Experiment - Training

4 agents trained through 106 stepsminimax-Q learning

vs. random opponent - MR

vs. itself - MM

Q-learningvs. random opponent - QR

vs. itself - QQ

Markov Games as a Framework for Multi-agent Reinforcement Learning 16

Experiment - Testing

Test 3QR, QQ – 100% loser?

Test 1QR > MR?

Test 2QR<<QQ?

Markov Games as a Framework for Multi-agent Reinforcement Learning 17

Contributions

A solution to 2-player Markov games with a modified Q-learning method in which minimax is in place of max

Minimax can also be used in single-agent environments to avoid risky behavior.

Markov Games as a Framework for Multi-agent Reinforcement Learning 18

Future work

Possible performance improvement of the minimax-Q learning method

Linear programming caused large computational complexity.

Iterative methods may be used to get approximate solutions to minimax much faster, which is sufficiently satisfactory.