Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to...

333
Multi-Agent Reinforcement Learning An Overview Marcello Restelli November 12, 2014

Transcript of Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to...

Page 1: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

Multi-Agent Reinforcement LearningAn Overview

Marcello Restelli

November 12, 2014

Page 2: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Outline

1 Introduction to Multi-Agent Reinforcement LearningReinforcement LearningMARL vs RLMARL vs Game Theory

2 MARL algorithmsBest-Response LearningEquilibrium Learners

Team GamesZero-sum GamesGeneral-sum Games

Page 3: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Outline

1 Introduction to Multi-Agent Reinforcement LearningReinforcement LearningMARL vs RLMARL vs Game Theory

2 MARL algorithmsBest-Response LearningEquilibrium Learners

Team GamesZero-sum GamesGeneral-sum Games

Page 4: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Game Theory in Computer Science

Game Theory ComputerScience

Page 5: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Game Theory in Computer Science

Game Theory ComputerScience

Computing Solution Concepts

Compact Game Representations

Mechanism Design

Multi-agent Learning

Page 6: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Some Naming Conventions

Player = AgentPayoff = RewardValue = UtilityMatrix = Strategic form = Normal formStrategy = PolicyPure strategy = Determinitic policyMixed strategy = Stochastic policy

Page 7: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Some Naming Conventions

Player = AgentPayoff = RewardValue = UtilityMatrix = Strategic form = Normal formStrategy = PolicyPure strategy = Determinitic policyMixed strategy = Stochastic policy

Page 8: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Some Naming Conventions

Player = AgentPayoff = RewardValue = UtilityMatrix = Strategic form = Normal formStrategy = PolicyPure strategy = Determinitic policyMixed strategy = Stochastic policy

Page 9: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Some Naming Conventions

Player = AgentPayoff = RewardValue = UtilityMatrix = Strategic form = Normal formStrategy = PolicyPure strategy = Determinitic policyMixed strategy = Stochastic policy

Page 10: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Some Naming Conventions

Player = AgentPayoff = RewardValue = UtilityMatrix = Strategic form = Normal formStrategy = PolicyPure strategy = Determinitic policyMixed strategy = Stochastic policy

Page 11: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Some Naming Conventions

Player = AgentPayoff = RewardValue = UtilityMatrix = Strategic form = Normal formStrategy = PolicyPure strategy = Determinitic policyMixed strategy = Stochastic policy

Page 12: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Some Naming Conventions

Player = AgentPayoff = RewardValue = UtilityMatrix = Strategic form = Normal formStrategy = PolicyPure strategy = Determinitic policyMixed strategy = Stochastic policy

Page 13: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

What is Multi-Agent Learning?

Difficult question...... we will try to answer in theseslidesIt involves

Multiple agentsSelf-interestConcurrently learning

It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems

Page 14: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

What is Multi-Agent Learning?

Difficult question...... we will try to answer in theseslidesIt involves

Multiple agentsSelf-interestConcurrently learning

It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems

Shoham et al.,2002-2007

Page 15: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

What is Multi-Agent Learning?

Difficult question...... we will try to answer in theseslidesIt involves

Multiple agentsSelf-interestConcurrently learning

It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems

Shoham et al.,2002-2007

If multi-agent is theanswer, what is thequestion?

Page 16: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

What is Multi-Agent Learning?

Difficult question...... we will try to answer in theseslidesIt involves

Multiple agentsSelf-interestConcurrently learning

It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems

Shoham et al.,2002-2007

Stone, 2007

If multi-agent is theanswer, what is thequestion?

Page 17: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

What is Multi-Agent Learning?

Difficult question...... we will try to answer in theseslidesIt involves

Multiple agentsSelf-interestConcurrently learning

It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems

Shoham et al.,2002-2007

Stone, 2007

If multi-agent is theanswer, what is thequestion?

Multi-agent learningis not the answer, itis the question!

Page 18: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

What is Multi-Agent Learning?

Difficult question...... we will try to answer in theseslidesIt involves

Multiple agentsSelf-interestConcurrently learning

It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems

Shoham et al.,2002-2007

Stone, 2007

If multi-agent is theanswer, what is thequestion?

Multi-agent learningis not the answer, itis the question!

Page 19: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

What is Multi-Agent Learning?

Difficult question...... we will try to answer in theseslidesIt involves

Multiple agentsSelf-interestConcurrently learning

It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems

Shoham et al.,2002-2007

Stone, 2007

If multi-agent is theanswer, what is thequestion?

Multi-agent learningis not the answer, itis the question!

Page 20: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

What is Multi-Agent Learning?

Difficult question...... we will try to answer in theseslidesIt involves

Multiple agentsSelf-interestConcurrently learning

It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems

Shoham et al.,2002-2007

Stone, 2007

If multi-agent is theanswer, what is thequestion?

Multi-agent learningis not the answer, itis the question!

Page 21: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

What is Multi-Agent Learning?

Difficult question...... we will try to answer in theseslidesIt involves

Multiple agentsSelf-interestConcurrently learning

It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems

Shoham et al.,2002-2007

Stone, 2007

If multi-agent is theanswer, what is thequestion?

Multi-agent learningis not the answer, itis the question!

Page 22: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

What is Multi-Agent Learning?

Difficult question...... we will try to answer in theseslidesIt involves

Multiple agentsSelf-interestConcurrently learning

It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems

Shoham et al.,2002-2007

Stone, 2007

If multi-agent is theanswer, what is thequestion?

Multi-agent learningis not the answer, itis the question!

Page 23: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

What is Multi-Agent Learning?

Difficult question...... we will try to answer in theseslidesIt involves

Multiple agentsSelf-interestConcurrently learning

It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems

Shoham et al.,2002-2007

Stone, 2007

If multi-agent is theanswer, what is thequestion?

Multi-agent learningis not the answer, itis the question!

Page 24: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

What is Multi-Agent Learning?

Difficult question...... we will try to answer in theseslidesIt involves

Multiple agentsSelf-interestConcurrently learning

It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems

Shoham et al.,2002-2007

Stone, 2007

If multi-agent is theanswer, what is thequestion?

Multi-agent learningis not the answer, itis the question!

Page 25: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

What is Multi-Agent Learning?

Difficult question...... we will try to answer in theseslidesIt involves

Multiple agentsSelf-interestConcurrently learning

It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems

Shoham et al.,2002-2007

Stone, 2007

If multi-agent is theanswer, what is thequestion?

Multi-agent learningis not the answer, itis the question!

Page 26: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

What is Multi-Agent Learning?

Difficult question...... we will try to answer in theseslidesIt involves

Multiple agentsSelf-interestConcurrently learning

It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems

Shoham et al.,2002-2007

Stone, 2007

If multi-agent is theanswer, what is thequestion?

Multi-agent learningis not the answer, itis the question!

Page 27: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Which Applications?

Distributed vehicle regulationAir traffic controlNetwork management and routingElectricity distribution managementSupply chainsJob schedulingComputer games

Page 28: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent Learning and RL

We are interested in learning in situations wheremultiple decision makers repeatedly interactAmong the different machine learning paradigms,reinforcement learning is the most suited to approachsuch problemWe will mainly focus on multi-agent RL, even if other(game-theoretic) learning approaches will bementioned

Fictitious playNo-regret learning

Page 29: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent Learning and RL

We are interested in learning in situations wheremultiple decision makers repeatedly interactAmong the different machine learning paradigms,reinforcement learning is the most suited to approachsuch problemWe will mainly focus on multi-agent RL, even if other(game-theoretic) learning approaches will bementioned

Fictitious playNo-regret learning

Page 30: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent Learning and RL

We are interested in learning in situations wheremultiple decision makers repeatedly interactAmong the different machine learning paradigms,reinforcement learning is the most suited to approachsuch problemWe will mainly focus on multi-agent RL, even if other(game-theoretic) learning approaches will bementioned

Fictitious playNo-regret learning

Page 31: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent Learning and RL

We are interested in learning in situations wheremultiple decision makers repeatedly interactAmong the different machine learning paradigms,reinforcement learning is the most suited to approachsuch problemWe will mainly focus on multi-agent RL, even if other(game-theoretic) learning approaches will bementioned

Fictitious playNo-regret learning

Page 32: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent Learning and RL

We are interested in learning in situations wheremultiple decision makers repeatedly interactAmong the different machine learning paradigms,reinforcement learning is the most suited to approachsuch problemWe will mainly focus on multi-agent RL, even if other(game-theoretic) learning approaches will bementioned

Fictitious playNo-regret learning

Page 33: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Outline

1 Introduction to Multi-Agent Reinforcement LearningReinforcement LearningMARL vs RLMARL vs Game Theory

2 MARL algorithmsBest-Response LearningEquilibrium Learners

Team GamesZero-sum GamesGeneral-sum Games

Page 34: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

History of RL

Psychology, Trial and error

Page 35: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

History of RL

Psychology, Trial and error

Pavlov (1903)Classical conditioning

Thorndike (1905)Law of effect

Minsky (1961)Credit–assignment

problem

Page 36: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

History of RL

Psychology, Trial and error

Pavlov (1903)Classical conditioning

Thorndike (1905)Law of effect

Minsky (1961)Credit–assignment

problem

Optimal Control

Page 37: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

History of RL

Psychology, Trial and error

Pavlov (1903)Classical conditioning

Thorndike (1905)Law of effect

Minsky (1961)Credit–assignment

problem

Optimal Control

Bellman (1957)Dynamic Programming

Howard (1960)Policy Iteration

Page 38: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

History of RL

Psychology, Trial and error

Pavlov (1903)Classical conditioning

Thorndike (1905)Law of effect

Minsky (1961)Credit–assignment

problem

Reinforcement Learning

Optimal Control

Bellman (1957)Dynamic Programming

Howard (1960)Policy Iteration

Page 39: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

History of RL

Psychology, Trial and error

Pavlov (1903)Classical conditioning

Thorndike (1905)Law of effect

Minsky (1961)Credit–assignment

problem

Reinforcement Learning

Optimal Control

Bellman (1957)Dynamic Programming

Howard (1960)Policy Iteration

Samuel (1956)Checkers

Sutton & Barto (1984)Temporal Difference

Watkins (1989)Q–learning

Littman (1994)minimax–Q

Tesauro (1992)TD–Gammon

Page 40: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

The Agent-Environment Interface

Agent interacts at discrete time steps t = 0,1,2, . . .Full observability: agent directly observesenvironment stateFormally, this is a Markov Decision Process (MDP)

Page 41: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

The Agent-Environment Interface

Agent interacts at discrete time steps t = 0,1,2, . . .Full observability: agent directly observesenvironment stateFormally, this is a Markov Decision Process (MDP)

Page 42: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

The Agent-Environment Interface

Agent interacts at discrete time steps t = 0,1,2, . . .Full observability: agent directly observesenvironment stateFormally, this is a Markov Decision Process (MDP)

Page 43: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Markov Decision Processes

An MDP is formalized as a 4-tuple: 〈S,A,P,R〉S: set of states

What the agent knows (complete observability)A: set of actions

What the agent can do (it may depend on state)P: state transition model

S × A× S → [0,1]

R: reward functionS × A× S → R

Page 44: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Markov Decision Processes

An MDP is formalized as a 4-tuple: 〈S,A,P,R〉S: set of states

What the agent knows (complete observability)A: set of actions

What the agent can do (it may depend on state)P: state transition model

S × A× S → [0,1]

R: reward functionS × A× S → R

Page 45: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Markov Decision Processes

An MDP is formalized as a 4-tuple: 〈S,A,P,R〉S: set of states

What the agent knows (complete observability)A: set of actions

What the agent can do (it may depend on state)P: state transition model

S × A× S → [0,1]

R: reward functionS × A× S → R

Page 46: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Markov Decision Processes

An MDP is formalized as a 4-tuple: 〈S,A,P,R〉S: set of states

What the agent knows (complete observability)A: set of actions

What the agent can do (it may depend on state)P: state transition model

S × A× S → [0,1]

R: reward functionS × A× S → R

Page 47: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Markov Decision Processes

An MDP is formalized as a 4-tuple: 〈S,A,P,R〉S: set of states

What the agent knows (complete observability)A: set of actions

What the agent can do (it may depend on state)P: state transition model

S × A× S → [0,1]

R: reward functionS × A× S → R

Page 48: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Markov Decision Processes

An MDP is formalized as a 4-tuple: 〈S,A,P,R〉S: set of states

What the agent knows (complete observability)A: set of actions

What the agent can do (it may depend on state)P: state transition model

S × A× S → [0,1]

R: reward functionS × A× S → R

Page 49: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Markov Decision Processes

An MDP is formalized as a 4-tuple: 〈S,A,P,R〉S: set of states

What the agent knows (complete observability)A: set of actions

What the agent can do (it may depend on state)P: state transition model

S × A× S → [0,1]

R: reward functionS × A× S → R

Page 50: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Markov Decision Processes

An MDP is formalized as a 4-tuple: 〈S,A,P,R〉S: set of states

What the agent knows (complete observability)A: set of actions

What the agent can do (it may depend on state)P: state transition model

S × A× S → [0,1]

R: reward functionS × A× S → R

Page 51: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Markov Decision Processes

An MDP is formalized as a 4-tuple: 〈S,A,P,R〉S: set of states

What the agent knows (complete observability)A: set of actions

What the agent can do (it may depend on state)P: state transition model

S × A× S → [0,1]

R: reward functionS × A× S → R

Page 52: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Markov Assumption

Let st be a random variable for state at time tP(st |at−1, st−1, . . . ,a0, s0) = P(st |at−1, st−1)

Markov is a special kind of conditional independenceFuture is independent of past given current state

Page 53: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Markov Assumption

Let st be a random variable for state at time tP(st |at−1, st−1, . . . ,a0, s0) = P(st |at−1, st−1)

Markov is a special kind of conditional independenceFuture is independent of past given current state

Page 54: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Markov Assumption

Let st be a random variable for state at time tP(st |at−1, st−1, . . . ,a0, s0) = P(st |at−1, st−1)

Markov is a special kind of conditional independenceFuture is independent of past given current state

Page 55: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Markov Assumption

Let st be a random variable for state at time tP(st |at−1, st−1, . . . ,a0, s0) = P(st |at−1, st−1)

Markov is a special kind of conditional independenceFuture is independent of past given current state

Page 56: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

The Goal: a Policy

Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?

a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary

Cost criteriafinite horizoninfinite horizon

average

discounted Rt =∞∑

k=0

γk rt+k

Page 57: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

The Goal: a Policy

Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?

a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary

Cost criteriafinite horizoninfinite horizon

average

discounted Rt =∞∑

k=0

γk rt+k

Page 58: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

The Goal: a Policy

Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?

a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary

Cost criteriafinite horizoninfinite horizon

average

discounted Rt =∞∑

k=0

γk rt+k

Page 59: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

The Goal: a Policy

Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?

a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary

Cost criteriafinite horizoninfinite horizon

average

discounted Rt =∞∑

k=0

γk rt+k

Page 60: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

The Goal: a Policy

Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?

a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary

Cost criteriafinite horizoninfinite horizon

average

discounted Rt =∞∑

k=0

γk rt+k

Page 61: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

The Goal: a Policy

Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?

a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary

Cost criteriafinite horizoninfinite horizon

average

discounted Rt =∞∑

k=0

γk rt+k

Page 62: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

The Goal: a Policy

Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?

a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary

Cost criteriafinite horizoninfinite horizon

average

discounted Rt =∞∑

k=0

γk rt+k

Page 63: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

The Goal: a Policy

Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?

a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary

Cost criteriafinite horizoninfinite horizon

average

discounted Rt =∞∑

k=0

γk rt+k

Page 64: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

The Goal: a Policy

Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?

a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary

Cost criteriafinite horizoninfinite horizon

average

discounted Rt =∞∑

k=0

γk rt+k

Page 65: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

The Goal: a Policy

Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?

a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary

Cost criteriafinite horizoninfinite horizon

average

discounted Rt =∞∑

k=0

γk rt+k

Page 66: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Value Functions

MDP + stationary policy⇒ Markov chainGiven a policy π, it is possible to define the utility ofeach state: Policy EvaluationValue function (Bellman equation)

Vπ(s) =∑a∈A

π(s|a)∑s′∈S

P(s′|s,a)(R(s,a, s′) + γVπ(s′))

For control purposes, rather than the value of eachstate, it is easier to consider the value of each action ineach stateAction-value function (Bellman equation)

Qπ(s,a) =∑s′∈S

P(s′|s,a)(R(s,a, s′) + γVπ(s′))

Page 67: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Value Functions

MDP + stationary policy⇒ Markov chainGiven a policy π, it is possible to define the utility ofeach state: Policy EvaluationValue function (Bellman equation)

Vπ(s) =∑a∈A

π(s|a)∑s′∈S

P(s′|s,a)(R(s,a, s′) + γVπ(s′))

For control purposes, rather than the value of eachstate, it is easier to consider the value of each action ineach stateAction-value function (Bellman equation)

Qπ(s,a) =∑s′∈S

P(s′|s,a)(R(s,a, s′) + γVπ(s′))

Page 68: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Value Functions

MDP + stationary policy⇒ Markov chainGiven a policy π, it is possible to define the utility ofeach state: Policy EvaluationValue function (Bellman equation)

Vπ(s) =∑a∈A

π(s|a)∑s′∈S

P(s′|s,a)(R(s,a, s′) + γVπ(s′))

For control purposes, rather than the value of eachstate, it is easier to consider the value of each action ineach stateAction-value function (Bellman equation)

Qπ(s,a) =∑s′∈S

P(s′|s,a)(R(s,a, s′) + γVπ(s′))

Page 69: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Value Functions

MDP + stationary policy⇒ Markov chainGiven a policy π, it is possible to define the utility ofeach state: Policy EvaluationValue function (Bellman equation)

Vπ(s) =∑a∈A

π(s|a)∑s′∈S

P(s′|s,a)(R(s,a, s′) + γVπ(s′))

For control purposes, rather than the value of eachstate, it is easier to consider the value of each action ineach stateAction-value function (Bellman equation)

Qπ(s,a) =∑s′∈S

P(s′|s,a)(R(s,a, s′) + γVπ(s′))

Page 70: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Value Functions

MDP + stationary policy⇒ Markov chainGiven a policy π, it is possible to define the utility ofeach state: Policy EvaluationValue function (Bellman equation)

Vπ(s) =∑a∈A

π(s|a)∑s′∈S

P(s′|s,a)(R(s,a, s′) + γVπ(s′))

For control purposes, rather than the value of eachstate, it is easier to consider the value of each action ineach stateAction-value function (Bellman equation)

Qπ(s,a) =∑s′∈S

P(s′|s,a)(R(s,a, s′) + γVπ(s′))

Page 71: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Optimal Value Functions

Optimal Bellamn equation (Bellman, 1957)

V ∗(s) = maxa

(∑s′∈S

P(s′|s,a)(R(s,a, s′) + γV ∗(s′)))

Q∗(s,a) =∑s′∈S

P(s′|s,a)(R(s,a, s′) + γmaxa′

Q∗(s′,a′))

For each MDP there is at least one deterministicoptimal policyAll optimal policies have the same value function V ∗

Page 72: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Optimal Value Functions

Optimal Bellamn equation (Bellman, 1957)

V ∗(s) = maxa

(∑s′∈S

P(s′|s,a)(R(s,a, s′) + γV ∗(s′)))

Q∗(s,a) =∑s′∈S

P(s′|s,a)(R(s,a, s′) + γmaxa′

Q∗(s′,a′))

For each MDP there is at least one deterministicoptimal policyAll optimal policies have the same value function V ∗

Page 73: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Optimal Value Functions

Optimal Bellamn equation (Bellman, 1957)

V ∗(s) = maxa

(∑s′∈S

P(s′|s,a)(R(s,a, s′) + γV ∗(s′)))

Q∗(s,a) =∑s′∈S

P(s′|s,a)(R(s,a, s′) + γmaxa′

Q∗(s′,a′))

For each MDP there is at least one deterministicoptimal policyAll optimal policies have the same value function V ∗

Page 74: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Solving an MDP

Policy searchbrute force is unfeasible (|A||S|)policy gradient, stochastic optimization approaches

Dynamic Programming (DP)Value IterationPolicy Iteration

Linear ProgrammingLP worst–case convergence guarantees are betterthan those of DP methodsLP methods become impractical at a much smallernumber of states than DP methods do

Page 75: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Solving an MDP

Policy searchbrute force is unfeasible (|A||S|)policy gradient, stochastic optimization approaches

Dynamic Programming (DP)Value IterationPolicy Iteration

Linear ProgrammingLP worst–case convergence guarantees are betterthan those of DP methodsLP methods become impractical at a much smallernumber of states than DP methods do

Page 76: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Solving an MDP

Policy searchbrute force is unfeasible (|A||S|)policy gradient, stochastic optimization approaches

Dynamic Programming (DP)Value IterationPolicy Iteration

Linear ProgrammingLP worst–case convergence guarantees are betterthan those of DP methodsLP methods become impractical at a much smallernumber of states than DP methods do

Page 77: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Solving an MDP

Policy searchbrute force is unfeasible (|A||S|)policy gradient, stochastic optimization approaches

Dynamic Programming (DP)Value IterationPolicy Iteration

Linear ProgrammingLP worst–case convergence guarantees are betterthan those of DP methodsLP methods become impractical at a much smallernumber of states than DP methods do

Page 78: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Solving an MDP

Policy searchbrute force is unfeasible (|A||S|)policy gradient, stochastic optimization approaches

Dynamic Programming (DP)Value IterationPolicy Iteration

Linear ProgrammingLP worst–case convergence guarantees are betterthan those of DP methodsLP methods become impractical at a much smallernumber of states than DP methods do

Page 79: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Solving an MDP

Policy searchbrute force is unfeasible (|A||S|)policy gradient, stochastic optimization approaches

Dynamic Programming (DP)Value IterationPolicy Iteration

Linear ProgrammingLP worst–case convergence guarantees are betterthan those of DP methodsLP methods become impractical at a much smallernumber of states than DP methods do

Page 80: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Solving an MDP

Policy searchbrute force is unfeasible (|A||S|)policy gradient, stochastic optimization approaches

Dynamic Programming (DP)Value IterationPolicy Iteration

Linear ProgrammingLP worst–case convergence guarantees are betterthan those of DP methodsLP methods become impractical at a much smallernumber of states than DP methods do

Page 81: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Solving an MDP

Policy searchbrute force is unfeasible (|A||S|)policy gradient, stochastic optimization approaches

Dynamic Programming (DP)Value IterationPolicy Iteration

Linear ProgrammingLP worst–case convergence guarantees are betterthan those of DP methodsLP methods become impractical at a much smallernumber of states than DP methods do

Page 82: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Solving an MDP

Policy searchbrute force is unfeasible (|A||S|)policy gradient, stochastic optimization approaches

Dynamic Programming (DP)Value IterationPolicy Iteration

Linear ProgrammingLP worst–case convergence guarantees are betterthan those of DP methodsLP methods become impractical at a much smallernumber of states than DP methods do

Page 83: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Dynamic Programming

Dynamic Programming (DP) is a collection ofalgorithms to solve problems exhibiting optimalsubstructureWhen the transition model and the reward function areknown, (offline) DP algorithms can be used to solveMDP problems

Complete knowledgeComputational expensive

From DP algorithms have been derived RL algorithms

Page 84: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Dynamic Programming

Dynamic Programming (DP) is a collection ofalgorithms to solve problems exhibiting optimalsubstructureWhen the transition model and the reward function areknown, (offline) DP algorithms can be used to solveMDP problems

Complete knowledgeComputational expensive

From DP algorithms have been derived RL algorithms

Page 85: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Dynamic Programming

Dynamic Programming (DP) is a collection ofalgorithms to solve problems exhibiting optimalsubstructureWhen the transition model and the reward function areknown, (offline) DP algorithms can be used to solveMDP problems

Complete knowledgeComputational expensive

From DP algorithms have been derived RL algorithms

Page 86: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Dynamic Programming

Dynamic Programming (DP) is a collection ofalgorithms to solve problems exhibiting optimalsubstructureWhen the transition model and the reward function areknown, (offline) DP algorithms can be used to solveMDP problems

Complete knowledgeComputational expensive

From DP algorithms have been derived RL algorithms

Page 87: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Dynamic Programming

Dynamic Programming (DP) is a collection ofalgorithms to solve problems exhibiting optimalsubstructureWhen the transition model and the reward function areknown, (offline) DP algorithms can be used to solveMDP problems

Complete knowledgeComputational expensive

From DP algorithms have been derived RL algorithms

Page 88: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

RL vs DP

RL methods are used when the transition model or thereward function are unknownThrough repeated interactions the agent estimatesthe utility of each stateTwo approaches

Model-basedModel-free

Q-learning

Page 89: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

RL vs DP

RL methods are used when the transition model or thereward function are unknownThrough repeated interactions the agent estimatesthe utility of each stateTwo approaches

Model-basedModel-free

Q-learning

Page 90: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

RL vs DP

RL methods are used when the transition model or thereward function are unknownThrough repeated interactions the agent estimatesthe utility of each stateTwo approaches

Model-basedModel-free

Q-learning

Page 91: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

RL vs DP

RL methods are used when the transition model or thereward function are unknownThrough repeated interactions the agent estimatesthe utility of each stateTwo approaches

Model-basedModel-free

Q-learning

Page 92: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

RL vs DP

RL methods are used when the transition model or thereward function are unknownThrough repeated interactions the agent estimatesthe utility of each stateTwo approaches

Model-basedModel-free

Q-learning

Page 93: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

RL vs DP

RL methods are used when the transition model or thereward function are unknownThrough repeated interactions the agent estimatesthe utility of each stateTwo approaches

Model-basedModel-free

Q-learning

Page 94: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Q-learning (Watkins,’89)

Q-learning is the most popular RL algorithmQt+1(st ,at ) = Qt (st ,at )+α(rt +γmaxa Qt (st+1,a)−Qt (st ,at ))

Qt+1(st ,at ) = (1− α)Qt (st ,at ) + α(rt + γmaxa Qt (st+1,a))

Off-policy TD algorithmSimple to implementIf all the state-action pairs are tried infinitely often andthe learning rate opportunely decreases, converges tothe optimal solution

Page 95: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Q-learning (Watkins,’89)

Q-learning is the most popular RL algorithmQt+1(st ,at ) = Qt (st ,at )+α(rt +γmaxa Qt (st+1,a)−Qt (st ,at ))

Qt+1(st ,at ) = (1− α)Qt (st ,at ) + α(rt + γmaxa Qt (st+1,a))

Off-policy TD algorithmSimple to implementIf all the state-action pairs are tried infinitely often andthe learning rate opportunely decreases, converges tothe optimal solution

Page 96: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Q-learning (Watkins,’89)

Q-learning is the most popular RL algorithmQt+1(st ,at ) = Qt (st ,at )+α(rt +γmaxa Qt (st+1,a)−Qt (st ,at ))

Qt+1(st ,at ) = (1− α)Qt (st ,at ) + α(rt + γmaxa Qt (st+1,a))

Off-policy TD algorithmSimple to implementIf all the state-action pairs are tried infinitely often andthe learning rate opportunely decreases, converges tothe optimal solution

Page 97: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Q-learning (Watkins,’89)

Q-learning is the most popular RL algorithmQt+1(st ,at ) = Qt (st ,at )+α(rt +γmaxa Qt (st+1,a)−Qt (st ,at ))

Qt+1(st ,at ) = (1− α)Qt (st ,at ) + α(rt + γmaxa Qt (st+1,a))

Off-policy TD algorithmSimple to implementIf all the state-action pairs are tried infinitely often andthe learning rate opportunely decreases, converges tothe optimal solution

Page 98: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Q-learning (Watkins,’89)

Q-learning is the most popular RL algorithmQt+1(st ,at ) = Qt (st ,at )+α(rt +γmaxa Qt (st+1,a)−Qt (st ,at ))

Qt+1(st ,at ) = (1− α)Qt (st ,at ) + α(rt + γmaxa Qt (st+1,a))

Off-policy TD algorithmSimple to implementIf all the state-action pairs are tried infinitely often andthe learning rate opportunely decreases, converges tothe optimal solution

Page 99: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Q-learning (Watkins,’89)

Q-learning is the most popular RL algorithmQt+1(st ,at ) = Qt (st ,at )+α(rt +γmaxa Qt (st+1,a)−Qt (st ,at ))

Qt+1(st ,at ) = (1− α)Qt (st ,at ) + α(rt + γmaxa Qt (st+1,a))

Off-policy TD algorithmSimple to implementIf all the state-action pairs are tried infinitely often andthe learning rate opportunely decreases, converges tothe optimal solution

Page 100: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Advanced Topics in RL

High-dimensional problemsContinuous MDPsPartially observable MDPsMulti-Objective MDPsInverse RLTransfer of KnowledgeExploration vs ExploitationMulti-agent learning

Page 101: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Exploration vs Exploitation

To cumulate high rewards an agent needs to exploitactions that have been tried in the past and are knownto be effective...... but it has to explore such actions to improveThe dilemma is that both exploration and exploitationare necessaryMany techniques have been studied

ε-greedy

Boltzmann→ π(s,a) =e

Q(s,a)T∑

a′∈A eQ(s,a′)

T

More efficient techniques (Multi–Armed Bandits)

Page 102: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Exploration vs Exploitation

To cumulate high rewards an agent needs to exploitactions that have been tried in the past and are knownto be effective...... but it has to explore such actions to improveThe dilemma is that both exploration and exploitationare necessaryMany techniques have been studied

ε-greedy

Boltzmann→ π(s,a) =e

Q(s,a)T∑

a′∈A eQ(s,a′)

T

More efficient techniques (Multi–Armed Bandits)

Page 103: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Exploration vs Exploitation

To cumulate high rewards an agent needs to exploitactions that have been tried in the past and are knownto be effective...... but it has to explore such actions to improveThe dilemma is that both exploration and exploitationare necessaryMany techniques have been studied

ε-greedy

Boltzmann→ π(s,a) =e

Q(s,a)T∑

a′∈A eQ(s,a′)

T

More efficient techniques (Multi–Armed Bandits)

Page 104: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Exploration vs Exploitation

To cumulate high rewards an agent needs to exploitactions that have been tried in the past and are knownto be effective...... but it has to explore such actions to improveThe dilemma is that both exploration and exploitationare necessaryMany techniques have been studied

ε-greedy

Boltzmann→ π(s,a) =e

Q(s,a)T∑

a′∈A eQ(s,a′)

T

More efficient techniques (Multi–Armed Bandits)

Page 105: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Exploration vs Exploitation

To cumulate high rewards an agent needs to exploitactions that have been tried in the past and are knownto be effective...... but it has to explore such actions to improveThe dilemma is that both exploration and exploitationare necessaryMany techniques have been studied

ε-greedy

Boltzmann→ π(s,a) =e

Q(s,a)T∑

a′∈A eQ(s,a′)

T

More efficient techniques (Multi–Armed Bandits)

Page 106: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Exploration vs Exploitation

To cumulate high rewards an agent needs to exploitactions that have been tried in the past and are knownto be effective...... but it has to explore such actions to improveThe dilemma is that both exploration and exploitationare necessaryMany techniques have been studied

ε-greedy

Boltzmann→ π(s,a) =e

Q(s,a)T∑

a′∈A eQ(s,a′)

T

More efficient techniques (Multi–Armed Bandits)

Page 107: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Exploration vs Exploitation

To cumulate high rewards an agent needs to exploitactions that have been tried in the past and are knownto be effective...... but it has to explore such actions to improveThe dilemma is that both exploration and exploitationare necessaryMany techniques have been studied

ε-greedy

Boltzmann→ π(s,a) =e

Q(s,a)T∑

a′∈A eQ(s,a′)

T

More efficient techniques (Multi–Armed Bandits)

Page 108: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Outline

1 Introduction to Multi-Agent Reinforcement LearningReinforcement LearningMARL vs RLMARL vs Game Theory

2 MARL algorithmsBest-Response LearningEquilibrium Learners

Team GamesZero-sum GamesGeneral-sum Games

Page 109: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

How RL can be extended to MAS?

RL research is mainly focused on single-agentlearningWe need to extend the MDP framework in order toconsider other agents with possibly different rewardfunctionsSo we will need to resort to game theory concepts

Page 110: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

How RL can be extended to MAS?

RL research is mainly focused on single-agentlearningWe need to extend the MDP framework in order toconsider other agents with possibly different rewardfunctionsSo we will need to resort to game theory concepts

Page 111: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

How RL can be extended to MAS?

RL research is mainly focused on single-agentlearningWe need to extend the MDP framework in order toconsider other agents with possibly different rewardfunctionsSo we will need to resort to game theory concepts

Page 112: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

When do we have a MAL problem?When there are multiple concurrent learnersActually, when some agent’s policies depend on otheragents’ past actions

MAL is much more difficult than SAL. Why?Problem dimensions typically grow with the number ofagentsNon-stationarity“Optimal” policies can be stochasticLearning cannot be separated from teaching

Page 113: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

When do we have a MAL problem?When there are multiple concurrent learnersActually, when some agent’s policies depend on otheragents’ past actions

MAL is much more difficult than SAL. Why?Problem dimensions typically grow with the number ofagentsNon-stationarity“Optimal” policies can be stochasticLearning cannot be separated from teaching

Page 114: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

When do we have a MAL problem?When there are multiple concurrent learnersActually, when some agent’s policies depend on otheragents’ past actions

MAL is much more difficult than SAL. Why?Problem dimensions typically grow with the number ofagentsNon-stationarity“Optimal” policies can be stochasticLearning cannot be separated from teaching

Page 115: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

When do we have a MAL problem?When there are multiple concurrent learnersActually, when some agent’s policies depend on otheragents’ past actions

MAL is much more difficult than SAL. Why?Problem dimensions typically grow with the number ofagentsNon-stationarity“Optimal” policies can be stochasticLearning cannot be separated from teaching

Page 116: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

When do we have a MAL problem?When there are multiple concurrent learnersActually, when some agent’s policies depend on otheragents’ past actions

MAL is much more difficult than SAL. Why?Problem dimensions typically grow with the number ofagentsNon-stationarity“Optimal” policies can be stochasticLearning cannot be separated from teaching

Page 117: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

When do we have a MAL problem?When there are multiple concurrent learnersActually, when some agent’s policies depend on otheragents’ past actions

MAL is much more difficult than SAL. Why?Problem dimensions typically grow with the number ofagentsNon-stationarity“Optimal” policies can be stochasticLearning cannot be separated from teaching

Page 118: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

When do we have a MAL problem?When there are multiple concurrent learnersActually, when some agent’s policies depend on otheragents’ past actions

MAL is much more difficult than SAL. Why?Problem dimensions typically grow with the number ofagentsNon-stationarity“Optimal” policies can be stochasticLearning cannot be separated from teaching

Page 119: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

When do we have a MAL problem?When there are multiple concurrent learnersActually, when some agent’s policies depend on otheragents’ past actions

MAL is much more difficult than SAL. Why?Problem dimensions typically grow with the number ofagentsNon-stationarity“Optimal” policies can be stochasticLearning cannot be separated from teaching

Page 120: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents

Best-responseEquilibrium

No learning procedure is optimal against all possibleopponent beahviors

Self-playTargeted optimality

Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret

Page 121: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents

Best-responseEquilibrium

No learning procedure is optimal against all possibleopponent beahviors

Self-playTargeted optimality

Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret

Page 122: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents

Best-responseEquilibrium

No learning procedure is optimal against all possibleopponent beahviors

Self-playTargeted optimality

Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret

Page 123: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents

Best-responseEquilibrium

No learning procedure is optimal against all possibleopponent beahviors

Self-playTargeted optimality

Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret

Page 124: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents

Best-responseEquilibrium

No learning procedure is optimal against all possibleopponent beahviors

Self-playTargeted optimality

Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret

Page 125: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents

Best-responseEquilibrium

No learning procedure is optimal against all possibleopponent beahviors

Self-playTargeted optimality

Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret

Page 126: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents

Best-responseEquilibrium

No learning procedure is optimal against all possibleopponent beahviors

Self-playTargeted optimality

Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret

Page 127: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents

Best-responseEquilibrium

No learning procedure is optimal against all possibleopponent beahviors

Self-playTargeted optimality

Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret

Page 128: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents

Best-responseEquilibrium

No learning procedure is optimal against all possibleopponent beahviors

Self-playTargeted optimality

Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret

Page 129: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents

Best-responseEquilibrium

No learning procedure is optimal against all possibleopponent beahviors

Self-playTargeted optimality

Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret

Page 130: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents

Best-responseEquilibrium

No learning procedure is optimal against all possibleopponent beahviors

Self-playTargeted optimality

Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret

Page 131: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Outline

1 Introduction to Multi-Agent Reinforcement LearningReinforcement LearningMARL vs RLMARL vs Game Theory

2 MARL algorithmsBest-Response LearningEquilibrium Learners

Team GamesZero-sum GamesGeneral-sum Games

Page 132: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Matrix Games

A matrix (or strategic) game is a tuple: 〈n,A,R〉n: number of playersA: joint action space, Ai is the set of actions of player iR: vector of reward functions, Ri is the reward functionof player i

Matrix games are one-shot gamesLearning requires repeated interactions

Repeated gamesStochastic games

Page 133: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Matrix Games

A matrix (or strategic) game is a tuple: 〈n,A,R〉n: number of playersA: joint action space, Ai is the set of actions of player iR: vector of reward functions, Ri is the reward functionof player i

Matrix games are one-shot gamesLearning requires repeated interactions

Repeated gamesStochastic games

Page 134: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Matrix Games

A matrix (or strategic) game is a tuple: 〈n,A,R〉n: number of playersA: joint action space, Ai is the set of actions of player iR: vector of reward functions, Ri is the reward functionof player i

Matrix games are one-shot gamesLearning requires repeated interactions

Repeated gamesStochastic games

Page 135: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Matrix Games

A matrix (or strategic) game is a tuple: 〈n,A,R〉n: number of playersA: joint action space, Ai is the set of actions of player iR: vector of reward functions, Ri is the reward functionof player i

Matrix games are one-shot gamesLearning requires repeated interactions

Repeated gamesStochastic games

Page 136: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Matrix Games

A matrix (or strategic) game is a tuple: 〈n,A,R〉n: number of playersA: joint action space, Ai is the set of actions of player iR: vector of reward functions, Ri is the reward functionof player i

Matrix games are one-shot gamesLearning requires repeated interactions

Repeated gamesStochastic games

Page 137: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Matrix Games

A matrix (or strategic) game is a tuple: 〈n,A,R〉n: number of playersA: joint action space, Ai is the set of actions of player iR: vector of reward functions, Ri is the reward functionof player i

Matrix games are one-shot gamesLearning requires repeated interactions

Repeated gamesStochastic games

Page 138: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Matrix Games

A matrix (or strategic) game is a tuple: 〈n,A,R〉n: number of playersA: joint action space, Ai is the set of actions of player iR: vector of reward functions, Ri is the reward functionof player i

Matrix games are one-shot gamesLearning requires repeated interactions

Repeated gamesStochastic games

Page 139: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Matrix Games

A matrix (or strategic) game is a tuple: 〈n,A,R〉n: number of playersA: joint action space, Ai is the set of actions of player iR: vector of reward functions, Ri is the reward functionof player i

Matrix games are one-shot gamesLearning requires repeated interactions

Repeated gamesStochastic games

Page 140: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

A Special Case: Repeated Games

In repeated games, the same one-shot game (calledstage game) is repeatedly played

E.g., iterated prisoner’s dilemma

Infinitely vs finitely repeated gamesReally, an extensive form game

Subgame-perfect (SP) equilibriaOne SP equilibrium is to repeatedly play some Nashequilibrium of the stage game

Stationary strategyAre other equilibria possible?

Page 141: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

A Special Case: Repeated Games

In repeated games, the same one-shot game (calledstage game) is repeatedly played

E.g., iterated prisoner’s dilemma

Infinitely vs finitely repeated gamesReally, an extensive form game

Subgame-perfect (SP) equilibriaOne SP equilibrium is to repeatedly play some Nashequilibrium of the stage game

Stationary strategyAre other equilibria possible?

Page 142: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

A Special Case: Repeated Games

In repeated games, the same one-shot game (calledstage game) is repeatedly played

E.g., iterated prisoner’s dilemma

Infinitely vs finitely repeated gamesReally, an extensive form game

Subgame-perfect (SP) equilibriaOne SP equilibrium is to repeatedly play some Nashequilibrium of the stage game

Stationary strategyAre other equilibria possible?

Page 143: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

A Special Case: Repeated Games

In repeated games, the same one-shot game (calledstage game) is repeatedly played

E.g., iterated prisoner’s dilemma

Infinitely vs finitely repeated gamesReally, an extensive form game

Subgame-perfect (SP) equilibriaOne SP equilibrium is to repeatedly play some Nashequilibrium of the stage game

Stationary strategyAre other equilibria possible?

Page 144: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

A Special Case: Repeated Games

In repeated games, the same one-shot game (calledstage game) is repeatedly played

E.g., iterated prisoner’s dilemma

Infinitely vs finitely repeated gamesReally, an extensive form game

Subgame-perfect (SP) equilibriaOne SP equilibrium is to repeatedly play some Nashequilibrium of the stage game

Stationary strategyAre other equilibria possible?

Page 145: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

A Special Case: Repeated Games

In repeated games, the same one-shot game (calledstage game) is repeatedly played

E.g., iterated prisoner’s dilemma

Infinitely vs finitely repeated gamesReally, an extensive form game

Subgame-perfect (SP) equilibriaOne SP equilibrium is to repeatedly play some Nashequilibrium of the stage game

Stationary strategyAre other equilibria possible?

Page 146: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

A Special Case: Repeated Games

In repeated games, the same one-shot game (calledstage game) is repeatedly played

E.g., iterated prisoner’s dilemma

Infinitely vs finitely repeated gamesReally, an extensive form game

Subgame-perfect (SP) equilibriaOne SP equilibrium is to repeatedly play some Nashequilibrium of the stage game

Stationary strategyAre other equilibria possible?

Page 147: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

A Special Case: Repeated Games

In repeated games, the same one-shot game (calledstage game) is repeatedly played

E.g., iterated prisoner’s dilemma

Infinitely vs finitely repeated gamesReally, an extensive form game

Subgame-perfect (SP) equilibriaOne SP equilibrium is to repeatedly play some Nashequilibrium of the stage game

Stationary strategyAre other equilibria possible?

Page 148: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Folk Theorem

No strategy profiles, but obtained payoffsInformally:

“In infinitely repeated game the set ofaverage rewards attainable in equilibriumare precisely those pairs attainable under

mixed strategies in the single stage game,with the constraint on the mixed strategies

that each player’s payoff is at least theamount he would receive if the other players

adopted minimax strategies against him”

Page 149: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Folk Theorem

No strategy profiles, but obtained payoffsInformally:

“In infinitely repeated game the set ofaverage rewards attainable in equilibriumare precisely those pairs attainable under

mixed strategies in the single stage game,with the constraint on the mixed strategies

that each player’s payoff is at least theamount he would receive if the other players

adopted minimax strategies against him”

Page 150: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Folk Theorem

No strategy profiles, but obtained payoffsInformally:

“In infinitely repeated game the set ofaverage rewards attainable in equilibriumare precisely those pairs attainable under

mixed strategies in the single stage game,with the constraint on the mixed strategies

that each player’s payoff is at least theamount he would receive if the other players

adopted minimax strategies against him”

Page 151: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Folk Theorem

Page 152: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

MDP + Matrix = Stochastic Games (SGs)

A stochastic (or Markov) game is a tuple: 〈n,S,A,P,R〉n: number of playersS: set of statesA: joint action space, A1 × · · · × AnP: state transition modelR: vector of reward functions, one for each agent

SG extends MDP to multiple agentsSGs with one state are called repeated games

Page 153: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

MDP + Matrix = Stochastic Games (SGs)

A stochastic (or Markov) game is a tuple: 〈n,S,A,P,R〉n: number of playersS: set of statesA: joint action space, A1 × · · · × AnP: state transition modelR: vector of reward functions, one for each agent

SG extends MDP to multiple agentsSGs with one state are called repeated games

Page 154: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

MDP + Matrix = Stochastic Games (SGs)

A stochastic (or Markov) game is a tuple: 〈n,S,A,P,R〉n: number of playersS: set of statesA: joint action space, A1 × · · · × AnP: state transition modelR: vector of reward functions, one for each agent

SG extends MDP to multiple agentsSGs with one state are called repeated games

Page 155: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

MDP + Matrix = Stochastic Games (SGs)

A stochastic (or Markov) game is a tuple: 〈n,S,A,P,R〉n: number of playersS: set of statesA: joint action space, A1 × · · · × AnP: state transition modelR: vector of reward functions, one for each agent

SG extends MDP to multiple agentsSGs with one state are called repeated games

Page 156: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

MDP + Matrix = Stochastic Games (SGs)

A stochastic (or Markov) game is a tuple: 〈n,S,A,P,R〉n: number of playersS: set of statesA: joint action space, A1 × · · · × AnP: state transition modelR: vector of reward functions, one for each agent

SG extends MDP to multiple agentsSGs with one state are called repeated games

Page 157: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

MDP + Matrix = Stochastic Games (SGs)

A stochastic (or Markov) game is a tuple: 〈n,S,A,P,R〉n: number of playersS: set of statesA: joint action space, A1 × · · · × AnP: state transition modelR: vector of reward functions, one for each agent

SG extends MDP to multiple agentsSGs with one state are called repeated games

Page 158: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

MDP + Matrix = Stochastic Games (SGs)

A stochastic (or Markov) game is a tuple: 〈n,S,A,P,R〉n: number of playersS: set of statesA: joint action space, A1 × · · · × AnP: state transition modelR: vector of reward functions, one for each agent

SG extends MDP to multiple agentsSGs with one state are called repeated games

Page 159: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

MDP + Matrix = Stochastic Games (SGs)

A stochastic (or Markov) game is a tuple: 〈n,S,A,P,R〉n: number of playersS: set of statesA: joint action space, A1 × · · · × AnP: state transition modelR: vector of reward functions, one for each agent

SG extends MDP to multiple agentsSGs with one state are called repeated games

Page 160: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Strategies in SG

Let ht = (s0,a0, s1,a1, . . . , st−1,at−1, st ) denote ahistory of t stages of a stochastic gameThe space of possible strategies is huge, but there areinteresting restrictions

Behavioral strategy: returns the probability of playingan action given a history htMarkov strategy: is a behavioral strategy in which thedistribution over actions depends only on the currentstateStationary strategy: is a time-independent Markovstrategy

Page 161: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Strategies in SG

Let ht = (s0,a0, s1,a1, . . . , st−1,at−1, st ) denote ahistory of t stages of a stochastic gameThe space of possible strategies is huge, but there areinteresting restrictions

Behavioral strategy: returns the probability of playingan action given a history htMarkov strategy: is a behavioral strategy in which thedistribution over actions depends only on the currentstateStationary strategy: is a time-independent Markovstrategy

Page 162: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Strategies in SG

Let ht = (s0,a0, s1,a1, . . . , st−1,at−1, st ) denote ahistory of t stages of a stochastic gameThe space of possible strategies is huge, but there areinteresting restrictions

Behavioral strategy: returns the probability of playingan action given a history htMarkov strategy: is a behavioral strategy in which thedistribution over actions depends only on the currentstateStationary strategy: is a time-independent Markovstrategy

Page 163: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Strategies in SG

Let ht = (s0,a0, s1,a1, . . . , st−1,at−1, st ) denote ahistory of t stages of a stochastic gameThe space of possible strategies is huge, but there areinteresting restrictions

Behavioral strategy: returns the probability of playingan action given a history htMarkov strategy: is a behavioral strategy in which thedistribution over actions depends only on the currentstateStationary strategy: is a time-independent Markovstrategy

Page 164: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Strategies in SG

Let ht = (s0,a0, s1,a1, . . . , st−1,at−1, st ) denote ahistory of t stages of a stochastic gameThe space of possible strategies is huge, but there areinteresting restrictions

Behavioral strategy: returns the probability of playingan action given a history htMarkov strategy: is a behavioral strategy in which thedistribution over actions depends only on the currentstateStationary strategy: is a time-independent Markovstrategy

Page 165: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Equilibria in SG

Markov-perfect equilibrium: is a profile of Markovstrategies that yields a Nash equilibrium in everyproper subgameEvery n-player, general-sum, discounted-rewardstochastic game has a Markov perfect equilibrium

Page 166: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Equilibria in SG

Markov-perfect equilibrium: is a profile of Markovstrategies that yields a Nash equilibrium in everyproper subgameEvery n-player, general-sum, discounted-rewardstochastic game has a Markov perfect equilibrium

Page 167: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Stochastic Games: Example

RewardGoalreached:+100Collision: -1Otherwise: 0

Some solutions:

Page 168: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Stochastic Games: Example

RewardGoalreached:+100Collision: -1Otherwise: 0

Some solutions:

Page 169: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Stochastic Games: Example

RewardGoalreached:+100Collision: -1Otherwise: 0

Some solutions:

Page 170: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Stochastic Games: Example

RewardGoalreached:+100Collision: -1Otherwise: 0

Some solutions:

Page 171: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Stochastic Games: Example

RewardGoalreached:+100Collision: -1Otherwise: 0

Some solutions:

Page 172: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

SummaryProblems

States

Agents

Single

Multiple

Single Multiple

Optimization

Page 173: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

SummaryProblems

States

Agents

Single

Multiple

Single Multiple

Optimization

MDP

Page 174: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

SummaryProblems

States

Agents

Single

Multiple

Single Multiple

Optimization

MDP

Matrix Game

Page 175: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

SummaryProblems

States

Agents

Single

Multiple

Single Multiple

Optimization

MDP

Matrix Game

Stochastic Game

Page 176: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

SummaryLearning

States

Agents

Single

Multiple

Single Multiple

Multi-ArmedBandit

Page 177: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

SummaryLearning

States

Agents

Single

Multiple

Single Multiple

Multi-ArmedBandit

ReinforcementLearning

Page 178: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

SummaryLearning

States

Agents

Single

Multiple

Single Multiple

Multi-ArmedBandit

ReinforcementLearning

Learning inRepeated Games

Page 179: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

SummaryLearning

States

Agents

Single

Multiple

Single Multiple

Multi-ArmedBandit

ReinforcementLearning

Learning inRepeated Games

Multi-AgentLearning

Page 180: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Learning vs Game Theory

Game theory predicts which strategies rationalplayers will playUnfortunately, in multi-agent learning, many agents arenot able to behave rationally

The problem is unknownReal-time constraintsHumans

In some problems a non-equilibrium strategy isappropriate if one expects others to playnon-equilibrium strategies

Page 181: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Learning vs Game Theory

Game theory predicts which strategies rationalplayers will playUnfortunately, in multi-agent learning, many agents arenot able to behave rationally

The problem is unknownReal-time constraintsHumans

In some problems a non-equilibrium strategy isappropriate if one expects others to playnon-equilibrium strategies

Page 182: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Learning vs Game Theory

Game theory predicts which strategies rationalplayers will playUnfortunately, in multi-agent learning, many agents arenot able to behave rationally

The problem is unknownReal-time constraintsHumans

In some problems a non-equilibrium strategy isappropriate if one expects others to playnon-equilibrium strategies

Page 183: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Learning vs Game Theory

Game theory predicts which strategies rationalplayers will playUnfortunately, in multi-agent learning, many agents arenot able to behave rationally

The problem is unknownReal-time constraintsHumans

In some problems a non-equilibrium strategy isappropriate if one expects others to playnon-equilibrium strategies

Page 184: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Learning vs Game Theory

Game theory predicts which strategies rationalplayers will playUnfortunately, in multi-agent learning, many agents arenot able to behave rationally

The problem is unknownReal-time constraintsHumans

In some problems a non-equilibrium strategy isappropriate if one expects others to playnon-equilibrium strategies

Page 185: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Learning vs Game Theory

Game theory predicts which strategies rationalplayers will playUnfortunately, in multi-agent learning, many agents arenot able to behave rationally

The problem is unknownReal-time constraintsHumans

In some problems a non-equilibrium strategy isappropriate if one expects others to playnon-equilibrium strategies

Page 186: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Why Learning?

If the game is known, the agent wants to learn thestrategies employed by the other agentsIf the game is unknown, the agent wants to learn alsothe structure of the game

Unknown payoffsUnknown transition probabilities

Observability: Do the agents see each others’actions, and/or each others’ payoffs?

Page 187: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Why Learning?

If the game is known, the agent wants to learn thestrategies employed by the other agentsIf the game is unknown, the agent wants to learn alsothe structure of the game

Unknown payoffsUnknown transition probabilities

Observability: Do the agents see each others’actions, and/or each others’ payoffs?

Page 188: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Why Learning?

If the game is known, the agent wants to learn thestrategies employed by the other agentsIf the game is unknown, the agent wants to learn alsothe structure of the game

Unknown payoffsUnknown transition probabilities

Observability: Do the agents see each others’actions, and/or each others’ payoffs?

Page 189: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Why Learning?

If the game is known, the agent wants to learn thestrategies employed by the other agentsIf the game is unknown, the agent wants to learn alsothe structure of the game

Unknown payoffsUnknown transition probabilities

Observability: Do the agents see each others’actions, and/or each others’ payoffs?

Page 190: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Why Learning?

If the game is known, the agent wants to learn thestrategies employed by the other agentsIf the game is unknown, the agent wants to learn alsothe structure of the game

Unknown payoffsUnknown transition probabilities

Observability: Do the agents see each others’actions, and/or each others’ payoffs?

Page 191: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Desired Properties in MAL

Rationality: play best-response against stationaryopponentsConvergence: play a Nash equilibrium in self-playSafety: no worse than minimax strategyTargeted-optimality: approximate best-responseagainst memory-bounded opponentsCooperate and compromise: an agent must offer andaccept compromisesThere are a lot of algorithms that have been proposedshowing some of these properties (WoLF, GIGA-WoLF,AWESOME, M-Qbed, . . . )

Page 192: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Desired Properties in MAL

Rationality: play best-response against stationaryopponentsConvergence: play a Nash equilibrium in self-playSafety: no worse than minimax strategyTargeted-optimality: approximate best-responseagainst memory-bounded opponentsCooperate and compromise: an agent must offer andaccept compromisesThere are a lot of algorithms that have been proposedshowing some of these properties (WoLF, GIGA-WoLF,AWESOME, M-Qbed, . . . )

Page 193: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Desired Properties in MAL

Rationality: play best-response against stationaryopponentsConvergence: play a Nash equilibrium in self-playSafety: no worse than minimax strategyTargeted-optimality: approximate best-responseagainst memory-bounded opponentsCooperate and compromise: an agent must offer andaccept compromisesThere are a lot of algorithms that have been proposedshowing some of these properties (WoLF, GIGA-WoLF,AWESOME, M-Qbed, . . . )

Page 194: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Desired Properties in MAL

Rationality: play best-response against stationaryopponentsConvergence: play a Nash equilibrium in self-playSafety: no worse than minimax strategyTargeted-optimality: approximate best-responseagainst memory-bounded opponentsCooperate and compromise: an agent must offer andaccept compromisesThere are a lot of algorithms that have been proposedshowing some of these properties (WoLF, GIGA-WoLF,AWESOME, M-Qbed, . . . )

Page 195: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Desired Properties in MAL

Rationality: play best-response against stationaryopponentsConvergence: play a Nash equilibrium in self-playSafety: no worse than minimax strategyTargeted-optimality: approximate best-responseagainst memory-bounded opponentsCooperate and compromise: an agent must offer andaccept compromisesThere are a lot of algorithms that have been proposedshowing some of these properties (WoLF, GIGA-WoLF,AWESOME, M-Qbed, . . . )

Page 196: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Desired Properties in MAL

Rationality: play best-response against stationaryopponentsConvergence: play a Nash equilibrium in self-playSafety: no worse than minimax strategyTargeted-optimality: approximate best-responseagainst memory-bounded opponentsCooperate and compromise: an agent must offer andaccept compromisesThere are a lot of algorithms that have been proposedshowing some of these properties (WoLF, GIGA-WoLF,AWESOME, M-Qbed, . . . )

Page 197: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Outline

1 Introduction to Multi-Agent Reinforcement LearningReinforcement LearningMARL vs RLMARL vs Game Theory

2 MARL algorithmsBest-Response LearningEquilibrium Learners

Team GamesZero-sum GamesGeneral-sum Games

Page 198: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Taxonomy of MARL AlgorithmsTask type

Fully cooperativeStatic: JAL, FMQDynamic: Team-Q, Distributed-Q, OAL

Fully competitiveMinimax-Q

MixedStatic: Fictitious Play, MetaStrategy, IGA, WoLF-IGA,GIGA, GIGA-WoLF, AWESOME, Hyper-QDynamic: Single-agent RL, Nash-Q, CE-Q,Asymmetric-Q, NSCP, WoLF-PHC, PD-WoLF, EXORL

Page 199: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Taxonomy of MARL AlgorithmsTask type

Fully cooperativeStatic: JAL, FMQDynamic: Team-Q, Distributed-Q, OAL

Fully competitiveMinimax-Q

MixedStatic: Fictitious Play, MetaStrategy, IGA, WoLF-IGA,GIGA, GIGA-WoLF, AWESOME, Hyper-QDynamic: Single-agent RL, Nash-Q, CE-Q,Asymmetric-Q, NSCP, WoLF-PHC, PD-WoLF, EXORL

Page 200: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Taxonomy of MARL AlgorithmsTask type

Fully cooperativeStatic: JAL, FMQDynamic: Team-Q, Distributed-Q, OAL

Fully competitiveMinimax-Q

MixedStatic: Fictitious Play, MetaStrategy, IGA, WoLF-IGA,GIGA, GIGA-WoLF, AWESOME, Hyper-QDynamic: Single-agent RL, Nash-Q, CE-Q,Asymmetric-Q, NSCP, WoLF-PHC, PD-WoLF, EXORL

Page 201: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Taxonomy of MARL AlgorithmsTask type

Fully cooperativeStatic: JAL, FMQDynamic: Team-Q, Distributed-Q, OAL

Fully competitiveMinimax-Q

MixedStatic: Fictitious Play, MetaStrategy, IGA, WoLF-IGA,GIGA, GIGA-WoLF, AWESOME, Hyper-QDynamic: Single-agent RL, Nash-Q, CE-Q,Asymmetric-Q, NSCP, WoLF-PHC, PD-WoLF, EXORL

Page 202: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Taxonomy of MARL AlgorithmsTask type

Fully cooperativeStatic: JAL, FMQDynamic: Team-Q, Distributed-Q, OAL

Fully competitiveMinimax-Q

MixedStatic: Fictitious Play, MetaStrategy, IGA, WoLF-IGA,GIGA, GIGA-WoLF, AWESOME, Hyper-QDynamic: Single-agent RL, Nash-Q, CE-Q,Asymmetric-Q, NSCP, WoLF-PHC, PD-WoLF, EXORL

Page 203: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Taxonomy of MARL AlgorithmsTask type

Fully cooperativeStatic: JAL, FMQDynamic: Team-Q, Distributed-Q, OAL

Fully competitiveMinimax-Q

MixedStatic: Fictitious Play, MetaStrategy, IGA, WoLF-IGA,GIGA, GIGA-WoLF, AWESOME, Hyper-QDynamic: Single-agent RL, Nash-Q, CE-Q,Asymmetric-Q, NSCP, WoLF-PHC, PD-WoLF, EXORL

Page 204: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Taxonomy of MARL AlgorithmsTask type

Fully cooperativeStatic: JAL, FMQDynamic: Team-Q, Distributed-Q, OAL

Fully competitiveMinimax-Q

MixedStatic: Fictitious Play, MetaStrategy, IGA, WoLF-IGA,GIGA, GIGA-WoLF, AWESOME, Hyper-QDynamic: Single-agent RL, Nash-Q, CE-Q,Asymmetric-Q, NSCP, WoLF-PHC, PD-WoLF, EXORL

Page 205: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Taxonomy of MARL AlgorithmsTask type

Fully cooperativeStatic: JAL, FMQDynamic: Team-Q, Distributed-Q, OAL

Fully competitiveMinimax-Q

MixedStatic: Fictitious Play, MetaStrategy, IGA, WoLF-IGA,GIGA, GIGA-WoLF, AWESOME, Hyper-QDynamic: Single-agent RL, Nash-Q, CE-Q,Asymmetric-Q, NSCP, WoLF-PHC, PD-WoLF, EXORL

Page 206: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Taxonomy of MARL AlgorithmsField of Origin

TemporalDifference RL

single-agent RLJAL

Distributed-QEXORLHyper-Q

FMQ

CE-QNash-QTeam-Q

minimax-QNSCP

Asymmetric-Q

OAL

Fictitious Play

AWESOME

MetaStrategy

WoLF-PHCPD-WoLF

IGAWoLF-IGA

GIGA-WoLFGIGA

Game Theory

Direct Policy Search

Page 207: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Equilibrium or not?

Why focus on equilibria?Equilibrium identifies conditions under which learningcan or should stopEasier to play in equilibrium as opposed to continuedcomputation

Why not to focus on equilibriaNash equilibrium strategy has no prescriptive forceMultiple potential equilibriaUse of an oracle to uniquely identify an equilibria is“cheating”Opponent may not wish to play an equilibriaCalculating a Nash Equilibrium for a large game can beintractable

Page 208: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Equilibrium or not?

Why focus on equilibria?Equilibrium identifies conditions under which learningcan or should stopEasier to play in equilibrium as opposed to continuedcomputation

Why not to focus on equilibriaNash equilibrium strategy has no prescriptive forceMultiple potential equilibriaUse of an oracle to uniquely identify an equilibria is“cheating”Opponent may not wish to play an equilibriaCalculating a Nash Equilibrium for a large game can beintractable

Page 209: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Equilibrium or not?

Why focus on equilibria?Equilibrium identifies conditions under which learningcan or should stopEasier to play in equilibrium as opposed to continuedcomputation

Why not to focus on equilibriaNash equilibrium strategy has no prescriptive forceMultiple potential equilibriaUse of an oracle to uniquely identify an equilibria is“cheating”Opponent may not wish to play an equilibriaCalculating a Nash Equilibrium for a large game can beintractable

Page 210: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Equilibrium or not?

Why focus on equilibria?Equilibrium identifies conditions under which learningcan or should stopEasier to play in equilibrium as opposed to continuedcomputation

Why not to focus on equilibriaNash equilibrium strategy has no prescriptive forceMultiple potential equilibriaUse of an oracle to uniquely identify an equilibria is“cheating”Opponent may not wish to play an equilibriaCalculating a Nash Equilibrium for a large game can beintractable

Page 211: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Equilibrium or not?

Why focus on equilibria?Equilibrium identifies conditions under which learningcan or should stopEasier to play in equilibrium as opposed to continuedcomputation

Why not to focus on equilibriaNash equilibrium strategy has no prescriptive forceMultiple potential equilibriaUse of an oracle to uniquely identify an equilibria is“cheating”Opponent may not wish to play an equilibriaCalculating a Nash Equilibrium for a large game can beintractable

Page 212: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Equilibrium or not?

Why focus on equilibria?Equilibrium identifies conditions under which learningcan or should stopEasier to play in equilibrium as opposed to continuedcomputation

Why not to focus on equilibriaNash equilibrium strategy has no prescriptive forceMultiple potential equilibriaUse of an oracle to uniquely identify an equilibria is“cheating”Opponent may not wish to play an equilibriaCalculating a Nash Equilibrium for a large game can beintractable

Page 213: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Equilibrium or not?

Why focus on equilibria?Equilibrium identifies conditions under which learningcan or should stopEasier to play in equilibrium as opposed to continuedcomputation

Why not to focus on equilibriaNash equilibrium strategy has no prescriptive forceMultiple potential equilibriaUse of an oracle to uniquely identify an equilibria is“cheating”Opponent may not wish to play an equilibriaCalculating a Nash Equilibrium for a large game can beintractable

Page 214: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Equilibrium or not?

Why focus on equilibria?Equilibrium identifies conditions under which learningcan or should stopEasier to play in equilibrium as opposed to continuedcomputation

Why not to focus on equilibriaNash equilibrium strategy has no prescriptive forceMultiple potential equilibriaUse of an oracle to uniquely identify an equilibria is“cheating”Opponent may not wish to play an equilibriaCalculating a Nash Equilibrium for a large game can beintractable

Page 215: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Equilibrium or not?

Why focus on equilibria?Equilibrium identifies conditions under which learningcan or should stopEasier to play in equilibrium as opposed to continuedcomputation

Why not to focus on equilibriaNash equilibrium strategy has no prescriptive forceMultiple potential equilibriaUse of an oracle to uniquely identify an equilibria is“cheating”Opponent may not wish to play an equilibriaCalculating a Nash Equilibrium for a large game can beintractable

Page 216: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Outline

1 Introduction to Multi-Agent Reinforcement LearningReinforcement LearningMARL vs RLMARL vs Game Theory

2 MARL algorithmsBest-Response LearningEquilibrium Learners

Team GamesZero-sum GamesGeneral-sum Games

Page 217: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Indipendent Learners

Typical conditions for IndependentLearning (IL):

An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.

Independent learners try to learn bestresponses

AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents

DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination

Traffic

Page 218: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Indipendent Learners

Typical conditions for IndependentLearning (IL):

An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.

Independent learners try to learn bestresponses

AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents

DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination

Traffic

Page 219: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Indipendent Learners

Typical conditions for IndependentLearning (IL):

An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.

Independent learners try to learn bestresponses

AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents

DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination

Traffic

Page 220: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Indipendent Learners

Typical conditions for IndependentLearning (IL):

An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.

Independent learners try to learn bestresponses

AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents

DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination

Traffic

Page 221: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Indipendent Learners

Typical conditions for IndependentLearning (IL):

An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.

Independent learners try to learn bestresponses

AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents

DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination

Traffic

Page 222: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Indipendent Learners

Typical conditions for IndependentLearning (IL):

An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.

Independent learners try to learn bestresponses

AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents

DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination

Traffic

Page 223: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Indipendent Learners

Typical conditions for IndependentLearning (IL):

An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.

Independent learners try to learn bestresponses

AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents

DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination

Traffic

Page 224: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Indipendent Learners

Typical conditions for IndependentLearning (IL):

An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.

Independent learners try to learn bestresponses

AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents

DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination

Traffic

Page 225: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Indipendent Learners

Typical conditions for IndependentLearning (IL):

An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.

Independent learners try to learn bestresponses

AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents

DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination

Traffic

Page 226: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Indipendent Learners

Typical conditions for IndependentLearning (IL):

An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.

Independent learners try to learn bestresponses

AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents

DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination

Traffic

Page 227: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent Reinforcement Learners

Q-learning [Watkins’92]Learning Automata [Narendra’74,Wheeler’86]WoLF-PHC [Bowling’01]FAQ-learning [Kaisers’10]RESQ-learning [Hennes’10]

Page 228: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Joint Action Learners

A joint action learner (JAL) is an agent that learnsQ-values for joint actionsTo estimate opponents’ actions empiricaldistributions can be used (as in fictitious play)

fi(a−i) = Πj 6=iφj(a−i)

The expected value of an individual action is the sum ofjoint Q-values, weighted by the estimated probability ofthe associated complementary joint action profiles:

EV (ai) =∑

a−i∈A−i

Q(ai ∪ a−i)fi(a−i)

Page 229: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Joint Action Learners

A joint action learner (JAL) is an agent that learnsQ-values for joint actionsTo estimate opponents’ actions empiricaldistributions can be used (as in fictitious play)

fi(a−i) = Πj 6=iφj(a−i)

The expected value of an individual action is the sum ofjoint Q-values, weighted by the estimated probability ofthe associated complementary joint action profiles:

EV (ai) =∑

a−i∈A−i

Q(ai ∪ a−i)fi(a−i)

Page 230: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Joint Action Learners

A joint action learner (JAL) is an agent that learnsQ-values for joint actionsTo estimate opponents’ actions empiricaldistributions can be used (as in fictitious play)

fi(a−i) = Πj 6=iφj(a−i)

The expected value of an individual action is the sum ofjoint Q-values, weighted by the estimated probability ofthe associated complementary joint action profiles:

EV (ai) =∑

a−i∈A−i

Q(ai ∪ a−i)fi(a−i)

Page 231: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Outline

1 Introduction to Multi-Agent Reinforcement LearningReinforcement LearningMARL vs RLMARL vs Game Theory

2 MARL algorithmsBest-Response LearningEquilibrium Learners

Team GamesZero-sum GamesGeneral-sum Games

Page 232: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Team Games

Team games are fully cooperative games in which allthe agents share the same reward functionIf the learning is centralized, it is actually single-agentlearning with multiple actuatorsMulti-agent learning raises in distributed problems

Page 233: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Team Games

Team games are fully cooperative games in which allthe agents share the same reward functionIf the learning is centralized, it is actually single-agentlearning with multiple actuatorsMulti-agent learning raises in distributed problems

Page 234: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Team Games

Team games are fully cooperative games in which allthe agents share the same reward functionIf the learning is centralized, it is actually single-agentlearning with multiple actuatorsMulti-agent learning raises in distributed problems

Page 235: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Coordination Equilibria

In a coordination equilibrium all the agents achieve theirmaximum possible payoff.If π1, π2, . . . are in coordination equilibrium, we havethat∑a1,...,an

π1(s, a1)·· · ··πn(s, an)Qi(s, a1, . . . , an) = maxa1,...,an

Qi(s, a1, . . . , an)

for all 1 ≤ i ≤ n ans states sIf a game has a coordination equilibrium, then it has adeterministic coordination equilibrium

Page 236: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Coordination Equilibria

In a coordination equilibrium all the agents achieve theirmaximum possible payoff.If π1, π2, . . . are in coordination equilibrium, we havethat∑a1,...,an

π1(s, a1)·· · ··πn(s, an)Qi(s, a1, . . . , an) = maxa1,...,an

Qi(s, a1, . . . , an)

for all 1 ≤ i ≤ n ans states sIf a game has a coordination equilibrium, then it has adeterministic coordination equilibrium

Page 237: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Coordination Equilibria

In a coordination equilibrium all the agents achieve theirmaximum possible payoff.If π1, π2, . . . are in coordination equilibrium, we havethat∑a1,...,an

π1(s, a1)·· · ··πn(s, an)Qi(s, a1, . . . , an) = maxa1,...,an

Qi(s, a1, . . . , an)

for all 1 ≤ i ≤ n ans states sIf a game has a coordination equilibrium, then it has adeterministic coordination equilibrium

Page 238: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersExample: Coordination game

a0 a1b0 10 0b1 0 10

The agents use Boltzmann explorationBoth are able to converge to one of the optimalstrategiesJALs can distinguish Q-values of different joint actionsThe difference in performance is small due to theexploration strategy

Page 239: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersExample: Coordination game

a0 a1b0 10 0b1 0 10

The agents use Boltzmann explorationBoth are able to converge to one of the optimalstrategiesJALs can distinguish Q-values of different joint actionsThe difference in performance is small due to theexploration strategy

Page 240: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersExample: Coordination game

a0 a1b0 10 0b1 0 10

The agents use Boltzmann explorationBoth are able to converge to one of the optimalstrategiesJALs can distinguish Q-values of different joint actionsThe difference in performance is small due to theexploration strategy

Page 241: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersExample: Coordination game

a0 a1b0 10 0b1 0 10

The agents use Boltzmann explorationBoth are able to converge to one of the optimalstrategiesJALs can distinguish Q-values of different joint actionsThe difference in performance is small due to theexploration strategy

Page 242: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersExample: Penalty game

a0 a1 a2b0 10 0 kb1 0 2 0b2 k 0 10

Considering k < 0, the game has 3 pure equilibria

Suppose penalty K = −100

Both ILs and JALs will converge to the self-confirmingequilibrium 〈a1,b1〉The magnitude of the penalty k influences the probability ofconvergence to the optimal joint strategy

Page 243: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersExample: Penalty game

a0 a1 a2b0 10 0 kb1 0 2 0b2 k 0 10

Considering k < 0, the game has 3 pure equilibria

Suppose penalty K = −100

Both ILs and JALs will converge to the self-confirmingequilibrium 〈a1,b1〉The magnitude of the penalty k influences the probability ofconvergence to the optimal joint strategy

Page 244: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersExample: Penalty game

a0 a1 a2b0 10 0 kb1 0 2 0b2 k 0 10

Considering k < 0, the game has 3 pure equilibria

Suppose penalty K = −100

Both ILs and JALs will converge to the self-confirmingequilibrium 〈a1,b1〉The magnitude of the penalty k influences the probability ofconvergence to the optimal joint strategy

Page 245: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersExample: Penalty game

a0 a1 a2b0 10 0 kb1 0 2 0b2 k 0 10

Considering k < 0, the game has 3 pure equilibria

Suppose penalty K = −100

Both ILs and JALs will converge to the self-confirmingequilibrium 〈a1,b1〉The magnitude of the penalty k influences the probability ofconvergence to the optimal joint strategy

Page 246: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersExample: Climbing game

a0 a1 a2b0 11 -30 0b1 -30 7 6b2 0 0 5

Agents start playing 〈a2,b2〉Agents converge to 〈a1,b1〉Convergence to pure equilibria is almost sure

Page 247: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersExample: Climbing game

a0 a1 a2b0 11 -30 0b1 -30 7 6b2 0 0 5

Agents start playing 〈a2,b2〉Agents converge to 〈a1,b1〉Convergence to pure equilibria is almost sure

Page 248: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersExample: Climbing game

a0 a1 a2b0 11 -30 0b1 -30 7 6b2 0 0 5

Agents start playing 〈a2,b2〉Agents converge to 〈a1,b1〉Convergence to pure equilibria is almost sure

Page 249: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersSufficient conditions

The learning rate α decreases over time such that∑

t α is divergentand

∑t α

2 is convergent

Each agent samples each of its actions infinitely often

The probability P it (a) of agent i choosing action a is nonzero

Agents become full exploiters with probability one eventually:

limt→∞

P it (Xt) = 0

where Xt is a random variable denoting the event that (fi , gi)prescribe a sub-optimal action

Let Et be a random variable denoting the probability of a(deterministic) equilibrium strategy profile being played at timet. Then for both ILs and JALs, for any δ, ε > 0, there is anT (δ, ε) such that

Pr(|Et − 1| < ε) > 1− δ

for all t > T (δ, ε).

Page 250: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersSufficient conditions

The learning rate α decreases over time such that∑

t α is divergentand

∑t α

2 is convergent

Each agent samples each of its actions infinitely often

The probability P it (a) of agent i choosing action a is nonzero

Agents become full exploiters with probability one eventually:

limt→∞

P it (Xt) = 0

where Xt is a random variable denoting the event that (fi , gi)prescribe a sub-optimal action

Let Et be a random variable denoting the probability of a(deterministic) equilibrium strategy profile being played at timet. Then for both ILs and JALs, for any δ, ε > 0, there is anT (δ, ε) such that

Pr(|Et − 1| < ε) > 1− δ

for all t > T (δ, ε).

Page 251: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersSufficient conditions

The learning rate α decreases over time such that∑

t α is divergentand

∑t α

2 is convergent

Each agent samples each of its actions infinitely often

The probability P it (a) of agent i choosing action a is nonzero

Agents become full exploiters with probability one eventually:

limt→∞

P it (Xt) = 0

where Xt is a random variable denoting the event that (fi , gi)prescribe a sub-optimal action

Let Et be a random variable denoting the probability of a(deterministic) equilibrium strategy profile being played at timet. Then for both ILs and JALs, for any δ, ε > 0, there is anT (δ, ε) such that

Pr(|Et − 1| < ε) > 1− δ

for all t > T (δ, ε).

Page 252: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersSufficient conditions

The learning rate α decreases over time such that∑

t α is divergentand

∑t α

2 is convergent

Each agent samples each of its actions infinitely often

The probability P it (a) of agent i choosing action a is nonzero

Agents become full exploiters with probability one eventually:

limt→∞

P it (Xt) = 0

where Xt is a random variable denoting the event that (fi , gi)prescribe a sub-optimal action

Let Et be a random variable denoting the probability of a(deterministic) equilibrium strategy profile being played at timet. Then for both ILs and JALs, for any δ, ε > 0, there is anT (δ, ε) such that

Pr(|Et − 1| < ε) > 1− δ

for all t > T (δ, ε).

Page 253: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersSufficient conditions

The learning rate α decreases over time such that∑

t α is divergentand

∑t α

2 is convergent

Each agent samples each of its actions infinitely often

The probability P it (a) of agent i choosing action a is nonzero

Agents become full exploiters with probability one eventually:

limt→∞

P it (Xt) = 0

where Xt is a random variable denoting the event that (fi , gi)prescribe a sub-optimal action

Let Et be a random variable denoting the probability of a(deterministic) equilibrium strategy profile being played at timet. Then for both ILs and JALs, for any δ, ε > 0, there is anT (δ, ε) such that

Pr(|Et − 1| < ε) > 1− δ

for all t > T (δ, ε).

Page 254: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersMyopic Heuristics

Neither ILs nor JALs ensure convergence to an optimalequilibriumNo hope with ILs, but JALs with a different explorationstrategy...Myopic heuristics

Optimistic Boltzmann (OB): For agent i , actionai ∈ Ai , let maxQ(ai ) = maxΠ−i Q(Π−i ,ai ). Chooseactions with Boltzmann exploration (another expolitivestrategy would suffice) using MaxQ(ai ) as the value ofaiWeighted OB (WOB): Explore using Boltzmann usingfactors MaxQ(ai ) · Pri (optimalmatchΠ−i forai )Combined: Let C(ai ) = ρMaxQ(ai ) + (1− ρ)EV (ai ), forsome 0 ≤ ρ ≤ 1. Choose actions using Boltzmannexploration with C(ai ) as value of ai

Page 255: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersMyopic Heuristics

Neither ILs nor JALs ensure convergence to an optimalequilibriumNo hope with ILs, but JALs with a different explorationstrategy...Myopic heuristics

Optimistic Boltzmann (OB): For agent i , actionai ∈ Ai , let maxQ(ai ) = maxΠ−i Q(Π−i ,ai ). Chooseactions with Boltzmann exploration (another expolitivestrategy would suffice) using MaxQ(ai ) as the value ofaiWeighted OB (WOB): Explore using Boltzmann usingfactors MaxQ(ai ) · Pri (optimalmatchΠ−i forai )Combined: Let C(ai ) = ρMaxQ(ai ) + (1− ρ)EV (ai ), forsome 0 ≤ ρ ≤ 1. Choose actions using Boltzmannexploration with C(ai ) as value of ai

Page 256: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersMyopic Heuristics

Neither ILs nor JALs ensure convergence to an optimalequilibriumNo hope with ILs, but JALs with a different explorationstrategy...Myopic heuristics

Optimistic Boltzmann (OB): For agent i , actionai ∈ Ai , let maxQ(ai ) = maxΠ−i Q(Π−i ,ai ). Chooseactions with Boltzmann exploration (another expolitivestrategy would suffice) using MaxQ(ai ) as the value ofaiWeighted OB (WOB): Explore using Boltzmann usingfactors MaxQ(ai ) · Pri (optimalmatchΠ−i forai )Combined: Let C(ai ) = ρMaxQ(ai ) + (1− ρ)EV (ai ), forsome 0 ≤ ρ ≤ 1. Choose actions using Boltzmannexploration with C(ai ) as value of ai

Page 257: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersMyopic Heuristics

Neither ILs nor JALs ensure convergence to an optimalequilibriumNo hope with ILs, but JALs with a different explorationstrategy...Myopic heuristics

Optimistic Boltzmann (OB): For agent i , actionai ∈ Ai , let maxQ(ai ) = maxΠ−i Q(Π−i ,ai ). Chooseactions with Boltzmann exploration (another expolitivestrategy would suffice) using MaxQ(ai ) as the value ofaiWeighted OB (WOB): Explore using Boltzmann usingfactors MaxQ(ai ) · Pri (optimalmatchΠ−i forai )Combined: Let C(ai ) = ρMaxQ(ai ) + (1− ρ)EV (ai ), forsome 0 ≤ ρ ≤ 1. Choose actions using Boltzmannexploration with C(ai ) as value of ai

Page 258: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersMyopic Heuristics

Neither ILs nor JALs ensure convergence to an optimalequilibriumNo hope with ILs, but JALs with a different explorationstrategy...Myopic heuristics

Optimistic Boltzmann (OB): For agent i , actionai ∈ Ai , let maxQ(ai ) = maxΠ−i Q(Π−i ,ai ). Chooseactions with Boltzmann exploration (another expolitivestrategy would suffice) using MaxQ(ai ) as the value ofaiWeighted OB (WOB): Explore using Boltzmann usingfactors MaxQ(ai ) · Pri (optimalmatchΠ−i forai )Combined: Let C(ai ) = ρMaxQ(ai ) + (1− ρ)EV (ai ), forsome 0 ≤ ρ ≤ 1. Choose actions using Boltzmannexploration with C(ai ) as value of ai

Page 259: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersMyopic Heuristics

Neither ILs nor JALs ensure convergence to an optimalequilibriumNo hope with ILs, but JALs with a different explorationstrategy...Myopic heuristics

Optimistic Boltzmann (OB): For agent i , actionai ∈ Ai , let maxQ(ai ) = maxΠ−i Q(Π−i ,ai ). Chooseactions with Boltzmann exploration (another expolitivestrategy would suffice) using MaxQ(ai ) as the value ofaiWeighted OB (WOB): Explore using Boltzmann usingfactors MaxQ(ai ) · Pri (optimalmatchΠ−i forai )Combined: Let C(ai ) = ρMaxQ(ai ) + (1− ρ)EV (ai ), forsome 0 ≤ ρ ≤ 1. Choose actions using Boltzmannexploration with C(ai ) as value of ai

Page 260: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersMyopic heuristics: Penalty game

Page 261: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learning [Lauer and Riedmiller,’01]

Applies to deterministic cooperative SGsNon-negative reward functionsUpdate rule:

Q0(s,a) = 0

Qk+1(s,a) = max(Qk (s,a),R(s,a) + γmaxa′∈A

Q(s′,a′))

This optimistic algorithm learns distributed Q-tables,provided that all state-action pairs occurs infinitely often

Page 262: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learning [Lauer and Riedmiller,’01]

Applies to deterministic cooperative SGsNon-negative reward functionsUpdate rule:

Q0(s,a) = 0

Qk+1(s,a) = max(Qk (s,a),R(s,a) + γmaxa′∈A

Q(s′,a′))

This optimistic algorithm learns distributed Q-tables,provided that all state-action pairs occurs infinitely often

Page 263: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learning [Lauer and Riedmiller,’01]

Applies to deterministic cooperative SGsNon-negative reward functionsUpdate rule:

Q0(s,a) = 0

Qk+1(s,a) = max(Qk (s,a),R(s,a) + γmaxa′∈A

Q(s′,a′))

This optimistic algorithm learns distributed Q-tables,provided that all state-action pairs occurs infinitely often

Page 264: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learning [Lauer and Riedmiller,’01]

Applies to deterministic cooperative SGsNon-negative reward functionsUpdate rule:

Q0(s,a) = 0

Qk+1(s,a) = max(Qk (s,a),R(s,a) + γmaxa′∈A

Q(s′,a′))

This optimistic algorithm learns distributed Q-tables,provided that all state-action pairs occurs infinitely often

Page 265: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learningExample: Climbing Games

a0 a1 a2b0 11 -30 0b1 -30 7 6b2 0 0 5

Distributed Q-tables

a0 a1 a2Q1(s0,a) 11 7 6Q2(s0,a) 11 7 5

Page 266: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learningExample: Climbing Games

a0 a1 a2b0 11 -30 0b1 -30 7 6b2 0 0 5

Distributed Q-tables

a0 a1 a2Q1(s0,a) 11 7 6Q2(s0,a) 11 7 5

Page 267: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learningExample: Penalty Games

a0 a1 a2b0 10 0 kb1 0 2 0b2 k 0 10

Distributed Q-tables

a0 a1 a2Q1(s0,a) 10 2 10Q2(s0,a) 10 2 10

It requires an additional mechanism of coordinationbetween agents

Update the current policy only if an improvement to theQ-value happens

Page 268: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learningExample: Penalty Games

a0 a1 a2b0 10 0 kb1 0 2 0b2 k 0 10

Distributed Q-tables

a0 a1 a2Q1(s0,a) 10 2 10Q2(s0,a) 10 2 10

It requires an additional mechanism of coordinationbetween agents

Update the current policy only if an improvement to theQ-value happens

Page 269: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learningExample: Penalty Games

a0 a1 a2b0 10 0 kb1 0 2 0b2 k 0 10

Distributed Q-tables

a0 a1 a2Q1(s0,a) 10 2 10Q2(s0,a) 10 2 10

It requires an additional mechanism of coordinationbetween agents

Update the current policy only if an improvement to theQ-value happens

Page 270: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learningExample: Penalty Games

a0 a1 a2b0 10 0 kb1 0 2 0b2 k 0 10

Distributed Q-tables

a0 a1 a2Q1(s0,a) 10 2 10Q2(s0,a) 10 2 10

It requires an additional mechanism of coordinationbetween agents

Update the current policy only if an improvement to theQ-value happens

Page 271: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learningStochastic environments

Distributed Q-learning works fine with deterministiccooperative environmentsExtension to stochastic environments is problematicThe main difficulty is that Q-values are affected by twokinds of uncertainty

behavior of other agentsinfluence of stochastic environments

Distinguish these two uncertainties is a key point inmultiagent learning

Page 272: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learningStochastic environments

Distributed Q-learning works fine with deterministiccooperative environmentsExtension to stochastic environments is problematicThe main difficulty is that Q-values are affected by twokinds of uncertainty

behavior of other agentsinfluence of stochastic environments

Distinguish these two uncertainties is a key point inmultiagent learning

Page 273: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learningStochastic environments

Distributed Q-learning works fine with deterministiccooperative environmentsExtension to stochastic environments is problematicThe main difficulty is that Q-values are affected by twokinds of uncertainty

behavior of other agentsinfluence of stochastic environments

Distinguish these two uncertainties is a key point inmultiagent learning

Page 274: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learningStochastic environments

Distributed Q-learning works fine with deterministiccooperative environmentsExtension to stochastic environments is problematicThe main difficulty is that Q-values are affected by twokinds of uncertainty

behavior of other agentsinfluence of stochastic environments

Distinguish these two uncertainties is a key point inmultiagent learning

Page 275: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learningStochastic environments

Distributed Q-learning works fine with deterministiccooperative environmentsExtension to stochastic environments is problematicThe main difficulty is that Q-values are affected by twokinds of uncertainty

behavior of other agentsinfluence of stochastic environments

Distinguish these two uncertainties is a key point inmultiagent learning

Page 276: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learningStochastic environments

Distributed Q-learning works fine with deterministiccooperative environmentsExtension to stochastic environments is problematicThe main difficulty is that Q-values are affected by twokinds of uncertainty

behavior of other agentsinfluence of stochastic environments

Distinguish these two uncertainties is a key point inmultiagent learning

Page 277: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Team Q-learning [Littman, ’01]

Requires to observe actions from other agentsUpdate rule

Q1(s, a1, . . . , an)← (1− α)Q1(s, a1, . . . , an) + α

(r1 + γmaxa′1,...,a

′n

Q1(s′, a′1, . . . , a′n)

)

It does not use an opponent modelIn a team game, team Q-learning will converge to theoptimal Q-function with probability one.If the limit equilibrium is unique and the agent follows aGLIE policy, it will converge in behavior withprobability oneThe main problem is to select an equilibrium whenthere are multiple coordination equilibria

Page 278: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Team Q-learning [Littman, ’01]

Requires to observe actions from other agentsUpdate rule

Q1(s, a1, . . . , an)← (1− α)Q1(s, a1, . . . , an) + α

(r1 + γmaxa′1,...,a

′n

Q1(s′, a′1, . . . , a′n)

)

It does not use an opponent modelIn a team game, team Q-learning will converge to theoptimal Q-function with probability one.If the limit equilibrium is unique and the agent follows aGLIE policy, it will converge in behavior withprobability oneThe main problem is to select an equilibrium whenthere are multiple coordination equilibria

Page 279: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Team Q-learning [Littman, ’01]

Requires to observe actions from other agentsUpdate rule

Q1(s, a1, . . . , an)← (1− α)Q1(s, a1, . . . , an) + α

(r1 + γmaxa′1,...,a

′n

Q1(s′, a′1, . . . , a′n)

)

It does not use an opponent modelIn a team game, team Q-learning will converge to theoptimal Q-function with probability one.If the limit equilibrium is unique and the agent follows aGLIE policy, it will converge in behavior withprobability oneThe main problem is to select an equilibrium whenthere are multiple coordination equilibria

Page 280: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Team Q-learning [Littman, ’01]

Requires to observe actions from other agentsUpdate rule

Q1(s, a1, . . . , an)← (1− α)Q1(s, a1, . . . , an) + α

(r1 + γmaxa′1,...,a

′n

Q1(s′, a′1, . . . , a′n)

)

It does not use an opponent modelIn a team game, team Q-learning will converge to theoptimal Q-function with probability one.If the limit equilibrium is unique and the agent follows aGLIE policy, it will converge in behavior withprobability oneThe main problem is to select an equilibrium whenthere are multiple coordination equilibria

Page 281: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Team Q-learning [Littman, ’01]

Requires to observe actions from other agentsUpdate rule

Q1(s, a1, . . . , an)← (1− α)Q1(s, a1, . . . , an) + α

(r1 + γmaxa′1,...,a

′n

Q1(s′, a′1, . . . , a′n)

)

It does not use an opponent modelIn a team game, team Q-learning will converge to theoptimal Q-function with probability one.If the limit equilibrium is unique and the agent follows aGLIE policy, it will converge in behavior withprobability oneThe main problem is to select an equilibrium whenthere are multiple coordination equilibria

Page 282: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Team Q-learning [Littman, ’01]

Requires to observe actions from other agentsUpdate rule

Q1(s, a1, . . . , an)← (1− α)Q1(s, a1, . . . , an) + α

(r1 + γmaxa′1,...,a

′n

Q1(s′, a′1, . . . , a′n)

)

It does not use an opponent modelIn a team game, team Q-learning will converge to theoptimal Q-function with probability one.If the limit equilibrium is unique and the agent follows aGLIE policy, it will converge in behavior withprobability oneThe main problem is to select an equilibrium whenthere are multiple coordination equilibria

Page 283: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Zero-sum Games

Consider 2 playersR1(i , j) = M(i , j)R1(i , j) = −M(i , j)player 1 is maximizerplayer 2 is minimizerExamples: matching pennies, rock-paper-scissors

Page 284: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Zero-sum Games

Consider 2 playersR1(i , j) = M(i , j)R1(i , j) = −M(i , j)player 1 is maximizerplayer 2 is minimizerExamples: matching pennies, rock-paper-scissors

Page 285: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Zero-sum Games

Consider 2 playersR1(i , j) = M(i , j)R1(i , j) = −M(i , j)player 1 is maximizerplayer 2 is minimizerExamples: matching pennies, rock-paper-scissors

Page 286: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Zero-sum Games

Consider 2 playersR1(i , j) = M(i , j)R1(i , j) = −M(i , j)player 1 is maximizerplayer 2 is minimizerExamples: matching pennies, rock-paper-scissors

Page 287: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Zero-sum Games

Consider 2 playersR1(i , j) = M(i , j)R1(i , j) = −M(i , j)player 1 is maximizerplayer 2 is minimizerExamples: matching pennies, rock-paper-scissors

Page 288: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Zero-sum Games

Consider 2 playersR1(i , j) = M(i , j)R1(i , j) = −M(i , j)player 1 is maximizerplayer 2 is minimizerExamples: matching pennies, rock-paper-scissors

Page 289: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Minimax-Q [Littman, ’94]

In MDPs a stationary, deterministic, and undominatedoptimal policy always existsIn MGs The performance of a policy depends on theopponent’s policy, so we cannot evaluate them withoutcontext.New definition of optimality in game theory

Performs best at its worst case compared with othersAt least one optimal policy exists, which may or may notbe deterministic because the agent is uncertain of itsopponent’s move.

Page 290: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Minimax-Q [Littman, ’94]

In MDPs a stationary, deterministic, and undominatedoptimal policy always existsIn MGs The performance of a policy depends on theopponent’s policy, so we cannot evaluate them withoutcontext.New definition of optimality in game theory

Performs best at its worst case compared with othersAt least one optimal policy exists, which may or may notbe deterministic because the agent is uncertain of itsopponent’s move.

Page 291: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Minimax-Q [Littman, ’94]

In MDPs a stationary, deterministic, and undominatedoptimal policy always existsIn MGs The performance of a policy depends on theopponent’s policy, so we cannot evaluate them withoutcontext.New definition of optimality in game theory

Performs best at its worst case compared with othersAt least one optimal policy exists, which may or may notbe deterministic because the agent is uncertain of itsopponent’s move.

Page 292: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Minimax-Q [Littman, ’94]

In MDPs a stationary, deterministic, and undominatedoptimal policy always existsIn MGs The performance of a policy depends on theopponent’s policy, so we cannot evaluate them withoutcontext.New definition of optimality in game theory

Performs best at its worst case compared with othersAt least one optimal policy exists, which may or may notbe deterministic because the agent is uncertain of itsopponent’s move.

Page 293: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Minimax-Q [Littman, ’94]

In MDPs a stationary, deterministic, and undominatedoptimal policy always existsIn MGs The performance of a policy depends on theopponent’s policy, so we cannot evaluate them withoutcontext.New definition of optimality in game theory

Performs best at its worst case compared with othersAt least one optimal policy exists, which may or may notbe deterministic because the agent is uncertain of itsopponent’s move.

Page 294: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Minimax-QLearning Optimal Policy

Q-learning

Q(s,a)← (1− α)Q(s,a) + α(r + γV (s′)

)V (s) = max

aQ(s,a)

minimax-Q learning

Q(s,a,o) := (1− α)Q(s,a,o) + α(rs,a,o + γV (s′)

)π(s, . . . ) := arg max

π′(s,... )min

o′

∑a′

(π(s,a′) ·Q(s,a′,o′)

)V (s) := min

o′

∑a′π(s,a′) ·Q(s,a′,o′)

Page 295: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Minimax-QLearning Optimal Policy

Q-learning

Q(s,a)← (1− α)Q(s,a) + α(r + γV (s′)

)V (s) = max

aQ(s,a)

minimax-Q learning

Q(s,a,o) := (1− α)Q(s,a,o) + α(rs,a,o + γV (s′)

)π(s, . . . ) := arg max

π′(s,... )min

o′

∑a′

(π(s,a′) ·Q(s,a′,o′)

)V (s) := min

o′

∑a′π(s,a′) ·Q(s,a′,o′)

Page 296: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Minimax-QConsiderations

In a two-player zero-sum multiagent environment, anagent following minimax Q-learning will converge tothe optimal Q-function with probability one.Furthermore, if it follows a GLIE policy and the limitequilibrium is unique, it will converge in behavior withprobability oneIn zero-sum SGs, even if the limit equilibrium is notunique, it converge to a policy that always achieve theoptimal value regardless of its opponent (safety)The minimax Q-learning achieves the largest valuepossible in the absence of knowledge of theopponent’s policyMinimax-Q is quite slow to converge w.r.t. Q-learning(but the latter learns only deterministic policies)

Page 297: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Minimax-QConsiderations

In a two-player zero-sum multiagent environment, anagent following minimax Q-learning will converge tothe optimal Q-function with probability one.Furthermore, if it follows a GLIE policy and the limitequilibrium is unique, it will converge in behavior withprobability oneIn zero-sum SGs, even if the limit equilibrium is notunique, it converge to a policy that always achieve theoptimal value regardless of its opponent (safety)The minimax Q-learning achieves the largest valuepossible in the absence of knowledge of theopponent’s policyMinimax-Q is quite slow to converge w.r.t. Q-learning(but the latter learns only deterministic policies)

Page 298: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Minimax-QConsiderations

In a two-player zero-sum multiagent environment, anagent following minimax Q-learning will converge tothe optimal Q-function with probability one.Furthermore, if it follows a GLIE policy and the limitequilibrium is unique, it will converge in behavior withprobability oneIn zero-sum SGs, even if the limit equilibrium is notunique, it converge to a policy that always achieve theoptimal value regardless of its opponent (safety)The minimax Q-learning achieves the largest valuepossible in the absence of knowledge of theopponent’s policyMinimax-Q is quite slow to converge w.r.t. Q-learning(but the latter learns only deterministic policies)

Page 299: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Minimax-QConsiderations

In a two-player zero-sum multiagent environment, anagent following minimax Q-learning will converge tothe optimal Q-function with probability one.Furthermore, if it follows a GLIE policy and the limitequilibrium is unique, it will converge in behavior withprobability oneIn zero-sum SGs, even if the limit equilibrium is notunique, it converge to a policy that always achieve theoptimal value regardless of its opponent (safety)The minimax Q-learning achieves the largest valuepossible in the absence of knowledge of theopponent’s policyMinimax-Q is quite slow to converge w.r.t. Q-learning(but the latter learns only deterministic policies)

Page 300: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Minimax-QConsiderations

In a two-player zero-sum multiagent environment, anagent following minimax Q-learning will converge tothe optimal Q-function with probability one.Furthermore, if it follows a GLIE policy and the limitequilibrium is unique, it will converge in behavior withprobability oneIn zero-sum SGs, even if the limit equilibrium is notunique, it converge to a policy that always achieve theoptimal value regardless of its opponent (safety)The minimax Q-learning achieves the largest valuepossible in the absence of knowledge of theopponent’s policyMinimax-Q is quite slow to converge w.r.t. Q-learning(but the latter learns only deterministic policies)

Page 301: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Can we extend this approach to general-sumSGs?

Yes and NoNash-Q learning is such an extensionHowever, it has much worse computational andtheoretical properties

Page 302: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Can we extend this approach to general-sumSGs?

Yes and NoNash-Q learning is such an extensionHowever, it has much worse computational andtheoretical properties

Page 303: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Can we extend this approach to general-sumSGs?

Yes and NoNash-Q learning is such an extensionHowever, it has much worse computational andtheoretical properties

Page 304: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-Q [Hu & Wellman, ’98-’03]

NashQit (s′) = π1(s′) · · · · · πn(s′) ·Qi

t (s′)

Each agent needs to maintain the Q-functions of all theother agents

Page 305: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-Q [Hu & Wellman, ’98-’03]

NashQit (s′) = π1(s′) · · · · · πn(s′) ·Qi

t (s′)

Each agent needs to maintain the Q-functions of all theother agents

Page 306: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QComplexity

Space requirements: n · |S| · |A|n

The algorithm running time is dominated by thecalculation of Nash equilibriumThe minimax operator can be computed in polynomialtime (linear programming)

Page 307: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QComplexity

Space requirements: n · |S| · |A|n

The algorithm running time is dominated by thecalculation of Nash equilibriumThe minimax operator can be computed in polynomialtime (linear programming)

Page 308: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QComplexity

Space requirements: n · |S| · |A|n

The algorithm running time is dominated by thecalculation of Nash equilibriumThe minimax operator can be computed in polynomialtime (linear programming)

Page 309: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QConvergence

Assumptions

Every state and joint action are visited infinitely oftenLearning rates suitably decreaseOne of the following assumptions hold

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a global optimal point,

and agent’s payoff in this equilibrium are used to update their Q-functions

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a saddle point, and

agent’s payoff in this equilibrium are used to update their Q-functions

TheoremUnder these assumptions, the sequence Qt = (Q1

t , . . . ,Qnt ), updated

by

Qkt+1(s, a1

, . . . , an) = (1−αt )Qkt (s, a1

, . . . , an)+αt

(rkt + γπ

1(s′) · · · · · πn(s′)Qkt (s′)

)for k = 1, . . . , n

where(π1(s′), . . . , πn(s′)

)is the appropriate type of Nash

equilibrium solution for the stage game(Q1

t (s′), . . . ,Qnt (s′)

),

converges to the Nash Q-value Q∗ =(Q1∗, . . . ,Qn

∗).

Page 310: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QConvergence

Assumptions

Every state and joint action are visited infinitely oftenLearning rates suitably decreaseOne of the following assumptions hold

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a global optimal point,

and agent’s payoff in this equilibrium are used to update their Q-functions

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a saddle point, and

agent’s payoff in this equilibrium are used to update their Q-functions

TheoremUnder these assumptions, the sequence Qt = (Q1

t , . . . ,Qnt ), updated

by

Qkt+1(s, a1

, . . . , an) = (1−αt )Qkt (s, a1

, . . . , an)+αt

(rkt + γπ

1(s′) · · · · · πn(s′)Qkt (s′)

)for k = 1, . . . , n

where(π1(s′), . . . , πn(s′)

)is the appropriate type of Nash

equilibrium solution for the stage game(Q1

t (s′), . . . ,Qnt (s′)

),

converges to the Nash Q-value Q∗ =(Q1∗, . . . ,Qn

∗).

Page 311: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QConvergence

Assumptions

Every state and joint action are visited infinitely oftenLearning rates suitably decreaseOne of the following assumptions hold

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a global optimal point,

and agent’s payoff in this equilibrium are used to update their Q-functions

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a saddle point, and

agent’s payoff in this equilibrium are used to update their Q-functions

TheoremUnder these assumptions, the sequence Qt = (Q1

t , . . . ,Qnt ), updated

by

Qkt+1(s, a1

, . . . , an) = (1−αt )Qkt (s, a1

, . . . , an)+αt

(rkt + γπ

1(s′) · · · · · πn(s′)Qkt (s′)

)for k = 1, . . . , n

where(π1(s′), . . . , πn(s′)

)is the appropriate type of Nash

equilibrium solution for the stage game(Q1

t (s′), . . . ,Qnt (s′)

),

converges to the Nash Q-value Q∗ =(Q1∗, . . . ,Qn

∗).

Page 312: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QConvergence

Assumptions

Every state and joint action are visited infinitely oftenLearning rates suitably decreaseOne of the following assumptions hold

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a global optimal point,

and agent’s payoff in this equilibrium are used to update their Q-functions

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a saddle point, and

agent’s payoff in this equilibrium are used to update their Q-functions

TheoremUnder these assumptions, the sequence Qt = (Q1

t , . . . ,Qnt ), updated

by

Qkt+1(s, a1

, . . . , an) = (1−αt )Qkt (s, a1

, . . . , an)+αt

(rkt + γπ

1(s′) · · · · · πn(s′)Qkt (s′)

)for k = 1, . . . , n

where(π1(s′), . . . , πn(s′)

)is the appropriate type of Nash

equilibrium solution for the stage game(Q1

t (s′), . . . ,Qnt (s′)

),

converges to the Nash Q-value Q∗ =(Q1∗, . . . ,Qn

∗).

Page 313: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QConvergence

Assumptions

Every state and joint action are visited infinitely oftenLearning rates suitably decreaseOne of the following assumptions hold

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a global optimal point,

and agent’s payoff in this equilibrium are used to update their Q-functions

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a saddle point, and

agent’s payoff in this equilibrium are used to update their Q-functions

TheoremUnder these assumptions, the sequence Qt = (Q1

t , . . . ,Qnt ), updated

by

Qkt+1(s, a1

, . . . , an) = (1−αt )Qkt (s, a1

, . . . , an)+αt

(rkt + γπ

1(s′) · · · · · πn(s′)Qkt (s′)

)for k = 1, . . . , n

where(π1(s′), . . . , πn(s′)

)is the appropriate type of Nash

equilibrium solution for the stage game(Q1

t (s′), . . . ,Qnt (s′)

),

converges to the Nash Q-value Q∗ =(Q1∗, . . . ,Qn

∗).

Page 314: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QConvergence

Assumptions

Every state and joint action are visited infinitely oftenLearning rates suitably decreaseOne of the following assumptions hold

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a global optimal point,

and agent’s payoff in this equilibrium are used to update their Q-functions

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a saddle point, and

agent’s payoff in this equilibrium are used to update their Q-functions

TheoremUnder these assumptions, the sequence Qt = (Q1

t , . . . ,Qnt ), updated

by

Qkt+1(s, a1

, . . . , an) = (1−αt )Qkt (s, a1

, . . . , an)+αt

(rkt + γπ

1(s′) · · · · · πn(s′)Qkt (s′)

)for k = 1, . . . , n

where(π1(s′), . . . , πn(s′)

)is the appropriate type of Nash

equilibrium solution for the stage game(Q1

t (s′), . . . ,Qnt (s′)

),

converges to the Nash Q-value Q∗ =(Q1∗, . . . ,Qn

∗).

Page 315: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QConvergence

Assumptions

Every state and joint action are visited infinitely oftenLearning rates suitably decreaseOne of the following assumptions hold

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a global optimal point,

and agent’s payoff in this equilibrium are used to update their Q-functions

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a saddle point, and

agent’s payoff in this equilibrium are used to update their Q-functions

TheoremUnder these assumptions, the sequence Qt = (Q1

t , . . . ,Qnt ), updated

by

Qkt+1(s, a1

, . . . , an) = (1−αt )Qkt (s, a1

, . . . , an)+αt

(rkt + γπ

1(s′) · · · · · πn(s′)Qkt (s′)

)for k = 1, . . . , n

where(π1(s′), . . . , πn(s′)

)is the appropriate type of Nash

equilibrium solution for the stage game(Q1

t (s′), . . . ,Qnt (s′)

),

converges to the Nash Q-value Q∗ =(Q1∗, . . . ,Qn

∗).

Page 316: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QConvergence

Assumptions

Every state and joint action are visited infinitely oftenLearning rates suitably decreaseOne of the following assumptions hold

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a global optimal point,

and agent’s payoff in this equilibrium are used to update their Q-functions

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a saddle point, and

agent’s payoff in this equilibrium are used to update their Q-functions

TheoremUnder these assumptions, the sequence Qt = (Q1

t , . . . ,Qnt ), updated

by

Qkt+1(s, a1

, . . . , an) = (1−αt )Qkt (s, a1

, . . . , an)+αt

(rkt + γπ

1(s′) · · · · · πn(s′)Qkt (s′)

)for k = 1, . . . , n

where(π1(s′), . . . , πn(s′)

)is the appropriate type of Nash

equilibrium solution for the stage game(Q1

t (s′), . . . ,Qnt (s′)

),

converges to the Nash Q-value Q∗ =(Q1∗, . . . ,Qn

∗).

Page 317: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QConvergence Result Analysis

The third assumption is really strongIt is unlikely that stage games during learning maintainadherence to assumptionsThe global optimum assumption implies fullcooperation between agents.The saddle point assumption implies no cooperationbetween agents.Nonetheless, empirically the algorithm convergeseven when assumptions are violatedThis suggests that assumptions may be relaxed, atleast for some classes of games

Page 318: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QConvergence Result Analysis

The third assumption is really strongIt is unlikely that stage games during learning maintainadherence to assumptionsThe global optimum assumption implies fullcooperation between agents.The saddle point assumption implies no cooperationbetween agents.Nonetheless, empirically the algorithm convergeseven when assumptions are violatedThis suggests that assumptions may be relaxed, atleast for some classes of games

Page 319: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QConvergence Result Analysis

The third assumption is really strongIt is unlikely that stage games during learning maintainadherence to assumptionsThe global optimum assumption implies fullcooperation between agents.The saddle point assumption implies no cooperationbetween agents.Nonetheless, empirically the algorithm convergeseven when assumptions are violatedThis suggests that assumptions may be relaxed, atleast for some classes of games

Page 320: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QConvergence Result Analysis

The third assumption is really strongIt is unlikely that stage games during learning maintainadherence to assumptionsThe global optimum assumption implies fullcooperation between agents.The saddle point assumption implies no cooperationbetween agents.Nonetheless, empirically the algorithm convergeseven when assumptions are violatedThis suggests that assumptions may be relaxed, atleast for some classes of games

Page 321: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QConvergence Result Analysis

The third assumption is really strongIt is unlikely that stage games during learning maintainadherence to assumptionsThe global optimum assumption implies fullcooperation between agents.The saddle point assumption implies no cooperationbetween agents.Nonetheless, empirically the algorithm convergeseven when assumptions are violatedThis suggests that assumptions may be relaxed, atleast for some classes of games

Page 322: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QConvergence Result Analysis

The third assumption is really strongIt is unlikely that stage games during learning maintainadherence to assumptionsThe global optimum assumption implies fullcooperation between agents.The saddle point assumption implies no cooperationbetween agents.Nonetheless, empirically the algorithm convergeseven when assumptions are violatedThis suggests that assumptions may be relaxed, atleast for some classes of games

Page 323: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Friend-or-Foe [Littman, ’01]

Friend-or-Foe Q-learning (FFQ) aims at removing therequirements on the intermediate Q-values duringlearningThe idea is to let the algorithm know what kind ofopponent to expect

friend: coordination equilibrium

Nash1(s,Q1,Q2) = maxa1∈A1,a2∈A2

Q1(s,a1,a2)

foe: adversarial equilibrium

Nash1(s,Q1,Q2) = maxπ∈Π(A1)

mina2∈A2

∑a1∈A1

π(a1)Q1(s,a1,a2)

In FFQ the learner maintains only a Q-function foritself

Page 324: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Friend-or-Foe [Littman, ’01]

Friend-or-Foe Q-learning (FFQ) aims at removing therequirements on the intermediate Q-values duringlearningThe idea is to let the algorithm know what kind ofopponent to expect

friend: coordination equilibrium

Nash1(s,Q1,Q2) = maxa1∈A1,a2∈A2

Q1(s,a1,a2)

foe: adversarial equilibrium

Nash1(s,Q1,Q2) = maxπ∈Π(A1)

mina2∈A2

∑a1∈A1

π(a1)Q1(s,a1,a2)

In FFQ the learner maintains only a Q-function foritself

Page 325: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Friend-or-Foe [Littman, ’01]

Friend-or-Foe Q-learning (FFQ) aims at removing therequirements on the intermediate Q-values duringlearningThe idea is to let the algorithm know what kind ofopponent to expect

friend: coordination equilibrium

Nash1(s,Q1,Q2) = maxa1∈A1,a2∈A2

Q1(s,a1,a2)

foe: adversarial equilibrium

Nash1(s,Q1,Q2) = maxπ∈Π(A1)

mina2∈A2

∑a1∈A1

π(a1)Q1(s,a1,a2)

In FFQ the learner maintains only a Q-function foritself

Page 326: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Friend-or-Foe [Littman, ’01]

Friend-or-Foe Q-learning (FFQ) aims at removing therequirements on the intermediate Q-values duringlearningThe idea is to let the algorithm know what kind ofopponent to expect

friend: coordination equilibrium

Nash1(s,Q1,Q2) = maxa1∈A1,a2∈A2

Q1(s,a1,a2)

foe: adversarial equilibrium

Nash1(s,Q1,Q2) = maxπ∈Π(A1)

mina2∈A2

∑a1∈A1

π(a1)Q1(s,a1,a2)

In FFQ the learner maintains only a Q-function foritself

Page 327: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Friend-or-Foe [Littman, ’01]

Friend-or-Foe Q-learning (FFQ) aims at removing therequirements on the intermediate Q-values duringlearningThe idea is to let the algorithm know what kind ofopponent to expect

friend: coordination equilibrium

Nash1(s,Q1,Q2) = maxa1∈A1,a2∈A2

Q1(s,a1,a2)

foe: adversarial equilibrium

Nash1(s,Q1,Q2) = maxπ∈Π(A1)

mina2∈A2

∑a1∈A1

π(a1)Q1(s,a1,a2)

In FFQ the learner maintains only a Q-function foritself

Page 328: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Friend-or-Foe Q-learningConvergence results

Friend-or-foe Q-learning convergesIn general the values learned by FFQ will not convergeto those of any Nash equilibriumThere are some special cases (independently fromopponent behavior)

Foe-Q learns values for a Nash equilibrium policy if thegame has an adversarial equilibriumFriend-Q learns values for a Nash equilibrium policy ifthe game has a coordination equilibrium

Foe-Q learns a Q-function whose corresponding policywill achieve at least the learned values, regardless ofthe opponent’s selected policy

Page 329: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Friend-or-Foe Q-learningConvergence results

Friend-or-foe Q-learning convergesIn general the values learned by FFQ will not convergeto those of any Nash equilibriumThere are some special cases (independently fromopponent behavior)

Foe-Q learns values for a Nash equilibrium policy if thegame has an adversarial equilibriumFriend-Q learns values for a Nash equilibrium policy ifthe game has a coordination equilibrium

Foe-Q learns a Q-function whose corresponding policywill achieve at least the learned values, regardless ofthe opponent’s selected policy

Page 330: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Friend-or-Foe Q-learningConvergence results

Friend-or-foe Q-learning convergesIn general the values learned by FFQ will not convergeto those of any Nash equilibriumThere are some special cases (independently fromopponent behavior)

Foe-Q learns values for a Nash equilibrium policy if thegame has an adversarial equilibriumFriend-Q learns values for a Nash equilibrium policy ifthe game has a coordination equilibrium

Foe-Q learns a Q-function whose corresponding policywill achieve at least the learned values, regardless ofthe opponent’s selected policy

Page 331: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Friend-or-Foe Q-learningConvergence results

Friend-or-foe Q-learning convergesIn general the values learned by FFQ will not convergeto those of any Nash equilibriumThere are some special cases (independently fromopponent behavior)

Foe-Q learns values for a Nash equilibrium policy if thegame has an adversarial equilibriumFriend-Q learns values for a Nash equilibrium policy ifthe game has a coordination equilibrium

Foe-Q learns a Q-function whose corresponding policywill achieve at least the learned values, regardless ofthe opponent’s selected policy

Page 332: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Friend-or-Foe Q-learningConvergence results

Friend-or-foe Q-learning convergesIn general the values learned by FFQ will not convergeto those of any Nash equilibriumThere are some special cases (independently fromopponent behavior)

Foe-Q learns values for a Nash equilibrium policy if thegame has an adversarial equilibriumFriend-Q learns values for a Nash equilibrium policy ifthe game has a coordination equilibrium

Foe-Q learns a Q-function whose corresponding policywill achieve at least the learned values, regardless ofthe opponent’s selected policy

Page 333: Multi-Agent Reinforcement Learning - Game Theory Polimi · 2019-12-03 · Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Friend-or-Foe Q-learningConvergence results

Friend-or-foe Q-learning convergesIn general the values learned by FFQ will not convergeto those of any Nash equilibriumThere are some special cases (independently fromopponent behavior)

Foe-Q learns values for a Nash equilibrium policy if thegame has an adversarial equilibriumFriend-Q learns values for a Nash equilibrium policy ifthe game has a coordination equilibrium

Foe-Q learns a Q-function whose corresponding policywill achieve at least the learned values, regardless ofthe opponent’s selected policy