Smoothness and Learning Equilibria

77
SMOOTHNESS AND LEARNING EQUILIBRIA Price of Anarchy Seminar Presenting: Aviv Kuvent Professor: Michal Feldman

description

Smoothness and Learning Equilibria. Price of Anarchy Seminar Presenting: Aviv Kuvent Professor: Michal Feldman. Reminder. PoA bound results for PNE, MNE. However: PNE need not always exist MNE is hard to compute Introduces 2 additional types of equilibrium: Correlated Equilibrium (CE) - PowerPoint PPT Presentation

Transcript of Smoothness and Learning Equilibria

Page 1: Smoothness and Learning Equilibria

SMOOTHNESS AND LEARNING EQUILIBRIAPrice of Anarchy SeminarPresenting: Aviv KuventProfessor: Michal Feldman

Page 2: Smoothness and Learning Equilibria

Reminder• PoA bound results for PNE, MNE.• However:

• PNE need not always exist• MNE is hard to compute

• Introduces 2 additional types of equilibrium:• Correlated Equilibrium (CE)• Coarse Correlated Equilibrium (CCE)

Page 3: Smoothness and Learning Equilibria

Reminder - CE• Definition (CE):• A distribution σ on the set of outcomes of a cost-

minimization game is a correlated equilibrium if for every player , strategy , and every deviation ,

Page 4: Smoothness and Learning Equilibria

Reminder - CCE• Definition (CCE):• A distribution σ on the set of outcomes of a cost-

minimization game is a coarse correlated equilibrium if for every player and every deviation ,

 

Page 5: Smoothness and Learning Equilibria

Reminder – Eq. Hierarchy

Page 6: Smoothness and Learning Equilibria

Reminder• We saw that in smooth games, the PoA bounds proven for

PNE/MNE also hold for CE/CCE.• CE/CCE are easy to compute:

• Linear Programming• No Regret Dynamics

Page 7: Smoothness and Learning Equilibria

The Game Model• Single Player• Set of actions , where • Adversary• time steps

Page 8: Smoothness and Learning Equilibria

The Game Model• At time :

• Player picks a distribution over his action set .• Adversary picks cost vector • Action is chosen according to the distribution , and the player

incurs the cost .• Player is aware of the values of the entire cost vector (Full

Information Model)• Partial Information Model

• Player only knows the value of the incurred cost.

Page 9: Smoothness and Learning Equilibria

Regret• The difference between the cost of our algorithm and the

cost of a chosen benchmark

• - the benchmark we choose.• Time-averaged Regret:

Page 10: Smoothness and Learning Equilibria

No Regret Algorithm• An algorithm has no-regret if for every sequence of cost

vectors, the expected time-averaged regret as .• Equivalently, we would like to show that:

• Goal: develop such an algorithm• Which benchmark to use?

Page 11: Smoothness and Learning Equilibria

Optimal Sequence• Or “Best Action in Hindsight”

• Too strong to be a useful benchmark.• Theorem:

• For any algorithm there exists a sequence of cost vectors such that , where is the size of the action set of the player.

Page 12: Smoothness and Learning Equilibria

Optimal Sequence• Proof:

• Build the following sequence: in time ,

• ⇒ Expected cost of in time • The cost of for time steps • An optimal sequence of cost 0 exists.

Page 13: Smoothness and Learning Equilibria

External Regret• Benchmark – the best fixed action:

• Playing the same action in all time steps• External Regret:

Page 14: Smoothness and Learning Equilibria

The Greedy Algorithm• First attempt to develop no-external regret algorithm• Notations:

• The cumulative cost of action up to time :

• The cost of best fixed action:

• The cost of the greedy algorithm:

Page 15: Smoothness and Learning Equilibria

The Greedy Algorithm• For simplicity, assume that .• The idea – choose the action which incurred minimal cost

so far.

• Greedy Algorithm:• Initially: select to be some arbitrary action in the action

set .• At time :

• Let , • Choose .

Page 16: Smoothness and Learning Equilibria

The Greedy Algorithm• Theorem:

• For any sequence of losses, the greedy algorithm has

• Proof: • At each time where the greedy algorithm incurs a cost of 1 and

does not increase, at least one action is removed from the set . • This can occur at most times before increases by 1. • Therefore, the greedy algorithm incurs cost at most between

successive increments in ⇒

• The additional is for the cost incurred in the steps from the last time increased.

Page 17: Smoothness and Learning Equilibria

Deterministic Algorithms• The greedy algorithm has a cost of at most factor larger

than that of the fixed best action – not “no regret algorithm”

•  Theorem: For any deterministic algorithm , there exists a cost sequence for which and .

• Corollary: There does not exist a no-regret deterministic algorithm.

Page 18: Smoothness and Learning Equilibria

Deterministic Algorithms• Proof:

• Fix deterministic algorithm • - action it selects at time • Generate the following cost sequence: • incurs a cost of 1 in each time step ⇒• selects some action at most times, that is the best fixed action ⇒

⇒ not dependent on ⇒ is not a no-regret algorithm.

Page 19: Smoothness and Learning Equilibria

Lower Bound on Regret• Theorem:

• Even for actions, no (randomized) algorithm has expected regret which vanishes faster than .

• Proof: • Let be a randomized algorithm. • Adversary: on each time step , chooses uniformly at random

between two cost vectors - • ⇒ . • Think of adversary as flipping a fair coins : , standard deviation = .

Page 20: Smoothness and Learning Equilibria

Lower Bound on Regret• Proof (cont):

• ⇒ One action has an expected cost of other action has an expected cost of .

• Second action - the best fixed action ⇒• ⇒

• General case of actions, similar argument lower bound ⇒of .

Page 21: Smoothness and Learning Equilibria

Multiplicative Weights (MW)• The idea:

• Keep a weight for each action• Probability of playing the action depends on weight • The weights are updated; used to "punish" actions which incurred

cost.• Notations:

• – weight of action in time t• – sum of the weights of the actions in time t• – cost of action in time t• – cost of the MW algorithm in time t• – cost of the MW algorithm• – cost of best fixed action

Page 22: Smoothness and Learning Equilibria

Multiplicative Weights (MW)• MW Algorithm:• Initially: and for each • At time step :

• Play an action according to distribution , where • Given the cost vector , decrease the weights:

Page 23: Smoothness and Learning Equilibria

Multiplicative Weights (MW)• What is ?• A tends towards 0, the distribution tends towards a

uniform distribution (exploration)• As tends towards 1, the distribution tends towards a

distribution which places most of the mass on the action which incurred the minimal cost so far (exploitation).

• Set specific later, for now assume

Page 24: Smoothness and Learning Equilibria

Multiplicative Weights (MW)• Theorem:

• MW is a no-regret algorithm, with expected (time-averaged) external regret of .

• Proof:• The weight of every action (and therefore their sum, ) can only

decrease with .• The idea of the proof - relate and to , and from there get the

bound.

Page 25: Smoothness and Learning Equilibria

Multiplicative Weights (MW)• Proof (cont):• Relate to :

• Let be a the best fixed action with a low cost• if no such an action exists – we can always get low regret with

respect to the fixed action benchmark

• : the weight of action at time

• ⇒

Page 26: Smoothness and Learning Equilibria

Multiplicative Weights (MW)• Proof (cont):• Relate to :

• At time :

• for

Page 27: Smoothness and Learning Equilibria

Multiplicative Weights (MW)• Proof (cont):

• ⇒

• So:

Page 28: Smoothness and Learning Equilibria

Multiplicative Weights (MW)• Proof (cont):• We got: ,

• ⇒

• Taylor expansion of • So, , and because ,

Page 29: Smoothness and Learning Equilibria

Multiplicative Weights (MW)• Proof (cont):• ⇒

• ⇒

Page 30: Smoothness and Learning Equilibria

Multiplicative Weights (MW)• Proof (cont):• We take , and get: ⇒

• This assumes that is known in advance. If it is not known, at time step take , where is the largest power of 2 smaller than (doubling trick).

Page 31: Smoothness and Learning Equilibria

Multiplicative Weights (MW)• Up to the factor of 2, we have reached the lower bound

for external regret rate ().• To achieve a regret of at most , we only need iterations

of the MW algorithm.

Page 32: Smoothness and Learning Equilibria

No Regret Dynamics• Moving from a single player setting to a multi player

setting• In each time step :

• Each player independently chooses a mixed strategy using some no-regret algorithm .

• Each player receives a cost vector , which is the cost of each of the possible actions of player , given the mixed strategies played by the other players:

Page 33: Smoothness and Learning Equilibria

No Regret Dynamics and CCE• Theorem:

• Let be the outcome sequence generated by the no-regret dynamics after time steps.

• Let be a uniform distribution over the multi-set of the outcomes . • is an approximate Coarse Correlated Equilibrium (CCE).

• Corollary: smooth games PoA bounds apply (approximately) to no regret dynamics.

Page 34: Smoothness and Learning Equilibria

No Regret Dynamics and CCE• Proof:• Player has external regret :

• From definition of • Expected cost of player when playing according to his no-

regret algorithm:

• Expected cost of player when always playing fixed strategy is:

Page 35: Smoothness and Learning Equilibria

No Regret Dynamics and CCE• Proof (cont):• ⇒

• For any fixed action :

• ⇒

• All players play no regret algorithms: as , and we get the definition of CCE.

Page 36: Smoothness and Learning Equilibria

Swap Regret• Another benchmark• Define a switching function on the actions • Swap Regret:

• External regret is a subset of swap regret, where we only look at the constant switching functions .

Page 37: Smoothness and Learning Equilibria

No Swap Regret Algorithm• Definition:• An algorithm has no swap regret if for all cost vectors and

all switching functions , the expected (time averaged) swap regret:

as

Page 38: Smoothness and Learning Equilibria

Swap-External Reduction• Black box reduction from no swap regret to no external

regret• (Corollary: there exists poly-time no swap regret

algorithms)

• Notations:• - number of actions• - instantiations of no external regret algorithms.• Algorithm receives a cost vector and returns a probability

distribution over actions.

Page 39: Smoothness and Learning Equilibria

Swap-External Reduction• Reduction:• The main algorithm , at time :

• Receive distributions over actions from algorithms • Compute and output consensus distribution • Receive a cost vector from the adversary• Give algorithm the cost vector

• Goal: use the no external regret guarantee of the algorithms , and get a no swap regret guarantee for .

• Need to show how to compute

Page 40: Smoothness and Learning Equilibria

Swap-External Reduction

M1

.

.

.

𝑝𝑡

𝑐𝑡

𝑞1𝑡

𝑞2𝑡

𝑞𝑛𝑡

M2

Mn

𝑝𝑡(𝑎1) ∙𝑐𝑡

𝑝𝑡(𝑎2) ∙𝑐𝑡

𝑝𝑡(𝑎𝑛) ∙𝑐𝑡

Page 41: Smoothness and Learning Equilibria

Swap-External Reduction• Theorem:

• This reduction provides a no swap regret algorithm

• Proof:• Time averaged expected cost of :

• Time averaged expected cost with some switching function :

 

Page 42: Smoothness and Learning Equilibria

Swap-External Reduction• Proof (cont):• Want to show: for all switching functions , , where when

• Time-averaged expected cost of no-external regret algorithm :

• is a no regret algorithm for every fixed action : ⇒

Page 43: Smoothness and Learning Equilibria

Swap-External Reduction• Proof (cont):• Fix . • Summing the inequality over , with for each :

Page 44: Smoothness and Learning Equilibria

Swap-External Reduction• Proof (cont):• For each , as , and is fixed as ⇒• If

then we’ve shown , where when , as needed.• Need to choose so that for every :

Page 45: Smoothness and Learning Equilibria

Swap-External Reduction• Proof (cont):• is a stationary distribution of a Markov chain.• Build the following Markov chain:

• The states - the actions set - • For every , the transition probability from to is

Page 46: Smoothness and Learning Equilibria

Swap-External Reduction• Proof (cont):• Every stationary distribution of this Markov chain satisfies

• This is an ergodic Markov chain • For such type of chain there is at least one stationary

distribution, and it can be computed in polynomial time (eigenvector computation).

Page 47: Smoothness and Learning Equilibria

Swap Regret Bound• We have shown a no regret reduction, with regret

• If we take MW as the algorithms , we get a swap regret bound of (each is bounded by )

• We can improve this:• In MW, the actual bound we found was , and we just

bounded by .• If we take the original bound, we get:

Page 48: Smoothness and Learning Equilibria

Swap Regret Bound• The cost vector we gave to at time was scaled by ⇒

• Square root function is concave Jansen’s inequality:⇒

• ⇒

Page 49: Smoothness and Learning Equilibria

Swap Regret Bound• ⇒

• We get the swap regret bound of

Page 50: Smoothness and Learning Equilibria

No Swap Regret Dynamics and CE• Equivalent definition of CE using switching functions:• A distribution σ on the set of outcomes of a cost-

minimization game is a correlated equilibrium if for every player , every strategy , and every switching function ,

Page 51: Smoothness and Learning Equilibria

No Swap Regret Dynamics and CE• Theorem: • Let be the outcome sequence generated by the no

(swap) regret dynamics after time steps (all players use no swap regret algorithms).

• Let be a uniform distribution over the multi-set of the outcomes .

• Then is an approximate CE.

• Proof: Similar to the proof of no external regret dynamics and CCE

Page 52: Smoothness and Learning Equilibria

Partial Information Model• Game Model:• At time :

• Player picks a distribution over his action set .• Adversary picks cost vector • Action is chosen according to the distribution , and the player

incurs the cost .• Player only knows the value .

• Lack of information leads to exploration vs exploitation trade-off.

Page 53: Smoothness and Learning Equilibria

Partial Information Model• Multi-Armed Bandit Problem:

• Single player• non-identical slot machines• time steps• Player needs to choose which slot machine to play in time step in

order to maximizes his profit.

• This shows a simple model of the trade-off between:• Exploration - trying each slot machine to find the best one• Exploitation - repeatedly playing the slot machine believed

to give the best payoff.

Page 54: Smoothness and Learning Equilibria

Partial – Full Information Reduction• Goal: Show the existence of no-regret partial information

algorithm via reduction to full information no-regret algorithm.

• For the reduction:• Assume (number of time steps) is given as a parameter,

and divide it to groups - • Assume divides • - the full information no external regret algorithm we use,

with time-averaged regret .

Page 55: Smoothness and Learning Equilibria

Partial – Full Information Reduction• Reduction:• The main algorithm , at time :• If was the last time step in the previous group ,

• Give a cost vector , where is the cost of action in group . • Get from distribution over actions.

• In each group select time steps in the group uniformly at random,

• If : return a distribution (exploration phase).• Else: return distribution (exploitation phase).

Page 56: Smoothness and Learning Equilibria

Partial – Full Information Reduction• We want to show is a no regret algorithm• i.e. , where as • Proof:• Let - the cost of action at time • Expected cost of action for group :

• If , then

• will pass to a total of such cost vectors . • The output of at the end of group is as if played a full

information game with time steps, and returned at step a probability distribution .

Page 57: Smoothness and Learning Equilibria

Partial – Full Information Reduction• Proof (cont):

• is the time step in group where the probability of action is 1 (and all other actions have probability 0)

Page 58: Smoothness and Learning Equilibria

Partial – Full Information Reduction• Proof (cont):• :

Page 59: Smoothness and Learning Equilibria

Partial – Full Information Reduction• Proof (cont):• What we have so far:

• Want to find upper bound for • Assume is a no external regret algorithm• Cost of best fixed action: • Time averaged expected cost :

Page 60: Smoothness and Learning Equilibria

Partial – Full Information Reduction• Proof (cont):

⇒• So we have:

• We have shown a no-external regret algorithm which has (MW algorithm). So we can get:

Page 61: Smoothness and Learning Equilibria

Partial – Full Information Reduction• Proof (cont):• We would like to choose the size of each group ( to

minimize the regret: ⇒

⇒Time averaged:

Page 62: Smoothness and Learning Equilibria

EXP3 Algorithm• A partial information no regret algorithm• Based on Exponential Weights algorithm• Exponential Weights:

• A full information no external regret algorithm• Similar to Multiplicative Weights, except in time , we update the

weights using: • We will show the algorithm for profit maximization (instead

of cost minimization)

• Notation:• - the sum of the weights at time

Page 63: Smoothness and Learning Equilibria

EXP3 Algorithm• The Algorithm:• Initialize • At time :

• (use a small correction to ensure is never too close to 0).• Choose an action according to the distribution • Receive profit value • Define: • Update the weights:

Page 64: Smoothness and Learning Equilibria

EXP3 Algorithm• Bound on regret (no proof):

• Similar to the bound shown in the reduction, the bound on the external regret in this algorithm is also

Page 65: Smoothness and Learning Equilibria

Minimax Theorem• Reminder:• 2-player zero sum game• Represented by payoff matrix • - the payoff of the row player in the outcome • - mixed strategy of the row player• - mixed strategy of the column player• Expected payoff of the row player:

• Expected payoff of the column player - the negative of this.

Page 66: Smoothness and Learning Equilibria

Minimax Theorem• Reminder:• MNE:

• A pair , where for all distributions over rows: , and for all distributions over columns:

• Theorem (Minimax):• For every 2-player zero sum game ,

Page 67: Smoothness and Learning Equilibria

Minimax Theorem• We can deduce this theorem from the existence of no

external regret algorithms• Until now we discussed cost-minimization games, but the

same no regret algorithms with some minor modifications apply for payoff maximization

• Proof:• :

Page 68: Smoothness and Learning Equilibria

Minimax Theorem• Proof (cont):• : • Have the two players play the game for enough time , so

that both have expected external regret of at most • Let be the mixed strategies played by the row player as

advised by the no external regret algorithm he uses• Let be the mixed strategies played by the column player

as advised by the no external regret algorithm he uses

Page 69: Smoothness and Learning Equilibria

Minimax Theorem• Proof (cont):• At time , the input to the no regret algorithms will be:

• for the row player • for the column player

• This is the expected payoff of each strategy on day given the mixed strategy played by the other player.

Page 70: Smoothness and Learning Equilibria

Minimax Theorem• Proof (cont):• The time-averaged mixed strategy of the row player:

• The time-averaged mixed strategy of the column player:

• The time-averaged expected payoff of the row player:

Page 71: Smoothness and Learning Equilibria

Minimax Theorem• Proof (cont):• Because the row player uses no external regret algorithm,

for every fixed action corresponding to row :

• This means that the row player could have gained at most an additional value from playing any fixed action in all time steps

• Since every arbitrary row mixed strategy is just a distribution over the 's, we get:

Page 72: Smoothness and Learning Equilibria

Minimax Theorem• Proof (cont):• For the column player, a similar argument applies • The column player attempts to minimize the value and

has at most external regret• So, playing a fixed column action in all time steps could

have decreased the value by at most :

for every column mixed strategy .

Page 73: Smoothness and Learning Equilibria

Minimax Theorem• Proof (cont):• So:• [instantiating to be ]• [for every ]• [ for every ]

• We got: • No regret - as ,

Page 74: Smoothness and Learning Equilibria

CE as LP Problem

• Correlated Distribution :

S21 S22S11 A1(S11,S21),

A2(S11,S21)A1(S11,S22),A2(S11,S22)

S12 A1(S12,S21),A2(S12,S21)

A1(S12,S22),A2(S12,S22)

S21 S22S11 p1 p2

S12 p3 p4

Page 75: Smoothness and Learning Equilibria

CE as LP Problem• Payoff of player :

• Player is advised to play with probability • Given player is advised to play , player is advised to play

with probability • The expected payoff of player is he is advised to play : • The expected payoff of player if he decides to ignore the

advice:

Page 76: Smoothness and Learning Equilibria

CE as LP Problem• CE requires:

• From this we get:

Page 77: Smoothness and Learning Equilibria

CE as LP Problem• LP Problem:• Maximize (Libertarian equilibrium)• Under conditions: