Smoothness and Learning Equilibria

SMOOTHNESS AND LEARNING EQUILIBRIAPrice of Anarchy SeminarPresenting: Aviv KuventProfessor: Michal Feldman

Reminder• PoA bound results for PNE, MNE.• However:

• PNE need not always exist• MNE is hard to compute

• Introduces 2 additional types of equilibrium:• Correlated Equilibrium (CE)• Coarse Correlated Equilibrium (CCE)

Reminder - CE• Definition (CE):• A distribution σ on the set of outcomes of a cost-

minimization game is a correlated equilibrium if for every player , strategy , and every deviation ,

Reminder - CCE• Definition (CCE):• A distribution σ on the set of outcomes of a cost-

minimization game is a coarse correlated equilibrium if for every player and every deviation ,

Reminder – Eq. Hierarchy

Reminder• We saw that in smooth games, the PoA bounds proven for

PNE/MNE also hold for CE/CCE.• CE/CCE are easy to compute:

• Linear Programming• No Regret Dynamics

The Game Model• Single Player• Set of actions , where • Adversary• time steps

The Game Model• At time :

• Player picks a distribution over his action set .• Adversary picks cost vector • Action is chosen according to the distribution , and the player

incurs the cost .• Player is aware of the values of the entire cost vector (Full

Information Model)• Partial Information Model

• Player only knows the value of the incurred cost.

Regret• The difference between the cost of our algorithm and the

cost of a chosen benchmark

• - the benchmark we choose.• Time-averaged Regret:

No Regret Algorithm• An algorithm has no-regret if for every sequence of cost

vectors, the expected time-averaged regret as .• Equivalently, we would like to show that:

• Goal: develop such an algorithm• Which benchmark to use?

Optimal Sequence• Or “Best Action in Hindsight”

• Too strong to be a useful benchmark.• Theorem:

• For any algorithm there exists a sequence of cost vectors such that , where is the size of the action set of the player.

Optimal Sequence• Proof:

• Build the following sequence: in time ,

• ⇒ Expected cost of in time • The cost of for time steps • An optimal sequence of cost 0 exists.

⇒

External Regret• Benchmark – the best fixed action:

• Playing the same action in all time steps• External Regret:

The Greedy Algorithm• First attempt to develop no-external regret algorithm• Notations:

• The cumulative cost of action up to time :

• The cost of best fixed action:

• The cost of the greedy algorithm:

The Greedy Algorithm• For simplicity, assume that .• The idea – choose the action which incurred minimal cost

so far.

• Greedy Algorithm:• Initially: select to be some arbitrary action in the action

set .• At time :

• Let , • Choose .

The Greedy Algorithm• Theorem:

• For any sequence of losses, the greedy algorithm has

• Proof: • At each time where the greedy algorithm incurs a cost of 1 and

does not increase, at least one action is removed from the set . • This can occur at most times before increases by 1. • Therefore, the greedy algorithm incurs cost at most between

successive increments in ⇒

• The additional is for the cost incurred in the steps from the last time increased.

Deterministic Algorithms• The greedy algorithm has a cost of at most factor larger

than that of the fixed best action – not “no regret algorithm”

• Theorem: For any deterministic algorithm , there exists a cost sequence for which and .

• Corollary: There does not exist a no-regret deterministic algorithm.

Deterministic Algorithms• Proof:

• Fix deterministic algorithm • - action it selects at time • Generate the following cost sequence: • incurs a cost of 1 in each time step ⇒• selects some action at most times, that is the best fixed action ⇒

⇒ not dependent on ⇒ is not a no-regret algorithm.

Lower Bound on Regret• Theorem:

• Even for actions, no (randomized) algorithm has expected regret which vanishes faster than .

• Proof: • Let be a randomized algorithm. • Adversary: on each time step , chooses uniformly at random

between two cost vectors - • ⇒ . • Think of adversary as flipping a fair coins : , standard deviation = .

Lower Bound on Regret• Proof (cont):

• ⇒ One action has an expected cost of other action has an expected cost of .

• Second action - the best fixed action ⇒• ⇒

• General case of actions, similar argument lower bound ⇒of .

Multiplicative Weights (MW)• The idea:

• Keep a weight for each action• Probability of playing the action depends on weight • The weights are updated; used to "punish" actions which incurred

cost.• Notations:

• – weight of action in time t• – sum of the weights of the actions in time t• – cost of action in time t• – cost of the MW algorithm in time t• – cost of the MW algorithm• – cost of best fixed action

Multiplicative Weights (MW)• MW Algorithm:• Initially: and for each • At time step :

• Play an action according to distribution , where • Given the cost vector , decrease the weights:

Multiplicative Weights (MW)• What is ?• A tends towards 0, the distribution tends towards a

uniform distribution (exploration)• As tends towards 1, the distribution tends towards a

distribution which places most of the mass on the action which incurred the minimal cost so far (exploitation).

• Set specific later, for now assume

Multiplicative Weights (MW)• Theorem:

• MW is a no-regret algorithm, with expected (time-averaged) external regret of .

• Proof:• The weight of every action (and therefore their sum, ) can only

decrease with .• The idea of the proof - relate and to , and from there get the

bound.

Multiplicative Weights (MW)• Proof (cont):• Relate to :

• Let be a the best fixed action with a low cost• if no such an action exists – we can always get low regret with

respect to the fixed action benchmark

• : the weight of action at time

• ⇒

Multiplicative Weights (MW)• Proof (cont):• Relate to :

• At time :

• for

Multiplicative Weights (MW)• Proof (cont):

• ⇒

• So:

Multiplicative Weights (MW)• Proof (cont):• We got: ,

• ⇒

• Taylor expansion of • So, , and because ,

Multiplicative Weights (MW)• Proof (cont):• ⇒

• ⇒

Multiplicative Weights (MW)• Proof (cont):• We take , and get: ⇒

⇒

• This assumes that is known in advance. If it is not known, at time step take , where is the largest power of 2 smaller than (doubling trick).

Multiplicative Weights (MW)• Up to the factor of 2, we have reached the lower bound

for external regret rate ().• To achieve a regret of at most , we only need iterations

of the MW algorithm.

No Regret Dynamics• Moving from a single player setting to a multi player

setting• In each time step :

• Each player independently chooses a mixed strategy using some no-regret algorithm .

• Each player receives a cost vector , which is the cost of each of the possible actions of player , given the mixed strategies played by the other players:

No Regret Dynamics and CCE• Theorem:

• Let be the outcome sequence generated by the no-regret dynamics after time steps.

• Let be a uniform distribution over the multi-set of the outcomes . • is an approximate Coarse Correlated Equilibrium (CCE).

• Corollary: smooth games PoA bounds apply (approximately) to no regret dynamics.

No Regret Dynamics and CCE• Proof:• Player has external regret :

• From definition of • Expected cost of player when playing according to his no-

regret algorithm:

• Expected cost of player when always playing fixed strategy is:

No Regret Dynamics and CCE• Proof (cont):• ⇒

• For any fixed action :

• ⇒

• All players play no regret algorithms: as , and we get the definition of CCE.

Swap Regret• Another benchmark• Define a switching function on the actions • Swap Regret:

• External regret is a subset of swap regret, where we only look at the constant switching functions .

No Swap Regret Algorithm• Definition:• An algorithm has no swap regret if for all cost vectors and

all switching functions , the expected (time averaged) swap regret:

as

Swap-External Reduction• Black box reduction from no swap regret to no external

regret• (Corollary: there exists poly-time no swap regret

algorithms)

• Notations:• - number of actions• - instantiations of no external regret algorithms.• Algorithm receives a cost vector and returns a probability

distribution over actions.

Swap-External Reduction• Reduction:• The main algorithm , at time :

• Receive distributions over actions from algorithms • Compute and output consensus distribution • Receive a cost vector from the adversary• Give algorithm the cost vector

• Goal: use the no external regret guarantee of the algorithms , and get a no swap regret guarantee for .

• Need to show how to compute

Swap-External Reduction

M1

.

.

.

𝑝𝑡

𝑐𝑡

𝑞1𝑡

𝑞2𝑡

𝑞𝑛𝑡

M2

Mn

𝑝𝑡(𝑎1) ∙𝑐𝑡

𝑝𝑡(𝑎2) ∙𝑐𝑡

𝑝𝑡(𝑎𝑛) ∙𝑐𝑡

Swap-External Reduction• Theorem:

• This reduction provides a no swap regret algorithm

• Proof:• Time averaged expected cost of :

• Time averaged expected cost with some switching function :

Swap-External Reduction• Proof (cont):• Want to show: for all switching functions , , where when

• Time-averaged expected cost of no-external regret algorithm :

• is a no regret algorithm for every fixed action : ⇒

Swap-External Reduction• Proof (cont):• Fix . • Summing the inequality over , with for each :

Swap-External Reduction• Proof (cont):• For each , as , and is fixed as ⇒• If

then we’ve shown , where when , as needed.• Need to choose so that for every :

Swap-External Reduction• Proof (cont):• is a stationary distribution of a Markov chain.• Build the following Markov chain:

• The states - the actions set - • For every , the transition probability from to is

Swap-External Reduction• Proof (cont):• Every stationary distribution of this Markov chain satisfies

• This is an ergodic Markov chain • For such type of chain there is at least one stationary

distribution, and it can be computed in polynomial time (eigenvector computation).

Swap Regret Bound• We have shown a no regret reduction, with regret

• If we take MW as the algorithms , we get a swap regret bound of (each is bounded by )

• We can improve this:• In MW, the actual bound we found was , and we just

bounded by .• If we take the original bound, we get:

Swap Regret Bound• The cost vector we gave to at time was scaled by ⇒

• Square root function is concave Jansen’s inequality:⇒

• ⇒

Swap Regret Bound• ⇒

• We get the swap regret bound of

No Swap Regret Dynamics and CE• Equivalent definition of CE using switching functions:• A distribution σ on the set of outcomes of a cost-

minimization game is a correlated equilibrium if for every player , every strategy , and every switching function ,

No Swap Regret Dynamics and CE• Theorem: • Let be the outcome sequence generated by the no

(swap) regret dynamics after time steps (all players use no swap regret algorithms).

• Let be a uniform distribution over the multi-set of the outcomes .

• Then is an approximate CE.

• Proof: Similar to the proof of no external regret dynamics and CCE

Partial Information Model• Game Model:• At time :

• Player picks a distribution over his action set .• Adversary picks cost vector • Action is chosen according to the distribution , and the player

incurs the cost .• Player only knows the value .

• Lack of information leads to exploration vs exploitation trade-off.

Partial Information Model• Multi-Armed Bandit Problem:

• Single player• non-identical slot machines• time steps• Player needs to choose which slot machine to play in time step in

order to maximizes his profit.

• This shows a simple model of the trade-off between:• Exploration - trying each slot machine to find the best one• Exploitation - repeatedly playing the slot machine believed

to give the best payoff.

Partial – Full Information Reduction• Goal: Show the existence of no-regret partial information

algorithm via reduction to full information no-regret algorithm.

• For the reduction:• Assume (number of time steps) is given as a parameter,

and divide it to groups - • Assume divides • - the full information no external regret algorithm we use,

with time-averaged regret .

Partial – Full Information Reduction• Reduction:• The main algorithm , at time :• If was the last time step in the previous group ,

• Give a cost vector , where is the cost of action in group . • Get from distribution over actions.

• In each group select time steps in the group uniformly at random,

• If : return a distribution (exploration phase).• Else: return distribution (exploitation phase).

Partial – Full Information Reduction• We want to show is a no regret algorithm• i.e. , where as • Proof:• Let - the cost of action at time • Expected cost of action for group :

• If , then

• will pass to a total of such cost vectors . • The output of at the end of group is as if played a full

information game with time steps, and returned at step a probability distribution .

Partial – Full Information Reduction• Proof (cont):

• is the time step in group where the probability of action is 1 (and all other actions have probability 0)

Partial – Full Information Reduction• Proof (cont):• :

Partial – Full Information Reduction• Proof (cont):• What we have so far:

• Want to find upper bound for • Assume is a no external regret algorithm• Cost of best fixed action: • Time averaged expected cost :

Partial – Full Information Reduction• Proof (cont):

⇒• So we have:

• We have shown a no-external regret algorithm which has (MW algorithm). So we can get:

Partial – Full Information Reduction• Proof (cont):• We would like to choose the size of each group ( to

minimize the regret: ⇒

⇒Time averaged:

EXP3 Algorithm• A partial information no regret algorithm• Based on Exponential Weights algorithm• Exponential Weights:

• A full information no external regret algorithm• Similar to Multiplicative Weights, except in time , we update the

weights using: • We will show the algorithm for profit maximization (instead

of cost minimization)

• Notation:• - the sum of the weights at time

EXP3 Algorithm• The Algorithm:• Initialize • At time :

• (use a small correction to ensure is never too close to 0).• Choose an action according to the distribution • Receive profit value • Define: • Update the weights:

EXP3 Algorithm• Bound on regret (no proof):

• Similar to the bound shown in the reduction, the bound on the external regret in this algorithm is also

Minimax Theorem• Reminder:• 2-player zero sum game• Represented by payoff matrix • - the payoff of the row player in the outcome • - mixed strategy of the row player• - mixed strategy of the column player• Expected payoff of the row player:

• Expected payoff of the column player - the negative of this.

Minimax Theorem• Reminder:• MNE:

• A pair , where for all distributions over rows: , and for all distributions over columns:

• Theorem (Minimax):• For every 2-player zero sum game ,

Minimax Theorem• We can deduce this theorem from the existence of no

external regret algorithms• Until now we discussed cost-minimization games, but the

same no regret algorithms with some minor modifications apply for payoff maximization

• Proof:• :

Minimax Theorem• Proof (cont):• : • Have the two players play the game for enough time , so

that both have expected external regret of at most • Let be the mixed strategies played by the row player as

advised by the no external regret algorithm he uses• Let be the mixed strategies played by the column player

as advised by the no external regret algorithm he uses

Minimax Theorem• Proof (cont):• At time , the input to the no regret algorithms will be:

• for the row player • for the column player

• This is the expected payoff of each strategy on day given the mixed strategy played by the other player.

Minimax Theorem• Proof (cont):• The time-averaged mixed strategy of the row player:

• The time-averaged mixed strategy of the column player:

• The time-averaged expected payoff of the row player:

Minimax Theorem• Proof (cont):• Because the row player uses no external regret algorithm,

for every fixed action corresponding to row :

• This means that the row player could have gained at most an additional value from playing any fixed action in all time steps

• Since every arbitrary row mixed strategy is just a distribution over the 's, we get:

Minimax Theorem• Proof (cont):• For the column player, a similar argument applies • The column player attempts to minimize the value and

has at most external regret• So, playing a fixed column action in all time steps could

have decreased the value by at most :

for every column mixed strategy .

Minimax Theorem• Proof (cont):• So:• [instantiating to be ]• [for every ]• [ for every ]

• We got: • No regret - as ,

CE as LP Problem

• Correlated Distribution :

S21 S22S11 A1(S11,S21),

A2(S11,S21)A1(S11,S22),A2(S11,S22)

S12 A1(S12,S21),A2(S12,S21)

A1(S12,S22),A2(S12,S22)

S21 S22S11 p1 p2

S12 p3 p4

CE as LP Problem• Payoff of player :

• Player is advised to play with probability • Given player is advised to play , player is advised to play

with probability • The expected payoff of player is he is advised to play : • The expected payoff of player if he decides to ignore the

advice:

CE as LP Problem• CE requires:

• From this we get:

CE as LP Problem• LP Problem:• Maximize (Libertarian equilibrium)• Under conditions:

Smoothness and Learning Equilibria

Documents

Transcript of Smoothness and Learning Equilibria