IDSIA Lugano Switzerland Master Algorithms for Active Experts Problems based on Increasing Loss...

19
IDSIA Lugano Switzerland Master Algorithms for Active Experts Problems based on Increasing Loss Values Jan Poland and Marcus Hutter Defensive Universal Learning or

Transcript of IDSIA Lugano Switzerland Master Algorithms for Active Experts Problems based on Increasing Loss...

Page 1: IDSIA Lugano Switzerland Master Algorithms for Active Experts Problems based on Increasing Loss Values Jan Poland and Marcus Hutter Defensive Universal.

IDSIA Lugano Switzerland

Master Algorithms for Active Experts

Problems based on Increasing Loss Values

Jan Poland and Marcus Hutter

Defensive Universal Learning

or

Page 2: IDSIA Lugano Switzerland Master Algorithms for Active Experts Problems based on Increasing Loss Values Jan Poland and Marcus Hutter Defensive Universal.

2

What

Universal learning: Algorithm that performs asymptotically optimal for any online decision problem.

Agent Environment

Actions/Decisions

Loss/ Reward

Page 3: IDSIA Lugano Switzerland Master Algorithms for Active Experts Problems based on Increasing Loss Values Jan Poland and Marcus Hutter Defensive Universal.

3

How

Prediction withExpert Advice

Bandittechniques

Infinite Expert classes

GrowingLosses

Page 4: IDSIA Lugano Switzerland Master Algorithms for Active Experts Problems based on Increasing Loss Values Jan Poland and Marcus Hutter Defensive Universal.

4

Prediction with Expert Advice

t = 1 2 3 4 5 ...

Expert 1: loss =

Expert 2: loss =

Expert 3: loss =...

Master: loss =

1

½

0

1

0

1

0

0

1

1

½

½

0

0

1

0

instantaneous losses are bounded in [0,1]

Page 5: IDSIA Lugano Switzerland Master Algorithms for Active Experts Problems based on Increasing Loss Values Jan Poland and Marcus Hutter Defensive Universal.

5

Prediction with Expert Advice

Expert 1: loss =

Expert 2: loss =

0

½

1

0

0

1

1

0...

Do not follow the leader:

)log( ntO

... but the perturbed leader:

Regret

= E loss(Master) – loss(Best expert)

[Hannan, Littlestone, Warmuth, Cesa-Bianchi, McMahan, Blum, etc.]

proven for adversarial environments!

Page 6: IDSIA Lugano Switzerland Master Algorithms for Active Experts Problems based on Increasing Loss Values Jan Poland and Marcus Hutter Defensive Universal.

6

Learning Rate

Cumulative loss grows

has to be scaled down(otherwise we end up following the leader)

dynamic learning rate

[Cesa-Bianchi et al., Auer et al., etc.]

learning rate and bounds can be significantly improved for small losses things get more complicated

t/1

Page 7: IDSIA Lugano Switzerland Master Algorithms for Active Experts Problems based on Increasing Loss Values Jan Poland and Marcus Hutter Defensive Universal.

7

Priors for Expert Classes

Expert class of finite size n

uniform prior 1/n is common

uniform complexity log n

[Hutter, Poland for dynamic learning rate]

Countably infinite expert class n =

uniform prior impossible

bounds are instead in log wi-1

i is the best expert in hindsight

Page 8: IDSIA Lugano Switzerland Master Algorithms for Active Experts Problems based on Increasing Loss Values Jan Poland and Marcus Hutter Defensive Universal.

8

Universal Expert Class

• Expert i = ith program on someuniversal Turing machine

• Prior complexity = length of the program

• Interprete the output appropriately, depending on the problem

• This construction is common in Algorithmic Information Theory

Page 9: IDSIA Lugano Switzerland Master Algorithms for Active Experts Problems based on Increasing Loss Values Jan Poland and Marcus Hutter Defensive Universal.

9

Bandit Case

t = 1 2 3 4 5 ...

Expert 1: loss =

Expert 2: loss =

Expert 3: loss =

...

Master: loss =

1

?

?

1

?

?

0

0

?

?

½

½

?

0

?

0

Page 10: IDSIA Lugano Switzerland Master Algorithms for Active Experts Problems based on Increasing Loss Values Jan Poland and Marcus Hutter Defensive Universal.

10

Bandit Case

• Explore with small probability t , otherwise exploit

t is the exploration rate

t 0 as t

• Deal with estimated losses:

observed loss of the action

probability of the action

• unbiased estimate[Auer et al., McMahan and Blum]

Page 11: IDSIA Lugano Switzerland Master Algorithms for Active Experts Problems based on Increasing Loss Values Jan Poland and Marcus Hutter Defensive Universal.

11

Bandit Case: Consequences

• Bounds are inwi instead of -log wi

• this (exponentially worse) bound is sharp in general

• Analysis gets harder for adaptive adversaries (with martingales...)

Page 12: IDSIA Lugano Switzerland Master Algorithms for Active Experts Problems based on Increasing Loss Values Jan Poland and Marcus Hutter Defensive Universal.

12

Reactive Environments

Repeated game: Prisoner‘s Dilemma

C D

Cooperate (C) 0.3 1

Defect (D) 0 0.7

[de Farias and Megiddo, 2003]

Cooperate: loss =

Defect: loss =

0.3

0

1

0.7

1

0.7

0.3

0

0.3

0

Page 13: IDSIA Lugano Switzerland Master Algorithms for Active Experts Problems based on Increasing Loss Values Jan Poland and Marcus Hutter Defensive Universal.

13

Prisoner‘s Dilemma

[de Farias and Megiddo, 2003]

• defecting is dominant

• but still cooperating may have the better long term reward

• e.g. against “Tit for Tat”

• Expert Advice fails against Tit for Tat

• Tit for Tat is reactive

Cooperate: loss =

Defect: loss =

0.3

0

1

0.7

1

0.7

0.3

0

0.3

0

Page 14: IDSIA Lugano Switzerland Master Algorithms for Active Experts Problems based on Increasing Loss Values Jan Poland and Marcus Hutter Defensive Universal.

14

Time Scale Change

Idea: Yield control to selected expertfor increasingly many time steps

1 2 3 4 ....

1 2 3 4 ....

originaltime scale

new mastertime scale

instantaneous losses may grow in time

Page 15: IDSIA Lugano Switzerland Master Algorithms for Active Experts Problems based on Increasing Loss Values Jan Poland and Marcus Hutter Defensive Universal.

15

Follow or Explore (FoE)

Need master algorithm +analysis for

• losses in [0,Bt], Bt grows

• countable expert classes

• dynamic learning rate

• dynamic exploration rate

• technical issue: dynamic confidence for almost sure assertions

Algorithm FoE (Follow or Explore)

(details are in the paper)

Page 16: IDSIA Lugano Switzerland Master Algorithms for Active Experts Problems based on Increasing Loss Values Jan Poland and Marcus Hutter Defensive Universal.

16

Main Result

Theorem: For any online decision problem, FoE performs in the limit as well as any computable strategy (expert). That is, FoE‘s average per round regret converges to 0.

Moreover, FoE uses only finitely many experts at each time, thus is computationally feasible.

Page 17: IDSIA Lugano Switzerland Master Algorithms for Active Experts Problems based on Increasing Loss Values Jan Poland and Marcus Hutter Defensive Universal.

17

Universal Optimality

universal optimal = the average per-round regret tends to zero

the cumulative regret grows slower than t

Page 18: IDSIA Lugano Switzerland Master Algorithms for Active Experts Problems based on Increasing Loss Values Jan Poland and Marcus Hutter Defensive Universal.

18

Universal Optimality

finite regret

logarithmic regret

(square) root regret

linear

Page 19: IDSIA Lugano Switzerland Master Algorithms for Active Experts Problems based on Increasing Loss Values Jan Poland and Marcus Hutter Defensive Universal.

19

Conclusions

Adversary

Actions/Decisions

• FoE is universally optimal in the limit

• but maybe too defensive for practice?!

Loss/ Reward

Thank you!