Low-regret Online Decision-making Via Bellman Inequalities · Low-regret Online Decision-making Via...

Post on 16-Oct-2020

11 views 0 download

Transcript of Low-regret Online Decision-making Via Bellman Inequalities · Low-regret Online Decision-making Via...

Low-regret Online Decision-making Via Bellman Inequalities

Joint work with Sid Banerjee and Itai Gurvich

2/36Relaxations and Regret Bounds for Online Problems

● Must make decisions upon request ● Uncertain process● Statistical information available● Goal: develop practical near optimal algorithms

3/36

Our Results

Relaxations and Regret Bounds for Online Problems

4/36

Our Results

Relaxations and Regret Bounds for Online Problems

Meta-Theorem For diferent resource allocation problems, we

give a practical policy, based on re-solving an optimization

program, with bounded .

The bound is independent of the horizon and capacities.

5/36

Our Results

Relaxations and Regret Bounds for Online Problems

Meta-Theorem For diferent resource allocation problems, we

give a practical policy, based on re-solving an optimization

program, with bounded .

The bound is independent of the horizon and capacities.

● Applications: Dynamic posted pricing, Online Knapsack, Network Revenue Management (Online Packing), Online Matching, Online Probing, Contextual Bandits

6/36

Our Results

Relaxations and Regret Bounds for Online Problems

Meta-Theorem For diferent resource allocation problems, we

give a practical policy, based on re-solving an optimization

program, with bounded .

The bound is independent of the horizon and capacities.

● Applications: Dynamic posted pricing, Online Knapsack, Network Revenue Management (Online Packing), Online Matching, Online Probing, Contextual Bandits

● Challenges: defne a benchmark and use it to design an algorithm

7/36

Why Constant Regret?

Relaxations and Regret Bounds for Online Problems

Case Study: edge weighted online matching

8/36

Why Constant Regret?

Relaxations and Regret Bounds for Online Problems

Case Study: edge weighted online matching

9/36

Why Constant Regret?

Relaxations and Regret Bounds for Online Problems

● Algorithms are diferent● Not worst case, but parametric

Case Study: edge weighted online matching

10/36

Problem 1: Online Knapsack

Relaxations and Regret Bounds for Online Problems

● Finite set of types:

● Known reward distribution and weight:

● Initial budget and horizon:

● Arrival process:

● Objective: collect as much reward as possible

11/36

Types of Benchmark

Relaxations and Regret Bounds for Online Problems

Number of type- arrivals

12/36

Types of Benchmark

Relaxations and Regret Bounds for Online Problems

Reward

Algorithm Optimal (DP) Prophet

Regret Number of type- arrivals

13/36

Online Packing

Relaxations and Regret Bounds for Online Problems

Theorem A natural policy with constant expected regret

for online packing problems. Regret independent of .

In particular, the regret depends only on

Generalizes to multiple resources and other arrival processes.

14/36

Online Packing

Relaxations and Regret Bounds for Online Problems

Theorem A natural policy with constant expected regret

for online packing problems. Regret independent of .

In particular, the regret depends only on

Generalizes to multiple resources and other arrival processes.

15/36

Online Packing

Relaxations and Regret Bounds for Online Problems

Theorem A natural policy with constant expected regret

for online packing problems. Regret independent of .

In particular, the regret depends only on

Generalizes to multiple resources and other arrival processes.

16/36

Online Packing

Relaxations and Regret Bounds for Online Problems

Theorem A natural policy with constant expected regret

for online packing problems. Regret independent of .

In particular, the regret depends only on

Generalizes to multiple resources and other arrival processes.

17/36

Online Packing

Relaxations and Regret Bounds for Online Problems

Theorem A natural policy with constant expected regret

for online packing problems. Regret independent of .

In particular, the regret depends only on

Generalizes to multiple resources and other arrival processes.

Similar results in a recent work for restricted cases [Bumpensanti & Wang]

18/36

Overview of the General Framework

Relaxations and Regret Bounds for Online Problems

Goal: Handle more general problems

19/36

Overview of the General Framework

Relaxations and Regret Bounds for Online Problems

Goal: Handle more general problems

20/36

Intuition

Relaxations and Regret Bounds for Online Problems

Given the additional information, Prophet wants to solve a DP

21/36

Intuition

Relaxations and Regret Bounds for Online Problems

Given the additional information, Prophet wants to solve a DP

22/36

Intuition

Relaxations and Regret Bounds for Online Problems

Bellman Loss (computational)

Given the additional information, Prophet wants to solve a DP

23/36

Intuition

Relaxations and Regret Bounds for Online Problems

Bellman Loss (computational)

Given the additional information, Prophet wants to solve a DP

24/36

Intuition

Relaxations and Regret Bounds for Online Problems

Bellman Loss (computational)

Information Loss (estimation)

Given the additional information, Prophet wants to solve a DP

25/36

Knapsack RABBI

Relaxations and Regret Bounds for Online Problems

26/36

Problem 2: Dynamic Posted Pricing

Relaxations and Regret Bounds for Online Problems

● Stream of T customers with i.i.d. rewards

● Each customer wants one of our identical items

● We can post any fare from the set

● Objective: collect as much reward as possible

Prophet solves:

?

27/36

Pricing RABBI

Relaxations and Regret Bounds for Online Problems

Fraction of customers that would buy when the fare is

28/36

Dynamic Posted Pricing

Relaxations and Regret Bounds for Online Problems

Theorem A natural policy with constant expected regret

for Dynamic Posted Pricing. Regret independent of .

In particular, the regret depends only on .

Fraction that buys at

29/36

Dynamic Posted Pricing

Relaxations and Regret Bounds for Online Problems

Theorem A natural policy with constant expected regret

for Dynamic Posted Pricing. Regret independent of .

In particular, the regret depends only on .

Fraction that buys at

30/36

Dynamic Posted Pricing

Relaxations and Regret Bounds for Online Problems

Theorem A natural policy with constant expected regret

for Dynamic Posted Pricing. Regret independent of .

In particular, the regret depends only on .

Fraction that buys at

31/36

The Algorithm is Practical

Relaxations and Regret Bounds for Online Problems

32/36

The Algorithm is Practical

Relaxations and Regret Bounds for Online Problems

33/36

Bound via Bellman Inequalities

Relaxations and Regret Bounds for Online Problems

Defnition Given fltration , is a relaxed value w.r.t. if

1) Initial Ordering:

2) Monotonicity:

34/36

Bound via Bellman Inequalities

Relaxations and Regret Bounds for Online Problems

Defnition Given fltration , is a relaxed value w.r.t. if

1) Initial Ordering:

2) Monotonicity:

35/36

Conclusions and Extensions

Relaxations and Regret Bounds for Online Problems

● Framework based on constructing tractable benchmarks● Bellman Loss: computational● Information Loss: estimation● Applications: NRM, Probing, Contextual Bandits,

AdWords, Dynamic Pricing, and other Resource Allocation Problems

36/36

Related Work

Relaxations and Regret Bounds for Online Problems

● Prophet: worst case distribution (competitive ratio) for maximum of iid [Hill & Kertz], best possible [Correa et al.], matroid constraints [Kleinberg & Weinberg]

● Constant regret in NRM: [Arlotto & Gurvich]

[Talluri & Van Ryzin], [Reiman & Wang], [Jasin & Kumar], [Bumpensanti & Wang]

● Online matching, resource allocation, AdWords[Manshadi et al], [Legrain & Jaillet]

● Probing: competitive ratio (linear regret) [Gupta & Nagarajan], [Singla], [Chugg & Maehara]

● Information Relaxation [Balseiro & Brown], [Brwon, Smith, & Sun] ● Approximate Dynamic Programming [Powell]