Bridging the gap between regret minimization and...

Bridging the gap between regret minimization and best arm identification, with

application to A/B tests

Rémy Degenne Thomas Nedelec Clément Calauzènes Vianney Perchet

LPSM, Université Paris Diderot,CMLA, ENS Paris Saclay

Criteo AI Lab,CMLA, ENS Paris Saclay

Criteo AI Lab CMLA, ENS Paris Saclay,Criteo AI Lab

Abstract

State of the art online learning procedures focuseither on selecting the best alternative (“best armidentification”) or on minimizing the cost (the“regret”). We merge these two objectives by pro-viding the theoretical analysis of cost minimizingalgorithms that are also �-PAC (with a provenguaranteed bound on the decision time), hencefulfilling at the same time regret minimization andbest arm identification. This analysis sheds lighton the common observation that ill-callibratedUCB-algorithms minimize regret while still iden-tifying quickly the best arm.

We also extend these results to the non-iid casefaced by many practitioners. This provides a tech-nique to make cost versus decision time compro-mise when doing adaptive tests with applicationsranging from website A/B testing to clinical trials.

Introduction

With the growing use of personalization and machine learn-ing techniques on user-facing systems, randomized exper-iments – or A/B tests – have become a standard tool toevaluate the performances of different versions of the sys-tems. Two of the main drawbacks of such experiment are itspossible length and its cost, as it commonly takes few weeksor even months to ensure a statistically founded decision.Thus, lots of attention have been paid to the complexity ofthe underlying statistical tests [18, 13, 24, 10, 23, 24].

These approaches have indubitable practical interest, but areoften limited to quite simple A/B frameworks because oftheir restrictive assumptions. The first restrictive one is theassumption that a good procedure should minimize the time

Proceedings of the 22nd International Conference on Artificial In-telligence and Statistics (AISTATS) 2019, Naha, Okinawa, Japan.PMLR: Volume 89. Copyright 2019 by the author(s).

needed to take a statistically significant decision [13, 24].In many situations, the main objective of practitioners is tominimize the overall cost of A/B testing without preventinghim to take the right decision given a certain time budget.

Another traditional assumption in the online learning liter-ature is that outcomes arriving over time are independent[13, 24]. Numerous common scenarii do not satisfy this andpractitioners cannot benefit from the statistically efficientmethods available in the iid setting. For instance, when anonline retailer wants to A/B test two versions of its mobileapplication, it may not be able to consider the purchasesas independents when customers are buying recurrently (orclicking repeatedly in the CTR optimization problem).

To address the first limitation, we exhibit a well-tuned vari-ant of the UCB algorithm [2] that is both able to take a�-PAC decision in finite time and reaches a low regret. Thisalgorithm can be taken as a tool for practitioners to interpo-late between the tasks of best arm identification and regret

minimization. Even if this objective has been briefly men-tioned and/or observed empirically [9], we provide the firsttheoretical analysis of such algorithm.

To handle the second limitation, we extend the ideas de-veloped in the iid case to a more complex setting that canhandle units arriving through time and delivering rewardscontinuously during the test. We provide sample efficientstatistical decision rules and guarantees to take decisionsin finite time in this setting, highlighting the clear trade-offbetween regret minimization and best arm identification,even on more complex settings.

Regret vs Best-Arm Identification in iid setting

Framework. We consider the classical multi-armed banditproblem with K � 2 arms or “population”. At each timestep t, the agent chooses an arm i 2 [K] := {1, . . . ,K}and observes a reward drawn from an unknown distribu-tion rit with expectation ri. We assume that each rit are�2-subGaussian, where the variance (proxy) �2 is known.We denote by ⇡n 2 [K] the sequence of random variableindicating which arm to pull at time n 2 N.

Bridging the gap between regret minimization and best arm identification, with application to A/B tests

Objective. We consider both natural objectives of banditproblems. The first one corresponds to regret minimization.It consists in minimizing the cumulative regret

R(T ) = T max�ri ; i 2 [K]

� E

TX

t=1

r⇡tt

In A/B testing, when K = 2, minimizing the regret is thesame as minimizing the cost of testing a new technology orthe impact of a clinical trial on patients.

The second objective, matching the problem of best arm

identification with fixed confidence, is to design an algorithmfor a given confidence level �, that minimizes the worst-casenumber of sample T� needed for the algorithm to finish andto return the optimal arm with probability 1� �. Using analgorithm for best arm identification in an A/B test gives aguarantee on the amount time necessary before being ableto to take a statistically significant decision.

Intuitively, an algorithm that is optimal for regret minimiza-tion is sub-optimal for best arm identification because itsexploration is too slow. The opposite is also true since theexploration of optimal best arm identification algorithms istoo aggressive for regret minimization.

We aim at studying a family of algorithms that interpolatebetween these two objectives. Informally, with � 2 [0, 1],our objective is to design algorithms for which with prob-ability 1� �, for all bandit problems P in some class, theworse arm is discarded after T� stages and we have both

T� f(P, �) and R(T�) g(P, �) .

The values f(P, �) and g(P, �) characterize the perfor-mances of the algorithm and should be as small as possible.

Literature review

Regret minimization. This objective has been extensivelystudied in the bandit literature since the seminal paper of[22]. We mention two particular classes of well-knownalgorithms that we will use throughout the paper.

– The Upper Confidence Bound (UCB) algorithm in-troduced in [12, 2] decides which arms to considerdepending on the respective empirical means and anerror term depending on the number of pulls of eacharm. Its regret, in the case of two arms with Gaussianrewards, is equal to 2 log(T )/� where � = |rA� rB|is the gap between the mean of the two arms [2].

– The other class of algorithms we consider is known asExplore Then Commit (ETC)[20, 7]. They are first con-sidering stages of pure exploration before exploitingthe arm with the highest empirical mean. The algo-rithm consists in tuning the switching times betweenthe stages. Its regret in the case of two arms withGaussian rewards is of order 4 log(T )/�.

We recall that ETC is necessarily sub-optimal for regret

minimization, as in the case of Gaussian rewards there existsa sub-optimal additional and multiplicative factor 2 [7].

Best arm identification. The problem of best arm identifica-tion [15], can be cast in two main settings depending on theconstraint imposed on the system:

– fixed budget [1, 4] where a total number of samplesT 2 N is given and the goal is to minimize the errorprobability at time T;

– fixed confidence [15, 5] where the goal is to minimizethe total number of stages used to return the best armwith probability 1� �.

In the fixed confidence setting there are two main waysto evaluate the sample complexity of the algorithm : the

average sample complexity studied in [15, 5, 13, 14] wherethe goal is to minimize the expected time of decision andthe worst case sample complexity studied in [6, 11, 9] wherethe objective is to have a quantity T� 2 N as low as possiblesuch that with probability 1 � �, the algorithm makes nomistake and the time of decision ⌧d is below T� . In the caseof two arms, the optimal sampling strategy is to sample eacharm uniformly and stop with a criterion similar to the oneused in ETC [13].

A/B testing Most of the statistical literature on A/B testing[13, 24] has focused on the objective of minimizing the timenecessary to take a statistical sufficient decision and to thebest of our knowledge, there exists no work theoretically in-terpolating between the objectives of best arm identificationand regret minimization.

1 Simultaneous Best-arm Identification and

Regret Minimization

In this section, we construct a family of algorithms that min-imizes regret while being �-PAC (with a proven guaranteedbound on the decision time), hence fulfilling at the sametime the regret minimization and best-arm identification.For the sake of clarity, we are going to assume that thereare only k = 2 populations, as in A/B testing, denoted by Aand B. The results for the general case K > 2 can actuallybe deduced almost immediately from those when K = 2.

Let us first recall that the algorithm with lowest decisiontime[14], given the variance of arms are identical, is ETC,which pulls both arms uniformly and, after pulling each armn times, returns the arm (e.g. A) with highest empirical

average r̂An , if r̂An � r̂Bn �r

4�2

n log⇣

log2(n)�

⌘.

In the statement of our theorems, the usual Landau notationO�(1) stands for any function whose limit is, as � goes to 0,equal to 0. Similarly, we will use e� 23� log( 1� ) instead of

Rémy Degenne, Thomas Nedelec, Clément Calauzènes, Vianney Perchet

� for the sake of clarity. Exact values of e�, precise statementsand proofs are mostly delayed to Appendix B.

The performances of ETC are now well understood.

Proposition 1 ([19]). With probability greater than 1� e�,

ETC returns the best arm at a stage ⌧d T� where

T� 32�2

�2log

✓1

�

◆⇣1 + O�(1)

⌘.

So the regret of ETC at the time of decision verifies

R(⌧d) 16�2

�log

✓1

�

◆⇣1 + O�(1)

⌘.

On the other hand, the seminal algorithm which is “optimal”for regret minimization is called UCB; it sequentially pullsthe arm with the highest “score” (the sum of the empiricalaverage plus some error term) while its decision rule is notsatisfied. Although its regret is small, it is not guaranteedthat the best arm will be identified in a short time. Here,UCB denotes the classical UCB algorithm, run with confi-dence parameter �. It can eliminate one arm when detectedas sub-optimal (using the anytime bound used int ETC). ⌧dis actually the number of samples generated before this elim-ination occurs. Maybe surprisingly, there are no guaranteesthat this number of samples is actually finite

Proposition 2 ([2]). With probability at least 1 � e�, the

regret of UCB is bounded as

R(⌧d) 8�2

�log

✓1

�

◆⇣1 + O�(1)

⌘.

There is no guarantee that ⌧d is uniformly bounded.

Our family of algorithms interpolates between UCB andETC by introducing a single parameter ↵ 2 [1,+1] whoseextreme values correspond respectively to focusing solelyon regret minimization (i.e., our algorithm identifies withUCB) or best-arm identification (it identifies with ETC). Toeach value of ↵ 2 [1,+1] corresponds a different trade-off:Indeed, the bigger the ↵, the bigger the regret but the smallerthe decision time.

More precisely, we introduce and study a continuum of algo-

rithms UCB↵. For n 2 N⇤, define "n =q

2�2

n log( 3 log2 n� ).

For ↵, � 2 R+, UCB↵ allocates the user at the populationwith the highest score argmaxi r̂

ini + ↵"ni . It returns an

arm i 2 {A,B} if it dominates the other arm j in the sensethat r̂ini � "ni � r̂jnj + "nj .

By construction, the UCB↵ algorithm will keep both indexesaround the same level. But due to the factor ↵ > 1, theintervals of decision with width "nA and "nB (without thefactor ↵ that is only used in the sampling policy, not in thedecision rule) will eventually become disjoint. Thus, whilebehaving like a UCB-type algorithm to minimize regret,

Algorithm 1 UCB↵

Input: ↵, �

1: repeat over n2: for each population i 2 {A,B} do

3: "in =q

2�2

ni log( 3 log2 ni

� )4: end for

5: Assign next user to populationin = argmaxi2{A,B} r̂

in + ↵"in

6: i⇤ = argmaxi2{A,B} r̂in

7: until r̂i⇤

n � "i⇤

n > r̂jn + "jn for j 6= i8: return i⇤

UCB↵ can still return the identity of the best arm. Thefollowing theorem makes precise this trade-off betweentime of decision and regret.

Theorem 3. With probability greater than 1�e�, UCB↵ has

a regret R(⌧d) at its time of decision satisfying

R(⌧d) ⇣8�2

�c↵ +�

⌘log�1�

�⇣1 + O�(1)

⌘.

where the constant c↵ 2 [1, 7] is defined by

c↵ = min� (↵+ 1)2

4,

4↵2

(↵� 1)2

and c1 = 1 ; c1 = 4 .

On that event, the time of decision satisfies ⌧d T� with

T� ↵2 + 1

(↵� 1)2

✓16�2

�2c↵ + 1

◆log�1�

�⇣1 + O�(1)

⌘.

When ↵ goes to 1, the leading term of the regret goes to8�2 log( 1� )/� but the time of decision becomes infinite.When ↵ ! 1, the leading term becomes 32�2 log( 1� )/�and the time of decision is of order 64�2 log( 1� )/�

2. Thefact that the constant c↵ is not monotonic in ↵, is an artifactof the proof, i.e., a byproduct of two different analyses ofthe same problem (for either small or large values of ↵).Indeed, Figure 1 actually indicates that regret of UCB↵ iscertainly monotonic with respect to ↵ as expected.

ETC has a regret bounded by an expression of order16�2

� log 1� . This is twice bigger than the regret of UCB↵ for

small ↵. These are only one-sided bounds and do not allowto conclude on which algorithm gets a lower regret, butexperiments show that UCB↵ for small ↵ indeed presentsan advantage in terms of regret versus ETC, at the cost of ahigher decision time. See Figure 1.

When ↵ goes to infinity, the exploration term is dominantfor UCB↵, and it will always pull the least pulled arm. Asa consequence, it becomes a variant of ETC with a sub-optimal decision criterion. We denote it by ETC’. It allo-cates to the populations A and B uniformly and selects A

if r̂An �r̂Bn �r

8�2

n log⇣

3 log2(n)�

⌘. Its confidence interval


width isp2 times larger than the one of ETC because of

the different concentration arguments used in designing thealgorithms: ETC uses a concentration lemma on the dif-ference r̂An � r̂Bn , while UCB↵ (and its limit ETC’) dealsseparately with arm A and B since their number of sam-ples might be different. Note that the difference betweenETC and ETC’ is a specificity of the two-armed bandit case,as the generalization of ETC to more than two arms [19]considers per-arm confidence intervals.

Figure 1: Comparison of ETC and UCB↵. Left: regret

at selection. Right: time of selection. The curves average1000 experiments with two Gaussian arms of means 0 and 1and variance 1.

On the other hand, UCB↵ generalizes immediately when thenumber of populations is greater than 2. In that case, UCB↵

samples arms as usual but successively eliminates themby looking at the anytime bound on their empirical mean(kind of similarly to LIL-UCB [9]). Then ⌧d correspondsto the time where K � 1 arms have been eliminated andthe algorithm can output a candidate for the best arm. Weassume that the index of the optimal population is 1 and wedenote by �k = r1�rk the gap associated to the subobtimalpopulation k. Then we can derive the following corollaryfrom Theorem 3.

Corollary 4. With K different arms, and with probability

greater than 1� e�, UCB↵ has a decision time smaller than

(↵+ 1)2

(↵� 1)2

✓8�2

�2min

c↵ + 1

◆+

KX

k=2

8�2

�2k

c↵ +K

!

⇥ log�K�

�⇣1 + O�(1)

⌘

and its regret R(⌧d) satisfies

R(⌧d) KX

k=2

⇣8�2

�kc↵ +�

⌘log�K�

�⇣1 + O�(1)

⌘.

With K > 2 population, ETC would stop sampling popu-lation k as soon as there exists another population i such

that r̂in�r̂kn � 4

r�2

n log⇣

3K log2(n)�

⌘. As a consequence,

its decision time will be upper-bounded by

⇣ 32�2

�2min

+KX

k=2

32�2

�2k

⌘log(

K

�)(1 + O�(1))

and its regret smaller than

⇣ KX

k=2

32�2

�2log(

K

�)⌘(1 + O�(1)) .

As a consequence, even if UCB↵ interpolates between UCBand ETC in term of regret, it actually outperforms ETC interms of decision time when ↵ ! 1.

1.1 How does inflating the exploration term lead to a

finite decision time? A proof sketch

We consider an algorithm which pulls argmaxi2{A,B} r̂ini+

↵"ni with ↵ > 1, and stops if |rAnA � rBnB | > "nA + "nB

and returns the arm with highest mean at that point.

Recall that "n is a quantity close to 1/pn, up to logarithmic

terms in n and multiplicative constants, hence 1/"2n ⇡ cnfor some constant c. If we can prove that as long as nodecision is taken, 1/"2nA and 1/"2nB are bounded from above,then we obtain an upper bound on the time t = nA + nBbefore a decision is taken.

Suppose that the best arm is A. Bounding nB is donethrough classical bandit arguments: since B is the worsearm, a UCB-type algorithm does not pull it often. Thechallenge is to show that our algorithm also controls nA.

We can make use of a concentration result of the form:with probability 1� �̃, r̂A(nA) + "nA � rA and r̂B(nB)�"nB rB. The decision criterion ensures that if a decision istaken and these concentration inequalities hold, then the armreturned is the correct one. Indeed under this concentrationevent, r̂B(nB)� r̂A(nA) rB � rA + "nB + "nA , which


is strictly smaller than "nB + "nA . Hence B cannot bereturned: if the algorithm stops, it is correct.

The algorithm will keep both indexes roughly equal, hencer̂AnA + ↵"nA ⇡ r̂BnB + ↵"nB and the ”pulling” confidenceintervals with width ↵"n will never get disjoint. But as"nA and "nB get small, the smaller ”decision” confidenceintervals with width "n will eventually separate, as seen inFigure 2.

More formally, if A is pulled and no decision was taken yet,then the index of A is big, r̂A(nA) + ↵"nA > r̂B(nB) +↵"nB , but not so big that the algorithm stops, i.e. r̂A(nA)�"nA r̂B(nB) + "nB . By combining the two, we obtainthe relation nA / 1

"2nA

(↵+1)2

(↵�1)21

"2nB

/ (↵+1)2

(↵�1)2nB , where

/ is to be read as the informal statement that the quantitiesare roughly proportional.

To sum-up the idea of the proof: B will not be pulled muchsince it is the worst arm and we employ an UCB-type algo-rithm. A will be pulled less than a factor depending on ↵times B. Thus as long as no decision is taken, t = nA + nB

is bounded, by a quantity T� . We conclude that the decisiontime is smaller than T� .

A B

A B

r̂A

r̂B

r̂A

r̂B"nB

↵"nB

Figure 2: Confidence intervals used in UCB↵. Top: Be-fore the time of decision, the indexes r̂(n)+↵"n are aligned,the tighter "n intervals also overlap. Bottom: time of selec-tion: the tight "n confidence intervals become disjoint.

2 Extensions to non-iid settings

When an internet platform wants to A/B test two versionsof its website, purchases from customers that are buyingrecurrently can not be considered as independent. Whena technical A/B test is run, it is also usual to split serversin two populations to A/B test the business impact of thelatency of a new code version. In this case, observationsfrom the same server are not independent. This setting isnot restricted to online marketing. Clinical trials classicallymeasure the survival time or quality of life of patients overtime and adaptive testing is a key challenge in this context,however it is so far either heuristic [21] or under the re-strictive assumption to observe the rewards before the nextdecision [3, 8]. More recently, [17] studied a similar settingof reward arriving over time, however with a different ob-jective, namely finding an adaptive way to stop the test for apatient taking into account a cost of testing.

The setting of multi-armed bandit presented in the previoussection has been applied to A/B tests [13, 23, 24] with inde-pendent rewards. The goal of this section is to show that wecan extend algorithms presented in the first section in a morecomplex setting and provide a framework to practitionersfor interpolating between best arm identification and regretminimization in other settings than the iid case.

To model theses aspects, we show that decisions of the multi-armed bandit algorithm can be taken at a unit level (e.g.users, patients, servers...). Once allocated to a populationA or B, a unit u interacts with the system during the wholeA/B test . When time increases, the system gathers moreand more signal on a unit arrived early in the A/B test. Wewill also assume in order to be able to take a causal decisionat the end of the A/B test that units already exposed to onetreatment can not be switched from population. A unit staysin the same population during the whole A/B test. Intuitively,the system will estimate the performance of one technologyby averaging its performance on the different units.

2.1 Notations

We need to differentiate the units (e.g. users) randomlyassigned to populations A and B from their associated re-wards. In the iid setting, the reward rAu associated to a unitu in population A was assumed to be observed instantly.Now, we assume that, for each unit u we have been seeingso far, we observe noisy version of this reward over timeand rAu is only the unknown expectation of this process.Population-specific notation is symmetric, thus, for the sakeof readability, we only detail notations for the control A andassume the corresponding one for the treatment B.

More formally, we assume the units u are i.i.d. samples froman unknown distribution and arriving in the test dynamicallyover time. To each unit u is associated an unknown rewardrAu which is an i.i.d. sub-gaussian r.v. with expected value


t

units A B

?

Figure 3: The unit allocation problem. At each staget 2 N, a new unit is allocated to either population A or Band a sample from every unit arrived before t is observed.

rA and variance �2r , as well as an arrival time Tu. A(t)

denotes the set of units in A of cardinality nAt at time t.

Then, given a unit u and starting at time Tu, we observeover time random outcomes rAu,t = rAu +"u,t where "u,t is azero-mean sub-gaussian variable with variance �2

" . At timet, the unit u has generated t�Tu+1 outcomes rAu,tu · · · rAu,t.

At each time step, the algorithm has access to all the rewardsgenerated by all the users already present in the A/B test andthe precision on rAu will increase with time as more samplesrAu,t are gathered.

In this setting, a natural estimator to consider is

r̂At :=1

nAt

X

u2A(t)

1

t� Tu + 1

tX

s=Tu

rAu,s

We call this estimator the Mean of means estimator. Withthis estimator, we can design algorithms that can reach atrade-off between regret minimization and best arm identi-fication and see how this regret depend on �2

r and �2" . We

could have considered other estimators such that the Total

Mean estimator defined as

R̂At :=

1Pu2A(t) t� Tu + 1

X

u2A(t)

tX

s=Tu

rAu,s

which is also an unbiased estimator of r. Yet unfortunately,this estimator puts more weights on older units than on themore recent one. In the case where all units have more orless the same noise, this is not really an issue (but this ismore or less the only case where the total mean estimatorhas good behavior). Unfortunately, with adaptive samplingalgorithm (such as UCB), then it could be the case that anew unit is allocated to a population after an exponentiallong time. In that case, the different weights put on differentunits can be of different order of magnitude, preventing fastconvergence of the estimate to the estimated mean.

A second motivation for choosing the Mean of means es-timator is because it opens doors for generalizing to morecomplex models on the stochastic processes underlying theunits behavior. Indeed, it can be seen as an average overthe units of the per-unit stochastic process expected values.Here, we assume the random outcomes riu,t are i.i.d. withexpectation riu (technically our results hold for martingaledifference sequence) . Then the technical part shows how tocombine concentration results on each of the units to derivebandit algorithms with good properties. With a differentmodel on the per-user random outcomes (as. mean revertingprocess or cyclic process), our proof techniques could beused as long as it is possible to construct an estimator of theexpectation that concentrate well enough.

Precise statements and proofs of results presented in thissection can be found in Appendix C.

2.2 ETC

The ETC algorithm allocates units alternatively to A andB, choosing possibly the first population at random. Tosimplify the analysis, we are actually going to assume that2 units arrive at each stage, one of each being allocated toeach population. So that if ETC stops at stage t 2 N, thenboth populations have n = t units, and regret is n�.

The stopping rule criterion of ETC is simply the following:

|r̂An � r̂Bn |>

vuut4⇣�2r+

�2" log(en)

n

⌘

nlog⇣⇡2n2

3�

⌘

Theorem 5. With probability at least 1 � �, the ETC al-

gorithm outputs the best population and stops after having

allocating a total of at most ⌧d T� units, where.

T� =32�2

r

�2log(

1

�)(1+O�(1))+

�2"

�2r

log log(1

�)(1+O�(1)) .

Its regret at ⌧� is equal to�2 ⌧� .

2.3 UCB-MM

The variant of UCB we consider is defined with respectto the following index of performance of population i 2{A,B} defined by

r̂it +

vuut2⇣�2r +

�2" log(eni

t)nit

⌘

nit

log⇣ (4ni

t)4

2�max{1, n

it�

2r

�2"

}⌘

(1)

Using this index, UCB-MM is described in Algorithm 2.We mention here that having random numbers of units withrandom numbers of occurrences prevents us for deriving themore or less standard concentration inequalities derived forthe other algorithm. Indeed, it actually requires to combine


Algorithm 2 UCB-MM↵

Input: ↵, �

1: repeat over t2: for each population i 2 {A,B} do

3: "it =

s2

⇣�2r+

�2" log(eni

t)

nit

⌘

nit

4: ⇥rlog⇣

(4nit)

4

� max{1, nit�

2r

�2"}⌘

5: end for

6: Assign next user to populationit = argmaxi2{A,B} r̂

it + ↵"it

7: i⇤ = argmaxi2{A,B} r̂it

8: until r̂i⇤

t � "i⇤

t > r̂jt + "jt for j 6= i9: return i⇤

several types of different inequalities, which explains thenon-standard term in the

plog(·) part of the index.

Theorem 6. With probability at least 1 � e�, the regret of

UCB-MM is bounded at stage td as

R(⌧d) 8�2

r

�log(

1

�)(1 + O�(1))

+�2r

�2"

� log log(1

�)(1 + O�(1)) .

There is no guarantee that ⌧d is uniformly bounded.

We emphasize here that the dependency of the total regretwith respect to the noise variance �2

" is negligible comparedto its dependency with respect to the variance of the unitperformance �2

r . Indeed, regret has a log 1� factor in front of

�2r (a term which is unavoidable, even without extra noise,

i.e., if �2" = 0). On the other hand, the multiplicative factor

of �2" is only double logarithmic, in log log 1

� . Moreover,the additional number of units required to find the best pop-ulation is, asymptotically, independent of �, the proximitymeasure of the two populations.

In the index definition of UCB-MM (Equation (1)), letting�2" goes to 0 gives a void bound (the index is basically

always +1). This is an artefact of the proof needed forhaving only an extra log log(·) term. It is also possible touse the following alternative error term for UCB-MMs

2�2r

nit

log⇣9 log2(ni

t)

�

⌘+

s2�2

" log(enit)

(nit)

2log⇣ (4ni

t)4

�

⌘

This error term converges, as �2" goes to 0, to the usual error

term of UCB. Unfortunately, the regret dependency in �2"

deteriorates as it scales with⇣8�2

r

�log

1

�+

�2r

�2"

�

rlog

1

�

⌘(1 + O�(1))

thus with ap

log(·) extra term instead of a log log(·) one.

2.3.1 Interpolating between regret minimization and

best arm identification

As in the iid case, it is possible, with unit, to define UCB-MM↵ to interpolate between UCB-MM and ETC-MM bymultiplying the error term of UCB-MM by a factor ↵ 2[1,+1].

Theorem 7. With probability at least 1 � e�, the regret of

UCB-MM↵ is bounded at stage T as

R(⌧d) ⇣8�2

r

�c↵ +�

⌘log(

1

�)(1 + O�(1))

+�2r

�2"

� log log(1

�)(1 + O�(1)) .

Moreover, the time of decision is upper-bounded by

⌧d ↵2 + 1

(↵� 1)2

✓16�2

r

�c↵ + 1

◆log�1�

�⇣1 + O�(1)

⌘

+ 2�2r

�2"

� log log(1

�)(1 + O�(1)) .

The proof of this result proceeds as in the iid case (see thesketch of proof in section 1.1), assuming that we can providea suitable concentration inequality to bound the deviationof the mean-of-means estimator. The main difference is theconcentration arguments used. The analysis is adaptive inthat it gives a bound on the decision time for any model, aslong as we are able to provide concentration inequalities forthe population means. The concentration inequality is usedto obtain a bound on the number of allocations to the sub-optimal population, then the inflated confidence intervalsmechanically provide a bound on the number of pulls of thebest population with respect to the sub-optimal one.

As in the iid case, we have stated our results for K = 2populations, but it’s straightforward to generalize them forK > 2 different populations.

Practical remark. In practice, attributing users dynami-cally to populations could be hard to handle in production(for instance, the population of the user must be stored tointeract with him when he is coming back on the platform..).This is why we also provide another anytime bound in ap-pendix in a simpler setting where we assume that all theunits are already present at the beginning of the test, thusallowing to allocate them to populations using a simple hashon their identifiers. Based on this bound, the practitionercan stop the test as early as possible such that the decisionis statically sufficient. However, this bound can not helpto do a tradeoff between regret minimization and best armidentification.

On the other hand, we assume that the reward and noise weresubGaussian random variable, with known variance (proxy)�2r and �2

" . Our results and techniques can be generalized tothe case where the random variables rit and "n,t are bounded


(say, in [0, 1]) with unknown variance. One just need to useempirical Bernstein concentration inequalities [16].

3 Experiments

On a simple iid setting, we show the performance proved inSection 1 on Figure 1. There are two Gaussian arms withsame variance 1 and means 0 and 1. The two graphs onthe figure show the decision time and the regret at the timeof decision of several algorithms for a range of values oflog(1/�). The algorithms shown are ETC, four instances ofUCB↵ for alpha in (1.5, 2, 4, 32) and ETC’, the variant ofETC to which UCB↵ tends to when ↵ ! 1. We highlight afew conclusions from these plots, which are all in agreementwith the theoretical results.

• The algorithm with lowest decision time is ETC, thealgorithm with lowest regret is UCB↵ with small ↵.

• For ↵ � 4, UCB↵ has lower regret and higher decisiontime than ETC’, but is worse than ETC on both criteria.

• For ↵ equal to 1.5 or 2, UCB↵ has lower regret andhigher decision time than ETC.

UCB↵ is seen empirically to realize a trade-off betweenits two limiting algorithms UCB and ETC’, and there isan interval for ↵ in which UCB↵ is a trade-off betweenUCB and ETC. The numerical relations between the boundscan also be verified: the regret of UCB1.5 is almost twicesmaller than the regret of ETC, which is twice smaller thanthe regret of ETC’.

We then show on Figure 4 empirical results on the unitsetting presented in Section 2. At each time step, a userarrives and then generate a reward according to ru for everytime step until the end of the game. In our simulation,ru is sampled from a normal distribution with mean 0 forpopulation A (respectively with mean 1 for population B)and variance �2

r equal to 1. The noise ✏ at each time step isalso Gaussian of mean 0 and variance �2

✏ equal to 1. Thedata can be seen as a triangular matrix (cf Fig. 3). Wecompare performances between UCB↵ for different valuesof ↵ and ETC’ in terms of regret and times of decision. Wesee that we are able to reproduce what we observed in theiid setting in the unit setting. We observe again a factor 4between the regret of ETC’ and UCB. With UCB↵, we canrealize a trade-off between ETC’ and UCB, both in terms ofregret and decision time.

4 Conclusion

We studied A/B tests in the fixed confidence setting fromthe two perspectives of regret minimization and best armsidentification. We introduced a class of algorithms thatoptimizes at the same time both objectives. It interpolates

Figure 4: First: Regret in the unit setting. Second: De-

cision time for the unit setting. The curves are averagesof 1000 experiments with two Gaussian arms with means 0and 1 and �r = 1, �✏ = 1.

between optimal algorithms designed for each case. Ourstudy also shed light on an effect often seen in practice: theUCB algorithm not only minimizes regret but also identifiesthe best arm in finite time, as soon as its exploration term isslightly inflated. This inflation is nearly always present inpractice. Indeed, for UCB to be a valid algorithm for a noiseand the confidence interval not to be bigger than necessary,it would need to be run with a constant matching exactly theunknown sub-Gaussian norm of the noise.

We extended our study to a non-iid setting by derivingadapted concentration results for the mean-of-means estima-tor, again obtaining algorithms which interpolate betweenthe two objectives. We would however like to warn prac-titioners on an intensive use of bandit algorithms for A/Btests. Even if data are collected through time, it can proveto be difficult to define a time of arrival for units not cor-related with the data to ensure the iid assumption holds asbandit algorithms that neglect this aspect could behave quitepoorly. Handling such problem would require to model theunit stochastic process conditionally to the time of arrivalwhich would be a use case for generalizing our results tomore complex stochastic processes for unit modelling.


5 Aknowledgement

The fourth author has benefited from the support of theFMJH Program Gaspard Monge in optimization and opera-tions research (supported in part by EDF), from the LabexLMH and from the CNRS through the PEPS program.

References

[1] J.-Y. Audibert and S. Bubeck. Best arm identificationin multi-armed bandits. In Conference on Learning

Theory, 2010.

[2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-timeanalysis of the multiarmed bandit problem. Journal of

Machine Learning Research, 47(2-3), 2002.

[3] D. A. Berry and S. G. Eick. Adaptive assignmentversus balanced randomization in clinical trials: Adecision analysis. Statistics in Medicine, 14(3), 1995.

[4] S. Bubeck, R. Munos, and G. Stoltz. Pure explorationin finitely-armed and continuous-armed bandits. Theo-

retical Computer Science, 412, 2011.

[5] A. Carpentier and A. Locatelli. Tight (lower) boundsfor the fixed budget best arm identification bandit prob-lem. In Conference on Learning Theory, 2016.

[6] E. Even-Dar, S. Mannor, and Y. Mansour. Action elim-ination and stopping conditions for the multi-armedbandit and reinforcement learning problems. Journal

of Machine Learning Research, 7(Jun), 2006.

[7] A. Garivier, T. Lattimore, and E. Kaufmann. Onexplore-then-commit strategies. In Advances in Neural

Information Processing Systems, 2016.

[8] F. Hu and W. F. Rosenberger. The Theory of Response-

Adaptive Randomization in Clinical Trials. John Wiley& Sons, Inc., Apr 2006.

[9] K. Jamieson, M. Malloy, R. Nowak, and S. Bubeck.lil’ucb: An optimal exploration algorithm for multi-armed bandits. In Conference on Learning Theory,2014.

[10] R. Johari, P. Koomen, L. Pekelis, and D. Walsh. Peek-ing at a/b tests: Why it matters, and what to do aboutit. In Proceedings of the 23rd ACM SIGKDD Interna-

tional Conference on Knowledge Discovery and Data

Mining, KDD ’17. ACM, 2017.

[11] Z. Karnin, T. Koren, and O. Somekh. Almost optimalexploration in multi-armed bandits. In International

Conference on Machine Learning, 2013.

[12] M. N. Katehakis and H. Robbins. Sequential choicefrom several populations. Proceedings of the National

Academy of Sciences of the United States of America,92(19), 1995.

[13] E. Kaufmann, O. Cappé, and A. Garivier. On theComplexity of A/B Testing. In Conference on Learn-

ing Theory, Proceedings of The 27th Conference onLearning Theory, 2014.

[14] E. Kaufmann, O. Cappé, and A. Garivier. On thecomplexity of best-arm identification in multi-armedbandit models. Journal of Machine Learning Research,17(1), 2016.

[15] S. Mannor and J. N. Tsitsiklis. The sample complex-ity of exploration in the multi-armed bandit problem.Journal of Machine Learning Research, 5(Jun), 2004.

[16] A. Maurer and M. Pontil. Empirical Bernstein Boundsand Sample-Variance Penalization. Conference on

Learning Theory, 2009.

[17] D. M. Negoescu, K. Bimpikis, M. L. Brandeau, andD. A. Iancu. Dynamic learning of patient responsetypes: An application to treating chronic diseases.Management Science, 64(8), 2018.

[18] L. Pekelis, D. Walsh, and R. Johari. The new statsengine. Internet. Retrieved December, 6, 2015.

[19] V. Perchet and P. Rigollet. The multi-armed banditproblem with covariates. The Annals of Statistics,2013.

[20] V. Perchet, P. Rigollet, S. Chassang, E. Snowberg, et al.Batched bandit problems. The Annals of Statistics, 44(2), 2016.

[21] W. H. Press. Bandit solutions provide unified ethicalmodels for randomized clinical trials and comparativeeffectiveness research. Proc Natl Acad Sci U S A, 106(52), Dec 2009.

[22] W. R. Thompson. On the likelihood that one unknownprobability exceeds another in view of the evidence oftwo samples. Biometrika, 25(3/4), 1933.

[23] F. Yang, A. Ramdas, K. G. Jamieson, and M. J. Wain-wright. A framework for multi-a(rmed)/b(andit) test-ing with online fdr control. In Advances in Neural

Information Processing Systems. Curran Associates,Inc., 2017.

[24] S. Zhao, E. Zhou, A. Sabharwal, and S. Ermon. Adap-tive concentration inequalities for sequential decisionproblems. In Advances in Neural Information Process-

ing Systems. 2016.

Bridging the gap between regret minimization and...

Documents

Transcript of Bridging the gap between regret minimization and...