Bandit Online Learning with Unknown DelaysBandit Online Learning with Unknown Delays Bingcong Li,...

Bandit Online Learning with Unknown Delays

Bingcong Li, Tianyi Chen, and Georgios B. Giannakis

University of Minnesota - Twin Cities, Minneapolis, MN 55455, USA{lixx5599,chen3827,georgios}@umn.edu

May 29, 2019

Abstract

This paper deals with bandit online learning problems involving feedback of unknown delay that can emergein multi-armed bandit (MAB) and bandit convex optimization (BCO) settings. MAB and BCO require onlyvalues of the objective function involved that become available through feedback, and are used to estimate thegradient appearing in the corresponding iterative algorithms. Since the challenging case of feedback with un-known delays prevents one from constructing the sought gradient estimates, existing MAB and BCO algorithmsbecome intractable. For such challenging setups, delayed exploration, exploitation, and exponential (DEXP3)iterations, along with delayed bandit gradient descent (DBGD) iterations are developed for MAB and BCO,respectively. Leveraging a unified analysis framework, it is established that the regret of DEXP3 and DBGDare O

(√Kd(T +D)

)and O

(√K(T +D)

), respectively, where d is the maximum delay and D denotes the

delay accumulated over T slots. Numerical tests using both synthetic and real data validate the performance ofDEXP3 and DBGD.

1 Introduction

Sequential decision making emerges in various learning and optimization tasks, such as online advertisement,online routing, and portfolio management [5,15]. Among popular methods for sequential decision making, multi-armed bandit (MAB) and bandit convex optimization (BCO) have widely-appreciated merits because with limitedinformation they offer quantifiable performance guarantees. MAB and BCO can be viewed as a repeated gamebetween a possibly randomized learner, and the possibly adversarial nature. In each round, the learner selects anaction, and incurs the associated loss that is returned by the nature. In contrast to the full information setting, onlythe loss of the performed action rather than the gradient of the loss function (or even the loss function itself) isrevealed to the learner. Popular approaches to bandit online learning estimate gradients using several point-wiseevaluations of the loss function, and use them to run online gradient-type iterative solvers; see e.g., [3] for MABand [1, 13] for BCO.

Although widely applicable with solid performance guarantees, standard MAB and BCO frameworks do notaccount for delayed feedback that is naturally present in various applications. For example, when carrying outmachine learning tasks using distributed mobile devices (a setup referred to as federated learning) [21], delaycomes from the time it takes to compute at mobile devices and also to transmit over the wireless communicationlinks; in online recommendations the click-through rate could be aggregated and then periodically sent back [19];in online routing over communication networks, the latency of each routing decision can be revealed only afterthe packet’s arrival to its destination [4]; and in parallel computing by data centers, computations are carried withoutdated information because agents are not synchronized [2, 11, 20].

Challenges arise naturally when dealing with bandit online learning with unknown delays, simply becauseunknown delayes prevent existing methods in non-stochastic MAB as well as BCO to construct reliable gradientestimates. To address this limitation, our solution is a fine-grained biased gradient estimator for MAB and a

1

arX

iv:1

807.

0320

5v3

[cs

.LG

] 2

8 M

ay 2

019

deterministic gradient estimator for BCO, where the standard unbiased loss estimator for non-stochastic MAB andthe nearly unbiased one for BCO are no longer available. The resultant algorithms, that we abbreviate as DEXP3and DBGD, are guaranteed to achieveO

(√Kd(T +D)

)andO

(√K(T +D)

)regret, respectively, over a T -slot

time horizon with the maximum (overall) delay being d (D).

1.1 Related works

Delayed online learning can be categorized depending on whether the feedback information is full or bandit(meaning partial). We review prior works from these two aspects.

Delayed online learning. This class deals with delayed but fully revealed loss information, namely fullyknown gradient or loss function. It is proved that an O

(√T +D

)regret for a T -slot time horizon with overall

delay D can be achieved. Particularly, algorithms dealing with a fixed delay have been studied in [29]. To reducethe storage and computation burden of [29], an online gradient descent type algorithm for fixed d-slot delay wasdeveloped in [18], where the lower bound O

(√(d+ 1)T

)was also provided. Adversarial delay has been tackled

recently in [17, 25, 26]. However, the algorithms as well as the corresponding analyses in [17, 25, 26] are notapplicable to bandit online learning setting when the delays are unknown.

Delayed bandit online learning. Stochastic MAB with delays has been reported in [8,9,24,28]; see also [16]for multi-instance generalizations introduced to handle adversarial delays in stochastic and non-stochastic MABsettings. For non-stochastic MAB, EXP3-based algorithms were developed to handle fixed delays in [7, 23].Although not requiring memories for extra instances, the delay in [7] and [23] must be known. A recent work [6]considers a more general non-stochastic MAB setting, where the feedback is anonymous.1

1.2 Contributions

Our main contributions can be summarized as follows.c1) Based on a novel biased gradient estimator, a delayed exploration-exploitation exponentially (DEXP3)

weighted algorithm is developed for delayed non-stochastic MAB with unknown and adversarially chosen delays;c2) Relying on a novel deterministic gradient estimator, a delayed bandit gradient descent (DBGD) algorithm

is developed to handle the delayed BCO setting; and,c3) A unifying analysis framework is developed to reveal that the regret of DEXP3 and DBGD isO

(√Kd(T +D)

)and O

(√K(T +D)

), respectively, where d is the maximum delay and D denotes the delay accumulated over T

slots. Numerical tests validate the efficiency of DEXP3 and DBGD.Notational conventions. Bold lowercase letters denote column vectors; E[ · ] represents expectation; 1( · )

denotes the indicator function; ( · )> stands for vector transposition; and ‖x‖ denotes the `2-norm of a vector x.

2 Problem statements

Before introducing the delayed bandit learning settings, we first revisit the standard non-stochastic MAB andBCO.

2.1 MAB and BCO

Non-stochastic MAB. Consider the MAB problem with a total of K arms (a.k.a. actions) [3,5]. At the beginningof slot t, without knowing the loss of any arm, the learner selects an arm at following a distribution pt ∈ ∆K overall arms, where the probability simplex is defined as ∆K := {p ∈ ∆K : p(k) ≥ 0,∀k;

∑Kk=1 p(k) = 1}. The

loss lt(at) incurred by the selection of at is an entry of the K × 1 loss vector lt, and it is observed by the learner.Along with previously observed losses {ls(as)}ts=1, it then becomes possible to find pt+1; see also Fig. 1 (a1).

1Anonymous feedback in MAB means that the learner observes the loss without knowing which arm it is associated with.

2

(a1)

(a2)

(b1)

(b2)

Figure 1: A single slot structure of: (a1) standard MAB; (a2) non-stochastic MAB with delayed feedback; (b1)standard BCO; (b2) BCO with delayed feedback, where pt+1|{ls(as)}ts=1 means that pt+1 updates based onpreviously observed losses {ls(as)}ts=1.

The goal is to minimize the regret, which is the difference between the expected cumulative loss of the learnerrelative to the loss of the best fixed policy in hindsight, given by

RegMABT :=

T∑t=1

E[p>t lt

]−

T∑t=1

(p∗)>lt (1)

where the expectation is taken w.r.t. the possible randomness of pt induced by the selection of {as}t−1s=1, while the

best fixed policy p∗ is

p∗ := arg minp∈∆K

T∑t=1

p>lt. (2)

Specifically, if p∗ = [0, . . . , 1, . . . , 0]>, the regret is relative to the corresponding best fixed arm in hindsight.BCO. Consider now the BCO setup with M -point feedback [1]. At the beginning of slot t, without knowing

the loss, the learner selects xt ∈ X , where X ⊂ RK is a compact and convex set. Being able to query the functionvalues at another M − 1 points {xt,k ∈ X}M−1

k=1 and with xt,0 := xt, the loss values at {xt,k}M−1k=0 , that is,

{ft(xt,k)}M−1k=0 , are observed instead of the function ft(·). The learner leverages the revealed losses to decide the

next action xt+1; see also Fig. 1 (b1). The learner’s goal is to find a sequence of{{xt,k}M−1

k=0

}Tt=1

to minimizethe regret relative to the best fixed action in hindsight, meaning2

RegBCOT :=

T∑t=1

E[ft(xt)

]−

T∑t=1

ft(x∗) (3)

where the expectation is taken over the sequence of random actions {xτ}t−1s=1. The best fixed action x∗ in hindsight

is

x∗ := arg minx∈X

T∑t=1

ft(x). (4)

2This definition is slightly different with that in [1]. However, we will show in Section 5.2 that the regret bound is not affected.

3

In both MAB and BCO settings, an online algorithm is desirable when its regret is sublinear w.r.t. the time horizonT , i.e., RegMAB

T = o(T ) and RegBCOT = o(T ) [5, 15].

2.2 Delayed MAB and BCO

MAB with unknown delays. In delayed MAB, the learner still chooses an arm at ∼ pt at the beginning of slot t.However, the loss lt(at) is observed after dt slots, namely, at the end of slot t + dt, where delay dt ≥ 0 can varyfrom slot to slot. In this paper, we assume that {dt}Tt=1 can be chosen adversarially by nature. Let ls|t(as|t) denotethe loss incurred by the selected arm as in slot s but observed at t, i.e., the learner receives the losses collected inLt =

{ls|t(as|t), s :s+ds= t

}at the end of slot t. Note that it is possible to have Lt=∅ in certain slots. And the

order of feedback can be arbitrary, meaning it is possible to heve t1 + dt1 ≥ t2 + dt2 when t1 ≤ t2. In contrastto [16] however, we consider the case where the delay dt is not accessible, i.e., the learner just observes the valueof ls|t(as|t), but not s. The learner’s goal is to select

{pt}Tt=1

“on-the-fly” to minimize the regret defined in (1).Note that in the presence of delays, the information to decide pt is even less compared with the standard MAB.Specifically, the available information for the learner to decide pt is collected in the set L1:t−1 :=

⋃t−1s=1 Ls; see

also Fig. 1 (a2).For simplicity, we assume that all feedback information is received at the end of slot T . This assumption does

not lose generality since the feedback arriving at the end of slot T cannot aid the arm selection, hence the finalperformance of the learner will not be affected.

BCO with unknown delays. For delayed BCO, the learner still chooses xt to play while querying {xt,k}M−1k=1

at the beginning of slot t. However, the loss as well as the querying responses{ft(xt,k)

}M−1

k=0are observed at the

end of slot t+ dt. Similarly, let fs|t(xs|t) denote the loss incurred in slot s but observed at t, and the feedback setat the end of slot t is Lt =

{{fs|t(xs|t,k)}M−1

k=0 , s :s+ds= t}

. To find a desirable xt, the learner relies on historyL1:t−1 :=

⋃t−1s=1 Ls, with the goal of minimizing the regret in (3).

Why existing algorithms fail with unknown delays? The algorithms for standard (non-delayed) MAB, suchas EXP3, cannot be applied to delayed MAB with unknown delays. Recall that in settings without delay, to dealwith the partially observed lt, EXP3 relies on an importance sampling type of loss estimates given by [3]

lt(k) =lt(at)1(at = k)

pt(k), k = 1, . . . ,K . (5)

The denominator as well as the indicator function in (5) ensure unbiasedness of lt(k). Leveraging the estimatedloss, the distribution pt+1 is obtained by

pt+1(k) =pt(k) exp(−ηlt(k))∑Kj=1 pt(j) exp(−ηlt(j))

, ∀k (6)

where η is the learning rate. Consider now that the loss ls|t(as|t) with delay ds is observed at t = s + ds. Torecover the unbiased estimator ls|t(k) in (5), ps(k) must be known. However, since ds is not revealed, even if thelearner can store the previous probability distributions, it is not clear how to attain the loss estimator.

Knowing the delay is also instrumental when it comes to the gradient estimator in BCO as well. For non-delayed single-point feedback BCO [13], e.g., M =1, since only one value of the loss instead of the full gradientis observed per slot, the idea is to draw ut uniformly from the surface of a unit ball in RK , and form the gradientestimate as

gt =K

δft(xt + δut)ut (7)

where δ is a small constant. The next action is obtained using a standard online (projected) gradient descentiteration leveraging the estimated gradient, that is

xt+1 = ΠXδ [xt − ηgt] (8)

4

Algorithm 1 DEXP31: Initialize: p1(k) = 1/K,∀k.2: for t = 1, 2 . . . , T do3: Select an arm at ∼ pt.4: Observe feedback collected in set Lt.5: for n = 1, 2, . . . , |Lt| do6: Estimate lsn|t via (9) if lsn|t(asn|t) ∈ Lt.7: Update pnt via (11) - (13).8: end for9: Obtain pt+1 via (14).

10: end for

slot slot

Figure 2: An example of DEXP3 with Lt including the losses incurred in slot s1, s2, and s3 and Lt+1 = ∅.

where Xδ is the shrunk feasibility set to ensure xt +ut is feasible. While gt serves as a nearly unbiased estimatorof ∇ft(xt), the unknown delay brings mismatch between the feedback fs|t(xs|t + δus|t) and gs|t. Specifically,given the feedback fs|t(xs|t + δus|t), since ds is unknown, the learner does not know us|t to obtain gs|t in (7).Similar arguments also hold for BCO with multi-point feedback.

Therefore, performing delayed bandit learning with unknown delays is challenging, and has not been explored.

3 DEXP3 for Delayed MAB

We start with the non-stochastic MAB setup that is randomized in nature because an arm at is chosen randomlyper slot according to a K × 1 probability mass vector pt. In this section, we show that for the MAB problem, solong as the (unknown) delay is bounded, based only on a single-point feedback, the randomized algorithm thatwe term Delayed EXP3 (DEXP3) can cope with unknown delays in MAB through a biased loss estimator, and isguaranteed to attain a desirable regret.

Recall that the feedback at slot t includes losses incurred at slots sn, n = 1, 2, . . . , |Lt|, where Lt :={lsn|t(asn|t) : ∀sn = t−dsn

}. Once Lt is revealed, the learner estimates lsn|t by scaling the observed loss

according to pt at the current slot. For each lsn|t(asn|t) ∈ Lt, the estimator of the loss vector lsn|t is

lsn|t(k) =lsn|t(k)1

(asn|t = k

)pt(k)

, ∀k. (9)

It is worth mentioning that the index sn in (9) is only used for analysis while during the implementation, thereis no need to know sn. In contrast to EXP3 [3] and its variant for delayed MAB with known delays [7, 16], ourestimator for lsn|t(k) in (9) turns out to be biased since asn|t is chosen according to psn and not according to pt,that is

Easn|t[lsn|t(k)

]=lsn|t(k)psn(k)

pt(k)6= lsn|t(k). (10)

Since Lt may contain multiple rounds of feedback, leveraging each lsn|t, the learner must update∣∣Lt∣∣ times

to obtain pt+1. Intuitively, to upper bound the bias of (10), an upper bound on psn(k)/pt(k) is required, which

5

amounts to a lower bound on pt(k). On the other hand however, the lower bound of pt(k) cannot be too large toavoid incurring extra regret. Different from EXP3, our DEXP3 ensures a lower bound on pt(k) by introducing anintermediate weight vector wt to evaluate the historical performance of each arm. Let n denote the index of theinner-loop update at slot t starting from p0

t := pt. For each lsn|t(asn|t) ∈ Lt, the learner first updates wnt by using

the estimated loss lsn|t as

wnt (k) = pn−1t (k) exp

(− ηmin

{δ1, lsn|t(k)

}), ∀k (11)

where η is the learning rate, and δ1 serves as an upper bound of lsn|t(k) to control the bias of lsn|t(k). However,to confine the extra regret incurred by introducing δ1, a carefully-selected δ1 should ensure that the probability ofhaving lsn|t(k) larger than δ1 is small enough. Then the learner finds wn

t by a trimmed normalization as

wnt (k) = max

{wnt (k)∑Kj=1 w

nt (j)

,δ2

K

}, ∀k. (12)

Update (12) ensures that wnt (k) is lower bounded by δ2/K. Finally, the learner normalizes wnt to obtain pnt as

pnt (k) =wnt (k)∑Kj=1w

nt (j)

, ∀k. (13)

It can be shown that pnt (k) is lower bounded by pnt (k) ≥ δ2K(1+δ2) [cf. (35) in supplementary material]. After all

the elements of in Lt have been used, the learner finds pt+1 via

pt+1 = p|Lt|t . (14)

Furthermore, if Lt = ∅, the learner directly reuses the previous distribution, i.e., pt+1 = pt, and chooses an armaccordingly. In a nutshell, DEXP3 is summarized in Alg. 1. As it will be shown in Sec. 5.2, if the delay dt isbounded by a constant d, DEXP3 can guarantee a regret ofO

(√Kd(T +D)

), where D =

∑Tt=1 dt is the overall

delay.

Remark 1. The recent composite loss wrapper algorithm (abbreviated as CLW) in [6] can be also applied to thedelayed MAB problem. However, CLW is designed for a more general setting with composite and anonymousfeedback, and its efficiency drops when the previous action as|t is known. The main differences between DEXP3and CLW are: i) the loss estimators are different; and, ii) DEXP3 updates pt in every slot, while CLW updatesoccur every other O(2d) slots (thus requiring a larger learning rate). As it will be corroborated by simulations,DEXP3 outperforms CLW in the considered setting.

4 DBGD for Delayed BCO

In this section, we develop an algorithm that we term Delayed Bandit Gradient Descent (DBGD) based on adeterministic approximant of the loss obtained using M = K + 1 rounds of feedback. DBGD enjoys regret ofO(√T +D

)for BCO problems even when the delays are unknown. In practice, (K + 1)-point feedback can be

obtained i) when it is possible to evaluate the loss function easily; and ii) when the slot duration is long, meaningthat the algorithm has enough time to query multiple points from the oracle [27].

The intuition behind our deterministic approximation originates from the gradient definition [1]. Consider forexample x ∈ R2, and the gradient∇f(x) = [∇1,∇2]>, where

∇1 = limδ→0

f(x+δe1)−f(x)

δ;∇2 = lim

δ→0

f(x+δe2)−f(x)

δ. (15)

6

Algorithm 2 DBGD1: Initialize: x1 = 0.2: for t = 1, 2 . . . , T do3: Play xt, also query xt + δek, k = 1, . . . ,K.4: Observe feedback collected in set Lt.5: if Lt = ∅ then set xt+1 = xt.6: else estimate gradient gsn|t via (17) if sn+ds= t.7: Update xt+1 via (18).8: end if9: end for

slot slot

Figure 3: An example of DBGD with Lt including the losses incurred in slot s1, s2, and s3 and Lt+1 = ∅.

Similarly, for a K-dimensional x, if K + 1 rounds of feedback are available, the gradient can be approximated as

gt =1

δ

K∑k=1

(ft(xt + δek)− ft(xt)

)ek (16)

where ek := [0, . . . , 1, . . . , 0]> denotes the unit vector with k-th entry equal 1. Intuitively, a smaller δ improvesthe approximation accuracy. When ft is further assumed to be linear, gt in (16) is unbiased. In this case, thegradient of ft can be recovered exactly, and thus the setup boils down to a delayed one with full information.However, if ft(·) is generally convex, gt in (16) is biased.

Leveraging the gradient in (16), we are ready to introduce the DBGD algorithm. Per slot t, the learner playsxt and also queries ft(xt + δek), ∀k = {1, . . . ,K}. However, to ensure that ft(xt + δek) is feasible, thext should be confined to the set Xδ = {x : x

1−δ ∈ X}. Note that if 0 ≤ δ < 1, Xδ is still convex. Letn = 1, 2, . . . , |Lt| indexing the inner loop update at slot t. At the end of slot t, the learner receives observationsLt =

{{fsn|t(xsn|t), fsn|t(xsn|t+ δek), k = 1, . . . ,K}, ∀sn = t−dsn

}. Per received feedback value, the learner

approximates the gradient via (16); thus, for each {fsn|t(xsn|t), fsn|t(xsn|t+δek), k = 1, . . . ,K}, we have

gsn|t =1

δ

K∑k=1

(fsn|t(xsn|t + δek)− fsn|t(xsn|t)

)ek. (17)

With gsn|t and x0t := xt, the learner will update

∣∣Lt∣∣ times to obtain xt+1 by

xnt = ΠXδ[xn−1t − ηgsn|t

], n = 1, . . . , |Lt| (18a)

xt+1 = x|Lt|t . (18b)

If no feedback is received at slot t, the learner simply sets xt+1 = xt. The DBGD is summarized in Algorithm 2.

5 A Unified Framework for Regret Analysis

In this section, we show that both DEXP3 and DBGD can guarantee an O(√T +D

)regret. Our analysis con-

siderably broadens that in [17], which was originally developed for delayed online learning with full-information

7

Figure 4: An example of mapping from real slots (solid line) and virtual slots (dotted line). At the end of slot t,the feedback is Lt = {ls1|t(as1|t), ls2|t(as2|t)}. “v.s.” stands for virtual slot.

feedback.

5.1 Mapping from Real to Virtual Slots

To analyze the recursion involving consecutive variables (pt and pt+1 in DEXP3 or xt and xt+1 in DBGD) ischallenging, since different from standard settings, the number of feedback rounds varies over slots. We willbypass this variable feedback using the notion of a “virtual slot.”

Over the real time horizon, there are in total T virtual slots, where the τ th virtual slot is associated with the τ thloss value fed back. Recall that the feedback received at the end of slot t is Lt. With the overall feedback receiveduntil the end of slot t − 1 denoted by Lt−1 :=

∑t−1v=1 |Lv|, the virtual slot τ corresponding to the first feedback

value received at slot t is τ = Lt−1 + 1. In what follows, we will use MAB as an example to elaborate on thismapping, but the BCO setting can be argued likewise.

When the multiple rounds of feedback are received over a real slot t, DEXP3 updates |Lt| times pt to obtainpt+1; see (11) - (13). Using the notion of virtual slots, these |Lt| updates are performed over |Lt| consecutivevirtual slots. Taking Fig. 4 as an example, when p1

t is obtained by using an estimated loss ls1|t [cf. (9)] and (11)- (13), this update is mapped to a virtual slot τ = Lt−1 + 1, where lτ := ls1|t is adopted to obtain pτ+1 := p1

t .Similarly, when p2

t is obtained using ls2|t, the virtual slot yields pτ+2 := p2t via lτ+1 := ls2|t. That is to say, at real

slot t, for n = 1, . . . , |Lt|, each update from pn−1t to pnt using the estimated loss lsn|t is mapped to an “update”

at the virtual slot τ + n − 1, where lτ+n−1 := lsn|t is employed to obtain pτ+n := pnt from pτ+n−1 = pn−1t .

According to the real-to-virtual slot mapping, we have pτ+|Lt| = pt+1; see also Fig. 4 for two examples. As wewill show later, it is convenient to analyze the recursion between two consecutive pτ and pτ+1, which is the keyfor the ensuing regret analysis.

With regard to DBGD, since multiple feedback rounds are possible per real slot t, we again map the |Lt|updates at a real slot [cf. (18)] to |Lt| virtual slots. The mapping is exactly the same as that in DEXP3, that is, pervirtual slot τ , gτ is used to obtain xτ+1.

From this real-to-virtual mapping vantage point, DEXP3 and DBGD can be viewed as (inexact) EXP3 andBGD running on the virtual time horizon with only one feedback value per virtual slot. That is to say, instead ofanalyzing regret on the real time horizon, which can involve multiple feedback rounds, we can alternatively turnto the virtual slot, where there is only one “update” per slot.

8

5.2 Regret Analysis of DEXP3

Now we turn to the analyze the regret for DEXP3. The analysis builds on the following assumptions.

Assumption 1. The losses satisfy maxt,k lt(k) ≤ 1.

Assumption 2. The delay dt is bounded, i.e., maxt dt ≤ d.

Assumption 1 requires the loss function to be upper bounded, which is common in MAB; see also [3, 5, 15].Assumption 2 asks for the delay to be bounded that also appears in previous analyses for the delayed onlinelearning setup [6,7,16]. It is also assumed that d > 0, since otherwise DEXP3 boils down to EXP3 automatically.Let us first consider the changes on pτ (k) after one “update” in the virtual slot.

Lemma 1. If the parameters are properly selected such that 1− δ2 − ηδ1 ≥ 0, the following inequality holds

pτ−1(k)

pτ (k)≤ 1

1− δ2 − ηδ1, ∀k, τ. (19)

Proof. See Sec. B.1 of the supplementary document.

Lemma 2. If the parameters are properly selected such that 1− ηδ1 ≥ 0, the following inequality holds

pτ (k)

pτ−1(k)≤ max

{1 + δ2,

1

1− ηδ1

}, ∀k, τ. (20)


Lemmas 1 and 2 assert that both pτ−1(k)/pτ (k) and pτ (k)/pτ−1(k) are bounded deterministically, that is,regardless of the arm selection and observed loss. These bounds are the critical for deriving the regret.

To bound the regret, the final cornerstone is the “regret” in virtual slots, specified in the following lemma.

Lemma 3. For a given sequence of {lτ}Tτ=1, the following relation follows

T∑τ=1

(pτ − p

)>min

{lτ , δ1 · 1

}≤ T ln(1 + δ2) + lnK

η+η

2

T∑τ=1

K∑k=1

pτ (k)[lτ (k)

]2 (21)

where 1 is a K × 1 vector of all ones, and p ∈ ∆K .


Leveraging Lemma 3, the regret of DEXP3 follows.

Theorem 1. Supposing Assumptions 1 and 2 hold, defining the overall delay D :=∑T

t=1 dt, and choosing

δ2 = 1T+D , η = O

(√1+lnK

dK(T+D)

), and δ1 = 1

2ηd− δ2

η , DEXP3 guarantees that the RegMABT in (1) satisfies

RegMABT = O

(√(T +D)dK(1+lnK)

). (22)


Theorem 1 indicates that DEXP3 tends to perform well when D is small, for example when D = o(T ). Thiscan happen when the delay is sparse, that is, most of dt = 0.

9

5.3 Regret analysis of DBGD

Our analysis builds on the following assumptions.

Assumption 3. For any t, the loss function ft is convex.

Assumption 4. For any t, ft is L-Lipschitz and β-smooth.

Assumption 5. The feasible set contains εB, where B is the unit ball, and ε > 0 is a predefined parameter. Thediameter of X is R; that is, maxx,y∈X ‖x− y‖ = R.

Assumptions 3 - 5 are common in online learning [15]. Assumption 4 requires that ft(·) is L-Lipschitzand β-smooth, which is needed to bound the bias of the estimator gs|t [1]. Assumption 5 is also typical inBCO [1, 12, 13, 15]. In addition, the counterpart of Assumption 1 in BCO can readily follow from Assumptions 4and 5.

To start, the quality of the gradient estimator gs|t is first evaluated. As stated, when ft(·) is not linear, theestimator gs|t is biased, but its bias is bounded.

Lemma 4. If Assumption 4 holds, then for every fs|t(xs|t), the corresponding estimator (16) satisfies

‖gs|t‖≤√KL, and ‖gs|t −∇fs|t(xs|t)‖≤

βδ

2

√K. (23)

Proof. See Sec. C.1 of the supplementary document.

Lemma 4 suggests that with δ small enough, the bias of gs|t will not be too large. Then, the following lemmashows the relation among gτ , xτ , and xτ+1, in a virtual time slot.

Lemma 5. Under Assumptions 4 and 5, the update in a virtual slot τ guarantees that for any x ∈ Xδ, we have

g>τ(xτ − x

)≤ η

2KL2 +

∥∥xτ − x∥∥2 −

∥∥xτ+1 − x∥∥2

2η. (24)


Lemma 5 is the counterpart of the gradient descent estimate in the non-delayed and full-information setting[15, Theorem 3.1], which demonstrates that DBGD is BGD running on the virtual slots. Finally, leveraging theseresults, the regret bound follows next.

Theorem 2. Suppose Assumptions 3 - 5 hold. Choosing δ = O((T + D)−1

), and η = O

((T + D)−1/2

), the

DBGD guarantees that the regret is bounded, that is

RegBCOT =

T∑t=1

ft(xt)−T∑t=1

ft(x∗) = O

(√T +D

)(25)

where D :=∑T

t=1 dt is the overall delay.


For the slightly different regret definition in [1], DBGD achieves the same regret bound.

Corollary 1. Upon defining xt,0 := xt and xt,k := xt + δek, choosing δ = O((T + D)−1

), and η = O

((T +

D)−1/2), the DBGD also guarantees that

1

K+1

T∑t=1

K∑k=0

ft(xt,k)−T∑t=1

ft(x∗) = O

(√T+D

). (26)


The O(√T +D

)regret in Theorem 2 and Corollary 1 recovers the bound of delayed online learning in the

full information setup [17, 18, 25] with only bandit feedback.

10

0 500 1000 1500 2000

Time slot

0

0.03

0.06

0.09

0.12

0.150.15R

egre

t

EXP3

BOLD

DEXP3

CLW

0 0.5 1 1.5 2 2.5

Time slot 104

0

0.03

0.06

0.09

0.12

0.15

Reg

ret

EXP3

BOLD

DEXP3

CLW

(a) (b)

Figure 5: (a) Regret of DEXP3 using synthetic data; (b) Regret of DEXP3 using real data.

0 400 800 1200 1600 2000

Time slot

0

0.02

0.04

0.06

0.08

Reg

ret

OGD

SOLID

(K+1)-BCO

DBGD

CLW-BCO

0 50 100 150 200 250 300

Time slot

0

0.1

0.2

0.3

0.4

0.5

Reg

ret

OGD

SOLID

(K+1)-BCO

DBGD

CLW-BCO

(a) (b)

Figure 6: (a) Regret of DBGD using synthetic data; (b) Regret of DBGD using real data.

6 Numerical tests

In this section, experiments are conducted to corroborate the validity of the novel DEXP3 and DBGD schemes.In synthetic data tests, we consider T = 2, 000 slots. Delays are periodically generated with period 1, 2, 1, 0, 3, 0, 2,

with the delay of the last few slots slightly modified to ensure that all feedback arrives at the end of T = 2, 000,resulting in the overall delay D = 2, 569.

DEXP3 synthetic tests. Consider K = 5 arms, and losses generated with a sudden change. Specifically, fort ∈ [1, 500], we have lt(k) = 0.4k| cos t| per arm k, while for the rest of the slots, lt(k) = 0.2k| sin(2t)|. Tobenchmark the novel DEXP3, we use: i) the standard EXP3 for non-delayed MAB [3]; ii) the BOLD for delayedMAB with known delay [16]; and, iii) CLW to deal with the more difficult setting in [6]. The instantaneousaccumulated regret (normalized by T ) versus time slots is plotted in Fig. 5 (a). The gap between BOLD andEXP3, illustrates that even with a known delay, the learner suffers from an extra regret. The small gap betweenDEXP3 and BOLD further demonstrates the estimator bias [cf. (9)] causing a slightly larger regret, which is theprice paid for the unknown delay. Compared with CLW, DEXP3 performs significantly better since DEXP3 canleverage more information relative to the non-anonymous feedback used by CLW.

11

DEXP3 real tests. We also tested DEXP3 using the Jester Online Joke Recommender System dataset [14], whereT = 24, 983 users rate K = 100 different jokes from 0 (not funny) to 1 (very funny). The goal is to recommendone joke per slot t to amuse the users. The system performance is evaluated by lt = (1− the score of this joke). Inthis test, we assign a random score within range [0, 1] for the missing entries of this dataset. The delay is generatedperiodically as in the synthetic test, resulting in D = 32, 119. Similar to the synthetic test, it can be observed thatDEXP3 incurs slightly larger regret than BOLD due to the unknown delay, but outperforms the recently developedCLW.

DBGD synthetic tests. Consider thatK = 5, and the feasible setX ∈ R5 is the unit ball, i.e., X :={‖x‖ ≤ 1

}.

The loss function at slot t is generated as ft(x) = at‖x‖2 + b>t x, where at = cos(3t) + 3, while bt(1) =2 sin(2t) + 1,bt(2) = cos(2t) − 2, bt(3) = sin(2t), bt(4) = 2 sin(2t) − 2, bt(4) = 2 sin(2t) − 2, and bt(5) = 2.To see the influence of the bandit feedback and the unknown delay, we consider the following benchmarks: i)the standard OGD [30] in full-information and non-delayed setting; ii) the (K + 1)-point feedback BCO [1] fornon-delayed BCO; iii) the SOLID for delayed full-information OCO [17]; and iv) the CLW-BCO with the inneralgorithm relying on (K + 1)-point feedback BCO [6]. The instantaneous accumulated regret (normalized by T )versus time slots is plotted in Fig. 6 (a). This test shows that DBGD performs almost as good as SOLID, and thegap between DBGD/SOLID and OGD/K + 1-BCO is due to the delay. The regret of DBGD again significantlyoutperforms that of CLW-BCO, demonstrating the efficiency of DBGD.

DBGD real tests. To further illustrate the merits of DBGD, we conduct tests dealing with online regressionapplied to a yacht hydrodynamics dataset [10], which contains T = 308 data with K = 6 features. Per slot t,the regressor xt ∈ R6 predicts based on the feature wt before its label yt is revealed. The loss function for slott is ft(xt) = 1

2(yt − x>t wt)2. The delay is generated periodically as before, and cumulatively it is D = 394.

The instantaneous accumulated regret (normalized by T ) versus time slots is plotted in Fig. 6 (b). Again, DBGDoutperforms CLW-BCO considerably. Comparing with the regret performance of DBGD and SOLID, we cansafely deduce the influence delay has on bandit feedback.

7 Conclusions

Bandit online learning with unknown delays, including non-stochastic MAB and BCO, was studied in this pa-per. Different from settings where the experienced delay is known in bandit online learning, the unknown delayprevents a simple gradient estimate that is needed by the iterative algorithm. To address this issue, a biased loss es-timator as well as a deterministic one were developed for non-stochastic MAB and BCO. Leveraging the proposedloss estimators, the so-termed DEXP3 and DBGD algorithms were developed. The regret of both DEXP3 andDBGD were established analytically. Numerical tests on synthetic and real datasets confirmed the performancegain of DEXP3 and DBGD relative to state-of-the-art approaches.

12

References

[1] A. Agarwal, O. Dekel, and L. Xiao, “Optimal algorithms for online convex optimization with multi-pointbandit feedback.” in Proc. Intl. Conf. on Learning Theory, 2010, pp. 28–40.

[2] A. Agarwal and J. C. Duchi, “Distributed delayed stochastic optimization,” in Proc. Advances in Neural Info.Process. Syst., Granada, Spain, 2011, pp. 873–881.

[3] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “The nonstochastic multiarmed bandit problem,”SIAM Journal on Computing, vol. 32, no. 1, pp. 48–77, 2002.

[4] B. Awerbuch and R. D. Kleinberg, “Adaptive routing with end-to-end feedback: Distributed learning andgeometric approaches,” in Proc. ACM Symp. on Theory of Computing, Chicago, IL, Jun. 2004, pp. 45–53.

[5] S. Bubeck, N. Cesa-Bianchi et al., “Regret analysis of stochastic and nonstochastic multi-armed banditproblems,” Found. and Trends® in Machine Learning, vol. 5, no. 1, pp. 1–122, 2012.

[6] N. Cesa-Bianchi, C. Gentile, and Y. Mansour, “Nonstochastic bandits with composite anonymous feedback,”in Proc. Conf. On Learning Theory, Stockholm, Sweden, 2018, pp. 750–773.

[7] N. Cesa-Bianchi, C. Gentile, Y. Mansour, and A. Minora, “Delay and cooperation in nonstochastic bandits,”J. Machine Learning Res., vol. 49, pp. 605–622, 2016.

[8] O. Chapelle and L. Li, “An empirical evaluation of thompson sampling,” in Proc. Advances in Neural Info.Process. Syst., Granada, Spain, 2011, pp. 2249–2257.

[9] T. Desautels, A. Krause, and J. W. Burdick, “Parallelizing exploration-exploitation tradeoffs in gaussianprocess bandit optimization,” J. Machine Learning Res., vol. 15, no. 1, pp. 3873–3923, 2014.

[10] D. Dheeru and E. Karra Taniskidou, “UCI machine learning repository,” 2017. [Online]. Available:http://archive.ics.uci.edu/ml

[11] J. Duchi, M. I. Jordan, and B. McMahan, “Estimation, optimization, and parallelism when data is sparse,” inProc. Advances in Neural Info. Process. Syst., Lake Tahoe, Nevada, 2013, pp. 2832–2840.

[12] J. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono, “Optimal rates for zero-order convex optimiza-tion: The power of two function evaluations,” IEEE Trans. Inform. Theory, vol. 61, no. 5, pp. 2788–2806,2015.

[13] A. D. Flaxman, A. T. Kalai, and H. B. McMahan, “Online convex optimization in the bandit setting: gradientdescent without a gradient,” in Proc. of ACM-SIAM symposium on Discrete algorithms, Vancouver, Canada,pp. 385–394.

[14] K. Goldberg, T. Roeder, D. Gupta, and C. Perkins, “Eigentaste: A constant time collaborativefiltering algorithm,” Information Retrieval, vol. 4, no. 2, pp. 133–151, 2001. [Online]. Available:http://eigentaste.berkeley.edu/dataset/

[15] E. Hazan, “Introduction to online convex optimization,” Found. and Trends® in Optimization, vol. 2, no. 3-4,pp. 157–325, 2016.

[16] P. Joulani, A. Gyorgy, and C. Szepesvari, “Online learning under delayed feedback,” in Proc. Intl. Conf.Machine Learning, Atlanta, 2013, pp. 1453–1461.

[17] P. Joulani, A. Gyorgy, and C. Szepesvari, “Delay-tolerant online convex optimization: Unified analysis andadaptive-gradient algorithms,” in Proc. of AAAI Conf. on Artificial Intelligence, vol. 16, Phoenix, Arizona,2016, pp. 1744–1750.

13

http://archive.ics.uci.edu/ml

http://eigentaste.berkeley.edu/dataset/

[18] J. Langford, A. J. Smola, and M. Zinkevich, “Slow learners are fast,” Proc. Advances in Neural Info. Process.Syst., pp. 2331–2339, 2009.

[19] L. Li, W. Chu, J. Langford, and R. E. Schapire, “A contextual-bandit approach to personalized news articlerecommendation,” in Proc. of the 19th Intl. Conf. on World Wide Web. Rayleigh, NC: ACM, 2010, pp.661–670.

[20] B. McMahan and M. Streeter, “Delay-tolerant algorithms for asynchronous distributed online learning,” inProc. Advances in Neural Info. Process. Syst., Montreal, Canada, 2014, pp. 2915–2923.

[21] H. B. McMahan, E. Moore, D. Ramage, S. Hampson et al., “Communication-efficient learning of deep net-works from decentralized data,” in Proc. Intl. Conf. on Artificial Intelligence and Statistics, Fort Lauderdale,Florida, 2017, pp. 273–1282.

[22] Y. Nesterov, Introductory lectures on convex optimization: A basic course. Springer Science & BusinessMedia, 2013, vol. 87.

[23] G. Neu, A. Antos, A. Gyorgy, and C. Szepesvari, “Online markov decision processes under bandit feedback,”in Proc. Advances in Neural Info. Process. Syst., Vancouver, Canada, 2010, pp. 1804–1812.

[24] C. Pike-Burke, S. Agrawal, C. Szepesvari, and S. Grunewalder, “Bandits with delayed anonymous feedback,”arXiv preprint arXiv:1709.06853, 2017.

[25] K. Quanrud and D. Khashabi, “Online learning with adversarial delays,” in Proc. Advances in Neural Info.Process. Syst., Montreal, Canada, 2015, pp. 1270–1278.

[26] O. Shamir and L. Szlak, “Online learning with local permutations and delayed feedback,” in Proc. Intl. Conf.Machine Learning, Sydney, Australia, 2017, pp. 3086–3094.

[27] T. S. Thune and Y. Seldin, “Adaptation to easy data in prediction with limited advice,” arXiv preprintarXiv:1807.00636, 2018.

[28] C. Vernade, O. Cappe, and V. Perchet, “Stochastic bandit models for delayed conversions,” in Proc. Conf. onUncertainty in Artificial Intelligence, Sydney, Australia, 2017.

[29] M. J. Weinberger and E. Ordentlich, “On delayed prediction of individual sequences,” IEEE Trans. Inform.Theory, vol. 48, no. 7, pp. 1959–1976, 2002.

[30] M. Zinkevich, “Online convex programming and generalized infinitesimal gradient ascent,” in Proc. Intl.Conf. Machine Learning, Washington D.C., 2003, pp. 928–936.

14

Supplementary Document for “Bandit Online Learning with Unknown Delays”

A Real to virtual slot mapping

For the analysis, let t(τ) denote the real slot when the real loss lt(τ) corresponding to lτ was incurred, i.e.,lτ = lt(τ)|t(τ)+dt(τ) . Also define an auxiliary variable sτ = τ −1−Lt(τ)−1. See an example in Fig. 7 and Table 1.

Lemma 6. The following relations hold: i) sτ ≥ 0, ∀τ ; ii)∑T

τ=1 sτ =∑T

t=1 dt; and, iii) if maxt dt ≤ d, wehave sτ ≤ 2d, ∀τ .

Proof. We first prove the property i) sτ ≥ 0, ∀t. Consider at virtual slot τ , the observed loss is lt(τ)(at(τ)) withthe corresponding sτ = τ − 1 − Lt(τ)−1. Suppose that Lt(τ)−1 = m, where 0 ≤ m ≤ t(τ) − 1 (by definitionof Lt(τ)−1). The history Lt(τ)−1 = m suggests that at the beginning of t1 = t(τ), there are in total m receivedfeedback. On the other hand, the loss lt(τ)(at(τ)) is observed at the end of slot t2 = t(τ) + dt(τ) ≥ t1, thus atthe beginning of t2, there are at least m observations. Hence we must have τ ≥ m + 1. Then by the definition,sτ ≥ m+ 1− 1−m = 0.

Then for the property ii)∑T

τ=1 sτ =∑T

t=1 dt, the proof follows from the definition of sτ , i.e.,

T∑τ=1

sτ =

T∑τ=1

(τ − 1− Lt(τ)−1

)=

T∑t=1

(t− 1)−T∑τ=1

Lt(τ)−1

(a)=

T∑t=1

(t− 1− Lt−1

) (b)=

T∑t=1

dt (27)

where (a) is due to the fact that {t(τ)}Tτ=1 is a permutation of {1, · · · , T}; and (b) follows from the definition ofLt−1.

Finally, for the property iii), notice that Lt(τ)−1 ≥ t(τ)−1−d, which follows that at the beginning of t = t(τ),the losses of slots t ≤ t(τ)− 1− d must have been received. Therefore, we have

sτ = τ − 1− Lt(τ)−1 ≤ τ − 1− t(τ) + 1 + d(c)

≤ 2d (28)

where (c) follows that lt(τ)(at(τ)) is observed at the end of t = t(τ) + dt(τ), and Lt(τ)+dt(τ)−1 is at most t(τ) +

dt(τ)− 2 (since lt(τ)(at(τ)) is not observed), leading to the fact that τ is at most t(τ) + dt(τ), leading to τ − t(τ) ≤dt(τ) ≤ d.

Figure 7: An examples of mapping from real slots (solid line)and virtual slots (dotted line). The value of t(τ) is markedbesides the corresponding yellow arrow; T = 3 with delayd1 = 2, d2 = 0, and d3 = 0.

Virtual slot τ = 1 τ = 2 τ = 3

t(τ) 2 3 1Lt(τ)−1 0 1 0sτ 0 0 2

Table 1: The value of t(τ), Lt(τ)−1, and sτ inFig. 7.

15

B Proofs for DEXP3

Before diving into the proofs, we first show some useful yet simple bounds for different parameters of the DEXP3’s(in virtual slots). In virtual slot τ , the update is carried out the same as (11), (12) and (13), given by

wτ+1(k) = pτ (k) exp[− ηmin

{δ1, lτ (k)

}], ∀k, (29)

wτ+1(k) = max

{wτ+1(k)∑Kj=1 wτ+1(j)

,δ2

K

}, ∀k, (30)

pτ+1(k) =wτ+1(k)∑Kj=1wτ+1(j)

, ∀k. (31)

Since lτ (k) ≥ 0, ∀k, τ , we haveK∑j=1

wτ (j) ≤K∑j=1

pτ−1(j) = 1. (32)

And∑K

k=1wτ (k) is bounded byK∑k=1

wτ (k) ≥K∑k=1

wτ (k)∑Kj=1 wτ (j)

= 1; (33)

K∑k=1

wτ (k) ≤K∑k=1

wτ (k)∑Kj=1 wτ (j)

+ δ2 = 1 + δ2. (34)

Finally, pτ (k) is bounded byδ2

K(1 + δ2)≤ wτ (k)

1 + δ2≤ pτ (k) ≤ wτ (k). (35)

B.1 Proof of Lemma 1

Lemma 7. In consecutive virtual slots τ − 1 and τ , the following inequality holds for any k.

pτ−1(k)− pτ (k) ≤ pτ−1(k)δ2 + ηmin

{δ1, lτ−1(k)

}1 + δ2

. (36)

Proof. First, we have

pτ (k)(a)

≥ wτ (k)

1 + δ2≥ wτ (k)∑K

j=1 wτ (j)(1 + δ2)

(b)

≥ wτ (k)

1 + δ2=pτ−1(k) exp

[− ηmin

{δ1, lτ−1(k)

}]1 + δ2

(37)

where (a) is the result of (35); (b) is due to (32). Hence, we have

pτ (k)− pτ−1(k) ≥pτ−1(k) exp

[− ηmin

{δ1, lτ−1(k)

}]1 + δ2

− pτ−1(k)

(c)

≥ pτ−1(k)

1 + δ2

[1− ηmin

{δ1, lτ−1(k)

}]− pτ−1(k)

= pτ−1(k)−δ2 − ηmin

{δ1, lτ−1(k)

}1 + δ2

(38)

where (c) follows from e−x ≥ 1− x and the proof is completed by multiplying −1 on both sides of (38).

16

From Lemma 7, we have

pτ−1(k)− pτ (k) ≤ pτ−1(k)δ2 + ηmin

{δ1, lτ−1(k)

}1 + δ2

≤ pτ−1(k)(δ2 + ηδ1). (39)

Hence, as long as 1− δ2 − ηδ1 ≥ 0, we can guarantee that (19) is satisfied.


Lemma 8. The following inequality holds for any τ and any k

pτ (k)− pτ−1(k) ≤ pτ (k)

[1− Iτ (k)

K∑j=1

pτ−1(j)(

1− ηmin{δ1, lτ−1(j)

})](40)

where Iτ (k) := 1(wτ (k) > δ2

K

).

Proof. We first show that

wτ (k) ≥ pτ (k)Iτ (k)K∑j=1

wτ (j). (41)

It is easy to see that inequality (41) holds when Iτ (k) = 0. When Iτ (k) = 1, we havewτ (k) = wτ (k)/(∑K

j=1 wτ (j)).

By (35), we have pτ (k) ≤ wτ (k) = wτ (k)/(∑K

j=1 wτ (j)), from which (41) holds. Then we have

pτ (k)− pτ−1(k) ≤ pτ (k)− wτ (k) ≤ pτ (k)− pτ (k)Iτ (k)

K∑j=1

wτ (j)

= pτ (k)

[1− Iτ (k)

K∑j=1

wτ (j)

]= pτ (k)

{1− Iτ (k)

K∑j=1

pτ−1(j) exp[− ηmin

{δ1, lτ−1(j)

}]}(a)

≤ pτ (k)

[1− Iτ (k)

K∑j=1

pτ−1(j)(


})](42)

where in (a) we used e−x ≥ 1− x.

The proof of Lemma 2 builds on Lemma 8. First consider the case of Iτ (k) = 0. In this case Lemma 8becomes pτ (k) − pτ−1(k) ≤ pτ (k), which is trivial. On the other hand, since Iτ (k) = 0, we have wτ (k) = δ2

K .Then leveraging (35), we have pτ (k) ≤ wτ (k) = δ2

K . Plugging the lower bound of pτ−1(k) into (35), we have

pτ (k)

pτ−1(k)≤ δ2

K

1

pτ−1(k)≤ δ2

K

K(1 + δ2)

δ2= 1 + δ2. (43)

Considering the case of Iτ (k) = 1, Lemma 8 becomes

pτ (k)− pτ−1(k) ≤ pτ (k)

[1−

K∑j=1

pτ−1(j)(


})]

= ηpτ (k)K∑j=1

pτ−1(j) min{δ1, lτ−1(k)

}≤ ηpτ (k)δ1. (44)

Rearranging (44) and combining it with (43), we complete the proof.

17


For conciseness, define cτ := min{lτ , δ1 · 1} , and correspondingly cτ (k) := min{lτ (k), δ1

}. We further define

Wτ :=∑K

k=1 wτ (k), and Wτ :=∑K

k=1wτ (k). Leveraging these auxiliary variables, we have

WT+1 =K∑k=1

wT+1(k) =K∑k=1

pT (k) exp[− ηcT (k)

]=

K∑k=1

wT (k)

WTexp

[− ηcT (k)

]≥

K∑k=1

wT (k)

WT

exp[− ηcT (k)

]WT

=

K∑k=1

pT−1(k)exp

[− ηcT (k)− ηcT−1(k)

]WTWT

=K∑k=1

wT−1(k)

WT−1

exp[− ηcT (k)− ηcT−1(k)

]WTWT

≥ · · · ≥K∑k=1

w1(k) exp[− η

∑Tτ=1 cτ (k)

]∏Tτ=1

(WτWτ

) . (45)

Then, for any probability distribution p ∈ ∆K noticing that the initialization of w1(k) = 1,∀k and henceW1 = K, inequality (45) implies that

K∑k=1

p(k) exp[− η

T∑τ=1

cτ (k)]≤

K∑k=1

exp[− η

T∑τ=1

cτ (k)]≤ W1

T∏τ=1

(WτWτ+1

) (a)

≤ K(1 + δ2)TT+1∏τ=2

Wτ ,

(46)

where in (a) we used the fact that Wτ ≤ 1 + δ2. Then, using the the Jensen’s inequality on e−x, we have

K∑k=1

p(k) exp[− η

T∑τ=1

cτ (k)]≥ exp

[− η

K∑k=1

T∑τ=1

p(k)cτ (k)

]. (47)

Plugging (47) into (46), we arrive at

exp

[− η

K∑k=1

T∑τ=1

p(k)cτ (k)

]≤ K(1 + δ2)T

T+1∏τ=2

Wτ . (48)

On the other hand, Wτ can be upper bounded by

Wτ =K∑k=1

wτ =K∑k=1

pτ−1(k) exp[− ηcτ−1(k)

](b)

≤K∑k=1

pτ−1(k)

(1− ηcτ−1(k) +

η2

2

[cτ−1(k)

]2)

= 1− ηK∑k=1

pτ−1(k)cτ−1(k) +η2

2

K∑k=1

pτ−1(k)[cτ−1(k)

]2 (49)

where (b) follows from e−x ≤ 1− x+ x2/2, ∀x ≥ 0. Taking logarithm on both sides of (49), we arrive at

ln Wτ ≤ ln

(1− η

K∑k=1

pτ−1(k)cτ−1(k) +η2

2

K∑k=1


]2)(c)

≤ −ηK∑k=1

pτ−1(k)cτ−1(k) +η2

2

K∑k=1


]2 (50)

18

where (c) follows from ln(1+x) ≤ x. Then taking logarithm on both sides of (48) and plugging (50) in, we arriveat

−ηK∑k=1

T∑τ=1

p(k)cτ (k) ≤ T ln(1 + δ2) + lnK − ηT∑τ=1

K∑k=1

pτ (k)cτ (k) +η2

2

T∑τ=1

K∑k=1

pτ (k)[cτ (k)

]2. (51)

Rearranging the terms of (51) and writing it compactly, we obtain

T∑τ=1

(pτ − p

)>cτ ≤

T ln(1+δ2)+lnK

η+η

2

T∑τ=1

K∑k=1

pτ (k)[cτ (k)

]2≤T ln(1+δ2)+lnK

η+η

2

T∑τ=1

K∑k=1

pτ (k)[lτ (k)

]2. (52)

B.4 Proof of Theorem 1

To begin with, the instantaneous regret can be written as

p>t lt − p>lt =

K∑k=1

pt(k)lt(k)−K∑k=1

p(k)lt(k)

(a)=

K∑k=1

pt(k)Eat[lt(k)1(at = k)

pt(k)

]−

K∑k=1

p(k)Eat[lt(k)1(at = k)

pt(k)

]

=

K∑k=1

(pt(k)− p(k)

)Eat[lt(k)1(at = k)

pt+dt(k)

pt+dt(k)

pt(k)

]

≤ maxk

pt+dt(k)

pt(k)

K∑k=1

(pt(k)− p(k)

)Eat[lt(k)1(at = k)

pt+dt(k)

](b)=

(maxk

pt+dt(k)

pt(k)

)Eat[p>t lt|t+dt − p>lt|t+dt

](53)

where (a) is due to Eat[lt(k)1(at=k)

pt(k)

]= lt(k), and (b) follows from lt|t+dt(k) = lt(k)1(at=k)

pt+dt (k) .

Then the overall regret of T slots is given by

RegT = E[ T∑t=1

p>t lt

]− p>lt ≤ E

[ T∑t=1

(maxk

pt+dt(k)

pt(k)

)Eat[p>t lt|t+dt − p>lt|t+dt

]](c)= E

[ T∑τ=1

(maxk

pt(τ)+dt(τ)(k)

pt(τ)(k)

)Eat(τ)

[p>t(τ)lt(τ)|t(τ)+dt(τ) − p>lt(τ)|t(τ)+dt(τ)

]](d)= E

[ T∑τ=1

(maxk

pt(τ)+dt(τ)(k)

pt(τ)(k)

)Eat(τ)

[p>t(τ)lτ − p>lτ

]](e)= E

[ T∑τ=1

(maxk

pt(τ)+dt(τ)(k)

pt(τ)(k)

)Eat(τ)

[p>τ−sτ lτ − p>lτ

]]

= E[ T∑τ=1

(maxk

pt(τ)+dt(τ)(k)

pt(τ)(k)

)(Eat(τ)

[p>τ−sτ lτ − p>τ lτ

]+ Eat(τ)

[p>τ lτ − p>lτ

])](54)

19

where (c) is due to the fact that {t(1), t(2), . . . , t(T )} is a permutation of {1, 2, . . . , T}; (d) follows from lτ =lt(τ)|t(τ)+dt(τ); (e) uses the fact pt = pLt−1+1 and pt(τ) = pLt(τ)−1+1 = pτ−sτ .

First note that between real time slot t(τ) and t(τ) + dt(τ), there is at most d + dt(τ) ≤ 2d feedback re-ceived. Hence the corresponding virtual slots will not differ larger than 2d. Note also that the index of virtual slotcorresponding to t(τ) must be no larger than that of t(τ) + dt(τ). Hence we have for all τ ∈ [1, T ],

maxk

pt(τ)+dt(τ)(k)

pt(τ)(k)≤(

maxk

pτ+1(k)

pτ (k)

)2d (f)

≤ max

{(1 + δ2)2d,

1

(1− ηδ1)2d

}(55)

where (f) is the result of Lemma 2.Then, to bound the terms in the second brackets of (54), again we denote cτ := min

{lτ , δ1 · 1}, and corre-

spondingly cτ (k) := min{lτ (k), δ1

}for conciseness. Then we have

p>τ−sτ cτ−p>τ cτ = c>τ (pτ−sτ − pτ )

(g)= cτ (m)

sτ−1∑j=0

(pτ−sτ+j(m)− pτ−sτ+j+1(m)

)(h)

≤ cτ (m)

sτ−1∑j=0

pτ−sτ+j(m)δ2 + ηcτ−sτ+j(m)

1 + δ2≤ cτ (m)

sτ−1∑j=0

(ηpτ−sτ+j(m)cτ−sτ+j(m) + δ2

)≤ lτ (m)

sτ−1∑j=0

(ηpτ−sτ+j(m)lτ−sτ+j(m) + δ2

)(56)

where (g) follows from the facts that lτ has at most one entry (with index m) being non-zero [cf. (66)] and sτ ≥ 0[cf. Lemma 6]; and (h) is the result of Lemma 7. Then notice that

lτ (k)pτ (k) =lt(τ)(k)

pt(τ)+dt(τ)(k)pτ (k)

(i)

≤(

maxk

pτ (k)

pτ+1(k)

)2d

≤ 1

(1− δ2 − ηδ1)2d(57)

where (i) uses the fact that between t(τ) and t(τ) + dt(τ) there is at most 2d feedback; then further applying theresult of Lemma 1, inequality (57) can be obtained. Plugging (57) back in to (56) and taking expectation w.r.t.at(τ), we arrive at

Eat(τ)[p>τ−sτ cτ − p>τ cτ

]≤(

ηsτ

(1− δ2 − ηδ1)2d+ δ2sτ

) K∑k=1

pt(τ)(k)lτ (k)

(j)

≤ K1

(1− δ2 − ηδ1)2d

(ηsτ

(1− δ2 − ηδ1)2d+ δ2sτ

)(58)

where (j) follows a similar reason of (57). Then, noticing∑T

τ=1 sτ =∑T

t=1 dt = D, we have

T∑τ=1


]≤ KD

(1− δ2 − ηδ1)2d

(η

(1− δ2 − ηδ1)2d+ δ2

). (59)

Using a similar argument of (57), we can obtain

Eat(τ)[pτ (k)

[lτ (k)

]2]= pτ (k)

l2t(τ)(k)

p2t(τ)+dt(τ)

(k)pt(τ)(k) ≤ 1

(1− δ2 − ηδ1)4d(60)

20

Then leveraging Lemma 3, we arrive at

T∑τ=1

Eat(τ)[(pτ − p)>cτ

]≤ T ln(1 + δ2) + lnK

η+η

2

T∑τ=1

K∑k=1

Eat(τ)[pτ (k)

[lτ (k)

]2]≤ T ln(1 + δ2) + lnK

η+

ηKT

2(1− δ2 − ηδ1)4d. (61)

The last step is to show that introducing δ1 will not incur too much extra regret. Note that both cτ and lτhave only one entry being non-zero, whose index is denoted by mτ . Notice that lτ (mτ ) > cτ (mτ ) only whenlτ (mτ ) =

lt(τ)(mτ )

pt(τ)+dt(τ)(mτ ) > δ1, which is equivalent to pt(τ)+dt(τ)(mτ ) < lt(τ)(mτ )/δ1 ≤ 1/δ1. Hence, we have

T∑τ=1

Eat(τ)[(pτ − p)>lτ

]=

T∑τ=1


]+

T∑τ=1

Eat(τ)[(pτ − p)>

(lτ − cτ

)](h)

≤T∑τ=1


]+

T∑τ=1

Eat(τ)[pτ (mτ )

(lτ (mτ )− cτ (mτ )

)1(pt(τ)+dt(τ)(mτ ) < 1/δ1

)]≤

T∑τ=1


]+

T∑τ=1

Eat(τ)[pτ (mτ )lτ (mτ )1

(pt(τ)+dt(τ)(mτ ) < 1/δ1

)](62)

where in (h), mτ denotes the index of the only one none-zero entry of lτ , and p is dropped due to the appearanceof the indicator function. To proceed, notice that

Eat(τ)[lτ (mτ )pτ (mτ )1

(pt(τ)+dt(τ)(mτ ) < 1/δ1

)]=

K∑k=1

pt(τ)(k)lt(τ)(k)

pt(τ)+dt(τ)(k)pτ (k)1

(pt(τ)+dt(τ)(k) < 1/δ1

)(i)

≤∑K

k=1 pτ (k)1(pt(τ)+dt(τ)(k) < 1/δ1

)(1− δ2 − ηδ1)2d

=K∑k=1

pτ (k)

pt(τ)+dt(τ)(k)

pt(τ)+dt(τ)(k)1(pt(τ)+dt(τ)(k) < 1/δ1

)(1− δ2 − ηδ1)2d

(j)

≤ K

δ1(1− δ2 − ηδ1)4d(63)

where in (i) we used the a similar argument of (57); and in (j) we used the fact x1(x < a) ≤ a.Plugging (63) back into (62), we arrive at

T∑τ=1


]≤

T∑τ=1


]+

KT

δ1(1− δ2 − ηδ1)4d(64)

Applying similar arguments as (62) and (63), we can also show that

T∑τ=1

Eat(τ)[(pτ−sτ − pτ

)>lτ

]≤

T∑τ=1


)>cτ

]+

KD

δ1(1− δ2 − ηδ1)6d. (65)

For the parameter selection, we have T ln(1 + δ2) = T ln(1 + 1T+D ) ≤ ln e = 1. Leveraging the inequality

that e ≤ (1− 2x)−2x ≤ 4,∀x ∈ N+, we have that

1

(1− ηδ1)2d≤ 1

(1− δ2 − ηδ1)2d= O(1). (66)

21

From (66) it is not hard to see the bound on (55), which is

maxk

pt(τ)+dt(τ)(k)

pt(τ)(k)≤ max

{(1 + δ2)2d,

1

(1− ηδ1)2d

}= O(1). (67)

Then for (59), we have

T∑τ=1


]≤ KD

(1− δ2 − ηδ1)2d

(η

(1− δ2 − ηδ1)2d+ δ2

)= O(ηKD + δ2KD). (68)

For (61), we have

T∑τ=1


]≤ T ln(1 + δ2) + lnK

η+

ηKT

2(1− δ2 − ηδ1)4d= O

(ηKT +

1 + lnK

η

). (69)

Using (69) and the selection of δ1, we can bound (64) by

T∑τ=1


]≤

T∑τ=1


]+

KT

δ1(1− δ2 − ηδ1)4d= O

(ηdKT +

1 + lnK

η

). (70)

Using (68) and the selection of δ1, we have

T∑τ=1


)>lτ

]≤

T∑τ=1


)>cτ

]+

KD

δ1(1− δ2 − ηδ1)6d= O

(ηdKD + δ2KD

).

(71)

Plugging (67), (70) , and (71) into (54), the regret is bounded by

RegT =

T∑t=1

E[p>t lt

]−

T∑t=1

p∗>lt = O(√

d(T +D)K(1 + lnK)). (72)

C Proofs for DBGD

C.1 Proof of Lemma 4

Since fs|t(·) is L-Lipschitz, we have gs|t(k) ≤ 1δL‖δek‖ = L, and thus ‖gs|t‖ ≤

√KL. On the other hand, let

∇s|t := ∇fs|t(xs|t), and ∇s|t(k) being the k-th entry of ∇s|t. Due to the β-smoothness of fs|t(·), we have

gs|t(k)−∇s|t(k) ≤ 1

δ

(δ∇>s|tek +

β

2δ2)−∇s|t(k) =

βδ

2(73)

suggesting that ‖gs|t −∇fs|t(xs|t)‖ ≤ βδ2

√K.

C.2 Proof of Lemma 5

Lemma 5 (Restate). In virtual slots, it is guaranteed to have

‖xτ − xτ−sτ ‖ ≤ ηsτ√KL (74)

and for any x ∈ Xδ, we have

ηg>τ(xτ − x

)≤ η2

2KL2 +

∥∥xτ − x∥∥2 −

∥∥xτ+1 − x∥∥2

2. (75)

22

Proof. The proof begins with

‖xτ−sτ − xτ‖ ≤sτ−1∑j=0

‖xτ−sτ+j − xτ−sτ+j+1‖(a)

≤ ηsτ√KL (76)

where (a) uses the fact that ‖xτ − xτ+1‖ =∥∥xτ −ΠXδ [xτ − ηgτ ]

∥∥ ≤ η‖gτ‖. The first inequality is thus provedThen, notice that∥∥xτ+1 − x

∥∥2 −∥∥xτ − x

∥∥2=∥∥ΠXδ [xτ − ηgτ ]− x

∥∥2 −∥∥xτ − x

∥∥2

(b)

≤∥∥xτ − x− ηgτ

∥∥2 −∥∥xτ − x

∥∥2= −2ηg>τ

(xτ − x

)+ η2

∥∥gτ∥∥2 (77)

where inequality (b) uses the non-expansion property of projection. Rearranging the terms of (77) completes theproof.

C.3 Proof of Theorem 2

Lemma 9. Let ht(x) := ft(x)+(gt−∇ft(xt)

)>x, where gt := gt|t+dt . Then ht(x) has the following properties:

i) ht(x) is(L+ βδ

√K

2

)-Lipschitz; and ii) ht(x) is β smooth and convex.

Proof. Starting with the first property, consider that

‖ht(x)− ht(y)‖ =∥∥ft(x) +

(gt −∇ft(xt)

)>x− ft(y)−

(gt −∇ft(xt)

)>y∥∥

≤ ‖ft(x)− ft(y)‖+∥∥gt −∇ft(xt)∥∥‖x− y‖

(a)

≤(L+

βδ√K

2

)‖x− y‖ (78)

where in (a) we used the results in Lemma 4. For the second property, the convexity of ht(x) is obvious. Thennoticing that∇ht(x) = ∇ft(x) + gt −∇ft(xt), we have

ht(y)− ht(x) = ft(y)− ft(x) +(gt −∇ft(xt)

)>(y − x)

≤(∇ft(x)

)>(y − x) +

β

2‖y − x‖2 +

(gt −∇ft(xt)

)>(y − x)

=(∇ht(x)

)>(y − x) +

β

2‖y − x‖2 (79)

which implies that ht(x) is β smooth.

Then we are ready to prove Theorem 2. Let ht(x) := ft(x) +(gt−∇ft(xt)

)>x, where gt := gt|t+dt . Using

the property of ht(x) in Lemma 9 as well as the fact∇ht(xt) = gt, we have

RegT =T∑t=1

ft(xt)−T∑t=1

ft(x∗)

=T∑t=1

(ht(xt)−

(gt −∇ft(xt)

)>xt

)−

T∑t=1

(ht(x

∗)−(gt −∇ft(xt)

)>x∗)

=

T∑t=1

(ht(xt)− ht(x∗)

)+

T∑t=1

(gt −∇ft(xt)

)>(x∗ − xt

)(a)

≤T∑t=1

(ht(xt)− ht(xδ)

)+

T∑t=1

(ht(xδ)− ht(x)

)+RTβδ

√K

2

(b)

≤T∑t=1

(ht(xt)− ht(xδ)

)+ δRT

(L+

βδ√K

2

)+RTβδ

√K

2(80)

23

where in (a) xδ := ΠXδ(x∗), and the inequality follows from the results in Lemma 4; (b) follows from the fact

that ht(·) is (L+ βδ√K

2 )-Lipschitz, as well as ‖xδ − x‖ ≤ δR.Hence, at virtual slots, it is like learning according to ht(xt), with ∇ht(xt) being revealed. With the short-

hand notation hτ (·) := ht(τ)(·), we have (using similar arguments like the proof of Theorem 1)

T∑t=1

ht(xt)−T∑t=1

ht(xδ) =T∑τ=1

ht(τ)(xt(τ))−T∑τ=1

ht(τ)(xδ) =T∑τ=1

hτ (xτ−sτ )−T∑τ=1

hτ (xδ)

=T∑τ=1


hτ (xτ ) +

T∑τ=1

hτ (xτ )−T∑τ=1

hτ (xδ). (81)

The first term in the RHS of (81) can be bounded as

hτ (xτ−sτ )− hτ (xτ ) ≤∥∥hτ (xτ−sτ )− hτ (xτ )

∥∥ (c)

≤(L+

βδ√K

2

)∥∥xτ−sτ − xτ∥∥ (d)

≤ ηsτ√KL

(L+

βδ√K

2

)(82)

where (c) follows from Lemma 9; and (d) is the result of Lemma 5. Hence, using∑T

τ=1 sτ = D in Lemma 6, weobtain

T∑τ=1


hτ (xτ ) ≤ ηD√KL

(L+

βδ√K

2

). (83)

On the other hand, by the convexity of hτ (·), we have

hτ (xτ )− hτ (xδ) ≤(∇hτ (xτ )

)>(xτ − xδ

)=[∇hτ (xτ )− gτ

]>(xτ − xδ

)+ g>τ

(xτ − xδ

)(e)

≤ β∥∥xτ − xτ−sτ

∥∥∥∥xτ − xδ∥∥+ g>τ

(xτ − xδ

)≤ βR

∥∥xτ − xτ−sτ∥∥+ g>τ

(xτ − xδ

)(84)

where (e) is because hτ (·) is β-smoothness [cf. [22, Thm 2.1.5]]. Taking summation over τ and leveraging theresults in Lemma 5, we have

T∑τ=1

hτ (xτ )− hτ (xδ) ≤T∑τ=1

ηsτ√KLβR+

T∑τ=1

η

2

∥∥gτ∥∥2+R2

η≤ ηD

√KLβR+

ηT

2KL2 +

R2

η. (85)

Selecting δ = O(1/(T +D)

), (83) implies

T∑τ=1


hτ (xτ ) ≤ ηD√KL

(L+

βδ√K

2

)= O

(η√KD

). (86)

Inequality (85) then becomes

T∑τ=1

hτ (xτ )− hτ (xδ) ≤ ηD√KLβR+

ηT

2KL2 +

R2

η= O

(ηKT + η

√KD +

1

η

). (87)

Plugging (81), (83), and (85) into (80), and choosing η = O(1/√K(T +D)), the proof is complete.

24

C.4 Proof of Corollary 1

To prove Corollary 1, we will show that

1

K + 1

T∑t=1

K∑k=0

ft(xt,k)−T∑t=1

ft(xt) = O(√K). (88)

Using the β-smoothness in Assumption 4, we have for any k 6= 0

ft(xt,k)− ft(xt) ≤(∇ft(xt)

)>(xt,k − xt) +

βδ2

2≤ δ‖∇ft(xt)‖+

βδ2

2. (89)

Then leveraging the result of Lemma 4, we have

‖∇ft(xt)‖ = ‖∇ft|t+dt(xt|t+dt)‖ = ‖∇ft|t+dt(xt|t+dt) + gt|t+dt − gt|t+dt‖

≤ ‖gt|t+dt‖+ ‖∇ft|t+dt(xt|t+dt)− gt|t+dt‖ ≤√KL+

βδ√K

2. (90)

Plugging (90) back to (89), we have

ft(xt,k)− ft(xt) ≤ δ√KL+

βδ2√K

2+βδ2

2

(a)= O

( √KT +D

)(91)

where (a) follows from δ = O((T +D)−1

). Summing over k and t readily implies (88).

25

Bandit Online Learning with Unknown DelaysBandit Online Learning with Unknown Delays Bingcong Li,...

Documents

Transcript of Bandit Online Learning with Unknown DelaysBandit Online Learning with Unknown Delays Bingcong Li,...