Distributed Online Optimization over a Heterogeneous Network...

Distributed Online Optimization over a Heterogeneous Network withAny-Batch Mirror Descent

Nima Eshraghi 1 Ben Liang 1

AbstractIn distributed online optimization over a com-puting network with heterogeneous nodes, slownodes can adversely affect the progress of fastnodes, leading to drastic slowdown of the over-all convergence process. To address this issue,we consider a new algorithm termed DistributedAny-Batch Mirror Descent (DABMD), which isbased on distributed Mirror Descent but uses afixed per-round computing time to limit the wait-ing by fast nodes to receive information updatesfrom slow nodes. DABMD is characterized byvarying minibatch sizes across nodes. It is appli-cable to a broader range of problems comparedwith existing distributed online optimization meth-ods such as those based on dual averaging, andit accommodates time-varying network topology.We study two versions of DABMD, dependingon whether the computing nodes average theirprimal variables via single or multiple consensusiterations. We show that both versions providestrong theoretical performance guarantee, by de-riving upperbounds on their expected dynamicregret, which capture the variability in minibatchsizes. Our experimental results show substantialreduction in cost and acceleration in convergencecompared with the known best alternative.

1. IntroductionIn many practical applications, system parameters and costfunctions vary over time. For example, in online machinelearning, data samples arrive dynamically, so that the learn-ing model is progressively updated. Therefore, online op-timization has recently attracted significant attention as animportant tool for modeling various classes of problems

1Department of Electrical and Computer Engineering, Uni-versity of Toronto, Toronto, Canada. Correspondence to: NimaEshraghi <[email protected]>.

Proceedings of the 37 th International Conference on MachineLearning, Online, PMLR 119, 2020. Copyright 2020 by the au-thor(s).

with uncertainty, including cloud computing (Lin et al.,2013), networking (Shi et al., 2018), and machine learning(Shalev-Shwartz, 2012).

Furthermore, emerging computing and machine learningapplications often demand huge amounts of data and com-putation, which exacerbates the need for distributed compu-tation, to avoid overburdening a single computing node andto provide robustness to failures. Yet, a central challengeto distributed computation is how to handle the uncertaintyinvolved in the processing time of heterogeneous computingnodes. In a large distributed computing network, the fin-ishing times of different computing nodes can vary greatlyacross the network and time, and they can be highly randomdue to the variation in available computing resource anddynamics of the workload. In the design of these networks,it is difficult to capture such uncertainty in an offline setting,while online optimization techniques naturally accommo-date unpredictable variations in the cost functions.

In this paper, we study the problem of online optimizationover a distributed computing network with time-varyingtopology. The network consists of heterogeneous comput-ing nodes, and the processing speed varies among the nodesand over time. At each time instant, there is a global costfunction in the network, which is generally defined and istime-varying. An example of such a global cost is the pre-diction loss. Individual nodes communicate locally to col-laboratively supplement their incomplete knowledge aboutthe time-varying global cost. Our aim is to minimize thetotal accumulated cost over a finite time interval.

The above computing model can be viewed as an abstractionfor several important practical cases. For instance, considera parallel processing system with unknown processing times.The computing nodes may be parallel processors or virtualmachine instances in a public cloud or mobile edge servers.A number of factors contribute to not knowing the process-ing time on these processors. For example, an edge servermay be shared among offloaded user tasks and network ser-vice tasks, so that only a fraction of the processing resourcemay be available to serve the user tasks. Another exam-ple is networked sensors and mobile devices in a machinelearning application, where the goal is to fit a model to alarge dataset. Different sensors and devices have differing

Distributed Online Optimization over a Heterogeneous Network with Any-Batch Mirror Descent

computing speed, and thus, slow nodes can cause significantdelay that adversely affects the overall performance. Closelyrelated to this case is the straggler problem in distributedlearning networks (Tandon et al., 2017). A careful design isrequired to handle slow learners and stragglers.

Prior studies on distributed online optimization utilize a vari-ety of solution methods, including gradient descent (Mateos-Nunez & Cortés, 2014), and mirror descent (Shahrampour& Jadbabaie, 2017). Mirror descent uses Bregman diver-gence, which generalizes the Euclidean norm used in theprojection step of gradient descent, thus acquiring expandedapplicability to a broader range of problems. In addition,Bregman divergence is mildly dependent on the dimensionof decision variables (Beck & Teboulle, 2003). Thus, mir-ror descent is optimal among first-order methods when thedecision variables’ dimension is high (Duchi et al., 2010).In this work we focus on the mirror descent approach.

Most prior studies assume homogeneous computing nodes,so that the amount of data processed in each round isfixed (Hosseini et al., 2013; Mateos-Nunez & Cortés, 2014;Nedich et al., 2015; Shahrampour & Jadbabaie, 2017;Tsianos & Rabbat, 2016; Xiao, 2010). However, as ex-plained above, heterogeneity is often unavoidable in moderncomputing networks. Therefore, in this work we insteadconsider an any-batch approach, fixing the per-round pro-cessing time and allowing each computing node to completeas much work as it can in each round. This forces all nodesto report their accomplished work when the fixed time in-terval expires, thus preventing slow computing nodes fromholding up the progress of faster ones. However, fixingthe processing time results in varying workload among theheterogeneous nodes, which complicates system design andperformance analysis. As far as we are aware, this work isthe first to consider this approach for mirror descent, with atheoretical analysis on its regret bound.

Contributions. In this paper, we consider the problem ofdistributed online optimization over a computing network,taking into account heterogeneity among computing nodesin terms of processing speed. We present an online opti-mization approach termed Distributed Any-Batch MirrorDescent (DABMD), which uses fixed per-round processingtime to limit the waiting time by fast nodes to receive infor-mation update from slow nodes. The term any-batch is dueto the fact that our approach allows varying batch sizes ofdata processed by computing nodes within each round. Westudy two versions of DABMD, depending on whether thecomputing nodes average their primal variables via singleor multiple consensus iterations. Our main contribution isin analyzing the performance of DABMD in terms of regret,which measures the difference between the accumulatedcost of the online algorithm and that of an optimal offlinecounterpart. In contrast to prior studies that focus on the

static regret (Ferdinand et al., 2019; Hosseini et al., 2013;Mateos-Nunez & Cortés, 2014; Nedich et al., 2015; Tsianos& Rabbat, 2016; Xiao, 2010), where the offline comparisontarget is static and hence highly suboptimal, we establish up-per bounds on the dynamic regret, which provides a strongerperformance guarantee. Furthermore, in contrast to earlierstudies (Ferdinand et al., 2019; Hosseini et al., 2013; Nedichet al., 2015; Shahrampour & Jadbabaie, 2017; Tsianos &Rabbat, 2016; Xiao, 2010) where a fixed network topologyis considered, we allow a time-varying network topology.Experimental results show a substantially accelerated con-vergence speed compared with the known best alternativefrom (Shahrampour & Jadbabaie, 2017)

2. Related WorksThe present paper lies at the intersection between two im-portant bodies of works, distributed computation and onlineoptimization. Here we review the most relevant studies inthese two areas.

2.1. Distributed Computation

The problem of distributed computation over a network hasbeen extensively studied since the seminal work of (Tsit-siklis et al., 1986). Distributed optimization techniques canbe categorized based on their computing network topology,either master-worker (Boyd et al., 2011) or fully distributed,which are characterized by the lack of centralized access toinformation. Our main focus is on methods in the secondcategory. These methods are particularly attractive sincethey are robust to communication and node failures and aresimple to implement.

Of particular interest is distributed gradient-based methods.In (Nedic & Ozdaglar, 2009), the distributed subgradientmethod is presented and its convergence rate is analyzed fora time-varying network. In (Duchi et al., 2011), the dualaveraging method for distributed optimization is proposedand the effect of network parameters on the convergencebehavior is investigated. Distributed dual averaging undervarious scenarios of delayed subgradient information andquantized communication error is studied in (Tsianos &Rabbat, 2012; Wang et al., 2014; Yuan et al., 2012). All theaforementioned methods are first-order algorithms requiringonly the calculation of subgradients and projections. This isespecially important in some applications such as large-scalelearning where first-order methods are preferable to higherorder approaches (Bottou & Bousquet, 2008). However,despite the benefit of simplicity of first-order methods, itis challenging to generate projections for some commonobjective functions. An example of such objective functionsis the entropy-based loss with the constraint set of unitsimplex (Beck & Teboulle, 2003).

The mirror descent algorithm (Beck & Teboulle, 2003; Ne-


mirovsky & Yudin, 1983) generalizes the projection stepby using Bregman divergence, consequently gaining appli-cability to a wide range of problems. Distributed versionsof mirror descent under various settings are presented in(Li et al., 2016; Rabbat, 2015; Raginsky & Bouvrie, 2012;Shahrampour & Jadbabaie, 2017; Wang et al., 2018; Xiet al., 2014; Yuan et al., 2018). However, all of these works,except (Shahrampour & Jadbabaie, 2017), consider onlytime-invariant cost functions in an offline setting.

2.2. Onilne Optimization

Online optimization techniques take into account the dynam-ics in a system and allow the cost function to be time-varying(Shalev-Shwartz, 2012; Zinkevich, 2003). In (Mateos-Nunez & Cortés, 2014), distributed online optimizationover a time-varying network topology is studied using thesubgradients of local functions. The subgradient approachuses Euclidean norms in the projection step. In contrast, ourwork based on mirror descent considers a Bregman diver-gence function, and thus is applicable to a broader range ofproblems.For distributed online optimization, the dual averaging ap-proach is studied in (Hosseini et al., 2013; Xiao, 2010).Extensions include approximate dual averaging (Tsianos &Rabbat, 2016), and a primal-dual approach (Nedich et al.,2015). However, these works assume homogeneous com-puting nodes. The authors in (Ferdinand et al., 2019), con-sider the straggler issue in learning networks, and proposeany-batch dual averaging approach. However, the networktopology is fixed over time, which limits its applicabilityto modern computing systems. Finally, all of these worksanalyze only the static regret, while we provide a boundon the dynamic regret, which takes into account the timevariant nature of the minimizer.The work most closely related to ours is (Shahrampour &Jadbabaie, 2017), which studies distributed mirror descentfor online optimization over a network with static topologyand homogeneous nodes in terms of processing capability.Our work may be viewed as a generalization of (Shahram-pour & Jadbabaie, 2017) in two directions. First, we accountfor the heterogeneity among the computing nodes, whichmotivates an any-batch design that is robust to slow com-puting nodes, to prevent long waiting time by fast nodesuntil slow nodes finish their work. Second, we allow atime-varying network topology. Thus, our approach is moresuited to practical computing networks, e.g., mobile cloudsystems and networked sensors, where heterogeneity andtime-varying topology are unavoidable.

3. Network Model and Problem Formulation3.1. Heterogeneous Time-Varying Computing Network

We consider a distributed computing network with heteroge-neous nodes in a time-slotted setting. The communication

topology is assumed to be time-varying, and it is modeled byan undirected graph Gt = (V, Et), where V = 1, . . . , nand Et ⊂ V ×V denote the set of nodes and edges at time t,respectively. Each node i ∈ V can generally represent acomputing unit, e.g., data center, or mobile device.

Computing nodes exchange information according to thetopology graph Gt at time t. Each computing node i assignsa non-negative weight P tij to the received information fromnode j. To ensure individual nodes sufficiently communi-cate with each other, we impose several standard assump-tions on topology graph Gt and the weighting rule (Nedic &Ozdaglar, 2009):

(a) (Weight Rule) There exists a scalar η with 0 < η < 1such that for all i, j ∈ V , and all t ≥ 0, P tij ≥ η, if (i, j) ∈Et, and P tij = 0, otherwise.

(b) (Doubly Stochastic) The weight matrix P t is doublystochastic, i.e.,

n∑j=1

P tij =

n∑i=1

P tij = 1, ∀t ≥ 0. (1)

(c) (Connectivity) The graph (V,⋃t≥k Et) is connected for

all k. Furthermore, there exists an integer B ≥ 1 such that∀(i, j) ∈ V × V , we have

(i, j) ∈ Et ∪ Et+1 ∪ · · · ∪ Et+B−1, ∀t ≥ 0. (2)

This guarantees that every node i receives the share of infor-mation processed by any other node j at least once every Bconsecutive time-slots.

3.2. Distributed Online Optimization

We consider the case where data samples arrive at the systemover time. Each data sample i is represented by (ωi, zi),where ωi ∈ Ω ⊆ Rd and zi ∈ R. There is a cost functionf(x, (ω, z)) associated with every data sample, where x ∈X denotes a decision variable taken from a compact andconvex set X . The cost function is generally defined andmay represent e.g., a measure of prediction accuracy tofit a learning model. For example, logistic regression isobtained by setting f(x, (ω, z)) = log(1 + exp(−zxTω)).For brevity, we use f(x, ω) instead of f(x, (ω, z)) in therest of the paper. The cost function f(x, ω) is convex in x,for all ω ∈ Rd. In addition, we assume f(x, ω) is Lipschitzcontinuous on X with constant L, i.e.,

|f(x1, ω)− f(x2, ω)| ≤ L ‖ x1 − x2 ‖,∀x1, x2 ∈ X .(3)

This implies that f(x, ω) has bounded gradients, i.e.,‖∇f(x, ω)‖∗ ≤ L for all ω ∈ Ω and x ∈ X , where ‖ · ‖∗ isthe dual norm of ‖ · ‖.

In distributed online optimization, the computing nodes indi-vidually process the data samples to minimize some global


cost function in a round-by-round manner. Each round tcorresponds to time slot t and consists of computation andcommunication phases. Let bi,t be the local minibatch size,i.e., the number of samples that node i ∈ V is able to pro-cess in round t. The value of bi,t depends on the processingspeed of node i within the computation time.

The local cost function is defined as the sum of the costsincurred for the locally computed samples. We denote thelocal cost function of node i at round t by fi,t : X → R, i.e.,

fi,t(x) =

bi,t∑s=1

f(x, ωsi,t), (4)

where the data samples ωsi,t, s = 1, . . . , bi,t are computedlocally, and collectively form the local minibatch of node iat round t.

At each round t, after each node i submits a local de-cision xi,t, a minibatch of bi,t samples are processed,and the node suffers a corresponding cost of fi,t(xi,t) =∑bi,ts=1 f(xi,t, ω

si,t). Thus, these cost functions are time vary-

ing. The random samples are assumed to be independentand identically distributed with an unknown and fixed prob-ability distribution.

We are interested in an optimization problem with a globalcost function, represented by ft : X → R at round t. Itis based on the local functions that are distributed over theentire network, i.e.,

ft(x) =

n∑i=1

fi,t(x). (5)

For distributed minimization of the global cost, every com-puting node i maintains a local estimate of the global mini-mizer yi,t as well as a local decision vector xi,t. Individualnodes perform local averaging of the current decision col-lected from their neighbors, using the weight matrix P t.Local estimates yi,t are updated accordingly, which are thenused together with the next minibatch bi,t+1 to compute thelocal decision xi,t+1 in the next round.

The goal of an online algorithm is to minimize the totalaccrued cost over a finite number of rounds T . The qualityof decisions is measured in terms of a popular metric, calledregret, which is the difference between the total cost in-curred by the online algorithm and that of an optimal offlinesolution, where all cost functions are known in advance. Asuccessful online algorithm closes the gap between the on-line decisions and the offline counterpart, when normalizedby T . In other words, a successful online algorithm sustainsa regret sublinear in T .

Studies on online learning often focus on the static re-gret, where offline solution is fixed over time, i.e., x∗ =argminx∈X

∑Tt=1 ft(x). Unfortunately, this offline com-

parison target is often highly suboptimal in most practical

streaming settings, e.g., x∗t may correspond to video framesin a dynamic environment which by nature are highly vari-able over time (Hall & Willett, 2015). Furthermore, in thispaper we study a fully dynamic scenario, where both thecost functions and the network topology are time-varying.Therefore, to capture the full potential of dynamics in thenetwork, we focus on the notion of dynamic regret (Zinke-vich, 2003), which is defined by

RegdT =1

n

n∑i=1

T∑t=1

ft(xi,t)−T∑t=1

ft(x∗t ). (6)

where x∗t = argminx∈X ft(x) is a minimizer of the globalcost at round t.

It is well known that the in a dynamic setting, the problemmay be intractable due to arbitrary fluctuation in the func-tions (Zhang et al., 2017). However, it is possible to upperbound the dynamic regret in terms of certain regularity ofthe sequence of minimizers. A popular quantity to representsuch regularity is

AT =

T∑t=1

‖ x∗t+1 − x∗t ‖, (7)

which represents the path length of the sequence. Thus, ATserves as a complexity measure which assesses the hardnessof an online problem. For instance, it is shown in (Shahram-pour & Jadbabaie, 2017) that the dynamic regret of dis-tributed online mirror descent for a convex function in a ho-mogeneous network can be bounded by O(

√T (AT + 1)).

For the analysis of dynamic regret of DABMD, we facemultiple challenges that need to be carefully considered. Inparticular, heterogeneity adds substantial challenge to thealgorithm design. Since the computation time is fixed for allnodes, the minibatch size bi,t is a random variable. This isin contrast to the traditional methods, where the minibatchsize is fixed among the computing nodes and over time.

Before delving into the details of DABMD, we proceed bystating several standard definitions. Further discussion onsome preliminary information is given in App. A.

Definition 1: A subdifferentiable function r : Rd → R isσ-strongly convex with respect to a norm ‖ . ‖, if thereexists a positive constant σ such that

r(x) ≥ r(y) + 〈∇r(y), x− y〉+σ

2‖ x− y ‖2,∀x, y,

(8)

where∇r(.) stands for the subgradient of function r(.).

Definition 2: Bregman divergence with respect to the func-tion r(.) is defined as

Dr(x, y) = r(x)− r(y)− 〈∇r(y), x− y〉. (9)

Bregman divergences are a general class of distance-measuring functions, which contains Euclidean norms andKullback-Leibler divergence as two special cases.


4. Distributed Any-Batch Mirror DescentNow we present DABMD, a distributed online optimizationapproach for an evolving dynamic networks with hetero-geneous computing nodes. We bound the performance ofthe proposed DABMD algorithm in terms of its expecteddynamic regret.

4.1. Algorithm Description

DABMD uses mirror descent in its core as an optimizationengine, powered by averaging consensus, enabling nodesto collaboratively approximate a globally optimal solutionthrough local interactions. In contrast to traditional dis-tributed methods, where a classical but impractical require-ment is that each computing node processes an equal amountof workload in a given time, DABMD is designed to accom-modate heterogeneity in a computing system. The comput-ing time for all nodes is fixed, and thus, faster nodes wouldno longer need to wait for slower ones to report their accom-plished work. As a result, the performance of DABMD isno longer limited by the variability in the batch sizes. Thisis particularly important in a large network consisting ofmany nodes with differing processing capabilities, whereslow nodes hold up the network, often at a great cost to theoverall system performance. Next, we describe DABMDin detail and establish a bound on its performance throughregret analysis. Every round of DABMD is comprised ofthree major steps:

Any-Batch Computation: The computation time in eachround is set to a fixed value for all nodes. Individual nodescompute the gradient of as many samples as they can withinthe fixed computation time. In particular, every computingnode i computes bi,t gradients of f(x, ω), evaluated at thelocal estimate of the global minimizer x = yi,t. Resultingfrom heterogeneity, the computing nodes process varyingworkload within each round. This implies that the batch sizebi,t is a random variable, which is in contrast to classicalapproaches based on a fixed minibatch size over time andacross the network. By the end of each round, each nodecomputes the local minibatch gradient, i.e.,

gi,t =1

bi,t

bi,t∑s=1

∇yf(yi,t, ωsi,t). (10)

Local Decision Update: After computing the local mini-batch gradient, each computing node i performs the follow-ing update on the local decision:

xi,t+1 = argminx∈X

〈x, gi,t〉+

1

αtDr(x, yi,t)

, (11)

where αt is a sequence of positive and non-increasing stepsizes, and Dr(., .) is the Bregman divergence corresponding

to the regularization function. Recall that xi,t and yi,t arerespectively the local decision and the local estimate of theglobal minimizer at round t. Thus, (11) suggests that everynode i aims to stay close to the locally approximated globalminimizer yi,t as measured by Bregman divergence, whiletaking a step in a direction close to gi,t to reduce the localcost at the current round.

Consensus Averaging: Consensus averaging is a mecha-nism to coordinate and synchronize the decision variables atdifferent computing nodes. After each node has processedits share of local information, and updated its local decision,it sends a message to the neighboring nodes. At round t,after receiving the messages from all neighbors, computingnode i updates its estimate of the global minimizer by

yi,t =

n∑j=1

P tijxj,t. (12)

The objective of this step is to provide an opportunity foreach node to obtain some information about the global cost.It allows nodes to cooperatively approximate the globalcost function. In particular, as opposed to the approachesbuilt on distributed dual averaging, where the subgradientinformation are exchanged, here, each node updates its localprimal variable according to the local mini-batch gradients,and then it shares the primal information within its localneighborhood.

4.2. Theoretical Results

We begin analyzing the performance of DABMD with aconvergence result on the local decision variables.

We define Φ(k, s) = P sP s+1 . . . P k as the transition ma-trix that records the weights history from round s to k. Theconvergence properties of this matrix is extensively ana-lyzed in (Nedic & Ozdaglar, 2009). In this work, we use thefollowing result

|Φ(k, s)ij −1

n| ≤ γΓk−s, (13)

where γ = 2(1+η−B0 )1−ηB0

and Γ = (1 − ηB0)1/B0 , andB0 = (n − 1)B is a positive integer. We refer the readerto Proposition 1 in (Nedic & Ozdaglar, 2009) for furtherdiscussion and proof.

Now, we are ready to present an upperbound on the devi-ation of the local decisions from the exact average in thefollowing lemma.

Lemma 1 Let xi,t be the sequence of local decisions gen-erated by DABMD, with all initial local decisions set tozero vectors. Suppose every node uses a σ-strongly convexregularizer r(.) and a non-increasing step size sequence


αt. Then, the following bound holds on xi,t+1:

‖xi,t+1 − xt+1‖ ≤α0L

σ

(2 +

nγ

1− Γ

),

where xt+1 = 1n

∑ni=1 xi,t+1.

Lemma 1 is proved in App. C of the supplementary material.

We observe that the error bound above relates to the networkgraph through parameters γ and Γ. Smaller values of Γ andγ result in local decisions moving closer to their average.In other words, a shorter communication interval B and alarger minimum link weight η would help individual nodesto collect the information of all other computing nodesacross the network at a faster pace. Thus, each computingnode can approximate xt+1 with a higher accuracy.

Remark 1. Similar to Lemma 1, the convergence of localdecisions are investigated in Lemma 1 of (Shahrampour &Jadbabaie, 2017). However, that analysis is based on theassumption that an oracle provides a common knowledgematrix to all computing nodes and requires the weight matrixto be fixed. Hence, their work cannot be extended to thecase of a time-varying network, such as in mobile cloudcomputing. In contrast, our analysis obviates the commonknowledge requirement and is applicable to a general time-varying network, where in addition to the weight matrix,even the connections can be time-varying.

The succeeding theorem provides an upper bound on theexpected dynamic regret and it is followed by a corollarycharacterizing the regret bound for the optimized, fixed stepsize.

Theorem 2 Let xi,t be the sequence of local decisions gen-erated by DABMD. Let bavg = E[bi,t], bmin = mini,tbi,tand bmax = maxi,tbi,t be the mean value, minimum, andmaximum of the minibatch sizes across the nodes and overtime. The expected dynamic regret satisfies

E[RegdT

]=

1

n

n∑i=1

T∑t=1

E[ft(xi,t)− ft(x∗t )

]≤

T∑t=1

2nbavgL2α0

σ

(2 +

nγ

1− Γ

)+

T∑t=1

VbL2n

2σbminαt

+

T∑t=1

MnbmaxE[‖ x∗t+1 − x∗t ‖]αT+1

+2nbmaxR

2

αT+1, (14)

where R2 = supx,y∈XDr(x, y), and Vb = E[b2i,t].

The proof of Theorem 2 is given in App. D.

Remark 2. The regret bound obtained above depends on theheterogeneity in processing speed only through parameterstatistics bmax, bmin, bavg, and Vb. In other words, these

parameters capture and summarize the effect of varyingprocessing speed among the computing nodes. Moreover,the impact of time-varying network graph with an imperfectaveraging protocol is summarized in the first component.More specifically, it is through this term that the networkconnectivity interval B, minimum link weight η, and theprecision of local estimates of the global minimizer appearin the result. Furthermore, the impact of the dynamic offlinecomparator variable x∗t is reflected in the third term.

Corollary 3 Under the same condition stated in Theorem 2,with fixed step size

α ∈ O

(√bmax(AT + 1)/

[T

(2nγbavg

1− Γ+

Vbbmin

)]),

we have

E[RegdT

]≤ O

(√T (AT + 1)bmax

(2nγbavg

1− Γ+

Vbbmin

)),

(15)

where AT =∑Tt=1 E[‖ x∗t+1 − x∗t ‖] denotes the expected

value of the total variation of the dynamic minimizer.

This corollary directly follows from substituting the fixedstep size α into the bound in (14).

Remark 3. We remark that our result recovers the regret ratespreviously derived for two special cases. When computingnodes are homogeneous, bi,t is no longer a random variable.Instead, the minibatch size is fixed across the network andover time. In this case, the terms involving the distributioncharacteristics disappear. Thus, we have a regret bound ofO(√T (AT + 1)), which recovers the result of (Shahram-

pour & Jadbabaie, 2017), in which distributed online mirrordescent was studied. In addition, if the offline minimizersare fixed, the problem is reduced to minimizing the staticregret. In this case, the path length AT is equal to zero andwe obtain a regret bound of O(

√T ), which recovers the

result of (Tsianos & Rabbat, 2016) on distributed onlinedual averaging.

Remark 4. The regret bound scales with the expected pathlength AT , which collects the mismatch errors between suc-cessive offline minimizers. When the minimizer sequencedrifts slowly, the overall error is small and AT can be of aconstant order. On the other hand, severe fluctuation in theobjective function or network topology can result in largemismatch errors, which can cause AT to become linear intime. Such behavior is natural since even for centralizedonline optimization, the problem is generally intractableunder worst-case system fluctuation (Zhang et al., 2017).Nevertheless, our goal is to consider AT as a complexitymeasure of the problem environment and derive the regretbound in term of this quantity. We observe that if AT issublinear then the dynamic regret is also sublinear.


5. Multiple Consensus IterationsIn this section, we study a more general version of DABMD,which uses multiple consensus averaging iterations. Herewe remark that the analysis of regret bound for this generalcase requires new techniques different from those used in theprevious section. As a result, the obtained regret bound for kconsensus averaging iterations is different from our previousresult even for k = 1, therefore necessitating presentationin a separate section.

Similar to the previous case, this version also contains threesteps. However, in the previous case, computing nodes per-form only a single consensus iteration. Thus, in estimatingthe global minimizer each node only receives the messageof the immediate neighbors that are one hop away, and isoblivious about the decision of the nodes that are fartheraway. This issue can be resolved via multiple consensusaveraging iterations. In particular, with k iterations of con-sensus averaging, computing nodes receive the messageof nodes that are k hops away. Therefore, they can moreprecisely approximate the global minimizer.

Let y(0)i,t = xi,t denote the initial message vector before

beginning the consensus averaging step. Let y(l)i,t denote

the vector output at computing node i and round t, afterl consensus iterations using matrix P t. At each iterationl, after receiving messages from all its neighbors, node iupdates its estimate of the global minimizer by

y(l)i,t =

n∑j=1

P tijy(l−1)j,t =

n∑j=1

[P t]lijy(0)j,t . (16)

As long as graph Gt, corresponding to the weight matrixP t, is connected and the second-largest eigen value of P t

is strictly less than unity, the consensus iterations are guar-anteed to converge to the exact average. However, for afinite number of iterations each node will have an error inits estimation. The following lemma bounds the consensuserrors after k iterations.

Lemma 4 Let y(k)i,t represent the output of the consensus

averaging after k iterations with updates (16). Under thesame condition stated in Lemma 1, the following boundholds:∥∥∥y(k)

i,t − xt∥∥∥ ≤ 2

√n

exp [kn−3]

α0L

σ

(2 +

nγ(k)

1− Γ(k)

),

where xt = 1n

∑i xi,t, γ

(k) = 2(1 + η−B0

k )/(1 − ηB0

k ),Γ(k) = (1− ηB0

k )1/B0 , and ηk = mini,j [Pt]ki,j .

Lemma 4 is proved in App. E of the supplementary material.

The following theorem establishes an upper bound on theexpected dynamic regret, followed by a corollary character-izing the regret rate for a fixed step size.

Theorem 5 Let xi,t be the sequence of local decisions gen-erated by DABMD with k consensus averaging iterations.The expected dynamic regret satisfies

E[RegdT ] ≤T∑t=1

bavgL2α0n

σ

(2 +

nγ(k)

1− Γ(k)

)(1 +

2√n

exp[kn−3]

)

+

T∑t=1

VbL2nαt

2σbmin+

T∑t=1

MnbmaxE[‖ x∗t+1 − x∗t ‖]αt

+2nR2bmax

αT+1. (17)

The proof of Theorem 5 is given in App. F.

Corollary 6 Under the same condition stated in Theorem 5,with fixed step size

α ∈ O

√√√√ bmax(AT + 1)

T(

2n√nγ(k)bavg

(1−Γ(k)) exp[kn−3]+ Vb

bmin

) ,

we have

E[RegdT

]≤ (18)

O

(√T (AT + 1)bmax

(2n√nγ(k)bavg

(1− Γ(k)) exp[kn−3]+

Vbbmin

)).

This corollary directly follows from substituting the fixedstep size α into the bound in (17).

Remark 5. We remark that the regret bound of DABMD withmultiple consensus iterations in (18) is of the same order ofthat of DABMD with a single consensus iteration in (15).The effect of multiple consensus iterations are mainly sum-marized in γ(k), Γ(k), and exp [kn−3], where both γ(k) andΓ(k) are functions of the transition matrix, and exp [kn−3]is resulted from bounding the second largest eigen value ofa doubly stochastic matrix (Landau & Odlyzko, 1981).

6. ExperimentsWe investigate the performance of DABMD via numericalexperiments on two different machine learning tasks usingreal-world datasets. We compare DABMD with the methodproposed in (Shahrampour & Jadbabaie, 2017), which wastermed Distributed Online Mirror Descent (DOMD). Dueto page limitation, here we focus on the case of single con-sensus iteration. Additional experiments are presented inApp. G.

6.1. Logistic Regression

In the first experiment, we consider multi-class classificationwith logistic regression. In this task, learner nodes observea sequence of labeled examples (ω, z), where ω ∈ Rd, and


(a) Accumulated logistic cost ofMNIST dataset.

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Tim e(s) 1e5

1

2

3

4

5

6

7

Co

st

of

the

en

tire

da

tase

t

1e8

DABMD - bavg = 10

DOMD - b = 10

DABMD - bavg = 20

DOMD - b = 20

DABMD - bavg = 35

DOMD - b = 35

(b) Logistic cost of MNISTdataset.

0 5 10 15 20 25 30

Round

0

50

100

150

200

250

300

Accu

mu

late

d r

idg

e c

ost

DABMD - bavg = 12

DOMD - b = 12

DABMD - bavg = 20

DOMD - b = 20

DABMD - bavg = 35

DOMD - b = 35

(c) Accumulated ridge cost ofYouTube transcoding dataset.

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Tim e(s) 1e3

0

1

2

3

4

5

6

7

Co

st

of

the

en

tire

se

t

1e4

DABMD - bavg = 12

DOMD - b = 12

DABMD - bavg = 20

DOMD - b = 20

DABMD- bavg = 35

DOMD - b = 35

(d) Ridge cost of YouTubetranscoding dataset.

Figure 1. Performance comparison between DOMD and DABMD on MNIST and YouTube transcoding datasets.

the label z, denoting the class of the data example, is drawnfrom a discrete space Z = 1, 2, . . . , c. We use the well-known MNIST digits dataset, where every data sample ωis an image of size 28 × 28 pixel that can be representedby a 784-dimensional vector, i.e., d = 784. Each samplecorresponds to one of the digits in 0, 1, . . . , 9, and hence,there are c = 10 classes.

We consider a network consisting of n = 10 distributednodes. The underlying network topology switches sequen-tially in a round robin manner among some doubly stochas-tic generated graphs. In addition, to model the processingspeed of distributed nodes, similar to studies on stragglersin (Dutta et al., 2018) and (Lee et al., 2017), we adopt ashifted exponential distribution. This captures a minimumcomputing time, denoted by ε, to process a sample, whilethe remaining computing time is memoryless. More specif-ically, the computing time of each sample is modeled asa random variable, which follows the probability densityfunction of the form P (r) = λ exp(−λ(r − ε)). We setλ = 0.5 and ε = 1. For fair comparison, we extend DOMDto have a fixed minibatch size close to the value of bavg,which is empirically found.

For logistic regression, the cost function associated witheach data point is given by

f(x, (ωi, zi)) = log(1 + exp(−zixTωi)).

The goal of the computing nodes is to classify streaming im-ages online by tracking the unknown optimal parameter x∗t .

In Fig. 1, we compare the performance of DABMD withDOMD, for different minibatch sizes b. From Fig. 1(a)and Fig. 1(b), we see that DABMD can save up to 10%in accumulated cost after 1000 rounds, and it reduces theconvergence time up to 30% compared with DOMD.

6.2. Ridge Regression

In our next experiment, we consider the ridge regressionproblem. The cost function for each data sample is given by

f(x, (ωi, zi)) = (xTωi − zi)2.

We use YouTube transcoding measurements from a publiclyavailable dataset (Deneke et al., 2014). It contains 6× 104

data samples, representing the input and output character-istics of a video transcoding application. The input dataspace has a dimension of d = 20, corresponding to videotranscoding attributes such as resolution, bit rate, and re-quired memory. Using this dataset, the goal is to predict thetranscoding time based on the video features.Similar to (Champati & Liang, 2017), we consider a Paretodistribution to model the processing time required to finish atranscoding task. By adjusting the Pareto tail index, we cancontrol how unevenly the processing times are distributed,and thus, can capture the heterogeneity of the network interms of processing speed. We set the Pareto tail index andscale parameter to 1.6 and 1.0, respectively. The networktopology is time-varyig and is generated in the same way asthe previous experiment.From Figs. 1(c) and 1(d), we observe that the performanceadvantage of DABMD is more pronounced when the com-puting time has a heavy tail distribution. This is becausewith a heavy tail distribution, it is more likely to observesome relatively long computing time in DOMD, whichcauses a significant delay and leads to inefficiency. DOMDperforms particularly poorly when b is large. In compari-son, DABMD saves up to 30% in accumulated cost after 30rounds, and reduces the convergence time up to 20%.

7. ConclusionWe have studied the problem of distributed online optimiza-tion over a heterogeneous time-varying network. The pro-posed DABMD algorithm imposes a fixed per-round pro-cessing time interval for all computing nodes. Thus, a keyfeature of DABMD is that the optimization progress doesnot solely depend on the slowest computing node. We dis-cussed two versions of DABMD, depending on whether thecomputing nodes average their primal variables via singleor multiple consensus iterations. We guarantee the per-formance of DABMD and provide upper bounds on theexpected dynamic regret. Experimental results on MNISTand YouTube transcoding datasets show substantial improve-ment on both the accumulated cost and convergence speedcompared with existing alternatives. We observe that theperformance advantage of using a fixed round time is moresignificant when the sample processing time is taken from aheavy tail distribution.


ReferencesBauschke, H. H. and Borwein, J. M. Joint and separate con-

vexity of the bregman distance. Studies in ComputationalMathematics, 8:22–36, 2001.

Beck, A. and Teboulle, M. Mirror descent and nonlinearprojected subgradient methods for convex optimization.Operations Research Letters, 31(3):167–175, 2003.

Bottou, L. and Bousquet, O. The tradeoffs of large scalelearning. In Proc. of Advances in Neural InformationProcessing Systems, pp. 161–168, 2008.

Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al.Distributed optimization and statistical learning via thealternating direction method of multipliers. Foundationsand Trends in Machine learning, 3(1):1–122, 2011.

Champati, J. P. and Liang, B. Single restart with time stampsfor computational offloading in a semi-online setting. InProc. IEEE Conference on Computer Communications(INFOCOM), 2017.

Deneke, T., Haile, H., Lafond, S., and Lilius, J. Videotranscoding time prediction for proactive load balancing.In Proc. IEEE International Conference on Multimediaand Expo (ICME), 2014.

Duchi, J. C., Shalev-Shwartz, S., Singer, Y., and Tewari, A.Composite objective mirror descent. In Proc. Conferenceon Learning Theory (COLT), 2010.

Duchi, J. C., Agarwal, A., and Wainwright, M. J. Dual aver-aging for distributed optimization: Convergence analysisand network scaling. IEEE Transactions on Automaticcontrol, 57(3):592–606, 2011.

Dutta, S., Joshi, G., Ghosh, S., Dube, P., and Nagpurkar, P.Slow and stale gradients can win the race: Error-runtimetrade-offs in distributed sgd. In Proc. International Con-ference on Artificial Intelligence and Statistics, 2018.

Ferdinand, N., Al-Lawati, H., Draper, S., and Nokleby, M.Anytime minibatch: Exploting stragglers in online dis-tributed optimization. In Proc. International Conferenceon Learning Representations (ICLR), 2019.

Hall, E. C. and Willett, R. M. Online convex optimization indynamic environments. IEEE Journal of Selected Topicsin Signal Processing, 9(4):647–662, 2015.

Hosseini, S., Chapman, A., and Mesbahi, M. Online dis-tributed optimization via dual averaging. In Proc. IEEEConference on Decision and Control (CDC), pp. 1484–1489, 2013.

Landau, H. and Odlyzko, A. Bounds for eigenvalues ofcertain stochastic matrices. Linear algebra and its Appli-cations, 38:5–15, 1981.

Lee, K., Lam, M., Pedarsani, R., Papailiopoulos, D., andRamchandran, K. Speeding up distributed machine learn-ing using codes. IEEE Transactions on Information The-ory, 64(3):1514–1529, 2017.

Li, J., Chen, G., Dong, Z., and Wu, Z. Distributed mirrordescent method for multi-agent optimization with delay.Neurocomputing, 177:643–650, February 2016.

Lin, M., Wierman, A., Andrew, L., and Thereska, E. Dy-namic right-sizing for power-proportional data centers.IEEE/ACM Transactions on Networking, 21(5):1378–1391, 2013.

Mateos-Nunez, D. and Cortés, J. Distributed online con-vex optimization over jointly connected digraphs. IEEETransactions on Network Science and Engineering, 1(1):23–37, 2014.

Nedic, A. and Ozdaglar, A. Distributed subgradient meth-ods for multi-agent optimization. IEEE Transactions onAutomatic Control, 54(1):48, 2009.

Nedich, A., Lee, S., and Raginsky, M. Decentralized onlineoptimization with global objectives and local communi-cation. In Proc. American Control Conference (ACC),2015.

Nemirovsky, A. S. and Yudin, D. B. Problem complexityand method efficiency in optimization. Wiley, 1983.

Rabbat, M. Multi-agent mirror descent for decentralizedstochastic optimization. In Proc. IEEE InternationalWorkshop on Computational Advances in Multi-SensorAdaptive Processing (CAMSAP), 2015.

Raginsky, M. and Bouvrie, J. Continuous-time stochasticmirror descent on a network: Variance reduction, consen-sus, convergence. In Proc. IEEE Conference on Decisionand Control (CDC), 2012.

Shahrampour, S. and Jadbabaie, A. Distributed online opti-mization in dynamic environments using mirror descent.IEEE Transactions on Automatic Control, 63(3):714–725,2017.

Shalev-Shwartz, S. Online learning and online convex opti-mization. Foundations and Trends in Machine Learning,4(2):107–194, 2012.

Shi, M., Lin, X., Fahmy, S., and Shin, D.-H. Competitiveonline convex optimization with switching costs and rampconstraints. In Proc. IEEE Conference on ComputerCommunications (INFOCOM), 2018.

Tandon, R., Lei, Q., Dimakis, A. G., and Karampatziakis,N. Gradient coding: Avoiding stragglers in distributedlearning. In Proc. International Conference on MachineLearning (ICML), 2017.


Tsianos, K. I. and Rabbat, M. G. Distributed dual averagingfor convex optimization under communication delays. InProc. American Control Conference (ACC), 2012.

Tsianos, K. I. and Rabbat, M. G. Efficient distributed onlineprediction and stochastic optimization with approximatedistributed averaging. IEEE Transactions on Signal andInformation Processing over Networks, 2(4):489–506,2016.

Tsitsiklis, J., Bertsekas, D., and Athans, M. Distributedasynchronous deterministic and stochastic gradient opti-mization algorithms. IEEE Transactions on AutomaticControl, 31(9):803–812, 1986.

Wang, H., Liao, X., Huang, T., and Li, C. Cooperative dis-tributed optimization in multiagent networks with delays.IEEE Transactions on Systems, Man, and Cybernetics,45(2):363–369, 2014.

Wang, Y., Zhou, H., and Hong, Y. Distributed stochastic mir-ror descent algorithm over time-varying network. In Proc.IEEE Conference on Control and Automation (ICCA),2018.

Xi, C., Wu, Q., and Khan, U. A. Distributed mirror descentover directed graphs. arXiv preprint, arXiv:1412.5526,2014.

Xiao, L. Dual averaging methods for regularized stochasticlearning and online optimization. Journal of MachineLearning Research, 11:2543–2596, 2010.

Yuan, D., Xu, S., Zhao, H., and Rong, L. Distributed dualaveraging method for multi-agent optimization with quan-tized communication. Systems & Control Letters, 61(11):1053–1061, 2012.

Yuan, D., Hong, Y., Ho, D. W., and Jiang, G. Optimaldistributed stochastic mirror descent for strongly convexoptimization. Automatica, 90:196–203, 2018.

Zhang, L., Yang, T., Yi, J., Jin, R., and Zhou, Z.-H. Im-proved dynamic regret for non-degenerate functions. InProc. Advances in Neural Information Processing Sys-tems, pp. 732–741, 2017.

Zinkevich, M. Online convex programming and general-ized infinitesimal gradient ascent. In Proc. InternationalConference on Machine Learning (ICML), 2003.

Distributed Online Optimization over a Heterogeneous Network...

Documents

Transcript of Distributed Online Optimization over a Heterogeneous Network...