Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf ·...

52
Stochastic Proximal Gradient Consensus Over Time-Varying Multi-Agent Networks Mingyi Hong Joint work with Tsung-Hui Chang IMSE and ECE Department, Iowa State University Presented at INFORMS 2015 Mingyi Hong (Iowa State University) 1 / 37

Transcript of Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf ·...

Page 1: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Stochastic Proximal Gradient Consensus OverTime-Varying Multi-Agent Networks

Mingyi HongJoint work with Tsung-Hui Chang

IMSE and ECE Department,Iowa State University

Presented at INFORMS 2015

Mingyi Hong (Iowa State University) 1 / 37

Page 2: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Main Content

Setup: Optimization over a time-varying multi-agent network

Mingyi Hong (Iowa State University) 2 / 37

Page 3: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Main Results

An algorithm for a large class of convex problems with rate guarantees

Connections among a number of popular algorithms

Mingyi Hong (Iowa State University) 3 / 37

Page 4: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Outline

1 Review of Distributed Optimization

2 The Proposed AlgorithmThe Proposed AlgorithmsDistributed ImplementationConvergence Analysis

3 Connection to Existing Methods

4 Numerical Results

5 Concluding Remarks

Mingyi Hong (Iowa State University) 3 / 37

Page 5: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Review of Distributed Optimization

Basic Setup

Consider the following convex optimization problem

miny∈R

f (y) :=N

∑i=1

fi(y), (P)

Each fi(y) is a convex and possibly nonsmooth function

A collection of N agents connected by a network:

1 Network defined by an undirected graph G = V , E

2 |V| = N vertices and |E | = E edges.

3 Each agent can only communicate with its immediate neighbors

Mingyi Hong (Iowa State University) 4 / 37

Page 6: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Review of Distributed Optimization

Basic Setup

Numerous applications in optimizing networked systems

1 Cloud computing [Foster et al 08]

2 Smart grid optimization [Gan et al 13] [Liu-Zhu 14][Kekatos 13]

3 Distributed learning [Mateos et al 10] [Boyd et al 11] [Bekkerman et al 12]

4 Communication and signal processing [Rabbat-Nowak 04] [Schizas et al 08][Giannakis et al 15]

5 Seismic Tomography [Zhao et al 15]

6 ...

Mingyi Hong (Iowa State University) 5 / 37

Page 7: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Review of Distributed Optimization

The Algorithms

A lot of algorithms are available for problem (P)

1 The distributed subgradient (DSG) based methods

2 The Alternating Direction Method of Multiplier (ADMM) based methods

3 The Distributed Dual Averaging based methods

4 ...

Algorithm families differ in applicable problems and convergence cond.

Mingyi Hong (Iowa State University) 6 / 37

Page 8: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Review of Distributed Optimization

The DSG Algorithm

Each agent i keeps a local copy of y, denoted as xi

Each agent i iteratively computes

xr+1i =

N

∑j=1

wrijx

rj − γrdr

i , ∀ i ∈ V .

We used the following notations

1 dri ∈ ∂ fi(yr

i ): a subgradient of the local function fi

2 wrij ≥ 0: the weight for the link eij at iteration r

3 γr > 0: some stepsize parameter

Mingyi Hong (Iowa State University) 7 / 37

Page 9: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Review of Distributed Optimization

The DSG Algorithm (Cont.)

Compactly, the algorithm can be written in vector form

xr+1 = Wxr − γrdr

1 xr ∈ R: vector of the agents’ local variable

2 dr ∈ R: vector of subgradients

3 W: a row-stochastic weight matrix

Mingyi Hong (Iowa State University) 8 / 37

Page 10: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Review of Distributed Optimization

The DSG Algorithm (Cont.)

Convergence has been analyzed in many works [Nedic-Ozdaglar09a][Nedic-Ozdaglar 09b]

The algorithm converges with a rate of O(ln(r)/√

r) [Chen 12]

Usually diminishing stepsize

The algorithm has been generalized to problems with

1 constraints [Nedic-Ozdaglar-Parrilo 10]

2 quantized messages [Nedic et al 08]

3 directed graphs [Nedic-Olshevsky 15]

4 stochastic gradients [Ram et al 10]

5 ...

Accelerated versions with rates O(ln(r)/r) [Chen 12] [Jakovetic et al 14]

Mingyi Hong (Iowa State University) 9 / 37

Page 11: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Review of Distributed Optimization

The EXTRA Algorithm

Recently, [Shi et al 14] proposed an EXTRA algorithm

xr+1 = Wxr − 1β

dr +1β

dr−1 + xr − Wxr−1

where W = 12 (I + W); f is assumed to be smooth; W is symmetric

EXTRA is an error-corrected version of DSG

xr+1 = Wxr − 1β

dr +r

∑t=1

(W − W)xt−1

It is shown that

1 A constant stepsize β can be used (with computable lower bound)

2 The algorithm converges with a (improved) rate of O(1/r)

Mingyi Hong (Iowa State University) 10 / 37

Page 12: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Review of Distributed Optimization

The ADMM Algorithm

The general ADMM solves the following two-block optimization problem

minx,y

f (x) + g(y)

s.t. Ax + By = c, x ∈ X, y ∈ Y

The augmented Lagrangian

L(x, y; λ) = f (x) + g(y) + 〈λ, c− Ax− By〉+ ρ

2‖c− Ax− By‖2

The algorithm

1 Minimize L(x, y; λ) w.r.t. x

2 Minimize L(x, y; λ) w.r.t. y

3 λ← λ + ρ(c− Ax− By)

Mingyi Hong (Iowa State University) 11 / 37

Page 13: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Review of Distributed Optimization

The ADMM for Network Consensus

For each link eij introduce two link variables zij, zjiReformulate problem (P) as [Schizas et al 08]

min f (x) :=N

∑i=1

fi(xi),

s.t. xi = zij, xj = zij,

xi = zji, xj = zji, ∀ eij ∈ E .

Mingyi Hong (Iowa State University) 12 / 37

Page 14: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Review of Distributed Optimization

The ADMM for Network Consensus (cont.)

The above problem is equivalent to

min f (x) :=N

∑i=1

fi(xi),

s.t. Ax + Bz = 0

(1)

where A, B are matrices related to network topology

Converges with O(1/r) rate [Wei-Ozdaglar 13]

When the objective is smooth and strongly convex, linear convergencehas been shown in [Shi et al 14]

For a star-network, convergence to stationary solution for nonconvexproblem (with rate O(1/

√r)) [H.-Luo-Razaviyayn 14]

Mingyi Hong (Iowa State University) 13 / 37

Page 15: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Review of Distributed Optimization

Comparison of ADMM and DSG

Table: Comparison of ADMM and DSG.

DSG ADMMProblem Type general convex smooth/smooth+simple NS.

Stepsize diminishing(a) constantConvergence Rate O(ln(r)/

√r) O(1/r)

Network Topology dynamic static(b)

Subproblem simple difficult(c)

(a) Except [Shi et al 14], which uses a constant stepsize

(b) Except [Chang-H.-Wang 14] [Ling et al 15], gradient-type subproblem

(c) Except [Wei-Ozdaglar 13], random graph

Mingyi Hong (Iowa State University) 14 / 37

Page 16: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Review of Distributed Optimization

Comparison of ADMM and DSG

Connections?

Mingyi Hong (Iowa State University) 15 / 37

Page 17: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Review of Distributed Optimization

Outline

1 Review of Distributed Optimization

2 The Proposed AlgorithmThe Proposed AlgorithmsDistributed ImplementationConvergence Analysis

3 Connection to Existing Methods

4 Numerical Results

5 Concluding Remarks

Mingyi Hong (Iowa State University) 15 / 37

Page 18: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

The Proposed Algorithm

Setup

The proposed method is ADMM based

We consider

min f (y) :=N

∑i=1

fi(y) =N

∑i=1

gi(y) + hi(y), (Q)

1 Each hi is lower-semicontinuous with easy “prox” operator

proxβh (u) := min

yhi(y) +

β

2‖y− u‖2.

2 Each gi has a Lipschitz continuous gradient, i.e., for some ρi > 0

‖∇gi(y)−∇gi(v)‖ ≤ Pi‖y− v‖, ∀ y, v ∈ dom(h), ∀ i.

Mingyi Hong (Iowa State University) 16 / 37

Page 19: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

The Proposed Algorithm

Graph Structure

Both static and random time-varying graph

For random network assume that

1 At a given iteration Gr is a subgraph of a connected graph G2 Each link e has a probability of pe ∈ (0, 1] of being active

3 A node i is active if an active link connects to it

4 Each iteration the graph realization is independent

Mingyi Hong (Iowa State University) 17 / 37

Page 20: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

The Proposed Algorithm

Gradient Information

Each agent has access to an estimate of the gradient gi(xi, ξi) such that

E[gi(xi, ξi)] = ∇gi(xi)

E[‖gi(xi, ξi)−∇gi(xi)‖2

]≤ σ2, ∀ i

Can be extended to allow only subgradient of the obj

Mingyi Hong (Iowa State University) 18 / 37

Page 21: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

The Proposed Algorithm

The Augmented Lagrangian

The problem we solve is still given by

min f (x) :=N

∑i=1

gi(xi) + hi(xi),

s.t. Ax + Bz = 0

The augmented Lagrangian

LΓ(x, z, λ) =N

∑i=1

gi(xi) + hi(xi) + 〈λ, Ax + Bz〉+ 12‖Ax + Bz‖Γ

2.

A diagonal matrix Γ is used as the penalty parameter (one edge one ρij)

Γ := diagρijij∈E

Mingyi Hong (Iowa State University) 19 / 37

Page 22: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

The Proposed Algorithm The Proposed Algorithms

The DySPGC Algorithm

The proposed algorithm is named DySPGC (Dynamic StochasticProximal Gradient Consensus)

It optimizes LΓ(x, z, λ) using similar steps as ADMM

The x-step will be replaced by a proximal gradient step

Mingyi Hong (Iowa State University) 20 / 37

Page 23: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

The Proposed Algorithm The Proposed Algorithms

The DySPGC: Static Graph + Exact Gradient

Algorithm 1. PGC Algorithm

At iteration 0, let BTλ0 = 0, z0 = 12 MT

+x0.At each iteration r + 1, update the variable blocks by:

xr+1 = arg min 〈∇g(xr), x− xr〉+ h(x)

+12

∥∥∥Ax + Bzr + Γ−1λr∥∥∥2

Γ+

12‖x− xr‖2

Ω

zr+1 = arg min12

∥∥∥Axr+1 + Bz + Γ−1λr∥∥∥2

Γ

λr+1 = λr + Γ(

Axr+1 + Bzr+1)

Mingyi Hong (Iowa State University) 21 / 37

Page 24: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

The Proposed Algorithm The Proposed Algorithms

The DySPGC: Static Graph + Exact Gradient

Algorithm 1. PGC Algorithm

At iteration 0, let BTλ0 = 0, z0 = 12 MT

+x0.At each iteration r + 1, update the variable blocks by:

xr+1 = arg min 〈∇g(xr), x− xr〉+ h(x)

+12

∥∥∥Ax + Bzr + Γ−1λr∥∥∥

Γ2 +

12‖x− xr‖2

Ω

zr+1 = arg min12

∥∥∥Axr+1 + Bz + Γ−1λr∥∥∥

Γ2

λr+1 = λr + Γ(

Axr+1 + Bzr+1)

Mingyi Hong (Iowa State University) 21 / 37

Page 25: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

The Proposed Algorithm The Proposed Algorithms

The DySPGC: Static Graph + Stochastic Gradient

Algorithm 2. SPGC Algorithm

At iteration 0, let BTλ0 = 0, z0 = 12 MT

+x0.At each iteration r + 1, update the variable blocks by:

xr+1 = arg min⟨

G(xr, ξr+1), x− xr⟩+ h(x)

+12

∥∥∥Ax + Bzr + Γ−1λr∥∥∥2

Γ+

12‖x− xr‖2

Ω+ηr+1 IMN

zr+1 = arg min12

∥∥∥Axr+1 + Bz + Γ−1λr∥∥∥2

Γ

λr+1 = λr + Γ(

Axr+1 + Bzr+1)

Mingyi Hong (Iowa State University) 22 / 37

Page 26: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

The Proposed Algorithm The Proposed Algorithms

The DySPGC: Static Graph + Stochastic Gradient

Algorithm 2. SPGC Algorithm

At iteration 0, let BTλ0 = 0, z0 = 12 MT

+x0.At each iteration r + 1, update the variable blocks by:

xr+1 = arg min⟨

G(xr, ξr+1), x− xr⟩+ h(x)

+12

∥∥∥Ax + Bzr + Γ−1λr∥∥∥2

Γ+

12‖x− xr‖2

Ω+ηr+1 IMN

zr+1 = arg min12

∥∥∥Axr+1 + Bz + Γ−1λr∥∥∥2

Γ

λr+1 = λr + Γ(

Axr+1 + Bzr+1)

Mingyi Hong (Iowa State University) 22 / 37

Page 27: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

The Proposed Algorithm The Proposed Algorithms

The DySPGC: Dynamic Graph + Stochastic Gradient

Algorithm 3. DySPGC Algorithm

At iteration 0, let BTλ0 = 0, z0 = 12 MT

+x0.At each iteration r + 1, update the variable blocks by:

xr+1 = arg min⟨

Gr+1(xr, ξr+1), x− xr⟩+ hr+1(x)

+12

∥∥∥Ar+1x + Br+1zr + Γ−1λr∥∥∥2

Γ+

12‖x− xr‖2

Ωr+1+ηr+1 IMN

xr+1i = xr

i , if i /∈ V r+1

zr+1 = arg min12

∥∥∥Ar+1xr+1 + Br+1z + Γ−1λr∥∥∥2

Γ

zr+1ij = zr

ij, if eij /∈ Ar+1

λr+1 = λr + Γ(

Ar+1xr+1 + Br+1zr+1)

Mingyi Hong (Iowa State University) 23 / 37

Page 28: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

The Proposed Algorithm The Proposed Algorithms

The DySPGC: Dynamic Graph + Stochastic Gradient

Algorithm 3. DySPGC Algorithm

At iteration 0, let BTλ0 = 0, z0 = 12 MT

+x0.At each iteration r + 1, update the variable blocks by:

xr+1 = arg min⟨

Gr+1(xr, ξr+1), x− xr⟩+ hr+1(x)

+12

∥∥∥Ar+1x + Br+1zr + Γ−1λr∥∥∥2

Γ+

12‖x− xr‖2

Ωr+1+ηr+1 IMN

xr+1i = xr

i , if i /∈ V r+1

zr+1 = arg min12

∥∥∥Ar+1xr+1 + Br+1z + Γ−1λr∥∥∥2

Γ

zr+1ij = zr

ij, if eij /∈ Ar+1

λr+1 = λr + Γ(

Ar+1xr+1 + Br+1zr+1)

Mingyi Hong (Iowa State University) 23 / 37

Page 29: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

The Proposed Algorithm Distributed Implementation

Distributed Implementation

The algorithms admit distributed implementation

In particular, the PGC admits a single-variable characterization

Mingyi Hong (Iowa State University) 24 / 37

Page 30: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

The Proposed Algorithm Distributed Implementation

Implementation of PGC

Define a stepsize parameter as

βi := ∑j∈Ni

(ρij + ρji) + wi, ∀ i.

(ωi: proximal parameters; ρij: penalty parameters for constraints)

Define a stepsize matrix Υ := diag([β1, · · · , βN ]) 0.

Define a weight matrix W ∈ RN×N as (a row-stochastic matrix)

(W[i, j]) =

ρji+ρij

∑`∈Ni(ρ`i+ρi`)+ωi

=ρji+ρij

βi, if eij ∈ E ,

ωi∑`∈Ni

(ρ`i+ρi`)+ωi= ωi

βi, ∀ i = j, i ∈ V

0, otherwise,

Mingyi Hong (Iowa State University) 25 / 37

Page 31: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

The Proposed Algorithm Distributed Implementation

Implementation of PGC (cont.)

Implementation of PGC

Let ζr ∈ ∂h(xr) be some subgradient vector for the nonsmooth function; thenthe PGC algorithm admits the following single-variable characterization

xr+1 − xr + Υ−1(ζr+1 − ζr)

= Υ−1(−∇g (xr) +∇g

(xr−1

))+ Wxr − 1

2(IN + W)xr−1.

In particular, for smooth problems

xr+1 = Wxr − Υ−1∇g(xr) + Υ−1∇g(xr−1) + xr − 12(IN + W)xr−1.

Mingyi Hong (Iowa State University) 26 / 37

Page 32: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

The Proposed Algorithm Convergence Analysis

Convergence Analysis

We analyze the (rate of) convergence of the proposed methodsLet us define a matrix of Lip-constants

P = diag([P1, · · · , PN ]).

Measure convergence rate by [Gao et al 14, Ouyang et al 14]

| f (xr)− f (x∗)︸ ︷︷ ︸objective gap

|, and ‖Axr + Bzr‖︸ ︷︷ ︸consensus gap

Mingyi Hong (Iowa State University) 27 / 37

Page 33: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

The Proposed Algorithm Convergence Analysis

Convergence Analysis

Table: Main Convergence Results.

Algorithm Convergence Condition Convergence Rate

Network Type Gradient Type

Static Exact ΥW + Υ 2P O(1/r)Static Stochastic ΥW + Υ 2P O(1/

√r)

Random Exact Ω P O(1/r)Random Stochastic Ω P O(1/

√r)

Note: For the exact gradient case, stepsize β can be halved if onlyconvergence is needed

Mingyi Hong (Iowa State University) 28 / 37

Page 34: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

The Proposed Algorithm Convergence Analysis

Outline

1 Review of Distributed Optimization

2 The Proposed AlgorithmThe Proposed AlgorithmsDistributed ImplementationConvergence Analysis

3 Connection to Existing Methods

4 Numerical Results

5 Concluding Remarks

Mingyi Hong (Iowa State University) 28 / 37

Page 35: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Connection to Existing Methods

Comparison with Different Algorithms

Algorithm Conn. with DySPCA Special Setting

EXTRA [Shi 14] Special Case Static, h ≡ 0, W = WT , G = ∇gDSG [Nedic 09] Different x-step Static, g smooth, G = ∇g

IC-ADMM [Chang14] Special Case Static, G = ∇g, g compositeDLM [Ling 15] Special Case Static, G = ∇g, h ≡ 0, βij = β, ρij = ρ

PG-EXTRA [Shi 15] Special Case Static, W = WT , G = ∇g

Mingyi Hong (Iowa State University) 29 / 37

Page 36: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Connection to Existing Methods

Comparison with Different Algorithms

Figure: Relationship among different algorithms

Mingyi Hong (Iowa State University) 30 / 37

Page 37: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Connection to Existing Methods

The EXTRA Related Algorithms

The EXTRA related algorithms (for either smooth or nonsmooth cases)[Shi et al 14, 15] are special cases of DySPCA

1 Symmetric weight matrix W = WT

2 Exact gradient

3 Scalar stepsize

4 Static graph

Mingyi Hong (Iowa State University) 31 / 37

Page 38: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Connection to Existing Methods

The DSG Method

Replacing our x-update by (setting the dual variable λr = 0)

xr+1 = arg min 〈∇g(xr), x− xr〉+ 〈0, Ax + Bzr〉

+12‖Ax + Bzr‖2

Γ +12‖x− xr‖2

Ω

Let βi = β j = β, then the PGC algorithm becomes

xr+1 = − 1β∇g(xr) + Wxr.

with W = 12 (I + W)

This is precisely the DSG iterates

Convergence not covered by our results

Mingyi Hong (Iowa State University) 32 / 37

Page 39: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Connection to Existing Methods

Outline

1 Review of Distributed Optimization

2 The Proposed AlgorithmThe Proposed AlgorithmsDistributed ImplementationConvergence Analysis

3 Connection to Existing Methods

4 Numerical Results

5 Concluding Remarks

Mingyi Hong (Iowa State University) 32 / 37

Page 40: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Numerical Results

Numerical Results

Some preliminary numerical results by solving a LASSO problem

minx

12 ∑N

i=1 ‖Aix− bi‖2 + ν‖x‖1

where Ai ∈ RK×M, bi ∈ RK

The parameters: N = 16, M = 100, ν = 0.1, K = 200

Data matrix randomly generated

Static graphs, generated according to the method proposed in[Yildiz-Scaglione 08], with a radius parameter set to 0.4.

Mingyi Hong (Iowa State University) 33 / 37

Page 41: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Numerical Results

Comparison between PG-EXTRA and PGC

Stepsize of PG-EXTRA chosen according to conditions given in [Shi 14]W is Metropolis constant edge weight matrixPCG: wi = Pi/2, ρij = 10−3

Figure: Comparison between PG-EXTRA and PGC

Mingyi Hong (Iowa State University) 34 / 37

Page 42: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Numerical Results

Comparison between DSG and Stochastic PGC

Stepsize of DSG chosen as a small constantσ2 = 0.1W is Metropolis constant edge weight matrixSPCG: wi = Pi, ρij = 10−3

Figure: Comparison between DSG and SPGCMingyi Hong (Iowa State University) 35 / 37

Page 43: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Numerical Results

Outline

1 Review of Distributed Optimization

2 The Proposed AlgorithmThe Proposed AlgorithmsDistributed ImplementationConvergence Analysis

3 Connection to Existing Methods

4 Numerical Results

5 Concluding Remarks

Mingyi Hong (Iowa State University) 35 / 37

Page 44: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Concluding Remarks

Summary

Develop a DySPGC algorithm for multi-agent optimization

It can deal with

1 Stochastic gradient

2 Time-varying networks

3 Nonsmooth composite objective

Convergence rate guarantee for various scenarios

Mingyi Hong (Iowa State University) 36 / 37

Page 45: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Concluding Remarks

Future Work/Generalization

Identified the relation between DSG-type and ADMM-type methods

Allows for significant generalization

1 Acceleration [Ouyang et al 15]

2 Variance Reduction for local problem when fi is a finite sum

fi(xi) =M

∑j=1

`j(xi)

3 Inexact x-subproblems (using, e.g., Conditional-Gradient)

4 Nonconvex problems [H.-Luo-Razaviyayn 14]

5 ...

Mingyi Hong (Iowa State University) 37 / 37

Page 46: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Concluding Remarks

Thank You!

Mingyi Hong (Iowa State University) 37 / 37

Page 47: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Concluding Remarks

Parameter Selection

It is easy to pick various parameters in various different scenarios

Case A: The weight matrix W is given and symmetric

1 We must have βi = β j = β;

2 For any fixed β, can compute (Ω, ρij)

3 Increase β to satisfy convergence condition

Case B: The user has the freedom in picking (ρij, Ω)1 For any set of (ρij, Ω), can compute W and βi

2 Increase Ω to satisfy convergence condition

In either case, the convergence condition can be verified by local agents

Mingyi Hong (Iowa State University) 37 / 37

Page 48: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Concluding Remarks

Case 1: Exact Gradient with Static Graph

Convergence for PGC Algorithm

Suppose that problem (Q) has a nonempty set of optimal solutions X∗ 6= ∅.Suppose Gr = G for all r and G is connected. Then the PGC converges to aprimal-dual optimal solution if

2Ω + M+ΞMT+ = ΥW + Υ P.

M+ΞMT+ is some matrix related to network topology

A sufficient condition isΩ P

or ωi > Pi for all i ∈ V ; can be determined locally.

Mingyi Hong (Iowa State University) 37 / 37

Page 49: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Concluding Remarks

Case 2: Stochastic Gradient with Static Graph

Convergence for SPGC Algorithm

Assume that dom(h) is a bounded set. Suppose that the following conditionshold

ηr+1 =√

r + 1, ∀ r,

and the stepsize matrix satisfies

2Ω + M+ΞMT+ = ΥW + Υ 2P. (8)

Then at a given iteration r, we have

E [ f (xr)− f (x∗)] + ρ‖Axr + Bzr‖ ≤ σ2√

r+

d2x

2√

r+

12r

(d2

z + d2λ(ρ) + max

iωid2

x

)where dλ(ρ) > 0, dx > 0, dz > 0 are some problem dependent constants.

Mingyi Hong (Iowa State University) 37 / 37

Page 50: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Concluding Remarks

Case 2: Stochastic Gradient with Static Graph (cont.)

Both objective value and constraint violation converge with rate O(1/√

r)

Easy to extend to the exact gradient case, with rate O(1/r)

Requires larger proximal parameter Ω than Case 1

Mingyi Hong (Iowa State University) 37 / 37

Page 51: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Concluding Remarks

Case 3: Exact Gradient with Time-Varying Graph

Convergence for DySPGC Algorithm

Suppose that problem (Q) has a nonempty set of optimal solutions X∗ 6= ∅,and G(xr, ξr+1) = ∇g(xr) for all r. Suppose the graph is randomly generated.If we choose the following stepsize

Ω 12

P

then (xr, zr, λr) that converges w.p.1. to a primal-dual solution.

1 The stepsize is more restrictive than Case 1 (not dependent on graph)

2 Convergence is in the sense of with probability 1

Mingyi Hong (Iowa State University) 37 / 37

Page 52: Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf · Convergence has been analyzed in many works [Nedic-Ozdaglar´ 09a][Nedic-Ozdaglar

Concluding Remarks

Case 4: Stochastic Gradient with Time-Varying Graph

Convergence for DySPGC Algorithm

Suppose wt = xt, zt, λt is a sequence generated by DySPCA, and that

ηr+1 =√

r + 1, ∀ r, and Ω P.

Then we have

E [ f (xr)− f (x∗) + ρ‖Axr + Bzr‖]

≤ σ2√

r+

d2x

2√

r+

12r

(2dJ + d2

z + d2λ(ρ) + max

iωid2

x

)where dλ(ρ), dJ , dx, dy are some positive constants.

Mingyi Hong (Iowa State University) 37 / 37