Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf ·...
Transcript of Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf ·...
Stochastic Proximal Gradient Consensus OverTime-Varying Multi-Agent Networks
Mingyi HongJoint work with Tsung-Hui Chang
IMSE and ECE Department,Iowa State University
Presented at INFORMS 2015
Mingyi Hong (Iowa State University) 1 / 37
Main Content
Setup: Optimization over a time-varying multi-agent network
Mingyi Hong (Iowa State University) 2 / 37
Main Results
An algorithm for a large class of convex problems with rate guarantees
Connections among a number of popular algorithms
Mingyi Hong (Iowa State University) 3 / 37
Outline
1 Review of Distributed Optimization
2 The Proposed AlgorithmThe Proposed AlgorithmsDistributed ImplementationConvergence Analysis
3 Connection to Existing Methods
4 Numerical Results
5 Concluding Remarks
Mingyi Hong (Iowa State University) 3 / 37
Review of Distributed Optimization
Basic Setup
Consider the following convex optimization problem
miny∈R
f (y) :=N
∑i=1
fi(y), (P)
Each fi(y) is a convex and possibly nonsmooth function
A collection of N agents connected by a network:
1 Network defined by an undirected graph G = V , E
2 |V| = N vertices and |E | = E edges.
3 Each agent can only communicate with its immediate neighbors
Mingyi Hong (Iowa State University) 4 / 37
Review of Distributed Optimization
Basic Setup
Numerous applications in optimizing networked systems
1 Cloud computing [Foster et al 08]
2 Smart grid optimization [Gan et al 13] [Liu-Zhu 14][Kekatos 13]
3 Distributed learning [Mateos et al 10] [Boyd et al 11] [Bekkerman et al 12]
4 Communication and signal processing [Rabbat-Nowak 04] [Schizas et al 08][Giannakis et al 15]
5 Seismic Tomography [Zhao et al 15]
6 ...
Mingyi Hong (Iowa State University) 5 / 37
Review of Distributed Optimization
The Algorithms
A lot of algorithms are available for problem (P)
1 The distributed subgradient (DSG) based methods
2 The Alternating Direction Method of Multiplier (ADMM) based methods
3 The Distributed Dual Averaging based methods
4 ...
Algorithm families differ in applicable problems and convergence cond.
Mingyi Hong (Iowa State University) 6 / 37
Review of Distributed Optimization
The DSG Algorithm
Each agent i keeps a local copy of y, denoted as xi
Each agent i iteratively computes
xr+1i =
N
∑j=1
wrijx
rj − γrdr
i , ∀ i ∈ V .
We used the following notations
1 dri ∈ ∂ fi(yr
i ): a subgradient of the local function fi
2 wrij ≥ 0: the weight for the link eij at iteration r
3 γr > 0: some stepsize parameter
Mingyi Hong (Iowa State University) 7 / 37
Review of Distributed Optimization
The DSG Algorithm (Cont.)
Compactly, the algorithm can be written in vector form
xr+1 = Wxr − γrdr
1 xr ∈ R: vector of the agents’ local variable
2 dr ∈ R: vector of subgradients
3 W: a row-stochastic weight matrix
Mingyi Hong (Iowa State University) 8 / 37
Review of Distributed Optimization
The DSG Algorithm (Cont.)
Convergence has been analyzed in many works [Nedic-Ozdaglar09a][Nedic-Ozdaglar 09b]
The algorithm converges with a rate of O(ln(r)/√
r) [Chen 12]
Usually diminishing stepsize
The algorithm has been generalized to problems with
1 constraints [Nedic-Ozdaglar-Parrilo 10]
2 quantized messages [Nedic et al 08]
3 directed graphs [Nedic-Olshevsky 15]
4 stochastic gradients [Ram et al 10]
5 ...
Accelerated versions with rates O(ln(r)/r) [Chen 12] [Jakovetic et al 14]
Mingyi Hong (Iowa State University) 9 / 37
Review of Distributed Optimization
The EXTRA Algorithm
Recently, [Shi et al 14] proposed an EXTRA algorithm
xr+1 = Wxr − 1β
dr +1β
dr−1 + xr − Wxr−1
where W = 12 (I + W); f is assumed to be smooth; W is symmetric
EXTRA is an error-corrected version of DSG
xr+1 = Wxr − 1β
dr +r
∑t=1
(W − W)xt−1
It is shown that
1 A constant stepsize β can be used (with computable lower bound)
2 The algorithm converges with a (improved) rate of O(1/r)
Mingyi Hong (Iowa State University) 10 / 37
Review of Distributed Optimization
The ADMM Algorithm
The general ADMM solves the following two-block optimization problem
minx,y
f (x) + g(y)
s.t. Ax + By = c, x ∈ X, y ∈ Y
The augmented Lagrangian
L(x, y; λ) = f (x) + g(y) + 〈λ, c− Ax− By〉+ ρ
2‖c− Ax− By‖2
The algorithm
1 Minimize L(x, y; λ) w.r.t. x
2 Minimize L(x, y; λ) w.r.t. y
3 λ← λ + ρ(c− Ax− By)
Mingyi Hong (Iowa State University) 11 / 37
Review of Distributed Optimization
The ADMM for Network Consensus
For each link eij introduce two link variables zij, zjiReformulate problem (P) as [Schizas et al 08]
min f (x) :=N
∑i=1
fi(xi),
s.t. xi = zij, xj = zij,
xi = zji, xj = zji, ∀ eij ∈ E .
Mingyi Hong (Iowa State University) 12 / 37
Review of Distributed Optimization
The ADMM for Network Consensus (cont.)
The above problem is equivalent to
min f (x) :=N
∑i=1
fi(xi),
s.t. Ax + Bz = 0
(1)
where A, B are matrices related to network topology
Converges with O(1/r) rate [Wei-Ozdaglar 13]
When the objective is smooth and strongly convex, linear convergencehas been shown in [Shi et al 14]
For a star-network, convergence to stationary solution for nonconvexproblem (with rate O(1/
√r)) [H.-Luo-Razaviyayn 14]
Mingyi Hong (Iowa State University) 13 / 37
Review of Distributed Optimization
Comparison of ADMM and DSG
Table: Comparison of ADMM and DSG.
DSG ADMMProblem Type general convex smooth/smooth+simple NS.
Stepsize diminishing(a) constantConvergence Rate O(ln(r)/
√r) O(1/r)
Network Topology dynamic static(b)
Subproblem simple difficult(c)
(a) Except [Shi et al 14], which uses a constant stepsize
(b) Except [Chang-H.-Wang 14] [Ling et al 15], gradient-type subproblem
(c) Except [Wei-Ozdaglar 13], random graph
Mingyi Hong (Iowa State University) 14 / 37
Review of Distributed Optimization
Comparison of ADMM and DSG
Connections?
Mingyi Hong (Iowa State University) 15 / 37
Review of Distributed Optimization
Outline
1 Review of Distributed Optimization
2 The Proposed AlgorithmThe Proposed AlgorithmsDistributed ImplementationConvergence Analysis
3 Connection to Existing Methods
4 Numerical Results
5 Concluding Remarks
Mingyi Hong (Iowa State University) 15 / 37
The Proposed Algorithm
Setup
The proposed method is ADMM based
We consider
min f (y) :=N
∑i=1
fi(y) =N
∑i=1
gi(y) + hi(y), (Q)
1 Each hi is lower-semicontinuous with easy “prox” operator
proxβh (u) := min
yhi(y) +
β
2‖y− u‖2.
2 Each gi has a Lipschitz continuous gradient, i.e., for some ρi > 0
‖∇gi(y)−∇gi(v)‖ ≤ Pi‖y− v‖, ∀ y, v ∈ dom(h), ∀ i.
Mingyi Hong (Iowa State University) 16 / 37
The Proposed Algorithm
Graph Structure
Both static and random time-varying graph
For random network assume that
1 At a given iteration Gr is a subgraph of a connected graph G2 Each link e has a probability of pe ∈ (0, 1] of being active
3 A node i is active if an active link connects to it
4 Each iteration the graph realization is independent
Mingyi Hong (Iowa State University) 17 / 37
The Proposed Algorithm
Gradient Information
Each agent has access to an estimate of the gradient gi(xi, ξi) such that
E[gi(xi, ξi)] = ∇gi(xi)
E[‖gi(xi, ξi)−∇gi(xi)‖2
]≤ σ2, ∀ i
Can be extended to allow only subgradient of the obj
Mingyi Hong (Iowa State University) 18 / 37
The Proposed Algorithm
The Augmented Lagrangian
The problem we solve is still given by
min f (x) :=N
∑i=1
gi(xi) + hi(xi),
s.t. Ax + Bz = 0
The augmented Lagrangian
LΓ(x, z, λ) =N
∑i=1
gi(xi) + hi(xi) + 〈λ, Ax + Bz〉+ 12‖Ax + Bz‖Γ
2.
A diagonal matrix Γ is used as the penalty parameter (one edge one ρij)
Γ := diagρijij∈E
Mingyi Hong (Iowa State University) 19 / 37
The Proposed Algorithm The Proposed Algorithms
The DySPGC Algorithm
The proposed algorithm is named DySPGC (Dynamic StochasticProximal Gradient Consensus)
It optimizes LΓ(x, z, λ) using similar steps as ADMM
The x-step will be replaced by a proximal gradient step
Mingyi Hong (Iowa State University) 20 / 37
The Proposed Algorithm The Proposed Algorithms
The DySPGC: Static Graph + Exact Gradient
Algorithm 1. PGC Algorithm
At iteration 0, let BTλ0 = 0, z0 = 12 MT
+x0.At each iteration r + 1, update the variable blocks by:
xr+1 = arg min 〈∇g(xr), x− xr〉+ h(x)
+12
∥∥∥Ax + Bzr + Γ−1λr∥∥∥2
Γ+
12‖x− xr‖2
Ω
zr+1 = arg min12
∥∥∥Axr+1 + Bz + Γ−1λr∥∥∥2
Γ
λr+1 = λr + Γ(
Axr+1 + Bzr+1)
Mingyi Hong (Iowa State University) 21 / 37
The Proposed Algorithm The Proposed Algorithms
The DySPGC: Static Graph + Exact Gradient
Algorithm 1. PGC Algorithm
At iteration 0, let BTλ0 = 0, z0 = 12 MT
+x0.At each iteration r + 1, update the variable blocks by:
xr+1 = arg min 〈∇g(xr), x− xr〉+ h(x)
+12
∥∥∥Ax + Bzr + Γ−1λr∥∥∥
Γ2 +
12‖x− xr‖2
Ω
zr+1 = arg min12
∥∥∥Axr+1 + Bz + Γ−1λr∥∥∥
Γ2
λr+1 = λr + Γ(
Axr+1 + Bzr+1)
Mingyi Hong (Iowa State University) 21 / 37
The Proposed Algorithm The Proposed Algorithms
The DySPGC: Static Graph + Stochastic Gradient
Algorithm 2. SPGC Algorithm
At iteration 0, let BTλ0 = 0, z0 = 12 MT
+x0.At each iteration r + 1, update the variable blocks by:
xr+1 = arg min⟨
G(xr, ξr+1), x− xr⟩+ h(x)
+12
∥∥∥Ax + Bzr + Γ−1λr∥∥∥2
Γ+
12‖x− xr‖2
Ω+ηr+1 IMN
zr+1 = arg min12
∥∥∥Axr+1 + Bz + Γ−1λr∥∥∥2
Γ
λr+1 = λr + Γ(
Axr+1 + Bzr+1)
Mingyi Hong (Iowa State University) 22 / 37
The Proposed Algorithm The Proposed Algorithms
The DySPGC: Static Graph + Stochastic Gradient
Algorithm 2. SPGC Algorithm
At iteration 0, let BTλ0 = 0, z0 = 12 MT
+x0.At each iteration r + 1, update the variable blocks by:
xr+1 = arg min⟨
G(xr, ξr+1), x− xr⟩+ h(x)
+12
∥∥∥Ax + Bzr + Γ−1λr∥∥∥2
Γ+
12‖x− xr‖2
Ω+ηr+1 IMN
zr+1 = arg min12
∥∥∥Axr+1 + Bz + Γ−1λr∥∥∥2
Γ
λr+1 = λr + Γ(
Axr+1 + Bzr+1)
Mingyi Hong (Iowa State University) 22 / 37
The Proposed Algorithm The Proposed Algorithms
The DySPGC: Dynamic Graph + Stochastic Gradient
Algorithm 3. DySPGC Algorithm
At iteration 0, let BTλ0 = 0, z0 = 12 MT
+x0.At each iteration r + 1, update the variable blocks by:
xr+1 = arg min⟨
Gr+1(xr, ξr+1), x− xr⟩+ hr+1(x)
+12
∥∥∥Ar+1x + Br+1zr + Γ−1λr∥∥∥2
Γ+
12‖x− xr‖2
Ωr+1+ηr+1 IMN
xr+1i = xr
i , if i /∈ V r+1
zr+1 = arg min12
∥∥∥Ar+1xr+1 + Br+1z + Γ−1λr∥∥∥2
Γ
zr+1ij = zr
ij, if eij /∈ Ar+1
λr+1 = λr + Γ(
Ar+1xr+1 + Br+1zr+1)
Mingyi Hong (Iowa State University) 23 / 37
The Proposed Algorithm The Proposed Algorithms
The DySPGC: Dynamic Graph + Stochastic Gradient
Algorithm 3. DySPGC Algorithm
At iteration 0, let BTλ0 = 0, z0 = 12 MT
+x0.At each iteration r + 1, update the variable blocks by:
xr+1 = arg min⟨
Gr+1(xr, ξr+1), x− xr⟩+ hr+1(x)
+12
∥∥∥Ar+1x + Br+1zr + Γ−1λr∥∥∥2
Γ+
12‖x− xr‖2
Ωr+1+ηr+1 IMN
xr+1i = xr
i , if i /∈ V r+1
zr+1 = arg min12
∥∥∥Ar+1xr+1 + Br+1z + Γ−1λr∥∥∥2
Γ
zr+1ij = zr
ij, if eij /∈ Ar+1
λr+1 = λr + Γ(
Ar+1xr+1 + Br+1zr+1)
Mingyi Hong (Iowa State University) 23 / 37
The Proposed Algorithm Distributed Implementation
Distributed Implementation
The algorithms admit distributed implementation
In particular, the PGC admits a single-variable characterization
Mingyi Hong (Iowa State University) 24 / 37
The Proposed Algorithm Distributed Implementation
Implementation of PGC
Define a stepsize parameter as
βi := ∑j∈Ni
(ρij + ρji) + wi, ∀ i.
(ωi: proximal parameters; ρij: penalty parameters for constraints)
Define a stepsize matrix Υ := diag([β1, · · · , βN ]) 0.
Define a weight matrix W ∈ RN×N as (a row-stochastic matrix)
(W[i, j]) =
ρji+ρij
∑`∈Ni(ρ`i+ρi`)+ωi
=ρji+ρij
βi, if eij ∈ E ,
ωi∑`∈Ni
(ρ`i+ρi`)+ωi= ωi
βi, ∀ i = j, i ∈ V
0, otherwise,
Mingyi Hong (Iowa State University) 25 / 37
The Proposed Algorithm Distributed Implementation
Implementation of PGC (cont.)
Implementation of PGC
Let ζr ∈ ∂h(xr) be some subgradient vector for the nonsmooth function; thenthe PGC algorithm admits the following single-variable characterization
xr+1 − xr + Υ−1(ζr+1 − ζr)
= Υ−1(−∇g (xr) +∇g
(xr−1
))+ Wxr − 1
2(IN + W)xr−1.
In particular, for smooth problems
xr+1 = Wxr − Υ−1∇g(xr) + Υ−1∇g(xr−1) + xr − 12(IN + W)xr−1.
Mingyi Hong (Iowa State University) 26 / 37
The Proposed Algorithm Convergence Analysis
Convergence Analysis
We analyze the (rate of) convergence of the proposed methodsLet us define a matrix of Lip-constants
P = diag([P1, · · · , PN ]).
Measure convergence rate by [Gao et al 14, Ouyang et al 14]
| f (xr)− f (x∗)︸ ︷︷ ︸objective gap
|, and ‖Axr + Bzr‖︸ ︷︷ ︸consensus gap
Mingyi Hong (Iowa State University) 27 / 37
The Proposed Algorithm Convergence Analysis
Convergence Analysis
Table: Main Convergence Results.
Algorithm Convergence Condition Convergence Rate
Network Type Gradient Type
Static Exact ΥW + Υ 2P O(1/r)Static Stochastic ΥW + Υ 2P O(1/
√r)
Random Exact Ω P O(1/r)Random Stochastic Ω P O(1/
√r)
Note: For the exact gradient case, stepsize β can be halved if onlyconvergence is needed
Mingyi Hong (Iowa State University) 28 / 37
The Proposed Algorithm Convergence Analysis
Outline
1 Review of Distributed Optimization
2 The Proposed AlgorithmThe Proposed AlgorithmsDistributed ImplementationConvergence Analysis
3 Connection to Existing Methods
4 Numerical Results
5 Concluding Remarks
Mingyi Hong (Iowa State University) 28 / 37
Connection to Existing Methods
Comparison with Different Algorithms
Algorithm Conn. with DySPCA Special Setting
EXTRA [Shi 14] Special Case Static, h ≡ 0, W = WT , G = ∇gDSG [Nedic 09] Different x-step Static, g smooth, G = ∇g
IC-ADMM [Chang14] Special Case Static, G = ∇g, g compositeDLM [Ling 15] Special Case Static, G = ∇g, h ≡ 0, βij = β, ρij = ρ
PG-EXTRA [Shi 15] Special Case Static, W = WT , G = ∇g
Mingyi Hong (Iowa State University) 29 / 37
Connection to Existing Methods
Comparison with Different Algorithms
Figure: Relationship among different algorithms
Mingyi Hong (Iowa State University) 30 / 37
Connection to Existing Methods
The EXTRA Related Algorithms
The EXTRA related algorithms (for either smooth or nonsmooth cases)[Shi et al 14, 15] are special cases of DySPCA
1 Symmetric weight matrix W = WT
2 Exact gradient
3 Scalar stepsize
4 Static graph
Mingyi Hong (Iowa State University) 31 / 37
Connection to Existing Methods
The DSG Method
Replacing our x-update by (setting the dual variable λr = 0)
xr+1 = arg min 〈∇g(xr), x− xr〉+ 〈0, Ax + Bzr〉
+12‖Ax + Bzr‖2
Γ +12‖x− xr‖2
Ω
Let βi = β j = β, then the PGC algorithm becomes
xr+1 = − 1β∇g(xr) + Wxr.
with W = 12 (I + W)
This is precisely the DSG iterates
Convergence not covered by our results
Mingyi Hong (Iowa State University) 32 / 37
Connection to Existing Methods
Outline
1 Review of Distributed Optimization
2 The Proposed AlgorithmThe Proposed AlgorithmsDistributed ImplementationConvergence Analysis
3 Connection to Existing Methods
4 Numerical Results
5 Concluding Remarks
Mingyi Hong (Iowa State University) 32 / 37
Numerical Results
Numerical Results
Some preliminary numerical results by solving a LASSO problem
minx
12 ∑N
i=1 ‖Aix− bi‖2 + ν‖x‖1
where Ai ∈ RK×M, bi ∈ RK
The parameters: N = 16, M = 100, ν = 0.1, K = 200
Data matrix randomly generated
Static graphs, generated according to the method proposed in[Yildiz-Scaglione 08], with a radius parameter set to 0.4.
Mingyi Hong (Iowa State University) 33 / 37
Numerical Results
Comparison between PG-EXTRA and PGC
Stepsize of PG-EXTRA chosen according to conditions given in [Shi 14]W is Metropolis constant edge weight matrixPCG: wi = Pi/2, ρij = 10−3
Figure: Comparison between PG-EXTRA and PGC
Mingyi Hong (Iowa State University) 34 / 37
Numerical Results
Comparison between DSG and Stochastic PGC
Stepsize of DSG chosen as a small constantσ2 = 0.1W is Metropolis constant edge weight matrixSPCG: wi = Pi, ρij = 10−3
Figure: Comparison between DSG and SPGCMingyi Hong (Iowa State University) 35 / 37
Numerical Results
Outline
1 Review of Distributed Optimization
2 The Proposed AlgorithmThe Proposed AlgorithmsDistributed ImplementationConvergence Analysis
3 Connection to Existing Methods
4 Numerical Results
5 Concluding Remarks
Mingyi Hong (Iowa State University) 35 / 37
Concluding Remarks
Summary
Develop a DySPGC algorithm for multi-agent optimization
It can deal with
1 Stochastic gradient
2 Time-varying networks
3 Nonsmooth composite objective
Convergence rate guarantee for various scenarios
Mingyi Hong (Iowa State University) 36 / 37
Concluding Remarks
Future Work/Generalization
Identified the relation between DSG-type and ADMM-type methods
Allows for significant generalization
1 Acceleration [Ouyang et al 15]
2 Variance Reduction for local problem when fi is a finite sum
fi(xi) =M
∑j=1
`j(xi)
3 Inexact x-subproblems (using, e.g., Conditional-Gradient)
4 Nonconvex problems [H.-Luo-Razaviyayn 14]
5 ...
Mingyi Hong (Iowa State University) 37 / 37
Concluding Remarks
Thank You!
Mingyi Hong (Iowa State University) 37 / 37
Concluding Remarks
Parameter Selection
It is easy to pick various parameters in various different scenarios
Case A: The weight matrix W is given and symmetric
1 We must have βi = β j = β;
2 For any fixed β, can compute (Ω, ρij)
3 Increase β to satisfy convergence condition
Case B: The user has the freedom in picking (ρij, Ω)1 For any set of (ρij, Ω), can compute W and βi
2 Increase Ω to satisfy convergence condition
In either case, the convergence condition can be verified by local agents
Mingyi Hong (Iowa State University) 37 / 37
Concluding Remarks
Case 1: Exact Gradient with Static Graph
Convergence for PGC Algorithm
Suppose that problem (Q) has a nonempty set of optimal solutions X∗ 6= ∅.Suppose Gr = G for all r and G is connected. Then the PGC converges to aprimal-dual optimal solution if
2Ω + M+ΞMT+ = ΥW + Υ P.
M+ΞMT+ is some matrix related to network topology
A sufficient condition isΩ P
or ωi > Pi for all i ∈ V ; can be determined locally.
Mingyi Hong (Iowa State University) 37 / 37
Concluding Remarks
Case 2: Stochastic Gradient with Static Graph
Convergence for SPGC Algorithm
Assume that dom(h) is a bounded set. Suppose that the following conditionshold
ηr+1 =√
r + 1, ∀ r,
and the stepsize matrix satisfies
2Ω + M+ΞMT+ = ΥW + Υ 2P. (8)
Then at a given iteration r, we have
E [ f (xr)− f (x∗)] + ρ‖Axr + Bzr‖ ≤ σ2√
r+
d2x
2√
r+
12r
(d2
z + d2λ(ρ) + max
iωid2
x
)where dλ(ρ) > 0, dx > 0, dz > 0 are some problem dependent constants.
Mingyi Hong (Iowa State University) 37 / 37
Concluding Remarks
Case 2: Stochastic Gradient with Static Graph (cont.)
Both objective value and constraint violation converge with rate O(1/√
r)
Easy to extend to the exact gradient case, with rate O(1/r)
Requires larger proximal parameter Ω than Case 1
Mingyi Hong (Iowa State University) 37 / 37
Concluding Remarks
Case 3: Exact Gradient with Time-Varying Graph
Convergence for DySPGC Algorithm
Suppose that problem (Q) has a nonempty set of optimal solutions X∗ 6= ∅,and G(xr, ξr+1) = ∇g(xr) for all r. Suppose the graph is randomly generated.If we choose the following stepsize
Ω 12
P
then (xr, zr, λr) that converges w.p.1. to a primal-dual solution.
1 The stepsize is more restrictive than Case 1 (not dependent on graph)
2 Convergence is in the sense of with probability 1
Mingyi Hong (Iowa State University) 37 / 37
Concluding Remarks
Case 4: Stochastic Gradient with Time-Varying Graph
Convergence for DySPGC Algorithm
Suppose wt = xt, zt, λt is a sequence generated by DySPCA, and that
ηr+1 =√
r + 1, ∀ r, and Ω P.
Then we have
E [ f (xr)− f (x∗) + ρ‖Axr + Bzr‖]
≤ σ2√
r+
d2x
2√
r+
12r
(2dJ + d2
z + d2λ(ρ) + max
iωid2
x
)where dλ(ρ), dJ , dx, dy are some positive constants.
Mingyi Hong (Iowa State University) 37 / 37