Post on 08-May-2022
Linear Convergent Decentralized Optimization withCompression
Xiaorui Liuhttp://cse.msu.edu/~xiaorui/
Joint work with Yao Li, Rongrong Wang,Jiliang Tang, and Ming Yan
Data Science and Engineering LabDepartment of Computer Science and Engineering
Michigan State University
ICLR 2021, May 6th
Xiaorui Liu (MSU) LEAD ICLR 2021 1 / 15
Introduction
Problem
x∗ := arg minx∈Rd
[f (x) :=
1
n
n∑i=1
fi (x)]
fi (·) is the local objective in agent i .
Centralization Decentralization
Xiaorui Liu (MSU) LEAD ICLR 2021 2 / 15
Introduction
Matrix notations
Xk =
(xk1)>
...(xkn)>
∈ Rn×d ,
∇F(Xk) =
(∇f1(xk1))>
...(∇f1(xkn))>
∈ Rn×d ,
Symmetric W ∈ Rn×n encodes the communication network.
WX = X iff x1 = x2 = · · · = xn,
−1 < λn(W) ≤ λn−1(W) ≤ · · ·λ2(W) < λ1(W) = 1.
Xiaorui Liu (MSU) LEAD ICLR 2021 3 / 15
Introduction
Communication Compression for decentralized optimization
DCD-SGD, ECE-SGD [TGZ+18]QDGD, QuanTimed-DSGD[RMHP19, RTM+19]DeepSqueeze [TLQ+19]CHOCO-SGD [KSJ19]. . .
Reduce to DGD-type algorithms, which suffer from convergence bias
X∗ 6= WX∗ − η∇F(X∗).
Their convergences degrade on heterogeneous data.
LEAD is the first primal-dual decentralized optimization algorithmwith compression and attains linear convergence.
Xiaorui Liu (MSU) LEAD ICLR 2021 4 / 15
Algorithm: LEAD
NIDS [LSY19] / D2 [TLY+18] (stochastic version of NIDS)
Xk+1 =I + W
2(2Xk − Xk−1 − η∇F(Xk ; ξk) + η∇F(Xk−1; ξk−1)),
A two step reformulation [LY19]:
Dk+1 = Dk +I−W
2η(Xk − η∇F(Xk)− ηDk),
Xk+1 = Xk − η∇F(Xk)− ηDk+1,
Concise and conceptual form of LEAD:
Yk = Xk − η∇F(Xk ; ξk)− ηDk
Yk = CompressionProcedure(Yk)
Dk+1 = Dk +γ
2η(I−W)Yk
Xk+1 = Xk − η∇F(Xk ; ξk)− ηDk+1
Xiaorui Liu (MSU) LEAD ICLR 2021 5 / 15
Algorithm: LEAD
LEAD
Yk = Xk − η∇F(Xk ; ξk)− ηDk
Yk = CompressionProcedure(Yk)
Dk+1 = Dk +γ
2η(I−W)Yk=
γ
2η(Yk − Yk
w )
Xk+1 = Xk − η∇F(Xk ; ξk)− ηDk+1
Compression Procedure
Qk = Compress(Yk −Hk) B Compression
Yk = Hk + Qk
Ykw = Hk
w + WQk B Communication
Hk+1 = (1− α)Hk + αYk
Hk+1w = (1− α)Hk
w + αYkw
Xiaorui Liu (MSU) LEAD ICLR 2021 6 / 15
Algorithm: LEAD
How LEAD works?
Gradient Correction
Xk+1 = Xk − η(∇F(Xk ; ξk) + Dk+1)
F(Xk ; ξk) + Dk+1 → 0
Difference Compression
Qk = Compress(Yk −Hk)
Yk → X∗,Hk → X∗ ⇒ Yk −Hk → 0⇒ ‖Qk − (Yk −Hk)‖ → 0
Implicit Error Compensation
Ek = Yk − Yk
Dk+1 = Dk +γ
2η(Yk − Yk
w ) = Dk +γ
2η(I−W)Yk +
γ
2η(Ek −WEk)
Xiaorui Liu (MSU) LEAD ICLR 2021 7 / 15
Assumption
Compression: EQ(x) = x, E‖x− Q(x)‖22 ≤ C‖x‖22 for some C ≥ 0.
fi (·) is µ-strongly convex and L-smooth:
fi (x) ≥ fi (y) + 〈∇fi (y), x− y〉+µ
2‖x− y‖2,
fi (x) ≤ fi (y) + 〈∇fi (y), x− y〉+L
2‖x− y‖2.
Gradient: Eξ∇fi (x; ξ) = ∇fi (x), Eξ‖∇fi (x; ξ)−∇fi (x)‖22 ≤ σ2.
Xiaorui Liu (MSU) LEAD ICLR 2021 8 / 15
Theory
κf =L
µ, κg =
λmax(I−W)
λ+min(I−W)
Theorem (Complexity bounds when σ = 0)
LEAD converges to the ε-accurate solution with the iterationcomplexity
O((
(1 + C )(κf + κg ) + Cκf κg)
log1
ε
).
When C = 0 (i.e., no compression) or C ≤ κf +κg
κf κg+κf +κg, the iteration
complexity isO(
(κf + κg ) log1
ε
).
This recovers the convergence rate of NIDS [LSY19].
Xiaorui Liu (MSU) LEAD ICLR 2021 9 / 15
Theory
Theorem (Complexity bounds when σ = 0)
With C = 0 (or C ≤ κf +κg
κf κg+κf +κg) and fully connected communication
graph (i.e., W = 11>
n ), the iteration complexity is
O(κf log1
ε).
This recovers the convergence rate of gradient descent [Nes13].
Theorem (Error bound when σ > 0)
1
n
n∑i=1
E∥∥∥xki − x∗
∥∥∥2 . O(1
k
)
Xiaorui Liu (MSU) LEAD ICLR 2021 10 / 15
Experiment
0 25 50 75 100 125 150 175Epoch
106
104
102
100
102
||Xk
X* |
|2 F
DGD ( 32 bits)NIDS ( 32 bits)QDGD ( 2 bits)DeepSqueeze ( 2 bits)CHOCO-SGD ( 2 bits)LEAD ( 2 bits)
‖Xk − X∗‖F
0 1000000 2000000 3000000 4000000 5000000Bits transmitted
106
104
102
100
102
||Xk
X* |
|2 F
DGD ( 32 bits)NIDS ( 32 bits)QDGD ( 2 bits)DeepSqueeze ( 2 bits)CHOCO-SGD ( 2 bits)LEAD ( 2 bits)
‖Xk − X∗‖F
Linear regression (σ = 0)
Xiaorui Liu (MSU) LEAD ICLR 2021 11 / 15
Experiment
0 25 50 75 100 125 150 175Epoch
107
105
103
101
101
Con
sens
us E
rror
DGD ( 32 bits)NIDS ( 32 bits)QDGD ( 2 bits)DeepSqueeze ( 2 bits)CHOCO-SGD ( 2 bits)LEAD ( 2 bits)
Consensus error
0 25 50 75 100 125 150 175Epoch
1010
108
106
104
102
100
102
Com
pres
sion
Err
or
QDGD ( 2 bits)DeepSqueeze ( 2 bits)CHOCO-SGD ( 2 bits)LEAD ( 2 bits)
Compression error
Linear regression (σ = 0)
Xiaorui Liu (MSU) LEAD ICLR 2021 12 / 15
Experiment
0 10 20 30 40 50 60 70 80Epoch
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
Loss
DGD (32 bits)NIDS (32 bits)QDGD (2 bits)DeepSqueeze (2 bits)CHOCO-SGD (2 bits)LEAD (2 bits)
f (Xk)
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4Bits transmitted 1e9
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
Loss
DGD (32 bits)NIDS (32 bits)QDGD (2 bits)DeepSqueeze (2 bits)CHOCO-SGD (2 bits)LEAD (2 bits)
f (Xk)
Logistic regression (σ > 0).
Xiaorui Liu (MSU) LEAD ICLR 2021 13 / 15
Experiment
0.0 0.2 0.4 0.6 0.8 1.0 1.2
Bits transmitted 1e14
0.5
1.0
1.5
2.0
Loss
Homogeneous dataDGD (32 bits)NIDS (32 bits)QDGD (2 bits)DeepSqueeze (2 bits)CHOCO-SGD (2 bits)LEAD (2 bits)
Loss f (Xk)
0.0 0.2 0.4 0.6 0.8 1.0 1.2
Bits transmitted 1e14
1.0
1.5
2.0
2.5
Loss
Heterogeneous dataDGD (32 bits)NIDS (32 bits)QDGD* (2 bits)DeepSqueeze* (2 bits)CHOCO-SGD* (2 bits)LEAD (2 bits)
Loss f (Xk)
Stochastic optimization on deep learning (∗ divergence).
Xiaorui Liu (MSU) LEAD ICLR 2021 14 / 15
Conclusion
LEAD is the first primal-dual decentralized optimization algorithmwith compression and attains linear convergence for strongly convexand smooth objectives
LEAD supports unbiased compression of arbitrary precision
LEAD works well for nonconvex problems such as training deep neuralnetworks
LEAD is robust to parameter settings, and needs minor effort forparameter tuning
Welcome to check our paper and poster for more details
Xiaorui Liu (MSU) LEAD ICLR 2021 15 / 15
Anastasia Koloskova, Sebastian U. Stich, and Martin Jaggi,Decentralized stochastic optimization and gossip algorithms withcompressed communication, Proceedings of the 36th InternationalConference on Machine Learning, PMLR, 2019, pp. 3479–3487.
Zhi Li, Wei Shi, and Ming Yan, A decentralized proximal-gradientmethod with network independent step-sizes and separatedconvergence rates, IEEE Transactions on Signal Processing 67 (2019),no. 17, 4494–4506.
Yao Li and Ming Yan, On linear convergence of two decentralizedalgorithms, arXiv preprint arXiv:1906.07225 (2019).
Yurii Nesterov, Introductory lectures on convex optimization: A basiccourse, vol. 87, Springer Science & Business Media, 2013.
Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, and RamtinPedarsani, An exact quantized decentralized gradient descentalgorithm, IEEE Transactions on Signal Processing 67 (2019), no. 19,4934–4947.
Xiaorui Liu (MSU) LEAD ICLR 2021 15 / 15
Amirhossein Reisizadeh, Hossein Taheri, Aryan Mokhtari, HamedHassani, and Ramtin Pedarsani, Robust and communication-efficientcollaborative learning, Advances in Neural Information ProcessingSystems, 2019, pp. 8388–8399.
Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu,Communication compression for decentralized training, Advances inNeural Information Processing Systems, 2018, pp. 7652–7662.
Hanlin Tang, Xiangru Lian, Shuang Qiu, Lei Yuan, Ce Zhang, TongZhang, and Ji Liu, Deepsqueeze: Decentralization meetserror-compensated compression, CoRR abs/1907.07346 (2019).
Hanlin Tang, Xiangru Lian, Ming Yan, Ce Zhang, and Ji Liu, D2:Decentralized training over decentralized data, Proceedings of the 35thInternational Conference on Machine Learning, 2018, pp. 4848–4856.
Xiaorui Liu (MSU) LEAD ICLR 2021 15 / 15