Linear Convergent Decentralized Optimization with Compression

Linear Convergent Decentralized Optimization withCompression

Xiaorui Liuhttp://cse.msu.edu/~xiaorui/

Joint work with Yao Li, Rongrong Wang,Jiliang Tang, and Ming Yan

Data Science and Engineering LabDepartment of Computer Science and Engineering

Michigan State University

ICLR 2021, May 6th

Xiaorui Liu (MSU) LEAD ICLR 2021 1 / 15

Introduction

Problem

x∗ := arg minx∈Rd

[f (x) :=

n∑i=1

fi (x)]

fi (·) is the local objective in agent i .

Centralization Decentralization

Introduction

Matrix notations

(xk1)>

...(xkn)>

∈ Rn×d ,

∇F(Xk) =

(∇f1(xk1))>

...(∇f1(xkn))>

∈ Rn×d ,

Symmetric W ∈ Rn×n encodes the communication network.

WX = X iff x1 = x2 = · · · = xn,

−1 < λn(W) ≤ λn−1(W) ≤ · · ·λ2(W) < λ1(W) = 1.

Introduction

Communication Compression for decentralized optimization

DCD-SGD, ECE-SGD [TGZ+18]QDGD, QuanTimed-DSGD[RMHP19, RTM+19]DeepSqueeze [TLQ+19]CHOCO-SGD [KSJ19]. . .

Reduce to DGD-type algorithms, which suffer from convergence bias

X∗ 6= WX∗ − η∇F(X∗).

Their convergences degrade on heterogeneous data.

LEAD is the first primal-dual decentralized optimization algorithmwith compression and attains linear convergence.

Algorithm: LEAD

NIDS [LSY19] / D2 [TLY+18] (stochastic version of NIDS)

Xk+1 =I + W

2(2Xk − Xk−1 − η∇F(Xk ; ξk) + η∇F(Xk−1; ξk−1)),

A two step reformulation [LY19]:

Dk+1 = Dk +I−W

2η(Xk − η∇F(Xk)− ηDk),

Xk+1 = Xk − η∇F(Xk)− ηDk+1,

Concise and conceptual form of LEAD:

Yk = Xk − η∇F(Xk ; ξk)− ηDk

Yk = CompressionProcedure(Yk)

Dk+1 = Dk +γ

2η(I−W)Yk

Xk+1 = Xk − η∇F(Xk ; ξk)− ηDk+1

Algorithm: LEAD

Yk = Xk − η∇F(Xk ; ξk)− ηDk

Yk = CompressionProcedure(Yk)

Dk+1 = Dk +γ

2η(I−W)Yk=

2η(Yk − Yk

Xk+1 = Xk − η∇F(Xk ; ξk)− ηDk+1

Compression Procedure

Qk = Compress(Yk −Hk) B Compression

Yk = Hk + Qk

Ykw = Hk

w + WQk B Communication

Hk+1 = (1− α)Hk + αYk

Hk+1w = (1− α)Hk

w + αYkw

Algorithm: LEAD

How LEAD works?

Gradient Correction

Xk+1 = Xk − η(∇F(Xk ; ξk) + Dk+1)

F(Xk ; ξk) + Dk+1 → 0

Difference Compression

Qk = Compress(Yk −Hk)

Yk → X∗,Hk → X∗ ⇒ Yk −Hk → 0⇒ ‖Qk − (Yk −Hk)‖ → 0

Implicit Error Compensation

Ek = Yk − Yk

Dk+1 = Dk +γ

2η(Yk − Yk

w ) = Dk +γ

2η(I−W)Yk +

2η(Ek −WEk)

Assumption

Compression: EQ(x) = x, E‖x− Q(x)‖22 ≤ C‖x‖22 for some C ≥ 0.

fi (·) is µ-strongly convex and L-smooth:

fi (x) ≥ fi (y) + 〈∇fi (y), x− y〉+µ

2‖x− y‖2,

fi (x) ≤ fi (y) + 〈∇fi (y), x− y〉+L

2‖x− y‖2.

Gradient: Eξ∇fi (x; ξ) = ∇fi (x), Eξ‖∇fi (x; ξ)−∇fi (x)‖22 ≤ σ2.

Theory

κf =L

µ, κg =

λmax(I−W)

λ+min(I−W)

Theorem (Complexity bounds when σ = 0)

LEAD converges to the ε-accurate solution with the iterationcomplexity

(1 + C )(κf + κg ) + Cκf κg)

When C = 0 (i.e., no compression) or C ≤ κf +κg

κf κg+κf +κg, the iteration

complexity isO(

(κf + κg ) log1

This recovers the convergence rate of NIDS [LSY19].

Theory

Theorem (Complexity bounds when σ = 0)

With C = 0 (or C ≤ κf +κg

κf κg+κf +κg) and fully connected communication

graph (i.e., W = 11>

n ), the iteration complexity is

O(κf log1

This recovers the convergence rate of gradient descent [Nes13].

Theorem (Error bound when σ > 0)

n∑i=1

E∥∥∥xki − x∗

∥∥∥2 . O(1

Experiment

0 25 50 75 100 125 150 175Epoch

DGD ( 32 bits)NIDS ( 32 bits)QDGD ( 2 bits)DeepSqueeze ( 2 bits)CHOCO-SGD ( 2 bits)LEAD ( 2 bits)

‖Xk − X∗‖F

0 1000000 2000000 3000000 4000000 5000000Bits transmitted

‖Xk − X∗‖F

Linear regression (σ = 0)

Experiment

0 25 50 75 100 125 150 175Epoch

Consensus error

0 25 50 75 100 125 150 175Epoch

QDGD ( 2 bits)DeepSqueeze ( 2 bits)CHOCO-SGD ( 2 bits)LEAD ( 2 bits)

Compression error

Linear regression (σ = 0)

Experiment

0 10 20 30 40 50 60 70 80Epoch

DGD (32 bits)NIDS (32 bits)QDGD (2 bits)DeepSqueeze (2 bits)CHOCO-SGD (2 bits)LEAD (2 bits)

f (Xk)

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4Bits transmitted 1e9

DGD (32 bits)NIDS (32 bits)QDGD (2 bits)DeepSqueeze (2 bits)CHOCO-SGD (2 bits)LEAD (2 bits)

f (Xk)

Logistic regression (σ > 0).

Experiment

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Bits transmitted 1e14

Homogeneous dataDGD (32 bits)NIDS (32 bits)QDGD (2 bits)DeepSqueeze (2 bits)CHOCO-SGD (2 bits)LEAD (2 bits)

Loss f (Xk)

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Bits transmitted 1e14

Heterogeneous dataDGD (32 bits)NIDS (32 bits)QDGD* (2 bits)DeepSqueeze* (2 bits)CHOCO-SGD* (2 bits)LEAD (2 bits)

Loss f (Xk)

Stochastic optimization on deep learning (∗ divergence).

Conclusion

LEAD is the first primal-dual decentralized optimization algorithmwith compression and attains linear convergence for strongly convexand smooth objectives

LEAD supports unbiased compression of arbitrary precision

LEAD works well for nonconvex problems such as training deep neuralnetworks

LEAD is robust to parameter settings, and needs minor effort forparameter tuning

Welcome to check our paper and poster for more details

Anastasia Koloskova, Sebastian U. Stich, and Martin Jaggi,Decentralized stochastic optimization and gossip algorithms withcompressed communication, Proceedings of the 36th InternationalConference on Machine Learning, PMLR, 2019, pp. 3479–3487.

Zhi Li, Wei Shi, and Ming Yan, A decentralized proximal-gradientmethod with network independent step-sizes and separatedconvergence rates, IEEE Transactions on Signal Processing 67 (2019),no. 17, 4494–4506.

Yao Li and Ming Yan, On linear convergence of two decentralizedalgorithms, arXiv preprint arXiv:1906.07225 (2019).

Yurii Nesterov, Introductory lectures on convex optimization: A basiccourse, vol. 87, Springer Science & Business Media, 2013.

Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, and RamtinPedarsani, An exact quantized decentralized gradient descentalgorithm, IEEE Transactions on Signal Processing 67 (2019), no. 19,4934–4947.

Amirhossein Reisizadeh, Hossein Taheri, Aryan Mokhtari, HamedHassani, and Ramtin Pedarsani, Robust and communication-efficientcollaborative learning, Advances in Neural Information ProcessingSystems, 2019, pp. 8388–8399.

Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu,Communication compression for decentralized training, Advances inNeural Information Processing Systems, 2018, pp. 7652–7662.

Hanlin Tang, Xiangru Lian, Shuang Qiu, Lei Yuan, Ce Zhang, TongZhang, and Ji Liu, Deepsqueeze: Decentralization meetserror-compensated compression, CoRR abs/1907.07346 (2019).

Hanlin Tang, Xiangru Lian, Ming Yan, Ce Zhang, and Ji Liu, D2:Decentralized training over decentralized data, Proceedings of the 35thInternational Conference on Machine Learning, 2018, pp. 4848–4856.

Linear Convergent Decentralized Optimization with Compression

Documents

Transcript of Linear Convergent Decentralized Optimization with Compression

A Decentralized, FarmerA Decentralized, Farmer--Led, MarketLed

Centralized Vs. Decentralized Market · Centralized Vs. Decentralized Market Attribute Centralized Model Decentralized Model Unit Commitment Central Decentralized Reserve Reserve

Convergent Charging_operation Guide

Convergent South Sumatera)

Convergent newsroom

Convergent Interviewing

Decentralized Control Applied to Multi-DOF Tuned-Mass Damper Design Decentralized H2 Control Decentralized H Control Decentralized Pole Shifting Decentralized.

DECENTRALIZED FINANCE (DeFi) AND DECENTRALIZED …

SAP Convergent Charging

9.Tectonics Convergent

Convergent Evidence for Abnormal Striatal Synaptic Plasticity ...papers.cnl.salk.edu/PDFs/Convergent Evidence for Abnormal...Review Convergent evidence for abnormal striatal synaptic

Iskratel Convergent Services_en_web

Convergent plates

Convergent Commerce

Convergent Charging_install Guide

Convergent boundaries

Oracle Communications Convergent Charging and Policy ... · PDF fileOracle Communications Convergent Charging and ... alongside full invoicing and bill ... Oracle Communications Convergent

Convergent Media Group

Convergent Brochure 2015

Handout: Chapter 11 Notes Pages 271 to 284. Rock Stress 1.Compression: BOP IT Squeezes and shortens a body of rock Near convergent plate boundaries.