Balancing Computation and Communication in Distributed...
Transcript of Balancing Computation and Communication in Distributed...
![Page 1: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/1.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Balancing Computation and Communication inDistributed Optimization
Ermin Wei
Department of Electrical Engineering and Computer ScienceNorthwestern University
DIMACSRutgers Univeristy
Aug 23, 2017
Balancing Computation & Communication 1/38
![Page 2: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/2.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Collaborators
Albert S. Berahas Raghu Bollapragada Nitish Shirish Keskar
Balancing Computation & Communication 2/38
![Page 3: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/3.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Overview
1 Introduction and Motivation
2 Distributed Gradient Descent Variant
3 Communication Computation Decoupled DGD Variants
4 Conclusions & Future Work
Balancing Computation & Communication 3/38
![Page 4: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/4.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Overview
1 Introduction and Motivation
2 Distributed Gradient Descent Variant
3 Communication Computation Decoupled DGD Variants
4 Conclusions & Future Work
Balancing Computation & Communication 4/38
![Page 5: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/5.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Problem Formulation
minx∈Rp
f (x) =n∑
i=1
fi (x)
Applications: Sensor Networks, Robotic Teams, Machine Learning.
Parameter estimation insensor networks.Communication
Multi-agent cooperativecontrol and coordination.Battery
Large scale computation.Computation
Balancing Computation & Communication 5/38
![Page 6: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/6.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Algorithm Evaluation
Typical numerical results (measured in iterations or time orcommunication rounds).
0 50 100 150 2000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Number of iterations
Err
or=
|F-F
*|/F
*
Existing AExisting B Our method
Evaluation framework should reflect features of different applications.
Balancing Computation & Communication 6/38
![Page 7: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/7.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Algorithm Evaluation
Typical numerical results (measured in iterations or time orcommunication rounds).
0 50 100 150 2000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Number of iterations
Err
or=
|F-F
*|/F
*
Existing AExisting B Our method
Evaluation framework should reflect features of different applications.
Balancing Computation & Communication 6/38
![Page 8: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/8.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Problem Formulation
minx∈Rp
f (x) =n∑
i=1
fi (x)
Applications: Sensor Networks, Robotic Teams, Machine Learning.
Distributed Setting: Consensus Optimization problem
minxi∈Rp
f (x) =n∑
i=1
fi (xi )
s.t. xi = xj , ∀i , j ∈ Ni
each node i has a local copy of the parameter vector xi
@ optimality consensus is achieved among all the nodes in the network
Balancing Computation & Communication 7/38
![Page 9: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/9.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Problem Formulation
minx∈Rp
f (x) =n∑
i=1
fi (x)
Applications: Sensor Networks, Robotic Teams, Machine Learning.
Distributed Setting:
Consensus Optimization problem
minxi∈Rp
f (x) =n∑
i=1
fi (xi )
s.t. xi = xj , ∀i , j ∈ Ni
each node i has a local copy of the parameter vector xi
@ optimality consensus is achieved among all the nodes in the network
Balancing Computation & Communication 7/38
![Page 10: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/10.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Problem Formulation
minx∈Rp
f (x) =n∑
i=1
fi (x)
Applications: Sensor Networks, Robotic Teams, Machine Learning.
Distributed Setting: Consensus Optimization problem
minxi∈Rp
f (x) =n∑
i=1
fi (xi )
s.t. xi = xj , ∀i , j ∈ Ni
each node i has a local copy of the parameter vector xi
@ optimality consensus is achieved among all the nodes in the network
Balancing Computation & Communication 7/38
![Page 11: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/11.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Consensus Optimization Problem
minxi∈Rp
f (x) =n∑
i=1
fi (xi )
s.t. xi = xj , ∀i , j ∈ Ni
x is a concatenation of all local xi ’s
W is a doubly-stochastic matrix that defines the connections in thenetwork
x =
x1
x2
...xn
∈ Rnp, W =
w11 w12 · · · w1n
w21 w22 · · · w2n
.... . .
...wn1 wn2 · · · wnn
∈ Rn×n, Z = W⊗ Ip ∈ Rnp×np
Balancing Computation & Communication 8/38
![Page 12: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/12.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Consensus Optimization Problem
minxi∈Rp
f (x) =n∑
i=1
fi (xi )
s.t. xi = xj , ∀i , j ∈ Ni
1
23
4
5
6
78
9x1, f1
x2, f2
x3, f3
x4, f4
x5, f5
x6, f6
x7, f7
x8, f8
x9, f9
x is a concatenation of all local xi ’s
W is a doubly-stochastic matrix that defines the connections in thenetwork
x =
x1
x2
...xn
∈ Rnp, W =
w11 w12 · · · w1n
w21 w22 · · · w2n
.... . .
...wn1 wn2 · · · wnn
∈ Rn×n, Z = W⊗ Ip ∈ Rnp×np
Balancing Computation & Communication 8/38
![Page 13: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/13.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Consensus Optimization Problem
minxi∈Rp
f (x) =n∑
i=1
fi (xi )
s.t. Zx = x
x is a concatenation of all local xi ’s
W is a doubly-stochastic matrix that defines the connections in thenetwork
x =
x1
x2
...xn
∈ Rnp, W =
w11 w12 · · · w1n
w21 w22 · · · w2n
.... . .
...wn1 wn2 · · · wnn
∈ Rn×n, Z = W⊗ Ip ∈ Rnp×np
Balancing Computation & Communication 8/38
![Page 14: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/14.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Literature Review
1 Sublinearly Converging Methods
DGD [Tsitsiklis and Bertsekas, 1989; Nedic and Ozdaglar, 2009; Sundhar Ram et.
al., 2010; Tsianos and Rabbat, 2012; Yuan et. al., 2015; Zeng and Yin, 2016] ...
2 Linearly and Superlinearly Converging Methods
EXTRA [Shi et. al., 2015], DIGing [Nedic et. al., 2017], NEXT [Lorenzo and
Scutari, 2015], Aug-DGM [Xu et. al., 2015], NN-EXTRA [Mokhtari et. al., 2016],
[Qu and Li, 2017], DQN [Eisen et. al., 2017], NN [Mokhtari et. al., 2014, 2015]...
3 Communication Efficient Methods
[Chen and Ozdaglar, 2012], [Shamir et. al., 2014], [Chow et. al., 2016], [Lan et.
al., 2017], [Tsianos et. al., 2012], [Zhang and Lin, 2015], ...
4 Asynchronous Methods
[Tsitsiklis and Bertsekas, 1989], [Tsitsiklis et. al., 1986], [Sundhar Ram et. al.,
2009], [Wei and Ozdaglar, 2013], [Mansoori and Wei, 2017], [Zhang and Kwok,
2014], [Wu et. al., 2017], ...
Balancing Computation & Communication 9/38
![Page 15: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/15.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Literature Review
1 Sublinearly Converging Methods
DGD [Tsitsiklis and Bertsekas, 1989; Nedic and Ozdaglar, 2009; Sundhar Ram et.
al., 2010; Tsianos and Rabbat, 2012; Yuan et. al., 2015; Zeng and Yin, 2016] ...
2 Linearly and Superlinearly Converging Methods
EXTRA [Shi et. al., 2015], DIGing [Nedic et. al., 2017], NEXT [Lorenzo and
Scutari, 2015], Aug-DGM [Xu et. al., 2015], NN-EXTRA [Mokhtari et. al., 2016],
[Qu and Li, 2017], DQN [Eisen et. al., 2017], NN [Mokhtari et. al., 2014, 2015]...
3 Communication Efficient Methods
[Chen and Ozdaglar, 2012], [Shamir et. al., 2014], [Chow et. al., 2016], [Lan et.
al., 2017], [Tsianos et. al., 2012], [Zhang and Lin, 2015], ...
4 Asynchronous Methods
[Tsitsiklis and Bertsekas, 1989], [Tsitsiklis et. al., 1986], [Sundhar Ram et. al.,
2009], [Wei and Ozdaglar, 2013], [Mansoori and Wei, 2017], [Zhang and Kwok,
2014], [Wu et. al., 2017], ...
Balancing Computation & Communication 9/38
Cost =#
Communication
s×cc+ #
Computation
s×cg
![Page 16: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/16.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Literature Review
1 Sublinearly Converging Methods
DGD [Tsitsiklis and Bertsekas, 1989; Nedic and Ozdaglar, 2009; Sundhar Ram et.
al., 2010; Tsianos and Rabbat, 2012; Yuan et. al., 2015; Zeng and Yin, 2016] ...
2 Linearly and Superlinearly Converging Methods
EXTRA [Shi et. al., 2015], DIGing [Nedic et. al., 2017], NEXT [Lorenzo and
Scutari, 2015], Aug-DGM [Xu et. al., 2015], NN-EXTRA [Mokhtari et. al., 2016],
[Qu and Li, 2017], DQN [Eisen et. al., 2017], NN [Mokhtari et. al., 2014, 2015]...
3 Communication Efficient Methods
[Chen and Ozdaglar, 2012], [Shamir et. al., 2014], [Chow et. al., 2016], [Lan et.
al., 2017], [Tsianos et. al., 2012], [Zhang and Lin, 2015], ...
4 Asynchronous Methods
[Tsitsiklis and Bertsekas, 1989], [Tsitsiklis et. al., 1986], [Sundhar Ram et. al.,
2009], [Wei and Ozdaglar, 2013], [Mansoori and Wei, 2017], [Zhang and Kwok,
2014], [Wu et. al., 2017], ...
Balancing Computation & Communication 9/38
Cost =#Communications
×cc
+ #Computations
×cg
![Page 17: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/17.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Literature Review
1 Sublinearly Converging Methods
DGD [Tsitsiklis and Bertsekas, 1989; Nedic and Ozdaglar, 2009; Sundhar Ram et.
al., 2010; Tsianos and Rabbat, 2012; Yuan et. al., 2015; Zeng and Yin, 2016] ...
2 Linearly and Superlinearly Converging Methods
EXTRA [Shi et. al., 2015], DIGing [Nedic et. al., 2017], NEXT [Lorenzo and
Scutari, 2015], Aug-DGM [Xu et. al., 2015], NN-EXTRA [Mokhtari et. al., 2016],
[Qu and Li, 2017], DQN [Eisen et. al., 2017], NN [Mokhtari et. al., 2014, 2015]...
3 Communication Efficient Methods
[Chen and Ozdaglar, 2012], [Shamir et. al., 2014], [Chow et. al., 2016], [Lan et.
al., 2017], [Tsianos et. al., 2012], [Zhang and Lin, 2015], ...
4 Asynchronous Methods
[Tsitsiklis and Bertsekas, 1989], [Tsitsiklis et. al., 1986], [Sundhar Ram et. al.,
2009], [Wei and Ozdaglar, 2013], [Mansoori and Wei, 2017], [Zhang and Kwok,
2014], [Wu et. al., 2017], ...
Balancing Computation & Communication 9/38
Cost =#Communications×cc
+ #Computations×cg
![Page 18: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/18.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Goal of the Project
Develop an algorithmic framework that is independent of the methodused to balance computation and communication in distributedoptimization
Prove convergence for methods that use the framework
Show that the framework can be applied to a many consensusoptimization problems (with different communication andcomputation costs)
Illustrate empirically that methods that utilize the frameworkoutperform their base algorithms for specific applications
Balancing Computation & Communication 10/38
![Page 19: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/19.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Goal of the Project
Develop an algorithmic framework that is independent of the methodused to balance computation and communication in distributedoptimization
Prove convergence for methods that use the framework
Show that the framework can be applied to a many consensusoptimization problems (with different communication andcomputation costs)
Illustrate empirically that methods that utilize the frameworkoutperform their base algorithms for specific applications
Balancing Computation & Communication 10/38
![Page 20: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/20.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Goal of the Project
Develop an algorithmic framework that is independent of the methodused to balance computation and communication in distributedoptimization
Prove convergence for methods that use the framework
Show that the framework can be applied to a many consensusoptimization problems (with different communication andcomputation costs)
Illustrate empirically that methods that utilize the frameworkoutperform their base algorithms for specific applications
Balancing Computation & Communication 10/38
![Page 21: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/21.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Goal of the Project
Develop an algorithmic framework that is independent of the methodused to balance computation and communication in distributedoptimization
Prove convergence for methods that use the framework
Show that the framework can be applied to a many consensusoptimization problems (with different communication andcomputation costs)
Illustrate empirically that methods that utilize the frameworkoutperform their base algorithms for specific applications
Balancing Computation & Communication 10/38
![Page 22: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/22.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
This talk
First stage of the project
Multiple consensus in DGD (theoretically and in practice)
Design a flexible first-order algorithm that decouples the twooperations
Investigate the method theoretically and empirically
By-product: variants of DGD with exact convergence
Not in this talk (ongoing work)
Multiple gradients
Extend framework to different algorithms (e.g., EXTRA, NN) orasynchronous methods
Balancing Computation & Communication 11/38
![Page 23: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/23.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
This talk
First stage of the project
Multiple consensus in DGD (theoretically and in practice)
Design a flexible first-order algorithm that decouples the twooperations
Investigate the method theoretically and empirically
By-product: variants of DGD with exact convergence
Not in this talk (ongoing work)
Multiple gradients
Extend framework to different algorithms (e.g., EXTRA, NN) orasynchronous methods
Balancing Computation & Communication 11/38
![Page 24: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/24.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Overview
1 Introduction and Motivation
2 Distributed Gradient Descent Variant
3 Communication Computation Decoupled DGD Variants
4 Conclusions & Future Work
Balancing Computation & Communication 12/38
![Page 25: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/25.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Distributed Gradient Descent (DGD)
DGD [Tsitsiklis and Bertsekas, 1989; Nedic and Ozdaglar, 2009; Sundhar Ram
et. al., 2010; Tsianos and Rabbat, 2012; Yuan et. al., 2015; Zeng and Yin, 2016]
xi ,k+1 =∑
j∈Ni∪iwijxj ,k − α∇fi (xi ,k), ∀i = 1, ..., n
xk+1 = Zxk − α∇f(xk)
Balancing Computation & Communication 13/38
![Page 26: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/26.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Distributed Gradient Descent (DGD)
DGD [Tsitsiklis and Bertsekas, 1989; Nedic and Ozdaglar, 2009; Sundhar Ram
et. al., 2010; Tsianos and Rabbat, 2012; Yuan et. al., 2015; Zeng and Yin, 2016]
xi ,k+1 =∑
j∈Ni∪iwijxj ,k − α∇fi (xi ,k), ∀i = 1, ..., n
xk+1 = Zxk − α∇f(xk)
x =
x1
x2
...xn
∈ Rnp, ∇f(xk) =
∇f1(x1,k)∇f2(x2,k)
...∇fn(xn,k)
∈ Rnp, Z = W⊗ Ip ∈ Rnp×np
Balancing Computation & Communication 13/38
![Page 27: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/27.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Distributed Gradient Descent (DGD)
DGD [Tsitsiklis and Bertsekas, 1989; Nedic and Ozdaglar, 2009; Sundhar Ram
et. al., 2010; Tsianos and Rabbat, 2012; Yuan et. al., 2015; Zeng and Yin, 2016]
xi ,k+1 =∑
j∈Ni∪iwijxj ,k − α∇fi (xi ,k), ∀i = 1, ..., n
xk+1 = Zxk − α∇f(xk)
Diminishing α: Sub-linear convergence to the solution
Constant α: Linear convergence to a neighborhood O(α)
Balancing Computation & Communication 13/38
![Page 28: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/28.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
DGD – Questions
DGD
xk+1 = Zxk − α∇f(xk)
Is it necessary to do one consensus step and one optimization(gradient) step?
If not, what is the interpretation of methods that do moreconsensus/optimization steps?
What convergence guarantees can be proven for such methods?
How do these variants perform in practice?
DGDt
xk+1 = Ztxk − α∇f(xk), Zt = Wt ⊗ Ip
Balancing Computation & Communication 14/38
![Page 29: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/29.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
DGD – Questions
DGD
xk+1 = Zxk − α∇f(xk)
Is it necessary to do one consensus step and one optimization(gradient) step?
If not, what is the interpretation of methods that do moreconsensus/optimization steps?
What convergence guarantees can be proven for such methods?
How do these variants perform in practice?
DGDt
xk+1 = Ztxk − α∇f(xk), Zt = Wt ⊗ Ip
Balancing Computation & Communication 14/38
![Page 30: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/30.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
DGD – Questions
DGD
xk+1 = Zxk − α∇f(xk)
Is it necessary to do one consensus step and one optimization(gradient) step?
If not, what is the interpretation of methods that do moreconsensus/optimization steps?
What convergence guarantees can be proven for such methods?
How do these variants perform in practice?
DGDt
xk+1 = Ztxk − α∇f(xk), Zt = Wt ⊗ Ip
Balancing Computation & Communication 14/38
![Page 31: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/31.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
DGD – Questions
DGD
xk+1 = Zxk − α∇f(xk)
Is it necessary to do one consensus step and one optimization(gradient) step?
If not, what is the interpretation of methods that do moreconsensus/optimization steps?
What convergence guarantees can be proven for such methods?
How do these variants perform in practice?
DGDt
xk+1 = Ztxk − α∇f(xk), Zt = Wt ⊗ Ip
Balancing Computation & Communication 14/38
![Page 32: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/32.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
DGD – Assumptions & Definitions
Assumptions
1 Each component function fi is µi > 0 strongly convex and has Li > 0Lipschitz continuous gradients
2 The mixing matrix W is symmetric and doubly stochastic with β < 1(β is the second largest eigenvalue)
Definitions
xk =1
n
n∑i=1
xi,k , ∇f (xk) =n∑
i=1
∇fi (xi,k), ∇f (xk) =n∑
i=1
∇fi (xk)
Balancing Computation & Communication 15/38
![Page 33: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/33.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
DGD – Theory – Bounded distance to minimum
Theorem (Bounded distance to minimum) [Yuan et. al., 2015]
Suppose Assumptions 1 & 2 hold, and let the step length satisfy
α ≤ min
{1 + λn(W)
Lf,
1
µf + Lf
}where µf is the strong convexity parameter of f and Lf is the Lipschitz constant of the gradientof f . Then, for all k = 0, 1, ...
‖xk+1 − x?‖2 ≤ c21‖xk − x?‖2 +
c23
(1− β)2
c21 = 1− αc2 + αδ − α2δc2, c2 =
µf Lf
µf + Lf,
c23 = α3(α+ δ−1)L2D2, D =
√√√√2L(n∑
i=1
fi (0)− f ?),
where x? = arg minx f (x) and δ > 0.
Balancing Computation & Communication 16/38
![Page 34: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/34.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
DGD – Theory – Bounded distance to minimum
Theorem (Bounded distance to minimum) [Yuan et. al., 2015]
Suppose Assumptions 1 & 2 hold, and let the step length satisfy
α ≤ min
{1 + λn(W)
Lf,
1
µf + Lf
}where µf is the strong convexity parameter of f and Lf is the Lipschitz constant of the gradientof f . Then, for all k = 0, 1, ...
‖xk+1 − x?‖2 ≤ c21‖xk − x?‖2 +
c23c23c23
(1− β)2(1− β)2(1− β)2
c21 = 1− αc2 + αδ − α2δc2, c2 =
µf Lf
µf + Lf,
c23 = α3(α+ δ−1)L2D2, D =
√√√√2L(n∑
i=1
fi (0)− f ?),
where x? = arg minx f (x) and δ > 0.
Balancing Computation & Communication 16/38
![Page 35: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/35.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
DGDt – Theory – Bounded distance to minimum
Theorem (Bounded distance to minimum) [ASB, RB, NSK and EW, 2017]
Suppose Assumptions 1 & 2 hold, and let the step length satisfy
α ≤ min
{1 + λn(Wt)
Lf,
1
µf + Lf
}where µf is the strong convexity parameter of f and Lf is the Lipschitz constant of the gradientof f . Then, for all k = 0, 1, ...
‖xk+1 − x?‖2 ≤ c21‖xk − x?‖2 +
c23c23c23
(1− βt)2(1− βt)2(1− βt)2
c21 = 1− αc2 + αδ − α2δc2, c2 =
µf Lf
µf + Lf,
c23 = α3(α+ δ−1)L2D2, D =
√√√√2L(n∑
i=1
fi (0)− f ?),
where x? = arg minx f (x) and δ > 0.
Balancing Computation & Communication 16/38
![Page 36: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/36.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
DGD – Theory – Comments
Can show similar error neighborhood for ‖xi ,k − x∗‖.Theoretical results are similar to DGD in nature (constant α)
Linear convergence to a neighborhood of the solutionImproved neighborhood
O(
1
(1− β)2
)v.s. O
(1
(1− βt)2
)But, cannot kill the neighborhood with increased communication
Balancing Computation & Communication 17/38
![Page 37: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/37.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
DGD – Theory – Comments
Can show similar error neighborhood for ‖xi ,k − x∗‖.Theoretical results are similar to DGD in nature (constant α)
Linear convergence to a neighborhood of the solutionImproved neighborhood
O(
1
(1− β)2
)v.s. O
(1
(1− βt)2
)But, cannot kill the neighborhood with increased communication
Drawback: requires extra communication
Balancing Computation & Communication 17/38
![Page 38: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/38.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
DGD – Theory – Comments
Can show similar error neighborhood for ‖xi ,k − x∗‖.Theoretical results are similar to DGD in nature (constant α)
Linear convergence to a neighborhood of the solutionImproved neighborhood
O(
1
(1− β)2
)v.s. O
(1
(1− βt)2
)But, cannot kill the neighborhood with increased communication
Drawback: requires extra communication
Effectively, DGDt is DGD with a different underlying graph (differentweights in W)
Balancing Computation & Communication 17/38
![Page 39: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/39.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
DGD – Theory – Comments
Effectively, DGDt is DGD with a different underlying graph (differentweights in W)
1 2
3 4
1 2 3 4
W =
1/2 1/4 1/4 01/4 1/2 0 1/41/4 0 1/2 1/4
0 1/4 1/4 1/2
,
W =
2/3 1/3 0 01/3 1/3 1/3 00 1/3 1/3 1/3
0 0 1/3 2/3
,
Balancing Computation & Communication 17/38
![Page 40: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/40.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
DGD – Theory – Comments
Effectively, DGDt is DGD with a different underlying graph (differentweights in W)
1 2
3 4
1 2 3 4
W =
1/2 1/4 1/4 01/4 1/2 0 1/41/4 0 1/2 1/4
0 1/4 1/4 1/2
,
W =
2/3 1/3 0 01/3 1/3 1/3 00 1/3 1/3 1/3
0 0 1/3 2/3
,
Balancing Computation & Communication 17/38
![Page 41: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/41.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
DGD – Theory – Comments
Effectively, DGDt is DGD with a different underlying graph (differentweights in W)
1 2
3 4
1 2 3 4
W =
1/2 1/4 1/4 01/4 1/2 0 1/41/4 0 1/2 1/4
0 1/4 1/4 1/2
, W 2 =
3/8 1/4 1/4 1
81/4 3/8 1/8 1/41/4 1/8 3/8 1/41/8 1/4 1/4 3/8
W =
2/3 1/3 0 01/3 1/3 1/3 00 1/3 1/3 1/3
0 0 1/3 2/3
,
Balancing Computation & Communication 17/38
![Page 42: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/42.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
DGD – Theory – Comments
Effectively, DGDt is DGD with a different underlying graph (differentweights in W)
1 2
3 4
1 2 3 4
W =
1/2 1/4 1/4 01/4 1/2 0 1/41/4 0 1/2 1/4
0 1/4 1/4 1/2
, W 10 ≈
1/4 1/4 1/4 1/41/4 1/4 1/4 1/41/4 1/4 1/4 1/41/4 1/4 1/4 1/4
W =
2/3 1/3 0 01/3 1/3 1/3 00 1/3 1/3 1/3
0 0 1/3 2/3
,
Balancing Computation & Communication 17/38
![Page 43: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/43.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
DGD – Theory – Comments
Effectively, DGDt is DGD with a different underlying graph (differentweights in W)
1 2
3 4
1 2 3 4
W =
1/2 1/4 1/4 01/4 1/2 0 1/41/4 0 1/2 1/4
0 1/4 1/4 1/2
, W 10 ≈
1/4 1/4 1/4 1/41/4 1/4 1/4 1/41/4 1/4 1/4 1/41/4 1/4 1/4 1/4
W =
2/3 1/3 0 01/3 1/3 1/3 00 1/3 1/3 1/3
0 0 1/3 2/3
,
Balancing Computation & Communication 17/38
![Page 44: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/44.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
DGD – Theory – Comments
Effectively, DGDt is DGD with a different underlying graph (differentweights in W)
1 2
3 4
1 2 3 4
W =
1/2 1/4 1/4 01/4 1/2 0 1/41/4 0 1/2 1/4
0 1/4 1/4 1/2
, W 10 ≈
1/4 1/4 1/4 1/41/4 1/4 1/4 1/41/4 1/4 1/4 1/41/4 1/4 1/4 1/4
W =
2/3 1/3 0 01/3 1/3 1/3 00 1/3 1/3 1/3
0 0 1/3 2/3
, W 2 =
5/9 1/3 1/9 01/3 1/3 2/9 1/91/9 2/9 1/3 1/3
0 1/9 1/3 5/9
Balancing Computation & Communication 17/38
![Page 45: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/45.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
DGD – Theory – Comments
Effectively, DGDt is DGD with a different underlying graph (differentweights in W)
1 2
3 4
1 2 3 4
W =
1/2 1/4 1/4 01/4 1/2 0 1/41/4 0 1/2 1/4
0 1/4 1/4 1/2
, W 10 ≈
1/4 1/4 1/4 1/41/4 1/4 1/4 1/41/4 1/4 1/4 1/41/4 1/4 1/4 1/4
W =
2/3 1/3 0 01/3 1/3 1/3 00 1/3 1/3 1/3
0 0 1/3 2/3
, W 3 =
0.48 0.33 0.15 0.040.33 0.30 0.22 0.150.15 0.22 0.30 0.330.04 0.15 0.33 0.48
Balancing Computation & Communication 17/38
![Page 46: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/46.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
DGD(t) – Numerical Results
Problem: Quadratic
f (x) =1
2
n∑i=1
xTAix + bTi x
each node i = {1, ..., n} has local data Ai ∈ Rni×p and bi ∈ Rni
Parameters: n = 10, p = 10, ni = 10, κ = 102
Methods: DGD (1,1), DGD (1,2), DGD (1,5), DGD (1,10)
Graph: 4-cyclic graph, wii = 15 , wij =
{15 if j ∈ Ni
0 otherwise
Balancing Computation & Communication 18/38
![Page 47: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/47.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
DGD(t) – Numerical Results
Problem: Quadratic
f (x) =1
2
n∑i=1
xTAix + bTi x
each node i = {1, ..., n} has local data Ai ∈ Rni×p and bi ∈ Rni
Parameters: n = 10, p = 10, ni = 10, κ = 102
Methods: DGD (1,1), DGD (1,2), DGD (1,5), DGD (1,10)
Graph: 4-cyclic graph, wii = 15 , wij =
{15 if j ∈ Ni
0 otherwise
Show the effect of multiple consensus steps per gradient step
Balancing Computation & Communication 18/38
![Page 48: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/48.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
DGD(t) – Numerical Results
0 1000 2000 3000 4000 5000
Iterations
10-2
10-1
100
Relative Error
DGD (1,1)
DGD (1,2)
DGD (1,5)
DGD (1,10)
0 1 2 3 4 5 6
Cost ×104
10-2
10-1
100
Relative Error
DGD (1,1)
DGD (1,2)
DGD (1,5)
DGD (1,10)
Quadratic. n = 10, p = 10, ni = 10, κ = 102.
Balancing Computation & Communication 18/38
Cost = #Communications × 1 + #Computations × 1
![Page 49: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/49.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Overview
1 Introduction and Motivation
2 Distributed Gradient Descent Variant
3 Communication Computation Decoupled DGD Variants
4 Conclusions & Future Work
Balancing Computation & Communication 19/38
![Page 50: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/50.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Operators
DGD
xk+1 = Zxk − α∇f(xk)
Methods
DGD: (T − I +W)[xk ] = Zxk − α∇f(xk)
WT: W[T [xk ]] = Zxk − αZ∇f(xk)
A special case of the algorithm appeared as CTA (Combined then Adapt)and ATC in [Sayed, 13] for quadratic problems
1 What can be proven about T [W[x]] and W[T [x]] variants of DGD?2 How do these methods perform in practice?3 Advantages and limitations of the methods?
Balancing Computation & Communication 20/38
![Page 51: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/51.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Operators
DGD
xk+1 = Zxk − α∇f(xk)
Operators
W[x] = Zx
T [x] = x− α∇f(x)
Methods
DGD: (T − I +W)[xk ] = Zxk − α∇f(xk)
WT: W[T [xk ]] = Zxk − αZ∇f(xk)
A special case of the algorithm appeared as CTA (Combined then Adapt)and ATC in [Sayed, 13] for quadratic problems
1 What can be proven about T [W[x]] and W[T [x]] variants of DGD?2 How do these methods perform in practice?3 Advantages and limitations of the methods?
Balancing Computation & Communication 20/38
![Page 52: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/52.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Operators
DGD
xk+1 = Zxk − α∇f(xk)
Operators
W[x] = Zx
T [x] = x− α∇f(x)
Methods
DGD: (T − I +W)[xk ] = Zxk − α∇f(xk)
WT: W[T [xk ]] = Zxk − αZ∇f(xk)
A special case of the algorithm appeared as CTA (Combined then Adapt)and ATC in [Sayed, 13] for quadratic problems
1 What can be proven about T [W[x]] and W[T [x]] variants of DGD?2 How do these methods perform in practice?3 Advantages and limitations of the methods?
Balancing Computation & Communication 20/38
![Page 53: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/53.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Operators
DGD
xk+1 = Zxk − α∇f(xk)
Operators
W[x] = Zx
T [x] = x− α∇f(x)
Methods
DGD: (T − I +W)[xk ] = Zxk − α∇f(xk)TW: T [W[xk ]] = Zxk − α∇f(Zxk)WT: W[T [xk ]] = Zxk − αZ∇f(xk)
A special case of the algorithm appeared as CTA (Combined then Adapt)and ATC in [Sayed, 13] for quadratic problems
1 What can be proven about T [W[x]] and W[T [x]] variants of DGD?2 How do these methods perform in practice?3 Advantages and limitations of the methods?
Balancing Computation & Communication 20/38
![Page 54: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/54.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Operators
DGD
xk+1 = Zxk − α∇f(xk)
Operators
W[x] = Zx
T [x] = x− α∇f(x)
Methods
DGD: (T − I +W)[xk ] = Zxk − α∇f(xk)TW: T [W[xk ]] = Zxk − α∇f(Zxk)WT: W[T [xk ]] = Zxk − αZ∇f(xk)
A special case of the algorithm appeared as CTA (Combined then Adapt)and ATC in [Sayed, 13] for quadratic problems
1 What can be proven about T [W[x]] and W[T [x]] variants of DGD?2 How do these methods perform in practice?3 Advantages and limitations of the methods?
Balancing Computation & Communication 20/38
![Page 55: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/55.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Operators
TW
yk = Zxk
xk+1 = yk − α∇f(yk)
Methods
DGD: (T − I +W)[xk ] = Zxk − α∇f(xk)TW: T [W[xk ]] = Zxk − α∇f(Zxk)WT: W[T [xk ]] = Zxk − αZ∇f(xk)
A special case of the algorithm appeared as CTA (Combined then Adapt)and ATC in [Sayed, 13] for quadratic problems
1 What can be proven about T [W[x]] and W[T [x]] variants of DGD?2 How do these methods perform in practice?3 Advantages and limitations of the methods?
Balancing Computation & Communication 20/38
![Page 56: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/56.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
TW – Methods, Assumptions & Definitions
Methods
TWt : t (predetermined) consensus steps for every gradient step
TW+: increasing number of consensus steps
Assumptions
1 Assumptions 1 & 2, same as before
Definitions
xk =1
n
n∑i=1
xi ,k , ∇f (yk) =n∑
i=1
∇fi (yi ,k), ∇f (xk) =n∑
i=1
∇fi (xk)
Balancing Computation & Communication 21/38
![Page 57: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/57.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
TWt – Theory – Bounded distance to minimum
Theorem (Bounded distance to minimum) [ASB, RB, NSK and EW, 2017]
Suppose Assumptions 1 & 2 hold, and let the step length satisfy
α ≤ min
{1 + λn(Wt)
Lf,
1
µf + Lf
}where µf is the strong convexity parameter of f and Lf is the Lipschitz constant of the gradientof f . Then, for all k = 0, 1, ...
‖xi,k − x?‖ ≤ cki ‖x?‖+
c3√1− c2
1
βt + βtαD
where x? = arg minx f (x) and δ > 0.
Balancing Computation & Communication 22/38
![Page 58: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/58.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
TW+ Theory (increasing Consensus)
Can we increase the number of consensus steps and converge to thesolution?
Increase t(k) accordingly so that we kill the error term
O(βt(k))
Similar idea appeared in [Chen and Ozdaglar, 2012] for nonsmoothproblems
Resulting in TW t(k) algorithm with exact convergence: As long as wekeep increasing the number of consensus
Balancing Computation & Communication 23/38
![Page 59: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/59.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
TW+ Theory (increasing Consensus)
Can we increase the number of consensus steps and converge to thesolution?
Increase t(k) accordingly so that we kill the error term
O(βt(k))
Similar idea appeared in [Chen and Ozdaglar, 2012] for nonsmoothproblems
Resulting in TW t(k) algorithm with exact convergence: As long as wekeep increasing the number of consensus
Balancing Computation & Communication 23/38
![Page 60: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/60.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
TW+ Theory (increasing Consensus)
Can we increase the number of consensus steps and converge to thesolution?
Increase t(k) accordingly so that we kill the error term
O(βt(k))
Similar idea appeared in [Chen and Ozdaglar, 2012] for nonsmoothproblems
Resulting in TW t(k) algorithm with exact convergence: As long as wekeep increasing the number of consensus
Balancing Computation & Communication 23/38
![Page 61: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/61.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
TW+ Theory (increasing Consensus)
Can we increase the number of consensus steps and converge to thesolution?
Increase t(k) accordingly so that we kill the error term
O(βt(k))
Similar idea appeared in [Chen and Ozdaglar, 2012] for nonsmoothproblems
Resulting in TW t(k) algorithm with exact convergence: As long as wekeep increasing the number of consensus
Balancing Computation & Communication 23/38
![Page 62: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/62.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
TW+ – Theory – Bounded distance to minimum
Theorem (Bounded distance to minimum)[ASB, RB, NSK and EW, 2017]
Suppose Assumptions 1 & 2 hold, t(k) = k and let the step length satisfy
α ≤ min
{1
Lf,
1
µf + Lf
}where µf is the strong convexity parameter of f and Lf is the Lipschitz constant of the gradientof f . Then, for all k = 0, 1, ...
‖xi,k − x?‖ ≤ Cρk ,
where x? = arg minx f (x) and some constants C , ρ.
When t(k) = k to reach an ε−accurate solution, we need O(log( 1ε )) number of
gradient evaluation and O((log( 1ε ))2) rounds of communication.
Balancing Computation & Communication 24/38
![Page 63: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/63.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Numerical Experiments
Methods: DGD, TW (1,1,-), TW (10,1,-), TW (1,10,-), TW (1,1,k),TW (1,1,500), TW (1,1,1000)
Problem: Quadratic
f (x) =1
2
n∑i=1
xTAix + bTi x
each node i = {1, ..., n} has local data Ai ∈ Rni×p and bi ∈ Rni
Parameters: n = 10, p = 10, ni = 10, κ = Lµ = 104
Graph: 4-cyclic graph
Balancing Computation & Communication 25/38
![Page 64: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/64.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Numerical Experiments
Methods: DGD, TW (1,1,-), TW (10,1,-), TW (1,10,-), TW (1,1,k),TW (1,1,500), TW (1,1,1000)
Problem: Quadratic
f (x) =1
2
n∑i=1
xTAix + bTi x
each node i = {1, ..., n} has local data Ai ∈ Rni×p and bi ∈ Rni
Parameters: n = 10, p = 10, ni = 10, κ = Lµ = 104
Graph: 4-cyclic graph
Balancing Computation & Communication 25/38
![Page 65: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/65.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Numerical Experiments – Quadratic Problems
0 2000 4000 6000 8000 10000
Iterations
10-14
10-12
10-10
10-8
10-6
10-4
10-2
100
Relative Error
DGD
0 0.5 1 1.5 2
Cost ×104
10-4
10-3
10-2
10-1
100
Relative Error
DGD
Quadratic. n = 10, p = 10, ni = 10, κ = 104.
Balancing Computation & Communication 26/38
Cost = #Communications × 1 + #Computations × 1
![Page 66: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/66.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Numerical Experiments – Quadratic Problems
0 2000 4000 6000 8000 10000
Iterations
10-14
10-12
10-10
10-8
10-6
10-4
10-2
100
Relative Error
DGD
TW (1,1,-)
0 0.5 1 1.5 2
Cost ×104
10-4
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
Quadratic. n = 10, p = 10, ni = 10, κ = 104.
Balancing Computation & Communication 26/38
Cost = #Communications × 1 + #Computations × 1
![Page 67: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/67.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Numerical Experiments – Quadratic Problems
0 2000 4000 6000 8000 10000
Iterations
10-14
10-12
10-10
10-8
10-6
10-4
10-2
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
0 0.5 1 1.5 2
Cost ×104
10-4
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
Quadratic. n = 10, p = 10, ni = 10, κ = 104.
Balancing Computation & Communication 26/38
Cost = #Communications × 1 + #Computations × 1
![Page 68: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/68.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Numerical Experiments – Quadratic Problems
0 2000 4000 6000 8000 10000
Iterations
10-14
10-12
10-10
10-8
10-6
10-4
10-2
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
0 0.5 1 1.5 2
Cost ×104
10-4
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
Quadratic. n = 10, p = 10, ni = 10, κ = 104.
Balancing Computation & Communication 26/38
Cost = #Communications × 1 + #Computations × 1
![Page 69: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/69.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Numerical Experiments – Quadratic Problems
0 2000 4000 6000 8000 10000
Iterations
10-14
10-12
10-10
10-8
10-6
10-4
10-2
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
0 0.5 1 1.5 2
Cost ×104
10-4
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
Quadratic. n = 10, p = 10, ni = 10, κ = 104.
Balancing Computation & Communication 26/38
Cost = #Communications × 1 + #Computations × 1
![Page 70: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/70.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Numerical Experiments – Quadratic Problems
0 2000 4000 6000 8000 10000
Iterations
10-14
10-12
10-10
10-8
10-6
10-4
10-2
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
TW (1,1,500)
0 0.5 1 1.5 2
Cost ×104
10-4
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
TW (1,1,500)
Quadratic. n = 10, p = 10, ni = 10, κ = 104.
Balancing Computation & Communication 26/38
Cost = #Communications × 1 + #Computations × 1
![Page 71: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/71.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Numerical Experiments – Quadratic Problems
0 2000 4000 6000 8000 10000
Iterations
10-14
10-12
10-10
10-8
10-6
10-4
10-2
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
TW (1,1,500)
TW (1,1,1000)
0 0.5 1 1.5 2
Cost ×104
10-4
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
TW (1,1,500)
TW (1,1,1000)
Quadratic. n = 10, p = 10, ni = 10, κ = 104.
Balancing Computation & Communication 26/38
Cost = #Communications × 1 + #Computations × 1
![Page 72: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/72.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Experiments – Quadratic Problems – Different Costs
0 2 4 6 8 10
Cost ×104
10-5
10-4
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
TW (1,1,500)
TW (1,1,1000)
0 2 4 6 8 10
Cost ×104
10-6
10-5
10-4
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
TW (1,1,500)
TW (1,1,1000)
0 2 4 6 8 10
Cost ×104
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
TW (1,1,500)
TW (1,1,1000)
Quadratic. n = 10, p = 10, ni = 10, κ = 104.
Left: cg = 10, cc = 1;
Center: cg = 1, cc = 1;
Right: cg = 1, cc = 10.
Balancing Computation & Communication 27/38
Cost = #Communications × cc + #Computations × cg
![Page 73: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/73.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Experiments – Quadratic Problems – Different Costs
0 2 4 6 8 10
Cost ×104
10-5
10-4
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
TW (1,1,500)
TW (1,1,1000)
0 2 4 6 8 10
Cost ×104
10-6
10-5
10-4
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
TW (1,1,500)
TW (1,1,1000)
0 2 4 6 8 10
Cost ×104
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
TW (1,1,500)
TW (1,1,1000)
Quadratic. n = 10, p = 10, ni = 10, κ = 104.Left: cg = 10, cc = 1;
Center: cg = 1, cc = 1;
Right: cg = 1, cc = 10.
Balancing Computation & Communication 27/38
Cost = #Communications × cc + #Computations × cg
![Page 74: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/74.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Experiments – Quadratic Problems – Different Costs
0 2 4 6 8 10
Cost ×104
10-5
10-4
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
TW (1,1,500)
TW (1,1,1000)
0 2 4 6 8 10
Cost ×104
10-6
10-5
10-4
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
TW (1,1,500)
TW (1,1,1000)
0 2 4 6 8 10
Cost ×104
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
TW (1,1,500)
TW (1,1,1000)
Quadratic. n = 10, p = 10, ni = 10, κ = 104.Left: cg = 10, cc = 1;
Center: cg = 1, cc = 1;Right: cg = 1, cc = 10.
Balancing Computation & Communication 27/38
Cost = #Communications × cc + #Computations × cg
![Page 75: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/75.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Numerical Experiments – Logistic Regression
Problem: Logistic Regression - Binary Classification (MushroomDataset)
f (x) =1
n · ni
n∑i=1
ni∑j=1
log(1 + e−(bi )j (xT (Ai )j·))
where A ∈ Rn·ni×p and b ∈ {−1, 1}n·ni , and each node i = 1, ..., nhas a portion of A and b, Ai ∈ Rni×p and bi ∈ Rni
Parameters: n = 10, p = 114, ni = 812, κ = 104
Graph: 4-cyclic graph
Balancing Computation & Communication 28/38
![Page 76: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/76.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Numerical Experiments – Logistic Regression
0 1000 2000 3000 4000 5000
Iterations
10-5
10-4
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
TW (1,1,250)
TW (1,1,500)
0 2000 4000 6000 8000 10000
Cost
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
TW (1,1,250)
TW (1,1,500)
Logistic Regression - mushroom. n = 10, p = 114, ni = 812.
Balancing Computation & Communication 28/38
Cost = #Communications × 1 + #Computations × 1
![Page 77: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/77.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Overview
1 Introduction and Motivation
2 Distributed Gradient Descent Variant
3 Communication Computation Decoupled DGD Variants
4 Conclusions & Future Work
Balancing Computation & Communication 29/38
![Page 78: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/78.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Final Remarks
Most distributed optimization algorithms do one communication andone computation per iteration
Showed the effect (theoretically and empirically) of doing multipleconsensus in DGD
Proposed a variant of DGD, TW, that decouples the two operations(consensus and computation) and converges to the solution byperforming multiple consensus steps
Important to balance communication and computation in order to getbest performance in terms of cost — right balance depends on theapplication (e.g., cost of communication and cost of computation)
Balancing Computation & Communication 30/38
![Page 79: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/79.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Final Remarks
Most distributed optimization algorithms do one communication andone computation per iteration
Showed the effect (theoretically and empirically) of doing multipleconsensus in DGD
Proposed a variant of DGD, TW, that decouples the two operations(consensus and computation) and converges to the solution byperforming multiple consensus steps
Important to balance communication and computation in order to getbest performance in terms of cost — right balance depends on theapplication (e.g., cost of communication and cost of computation)
Balancing Computation & Communication 30/38
![Page 80: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/80.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Final Remarks
Most distributed optimization algorithms do one communication andone computation per iteration
Showed the effect (theoretically and empirically) of doing multipleconsensus in DGD
Proposed a variant of DGD, TW, that decouples the two operations(consensus and computation) and converges to the solution byperforming multiple consensus steps
Important to balance communication and computation in order to getbest performance in terms of cost — right balance depends on theapplication (e.g., cost of communication and cost of computation)
Balancing Computation & Communication 30/38
![Page 81: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/81.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Final Remarks
Most distributed optimization algorithms do one communication andone computation per iteration
Showed the effect (theoretically and empirically) of doing multipleconsensus in DGD
Proposed a variant of DGD, TW, that decouples the two operations(consensus and computation) and converges to the solution byperforming multiple consensus steps
DGD: xk+1 = Zxk − α∇f (xk)
TW: xk+1 = Zxk − α∇f (Zxk)
Important to balance communication and computation in order to getbest performance in terms of cost — right balance depends on theapplication (e.g., cost of communication and cost of computation)
Balancing Computation & Communication 30/38
![Page 82: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/82.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Final Remarks
Most distributed optimization algorithms do one communication andone computation per iteration
Showed the effect (theoretically and empirically) of doing multipleconsensus in DGD
Proposed a variant of DGD, TW, that decouples the two operations(consensus and computation) and converges to the solution byperforming multiple consensus steps
Important to balance communication and computation in order to getbest performance in terms of cost — right balance depends on theapplication (e.g., cost of communication and cost of computation)
Balancing Computation & Communication 30/38
![Page 83: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/83.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Future Work
Apply framework to other algorithms (exact, second-order,asynchronous, ...)
Construct framework to do multiple gradient steps
Adapt number of gradient and communication steps in algorithmicway
Other considerations: memory access, partial blocks, quantizationeffects, dynamic environment
Balancing Computation & Communication 31/38
![Page 84: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/84.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Backup Slides
Balancing Computation & Communication 32/38
![Page 85: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/85.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
DGD – Theory – Bounded gradients
Lemma (Bounded gradients) [Yuan et. al., 2015]
Suppose Assumption 1 holds, and let the step size satisfy
α ≤ 1 + λn(W)
L
where λn(W) is the smallest eigenvalue of W and L = maxi Li . Then, starting fromxi,0 = 0 (i = 1, 2, ..., n), the sequence xi,k generated by DGD converges. In addition, wealso have
‖∇f(xk)‖ ≤ D =
√√√√2L(1
n
n∑i=1
fi (0)− f ?i ) (1)
for all k = 1, 2, ..., where f ?i = fi (x?i ) and x?i = arg minx fi (x).
Balancing Computation & Communication 33/38
![Page 86: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/86.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
DGDt – Theory – Bounded gradients
Lemma (Bounded gradients) [ASB, RB, NSK and EW, 2017]
Suppose Assumption 1 holds, and let the step size satisfy
α ≤ 1 + λn(Wt)
L
where λn(Wt) is the smallest eigenvalue of Wt and L = maxi Li . Then, starting fromxi,0 = 0 (i = 1, 2, ..., n), the sequence xi,k generated by DGD converges. In addition, wealso have
‖∇f(xk)‖ ≤ D =
√√√√2L(1
n
n∑i=1
fi (0)− f ?i ) (1)
for all k = 1, 2, ..., where f ?i = fi (x?i ) and x?i = arg minx fi (x).
Balancing Computation & Communication 33/38
![Page 87: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/87.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
DGD – Theory – Bounded deviation from mean
Lemma (Bounded deviation from mean) [Yuan et. al., 2015]
If (1) and Assumption 2 hold, then the total deviation from the mean is bounded,namely,
‖xi,k − xk‖ ≤αD
1− β
for all k and i . Moreover, if in addition Assumption 1 holds, then
‖∇fi (xi,k)−∇fi (xk)‖ ≤ αDLi
1− β
‖∇f (xk)−∇f (xk)‖ ≤ αDL
1− β
for all k and i .
Balancing Computation & Communication 34/38
![Page 88: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/88.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
DGDt – Theory – Bounded deviation from mean
Lemma (Bounded deviation from mean) [ASB, RB, NSK and EW, 2017]
If (1) and Assumption 2 hold, then the total deviation from the mean is bounded,namely,
‖xi,k − xk‖ ≤αD
1− βt
for all k and i . Moreover, if in addition Assumption 1 holds, then
‖∇fi (xi,k)−∇fi (xk)‖ ≤ αDLi
1− βt
‖∇f (xk)−∇f (xk)‖ ≤ αDL
1− βt
for all k and i .
Balancing Computation & Communication 34/38
![Page 89: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/89.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
TWt – Theory – Bounded deviation from mean
Lemma (Bounded deviation from mean) [ASB, RB, NSK and EW, 2017]
If Assumption 2 holds, then the total deviation from the mean is bounded, namely,
‖yi,k − xk‖ ≤βtαDk(k + 1)
2
for all k and i . Moreover, if in addition Assumption 1 holds, then
‖∇fi (yi,k)−∇fi (xk)‖ ≤ βtαDLik(k + 1)
2
‖∇f (yk)−∇f (xk)‖ ≤ βtαDLk(k + 1)
2
for all k and i .
Balancing Computation & Communication 35/38
![Page 90: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/90.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
DGDt Numerical Results
0 1000 2000 3000 4000 5000
Iterations
10-2
10-1
100
Relative Error
DGD (1,1)
DGD (1,2)
DGD (1,5)
DGD (1,10)
0 1000 2000 3000 4000 5000
Number of Gradient Evaluations
10-2
10-1
100
Relative Error
DGD (1,1)
DGD (1,2)
DGD (1,5)
DGD (1,10)
0 0.5 1 1.5 2
Number of Communications ×106
10-2
10-1
100
Relative Error
DGD (1,1)
DGD (1,2)
DGD (1,5)
DGD (1,10)
0 1 2 3 4 5 6
Cost ×104
10-2
10-1
100
Relative Error
DGD (1,1)
DGD (1,2)
DGD (1,5)
DGD (1,10)
Quadratic. n = 10, p = 10, ni = 10, κ = 102.
Balancing Computation & Communication 36/38
![Page 91: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/91.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Experiments – Quadratic Problems
0 2000 4000 6000 8000 10000
Iterations
10-14
10-12
10-10
10-8
10-6
10-4
10-2
100
Relative Error
DGD
0 2000 4000 6000 8000 10000
Number of Gradient Evaluations
10-14
10-12
10-10
10-8
10-6
10-4
10-2
100
Relative Error
DGD
0 0.5 1 1.5 2 2.5 3 3.5 4
Number of Communications ×105
10-3
10-2
10-1
100
Relative Error
DGD
0 0.5 1 1.5 2
Cost ×104
10-4
10-3
10-2
10-1
100
Relative Error
DGD
Quadratic. n = 10, d = 10, ni = 10, κ = 104.
Balancing Computation & Communication 37/38
![Page 92: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/92.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Experiments – Quadratic Problems
0 2000 4000 6000 8000 10000
Iterations
10-14
10-12
10-10
10-8
10-6
10-4
10-2
100
Relative Error
DGD
TW (1,1,-)
0 2000 4000 6000 8000 10000
Number of Gradient Evaluations
10-14
10-12
10-10
10-8
10-6
10-4
10-2
100
Relative Error
DGD
TW (1,1,-)
0 0.5 1 1.5 2 2.5 3 3.5 4
Number of Communications ×105
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
0 0.5 1 1.5 2
Cost ×104
10-4
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
Quadratic. n = 10, d = 10, ni = 10, κ = 104.
Balancing Computation & Communication 37/38
![Page 93: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/93.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Experiments – Quadratic Problems
0 2000 4000 6000 8000 10000
Iterations
10-14
10-12
10-10
10-8
10-6
10-4
10-2
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
0 2000 4000 6000 8000 10000
Number of Gradient Evaluations
10-14
10-12
10-10
10-8
10-6
10-4
10-2
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
0 0.5 1 1.5 2 2.5 3 3.5 4
Number of Communications ×105
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
0 0.5 1 1.5 2
Cost ×104
10-4
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
Quadratic. n = 10, d = 10, ni = 10, κ = 104.
Balancing Computation & Communication 37/38
![Page 94: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/94.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Experiments – Quadratic Problems
0 2000 4000 6000 8000 10000
Iterations
10-14
10-12
10-10
10-8
10-6
10-4
10-2
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
0 2000 4000 6000 8000 10000
Number of Gradient Evaluations
10-14
10-12
10-10
10-8
10-6
10-4
10-2
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
0 0.5 1 1.5 2 2.5 3 3.5 4
Number of Communications ×105
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
0 0.5 1 1.5 2
Cost ×104
10-4
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
Quadratic. n = 10, d = 10, ni = 10, κ = 104.
Balancing Computation & Communication 37/38
![Page 95: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/95.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Experiments – Quadratic Problems
0 2000 4000 6000 8000 10000
Iterations
10-14
10-12
10-10
10-8
10-6
10-4
10-2
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
0 2000 4000 6000 8000 10000
Number of Gradient Evaluations
10-14
10-12
10-10
10-8
10-6
10-4
10-2
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
0 0.5 1 1.5 2 2.5 3 3.5 4
Number of Communications ×105
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
0 0.5 1 1.5 2
Cost ×104
10-4
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
Quadratic. n = 10, d = 10, ni = 10, κ = 104.
Balancing Computation & Communication 37/38
![Page 96: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/96.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Experiments – Quadratic Problems
0 2000 4000 6000 8000 10000
Iterations
10-14
10-12
10-10
10-8
10-6
10-4
10-2
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
TW (1,1,500)
0 2000 4000 6000 8000 10000
Number of Gradient Evaluations
10-14
10-12
10-10
10-8
10-6
10-4
10-2
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
TW (1,1,500)
0 0.5 1 1.5 2 2.5 3 3.5 4
Number of Communications ×105
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
TW (1,1,500)
0 0.5 1 1.5 2
Cost ×104
10-4
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
TW (1,1,500)
Quadratic. n = 10, d = 10, ni = 10, κ = 104.
Balancing Computation & Communication 37/38
![Page 97: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/97.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Experiments – Quadratic Problems
0 2000 4000 6000 8000 10000
Iterations
10-14
10-12
10-10
10-8
10-6
10-4
10-2
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
TW (1,1,500)
TW (1,1,1000)
0 2000 4000 6000 8000 10000
Number of Gradient Evaluations
10-14
10-12
10-10
10-8
10-6
10-4
10-2
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
TW (1,1,500)
TW (1,1,1000)
0 0.5 1 1.5 2 2.5 3 3.5 4
Number of Communications ×105
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
TW (1,1,500)
TW (1,1,1000)
0 0.5 1 1.5 2
Cost ×104
10-4
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
TW (1,1,500)
TW (1,1,1000)
Quadratic. n = 10, d = 10, ni = 10, κ = 104.
Balancing Computation & Communication 37/38
![Page 98: Balancing Computation and Communication in Distributed ...archive.dimacs.rutgers.edu/Workshops/Learning/Slides/Wei.pdf · Introduction and Motivation Distributed Gradient Descent](https://reader034.fdocuments.net/reader034/viewer/2022052017/602f9592f3cdae076531a979/html5/thumbnails/98.jpg)
Introduction and MotivationDistributed Gradient Descent Variant
Communication Computation Decoupled DGD VariantsConclusions & Future Work
Experiments – Logistic Regression
0 1000 2000 3000 4000 5000
Iterations
10-5
10-4
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
TW (1,1,250)
TW (1,1,500)
0 1000 2000 3000 4000 5000
Number of Gradient Evaluations
10-5
10-4
10-3
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
TW (1,1,250)
TW (1,1,500)
0 0.5 1 1.5 2
Number of Communications ×105
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
TW (1,1,250)
TW (1,1,500)
0 2000 4000 6000 8000 10000
Cost
10-2
10-1
100
Relative Error
DGD
TW (1,1,-)
TW (10,1,-)
TW (1,10,-)
TW (1,1,k)
TW (1,1,250)
TW (1,1,500)
Logistic Regression - mushroom. n = 10, d = 114, ni = 812.
Balancing Computation & Communication 38/38