Post on 09-Jul-2020
Distributed nonsmooth optimization:variational analysis at work and flexible proximal algorithms
Jerome MALICK
CNRS, Laboratoire Jean Kuntzmann, Grenoble (France)
International School of Mathematics “Guido Stampacchia”
71st Workshop: Advances in nonsmooth analysis and optimization
June 2019 – Erice (Sicily)
Content at a glance
Today’s talk:a mix of
various notions
variationalanalysis
at work
setconv.
sensivityanalysis
nonsmoothalgorithms
proximalgradient
levelbundle
distributedoptim.
parallel
computing
distrib.learning
synchr.vs.
asynchr.
unreliablecomm.
Content at a glance
Today’s talk:a mix of
various notions
variationalanalysis
at work
setconv.
sensivityanalysis
nonsmoothalgorithms
proximalgradient
levelbundle
distributedoptim.
parallel
computing
distrib.learning
synchr.vs.
asynchr.
unreliablecomm.
Content at a glance
Today’s talk:a mix of
various notions
variationalanalysis
at work
setconv.
sensivityanalysis
nonsmoothalgorithms
proximalgradient
levelbundle
distributedoptim.
parallel
computing
distrib.learning
synchr.vs.
asynchr.
unreliablecomm.
Content at a glance
Today’s talk:a mix of
various notions
variationalanalysis
at work
setconv.
sensivityanalysis
nonsmoothalgorithms
proximalgradient
levelbundle
distributedoptim.
parallel
computing
distrib.learning
synchr.vs.
asynchr.
unreliablecomm.
Content at a glance
Today’s talk:a mix of
various notions
variationalanalysis
at work
setconv.
sensivityanalysis
nonsmoothalgorithms
proximalgradient
levelbundle
distributedoptim.
parallel
computing
distrib.learning
synchr.vs.
asynchr.
unreliablecomm.
Content at a glance
Today’s talk:a mix of
various notions
variationalanalysis
at work
setconv.
sensivityanalysis
nonsmoothalgorithms
proximalgradient
levelbundle
distributedoptim.
parallel
computing
distrib.learning
synchr.vs.
asynchr.
unreliablecomm.
Content at a glance
Today’s talk:a mix of
various notions
variationalanalysis
at work
setconv.
sensivityanalysis
nonsmoothalgorithms
proximalgradient
levelbundle
distributedoptim.
parallel
computing
distrib.learning
synchr.vs.
asynchr.
unreliablecomm.
Content at a glance
Today’s talk:a mix of
various notions
variationalanalysis
at work
setconv.
sensivityanalysis
nonsmoothalgorithms
proximalgradient
levelbundle
distributedoptim.
parallel
computing
distrib.learning
synchr.vs.
asynchr.
unreliablecomm.
Content at a glance
Today’s talk:a mix of
various notions
variationalanalysis
at work
setconv.
sensivityanalysis
nonsmoothalgorithms
proximalgradient
levelbundle
distributedoptim.
parallel
computing
distrib.learning
synchr.vs.
asynchr.
unreliablecomm.
Content at a glance
Today’s talk:a mix of
various notions
variationalanalysis
at work
setconv.
sensivityanalysis
nonsmoothalgorithms
proximalgradient
levelbundle
distributedoptim.
parallel
computing
distrib.learning
synchr.vs.
asynchr.
unreliablecomm.
Content at a glance
Today’s talk:a mix of
various notions
variationalanalysis
at work
setconv.
sensivityanalysis
nonsmoothalgorithms
proximalgradient
levelbundle
distributedoptim.
parallel
computing
distrib.learning
synchr.vs.
asynchr.
unreliablecomm.
Content at a glance
Today’s talk:a mix of
various notions
variationalanalysis
at work
setconv.
sensivityanalysis
nonsmoothalgorithms
proximalgradient
levelbundle
distributedoptim.
parallel
computing
distrib.learning
synchr.vs.
asynchr.
unreliablecomm.
Outline
1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning
2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods
3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction
Outline
1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning
2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods
3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction
Outline
1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning
2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods
3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction
Distributed optimization
Distributed optimization: past and present
Old history...Seminal work of D. Bertsekas and co-authors
Two classical frameworks:
(1) network (2) master/slaves
Recent challenges : new distributed systems
– Increase available computingresources (data centers, cloud)
– Improvement of multicoreinfrastructure and networks
How to (efficiently) deploy our algorithms ?
1
Distributed optimization
Distributed optimization: past and present
Old history...Seminal work of D. Bertsekas and co-authors
Two classical frameworks:
(1) network (2) master/slaves
Recent challenges : new distributed systems
– Increase available computingresources (data centers, cloud)
– Improvement of multicoreinfrastructure and networks
How to (efficiently) deploy our algorithms ?
1
Distributed optimization
Example in master/slave framework
Basic example: minimize m smoothfunctions over M = m workers
minx∈Rd
m∑i=1
f i(x)f1 f2 f3 f4
(m = M = 4)
Standard/Distributed gradient descent
xk+1 = xk − γm∑
i=1
∇f i(xk) (map-reduce)
Extensions:
with prox-term, nonsmooth f i, ...
When m the number of functionsis greater than N the number ofmachines (m > M), the algoextends to incremental/stochastic(batch) gradient descent(reviewed in Georg’s talk on monday)
shared memory
f i1 f i2 f i3 f i4
(m > M = 4)
2
Distributed optimization
Example in master/slave framework
Basic example: minimize m smoothfunctions over M = m workers
minx∈Rd
m∑i=1
f i(x)f1 f2 f3 f4
(m = M = 4)
Standard/Distributed gradient descent
xk+1 = xk − γm∑
i=1
∇f i(xk) (map-reduce)
Extensions:
with prox-term, nonsmooth f i, ...
When m the number of functionsis greater than N the number ofmachines (m > M), the algoextends to incremental/stochastic(batch) gradient descent(reviewed in Georg’s talk on monday)
shared memory
f i1 f i2 f i3 f i4
(m > M = 4)
2
Distributed optimization
Example in master/slave framework
Basic example: minimize m smoothfunctions over M = m workers
minx∈Rd
m∑i=1
f i(x)f1 f2 f3 f4
(m = M = 4)
Standard/Distributed gradient descent
xk+1 = xk − γm∑
i=1
∇f i(xk) (map-reduce)
Extensions:
with prox-term, nonsmooth f i, ...
When m the number of functionsis greater than N the number ofmachines (m > M), the algoextends to incremental/stochastic(batch) gradient descent(reviewed in Georg’s talk on monday)
shared memory
f i1 f i2 f i3 f i4
(m > M = 4)
2
Distributed optimization
Example in master/slave framework
Basic example: minimize m smoothfunctions over M = m workers
minx∈Rd
m∑i=1
f i(x)f1 f2 f3 f4
(m = M = 4)
Standard/Distributed gradient descent
xk+1 = xk − γm∑
i=1
∇f i(xk) (map-reduce)
Extensions:
with prox-term, nonsmooth f i, ...
When m the number of functionsis greater than N the number ofmachines (m > M), the algoextends to incremental/stochastic(batch) gradient descent(reviewed in Georg’s talk on monday)
shared memory
f i1 f i2 f i3 f i4
(m > M = 4)
2
Distributed optimization
Limitations of optimization algorithms in master/slave framework
The previous slide hides thecomplexity of tasks and the varietyof possible computing systems...
Two key limitations of usual algorithms
(1) Synchronization: Master has to wait for all workers at each iteration...
Example with 3 workers/agents: (image: W. Yin)
Comparison: iteration is redefined
Synchronousnew iteration = all agents finish
Agent 1
Agent 2
Agent 3
t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9
Asynchronousnew iteration = any agent finishes
synchronous
Comparison: iteration is redefined
Synchronousnew iteration = all agents finish
Agent 1
Agent 2
Agent 3
t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9
Asynchronousnew iteration = any agent finishes
asynchronous
(2) Communications: unreliable, costly,...
(image: Google AI)
3
Distributed optimization
Limitations of optimization algorithms in master/slave framework
The previous slide hides thecomplexity of tasks and the varietyof possible computing systems...
Two key limitations of usual algorithms
(1) Synchronization: Master has to wait for all workers at each iteration...
Example with 3 workers/agents: (image: W. Yin)
Comparison: iteration is redefined
Synchronousnew iteration = all agents finish
Agent 1
Agent 2
Agent 3
t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9
Asynchronousnew iteration = any agent finishes
synchronous
Comparison: iteration is redefined
Synchronousnew iteration = all agents finish
Agent 1
Agent 2
Agent 3
t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9
Asynchronousnew iteration = any agent finishes
asynchronous
(2) Communications: unreliable, costly,...
(image: Google AI)
3
Distributed optimization
Limitations of optimization algorithms in master/slave framework
The previous slide hides thecomplexity of tasks and the varietyof possible computing systems...
Two key limitations of usual algorithms
(1) Synchronization: Master has to wait for all workers at each iteration...
Example with 3 workers/agents: (image: W. Yin)
Comparison: iteration is redefined
Synchronousnew iteration = all agents finish
Agent 1
Agent 2
Agent 3
t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9
Asynchronousnew iteration = any agent finishes
synchronous
Comparison: iteration is redefined
Synchronousnew iteration = all agents finish
Agent 1
Agent 2
Agent 3
t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9
Asynchronousnew iteration = any agent finishes
asynchronous
(2) Communications: unreliable, costly,...
(image: Google AI)
3
Distributed optimization
Limitations of optimization algorithms in master/slave framework
The previous slide hides thecomplexity of tasks and the varietyof possible computing systems...
Two key limitations of usual algorithms
(1) Synchronization: Master has to wait for all workers at each iteration...
Example with 3 workers/agents: (image: W. Yin)
Comparison: iteration is redefined
Synchronousnew iteration = all agents finish
Agent 1
Agent 2
Agent 3
t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9
Asynchronousnew iteration = any agent finishes
synchronous
Comparison: iteration is redefined
Synchronousnew iteration = all agents finish
Agent 1
Agent 2
Agent 3
t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9
Asynchronousnew iteration = any agent finishes
asynchronous
(2) Communications: unreliable, costly,...
(image: Google AI)
3
Distributed optimization
Examples & applications of distributed/parallel optimization
Rich activity : e.g. [Nedic et al ’14], [Yin et al ’15], [Richtarik et al ’16],[Chouzenoux et al ’17]... just to name a few
Various practical situations 6=
In the workshop: – Welington’s talk (wednesday)
– Mentioned in the talks of Russel, Silvia, and Antoine
Our community is more knowledgeable on the “data center” situation
1 data is distributed among many machines (“big data”)
2 computation load is distributed (“big problems” or HPC)
Ex: large-scale stochastic programming problems of Welington’s talk
let’s discuss more distributed (on-device) learning
4
Distributed optimization
Examples & applications of distributed/parallel optimization
Rich activity : e.g. [Nedic et al ’14], [Yin et al ’15], [Richtarik et al ’16],[Chouzenoux et al ’17]... just to name a few
Various practical situations 6=
In the workshop: – Welington’s talk (wednesday)
– Mentioned in the talks of Russel, Silvia, and Antoine
Our community is more knowledgeable on the “data center” situation
1 data is distributed among many machines (“big data”)
2 computation load is distributed (“big problems” or HPC)
Ex: large-scale stochastic programming problems of Welington’s talk
let’s discuss more distributed (on-device) learning
4
Distributed optimization
Examples & applications of distributed/parallel optimization
Rich activity : e.g. [Nedic et al ’14], [Yin et al ’15], [Richtarik et al ’16],[Chouzenoux et al ’17]... just to name a few
Various practical situations 6=
In the workshop: – Welington’s talk (wednesday)
– Mentioned in the talks of Russel, Silvia, and Antoine
Our community is more knowledgeable on the “data center” situation
1 data is distributed among many machines (“big data”)
2 computation load is distributed (“big problems” or HPC)
Ex: large-scale stochastic programming problems of Welington’s talk
let’s discuss more distributed (on-device) learning
4
Outline
1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning
2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods
3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction
Distributed optimization
Recalls on supervised learning: context
Data : n observations (aj, yj) (j = 1, . . . , n)
e.g. (aj, yj) ∈ Rm × R (regression) or (aj, yj) ∈ Rm × {−1,+1} (binary classification)
a1 a2 a3 a4 a5
y1 = 1 y2 = �1 y3 = 1 y4 = �1 y5 = 1
Supervised learning : h(a, x) ∈ R parameterized by x ∈ Rd
usually x = β (in stats) x = ω (in learning) or x = θ (in deep learning)
goal: find x such that h(ai, x) ' yi (and generalizes well on unseen data)
Standard prediction functions :
Linear prediction: h(a, x) = 〈a, x〉(or h(a, x) = 〈φ(a), x〉 for kernel)
Highly non-linear prediction by artificialneural networksh(a, x) = 〈xm, σ(〈xm−1, · · ·σ(〈x1, a〉)〉)〉
Definition of the pliable lasso Optimization
Deep Nets/Deep Learning
x1
x2
x3
x4
f(x)
Hiddenlayer L2
Inputlayer L1
Outputlayer L3
Neural network diagram with a single hidden layer. The hidden layer
derives transformations of the inputs — nonlinear transformations of linear
combinations — which are then used to model the output10 / 46
5
Distributed optimization
Recalls on supervised learning: context
Data : n observations (aj, yj) (j = 1, . . . , n)
e.g. (aj, yj) ∈ Rm × R (regression) or (aj, yj) ∈ Rm × {−1,+1} (binary classification)
a1 a2 a3 a4 a5
y1 = 1 y2 = �1 y3 = 1 y4 = �1 y5 = 1
Supervised learning : h(a, x) ∈ R parameterized by x ∈ Rd
usually x = β (in stats) x = ω (in learning) or x = θ (in deep learning)
goal: find x such that h(ai, x) ' yi (and generalizes well on unseen data)
Standard prediction functions :
Linear prediction: h(a, x) = 〈a, x〉(or h(a, x) = 〈φ(a), x〉 for kernel)
Highly non-linear prediction by artificialneural networksh(a, x) = 〈xm, σ(〈xm−1, · · ·σ(〈x1, a〉)〉)〉
Definition of the pliable lasso Optimization
Deep Nets/Deep Learning
x1
x2
x3
x4
f(x)
Hiddenlayer L2
Inputlayer L1
Outputlayer L3
Neural network diagram with a single hidden layer. The hidden layer
derives transformations of the inputs — nonlinear transformations of linear
combinations — which are then used to model the output10 / 46
5
Distributed optimization
Recalls on supervised learning: context
Data : n observations (aj, yj) (j = 1, . . . , n)
e.g. (aj, yj) ∈ Rm × R (regression) or (aj, yj) ∈ Rm × {−1,+1} (binary classification)
a1 a2 a3 a4 a5
y1 = 1 y2 = �1 y3 = 1 y4 = �1 y5 = 1
Supervised learning : h(a, x) ∈ R parameterized by x ∈ Rd
usually x = β (in stats) x = ω (in learning) or x = θ (in deep learning)
goal: find x such that h(ai, x) ' yi (and generalizes well on unseen data)
Standard prediction functions :
Linear prediction: h(a, x) = 〈a, x〉(or h(a, x) = 〈φ(a), x〉 for kernel)
Highly non-linear prediction by artificialneural networksh(a, x) = 〈xm, σ(〈xm−1, · · ·σ(〈x1, a〉)〉)〉
Definition of the pliable lasso Optimization
Deep Nets/Deep Learning
x1
x2
x3
x4
f(x)
Hiddenlayer L2
Inputlayer L1
Outputlayer L3
Neural network diagram with a single hidden layer. The hidden layer
derives transformations of the inputs — nonlinear transformations of linear
combinations — which are then used to model the output10 / 46
5
Distributed optimization
Recalls on supervised learning: context
Data : n observations (aj, yj) (j = 1, . . . , n)
e.g. (aj, yj) ∈ Rm × R (regression) or (aj, yj) ∈ Rm × {−1,+1} (binary classification)
a1 a2 a3 a4 a5
y1 = 1 y2 = �1 y3 = 1 y4 = �1 y5 = 1
Supervised learning : h(a, x) ∈ R parameterized by x ∈ Rd
usually x = β (in stats) x = ω (in learning) or x = θ (in deep learning)
goal: find x such that h(ai, x) ' yi (and generalizes well on unseen data)
Standard prediction functions :
Linear prediction: h(a, x) = 〈a, x〉(or h(a, x) = 〈φ(a), x〉 for kernel)
Highly non-linear prediction by artificialneural networksh(a, x) = 〈xm, σ(〈xm−1, · · ·σ(〈x1, a〉)〉)〉
Definition of the pliable lasso Optimization
Deep Nets/Deep Learning
x1
x2
x3
x4
f(x)
Hiddenlayer L2
Inputlayer L1
Outputlayer L3
Neural network diagram with a single hidden layer. The hidden layer
derives transformations of the inputs — nonlinear transformations of linear
combinations — which are then used to model the output10 / 46
5
Distributed optimization
Recalls on supervised learning: learning is optimizing !
Learning = finding x, i.e. the best x from data
= solving an optimization problem
Regularized empirical risk minimization(regularization avoids overfitting, imposes structure to x, or helps numerically)
minx∈Rd
1n
n∑j=1
`(yj, h(aj, x)
)+ λR(x)
data fitting-term + regularizer
Ex: regularized logistic regression (to be used in numerical exps)
minx∈Rd
1n
n∑j=1
log(1+exp(−yj〈aj, x〉)
)+λ2
2‖x‖2
2 + λ1‖x‖1
Distributed setting: machine i has a chunk of data Si
1n
n∑j=1
`(yj, h(aj, x)
)=
m∑i=1
1n
∑i∈Si
`(yi, h(ai, x)
)︸ ︷︷ ︸
f i(x)
6
Distributed optimization
Recalls on supervised learning: learning is optimizing !
Learning = finding x, i.e. the best x from data
= solving an optimization problem
Regularized empirical risk minimization(regularization avoids overfitting, imposes structure to x, or helps numerically)
minx∈Rd
1n
n∑j=1
`(yj, h(aj, x)
)+ λR(x)
data fitting-term + regularizer
Ex: regularized logistic regression (to be used in numerical exps)
minx∈Rd
1n
n∑j=1
log(1+exp(−yj〈aj, x〉)
)+λ2
2‖x‖2
2 + λ1‖x‖1
Distributed setting: machine i has a chunk of data Si
1n
n∑j=1
`(yj, h(aj, x)
)=
m∑i=1
1n
∑i∈Si
`(yi, h(ai, x)
)︸ ︷︷ ︸
f i(x)
6
Distributed optimization
Recalls on supervised learning: learning is optimizing !
Learning = finding x, i.e. the best x from data
= solving an optimization problem
Regularized empirical risk minimization(regularization avoids overfitting, imposes structure to x, or helps numerically)
minx∈Rd
1n
n∑j=1
`(yj, h(aj, x)
)+ λR(x)
data fitting-term + regularizer
Ex: regularized logistic regression (to be used in numerical exps)
minx∈Rd
1n
n∑j=1
log(1+exp(−yj〈aj, x〉)
)+λ2
2‖x‖2
2 + λ1‖x‖1
Distributed setting: machine i has a chunk of data Si
1n
n∑j=1
`(yj, h(aj, x)
)=
m∑i=1
1n
∑i∈Si
`(yi, h(ai, x)
)︸ ︷︷ ︸
f i(x)
6
Distributed optimization
New machine learning frameworks ?
Machine learning has led to major breakthroughs in various areas(such as natural language processing, computer vision and speech recognition...)
Much of this success has been based on collecting huge amounts of data(and requires a lot of energy...)
Collecting data has two major problems:1 it is privacy-invasive2 it requires a lot of storage
A solution: federated learningdecoupling the ability to domachine learning from theneed to store the data
no data move... but:– unbalanced # samples– unstable communications...
Back to the initial question: how to (efficiently) deploy our algorithms ?
7
Distributed optimization
New machine learning frameworks ?
Machine learning has led to major breakthroughs in various areas(such as natural language processing, computer vision and speech recognition...)
Much of this success has been based on collecting huge amounts of data(and requires a lot of energy...)
Collecting data has two major problems:1 it is privacy-invasive2 it requires a lot of storage
A solution: federated learningdecoupling the ability to domachine learning from theneed to store the data
no data move... but:– unbalanced # samples– unstable communications...
Back to the initial question: how to (efficiently) deploy our algorithms ?
7
Distributed optimization
New machine learning frameworks ?
Machine learning has led to major breakthroughs in various areas(such as natural language processing, computer vision and speech recognition...)
Much of this success has been based on collecting huge amounts of data(and requires a lot of energy...)
Collecting data has two major problems:1 it is privacy-invasive2 it requires a lot of storage
A solution: federated learningdecoupling the ability to domachine learning from theneed to store the data
no data move... but:– unbalanced # samples– unstable communications...
Back to the initial question: how to (efficiently) deploy our algorithms ?
7
Distributed optimization
New machine learning frameworks ?
Machine learning has led to major breakthroughs in various areas(such as natural language processing, computer vision and speech recognition...)
Much of this success has been based on collecting huge amounts of data(and requires a lot of energy...)
Collecting data has two major problems:1 it is privacy-invasive2 it requires a lot of storage
A solution: federated learningdecoupling the ability to domachine learning from theneed to store the data
no data move... but:– unbalanced # samples– unstable communications...
Back to the initial question: how to (efficiently) deploy our algorithms ?
7
Outline
1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning
2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods
3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction
Outline
1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning
2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods
3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction
Asynchronous nonsmooth optimization methods
Asynchronous Master/Slave framework
Algorithm = global communication scheme + local optimization methodwhat is x what is i
xk = xk−1 + ∆
Master
oracle f1
1
Worker 1
... oracle f i
i → ∆
Worker i
... oracle fM
M
Worker M
∆ xki = i(k)
timei = i(k) viewpoint i
k = k− dki
ik− Dk
i
ii
timej 6= i(k) viewpoint
k
j
k− dkj
j
k− Dkj
jj
– iteration = receive from a worker + master update + send– time k = number of iterations– machine i(k) = the machine updating at time k– delay dk
i = time since last exchange with idk
i = 0 iff i = i(k) dki = dk−1
i + 1 elsewhere
– second delay Dki = time since penultimate exchange with i
8
Asynchronous nonsmooth optimization methods
An instantiation: asynchr. gradient
Averaged minimization of smooth functions
minx∈Rd
1m
m∑i=1
f i(x) with f i L-smooth and µ-strongly conv.
Asynchr. averaged gradient [Vanli, Gurbuzbalaban, Ozdaglar ’17]
xk = xk−1 −γ
m
m∑i=1
∇f i(
xk−Dki
)Problem: stepsize γ depends on delays between machines /Proposed algo: rather combine iterates [Mishchenko, Iutzeler, M. ’18]
xk =1m
m∑i=1
xk−Dki− γ
m
m∑i=1
∇f i(xk−Dki)
Interest: Take usual stepsize , γ ∈ (0, 2/(µ+ L)] gives linear convergence
Variational analysis at work ? not much yet
– Convergence proof follows from standard tools and usual rationale
– Originality : new “epoch”-based analysis to get delay-free results
9
Asynchronous nonsmooth optimization methods
An instantiation: asynchr. gradient
Averaged minimization of smooth functions
minx∈Rd
1m
m∑i=1
f i(x) with f i L-smooth and µ-strongly conv.
Asynchr. averaged gradient [Vanli, Gurbuzbalaban, Ozdaglar ’17]
xk = xk−1 −γ
m
m∑i=1
∇f i(
xk−Dki
)Problem: stepsize γ depends on delays between machines /
Proposed algo: rather combine iterates [Mishchenko, Iutzeler, M. ’18]
xk =1m
m∑i=1
xk−Dki− γ
m
m∑i=1
∇f i(xk−Dki)
Interest: Take usual stepsize , γ ∈ (0, 2/(µ+ L)] gives linear convergence
Variational analysis at work ? not much yet
– Convergence proof follows from standard tools and usual rationale
– Originality : new “epoch”-based analysis to get delay-free results
9
Asynchronous nonsmooth optimization methods
An instantiation: asynchr. gradient
Averaged minimization of smooth functions
minx∈Rd
1m
m∑i=1
f i(x) with f i L-smooth and µ-strongly conv.
Asynchr. averaged gradient [Vanli, Gurbuzbalaban, Ozdaglar ’17]
xk = xk−1 −γ
m
m∑i=1
∇f i(
xk−Dki
)Problem: stepsize γ depends on delays between machines /Proposed algo: rather combine iterates [Mishchenko, Iutzeler, M. ’18]
xk =1m
m∑i=1
xk−Dki− γ
m
m∑i=1
∇f i(xk−Dki)
Interest: Take usual stepsize , γ ∈ (0, 2/(µ+ L)] gives linear convergence
Variational analysis at work ? not much yet
– Convergence proof follows from standard tools and usual rationale
– Originality : new “epoch”-based analysis to get delay-free results
9
Asynchronous nonsmooth optimization methods
An instantiation: asynchr. gradient
Averaged minimization of smooth functions
minx∈Rd
1m
m∑i=1
f i(x) with f i L-smooth and µ-strongly conv.
Asynchr. averaged gradient [Vanli, Gurbuzbalaban, Ozdaglar ’17]
xk = xk−1 −γ
m
m∑i=1
∇f i(
xk−Dki
)Problem: stepsize γ depends on delays between machines /Proposed algo: rather combine iterates [Mishchenko, Iutzeler, M. ’18]
xk =1m
m∑i=1
xk−Dki− γ
m
m∑i=1
∇f i(xk−Dki)
Interest: Take usual stepsize , γ ∈ (0, 2/(µ+ L)] gives linear convergence
Variational analysis at work ? not much yet
– Convergence proof follows from standard tools and usual rationale
– Originality : new “epoch”-based analysis to get delay-free results
9
Asynchronous nonsmooth optimization methods
Numerical illustrations
Illustration on toy problem2D quadratic functions on 5 workers but one worker 10x slower
Illustration on standard regularized logistic regression100 machines in a cluster with 10% of the data in machine one, even on the rest
0 200 400 600 800 1,000 1,200 1,400 1,600 1,800
10−3
10−2
10−1
100
Wallclock time (s)
Subo
ptim
alit
y
0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 2,20010−2
10−1
Wallclock time (s)
RCV1 ( 697641 × 47236) URL ( 2396130 × 3231961)
DAve-PG Synchronous PG PIAG
Conclusion: stepsize matters ! (remind Silvia’s talk!)
10
Asynchronous nonsmooth optimization methods
Numerical illustrations
Illustration on toy problem2D quadratic functions on 5 workers but one worker 10x slower
Illustration on standard regularized logistic regression100 machines in a cluster with 10% of the data in machine one, even on the rest
0 200 400 600 800 1,000 1,200 1,400 1,600 1,800
10−3
10−2
10−1
100
Wallclock time (s)
Subo
ptim
alit
y
0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 2,20010−2
10−1
Wallclock time (s)
RCV1 ( 697641 × 47236) URL ( 2396130 × 3231961)
DAve-PG Synchronous PG PIAG
Conclusion: stepsize matters ! (remind Silvia’s talk!)
10
Asynchronous nonsmooth optimization methods
Numerical illustrations
Illustration on toy problem2D quadratic functions on 5 workers but one worker 10x slower
Illustration on standard regularized logistic regression100 machines in a cluster with 10% of the data in machine one, even on the rest
0 200 400 600 800 1,000 1,200 1,400 1,600 1,800
10−3
10−2
10−1
100
Wallclock time (s)
Subo
ptim
alit
y
0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 2,20010−2
10−1
Wallclock time (s)
RCV1 ( 697641 × 47236) URL ( 2396130 × 3231961)
DAve-PG Synchronous PG PIAG
Conclusion: stepsize matters ! (remind Silvia’s talk!)
10
Outline
1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning
2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods
3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction
Asynchronous nonsmooth optimization methods
The idea of the work with Welington in one slide/picture
Basic situation: minimize m convex functions over m machines/oracles
minx∈X
m∑i=1
f i(x)at time k, machine i= i(k) sends up
f i(xk−dki) and g ∈ ∂f i(xk−dk
i)
Example: d = 1, m = 2 (oracle f1 3× faster than oracle f2)
f1 f2
f1 + f2
f1 f2
f1 + f2
fup = flev +
m∑i=1
(Li‖xk+1 − xk−dk
j‖ − 〈gj
k−dkj, xk+1 − xk−dk
j〉)
f1 + f2
f lev
11
Asynchronous nonsmooth optimization methods
The idea of the work with Welington in one slide/picture
Basic situation: minimize m convex functions over m machines/oracles
minx∈X
m∑i=1
f i(x)at time k, machine i= i(k) sends up
f i(xk−dki) and g ∈ ∂f i(xk−dk
i)
Example: d = 1, m = 2 (oracle f1 3× faster than oracle f2)
f1 f2
f1 + f2
f1 f2
f1 + f2
fup = flev +
m∑i=1
(Li‖xk+1 − xk−dk
j‖ − 〈gj
k−dkj, xk+1 − xk−dk
j〉)
f1 + f2
f lev
11
Asynchronous nonsmooth optimization methods
The idea of the work with Welington in one slide/picture
Basic situation: minimize m convex functions over m machines/oracles
minx∈X
m∑i=1
f i(x)at time k, machine i= i(k) sends up
f i(xk−dki) and g ∈ ∂f i(xk−dk
i)
Example: d = 1, m = 2 (oracle f1 3× faster than oracle f2)
f1 f2
f1 + f2
f1 f2
f1 + f2
lower bound
[Hintermuller ’01][Oliveira Sagastizabal ’17]
fup = flev +
m∑i=1
(Li‖xk+1 − xk−dk
j‖ − 〈gj
k−dkj, xk+1 − xk−dk
j〉)
f1 + f2
f lev
11
Asynchronous nonsmooth optimization methods
The idea of the work with Welington in one slide/picture
Basic situation: minimize m convex functions over m machines/oracles
minx∈X
m∑i=1
f i(x)at time k, machine i= i(k) sends up
f i(xk−dki) and g ∈ ∂f i(xk−dk
i)
Example: d = 1, m = 2 (oracle f1 3× faster than oracle f2)
f1 f2
f1 + f2
f1 f2
f1 + f2
lower bound
[Hintermuller ’01][Oliveira Sagastizabal ’15]
but no upper bound
f 1(xk1 ) f 2(xk2 ) xk1 6= xk2
fup = flev +m∑
i=1
(Li‖xk+1 − xk−dk
j‖ − 〈gj
k−dkj, xk+1 − xk−dk
j〉)
f1 + f2
f lev
11
Asynchronous nonsmooth optimization methods
The idea of the work with Welington in one slide/picture
Basic situation: minimize m convex functions over m machines/oracles
minx∈X
m∑i=1
f i(x)at time k, machine i= i(k) sends up
f i(xk−dki) and g ∈ ∂f i(xk−dk
i)
Example: d = 1, m = 2 (oracle f1 3× faster than oracle f2)
f1 f2
f1 + f2
f1 f2
f1 + f2
lower bound
[Hintermuller ’01][Oliveira Sagastizabal ’15]
but no upper bound
f 1(xk1 ) f 2(xk2 ) xk1 6= xk2
fup = flev +
m∑i=1
(Li‖xk+1 − xk−dk
j‖ − 〈gj
k−dkj, xk+1 − xk−dk
j〉)
f1 + f2
f lev
11
Asynchronous nonsmooth optimization methods
The idea of the work with Welington in one slide/picture
Basic situation: minimize m convex functions over m machines/oracles
minx∈X
m∑i=1
f i(x)at time k, machine i= i(k) sends up
f i(xk−dki) and g ∈ ∂f i(xk−dk
i)
Example: d = 1, m = 2 (oracle f1 3× faster than oracle f2)
f1 f2
f1 + f2
fup = flev +m∑
i=1
(Li‖xk+1 − xk−dk
j‖ − 〈gj
k−dkj, xk+1 − xk−dk
j〉)
f1 + f2
f lev
11
Asynchronous nonsmooth optimization methods
Direct convergence by basic variational analysis
Convergence of usual level bundle [Lemarechal Nesterov Nemirovski ’95]
requires tight control of iterates moves to get complexity analysis
Forget this : asynchronous iterates are out of controlso we can simplify the proof in this more complicated setting !
3 line proof : finite number of null steps (by contradiction)
– recall xk+1 = ProjXk(x)
– nested sequence of (compact) sets Xk ⊂ Xk+1 ⊂ · · ·
– Xk −→k∩kXk = X∞ 6= ∅ (Painleve-Kuratowski)
– xk+1 = ProjXk(x) −→
kProjX∞ (x) [Rockafellar Wets Book]
– fup = flev +
m∑i=1
(Li‖xk+1 − xk−dk
j‖ − 〈gj
k−dkj, xk+1 − xk−dk
j〉)−→
kflev
– it leads to contradiction ,
Variational analysis at work... a little... more to come... (Samir be patient)
12
Asynchronous nonsmooth optimization methods
Direct convergence by basic variational analysis
Convergence of usual level bundle [Lemarechal Nesterov Nemirovski ’95]
requires tight control of iterates moves to get complexity analysis
Forget this : asynchronous iterates are out of controlso we can simplify the proof in this more complicated setting !
3 line proof : finite number of null steps (by contradiction)
– recall xk+1 = ProjXk(x)
– nested sequence of (compact) sets Xk ⊂ Xk+1 ⊂ · · ·
– Xk −→k∩kXk = X∞ 6= ∅ (Painleve-Kuratowski)
– xk+1 = ProjXk(x) −→
kProjX∞ (x) [Rockafellar Wets Book]
– fup = flev +
m∑i=1
(Li‖xk+1 − xk−dk
j‖ − 〈gj
k−dkj, xk+1 − xk−dk
j〉)−→
kflev
– it leads to contradiction ,
Variational analysis at work... a little... more to come... (Samir be patient)
12
Outline
1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning
2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods
3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction
Outline
1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning
2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods
3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction
Communication-efficient proximal methods
When communication is the bottleneck...
Communicate may be bad :
– Heterogeneous frameworks: communications are unreliable
– High-dimensional problems: communications can also be costly
A solution to reduce communication is to reduce dimension
How ? using the structure of regularized problem
minx∈Rd
m∑i=1
f i(x) + λR(x)
Typically x? belongs to low-dimensional manifold
because the convex R is usually highly structured...
13
Communication-efficient proximal methods
Mirror-stratifiable regularizers
Most of the regularizers used in machine learning or image processing
have a strong primal-dual structure – mirror-stratifiable [Fadili, M., Peyre ’18]
Examples: (associated unit ball and low-dimensional manifold where x belongs)
R = ‖ · ‖1 ( and ‖ · ‖∞ or other polyedral gauges)
nuclear norm (aka trace-norm) R(X) =∑
i |σi(X)| = ‖σ(X)‖1
group-`1 R(x) =∑
b∈B ‖xb‖2 ( e.g. R(x) = |x1|+ ‖x2,3‖ )
xMx
x Mx
x
Mx
14
Communication-efficient proximal methods
Mirror-stratifiable regularizers
Most of the regularizers used in machine learning or image processing
have a strong primal-dual structure – mirror-stratifiable [Fadili, M., Peyre ’18]
Examples: (associated unit ball and low-dimensional manifold where x belongs)
R = ‖ · ‖1 ( and ‖ · ‖∞ or other polyedral gauges)
nuclear norm (aka trace-norm) R(X) =∑
i |σi(X)| = ‖σ(X)‖1
group-`1 R(x) =∑
b∈B ‖xb‖2 ( e.g. R(x) = |x1|+ ‖x2,3‖ )
xMx
x Mx
x
Mx
14
Communication-efficient proximal methods
Mirror-stratifiable regularizers
Most of the regularizers used in machine learning or image processing
have a strong primal-dual structure – mirror-stratifiable [Fadili, M., Peyre ’18]
Examples: (associated unit ball and low-dimensional manifold where x belongs)
R = ‖ · ‖1 ( and ‖ · ‖∞ or other polyedral gauges)
nuclear norm (aka trace-norm) R(X) =∑
i |σi(X)| = ‖σ(X)‖1
group-`1 R(x) =∑
b∈B ‖xb‖2 ( e.g. R(x) = |x1|+ ‖x2,3‖ )
xMx
x Mx
x
Mx
14
Communication-efficient proximal methods
Mirror-stratifiable regularizers
Most of the regularizers used in machine learning or image processing
have a strong primal-dual structure – mirror-stratifiable [Fadili, M., Peyre ’18]
Examples: (associated unit ball and low-dimensional manifold where x belongs)
R = ‖ · ‖1 ( and ‖ · ‖∞ or other polyedral gauges)
nuclear norm (aka trace-norm) R(X) =∑
i |σi(X)| = ‖σ(X)‖1
group-`1 R(x) =∑
b∈B ‖xb‖2 ( e.g. R(x) = |x1|+ ‖x2,3‖ )
xMx
x Mx
x
Mx
14
Communication-efficient proximal methods
Recall on stratifications
Recall the talks of Aris (monday morning) and of Hasnaa (tuesday afternoon)
A stratification of a set D ⊂ RN is a finite partitionM = {Mi}i∈I
D =⋃i∈I
Mi
with so-called “strata” (e.g. smooth/affine manifolds) which fit nicely:
M ∩ cl(M′) 6= ∅ =⇒ M ⊂ cl(M′)
This relation induces a (partial) ordering M 6 M′
Example: B∞ the unit `∞-ball in R2
a stratification with 9 (affine) strata
M1 6 M2 6 M4
M1 6 M3 6 M4
M1M2
M3M4
15
Communication-efficient proximal methods
Recall on stratifications
Recall the talks of Aris (monday morning) and of Hasnaa (tuesday afternoon)
A stratification of a set D ⊂ RN is a finite partitionM = {Mi}i∈I
D =⋃i∈I
Mi
with so-called “strata” (e.g. smooth/affine manifolds) which fit nicely:
M ∩ cl(M′) 6= ∅ =⇒ M ⊂ cl(M′)
This relation induces a (partial) ordering M 6 M′
Example: B∞ the unit `∞-ball in R2
a stratification with 9 (affine) strata
M1 6 M2 6 M4
M1 6 M3 6 M4
M1M2
M3M4
15
Communication-efficient proximal methods
Mirror-stratifiable function: formal definition
A convex function R : RN→R ∪ {+∞} is mirror-stratifiable with respect to
– a (primal) stratificationM = {Mi}i∈I of dom(∂R)
– a (dual) stratificationM∗ = {M∗i }i∈I of dom(∂R∗)
if JR has 2 properties
JR :M→M∗ is invertible with inverse JR∗
M∗ 3 M∗ = JR(M) ⇐⇒ JR∗(M∗) = M ∈M
JR is decreasing for the order relation 6 between strata
M 6 M′ ⇐⇒ JR(M) > JR(M′)
with the transfert operator JR : RN ⇒ RN [Daniilidis-Drusvyatskiy-Lewis ’13]
JR(S) =⋃x∈S
ri(∂R(x))
16
Communication-efficient proximal methods
Mirror-stratifiable function: simple example
R = ιB∞ R∗ = ‖ · ‖1
JR(Mi) =⋃
x∈Mi
ri ∂R(x) = ri N B∞ (x) = M∗i
Mi = ri ∂‖x‖1 =⋃
x∈M∗i
ri ∂R∗(x) = JR∗ (M∗i )
M1M2
M3M4
M⇤1
M⇤2
M⇤3
M⇤4
JR
JR⇤
17
Outline
1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning
2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods
3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction
Communication-efficient proximal methods
Sensitivity under small variations
Parameterized composite optimization problem (smooth + nonsmooth)
minx∈RN
F(x, p) + R(x),
Optimality condition for a primal-dual solution (x?(p), u?(p))
u?(p) = −∇F(x?(p), p) ∈ ∂R(x?(p))
For p∼p0, can we localize x?(p) with respect to x?(p0) ?
Theorem (Enlarged sensitivity)
Under mild assumptions (unique minimizer x?(p0) at p0 and objective uniformly level-bounded
in x), if R is mirror-stratifiable, then for p∼p0,
Mx?(p0) 6 Mx?(p) 6 JR∗(M∗u?(p0))
In the non-degenerate case: u?(p0) ∈ ri(∂R(x?(p0))
)Mx?(p0) = Mx?(p)
(= JR∗(M∗u?(p0)
))
we retrieve exactly the active strata ([Lewis ’06] for partly-smooth functions)
18
Communication-efficient proximal methods
Sensitivity under small variations
Parameterized composite optimization problem (smooth + nonsmooth)
minx∈RN
F(x, p) + R(x),
Optimality condition for a primal-dual solution (x?(p), u?(p))
u?(p) = −∇F(x?(p), p) ∈ ∂R(x?(p))
For p∼p0, can we localize x?(p) with respect to x?(p0) ?
Theorem (Enlarged sensitivity)
Under mild assumptions (unique minimizer x?(p0) at p0 and objective uniformly level-bounded
in x), if R is mirror-stratifiable, then for p∼p0,
Mx?(p0) 6 Mx?(p) 6 JR∗(M∗u?(p0))
In the non-degenerate case: u?(p0) ∈ ri(∂R(x?(p0))
)Mx?(p0) = Mx?(p)
(= JR∗(M∗u?(p0)
))
we retrieve exactly the active strata ([Lewis ’06] for partly-smooth functions)
18
Communication-efficient proximal methods
Sensitivity under small variations
Parameterized composite optimization problem (smooth + nonsmooth)
minx∈RN
F(x, p) + R(x),
Optimality condition for a primal-dual solution (x?(p), u?(p))
u?(p) = −∇F(x?(p), p) ∈ ∂R(x?(p))
For p∼p0, can we localize x?(p) with respect to x?(p0) ?
Theorem (Enlarged sensitivity)
Under mild assumptions (unique minimizer x?(p0) at p0 and objective uniformly level-bounded
in x), if R is mirror-stratifiable, then for p∼p0,
Mx?(p0) 6 Mx?(p) 6 JR∗(M∗u?(p0))
In the non-degenerate case: u?(p0) ∈ ri(∂R(x?(p0))
)Mx?(p0) = Mx?(p)
(= JR∗(M∗u?(p0)
))
we retrieve exactly the active strata ([Lewis ’06] for partly-smooth functions)
18
Communication-efficient proximal methods
Sensitivity under small variations
Parameterized composite optimization problem (smooth + nonsmooth)
minx∈RN
F(x, p) + R(x),
Optimality condition for a primal-dual solution (x?(p), u?(p))
u?(p) = −∇F(x?(p), p) ∈ ∂R(x?(p))
For p∼p0, can we localize x?(p) with respect to x?(p0) ?
Theorem (Enlarged sensitivity)
Under mild assumptions (unique minimizer x?(p0) at p0 and objective uniformly level-bounded
in x), if R is mirror-stratifiable, then for p∼p0,
Mx?(p0) 6 Mx?(p) 6 JR∗(M∗u?(p0))
In the non-degenerate case: u?(p0) ∈ ri(∂R(x?(p0))
)Mx?(p0) = Mx?(p)
(= JR∗(M∗u?(p0)
))
we retrieve exactly the active strata ([Lewis ’06] for partly-smooth functions)18
Communication-efficient proximal methods
First sensitivity result illustrated
Simple projection problem{min 1
2‖x − p‖2
‖x‖∞ 6 1
{min 1
2‖u− p‖2 + ‖u‖1
u ∈ RN
19
Communication-efficient proximal methods
First sensitivity result illustrated
Simple projection problem{min 1
2‖x − p‖2
‖x‖∞ 6 1
{min 1
2‖u− p‖2 + ‖u‖1
u ∈ RN
Non-degenerate case: u?(p0) = p0 − x?(p0) ∈ ri NB∞(x?(p0))
x?(p0)
p0
u?(p0)
M⇤1
19
Communication-efficient proximal methods
First sensitivity result illustrated
Simple projection problem{min 1
2‖x − p‖2
‖x‖∞ 6 1
{min 1
2‖u− p‖2 + ‖u‖1
u ∈ RN
Non-degenerate case: u?(p0) = p0 − x?(p0) ∈ ri NB∞(x?(p0))
=⇒ M1 = Mx?(p0) = Mx?(p) (in this case x?(p) = x?(p0))
x?(p0)
p0
u?(p0)
M⇤1
p
u?(p)
19
Communication-efficient proximal methods
First sensitivity result illustrated
Simple projection problem{min 1
2‖x − p‖2
‖x‖∞ 6 1
{min 1
2‖u− p‖2 + ‖u‖1
u ∈ RN
General case: u?(p0) = p0 − x?(p0) ∈ �ri NB∞(x?(p))
x?(p0)
p0
u?(p0)
M⇤1
M⇤2
19
Communication-efficient proximal methods
First sensitivity result illustrated
Simple projection problem{min 1
2‖x − p‖2
‖x‖∞ 6 1
{min 1
2‖u− p‖2 + ‖u‖1
u ∈ RN
General case: u?(p0) = p0 − x?(p0) ∈ �ri NB∞(x?(p))
=⇒ M1 = Mx?(p0) 6 Mx?(p) 6 JR∗(M∗u?(p0)) = M2
x?(p0)
p0
u?(p0)
M⇤1
M⇤2
M2
19
Communication-efficient proximal methods
Activity identification of proximal algorithms
Composite optimization problem (smooth + nonsmooth)
minx∈RN
f(x) + R(x)
Optimality condition −∇f(x?) ∈ ∂R(x?)
Proximal-gradient algorithm (aka forward-backward algorithm)
xk+1 = proxγR
(xk − γ∇f(xk)
)
Does the iterates xk identify the low-complexity of x? ?
Theorem (Enlarged activity identification)
Under convergence assumptions, if R is mirror-stratifiable, then for k large
Mx? 6 Mxk 6 JR∗(M∗−∇f(x?))
In the non-degenerate case: −∇f(x?) ∈ ri(∂R(x?)))
we have exact identification Mx? = Mxk
(= JR∗(M∗−∇f(x?))
)[Liang et al 15]
and we can bound the threshold in this case [Hare et al 19]
20
Communication-efficient proximal methods
Activity identification of proximal algorithms
Composite optimization problem (smooth + nonsmooth)
minx∈RN
f(x) + R(x)
Optimality condition −∇f(x?) ∈ ∂R(x?)
Proximal-gradient algorithm (aka forward-backward algorithm)
xk+1 = proxγR
(xk − γ∇f(xk)
)Does the iterates xk identify the low-complexity of x? ?
Theorem (Enlarged activity identification)
Under convergence assumptions, if R is mirror-stratifiable, then for k large
Mx? 6 Mxk 6 JR∗(M∗−∇f(x?))
In the non-degenerate case: −∇f(x?) ∈ ri(∂R(x?)))
we have exact identification Mx? = Mxk
(= JR∗(M∗−∇f(x?))
)[Liang et al 15]
and we can bound the threshold in this case [Hare et al 19]
20
Communication-efficient proximal methods
Activity identification of proximal algorithms
Composite optimization problem (smooth + nonsmooth)
minx∈RN
f(x) + R(x)
Optimality condition −∇f(x?) ∈ ∂R(x?)
Proximal-gradient algorithm (aka forward-backward algorithm)
xk+1 = proxγR
(xk − γ∇f(xk)
)Does the iterates xk identify the low-complexity of x? ?
Theorem (Enlarged activity identification)
Under convergence assumptions, if R is mirror-stratifiable, then for k large
Mx? 6 Mxk 6 JR∗(M∗−∇f(x?))
In the non-degenerate case: −∇f(x?) ∈ ri(∂R(x?)))
we have exact identification Mx? = Mxk
(= JR∗(M∗−∇f(x?))
)[Liang et al 15]
and we can bound the threshold in this case [Hare et al 19]20
Communication-efficient proximal methods
For the case of sparse optimization
R = ‖ · ‖1 promotes sparsity
Mx ={
z ∈ Rd : supp(z) = supp(x)} x0
Mx0 ||x0||0 =1
?
?
dim(Mx) = # supp(x) = ‖x‖0
`1 regularized least-squares (LASSO)
minx∈Rd
12‖Ax − y‖2 + λ‖x‖1
Ilustration: plot of supp(xk)for one instance with d = 27
Result reads: for all k large enough
supp(x?) ⊆ supp(xk) ⊆ supp(y?ε)
where y?ε = proxγ(1−ε)R(u? − x?) for any ε > 0
Gap between two extreme strata
δ = dim(JR∗(M∗−∇f(x?)))− dim(Mx?) = #supp(A>(Ax? − y))−#supp(x?)
21
Communication-efficient proximal methods
Illustration of the identification of proximal-gradient algorithm
Generate many random problems (with d = 100 and n = 50) and solve them
Select those with #supp(x?) = 10with δ = 0 or 10 (δ = dim(JR∗ (M
∗A>(Ax?−y)
))− dim(Mx? ))
Plot the evolution of #supp(xk) with xk+1 = proxγ‖·‖1
(xk − γ A>(Axk − y))
)
δ quantifies the degeneracy of the problem and the identification of algorithm
δ = 0: weak degeneracy→ exact identification
δ = 10: strong degeneracy→ enlarged identification
22
Outline
1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning
2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods
3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction
Communication-efficient proximal methods
Sparse communication
For R = ‖ · ‖1, iterates of proximal gradient will eventually become sparse...
In our distributed optim. context
↓↓↓ downward comm.
become sparse ,↑↑↑ upward comm.
stay dense... /
xk = proxλ‖·‖1(xk−1+∆)
Master
oracle f i
i → ∆
Worker i
∆ xk (naturally) sparsei = i(k)
Proposed solution: sparsify the updates[Grishchenko, Iutzeler, M. ’19]
Take a selection S (at random or not...)
(xik)[j] =
(
xk−Dki− γ∇f i(xk−Dk
i))[j]
if i = i(k) and j ∈ S
(xik−1)[j] otherwise
Similar to block-coordinate (recall Silvia’s talk yesterday)......but the iteration is different by averaging
23
Communication-efficient proximal methods
Sparse communication
For R = ‖ · ‖1, iterates of proximal gradient will eventually become sparse...
In our distributed optim. context
↓↓↓ downward comm.
become sparse ,↑↑↑ upward comm.
stay dense... /
xk = proxλ‖·‖1(xk−1+∆)
Master
oracle f i
i → ∆
Worker i
∆ xk (naturally) sparsei = i(k)
Proposed solution: sparsify the updates[Grishchenko, Iutzeler, M. ’19]
Take a selection S (at random or not...)
(xik)[j] =
(
xk−Dki− γ∇f i(xk−Dk
i))[j]
if i = i(k) and j ∈ S
(xik−1)[j] otherwise
Similar to block-coordinate (recall Silvia’s talk yesterday)......but the iteration is different by averaging
23
Communication-efficient proximal methods
Sparse communication
For R = ‖ · ‖1, iterates of proximal gradient will eventually become sparse...
In our distributed optim. context
↓↓↓ downward comm.
become sparse ,↑↑↑ upward comm.
stay dense... /
xk = proxλ‖·‖1(xk−1+∆)
Master
oracle f i
i → ∆
Worker i
∆let’s sparsify xk (naturally) sparsei = i(k)
Proposed solution: sparsify the updates[Grishchenko, Iutzeler, M. ’19]
Take a selection S (at random or not...)
(xik)[j] =
(
xk−Dki− γ∇f i(xk−Dk
i))[j]
if i = i(k) and j ∈ S
(xik−1)[j] otherwise
Similar to block-coordinate (recall Silvia’s talk yesterday)......but the iteration is different by averaging
23
Communication-efficient proximal methods
Sparse communication
For R = ‖ · ‖1, iterates of proximal gradient will eventually become sparse...
In our distributed optim. context
↓↓↓ downward comm.
become sparse ,↑↑↑ upward comm.
stay dense... /
xk = proxλ‖·‖1(xk−1+∆)
Master
oracle f i
i → ∆
Worker i
∆let’s sparsify xk (naturally) sparsei = i(k)
Proposed solution: sparsify the updates[Grishchenko, Iutzeler, M. ’19]
Take a selection S (at random or not...)
(xik)[j] =
(
xk−Dki− γ∇f i(xk−Dk
i))[j]
if i = i(k) and j ∈ S
(xik−1)[j] otherwise
Similar to block-coordinate (recall Silvia’s talk yesterday)......but the iteration is different by averaging
23
Communication-efficient proximal methods
Choice of selection: random vs. adaptative
Random: Take randomly a fixed number of entries (e.g. 200)
Adaptative: Take the mask of the support + some entries randomly
`1-regularized logistic regression10 machines in a cluster, madelon dataset (2000× 500) evenly distributed
comparison of asynchr. prox-grad vs. 3 sparsified variants
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
·107
10−11
10−8
10−5
10−2
quantity of data exchanged
subo
ptim
alit
y
Full Update200 coord.
Mask + 100 coord.Mask + sizeof(Mask)
Tradeoff between sparsification less comm. and identification faster cv
Taking twice the size of the support works well ,' Automatic Dimension Reduction
24
Communication-efficient proximal methods
Choice of selection: random vs. adaptative
Random: Take randomly a fixed number of entries (e.g. 200)
Adaptative: Take the mask of the support + some entries randomly
`1-regularized logistic regression10 machines in a cluster, madelon dataset (2000× 500) evenly distributed
comparison of asynchr. prox-grad vs. 3 sparsified variants
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
·107
10−11
10−8
10−5
10−2
quantity of data exchanged
subo
ptim
alit
y
Full Update200 coord.
Mask + 100 coord.Mask + sizeof(Mask)
Tradeoff between sparsification less comm. and identification faster cv
Taking twice the size of the support works well ,' Automatic Dimension Reduction
24
Conclusions
Take-home message
Distributed optimization is a hot topic
Work to adapt our algorithms to recent distributed computing systems
Practical deployment + theoretical work (to design/analyze aglorithms)
Key ideas
our algos are “flexible” : independent from computing systems
no assumptions on delays !
we exploit identification proximal algorithm
to automatically reduce dimension and size of communications
thanks !!
Conclusions
Take-home message
Distributed optimization is a hot topic
Work to adapt our algorithms to recent distributed computing systems
Practical deployment + theoretical work (to design/analyze aglorithms)
Key ideas
our algos are “flexible” : independent from computing systems
no assumptions on delays !
we exploit identification proximal algorithm
to automatically reduce dimension and size of communications
thanks !!
Talk based on joint work
Mirror-stratified functions
J. Fadili, J. Malick, G. PeyreSensitivity Analysis for Mirror-StratifiableConvex FunctionsSIAM Journal on Optimization, 2018
Asynchronous level bundle algorithms
F. Iutzeler, J. Malick, W. de Oliveira,Asynchronous level bundle methodsIn revision in Math. Programming, 2019
Flexible distributed proximal algorithms K. Mishchenko, F. Iutzeler, J. MalickA Delay-tolerant Proximal-Gradient Algorithmfor Distributed LearningICML, 2018
M. Grishchenko, F. Iutzeler, J. MalickSubspace Descent Methods withIdentification-Adapted SamplingSubmitted to Maths of OR, 2019