variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational...

Post on 09-Jul-2020

7 views 0 download

Transcript of variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational...

Distributed nonsmooth optimization:variational analysis at work and flexible proximal algorithms

Jerome MALICK

CNRS, Laboratoire Jean Kuntzmann, Grenoble (France)

International School of Mathematics “Guido Stampacchia”

71st Workshop: Advances in nonsmooth analysis and optimization

June 2019 – Erice (Sicily)

Content at a glance

Today’s talk:a mix of

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance

Today’s talk:a mix of

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance

Today’s talk:a mix of

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance

Today’s talk:a mix of

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance

Today’s talk:a mix of

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance

Today’s talk:a mix of

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance

Today’s talk:a mix of

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance

Today’s talk:a mix of

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance

Today’s talk:a mix of

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance

Today’s talk:a mix of

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance

Today’s talk:a mix of

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance

Today’s talk:a mix of

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Outline

1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning

2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods

3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction

Outline

1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning

2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods

3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction

Outline

1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning

2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods

3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction

Distributed optimization

Distributed optimization: past and present

Old history...Seminal work of D. Bertsekas and co-authors

Two classical frameworks:

(1) network (2) master/slaves

Recent challenges : new distributed systems

– Increase available computingresources (data centers, cloud)

– Improvement of multicoreinfrastructure and networks

How to (efficiently) deploy our algorithms ?

1

Distributed optimization

Distributed optimization: past and present

Old history...Seminal work of D. Bertsekas and co-authors

Two classical frameworks:

(1) network (2) master/slaves

Recent challenges : new distributed systems

– Increase available computingresources (data centers, cloud)

– Improvement of multicoreinfrastructure and networks

How to (efficiently) deploy our algorithms ?

1

Distributed optimization

Example in master/slave framework

Basic example: minimize m smoothfunctions over M = m workers

minx∈Rd

m∑i=1

f i(x)f1 f2 f3 f4

(m = M = 4)

Standard/Distributed gradient descent

xk+1 = xk − γm∑

i=1

∇f i(xk) (map-reduce)

Extensions:

with prox-term, nonsmooth f i, ...

When m the number of functionsis greater than N the number ofmachines (m > M), the algoextends to incremental/stochastic(batch) gradient descent(reviewed in Georg’s talk on monday)

shared memory

f i1 f i2 f i3 f i4

(m > M = 4)

2

Distributed optimization

Example in master/slave framework

Basic example: minimize m smoothfunctions over M = m workers

minx∈Rd

m∑i=1

f i(x)f1 f2 f3 f4

(m = M = 4)

Standard/Distributed gradient descent

xk+1 = xk − γm∑

i=1

∇f i(xk) (map-reduce)

Extensions:

with prox-term, nonsmooth f i, ...

When m the number of functionsis greater than N the number ofmachines (m > M), the algoextends to incremental/stochastic(batch) gradient descent(reviewed in Georg’s talk on monday)

shared memory

f i1 f i2 f i3 f i4

(m > M = 4)

2

Distributed optimization

Example in master/slave framework

Basic example: minimize m smoothfunctions over M = m workers

minx∈Rd

m∑i=1

f i(x)f1 f2 f3 f4

(m = M = 4)

Standard/Distributed gradient descent

xk+1 = xk − γm∑

i=1

∇f i(xk) (map-reduce)

Extensions:

with prox-term, nonsmooth f i, ...

When m the number of functionsis greater than N the number ofmachines (m > M), the algoextends to incremental/stochastic(batch) gradient descent(reviewed in Georg’s talk on monday)

shared memory

f i1 f i2 f i3 f i4

(m > M = 4)

2

Distributed optimization

Example in master/slave framework

Basic example: minimize m smoothfunctions over M = m workers

minx∈Rd

m∑i=1

f i(x)f1 f2 f3 f4

(m = M = 4)

Standard/Distributed gradient descent

xk+1 = xk − γm∑

i=1

∇f i(xk) (map-reduce)

Extensions:

with prox-term, nonsmooth f i, ...

When m the number of functionsis greater than N the number ofmachines (m > M), the algoextends to incremental/stochastic(batch) gradient descent(reviewed in Georg’s talk on monday)

shared memory

f i1 f i2 f i3 f i4

(m > M = 4)

2

Distributed optimization

Limitations of optimization algorithms in master/slave framework

The previous slide hides thecomplexity of tasks and the varietyof possible computing systems...

Two key limitations of usual algorithms

(1) Synchronization: Master has to wait for all workers at each iteration...

Example with 3 workers/agents: (image: W. Yin)

Comparison: iteration is redefined

Synchronousnew iteration = all agents finish

Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9

Asynchronousnew iteration = any agent finishes

synchronous

Comparison: iteration is redefined

Synchronousnew iteration = all agents finish

Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9

Asynchronousnew iteration = any agent finishes

asynchronous

(2) Communications: unreliable, costly,...

(image: Google AI)

3

Distributed optimization

Limitations of optimization algorithms in master/slave framework

The previous slide hides thecomplexity of tasks and the varietyof possible computing systems...

Two key limitations of usual algorithms

(1) Synchronization: Master has to wait for all workers at each iteration...

Example with 3 workers/agents: (image: W. Yin)

Comparison: iteration is redefined

Synchronousnew iteration = all agents finish

Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9

Asynchronousnew iteration = any agent finishes

synchronous

Comparison: iteration is redefined

Synchronousnew iteration = all agents finish

Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9

Asynchronousnew iteration = any agent finishes

asynchronous

(2) Communications: unreliable, costly,...

(image: Google AI)

3

Distributed optimization

Limitations of optimization algorithms in master/slave framework

The previous slide hides thecomplexity of tasks and the varietyof possible computing systems...

Two key limitations of usual algorithms

(1) Synchronization: Master has to wait for all workers at each iteration...

Example with 3 workers/agents: (image: W. Yin)

Comparison: iteration is redefined

Synchronousnew iteration = all agents finish

Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9

Asynchronousnew iteration = any agent finishes

synchronous

Comparison: iteration is redefined

Synchronousnew iteration = all agents finish

Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9

Asynchronousnew iteration = any agent finishes

asynchronous

(2) Communications: unreliable, costly,...

(image: Google AI)

3

Distributed optimization

Limitations of optimization algorithms in master/slave framework

The previous slide hides thecomplexity of tasks and the varietyof possible computing systems...

Two key limitations of usual algorithms

(1) Synchronization: Master has to wait for all workers at each iteration...

Example with 3 workers/agents: (image: W. Yin)

Comparison: iteration is redefined

Synchronousnew iteration = all agents finish

Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9

Asynchronousnew iteration = any agent finishes

synchronous

Comparison: iteration is redefined

Synchronousnew iteration = all agents finish

Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9

Asynchronousnew iteration = any agent finishes

asynchronous

(2) Communications: unreliable, costly,...

(image: Google AI)

3

Distributed optimization

Examples & applications of distributed/parallel optimization

Rich activity : e.g. [Nedic et al ’14], [Yin et al ’15], [Richtarik et al ’16],[Chouzenoux et al ’17]... just to name a few

Various practical situations 6=

In the workshop: – Welington’s talk (wednesday)

– Mentioned in the talks of Russel, Silvia, and Antoine

Our community is more knowledgeable on the “data center” situation

1 data is distributed among many machines (“big data”)

2 computation load is distributed (“big problems” or HPC)

Ex: large-scale stochastic programming problems of Welington’s talk

let’s discuss more distributed (on-device) learning

4

Distributed optimization

Examples & applications of distributed/parallel optimization

Rich activity : e.g. [Nedic et al ’14], [Yin et al ’15], [Richtarik et al ’16],[Chouzenoux et al ’17]... just to name a few

Various practical situations 6=

In the workshop: – Welington’s talk (wednesday)

– Mentioned in the talks of Russel, Silvia, and Antoine

Our community is more knowledgeable on the “data center” situation

1 data is distributed among many machines (“big data”)

2 computation load is distributed (“big problems” or HPC)

Ex: large-scale stochastic programming problems of Welington’s talk

let’s discuss more distributed (on-device) learning

4

Distributed optimization

Examples & applications of distributed/parallel optimization

Rich activity : e.g. [Nedic et al ’14], [Yin et al ’15], [Richtarik et al ’16],[Chouzenoux et al ’17]... just to name a few

Various practical situations 6=

In the workshop: – Welington’s talk (wednesday)

– Mentioned in the talks of Russel, Silvia, and Antoine

Our community is more knowledgeable on the “data center” situation

1 data is distributed among many machines (“big data”)

2 computation load is distributed (“big problems” or HPC)

Ex: large-scale stochastic programming problems of Welington’s talk

let’s discuss more distributed (on-device) learning

4

Outline

1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning

2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods

3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction

Distributed optimization

Recalls on supervised learning: context

Data : n observations (aj, yj) (j = 1, . . . , n)

e.g. (aj, yj) ∈ Rm × R (regression) or (aj, yj) ∈ Rm × {−1,+1} (binary classification)

a1 a2 a3 a4 a5

y1 = 1 y2 = �1 y3 = 1 y4 = �1 y5 = 1

Supervised learning : h(a, x) ∈ R parameterized by x ∈ Rd

usually x = β (in stats) x = ω (in learning) or x = θ (in deep learning)

goal: find x such that h(ai, x) ' yi (and generalizes well on unseen data)

Standard prediction functions :

Linear prediction: h(a, x) = 〈a, x〉(or h(a, x) = 〈φ(a), x〉 for kernel)

Highly non-linear prediction by artificialneural networksh(a, x) = 〈xm, σ(〈xm−1, · · ·σ(〈x1, a〉)〉)〉

Definition of the pliable lasso Optimization

Deep Nets/Deep Learning

x1

x2

x3

x4

f(x)

Hiddenlayer L2

Inputlayer L1

Outputlayer L3

Neural network diagram with a single hidden layer. The hidden layer

derives transformations of the inputs — nonlinear transformations of linear

combinations — which are then used to model the output10 / 46

5

Distributed optimization

Recalls on supervised learning: context

Data : n observations (aj, yj) (j = 1, . . . , n)

e.g. (aj, yj) ∈ Rm × R (regression) or (aj, yj) ∈ Rm × {−1,+1} (binary classification)

a1 a2 a3 a4 a5

y1 = 1 y2 = �1 y3 = 1 y4 = �1 y5 = 1

Supervised learning : h(a, x) ∈ R parameterized by x ∈ Rd

usually x = β (in stats) x = ω (in learning) or x = θ (in deep learning)

goal: find x such that h(ai, x) ' yi (and generalizes well on unseen data)

Standard prediction functions :

Linear prediction: h(a, x) = 〈a, x〉(or h(a, x) = 〈φ(a), x〉 for kernel)

Highly non-linear prediction by artificialneural networksh(a, x) = 〈xm, σ(〈xm−1, · · ·σ(〈x1, a〉)〉)〉

Definition of the pliable lasso Optimization

Deep Nets/Deep Learning

x1

x2

x3

x4

f(x)

Hiddenlayer L2

Inputlayer L1

Outputlayer L3

Neural network diagram with a single hidden layer. The hidden layer

derives transformations of the inputs — nonlinear transformations of linear

combinations — which are then used to model the output10 / 46

5

Distributed optimization

Recalls on supervised learning: context

Data : n observations (aj, yj) (j = 1, . . . , n)

e.g. (aj, yj) ∈ Rm × R (regression) or (aj, yj) ∈ Rm × {−1,+1} (binary classification)

a1 a2 a3 a4 a5

y1 = 1 y2 = �1 y3 = 1 y4 = �1 y5 = 1

Supervised learning : h(a, x) ∈ R parameterized by x ∈ Rd

usually x = β (in stats) x = ω (in learning) or x = θ (in deep learning)

goal: find x such that h(ai, x) ' yi (and generalizes well on unseen data)

Standard prediction functions :

Linear prediction: h(a, x) = 〈a, x〉(or h(a, x) = 〈φ(a), x〉 for kernel)

Highly non-linear prediction by artificialneural networksh(a, x) = 〈xm, σ(〈xm−1, · · ·σ(〈x1, a〉)〉)〉

Definition of the pliable lasso Optimization

Deep Nets/Deep Learning

x1

x2

x3

x4

f(x)

Hiddenlayer L2

Inputlayer L1

Outputlayer L3

Neural network diagram with a single hidden layer. The hidden layer

derives transformations of the inputs — nonlinear transformations of linear

combinations — which are then used to model the output10 / 46

5

Distributed optimization

Recalls on supervised learning: context

Data : n observations (aj, yj) (j = 1, . . . , n)

e.g. (aj, yj) ∈ Rm × R (regression) or (aj, yj) ∈ Rm × {−1,+1} (binary classification)

a1 a2 a3 a4 a5

y1 = 1 y2 = �1 y3 = 1 y4 = �1 y5 = 1

Supervised learning : h(a, x) ∈ R parameterized by x ∈ Rd

usually x = β (in stats) x = ω (in learning) or x = θ (in deep learning)

goal: find x such that h(ai, x) ' yi (and generalizes well on unseen data)

Standard prediction functions :

Linear prediction: h(a, x) = 〈a, x〉(or h(a, x) = 〈φ(a), x〉 for kernel)

Highly non-linear prediction by artificialneural networksh(a, x) = 〈xm, σ(〈xm−1, · · ·σ(〈x1, a〉)〉)〉

Definition of the pliable lasso Optimization

Deep Nets/Deep Learning

x1

x2

x3

x4

f(x)

Hiddenlayer L2

Inputlayer L1

Outputlayer L3

Neural network diagram with a single hidden layer. The hidden layer

derives transformations of the inputs — nonlinear transformations of linear

combinations — which are then used to model the output10 / 46

5

Distributed optimization

Recalls on supervised learning: learning is optimizing !

Learning = finding x, i.e. the best x from data

= solving an optimization problem

Regularized empirical risk minimization(regularization avoids overfitting, imposes structure to x, or helps numerically)

minx∈Rd

1n

n∑j=1

`(yj, h(aj, x)

)+ λR(x)

data fitting-term + regularizer

Ex: regularized logistic regression (to be used in numerical exps)

minx∈Rd

1n

n∑j=1

log(1+exp(−yj〈aj, x〉)

)+λ2

2‖x‖2

2 + λ1‖x‖1

Distributed setting: machine i has a chunk of data Si

1n

n∑j=1

`(yj, h(aj, x)

)=

m∑i=1

1n

∑i∈Si

`(yi, h(ai, x)

)︸ ︷︷ ︸

f i(x)

6

Distributed optimization

Recalls on supervised learning: learning is optimizing !

Learning = finding x, i.e. the best x from data

= solving an optimization problem

Regularized empirical risk minimization(regularization avoids overfitting, imposes structure to x, or helps numerically)

minx∈Rd

1n

n∑j=1

`(yj, h(aj, x)

)+ λR(x)

data fitting-term + regularizer

Ex: regularized logistic regression (to be used in numerical exps)

minx∈Rd

1n

n∑j=1

log(1+exp(−yj〈aj, x〉)

)+λ2

2‖x‖2

2 + λ1‖x‖1

Distributed setting: machine i has a chunk of data Si

1n

n∑j=1

`(yj, h(aj, x)

)=

m∑i=1

1n

∑i∈Si

`(yi, h(ai, x)

)︸ ︷︷ ︸

f i(x)

6

Distributed optimization

Recalls on supervised learning: learning is optimizing !

Learning = finding x, i.e. the best x from data

= solving an optimization problem

Regularized empirical risk minimization(regularization avoids overfitting, imposes structure to x, or helps numerically)

minx∈Rd

1n

n∑j=1

`(yj, h(aj, x)

)+ λR(x)

data fitting-term + regularizer

Ex: regularized logistic regression (to be used in numerical exps)

minx∈Rd

1n

n∑j=1

log(1+exp(−yj〈aj, x〉)

)+λ2

2‖x‖2

2 + λ1‖x‖1

Distributed setting: machine i has a chunk of data Si

1n

n∑j=1

`(yj, h(aj, x)

)=

m∑i=1

1n

∑i∈Si

`(yi, h(ai, x)

)︸ ︷︷ ︸

f i(x)

6

Distributed optimization

New machine learning frameworks ?

Machine learning has led to major breakthroughs in various areas(such as natural language processing, computer vision and speech recognition...)

Much of this success has been based on collecting huge amounts of data(and requires a lot of energy...)

Collecting data has two major problems:1 it is privacy-invasive2 it requires a lot of storage

A solution: federated learningdecoupling the ability to domachine learning from theneed to store the data

no data move... but:– unbalanced # samples– unstable communications...

Back to the initial question: how to (efficiently) deploy our algorithms ?

7

Distributed optimization

New machine learning frameworks ?

Machine learning has led to major breakthroughs in various areas(such as natural language processing, computer vision and speech recognition...)

Much of this success has been based on collecting huge amounts of data(and requires a lot of energy...)

Collecting data has two major problems:1 it is privacy-invasive2 it requires a lot of storage

A solution: federated learningdecoupling the ability to domachine learning from theneed to store the data

no data move... but:– unbalanced # samples– unstable communications...

Back to the initial question: how to (efficiently) deploy our algorithms ?

7

Distributed optimization

New machine learning frameworks ?

Machine learning has led to major breakthroughs in various areas(such as natural language processing, computer vision and speech recognition...)

Much of this success has been based on collecting huge amounts of data(and requires a lot of energy...)

Collecting data has two major problems:1 it is privacy-invasive2 it requires a lot of storage

A solution: federated learningdecoupling the ability to domachine learning from theneed to store the data

no data move... but:– unbalanced # samples– unstable communications...

Back to the initial question: how to (efficiently) deploy our algorithms ?

7

Distributed optimization

New machine learning frameworks ?

Machine learning has led to major breakthroughs in various areas(such as natural language processing, computer vision and speech recognition...)

Much of this success has been based on collecting huge amounts of data(and requires a lot of energy...)

Collecting data has two major problems:1 it is privacy-invasive2 it requires a lot of storage

A solution: federated learningdecoupling the ability to domachine learning from theneed to store the data

no data move... but:– unbalanced # samples– unstable communications...

Back to the initial question: how to (efficiently) deploy our algorithms ?

7

Outline

1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning

2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods

3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction

Outline

1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning

2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods

3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction

Asynchronous nonsmooth optimization methods

Asynchronous Master/Slave framework

Algorithm = global communication scheme + local optimization methodwhat is x what is i

xk = xk−1 + ∆

Master

oracle f1

1

Worker 1

... oracle f i

i → ∆

Worker i

... oracle fM

M

Worker M

∆ xki = i(k)

timei = i(k) viewpoint i

k = k− dki

ik− Dk

i

ii

timej 6= i(k) viewpoint

k

j

k− dkj

j

k− Dkj

jj

– iteration = receive from a worker + master update + send– time k = number of iterations– machine i(k) = the machine updating at time k– delay dk

i = time since last exchange with idk

i = 0 iff i = i(k) dki = dk−1

i + 1 elsewhere

– second delay Dki = time since penultimate exchange with i

8

Asynchronous nonsmooth optimization methods

An instantiation: asynchr. gradient

Averaged minimization of smooth functions

minx∈Rd

1m

m∑i=1

f i(x) with f i L-smooth and µ-strongly conv.

Asynchr. averaged gradient [Vanli, Gurbuzbalaban, Ozdaglar ’17]

xk = xk−1 −γ

m

m∑i=1

∇f i(

xk−Dki

)Problem: stepsize γ depends on delays between machines /Proposed algo: rather combine iterates [Mishchenko, Iutzeler, M. ’18]

xk =1m

m∑i=1

xk−Dki− γ

m

m∑i=1

∇f i(xk−Dki)

Interest: Take usual stepsize , γ ∈ (0, 2/(µ+ L)] gives linear convergence

Variational analysis at work ? not much yet

– Convergence proof follows from standard tools and usual rationale

– Originality : new “epoch”-based analysis to get delay-free results

9

Asynchronous nonsmooth optimization methods

An instantiation: asynchr. gradient

Averaged minimization of smooth functions

minx∈Rd

1m

m∑i=1

f i(x) with f i L-smooth and µ-strongly conv.

Asynchr. averaged gradient [Vanli, Gurbuzbalaban, Ozdaglar ’17]

xk = xk−1 −γ

m

m∑i=1

∇f i(

xk−Dki

)Problem: stepsize γ depends on delays between machines /

Proposed algo: rather combine iterates [Mishchenko, Iutzeler, M. ’18]

xk =1m

m∑i=1

xk−Dki− γ

m

m∑i=1

∇f i(xk−Dki)

Interest: Take usual stepsize , γ ∈ (0, 2/(µ+ L)] gives linear convergence

Variational analysis at work ? not much yet

– Convergence proof follows from standard tools and usual rationale

– Originality : new “epoch”-based analysis to get delay-free results

9

Asynchronous nonsmooth optimization methods

An instantiation: asynchr. gradient

Averaged minimization of smooth functions

minx∈Rd

1m

m∑i=1

f i(x) with f i L-smooth and µ-strongly conv.

Asynchr. averaged gradient [Vanli, Gurbuzbalaban, Ozdaglar ’17]

xk = xk−1 −γ

m

m∑i=1

∇f i(

xk−Dki

)Problem: stepsize γ depends on delays between machines /Proposed algo: rather combine iterates [Mishchenko, Iutzeler, M. ’18]

xk =1m

m∑i=1

xk−Dki− γ

m

m∑i=1

∇f i(xk−Dki)

Interest: Take usual stepsize , γ ∈ (0, 2/(µ+ L)] gives linear convergence

Variational analysis at work ? not much yet

– Convergence proof follows from standard tools and usual rationale

– Originality : new “epoch”-based analysis to get delay-free results

9

Asynchronous nonsmooth optimization methods

An instantiation: asynchr. gradient

Averaged minimization of smooth functions

minx∈Rd

1m

m∑i=1

f i(x) with f i L-smooth and µ-strongly conv.

Asynchr. averaged gradient [Vanli, Gurbuzbalaban, Ozdaglar ’17]

xk = xk−1 −γ

m

m∑i=1

∇f i(

xk−Dki

)Problem: stepsize γ depends on delays between machines /Proposed algo: rather combine iterates [Mishchenko, Iutzeler, M. ’18]

xk =1m

m∑i=1

xk−Dki− γ

m

m∑i=1

∇f i(xk−Dki)

Interest: Take usual stepsize , γ ∈ (0, 2/(µ+ L)] gives linear convergence

Variational analysis at work ? not much yet

– Convergence proof follows from standard tools and usual rationale

– Originality : new “epoch”-based analysis to get delay-free results

9

Asynchronous nonsmooth optimization methods

Numerical illustrations

Illustration on toy problem2D quadratic functions on 5 workers but one worker 10x slower

Illustration on standard regularized logistic regression100 machines in a cluster with 10% of the data in machine one, even on the rest

0 200 400 600 800 1,000 1,200 1,400 1,600 1,800

10−3

10−2

10−1

100

Wallclock time (s)

Subo

ptim

alit

y

0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 2,20010−2

10−1

Wallclock time (s)

RCV1 ( 697641 × 47236) URL ( 2396130 × 3231961)

DAve-PG Synchronous PG PIAG

Conclusion: stepsize matters ! (remind Silvia’s talk!)

10

Asynchronous nonsmooth optimization methods

Numerical illustrations

Illustration on toy problem2D quadratic functions on 5 workers but one worker 10x slower

Illustration on standard regularized logistic regression100 machines in a cluster with 10% of the data in machine one, even on the rest

0 200 400 600 800 1,000 1,200 1,400 1,600 1,800

10−3

10−2

10−1

100

Wallclock time (s)

Subo

ptim

alit

y

0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 2,20010−2

10−1

Wallclock time (s)

RCV1 ( 697641 × 47236) URL ( 2396130 × 3231961)

DAve-PG Synchronous PG PIAG

Conclusion: stepsize matters ! (remind Silvia’s talk!)

10

Asynchronous nonsmooth optimization methods

Numerical illustrations

Illustration on toy problem2D quadratic functions on 5 workers but one worker 10x slower

Illustration on standard regularized logistic regression100 machines in a cluster with 10% of the data in machine one, even on the rest

0 200 400 600 800 1,000 1,200 1,400 1,600 1,800

10−3

10−2

10−1

100

Wallclock time (s)

Subo

ptim

alit

y

0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 2,20010−2

10−1

Wallclock time (s)

RCV1 ( 697641 × 47236) URL ( 2396130 × 3231961)

DAve-PG Synchronous PG PIAG

Conclusion: stepsize matters ! (remind Silvia’s talk!)

10

Outline

1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning

2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods

3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction

Asynchronous nonsmooth optimization methods

The idea of the work with Welington in one slide/picture

Basic situation: minimize m convex functions over m machines/oracles

minx∈X

m∑i=1

f i(x)at time k, machine i= i(k) sends up

f i(xk−dki) and g ∈ ∂f i(xk−dk

i)

Example: d = 1, m = 2 (oracle f1 3× faster than oracle f2)

f1 f2

f1 + f2

f1 f2

f1 + f2

fup = flev +

m∑i=1

(Li‖xk+1 − xk−dk

j‖ − 〈gj

k−dkj, xk+1 − xk−dk

j〉)

f1 + f2

f lev

11

Asynchronous nonsmooth optimization methods

The idea of the work with Welington in one slide/picture

Basic situation: minimize m convex functions over m machines/oracles

minx∈X

m∑i=1

f i(x)at time k, machine i= i(k) sends up

f i(xk−dki) and g ∈ ∂f i(xk−dk

i)

Example: d = 1, m = 2 (oracle f1 3× faster than oracle f2)

f1 f2

f1 + f2

f1 f2

f1 + f2

fup = flev +

m∑i=1

(Li‖xk+1 − xk−dk

j‖ − 〈gj

k−dkj, xk+1 − xk−dk

j〉)

f1 + f2

f lev

11

Asynchronous nonsmooth optimization methods

The idea of the work with Welington in one slide/picture

Basic situation: minimize m convex functions over m machines/oracles

minx∈X

m∑i=1

f i(x)at time k, machine i= i(k) sends up

f i(xk−dki) and g ∈ ∂f i(xk−dk

i)

Example: d = 1, m = 2 (oracle f1 3× faster than oracle f2)

f1 f2

f1 + f2

f1 f2

f1 + f2

lower bound

[Hintermuller ’01][Oliveira Sagastizabal ’17]

fup = flev +

m∑i=1

(Li‖xk+1 − xk−dk

j‖ − 〈gj

k−dkj, xk+1 − xk−dk

j〉)

f1 + f2

f lev

11

Asynchronous nonsmooth optimization methods

The idea of the work with Welington in one slide/picture

Basic situation: minimize m convex functions over m machines/oracles

minx∈X

m∑i=1

f i(x)at time k, machine i= i(k) sends up

f i(xk−dki) and g ∈ ∂f i(xk−dk

i)

Example: d = 1, m = 2 (oracle f1 3× faster than oracle f2)

f1 f2

f1 + f2

f1 f2

f1 + f2

lower bound

[Hintermuller ’01][Oliveira Sagastizabal ’15]

but no upper bound

f 1(xk1 ) f 2(xk2 ) xk1 6= xk2

fup = flev +m∑

i=1

(Li‖xk+1 − xk−dk

j‖ − 〈gj

k−dkj, xk+1 − xk−dk

j〉)

f1 + f2

f lev

11

Asynchronous nonsmooth optimization methods

The idea of the work with Welington in one slide/picture

Basic situation: minimize m convex functions over m machines/oracles

minx∈X

m∑i=1

f i(x)at time k, machine i= i(k) sends up

f i(xk−dki) and g ∈ ∂f i(xk−dk

i)

Example: d = 1, m = 2 (oracle f1 3× faster than oracle f2)

f1 f2

f1 + f2

f1 f2

f1 + f2

lower bound

[Hintermuller ’01][Oliveira Sagastizabal ’15]

but no upper bound

f 1(xk1 ) f 2(xk2 ) xk1 6= xk2

fup = flev +

m∑i=1

(Li‖xk+1 − xk−dk

j‖ − 〈gj

k−dkj, xk+1 − xk−dk

j〉)

f1 + f2

f lev

11

Asynchronous nonsmooth optimization methods

The idea of the work with Welington in one slide/picture

Basic situation: minimize m convex functions over m machines/oracles

minx∈X

m∑i=1

f i(x)at time k, machine i= i(k) sends up

f i(xk−dki) and g ∈ ∂f i(xk−dk

i)

Example: d = 1, m = 2 (oracle f1 3× faster than oracle f2)

f1 f2

f1 + f2

fup = flev +m∑

i=1

(Li‖xk+1 − xk−dk

j‖ − 〈gj

k−dkj, xk+1 − xk−dk

j〉)

f1 + f2

f lev

11

Asynchronous nonsmooth optimization methods

Direct convergence by basic variational analysis

Convergence of usual level bundle [Lemarechal Nesterov Nemirovski ’95]

requires tight control of iterates moves to get complexity analysis

Forget this : asynchronous iterates are out of controlso we can simplify the proof in this more complicated setting !

3 line proof : finite number of null steps (by contradiction)

– recall xk+1 = ProjXk(x)

– nested sequence of (compact) sets Xk ⊂ Xk+1 ⊂ · · ·

– Xk −→k∩kXk = X∞ 6= ∅ (Painleve-Kuratowski)

– xk+1 = ProjXk(x) −→

kProjX∞ (x) [Rockafellar Wets Book]

– fup = flev +

m∑i=1

(Li‖xk+1 − xk−dk

j‖ − 〈gj

k−dkj, xk+1 − xk−dk

j〉)−→

kflev

– it leads to contradiction ,

Variational analysis at work... a little... more to come... (Samir be patient)

12

Asynchronous nonsmooth optimization methods

Direct convergence by basic variational analysis

Convergence of usual level bundle [Lemarechal Nesterov Nemirovski ’95]

requires tight control of iterates moves to get complexity analysis

Forget this : asynchronous iterates are out of controlso we can simplify the proof in this more complicated setting !

3 line proof : finite number of null steps (by contradiction)

– recall xk+1 = ProjXk(x)

– nested sequence of (compact) sets Xk ⊂ Xk+1 ⊂ · · ·

– Xk −→k∩kXk = X∞ 6= ∅ (Painleve-Kuratowski)

– xk+1 = ProjXk(x) −→

kProjX∞ (x) [Rockafellar Wets Book]

– fup = flev +

m∑i=1

(Li‖xk+1 − xk−dk

j‖ − 〈gj

k−dkj, xk+1 − xk−dk

j〉)−→

kflev

– it leads to contradiction ,

Variational analysis at work... a little... more to come... (Samir be patient)

12

Outline

1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning

2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods

3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction

Outline

1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning

2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods

3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction

Communication-efficient proximal methods

When communication is the bottleneck...

Communicate may be bad :

– Heterogeneous frameworks: communications are unreliable

– High-dimensional problems: communications can also be costly

A solution to reduce communication is to reduce dimension

How ? using the structure of regularized problem

minx∈Rd

m∑i=1

f i(x) + λR(x)

Typically x? belongs to low-dimensional manifold

because the convex R is usually highly structured...

13

Communication-efficient proximal methods

Mirror-stratifiable regularizers

Most of the regularizers used in machine learning or image processing

have a strong primal-dual structure – mirror-stratifiable [Fadili, M., Peyre ’18]

Examples: (associated unit ball and low-dimensional manifold where x belongs)

R = ‖ · ‖1 ( and ‖ · ‖∞ or other polyedral gauges)

nuclear norm (aka trace-norm) R(X) =∑

i |σi(X)| = ‖σ(X)‖1

group-`1 R(x) =∑

b∈B ‖xb‖2 ( e.g. R(x) = |x1|+ ‖x2,3‖ )

xMx

x Mx

x

Mx

14

Communication-efficient proximal methods

Mirror-stratifiable regularizers

Most of the regularizers used in machine learning or image processing

have a strong primal-dual structure – mirror-stratifiable [Fadili, M., Peyre ’18]

Examples: (associated unit ball and low-dimensional manifold where x belongs)

R = ‖ · ‖1 ( and ‖ · ‖∞ or other polyedral gauges)

nuclear norm (aka trace-norm) R(X) =∑

i |σi(X)| = ‖σ(X)‖1

group-`1 R(x) =∑

b∈B ‖xb‖2 ( e.g. R(x) = |x1|+ ‖x2,3‖ )

xMx

x Mx

x

Mx

14

Communication-efficient proximal methods

Mirror-stratifiable regularizers

Most of the regularizers used in machine learning or image processing

have a strong primal-dual structure – mirror-stratifiable [Fadili, M., Peyre ’18]

Examples: (associated unit ball and low-dimensional manifold where x belongs)

R = ‖ · ‖1 ( and ‖ · ‖∞ or other polyedral gauges)

nuclear norm (aka trace-norm) R(X) =∑

i |σi(X)| = ‖σ(X)‖1

group-`1 R(x) =∑

b∈B ‖xb‖2 ( e.g. R(x) = |x1|+ ‖x2,3‖ )

xMx

x Mx

x

Mx

14

Communication-efficient proximal methods

Mirror-stratifiable regularizers

Most of the regularizers used in machine learning or image processing

have a strong primal-dual structure – mirror-stratifiable [Fadili, M., Peyre ’18]

Examples: (associated unit ball and low-dimensional manifold where x belongs)

R = ‖ · ‖1 ( and ‖ · ‖∞ or other polyedral gauges)

nuclear norm (aka trace-norm) R(X) =∑

i |σi(X)| = ‖σ(X)‖1

group-`1 R(x) =∑

b∈B ‖xb‖2 ( e.g. R(x) = |x1|+ ‖x2,3‖ )

xMx

x Mx

x

Mx

14

Communication-efficient proximal methods

Recall on stratifications

Recall the talks of Aris (monday morning) and of Hasnaa (tuesday afternoon)

A stratification of a set D ⊂ RN is a finite partitionM = {Mi}i∈I

D =⋃i∈I

Mi

with so-called “strata” (e.g. smooth/affine manifolds) which fit nicely:

M ∩ cl(M′) 6= ∅ =⇒ M ⊂ cl(M′)

This relation induces a (partial) ordering M 6 M′

Example: B∞ the unit `∞-ball in R2

a stratification with 9 (affine) strata

M1 6 M2 6 M4

M1 6 M3 6 M4

M1M2

M3M4

15

Communication-efficient proximal methods

Recall on stratifications

Recall the talks of Aris (monday morning) and of Hasnaa (tuesday afternoon)

A stratification of a set D ⊂ RN is a finite partitionM = {Mi}i∈I

D =⋃i∈I

Mi

with so-called “strata” (e.g. smooth/affine manifolds) which fit nicely:

M ∩ cl(M′) 6= ∅ =⇒ M ⊂ cl(M′)

This relation induces a (partial) ordering M 6 M′

Example: B∞ the unit `∞-ball in R2

a stratification with 9 (affine) strata

M1 6 M2 6 M4

M1 6 M3 6 M4

M1M2

M3M4

15

Communication-efficient proximal methods

Mirror-stratifiable function: formal definition

A convex function R : RN→R ∪ {+∞} is mirror-stratifiable with respect to

– a (primal) stratificationM = {Mi}i∈I of dom(∂R)

– a (dual) stratificationM∗ = {M∗i }i∈I of dom(∂R∗)

if JR has 2 properties

JR :M→M∗ is invertible with inverse JR∗

M∗ 3 M∗ = JR(M) ⇐⇒ JR∗(M∗) = M ∈M

JR is decreasing for the order relation 6 between strata

M 6 M′ ⇐⇒ JR(M) > JR(M′)

with the transfert operator JR : RN ⇒ RN [Daniilidis-Drusvyatskiy-Lewis ’13]

JR(S) =⋃x∈S

ri(∂R(x))

16

Communication-efficient proximal methods

Mirror-stratifiable function: simple example

R = ιB∞ R∗ = ‖ · ‖1

JR(Mi) =⋃

x∈Mi

ri ∂R(x) = ri N B∞ (x) = M∗i

Mi = ri ∂‖x‖1 =⋃

x∈M∗i

ri ∂R∗(x) = JR∗ (M∗i )

M1M2

M3M4

M⇤1

M⇤2

M⇤3

M⇤4

JR

JR⇤

17

Outline

1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning

2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods

3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction

Communication-efficient proximal methods

Sensitivity under small variations

Parameterized composite optimization problem (smooth + nonsmooth)

minx∈RN

F(x, p) + R(x),

Optimality condition for a primal-dual solution (x?(p), u?(p))

u?(p) = −∇F(x?(p), p) ∈ ∂R(x?(p))

For p∼p0, can we localize x?(p) with respect to x?(p0) ?

Theorem (Enlarged sensitivity)

Under mild assumptions (unique minimizer x?(p0) at p0 and objective uniformly level-bounded

in x), if R is mirror-stratifiable, then for p∼p0,

Mx?(p0) 6 Mx?(p) 6 JR∗(M∗u?(p0))

In the non-degenerate case: u?(p0) ∈ ri(∂R(x?(p0))

)Mx?(p0) = Mx?(p)

(= JR∗(M∗u?(p0)

))

we retrieve exactly the active strata ([Lewis ’06] for partly-smooth functions)

18

Communication-efficient proximal methods

Sensitivity under small variations

Parameterized composite optimization problem (smooth + nonsmooth)

minx∈RN

F(x, p) + R(x),

Optimality condition for a primal-dual solution (x?(p), u?(p))

u?(p) = −∇F(x?(p), p) ∈ ∂R(x?(p))

For p∼p0, can we localize x?(p) with respect to x?(p0) ?

Theorem (Enlarged sensitivity)

Under mild assumptions (unique minimizer x?(p0) at p0 and objective uniformly level-bounded

in x), if R is mirror-stratifiable, then for p∼p0,

Mx?(p0) 6 Mx?(p) 6 JR∗(M∗u?(p0))

In the non-degenerate case: u?(p0) ∈ ri(∂R(x?(p0))

)Mx?(p0) = Mx?(p)

(= JR∗(M∗u?(p0)

))

we retrieve exactly the active strata ([Lewis ’06] for partly-smooth functions)

18

Communication-efficient proximal methods

Sensitivity under small variations

Parameterized composite optimization problem (smooth + nonsmooth)

minx∈RN

F(x, p) + R(x),

Optimality condition for a primal-dual solution (x?(p), u?(p))

u?(p) = −∇F(x?(p), p) ∈ ∂R(x?(p))

For p∼p0, can we localize x?(p) with respect to x?(p0) ?

Theorem (Enlarged sensitivity)

Under mild assumptions (unique minimizer x?(p0) at p0 and objective uniformly level-bounded

in x), if R is mirror-stratifiable, then for p∼p0,

Mx?(p0) 6 Mx?(p) 6 JR∗(M∗u?(p0))

In the non-degenerate case: u?(p0) ∈ ri(∂R(x?(p0))

)Mx?(p0) = Mx?(p)

(= JR∗(M∗u?(p0)

))

we retrieve exactly the active strata ([Lewis ’06] for partly-smooth functions)

18

Communication-efficient proximal methods

Sensitivity under small variations

Parameterized composite optimization problem (smooth + nonsmooth)

minx∈RN

F(x, p) + R(x),

Optimality condition for a primal-dual solution (x?(p), u?(p))

u?(p) = −∇F(x?(p), p) ∈ ∂R(x?(p))

For p∼p0, can we localize x?(p) with respect to x?(p0) ?

Theorem (Enlarged sensitivity)

Under mild assumptions (unique minimizer x?(p0) at p0 and objective uniformly level-bounded

in x), if R is mirror-stratifiable, then for p∼p0,

Mx?(p0) 6 Mx?(p) 6 JR∗(M∗u?(p0))

In the non-degenerate case: u?(p0) ∈ ri(∂R(x?(p0))

)Mx?(p0) = Mx?(p)

(= JR∗(M∗u?(p0)

))

we retrieve exactly the active strata ([Lewis ’06] for partly-smooth functions)18

Communication-efficient proximal methods

First sensitivity result illustrated

Simple projection problem{min 1

2‖x − p‖2

‖x‖∞ 6 1

{min 1

2‖u− p‖2 + ‖u‖1

u ∈ RN

19

Communication-efficient proximal methods

First sensitivity result illustrated

Simple projection problem{min 1

2‖x − p‖2

‖x‖∞ 6 1

{min 1

2‖u− p‖2 + ‖u‖1

u ∈ RN

Non-degenerate case: u?(p0) = p0 − x?(p0) ∈ ri NB∞(x?(p0))

x?(p0)

p0

u?(p0)

M⇤1

19

Communication-efficient proximal methods

First sensitivity result illustrated

Simple projection problem{min 1

2‖x − p‖2

‖x‖∞ 6 1

{min 1

2‖u− p‖2 + ‖u‖1

u ∈ RN

Non-degenerate case: u?(p0) = p0 − x?(p0) ∈ ri NB∞(x?(p0))

=⇒ M1 = Mx?(p0) = Mx?(p) (in this case x?(p) = x?(p0))

x?(p0)

p0

u?(p0)

M⇤1

p

u?(p)

19

Communication-efficient proximal methods

First sensitivity result illustrated

Simple projection problem{min 1

2‖x − p‖2

‖x‖∞ 6 1

{min 1

2‖u− p‖2 + ‖u‖1

u ∈ RN

General case: u?(p0) = p0 − x?(p0) ∈ �ri NB∞(x?(p))

x?(p0)

p0

u?(p0)

M⇤1

M⇤2

19

Communication-efficient proximal methods

First sensitivity result illustrated

Simple projection problem{min 1

2‖x − p‖2

‖x‖∞ 6 1

{min 1

2‖u− p‖2 + ‖u‖1

u ∈ RN

General case: u?(p0) = p0 − x?(p0) ∈ �ri NB∞(x?(p))

=⇒ M1 = Mx?(p0) 6 Mx?(p) 6 JR∗(M∗u?(p0)) = M2

x?(p0)

p0

u?(p0)

M⇤1

M⇤2

M2

19

Communication-efficient proximal methods

Activity identification of proximal algorithms

Composite optimization problem (smooth + nonsmooth)

minx∈RN

f(x) + R(x)

Optimality condition −∇f(x?) ∈ ∂R(x?)

Proximal-gradient algorithm (aka forward-backward algorithm)

xk+1 = proxγR

(xk − γ∇f(xk)

)

Does the iterates xk identify the low-complexity of x? ?

Theorem (Enlarged activity identification)

Under convergence assumptions, if R is mirror-stratifiable, then for k large

Mx? 6 Mxk 6 JR∗(M∗−∇f(x?))

In the non-degenerate case: −∇f(x?) ∈ ri(∂R(x?)))

we have exact identification Mx? = Mxk

(= JR∗(M∗−∇f(x?))

)[Liang et al 15]

and we can bound the threshold in this case [Hare et al 19]

20

Communication-efficient proximal methods

Activity identification of proximal algorithms

Composite optimization problem (smooth + nonsmooth)

minx∈RN

f(x) + R(x)

Optimality condition −∇f(x?) ∈ ∂R(x?)

Proximal-gradient algorithm (aka forward-backward algorithm)

xk+1 = proxγR

(xk − γ∇f(xk)

)Does the iterates xk identify the low-complexity of x? ?

Theorem (Enlarged activity identification)

Under convergence assumptions, if R is mirror-stratifiable, then for k large

Mx? 6 Mxk 6 JR∗(M∗−∇f(x?))

In the non-degenerate case: −∇f(x?) ∈ ri(∂R(x?)))

we have exact identification Mx? = Mxk

(= JR∗(M∗−∇f(x?))

)[Liang et al 15]

and we can bound the threshold in this case [Hare et al 19]

20

Communication-efficient proximal methods

Activity identification of proximal algorithms

Composite optimization problem (smooth + nonsmooth)

minx∈RN

f(x) + R(x)

Optimality condition −∇f(x?) ∈ ∂R(x?)

Proximal-gradient algorithm (aka forward-backward algorithm)

xk+1 = proxγR

(xk − γ∇f(xk)

)Does the iterates xk identify the low-complexity of x? ?

Theorem (Enlarged activity identification)

Under convergence assumptions, if R is mirror-stratifiable, then for k large

Mx? 6 Mxk 6 JR∗(M∗−∇f(x?))

In the non-degenerate case: −∇f(x?) ∈ ri(∂R(x?)))

we have exact identification Mx? = Mxk

(= JR∗(M∗−∇f(x?))

)[Liang et al 15]

and we can bound the threshold in this case [Hare et al 19]20

Communication-efficient proximal methods

For the case of sparse optimization

R = ‖ · ‖1 promotes sparsity

Mx ={

z ∈ Rd : supp(z) = supp(x)} x0

Mx0 ||x0||0 =1

?

?

dim(Mx) = # supp(x) = ‖x‖0

`1 regularized least-squares (LASSO)

minx∈Rd

12‖Ax − y‖2 + λ‖x‖1

Ilustration: plot of supp(xk)for one instance with d = 27

Result reads: for all k large enough

supp(x?) ⊆ supp(xk) ⊆ supp(y?ε)

where y?ε = proxγ(1−ε)R(u? − x?) for any ε > 0

Gap between two extreme strata

δ = dim(JR∗(M∗−∇f(x?)))− dim(Mx?) = #supp(A>(Ax? − y))−#supp(x?)

21

Communication-efficient proximal methods

Illustration of the identification of proximal-gradient algorithm

Generate many random problems (with d = 100 and n = 50) and solve them

Select those with #supp(x?) = 10with δ = 0 or 10 (δ = dim(JR∗ (M

∗A>(Ax?−y)

))− dim(Mx? ))

Plot the evolution of #supp(xk) with xk+1 = proxγ‖·‖1

(xk − γ A>(Axk − y))

)

δ quantifies the degeneracy of the problem and the identification of algorithm

δ = 0: weak degeneracy→ exact identification

δ = 10: strong degeneracy→ enlarged identification

22

Outline

1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning

2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods

3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction

Communication-efficient proximal methods

Sparse communication

For R = ‖ · ‖1, iterates of proximal gradient will eventually become sparse...

In our distributed optim. context

↓↓↓ downward comm.

become sparse ,↑↑↑ upward comm.

stay dense... /

xk = proxλ‖·‖1(xk−1+∆)

Master

oracle f i

i → ∆

Worker i

∆ xk (naturally) sparsei = i(k)

Proposed solution: sparsify the updates[Grishchenko, Iutzeler, M. ’19]

Take a selection S (at random or not...)

(xik)[j] =

(

xk−Dki− γ∇f i(xk−Dk

i))[j]

if i = i(k) and j ∈ S

(xik−1)[j] otherwise

Similar to block-coordinate (recall Silvia’s talk yesterday)......but the iteration is different by averaging

23

Communication-efficient proximal methods

Sparse communication

For R = ‖ · ‖1, iterates of proximal gradient will eventually become sparse...

In our distributed optim. context

↓↓↓ downward comm.

become sparse ,↑↑↑ upward comm.

stay dense... /

xk = proxλ‖·‖1(xk−1+∆)

Master

oracle f i

i → ∆

Worker i

∆ xk (naturally) sparsei = i(k)

Proposed solution: sparsify the updates[Grishchenko, Iutzeler, M. ’19]

Take a selection S (at random or not...)

(xik)[j] =

(

xk−Dki− γ∇f i(xk−Dk

i))[j]

if i = i(k) and j ∈ S

(xik−1)[j] otherwise

Similar to block-coordinate (recall Silvia’s talk yesterday)......but the iteration is different by averaging

23

Communication-efficient proximal methods

Sparse communication

For R = ‖ · ‖1, iterates of proximal gradient will eventually become sparse...

In our distributed optim. context

↓↓↓ downward comm.

become sparse ,↑↑↑ upward comm.

stay dense... /

xk = proxλ‖·‖1(xk−1+∆)

Master

oracle f i

i → ∆

Worker i

∆let’s sparsify xk (naturally) sparsei = i(k)

Proposed solution: sparsify the updates[Grishchenko, Iutzeler, M. ’19]

Take a selection S (at random or not...)

(xik)[j] =

(

xk−Dki− γ∇f i(xk−Dk

i))[j]

if i = i(k) and j ∈ S

(xik−1)[j] otherwise

Similar to block-coordinate (recall Silvia’s talk yesterday)......but the iteration is different by averaging

23

Communication-efficient proximal methods

Sparse communication

For R = ‖ · ‖1, iterates of proximal gradient will eventually become sparse...

In our distributed optim. context

↓↓↓ downward comm.

become sparse ,↑↑↑ upward comm.

stay dense... /

xk = proxλ‖·‖1(xk−1+∆)

Master

oracle f i

i → ∆

Worker i

∆let’s sparsify xk (naturally) sparsei = i(k)

Proposed solution: sparsify the updates[Grishchenko, Iutzeler, M. ’19]

Take a selection S (at random or not...)

(xik)[j] =

(

xk−Dki− γ∇f i(xk−Dk

i))[j]

if i = i(k) and j ∈ S

(xik−1)[j] otherwise

Similar to block-coordinate (recall Silvia’s talk yesterday)......but the iteration is different by averaging

23

Communication-efficient proximal methods

Choice of selection: random vs. adaptative

Random: Take randomly a fixed number of entries (e.g. 200)

Adaptative: Take the mask of the support + some entries randomly

`1-regularized logistic regression10 machines in a cluster, madelon dataset (2000× 500) evenly distributed

comparison of asynchr. prox-grad vs. 3 sparsified variants

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

·107

10−11

10−8

10−5

10−2

quantity of data exchanged

subo

ptim

alit

y

Full Update200 coord.

Mask + 100 coord.Mask + sizeof(Mask)

Tradeoff between sparsification less comm. and identification faster cv

Taking twice the size of the support works well ,' Automatic Dimension Reduction

24

Communication-efficient proximal methods

Choice of selection: random vs. adaptative

Random: Take randomly a fixed number of entries (e.g. 200)

Adaptative: Take the mask of the support + some entries randomly

`1-regularized logistic regression10 machines in a cluster, madelon dataset (2000× 500) evenly distributed

comparison of asynchr. prox-grad vs. 3 sparsified variants

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

·107

10−11

10−8

10−5

10−2

quantity of data exchanged

subo

ptim

alit

y

Full Update200 coord.

Mask + 100 coord.Mask + sizeof(Mask)

Tradeoff between sparsification less comm. and identification faster cv

Taking twice the size of the support works well ,' Automatic Dimension Reduction

24

Conclusions

Take-home message

Distributed optimization is a hot topic

Work to adapt our algorithms to recent distributed computing systems

Practical deployment + theoretical work (to design/analyze aglorithms)

Key ideas

our algos are “flexible” : independent from computing systems

no assumptions on delays !

we exploit identification proximal algorithm

to automatically reduce dimension and size of communications

thanks !!

Conclusions

Take-home message

Distributed optimization is a hot topic

Work to adapt our algorithms to recent distributed computing systems

Practical deployment + theoretical work (to design/analyze aglorithms)

Key ideas

our algos are “flexible” : independent from computing systems

no assumptions on delays !

we exploit identification proximal algorithm

to automatically reduce dimension and size of communications

thanks !!

Talk based on joint work

Mirror-stratified functions

J. Fadili, J. Malick, G. PeyreSensitivity Analysis for Mirror-StratifiableConvex FunctionsSIAM Journal on Optimization, 2018

Asynchronous level bundle algorithms

F. Iutzeler, J. Malick, W. de Oliveira,Asynchronous level bundle methodsIn revision in Math. Programming, 2019

Flexible distributed proximal algorithms K. Mishchenko, F. Iutzeler, J. MalickA Delay-tolerant Proximal-Gradient Algorithmfor Distributed LearningICML, 2018

M. Grishchenko, F. Iutzeler, J. MalickSubspace Descent Methods withIdentification-Adapted SamplingSubmitted to Maths of OR, 2019