variational analysis at work and ﬂexible proximal algorithms · 2019-07-02 · Variational...

Distributed nonsmooth optimization:variational analysis at work and flexible proximal algorithms

Jerome MALICK

CNRS, Laboratoire Jean Kuntzmann, Grenoble (France)

International School of Mathematics “Guido Stampacchia”

71st Workshop: Advances in nonsmooth analysis and optimization

June 2019 – Erice (Sicily)

Content at a glance

Today’s talk:a mix of

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Outline

1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning

2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods

3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction

Outline

Distributed optimization

Distributed optimization: past and present

Old history...Seminal work of D. Bertsekas and co-authors

Two classical frameworks:

(1) network (2) master/slaves

Recent challenges : new distributed systems

– Increase available computingresources (data centers, cloud)

– Improvement of multicoreinfrastructure and networks

How to (efficiently) deploy our algorithms ?

Distributed optimization: past and present

Old history...Seminal work of D. Bertsekas and co-authors

Two classical frameworks:

(1) network (2) master/slaves

Recent challenges : new distributed systems

– Increase available computingresources (data centers, cloud)

– Improvement of multicoreinfrastructure and networks

How to (efficiently) deploy our algorithms ?

Example in master/slave framework

Basic example: minimize m smoothfunctions over M = m workers

minx∈Rd

m∑i=1

f i(x)f1 f2 f3 f4

(m = M = 4)

Standard/Distributed gradient descent

xk+1 = xk − γm∑

∇f i(xk) (map-reduce)

Extensions:

with prox-term, nonsmooth f i, ...

When m the number of functionsis greater than N the number ofmachines (m > M), the algoextends to incremental/stochastic(batch) gradient descent(reviewed in Georg’s talk on monday)

shared memory

f i1 f i2 f i3 f i4

(m > M = 4)

minx∈Rd

m∑i=1

f i(x)f1 f2 f3 f4

(m = M = 4)

Extensions:

shared memory

f i1 f i2 f i3 f i4

(m > M = 4)

minx∈Rd

m∑i=1

f i(x)f1 f2 f3 f4

(m = M = 4)

Extensions:

shared memory

f i1 f i2 f i3 f i4

(m > M = 4)

minx∈Rd

m∑i=1

f i(x)f1 f2 f3 f4

(m = M = 4)

Extensions:

shared memory

f i1 f i2 f i3 f i4

(m > M = 4)

Limitations of optimization algorithms in master/slave framework

The previous slide hides thecomplexity of tasks and the varietyof possible computing systems...

Two key limitations of usual algorithms

(1) Synchronization: Master has to wait for all workers at each iteration...

Example with 3 workers/agents: (image: W. Yin)

Comparison: iteration is redefined

Synchronousnew iteration = all agents finish

Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9

Asynchronousnew iteration = any agent finishes

synchronous

Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9

asynchronous

(2) Communications: unreliable, costly,...

(image: Google AI)

Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9

synchronous

Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9

asynchronous

(image: Google AI)

Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9

synchronous

Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9

asynchronous

(image: Google AI)

Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9

synchronous

Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9

asynchronous

(image: Google AI)

Examples & applications of distributed/parallel optimization

Rich activity : e.g. [Nedic et al ’14], [Yin et al ’15], [Richtarik et al ’16],[Chouzenoux et al ’17]... just to name a few

Various practical situations 6=

In the workshop: – Welington’s talk (wednesday)

– Mentioned in the talks of Russel, Silvia, and Antoine

Our community is more knowledgeable on the “data center” situation

1 data is distributed among many machines (“big data”)

2 computation load is distributed (“big problems” or HPC)

Ex: large-scale stochastic programming problems of Welington’s talk

let’s discuss more distributed (on-device) learning

Outline

Recalls on supervised learning: context

Data : n observations (aj, yj) (j = 1, . . . , n)

e.g. (aj, yj) ∈ Rm × R (regression) or (aj, yj) ∈ Rm × {−1,+1} (binary classification)

a1 a2 a3 a4 a5

y1 = 1 y2 = �1 y3 = 1 y4 = �1 y5 = 1

Supervised learning : h(a, x) ∈ R parameterized by x ∈ Rd

usually x = β (in stats) x = ω (in learning) or x = θ (in deep learning)

goal: find x such that h(ai, x) ' yi (and generalizes well on unseen data)

Standard prediction functions :

Linear prediction: h(a, x) = 〈a, x〉(or h(a, x) = 〈φ(a), x〉 for kernel)

Highly non-linear prediction by artificialneural networksh(a, x) = 〈xm, σ(〈xm−1, · · ·σ(〈x1, a〉)〉)〉

Definition of the pliable lasso Optimization

Deep Nets/Deep Learning

Hiddenlayer L2

Inputlayer L1

Outputlayer L3

Neural network diagram with a single hidden layer. The hidden layer

derives transformations of the inputs — nonlinear transformations of linear

combinations — which are then used to model the output10 / 46

a1 a2 a3 a4 a5

y1 = 1 y2 = �1 y3 = 1 y4 = �1 y5 = 1

Hiddenlayer L2

Inputlayer L1

Outputlayer L3

a1 a2 a3 a4 a5

y1 = 1 y2 = �1 y3 = 1 y4 = �1 y5 = 1

Hiddenlayer L2

Inputlayer L1

Outputlayer L3

a1 a2 a3 a4 a5

y1 = 1 y2 = �1 y3 = 1 y4 = �1 y5 = 1

Hiddenlayer L2

Inputlayer L1

Outputlayer L3

Recalls on supervised learning: learning is optimizing !

Learning = finding x, i.e. the best x from data

= solving an optimization problem

Regularized empirical risk minimization(regularization avoids overfitting, imposes structure to x, or helps numerically)

minx∈Rd

n∑j=1

`(yj, h(aj, x)

)+ λR(x)

data fitting-term + regularizer

Ex: regularized logistic regression (to be used in numerical exps)

minx∈Rd

n∑j=1

log(1+exp(−yj〈aj, x〉)

2‖x‖2

2 + λ1‖x‖1

Distributed setting: machine i has a chunk of data Si

n∑j=1

`(yj, h(aj, x)

m∑i=1

∑i∈Si

`(yi, h(ai, x)

)︸︷︷︸

f i(x)

minx∈Rd

n∑j=1

`(yj, h(aj, x)

)+ λR(x)

minx∈Rd

n∑j=1

2‖x‖2

2 + λ1‖x‖1

n∑j=1

`(yj, h(aj, x)

m∑i=1

∑i∈Si

`(yi, h(ai, x)

)︸︷︷︸

f i(x)

minx∈Rd

n∑j=1

`(yj, h(aj, x)

)+ λR(x)

minx∈Rd

n∑j=1

2‖x‖2

2 + λ1‖x‖1

n∑j=1

`(yj, h(aj, x)

m∑i=1

∑i∈Si

`(yi, h(ai, x)

)︸︷︷︸

f i(x)

New machine learning frameworks ?

Machine learning has led to major breakthroughs in various areas(such as natural language processing, computer vision and speech recognition...)

Much of this success has been based on collecting huge amounts of data(and requires a lot of energy...)

Collecting data has two major problems:1 it is privacy-invasive2 it requires a lot of storage

A solution: federated learningdecoupling the ability to domachine learning from theneed to store the data

no data move... but:– unbalanced # samples– unstable communications...

Back to the initial question: how to (efficiently) deploy our algorithms ?

Outline

Asynchronous nonsmooth optimization methods

Asynchronous Master/Slave framework

Algorithm = global communication scheme + local optimization methodwhat is x what is i

xk = xk−1 + ∆

Master

oracle f1

Worker 1

... oracle f i

i → ∆

Worker i

... oracle fM

Worker M

∆ xki = i(k)

timei = i(k) viewpoint i

k = k− dki

ik− Dk

timej 6= i(k) viewpoint

k− dkj

k− Dkj

– iteration = receive from a worker + master update + send– time k = number of iterations– machine i(k) = the machine updating at time k– delay dk

i = time since last exchange with idk

i = 0 iff i = i(k) dki = dk−1

i + 1 elsewhere

– second delay Dki = time since penultimate exchange with i

An instantiation: asynchr. gradient

Averaged minimization of smooth functions

minx∈Rd

m∑i=1

f i(x) with f i L-smooth and µ-strongly conv.

Asynchr. averaged gradient [Vanli, Gurbuzbalaban, Ozdaglar ’17]

xk = xk−1 −γ

m∑i=1

∇f i(

xk−Dki

)Problem: stepsize γ depends on delays between machines /Proposed algo: rather combine iterates [Mishchenko, Iutzeler, M. ’18]

xk =1m

m∑i=1

xk−Dki− γ

m∑i=1

∇f i(xk−Dki)

Interest: Take usual stepsize , γ ∈ (0, 2/(µ+ L)] gives linear convergence

Variational analysis at work ? not much yet

– Convergence proof follows from standard tools and usual rationale

– Originality : new “epoch”-based analysis to get delay-free results

minx∈Rd

m∑i=1

xk = xk−1 −γ

m∑i=1

∇f i(

xk−Dki

)Problem: stepsize γ depends on delays between machines /

Proposed algo: rather combine iterates [Mishchenko, Iutzeler, M. ’18]

xk =1m

m∑i=1

xk−Dki− γ

m∑i=1

∇f i(xk−Dki)

minx∈Rd

m∑i=1

xk = xk−1 −γ

m∑i=1

∇f i(

xk−Dki

xk =1m

m∑i=1

xk−Dki− γ

m∑i=1

∇f i(xk−Dki)

minx∈Rd

m∑i=1

xk = xk−1 −γ

m∑i=1

∇f i(

xk−Dki

xk =1m

m∑i=1

xk−Dki− γ

m∑i=1

∇f i(xk−Dki)

Numerical illustrations

Illustration on toy problem2D quadratic functions on 5 workers but one worker 10x slower

Illustration on standard regularized logistic regression100 machines in a cluster with 10% of the data in machine one, even on the rest

0 200 400 600 800 1,000 1,200 1,400 1,600 1,800

10−3

10−2

10−1

Wallclock time (s)

0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 2,20010−2

10−1

Wallclock time (s)

RCV1 ( 697641 × 47236) URL ( 2396130 × 3231961)

DAve-PG Synchronous PG PIAG

Conclusion: stepsize matters ! (remind Silvia’s talk!)

0 200 400 600 800 1,000 1,200 1,400 1,600 1,800

10−3

10−2

10−1

Wallclock time (s)

0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 2,20010−2

10−1

Wallclock time (s)

RCV1 ( 697641 × 47236) URL ( 2396130 × 3231961)

0 200 400 600 800 1,000 1,200 1,400 1,600 1,800

10−3

10−2

10−1

Wallclock time (s)

0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 2,20010−2

10−1

Wallclock time (s)

RCV1 ( 697641 × 47236) URL ( 2396130 × 3231961)

Outline

The idea of the work with Welington in one slide/picture

Basic situation: minimize m convex functions over m machines/oracles

minx∈X

m∑i=1

f i(x)at time k, machine i= i(k) sends up

f i(xk−dki) and g ∈ ∂f i(xk−dk

Example: d = 1, m = 2 (oracle f1 3× faster than oracle f2)

f1 + f2

fup = flev +

m∑i=1

(Li‖xk+1 − xk−dk

j‖ − 〈gj

k−dkj, xk+1 − xk−dk

f1 + f2

minx∈X

m∑i=1

f1 + f2

fup = flev +

m∑i=1

j‖ − 〈gj

f1 + f2

minx∈X

m∑i=1

f1 + f2

lower bound

[Hintermuller ’01][Oliveira Sagastizabal ’17]

fup = flev +

m∑i=1

j‖ − 〈gj

f1 + f2

minx∈X

m∑i=1

f1 + f2

lower bound

but no upper bound

f 1(xk1 ) f 2(xk2 ) xk1 6= xk2

fup = flev +m∑

j‖ − 〈gj

f1 + f2

minx∈X

m∑i=1

f1 + f2

lower bound

but no upper bound

f 1(xk1 ) f 2(xk2 ) xk1 6= xk2

fup = flev +

m∑i=1

j‖ − 〈gj

f1 + f2

minx∈X

m∑i=1

f1 + f2

fup = flev +m∑

j‖ − 〈gj

f1 + f2

Direct convergence by basic variational analysis

Convergence of usual level bundle [Lemarechal Nesterov Nemirovski ’95]

requires tight control of iterates moves to get complexity analysis

Forget this : asynchronous iterates are out of controlso we can simplify the proof in this more complicated setting !

3 line proof : finite number of null steps (by contradiction)

– recall xk+1 = ProjXk(x)

– nested sequence of (compact) sets Xk ⊂ Xk+1 ⊂ · · ·

– Xk −→k∩kXk = X∞ 6= ∅ (Painleve-Kuratowski)

– xk+1 = ProjXk(x) −→

kProjX∞ (x) [Rockafellar Wets Book]

– fup = flev +

m∑i=1

j‖ − 〈gj

j〉)−→

– it leads to contradiction ,

Variational analysis at work... a little... more to come... (Samir be patient)

Direct convergence by basic variational analysis

Convergence of usual level bundle [Lemarechal Nesterov Nemirovski ’95]

requires tight control of iterates moves to get complexity analysis

Forget this : asynchronous iterates are out of controlso we can simplify the proof in this more complicated setting !

3 line proof : finite number of null steps (by contradiction)

– recall xk+1 = ProjXk(x)

– nested sequence of (compact) sets Xk ⊂ Xk+1 ⊂ · · ·

– Xk −→k∩kXk = X∞ 6= ∅ (Painleve-Kuratowski)

– xk+1 = ProjXk(x) −→

kProjX∞ (x) [Rockafellar Wets Book]

– fup = flev +

m∑i=1

j‖ − 〈gj

j〉)−→

– it leads to contradiction ,

Variational analysis at work... a little... more to come... (Samir be patient)

Outline

Communication-efficient proximal methods

When communication is the bottleneck...

Communicate may be bad :

– Heterogeneous frameworks: communications are unreliable

– High-dimensional problems: communications can also be costly

A solution to reduce communication is to reduce dimension

How ? using the structure of regularized problem

minx∈Rd

m∑i=1

f i(x) + λR(x)

Typically x? belongs to low-dimensional manifold

because the convex R is usually highly structured...

Mirror-stratifiable regularizers

Most of the regularizers used in machine learning or image processing

have a strong primal-dual structure – mirror-stratifiable [Fadili, M., Peyre ’18]

Examples: (associated unit ball and low-dimensional manifold where x belongs)

R = ‖ · ‖1 ( and ‖ · ‖∞ or other polyedral gauges)

nuclear norm (aka trace-norm) R(X) =∑

i |σi(X)| = ‖σ(X)‖1

group-`1 R(x) =∑

b∈B ‖xb‖2 ( e.g. R(x) = |x1|+ ‖x2,3‖ )

i |σi(X)| = ‖σ(X)‖1

group-`1 R(x) =∑

b∈B ‖xb‖2 ( e.g. R(x) = |x1|+ ‖x2,3‖ )

i |σi(X)| = ‖σ(X)‖1

group-`1 R(x) =∑

b∈B ‖xb‖2 ( e.g. R(x) = |x1|+ ‖x2,3‖ )

i |σi(X)| = ‖σ(X)‖1

group-`1 R(x) =∑

b∈B ‖xb‖2 ( e.g. R(x) = |x1|+ ‖x2,3‖ )

Recall on stratifications

Recall the talks of Aris (monday morning) and of Hasnaa (tuesday afternoon)

A stratification of a set D ⊂ RN is a finite partitionM = {Mi}i∈I

D =⋃i∈I

with so-called “strata” (e.g. smooth/affine manifolds) which fit nicely:

M ∩ cl(M′) 6= ∅ =⇒ M ⊂ cl(M′)

This relation induces a (partial) ordering M 6 M′

Example: B∞ the unit `∞-ball in R2

a stratification with 9 (affine) strata

M1 6 M2 6 M4

M1 6 M3 6 M4

Recall on stratifications

Recall the talks of Aris (monday morning) and of Hasnaa (tuesday afternoon)

A stratification of a set D ⊂ RN is a finite partitionM = {Mi}i∈I

D =⋃i∈I

with so-called “strata” (e.g. smooth/affine manifolds) which fit nicely:

M ∩ cl(M′) 6= ∅ =⇒ M ⊂ cl(M′)

This relation induces a (partial) ordering M 6 M′

Example: B∞ the unit `∞-ball in R2

a stratification with 9 (affine) strata

M1 6 M2 6 M4

M1 6 M3 6 M4

Mirror-stratifiable function: formal definition

A convex function R : RN→R ∪ {+∞} is mirror-stratifiable with respect to

– a (primal) stratificationM = {Mi}i∈I of dom(∂R)

– a (dual) stratificationM∗ = {M∗i }i∈I of dom(∂R∗)

if JR has 2 properties

JR :M→M∗ is invertible with inverse JR∗

M∗ 3 M∗ = JR(M) ⇐⇒ JR∗(M∗) = M ∈M

JR is decreasing for the order relation 6 between strata

M 6 M′ ⇐⇒ JR(M) > JR(M′)

with the transfert operator JR : RN ⇒ RN [Daniilidis-Drusvyatskiy-Lewis ’13]

JR(S) =⋃x∈S

ri(∂R(x))

Mirror-stratifiable function: simple example

R = ιB∞ R∗ = ‖ · ‖1

JR(Mi) =⋃

x∈Mi

ri ∂R(x) = ri N B∞ (x) = M∗i

Mi = ri ∂‖x‖1 =⋃

x∈M∗i

ri ∂R∗(x) = JR∗ (M∗i )

Outline

Sensitivity under small variations

Parameterized composite optimization problem (smooth + nonsmooth)

minx∈RN

F(x, p) + R(x),

Optimality condition for a primal-dual solution (x?(p), u?(p))

u?(p) = −∇F(x?(p), p) ∈ ∂R(x?(p))

For p∼p0, can we localize x?(p) with respect to x?(p0) ?

Theorem (Enlarged sensitivity)

Under mild assumptions (unique minimizer x?(p0) at p0 and objective uniformly level-bounded

in x), if R is mirror-stratifiable, then for p∼p0,

Mx?(p0) 6 Mx?(p) 6 JR∗(M∗u?(p0))

In the non-degenerate case: u?(p0) ∈ ri(∂R(x?(p0))

)Mx?(p0) = Mx?(p)

(= JR∗(M∗u?(p0)

we retrieve exactly the active strata ([Lewis ’06] for partly-smooth functions)

minx∈RN

F(x, p) + R(x),

u?(p) = −∇F(x?(p), p) ∈ ∂R(x?(p))

Mx?(p0) 6 Mx?(p) 6 JR∗(M∗u?(p0))

)Mx?(p0) = Mx?(p)

(= JR∗(M∗u?(p0)

minx∈RN

F(x, p) + R(x),

u?(p) = −∇F(x?(p), p) ∈ ∂R(x?(p))

Mx?(p0) 6 Mx?(p) 6 JR∗(M∗u?(p0))

)Mx?(p0) = Mx?(p)

(= JR∗(M∗u?(p0)

minx∈RN

F(x, p) + R(x),

u?(p) = −∇F(x?(p), p) ∈ ∂R(x?(p))

Mx?(p0) 6 Mx?(p) 6 JR∗(M∗u?(p0))

)Mx?(p0) = Mx?(p)

(= JR∗(M∗u?(p0)

we retrieve exactly the active strata ([Lewis ’06] for partly-smooth functions)18

First sensitivity result illustrated

Simple projection problem{min 1

2‖x − p‖2

‖x‖∞ 6 1

{min 1

2‖u− p‖2 + ‖u‖1

u ∈ RN

2‖x − p‖2

‖x‖∞ 6 1

{min 1

2‖u− p‖2 + ‖u‖1

u ∈ RN

Non-degenerate case: u?(p0) = p0 − x?(p0) ∈ ri NB∞(x?(p0))

x?(p0)

u?(p0)

2‖x − p‖2

‖x‖∞ 6 1

{min 1

2‖u− p‖2 + ‖u‖1

u ∈ RN

Non-degenerate case: u?(p0) = p0 − x?(p0) ∈ ri NB∞(x?(p0))

=⇒ M1 = Mx?(p0) = Mx?(p) (in this case x?(p) = x?(p0))

x?(p0)

u?(p0)

2‖x − p‖2

‖x‖∞ 6 1

{min 1

2‖u− p‖2 + ‖u‖1

u ∈ RN

General case: u?(p0) = p0 − x?(p0) ∈ �ri NB∞(x?(p))

x?(p0)

u?(p0)

2‖x − p‖2

‖x‖∞ 6 1

{min 1

2‖u− p‖2 + ‖u‖1

u ∈ RN

General case: u?(p0) = p0 − x?(p0) ∈ �ri NB∞(x?(p))

=⇒ M1 = Mx?(p0) 6 Mx?(p) 6 JR∗(M∗u?(p0)) = M2

x?(p0)

u?(p0)

Activity identification of proximal algorithms

Composite optimization problem (smooth + nonsmooth)

minx∈RN

f(x) + R(x)

Optimality condition −∇f(x?) ∈ ∂R(x?)

Proximal-gradient algorithm (aka forward-backward algorithm)

xk+1 = proxγR

(xk − γ∇f(xk)

Does the iterates xk identify the low-complexity of x? ?

Theorem (Enlarged activity identification)

Under convergence assumptions, if R is mirror-stratifiable, then for k large

Mx? 6 Mxk 6 JR∗(M∗−∇f(x?))

In the non-degenerate case: −∇f(x?) ∈ ri(∂R(x?)))

we have exact identification Mx? = Mxk

(= JR∗(M∗−∇f(x?))

)[Liang et al 15]

and we can bound the threshold in this case [Hare et al 19]

minx∈RN

f(x) + R(x)

xk+1 = proxγR

(xk − γ∇f(xk)

)Does the iterates xk identify the low-complexity of x? ?

Mx? 6 Mxk 6 JR∗(M∗−∇f(x?))

(= JR∗(M∗−∇f(x?))

)[Liang et al 15]

and we can bound the threshold in this case [Hare et al 19]

minx∈RN

f(x) + R(x)

xk+1 = proxγR

(xk − γ∇f(xk)

)Does the iterates xk identify the low-complexity of x? ?

Mx? 6 Mxk 6 JR∗(M∗−∇f(x?))

(= JR∗(M∗−∇f(x?))

)[Liang et al 15]

and we can bound the threshold in this case [Hare et al 19]20

For the case of sparse optimization

R = ‖ · ‖1 promotes sparsity

z ∈ Rd : supp(z) = supp(x)} x0

Mx0 ||x0||0 =1

dim(Mx) = # supp(x) = ‖x‖0

`1 regularized least-squares (LASSO)

minx∈Rd

12‖Ax − y‖2 + λ‖x‖1

Ilustration: plot of supp(xk)for one instance with d = 27

Result reads: for all k large enough

supp(x?) ⊆ supp(xk) ⊆ supp(y?ε)

where y?ε = proxγ(1−ε)R(u? − x?) for any ε > 0

Gap between two extreme strata

δ = dim(JR∗(M∗−∇f(x?)))− dim(Mx?) = #supp(A>(Ax? − y))−#supp(x?)

Illustration of the identification of proximal-gradient algorithm

Generate many random problems (with d = 100 and n = 50) and solve them

Select those with #supp(x?) = 10with δ = 0 or 10 (δ = dim(JR∗ (M

∗A>(Ax?−y)

))− dim(Mx? ))

Plot the evolution of #supp(xk) with xk+1 = proxγ‖·‖1

(xk − γ A>(Axk − y))

δ quantifies the degeneracy of the problem and the identification of algorithm

δ = 0: weak degeneracy→ exact identification

δ = 10: strong degeneracy→ enlarged identification

Outline

Sparse communication

For R = ‖ · ‖1, iterates of proximal gradient will eventually become sparse...

In our distributed optim. context

↓↓↓ downward comm.

become sparse ,↑↑↑ upward comm.

stay dense... /

xk = proxλ‖·‖1(xk−1+∆)

Master

oracle f i

i → ∆

Worker i

∆ xk (naturally) sparsei = i(k)

Proposed solution: sparsify the updates[Grishchenko, Iutzeler, M. ’19]

Take a selection S (at random or not...)

(xik)[j] =

xk−Dki− γ∇f i(xk−Dk

i))[j]

if i = i(k) and j ∈ S

(xik−1)[j] otherwise

Similar to block-coordinate (recall Silvia’s talk yesterday)......but the iteration is different by averaging

stay dense... /

Master

oracle f i

i → ∆

Worker i

∆ xk (naturally) sparsei = i(k)

(xik)[j] =

i))[j]

stay dense... /

Master

oracle f i

i → ∆

Worker i

∆let’s sparsify xk (naturally) sparsei = i(k)

(xik)[j] =

i))[j]

stay dense... /

Master

oracle f i

i → ∆

Worker i

∆let’s sparsify xk (naturally) sparsei = i(k)

(xik)[j] =

i))[j]

Choice of selection: random vs. adaptative

Random: Take randomly a fixed number of entries (e.g. 200)

Adaptative: Take the mask of the support + some entries randomly

`1-regularized logistic regression10 machines in a cluster, madelon dataset (2000× 500) evenly distributed

comparison of asynchr. prox-grad vs. 3 sparsified variants

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

10−11

10−8

10−5

10−2

quantity of data exchanged

Full Update200 coord.

Mask + 100 coord.Mask + sizeof(Mask)

Tradeoff between sparsification less comm. and identification faster cv

Taking twice the size of the support works well ,' Automatic Dimension Reduction

Choice of selection: random vs. adaptative

Random: Take randomly a fixed number of entries (e.g. 200)

Adaptative: Take the mask of the support + some entries randomly

`1-regularized logistic regression10 machines in a cluster, madelon dataset (2000× 500) evenly distributed

comparison of asynchr. prox-grad vs. 3 sparsified variants

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

10−11

10−8

10−5

10−2

quantity of data exchanged

Full Update200 coord.

Mask + 100 coord.Mask + sizeof(Mask)

Tradeoff between sparsification less comm. and identification faster cv

Taking twice the size of the support works well ,' Automatic Dimension Reduction

Conclusions

Take-home message

Distributed optimization is a hot topic

Work to adapt our algorithms to recent distributed computing systems

Practical deployment + theoretical work (to design/analyze aglorithms)

Key ideas

our algos are “flexible” : independent from computing systems

no assumptions on delays !

we exploit identification proximal algorithm

to automatically reduce dimension and size of communications

thanks !!

Conclusions

Take-home message

Distributed optimization is a hot topic

Work to adapt our algorithms to recent distributed computing systems

Practical deployment + theoretical work (to design/analyze aglorithms)

Key ideas

our algos are “flexible” : independent from computing systems

no assumptions on delays !

we exploit identification proximal algorithm

to automatically reduce dimension and size of communications

thanks !!

Talk based on joint work

Mirror-stratified functions

J. Fadili, J. Malick, G. PeyreSensitivity Analysis for Mirror-StratifiableConvex FunctionsSIAM Journal on Optimization, 2018

Asynchronous level bundle algorithms

F. Iutzeler, J. Malick, W. de Oliveira,Asynchronous level bundle methodsIn revision in Math. Programming, 2019

Flexible distributed proximal algorithms K. Mishchenko, F. Iutzeler, J. MalickA Delay-tolerant Proximal-Gradient Algorithmfor Distributed LearningICML, 2018

M. Grishchenko, F. Iutzeler, J. MalickSubspace Descent Methods withIdentification-Adapted SamplingSubmitted to Maths of OR, 2019

variational analysis at work and ﬂexible proximal algorithms · 2019-07-02 · Variational...

Documents

Transcript of variational analysis at work and ﬂexible proximal algorithms · 2019-07-02 · Variational...

Adaptative Learning Designs

Variational Inference & Variational Autoencoderscseweb.ucsd.edu/~dasgupta/254-deep-ul/casey-mary.pdfAuto-Encoding Variational Bayes (Kingma & Welling) SGVB (Stochastic Gradient Variational

Variational Learning and Variational Inference · The variational approach • Variational inference: Find q(h) by solving • Variational learning: Alternate between running variational

Yes, but Did It Work?: Evaluating Variational InferenceYes, but Did It Work?: Evaluating Variational Inference Yuling Yao1 Aki Vehtari 2 Daniel Simpson3 Andrew Gelman1 Abstract While

Barlett. Adaptative Strategies

LIMMUNITE ADAPTATIVE, UN PROLONGEMENT DE LIMMUNITE INNEE.

Optique adaptative en ophtalmologie

Yes, but Did It Work?: Evaluating Variational Inferencegelman/research/published/... · 2018-06-07 · Yes, but Did It Work?: Evaluating Variational Inference Yuling Yao1 Aki Vehtari

A Choice of Variational Stacks: Exploring Variational Data ...web.engr.oregonstate.edu/~walkiner/papers/vamos17-variational-stacks.pdfA Choice of Variational Stacks: Exploring Variational

Variational Space Deformations1 · Variational Space DeformationVariational Space Deformation with Barycentric Coordinates Mirela Ben-Chen Stanford University Joint work with Ofir

Variational Networks: Connecting Variational Methods … · Variational Networks: Connecting Variational Methods and Deep Learning Erich Kobler1, Teresa Klatzer1, Kerstin Hammernik1

Adaptative streaming : enjeux, panorama, principes et difficultés

Réutilisation adaptative

CH.11. VARIATIONAL PRINCIPLESmmc.rmee.upc.edu/documents/Slides/Ch11_v15.pdf · Ch.11. Variational Principles 11.4. Virtual Work Principle Continuum mechanics problem for a body: Cauchy

PROX-REGULAR FUNCTIONS IN VARIATIONAL …...ysis, variational analysis, proto-derivatives, second-order epi-derivatives, Attouch’s theorem. This work was supported in part by the

Variational Bayes and Variational Message Passing · Variational Bayes and Variational Message Passing Mohammad Emtiyaz Khan CS,UBC Variational Bayes and Variational Message Passing

(1. réponse adaptative humorale ) – Les anticorps

Adaptative Strategies for Noise Filtering - Chaos

On-line adaptative parallel prefix computation

Variational Inference in Bayesian Submodular Models Josip Djolonga joint work with Andreas Krause.