Generalized/Global Abs-Linear Learning (GALL)€¦ · Outline 1 From Heavy to Savvy Ball search...

25
Generalized/Global Abs-Linear Learning (GALL) Andreas Griewank and ´ Angel Rojas Humboldt University (Berlin) and Yachay Tech (Imbabura) 14.12.19, NeurIPS Vancouver A. Griewank, ´ A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 1 / 25

Transcript of Generalized/Global Abs-Linear Learning (GALL)€¦ · Outline 1 From Heavy to Savvy Ball search...

Page 1: Generalized/Global Abs-Linear Learning (GALL)€¦ · Outline 1 From Heavy to Savvy Ball search trajectory 2 Results in convex, homogeneous and prox-linear case 3 Successive Piecewise

Generalized/Global Abs-Linear Learning (GALL)

Andreas Griewank and Angel Rojas

Humboldt University (Berlin) and Yachay Tech (Imbabura)

14.12.19, NeurIPS Vancouver

A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 1 / 25

Page 2: Generalized/Global Abs-Linear Learning (GALL)€¦ · Outline 1 From Heavy to Savvy Ball search trajectory 2 Results in convex, homogeneous and prox-linear case 3 Successive Piecewise

Outline

1 From Heavy to Savvy Ball search trajectory

2 Results in convex, homogeneous and prox-linear case

3 Successive Piecewise Linearization

4 Mixed Binary Linear Optimization

5 Generalized Abs-Linear Learning

6 Summary, Conclusions and Outlook

A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 2 / 25

Page 3: Generalized/Global Abs-Linear Learning (GALL)€¦ · Outline 1 From Heavy to Savvy Ball search trajectory 2 Results in convex, homogeneous and prox-linear case 3 Successive Piecewise

Folklore and Common Expectations in ML

1 Nonsmoothness can be ignored except for step size choice.

2 Stochastic (mini-batch) sampling hides all the problems.

3 Higher dimensions make local minimizer less likely.

4 Difficulty is getting away from saddle points not minimizers.

5 Precise location of (almost) global minimizer unimportant.

6 Network architecture and stepsize selection can be tweaked.

7 Convergence proofs only under ”unrealistic assumptions”.

A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 3 / 25

Page 4: Generalized/Global Abs-Linear Learning (GALL)€¦ · Outline 1 From Heavy to Savvy Ball search trajectory 2 Results in convex, homogeneous and prox-linear case 3 Successive Piecewise

Generalized Gradient Concepts

Notational Zoo (Subspecies in Euclidean and Lipschitzian Habitat):

Frechet Derivative: ∇ϕ(x) ≡ ∂ϕ(x)/∂x : D 7→ Rn ∪ ∅

Limiting Gradient: ∂Lf (x) ≡ limx→x∇ϕ(x) : D ⇒ Rn

Clarke Gradient: ∂ϕ(x) ≡ conv(∂Lϕ(x)) : D ⇒ Rn

Bouligand: f ′(x ; ∆x) ≡ limt↘0[ϕ(x + t∆x)− ϕ(x)]/t

: D × Rn 7→ R: D 7→ PLh(Rn)

Piecewise Linearization(PL):

∆ϕ(x ; ∆x) : D × Rn 7→ R: D 7→ PL(Rn)

Moriarty Effect due to Rademacher (C 0,1 = W 1,∞ ) :

Almost everywhere all concepts reduce to Frechet, except PL!!

A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 4 / 25

Page 5: Generalized/Global Abs-Linear Learning (GALL)€¦ · Outline 1 From Heavy to Savvy Ball search trajectory 2 Results in convex, homogeneous and prox-linear case 3 Successive Piecewise

Lurking in the background: Prof. Moriarty

A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 5 / 25

Page 6: Generalized/Global Abs-Linear Learning (GALL)€¦ · Outline 1 From Heavy to Savvy Ball search trajectory 2 Results in convex, homogeneous and prox-linear case 3 Successive Piecewise

Filippov solutions of generalized steepest descent inclusion

The convexity and outer semi-continuity of subsets ∂ϕ(x(t)) imply that

−x(t) ∈ ∂ϕ(x(t)) from x(0) = x0 ∈ Rn

has (at least) one absolutely continuous Filippov solution trajectory x(t).

Heavy ball (Polyak, 1964)

−x(t) ∈ ∂ϕ(x(t)) from x(0) = x0, −x(0) ∈ ∂ϕ(x0) .

Picks up speed/momentum going downhill and slows down going uphill.

Savvy ball (Griewank, 1981)

d

dt

[−x(t)

(ϕ(x(t))−c)e

]∈ e ∂ϕ(x(t))

(ϕ(x(t))−c)e+1= ∂

[−1

(ϕ(x(t))− c)e

].

Can be rewritten as a first order system of a differential equation and aninclusion satisfying Fillipov =⇒ absolutely continuous (x(t), x(t)) exists.

A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 6 / 25

Page 7: Generalized/Global Abs-Linear Learning (GALL)€¦ · Outline 1 From Heavy to Savvy Ball search trajectory 2 Results in convex, homogeneous and prox-linear case 3 Successive Piecewise

Integrated Form

v(t) =x(t)

[ϕ(x(t))−c]e∈ x0

[ϕ(x0)−c]e− e

∫ t

0

∂ϕ(x(τ)) dτ

[ϕ(x(τ))−c]e+1.

Second order Form

x(t) ∈ −[I − x(t) x(t)>

‖x(t)‖2

][e ∂ϕ(x(t))]

[ϕ(x(t))−c]with ‖x(0)‖ = 1 .

Idea: Adjustment of current search direction x(t) towards a negativegradient direction −∂ϕ(x(t)) .

The closer the current function value ϕ(x(t)) is to the target level c ,the more rapidly the direction is adjusted.

If ϕ convex, ϕ(x) ≤ c and e ≤ 1 the trajectory reaches the level set.

On degree (1/e) homogeneous objectives, local minimizers below care accepted and local minimizers above the target level are passed by.

A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 7 / 25

Page 8: Generalized/Global Abs-Linear Learning (GALL)€¦ · Outline 1 From Heavy to Savvy Ball search trajectory 2 Results in convex, homogeneous and prox-linear case 3 Successive Piecewise

JOTA: VOL. 34, NO. 1, M A Y 1981 33

Fig. 1. Search trajectories with target c = 0 and sensitivit2y , e e {0.4, 0.5, 0.67} on the objective function f = (x~ + x ~ ) / 2 0 0 + t - c o s x1 cos(x2/~/2). Initial point (40, -35) . Global minimum at origin marked by +~

gradient and explore the objective function more thoroughly. Simul- taneously, the stepsize, which equals the length of the dashes, becomes smaller to ensure an accurate integration.

The behavior of the individual trajectories confirms in principle the results of Theorem 5.1(i) applied to the quadratic term u, with d being equal to 2. The combination

and

1 e =~= 1/d

c = O=f*

seems optimal, even though the corresponding trajectory converges to the global solution x* only from the initial point (40,-35), but not from (35, -30). In the latter case, as shown in Fig. 2, the trajectory is distracted

A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 8 / 25

Page 9: Generalized/Global Abs-Linear Learning (GALL)€¦ · Outline 1 From Heavy to Savvy Ball search trajectory 2 Results in convex, homogeneous and prox-linear case 3 Successive Piecewise

34 JOTA: VOL 34, NO. 1, MAY 1981

Fig. 2. Search trajectories with sensitivity e = 0.5 and target c E {-0.4, 0, 0.4} on the ob- jective function f = (x~+x~)/200+ 1-cos xt cos(x2/']2). Initial point (35, -30). Global minimum at origin marked by +.

f rom x* by a sequence of suboptimal minima and eventually diverges toward infinity. Trajectories with sensitivities larger than 0.5, like the one with e = 0.67 in Fig. 1, usually lack the penetra t ion to reach x* and wander around endlessly, as they cannot escape the attraction of the quadratic te rm u. On the other hand, trajectories with sensitivities less than 0.5, like the one with e = 0.4 in Fig. 1, are likely to pass the global solution x* at some distance before diverging toward infinity. The same is true of trajectories with appropr ia te sensitivity e = ½, but having an unattainable target, as we can see f rom the case c = - 0 . 4 in Fig. 2. Trajectories whose target is at tainable are likely to achieve their goal, like the one with c = 0.4 in Fig. 2, which attains its target close to the suboptimal minimum

£1 ~ - I r (1 , 42) T,

with value

f ( ~ l ) ~ 0 . 1 5

after passing through the neighborhood of two unacceptable minima

~2 ~ -2~r (1, x/2) T ~3 = -¢r(3, x/2) T,

A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 9 / 25

Page 10: Generalized/Global Abs-Linear Learning (GALL)€¦ · Outline 1 From Heavy to Savvy Ball search trajectory 2 Results in convex, homogeneous and prox-linear case 3 Successive Piecewise

Closed form solution on prox-linear function

Lemma(A.G. 1977 & A.R. 2019). For ϕ(x) = b + g>x + q2‖x‖2

2

x(t) = −[I − x(t) x(t)>

] ∇ϕ(x(t))

[ϕ(x(t))− c]

yields momentum like

x(t) = x0 + sin(ωt)ω x0 + 1−cos(ωt)

ω2 x0 ≈ x0 + tx0 − t2g2(ϕ0−c)

and

ϕ(x(t)) = ϕ0 +[(g + qx0)>x0

]sin(ωt)ω +

[q − ω2(ϕ0 − c)

] (1−cos(ωt))ω2

where

x0 = −[I − x0x

>0

] (g + qx0)

(ϕ0−c)and ω = ‖x0‖ .

A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 10 / 25

Page 11: Generalized/Global Abs-Linear Learning (GALL)€¦ · Outline 1 From Heavy to Savvy Ball search trajectory 2 Results in convex, homogeneous and prox-linear case 3 Successive Piecewise

Piecewise-Linearization Approach

1 Every function ϕ(x) that is abs-normal, i.e. evaluated by a sequenceof smooth elemental functions and piecewise linear elements likeabs,min,max can be approximated near a reference point x by apiecewise-linear function ∆ϕ(x ; ∆x) s.t.

|ϕ(x + ∆x)− ϕ(x)−∆ϕ(x ; ∆x)| ≤ q2‖∆x‖2

2 The function y = ∆ϕ(x ; x − x) can be represented in Abs-Linear form

z = d + Zx + Mz + L|z |y = µ+ a>x + b>z + c>|z |

where M and L are strictly lower triangular matrices s.t. z = z(x).3 [d ,Z ,M, L, µ, a, b, c] can be generated automatically by Algorithmic

Piecewise Differentiation, which allows the computational handling of∆ϕ in and between the polyhedra

Pσ = closure{x ∈ Rn; sgn(z(x)) = σ} for σ ∈ {−1,+1}s

A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 11 / 25

Page 12: Generalized/Global Abs-Linear Learning (GALL)€¦ · Outline 1 From Heavy to Savvy Ball search trajectory 2 Results in convex, homogeneous and prox-linear case 3 Successive Piecewise

xx

F(x

)

F♦xF

(a) Tangent mode linearization

A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 12 / 25

Page 13: Generalized/Global Abs-Linear Learning (GALL)€¦ · Outline 1 From Heavy to Savvy Ball search trajectory 2 Results in convex, homogeneous and prox-linear case 3 Successive Piecewise

SALMIN defined by iteration

xk+1 = arglocmin∆x

{∆ϕ(xk ; ∆x) + qk2 ‖∆x‖2} (1)

where qk > 0 is adjusted such that eventually qk ≥ q in region of interest.Has cluster points x∗ that are first order minimal minimal (FOM) i.e.

∆ϕ(x∗,∆x) ≥ 0 for ∆x ≈ 0 .

Drawback: Requires computation and factorization of active Jacobians.

Coordinate Global Descent CGD

f (w ; x) is PL w.r.t. x but ϕ(w) is only multi-piecewise linear w.r.t. w , i.e.

ϕ(x + tej) ≡ ϕ(x) + ∆ϕ(x + tej) for t ∈ R .

Along any such coordinate bi-direction we can perform a global univariateminimization efficiently. Cluster points x∗ of such alternating coordinatesearches seem not even even Clarke stationary, i.e. 0 ∈ ∂ϕ(x∗) .

A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 13 / 25

Page 14: Generalized/Global Abs-Linear Learning (GALL)€¦ · Outline 1 From Heavy to Savvy Ball search trajectory 2 Results in convex, homogeneous and prox-linear case 3 Successive Piecewise

Figure 1: Decimal digits gained by 4 methods on single layer regression problem.

A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 14 / 25

Page 15: Generalized/Global Abs-Linear Learning (GALL)€¦ · Outline 1 From Heavy to Savvy Ball search trajectory 2 Results in convex, homogeneous and prox-linear case 3 Successive Piecewise

SALGO-SAVVY algorithm

1 Form piecewise linearization ∆ϕ of objective ϕ at the current iteratex and estimate the proximal coefficient q, set x0 = x ,

2 Select the initial tangent x0 and σ = sgn(z(x0)).

3 Compute and follow circular segment x(t) in Pσ.

4 Determine minimal t∗ where ϕ(x(t∗)) = c orx∗ = x(t?) lies on the boundary of Pσ with some Pσ.

5 If ϕ(x∗) ≤ c then lower c and goto step (2) // restart inner loop

xor goto step (1) with x = x∗ and adjusted q // continue outer loopxor terminate optimization if user ”happy” or resources exceeded.

6 Else, set x0 = x∗, x0 = x(t∗), σ = σ and continue with step (3).

Many other heuristic strategies for retargeting and restarting possible.!!!!

A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 15 / 25

Page 16: Generalized/Global Abs-Linear Learning (GALL)€¦ · Outline 1 From Heavy to Savvy Ball search trajectory 2 Results in convex, homogeneous and prox-linear case 3 Successive Piecewise

Savvy Ball Path

Figure 2: Reached value 0.591576 whereas target level 0.519984 unreachable.

A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 16 / 25

Page 17: Generalized/Global Abs-Linear Learning (GALL)€¦ · Outline 1 From Heavy to Savvy Ball search trajectory 2 Results in convex, homogeneous and prox-linear case 3 Successive Piecewise

SAVVY on MNIST, n = 784,m = 10, d = 60000

Resulting accuracy of one layer model with smooth-max activation andcross entropy loss on test set of 10000 images is the ”optimal” 92%.

A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 17 / 25

Page 18: Generalized/Global Abs-Linear Learning (GALL)€¦ · Outline 1 From Heavy to Savvy Ball search trajectory 2 Results in convex, homogeneous and prox-linear case 3 Successive Piecewise

Mixed Binary Linear Optimization

Consider a piecewise linear optimization problem in Abs-Linear-Form

Min a>x + b>z + c>Σz s.t. z = Zx + Mz + LΣz and Σz ≥ 0

where σ ∈ {−1, 1}n and Σ = diag(σ) are binary variables.This (MIBLOP) can be (MILOP), provided |z |∞ ≤ γ yielding

minx ,z,w ,σ

(a>x + b>z + c>h + q

2‖x‖2)

s.t. z = Zx + Mz + Lh , (2)

−h ≤ z ≤ h and h + γ(σ − e) ≤ z ≤ −h + γ(σ + e),

Quote by Fischetti and Jo (2018)

”Deep Neural Networks as 0-1 Mixed Integer Linear Programs: AFeasibility Study”: PL models are unfortunately not suited for training.

A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 18 / 25

Page 19: Generalized/Global Abs-Linear Learning (GALL)€¦ · Outline 1 From Heavy to Savvy Ball search trajectory 2 Results in convex, homogeneous and prox-linear case 3 Successive Piecewise

Prediction by PL functions in ANF

For x ∈ Rn 7→ y ∈ Rm

Continuous PL function ⇐⇒ Hinged NN ⇐⇒ Abs-Linear-Form .

Numb. of Layers ` ≥ ν = Switching Depth = nilpotency of (I −M)−1L.

z = c + Zx + Mz + L|z | ∈ Rs

y = b + Jx + Nz ∈ Rm

1 where M, L ∈ Rs×s are strictly lower triangular to yield z = z(x).

2 ≡ NN if M ≡ L are block bidiagonal, other sparsity patterns possible.

3 note that max(u,w) = u + (z + |z |)/2 with z = (w − u).

4 ALFs with ν ≤ ν form infinite dimensional linear space of C 0,1(Rn).

5 ALFs can be successively abs-linearized with respect tow = [c ,Z ,M, L, b, J,N] for learning=fitting.

A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 19 / 25

Page 20: Generalized/Global Abs-Linear Learning (GALL)€¦ · Outline 1 From Heavy to Savvy Ball search trajectory 2 Results in convex, homogeneous and prox-linear case 3 Successive Piecewise

Structured Piecewise linearization (PL) w.r.t. weight vector

Given a reference point w = [c , Z , M, L, b, J, N] we have Taylor like

z = z + ∆z(w ;w − w) for x fixed

where z can be calculated directly from Abs-Linear-Form

z = [c + Zx + ∆Mz + ∆L|z |] + Mz + L|z |

with ∆M = M−M, ∆L = L−L. The discrepancy is bounded by

‖z − z‖∞ ≤ q2‖∆M,∆L‖2

F .

Explicit upper bound on q can be given but seems too conservative.

Reverse Mode AD ≡ Back Propagation yields at ∗ = 2 OPS(z)[c , Z , M, L, b, J, N

]≡ ∂(z>z)

∂[c ,Z ,∆M,∆L, b, J,N].

A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 20 / 25

Page 21: Generalized/Global Abs-Linear Learning (GALL)€¦ · Outline 1 From Heavy to Savvy Ball search trajectory 2 Results in convex, homogeneous and prox-linear case 3 Successive Piecewise

Objective for successive linearizations for model sizes 3,4,5,

Empirical Risk

# of models

s=3

s=4

s=5

Model size

1 2 3 4 5

0.11

0.12

0.13

0.14

0.15

0.16

A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 21 / 25

Page 22: Generalized/Global Abs-Linear Learning (GALL)€¦ · Outline 1 From Heavy to Savvy Ball search trajectory 2 Results in convex, homogeneous and prox-linear case 3 Successive Piecewise

Simplex Iterations by Gurobi

Regression on Griewank in 2 dimensions, 50 training data, 8 testing dataover 5 successive piecewise linearizations.

s #w var. 1 2 3 4 53 21 471 303810 353703 1716277 581060 6810254 31 631 1129639 263007 1015447 1339147 10686085 43 793 1153345 22793377 22895320 21241422 16513124

For s=5 there were 250 equality and 1000 inequality constraints, both linear.

Conclusion:

Nice try – but !!!

Question:

Are we overlooking any structure that could/should be exploited?

A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 22 / 25

Page 23: Generalized/Global Abs-Linear Learning (GALL)€¦ · Outline 1 From Heavy to Savvy Ball search trajectory 2 Results in convex, homogeneous and prox-linear case 3 Successive Piecewise

Potential contributions

1 SALMIN generates cluster points that are first oder minimal.

2 Analytically savvy ball reaches target level in convex case.

3 Savvy ball can climb away from undesirable local minimizers.

4 Successive PL allows exact integration of Savvy Ball andapplication of Mixed Binary Linear Optimization (Gurobi).

5 Though costly MIBLOP may provide reference solutions.

6 Stepsize chosen automatically via kinks and angle bound.

7 Abs-Linear-Learning generalizes hinged Neural Nets.

A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 23 / 25

Page 24: Generalized/Global Abs-Linear Learning (GALL)€¦ · Outline 1 From Heavy to Savvy Ball search trajectory 2 Results in convex, homogeneous and prox-linear case 3 Successive Piecewise

Improvements and Developments

1 Refine targeting and restarting strategy for SB.

2 Matrix based implementation for HPC with GPU.

3 Exploitation of low-rank updates in polyhedral transition.

4 Mini-batch version in stochastic gradient fashion.

5 Check global optimality of MIBLOP cluster points.

6 Piecewise linearize ”loss”-function (e.g. sparsemax).

7 Adaptively enforce sparsity in Abs-Linear-Learning.

A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 24 / 25

Page 25: Generalized/Global Abs-Linear Learning (GALL)€¦ · Outline 1 From Heavy to Savvy Ball search trajectory 2 Results in convex, homogeneous and prox-linear case 3 Successive Piecewise

Muchas Gracias por su Atencion !!

A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 25 / 25