Generalized/Global Abs-Linear Learning (GALL)€¦ · Outline 1 From Heavy to Savvy Ball search...
Transcript of Generalized/Global Abs-Linear Learning (GALL)€¦ · Outline 1 From Heavy to Savvy Ball search...
Generalized/Global Abs-Linear Learning (GALL)
Andreas Griewank and Angel Rojas
Humboldt University (Berlin) and Yachay Tech (Imbabura)
14.12.19, NeurIPS Vancouver
A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 1 / 25
Outline
1 From Heavy to Savvy Ball search trajectory
2 Results in convex, homogeneous and prox-linear case
3 Successive Piecewise Linearization
4 Mixed Binary Linear Optimization
5 Generalized Abs-Linear Learning
6 Summary, Conclusions and Outlook
A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 2 / 25
Folklore and Common Expectations in ML
1 Nonsmoothness can be ignored except for step size choice.
2 Stochastic (mini-batch) sampling hides all the problems.
3 Higher dimensions make local minimizer less likely.
4 Difficulty is getting away from saddle points not minimizers.
5 Precise location of (almost) global minimizer unimportant.
6 Network architecture and stepsize selection can be tweaked.
7 Convergence proofs only under ”unrealistic assumptions”.
A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 3 / 25
Generalized Gradient Concepts
Notational Zoo (Subspecies in Euclidean and Lipschitzian Habitat):
Frechet Derivative: ∇ϕ(x) ≡ ∂ϕ(x)/∂x : D 7→ Rn ∪ ∅
Limiting Gradient: ∂Lf (x) ≡ limx→x∇ϕ(x) : D ⇒ Rn
Clarke Gradient: ∂ϕ(x) ≡ conv(∂Lϕ(x)) : D ⇒ Rn
Bouligand: f ′(x ; ∆x) ≡ limt↘0[ϕ(x + t∆x)− ϕ(x)]/t
: D × Rn 7→ R: D 7→ PLh(Rn)
Piecewise Linearization(PL):
∆ϕ(x ; ∆x) : D × Rn 7→ R: D 7→ PL(Rn)
Moriarty Effect due to Rademacher (C 0,1 = W 1,∞ ) :
Almost everywhere all concepts reduce to Frechet, except PL!!
A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 4 / 25
Lurking in the background: Prof. Moriarty
A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 5 / 25
Filippov solutions of generalized steepest descent inclusion
The convexity and outer semi-continuity of subsets ∂ϕ(x(t)) imply that
−x(t) ∈ ∂ϕ(x(t)) from x(0) = x0 ∈ Rn
has (at least) one absolutely continuous Filippov solution trajectory x(t).
Heavy ball (Polyak, 1964)
−x(t) ∈ ∂ϕ(x(t)) from x(0) = x0, −x(0) ∈ ∂ϕ(x0) .
Picks up speed/momentum going downhill and slows down going uphill.
Savvy ball (Griewank, 1981)
d
dt
[−x(t)
(ϕ(x(t))−c)e
]∈ e ∂ϕ(x(t))
(ϕ(x(t))−c)e+1= ∂
[−1
(ϕ(x(t))− c)e
].
Can be rewritten as a first order system of a differential equation and aninclusion satisfying Fillipov =⇒ absolutely continuous (x(t), x(t)) exists.
A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 6 / 25
Integrated Form
v(t) =x(t)
[ϕ(x(t))−c]e∈ x0
[ϕ(x0)−c]e− e
∫ t
0
∂ϕ(x(τ)) dτ
[ϕ(x(τ))−c]e+1.
Second order Form
x(t) ∈ −[I − x(t) x(t)>
‖x(t)‖2
][e ∂ϕ(x(t))]
[ϕ(x(t))−c]with ‖x(0)‖ = 1 .
Idea: Adjustment of current search direction x(t) towards a negativegradient direction −∂ϕ(x(t)) .
The closer the current function value ϕ(x(t)) is to the target level c ,the more rapidly the direction is adjusted.
If ϕ convex, ϕ(x) ≤ c and e ≤ 1 the trajectory reaches the level set.
On degree (1/e) homogeneous objectives, local minimizers below care accepted and local minimizers above the target level are passed by.
A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 7 / 25
JOTA: VOL. 34, NO. 1, M A Y 1981 33
Fig. 1. Search trajectories with target c = 0 and sensitivit2y , e e {0.4, 0.5, 0.67} on the objective function f = (x~ + x ~ ) / 2 0 0 + t - c o s x1 cos(x2/~/2). Initial point (40, -35) . Global minimum at origin marked by +~
gradient and explore the objective function more thoroughly. Simul- taneously, the stepsize, which equals the length of the dashes, becomes smaller to ensure an accurate integration.
The behavior of the individual trajectories confirms in principle the results of Theorem 5.1(i) applied to the quadratic term u, with d being equal to 2. The combination
and
1 e =~= 1/d
c = O=f*
seems optimal, even though the corresponding trajectory converges to the global solution x* only from the initial point (40,-35), but not from (35, -30). In the latter case, as shown in Fig. 2, the trajectory is distracted
A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 8 / 25
34 JOTA: VOL 34, NO. 1, MAY 1981
Fig. 2. Search trajectories with sensitivity e = 0.5 and target c E {-0.4, 0, 0.4} on the ob- jective function f = (x~+x~)/200+ 1-cos xt cos(x2/']2). Initial point (35, -30). Global minimum at origin marked by +.
f rom x* by a sequence of suboptimal minima and eventually diverges toward infinity. Trajectories with sensitivities larger than 0.5, like the one with e = 0.67 in Fig. 1, usually lack the penetra t ion to reach x* and wander around endlessly, as they cannot escape the attraction of the quadratic te rm u. On the other hand, trajectories with sensitivities less than 0.5, like the one with e = 0.4 in Fig. 1, are likely to pass the global solution x* at some distance before diverging toward infinity. The same is true of trajectories with appropr ia te sensitivity e = ½, but having an unattainable target, as we can see f rom the case c = - 0 . 4 in Fig. 2. Trajectories whose target is at tainable are likely to achieve their goal, like the one with c = 0.4 in Fig. 2, which attains its target close to the suboptimal minimum
£1 ~ - I r (1 , 42) T,
with value
f ( ~ l ) ~ 0 . 1 5
after passing through the neighborhood of two unacceptable minima
~2 ~ -2~r (1, x/2) T ~3 = -¢r(3, x/2) T,
A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 9 / 25
Closed form solution on prox-linear function
Lemma(A.G. 1977 & A.R. 2019). For ϕ(x) = b + g>x + q2‖x‖2
2
x(t) = −[I − x(t) x(t)>
] ∇ϕ(x(t))
[ϕ(x(t))− c]
yields momentum like
x(t) = x0 + sin(ωt)ω x0 + 1−cos(ωt)
ω2 x0 ≈ x0 + tx0 − t2g2(ϕ0−c)
and
ϕ(x(t)) = ϕ0 +[(g + qx0)>x0
]sin(ωt)ω +
[q − ω2(ϕ0 − c)
] (1−cos(ωt))ω2
where
x0 = −[I − x0x
>0
] (g + qx0)
(ϕ0−c)and ω = ‖x0‖ .
A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 10 / 25
Piecewise-Linearization Approach
1 Every function ϕ(x) that is abs-normal, i.e. evaluated by a sequenceof smooth elemental functions and piecewise linear elements likeabs,min,max can be approximated near a reference point x by apiecewise-linear function ∆ϕ(x ; ∆x) s.t.
|ϕ(x + ∆x)− ϕ(x)−∆ϕ(x ; ∆x)| ≤ q2‖∆x‖2
2 The function y = ∆ϕ(x ; x − x) can be represented in Abs-Linear form
z = d + Zx + Mz + L|z |y = µ+ a>x + b>z + c>|z |
where M and L are strictly lower triangular matrices s.t. z = z(x).3 [d ,Z ,M, L, µ, a, b, c] can be generated automatically by Algorithmic
Piecewise Differentiation, which allows the computational handling of∆ϕ in and between the polyhedra
Pσ = closure{x ∈ Rn; sgn(z(x)) = σ} for σ ∈ {−1,+1}s
A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 11 / 25
xx
F(x
)
F♦xF
(a) Tangent mode linearization
A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 12 / 25
SALMIN defined by iteration
xk+1 = arglocmin∆x
{∆ϕ(xk ; ∆x) + qk2 ‖∆x‖2} (1)
where qk > 0 is adjusted such that eventually qk ≥ q in region of interest.Has cluster points x∗ that are first order minimal minimal (FOM) i.e.
∆ϕ(x∗,∆x) ≥ 0 for ∆x ≈ 0 .
Drawback: Requires computation and factorization of active Jacobians.
Coordinate Global Descent CGD
f (w ; x) is PL w.r.t. x but ϕ(w) is only multi-piecewise linear w.r.t. w , i.e.
ϕ(x + tej) ≡ ϕ(x) + ∆ϕ(x + tej) for t ∈ R .
Along any such coordinate bi-direction we can perform a global univariateminimization efficiently. Cluster points x∗ of such alternating coordinatesearches seem not even even Clarke stationary, i.e. 0 ∈ ∂ϕ(x∗) .
A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 13 / 25
Figure 1: Decimal digits gained by 4 methods on single layer regression problem.
A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 14 / 25
SALGO-SAVVY algorithm
1 Form piecewise linearization ∆ϕ of objective ϕ at the current iteratex and estimate the proximal coefficient q, set x0 = x ,
2 Select the initial tangent x0 and σ = sgn(z(x0)).
3 Compute and follow circular segment x(t) in Pσ.
4 Determine minimal t∗ where ϕ(x(t∗)) = c orx∗ = x(t?) lies on the boundary of Pσ with some Pσ.
5 If ϕ(x∗) ≤ c then lower c and goto step (2) // restart inner loop
xor goto step (1) with x = x∗ and adjusted q // continue outer loopxor terminate optimization if user ”happy” or resources exceeded.
6 Else, set x0 = x∗, x0 = x(t∗), σ = σ and continue with step (3).
Many other heuristic strategies for retargeting and restarting possible.!!!!
A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 15 / 25
Savvy Ball Path
Figure 2: Reached value 0.591576 whereas target level 0.519984 unreachable.
A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 16 / 25
SAVVY on MNIST, n = 784,m = 10, d = 60000
Resulting accuracy of one layer model with smooth-max activation andcross entropy loss on test set of 10000 images is the ”optimal” 92%.
A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 17 / 25
Mixed Binary Linear Optimization
Consider a piecewise linear optimization problem in Abs-Linear-Form
Min a>x + b>z + c>Σz s.t. z = Zx + Mz + LΣz and Σz ≥ 0
where σ ∈ {−1, 1}n and Σ = diag(σ) are binary variables.This (MIBLOP) can be (MILOP), provided |z |∞ ≤ γ yielding
minx ,z,w ,σ
(a>x + b>z + c>h + q
2‖x‖2)
s.t. z = Zx + Mz + Lh , (2)
−h ≤ z ≤ h and h + γ(σ − e) ≤ z ≤ −h + γ(σ + e),
Quote by Fischetti and Jo (2018)
”Deep Neural Networks as 0-1 Mixed Integer Linear Programs: AFeasibility Study”: PL models are unfortunately not suited for training.
A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 18 / 25
Prediction by PL functions in ANF
For x ∈ Rn 7→ y ∈ Rm
Continuous PL function ⇐⇒ Hinged NN ⇐⇒ Abs-Linear-Form .
Numb. of Layers ` ≥ ν = Switching Depth = nilpotency of (I −M)−1L.
z = c + Zx + Mz + L|z | ∈ Rs
y = b + Jx + Nz ∈ Rm
1 where M, L ∈ Rs×s are strictly lower triangular to yield z = z(x).
2 ≡ NN if M ≡ L are block bidiagonal, other sparsity patterns possible.
3 note that max(u,w) = u + (z + |z |)/2 with z = (w − u).
4 ALFs with ν ≤ ν form infinite dimensional linear space of C 0,1(Rn).
5 ALFs can be successively abs-linearized with respect tow = [c ,Z ,M, L, b, J,N] for learning=fitting.
A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 19 / 25
Structured Piecewise linearization (PL) w.r.t. weight vector
Given a reference point w = [c , Z , M, L, b, J, N] we have Taylor like
z = z + ∆z(w ;w − w) for x fixed
where z can be calculated directly from Abs-Linear-Form
z = [c + Zx + ∆Mz + ∆L|z |] + Mz + L|z |
with ∆M = M−M, ∆L = L−L. The discrepancy is bounded by
‖z − z‖∞ ≤ q2‖∆M,∆L‖2
F .
Explicit upper bound on q can be given but seems too conservative.
Reverse Mode AD ≡ Back Propagation yields at ∗ = 2 OPS(z)[c , Z , M, L, b, J, N
]≡ ∂(z>z)
∂[c ,Z ,∆M,∆L, b, J,N].
A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 20 / 25
Objective for successive linearizations for model sizes 3,4,5,
Empirical Risk
# of models
s=3
s=4
s=5
Model size
1 2 3 4 5
0.11
0.12
0.13
0.14
0.15
0.16
A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 21 / 25
Simplex Iterations by Gurobi
Regression on Griewank in 2 dimensions, 50 training data, 8 testing dataover 5 successive piecewise linearizations.
s #w var. 1 2 3 4 53 21 471 303810 353703 1716277 581060 6810254 31 631 1129639 263007 1015447 1339147 10686085 43 793 1153345 22793377 22895320 21241422 16513124
For s=5 there were 250 equality and 1000 inequality constraints, both linear.
Conclusion:
Nice try – but !!!
Question:
Are we overlooking any structure that could/should be exploited?
A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 22 / 25
Potential contributions
1 SALMIN generates cluster points that are first oder minimal.
2 Analytically savvy ball reaches target level in convex case.
3 Savvy ball can climb away from undesirable local minimizers.
4 Successive PL allows exact integration of Savvy Ball andapplication of Mixed Binary Linear Optimization (Gurobi).
5 Though costly MIBLOP may provide reference solutions.
6 Stepsize chosen automatically via kinks and angle bound.
7 Abs-Linear-Learning generalizes hinged Neural Nets.
A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 23 / 25
Improvements and Developments
1 Refine targeting and restarting strategy for SB.
2 Matrix based implementation for HPC with GPU.
3 Exploitation of low-rank updates in polyhedral transition.
4 Mini-batch version in stochastic gradient fashion.
5 Check global optimality of MIBLOP cluster points.
6 Piecewise linearize ”loss”-function (e.g. sparsemax).
7 Adaptively enforce sparsity in Abs-Linear-Learning.
A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 24 / 25
Muchas Gracias por su Atencion !!
A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 25 / 25