Abstract 1. Introduction - arXiv · Implicit differentiation of Lasso-type models for...

Implicit differentiation of Lasso-type models for hyperparameter optimization

Quentin Bertrand * 1 Quentin Klopfenstein * 2 Mathieu Blondel 3 Samuel Vaiter 2 Alexandre Gramfort 1

Joseph Salmon 4

Abstract

Setting regularization parameters for Lasso-typeestimators is notoriously difficult, though cru-cial in practice. The most popular hyperparam-eter optimization approach is grid-search usingheld-out validation data. Grid-search however re-quires to choose a predefined grid for each pa-rameter, which scales exponentially in the num-ber of parameters. Another approach is to casthyperparameter optimization as a bi-level opti-mization problem, one can solve by gradient de-scent. The key challenge for these methods isthe estimation of the gradient w.r.t. the hyperpa-rameters. Computing this gradient via forwardor backward automatic differentiation is possibleyet usually suffers from high memory consump-tion. Alternatively implicit differentiation typi-cally involves solving a linear system which canbe prohibitive and numerically unstable in highdimension. In addition, implicit differentiationusually assumes smooth loss functions, whichis not the case for Lasso-type problems. Thiswork introduces an efficient implicit differentia-tion algorithm, without matrix inversion, tailoredfor Lasso-type problems. Our approach scales tohigh-dimensional data by leveraging the sparsityof the solutions. Experiments demonstrate thatthe proposed method outperforms a large num-ber of standard methods to optimize the error onheld-out data, or the Stein Unbiased Risk Esti-mator (SURE).

*Equal contribution 1Université Paris-Saclay, Inria, CEA,Palaiseau, France 2Institut Mathématique de Bourgogne,Université de Bourgogne, Dijon, France 3Google Research,Brain team, Paris, France 4Université de Montpellier,CNRS, Montpellier, France. Correspondence to: QuentinBertrand <[email protected]>, Quentin Klopfenstein<[email protected]>.

1. IntroductionIn many statistical applications, the number of parame-ters p is much larger than the number of observations n.In such scenarios, a popular approach to tackle linear re-gression problems is to consider convex `1-type penalties,used in Lasso (Tibshirani, 1996), Group-Lasso (Yuan andLin, 2006), Elastic-Net (Zou and Hastie, 2005) or adap-tive Lasso (Zou, 2006). These Lasso-type estimators relyon regularization hyperparameters, trading data fidelityagainst sparsity. Unfortunately, setting these hyperparame-ters is hard in practice: estimators based on `1-type penal-ties are indeed more sensitive to the choice of hyperparam-eters than `2 regularized estimators.

To control for overfitting, it is customary to use differentdatasets for model training (i.e., computing the regressioncoefficients) and hyperparameter selection (i.e., choosingthe best regularization parameters). A metric, e.g., hold-out loss, is optimized on a validation dataset (Stone andRamer, 1965). Alternatively one can rely on a statisticalcriteria that penalizes complex models such as AIC/BIC(Liu et al., 2011) or SURE (Stein Unbiased Risk Estima-tor, Stein 1981). In all cases, hyperparameters are tuned tooptimize a chosen metric.

The canonical hyperparameter optimization method isgrid-search. It consists in fitting and selecting the bestmodel over a predefined grid of parameter values. Thecomplexity of grid-search is exponential with the numberof hyperparameters, making it only competitive when thenumber of hyperparameters is small. Other hyperparameterselection strategies include random search (Bergstra andBengio, 2012) and Bayesian optimization (Brochu et al.,2010; Snoek et al., 2012) that aims to learn an approxima-tion of the metric over the parameter space and rely on anexploration policy to find the optimum.

Another line of work for hyperparameter optimization(HO) relies on gradient descent in the hyperparameterspace. This strategy has been widely explored for smoothobjective functions (Larsen et al., 1996; Bengio, 2000;Larsen et al., 2012). The main challenge for this class ofmethods is estimating the gradient w.r.t. the hyperparame-ters. Gradient estimation techniques are mostly divided intwo categories. Implicit differentiation requires the exact

arX

iv:2

002.

0894

3v2

[st

at.M

L]

3 A

pr 2

020


solution of the optimization problem and involves the res-olution of a linear system (Bengio, 2000). This can be ex-pensive to compute and lead to numerical instabilities, es-pecially when the system is ill-conditioned (Lorraine et al.,2019). Alternatively, iterative differentiation computes thegradient using the iterates of an optimization algorithm.Backward iterative differentiation (Domke, 2012) is com-putationally efficient when the number of hyperparametersis large. However it is memory consuming since it requiresstoring all intermediate iterates. In contrast, forward itera-tive differentiation (Deledalle et al., 2014; Franceschi et al.,2017) does not require storing the iterates but can be com-putationally expensive with a large number of hyperparam-eters; see Baydin et al. (2018) for a survey.

This article proposes to investigate the use of these meth-ods to set the regularization hyperparameters in an auto-matic fashion for Lasso-type problems. To cover the casesof both low and high number of hyperparameters, two esti-mators are investigated, namely the Lasso and the weightedLasso which have respectively one or as many parametersas features. Our contributions are as follows:

• We show that forward iterative differentiation of blockcoordinate descent (BCD), a state-of-the-art solver forLasso-type problems, converges towards the true gra-dient. Crucially, we show that this scheme convergeslinearly once the support is identified and that its limitdoes not depend of the initial starting point.

• These results lead to the proposed algorithm (Algo-rithm 2) where the computation of the Jacobian is de-coupled from the computation of the regression co-efficients. The later can be done with state-of-the-artconvex solvers, and interestingly, it does not requiresolving a linear system, potentially ill-conditioned.

• We show through an extensive benchmark on simu-lated and real high dimensional data that the proposedmethod outperforms state-of-the-art HO methods.

Our work should not be confused with the line of work byGregor and LeCun (2010), where the Lasso objective valueis differentiated w.r.t. optimization parameters instead ofthe solution. In other words, we tackle argmin rather thanmin differentiation.

Notation The design matrix is X ∈ Rn×p (correspondingto n samples and p features) and the observation vector isy ∈ Rn. The regularization parameter, possibly multivari-ate, is denoted by λ = (λ1, . . . , λr)

> ∈ Rr. We denoteβ(λ) ∈ Rp the regression coefficients associated to λ. Wedenote J(λ) , (∇λβ(λ)

1 , . . . ,∇λβ(λ)p )> ∈ Rp×r the weak

Jacobian (Evans and Gariepy, 1992) of β(λ) w.r.t. λ. For afunction ψ : Rp × Rr → R with weak derivatives of order

two, we denote by∇βψ(β, λ) ∈ Rp (resp. ∇λ(β, λ) ∈ Rr)its weak gradient w.r.t. the first parameter (resp. the secondparameter). The weak Hessian ∇2ψ(β, λ) is a matrix inR(p+r)×(p+r) which has a block structure

∇2ψ(β, λ) =

( ∇2βψ(β, λ) ∇2

β,λψ(β, λ)

∇2λ,βψ(β, λ) ∇2

λψ(β, λ)

).

The support of β(λ) (the indices of non-zero coefficients)is denoted by S(λ), and s(λ) represents its cardinality(i.e., the number of non-zero coefficients). The sign vec-tor sign β(λ) ∈ Rp is the vector of component-wise signs(with the convention that sign(0) = 0) of β(λ). Note that toease the reading, we drop λ in the notation when it is clearfrom the context and use β, J , S and s.

2. Background2.1. Problem setting

To favor sparse coefficients, we consider Lasso-type es-timators based on non-smooth regularization functions.Such problems consist in finding:

β(λ) ∈ arg minβ∈Rp

ψ(β, λ) . (1)

The Lasso (Tibshirani, 1996) is recovered, with the numberof hyperparameters set to r = 1:

ψ(β, λ) =1

2n‖y −Xβ‖22 + eλ‖β‖1 , (2)

while the weighted Lasso (wLasso, Zou 2006, introducedto reduce the bias of the Lasso) has r = p hyperparametersand reads:

ψ(β, λ) =1

2n‖y −Xβ‖22 +

p∑j=1

eλj |βj | . (3)

Note that we adopt the hyperparameter parametrization ofPedregosa (2016), i.e., we write the regularization parame-ter as eλ. This avoids working with a positivity constraintin the optimization process and fixes scaling issues in theline search. It is also coherent with the usual choice of ageometric grid for grid search (Friedman et al., 2010).Remark 1. Other formulations could be investigated likeElastic-Net or non-convex formulation, e.g., MCP (Zhang,2010). Our theory does not cover non-convex cases, thoughwe illustrate that it behaves properly numerically. Handlingsuch non-convex cases is left as a question for future work.

The HO problem can be expressed as a nested bi-level op-timization problem. For a given differentiable criterionC : Rp 7→ R (e.g., hold-out loss or SURE), it reads:


arg minλ∈Rr

{L(λ) , C

(β(λ)

)}s.t. β(λ) ∈ arg min

β∈Rpψ(β, λ) . (4)

Note that SURE itself is not necessarily weakly differen-tiable w.r.t. β(λ). However a weakly differentiable approx-imation can be constructed (Ramani et al., 2008; Deledalleet al., 2014). Under the hypothesis that Problem (1) has aunique solution for every λ ∈ Rr, the function λ 7→ β(λ) isweakly differentiable (Vaiter et al., 2013). Using the chainrule, the gradient of L w.r.t. λ then writes:

∇λL(λ) = J>(λ)∇C(β(λ)

). (5)

Computing the weak Jacobian J(λ) of the inner problemis the main challenge, as once the hypergradient ∇λL(λ)has been computed, one can use usual gradient descent,λ(t+1) = λ(t) − ρ∇λL(λ(t)), for a step size ρ > 0.Note however that L is usually non-convex and conver-gence towards a global minimum is not guaranteed. In thiswork, we propose an efficient algorithm to compute J(λ)for Lasso-type problems, relying on improved forward dif-ferentiation.

2.2. Implicit differentiation (smooth case)

Implicit differentiation, which can be traced back to Larsenet al. (1996), is based on the knowledge of β and requiressolving a p× p linear system (Bengio, 2000, Sec. 4). Sincethen, it has been extensively applied in various contexts.Chapelle et al. (2002); Seeger (2008) used implicit differ-entiation to select hyperparameters of kernel-based mod-els. Kunisch and Pock (2013) applied it to image restora-tion. Pedregosa (2016) showed that each inner optimiza-tion problem could be solved only approximately, leverag-ing noisy gradients. Related to our work, Foo et al. (2008)applied implicit differentiation on a “weighted” Ridge-typeestimator (i.e., a Ridge penalty with one λj per feature).

Yet, all the aforementioned methods have a common draw-back : they are limited to the smooth setting, since they relyon optimality conditions for smooth optimization. Theyproceed as follows: if β 7→ ψ(β, λ) is a smooth convexfunction (for any fixed λ) in Problem (1), then for all λ, thesolution β(λ) satisfies the following fixed point equation:

∇βψ(β(λ), λ

)= 0 . (6)

Then, this equation can be differentiated w.r.t. λ:

∇2β,λψ(β(λ), λ) + J>(λ)∇2

βψ(β(λ), λ) = 0 . (7)

Assuming that ∇2βψ(β(λ), λ) is invertible this leads to a

closed form solution for the weak Jacobian J(λ):

J>(λ) = −∇2β,λψ

(β(λ), λ

)(∇2βψ(β(λ), λ)

)︸︷︷︸

p×p

−1, (8)

which in practice is computed by solving a linear system.Unfortunately this approach cannot be generalized for non-smooth problems since Equation (6) no longer holds.

2.3. Implicit differentiation (non-smooth case)

Related to our work Mairal et al. (2012) used implicit dif-ferentiation with respect to the dictionary (X ∈ Rn×p) onElastic-Net models to perform dictionary learning. Regard-ing Lasso problems, the literature is quite scarce, see (Dos-sal et al., 2013; Zou et al., 2007) and (Vaiter et al., 2013;Tibshirani and Taylor, 2011) for a more generic settingencompassing weighted Lasso. General methods for gra-dient estimation of non-smooth optimization schemes ex-ist (Vaiter et al., 2017) but are not practical since they de-pend on a possibly ill-posed linear system to invert. Amosand Kolter (2017) have applied implicit differentiation onestimators based on quadratic objective function with lin-ear constraints, whereas Niculae and Blondel (2017) haveused implicit differentiation on a smooth objective func-tion with simplex constraints. However none of these ap-proaches leverages the sparsity of Lasso-type estimators.

3. Hypergradients for Lasso-type problemsTo tackle hyperparameter optimization of non-smoothLasso-type problems, we propose in this section an efficientalgorithm for hypergradient estimation. Our algorithm re-lies on implicit differentiation, thus enjoying low-memorycost, yet does not require to naively solve a (potentiallyill-conditioned) linear system of equations. In the sequel,we assume access to a (weighted) Lasso solver, such asISTA (Daubechies et al., 2004) or Block Coordinate De-scent (BCD, Tseng and Yun (2009), see also Algorithm 5).

3.1. Implicit differentiation

Our starting point is the key observation that Lasso-typesolvers induce a fixed point iteration that we can leverageto compute a Jacobian. Indeed, proximal BCD algorithms(Tseng and Yun, 2009), consist in a local gradient step com-posed with a soft-thresholding step (ST), e.g., for the Lasso:

βj ← ST

(βj −

X>:,j(Xβ − y)

‖X:,j‖2,neλ

‖X:,j‖

), (9)

where ST(t, τ) = sign(t)·(|t|−τ)+ for any t ∈ R and τ ≥0 (extended for vectors component-wise). The solution of


the optimization problem satisfies, for any α > 0, the fixed-point equation (Combettes and Wajs, 2005, Prop. 3.1):

β(λ)j = ST

(β(λ)j − 1

αX>(Xβ(λ) − y),

neλ

α

). (10)

The former can be differentiated w.r.t. λ, see Lemma A.1in Appendix, leading to a closed form solution for the Ja-cobian J(λ) of the Lasso and the weighted Lasso.

Proposition 1 (Adapting Vaiter et al. 2013, Thm. 1).Let S be the support of the vector β(λ). Suppose thatX>SXS � 0 , then a weak Jacobian J = J(λ) of the

Lasso writes:

JS = −neλ(X>SXS

)−1sign βS , (11)

JSc = 0 , (12)

and for the weighted Lasso:

JS,S = −(X>SXS

)−1diag

(neλS � sign βS

)(13)

Jj1,j2 = 0 if j1 /∈ S or if j2 /∈ S . (14)

The proof of Proposition 1 can be found in Appendix A.1.Note that the positivity condition in Proposition 1 is satis-fied if the (weighted) Lasso has a unique solution. More-over, even for multiple solutions cases, there exists at leastone satisfying the positivity condition (Vaiter et al., 2013).

Proposition 1 shows that the Jacobian of the weightedLasso J(λ) ∈ Rp×p is row and column sparse. This iskey for algorithmic efficiency. Indeed, a priori, one has tostore a possibly dense p × p matrix, which is prohibitivewhen p is large. Proposition 1 leads to a simple algorithm(see Algorithm 1) to compute the Jacobian in a cheap way,as it only requires storing and inverting an s × s matrix.Even if the linear system to solve is of size s × s, insteadof p× p for smooth objective function, the system to invertcan be ill-conditioned, especially when a large support sizes is encountered. This leads to numerical instabilities andslows down the resolution (see an illustration in Figure 2).Forward (Algorithm 3 in Appendix) and backward (Algo-rithm 4 in Appendix) iterative differentiation, which do notrequire solving linear systems, can overcome these issues.

3.2. Link with iterative differentiation

Iterative differentiation in the field of hyperparameter set-ting can be traced back to Domke (2012) who derived abackward differentiation algorithm for gradient descent,heavy ball and L-BFGS algorithms applied to smooth lossfunctions. Agrawal et al. (2019) generalized it to a spe-cific subset of convex programs. Maclaurin et al. (2015)derived a backward differentiation for stochastic gradientdescent. On the other hand Deledalle et al. (2014) used

Algorithm 1 IMPLICIT DIFFERENTIATION

input : X, y, λ// jointly compute coef. and Jacobian

if Lasso thenGet β = Lasso(X, y, λ) and its support S.J = 0pJS = −neλ(X>

SXS)−1 sign βS

if wLasso thenGet β = wLasso(X, y, λ) and its support S.J = 0p×pJS,S = −(X>

SXS)−1 diag(neλS � sign βS)

return β, J

forward differentiation of (accelerated) proximal gradientdescent for hyperparameter optimization with non-smoothpenalties. Franceschi et al. (2017) proposed a benchmarkof forward mode versus backward mode, varying the num-ber of hyperparameters to learn.

Forward differentiation consists in differentiating each stepof the algorithm (w.r.t. λ in our case). For the Lassosolved with BCD it amounts differentiating Equation (9),and leads to the following recursive equation for the Jaco-bian, with zj = βj −X>:,j(Xβ − y)/ ‖X:,j‖2:

Jj ←∂1 ST(zj ,

neλ

‖X:,j‖2

)(Jj − 1

‖X:,j‖2X>:,jXJ

)+ ∂2 ST

(zj ,

neλ

‖X:,j‖2

)neλ

‖X:,j‖2, (15)

see Algorithm 3 (in Appendix) for full details.Our proposed algorithm exploits that after a fixednumber of epochs ∂1 ST(zj , ne

λ/ ‖X:,j‖2) and∂2 ST(zj , ne

λ/ ‖X:,j‖2) are constant. It is thus pos-sible to decouple the computation of the Jacobian byonly solving Problem (1) in a first step and then apply theforward differentiation recursion steps, see Algorithm 2.This can be seen as the forward counterpart in a non-smooth case of the recent paper Lorraine et al. (2019).An additional benefit of such updates is that they can be

restricted to the (current) support, which leads to fasterJacobian computation. We now show that the Jacobiancomputed using forward differentiation and our method,Algorithm 2, converges toward the true Jacobian.

Proposition 2. Assuming the Lasso solution (Prob-lem (2)) (or weighted Lasso Problem (3)) is unique,then Algorithms 2 and 3 converge toward the implicitdifferentiation solution J defined in Proposition 1.Moreover once the support has been identified theconvergence of the Jacobian is linear and its limit doesnot depend on the initial starting point J (0).

Proof of Proposition 2 can be found in Appendix A.2and A.3.


Imp. F. Iterdiff. (ours) F. Iterdiff. B. Iterdiff.

20 40 60Number of iterations

10−1

100

101

Tim

es(s

)

20 40 60Number of iterations

10−7

10−5

Obj

ecti

vem

inus

opti

mum

Figure 1. Time to compute a single gradient (Synthetic data,Lasso, n, p = 1000, 2000). Influence on the number of iterationsof BCD (in the inner optimization problem of Problem (4)) on thecomputation time (left) and the distance to “optimum” of the gra-dient ∇λL(λ)(right) for the Lasso estimator. The “optimum” ishere the gradient given by implicit differentiation (Algorithm 1).

As an illustration, Figure 1 shows the times of computa-tion of a single gradient ∇λL(λ) and the distance to “op-timum” of this gradient as a function of the number of it-erations in the inner optimization problem for the forwarditerative differentiation (Algorithm 3), the backward iter-ative differentiation (Algorithm 4), and the proposed algo-rithm (Algorithm 2). The backward iterative differentiationis several order of magnitude slower than the forward andour implicit forward method. Moreover, once the supporthas been identified (after 20 iterations) the proposed im-plicit forward method converges faster than other methods.Note also that in Propositions 1 and 2 the Jacobian for theLasso only depends on the support (i.e., the indices of thenon-zero coefficients) of the regression coefficients β(λ).In other words, once the support of β(λ) is correctly identi-fied, even if the value of the non-zeros coefficients are notcorrectly estimated, the Jacobian is exact, see Sun et al.(2019) for support identification guarantees.

4. ExperimentsOur Python code is released as an open source package:https://github.com/QB3/sparse-ho. All theexperiments are written in Python using Numba (Lam et al.,2015) for the critical parts such as the BCD loop. We com-pare our gradient computation technique against other com-petitors (see the competitors section) on the HO problem(Problem (4)).

Solving the inner optimization problem. Note that ourproposed method, implicit forward differentiation, has theappealing property that it can be used with any solver. Forinstance for the Lasso one can combine the proposed al-gorithm with state of the art solver such as Massias et al.(2018) which would be tedious to combine with iterativedifferentiation methods. However for the comparison to befair, for all methods we have used the same vanilla BCD

Algorithm 2 IMP. F. ITERDIFF. (proposed)input : X, y, λ, niter_jacinit : J = 0// sequentially compute coef. & Jacobian

if Lasso thenGet β = Lasso(X, y, λ) and its support S.dr = −X:,SJS // trick for cheap updates

if wLasso thenGet β = wLasso(X, y, λ) and its support S.dr = −X:,SJS,S

for k = 0, . . . , niter_jac − 1 dofor j ∈ S do

if Lasso thenJold = Jj // trick for cheap update

// diff. Equation (9) w.r.t. λ

Jj +=X>:,jdr

‖X:,j‖2− neλ

‖X:,j‖2sign βj // O(n)

drj −= X:,j(Jj,: − Jold) // O(n)if wLasso thenJold = Jj,: // trick for cheap update

// diff. Equation (9) w.r.t. λ

Jj,S += 1‖X:,j‖2

X>:,jdr // O(n× s)

Jj,j −= neλj

‖X:,j‖2sign βj // O(1)

dr −= X:,j ⊗ (Jj,: − Jold) // O(n× s)return β,J

algorithm (recalled in Algorithm 5). We stop the Lasso-types solver when f(β(k+1))−f(β(k))

f(0) < εtol , where f is thecost function of the Lasso or wLasso and εtol a given toler-ance. The tolerance is fixed at εtol = 10−5 for all methodsthroughout the different benchmarks.

Line search. For each hypergradient-based method, thegradient step is combined with a line-search strategy fol-lowing the work of Pedregosa (2016). The procedure isdetailed in Appendix1.

Initialization. Since the function to optimize L is not con-vex, initialization plays a crucial role in the final solutionas well as the convergence of the algorithm. For instance,initializing λ = λinit in a flat zone of L(λ) could lead toslow convergence. In the numerical experiments, the Lassois initialized with λinit = λmax− log(10), where λmax is thesmallest λ such that 0 is a solution of Problem (2).

Competitors. In this section we compare the empiricalperformance of implicit forward differentiation algorithmto different competitors. Competitors are divided in twocategories. Firstly, the ones relying on hyperparameter gra-dient:

• Imp. F. Iterdiff.: implicit forward differentiation

1see https://github.com/fabianp/hoag for details

https://github.com/QB3/sparse-ho

https://github.com/fabianp/hoag


Table 1. Summary of cost in time and space for each method

Mode Computed Space Time Space Timequantity (Lasso) (Lasso) (wLasso) (wLasso)

F. Iterdiff. J O(p) O(2npniter) O(p2) O(np2niter)B. Iterdiff. J>v O(2pniter) O(npniter + np2niter) O(p2niter) O(npniter + np2niter)Implicit J>v O(p) O(npniter + s3) O(p+ s2) O(npniter + s3)Imp. F. Iterdiff. J O(p) O(npniter + nsniter_jac) O(p+ s2) O(npniter + ns2nit_jac)

(proposed) described in Algorithm 2.

• Implicit: implicit differentiation, which requires solv-ing a s× s linear system as described in Algorithm 1.

• F. Iterdiff.: forward differentiation (Deledalle et al.,2014; Franceschi et al., 2017) which jointly computesthe regression coefficients β as well as the Jacobian Jas shown in Algorithm 3.

Secondly, the ones not based on hyperparameter gradient:

• Grid-search: as recommended by Friedman et al.(2010), we use 100 values on a uniformly-spaced gridfrom λmax to λmax − 4 log(10).

• Random-search: we sample uniformly at random100 values taken on the same interval as for theGrid-search [λmax − 4 log(10);λmax], as suggested byBergstra et al. (2013).

• Bayesian: sequential model based optimization(SMBO) using a Gaussian process to model the objec-tive function. We used the implementation of Bergstraet al. (2013).2 The constraints space for the hyperpa-rameter search was set in [λmax−4 log(10);λmax], andthe expected improvement (EI) was used as aquisitionfunction.

The cost and the quantity computed by each algorithm canbe found in Table 1. The backward differentiation (Domke,2012) is not included in the benchmark since it was severalorders of magnitude slower than the other techniques (seeFigure 1). This is due to the high cost of the BCD algorithmin backward mode, see Table 1.

4.1. Application to held-out loss

When using the held-out loss, each dataset (X, y) is split in3 equal parts: the training set (X train, ytrain), the validationset (Xval, yval) and the test set (X test, ytest).

(Lasso, held-out criterion). For the Lasso and the held-out2https://github.com/hyperopt/hyperopt

loss, the bilevel optimization Problem (4) reads:

arg minλ∈R

‖yval −Xvalβ(λ)‖2 (16)

s.t. β(λ) ∈ arg minβ∈Rp

12n‖ytrain −X trainβ‖22 + eλ‖β‖1 .

Figure 2 (top) shows on 3 datasets (see Appendix D fordataset details) the distance to the “optimum” of ‖yval −Xvalβ(λ)‖2 as a function of time. Here the goal is to findλ solution of Problem (16). The “optimum” is chosen asthe minimum of ‖yval − Xvalβ(λ)‖2 among all the meth-ods. Figure 2 (bottom) shows the loss ‖ytest −X testβ(λ)‖2on the test set (independent from the training set and thevalidation set). This illustrates how well the estimator gen-eralizes. Firstly, it can be seen that on all datasets the pro-posed implicit forward differentiation outperforms forwarddifferentiation which illustrates Proposition 2 and corrobo-rates the cost of each algorithm in Table 1. Secondly, it canbe seen that on the 20news dataset (Figure 2, top) the im-plicit differentiation (Algorithm 1) convergence is slowerthan implicit forward differentiation, forward differentia-tion, and even slower than the grid-search. In this case, thisis due to the very slow convergence of the conjugate gra-dient algorithm (Nocedal and Wright, 2006) when solvingthe ill-conditioned linear system in Algorithm 1.

(MCP, held-out criterion). We also applied our algorithmon an estimator based on a non-convex penalty: the MCP(Zhang, 2010) with 2 hyperparameters. Since the penalty isnon-convex the estimator may not be continuous w.r.t. hy-perparameters and the theory developed above does nothold. However experimentally implicit forward differen-tiation outperforms forward differentiation for the HO, seeAppendix C for full details.

4.2. Application to another criterion: SURE

Evaluating models on held-out data makes sense if the de-sign is formed from random samples as it is often consid-ered in supervised learning. However, this assumption doesnot hold for certain kinds of applications in signal or imageprocessing. For these applications, the held-out loss cannotbe used as the criterion for optimizing the hyperparame-ters of a given model. In this case, one may use a proxy ofthe prediction risk, like the Stein Unbiased Risk Estimation


Imp. F. Iterdiff. (ours) Implicit F. Iterdiff. Grid-search Bayesian Random-search

0.0 0.5 1.0

10−5

10−4

10−3

10−2

10−1

100

Obj

ecti

vem

inus

opti

mum

rcv1 (p = 19, 959)

0 5 10 15

10−3

10−2

10−1

100

101

102

20news (p = 130, 107)

0 100 200 300

10−4

10−3

10−2

10−1

100

101

finance (p = 1, 668, 737)

0.0 0.5 1.0Time (s)

10−1

100

Los

son

test

set

0 5 10 15Time (s)

101

102

0 100 200 300Time (s)

10−1

100

101

Figure 2. Computation time for the HO of the Lasso on real data. Distance to “optimum” (top) and performance (bottom) on the testset for the Lasso for 3 different datasets: rcv1, 20news and finance.

(SURE, Stein (1981)). The SURE is an unbiased estimatorof the prediction risk under weak differentiable conditions.The drawback of this criterion is that it requires the knowl-edge of the variance of the noise. The SURE is defined asfollows: SURE(λ) = ‖y−Xβ(λ)‖2−nσ2+2σ2dof(β(λ)) ,where the degrees of freedom (dof Efron 1986) is definedas dof(β(λ)) =

∑ni=1 cov(yi, (Xβ

(λ))i)/σ2 . The dof can

be seen a measure of the complexity of the model, for in-stance for the Lasso dof(β(λ)) = s, see Zou et al. (2007).The SURE can thus be seen as a criterion trading data-fidelity against model complexity. However, the dof isnot differentiable (not even continuous in the Lasso case),yet it is possible to construct a weakly differentiable ap-proximation of it based on Finite Differences Monte-Carlo(see Deledalle et al. 2014 for full details), with ε > 0 andδ ∼ N (0, Idn):

dofFDMC(y, λ, δ, ε) = 1ε 〈Xβ(λ)(y + εδ)−Xβ(λ)(y), δ〉 .

We use this smooth approximation in the bi-level optimiza-tion problem to find the best hyperparameter. The bi-leveloptimization problem then reads:

arg minλ∈R

‖y −Xβ(λ)‖2 + 2σ2dofFDMC(y, λ, δ, ε) (17)

s.t. β(λ)(y) ∈ arg minβ∈Rp

12n‖y −Xβ‖22 + eλ‖β‖1

β(λ)(y + εδ) ∈ arg minβ∈Rp

12n‖y + εδ −Xβ‖22 + eλ‖β‖1

Note that solving this problem requires the computation oftwo (instead of one for the held-out loss) Jacobians w.r.t. λof the solution β(λ) at the points y and y + εδ.

(Lasso, SURE criterion). To investigate the estimation per-formance of the implicit forward differentiation in com-parison to the competitors described above, we used asmetric the (normalized) Mean Squared Error (MSE) de-fined as MSE , ‖β − β∗‖2/‖β∗‖2. The entries of thedesign matrix X ∈ Rn×p are i.i.d. random Gaussian vari-ables N (0, 1). The number of rows is fixed to n = 100.Then, we generated β∗ with 5 non-zero coefficients equalsto 1. The vector y was computed by adding to Xβ∗ addi-tive Gaussian noise controlled by the Signal-to-Noise Ra-tio: SNR , ‖Xβ∗‖/‖y −Xβ∗‖ (here SNR = 3). Fol-lowing Deledalle et al. (2014), we set ε = 2σ/n0.3. Wevaried the number of features p between 200 and 10,000 ona linear grid of size 10. For a fixed number of features, weperformed 50 repetitions and each point of the curves rep-resents the mean of these repetitions. Comparing efficiencyin time between methods is difficult since they are not di-rectly comparable. Indeed, grid-search and random-searchdiscretize the HO space whereas others methods work inthe continuous space which is already an advantage. How-ever, to be able to compare the hypergradient methods andpossibly compare them to the others, we computed the to-tal amount of time for a method to return its optimal valueof λ. In order to have a fair comparison, we compared 50evaluations of the line-search for each hypergradient meth-


Imp. F. Iterdiff. (ours)

Implicit

F. Iterdiff.

Grid-search

Bayesian

Random-search

200 2500 5000 7500 10000Number of features (p)

0.20

0.25

0.30

0.35

MS

E


100

101

102

Tim

e(s

)

Figure 3. Lasso: estimation performance. Estimation MeanSquared Error (left) and running time (right) as a function of thenumber of features for the Lasso model.

ods, 50 evaluations of the Bayesian methods and finally 50evaluations on fixed or random grid. We are aware that thecost of each of these evaluations is not the same but it al-lows to see that our method stays competitive in time withoptimizing one parameter. Moreover we will also see thatour method scales better with a large number of hyperpa-rameters to optimize.

Figure 3 shows the influence of the number of featureson the MSE and the computation time. First, MSE of allcompetitors is comparable which means that the value ofβ(λ) obtained by the different methods is approximately thesame. The only method that performs worse than the oth-ers is implicit differentiation. We attribute this to instabil-ities in the matrix inversion of the implicit differentiation.Second, it can be seen that our method has the same esti-mation performance as state-of-the-art methods like grid-search. This illustrates that our approach is comparable tothe others in term of quality of the estimator. Yet, its run-ning time is the lowest of all hypergradient-based strate-gies and competes with the implicit differentiation and therandom-search.

(Weighted Lasso vs Lasso, SURE criterion). As our methodleverages the sparsity of the solution, it can be used for HOwith a large number of hyperparameters, contrary to classi-cal forward differentiation. The weighted Lasso (wLasso,Zou 2006) has p hyperparameters and was introduced toreduce the bias of the Lasso. However setting the p hyper-parameters is impossible with grid-search.

Figure 4 shows the estimation MSE and the running timeof the different methods to obtain the hyperparameter val-ues as a function of the number of features used to simu-late the data. The simulation setting is here the same asfor the Lasso problems investigated in Figure 3 (n = 100,SNR = 3). We compared the classical Lasso estimator andthe weighted Lasso estimator where the regularization hy-perparameter was chosen using implicit forward differenti-ation and the forward iterative differentiation as describedin Algorithm 3. Problem (4) is not convex for the weighted

Lasso Imp. F. Iterdiff. (ours)

Lasso F. Iterdiff.

wLasso Imp. F. Iterdiff. (ours)

wLasso F. Iterdiff.


0.1

0.2

0.3

MS

E


100

101

102

103

Tim

e(s

)

Figure 4. Lasso vs wLasso. Estimation Mean Squared Error (left)and running (right) of competitors as a function of the number offeatures for the weighted Lasso and Lasso models.

Lasso and a descent algorithm like ours can be trappedin local minima, crucially depending on the starting pointλinit. To alleviate this problem, we introduced a regular-ized version of Problem (4):

arg minλ∈R

C(β(λ)

)+ γ

p∑j

λ2j

s.t. β(λ) ∈ arg minβ∈Rp

, ψ(β, λ) . (18)

The solution obtained by solving Equation (18) is thenused as the initialization λ(0) for our algorithm. Inthis experiment the regularization term is constant γ =C(β(λmax))/10. We see in Figure 4 that the weighted Lassogives a lower MSE than the Lasso and allows for a betterrecovery of β∗. This experiment shows that the amountof time needed to obtain the vector of hyperparameters ofthe weighted Lasso via our algorithm is in the same rangeas for obtaining the unique hyperparameter of the Lassoproblem. It also shows that our proposed method is muchfaster than the naive way of computing the Jacobian usingforward iterative differentiation. A maximum running timethreshold was used for this experiment checking the run-ning time at each line-search iteration, explaining why theforward differentiation of the wLasso does not explode intime on Figure 4.

ConclusionIn this work we studied the performance of several methodsto select hyperparameters of Lasso-type estimators show-ing results for the Lasso and the weighted Lasso, whichhave respectively one or p hyperparameters. We exploitedthe sparsity of the solutions and the specific structure ofthe iterates of forward differentiation, leading to our im-plicit forward differentiation algorithm that computes effi-ciently the full Jacobian of these estimators w.r.t. the hyper-parameters. This allowed us to select them through a stan-dard gradient descent and have an approach that scales to a


high number of hyperparameters. Importantly, contrary toa classical implicit differentiation approach, the proposedalgorithm does not require solving a linear system. Finally,thanks to its two step nature, it is possible to leverage in thefirst step the availability of state-of-the-art Lasso solversthat make use of techniques such as active sets or screeningrules. Such algorithms, that involve calls to inner solversrun on subsets of features, are discontinuous w.r.t. hyperpa-rameters which would significantly challenge a single stepapproach based on automatic differentiation.

Acknowledgments This work was funded by ERC Start-ing Grant SLAB ERC-YStG-676943.


ReferencesA. Agrawal, B. Amos, S. Barratt, S. Boyd, S. Diamond, and

J. Z. Kolter. Differentiable convex optimization layers.In Advances in neural information processing systems,pages 9558–9570, 2019.

B. Amos and J. Z. Kolter. Optnet: Differentiable optimiza-tion as a layer in neural networks. In ICML, volume 70,pages 136–145, 2017.

A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M.Siskind. Automatic differentiation in machine learning:a survey. J. Mach. Learn. Res., 18(153):1–43, 2018.

Y. Bengio. Gradient-based optimization of hyperparame-ters. Neural computation, 12(8):1889–1900, 2000.

J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. J. Mach. Learn. Res., 13:281–305, 2012.

J. Bergstra, D. Yamins, and D. D. Cox. Hyperopt: A pythonlibrary for optimizing the hyperparameters of machinelearning algorithms. In Proceedings of the 12th Pythonin science conference, pages 13–20, 2013.

P. Breheny and J. Huang. Coordinate descent algorithmsfor nonconvex penalized regression, with applications tobiological feature selection. Ann. Appl. Stat., 5(1):232,2011.

E. Brochu, V. M. Cora, and Nando De Freitas. A tutorial onBayesian optimization of expensive cost functions, withapplication to active user modeling and hierarchical re-inforcement learning. arXiv preprint arXiv:1012.2599,2010.

O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee.Choosing multiple parameters for support vector ma-chines. Machine learning, 46(1-3):131–159, 2002.

P. L. Combettes and V. R. Wajs. Signal recovery by proxi-mal forward-backward splitting. Multiscale Modeling &Simulation, 4(4):1168–1200, 2005.

I. Daubechies, M. Defrise, and C. De Mol. An iterativethresholding algorithm for linear inverse problems witha sparsity constraint. Comm. Pure Appl. Math., 57(11):1413–1457, 2004.

C.-A. Deledalle, S. Vaiter, J. Fadili, and G. Peyré. SteinUnbiased GrAdient estimator of the Risk (SUGAR) formultiple parameter selection. SIAM J. Imaging Sci., 7(4):2448–2487, 2014.

J. Domke. Generic methods for optimization-based model-ing. In AISTATS, volume 22, pages 318–326, 2012.

C. Dossal, M. Kachour, M.J. Fadili, G. Peyré, and C. Ches-neau. The degrees of freedom of the lasso for generaldesign matrix. Statistica Sinica, 23(2):809–828, 2013.

B. Efron. How biased is the apparent error rate of a pre-diction rule? J. Amer. Statist. Assoc., 81(394):461–470,1986.

L. C. Evans and R. F. Gariepy. Measure theory and fineproperties of functions. CRC Press, 1992.

C. S. Foo, C. B. Do, and A. Y. Ng. Efficient multiple hyper-parameter learning for log-linear models. In Advances inneural information processing systems, pages 377–384,2008.

L. Franceschi, M. Donini, P. Frasconi, and M. Pontil. For-ward and reverse gradient-based hyperparameter opti-mization. In ICML, pages 1165–1173, 2017.

J. Friedman, T. J. Hastie, and R. Tibshirani. Regulariza-tion paths for generalized linear models via coordinatedescent. J. Stat. Softw., 33(1):1–22, 2010.

K. Gregor and Y. LeCun. Learning fast approximations ofsparse coding. In ICML, pages 399–406, 2010.

E. Hale, W. Yin, and Y. Zhang. Fixed-point continuation for`1-minimization: Methodology and convergence. SIAMJ. Optim., 19(3):1107–1130, 2008.

K. Kunisch and T. Pock. A bilevel optimization approachfor parameter learning in variational models. SIAM J.Imaging Sci., 6(2):938–983, 2013.

S. K. Lam, A. Pitrou, and S. Seibert. Numba: A LLVM-based Python JIT Compiler. In Proceedings of the Sec-ond Workshop on the LLVM Compiler Infrastructure inHPC, pages 1–6. ACM, 2015.

J. Larsen, L. K. Hansen, C. Svarer, and M. Ohlsson. Designand regularization of neural networks: the optimal use ofa validation set. In Neural Networks for Signal Process-ing VI. Proceedings of the 1996 IEEE Signal ProcessingSociety Workshop, pages 62–71, 1996.

J. Larsen, C. Svarer, L. N. Andersen, and L. K. Hansen.Adaptive regularization in neural network modeling. InNeural Networks: Tricks of the Trade - Second Edition,pages 111–130. Springer, 2012.

W. Liu, Y. Yang, et al. Parametric or nonparametric? aparametricness index for model selection. Ann. Statist.,39(4):2074–2102, 2011.

J. Lorraine, P. Vicol, and D. Duvenaud. Optimizingmillions of hyperparameters by implicit differentiation.arXiv preprint arXiv:1911.02590, 2019.


D. Maclaurin, D. Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through reversiblelearning. In ICML, volume 37, pages 2113–2122, 2015.

J. Mairal, F. Bach, and J. Ponce. Task-driven dictionarylearning. IEEE Trans. Pattern Anal. Mach. Intell., 34(4):791–804, 2012.

M. Massias, A. Gramfort, and J. Salmon. Celer: a FastSolver for the Lasso with Dual Extrapolation. In ICML,volume 80, pages 3315–3324, 2018.

M. Massias, S. Vaiter, A. Gramfort, and J. Salmon. Dualextrapolation for sparse generalized linear models. arXivpreprint arXiv:1907.05830, 2019.

V. Niculae and M. Blondel. A regularized frameworkfor sparse and structured neural attention. In Advancesin neural information processing systems, pages 3338–3348, 2017.

J. Nocedal and S. J. Wright. Numerical optimization.Springer Series in Operations Research and FinancialEngineering. Springer, New York, second edition, 2006.

F. Pedregosa. Hyperparameter optimization with approx-imate gradient. In ICML, volume 48, pages 737–746,2016.

S. Ramani, T. Blu, and M. Unser. Monte-Carlo SURE: ablack-box optimization of regularization parameters forgeneral denoising algorithms. IEEE Trans. Image Pro-cess., 17(9):1540–1554, 2008.

M. W. Seeger. Cross-validation optimization for large scalestructured classification kernel methods. J. Mach. Learn.Res., 9:1147–1178, 2008.

J. Snoek, H. Larochelle, and R. P. Adams. Practicalbayesian optimization of machine learning algorithms.In Advances in neural information processing systems,pages 2951–2959, 2012.

E. Soubies, L. Blanc-Féraud, and G. Aubert. A unifiedview of exact continuous penalties for `2-`0 minimiza-tion. SIAM J. Optim., 27(3):2034–2060, 2017.

C. M. Stein. Estimation of the mean of a multivariate nor-mal distribution. Ann. Statist., 9(6):1135–1151, 1981.

L. R. A. Stone and J.C. Ramer. Estimating WAIS IQ fromShipley Scale scores: Another cross-validation. Journalof clinical psychology, 21(3):297–297, 1965.

Y. Sun, H. Jeong, J. Nutini, and M. Schmidt. Are we thereyet? manifold identification of gradient-related proxi-mal methods. In AISTATS, volume 89, pages 1110–1119,2019.

R. Tibshirani. Regression shrinkage and selection via thelasso. J. R. Stat. Soc. Ser. B Stat. Methodol., 58(1):267–288, 1996.

R. J. Tibshirani and J. Taylor. The solution path of thegeneralized lasso. Ann. Statist., 39(3):1335–1371, 2011.

P. Tseng and S. Yun. Block-coordinate gradient descentmethod for linearly constrained nonsmooth separableoptimization. J. Optim. Theory Appl., 140(3):513, 2009.

S. Vaiter, C.-A. Deledalle, G. Peyré, C. Dossal, andJ. Fadili. Local behavior of sparse analysis regulariza-tion: Applications to risk estimation. Appl. Comput.Harmon. Anal., 35(3):433–451, 2013.

S. Vaiter, C.-A. Deledalle, G. Peyré, J. M. Fadili, andC. Dossal. The degrees of freedom of partly smooth reg-ularizers. Ann. Inst. Stat. Math., 69(4):791–832, 2017.

M. Yuan and Y. Lin. Model selection and estimation inregression with grouped variables. J. R. Stat. Soc. Ser. BStat. Methodol., 68(1):49–67, 2006.

C.-H. Zhang. Nearly unbiased variable selection underminimax concave penalty. Ann. Statist., 38(2):894–942,2010.

H. Zou. The adaptive lasso and its oracle properties. J.Amer. Statist. Assoc., 101(476):1418–1429, 2006.

H. Zou and T. J. Hastie. Regularization and variable se-lection via the elastic net. J. R. Stat. Soc. Ser. B Stat.Methodol., 67(2):301–320, 2005.

H. Zou, T. J. Hastie, and R. Tibshirani. On the “degrees offreedom” of the lasso. Ann. Statist., 35(5):2173–2192,2007.


A. ProofsA.1. Proof of Proposition 1

We start by a lemma on the weak derivative of the soft-thresholding.

Lemma A.1. The soft-thresholding ST : R×R+ 7→ R defined by ST(t, τ) = sign(t) · (|t| − τ)+ is weakly differentiablewith weak derivatives

∂ ST

∂t(t, τ) = 1{|t|>τ} , (19)

and

∂ ST

∂τ(t, τ) = − sign(t) · 1{|t|>τ} , (20)

where

1{|t|>τ} =

{1, if |t| > τ,

0, otherwise.(21)

Proof. See (Deledalle et al., 2014, Proposition 1)

Proof. (Proposition 1, Lasso ISTA) The soft-thresholding is differentiable almost everywhere (a.e.), thus Equation (10)can be differentiated a.e. thanks to the previous lemma, and for any α > 0

J=

1{|β1|>τ}

...1{|βp|>τ}

� (Idp−1

αX>X

)J−

neλ

α

sign(β1)1{|β1|>τ}

...sign(βp)1{|βp|>τ}

.

Inspecting coordinates inside and outside the support of β leads to:{JSc = 0

JS = JS − 1αX>:,SX:,SJS − neλ

α sign βS .(22)

Rearranging the term of Equation (22) it yields:

X>:,SX:,SJS = −neλ sign βS (23)

JS = −neλ(X>

:,SX:,S

)−1sign βS . (24)

(Proposition 1, Lasso BCD)

The fixed point equations for the BCD case is

βj = ST

(βj −

1

‖X:j‖22X>:j (Xβj − y),

neλ

‖X:j‖22

). (25)

As before we can differentiate this fixed point equation Equation (25)

Jj = 1{|βj|>τ} ·(Jj −

1

‖X:j‖22X>:jXJ

)− neλ

‖X:j‖22sign (βj)1{|βj|>τ} , (26)

leading to the same result.


A.2. Proof of Proposition 2 in the ISTA case

Proof. (Lasso case, ISTA) In Algorithm 3, β(k) follows ISTA steps, thus (β(k))l∈N converges toward the solution of theLasso β. Let S be the support of the Lasso estimator β, and ν(S) > 0 the smallest eigenvalue of X>

:,SX:,S . Under

uniqueness assumption proximal gradient descent (a.k.a. ISTA) achieves sign identification (Hale et al., 2008), i.e., thereexists k0 ∈ N such that for all k ≥ k0 − 1:

signβ(k+1) = sign β . (27)

Recalling the update of the Jacobian J for the Lasso solved with ISTA is the following:

J (k+1) =∣∣∣signβ(k+1)

∣∣∣�(Id− 1

‖X‖22X>X

)J (k) − neλ

‖X‖22signβ(k+1) ,

it is clear that J (k) is sparse with the sparsity pattern β(k) for all k ≥ k0. Thus we have that for all k ≥ k0:

J (k+1)

S= J (k)

S− 1

‖X‖22X>

:,SXJ (k) − neλ

‖X‖22sign βS

= J (k)

S− 1

‖X‖22X>

:,SX:,SJ

(k)

S− neλ

‖X‖22sign βS

=

(IdS −

1

‖X‖22X>

:,SX:,S

)J (k)

S− neλ

‖X‖22sign βS . (28)

One can remark that J defined in Equation (11), satisfies the following:

JS =

(IdS −

1

‖X‖22X>

:,SX:,S

)JS −

neλ

‖X‖22sign βS . (29)

Combining Equations (28) and (29) and denoting ν(S) > 0 the smallest eigenvalue of X>SXS , we have for all k ≥ k0:

J (k+1)

S− JS =

(IdS −

1

‖X‖22X>

:,SX:,S

)(J (k)

S− JS

)‖J (k+1)

S− JS‖2 ≤

(1− ν(S)

‖X‖22

)‖J (k)

S− JS‖2

‖J (k)

S− JS‖2 ≤

(1− ν(S)

‖X‖22

)k−k0‖J (k0)

S− JS‖2 .

Thus the sequence of Jacobian(J (k)

)k∈N converges linearly to J once the support is identified.

Proof. (wLasso case, ISTA) Recalling the update of the Jacobian J ∈ Rp×p for the wLasso solved with ISTA is thefollowing:

J (k+1) =∣∣∣signβ(k+1)

∣∣∣�(Id− 1

‖X‖22X>X

)J (k)

− neλ

‖X‖22diag

(signβ(k+1)

), (30)

The proof follows exactly the same steps as the ISTA Lasso case to show convergence in spectral norm of the sequence(J (k))k∈N toward J .


A.3. Proof of Proposition 2 in the BCD case

The goal of the proof is to show that iterations of the Jacobian sequence (J (k))k∈N generated by the Block CoordinateDescent algorithm (Algorithm 3) converges toward the true Jacobian J . The main difficulty of the proof is to show thatthe Jacobian sequence follows an asymptotic Vector AutoRegressive (VAR, see Massias et al. (2019, Thm. 10) for moredetail), i.e., the main difficulty is to show that there exists k0 such that for all k ≥ k0:

J (k+1) = AJ (k) +B , (31)

with A ∈ Rp×p a contracting operator and B ∈ Rp. We follow exactly the proof of Massias et al. (2019, Thm. 10).

Proof. (Lasso, BCD)

Let j1, . . . , jS be the indices of the support of β, in increasing order. As the sign is identified, coefficients outside thesupport are 0 and remain 0. We decompose the k-th epoch of coordinate descent into individual coordinate updates: Letβ(0) ∈ Rp denote the initialization (i.e., the beginning of the epoch, ), β(1) = β(k) the iterate after coordinate j1 has beenupdated, etc., up to β(S) after coordinate jS has been updated, i.e., at the end of the epoch (β(S) = β(k+1)). Let s ∈ S,then β(s) and β(s−1) are equal everywhere, except at coordinate js:

J (s)js

= J (s−1)js

− 1

‖X:,js‖2X>:,jsXJ (s−1) − 1

‖Xjs‖2signβjs after sign identification we have: (32)

= J (s−1)js

− 1

‖X:,js‖2X>:,jsX:,SJ

(s−1)S

− 1

‖X:,js‖2sign βjs (33)

X:,SJ(s)

S=

(Idn−

X:,jsX>:,js

‖X:,js‖2

)︸︷︷︸

As

X:,SJ(s−1)S

− sign βjs

‖X:,js‖2X:,js︸︷︷︸

bs

(34)

We thus have:X:,SJ

(S)

S= AS . . . A1︸︷︷︸

A∈Rn×n

X:,SJ(0)

S+AS . . . A2b1 + · · ·+ASbS−1 + bS︸︷︷︸

b∈Rn

. (35)

After sign identification and a full update of coordinate descent we thus have:

X:,SJ(t+1)

S= AX:,SJ

(t)

S+ b . (36)

Since b ⊥ Ker(Idn−A), one can apply Massias et al. (2019, Prop. 15), i.e., ‖A‖2 = 1, however one can show thatA = A + A with A the orthogonal projection on Ker(Idn−A) and ‖A‖2 < 1. Moreover, after sign identification thesequence is a VAR with associated matrix A, whose spectral norm is less than 1.

Proof. (wLasso case, BCD) As for the Lasso case:

J (s)js,:

= J (s−1)js,:

− 1

‖X:,js‖2X>:,jsXJ (s−1) − 1

‖Xjs‖2signβjsejs after sign identification we have: (37)

J (s)

js,S= J (s−1)

js,S− 1

‖X:,js‖2X>:,jsX:,SJ

(s−1)S,S

− 1

‖X:,js‖2sign βjsejs (38)

X:,SJ(s)

S,S=

(Idn−

X:,jsX>:,js

‖X:,js‖2

)︸︷︷︸

As

X:,SJ(s−1)S,S

− sign βjs

‖X:,js‖2X:,js︸︷︷︸

Bs

⊗ejs (39)

X:,SJ(S)

S,S= AS . . . A1︸︷︷︸

A∈Rn×n

X:,SJ(0)

S,S+AS . . . A2B1 ⊗ e1 + · · ·+ASBS−1 ⊗ eS−1 +BS ⊗ eS︸︷︷︸

D∈Rn×p

. (40)

One can show that

D:,j = Πs−1i=jAiBj . (41)

As in Massias et al. (2019, Lemma 14), we can proove that D:,j ∈ Ker(Idn−A)⊥. As before, after sign identification thesequence is a VAR with associated matrix A, whose spectral norm is less than 1.


B. Block coordinate descent algorithmsAlgorithm 3 presents the forward iteration scheme which computes iteratively the solution of the Lasso or wLasso jointlywith the Jacobian computation. This is the naive way of computing the Jacobian without taking advantage of its sparsity.Eventually, it requires to differentiate every lines of code w.r.t. to λ and take advantage of the BCD updates for cheapupdates on the Jacobian as well.

Algorithm 3 FORWARD ITERDIFF (Deledalle et al., 2014; Franceschi et al., 2017)

input : X, y, β(0),J (0), λ, L, niter// jointly compute coef. & Jacobian

β = β(0) // warm start

J = J (0) // warm start

r = y −Xβdr = −XJfor k = 0, . . . , niter − 1 do

for j = 0, . . . , p− 1 do// update the regression coefficients

βold = βjzj = βj + 1

‖X:,j‖2X>:,jr // gradient step

βj = ST(zj , neλ/ ‖X:,j‖2) // proximal step

r −= X:,j(βj − βold)// update the Jacobian

if Lasso thenJold = JjJj = |signβj |

(Jj + 1

‖X:,j‖2X>:,jdr

)// diff. w.r.t. λ

Jj −= neλ

‖X:,j‖2signβj // diff. w.r.t. λ

drj −= X:,j(Jj − Jold)if wLasso thenJold = Jj,:Jj,: = | signβj |

(Jj,: + 1

‖X:,j‖2X>:,jdr

)// diff. w.r.t. λ1, . . . , λp

Jj,j −= neλj

‖X:,j‖2signβj // diff. w.r.t. λ1, . . . , λp

drj −= X:,j(Jj − Jold)return βniter ,J niter

(λ)

Algorithm 4 describes the backward iterative differentiation algorithm used for benchmark. Backward differentiationrequires the storage of every updates on β. As Figure 1 shows, this algorithm is not efficient for our case because thefunction to differentiate f : R → Rp ( f : Rp → Rp, for the wLasso) has a higher dimension output space than the inputspace. The storage is also an issue mainly for the wLasso case which makes this algorithm difficult to use in practice inour context.

Algorithm 5 presents the classical BCD iterative scheme for solving the Lasso problem using the composition of a gradientstep with the soft-thresholding operator.


Algorithm 4 BACKWARD ITERDIFF (Domke, 2012)

input : X, y, β(0), α, λ, L, niter// backward computation of β and J>

(λ)α

β = β(0) // warm start

// compute the regression coefficients and store the iterates

for k = 0, . . . , niter − 1 dofor j = 0, . . . , p− 1 do




r −= X:,j(βj − βold)// Init. backward differentiation

g = 0 // g stores J>λ α

// compute the Jacobian

for k = niter down to 1 dofor j = 0, . . . , p− 1 do

if Lasso theng −= neλ

‖X:,j‖2αj signβ

(k)j

αj ∗= | signβ(k)j |

α −= 1‖X:,j‖2

αjX>:,jX // O(np)

if wLasso thengj −= neλj

‖X:,j‖2αj signβ

(k)j

αj ∗= | signβ(k)j |

α −= 1‖X:,j‖2

αjX>:,jX

return βniter , g(1)

Algorithm 5 BCD FOR THE LASSO (Friedman et al., 2010)

input : X, y, β(0), λ, L, niterβ = β(0) // warm start

for k = 0, . . . , niter − 1 dofor j = 0, . . . , p− 1 do




r −= X:,j(βj − βold)return βniter


C. Derivations for MCPLet us remind the definition of the Minimax Concave Penalty (MCP) estimator introduced by Zhang (2010), also analyzedunder the name CELE0 by Soubies et al. (2017). First of all, for any t ∈ R:

pMCPλ,γ (t) =

{λ|t| − t2

2γ , if |t| ≤ γλ12γλ

2, if |t| > γλ .(42)

The proximity operator of pλ,γ for parameters λ > 0 and γ > 1 is defined as follow (see Breheny and Huang 2011, Sec.2.1):

proxMCPλ,γ (t) =

{ST(t,λ)

1− 1γ

if |t| ≤ γλt if |t| > γλ .

(43)

For ourselves we choose as for the Lasso an exponential parametrization of the coefficients, for λ ∈ R and γ > 0:

β(λ,γ)(y) , arg minβ∈Rp

1

2n‖y −Xβ‖22 +

p∑j=1

pMCPeλ,eγ (|βj |) . (44)

Update rule for Coordinate Descent Below, we provide equation to update the coefficient in the coordinate descentalgorithm of the MCP:

βj ← arg minβj∈R

1

2n‖y − βjX:,j −

∑j′ 6=j

βj′X:,j′‖22 +

p∑j′ 6=j

pMCPeλ,eγ (βj′) + pMCP

eλ,eγ (βj)

= arg minβj∈R

1

2n‖y − βjX:,j −

∑j′ 6=j

βj′X:,j′‖22 + pMCPeλ,eγ (βj)

= arg minβj∈R

‖X:,j‖22

1

2n

βj − 1

‖X:,j‖22

⟨y −

∑j′ 6=j

βj′X:,j′ , X:,j

⟩2

+1

‖X:,j‖22pMCPeλ,eγ (βj)

= arg min

βj∈R

1

2n

βj − 1

‖X:,j‖22

⟨y −

∑j′ 6=j


⟩2

+1

‖X:,j‖22pMCPeλ,eγ (βj)

= arg min

βj∈R

1

2Lj

βj − 1

‖X:,j‖22

⟨y −

∑j′ 6=j


⟩2

+ pMCPeλ,eγ (βj)

,with Lj ,n

‖X:,j‖22

= proxMCPeλ/Lj ,eγLj

(βj −

1∥∥X2:,j

∥∥X>:,j(Xβ − y), λ

). (45)

One can write the following fixed point equation satisfied by the estimator β, with Lj = ‖X:,j‖2 /n:

βj = proxMCPeλ/Lj ,eγLj

⟨y −∑k 6=j

βkX:,k,X:,j

‖X:,j‖2

⟩= proxMCP

eλ/Lj ,eγLj

(βj −

1

‖X:,j‖2X>:,j

(Xβ − y

)). (46)

Since the MCP penalty is non-convex, the estimator may not be continuous w.r.t. hyperparameters and gradient basedhyperparameter optimization may not be theoretically justified. However we can differentiate the fixed point equation


Imp. F. iterdiff. (ours) F. iterdiff. Grid-search

0 2 410−4

10−3

10−2

10−1

100

Obj

ecti

vem

inus

opti

mum

rcv1 (p=19,959)

0 10 20 3010−2

10−1

100

101

102 20news (p=130,107)

0 2 4Time (s)

10−1

100

Los

son

test

set

0 10 20 30Time (s)

101

102

Figure 5. Computation time for the HO of the MCP on real data Distance to “optimum” (top) and performance (bottom) on the testset for the MCP.

Equation (46) almost everywhere:

Jj =

(Jj −

1

‖X:j‖22X>:jXJ

)·∂ proxMCP

eλ/Lj ,eγLj

∂t

(βj −

1∥∥X2:,j

∥∥X>:,j(Xβ − y)

)

+eλ

Lj

∂ proxMCPeλ/Lj ,eγLj

∂λ

(βj −

1∥∥X2:,j

∥∥X>:,j(Xβ − y)

)

+ eγLj∂ proxMCP

eλ/Lj ,eγLj

∂γ

(βj −

1∥∥X2:,j

∥∥X>:,j(Xβ − y)

). (47)

where

∂ proxMCPλ,γ

∂t(t) =

{ | sign t|1− 1

γ

, if |t| ≤ λγ1, otherwise

, (48)

∂ proxMCPλ,γ

∂λ(t) =

0, if |t| ≤ λ− sign t

1− 1γ

, if λ ≤ |t| ≤ λγ0, if |t| > λγ

, (49)

∂ proxMCPλ,γ

∂γ(t) =

{−ST(t,λ)

(γ−1)2 if |t| ≤ λγ0 if |t| > λγ

. (50)

Contrary to other methods, HO based algorithms do not scale exponentially in the number of hyperparameters. Here wepropose experiments on the held-out loss with the MCP estimator (Zhang, 2010), which has 2 hyperparameters λ and γ.Our algorithm can generalize to such non-smooth proximity-based estimator.

Comments on Figure 5 (MCP, held-out criterion). Figure 5 (top) shows the convergence of the optimum on 2 datasets(rcv1 and 20news) for the MCP estimator. As before implicit forward differentiation outperforms forward differentiationillustrating Proposition 2 and Table 1.


D. Datasets and implementation detailsThe code used to produce all the figures as well as the implementation details can be found in the supplementary material inthe forward_implicit/expes folder. In particular in all experiments, for our algorithm, implicit forward differentiation,the size of the loop computing the Jacobian is fixed: n_iter_jac = 100. Reminding that the goal is to compute the gradient:

J>(λ)∇C(β(λ)

), (51)

we break the loop if‖(J (k+1) − J (k))∇C(β(λ))‖ ≤ ‖∇C(β(λ))‖ × εjac , (52)

with εjac = 10−3. All methods benefit from warm start.

D.1. Details on Figure 1

Figure 1 is done using synthetic data. As described in Section 4.2, X ∈ Rn×p is a Toeplitz correlated ma-trix, with correlation coefficient ρ = 0.9, (n, p) = (1000, 2000). β ∈ Rp is chosen with 5 non-zero coeffi-cients chosen at random. Then y ∈ Rn is chosen to be equal to Xβ contaminated by some i.i.d. random Gaus-sian noise, we chose SNR = 3. For Figure 1 all the implementation details can be found in the joint code in theforward_implicit/examples/plot_time_to_compute_single_gradient.py file. Figure 1 shows the time of compu-tation of one gradient and the distance to ”optimum”. For this figure we evaluated the gradient in λ = λmax − ln(10). The”optimum” is the gradient obtained using the implicit differentiation method.


Let us first begin by a description of all the datasets and where they can be downloaded.

rcv1. The rcv1 dataset can be downloaded here: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html#rcv1v2%20(topics;%20subsets). The dataset contains n = 20, 242 sam-ples and p = 19, 959 features.

20news. The 20news dataset can be downloaded here https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#news20. The dataset contains n = 11, 314 samples andp = 130, 107 features.

finance. The finance (E2006-log1p on libsvm) dataset can be downloaded here: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#E2006-log1p. The dataset contains n = 16, 087samples and p = 1, 668, 737 features.

All the implementation details can be found in the code: forward_implicit/expes/main_lasso_pred.py.


Figure 3 was performed using simulated data. The matrix X ∈ Rn×p was obtained by simulated n × p i.i.d. GaussianvariablesN (0, 1). The number of rows was fixed at n = 100 and we changed the number of columns p from 200 to 10,000on a linear grid of size 10. Then , we generated β∗ with 5 coefficients equal to 1 and the rest equals to 0. The vector yis equal to Xβ∗ contaminated by some i.i.d. random Gaussian noise controlled by a SNR value of 3. We performed 50repetitions for each value of p and computed the average MSE on these repetitions. The initial value for the line-searchalgorithm was set at λmax + ln(0.7) and the number of iterations for the Jacobian at 500 for the whole experiment. All theimplementation details can be found in the code : forward_implicit/expes/main_lasso_est.py.


Figure 4 was performed using the same simulating process as described above only this time we performed only 25 repeti-tions for each value of p. We had to deal with the fact that Problem (4) is not convex for the weighted Lasso which meansthat our line-search algorithm could get stuck in local minima. In order to alleviate this problem, we introduced Equa-tion (18) to obtain an initial point for the line-search algorithm. We chose the regularization term to be constant and equals

https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html#rcv1v2%20(topics;%20subsets)

https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html#rcv1v2%20(topics;%20subsets)

https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#news20

https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#news20

https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#E2006-log1p

https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#E2006-log1p


to C(β(λmax))/10. We used a time treshold of 500 seconds which was hit only by the forward differentiation algorithm forthe wLasso. The details about this experiment can be found in the code : forward_implicit/expes/main_wLasso.py.

Abstract 1. Introduction - arXiv · Implicit differentiation of Lasso-type models for...

Documents

Transcript of Abstract 1. Introduction - arXiv · Implicit differentiation of Lasso-type models for...