# safe_grid_search_with_optimal_complexity__0862617a.pdf

Safe Grid Search with Optimal Complexity

Eugene Ndiaye 1 Tam Le 1 Olivier Fercoq 2 Joseph Salmon 3 Ichiro Takeuchi 4

Popular machine learning estimators involve regularization parameters that can be challenging to tune, and standard strategies rely on grid search for this task. In this paper, we revisit the techniques of approximating the regularization path up to predeﬁned tolerance ϵ in a uniﬁed framework and show that its complexity is O(1/ d ϵ) for uniformly convex loss of order d 2 and O(1/ ϵ) for Generalized Self-Concordant functions. This framework encompasses least-squares but also logistic regression, a case that as far as we know was not handled as precisely in previous works. We leverage our technique to provide reﬁned bounds on the validation error as well as a practical algorithm for hyperparameter tuning. The latter has global convergence guarantee when targeting a prescribed accuracy on the validation set. Last but not least, our approach helps relieving the practitioner from the (often neglected) task of selecting a stopping criterion when optimizing over the training set: our method automatically calibrates this criterion based on the targeted accuracy on the validation set.

1. Introduction

Various machine learning problems are formulated as minimization of an empirical loss function f plus a regularization function Ωwhose calibration is controlled by a hyperparameter λ. The choice of λ is crucial in practice since it directly inﬂuences the generalization performance of the estimator, i.e., its score on unseen data sets. The most popular method in such a context is cross-validation (or some variant, see (Arlot & Celisse, 2010) for a detailed review). For simplicity, we investigate here the simplest case, the holdout version. It consists in splitting the data in two parts: on

1Riken AIP 2LTCI, T el ecom Paris Tech, Universit e Paris-Saclay 3IMAG, Univ Montpellier, CNRS, Montpellier, France 4Nagoya Institute of Technology. Correspondence to: E. Ndiaye <eugene.ndiaye@riken.jp>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

the ﬁrst part (training set) the method is trained for a predeﬁned collection of candidates ΛT := {λ0, . . . , λT 1}, and on the second part (validation set), the best parameter is selected among the T candidates.

For a piecewise quadratic loss f and a piecewise linear regularization Ω(e.g., the Lasso estimator), Osborne et al. (2000); Rosset & Zhu (2007) have shown that the set of solutions follows a piecewise linear curve w.r.t. to the parameter λ. There are several algorithms that can generate the full path by maintaining optimality conditions when the regularization parameter varies. This is what LARS is performing for Lasso (Efron et al., 2004), but similar approaches exist for SVM (Hastie et al., 2004) or generalized linear models (GLM) (Park & Hastie, 2007). Unfortunately, these methods have some drawbacks that can be critical in many situations:

their worst case complexity, i.e., the number of linear segments, is exponential in the dimension p of the problem (G artner et al., 2012) leading to unpractical algorithms. Recently, Li & Singer (2018) have shown that for some speciﬁc design matrix with n observations, a polynomial complexity of O(n p6) can be obtained. Note that even in a more favorable cases of linear complexity in p, the exact path can be expensive to compute when the dimension p is large.

they suffer from numerical instabilities due to multiple and expensive inversion of ill-conditioned matrix. As a result, these algorithms may fail before exploring the entire path, a common issue for small regularization parameter.

they lack ﬂexibility when it comes at incorporating different statistical learning tasks because they usually rely on speciﬁc algebra to handle the structure of the regularization and loss functions. As far as we know, they can be applied only to a limited number of cases and we are not aware of a general framework that bypasses these issues.

they cannot beneﬁt of early stopping. Following Bottou & Bousquet (2008), it is not necessary to optimize below the statistical error for suitable generalization. Exact regularization path algorithms need to maintain optimality conditions as the hyperparameter varies, which is time consuming.

To overcome these issues, an ϵ-approximation of the solution path was proposed and optimal complexity was proven to be O(1/ϵ) by (Giesen et al., 2010) in a fairly general

Safe Grid Search with Optimal Complexity

Lasso Logistic regr. fi(z) (yi z)2/2 log(1 + ez) yiz f i (u) ((u yi)2 y2 i )/2 Nh(u + yi) Vf ,x(u) u 2 2/2 w4( u 2 x/ u 2) u 2 u

Table 1. w4(τ) = (1 τ) log(1 τ)+τ

τ2 and Nh(x) = x log(x) + (1 x) log(1 x)

setting. Then, Mairal & Yu (2012) provided an interesting algorithm whose complexity is O(1/ ϵ) for the Lasso case. The latter result was then extended by (Giesen et al., 2012) to objective function with quadratic lower bound while providing a lower and upper bound of order O(1/ ϵ). Unfortunately, these assumptions fail to hold for many problems, including logistic regression or Huber loss.

Following such ideas, (Shibagaki et al., 2015) have proposed, for classiﬁcation problems, to approximate the regularization path on the hold-out cross-validation error. Indeed, the latter is a more natural criterion to monitor when one aims at selecting a hyperparameter guaranteed to achieve the best validation error. The main idea is to construct upper and lower bounds of the validation error as simple functions of the regularization parameter. Hence by sequentially varying the parameters, one can estimate a range of parameter for which the validation error gap (i.e., the difference with the validation error achieved by the best parameter) is smaller than an accuracy ϵv > 0.

Contributions. We revisit the approximation of the solution and validation path in a uniﬁed framework under general regularity assumptions commonly met in machine learning. We encompass both classiﬁcation and regression problems and provide a complexity analysis along with explicit optimality guarantees. We highlight the relationship between the regularity of the loss function and the complexity of the approximation path. We prove that its complexity is O(1/ d ϵ) for uniformly convex loss of order d 2 (see Bauschke & Combettes (2011, Deﬁnition 10.5)) and O(1/ ϵ) for the logistic loss thanks to a reﬁned measure of its curvature throughout its Generalized Self-Concordant properties (Sun & Tran-Dinh, 2017). As far as we know, the previously known approximation path algorithms cannot handle these cases. We provide an algorithm with global convergence property for selecting a hyperparameter with a validation error ϵv-close to the optimal hyperparameter from a given grid. We bring a natural stopping criterion when optimizing over the training set making this criterion automatically calibrated.

Our implementation is available at https://github. com/Eugene Ndiaye/safe_grid_search.

Notation. Given a proper, closed and convex function f : Rn R {+ }, we denote dom f = {x Rn :

Exact: Pλ(ˆβ(λ))

Exact shifted:Pλ(ˆβ(λ)) + ϵ

Approximated: Pλ(β(λ))

Figure 1. Illustration of the approximation path for the Lasso at accuracy ϵ = y 2 2 /20. We choose λmax = X y and λmin = λmax/50. The shaded gray region shows the interval where any ϵ-path must lie. The exact path is computed with the Lasso Lars on diabetes data from sklearn.

f(x) < + }. If f is a twice continuously differentiable function with positive deﬁnite Hessian 2f(x) at any x dom f, we denote z x = p

2f(x)z, z . The Fenchel Legendre transform of f is the function f : Rn R {+ } deﬁned by f (x ) = supx dom f x , x f(x). The support function of a nonempty set C is deﬁned as σC(x) = supc C c, x . If C is closed, convex and contains 0, we deﬁne its polar as σ C(x ) = supσC(x) 1 x , x . We denote by [T] the set {1, . . . , T} for any non zero integer T. The vector of observations is y Rn and the design matrix X = [x1, . . . , xn] Rn p has n observations row-wise, and p features (column-wise).

2. Problem setup

Let us consider the class of regularized learning methods expressed as convex optimization problems, such as (regularized) GLM (Mc Cullagh & Nelder, 1989):

ˆβ(λ) arg min β Rp f(Xβ) + λΩ(β) | {z } Pλ(β)

(Primal). (1)

We highlight two important cases: the regularized leastsquares and logistic regression where the loss functions are written as an empirical risk f(Xβ) = P i [n] fi(x i β) with the fi s given in Table 1. The penalty term is often used to incorporate prior knowledges by enforcing a certain regularity on the solutions. For instance, choosing a Ridge penalty (Hoerl & Kennard, 1970) Ω( ) = 1

2 2 2 improves the stability of the resolution of inverse problems while Ω( ) = 1 imposes sparsity at the feature level, a motivation that led to the Lasso estimator (Tibshirani, 1996); see also (Bach et al., 2012) for extensions to other structured penalties.

In practice, obtaining ˆβ(λ), an exact solution to Problem (1) is unpractical and one aims achieving a prescribed precision ϵ > 0. More precisely, a (primal) vector β(λ) := β(λ,ϵ) (we will drop the dependency in ϵ for readability) is referred to as an ϵ-solution for λ if its (primal) objective value is optimal at precision ϵ:

Pλ(β(λ)) Pλ(ˆβ(λ)) ϵ . (2)

Safe Grid Search with Optimal Complexity

We recall and illustrate the notion of approximation path in Figure 1 as described by Giesen et al. (2012).

Deﬁnition 1 (ϵ-path). A set Pϵ Rp is called an ϵ-path for a parameter range [λmin, λmax] if

λ [λmin, λmax], an ϵ-solution β(λ) Pϵ . (3)

We call path complexity Tϵ the cardinality of the ϵ-path.

To achieve the targeted ϵ-precision in (2) over a whole path and construct an ϵ-path 1, we rely on duality gap evaluations. For that, we compute ϵc-solutions2 (for an accuracy ϵc < ϵ) over a ﬁnite grid, and then we control the gap variations w.r.t. λ to achieve the prescribed ϵ-precision over the whole range [λmin, λmax]; see Algorithm 1. We now recall the Fenchel duality (Rockafellar, 1997, Chapter 31):

ˆθ(λ) arg max θ Rn f ( λθ) λΩ (X θ) | {z } Dλ(θ)

(Dual). (4)

For a primal/dual pair (β, θ) dom Pλ dom Dλ, the duality gap is the difference between primal and dual objectives:

Gλ(β, θ) = f(Xβ) + f ( λθ) + λ(Ω(β) + Ω (X θ)) .

Weak duality yields Dλ(θ) Pλ(β) and

Pλ(β) Pλ(ˆβ(λ)) Gλ(β, θ) , (5)

explaining the interest of the duality gap as an optimality certiﬁcate. Using (5), we can safely construct an approximation path for Problem (1) : if β(λ) is an ϵ-solution for λ, it is guaranteed to remain one for all parameters λ such that Gλ (β(λ), θ(λ)) ϵ. Since the function λ 7 Gλ (β(λ), θ(λ)) does not exhibit a simple dependence in λ, we rely on an upper bound on the gap encoding the structural regularity of the loss function (e.g., 1-dimensional quadratics for strongly convex functions). This bound controls the optimization error as λ varies while preserving optimal complexity on the approximation path.

3. Bounds and approximation path

We introduce the tools to design an approximation path.

3.1. Preliminary results and technical tools

Deﬁnition 2. Given a differentiable function f and x dom f, let Uf,x( ) and Vf,x( ) be non negative functions that vanish at 0. We say that f is Uf,x-convex (resp. Vf,xsmooth) at x when Inequality (6) (resp. (7)) is satisﬁed for any z dom f

Uf,x(z x) f(z) f(x) f(x), z x , (6)

Vf,x(z x) f(z) f(x) f(x), z x . (7)

1note that such a path depends on exact solutions ˆβ(λ) s 2the c stands for computational in ϵc

This extends µ-strong convexity and ν-smoothness (Nesterov, 2004) and encompasses smooth uniformly convex losses and generalized self-concordant ones.

Smooth uniformly convex case: In this case, we have

Uf,x(z x) = U( z x ),

Vf,x(z x) = V( z x ),

where U( ) and V( ) are increasing from [0, + ) to [0, + ] vanishing at 0; see Az e & Penot (1995). Examples of such functions are U(t) = µ d td and V(t) = ν dtd

where d, µ and ν are positive constants. The case d = 2 corresponds to strong convexity and smoothness; in general they are called uniformly convex of order d, see (Juditski & Nesterov, 2014) or (Bauschke & Combettes, 2011, Ch. 10.2 and 18.5) for details.

Generalized self-concordant case: a C3 convex function f is (Mf, ν)-generalized self-concordant of order ν 2 and Mf 0 if x dom f and u, v Rn: 3f(x)[v]u, u Mf u 2 x v ν 2 x v 3 ν 2 .

In this case, Sun & Tran-Dinh (2017, Proposition 10) have shown that one could write:

Uf,x(y x) = wν( dν(x, y)) y x 2 x ,

Vf,x(y x) = wν(dν(x, y)) y x 2 x ,

where the last equality holds if dν(x, y) < 1 for the case ν > 2. Closed-form expressions of wν( ) and dν( ) are recalled in Appendix for logistic, quadratic and power losses.

Approximating the duality gap path. Assume we have constructed primal/dual feasible vectors for a ﬁnite grid of parameters ΛT = {λ0, . . . , λT 1}, i.e., we have at our disposal (β(λt), θ(λt)) for all λt ΛT . Let us denote Gt = Gλt(β(λt), θ(λt)), and for ζt = λtθ(λt), t = f(Xβ(λt)) f( f (ζt)). For any function φ : Rn [0, + ] that vanishes at 0, ρ R, we deﬁne

Qt,φ(ρ) = Gt +ρ ( t Gt) + φ( ρ ζt) . (8)

The terms Gt and t represent a measure of the optimization error at λt. The notation introduced in (8) will be convenient to write concisely upper and lower bounds on the duality gap. This is the goal of the next lemma which leverages regularity of the loss function f, as introduced in Deﬁnition 2. This provides control on how the duality gap deviates when one evaluates it for another (close) parameter λ.

Lemma 1. We assume that λθ(λt) dom f and X θ(λt) dom Ω . If f is Vf -smooth (resp. Uf - convex)3, then for ρ = 1 λ/λt, the right (resp. left)

3we drop x in Uf,x and write Uf if no ambiguity holds.

Safe Grid Search with Optimal Complexity

λmax λ1 λ2 λ3 λ4 λ5 λmin

Upper Bound of the Duality Gap

(a) Uniform unilateral approximation path (as in Proposition 3)

λmax λ1 λ2 λ3 λmin

(b) Bilateral approximation path (as in Proposition 4)

Figure 2. Illustration of the construction of ϵ-paths for the Lasso on synthetic dataset generated with sklearn as X, y = make regression(n = 30, p = 150) at accuracy ϵ = y 2 2 /40 and ϵc = ϵ/10. We choose λmax = X y and λmin = λmax/20. For Lasso the bounds are piece-wise quadratic. The shaded gray regions correspond to the regions where the true value of the duality gap lies. We obtain a path complexity of Tϵ = 6 (resp. Tϵ = 4) for the unilateral (resp. bilateral) path over [λmin, λmax].

hand side of Inequality (9) holds true

Qt,Uf (ρ) Gλ(β(λt), θ(λt)) Qt,Vf (ρ) . (9)

Proof. Proof for this result and for other propositions and theorems are deferred to the Appendix.

The function φ, chosen as Vf (resp. Uf ) for the upper (resp. lower) bound, essentially captures the regularity needed to approximate the duality gap at λ when using primal/dual vector (β(λt), θ(λt)) for λt close to λ. When the function satisﬁes both inequalities, tightness of the bounds can be related to the conditioning Uf /Vf of the dual loss f . Equality holds for Uf Vf 1

2 2 2 (least-squares), showing the tightness of the bounds.

From Lemma 1, we have Gλ(β(λt), θ(λt)) ϵ as soon as Qt,Vf (ρ) ϵ where ρ = 1 λ/λt varies with λ. Hence, we obtain the following proposition that allows to track the regularization path for an arbitrary precision on the duality gap. It proceeds by choosing the largest ρ = ρt such that the upper bound in Equation (9) remains below ϵ and leads to Algorithm 1 for computing an ϵ-path.

Proposition 1 (Grid for a prescribed precision). Given (β(λt), θ(λt)) such that Gt ϵc < ϵ, for all λ λt 1 ρℓ t(ϵ), 1 + ρr t(ϵ) , we have Gλ(β(λt), θ(λt)) ϵ where ρℓ t(ϵ) (resp. ρr t(ϵ)) is the largest non-negative ρ s.t. Qt,Vf (ρ) ϵ (resp. Qt,Vf ( ρ) ϵ).

Conversely, given a grid4 of T parameters ΛT = {λ0, . . . , λT 1}, we deﬁne ϵΛT , the error of the approximation path on [λmin, λmax] by using a piecewise constant approximation of the map λ 7 Gλ(β(λt), θ(λt)):

ϵΛT := max λ [λmin,λmax] min λt ΛT Gλ(β(λt), θ(λt)) . (10)

4we assume a decreasing order λt+1 < λt, reﬂecting common practices for GLM, e.g., for the Lasso.

This error is however difﬁcult to evaluate in practice so we rely on a tight upper bound based on Lemma 1 that often leads to closed-form expressions. Proposition 2 (Precision for a given grid). Given a grid of parameters ΛT , the set {β(λ) : λ ΛT } is an ϵΛT - path with ϵΛT maxt [T ] Qt,Vf (1 λ t /λt) where for all t {0, . . . , T 1}, λ t is the largest λ [λt+1, λt] such that Qt,Vf (1 λ/λt) Qt+1,Vf (1 λ/λt+1).

Construction of dual feasible vector. We rely on gradient rescaling to produce a dual feasible vector: Lemma 2. For any β(λt) Rp, the vector

θ(λt) = f(Xβ(λt)) max(λt, σ dom Ω (X f(Xβ(λt))) ,

is feasible: λθ(λt) dom f , X θ(λt) dom Ω . Remark 1. When the regularization is a norm, Ω( ) = then σ dom Ω is the associated dual norm .

The dual θ(λt) in Lemma 2 implies that Gt and t converge to 0 when β(λt) converges to ˆβ(λt) (Ndiaye et al., 2017).

Finding ρ. Following Proposition 1, a 1-dimensional equation Qt,Vf (ρ) = ϵ needs to be solved to obtain an ϵ-path. This can be done efﬁciently at high precision by numerical solvers if no explicit solution is available.

As a corollary from Lemma 1 and Proposition 2, we recover the analysis by Giesen et al. (2012): Corollary 1. If the function f is ν

2 2-smooth, the left (ρℓ t) and right (ρr t) step sizes deﬁned in Proposition 1 have closed-form expressions:

2νδt ζt 2+ δ2 t δt

ν ζt 2 , ρr t =

2νδt ζt 2+ δ2 t + δt

where δt := ϵ Gt and δt := t Gt. This is simpliﬁed to δt = ϵ ϵc and δt = 0 when max(Gt, t) ϵc.

Safe Grid Search with Optimal Complexity

Algorithm 1 training path

Input: f, Ω, ϵ, ϵc, [λmin, λmax] Initialization: t = 0, λ0 = λmax, Λ = {λmax} repeat

Get β(λt) solving (1) to accuracy Gt ϵc < ϵ Compute the step size ρℓ t(ϵ) following Proposition 3, 4, 5. Set λt+1 = max(λt (1 ρℓ t), λmin) Λ Λ {λt+1} and t t + 1 until λt λmin Return: {β(λt) : λt Λ}

3.2. Discretization strategies

We now establish new strategies for the exploration of the hyperparameter space in the search for an ϵ-path.

For regularized learning methods, it is customary to start from a large regularizer5 λ0 = λmax and then to perform the computation of ˆβ(λt+1) after the one of ˆβ(λt), until the smallest parameter of interest λmin is reached. Models are generally computed by increasing complexity, allowing important speed-ups due to warm start (Friedman et al., 2007) when the λ s are close to each other. Knowing λt, we provide a recursive strategy to construct λt+1.

Adaptive unilateral. The strategy we call unilateral consists in computing the new parameter as λt+1 = λt (1 ρℓ t(ϵ)) as in Proposition 1.

Proposition 3 (Unilateral approximation path). Assume that f is Vf -smooth. We construct the grid of parameters Λ(u)(ϵ) = {λ0, . . . , λTϵ 1} by

λ0 = λmax, λt+1 = λt (1 ρℓ t(ϵ)) ,

and (β(λt), θ(λt)) s.t. Gt ϵc < ϵ for all t. Then, the set {β(λt) : λt Λ(u)(ϵ)} is an ϵ-path for Problem (1).

This strategy is illustrated in Figure 2(a) on a Lasso case, and stands as a generic one to compute an approximation path for loss functions satisfying assumptions in Deﬁnition 2.

Adaptive bilateral. For uniformly convex functions, we can make a larger step by combining the information given by the left and right step sizes. Indeed, let us assume that we explore the parameter range [λmin, λmax]. Starting from a parameter λt, we deﬁne the next step, given by Proposition 1, λℓ t := λt(1 ρℓ t). Then it exists λt λℓ t such that λr t := λt (1 + ρr t ) = λℓ t. Thus a larger step can be done by using λt = λt (1 ρℓ t)/(1 + ρr t ). However ρr t depends on the (approximated) solution β(λt ) that we do not know before optimizing the problem for parameter λt when computing sequentially the grid points in decreasing order i.e., λt λt. We overcome this issue in Lemma 3 by (upper) bounding all the constants in Qt ,Vf (ρ) that

5for the Lasso one often chooses λ0 = λmax := X y

depend on the solution β(λt ), by constants involving only information given by β(λt).

Lemma 3. Assuming f uniformly smooth yields f(Xβ(λt )) e Rt, where e Rt := V f 1 f(Xβ(λt)) + 2ϵc ρℓ t(ϵ) . If additionally f is uniformly convex, this yields

t e t, where e t := e Rt U 1 f (ϵc) as well as

Gλ(β(λt ), θ(λt )) Qt ,Vf (ρ) e Qt,Vf (ρ), where

e Qt,Vf (ρ) = ϵc + ρ (e t ϵc) + Vf |ρ| e Rt .

Let us now deﬁne ρ(b) t (ϵ) = ρℓ t(ϵ) + ρr t(ϵ) 1 + ρr t(ϵ) , where ρℓ t(ϵ) is

deﬁned in Proposition 1 and ρr t(ϵ) is the largest non negative ρ such that e Qt,Vf (ρ) ϵ in Lemma 3.

Proposition 4 (Bilateral Approximation Path). Assume that f is uniformly convex and smooth. We construct the grid Λ(b)(ϵ) = {λ0, . . . , λTϵ 1} by

λ0 = λmax, λt+1 = λt (1 ρ(b) t (ϵ)) ,

and (β(λt), θ(λt)) s.t. Gt ϵc < ϵ for all t. Then the set {β(λt) : λt Λ(b)(ϵ)} is an ϵ-path for Problem (1).

This strategy is illustrated in Figure 2(b) on a Lasso example.

Uniform unilateral and bilateral. In some cases, it may be advantageous to have access to a predeﬁned grid before launching a hyperparameter selection procedure such as hyperband (Li et al., 2017) or for parallel computations. Given the initial information from the initialization (β(λ0), θ(λ0)), we can build a uniform grid that guarantees an ϵ-approximation before solving any optimization problem. Indeed, by applying Lemma 3 at t = 0, we have Gλ(β(λt), θ(λt)) e Q0,Vf (ρ). We can deﬁne eρℓ 0(ϵ) (resp. eρr 0(ϵ)) as the largest non-negative ρ s.t. e Q0,Vf (ρ) ϵ (resp. e Q0,Vf ( ρ) ϵ) and also

( eρℓ 0(ϵ) for unilateral path,

eρℓ 0(ϵ)+eρr 0(ϵ) 1+ ρr 0(ϵ) for bilateral path. (11)

Proposition 5 (Uniform approximation path).

Assume that f is uniformly convex and smooth, and deﬁne the grid Λ(0)(ϵ) = {λ0, . . . , λTϵ 1} by

λ0 = λmax, λt+1 = λt (1 ρ0(ϵ)) ,

and t [T], (β(λt), θ(λt)) s.t. Gt ϵc < ϵ. Then the set {β(λt) : λt Λ(0)(ϵ)} is an ϵ-path for Problem (1) with at most Tϵ grid points where

Tϵ = log(λmin/λmax)

log(1 ρ0(ϵ))

Safe Grid Search with Optimal Complexity

Computational time

w.r.t. default grid

Default grid Uniform unilateral Uniform bilateral

Adaptive unilateral Adaptive bilateral

10 30 50 70 100 300 Size of the default grid

w.r.t. default grid

(a) ℓ1 least-squares regression on climate data set NCEP/NCAR Reanalysis; n = 814 observations and p = 73577 features.

Computational time

w.r.t. default grid

10 30 50 70 100 300 Size of the default grid

w.r.t. default grid

Default grid Adaptive unilateral

(b) ℓ1 logistic regression on leukemia data set with n = 72 observations and p = 7129 features.

Figure 3. Computation of the approximation path to reach the same error than the default grid (ϵ = 10 4 y 2 for the least-squares case and ϵ = 10 4 min(n1, n2)/n where ni is the number of observations in the class i {0, 1}, for the logistic case). We have used the same (vanilla) coordinate descent optimization solver with warm start between parameters for all grids. Note that a smaller grid do not imply faster computation, as the interplay with the warm-start can be intricate in our sequential approach.

3.3. Limitations of previous framework

Previous algorithms for computing ϵ-paths have been initially developed with a complexity of O(1/ϵ) (Clarkson, 2010; Giesen et al., 2010) in a large class of problems. Yet, losses arising in machine learning have often nicer regularities that can be exploited. This is all the more striking in the Lasso case where a better complexity in O(1/ ϵ) was obtained by Mairal & Yu (2012); Giesen et al. (2012).

The relation between path complexity and regularity of the objective function remains unclear and previous methods do not apply to all popular learning problems. For instance the dual loss f of the logistic regression is not uniformly smooth. So to apply the previous theory, one needs to optimize on a (potentially badly pre-selected) compact set.

Let us consider the one dimensional toy example where p = 1, β R, X = Idp, y = 1 and the loss function f(Xβ) = log(1 + exp(β)). We have, 2f(β) = exp(β)/(1 + exp(β))2. Then for Problem (1), since Pλ(ˆβ(λ)) Pλ(0), we have |ˆβ(λ)| [0, log(2)/λ] and a smoothness constant νf exp(λ) for the dual can be estimated at each step. This leads to an unreasonable algorithm with tiny step sizes in Corollary 1. Also, the algorithm proposed by Giesen et al. (2012) can not be applied for the logistic loss since the dual function is not polynomial.

Our proposed algorithm does not suffer from such limitations and we introduce a ﬁner analysis that takes into account the regularity of the loss functions.

3.4. Complexity and regularity

Lower bound on path complexity. For our method, the lower bound on the duality gap quantiﬁes how close the from

Proposition 1 is from the best possible one can achieve for smooth loss functions. Indeed, at optimal solution, we have Gt = t = 0. Thus the largest possible step starting at λt and moving in decreasing order is given by the smallest λ [λmin, λt] such that Uf ( ˆζt ρ) > ϵ where ˆζt = λtˆθ(λt). Hence, any algorithm for computing an ϵ-path for Uf -uniformly convex dual loss, have necessarily a complexity of order at least O(1/U 1 f (ϵ)).

Upper bounds. We remind that we write Tϵ for the complexity of our proposed approximation path i.e., the cardinality of the grid returned by Algorithm 1. In the following proposition, we propose a bound on the complexity w.r.t. the regularity of the loss function. Discussions on the constants and assumptions are provided in the Appendix. Proposition 6 (Approximation path: complexity).

Assuming that max(Gt, t) ϵc < ϵ at each step t, there exists an explicit constant Cf(ϵc) > 0 such that

Tϵ log λmax

Cf(ϵc) Wf (ϵ ϵc) , (13)

where for all t > 0, the function Wf is deﬁned by

V 1 f ( ), if f is uniformly convex and smooth , if f is Generalized Self-Concordant and uniformly-smooth.

Moreover, Cf(ϵc) is an uniform upper bound of ζt along the path, that tends to a constant Cf when ϵc goes to 0.

Proposition 6 applied to the special case when f is ν-smooth

and µ-strongly convex reads log λmax

ν µ f(Xβ(λ0))

ϵ ϵc for the complexity for any data X, y. This is not explicitly dependent on the dimension n and p and are more scalable.

Safe Grid Search with Optimal Complexity

Algorithm 2 ϵv-path for Validation Set

Input: f, Ω, ϵv, [λmin, λmax] Compute ϵv,µ as in Proposition 7 Λ(ϵv,µ) = training path (f, Ω, ϵv,µ, [λmin, λmax]) Return: Λ(ϵv,µ)

4. Validation path

To achieve good generalization performance, estimators deﬁned as solutions of Problem 1 require a careful adjustment of λ to balance data-ﬁtting and regularization. A standard approach to calibrate such a parameter is to select it by comparing the validation errors on a ﬁnite grid (say with K-fold cross-validation). Unfortunately, it is often difﬁcult to determine a priori the grid limits, the number of λ s (number of points in the grid) or how they should be distributed to achieve low validation error.

Considering the validation data (X , y ) with n observations and loss6 L, we deﬁne the validation error for β Rp:

Ev(β) = L(y , X β) . (14)

For selecting a hyperparameter, we leverage our approximation path to solve the bi-level problem

arg min λ [λmin,λmax] Ev(ˆβ(λ)) = L(y , X ˆβ(λ))

s.t. ˆβ(λ) arg min β Rp f(Xβ) + λΩ(β) .

Recent works have addressed this problem by using gradientbased algorithms, see for instance Pedregosa (2016); Franceschi et al. (2018) who have shown promising results in computational time and scalability w.r.t. multiple hyperparameters. Yet, they require assumptions such as smoothness of the validation function Ev and non-singular Hessian of the inner optimization problem at optimal values which are difﬁcult to check in practice since they depend on the optimal solutions ˆβ(λ). Moreover, they can only guarantee convergence to stationary point.

In this section, we generalize the approach of (Shibagaki et al., 2015) and show that with a safe and simple exploration of the parameter space, our algorithm has a global convergence property. For that, we assume the following conditions on the validation loss and on the inner optimization objective throughout the section:

|L(a, b) L(a, c)| L(b, c) for any a, b, c Rn. A1

The function β 7 Pλ(β) is µ-strongly convex. A2

6the data-ﬁtting terms might differ from training to testing; for instance for logistic regression the ℓ0/1-loss is used for validation but the logistic function is optimized at training.

The assumption on the loss function is veriﬁed for norms (regression) and indicator functions (classiﬁcation). Indeed, if L(a, b) = a b , A1 corresponds to the triangle inequality. For the ℓ0/1-loss L(a, b) = 1

n Pn i=1 1aibi<0, since for any real s, u and v, |1us<0 1uv<0| 1sv<0, one has 1

i=1 1aibi<0 1

i=1 1aici<0 1

i=1 1bici<0 .

Deﬁnition 3. Given a primal solution ˆβ(λ) for parameter λ and a primal point β(λt) returned by an algorithm, we deﬁne the gap on the validation error between λ and λt as

Ev(λt, λ) := Ev(ˆβ(λ)) Ev(β(λt)) . (15)

Suppose we have ﬁxed a tolerance ϵv on the gap on validation error i.e., Ev(λt, λ) ϵv. Based on Assumption A1, if there is a region Rλ that contains the optimal solution ˆβ(λ) at parameter λ, then we have

Ev(λt, λ) L(X ˆβ(λ), X β(λt))

max β Rλ L(X β, X β(λt)) .

A simple strategy consists in choosing Rλ as a ball. Lemma 4 (Gap safe region Ndiaye et al. (2017)). Under A2, any primal solution ˆβ(λ) belongs to the Euclidean ball with center β(λt) and radius

rt,µ(λ) = r 2

µ Gλ(β(λt), θ(λt)) . (16)

Such a safe ball leveraging duality gap has been proved useful to speed-up sparse optimization solvers. The improve performance relies on the ability to identify the sparsity structure of the optimal solutions; approaches of this type are referred to as safe screening rules as they provide safe certiﬁcates for such structures (El Ghaoui et al., 2012; Fercoq et al., 2015; Shibagaki et al., 2016; Ndiaye et al., 2017).

Since the radius in Equation (16) depends explicitly on the duality gap, we can sequentially track a range of parameters for which the gap on the validation error remains below a prescribed tolerance by controlling the optimization error. Proposition 7 (Grid for prescribed validation error). Under Assumptions A1 and A2, let us deﬁne for i [n ] an index

in the test set, ξi = x i β(λt)

µ 2 ϵv X 2 , (regression)

µ 2 ξ( nϵv +1), (classiﬁcation) (17)

where ξ( nϵv +1) is the ( nϵv + 1)-th smallest value of ξi s. Given (β(λt), θ(λt)) such that Gt ϵv,µ, we have Ev(λt, λ) ϵv for all parameter λ in the interval λt 1 ρℓ t(ϵv,µ), 1 + ρr t(ϵv,µ) , where ρℓ t(ϵv,µ), ρr t(ϵv,µ) are deﬁned in Proposition 1.

Safe Grid Search with Optimal Complexity

λmin λmax 2.4

Validation curve at machine precision

Low precision δv 10

High precision δv/10

(a) Synthetic data set generated using the sklearn command X, y = make sparse uncorrelated(n = 30, p = 50).

Validation curve at machine precision

Low precision δv 10

High precision δv/10

(b) Synthetic data set generated using the sklearn command X, y = make regression(n = 500, p = 5000).

Figure 4. Safe selection of the optimal hyperparameter for Enet on the validation set (30% of the observations). The targeted accuracy ϵv is reﬁned from δv 10 to δv/10 with δv = maxλt Λ Ev(β(λt)) minλt Λ Ev(β(λt)) and Λ is the default grid between λmax = X y and λmin = λmax/100 of size T = 200. The stars represent the worst case solution amount the one generated by Algorithm 2 (with bilateral path). For loose precision suboptimal parameters are identiﬁed, but better ones are found as the accuracy ϵv decreases.

Remark 2 (Stopping criterion for training). For the current parameter λt, Ev(λt, λt) ϵv as soon as Gt ϵv,µ, which gives us a stopping criterion for optimizing on the training part (X, y) relative to the desired accuracy ϵv on the validation data (X , y ). This has the appealing property of relieving the practitioner from selecting the stopping criterion ϵc when optimizing on the training set.

Algorithm 2 outputs a discrete set of parameters Λ(ϵv,µ) such that {β(λt) for λt Λ(ϵv,µ)} is an ϵv-path for the validation error. Thus, for any λ [λmin, λmax], there exists λt Λ(ϵv,µ) such that

Ev(β(λt)) ϵv Ev(ˆβ(λ)) . (18)

The following proposition is obtained by taking the minimum on both sides of the inequality. Proposition 8. Under Assumptions A1 and A2, the set {β(λt) for λt Λ(ϵv,µ)} is an ϵv-path for the error and

min λt Λ(ϵv,µ) Ev(β(λt)) min λ [λmin,λmax] Ev(ˆβ(λ)) ϵv .

5. Numerical experiments

We illustrate our method on ℓ1-regularized least squares and logistic regression by comparing the computational times and number of grid points needed to compute an ϵ-path for a given range [λmin, λmax] for several strategies.

The Default grid is the one used by default in the packages glmnet (Friedman et al., 2010) and sklearn (Pedregosa et al., 2011). It is deﬁned as λt = λmax 10 δt/(T 1)

(here δ = 3). The proposed grids are the adaptive unilateral/bilateral and uniform unilateral/bilateral grids that are deﬁned in Propositions 3, 4 and 5.

Thanks to Proposition 2, we measure the approximation path error ϵ of the default grid of size T and report the times and

numbers of grid points Tϵ needed to achieve such a precision. Our experiments were conducted on the leukemia dataset, available in sklearn and the climate dataset NCEP/NCAR Reanalysis (Kalnay et al., 1996). The optimization algorithms are the same for all the grid, hence we compare only the grid construction impact. Results are reported in Figure 3 for classiﬁcation and regression problem. Our approach leads to better guarantees for approximating the regularization path w.r.t. the default grid and often signiﬁcant gain in computing time.

Figure 4 illustrates convergence for Elastic Net (Enet) (Zou & Hastie, 2005), on synthetic data generated by sklearn as random regression problems make regression and make sparse uncorrelated (Celeux et al., 2012). For a decreasing levels of validation error, we represent the λ selected by our algorithm and its corresponding safe interval. Even when the validation curve is non smooth and non convex, the output of the safe grid search converges to the global minimum as stated in Proposition 8.

6. Conclusion

We have shown how to efﬁciently construct one dimensional grids of regularization parameters for convex risk minimization, and to get an automatic calibration, optimal in term of hold-out test error. Future research could examine how to adapt our framework to address multi-dimensional parameter grids. This case is all the more interesting that it naturally arises when addressing non-convex problems, e.g., MCP or SCAD, with re-weighted ℓ1-minimization. Approximation of a full path then requires to optimize up to precision ϵc at each step, even for non promising hyperparameter, which is time consuming. Combining our approach with safe elimination procedures could provide faster hyperparameter selection algorithms.

Safe Grid Search with Optimal Complexity

Acknowledgements

This work was supported by the Chair Machine Learning for Big Data at T el ecom Paris Tech. TL acknowledges the support of JSPS KAKENHI Grant number 17K12745.

Arlot, S. and Celisse, A. A survey of cross-validation procedures for model selection. Statistics surveys, 4:40 79, 2010.

Az e, D. and Penot, J.-P. Uniformly convex and uniformly smooth convex functions. In Annales de la facult e des sciences de Toulouse, pp. 705 730. Universit e Paul Sabatier, 1995.

Bach, F., Jenatton, R., Mairal, J., and Obozinski, G. Convex optimization with sparsity-inducing norms. Foundations and Trends in Machine Learning, 4(1):1 106, 2012.

Bauschke, H. H. and Combettes, P. L. Convex analysis and monotone operator theory in Hilbert spaces. Springer, New York, 2011. ISBN 978-1-4419-9466-0.

Bottou, L. and Bousquet, O. The tradeoffs of large scale learning. In NIPS, pp. 161 168, 2008.

Celeux, G., Anbari, M. E., Marin, J.-M., and Robert, C. P. Regularization in regression: comparing bayesian and frequentist methods in a poorly informative situation. Bayesian Analysis, 7(2):477 502, 2012.

Clarkson, K. L. Coresets, sparse greedy approximation, and the frank-wolfe algorithm. ACM Transactions on Algorithms (TALG), 6(4):63:1 63:30, 2010.

Efron, B., Hastie, T., Johnstone, I. M., and Tibshirani, R. Least angle regression. Ann. Statist., 32(2):407 499, 2004. With discussion, and a rejoinder by the authors.

El Ghaoui, L., Viallon, V., and Rabbani, T. Safe feature elimination in sparse supervised learning. J. Paciﬁc Optim., 8 (4):667 698, 2012.

Fercoq, O., Gramfort, A., and Salmon, J. Mind the duality gap: safer rules for the lasso. In ICML, pp. 333 342, 2015.

Franceschi, L., Frasconi, P., Salzo, S., and Pontil, M. Bilevel programming for hyperparameter optimization and metalearning. In ICML, pp. 1563 1572, 2018.

Friedman, J., Hastie, T., H oﬂing, H., and Tibshirani, R. Pathwise coordinate optimization. Ann. Appl. Stat., 1(2): 302 332, 2007.

Friedman, J., Hastie, T., and Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw., 33(1):1, 2010.

G artner, B., Jaggi, M., and Maria, C. An exponential lower bound on the complexity of regularization paths. Journal of Computational Geometry, 2012.

Giesen, J., Jaggi, M., and Laue, S. Approximating parameterized convex optimization problems. In European Symposium on Algorithms, pp. 524 535, 2010.

Giesen, J., M uller, J. K., Laue, S., and Swiercy, S. Approximating concavely parameterized optimization problems. In NIPS, pp. 2105 2113, 2012.

Hastie, T., Rosset, S., Tibshirani, R., and Zhu, J. The entire regularization path for the support vector machine. J. Mach. Learn. Res., 5:1391 1415, 2004.

Hoerl, A. E. and Kennard, R. W. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55 67, 1970.

Juditski, A. and Nesterov, Y. Primal-dual subgradient methods for minimizing uniformly convex functions. ar Xiv preprint ar Xiv:1401.1792, 2014.

Kalnay, E., Kanamitsu, M., Kistler, R., Collins, W., Deaven, D., Gandin, L., Iredell, M., Saha, S., White, G., Woollen, J., et al. The NCEP/NCAR 40-year reanalysis project. Bulletin of the American meteorological Society, 77(3): 437 471, 1996.

Li, L., Jamieson, K., De Salvo, G., Rostamizadeh, A., and Talwalkar, A. Hyperband: A novel bandit-based approach to hyperparameter optimization. The Journal of Machine Learning Research, 18(1):6765 6816, 2017.

Li, Y. and Singer, Y. The well tempered lasso. ICML, 2018.

Mairal, J. and Yu, B. Complexity analysis of the lasso regularization path. In ICML, pp. 353 360, 2012.

Mc Cullagh, P. and Nelder, J. A. Generalized Linear Models. Chapman & Hall, 1989.

Ndiaye, E., Fercoq, O., Gramfort, A., and Salmon, J. Gap safe screening rules for sparsity enforcing penalties. J. Mach. Learn. Res., 18(128):1 33, 2017.

Nesterov, Y. Introductory lectures on convex optimization, volume 87 of Applied Optimization. Kluwer Academic Publishers, Boston, MA, 2004.

Osborne, M. R., Presnell, B., and Turlach, B. A. A new approach to variable selection in least squares problems. IMA J. Numer. Anal., 20(3):389 403, 2000.

Park, M. Y. and Hastie, T. L1-regularization path algorithm for generalized linear models. J. Roy. Statist. Soc. Ser. B, 69(4):659 677, 2007.

Safe Grid Search with Optimal Complexity

Pedregosa, F. Hyperparameter optimization with approximate gradient. In ICML, pp. 737 746, 2016.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res., 12:2825 2830, 2011.

Rockafellar, R. T. Convex analysis. Princeton Landmarks in Mathematics. Princeton University Press, Princeton, NJ, 1997. Reprint of the 1970 original, Princeton Paperbacks.

Rosset, S. and Zhu, J. Piecewise linear regularized solution paths. Ann. Statist., 35(3):1012 1030, 2007.

Shibagaki, A., Suzuki, Y., Karasuyama, M., and Takeuchi, I. Regularization path of cross-validation error lower bounds. In NIPS, pp. 1666 1674, 2015.

Shibagaki, A., Karasuyama, M., Hatano, K., and Takeuchi, I. Simultaneous safe screening of features and samples in doubly sparse modeling. In ICML, pp. 1577 1586, 2016.

Sun, T. and Tran-Dinh, Q. Generalized self-concordant functions: A recipe for newton-type methods. Mathematical Programming, 2017.

Tibshirani, R. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 58(1):267 288, 1996.

Zou, H. and Hastie, T. Regularization and variable selection via the elastic net. J. Roy. Statist. Soc. Ser. B, 67(2): 301 320, 2005.