# provably_tuning_the_elasticnet_across_instances__3d2390ed.pdf

Provably tuning the Elastic Net across instances

Maria-Florina Balcan Mikhail Khodak Dravyansh Sharma Ameet Talwalkar

Carnegie Mellon University

An important unresolved challenge in the theory of regularization is to set the regularization coefﬁcients of popular techniques like the Elastic Net with general provable guarantees. We consider the problem of tuning the regularization parameters of Ridge regression, LASSO, and the Elastic Net across multiple problem instances, a setting that encompasses both cross-validation and multi-task hyperparameter optimization. We obtain a novel structural result for the Elastic Net which characterizes the loss as a function of the tuning parameters as a piecewise-rational function with algebraic boundaries. We use this to bound the structural complexity of the regularized loss functions and show generalization guarantees for tuning the Elastic Net regression coefﬁcients in the statistical setting. We also consider the more challenging online learning setting, where we show vanishing average expected regret relative to the optimal parameter pair. We further extend our results to tuning classiﬁcation algorithms obtained by thresholding regression ﬁts regularized by Ridge, LASSO, or Elastic Net. Our results are the ﬁrst general learning-theoretic guarantees for this important class of problems that avoid strong assumptions on the data distribution. Furthermore, our guarantees hold for both validation and popular information criterion objectives.

1 Introduction

Ridge regression [30, 43], LASSO [41], and their generalization the Elastic Net [28] are among the most popular algorithms in machine learning and statistics, with applications to linear classiﬁcation, regression, data analysis, and feature selection [15, 46, 28, 20, 24]. Given a supervised dataset (X, y) 2 Rm p Rm with m datapoints and p features, these algorithms compute the linear predictor

λ1,λ2 = arg min

β2Rp ky Xβk2

2 + λ1kβk1 + λ2kβk2

Here λ1, λ2 0 are regularization coefﬁcients constraining the 1 and 2 norms, respectively, of the model β. For general λ1 and λ2 the above algorithm is the Elastic Net, while setting λ1 = 0 recovers Ridge and setting λ2 = 0 recovers LASSO.

These coefﬁcients play a crucial role across ﬁelds: in machine learning controlling the norm of β implies provable generalization guarantees and prevent over-ﬁtting in practice [34], in data analysis their combined use yields parsimonious and interpretable models [28], and in Bayesian statistics they correspond to imposing speciﬁc priors on β [35, 33]. In practice, λ2 regularizes β by uniformly shrinking all coefﬁcients, while λ1 encourages the model vector to be sparse. This means that while they do yield learning-theoretic and statistical beneﬁts, setting them to be too high will cause models to under-ﬁt the data. The question of how to set the regularization coefﬁcients becomes even more unclear in the case of the Elastic Net, as one must juggle trade-offs between sparsity, feature correlation, and bias when setting both λ1 and λ2 simultaneously. As a result, there has been intense empirical and theoretical effort devoted to automatically tuning these parameters. Yet the

Correspondence: dravyans@cs.cmu.edu. Author emails: {ninamf,dravyans}@cs.cmu.edu, {khodak,talwalkar}@cmu.edu

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

state-of-the-art is quite unsatisfactory: proposed work consists of either heuristics without formal guarantees [26, 31], approaches that optimize over a ﬁnite grid or random set instead of the full continuous domain [17], or analyses that involve very strong theoretical assumptions [44].

In this work, we study a variant on the above well-established and intensely studied formulation. The key distinction is that instead of a single dataset (X, y), we consider a collection of datasets or instances of the same underlying regression problem (X(i), y(i)) and would like to learn a pair (λ1, λ2) that selects a model in equation (1) that has low loss on a validation dataset. This can be

useful to model practical settings, for example where new supervised data is obtained several times or where the set of features may change frequently [19]. We do not require all examples across datasets to be i.i.d. draws from the same data distribution, and can capture more general data generation scenarios like cross-validation and multi-task learning [45]. Despite these advantages, we remark that our problem formulation is quite different from the standard single dataset setting. Our formulation treats the selection of regularization coefﬁcients as data-driven algorithm design, which is often used to study combinatorial problems [27, 3], and has connections to meta-learning [12].

Our main contribution is a new structural result for the Elastic Net Regression problem, which implies generalization guarantees for selecting Elastic Net Regression coefﬁcients in the multiple-instance setting. In particular, Ridge and LASSO regressions are special cases. We extend our results to obtain low regret in the online learning setting, and to tuning related linear classiﬁcation algorithms. In summary, we make the following key contributions:

We formulate the problem of tuning the Elastic Net as a question of learning λ1 and λ2 simulta-

neously across multiple problem instances, either generated statistically or coming online. Our formulation captures relevant settings like cross-validation and multi-task learning.

We provide a novel structural result (Theorem 2.2) that characterizes the loss of the Elastic Net ﬁt.

We show that the hyperparameter space can be partitioned by polynomial curves of bounded degrees into pieces where the loss is a bivariate rational function. The result holds for both the usual Elastic Net validation objective and when it is augmented with information criteria like the AIC or BIC.

An important consequence of our structural result is a bound on the pseudo-dimension (Deﬁnition

5) for the loss function class, which yields strong generalization bounds for tuning λ1 and λ2 simultaneously in the statistical learning setting (Theorem 3.2). Informally, for Elastic Net regression problems with at most p parameters, for any problem distribution D, we show that O

2 (p2 log 1

problem instances (or datasets) are sufﬁcient to learn an -approximation to the best (λ1, λ2), with probability at least 1 δ.

In the online setting, we show under very mild data assumptions much weaker than prior

work that the problem satisﬁes a dispersion condition [6, 9]. As a result we can tune all parameters across a sequence of instances appearing online and obtain vanishing regret relative to the optimal parameter in hindsight over the sequence (Theorem 3.3) at the rate O(1/

T)2 wrt the length T of the sequence.

We also give distributional and online learning results for regularized classiﬁers (Theorems 4.1, 4.2).

We include a couple of remarks to emphasize the generality and signiﬁcance of our results. First, in our multiple-instance formulation the different problem instances need not have the same number of examples, or even the same set of features. This allows us to handle practical scenarios where the set of features changes across datasets, and we can learn parameters that work well on average across multiple different but related regression tasks. Second, by generating problem instances iid from a ﬁxed (training + validation) dataset, we can obtain iterations (training/validation splits) of popular cross-validation techniques (including the popular leave-one-out and Monte Carlo CV) and our result implies that O(p2/ 2) iterations are enough to determine an Elastic Net parameter ˆλ with loss within (w.h.p.) of the optimal parameter λ over the distribution induced by the cross-validation splits.

Key challenges and insights. A major challenge in learning the Elastic Net parameters is that the variation of the solution path as a function of λ2 is hard to characterize. Indeed the original Elastic Net paper [47] suggests using the heuristic of grid search to learn a good λ2, even though λ1 may be exactly optimized by computing full solution paths (for each λ2). We approach this indirectly by utilizing a

2The soft-O notation is used to emphasize dependence on T, and suppresses other factors as well as logarithmic terms.

characterization of the LASSO solution by [42], which is based on the KKT (Karush Kuhn Tucker) optimality conditions, to arrive at a precise piecewise structure for the problem. In more detail, we use these conditions to come up with a set of algebraic curves (polynomial equations in λ1 and λ2) of bounded degrees, such that the set of possible discontinuities is contained within the zero-set of these curves, and the loss function behaves well in the each piece of the partition of the parameter domain by these curves. This characterization is crucial in establishing a bound on the structural complexity needed to provide strong generalization guarantees. We further show additional structure on these algebraic curves that (roughly speaking) imply that the curves do not concentrate in any region of the domain, allowing us to use the powerful recipe of [8] for online learning.

Related work. Model selection for Ridge regression, LASSO or Elastic Net typically involves selecting the regularization parameter λ for given data, although some parameter-free techniques for variable selection have been recently proposed [32]. Choosing optimal parameters for tuning the regularization has been a subject of extensive theoretical and applied research. Much of this effort is heuristic [26, 31] or focused on developing tuning objectives beyond validation accuracy like AIC or BIC [1, 39] without providing procedures for provably optimizing them. The standard approach given a tuning objective is to optimize it over a grid or random set of parameters, for which there are guarantees [17], but this does not ensure optimality over the entire continuous tuning domain, especially since objectives such as 0-1 validation error or information criteria can have many discontinuities. Selecting a grid that is too ﬁne or too coarse can result in either very inefﬁcient or highly inaccurate estimates (respectively) for good parameters. Other guarantees make strong assumptions on the data distribution such as sub-Gaussian noise [44, 16] or depend on unknown parameters that are hard to quantify in practice [23]. Recent work has shown asymptotic consistency of cross-validation for ridge regression, even in the limiting case λ2 ! 0 which is particularly interesting for the overparameterized regime [29, 36]. A successful line of work has focused on efﬁciently obtaining models for different values of λ1 using regularization paths [22], but the guarantees are computational rather than learning-theoretic or statistical. In contrast, we provide principled approaches that guarantee near-optimality of selected parameters with high conﬁdence over the entire continuous domain of parameters.

Data-driven algorithm design has proved successful for tuning parameters for a variety of combinatorial problems like clustering, integer programming, auction design and graph-based learning [7, 11, 5, 4]. We provide an application of these techniques to parameter tuning in a problem that is not inherently combinatorial by revealing a novel discrete structure. We identify the underlying piecewise structure of the Elastic Net loss function which is extremely effective in establishing learning-theoretic guarantees [10]. To exploit this piecewise structure, we analyze the learning-theoretic complexity of rational algebraic function classes and infer generalization guarantees. We also employ and extend general tools and techniques for online data-driven learning from [8, 4] to rational functions in order to prove our online learning guarantees for regularization coefﬁcient tuning.

2 Preliminaries and a Key Structural Result

Given data (X, y) with X 2 Rm p and y 2 Rm, consisting of m labeled examples with p features, we seek estimators β 2 Rp which minimize the regularized loss. Popular regularization methods like LASSO and Elastic Net can be expressed as computing the solution of an optimization problem

λ,f 2 arg min

β2Rp ky Xβk2

2 + hλ, f(β)i

where f : Rp ! Rd

0 gives the regularization penalty for estimator β, λ 2 Rd

0 is the regularization parameter, and d is the number of regularization parameters. d = 1 for Ridge and LASSO, and d = 2 for the Elastic Net. Setting f = f2 with f2(β) = kβk2

2 yields Ridge regression, and setting f(β) = f1(β) := kβk1 corresponds to LASSO. Also using f EN(β) = (f1(β), f2(β)) gives the Elastic Net with regularization parameter λ = (λ1, λ2). Note that we use the same λ (with some notational overloading) to denote the regularization parameters for ridge, LASSO, or Elastic Net. We write ˆβ(X,y)

λ,f as simply ˆβλ,f when the dataset (X, y) is clear from context. On any instance x 2 Rp from the feature space, the prediction of the regularized estimator is given by the dot product hx, ˆβλ,fi. The average squared loss over a dataset (X0, y0) with X0 2 Rm0 p and y0 2 Rm0 is given

by lr(ˆβλ,f, (X0, y0)) = 1 m0

###y0 X0 ˆβλ,f

2. By setting (X0, y0) to be the training data (X, y), we

get the training loss lr(ˆβλ,f, (X, y)). We use (Xval, yval) to denote a validation split.

Distributional and Online Settings. In the distributional or statistical setting, we receive a collection of n instances of the regression problem P (i) = (X(i), y(i), X(i)

val ) 2 Rmi,pi,m0

i := Rmi pi Rmi Rm0

i for i 2 [n] generated i.i.d. from some problem distribution D. The problems are in the problem space given by m,p = S

m1 0,m2 m,p1 p Rm1,p1,m2 (note that the problem distribution D is over m,p). On any given instance P (i) the loss is given by

the squared loss on the validation set, EN(λ, P (i)) = lr(ˆβ(X(i),y(i))

λ,f EN , (X(i)

val )). On the other hand, in the online setting, we receive a sequence of T instances of the Elastic Net regression problem P (i) = (X(i), y(i), X(i)

val ) 2 m,p for i 2 [T] online. On any given instance P (i), the online learner is required to select the regularization parameter λ(i) without observing y(i)

experiences loss given by (λ(i), P (i)) = lc(ˆβ(X(i),y(i))

λ(i),f EN , (X(i)

val )). The goal is to minimize the regret w.r.t. choosing the best ﬁxed parameter in hindsight for the same problem sequence, i.e. RT = PT

i=1 (λ(i), P (i)) minλ

i=1 (λ, P (i)). We also deﬁne average regret as 1 T RT and expected regret as E[RT ] where the expectation is over both the randomness of the loss functions and any random coins used by the online algorithm.

Given a class of regularization algorithms A parameterized by regularization parameter λ over a set of problem instances X, and given loss function : A X ! R which measures the loss of any algorithm in A on any ﬁxed problem instance, consider the set of functions HA = { (A, ) | A 2 A}. For example, for the Elastic Net we have EN(λ, P) = lr(ˆβ(XP ,y P )

λ,f EN , (X0

P )), where (XP , y P ) and (X0

P ) are the training and validation sets associated with problem P 2 X respectively. Bounding the pseudo-dimension of HA gives a bound on the sample complexity for uniform convergence guarantees, i.e. a bound on the sample size n for which the algorithm ˆAS 2 A which minimizes the average loss on any sample S of size n drawn i.i.d. from any problem distribution D is guaranteed to be near-optimal with high probability [21]. See Appendix A for the relevant classic deﬁnitions and results. Deﬁne the dual class H of a set of real-valued functions H 2X as H = {h

x : H ! R | x 2 X} where h

x(h) = h(x). In the context of regression problems X, for each ﬁxed problem instance x 2 X there is a dual function h

x that computes the loss (A, x) for any (primal) function h A = (A, ) 2 HA. For a function class H, showing that dual class H is piecewise-structured in the sense of Deﬁnition 1 and bounding the complexity of the duals of boundary and piece functions of H are useful to understand the learnability of H [10]. Deﬁnition 1 (Piecewise structured functions, [10]). A function class H RX that maps a domain X to R is (F, G, k)-piecewise decomposable for a class G {0, 1}X of boundary functions and a class F RX of piece functions if the following holds: for every h 2 H, there are k boundary functions g1, . . . , gk 2 G and a piece function fb 2 F for each bit vector b 2 {0, 1}k such that for all x 2 X, h(x) = fbx(x) where bx = (g1(x), . . . , gk(x)) 2 {0, 1}k.

Intuitively, a real-valued function is piecewise-structured if the domain can be divided into pieces by a ﬁnite number of boundary functions (say linear or polynomial thresholds) and the function value over each piece is easy to characterize (e.g. constant, linear, polynomial). To state and understand our structural insights into the Elastic Net problem we will also need the deﬁnition of equicorrelation sets, the subset of features with maximum absolute correlation for any ﬁxed λ1, useful for characterizing LASSO/Elastic Net solutions. For any subset E [p] of the features, we deﬁne XE = (. . . X i . . .)i2E as the m |E| matrix of columns X i of X corresponding to indices i 2 E. Similarly βE 2 R|E| is the subset of estimators in β corresponding to indices in E. We will assume all the feature matrixes X (for training datasets) are in general position (Deﬁnition 6).

Deﬁnition 2 (Equicorrelation sets, [42]). Let β 2 arg minβ2Rp ky Xβk2

2+λ1||β||1. The equicorrelation set corresponding to β , E = {j 2 [p] | |x T

j (y Xβ )| = λ1}, is simply the set of covariates with maximum absolute correlation. We also deﬁne the equicorrelation sign vector for β as s = sign(XT

E (y Xβ )) 2 { 1}.

Consider the class of algorithms consisting of Elastic Net regressors for different values of λ = (λ1, λ2) 2 (0, 1) (0, 1). We assume λ1 > 0 for technical simplicity (cf. [42]). We seek to solve

problems of the form P = (X, y, Xval, yval) 2 m,p, where (X, y) is the training set, (Xval, yval) is the validation set with the same set of features, and m, p are upper bounds on the number of examples and features respectively in any dataset. Let HEN = { EN(λ, ) | λ 2 (0, 1) (0, 1)} denote the set of loss functions for the class of algorithms consisting of Elastic Net regressors for different values of λ 2 R+ R+. Additionally, we will consider information criterion based loss functions, AIC

EN (λ, P) = EN(λ, P) + 2||ˆβ(X,y)

λ,f EN ||0 and BIC

EN (λ, P) = EN(λ, P) + 2||ˆβ(X,y)

λ,f EN ||0 log m [1, 39]. Let HAIC

EN and HBIC

EN denote the corresponding sets of loss functions. These criteria are popularly used to compute the squared loss on the training set, to give alternatives to cross-validation. We do not make any assumption on the relation between training and validation sets in our formulation, so our analysis can capture these settings as well.

Figure 1: An illustration of the piecewise structure of the Elastic Net loss, as a function of the regularization parameters, for a ﬁxed problem instance. Pieces are regions where some bounded degree polynomials (r1, r2) have a ﬁxed sign pattern (one of 1, 1), and in each piece the loss is a ﬁxed (rational) function.

We will now establish a piecewise structure of the dual class loss functions (Deﬁnition 1). A key observation is that if the signed equicorrelation set (E, s) (i.e. a subset of features E [p] with the same maximum absolute correlation, assigned a ﬁxed sign pattern { 1, +1}|E|, see Deﬁnition 2) is ﬁxed, then the Elastic Net coefﬁcients may be characterized (Lemma C.1) and the loss is a ﬁxed rational polynomial piece function of the parameters λ1, λ2. We then show the existence of a set of boundary function curves G, such that any region of the parameter space located on a ﬁxed side of all the curves (more formally, for a ﬁxed sign pattern in Deﬁnition 1) in G has the same signed equicorrelation set. The boundary functions are a collection of possible curves at which a covariate may enter or leave the set E and correspond to algebraic curves. We make repeated use of the following lemma which provides useful properties of the piece functions as well the the boundary functions of the dual class loss functions. Lemma 2.1. Let A be an r s matrix. Consider the matrix B(λ) = (AT A + λIs) 1 and λ > 0.

1. Each entry of B(λ) is a rational polynomial Pij(λ)/Q(λ) for i, j 2 [s] with each Pij of

degree at most s 1, and Q of degree s.

2. Further, for i = j, Pij has degree s 1 and leading coefﬁcient 1, and for i 6= j Pij has

degree at most s 2. Also, Q(λ) has leading coefﬁcient 1.

The proof is straightforward (Appendix C). We will now formally state and prove our key structural result which is needed to establish our generalization and online regret guarantees in Section 3. Theorem 2.2. Let L be a set of functions {lλ : m,p ! R 0 | λ 2 R+ R 0} that map a regression problem instance P 2 m,p to the validation loss EN(λ, P) of Elastic Net trained with regularization parameter λ = (λ1, λ2). The dual class L is (F, G, p3p)-piecewise decomposable, with F = {fq : L ! R} consisting of rational polynomial functions fq1,q2 : lλ 7! q1(λ1,λ2)

q2(λ2) , where q1, q2 have degrees at most 2p, and G = {gr : L ! {0, 1}} consisting of semi-algebraic sets3 bounded by algebraic curves gr : uλ 7! I{r(λ1, λ2) < 0}, where r is a polynomial of degree 1 in λ1 and at most p in λ2.

Proof. Let P = (X, y, Xval, yval) 2 m,p be a regression problem instance. By using the standard reduction to LASSO [47] and well-known characterization of the LASSO solution in terms of equicorrelation sets, we can characterize the solution ˆβλ,f EN of the Elastic Net as follows (Lemma C.1):

ˆβλ,f EN = (XT

E XE + λ2I|E|) 1XT

E XE + λ2I|E|) 1s

for some E 2 [p] and s 2 { 1, 1}p. Thus for any λ = (λ1, λ2), the prediction ˆy on any validation example with features x 2 Rp satisﬁes (for some E, s 2 2[p] { 1, 1}p)

ˆyj = xˆβλ,f EN = x(XT

E XE + λ2I|E|) 1XT

E XE + λ2I|E|) 1s

3See Deﬁnition 7 for deﬁnitions of standard terminology from algebraic geometry.

For any subset R R2, if the signed equicorrelation set (E, s) is ﬁxed over R, then the above observation, together with Lemma C.2 implies that the loss function EN(λ, P) is a rational function of the form q1(λ1,λ2)

q2(λ2) , where q1 is a bivariate polynomial with degree at most 2|E| and q2 is univariate with degree 2|E|.

To show the piecewise structure, we need to demonstrate a set boundary functions G = {g1, . . . , gk} such that for any sign pattern b 2 {0, 1}k, the signed equicorrelation set (E, s) for the region with sign pattern b is ﬁxed. To this end, based on the observation above, we will consider the conditions (on λ) under which a covariate may enter or leave the equicorrelation set. We will show that this can happen only at one of a ﬁnite number of algebraic curves (with bounded degrees).

Condition for joining E. Fix E, s. Also ﬁx j /2 E. If covariate j enters the equicorrelation set, the KKT conditions (Lemma B.1) applied to the LASSO problem corresponding to the Elastic Net (Lemma C.1) imply

where c1 = (X

T y , c2 = (X

E) 1s, X = 1 p1+λ2

1 = λ1 p1+λ2 . Rearranging, and simplifying, we get

T XE + λ2I|E|) 1XE

T XE + λ2I|E|) 1s 1

Note that the terms (x

E y, and (x

j)T y = x T

j y do not depend on λ1 or λ2 (the λ2 terms are zeroed out since j /2 E). Moreover, (X

T XE +λ2I|E|) 1. Using Lemma C.2, we get an algebraic curve rj,E,s(λ1, λ2) = 0 with degree 1 in λ1 and |E| in λ2 corresponding to addition of j /2 E given E, s.

Condition for leaving E. Now consider a ﬁxed j0 2 E, given ﬁxed E, s. The coefﬁcient of j0 will be zero for λ

(c1)j0 (c2)j0 , which simpliﬁes to λ1((XE

T XE + λ2I|E|) 1s)j0 =

T XE +λ2I|E|) 1XE

T y)j0. Again by Lemma C.2, we get an algebraic curve rj0,E,s(λ1, λ2) = 0 with degree 1 in λ1 and at most |E| in λ2 corresponding to removal of j0 2 E given E, s.

Putting the two together, we get Pp

((p i) + i) = p3p algebraic curves of degree 1 in λ1 and at most p in λ2, across which the signed equicorrelation set may change. These curves characterize the complete set of points (λ1, λ2) at which (E, s) may possibly change. Thus by setting these p3p curves as the set of boundary functions G, E, s is guaranteed to be ﬁxed for each sign pattern, and the corresponding loss takes the rational function form shown above.

The exact same piecewise structure can be established for the dual function classes for loss functions AIC

EN (λ, ) and BIC

EN (λ, ). This is evident from the proof of Theorem 2.2, since any dual piece has a ﬁxed equicorrelation set, and therefore ||β||0 is ﬁxed. Given this piecewise structure, a challenge to learning values of λ that minimize the loss function is that the function may not be differentiable (or may even be discontinuous, for the information criteria based losses) at the piece boundaries, making well-known gradient-based (local) optimization techniques inapplicable here. In the following (speciﬁcally Algorithm 1) we will show that techniques from data-driven design may be used to overcome this optimization challenge.

3 Learning to Regularize the Elastic Net

We will consider the problem of learning provably good Elastic Net parameters for a given problem domain, from multiple datasets (problem instances) either available as a collection (Section 3.1), or arriving online (Section 3.2). Our parameter tuning techniques also apply to simpler regression techniques typically used for variable selection, like LARS and LASSO, which are reasonable choices if the features are not multicollinear. Proof details for this section are located in Appendix C.

3.1 Distributional Setting

Our main result in this section is the following upper bound on the pseudo-dimension of the classes of loss functions for the Elastic Net, which implies that in our distributional setting it is possible to learn near-optimal values of λ with polynomially many problem instances.

Theorem 3.1. PDIM(HEN) = O(p2). Further, PDIM(HAIC

EN ) = O(p2) and PDIM(HBIC

EN ) = O(p2).

Proof Sketch. We use the (F, G, p3p)-piecewise decomposable structure for the dual class function H

EN established in Theorem 2.2. We can bound the pseudo-dimension of the dual class of piece functions F (a class of bivariate rational functions) by O(log p) by giving an upper bound (of O(k3d3)) on the number of sign patterns over R2 induced by k algebraic curves of degree at most d. We can also bound the VC dimension of the dual class of boundary functions G (semi-algebraic sets in two variates) by O(p) using a standard linearization argument. Finally, a powerful result from [10] (Theorem C.3) allows us to bound the pseudodimension of H by combining the above results.

A key challenge to establish the theorem is providing new bounds on the pseudo-dimension of rational functions of bounded degrees (Lemma C.5). The upper bound above implies a guarantee on the sample complexity of learning the Elastic Net tuning parameter, using standard learning-theoretic results [2]. In our setting of learning from multiple problem instances, each sample is a dataset instance, so the sample complexity is simply the number of regression problem instances needed to learn the tuning parameters to any given approximation and conﬁdence level.

Theorem 3.2 (Sample complexity of tuning the Elastic Net). Let D be an arbitary distribution over the problem space m,p. There is an algorithm which given n = O

2 (p2 log 1

problem samples drawn from D, for any > 0 and δ 2 (0, 1), outputs a regularization parameter ˆλ for the Elastic Net such that with probability at least 1 δ over the draw of the problem samples, we have that

(((EP D[ EN(ˆλ, P)] min

λ EP D[ EN(λ, P)]

Proof. This follows from substituting our result in Theorem 3.1 into well-known generalization guarantee for function classes with bounded pseudo-dimensions (Theorem A.1).

Discussion and applications. Computing the parameters which minimize the loss on the problem samples (aka Empirical Risk Minimization, or ERM) achieves the sample complexity bound in Theorem 3.2. Even though we only need polynomially many samples to guarantee the selection of nearly-optimal parameters, it is not clear how to implement the ERM efﬁciently. Note that we do not assume the set of features is the same across problem instances, so our approach can handle feature reset i.e. different problem instances can differ in not only the number of examples but also the number of features. Moreover, as a special case application, we consider the commonly used techniques of leave-one-out cross validation (LOOCV) and Monte Carlo cross validation (repeated random test-validation splits, typically independent and in a ﬁxed proportion). Given a dataset of size mtr, LOOCV would require mtr regression ﬁts which can be inefﬁcient for large dataset size. Alternately, we can consider draws from a distribution DLOO which generates problem instances P from a ﬁxed dataset (X, y) 2 Rm+1 p Rm+1 by uniformly selecting j 2 [m + 1] and setting P = (X j , y j, Xj , yj). Theorem 3.2 now implies that O(p2/ 2) iterations are enough to determine an Elastic Net parameter ˆλ with loss within (w.h.p.) of the parameter λ obtained from running the full LOOCV. Similarly, we can deﬁne a distribution DMC to capture the Monte Carlo cross validation procedure and determine the number of iterations sufﬁcient to get an -approximation of the loss corresponding parameter selection with arbitrarily large number of runs of the procedure. Thus, in a very precise sense, our results answer the question of how much cross-validation is enough to effectively implement the above techniques.

Remark 1. While our result implies polynomial sample complexity, the question of learning the provably near-optimal parameter efﬁciently (even in output polynomial time) is left open. For the special cases of LASSO (λ2 = 0) and Ridge (λ1 = 0), the piece boundaries of the piecewise polynomial dual class (loss) function may be computed efﬁciently (using the LARS-LASSO algorithm of [22] for LASSO, and solving linear systems and locating roots of polynomials for Ridge). This applies to online and classiﬁcation settings in the following sections as well.

3.2 Online Learning

We will now extend our results to learning the regularization coefﬁcients given an online sequence of regression problems, such as when one needs to solve a new regression problem each day. Unlike the distributional setting above, we will not assume any problem distribution and our results will hold for an adversarial sequence of problem instances. We will need very mild assumptions on the data, namely boundedness of feature and prediction values and smoothness of predictions (formally stated as Assumptions 1 and 2), while our distributional results above hold for worst-case problem datasets. Our ﬁrst assumption is that all feature values and predictions are bounded, for training as well as validation examples. Assumption 1 (Boundedness). The predicted variable and all feature values are bounded by an absolute constant R, i.e. max{||X(i)||1,1, ||y(i)||1, ||X(i)

val ||1,1, ||y(i)

We will need the following deﬁnition of distribution smoothness to state our second assumption. Deﬁnition 3. A continuous probability distribution is said to be -bounded if the probability density function p(x) satisﬁes p(x) for any x in the sample space.

For example, the normal distribution N(µ, σ2) with mean µ and standard deviation σ is 1 σ

2 - bounded. We assume that the predicted variable y in the training set comes from a -bounded (i.e. smooth) distribution, which does not require the strong tail decay of sub-Gaussian distributions [44, 13]. Moreover, the online adversary is allowed to change the distribution as long as it is - bounded. Note that our assumption also captures common data preprocessing steps, for example the jitter parameter in the popular Python library scikit-learn [37] adds a uniform noise to the y values to help model stability. The assumption is formally stated as follows: Assumption 2 (Smooth predictions). The predicted variables y(i) in the training set are drawn from a joint -bounded distribution, i.e. for each i, the variables y(i) have a joint distribution with probability density bounded by .

Under these assumptions, we can show that it is possible to learn the Elastic Net parameters with sublinear expected regret when the problem instances arrive online. The learning algorithm (Algorithm 1) that achieves this regret is a continuous variant of the classic Exponential Weights algorithm [14, 6]. It samples points in the domain with probability inversely propotional to the exponentiated loss. To formally state our result, we will need the following deﬁnition of dispersed loss functions. Informally speaking, it captures how amenable a set of non-Lipschitz functions is to online learning by measuring the worst rate of occurrence of non-Lipschitzness (or discontinuities) between any pair of points in the domain. [6, 9, 8] show that dispersion is necessary and sufﬁcient for learning piecewise Lipschitz functions. Deﬁnition 4. Dispersion [8]. The sequence of random loss functions l1, . . . , l T is β-dispersed for the Lipschitz constant L if, for all T and for all T β, we have that, in expectation, at most O( T) functions (the soft-O notation suppresses dependence on quantities beside , T and β, as well as logarithmic terms) are not L-Lipschitz for any pair of points at distance in the domain C. That is, for all T and for all T β, E

max , 02C k 0k2

(({t 2 [T] | lt( ) lt( 0) > L k 0k2}

Our key contribution is to show that the loss sequence is dispersed (Deﬁnition 4) under the above assumptions. This involves establishing additional structure for the problem, speciﬁcally about the location of boundary functions in the piecewise structure from Theorem 2.2. This stronger characterization coupled with results from [8] on dispersion of algebraic discontinuities completes the proof. Theorem 3.3. Suppose Assumptions 1 and 2 hold. Let l1, . . . , l T : (0, λmax)2 ! R 0 denote an independent sequence of losses (e.g. fresh randomness is used to generate the validation set features in each round) as a function of the Elastic Net regularization parameter λ = (λ1, λ2),

li(λ) = lr(ˆβ(X(i),y(i))

λ,f EN , (X(i)

val)). The sequence of functions is 1

2-dispersed, and there is an online algorithm with O(

T)4 expected regret. The result also holds for loss functions adjusted by information criteria AIC and BIC.

Proof Sketch. We start with the (F, G, p3p)-piecewise decomposable structure for the dual class function H

EN from Theorem 2.2. Observe that the rational piece functions in F do not introduce

4The O( ) notation hides dependence on logarithmic terms, as well as on quantities other than T.

Algorithm 1 Data-driven Regularization ( )

1: Input: Problems (X(i), y(i)) and regularization penalty function f. 2: Hyperparameter: step size parameter 2 (0, 1].

3: Output: Parameters (λi)i2[T ] 2 C, C R+ (LASSO/Ridge) or C R+2 (Elastic Net). 4: Set w1(λ) = 1 for all λ 2 C. 5: for i = 1, 2, . . . , T do 6: Wi :=

7: Sample λ with probability pt(λ) = wi(λ)

Wi , output as λi.

8: Compute average loss function li(λ) = 1 |y(i)|l(ˆβλ,f, (X(i), y(i))).

9: For each λ 2 C, update weights wi+1(λ) = e (1 li(λ))wi(λ).

any new discontinuities since the denominator polynomials do not have positive roots. For each of two types of boundary functions in G (corresponding to leaving/entering the equicorrelation set) we show that the discontinuities between any pair of points λ, λ0 lie along the roots of polynomials with non-leading coefﬁcients bounded and smoothly distributed (bounded joint density). This allows us to use results from [8] to establish dispersion, and therefore online learnability.

We remark that the above result holds for arbitrary training features and validation sets in the problem sequence that satisfy our assumptions, in particular the loss functions are only assumed to be independent but not identically distributed. In contrast, the results in the previous section needed them to be drawn from the same distribution. Also the parameters need to be selected online, and cannot be changed for already seen instances. This setting captures interesting practical settings where the set of features (including feature dimensions) and the relevant training set (including training set size) may change over the online sequence. It is not clear how usual model selection techniques like cross-validation may be adapted to these challenging settings.

4 Extension to Regularized Least Squares Classiﬁcation

Regression techniques can also be used to train binary classiﬁers by using an appropriate threshold on top of the regression estimate. Intuitively, regression learns a linear mapping which projects the datapoints onto a one-dimensional space, i.e. a real number, after which a threshold may be applied to classify the points. The use of thresholds to make discrete classiﬁcations adds discontinuities to the empirical loss function. Thus, in general, the classiﬁcation setting is more challenging as it already includes the piecewise structure in the regression loss. We provide statistical and online learning guarantees for Ridge and LASSO. For the Elastic Net we present the extensions needed to the arguments from the previous sections to obtain results in the classiﬁcation setting.

More formally, we will restrict y to {0, 1}m. The estimator ˆβλ,f is obtained as before, and the prediction on a test instance x may be obtained by taking the sign of a thresholded regression estimate, sign(hx, ˆβλ,fi ), where sign : R ! {0, 1} maps x 2 R to I{x 0} and 2 R is the threshold. The threshold corresponds to the intercept or bias of the learned linear classiﬁer, here we will treat it as a tunable hyperparameter (in addition to λ1, λ2)5. The average 0-1 loss over the dataset (X, y) is given by lc(ˆβλ,f, (X, y), ) = 1

i=1 |yi sign(h Xi, ˆβλ,fi )|6. Proofs from this section are in Appendix D.

4.1 Distributional setting

The problem setting is the same as in Section 3.1, except that the labels y are binary and we use threshold for prediction. We bound the pseudo-dimension for classiﬁcation loss on these problem instances, which as before (c.f. Theorems 3.1 and 3.2) imply that polynomially many problem samples are sufﬁcient to generalize well over the problem distribution D. For Ridge and LASSO we

5We can still have a problem instance speciﬁc bias in β using the standard trick of adding a unit feature to X, thus we generalize the common practice of using a ﬁxed threshold. For example, the Ridge Classiﬁer implementation in Python library scikit-learn 1.1.1 [37] assumes y 2 { 1, +1}m and sets = 0.

6Squared loss and 0-1 loss are identical in this setting.

upper bound the number of discontinuities of the piecewise constant classiﬁcation loss by determining the values of λ where any prediction changes. Theorem 4.1. Let Hc

LASSO and Hc

EN denote the set of loss functions for classiﬁcation problems with at most m examples and p features, for linear classiﬁers regularized using Ridge, LASSO and Elastic Net regression respectively.

(i) PDIM(Hc

Ridge) = O(log mp) (ii) PDIM(Hc

LASSO) = O(p log m). Further, in the overparameterized regime (p m), we have that PDIM(Hc

LASSO) = O(m log p

m). (iii) PDIM(Hc

EN) = O(p2 + p log m).

The key difference with the bound for the regression loss in Theorem 3.1 is the additional O(p log m) term which corresponds to discontinuities induced by the thresholding in the regression based classiﬁers. We can establish a structure similar to Theorem 2.2 in this case (Lemma D.1).

4.2 Online setting

As in Section 3.2, we can deﬁne an online learning setting for classiﬁcation. Note that the smoothness of the predicted variable is not meaningful here, since y is a binary vector. Instead we will assume that the validation examples have smooth feature values. Intuitively this means that small perturbations to the feature values does not meaningfully change the problem.

Assumption 3 (Smooth validation features). The feature values (X(i)

val )jk in the validation examples are drawn from a joint -bounded distribution.

Under the assumption, we show that we can learn the regularization parameters online, for each of Ridge, LASSO and Elastic Net estimators. The proofs are straightforward extensions of the structural results developed in the previous sections, with minor technical changes to use the above validation set feature smoothness instead of Assumption 2, and are deferred to the appendix. Theorem 4.2. Suppose Assumptions 1 and 3 hold. Let l1, . . . , l T : (0, H]d [ H, H] ! R denote an independent sequence of losses as a function of the regularization parameter λ, li(λ, ) = lc(ˆβλ,f, (X(i), y(i)), ). If f is given by f1 (LASSO), f2 (Ridge), or f EN (Elastic Net) then the sequence of functions is 1

2-dispersed and there is an online algorithm with O(

T) expected regret.

5 Conclusions and Future Work

We obtain a novel structural result for the Elastic Net loss as a function of the tuning parameters. Our characterization gives polynomial upper bounds for the sample complexity of learning the parameters from multiple instances coming from the same problem domain. For the Elastic Net we show generalization and online regret guarantees, but efﬁcient implementation of the algorithms is an interesting question for further work. Also we show general learning-theoretic guarantees, i.e. without any signiﬁcant restrictions on the data-generating distribution, in learning from multiple problems. The problems may be drawn i.i.d. from an arbitrary problem distribution, or even arrive in an online sequence but with some smoothness properties. It is unclear if such guarantees may be given for tuning parameters for the more standard setting of tuning a single training set. In this work we only give upper bounds on the sample complexity by bounding the pseudodimension, showing lower bounds is an interesting direction for future work.

Acknowledgments

We thank Ryan Tibshirani for helpful discussions. This material is based on work supported by the National Science Foundation under grants CCF-1910321, IIS-1705121, IIS-1838017, IIS-1901403, IIS-2046613, IIS-2112471, and SES-1919453; the Defense Advanced Research Projects Agency under cooperative agreement HR00112020003; a Simons Investigator Award; an AWS Machine Learning Research Award; an Amazon Research Award; a Bloomberg Research Grant; a Microsoft Research Faculty Fellowship; funding from Meta, Morgan Stanley, and Amazon; and a Facebook Ph D Fellowship. Any opinions, ﬁndings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reﬂect the views of any of these funding agencies.

[1] Hirotugu Akaike. A new look at the statistical model identiﬁcation. IEEE transactions on

automatic control, 19(6):716 723, 1974.

[2] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations,

volume 9. cambridge university press Cambridge, 1999.

[3] Maria-Florina Balcan. Book chapter Data-Driven Algorithm Design. In Beyond Worst Case

Analysis of Algorithms, T. Roughgarden (Ed). Cambridge University Press, 2020.

[4] Maria-Florina Balcan and Dravyansh Sharma. Data driven semi-supervised learning. Advances

in Neural Information Processing Systems, 34, 2021.

[5] Maria-Florina Balcan, Tuomas Sandholm, and Ellen Vitercik. Sample complexity of automated

mechanism design. Advances in Neural Information Processing Systems, 29, 2016.

[6] Maria-Florina Balcan, Travis Dick, and Ellen Vitercik. Dispersion for data-driven algorithm

design, online learning, and private optimization. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 603 614. IEEE, 2018.

[7] Maria-Florina Balcan, Travis Dick, and Manuel Lang. Learning to link. In International

Conference on Learning Representations, 2019.

[8] Maria-Florina Balcan, Travis Dick, and Wesley Pegden. Semi-bandit optimization in the

dispersed setting. In Conference on Uncertainty in Artiﬁcial Intelligence, pages 909 918. PMLR, 2020.

[9] Maria-Florina Balcan, Travis Dick, and Dravyansh Sharma. Learning piecewise Lipschitz

functions in changing environments. In International Conference on Artiﬁcial Intelligence and Statistics, pages 3567 3577. PMLR, 2020.

[10] Maria-Florina Balcan, Dan De Blasio, Travis Dick, Carl Kingsford, Tuomas Sandholm, and

Ellen Vitercik. How much data is sufﬁcient to learn high-performing algorithms? generalization guarantees for data-driven algorithm design. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 919 932, 2021.

[11] Maria-Florina Balcan, Siddharth Prasad, Tuomas Sandholm, and Ellen Vitercik. Sample

complexity of tree search conﬁguration: Cutting planes and beyond. Advances in Neural Information Processing Systems, 34, 2021.

[12] Maria-Florina F Balcan, Mikhail Khodak, Dravyansh Sharma, and Ameet Talwalkar. Learning-

to-learn non-convex piecewise-lipschitz functions. Advances in Neural Information Processing Systems, 34:15056 15069, 2021.

[13] Emmanuel J Candès and Yaniv Plan. Near-ideal model selection by l1 minimization. The

Annals of Statistics, 37(5A):2145 2177, 2009.

[14] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university

press, 2006.

[15] JM Chambers and TJ Hastie. Linear models. Chapter 4 of statistical models in S. Wadsworth &

Brooks/Cole, 1992, 1992.

[16] Denis Chetverikov, Zhipeng Liao, and Victor Chernozhukov. On cross-validated lasso in high

dimensions. The Annals of Statistics, 49(3):1300 1317, 2021.

[17] Michael Chichignoud, Johannes Lederer, and Martin J Wainwright. A practical scheme and

fast algorithm to tune the lasso with optimality guarantees. The Journal of Machine Learning Research, 17(1):8162 8181, 2016.

[18] Thomas M Cover. Geometrical and statistical properties of systems of linear inequalities with

applications in pattern recognition. IEEE transactions on electronic computers, (3):326 334, 1965.

[19] Amit Dhurandhar and Marek Petrik. Efﬁcient and accurate methods for updating generalized

linear models with multiple feature additions. The Journal of Machine Learning Research, 15 (1):2607 2627, 2014.

[20] Edgar Dobriban and Stefan Wager. High-dimensional asymptotics of prediction: Ridge regres-

sion and classiﬁcation. The Annals of Statistics, 46(1):247 279, 2018.

[21] Richard M Dudley. The sizes of compact subsets of Hilbert space and continuity of Gaussian

processes. Journal of Functional Analysis, 1(3):290 330, 1967.

[22] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression.

The Annals of statistics, 32(2):407 499, 2004.

[23] Jianqing Fan and Jinchi Lv. A selective overview of variable selection in high dimensional

feature space. Statistica Sinica, 20(1):101, 2010.

[24] Manuel Fernández-Delgado, Manisha Sanjay Sirsat, Eva Cernadas, Sadi Alawadi, Senén Barro,

and Manuel Febrero-Bande. An extensive experimental survey of regression methods. Neural Networks, 111:11 34, 2019.

[25] Jean-Jacques Fuchs. Recovery of exact sparse representations in the presence of bounded noise.

IEEE Transactions on Information Theory, 51(10):3601 3608, 2005.

[26] Diane Galarneau Gibbons. A simulation study of some ridge estimators. Journal of the American

Statistical Association, 76(373):131 139, 1981.

[27] Rishi Gupta and Tim Roughgarden. A pac approach to application-speciﬁc algorithm selection.

SIAM Journal on Computing, 46(3):992 1017, 2017.

[28] Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements

of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.

[29] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-

dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949 986, 2022.

[30] Arthur E Hoerl and Robert W Kennard. Ridge regression: applications to nonorthogonal

problems. Technometrics, 12(1):69 82, 1970.

[31] Lisa-Ann Kirkland, Frans Kanfer, and Sollie Millard. LASSO tuning parameter selection. In

Annual Proceedings of the South African Statistical Association Conference, volume 2015,

pages 49 56. South African Statistical Association (SASA), 2015.

[32] Johannes Lederer and Christian Müller. Don t fall for tuning parameters: tuning-free variable

selection in high dimensions with the TREX. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 29, 2015.

[33] Qing Li and Nan Lin. The bayesian elastic net. Bayesian analysis, 5(1):151 170, 2010.

[34] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning.

MIT Press, 2012.

[35] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.

[36] Pratik Patil, Yuting Wei, Alessandro Rinaldo, and Ryan Tibshirani. Uniform consistency of

cross-validation estimators for high-dimensional ridge regression. In International Conference on Artiﬁcial Intelligence and Statistics, pages 3178 3186. PMLR, 2021.

[37] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,

P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

[38] David Pollard. Convergence of stochastic processes. Springer Science & Business Media, 2012.

[39] Gideon Schwarz. Estimating the dimension of a model. The annals of statistics, pages 461 464,

[40] Igor Rostislavich Shafarevich. Basic Algebraic Geometry:(by) IR Shafarevich Transl. from the

Russian by KA Hirsch. Springer-Verlag, 1977.

[41] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal

Statistical Society: Series B (Methodological), 58(1):267 288, 1996.

[42] Ryan J Tibshirani. The lasso problem and uniqueness. Electronic Journal of statistics, 7:

1456 1490, 2013.

[43] Andrey N Tikonov and Vasily Y Arsenin. Solutions of ill-posed problems. New York: Winston,

[44] Tong Zhang. Some sharp performance bounds for least squares regression with l1 regularization.

The Annals of Statistics, 37(5A):2109 2144, 2009.

[45] Yu Zhang and Qiang Yang. A survey on multi-task learning. IEEE Transactions on Knowledge

and Data Engineering, 2021.

[46] Peng Zhao and Bin Yu. Stagewise lasso. The Journal of Machine Learning Research, 8:

2701 2726, 2007.

[47] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of

the royal statistical society: series B (statistical methodology), 67(2):301 320, 2005.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] We emphasize we are in a multiple

instance setting instead of the standard setting. (c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes]

2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes]

3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experi-

mental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [N/A]

4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A]

(e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A]