# on_optimal_generalizability_in_parametric_learning__3ef8f64f.pdf

On Optimal Generalizability in Parametric Learning

Ahmad Beirami beirami@seas.harvard.edu Meisam Razaviyayn razaviya@usc.edu Shahin Shahrampour shahin@seas.harvard.edu

Vahid Tarokh vahid@seas.harvard.edu

We consider the parametric learning problem, where the objective of the learner is determined by a parametric loss function. Employing empirical risk minimization with possibly regularization, the inferred parameter vector will be biased toward the training samples. Such bias is measured by the cross validation procedure in practice where the data set is partitioned into a training set used for training and a validation set, which is not used in training and is left to measure the outof-sample performance. A classical cross validation strategy is the leave-one-out cross validation (LOOCV) where one sample is left out for validation and training is done on the rest of the samples that are presented to the learner, and this process is repeated on all of the samples. LOOCV is rarely used in practice due to the high computational complexity. In this paper, we ﬁrst develop a computationally efﬁcient approximate LOOCV (ALOOCV) and provide theoretical guarantees for its performance. Then we use ALOOCV to provide an optimization algorithm for ﬁnding the regularizer in the empirical risk minimization framework. In our numerical experiments, we illustrate the accuracy and efﬁciency of ALOOCV as well as our proposed framework for the optimization of the regularizer.

1 Introduction

We consider the parametric supervised/unsupervised learning problem, where the objective of the learner is to build a predictor based on a set of historical data. Let zn = {zi}n i=1, where zi Z denotes the data samples at the learner s disposal that are assumed to be drawn i.i.d. from an unknown density function p( ), and Z is compact.

We assume that the learner expresses the objective in terms of minimizing a parametric loss function ℓ(z; θ), which is a function of the parameter vector θ. The learner solves for the unknown parameter vector θ Θ Rk, where k denotes the number of parameters in the model class, and Θ is a convex, compact set.

Let L(θ) E{ℓ(z; θ)} (1)

be the risk associated with the parameter vector θ, where the expectation is with respect to the density p( ) that is unknown to the learner. Ideally, the goal of the learner is to choose the parameter vector θ such that θ arg minθ Θ L(θ) = arg minθ Θ E{ℓ(z; θ)}. Since the density function p( ) is unknown, the learner cannot compute θ and hence cannot achieve the ideal performance of L(θ ) = minθ Θ L(θ) associated with the model class Θ. Instead, one can consider the minimiza-

School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138, USA. Department of Industrial and Systems Engineering, University of Southern California, Los Angeles, CA 90089, USA.

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

tion of the empirical version of the problem through the empirical risk minimization framework: bθ(zn) arg min θ Θ

i [n] ℓ(zi; θ) + r(θ),

where [n] {1, 2, . . . , n} and r(θ) is some regularization function. While the learner can evaluate her performance on the training data samples (also called the in-sample empirical risk, i.e., 1 n Pn i=1 ℓ(zi; bθ(zn))), it is imperative to assess the average performance of the learner on fresh test samples, i.e., L(bθ(zn)), which is referred to as the out-of-sample risk. A simple and universal approach to measuring the out-of-sample risk is cross validation [1]. Leave-one-out cross validation (LOOCV), which is a popular exhaustive cross validation strategy, uses (n 1) of the samples for training while one sample is left out for testing. This procedure is repeated on the n samples in a round-robin fashion, and the learner ends up with n estimates for the out-of-sample loss corresponding to each sample. These estimates together form a cross validation vector which can be used for the estimation of the out-of-sample performance, model selection, and tuning the model hyperparameters. While LOOCV provides a reliable estimate of the out-of-sample loss, it brings about an additional factor of n in terms of computational cost, which makes it practically impossible because of the high computational cost of training when the number of samples is large.

Contribution: Our ﬁrst contribution is to provide an approximation for the cross validation vector, called ALOOCV, with much lower computational cost. We compare its performance with LOOCV in problems of reasonable size where LOOCV is tractable. We also test it on problems of large size where LOOCV is practically impossible to implement. We describe how to handle quasi-smooth loss/regularizer functions. We also show that ALOOCV is asymptotically equivalent to Takeuchi information criterion (TIC) under certain regularity conditions.

Our second contribution is to use ALOOCV to develop a gradient descent algorithm for jointly optimizing the regularization hyperparameters as well as the unknown parameter vector θ. We show that multiple hyperparameters could be tuned using the developed algorithm. We emphasize that the second contribution would not have been possible without the developed estimator as obtaining the gradient of the LOOCV with respect to tuning parameters is computationally expensive. Our experiments show that the developed method handles quasi-smooth regularized loss functions as well as number of tuning parameters that is on the order of the training samples.

Finally, it is worth mentioning that although the leave-one-out cross validation scenario is considered in our analyses, the results and the algorithms can be extended to the leave-q-out cross validation and bootstrap techniques.

Related work: A main application of cross validation (see [1] for a recent survey) is in model selection [2 4]. On the theoretical side, the proposed approximation on LOOCV is asymptotically equivalent to Takeuchi information criterion (TIC) [4 7], under certain regularity conditions (see [8] for a proof of asymptotic equivalence of AIC and LOOCV in autoregressive models). This is also related to Barron s predicted square error (PSE) [9] and Moody s effective number of parameters for nonlinear systems [10]. Despite these asymptotic equivalences our main focus is on the nonasymptotic performance of ALOOCV.

ALOOCV simpliﬁes to the closed form derivation of the LOOCV for linear regression, called PRESS (see [11, 12]). Hence, this work can be viewed as an approximate extension of this closed form derivation for an arbitrary smooth regularized loss function. This work is also related to the concept of inﬂuence functions [13], which has recently received renewed interest [14]. In contrast to methods based on inﬂuence functions that require large number of samples due to their asymptotic nature, we empirically show that the developed ALOOCV works well even when the number of samples and features are small and comparable to each other. In particular, ALOOCV is capable of predicting overﬁtting and hence can be used for model selection and choosing the regularization hyperparameter. Finally, we expect that the idea of ALOOCV can be extended to derive computationally efﬁcient approximate bootstrap estimators [15].

Our second contribution is a gradient descent optimization algorithm for tuning the regularization hyperparameters in parametric learning problems. A similar approach has been taken for tuning the single parameter in ridge regression where cross validation can be obtained in closed form [16]. Most of the existing methods, on the other hand, ignore the response and carry out the optimization solely based on the features, e.g., Stein unbiased estimator of the risk for multiple parameter selection [17,18].

Bayesian optimization has been used for tuning the hyperparameters in the model [19 23], which postulates a prior on the parameters and optimizes for the best parameter. Bayesian optimization methods are generally derivative free leading to slow convergence rate. In contrast, the proposed method is based on a gradient descent method. Other popular approaches to the tuning of the optimization parameters include grid search and random search [24 26]. These methods, by nature, also suffer from slow convergence. Finally, model selection has been considered as a bi-level optimization [27,28] where the training process is modeled as a second level optimization problem within the original problem. These formulations, similar to many other bi-level optimization problems, often lead to computationally intensive algorithms that are not scalable.

We remark that ALOOCV can also be used within Bayesian optimization, random search, and grid search methods. Further, resource allocation can be used for improving the optimization performance in all of such methods.

2 Problem Setup

To facilitate the presentation of the ideas, let us deﬁne the following concepts. Throughout, we assume that all the vectors are in column format.

Deﬁnition 1 (regularization vector/regularized loss function) We suppose that the learner is concerned with M regularization functions r1(θ), . . . , r M(θ) in addition to the main loss function ℓ(z; θ). We deﬁne the regularization vector r(θ) as

r(θ) (r1(θ), . . . , r M(θ)) .

Further, let λ = (λ1, . . . , λM) be the vector of regularization parameters. We call wn(z; θ, λ) the regularized loss function given by

wn(z; θ, λ) ℓ(z; θ) + 1

nλ r(θ) = ℓ(z; θ) + 1

m [M] λmrm(θ).

The above deﬁnition encompasses many popular learning problems. For example, elastic net regression [31] can be cast in this framework by setting r1(θ) = θ 1 and r2(θ) = 1

Deﬁnition 2 (empirical risk/regularized empirical risk) Let the empirical risk be deﬁned as b Lzn(θ) = 1

n Pn i=1 ℓ(zi; θ). Similarly, let the regularized empirical risk be deﬁned as c Wzn(θ, λ) = 1 n Pn i=1{wn(zi; θ, λ)}.

Deﬁnition 3 (regularized empirical risk minimization) We suppose that the learner solves the empirical risk minimization problem by selecting bθλ(zn) as follows:

bθλ(zn) arg min θ Θ

n c Wzn(θ, λ) o = arg min θ Θ

i [n] ℓ(zi; θ) + λ r(θ)

Once the learner solves for bθλ(zn), the empirical risk corresponding to bθλ(zn) can be readily computed by b Lzn(bθλ(zn)) = 1

i [n] ℓ(zi; bθλ(zn)). While the learner can evaluate her performance

on the observed data samples (also called the in-sample empirical risk, i.e., b Lzn(bθλ(zn))), it is imperative to assess the performance of the learner on unobserved fresh samples, i.e., L(bθλ(zn)) (see (1)), which is referred to as the out-of-sample risk. To measure the out-of-sample risk, it is a common practice to perform cross validation as it works outstandingly well in many practical situations and is conceptually universal and simple to implement.

Leave-one-out cross validation (LOOCV) uses all of the samples but one for training, which is left out for testing, leading to an n-dimensional cross validation vector of out-of-sample estimates. Let us formalize this notion. Let zn\i (z1, . . . , zi 1, zi+1, . . . , zn) denote the set of the training examples excluding zi.

Deﬁnition 4 (LOOCV empirical risk minimization/cross validation vector) Let bθλ(zn\i) be the estimated parameter over the training set zn\i, i.e.,

bθλ(zn\i) arg min θ Rk

n c Wzn\i(θ, λ) o = arg min θ Rk

j [n]\i ℓ(zj; θ) + λ r(θ)

The cross validation vector is given by {CVλ,i(zn)}i [n] where CVλ,i(zn) ℓ(zi; bθλ(zn\i)), and the cross validation out-of-sample estimate is given by CVλ(zn) 1

n Pn i=1 CVλ,i(zn).

The empirical mean and the empirical variance of the n-dimensional cross validation vector are used by practitioners as surrogates on assessing the out-of-sample performance of a learning method.

The computational cost of solving the problem in (3) is n times that of the original problem in (2). Hence, while LOOCV provides a simple yet powerful tool to estimate the out-of-sample performance, the additional factor of n in terms of the computational cost makes LOOCV impractical in large-scale problems. One common solution to this problem is to perform validation on fewer number of samples, where the downside is that the learner would obtain a much more noisy and sometimes completely unstable estimate of the out-of-sample performance compared to the case where the entire LOOCV vector is at the learner s disposal. On the other hand, ALOOCV described next will provide the beneﬁts of LOOCV with negligible additional computational cost.

We emphasize that the presented problem formulation is general and includes a variety of parametric machine learning tasks, where the learner empirically solves an optimization problem to minimize some loss function.

3 Approximate Leave-One-Out Cross Validation (ALOOCV)

We assume that the regularized loss function is three times differentiable with continuous derivatives (see Assumption 1). This includes many learning problems, such as the L2 regularized logistic loss function. We later comment on how to handle the ℓ1 regularizer function in LASSO. To proceed, we need one more deﬁnition.

Deﬁnition 5 (Hessian/empirical Hessian) Let H(θ) denote the Hessian of the risk function deﬁned as H(θ) 2 θL(θ). Further, let b Hzn(θ, λ) denote the empirical Hessian of the regularized loss function, deﬁned as b Hzn(θ, λ) b Ezn 2 θwn(z; θ, λ) = 1

n Pn i=1 2 θwn(zi; θ, λ). Similarly, we deﬁne b Hzn(θ, λ) b Ezn\i 2 θwn(z; θ, λ) = 1 n 1 P i [n]\i 2 θwn(zi; θ, λ).

Next we present the set of assumptions we need to prove the main result of the paper.

Assumption 1 We assume that

(a) There exists θ Θ ,3 such that bθλ(zn) θ = op(1).4

(b) wn(z; θ) is of class C3 as a function of θ for all z Z.

(c) H(θ ) 0 is positive deﬁnite.

Theorem 1 Under Assumption 1, let

eθ (i) λ (zn) bθλ(zn) + 1 n 1

b Hzn\i bθλ(zn), λ 1 θℓ(zi; bθλ(zn)), (4)

assuming the inverse exists. Then,

bθλ(zn\i) eθ (i) λ (zn) = 1 n 1

b Hzn\i bθλ(zn), λ 1 ε(i) λ,n, (5)

3( ) denotes the interior operator. 4Xn = op(an) implies that Xn/an approaches 0 in probability with respect to the density function p( ).

with high probability where

ε(i) λ,n = ε(i),1 λ,n ε(i),2 λ,n , (6)

and ε(i),1 λ,n is deﬁned as

ε(i),1 λ,n 1

κ [k] (bθλ(zn) bθλ(zn\i))

θκ 2 θwn 1(zj; ζi,j,1 λ,κ (zn), λ) (bθλ(zn) bθλ(zn\i))beκ,

(7) where beκ is κ-th standard unit vector, and such that for all κ [k], ζi,j,1 λ,κ (zn) = αi,j,1 κ bθλ(zn) +

(1 αi,j,1 κ )bθλ(zn\i) for some 0 αi,j,1 κ 1. Further, ε(i),2 λ,n is deﬁned as

ε(i),2 λ,n X

κ,ν [k] be ν (bθλ(zn) bθλ(zn\i)) 2

θκ θν θ wn 1(zj; ζi,j,2 λ,κ,ν(zn), λ) (bθλ(zn) bθλ(zn\i))beκ,

(8) such that for κ, ν [k], ζ(i),2 λ,κ,ν(zn) = αi,j,2 κ,ν bθλ(zn)+(1 αi,j,2 κ,ν )bθλ(zn\i) for some 0 αi,j,2 κ,ν 1. Further, we have5

bθλ(zn) bθλ(zn\i)) = Op

bθλ(zn\i) eθ (i) λ (zn) = Op

See the appendix for the proof. Inspired by Theorem 1, we provide an approximation on the cross validation vector.

Deﬁnition 6 (approximate cross validation vector) Let ACVλ,i(zn) = ℓ zi; eθ (i) λ (zn) . We call

{ACVλ,i(zn)}i [n] the approximate cross validation vector. We further call

i=1 ACVλ,i(zn) (11)

the approximate cross validation estimator of the out-of-sample loss.

We remark that the deﬁnition can be extended to leave-q-out and q-fold cross validation by replacing the index i to an index set S with |S| = q, comprised of the q left-out samples in (4).

The cost of the computation of {eθ (i) λ (zn)}i [n] is upper bounded by O(np+C(n, p)) where C(n, p) is the computational cost of solving for bθλ(zn) in (2); see [14]. Note that the empirical risk minimization problem posed in (2) requires time at least Ω(np). Hence, the overall cost of computation

of {eθ (i) λ (zn)}i [n] is dominated by solving (2). On the other hand, the cost of computing the true cross validation performance by naively solving n optimization problems {bθλ(zn\i)}i [n] posed in (3) would be O(n C(n, p)) which would necessarily be Ω(n2p) making it impractical for largescale problems.

Corollary 2 The approximate cross validation vector is exact for kernel ridge regression. That is,

given that the regularized loss function is quadratic in θ, we have eθ (i) λ (zn) = bθλ(zn\i) for all i [n] .

Proof We notice that the error term ε(i) λ,n in (6) only depends on the third derivative of the loss func-

tion in a neighborhood of bθλ(zn). Hence, provided that the regularized loss function is quadratic in θ, ε(i) λ,n = 0 for all i [n].

5Xn = Op(an) implies that Xn/an is stochastically bounded with respect to the density function p( ).

The fact that the cross validation vector could be obtained for kernel ridge regression in closed form without actually performing cross validation is not new, and the method is known as PRESS [11]. In a sense, the presented approximation could be thought of as an extension of this idea to more general loss and regularizer functions while losing the exactness property. We remark that the idea of ALOOCV is also related to that of the inﬂuence functions. In particular, inﬂuence functions have been used in [14] to derive an approximation on LOOCV for neural networks with large sample sizes. However, we notice that methods based on inﬂuence functions usually underestimate overﬁtting making them impractical for model selection. In contrast, we empirically demonstrate the effectiveness of ALOOCV in capturing overﬁtting and model selection.

In the case of ℓ1 regularizer we assume that the support set of bθλ(zn) and bθλ(zn\i) are the same. Although this would be true for large enough n under Assumption 1, it is not necessarily true for a given sample zn when sample i is left out. Provided that the support set of bθλ(zn\i) is known we use the developed machinery in Theorem 1 on the subset of parameters that are non-zero. Further, we ignore the ℓ1 regularizer term in the regularized loss function as it does not contribute to the Hessian matrix locally, and we assume that the regularized loss function is otherwise smooth in the sense of Assumption 1. In this case, the cost of calculating ALOOCV would scale with O(npa log(1/ϵ)) where pa denotes the number of non-zero coordinates in the solution bθλ(zn).

We remark that although the nature of guarantees in Theorem 1 are asymptotic, we have experimentally observed that the estimator works really well even for n and p as small as 50 in elastic net regression, logistic regression, and ridge regression. Next, we also provide an asymptotic characterization of the approximate cross validation.

Lemma 3 Under Assumption 1, we have

ACVλ(zn) = b Lzn(bθλ(zn)) + b Rzn(bθλ(zn), λ) + Op

where b Rzn(θ, λ) 1 n(n 1)

i [n] θ ℓ(zi; θ) h b Hzn\i(θ, λ) i 1 θℓ(zi; θ). (13)

Note that in contrast to the ALOOCV (in Theorem 1), the Op(1/n2) error term here depends on the second derivative of the loss function with respect to the parameters, consequently leading to worse performance, and underestimation of overﬁtting.

4 Tuning the Regularization Parameters

Thus far, we presented an approximate cross validation vector that closely follows the predictions provided by the cross validation vector, while being computationally inexpensive. In this section, we use the approximate cross validation vector to tune the regularization parameters for the optimal out-of-sample performance. We are interested in solving minλ

CVλ(zn) = 1

n Pn i=1 ℓ zi; bθλ zn\i . To this end, we need to calculate the gradient of bθλ(zn) with respect to λ, which is given in the following lemma.

Lemma 4 We have λbθλ(zn) = 1

n h b Hzn bθλ(zn), λ i 1 θr(bθλ(zn)).

Corollary 5 We have λbθλ(zn\i) = 1 n 1 h b Hzn\i bθλ(zn\i), λ i 1 θr(bθλ(zn\i)).

In order to apply ﬁrst order optimization methods for minimizing CVλ(zn), we need to compute its gradient with respect to the tuning parameter vector λ. Applying the simple chain rule implies

λCVλ(zn) = 1

i=1 λ bθλ(zn\i) θℓ zi; bθλ zn\i (14)

i=1 θ r bθλ(zn\i) h b Hzn\i bθλ(zn\i) i 1 θℓ zi; bθλ zn\i , (15)

0 100 200 300 400 500 600 700 800 Iteration Number

ALOOCV Out-of-Sample Loss

Elapsed time: 28 seconds

Figure 1: The progression of the loss when Algorithm 1 is applied to ridge regression with diagonal regressors.

0 100 200 300 400 500 600 700 800 Iteration Number

mean( 1,..., m)

mean( m+1, ..., p)

Figure 2: The progression of λ s when Algorithm 1 is applied to ridge regression with diagonal regressors.

where (15) follows by substituting λbθλ(zn\i) from Corollary 5. However, (15) is computationally expensive and almost impossible in practice even for medium size datasets. Hence, we use the ALOOCV from (4) (Theorem 1) in (14) to approximate the gradient.

g(i) λ (zn) 1 n 1 θ r eθ (i) λ (zn) b Hzn\i eθ (i) λ (zn) 1 θℓ zi; eθ (i) λ (zn) . (16)

Further, motivated by the suggested ALOOCV, let us deﬁne the approximate gradient gλ(zn) as gλ(zn) 1 n P

i [n] g(i) λ (zn) . Based on our numerical experiments, this approximate gradient closely follows the gradient of the cross validation, i.e., λCVλ(zn) gλ(zn). Note that this approximation is straightforward to compute. Therefore, using this approximation, we can apply the ﬁrst order optimization algorithm 1 to optimize the tuning parameter λ. Although Algorithm 1 is

Algorithm 1 Approximate gradient descent algorithm for tuning λ

Initialize the tuning parameter λ0, choose a step-size selection rule, and set t = 0 for t = 0, 1, 2, . . . do

calculate the approximate gradient gλt(zn) set λt+1 = λt αtgλt(zn) end for

more computationally efﬁcient compared to LOOCV (saving a factor of n), it might still be computationally expensive for large values of n as it still scales linearly with n. Hence, we also present an online version of the algorithm using the stochastic gradient descent idea; see Algorithm 2.

Algorithm 2 Stochastic (online) approximate gradient descent algorithm for tuning λ

Initialize the tuning parameter λ0 and set t = 0 for t = 0, 1, 2, . . . do

choose a random index it {1, . . . , n} calculate the stochastic gradient g(it) λt (zn) using (16)

set λt+1 = λt αtg(it) λt (zn) end for

5 Numerical Experiments

Ridge regression with diagonal regressors: We consider the following regularized loss function:

wn(z; θ, λ) = ℓ(z; θ) + 1

nλ r(θ) = 1

2(y θ x)2 + 1

2nθ diag(λ)θ.

Histogram of

Figure 3: The histogram of the normalized difference between LOOCV and ALOOCV for 5 runs of the algorithm on randomly selected samples for each λ in Table 1 (MNIST dataset with n = 200 and p = 400).

λ b Lzn L CV ACV IF 3.3333 0.0637 (0.0064) 0.1095 (0.0168) 0.1077 (0.0151) 0.1080 (0.0152) 0.0906 (0.0113) 1.6667 0.0468 (0.0051) 0.1021 (0.0182) 0.1056 (0.0179) 0.1059 (0.0179) 0.0734 (0.0100) 0.8333 0.0327 (0.0038) 0.0996 (0.0201) 0.1085 (0.0214) 0.1087 (0.0213) 0.0559 (0.0079) 0.4167 0.0218 (0.0026) 0.1011 (0.0226) 0.1158 (0.0256) 0.1155 (0.0254) 0.0397 (0.0056) 0.2083 0.0139 (0.0017) 0.1059 (0.0256) 0.1264 (0.0304) 0.1258 (0.0300) 0.0267 (0.0038) 0.1042 0.0086 (0.0011) 0.1131 (0.0291) 0.1397 (0.0356) 0.1386 (0.0349) 0.0171 (0.0024) 0.0521 0.0051 (0.0006) 0.1219 (0.0330) 0.1549 (0.0411) 0.1534 (0.0402) 0.0106 (0.0015)

Table 1: The results of logistic regression (in-sample loss, out-of-sample loss, LOOCV, and ALOOCV, and Inﬂuence Function LOOCV) for different regularization parameters on MNIST dataset with n = 200 and p = 400. The numbers in parentheses represent the standard error.

n p λ b Lzn L ACV 1e5 0.6578 0.6591 0.6578 (0.0041) 1e4 0.5810 0.6069 0.5841 (0.0079) 1e3 0.5318 0.5832 0.5444 (0.0121) 1e2 0.5152 0.5675 0.5379 (0.0146) 1e1 0.4859 0.5977 0.5560 (0.0183) 1e0 0.4456 0.6623 0.6132 (0.0244)

Table 2: The results of logistic regression (insample loss, out-of-sample loss, CV, ACV) on CIFAR-10 dataset with n = 9600 and p = 3072.

ℓ(zi; bθλ(zn)) CV ACV IF 0.0872 8.5526 8.6495 0.2202 0.0920 2.1399 2.1092 0.2081 0.0926 10.8783 9.4791 0.2351 0.0941 3.5210 3.3162 0.2210 0.0950 5.7753 6.1859 0.2343 0.0990 5.2626 5.0554 0.2405 0.1505 12.0483 11.5281 0.3878

Table 3: Comparison of the leave-one-out estimates on the 8 outlier samples with highest in-sample loss in the MNIST dataset.

In other words, we consider one regularization parameter per each model parameter. To validate the proposed optimization algorithm, we consider a scenario with p = 50 where x is drawn i.i.d. from N(0, Ip). We let y = θ x + ϵ where θ1 = . . . = θ40 = 0 and θ41, . . . , θ50 N(0, 1) i.i.d, and ϵ N(0, 0.1). We draw n = 150 samples from this model, and apply Algorithm 1 to optimize for λ = (λ1, . . . , λ50). The problem is designed in such a way that out of 50 features, the ﬁrst 40 are irrelevant while the last 10 are important. We initialize the algorithm with λ1 1 = . . . = λ1 50 = 1/3 and compute ACV using Theorem 1. Recall that in this case, ACV is exactly equivalent to CV (see Corollary 2). Figure 1 plots ALOOCV, the out-of-sample loss, and the mean value of λ calculated over the irrelevant and relevant features respectively. As expected, the λ for an irrelevant feature is set to a larger number, on the average, compared to that of a relevant feature. Finally, we remark that the optimization of 50 tuning parameters in 800 iterations took a mere 28 seconds on a PC.

0 2 4 6 8 10 12 14 16 18 20 0

Out-of-Sample Loss

Estimated CV

and Actual CV

7 7.5 8 8.5 9

Actual CV Out-of-Sample Loss

Estimated CV

n = 70 p = 50

No. of iterations

70 80 90 100 110 120 130 140 150 p

Computation Time

Approximate LOOCV

70 80 90 100 110 120 130 140 150 4

Run time ratio LOOCV/ALOOCV

Figure 4: The application of Algorithms 1 and 2 to elastic net regression. The left panel shows the loss vs. number of iterations. The right panel shows the run-time vs. n (the sample size).

Logistic regression: The second example that we consider is logistic regression:

wn(z; θ, λ) = ℓ(z; θ) + 1

nλ r(θ) = H(y|| sigmoid(θ0 + θ x)) + 1

where H( || ) for any u [0, 1] and v (0, 1) is given by H(u||v) := u log 1

v + (1 u) log 1 1 v, and denotes the binary cross entropy function, and sigmoid(x) := 1/(1 + e x) denotes the sigmoid function. In this case, we only consider a single regularization parameter. Since the loss and regularizer are smooth, we resort to Theorem 1 to compute ACV. We applied logistic regression on MNIST and CIFAR-10 image datasets where we used each pixel in the image as a feature according to the aforementioned loss function. In MNIST, we classify the digits 2 and 3 while in CIFAR-10, we classify bird and cat. As can be seen in Tables 1 and 2, ACV closely follows CV on the MNIST dataset. On the other hand, the approximation of LOOCV based on inﬂuence functions [14] performs poorly in the regime where the model is signiﬁcantly overﬁt and hence it cannot be used for effective model selection. On CIFAR-10, ACV takes 1s to run per each sample, whereas CV takes 60s per each sample requiring days to run for each λ even for this medium sized problem. The histogram of the normalized difference between CV and ACV vectors is plotted in Figure 3 for 5 runs of the algorithm for each λ in Table 1. As can be seen, CV and ACV are almost always within 5% of each other. We have also plotted the loss for the eight outlier samples with the highest in-sample loss in the MNIST dataset in Table 3. As can be seen, ALOOCV closely follows LOOCV even when the leave-one-out loss is two orders of magnitude larger than the in-sample loss for these outliers. On the other hand, the approximation based on the inﬂuence functions fails to capture the out-of-sample performance and the outliers in this case.

Elastic net regression: Finally, we consider the popular elastic net regression problem [31]:

wn(z; θ, λ) = ℓ(z; θ) + 1

nλ r(θ) = 1

2(y θ x)2 + 1

nλ1 θ 1 + 1

2nλ2 θ 2 2.

In this case, there are only two regularization parameters to be optimized for the quasi-smooth regularized loss. Similar to the previous case, we consider y = θ x + ϵ where θκ = κρκψκ where ρκ is a Bernoulli(1/2) RV and ψκ N(0, 1). Hence, the features are weighted non-uniformly in y and half of them are zeroed out on the average. We apply both Algorithms 1 and 2 where we used the approximation in Theorem 1 and the explanation on how to handle ℓ1 regularizers to compute ACV. We initialized with λ1 = λ2 = 0. As can be seen on the left panel (Figure 4), ACV closely follows CV in this case. Further, we see that both algorithms are capable of signiﬁcantly reducing the loss after only a few iterations. The right panel compares the run-time of the algorithms vs. the number of samples. This conﬁrms our analysis that the run-time of CV scales quadratically with O(n2) as opposed to O(n) in ACV. This impact is more signiﬁed in the inner panel where the run-time ratio is plotted.

Acknowledgement

This work was supported in part by DARPA under Grant No. W911NF-16-1-0561. The authors are thankful to Jason D. Lee (USC) who brought to their attention the recent work [14] on inﬂuence functions for approximating leave-one-out cross validation.

[1] Sylvain Arlot and Alain Celisse. A survey of cross-validation procedures for model selection. Statistics surveys, 4:40 79, 2010.

[2] Seymour Geisser. The predictive sample reuse method with applications. Journal of the American Statistical Association, 70(350):320 328, 1975.

[3] Peter Craven and Grace Wahba. Smoothing noisy data with spline functions. Numerische Mathematik, 31(4):377 403, 1978.

[4] Kenneth P Burnham and David R Anderson. Model selection and multimodel inference: a practical information-theoretic approach. Springer Science & Business Media, 2003.

[5] Hirotugu Akaike. Statistical predictor identiﬁcation. Annals of the Institute of Statistical Mathematics, 22(1):203 217, 1970.

[6] Hirotogu Akaike. Information theory and an extension of the maximum likelihood principle. In Selected Papers of Hirotugu Akaike, pages 199 213. Springer, 1998.

[7] K Takeuchi. Distribution of informational statistics and a criterion of model ﬁtting. suri-kagaku (mathematical sciences) 153 12-18, 1976.

[8] Mervyn Stone. Cross-validation and multinomial prediction. Biometrika, pages 509 515, 1974.

[9] Andrew R Barron. Predicted squared error: a criterion for automatic model selection. 1984.

[10] John E Moody. The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. In Advances in neural information processing systems, pages 847 854, 1992.

[11] David M Allen. The relationship between variable selection and data agumentation and a method for prediction. Technometrics, 16(1):125 127, 1974.

[12] Ronald Christensen. Plane answers to complex questions: the theory of linear models. Springer Science & Business Media, 2011.

[13] R Dennis Cook and Sanford Weisberg. Characterizations of an empirical inﬂuence function for detecting inﬂuential cases in regression. Technometrics, 22(4):495 508, 1980.

[14] Pang Wei Koh and Percy Liang. Understanding black-box predictions via inﬂuence functions. International Conference on Machine Learning, 2017.

[15] Bradley Efron. Better bootstrap conﬁdence intervals. Journal of the American statistical Association, 82(397):171 185, 1987.

[16] Gene H Golub, Michael Heath, and Grace Wahba. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21(2):215 223, 1979.

[17] Charles-Alban Deledalle, Samuel Vaiter, Jalal Fadili, and Gabriel Peyr e. Stein Unbiased Gr Adient estimator of the Risk (SUGAR) for multiple parameter selection. SIAM Journal on Imaging Sciences, 7(4):2448 2487, 2014.

[18] Sathish Ramani, Zhihao Liu, Jeffrey Rosen, Jon-Fredrik Nielsen, and Jeffrey A Fessler. Regularization parameter selection for nonlinear iterative image restoration and MRI reconstruction using GCV and SURE-based methods. IEEE Transactions on Image Processing, 21(8):3659 3672, 2012.

[19] Jonas Moˇckus. Application of Bayesian approach to numerical methods of global and stochastic optimization. Journal of Global Optimization, 4(4):347 365, 1994.

[20] Jonas Moˇckus. On Bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical Conference, pages 400 404. Springer, 1975.

[21] Eric Brochu, Vlad M Cora, and Nando De Freitas. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. ar Xiv preprint ar Xiv:1012.2599, 2010.

[22] Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithm conﬁguration. In International Conference on Learning and Intelligent Optimization, pages 507 523. Springer, 2011.

[23] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951 2959, 2012. [24] James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb):281 305, 2012. [25] Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Auto-weka: Combined selection and hyperparameter optimization of classiﬁcation algorithms. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 847 855. ACM, 2013. [26] Katharina Eggensperger, Matthias Feurer, Frank Hutter, James Bergstra, Jasper Snoek, Holger Hoos, and Kevin Leyton-Brown. Towards an empirical foundation for assessing Bayesian optimization of hyperparameters. In NIPS workshop on Bayesian Optimization in Theory and Practice, pages 1 5, 2013. [27] Gautam Kunapuli, K Bennett, Jing Hu, and Jong-Shi Pang. Bilevel model selection for support vector machines. In CRM Proceedings and Lecture Notes, volume 45, pages 129 158, 2008. [28] Kristin P Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli, and Jong-Shi Pang. Model selection via bilevel optimization. In 2006 Intl. Joint Conf. on Neural Networks IJCNN 06, pages 1922 1929, 2006. [29] Kevin Swersky, Jasper Snoek, and Ryan P Adams. Multi-task Bayesian optimization. In Advances in neural information processing systems, pages 2004 2012, 2013. [30] Lisha Li, Kevin Jamieson, Giulia De Salvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: Bandit-based conﬁguration evaluation for hyperparameter optimization. Proc. of ICLR, 17, 2017. [31] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301 320, 2005.