# iterative_approximate_crossvalidation__4d690dbd.pdf Iterative Approximate Cross-Validation Yuetian Luo 1 Zhimei Ren 2 Rina Foygel Barber 2 Cross-validation (CV) is one of the most popular tools for assessing and selecting predictive models. However, standard CV suffers from high computational cost when the number of folds is large. Recently, under the empirical risk minimization (ERM) framework, a line of works proposed efficient methods to approximate CV based on the solution of the ERM problem trained on the full dataset. However, in large-scale problems, it can be hard to obtain the exact solution of the ERM problem, either due to limited computational resources or due to early stopping as a way of preventing overfitting. In this paper, we propose a new paradigm to efficiently approximate CV when the ERM problem is solved via an iterative first-order algorithm, without running until convergence. Our new method extends existing guarantees for CV approximation to hold along the whole trajectory of the algorithm, including at convergence, thus generalizing existing CV approximation methods. Finally, we illustrate the accuracy and computational efficiency of our method through a range of empirical studies. 1. Introduction In machine learning and statistics, cross-validation (CV) (Allen, 1974; Stone, 1974; Geisser, 1975) is one of the most popular methods for the tasks of assessing and selecting predictive models. It is conceptually simple and easy to implement. Among many variants of CV, one popular choice is leave-one-out CV (also called the Jackknife), which often offers the most accurate prediction for the outof-sample risk in many practical and challenging scenarios (Arlot & Celisse, 2010; Stephenson & Broderick, 2020; Rad & Maleki, 2020). Leave-one-out CV uses n 1 out of n data 1Data Science Institute, University of Chicago, Chicago IL 60637, USA 2Department of Statistics, University of Chicago, Chicago IL 60637, USA. Correspondence to: Yuetian Luo . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). points for training, and the remaining one for testing, and then repeats for each of the n data points in the sample. The resulting n cross-validation errors can be used for model assessment or selection, tuning parameter selection, etc. However, the superior performance of leave-one-out CV is accompanied by high computational cost when n is large, as the potentially complex model needs to be fitted n times. To be able to run leave-one-out CV but avoid the high computational burden, several methods have been proposed to approximate the expensive model refitting step with an inexpensive surrogate step, with much progress in the recent literature particularly in the context of empirical risk minimization (ERM) (Obuchi & Kabashima, 2016; Beirami et al., 2017; Giordano et al., 2019; Koh & Liang, 2017; Wang et al., 2018; Wilson et al., 2020; Rad & Maleki, 2020; Stephenson & Broderick, 2020). Background: approximate CV for ERM Suppose one aims to estimate some parameter of interest θ via solving the following regularized ERM problem: bθ = argmin θ Rp F(Z; θ), (1) where, for a dataset Z of size n, i=1 ℓ(Zi; θ) + λπ(θ). Here ℓ(Z; θ) is the loss function for data point Z and parameter θ, Z = {Zi}n i=1 is the set of observed data points, π( ) is a regularization term, and λ 0 is a tuning parameter. Under this setting, the widely used leave-one-out CV loss for estimating the prediction performance of bθ is given by CV({bθ i}n i=1) = 1 i=1 ℓ(Zi; bθ i), (2) where bθ i := argminθ Rp F(Z i; θ) is the minimizer of the leave-one-out objective F(Z i; θ) := Pn j=1,j =i ℓ(Zj; θ) + λπ(θ). Since computing bθ i for every i [n] := {1, . . . , n} is often expensive, this motivates finding an approximation to CV({bθ i}n i=1) that does not require fitting n many models. Iterative Approximate Cross-Validation Observe that bθ and bθ i are the solutions to two highly similar minimization problems they minimize F(Z; θ) and F(Z i; θ) = F(Z; θ) ℓ(Zi; θ), respectively. One approach in the recent literature proposes approximating bθ i by initializing at θ = bθ and taking a single Newton step (NS) on the objective F(Z i; θ) (Beirami et al., 2017; Rad & Maleki, 2020), eθNS i = bθ 2 θF(Z i; bθ) 1 θF(Z i; bθ). (3) Another approach is based on the infinitesimal jackknife (IJ) (Jaeckel, 1972; Efron, 1982; Giordano et al., 2019): eθIJ i = bθ 2 θF(Z; bθ) 1 θF(Z i; bθ). (4) (Here for simplicity we consider the case where F is twice differentiable; we will discuss other settings below.) The challenge: iterative algorithms The accuracy of the estimators in (3) and (4) relies heavily on the assumption that bθ can be computed exactly, which may not be the case in many scenarios. For example, when (1) is solved via an iterative algorithm such as gradient descent (GD), we may not be able to run the algorithm into convergence due to the limited computational resources. Other algorithms we might use have a very slow rate of convergence, such as stochastic gradient descent (SGD). Another important example is that, in some settings, we may intentionally stop training early to avoid overfitting in learning machine learning models (see e.g. (Jabbar & Khan, 2015)). Figure 1 shows a simple simulation that illustrates the loss of accuracy suffered by the NS and IJ methods, when run with an estimate of bθ obtained before convergence. Specifically, consider logistic regression with a ridge penalty, with the estimator obtained by running GD or SGD for t steps that is, we would like to estimate CV({bθ(t) i}n i=1), where bθ(t) i is the solution after t steps of GD or SGD on the objective function F(Z i; θ). bθ(t) i is then approximated by running either NS (3) or IJ (4), but with bθ(t) (the tth step of GD or SGD on the full objective function F(Z; θ)) in place of the exact minimizer bθ. We see highly inaccurate approximations of NS and IJ methods to the leave-one-out CV loss CV({bθ(t) i}n i=1) during the early iterations of GD, and even after 1000 iterations for SGD. Details of this simulation are given in Section 6. Our contribution: iterative approximate CV The central question of our paper is this: Can we develop new approximation schemes to leave-oneout CV when bθ is not known exactly, but is estimated with an iterative algorithm that is not run to convergence? 1 102 104 Iter Num (log scale) 1 102 104 Iter Num (log scale) Figure 1. The red and blue lines show the error in estimating CV({bθ(t) i}n i=1) for the NS and IJ methods, run with bθ(t) as an approximation to the exact solution bθ, while the black line shows the error for our proposed method, iterative approximate CV (IACV). The objective function is given by logistic regression plus a ridge penalty, solved iteratively with GD (left, n = 1000) and SGD (right, n = 1000, batch-size = 400). Solid lines represent the median value over 100 repetitions and the shaded area shows the region between lower and upper quartiles. In this work, we provide a positive answer to this question. We find that before the convergence of the iterative algorithm, efficient leave-one-out CV approximation can still be possible if we can leverage information about the iterative nature of the algorithm. In Figure 1, we see that our proposed method, Iterative Approximate CV (IACV), achieves an accurate approximation to leave-one-out CV loss CV({bθ(t) i}n i=1) across all iterations t, even far before convergence; at convergence, IACV yields the same results as NS. Theoretically, we are able to show that, under some regularity conditions, IACV enjoys guaranteed CV approximation along the whole trajectory of the algorithm, and moreover, at convergence, IACV recovers the NS method (3). 1.1. Additional Related Literature In addition to the NS and IJ methods, described earlier, many other works in the literature also offer methods and theories for the problem of approximating leave-one-out CV. The closest to ours is the work by Koh & Liang (2017); Ghosh et al. (2020), which also considers the problem of approximating CV with an inexact bθ, but with results of a very different flavor this line of work assumes that bθ(t) is already an accurate estimate of bθ (i.e., t is large and the iterative algorithm is near convergence), and bounds the error in approximating the leave-one-out models bθ i as a function of this convergence error bθ(t) bθ 2. In contrast, our work instead estimates the models bθ(t) i (i.e., at each time t rather than at convergence), and does not assume that bθ(t) bθ 2 is small. Another recent line of work studies the problem of efficient Iterative Approximate Cross-Validation data deletion in a trained model under the ERM framework, where the aim is to provide an estimator of θ that is approximately independent of data point i, for the sake of the ith individual s privacy. These problems are often referred to as data deletion , machine unlearning , or decremental learning in the literature (Tsai et al., 2014; Cao & Yang, 2015; Bourtoule et al., 2021; Neel et al., 2021). Like our work, these methods also aim to approximate the leave-oneout model efficiently given the trained full model, but additionally require that the approximated model is statistically indistinguishable from a model that would have resulted from retraining without the ith individual s data (Izzo et al., 2021; Guo et al., 2020; Ginart et al., 2019). These works generally need to add noise in the approximation procedure in order to ensure this constraint is satisfied. 2. Iterative Approximate CV In this section, we present our method for approximating leave-one-out CV in the setting of iterative optimization, where the algorithm may not be run to convergence. We find that even when bθ can not be exactly computed, we can still approximate the leave-one-out iterates bθ(t) i accurately by leveraging the structure of the iterative update. In our theoretical results below, we will see that our estimates have guaranteed accuracy (under mild conditions) along all iterations t 1 of the algorithm, i.e., even at iterations when the algorithm is far from convergence. 2.1. Framework The iterative procedures that we will study in this paper gradient descent, stochastic gradient descent, and proximal gradient descent can all be expressed within the following general framework. Suppose our objective function can be written in the form F(Z; θ) = g(Z; θ) + h(θ) where g(Z; θ) is twice-differentiable in θ while h(θ) may be nondifferentiable. We consider solving the problem (1) iteratively as follows: for each step t 1, we take a gradient step on g (possibly after subsampling the data points), and a proximal step on h. Specifically, the iterations are given by bθ(t) = argmin θ 2αt θ θ 2 2 + h(θ) where θ = bθ(t 1) αt θg(ZSt; bθ(t 1)), (5) where St [n] is a subset of indices, ZSt := {Zi : i St} is the corresponding subset of the data, and αt > 0 is the learning rate. In the setting where the objective function is twicedifferentiable, we can take g(Z; θ) = F(Z; θ) and h 0. For example, in the regularized ERM setting, if the loss function ℓand the penalty function π(θ) are twice-differentiable, we can simply take g(Z; θ) = Pn i=1 ℓ(Zi; θ) + λπ(θ). Then (5) yields gradient descent if we choose the full dataset St = [n] at each iteration, GD : bθ(t) = bθ(t 1) αt θF(Z; bθ(t 1)), (6) or stochastic gradient descent if at each iteration we sample a random batch St [n], SGD : bθ(t) = bθ(t 1) αt θF(ZSt; bθ(t 1)). (7) In other settings, we may have a nondifferentiable term h(θ), which has an inexpensive proximal map (i.e., the solution to (5) can be computed efficiently for each iteration). For instance, in regularized ERM with a nonsmooth penalty function π such as the ℓ1 norm, we might take g(Z; θ) = Pn i=1 ℓ(Zi; θ) and h(θ) = λπ(θ). Then the general iterative scheme (5), run again with the full dataset St = [n] at each iteration, reduces to proximal gradient descent, Prox GD : bθ(t) = arg min θ 2αt θ θ 2 2 + h(θ) o where θ = bθ(t 1) αt θg(Z; bθ(t 1)). (8) 2.2. Estimation Procedure: General Case We will now provide an algorithm for Iterative Approximate CV (IACV) for general iterative procedures of the form (5). After deriving the general formulation, we will then show how it specializes to each of the three settings, GD, SGD, and Prox GD. The targets bθ(t) i for i [n] are obtained by running the same iterative solver as in (5) except with i-th data point being left out: for each t 1, bθ(t) i = argmin θ 2αt θ θ 2 2 + h(θ) where θ = bθ(t 1) i αt θg(ZSt\{i}; bθ(t 1) i ), (9) In many settings, running the iterative algorithm for each i [n] is computationally too expensive, since at each iteration t, it requires computing gradients for g at n different parameter vectors {bθ(t 1) i }n i=1. Thus, we aim to find computationally inexpensive surrogates for these gradients. We now define our procedure, which produces approximations eθ(t) i bθ(t) i, at each iteration t 1 and for each i [n]. At iteration t, if we are not running the exact leave-one-out procedure, then computing θ in (9) above involves two unknowns: the previous iterate, bθ(t 1) i , which we simply approximate with our previous iteration s estimate eθ(t 1) i , Iterative Approximate Cross-Validation and its gradient, θg(ZSt\{i}; bθ(t 1) i ), which we now address. If θ 7 g( ; θ) is twice-differentiable, a natural idea is to approximate θg(ZSt\i; bθ(t 1) i ) by its Taylor expansion at bθ(t 1). The reason we choose bθ(t 1) as the base point is that it can be shared across the n problems of approximating θg(ZSt\i; bθ(t 1) i ) for each i [n]. We thus take the approximation θg(ZSt\i; bθ(t 1) i ) θg(ZSt\i; bθ(t 1)) + 2 θg(ZSt\i; bθ(t 1))[bθ(t 1) i bθ(t 1)], (10) and then again plug in eθ(t 1) i for the unknown bθ(t 1) i in the last term. Thus, for each i [n] each t 1, the IACV approximation is given by IACV: eθ(t) i = arg min θ 2αt θ θ 2 2 + h(θ) where θ = eθ(t 1) i αt θg(ZSt\i; bθ(t 1)) + 2 θg(ZSt\i; bθ(t 1))[eθ(t 1) i bθ(t 1)] . (11) 2.3. Estimation Procedure for GD, SGD, and Prox GD Next, we give the details for implementing this general procedure in the specific settings of GD, SGD, and Prox GD. Gradient descent (GD) For GD, we have h(θ) 0 and St [n], so the steps of IACV reduce to eθ(t) i = eθ(t 1) i αt θF(Z i; bθ(t 1)) + 2 θF(Z i; bθ(t 1))[eθ(t 1) i bθ(t 1)] . (12) Stochastic gradient descent (SGD) For SGD, we again have h(θ) 0, but the sets St consist of small batches of data. The steps of IACV reduce to eθ(t) i = eθ(t 1) i αt θF(ZSt\i; bθ(t 1)) + 2 θF(ZSt\i; bθ(t 1))[eθ(t 1) i bθ(t 1)] . (13) Proximal gradient descent (Prox GD) For Prox GD, we again take St [n], but the function h(θ) is nontrivial (e.g., a nonsmooth regularizer). The steps of IACV reduce to eθ(t) i = argmin θ 2αt θ θ 2 2 + h(θ) where θ = eθ(t 1) i αt θg(Z i; bθ(t 1)) + 2 θg(Z i; bθ(t 1))[eθ(t 1) i bθ(t 1)] . (14) Computational Cost IACV Exact LOO CV GD n(Ap + Bp) + np2 n2Ap + np SGD K(Ap + Bp) + np2 n KAp + np Prox GD n(Ap + Bp + Dp) + np2 n2Ap + n Dp + np Table 1. The order of per-iteration computation complexity of iterative approximate CV (IACV) as compared to exact leave-one-out (LOO) CV. See Section 3 for details. 3. Computation and Memory Complexity In this section, we compare the computation and memory complexity of the proposed leave-one-out CV approximation method (11) with the exact leave-one-out CV (9). Note that running IACV (11) assumes that we have access to the iterates bθ(t), obtained by running the iterative procedure (5) on the full dataset. In addition, at each time t, we also need to calculate θg(ZSt\i; bθ(t 1)) and 2 θg(ZSt\i; bθ(t 1)) in order to compute the estimator. In other words, at each time t we need to compute the gradient and Hessian of g for datasets ZSt\i that vary with i but at a parameter vector bθ(t 1) that is constant over i in general, this can be done efficiently due to the separability of the data in the ERM problem. This is different from the computational cost of exact leave-one-out CV, which instead requires computing θg(ZSt\i; bθ(t 1) i ) for each i at each iteration t, i.e., the gradient of g needs to be computed at n different parameter vectors {bθ(t 1) i }n i=1. For simplicity, let us consider the case where g(Z; θ) = Pn i=1 ℓ(Zi; θ). We denote the computational cost of one call to θℓ(Zi, ) as Ap, and one call to 2 θℓ(Zi, ) as Bp, which depend on the dimension p of the parameter θ. We also assume that the cost of computing θg(Z; ) = Pn i=1 θℓ(Zi; ) is equal to n Ap, and similarly n Bp for 2 θg(Z; ); this assumption is mild since, in most typical settings, the gradient or Hessian of the loss can only be computed via evaluation on each data point one-by-one. Finally, let Dp denotes the computational cost of one call to the proximal operator θ 7 argminθ{ 1 2αt θ θ 2 2 + h(θ)}. In Table 1, we provide the per-iteration computation complexity of computing {eθ(t) i}n i=1 for a single iteration t in our proposed method IACV, as compared to computing {bθ(t) i}n i=1 in the exact leave-one-out CV, for the three iterative algorithms GD, SGD (with batch size K), and Prox GD. Details for these calculations are given in Appendix B. For example, for GD, we can see that IACV is more efficient as long as Bp + p2 n Ap. Since typically, Ap and Bp are of order p and p2, respectively, we can see that condition Bp + p2 n Ap is typically satisfied when n p (which is exactly the regime considered in the literature for analyzing the existing CV approximation methods (Beirami et al., Iterative Approximate Cross-Validation 2017; Wilson et al., 2020)), and thus IACV is computationally more efficient in this regime. One limitation of IACV is an increased memory cost the proposed method needs to store the data, gradient, and Hessian, which costs O(np + p2) space, while the exact leaveone-out CV costs O(np) as it only needs to store the data and gradient. In the n p regime, however, the memory cost of the two methods is on the same order. 4. Accuracy Guarantees for IACV In this section, we provide theoretical guarantees for IACV in the GD and SGD settings. (Guarantees for Prox GD are given in Appendix A.) For both GD and SGD, we will consider the objective function F(Z; θ) = Pn i=1 ℓ(Zi; θ) + λπ(θ), where π is a twice-differentiable regularizer, as before. We will study two notions of accuracy: the approximation error, Err(t) approx = 1 i=1 eθ(t) i bθ(t) i 2, (15) measuring our accuracy in approximating the exact leaveone-out estimators bθ(t) i, and the CV error, Err(t) CV = |CV({eθ(t) i}n i=1) CV({bθ(t) i}n i=1)|, (16) measuring our accuracy in estimating the leave-one-out CV loss for {bθ(t) i}n i=1. (This latter notion of error is the quantity that was plotted in Figure 1.) 4.1. Guarantees for GD We first state our assumptions for the GD setting. Assumption 4.1. For all t 1, mini [n] λmin( 2 θF(Z i; bθ(t 1))) nλ0 and maxi [n] λmax( 2 θF(Z i; bθ(t 1))) nλ1 for some positive constants λ0, λ1 > 0. This first assumption says the Hessian of F is wellconditioned along all iterates. Note that in the literature for analyzing the NS (3) or IJ (4) estimators, the Hessian is often assumed to be well-conditioned at bθ, the minimizer of (1) (Beirami et al., 2017; Wilson et al., 2020; Giordano et al., 2019), which is sufficient since eθNS i and eθIJ i in (3) and (4) are based on bθ only. Here we require a stronger assumption because our estimator depends on bθ(t) for all t. The second assumption is about the gradient. Assumption 4.2. For all t 1, i [n], we have θℓ(Zi; bθ(t 1)) 2 ηi for some ηi > 0. The final assumption requires the Hessian to be Lipschitz. Assumption 4.3. For all i [n], 2 θF(Z i; ) is nγLipschitz, i.e., for any θ1, θ2 Rp, 2 θF(Z i; θ1) 2 θF(Z i; θ1) nγ θ1 θ2 2. We are now ready to state our first main result about IACV. Theorem 4.4 (Approximation Error of IACV for GD). Suppose bθ(0) = bθ(0) i = eθ(0) i for all i [n], αt 1/(nλ1) for t 1, and Assumptions 4.1 4.3 are satisfied with γ < 2λ0 and n 4 η 2λ0 γ . Then for all t 1 and i [n], we have bθ(t) bθ(t) i 2 2ηi (2λ0 γ)n and eθ(t) i bθ(t) i 2 4γη2 i λ0(2λ0 γ)2n2 . In Theorem 4.4, we give upper bounds that are independent of t, for clarity of the result; tighter, iteration-dependent upper bounds are discussed in Appendix D. As an immediate consequence of the last bound, the approximation error (15) is bounded as Err(t) approx 4γ η 2 λ0(2λ0 γ)2n2 for all t 1. In Theorem 4.4, we see that the error bound for eθ(t) i bθ(t) i 2 is of a smaller order than the one for bθ(t) bθ(t) i 2. This is exactly what we need the goal of leave-one-out CV is to determine how the leave-one-out estimators bθ(t) i behave differently from the estimator bθ(t) computed on the full dataset, and thus the approximations eθ(t) i are useful only if they can improve over the baseline accuracy of bθ(t) itself. In addition, it has been shown in Beirami et al. (2017); Wilson et al. (2020) that the NS (3) and IJ (4) estimators achieve an approximation error of order 1/n2, which is the same as our result above. However, NS and IJ only achieve this error when run with the exact value of bθ (as we have seen in Figure 1, this assumption is crucial for empirical accuracy), while IACV achieves the O(1/n2) approximation error bound along all iterations of the algorithm rather than only at convergence. Based on the approximation error bounds established in Theorem 4.4, we are also able to provide guarantees for IACV s ability to approximate the leave-one-out CV loss CV({bθ(t) i}n i=1), with error on the order of 1/n2. We will need one additional assumption on the gradient of the loss, which can be viewed as a stronger version of Assumption 4.2: Assumption 4.5. Suppose for any i [n] and all t 1, there exists η i > 0 such that supa [0,1] ℓ(Zi; aeθ(t) i+(1 a)bθ(t) i) 2 η i. For example, this assumption will be satisfied if ℓ(Zi, ) is η i-Lipschitz for each i. Iterative Approximate Cross-Validation Theorem 4.6 (CV Error of IACV for GD). If the assumptions in Theorem 4.4 are satisfied, and Assumption 4.5 holds, then for all t 1, Err(t) CV 1 n Pn i=1 4γη iη2 i λ0(2λ0 γ)2n2 . Theorem 4.6 suggests IACV might be useful when we perform early stopping based on CV loss as it has guaranteed accurate estimates of CV({bθ(t) i}n i=1) along the entire trajectory, including times t that are well before convergence. When are these assumptions satisfied? Next, we provide an example of a setting where Assumptions 4.1 4.5 are likely satisfied. Suppose the data is generated from the generalized linear model with a canonic link function (Mc Cullagh & Nelder, 1989). Specifically, suppose Yi has density exp(ηi Yi ϕ(ηi))h(Yi), where ϕ : R 7 R is a given function and ηi is linked with Xi via ηi = X i θ . Then given i.i.d. data Z = {(Xi, Yi)}n i=1, the gradient, and the Hessian of the negative log-likelihood function with ridge regularizer are given as Yi ϕ (θ Xi) Xi + λθ, 2 θF(Z; θ) = i=1 ϕ (θ Xi)Xi X i + λIp. Since E(Yi) = ϕ (θ Xi) and Var(Yi) = ϕ (θ Xi) > 0, we expect the first and second derivatives of the objective will be well-controlled when θ is close to θ , the Xis are bounded, and n p. For many non-convex problems of interest, if we could have a warm-start initialization, then the problem at local typically satisfies certain restricted strong convexity and smoothness properties, and the Hessians and gradients along the iterates are often well-controlled as well (Loh & Wainwright, 2013; Chi et al., 2019). 4.2. Guarantees for SGD In the SGD setting, we assume the batches {St}t 1 are drawn i.i.d. such that each data point is included in St with probability K/n. To establish the approximation guarantee for IACV in this setting, we need the following three assumptions, which are stochastic analogues of Assumptions 4.1 4.3. Assumption 4.7. Suppose F( ; θ) is twice differentiable in θ and there exists λ0, λ1 > 0 such that for all t 1, i [n], and αt 1 Kλ1 , we have E( [Ip αt 2 θF(ZSt\{i}; bθ(t 1))][bθ(t 1) bθ(t 1) i ] 2) (1 αt Kλ0)E h bθ(t 1) bθ(t 1) i 2 i , where the expectation is taken over randomly drawn batches S1, . . . , St. For example, this assumption would be satisfied under the stronger assumption that for all a, θ Rp, all i [n], and for a randomly drawn batch S, ES( (Ip αt 2 θF(ZS\{i}; θ))v 2) (1 αt Kλ0) a 2. Essentially, we can interpret Kλ0 and Kλ1 as bounding the smallest and largest eigenvalues of 2 θF(ZS\{i}; θ). Our next assumptions bound the gradient and the Lipschitz constant of the Hessian. Assumption 4.8. For all t 1 and i [n], we have ES1,...,St 1 θℓ(Zi; bθ(t 1)) 2 ηi for some ηi > 0. Assumption 4.9. There exists γ > 0 such that for any i [n] and t 1, ( 2 θF(ZSt\{i}; θ ) 2 θF(ZSt\{i}; bθ(t 1))) [bθ(t 1) i bθ(t 1)] 2 γKE bθ(t 1) i bθ(t 1) 2 2, where the supremum is taken over θ = abθ(t 1) i + (1 a)bθ(t 1) for any a [0, 1]. For example, this assumption would be satisfied if given any v, θ1, θ2 Rp, ES ( 2 θF(ZS\{i}; θ1) 2 θF(ZS\{i}; θ2)v 2 γK v 2 θ1 θ2 2. In addition, we need an extra assumption as follows. Assumption 4.10. There exists β 1 such that for all t 1, we have ES1,...,St 1 bθ(t 1) i bθ(t 1) 2 2 K ES1,...,St 1 bθ(t 1) i bθ(t 1) 2 2 . Assumption 4.10 can be viewed as a reverse Jensen s inequality, but there is an inflation factor β n K on the right-hand side. The order n K of this inflation factor is expected. A simple way to reveal this is to examine the inequality when t = 2. If bθ(0) i = bθ(0), then bθ(1) i bθ(1) = 0 with probability n ) and bθ(1) i bθ(1) = α1 θℓ(Zi; bθ(0)) with probability K n (depending on whether data point i is excluded or included in the first batch S1 at time t = 1). Thus, we have ES1 bθ(1) i bθ(1) 2 2 = n K (ES1 bθ(1) i bθ(1) )2. Now we are ready to present a guarantee on the approximation error of IACV in the SGD setting. Theorem 4.11 (Approximation Error of IACV for SGD). Suppose bθ(0) = bθ(0) i = eθ(0) i for all i [n], αt 1/(Kλ0) for all t 1 and Assumptions 4.7 4.10 are satisfied with γβ < 2λ0 and K 4 η 2λ0 γβ . Then for all t 1, i [n], we Iterative Approximate Cross-Validation have E bθ(t) bθ(t) i 2 2ηi (2λ0 γβ)n and E eθ(t) i bθ(t) i 2 4γβη2 i λ0(2λ0 γβ)2n K , where the expectation is taken over the randomly drawn batches S1, . . . , St. By Theorem 4.11, the expected approximation error in the SGD case is bounded as E Err(t) approx 4γβ η 2 λ0(2λ0 γ)2n2 for all t 1. Interestingly, we find that the baseline approximation error E bθ(t) bθ(t) i 2 does not depend on K and is at the same order 1 n as the one in Theorem 4.4. In contrast, the IACV error E eθ(t) i bθ(t) i 2 scales at the order of 1 n K , as compared to order 1 n2 in the previous setting. This factor comes from the inflation factor n K in Assumption 4.10; as discussed immediately below this assumption, the factor n K cannot be removed if we expect the assumption to hold in practice for all t 1, but a remaining open question is whether the factor n K can be removed if we only require it to hold for all t T0 for some large T0. If this is the case, then (for sufficiently large t) the IACV error in Theorem 4.11 will scale as 1 n2 rather than 1 n K . Finally, we provide a CV error guarantee under the SGD setting, with one additional assumption: Assumption 4.12. Suppose for any i [n] and all t 1, there exists η i > 0 such that sup a [0,1] | θℓ(Zi; aeθ(t) i + (1 a)bθ(t) i) (eθ(t) i bθ(t) i)| η i E eθ(t) i bθ(t) i 2. For example, this assumption will be satisfied if ℓ(Zi, ) is η i-Lipschitz for each i. Theorem 4.13 (CV Error of IACV for SGD). If the assumptions in Theorem 4.11 and Assumption 4.12 are satisfied, then for all t 1: E Err(t) CV 1 4γβη iη2 i λ0(2λ0 γβ)2n K , where the expectation is taken over the randomly drawn batches S1, . . . , St. As before, this result is analogous to Theorem 4.6 for the GD setting, but with rate 1/n K in place of 1/n2. 5. Limiting Behavior of IACV As we have seen in Section 4, IACV achieves accurate approximation along all iterations of the algorithm under some regularity conditions. Thus, in the setting when bθ(t) indeed converges to bθ, IACV also has the same approximation properties in the limit, and its approximation error bound is comparable to the guarantee of the one-step Newton (NS) estimator in (3) as we have mentioned in Section 4.1. In this section, we show this is not a coincidence in fact, there is a close connection of the proposed estimator eθ(t) i to the NS estimator when t and the algorithm converges to bθ. Theorem 5.1. Suppose bθ(t) converges to bθ, Assumption 4.1 is satisfied along all iterations, αt = α < 1/(nλ1) for all t 1, and 2 θF(Z i; θ), θF(Z i; θ) are continuous in θ for all i [n]. Then eθ(t) i defined in (12) converges to eθNS i in (3), as t . Moreover, in Theorem A.3, we will show similar guarantees continue to hold in the Prox GD setting. (In the SGD setting, however, the techniques for proving Theorem 5.1 fail we will give an intuition for why this is the case, in Appendix E.2.) In view of the results in Theorem 5.1, we can regard IACV as an extension of eθNS i in (3) to iterative solvers with provable per-iteration guarantees. This provides a safe and efficient way to approximate CV in practice where it is often agnostic whether the algorithm will converge to the solution of ERM or not. 6. Simulation Studies In this section, we conduct numerical studies to investigate the empirical performance of the proposed IACV method and to verify our theoretical findings for GD, SGD, and Prox GD.1 The data is generated from a logistic regression model, with Zi = (Xi, Yi) Rp {0, 1}, with dimension p = 20 and with Xi drawn with i.i.d. N(0, 1) entries while Yi Bernoulli(exp(X i θ )/(1 + exp(X i θ ))) for true parameter vector θ which has 5 randomly chosen nonzero entries drawn as N(0, 1), and all other entries zero. Our objective function is given by regularized negative loglikelihood, F(Z; θ) = Pn i=1 ℓ(Zi; θ) + λπ(θ), where ℓ(Zi; θ) = Yi X i θ + log(1 + exp(X i θ)). For GD and SGD, we use ridge regularization, with π(θ) = θ 2 2 and penalty parameter λ = 10 6 n. For Prox GD we instead use the logistic Lasso, with π(θ) = θ 1 and λ = 10 6 n. We initialize the algorithm at the origin. Each simulation study is repeated for 100 independent trials. Our proposed method is given by the IACV estimator eθ(t) i, defined in (12) for GD and in (13) for SGD. For comparison, 1Code to reproduce all experiments is available at https: //github.com/yuetianluo/IACV. Iterative Approximate Cross-Validation we also implement the one-step Newton (NS) and infinitesimal jackknife (IJ) estimators along the optimization path, i.e., we use the tth iteration bθ(t) in place of the true minimizer bθ in the definition of the NS (3) or IJ (4) estimators, leading to the approximate NS estimator eθNS(t) i = bθ(t) 2 θF(Z i; bθ(t)) 1 θF(Z i; bθ(t)) and similarly the approximate IJ estimator eθIJ(t) i = bθ(t) 2 θF(Z; bθ(t)) 1 θF(Z i; bθ(t)). Finally, we also compare to a baseline estimator where we simply approximate the leave-one-out iterate bθ(t) i with the full-data iterate bθ(t): bθbaseline(t) i = bθ(t). Of course, this baseline is not useful in practice (since data point i has not actually been removed from the estimator), and thus any proposed method is only meaningful if it can perform substantially better than this baseline. We measure the accuracy of the four methods (baseline, NS, IJ, and IACV) in terms of the averaged approximation error Err(t) approx defined in (15), and the relative CV error, Rel Err(t) CV = Err(t) CV/CV({bθ(t) i}n i=1), where Err(t) CV is defined in (16). 6.1. Gradient Descent (GD) For the gradient descent simulation, we consider sample sizes n = 250 and n = 1000, and αt = 0.5/n. In the top panels of Figure 2, we can see that along the iterations of the algorithm, the approximation error of IACV is always better than NS and IJ before the convergence of the algorithm. As a result of that, IACV also achieves better CV error as we illustrate in the middle panels of Figure 2. In fact, during early iterations, the error of both NS and IJ is higher than the (noninformative) baseline estimator bθbaseline(t) i , while IACV s error is substantially lower, and also shows a smaller variance in estimation as it has narrower shaded areas. When the algorithm converges, we find that the performance of IACV and the NS estimator are almost the same, which matches the theoretical prediction in Theorem 5.1 (where we see that, under mild assumptions, IACV will converge to the NS estimator, i.e., the black line and the red line will meet in the limit), while the IJ method shows higher error even at convergence. In addition, we observe that as sample size n increases from 250 to 1000, the limiting approximation error of IACV decreases from 1.5 10 3 to 6.8 10 5. This roughly matches what we have shown in Theorem 4.4 that the approximation error decreases quadratically with respect to the sample size. Finally, in the bottom panels of Figure 2, we report the runtime of IACV as compared to the exact leave-one-out 1 102 104 Iter Num (log scale) 1 102 104 Iter Num (log scale) Baseline NS IJ IACV 3 103 6 103 Iter Num Runtime (s) 3 103 6 103 Iter Num Exact CV IACV Figure 2. Comparison of the baseline, NS, IJ, and IACV methods in gradient descent for logistic regression with ridge regularizer. Top: approximation error comparison; middle: relative CV error comparison; bottom: runtime comparison of exact leave-one-out CV and IACV. Solid lines represent the median value over 100 experiments and shaded areas denote the region between lower and upper quartiles. CV. We can see that IACV is much faster than the exact leave-one-out CV method; in particular when n = 1000, IACV shows approximately 6-7 times speed-up. A larger scale simulation for GD is provided in Appendix C. 6.2. Stochastic Gradient Descent (SGD) For the stochastic gradient descent simulation, we take sample size n = 1000, and test batch size K = 100 and K = 400. We choose αt based on a common strategy called epoch doubling in the literature, where we run T0 = 1000 steps with step size α = 0.5/K, then run 2T0 steps with step size α/2, and so on. We plot the accuracy of the different methods in Figure 3, in the top (approximation error) and middle (CV error) panels. We can see that IACV has a clear advantage over the other methods. In contrast to the simulation results in Figure 2 for GD, here the red line (for NS) does not reach the black line (for IACV) even after T = 105 iterations, in terms of approximation error. This may be due to the slow convergence of SGD; nonetheless, it is still unclear whether the NS estimator eθNS(t) i and our estimator eθ(t) i will converge to the same limit or not, since we do not know whether a Iterative Approximate Cross-Validation 10 103 105 Iter Num (log scale) 10 103 105 Iter Num (log scale) Baseline NS IJ IACV 4 104 8 104 Iter Num Runtime (s) 4 104 8 104 Iter Num Exact CV IACV Figure 3. Comparison of the baseline, NS, IJ, and IACV methods in stochastic gradient descent for logistic regression with ridge regularizer; details of the panels are the same as in Figure 2. Note that in the top panels, the results for NS and IJ (red and blue) are nearly perfectly overlapping and thus difficult to distinguish. result analogous to Theorem 5.1 holds in the setting of SGD. As for GD, we see that IACV offers error far lower than both NS and IJ during early iterations, when NS and IJ show error higher even than the baseline estimator. Finally, we show the runtime of IACV with exact leave-oneout CV in the bottom panel in Figure 3. We can see IACV still has clear computational advantages in this setting. 6.3. Proximal Gradient Descent (Prox GD) Finally, for the proximal gradient descent simulation, we take sample sizes n = 250 and n = 1000, and αt = 0.5/n, as for the GD simulation. The regularizer is now π(θ) = θ 1, a nonsmooth function (see Appendix A for the definition of the NS and IJ methods in this nonsmooth setting). The results for this setting are qualitatively very similar to the GD setting; again, we see that IACV shows good accuracy in terms of both approximation error and CV error even at early iterations, while NS and IJ show error higher than the baseline during early iterations. Finally, when the algorithm converges, the performance of our estimator is similar to the NS estimator as predicted in Theorem A.3 (which is the analogue of Theorem 5.1, for the Prox GD setting). 1 102 104 Iter Num (log scale) 1 102 104 Iter Num (log scale) Baseline NS IJ IACV 3 103 6 103 Iter Num Runtime (s) 3 103 6 103 Iter Num Exact CV IACV Figure 4. Comparison of the baseline, NS, IJ, and IACV methods in proximal gradient descent for logistic regression with ℓ1 regularizer (the logistic Lasso); details of the panels are the same as in Figure 2. 7. Conclusion and Discussions In this paper, we provide a new method, iterative approximate cross-validation (IACV), to efficiently approximate the computationally expensive leave-one-out CV under the ERM framework when the problem is solved by common iterative algorithms. This work suggests several interesting directions for further exploration. For example, Theorem 5.1 establishes that, for gradient descent, the IACV and NS estimators coincide at convergence; we do not yet know whether an analogous result holds for SGD. We can also consider the performance of IACV in a high-dimensional setting. For instance, in regression with p > n, the NS estimator defined in (3) fails due to high dimensionality, but Stephenson & Broderick (2020) propose running ℓ1-regularized regression and then running the NS estimator on the selected active set of bθ. It would be interesting to see whether we can adapt the IACV method to that setting and provide theoretical guarantees for the sparse high-dimensional regime. Acknowledgements Z.R. and R.F.B were supported by the Office of Naval Research via grant N00014-20-1-2337. R.F.B. was additionally supported by the National Science Foundation via grants DMS-1654076 and DMS-2023109. Iterative Approximate Cross-Validation Allen, D. M. The relationship between variable selection and data agumentation and a method for prediction. technometrics, 16(1):125 127, 1974. Arlot, S. and Celisse, A. A survey of cross-validation procedures for model selection. Statistics surveys, 4:40 79, 2010. Beirami, A., Razaviyayn, M., Shahrampour, S., and Tarokh, V. On optimal generalizability in parametric learning. Advances in Neural Information Processing Systems, 30, 2017. Bourtoule, L., Chandrasekaran, V., Choquette-Choo, C. A., Jia, H., Travers, A., Zhang, B., Lie, D., and Papernot, N. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pp. 141 159. IEEE, 2021. Cao, Y. and Yang, J. Towards making systems forget with machine unlearning. In 2015 IEEE Symposium on Security and Privacy, pp. 463 480. IEEE, 2015. Chi, Y., Lu, Y. M., and Chen, Y. Nonconvex optimization meets low-rank matrix factorization: An overview. IEEE Transactions on Signal Processing, 67(20):5239 5269, 2019. Efron, B. The jackknife, the bootstrap and other resampling plans. SIAM, 1982. Geisser, S. The predictive sample reuse method with applications. Journal of the American statistical Association, 70(350):320 328, 1975. Ghosh, S., Stephenson, W., Nguyen, T. D., Deshpande, S., and Broderick, T. Approximate cross-validation for structured models. Advances in Neural Information Processing Systems, 33:8741 8752, 2020. Ginart, A., Guan, M., Valiant, G., and Zou, J. Y. Making ai forget you: Data deletion in machine learning. Advances in neural information processing systems, 32, 2019. Giordano, R., Stephenson, W., Liu, R., Jordan, M., and Broderick, T. A swiss army infinitesimal jackknife. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1139 1147. PMLR, 2019. Guo, C., Goldstein, T., Hannun, A., and Van Der Maaten, L. Certified data removal from machine learning models. In Proceedings of the 37th International Conference on Machine Learning, pp. 3832 3842, 2020. Izzo, Z., Smart, M. A., Chaudhuri, K., and Zou, J. Approximate data deletion from machine learning models. In International Conference on Artificial Intelligence and Statistics, pp. 2008 2016. PMLR, 2021. Jabbar, H. and Khan, R. Z. Methods to avoid over-fitting and under-fitting in supervised machine learning (comparative study). Computer Science, Communication and Instrumentation Devices, 70:163 172, 2015. Jaeckel, L. The infinitesimal jackknife. memorandum. Technical report, MM 72-1215-11, Bell Lab. Murray Hill, NJ, 1972. Koh, P. W. and Liang, P. Understanding black-box predictions via influence functions. In International conference on machine learning, pp. 1885 1894. PMLR, 2017. Loh, P.-L. and Wainwright, M. J. Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Advances in Neural Information Processing Systems, 26, 2013. Mc Cullagh, P. and Nelder, J. Generalized linear models, 1989. Neel, S., Roth, A., and Sharifi-Malvajerdi, S. Descent-todelete: Gradient-based methods for machine unlearning. In Algorithmic Learning Theory, pp. 931 962. PMLR, 2021. Obuchi, T. and Kabashima, Y. Cross validation in lasso and its acceleration. Journal of Statistical Mechanics: Theory and Experiment, 2016(5):053304, 2016. Rad, K. R. and Maleki, A. A scalable estimate of the outof-sample prediction error via approximate leave-one-out cross-validation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(4):965 996, 2020. Stephenson, W. and Broderick, T. Approximate crossvalidation in high dimensions with guarantees. In International Conference on Artificial Intelligence and Statistics, pp. 2424 2434. PMLR, 2020. Stone, M. Cross-validatory choice and assessment of statistical predictions. Journal of the royal statistical society: Series B (Methodological), 36(2):111 133, 1974. Tsai, C.-H., Lin, C.-Y., and Lin, C.-J. Incremental and decremental training for linear classification. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 343 352, 2014. Wang, S., Zhou, W., Lu, H., Maleki, A., and Mirrokni, V. Approximate leave-one-out for fast parameter tuning in high dimensions. In International Conference on Machine Learning, pp. 5228 5237. PMLR, 2018. Wilson, A., Kasy, M., and Mackey, L. Approximate crossvalidation: Guarantees for model assessment and selection. In International Conference on Artificial Intelligence and Statistics, pp. 4530 4540. PMLR, 2020. Iterative Approximate Cross-Validation Wright, S. J. and Recht, B. Optimization for data analysis. Cambridge University Press, 2022. Iterative Approximate Cross-Validation A. Guarantees for IACV in the Proximal GD Setting We now provide the analogues of Theorems 4.4 and 4.6 in the Prox GD setting, working with the objective function F(Z; θ) = g(Z; θ) + h(θ), where as before, g(Z; θ) = Pn i=1 ℓ(Zi; θ) is the empirical risk and h(θ) = λπ(θ) is the nonsmooth regularization term. Theorem A.1. Suppose bθ(0) = bθ(0) i = eθ(0) i for all i [n], αt 1/(nλ1) for t 1, and Assumptions 4.1 4.3 are satisfied with F( ; θ) replaced by g( ; θ) and with γ < 2λ0 and n 4 η 2λ0 γ . Assume also that h is convex. Then for all t 1 and i [n], we have bθ(t) bθ(t) i 2 2ηi (2λ0 γ)n eθ(t) i bθ(t) i 2 4γη2 i λ0(2λ0 γ)2n2 . Note that these bounds are identical to the results obtained in Theorem 4.4, and thus as before, we obtain the approximation error bound Err(t) approx 4γ η 2 λ0(2λ0 γ)2n2 for all t 1. The proof of Theorem A.1 is provided in Appendix E.1. Next we bound the CV error for IACV in the Prox GD setting. Theorem A.2. Suppose the assumptions in Theorem A.1 and Assumption 4.5 are satisfied. Then for all t 1, Err(t) CV 1 4γη iη2 i λ0(2λ0 γ)2n2 . Again, this bound is the same as the one obtained in Theorem 4.6 for GD. The proof of Theorem A.2 is essentially the same as the proof of Theorem 4.6 and for simplicity, we omit it here. Next, we provide convergence theory for IACV in the Prox GD setting, to obtain a result analogous to Theorem 5.1 for the GD case comparing the IACV and NS estimators. First, we need to define the NS estimator in this setting where F(Z; θ) is nonsmooth. The main idea is to replace the one Newton step in (3) with one proximal Newton step (Wilson et al., 2020): eθNS i = prox 2 θg(Z i;bθ) h bθ 2 θg(Z i; bθ) 1 θg(Z i; bθ) , (17) where given any positive definite matrix H Rp p and convex function h : Rp R, prox H h (x) is defined as follows: prox H h (x) = arg min z Rp 1 2(x z) H(x z) + h(z). (18) (We can similarly define the IJ estimator in this setting as eθIJ i = prox 2 θg(Z i;bθ) h bθ 2 θg(Z; bθ) 1 θg(Z i; bθ) . (19) In our simulations in Section 6.3, we simply replace bθ with bθ(t) in equations (17) and (19) to obtain our approximate NS and IJ iterations.) Next, we show eθ(t) i in (14) will converge to eθNS i in (17), under proper assumptions. Theorem A.3. Suppose bθ(t) converges to bθ, Assumption 4.1 is satisfied along all iterations with g( ; θ) in place of F( ; θ), αt = α < 1/(nλ1) for all t 1, and 2 θg(Z i; θ), θg(Z i; θ) are continuous in θ for all i [n]. In addition, we assume h(θ) is convex. Then it holds that eθ(t) i in (14) converges to eθNS i in (17), as t . The proof of this theorem is provided in Appendix E.2. Iterative Approximate Cross-Validation 102 104 Iteration Number (log scale) 1 102 104 Iteration Number (log scale) Baseline NS IJ IACV Figure 5. Approximation error and relative CV error comparison of the baseline, NS, IJ, and IACV methods in gradient descent for logistic regression with ridge regularizer. Solid lines represent the median value over 100 experiments and shaded areas denote the region between lower and upper quartiles. B. Details for deriving computational complexity To derive the calculations in Table 1, for example, for GD, the main per-iteration cost of our method comes from computing { θg(Z i; bθ(t 1))}n i=1 and { 2 θg(Z i; bθ(t 1))}n i=1, and performing the Hessian-gradient product. As we assume above, due to the relation θg(Z i; bθ(t 1)) = P j [n],j =i θℓ(Zj; bθ(t 1)), the cost of computing { θg(Z i; bθ(t 1))}n i=1 and { 2 θg(Z i; bθ(t 1))}n i=1 is of order n(Ap + Bp). Moreover, the cost of the Hessian-gradient product is of order np2. On the other hand, the main per-iteration computational cost of the exact leave-one-out CV (i.e., running an iteration of gradient descent on the dataset Z i, for each i), comes from evaluating θg(Z i, bθ(t 1) i ) and then subtracting this gradient from the current estimate, for each i. These two steps have cost of order n Ap and p, respectively, for each i, and therefore the total cost of one iteration has order n2Ap + np. Similar calculations can be performed to compute the order of computational complexity for SGD and Prox GD as well. Note here that we do not intend to compare the runtime of NS and IJ with IACV and exact CV because these methods are not comparable in terms of their target problem. NS and IJ are one-step methods, where given a single solution (i.e., a single bθ(T ) approximating bθ), we run the method once to approximate the leave-one-out models. On the other hand, exact iterative CV as well as our IACV methods are both performed in an online fashion, with steps carried out for each t = 1, . . . , T. The cost of NS and IJ will not scale with T, while naturally cost of CV and IACV must scale with T; this apparent computational benefit of NS and IJ is simply due to the fact that NS and IJ ignore the iterative nature of the algorithm and thus are completely invalid (i.e., errors are higher than the noninformative baseline ) for early, pre-convergence times T. C. A Larger Scale Simulation for IACV in GD In Figures 5 and 6, we provide simulation results for comparing IACV and other methods when n = 5000, p = 50. We observe a similar pattern as in the existing plots: the approximation error of IACV is always better than NS and IJ before the convergence of the algorithm and IACV is much faster than the exact leave-one-out CV. If we further grow p and n, we find it is too expensive to run the whole program as the exact leave-one-out CV takes too much time to run for many independent trials, but we nonetheless would expect to see similar performance. D. A Iteration-Dependent Upper Bound for eθ(t) i bθ(t) i 2 Notice that in Theorem 4.4, we only present a version of the upper bound which is independent of t for convenience. It is simple and clearly illustrates the idea that the error bound holds along the whole trajectory of the learning process. But based on our proof, a t-dependent iterative bound on the approximation error can be obtained and it will typically be sharper at the early stage of training. Specifically, in our proof of Theorem 4.4, (25) shows eθ(t) i bθ(t) i 2 (1 αtnλ0) eθ(t 1) i bθ(t 1) i 2 + αtnγ bθ(t 1) i bθ(t 1) 2 2, t 1. Iterative Approximate Cross-Validation 3 103 6 103 Iteration Number Runtime (s) Exact CV IACV Figure 6. Runtime comparison of the baseline, NS, IJ, and IACV methods in gradient descent for logistic regression with ridge regularizer. The error bound of eθ(t) i depends on error bound of eθ(t 1) i and bθ(t 1) i bθ(t 1) 2. Furthermore, bθ(t 1) i bθ(t 1) 2 also satisfies the following iterative upper bound by (23): bθ(t 1) i bθ(t 1) 2 (1 αt 1nλ0) bθ(t 2) bθ(t 2) i 2 + αt 1ηi + αt 1nγ bθ(t 2) bθ(t 2) i 2 2. If we initialize bθ(0) = bθ(0) i = eθ(0) i = 0, we can see the error bound for bθ(t 1) i bθ(t 1) 2 will slightly increase from 0 over the iteration t and stabilize when the error contraction ( αt 1nλ0 bθ(t 2) bθ(t 2) i 2) cancels out the per-iteration approximation error accumulation (αt 1ηi + αt 1nγ bθ(t 2) bθ(t 2) i 2 2). By a similar argument, we also have the error bound for eθ(t) i bθ(t) i 2 will increase first and then stabilize. E.1. Proofs of error bound results In this section we prove the approximation error bounds and CV error bounds for GD, SGD, and Prox GD. Proof of Theorem 4.4. Step 1. In this step, we prove the bound on bθ(t) bθ(t) i 2 by induction. By the update rule of bθ(t) and bθ(t) i, we have bθ(t) bθ(t) i = bθ(t 1) bθ(t 1) i + αt θF(Z i; bθ(t 1) i ) θF(Z; bθ(t 1)) = bθ(t 1) bθ(t 1) i αt ℓ(Zi; bθ(t 1)) + αt θF(Z i; bθ(t 1) i ) θF(Z i; bθ(t 1)) . (20) Given any u Rp, by the Taylor expansion for gu(θ) = u, θF(Z i; θ) , we have gu(bθ(t 1) i ) = gu(bθ(t 1)) + θgu(θ ) u, θF(Z i; bθ(t 1) i ) = u, θF(Z i; bθ(t 1)) + u, 2 θF(Z i; bθ )[bθ(t 1) i bθ(t 1)] = u, θF(Z i; bθ(t 1)) + u, 2 θF(Z i; bθ(t 1))[bθ(t 1) i bθ(t 1)] + u, ( 2 θF(Z i; θ ) 2 θF(Z i; bθ(t 1)))[bθ(t 1) i bθ(t 1)] where θ = abθ(t 1) + (1 a)bθ(t 1) i for some a [0, 1]. Iterative Approximate Cross-Validation Plugging (21) into (20), we have for any u Rp, bθ(t) bθ(t) i, u = bθ(t 1) bθ(t 1) i αtℓ(Zi; bθ(t 1)) + αt 2 θF(Z i; bθ(t 1))[bθ(t 1) i bθ(t 1)], u + αt u, ( 2 θF(Z i; θ ) 2 θF(Z i; bθ(t 1)))[bθ(t 1) i bθ(t 1)] = D Ip αt 2 θF(Z i; bθ(t 1)) [bθ(t 1) bθ(t 1) i ] αtℓ(Zi; bθ(t 1)), u E + αt u, ( 2 θF(Z i; θ ) 2 θF(Z i; bθ(t 1)))[bθ(t 1) i bθ(t 1)] . Since bθ(0) = bθ(0) i , the error bound of bθ(t) to bθ(t) i holds at t = 0. Suppose now bθ(t 1) bθ(t 1) i 2 2ηi (2λ0 γ)n holds for some t 1, then by (22) and the fact that bθ(t) i eθ(t) i 2 = supu: u 2=1 bθ(t) i eθ(t) i, u , we have bθ(t) bθ(t) i 2 (a) Ip αt 2 θF(Z i; bθ(t 1)) 2 bθ(t 1) bθ(t 1) i 2 + αtηi + αtnγ bθ(t 1) bθ(t 1) i 2 2 (b) (1 αtnλ0) bθ(t 1) bθ(t 1) i 2 + αtηi + αtnγ bθ(t 1) bθ(t 1) i 2 2 (c) (1 αtnλ0) 2ηi (2λ0 γ)n + αtηi + αtnγ 2ηi (2λ0 γ)n (d) (1 αtnλ0) 2ηi (2λ0 γ)n + αtηi + αtnγ ηi (2λ0 γ)n = 2ηi (2λ0 γ)n, where (a) is by triangle inequality and Assumptions 4.2 and 4.3; (b) is by Assumption 4.1; (c) is by the induction assumption; (d) is because n 4 η 2λ0 γ . This proves the desired bound on bθ(t) bθ(t) i 2. Step 2. We now prove the bound on eθ(t) i bθ(t) i 2 by induction. First by the update rule of bθ(t) i and the Taylor expansion in (21), we have for any u Rp, bθ(t) i, u = D bθ(t 1) i αt θF(Z i; bθ(t 1)) + 2 θF(Z i; bθ(t 1))[bθ(t 1) i bθ(t 1)] , u E αt u, ( 2 θF(Z i; θ ) 2 θF(Z i; bθ(t 1)))[bθ(t 1) i bθ(t 1)] . Combining with (12), we have bθ(t) i eθ(t) i, u = D bθ(t 1) i eθ(t 1) i αt 2 θF(Z i; bθ(t 1))[bθ(t 1) i eθ(t 1) i ], u E αt u, ( 2 θF(Z i; θ ) 2 θF(Z i; bθ(t 1)))[bθ(t 1) i bθ(t 1)] = D Ip αt 2 θF(Z i; bθ(t 1)) [bθ(t 1) i eθ(t 1) i ], u E αt u, ( 2 θF(Z i; θ ) 2 θF(Z i; bθ(t 1)))[bθ(t 1) i bθ(t 1)] . Since bθ(0) i = eθ(0) i , the error bound holds at t = 0. Suppose now eθ(t 1) i bθ(t 1) i 2 4γη2 i λ0(2λ0 γ)2n2 holds, then at time t, based on (24), we have bθ(t) i eθ(t) i 2 (a) Ip αt 2 θF(Z i; bθ(t 1)) 2 bθ(t 1) i eθ(t 1) i 2 + αtnγ bθ(t 1) i bθ(t 1) 2 2 (b) (1 αtnλ0) bθ(t 1) i eθ(t 1) i 2 + αtnγ bθ(t 1) i bθ(t 1) 2 2 (c) (1 αtnλ0) 4γη2 i λ0(2λ0 γ)2n2 + αtnγ 2ηi (2λ0 γ)n = 4γη2 i λ0(2λ0 γ)2n2 . Here (a) is by the triangle inequality and Assumption 4.3; (b) is by Assumption 4.1; (c) is by the induction assumption and the error bound we have proved for bθ(t 1) i bθ(t 1) 2 in Step 1. This proves the desired bound on eθ(t) i bθ(t) i 2, and thuscompletes the proof of this theorem. Iterative Approximate Cross-Validation Proof of Theorem 4.6. First, for every i [n], by the Taylor expansion, we have ℓ(Zi; eθ(t) i) = ℓ(Zi; bθ(t) i) + θℓ(Zi; θ ) [eθ(t) i bθ(t) i] where θ = aeθ(t) i + (1 a)bθ(t) i for some 0 a 1. Thus |ℓ(Zi; eθ(t) i) ℓ(Zi; bθ(t) i)| θℓ(Zi; θ ) 2 eθ(t) i bθ(t) i 2 η i eθ(t) i bθ(t) i 2. (26) Here the last inequality is by Assumption 4.5. This completes the proof of this theorem, once we plug in the bound on eθ(t) i bθ(t) i 2 obtained in Theorem 4.4. Proof of Theorem 4.11. Step 1. In this step, we prove the bound on bθ(t) bθ(t) i 2 by induction. Similar to (21), given any u Rp, by the Taylor expansion for θF(ZSt\{i}; bθ(t 1) i ), u , we have θF(ZSt\{i}; bθ(t 1) i ), u = D θF(ZSt\{i}; bθ(t 1)) + 2 θF(ZSt\{i}; bθ(t 1))[bθ(t 1) i bθ(t 1)], u E + ( 2 θF(ZSt\{i}; θ ) 2 θF(ZSt\{i}; bθ(t 1)))[bθ(t 1) i bθ(t 1)], u , (27) where θ = abθ(t 1) + (1 a)bθ(t 1) i with 0 a 1. bθ(t) bθ(t) i, u = bθ(t 1) bθ(t 1) i αt θF(ZSt; bθ(t 1)) + αt θF(ZSt\{i}; bθ(t 1) i ), u (27) = [Ip αt 2 θF(ZSt\{i}; bθ(t 1))][bθ(t 1) bθ(t 1) i ], u αt θF(ZSt; bθ(t 1)) αt θF(ZSt\{i}; bθ(t 1)), u + αt ( 2 θF(ZSt\{i}; θ ) 2 θF(ZSt\{i}; bθ(t 1)))[bθ(t 1) i bθ(t 1)], u . Then we have E bθ(t) bθ(t) i 2 E [Ip αt 2 θF(ZSt\{i}; bθ(t 1))][bθ(t 1) bθ(t 1) i ] 2 + αt E θF(ZSt; bθ(t 1)) θF(ZSt\{i}; bθ(t 1)) 2 | {z } (B) + αt E ( 2 θF(ZSt\{i}; θ ) 2 θF(ZSt\{i}; bθ(t 1)))[bθ(t 1) i bθ(t 1)] 2 | {z } (C) Next, we bound the (A), (B), (C) terms separately. First for term (A), E [Ip αt 2 θF(ZSt\{i}; bθ(t 1))][bθ(t 1) bθ(t 1) i ] 2 =ES1,...,St 1 h ESt( [Ip αt 2 θF(ZSt\{i}; bθ(t 1))][bθ(t 1) bθ(t 1) i ] 2) i Assumption (4.7) (1 αt Kλ0)ES1,...,St 1 h bθ(t 1) bθ(t 1) i 2 i . Next, for term (B), E θF(ZSt; bθ(t 1)) θF(ZSt\{i}; bθ(t 1)) 2 =ES1,...,St 1 h ESt θF(ZSt; bθ(t 1)) θF(ZSt\{i}; bθ(t 1)) 2 i n ES1,...,St 1 θℓ(Zi; bθ(t 1)) 2 Assumption (4.8) Kηi Iterative Approximate Cross-Validation Here the next-to-last step holds because each data point is included independently in St with probability K/n; thus with probability 1 K n , we have θF(ZSt; bθ(t 1)) θF(ZSt\{i}; bθ(t 1)) = 0 and with probability K n , i St and in that case we have θF(ZSt; bθ(t 1)) θF(ZSt\{i}; bθ(t 1)) = θℓ(Zi; bθ(t 1)). And, for term (C), E ( 2 θF(ZSt\{i}; θ ) 2 θF(ZSt\{i}; bθ(t 1)))[bθ(t 1) i bθ(t 1)] 2 E sup b [0,1] ( 2 θF(ZSt\{i}; bbθ(t 1) + (1 b)bθ(t 1) i ) 2 θF(ZSt\{i}; bθ(t 1)))[bθ(t 1) i bθ(t 1)] 2 Assumption 4.9 γKES1,...,St 1 h bθ(t 1) i bθ(t 1) 2 2 i Assumption 4.10 E bθ(t 1) i bθ(t 1) 2 2 = γβn E bθ(t 1) i bθ(t 1) 2 2 . Since bθ(0) = bθ(0) i , the error bound of bθ(t) to bθ(t) i holds at t = 0. Suppose now E bθ(t 1) bθ(t 1) i 2 2ηi (2λ0 γβ)n holds for some t 1, then by plugging (29), (30) and (31) into (28), we have E bθ(t) bθ(t) i 2 (1 αt Kλ0)E bθ(t 1) bθ(t 1) i 2 + αt Kηi n + αtγβn E bθ(t 1) i bθ(t 1) 2 2 (1 αt Kλ0) 2ηi (2λ0 γβ)n + αt Kηi n + αt Kγβ ηi (2λ0 γβ)n 4ηi (2λ0 γβ)K (1 αt Kλ0) 2ηi (2λ0 γβ)n + αt Kηi n + αt Kγβ ηi (2λ0 γβ)n = 2ηi (2λ0 γβ)n, where the next-to-last step holds since we have assumed K 4 η 2λ0 γβ by the assumption on K. This finishes the proof for the first part. Step 2. We now prove the bound on eθ(t) i bθ(t) i 2 by induction. Based on the Taylor expansion in (27) and the update rule of eθ(t) i in (13), we have for any u Rp, bθ(t) i eθ(t) i, u = bθ(t 1) i αt θF(ZSt\{i}; bθ(t 1) i ) eθ(t 1) i , u + D αt θF(ZSt\{i}; bθ(t 1)) + 2 θF(ZSt\{i}; bθ(t 1))[eθ(t 1) i bθ(t 1)] , u E (27) = D Ip αt 2 θF(ZSt\{i}; bθ(t 1)) [bθ(t 1) i eθ(t 1) i ], u E αt ( 2 θF(ZSt\{i}; θ ) 2 θF(ZSt\{i}; bθ(t 1)))[bθ(t 1) i bθ(t 1)], u . Thus we have E bθ(t) i eθ(t) i 2 E h Ip αt 2 θF(ZSt\{i}; bθ(t 1)) [bθ(t 1) i eθ(t 1) i ] 2 i + αt E ( 2 θF(ZSt\{i}; θ ) 2 θF(ZSt\{i}; bθ(t 1)))[bθ(t 1) i bθ(t 1)] 2 2 (1 αt Kλ0)E bθ(t 1) i eθ(t 1) i 2 + αtγβn E bθ(t 1) i bθ(t 1) 2 2 (1 αt Kλ0)E bθ(t 1) i eθ(t 1) i 2 + αtγβn 2ηi (2λ0 γβ)n Here the next-to-last step holds by a similar argument as in (29) and (31), while the last step holds by the error bound we have proved for E bθ(t 1) i bθ(t 1) 2 in Step 1. Since bθ(0) i = eθ(0) i , the error bound holds at t = 0. Suppose now E eθ(t 1) i bθ(t 1) i 2 4γβη2 i λ0(2λ0 γβ)2n K holds, then at time t, based on (33), we have E bθ(t) i eθ(t) i 2 (1 αt Kλ0) 4γβη2 i λ0(2λ0 γβ)2n K + αtγnβ 2ηi (2λ0 γβ)n 2 = 4γβη2 i λ0(2λ0 γβ)2n K , (34) Iterative Approximate Cross-Validation which proves the desired bound and thus completes the proof of this theorem. Proof of Theorem 4.13. First, for every i [n], by the Taylor expansion, we have ℓ(Zi; eθ(t) i) = ℓ(Zi; bθ(t) i) + θℓ(Zi; θ ) [eθ(t) i bθ(t) i], where θ = aeθ(t) i + (1 a)bθ(t) i for some 0 a 1. E(Err(t) CV) 1 i=1 E(|ℓ(Zi; eθ(t) i) ℓ(Zi; bθ(t) i)|) sup a [0,1] | θℓ(Zi; aeθ(t) i + (1 a)bθ(t) i) [eθ(t) i bθ(t) i]| 4η iη2 i γβ λ0(2λ0 γβ)2n K , where the last inequality is by Assumption 4.12 and Theorem 4.11. Proof of Theorem A.1. The proof strategy of this theorem is similar to the proof of Theorem 4.4. For convenience, we will define proxαt h (θ ) := arg min θ 2αt θ θ 2 2 + h(θ) . (35) A key property we use here is that the proximal operator proxαt h is nonexpansive for convex h (Wright & Recht, 2022, Proposition 8.19), i.e., proxαt h (x) proxαt h (x ) 2 x x 2. Step 1. In this step, we derive the error bound for bθ(t) bθ(t) i 2. By the updating rules of bθ(t) and bθ(t) i, we have bθ(t) bθ(t) i 2 = proxαt h bθ(t 1) αt θg(Z; bθ(t 1)) proxαt h bθ(t 1) i αt θg(Z i; bθ(t 1) i ) 2 bθ(t 1) αt θg(Z; bθ(t 1)) (bθ(t 1) i αt θg(Z i; bθ(t 1) i )) 2 . Here the inequality holds by the nonexpansiveness property of the proximal operator. The rest of the proof follows exactly as in Step 1 of Theorem 4.4 by replacing F( ; θ) with g( ; θ). Step 2. By our estimator in (14), we have eθ(t) i bθ(t) i 2 = proxαt h eθ(t 1) j αt( θg(Z j; bθ(t 1)) + 2 θg(Z j; bθ(t 1))[eθ(t 1) j bθ(t 1)]) proxαt h bθ(t 1) i αt θg(Z i; bθ(t 1) i ) 2 eθ(t 1) j αt( θg(Z j; bθ(t 1)) + 2 θg(Z j; bθ(t 1))[eθ(t 1) j bθ(t 1)]) bθ(t 1) i + αt θg(Z i; bθ(t 1) i ) 2, where the inequality is again due to the nonexpansiveness property of the proximal operator. The remainder of the proof again follows the same argument as in Step 2 of Theorem 4.4 by replacing F( ; θ) with g( ; θ). Iterative Approximate Cross-Validation E.2. Proofs of convergence results In this section, we prove Theorems 5.1 and A.3, which establish convergence of the IACV estimator to the NS estimator in the GD and Prox GD setting, respectively. Proof of Theorem 5.1. Since we assume bθ(t) converges to bθ and Assumption 4.1 is satisfied at all iterations, we therefore see that 2 θF(Z i; bθ) is invertible by the continuity of 2 θF(Z i; θ). Let us denote bt i = bθ(t 1) ( 2 θF(Z i; bθ(t 1))) 1 θF(Z i; bθ(t 1)). First, based on the update rule for eθ(t) i in (12), we have: eθ(t) i bt i = [Ip α 2 θF(Z i; bθ(t 1))](eθ(t 1) i bt i). This implies eθ(t) i eθNS i = [Ip α 2 θF(Z i; bθ(t 1))](eθ(t 1) i bt i) + bt i eθNS i = [Ip α 2 θF(Z i; bθ(t 1))](eθ(t 1) i eθNS i ) α 2 θF(Z i; bθ(t 1))(eθNS i bt i), eθ(t) i eθNS i 2 Ip α 2 θF(Z i; bθ(t 1)) eθ(t 1) i eθNS i 2 + α 2 θF(Z i; bθ(t 1)) eθNS i bt i 2 (1 αnλ0) eθ(t 1) i eθNS i 2 + eθNS i bt i 2, where the last step holds by Assumption 4.1 together with the choice of α. Note that since 2 θF(Z i; θ) and θF(Z i; θ) are continuous in θ, we then have bt i eθNS i as bθ(t) converges to bθ, i.e., lim t eθNS i bt i 2 = 0. Combined with the above, this is sufficient to prove that lim t eθ(t) i eθNS i 2 = 0, which completes the proof of the theorem. Proof of Theorem A.3. Let us first prove the claim that eθNS i in (17) is a fixed point of a(t) = proxα h a(t 1) α θg(Z i; bθ) + 2 θg(Z i; bθ)[a(t 1) bθ] , (36) where prox is defined in (35). To show this, it is equivalent to show eθNS i = proxα h eθNS i α θg(Z i; bθ) + 2 θg(Z i; bθ)[eθNS i bθ] . Since h is convex, by the definition of prox in (18) and the definition of eθNS i in (17), the first-order optimality condition tells us that eθNS i must satisfy 0 ( 2 θg(Z i; bθ) bθ eθNS i θg(Z i; bθ)) + h(eθNS i ). (37) Also, the proximal operator proxα h has a unique solution as h is convex. By the definition of the proximal operator proxα h, the output of proxα h eθNS i α θg(Z i; bθ) + 2 θg(Z i; bθ)[eθNS i bθ] , say x , is the unique vector that satisfies 0 eθNS i α θg(Z i; bθ) + 2 θg(Z i; bθ)[eθNS i bθ] x α + h(x ). (38) Iterative Approximate Cross-Validation Notice that x = eθNS i satisfies (38) because of (37). Thus, eθNS i is a fixed point of (36) as desired. Next, we show eθ(t) i in (14) converges to eθNS i under the assumptions given in the statement of the theorem. By (14) together with the calculations above, we have eθ(t) i eθNS i = proxα h eθ(t 1) i α( θg(Z i; bθ(t 1)) + 2 θg(Z i; bθ(t 1))[eθ(t 1) i bθ(t 1)]) proxα h eθNS i α θg(Z i; bθ) + 2 θg(Z i; bθ)[eθNS i bθ] . eθ(t) i eθNS i 2 = proxα h eθ(t 1) i α( θg(Z i; bθ(t 1)) + 2 θg(Z i; bθ(t 1))[eθ(t 1) i bθ(t 1)]) proxα h eθNS i α θg(Z i; bθ) + 2 θg(Z i; bθ)[eθNS i bθ] 2 eθ(t 1) i α( θg(Z i; bθ(t 1)) + 2 θg(Z i; bθ(t 1))[eθ(t 1) i bθ(t 1)]) eθNS i α θg(Z i; bθ) + 2 θg(Z i; bθ)[eθNS i bθ] 2 = [I α 2 θg(Z i; bθ(t 1))](eθ(t 1) i eθNS i ) + α( 2 θg(Z i; bθ) 2 θg(Z i; bθ(t 1)))eθNS i α( θg(Z i; bθ(t 1)) 2 θg(Z i; bθ(t 1))bθ(t 1)) + α( θg(Z i; bθ) 2 θg(Z i; bθ)bθ) 2 [I α 2 θg(Z i; bθ(t 1))](eθ(t 1) i eθNS i ) 2 + α h ( 2 θg(Z i; bθ) 2 θg(Z i; bθ(t 1)))eθNS i 2 + θg(Z i; bθ(t 1)) θg(Z i; bθ) 2 + 2 θg(Z i; bθ(t 1))bθ(t 1) 2 θg(Z i; bθ)bθ 2 i (1 αnλ0) eθ(t 1) i eθNS i 2 + α h ( 2 θg(Z i; bθ) 2 θg(Z i; bθ(t 1)))eθNS i 2 + θg(Z i; bθ(t 1)) θg(Z i; bθ) 2 + 2 θg(Z i; bθ(t 1))bθ(t 1) 2 θg(Z i; bθ)bθ 2 i . where the second step holds because proxα h is nonexpansive for convex h (Wright & Recht, 2022, Proposition 8.19), and the last step applies Assumption 4.1 (with g in place of F, as assumed in the Theorem). Since 2 θg(Z i; θ), θg(Z i; θ) are continuous in θ, and bθ(t) is assumed to converge to bθ, the quantity in square brackets converges to zero. As in the proof of Theorem 5.1, this therefore implies lim t eθ(t) i eθNS i 2 0. Can Theorem 5.1 be extended to SGD? A key intuition behind the proof of Theorem 5.1 is that eθNS i in (3) is a fixed point of the following update equation xt = xt 1 αt θF(Z i; bθ) + 2 θF(Z i; bθ)[xt 1 bθ] . In addition, we use the fact that given any ϵ > 0, a sequence of scalars {xt} that satisfies xt = (1 αt)xt 1 + ϵ must have xt 0 when we take αt α > 0. Finally, we note in the proof of Theorem 5.1, the condition bθ is the solution of (1) is actually not used, so the conclusions there continue to hold if we replace bθ with any θ as long as: (1) bθ(t) converges to θ and (2) Assumption 4.1 is satisfied along the iterates. A similar intuition holds for the proof of Theorem A.3. In the SGD setting, on the other hand, the techniques for proving Theorem 5.1 fail as αt there are required to decay to zero eventually for convergence. Thus, if we have a sequence of values {xt} satisfying xt = (1 αt)xt 1 + ϵ for any ϵ > 0, this does not necessarily imply xt 0. Whether a result analogous to Theorem 5.1 might be provable for SGD via a different technique remains an open question.