# early_stopping_for_nonparametric_testing__c46d59ea.pdf Early Stopping for Nonparametric Testing Meimei Liu Department of Statistical Science Duke University Durham, NC 27705 meimei.liu@duke.edu Guang Cheng Department of Statistics Purdue University West Lafayette, IN 47907 chengg@purdue.edu Early stopping of iterative algorithms is an algorithmic regularization method to avoid over-fitting in estimation and classification. In this paper, we show that early stopping can also be applied to obtain the minimax optimal testing in a general non-parametric setup. Specifically, a Wald-type test statistic is obtained based on an iterated estimate produced by functional gradient descent algorithms in a reproducing kernel Hilbert space. A notable contribution is to establish a sharp stopping rule: when the number of iterations achieves an optimal order, testing optimality is achievable; otherwise, testing optimality becomes impossible. As a by-product, a similar sharpness result is also derived for minimax optimal estimation under early stopping. All obtained results hold for various kernel classes, including Sobolev smoothness classes and Gaussian kernel classes. 1 Introduction As a computationally efficient approach, early stopping often works by terminating an iterative algorithm on a pre-specified number of steps to avoid over-fitting. Recently, various forms of early stopping have been proposed in estimation and classification. Examples include boosting algorithms (Bühlmann and Yu [2003], Zhang and Yu [2005], Wei et al. [2017]); gradient descent over reproducing kernel Hilbert spaces (Yao et al. [2007], Raskutti et al. [2014]) and reference therein. However, statistical inference based on early stopping has largely remained unexplored. In this paper, we apply the early stopping regularization to nonparametric testing and characterize its minimax optimality from an algorithmic perspective. Notably, it differs from the traditional framework of using penalization methods to conduct statistical inference. Recall that classical nonparametric inference often involves minimizing objective functions in the loss + penalty form to avoid overfitting; examples include the penalized likelihood ratio test, Wald-type test, see Fan and Jiang [2007], Shang and Cheng [2013], Liu et al. [2018] and reference therein. However, solving a quadratic program in the penalized regularization requires O(n3) basic operations. Additionally, in practice cross validation method (Golub et al. [1979]) is often used as a tuning procedure which is known to be optimal for estimation but suboptimal for testing; see Fan et al. [2001]. As far as we are aware, there is no theoretically justified tuning procedure for obtaining optimal testing in our setup. We address this issue by proposing a data-dependent early stopping rule that enjoys both theoretical support and computational efficiency. To be more specific, we first develop a Wald-type test statistic Dn,t based on the iterated estimator ft with t being the number of iterations. As illustrated in Figure 1 (a) and (b), the testing power demonstrates a parabolic pattern. Specifically, it increases as the iteration grows in the beginning, and then decreases after reaching its largest value when t = T , implying how over-fitting affects the power performance. To precisely quantify T , we analyze the power performance by characterizing the strength of the weakest detectable signals (SWDS). We show that SWDS at each iteration is controlled by the bias of the iterated estimator and the standard derivation of the test statistic. In 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. fact, each iterative step reduces the former but increase the latter. Such a tradeoff in testing is rather different from the classical bias-variance tradeoff in estimation; as shown in Figure 1 (c). Hence, the early stopping rule to be provided is different from those in the literature such as Raskutti et al. [2014] and Wei et al. [2017]; also see Figure 1 (a) and (b) in comparison with power and MSE. 100 200 300 400 500 Iteration Second-order Sobolev kernel 0 100 200 300 400 500 Iteration Gaussian kernel S.D. of !",$ Variance of %$ &'()* of %$ Stopping rules (a) (b) (c) Figure 1: (a) and (b) are mean square error (MSE) and power performance of gradient descent update at each iteration with constant step size = 1; Power was calculated based on 500 replicates. (a) Data were generated via yi = 0.5x2 i +0.5 sin(4 xi)+ i with sample size n = 200, {xi}n i=1 Unif[0, 1], i N(0, 1). (b) Data were generated by yi = 0.5x2 i + 0.5|xi 0.5| + i with sample size n = 200. (c) Stopping rules for estimation and testing based on different tradeoff criteria. The above analysis apply to many reproducing kernels, and lead to specific optimal testing rate, depending on their eigendecay rate. In the specific examples of polynomial decaying kernel and exponential decaying kernel, we further show that the developed stopping rule is indeed sharp : testing optimality is obtained if and only if the number of iterations obtains an optimal order defined by the stopping rule. As a by-product, we prove that the early stopping rule in Raskutti et al. [2014] and Wei et al. [2017] is also sharp for optimal estimation. 2 Background and Problem Formulation We begin by introducing some background on reproducing kernel Hilbert space (RKHS), and functional gradient descent algorithms in the RKHS, together with our nonparametric testing formulation. 2.1 Nonparametric estimation in reproducing kernel Hilbert spaces Consider the following nonparametric model yi = f(xi) + i, i = 1, , n, (2.1) where xi 2 X Rd for a fixed d 1 are random covariates, and i are Gaussian random noise with mean zero and variance σ2. Throughout we assume that f 2 H, where H L2(PX) is a reproducing kernel Hilbert space (RKHS) associated with an inner product h , i H and a reproducing kernel function K( , ) : X X ! R. By Mercer s Theorem, K has the following spectral expansion: µiφi(x)φi(x0), x, x0 2 X, (2.2) where µ1 µ2 0 is a sequence of eigenvalues and {φi}1 i=1 form a basis in L2(PX). Moreover, for any i, j 2 N, hφi, φji L2(PX) = δij and hφi, φji H = δij/µi. In the literature, e.g., Guo [2002] and Shang and Cheng [2013], it is common to assume that φj s are uniformly bounded. This is also assumed throughout this paper. Assumption A1. The eigenfunctions {φk}1 k=0 are uniformly bounded on X, i.e., there exists a finite constant c K > 0 such that kφjksup c K. Two types of kernel are often considered in the nonparametric literature, depending on how fast its eigenvalues decay to zero. The first is that µi i 2m, leading to the so-called polynomial decay kernel (PDK) of order m > 0. For instance, an m-order Sobolev space is an RKHS with a PDK of order m; see Wahba [1990], and the trigonometric basis in periodic Sobolev space with PDK satisfies Assumption A1 trivially. The second is that µi exp( βip) for some constant β, p > 0, corresponding to the so-called exponential-polynomial decay kernel (EDK) of order p > 0; see Schölkopf et al. [1999]. In particular, for EDK of order two, an example is K(x1, x2) = exp( (x1 x2)2/2). In the latter case, Assumption A1 holds according to Lu et al. [2016]. By representer theorem, any f 2 H can be presented as f( ) = 1 pn wi K(xi, ) + ( ), where 2 H and ( ) ? span{K(x1, ), , K(xn, )}. Given x = (x1, , xn), define an empirical kernel [K]ij = 1 n K(xi, xj) and f = (f(x1), , f(xn)), then f = pn Kw, where w = (w1, , wn)> 2 Rn. 2.2 Gradient Descent Algorithms Given the samples {(xi, yi)}, consider minimizing the least-square loss function (yi f(xi))2 over a Hilbert space H. Note that by representer theorem, f(x) = hf, K(x, )i H, then the gradient of L(f) is r L(f) = n 1 Pn i=1(f(xi) yi)K(xi, ). Given x = (x1, , xn) and y = (y1, , yn), define f t = (ft(x1), , ft(xn)) for t = 0, 1, . Then straightforward calculation shows that the functional gradient descent algorithm generates a sequence of vectors {f t}1 t=0 via the recursion f t+1 = f t t K(f t y), (2.3) where { t}1 t=0 is the step sizes. Denote the total step size upto the t-th step as t = Pt 1 =0 . Consider the singular value decomposition K = U U >, where UU > = In and = diag(bµ1, bµ2, , bµn) with bµ1 bµ2 bµn 0. We have the following assumption for the step sizes and t. Assumption A2. The step size { t}1 t=0 is non-increasing; for all = 0, 1, 2, , 0 min{1, 1/bµ1}. The total step size t = Pt 1 =0 diverges as t ! 1; for 0 t1 t2 as t2 ! 1 , t1 t2. Assumption A2 supposes the step size { t}1 t=0 to be bounded and non-increasing, but cannot decrease too fast as t diverges. Many choices of step sizes satisfy Assumption A2. A trivial example is to choose a constant step size 0 = = t = min{1, 1/bµ1}. Define t = argmin{j : µj < 1 t } 1, we have the following assumption on the population eigenvalues through t. Assumption A3. t diverges as t ! 1. It is easy to check that Assumption A3 is satisfied in PDK and EDK introduced in Section 2.1. 2.3 Nonparametric testing Our goal is to test whether the nonparametric function in (2.1) is equal to some known function. To be precise, we consider the nonparametric hypothesis testing problem H0 : f = f v.s. H1 : f 2 H \ {f }, where f is a hypothesized function. For convenience, assume f = 0, i.e., we will test H0 : f = 0 vs. H1 : f 2 H \ {0}. (2.4) In general, testing f = f (for an arbitrary known f ) is equivalent to testing f f f = 0. So, (2.4) has no loss of generality. Based on the iterated estimator f t, we propose the following Wald-type test statistic: Dn,t = kf tk2 where kf tk2 t (xi). In what follows, we will derive the null limit distribution of Dn,t, and explicitely show how the stopping time affects minimax optimality of testing. 3 Main Results 3.1 Stopping rule for nonparametric testing Given a sequence of step size { t}1 t=0 satisfying Assumption A2, we first introduce the stopping rule as follows: T := argmin min{1, tbµi} As will be clarified in Section 3.2, the intuition underlying the stopping rule (3.1) is that 1 the bias of the iterated estimator ft, which is a decrease function of t; 1 i=1 min{1, t bµi} is the standard deviation of the test statistic Dn,t as an increasing function of t. The optimal stopping rule can be achieved by such a bias-standard deviation tradeoff. Recall that such a tradeoff in (3.1) for testing is different from another type of bias-variance tradeoff in estimation (see Raskutti et al. [2014], Wei et al. [2017]), thus leading to different optimal stopping time. In fact, as seen in Figure 1 (c), optimal estimation can be achieved at e T, which is earlier than than T . This is also empirically confirmed by Figure 1 (a) and (b) where minimum mean square error (MSE) can always be achieved earlier than the maximum power. Please see Section 4 for more discussions. 3.2 Minimax optimal testing In this section, we first derive the null limit distribution of (standardized) Dn,t as a standard Gaussian under mild conditions, that is, we only require the total step sizes t goes to infinity. Define a sequence of diagonal shrinkage matrices as St = Qt 1 =0(In ). As stated in Raskutti et al. [2014], the matrix St describes the extent of shrinkage towards the origin. By Assumption A2 that 0 t min{1, 1/bµ1}, St is positive semidefinite. Theorem 3.1. Suppose Assumption A2, A3 are satisfied. Then under H0, as n ! 1 and t ! 1, we have Dn,t µn,t d ! N(0, 1). Here µn,t = EH0[Dn,t|x] = 1 n tr((In St)2) and σ2 n,t = Var H0[Dn,t|x] = 2 n2 tr((I St)4). Then based on Theorem 3.1, we have the following testing rule at significance level : φn,t = I(|Dn,t µn,t| z1 /2σn,t), where z1 /2 is the 100 (1 /2)th percentile of standard normal distribution. Lemma 3.2. µn,t 1 i=1 min{1, tbµi}, and σ2 i=1 min{1, tbµi}. Define the squared separation rate min{1, tbµi}. The separation rate dn,t is used to measure the distance between the null hypothesis and a sequence of alternative hypotheses. The following Theorem 3.3 shows that, if the alternative signal f is separated from zero by an order dn,t, then the proposed test statistic Dn,t asymptotically achieves high power at the total step size t. To achieve the maximum power, we need to minimize dn,t. Under the stopping rule (3.1), we can see that when t = T , the separation rate achieves its minimal value as d n := dn,T . Theorem 3.3. (a) Suppose Assumption A2 and A3 are satisfied. For any " > 0, there exist positive constants C", t" and N" such that with probability greater than 1 e c t, inf t t" inf n N" inf f2B kfkn C"dn,t Pf(φn,t = 1|x) 1 ", where c is a constant, B = {f 2 H : kfk H C} for a constant C and Pf( ) is the probability measure under f. (b) The separation rate dn,t achieves its minimal value as d n := dn,T . The general Theorem 3.3 implies the following concrete stopping rules under various kernel classes. Corollary 3.4. (PDK of order m) Suppose Assumption A2 holds and m > 3/2. Then at time T with T n4m/(4m+1), for any " > 0, there exist constants C" and N" such that, with probability greater than 1 e cmn(2m 3)/(2m 1) e c1n2/(4m+1), inf n N" inf f2B Pf(φn,T = 1|x) 1 ", where cm is an absolute constant depending on m only, c1 is a constant. Note that the minimal separation rate n 2m 4m+1 is minimax optimal according to (Ingster [1993]). Thus, Dn,T is optimal when T n4m/(4m+1). Note that T = PT 1 t=0 t, T n4m/(4m+1) when constant step sizes are chosen. Corollary 3.5. (EDK of order p) Suppose Assumption A2 holds and p 1. Then at time T with T n(log n) 1/(2p), for any " > 0, there exist constants C" and N" such that, with probability greater than 1 e cβ,pn(log n) 2/p e c1(log n)1/p, inf n N" inf f2B Pf(φn,T = 1|x) 1 ", where cβ,p is an absolute constant depending on β, p. Note that the minimal separation rate n 1/2(log n)1/(4p) is proven to be minimax optimal in Corollary 1 of Wei and Wainwright [2017]. Hence, Dn,T is optimal at the total step size T n(log n) 1/(2p). When the step sizes are chosen as constants, then the corresponding T n(log n) 1/(2p). 3.3 Sharpness of the stopping rule Theorem 3.3 shows that optimal testing can be achieved when t = T . In the specific examples of PDK and EDK, Theorem 3.6 further shows that when t T or t T , there exists a local alternative f that is not detectable by Dn,t even when it is separated from zero by d n. In this case, the asymptotic testing power is actually smaller than . Hence, we claim that T is sharp in the sense that testing optimality is obtained if and only if the total step size achieves the order of T . Given a sequence of step size { t}1 t=0 satisfying Assumption A2, we have the following results. Theorem 3.6. Suppose Assumption A2 holds, and t T or t T . There exists a positive constant C1 such that, with probability approaching 1, n!1 inf f2B kfkn C1d Pf(φn,t = 1|x) . In the proof, we construct the alternative function as Pn i=1 K(xi, )wi, with wi being defined in (A.8) and (A.9) for the two cases t T and t T , respectively. 4 Sharpness of early stopping in nonparametric estimation In this section, we review the existing early stopping rule for estimation, and further explore its sharpness property. In the literature, Raskutti et al. [2014] and Wei et al. [2017] proposed to use the fixed point of local empirical Rademacher complexity to define the stopping rule as follows e T := argmin min{1, tbµi} Given the above stopping rule, the following theorem holds where f represents truth. Theorem 4.1 (Raskutti et al. [2014]). Given the stopping time e T defined by (4.1), there are universal positive constants (c1, c2) such that the following events hold with probability at least 1 c1 exp( c2n/ e T ): (a) For all iterations t = 1, 2, , e T: kft f k2 (b) At the iteration e T, kf e T f k2 (c) For all t e T, min{1, bµi t} To show the sharpness of e T, it suffices to examine the upper bound in Theorem 4.1 (a). In particular, we prove a complementary lower bound result. Specifically, Theorem 4.2 implies that once t e T, the rate optimality will break down for at least one true f 2 B with high probability. Denote the stopping time e T satisfying n2m/(2m+1) K is PDK of order m, n/(log n)1/p K is EDK of order p. Theorem 4.2. (a) (PDK of order m) Suppose Assumption A2 holds and m > 3 2. For all t e T, with probability approaching 1, it holds that (b) (EDK of order p) Suppose Assumption A2 holds and p 1. For all t e T, with probability approaching 1, Combining with Theorem 4.1, we claim that e T is a sharp stopping time for estimation. At last, we comment briefly that the stopping rule for estimation and Theorem 4.1 (a), (b) can also be obtained in our framework as a by-product. Intuitively, the stopping time e T in (4.1) is achieved by the classical bias-variance tradeoff. Note that kft f k2 n has a trivial upper bound n 2 kft E ftk2 n | {z } Variance +2 k E ft f k2 n | {z } Squared bias where the expectation is taken with respect to . The squared bias term is bounded by 1 t (see Lemma A.3); the variance term is bounded by the mean of Dn,t, that is, kft E ftk2 n = OP (µn,t) (see Lemma A.1), where µn,t = tr((I St)2)/n 1 n i=1 min{1, tbµi} as shown in Lemma 3.2. Obviously, according to (4.1), when t e T, the squared bias will dominate the variance. 5 Numerical Study In this section, we compare our testing method with an oracle version of stopping rule that uses knowledge of f , as well as the test based on the penalized regularization. We further conduct the simulation studies to verify our theoretical results. An oracle version of early stopping rule The early stopping rule defined in (3.1) involves the bias of the iterated estimator ft that can be directly calculated as k E ft f k2 n = k St U >f k2 ii)2[U >f (x)]2 And the standard derivation of Dn,t is σn,t = 1 . An oracle method is to base its stopping time on the exact in-sample bias of ft and the standard derivation of Dn,t, which is defined as follows: T := argmin ii)2[U >f (x)]2 Our oracle method represents an ideal case that the true function f is known. Algorithm based on the early stopping rule (3.1) In the early stopping rule defined in (3.1), the bias term is bounded by the order of 1 t . To implement the stopping rule in (3.1) practically, we propose a boostrap method to approximate the bias term. Specifically, we calculate a sequence of {f (b) b=1 based on the pair boostrapped data {x(b) i=1, and use k St U >ft Bk2 n to approximate the bias term, where ft B = B , B is a positive integer. On the other hand, the standard derivation term 1 involves calculating all eigenvalues of the kernel matrix. This step can be implemented by many methods on fast computation of kernel eigenvalues; see Stewart [2002], Drineas and Mahoney [2005] and reference therein. Penalization-based test As another reference, we also conduct the penalization-based test by using the test statistic as Dn,λ = k bfn,λk2 n. Here bfn,λ is the kernel ridge regression (KRR) estimator (Shawe-Taylor and Cristianini [2004]) defined as bfn,λ := argmin (yi f(xi))2 + λkfk2 H = hf, fi H with h , i H the inner product of H. The penalty parameter λ plays the same role of the total step size t to avoid overfitting. Liu et al. [2018] shows that minimax optimal testing rate can be achieved by choosing the penalty parameter satisfying λ ( + λIn) 1) 44/n. The specific λ varies for different kernel classes. For example, in PDK, the optimal testing can be achieved with λ n 4m/(4m+1); in EDK, the corresponding λ n 1(log n)1/(2p). We discover an interesting connection that the inverse of these λ share the same order as the stopping rules in Corollary 3.4 and Corollary 3.5, respectively. Lemma 5.1 provides a theoretical explanation for such connection. Lemma 5.1. tr I St44 holds if and only if λ 1 t . However, it is still challenging to choose the optimal penalty parameter for testing in practice. A compromising strategy is to use cross validation (CV) method (Golub et al. [1979]), which was invented for optimal estimation problems. In the following numerical study, we will show that the CV-based Dn,λ performs less satisfactorily than our proposed early stopping method. 5.1 Numerical study I In this section, we compare our early stopping based test statistics (ES) with two other methods: the oracle early stopping (Oracle ES) method, and the penalization-based test described above. Particularity, we consider the hypothesis testing problem H0 : f = 0. Data were generated from the regression model (2.1) with f(xi) = c cos(4 xi), where xi iid Unif[0, 1] and c = 0, 1 respectively. c = 0 is used for examining the size of the test, and c = 1 is used for examining the power of the test. The sample size n is ranged from 100 to 1000. We use Gaussian kernel (i.e., p = 2 in EDK) to fit the data. Significance level was chosen as 0.05. Both size and power were calculated as the proportions of rejections based on 500 independent replications. For the ES, we use bootstrap method to approximate the bias with B = 10 and the step size = 1. For the penalization-based test, we use 10 fold cross validation (10-fold CV) to select the penalty parameter. For the oracle ES, we follow the stopping rule in (5.1) with constant step size = 1. Figure 2 (a) shows that the size of the three testing methods approach the nominal level 0.05 under various n, demonstrating the testing consistency. Figure 2 (b) displays the power of the three testing rules. The ES exhibits better power performance than the penalization-based test with 10 fold CV under various sample sizes. Furthermore, as n increases, the power of ES approaches to the Oracle ES, which serves as the benchmark. As shown in Figure 2 (c), the ES enjoys great computation efficiency compared with the Wald-test with 10 fold CV, and it is reasonable that our proposed ES takes more time than the oracle ES, due to the extra step for bootstrapping. In Supplementary A.8, we show additional synthetic experiments with various c based on second-order Sobolev kernel verifying our theoretical contribution. 100 200 300 400 500 600 700 800 900 1000 n 100 200 300 400 500 600 700 800 900 1000 n 100 200 300 400 500 600 700 800 900 1000 n Computational Time (s) (a) (b) (c) Figure 2: (a) is the size with signal strength c = 0; (b) is the power with signal strength c = 1; (c) is the computational time (in seconds) for the three testing rules. 5.2 Numerical study II In this section, we show synthetic experiments verifying our sharpness results stated in Corollary 3.4, Corollary 3.5 and Theorem 3.6. Data were generated from the regression model (2.1) with f(xi) = c(0.8(xi 0.5)2 + 0.2 sin(4 xi)), where xi iid Unif[0, 1], and c = 0, 1, respectively. Other setting is as the same as in Section 5.1. In Figure 3 (a) and (b), we use the second-order Sobolev kernel (i.e., m = 2 in PDK) to fit the model, and set the constant step size = 1. Corollary 3.4 suggests that optimal power can be achieved at the stopping time T n8/9. To display the impact of the stopping time on power performance, we set the total iteration steps T as (n8/9)γ with γ = 2/3, 1, 4/3 and n ranges from 100 to 1000. Figure 3 (a) shows that the size approaches the nominal level 0.05 under various choices of (γ, n), demonstrating the testing consistency supported by Theorem 3.1. Figure 3 (b) displays the power of our testing rule. A key observation is that the power under the theoretically derived stopping rule (γ = 1) performs best, compared with other stopping choices (γ = 2/3, 4/3). In Figure 3 (c) and (d), we use Gaussian kernel (i.e., p = 2 in EDK) to fit the model, and set the constant step size = 1. Here we set the total iteration steps as (n/(log n)1/4)γ with γ = 2/3, 1, 4/3 and n ranges from 100 to 1000. Note that γ = 1 corresponds to the optimal stopping time in Corollary 3.5. Overall, the interpretations are similar to Figure 3 (a) and (b) for PDK. 6 Discussion The main contribution of this paper is that we apply the early stopping strategy to nonparametric testing, and propose the first sharp stopping rule to guarantee minimax optimal testing (to the best of our knowledge). Our stopping rule depends on the eigenvalues of the kernel matrix, especially the first few leading eigenvalues. There are many efficient methods to compute the top eigenvalues fast, (a) (b) (c) (d) 100 200 300 400 500 600 700 800 900 1000 Sample Size 100 200 300 400 500 600 700 800 900 1000 Sample Size 100 200 300 400 500 600 700 800 900 1000 Sample Size 100 200 300 400 500 600 700 800 900 1000 Sample Size Figure 3: (a) is the size of Dn,t with signal strength c = 0 under PDK; (b) is the power of Dn,t with signal strength c = 1 under PDK. (c) is the size of Dn,t with signal strength c = 0 under EDK; (d) is the power of Dn,t with signal strength c = 1 under EDK. see Drineas and Mahoney [2005], Ma and Belkin [2017]. As a future work, we can also introduce the randomly projected kernel methods to accelerate the computing time. Peter Bühlmann and Bin Yu. Boosting with the l 2 loss: regression and classification. Journal of the American Statistical Association, 98(462):324 339, 2003. Petros Drineas and Michael W Mahoney. On the nyström method for approximating a gram matrix for improved kernel-based learning. journal of machine learning research, 6(Dec):2153 2175, 2005. Jianqing Fan and Jiancheng Jiang. Nonparametric inference with generalized likelihood ratio tests. TEST, 16(3):409 444, Dec 2007. ISSN 1863-8260. Jianqing Fan, Chunming Zhang, and Jian Zhang. Generalized likelihood ratio statistics and wilks phenomenon. The Annals of statistics, 29(1):153 193, 2001. Gene H Golub, Michael Heath, and Grace Wahba. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21(2):215 223, 1979. Wensheng Guo. Inference in smoothing spline analysis of variance. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(4):887 898, 2002. Yuri I Ingster. Asymptotically minimax hypothesis testing for nonparametric alternatives. i, ii, iii. Mathematical Methods of Statistics, 2(2):85 114, 1993. Meimei Liu, Zuofeng Shang, and Guang Cheng. Nonparametric testing under random projection. ar Xiv preprint ar Xiv:1802.06308, 2018. Junwei Lu, Guang Cheng, and Han Liu. Nonparametric heterogeneity testing for massive data. ar Xiv preprint ar Xiv:1601.06212, 2016. Siyuan Ma and Mikhail Belkin. Diving into the shallows: a computational perspective on large-scale shallow learning. In Advances in Neural Information Processing Systems, pages 3781 3790, 2017. Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Early stopping and non-parametric regression: an optimal data-dependent stopping rule. Journal of Machine Learning Research, 15(1):335 366, 2014. Mark Rudelson and Roman Vershynin. Hanson-wright inequality and sub-gaussian concentration. Electronic Communications in Probability, 18(82):1 9, 2013. Bernhard Schölkopf, Christopher JC Burges, and Alexander J Smola. Advances in kernel methods: support vector learning. MIT press, 1999. Zuofeng Shang and Guang Cheng. Local and global asymptotic inference in smoothing spline models. The Annals of Statistics, 41(5):2608 2638, 2013. John Shawe-Taylor and Nello Cristianini. Kernel methods for pattern analysis. Cambridge university press, 2004. Gilbert W Stewart. A krylov schur algorithm for large eigenproblems. SIAM Journal on Matrix Analysis and Applications, 23(3):601 614, 2002. Grace Wahba. Spline models for observational data. SIAM, 1990. Yuting Wei and Martin J Wainwright. The local geometry of testing in ellipses: Tight control via localized kolomogorov widths. ar Xiv preprint ar Xiv:1712.00711, 2017. Yuting Wei, Fanny Yang, and Martin J Wainwright. Early stopping for kernel boosting algorithms: A general analysis with localized complexities. In Advances in Neural Information Processing Systems, pages 6067 6077, 2017. Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26(2):289 315, 2007. Tong Zhang and Bin Yu. Boosting with early stopping: Convergence and consistency. The Annals of Statistics, 33(4):1538 1579, 2005.