# asymptotics_of_kfold_cross_validation__2158e09a.pdf

Journal of Artiﬁcial Intelligence Research 78 (2023) 491-526 Submitted 06/2022; published 11/2023

Asymptotics of K-Fold Cross Validation

Jessie Li jeqli@ucsc.edu Department of Economics University of California, Santa Cruz 1156 High Street, Santa Cruz, CA 95064, USA

This paper investigates the asymptotic distribution of the K-fold cross validation error in an i.i.d. setting. As the number of observations n goes to inﬁnity while keeping the number of folds K ﬁxed, the K-fold cross validation error is n-consistent for the expected out-of-sample error and has an asymptotically normal distribution. A consistent estimate of the asymptotic variance is derived and used to construct asymptotically valid conﬁdence intervals for the expected out-of-sample error. A hypothesis test is developed for comparing two estimators expected out-of-sample errors and a subsampling procedure is used to obtain critical values. Monte Carlo simulations demonstrate the asymptotic validity of our conﬁdence intervals for the expected out-of-sample error and investigate the size and power properties of our test. In our empirical application, we use our estimator selection test to compare the out-of-sample predictive performance of OLS, Neural Networks, and Random Forests for predicting the sale price of a domain name in a Go Daddy expiry auction.

1. Introduction

This paper studies the asymptotics of K-fold cross validation in an i.i.d. setting as a way to approximate an estimator s expected out-of-sample error, which is the expected predictive loss evaluated at a new observation drawn independently but from the same distribution as the data. We ﬁrst derive the asymptotic distribution of the K-fold cross validation error centered around the expected out-of-sample error as the number of observations n goes to inﬁnity but the number of folds K is ﬁxed. The assumption of ﬁxed K is reasonable given that researchers typically use K = 5 or K = 10 in practice. We note that the assumption of ﬁxed K rules out leave-one-out cross validation, which eﬀectively sets K = n. We ﬁnd that the asymptotic distribution of the K-fold cross validation error does not depend on K, so for suﬃciently large n, the choice of K should not be a ﬁrst order concern. We also provide a consistent estimate of the asymptotic variance, which gives researchers a simple way to construct asymptotically valid standard errors for the K-fold cross validation error. The researcher can use these standard errors to form conﬁdence intervals for the expected out-of-sample error. Our results are derived for estimators with asymptotically linear representations and well-deﬁned probability limits. The cross validation loss functions are three times continuously diﬀerentiable at this probability limit with derivatives having bounded second moments. Additionally, we allow for the objective function used to compute the estimators to be nondiﬀerentiable; for example, we allow for quantile regression and ℓ2norm Support Vector Machine regression. It is sometimes the case that researchers would like to have a formal statistical test for comparing the predictive performance of two diﬀerent estimators. To aid them on this front,

2023 The Authors. Published by AI Access Foundation under Creative Commons Attribution License CC BY 4.0.

we formulate a hypothesis test where the null hypothesis is that the two estimators have the same expected out-of-sample error. Our null hypothesis is diﬀerent from traditional model selection tests because we are not trying to discern the true functional form of the conditional mean function or some other feature of interest. Instead, we are comparing the out-of-sample predictive performance of diﬀerent estimators. One motivation for doing such an exercise is that sometimes diﬀerent estimators can be consistent for the same underlying conditional mean function; so traditional model selection tests would view them as equally good, and yet they can still diﬀer in how well they predict the outcome out-of-sample. For instance, we can compare a kernel regression estimator to a local linear regression estimator using the same set of covariates, even if both estimators are consistent for the same underlying conditional mean function. Our test also allows for our estimators to be consistent for diﬀerent conditional mean functions and the estimators can have diﬀerent rates of convergence. For example, in the empirical application, we compare the predictive performance of OLS, Neural Networks, and Random Forests for predicting the sale price of a domain name in a Go Daddy expiry auction. Section 2 provides a literature review. Section 3 contains the theoretical results on the asymptotic normality of the K-fold cross validation error and consistent estimation of the asymptotic variance. Section 4 contains the theoretical results of the hypothesis testing procedure. Section 5 contains Monte Carlo simulations examining the empirical coverage frequencies and rejection frequencies for a class of linear models. Section 6 contains an empirical application of our estimator selection test conducted pairwise and size-adjusted for multiple testing using a Bonferroni correction. Section 7 concludes. The appendix contains proofs of the main results. Some notation that will be used in this paper are as follows: Xn = Op(1) means for any ϵ > 0, there exists a ﬁnite M > 0 and a ﬁnite N > 0 such that P (|Xn| > M) < ϵ n > N , Xn = op(1) means for any ϵ > 0, lim n P (|Xn| ϵ) = 0, and plim denotes probability limit.

2. Literature Review

An early paper that discusses K-fold cross validation is (Burman, 1989), which examines the bias and variance of K-fold cross validation as an estimator for the expected prediction error (measured using squared error loss) conditional on the data. (Zhang, 1993) derives the probabilities of selecting each model when performing K-fold cross validation using linear models. (Shao, 1993) provides necessary and suﬃcient conditions for variable selection consistency for linear models using leave-nv-out cross validation, where nt observations are used for estimating the models using linear regression and nv observations are used for validating the accuracy of the models predictions. He demonstrates that necessary and suﬃcient conditions for variable selection consistency are nv/nt and nt . (Yang, 2007) considers an extension of (Shao, 1993) s consistency results to nonparametric regression. If at least one model is estimated at a rate slower than O(n 1/2 t ), then leave-nvout cross validation is consistent for the best model when nt/nv O(1) and sometimes even when nt/nv . In the econometrics literature, (Li & Racine, 2004) and (Hall, Racine, & Li, 2004) examine the optimality properties of using cross validation to select the bandwidth for local linear regression and conditional probability density estimation, respectively, in the i.i.d. setting. (Racine, 2000) proposes an improved version of (Burman, Chow, & Nolan,

Asymptotics of K-Fold Cross Validation

1994) s h-block cross validation method for dependent data, and he demonstrates that the probability of selecting the model with the best predictive ability converges to 1 as the total number of observations approaches inﬁnity. An excellent survey article on cross validation is (Arlot & Celisse, 2010), which is the paper that ﬁrst motivated work on our paper and therefore contains similar notation as ours. None of the above papers consider constructing conﬁdence intervals for the expected out-of-sample errors or testing whether two models have the same expected out-of-sample error. A recent paper (Lei, 2019) considers testing whether a linear model has the lowest average predictive risk among a ﬁxed number of candidate models in the classical linear regression setting. However, their notion of predictive risk (also called the conditional test error) is diﬀerent from our notion of expected out-of-sample error because they condition on the training data when taking the expectation over the new observations, whereas we take the expectation over both the new observations and the training data. This distinction is important to consider because as noted in section 7.12 of (Hastie, Tibshirani, & Friedman, 2009) and also more recently, (Zhu & Timmermann, 2020) and (Bates, Hastie, & Tibshirani, 2023), K-fold cross validation only estimates well the expected out-of-sample error (also called the expected test error) rather than the conditional test error. Although (Bates et al., 2023) propose a resampling procedure to obtain conﬁdence intervals for the expected out-ofsample error, they do not derive the asymptotic distribution of the K-fold cross validation error. Their procedure (called nested CV) is also more computationally intensive than our analytic standard errors because it involves performing cross validation repeatedly for many iterations. An important recent paper (Wager, 2020) derives the asymptotic distribution of K-fold cross validation for nonparametric estimators; we will show that our result is consistent with their result in the case of nonparametric estimators while also noting that our more general results allow for both parametric and nonparametric estimators. Another important recent paper (Bayle, Bayle, Janson, & Mackey, 2020) also derives the asymptotic distribution of K-fold cross validation but centered around the fold error rather than the expected out-of-sample error. They note at the end that further work could be done to extend their results to center around the expected-out-of-sample error. Another important recent paper (Austern & Zhou, 2020) does derive the asymptotic distribution centered around the expected out-of-sample error using diﬀerent proof techniques based on Stein s method. However, they do not consider the problem of testing whether two estimators have the same expected out-of-sample error. Additionally, in their discussion of parametric M-estimators (see their Proposition 3 on page 12), they require the objective function to be strictly convex and twice diﬀerentiable in the parameter everywhere. We do not need the objective function to be diﬀerentiable or convex, which handles quantile regression and nonconvex maximum likelihood problems. The literature on model selection tests also has many important papers; for example an early paper on model speciﬁcation testing is (Bierens, 1982). He is testing the null hypothesis that the conditional expectation function can be speciﬁed as a parametric function, which is diﬀerent from our null hypothesis comparing the expected out-of-sample errors of the two estimators. (Vuong, 1989) develops a likelihood ratio test for comparing models estimated by maximum likelihood. Later, (Rivers & Vuong, 2002) generalizes the framework to allow for models speciﬁed by moment conditions, but they require the estimator to be n consistent. (Lavergne & Vuong, 1996) propose a model selection test comparing two

non-nested nonparametric models with diﬀerent sets of regressors. (Lavergne, 2001) considers a test of the equality of two conditional expectation functions estimated using kernel regression. (Lavergne & Vuong, 2000) consider testing for the joint signiﬁcance of a subset of continuous explanatory variables in nonparametric regression using kernel methods. (Chen, Hong, & Shum, 2007) use a likelihood ratio test to compare parametric likelihood models with moment-based models. (Hong & Preston, 2012) examine a more general class of estimators that include many non-likelihood based and semi-parametric estimators, but their procedure uses the Bayesian Information Criterion (BIC), which relies on knowing the eﬀective model dimension, which can be diﬃcult to determine for nonparametric estimators. More recently, (Shi, 2015) has developed a non-degenerate version of (Vuong, 1989) s likelihood ratio test which exhibits uniform size control. (Liao & Shi, 2020) extend (Shi, 2015) to semi-parametric and nonparametric models estimated using a sieve M-estimator. Another parametric likelihood based model selection test that controls size uniformly over a large class of data generating processes is (Schennach & Wilhelm, 2017). The null hypotheses in these aforementioned papers are diﬀerent than ours because they are testing properties about the true underlying model whereas we are interested in evaluating the outof-sample predictive performance of our estimators. In this regard, our test might be more similar to tests of predictive ability, such as the Diebold-Mariano-West tests ((Diebold & Mariano, 1995), (Diebold, 2015), and (West, 1996)), the (Giacomini & White, 2006) test of conditional predictive ability, and the (Clark & Mc Cracken, 2015) bootstrap-based test of out-of-sample forecast accuracy. However, in contrast to these papers, we allow for a wider class of estimators with diﬀerent convergence rates and our focus is on i.i.d. cross-sectional data. The focus on i.i.d. cross-sectional data is because K-fold cross validation is primarily used in this context; however, it is not diﬃcult to extend our method to balanced panel data.

3. Theoretical Properties of K-Fold Cross Validation

Consider an i.i.d. sample of data Ξ {ξi}n i=1 {(xi, yi)}n i=1 from some unknown distribution P. Let X Rd be the support of xi. The criterion for comparing a set of candidate estimators {ˆs M : X 7 R, for M M} is the expected out-of-sample error EPEOUT,n(M) E[γ(ˆs M; ξi)], which is the expected loss if we were to apply estimator M s ˆs M ( ) computed over n observations on a new ξi drawn independently from the same distribution as Ξ P. The expectation E[γ(ˆs M; ξi)] is taken with respect to both the training data Ξ used to form ˆs M ( ) and the new observation ξi. For this reason, EPEOUT,n(M) can be viewed as a measure of unconditional expected predictive accuracy.

We will use K-fold cross validation to estimate the expected out-of-sample error. Formally, the K-fold cross validation procedure selects the estimator among the set of candidate estimators M which minimizes the average prediction error for observations ξi in the kth

validation fold, averaged over k = 1, ..., K.

ˆLCV n (M) 1

γ(ˆs( k) M ; ξi)

Asymptotics of K-Fold Cross Validation

I(v) k are the indices of the observations in the kth validation fold and nv n

K is the number of observations in each validation fold, assuming WLOG that n is divisible by K. Later, we will use nt K 1

K n to denote the number of observations in each training fold. ˆs( k) M ( ) is estimator M s estimator computed using the observations not in the kth validation fold. γ( ; ) is a loss function such as the squared error loss γ(ˆs M; ξi) = (yi ˆs M(xi))2. While we do not discuss the details in this paper, it is not diﬃcult to extend our results to balanced panel data where we have data on n individuals over a ﬁxed time period T by redeﬁning the K-fold Cross Validation error as

ˆLCV n,T (M) 1

t=1 γ(ˆs( k) M ; ξit)

where I(v) k are the indices of the individuals in the kth validation fold, nv n

number of individuals in each validation fold, and ˆs( k) M ( ) is model M s estimator computed using the individuals not in the kth validation fold. We now present the assumptions that we will need throughout the paper. The ﬁrst assumption says that s M(xi) is the probability limit of ˆs M(xi) for all observations xi and all estimators M in a ﬁnite set M. We emphasize that diﬀerent estimators may have diﬀerent s M s, and all of the s M s can be diﬀerent from the true feature s0 corresponding to the underlying data generating process. Therefore, our ﬁrst assumption allows for estimators that are inconsistent for the true conditional mean function or other features of the data generating process.

Assumption 3.1 The set of candidate estimators M is ﬁnite, and

max M M max 1 i n |ˆs M(xi) s M(xi)| = op(1)

The next assumption says the loss function is three times continuously diﬀerentiable with zero third derivative.

Assumption 3.2 γ(s; ξi) is a three times continuously diﬀerentiable function of s (xi) with

s 2 < , E 2γ(s;ξi)

s2 2 < , and zero third derivative 3γ(s;ξi)

s3 = 0 for all i.

We will use γ(s;ξi)

s to denote the derivative. Although the notation may suggest otherwise, we are taking the derivative with respect to a particular point s (xi) rather than the whole function s ( ) itself. An example of a loss function that satisﬁes Assumption 3.2 is the squared error loss. Note that even though our loss function is three times continuously diﬀerentiable, the objective function used when computing the estimator ˆs M(xi) does not need to be diﬀerentiable. The next assumption says the estimators have an asymptotically linear representation.

Assumption 3.3 For each estimator M M, there exists τn satisfying τn/n1/4 and τn/ n = O(1), and there exists a function φM (ξj, xi; s M) satisfying

E [φM (ξj, xi; s M)| xi] = 0 and E φM (ξj, xi; s M) γ(s M;ξi) s M

2 < when τn = n, and

V ar h E h φM (ξj, xi; s M) γ(s M;ξi) s M

ξi ii 0 and V ar h E h φM (ξj, xi; s M) γ(s M;ξi) s M

when τn n, such that for each xi X,

ˆs M(xi) s M(xi) = 1

j=1 φM (ξj, xi; s M) + Rn,i

where max 1 i n Rn,i = op 1 τn

Assumption 3.3 states that the diﬀerence between ˆs M(xi) and its probability limit s M(xi) at each validation data point xi is asymptotically equivalent to a scaled average of terms φM (ξj, xi; s M) that depend only on the individual training observation ξj, the ﬁxed point of evaluation xi, the function s M, and possibly some other nonrandom quantities. For n consistent estimators where E [φM (ξj, xi; s M)| xi] = 0, the function φM( , ; ) is called the inﬂuence function (see e.g. (Van der Vaart, 2000) or (Newey & Mc Fadden, 1994)). For example, OLS regression has an inﬂuence function φM (ξj, xi; β M) = x i E[xjx j] 1xj(yj x jβ M) which

satisﬁes Assumption 3.3 under the weak exogeneity assumption E h xj(yj x jβ M) i = 0. For nonparametric estimators, typically E [φM (ξj, xi; s M)| xi] = 0 because of the bias, but

V ar h E h φM (ξj, xi; s M) γ(s M;ξi) s M

ξi ii 0 and V ar h E h φM (ξj, xi; s M) γ(s M;ξi) s M

For nonparametric kernel estimators, the bias is of order h2 (where h is the bandwidth), so E h φM (ξj, xi; s M) γ(s M;ξi) s M

ξi i = h2λ(xi) and E h φM (ξj, xi; s M) γ(s M;ξi) s M

ξj i = h2ν(xj), where h 0 as n . Under weak regularity conditions, it is possible to show that V ar h2λ(xi) 0 and V ar h2ν(xi) 0. The φM( , ; ) function can be obtained from the ﬁrst order Taylor expansion of the ﬁrst order condition. For example, if ˆs M(xi) = F(xi, ˆβM), s M(xi) = F(xi, β M), where F( , ) is

known, ˆβM = arg min β

n QM n (β) = 1

n Pn j=1 q M j (β) o , and β M = arg min β

n QM(β) = E h q M j (β) io ,

then the ﬁrst order Taylor expansion of the ﬁrst order condition for ˆβM is 0 = QM n (ˆβM)

QM n (β M) β + ˆβM β M 2QM n (β M) β β + Op

ˆβM β M 2 , which implies

ˆs M(xi) s M(xi) = F(xi, ˆβM) F(xi, β M)

= F(xi, β M) β

ˆβM β M + Op

j=1 F(xi, β M) β plim 2QM n (β M) β β

1 q M j (β M) β | {z }

φM(ξj,xi;β M)

For nonparametric estimators where ˆs M(xi) = arg min s {QM n (s, xi) = 1 n Pn j=1 q M j (s, xi)},

the ﬁrst order Taylor expansion of the ﬁrst order condition is 0 = QM n (ˆs M)

s = QM n (s M) s +

Asymptotics of K-Fold Cross Validation

(ˆs M(xi) s M(xi)) 2QM n (s M) s2 + OP |ˆs M(xi) s M(xi)|2 , which implies

ˆs M(xi) s M(xi) = 2QM n (s M) s2

q M j (s M, xi) s + OP |ˆs M(xi) s M(xi)|2

j=1 plim 2QM n (s M) s2

1 q M j (s M, xi) s | {z }

φM(ξj,xi;s M)

Examples of φM( , ; ) for some well known estimators are the following:

1. Ordinary Least Squares: For ˆs M(xi) = x i ˆβM,

ˆs M(xi) = x i ˆβM = x i(X X) 1X Y

β M = E[xjx j] 1E[xjy j]

φM (ξj, xi; β M) = x i E[xjx j] 1xj(yj x jβ M)

2. Nonlinear Least Squares: For known F( , ), ˆs M(xi) = F(xi, ˆβM),

ˆβM = arg min β

j=1 (yj F(xj, β))2

β M = arg min β E h (yj F(xj, β))2i

φM (ξj, xi; β M)

= F(xi, β M) β E F(xj, β M) β F(xj, β M) β

1 F(xj, β M) β (yj F(xj, β M))

3. Ridge Regression: For ˆs M(xi) = x i ˆβM,

ˆβM = arg min β

yi x iβ 2 + λβ β

β M = arg min β E h yi x iβ 2i + λβ β

ˆs M(xi) = x i(X X + λI) 1X Y

φM (ξj, xi; β M) = x i E xjx j + λI 1 xj(yj x jβ M) λβ M

4. Kernel Regression: For h 0, nhd , Kh(xi xj) = 1 hd K xi xj

h , f( ) the density function of xi, and s M(xi) = E[yi|xi],

ˆs M(xi) = arg min s 1 n

j=1 (yj s)2Kh(xi xj) =

Pn j=1 Kh(xi xj)yj Pn j=1 Kh(xi xj)

φM (ξj, xi; s M) = f(xi) 1Kh(xi xj)(yj s M(xi))

5. Local Linear Regression: For h 0, nhd , Kh(xi xj) = 1 hd K xi xj

h , f( ) the density function of xi, and s M(xi) = E[yi|xi],

(ˆs M(xi), ˆβM) = arg min s,β

j=1 (yj s (xi xj) β)2Kh(xi xj)

φM (ξj, xi; s M, β M) = f(xi) 1Kh(xi xj)(yj s M(xi) (xi xj) β M)

6. Quantile Regression: For ˆs M(xi) = x i ˆβM, ρτ(u) = {(1 τ) 1 (u 0) + τ1 (u > 0)} |u|, and fyj|xj( ) the conditional density function of yj given xj,

ˆβM = arg min β

i=1 ρτ(yi x iβ)

β M = arg min β E ρτ(yi x iβ)

φM (ξj, xi; β M) = x i E h xjx jfyj|xj x jβ M i 1 xj τ 1 yj x jβ M

7. Local Linear Quantile Regression: For h 0, nhd , and s M(xi) = Qτ[yi|xi] the conditional τth quantile of yi given xi,

(ˆs M(xi), ˆβM) = arg min s,β

j=1 ρτ(yj s (xi xj) β)Kh(xi xj)

φM (ξj, xi; s M, β M) =

j=1 Kh(xi xj) fyj|xj s M(xi) + (xi xj) β M

Kh(xi xj) τ 1 yj s M(xi) + (xi xj) β M

8. ℓ2-norm SVM regression: For x+ max(x, 0),

ˆβM = arg min β

ρτ yi x iβ κ + + λβ β

β M = arg min β E h ρτ yi x iβ κ + + λβ β i

φM (ξj, xi; β M)

= x i H 1 xj

τ1 yj x jβ M + κ

(1 τ) 1 yj x jβ M κ 1 τ

H = E xjx j

τfyj|xj x jβ M + κ

+ (1 τ) fyj|xj

x jβ M κ 1 τ

9. Generalized Method of Moments: For known F( , ), ˆs M(xi) = F(xi, ˆβM), s M(xi) = F(xi, β M), moment functions E [g( , β)] which might not equal zero at β M, and G(β M)

Asymptotics of K-Fold Cross Validation

E h g(ξj,β M) β i ,

ˆβM = arg min β

j=1 g(ξj, β)

j=1 g(ξj, β)

β M = arg min β E [g(ξj, β)] WE [g(ξj, β)]

φM (ξj, xi; β M) = F(xi, β M) β G(β M) WG(β M) 1 G(β M) Wg(ξj, β M)

10. Maximum Likelihood Estimation: For known F( , ), ˆs M(xi) = F(xi, ˆβM), s M(xi) = F(xi, β M), and a density function f( , ) which might not be the true density of ξi,

ˆβM = arg max β

j=1 log f(ξj, β)

β M = arg max β E [log f(ξj, β)]

φM (ξj, xi; β M) = F(xi, β M) β E 2 log f(ξj, β M) β β

1 log f(ξj, β M) β

Remark 3.1 We would like to note that several other well-known machine learning estimators besides Ridge and ℓ2-norm SVM regression also have an inﬂuence function representation. For example, (Wager & Athey, 2018) derived the asymptotic normality of honest Random Forest estimators using Hajek projections, which indicates the existence of an inﬂuence function representation (see Lemma 3.3 in (Wager & Athey, 2018) for details). Similarly, (White, 1989), (Chen & White, 1999), and (White & Racine, 2001) demonstrated asymptotic normality of single-hidden layer Neural Networks by formulating them

as nonlinear least squares estimators ˆβn = arg min 1

n Pn i=1 yi Λ Pq j=1 ψ (x iγj) θj 2 , where Λ( ) is a smooth increasing output function, ψ ( ) is a smooth increasing activation function such as the sigmoidal function and β (γ , θ ) are the network parameters such as the weights on the inputs and hidden units. The inﬂuence function is given by example 2 above after replacing F(xi, β) with Λ Pq j=1 ψ (x iγj) θj . However, there are also estimators that do not have an inﬂuence function representation generally. For example, Lasso typically does not have an inﬂuence function representation unless additional assumptions are imposed to guarantee asymptotic normality. In the ﬁnitedimensional case, if nλn 0, then ˆβM = arg min β

1 n Pn i=1 (yi x iβ)2+λn β 1 will have an

inﬂuence function representation equal to that for least squares because Lasso s non-standard asymptotic distribution will reduce down to a normal distribution. For more details, see the last line of the proof of Theorem 2 of (Knight & Fu, 2000).

3.1 Asymptotic Normality of K-Fold Cross Validation Criterion

In this section, we derive the asymptotic normality of n times the diﬀerence between the K-Fold cross validation criterion and the expected out-of-sample error for each estimator M

in a set of candidate estimator M. Recall that nt K 1

K n is the number of observations in each training fold. We will use τnt to denote the analog of τn evaluated using nt observations.

Theorem 3.1 Asymptotic Normality: Suppose Assumptions 3.1-3.3 are satisﬁed. Deﬁne for each estimator M M

ψ(ξj, ξi) φM (ξj, xi; s M) γ(s M; ξi) s M σ2 γ V ar [γ(s M; ξi)]

σ2 01,ψ V ar [E [ψ(ξj, ξi)| ξj]]

λ Cov [γ(s M; ξj), E [ψ(ξj, ξi)|ξj]]

As n while keeping K ﬁxed,

n ˆLCV n (M) EPEOUT,n(M) d N 0, σ2

( σ2 γ , τn n σ2 γ + σ2 01,ψ + 2λ , τn = n

Remark 3.2 Note that for estimators that converge at a slower than n rate (i.e. nonparametric estimators), the K-fold cross validation error still converges at the n rate because the residual variance E[γ(s M; ξi)] component of the expected out-of-sample error can be consistently estimated at the n rate. This result for nonparametric estimators is consistent with proposition 1 of (Wager, 2020).

Remark 3.3 Note that the rate of convergence and asymptotic variance of the K-fold cross validation error do not depend on the number of folds K, assuming that K is ﬁxed. This suggests that the choice of which ﬁxed K to use is not ﬁrst-order important in large samples.

3.2 Consistent Estimate of Asymptotic Variance

The beneﬁt of deriving the asymptotic distribution of the K-fold cross validation error is that we can use a consistent estimate of the asymptotic variance to construct conﬁdence intervals for the expected out-of-sample error. For nonparametric estimators, we will need to compute our estimator once at each data point ˆs M(xi) for i = 1...n when forming our estimate of ˆσ2 γ. For parametric estimators, we will need to compute ˆβM once and then estimate the inﬂuence functions ˆφM (ξj, xi; ˆs M) and ﬁrst derivatives of γ (ˆs M; ξi) at each data point. The following theorem demonstrates consistency of our estimate of the asymptotic variance.

Theorem 3.2 Consistent Estimate of Asymptotic Variance: Deﬁne

( ˆσ2 γ , τn n ˆσ2 γ + ˆσ2 01,ψ + 2ˆλ , τn = n

where ˆψ (ξj, ξi; ˆs M) = ˆφM (ξj, xi; ˆs M) γ (ˆs M; ξi)

Asymptotics of K-Fold Cross Validation

γ(ˆs M; ξi) 1

k=1 γ(ˆs M; ξk)

ˆσ2 01,ψ = 1

k=1 ˆh M(ξk)

ˆh M(ξj) = 1 n 1

{i|1 i n,i =j}

ˆψ (ξj, ξi; ˆs M)

k=1 ˆh M(ξk)

γ(ˆs M; ξi) 1

k=1 γ(ˆs M; ξk)

Suppose Assumptions 3.1-3.3 are satisﬁed in addition to

{i|1 i n,i =j}

ˆψ (ξj, ξi; ˆs M) 1 n 1

{i|1 i n,i =j} ψ (ξj, ξi; s M)

i=1 |γ (ˆs M; ξi) γ (s M; ξi)|2 p 0

Then ˆσ2 n converges in probability to σ2.

The additional two assumptions at the end are necessary to ensure that using the estimated inﬂuence function ˆψ ( , ; ) and estimated feature ˆs M instead of their probability limits does not impact the consistency of our estimates ˆσ2 γ, ˆσ2 01,ψ, and ˆλ. Similar assumptions are assumed in Lemma 8.3 of (Newey & Mc Fadden, 1994) when showing consistency of estimators of the asymptotic variance of two step estimators.

4. Estimator Selection Test

We now formulate a hypothesis test to compare the predictive performance of two estimators using their expected out-of-sample errors EPEOUT,n(M) E h γ(ˆs M; ξi) i , where the expectation is taken with respect to both the data used to estimate ˆs M and the new observation ξi. The two estimators could be estimating either the same feature s M1 = s M2 or diﬀerent features s M1 = s M2. The null hypothesis states that the expected out-of-sample errors of the two estimators are the same and the alternatives are that one is higher than the other:

H0 : EPEOUT,n(M2) EPEOUT,n(M1) = 0

H1 : EPEOUT,n(M2) EPEOUT,n(M1) > 0

H2 : EPEOUT,n(M2) EPEOUT,n(M1) < 0

4.1 Asymptotic Distribution of Test Statistic in the Absence of First Order Degeneracy

When the diﬀerence between the expected out-of-sample errors of the two estimators can be consistently estimated at the n rate, we can use the following test statistic:

ˆLCV n (M2) ˆLCV n (M1)

ˆ σ2 n is a consistent estimate of the diﬀerence in the K-fold cross validation errors asymptotic variances, which diﬀers for parametric and nonparametric estimators:

σ2 γ + σ2 01,ψM1 + 2λM1 , τn,2 τn,1 = n

σ2 γ + σ2 01,ψM2 + 2λM2 , τn,1 τn,2 = n

σ2 γ , τn,2, τn,1 n σ2 γ + σ2 01,ψ + 2 λ , τn,2 = τn,1 = n

φM2 ξj, xi; s M2 γ s M2; ξi

s M2 φM1 ξj, xi; s M1 γ s M1; ξi

γ(ξi) γ s M2; ξi γ s M1; ξi , σ2 γ V ar [ γ(ξi)]

σ2 01,ψ V ar h E h ψ(ξj, ξi) ξj ii , λ Cov h γ(ξj), E h ψ(ξj, ξi)|ξj ii

The following theorem shows that when 0 < σ2 < , Tn has a standard normal asymptotic distribution under H0. It follows that the test which rejects H0 when |Tn| > Z1 α/2, where Zα is the α-th percentile of the standard normal, is pointwise consistent in level.

Theorem 4.1 Suppose Assumptions 3.1-3.3 are satisﬁed for M1 and M2, and 0 < σ2 < . Then Tn d N(0, 1) under H0 and under alternatives of the form n (EPEOUT,n(M2) EPEOUT,n(M1)) 0. Tn d N(c/ σ, 1) under alternatives of the form n (EPEOUT,n(M2) EPEOUT,n(M1)) c, for some constant c. Tn p under alternatives of the form n (EPEOUT,n(M2) EPEOUT,n(M1)) . Tn p under alternatives of the form n (EPEOUT,n(M2) EPEOUT,n(M1)) .

4.2 Consistent Estimation of Asymptotic Variance

We now provide a consistent estimate of σ2.

Theorem 4.2 Consistent Estimate of Asymptotic Variance: Deﬁne

ˆ σ2 γ + ˆσ2 01,ψM1 + 2ˆλM1 , τn,2 τn,1 = n ˆ σ2 γ + ˆσ2 01,ψM2 + 2ˆλM2 , τn,1 τn,2 = n ˆ σ2 γ , τn,2, τn,1 n ˆ σ2 γ + ˆ σ2 01,ψ + 2ˆ λ , τn,2 = τn,1 = n

Asymptotics of K-Fold Cross Validation

k=1 ˆ γ(ξk)

ˆ γ(ξi) γ (ˆs M2; ξi) γ (ˆs M1; ξi)

ˆ ψ (ξj, ξi; ˆs M1, ˆs M2) = ˆφM2 (ξj, xi; ˆs M2) γ (ˆs M2; ξi)

ˆs M2 ˆφM1 (ξj, xi; ˆs M1) γ (ˆs M1; ξi)

ˆ σ2 01 = 1

ˆ h(ξj) = 1 n 1

{i|1 i n,i =j}

ˆ ψ (ξj, ξi; ˆs M1, ˆs M2)

k=1 ˆ γ(ξk)

! ˆ h(ξi) 1

If all of the conditions in theorem 3.2 are satisﬁed for M1 and M2, then ˆ σ2 n converges in probability to σ2.

4.3 Testing Procedure Accounting for First Order Degeneracy

If σ2 = 0, which can arise for example when two nonparametric estimators are estimating the same conditional mean function in which case V ar [ γ (ξ)] = 0, then the asymptotic distribution of Tn under H0 becomes degenerate. In order to have a nondegenerate test statistic, we need to scale by n instead of n (see the last part of the proof of Theorem 4.1 for an explanation of this result). Although we do not explicitly characterize the asymptotic distribution of our test statistic under ﬁrst order degeneracy, we can consistently estimate this distribution using subsampling. If we knew ﬁrst order degeneracy were present, we could use n ˆLCV n (M2) ˆLCV n (M1) as the test statistic. However, it is often very diﬃcult

to know if ﬁrst order degeneracy is present, so instead, we ﬁrst obtain an estimate nˆδ of the rate of convergence of ˆLCV n (M2) ˆLCV n (M1) using the following algorithm (described in Theorem 8.2.2 in (Politis, Romano, & Wolf, 1999)).

1. For r = 1...R replications, draw I diﬀerent subsamples of sizes bi for i = 1...I where bi and bi/n 0 and compute bi,r ˆLCV, b,i (M2) ˆLCV, b,i (M1) ˆLCV n (M2) ˆLCV n (M1) .

2. For each i = 1...I and j = 1...J, deﬁne cτij and cρij as the τjth and ρjth percentiles of bi,r for i = 1...I, where τj and ρj are the jth elements of

τ = n 70, 70 + 20 J 1, 70 + 40 J 1, ..., 90 o and ρ = n 10, 10 + 20 J 1, 10 + 40 J 1, ..., 30 o .

3. Deﬁne yij = log cτij cρij , yi = 1

J PJ j=1 yij, y = 1

I PI i=1 yi, log (b) = 1

I PI i=1 log (bi).

4. The estimated rate of convergence is nˆδ for

PI i=1 ( yi y) log (bi) log (b)

PI i=1 log (bi) log (b) 2

We then use the test statistic Tn nˆδ ˆLCV n (M2) ˆLCV n (M1) , and the percentiles

of ˆLn,b (x) = 1

B PB i=1 1 bˆδ ˆLCV, b,i (M2) ˆLCV, b,i (M1) ˆLCV n (M2) ˆLCV n (M1) x as the critical values, where the subsample size b satisﬁes b and b/n 0, and i = 1...B indexes a random sample of the n b possible subsets of size b. This procedure requires computing the diﬀerence in the cross validation errors (I + 1)B + 1 times, where I is the number of diﬀerent subsample sizes we use to estimate the rate of convergence δ using the procedure in Theorem 8.2.2 of (Politis et al., 1999). We acknowledge that the computational cost of this procedure can be high due to the repeated computation of the cross validation errors, but we cannot avoid this cost unless we knew what the rate of convergence of our test statistic is, which is diﬃcult in practice. The next theorem states that our test using the subsampling procedure is pointwise consistent in level.

Theorem 4.3 Pointwise Consistency in Level: Suppose Assumptions 3.1-3.3 are satisﬁed.

Let c α be the α-th percentile of ˆLn,b (x). For φn = 1 Tn > c 1 α/2 S Tn < c α/2 ,

lim sup n EH0 h φn i α

Proof: Result follows from the consistency of subsampling for statistics with an estimated rate of convergence (e.g. Theorem 8.3.1 in (Politis et al., 1999)).

Remark 4.1 In order to illustrate when ﬁrst order degeneracy may arise, take the simple example of linear regression with nested regressors:

M1 :Yi = α1Wi + ηi M2 :Yi = α2Wi + β2Xi + ϵi

Deﬁne Z i = [Wi, Xi]. The inﬂuence functions for the OLS estimates are

φM1 (ξj, Wi; α1) = W i E Wj W j 1 Wj (Yj α1Wj)

φM2 (ξj, Zi; α2, β2) = Z i E Zj Z j 1 Zj (Yj α2Wj β2Xj)

Asymptotics of K-Fold Cross Validation

The loss function is squared error loss.

γ s M1; ξi = (Yi α1Wi)2

γ s M2; ξi = (Yi α2Wi β2Xi)2

φM2 (ξj, Zi; α2, β2) γ s M2; ξi

s M2 φM1 (ξj, Wi; α1) γ s M1; ξi

= 2 (Yj α2Wj β2Xj) Z j E Zj Z j 1 Zi (Yi α2Wi β2Xi)

+ 2 (Yj α1Wj) W j E Wj W j 1 Wi (Yi α1Wi)

E h ψ (ξj, ξi) ξj i = 2 (Yj α2Wj β2Xj) Z j E Zj Z j 1 E [Zi (Yi α2Wi β2Xi)]

+ 2 (Yj α1Wj) W j E Wj W j 1 E [Wi (Yi α1Wi)]

If it were the case that E [Zi (Yi α2Wi β2Xi)] = 0, E [Wi (Yi α1Wi)] = 0, and V ar ϵ2 i η2 i = 0, then we have ﬁrst order degeneracy since σ2 σ2 γ + σ2 01,ψ +2 λ = 0. This can happen if model 1 were the true model with E [Wi (Yi α1Wi)] = 0, which would imply that β2 = 0.

Remark 4.2 For multiple model comparisons, if we are interested in constructing a datadependent set of estimators that contain the best estimator(s) with a certain probability, we can construct a conﬁdence set using a procedure similar to Deﬁnition 2 of (Hansen, Lunde, & Nason, 2011) which does not require a benchmark model to be selected as in (White, 2000). To construct the model conﬁdence set, we ﬁrst test the null hypotheses H0,M : EPEOUT,n (Mi) EPEOUT,n (Mj) = 0 for all Mi, Mj M using the test statistic

TM,n = max i,j

ˆLCV n (Mi) ˆLCV n (Mj) , which is the maximum of the absolute diﬀerence of

the K-fold cross validation errors for all pairs of estimators in M. If H0,M is rejected, an elimination rule is used to eliminate from M an estimator with worst performance. We repeat the test using the smaller model set until it is accepted and the model conﬁdence set is deﬁned as the set of remaining models. Since the asymptotics of the test statistic TM,n are nonstandard, we will have to resort to subsampling or other consistent resampling methods to obtain critical values. If we are instead interested in obtaining pairwise rankings, then we can use the multiple testing procedures discussed in (Lehmann & Romano, 2005) and (Romano, Shaikh, & Wolf, 2010) and references therein. For example, we can use the Bonferroni correction, as we do in our empirical application.

5. Monte Carlo

In this section, we use Monte Carlo simulations to study the coverage of our K-fold cross validation conﬁdence intervals and the rejection frequencies of our estimator selection test.

5.1 Conﬁdence Intervals for expected out-of-sample error

We demonstrate how we can use the asymptotic distribution of the K-fold cross validation error to construct conﬁdence intervals for the expected out-of-sample error. We investigate the empirical coverage frequencies for these conﬁdence intervals for three data generating processes (DGPs).

5.1.1 Linear DGP

For each Monte Carlo simulation r = 1...R, we generate the test data ( xri, zri, yri)n T est i=1 and the training data (xri, zri, yri)n i=1 independently of each other using the linear DGP:

yi = β0xi + z iγ0 + ϵi, ϵi N(0, 1), ϵi xi, zi xi zi

, 0.5 I10 + ι10ι 10

where µ = 2 3 4 5 4 5 6 7 8 , I10 is the 10 10 identity matrix, and ι10 is the 10 1 vector of all ones. Deﬁne w i [xi, z i]. Our candidate models include linear models for diﬀerent values of p {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}:

j=1 θjwij + ηi

We also look at two additional linear models with group eﬀects αg, which are 50 dummy variables generated independently of ϵ, x, and z.

yi = θ1xi + αg + νi, yi = w iθ + αg + ζi

All of our linear models are estimated using OLS. In addition, we also consider a univariate nonparametric model that is estimated using kernel (local-constant) regression of y on x using a Gaussian kernel Khn (x) = K (x/hn), K (x) = (2π) 1/2 e x2/2, and bandwidth hn = (4/3)1/5n 1/5. The 5-fold cross validation error using squared error loss is

yi ˆs( k) nt (wi) 2

where ˆs( k) nt is the estimate of the conditional mean of y given w computed using the nt training observations, which are the observations not in the k-th fold. For OLS, ˆs( k) nt (wi) =

j / I(v) k wjw j 1 1 nt P

j / I(v) k wjyj . For the kernel estimator,

ˆs( k) nt (xi) =

j / I(v) k Khnt (xi xj)yj P

j / I(v) k Khnt (xi xj) .

We estimate the expected out-of-sample error using the test error averaged across R Monte Carlo simulations:

\ EPEOUT,n = 1

i=1 ( yri ˆsn,r ( wri))2

where ˆsn,r is the estimate of the conditional mean of y given w computed using the training

data on the r-th simulation. For OLS, ˆsn,r ( wri) = w ri 1 n Pn j=1 wrjw rj 1 1 n Pn j=1 wrjyrj .

For the kernel estimator, ˆsn,r ( xri) =

Pn j=1 Khn( xri xrj)yj Pn j=1 Khn( xri xrj) .

Asymptotics of K-Fold Cross Validation

Table 1 shows the empirical coverage frequencies for \ EPEOUT,n of the nominal 95%

equal-tailed two-sided conﬁdence intervals h ˆLCV n 1.96 ˆσn n

i , where ˆσn is given in Theorem 3.2. Note that we do not need to compute the inﬂuence function for our nonparametric estimator because ˆσ2 n = ˆσ2 γ = 1

n Pn i=1 ˆγ(ξi) 1

n Pn k=1 ˆγ(ξk) 2, where ˆγ(ξi) = (yi ˆsn (xi))2.

For the OLS estimators, we use ˆφM ξj, wi; ˆθ = w i 1 n Pn j=1 wjw j 1 wj yj w j ˆθ as our estimated inﬂuence functions. We consider three diﬀerent values for γ0 while keeping β0 at 0.5. p refers to the number of right hand size variables, so the p = 51 and p = 60 columns refer to the models with group eﬀects.

p 1 2 3 4 5 6 7 8 9 10 51 60 kernel γ0i = 0.5 0.950 0.948 0.952 0.951 0.954 0.949 0.947 0.947 0.959 0.956 0.950 0.954 0.948 γ0i = 1 n 0.956 0.950 0.956 0.951 0.952 0.952 0.954 0.956 0.957 0.958 0.954 0.956 0.948 γ0i = 1

n 0.958 0.956 0.956 0.958 0.957 0.957 0.958 0.958 0.959 0.958 0.952 0.949 0.953

Table 1: Empirical Coverage Frequencies, Linear DGP, n = 2000, n Test = 4000, R = 2000

The empirical coverage frequencies for all estimators are close to the nominal level, which supports the use of our asymptotic theory to construct asymptotically valid conﬁdence intervals for the expected out-of-sample error. Note that the probability limits of our estimators ˆs M, except for the one with p = 10, are all diﬀerent from the true conditional mean function of the underlying data generating process. We do not require our estimators to be consistent for some true feature of the underlying DGP when deriving the asymptotic distribution of the K-fold cross validation error.

5.1.2 Nonlinear DGP

We now repeat the same exercise as in the main text, but with a nonlinear DGP:

yi = exp(β0xi) 1 + exp(β0xi) + z iγ0 + ϵi, ϵi N(0, σ2 ϵ ), ϵi xi, zi

Table 2 shows the empirical coverage frequencies for \ EPEOUT,n of the nominal 95% equal-

tailed two-sided conﬁdence intervals h ˆLCV n 1.96 ˆσn n

i , where ˆσn is given in Theorem 3.2. We see again that the coverage is close to the nominal level, even though our estimators are not consistent for the true underlying conditional mean function.

p 1 2 3 4 5 6 7 8 9 10 51 60 kernel γ0i = 0.5 0.950 0.945 0.952 0.952 0.951 0.949 0.949 0.939 0.959 0.951 0.948 0.951 0.946 γ0i = 1 n 0.943 0.949 0.953 0.953 0.951 0.954 0.956 0.952 0.955 0.953 0.946 0.948 0.951 γ0i = 1

n 0.952 0.949 0.952 0.950 0.953 0.952 0.951 0.954 0.952 0.953 0.953 0.954 0.947

Table 2: Empirical Coverage Frequencies, Nonlinear DGP, n = 2000, n Test = 4000, R = 2000

5.1.3 Gaussian Mixture Model

We generate data according to the following model: for x N (1, 1),

y1|x N x µ1, σ1 , y2|x N x µ2, σ2 , d|x Bernoulli (π) , y = (1 d) y1 + dy2

For β0 = (π, µ1, µ2, σ1, σ2), the density of Y is

f (y|x; β0) = (1 π) 1

σ1 φ y x µ1

σ2 φ y x µ2

and the conditional mean function is F (x, β0) E [y|x] = (1 π) x µ1 + πx µ2. Using the information matrix equality, the inﬂuence function is given by

φM (ξj, xi; β0) = F (xi, β0)

β E 2 log f (yj|xj; β0)

1 log f (yj|xj; β0)

= F (xi, β0)

β E log f (yj|xj; β0)

β log f (yj|xj; β0)

1 log f (yj|xj; β0)

β = x (µ2 µ1) (1 π) x πx 0 0

log f (y|x; β0)

1 f(y|x;β0) 1 σ2 φ y x µ2

σ1 φ y x µ1

1 f(y|x;β0) (1 π) x

σ2 1 φ y x µ1

1 f(y|x;β0)π x

σ2 2 φ y x µ2

1 f(y|x;β0) (1 π) n y x µ1

σ2 1 φ y x µ1

1 f(y|x;β0)π n y x µ2

σ2 2 φ y x µ2

1 f(y|x;β0) 1 σ2 φ y x µ2

σ1 φ y x µ1

1 f(y|x;β0) (1 π) x

1 f(y|x;β0)π x

1 f(y|x;β0) 1 π

2 1 φ y x µ1

1 f(y|x;β0) π σ2 2

2 1 φ y x µ2

We set the true values of the mean parameters to µ1 = 1, µ2 = 2, σ1 = σ2 = 1. We estimate 9 diﬀerent Gaussian mixture regression models corresponding to π {0.1, 0.2, 0.3, ...., 0.9}. The K-fold cross validation error is

yi F xi, ˆβ( k) 2

We estimate the expected out-of-sample error using the test error averaged across R Monte Carlo simulations:

\ EPEOUT,n = 1

yri F xri, ˆβn,r 2

Asymptotics of K-Fold Cross Validation

We construct nominal 95% equal-tailed two-sided conﬁdence intervals h ˆLCV n 1.96 ˆσn n

i , where ˆσn is given in Theorem 3.2, using two diﬀerent ways of estimating the inﬂuence function. The ﬁrst way uses the information matrix equality formulation of the inﬂuence function.

ˆφM ξj, xi; ˆβ = F xi, ˆβ

log f yj|xj; ˆβ

log f yj|xj; ˆβ

1 log f yj|xj; ˆβ

The second way uses the estimated Hessian given as an output of the fmincon solver.

ˆφM ξj, xi; ˆβ = F xi, ˆβ

β ˆH 1 log f yj|xj; ˆβ

The empirical coverage frequencies for \ EPEOUT,n are in Table 3, and we can see that they are all close to 95%.

π = 0.1 π = 0.2 π = 0.3 π = 0.4 π = 0.5 π = 0.6 π = 0.7 π = 0.8 π = 0.9 Info. Matrix Eq. 0.948 0.951 0.950 0.952 0.954 0.952 0.951 0.950 0.946 Est. Hessian 0.948 0.951 0.949 0.952 0.954 0.952 0.950 0.950 0.946

Table 3: Empirical Coverage Frequencies, Gaussian Mixture Model DGP, n = 2000, n Test = 4000, R = 5000

5.2 Estimator Selection Test

We examine the rejection frequencies of our estimator selection test for testing the equivalence of the expected out-of-sample errors. Consider the following simple data generating process:

Yi = α0Wi + β0Xi + ϵi, ϵi i.i.d. N(0, 1), ϵi Wi Xi

i.i.d. N 1 1

, 1 0.1 0.1 1

We compare two linear models estimated by OLS, one of which is nested inside the other:

M1 : Yi = α1Wi + ηi M2 : Yi = α2Wi + β2Xi + ϵi

We examine the empirical frequencies of failing to reject the null that both models are equally good in terms of out-of-sample predictive accuracy, rejecting in favor of model 1, and rejecting in favor of model 2 under six diﬀerent choices of β0 n 0, 1

n, 1 n, n 1/4, n 1/6, 1 o . Table 4 shows the empirical rejection frequencies for nominal 5% and 10% tests using n = 5000 observations, R = 5000 Monte Carlo simulations, and B = 5000 subsampling replications with a subsample size of b = n. In all cases we use 5-fold cross validation to form the test statistic Tn nˆδ ˆLCV n (M2) ˆLCV n (M1) , and estimate ˆδ using I =

10 diﬀerent values of the subsample size n(0.5:0.05:0.95). We also report the ˆδ estimates averaged over the R = 5000 Monte Carlo simulations. The average ˆδ is fairly close to 1 for β0 n 0, 1

o and close to 0.5 for β0 n 1/4, n 1/6, 1 . This suggests that ﬁrst order degeneracy is a problem for values of β0 drifting very quickly to zero, but not an issue for the slower drifting values. For β0 n 0, 1

o , the two models are suﬃciently similar such that our test fails to reject the null of equal expected out-of-sample error for a majority of the simulations. For β0 n 1/4, n 1/6, 1 , the two models are diﬀerent enough so that the probability of rejecting the null in favor of model M2 is very close to 1.

β0 = 0 β0 = n 1 β0 = n 1/2 β0 = n 1/4 β0 = n 1/6 β0 = 1 nominal 5% Fail to Reject 0.9852 0.9852 0.9922 0.0004 0 0 Favor M1 0.0148 0.0148 0.0078 0 0 0 Favor M2 0 0 0 0.9996 1 1 nominal 10% Fail to Reject 0.9626 0.9624 0.9602 0 0 0 Favor M1 0.0344 0.0344 0.0214 0 0 0 Favor M2 0.0030 0.0032 0.0184 1 1 1

Average ˆδ 0.972 0.972 0.925 0.527 0.513 0.506

Table 4: Empirical Rejection/Non-rejection Frequencies, n = 5000, B = 5000, R = 5000

6. Empirical Application

The data come from Go Daddy, a domain name registrar responsible for managing sales of internet domain names. Each observation is a particular domain name listed on a Go Daddy expiry auction between May 12th, 2017 and July 11th, 2017. The domains are auctioned oﬀ in an open-bid English auction with a minimum bid of $12 and a duration of approximately 10 days. If the domain is still not sold after the 10 days are over, there is a 5 day closeout Dutch auction. One interesting fact about these auctions is that the majority of participants are speculators who have no intrinsic use of the domain except turning a proﬁt when they resell the name in an aftermarket. Another interesting fact is that very few of the English auctions result in sale, partly due to the sheer volume of domains that are listed for sale. For example, of the 2178187 auctions with a start time on or after May 12th, 2017 and before July 11th, 2017, only 28448 auctions met the minimum bid requirement. Starting on May 12th, 2017, Go Daddy implemented a simple randomized experiment where some domain names would receive a valuation metric provided by a machine learning algorithm using deep learning. The idea was to provide auction participants with a better sense of the value of a domain name. Our goal is to compare three diﬀerent estimators for predicting the sale price for those 28448 auctions which met the minimum bid requirement using the following nine features: dummy for whether the domain has a valuation assigned by the deep learning algorithm, the domain valuation assigned by the deep learning algorithm (coded as 0 for those domains without a valuation), the number of characters in the domain name, three dummies for

Asymptotics of K-Fold Cross Validation

whether the domain is a .com, .org, or .net, and three dummies for whether the domain contains a word in the English dictionary, a number, or a vowel. The ﬁrst estimator is OLS using all independent variables. The second estimator is an honest Random Forest estimator using all independent variables and computed using the grf R package s regression forest command with the default options. The third estimator is a single-hidden layer Neural Network with a sigmoidal activation function and 5 hidden units using all independent variables and computed using the nn.train command in the deepnet R package. We perform three pairwise nominal (5/3)%-level 5-fold cross validation with squared error loss estimator selection tests using B = 1000 subsampling iterations. We divide by 3 because we are using the Bonferroni correction to control the Family Wise Error Rate at 5%; this is just one of many ways that we could have corrected the multiple testing issue. The 5-fold cross validation errors are 68323.73 for Random Forests, 94291.50 for Neural Networks, and 94444.18 for OLS. For all three tests, we reject the null of equal expected out-of-sample error for the two estimators. When we compare Random Forests to Neural Networks, we ﬁnd Random Forests has greater predictive accuracy than Neural Networks. When we compare Random Forests to OLS, we ﬁnd Random Forests has greater predictive accuracy than OLS. When we compare Neural Networks to OLS, we ﬁnd Neural Networks has greater predictive accuracy than OLS. The results are the same across a range of diﬀerent values of the subsample size b = nκ , where κ {0.4, 0.5, 0.6, 0.7, 0.8}.

7. Conclusion

We have demonstrated asymptotic normality of the K-fold cross validation error as the number of observations n goes to inﬁnity keeping the number of folds K ﬁxed. The rate of convergence of the K-fold cross validation error to the expected out-of-sample error is n for both parametric and nonparametric estimators, and the asymptotic variance does not depend on K. We have constructed an analytic estimate of the asymptotic variance, which can be used to construct conﬁdence intervals for the expected out-of-sample error. We have also developed a hypothesis test for comparing two estimators expected out-of-sample errors by looking at the diﬀerence in their K-fold cross validation errors. In the absence of ﬁrst-order degeneracy, the test statistic is asymptotically normal under the null and can be benchmarked against the standard normal critical values. In the presence of ﬁrst-order degeneracy, the test statistic is not asymptotically normal but its distribution under the null is consistently estimable using subsampling.

Appendix A. Appendix

A.1 Proof of Theorem 3.1

Denote τnt as the analog of τn using nt = n(K 1)/K observations, and I(t) k as the indices of the observations in the kth training fold. Taking second order Taylor expansions of γ(ˆs( k) M ; ξi) and γ(ˆs M; ξi) around γ(s M; ξi) and γ(s M; ξi),

n( ˆLCV n (M) EPEOUT,n(M))

γ(ˆs( k) M ; ξi) E[γ(s M; ξi)] E (ˆs M( xi) s M( xi)) γ(s M; ξi) s M

E (ˆs M( xi) s M( xi))2 1

2 2γ(s M; ξi) s M 2

n γ(s M; ξi) E[γ(s M; ξi)] o

ˆs( k) M (xi) s M(xi) γ(s M; ξi) s M E (ˆs M( xi) s M( xi)) γ(s M; ξi) s M

ˆs( k) M (xi) s M(xi) 2 2γ(s M; ξi) s M 2 E (ˆs M( xi) s M( xi))2 2γ(s M; ξi) s M 2

Using Assumption 3.3 and the fact that xi is drawn from the same distribution as xi,

(ˆs( k) M (xi) s M(xi)) γ(s M; ξi) s M E (ˆs M,nt( xi) s M( xi)) γ(s M; ξi) s M

φM(ξj, xi) γ(s M; ξi) s M E φM(ξj, xi) γ(s M; ξi) s M

γ(s M; ξi) s M E γ(s M; ξi) s M

φM(ξj, xi) γ(s M; ξi) s M E φM(ξj, xi) γ(s M; ξi) s M

ψ(ξj,ξi) E[ψ(ξj,ξi)]

{ψ(ξj, ξi) E [ψ(ξj, ξi)]}

Asymptotics of K-Fold Cross Validation

Furthermore,

ˆs( k) M (xi) s M(xi) 2 2γ(s M; ξi) s M 2 E (ˆs M,nt( xi) s M( xi))2 2γ(s M; ξi) s M 2

φM(ξj1, xi)φM(ξj2, xi) 2γ(s M; ξi) s M 2

nt E φM(ξj, xi)2 2γ(s M; ξi) s M 2

nt E φM(ξj1, xi)φM(ξj2, xi) 2γ(s M; ξi) s M 2

φM(ξj, xi) 2γ(s M; ξi) s M 2 E φM(ξj, xi) 2γ(s M; ξi) s M 2

2γ(s M; ξi) s M 2 E 2γ(s M; ξi) s M 2

nt 1 nv 1 nt

φM(ξj, xi)2 2γ(s M; ξi) s M 2 E φM(ξj, xi)2 2γ(s M; ξi) s M 2

2 nvnt(nt 1)

1 2φM(ξj1, xi)φM(ξj2, xi) 2γ(s M; ξi) s M 2 | {z } ν(ξj1 ,ξj2 ,ξi)

E [ν(ξj1, ξj2, ξi)]

φM(ξj, xi) 2γ(s M; ξi) s M 2 E φM(ξj, xi) 2γ(s M; ξi) s M 2

2γ(s M; ξi) s M 2 E 2γ(s M; ξi) s M 2

For each k = 1...K, U1k is a two-sample U-statistic of degree (1,1) with kernel ψ(ξj, ξi) = φM(ξj, xi) γ(s M;ξi) s M , U2k is a two-sample U-statistic of degree (2,1) with kernel ν(ξj1, ξj2, ξi) =

1 2φM(ξj1, xi)φM(ξj2, xi) 2γ(s M;ξi) s M 2 , R1k is a two-sample U-statistic of degree (1,1) with kernel

κ(ξj, ξi) = φM(ξj, xi)2 2γ(s M;ξi) s M 2 , and R2k is a two-sample U-statistic of degree (1,1) with

kernel η(ξj, ξi) = φM(ξj, xi) 2γ(s M;ξi) s M 2 . Results for two-sample U-statistics in e.g. (Van der

Vaart, 2000) show that R1k and R2k are Op 1 nv+nt

for all k, which imply they do not

contribute to the asymptotic distribution of ˆLCV n (M).

For n-consistent estimators, E [φM(ξj, xi)| xi] = 0 implies E [ψ(ξj, ξi)| ξi] = 0,

E [ν(ξj1, ξj2, ξi)| ξi] = 0, E [ν(ξj1, ξh, ξi)| ξh] = 0, and E [ν(ξh, ξj2, ξi)| ξh] = 0 because

E φM(ξj, xi) γ(s M; ξi) s M

= E [φM(ξj, xi)| xi] γ(s M; ξi) s M = 0

E φM(ξj1, xi)φM(ξj2, xi) 2γ(s M; ξi) s M 2

= E [φM(ξj1, xi)| xi] E [φM(ξj2, xi)| xi] 2γ(s M; ξi) s M 2 = 0

E φM(ξj1, xi)φM(ξj2, xi) 2γ(s M; ξi) s M 2

= E φM(ξj1, xi)E [φM(ξj2, xi)| xi] 2γ(s M; ξi) s M 2

E φM(ξj1, xi)φM(ξj2, xi) 2γ(s M; ξi) s M 2

= E E [φM(ξj1, xi)| xi] φM(ξj2, xi) 2γ(s M; ξi) s M 2

This implies that U2k is a degenerate two-sample U-statistic of order Op 1 nvnt

, and only U1k contributes to the asymptotic distribution. Additionally, E [ψ(ξj, ξi)| ξi] = 0 implies that one of the projection terms disappears in the Hoeﬀding decomposition for U1k (see e.g. (Lehmann, 1951), (Van der Vaart, 2000), or (Neumeyer, 2004)).

nv U1k = nv nt

E [ψ(ξj, ξi) E [ψ(ξj, ξi)]| ξj]

E [ψ(ξj, ξi) E [ψ(ξj, ξi)]| ξi]

| {z } =0 + nv 1k

where 1k is a degenerate two-sample U-statistic of order Op 1 nvnt

. Because each observation ξj appears in K 1 training sets,

E [ψ(ξj, ξi) E [ψ(ξj, ξi)]| ξj] = (K 1)

j=1 E [ψ(ξj, ξi) E [ψ(ξj, ξi)]| ξj]

Since nt = (K 1)nv and nt = n(K 1)/K,

nv U1k = 1 n

j=1 E [ψ(ξj, ξi) E [ψ(ξj, ξi)]| ξj] + 1

n ˆLCV n (M) EPEOUT,n(M) = 1

{γ(s M; ξi) E[γ(s M; ξi)]}

nv U1k + op(1)

By the Lindeberg-Levy Central Limit Theorem, n ˆLCV n (M) EPEOUT,n(M) d N 0, σ2 γ + σ2 01,ψ + 2λ

Asymptotics of K-Fold Cross Validation

where σ2 γ V ar [γ(s M; ξi)], σ2 01,ψ V ar [E [ψ(ξj, ξi)| ξj]], and λ Cov [γ(s M; ξj), E [ψ(ξj, ξi)|ξj]]. For nonparametric estimators, note that our assumptions of V ar [E [ψ (ξj, ξi)| ξi]] 0 and V ar [E [ψ (ξj, ξi)| ξj]] 0 imply U1k is a degenerate two-sample U-statistic, so only the ﬁrst term contributes to the asymptotic distribution:

n ˆLCV n (M) EPEOUT,n(M) = 1

{γ(s M; ξi) E[γ(s M; ξi)]} + op(1)

The asymptotic variance will be σ2 = σ2 γ.

A.2 Proof of Theorem 3.2

Deﬁne ˆγM (ξi) = γ (ˆs M; ξi) and γ0M (ξi) = γ (s M; ξi). We assumed 1

n Pn i=1 |ˆγM (ξi) γ0M (ξi)|2 = op(1), and by the law of large numbers, 1

n Pn i=1 γ0M (ξi) p E [γ0M (ξi)]. Therefore,

i=1 ˆγM (ξi) E [γ0M (ξi)]

i=1 |ˆγM (ξi) γ0M (ξi)| +

i=1 γ0M (ξi) E [γ0M (ξi)]

i=1 ˆγ2 M (ξi) E γ2 0M (ξi)

i=1 |ˆγM (ξi) γ0M (ξi)|2 + 2 1

i=1 |γ0M (ξi)| |ˆγM (ξi) γ0M (ξi)|

i=1 γ2 0M (ξi) E γ2 0M (ξi)

i=1 |ˆγM (ξi) γ0M (ξi)|2 + 2

i=1 |γ0M (ξi)|2

i=1 |ˆγM (ξi) γ0M (ξi)|2 + op(1) p 0

Therefore, ˆσ2 γ = 1

n Pn i=1 ˆγ2 M (ξi) 1

n Pn i=1 ˆγM (ξi) 2 p V ar [γ0M(ξi)]. We have shown that we can consistently estimate the asymptotic variance of the K-fold cross validation error for nonparametric estimators. Now we consider the case of n-consistent estimators. Deﬁne h0M (ξj) = E [ψ(ξj, ξi; s M)| ξj], h M (ξj) = 1 n 1 P i =j ψ (ξj, ξi; s M), and ˆh M (ξj) = 1 n 1 P i =j ˆψ (ξj, ξi; ˆs M). Notice that 1 n Pn j=1 h M (ξj) = 1 n 1 1 n Pn j=1 P i =j ψ (ξj, ξi; s M) is a U-statistic of order 2. Notice that 1 n Pn j=1 h0M (ξj) is one of the projection terms in the Hoeﬀding decomposition for Ustatistics, and the other projection term is zero since E [ψ (ξj, ξi; s M)| ξi] = 0. Since the remainder term in the Hoeﬀding decomposition must converge in probability to zero,

j=1 h M (ξj) 1

j=1 h0M (ξj)

i =j ψ (ξj, ξi; s M) 1

j=1 E [ψ (ξj, ξi; s M)| ξj]

Similarly, notice that 1

n Pn j=1 h M (ξj)2 = 1 (n 1)2 1

n Pn j=1 P i1 =j P i2 =j ψ (ξj, ξi1; s M) ψ (ξj, ξi2; s M) is asymptotically equivalent to a U-statistic of order 3. Using the Hoeﬀding decomposition and E [ψ (ξj, ξi1; s M) ψ (ξj, ξi2; s M)| ξj] = E [ψ (ξj, ξi; s M)| ξj]2,

j=1 h M (ξj)2 1

j=1 h0M (ξj)2

1 n (n 1) (n 2)

i2 =i1 ψ (ξj, ξi1; s M) ψ (ξj, ξi2; s M) 1

j=1 E [ψ (ξj, ξi; s M)| ξj]2

Since we assumed 1

n Pn j=1 ˆh M (ξj) h M (ξj) 2 p 0, it follows that

j=1 ˆh M (ξj)2 1

j=1 h0M (ξj)2

ˆh M (ξj) h M (ξj) 2 + 2 1

h M (ξj) ˆh M (ξj) h M (ξj)

j=1 h M (ξj)2 1

j=1 h0M (ξj)2

ˆh M (ξj) h M (ξj) 2 + 2

h M (ξj) 2 v u u t 1

ˆh M (ξj) h M (ξj) 2 + op(1)

j=1 ˆh M (ξj) 1

j=1 h0M (ξj)

j=1 h M (ξj) 1

j=1 h0M (ξj)

ˆh M (ξj) h M (ξj)

which implies that

j=1 ˆh M (ξj)

j=1 h0M (ξj)

j=1 ˆh M (ξj) + 1

j=1 h0M (ξj)

j=1 ˆh M (ξj) 1

j=1 h0M (ξj)

= Op(1)op(1) = op(1)

Asymptotics of K-Fold Cross Validation

and therefore,

j=1 ˆh M (ξj)2

j=1 ˆh M (ξj)

j=1 h0M (ξj)2

j=1 h0M (ξj)

Since 1 n Pn j=1 h0M (ξj)2 1 n Pn j=1 h0M (ξj) 2 p V ar (h0M (ξj)), it follows that ˆσ2 01,ψ =

1 n Pn j=1 ˆh M (ξj)2 1 n Pn j=1 ˆh M (ξj) 2 p V ar (h0M (ξj)).

Note that 1

n Pn j=1 γ0M (ξj) h M (ξj) is a U-statistic of order 2. By the Hoeﬀding decomposition,

j=1 γ0M (ξj) h M (ξj) 1

j=1 γ0M (ξj) h0M (ξj)

i =j γ (s M; ξj) ψ (ξj, ξi; s M) 1

j=1 E [γ (s M; ξj) ψ (ξj, ξi; s M)| ξj]

Our assumptions and results so far imply 1 n

i=1 ˆγM (ξi) ˆh M (ξi) 1

i=1 γ0M (ξi) h0M (ξi)

i=1 (ˆγM (ξi) γ0M (ξi)) ˆh M (ξi)

ˆh M (ξi) h M (ξi) γ0M (ξi)

i=1 γ0M (ξi) h M (ξi) 1

i=1 γ0M (ξi) h0M (ξi)

i=1 |ˆγM (ξi) γ0M (ξi)| ˆh M (ξi) + 1

ˆh M (ξi) h M (ξi) |γ0M (ξi)| + op(1)

i=1 |ˆγM (ξi) γ0M (ξi)|2

ˆh M (ξi) 2

ˆh M (ξi) h M (ξi) 2 v u u t 1

i=1 |γ0M (ξi)|2 + op(1)

= op (1) Op (1) + op (1) Op (1) = op (1)

n Pn i=1 (γ0M(ξi)h0M(ξi) E [γ0M(ξi)h0M(ξi)]) p 0,

ˆγM(ξi)ˆh M(ξi) E [γ0M(ξi)h0M(ξi)] p 0

Since also 1

n Pn i=1 (ˆγM(ξi) E [γ0M(ξi)]) p 0 and 1

n Pn i=1 ˆh M(ξi) E [h0M(ξi)] p 0,

ˆλ p λ Cov [γ0M(ξi), h0M(ξi)]

A.3 Proof of Theorem 4.1

The arguments are similar to the Proof of Theorem 3.1, except that we now have to account for the possibly diﬀerent rates of convergence of the two estimators. Taking Taylor expansions of γ(ˆs( k) M ; ξi) and γ(ˆs M; ξi) around γ(s M; ξi) and γ(s M; ξi) for M {M1, M2},

n ˆLCV n (M2) ˆLCV n (M1) E[γ(ˆs M2; ξi)] E[γ(ˆs M1; ξi)]

n γ(s M2; ξi) γ(s M1; ξi) E[γ(s M2; ξi)] E[γ(s M1; ξi)] o

( ˆs( k) M2 (xi) s M2(xi) γ(s M2; ξi) s M2 ˆs( k) M1 (xi) s M1(xi) γ(s M1; ξi) s M1

(ˆs M2( xi) s M2( xi)) γ(s M2; ξi) s M2

(ˆs M1( xi) s M1( xi)) γ(s M1; ξi) s M1

(ˆs( k) M2 (xi) s M2(xi))2 2γ(s M2; ξi) s M2 2 E

(ˆs M2( xi) s M2( xi))2 2γ(s M2; ξi) s M2 2

(ˆs( k) M1 (xi) s M1(xi))2 2γ(s M1; ξi) s M1 2 E

(ˆs M1( xi) s M1( xi))2 2γ(s M1; ξi) s M1 2

{γ(s M2; ξi) γ(s M1; ξi) (E[γ(s M2; ξi)] E[γ(s M1; ξi)])} | {z } γ(ξi) E[ γ(ξi)]

φM2(ξj, xi) γ(s M2; ξi) s M2 | {z } ψM2 (ξj,ξi)

E [ψM2(ξj, ξi)]

φM1(ξj, xi) γ(s M1; ξi) s M1 | {z }

ψM1 (ξj,ξi)

E [ψM1(ξj, ξi)]

{ γ(ξi) E[ γ(ξi)]} +

nv U1,M2,k 1

Note that U1k U1,M2,k U1,M1,k is a two-sample U-statistic with kernel function ψ(ξj, ξi) ψM2(ξj, ξi) ψM1(ξj, ξi). For n-consistent estimators τn,1 = τn,2 = n, E [φM(ξj, xi)| xi] =

0 for M {M1, M2} implies E h ψ(ξj, ξi) ξi i = 0. Using similar Hoeﬀding decomposition

Asymptotics of K-Fold Cross Validation

arguments as in the proof of Theorem 3.1, n ˆLCV n (M2) ˆLCV n (M1) (EPEOUT,n(M2) EPEOUT,n(M1))

i=1 { γ(ξi) E[ γ(ξi)]} + 1 n

j=1 E h ψ(ξj, ξi) E h ψ(ξj, ξi) i ξj i + op(1)

The asymptotic variance is σ2 = σ2 γ + σ2 01,ψ + 2 λ

If τn,1 τn,2 = n, since V ar [E [ψM1(ξj, ξi)| ξi]] 0 and V ar [E [ψM1(ξj, ξi)| ξj]] 0, U1,M1,k is a degenerate two-sample U-statistic, so n ˆLCV n (M2) ˆLCV n (M1) (EPEOUT,n(M2) EPEOUT,n(M1))

i=1 { γ(ξi) E[ γ(ξi)]} + 1 n

j=1 E [ψM2(ξj, ξi) E [ψM2(ξj, ξi)]| ξj] + op(1)

The asymptotic variance is σ2 = σ2 γ + σ2 01,ψM2 + 2λM2

If τn,2 τn,1 = n, since V ar [E [ψM2(ξj, ξi)| ξi]] 0 and V ar [E [ψM2(ξj, ξi)| ξj]] 0, U1,M2,k is a degenerate two-sample U-statistic, so n ˆLCV n (M2) ˆLCV n (M1) (EPEOUT,n(M2) EPEOUT,n(M1))

i=1 { γ(ξi) E[ γ(ξi)]} 1 n

j=1 E [ψM1(ξj, ξi) E [ψM1(ξj, ξi)]| ξj] + op(1)

The asymptotic variance is σ2 = σ2 γ + σ2 01,ψM1 + 2λM1

If τn,1, τn,2 n, V ar [E [ψM1(ξj, ξi)| ξi]] 0, V ar [E [ψM1(ξj, ξi)| ξj]] 0, V ar [E [ψM2(ξj, ξi)| ξi]] 0, and V ar [E [ψM2(ξj, ξi)| ξj]] 0 imply U1,M1,k and U1,M2,k are degenerate two-sample U-statistics. Then only the residual variance term contributes to the asymptotic distribution: n ˆLCV n (M2) ˆLCV n (M1) (EPEOUT,n(M2) EPEOUT,n(M1))

n γ(ξi) E[ γ( ξi)] o + op(1)

The asymptotic variance will be σ2 = σ2 γ. Under H0 and alternatives of the form n (EPEOUT,n(M2) EPEOUT,n(M1)) 0, n ˆLCV n (M2) ˆLCV n (M1) has the same asymptotic distribution as n ˆLCV n (M2) ˆLCV n (M1) (EPEOUT,n(M2) EPEOUT,n(M1)) . Since ˆ σ2 n p σ2,

ˆLCV n (M2) ˆLCV n (M1) d N(0, 1)

Under alternatives of the form n (EPEOUT,n(M2) EPEOUT,n(M1)) c, we have

ˆLCV n (M2) ˆLCV n (M1) (EPEOUT,n(M2) EPEOUT,n(M1))

ˆ σn (EPEOUT,n(M2) EPEOUT,n(M1)) | {z } c/ σ

d N(c/ σ, 1).

Under alternatives of the form n (EPEOUT,n(M2) EPEOUT,n(M1)) , it follows that

ˆLCV n (M2) ˆLCV n (M1) (EPEOUT,n(M2) EPEOUT,n(M1))

ˆ σn (EPEOUT,n(M2) EPEOUT,n(M1)) | {z }

Under alternatives of the form n (EPEOUT,n(M2) EPEOUT,n(M1)) , it follows that

ˆLCV n (M2) ˆLCV n (M1) (EPEOUT,n(M2) EPEOUT,n(M1))

ˆ σn (EPEOUT,n(M2) EPEOUT,n(M1)) | {z }

Up to this point we have examined the case when the rate of convergence of ˆLCV n (M2) ˆLCV n (M1) is n, but it can happen that the rate is n when σ2 = 0, which can occur when V ar [ γ (ξ)] = 0 and U1k is a degenerate two-sample U-statistic of degree (1,1). There are results in the literature that characterize the asymptotic distribution of degenerate twosample U-statistics of degree (1,1) (see e.g. (Neuhaus, 1977), (Eagleson, 1979), (Dewan & Rao, 2001)). They show that the rate of convergence is the square root of the product of the two sample sizes. In our case, this would mean that U1k = OP 1 nvnt

n ˆLCV n (M2) ˆLCV n (M1) (EPEOUT,n(M2) EPEOUT,n(M1))

k=1 nv U1k + op(1) =

+ op(1) = OP (1)

Although there are results in the literature that characterize the asymptotic distribution of a single degenerate two-sample U-statistic of degree (1,1) (see e.g. (Neuhaus, 1977),

Asymptotics of K-Fold Cross Validation

(Eagleson, 1979), (Dewan & Rao, 2001)), deriving the asymptotic distribution of a sum of K degenerate two-sample U-statistics which are not independent of each other is still fairly complicated. We leave the details for future research and instead use subsampling to estimate the distribution in practice.

A.4 Proof of Theorem 4.2

First, note that 1

n Pn i=1 |ˆγM (ξi) γ0M (ξi)|2 p 0 for M {M1, M2} implies

1 n Pn i=1 ˆ γ (ξi) γ0 (ξi) 2 p 0 and therefore 1

n Pn i=1 ˆ γ (ξi) γ0 (ξi) p 0 for ˆ γ (ξi) ˆγM2 (ξi) ˆγM1 (ξi) and γ0 (ξi) γ0M2 (ξi) γ0M1 (ξi). Additionally, by the law of large numbers 1 n Pn i=1 γ0 (ξi) p E [ γ0 (ξi)]. Therefore,

i=1 ˆ γ (ξi) E [ γ0 (ξi)]

ˆ γ (ξi) γ0 (ξi) +

i=1 γ0 (ξi) E [ γ0 (ξi)]

i=1 ˆ γ2 (ξi) E γ2 0 (ξi)

ˆ γ (ξi) γ0 (ξi) 2 + 2 1

i=1 | γ0 (ξi)| ˆ γ (ξi) γ0 (ξi) +

i=1 γ2 0 (ξi) E γ2 0 (ξi)

ˆ γ (ξi) γ0 (ξi) 2 + 2

i=1 | γ0 (ξi)|2

ˆ γ (ξi) γ0 (ξi) 2 + op(1) p 0

Therefore, ˆ σ2 γ = 1 n Pn i=1 ˆ γ2 (ξi) 1 n Pn i=1 ˆ γ (ξi) 2 p V ar [ γ0(ξi)]. Our assumption of

1 n Pn j=1 ˆh M (ξj) h M (ξj) 2 p 0 for M {M1, M2} implies

ˆ h (ξj) h (ξj) 2 = 1

ˆh M2 (ξj) h M2 (ξj) ˆh M1 (ξj) h M1 (ξj) 2

ˆh M2 (ξj) h M2 (ξj) 2 + 1

ˆh M1 (ξj) h M1 (ξj) 2

For h (ξj) h M2 (ξj) h M1 (ξj), 1

n Pn j=1 h (ξj)2 = 1 (n 1)2 1

n Pn j=1 P i1 =j P i2 =j ψ (ξj, ξi1) ψ (ξj, ξi2) is asymptotically equivalent to a U-statistic of order 3. Using the Hoeﬀding decomposi-

tion and E h ψ (ξj, ξi1) ψ (ξj, ξi2) ξj i = E h ψ (ξj, ξi) ξj i2 = h0 (ξj) for h0 (ξj) h0M2 (ξj)

j=1 h0 (ξj)2

1 n (n 1) (n 2)

ψ (ξj, ξi1) ψ (ξj, ξi2) 1

j=1 E h ψ (ξj, ξi) ξj i2 + op(1)

ˆ h (ξj)2 1

j=1 h0 (ξj)2

ˆ h (ξj) h (ξj) 2 + 2 1

h (ξj) ˆ h (ξj) h (ξj) +

j=1 h0 (ξj)2

ˆ h (ξj) h (ξj) 2 + 2

h (ξj) 2 v u u t 1

ˆ h (ξj) h (ξj) 2 + op(1)

Since we showed in Theorem 3.2 that 1

n Pn j=1 ˆh M (ξj) 1

n Pn j=1 h0M (ξj) = op(1) for M {M1, M2},

j=1 h0 (ξj)

j=1 ˆh M2 (ξj) 1

j=1 h0M2 (ξj)

j=1 ˆh M1 (ξj) 1

j=1 h0M1 (ξj)

n Pn j=1 h0 (ξj)2 1 n Pn j=1 h0 (ξj) 2 p V ar h0 (ξj) , it follows that ˆ σ2 01,ψ = 1

n Pn j=1 ˆ h (ξj)2 1 n Pn j=1 ˆ h (ξj) 2 p V ar h0 (ξj) .

n Pn i=1 γ0 (ξi) h (ξi) 1

n Pn i=1 γ0 (ξi) h0 (ξi) p 0 since for M, M {M1, M2}, 1

n Pn i=1 γ0M (ξi) h M (ξi) 1

n Pn i=1 γ0M (ξi) h0M (ξi) p 0. Since also 1

n Pn i=1 ˆ h (ξi) h (ξi) p

Asymptotics of K-Fold Cross Validation

n Pn i=1 ˆ γ (ξi) γ0 (ξi) p 0,

i=1 ˆ γ (ξi) ˆ h (ξi) 1

i=1 γ0 (ξi) h0 (ξi)

ˆ γ (ξi) γ0 (ξi) ˆ h (ξi)

ˆ h (ξi) h (ξi) γ0 (ξi)

i=1 γ0 (ξi) h (ξi) 1

i=1 γ0 (ξi) h0 (ξi)

ˆ γ (ξi) γ0 (ξi) ˆ h (ξi) + 1

ˆ h (ξi) h (ξi) | γ0 (ξi)| + op(1)

ˆ γ (ξi) γ0 (ξi) 2 v u u t 1

ˆ h (ξi) h (ξi) 2 v u u t 1

i=1 | γ0 (ξi)|2 + op(1)

= op (1) Op (1) + op (1) Op (1) = op (1)

n Pn i=1 γ0(ξi) h0(ξi) E h γ0(ξi) h0(ξi) i p 0,

ˆ γ(ξi)ˆ h(ξi) E h γ0(ξi) h0(ξi) i p 0

Since also 1

n Pn i=1 ˆ γ(ξi) E [ γ0(ξi)] p 0 and 1

n Pn i=1 ˆ h(ξi) E h h0(ξi) i p 0,

ˆ λ p λ Cov h γ0(ξi), h0(ξi) i

Arlot, S., & Celisse, A. (2010). A survey of cross-validation procedures for model selection. Statistics Surveys, 4, 40 79.

Austern, M., & Zhou, W. (2020). Asymptotics of cross-validation. ar Xiv preprint ar Xiv:2001.11111.

Bates, S., Hastie, T., & Tibshirani, R. (2023). Cross-validation: what does it estimate and how well does it do it?. Journal of the American Statistical Association, 1 12.

Bayle, P., Bayle, A., Janson, L., & Mackey, L. (2020). Cross-validation conﬁdence intervals for test error. 34th Conference on Neural Information Processing Systems (Neur IPS 2020).

Bierens, H. J. (1982). Consistent model speciﬁcation tests. Journal of Econometrics, 20(1), 105 134.

Burman, P. (1989). A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods. Biometrika, 76(3), 503 514.

Burman, P., Chow, E., & Nolan, D. (1994). A cross-validatory method for dependent data. Biometrika, 81(2), 351 358.

Chen, X., Hong, H., & Shum, M. (2007). Nonparametric likelihood ratio model selection tests between parametric likelihood and moment condition models. Journal of Econometrics, 141(1), 109 140.

Chen, X., & White, H. (1999). Improved rates and asymptotic normality for nonparametric neural network estimators. IEEE Transactions on Information Theory, 45(2), 682 691.

Clark, T. E., & Mc Cracken, M. W. (2015). Nested forecast model comparisons: a new approach to testing equal accuracy. Journal of Econometrics, 186(1), 160 177.

Dewan, I., & Rao, B. P. (2001). Asymptotic normality for u-statistics of associated random variables. Journal of Statistical Planning and Inference, 97(2), 201 225.

Diebold, F. X. (2015). Comparing predictive accuracy, twenty years later: A personal perspective on the use and abuse of diebold mariano tests. Journal of Business & Economic Statistics, 33(1), 1 1.

Diebold, F. X., & Mariano, R. S. (1995). Comparing predictive accuracy. Journal of Business & Economic Statistics, 20(1), 134 144.

Eagleson, G. (1979). Orthogonal expansions and u-statistics. Australian Journal of Statistics, 21(3), 221 237.

Giacomini, R., & White, H. (2006). Tests of conditional predictive ability. Econometrica, 74(6), 1545 1578.

Hall, P., Racine, J., & Li, Q. (2004). Cross-validation and the estimation of conditional probability densities. Journal of the American Statistical Association, 99(468), 1015 1026.

Hansen, P. R., Lunde, A., & Nason, J. M. (2011). The model conﬁdence set. Econometrica, 79(2), 453 497.

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning, Vol. 2. Springer.

Hong, H., & Preston, B. (2012). Bayesian averaging, prediction and nonnested model selection. Journal of Econometrics, 167(2), 358 369.

Knight, K., & Fu, W. (2000). Asymptotics for lasso-type estimators. Annals of Statistics, 28(5), 1356 1378.

Lavergne, P. (2001). An equality test across nonparametric regressions. Journal of Econometrics, 103(1-2), 307 344.

Lavergne, P., & Vuong, Q. (2000). Nonparametric signiﬁcance testing. Econometric Theory, 16(4), 576 601.

Asymptotics of K-Fold Cross Validation

Lavergne, P., & Vuong, Q. H. (1996). Nonparametric selection of regressors: The nonnested case. Econometrica: Journal of the Econometric Society, 207 219.

Lehmann, E. L. (1951). Consistency and unbiasedness of certain nonparametric tests. The Annals of Mathematical Statistics, 165 179.

Lehmann, E. L., & Romano, J. P. (2005). Testing Statistical Hypotheses (Springer Texts in Statistics). Springer.

Lei, J. (2019). Cross-validation with conﬁdence. Journal of the American Statistical Association, 1 20.

Li, Q., & Racine, J. (2004). Cross-validated local linear nonparametric regression. Statistica Sinica, 485 512.

Liao, Z., & Shi, X. (2020). A nondegenerate vuong test and post selection conﬁdence intervals for semi/nonparametric models. Quantitative Economics, 11(3), 983 1017.

Neuhaus, G. (1977). Functional limit theorems for u-statistics in the degenerate case. Journal of Multivariate Analysis, 7(3), 424 439.

Neumeyer, N. (2004). A central limit theorem for two-sample u-processes. Statistics and Probability Letters, 67, 73 85.

Newey, W. K., & Mc Fadden, D. (1994). Large sample estimation and hypothesis testing. Handbook of econometrics, 4, 2111 2245.

Politis, D., Romano, J., & Wolf, M. (1999). Subsampling. Springer Series in Statistics.

Racine, J. (2000). Consistent cross-validatory model-selection for dependent data: hv-block cross-validation. Journal of Econometrics, 99, 39 61.

Rivers, D., & Vuong, Q. (2002). Model selection tests for nonlinear dynamic models. The Econometrics Journal, 5(1), 1 39.

Romano, J. P., Shaikh, A. M., & Wolf, M. (2010). Hypothesis testing in econometrics. Annual Review of Economics, 2(1), 75 104.

Schennach, S. M., & Wilhelm, D. (2017). A simple parametric model selection test. Journal of the American Statistical Association, 112(520), 1663 1674.

Shao, J. (1993). Linear model selection by cross-validation. Journal of the American Statistical Association, 88(422), 486 494.

Shi, X. (2015). A non-degenerate vuong test. Quantitative Economics, 85 121.

Van der Vaart, A. W. (2000). Asymptotic statistics, Vol. 3. Cambridge university press.

Vuong, Q. H. (1989). Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica: Journal of the Econometric Society, 307 333.

Wager, S. (2020). Cross-validation, risk estimation, and model selection: Comment on a paper by rosset and tibshirani. Journal of the American Statistical Association, 115(529), 157 160.

Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment eﬀects using random forests. Journal of the American Statistical Association, 113(523), 1228 1242.

West, K. D. (1996). Asymptotic inference about predictive ability. Econometrica: Journal of the Econometric Society, 1067 1084.

White, H. (1989). Some asymptotic results for learning in single hidden-layer feedforward network models. Journal of the American Statistical Association, 84(408), 1003 1013.

White, H. (2000). A reality check for data snooping. Econometrica, 68(5), 1097 1126.

White, H., & Racine, J. (2001). Statistical inference, the bootstrap, and neural-network modeling with application to foreign exchange rates. IEEE Transactions on Neural Networks, 12(4), 657 673.

Yang, Y. (2007). Consistency of cross validation for comparing regression procedures. The Annals of Statistics, 35(6), 2450 2473.

Zhang, P. (1993). Model selection via multifold cross validation. The Annals of Statistics, 299 313.

Zhu, Y., & Timmermann, A. (2020). Can two forecasts have the same conditional expected accuracy?. ar Xiv preprint ar Xiv:2006.03238.