# vstatistics_and_variance_estimation__95606069.pdf

Journal of Machine Learning Research 22 (2021) 1-48 Submitted 5/21; Revised 9/21; Published 10/21

V -statistics and Variance Estimation

Zhengze Zhou zz433@cornell.edu Department of Statistics and Data Science Cornell University Ithaca, NY 14850, USA

Lucas Mentch lkm31@pitt.edu Department of Statistics University of Pittsburgh Pittsburgh, PA 15260, USA

Giles Hooker ghooker@berkeley.edu Department of Statistics University of California, Berkeley Berkeley, CA 94720, USA

Editor: Genevera Allen

As machine learning procedures become an increasingly popular modeling option among applied researchers, there has been a corresponding interest in developing valid tools for understanding their statistical properties and uncertainty. Tree-based ensembles like random forests remain one such popular option for which several important theoretical advances have been made in recent years by drawing upon a connection between their natural subsampled structure and the classical theory of U-statistics. Unfortunately, the procedures for estimating predictive variance resulting from these studies are plagued by severe bias and extreme computational overhead. Here, we argue that the root of these problems lies in the use of subsampling without replacement and that with-replacement subsamples, resulting in V -statistics, substantially alleviates these problems. We develop a general framework for analyzing the asymptotic behavior of V -statistics, demonstrating asymptotic normality under precise regularity conditions and establishing previously unreported connections to U-statistics. Importantly, these ﬁndings allow us to produce a natural and eﬃcient means of estimating the variance of a conditional expectation, a problem of wide interest across multiple scientiﬁc domains that also lies at the heart of uncertainty quantiﬁcation for supervised learning ensembles.

Keywords: V -statistics, U-statistics, Variance Estimation, Uncertainty Quantiﬁcation, Supervised Ensembles

1. Introduction

There has been considerable interest in recent years in developing inferential procedures for random forests and related methods (Mentch and Hooker, 2016; Wager and Athey, 2018; Cui et al., 2017). In most cases, these procedures exploit the underlying subsampling structure to represent predictions from the resulting ensemble in terms of U-statistics. This representation then allows central limit theorems to be derived by extending those classi-

c 2021 Zhengze Zhou, Lucas Mentch and Giles Hooker.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/21-0575.html.

Zhou et al.

cal results (Hoeﬀding, 1948). While these advances have led to promising new inferential methodology (Mentch and Hooker, 2016; Hooker and Mentch, 2018; Zhou et al., 2018; Peng et al., 2019; Athey et al., 2019), each necessitates explicit variance estimation of the random forests predictions. The primary component of this variance can be expressed as the variance of a conditional expectation. Estimation of parameters of this form has been explored for some time in the broader academic literature as it arises in a number of practical applications (Zouaoui and Wilson, 2003; Staum, 2009). Mentch and Hooker (2016); Hooker and Mentch (2018) exploit this representation explicitly, considering a structured subsampling approach that produces a nested Monte Carlo estimate for the variance of random forest predictions. Years earlier, Sun et al. (2011) investigated a similar estimator in an operations research context, using failure time stability as a motivating example. Goda (2017) recently derived a generalized version of this estimator and demonstrated that estimation could be done in a non-nested fashion. Wang and Lindsay (2014) were explicitly interested in estimating the variance of classical U-statistics and introduced a partition resampling scheme that was shown to be the best unbiased estimator as long as the kernel degree k satisﬁes k n/2. In the context of ensemble variance estimation, Sexton and Laake (2009) proposed a bootstrap-of-littlebags approach and Wager et al. (2014); Wager and Athey (2018) utilize an extension of the Inﬁnitesimal Jackknife estimator put forth in Efron (2014). Unfortunately, all of these estimation procedures exhibit considerable upward bias unless the size of the ensemble is much larger than that required to construct a typical, predictivelyaccurate random forest. Such biased estimates generally result in conservative conﬁdence intervals and signiﬁcant decreases in the power of resulting hypothesis tests (Mentch and Hooker, 2016; Zhou et al., 2018). Methods proposed to correct this bias also frequently result in negative variance estimates, leading to ad hoc workarounds such as a pseudo-Bayes approach to enable inference (Athey et al., 2019). In this paper, we argue that the diﬃculty of correcting the bias in these estimates results from the U-statistic construction itself that employs subsamples without replacement. This structure results in negative correlation between the trees in the ensemble that induces undesired bias in the bias correction. By simply employing subsampling with replacement, the entire ensemble can be represented as an expectation over the empirical distribution of the data and we show that this framework provides considerably more reliable inference than in the case of U-statistics as well as reducing the sensitivity of predictive performance to subsample size. Buja and Stuetzle (2006) studied some equivalence for bagging on resampling with or without replacement along with the eﬀect of kernel size on bias and variance. Formally, methods based around subsampling with replacement fall under the framework of V -statistics. These have been treated by either demonstrating their asymptotic equivalence to U-statistics, or via a series expansions (von Mises, 1947). However, the equivalence hold only when the size of the V -statistic kernel grows more slowly than n1/4 meaning that the number of observations given to each tree grows very slowly and series expansions are complicated by the non-diﬀerential greedy tree-building process. Instead, we show that a V -statistic of any order can be exactly represented as a U-statistic but employing a diﬀerent kernel, allowing us to invoke the central limit theorems already derived for inﬁnite-order U-statistics, although some variation must be made to account for the use of random sampling for subsets.

V -statistics and Variance Estimation

Importantly, the general framework we develop allows us to present a uniﬁed theory for the general problem of variance estimation from which we can bridge the gap between the numerous approaches discussed above. In particular, we propose balanced method (BM) for variance estimation and show that it enjoys lower bias than the alternatives given ensembles of equal size. Further, we establish a close connection between the BM and the Inﬁnitesimal Jackknife (IJ) and prove their equivalence under a natural condition. To estimate variance in the limiting distribution in ﬁnite sample case, we develop a bias-corrected version of BM through an ANOVA-like framework (Sun et al., 2011). The new estimator is shown to produce much more reliable results with a moderate number of base learners such as would be incurred, for example, by utilizing a traditional bootstrap approach. Through the remainder of this paper, Section 2 and 3 review Uand V -statistics, their use in the theory of machine learning methods and derive their equivalence, although with modiﬁcations to the expression for variance in the incomplete case. In Section 4, we then examine variance estimates where we derive a more eﬃcient estimate version of the estimate in Mentch and Hooker (2016) and show that this is equivalent to the Inﬁnitesimal Jackknife if every observation is used the same number of times in the ensemble. This framework allows us to explicitly derive bias corrections in Section 5 and we show that while the natural estimate of the bias over-corrects the variance for the case of U-statistics, it is unbiased in the case of V -statistics. Section 6 extends the asymptotic results to randomized ensembles. Empirical studies in Section 7 corroborate the developed asymptotic theories and the eﬀectiveness of variance estimation procedure; it also suggest that the change from without replacement to with replacement subsampling does not have a consistent eﬀect on predictive performance but that the latter makes that performance less sensitive to subsample size, bringing the method closer to the original bootstrap sampling proposed in Breiman (2001). These results suggest that subsampling with replacement should be considered the appropriate default for ensemble methods. Proofs of all results, additional experiment studies and further discussions are collected in the Appendix. Code accompanying this paper can be found at https://github.com/ Zhengze Zhou/V-statistics-and-Variance-Estimation.

1.1 A Motivating Example

We will illustrate the practical importance of our contribution through a motivating example using Boston Housing Data1. The data set contains 506 samples with 13 features and the target is the median value of owner-occupied homes in $1000 s in the area of Boston Mass. To simulate practical use cases, we leave 20% of the data as test set and train on the remaining samples. A random sample is selected from the test set and a 95% conﬁdence interval for predictions are calculated by the methods described later in this paper. Previous work suggests building ensembles by sampling without replacement (U-statistics), and estimating predictive variance by Inﬁnitesimal Jackknife (IJ). We build the a random forests with subsample size 100 on the training data and calculate the intervals on the selected test sample, while varying the number of base learners B = 100, 500, 1000, 5000, 10000. The results are depicted by dashed lines in Figure 1 . Here the conﬁdence interval of predictions are calculated as predicted value 1.96 estimated standard deviation. The width

1. https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

Zhou et al.

of the intervals decrease as B becomes larger and small values of B results in conservative estimates. The ideal number of B is usually prohibitively large in practice (B > 5000 in this example) in order to get accurate variance estimation. This demonstrates the insuﬃciency of existing work in conducting inference for ensembles, despite solid theoretical properties of U-statistics. This paper proposes the use of V -statistics by sampling with replacement along with a bias-corrected variance estimator (solid lines in Figure 1) . We can see that the proposed bias corrected estimates yield more accurate conﬁdence intervals with moderate size of B. We would like to emphasize that the contributions of our work are twofold: in additional to theoretical advancements in analyzing asymptotics of V -statistics, we also provide a general framework for eﬃcient variance estimation, which is of great signiﬁcance for practical applications. We note here that this interval is based on a central limit theorem that is centered on the expectation of the random forests, rather than using a target conditional mean E(Y |X = x). The consistency of such intervals will then depend on the size of the bias associated with the particular tree-building method used in the random forests; see (Wager and Athey, 2018; Scornet et al., 2015) for examples. Our focus in this paper is on the structure of the ensemble rather than its constituent and we do not address conﬁdence interval consistency here. Alternatively, Zhou and Hooker (2018) multiplies the calculated standard error by

2 to produce reproduction intervals , giving the range in which an independently-generated random forests would fall and thus a notion of stability. For the sake of simplicity, we have used standard interval calculations below.

2. Related Work on U-statistics

We ﬁrst give a brief introduction on the notion of U-statistics, and then illustrate how it can be utilized in the analysis of ensemble models. Assume that we have a training set D = {Z1, . . . , Zn} of i.i.d. observations of the form Zi = (Xi, Yi) drawn from an underlying distribution FZ, where X = (X1, . . . , Xp) X are p covariates. We want to estimate a parameter of interest θ. Suppose there exists an unbiased estimator h of θ that is a function of k n arguments (we call h a kernel of size or degree k) so that θ = Eh (Z1, . . . , Zk)

and without loss of generality, assume that h is permutation symmetric in its arguments and Eh2 (Z1, . . . , Zk) < . Then the minimum variance unbiased estimator for θ is given by

Un = 1 n k X

i h (Zi1, . . . , Zik) (1)

where {Zi1, . . . , Zik} consists of k distinct elements from the original sample {Z1, . . . , Zn} and the sum is taken over all n k subsamples of size k. The estimator in (1) is referred to as a complete U-statistic with kernel h of degree k. There are some natural extensions of (1). To produce more predicive ensembles, we would like k to grow with n so the kernel will have access to more information from the

V -statistics and Variance Estimation

Figure 1: Conﬁdence intervals for predictions based on Uand V -statistics. Dashed and solid lines are for Uand V -statistics respectively. The x-axis denotes the number of base learners used in an ensemble. Prediction values are drawn in blue, while IJ estimates of 95% conﬁdence intervals are drawn in red. Bias corrected estimator for V -statistics is shown in green.

Zhou et al.

data. This results in a kernel that varies with n, and an Inﬁnite Order U-statistic (IOUS; Frees, 1989)

Un,kn = 1 n kn X

i hkn Zi1, . . . , Zikn . (2)

Further, evaluating all n kn kernels is computationally infeasible for even moderately sized n or kn and thus an estimate can be achieved by averaging over only Bn < n kn subsamples. Incorporating this, the estimator becomes an Incomplete Inﬁnite Order U-statistic

Un,kn,Bn = 1

i hkn Zi1, . . . , Zikn . (3)

In (2) and (3) we use subscripts to denote that values of k and B may depend on n, and the degree of kernel h is kn. U-statistics of form (1) were ﬁrst studied in Halmos (1946) and Hoeﬀding (1948), where the latter also shows that these statistics are asymptotically normally distributed. Sen (1992) provides a review of Hoeﬀding s seminal paper and outlines the importance of U-statistics in modern statistical theory. A comprehensive treatment of the classical Ustatistics results can be found in Lee (1990) and Serﬂing (2009). Certain basic properties, such as almost sure consistency and asymptotic normality, are proved to hold in the case of (2) and (3) in Frees (1989). The connection between U-statistics and ensemble methods had not been observed until very recently in the work of Mentch and Hooker (2016) and Wager and Athey (2018). For simplicity we will focus on the regression setting, where predictions are assumed to be continuous. This can also incorporate binary classiﬁcation as long as the model predicts the probability by averaging outputs instead of predicting a label obtained from a majority vote. We are interested in estimating the conditional mean function at a test point x

µ (x) = E (Y |X = x) .

Given a base learner h, ensemble methods generate resamples R1, . . . , RB of the original data, apply h to each resample, and produce ﬁnal point estimates by averaging over those generated by each model, yielding estimates of the form

i=1 h (x; ωi, Ri) .

Here, the ωi denotes an auxiliary randomization parameter as used in randomized ensembles like random forests, but which may be dropped for simpler (non-randomized) estimation procedures like bagging2; Peng et al. (2019) refers to these estimators as generalized Ustatistics. When all instances of the randomness are considered, note that the kernel again becomes nonrandom (we can write the kernel as Eωih (x; ωi, Ri)), as in Wager and Athey (2018) where the authors assume B is large enough for Monte Carlo eﬀects not to matter.

2. In some papers, ωi incorporates drawing the subsamples in bagging and doing both resampling and random feature selection in random forests. Here ω has a diﬀerent meaning as we write out the resamples explicitly.

V -statistics and Variance Estimation

The conventional procedure in random forests is to take R1, . . . , RB to be bootstrap samples, which turns out very diﬃcult to analyze statistically. Mentch and Hooker (2016) propose the following procedure to construct an ensemble. Given a training set D of size n, an ensemble consisting of Bn base learners is constructed using subsamples of size kn

Un,kn,Bn (x) = 1

i=1 hkn x; Z i1, . . . , Z ikn

where {Z i1, . . . , Z ikn} is drawn without replacement from {Z1, . . . , Zn}. This ﬁts into the statistical framework of U-statistics and asymptotic normality can be demonstrated under some regularity conditions (see Peng et al. (2019) for reﬁned results). In particular, the explicit expression for the variance of predictions at any given point can be written in closed-form k2 n n ζ1,kn + 1

Bn ζkn,kn. (5)

For a given c, 1 c kn, the variance parameters are deﬁned as

ζc,kn = cov hkn(Z1, . . . , Zkn), hkn(Z1, . . . , Zc, Z c+1, . . . , Z kn) (6)

where Z c+1, . . . , Z kn are i.i.d. copies from the same distribution FZ and independent of the original data Z1, . . . , Zn. For notational simplicity, we drop the test point x in (6). Within these results, as in ours below, the asymptotic distribution is centered at θk = Ehkn(Z1, . . . , Zkn) instead of the true conditional mean E(Y |X = x). As noted above, this means that any inferential statements must, in general, be made about the sampling structure of the ensemble rather than the underlying data generating process. A careful analysis of speciﬁc choices of the base learner h and the relationship between covariates X and response Y are central in achieving consistent predictions and is not the focus of this paper. Some work along these lines includes Wager and Athey (2018) which focus on particular tree-building methods, and Scornet et al. (2015) which demonstrate the L2

consistency for random forests when the underlying response corresponds to an additive regression model. Also note that the asymptotic normality result in Wager and Athey (2018) can be viewed as a special case of (5). Here, the authors assume that ensemble size B is large enough for Monte Carlo eﬀects not to matter, in which case (5) reduces to k2 n n ζ1,kn.

3. V -statistics

V -statistics are closely related to U-statistics except that the data used in each kernel is sampled with replacement. Similar to (1), a complete V -statistic with kernel h of degree k is deﬁned as

Vn = n k n X

ik=1 hk (Zi1, . . . , Zik) (7)

where {Zi1, . . . , Zik} consists of k elements from {Z1, . . . , Zn} and the sum is taken over all nk subsamples of size k. An Inﬁnite Order V -statistic (IOVS) is deﬁned analogously to (2)

Zhou et al.

Vn,kn = n kn n X

ikn=1 hkn Zi1, . . . , Zikn . (8)

Asymptotic equivalence between V - and U-statistics for ﬁxed kernel degree is a well studied topic (Lee, 1990). Previous work has also shown that the equivalence hold when the size of V -statistic kernel grows more slowly than n 1 4 (Shieh, 1994). We provide a rigorous analysis of this argument in Appendix A. However, this growth rate of kernel size is fairly restrictive: the number of observations given to each base learner grows very slowly, which can harm predictive performance. In the following, we show that a V -statistic of any order can be exactly represented as a U-statistic but employing a diﬀerent kernel, thus enabling us to discard the restriction to attain a more general asymptotic results of V -statistics.

3.1 Representation As U-statistics

This section develops a broader connection between V - and U-statistics to show that the former automatically achieve almost all the properties of the latter. A complete, inﬁnite order V -statistic Vn,kn with kernel hkn can be written as a corresponding U-statistic but with a more complicated kernel derived from hkn( ). Let Ωdenote the set {1, 2, . . . , n}. We use Bkn(Ω) to denote all size kn permutations of Ωwith replacement, and let Skn(Ω) denote subsamples of size kn without replacement so that |Bkn(Ω)| = nkn and |Skn(Ω)| = n kn . We can write Vn,kn as

Vn,kn = n kn X

b Bkn(Ω) hkn (Zb)

where b has kn elements and Zb are those Z s with index in b. Equivalently, Vn,kn can be expressed as

Vn,kn = n kn X

b Bkn(s) ωbhkn (Zb)

where ωb is the weight associated with each evaluation of hkn to account for the multiplicity in sampling the same b from Bkn(s) for diﬀerent s. For b = {i1, i2, . . . , ikn}, we use u(b) {1, 2, . . . , kn} to denote the number of unique elements in b and we have

ωb = 1 n u(b) kn u(b) .

We can thus express Vn,kn as a U-statistic

Vn,kn = 1 n kn X

s Skn(Ω) h kn (Zs) (9)

where the kernel h kn is deﬁned as

b Bkn(s) ωbhkn (Zb) .

V -statistics and Variance Estimation

Here ωb is deﬁned as before, and

b Bkn(s) ωb = nkn

A general result for the aymptotics of V -statistics is stated in the following theorem, allowing us to remove the restriction kn = o(n 1 4 ). First we deﬁne a class to which the kernel function hkn belongs:

H = h : sup kn Eh2 kn Zi1, . . . , Zikn <

where (i1, . . . , ikn) are chosen from {1, . . . , kn} with replacement.

Theorem 1 Let Z1, Z2, . . . , Zn iid FZ and let Vn,kn,Bn be an incomplete, inﬁnite order V - statistic with kernel hkn such that hkn H. Let θ kn = Eh kn. Then under the assumption

that lim ζ kn,kn nζ 1,kn 0, we have

Vn,kn,Bn θ kn

n ζ 1,kn + 1 Bn ζkn,kn

In the complete case where Bn = nkn, we have Vn,kn θ kn

Here, the variance parameter ζ 1,kn is deﬁned as in Equation (6) by replacing kernel hkn with h kn ζ 1,kn = cov h kn (Z1, . . . , Zkn) , h kn Z1, Z 2, . . . , Z kn

and ζkn,kn = var (hkn (Z1, . . . , Zkn)) is still the variance across individual kernels hkn.

As above, we note that our asymptotic distribution is centered onθ kn = Eh kn. This theorem provides a more general result for the asymptotics of V -statistics. It is essentially a reduction to U-statistics by constructing a new kernel representation. The variance expression k2 n n ζ 1,kn + 1 Bn ζkn,kn again can be viewed as two parts: the ﬁrst part k2 n n ζ 1,kn comes from the complete case; the second part 1 Bn ζkn,kn is the additional Monte Carlo variance introduced due to incomplete case, which is why ζkn,kn only involves the original kernel hkn instead of the composite kernel h kn. Peng et al. (2019) provides a uniﬁed analysis of these two components by incorporating the choice of subsamples into the randomization parameters of the generalized U-statistics. This strategy is not available in our case, as detailed in Section 6. Unlike Theorem 10 (see Appendix A) where the expected value of V - and U-statistics are the same asymptotically when kn = o(n 1 4 ), in the more general case here one may

Zhou et al.

not have the expected value of the new kernel h kn equals that of hkn. The central limit theorem centers at the expectation of the statistics of interest. How to quantify the bias of the predictions is out of the scope of this paper; see Wager and Athey (2018) for a careful treatment of decision tree based ensembles. The introduction of new kernel h kn facilitates theoretical analysis, but it brings challenges in estimating variance component ζ 1,kn directly: it is not feasible to calculate h kn(Zs) for any s Skn(Ω). We will see in Section 4 that as a general variance estimation method, the Inﬁnitesimal Jackknife (IJ) can be applied to the original kernel function hkn. And based on Theorem 3, Balanced Variance Estimation Method is equivalently valid without resorting to evaluating h kn(Zs) directly.

4. Variance Estimation

This section addresses how to estimate variance in the limiting distribution. Mentch and Hooker (2016) propose Internal Variance Estimation Method (IM) based on a two-level sampling procedure. Inspired from this, we design the Balanced Variance Estimation Method (BM) which is shown to have lower bias compared to IM. Unlike IM and BM, the Inﬁnitesimal Jackknife (IJ) employed in Wager and Athey (2018) does not depend on an explicit expression for the variance term. All methods presented can apply to both Uand V -statistics, though they exhibit diﬀerent performances when sampling without or with replacement, especially in terms of bias. For notational simplicity, during the development of BM and IM we will use hkn to denote the kernel function and ζc,kn to denote variance parameters. For general V -statistics as characterized by Theorem 1, one can substitute them for h kn and ζ c,kn whenever needed. IM and BM operate by directly estimating ζ1,kn and ζkn,kn as deﬁned in (6). Notice that ζkn,kn = var(hkn(Z1, . . . , Zkn)), which can be simply estimated as the variance across all base learners. The estimation for ζ1,kn is much more involved. The sample covariance between predictions may serve as a consistent estimator, but in practice it is numerically unstable and often results in negative variance estimates (Mentch and Hooker, 2016). Thus we work with the equivalent expression for ζ1,kn (Lee, 1990)

ζ1,kn = var (E (hkn(Z1, . . . , Zkn) |Z1 = z1)) . (10)

Expressions of the form from (10) belong to an important theme in statistics: estimating the variance of a conditional expectation. It is usually related to uncertainty quantiﬁcation and has been studied intensively in a number of ﬁelds (Zouaoui and Wilson, 2003; Staum, 2009). For a more detailed review, we refer readers to Sun et al. (2011). In what follows, assume we have data D = {Z1, . . . , Zn} of i.i.d. observations of the form Zi = (Xi, Yi), and a kernel function hkn(Z1, . . . , Zkn). For simplicity, we suppress notations by dropping the test point x in the kernel expression.

4.1 Internal Variance Estimation Method

IM was ﬁrst proposed in Mentch and Hooker (2016) wherein the estimates are obtained as a result of restructuring the ensemble building procedure. It can be viewed as a nested two-level Monte Carlo, where we need to choose n OUT and n IN for the number of outer and inner iterations respectively. See Algorithm 1 for details.

V -statistics and Variance Estimation

Algorithm 1 Internal Variance Estimation Method

for i in 1 to n OUT do

Select initial ﬁxed point z(i)

for j in 1 to n IN do

Select subsample S z(i),j of size kn from training set that includes z(i)

Build base learner and evaluate hkn(S z(i),j) end for Record average of the n IN predictions end for Compute the variance of the n OUT averages to estimate ζ1,kn Compute the variance of all predictions to estimate ζkn,kn Compute the mean of all predictions to obtain ﬁnal ensemble prediction

We use the shorthand hi,j to denote hkn(S z(i),j). The average across inner level is calculated as hi = 1 n IN Pn IN j=1 hi,j. Further we use h = 1 n OUT Pn OUT i=1 hi to denote the average across outer level i. Then the estimates for ζ1,kn and ζkn,kn can be expressed as

ˆζIM 1,kn = 1 n OUT 1

ˆζIM kn,kn = 1 n IN n OUT 1

4.2 Balanced Variance Estimation Method

As will be shown in Figure 2 in Section 5.1, the estimator for ˆζIM 1,kn given by IM is severely biased upwards when n IN and n OUT are not suﬃciently large (Bn = n IN n OUT). IM is not optimal in the sense that it does not utilize all the information in the ensemble. In particular, hi,j is only used once in the outer iteration i when conditioned on z(i). Ideally we could also utilize hi,j by conditioning on the remaining kn 1 inputs. Further, we need to choose two hyperparameters n OUT and n IN instead of ﬁxing the number of base learners Bn. It is not clear what combination will yield optimal performance under the same computational budget this trade-oﬀwill likely diﬀer depending on whether we wish to optimize predictive performance or variance estimation. To address these issues, we design the Balanced Variance Estimation Method (Algorithm 2). In the following, we use hb to represent hkn(Sb) if there is no ambiguity. Let Ni,b denote the number of times the ith training sample appears in subsample Sb. In the case of Ustatistics where we sample without replacement, Ni,b {0, 1}. For V -statistics Ni,b can be larger than 1 due to sampling with replacement. Summing over b gives Ni = PBn b=1 Ni,b and the averaged version Ni = Ni

Bn . For 1 i n, deﬁne

Zhou et al.

Algorithm 2 Balanced Variance Estimation Method

for b in 1 to Bn do

Select subsample Sb of size kn from training set D of size n. Build base learner and evaluate hkn(Sb) end for for i in 1 to n do

Calculate mi as the average of hkn(Sb) where the ith training sample appears in Sb, weighted by the number of appearance. end for Compute the variance of mi to estimate ζ1,kn Compute the variance of all predictions to estimate ζkn,kn Compute the mean of all predictions to obtain ﬁnal ensemble prediction

where ωi,b = Ni,b

Ni . Further deﬁne the average of mi as m = 1 n Pn i=1 mi and the overall

average of hb as h = 1 Bn PBn b=1 hb. The estimates for ζ1,kn and ζkn,kn can be written as

ˆζBM 1,kn = 1 n 1

i=1 (mi m)2 ,

ˆζBM kn,kn = 1 Bn 1

4.3 Inﬁnitesimal Jackknife

The Inﬁnitesimal Jackknife (IJ) was ﬁrst studied by Jaeckel (1972) as an extension for the jackknife to estimate variance. The basic idea of the jackknife is to omit one observation and recompute the estimate using the remaining samples. Alternatively, if we assign a weight to each observation, omitting one is equivalent to setting the corresponding weight to zero. More generally, we can give each observation a weight slightly less than one every time. IJ is the limiting case as this deﬁciency in the weight approaches zero. Efron (1982) provided a more detailed treatment of these resampling plans. More recently, IJ was found to be a powerful tool for estimating standard errors in bagging (Efron, 2014). Wager et al. (2014) and Wager and Athey (2018) applied IJ in the context of random forests. In our setting, the Inﬁnitesimal Jackknife estimate of variance can be expressed as

i=1 cov2 (Ni,b, hb)

where cov(Ni,b, hb) =

PBn b=1(Ni,b Ni)(hb h) Bn and Ni is deﬁned in Section 4.2. As a general variance estimation method for ensembles, IJ is applied upon the original kernel function (Efron, 2014). Consistency results of IJ for U-statistics typed ensembles were developed in Wager and Athey (2018); Ghosal and Hooker (2020). IJ does not rely on an explicit expression of the variance term and is targeted at estimating the limiting variance assuming Bn is suﬃciently large. By applying IJ, we are

V -statistics and Variance Estimation

essentially estimating k2 n n ζ1,kn (as in Theorem 10) or k2 n n ζ 1,kn (as in Theorem 1). For general V -statistics, it is not practical to use BM on the composite kernel h kn to get an estimate for ζ 1,kn directly, and we can instead use IJ on the original kernel hkn to get variance estimates. A direct connection exists between BM and IJ, which we will show below.

Deﬁnition 2 Balanced Subsample Structure We call a subsample structure balanced if Bn kn is a multiple of n, and each training sample appears exactly rn = Bn kn

For U-statistics, this structure implies that each training observation appears in exactly rn base learners. For V -statistics, each sample is required to occur rn times but may be used in fewer than rn base learners since the sampling is done with replacement.

Theorem 3 If we have balanced subsample structure, the Balanced Variance Estimation Method and the Inﬁnitesimal Jackknife estimator satisfy

k2 n n ˆζBM 1,kn = n n 1 ˆVIJ.

Remark 4 The scaling factor n n 1 is a result of how we calculate the empirical variance. If

instead we deﬁne ˆζBM 1,kn = 1

n Pn i=1(mi m)2, then the two estimators are equal: k2 n n ˆζBM 1,kn = ˆVIJ. Theorem 3 also enables us to apply BM on the original kernel hkn to get valid variance estimates in general V -statistics without resorting to the calculation of composite kernel h kn, which is infeasible in practice. Further implications of this result are discussed in Appendix I.

5. Bias Corrections for Variance Estimates

We have so far presented three variance estimation methods (IM, BM and IJ) to estimate the variance of predictions given by an ensemble, being either U-statistics or V -statistics. Although both IM and BM are targeted at the speciﬁc variance expression (Equation (10)), IJ is a more general procedure. Applying IM or BM for the general V -statistics (Theorem 1) is infeasible since it involves the evaluation of the complex kernel h kn, while IJ can be naturally applied on the original kernel hkn (Efron, 2014). By the connection of BM and IJ in Theorem 3, BM applying on the original kernel also yields valid estimates. As a result, although we need the new kernel h kn for theoretical analysis, variance estimates can be achieved without explicitly evaluation. In this section, we will simply refer to the variance components of both U and V statistics as ζ1,kn and ζkn,kn.

5.1 Bias in Variance Estimation

As brieﬂy mentioned in Section 1, all existing variance estimation methods exhibit severe bias when the number of base learners is not suﬃciently large. We now conduct a simple simulation to demonstrate the extent of this bias. Suppose X 20 unif(0, 1) and Y = 2X + N(0, 1). An ensemble of decision trees is built to predict Y from X, and we calculate the variance of the prediction at x = 10 using IM, BM and IJ. In our simulation, we ﬁx

Zhou et al.

the number of training observations n = 500 and kernel size kn = 100. The number of base learners Bn is varied among 100, 1000 and 10000. Figure 2 shows the result for both Uand V -statistics. Notice that although larger Bn indicates lower variance (see Equation (5)), the eﬀect of this is negligible as the dominating component of the variance is k2 n n ζ1,kn in our case (compare Figure 6 with 7 in Appendix C). In

order to provide a fair comparison, the variance shown in the ﬁgure is for k2 n n ζ1,kn + 1 1000ζkn,kn for each value of Bn. Thus, diﬀerent values of Bn only have eﬀect on the estimation for ζ1,kn and ζkn,kn. Appendix G provides at a closer look at the relationship between two variance components as Bn varies. We can easily observe that all three methods (IM, BM, IJ) badly overestimate the variance (notice the log scale on y-axis). The bias mainly arises from an overestimation of ζ1,kn (see Figure 6 and 7 in Appendix C). However, BM and IJ are better than IM since they utilize more information. The plot also corroborates Theorem 3: BM and IJ are exactly the same up to a scaling factor. In Appendix C, we include additional simulation results on the eﬀect of diﬀerent kernel size kn. It is worth noting that the pattern of bias is consistent among V -statistics: an overestimation of ζ1,kn leads to severe bias which diminishes as the number of base learners increases. This eﬀect exists in U-statistics as well, as the estimated variance decreases as Bn increases. However for U-statistics, the variance estimates tend to underestimate when large kernels are used (Figure 4b and 5b in Appendix C). This is partly caused by the fact that the sampling scheme with U-statistics is not equivalent to sampling from the empirical distribution, especially when the kernel size is large. Both perspectives on variance estimation, either on estimating the variance of conditional expectation or resorting to Inﬁnitesimal Jackknife, are based on the idea of using the empirical distribution of the data to approximate the true underlying distribution. U-statistics, which operate by sampling without replacement, are not equivalent to sampling from empirical distribution, thus resulting in the underestimation phenomenon. In Wager and Athey (2018), the authors use a correction factor n(n 1) (n kn)2 as an empirical adjustment for this eﬀect. Figure 8 in Appendix F shows the result when this correction is applied (denoted by corrected-IJ). Empirically the correction mitigates the underestimation bias, and exhibits a similar pattern as V -statistics: a bias due to overestimation of ζ1,kn.

5.2 A Bias-corrected Estimator

In this section, we present a bias-corrected estimator for ζ1,kn under the framework of V - statistics. We use an ANOVA-like estimation of variance components similar to Sun et al. (2011). Derivations are collected in Appendix D. Following notation form Section 4.2, deﬁne

i=1 Ni (mi m)2

b=1 Ni,b (hb mi)2 .

V -statistics and Variance Estimation

(a) Subsampling with replacement (V -statistics).

(b) Subsampling without replacement (U-statistics).

Figure 2: Variance estimation by three diﬀerent methods: Internal Variance Estimation Method (IM), Balanced Variance Estimation Method (BM) and Inﬁnitesimal Jackknife (IJ). The red line denotes true (log) variance obtained by generating data, training the ensemble 100 times and calculating the empirical variance of predictions. For IM, we choose n OUT = (10, 20, 50) for n estimators = (100, 1000, 10000) respectively.

Zhou et al.

A bias-corrected estimate is given by

ˆζ1,kn = SSτ (n 1) ˆσ2 ϵ C Pn i=1 N2 i /C

where C = Pn i=1 Ni = Bnkn and ˆσ2 ϵ = SSϵ

C n. As a special case for the Balanced Subsample Structure, we have N1 = N2 = . . . = Ni = rn, then

ˆσ2 ϵ = SSϵ C n = 1 n (rn 1)

b=1 Ni,b (hb mi)2

and ˆζ1,kn = 1 n 1

i=1 (mi m)2 1

rn ˆσ2 ϵ . (11)

The calculation for ˆζ1,kn in (11) may seem complicated at ﬁrst. In Appendix E, we show that under Balanced Subsample Structure we have ˆζ1,kn 1 n 1 Pn i=1(mi m)2 1

Bn n kn ˆζBM kn,kn, which is simply the original BM estimator minus a correction term calculated from ˆζBM kn,kn. This indicates that one can actually calculate the bias-corrected estimator without any extra computational eﬀort. In this case, the estimate for the limiting variance k2 n n ζ1,kn + 1 Bn ζkn,kn is k2 n n(n 1) Pn i=1(mi m)2 kn 1

Bn ˆζBM kn,kn, where it s clear that the incomplete part is negligible relative to the bias correction term we need to apply. Figure 3 shows simulation results for this bias-corrected estimator (we call it corrected V in the remaining part of this paper) compared with BM and IJ under the framework of V -statistics. Here we no longer display IM since it is systematically worse. We can see that even with only 100 base learners, the bias-corrected estimator achieves relatively accurate estimation of the variance. The bias-corrected term may introduce some instability when Bn is very small, but for a moderate size Bn it has much lower bias compared to BM and IJ. It is worth pointing out that this bias correction method does not work for U-statistics, for the same reason mentioned before: the sampling schema is not equivalent to sampling from empirical distribution. Figure 9 in Appendix F shows that the bias-corrected estimator over-corrects the variance for U-statistics. In Athey et al. (2019), the authors developed a method called the bootstrap of little bags to estimate variance based on the work of Sexton and Laake (2009). They also encountered the challenge of negative variance when Bn is small. In their software, an improper uniform prior over [0, ) was employed to help mitigate this issue. We conjecture that the phenomenon also stems from the mechanism of sampling without replacement. In Appendix F, we present an empirically accurate correction to U-statistics as well. We brieﬂy discuss computational costs to end this section. The majority of computational eﬀorts are spent at building the ensembles, which is scaled with the total number of base learners Bn. The time needed for variance calculation using either IM, BM or IJ along with the bias correction are only marginal. That being said, in many cases to get accurate variance estimates a prohibitively large Bn is needed. The bias correction method we developed in this section can reduce Bn to a moderate size, and thus improves computational eﬃciency for statistical inference.

V -statistics and Variance Estimation

Figure 3: Variance estimation by three diﬀerent methods: corrected-V, BM and IJ. The red line denotes true (log) variance obtained by generating data, training the ensemble 100 times and calculating the empirical variance of predictions.

6. Randomized Ensembles

As brieﬂy mentioned before, randomized ensembles are widely used in practice. A general principle to achieve good performance in ensembles is to make individual learners both accurate and diverse (Zhou, 2012). To increase diversity, randomization is added to each base learner. For example in random forests (Breiman, 2001), each split is chosen from a randomly selected subset of all possible features. Similar to Peng et al. (2019), we deﬁne the notion of a generalized complete V -statistic

Vn,kn,ω = n kn n X

ikn=1 hkn Zi1, . . . , Zikn; ω . (12)

Note that for each kernel hkn we consider an i.i.d. sample of random ωi but the subscript is dropped for notational convenience. Similarly deﬁne the generalized incomplete statistic by

Vn,kn,Bn,ω = 1

i hkn Zi1, . . . , Zikn; ω . (13)

Following the same idea developed in Mentch and Hooker (2016) and Wager and Athey (2018), consider the expected version of (12)

V n,kn,ω = EωVn,kn,ω = n kn n X

ikn=1 Eωhkn Zi1, . . . , Zikn; ω (14)

where the expectation is taken over the randomization parameter ω. In this case, V n,kn,ω can be viewed as a non-randomized V -statistic with kernel h E kn = Eωhkn where Theorem 1 applies. We state this result formally in the following corollary.

Zhou et al.

Corollary 5 Let Z1, Z2, . . . , Zn iid FZ and let Vn,kn,ω be a generalized complete V -statistic deﬁned in (12) and the corresponding expected version V n,kn,ω in (14). Under the same conditions as Theorem 1, we have V n,kn,ω θ kn

where all parameters θ kn, ζ 1,kn, ζ kn,kn are deﬁned using new non-randomized kernel h E kn instead of hkn.

Given this, in order to retain the asymptotic normality of the corresponding randomized case (13), there are two steps: ﬁrst we show that Vn,kn,ω V n,kn,ω Var(V n,kn,ω) P 0 and thus Vn,kn,ω has

the same asymptotic distribution as V n,kn,ω. Then the asymptotics of Vn,kn,Bn,ω can be derived from that of Vn,kn,ω.

Theorem 6 Let Vn,kn,Bn,ω be a generalized incomplete V -statistic of the form deﬁned in (13). Further assume the corresponding statistic V n,kn,ω in (14) satisﬁes Corollary 5 and limn k2 nζ 1,kn > 0. Then as long as

sup kn E hkn Zi1, . . . , Zikn; ω Eωhkn Zi1, . . . , Zikn; ω <

where (i1, . . . , ikn) are chosen from {1, . . . , kn} with replacement, we have Vn,kn,Bn,ω θ kn

n ζ 1,kn + 1 Bn ζkn,kn

Here, ζkn,kn = var (hkn (Z1, . . . , Zkn, ω)) is the variance across individual randomized kernels, and all parameters θ kn, ζ 1,kn, ζ kn,kn are deﬁned using new kernel h E kn instead of hkn as in Corollary 5.

The analysis for randomized ensembles in Peng et al. (2019) directly treats the randomized kernel, rather than ﬁrst establishing a result for the expectation over ω. This is easier in the framework of U-statistics; in the case of V -statistics, the new kernel h kn constructed as in Section 3.1 no longer has independent randomization parameters, since diﬀerent h kn might share the same kernel hkn. The condition limn k2 nζ 1,kn > 0 required here has also appeared in Lemma 4.1 of Song et al. (2019). We believe it is generally satisﬁed with many base learners including trees, see Peng et al. (2019) for an in-depth analysis for the behavior of ζ1,kn.

7. Empirical Studies

Here, we conduct two suites of experiments. All simulations are implemented in Python. For building random forests, we apply Random Forest Classiﬁer and Random Forest Regressor from scikit-learn Friedman (1991). Unless otherwise noted, default parameter values are used.

V -statistics and Variance Estimation

7.1 Predictive Performance

In this section, we evaluate the predictive performance for diﬀerent sampling strategies. In particular, we focus on the scenario of U-statistics (sample without replacement) and V - statistics (sample with replacement), and varying subsample size (proportion of the size of training data to be 0.2, 0.4, 0.6, 0.8, 1.0). We address the following two questions empirically:

1. Should we subsample with or without replacement in terms of prediction performance?

2. What is the best subsample size?

There are six datasets taken from UCI Machine learning Repository (see Appendix K for details) and we also include a regression function (denoted by MARS) which was initially considered by Friedman (1991) for multivariate adaptive regression splines, and has since been used as a benchmark in many random forests publications. Each model is built using 100 trees and to full depth until a leaf is pure or contain fewer than 2 data points. 20% of samples are left as test set. For classiﬁcation, p of features are considered when searching for best splits, and p/3 for regression. Table 7.1 summarizes our results. The ﬁrst three datasets are regression tasks for which we report root-mean-squared error and the remaining four are classiﬁcation with accuracy given by correct classiﬁcation percentage. We repeat the process 20 times and denote standard error in parenthesis. Top performance entries are marked in bold separately for sampling with vs. without replacement and an asterisk indicates the best performance across all scenarios. We make two observations here. First, for both sampling with replacement or without, there is a best subsample size for prediction, though the proportion varies across diﬀerent datasets. Accuracy decreases as the subsample size moves away from the ideal proportion. It is worth noting that performance discrepancy, which is deﬁned to be the maximum performance diﬀerence across ﬁve subsample size settings, is generally larger in the case of U-statistics than V -statistics. For example, in diabetes dataset, there is a 4.1357 RMSE diﬀerence in U-statistics scenario compared to 1.03 in V -statistics. Similarly for retinopathy dataset, the accuracy discrepancy is 2.79% versus 1.26%. It may suggest that sampling with replacement is more robust to changes in subsample sizes; possibly as a result of combining trees built on diﬀerent numbers of unique data points. On the other hand, we did not see an obvious performance gap between two sampling techniques. The best result can be generated either by sampling with replacement or without depending on the speciﬁc data at hand. In practice, one will need to use cross validation to choose the best sampling strategy and subsample size, if predictive performance is the primary concern.

Zhou et al.

Dataset n p U-statistics (sample without replacement) V -statistics (sample with replacement)

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1

boston 506 13 3.9696 (0.1112) 3.5754 (0. 0788) 3.3225 (0.1131) 3.2103 (0.1050) 3.0588* (0.0787) 4.0358 (0.0656) 3.6799 (0.0758) 3.4705 (0.0936) 3.3738 (0.1057) 3.2944 (0.0887)

diabetes 442 10 54.6137 (0.6936) 54.6289 (0.5164) 55.5140 (0.6097) 56.9017 (0.5681) 58.7494 (0.8187) 54.3830* (0.7100) 54.4282 (0.6347) 54.5220 (0.7739) 55.0782 (0.5642) 55.4130 (0.8877)

MARS 500 5 3.2106 (0.0648) 2.8850 (0.0980) 2.7211 (0.0845) 2.6329 (0.0482) 2.5465* (0.0484) 3.2908 (0.1059) 2.9706 (0.0644) 2.8201 (0.0683) 2.7498 (0.0591) 2.6965 (0.0425)

iris 150 4 0.9667* (0.0000) 0.9667* (0.0000) 0.9633 (0.0100) 0.9500 (0.0167) 0.9333 (0.0000) 0.9650 (0.0073) 0.9667* (0.0000) 0.9617 (0.0119) 0.9667* (0.0000) 0.9633 (0.0100)

digits 1797 64 0.9507 (0.0053) 0.9660 (0.0034) 0.9688 (0.0043) 0.9735* (0.0048) 0.9731 (0.0045) 0.9481 (0.0054) 0.9588 (0.0058) 0.9651 (0.0049) 0.9681 (0.0048) 0.9689 (0.0036)

retinopathy 1151 19 0.6935 (0.0156) 0.6825 (0.0112) 0.6766 (0.0114) 0.6708 (0.0082) 0.6656 (0.0125) 0.6892 (0.0135) 0.6946* (0.0126) 0.6868 (0.0125) 0.6846 (0.0109) 0.6820 (0.0117)

breast cancer 569 30 0.9772* (0.0070) 0.9746 (0.0038) 0.9750 (0.0064) 0.9741 (0.0019) 0.9745 (0.0026) 0.9741 (0.0071) 0.9737 (0.0039) 0.9746 (0.0038) 0.9759 (0.0047) 0.9754 (0.0035)

Table 1: Predictive performance on seven datasets under diﬀerent sampling strategies.

V -statistics and Variance Estimation

7.2 Asymptotic Normality and Variance Estimation

In this section, we illustrate empirically the asymptotic normality property and variance estimation algorithms for V -statistics. We will ﬁrst utilize the MARS function (Friedman, 1991; Mentch and Hooker, 2016) such that we have access to the underlying data generating distribution: y = f(x) = 10 sin(πx1x2) + 20(x3 0.05)2 + 10x4 + 5x5 + ϵ, where X U([0, 1]5) and ϵ N(0, 1). Our simulation runs for 500 iterations. In each iteration we generate n = 500 training observations and train random forests with subsample size k = 100, 250, 500 and the number of trees B = 500, 1000, 2500, 5000. We make evaluation on three test points: p1 = [0.5, 0.5, 0.5, 0.5, 0.5] and p2, p3 are randomly drawn from U([0, 1]5). For each k and B, let ˆfi,j denote the prediction at the ith data point and jth iteration (i = 1, 2, 3 and j = 1, . . . , 500). Similarly ˆVu,ij and ˆVc,ij are the variance estimates for IJ and corrected-V respectively. The following three metrics are reported:

Normality: We test the normality of predictions ˆfi,j for j = 1, 2, . . . 500 based on D Agostino (1971) and D Agostino and Pearson (1973) which combine skewness and kurtosis, and is implemented by scipy.stats.normaltest3 in Python. In Table 2, we report test statistics and corresponding p-values (in parenthesis). We can see from the p-values reported that normality for predictions generally hold, even for large subsample size. See Appendix H for a larger scale of experiments on asymptotic normality for ensembles.

Variance ratio: The estimated variance for each setting is given by ˆVu,i = 1 500 P500 t=1 ˆVu,ij and ˆVc,i = 1 500 P500 t=1 ˆVc,ij. And true variance V ( ˆfi) is approximated by the empirical variance of ˆfi,j for j = 1, 2, . . . 500. Note that in practice we cannot calculate the true asymptotic variance, but the between-simulation variance can serve as a good

approximation. The variance ratio is deﬁned as ˆVu,i V ( ˆfi) and ˆVc,i V ( ˆfi), where a value close to 1 is ideal. We can see similar patterns across three tables. The original version of IJ produces highly biased variance estimates, where the bias diminishes as the number of trees B becomes larger. The bias-corrected version successfully alleviates the issue. For k = 100, it starts to produce reasonable estimates for 1000 trees, and the variance ratios are close to one for larger B values. We can also see that it becomes harder to estimate the variance as the subsample sizes grow.

Coverage probability: constructing 95% conﬁdence intervals by ˆfi,j 1.96 q

or ˆfi,j 1.96 q

ˆVc,ij for j = 1, 2, . . . 500 in each setting and we can calculate a coverage probability by checking whether the expected prediction value (approximated by ˆfi = 1 500 P500 t=1 ˆfi,j) falls into this interval. (Note that this does not assess coverage of an underlying θ kn = Eh kn.) This is strongly related to our results for variance ratios. A larger variance ratios will produce conservative intervals, thus generating higher coverage probability. The bias-corrected algorithm produces coverage probability close to 0.95 with reasonable number of base learners.

3. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html

Zhou et al.

p1 p2 p3 original corrected original corrected original corrected

normality 3.5285 (0.1713) 0.2274 (0.8925) 0.6600 (0.7189) var ratio 7.6985 1.4962 8.3871 1.5648 5.7719 1.2725 coverage 100.0 97.2 100.0 96.4 100.0 95.2

normality 1.2341 (0.5395) 1.1715 (0.5567) 1.2146 (0.5448) var ratio 4.9540 1.3839 4.9290 1.3116 3.8187 1.2048 coverage 100.0 96.4 100.0 96.6 99.8 96.0

normality 1.9708 (0.3733) 2.8710 (0.2380) 0.1502 (0.9276) var ratio 2.3068 1.0328 2.5635 1.1058 2.1548 1.0770 coverage 99.2 94.2 99.8 94.6 99.4 94.4

normality 3.2266 (0.1992) 0.8750 (0.6456) 2.7195 (0.2567) var ratio 1.7073 1.0305 1.7748 1.0460 1.5446 1.0138 coverage 98.0 95.0 99.2 94.8 97.3 93.0

(a) k = 100 p1 p2 p3 original corrected original corrected original corrected

normality 0.0309 (0.9846) 2.9390 (0.2300) 1.1261 (0.5695) var ratio 9.2069 2.6342 9.1624 2.5651 7.0298 2.1752 coverage 100.0 99.2 100.0 99.0 100.0 98.6

normality 0.6423 (0.7253) 0.4018 (0.8180) 1.1071 (0.5749) var ratio 5.7815 2.0194 5.1873 1.7729 4.6717 1.7826 coverage 100.0 99.4 100.0 98.4 100.0 97.8

normality 8.7518 (0.0126) 3.4691 (0.1765) 0.9235 (0.6302) var ratio 2.6779 1.3200 2.5242 1.2031 2.1364 1.1433 coverage 99.8 95.0 99.4 95.4 98.8 94.8

normality 2.0317 (0.3621) 2.2734 (0.3209) 0.7299 (0.6942) var ratio 1.8688 1.1504 1.8303 1.1103 1.5598 1.0553 coverage 98.8 95.0 99.0 94.2 96.8 92.0

(b) k = 250 p1 p2 p3 original corrected original corrected original corrected

normality 0.5147 (0.7731) 2.1454 (0.3421) 2.3230 (0.3130) var ratio 10.5311 4.4362 11.3409 4.7405 7.3644 3.1930 coverage 100.0 99.8 100.0 100.0 100.0 99.4

normality 8.0832 (0.0176) 0.5806 (0.7481) 0.3763 (0.8285) var ratio 6.2113 2.8874 6.4252 2.9916 4.5268 2.2424 coverage 100.0 99.8 100.0 99.4 100.0 99.4

normality 1.0393 (0.5947) 2.6963 (0.2579) 0.7730 (0.6794) var ratio 2.8223 1.5995 3.3763 1.9137 2.4909 1.5236 coverage 99.8 96.2 99.4 97.8 99.0 96.4

normality 2.5268 (0.2827) 1.2847 (0.5361) 0.0623 (0.9693) var ratio 2.0838 1.405 1.9944 1.3492 1.6583 1.2001 coverage 98.4 94.8 98.2 95.8 96.4 93.6

(c) k = 500

Table 2: Asymptotic normality and variance estimation results for MARS function.

V -statistics and Variance Estimation

k = 100 k = 250 k = 500 original corrected original corrected original corrected

normality 0.10 0.00 0.15 var ratio 12.34(3.64) 1.41(0.36) 13.81(3.41) 2.32(0.53) 20.73(8.75) 5.08(1.98) coverage 100.0(0.0) 94.33(0.04) 100.0(0.0) 98.3(0.02) 100.0(0.0) 99.7(0.01)

normality 0.15 0.00 0.05 var ratio 7.12(2.07) 1.25(0.34) 7.62(2.10) 1.63(0.39) 11.07(4.87) 3.08(1.23) coverage 100.0(0.0) 95.0(0.05) 100.0(0.0) 96.9(0.03) 100.0(0.0) 98.7(0.02))

normality 0.10 0.05 0.05 var ratio 3.49(0.93) 1.10(0.28) 3.67(0.90) 1.21(0.25) 5.00(2.10) 1.78(0.06) coverage 100.0(0.0) 94.7(0.05) 100.0(0.0) 94.1(0.04) 99.6(0.02) 96.6(0.03)

normality 0.10 0.00 0.05 var ratio 2.32(0.63) 1.08(0.28) 2.31(0.54) 1.06(0.22) 2.93(1.08) 1.34(0.39) coverage 99.2(0.02) 94.0(0.05) 99.3(0.01) 91.7(0.05) 98.7(0.03) 93.3(0.02)

Table 3: Asymptotic normality and variance estimation results for Protein Tertiary Structure across 20 test samples.

In order to see how our results work in real world settings, we pick a relatively large scale dataset: Physicochemical Properties of Protein Tertiary Structure Data Set4. The data set contains 45730 samples with 9 covariates and the target is the size of the residue. To simulate the situation where one can attain alternative training data drawn from the same data generating distribution to quantify sampling uncertainty, we randomly select 45000 samples and partition them to 45 sets with 1000 in each. These 45 sample sets act as independent draws from the unknown data generating distribution. As in the previous experiment, we run for 45 iterations with subsample size k = 100, 250, 500 and the number of trees B = 500, 1000, 2500, 5000. We treat the average of the resulting 45 random forests as the target for inference. We evaluate normality, variance ratio and coverage probability across 20 randomly selected test points. In Table 3 we report the average rejection percentage for the normality test, average variance ratio and average coverage probability. Numbers in the parentheses denote standard deviations. We observe similar patterns as in the MARS setting.

8. Conclusion

In this paper, we present a framework for analyzing the asymptotics of V -statistics where the kernel size kn grows with the number of samples n. It is shown that a central limit theorem can be established similar to the work in Mentch and Hooker (2016), Wager and Athey (2018) and Peng et al. (2019), which focus on the case of U-statistics. The result brings new insight into the analysis of ensemble methods. We also provide uniﬁed treatment of variance estimation in both Uand V -statistics. We observe that existing methods for estimating the limiting variance exhibit severe bias and would require a prohibitively large number of base learners to achieve accurate results, hindering any practical applications such as constructing conﬁdence intervals or conducting

4. https://archive.ics.uci.edu/ml/datasets/Physicochemical+Properties+of+Protein+Tertiary+ Structure

Zhou et al.

hypothesis tests. To this end, we propose a new method called Balanced Variance Estimation Method (BM), and carefully analyze its connection to other methods. In particular, we demonstrate an equivalence between BM and Inﬁnitesimal Jackknife. Additionally, a bias correction method is developed which is shown to produce more accurate variance estimation with a moderate size of base learners. Practically, we would suggest sampling with replacement in building ensembles since the bias correction for V -statistics is theoretically sound and much less involved. What s more, it appears that the asymptotic normality for V -statistics holds across a broarder spectrum compared to its U-statistics counterpart (Appendix H). We speculate that the relative insensitivity of V-statistics performance to subsample size results from averaging over trees with diﬀerent numbers of unique samples. More generally, we might speculate that our equivalent kernel representation in fact admits an approximation in terms of Ustatistics of somewhat lower orders. For example, the weight given to the value of the kernel with kn unique data points is asymptotically negligible. This analysis may lead to better understanding of these properties and the rates at which V -statistics subsamples can be allowed to grow. From another theoretical point of view, the analysis we provide here is essentially a reduction to U-statistics. We will further explore whether other approaches like Taylor expansion using diﬀerential methods Serﬂing (2009) could be applied to attain similar results. It would also be valuable to see if the results presented in this paper could be extended to high dimensional cases.

Acknowledgments

All three authors were partially supported by NSF grant DMS-1712554. LM was partially funded by NSF grant DMS-1712041.

V -statistics and Variance Estimation

Appendix A. Equivalence between V - and U-statistics

We ﬁrst show that the asymptotic behavior of Vn,kn is the same as that of Un,kn, provided kn = o(n 1 4 ). The following important lemma relates Vn,kn to a family of U-statistics, which is a simple extension from Theorem 1 in (Lee, 1990, p.183) to the case where the kernel size kn is changing with n.

Lemma 7 Let Vn,kn be a complete, inﬁnite order V -statistic based on a permutation symmetric kernel hkn of degree kn as deﬁned in (8). Then we may write

Vn,kn = n kn kn X

j=1 j!S(j) kn

where U(j) n is a U-statistic of degree j. The kernel φ(j) of U(j) n is given by

φ(j) (z1, . . . , zj) = j!S(j) kn

(j) hkn zi1, . . . , zikn

where the sum P (j) is taken over all kn-tuples (i1, . . . , ikn) formed from {1, 2, . . . , j} having

exactly j indices distinct, and the quantities S(j) kn are Stirling numbers of the second kind Rennie and Dobson (1969).

Intuitively, as n grows, if kn grows slowly enough, Vn,kn should behave like Un,kn, as the diﬀerence brought by sampling with or without replacement becomes negligible. Theorem 8 extends a result in Shieh (1994) which makes this argument rigorous.

Theorem 8 Suppose hkn H, kn = o(n 1 4 ) and limn Var( n Un,kn) > 0. Then Vn,kn and Un,kn have the same asymptotic distribution.

Remark 9 This theorem only states that asymptotically Vn,kn and Un,kn are indistinguishable. The assumption that limn Var( n Un,kn) > 0 simply indicates that the rate of convergence for Un,kn is n. Theorem 8 may possibly hold under other regimes, such as with degenerate kernels, where the convergence rate is not n, but this is out of the scope of this paper.

As in Equation (3), by averaging only Bn < nkn set of subsamples we have an incomplete, inﬁnite order V -statistic

Vn,kn,Bn = 1

i hkn Zi1, . . . , Zikn (15)

where {Zi1, . . . , Zik} is again drawn with replacement from {Z1, . . . , Zn}. Under some regularity conditions, similar asymptotic results as Theorem 1 in Mentch and Hooker (2016); Theorem 1 in Peng et al. (2019) can be shown.

Zhou et al.

Theorem 10 Let Z1, Z2, . . . , Zn iid FZ and let Vn,kn,Bn be an incomplete, inﬁnite order V -statistic with kernel hkn. Let θkn = Ehkn(Z1, . . . , Zkn) such that hkn H. Then under the assumptions that kn = o(n 1 4 ), limn k2 nζ1,kn > 0 and limn ζkn,kn nζ1,kn 0, we have

(Vn,kn,Bn θkn) q

n ζ1,kn + 1 Bn ζkn,kn

In the complete case where Bn = nkn, we have

(Vn,kn θkn) q

Note that the ﬁrst two assumptions kn = o(n 1 4 ) and limn k2 nζ1,kn > 0 ensure that we can apply Theorem 8. The proof requires an additional lemma and is collected together in Appendix B.2.

Appendix B. Proofs

B.1 Proof of Theorem 8

Proof By Slutsky s theorem, we only need to show (Vn,kn Un,kn)

p 0. Since we assume

limn Var( n Un,kn) > 0, it suﬃces to prove n(Vn,kn Un,kn) p 0. We seek to prove L1

convergence, which implies convergence in probability. According to Lemma 7, Vn,kn could be written as

j!S(j) kn n j

= kn!S(kn) kn n kn

nkn Un,kn +

j!S(j) kn n j

= n(n 1) . . . (n kn + 1)

nkn Un,kn +

j!S(j) kn n j

nkn U(j) n .

Since we assume the second moment of kernel h is bounded, U(j) n could also be bounded by a constant C < . We have

E n (Vn,kn Un,kn) = n E

n(n 1) . . . (n kn + 1)

nkn 1 Un,kn +

j!S(j) kn n j

n(n 1) . . . (n kn + 1)

nkn 1 Un,kn

j!S(j) kn n j

n n(n 1) . . . (n kn + 1)

j!S(j) kn n j

V -statistics and Variance Estimation

First it s easy to see

n n(n 1) . . . (n kn + 1)

1 Pkn 1 i=1

as n when kn = o(n 1 4 ). An upper bound for S(j) kn is provided in Rennie and Dobson (1969)

j!S(j) kn n j

nkn = j! n j

nj S(j) kn nkn j

nj kkn j n kkn j n nkn j

Let an = k2 n n, and we know an 0. Taking the sum yields

j!S(j) kn n j

j=1 akn j n

We could conclude that E| n(Vn,kn Un,kn)| 0.

Zhou et al.

B.2 Proof of Theorem 10

Since kn = o(n 1 4 ) and limn k2 nζ1,kn > 0, the complete case follows directly from Theorem 8 and Theorem 1 in Peng et al. (2019). We will need the following lemma for the incomplete case.

Lemma 11 Let a1, a2, . . . be a sequence of constants such that limn 1

n Pn i=1 ai = 0 and limn 1

n Pn i=1 a2 i = σ2 and let random variables M1, . . . , Mn have a multinomial distribution, multinomial(Bn; 1

n, . . . , 1

n). Then as Bn, n , the limiting distribution of

is N(0, σ2).

Proof The characteristic function of (M1, . . . , Mn) is ( 1

neit1 + . . . + 1

neitn)Bn since it s

multinomial(Bn; 1

n, . . . , 1

n). Thus the characteristic function of B 1

2 n Pn i=1 ai Mi Bn

n is given by

2 n Pn i=1 ai(Mi Bn

n ) = e it B 1

n Pn i=1 ai E eit B 1

2 n Pn i=1 ai Mi

= e it an B 1 2 n 1

2 n + . . . + 1

= e it an B 1 2 n 1 n

i=1 a2 i + . . .

= e it an B 1 2 n

2 + o B 1 n !Bn

where an = 1

n Pn i=1 ai and σ2 n = 1

n Pn i=1 a2 i . Taking the logarithm gives

log E eit B 1

2 n Pn i=1 ai(Mi Bn

n ) = it an B

1 2n + Bn log

2 + o B 1 n !

2 σ2 n a2 n it B 1

2 σ2 n a2 n t2 + o (1) .

Since we assume tht an 0 and σ2 n σ2, the above quantity converges to 1

2σ2t2, which is the logarithm of the characteristic function of N(0, σ2).

Now we could prove the major part of Theorem 10. Proof Without loss of generality we will assume θkn = 0. Suppose (M1, . . . , Mnkn) have a multinomial distribution, multinomial (Bn; 1 nkn , . . . , 1 nkn ). We could rewrite Vn,kn,Bn as

V -statistics and Variance Estimation

Vn,kn,Bn = 1

i hkn Zi1, . . . , Zikn

i=1 Mihkn Zi1, . . . , Zikn

hkn Zi1, . . . , Zikn

hkn Zi1, . . . , Zikn + 1 nkn

i=1 hkn Zi1, . . . , Zikn

hkn Zi1, . . . , Zikn + Vn,kn.

To show Vn,kn,Bn r

d N(0, 1), it is equivalent to prove

it Vn,kn,Bn q

n ζ1,kn + 1 Bn ζkn,kn

From the above decomposition of Vn,kn,Bn, we have

it Vn,kn,Bn q

n ζ1,kn + 1 Bn ζkn,kn

Bn Pnkn i=1 (Mi Bn nkn )hkn(Zi1, . . . , Zikn ) + Vn,kn) q

n ζ1,kn + 1 Bn ζkn,kn

Bn Pnkn i=1 (Mi Bn nkn )hkn(Zi1, . . . , Zikn ) + Vn,kn) q

n ζ1,kn + 1 Bn ζkn,kn

|Z1, . . . , Zn

n ζ1,kn + 1 Bn ζkn,kn

1 Bn Pnkn i=1 (Mi Bn nkn )hkn(Zi1, . . . , Zikn ) q

n ζ1,kn + 1 Bn ζkn,kn

|Z1, . . . , Zn

Zhou et al.

Since Vn,kn q

d N(0, 1) and by Lemma 11,

it Vn,kn,Bn q

n ζ1,kn + 1 Bn ζkn,kn

n ζ1,kn + 1 Bn ζkn,kn

1 Bn Pnkn i=1 (Mi Bn nkn )hkn(Zi1, . . . , Zikn ) q

n ζ1,kn + 1 Bn ζkn,kn

|Z1, . . . , Zn

k2 n n ζ1,kn

n ζ1,kn + 1 Bn ζkn,kn t2

1 Bn ζkn,kn

n ζ1,kn + 1 Bn ζkn,kn t2

B.3 Proof of Theorem 1

Proof By Equation (9), the complete case follows directly from Theorem 1 in Peng et al. (2019). The incomplete case follows exactly the same proof as Theorem 10.

B.4 Proof of Theorem 3

In the case of Balanced Subsample Structure where rn = Bn kn

n , we have m = h and Ni = rn for all i.

First we can rewrite ˆζBM 1,kn as

ˆζBM 1,kn = 1 n 1

i=1 (mi m)2

b=1 ωi,bhb h

= 1 n 1 1 r2n

b=1 Ni,b hb h !2

V -statistics and Variance Estimation

Then for ˆVIJ = Pn i=1 cov2(Ni,b, hb), we look at each individual term

cov (Ni,b, hb) = PBn b=1(Ni,b Ni)(hb h)

Ni,b Ni hb b + X

Ni,b Ni hb b

b,Zi b Ni,b hb h kn

b,Zi b Ni,b hb h

b=1 Ni,b hb h

where Zi denotes the ith training sample. Combining two previous identities

i=1 cov2(Ni,b, hb)

b=1 Ni,b(hb b)

= (n 1)r2 n B2n ˆζBM 1,kn

n k2 n n ˆζBM 1,kn

as claimed.

B.5 Proof of Theorem 6

Proof The assumption limn k2 nζ 1,kn > 0 implies limn Var( n V n,kn,Bn,ω) > 0. We ﬁrst show the complete case. Similar to the proof of Theorem 2 in Mentch and Hooker (2016), we have

E(Vn,kn,ω V n,kn,ω)2 = 1 (nkn)2 E X

hkn Zi1, . . . , Zikn; ω Eωhkn Zi1, . . . , Zikn; ω 2 .

Zhou et al.

n(Vn,kn,ω V n,kn,ω) q

Var( n V n,kn,ω)

= lim n E n

nkn 1 Var( n V n,kn,ω) 1 nkn X

i E hkn Zi1, . . . , Zikn; ω Eωhkn Zi1, . . . , Zikn; ω 2

since supkn E hkn Zi1, . . . , Zikn; ω Eωhkn Zi1, . . . , Zikn; ω < and limn Var( n V n,kn,ω) > 0. The incomplete case follows exactly as in Mentch and Hooker (2016).

Appendix C. Additional Simulation Results

Simulations in this section are based on a simple setting where X 20 unif(0, 1) and Y = 2X + N(0, 1). The number of training observations n = 500. The model is an ensemble of decision trees. Figure 4 and 5 displays variance estimation by IM, BM and IJ for kernel size kn = 250 and 400. Figure 6 and Figure 7 shows the estimated values for each variance components ζ1,kn and ζkn,kn for kn = 100. Since IJ does not target at ζ1,kn directly, we rescaled the estimated by a factor k2 n n according to Theorem 3. The estimators for ζkn,kn for the three methods shown are essentially the same as they are all calculating the variance across all base learners predictions.

Appendix D. Derivations on Bias-corrected Estimator

Our goal is to provide an estimator of ζ1,kn based on expression given in (10)

ζ1,kn = var (E (hkn (Z1, . . . , Zkn) |Z1 = z1)) .

To simplify notations, we introduce a more general mathematical representation. Consider a random variable X and its conditional distribution given a random variable Z, denoted by M = E(X|Z). We want to estimate the variance of σ2 M = Var(M). Use FZ and FX|Z to denote the distribution for Z and the conditional distribution X given Z respectively. Consider the following sampling framework: for k = 1, . . . , K:

1. Sample Zk randomly from FZ.

2. For j = 1, . . . , nk: Sample Xkj randomly from FX|Z=Zk.

V -statistics and Variance Estimation

(a) Subsampling with replacement (V -statistics).

(b) Subsampling without replacement (U-statistics).

Figure 4: Variance estimation by three diﬀerent methods: IM, BM and IJ. The kernel size kn = 250. The variance shown is for prediction at test point x = 10. The red line denotes true (log) variance obtained by generating data, training the ensemble 100 times and calculating the empirical variance of predictions.

Zhou et al.

(a) Subsampling with replacement (V -statistics).

(b) Subsampling without replacement (U-statistics).

Figure 5: Variance estimation by three diﬀerent methods: IM, BM and IJ. The kernel size kn = 400. The variance shown is for prediction at test point x = 10. The red line denotes true (log) variance obtained by generating data, training the ensemble 100 times and calculating the empirical variance of predictions.

V -statistics and Variance Estimation

(a) Subsampling with replacement (V -statistics).

(b) Subsampling without replacement (U-statistics).

Figure 6: ζ1,kn estimated by three diﬀerent methods: IM, BM and IJ. The kernel size kn = 100. The variance shown is for prediction at test point x = 10.

Zhou et al.

(a) Subsampling with replacement (V -statistics).

(b) Subsampling without replacement (U-statistics).

Figure 7: ζkn,kn estimated by three diﬀerent methods: IM, BM and IJ. The kernel size kn = 100. The variance shown is for prediction at test point x = 10.

V -statistics and Variance Estimation

We ll use the collections of samples Xkj (k = 1, . . . , K, j = 1, . . . , nk) to provide an estimator for σM. Deﬁne C = PK k=1 nk, σ2 ϵ = E(Var(X|Z)) and the following two sum of squares

k=1 nk( Xk X)2,

j=1 (Xkj Xk)2

where X = 1 C PK k=1 nk Xk, Xk = 1 nk Pnk j=1 Xkj. Following the calculations in Sun et al. (2011), we have

i=1 n2 i /C

σ2 M + (K 1) σ2 ϵ ,

E(SSϵ) = (C K)σ2 ϵ .

Thus we can get the estimator for σ2 M as

ˆσ2 M = SSτ (K 1)ˆσ2 ϵ C PK i=1 n2 i /C

ˆσ2 ϵ = SSϵ C K .

The unbiasedness of these estimators is shown by Searle et al. (2009). By setting Z = Z1 and X = hkn (Z1, . . . , Zkn) gives the estimator presented in Section 5.2.

Appendix E. An Alternative Version of Bias Correction

Our subsampling methods choose each data point with equal probability and we thus expect to obtain an approximately balanced subsample. For simplicity, our derivation assumes this structure holds exactly. Recall that we have N1 = N2 = . . . = Ni = rn, then

ˆσ2 ϵ = SSϵ C n = 1 n(rn 1)

b=1 Ni,b(hb mi)2,

ˆζ1,kn = 1 n 1

i=1 (mi m)2 1

We could rewrite ˆσ2 ϵ as

Zhou et al.

ˆσ2 ϵ = 1 n(rn 1)

b=1 Ni,b(hb mi)2

= 1 n(rn 1)

b=1 Ni,b(hb h + h mi)2

= 1 n(rn 1)

b=1 Ni,b(hb h)2 + 1 n(rn 1)

b=1 Ni,b( h mi)2

+ 2 n(rn 1)

b=1 Ni,b(hb h)( h mi)

= 1 n(rn 1)

b=1 (hb h)2 n X

i=1 Ni,b + 1 n(rn 1)

i=1 ( h mi)2 Bn X

+ 2 n(rn 1)

i=1 ( h mi)

b=1 Ni,b(hb h)

= 1 n(rn 1)

b=1 (hb h)2kn + 1 n(rn 1)

i=1 ( h mi)2rn

+ 2 n(rn 1)

i=1 ( h mi)rn(mi h)

= kn n(rn 1)

b=1 (hb h)2 rn n(rn 1)

i=1 ( h mi)2

= kn n(rn 1)

b=1 (hb h)2 rn n(rn 1)

i=1 (mi m)2.

Plug this into the expression for ˆζ1,kn, we have

V -statistics and Variance Estimation

ˆζ1,kn = 1 n 1

i=1 (mi m)2 1

i=1 (mi m)2 1

hb h 2 rn n(rn 1)

i=1 (mi m)2 !

= 1 n 1 1 n(rn 1)

i=1 (mi m)2 kn rnn(rn 1)

= 1 n 1 1 n(rn 1)

i=1 (mi m)2 kn rnn(rn 1) (Bn 1) ˆζBM kn,kn

i=1 (mi m)2 1

n kn ˆζBM kn,kn.

The approximation in the last line holds as long as rn grows with n, which is a reasonable assumption in most cases.

Appendix F. Bias Correction for U-statistics

Simulations in this section are based on a simple setting where X 20 unif(0, 1) and Y = 2X +N(0, 1). The number of training observations n = 500. The model is an ensemble of decision trees built under the framework of U-statistics: each tree is constructed using subsamples without replacement. Figure 8 shows the result for U-statistics by employing the correction by Wager and Athey (2018). Figure 9 shows the result of corrected-V developed in Section 5.2 applied to U-statistics. For simplicity, we will use the simpler but approximate variance estimation described in Appendix E

ˆζ1,kn = 1 n 1

i=1 (mi m)2 1

n kn ˆζBM kn,kn.

We ﬁnd that if we scale the correction term by n kn

n , and include the correction term in Wager and Athey (2018), it works for U-statistics empirically. The estimator for ζ1,kn for U-statistics is

ˆζU 1,kn = n(n 1)

i=1 (mi m)2 1

kn ˆζBM kn,kn

The blue bars (denoted by corrected-U) in Figure 10 shows the result. We can see that by combining both correction terms, the estimator yields stable and accurate variance estimation. How to theoretically analyze bias correction for U-statistics remains a promising future endeavor.

Zhou et al.

(a) kn = 100.

(b) kn = 250.

(c) kn = 400.

Figure 8: Variance Estimation by three diﬀerent methods: corrected-IJ, BM and IJ. The kernel size kn = 100, 250, 400. The variance shown is for prediction at test point x = 10. The red line denotes true (log) variance obtained by generating data, training the ensemble 100 times and calculating the empirical variance of predictions.

V -statistics and Variance Estimation

Figure 9: Variance estimation by three diﬀerent methods: corrected-V, BM and IJ. The variance shown is for prediction at test point x = 10. The red line denotes true (log) variance obtained by generating data, training the ensemble 100 times and calculating the empirical variance of predictions.

Appendix G. A Closer Look at Variance Components

A major diﬀerence between our work and Wager and Athey (2018) is that we also take into account the eﬀect of Monte Carlo eﬀect brought by the number of base learners Bn. Thus our variance has two components, where the ﬁrst part k2 n n ζ1,kn corresponds to the complete case and the second part 1 Bn ζkn,kn is the additional Monte Carlo variance introduced due to the incomplete case.

One can imagine that for smaller Bn, the second part of Monte Carlo variance might be much larger than the ﬁrst part, while as Bn gets larger the eﬀect diminishes and k2 n n ζ1,kn becomes the dominating one. We conduct an experiment to visualize this transition. As before, let X 20 unif(0, 1) and Y = 2X + N(0, 1). The model is an ensemble of decision trees built under the framework of V -statistics. We ﬁx the number of training observations n = 1000 and kernel size kn = 10 and build the ensembles with Bn = 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000. For each Bn, 100 models are built and we calculate empirical variance of the predictions at test point x = 10. This procedure is repeated 10 times and we report the average of empirical variance.

Figure 11 shows four lines: the empirical variance; the two variance components k2 n n ζ1,kn and 1 Bn ζkn,kn; the total estimated variance which is simply the sum k2 n n ζ1,kn + 1 Bn ζkn,kn. We estimate ζ1,kn and ζkn,kn using an ensemble of size Bn = 1000. The dotted black line aligns well with the black line, which indicates that our variance estimates give accurate results. For small Bn = 100, each observation is expected to only appear once in the ensemble (since rn = kn Bn

n = 1), and as a result base learners will be approximately independent. In this case, the variance of the ensemble prediction should mainly come from 1 Bn ζkn,kn. When Bn grows, dependence between some base learners kicks in and the eﬀect of k2 n n ζ1,kn gradually becomes the dominating part as 1 Bn ζkn,kn decreases. This transition is depicted in Figure

Zhou et al.

(a) kn = 100.

(b) kn = 250.

(c) kn = 400.

Figure 10: Variance Estimation by three diﬀerent methods: corrected-U, BM and IJ. The kernel size kn = 100, 250, 400. The variance shown is for prediction at test point x = 10. The red line denotes true (log) variance obtained by generating data, training the ensemble 100 times and calculating the empirical variance of predictions.

V -statistics and Variance Estimation

Figure 11: Variance components for diﬀerent Bn. The number of training observations n = 1000 and kernel size kn = 10. The variance shown is for prediction at test point x = 10. Four lines shown are: empirical variance, two variance components (k2 n n ζ1,kn and 1 Bn ζkn,kn) and their sum as estimated variance.

11. Note that in practice the additional Monte Carlo variance introduced due to incomplete case is usually negligible as we would choose larger kn and Bn.

Appendix H. Additional Simulation Results on Normality for Ensembles

The experiment is conducted under the same setting of Section 7.2. Let y = f(x) = 10 sin(πx1x2) + 20(x3 0.05)2 + 10x4 + 5x5 + ϵ, where X U([0, 1]5) and ϵ N(0, 1).

We ﬁx B = 500 and vary subsample size k = 100, 200, 300, 400, 500. We randomly general 100 test points from U([0, 1]5). For each test point, we conduct 100 simulations where in each iteration we generate n = 500 training observations to ﬁt a random forests and record its prediction. Finally a normality test is conducted on 100 predictions of the same test point. The experiment is repeated for both V - and U-statistics. Table 4 shows the percentage of p-values falls below 0.05 (a normal hypothesis is rejected). We can see that in our setting normality for predictions generally hold for V -statistics across all kernel size, while for U-statistics it starts to break down for large kernel size.

Zhou et al.

V -statistics U-statistics 100 0.1 0.07 200 0.06 0.05 300 0.06 0.04 400 0.05 0.11 500 0.07 0.18

Table 4: Normality Test for Ensembles.

Appendix I. Variance Estimation for V -statistics and Its Implications

We prove the theoretical asymptotics for general V -statistics utilizing a composite kernel h kn, which is infeasible to evaluate in practice. Thus it remains a challenge to quantify its variance directly which involves ζ 1,kn, and we use Inﬁnitesimal Jackknife as a workaround. IJ was initially developed for computing standard errors and conﬁdence interval in bagging (Efron, 2014). As a general tool, IJ does not rely on any speciﬁc variance expressions and is applied upon the original kernel hkn instead of h kn. Consistency of IJ under the framework of U-statistics was proved in Wager and Athey (2018); Ghosal and Hooker (2020). The result in Theorem 3 establishes a connection between BM and IJ. Thus applying BM on the original kernel hkn of a V-statistic should yield valid variance estimates. This is how we calculate limiting variance in all of the empirical studies. Although we do not theoretically prove IJ is consistent for V -statistics, the empirical results in Table 2 show a promising sign. If this is the case, then we should have ζ1,kn = ζ 1,kn for a general V -statistics under the conditions of Theorem 1. We leave this conjecture as a future work.

Appendix J. U-statistics Results for Section 7.2

This section shows the results for U-statistics under the same setting of Section 7.2. Table 5 presents the normality test, variance ratio and coverage for the same three test points as in Table 2 using the original BM variance estimation method. This further supports our previous ﬁnding that for U-statistics a prohibitive number of base learner is needed for valid inference; while for larger kernel sizes, normality starts to break down and as a result the estimated variance is no longer valid. Corresponding U-statistics result for Protein Tertiary Structure data is shown in Table 6.

Appendix K. Datasets Information

Six of the seven datasets in Table 7.1 are taken from UCI Machine Learning Repository5:

boston: https://archive.ics.uci.edu/ml/machine-learning-databases/housing/. The dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. This is a regression task to predict median value of owner-occupied homes in $1000 s.

5. https://archive.ics.uci.edu/ml/index.php

V -statistics and Variance Estimation

p1 p2 p3 original original original

normality 2.4464 (0.2943) 1.7276 (0.4216) 3.9023 (0.1421) var ratio 5.6619 6.1710 4.1774 coverage 100.0 100.0 99.8

normality 4.0015 (0.1352) 0.7224 (0.6968) 0.0999 (0.9513) var ratio 3.5290 3.6977 2.7600 coverage 99.8 100.0 99.8

normality 1.6279 (0.4431) 4.2958 (0.1167) 0.4680 (0.7914) var ratio 1.6725 1.8111 1.6239 coverage 98.2 99.0 97.2

normality 3.8553(0.1455) 0.7108 (0.7009) 2.1749 (0.3371) var ratio 1.2396 1.3060 1.1624 coverage 96.6 98.2 94.6

(a) k = 100 p1 p2 p3 original original original

normality 3.8301 (0.1473) 3.0423 (0.2185) 1.3942 (0.4980) var ratio 3.2972 2.9889 2.3566 coverage 99.6 100.0 99.0

normality 5.9749 (0.0504) 0.3142 (0.8546) 0.0418 (0.9793) var ratio 1.9543 1.7152 1.6739 coverage 98.4 98.6 97.2

normality 7.4631 (0.0239) 2.0683 (0.0.3555) 2.3659 (0.3064) var ratio 0.9110 0.8726 0.7628 coverage 91.8 92.0 89.6

normality 6.2611 (0.0437) 1.4629 (0.4812) 1.9362 (0.3798) var ratio 0.5918 0.6795 0.5756 coverage 85.4 87.6 83.4

(b) k = 250 p1 p2 p3 original original original

normality 28.9413 (5.19e-07) 1.7923 (0.4081) 7.0993 (0.0287) var ratio 12.27e-05 17.39e-05 16.13e-05 coverage 1.4 1.4 0.4

normality 11.3388 (0.0034) 3.8287 (0.1474) 8.5205 (0.0141) var ratio 9.03e-05 9.63e-05 8.48e-05 coverage 1.2 0.6 1.4

normality 2.9161 (0.2327) 0.3811 (0.8265) 13.3368 (0.0013) var ratio 3.17e-05 3.63e-05 3.48e-05 coverage 0.6 0.6 0.6

normality 5.2564 (0.7222) 4.0334 (0.1331) 10.9238 (0.0042) var ratio 1.63e-05 1.91e-05 1.65e-05 coverage 0.4 0.4 0.6

(c) k = 500

Table 5: Asymptotic normality and variance estimation results for MARS function: Ustatistics. 45

Zhou et al.

k = 100 k = 250 k = 500 original original original

normality 0.10 0.15 0.10 var ratio 10.77(2.65) 13.20(5.70) 6.88(1.73) coverage 100.0(0.0) 100.0(0.0) 99.88(0.00)

normality 0.15 0.15 0.15 var ratio 6.03(1.49) 6.95(2.46) 3.62(1.01) coverage 100.0(0.0) 100.0(0.0) 99.33(0.01)

normality 0.10 0.20 0.15 var ratio 3.07(0.77) 3.29(1.20) 1.67(0.40) coverage 99.33(0.01) 99.7(0.01) 96.8(0.03)

normality 0.20 0.15 0.15 var ratio 2.00(0.46) 2.05(0.72) 1.01(0.24) coverage 98.22(0.02) 97.67(0.03) 91.0(0.05)

Table 6: Asymptotic normality and variance estimation results for Protein Tertiary Structure across 20 test samples: U-statistics.

diabetes: https://archive.ics.uci.edu/ml/datasets/diabetes. The attributes are diabetes patient records and the target is an integer between 25 and 346. We simply cast it as a regression problem.

iris: https://archive.ics.uci.edu/ml/datasets/Iris. This is a classiﬁcation problem. The dataset contains 3 classes of 50 instances each, where each class refers to a type of iris plant.

digits: https://archive.ics.uci.edu/ml/datasets/optical+recognition+of+handwritten+ digits. A classiﬁcation task to predict integers from 0 to 9 with 64 attributes.

retinopathy: https://archive.ics.uci.edu/ml/datasets/Diabetic+Retinopathy+ Debrecen+Data+Set. This dataset contains features extracted from the Messidor image set to predict whether an image contains signs of diabetic retinopathy or not.

breast cancer: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+ (Diagnostic). This is a classiﬁcation task to predict whether the diagnosis is malignant or benign based on features computed from a digitized image of a ﬁne needle aspirate (FNA) of a breast mass.

Susan Athey, Julie Tibshirani, Stefan Wager, et al. Generalized random forests. The Annals

of Statistics, 47(2):1148 1178, 2019.

Leo Breiman. Random forests. Machine learning, 45(1):5 32, 2001.

Andreas Buja and Werner Stuetzle. Observations on bagging. Statistica Sinica, pages 323 351, 2006.

V -statistics and Variance Estimation

Yifan Cui, Ruoqing Zhu, Mai Zhou, and Michael Kosorok. Consistency of survival tree and forest models: splitting bias and correction. ar Xiv preprint ar Xiv:1707.09631, 2017.

Ralph D Agostino and Egon S Pearson. Tests for departure from normality. empirical results for the distributions of b2 and b1. Biometrika, 60(3):613 622, 1973.

Ralph B D Agostino. An omnibus test of normality for moderate and large size samples.

Biometrika, 58(2):341 348, 1971.

Bradley Efron. The jackknife, the bootstrap and other resampling plans. SIAM, 1982.

Bradley Efron. Estimation and accuracy after model selection. Journal of the American

Statistical Association, 109(507):991 1007, 2014.

Edward W Frees. Inﬁnite order u-statistics. Scandinavian Journal of Statistics, pages 29 45, 1989.

Jerome H Friedman. Multivariate adaptive regression splines. The annals of statistics, pages 1 67, 1991.

Indrayudh Ghosal and Giles Hooker. Boosting random forests to reduce bias; one-step boosted forest and its variance estimate. Journal of Computational and Graphical Statistics, pages 1 10, 2020.

Takashi Goda. Computing the variance of a conditional expectation via non-nested monte carlo. Operations Research Letters, 45(1):63 67, 2017.

Paul R Halmos. The theory of unbiased estimation. The Annals of Mathematical Statistics, pages 34 43, 1946.

Wassily Hoeﬀding. A class of statistics with asymptotically normal distribution. The Annals

of Mathematical Statistics, pages 293 325, 1948.

Giles Hooker and Lucas Mentch. Bootstrap bias corrections for ensemble methods. Statistics

and Computing, 28(1):77 86, 2018.

Louis A Jaeckel. The inﬁnitesimal jackknife. Bell Telephone Laboratories, 1972.

Justin Lee. U-statistics: Theory and Practice. Citeseer, 1990.

Lucas Mentch and Giles Hooker. Quantifying uncertainty in random forests via conﬁdence intervals and hypothesis tests. The Journal of Machine Learning Research, 17(1):841 881, 2016.

Wei Peng, Tim Coleman, and Lucas Mentch. Asymptotic distributions and rates of convergence for random forests and other resampled ensemble learners. ar Xiv preprint ar Xiv:1905.10651, 2019.

Basil Cameron Rennie and Annette Jane Dobson. On stirling numbers of the second kind.

Journal of Combinatorial Theory, 7(2):116 121, 1969.

Zhou et al.

Erwan Scornet, G erard Biau, Jean-Philippe Vert, et al. Consistency of random forests. The

Annals of Statistics, 43(4):1716 1741, 2015.

Shayle R Searle, George Casella, and Charles E Mc Culloch. Variance components, volume 391. John Wiley & Sons, 2009.

PK Sen. Introduction to hoeﬀding (1948) a class of statistics with asymptotically normal distribution. In Breakthroughs in statistics, pages 299 307. Springer, 1992.

Robert J Serﬂing. Approximation theorems of mathematical statistics, volume 162. John Wiley & Sons, 2009.

Joseph Sexton and Petter Laake. Standard errors for bagged and random forest estimators.

Computational Statistics & Data Analysis, 53(3):801 811, 2009.

Grace S Shieh. Inﬁnite order v-statistics. Statistics & Probability Letters, 20(1):75 80, 1994.

Yanglei Song, Xiaohui Chen, Kengo Kato, et al. Approximating high-dimensional inﬁniteorder u-statistics: Statistical and computational guarantees. Electronic Journal of Statistics, 13(2):4794 4848, 2019.

Jeremy Staum. Monte carlo computation in ﬁnance. In Monte Carlo and Quasi-Monte

Carlo Methods 2008, pages 19 42. Springer, 2009.

Yunpeng Sun, Daniel W Apley, and Jeremy Staum. Eﬃcient nested simulation for estimating the variance of a conditional expectation. Operations research, 59(4):998 1007, 2011.

Ralph von Mises. On the asymptotic distribution of diﬀerentiable statistical functions. The

annals of mathematical statistics, 18(3):309 348, 1947.

Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment eﬀects using random forests. Journal of the American Statistical Association, 113(523):1228 1242, 2018.

Stefan Wager, Trevor Hastie, and Bradley Efron. Conﬁdence intervals for random forests: The jackknife and the inﬁnitesimal jackknife. The Journal of Machine Learning Research, 15(1):1625 1651, 2014.

Qing Wang and Bruce Lindsay. Variance estimation of a general u-statistic with application to cross-validation. Statistica Sinica, pages 1117 1141, 2014.

Yichen Zhou and Giles Hooker. Boulevard: Regularized stochastic gradient boosted trees and their limiting distribution. ar Xiv preprint ar Xiv:1806.09762, 2018.

Yichen Zhou, Zhengze Zhou, and Giles Hooker. Approximation trees: Statistical stability in model distillation. ar Xiv preprint ar Xiv:1808.07573, 2018.

Zhi-Hua Zhou. Ensemble methods: foundations and algorithms. CRC press, 2012.

Faker Zouaoui and James R Wilson. Accounting for parameter uncertainty in simulation input modeling. Iie Transactions, 35(9):781 792, 2003.