# predictive_inference_is_free_with_the_jackknifeafterbootstrap__cbff5a45.pdf

Predictive Inference Is Free with the Jackknife+-after-Bootstrap

Byol Kim Chen Xu Rina Foygel Barber

Ensemble learning is widely used in applications to make predictions in complex decision problems for example, averaging models ﬁtted to a sequence of samples bootstrapped from the available training data. While such methods offer more accurate, stable, and robust predictions and model estimates, much less is known about how to perform valid, assumption-lean inference on the output of these types of procedures. In this paper, we propose the jackknife+-after-bootstrap (J+a B), a procedure for constructing a predictive interval, which uses only the available bootstrapped samples and their corresponding ﬁtted models, and is therefore free in terms of the cost of model ﬁtting. The J+a B offers a predictive coverage guarantee that holds with no assumptions on the distribution of the data, the nature of the ﬁtted model, or the way in which the ensemble of models are aggregated at worst, the failure rate of the predictive interval is inﬂated by a factor of 2. Our numerical experiments verify the coverage and accuracy of the resulting predictive intervals on real data.

1 Introduction

Ensemble learning is a popular technique for enhancing the performance of machine learning algorithms. It is used to capture a complex model space with simple hypotheses which are often signiﬁcantly easier to learn, or to increase the accuracy of an otherwise unstable procedure [see 14, 27, 29, and references therein].

While ensembling can provide substantially more stable and accurate estimates, relatively little is known about how to perform provably valid inference on the resulting output. Particular challenges arise when the data distribution is unknown, or when the base learner is difﬁcult to analyze. To consider a motivating example, suppose that each observation consists of a vector of features X Rp and a real-valued response Y R. Even in an idealized scenario where we might be certain that the data follow a linear model, it is still not clear how we might perform inference on a bagged prediction obtained by, say, averaging the Lasso predictions on multiple bootstrapped samples of the data.

To address the problem of valid statistical inference for ensemble predictions, we propose a method for constructing a predictive conﬁdence interval for a new observation that can be wrapped around existing ensemble prediction methods. Our method integrates ensemble learning with the recently proposed jackknife+ [2]. It is implemented by tweaking how the ensemble aggregates the learned predictions. This makes the resulting integrated algorithm to output an interval-valued prediction that, when run at a target predictive coverage level of 1 α, provably covers the new response value at least 1 2α proportion of the time in the worst case, with no assumptions on the data beyond independent and identically distributed samples.

Department of Statistics, The University of Chicago, Chicago, IL 60637, USA byolkim@uchicago.edu H. Milton Stewart School of Industrial & Systems Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA cxu310@gatech.edu Department of Statistics, The University of Chicago, Chicago, IL 60637, USA rina@uchicago.edu

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

Our main contributions are as follows.

We propose the jackknife+-after-bootstrap (J+a B), a method for constructing predictive conﬁdence intervals that can be efﬁciently wrapped around an ensemble learning algorithm chosen by the user. We prove that the coverage of a J+a B interval is at worst 1 2α for the assumptionfree theory. This lower bound is non-asymptotic, and holds for any sample size and any distribution of the data. We verify that the empirical coverage of a J+a B interval is actually close to 1 α.

2 Background and related work

Suppose we are given n independent and identically distributed (i.i.d.) observations (X1, Y1), . . . , (Xn, Yn) iid P from some probability distribution P on Rp R. Given the available training data, we would like to predict the value of the response Yn+1 for a new data point with features Xn+1, where we assume that (Xn+1, Yn+1) is drawn from the same probability distribution P. A common framework is to ﬁt a regression model bµ : Rp R by applying some regression algorithm to the training data {(Xi, Yi)}n i=1, and then predicting bµ(Xn+1) as our best estimate of the unseen test response Yn+1.

However, the question arises: How can we quantify the likely accuracy or error level of these predictions? For example, can we use the available information to build an interval around our estimate bµ(Xn+1) (some margin of error) that we believe is likely to contain Yn+1? More generally, we want to build a predictive interval b C(Xn+1) R that maps the test features Xn+1 to an interval (or more generally, a set) believed to contain Yn+1. Thus, instead of bµ : Rp R, we would like our method to output b C : Rp R2 with the property

P h Yn+1 b C(Xn+1) i 1 α, (1)

where the probability is with respect to the distribution of the n + 1 training and test data points (as well as any additional source of randomness used in obtaining b C). Ideally, we want b C to satisfy (1) for any data distribution P. Such b C is said to satisfy distribution-free predictive coverage at level 1 α.

2.1 Jackknife and jackknife+

One of the methods that can output b C with distribution-free predictive coverage is the recent jackknife+ of Barber et al. [2] which inspired our work. As suggested by the name, the jackknife+ is a simple modiﬁcation of the jackknife approach to constructing predictive conﬁdence intervals.

To deﬁne the jackknife and the jackknife+, we begin by introducing some notation. Let R denote any regression algorithm; R takes in a training data set, and outputs a model bµ : Rp R, which can then be used to map a new X to a predicted Y . We will write bµ = R({(Xi, Yi)}n i=1) for the model ﬁtted on the full training data, and will also write bµ\i = R({(Xj, Yj)}n j=1,j =i) for the model ﬁtted on the training data without the point i. Let q+ α,n{vi} and q α,n{vi} denote the upper and the lower α-quantiles of a list of n values indexed by i, that is to say, q+ α,n{vi} = the (1 α)(n + 1) -th smallest value of v1, . . . , vn, and q α,n{vi} = q+ α,n{ vi}.

The jackknife prediction interval is given by

b CJ α,n(x) = bµ(x) q+ α,n{Ri} = q α,n{bµ(x) Ri}, q+ α,n{bµ(x) + Ri} , (2)

where Ri = |Yi bµ\i(Xi)| is the i-th leave-one-out residual. This is based on the idea that the Ri s are good estimates of the test residual |Yn+1 bµ\i(Xn+1)|, because the data used to train bµ\i is independent of (Xi, Yi). Perhaps surprisingly, it turns out that fully assumption-free theory is impossible for (2) [see 2, Theorem 2]. By contrast, it is achieved by the jackknife+, which modiﬁes (2) by replacing bµ with bµ\i s:

b CJ+ α,n(x) = q α,n{bµ\i(x) Ri}, q+ α,n{bµ\i(x) + Ri} . (3)

Barber et al. [2] showed that (3) satisﬁes distribution-free predictive coverage at level 1 2α. Intuitively, the reason that such a guarantee is impossible for (2) is that the test residual |Yn+1 bµ(Xn+1)| is not quite comparable with the leave-one-out residuals |Yi bµ\i(Xi)|, because bµ always sees one more observation in training than bµ\i sees. The jackknife+ correction restores the symmetry, making assumption-free theory possible.

2.2 Ensemble methods

In this paper, we are concerned with ensemble predictions that apply a base regression method R, such as linear regression or the Lasso, to different training sets generated from the training data by a resampling procedure.

Speciﬁcally, the ensemble method starts by creating multiple training data sets (or multisets) of size m from the available training data points {1, . . . , n}. We may choose the sets by bootstrapping (sampling m indices uniformly at random with replacement a typical choice is m = n), or by subsampling (sampling without replacement, for instance with m = n/2).

For each b, the algorithm calls on R to ﬁt the model bµb using the training set Sb, and then aggregates the B predictions bµ1(x), . . . , bµB(x) into a single ﬁnal prediction bµϕ(x) via an aggregation function ϕ,4 typically chosen to be a simple function such as the median, mean, or trimmed mean. When ϕ is the mean, the ensemble method run with bootstrapped Sb s is referred to as bagging [5], while if we instead use subsampled Sb s, then this ensembling procedure is referred to as subagging [8].

The procedure is formalized in Algorithm 1.

Algorithm 1 Ensemble learning Input: Data {(Xi, Yi)}n i=1 Output: Ensembled regression function bµϕ

for b = 1, . . . , B do

Draw Sb = (ib,1, . . . , ib,m) by sampling with or without replacement from {1, . . . , n}. Compute bµb = R((Xib,1, Yib,1), . . . , (Xib,m, Yib,m)). end for Deﬁne bµϕ = ϕ(bµ1, . . . , bµB).

Can we apply the jackknife+ to an ensemble method? While ensembling is generally understood to provide a more robust and stable prediction as compared to the underlying base algorithm, there are substantial difﬁculties in developing inference procedures for ensemble methods with theoretical guarantees. For one thing, ensemble methods are frequently used with highly discontinuous and nonlinear base learners, and aggregating many of them leads to models that defy an easy analysis. The problem is compounded by the fact that ensemble methods are typically employed in settings where good generative models of the data distribution are either unavailable or difﬁcult to obtain. This makes distribution-free methods that can wrap around arbitrary machine learning algorithms, such as the conformal prediction [36, 18], the split conformal [24, 35, 18], or cross-validation or jackknife type methods [2] attractive, as they retain validity over any data distribution. However, when deployed with ensemble prediction methods which often require a signiﬁcant overhead from the extra cost of model ﬁtting, the resulting combined procedures tend to be extremely computationally intensive, making them impractical in most settings. In the case of the jackknife+, if each ensembled model makes B many calls to the base regression method R, the jackknife+ would require a total of Bn calls to R. By contrast, our method will require only O(B) many calls to R, making the computational burden comparable to obtaining a single ensemble prediction.

2.3 Related work

Our paper contributes to the fast-expanding literature on distribution-free predictive inference, which has garnered attention in recent years for providing valid inferential tools that can work with complex

4Formally, we deﬁne ϕ as a map from S

k 0 Rk R, mapping any collection of predictions in R to a single aggregated prediction. (If the collection is empty, we would simply output zero or some other default choice). ϕ lifts naturally to a map on vectors of functions, by writing bµϕ = ϕ(bµ1, . . . , bµB), where bµϕ(x) is deﬁned for each x R by applying ϕ to the collection (bµ1(x), . . . , bµB(x)).

Table 1: Comparison of computational costs for obtaining ntest predictions

#calls to R #evaluations #calls to ϕ

Ensemble B Bntest ntest J+ with Ensemble Bn Bn(1 + ntest) n(1 + ntest) J+a B B B(n + ntest) n(1 + ntest)

machine learning algorithms such as neural networks. This is because many of the methods proposed are wrapper algorithms that can be used in conjunction with an arbitrary learning procedures. This list includes the conformal prediction methodology of Vovk et al. [36], Lei et al. [18], the split conformal methods explored in Papadopoulos [24], Vovk [35], Lei et al. [18], and the jackknife+ of Barber et al. [2]. Meanwhile, methods such as cross-validation or leave-one-out cross-validation (also called the jackknife ) stabilize the results in practice but require some assumptions to analyze theoretically [33, 34, 2].

The method we propose can also be viewed as a wrapper designed speciﬁcally for use with ensemble learners. As mentioned in Section 2.2, applying a distribution-free wrapper around an ensemble prediction method typically results in a combined procedure that is computationally burdensome. This has motivated many authors to come up with cost efﬁcient wrappers for use in the ensemble prediction setting. For example, Papadopoulos et al. [26], Papadopoulos and Haralambous [25] use a holdout set to assess the predictive accuracy of an ensembled model. However, when the sample size n is limited, one may achieve more accurate predictions with a cross-validation or jackknife type method, as such a method avoids reducing the sample size in order to obtain a holdout set. Moreover, by using out-of-bag estimates [6], it is often possible to reduce the overall cost to the extent that it is on par with obtaining a single ensemble prediction. This is explored in Johansson et al. [16], where they propose a prediction interval of the form bµϕ(Xn+1) q+ α,n(Ri), where bµϕ\i = ϕ({bµb : b = 1, . . . , B, Sb i}) and Ri = |Yi bµϕ\i(Xi)|. Zhang et al. [38] provide a theoretical analysis of this type of prediction interval, ensuring that predictive coverage holds asymptotically under additional assumptions. Devetyarov and Nouretdinov [10], Löfström et al. [20], Boström et al. [4, 3], Linusson et al. [19] study variants of this type of method, but fully distribution-free coverage cannot be guaranteed for these methods. By contrast, our method preserves exchangeability, and hence is able to maintain assumption-free and ﬁnite-sample validity.

More recently, Kuchibhotla and Ramdas [17] looked at aggregating conformal inference after subsampling or bootstrapping. Their work proposes ensembling multiple runs of an inference procedure, while in contrast our present work seeks to provide inference for ensembled methods.

Stepping away from distribution-free methods, for the popular random forests [15, 7], Meinshausen [22], Athey et al. [1], Lu and Hardin [21] proposed methods for estimating conditional quantiles, which can be used to construct prediction intervals. The guarantees they provide are necessarily approximate or asymptotic, and rely on additional conditions. Tangentially related are the methods for estimating the variance of the random forest estimator of the conditional mean, e.g., Sexton and Laake [32], Wager et al. [37], Mentch and Hooker [23], which apply, in order, the jackknife-after-bootstrap (not jackknife+) [11] or the inﬁnitesimal jackknife [12] or U-statistics theory. Roy and Larocque [31] propose a heuristic for constructing prediction intervals using such variance estimates. For a comprehensive survey of statistical work related to random forests, we refer the reader to the literature review by Athey et al. [1].

3 Jackknife+-after-bootstrap (J+a B)

We present our method, the jackknife+-after-bootstrap (J+a B). To design a cost efﬁcient wrapper method suited to the ensemble prediction setting, we borrow an old insight from Breiman [6] and make use of the out-of-bag estimates. Speciﬁcally, it is possible to obtain the i-th leave-one-out model bµϕ\i without additional calls to the base regression method by reusing the already computed bµ1, . . . , bµB by aggregating only those bµb s whose underlying training data set Sb did not include the i-th data point. This is formalized in Algorithm 2.

Algorithm 2 Jackknife+-after-bootstrap (J+a B) Input: Data {(Xi, Yi)}n i=1 Output: Predictive interval b CJ+a B α,n,B for b = 1, . . . , B do

Draw Sb = (ib,1, . . . , ib,m) by sampling with or without replacement from {1, . . . , n}. Compute bµb = R((Xib,1, Yib,1), . . . , (Xib,m, Yib,m)). end for for i = 1, . . . , n do

Aggregate bµϕ\i = ϕ({bµb : b = 1, . . . , B, Sb i}). Compute the residual, Ri = |Yi bµϕ\i(Xi)|. end for Compute the J+a B prediction interval: at each x R,

b CJ+a B α,n,B(x) = q α,n{bµϕ\i(x) Ri}, q+ α,n{bµϕ\i(x) + Ri} .

Because the J+a B algorithm recycles the same B models bµ1, . . . , bµB to compute all n leave-one-out models bµϕ\i, the cost of model ﬁtting is identical for the J+a B algorithm and the ensemble learning. Table 1 compares the computational costs of an ensemble method, the jackknife+ wrapped around an ensemble, and the J+a B when the goal is to make ntest predictions. In settings where both model evaluations and aggregations remain relatively cheap, our J+a B algorithm is able to output a more informative conﬁdence interval at virtually no extra cost beyond what is already necessary to produce a single ensemble point prediction. Thus, one can view the J+a B as offering predictive inference free of charge.

In this section, we prove that the coverage of a J+a B interval satisﬁes a distribution-free lower-bound of 1 2α in the worst-case. We make two assumptions, one on the data distribution and the other on the ensemble algorithm.

Assumption 1. (X1, Y1), . . . , (Xn, Yn), (Xn+1, Yn+1) iid P, where P is any distribution on Rp R.

Assumption 2. For k 1, any ﬁxed k-tuple ((x1, y1), . . . , (xk, yk)) Rp R, and any permutation σ on {1, . . . , k}, it holds that R((x1, y1), . . . , (xk, yk)) = R((xσ(1), yσ(1)), . . . , (xσ(k), yσ(k))) and ϕ(y1, . . . , yk) = ϕ(yσ(1), . . . , yσ(k)). In other words, the base regression algorithm R and the aggregation ϕ are both invariant to the ordering of the input arguments.5

Assumption 1 is fairly standard in the distribution-free prediction literature [36, 18, 2]. In fact, our results only require exchangeability of the n + 1 data points, as is typical in distribution-free inference the i.i.d. assumption is a familiar special case. Assumption 2 is a natural condition in the setting where the data points are i.i.d., and therefore should logically be treated symmetrically.

Theorem 1 gives the distribution-free coverage guarantee for the J+a B prediction interval with one intriguing twist: the total number of base models, B, must be drawn at random rather than chosen in advance. This is because Algorithm 2 as given subtly violates symmetry even when R and ϕ are themselves symmetric. However, we shall see that requiring B to be Binomial is enough to restore symmetry, after which assumption-free theory is possible.

Theorem 1. Fix any integers e B 1 and m 1, any base algorithm R, and any aggregation function ϕ. Suppose the jackknife+-after-bootstrap method (Algorithm 2) is run with (i) B Binomial( e B, (1 1 n+1)m) in the case of sampling with replacement or (ii) B Binomial( e B, 1 m n+1) in the case of sampling without replacement. Then, under Assumptions 1 and 2, the jackknife+-

5If R and/or ϕ involve any randomization for example if ϕ operates by sampling from the collection of predictions then we can require that the outputs are equal in distribution under any permutation of the input arguments, rather than requiring that equality holds deterministically. In this case, the coverage guarantees in our theorems hold on average over the randomization in R and/or ϕ, in addition to the distribution of the data.

after-bootstrap prediction interval satisﬁes

P h Yn+1 b CJ+a B α,n,B(Xn+1) i 1 2α,

where the probability holds with respect to the random draw of the training data (X1, Y1), . . . , (Xn, Yn), the test data point (Xn+1, Yn+1), and the Binomial B.

Proof sketch. Our proof follows the main ideas of the jackknife+ guarantee [2, Theorem 1]. It is a consequence of the jackknife+ construction that the guarantee can be obtained by a simultaneous comparison of all n pairs of leave-one-out(-of-n) residuals, |Yn+1 bµ\i(Xn+1)| vs |Yi bµ\i(Xi)| for i = 1, . . . , n. The key insight provided by Barber et al. [2] is that this is easily done by regarding the residuals as leave-two-out(-of-(n + 1)) residuals |Yi eµ\i,j(Xi)| with {i, j} (n + 1), where eµ\i,j is a model trained on the augmented data combining both training and test points and then screening out the i-th and the j-th points, one of which is the test point. These leave-two-out residuals are naturally embedded in an (n + 1) (n + 1) array of all the leave-two-out residuals, R = [Rij = |Yi eµ\i,j(Xi)| : i = j {1, . . . , n, n + 1}]. Since the n + 1 data points in the augmented data are i.i.d., they are exchangeable, and hence so is the array R, i.e., the distribution of R is invariant to relabeling of the indices. A simple counting argument then ensures that the jackknife+ interval fails to cover with probability at most 2α. This is the essence of the jackknife+ proof.

Turning to our J+a B, it may be tempting to deﬁne eµϕ\i,j = ϕ({bµb : Sb i, j}), the aggregation of all bµb s whose underlying data set Sb excludes i and j, and go through with the jackknife+ proof. However, this construction is no longer useful; the corresponding R in this case is no longer exchangeable. This is most easily seen by noting that there are always exactly B many Sb s that do not include the test observation n + 1, whereas the number of Sb s that do not contain a particular training observation i {1, . . . , n} is a random number usually smaller than B. The issue here is that the J+a B algorithm as given fails to be symmetric for all n + 1 data points.

However, just as the jackknife+ symmetrized the jackknife by replacing bµ with bµ\i s, the J+a B can also be symmetrized by merely requiring it to run with a Binomial B. To see why, consider the lifted Algorithm 3.

Algorithm 3 Lifted J+a B residuals

Input: Data {(Xi, Yi)}n+1 i=1 Output: Residuals (Rij : i = j {1, . . . , n + 1})

for b = 1, . . . , e B do

Draw e Sb = (ib,1, . . . , ib,m) by sampling with or without replacement from {1, . . . , n + 1}. Compute eµb = R((Xib,1, Yib,1), . . . , (Xib,m, Yib,m)). end for for pairs i = j {1, . . . , n + 1} do

Aggregate eµϕ\i,j = ϕ({eµb : e Sb i, j}). Compute the residual, Rij = |Yi eµϕ\i,j(Xi)|. end for

Because all n + 1 data points are treated equally by Algorithm 3, the resulting array of residuals R = [Rij : i = j {1, . . . , n+1}] is again exchangeable. Now, for each i = 1, . . . , n+1, deﬁne e Ei as the event that P

j {1,...,n+1}\{i} 1I [Rij > Rji] (1 α)(n + 1). Because of the exchangeability

of the array, the same counting argument mentioned above ensures P[e En+1] 2α.

To relate the event e En+1 to the actual J+a B interval b CJ+a B α,n,B(Xn+1) being constructed, we need to

couple Algorithms 2 and 3. Let B = P e B b=1 1I[e Sb n + 1], the number of e Sb s containing only the training data in the lifted construction, and let 1 b1 < < b B e B be the indices of such e Sb s. Note that B is Binomially distributed, as required by the theorem. For each k = 1, . . . , B, deﬁne Sk = e Sbk. Then, each Sk is an independent uniform draw from {1, . . . , n}, with or without replacement. Therefore, we can equivalently consider running Algorithm 2 with these particular S1, . . . , SB. Furthermore, this ensures that eµϕ\n+1,i = bµϕ\i for each i, that is, the leave-one-out models in Algorithm 2 coincide with the leave-two-out models in Algorithm 3. Thus, we have constructed a coupling of the J+a B with its lifted version.

Finally, deﬁne En+1 as the event that Pn i=1 1I |Yn+1 bµϕ\i(Xn+1)| > Ri (1 α)(n + 1), where Ri = |Yi bµϕ\i(Xi)| as before. By the coupling we have constructed, we can see that the event En+1 is equivalent to the lifted event e En+1, and thus, P[En+1] = P[e En+1] 2α. It can be veriﬁed that in the event that the J+a B interval fails to cover, i.e., if Yn+1 / b CJ+a B α,n,B(Xn+1), the event En+1 must occur, which concludes the proof. The full version of this proof is given in Supplement A.

In most settings where a large number of models are being aggregated, we would not expect the distinction of random vs ﬁxed B to make a meaningful difference to the ﬁnal output. In Supplement B, we formalize this intuition and give a stability condition on the aggregating map ϕ under which the J+a B has valid coverage for any choice of B.

Finally, we remark that although we have exclusively used the regression residuals |Yi bµ\i(Xi)| in our exposition for concreteness, our method can also accommodate alternative measures of conformity, e.g., using quantile regression as in Romano et al. [30] or weighted residuals as in Lei et al. [18] which can better handle heteroscedasticity. More generally, if bcϕ\i is the trained conformity measure aggregated from the Sb s that did not use the i-th point, then the corresponding J+a B set is given by

b Cc-J+a B α,n,B(x) =

i=1 1I bcϕ\i(x, y) > bcϕ\i(Xi, Yi) < (1 α)(n + 1)

5 Experiments

In this section, we demonstrate that the J+a B intervals enjoy coverage near the nominal level of 1 α numerically, using three real data sets and different ensemble prediction methods. In addition, we also look at the results for the jackknife+, combined either with the same ensemble method (J+ENSEMBLE) or with the non-ensembled base method (J+NON-ENSEMBLE); the precise deﬁnitions are given in Supplement D.1. The code is available online.6

We used three real data sets, which were also used in Barber et al. [2], following the same data preprocessing steps as described therein. The Communities and Crime (COMMUNITIES) data set [28] contains information on 1994 communities with p = 99 covariates. The response Y is the per capita violent crime rate. The Blog Feedback (BLOG) data set [9] contains information on 52397 blog posts with p = 280 covariates. The response is the number of comments left on the blog post in the following 24 hours, which we transformed as Y = log(1 + #comments). The Medical Expenditure Panel Survey (MEPS) 2016 data set from the Agency for Healthcare Research and Quality, with details for older versions in [13], contains information on 33005 individuals with p = 107 covariates. The response is a score measuring each individual s utilization level of medical services. We transformed this as Y = log(1 + utilization score).

For the base regression method R, we used either the ridge regression (RIDGE), the random forest (RF), or a neural network (NN). For RIDGE, we set the penalty at λ = 0.001 X 2, where X is the spectral norm of the training data matrix. RF was implemented using the Random Forest Regressor method from scikit-learn with 20 trees grown for each random forest using the mean absolute error criterion and the bootstrap option turned off, with default settings otherwise. For NN, we used the MLPRegressor method from scikit-learn with the L-BFGS solver and the logistic activation function, with default settings otherwise. For the aggregation ϕ, we used averaging (MEAN). Results obtained with other aggregation methods are discussed in Supplement D.2.

We ﬁxed α = 0.1 for the target coverage of 90%. We used n = 40 observations for training, sampling uniformly without replacement to create a training-test split for each trial. The results presented here are from 10 independent training-test splits of each data set. The ensemble wrappers J+a B and J+ENSEMBLE used sampling with replacement. We varied the size m of each bootstrap replicate as m/n = 0.2, 0.4, . . . , 1.0. For J+ENSEMBLE, we used B = 20. For the J+a B, we drew B Binomial( e B, (1 1 n+1)m) with e B = [20/{(1 1 n+1)m(1 1

n)m}], where [ ] refers to the integer part of the argument. This ensures that the number of models being aggregated for each leave-one-out model is matched on average to the number in J+ENSEMBLE. We remark that the scale

6https://www.stat.uchicago.edu/~rina/jackknife+-after-bootstrap_realdata.html

COMMUNITIES

J+ NON-ENSEMBLE J+ ENSEMBLE J+a B

Figure 1: Distributions of coverage (averaged over each test data) in 10 independent splits for ϕ = MEAN. The black line indicates the target coverage of 1 α.

of our experiments, as reﬂected in the number of different training-test splits or the size of n or B, has been limited by the computationally inefﬁcient J+ENSEMBLE.

We emphasize that we made no attempt to optimize any of our models. This is because our goal here is to illustrate certain properties of our method that hold universally for any data distribution and any ensemble method, and not just in cases when the method happens to be the right one for the data. All other things being equal, the statistical efﬁciency of the intervals our method constructs would be most impacted by how accurately the model is able to capture the data. However, because the method we propose leaves this choice up to the users, performance comparisons along the axis of different ensemble methods are arguably not very meaningful.

We are rather more interested in comparisons of the J+a B and J+ENSEMBLE, and of the J+a B (or J+ENSEMBLE) and J+NON-ENSEMBLE. For the J+a B vs J+ENSEMBLE comparison, we are on the lookout for potential systematic tradeoffs between computational and statistical efﬁciency. For each i, conditional on the event that the same number of models were aggregated for the i-th leave-one-out models bµϕ\i in the J+a B and J+ENSEMBLE, the two bµϕ\i s have the same marginal distribution. However, this is not the case for the joint distribution of all n leave-one-out models {bµϕ\i}n i=1; with respect to the resampling measure, the collection is highly correlated in the case of the J+a B, and independent in the case of J+ENSEMBLE. Thus, in principle, the statistical properties of b CJ+a B α,n,B and b CJ+ENSEMBLE α,n,B could differ, although it would be a surprise if it were to turn out that one method always performed better than the other. In comparing the J+a B (or J+ENSEMBLE) and J+NON-ENSEMBLE, we seek to reafﬁrm some known results in bagging. It is well-known that bagging improves the accuracy of unstable predictors, but has little effect on stable ones [5, 8]. It is reasonable to expect that this property will manifest in some way when the width of b CJ+a B α,n,B (or b CJ+ENSEMBLE α,n,B ) is compared to

COMMUNITIES

J+ NON-ENSEMBLE J+ ENSEMBLE J+a B

Figure 2: Distributions of interval width (averaged over each test data) in 10 independent splits for ϕ = MEAN.

that of b CJ+NON-ENSEMBLE α,n . We expect the former to be narrower than the latter when the base regression method is unstable (e.g., RF), but not so when it is already stable (e.g., RIDGE).

Figures 1 and 2 summarize the results of our experiments. First, from Figure 1, it is clear that the coverage of the J+a B is near the nominal level. This is also the case for J+ENSEMBLE or J+NON-ENSEMBLE. Second, in Figure 2, we observe no evidence of a consistent trend of one method always outperforming the other in terms of the precision of the intervals, although we do see some slight variations across different data sets and regression algorithms. Thus, we prefer the computationally efﬁcient J+a B to the costly J+ENSEMBLE. Finally, comparing the J+a B (or J+ENSEMBLE) and J+NON-ENSEMBLE, we ﬁnd the effect of bagging reﬂected in the interval widths, and we see improved precision in the case of RF, and for some data sets and at some values of m, in the case of NN. Thus, in settings where the base learner is expected to beneﬁt from ensembling, the J+a B is a practical method for obtaining informative prediction intervals that requires a level of computational resources on par with the ensemble algorithm itself.

6 Conclusion

We propose the jackknife+-after-bootstrap (J+a B), a computationally efﬁcient wrapper method tailored to the setting of ensemble learning, where by a simple modiﬁcation to the aggregation stage, the method outputs a predictive interval with fully assumption-free coverage guarantee in place of a point prediction. The J+a B provides a mechanism for quantifying uncertainty in ensemble predictions that is both straightforward to implement and easy to interpret, and can therefore be easily integrated into existing ensemble models.

7 Broader impact

Machine learning algorithms are becoming increasingly pervasive in many application areas involving complicated and high-stakes decision making including medical treatment planning and diagnosis, public health, and public policy. As the use of machine learning becomes more widespread, however, we are also becoming more cognizant of the potential pitfalls due to hidden biases in the data or unexpected behavior of blackbox algorithms. Quantifying the uncertainty in machine predictions is one way to safeguard against such errors. Doing so in a meaningful way without making unveriﬁable or overly simplistic assumptions is a challenge, as data in these applications often exhibit complex phenomena, such as censoring or missingness, heavy tails, multi-modality, etc.. Methods such as our J+a B provide predictive inference guarantees that can be efﬁciently implemented on large scale data sets under very few assumptions.

Acknowledgments and Disclosure of Funding

R.F.B. was supported by the National Science Foundation via grant DMS-1654076. The authors are grateful to Aaditya Ramdas and Chirag Gupta for helpful suggestions on related works.

[1] Susan Athey, Julie Tibshirani, and Stefan Wager. Generalized random forests. Ann. Statist., 47(2):1148 1178, 2019. doi: 10.1214/18-AOS1709. URL https://projecteuclid.org: 443/euclid.aos/1547197251.

[2] Rina Foygel Barber, Emmanuel J. Candès, Aaditya Ramdas, and Ryan J. Tibshirani. Predictive inference with the jackknife+, 2019. ar Xiv preprint.

[3] H. Boström, L. Asker, R. Gurung, I. Karlsson, T. Lindgren, and P. Papapetrou. Conformal prediction using random survival forests. In 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 812 817, 2017. ISBN null. doi: 10.1109/ ICMLA.2017.00-57.

[4] Henrik Boström, Henrik Linusson, Tuve Löfström, and Ulf Johansson. Accelerating difﬁculty estimation for conformal regression forests. Annals of Mathematics and Artiﬁcial Intelligence, 81(1):125 144, 2017. doi: 10.1007/s10472-017-9539-9. URL https://doi.org/10.1007/ s10472-017-9539-9.

[5] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123 140, Aug 1996. doi: 10.1007/ BF00058655. URL https://doi.org/10.1007/BF00058655.

[6] Leo Breiman. Out-of-bag estimation. Technical report, Department of Statistics, University of California, Berkeley, 1997.

[7] Leo Breiman. Random forests. Machine Learning, 45(1):5 32, Oct 2001. ISSN 1573-0565. doi: 10.1023/A:1010933404324. URL https://doi.org/10.1023/A:1010933404324.

[8] Peter Bühlmann and Bin Yu. Analyzing bagging. Ann. Statist., 30(4):927 961, Aug 2002. doi: 10.1214/aos/1031689014. URL https://doi.org/10.1214/aos/1031689014.

[9] Krisztian Buza. Feedback prediction for blogs. In Myra Spiliopoulou, Lars Schmidt-Thieme, and Ruth Janning, editors, Data Analysis, Machine Learning and Knowledge Discovery, pages 145 152. Springer International Publishing, 2014. ISBN 978-3-319-01595-8.

[10] Dmitry Devetyarov and Ilia Nouretdinov. Prediction with conﬁdence based on a random forest classiﬁer. In Harris Papadopoulos, Andreas S. Andreou, and Max Bramer, editors, Artiﬁcial Intelligence Applications and Innovations, pages 37 44, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg. ISBN 978-3-642-16239-8.

[11] Bradley Efron. Jackknife-after-bootstrap standard errors and inﬂuence functions. Journal of the Royal Statistical Society. Series B (Methodological), 54(1):83 127, 1992. ISSN 00359246. URL http://www.jstor.org/stable/2345949.

[12] Bradley Efron. Estimation and accuracy after model selection. Journal of the American Statistical Association, 109(507):991 1007, 2014. doi: 10.1080/01621459.2013.823775. URL https://doi.org/10.1080/01621459.2013.823775. PMID: 25346558.

[13] Trena M Ezzati-Rice, Frederick Rohde, and Janet Greenblatt. Sample design of the medical expenditure panel survey household component, 1998 2007. Methodology Report 22, Agency for Healthcare Research and Quality, Rockville, MD, Mar 2008. URL http://www.meps. ahrq.gov/mepsweb/data_files/publications/mr22/mr22.pdf.

[14] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Ensemble Learning, pages 605 624. Springer New York, 2009. ISBN 978-0-387-84858-7. doi: 10.1007/978-0-387-84858-7_16. URL https://doi.org/10.1007/978-0-387-84858-7_16.

[15] Tin Kam Ho. Random decision forests. In Proceedings of 3rd International Conference on Document Analysis and Recognition, volume 1, pages 278 282. IEEE, Aug 1995. doi: 10.1109/ICDAR.1995.598994.

[16] Ulf Johansson, Henrik Boström, Tuve Löfström, and Henrik Linusson. Regression conformal prediction with random forests. Machine Learning, 97(1):155 176, 2014. doi: 10.1007/ s10994-014-5453-0. URL https://doi.org/10.1007/s10994-014-5453-0.

[17] Arun K. Kuchibhotla and Aaditya K. Ramdas. Nested conformal prediction and the generalized jackknife+, 2019. ar Xiv preprint.

[18] Jing Lei, Max G Sell, Alessandro Rinaldo, Ryan J. Tibshirani, and Larry Wasserman. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094 1111, 2018. doi: 10.1080/01621459.2017.1307116. URL https://doi.org/10.1080/01621459.2017.1307116.

[19] H. Linusson, U. Johansson, and H. Boström. Efﬁcient conformal predictor ensembles. Neurocomputing, 2019. doi: https://doi.org/10.1016/j.neucom.2019.07.113. URL http: //www.sciencedirect.com/science/article/pii/S0925231219316108.

[20] T. Löfström, U. Johansson, and H. Boström. Effective utilization of data in inductive conformal prediction using ensembles of neural networks. In The 2013 International Joint Conference on Neural Networks (IJCNN), pages 1 8, 2013. ISBN 2161-4393. doi: 10.1109/IJCNN.2013. 6706817.

[21] Benjamin Lu and Johanna Hardin. A uniﬁed framework for random forest prediction error estimation, 2019. ar Xiv preprint.

[22] N. Meinshausen. Quantile regression forests. Journal of Machine Learning Research, 7:983 999, 2006. URL https://www.scopus.com/inward/record.uri?eid=2-s2. 0-33745174860&partner ID=40&md5=93784fdaa267f35c561a576c66aad9ff.

[23] Lucas Mentch and Giles Hooker. Quantifying uncertainty in random forests via conﬁdence intervals and hypothesis tests. Journal of Machine Learning Research, 17(26):1 41, 2016. URL http://jmlr.org/papers/v17/14-168.html.

[24] Harris Papadopoulos. Inductive conformal prediction: Theory and application to neural networks. In Paula Fritzsche, editor, Tools in Artiﬁcial Intelligence, pages 325 330. In Tech, 2008. ISBN 978-953-7619-03-9. URL http://www.intechopen.com/books/tools_ in_artificial_intelligence/inductive_conformal_prediction__theory_and_ application_to_neural_networks.

[25] Harris Papadopoulos and Haris Haralambous. Reliable prediction intervals with regression neural networks. Neural Networks, 24(8):842 851, 2011. doi: https://doi.org/10.1016/ j.neunet.2011.05.008. URL http://www.sciencedirect.com/science/article/pii/ S089360801100150X.

[26] Harris Papadopoulos, Kostas Proedrou, Volodya Vovk, and Alex Gammerman. Inductive conﬁdence machines for regression. In Tapio Elomaa, Heikki Mannila, and Hannu Toivonen, editors, Machine Learning: ECML 2002, pages 345 356, Berlin, Heidelberg, 2002. Springer Berlin Heidelberg. ISBN 978-3-540-36755-0.

[27] Robi Polikar. Ensemble based systems in decision making. IEEE Circuits and Systems Magazine, 6(3):21 45, Mar 2006. ISSN 1558-0830. doi: 10.1109/MCAS.2006.1688199.

[28] Michael Redmond and Alok Baveja. A data-driven software tool for enabling cooperative information sharing among police departments. European Journal of Operational Research, 141(Mar):660 678, 2002. ISSN 0377-2217. doi: 10.1016/S0377-2217(01)00264-8. URL http://www.sciencedirect.com/science/article/pii/S0377221701002648.

[29] Lior Rokach. Ensemble-based classiﬁers. Artiﬁcial Intelligence Review, 33(1):1 39, Feb 2010. ISSN 1573-7462. doi: 10.1007/s10462-009-9124-7. URL https://doi.org/10.1007/ s10462-009-9124-7.

[30] Yaniv Romano, Evan Patterson, and Emmanuel Candes. Conformalized quantile regression. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 3543 3553. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/ 8613-conformalized-quantile-regression.pdf.

[31] Marie-Hélène Roy and Denis Larocque. Prediction intervals with random forests. Statistical Methods in Medical Research, 29(1):205 229, 2020/03/01 2019. doi: 10.1177/ 0962280219829885. URL https://doi.org/10.1177/0962280219829885.

[32] Joseph Sexton and Petter Laake. Standard errors for bagged and random forest estimators. Computational Statistics & Data Analysis, 53(3):801 811, 2009. ISSN 0167-9473. doi: https: //doi.org/10.1016/j.csda.2008.08.007. URL http://www.sciencedirect.com/science/ article/pii/S0167947308003988. Computational Statistics within Clinical Research.

[33] Lukas Steinberger and Hannes Leeb. Leave-one-out prediction intervals in linear regression models with many variables, 2016. ar Xiv preprint.

[34] Lukas Steinberger and Hannes Leeb. Conditional predictive inference for high-dimensional stable algorithms, 2018. ar Xiv preprint.

[35] Vladimir Vovk. Conditional validity of inductive conformal predictors. Machine Learning, 92(2-3):349 376, 2013. ISSN 08856125. doi: 10.1007/s10994-013-5355-6. URL https: //www.scopus.com/inward/record.uri?eid=2-s2.0-84880107869&doi=10.1007% 2fs10994-013-5355-6&partner ID=40&md5=18a26112b5a5b1b33e6b108388ea3854.

[36] Vladimir Vovk, Alexander Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer-Verlag, 2005. ISBN 978-0-387-25061-8. doi: 10.1007/b106715. URL https://doi.org/10.1007/b106715.

[37] Stefan Wager, Trevor Hastie, and Bradley Efron. Conﬁdence intervals for random forests: The jackknife and the inﬁnitesimal jackknife. Journal of Machine Learning Research, 15: 1625 1651, 2014. URL http://jmlr.org/papers/v15/wager14a.html.

[38] Haozhe Zhang, Joshua Zimmerman, Dan Nettleton, and Daniel J. Nordman. Random forest prediction intervals. The American Statistician, pages 1 15, 04 2019. doi: 10.1080/00031305. 2019.1585288. URL https://doi.org/10.1080/00031305.2019.1585288.