# bootstrap_in_high_dimension_with_low_computation__e1487a7b.pdf

Bootstrap in High Dimension with Low Computation

Henry Lam 1 Zhenyuan Liu 1

The bootstrap is a popular data-driven method to quantify statistical uncertainty, but for modern high-dimensional problems, it could suffer from huge computational costs due to the need to repeatedly generate resamples and refit models. We study the use of bootstraps in high-dimensional environments with a small number of resamples. In particular, we show that with a recent cheap bootstrap perspective, using a number of resamples as small as one could attain valid coverage even when the dimension grows closely with the sample size, thus strongly supporting the implementability of the bootstrap for large-scale problems. We validate our theoretical results and compare the performance of our approach with other benchmarks via a range of experiments.

1. Introduction

The bootstrap is a widely used method for statistical uncertainty quantification, notably confidence interval construction (Efron & Tibshirani, 1994; Davison & Hinkley, 1997; Shao & Tu, 2012; Hall & Martin, 1988). Its main idea is to resample data and use the distribution of resample estimates to approximate a sampling distribution. Typically, this approximation requires running many Monte Carlo replications to generate the resamples and refit models. This is affordable for classical problems, but for modern large-scale problems, this repeated fitting could impose tremendous computation concerns. This issue motivates an array of recent works to curb the computation effort, mostly through a subsampling perspective that fits models on smaller data sets in the bootstrap process, e.g., Kleiner et al. (2012); Lu et al. (2020); Giordano et al. (2019); Schulam & Saria (2019); Alaa & Van Der Schaar (2020).

In contrast to subsampling, we consider in this paper the

1Department of Industrial Engineering and Operations Research, Columbia University, New York, NY, USA. Correspondence to: Henry Lam <khl2114@columbia.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

reduction in bootstrap computation cost by using a fewer number of Monte Carlo replications or resamples. In particular, we target the following question: Is it possible to run a valid bootstrap for high-dimensional problems with very little Monte Carlo computation? While conventional bootstraps rely heavily on adequate resamples, recent work (Lam, 2022a;b) shows that it is possible to reduce the resampling effort dramatically, even down to one Monte Carlo replication. The rough idea of this cheap bootstrap is to exploit the approximate independence among the original and resample estimates, instead of their distributional closeness utilized in the conventional bootstraps. We will leverage this recent idea in this paper. However, since Lam (2022a;b) is based purely on asymptotic derivation, giving an affirmative answer to the above question requires the study of new finite-sample bounds to draw understanding on bootstrap behaviors jointly in terms of problem dimension p, sample size n and number of resamples B.

To this end, our main theoretical contribution in this paper is three-fold:

General Finite-Sample Bootstrap Bounds: We derive general finite-sample bounds on the coverage error of confidence intervals aggregated from B resample estimates, where B is small using the cheap bootstrap idea, and B = for traditional quantile-based bootstrap methods including the basic and percentile bootstraps (e.g., Davison & Hinkley (1997) Section 5.2-5.3). Our bounds reveal that, given the same primitives on the approximate normality of the original and each resample estimate, the cheap bootstrap with fixed small B achieves similar coverage error bounds as conventional bootstraps using infinite resamples. This also simultaneously recovers the main result in Lam (2022a), but stronger in terms of the finite-sample guarantee.

Bootstrap Bounds on Function-of-Mean Models Explicit in p, n and B: We specialize our general bounds above to the function-of-mean model that is customary in the highdimensional Berry-Esseen and central limit theorem (CLT) literature (Pinelis & Molzon, 2016; Zhilova, 2020). In particular, our bounds explicit on p, n and B conclude vanishing coverage errors for the cheap bootstrap when p = o(n), for any given B 1. Note that the function-of-mean model does not capture all interesting problems, but it has been commonly used and in fact, appears the only model used

Bootstrap in High Dimension with Low Computation

in deriving finite-sample CLT errors for technicality reasons. Our bounds shed light that, at least for this wide class of models, using a small number of resamples can achieve a good coverage even in a dimension p growing closely with n.

Bootstrap Bounds on Linear Models Independent of p: We further specialize our bounds to linear functions with weaker tail conditions, which have orders independent of p under certain conditions on the Lp norm or Orlicz norm of the linearly scaled random variable.

In addition to theoretical bounds, we investigate the empirical performances of bootstraps using few resamples on largescale problems, including high-dimensional linear regression, high-dimensional logistic regression, computational simulation modeling, and a real-world data set RCV1-v2 (Lewis et al., 2004). To give a sense of our comparisons that support using the cheap bootstrap in high dimension, here is a general conclusion observed in our experiments: Figure 1(a) shows the coverage probabilities of 95%-level confidence intervals for three regression coefficients with corresponding true values 0, 2, 1 in a 9000-dimensional linear regression (in Section 4). The cheap bootstrap coverage probabilities are close to the nominal level 95% even with one resample, but the basic and percentile bootstraps only attain around 80% coverage with ten resamples. In this example, one Monte Carlo replication to obtain each resample estimate takes around 4 minutes in the virtual machine e2-highmem-2 in Google Cloud Platform. Therefore, the cheap bootstrap only requires 4 minutes to obtain a statistically valid interval, but the standard bootstrap methods are still far from the nominal coverage even after more than a 40-minute run. Figure 1(b) shows the average interval widths. This reveals the price of a wider interval for the cheap bootstrap when the Monte Carlo budget is very small, but considering the low coverages in the other two methods and the fast decay of the cheap bootstrap width for the first few number of resamples, such a price appears secondary.

Notation. For a random vector X, we write Xk as the tensor power X k. The vector norm is taken as the usual Euclidean norm. The matrix and tensor norms are taken as the operator norm. For a square matrix M, tr(M) denotes the trace of M. Ip p denotes the identity matrix in Rp p and 1p denotes the vector in Rp whose components are all 1. Φ denotes the cumulative distribution function of the standard normal. C2(Rp) denotes the set of twice continuously differentiable functions on Rp. Throughout the whole paper, we use C > 0 (without subscripts) to denote a universal constant which may vary each time it appears. We use C1, C2, . . . to denote constants that could depend on other parameters and we will clarify their dependence when using them.

2. Background on Bootstrap Methods

We briefly review standard bootstrap methods and from there the recent cheap bootstrap. Suppose we are interested in estimating a target statistical quantity ψ := ψ(PX) where ψ( ) : P 7 R is a functional defined on the probability measure space P. Given i.i.d. data X1, . . . , Xn Rp following the unknown distribution PX, we denote the empirical distribution as ˆPX,n( ) := (1/n) Pn i=1 I(Xi ). A natural point estimator is ˆψn := ψ( ˆPX,n).

To construct a confidence interval from ˆψn, a typical beginning point is the distribution of ˆψn ψ from which we can pivotize. As this distribution is unknown in general, the bootstrap idea is to approximate it using the resample counterpart, as if the empirical distribution was the true distribution. More concretely, conditional on X1, . . . , Xn, we repeatedly, say for B times, resample (i.e., sample with replacement) the data n times to obtain resamples {X b 1 , . . . , X b n }, b = 1, . . . , B. Denoting ˆP b X,n as the resample empirical distributions, we construct B resample estimates ˆψ b n := ψ( ˆP b X,n). Then we use the α/2 and (1 α/2)-th quantiles of ˆψ b n ˆψn, called qα/2 and q1 α/2, to construct [ ˆψn q1 α/2, ˆψn qα/2] as a (1 α)-level confidence interval, which is known as the basic bootstrap (Davison & Hinkley (1997) Section 5.2). Alternatively, we could also use the α/2 and (1 α/2)-th quantiles of ˆψ b n , say q α/2 and q 1 α/2, to form [q α/2, q 1 α/2], which is known as the percentile bootstrap (Davison & Hinkley (1997) Section 5.3). There are numerous other variants in the literature, such as studentization (Hall, 1988), calibration or iterated bootstrap (Hall, 1986a; Beran, 1987), and bias correction and acceleration (Efron, 1987; Di Ciccio et al., 1996; Di Ciccio & Tibshirani, 1987), with the general goal of obtaining more accurate coverage.

All the above methods rely on the principle that ˆψn ψ and ˆψ n ˆψn (conditional on X1, . . . , Xn) are close in distribution. Typically, this means that, with a n-scaling, they both converge to the same normal distribution. In contrast, the cheap bootstrap proposed in Lam (2022a;b) constructs a (1 α)-level confidence interval via h ˆψn t B,1 α/2Sn,B, ˆψn + t B,1 α/2Sn,B i , (1)

where S2 n,B = (1/B) PB b=1( ˆψ b n ˆψn)2, and t B,1 α/2 is the (1 α/2)-th quantile of t B, the t-distribution with degree of freedom B. The quantity S2 n,B resembles the sample variance of the resample estimates ˆψ b n s, in the sense that as B , S2 n,B approaches the bootstrap variance V ar ( ˆψ n) (where V ar ( ) denotes the variance of a resample conditional on the data). In this way, (1) reduces to the normality interval with a plug-in estimator of the standard error term when B and n are both large. However, intriguingly, B does not need to be large, and S2 n,B is not necessarily close to

Bootstrap in High Dimension with Low Computation

2 4 6 8 10 number of resamples

estimated coverage probability

Cheap Boostrap βi = 0 Basic Boostrap βi = 0 Percentile Boostrap βi = 0 Cheap Boostrap βi = 2 Basic Boostrap βi = 2 Percentile Boostrap βi = 2 Cheap Boostrap βi = 1 Basic Boostrap βi = 1 Percentile Boostrap βi = 1

(a) Coverage probability

2 4 6 8 10 number of resamples

estimated conﬁdence interval width

Cheap Boostrap βi = 0 Basic/Percentile Boostrap βi = 0 Cheap Boostrap βi = 2 Basic/Percentile Boostrap βi = 2 Cheap Boostrap βi = 1 Basic/Percentile Boostrap βi = 1

(b) Confidence interval width

Figure 1. Empirical coverage probabilities and confidence interval widths for different numbers of resamples in a linear regression.

the bootstrap variance. Instead, the idea is to consider the joint distribution of ˆψn ψ, ˆψ 1 n ˆψn, . . . , ˆψ B n ˆψn that is argued to be asymptotically independent as n and B fixed, which subsequently allows the construction of a pivotal t-statistic and gives rise to (1) for fixed B. A more detailed explanation on the cheap bootstrap in the low-dimensional case (i.e., p is fixed) is given in Appendix A.

3. High-Dimensional Bootstrap Bounds

As explained in the introduction, we aim to study the coverage performance of bootstraps in high-dimensional problems, focusing on the cheap bootstrap approach that allows the use of small computation effort. We describe our results at three levels, first under general assumptions (Section 3.1), then more explicit bounds under the function-ofmean model and sub-Gaussianity of X (Section 3.2), finally bounds for a linear function under weaker tail assumptions on X (Section 3.3).

3.1. General Finite-Sample Bounds

We have the following finite-sample bound for the cheap bootstrap:

Theorem 3.1. Suppose we have the finite-sample accuracy for the estimator ˆψn

P( n( ˆψn ψ) x) Φ(x/σ) E1, (2)

and with probability at least 1 β we have the finite-sample accuracy for the bootstrap estimator ˆψ n

P ( n( ˆψ n ˆψn) x) Φ(x/σ) E2, (3)

where σ > 0, E1 and E2 are deterministic quantities, and P denotes the probability on a resample conditional on the data. Then the coverage error of (1) satisfies, for any B 1, P(|ψ ˆψn| t B,1 α/2Sn,B) (1 α)

2E1 + 2BE2 + β. (4)

Condition (2) is a Berry-Esseen bound (Bentkus, 2003; Pinelis & Molzon, 2016) that gauges the normal approximation for the original estimate ˆψn. Condition (3) manifests a similar normal approximation for the resample estimate ˆψ n, and has been a focus in the high-dimensional CLT literature (Zhilova, 2020; Lopes, 2022; Chernozhukov et al., 2020). Both conditions (in their asymptotic form) are commonly used to establish the validity of standard bootstrap methods. Theorem 3.1 shows that, under these conditions, the coverage error of the cheap bootstrap interval (1) with any B 1 can be controlled. Note that Theorem 3.1 is very general in the sense that there is no direct assumption applied on the form of ψ( ) All we assume is approximate normality in the sense of (2) and (3). Due to technical delicacies, in the bootstrap literature, finite-sample or higher-order coverage errors are typically obtainable only with specific models (Hall, 2013; Zhilova, 2020; Lopes, 2022; Chernozhukov et al., 2020), the most popular being the function-of-mean model (see Section 3.2) or even simply the sample mean. In contrast, the bound in Theorem 3.1 that concludes the sufficiency in using a very small B is a general statement that does not depend on the delicacies of ψ( ). Moreover, by plugging in suitable bounds for E1, E2, β under regularity conditions, Theorem 3.1 also recovers the main result (Theorem 1) in Lam (2022a). The detailed proof of Theorem 3.1 is in Appendix D (and so are the proofs of all other theorems). Below we give a sketch of the main argument.

Bootstrap in High Dimension with Low Computation

Proof sketch of Theorem 3.1. Step 1: We write the coverage probability as the expected value (with respect to data) of a multiple integral with respect to the distributions of n( ˆψ n ˆψn) (denoted by Q , conditional on data), i.e.,

P(|ψ ˆψn| t B,1 α/2Sn,B)

n( ˆψn ψ) q

1 B PB b=1( n( ˆψ b n ˆψn))2

| n( ˆ ψn ψ)| PB b=1 z2 b /B t B,1 α/2 d Q (z B) d Q (z1)

Step 2: Suppose (3) happens and denote this event by A which satisfies P(Ac) β. For each b = 1, . . . , B, given all other zb , b = b, the integration region is of the form zb ( , q] [q, ) for some q 0. Then we can replace the distribution Q by the distribution of N(0, σ2) (denoted by P0) with controlled error given in (3) and obtain

| n( ˆ ψn ψ)| PB b=1 z2 b /B t B,1 α/2 d Q (z B) d Q (z1)

| n( ˆ ψn ψ)| PB b=1 z2 b /B t B,1 1 α/2 d P0(z B) d P0(z1)

where |R1| 2BE2 + β accounts for the error from (3) and the small probability event Ac.

Step 3: Following the same logic in Step 2 and noticing that the integration region for n( ˆψn ψ) is [ q, q] for some q 0, we can also replace the distribution of n( ˆψn ψ) by the distribution P0 with controlled error |R2| 2E1 according to (2):

| n( ˆ ψn ψ)| PB b=1 z2 b /B t B,1 α/2 d P0(z B) d P0(z1)

|z0| PB b=1 z2 b /B t B,1 α/2 d P0(z B) d P0(z1)d P0(z0)

+ R2 = 1 α + R2. (7)

Step 4: Plugging (6) and (7) back into (5), we can express the coverage probability as a sum of the nominal level and the remainder term:

P(|ψ ˆψn| t B,1 α/2Sn,B) = 1 α + R1 + R2

with error |R1 + R2| 2E1 + 2BE2 + β. This gives our conclusion.

Theorem 3.1 is designed to work well for small B (our target scenario), but deteriorates when B grows. However, in the latter case, we can strengthen the bound to cover the large-B regime with additional conditions on the variance estimator (see Appendix B.1).

We compare with standard basic and percentile bootstraps using B = . Below is a generalization of Zhilova (2020) which focuses only on the basic bootstrap under the functionof-mean model.

Theorem 3.2. Suppose the conditions in Theorem 3.1 hold. If qα/2, q1 α/2 are the α/2-th and (1 α/2)-th quantiles of ˆψ n ˆψn respectively given X1, . . . , Xn, then a finitesample bound on the basic bootstrap coverage error is

|P( ˆψn q1 α/2 ψ ˆψn qα/2) (1 α)|

2E1 + 2E2 + 2β. (8)

If q α/2, q 1 α/2 are the α/2-th and (1 α/2)-th quantiles

of ˆψ n respectively given X1, . . . , Xn, then a finite-sample bound on the percentile bootstrap coverage error is

|P(q α/2 ψ q 1 α/2) (1 α)| 2E1+2E2+2β. (9)

In view of Theorems 3.1 and 3.2, the cheap bootstrap with any fixed B can achieve the same order of coverage error bound as the basic and percentile bootstraps with B = , in the sense that

(1/2) EBQuantile EBCheap B EBQuantile, (10)

where EBCheap is the RHS error bound of (4) and EBQuantile is that of (8) or (9). This shows that, to attain a good coverage that is on par with standard basic/percentile bootstraps, it suffices to use the cheap bootstrap with a small B which could save computation dramatically.

Besides coverage, another important quality of confidence interval is its width. To this end, note that for any fixed B, (3) ensures that n Sn,B σ p

χ2 B/B (unconditionally as n with proper model assumptions) . Therefore, the half-width of (1) is approximately t B,1 α/2σ p

χ2 B/(n B) with expected value

= t B,1 α/2σ

2 Bn Γ((B + 1)/2)

(11) where Γ( ) is the gamma function. Since the dimensional impact is hidden in σ which is a common factor in the expected width as B varies, we can see p does not affect the relative width behavior as B changes. In particular, from (11) the inflation of the expected width relative to the case

Bootstrap in High Dimension with Low Computation

B = is 417.3% for B = 1, and dramatically reduces to 94.6%, 24.8% and 10.9% for B = 2, 5, 10, thus giving an interval with both correct coverage and short width using a small computation budget B.

In the next sections, we will apply Theorem 3.1 to obtain explicit bounds for specific high-dimensional models. Here, in relation to (10), we briefly comment that the order of the coverage error bounds for these models is of order 1/ n, both for the cheap bootstrap (which we will derive) and state-of-the-art high-dimensional bootstrap CLT. This is in contrast to the typical 1/n coverage error in two-sided bootstrap confidence intervals in low dimension (see Hall (2013) Section 3.5 for quantile-based bootstraps and Lam (2022a) Section 3.2 for the cheap bootstrap).

3.2. Function-of-Mean Models

We now specialize to the function-of-mean model ψ = g(µ) for a mean vector µ = E[X] Rp and smooth function g : Rp 7 R, which allows us to construct more explicit bounds. The original estimate ˆψn and resample estimate ˆψ b n are now given by g( Xn) and g( X b n ) respectively, where Xn denotes the sample mean of data and X b n denotes the resample mean of X b 1 , . . . , X b n . We assume:

Assumption 3.3. The function g(x) C2(Rp) has Hessian matrix Hg(x) with uniformly bounded eigenvalues, that is, a constant CHg > 0 s.t. supx Rp |a Hg(x)a| CHg||a||2, a Rp.

Assumption 3.4. X is sub-Gaussian, i.e., there is a constant τ 2 > 0 s.t. E[exp(a (X µ))] exp(||a||2τ 2/2), a Rp. Furthermore, X has a density bounded by a constant CX and its covariance matrix Σ is positive definite with the smallest eigenvalue λΣ > 0.

Based on Theorem 3.1, we derive the following explicit bound:

Theorem 3.5. Suppose the function g satisfies Assumption 3.3 and random vector X satisfies Assumption 3.4. Moreover, assume || g(µ)|| > C g p for some constant C g > 0. Then we have

P(|g(µ) g( Xn)| t B,1 α/2Sn,B) (1 α)

m31 nσ3 + CHgm1/3 31 tr(Σ) nσ2 + CHgm2/3 32 n5/6σ

+ CHgm1/3 31 m2/3 32 nσ2 + CHgτ 2

+ ||E[(X µ)3]||

1/2 + τ 2 p

τ 4(log n)3/2

λ2 Σ n + τ 2(log n)3/2

1/2 (log n + log p) p

where m31 = E[| g(µ) (X µ)|3], m32 := E[||X µ||3], σ2 = g(µ) Σ g(µ), C is a universal constant and C1 is a constant only depending on CX.

Theorem 3.5 is obtained by tracing the implicit quantities in Theorem 3.1 for the function-of-mean model, via extracting the dependence on problem parameters in the Berry-Esseen theorems for the multivariate delta method (Pinelis & Molzon, 2016) and the standard bootstrap (Zhilova, 2020). In particular, the sub-Gaussian assumption is required to derive finite-sample concentration inequalities, in a similar spirit as the state-of-the-art high-dimensional CLTs (e.g., Chernozhukov et al. (2017); Lopes (2022)). On the other hand, the third moments such as ||E[(X µ)3]|| (operation norm of the third order tensor E[(X µ)3]), m31 and m32 are due to the use of the Berry-Esseen theorem and a multivariate higher-order Berry-Esseen inequality in Zhilova (2020), which generally requires this order of moments. The bound in Theorem 3.5 can be simplified with reasonable assumptions on the involved quantities:

Corollary 3.6. Suppose the conditions in Theorem 3.5 hold. Moreover, suppose that τ, CHg, C1 = O(1), C g, λΣ = Θ(1), || g(µ)||2 = O(p), σ2 = Θ(p) and ||E[(X µ)3]|| = O(1). Then as p, n , P(|g(µ) g( Xn)| t B,1 α/2Sn,B) (1 α)

= B O 1 + log n

1/2 (log n + log p) p

Consequently, for any fixed B 1, the cheap bootstrap confidence interval is asymptotically exact provided p = o(n), i.e.,

lim p,n p=o(n) P(|g(µ) g( Xn)| t B,1 α/2Sn,B) = 1 α.

In Corollary 3.6, the cheap bootstrap coverage error shrinks to 0 as n if p = o(n), i.e., the problem dimension grows slower than n in any arbitrary fashion. Although there is no theoretical guarantee that the choice of p = o(n) is tight, we offer numerical evidence in Section 4 where the cheap bootstrap works with a small B when p/n = 0.09 but it fails (i.e., over-covers the target with a

Bootstrap in High Dimension with Low Computation

quite large interval width) when p/n = 0.25. Such a difference indicates that p = o(n) can be tight in some cases. Recall that ||E[(X µ)3]|| denotes the operator norm of the third order tensor E[(X µ)3], and so the assumption ||E[(X µ)3]|| = O(1) holds if the components of X are independent (or slightly weakly dependent). Other order assumptions in Corollary 3.6 are natural. An example of the function-of-mean model is g(µ) = ||µ||2, used also in Zhilova (2020), whose confidence interval becomes a simultaneous confidence region for the mean vector µ.

3.3. Linear Functions

We consider a further specialization to linear g where, instead of sub-Gaussanity of X, we are now able to use weaker tail conditions. Assume g(x) = g 1 x + g2, where g1 Rp

and g2 R are known. Then g( Xn) and g( X b n ) are essentially the sample mean and resample mean of i.i.d. random variables g 1 Xi + g2, i = 1, . . . , n.

First, we consider the case where g 1 (X µ) is subexponential, i.e., ||g 1 (X µ)||ψ1 := inf{λ > 0 : E[ψ1(|g 1 (X µ)|/λ)] 1} < , where || ||ψ1 is the Orlicz norm induced by the function ψ1(x) = ex 1. Subexponential property is a weaker tail condition than sub Gaussianity; see e.g. Vershynin (2018) Sections 2.5 and 2.7. Under this condition, we have:

Theorem 3.7. Suppose g is a linear function in the form g(x) = g 1 x + g2. Assume that σ2 = g 1 Σg1 > 0 and ||g 1 (X µ)||ψ1 < . Then for any n 3, we have the following finite-sample bound on the cheap bootstrap coverage error

P(|g(µ) g( Xn)| t B,1 α/2Sn,B) (1 α)

n + BC E[|g 1 (X µ)|3]

+ BC ||g 1 (X µ)||4 ψ1 log11(n) σ4 n ,

where C is a universal constant.

Note that the bootstrap in Theorem 3.7 effectively applies on the univariate g 1 (X µ). Nonetheless, proving Theorem 3.7 requires tools from high-dimensional CLT (Lopes, 2022; Chernozhukov et al., 2020), as this appears the only line of work that investigates finite-sample bootstrap errors (for mean estimation). The order of the bound in terms of p is controlled by g 1 (X µ), and so if the latter is well-scaled by its standard deviation σ in the sense that E[|g 1 (X µ)/σ|3], ||g 1 (X µ)/σ||ψ1 = O(1) (e.g., X follows a multivariate normal distribution), then the order is independent of p, which means the error goes to 0 for any p as long as n . However, if the orders of E[|g 1 (X µ)/σ|3] and ||g 1 (X µ)/σ||ψ1 depend on p,

then the growing rate of p must be restricted by n to ensure the error goes to 0.

Next, we further weaken the tail condition on g 1 (X µ). We only assume E[|g 1 (X µ)|q] < for some q 4. In this case, we have the following: Theorem 3.8. Suppose g is a linear function in the form of g(x) = g 1 x + g2. Assume that σ2 = g 1 Σg1 > 0 and E[|g 1 (X µ)|q] < for some q 4. Then for any n 3, we have the following finite-sample bound on the cheap bootstrap coverage accuracy P(|g(µ) g( Xn)| t B,1 α/2Sn,B) (1 α)

BC1 log n n1/2 3/(2q) max n E[|g 1 (X µ)/σ|q]1/q, q

E[|g 1 (X µ)/σ|4] + C E[|g 1 (X µ)|3]

where C is a universal constant and C1 is a constant depending only on q.

The implication of Theorem 3.8 on the choice of p is similar to Theorem 3.7. In parallel to the above, explicit finitesample bounds for standard quantile-based bootstrap methods can also be obtained by means of Theorem 3.2 under the assumptions in Theorems 3.5, 3.7 or 3.8 (see Appendix B.2).

4. Numerical Experiments

We consider various high-dimensional examples:

Ellipsoidal estimation: The estimation target is g(µ) = ||µ||2, where µ is the mean of X Rp with ground-truth distribution N(0.021p, 0.01Ip p). Sample size n = 105

and dimension p = 2.5 104.

Sinusoidal estimation: The estimation target is g(µ) = Pp i=1 sin(µi), where µ = (µi)p i=1 is the mean of X Rp

with ground-truth distribution N(0, 0.01Ip p). Sample size n = 105 and dimension p = 2.5 104.

Linear regression with independent covariates: Consider the true model Y = X β + ε, where X Rp follows N(0, 0.01Ip p) and ε N(0, 1) independent of X. The first, second and last 1/3 components of β = (βi)p i=1 are 0, 2, 1 respectively. We estimate β given i.i.d. data (Xi, Yi)n i=1 with n = 105 and p = 9000.

Logistic regression with independent covariates: Consider the true model Y {0, 1}, X Rp, P(Y = 1|X) = exp(X β)/(1+exp(X β)), where X N(0, 0.01Ip p). The first 300 components of βi s are 1, the second 300 components 1 and the rest 0. As suggested in Sur & Cand es (2019), we choose such values of βi s to make sure V ar(X β) = 6 does not increase with p so that P(Y = 1|X) is not trivially equal to 0 or 1 in most cases.

Bootstrap in High Dimension with Low Computation

We estimate β given i.i.d. data (Xi, Yi)n i=1 with n = 105

and p = 9000.

Stochastic simulation model: Consider a stochastic computer communication model used to calculate the steadystate average message delay (Cheng & Holland (1997); Lin et al. (2015); Lam & Qian (2022); see Appendix C.3 for details). This problem can be cast as computing ψ(P1, . . . , Pp) where ψ represents this expensive simulation model (due to the need to run very long time in order to reach steady state) and Pj s denote the input distributions, p = 13 in total. The data sizes for observing these 13 input models range from 3 104 to 6 104.

A real data example: We run logistic regression on the RCV1-v2 data in Lewis et al. (2004). This dataset contains n = 804414 manually categorized newswire stories with a total of p = 47236 features. Economics ( ECAT ) is chosen as the +1 label. We target coefficient estimation.

Linear regression with dependent covariates: We consider the same linear regression setup as before but with two different distributions of X. One is X N(0, 0.01Σ) where the components of Σ are Σij = 0.8|i j|. For this distribution, the i-th component and j-th component of X are more dependent when i and j are closer to each other. The other distribution is X N(0, 0.01AA ), where A is a random matrix whose components are i.i.d. from U(0, 1). The two cases are referred to as exponential decay and random covariance matrix respectively.

Logistic regression with dependent covariates: We consider the same logistic regression setup as before but change the distribution of X into the exponential decay distribution mentioned above. Here we do not consider the random covariance matrix case because the significant noisiness of X makes V ar(X β) so large that P(Y = 1|X) is trivially equal to 0 or 1 in most cases, which is also avoided in other work (e.g., Sur & Cand es (2019)).

Ridge regression with p > n: Consider the true linear model Y = X β + ε, where ε N(0, 1) is independent of X. The first, second and last 1/3 components of β = (βi)p i=1 are 0, 2, 1 respectively. We consider three distributions of X. The first one is X N(0, 0.01Ip p) (referred to as independent ). The other two are exponential decay distribution and random covariance matrix mentioned above. Given i.i.d. data (Xi, Yi)n i=1 with n = 8000 and p = 9000, we estimate β by means of ridge regression which minimizes ||Y Xβ||2 + λ||β||2.

(Regularized) logistic regression with varying dimensions p: We use the same setup of logistic regression with independent covariates as before (e.g. n = 105) but use p {12000, 15000, 18000, 21000, 25000} to test the valid boundary of p (i.e., the maximum p that makes cheap bootstrap work) in this problem. Moreover, we also run a regu-

larized version by adding the ℓ2 regularization term ||β||2/2 to the log likelihood function of logistic regression to see the effect of regularization.

Setups and comparison benchmarks. In each example above, our targets are 95%-level confidence intervals for the target parameters. We test four bootstrap confidence intervals: 1) cheap bootstrap (1); 2) basic bootstrap described in Section 2; 3) percentile bootstrap described in Section 2; 4) standard error bootstrap that uses standard normal quantile and standard deviation of ˆψ b n s in lieu of t B,1 α/2 and Sn,B respectively in (1). For each setup except the real-data example, we run 1000 experimental repetitions, each time generating a new data set from the ground truth distribution and construct the intervals. We report the empirical coverage and average interval width over these repetitions. For examples with more than one target estimation quantity, we further average the coverages and widths over all these targets. For the high-dimensional linear regression with independent covariates, we additionally show a box plot of the coverage probabilities and confidence interval widths of each individual βi. We vary the number of resamples B from 1 to 10 in all examples and report the running time (i.e., model fitting time for one point estimate and B bootstrap estimates; the time for outputting the confidence intervals using these estimates is negligible compared to the model fitting time) in the virtual machine e2-highmem-2 in Google Cloud Platform. Some examples have larger scale and thus are run in the virtual machine e2-highmem-8 with larger memory and better CPU, whose running time will be starred (*).

Results and discussions. Tables 1-5 and Figure 2 describe our results (Tables 2-5 are delegated to Appendix C.1 due to the limitation of space), where we report B = 1, 2, 5, 10. CB , BB , PB and SEB in Figure 2 stand for the cheap bootstrap, basic bootstrap, percentile bootstrap and standard error bootstrap respectively.

Coverage probability: According to Table 1, the cheap bootstrap performs the best in terms of the coverage probabilities in almost all cases (except the real-data example where we cannot validate and only report the interval widths). In all but three entries, the cheap bootstrap gives the closest coverages to the nominal 95% among all considered bootstrap methods, and in all but three entries the cheap bootstrap coverages are above 95%. In contrast, other approaches are substantially below the nominal level except for very few cases with B = 10. For example, in the ellipsoidal estimation, cheap bootstrap coverage probabilities are above 95% for all considered B s, while the highest coverage among other bootstrap methods is 82.1% even for B = 10. These observations corroborate with theory since unlike standard bootstrap methods, the cheap bootstrap gives small coverage errors even with very small B. Note that when B = 1,

Bootstrap in High Dimension with Low Computation

the entries of other bootstrap methods are all N.A. since quantile-based approaches cannot even output two distinct finite numbers using one resample, and standard error bootstrap uses B 1 in the denominator of the sample variance. Similar results are also observed from Tables 2 and 3, which confirm that even for distributions with dependent components and ridge regression with p > n, the cheap bootstrap continues to work and outperform other bootstrap methods. In terms of the individual plot in Figure 2, the cheap bootstrap has coverages close to the nominal level 95% for almost all βi s for any B. On the other hand, standard error bootstrap coverages are above 90% only when B = 10 while the quantile-based bootstrap coverages are still below 85% for most of the βi s even for B = 10.

Interval width: Cheap bootstrap intervals are wider than other bootstrap intervals. However, these widths appear to decay very fast for the first few B s. In all examples, they decrease by around 2/3 from B = 1 to B = 2 and by around 4/5 from B = 1 to B = 10, hinting a quick enhancement of statistical efficiency as the computation budget increases. Note that while the other bootstrap intervals are shorter, they fall short in attaining the nominal coverage. It is thus reasonable to see the larger widths of the cheap bootstrap intervals which appear to push up the coverages by the right amounts.

Valid boundary of p: Table 4 displays the results of logistic regression with varying p. We observe that the confidence widths have a dramatic increase when p increases to 21000 from 18000, which hints that the validity boundary of p under this setting should be at most 21000. Moreover, even for p = 15000 and p = 18000, the coverage probabilities are almost 100% which is overly large compared to the nominal level 95%. Combining with our existing favorable results for p = 9000, for this problem it appears the validity boundary could be indeed around 9000. As a comparison, Table 5 displays the results of regularized logistic regression. We can see although the interval widths seem to converge with the regularization, the coverage probabilities are still overly large for large p. For this regularized example, the validity boundary could be around 12000 or 15000, which is larger than the previous boundary. This contrast shows that regularization could be helpful to achieve a valid confidence interval when p is large. But no matter whether we add the regularization, it seems the validity boundary is always around the order p = o(n) for logistic regression. The results here, combined with the favorable results for ridge regression with p > n in Table 3, provide evidence that the precise validity threshold of p could be model-dependent.

5. Discussions and Other Connections

In this paper, we show how to run the bootstrap that achieves statistically valid coverage with very low cost, in terms of

CB BB PB SEB

CB BB PB SEB

CB BB PB SEB

(a) Coverage probability

CB BB PB SEB

CB BB PB SEB

CB BB PB SEB

(b) Confidence interval width

Figure 2. Box plots of empirical coverage probabilities and confidence interval widths of all βi s for different numbers of resamples in a linear regression.

the number of resamples, even when the problem dimension grows closely with the data size. This is made possible by using sample-resample independence from a recent cheap bootstrap perspective. We provide general finite-sample coverage error bounds to support our validity, and specialize these bounds to explicit models including function-ofmean models with sub-Gaussian tails and linear models with weaker tail conditions. We discuss how our approach can operate with as low as one resample in attaining valid interval coverage in large sample. At the same time, the interval constructed with one resample tends to be wide, but fortunately shrinks quickly as the number of resamples increases from one even slightly. We run a wide set of numerical experiments to validate our theory and show the outperformance of our method compared to other benchmarks. As our numerical experiments hint that the tight growth rate of p in terms of n could be model-dependent, a future investigation is to establish more precise finite-sample bounds that

Bootstrap in High Dimension with Low Computation

Table 1. Coverage probabilities (Pro.), confidence interval widths (Wid.) and running time (unit: second) of the first six numerical examples. The closest coverage probability to the nominal 95% level among all methods in each setting is bold.

Example B Cheap Bootstrap Basic Bootstrap Percentile Bootstrap Standard Error Bootstrap Running Time Pro. Wid. Pro. Wid. Pro. Wid. Pro. Wid.

1 96.0% 0.069 N.A. N.A. N.A. N.A. N.A. N.A. 9*

Ellipsoidal 2 97.3% 0.026 32.2% 0.002 5.5% 0.002 55.1% 0.006 15*

estimation 5 97.4% 0.016 66.0% 0.005 13.6% 0.005 70.1% 0.007 36*

10 97.5% 0.014 82.1% 0.006 20.8% 0.006 73.6% 0.008 70*

1 94.4% 0.958 N.A. N.A. N.A. N.A. N.A. N.A. 7*

Sinusoidal 2 95.2% 0.384 29.6% 0.051 35.2% 0.051 63.2% 0.142 12*

estimation 5 93.6% 0.248 71.2% 0.117 66.4% 0.117 86.4% 0.187 27*

10 94.4% 0.222 84.0% 0.156 83.2% 0.156 89.6% 0.196 52*

Linear 1 95.1% 0.68 N.A. N.A. N.A. N.A. N.A. N.A. 443

regression 2 95.1% 0.256 33.5% 0.038 33.5% 0.038 70.2% 0.105 666

(independent 5 95.2% 0.164 67.0% 0.078 67.0% 0.078 88.1% 0.123 1337

covariates) 10 95.2% 0.146 82.2% 0.103 82.2% 0.103 92.1% 0.128 2454

Logistic 1 96.1% 2.866 N.A. N.A. N.A. N.A. N.A. N.A. 50

regression 2 96.9% 1.074 39.7% 0.147 31.7% 0.147 73.4% 0.407 81

(independent 5 97.9% 0.685 77.9% 0.302 63.3% 0.302 91.0% 0.479 175

covariates) 10 98.4% 0.609 91.9% 0.400 77.7% 0.400 94.6% 0.496 331

1 96.9% 1.757 10 3 N.A. N.A. N.A. N.A. N.A. N.A. 1

Stochastic 2 98.8% 6.417 10 4 21.9% 6.962 10 5 47.0% 6.962 10 5 68.7% 1.930 10 4 2

simulation 5 99.7% 4.044 10 4 43.2% 1.428 10 4 90.4% 1.428 10 4 87.1% 2.269 10 4 3

10 100% 3.591 10 4 55.6% 1.915 10 4 99.8% 1.915 10 4 92.6% 2.375 10 4 5

1 N.A. 3.594 N.A. N.A. N.A. N.A. N.A. N.A. 156

Real 2 N.A. 1.361 N.A. 0.201 N.A. 0.201 N.A. 0.556 233

data 5 N.A. 0.877 N.A. 0.414 N.A. 0.414 N.A. 0.658 464

10 N.A. 0.779 N.A. 0.547 N.A. 0.547 N.A. 0.682 849

are tight in some appropriate sense.

We close this paper by positioning our results in the broader literature. First, our work is related to bootstrap coverage analysis. The commonest approach is to use the Edgeworth expansion that reveals the asymptotic higher-order terms in the coverage errors; see the comprehensive monograph (Hall, 2013). It is only until recently when finitesample bounds on bootstrap appear, mostly in the highdimensional CLT literature where the target is sample mean (Chernozhukov et al. (2017); Lopes (2022); Chernozhukov et al. (2020) and references therein). They aim to prove a uniform finite-sample bound of normal approximation of the (bootstrap) sample mean over all hyperrectangles. An alternative approach is to use Stein s method (Fang & Koike, 2021).

Second, within the bootstrap framework, various approaches have been proposed to reduce the Monte Carlo sampling effort by, e.g., variance reduction such as importance sampling (Booth & Do, 1993), or analytic approximation especially when applying iterated bootstraps (Booth & Hall, 1994;

Lee & Young, 1995). These methods, however, require additional knowledge such as an explicit way to calculate variance, or focus on tail estimation issue. The closest work to the cheap bootstrap idea we utilize in this paper is Hall (1986b) who investigates the number of resamples for one-sided bootstrap intervals. Nonetheless, Hall (1986b) suggests a minimum of 19 for B in a 95%-level interval, obtained via an order-statistics calculation, which is still much larger than the minimal choice B = 1 in the cheap bootstrap.

Acknowledgement

We gratefully acknowledge support from the National Science Foundation under grants CAREER CMMI-1834710 and IIS-1849280.

Alaa, A. and Van Der Schaar, M. Discriminative jackknife: Quantifying uncertainty in deep learning via higher-order

Bootstrap in High Dimension with Low Computation

influence functions. In Proceedings of the 37th International Conference on Machine Learning, pp. 165 174. PMLR, 2020.

Balakrishnan, S. and Madigan, D. Algorithms for sparse linear classifiers in the massive data setting. Journal of Machine Learning Research, 9(10):313 337, 2008.

Bentkus, V. On the dependence of the Berry Esseen bound on dimension. Journal of Statistical Planning and Inference, 113(2):385 402, 2003.

Beran, R. Prepivoting to reduce level error of confidence sets. Biometrika, 74(3):457 468, 1987.

Booth, J. G. and Do, K.-A. Simple and efficient methods for constructing bootstrap confidence intervals. Computational Statistics, 8:333 333, 1993.

Booth, J. G. and Hall, P. Monte Carlo approximation and the iterated bootstrap. Biometrika, 81(2):331 340, 1994.

Cheng, R. C. and Holland, W. Sensitivity of computer simulation experiments to errors in input data. Journal of Statistical Computation and Simulation, 57(1-4):219 241, 1997.

Chernozhukov, V., Chetverikov, D., and Kato, K. Central limit theorems and bootstrap in high dimensions. Annals of Probability, 45(4):2309 2352, 2017.

Chernozhukov, V., Chetverikov, D., and Koike, Y. Nearly optimal central limit theorem and bootstrap approximations in high dimensions. ar Xiv preprint ar Xiv:2012.09513, 2020.

Davison, A. C. and Hinkley, D. V. Bootstrap Methods and Their Application. Cambridge University Press, 1997.

Di Ciccio, T. and Tibshirani, R. Bootstrap confidence intervals and bootstrap approximations. Journal of the American Statistical Association, 82(397):163 170, 1987.

Di Ciccio, T. J., Efron, B., et al. Bootstrap confidence intervals. Statistical Science, 11(3):189 228, 1996.

Efron, B. Better bootstrap confidence intervals. Journal of the American statistical Association, 82(397):171 185, 1987.

Efron, B. and Tibshirani, R. J. An Introduction to the Bootstrap. Chapman & Hall/CRC, 1994.

Fang, X. and Koike, Y. High-dimensional central limit theorems by Stein s method. Annals of Applied Probability, 31(4):1660 1686, 2021.

Giordano, R., Stephenson, W., Liu, R., Jordan, M., and Broderick, T. A Swiss army infinitesimal jackknife. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1139 1147. PMLR, 2019.

Hall, P. On the bootstrap and confidence intervals. Annals of Statistics, 14(4):1431 1452, 1986a.

Hall, P. On the number of bootstrap simulations required to construct a confidence interval. Annals of Statistics, pp. 1453 1462, 1986b.

Hall, P. Theoretical comparison of bootstrap confidence intervals. The Annals of Statistics, pp. 927 953, 1988.

Hall, P. The Bootstrap and Edgeworth Expansion. Springer Science & Business Media, 2013.

Hall, P. and Martin, M. A. On bootstrap resampling and iteration. Biometrika, 75(4):661 671, 1988.

Kleiner, A., Talwalkar, A., Sarkar, P., and Jordan, M. I. The big data bootstrap. In Proceedings of the 29th International Conference on Machine Learning, pp. 1787 1794, 2012.

Lam, H. A cheap bootstrap method for fast inference. ar Xiv preprint ar Xiv:2202.00090, 2022a.

Lam, H. Cheap bootstrap for input uncertainty quantification. In 2022 Winter Simulation Conference (WSC), pp. 2318 2329. IEEE, 2022b.

Lam, H. and Qian, H. Subsampling to enhance efficiency in input uncertainty quantification. Operations Research, 70(3):1891 1913, 2022.

Lee, S. M. and Young, G. A. Asymptotic iterated bootstrap confidence intervals. Annals of Statistics, 23(4):1301 1330, 1995.

Lewis, D. D., Yang, Y., Russell-Rose, T., and Li, F. Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5(Apr):361 397, 2004.

Lin, Y., Song, E., and Nelson, B. Single-experiment input uncertainty. Journal of Simulation, 9(3):249 259, 2015.

Lopes, M. E. Central limit theorem and bootstrap approximation in high dimensions: Near 1/ n rates via implicit smoothing. Annals of Statistics, 50(5):2492 2513, 2022.

Lu, Z., Ie, E., and Sha, F. Uncertainty estimation with infinitesimal jackknife, its distribution and mean-field approximation. ar Xiv preprint ar Xiv:2006.07584, 2020.

Pinelis, I. and Molzon, R. Optimal-order bounds on the rate of convergence to normality in the multivariate delta method. Electronic Journal of Statistics, 10(1):1001 1063, 2016.

Rudelson, M. and Vershynin, R. Hanson-Wright inequality and sub-Gaussian concentration. Electronic Communications in Probability, 18:1 9, 2013.

Bootstrap in High Dimension with Low Computation

Schulam, P. and Saria, S. Can you trust this prediction? Auditing pointwise reliability after learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1022 1031. PMLR, 2019.

Shao, J. and Tu, D. The Jackknife and Bootstrap. Springer Science & Business Media, 2012.

Singh, S., Kubica, J., Larsen, S., and Sorokina, D. Parallel large scale feature selection for logistic regression. In Proceedings of the 2009 SIAM International Conference on Data Mining, pp. 1172 1183. SIAM, 2009.

Sur, P. and Cand es, E. J. A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29):14516 14525, 2019.

Van der Vaart, A. W. Asymptotic Statistics. Cambridge University Press, 2000.

Vershynin, R. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press, 2018.

Zhilova, M. Nonclassical Berry Esseen inequalities and accuracy of the bootstrap. Annals of Statistics, 48(4): 1922 1939, 2020.

Bootstrap in High Dimension with Low Computation

A. A More Detailed Explanation on the Cheap Bootstrap in the Low-Dimensional Case

This section aims to give more details on the cheap bootstrap method in the low-dimensional case discussed in Section 2.

Suppose we are interested in estimating a target statistical quantity ψ := ψ(PX) where ψ( ) : P 7 R is a functional defined on the probability measure space P. Given i.i.d. data X1, . . . , Xn Rp following the unknown distribution PX, we denote the empirical distribution as ˆPX,n( ) := (1/n) Pn i=1 I(Xi ). A natural point estimator is ˆψn := ψ( ˆPX,n).

The cheap bootstrap confidence interval for ψ is constructed as follows. Conditional on X1, . . . , Xn, we independently resample, i.e., sample with replacement, the data for B times to obtain resamples {X b 1 , . . . , X b n }, b = 1, . . . , B. Denoting ˆP b X,n as the resample empirical distributions, we construct B resample estimates ˆψ b n := ψ( ˆP b X,n). A (1 α)-level confidence interval is then given by

h ˆψn t B,1 α/2Sn,B, ˆψn + t B,1 α/2Sn,B i , (12)

where S2 n,B = (1/B) PB b=1( ˆψ b n ˆψn)2, and t B,1 α/2 is the (1 α/2)-th quantile of t B, the t-distribution with degree of freedom B. Theorem 1 in Lam (2022a) shows that, under conditions on par with standard bootstrap methods, (12) is an asymptotically exact (1 α)-level confidence interval for any fixed B 1, i.e.,

P(ψ [ ˆψn t B,1 α/2Sn,B, ˆψn + t B,1 α/2Sn,B]) 1 α, as n (13)

where P is the probability with respect to both the data and the randomness in the resampling process.

Here we explain the asymptotic argument that gives rise to (13). Under suitable conditions, the sampling distribution of an estimate ˆψn and the distribution of a resample estimate ˆψ n are approximately equal. More formally, they are equal in the

asymptotic sense of two CLTs n( ˆψn ψ) d N(0, σ2) for some σ2 > 0, and n( ˆψ n ˆψn) d N(0, σ2) (conditional on X1, . . . , Xn in probability) for the same σ2. By means of a conditional argument, we can combine the two aforementioned CLTs to obtain the following joint convergence

n( ˆψn ψ, ˆψ 1 n ˆψn, . . . , ˆψ B n ˆψn) d (σZ0, σZ1, . . . , σZB), as n (14)

where Zb, b = 0, . . . , B are i.i.d. standard normal. From (14), we can establish the convergence of a pivotal t-statistic ( ˆψn ψ)/Sn,B d t B which gives (13). The above shows that, with B fixed as small as 1, (12) already offers a coverage close to the nominal level as n . In this argument, the approximation accuracy of ( ˆψn ψ)/Sn,B by the t B random variable is crucial. However, in the high-dimensional case when p as n , the joint CLT (14) may not hold and thus the techniques in this paper are needed to establish the validity of the cheap bootstrap method.

B. Additional Theoretical Results

This section provides additional theoretical results. Appendix B.1 establishes an alternative finite-sample bound for the cheap bootstrap that generalizes Theorem 3.1 to cover the large-B regime. Section B.2 provides finite-sample bounds for standard quantile-based bootstrap methods under the conditions in Sections 3.2 and 3.3.

B.1. Further Finite-Sample Bound for the Cheap Bootstrap

The following result generalizes Theorem 3.1 to include both the small and large-B regimes:

Theorem B.1. Suppose we have the finite sample accuracy for the estimator ˆψn

P( n( ˆψn ψ) x) Φ x

and with probability at least 1 β we have the finite sample accuracy for the bootstrap estimator ˆψ n

P ( n( ˆψ n ˆψn) x) Φ x

Bootstrap in High Dimension with Low Computation

where σ > 0 and P denotes the probability on a resample conditional on the data. Further, suppose that the following concentration inequality holds

b=1 ( n( ˆψ b n ˆψn))2 σ

where E3 is deterministic and σ E3 > 0. Then we have the following finite sample bound on the cheap bootstrap coverage error

|P(ψ [ ˆψn t B,1 α/2Sn,B, ˆψn + t B,1 α/2Sn,B]) (1 α)|

2E1 + 2BE2 + β, 2E1 + 2E4 +

2 π |t B,1 α/2 z1 α/2| +

σ t B,1 α/2

where z1 α/2 is the (1 α/2)-quantile of the standard normal.

The finite sample accuracy in Theorem B.1 consists of two parts. The first one 2E1 + 2BE2 + β works well when B is small as shown in Sections 3.2 and 3.3 but it deteriorates when B grows. In contrast, the second part

2E1 + 2E4 +

2 π |t B,1 α/2 z1 α/2| +

σ t B,1 α/2 (17)

vanishes as B, n but does not if B is bounded even if n . Its behavior for bounded B is easy to see: The third term p

2/π|t B,1 α/2 z1 α/2| in (17) is bounded away from zero if B is bounded and thus (17) never converges to zero even if n . To explain why (17) vanishes as B, n , first note that the first term 2E1 is independent of B and satisfies E1 0 as n by the Berry-Esseen Theorem for a reasonable model ψ( ) such as the function-of-mean model

in Section 3.2. Second, notice that q

(1/B) PB b=1( n( ˆψ b n ˆψn))2 is the bootstrap estimator of the asymptotic standard deviation σ. Therefore, (16) is the concentration inequality for the bootstrap principle applied to the estimation of σ and would hold with a choice of E3 and E4 satisfying E3 0, E4 0 as B, n . Lastly, since t B d N(0, 1) as B , by Lemma 21.2 in Van der Vaart (2000), we have t B,1 α/2 z1 α/2 as B . Therefore, we can see the second part (17) converges to zero as B, n at any rate.

Under concrete assumptions on X as in Sections 3.2 and 3.3, explicit forms of E3 and E4 depending on B, n and the distribution of X can be derived, based on similar arguments as the explicit bounds in Theorems 3.5-3.8. Then by studying the order of these explicit bounds with respect to B, p and n, we can deduce a proper growth rate of dimension p = p(B, n) which ensures a vanishing error as B, n . The concentration inequality (16) seems unexplored in the literature and we leave it as future work.

B.2. Explicit Finite-Sample Bounds for Quantile-Based Bootstrap Methods

In this section, in parallel to Theorems 3.5-3.8 for the cheap bootstrap, we provide a few explicit bounds for standard quantile-based bootstrap methods under the same conditions.

The first result is in parallel to Theorem 3.5 under the function-of-mean model:

Theorem B.2. Suppose the conditions in Theorem 3.5 hold. If qα/2, q1 α/2 are the α/2-th and (1 α/2)-th quantiles of g( X n) g( Xn) respectively given X1, . . . , Xn, then the finite-sample bound on the basic bootstrap coverage error is given by

|P(g( Xn) q1 α/2 g(µ) g( Xn) qα/2) (1 α)|

m31 nσ3 + CHgm1/3 31 tr(Σ) nσ2 + CHgm2/3 32 n5/6σ + CHgm1/3 31 m2/3 32 nσ2

n + ||E[(X µ)3]||

Bootstrap in High Dimension with Low Computation

1/2 + τ 2 p

1/2 + τ 3 p

τ 4(log n)3/2

λ2 Σ n + τ 2(log n)3/2

1/2 (log n + log p) p

where C is a universal constant and C1 is a constant only depending on CX. If q α/2, q 1 α/2 are the α/2-th and (1 α/2)-th quantiles of g( X n) respectively given X1, . . . , Xn, then the finite-sample bound on the percentile bootstrap coverage error is given by

|P(q α/2 g(µ) q 1 α/2) (1 α)|

m31 nσ3 + CHgm1/3 31 tr(Σ) nσ2 + CHgm2/3 32 n5/6σ + CHgm1/3 31 m2/3 32 nσ2

n + ||E[(X µ)3]||

1/2 + τ 2 p

1/2 + τ 3 p

τ 4(log n)3/2

λ2 Σ n + τ 2(log n)3/2

1/2 (log n + log p) p

where C is a universal constant and C1 is a constant only depending on CX.

Our discussion below Theorem 3.2 shows that the cheap bootstrap error bound with any given B and quantile-based bootstrap error bounds with B = only differ up to a constant. Therefore, the order analysis for the cheap bootstrap in Corollary 3.6 also applies here, that is, under the conditions in Corollary 3.6, the quantile-based bootstrap coverage errors shrink to 0 as n if p = o(n).

The second result is in parallel to Theorem 3.7 under the sub-exponential assumption and linearity of g:

Theorem B.3. Suppose the conditions in Theorem 3.7 hold. If qα/2, q1 α/2 are the α/2-th and (1 α/2)-th quantiles of g( X n) g( Xn) respectively given X1, . . . , Xn, then the finite-sample bound on the basic bootstrap coverage error is given by

|P(g( Xn) q1 α/2 g(µ) g( Xn) qα/2) (1 α)|

1 n + E[|g 1 (X µ)|3]

σ3 n + ||g 1 (X µ)||4 ψ1 log11(n) σ4 n

+ CE[|g 1 (X µ)|3]

where C is a universal constant. If q α/2, q 1 α/2 are the α/2-th and (1 α/2)-th quantiles of g( X n) respectively given X1, . . . , Xn, then the finite-sample bound on the percentile bootstrap coverage error is given by

|P(q α/2 g(µ) q 1 α/2) (1 α)|

1 n + E[|g 1 (X µ)|3]

σ3 n + ||g 1 (X µ)||4 ψ1 log11(n) σ4 n

+ CE[|g 1 (X µ)|3]

where C is a universal constant.

The last result is in parallel to Theorem 3.8 under moment conditions and linearity of g:

Theorem B.4. Suppose the conditions in Theorem 3.8 hold. If qα/2, q1 α/2 are the α/2 and (1 α/2)-quantiles of g( X n) g( Xn) respectively given X1, . . . , Xn, then the finite-sample bound on the basic bootstrap coverage error is given by

|P(g( Xn) q1 α/2 g(µ) g( Xn) qα/2) (1 α)|

Bootstrap in High Dimension with Low Computation

2 n + C1 max E[|g 1 (X µ)/σ|q]1/q, q

E[|g 1 (X µ)/σ|4] (log n)3/2

n + log n n1/2 3/(2q)

+ CE[|g 1 (X µ)|3]

where C is a universal constant and C1 is a constant depending only on q. If q α/2, q 1 α/2 are the α/2 and (1 α/2)- quantiles of g( X n) respectively given X1, . . . , Xn, then the finite-sample bound on the percentile bootstrap coverage error is given by

|P(q α/2 g(µ) q 1 α/2) (1 α)|

2 n + C1 max E[|g 1 (X µ)/σ|q]1/q, q

E[|g 1 (X µ)/σ|4] (log n)3/2

n + log n n1/2 3/(2q)

+ CE[|g 1 (X µ)|3]

where C is a universal constant and C1 is a constant depending only on q.

The order analysis for the cheap bootstrap also applies to Theorems B.3 and B.4. If g 1 (X µ) is well-scaled by its standard deviation σ in the sense that the Lp norm and Orlicz norm || ||ψ1 is independent of p, then the errors shrink to 0 for any p as n . Otherwise, the growth rate of p should depend on n to obtain a vanishing error.

C. Details of Numerical Experiments and Additional Numerical Results

In this section, we present additional results and details of the experiments in Section 4. We also report some additional experiments. Section C.1 presents tables for experimental results in Section 4 that have not been shown in previous sections. The following subsections, C.2, C.3 and C.4 provide additional details for logistic regression with independent covariates, the computer network and the real world problem presented in Section 4. Section C.5 further validates our performances by a simulation study with a lower nominal level 70%. Finally, Section C.6 studies the coverage error behavior as B and n vary for the sinusoidal estimation.

C.1. Additional Tables

Tables 2-5 reports the experimental results of regression problems with dependent covariates, ridge regression and (regularized) logistic regression with different p presented in Section 4 respecitvely.

C.2. Logistic Regression with Independent Covariates

Figure 3 presents the coverage probabilities and confidence interval widths of 95%-level confidence intervals for three typical choices of parameters: β1 = 1, β301 = 1 and β601 = 0. We observe that all cheap bootstrap coverage probabilities are close to or larger than the nominal level 95% while other bootstrap method coverages are below 90% except for the standard error bootstrap for β601 = 0 and B 5. Besides, cheap bootstrap interval widths are larger than others but decay very fast for the first few B s, in line with our observation in the previous linear regression example. In fact, it is already quite close to other bootstrap widths for β601 = 0 and B = 10. Figure 4 reports the box plot of the coverage probabilities and confidence interval widths of all βi s with B = 1, 2, 5, 10. We distinguish between βi = 0 and βi = 0 since the former has wider widths than the latter. For βi = 0, the cheap bootstrap widths shrink more slowly so that almost all cheap bootstrap coverage probabilities are 100% but other bootstrap method coverages are still below 90% in almost all cases. For βi = 0, the cheap bootstrap with any B, standard error bootstrap with B = 5, 10 and basic bootstrap with B = 10 have coverage probabilities close to the nominal level 95%. In other cases, most of the coverage probabilities are below 85%. A similar decay rate for the cheap bootstrap interval width is also observed here: it decreases by around 2/3 from B = 1 to B = 2 and by around 4/5 from B = 1 to B = 10.

C.3. Computer Network

We detail the specifications of the computer communication network simulation model; similar models have been used in Cheng & Holland (1997); Lin et al. (2015); Lam & Qian (2022). This network can be represented by an undirected graph

Bootstrap in High Dimension with Low Computation

2 4 6 8 10 number of Monte Carlo simulations

estimated coverage probability

Cheap Boostrap Basic Boostrap Percentile Boostrap Standard Error Boostrap

(a) Coverage probability for β1 = 1

2 4 6 8 10 number of Monte Carlo simulations

estimated conﬁdence interval width

Cheap Boostrap Basic/Percentile Boostrap Standard Error Boostrap

(b) Confidence interval width for β1 = 1

2 4 6 8 10 number of Monte Carlo simulations

estimated coverage probability

Cheap Boostrap Basic Boostrap Percentile Boostrap Standard Error Boostrap

(c) Coverage probability for β301 = 1

2 4 6 8 10 number of Monte Carlo simulations

estimated conﬁdence interval width

Cheap Boostrap Basic/Percentile Boostrap Standard Error Boostrap

(d) Confidence interval width for β301 = 1

2 4 6 8 10 number of Monte Carlo simulations

estimated coverage probability

Cheap Boostrap Basic Boostrap Percentile Boostrap Standard Error Boostrap

(e) Coverage probability for β601 = 0

2 4 6 8 10 number of Monte Carlo simulations

estimated conﬁdence interval width

Cheap Boostrap Basic/Percentile Boostrap Standard Error Boostrap

(f) Confidence interval width for β601 = 0

Figure 3. Empirical coverage probabilities and confidence interval widths for different numbers of resamples in a logistic regression.

Bootstrap in High Dimension with Low Computation

Table 2. Coverage probabilities (Pro.), confidence interval widths (Wid.) and running time (unit: second) of the regression problems with dependent covariates. The closest coverage probability to the nominal 95% level among all methods in each setting is bold.

Example B Cheap Bootstrap Basic Bootstrap Percentile Bootstrap Standard Error Bootstrap Running Time Pro. Wid. Pro. Wid. Pro. Wid. Pro. Wid.

Linear 1 95.1% 1.451 N.A. N.A. N.A. N.A. N.A. N.A. 167*

regression 2 95.1% 0.546 33.6% 0.081 33.5% 0.081 70.3% 0.224 258*

(exponential 5 95.2% 0.350 67.1% 0.166 67.1% 0.166 88.2% 0.264 531*

decay) 10 95.2% 0.311 82.2% 0.220 82.2% 0.220 92.1% 0.273 987*

Linear 1 94.7% 129.255 N.A. N.A. N.A. N.A. N.A. N.A. 511

regression 2 94.9% 52.791 33.9% 7.575 34.5% 7.575 69.9% 20.995 798

(random 5 94.8% 30.733 67.8% 14.875 66.8% 14.875 87.5% 23.555 1659

cov matrix) 10 95.0% 27.048 82.1% 19.398 82.1% 19.398 91.9% 23.787 3095

Logistic 1 95.3% 7.411 N.A. N.A. N.A. N.A. N.A. N.A. 114*

regression 2 95.5% 2.787 34.9% 0.404 32.3% 0.404 70.7% 1.119 175*

(exponential 5 95.8% 1.787 69.5% 0.832 64.8% 0.832 88.6% 1.318 360*

decay) 10 96.0% 1.588 84.6% 1.101 79.8% 1.101 92.6% 1.364 667*

in Figure 5. The four nodes denote message processing units and the four edges are transport channels. For every pair of nodes i, j (i = j), there are external messages which enter into node i from the external and are to be transmitted to node j through a prescribed path. Their arrival time follows a Poisson process with parameter λi,j showed in Table 6. All the message lengths (unit: bits) are i.i.d. following a common exponential distribution with mean 300 bits. Suppose each unit spends 0.001 second to process a message passing it. We assume the node storage is unlimited but the channel storage is restricted to 275000 bits. Message speed in transport channels is 150000 miles per second and channel i has length 100i miles. Therefore, it takes l/275000 + 100i/150000 seconds for a message with length l bits to pass channel i. Suppose the network is empty at the beginning. The performance measure of interest is the steady-state average delay for the messages where delay means the time from the entering node to the destination node. It has approximate true value 7.05 10 3. This example has 13 unknown input distributions, i.e., 12 inter-arrival time distributions Exp(λi,j) and one message length distribution Exp(1/300), for which we have data sizes from 3 104 to 6 104. Given input distributions P1, . . . , P13, the performance measure of this system can be computed accurately by

ψ(P1, . . . , P13) = EP1,...,P13

where Dk is the delay for the k-th message. The point estimator of ψ(P1, . . . , P13) is taken as ˆψ = ψ( ˆP1,n1, . . . , ˆP13,n13) where each ˆPi,ni is the empirical distribution of ni i.i.d. samples {Xi,j, j = 1, . . . , ni} from the i-th input distribution Pi.

Next we construct the bootstrap estimator ˆψ b. For each b = 1, . . . , B and i = 1, . . . , 13, we sample with replacement the data {Xi,j, j = 1, . . . , ni} to obtain the bootstrap resamples {X b i,j, j = 1, . . . , ni} and denote the resample empirical distribution by ˆP b i,ni. The sampling procedure is conducted independently for different b and i. The bootstrap estimator is taken as ˆψ b = ψ( ˆP b 1,n1, . . . , ˆP b 13,n13). The cheap bootstrap confidence interval is still constructed as in (1).

The results for the above configuration can be found in the row Stochastic simulation of Table 1 and the corresponding discussions can be found in Section 4.

To investigate the robustness of the cheap bootstrap or other methods, we consider an alternative configuration where computer network is the same but the input models are different. More concretely, all 13 input models (12 inter-arrival time distributions and one message length distribution) are changed to Gamma distributions Gamma(α, β) which have densities of the form

Γ(α)xα 1e βx, x > 0.

Bootstrap in High Dimension with Low Computation

Table 3. Coverage probabilities (Pro.), confidence interval widths (Wid.) and running time (unit: second) of the ridge regression with p = 9000 and n = 8000. The closest coverage probabilities to the nominal 95% level among all methods are bold.

Example B Cheap Bootstrap Basic Bootstrap Percentile Bootstrap Standard Error Bootstrap Running Time Pro. Wid. Pro. Wid. Pro. Wid. Pro. Wid.

Ridge 1 96.7% 15.572 N.A. N.A. N.A. N.A. N.A. N.A. 48

regression 2 97.2% 5.719 23.1% 0.553 25.5% 0.553 68.8% 1.534 72

(independent; 5 97.5% 3.595 47.0% 1.141 51.8% 1.141 86.6% 1.807 144

λ = 0.1) 10 97.6% 3.171 60.2% 1.510 66.0% 1.510 90.7% 1.870 264

Ridge 1 96.3% 13.734 N.A. N.A. N.A. N.A. N.A. N.A. 48

regression 2 96.7% 5.082 27.1% 0.539 24.8% 0.539 68.7% 1.495 72

(independent; 5 97.0% 3.212 54.8% 1.111 50.3% 1.111 86.5% 1.761 144

λ = 1) 10 97.1% 2.839 69.2% 1.471 64.3% 1.471 90.6% 1.822 264

Ridge 1 96.9% 9.588 N.A. N.A. N.A. N.A. N.A. N.A. 47

regression 2 97.7% 3.453 11.6% 0.263 37.3% 0.263 48.0% 0.728 71

(exponential 5 98.4% 2.143 23.8% 0.541 73.9% 0.541 59.9% 0.857 142

decay; λ = 0.1) 10 98.7% 1.882 31.2% 0.716 88.5% 0.716 62.8% 0.887 260

Ridge 1 95.7% 5.776 N.A. N.A. N.A. N.A. N.A. N.A. 48

regression 2 96.3% 2.145 19.5% 0.240 36.2% 0.240 61.6% 0.665 71

(exponential 5 96.9% 1.360 39.8% 0.495 71.8% 0.495 78.0% 0.784 142

decay; λ = 1) 10 97.2% 1.203 51.5% 0.654 86.7% 0.654 81.9% 0.811 261

Ridge 1 96.7% 15.232 N.A. N.A. N.A. N.A. N.A. N.A. 49

regression 2 96.8% 5.527 19.4% 0.462 21.2% 0.462 64.4% 1.281 74

(random cov 5 96.5% 3.448 39.6% 0.953 43.4% 0.953 81.5% 1.510 151

matrix; λ = 0.1) 10 96.3% 3.032 51.2% 1.261 56.4% 1.261 85.6% 1.563 279

Ridge 1 96.3% 13.302 N.A. N.A. N.A. N.A. N.A. N.A. 50

regression 2 96.4% 4.873 23.5% 0.457 20.9% 0.457 65.2% 1.267 76

(random cov 5 96.2% 3.060 47.9% 0.943 42.8% 0.943 82.5% 1.493 155

matrix; λ = 1) 10 95.9% 2.696 61.2% 1.247 55.7% 1.247 86.6% 1.545 286

The message length distribution follows Gamma(2.5, 1/200) and the parameters for the inter-arrival time distributions Gamma(αi,j, βi,j) are given in Table 7. Under the new input distributions, the true steady-state mean delay is approximately 0.0109. Figure 6 reports the results. The cheap bootstrap coverage probabilities are close to the nominal level 95% for any B while those of the basic bootstrap and standard error bootstrap are below 60% and 90% respectively even for B = 10. Percentile bootstrap also performs well when B 7 perhaps because of the skewness of the estimates. But this is not always the case in view of the previous numerical results.

C.4. Real World Problem

The data we use is the RCV1-v2 data in Lewis et al. (2004). This dataset contains n = 804414 manually categorized newswire stories with a total of p = 47236 features. We compare the confidence interval widths of the logistic regression parameters for the four bootstrap methods, by using all the observations to run the logistic regression and estimate the parameters. That is, the observation matrix is of the size 804414 47236 4 1010. There are up to 103 different categories for all these newswire stories. As in Singh et al. (2009) and Balakrishnan & Madigan (2008), we only use the economics ( ECAT ) as the +1 label, i.e., the label Y is 1 if the newswire story is in economics and 0 if not, which leads to 119920 positive labels. Besides, we add l2 regularization to this logistic regression as in Singh et al. (2009) and Balakrishnan & Madigan (2008).

Bootstrap in High Dimension with Low Computation

Table 4. Coverage probabilities (Pro.), confidence interval widths (Wid.) and running time (unit: second) of the logistic regression with n = 105 and different p. The closest coverage probability to the nominal 95% level among all methods in each setting is bold.

p B Cheap Bootstrap Basic Bootstrap Percentile Bootstrap Standard Error Bootstrap Running Time Pro. Wid. Pro. Wid. Pro. Wid. Pro. Wid.

1 96.7% 3.697 N.A. N.A. N.A. N.A. N.A. N.A. 46*

2 97.7% 1.378 42.8% 0.182 32.3% 0.182 75.9% 0.504 76*

5 99.0% 0.878 83.1% 0.375 64.6% 0.375 93.1% 0.594 165*

10 99.5% 0.778 95.3% 0.496 79.0% 0.496 96.2% 0.615 314*

1 97.5% 5.368 N.A. N.A. N.A. N.A. N.A. N.A. 84*

2 98.7% 1.993 46.3% 0.251 33.3% 0.251 80.2% 0.697 142*

5 99.8% 1.268 88.0% 0.518 66.4% 0.518 95.9% 0.821 316*

10 100% 1.123 96.2% 0.686 80.9% 0.686 98.1% 0.849 606*

1 98.7% 11.473 N.A. N.A. N.A. N.A. N.A. N.A. 160*

2 99.7% 4.249 47.2% 0.495 35.0% 0.495 88.3% 1.372 285*

5 100% 2.700 89.1% 1.022 69.6% 1.022 99.2% 1.619 659*

10 100% 2.389 96.2% 1.354 84.0% 1.354 99.9% 1.676 1283*

1 100% 1272.497 N.A. N.A. N.A. N.A. N.A. N.A. 143*

2 100% 468.656 36.6% 48.271 36.5% 48.271 99.9% 133.799 242*

5 100% 296.354 72.5% 99.900 72.2% 99.900 100% 158.067 537*

10 100% 261.811 86.6% 132.870 86.3% 132.870 100% 163.747 1029*

1 99.9% 201.611 N.A. N.A. N.A. N.A. N.A. N.A. 142*

2 100% 74.595 36.6% 7.597 35.3% 7.597 98.8% 21.057 215*

5 100% 47.205 72.5% 15.737 70.1% 15.737 100% 24.922 433*

10 100% 41.583 86.8% 20.819 84.7% 20.819 100% 25.750 797*

To run this logistic regression, we use sklearn.linear model.Logistic Regression (a machine learning package in Python) in the virtual machine c2-standard-8 in Google Cloud Platform, which takes about 30-40 minutes to run one bootstrap resample. Therefore, the common bootstrap methods which require B = 50 or 100 would be computationlly expensive.

In Section 4, we report and discuss the average interval widths over all βi s for the four bootstrap methods. Here we display the results for three individual parameters, namely the first three βi s, in Figure 7. Since we are only able to run one experimental repetition in this real world example, the confidence interval widths contain some noises and thus we cannot observe the monotonicity of the widths when B increases. But we still see the cheap bootstrap confidence interval widths are wider than others, with general trends that resemble the average interval widths in our synthetic examples. This suggests that the cheap bootstrap confidence intervals would have higher and closer-to-nominal coverages than the other methods.

C.5. Numerical Experiments with a Lower Nominal Level

In this section, we conduct a simulation study with the nominal level 70% to further support the validity of the cheap bootstrap. We choose the ellipsoidal estimation and sinusoidal estimation presented in Section 4 as our model. All the settings are the same except that we use a different sample size n = 4 104 and a different dimension p = 9000. Table 8 presents the empirical coverage and average interval width over 1000 experimental repetitions. We can observe that the cheap bootstrap coverage probabilities are still close to the nominal level, and are the closest among all methods in all cases except two (sinusoidal B = 5 and B = 10) where percentile and standard error bootstraps each outperforms slightly. Regarding these exceptional cases, we note that the outperformance of the percentile bootstrap is likely a coincidence, because as a quantile-based method it cannot construct a symmetric 70% level confidence interval from only 5 resamples. We have used the minimum and maximum of the 5 resamples to construct this percentile bootstrap confidence interval,

Bootstrap in High Dimension with Low Computation

Table 5. Coverage probabilities (Pro.), confidence interval widths (Wid.) and running time (unit: second) of the ℓ2-regularized logistic regression with n = 105 and different p. The closest coverage probability to the nominal 95% level among all methods in each setting is bold.

p B Cheap Bootstrap Basic Bootstrap Percentile Bootstrap Standard Error Bootstrap Running Time Pro. Wid. Pro. Wid. Pro. Wid. Pro. Wid.

1 96.3% 3.162 N.A. N.A. N.A. N.A. N.A. N.A. 42*

2 97.3% 1.182 41.0% 0.161 32.2% 0.161 74.7% 0.446 69*

5 98.4% 0.755 80.3% 0.331 64.4% 0.331 92.2% 0.525 150*

10 98.9% 0.669 93.5% 0.438 78.9% 0.438 95.6% 0.543 284*

1 96.8% 3.943 N.A. N.A. N.A. N.A. N.A. N.A. 83*

2 97.9% 1.471 43.5% 0.196 32.9% 0.196 76.9% 0.543 141*

5 99.1% 0.938 84.3% 0.404 65.6% 0.404 93.9% 0.640 315*

10 99.6% 0.832 95.9% 0.534 80.1% 0.534 96.8% 0.662 606*

1 97.3% 5.056 N.A. N.A. N.A. N.A. N.A. N.A. 156*

2 98.5% 1.884 45.7% 0.246 33.6% 0.246 79.3% 0.683 241*

5 99.6% 1.202 87.6% 0.508 67.1% 0.508 95.5% 0.805 498*

10 99.9% 1.065 97.1% 0.673 81.6% 0.673 97.8% 0.833 925*

1 97.6% 6.400 N.A. N.A. N.A. N.A. N.A. N.A. 191*

2 98.8% 2.385 47.1% 0.310 34.5% 0.310 81.4% 0.859 319*

5 99.8% 1.520 89.4% 0.639 68.6% 0.639 96.5% 1.012 703*

10 100% 1.348 97.5% 0.845 83.1% 0.845 98.3% 1.047 1343*

1 97.4% 7.259 N.A. N.A. N.A. N.A. N.A. N.A. 208*

2 98.7% 2.712 46.5% 0.362 35.2% 0.362 80.8% 1.004 350*

5 99.7% 1.731 88.8% 0.746 69.9% 0.746 96.0% 1.183 777*

10 99.9% 1.536 98.3% 0.988 84.5% 0.988 97.9% 1.224 1487*

whose actual nominal level should be close to 100%. So it is likely by accident that percentile bootstrap coverage is closest to 70% and in fact this coverage is far away from its actual nominal level around 100%.

C.6. Coverage Error Behavior with Respect to B and n

In this section, we numerically study the cheap bootstrap coverage error behavior with respect to B and n and illustrate how it aligns with our theoretical bounds. We choose the model as the sinusoidal estimation in Section 4. We fix the dimension p = 9000 and vary B and n. Figure 8 displays the colormaps of the absolute values of empirical coverage errors (nominal level 95%), where the x-axis represents B and y-axis represents n. We cut the results of the basic and percentile bootstraps for the first few B s because their errors are too large. From Figure 8 (a), it appears that the cheap bootstrap coverage error does not change much in this regime of n. This matches to some extent Theorem 3.5 and Corollary 3.6 that guarantee the coverage error would decrease as n increases with a slow rate 1/ n. Further, when we fix n, the cheap bootstrap coverage error seems to lack clear trend and otherwise be quite stable as B changes. On the other hand, the basic and percentile bootstrap coverage errors show a clear decreasing trend as B increases. Their different behaviors are attributed to the different ideas behind them. The basic bootstrap and percentile bootstrap are quantile-based methods. As B increases, the bootstrap quantile estimate is closer and closer to the true quantile, which leads to the improvement on the coverage error. However, the cheap bootstrap method relies on a totally different idea, i.e., it relies on approximate independence of the resamples from the original estimator and thus a t-distribution-based (with degree of freedom B) confidence interval can be constructed. Different B just means a different pivotal t-distribution. There is no evident reason that the pivotal t-distribution with a larger degree of freedom B will lead to a smaller coverage error.

Bootstrap in High Dimension with Low Computation

CB BB PB SEB

CB BB PB SEB 0.0

CB BB PB SEB 0.0

(a) Coverage probability βi = 0

CB BB PB SEB

CB BB PB SEB

CB BB PB SEB

(b) Confidence interval width βi = 0

CB BB PB SEB

CB BB PB SEB 0.5

CB BB PB SEB

(c) Coverage probability βi = 0

CB BB PB SEB

CB BB PB SEB

CB BB PB SEB

(d) Confidence interval width βi = 0

Figure 4. Box plots of empirical coverage probabilities and confidence interval widths of all βi s for different numbers of resamples in a logistic regression.

Proof of Theorem 3.1: We define Q as the distribution of n( ˆψ n ˆψn) conditional on X1, . . . , Xn. With the repeated bootstrap resampling, we have that n( ˆψ b n ˆψn), b = 1, . . . , B are independent conditional on X1, . . . , Xn. Then we can write the coverage probability as

P(ψ [ ˆψn t B,1 α/2Sn,B, ˆψn + t B,1 α/2Sn,B])

n( ˆψn ψ) q

1 B PB b=1( n( ˆψ b n ˆψn))2

| n( ˆ ψn ψ)| t B,1 α/2

1 B PB b=1 z2 b d Q (z B) d Q (z1)

where the expectation E is taken with respect to X1, . . . , Xn. If we write A as the event that

Q (( , x]) Φ x

P ( n( ˆψ n ˆψn) x) Φ x

Bootstrap in High Dimension with Low Computation

Node 1 Node 4

Figure 5. A computer network with four nodes and four channels.

2 4 6 8 10 number of resamples

estimated coverage probability

Cheap Boostrap Basic Boostrap Percentile Boostrap Standard Error Boostrap

(a) Coverage probability

2 4 6 8 10 number of resamples

estimated confidence interval width

Cheap Boostrap Basic/Percentile Boostrap Standard Error Boostrap

(b) Confidence interval width

Figure 6. Empirical coverage probabilities and confidence interval widths for different numbers of resamples in a computer communication network.

2 4 6 8 10 number of resamples

conﬁdence interval width

Cheap Boostrap β1 Basic/Percentile Boostrap β1 Standard Error Boostrap β1 Cheap Boostrap β2 Basic/Percentile Boostrap β2 Standard Error Boostrap β2 Cheap Boostrap β3 Basic/Percentile Boostrap β3 Standard Error Boostrap β3

Figure 7. Confidence interval widths of the first three β s for different numbers of resamples in a real world logistic regression.

Bootstrap in High Dimension with Low Computation

Table 6. Arrival rates λi,j of messages to be transmitted from node i to node j.

Node i 1 2 3 4

1 N.A. 40 30 35 2 50 N.A. 45 15 3 60 15 N.A. 20 4 25 30 40 N.A.

Table 7. Parameters (αi,j, βi,j) for the inter-arrival time distribution of messages to be transmitted from node i to node j.

Node i 1 2 3 4

1 N.A. (1.5, 60) (0.7, 40) (1.3, 50) 2 (2, 80) N.A. (1.5, 65) (0.6, 20) 3 (3, 100) (0.5, 25) N.A. (1.2, 30) 4 (0.8, 40) (1.1, 50) (0.9, 35) N.A.

then we know that P(Ac) β. We consider the coverage probability intersected with A, i.e.,

| n( ˆ ψn ψ)| t B,1 α/2

1 B PB b=1 z2 b d Q (z B) d Q (z1); A

Note that conditional on X1, . . . , Xn and given z1, . . . , z B 1, the integral region for z B can be written as

z B : | n( ˆψn ψ)| t B,1 α/2

= ( , q] [q, )

for some q 0. Therefore, applying (18), we have

| n( ˆ ψn ψ)| t B,1 α/2

1 B PB b=1 z2 b d Q (z B) Z

| n( ˆ ψn ψ)| t B,1 α/2

1 B PB b=1 z2 b d P0(z B)

where P0 is the distribution of N(0, σ2). Plugging it into (19), we have

| n(g( Xn) g(µ))| t B,1 α/2

1 B PB b=1 z2 b d Q (z B) d Q (z1); A

| n( ˆ ψn ψ)| t B,1 α/2

1 B PB b=1 z2 b d P0(z B)d Q (z B 1) d Q (z1); A

where the error RB satisfies

|RB| E Z Z 2E2d Q (z B 1) d Q (z1); A 2E2

By the same argument, we can further replace the remaining Q (zi) s by P0(zi) s and obtain

| n( ˆ ψn ψ)| t B,1 α/2

1 B PB b=1 z2 b d Q (z B) d Q (z1); A

Bootstrap in High Dimension with Low Computation

Table 8. Coverage probabilities (Pro.), confidence interval widths (Wid.) and running time (unit: second) of the numerical examples. The closest coverage probability to the nominal 70% level among all methods in each setting is bold.

Example B Cheap Bootstrap Basic Bootstrap Percentile Bootstrap Standard Error Bootstrap Running Time Pro. Wid. Pro. Wid. Pro. Wid. Pro. Wid.

1 73.0% 9.828 10 3 N.A. N.A. N.A. N.A. N.A. N.A. 2

Ellipsoidal 2 70.0% 7.568 10 3 30.9% 2.197 10 3 7.7% 2.197 10 3 34.3% 3.220 10 3 4

estimation 5 68.5% 6.639 10 3 64.7% 4.429 10 3 16.5% 4.429 10 3 42.3% 3.717 10 3 10

10 66.9% 6.323 10 3 60.1% 3.756 10 3 10.8% 3.756 10 3 41.8% 3.804 10 3 20

1 70.3% 0.148 N.A. N.A. N.A. N.A. N.A. N.A. 2

Sinusoidal 2 72.0% 0.116 34.9% 0.052 33.3% 0.052 52.4% 0.077 4

estimation 5 72.2% 0.104 67.6% 0.108 68.6% 0.108 65.6% 0.091 10

10 72.3% 0.100 65.8% 0.093 63.7% 0.093 68.8% 0.094 19

| n( ˆ ψn ψ)| t B,1 α/2

1 B PB b=1 z2 b d P0(z B) d P0(z1); A

where each error Rb satisfies |Rb| 2E2.

Therefore, the coverage probability satisfies

| n( ˆ ψn ψ)| t B,1 α/2

1 B PB b=1 z2 b d Q (z B) d Q (z1)

| n( ˆ ψn ψ)| t B,1 α/2

1 B PB b=1 z2 b d Q (z B) d Q (z1); A

| n( ˆ ψn ψ)| t B,1 α/2

1 B PB b=1 z2 b d Q (z B) d Q (z1); Ac #

| n( ˆ ψn ψ)| t B,1 α/2

1 B PB b=1 z2 b d P0(z B) d P0(z1); A

| n( ˆ ψn ψ)| t B,1 α/2

1 B PB b=1 z2 b d Q (z B) d Q (z1); Ac #

| n( ˆ ψn ψ)| t B,1 α/2

1 B PB b=1 z2 b d P0(z B) d P0(z1)

b=1 Rb, (20)

where the additional error RAc is given by

| n( ˆ ψn ψ)| t B,1 α/2

1 B PB b=1 z2 b d Q (z B) d Q (z1); Ac #

| n( ˆ ψn ψ)| t B,1 α/2

1 B PB b=1 z2 b d P0(z B) d P0(z1); Ac #

and it satisfies |RAc| P(Ac) β. Now we will handle the distribution of n( ˆψn ψ) which is denoted by Q0. Note that by Fubini s theorem we have

| n( ˆ ψn ψ)| t B,1 α/2

1 B PB b=1 z2 b d P0(z B) d P0(z1)

Bootstrap in High Dimension with Low Computation

0 10 20 30 40 50

0.0000 0.0025 0.0050 0.0075 0.0100 0.0125 0.0150 0.0175

(a) Cheap bootstrap

20 25 30 35 40 45 50

0.00 0.01 0.02 0.03 0.04 0.05 0.06

(b) Basic bootstrap

20 25 30 35 40 45 50

0.00 0.01 0.02 0.03 0.04 0.05

(c) Percentile bootstrap

Figure 8. Colormaps of the absolute values of empirical coverage errors (nominal level 95%) for the sinusoidal estimation.

|z0| t B,1 α/2

1 B PB b=1 z2 b d Q0(z0)d P0(z B) d P0(z1).

Given z1, . . . , z B, consider the innermost integral with respect to Q0. By the finite-sample accuracy for Q0, i.e.,

P( n( ˆψn ψ) x) Φ x

sup x R |Q0(( , x]) P0(( , x])| E1,

|z0| t B,1 α/2

1 B PB b=1 z2 b d Q0(z0) Z

|z0| t B,1 α/2

1 B PB b=1 z2 b d P0(z0)

| n( ˆ ψn ψ)| t B,1 α/2

1 B PB b=1 z2 b d P0(z B) d P0(z1)

|z0| t B,1 α/2

1 B PB b=1 z2 b d P0(z0)d P0(z B) d P0(z1) + R0

= 1 α + R0, (21)

where the second equality follows from Z0 q

1 B PB b=1 Z2 b

Bootstrap in High Dimension with Low Computation

for i.i.d. Zi N(0, σ2), i = 0, . . . , B and the error R0 satisfies

Plugging (21) into (20), we have

| n( ˆ ψn ψ)| t B,1 α/2

1 B PB b=1 z2 b d Q (z B) d Q (z1)

= 1 α + RAc +

b=0 Rb := 1 α + R,

where the overall error satisfies |R| 2E1 + 2BE2 + β.

Proof of Theorem 3.2: Recall that for a cumulative distribution function F of a random variable, the q-th quantile is defined as F 1(q) = inf{x : F(x) q}. We first prove a useful result: if the cumulative distribution functions of two random variables X and Y satisfy sup t R |FX(t) FY (t)| ε, (22)

then for any α [0, 1], F 1 Y (α ε) F 1 X (α) F 1 Y (α + ε). (23)

To prove it, we note that if α ε < 0 then = F 1 Y (α ε) F 1 X (α) trivially holds and if α + ε > 1 then F 1 X (α) F 1 Y (α + ε) = trivially holds. So we assume 0 α ε α + ε 1. Now let s prove the first inequality F 1 Y (α ε) F 1 X (α). By the definition of F 1 X and right-continuity of FX, we know that FX(F 1 X (α)) α. Therefore, by (22), we know that FY (F 1 X (α)) α ε, which implies F 1 Y (α ε) F 1 X (α) by the definition of F 1 Y (α ε). This proves the first inequality in (23). Interchanging the role of X and Y , we have F 1 X (α ε) F 1 Y (α). Replacing α by α + ε, we obtain the second inequality F 1 X (α) F 1 Y (α + ε).

Now we consider the basic bootstrap. We write A as the event

P ( n( ˆψ n ˆψn) x) Φ x

By our assumption, we have P(Ac) β. Note that if qα/2 and q1 α/2 are the α/2-th and (1 α/2)-th quantiles of ˆψ n ˆψn given X1, . . . , Xn, then nqα/2 and nq1 α/2 are the α/2-th and (1 α/2)-th quantiles of n( ˆψ n ˆψn) respectively given X1, . . . , Xn. By the inequality (23), when A happens, we have

σzα/2 E2 nqα/2 σzα/2+E2,

where zq is the q-th quantile of the standard normal. This inequality implies

P( n( ˆψn ψ) < σzα/2 E2; A) P( n( ˆψn ψ) < nqα/2; A) (24)

and P( n( ˆψn ψ) < nqα/2; A) P( n( ˆψn ψ) < σzα/2+E2; A). (25)

Next, we notice that

P( n( ˆψn ψ) x) Φ x

P( n( ˆψn ψ) < x) Φ x

Bootstrap in High Dimension with Low Computation

Thus, (24) implies that

P( n( ˆψn ψ) < nqα/2; A)

P( n( ˆψn ψ) < σzα/2 E2; A)

P( n( ˆψn ψ) < σzα/2 E2) P(Ac)

Similarly, (25) implies that

P( n( ˆψn ψ) < nqα/2; A) P( n( ˆψn ψ) < σzα/2+E2) α

2 + E1 + E2.

Therefore, we have the following two-sided bound

2 E1 E2 β P( n( ˆψn ψ) < nqα/2; A) α

2 + E1 + E2.

For the (1 α/2)-th quantile, we can also derive a similar bound

2 E1 E2 β P( n( ˆψn ψ) nq1 α/2; A) 1 α

2 + E1 + E2.

So we have |P( nqα/2 n( ˆψn ψ) nq1 α/2; A) (1 α)| 2E1 + 2E2 + β,

which gives rise to

|P( nqα/2 n( ˆψn ψ) nq1 α/2) (1 α)|

|P( nqα/2 n( ˆψn ψ) nq1 α/2; A) (1 α)|

+ P( nqα/2 n( ˆψn ψ) nq1 α/2; Ac)

2E1 + 2E2 + β + P(Ac)

2E1 + 2E2 + 2β,

or equivalently |P( ˆψn q1 α/2 ψ ˆψn qα/2) (1 α)| 2E1 + 2E2 + 2β.

The result for the percentile bootstrap follows similarly but we need to use the symmetry of N(0, σ2). Note that

P( n( ˆψn ψ) x) Φ x

P( n( ˆψn ψ) < x) Φ x

and the latter can be rewritten as

P( n( ˆψn ψ) < x) Φ x

P( n( ˆψn ψ) < x) Φ x

P( n(ψ ˆψn) > x) Φ x

(1 P( n(ψ ˆψn) > x)) 1 Φ x

Bootstrap in High Dimension with Low Computation

P( n(ψ ˆψn) x) Φ x

where the last equality uses the symmetry of N(0, σ2), i.e., 1 Φ( x/σ) = Φ(x/σ). Moreover, if qα/2 and q1 α/2 are the α/2-th and (1 α/2)-th quantiles of ˆψ n given X1, . . . , Xn, then n(qα/2 ˆψn) and n(q1 α/2 ˆψn) are the α/2-th and (1 α/2)-th quantiles of n( ˆψ n ˆψn) given X1, . . . , Xn. Therefore, the proof for the basic bootstrap also applies if we replace nqα/2, nq1 α/2 and n( ˆψn ψ) in that proof by n(qα/2 ˆψn), n(q1 α/2 ˆψn) and n(ψ ˆψn) respectively. In particular, the final result now reads as follows

|P( n(qα/2 ˆψn) n(ψ ˆψn) n(q1 α/2 ˆψn)) (1 α)| 2E1 + 2E2 + 2β

or equivalently |P(qα/2 ψ q1 α/2) (1 α)| 2E1 + 2E2 + 2β.

This completes our proof.

To prove Theorem 3.5, we first need to prove two lemmas regarding (2) and (3) respectively.

The following lemma is from Theorem 2.11 in Pinelis & Molzon (2016) which establishes the Berry-Esseen theorem in the multivariate delta method in the form of (2).

Lemma D.1. Suppose that X1, . . . , Xn are i.i.d. random vectors in Rp satisfying E[X] = µ, V ar(X) = Σ, m31 := E[| g(µ) (X µ)|3] < and m32 := E[||X µ||3] < . Suppose g(x) satisfies Assumption 3.3 and σ2 := g(µ) Σ g(µ) > 0. Then there is a universal constant C > 0 s.t.

P( n(g( Xn) g(µ)) x) Φ x

m31 nσ3 + CHgm1/3 31 tr(Σ) nσ2 + CHgm2/3 32 n5/6σ + CHgm1/3 31 m2/3 32 nσ2

Proof of Lemma D.1: Define f(x) = g(x + µ) g(µ) and its linearization L(x) = g(µ) x. Then by the second order Taylor expansion of f(x) and boundedness property of Hg in Assumption 3.3, we can see (2.1) in Pinelis & Molzon (2016) holds for Mϵ = CHg and any ϵ > 0. By Theorem 2.11 in Pinelis & Molzon (2016) with V = X µ, c = 1/2 and ϵ , we have

P( n(g( Xn) g(µ)) x) Φ x

K0 + K1m31/σ3

n + K20 + K21m1/3 31 /σ n tr(Σ) + K30 + K31m1/3 31 /σ n m2/3 32 , (26)

where the additional term Kϵ in Theorem 2.11 vanishes as ϵ . By the definition of these K s with c = 1/2 in (2.30) in Pinelis & Molzon (2016), we can see there is a universal constant C > 0 s.t.

K0 C, K1 C, K20 C CHg

σ , K21 C CHg

σ , K30 C CHg

σn1/3 , K31 C CHg

Moreover, by Holder s inequality, m31 = E[| g(µ) (X µ)|3] E[| g(µ) (X µ)|2]3/2 = σ3, which implies that K0, K20 can be absorbed into K1m31/σ3, K21m1/3 31 /σ respectively by choosing a larger C. Therefore, (26) can be written as

P( n(g( Xn) g(µ)) x) Φ x

m31 nσ3 + CHgm1/3 31 tr(Σ) nσ2 + CHgm2/3 32 n5/6σ + CHgm1/3 31 m2/3 32 nσ2

This concludes our proof.

Bootstrap in High Dimension with Low Computation

Next we prove the finite-sample accuracy (3) for the bootstrap estimator by extracting the dependence on problem parameters in Theorem 4.2 in Zhilova (2020) and combining it with Lemma D.1.

Lemma D.2. Suppose the conditions in Theorem 3.5 hold. Then with probability at least 1 6/n we have

P ( n(g( X n) g( Xn)) x) Φ x

m31 nσ3 + CHgm1/3 31 tr(Σ) nσ2 + CHgm2/3 32 n5/6σ + CHgm1/3 31 m2/3 32 nσ2

n + ||E[(X µ)3]||

1/2 + τ 2 p

1/2 + τ 3 p

τ 4(log n)3/2

λ2 Σ n + τ 2(log n)3/2

1/2 (log n + log p) p

where m31, m32 and σ2 are defined in Lemma D.1, C is a universal constant and C1 only depends on CX.

Proof of Lemma D.2: We use Theorem 4.2 in Zhilova (2020) to prove this lemma. First, we verify the conditions of Theorem 4.2 with K = 3. For K = 3, by Remark 2.1, we can take Ui 0, Yi µ = Zi µ N(0, Σz) independent of X1, . . . , Xn with Σz = Σ which satisfies (2.1) and (2.2) for approximating Xi µ (since Xi is not centered in Theorem 4.2, Yi should also be non-centered). In this case, we can see Cz := ||Σ 1/2 z || = ||Σ 1/2|| = 1/ λΣ. Similarly, CX := ||Σ 1/2|| = 1/ λΣ. Other conditions about f in Theorem 4.2 have already been assumed in the statement of this lemma. Thus, by Theorem 4.2, it holds with probability at least 1 6e x for x > 0:

sup t R |P( n(g( Xn) g(µ)) t) P ( n(g( X n) g( Xn)) t)|

1 λΣ τ 2 1 + 2 rx

n + CB,i.i.d. 1

λ3/2 Σ CM,3 1 n

+ 2CB,i.i.d.

λ3/2 Σ τ 3 1 n + R1,n,3 (27)

where CB,i.i.d. > 0 is a constant only depending on K and thus is a universal constant for K = 3, CM,3 is defined as

CM,3 = ||E[(X µ)3]|| + ||E[(Y1 µ)3]||

and R1,n,3 is defined in (B.14) as

R1,n,3 = τ 4Cx,2

2n + 4τ 2Cx,2

2n + Cϕ,2τ 3Cx,3

2λ3/2 Σ n . (28)

Since Y1 µ N(0, Σz), the tensor power (Y1 µ)3 has expectation zero and thus CM,3 can be simplified as CM,3 = ||E[(X µ)3]||. Cϕ,2 in (28) is defined in (A.13) and according to Remark A.1, it depends on the choice of ϕ(t) which is a K = 3 times continuously differentiable smooth approximation of the indicator function 1{t 0}. Once we fix such ϕ(t), Cϕ,2 is a universal constant. Cx,2 and Cx,3 in (28) are defined in (B.27) of Theorem B.1 as

Cx,2 = C1((x + log n) p

p + 2(x + log n)

x + log n + log p

p + 2(x + log n + log p)

Bootstrap in High Dimension with Low Computation

((x + log n + log p) p

x + log n + log p)

p + 2(x + log n)

where C1 is a constant only depending on the density bound CX based on the proof of Theorem B.1. Besides, since we assume n 3, we know that x + log n x + log n and x + log n + log p x + log n + log p holds for any x > 0. Therefore, Cx,2 and Cx,3 can be simplified as

Cx,2 = C1(x + log n)

p + 2(x + log n)

x + log n + log p

p + 2(x + log n + log p)

(x + log n + log p)

p + 2(x + log n)

Moreover, for any y > 0, we always have 1 + y 1 + 2 y + 2y 4(1 + y). Therefore, the remainder term (28) can be bounded in a more compact way up to some constants as follows:

λ2 Σ n(x + log n) x + τ 2

λΣ n(x + log n) x

1 + x + log n + log p

1/2 (x + log n + log p) x

1 + x + log n

1/2 + τ 2 p

1 + x + log n

1/2 + τ 3 p

1 + x + log n

where C is a universal constant and C1 only depends on CX. Plugging it back to (27), we can similarly write (27) in a more compact way:

sup t R |P( n(g( Xn) g(µ)) t) P ( n(g( X n) g( Xn)) t)|

n + ||E[(X µ)3]||

1 + x + log n

1/2 + τ 2 p

1 + x + log n

1/2 + τ 3 p

1 + x + log n

λ2 Σ n(x + log n) x + τ 2

λΣ n(x + log n) x

1 + x + log n + log p

1/2 (x + log n + log p) x

where C is a universal constant and C1 only depends on CX. Now we choose x = log n. Then with probability at least 1 6/n, we have

sup t R |P( n(g( Xn) g(µ)) t) P ( n(g( X n) g( Xn)) t)|

n + ||E[(X µ)3]||

Bootstrap in High Dimension with Low Computation

1/2 + τ 2 p

1/2 + τ 3 p

τ 4(log n)3/2

λ2 Σ n + τ 2(log n)3/2

1/2 (log n + log p) p

where C is a universal constant, C1 only depends on CX and we have absorbed log p/p into the constant term 1 due to log p/p 1.

Now we combine the above bound with Lemma D.1 in this paper. Since X is sub-Gaussian, the moment conditions in Lemma D.1 hold. Moreover, since Σ is positive definite and || g(µ)|| > 0, σ2 = g(µ) Σ g(µ) > 0 is also satisfied. Therefore, by Lemma D.1 and triangular inequality, we obtain the desired bound with probability at least 1 6/n

P ( n(g( X n) g( Xn)) x) Φ x

m31 nσ3 + CHgm1/3 31 tr(Σ) nσ2 + CHgm2/3 32 n5/6σ + CHgm1/3 31 m2/3 32 nσ2

n + ||E[(X µ)3]||

1/2 + τ 2 p

1/2 + τ 3 p

τ 4(log n)3/2

λ2 Σ n + τ 2(log n)3/2

1/2 (log n + log p) p

where C is a universal constant and C1 only depends on CX.

Proof of Theorem 3.5: Plugging the bounds in Lemma D.1 and Lemma D.2 into Theorem 3.1, we obtain the desired finite sample bound for the cheap bootstrap coverage accuracy. Besides, the error E1 can be absorbed in E2.

We make a remark about the proof of Theorem 3.5. In Zhilova (2020), it appears that a generalized version of the Hanson Wright inequality (equation (B.32)) that allows dependent components is derived as a middle step of the proof of Theorem B.1. If the classical Hanson-Wright inequality (Theorem 1.1 in Rudelson & Vershynin (2013)) is used in that proof instead, then our Theorem 3.5 would be changed accordingly to have an additional assumption that X has independent components but no longer needs to be continuously distributed with a bounded density. In this case, C1 can be taken as a universal constant.

Proof of Corollary 3.6: It suffices to show that if τ = O(1) and || g(µ)||2 = O(p), then tr(Σ) = O(p), m31 = O(p3/2) and m32 = O(p3/2). In fact, if these orders hold, with other orders assumed in Corollary 3.6, we can easily get the desired order after absorbing the small order terms into large order terms.

Now let us prove tr(Σ) = O(p), m31 = O(p3/2) and m32 = O(p3/2) provided τ = O(1) and || g(µ)||2 = O(p). Recall that X is assumed to be sub-Gaussian, i.e.,

E[exp(a (X µ))] exp(||a||2τ 2/2), a Rp. (29)

for some τ 2 > 0. Therefore, Xi µi, i = 1, . . . , p are sub-Gaussian random variables with sub-Gaussian norm τ up to a universal constant (see Vershynin (2018) Section 2.5). For simplicity, we write a b if a Cb for a universal constant C > 0. By Proposition 2.5.2 (ii) in Vershynin (2018), E[|Xi µi|2] = Σii τ 2 and E[|Xi µi|4] τ 4. Therefore, we can see tr(Σ) τ 2p = O(p). By H older s inequality,

m32 = E[||X µ||3] E[||X µ||4]3/4 =

i,j=1 E (Xi µi)2(Xj µj)2

Bootstrap in High Dimension with Low Computation

E [(Xi µi)4]E[(Xj µj)4]

= τ 3p3/2 = O(p3/2).

Moreover, (29) also implies that g(µ) (X µ) is sub-Gaussian with sub-Gaussian norm || g(µ)||τ up to a universal constant. By Proposition 2.5.2 (ii) in Vershynin (2018), we have m31 = E[| g(µ) (X µ)|3] || g(µ)||3τ 3 = O(p3/2). This concludes our proof.

Proof of Theorem 3.7: We will apply Theorem 3.1 to prove this theorem. The finite-sample bound (2) can be obtained by the Berry-Esseen theorem:

P( n(g 1 Xn g 1 µ) x) Φ x

CE[|g 1 (X µ)|3]

σ3 n . (30)

Next, we need to find a bound for

P ( n(g 1 X n g 1 Xn) x) Φ x

We consider the i.i.d. centered random variables g 1 (Xi µ), i = 1, . . . , n which have non-degenerate variance σ2 = g 1 Σg1 > 0. We apply Theorem 2.5 in Lopes (2022) to g 1 (Xi µ) s by choosing Y = N(0, g 1 Σg1) such that ϱ = 1, = 0, ω1 = ||g 1 (X µ)/σ||ψ1 = ||g 1 (X µ)||ψ1/σ and obtain that with probability at least 1 C/n,

sup x R |P ( n(g 1 X n g 1 Xn) x) P( ng 1 ( Xn µ) x)| C||g 1 (X µ)||4 ψ1 log11(n) σ4 n ,

where C > 0 is a universal constant. By the triangle inequality and Berry-Esseen bound (30), the following holds with probability at least 1 C/n

P ( n(g 1 X n g 1 Xn) x) Φ x

CE[|g 1 (X µ)|3]

σ3 n + C||g 1 (X µ)||4 ψ1 log11(n) σ4 n , (31)

where C > 0 is a universal constant.

By Theorem 3.1 and the bounds (30) and (31), we finally get P(g(µ) [g( Xn) t B,1 α/2Sn,B, g( Xn) + t B,1 α/2Sn,B]) (1 α)

E[|g 1 (X µ)|3]

σ3 n + ||g 1 (X µ)||4 ψ1 log11(n) σ4 n

+ CE[|g 1 (X µ)|3]

where C is a universal constant. Furthermore, the last term can be absorbed into the second term by using a larger constant 2C, which completes our proof.

Proof of Theorem 3.8: We use Theorem 3.1 to prove this theorem. As in the proof of Theorem 3.7, (2) is given by the Berry-Esseen theorem in (30) and we only need to bound

P ( n(g 1 X n g 1 Xn) x) Φ x

In this regard, we will use the results in Chernozhukov et al. (2020). Note that their results only apply for at least three-dimensional random vectors so we consider the following setting. Suppose we have 3n i.i.d. random variables g 1 (Xij µ), i = 1, . . . , n, j = 1, 2, 3 where each Xij has the same distribution as X1. Then we can construct n threedimensional i.i.d. random vectors as Xi := (g 1 (Xi1 µ), g 1 (Xi2 µ), g 1 (Xi3 µ)) , i = 1, . . . , n whose components have common variance σ2 = g 1 Σg1. Then we can see that conditions (E.3) and (M) are satisfied for

Bn = max 3E[|g 1 (X µ)/σ|q]1/q, q

E[|g 1 (X µ)/σ|4] .

Bootstrap in High Dimension with Low Computation

Therefore, by Corollary 3.2 in Chernozhukov et al. (2020) (σ ,W = 1 since Xi has independent components), we have with probability at least 1 1/ n that

sup A R |P ( n( X n Xn) A) P(N(0, σ2I3 3) A)|

log 3 log n p

log(3 n) n + log 3 p

log(3n) n1/2 3/(2q)

C1 max 3E[|g 1 (X µ)/σ|q]1/q, q

E[|g 1 (X µ)/σ|4] (log n)3/2

n + log n n1/2 3/(2q)

where C1 > 0 denotes a constant depending only on q which are different for its two appearances and R contains all the hyperrectangles in R3. In particular, if we only focus on the first component of Xi, that is, we choose A = ( , x] R R, we have with probability at least 1 1/ n that

P ( n(g 1 X n g 1 Xn) x) Φ x

C1 max 3E[|g 1 (X µ)/σ|q]1/q, q

E[|g 1 (X µ)/σ|4] (log n)3/2

n + log n n1/2 3/(2q)

By Theorem 3.1 and the bounds (30) and (32), we then obtain P(g(µ) [g( Xn) t B,1 α/2Sn,B, g( Xn) + t B,1 α/2Sn,B]) (1 α)

2 n + BC1 max E[|g 1 (X µ)/σ|q]1/q, q

E[|g 1 (X µ)/σ|4]

n + log n n1/2 3/(2q)

+ CE[|g 1 (X µ)|3]

where C is a universal constant and C1 is a constant depending only on q. Finally notice that

n = o log n n1/2 3/(2q)

and E[|g 1 (X µ)|3]/σ3 1. We can absorb (log n)3/2/ n into log n/n1/2 3/(2q) and absorb 2/ n into CE[|g 1 (X µ)|3]/σ3 n (with larger constants C1 and C), which leads to P(g(µ) [g( Xn) t B,1 α/2Sn,B, g( Xn) + t B,1 α/2Sn,B]) (1 α)

BC1 max E[|g 1 (X µ)/σ|q]1/q, q

E[|g 1 (X µ)/σ|4] log n n1/2 3/(2q) + CE[|g 1 (X µ)|3]

Proof of Theorem B.1: In view of Theorem 3.1, it suffices to show

|P(ψ [ ˆψn t B,1 α/2Sn,B, ˆψn + t B,1 α/2Sn,B]) (1 α)|

2E1 + 2E4 +

2 π |t B,1 α/2 z1 α/2| +

σ t B,1 α/2.

Then taking the minimum of the two bounds, we can get the desired result. We write A as the event that

b=1 ( n( ˆψ b n ˆψn))2 σ

Bootstrap in High Dimension with Low Computation

b=1 ( n( ˆψ b n ˆψn))2 σ + E3.

Then we have P(Ac) E4. Note that the confidence interval can be written as

ψ [ ˆψn t B,1 α/2Sn,B, ˆψn + t B,1 α/2Sn,B]

n( ˆψn ψ) q

1 B PB b=1( n( ˆψ b n ˆψn))2

Therefore, we have

| n( ˆψn ψ)|

σ E3 t B,1 α/2; A

P(ψ [ ˆψn t B,1 α/2Sn,B, ˆψn + t B,1 α/2Sn,B]; A)

| n( ˆψn ψ)|

σ + E3 t B,1 α/2; A

For the upper bound, we can further bound it as follows

| n( ˆψn ψ)|

σ + E3 t B,1 α/2; A

| n( ˆψn ψ)|

σ + E3 t B,1 α/2

= P( (σ + E3)t B,1 α/2 n( ˆψn ψ) (σ + E3)t B,1 α/2).

By means of the finite-sample accuracy in (15), we have

P( (σ + E3)t B,1 α/2 n( ˆψn ψ) (σ + E3)t B,1 α/2)

σ t B,1 α/2

σ t B,1 α/2

Φ(z1 α/2) Φ( z1 α/2) + 2E1 +

σ t B,1 α/2 z1 α/2

1 α + 2E1 +

2 π |t B,1 α/2 z1 α/2| +

σ t B,1 α/2,

where z1 α/2 is the (1 α/2)-th quantile of the standard normal and the second inequality is due to the 1/

2π-Lipschitz property of Φ( ). For the lower bound, by a similar argument we can obtain

| n( ˆψn ψ)|

σ E3 t B,1 α/2; A

| n( ˆψn ψ)|

σ E3 t B,1 α/2

2 π |t B,1 α/2 z1 α/2|

σ t B,1 α/2 P(Ac)

2 π |t B,1 α/2 z1 α/2|

σ t B,1 α/2 E4.

Bootstrap in High Dimension with Low Computation

Thus, by combining the upper and lower bounds, we have the following bound for the coverage error when A happens

|P(ψ [ ˆψn t B,1 α/2Sn,B, ˆψn + t B,1 α/2Sn,B]; A) (1 α)|

2 π |t B,1 α/2 z1 α/2| +

σ t B,1 α/2.

Finally, the overall coverage error can be bounded by

|P(ψ [ ˆψn t B,1 α/2Sn,B, ˆψn + t B,1 α/2Sn,B]) (1 α)|

|P(ψ [ ˆψn t B,1 α/2Sn,B, ˆψn + t B,1 α/2Sn,B]; A) (1 α)| + P(Ac)

2E1 + 2E4 +

2 π |t B,1 α/2 z1 α/2| +

σ t B,1 α/2,

which, combined with Theorem 3.1, gives us the desired bound.

Proof of Theorem B.2: By means of Lemmas D.1 and D.2, this directly follows from Theorem 3.2. Besides, the error E1 can be absorbed into E2.

Proof of Theorem B.3: Plugging the bounds (30) and (31) into Theorem 3.2, we get the desired result.

Proof of Theorem B.4: Plugging the bounds (30) and (32) into Theorem 3.2, we get the desired result.