# highdimensional_l2boosting_rate_of_convergence__3905cc5b.pdf

Journal of Machine Learning Research 26 (2025) 1-54 Submitted 7/21; Revised 3/25; Published 3/25

High-Dimensional L2-Boosting: Rate of Convergence

Ye Luo kurtluo@hku.hk Hong Kong University Business School The University of Hong Kong Hong Kong

Martin Spindler martin.spindler@uni-hamburg.de Institute for Statistics University of Hamburg Germany

Jannis Kueck kueck@dice.hhu.de D usseldorf Institute for Competition Economics Heinrich Heine University D usseldorf Germany

Editor: Corinna Cortes

Boosting is one of the most signiﬁcant developments in machine learning. This paper studies the rate of convergence of L2-Boosting in a high-dimensional setting under early stopping. We close a gap in the literature and provide the rate of convergence of L2-Boosting in a high-dimensional setting under approximate sparsity and without beta-min condition. We also show that the rate of convergence of the classical L2-Boosting depends on the design matrix described by a sparse eigenvalue condition. To show the latter results, we derive new, improved approximation results for the pure greedy algorithm, based on analyzing the revisiting behavior of L2-Boosting. These results might be of independent interest. Moreover, we introduce so-called restricted L2-Boosting . The restricted L2-Boosting algorithm sticks to the set of the previously chosen variables, exploits the information contained in these variables ﬁrst and then only occasionally allows to add new variables to this set. We derive the rate of convergence for restricted L2-Boosting under early stopping which is close to the convergence rate of Lasso in an approximate sparse, high-dimensional setting without beta-min condition. We also introduce feasible rules for early stopping, which can be easily implemented and used in applied work. Finally, we present simulation studies to illustrate the relevance of our theoretical results and to provide insights into the practical aspects of boosting. In these simulation studies, L2-Boosting clearly outperforms Lasso. An empirical illustration and the proofs are contained in the Appendix.

Keywords: Boosting, L2-Boosting, restricted L2-Boosting, early stopping, high-dimensional regression, rate of convergence, approximate sparsity, approximation theory

c 2025 Luo, Spindler, Kueck.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/21-0725.html.

Luo, Spindler and Kueck

1. Introduction

In this paper we consider L2-Boosting algorithms for regression which are coordinatewise greedy algorithms that estimate the target function under L2 loss. Boosting algorithms represent one of the major advances in machine learning and statistics in recent years. Freund and Schapire s Ada Boost algorithm for classiﬁcation (Freund and Schapire (1997)) has attracted much attention in the machine learning community as well as in statistics. Many variants of the Ada Boost algorithm have been introduced and proven to be very competitive in terms of prediction accuracy in a variety of applications with a strong resistance to overﬁtting. Boosting methods were originally proposed as ensemble methods, which rely on the principle of generating multiple predictions and majority voting (averaging) among the individual classiﬁers (B uhlmann and Hothorn (2007)). An important step in the analysis of boosting algorithms was Breiman s interpretation of boosting as a gradient descent algorithm in function space, inspired by numerical optimization and statistical estimation (Breiman (1996), Breiman (1998)). Building on this insight, Friedman et al. (2000) and Friedman (2001) embedded boosting algorithms into the framework of statistical estimation and additive basis expansion. This also enabled the application of boosting for regression analysis. Boosting for regression was proposed by Friedman (2001), and then B uhlmann and Yu (2003) deﬁned and introduced L2-Boosting. An extensive overview of the development of boosting and its manifold applications is given in the survey of B uhlmann and Hothorn (2007). In the high-dimensional setting there are two important but unsolved problems on L2Boosting. First, the convergence rate of the L2-Boosting, in particular under early stopping, has not been thoroughly analyzed. Second, the pattern of the variables selected at each step of L2-Boosting is unknown. In this paper, we show that these two problems are closely related. We establish results on the sequence of variables that are selected by L2-Boosting. We call a step of L2-Boosting revisiting if the variable chosen in this step has already been selected in previous steps. We analyze the revisiting behavior of L2-Boosting, i.e., how often L2-Boosting revisits. We then utilize these results to derive an upper bound of the rate of convergence of the L2-Boosting.1 We show that frequency of revisiting, as well as the convergence speed of L2-Boosting, depend on the structure of the design matrix, namely on the minimal and maximal restricted eigenvalue. Moreover, we introduce the so called restricted L2-Boosting , another variant of the classical boosting algorithm. The restricted L2-Boosting algorithm sticks to the set of the previously chosen variables, exploits the information contained in these variables ﬁrst and then only occasionally allows to add new variables to this set. We show that the convergence rate is close to that of Lasso and achieves the same rate as orthogonal L2-Boosting, which was derived in Kueck et al. (2023), but without the need for orthogonal projections which are costly to calculate. It should be noted, that all results are shown in an approximate sparse, high-dimensional setting without beta-min condition. We also introduce feasible, data-driven rules for early stopping for both algorithms, which can be easily implemented and used in applied work.

1. Without analyzing the sequence of variables selected at each step of L2-Boosting, only much weaker results on convergence speed of L2-Boosting are available based on De Vore and Temlyakov (1996) and Livshitz and Temlyakov (2003).

High-Dimensional L2-Boosting

Compared to Lasso, boosting uses a somewhat unusual penalization scheme. The penalization is done by early stopping to avoid overﬁtting in the high-dimensional case. In the low-dimensional case, L2-Boosting without stopping converges to the ordinary least squares (OLS) solution. In a high-dimensional setting, early stopping is key for avoiding overﬁtting and for the predictive performance of boosting. We give new stopping rules that are simple to implement and also works very well in practical settings as demonstrated in the simulation studies. We prove that such a stopping rule achieves the best bound obtained in our theoretical results. In a deterministic setting, which is when there is no noise or error term in the model, boosting methods are also known as greedy algorithms (the pure greedy algorithm (PGA) and the orthogonal greedy algorithm (OGA)). In signal processing, L2-Boosting is essentially the same as the matching pursuit algorithm of Mallat and Zhang (1993). We will employ the abbreviations BA (L2-Boosting algorithm), o BA (orthogonal L2-Boosting algorithm) and res BA (restricted L2-Boosting algorithm) for the stochastic versions we analyze. The rate of convergence of greedy algorithms has been analyzed in De Vore and Temlyakov (1996) and Livshitz and Temlyakov (2003). Temlyakov (2011) is an excellent survey of recent results on the approximation theory of greedy approximation. To the best of our knowledge, with an additional assumption on the design matrix, we establish the ﬁrst results on revisiting in the deterministic setting and greatly improve the existing results of De Vore and Temlyakov (1996). These results, which are available in the appendix, are essential for our analysis of L2-Boosting, but might also be of interest in their own right. As mentioned above, boosting for regression was introduced by Friedman (2001). L2Boosting was deﬁned in B uhlmann and Yu (2003). Its numerical convergence, consistency, and statistical rates of convergence of boosting with early stopping in a low-dimensional setting were obtained in Zhang and Yu (2005). Consistency in prediction norm of L2Boosting in a high-dimensional setting was ﬁrst proved in B uhlmann (2006). The numerical convergence properties of boosting in a low-dimensional setting are analyzed in Freund et al. (2016). The orthogonal Boosting algorithm in a statistical setting under diﬀerent assumptions is analyzed in Ing and Lai (2011). The rates for the PGA and OGA are obtained in Barron et al. (2008). For results on orthogonal boosting and modiﬁcations, we refer to the excellent survey by Lai and Yuan (2021). In this paper we consider linear basis functions. Classiﬁcation and regression trees, and the widely used neural networks, involve non-linear basis functions. We hope that our results can serve as a starting point for the analysis of non-linear basis functions which is left for future research. The structure of this paper is as follows: In Section 2, the L2-Boosting algorithm (BA/PGA) is deﬁned together with its modiﬁcations, the restricted-L2-Boosting algorithm (res BA/res PGA) and the orthogonalized version (o BA), which serves as reference. In Section 3, we present a new approximation result for the pure greedy algorithm (PGA) and an analysis of the revisiting behavior of the boosting algorithm. In Section 4, we provide the main results of our analysis, namely an analysis of the boosting algorithm and some of its variants. The proofs together with some details of the new approximation theory for PGA are provided in the appendix. Section 5 contains a simulation study that oﬀers some insights into the methods and also provides some guidance for stopping rules in applications. Additional empirical examples can be found in the appendix. Section 6 provides concluding remarks.

Luo, Spindler and Kueck

Notation: Let z and y be n-dimensional vectors. Deﬁne ||z|| to be the Euclidean norm, and ||z||2,n := p

En[z2] to be the empirical L2-norm with En[z] = 1/n Pn i=1 zi. Deﬁne < , >n to be the inner product deﬁned by: < z, y >n= 1/n Pn i=1 ziyi. For a random variable X, E[X] denotes its expectation. The correlation between the random variables X and Y is denoted by corr(X, Y ). We use the notation a b = max{a, b} and a b = min{a, b}. We also use the notation a b to mean a cb for some constant c > 0 that does not depend on n; and a P b to mean a = OP (b). For a vector β Rp, supp(β) denotes the set of indices of which the corresponding element in β is not zero. Further, given a set of indices T {1, . . . , p}, we denote by βT the vector in which βTj = βj if j T, βTj = 0 if j / T.

2. L2-Boosting with componentwise least squares

To deﬁne the boosting algorithm for linear models, we consider the following regression setting: yi = x iβ + ri + εi, i = 1, . . . , n, (1)

with vector xi = (xi,1, . . . , xi,pn) consisting of pn predictor variables, β a pn-dimensional coeﬃcient vector and a random, mean-zero error term εi, E[εi|xi] = 0. We allow the dimension of the predictors pn to grow with the sample size n, and to be even larger than the sample size, i.e., dim(β) = pn > n. We impose an approximate sparsity condition. This means that there is a large set of potential variables, but the number of relevant variables, which can grow with the sample size, denoted by sn, is small compared to the sample size, i.e. β 0 = sn < n. The random variable ri denotes the approximation error of the sparse model. In the following, we will drop the dependence of sn and pn on the sample size in the notation and denote it by s and p if no confusion will arise. X denotes the n p design matrix where the single observations xi form the rows. Xj denotes the jth column of design matrix, and xi,j the jth component of the vector xi. We consider a ﬁxed design for the regressors. We assume that the regressors are standardized with mean zero and variance one, i.e., En[xi,j] = 0 and En[x2 i,j] = 1 for j = 1, . . . , p, The basic principle of boosting can be described as follows. We follow the interpretation of Breiman (1998) and Friedman (2001) of boosting as a functional gradient descent optimization (minimization) method. The goal is to minimize a loss function, e.g., an L2 loss or the negative log-likelihood function of a model, by an iterative optimization scheme. In each step, the (negative) gradient which is used in every step to update the current solution is modelled and estimated by a parametric or nonparametric statistical model, the so-called base learner. The ﬁtted gradient is used for updating the solution of the optimization problem. A strength of boosting, besides the fact that it can be used for diﬀerent loss functions, is its ﬂexibility with regard to the base learners. We then repeat this procedure until some stopping criterion is met. The literature has developed many diﬀerent forms of boosting algorithms. In this paper, we consider L2-Boosting with componentwise linear least squares, as well as two variants. All three are designed for regression analysis. L2 refers to the loss function, which is the sum of squares of the residuals Qn(β) = Pn i=1(yi x iβ)2 typical in regression analysis. In this case, the gradient equals the residuals. Componentwise linear least squares refers to the base learners. We ﬁt the gradient (i.e. residuals) against each regressor (p univariate regressions) and select the predictor/variable which correlates most

High-Dimensional L2-Boosting

highly with the gradient/residual, i.e., decreases the loss function most, and then update the estimator in this direction. We next update the residuals and repeat the procedure until some stopping criterion is met. In this paper, we focus on the classical L2-Boosting, which was introduced in Friedman (2001) and reﬁned in B uhlmann and Yu (2003) for regression analysis, and two modiﬁcations: restricted-L2-Boosting and orthogonal L2-Boosting. Further, we also provide some results for post-L2-Boosting which is deﬁned in Comment 1. As far as we know, post-L2-Boosting has not yet been deﬁned and analyzed in the literature.

2.1 L2-Boosting

For L2-Boosting with componentwise least squares, the algorithm is given below.

Algorithm 1 (L2-Boosting (BA/PGA))

(1) Start/Initialization: β0 = 0 (p-dimensional vector), set maximum number of iterations mstop and set iteration index m to 0.

(2) At the (m + 1)th step, calculate the residuals Um i = yi x iβm.

(3) For each predictor variable j = 1, . . . , p, calculate :

γm j := Pn i=1 Um i xi,j Pn i=1 x2 i,j = < Um, xj >n

En[x2 i,j] .

Select the variable jm that is the most correlated with the residuals2, i.e.,

max 1 j p |γm j |.

(4) Update the estimator: βm+1 := βm + γm jmejm where ejm is the jmth index vector.

(5) Increase m by one. If m < mstop, continue with (2); otherwise stop.

For simplicity, write γm for the value of γm jm at the mth step. The act of stopping is crucial for boosting algorithms, as stopping too late or never stopping leads to overﬁtting and therefore some kind of penalization is required. A suitable solution is to stop early, i.e., before overﬁtting takes place. Early stopping can be interpreted as a form of penalization. Similar to Lasso, early stopping might induce a bias through shrinkage. A potential way to decrease the bias is by post-L2-Boosting which is deﬁned in Comment 1 below. In general, during the run of the boosting algorithm, it is possible that the same variable is selected at diﬀerent steps, which means the variable is revisited. This revisiting behavior is key to the analysis of the rate of convergence of L2-Boosting. In the next section, we will analyze the revisiting properties of boosting in more detail.

Remark 1 Post-L2-Boosting is a variant of L2-Boosting, namely, a post-model selection estimator that applies ordinary least squares (OLS) to the model selected by L2-Boosting in the ﬁrst step. To deﬁne this estimator formally, we make the following deﬁnitions:

2. Equivalently, which ﬁts the gradient best in a L2-sense.

Luo, Spindler and Kueck

T := supp(β) and ˆT := supp(βm ), the support of the true model and the support of the model estimated by L2-Boosting as described above with stopping at m . A superscript C denotes the complement of the set with regard to {1, . . . , p}. In the context of Lasso, OLS after model selection was analyzed in Belloni and Chernozhukov (2013). Given the above deﬁnitions, the post-model selection estimator or OLS post-L2-Boosting estimator will take the form β = arg min β Rp Qn(β) : βj = 0 for each j ˆT C. (2)

In this paper, we will not focus on post-L2-Boosting.

2.2 Restricted L2-Boosting

In this section, we introduce the so-called restricted L2-Boosting algorithm. The motivation is that the variable selection behavior of the pure greedy algorithm, introduced in the section before, is challenging to analyze. Related to this and even more important, the pure greedy algorithm also has a provably slow rate of convergence in general driven by challenging singular cases, as highlighted by the work of Temlyakov (2011). These cases result in hard to handle variable selection patterns. In general, boosting ﬁrst selects the relevant variables and then the irrelevant variables. But the overall pattern is diﬃcult to analyze, because the selection of relevant and irrelevant variables can alternate, in particular at an advanced stage of the algorithm when the correlation of the relevant variables with the remainder is small due to previous extraction. The restricted L2-Boosting algorithm sticks to the set of the already chosen variables for some time and exploits the information contained in these variables, i.e. extracts the correlation of these variables with the remainder until new variables are selected and added to the consideration set. The algorithm is given in the following way:

Algorithm 2 (restricted L2-Boosting Algorithm (res BA/res PGA))

(1) Start/Initialization: β0 = 0 (p-dimensional vector) and set iteration index m to 0.

(2) At the (m + 1)th step, calculate the residuals Um i = yi x iβm. Deﬁne the set of variables being selected as ˆT m.

(3) Set T := {1, . . . , p} if lm = 0 and T := ˆT m if lm = 1, for lm {0, 1} being a sequence of integers indexed by m.3

(4) For each predictor variable j T , calculate :

γm j := Pn i=1 Um i xi,j Pn i=1 x2 i,j = < Um, xj >n

En[x2 i,j] .

Select the variable jm that is the most correlated with the residuals , i.e.,

max j T |γm j |.

(5) Update the estimator: βm+1 := βm + γm jmejm where ejm is the jmth index vector.

3. This criterion is essentially restricting the algorithm within the set of already selected variables.

High-Dimensional L2-Boosting

(6) Stopping Criterion: Stop at m which is deﬁned as the ﬁrst m with

Um+1 2 2,n/ Um 2 2,n > 1 CU log(2p/α)

when lm = 0 where CU and α are deﬁned in Theorem 17.

Remark 2 A related version of restricted L2-Boosting algorithm is the iterated post-L2Boosting algorithm where at certain iterations during the algorithm a projection step is performed on the already selected variables (Kueck et al. (2023)). By projecting the residuals on the selected variables, all correlation/information of the variables is taken out. The restricted L2-Boosting algorithm avoids these projection steps which are computationally expensive.

2.3 Orthogonal L2-Boosting

A variant of the L2-Boosting algorithm is orthogonal Boosting (o BA) or the orthogonal greedy algorithm in its deterministic version. Only the updating step is changed: after each selection step, an orthogonal projection of the response variable is conducted on all the variables which have been selected up to this point. The advantage of this method is that any variable is selected at most once in this procedure, while in the previous version the same variable might be selected at diﬀerent steps which makes the analysis far more complicated. More formally, the method can be described as follows by modifying step (4) in Algorithm 1:

Algorithm 3 (Orthogonal L2-Boosting (o BA))

(1) Start/Initialization: β0 = 0 (p-dimensional vector), set maximum number of iterations mstop and set iteration index m to 0.

(2) At the (m + 1)th step, calculate the residuals Um i = yi x iβm.

(3) For each predictor variable j = 1, . . . , p, calculate :

γm j := Pn i=1 Um i xi,j Pn i=1 x2 i,j = < Um, xj >n

En[x2 i,j] .

Select the variable jm that is the most correlated with the residuals , i.e.,

max 1 j p |γm j |.

(4) Update the estimator: βm+1 = (XT m+1 XT m+1) 1XT m+1 y.

(5) Increase m by one. If m < mstop, continue with (2); otherwise stop.

Remark 3 Orthogonal L2-Boosting (and also post-L2-Boosting) requires, to be well-deﬁned, that the number of selected variables be smaller than the sample size. This is enforced by our stopping rule, as we will see later.

Luo, Spindler and Kueck

2.4 Comparison

We have introduced diﬀerent variants of L2-Boosting algorithms. Among them, L2-Boosting is the most widely used algorithm in empirical applications. As we will see later, the rate of convergence is slower than the Lasso rate as there are cases for which the convergence, even in the noiseless case, is too slow. In signal processing and related ﬁelds, the noiseless version of the orthogonal Boosting ( orthogonal matching pursuit ) is also very popular. It has been shown that for orthogonal L2-Boosting fast convergence rates with a feasible, data-driven stopping criterion in high dimensions can be obtained even in the case with noise (see Kueck et al. (2023) without beta-min condition and Stankewitz (2024) under stronger assumptions like beta-min condition and normality). The restricted L2-Boosting can be considered as a middle ground between both. It only uses the boosting steps (without orthogonal projection) but, as we will show, it achieves the same rate of convergence as orthogonal L2-Boosting. By restricting to the set of already selected variables for some steps, the information of the variables is partialled-out of the target variable y and in the limit it converges to an orthogonal projection step, as used in orthogonal L2-Boosting. Orthogonal L2-Boosting might be interpreted as post-L2-Boosting where the reﬁt takes place after each step. By only selectively/occasionally adding new variables, where only standard boosting steps are used, the rate of convergence is improved compared to L2-Boosting. Restricted L2-Boosting can be considered as an approximation to orthogonal Boosting but without orthogonal projections. Therefore, it oﬀers advantageous computational properties as orthogonal projections are costly to calculate. The algorithm only employs the default boosting steps, which are based on univariate regressions and are fast and easy to compute. Therefore, the proposed restricted L2-Boosting algorithm has excellent theoretical properties and is computationally superior to orthogonal L2-Boosting. It is also computationally superior to iterated L2-Boosting, where projection steps are imposed from time to time, since projections can be dispensed at all.

3. New Approximation Results for the Pure Greedy Algorithm

In approximation theory a key question is how fast functions can be approximated by greedy algorithms. Approximation theory is concerned with deterministic settings, i.e., the case without noise: yi = x iβ, i = 1, . . . , n. (3)

Nevertheless, to derive rates for the L2-Boosting algorithm in a stochastic setting, the corresponding results for the deterministic part play a key role. For example, the results in B uhlmann (2006) are limited by the result used from approximation theory, namely the rate of convergence of weak relaxed greedy algorithms derived in Temlyakov (2000). For the pure greedy algorithm, De Vore and Temlyakov (1996) establish a rate of convergence of m 1/6 in the ℓ2 norm, where m denotes the number of steps iterated in the PGA. This rate was improved to m 11/62 in Konyagin and Temlyakov (1999), but Livshitz and Temlyakov (2003) established a lower bound of m 0.27. The class of functions F which is considered in those papers is determined by general dictionaries D and given by

f H : f = X

k Λ ckwk, wk D, |Λ| < and X

High-Dimensional L2-Boosting

where M is some constant, H denotes a Hilbert space, and the sequence (ck) are the coeﬃcients with regard to the dictionary D. In this section, we discuss the approximation bound of the pure greedy algorithm where we impose an additional but widely used assumption on the Gram matrix En[xix i] in high-dimensional statistics to tighten the bounds. We provide a new lemma on the revisiting behavior of the pure greedy algorithm and a new approximation result which is the core of this section. The proofs for this section and a detailed analysis of the revisiting behavior of the algorithm are moved to Appendix A and Appendix B. Let us deﬁne the restricted eigenvalue assumption which is also commonly used in the analysis of Lasso. To do this, consider

Σ(s, M) := {A|dim(A) s s, A is any diagonal submatrices of M},

for any square matrix M.

Deﬁnition 4 The smallest and largest restricted eigenvalues are deﬁned as

φs(s, M) := min W Σ(s,M) φs(W),

and φl(s, M) := max W Σ(s,M) φl(W).

φs(W) and φl(W) denote the smallest and largest eigenvalue of the matrix W.

Assummption A.1 (Sparse Eigenvalue (SE)) Consider the Gram matrix Σ = En[xix i]. Assume that all the elements on the diagonal of Σ are equal to one. We assume that there exist positive constants cφ 1 and Cφ > 1 such that 0 < cφ φs(s , Σ) φl(s , Σ) Cφ <

holds for s Mn, where Mn is a sequence such that Mn with n and Mn CMs log(n), where CM is a large enough ﬁxed constant.

Remark 5 This condition is a variant of the so-called sparse eigenvalue condition , which is used for the analysis of the Lasso estimator. A detailed discussion of this condition is given in Belloni et al. (2010). Similar conditions, such as the restricted isometry condition or the restricted eigenvalue condition, have been used for the analysis of the Dantzig Selector (Candes and Tao (2007)) or the Lasso estimator (Bickel et al. (2009)). An extensive overview of diﬀerent conditions on matrices and how they are related is given by van de Geer and B uhlmann (2009). Assuming that φs(m, En[xix i]) > 0, requires that all empirical Gram submatrices formed by any m components of xi are positive deﬁnite. It is well-known that Condition SE is fulﬁlled for many designs of interest.

Deﬁne V m = Xαm as the residual for the PGA. Here, αm is deﬁned as the diﬀerence between the true parameter vector β and the approximation at the mth step, βm, i.e. αm = β βm. We would like to explore how fast V m converges to 0. In our notation, ||V m+1||2 2,n = ||V m||2 2,n (γm)2, therefore ||V m||2 2,n is non-increasing in m. As described in Algorithm 1, the sequence of variables selected in the PGA is denoted by j0, j1, . . .. Deﬁne

Luo, Spindler and Kueck

T m := T {j0, j1, . . . , jm 1} with T := supp(β). Deﬁne q(m) := |T m| as the cardinality of T m, m = 0, 1, . . .. It is obvious that q(0) = s and q(m) m + s where s = β 0 denotes the number of relevant regressors. It is essential to understand how PGA revisits the set of already selected variables. To analyze the revisiting behavior of the PGA, some deﬁnitions are needed to ﬁx ideas.

Deﬁnition 6 We say that the PGA is revisiting at the mth step, if and only if jm 1 T m 1. We deﬁne the sequence of labels A := {A1, A2, ...} with each entry Ai being either labelled as R(revisiting) or N(non-revisiting).

Lemma 7 Assume that assumption A.1 holds with m < Mn. Consider the sequence of steps 1, 2, . . . , m with ||V m||2 2,n > 0. Denote µa(c) = 1 1 + 1

c c for any c > 0. Then, for any δ > 0, the number of Rs in the sequence A at step m, denoted R(m), must satisfy:

|R(m)| 1 (1 + δ)µa(cφ)

2 (1 + δ)µa(cφ)m (1 + δ)µa(cφ) 2 (1 + δ)µa(cφ)q(0).

The lower bound stated in Lemma 7 has room for improvement, e.g., when cφ = 1, |R(m)|/m = 1 as it is shown in Lemma 20 in Appendix A, while we get 1/2 in Lemma 7 as lower bounds of |R(m)/m| as m becomes large enough. Deriving tight bounds is an interesting question for future research. More detailed properties of the revisiting behavior of L2-Boosting are provided in the Appendix A. With an estimated bound for the proportion of Rs in the sequence A, we are now able to derive an upper bound for ||V m||2 2,n. By Lemma 7, deﬁne n k(m) := m+µa(cφ)q(k)

2 µa(cφ) , where n k(m)(1 + δ) is an upper bound of |q(m + k) q(k)| as q(m) is large enough, for any δ > 0. Before we state the main result of this section, we present an other auxiliary lemma.

Lemma 8 Assume that assumption A.1 holds. Consider the steps numbered as k+1, . . . , k+ m. Assume that m + k < Mn and let m = λq(k) for a constant λ > 0. Deﬁne

c((1 µa(c))λ µa(c))

log 2+λ 2 µa(c) + c

for all λ µa(c) 1 µa(c) and c > 0. Then, for any arbitrarily small δ > 0 and q(k) being large enough, there exists an arbitrarily small δ > 0 such that the following statement holds:

||V m+k||2 2,n ||V k||2 2,n

q(k) q(k) + (1 + δ)n k(m)

ζ(cφ,λ) δ .

Based on Lemma 8, we are able to develop our main results on the approximation theory of the pure greedy algorithm under L2 loss.

Theorem 9 (Approximation Theory of PGA based on revisiting) Assume that assumption A.1 holds. Deﬁne ζ (c) := maxλ µa(c) 1 µa(c) ζ(c, λ) as a function of c. Then, for any

κ > 0 and m < Mn, there exists a ﬁxed constant C > 0 such that

||V m||2 2,n ||V 0||2 2,n C s m + s

for m large enough.

High-Dimensional L2-Boosting

Remark 10 Our results stated in Theorem 9 depend on the lower bound of |R(m)|/m, which is the proportion of the Rs in the ﬁrst m terms in the sequence A. We conjecture that the convergence rate of PGA is close to exponential as c 0. Denote the actual proportion of R in the sequence A by ψ(c), i.e.,

|R(m)| ψ(c)m ψ1(c)q(0),

where ψ(c), ψ1(c) are some constants depending on c. If ψ(c) 1, it is easy to show that

||V m||2 2,n ||V 0||2 2,n

based on the proof of Theorem 9, for any arbitrarily large ζ. In general, further improvements in the convergence rate of PGA can be achieved by improving the lower bounds of |R(m)|/m. Table 1 gives diﬀerent values of the SE constant cφ for the corresponding values of ζ (cφ). The convergence rate of PGA and hence of L2-Boosting is aﬀected by the

Table 1: Relation between cφ and ζ

cφ ζ (cφ) 1.0 1.19 0.9 1.04 0.8 0.89 0.7 0.76 0.6 0.63 0.5 0.51 0.4 0.40

frequency of revisiting. Since diﬀerent values of cφ impose diﬀerent lower bounds on the frequency of revisiting, diﬀerent values of cφ imply a diﬀerent convergence rate of the process in our framework.

Remark 11 As already mentioned, the function ζ (c) deﬁned in Theorem 9 is used to provide a general lower bound of the revisiting behavior which aﬀects the rate of convergence of boosting algorithms. In some special cases, the PGA algorithm always selects a predictor from the true set T, indicating that ζ (cφ) , and therefore a stronger revisiting behavior can be shown. This is, for example, the case in a (near) orthogonal design, e.g., when Xj, j = 1, . . . , p, are i.i.d. standard normally distributed or in an equi-correlated design, where 1 < corr(Xi, Xj) = ρ < 1 for all i = j. A formal discussion of the equi-correlated design with ρ > 0 is given in Appendix D. In both cases, we achieve the Lasso rate of convergence. But under more general designs, greedy algorithms can only achieve a slower rate of convergence (see Temlyakov (2011)), preventing one to achieve the Lasso rate of convergence, as we discuss in more detail in Section 4.

4. Main Results

In this section, we provide the main results of our paper for L2-Boosting (BA) and restricted L2-Boosting (res BA) which were introduced in Section 2. To do this, we reconsider the high-

Luo, Spindler and Kueck

dimensional approximate sparse regression model in (1):

yi = x iβ + ri + εi, i = 1, . . . , n,

with vector xi = (xi,1, . . . , xi,p) consisting of p predictor variables, β a p-dimensional coefﬁcient vector, and a random, mean-zero error term εi, E[εi|xi] = 0. The random variable ri denotes the approximation error of the exact sparse model. In Assumption A.2, we provide the explicit conditions of the approximate sparse regression model, as, for example, also considered in Belloni et al. (2013) and Belloni et al. (2016). Another sparsity assumption used in the literature is, e.g., weak sparsity (Negahban et al. (2012); Ing (2020)), i.e. Pp j=1 |βj|q Rn, where 0 < q 1, but in this paper we focus on the approximate sparse regression model.4 In this section, the following assumptions are employed.

Assummption A.2 (Approximate Sparsity)

(ii) r 2,n := rn q

Crs log(2p/α)

n for some generic constant Cr > 0 and α > 0.

(iii) There exists a constant K 1 such that β 2 n K 1.

Assummption A.3 (Error term) For any α small enough, with probability 1 α, we have:

max 1 j p |En[xi,jεi]| σ

In addition, we require that ˆσ2 := En[ε2 i ] satisﬁes that

|ˆσ2 σ2| ωσ2,

for some small enough constant ω 0, 1

Remark 12 The previous assumption is implied, for example, if the error terms are i.i.d. N(0, σ2) random variables. This in turn can be generalized/weakened to cases of nonnormality by self-normalized random vector theory (de la Pe na et al. (2009)) or the approach introduced in Chernozhukov et al. (2014).

4.1 L2-Boosting with Componentwise Least Squares

First, we analyze the classical L2-Boosting algorithm with componentwise least squares. For this purpose, the approximation results which we derived in the previous section are key. While in the previous section the stochastic component was absent, in this section it is explicitly considered. The following deﬁnitions will be helpful for the analysis: Um denotes the residuals at the mth iteration, Um = y Xβm = V m+r+ε. Here, βm is the estimator at the mth iteration. Again, we deﬁne the diﬀerence between the true and the estimated vector as αm := β βm. The prediction error is given by V m = Xαm. For boosting algorithms

4. The extension of our results to diﬀerent sparsity assumptions is left for future work.

High-Dimensional L2-Boosting

in the high-dimensional setting, it is essential to determine when to stop, i.e. the stopping criterion. In the low-dimensional case, stopping time is not important: the value of the objective function decreases and converges to the traditional OLS solution exponentially fast, as described in B uhlmann and Yu (2003). In the high-dimensional case, such fast convergence rates are usually not available: the residual ε can be explained by n linearly independent variables Xj. Thus, selecting more terms only leads to overﬁtting. Early stopping is comparable to the penalization in Lasso, which prevents one from choosing too many variables and hence overﬁtting. Similarly to Lasso, an (approximate) sparse structure will be needed for analysis. At each step, we minimize ||Um||2 2,n along the most greedy variable Xjm. The following lemma establishes the main result of the convergence rate of L2-Boosting.

Lemma 13 Suppose assumptions A.1 A.3 hold and s log(p)/n 0. Assume Mn is large enough, i.e.

log(Mn/s) + ξ + 1 1 + ζ (cφ)

s log(2p/α)

n||V 0||2 2,n

for some ξ > 0. Let m + 1 be the ﬁrst time that ||V m||2,n η m + sλn, where η is a constant large enough. Then, for any δ > 0, with probability 1 α,

(1) it holds

s log(2p/α) n||V 0 + r||2 2,n

! 1 1+ζ (cφ) δ and m < Mn; (4)

(2) the prediction error ||V m +1|| satisﬁes:

||V m +1||2 2,n ||V 0 + r||

2 1+ζ (cφ) δ 2,n

s log(2p/α)

ζ (cφ) δ 1+ζ (cφ) δ . (5)

Remark 14 Lemma 13 shows that the convergence rate of the L2-Boosting depends on the value of cφ. For diﬀerent values of cφ, the lower bound of the proportion of revisiting ( R ) in the sequence A should be diﬀerent. Such lower bounds on the frequency of revisiting will naturally determine the upper bound for the deterministic component, which aﬀects our results on the rate of convergence of L2-Boosting. As ζ (cφ) , the statement (5) implies the usual Lasso rate of convergence.

The bound of the approximation error ||V m||2 2,n stated in inequality (5) is obtained under an infeasible stopping criterion. Below we establish another result which employs the same convergence rate but with a feasible stopping criterion which can be implemented in empirical studies.

Theorem 15 Suppose all conditions stated in Lemma 13 hold. Let cu > 4 be a constant. Let m 1 + 1 be the ﬁrst time that

||Um||2 2,n ||Um 1||2 2,n > 1 cu log(2p/α)/n.

Luo, Spindler and Kueck

Then, with probability at least 1 α,

||V m 1+1||2 2,n ||V 0||

2 1+ζ (cφ) δ 2,n

s log(2p/α)

ζ (cφ) δ 1+ζ (cφ) δ

for any δ > 0.

Remark 16 As we have already seen in the deterministic case, the rate of convergence depends on the constant cφ. Table 2 shows for diﬀerent values of cφ the corresponding rates when δ is set to zero. Hence, the rates can be interpreted as upper bounds.

Table 2: Relation between c and ζ (c) 1+ζ (c) c rate 1.0 0.54 0.9 0.51 0.8 0.47 0.7 0.43 0.6 0.39 0.5 0.34 0.4 0.29

4.2 Restricted L2-Boosting

The following theorem establishes the main result of convergence rate of restricted L2Boosting.

Theorem 17 Suppose Assumptions A.1 A.3 hold. Deﬁne a sequence of positive integers mk as follows: For k = 1, 2, . . . , deﬁne Lk as a positive integer, and WLOG., we let Lk = Ln as a constant, and deﬁne Fk = CF | ˆT mk 1| log(n) for some absolute constant CF large enough5 with mk := mk 1 + Lk + Fk

where m0 := 0. As suggested in the restricted L2-Boosting, Algorithm 2 runs on the full set of variables when lm = 0 and the selected variables when lm = 1. To this end, consider a sequence of indices lm {0, 1}, m = 1, 2, . . . , with lm = 0 for m = mk + 1, ..., mk + Lk, and lm = 1 for m = mk + Lk + 1, ..., mk+1, k 0. Suppose that Ln < KL s for some generic positive constant KL > 0. For CU > 4/cφ deﬁned in Algorithm 2, with probability 1 α, we have

x i(βm β) 2 s log(n) log(2p/α)

Remark 18 The bound of the prediction error of restricted L2-Boosting in Theorem 17 is obtained under a feasible stopping criterion which can be easily implemented in empirical

5. For example, Fn can be chosen as lnξ(n) for some ξ > 2 for n large enough.

High-Dimensional L2-Boosting

studies. The constant CU has to be chosen by the practitioner. In contrast to the constant cu > 4 in Theorem 15, Cu depends on the sparse eigenvalue cφ which can be (pre)-estimated by data and Cu is chosen accordingly. It is worth noting that the restricted L2-Boosting algorithm is purely gradient based. Hence, there is no need for orthogonal projections which are computationally costly in high-dimensional settings. Therefore, restricted L2-Boosting combines two desirable properties: computational eﬃciency and very fast rate of convergence. In practice, one also needs to specify the sequence of indices lm {0, 1}, m = 1, 2, . . . , which depends on Lk and Fk and thus determines the so-called consideration set. As already outlined in Section 2.2, the restricted L2-Boosting algorithm sticks to the set of the already chosen variables for some time (Fk iterations) and exploits the information contained in these variables until new variables are added to the consideration set. By restricting to the set of already selected variables for some iterations, the information of the variables is partialled-out of the target variable y and in the limit it converges to an orthogonal projection step, as used in orthogonal L2-Boosting. The constant Lk 1 is the number of iterations where we allow the algorithm to select new variables.

5. Simulation Study

In this section, we present the results of our simulation study. The goal of this exercise is to illustrate the relevance of our theoretical results in providing insights into the functionality of boosting and the practical aspects of boosting. In particular, we demonstrate that the stopping rules for early stopping we propose work reasonably well in the simulations and give guidance for practical applications. Moreover, the comparison with Lasso might also be of interest. First, we start with an illustrative example and later we present further results, in particular, for diﬀerent designs and settings.

5.1 Illustrative Example

The goal of this section is to give an illustration of the diﬀerent stopping criteria. We employ the following data generating process (dgp):6

y = 5x1 + 2x2 + 1x3 + 0x4 + . . . + 0x10 + ε, (6)

where ε N(0, 22) and X = (X1, . . . , X10) N10(0, I10) with I10 denoting the identity matrix of size 10 10. To evaluate the methods and, in particular, the stopping criteria, we conduct an analysis of both in-sample and out-of-sample mean squared error (MSE) deﬁned in equation (8). For the out-of-sample analysis we draw a new observation for evaluation and calculation of the MSE. For the in-sample analysis we also repeat the procedure and form the average over all repetitions. In both cases we employ 60 repetitions. The sample size is n = 20. Hence, we have 20 observations to estimate 10 parameters. The results are presented in Figures 1 and 2. Both show how MSE depends on the number of steps of the boosting algorithm. We see that MSE ﬁrst decreases with more steps, reaches its minimum and then starts to increase again due to overﬁtting. In both graphs the solution of the L2-Boosting algorithm convergences to the OLS solution. We also indicate the MSE of Lasso estimators as horizontal lines (with cross-validated choice of the penalty parameter

6. In order to allow comparability the dgp is adopted from B uhlmann (2006).

Luo, Spindler and Kueck

and data-driven choice of the penalization parameter). In order to ﬁnd a feasible stopping criterion, we have to rely on the in-sample analysis. Figure 2 reveals that the stopping criterion we introduced in the previous sections performs very well and even better than stopping based on a corrected AIC value which has been proposed in the literature as a stopping criterion for boosting. The average stopping steps of our criterion and the corrected AIC-based criterion (AICc) are presented by the vertical lines. On average our criterion stops earlier than the AICc based one. As our criterion performs better than the AICc, we will not report AICc results in the following subsection. For the restricted L2-Boosting algorithm, similar patterns arise and are omitted.

0 10 20 30 40 50

0 2 4 6 8 10

Out of sample Analysis

Figure 1: This ﬁgure shows the out-of-sample MSE of the L2-Boosting algorithm depending on the number of steps. The horizontal lines show the MSE of OLS, Boosting and Lasso estimates.

5.2 Further Results

In this section, we present results for diﬀerent designs and settings to give a more detailed comparison of the methods. We consider the linear model

j=1 βjxj + ε, (7)

with ε standard normal distributed and i.i.d.. For the coeﬃcient vector β we consider two designs. First, we consider a sparse design, i.e., the ﬁrst s elements of β are set equal to one, all other components to zero (β = (1, . . . , 1, 0, . . . , 0)). Then we consider a polynomial design in which the jth coeﬃcient given by 1/j, i.e., β = (1, 1/2, 1/3, . . . , 1/p). For the design matrix X, we consider two diﬀerent settings: an orthogonal setting and a correlated setting. In the former setting, the entries of X are drawn as i.i.d. draws from a standard

High-Dimensional L2-Boosting

0 10 20 30 40 50

In sample Analysis

our criterion

AIC criterion

Figure 2: This ﬁgure shows the in-sample MSE of the L2-Boosting algorithm depending on the number of steps. The horizontal lines show the MSE of OLS, Boosting and Lasso estimates.

normal distribution. In the correlated design, the xi (rows of X) are distributed according to a multivariate normal distribution where the correlations are given by a Toeplitz matrix with factor 0.5 and alternating signs. To sum up, we have the following settings:

X: orthogonal or correlated

coeﬃcient vector β: sparse design or polynomial decaying design

n = 100, 200, 400

p = 100, 200

out-of-sample prediction size n1 = 50

number of repetitions R = 500

We consider the following estimators: L2-Boosting with componentwise least squares, restricted L2-Boosting, orthogonal L2-Boosting and Lasso. For Lasso, we also consider the post-selection estimator ( p-Lasso ). Here, we consider a data-driven regressor-dependent choice for the penalization parameter (Belloni et al. (2012)) and cross-validation. Although cross-validation is very popular, it does not rely on established theoretical results and therefore we prefer a comparison with the formal penalty choice developed in Belloni et al.

Luo, Spindler and Kueck

Table 3: Simulation results: sparse, iid design (Boosting)

n p BA-oracle BA-our o BA-oracle o BA-our res BA-oracle res BA-our 100 100 0.454 0.599 0.114 0.475 0.168 0.291 100 200 0.543 0.779 0.123 0.704 0.189 0.359 200 200 0.184 0.307 0.052 0.282 0.072 0.168 400 200 0.080 0.135 0.026 0.121 0.032 0.081 800 200 0.037 0.066 0.013 0.058 0.015 0.042

Table 4: Simulation results: sparse, iid design (Lasso)

n p Lasso p-Lasso Lasso-CV p-Lasso-CV 100 100 0.88 0.70 0.54 0.94 100 200 1.02 1.30 0.72 1.02 200 100 0.29 0.28 0.22 0.43 200 200 0.37 0.39 0.30 0.44 400 100 0.13 0.11 0.09 0.18 400 200 0.16 0.20 0.14 0.27

(2012). For our boosting algorithms, we consider two stopping rules: oracle and a datadependent stopping criterion ( our ) which stops when

||Um||2 2,n ||Um 1||2 2,n = ˆσ2 m,n ˆσ2 m 1,n > 1 C log(p)/n

for some constant C. Using this approach, boosting stops when the ratio of the estimated variances does not improve upon a certain amount any more. The oracle rule stops when the mean-squared-error (MSE), deﬁned below, is minimized, which is not feasible in practical applications. The simulations were performed in R (R Core Team (2014)). For Lasso estimation the packages hdm by Chernozhukov et al. (2015) and glmnet by Jerome Friedman (2010) (for cross-validation) were used. The boosting procedures were implemented by the authors and the code is available upon request.7 To evaluate the performance of the estimators, we use the MSE criterion. We estimate the models on the same data sets and use the estimators to predict 50 observations out-of-sample. The (out-of-sample) MSE is deﬁned as MSE = E[(f(X) fm(X))2] = E[(X (β βm))2], (8)

where m denotes the iteration at which we stop, depending on the employed stopping rule. The MSE is estimated by

i=1 [(f(xi) fm(xi))2] = 1

i=1 [(x i(β βm))2] (9)

for the out-of-sample predictions. The results of the simulation study are shown in Tables 3 10.

As expected, the oracle-based estimators clearly dominate in almost all cases, although our stopping criterion also gives very good results. Not surprisingly, given our theoretical results, both restricted L2-Boosting and orthogonal L2-Boosting outperform the standard L2-Boosting in most cases. A comparison of restricted and orthogonal L2-Boosting does

7. A R package is in preparation.

High-Dimensional L2-Boosting

Table 5: Simulation results: sparse, correlated design (Boosting)

n p BA-oracle BA-our o BA-oracle o BA-our res BA-oracle res BA-our 100 100 1.501 2.232 0.357 1.255 0.670 2.028 100 200 3.220 3.080 2.516 2.552 2.301 2.968 200 200 0.627 0.753 0.059 0.249 0.144 0.215 400 200 0.195 0.245 0.027 0.113 0.068 0.102 800 200 0.081 0.104 0.013 0.054 0.035 0.051

Table 6: Simulation results: sparse, correlated design (Lasso)

n p Lasso p-Lasso Lasso-CV p-Lasso-CV 100 100 2.63 1.35 0.97 1.37 100 200 2.96 2.04 1.63 2.38 200 100 1.10 0.23 0.33 0.57 200 200 1.64 0.38 0.49 0.87 400 100 0.38 0.10 0.13 0.23 400 200 0.36 0.15 0.16 0.31

Table 7: Simulation results: polynomial, iid design (Boosting)

n p BA-oracle BA-our o BA-oracle o BA-our res BA-oracle res BA-our 100 100 0.411 0.548 0.400 0.658 0.400 0.536 100 200 0.252 0.324 0.248 0.356 0.246 0.312 200 200 0.282 0.399 0.274 0.463 0.271 0.381 400 200 0.182 0.232 0.180 0.251 0.178 0.221 800 200 0.122 0.143 0.121 0.148 0.121 0.140

Table 8: Simulation results: polynomial, iid design (Lasso)

n p Lasso p-Lasso Lasso-CV p-Lasso-CV 100 100 0.43 0.83 0.45 0.84 100 200 0.50 1.06 0.54 0.76 200 100 0.30 0.34 0.26 0.47 200 200 0.34 0.52 0.33 0.52 400 100 0.19 0.19 0.15 0.24 400 200 0.21 0.31 0.20 0.38

Table 9: Simulation results: polynomial, correlated design (Boosting)

n p BA-oracle BA-our o BA-oracle o BA-our res BA-oracle res BA-our 100 100 0.242 0.421 0.228 0.506 0.223 0.380 100 200 0.251 0.531 0.244 0.670 0.238 0.489 200 200 0.193 0.312 0.171 0.337 0.165 0.269 400 200 0.149 0.203 0.113 0.188 0.111 0.161 800 200 0.102 0.126 0.078 0.111 0.077 0.100

Table 10: Simulation results: polynomial, correlated design (Lasso)

n p Lasso p-Lasso Lasso-CV p-Lasso-CV 100 100 0.33 0.53 0.33 0.55 100 200 0.34 0.93 0.36 0.55 200 100 0.27 0.31 0.23 0.41 200 200 0.28 0.47 0.29 0.46 400 100 0.17 0.18 0.14 0.24 400 200 0.16 0.24 0.15 0.29

Luo, Spindler and Kueck

not provide a clear answer with advantages on both sides. It is worth noting that the post Lasso estimator improves upon Lasso, but there are some exceptions, probably driven by overﬁtting. Cross-validation works very well in many settings. An important objective of the simulation study is to compare L2-Boosting and Lasso. It seems that in the polynomial decaying setting, L2-Boosting (particulary orthogonal L2-Boosting with our stopping rule) dominates post-Lasso. This also seems true in the sparse i.i.d. setting. In the sparse correlated setting, they perform equally well overall. In summary, it seems that L2-Boosting is a serious contender for Lasso in high-dimensional linear regression models as our new theoretical results propose.

6. Conclusion

Although boosting algorithms are widely used in research and industry, the analysis of their properties in high-dimensional settings has been quite challenging. In this paper, the rate of convergence for the L2-Boosting algorithm and variants under early stopping in a highdimensional setting are derived which has been a long-standing open problem until now. For the analysis of the L2-Boosting algorithm, new approximation results are derived which improve on previous results and might be of independent interest. The rate of pure greedy algorithms can be slow in some pathological cases, as shown in Livshitz and Temlyakov (2003). Therefore, we introduce a new variant called restricted L2-Boosting which achieves a faster rate of convergence that is comparable to the rate of convergence of Lasso. All results are derived without beta-min condition and under a feasible early stopping criterion. In this paper, we focus on linear basis functions. The analysis of nonlinear basis functions, like trees, is left for future research and the results in this paper might serve as a starting point.

Acknowledgments

We thank the editor Corinna Cortes for her support and two anonymous referees for their valuable comments which helped to improve the paper. We thank Chunrong Ai, Victor Chernozhukov, Jerry Hausman, Scott Kostyshak, Anna Mikusheva, Faming Liang, Whitney Newey, Philippe Rigollet, David Sappington and seminar participants at MIT and University of Florida for invaluable comments and discussions. Financial support by the Fritz Thyssen Stiftung (Az. 40.13.0.014) and Hong Kong RGC TRS (T32-615/24-R) is gratefully acknowledged.

High-Dimensional L2-Boosting

Appendix A. A new approximation theory for PGA

A.1 Auxiliary lemmas on approximation theory of PGA

In this section of the appendix, we introduce preparatory results for a new approximation theory based on revisiting. These results are useful to prove Lemma 7. The proofs of these lemmas are provided in the next section. For any m1 m, deﬁne L(m, m1) = ||V m1||2 2,n/||V m||2 2,n 1. For any integers q1 > q, deﬁne (q, q1) := Πq1 q 1 j=0 1 cφ q+j with cφ deﬁned in Assumption A.1. It is easy to see that for any k1 > k, (k, k1)/(k/k1)cφ < 1 and

(k, k1)/(k/k1)cφ 1 (10)

as k . First of all, we can establish the following naive bounds on L(m, m1):

Lemma 19 Suppose ||V m||2,n > 0, m + 1 < Mn and m1 < Mn. Under Assumption A.1, it holds

a) For any m, L(m, m + 1) 1 cφ q(m);

b) For any m 0, m1 > m, L(m, m1) (q(m), q(m) + m1 m).

The bound of L(m, m1) established in Lemma 19 is loose. To obtain better results on the convergence rate of V m 2 2,n, the revisiting behavior of the PGA has to be analyzed in more detail. The revisiting behavior of PGA addresses the question when and how often variables are selected again which have already been selected before. When PGA chooses too many new variables, it leads on average to slower convergence rates and vice versa. The next results primarily focus on analyzing the revisiting behavior of the PGA. The following lemma summarizes a few basic facts of the sequence of Ai, i 1.

Lemma 20 Suppose m < Mn and m1 < Mn. Further, assume that Assumption A.1 is satisﬁed. It holds

a) If En[x ixi] is a diagonal matrix, i.e., cφ = 1, then there are only Rs in the sequence A.

b) Deﬁne N(m) := {k|Ak = N, 1 k m}, the index set for the non-revisiting steps, and R(m) := {k|Ak = R, 1 k m}, the index set for the revisiting steps. Then |R(m)| + |N(m)| = m, q(m) = |N(m)| + q(0), and JN(m) := {jk|k N(m)} has cardinality equal to |N(m)|.

c) L(0, m) Π|N(m)| i=1 1 cφ q(0)+i 1 1 cφ q(m) |R(m)| , i.e., the sequence to maximize the upper bound of L(0, m) stated above is NN . . . NRR . . . R. Consequently, the sequence {Am+1, ..., Am1} to maximize the upper bound of L(m, m1) for general m1 > m is also NN . . . NRR . . . R.

The proof of this lemma is obvious and hence omitted. Much more involved is the following result for characterizing the revisiting behavior.

Luo, Spindler and Kueck

Lemma 21 Assume that Assumption A.1 holds and that m < Mn. Consider the sequence of steps 1, 2, . . . , m. Set µe(cφ) = (1 exp( 1/c2 φ)). Then, the number of Rs in the sequence A satisﬁes:

|R(m)| 1 µe(cφ)

2 µe(cφ)m µe(cφ) 2 µe(cφ)q(0).

Lemma 21 provides a lower bound of the proportions of Rs. It illustrates that the R spots occupy at least some signiﬁcant proportion of the sequence A, with the lower bound of the proportion depending on cφ. In fact, such a result holds for arbitrary consecutive sequence Am, Am+1, . . . , Am+k, as long as m+k < Mn. In the main text, we further extend results stated in Lemma 21.

A.2 Proofs of lemmas in Appendix A.1

Proof [Proof of Lemma 19] By deﬁnition,

||V m||2 2,n = X

j T m αm j < V m, Xj >n= X

j T m αm j V m 2,ncorr(V m, Xj).

Deﬁne ρjm := |γm|/ V m 2,n = |corr(V m, Xjm)|

since V m = Um in the deterministic case. Therefore,

αm j V m 2,n,

i.e., ρ2 jm P j T m |αm j | 2 V m 2 2,n. By the Cauchy-Schwarz inequality,

j T m |αm j |

q(m)||αm||2.

Therefore, ρ2 jm cφ q(m)

with cφ deﬁned in Assumption A.1 and

||V m+1||2 2,n = ||V m||2 2,n(1 ρ2 jm) V m 2 2,n

i.e. L(m, m + 1) 1 cφ q(m). The second statement follows from statement a) by iteration and the fact that q(m + 1) q(m ) + 1 for any m 0.

Proof [Proof of Lemma 21] Deﬁne e N(m) := {l : jl / T 0, jl is only visited once within steps 1,2,. . . ,m}

High-Dimensional L2-Boosting

with T 0 = T = supp(β). It is easy to see that e N(m) N(m) and | e N(m)| 2|N(m)| m since we excluded T 0 in both N and N(m) and

|N(m)| = m |R(m)|

and | e N(m)| m 2|R(m)|

where N(m) := {k|Ak = N, 1 k m} is the index set for the non-revisiting steps. For any jl with l e N(m), it holds αm jl = γl jl / T 0. If |R(m)| m/2, then the statement of

this lemma trivially holds. Therefore, we can assume that e N(m) is non-empty. Hence,

l e N(m) (γl 1)2.

By the sparse eigenvalue condition A1, 1 cφ ||V m||2 2,n ||αm||2 X

l e N(m) (γl 1)2. (12)

Note that by Lemma 19,

(γl 1)2 = ||V l 1||2 2,n ||V l||2 2,n = ||V l 1||2 2,n (1 L(l 1, l)) cφ q(l 1)||V l 1||2 2,n. (13)

Therefore, (γl 1)2 cφ q(l 1)||V m||2 2,n for all l N(m). Plugging this back into (12), we get:

1 cφ ||V m||2 2,n cφ X

1 q(l 1)||V l 1||2 2,n. (14)

Since l e N(m) are diﬀerent integers with the maximum value of q(l 1) being less than or equal to q(m) = q(0) + |N(m)|, it holds

| e N(m)| X

1 q(0) + |N(m)| l log((q(0)+|N(m)|)/(q(0)+|N(m)| | e N(m)|)).

The inequality above implies that

exp(1/c2 φ) (q(0) + |N(m)|)/(q(0) + |N(m)| | e N(m)|),

i.e., | e N(m)| (1 exp( 1/(cφ)2))(q(0) + |N(m)|). Set

µe(cφ) = (1 exp( 1/c2 φ)) (0, 1).

Since we know that | e N(m)| 2|N(m)| m, we immediately have:

|N(m)| 1 2 µe(cφ)(m + µe(cφ)q(0))

|R(m)| 1 µe(cφ)

2 µe(cφ)m µe(cφ) 2 µe(cφ)q(0).

Luo, Spindler and Kueck

Appendix B. Proofs of main results in Section 3

Proof [Proof of Lemma 7] First of all, WLOG, we can assume that q(0) exceeds a large enough constant Q(δ). Otherwise, it can be assumed that the true parameter β contains some inﬁnitesimal components such that q(0) > Q(δ). Let s revisit inequality (13). It holds X

l e N(m) (γl 1)2 X

l e N(m) ||V l 1||2 cφ q(l 1).

The right-hand side reaches its minimum when

e N(m) = {m | e N(m)| + 1, m | e N(m)| + 2, ..., m},

and for the step m | e N(m)| + l, with l = 1, 2, ..., | e N(m)|, we have

q(m | e N(m)| + l 1) = q(m) | e N(m)| + l 1.

Hence, for any δ > 0, and q(0) large enough,

l e N(m) (γl 1)2 (1 + δ) X

l e N(m) ||V l 1||2 2,n cφ q(l 1)

= (1 + δ) X

l e N(m) ||V m||2 2,n 1 L(l 1, m) cφ q(l 1)

(1 + δ)||V m||2 2,ncφ

| e N(m)| X

1 q(m l) 1 L(m l, m)

||V m||2 2,ncφ

| e N(m)| X

q(m) q(m) l

c2 φ αm 2q(m)cφ q(m) 1 X

k=q(m) | N(m)|

cφ αm 2q(m)cφ((q(m) | e N(m)|) cφ q(m) cφ)

k=q(m) | N(m)|

1+cφ = ((q(m) | e N(m)|) cφ q(m) cφ),

||V m||2 2,n cφ αm 2 by Assumption A.1 and ||V l 1||2 2,n/||V m||2 2,n = 1/L(l 1, m), while

L(m l, m) q(m) l

as q(m) l q(m) | N(m)| q(0) by Lemma 19. Using αm 2 P

l e N(m)(γl 1)2, we conclude (1 + δ) cφq(m)cφ((q(m) | e N(m)|) cφ q(m) cφ),

High-Dimensional L2-Boosting

| e N(m)| q(m)

1 1 + 1 + δ

q(m)(1 + δ )µa(cφ),

for some δ > 0, with δ 0 as δ 0. Since we know that | e N(m)| 2|N(m)| m, we have analogous to the proof of Lemma 21,

|N(m)| (|N(m)| + q(0))(1 + δ )µa(cφ) + m

2 which implies

|N(m)| q(0)(1 + δ )µa(cφ) + m

2 (1 + δ )µa(cφ)

|R(m)| = m |N(m)| 1 (1 + δ )µa(cφ)

2 (1 + δ )µa(cφ)m (1 + δ )µa(cφ) 2 (1 + δ )µa(cφ)q(0).

Proof [Proof of Lemma 8] Without loss of generality, we can assume that k = 0. We can also assume that ||V 0||2 2,n > 0, because otherwise ||V 0||2 2,n = ||V m||2 2,n = 0 so that the conclusion already holds. Set

n0 = |N(m)| m + q(0)(1 + δ)µa(cφ)

2 (1 + δ)µa(cφ) (1 + δ)m + µa(cφ)q(0)

2 µa(cφ) = (1 + δ)n 0(m)

for any δ > 0 by Lemma 7 when q(0) is large enough. Then, by Lemma 20,

||V m||2 2,n/||V 0||2 2,n Πn0 i=1

1 cφ q(0) + i 1

1 cφ q(0) + n0

where the right hand reaches its maximum when n0 = (1 + δ)n 0(m). When q(0) is large enough, we know that for any δ > 0,

Π(1+δ)n 0(m) i=1

1 cφ q(0) + i 1

(1 + δ) q(0) q(0) + (1 + δ)n 0(m)

= (1 + δ) 2 µa(cφ) λ(1 + δ) + 2 + δµa(cφ)

where λ = m/q(0) (15)

1 cφ q(0) + (1 + δ)n 0(m)

m (1+δ)n 0(m) =

q(0)2+(1+δ)λ+δµa(cφ)

q(0) ((1 δ) µa(cφ))λ (1+δ)µa(c)

exp cφ((1 δ) µa(cφ))λ (1 + δ)µa(cφ))

2 + (1 + δ)λ + δµa(cφ)

Luo, Spindler and Kueck

Thus, for any δ > 0, and for q(0) large enough,

||V m||2 2,n/||V 0||2 2,n

(1 + δ) 2 µa(cφ) λ(1 + δ) + 2 + δµa(cφ)

cφ exp cφ((1 δ) µa(cφ))λ (1 + δ)µa(cφ))

2 + (1 + δ)λ + δµa(cφ)

= (1 + δ) q(0) + (1 + δ)n 0(m) q(0)

cφ exp cφ((1 δ) µa(cφ))λ (1 + δ)µa(cφ))

2 + (1 + δ)λ + δµa(cφ)

where 2 µa(cφ) λ(1 + δ) + 2 + δµa(cφ) = q(0) + (1 + δ)n 0(m) q(0) .

It is worth noting that the bound on the right-hand side does not depend on q(0) or m only on λ. Hence, for some δ > 0 that is small enough that depends on any small enough δ > 0, and for q(0) large enough,

||V m||2 2,n ||V 0||2 2,n(1 + δ) q(0) q(0) + (1 + δ)n 0(m)

cφ q(0) q(0) + (1 + δ)n 0(m)

||V 0||2 2,n

q(0) q(0) + (1 + δ)n 0(m)

ζ (cφ, λ, δ) :=

cφ((1 δ) µa(cφ))λ (1+δ)µa(cφ))

2+(1+δ)λ+δµa(cφ)

log q(0) q(0)+(1+δ)n 0(m)

and ζ(cφ, λ) deﬁned in the statement of this lemma,

ζ(cφ, λ) :=

cφ((1 µa(cφ))λ µa(cφ))

log 2+λ 2 µa(cφ) + cφ.

Proof [Proof of Theorem 9] As in the proof of Lemma 7, WLOG, we can assume that q(0) exceeds a large enough constant Q(δ). Otherwise, we can consider the true parameter β contains some inﬁnitesimal components such that q(0) > Q(δ). Let λ µa(c) 1 µa(c) be the maximizer of

c((1 µa(c))λ µa(c))

log 2+λ 2 µa(c) + c

given c (0, 1). Let m0 = 0. Deﬁne the sequence mi+1 = 1 + mi + λ q(mi) with

λ λi := mi+1 mi

q(mi) λ + 1 q(mi) λ + 1

High-Dimensional L2-Boosting

n i := mi+1 mi + µa(cφ)q(mi)

By Lemma 8, for q(mi) large enough, e.g., q(mi) q for some q large enough, there exists δ > 0 that is small enough, so that

||V mi+1||2 2,n ||V mi||2 2,n q(mi) q(mi) + (1 + δ)n i

q(mi) q(mi) + (1 + δ)n i

ζ(cφ,λ ) δ , (16)

where δ := δ +maxλ [λ ,λ + 1

q ] ζ(cφ,λ)

q is an arbitrarily small constant, as q is large enough,

δ is small enough, ζ is a continuously diﬀerentiable function with bounded derivatives, and (1 + δ)n i q(mi+1) q(mi). For those m with q(m) q, by (11) in the proof of Lemma 19, we have that V m+1 2 2,n V m 2 2,n 1 cφ q(m)

V mi+1 2 2,n V mi 2 2,n = Πmi+1 1 m=mi V m+1 2 2,n V m 2 2,n

Πmi+1 1 m=mi

1 cφ q(mi+1 1)

1 cφ q(mi+1)

mi+1 mi . (17)

Deﬁne i as the index of i such that q(mi 1) < q and q(mi ) q. By deﬁnition,

q(mi ) q(mi 1) + mi+1 mi (1 + λ )q(mi 1) + 1 (1 + λ ) q + 1 =: eq. (18)

Assume i i . Since (1+δ)n i q(mi+1) q(mi), it holds q(mi) (1+δ) Pi 1 j=i n j +q(mi ).

If i < i , the statement follows by (17). As q(mi) q(mi)+(1+δ)n i is increasing in q(mi), we have that

q(mi) q(mi) + (1 + δ)n i (1 + δ) Pi 1 j=i n j + q(mi )

(1 + δ) Pi 1 j=i n j + q(mi ) + (1 + δ)n i

= (1 + δ) Pi 1 j=i n j + q(mi )

(1 + δ) Pi j=i n j + q(mi ) . (19)

Luo, Spindler and Kueck

By (19), we have that:

V mi 2 2,n V 0 2 2,n

Πi 1 j=i ||V mj+1||2 2,n ||V mj||2 2,n

Πi 1 j=0 ||V mj+1||2 2,n ||V mj||2 2,n

Πi 1 j=i q(mj) q(mj) + (1 + δ)n j

!ζ(cφ,λ ) δ 1 cφ

q(m i ) + (1 + δ) Pi 1 j=i n j

!ζ(cφ,λ ) δ 1 cφ

with q(mj+1) q q when j < i . Note that by deﬁnition, we have

j=i n j = (1 + δ) 2 µa(cφ)

j=i µa(cφ)q(mj)

1 2 µa(cφ)(mi mi ). (21)

Plug in (21), we have that

V mi 2 2,n V 0 2 2,n Cµ

q(mi ) q(m i ) + mi mi

ζ(cφ,λ ) δ 1 cφ

where Cµ := (2 µa(cφ))ζ(cφ,λ ) δ . By (18), we have

q(m i ) eq.

If mi 2mi , we conclude q(mi ) q(m i ) + mi mi

eq 1 q(m i ) + mi mi

eq 2 q(0) + mi

Therefore, V mi 2 2,n V 0 2 2,n C µ

q(0) q(0) + mi

ζ(cφ,λ ) δ , (24)

ζ(cφ,λ ) δ Cµ(2eq)ζ(cφ,λ ) δ

is a ﬁxed constant given q and δ . If mi < 2mi , then,

V mi 2 2,n V 0 2 2,n 1 cφ

ζ(cφ,λ ) δ q(0) q(0) + mi

ζ(cφ,λ ) δ , (25)

for mi being large enough, as the left hand side of the above inequality decays exponentially in mi, while the right hand side decays only in ﬁxed polynomial speed of mi. Therefore, in either case, we have that

V mi 2 2,n V 0 2 2,n C µ

q(0) q(0) + mi

ζ(cφ,λ ) δ , (26)

High-Dimensional L2-Boosting

for some ﬁxed C µ > 0. For any m > 0, m < M0, since m0, m1, . . . is an increasing sequence of positive integers, there exists i such that mi m < mi+1. Thus, m mi mi+1

mi (2 + λ ). Therefore,

V m 2 2,n V 0 2 2,n V mi 2 2,n V 0 2 2,n C µ

q(0) q(0) + mi

ζ(cφ,λ ) δ e Cµ

q(0) q(0) + m

ζ(cφ,λ ) δ , (27)

with e Cµ := C µ(2 + λ )ζ(cφ,λ ) δ

is a ﬁxed constant. Lastly, by replacing δ with κ, the proof is completed.

Appendix C. Proofs for the L2-Boosting algorithm

C.1 Auxiliary lemmas for L2-Boosting

The two lemmas below state several basic properties of the L2-Boosting algorithm that will be useful in deriving the main results.

Lemma 22 It holds

||Um+1||2 2,n = ||Um||2 2,n < Um, Xjm >2 n= ||Um||2 2,n(1 ρ2(Um, Xjm))

||V m+1 + r||2 2,n = ||V m + r||2 2,n 2 < V m + r, γm jm Xjm >n +(γm jm)2,

where γm jm =< Um, Xjm >n. Moreover, since V m = Um r ε,

||V m+1 + r||2 2,n = ||V m + r||2 2,n (2 < Um, Xjm >n< Um ε, Xjm >n) + < Um, Xjm >2 n = ||V m + r||2 2,n + 2γm jm < ε, Xjm >n (γm jm)2.

Lemma 23 Assume that Assumptions A.1-A.3 hold and m Mn. Let Zm = ||Um||2 2,n ||V m + r||2 2,n. Then, with probability 1 α and uniformly in m, it holds

|Zm σ2 n| 2 p

(1 + ω)Crs + 2 m + s cφ V m 2,n

with σ2 n := ||ε||2 2,n. Lemma 23 bounds the diﬀerence between ||Um||2 2,n and ||V m + r||2 2,n.

Luo, Spindler and Kueck

Proof [Proof of Lemma 23] By Assumption A.3, we have that:

|Zm ||ε||2 2,n| 2 < ε, V m + r >n = 2 < ε, r + Xαm >n 2 ε 2,n r 2,n + 2| < ε, Xαm >n |

Crs log(2p/α)

n + 2 αm 1λn

(1 + ω)Crsλn + 2

m + s αm 2λn

(1 + ω)Crs + 2 m + s cφ V m 2,n

as |supp(αm)| m + s.

Next, we provide the lemmas that we require for proving our main result in Theorem 17. Deﬁne Uo T := y PT y

as the residual of projection of y on XT for any T {1, 2, ..., p}. Further, we deﬁne Pm,T as the operator of performing m periods of PBA algorithm subject to the subset T, i.e.,

Um T := y Pm,T y = (I Pm,T )U0

with U0 := y and V m T := U0 T Um T denotes the approximation error. Lemma 24 measures a bound between the operator Pm,T and the projection operator PT . That said, Pm,T is an approximation of PT .

Lemma 24 (Iterations) Suppose that Assumption A.1 holds. Suppose T {1, 2, ..., p} with |T| Mn. Then, starting with U0 = y, the L2-Boosting algorithm that runs on the restricted subset T satisﬁes that:

Um Uo T 2 2,n y Uo T 2 2,n exp 1

|T|cφm . (28)

Proof [Proof of Lemma 24] Given the L2-Boosting algorithm,

γjm := max j T |< Um, Xjm >n

Xjm 2,n | = | < Um, Xjm >n |, (29)

with jm being the largest index. The approximation error can be written as follows:

V m T := U0 T Um T = XT αm T .

We consider Um T = V m T + η

where η = Um T V m T = Uo T = y PT y is orthogonal to all the column vectors in XT . Note that η is never changing when m increases. As a result, we have that:

|γjm| αm T 1 X

j T < Um T , αm j Xjm >n= V m T 2 2,n. (30)

High-Dimensional L2-Boosting

|γjm|2 V m T 4 2,n αm T 2 1 V m T 4 2,n |T| αm T 2 2 cφ V m T 4 2,n |T| V m T 2 2,n = cφ V m T 2 2,n |T| .

That is to say,

V m+1 T 2 2,n = Um+1 T 2 2,n η 2 2,n = Um T 2 2,n |γjm|2 η 2 2,n V m T 2 2,n

As a result,

V m T 2 2,n V 0 T 2 2,n

m V 0 T 2 2,n exp cφ

which proves the statement of the lemma.

Lemma 25 (Bounds on Residuals) Assuming that assumptions A.1-A.3 hold, for any positive integer M Mn, we have:

sup T {1,2,...,p},|T| M 1

n X T ε 2 2,n Mλ2 n,

sup T {1,2,...,p},|T| M PT ε 2 2,n 1

sup T {1,2,...,p},|T| M PT (r + ε) 2 2,n 2 Crs log(2p/α) + σ2/cφ log(2p/α)M

with probability 1 α.

The proof is given in Lemma 1 in the appendix of Kueck et al. (2023).

Lemma 26 (lower bound on residuals) Given Assumption A.1. Consider Um = Xαm+ r + ε with αm 0 Mn. Then,

|γm jm| r cφ αm 0 Xαm 2,n

Crcφs log(2p/α)

with probability 1 α.

Proof [Proof of Lemma 26] Denote ρm := maxj=1,2,...,p |corr(Um, Xj)|. It can be shown that (γm jm)2 = ρ2 m Um 2 2,n. We know that

< Um, Xαm >n Xαm 2 2,n r 2,n Xαm 2,n λn αm 1

Crs log(2p/α)

n Xαm 2,n λn

cφ Xαm 2,n.

Luo, Spindler and Kueck

On the other hand, we have that:

< Um, Xαm >n X

j supp(α) αm j < U, Xj >n ρm Um 2,n X

j supp(α) |αm j |

αm 0 αm 2 |γm jm|

cφ Xαm 2,n.

Therefore, we have that:

|γm jm| r cφ | αm 0 Xαm 2,n

Crcφs log 2p/α

n αm 0 λn. (33)

C.2 Proofs of main results in Section 4

Proof [Proof of Lemma 13] By assumption A.3, we know that λn max1 j p | < ε, Xj >n | with probability 1 α. According to our deﬁnition, m + 1 is the ﬁrst time

||V m||2,n η

where η is a ﬁxed positive constant that is large enough with η > 2σ Cr. We know that in high-dimensional settings, ||Um||2,n 0, so ||V m||2 2,n σ2. Thus, by ﬁxing p and n, such an m must exist. Therefore, we can consider any m < m := (m + 1) Mn. Note that T m := T0 {j0, j1, . . . , jm 1} and V m := Xαm = XT mαT m. It holds

|γm jm| αT m 1 < Um, XT mαT m >n V m 2 2,n | < r, XT mαT m > | | < ε, XT mαT m > |

V m 2 2,n r 2,n V m 2,n λn αT m 1.

As we assume that V m 2,n > η m + sλn, with η being large enough, we have that r 2,n Cr

ση V m 2,n. Then,

V m 2 2,n r 2,n V m 2,n 1 Cr

High-Dimensional L2-Boosting

which implies that:

ση V m 2 2,n α e T m 1 λn

ση V m 2 2,n p

q(m) α e T m 2 λn

q(m) V m 2,n λn

q(m) V m 2,n

CV,1 V m + r 2,n p

where V m 2,n η m + sλn η p

q(m)λn, and

is a positive constant when η is large enough, and approaches to cφ as η goes to inﬁnity. By Lemma 22, it holds that

||V m + r||2 2,n ||V m+1 + r||2 2,n = ||Um||2 2,n ||Um+1||2,n 2γm jm < Xjm, ε >n = (γm jm)2 2γm jm < Xjm, ε >n . (35)

By equation (34), it follows that:

||V m + r||2 2,n ||V m+1 + r||2 2,n (γm jm)2 2γm jm < Xjm, ε >n (γm jm)2 2|γm jm|λn

CV,2 V m + r 2 2,n,

when η is large enough where

Luo, Spindler and Kueck

is arbitrarily close to cφ when η is large enough. By the arguments in the proof of Lemma 21, when n is large enough, by (34) and r 2,n Cr

ση V m 2,n, we have that:

ση 2 V m + r 2 2,n 1

cφ ||V m||2 2,n

k e N(m) (γk 1)2

C2 V,1 q(k 1)||V k 1 + r||2 2,n

which corresponds to (14) in Lemma 21 considering V m +r 2 2,n instead of V m 2 2,n. Deﬁne

cφ CV,2, cφ C2 V,1/ 1 Cr

which can be arbitrarily close to 0 as η is large enough. Thus, following the proof of Lemma 7 and Lemma 8, we can treat cφ ψ as the constant cφ in Theorem 9. Therefore, results of Lemma 7 and Lemma 8 applies for V m + r with cφ replaced by cφ ψ. We conclude that

V m + r 2 2,n V 0 + r 2 2,n C s s + m

ζ (cφ) ψ , (36)

for some arbitrarily small ψ > 0 as η is large enough. By using that V m + r 2,n V m 2,n r 2,n, we have that η Cr

m + s V m + r 2,n

for all m < em. It follows that η Cr

2 λ2 n( em 1 + s) e C V 0 + r 2 2,n

ζ (cφ) ψ . (37)

Therefore s log(p)

n ||V 0 + r||2 2,n

ζ (cφ) ψ +1 ,

where ψ can be arbitrarily close to 0 as m is large enough or equivalently,

s log(2p/α) n||V 0 + r||2 2,n

! 1 1+ζ (cφ) ψ .

By assumption,

log(Mn/s) + ξ + 1 1 + ζ (cφ)

s log(2p/α) n||V 0 + r||2 2,n

High-Dimensional L2-Boosting

for some ξ > 0. Thus, asymptotically,

em = Mn (m + 1) < Mn,

i.e., m + 1 < Mn. Thus,

s log(2p/α) n||V 0 + r||2 2,n

! 1 1+ζ (cφ) ψ (38)

||V m +1||2 2,n η

m + 1 + sλn ||V 0 + r||

2 1+ζ (cφ) ψ 2,n

s log(2p/α)

1+ζ (cφ) ψ ,

for any small ψ > 0 if η is large enough.

Proof [Proof of Theorem 15] At the (m 1 + 1)th step, we have:

||Um 1+1||2 2,n > (1 cu log(2p/α)/n)||Um 1||2 2,n.

It follows that (γm 1)2 < cu log(2p/α)/n||Um 1||2 2,n, while (γm)2 cu log(2p/α)/n||Um||2 2,n for all m < m 1. Consider the m deﬁned in Lemma 13 as a reference point.

Case (a): Suppose m 1 < m : By the proof of Lemma 13, ||V m + r||2 2,n is decreasing when m m 1 + 1 m . By Lemma 23, it holds

||Um 1||2 2,n σ2 n +

(1 + ω)Crs + 2

m 1 + s cφ ||V m 1||2,n

λn + V m 1 + r 2 2,n.

It follows that

(γm 1)2 < cu log(2p/α)/n||Um 1||2 2,n

< (1 + ω)cuλ2 n + cu log(2p/α)/n

(1 + ω)Crs + 2

m 1 + s cφ ||V m 1||2,n

+ cu log(2p/α)/n V m 1 + r 2 2,n. (39)

By inequality (34), we have that

|γm 1| CV,1 V m 1 + r 2,n p

Luo, Spindler and Kueck

for η large enough where V m 2,n η m + sλn holds for all m m , including m 1 by assumption that m 1 < m . Combining this with inequality (39), we have that:

C2 V,1 V m 1 + r 2 2,n m 1 + s

cuλ2 n(1 + ω) + 2cuλn log(2p/α)

m 1 + s σ2 cφ V m 1 2,n

+ cu log(2p/α)/n V m 1 + r 2 2,n. (40)

We show that this leads to a contradiction and hence m 1 m . First, by Lemma 13, we know that

m + 1 + sλn ||V 0 + r||

2 1+ζ (cφ) ψ 2,n

1+ζ (cφ) ψ 0 (41)

as n is large enough since s log(2p/α)

n 0 by assumption. As a result, for n large enough, we have that 1 3C2 V,1 V m 1 + r 2 2,n m 1 + s > cu log(2p/α)/n V m 1 + r 2 2,n (42)

which corresponds to the third component in (40). Second, since V m 1 2,n η p

m 1 + sλn by construction for η large enough, we have

V m 1 + r 2 2,n 1 Cr

2 V m 1 2 2,n

as shown in the proof of Lemma 13. Hence, for η being a large enough ﬁxed constant and n large enough, we have that

1 3C2 V,1 V m 1 + r 2 2,n

3C2 V,1 V m 1 2 2,n

V m 1 2,n 1 3C2 V,1

= 2 V m 1 2,ncu/(σ2 cφ)λ3 n p

σ2 cφC2 V,1

2 /(6cuλ2 n)

m 1 + s σ2 cφ V m 1 2,n (43)

High-Dimensional L2-Boosting

as σ2 cφC2 V,1 1 Cr

2ση 2 /(6cuλ2 n) . Third, for η large enough, we have that

1 3C2 V,1 V m 1 + r 2 2,n m 1 + s 1

3C2 V,1η2 1 Cr

> 2cuλ2 n(1 + ω)

> cuλ2 n(1 + ω) + 2cuλn log(2p/α)

(1 + ω)Crs, (44)

as 2cu log(2p/α)

n s = 2cuλn/σ q

s log(2p/α)

n = o(λn) since s log(2p/α)

n 0 for n large enough. Therefore, equations (42) (44) lead to a contradiction with (40).

Case (b): We know that m 1 m : It follows that

(γm jm)2 cu log(2p/α)/n||Um||2 2,n

for all m < m 1. Since ||Um||2 2,n is a decreasing sequence, for δ small enough, there exists some m2 such that ||Um 2 ||2 2,n > (1 δ)σ2 n for any m m2, and ||Um2+1||2 2,n (1 δ)σ2 n. For δ small enough and m m2 m 1, it holds

||V m+1 + r||2 2,n ||V m + r||2 2,n = (γm jm)2 + 2γm jm < ε, Xjm >n (γm jm)2 + 2λn|γm jm|

by Lemma 22. Since cu > 4, for δ small enough, it holds

|γm jm|2 cu log(2p/α)/n||Um||2 2,n cu(1 δ)(1 ω)λ2 n > 4λ2 n

and thus (γm jm)2+2λn|γm| < 0. Therefore, V m+r 2,n is strictly decreasing when m m2.

Case (b.1): Suppose m 1 < m2: By same arguments as in the proof of Lemma 13, we have

V m +1 + r 2,n p ||V 0 + r||

1 1+ζ (cφ) ψ 2,n

s log(2p/α)

2(1+ζ (cφ) ψ )

for any ψ > 0 since V m + r 2,n is strictly decreasing for m m2. As

r 2,n s log(2p/α)

2 s log(2p/α)

2(1+ζ (cφ) ψ ) ,

it follows that

||V m 1+1||2,n ||V m 1+1 + r||2,n + r 2,n ||V m +1 + r||2,n + r 2,n

p ||V 0 + r||

1 1+ζ (cφ) ψ 2,n

s log(2p/α)

2(1+ζ (cφ) ψ ) + s log(2p/α)

2(1+ζ (cφ) ψ )

p ||V 0 + r||

1 1+ζ (cφ) ψ 2,n

s log(2p/α)

2(1+ζ (cφ) ψ )

Luo, Spindler and Kueck

which concludes the result.

Case (b.2): Suppose m 1 m2: We show that this leads to a contradiction. First of all, we show that m2 m . We know that

||Um||2 2,n = σ2 n + ||V m + r||2 2,n + 2 < V m + r, ε >n

for any m. Since ||Um2+1||2 2,n (1 δ)σ2 n, it holds that

2 < V m2+1 + r, ε >n +||V m2+1 + r||2 2,n δσ2 n. (45)

Let s prove m2 m by contradiction. Suppose m2 < m , it follows that m2+1 m +1 < M0. Since we know that ||V m2+1||2,n > η m2 + 1 + sλn, it holds

||V m2+1 + r||2 2,n > 1 Cr

2 V m2+1 2 2,n.

Also, | < r, ε >n | r 2,nσn 0 as n . Therefore, for η being a large enough ﬁxed constant and n being large enough, we have that:

||V m2+1 + r||2 2,n + 2 < V m2+1, ε >n

2 ||V m2+1||2 2,n 2 m2 + 1 + s cφ λn V m2+1 2,n > 0

and 2 < r, ε >n> δσ2 n which leads to a contradiction with (45). Thus, it must hold that m2 m . Therefore,

||V m + r||2 2,n ||V m +1 + r||2 2,n 1 + Cr

2 η(m + s + 1)λ2 n

for any m + 1 m m2 + 1. We also know that by assumption,

(γm)2 cu log(2p/α)/n Um 2 2,n cu log(2p/α)/n(1 δ)σ2 n cu log(2p/α)/n(1 δ)(1 ω)σ2

cu(1 δ)(1 ω)λ2 n

for any m m2 m 1. Since

||V m + r||2 2,n ||V m+1 + r||2 2,n = (γm)2 2γm < Xjm, ε >n cu1λ2 n > 0

for some constant cu1 > 0 if (1 δ)(1 ω)cu/4 > 4, it follows that

||V m + r||2 2,n ||V m+1 + r||2 2,n cu1λ2 n

for m = m , m + 1, . . . , m2. Consequently, ||V m2+1 + r||2 2,n ||V m +1 + r||2 2,n. By assumption, at the (m2 + 1)th step, we know that ||Um2+1||2 2,n (1 δ)σ2 n. Hence,

2 < V m2+1 + r, ε >n + V m2+1 + r 2 2,n δσ2 n. (46)

High-Dimensional L2-Boosting

However, ||V m2+1 + r||2 2,n ||V m +1 + r||2 2,n, and therefore, by Lemma 13,

2| < V m2+1 + r, ε >n | ||V m2+1 + r||2,nσn ||V m +1 + r||2,nσn 0,

which contradicts (46) when n is large enough. Therefore, Case (b.2) can not happen, and the proof is concluded.

Proof [Proof of Theorem 17] WLOG., we can assume that CF log n is an integer, so that Fk is a positive integer for k = 1, 2, . . . . For some constant C > 0 that is large enough, deﬁne k := C s log n

Ln . We ﬁrst show that, if we run the algorithms forever, we have with probability 1 α:

En h x i(βmk β0) 2i p s log(n) log(p)

En[ βmk β0 2 2] p s log(n) log(p)

Deﬁne ˆT m as the set of variables that are selected by the end of time period m. It is worth noting that | ˆT mk| k Lk with Lk = Ln for all k Z+ by construction. Deﬁne ˇmk := mk 1 + Lk. Therefore,

| ˆT mk| k Ln C s log n,

for all k k := C s log n

Ln . As in Lemma 24, Pm,T denotes the operator of performing m periods of PBA algorithm subject to the subset T. Further, deﬁne Rk := (P ˆT ˇ mk PFk, ˆT ˇ mk)U ˇmk. By Lemma 24, we have that

Rk 2 2,n = (1 PFk, ˆT ˇ mk)U ˇmk (1 P ˆT ˇ mk)U ˇmk 2 2,n

P ˆT ˇ mk U ˇmk 2 2,n exp

for Fk 2 cφ | ˆT ˇmk| log(n y 2,n) starting at U ˇmk. Note that Rk is a small term that could be

neglected later with y 2,n = U0 2 2,n Um 2 2,n being a decreasing sequence in m. Note that U ˇmk = y x β ˇmk, Therefore, P ˆT ˇ mk U ˇmk = P ˆT ˇ mky in the projection of vector y on the columns of X with indices in T ˇmk. As a result, we have

Umk = (I P ˆT ˇ mk)U ˇmk + Rk = (I P ˆT ˇ mk)y + Rk (47)

with mk = ˇmk + Fk. Deﬁne e V k := V mk = X(β βmk) and e V k o = Xβ P ˆT ˇ mk Xβ = (I P ˆT ˇ mk)X ˆT ˇ mkβ ˆT ˇ mk. Therefore,

e V k = Xβ (y Umk) = (r + ε) + (I P ˆT ˇ mk)y + Rk = (I P ˆT ˇ mk)Xβ P ˆT ˇ mk(r + ε) + Rk

It implies that e V k = e V k o + Rk PT ˇ mk(r + ε). (48)

By deﬁnition, e V k o 2 2,n is a decaying sequence in k as T ˇmk T ˇmk+1 holds for all k 0. We need the following proposition:

Luo, Spindler and Kueck

Proposition 27 Suppose all conditions in Theorem 17 hold. For Mn large enough, there exists an absolute constant Ck > 0 such that for some k Cks log n/Ln and k(Ln + 1) < Mn, we have that:

En[ x i(βmk β) 2] p s log(n) log(p)

βmk β 2 2 p s log(n) log(p)

Proof [Proof of Proposition 27] Assume that k k with k = C s log n

Ln . Deﬁne e T mk :=

T0\ ˆT mk. That said, e T mk is the set of variables that are in T0 but not yet being selected by time mk. Then, if e T mk = , then, all the variables in T0 are already selected. By deﬁnition,

V mk = y Umk Xβ = y (I PFk, ˆT ˇ mk)U ˇmk X ˆT ˇ mkβ ˆT ˇ mk.

It implies that:

V mk 2 2,n = X ˆT ˇ mkβ ˆT ˇ mk y + (I PFk, ˆT ˇ mk)U ˇmk 2 2,n 2 X ˆT ˇ mkβ ˆT ˇ mk y + (I P ˆT ˇ mk)U ˇmk 2 2,n + 2 (P ˆT ˇ mk PFk, ˆT ˇ mk)U ˇmk 2 2,n 2 P ˆT ˇ mk(rn + ε) 2 2,n + 2 (P ˆT ˇ mk PFk, ˆT ˇ mk)U ˇmk 2 2,n

(4Cr log(2p/α) + 4σ2C /cφ log(2p/α) log(n))s + 1

by Lemma 25 since ˆT ˇmk C s log(n) and by Lemma 24 since Fk > CF | ˆT ˇmk| log n for some CT large enough, which concludes the result. Now, suppose e T mk = for k = 0, 1, ..., k . Recall that

Umk := (I P ˆT ˇ mk)y + Rk,

with Rk 2,n 1

V mk = (I P ˆT ˇ mk)Xβ P ˆT ˇ mk(r + ε) + Rk = (I P ˆT ˇ mk)X e T ˇ mkβ e T ˇ mk.

Suppose that V mk 2 2,n > CV s log(n) log(2p/α)

n for all k Cks log(n)

Ln , where Ck CM is a large

enough constant. For CV large enough, we have that V m 2,n > q

Crcφs log(2p/α)

n . By (48) and Lemma 25, it follows that

e V k o 2,n V mk 2,n P ˆT ˇ mk(r + ε) 2,n Rk 2,n

CV s log(2p/α) log(n)

2Crs log(2p/α) + 2σ2/cφ| ˆT ˇmk| log(2p/α)

CV /2s log(2p/α) log(n)

given that CV > 2(2Cr + 2σ2/cφCk + 1) is large enough ﬁxed constant where we used that

| ˆT ˇmk| Lnk Cks log(n). Deﬁne ωn := c2 φ 4Ln C3 φ . Next, we divide our discussions into two

High-Dimensional L2-Boosting

Case (A1): P ˇmk t=mk 1+1 γ2 jt ωn e V k 1 o 2 2,n. By deﬁnition, we have that:

Umk 2 2,n U ˇmk 2 2,n Umk 1 2 2,n

t=mk 1+1 γ2 jt Umk 1 2 2,n ωn e V k 1 o 2 2,n. (51)

Case (A2): P ˇmk t=mk 1+1 γ2 jt ωn e V k 1 o 2 2,n. Deﬁne δm,k := Pm t=mk 1+1 γjtejt, where ejt is the unit vector that is equal to 1 only at the jt-entry, and 0 otherwise, for m mk 1 and m ˇmk 1. It holds

δm,k 2 2,n Ln

t=mk 1+1 γ2 jt Lnωn e V k 1 o 2 2,n. (52)

As a result, by deﬁnition, for any m = mk 1, ..., ˇmk 1, we have that:

|γjm| β e T mk 1 1 < Um, X e T mk 1β e T mk 1 >n = < (I P ˆT mk 1)X e T mk 1β e T mk 1, X e T mk 1β e T mk 1 >n | {z } ΨV,1 + < (I P ˆT mk 1)(r + ε), X e T mk 1β e T mk 1 >n | {z } ΨV,2 + < Rk, X e T mk 1β e T mk 1 >n | {z } ΨV,3 + < Xδm,k, X e T mk 1β e T mk 1 >n | {z } ΨV,4

Um = Umk + Xδm,k = (I P ˆT mk 1 + Rk)Y + Xδm,k = (I P ˆT mk 1)X e T mk 1β e T mk 1 + (I P ˆT mk 1)(r + ε) + Rk + Xδm,k.

For ΨV,1, we have that:

ΨV,1 = (I P ˆT mk 1)X e T mk 1β e T mk 1 2 2,n. (53)

There exists a δ Rp with supp(δ) ˆT mk 1 such that Xδ = P ˆT mk 1X e T mk 1β e T mk 1. By deﬁnition, e T mk 1 ˆT mk 1 = , we have that:

ΨV,1 = Xδ X e T mk 1β e T mk 1 2 2,n cφ( β e T mk 1 2 + δ 2)

cφ β e T mk 1 2 cφ

Cφ X e T mk 1β e T mk 1 2. (54)

By (50), we have that

X e T mk 1β e T mk 1 2 2,n e V k 1 o 2 2,n CV s log(2p/α) log(n)

Luo, Spindler and Kueck

For ΨV,2, we have that:

|ΨV,2| = | < (I P ˆT mk 1)r, X e T mk 1β e T mk 1 >n + < ε, X e T mk 1β e T mk 1 >n < P ˆT mk 1ε, X e T mk 1β e T mk 1 >n |

r 2,n X e T mk 1β e T mk 1 2,n + β e T mk 1 | e T mk 1| 1 2 λn +

cφ λn X e T mk 1β e T mk 1 2,n

4Cr CV log(n) +

σ2Cφ CV log(n) +

X e T mk 1β e T mk 1 2 2,n. (55)

For ΨV,3, we have that:

ΨV,3 Rk 2,n X e T mk 1β e T mk 1 2,n

2 sn log(2p/α) log(n)CV X e T mk 1β e T mk 1 2 2,n.

(56) For ΨV,4, we have that:

ΨV,4 Xδm,k 2,n X e T mk 1β e T mk 1 2,n p

CφLnωn X e T mk 1β e T mk 1 2 2,n. (57)

Combining (54), (55), (56) and (57), we have that:

|γjm| β e T ˇ mk 1 1 cφ

4Cr CV log(n)

σ2Cφ CV log(n)

2 sn log(2p/α) log(n)CV p

CφLnωn X e T mk 1β e T mk 1 2 2,n

4Cφ X e T mk 1β e T mk 1 2 2,n, (58)

given that CV > 17C2 φσ2Ck

c3 φ and n being large enough. As a result, we have that:

X e T mk 1β e T mk 1 2 2,n β e T mk 1 1 cφ

X e T mk 1β e T mk 1 2 2,n s β e T mk 1

c3/2 φ 4Cφ s X e T mk 1β e T mk 1 2,n, (59)

where the last inequality follows from the sparse eigenvalue condition and | e T mk 1| s. Therefore, we have that:

m=mk 1+1 |γjm|2 Lnc3 φ 16C2 φs X e T mk 1β e T mk 1 2 2,n Lnc3 φ 16C2 φs e V k 1 o 2 2,n. (60)

Combining both cases (A1) and (A2), we have that:

Umk 2 2,n Umk 1 2 2,n κn e V k 1 o 2 2,n (61)

High-Dimensional L2-Boosting

where κn := min c2 φ 4Ln C3 φ , Lnc3 φ 16C2 φs

s with κ := min c2 φ 4K2 LC3 φ , c3 φ 16C2 φ

given that Ln

KL s with KL being a ﬁxed constant, and hence, κ > 0 is a ﬁxed constant. Deﬁne q := s

Ln as the smallest integer s Ln . WLOG., we can assume that q = s Ln for simplicity, i.e., s Ln is an integer. By applying (61) multiple times, we have that:

Umk+q 2 2,n Umk 2 2,n κLn

l=k e V l o 2 2,n. (62)

We again divide our analysis into two cases: Assume that k k + q Cs log(n)

Case (B1): e V k+q 1 o 2 2,n > 1 2 e V k o 2 2,n. It implies that e V l o 2 2,n 1 2 e V k o 2 2,n for all l = k, . . . , k + q 1 as e V l o 2 2,n = (I P ˆT ˇ ml)Xβ 2 2,n

is a decreasing sequence in l. It follows that

Umk+q 2 2,n Umk 2 2,n κLn

l=k e V l o 2 2,n

Umk 2 2,n κq Ln

2s e V k o 2 2,n = Umk 2 2,n κ

2 e V k o 2 2,n.

Recall that by (47), we have:

Umk = e V k o + (I P ˆT ˇ mk)(r + ε) + Rk.

It implies that

e V k+q o 2 2,n 1 κ

e V k o 2 2,n + 2 < e V k o , (I P ˆT ˇ mk)(r + ε) >n

2 < e V k+q o , (I P ˆT ˇ mk+q )(r + ε) >n + (I P ˆT ˇ mk)(r + ε) 2 2,n (I P ˆT ˇ mk+q )(r + ε) 2 2,n + Rk 2 2,n Rk+q 2 2,n + 2 < e V k o + (I P ˆT ˇ mk)(r + ε), Rk >n

2 < e V k+q o + (I P ˆT ˇ mk+q )(r + ε), Rk+q >n .

By (55), replacing k 1 by k, for n large enough, we have

2 < e V k o , (I P ˆT ˇ mk)(r + ε) >n 2 < e V k+q o , (I P ˆT ˇ mk+q )(r + ε) >n

4Cr CV log(n) +

σ2Cφ CV log(n) +

X e T mkβ e T mk 2 2,n

cφCV e V k o 2 2,n κ

12 e V k o 2 2,n (63)

Luo, Spindler and Kueck

given that CV > 64 36 2C2 φσ2Ck c3 φκ2 . Here we used that e V k o := (1 P ˆT mk)Xβ = (1 P ˆT mk)X e T mkβ e T mk

as e T mk = T0\ ˆT mk. Again, we can ﬁnd a δ with supp(δ) ˆT mk such that Xδ = P ˆT mk X e T mkβ e T mk. Hence, the supports of δ and β e T mk have no intersection and we conclude

e V k o 2 2,n cφ( δ 2 + β e T mk 2) cφ β e T mk 2 cφ

Cφ X e T mkβ e T mk 2 2,n.

By Lemma 25 and (50), for n large enough, we have that:

(I P ˆT ˇ mk)(r + ε) 2 2,n (I P ˆT ˇ mk+q )(r + ε) 2 2,n (64)

= P ˆT ˇ mk(r + ε) 2 2,n + P ˆT ˇ mk+q (r + ε) 2 2,n

Crs log(2p/α) + σ2/cφ log(2p/α)| ˆT ˇmk+q|

cφCV e V k o 2 2,n κ

12 e V k o 2 2,n (65)

given that CV 60Ckσ2

cφκ . And for n large enough, for ﬁxed CV , by (50), we have that:

Rk 2 2,n Rk+q 2 2,n + 2 < e V k o + (I P ˆT ˇ mk)(r + ε), Rk >n

2 < e V k+q o + (I P ˆT ˇ mk+q )(r + ε), Rk+q >n

n e V k o 2,n + 4

n r + ε 2,n κ

12 e V k n 2 2,n. (66)

Combining (63), (65) and (66), we have that:

e V k+q o 2 2,n 1 κ

e V k o 2 2,n. (67)

Hence, we have shown exponential decay of e V k o 2 2,n in Case (B1).

Case (B2): e V k+q o 2 2,n 1

2 e V k o 2 2,n. This automatically proves exponential decay. Note that κ 1

4 as Cφ cφ. Combining Case (B1) and Case (B2), we have that

e V k+q o 2 2,n 1 κ

e V k o 2 2,n.

Therefore, for any kq Cks log(n), we have that

e V kq o 2 2,n e V 0 o 2 2,n 1 κ

Since q = s Ln , let k = Ck Ln log(n) . WLOG., assume that Ck Ln log(n) is an integer so that k = Ckn log(n). For n large enough and Ck 4K

Lnκ, we have that

e V kq o 2 2,n n Ck Ln ln(1 κ

4 ) e V 0 o 2 2,n n Ck Lnκ/4n K 1 < CV s log(n) log(p)

High-Dimensional L2-Boosting

given that e V 0 o 2 2,n n K 1 for some ﬁxed constant K > 0. This leads to a contradictory

to (50). Therefore, for given constant Ck 4K

Lnκ, there must exists k Cks log(n)

Ln such that

V mk 2 2,n CV s log(n) log(2p/α)

which concludes our proof.

By construction of Algorithm 2, we have that

Um+1 2 2,n = Um 2 2,n γ2 jm

for all non-negative integers m. Therefore,

Um+1 2 2,n Um 2 2,n = 1 γ2 jm Um 2 2,n . (71)

It implies that the algorithm will not stop if

γ2 jm Um 2 2,n Cu log(2p/α)

Suppose that the algorithm does not stop when k C s log n

Ln k M := CMs log n

Ln with CM deﬁned in Assumption 1. If C > Ck, by Proposition 27, it holds

En h x i(βmk β) 2i p s log(n) log(p)

for some k Cks log n/Ln. By Proposition 27, there also exists a positive constant CV such that

e V k 2,n CV s log(2p/α) log(n)

with k C s log n/Ln. It implies that | ˆT ˇmk | = | ˆT mk | C s log n. Therefore, by Lemma 25, we have that

(I P ˆT ˇ mk )Xβ 2,n e V k 2,n + P ˆT ˇ mk (r + ε) 2,n + Rk 2,n

1 2 V + (2Cr + 2σ2/cφC ) 1 2 + 1 r

s log n log(2p/α)

It implies that for any k k and k k M, we have that

e V k 2,n (I P ˆT ˇ mk)Xβ 2,n + P ˆT ˇ mk(r + ε) 2,n + Rk 2,n

(I P ˆT ˇ mk )Xβ 2,n +

2(Crs log(2p/α) + kσ2/cφ log(2p/α)s log n)

Ck + 1 + 2(Cr + kσ2/cφ) 1 2 r

s log n log(2p/α)

Luo, Spindler and Kueck

where C k := (C

1 2 V + (2Cr + σ2/cφC ) 1 2 + 1). By deﬁnition, with s log n log(2p/α)

n 0, for n large enough, for arbitrarily small η > 0, we have that

Umk 2,n = (ε + r) + e V k 2,n (ε + r) 2,n e V k 2,n p

(1 2η)σ2 (75)

and Umk 2,n = (ε + r) + e V k 2,n (ε + r) 2,n + e V k 2,n p

(1 + 2η)σ2. (76)

By deﬁnition, since Umk 2,n is a decreasing sequence, we have that:

γ2 jm CU log(2p/α)

n Um 2 2,n CU(1 2η)σ2 log(2p/α)

As a result, we have that:

Umk 2 2,n Umk 2 2,n X

mk m mk 1, lm=1 γ2 jm

Umk 2 2,n (k k )Ln CU(1 2η)σ2 log(2p/α)

Therefore, it implies that for any k k , we have:

Umk 2 2,n Umk 2 2,n = (I P ˆT ˇ mk)y + Rk 2 2,n (I P ˆT ˇ mk )y + Rk 2 2,n = (I P ˆT ˇ mk)y 2 2,n (I P ˆT ˇ mk )y 2 2,n | {z } Ψ1 + 2 < (I P ˆT ˇ mk)y, Rk >n < (I P ˆT ˇ mk )y, Rk >n | {z } Ψ2 + Rk 2 2,n Rk 2 2,n | {z } Ψ3

For Ψ1, we have that:

Ψ1 = (I P ˆT ˇ mk)y 2 2,n (I P ˆT ˇ mk )y 2 2,n = (I P ˆT ˇ mk)Xβ 2 2,n (I P ˆT ˇ mk )Xβ 2 2,n | {z } Ψ11 2 < (P ˆT ˇ mk P ˆT ˇ mk )Xβ, (r + ε)) >n | {z } Ψ12 + P ˆT ˇ mk (r + ε) 2 2,n P ˆT ˇ mk(r + ε) 2 2,n | {z } Ψ13

For Ψ11, by (74), we have that:

Ψ11 (I P ˆT ˇ mk )Xβ 2 2,n C

1 2 V + (2Cr + 2σ2/cφC ) 1 2 + 1 2 s log n log(2p/α)

High-Dimensional L2-Boosting

For Ψ12, we know that

(P ˆT ˇ mk P ˆT ˇ mk )Xβ = (I P ˆT ˇ mk )Xβ (I P ˆT ˇ mk)Xβ.

It is worth noting that (I P ˆT ˇ mk)Xβ is a linear combination of columns of X with indices in Tk := ˆT ˇmk T0 where T0 := supp(β). Therefore, we can ﬁnd a ζk Rp with supp(ζk) Tk, with Xζk = (I P ˆT ˇ mk)Xβ. Since

(I P ˆT ˇ mk)Xβ 2 2,n (I P ˆT ˇ mk )Xβ 2 2,n

1 2 V + (2Cr + 2σ2/cφC ) 1 2 + 1 2 s log n log(2p/α)

we have that

Xζk 2 2,n C

1 2 V + (2Cr + 2σ2/cφC ) 1 2 + 1 2 s log n log(2p/α)

By sparse eigenvalue condition in Assumption A.1, it implies that

1 2 V + (2Cr + 2σ2/cφC ) 1 2 + 1 2 s log n log(2p/α)

Ψ12 = 2 < Xζk Xζk , (r + ε) >n = 2 < Xζk Xζk , r >n +2 < ζk ζk , XTk ε >n

2( Xζk 2,n + Xζk 2,n) r 2,n 2( ζk 2,n + ζk 2,n)|Tk | 1 2 λn

1 2 V + (2Cr + 2σ2/cφC ) 1 2 + 1 s log(n) 1 2 log(2p/α)

1 2 V + (2Cr + 2σ2/cφC ) 1 2 + 1 s log(n) log(2p/α)

For Ψ13, by Lemma 25, we have that:

Ψ13 P ˆT ˇ mk(r + ε) 2 2,n 2 Crs log(2p/α) + σ2/cφ log(2p/α)k Ln

Combining (79), (81) and (82), we have that:

Ψ1 CΨ1 s log n log(2p/α)

k Ln log(2p/α)

where CΨ1 := (4 Cr + 2Cr + 4σ

C / cφ + (C

1 2 V + (2Cr + 2σ2/cφC ) 1 2 + 1))(C

1 2 V + (2Cr +

2σ2/cφC ) 1 2 + 1) is a ﬁxed constant. For Ψ2 and Ψ3, since Rk , Rk 1

n, and by (76), we have that for n large enough:

Ψ2 4(1 + 2η)σ

Luo, Spindler and Kueck

Therefore, we have that:

Umk 2 2,n Umk 2 2,n CΨ1 s log n log(2p/α)

k Ln log(2p/α)

n 4(1 + 2η)σ

Recall that in equation (77), we have that

Umk 2 2,n Umk 2 2,n (k k )Ln CU(1 2η)σ2 log(2pα)

Therefore, combining (85) and (86), we have that

(k k )Ln CU(1 2η)σ2 log(2pα)

CΨ1 s log n log(2p/α)

k Ln log(2p/α)

n + 4(1 + 2η)σ

and by assumption that CU > 4 cφ , for small enough η > 0, it holds CU(1 2η) 4

Since k C s log n

Ln , it implies that for n large enough, we have that

k k CU(1 2η) + CΨ1s log n/Ln + 1

cφ C s log n

where C := C CU(1 2η)+CΨ1+1

cφ s log n/Ln. By assumption that C CM, all the analysis in

the above that uses the sparse eigenvalue condition applies. Therefore, the algorithm must stop before mk with k C . Therefore, when we stop at m , we have that

CU log(2p/α)

CU log(2p/α)

n ( V m 2,n + (r + ε) 2,n). (87)

By Lemma 26, we have that

cφ (| ˆT m | + s) V m 2,n

Crcφs log 2p/α

n(| ˆT m | + s) λn.

It implies that

CU (| ˆT m | + s) log(2p/α)

CU (| ˆT m | + s) log(2p/α)

n (2 + 2η)σ +

Crcφs log(2p/α)

(| ˆT m | + s)λn.

Since | ˆT m | C s log(n) and s log(n) log(p)

n 0, we have that for n large enough,

s log(n) log(2p/α)

CUC (2 + 2η)σ + p

High-Dimensional L2-Boosting

for C C + 1. It follows that

V m 2 2,n 2

CUC (2 + 2η)σ + p

Crcφ + σC 2 s log(n) log(2p/α)

CUC (2 + 2η)σ + p

Crcφ + σC 2 s log(n) log(2p/α)

follows by sparse eigenvalue condition which concludes the proof.

Appendix D. Discussion of the equi-correlated design

In the equi-correlated design, we assume to have a set of predictors X1, . . . , Xp, such that

Covn(Xi, Xj) = corrn(Xi, Xj) = ρ (0, 1)

for i = j, where Covn represents the empirical covariance and corrn the empirical correlation, respectively. By assumption, all Xj, j = 1, 2, . . . , p, are standardized with mean zero and variance one. For the case Cov(Xi, Xj) = ρ, the results are similar and can be well approximated by the case Covn(Xi, Xj) = ρ. To analyze the revisiting behavior of the PGA algorithm in model (3) in the main text, it is suﬃcient to consider Xαm = X(β βm), which is the approximation error at the mth boosting step. It is worth noting that

Covn(Xαm, Xj) = X

i T m αm i ρ, if j / T m, (88)

Covn(Xαm, Xj) = X

i T m αm i ρ + αm j (1 ρ), if j T m, (89)

where T m := T supp(βm) with T := supp(β). WLOG., one can assume that P i T m αm i 0. We divide our analysis into two cases: Case (a): If P i T m αm i > 0, there must be a j T m such that αm j > 0. Due to Equations (88) and (89), this implies

|Covn(Xαm, Xj)| > |Covn(Xαm, Xl)|

for all l / ˆT m since αm j (1 ρ) > 0 and P i T m αm i ρ > 0. Therefore, the algorithm must select one predictor from T m. Case (b): If P i T m αm i = 0, the current approximation error Xαm has a correlation of 0 with all variables j that are not in T m, see Equation (88). As a result, the algorithm must either (b1) select a predictor that is in T m if not all αm i are zero for i T m or (b2) stop since all αm i are 0. In summary, the algorithm will always tend to select one predictor from the existing T m

unless it stops as the approximation error is already zero. Therefore,

T = T 0 = T 1 = = T m.

Hence, we have 100 percent of revisiting, indicating that ζ (cφ) .

Luo, Spindler and Kueck

Appendix E. Applications

In this section, we present applications from diﬀerent ﬁelds to illustrate the boosting algorithms. We present applications to demonstrate how the methods work when applied to real data sets and, then compare these methods to related methods, i.e. Lasso. The focus is on making predictions which is an important task in many applications.

E.1 Riboﬂavin production

This application involves genetic data and analyzes the production of riboﬂavin. First, we describe the data set, then we present the results.

E.1.1 Data set

The data set has been provided by DSM (Kaiserburg, Switzerland) and was made publicly available for academic research in B uhlmann et al. (2014) (Supplemental Material). The real-valued response/dependent variable is the logarithm of the riboﬂavin production rate. The (co-)variables measure the logarithm of the expression level of 4, 088 genes (p = 4, 088), which are normalized. This means that the covariables are standardized to have variance 1, and the dependent variable and the resources are de-meaned , which is equivalent to including an unpenalized intercept. The data set consists of n = 71 observations which were hybridized repeatedly during a fed-batch fermentation process in which diﬀerent engineered strains and strains grown under diﬀerent fermentation conditions were analyzed. For further details we refer to B uhlmann et al. (2014), their Supplemental Material and the references therein.

E.1.2 Results

We analyze a data set on the production of riboﬂavin (vitamin B2). We split the data set randomly into two samples: a training set and a testing set. We estimate the model with diﬀerent methods on the training set and then use the testing set to calculate out-of-sample mean squared errors (MSE) in order to evaluate the predictive accuracy. The size of the training set was 60 and the remaining 11 observations were used for forecasting. The table below shows the MSE for diﬀerent methods discussed in the previous sections.

Table 11: Results Riboﬂavin Production (out-of-sample MSE)

BA-our o BA-our Lasso p-Lasso 0.3641 0.1080 0.1687 0.1539

All calculations were performed in R (R Core Team (2014)) with the package hdm (Chernozhukov et al. (2015)) and our own code. Replication ﬁles are available upon request. The results show, again, that orthogonal L2-Boosting outperforms Lasso and post-Lasso in this application.

High-Dimensional L2-Boosting

E.2 Predicting Test Scores

E.2.1 Data Set

Here, the task is to predict the ﬁnal score in the subjects Mathematics and Portugese in secondary education. This is relevant, e.g., to identify students which need additional support to master the material. The data contains both student grades and demographic, social and school related features and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics and Portuguese. The data set is made available at the UCI Machine Learning Repository and was contributed by Paulo Cortez. The main reference for the data set is Cortez and Silva (2008).

E.2.2 Results

We employed ﬁve-fold cross-validation to evaluate the predictive performance of the data set. The results remain stable when choosing a diﬀerent number of folds. The data sets contain, for both test results, 33 variables, which are used as predictors. The data set for the Mathematics test scores contains 395 observations, the sample size for Portuguese is 649. The results conﬁrm our theoretical derivations that boosting is comparable to Lasso.

Table 12: Prediction of education (out-of-sample MSE)

subject BA-our o BA-our Lasso p-Lasso Mathematics 19.1 19.3 18.4 18.4 Protugese 8.0 7.9 7.8 7.8

Luo, Spindler and Kueck

Andrew R. Barron, Albert Cohen, Wolfgang Dahmen, and Ronald A. De Vore. Approximation and learning by greedy algorithms. Ann. Statist., 36(1):64 94, 02 2008. doi: 10. 1214/009053607000000631. URL http://dx.doi.org/10.1214/009053607000000631.

A. Belloni, D. Chen, V. Chernozhukov, and C. Hansen. Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica, 80(6):2369 2429, 2012. ISSN 1468-0262. doi: 10.3982/ECTA9626. URL http://dx.doi.org/10. 3982/ECTA9626.

Alexandre Belloni and Victor Chernozhukov. Least squares after model selection in highdimensional sparse models. Bernoulli, 19(2):521 547, 05 2013. doi: 10.3150/11-BEJ410. URL http://dx.doi.org/10.3150/11-BEJ410.

Alexandre Belloni, Victor Chernozhukov, and Christian Hansen. Inference for highdimensional sparse econometric models. Advances in Economics and Econometrics. 10th World Congress of Econometric Society. August 2010, III:245 295, 2010. Ar Xiv, 2011.

Alexandre Belloni, Victor Chernozhukov, and Christian Hansen. Inference on treatment eﬀects after selection among high-dimensional controls. The Review of Economic Studies, 81(2):608 650, 11 2013. ISSN 0034-6527. doi: 10.1093/restud/rdt044. URL https: //doi.org/10.1093/restud/rdt044.

Alexandre Belloni, Victor Chernozhukov, and Ying Wei. Post-selection inference for generalized linear models with many controls. Journal of Business & Economic Statistics, 34 (4):606 619, 2016. ISSN 07350015. URL http://www.jstor.org/stable/44166593.

P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics, 37(4):1705 1732, 2009.

Leo Breiman. Bagging predictors. Machine Learning, 24:123 140, 1996. ISSN 0885-6125. doi: 10.1007/BF00058655. URL http://dx.doi.org/10.1007/BF00058655.

Leo Breiman. Arcing classiﬁers. The Annals of Statistics, 26(3):801 824, 1998.

Peter B uhlmann. Boosting for high-dimensional linear models. The Annals of Statistics, 34 (2):559 583, 2006.

Peter B uhlmann and Torsten Hothorn. Boosting algorithms: Regularization, prediction and model ﬁtting. Statistical Science, 22(4):477 505, 2007. doi: 10.1214/07-STS242. URL http://dx.doi.org/10.1214/07-STS242. with discussion.

Peter B uhlmann and Bin Yu. Boosting with the L2 Loss: Regression and classiﬁcation. Journal of the American Statistical Association, 98(462):324 339, 2003. ISSN 01621459. URL http://www.jstor.org/stable/30045243.

Peter B uhlmann, Markus Kalisch, and Lukas Meier. High-dimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application, 1(1):255 278, 2014. doi: 10.1146/annurev-statistics-022513-115545. URL http://dx. doi.org/10.1146/annurev-statistics-022513-115545.

High-Dimensional L2-Boosting

Emmanuel Candes and Terence Tao. The dantzig selector: statistical estimation when p is much larger than n. The Annals of Statistics, pages 2313 2351, 2007.

Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Gaussian approximation of suprema of empirical processes. The Annals of Statistics, 42(4):1564 1597, 2014.

Victor Chernozhukov, Christian Hansen, and Martin Spindler. hdm: High-Dimensional Metrics, 2015. R package version 0.1.

Paulo Cortez and Alice Maria Gon calves Silva. Using data mining to predict secondary school student performance. 2008.

Victor H. de la Pe na, Tze Leung Lai, and Qi-Man Shao. Self-normalized processes. Probability and its Applications (New York). Springer-Verlag, Berlin, 2009. ISBN 978-3-54085635-1. Limit theory and statistical applications.

R. A. De Vore and V. N. Temlyakov. Some remarks on greedy algorithms. Advances in Computational Mathematics, 5(1):173 187, 1996. ISSN 1572-9044. doi: 10.1007/BF02124742. URL http://dx.doi.org/10.1007/BF02124742.

Robert M. Freund, Paul Grigas, and Rahul Mazumder. A new perspective on boosting in linear regression via subgradient optimization and relatives. Annals of Statistics, 2016.

Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119 139, 1997.

J. H. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of boosting. The Annals of Statistics, 28:337 407, 2000. doi: 10.1214/aos/1016218223. URL http://projecteuclid.org/euclid.aos/1016218223. with discussion.

Jerome H. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189 1232, 2001. ISSN 00905364. URL http://www.jstor. org/stable/2699986.

Ching-Kang Ing. Model selection for high-dimensional linear regression with dependent observations. The Annals of Statistics, 48(4):1959 1980, 2020.

Ching-Kang Ing and Tze Leung Lai. A stepwise regression method and consistent model selection for high-dimensional sparse linear models. Statistica Sinica, 21(4):1473 1513, 2011.

Robert Tibshirani Jerome Friedman, Trevor Hastie. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1 22, 2010. URL http://www.jstatsoft.org/v33/i01.

S. V. Konyagin and V. N. Temlyakov. Rate of convergence of pure greedy algorithm. East J. Approx., 5(4):493 499, 1999. ISSN 1310-6236.

Luo, Spindler and Kueck

Jannis Kueck, Ye Luo, Martin Spindler, and Zigan Wang. Estimation and inference of treatment eﬀects with l2-boosting in high-dimensional settings. Journal of Econometrics, 234(2):714 731, 2023.

Tze Leung Lai and Hongsong Yuan. Stochastic Approximation: From Statistical Origin to Big-Data, Multidisciplinary Applications. Statistical Science, 36(2):291 302, 2021. doi: 10.1214/20-STS784. URL https://doi.org/10.1214/20-STS784.

E.D. Livshitz and V.N. Temlyakov. Two lower estimates in greedy approximation. Constructive Approximation, 19(4):509 523, 2003. ISSN 1432-0940. doi: 10.1007/ s00365-003-0533-6. URL http://dx.doi.org/10.1007/s00365-003-0533-6.

S.G. Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. Trans. Sig. Proc., 41(12):3397 3415, Dec 1993. doi: 10.1109/78.258082. URL http://dx.doi. org/10.1109/78.258082.

Sahand N. Negahban, Pradeep Ravikumar, Martin J. Wainwright, and Bin Yu. A Uniﬁed Framework for High-Dimensional Analysis of M-Estimators with Decomposable Regularizers. Statistical Science, 27(4):538 557, 2012. doi: 10.1214/12-STS400. URL https://doi.org/10.1214/12-STS400.

R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2014. URL http://www.R-project.org/.

Bernhard Stankewitz. Early stopping for -boosting in high-dimensional linear models. The Annals of Statistics, 52(2):491 518, 2024. doi: 10.1214/24-AOS2356. URL https: //doi.org/10.1214/24-AOS2356.

Vladimir Temlyakov. Greedy Approximation. Cambridge University Press, New York, NY, USA, 1st edition, 2011. ISBN 1107003377, 9781107003378.

V.N. Temlyakov. Weak greedy algorithms. Advances in Computational Mathematics, 12 (2):213 227, 2000. ISSN 1572-9044. doi: 10.1023/A:1018917218956. URL http://dx. doi.org/10.1023/A:1018917218956.

Sara A. van de Geer and Peter B uhlmann. On the conditions used to prove oracle results for the lasso. Electron. J. Statist., 3:1360 1392, 2009. doi: 10.1214/09-EJS506. URL http://dx.doi.org/10.1214/09-EJS506.

T. Zhang and B. Yu. Boosting with early stopping: Convergence and consistency. The Annals of Statistics, 33:1538 1579, 2005. doi: 10.1214/009053605000000255. URL http: //projecteuclid.org/euclid.aos/1123250222.