# highdimensional_l2boosting_rate_of_convergence__3905cc5b.pdf Journal of Machine Learning Research 26 (2025) 1-54 Submitted 7/21; Revised 3/25; Published 3/25 High-Dimensional L2-Boosting: Rate of Convergence Ye Luo kurtluo@hku.hk Hong Kong University Business School The University of Hong Kong Hong Kong Martin Spindler martin.spindler@uni-hamburg.de Institute for Statistics University of Hamburg Germany Jannis Kueck kueck@dice.hhu.de D usseldorf Institute for Competition Economics Heinrich Heine University D usseldorf Germany Editor: Corinna Cortes Boosting is one of the most significant developments in machine learning. This paper studies the rate of convergence of L2-Boosting in a high-dimensional setting under early stopping. We close a gap in the literature and provide the rate of convergence of L2-Boosting in a high-dimensional setting under approximate sparsity and without beta-min condition. We also show that the rate of convergence of the classical L2-Boosting depends on the design matrix described by a sparse eigenvalue condition. To show the latter results, we derive new, improved approximation results for the pure greedy algorithm, based on analyzing the revisiting behavior of L2-Boosting. These results might be of independent interest. Moreover, we introduce so-called restricted L2-Boosting . The restricted L2-Boosting algorithm sticks to the set of the previously chosen variables, exploits the information contained in these variables first and then only occasionally allows to add new variables to this set. We derive the rate of convergence for restricted L2-Boosting under early stopping which is close to the convergence rate of Lasso in an approximate sparse, high-dimensional setting without beta-min condition. We also introduce feasible rules for early stopping, which can be easily implemented and used in applied work. Finally, we present simulation studies to illustrate the relevance of our theoretical results and to provide insights into the practical aspects of boosting. In these simulation studies, L2-Boosting clearly outperforms Lasso. An empirical illustration and the proofs are contained in the Appendix. Keywords: Boosting, L2-Boosting, restricted L2-Boosting, early stopping, high-dimensional regression, rate of convergence, approximate sparsity, approximation theory c 2025 Luo, Spindler, Kueck. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/21-0725.html. Luo, Spindler and Kueck 1. Introduction In this paper we consider L2-Boosting algorithms for regression which are coordinatewise greedy algorithms that estimate the target function under L2 loss. Boosting algorithms represent one of the major advances in machine learning and statistics in recent years. Freund and Schapire s Ada Boost algorithm for classification (Freund and Schapire (1997)) has attracted much attention in the machine learning community as well as in statistics. Many variants of the Ada Boost algorithm have been introduced and proven to be very competitive in terms of prediction accuracy in a variety of applications with a strong resistance to overfitting. Boosting methods were originally proposed as ensemble methods, which rely on the principle of generating multiple predictions and majority voting (averaging) among the individual classifiers (B uhlmann and Hothorn (2007)). An important step in the analysis of boosting algorithms was Breiman s interpretation of boosting as a gradient descent algorithm in function space, inspired by numerical optimization and statistical estimation (Breiman (1996), Breiman (1998)). Building on this insight, Friedman et al. (2000) and Friedman (2001) embedded boosting algorithms into the framework of statistical estimation and additive basis expansion. This also enabled the application of boosting for regression analysis. Boosting for regression was proposed by Friedman (2001), and then B uhlmann and Yu (2003) defined and introduced L2-Boosting. An extensive overview of the development of boosting and its manifold applications is given in the survey of B uhlmann and Hothorn (2007). In the high-dimensional setting there are two important but unsolved problems on L2Boosting. First, the convergence rate of the L2-Boosting, in particular under early stopping, has not been thoroughly analyzed. Second, the pattern of the variables selected at each step of L2-Boosting is unknown. In this paper, we show that these two problems are closely related. We establish results on the sequence of variables that are selected by L2-Boosting. We call a step of L2-Boosting revisiting if the variable chosen in this step has already been selected in previous steps. We analyze the revisiting behavior of L2-Boosting, i.e., how often L2-Boosting revisits. We then utilize these results to derive an upper bound of the rate of convergence of the L2-Boosting.1 We show that frequency of revisiting, as well as the convergence speed of L2-Boosting, depend on the structure of the design matrix, namely on the minimal and maximal restricted eigenvalue. Moreover, we introduce the so called restricted L2-Boosting , another variant of the classical boosting algorithm. The restricted L2-Boosting algorithm sticks to the set of the previously chosen variables, exploits the information contained in these variables first and then only occasionally allows to add new variables to this set. We show that the convergence rate is close to that of Lasso and achieves the same rate as orthogonal L2-Boosting, which was derived in Kueck et al. (2023), but without the need for orthogonal projections which are costly to calculate. It should be noted, that all results are shown in an approximate sparse, high-dimensional setting without beta-min condition. We also introduce feasible, data-driven rules for early stopping for both algorithms, which can be easily implemented and used in applied work. 1. Without analyzing the sequence of variables selected at each step of L2-Boosting, only much weaker results on convergence speed of L2-Boosting are available based on De Vore and Temlyakov (1996) and Livshitz and Temlyakov (2003). High-Dimensional L2-Boosting Compared to Lasso, boosting uses a somewhat unusual penalization scheme. The penalization is done by early stopping to avoid overfitting in the high-dimensional case. In the low-dimensional case, L2-Boosting without stopping converges to the ordinary least squares (OLS) solution. In a high-dimensional setting, early stopping is key for avoiding overfitting and for the predictive performance of boosting. We give new stopping rules that are simple to implement and also works very well in practical settings as demonstrated in the simulation studies. We prove that such a stopping rule achieves the best bound obtained in our theoretical results. In a deterministic setting, which is when there is no noise or error term in the model, boosting methods are also known as greedy algorithms (the pure greedy algorithm (PGA) and the orthogonal greedy algorithm (OGA)). In signal processing, L2-Boosting is essentially the same as the matching pursuit algorithm of Mallat and Zhang (1993). We will employ the abbreviations BA (L2-Boosting algorithm), o BA (orthogonal L2-Boosting algorithm) and res BA (restricted L2-Boosting algorithm) for the stochastic versions we analyze. The rate of convergence of greedy algorithms has been analyzed in De Vore and Temlyakov (1996) and Livshitz and Temlyakov (2003). Temlyakov (2011) is an excellent survey of recent results on the approximation theory of greedy approximation. To the best of our knowledge, with an additional assumption on the design matrix, we establish the first results on revisiting in the deterministic setting and greatly improve the existing results of De Vore and Temlyakov (1996). These results, which are available in the appendix, are essential for our analysis of L2-Boosting, but might also be of interest in their own right. As mentioned above, boosting for regression was introduced by Friedman (2001). L2Boosting was defined in B uhlmann and Yu (2003). Its numerical convergence, consistency, and statistical rates of convergence of boosting with early stopping in a low-dimensional setting were obtained in Zhang and Yu (2005). Consistency in prediction norm of L2Boosting in a high-dimensional setting was first proved in B uhlmann (2006). The numerical convergence properties of boosting in a low-dimensional setting are analyzed in Freund et al. (2016). The orthogonal Boosting algorithm in a statistical setting under different assumptions is analyzed in Ing and Lai (2011). The rates for the PGA and OGA are obtained in Barron et al. (2008). For results on orthogonal boosting and modifications, we refer to the excellent survey by Lai and Yuan (2021). In this paper we consider linear basis functions. Classification and regression trees, and the widely used neural networks, involve non-linear basis functions. We hope that our results can serve as a starting point for the analysis of non-linear basis functions which is left for future research. The structure of this paper is as follows: In Section 2, the L2-Boosting algorithm (BA/PGA) is defined together with its modifications, the restricted-L2-Boosting algorithm (res BA/res PGA) and the orthogonalized version (o BA), which serves as reference. In Section 3, we present a new approximation result for the pure greedy algorithm (PGA) and an analysis of the revisiting behavior of the boosting algorithm. In Section 4, we provide the main results of our analysis, namely an analysis of the boosting algorithm and some of its variants. The proofs together with some details of the new approximation theory for PGA are provided in the appendix. Section 5 contains a simulation study that offers some insights into the methods and also provides some guidance for stopping rules in applications. Additional empirical examples can be found in the appendix. Section 6 provides concluding remarks. Luo, Spindler and Kueck Notation: Let z and y be n-dimensional vectors. Define ||z|| to be the Euclidean norm, and ||z||2,n := p En[z2] to be the empirical L2-norm with En[z] = 1/n Pn i=1 zi. Define < , >n to be the inner product defined by: < z, y >n= 1/n Pn i=1 ziyi. For a random variable X, E[X] denotes its expectation. The correlation between the random variables X and Y is denoted by corr(X, Y ). We use the notation a b = max{a, b} and a b = min{a, b}. We also use the notation a b to mean a cb for some constant c > 0 that does not depend on n; and a P b to mean a = OP (b). For a vector β Rp, supp(β) denotes the set of indices of which the corresponding element in β is not zero. Further, given a set of indices T {1, . . . , p}, we denote by βT the vector in which βTj = βj if j T, βTj = 0 if j / T. 2. L2-Boosting with componentwise least squares To define the boosting algorithm for linear models, we consider the following regression setting: yi = x iβ + ri + εi, i = 1, . . . , n, (1) with vector xi = (xi,1, . . . , xi,pn) consisting of pn predictor variables, β a pn-dimensional coefficient vector and a random, mean-zero error term εi, E[εi|xi] = 0. We allow the dimension of the predictors pn to grow with the sample size n, and to be even larger than the sample size, i.e., dim(β) = pn > n. We impose an approximate sparsity condition. This means that there is a large set of potential variables, but the number of relevant variables, which can grow with the sample size, denoted by sn, is small compared to the sample size, i.e. β 0 = sn < n. The random variable ri denotes the approximation error of the sparse model. In the following, we will drop the dependence of sn and pn on the sample size in the notation and denote it by s and p if no confusion will arise. X denotes the n p design matrix where the single observations xi form the rows. Xj denotes the jth column of design matrix, and xi,j the jth component of the vector xi. We consider a fixed design for the regressors. We assume that the regressors are standardized with mean zero and variance one, i.e., En[xi,j] = 0 and En[x2 i,j] = 1 for j = 1, . . . , p, The basic principle of boosting can be described as follows. We follow the interpretation of Breiman (1998) and Friedman (2001) of boosting as a functional gradient descent optimization (minimization) method. The goal is to minimize a loss function, e.g., an L2 loss or the negative log-likelihood function of a model, by an iterative optimization scheme. In each step, the (negative) gradient which is used in every step to update the current solution is modelled and estimated by a parametric or nonparametric statistical model, the so-called base learner. The fitted gradient is used for updating the solution of the optimization problem. A strength of boosting, besides the fact that it can be used for different loss functions, is its flexibility with regard to the base learners. We then repeat this procedure until some stopping criterion is met. The literature has developed many different forms of boosting algorithms. In this paper, we consider L2-Boosting with componentwise linear least squares, as well as two variants. All three are designed for regression analysis. L2 refers to the loss function, which is the sum of squares of the residuals Qn(β) = Pn i=1(yi x iβ)2 typical in regression analysis. In this case, the gradient equals the residuals. Componentwise linear least squares refers to the base learners. We fit the gradient (i.e. residuals) against each regressor (p univariate regressions) and select the predictor/variable which correlates most High-Dimensional L2-Boosting highly with the gradient/residual, i.e., decreases the loss function most, and then update the estimator in this direction. We next update the residuals and repeat the procedure until some stopping criterion is met. In this paper, we focus on the classical L2-Boosting, which was introduced in Friedman (2001) and refined in B uhlmann and Yu (2003) for regression analysis, and two modifications: restricted-L2-Boosting and orthogonal L2-Boosting. Further, we also provide some results for post-L2-Boosting which is defined in Comment 1. As far as we know, post-L2-Boosting has not yet been defined and analyzed in the literature. 2.1 L2-Boosting For L2-Boosting with componentwise least squares, the algorithm is given below. Algorithm 1 (L2-Boosting (BA/PGA)) (1) Start/Initialization: β0 = 0 (p-dimensional vector), set maximum number of iterations mstop and set iteration index m to 0. (2) At the (m + 1)th step, calculate the residuals Um i = yi x iβm. (3) For each predictor variable j = 1, . . . , p, calculate : γm j := Pn i=1 Um i xi,j Pn i=1 x2 i,j = < Um, xj >n En[x2 i,j] . Select the variable jm that is the most correlated with the residuals2, i.e., max 1 j p |γm j |. (4) Update the estimator: βm+1 := βm + γm jmejm where ejm is the jmth index vector. (5) Increase m by one. If m < mstop, continue with (2); otherwise stop. For simplicity, write γm for the value of γm jm at the mth step. The act of stopping is crucial for boosting algorithms, as stopping too late or never stopping leads to overfitting and therefore some kind of penalization is required. A suitable solution is to stop early, i.e., before overfitting takes place. Early stopping can be interpreted as a form of penalization. Similar to Lasso, early stopping might induce a bias through shrinkage. A potential way to decrease the bias is by post-L2-Boosting which is defined in Comment 1 below. In general, during the run of the boosting algorithm, it is possible that the same variable is selected at different steps, which means the variable is revisited. This revisiting behavior is key to the analysis of the rate of convergence of L2-Boosting. In the next section, we will analyze the revisiting properties of boosting in more detail. Remark 1 Post-L2-Boosting is a variant of L2-Boosting, namely, a post-model selection estimator that applies ordinary least squares (OLS) to the model selected by L2-Boosting in the first step. To define this estimator formally, we make the following definitions: 2. Equivalently, which fits the gradient best in a L2-sense. Luo, Spindler and Kueck T := supp(β) and ˆT := supp(βm ), the support of the true model and the support of the model estimated by L2-Boosting as described above with stopping at m . A superscript C denotes the complement of the set with regard to {1, . . . , p}. In the context of Lasso, OLS after model selection was analyzed in Belloni and Chernozhukov (2013). Given the above definitions, the post-model selection estimator or OLS post-L2-Boosting estimator will take the form β = arg min β Rp Qn(β) : βj = 0 for each j ˆT C. (2) In this paper, we will not focus on post-L2-Boosting. 2.2 Restricted L2-Boosting In this section, we introduce the so-called restricted L2-Boosting algorithm. The motivation is that the variable selection behavior of the pure greedy algorithm, introduced in the section before, is challenging to analyze. Related to this and even more important, the pure greedy algorithm also has a provably slow rate of convergence in general driven by challenging singular cases, as highlighted by the work of Temlyakov (2011). These cases result in hard to handle variable selection patterns. In general, boosting first selects the relevant variables and then the irrelevant variables. But the overall pattern is difficult to analyze, because the selection of relevant and irrelevant variables can alternate, in particular at an advanced stage of the algorithm when the correlation of the relevant variables with the remainder is small due to previous extraction. The restricted L2-Boosting algorithm sticks to the set of the already chosen variables for some time and exploits the information contained in these variables, i.e. extracts the correlation of these variables with the remainder until new variables are selected and added to the consideration set. The algorithm is given in the following way: Algorithm 2 (restricted L2-Boosting Algorithm (res BA/res PGA)) (1) Start/Initialization: β0 = 0 (p-dimensional vector) and set iteration index m to 0. (2) At the (m + 1)th step, calculate the residuals Um i = yi x iβm. Define the set of variables being selected as ˆT m. (3) Set T := {1, . . . , p} if lm = 0 and T := ˆT m if lm = 1, for lm {0, 1} being a sequence of integers indexed by m.3 (4) For each predictor variable j T , calculate : γm j := Pn i=1 Um i xi,j Pn i=1 x2 i,j = < Um, xj >n En[x2 i,j] . Select the variable jm that is the most correlated with the residuals , i.e., max j T |γm j |. (5) Update the estimator: βm+1 := βm + γm jmejm where ejm is the jmth index vector. 3. This criterion is essentially restricting the algorithm within the set of already selected variables. High-Dimensional L2-Boosting (6) Stopping Criterion: Stop at m which is defined as the first m with Um+1 2 2,n/ Um 2 2,n > 1 CU log(2p/α) when lm = 0 where CU and α are defined in Theorem 17. Remark 2 A related version of restricted L2-Boosting algorithm is the iterated post-L2Boosting algorithm where at certain iterations during the algorithm a projection step is performed on the already selected variables (Kueck et al. (2023)). By projecting the residuals on the selected variables, all correlation/information of the variables is taken out. The restricted L2-Boosting algorithm avoids these projection steps which are computationally expensive. 2.3 Orthogonal L2-Boosting A variant of the L2-Boosting algorithm is orthogonal Boosting (o BA) or the orthogonal greedy algorithm in its deterministic version. Only the updating step is changed: after each selection step, an orthogonal projection of the response variable is conducted on all the variables which have been selected up to this point. The advantage of this method is that any variable is selected at most once in this procedure, while in the previous version the same variable might be selected at different steps which makes the analysis far more complicated. More formally, the method can be described as follows by modifying step (4) in Algorithm 1: Algorithm 3 (Orthogonal L2-Boosting (o BA)) (1) Start/Initialization: β0 = 0 (p-dimensional vector), set maximum number of iterations mstop and set iteration index m to 0. (2) At the (m + 1)th step, calculate the residuals Um i = yi x iβm. (3) For each predictor variable j = 1, . . . , p, calculate : γm j := Pn i=1 Um i xi,j Pn i=1 x2 i,j = < Um, xj >n En[x2 i,j] . Select the variable jm that is the most correlated with the residuals , i.e., max 1 j p |γm j |. (4) Update the estimator: βm+1 = (XT m+1 XT m+1) 1XT m+1 y. (5) Increase m by one. If m < mstop, continue with (2); otherwise stop. Remark 3 Orthogonal L2-Boosting (and also post-L2-Boosting) requires, to be well-defined, that the number of selected variables be smaller than the sample size. This is enforced by our stopping rule, as we will see later. Luo, Spindler and Kueck 2.4 Comparison We have introduced different variants of L2-Boosting algorithms. Among them, L2-Boosting is the most widely used algorithm in empirical applications. As we will see later, the rate of convergence is slower than the Lasso rate as there are cases for which the convergence, even in the noiseless case, is too slow. In signal processing and related fields, the noiseless version of the orthogonal Boosting ( orthogonal matching pursuit ) is also very popular. It has been shown that for orthogonal L2-Boosting fast convergence rates with a feasible, data-driven stopping criterion in high dimensions can be obtained even in the case with noise (see Kueck et al. (2023) without beta-min condition and Stankewitz (2024) under stronger assumptions like beta-min condition and normality). The restricted L2-Boosting can be considered as a middle ground between both. It only uses the boosting steps (without orthogonal projection) but, as we will show, it achieves the same rate of convergence as orthogonal L2-Boosting. By restricting to the set of already selected variables for some steps, the information of the variables is partialled-out of the target variable y and in the limit it converges to an orthogonal projection step, as used in orthogonal L2-Boosting. Orthogonal L2-Boosting might be interpreted as post-L2-Boosting where the refit takes place after each step. By only selectively/occasionally adding new variables, where only standard boosting steps are used, the rate of convergence is improved compared to L2-Boosting. Restricted L2-Boosting can be considered as an approximation to orthogonal Boosting but without orthogonal projections. Therefore, it offers advantageous computational properties as orthogonal projections are costly to calculate. The algorithm only employs the default boosting steps, which are based on univariate regressions and are fast and easy to compute. Therefore, the proposed restricted L2-Boosting algorithm has excellent theoretical properties and is computationally superior to orthogonal L2-Boosting. It is also computationally superior to iterated L2-Boosting, where projection steps are imposed from time to time, since projections can be dispensed at all. 3. New Approximation Results for the Pure Greedy Algorithm In approximation theory a key question is how fast functions can be approximated by greedy algorithms. Approximation theory is concerned with deterministic settings, i.e., the case without noise: yi = x iβ, i = 1, . . . , n. (3) Nevertheless, to derive rates for the L2-Boosting algorithm in a stochastic setting, the corresponding results for the deterministic part play a key role. For example, the results in B uhlmann (2006) are limited by the result used from approximation theory, namely the rate of convergence of weak relaxed greedy algorithms derived in Temlyakov (2000). For the pure greedy algorithm, De Vore and Temlyakov (1996) establish a rate of convergence of m 1/6 in the ℓ2 norm, where m denotes the number of steps iterated in the PGA. This rate was improved to m 11/62 in Konyagin and Temlyakov (1999), but Livshitz and Temlyakov (2003) established a lower bound of m 0.27. The class of functions F which is considered in those papers is determined by general dictionaries D and given by f H : f = X k Λ ckwk, wk D, |Λ| < and X High-Dimensional L2-Boosting where M is some constant, H denotes a Hilbert space, and the sequence (ck) are the coefficients with regard to the dictionary D. In this section, we discuss the approximation bound of the pure greedy algorithm where we impose an additional but widely used assumption on the Gram matrix En[xix i] in high-dimensional statistics to tighten the bounds. We provide a new lemma on the revisiting behavior of the pure greedy algorithm and a new approximation result which is the core of this section. The proofs for this section and a detailed analysis of the revisiting behavior of the algorithm are moved to Appendix A and Appendix B. Let us define the restricted eigenvalue assumption which is also commonly used in the analysis of Lasso. To do this, consider Σ(s, M) := {A|dim(A) s s, A is any diagonal submatrices of M}, for any square matrix M. Definition 4 The smallest and largest restricted eigenvalues are defined as φs(s, M) := min W Σ(s,M) φs(W), and φl(s, M) := max W Σ(s,M) φl(W). φs(W) and φl(W) denote the smallest and largest eigenvalue of the matrix W. Assummption A.1 (Sparse Eigenvalue (SE)) Consider the Gram matrix Σ = En[xix i]. Assume that all the elements on the diagonal of Σ are equal to one. We assume that there exist positive constants cφ 1 and Cφ > 1 such that 0 < cφ φs(s , Σ) φl(s , Σ) Cφ < holds for s Mn, where Mn is a sequence such that Mn with n and Mn CMs log(n), where CM is a large enough fixed constant. Remark 5 This condition is a variant of the so-called sparse eigenvalue condition , which is used for the analysis of the Lasso estimator. A detailed discussion of this condition is given in Belloni et al. (2010). Similar conditions, such as the restricted isometry condition or the restricted eigenvalue condition, have been used for the analysis of the Dantzig Selector (Candes and Tao (2007)) or the Lasso estimator (Bickel et al. (2009)). An extensive overview of different conditions on matrices and how they are related is given by van de Geer and B uhlmann (2009). Assuming that φs(m, En[xix i]) > 0, requires that all empirical Gram submatrices formed by any m components of xi are positive definite. It is well-known that Condition SE is fulfilled for many designs of interest. Define V m = Xαm as the residual for the PGA. Here, αm is defined as the difference between the true parameter vector β and the approximation at the mth step, βm, i.e. αm = β βm. We would like to explore how fast V m converges to 0. In our notation, ||V m+1||2 2,n = ||V m||2 2,n (γm)2, therefore ||V m||2 2,n is non-increasing in m. As described in Algorithm 1, the sequence of variables selected in the PGA is denoted by j0, j1, . . .. Define Luo, Spindler and Kueck T m := T {j0, j1, . . . , jm 1} with T := supp(β). Define q(m) := |T m| as the cardinality of T m, m = 0, 1, . . .. It is obvious that q(0) = s and q(m) m + s where s = β 0 denotes the number of relevant regressors. It is essential to understand how PGA revisits the set of already selected variables. To analyze the revisiting behavior of the PGA, some definitions are needed to fix ideas. Definition 6 We say that the PGA is revisiting at the mth step, if and only if jm 1 T m 1. We define the sequence of labels A := {A1, A2, ...} with each entry Ai being either labelled as R(revisiting) or N(non-revisiting). Lemma 7 Assume that assumption A.1 holds with m < Mn. Consider the sequence of steps 1, 2, . . . , m with ||V m||2 2,n > 0. Denote µa(c) = 1 1 + 1 c c for any c > 0. Then, for any δ > 0, the number of Rs in the sequence A at step m, denoted R(m), must satisfy: |R(m)| 1 (1 + δ)µa(cφ) 2 (1 + δ)µa(cφ)m (1 + δ)µa(cφ) 2 (1 + δ)µa(cφ)q(0). The lower bound stated in Lemma 7 has room for improvement, e.g., when cφ = 1, |R(m)|/m = 1 as it is shown in Lemma 20 in Appendix A, while we get 1/2 in Lemma 7 as lower bounds of |R(m)/m| as m becomes large enough. Deriving tight bounds is an interesting question for future research. More detailed properties of the revisiting behavior of L2-Boosting are provided in the Appendix A. With an estimated bound for the proportion of Rs in the sequence A, we are now able to derive an upper bound for ||V m||2 2,n. By Lemma 7, define n k(m) := m+µa(cφ)q(k) 2 µa(cφ) , where n k(m)(1 + δ) is an upper bound of |q(m + k) q(k)| as q(m) is large enough, for any δ > 0. Before we state the main result of this section, we present an other auxiliary lemma. Lemma 8 Assume that assumption A.1 holds. Consider the steps numbered as k+1, . . . , k+ m. Assume that m + k < Mn and let m = λq(k) for a constant λ > 0. Define c((1 µa(c))λ µa(c)) log 2+λ 2 µa(c) + c for all λ µa(c) 1 µa(c) and c > 0. Then, for any arbitrarily small δ > 0 and q(k) being large enough, there exists an arbitrarily small δ > 0 such that the following statement holds: ||V m+k||2 2,n ||V k||2 2,n q(k) q(k) + (1 + δ)n k(m) ζ(cφ,λ) δ . Based on Lemma 8, we are able to develop our main results on the approximation theory of the pure greedy algorithm under L2 loss. Theorem 9 (Approximation Theory of PGA based on revisiting) Assume that assumption A.1 holds. Define ζ (c) := maxλ µa(c) 1 µa(c) ζ(c, λ) as a function of c. Then, for any κ > 0 and m < Mn, there exists a fixed constant C > 0 such that ||V m||2 2,n ||V 0||2 2,n C s m + s for m large enough. High-Dimensional L2-Boosting Remark 10 Our results stated in Theorem 9 depend on the lower bound of |R(m)|/m, which is the proportion of the Rs in the first m terms in the sequence A. We conjecture that the convergence rate of PGA is close to exponential as c 0. Denote the actual proportion of R in the sequence A by ψ(c), i.e., |R(m)| ψ(c)m ψ1(c)q(0), where ψ(c), ψ1(c) are some constants depending on c. If ψ(c) 1, it is easy to show that ||V m||2 2,n ||V 0||2 2,n based on the proof of Theorem 9, for any arbitrarily large ζ. In general, further improvements in the convergence rate of PGA can be achieved by improving the lower bounds of |R(m)|/m. Table 1 gives different values of the SE constant cφ for the corresponding values of ζ (cφ). The convergence rate of PGA and hence of L2-Boosting is affected by the Table 1: Relation between cφ and ζ cφ ζ (cφ) 1.0 1.19 0.9 1.04 0.8 0.89 0.7 0.76 0.6 0.63 0.5 0.51 0.4 0.40 frequency of revisiting. Since different values of cφ impose different lower bounds on the frequency of revisiting, different values of cφ imply a different convergence rate of the process in our framework. Remark 11 As already mentioned, the function ζ (c) defined in Theorem 9 is used to provide a general lower bound of the revisiting behavior which affects the rate of convergence of boosting algorithms. In some special cases, the PGA algorithm always selects a predictor from the true set T, indicating that ζ (cφ) , and therefore a stronger revisiting behavior can be shown. This is, for example, the case in a (near) orthogonal design, e.g., when Xj, j = 1, . . . , p, are i.i.d. standard normally distributed or in an equi-correlated design, where 1 < corr(Xi, Xj) = ρ < 1 for all i = j. A formal discussion of the equi-correlated design with ρ > 0 is given in Appendix D. In both cases, we achieve the Lasso rate of convergence. But under more general designs, greedy algorithms can only achieve a slower rate of convergence (see Temlyakov (2011)), preventing one to achieve the Lasso rate of convergence, as we discuss in more detail in Section 4. 4. Main Results In this section, we provide the main results of our paper for L2-Boosting (BA) and restricted L2-Boosting (res BA) which were introduced in Section 2. To do this, we reconsider the high- Luo, Spindler and Kueck dimensional approximate sparse regression model in (1): yi = x iβ + ri + εi, i = 1, . . . , n, with vector xi = (xi,1, . . . , xi,p) consisting of p predictor variables, β a p-dimensional coefficient vector, and a random, mean-zero error term εi, E[εi|xi] = 0. The random variable ri denotes the approximation error of the exact sparse model. In Assumption A.2, we provide the explicit conditions of the approximate sparse regression model, as, for example, also considered in Belloni et al. (2013) and Belloni et al. (2016). Another sparsity assumption used in the literature is, e.g., weak sparsity (Negahban et al. (2012); Ing (2020)), i.e. Pp j=1 |βj|q Rn, where 0 < q 1, but in this paper we focus on the approximate sparse regression model.4 In this section, the following assumptions are employed. Assummption A.2 (Approximate Sparsity) (ii) r 2,n := rn q Crs log(2p/α) n for some generic constant Cr > 0 and α > 0. (iii) There exists a constant K 1 such that β 2 n K 1. Assummption A.3 (Error term) For any α small enough, with probability 1 α, we have: max 1 j p |En[xi,jεi]| σ In addition, we require that ˆσ2 := En[ε2 i ] satisfies that |ˆσ2 σ2| ωσ2, for some small enough constant ω 0, 1 Remark 12 The previous assumption is implied, for example, if the error terms are i.i.d. N(0, σ2) random variables. This in turn can be generalized/weakened to cases of nonnormality by self-normalized random vector theory (de la Pe na et al. (2009)) or the approach introduced in Chernozhukov et al. (2014). 4.1 L2-Boosting with Componentwise Least Squares First, we analyze the classical L2-Boosting algorithm with componentwise least squares. For this purpose, the approximation results which we derived in the previous section are key. While in the previous section the stochastic component was absent, in this section it is explicitly considered. The following definitions will be helpful for the analysis: Um denotes the residuals at the mth iteration, Um = y Xβm = V m+r+ε. Here, βm is the estimator at the mth iteration. Again, we define the difference between the true and the estimated vector as αm := β βm. The prediction error is given by V m = Xαm. For boosting algorithms 4. The extension of our results to different sparsity assumptions is left for future work. High-Dimensional L2-Boosting in the high-dimensional setting, it is essential to determine when to stop, i.e. the stopping criterion. In the low-dimensional case, stopping time is not important: the value of the objective function decreases and converges to the traditional OLS solution exponentially fast, as described in B uhlmann and Yu (2003). In the high-dimensional case, such fast convergence rates are usually not available: the residual ε can be explained by n linearly independent variables Xj. Thus, selecting more terms only leads to overfitting. Early stopping is comparable to the penalization in Lasso, which prevents one from choosing too many variables and hence overfitting. Similarly to Lasso, an (approximate) sparse structure will be needed for analysis. At each step, we minimize ||Um||2 2,n along the most greedy variable Xjm. The following lemma establishes the main result of the convergence rate of L2-Boosting. Lemma 13 Suppose assumptions A.1 A.3 hold and s log(p)/n 0. Assume Mn is large enough, i.e. log(Mn/s) + ξ + 1 1 + ζ (cφ) s log(2p/α) n||V 0||2 2,n for some ξ > 0. Let m + 1 be the first time that ||V m||2,n η m + sλn, where η is a constant large enough. Then, for any δ > 0, with probability 1 α, (1) it holds s log(2p/α) n||V 0 + r||2 2,n ! 1 1+ζ (cφ) δ and m < Mn; (4) (2) the prediction error ||V m +1|| satisfies: ||V m +1||2 2,n ||V 0 + r|| 2 1+ζ (cφ) δ 2,n s log(2p/α) ζ (cφ) δ 1+ζ (cφ) δ . (5) Remark 14 Lemma 13 shows that the convergence rate of the L2-Boosting depends on the value of cφ. For different values of cφ, the lower bound of the proportion of revisiting ( R ) in the sequence A should be different. Such lower bounds on the frequency of revisiting will naturally determine the upper bound for the deterministic component, which affects our results on the rate of convergence of L2-Boosting. As ζ (cφ) , the statement (5) implies the usual Lasso rate of convergence. The bound of the approximation error ||V m||2 2,n stated in inequality (5) is obtained under an infeasible stopping criterion. Below we establish another result which employs the same convergence rate but with a feasible stopping criterion which can be implemented in empirical studies. Theorem 15 Suppose all conditions stated in Lemma 13 hold. Let cu > 4 be a constant. Let m 1 + 1 be the first time that ||Um||2 2,n ||Um 1||2 2,n > 1 cu log(2p/α)/n. Luo, Spindler and Kueck Then, with probability at least 1 α, ||V m 1+1||2 2,n ||V 0|| 2 1+ζ (cφ) δ 2,n s log(2p/α) ζ (cφ) δ 1+ζ (cφ) δ for any δ > 0. Remark 16 As we have already seen in the deterministic case, the rate of convergence depends on the constant cφ. Table 2 shows for different values of cφ the corresponding rates when δ is set to zero. Hence, the rates can be interpreted as upper bounds. Table 2: Relation between c and ζ (c) 1+ζ (c) c rate 1.0 0.54 0.9 0.51 0.8 0.47 0.7 0.43 0.6 0.39 0.5 0.34 0.4 0.29 4.2 Restricted L2-Boosting The following theorem establishes the main result of convergence rate of restricted L2Boosting. Theorem 17 Suppose Assumptions A.1 A.3 hold. Define a sequence of positive integers mk as follows: For k = 1, 2, . . . , define Lk as a positive integer, and WLOG., we let Lk = Ln as a constant, and define Fk = CF | ˆT mk 1| log(n) for some absolute constant CF large enough5 with mk := mk 1 + Lk + Fk where m0 := 0. As suggested in the restricted L2-Boosting, Algorithm 2 runs on the full set of variables when lm = 0 and the selected variables when lm = 1. To this end, consider a sequence of indices lm {0, 1}, m = 1, 2, . . . , with lm = 0 for m = mk + 1, ..., mk + Lk, and lm = 1 for m = mk + Lk + 1, ..., mk+1, k 0. Suppose that Ln < KL s for some generic positive constant KL > 0. For CU > 4/cφ defined in Algorithm 2, with probability 1 α, we have x i(βm β) 2 s log(n) log(2p/α) Remark 18 The bound of the prediction error of restricted L2-Boosting in Theorem 17 is obtained under a feasible stopping criterion which can be easily implemented in empirical 5. For example, Fn can be chosen as lnξ(n) for some ξ > 2 for n large enough. High-Dimensional L2-Boosting studies. The constant CU has to be chosen by the practitioner. In contrast to the constant cu > 4 in Theorem 15, Cu depends on the sparse eigenvalue cφ which can be (pre)-estimated by data and Cu is chosen accordingly. It is worth noting that the restricted L2-Boosting algorithm is purely gradient based. Hence, there is no need for orthogonal projections which are computationally costly in high-dimensional settings. Therefore, restricted L2-Boosting combines two desirable properties: computational efficiency and very fast rate of convergence. In practice, one also needs to specify the sequence of indices lm {0, 1}, m = 1, 2, . . . , which depends on Lk and Fk and thus determines the so-called consideration set. As already outlined in Section 2.2, the restricted L2-Boosting algorithm sticks to the set of the already chosen variables for some time (Fk iterations) and exploits the information contained in these variables until new variables are added to the consideration set. By restricting to the set of already selected variables for some iterations, the information of the variables is partialled-out of the target variable y and in the limit it converges to an orthogonal projection step, as used in orthogonal L2-Boosting. The constant Lk 1 is the number of iterations where we allow the algorithm to select new variables. 5. Simulation Study In this section, we present the results of our simulation study. The goal of this exercise is to illustrate the relevance of our theoretical results in providing insights into the functionality of boosting and the practical aspects of boosting. In particular, we demonstrate that the stopping rules for early stopping we propose work reasonably well in the simulations and give guidance for practical applications. Moreover, the comparison with Lasso might also be of interest. First, we start with an illustrative example and later we present further results, in particular, for different designs and settings. 5.1 Illustrative Example The goal of this section is to give an illustration of the different stopping criteria. We employ the following data generating process (dgp):6 y = 5x1 + 2x2 + 1x3 + 0x4 + . . . + 0x10 + ε, (6) where ε N(0, 22) and X = (X1, . . . , X10) N10(0, I10) with I10 denoting the identity matrix of size 10 10. To evaluate the methods and, in particular, the stopping criteria, we conduct an analysis of both in-sample and out-of-sample mean squared error (MSE) defined in equation (8). For the out-of-sample analysis we draw a new observation for evaluation and calculation of the MSE. For the in-sample analysis we also repeat the procedure and form the average over all repetitions. In both cases we employ 60 repetitions. The sample size is n = 20. Hence, we have 20 observations to estimate 10 parameters. The results are presented in Figures 1 and 2. Both show how MSE depends on the number of steps of the boosting algorithm. We see that MSE first decreases with more steps, reaches its minimum and then starts to increase again due to overfitting. In both graphs the solution of the L2-Boosting algorithm convergences to the OLS solution. We also indicate the MSE of Lasso estimators as horizontal lines (with cross-validated choice of the penalty parameter 6. In order to allow comparability the dgp is adopted from B uhlmann (2006). Luo, Spindler and Kueck and data-driven choice of the penalization parameter). In order to find a feasible stopping criterion, we have to rely on the in-sample analysis. Figure 2 reveals that the stopping criterion we introduced in the previous sections performs very well and even better than stopping based on a corrected AIC value which has been proposed in the literature as a stopping criterion for boosting. The average stopping steps of our criterion and the corrected AIC-based criterion (AICc) are presented by the vertical lines. On average our criterion stops earlier than the AICc based one. As our criterion performs better than the AICc, we will not report AICc results in the following subsection. For the restricted L2-Boosting algorithm, similar patterns arise and are omitted. 0 10 20 30 40 50 0 2 4 6 8 10 Out of sample Analysis Figure 1: This figure shows the out-of-sample MSE of the L2-Boosting algorithm depending on the number of steps. The horizontal lines show the MSE of OLS, Boosting and Lasso estimates. 5.2 Further Results In this section, we present results for different designs and settings to give a more detailed comparison of the methods. We consider the linear model j=1 βjxj + ε, (7) with ε standard normal distributed and i.i.d.. For the coefficient vector β we consider two designs. First, we consider a sparse design, i.e., the first s elements of β are set equal to one, all other components to zero (β = (1, . . . , 1, 0, . . . , 0)). Then we consider a polynomial design in which the jth coefficient given by 1/j, i.e., β = (1, 1/2, 1/3, . . . , 1/p). For the design matrix X, we consider two different settings: an orthogonal setting and a correlated setting. In the former setting, the entries of X are drawn as i.i.d. draws from a standard High-Dimensional L2-Boosting 0 10 20 30 40 50 In sample Analysis our criterion AIC criterion Figure 2: This figure shows the in-sample MSE of the L2-Boosting algorithm depending on the number of steps. The horizontal lines show the MSE of OLS, Boosting and Lasso estimates. normal distribution. In the correlated design, the xi (rows of X) are distributed according to a multivariate normal distribution where the correlations are given by a Toeplitz matrix with factor 0.5 and alternating signs. To sum up, we have the following settings: X: orthogonal or correlated coefficient vector β: sparse design or polynomial decaying design n = 100, 200, 400 p = 100, 200 out-of-sample prediction size n1 = 50 number of repetitions R = 500 We consider the following estimators: L2-Boosting with componentwise least squares, restricted L2-Boosting, orthogonal L2-Boosting and Lasso. For Lasso, we also consider the post-selection estimator ( p-Lasso ). Here, we consider a data-driven regressor-dependent choice for the penalization parameter (Belloni et al. (2012)) and cross-validation. Although cross-validation is very popular, it does not rely on established theoretical results and therefore we prefer a comparison with the formal penalty choice developed in Belloni et al. Luo, Spindler and Kueck Table 3: Simulation results: sparse, iid design (Boosting) n p BA-oracle BA-our o BA-oracle o BA-our res BA-oracle res BA-our 100 100 0.454 0.599 0.114 0.475 0.168 0.291 100 200 0.543 0.779 0.123 0.704 0.189 0.359 200 200 0.184 0.307 0.052 0.282 0.072 0.168 400 200 0.080 0.135 0.026 0.121 0.032 0.081 800 200 0.037 0.066 0.013 0.058 0.015 0.042 Table 4: Simulation results: sparse, iid design (Lasso) n p Lasso p-Lasso Lasso-CV p-Lasso-CV 100 100 0.88 0.70 0.54 0.94 100 200 1.02 1.30 0.72 1.02 200 100 0.29 0.28 0.22 0.43 200 200 0.37 0.39 0.30 0.44 400 100 0.13 0.11 0.09 0.18 400 200 0.16 0.20 0.14 0.27 (2012). For our boosting algorithms, we consider two stopping rules: oracle and a datadependent stopping criterion ( our ) which stops when ||Um||2 2,n ||Um 1||2 2,n = ˆσ2 m,n ˆσ2 m 1,n > 1 C log(p)/n for some constant C. Using this approach, boosting stops when the ratio of the estimated variances does not improve upon a certain amount any more. The oracle rule stops when the mean-squared-error (MSE), defined below, is minimized, which is not feasible in practical applications. The simulations were performed in R (R Core Team (2014)). For Lasso estimation the packages hdm by Chernozhukov et al. (2015) and glmnet by Jerome Friedman (2010) (for cross-validation) were used. The boosting procedures were implemented by the authors and the code is available upon request.7 To evaluate the performance of the estimators, we use the MSE criterion. We estimate the models on the same data sets and use the estimators to predict 50 observations out-of-sample. The (out-of-sample) MSE is defined as MSE = E[(f(X) fm(X))2] = E[(X (β βm))2], (8) where m denotes the iteration at which we stop, depending on the employed stopping rule. The MSE is estimated by i=1 [(f(xi) fm(xi))2] = 1 i=1 [(x i(β βm))2] (9) for the out-of-sample predictions. The results of the simulation study are shown in Tables 3 10. As expected, the oracle-based estimators clearly dominate in almost all cases, although our stopping criterion also gives very good results. Not surprisingly, given our theoretical results, both restricted L2-Boosting and orthogonal L2-Boosting outperform the standard L2-Boosting in most cases. A comparison of restricted and orthogonal L2-Boosting does 7. A R package is in preparation. High-Dimensional L2-Boosting Table 5: Simulation results: sparse, correlated design (Boosting) n p BA-oracle BA-our o BA-oracle o BA-our res BA-oracle res BA-our 100 100 1.501 2.232 0.357 1.255 0.670 2.028 100 200 3.220 3.080 2.516 2.552 2.301 2.968 200 200 0.627 0.753 0.059 0.249 0.144 0.215 400 200 0.195 0.245 0.027 0.113 0.068 0.102 800 200 0.081 0.104 0.013 0.054 0.035 0.051 Table 6: Simulation results: sparse, correlated design (Lasso) n p Lasso p-Lasso Lasso-CV p-Lasso-CV 100 100 2.63 1.35 0.97 1.37 100 200 2.96 2.04 1.63 2.38 200 100 1.10 0.23 0.33 0.57 200 200 1.64 0.38 0.49 0.87 400 100 0.38 0.10 0.13 0.23 400 200 0.36 0.15 0.16 0.31 Table 7: Simulation results: polynomial, iid design (Boosting) n p BA-oracle BA-our o BA-oracle o BA-our res BA-oracle res BA-our 100 100 0.411 0.548 0.400 0.658 0.400 0.536 100 200 0.252 0.324 0.248 0.356 0.246 0.312 200 200 0.282 0.399 0.274 0.463 0.271 0.381 400 200 0.182 0.232 0.180 0.251 0.178 0.221 800 200 0.122 0.143 0.121 0.148 0.121 0.140 Table 8: Simulation results: polynomial, iid design (Lasso) n p Lasso p-Lasso Lasso-CV p-Lasso-CV 100 100 0.43 0.83 0.45 0.84 100 200 0.50 1.06 0.54 0.76 200 100 0.30 0.34 0.26 0.47 200 200 0.34 0.52 0.33 0.52 400 100 0.19 0.19 0.15 0.24 400 200 0.21 0.31 0.20 0.38 Table 9: Simulation results: polynomial, correlated design (Boosting) n p BA-oracle BA-our o BA-oracle o BA-our res BA-oracle res BA-our 100 100 0.242 0.421 0.228 0.506 0.223 0.380 100 200 0.251 0.531 0.244 0.670 0.238 0.489 200 200 0.193 0.312 0.171 0.337 0.165 0.269 400 200 0.149 0.203 0.113 0.188 0.111 0.161 800 200 0.102 0.126 0.078 0.111 0.077 0.100 Table 10: Simulation results: polynomial, correlated design (Lasso) n p Lasso p-Lasso Lasso-CV p-Lasso-CV 100 100 0.33 0.53 0.33 0.55 100 200 0.34 0.93 0.36 0.55 200 100 0.27 0.31 0.23 0.41 200 200 0.28 0.47 0.29 0.46 400 100 0.17 0.18 0.14 0.24 400 200 0.16 0.24 0.15 0.29 Luo, Spindler and Kueck not provide a clear answer with advantages on both sides. It is worth noting that the post Lasso estimator improves upon Lasso, but there are some exceptions, probably driven by overfitting. Cross-validation works very well in many settings. An important objective of the simulation study is to compare L2-Boosting and Lasso. It seems that in the polynomial decaying setting, L2-Boosting (particulary orthogonal L2-Boosting with our stopping rule) dominates post-Lasso. This also seems true in the sparse i.i.d. setting. In the sparse correlated setting, they perform equally well overall. In summary, it seems that L2-Boosting is a serious contender for Lasso in high-dimensional linear regression models as our new theoretical results propose. 6. Conclusion Although boosting algorithms are widely used in research and industry, the analysis of their properties in high-dimensional settings has been quite challenging. In this paper, the rate of convergence for the L2-Boosting algorithm and variants under early stopping in a highdimensional setting are derived which has been a long-standing open problem until now. For the analysis of the L2-Boosting algorithm, new approximation results are derived which improve on previous results and might be of independent interest. The rate of pure greedy algorithms can be slow in some pathological cases, as shown in Livshitz and Temlyakov (2003). Therefore, we introduce a new variant called restricted L2-Boosting which achieves a faster rate of convergence that is comparable to the rate of convergence of Lasso. All results are derived without beta-min condition and under a feasible early stopping criterion. In this paper, we focus on linear basis functions. The analysis of nonlinear basis functions, like trees, is left for future research and the results in this paper might serve as a starting point. Acknowledgments We thank the editor Corinna Cortes for her support and two anonymous referees for their valuable comments which helped to improve the paper. We thank Chunrong Ai, Victor Chernozhukov, Jerry Hausman, Scott Kostyshak, Anna Mikusheva, Faming Liang, Whitney Newey, Philippe Rigollet, David Sappington and seminar participants at MIT and University of Florida for invaluable comments and discussions. Financial support by the Fritz Thyssen Stiftung (Az. 40.13.0.014) and Hong Kong RGC TRS (T32-615/24-R) is gratefully acknowledged. High-Dimensional L2-Boosting Appendix A. A new approximation theory for PGA A.1 Auxiliary lemmas on approximation theory of PGA In this section of the appendix, we introduce preparatory results for a new approximation theory based on revisiting. These results are useful to prove Lemma 7. The proofs of these lemmas are provided in the next section. For any m1 m, define L(m, m1) = ||V m1||2 2,n/||V m||2 2,n 1. For any integers q1 > q, define (q, q1) := Πq1 q 1 j=0 1 cφ q+j with cφ defined in Assumption A.1. It is easy to see that for any k1 > k, (k, k1)/(k/k1)cφ < 1 and (k, k1)/(k/k1)cφ 1 (10) as k . First of all, we can establish the following naive bounds on L(m, m1): Lemma 19 Suppose ||V m||2,n > 0, m + 1 < Mn and m1 < Mn. Under Assumption A.1, it holds a) For any m, L(m, m + 1) 1 cφ q(m); b) For any m 0, m1 > m, L(m, m1) (q(m), q(m) + m1 m). The bound of L(m, m1) established in Lemma 19 is loose. To obtain better results on the convergence rate of V m 2 2,n, the revisiting behavior of the PGA has to be analyzed in more detail. The revisiting behavior of PGA addresses the question when and how often variables are selected again which have already been selected before. When PGA chooses too many new variables, it leads on average to slower convergence rates and vice versa. The next results primarily focus on analyzing the revisiting behavior of the PGA. The following lemma summarizes a few basic facts of the sequence of Ai, i 1. Lemma 20 Suppose m < Mn and m1 < Mn. Further, assume that Assumption A.1 is satisfied. It holds a) If En[x ixi] is a diagonal matrix, i.e., cφ = 1, then there are only Rs in the sequence A. b) Define N(m) := {k|Ak = N, 1 k m}, the index set for the non-revisiting steps, and R(m) := {k|Ak = R, 1 k m}, the index set for the revisiting steps. Then |R(m)| + |N(m)| = m, q(m) = |N(m)| + q(0), and JN(m) := {jk|k N(m)} has cardinality equal to |N(m)|. c) L(0, m) Π|N(m)| i=1 1 cφ q(0)+i 1 1 cφ q(m) |R(m)| , i.e., the sequence to maximize the upper bound of L(0, m) stated above is NN . . . NRR . . . R. Consequently, the sequence {Am+1, ..., Am1} to maximize the upper bound of L(m, m1) for general m1 > m is also NN . . . NRR . . . R. The proof of this lemma is obvious and hence omitted. Much more involved is the following result for characterizing the revisiting behavior. Luo, Spindler and Kueck Lemma 21 Assume that Assumption A.1 holds and that m < Mn. Consider the sequence of steps 1, 2, . . . , m. Set µe(cφ) = (1 exp( 1/c2 φ)). Then, the number of Rs in the sequence A satisfies: |R(m)| 1 µe(cφ) 2 µe(cφ)m µe(cφ) 2 µe(cφ)q(0). Lemma 21 provides a lower bound of the proportions of Rs. It illustrates that the R spots occupy at least some significant proportion of the sequence A, with the lower bound of the proportion depending on cφ. In fact, such a result holds for arbitrary consecutive sequence Am, Am+1, . . . , Am+k, as long as m+k < Mn. In the main text, we further extend results stated in Lemma 21. A.2 Proofs of lemmas in Appendix A.1 Proof [Proof of Lemma 19] By definition, ||V m||2 2,n = X j T m αm j < V m, Xj >n= X j T m αm j V m 2,ncorr(V m, Xj). Define ρjm := |γm|/ V m 2,n = |corr(V m, Xjm)| since V m = Um in the deterministic case. Therefore, αm j V m 2,n, i.e., ρ2 jm P j T m |αm j | 2 V m 2 2,n. By the Cauchy-Schwarz inequality, j T m |αm j | q(m)||αm||2. Therefore, ρ2 jm cφ q(m) with cφ defined in Assumption A.1 and ||V m+1||2 2,n = ||V m||2 2,n(1 ρ2 jm) V m 2 2,n i.e. L(m, m + 1) 1 cφ q(m). The second statement follows from statement a) by iteration and the fact that q(m + 1) q(m ) + 1 for any m 0. Proof [Proof of Lemma 21] Define e N(m) := {l : jl / T 0, jl is only visited once within steps 1,2,. . . ,m} High-Dimensional L2-Boosting with T 0 = T = supp(β). It is easy to see that e N(m) N(m) and | e N(m)| 2|N(m)| m since we excluded T 0 in both N and N(m) and |N(m)| = m |R(m)| and | e N(m)| m 2|R(m)| where N(m) := {k|Ak = N, 1 k m} is the index set for the non-revisiting steps. For any jl with l e N(m), it holds αm jl = γl jl / T 0. If |R(m)| m/2, then the statement of this lemma trivially holds. Therefore, we can assume that e N(m) is non-empty. Hence, l e N(m) (γl 1)2. By the sparse eigenvalue condition A1, 1 cφ ||V m||2 2,n ||αm||2 X l e N(m) (γl 1)2. (12) Note that by Lemma 19, (γl 1)2 = ||V l 1||2 2,n ||V l||2 2,n = ||V l 1||2 2,n (1 L(l 1, l)) cφ q(l 1)||V l 1||2 2,n. (13) Therefore, (γl 1)2 cφ q(l 1)||V m||2 2,n for all l N(m). Plugging this back into (12), we get: 1 cφ ||V m||2 2,n cφ X 1 q(l 1)||V l 1||2 2,n. (14) Since l e N(m) are different integers with the maximum value of q(l 1) being less than or equal to q(m) = q(0) + |N(m)|, it holds | e N(m)| X 1 q(0) + |N(m)| l log((q(0)+|N(m)|)/(q(0)+|N(m)| | e N(m)|)). The inequality above implies that exp(1/c2 φ) (q(0) + |N(m)|)/(q(0) + |N(m)| | e N(m)|), i.e., | e N(m)| (1 exp( 1/(cφ)2))(q(0) + |N(m)|). Set µe(cφ) = (1 exp( 1/c2 φ)) (0, 1). Since we know that | e N(m)| 2|N(m)| m, we immediately have: |N(m)| 1 2 µe(cφ)(m + µe(cφ)q(0)) |R(m)| 1 µe(cφ) 2 µe(cφ)m µe(cφ) 2 µe(cφ)q(0). Luo, Spindler and Kueck Appendix B. Proofs of main results in Section 3 Proof [Proof of Lemma 7] First of all, WLOG, we can assume that q(0) exceeds a large enough constant Q(δ). Otherwise, it can be assumed that the true parameter β contains some infinitesimal components such that q(0) > Q(δ). Let s revisit inequality (13). It holds X l e N(m) (γl 1)2 X l e N(m) ||V l 1||2 cφ q(l 1). The right-hand side reaches its minimum when e N(m) = {m | e N(m)| + 1, m | e N(m)| + 2, ..., m}, and for the step m | e N(m)| + l, with l = 1, 2, ..., | e N(m)|, we have q(m | e N(m)| + l 1) = q(m) | e N(m)| + l 1. Hence, for any δ > 0, and q(0) large enough, l e N(m) (γl 1)2 (1 + δ) X l e N(m) ||V l 1||2 2,n cφ q(l 1) = (1 + δ) X l e N(m) ||V m||2 2,n 1 L(l 1, m) cφ q(l 1) (1 + δ)||V m||2 2,ncφ | e N(m)| X 1 q(m l) 1 L(m l, m) ||V m||2 2,ncφ | e N(m)| X q(m) q(m) l c2 φ αm 2q(m)cφ q(m) 1 X k=q(m) | N(m)| cφ αm 2q(m)cφ((q(m) | e N(m)|) cφ q(m) cφ) k=q(m) | N(m)| 1+cφ = ((q(m) | e N(m)|) cφ q(m) cφ), ||V m||2 2,n cφ αm 2 by Assumption A.1 and ||V l 1||2 2,n/||V m||2 2,n = 1/L(l 1, m), while L(m l, m) q(m) l as q(m) l q(m) | N(m)| q(0) by Lemma 19. Using αm 2 P l e N(m)(γl 1)2, we conclude (1 + δ) cφq(m)cφ((q(m) | e N(m)|) cφ q(m) cφ), High-Dimensional L2-Boosting | e N(m)| q(m) 1 1 + 1 + δ q(m)(1 + δ )µa(cφ), for some δ > 0, with δ 0 as δ 0. Since we know that | e N(m)| 2|N(m)| m, we have analogous to the proof of Lemma 21, |N(m)| (|N(m)| + q(0))(1 + δ )µa(cφ) + m 2 which implies |N(m)| q(0)(1 + δ )µa(cφ) + m 2 (1 + δ )µa(cφ) |R(m)| = m |N(m)| 1 (1 + δ )µa(cφ) 2 (1 + δ )µa(cφ)m (1 + δ )µa(cφ) 2 (1 + δ )µa(cφ)q(0). Proof [Proof of Lemma 8] Without loss of generality, we can assume that k = 0. We can also assume that ||V 0||2 2,n > 0, because otherwise ||V 0||2 2,n = ||V m||2 2,n = 0 so that the conclusion already holds. Set n0 = |N(m)| m + q(0)(1 + δ)µa(cφ) 2 (1 + δ)µa(cφ) (1 + δ)m + µa(cφ)q(0) 2 µa(cφ) = (1 + δ)n 0(m) for any δ > 0 by Lemma 7 when q(0) is large enough. Then, by Lemma 20, ||V m||2 2,n/||V 0||2 2,n Πn0 i=1 1 cφ q(0) + i 1 1 cφ q(0) + n0 where the right hand reaches its maximum when n0 = (1 + δ)n 0(m). When q(0) is large enough, we know that for any δ > 0, Π(1+δ)n 0(m) i=1 1 cφ q(0) + i 1 (1 + δ) q(0) q(0) + (1 + δ)n 0(m) = (1 + δ) 2 µa(cφ) λ(1 + δ) + 2 + δµa(cφ) where λ = m/q(0) (15) 1 cφ q(0) + (1 + δ)n 0(m) m (1+δ)n 0(m) = q(0)2+(1+δ)λ+δµa(cφ) q(0) ((1 δ) µa(cφ))λ (1+δ)µa(c) exp cφ((1 δ) µa(cφ))λ (1 + δ)µa(cφ)) 2 + (1 + δ)λ + δµa(cφ) Luo, Spindler and Kueck Thus, for any δ > 0, and for q(0) large enough, ||V m||2 2,n/||V 0||2 2,n (1 + δ) 2 µa(cφ) λ(1 + δ) + 2 + δµa(cφ) cφ exp cφ((1 δ) µa(cφ))λ (1 + δ)µa(cφ)) 2 + (1 + δ)λ + δµa(cφ) = (1 + δ) q(0) + (1 + δ)n 0(m) q(0) cφ exp cφ((1 δ) µa(cφ))λ (1 + δ)µa(cφ)) 2 + (1 + δ)λ + δµa(cφ) where 2 µa(cφ) λ(1 + δ) + 2 + δµa(cφ) = q(0) + (1 + δ)n 0(m) q(0) . It is worth noting that the bound on the right-hand side does not depend on q(0) or m only on λ. Hence, for some δ > 0 that is small enough that depends on any small enough δ > 0, and for q(0) large enough, ||V m||2 2,n ||V 0||2 2,n(1 + δ) q(0) q(0) + (1 + δ)n 0(m) cφ q(0) q(0) + (1 + δ)n 0(m) ||V 0||2 2,n q(0) q(0) + (1 + δ)n 0(m) ζ (cφ, λ, δ) := cφ((1 δ) µa(cφ))λ (1+δ)µa(cφ)) 2+(1+δ)λ+δµa(cφ) log q(0) q(0)+(1+δ)n 0(m) and ζ(cφ, λ) defined in the statement of this lemma, ζ(cφ, λ) := cφ((1 µa(cφ))λ µa(cφ)) log 2+λ 2 µa(cφ) + cφ. Proof [Proof of Theorem 9] As in the proof of Lemma 7, WLOG, we can assume that q(0) exceeds a large enough constant Q(δ). Otherwise, we can consider the true parameter β contains some infinitesimal components such that q(0) > Q(δ). Let λ µa(c) 1 µa(c) be the maximizer of c((1 µa(c))λ µa(c)) log 2+λ 2 µa(c) + c given c (0, 1). Let m0 = 0. Define the sequence mi+1 = 1 + mi + λ q(mi) with λ λi := mi+1 mi q(mi) λ + 1 q(mi) λ + 1 High-Dimensional L2-Boosting n i := mi+1 mi + µa(cφ)q(mi) By Lemma 8, for q(mi) large enough, e.g., q(mi) q for some q large enough, there exists δ > 0 that is small enough, so that ||V mi+1||2 2,n ||V mi||2 2,n q(mi) q(mi) + (1 + δ)n i q(mi) q(mi) + (1 + δ)n i ζ(cφ,λ ) δ , (16) where δ := δ +maxλ [λ ,λ + 1 q ] ζ(cφ,λ) q is an arbitrarily small constant, as q is large enough, δ is small enough, ζ is a continuously differentiable function with bounded derivatives, and (1 + δ)n i q(mi+1) q(mi). For those m with q(m) q, by (11) in the proof of Lemma 19, we have that V m+1 2 2,n V m 2 2,n 1 cφ q(m) V mi+1 2 2,n V mi 2 2,n = Πmi+1 1 m=mi V m+1 2 2,n V m 2 2,n Πmi+1 1 m=mi 1 cφ q(mi+1 1) 1 cφ q(mi+1) mi+1 mi . (17) Define i as the index of i such that q(mi 1) < q and q(mi ) q. By definition, q(mi ) q(mi 1) + mi+1 mi (1 + λ )q(mi 1) + 1 (1 + λ ) q + 1 =: eq. (18) Assume i i . Since (1+δ)n i q(mi+1) q(mi), it holds q(mi) (1+δ) Pi 1 j=i n j +q(mi ). If i < i , the statement follows by (17). As q(mi) q(mi)+(1+δ)n i is increasing in q(mi), we have that q(mi) q(mi) + (1 + δ)n i (1 + δ) Pi 1 j=i n j + q(mi ) (1 + δ) Pi 1 j=i n j + q(mi ) + (1 + δ)n i = (1 + δ) Pi 1 j=i n j + q(mi ) (1 + δ) Pi j=i n j + q(mi ) . (19) Luo, Spindler and Kueck By (19), we have that: V mi 2 2,n V 0 2 2,n Πi 1 j=i ||V mj+1||2 2,n ||V mj||2 2,n Πi 1 j=0 ||V mj+1||2 2,n ||V mj||2 2,n Πi 1 j=i q(mj) q(mj) + (1 + δ)n j !ζ(cφ,λ ) δ 1 cφ q(m i ) + (1 + δ) Pi 1 j=i n j !ζ(cφ,λ ) δ 1 cφ with q(mj+1) q q when j < i . Note that by definition, we have j=i n j = (1 + δ) 2 µa(cφ) j=i µa(cφ)q(mj) 1 2 µa(cφ)(mi mi ). (21) Plug in (21), we have that V mi 2 2,n V 0 2 2,n Cµ q(mi ) q(m i ) + mi mi ζ(cφ,λ ) δ 1 cφ where Cµ := (2 µa(cφ))ζ(cφ,λ ) δ . By (18), we have q(m i ) eq. If mi 2mi , we conclude q(mi ) q(m i ) + mi mi eq 1 q(m i ) + mi mi eq 2 q(0) + mi Therefore, V mi 2 2,n V 0 2 2,n C µ q(0) q(0) + mi ζ(cφ,λ ) δ , (24) ζ(cφ,λ ) δ Cµ(2eq)ζ(cφ,λ ) δ is a fixed constant given q and δ . If mi < 2mi , then, V mi 2 2,n V 0 2 2,n 1 cφ ζ(cφ,λ ) δ q(0) q(0) + mi ζ(cφ,λ ) δ , (25) for mi being large enough, as the left hand side of the above inequality decays exponentially in mi, while the right hand side decays only in fixed polynomial speed of mi. Therefore, in either case, we have that V mi 2 2,n V 0 2 2,n C µ q(0) q(0) + mi ζ(cφ,λ ) δ , (26) High-Dimensional L2-Boosting for some fixed C µ > 0. For any m > 0, m < M0, since m0, m1, . . . is an increasing sequence of positive integers, there exists i such that mi m < mi+1. Thus, m mi mi+1 mi (2 + λ ). Therefore, V m 2 2,n V 0 2 2,n V mi 2 2,n V 0 2 2,n C µ q(0) q(0) + mi ζ(cφ,λ ) δ e Cµ q(0) q(0) + m ζ(cφ,λ ) δ , (27) with e Cµ := C µ(2 + λ )ζ(cφ,λ ) δ is a fixed constant. Lastly, by replacing δ with κ, the proof is completed. Appendix C. Proofs for the L2-Boosting algorithm C.1 Auxiliary lemmas for L2-Boosting The two lemmas below state several basic properties of the L2-Boosting algorithm that will be useful in deriving the main results. Lemma 22 It holds ||Um+1||2 2,n = ||Um||2 2,n < Um, Xjm >2 n= ||Um||2 2,n(1 ρ2(Um, Xjm)) ||V m+1 + r||2 2,n = ||V m + r||2 2,n 2 < V m + r, γm jm Xjm >n +(γm jm)2, where γm jm =< Um, Xjm >n. Moreover, since V m = Um r ε, ||V m+1 + r||2 2,n = ||V m + r||2 2,n (2 < Um, Xjm >n< Um ε, Xjm >n) + < Um, Xjm >2 n = ||V m + r||2 2,n + 2γm jm < ε, Xjm >n (γm jm)2. Lemma 23 Assume that Assumptions A.1-A.3 hold and m Mn. Let Zm = ||Um||2 2,n ||V m + r||2 2,n. Then, with probability 1 α and uniformly in m, it holds |Zm σ2 n| 2 p (1 + ω)Crs + 2 m + s cφ V m 2,n with σ2 n := ||ε||2 2,n. Lemma 23 bounds the difference between ||Um||2 2,n and ||V m + r||2 2,n. Luo, Spindler and Kueck Proof [Proof of Lemma 23] By Assumption A.3, we have that: |Zm ||ε||2 2,n| 2 < ε, V m + r >n = 2 < ε, r + Xαm >n 2 ε 2,n r 2,n + 2| < ε, Xαm >n | Crs log(2p/α) n + 2 αm 1λn (1 + ω)Crsλn + 2 m + s αm 2λn (1 + ω)Crs + 2 m + s cφ V m 2,n as |supp(αm)| m + s. Next, we provide the lemmas that we require for proving our main result in Theorem 17. Define Uo T := y PT y as the residual of projection of y on XT for any T {1, 2, ..., p}. Further, we define Pm,T as the operator of performing m periods of PBA algorithm subject to the subset T, i.e., Um T := y Pm,T y = (I Pm,T )U0 with U0 := y and V m T := U0 T Um T denotes the approximation error. Lemma 24 measures a bound between the operator Pm,T and the projection operator PT . That said, Pm,T is an approximation of PT . Lemma 24 (Iterations) Suppose that Assumption A.1 holds. Suppose T {1, 2, ..., p} with |T| Mn. Then, starting with U0 = y, the L2-Boosting algorithm that runs on the restricted subset T satisfies that: Um Uo T 2 2,n y Uo T 2 2,n exp 1 |T|cφm . (28) Proof [Proof of Lemma 24] Given the L2-Boosting algorithm, γjm := max j T |< Um, Xjm >n Xjm 2,n | = | < Um, Xjm >n |, (29) with jm being the largest index. The approximation error can be written as follows: V m T := U0 T Um T = XT αm T . We consider Um T = V m T + η where η = Um T V m T = Uo T = y PT y is orthogonal to all the column vectors in XT . Note that η is never changing when m increases. As a result, we have that: |γjm| αm T 1 X j T < Um T , αm j Xjm >n= V m T 2 2,n. (30) High-Dimensional L2-Boosting |γjm|2 V m T 4 2,n αm T 2 1 V m T 4 2,n |T| αm T 2 2 cφ V m T 4 2,n |T| V m T 2 2,n = cφ V m T 2 2,n |T| . That is to say, V m+1 T 2 2,n = Um+1 T 2 2,n η 2 2,n = Um T 2 2,n |γjm|2 η 2 2,n V m T 2 2,n As a result, V m T 2 2,n V 0 T 2 2,n m V 0 T 2 2,n exp cφ which proves the statement of the lemma. Lemma 25 (Bounds on Residuals) Assuming that assumptions A.1-A.3 hold, for any positive integer M Mn, we have: sup T {1,2,...,p},|T| M 1 n X T ε 2 2,n Mλ2 n, sup T {1,2,...,p},|T| M PT ε 2 2,n 1 sup T {1,2,...,p},|T| M PT (r + ε) 2 2,n 2 Crs log(2p/α) + σ2/cφ log(2p/α)M with probability 1 α. The proof is given in Lemma 1 in the appendix of Kueck et al. (2023). Lemma 26 (lower bound on residuals) Given Assumption A.1. Consider Um = Xαm+ r + ε with αm 0 Mn. Then, |γm jm| r cφ αm 0 Xαm 2,n Crcφs log(2p/α) with probability 1 α. Proof [Proof of Lemma 26] Denote ρm := maxj=1,2,...,p |corr(Um, Xj)|. It can be shown that (γm jm)2 = ρ2 m Um 2 2,n. We know that < Um, Xαm >n Xαm 2 2,n r 2,n Xαm 2,n λn αm 1 Crs log(2p/α) n Xαm 2,n λn cφ Xαm 2,n. Luo, Spindler and Kueck On the other hand, we have that: < Um, Xαm >n X j supp(α) αm j < U, Xj >n ρm Um 2,n X j supp(α) |αm j | αm 0 αm 2 |γm jm| cφ Xαm 2,n. Therefore, we have that: |γm jm| r cφ | αm 0 Xαm 2,n Crcφs log 2p/α n αm 0 λn. (33) C.2 Proofs of main results in Section 4 Proof [Proof of Lemma 13] By assumption A.3, we know that λn max1 j p | < ε, Xj >n | with probability 1 α. According to our definition, m + 1 is the first time ||V m||2,n η where η is a fixed positive constant that is large enough with η > 2σ Cr. We know that in high-dimensional settings, ||Um||2,n 0, so ||V m||2 2,n σ2. Thus, by fixing p and n, such an m must exist. Therefore, we can consider any m < m := (m + 1) Mn. Note that T m := T0 {j0, j1, . . . , jm 1} and V m := Xαm = XT mαT m. It holds |γm jm| αT m 1 < Um, XT mαT m >n V m 2 2,n | < r, XT mαT m > | | < ε, XT mαT m > | V m 2 2,n r 2,n V m 2,n λn αT m 1. As we assume that V m 2,n > η m + sλn, with η being large enough, we have that r 2,n Cr ση V m 2,n. Then, V m 2 2,n r 2,n V m 2,n 1 Cr High-Dimensional L2-Boosting which implies that: ση V m 2 2,n α e T m 1 λn ση V m 2 2,n p q(m) α e T m 2 λn q(m) V m 2,n λn q(m) V m 2,n CV,1 V m + r 2,n p where V m 2,n η m + sλn η p q(m)λn, and is a positive constant when η is large enough, and approaches to cφ as η goes to infinity. By Lemma 22, it holds that ||V m + r||2 2,n ||V m+1 + r||2 2,n = ||Um||2 2,n ||Um+1||2,n 2γm jm < Xjm, ε >n = (γm jm)2 2γm jm < Xjm, ε >n . (35) By equation (34), it follows that: ||V m + r||2 2,n ||V m+1 + r||2 2,n (γm jm)2 2γm jm < Xjm, ε >n (γm jm)2 2|γm jm|λn CV,2 V m + r 2 2,n, when η is large enough where Luo, Spindler and Kueck is arbitrarily close to cφ when η is large enough. By the arguments in the proof of Lemma 21, when n is large enough, by (34) and r 2,n Cr ση V m 2,n, we have that: ση 2 V m + r 2 2,n 1 cφ ||V m||2 2,n k e N(m) (γk 1)2 C2 V,1 q(k 1)||V k 1 + r||2 2,n which corresponds to (14) in Lemma 21 considering V m +r 2 2,n instead of V m 2 2,n. Define cφ CV,2, cφ C2 V,1/ 1 Cr which can be arbitrarily close to 0 as η is large enough. Thus, following the proof of Lemma 7 and Lemma 8, we can treat cφ ψ as the constant cφ in Theorem 9. Therefore, results of Lemma 7 and Lemma 8 applies for V m + r with cφ replaced by cφ ψ. We conclude that V m + r 2 2,n V 0 + r 2 2,n C s s + m ζ (cφ) ψ , (36) for some arbitrarily small ψ > 0 as η is large enough. By using that V m + r 2,n V m 2,n r 2,n, we have that η Cr m + s V m + r 2,n for all m < em. It follows that η Cr 2 λ2 n( em 1 + s) e C V 0 + r 2 2,n ζ (cφ) ψ . (37) Therefore s log(p) n ||V 0 + r||2 2,n ζ (cφ) ψ +1 , where ψ can be arbitrarily close to 0 as m is large enough or equivalently, s log(2p/α) n||V 0 + r||2 2,n ! 1 1+ζ (cφ) ψ . By assumption, log(Mn/s) + ξ + 1 1 + ζ (cφ) s log(2p/α) n||V 0 + r||2 2,n High-Dimensional L2-Boosting for some ξ > 0. Thus, asymptotically, em = Mn (m + 1) < Mn, i.e., m + 1 < Mn. Thus, s log(2p/α) n||V 0 + r||2 2,n ! 1 1+ζ (cφ) ψ (38) ||V m +1||2 2,n η m + 1 + sλn ||V 0 + r|| 2 1+ζ (cφ) ψ 2,n s log(2p/α) 1+ζ (cφ) ψ , for any small ψ > 0 if η is large enough. Proof [Proof of Theorem 15] At the (m 1 + 1)th step, we have: ||Um 1+1||2 2,n > (1 cu log(2p/α)/n)||Um 1||2 2,n. It follows that (γm 1)2 < cu log(2p/α)/n||Um 1||2 2,n, while (γm)2 cu log(2p/α)/n||Um||2 2,n for all m < m 1. Consider the m defined in Lemma 13 as a reference point. Case (a): Suppose m 1 < m : By the proof of Lemma 13, ||V m + r||2 2,n is decreasing when m m 1 + 1 m . By Lemma 23, it holds ||Um 1||2 2,n σ2 n + (1 + ω)Crs + 2 m 1 + s cφ ||V m 1||2,n λn + V m 1 + r 2 2,n. It follows that (γm 1)2 < cu log(2p/α)/n||Um 1||2 2,n < (1 + ω)cuλ2 n + cu log(2p/α)/n (1 + ω)Crs + 2 m 1 + s cφ ||V m 1||2,n + cu log(2p/α)/n V m 1 + r 2 2,n. (39) By inequality (34), we have that |γm 1| CV,1 V m 1 + r 2,n p Luo, Spindler and Kueck for η large enough where V m 2,n η m + sλn holds for all m m , including m 1 by assumption that m 1 < m . Combining this with inequality (39), we have that: C2 V,1 V m 1 + r 2 2,n m 1 + s cuλ2 n(1 + ω) + 2cuλn log(2p/α) m 1 + s σ2 cφ V m 1 2,n + cu log(2p/α)/n V m 1 + r 2 2,n. (40) We show that this leads to a contradiction and hence m 1 m . First, by Lemma 13, we know that m + 1 + sλn ||V 0 + r|| 2 1+ζ (cφ) ψ 2,n 1+ζ (cφ) ψ 0 (41) as n is large enough since s log(2p/α) n 0 by assumption. As a result, for n large enough, we have that 1 3C2 V,1 V m 1 + r 2 2,n m 1 + s > cu log(2p/α)/n V m 1 + r 2 2,n (42) which corresponds to the third component in (40). Second, since V m 1 2,n η p m 1 + sλn by construction for η large enough, we have V m 1 + r 2 2,n 1 Cr 2 V m 1 2 2,n as shown in the proof of Lemma 13. Hence, for η being a large enough fixed constant and n large enough, we have that 1 3C2 V,1 V m 1 + r 2 2,n 3C2 V,1 V m 1 2 2,n V m 1 2,n 1 3C2 V,1 = 2 V m 1 2,ncu/(σ2 cφ)λ3 n p σ2 cφC2 V,1 2 /(6cuλ2 n) m 1 + s σ2 cφ V m 1 2,n (43) High-Dimensional L2-Boosting as σ2 cφC2 V,1 1 Cr 2ση 2 /(6cuλ2 n) . Third, for η large enough, we have that 1 3C2 V,1 V m 1 + r 2 2,n m 1 + s 1 3C2 V,1η2 1 Cr > 2cuλ2 n(1 + ω) > cuλ2 n(1 + ω) + 2cuλn log(2p/α) (1 + ω)Crs, (44) as 2cu log(2p/α) n s = 2cuλn/σ q s log(2p/α) n = o(λn) since s log(2p/α) n 0 for n large enough. Therefore, equations (42) (44) lead to a contradiction with (40). Case (b): We know that m 1 m : It follows that (γm jm)2 cu log(2p/α)/n||Um||2 2,n for all m < m 1. Since ||Um||2 2,n is a decreasing sequence, for δ small enough, there exists some m2 such that ||Um 2 ||2 2,n > (1 δ)σ2 n for any m m2, and ||Um2+1||2 2,n (1 δ)σ2 n. For δ small enough and m m2 m 1, it holds ||V m+1 + r||2 2,n ||V m + r||2 2,n = (γm jm)2 + 2γm jm < ε, Xjm >n (γm jm)2 + 2λn|γm jm| by Lemma 22. Since cu > 4, for δ small enough, it holds |γm jm|2 cu log(2p/α)/n||Um||2 2,n cu(1 δ)(1 ω)λ2 n > 4λ2 n and thus (γm jm)2+2λn|γm| < 0. Therefore, V m+r 2,n is strictly decreasing when m m2. Case (b.1): Suppose m 1 < m2: By same arguments as in the proof of Lemma 13, we have V m +1 + r 2,n p ||V 0 + r|| 1 1+ζ (cφ) ψ 2,n s log(2p/α) 2(1+ζ (cφ) ψ ) for any ψ > 0 since V m + r 2,n is strictly decreasing for m m2. As r 2,n s log(2p/α) 2 s log(2p/α) 2(1+ζ (cφ) ψ ) , it follows that ||V m 1+1||2,n ||V m 1+1 + r||2,n + r 2,n ||V m +1 + r||2,n + r 2,n p ||V 0 + r|| 1 1+ζ (cφ) ψ 2,n s log(2p/α) 2(1+ζ (cφ) ψ ) + s log(2p/α) 2(1+ζ (cφ) ψ ) p ||V 0 + r|| 1 1+ζ (cφ) ψ 2,n s log(2p/α) 2(1+ζ (cφ) ψ ) Luo, Spindler and Kueck which concludes the result. Case (b.2): Suppose m 1 m2: We show that this leads to a contradiction. First of all, we show that m2 m . We know that ||Um||2 2,n = σ2 n + ||V m + r||2 2,n + 2 < V m + r, ε >n for any m. Since ||Um2+1||2 2,n (1 δ)σ2 n, it holds that 2 < V m2+1 + r, ε >n +||V m2+1 + r||2 2,n δσ2 n. (45) Let s prove m2 m by contradiction. Suppose m2 < m , it follows that m2+1 m +1 < M0. Since we know that ||V m2+1||2,n > η m2 + 1 + sλn, it holds ||V m2+1 + r||2 2,n > 1 Cr 2 V m2+1 2 2,n. Also, | < r, ε >n | r 2,nσn 0 as n . Therefore, for η being a large enough fixed constant and n being large enough, we have that: ||V m2+1 + r||2 2,n + 2 < V m2+1, ε >n 2 ||V m2+1||2 2,n 2 m2 + 1 + s cφ λn V m2+1 2,n > 0 and 2 < r, ε >n> δσ2 n which leads to a contradiction with (45). Thus, it must hold that m2 m . Therefore, ||V m + r||2 2,n ||V m +1 + r||2 2,n 1 + Cr 2 η(m + s + 1)λ2 n for any m + 1 m m2 + 1. We also know that by assumption, (γm)2 cu log(2p/α)/n Um 2 2,n cu log(2p/α)/n(1 δ)σ2 n cu log(2p/α)/n(1 δ)(1 ω)σ2 cu(1 δ)(1 ω)λ2 n for any m m2 m 1. Since ||V m + r||2 2,n ||V m+1 + r||2 2,n = (γm)2 2γm < Xjm, ε >n cu1λ2 n > 0 for some constant cu1 > 0 if (1 δ)(1 ω)cu/4 > 4, it follows that ||V m + r||2 2,n ||V m+1 + r||2 2,n cu1λ2 n for m = m , m + 1, . . . , m2. Consequently, ||V m2+1 + r||2 2,n ||V m +1 + r||2 2,n. By assumption, at the (m2 + 1)th step, we know that ||Um2+1||2 2,n (1 δ)σ2 n. Hence, 2 < V m2+1 + r, ε >n + V m2+1 + r 2 2,n δσ2 n. (46) High-Dimensional L2-Boosting However, ||V m2+1 + r||2 2,n ||V m +1 + r||2 2,n, and therefore, by Lemma 13, 2| < V m2+1 + r, ε >n | ||V m2+1 + r||2,nσn ||V m +1 + r||2,nσn 0, which contradicts (46) when n is large enough. Therefore, Case (b.2) can not happen, and the proof is concluded. Proof [Proof of Theorem 17] WLOG., we can assume that CF log n is an integer, so that Fk is a positive integer for k = 1, 2, . . . . For some constant C > 0 that is large enough, define k := C s log n Ln . We first show that, if we run the algorithms forever, we have with probability 1 α: En h x i(βmk β0) 2i p s log(n) log(p) En[ βmk β0 2 2] p s log(n) log(p) Define ˆT m as the set of variables that are selected by the end of time period m. It is worth noting that | ˆT mk| k Lk with Lk = Ln for all k Z+ by construction. Define ˇmk := mk 1 + Lk. Therefore, | ˆT mk| k Ln C s log n, for all k k := C s log n Ln . As in Lemma 24, Pm,T denotes the operator of performing m periods of PBA algorithm subject to the subset T. Further, define Rk := (P ˆT ˇ mk PFk, ˆT ˇ mk)U ˇmk. By Lemma 24, we have that Rk 2 2,n = (1 PFk, ˆT ˇ mk)U ˇmk (1 P ˆT ˇ mk)U ˇmk 2 2,n P ˆT ˇ mk U ˇmk 2 2,n exp for Fk 2 cφ | ˆT ˇmk| log(n y 2,n) starting at U ˇmk. Note that Rk is a small term that could be neglected later with y 2,n = U0 2 2,n Um 2 2,n being a decreasing sequence in m. Note that U ˇmk = y x β ˇmk, Therefore, P ˆT ˇ mk U ˇmk = P ˆT ˇ mky in the projection of vector y on the columns of X with indices in T ˇmk. As a result, we have Umk = (I P ˆT ˇ mk)U ˇmk + Rk = (I P ˆT ˇ mk)y + Rk (47) with mk = ˇmk + Fk. Define e V k := V mk = X(β βmk) and e V k o = Xβ P ˆT ˇ mk Xβ = (I P ˆT ˇ mk)X ˆT ˇ mkβ ˆT ˇ mk. Therefore, e V k = Xβ (y Umk) = (r + ε) + (I P ˆT ˇ mk)y + Rk = (I P ˆT ˇ mk)Xβ P ˆT ˇ mk(r + ε) + Rk It implies that e V k = e V k o + Rk PT ˇ mk(r + ε). (48) By definition, e V k o 2 2,n is a decaying sequence in k as T ˇmk T ˇmk+1 holds for all k 0. We need the following proposition: Luo, Spindler and Kueck Proposition 27 Suppose all conditions in Theorem 17 hold. For Mn large enough, there exists an absolute constant Ck > 0 such that for some k Cks log n/Ln and k(Ln + 1) < Mn, we have that: En[ x i(βmk β) 2] p s log(n) log(p) βmk β 2 2 p s log(n) log(p) Proof [Proof of Proposition 27] Assume that k k with k = C s log n Ln . Define e T mk := T0\ ˆT mk. That said, e T mk is the set of variables that are in T0 but not yet being selected by time mk. Then, if e T mk = , then, all the variables in T0 are already selected. By definition, V mk = y Umk Xβ = y (I PFk, ˆT ˇ mk)U ˇmk X ˆT ˇ mkβ ˆT ˇ mk. It implies that: V mk 2 2,n = X ˆT ˇ mkβ ˆT ˇ mk y + (I PFk, ˆT ˇ mk)U ˇmk 2 2,n 2 X ˆT ˇ mkβ ˆT ˇ mk y + (I P ˆT ˇ mk)U ˇmk 2 2,n + 2 (P ˆT ˇ mk PFk, ˆT ˇ mk)U ˇmk 2 2,n 2 P ˆT ˇ mk(rn + ε) 2 2,n + 2 (P ˆT ˇ mk PFk, ˆT ˇ mk)U ˇmk 2 2,n (4Cr log(2p/α) + 4σ2C /cφ log(2p/α) log(n))s + 1 by Lemma 25 since ˆT ˇmk C s log(n) and by Lemma 24 since Fk > CF | ˆT ˇmk| log n for some CT large enough, which concludes the result. Now, suppose e T mk = for k = 0, 1, ..., k . Recall that Umk := (I P ˆT ˇ mk)y + Rk, with Rk 2,n 1 V mk = (I P ˆT ˇ mk)Xβ P ˆT ˇ mk(r + ε) + Rk = (I P ˆT ˇ mk)X e T ˇ mkβ e T ˇ mk. Suppose that V mk 2 2,n > CV s log(n) log(2p/α) n for all k Cks log(n) Ln , where Ck CM is a large enough constant. For CV large enough, we have that V m 2,n > q Crcφs log(2p/α) n . By (48) and Lemma 25, it follows that e V k o 2,n V mk 2,n P ˆT ˇ mk(r + ε) 2,n Rk 2,n CV s log(2p/α) log(n) 2Crs log(2p/α) + 2σ2/cφ| ˆT ˇmk| log(2p/α) CV /2s log(2p/α) log(n) given that CV > 2(2Cr + 2σ2/cφCk + 1) is large enough fixed constant where we used that | ˆT ˇmk| Lnk Cks log(n). Define ωn := c2 φ 4Ln C3 φ . Next, we divide our discussions into two High-Dimensional L2-Boosting Case (A1): P ˇmk t=mk 1+1 γ2 jt ωn e V k 1 o 2 2,n. By definition, we have that: Umk 2 2,n U ˇmk 2 2,n Umk 1 2 2,n t=mk 1+1 γ2 jt Umk 1 2 2,n ωn e V k 1 o 2 2,n. (51) Case (A2): P ˇmk t=mk 1+1 γ2 jt ωn e V k 1 o 2 2,n. Define δm,k := Pm t=mk 1+1 γjtejt, where ejt is the unit vector that is equal to 1 only at the jt-entry, and 0 otherwise, for m mk 1 and m ˇmk 1. It holds δm,k 2 2,n Ln t=mk 1+1 γ2 jt Lnωn e V k 1 o 2 2,n. (52) As a result, by definition, for any m = mk 1, ..., ˇmk 1, we have that: |γjm| β e T mk 1 1 < Um, X e T mk 1β e T mk 1 >n = < (I P ˆT mk 1)X e T mk 1β e T mk 1, X e T mk 1β e T mk 1 >n | {z } ΨV,1 + < (I P ˆT mk 1)(r + ε), X e T mk 1β e T mk 1 >n | {z } ΨV,2 + < Rk, X e T mk 1β e T mk 1 >n | {z } ΨV,3 + < Xδm,k, X e T mk 1β e T mk 1 >n | {z } ΨV,4 Um = Umk + Xδm,k = (I P ˆT mk 1 + Rk)Y + Xδm,k = (I P ˆT mk 1)X e T mk 1β e T mk 1 + (I P ˆT mk 1)(r + ε) + Rk + Xδm,k. For ΨV,1, we have that: ΨV,1 = (I P ˆT mk 1)X e T mk 1β e T mk 1 2 2,n. (53) There exists a δ Rp with supp(δ) ˆT mk 1 such that Xδ = P ˆT mk 1X e T mk 1β e T mk 1. By definition, e T mk 1 ˆT mk 1 = , we have that: ΨV,1 = Xδ X e T mk 1β e T mk 1 2 2,n cφ( β e T mk 1 2 + δ 2) cφ β e T mk 1 2 cφ Cφ X e T mk 1β e T mk 1 2. (54) By (50), we have that X e T mk 1β e T mk 1 2 2,n e V k 1 o 2 2,n CV s log(2p/α) log(n) Luo, Spindler and Kueck For ΨV,2, we have that: |ΨV,2| = | < (I P ˆT mk 1)r, X e T mk 1β e T mk 1 >n + < ε, X e T mk 1β e T mk 1 >n < P ˆT mk 1ε, X e T mk 1β e T mk 1 >n | r 2,n X e T mk 1β e T mk 1 2,n + β e T mk 1 | e T mk 1| 1 2 λn + cφ λn X e T mk 1β e T mk 1 2,n 4Cr CV log(n) + σ2Cφ CV log(n) + X e T mk 1β e T mk 1 2 2,n. (55) For ΨV,3, we have that: ΨV,3 Rk 2,n X e T mk 1β e T mk 1 2,n 2 sn log(2p/α) log(n)CV X e T mk 1β e T mk 1 2 2,n. (56) For ΨV,4, we have that: ΨV,4 Xδm,k 2,n X e T mk 1β e T mk 1 2,n p CφLnωn X e T mk 1β e T mk 1 2 2,n. (57) Combining (54), (55), (56) and (57), we have that: |γjm| β e T ˇ mk 1 1 cφ 4Cr CV log(n) σ2Cφ CV log(n) 2 sn log(2p/α) log(n)CV p CφLnωn X e T mk 1β e T mk 1 2 2,n 4Cφ X e T mk 1β e T mk 1 2 2,n, (58) given that CV > 17C2 φσ2Ck c3 φ and n being large enough. As a result, we have that: X e T mk 1β e T mk 1 2 2,n β e T mk 1 1 cφ X e T mk 1β e T mk 1 2 2,n s β e T mk 1 c3/2 φ 4Cφ s X e T mk 1β e T mk 1 2,n, (59) where the last inequality follows from the sparse eigenvalue condition and | e T mk 1| s. Therefore, we have that: m=mk 1+1 |γjm|2 Lnc3 φ 16C2 φs X e T mk 1β e T mk 1 2 2,n Lnc3 φ 16C2 φs e V k 1 o 2 2,n. (60) Combining both cases (A1) and (A2), we have that: Umk 2 2,n Umk 1 2 2,n κn e V k 1 o 2 2,n (61) High-Dimensional L2-Boosting where κn := min c2 φ 4Ln C3 φ , Lnc3 φ 16C2 φs s with κ := min c2 φ 4K2 LC3 φ , c3 φ 16C2 φ given that Ln KL s with KL being a fixed constant, and hence, κ > 0 is a fixed constant. Define q := s Ln as the smallest integer s Ln . WLOG., we can assume that q = s Ln for simplicity, i.e., s Ln is an integer. By applying (61) multiple times, we have that: Umk+q 2 2,n Umk 2 2,n κLn l=k e V l o 2 2,n. (62) We again divide our analysis into two cases: Assume that k k + q Cs log(n) Case (B1): e V k+q 1 o 2 2,n > 1 2 e V k o 2 2,n. It implies that e V l o 2 2,n 1 2 e V k o 2 2,n for all l = k, . . . , k + q 1 as e V l o 2 2,n = (I P ˆT ˇ ml)Xβ 2 2,n is a decreasing sequence in l. It follows that Umk+q 2 2,n Umk 2 2,n κLn l=k e V l o 2 2,n Umk 2 2,n κq Ln 2s e V k o 2 2,n = Umk 2 2,n κ 2 e V k o 2 2,n. Recall that by (47), we have: Umk = e V k o + (I P ˆT ˇ mk)(r + ε) + Rk. It implies that e V k+q o 2 2,n 1 κ e V k o 2 2,n + 2 < e V k o , (I P ˆT ˇ mk)(r + ε) >n 2 < e V k+q o , (I P ˆT ˇ mk+q )(r + ε) >n + (I P ˆT ˇ mk)(r + ε) 2 2,n (I P ˆT ˇ mk+q )(r + ε) 2 2,n + Rk 2 2,n Rk+q 2 2,n + 2 < e V k o + (I P ˆT ˇ mk)(r + ε), Rk >n 2 < e V k+q o + (I P ˆT ˇ mk+q )(r + ε), Rk+q >n . By (55), replacing k 1 by k, for n large enough, we have 2 < e V k o , (I P ˆT ˇ mk)(r + ε) >n 2 < e V k+q o , (I P ˆT ˇ mk+q )(r + ε) >n 4Cr CV log(n) + σ2Cφ CV log(n) + X e T mkβ e T mk 2 2,n cφCV e V k o 2 2,n κ 12 e V k o 2 2,n (63) Luo, Spindler and Kueck given that CV > 64 36 2C2 φσ2Ck c3 φκ2 . Here we used that e V k o := (1 P ˆT mk)Xβ = (1 P ˆT mk)X e T mkβ e T mk as e T mk = T0\ ˆT mk. Again, we can find a δ with supp(δ) ˆT mk such that Xδ = P ˆT mk X e T mkβ e T mk. Hence, the supports of δ and β e T mk have no intersection and we conclude e V k o 2 2,n cφ( δ 2 + β e T mk 2) cφ β e T mk 2 cφ Cφ X e T mkβ e T mk 2 2,n. By Lemma 25 and (50), for n large enough, we have that: (I P ˆT ˇ mk)(r + ε) 2 2,n (I P ˆT ˇ mk+q )(r + ε) 2 2,n (64) = P ˆT ˇ mk(r + ε) 2 2,n + P ˆT ˇ mk+q (r + ε) 2 2,n Crs log(2p/α) + σ2/cφ log(2p/α)| ˆT ˇmk+q| cφCV e V k o 2 2,n κ 12 e V k o 2 2,n (65) given that CV 60Ckσ2 cφκ . And for n large enough, for fixed CV , by (50), we have that: Rk 2 2,n Rk+q 2 2,n + 2 < e V k o + (I P ˆT ˇ mk)(r + ε), Rk >n 2 < e V k+q o + (I P ˆT ˇ mk+q )(r + ε), Rk+q >n n e V k o 2,n + 4 n r + ε 2,n κ 12 e V k n 2 2,n. (66) Combining (63), (65) and (66), we have that: e V k+q o 2 2,n 1 κ e V k o 2 2,n. (67) Hence, we have shown exponential decay of e V k o 2 2,n in Case (B1). Case (B2): e V k+q o 2 2,n 1 2 e V k o 2 2,n. This automatically proves exponential decay. Note that κ 1 4 as Cφ cφ. Combining Case (B1) and Case (B2), we have that e V k+q o 2 2,n 1 κ e V k o 2 2,n. Therefore, for any kq Cks log(n), we have that e V kq o 2 2,n e V 0 o 2 2,n 1 κ Since q = s Ln , let k = Ck Ln log(n) . WLOG., assume that Ck Ln log(n) is an integer so that k = Ckn log(n). For n large enough and Ck 4K Lnκ, we have that e V kq o 2 2,n n Ck Ln ln(1 κ 4 ) e V 0 o 2 2,n n Ck Lnκ/4n K 1 < CV s log(n) log(p) High-Dimensional L2-Boosting given that e V 0 o 2 2,n n K 1 for some fixed constant K > 0. This leads to a contradictory to (50). Therefore, for given constant Ck 4K Lnκ, there must exists k Cks log(n) Ln such that V mk 2 2,n CV s log(n) log(2p/α) which concludes our proof. By construction of Algorithm 2, we have that Um+1 2 2,n = Um 2 2,n γ2 jm for all non-negative integers m. Therefore, Um+1 2 2,n Um 2 2,n = 1 γ2 jm Um 2 2,n . (71) It implies that the algorithm will not stop if γ2 jm Um 2 2,n Cu log(2p/α) Suppose that the algorithm does not stop when k C s log n Ln k M := CMs log n Ln with CM defined in Assumption 1. If C > Ck, by Proposition 27, it holds En h x i(βmk β) 2i p s log(n) log(p) for some k Cks log n/Ln. By Proposition 27, there also exists a positive constant CV such that e V k 2,n CV s log(2p/α) log(n) with k C s log n/Ln. It implies that | ˆT ˇmk | = | ˆT mk | C s log n. Therefore, by Lemma 25, we have that (I P ˆT ˇ mk )Xβ 2,n e V k 2,n + P ˆT ˇ mk (r + ε) 2,n + Rk 2,n 1 2 V + (2Cr + 2σ2/cφC ) 1 2 + 1 r s log n log(2p/α) It implies that for any k k and k k M, we have that e V k 2,n (I P ˆT ˇ mk)Xβ 2,n + P ˆT ˇ mk(r + ε) 2,n + Rk 2,n (I P ˆT ˇ mk )Xβ 2,n + 2(Crs log(2p/α) + kσ2/cφ log(2p/α)s log n) Ck + 1 + 2(Cr + kσ2/cφ) 1 2 r s log n log(2p/α) Luo, Spindler and Kueck where C k := (C 1 2 V + (2Cr + σ2/cφC ) 1 2 + 1). By definition, with s log n log(2p/α) n 0, for n large enough, for arbitrarily small η > 0, we have that Umk 2,n = (ε + r) + e V k 2,n (ε + r) 2,n e V k 2,n p (1 2η)σ2 (75) and Umk 2,n = (ε + r) + e V k 2,n (ε + r) 2,n + e V k 2,n p (1 + 2η)σ2. (76) By definition, since Umk 2,n is a decreasing sequence, we have that: γ2 jm CU log(2p/α) n Um 2 2,n CU(1 2η)σ2 log(2p/α) As a result, we have that: Umk 2 2,n Umk 2 2,n X mk m mk 1, lm=1 γ2 jm Umk 2 2,n (k k )Ln CU(1 2η)σ2 log(2p/α) Therefore, it implies that for any k k , we have: Umk 2 2,n Umk 2 2,n = (I P ˆT ˇ mk)y + Rk 2 2,n (I P ˆT ˇ mk )y + Rk 2 2,n = (I P ˆT ˇ mk)y 2 2,n (I P ˆT ˇ mk )y 2 2,n | {z } Ψ1 + 2 < (I P ˆT ˇ mk)y, Rk >n < (I P ˆT ˇ mk )y, Rk >n | {z } Ψ2 + Rk 2 2,n Rk 2 2,n | {z } Ψ3 For Ψ1, we have that: Ψ1 = (I P ˆT ˇ mk)y 2 2,n (I P ˆT ˇ mk )y 2 2,n = (I P ˆT ˇ mk)Xβ 2 2,n (I P ˆT ˇ mk )Xβ 2 2,n | {z } Ψ11 2 < (P ˆT ˇ mk P ˆT ˇ mk )Xβ, (r + ε)) >n | {z } Ψ12 + P ˆT ˇ mk (r + ε) 2 2,n P ˆT ˇ mk(r + ε) 2 2,n | {z } Ψ13 For Ψ11, by (74), we have that: Ψ11 (I P ˆT ˇ mk )Xβ 2 2,n C 1 2 V + (2Cr + 2σ2/cφC ) 1 2 + 1 2 s log n log(2p/α) High-Dimensional L2-Boosting For Ψ12, we know that (P ˆT ˇ mk P ˆT ˇ mk )Xβ = (I P ˆT ˇ mk )Xβ (I P ˆT ˇ mk)Xβ. It is worth noting that (I P ˆT ˇ mk)Xβ is a linear combination of columns of X with indices in Tk := ˆT ˇmk T0 where T0 := supp(β). Therefore, we can find a ζk Rp with supp(ζk) Tk, with Xζk = (I P ˆT ˇ mk)Xβ. Since (I P ˆT ˇ mk)Xβ 2 2,n (I P ˆT ˇ mk )Xβ 2 2,n 1 2 V + (2Cr + 2σ2/cφC ) 1 2 + 1 2 s log n log(2p/α) we have that Xζk 2 2,n C 1 2 V + (2Cr + 2σ2/cφC ) 1 2 + 1 2 s log n log(2p/α) By sparse eigenvalue condition in Assumption A.1, it implies that 1 2 V + (2Cr + 2σ2/cφC ) 1 2 + 1 2 s log n log(2p/α) Ψ12 = 2 < Xζk Xζk , (r + ε) >n = 2 < Xζk Xζk , r >n +2 < ζk ζk , XTk ε >n 2( Xζk 2,n + Xζk 2,n) r 2,n 2( ζk 2,n + ζk 2,n)|Tk | 1 2 λn 1 2 V + (2Cr + 2σ2/cφC ) 1 2 + 1 s log(n) 1 2 log(2p/α) 1 2 V + (2Cr + 2σ2/cφC ) 1 2 + 1 s log(n) log(2p/α) For Ψ13, by Lemma 25, we have that: Ψ13 P ˆT ˇ mk(r + ε) 2 2,n 2 Crs log(2p/α) + σ2/cφ log(2p/α)k Ln Combining (79), (81) and (82), we have that: Ψ1 CΨ1 s log n log(2p/α) k Ln log(2p/α) where CΨ1 := (4 Cr + 2Cr + 4σ C / cφ + (C 1 2 V + (2Cr + 2σ2/cφC ) 1 2 + 1))(C 1 2 V + (2Cr + 2σ2/cφC ) 1 2 + 1) is a fixed constant. For Ψ2 and Ψ3, since Rk , Rk 1 n, and by (76), we have that for n large enough: Ψ2 4(1 + 2η)σ Luo, Spindler and Kueck Therefore, we have that: Umk 2 2,n Umk 2 2,n CΨ1 s log n log(2p/α) k Ln log(2p/α) n 4(1 + 2η)σ Recall that in equation (77), we have that Umk 2 2,n Umk 2 2,n (k k )Ln CU(1 2η)σ2 log(2pα) Therefore, combining (85) and (86), we have that (k k )Ln CU(1 2η)σ2 log(2pα) CΨ1 s log n log(2p/α) k Ln log(2p/α) n + 4(1 + 2η)σ and by assumption that CU > 4 cφ , for small enough η > 0, it holds CU(1 2η) 4 Since k C s log n Ln , it implies that for n large enough, we have that k k CU(1 2η) + CΨ1s log n/Ln + 1 cφ C s log n where C := C CU(1 2η)+CΨ1+1 cφ s log n/Ln. By assumption that C CM, all the analysis in the above that uses the sparse eigenvalue condition applies. Therefore, the algorithm must stop before mk with k C . Therefore, when we stop at m , we have that CU log(2p/α) CU log(2p/α) n ( V m 2,n + (r + ε) 2,n). (87) By Lemma 26, we have that cφ (| ˆT m | + s) V m 2,n Crcφs log 2p/α n(| ˆT m | + s) λn. It implies that CU (| ˆT m | + s) log(2p/α) CU (| ˆT m | + s) log(2p/α) n (2 + 2η)σ + Crcφs log(2p/α) (| ˆT m | + s)λn. Since | ˆT m | C s log(n) and s log(n) log(p) n 0, we have that for n large enough, s log(n) log(2p/α) CUC (2 + 2η)σ + p High-Dimensional L2-Boosting for C C + 1. It follows that V m 2 2,n 2 CUC (2 + 2η)σ + p Crcφ + σC 2 s log(n) log(2p/α) CUC (2 + 2η)σ + p Crcφ + σC 2 s log(n) log(2p/α) follows by sparse eigenvalue condition which concludes the proof. Appendix D. Discussion of the equi-correlated design In the equi-correlated design, we assume to have a set of predictors X1, . . . , Xp, such that Covn(Xi, Xj) = corrn(Xi, Xj) = ρ (0, 1) for i = j, where Covn represents the empirical covariance and corrn the empirical correlation, respectively. By assumption, all Xj, j = 1, 2, . . . , p, are standardized with mean zero and variance one. For the case Cov(Xi, Xj) = ρ, the results are similar and can be well approximated by the case Covn(Xi, Xj) = ρ. To analyze the revisiting behavior of the PGA algorithm in model (3) in the main text, it is sufficient to consider Xαm = X(β βm), which is the approximation error at the mth boosting step. It is worth noting that Covn(Xαm, Xj) = X i T m αm i ρ, if j / T m, (88) Covn(Xαm, Xj) = X i T m αm i ρ + αm j (1 ρ), if j T m, (89) where T m := T supp(βm) with T := supp(β). WLOG., one can assume that P i T m αm i 0. We divide our analysis into two cases: Case (a): If P i T m αm i > 0, there must be a j T m such that αm j > 0. Due to Equations (88) and (89), this implies |Covn(Xαm, Xj)| > |Covn(Xαm, Xl)| for all l / ˆT m since αm j (1 ρ) > 0 and P i T m αm i ρ > 0. Therefore, the algorithm must select one predictor from T m. Case (b): If P i T m αm i = 0, the current approximation error Xαm has a correlation of 0 with all variables j that are not in T m, see Equation (88). As a result, the algorithm must either (b1) select a predictor that is in T m if not all αm i are zero for i T m or (b2) stop since all αm i are 0. In summary, the algorithm will always tend to select one predictor from the existing T m unless it stops as the approximation error is already zero. Therefore, T = T 0 = T 1 = = T m. Hence, we have 100 percent of revisiting, indicating that ζ (cφ) . Luo, Spindler and Kueck Appendix E. Applications In this section, we present applications from different fields to illustrate the boosting algorithms. We present applications to demonstrate how the methods work when applied to real data sets and, then compare these methods to related methods, i.e. Lasso. The focus is on making predictions which is an important task in many applications. E.1 Riboflavin production This application involves genetic data and analyzes the production of riboflavin. First, we describe the data set, then we present the results. E.1.1 Data set The data set has been provided by DSM (Kaiserburg, Switzerland) and was made publicly available for academic research in B uhlmann et al. (2014) (Supplemental Material). The real-valued response/dependent variable is the logarithm of the riboflavin production rate. The (co-)variables measure the logarithm of the expression level of 4, 088 genes (p = 4, 088), which are normalized. This means that the covariables are standardized to have variance 1, and the dependent variable and the resources are de-meaned , which is equivalent to including an unpenalized intercept. The data set consists of n = 71 observations which were hybridized repeatedly during a fed-batch fermentation process in which different engineered strains and strains grown under different fermentation conditions were analyzed. For further details we refer to B uhlmann et al. (2014), their Supplemental Material and the references therein. E.1.2 Results We analyze a data set on the production of riboflavin (vitamin B2). We split the data set randomly into two samples: a training set and a testing set. We estimate the model with different methods on the training set and then use the testing set to calculate out-of-sample mean squared errors (MSE) in order to evaluate the predictive accuracy. The size of the training set was 60 and the remaining 11 observations were used for forecasting. The table below shows the MSE for different methods discussed in the previous sections. Table 11: Results Riboflavin Production (out-of-sample MSE) BA-our o BA-our Lasso p-Lasso 0.3641 0.1080 0.1687 0.1539 All calculations were performed in R (R Core Team (2014)) with the package hdm (Chernozhukov et al. (2015)) and our own code. Replication files are available upon request. The results show, again, that orthogonal L2-Boosting outperforms Lasso and post-Lasso in this application. High-Dimensional L2-Boosting E.2 Predicting Test Scores E.2.1 Data Set Here, the task is to predict the final score in the subjects Mathematics and Portugese in secondary education. This is relevant, e.g., to identify students which need additional support to master the material. The data contains both student grades and demographic, social and school related features and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics and Portuguese. The data set is made available at the UCI Machine Learning Repository and was contributed by Paulo Cortez. The main reference for the data set is Cortez and Silva (2008). E.2.2 Results We employed five-fold cross-validation to evaluate the predictive performance of the data set. The results remain stable when choosing a different number of folds. The data sets contain, for both test results, 33 variables, which are used as predictors. The data set for the Mathematics test scores contains 395 observations, the sample size for Portuguese is 649. The results confirm our theoretical derivations that boosting is comparable to Lasso. Table 12: Prediction of education (out-of-sample MSE) subject BA-our o BA-our Lasso p-Lasso Mathematics 19.1 19.3 18.4 18.4 Protugese 8.0 7.9 7.8 7.8 Luo, Spindler and Kueck Andrew R. Barron, Albert Cohen, Wolfgang Dahmen, and Ronald A. De Vore. Approximation and learning by greedy algorithms. Ann. Statist., 36(1):64 94, 02 2008. doi: 10. 1214/009053607000000631. URL http://dx.doi.org/10.1214/009053607000000631. A. Belloni, D. Chen, V. Chernozhukov, and C. Hansen. Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica, 80(6):2369 2429, 2012. ISSN 1468-0262. doi: 10.3982/ECTA9626. URL http://dx.doi.org/10. 3982/ECTA9626. Alexandre Belloni and Victor Chernozhukov. Least squares after model selection in highdimensional sparse models. Bernoulli, 19(2):521 547, 05 2013. doi: 10.3150/11-BEJ410. URL http://dx.doi.org/10.3150/11-BEJ410. Alexandre Belloni, Victor Chernozhukov, and Christian Hansen. Inference for highdimensional sparse econometric models. Advances in Economics and Econometrics. 10th World Congress of Econometric Society. August 2010, III:245 295, 2010. Ar Xiv, 2011. Alexandre Belloni, Victor Chernozhukov, and Christian Hansen. Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies, 81(2):608 650, 11 2013. ISSN 0034-6527. doi: 10.1093/restud/rdt044. URL https: //doi.org/10.1093/restud/rdt044. Alexandre Belloni, Victor Chernozhukov, and Ying Wei. Post-selection inference for generalized linear models with many controls. Journal of Business & Economic Statistics, 34 (4):606 619, 2016. ISSN 07350015. URL http://www.jstor.org/stable/44166593. P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics, 37(4):1705 1732, 2009. Leo Breiman. Bagging predictors. Machine Learning, 24:123 140, 1996. ISSN 0885-6125. doi: 10.1007/BF00058655. URL http://dx.doi.org/10.1007/BF00058655. Leo Breiman. Arcing classifiers. The Annals of Statistics, 26(3):801 824, 1998. Peter B uhlmann. Boosting for high-dimensional linear models. The Annals of Statistics, 34 (2):559 583, 2006. Peter B uhlmann and Torsten Hothorn. Boosting algorithms: Regularization, prediction and model fitting. Statistical Science, 22(4):477 505, 2007. doi: 10.1214/07-STS242. URL http://dx.doi.org/10.1214/07-STS242. with discussion. Peter B uhlmann and Bin Yu. Boosting with the L2 Loss: Regression and classification. Journal of the American Statistical Association, 98(462):324 339, 2003. ISSN 01621459. URL http://www.jstor.org/stable/30045243. Peter B uhlmann, Markus Kalisch, and Lukas Meier. High-dimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application, 1(1):255 278, 2014. doi: 10.1146/annurev-statistics-022513-115545. URL http://dx. doi.org/10.1146/annurev-statistics-022513-115545. High-Dimensional L2-Boosting Emmanuel Candes and Terence Tao. The dantzig selector: statistical estimation when p is much larger than n. The Annals of Statistics, pages 2313 2351, 2007. Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Gaussian approximation of suprema of empirical processes. The Annals of Statistics, 42(4):1564 1597, 2014. Victor Chernozhukov, Christian Hansen, and Martin Spindler. hdm: High-Dimensional Metrics, 2015. R package version 0.1. Paulo Cortez and Alice Maria Gon calves Silva. Using data mining to predict secondary school student performance. 2008. Victor H. de la Pe na, Tze Leung Lai, and Qi-Man Shao. Self-normalized processes. Probability and its Applications (New York). Springer-Verlag, Berlin, 2009. ISBN 978-3-54085635-1. Limit theory and statistical applications. R. A. De Vore and V. N. Temlyakov. Some remarks on greedy algorithms. Advances in Computational Mathematics, 5(1):173 187, 1996. ISSN 1572-9044. doi: 10.1007/BF02124742. URL http://dx.doi.org/10.1007/BF02124742. Robert M. Freund, Paul Grigas, and Rahul Mazumder. A new perspective on boosting in linear regression via subgradient optimization and relatives. Annals of Statistics, 2016. Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119 139, 1997. J. H. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of boosting. The Annals of Statistics, 28:337 407, 2000. doi: 10.1214/aos/1016218223. URL http://projecteuclid.org/euclid.aos/1016218223. with discussion. Jerome H. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189 1232, 2001. ISSN 00905364. URL http://www.jstor. org/stable/2699986. Ching-Kang Ing. Model selection for high-dimensional linear regression with dependent observations. The Annals of Statistics, 48(4):1959 1980, 2020. Ching-Kang Ing and Tze Leung Lai. A stepwise regression method and consistent model selection for high-dimensional sparse linear models. Statistica Sinica, 21(4):1473 1513, 2011. Robert Tibshirani Jerome Friedman, Trevor Hastie. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1 22, 2010. URL http://www.jstatsoft.org/v33/i01. S. V. Konyagin and V. N. Temlyakov. Rate of convergence of pure greedy algorithm. East J. Approx., 5(4):493 499, 1999. ISSN 1310-6236. Luo, Spindler and Kueck Jannis Kueck, Ye Luo, Martin Spindler, and Zigan Wang. Estimation and inference of treatment effects with l2-boosting in high-dimensional settings. Journal of Econometrics, 234(2):714 731, 2023. Tze Leung Lai and Hongsong Yuan. Stochastic Approximation: From Statistical Origin to Big-Data, Multidisciplinary Applications. Statistical Science, 36(2):291 302, 2021. doi: 10.1214/20-STS784. URL https://doi.org/10.1214/20-STS784. E.D. Livshitz and V.N. Temlyakov. Two lower estimates in greedy approximation. Constructive Approximation, 19(4):509 523, 2003. ISSN 1432-0940. doi: 10.1007/ s00365-003-0533-6. URL http://dx.doi.org/10.1007/s00365-003-0533-6. S.G. Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. Trans. Sig. Proc., 41(12):3397 3415, Dec 1993. doi: 10.1109/78.258082. URL http://dx.doi. org/10.1109/78.258082. Sahand N. Negahban, Pradeep Ravikumar, Martin J. Wainwright, and Bin Yu. A Unified Framework for High-Dimensional Analysis of M-Estimators with Decomposable Regularizers. Statistical Science, 27(4):538 557, 2012. doi: 10.1214/12-STS400. URL https://doi.org/10.1214/12-STS400. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2014. URL http://www.R-project.org/. Bernhard Stankewitz. Early stopping for -boosting in high-dimensional linear models. The Annals of Statistics, 52(2):491 518, 2024. doi: 10.1214/24-AOS2356. URL https: //doi.org/10.1214/24-AOS2356. Vladimir Temlyakov. Greedy Approximation. Cambridge University Press, New York, NY, USA, 1st edition, 2011. ISBN 1107003377, 9781107003378. V.N. Temlyakov. Weak greedy algorithms. Advances in Computational Mathematics, 12 (2):213 227, 2000. ISSN 1572-9044. doi: 10.1023/A:1018917218956. URL http://dx. doi.org/10.1023/A:1018917218956. Sara A. van de Geer and Peter B uhlmann. On the conditions used to prove oracle results for the lasso. Electron. J. Statist., 3:1360 1392, 2009. doi: 10.1214/09-EJS506. URL http://dx.doi.org/10.1214/09-EJS506. T. Zhang and B. Yu. Boosting with early stopping: Convergence and consistency. The Annals of Statistics, 33:1538 1579, 2005. doi: 10.1214/009053605000000255. URL http: //projecteuclid.org/euclid.aos/1123250222.