# dimension_reduction_and_mars__4e00b3a5.pdf Journal of Machine Learning Research 24 (2023) 1-30 Submitted 12/22; Revised 9/23; Published 10/23 Dimension Reduction and MARS Yu Liu LIUYUCHINA123@GMAIL.COM School of Mathematical Sciences University of Electronic Science and Technology of China, China Degui Li DEGUI.LI@YORK.AC.UK Department of Mathematics University of York, UK Yingcun Xia STAXYC@NUS.EDU.SG Department of Statistics and Data Science National University of Singapore, Singapore and School of Mathematical Sciences University of Electronic Science and Technology of China, China Editor: Rajarshi Guhaniyogi The multivariate adaptive regression spline (MARS) is one of the popular estimation methods for nonparametric multivariate regression. However, as MARS is based on marginal splines, to incorporate interactions of covariates, products of the marginal splines must be used, which often leads to an unmanageable number of basis functions when the order of interaction is high and results in low estimation efficiency. In this paper, we improve the performance of MARS by using linear combinations of the covariates which achieve sufficient dimension reduction. The special basis functions of MARS facilitate calculation of gradients of the regression function, and estimation of these linear combinations is obtained via eigen-analysis of the outer-product of the gradients. Under some technical conditions, the consistency property is established for the proposed estimation method. Numerical studies including both simulation and empirical applications show its effectiveness in dimension reduction and improvement over MARS and other commonly-used nonparametric methods in regression estimation and prediction. Keywords: consistency, gradient estimation, multivariate adaptive regression spline, nonparametric regression, sufficient dimension reduction 1. Introduction Nonparametric estimation is an effective tool in statistics and machine learning to capture a flexible nonlinear relationship between the response and explanatory variables, relaxing pre-specified model structural assumptions required in parametric estimation methods. However, extension of the nonparametric regression estimation to the setting with multivariate regressors needs to be handled with care, as the required number of observations (to achieve given estimation accuracy) increases exponentially as the dimension of covariates increases, resulting in the so-called curse of dimensionality (e.g., Fan and Gijbels, 1996). To address this problem, we often have to restrict the class of multivariate regression functions so that only the lower dimensional nonparametric functions are to be estimated. c 2023 Yu Liu, Degui Li and Yingcun Xia. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-1422.html. LIU, LI AND XIA Commonly-used function classes include additive models (Hastie and Tibshirani, 1986), varying-coefficient models (Hastie and Tibshirani, 1993), partially linear models (Engle et al., 1986) and single-index models (H ardle et al., 1993). However, these restricted nonparametric estimation methods may have unstable numerical performance in practical data analysis when the regression function class is misspecified. Hence, it is imperative to develop a fully nonparametric multivariate estimation method that can reduce the curse of dimensionality but need no restriction on the class of regression functions. In nonparametric estimation, the regression function is often approximated by a linear expansion of base functions (e.g., Chapter 5 of Hastie et al., 2009). In the case of multivariate covariates, the required number of basis functions in the approximation may increase dramatically as the dimension of covariates increases. A commonly-used idea to design a feasible estimation algorithm is to control model complexity and thus limit the number of basis functions. This can be done by adaptively scanning the set of basis functions and selecting only those which contribute significantly to the model fitting. Among a long list of existing estimation algorithms, the multivariate adaptive regression spline (MARS, Friedman, 1991) is arguably the most popular one. It uses piecewise linear basis functions and can be viewed as a natural generalization of the stepwise linear regression approach. Because of the selection of splines in the estimation algorithm, MARS can also result in variable selection. MARS is well suited for high-dimensional nonparametric regression problems and can be further extended to tackle classification problems (e.g., Stone et al., 1997). Existing literature in statistical learning such as Hastie et al. (2009) usually implements the MARS algorithm directly without making any transformation or dimension reduction of the covariates. This may result in an unmanageable number of basis functions (if the level of model complexity or the order of interaction is high) and low estimation efficiency. In multivariate nonparametric regression, it is often the case that important features of multiple regressors are retrievable via low-dimensional projections. The low-dimensional sub-space is expected to retain all (or most of) the information provided by the covariates on the response, and is thus called the sufficient dimension reduction (SDR) space, which is first introduced by Li (1991). Aiming at dimension reduction for the conditional mean, which is more relevant to our main interest, a similar concept (central mean space) is also introduced by Cook and Li (2002). More recent developments on this topic can be found in Xia (2008), Chen et al. (2010), Yin and Li (2011), Fukumizu and Leng (2014), Luo et al. (2014), Ma and Zhu (2014), Wang et al. (2015), Yang et al. (2017), and Fertl and Bura (2022). This paper aims to combine SDR with MARS by incorporating linear combinations of covariates to improve the regression estimation. These linear combinations are the SDR directions or more precisely the central mean space of Cook and Li (2002) when the underlying model has a multiple-index structure and can effectively reduce the order of covariate interaction required in MARS and improve the estimation performance. As these linear combinations in MARS are dimension-reduced covariates, the proposed methodology is called dr MARS throughout the paper. The nonparametric estimation procedure developed in this paper includes two stages: (i) estimate the SDR space of the conditional mean; and (ii) modify MARS by incorporating these linear combinations of covariates (or SDR) to estimate the regression functions. The main technique in stage (i) is to conduct eigen-analysis of the outer-product of regression function gradient estimates and estimate the SDR directions by the eigenvectors correspond- DIMENSION REDUCTION AND MARS ing to the first few largest eigenvalues. In particular, we estimate the gradient via a linear basis expansion determined by MARS and further derive a sensible convergence property for the resulting estimates. With the MARS algorithm, this new gradient estimation is easy to implement, complementing other gradient estimation methods such as the local linear smoothing and reproducing kernel Hilbert space which have been extensively studied in the literature (e.g., Xia et al., 2002; Xia, 2008; Fukumizu and Leng, 2014). The dr MARS in stage (ii) incorporates the linear combinations of covariates, making it substantially different from the classic MARS in Friedman (1991). In particular, when a high-order interaction of covariates can be equivalently expressed as the multiple-index form, our dr MARS can significantly reduce the number of terms in the basis expansion and improve the estimation efficiency. As a simple example, (x1 + x2 + x3 + x4)3 has a third-order interaction when the conventional MARS is applied, but it has only a first-order interaction in the dr MARS if the linear combination is correctly identified. This is confirmed by our numerical studies, which also show the advantage of dr MARS even if the postulated model cannot reduce the order of interactions via the SDR-determined linear combinations of covariates. dr MARS inherits some nice features from MARS (such as the simple form of linear spline basis functions and selection of spline in the algorithm) and works well when the dimension of predictors is relatively large (see the simulation and empirical application). Under some technical conditions, we derive the consistency theory for the dr MARS estimation, complementing the existing asymptotic theory for the spline-based estimation (e.g., Stone, 1990, 1991; Zhou et al., 1998; Huang, 2003; Lin, 2013). Another work related to our approach is the random projection or random rotation (e.g., Blaser and Fryzlewicz, 2016; Cannings and Samworth, 2017; Bagnall et al., 2018). The random rotation is an ensemble procedure. It randomly selects the projections and estimates the model using the projected combinations of the variables as predictors for regression methods such as the random forest or support vector machine. Each set of projections thus generates a prediction. The final prediction is a weighted average of these predictions. In contrast, the rotation in our approach is based on the regression itself, i.e., SDR, and thus is more efficient for prediction. As we will show in the numerical studies, the rotation based on SDR has better estimation and prediction accuracy than the random rotation. The rest of the paper is organized as follows. Section 2 defines the SDR space, introduces the MARS-based estimation method, and develops the convergence properties of the estimates. Section 3 describes the dr MARS algorithm and its consistency theory. Sections 4 and 5 report the simulation studies and real data applications, respectively. Section 6 concludes the paper. Proofs of the main theorems are available in an appendix. Throughout the paper, for a vector u = (u1, , ud) , we define |u|q q = Pd i=1 |ui|q with q 1; for a d d matrix W = (wij)d d, we let W and W F be the spectral and Frobenius norms, respectively. 2. Estimation of SDR space via MARS Let Y and X be the response and p-dimensional vector of covariates, respectively. Assume the following multiple-index model structure: G(x) = E(Y |X = x) = E Y |B X = B x = G0 B x , (1) LIU, LI AND XIA where B is a p d orthogonal matrix with d smaller than p, G( ) is a multivariate nonparametric regression function on Rp and G0( ) is a nonparametric link function on Rd. It follows from model (1) that projection of the p-dimensional X onto the d-dimensional sub-space B X retains all the information provided by X for prediction of Y . Hence, the matrix B determines the SDR directions (or the central mean subspace). The space spanned by B s column vectors is called the SDR space. Letting u = B x, by (1), we readily have that G (x) = BG 0(u), where G and G 0 are the gradient vectors. By Lemma 1 in Xia et al. (2002), the space spanned by B is the same as that spanned by the eigenvectors of ΣG := E G (X)G (X) corresponding to the largest d eigenvalues, i.e., span(B) = span(β1, , βd), where βj is the eigenvector of ΣG corresponding to the j-th largest eigenvalue. With a sample of observations (Yi, Xi), i = 1, , n, we estimate ΣG by the outer-product of gradient estimates: i=1 e G (Xi) e G (Xi) , (2) where e G is a nonparametric estimate of the gradient G . A natural estimate of G is via the local linear smoothing method (e.g., Fan and Gijbels, 1996). The estimate of B can be obtained by subsequently conducting the eigen-analysis of eΣG (e.g., Xia et al., 2002; Xia, 2008). However, the local linear estimation is essentially a kernel-based local smoothing method which is sensitive to the smoothing parameter choice, and still suffers the curse of dimensionality when the dimension p is large. Next, we propose an alternative technique to estimate G via MARS. MARS is an adaptive estimation procedure using linear spline functions in the basis expansion. For the k-th covariate, we define the piecewise linear basis functions with knots taken from the set {tk,1, , tk,nk}: h+ k,j(xk) = (xk tk,j)+ = xk tk,j, if xk > tk,j, 0, otherwise , (3) h k,j(xk) = (xk tk,j) = (tk,j xk)+, (4) which form reflected pairs for the k-th covariate at tk,j, j = 1, , nk. The collection of marginal basis functions for all the covariates is C = n h+ k,j, h k,j , j = 1, , nk, k = 1, 2, , p o . When each basis function depends only on a single covariate, the number of basis functions in C is 2 Pp k=1 nk, assuming all the knots are distinct. To incorporate interactions of covariates, we use tensor products of the basis functions in C. Specifically, when the order of interaction is set to be R, a typical R-variate basis function is defined as hk1j1, ,k Rj R(xk1, , xk R) = r=1 hkr,jr(xkr), (5) where hk,j is a basis function from C, 1 jr 2nkr, and 1 k1 = k2 = = k R p. Note that the number of the R-variate basis functions increases dramatically as p increases. DIMENSION REDUCTION AND MARS Suppose that the multivariate nonparametric regression function is approximated by the following form of basis expansion: G(x) Gm(x) := θ0 + j=1 θjhj(x), (6) where hj is either a basis function in C or a product of marginal basis functions, see (5), and m is the number of basis functions which may diverge to infinity. Here G(x) Gm(x) means that Gm(x) G(x) as m . The coefficients θj, j = 0, 1, , m, are estimated by the least squares as in standard linear regression. From (6), we further obtain the basis expansion for the gradient G : G (x) G m(x) := j=1 θjh j(x), (7) where h j is the gradient vector of hj. When the order of interaction R is large (or even moderately large), it is practically infeasible to include all the R-variate basis functions. The real art of MARS is to provide an adaptive selection procedure including both the forward and backward stepwise algorithms to construct the basis functions with the linear spline functions in C. This adaptive selection reduces the number of basis functions in (6) while retains the model flexibility. We next briefly describe the MARS algorithm to determine the basis functions in (6) and (7). Start with the constant function h0(x) 1 and use the linear spline functions in C as the candidate functions. In each stage, let M be the set of basis functions which have been selected in the previous stages. Construct a new basis function from products of any basis function in M with one of the reflected pairs in C. This new term in the basis expansion has the following typical form: θ|M|+1hl(x)h+ k,j(xk) + θ|M|+2hl(x)h k,j(xk), hl M, h+ k,j, h k,j C, where θ|M|+1 and θ|M|+2 are the parameters to be estimated by least squares, and |M| denotes the cardinality of M. Add the products to the model approximation with the basis functions in M and choose the product which results in the largest decrease in the training estimation errors. Repeat the above process until the number of the selected basis functions reaches a pre-determined number M. As the number M is usually large, the model selected in the forward stepwise algorithm often overfits the data. Thus, a backward stepwise algorithm is needed to delete the term whose removal results in the smallest increase in the residual squared errors. Let h1, , h em be the basis functions selected by MARS and eh j be the gradient vector of ehj, j = 1, , em. We write e H(x) = h 1,eh1(x), ,eh em(x) i and e H (x) = h 0p,eh 1(x), ,eh em(x) i , where 0p is a p-dimensional zero vector. We use least squares to estimate the parameters in the final basis approximation: eα = (eα0, eα1, , eα em) = e H e H 1 e H Y, (8) LIU, LI AND XIA where e H = h e H(X1), , e H(Xn) i and Y = (Y1, , Yn) . Consequently the estimate of G (x) can be obtained by j=1 eαjeh j(x) = e H (x) e H e H 1 e H Y. (9) The above gradient estimate is then used to construct eΣG in (2). Letting eβj be the eigenvector of eΣG corresponding to the j-th largest eigenvalue, we obtain e B = (eβ1, , eβd), which will be shown to be a consistent estimate of B (subject to appropriate rotation); see Theorem 4 below. If the order of covariate interaction is set as R in MARS, we let H( ) be a vector containing all the basis functions which are either from C or tensor products of the marginal basis functions as in (5). Without loss of generality, we may write H( ) = h e H( ) , e H ( ) i , where e H ( ) is a vector of basis functions not selected by MARS. Let m H be the dimension of H( ) which is often much larger than em. It is worth pointing out that H( ) is a vector of deterministic functions which can be seen as the candidate basis functions in MARS, and m H is a non-random positive integer. We next study the convergence property for the MARS-based nonparametric estimates e G and eΣG, which requires the following technical conditions. Assumption 1 (i) Let (Yi, Xi), i = 1, , n, be independent and identically distributed (i.i.d.), and εi := Yi G(Xi) be zero-mean and homoskedastic, i.e., E(ε2 i |Xi) = σ2 > 0 almost surely (a.s.). (ii) The density function of Xi exists, and is bounded away from zero and infinity on a compact set X. Both G and G are continuous on X. (iii) The matrix Ω:= E H(Xi)H(Xi) is positive definite, and m H log m H = o(n). (iv) There exists eρ( ) satisfying eρ(u) 0 as u , such that G (x) e H (x) αo 2 = OP (eρ(m )) , αo = e H e H 1 e H G, conditional on em = m , where G = [G(X1), , G(Xn)] . (v) The matrix ΣG has full rank of d with positive and distinct eigenvalues. Remark 1 The independence restriction in Assumption 1(i) can be weakened and the theory developed in this section also holds for stationary and weakly dependent time series satisfying some mixing properties (e.g., Bradley, 2005). Assumption 1(ii) is commonly used in deriving asymptotic results of the spline-based estimation. The compact support restriction can be relaxed at the cost of slightly more lengthy proof with some moment condition on X. Assumption 1(iii) is a sufficient condition to ensure that the least squares estimate in (8) is well defined. In fact, by Lemma 10 in the appendix, 1 n e H e H is positive definite with probability DIMENSION REDUCTION AND MARS approaching one (w.p.a.1), which is implicitly assumed in Friedman (1991). As the linear spline basis functions are special polynomial spline functions, we may replace Assumption 1(iii) by an alternative condition through Huang (2003) s theoretical framework. Let e S be the estimation space containing the linear spline functions and their tensor products selected by MARS. Given a sample of covariates X1, , Xn, suppose that e S is empirically identifiable in the sense that g e S and |g|2 n = 1 n Pn i=1 g2(Xi) = 0 together imply g 0. For a vector v, v ( 1 n e H e H)v = 0 indicates that |v e H|2 n = 1 n Pn i=1[v e H(Xi)]2 = 0. Then, as v e H e S, by the empirical identifiability of e S, we readily have v e H(x) = 0 for any x X, and thus v = 0. This shows that 1 n e H e H is positive definite w.p.a.1, and its inverse is well defined. Assumption 1(iii) restricts the divergence rate of m H, which is very mild for the nonparametric series estimation. Assumption 1(iv) imposes a high-level condition on the uniform bias order of the gradient estimate. In fact, α o e H( ) can be seen as the projection of G onto the estimation space e S defined above. In the spline-based estimation theory, it is reasonable to assume |G(x) α o e H(x)| 0 uniformly over x X. Assumption 1(iv) shows that this uniform approximation continues to hold when G and its projection onto e S are replaced by their gradients. Let S be the estimation space containing all the linear spline functions and their tensor products as in H( ). It is clear that e S S. Letting H = [H(X1), , H(Xn)] , we define α = (H H) 1H G so that α H( ) can be seen as the projection of G onto S. The bias term of the gradient estimation can be decomposed as G (x) e H (x) αo = G (x) H (x) α + h H (x) α e H (x) αo i =: g bias1 (x) + g bias2 (x) . (10) The first term g bias1 (x) is due to the approximation error of G ( ) by its projection onto the space S. In fact, under some smoothness conditions on G and G (e.g., Stone, 1982; Huang, 2003), with the approximation theory, we conjecture that its order is upper bounded by m q/p H , where q is a positive number relevant to the order of bounded and continuous derivatives of G( ). The second term g bias2 (x) is induced by the projection onto the MARS-selected estimation space e S rather than S. According to the MARS algorithm, we expect this bias order tends to zero if em is sufficiently large. If em is of the same order as m H, it is reasonable to conjecture that the two terms on the right side of (10) have the same approximation order. Assumption 1(v) is analogous to the condition (C4) in Xia (2008), making it feasible to apply the Davis-Kahan theorem (e.g., Theorem 2 in Yu et al., 2015) to prove Theorem 4. Theorem 2 Suppose that Assumption 1(i) (iv) is satisfied. Then, conditional on em = m , e G (x) G (x) 2 = OP m1/2 n 1/2 + eρ(m ) , (11) eΣG ΣG = OP m1/2 n 1/2 + eρ(m ) . (12) Remark 3 The two convergence rates in (11) and (12) are due to the estimation variance and bias, respectively. They are comparable to the convergence results obtained by Fukumizu and Leng (2014), where the gradient is estimated by the covariance operator on the reproducing kernel Hilbert space. It is worth noting that the rates in (11) and (12) are slower than the root-n rate as MARS is essentially nonparametric. Furthermore, if the minimum eigenvalue of Ωconverges slowly to LIU, LI AND XIA zero, the convergence rates would be further slowed down. For instance, we may show the following convergence property for the MARS estimate of the gradient: e G (x) G (x) 2 = OP (m /λ)1/2 n 1/2 + eρ(m ) , λ = λmin(Ω), conditional on em = m . In Theorem 2, we assume that the number of candidate covariates in nonparametric regression is fixed. The convergence results in (11) and (12) can be further extended to the setting when the covariate number is divergent at a slow polynomial rate of n. Following the proof of Theorem 2 in the appendix and assuming that e S is empirically identifiable, we may show that e G (x) G (x) 2 = OP (pm )1/2 n 1/2 + eρ(m ) , eΣG ΣG = OP p1/2 h (pm )1/2 n 1/2 + eρ(m ) i , conditional on em = m . These convergence properties indicate that the dimension p must be of order smaller than n1/2. For high-dimensional nonparametric estimation with p possibly larger than n1/2, we may have to impose sparsity assumptions on G and ΣG, and combine the developed MARS estimation with a shrinkage technique (e.g., Bickel and Levina, 2008). Theorem 4 Suppose that Assumption 1(i) (v) is satisfied. Conditional on em = m , there exists a d d rotation matrix Q such that e B BQ = OP m1/2 n 1/2 + eρ(m ) . Remark 5 Xia (2008) derives a faster convergence rate by using the minimum average variance estimation with refined kernel weights. However, some restrictive conditions are imposed on the smoothing parameter and the dimension d. For instance, d cannot exceed 3 to achieve the root-n convergence. In contrast, we do not require additional restriction on d. We need to determine the dimension of the SDR space for which many criteria have been proposed (e.g., Li, 1991; Xia et al., 2002). In the simulation study, we select the dimension via the 10-fold cross-validation (CV) criterion. We do not study the theory of the dimension selection in this paper, but our simulations suggest that this criterion works reasonably well; see Table 2. As the basis of the SDR space is not unique, the MARS estimate e B converges to B up to appropriate transformation via the rotation matrix Q. However, with Assumption 1(v) and the model identification conditions as in Proposition 1.1 of Xia (2008), we may consider Q as an identity matrix and thus e B converges to B. Since BQ is also a base of SRD space, for notational convenience, we do not distinguish between B and BQ in the rest of the paper, and use B to denote both cases. 3. Dimension-reduced MARS Let X i = e B Xi = eβ 1Xi, , eβ d Xi be a d-dimensional vector of projected covariates, where e B is defined in Section 2. Generally, we can use any SDR method, such as SIR of Li (1991), to estimate e B, and then apply MARS to (Yi, X i ), i = 1, , n. We call this general approach SDR-MARS, while the MARS DIMENSION REDUCTION AND MARS estimation based on our dimension reduction proposed in Section 2 is still called dr MARS to avoid possible confusion. Due to the convergence property of e B in Theorem 4, we expect that X i can well approximate X i = B Xi. Write x = e B x = (x 1, , x d) and x = B x, x Rp. For the k-th projected covariate, we define h+ k,j(x k) and h k,j(x k) similarly to h+ k,j(xk) and h k,j(xk) in (3) and (4) but with the set of knots {tk,1, , tk,nk} replaced by n t k,1, , t k,n k o , and construct C = n h+ k,j, h k,j , j = 1, , n k, k = 1, 2, , d o . With a sample of response and projected covariates (Y1, X 1), , (Yn, X n), we use the linear spline functions in C as the candidate functions and follow the forward stepwise algorithm and then the backward stepwise algorithm as in Section 2 to adaptively select the basis functions denoted by bhj, j = 1, , bm. By (1), we readily have that G(x) = G0 B x = G0 (x ). Since x x by Theorem 4, instead of estimating G, we next estimate the nonparametric link function G0 using the dr MARS selected basis functions. Let b H( ) = h 1,bh1( ), ,bh bm( ) i , b H = h b H(X 1), , b H(X n) i . We estimate the parameters in the basis expansion via least squares, i.e., bγ = (bγ0, bγ1, , bγ bm) = b H b H 1 b H Y, (13) and then obtain the dr MARS estimate: b G0(x ) = bγ0 + j=1 bγjbhj(x ) = b H(x ) b H b H 1 b H Y. (14) The main difference between dr MARS and the conventional MARS in Friedman (1991) is that the former incorporates the linear combinations of covariates determined by the SDR projection in the estimation algorithm. Hence dr MARS is expected to work better when the underlying model contains the multiple-index structure (1). In particular, if a high-order interaction of covariates can be written as the multiple-index form, dr MARS can significantly reduce the number of basis functions in the model approximation, and subsequently improve the estimation efficiency; see the simulation studies in Section 4. Similar to H( ) defined in Section 2, we let H( ) be a vector containing all the basis functions which are either from C or the tensor products of the marginal basis functions, i.e., H( ) = h b H( ) , b H ( ) i , where b H ( ) is a vector of basis functions not selected by dr MARS. Let m H be the dimension of H( ). Note that H( ) is a vector of deterministic functions, facilitating the asymptotic derivation of the dr MARS estimation. We need the following technical conditions to derive the convergence theory of the dr MARS estimation. LIU, LI AND XIA Assumption 2 (i) The density function of X i exists, and is bounded away from zero and infinity on a compact set. The link function G0 is continuous and differentiable. (ii) The matrix Ω:= E H(X i )H(X i ) is positive definite, and m H plog m H = o(n). (iii) The numbers of dr MARS selected basis functions and MARS selected ones: bm and em, satisfy that bm em1/2n 1/2 + eρ( em) = o P (1). (iv) There exists bρ( ) satisfying bρ(u) 0 as u , such that G0(x ) b H(x ) γ = OP (bρ(m )) , γ = b H n b H n 1 b H n G, conditional on bm = m . Remark 6 Assumption 2(i) extends Assumption 1(ii) to the setting including projected covariates. As discussed in Remark 1, Assumption 2(ii) ensures that the least squares estimate (13) is well defined. In fact, Lemma 11 in the appendix shows that 1 n b H b H is positive definite w.p.a.1, indicating that its inverse matrix exists. Let b S be the estimation space by including the linear spline functions and their tensor products selected by dr MARS. We may show that Assumption 2(ii) can be replaced by the empirical identifiability condition on b S. Assumption 2(iii), combined with the convergence property in Theorem 4, is crucial to ensure the consistency property when we replace e B by B in dr MARS. We next discuss the high-level condition in Assumption 2(iv) on the dr MARS estimation bias. Letting H(X 1), , H(X n) , b H = h b H(X 1), , b H(X n) i , we define γ = (H G, γ = (b H b H ) 1 b H G. Similar to the bias decomposition (10) in Remark 1, we have G0(x ) b H(x ) γ = h G0(x ) b H(x ) γ i + h G0(x ) b H(x ) γ G0(x ) + b H(x ) γ i = G0(x ) H(x ) γ + h H(x ) γ b H(x ) γ i h G0(x ) b H(x ) γ G0(x ) + b H(x ) γ i =: d bias1 (x) + d bias2 (x) + d bias3 (x) . (15) Hence the high-level bias order in Assumption 2(iv) combines the three bias terms in the decomposition (15). The first term d bias1 (x) is caused by the approximation error of G0( ) by its projection onto S, an estimation space containing the linear spline functions and their tensor products as in H( ). As discussed in Remark 1, under some smoothness conditions on G0 , we may show that d bias1 (x) is of order m q/d H , where q is a positive number relevant to the smoothness level of G0( ). The second term d bias2 (x) is induced by the projection onto the dr MARS-selected estimation space b S rather than S. Lin (2013) discusses the order of d bias2 (x) under some extra restrictions. As discussed in Remark 1, if bm is of the same order as m H, we conjecture that d bias2 (x) would have the same approximation order as d bias1 (x). Finally, d bias3 (x) is the extra bias due to the replacement of B by e B in the dr MARS algorithm. DIMENSION REDUCTION AND MARS The following theorem gives the point-wise convergence rate for b G0(x ) defined in (14). Theorem 7 Suppose that Assumptions 1 and 2 are satisfied. The dr MARS estimate b G0(x ) has the following convergence result: b G0(x ) G0(x ) = OP m1/2 /n1/2 + bρ(m ) , (16) conditional on bm = m . Remark 8 The convergence rate obtained in Theorem 7 is comparable to those derived by Huang (2003) and Lin (2013) for the polynomial spline regression estimation. Assume that the nonparametric link function is sufficiently smooth, say G0( ) is q-smooth (e.g., Huang, 2003), bm m H, and the convergence of e B is sufficiently fast, say e B is root-n convergent (e.g., Xia, 2008). Following the discussion in Remark 6, we conjecture that bρ(m ) is dominated by d bias1 (x) + d bias2 (x), which is upper bounded by m q/d conditional on bm = m . Consequently the point-wise convergence rate of dr MARS becomes OP (m1/2 /n1/2 + m q/d ), indicating that the optimal order of bm = m is nd/(d+2q) and the optimal convergence rate is expected to be OP (n q/(d+2q)) (e.g., Stone, 1982). In contrast, as discussed in Lin (2013), if G( ) is q-smooth, the conventional MARS estimation (without SDR rotation) has the bias order bm q/d , where bm is the number of the MARS selected basis functions. If bm has the optimal order np/(p+2q), the point-wise convergence rate of the conventional MARS is OP n q/(p+2q) . As d is typically smaller than p, it is sensible to expect that dr MARS has faster convergence rate than the conventional MARS under the multiple-index model framework (1). This is confirmed by the numerical studies in Section 4 for finite samples. In practice, we may further modify dr MARS to obtain the nonparametric estimation that is robust to possible model misspecification, i.e., the multiple-index structural assumption (1) is violated. Let ˇXi = X i , X i be a vector combining both the original and projected covariates. Consider a sample (Y1, ˇX1), , (Yn, ˇXn), use the linear spline functions in C C as the candidate functions and apply MARS to adaptively select the basis functions denoted by ˇhj, j = 1, , ˇm. Similarly to (14), the estimate of G(x) is obtained by ˇG(x) = ˇφ0 + j=1 ˇφj ˇhj(ˇx), (17) where ˇφ0, ˇφ1, , ˇφ ˇm are the least squares estimates and ˇx = x , x with x = e B x. Furthermore, the nonparametric estimate can be recast into the following form: ˇG(x) = ˇφ0 + ˇG (x ) + ˇG (x), where ˇG (x ) is defined by summing over the terms in (17) whose basis functions involve only the projected covariates whereas ˇG (x) is defined by summing over the terms whose basis functions involve the original covariates. When the multiple-index model assumption is valid, it is expected that most of the basis functions involved in defining ˇG (x) would be screened out in the adaptive selection process, and consequently ˇG(x) is approximated by ˇφ0 + ˇG (x ), which is expected to be close to b G0(x ) defined in (14). LIU, LI AND XIA 4. Simulation studies In this section, we use simulated data to showcase the performance of the proposed dimension reduction and dr MARS methods in two aspects: estimation of the SDR (central mean) space and estimation of the regression function. For the SDR space estimation, we compare dr MARS with principal Hessian directions (p Hd, Cook and Li, 2004), conditional variance estimator (CVE, Fertl and Bura, 2022), gradient-based kernel dimension reductiong (g KDR, Fukumizu and Leng, 2014) and minimum average variance estimation (MAVE, Xia et al., 2002). The accuracy of an estimate e B is evaluated by D(e B, B) = (I B(B B) 1B )e B F / where d is the effective dimension which is assumed to be known. Selection of this dimension is evaluated separately. The smaller D(e B, B) is, the better the SDR space estimate is. For the regression function estimation, we compare dr MARS with two popular methods: the support vector machine (SVM, Cortes and Vapnik, 1995) and random forest (RF, Breiman, 2001). We also compare the function estimation using the SDR directions obtained by p Hd, CVE, g KDR, MAVE and our dr MARS, respectively, and call them SDR-MARS in general. For any estimate of the regression function G(x) = E(Y |X = x), say b G(x), we define the mean squared error (MSE): h b G(Zi) G(Zi) i2 , to evaluate the estimation accuracy, where, {X1, , Xn} is the in-sample used to estimate the regression function G( ) and {Z1, , Zm} is the out-of-sample used to compute the MSE. Both follow the same distribution. All methods are implemented with R. Specifically, package dr (Weisberg, 2002) for p Hd, cve function in package CVar E for CVE, package MAVE for MAVE, package earth (Milborrow et al., 2017) for MARS, svm function in package e1071 (Dimitriadou et al., 2008) for SVM, package random Forest (Liaw and Wiener, 2002) for RF are used in our numerical studies. The source codes for g KDR and dr MARS as well as all the relevant files can be downloaded from https://github.com/liuyu-star/dr MARS. For all the R functions, their default values of tuning parameters are used. In addition, as the random rotation is a commonly-used ensemble method (e.g., Blaser and Fryzlewicz, 2016; Cannings and Samworth, 2017; Bagnall et al., 2018), we also include it in our comparison, denoted by RAND. In our setting, for each random rotation matrix B, RAND applies MARS to (B Xi, Yi), i = 1, , n to train the model and then predict the testing data. The data is generated by the following nonlinear regression model: Yi = G(Xi) + εi, where Xi = (Xi1, , Xip) i.i.d. Up( 1, 1) or Np (0p, ΣX) with ΣX = 0.6|i j| εi i.i.d. N(0, 0.52). The specifications of G( ) are as follows, (M1) G(x) = 0.5(x1 + x2) + 2.5 exp( 2(x1 + x2 + x3)2), DIMENSION REDUCTION AND MARS (M2) G(x) = 1 30 exp(4x1) + 4 3 + 3 exp( 20(x2 0.5)) + 3x3 + 2x4 + x5 (M3) G(x) = 0.6 sin(πx1x2) + 1.2(x3 0.5)2 + 0.6x4 + 0.3x5, (M4) G(x) = 5x1x2x3, (M5) G(x) = 4(x1 x2 + x3) sin(0.5π(x1 + x2)), (M6) G(x) = x1(x1 + x2 + 1), (M7) G(x) = x1 0.5 + (x2 + 1.5)2 . In terms of SDR, the effective dimensions for M1, , M7 are 2, 3, 4, 3, 2, 2 and 2, respectively. For example, the SDR space of M3 is spanned by β1 = 1, 0 p 1 , β2 = 0, 1, 0 p 2 , β3 = 0, 0, 1, 0 p 3 and β4 = 0, 0, 0, 2/ 5, 0 p 5 . The dimension p is 50 or 100, and the sample size n is 200 or 500. For the estimation of the SDR space, the simulation results based on 100 replications are shown in Tables 1 and 2. For the seven models, Table 1 shows that the estimation errors of dr MARS are smaller than those of p Hd, CVE, g KDR and MAVE, indicating that dr MARS has significant improvement over the competing methods in estimating the SDR space. Moreover, in most cases the relative estimation error reduction of dr MARS over the others improves as the dimension p increases. For example, for M2 with X following the uniform distribution and n = 500, the relative estimation error reduction (dr MARS over MAVE) is (0.72 0.34)/0.72 = 0.5278 when p = 50, and it increases to (0.82 0.31)/0.82 = 0.6220 when p = 100. As mentioned in Section 2, we select the dimension of SDR space by the 10-fold CV criterion, using a similar idea as in Xia et al. (2002). The data sample is randomly divided into 10 equal subsamples Ik with size 0.1n , i.e., {1, , n} = 10 k=1Ik. The true dimension (denoted by d0) is estimated as follows bd = arg max 1 d d CV(d), CV(d) = 1 k=1 R2(d, Ik), where d is set as 5 in the simulation, R2(d, Ik) = 1 Yi b G Ik 0 (b B d Xi) 2 Yi Y Ik 2 , b Bd is computed from the whole data set by setting the dimension of SDR space as d, b G Ik 0 ( ) is computed from the data {(Yi, b B d Xi) : i / Ik} using MARS in the R package earth, and Y Ik is the average value of the response {Yi : i / Ik}. The frequencies of correctly selecting the correct dimension and the computational time for estimating the SDR space over 100 replications are reported in Table 2, where only the results for X Np(0p, ΣX) are reported as the performance is similar for uniformly distributed covariates. For most cases, the dimension estimates based on dr MARS are the most accurate one with ρ(bd = d0) larger than the other methods. Regarding the computing time, p Hd is the least time consuming LIU, LI AND XIA X Up( 1, 1) X Np(0p, ΣX) Model p n p Hd CVE g KDR MAVE dr MARS p Hd CVE g KDR MAVE dr MARS 50 200 0.89 0.75 0.83 0.74 0.53 0.98 0.76 0.83 0.87 0.80 500 0.74 0.66 0.69 0.60 0.47 0.96 0.71 0.75 0.79 0.71 100 200 0.98 0.94 0.91 0.75 0.58 0.99 0.83 0.87 0.87 0.85 500 0.86 0.71 0.81 0.69 0.44 0.99 0.74 0.82 0.87 0.73 50 200 0.96 0.84 0.82 0.81 0.39 0.96 0.85 0.76 0.90 0.81 500 0.94 0.81 0.77 0.72 0.34 0.95 0.85 0.66 0.91 0.79 100 200 0.98 0.89 0.86 0.81 0.38 0.98 0.91 0.83 0.89 0.82 500 0.97 0.84 0.82 0.82 0.31 0.98 0.90 0.75 0.94 0.83 50 200 0.94 0.88 0.85 0.84 0.72 0.93 0.90 0.83 0.82 0.38 500 0.89 0.84 0.78 0.68 0.66 0.89 0.86 0.77 0.76 0.20 100 200 0.98 0.94 0.90 0.86 0.77 0.97 0.95 0.93 0.81 0.46 500 0.96 0.89 0.86 0.84 0.70 0.94 0.90 0.86 0.82 0.25 50 200 0.94 0.92 0.96 0.95 0.78 0.96 0.85 0.87 0.86 0.21 500 0.93 0.81 0.92 0.95 0.51 0.95 0.83 0.78 0.82 0.04 100 200 0.98 0.97 0.98 0.95 0.85 0.98 0.89 0.96 0.86 0.33 500 0.96 0.96 0.98 0.98 0.65 0.98 0.84 0.89 0.88 0.08 50 200 0.81 0.78 0.97 0.42 0.31 0.93 0.93 0.98 0.96 0.79 500 0.50 0.21 0.77 0.11 0.49 0.90 0.85 0.93 0.94 0.44 100 200 0.97 0.96 0.99 0.66 0.27 0.97 0.97 0.99 0.95 0.84 500 0.75 0.71 0.99 0.21 0.47 0.94 0.94 0.99 0.98 0.54 50 200 0.95 0.76 0.73 0.71 0.57 0.92 0.74 0.92 0.69 0.28 500 0.83 0.61 0.56 0.59 0.56 0.84 0.71 0.69 0.61 0.08 100 200 0.98 0.87 0.83 0.71 0.57 0.98 0.78 0.96 0.67 0.40 500 0.95 0.73 0.68 0.70 0.57 0.92 0.73 0.91 0.70 0.13 50 200 0.96 0.88 0.86 0.88 0.76 0.97 0.80 0.86 0.82 0.47 500 0.91 0.77 0.76 0.81 0.71 0.93 0.75 0.73 0.74 0.22 100 200 0.99 0.94 0.92 0.90 0.76 0.99 0.86 0.90 0.81 0.55 500 0.97 0.87 0.84 0.89 0.66 0.98 0.78 0.83 0.84 0.26 Table 1: Average D(e B, B) for estimation of the SDR space over 100 replications: the smaller the value, the better the method. and CVE is the most time consuming, whereas dr MARS is in the middle. In summary, dr MARS with the 10-fold CV can substantially improve dimension estimation accuracy with reasonable (and acceptable) computational time. Table 3 lists the MSEs for nonparametric regression function estimation, where only the results for X Up( 1, 1) are reported as the performance for Gaussian covariates is similar. Generally, dr MARS has smaller MSEs than the conventional MARS. For example, for model M1 with p = 50 and n = 500, the MSE(G) of MARS is 0.52, the MSE(G) of the SDR-MARS with SDR estimated by p Hd, CVE, g KDR and MAVE are smaller, and the MSE(G) of our dr MARS is the smallest. A similar pattern can also be found for the other data generating DIMENSION REDUCTION AND MARS ρ(bd = d0) Computational time (in seconds) Model p n p Hd CVE g KDR MAVE dr MARS p Hd CVE g KDR MAVE dr MARS 50 200 0.00 0.14 0.21 0.13 0.31 0.01 5.79 0.33 2.08 1.57 500 0.00 0.00 0.17 0.09 0.25 0.01 27.56 1.55 7.42 3.38 100 200 0.00 0.31 0.19 0.18 0.19 0.02 17.70 0.70 2.11 2.78 500 0.00 0.01 0.11 0.00 0.20 0.04 55.85 4.85 18.70 6.43 50 200 0.15 0.08 0.10 0.12 0.17 0.01 8.42 0.28 1.87 0.82 500 0.13 0.10 0.08 0.14 0.24 0.01 47.90 1.84 6.71 2.22 100 200 0.17 0.10 0.04 0.13 0.21 0.02 22.04 0.79 1.89 1.24 500 0.13 0.14 0.03 0.17 0.20 0.03 76.34 5.06 16.95 3.17 50 200 0.21 0.00 0.35 0.05 0.36 0.01 13.05 0.26 1.90 1.50 500 0.24 0.00 0.27 0.06 0.30 0.01 66.94 1.53 6.77 4.01 100 200 0.30 0.00 0.29 0.01 0.27 0.02 41.58 0.74 1.95 2.82 500 0.15 0.00 0.23 0.10 0.34 0.03 134.85 4.65 17.25 7.32 50 200 0.13 0.01 0.23 0.14 0.33 0.01 9.53 0.29 1.91 1.13 500 0.15 0.00 0.20 0.04 0.36 0.01 49.21 1.83 6.76 3.09 100 200 0.17 0.00 0.10 0.12 0.35 0.02 28.93 0.82 1.93 1.96 500 0.16 0.00 0.18 0.11 0.43 0.04 93.66 5.18 17.24 5.41 50 200 0.04 0.16 0.17 0.20 0.26 0.01 6.43 0.25 2.08 1.35 500 0.00 0.14 0.20 0.15 0.40 0.01 33.98 1.44 7.38 4.33 100 200 0.04 0.14 0.01 0.15 0.30 0.02 19.38 0.70 2.12 2.61 500 0.00 0.14 0.15 0.04 0.29 0.03 63.76 4.44 18.68 6.75 50 200 0.23 0.02 0.07 0.33 0.44 0.01 5.37 0.26 2.07 1.01 500 0.14 0.01 0.08 0.50 0.46 0.01 25.99 1.53 7.39 1.42 100 200 0.13 0.01 0.16 0.38 0.33 0.02 17.86 0.71 2.11 1.93 500 0.21 0.05 0.01 0.25 0.47 0.04 49.32 4.70 18.63 2.78 50 200 0.03 0.18 0.19 0.21 0.35 0.01 5.43 0.25 2.08 1.38 500 0.03 0.21 0.30 0.25 0.50 0.02 27.11 1.43 7.39 3.48 100 200 0.09 0.18 0.06 0.16 0.39 0.02 17.09 0.70 2.11 2.76 500 0.05 0.23 0.02 0.07 0.53 0.04 51.15 4.28 18.71 6.60 Table 2: The proportion of selecting the true dimension of the SDR space ρ(bd = d0) and the computing time for estimating the SDR space (with the true dimension) over 100 replications when X Np(0p, ΣX). processes. Note that the MSE(G) of MARS may be larger than that of SVM and RF for some of the data generating processes (such as M1 and M4). However, in most cases, our dr MARS has smaller MSE(G) than (or comparable MSE(G) to) the SVM and RF methods. The simulation results in Table 3 also show that RAND has poorer numerical performance than the other SDR-MARS methods. LIU, LI AND XIA Original SDR-MARS Model p n SVM RF MARS RAND p Hd CVE g KDR MAVE dr MARS 50 200 0.95 0.80 0.99 0.92 1.03 0.49 1.02 0.54 0.35 500 0.87 0.61 0.52 0.82 0.46 0.15 0.51 0.13 0.09 100 200 0.98 0.83 1.23 0.98 1.65 1.42 1.45 0.63 0.40 500 0.95 0.66 0.64 0.92 0.74 0.26 0.66 0.32 0.10 50 200 0.43 0.28 0.42 0.35 0.42 0.33 0.36 0.59 0.36 500 0.28 0.16 0.29 0.23 0.29 0.27 0.28 0.34 0.27 100 200 0.61 0.31 0.43 0.55 0.43 0.54 0.62 0.61 0.38 500 0.38 0.18 0.34 0.34 0.34 0.31 0.33 0.57 0.31 50 200 0.49 0.29 0.63 0.43 0.64 0.44 0.44 0.65 0.36 500 0.35 0.22 0.42 0.30 0.42 0.28 0.32 0.32 0.27 100 200 0.63 0.31 0.72 0.59 0.74 0.67 0.74 0.71 0.42 500 0.45 0.24 0.55 0.42 0.55 0.39 0.40 0.64 0.32 50 200 0.98 0.98 2.20 1.00 1.22 1.28 1.10 1.91 1.11 500 0.99 0.96 1.25 0.97 0.92 0.62 0.93 1.10 0.70 100 200 0.97 0.96 2.45 0.99 1.73 1.49 1.73 1.99 1.41 500 0.98 0.96 1.94 0.97 1.24 1.27 1.13 1.84 0.97 50 200 6.57 4.18 1.09 6.57 1.09 1.09 1.09 1.08 0.87 500 6.30 2.83 0.24 6.20 0.24 0.24 0.24 0.26 0.25 100 200 6.60 4.53 1.56 6.68 1.56 1.63 1.56 1.54 0.95 500 6.61 3.03 0.27 6.49 0.27 0.27 0.27 0.27 0.28 50 200 0.34 0.15 0.36 0.30 0.37 0.24 0.34 0.32 0.15 500 0.25 0.09 0.26 0.21 0.26 0.10 0.20 0.16 0.09 100 200 0.40 0.16 0.40 0.39 0.40 0.47 0.52 0.36 0.19 500 0.31 0.09 0.31 0.30 0.32 0.20 0.27 0.34 0.10 50 200 0.08 0.05 0.35 0.08 0.23 0.18 0.13 0.31 0.14 500 0.06 0.03 0.23 0.06 0.17 0.10 0.08 0.18 0.12 100 200 0.09 0.05 0.39 0.09 0.35 0.24 0.24 0.32 0.17 500 0.07 0.04 0.29 0.07 0.23 0.17 0.11 0.29 0.12 Table 3: Average MSE(G) for the regression function estimation over 100 replications when X Up( 1, 1). 5. Real data analysis In this section, we apply the proposed dr MARS to the out-of-sample prediction of real data and statistical inference. Similarly to the simulation studies in Section 4, we consider the conventional MARS and SDR-MARS using various SDR estimation methods and compare their performance with other commonly-used nonparametric regression methods such as RF and SVM. We build the model using the training set (Xtrain i , Y train i ) : i = 1, , n , and make prediction for the testing set (Xtest i , Y test i ) : i = 1, , m . The prediction performance is DIMENSION REDUCTION AND MARS evaluated by the relative mean squared prediction error: b Y test i Y test i 2 / Y Y test i 2 , where b Y test i is the fitted value of the response Y test i and Y = 1 n Pn i=1 Y train i is a naive prediction using the average of response observations in the learning set. In addition, as the use of logistic regression, our dr MARS can be used for classification with two categories denoted as 0 and 1. Specifically, letting the prediction for the testing set be bytest i , i = 1, 2, , m, the classification is b Y test i = I bytest i > 0.5 , where I( ) is the indicator function. The classification performance is measured by the misclassification rate (MCR) defined as i=1 I(b Y test i = Y test i ). The following data sets are used to demonstrate the performance of prediction. data.1 The data (https://archive.ics.uci.edu/ml/datasets/concrete+compr essive+strength) is about the concrete compressive strength (Y ) and its dependence on concrete s ingredients and age (X). It has p = 8 predictors and N = 1, 030 observations. The square root transformation is made to the concrete compressive strength as the response. data.2 The data (www.kaggle.com/harlfoxem/housesalesprediction) contains house sale prices for King County in US including Seattle between May 2014 and May 2015. It contains N = 21, 613 house sale records. The interest is to predict the house sale prices (Y ) based on p = 18 variables (X). The logarithm transformation is made to the house sale prices. data.3 The data (archive.ics.uci.edu/ml/datasets/Parkinsons+Telemonitor ing) is composed of a range of biomedical voice measurements from 42 people with early-stage Parkinson s disease recruited to a six-month trial of a telemonitoring device for remote symptom progression monitoring. Data on people s age, gender, time interval from baseline recruitment date and 16 biomedical voice measures are the covariates (X with p = 19), and N = 5, 875 voice recording from these individuals are collected. Our interest is to predict the motor scores ( motor UPDRS , Y ) from the 19 covariates. data.4 The data (https://archive.ics.uci.edu/ml/datasets/Residential+Bu ilding+Data+Set) contains construction cost, project variables, and economic variables corresponding to real estate single-family residential apartments in Tehran, Iran. It contains N = 372 observations. The interest is to predict the construction cost (Y ) using p = 102 predictors (X) without considering the construction year. The logarithm transformation is made to the construction cost. The following data sets are used to demonstrate the classification performance. LIU, LI AND XIA data.5 The data (www.kaggle.com/datasets/muratkokludataset/pistachio-da taset) includes a total of N = 2, 148 images, 1,232 of Kirmizi type (Y = 0) and 916 of Siirt type (Y = 1). Each image contains 12 morphological features, 4 shape features and 12 color features (p = 28). We are interested in the classification of the images based on the 28 features. data.6 The data (https://archive.ics.uci.edu/ml/datasets/Hill-Valley) contains N = 1, 212 records, each of which represents p = 100 points on a two-dimensional graph. When plotted in order (from 1 through 100) as the Y co-ordinate, the points will create either a Hill (a bump in the terrain, Y = 1) or a Valley (a dip in the terrain, Y = 0). Our interest is to discriminate whether a given record is a Hill or a valley by 100 points on a two-dimensional graph. data.7 The data (www.kaggle.com/datasets/cnic92/200-financial-indicator s-of-us-stocks-20142018) includes N = 986 US stocks in year 2018, each of which contains p = 216 financial indicators. These predictors are commonly found in the 10-K filings each publicly traded company releases yearly. Each stock is classified into two classes: if the value of a stock increases during 2019 then Y = 1; if the value of a stock decreases during 2019 then Y = 0. The interest is to classify those stocks that are buy-worthy or not. We randomly select n = min(1000, N/3 ) or n = min(2000, 2N/3 ) observations as the training set, and the remaining observations as the testing set, and repeat the random splitting 100 times. The dimension of SDR space is selected using the 10-fold CV described in Section 4. The average r MSPEs and MCRs are reported in Table 4 below. Data Training Original SDR-MARS (N, p) size SVM RF MARS RAND p Hd CVE g KDR MAVE dr MARS data.1 343 21.30 17.86 15.85 21.42 15.85 15.85 15.85 15.88 12.69 (1030, 8) 686 16.75 12.05 14.21 20.10 14.21 14.21 14.15 13.90 10.72 data.2 1000 22.55 16.06 15.50 35.14 15.50 15.50 15.50 15.46 14.62 (21613, 18) 2000 19.57 14.35 13.89 33.37 13.89 13.89 13.86 13.67 13.09 data.3 1000 67.14 35.05 32.16 70.75 30.43 31.14 31.08 31.05 12.79 (5875, 19) 2000 60.20 23.98 30.62 69.58 26.43 28.95 26.51 28.37 10.60 data.4 124 10.02 8.74 5.85 28.86 12.82 5.83 5.86 6.01 4.75 (372, 102) 248 6.73 6.41 4.21 26.75 7.93 4.17 4.22 4.34 3.68 data.5 716 8.74 11.22 9.30 11.23 9.30 9.30 9.30 8.27 9.07 (2148,28) 1432 7.55 10.31 7.72 10.28 7.72 7.72 7.72 7.29 7.52 data.6 404 50.00 44.88 19.38 8.39 20.08 17.88 8.31 14.60 6.21 (1212, 100) 808 50.07 40.88 19.57 5.68 16.90 17.03 6.24 17.08 3.16 data.7 328 19.47 5.19 0.35 21.68 0.49 0.35 0.36 0.95 0.31 (986 ,216) 657 13.94 0.94 0.12 21.46 0.13 0.12 0.12 0.35 0.11 Table 4: Average r MSPE or MCR of the real data over 100 replications (in %) DIMENSION REDUCTION AND MARS As can be seen from Table 4, the conventional MARS often has smaller r MSPE than SVM and RF in particular when the size of training sets is relatively smaller. SDR-MARS based on various dimension reduction methods may make a further improvement over MARS and the improvement of the proposed dr MARS is more remarkable than the other SDR-MARS methods (see the columns under SDR-MARS ). In contrast, RAND has the worst performance in the out-of-sample prediction with the largest r MSPE. Regarding the classification performance, it can be seen that SDR-MARS again outperforms MARS and dr MARS often has much more significant improvement over MARS than the other methods (see data.6 and data.7). We next make some further illustration of the estimated model structure using data.3 and data.4. For ease of comparison, we standardize each variable. The estimation results of data.3 are listed in Table 5. The dimension of SDR space and the interaction degree of dr MARS are selected as 3 and 1, respectively, and the regression model is estimated as follows, E(Y | X = x) = 47.82 + g1(β 1x) + g2(β 2x) + g3(β 3x), g1(v1) = 553.72(v1 0.66)+ + 2.78(0.66 v1)+ + 150.52(v1 0.63)+, g2(v2) = 90.49(v2 + 1.27)+ 34.30( 1.27 v2)+ + 121.43(v2 + 0.00)+ 111.66(v2 0.19)+ + 28.26(v2 0.57)+ + 12.22(v2 + 0.41)+ +158.01(v2 + 1.01)+ 123.24(v2 + 0.65)+, g3(v3) = 7.67(v3 + 0.29)+ 8.60( 0.29 v3)+, where B = (β1, β2, β3) is the direction matrix in SDR space with the estimation results reported in Table 5. Note that the model is additive (with the interaction degree 1), we are able to estimate each additive function (with confidence bands) as plotted in Figure 1. í í sex β7 [ age 0DOH Female fitted function g ( ) fitted function g ( ) fitted function g ( ) Figure 1: The estimated additive functions using the SDR directions for data.3 Note that as MARS takes a step-wise procedure to select the spline bases, some of them may be screened out. As a consequence, some of the predictors may not be selected in the model. This is shown clearly in the estimated coefficients of the directions in Table LIU, LI AND XIA 5. It is known that gender and age are two important factors for the parkinson s disease, which are clearly dominant in the first two directions; see the first two columns of Table 5 and the left and middle plots of Figure 1. It is also interesting to see that the last column in Table 5 is mainly comprised of Shimmer.DDA (with coefficient 0.4893) and Jitter.RAP (with coefficient 0.4176) and Jitter.DDP (with coefficient -0.3865). As the first two directions are mainly influenced by age and sex, we can plot them separately as shown in Figure 1, which is in line with the understanding of the relationship between the disease and the two variables. The estimated coefficients for the third projection imply that higher value of Shimmer.DDA leads to a lower degree of the disease. The result sheds some light on the debate about the usefulness of Shimmer.DDA in the disease diagnostics, suggesting that Shimmer.DDA is indeed useful in identifying the disease (e.g., Hausdorff, 2007; Kirchner et al., 2014). age -0.0237 -0.9440 0.0722 sex -0.9443 -0.0126 0.0399 test time 0.0000 0.0000 -0.0547 Jitter(%) 0.0000 0.0000 -0.1374 Jitter.Abs 0.0000 0.0000 0.0606 Jitter.RAP 0.0000 0.0106 0.4176 Jitter.PPQ5 0.0000 0.0000 -0.1404 Jitter.DDP 0.0000 0.0000 -0.3865 Shimmer 0.0000 0.0000 -0.1577 Shimmer.d B 0.0000 0.0000 0.2654 Shimmer.APQ3 0.0000 0.0000 -0.3347 Shimmer.APQ5 0.0000 0.0000 -0.1152 Shimmer.APQ11 0.0000 0.0000 -0.0935 Shimmer.DDA 0.0000 0.0126 0.4893 NHR 0.0000 0.0000 0.0635 HNR 0.0000 0.0000 0.0814 RPDE 0.0000 0.0000 0.0226 DFA 0.0000 0.0000 0.3361 PPE 0.0000 0.0000 0.0000 Table 5: The estimated SDR directions for data.3 For data.4, the dimension of SDR space and the interaction degree of MARS are selected as 3 and 2, respectively, and the model is estimated as follows, E(Y | X = x) = 5.82 + g1(β 1x) + g2(β 2x) + g3(β 3x) + g12(β 1x, β 2x) + g13(β 1x, β 3x), where β1, β2 and β3 are directions of the SDR space, and g1(v1) = 31.67(v1 + 0.00)+ 42.83( 0.00 v1)+ 12.97(v1 + 0.03)+ 10.30(v1 0.03)+, DIMENSION REDUCTION AND MARS g2(v2) = 7.52(v2 + 0.00)+ 3.73( 0.00 v2)+, g3(v3) = 28.42(v2 0.03)+, g12(v1, v2) = 881.25( 0.00 v1)+(v2 0.01)+, g13(v1, v3) = 153.80(v1 + 0.03)+(v3 0.01)+. Inherited from MARS, dr MARS may remove some variables if they are not important. Hence only a small portion of the 102 predictors are selected by dr MARS in the final model and have nonzero coefficients in the directions, and the coefficients of the remaining variables are zero. Table 6 only lists those variables that have nonzero coefficients. Note that the 102 predictors include 7 project physical and financial variables, and 5 groups of time lag economic variables (5*19 variables in total), which we denote in Table 6 as lag k, k = 1, , 5. It shows that none of project physical and financial variables is significant, and significant economic variables appear in multiple time lags, indicating that economic variables have a durable effect on final costs. Specifically, the building services index (x9, x28, x47, x66) and consumer price index (x22, x23, x41, x42, x60, x61) are important factors for the final cost. The land price index (x33, x52) and the cumulative liquidity (x12, x31, x50, x69) also affect the final cost. variables that has non-zero coefficients in dr MARS β1 β2 β3 x9: Building services index (BSI) for a preselected base year (lag 1) 0.4291 -0.2057 0.1172 x12: Cumulative liquidity (lag 1) 0.1303 0.0000 0.1469 x22: Consumer price index (CPI) in the base year (lag 1) 0.0000 -0.1739 0.0000 x23: CPI of housing, water, fuel & power in the base year (lag 1) 0.3816 0.2211 0.0000 x28: Building services index (BSI) for a preselected base year (lag 2)-0.5875 0.0000 0.0000 x29: Wholesale price index (WPI) of building materials for the base 0.0000 -0.1322 0.0000 year (lag 2) x31: Cumulative liquidity (lag 2) 0.0000 -0.1167-0.1606 x33: Land price index for the base year (lag 2) 0.0000 0.1473 0.0000 x41: Consumer price index (CPI) in the base year (lag 2) 0.2372 0.0000 0.1360 x42: CPI of housing, water, fuel & power in the base year (lag 2) -0.2168 -0.5692-0.3176 x47: Building services index (BSI) for a preselected base year (lag 3) 0.1970 0.6146 0.3905 x50: Cumulative liquidity (lag 3) -0.1275 0.0000-0.3034 x52: Land price index for the base year (lag 3) -0.1467 0.0000-0.1483 x60: Consumer price index (CPI) in the base year (lag 3) 0.0000 -0.1859 0.0000 x61: CPI of housing, water, fuel & power in the base year (lag 3) 0.1178 0.2444 0.4858 x66: Building services index (BSI) for a preselected base year (lag 4)-0.2091 0.0000-0.4764 x69: Cumulative liquidity (lag 4) 0.1549 0.0000 0.2307 Table 6: The estimated SDR directions for data.4 with nonzero coefficients, while coefficients for those not listed here are all 0. Due to the interaction degree 2, we draw plots in the three-dimensional space for the dependence of the cost Y on the projected variables β j X, j = 1, 2, 3, as shown in Figure 2. The first row of plots shows the relationship between cost and the directions; the second row shows the corresponding fitted functions specified in the estimated model above. LIU, LI AND XIA 0.1 0.0 0.1 0.00. 0.2 3 0.1 0.0 0.1 fitted values E(Y|X) fitted values E(Y|X) fitted values E(Y|X) β T x β T x 1 β T x β T x 0 β T x β T x 0. 0.00. 0. 0. 0.0 0. 0. 0. 0. 0. 0.0 β T x β T x 1 β T x β T x ILWWHG IXQFWLRQ J J J ILWWHG IXQFWLRQ J J J ILWWHG IXQFWLRQ J J β T x β T x Figure 2: Plots for data.4: the plots on the top panel are the cost against each direction, and those on the bottom panel are the fitted functions Interestingly, as shown in the last panel of Figure 2, the cost has a non-linear dependence on the direction. The nonlinearity is along the second direction, which is a contrast between BSI for a preselected base year x47 (coefficient 0.6146) with the CPI of housing, water, fuel & power in the base year x42, x61 (coefficient -0.5692, 0.2444). The reason for this nonlinearity needs further investigation. 6. Conclusion This paper has proposed a general method that combines SDR with the commonly-used MARS algorithm to estimate nonparametric regression functions. The special structure of the MARS basis functions makes it easy to compute the gradient vector of regression functions and thus the SDR space. The selection of spline functions in MARS also makes our dimension reduction method more suitable for high dimensional data. The proposed dr MARS based on the SDR space can in turn improve the efficiency of conventional MARS. Through the comparison with other commonly-used nonparametric estimation and dimension reduction techniques, our numerical studies including both simulation and empirical applications show that the proposed dr MARS has better finite-sample performance in both DIMENSION REDUCTION AND MARS in-sample estimation and out-of-sample prediction. In summary, there are two key factors in dr MAVE that contribute to its performance. Firstly, dr MARS benefits from the automatic selection of the basic function, a feature inherent in MARS itself, which can also lead to the selection of variables. Second, dr MARS performs dimension reduction with the same objective of minimizing the loss function of MARS. Several issues can be studied further. Cai et al. (2022) suggest a hybrid of random projection with SDR, which uses random projection to reduce the dimension of predictors to a lower-dimensional space and then applies SDR to the smaller space. We conjecture that such a hybrid may be adopted here. Our method can also be applied to other regression methods such as the random forest or support vector machine to solve the interaction between variables. Appendix: Proofs of the asymptotic theorems In this appendix we prove the main theorems in Sections 2 and 3. Throughout the proofs, we let C denote a generic positive constant whose value may change from line to line. We start with a useful inequality for independent random matrices (Tropp, 2012). Lemma 9 Suppose that Λi, i = 1, , n, are independent q1 q2 random matrices with zero mean and max1 i n Λi λn. Then for any z > 0, (q1 + q2) exp z2 2(ξ2n + λnz/3) i=1 E ΛiΛ i , i=1 E Λ i Λi The following lemma ensures that the least squares estimate (8) is well defined. Lemma 10 Suppose that Assumption 1(i) (iii) is satisfied. Then 1 n e H e H is positive definite w.p.a.1. Proof of Lemma 10. Recall that H = [H(X1), , H(Xn)] . We first prove 1 n H H Ω = o P (1). (18) Note that 1 n H H Ω= 1 H(Xi)H(Xi) Ω . We next make use of the inequality in Lemma 9 with Λi = H(Xi)H(Xi) Ωto prove (18). It is easy to verify that λn = c1m H and ξ2 n = c2m2 H, where c1 and c2 are two positive constants. By Lemma 9 and m H log m H = o(n), for any ϵ > 0, we have P 1 n H H Ω ϵ = P H(Xi)H(Xi) Ω nϵ LIU, LI AND XIA 2 c2m2 H + ϵc1m Hn/3 2m H exp ϵ2n2 = exp log(2m H) ϵ2 completing the proof of (18). Combining Assumption 1(iii) with (18), we may show that 1 n H H is positive definite w.p.a.1, i.e., λmin( 1 n H H) is positive and bounded away from zero, where λmin( ) denotes the minimum eigenvalue of a square matrix. It is trivial to verify that λmin(H H) λmin(e H e H). Hence, we may claim that 1 n e H e H is positive definite w.p.a.1. Proof of Theorem 2. Without loss of generality, we next prove the convergence results by setting em = m , where m is a non-random positive integer. Letting ε = (ε1, , εn) and G = [G(X1), , G(Xn)] , by (8), we have e G (x) G (x) = Π1(x) + Π2(x), (19) Π1(x) = e H (x) e H e H 1 e H ε, Π2(x) = e H (x) e H e H 1 e H G G (x). We first consider Π1(x). Let FX = σ(X1, , Xn) and Em = (e1, e2, , em ) , where ej is an m H-dimensional vector with the j-th element being 1 and the others being zeros. Note that e H = Em H . By Assumption 1(i), we have Em Var n 1/2H ε | FX E m = σ2 1 n Em H HE m n e H e H , (20) indicating that |Π1(x)|2 2 C n e H e H 1 e H (x) w.p.a.1. (21) As e H (x) is of order m1/2 , it follows from (21) and Lemma 10 that |Π1(x)|2 = OP m1/2 n 1/2 . (22) On the other hand, by Assumption 1(iv), we have |Π2(x)|2 = e H (x) e H e H 1 e H G G (x) 2 = e H (x) αo G (x) 2 = OP (eρ(m )) . (23) DIMENSION REDUCTION AND MARS By (22) and (23), we complete the proof of (11). We next turn to the proof of (12). Note that i=1 e G (Xi) e G (Xi) 1 i=1 G (Xi)G (Xi) # i=1 G (Xi)G (Xi) ΣG =: Π3 + Π4. (24) By Assumption 1(i) and Lemma 9, we readily have i=1 G (Xi)G (Xi) ΣG when M . This indicates that i=1 G (Xi)G (Xi) ΣG = OP n 1/2 . (25) Re-write Π3 as h e G (Xi) G (Xi) i G (Xi) + 1 G (Xi) h e G (Xi) G (Xi) i + h e G (Xi) G (Xi) i h e G (Xi) G (Xi) i =: Π3,1 + Π3,2 + Π3,3. (26) Following the proofs of (20) (23), we may show that i=1 |Π1(Xi)|2 Cn 1/2 1 e H (Xi) = OP n 1/2m1/2 , (27) i=1 |Π2(Xi)|2 = OP (eρ(m )) . (28) By the decomposition (19), the Cauchy-Schwarz inequality, (27) and (28), we have h e G (Xi) G (Xi) i G (Xi) e G (Xi) G (Xi) 2 e G (Xi) G (Xi) 2 LIU, LI AND XIA i=1 |Π1(Xi)|2 + 1 i=1 |Π2(Xi)|2 = OP m1/2 n 1/2 + eρ(m ) , (29) and similarly Π3,2 = OP m1/2 n 1/2 + eρ(m ) , Π3,3 = OP m n 1 + eρ2(m ) . (30) By virtue of (24) (26), (29) and (30), we complete the proof of (12). Proof of Theorem 4. By Assumption 1(v) and the Davis-Kahan theorem (e.g., Yu et al., 2015), there exists a d d rotation matrix Q such that e B BQ C eΣG ΣG , which, together with (12), proves Theorem 4. The following lemma ensures that the least squares estimate (13) is well defined. Lemma 11 Suppose that Assumptions 1 and 2(i) (iii) are satisfied. Then 1 n b H b H is positive definite w.p.a.1. Proof of Lemma 11. Recall that b H = [ b H(X 1), , b H(X n)] and define b H = [ b H(X 1), , b H(X n)] . By Theorem 4, the smoothness property of the basis functions and Assumption 2(iii), we have 1 n b H b H 1 = OP bm ( em1/2n 1/2 + eρ( em)) = o P (1). (31) By (31), it is sufficient to show that 1 n b H b H is positive definite w.p.a.1. This can be proved by using Assumption 2(ii) and following the proof of Lemma 10. Proof of Theorem 7. Without loss of generality, we next prove the consistency property by setting bm = m , where m is a non-random positive integer. Let ε = (ε1, , εn) and G = [G(X1), , G(Xn)] as in the proof of Theorem 4. Note that b G0(x ) G0(x ) = b H(x ) b H b H 1 b H ε + b H(x ) b H b H 1 b H G G0(x ) = b H(x ) b H b H 1 b H ε + OP (bρ(m )) (32) conditional on bm = m . DIMENSION REDUCTION AND MARS Letting H = H(X 1), , H(X n) with H( ) defined in Section 3, and Em = (e1, e2, , em ) , we then have b H = Em H . Note that b H ε = b H ε + b H b H ε = Em H ε + b H b H ε. (33) As in the proof of (20), we may show that Em Var n 1/2H ε | FX E m = σ2 1 which, together with Lemma 11, indicates that b H(x ) b H b H 1 b H ε = b H(x ) b H b H 1 b H ε (1 + o P (1)) = OP m1/2 n 1/2 . (34) On the other hand, by Theorem 4, Lemma 11 and the smoothness property of the MARS basis functions, we have b H(x ) b H b H 1 b H b H ε = o P m1/2 n 1/2 . (35) By virtue of (33) (35), we have b H(x ) b H b H 1 b H ε = OP m1/2 n 1/2 (36) With (32) and (36), we complete the proof of (16). Acknowledgements The authors would like to thank an Editor and three reviewers for the constructive comments, which helped to improve the article. This project is supported by the National Natural Science Foundation of China (72033002 and 11931014) and MOE s Academic Research Fund (A-8000021-00-00) of Singapore. The second author is partly supported by the Leverhulme Research Fellowship (RF-2023-396). Anthony Bagnall, M Flynn, J Large, J Line, A Bostrom, and G Cawley. Is rotation forest the best classifier for problems with continuous features? ar Xiv preprint ar Xiv:1809.06705, 2018. Peter J Bickel and Elizaveta Levina. Covariance regularization by thresholding. The Annals of Statistics, 36(6):2577 2604, 2008. LIU, LI AND XIA Rico Blaser and Piotr Fryzlewicz. Random rotation ensembles. Journal of Machine Learning Research, 17(1):126 151, 2016. Richard C Bradley. Basic properties of strong mixing conditions. a survey and some open questions. Probability Surveys, 2:107 144, 2005. Leo Breiman. Random forests. Machine Learning, 45:5 32, 2001. Zhibo Cai, Yingcun Xia, and Weiqiang Hang. An outer-product-of-gradient approach to dimension reduction and its application to classification in high dimensional space. Journal of the American Statistical Association, forthcoming, 2022. Timothy I Cannings and Richard J Samworth. Random-projection ensemble classification. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(4):959 1035, 2017. Xin Chen, Changliang Zou, and Dennis Cook. Coordinate-independent sparse sufficient dimension reduction and variable selection. The Annals of Statistics, 38(6):3696 3723, 2010. Dennis Cook and Bing Li. Dimension reduction for conditional mean in regression. The Annals of Statistics, 30(2):455 474, 2002. R Dennis Cook and Bing Li. Determining the dimension of iterative hessian transformation. The Annals of Statistics, 32(6):2501 2531, 2004. Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20: 273 297, 1995. Evgenia Dimitriadou, Kurt Hornik, Friedrich Leisch, David Meyer, and Andreas Weingessel. Misc functions of the department of statistics (e1071), tu wien. R package, 1:5 24, 2008. Robert F. Engle, C. W. J. Granger, John Rice, and Andrew Weiss. Semiparametric estimates of the relation between weather and electricity sales. Journal of the American Statistical Association, 81(394), 1986. Jianqing Fan and Irene Gijbels. Local Polynomial Modelling and Its Applications. Chapman & Hall/CRC, 1996. Lukas Fertl and Efstathia Bura. Conditional variance estimator for sufficient dimension reduction. Bernoulli, 28(3):1862 1891, 2022. Jerome H Friedman. Multivariate adaptive regression splines. The Annals of Statistics, 19(1): 1 67, 1991. Kenji Fukumizu and Chenlei Leng. Gradient-based kernel dimension reduction for regression. Journal of the American Statistical Association, 109(505):359 370, 2014. Wolfgang H ardle, Peter Hall, and Hidehiko Ichimura. Optimal smoothing in single-index models. The Annals of Statistics, 21(1):157 178, 1993. Trevor Hastie and Robert Tibshirani. Generalized additive models. Statistical Science, 1(3): 297 310, 1986. DIMENSION REDUCTION AND MARS Trevor Hastie and Robert Tibshirani. Varying-coefficient models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 55(4):757 796, 1993. Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The Elements of Statistical Learning: Data mining, Inference, and Prediction, volume 2. Springer, 2009. Jeffrey M Hausdorff. Gait dynamics, fractals and falls: finding meaning in the stride-tostride fluctuations of human walking. Human Movement Science, 26(4):555 589, 2007. Jianhua Z Huang. Local asymptotics for polynomial spline regression. The Annals of Statistics, 31(5):1600 1635, 2003. Marietta Kirchner, Patric Schubert, Magnus Liebherr, and Christian T Haas. Detrended fluctuation analysis and adaptive fractal analysis of stride time data in parkinson s disease: stitching together short gait trials. Plo S One, 9(1):e85787, 2014. Ker-Chau Li. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414):316 327, 1991. Andy Liaw and Matthew Wiener. Classification and regression by randomforest. R news, 2 (3):18 22, 2002. Wei Lin. The Econometric Analysis of Interval-Valued Data and Adaptive Regression Splines. Ph D thesis, UC Riverside, 2013. Wei Luo, Bing Li, and Xiangrong Yin. On efficient dimension reduction with respect to a statistical functional of interest. The Annals of Statistics, 42(1):382 412, 2014. Yanyuan Ma and Liping Zhu. On estimation efficiency of the central mean subspace. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(5):885 901, 2014. Stephen Milborrow, Trevor Hastie, Robert Tibshirani, Alan Miller, and Thomas Lumley. earth: Multivariate adaptive regression splines. R package version, 5(2), 2017. Charles J Stone. Optimal global rates of convergence for nonparametric regression. The Annals of Statistics, 10(4):1040 1053, 1982. Charles J Stone. Large-sample inference for log-spline models. The Annals of Statistics, 18(2): 717 741, 1990. Charles J Stone. Asymptotics for doubly flexible logspline response models. The Annals of Statistics, 19(4):1832 1854, 1991. Charles J Stone, Mark H. Hansen, Charles Kooperberg, and Young K. Truong. Polynomial splines and their tensor products in extended linear modeling. The Annals of Statistics, 25 (4):1371 1425, 1997. Joel A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12:389 434, 2012. LIU, LI AND XIA Tao Wang, Peirong Xu, and Lixing Zhu. Variable selection and estimation for semiparametric multiple-index models. Bernoulli, 21(1):242 275, 2015. S. Weisberg. Dimension reduction regression in r. Journal of Statistical Software, 7(1):1 22, 2002. Yingcun Xia. A multiple-index model and dimension reduction. Journal of the American Statistical Association, 103(484):1631 1640, 2008. Yingcun Xia, Howell Tong, Wai Keung Li, and Li-Xing Zhu. An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(3):363 410, 2002. Zhuoran Yang, Krishnakumar Balasubramanian, and Han Liu. On stein s identity and nearoptimal estimation in high-dimensional index models. ar Xiv preprint ar Xiv:1709.08795, 2017. Xiangrong Yin and Bing Li. Sufficient dimension reduction based on an ensemble of minimum average variance estimators. The Annals of Statistics, 39(6):3392 3416, 2011. Yi Yu, Tengyao Wang, and Richard J Samworth. A useful variant of the davis kahan theorem for statisticians. Biometrika, 102(2):315 323, 2015. S. Zhou, X. Shen, and D.A. Wolfe. Local asymptotics for regression splines and confidence regions. The Annals of Statistics, 26(5):1760 1782, 1998.