# dimension_reduction_and_mars__4e00b3a5.pdf

Journal of Machine Learning Research 24 (2023) 1-30 Submitted 12/22; Revised 9/23; Published 10/23

Dimension Reduction and MARS

Yu Liu LIUYUCHINA123@GMAIL.COM School of Mathematical Sciences University of Electronic Science and Technology of China, China

Degui Li DEGUI.LI@YORK.AC.UK Department of Mathematics University of York, UK

Yingcun Xia STAXYC@NUS.EDU.SG Department of Statistics and Data Science National University of Singapore, Singapore and School of Mathematical Sciences University of Electronic Science and Technology of China, China

Editor: Rajarshi Guhaniyogi

The multivariate adaptive regression spline (MARS) is one of the popular estimation methods for nonparametric multivariate regression. However, as MARS is based on marginal splines, to incorporate interactions of covariates, products of the marginal splines must be used, which often leads to an unmanageable number of basis functions when the order of interaction is high and results in low estimation efﬁciency. In this paper, we improve the performance of MARS by using linear combinations of the covariates which achieve sufﬁcient dimension reduction. The special basis functions of MARS facilitate calculation of gradients of the regression function, and estimation of these linear combinations is obtained via eigen-analysis of the outer-product of the gradients. Under some technical conditions, the consistency property is established for the proposed estimation method. Numerical studies including both simulation and empirical applications show its effectiveness in dimension reduction and improvement over MARS and other commonly-used nonparametric methods in regression estimation and prediction. Keywords: consistency, gradient estimation, multivariate adaptive regression spline, nonparametric regression, sufﬁcient dimension reduction

1. Introduction

Nonparametric estimation is an effective tool in statistics and machine learning to capture a ﬂexible nonlinear relationship between the response and explanatory variables, relaxing pre-speciﬁed model structural assumptions required in parametric estimation methods. However, extension of the nonparametric regression estimation to the setting with multivariate regressors needs to be handled with care, as the required number of observations (to achieve given estimation accuracy) increases exponentially as the dimension of covariates increases, resulting in the so-called curse of dimensionality (e.g., Fan and Gijbels, 1996). To address this problem, we often have to restrict the class of multivariate regression functions so that only the lower dimensional nonparametric functions are to be estimated.

c 2023 Yu Liu, Degui Li and Yingcun Xia.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-1422.html.

LIU, LI AND XIA

Commonly-used function classes include additive models (Hastie and Tibshirani, 1986), varying-coefﬁcient models (Hastie and Tibshirani, 1993), partially linear models (Engle et al., 1986) and single-index models (H ardle et al., 1993). However, these restricted nonparametric estimation methods may have unstable numerical performance in practical data analysis when the regression function class is misspeciﬁed. Hence, it is imperative to develop a fully nonparametric multivariate estimation method that can reduce the curse of dimensionality but need no restriction on the class of regression functions. In nonparametric estimation, the regression function is often approximated by a linear expansion of base functions (e.g., Chapter 5 of Hastie et al., 2009). In the case of multivariate covariates, the required number of basis functions in the approximation may increase dramatically as the dimension of covariates increases. A commonly-used idea to design a feasible estimation algorithm is to control model complexity and thus limit the number of basis functions. This can be done by adaptively scanning the set of basis functions and selecting only those which contribute signiﬁcantly to the model ﬁtting. Among a long list of existing estimation algorithms, the multivariate adaptive regression spline (MARS, Friedman, 1991) is arguably the most popular one. It uses piecewise linear basis functions and can be viewed as a natural generalization of the stepwise linear regression approach. Because of the selection of splines in the estimation algorithm, MARS can also result in variable selection. MARS is well suited for high-dimensional nonparametric regression problems and can be further extended to tackle classiﬁcation problems (e.g., Stone et al., 1997). Existing literature in statistical learning such as Hastie et al. (2009) usually implements the MARS algorithm directly without making any transformation or dimension reduction of the covariates. This may result in an unmanageable number of basis functions (if the level of model complexity or the order of interaction is high) and low estimation efﬁciency. In multivariate nonparametric regression, it is often the case that important features of multiple regressors are retrievable via low-dimensional projections. The low-dimensional sub-space is expected to retain all (or most of) the information provided by the covariates on the response, and is thus called the sufﬁcient dimension reduction (SDR) space, which is ﬁrst introduced by Li (1991). Aiming at dimension reduction for the conditional mean, which is more relevant to our main interest, a similar concept (central mean space) is also introduced by Cook and Li (2002). More recent developments on this topic can be found in Xia (2008), Chen et al. (2010), Yin and Li (2011), Fukumizu and Leng (2014), Luo et al. (2014), Ma and Zhu (2014), Wang et al. (2015), Yang et al. (2017), and Fertl and Bura (2022). This paper aims to combine SDR with MARS by incorporating linear combinations of covariates to improve the regression estimation. These linear combinations are the SDR directions or more precisely the central mean space of Cook and Li (2002) when the underlying model has a multiple-index structure and can effectively reduce the order of covariate interaction required in MARS and improve the estimation performance. As these linear combinations in MARS are dimension-reduced covariates, the proposed methodology is called dr MARS throughout the paper. The nonparametric estimation procedure developed in this paper includes two stages: (i) estimate the SDR space of the conditional mean; and (ii) modify MARS by incorporating these linear combinations of covariates (or SDR) to estimate the regression functions. The main technique in stage (i) is to conduct eigen-analysis of the outer-product of regression function gradient estimates and estimate the SDR directions by the eigenvectors correspond-

DIMENSION REDUCTION AND MARS

ing to the ﬁrst few largest eigenvalues. In particular, we estimate the gradient via a linear basis expansion determined by MARS and further derive a sensible convergence property for the resulting estimates. With the MARS algorithm, this new gradient estimation is easy to implement, complementing other gradient estimation methods such as the local linear smoothing and reproducing kernel Hilbert space which have been extensively studied in the literature (e.g., Xia et al., 2002; Xia, 2008; Fukumizu and Leng, 2014). The dr MARS in stage (ii) incorporates the linear combinations of covariates, making it substantially different from the classic MARS in Friedman (1991). In particular, when a high-order interaction of covariates can be equivalently expressed as the multiple-index form, our dr MARS can signiﬁcantly reduce the number of terms in the basis expansion and improve the estimation efﬁciency. As a simple example, (x1 + x2 + x3 + x4)3 has a third-order interaction when the conventional MARS is applied, but it has only a ﬁrst-order interaction in the dr MARS if the linear combination is correctly identiﬁed. This is conﬁrmed by our numerical studies, which also show the advantage of dr MARS even if the postulated model cannot reduce the order of interactions via the SDR-determined linear combinations of covariates. dr MARS inherits some nice features from MARS (such as the simple form of linear spline basis functions and selection of spline in the algorithm) and works well when the dimension of predictors is relatively large (see the simulation and empirical application). Under some technical conditions, we derive the consistency theory for the dr MARS estimation, complementing the existing asymptotic theory for the spline-based estimation (e.g., Stone, 1990, 1991; Zhou et al., 1998; Huang, 2003; Lin, 2013). Another work related to our approach is the random projection or random rotation (e.g., Blaser and Fryzlewicz, 2016; Cannings and Samworth, 2017; Bagnall et al., 2018). The random rotation is an ensemble procedure. It randomly selects the projections and estimates the model using the projected combinations of the variables as predictors for regression methods such as the random forest or support vector machine. Each set of projections thus generates a prediction. The ﬁnal prediction is a weighted average of these predictions. In contrast, the rotation in our approach is based on the regression itself, i.e., SDR, and thus is more efﬁcient for prediction. As we will show in the numerical studies, the rotation based on SDR has better estimation and prediction accuracy than the random rotation. The rest of the paper is organized as follows. Section 2 deﬁnes the SDR space, introduces the MARS-based estimation method, and develops the convergence properties of the estimates. Section 3 describes the dr MARS algorithm and its consistency theory. Sections 4 and 5 report the simulation studies and real data applications, respectively. Section 6 concludes the paper. Proofs of the main theorems are available in an appendix. Throughout the paper, for a vector u = (u1, , ud) , we deﬁne |u|q q = Pd i=1 |ui|q with q 1; for a d d matrix W = (wij)d d, we let W and W F be the spectral and Frobenius norms, respectively.

2. Estimation of SDR space via MARS

Let Y and X be the response and p-dimensional vector of covariates, respectively. Assume the following multiple-index model structure:

G(x) = E(Y |X = x) = E Y |B X = B x = G0 B x , (1)

LIU, LI AND XIA

where B is a p d orthogonal matrix with d smaller than p, G( ) is a multivariate nonparametric regression function on Rp and G0( ) is a nonparametric link function on Rd. It follows from model (1) that projection of the p-dimensional X onto the d-dimensional sub-space B X retains all the information provided by X for prediction of Y . Hence, the matrix B determines the SDR directions (or the central mean subspace). The space spanned by B s column vectors is called the SDR space. Letting u = B x, by (1), we readily have that G (x) = BG 0(u), where G and G 0 are the gradient vectors. By Lemma 1 in Xia et al. (2002), the space spanned by B is the same as that spanned by the eigenvectors of ΣG := E G (X)G (X) corresponding to the largest d eigenvalues, i.e., span(B) = span(β1, , βd), where βj is the eigenvector of ΣG corresponding to the j-th largest eigenvalue. With a sample of observations (Yi, Xi), i = 1, , n, we estimate ΣG by the outer-product of gradient estimates:

i=1 e G (Xi) e G (Xi) , (2)

where e G is a nonparametric estimate of the gradient G . A natural estimate of G is via the local linear smoothing method (e.g., Fan and Gijbels, 1996). The estimate of B can be obtained by subsequently conducting the eigen-analysis of eΣG (e.g., Xia et al., 2002; Xia, 2008). However, the local linear estimation is essentially a kernel-based local smoothing method which is sensitive to the smoothing parameter choice, and still suffers the curse of dimensionality when the dimension p is large. Next, we propose an alternative technique to estimate G via MARS. MARS is an adaptive estimation procedure using linear spline functions in the basis expansion. For the k-th covariate, we deﬁne the piecewise linear basis functions with knots taken from the set {tk,1, , tk,nk}:

h+ k,j(xk) = (xk tk,j)+ = xk tk,j, if xk > tk,j, 0, otherwise , (3)

h k,j(xk) = (xk tk,j) = (tk,j xk)+, (4)

which form reﬂected pairs for the k-th covariate at tk,j, j = 1, , nk. The collection of marginal basis functions for all the covariates is

C = n h+ k,j, h k,j , j = 1, , nk, k = 1, 2, , p o .

When each basis function depends only on a single covariate, the number of basis functions in C is 2 Pp k=1 nk, assuming all the knots are distinct. To incorporate interactions of covariates, we use tensor products of the basis functions in C. Speciﬁcally, when the order of interaction is set to be R, a typical R-variate basis function is deﬁned as

hk1j1, ,k Rj R(xk1, , xk R) =

r=1 hkr,jr(xkr), (5)

where hk,j is a basis function from C, 1 jr 2nkr, and 1 k1 = k2 = = k R p. Note that the number of the R-variate basis functions increases dramatically as p increases.

DIMENSION REDUCTION AND MARS

Suppose that the multivariate nonparametric regression function is approximated by the following form of basis expansion:

G(x) Gm(x) := θ0 +

j=1 θjhj(x), (6)

where hj is either a basis function in C or a product of marginal basis functions, see (5), and m is the number of basis functions which may diverge to inﬁnity. Here G(x) Gm(x) means that Gm(x) G(x) as m . The coefﬁcients θj, j = 0, 1, , m, are estimated by the least squares as in standard linear regression. From (6), we further obtain the basis expansion for the gradient G :

G (x) G m(x) :=

j=1 θjh j(x), (7)

where h j is the gradient vector of hj. When the order of interaction R is large (or even moderately large), it is practically infeasible to include all the R-variate basis functions. The real art of MARS is to provide an adaptive selection procedure including both the forward and backward stepwise algorithms to construct the basis functions with the linear spline functions in C. This adaptive selection reduces the number of basis functions in (6) while retains the model ﬂexibility. We next brieﬂy describe the MARS algorithm to determine the basis functions in (6) and (7). Start with the constant function h0(x) 1 and use the linear spline functions in C as the candidate functions. In each stage, let M be the set of basis functions which have been selected in the previous stages. Construct a new basis function from products of any basis function in M with one of the reﬂected pairs in C. This new term in the basis expansion has the following typical form:

θ|M|+1hl(x)h+ k,j(xk) + θ|M|+2hl(x)h k,j(xk), hl M, h+ k,j, h k,j C,

where θ|M|+1 and θ|M|+2 are the parameters to be estimated by least squares, and |M| denotes the cardinality of M. Add the products to the model approximation with the basis functions in M and choose the product which results in the largest decrease in the training estimation errors. Repeat the above process until the number of the selected basis functions reaches a pre-determined number M. As the number M is usually large, the model selected in the forward stepwise algorithm often overﬁts the data. Thus, a backward stepwise algorithm is needed to delete the term whose removal results in the smallest increase in the residual squared errors. Let h1, , h em be the basis functions selected by MARS and eh j be the gradient vector of ehj, j = 1, , em. We write

e H(x) = h 1,eh1(x), ,eh em(x) i and e H (x) = h 0p,eh 1(x), ,eh em(x) i ,

where 0p is a p-dimensional zero vector. We use least squares to estimate the parameters in the ﬁnal basis approximation:

eα = (eα0, eα1, , eα em) = e H e H 1 e H Y, (8)

LIU, LI AND XIA

where e H = h e H(X1), , e H(Xn) i and Y = (Y1, , Yn) .

Consequently the estimate of G (x) can be obtained by

j=1 eαjeh j(x) = e H (x) e H e H 1 e H Y. (9)

The above gradient estimate is then used to construct eΣG in (2). Letting eβj be the eigenvector of eΣG corresponding to the j-th largest eigenvalue, we obtain e B = (eβ1, , eβd), which will be shown to be a consistent estimate of B (subject to appropriate rotation); see Theorem 4 below. If the order of covariate interaction is set as R in MARS, we let H( ) be a vector containing all the basis functions which are either from C or tensor products of the marginal basis functions as in (5). Without loss of generality, we may write

H( ) = h e H( ) , e H ( ) i ,

where e H ( ) is a vector of basis functions not selected by MARS. Let m H be the dimension of H( ) which is often much larger than em. It is worth pointing out that H( ) is a vector of deterministic functions which can be seen as the candidate basis functions in MARS, and m H is a non-random positive integer. We next study the convergence property for the MARS-based nonparametric estimates e G and eΣG, which requires the following technical conditions.

Assumption 1 (i) Let (Yi, Xi), i = 1, , n, be independent and identically distributed (i.i.d.), and εi := Yi G(Xi) be zero-mean and homoskedastic, i.e., E(ε2 i |Xi) = σ2 > 0 almost surely (a.s.). (ii) The density function of Xi exists, and is bounded away from zero and inﬁnity on a compact set X. Both G and G are continuous on X. (iii) The matrix Ω:= E H(Xi)H(Xi) is positive deﬁnite, and m H log m H = o(n). (iv) There exists eρ( ) satisfying eρ(u) 0 as u , such that

G (x) e H (x) αo 2 = OP (eρ(m )) , αo = e H e H 1 e H G,

conditional on em = m , where G = [G(X1), , G(Xn)] . (v) The matrix ΣG has full rank of d with positive and distinct eigenvalues.

Remark 1 The independence restriction in Assumption 1(i) can be weakened and the theory developed in this section also holds for stationary and weakly dependent time series satisfying some mixing properties (e.g., Bradley, 2005). Assumption 1(ii) is commonly used in deriving asymptotic results of the spline-based estimation. The compact support restriction can be relaxed at the cost of slightly more lengthy proof with some moment condition on X. Assumption 1(iii) is a sufﬁcient condition to ensure that the least squares estimate in (8) is well deﬁned. In fact, by Lemma 10 in the appendix, 1

n e H e H is positive deﬁnite with probability

DIMENSION REDUCTION AND MARS

approaching one (w.p.a.1), which is implicitly assumed in Friedman (1991). As the linear spline basis functions are special polynomial spline functions, we may replace Assumption 1(iii) by an alternative condition through Huang (2003) s theoretical framework. Let e S be the estimation space containing the linear spline functions and their tensor products selected by MARS. Given a sample of covariates X1, , Xn, suppose that e S is empirically identiﬁable in the sense that g e S and |g|2 n = 1

n Pn i=1 g2(Xi) = 0 together imply g 0. For a vector v, v ( 1

n e H e H)v = 0 indicates that |v e H|2 n = 1

n Pn i=1[v e H(Xi)]2 = 0. Then, as v e H e S, by the empirical identiﬁability of e S, we readily have v e H(x) = 0 for any x X, and thus v = 0. This shows that 1

n e H e H is positive deﬁnite w.p.a.1, and its inverse is well deﬁned. Assumption 1(iii) restricts the divergence rate of m H, which is very mild for the nonparametric series estimation. Assumption 1(iv) imposes a high-level condition on the uniform bias order of the gradient estimate. In fact, α o e H( ) can be seen as the projection of G onto the estimation space e S deﬁned above. In the spline-based estimation theory, it is reasonable to assume |G(x) α o e H(x)| 0 uniformly over x X. Assumption 1(iv) shows that this uniform approximation continues to hold when G and its projection onto e S are replaced by their gradients. Let S be the estimation space containing all the linear spline functions and their tensor products as in H( ). It is clear that e S S. Letting H = [H(X1), , H(Xn)] , we deﬁne α = (H H) 1H G so that α H( ) can be seen as the projection of G onto S. The bias term of the gradient estimation can be decomposed as

G (x) e H (x) αo = G (x) H (x) α + h H (x) α e H (x) αo i

=: g bias1 (x) + g bias2 (x) . (10)

The ﬁrst term g bias1 (x) is due to the approximation error of G ( ) by its projection onto the space S. In fact, under some smoothness conditions on G and G (e.g., Stone, 1982; Huang, 2003), with the approximation theory, we conjecture that its order is upper bounded by m q/p H , where q is a positive number relevant to the order of bounded and continuous derivatives of G( ). The second term g bias2 (x) is induced by the projection onto the MARS-selected estimation space e S rather than S. According to the MARS algorithm, we expect this bias order tends to zero if em is sufﬁciently large. If em is of the same order as m H, it is reasonable to conjecture that the two terms on the right side of (10) have the same approximation order. Assumption 1(v) is analogous to the condition (C4) in Xia (2008), making it feasible to apply the Davis-Kahan theorem (e.g., Theorem 2 in Yu et al., 2015) to prove Theorem 4.

Theorem 2 Suppose that Assumption 1(i) (iv) is satisﬁed. Then, conditional on em = m , e G (x) G (x) 2 = OP m1/2 n 1/2 + eρ(m ) , (11) eΣG ΣG = OP m1/2 n 1/2 + eρ(m ) . (12)

Remark 3 The two convergence rates in (11) and (12) are due to the estimation variance and bias, respectively. They are comparable to the convergence results obtained by Fukumizu and Leng (2014), where the gradient is estimated by the covariance operator on the reproducing kernel Hilbert space. It is worth noting that the rates in (11) and (12) are slower than the root-n rate as MARS is essentially nonparametric. Furthermore, if the minimum eigenvalue of Ωconverges slowly to

LIU, LI AND XIA

zero, the convergence rates would be further slowed down. For instance, we may show the following convergence property for the MARS estimate of the gradient: e G (x) G (x) 2 = OP (m /λ)1/2 n 1/2 + eρ(m ) , λ = λmin(Ω),

conditional on em = m . In Theorem 2, we assume that the number of candidate covariates in nonparametric regression is ﬁxed. The convergence results in (11) and (12) can be further extended to the setting when the covariate number is divergent at a slow polynomial rate of n. Following the proof of Theorem 2 in the appendix and assuming that e S is empirically identiﬁable, we may show that e G (x) G (x) 2 = OP (pm )1/2 n 1/2 + eρ(m ) , eΣG ΣG = OP p1/2 h (pm )1/2 n 1/2 + eρ(m ) i ,

conditional on em = m . These convergence properties indicate that the dimension p must be of order smaller than n1/2. For high-dimensional nonparametric estimation with p possibly larger than n1/2, we may have to impose sparsity assumptions on G and ΣG, and combine the developed MARS estimation with a shrinkage technique (e.g., Bickel and Levina, 2008).

Theorem 4 Suppose that Assumption 1(i) (v) is satisﬁed. Conditional on em = m , there exists a

d d rotation matrix Q such that e B BQ = OP m1/2 n 1/2 + eρ(m ) .

Remark 5 Xia (2008) derives a faster convergence rate by using the minimum average variance estimation with reﬁned kernel weights. However, some restrictive conditions are imposed on the smoothing parameter and the dimension d. For instance, d cannot exceed 3 to achieve the root-n convergence. In contrast, we do not require additional restriction on d. We need to determine the dimension of the SDR space for which many criteria have been proposed (e.g., Li, 1991; Xia et al., 2002). In the simulation study, we select the dimension via the 10-fold cross-validation (CV) criterion. We do not study the theory of the dimension selection in this paper, but our simulations suggest that this criterion works reasonably well; see Table 2. As the basis of the SDR space is not unique, the MARS estimate e B converges to B up to appropriate transformation via the rotation matrix Q. However, with Assumption 1(v) and the model identiﬁcation conditions as in Proposition 1.1 of Xia (2008), we may consider Q as an identity matrix and thus e B converges to B. Since BQ is also a base of SRD space, for notational convenience, we do not distinguish between B and BQ in the rest of the paper, and use B to denote both cases.

3. Dimension-reduced MARS

Let X i = e B Xi = eβ 1Xi, , eβ d Xi

be a d-dimensional vector of projected covariates, where e B is deﬁned in Section 2. Generally, we can use any SDR method, such as SIR of Li (1991), to estimate e B, and then apply MARS to (Yi, X i ), i = 1, , n. We call this general approach SDR-MARS, while the MARS

DIMENSION REDUCTION AND MARS

estimation based on our dimension reduction proposed in Section 2 is still called dr MARS to avoid possible confusion. Due to the convergence property of e B in Theorem 4, we expect that X i can well approximate X i = B Xi. Write x = e B x = (x 1, , x d) and x = B x, x Rp. For the k-th projected covariate, we deﬁne h+ k,j(x k) and h k,j(x k) similarly to h+ k,j(xk) and h k,j(xk) in (3) and (4) but with the set of knots {tk,1, , tk,nk} replaced by n t k,1, , t k,n k

o , and construct

C = n h+ k,j, h k,j , j = 1, , n k, k = 1, 2, , d o .

With a sample of response and projected covariates (Y1, X 1), , (Yn, X n), we use the linear spline functions in C as the candidate functions and follow the forward stepwise algorithm and then the backward stepwise algorithm as in Section 2 to adaptively select the basis functions denoted by bhj, j = 1, , bm. By (1), we readily have that G(x) = G0 B x = G0 (x ). Since x x by Theorem 4, instead of estimating G, we next estimate the nonparametric link function G0 using the dr MARS selected basis functions. Let b H( ) = h 1,bh1( ), ,bh bm( ) i , b H = h b H(X 1), , b H(X n) i .

We estimate the parameters in the basis expansion via least squares, i.e.,

bγ = (bγ0, bγ1, , bγ bm) = b H b H 1 b H Y, (13)

and then obtain the dr MARS estimate:

b G0(x ) = bγ0 +

j=1 bγjbhj(x ) = b H(x ) b H b H 1 b H Y. (14)

The main difference between dr MARS and the conventional MARS in Friedman (1991) is that the former incorporates the linear combinations of covariates determined by the SDR projection in the estimation algorithm. Hence dr MARS is expected to work better when the underlying model contains the multiple-index structure (1). In particular, if a high-order interaction of covariates can be written as the multiple-index form, dr MARS can signiﬁcantly reduce the number of basis functions in the model approximation, and subsequently improve the estimation efﬁciency; see the simulation studies in Section 4. Similar to H( ) deﬁned in Section 2, we let H( ) be a vector containing all the basis functions which are either from C or the tensor products of the marginal basis functions, i.e.,

H( ) = h b H( ) , b H ( ) i ,

where b H ( ) is a vector of basis functions not selected by dr MARS. Let m H be the dimension of H( ). Note that H( ) is a vector of deterministic functions, facilitating the asymptotic derivation of the dr MARS estimation. We need the following technical conditions to derive the convergence theory of the dr MARS estimation.

LIU, LI AND XIA

Assumption 2 (i) The density function of X i exists, and is bounded away from zero and inﬁnity on a compact set. The link function G0 is continuous and differentiable. (ii) The matrix Ω:= E

H(X i )H(X i ) is positive deﬁnite, and m H plog m H = o(n). (iii) The numbers of dr MARS selected basis functions and MARS selected ones: bm and em, satisfy that bm em1/2n 1/2 + eρ( em) = o P (1). (iv) There exists bρ( ) satisfying bρ(u) 0 as u , such that G0(x ) b H(x ) γ = OP (bρ(m )) , γ = b H n b H n 1 b H n G,

conditional on bm = m .

Remark 6 Assumption 2(i) extends Assumption 1(ii) to the setting including projected covariates. As discussed in Remark 1, Assumption 2(ii) ensures that the least squares estimate (13) is well deﬁned. In fact, Lemma 11 in the appendix shows that 1

n b H b H is positive deﬁnite w.p.a.1, indicating that its inverse matrix exists. Let b S be the estimation space by including the linear spline functions and their tensor products selected by dr MARS. We may show that Assumption 2(ii) can be replaced by the empirical identiﬁability condition on b S. Assumption 2(iii), combined with the convergence property in Theorem 4, is crucial to ensure the consistency property when we replace e B by B in dr MARS. We next discuss the high-level condition in Assumption 2(iv) on the dr MARS estimation bias. Letting

H(X 1), , H(X n) , b H = h b H(X 1), , b H(X n) i ,

we deﬁne γ = (H

G, γ = (b H b H ) 1 b H G.

Similar to the bias decomposition (10) in Remark 1, we have

G0(x ) b H(x ) γ = h G0(x ) b H(x ) γ i + h G0(x ) b H(x ) γ G0(x ) + b H(x ) γ i

= G0(x ) H(x ) γ + h

H(x ) γ b H(x ) γ i

h G0(x ) b H(x ) γ G0(x ) + b H(x ) γ i

=: d bias1 (x) + d bias2 (x) + d bias3 (x) . (15)

Hence the high-level bias order in Assumption 2(iv) combines the three bias terms in the decomposition (15). The ﬁrst term d bias1 (x) is caused by the approximation error of G0( ) by its projection onto S, an estimation space containing the linear spline functions and their tensor products as in H( ). As discussed in Remark 1, under some smoothness conditions on G0 , we may show that d bias1 (x) is of order m q/d

H , where q is a positive number relevant to the smoothness level of G0( ). The second

term d bias2 (x) is induced by the projection onto the dr MARS-selected estimation space b S rather than S. Lin (2013) discusses the order of d bias2 (x) under some extra restrictions. As discussed in Remark 1, if bm is of the same order as m H, we conjecture that d bias2 (x) would have the same approximation order as d bias1 (x). Finally, d bias3 (x) is the extra bias due to the replacement of B by e B in the dr MARS algorithm.

DIMENSION REDUCTION AND MARS

The following theorem gives the point-wise convergence rate for b G0(x ) deﬁned in (14).

Theorem 7 Suppose that Assumptions 1 and 2 are satisﬁed. The dr MARS estimate b G0(x ) has the following convergence result:

b G0(x ) G0(x ) = OP m1/2 /n1/2 + bρ(m ) , (16)

conditional on bm = m .

Remark 8 The convergence rate obtained in Theorem 7 is comparable to those derived by Huang (2003) and Lin (2013) for the polynomial spline regression estimation. Assume that the nonparametric link function is sufﬁciently smooth, say G0( ) is q-smooth (e.g., Huang, 2003), bm m H, and the convergence of e B is sufﬁciently fast, say e B is root-n convergent (e.g., Xia, 2008). Following the discussion in Remark 6, we conjecture that bρ(m ) is dominated by d bias1 (x) + d bias2 (x), which is upper bounded by m q/d conditional on bm = m . Consequently the point-wise convergence rate of dr MARS becomes OP (m1/2 /n1/2 + m q/d ), indicating that the optimal order of bm = m is nd/(d+2q) and the optimal convergence rate is expected to be OP (n q/(d+2q)) (e.g., Stone, 1982). In contrast, as discussed in Lin (2013), if G( ) is q-smooth, the conventional MARS estimation (without SDR rotation) has the bias order bm q/d , where bm is the number of the MARS selected basis functions. If bm has the optimal order np/(p+2q), the point-wise convergence rate of the conventional MARS is OP n q/(p+2q) . As d is typically smaller than p, it is sensible to expect that dr MARS has faster convergence rate than the conventional MARS under the multiple-index model framework (1). This is conﬁrmed by the numerical studies in Section 4 for ﬁnite samples.

In practice, we may further modify dr MARS to obtain the nonparametric estimation that is robust to possible model misspeciﬁcation, i.e., the multiple-index structural assumption (1) is violated. Let ˇXi = X i , X i be a vector combining both the original and projected covariates. Consider a sample (Y1, ˇX1), , (Yn, ˇXn), use the linear spline functions in C C as the candidate functions and apply MARS to adaptively select the basis functions denoted by ˇhj, j = 1, , ˇm. Similarly to (14), the estimate of G(x) is obtained by

ˇG(x) = ˇφ0 +

j=1 ˇφj ˇhj(ˇx), (17)

where ˇφ0, ˇφ1, , ˇφ ˇm are the least squares estimates and ˇx = x , x with x = e B x. Furthermore, the nonparametric estimate can be recast into the following form:

ˇG(x) = ˇφ0 + ˇG (x ) + ˇG (x),

where ˇG (x ) is deﬁned by summing over the terms in (17) whose basis functions involve only the projected covariates whereas ˇG (x) is deﬁned by summing over the terms whose basis functions involve the original covariates. When the multiple-index model assumption is valid, it is expected that most of the basis functions involved in deﬁning ˇG (x) would be screened out in the adaptive selection process, and consequently ˇG(x) is approximated by ˇφ0 + ˇG (x ), which is expected to be close to b G0(x ) deﬁned in (14).

LIU, LI AND XIA

4. Simulation studies

In this section, we use simulated data to showcase the performance of the proposed dimension reduction and dr MARS methods in two aspects: estimation of the SDR (central mean) space and estimation of the regression function. For the SDR space estimation, we compare dr MARS with principal Hessian directions (p Hd, Cook and Li, 2004), conditional variance estimator (CVE, Fertl and Bura, 2022), gradient-based kernel dimension reductiong (g KDR, Fukumizu and Leng, 2014) and minimum average variance estimation (MAVE, Xia et al., 2002). The accuracy of an estimate e B is evaluated by

D(e B, B) = (I B(B B) 1B )e B F /

where d is the effective dimension which is assumed to be known. Selection of this dimension is evaluated separately. The smaller D(e B, B) is, the better the SDR space estimate is. For the regression function estimation, we compare dr MARS with two popular methods: the support vector machine (SVM, Cortes and Vapnik, 1995) and random forest (RF, Breiman, 2001). We also compare the function estimation using the SDR directions obtained by p Hd, CVE, g KDR, MAVE and our dr MARS, respectively, and call them SDR-MARS in general. For any estimate of the regression function G(x) = E(Y |X = x), say b G(x), we deﬁne the mean squared error (MSE):

h b G(Zi) G(Zi) i2 ,

to evaluate the estimation accuracy, where, {X1, , Xn} is the in-sample used to estimate the regression function G( ) and {Z1, , Zm} is the out-of-sample used to compute the MSE. Both follow the same distribution. All methods are implemented with R. Speciﬁcally, package dr (Weisberg, 2002) for p Hd, cve function in package CVar E for CVE, package MAVE for MAVE, package earth (Milborrow et al., 2017) for MARS, svm function in package e1071 (Dimitriadou et al., 2008) for SVM, package random Forest (Liaw and Wiener, 2002) for RF are used in our numerical studies. The source codes for g KDR and dr MARS as well as all the relevant ﬁles can be downloaded from https://github.com/liuyu-star/dr MARS. For all the R functions, their default values of tuning parameters are used. In addition, as the random rotation is a commonly-used ensemble method (e.g., Blaser and Fryzlewicz, 2016; Cannings and Samworth, 2017; Bagnall et al., 2018), we also include it in our comparison, denoted by RAND. In our setting, for each random rotation matrix B, RAND applies MARS to (B Xi, Yi), i = 1, , n to train the model and then predict the testing data. The data is generated by the following nonlinear regression model:

Yi = G(Xi) + εi,

where Xi = (Xi1, , Xip) i.i.d. Up( 1, 1) or Np (0p, ΣX) with ΣX = 0.6|i j|

εi i.i.d. N(0, 0.52). The speciﬁcations of G( ) are as follows,

(M1) G(x) = 0.5(x1 + x2) + 2.5 exp( 2(x1 + x2 + x3)2),

DIMENSION REDUCTION AND MARS

(M2) G(x) = 1

30 exp(4x1) + 4 3 + 3 exp( 20(x2 0.5)) + 3x3 + 2x4 + x5

(M3) G(x) = 0.6 sin(πx1x2) + 1.2(x3 0.5)2 + 0.6x4 + 0.3x5,

(M4) G(x) = 5x1x2x3,

(M5) G(x) = 4(x1 x2 + x3) sin(0.5π(x1 + x2)),

(M6) G(x) = x1(x1 + x2 + 1),

(M7) G(x) = x1 0.5 + (x2 + 1.5)2 .

In terms of SDR, the effective dimensions for M1, , M7 are 2, 3, 4, 3, 2, 2 and 2, respectively. For example, the SDR space of M3 is spanned by β1 = 1, 0 p 1 , β2 = 0, 1, 0 p 2 , β3 = 0, 0, 1, 0 p 3 and β4 = 0, 0, 0, 2/

5, 0 p 5 . The dimension p is 50 or 100, and the sample size n is 200 or 500. For the estimation of the SDR space, the simulation results based on 100 replications are shown in Tables 1 and 2. For the seven models, Table 1 shows that the estimation errors of dr MARS are smaller than those of p Hd, CVE, g KDR and MAVE, indicating that dr MARS has signiﬁcant improvement over the competing methods in estimating the SDR space. Moreover, in most cases the relative estimation error reduction of dr MARS over the others improves as the dimension p increases. For example, for M2 with X following the uniform distribution and n = 500, the relative estimation error reduction (dr MARS over MAVE) is (0.72 0.34)/0.72 = 0.5278 when p = 50, and it increases to (0.82 0.31)/0.82 = 0.6220 when p = 100. As mentioned in Section 2, we select the dimension of SDR space by the 10-fold CV criterion, using a similar idea as in Xia et al. (2002). The data sample is randomly divided into 10 equal subsamples Ik with size 0.1n , i.e., {1, , n} = 10 k=1Ik. The true dimension (denoted by d0) is estimated as follows

bd = arg max 1 d d CV(d), CV(d) = 1

k=1 R2(d, Ik),

where d is set as 5 in the simulation,

R2(d, Ik) = 1

Yi b G Ik 0 (b B d Xi) 2

Yi Y Ik 2 ,

b Bd is computed from the whole data set by setting the dimension of SDR space as d, b G Ik 0 ( ) is computed from the data {(Yi, b B d Xi) : i / Ik} using MARS in the R package earth, and Y Ik is the average value of the response {Yi : i / Ik}. The frequencies of correctly selecting the correct dimension and the computational time for estimating the SDR space over 100 replications are reported in Table 2, where only the results for X Np(0p, ΣX) are reported as the performance is similar for uniformly distributed covariates. For most cases, the dimension estimates based on dr MARS are the most accurate one with ρ(bd = d0) larger than the other methods. Regarding the computing time, p Hd is the least time consuming

LIU, LI AND XIA

X Up( 1, 1) X Np(0p, ΣX)

Model p n p Hd CVE g KDR MAVE dr MARS p Hd CVE g KDR MAVE dr MARS

50 200 0.89 0.75 0.83 0.74 0.53 0.98 0.76 0.83 0.87 0.80 500 0.74 0.66 0.69 0.60 0.47 0.96 0.71 0.75 0.79 0.71

100 200 0.98 0.94 0.91 0.75 0.58 0.99 0.83 0.87 0.87 0.85 500 0.86 0.71 0.81 0.69 0.44 0.99 0.74 0.82 0.87 0.73

50 200 0.96 0.84 0.82 0.81 0.39 0.96 0.85 0.76 0.90 0.81 500 0.94 0.81 0.77 0.72 0.34 0.95 0.85 0.66 0.91 0.79

100 200 0.98 0.89 0.86 0.81 0.38 0.98 0.91 0.83 0.89 0.82 500 0.97 0.84 0.82 0.82 0.31 0.98 0.90 0.75 0.94 0.83

50 200 0.94 0.88 0.85 0.84 0.72 0.93 0.90 0.83 0.82 0.38 500 0.89 0.84 0.78 0.68 0.66 0.89 0.86 0.77 0.76 0.20

100 200 0.98 0.94 0.90 0.86 0.77 0.97 0.95 0.93 0.81 0.46 500 0.96 0.89 0.86 0.84 0.70 0.94 0.90 0.86 0.82 0.25

50 200 0.94 0.92 0.96 0.95 0.78 0.96 0.85 0.87 0.86 0.21 500 0.93 0.81 0.92 0.95 0.51 0.95 0.83 0.78 0.82 0.04

100 200 0.98 0.97 0.98 0.95 0.85 0.98 0.89 0.96 0.86 0.33 500 0.96 0.96 0.98 0.98 0.65 0.98 0.84 0.89 0.88 0.08

50 200 0.81 0.78 0.97 0.42 0.31 0.93 0.93 0.98 0.96 0.79 500 0.50 0.21 0.77 0.11 0.49 0.90 0.85 0.93 0.94 0.44

100 200 0.97 0.96 0.99 0.66 0.27 0.97 0.97 0.99 0.95 0.84 500 0.75 0.71 0.99 0.21 0.47 0.94 0.94 0.99 0.98 0.54

50 200 0.95 0.76 0.73 0.71 0.57 0.92 0.74 0.92 0.69 0.28 500 0.83 0.61 0.56 0.59 0.56 0.84 0.71 0.69 0.61 0.08

100 200 0.98 0.87 0.83 0.71 0.57 0.98 0.78 0.96 0.67 0.40 500 0.95 0.73 0.68 0.70 0.57 0.92 0.73 0.91 0.70 0.13

50 200 0.96 0.88 0.86 0.88 0.76 0.97 0.80 0.86 0.82 0.47 500 0.91 0.77 0.76 0.81 0.71 0.93 0.75 0.73 0.74 0.22

100 200 0.99 0.94 0.92 0.90 0.76 0.99 0.86 0.90 0.81 0.55 500 0.97 0.87 0.84 0.89 0.66 0.98 0.78 0.83 0.84 0.26

Table 1: Average D(e B, B) for estimation of the SDR space over 100 replications: the smaller the value, the better the method.

and CVE is the most time consuming, whereas dr MARS is in the middle. In summary, dr MARS with the 10-fold CV can substantially improve dimension estimation accuracy with reasonable (and acceptable) computational time.

Table 3 lists the MSEs for nonparametric regression function estimation, where only the results for X Up( 1, 1) are reported as the performance for Gaussian covariates is similar. Generally, dr MARS has smaller MSEs than the conventional MARS. For example, for model M1 with p = 50 and n = 500, the MSE(G) of MARS is 0.52, the MSE(G) of the SDR-MARS with SDR estimated by p Hd, CVE, g KDR and MAVE are smaller, and the MSE(G) of our dr MARS is the smallest. A similar pattern can also be found for the other data generating

DIMENSION REDUCTION AND MARS

ρ(bd = d0) Computational time (in seconds)

Model p n p Hd CVE g KDR MAVE dr MARS p Hd CVE g KDR MAVE dr MARS

50 200 0.00 0.14 0.21 0.13 0.31 0.01 5.79 0.33 2.08 1.57 500 0.00 0.00 0.17 0.09 0.25 0.01 27.56 1.55 7.42 3.38

100 200 0.00 0.31 0.19 0.18 0.19 0.02 17.70 0.70 2.11 2.78 500 0.00 0.01 0.11 0.00 0.20 0.04 55.85 4.85 18.70 6.43

50 200 0.15 0.08 0.10 0.12 0.17 0.01 8.42 0.28 1.87 0.82 500 0.13 0.10 0.08 0.14 0.24 0.01 47.90 1.84 6.71 2.22

100 200 0.17 0.10 0.04 0.13 0.21 0.02 22.04 0.79 1.89 1.24 500 0.13 0.14 0.03 0.17 0.20 0.03 76.34 5.06 16.95 3.17

50 200 0.21 0.00 0.35 0.05 0.36 0.01 13.05 0.26 1.90 1.50 500 0.24 0.00 0.27 0.06 0.30 0.01 66.94 1.53 6.77 4.01

100 200 0.30 0.00 0.29 0.01 0.27 0.02 41.58 0.74 1.95 2.82 500 0.15 0.00 0.23 0.10 0.34 0.03 134.85 4.65 17.25 7.32

50 200 0.13 0.01 0.23 0.14 0.33 0.01 9.53 0.29 1.91 1.13 500 0.15 0.00 0.20 0.04 0.36 0.01 49.21 1.83 6.76 3.09

100 200 0.17 0.00 0.10 0.12 0.35 0.02 28.93 0.82 1.93 1.96 500 0.16 0.00 0.18 0.11 0.43 0.04 93.66 5.18 17.24 5.41

50 200 0.04 0.16 0.17 0.20 0.26 0.01 6.43 0.25 2.08 1.35 500 0.00 0.14 0.20 0.15 0.40 0.01 33.98 1.44 7.38 4.33

100 200 0.04 0.14 0.01 0.15 0.30 0.02 19.38 0.70 2.12 2.61 500 0.00 0.14 0.15 0.04 0.29 0.03 63.76 4.44 18.68 6.75

50 200 0.23 0.02 0.07 0.33 0.44 0.01 5.37 0.26 2.07 1.01 500 0.14 0.01 0.08 0.50 0.46 0.01 25.99 1.53 7.39 1.42

100 200 0.13 0.01 0.16 0.38 0.33 0.02 17.86 0.71 2.11 1.93 500 0.21 0.05 0.01 0.25 0.47 0.04 49.32 4.70 18.63 2.78

50 200 0.03 0.18 0.19 0.21 0.35 0.01 5.43 0.25 2.08 1.38 500 0.03 0.21 0.30 0.25 0.50 0.02 27.11 1.43 7.39 3.48

100 200 0.09 0.18 0.06 0.16 0.39 0.02 17.09 0.70 2.11 2.76 500 0.05 0.23 0.02 0.07 0.53 0.04 51.15 4.28 18.71 6.60

Table 2: The proportion of selecting the true dimension of the SDR space ρ(bd = d0) and the computing time for estimating the SDR space (with the true dimension) over 100 replications when X Np(0p, ΣX).

processes. Note that the MSE(G) of MARS may be larger than that of SVM and RF for some of the data generating processes (such as M1 and M4). However, in most cases, our dr MARS has smaller MSE(G) than (or comparable MSE(G) to) the SVM and RF methods. The simulation results in Table 3 also show that RAND has poorer numerical performance than the other SDR-MARS methods.

LIU, LI AND XIA

Original SDR-MARS

Model p n SVM RF MARS RAND p Hd CVE g KDR MAVE dr MARS

50 200 0.95 0.80 0.99 0.92 1.03 0.49 1.02 0.54 0.35 500 0.87 0.61 0.52 0.82 0.46 0.15 0.51 0.13 0.09

100 200 0.98 0.83 1.23 0.98 1.65 1.42 1.45 0.63 0.40 500 0.95 0.66 0.64 0.92 0.74 0.26 0.66 0.32 0.10

50 200 0.43 0.28 0.42 0.35 0.42 0.33 0.36 0.59 0.36 500 0.28 0.16 0.29 0.23 0.29 0.27 0.28 0.34 0.27

100 200 0.61 0.31 0.43 0.55 0.43 0.54 0.62 0.61 0.38 500 0.38 0.18 0.34 0.34 0.34 0.31 0.33 0.57 0.31

50 200 0.49 0.29 0.63 0.43 0.64 0.44 0.44 0.65 0.36 500 0.35 0.22 0.42 0.30 0.42 0.28 0.32 0.32 0.27

100 200 0.63 0.31 0.72 0.59 0.74 0.67 0.74 0.71 0.42 500 0.45 0.24 0.55 0.42 0.55 0.39 0.40 0.64 0.32

50 200 0.98 0.98 2.20 1.00 1.22 1.28 1.10 1.91 1.11 500 0.99 0.96 1.25 0.97 0.92 0.62 0.93 1.10 0.70

100 200 0.97 0.96 2.45 0.99 1.73 1.49 1.73 1.99 1.41 500 0.98 0.96 1.94 0.97 1.24 1.27 1.13 1.84 0.97

50 200 6.57 4.18 1.09 6.57 1.09 1.09 1.09 1.08 0.87 500 6.30 2.83 0.24 6.20 0.24 0.24 0.24 0.26 0.25

100 200 6.60 4.53 1.56 6.68 1.56 1.63 1.56 1.54 0.95 500 6.61 3.03 0.27 6.49 0.27 0.27 0.27 0.27 0.28

50 200 0.34 0.15 0.36 0.30 0.37 0.24 0.34 0.32 0.15 500 0.25 0.09 0.26 0.21 0.26 0.10 0.20 0.16 0.09

100 200 0.40 0.16 0.40 0.39 0.40 0.47 0.52 0.36 0.19 500 0.31 0.09 0.31 0.30 0.32 0.20 0.27 0.34 0.10

50 200 0.08 0.05 0.35 0.08 0.23 0.18 0.13 0.31 0.14 500 0.06 0.03 0.23 0.06 0.17 0.10 0.08 0.18 0.12

100 200 0.09 0.05 0.39 0.09 0.35 0.24 0.24 0.32 0.17 500 0.07 0.04 0.29 0.07 0.23 0.17 0.11 0.29 0.12

Table 3: Average MSE(G) for the regression function estimation over 100 replications when X Up( 1, 1).

5. Real data analysis

In this section, we apply the proposed dr MARS to the out-of-sample prediction of real data and statistical inference. Similarly to the simulation studies in Section 4, we consider the conventional MARS and SDR-MARS using various SDR estimation methods and compare their performance with other commonly-used nonparametric regression methods such as RF and SVM. We build the model using the training set (Xtrain i , Y train i ) : i = 1, , n , and make prediction for the testing set (Xtest i , Y test i ) : i = 1, , m . The prediction performance is

DIMENSION REDUCTION AND MARS

evaluated by the relative mean squared prediction error:

b Y test i Y test i 2 /

Y Y test i 2 ,

where b Y test i is the ﬁtted value of the response Y test i and Y = 1 n Pn i=1 Y train i is a naive prediction using the average of response observations in the learning set. In addition, as the use of logistic regression, our dr MARS can be used for classiﬁcation with two categories denoted as 0 and 1. Speciﬁcally, letting the prediction for the testing set be bytest i , i = 1, 2, , m, the classiﬁcation is b Y test i = I bytest i > 0.5 , where I( ) is the indicator function. The classiﬁcation performance is measured by the misclassiﬁcation rate (MCR) deﬁned as

i=1 I(b Y test i = Y test i ).

The following data sets are used to demonstrate the performance of prediction.

data.1 The data (https://archive.ics.uci.edu/ml/datasets/concrete+compr essive+strength) is about the concrete compressive strength (Y ) and its dependence on concrete s ingredients and age (X). It has p = 8 predictors and N = 1, 030 observations. The square root transformation is made to the concrete compressive strength as the response.

data.2 The data (www.kaggle.com/harlfoxem/housesalesprediction) contains house sale prices for King County in US including Seattle between May 2014 and May 2015. It contains N = 21, 613 house sale records. The interest is to predict the house sale prices (Y ) based on p = 18 variables (X). The logarithm transformation is made to the house sale prices.

data.3 The data (archive.ics.uci.edu/ml/datasets/Parkinsons+Telemonitor ing) is composed of a range of biomedical voice measurements from 42 people with early-stage Parkinson s disease recruited to a six-month trial of a telemonitoring device for remote symptom progression monitoring. Data on people s age, gender, time interval from baseline recruitment date and 16 biomedical voice measures are the covariates (X with p = 19), and N = 5, 875 voice recording from these individuals are collected. Our interest is to predict the motor scores ( motor UPDRS , Y ) from the 19 covariates.

data.4 The data (https://archive.ics.uci.edu/ml/datasets/Residential+Bu ilding+Data+Set) contains construction cost, project variables, and economic variables corresponding to real estate single-family residential apartments in Tehran, Iran. It contains N = 372 observations. The interest is to predict the construction cost (Y ) using p = 102 predictors (X) without considering the construction year. The logarithm transformation is made to the construction cost.

The following data sets are used to demonstrate the classiﬁcation performance.

LIU, LI AND XIA

data.5 The data (www.kaggle.com/datasets/muratkokludataset/pistachio-da taset) includes a total of N = 2, 148 images, 1,232 of Kirmizi type (Y = 0) and 916 of Siirt type (Y = 1). Each image contains 12 morphological features, 4 shape features and 12 color features (p = 28). We are interested in the classiﬁcation of the images based on the 28 features.

data.6 The data (https://archive.ics.uci.edu/ml/datasets/Hill-Valley) contains N = 1, 212 records, each of which represents p = 100 points on a two-dimensional graph. When plotted in order (from 1 through 100) as the Y co-ordinate, the points will create either a Hill (a bump in the terrain, Y = 1) or a Valley (a dip in the terrain, Y = 0). Our interest is to discriminate whether a given record is a Hill or a valley by 100 points on a two-dimensional graph.

data.7 The data (www.kaggle.com/datasets/cnic92/200-financial-indicator s-of-us-stocks-20142018) includes N = 986 US stocks in year 2018, each of which contains p = 216 ﬁnancial indicators. These predictors are commonly found in the 10-K ﬁlings each publicly traded company releases yearly. Each stock is classiﬁed into two classes: if the value of a stock increases during 2019 then Y = 1; if the value of a stock decreases during 2019 then Y = 0. The interest is to classify those stocks that are buy-worthy or not.

We randomly select n = min(1000, N/3 ) or n = min(2000, 2N/3 ) observations as the training set, and the remaining observations as the testing set, and repeat the random splitting 100 times. The dimension of SDR space is selected using the 10-fold CV described in Section 4. The average r MSPEs and MCRs are reported in Table 4 below.

Data Training Original SDR-MARS

(N, p) size SVM RF MARS RAND p Hd CVE g KDR MAVE dr MARS

data.1 343 21.30 17.86 15.85 21.42 15.85 15.85 15.85 15.88 12.69 (1030, 8) 686 16.75 12.05 14.21 20.10 14.21 14.21 14.15 13.90 10.72 data.2 1000 22.55 16.06 15.50 35.14 15.50 15.50 15.50 15.46 14.62 (21613, 18) 2000 19.57 14.35 13.89 33.37 13.89 13.89 13.86 13.67 13.09 data.3 1000 67.14 35.05 32.16 70.75 30.43 31.14 31.08 31.05 12.79 (5875, 19) 2000 60.20 23.98 30.62 69.58 26.43 28.95 26.51 28.37 10.60 data.4 124 10.02 8.74 5.85 28.86 12.82 5.83 5.86 6.01 4.75 (372, 102) 248 6.73 6.41 4.21 26.75 7.93 4.17 4.22 4.34 3.68 data.5 716 8.74 11.22 9.30 11.23 9.30 9.30 9.30 8.27 9.07 (2148,28) 1432 7.55 10.31 7.72 10.28 7.72 7.72 7.72 7.29 7.52 data.6 404 50.00 44.88 19.38 8.39 20.08 17.88 8.31 14.60 6.21 (1212, 100) 808 50.07 40.88 19.57 5.68 16.90 17.03 6.24 17.08 3.16 data.7 328 19.47 5.19 0.35 21.68 0.49 0.35 0.36 0.95 0.31 (986 ,216) 657 13.94 0.94 0.12 21.46 0.13 0.12 0.12 0.35 0.11

Table 4: Average r MSPE or MCR of the real data over 100 replications (in %)

DIMENSION REDUCTION AND MARS

As can be seen from Table 4, the conventional MARS often has smaller r MSPE than SVM and RF in particular when the size of training sets is relatively smaller. SDR-MARS based on various dimension reduction methods may make a further improvement over MARS and the improvement of the proposed dr MARS is more remarkable than the other SDR-MARS methods (see the columns under SDR-MARS ). In contrast, RAND has the worst performance in the out-of-sample prediction with the largest r MSPE. Regarding the classiﬁcation performance, it can be seen that SDR-MARS again outperforms MARS and dr MARS often has much more signiﬁcant improvement over MARS than the other methods (see data.6 and data.7). We next make some further illustration of the estimated model structure using data.3 and data.4. For ease of comparison, we standardize each variable. The estimation results of data.3 are listed in Table 5. The dimension of SDR space and the interaction degree of dr MARS are selected as 3 and 1, respectively, and the regression model is estimated as follows,

E(Y | X = x) = 47.82 + g1(β 1x) + g2(β 2x) + g3(β 3x), g1(v1) = 553.72(v1 0.66)+ + 2.78(0.66 v1)+ + 150.52(v1 0.63)+, g2(v2) = 90.49(v2 + 1.27)+ 34.30( 1.27 v2)+ + 121.43(v2 + 0.00)+ 111.66(v2 0.19)+ + 28.26(v2 0.57)+ + 12.22(v2 + 0.41)+ +158.01(v2 + 1.01)+ 123.24(v2 + 0.65)+, g3(v3) = 7.67(v3 + 0.29)+ 8.60( 0.29 v3)+,

where B = (β1, β2, β3) is the direction matrix in SDR space with the estimation results reported in Table 5. Note that the model is additive (with the interaction degree 1), we are able to estimate each additive function (with conﬁdence bands) as plotted in Figure 1.

í í sex β7 [ age 0DOH Female

fitted function g ( )

fitted function g ( )

fitted function g ( )

Figure 1: The estimated additive functions using the SDR directions for data.3

Note that as MARS takes a step-wise procedure to select the spline bases, some of them may be screened out. As a consequence, some of the predictors may not be selected in the model. This is shown clearly in the estimated coefﬁcients of the directions in Table

LIU, LI AND XIA

5. It is known that gender and age are two important factors for the parkinson s disease, which are clearly dominant in the ﬁrst two directions; see the ﬁrst two columns of Table 5 and the left and middle plots of Figure 1. It is also interesting to see that the last column in Table 5 is mainly comprised of Shimmer.DDA (with coefﬁcient 0.4893) and Jitter.RAP (with coefﬁcient 0.4176) and Jitter.DDP (with coefﬁcient -0.3865). As the ﬁrst two directions are mainly inﬂuenced by age and sex, we can plot them separately as shown in Figure 1, which is in line with the understanding of the relationship between the disease and the two variables. The estimated coefﬁcients for the third projection imply that higher value of Shimmer.DDA leads to a lower degree of the disease. The result sheds some light on the debate about the usefulness of Shimmer.DDA in the disease diagnostics, suggesting that Shimmer.DDA is indeed useful in identifying the disease (e.g., Hausdorff, 2007; Kirchner et al., 2014).

age -0.0237 -0.9440 0.0722 sex -0.9443 -0.0126 0.0399 test time 0.0000 0.0000 -0.0547 Jitter(%) 0.0000 0.0000 -0.1374 Jitter.Abs 0.0000 0.0000 0.0606 Jitter.RAP 0.0000 0.0106 0.4176 Jitter.PPQ5 0.0000 0.0000 -0.1404 Jitter.DDP 0.0000 0.0000 -0.3865 Shimmer 0.0000 0.0000 -0.1577 Shimmer.d B 0.0000 0.0000 0.2654 Shimmer.APQ3 0.0000 0.0000 -0.3347 Shimmer.APQ5 0.0000 0.0000 -0.1152 Shimmer.APQ11 0.0000 0.0000 -0.0935 Shimmer.DDA 0.0000 0.0126 0.4893 NHR 0.0000 0.0000 0.0635 HNR 0.0000 0.0000 0.0814 RPDE 0.0000 0.0000 0.0226 DFA 0.0000 0.0000 0.3361 PPE 0.0000 0.0000 0.0000

Table 5: The estimated SDR directions for data.3

For data.4, the dimension of SDR space and the interaction degree of MARS are selected as 3 and 2, respectively, and the model is estimated as follows,

E(Y | X = x) = 5.82 + g1(β 1x) + g2(β 2x) + g3(β 3x) + g12(β 1x, β 2x) + g13(β 1x, β 3x),

where β1, β2 and β3 are directions of the SDR space, and

g1(v1) = 31.67(v1 + 0.00)+ 42.83( 0.00 v1)+ 12.97(v1 + 0.03)+ 10.30(v1 0.03)+,

DIMENSION REDUCTION AND MARS

g2(v2) = 7.52(v2 + 0.00)+ 3.73( 0.00 v2)+,

g3(v3) = 28.42(v2 0.03)+,

g12(v1, v2) = 881.25( 0.00 v1)+(v2 0.01)+,

g13(v1, v3) = 153.80(v1 + 0.03)+(v3 0.01)+.

Inherited from MARS, dr MARS may remove some variables if they are not important. Hence only a small portion of the 102 predictors are selected by dr MARS in the ﬁnal model and have nonzero coefﬁcients in the directions, and the coefﬁcients of the remaining variables are zero. Table 6 only lists those variables that have nonzero coefﬁcients. Note that the 102 predictors include 7 project physical and ﬁnancial variables, and 5 groups of time lag economic variables (5*19 variables in total), which we denote in Table 6 as lag k, k = 1, , 5. It shows that none of project physical and ﬁnancial variables is signiﬁcant, and signiﬁcant economic variables appear in multiple time lags, indicating that economic variables have a durable effect on ﬁnal costs. Speciﬁcally, the building services index (x9, x28, x47, x66) and consumer price index (x22, x23, x41, x42, x60, x61) are important factors for the ﬁnal cost. The land price index (x33, x52) and the cumulative liquidity (x12, x31, x50, x69) also affect the ﬁnal cost.

variables that has non-zero coefﬁcients in dr MARS β1 β2 β3

x9: Building services index (BSI) for a preselected base year (lag 1) 0.4291 -0.2057 0.1172 x12: Cumulative liquidity (lag 1) 0.1303 0.0000 0.1469 x22: Consumer price index (CPI) in the base year (lag 1) 0.0000 -0.1739 0.0000 x23: CPI of housing, water, fuel & power in the base year (lag 1) 0.3816 0.2211 0.0000 x28: Building services index (BSI) for a preselected base year (lag 2)-0.5875 0.0000 0.0000 x29: Wholesale price index (WPI) of building materials for the base 0.0000 -0.1322 0.0000 year (lag 2) x31: Cumulative liquidity (lag 2) 0.0000 -0.1167-0.1606 x33: Land price index for the base year (lag 2) 0.0000 0.1473 0.0000 x41: Consumer price index (CPI) in the base year (lag 2) 0.2372 0.0000 0.1360 x42: CPI of housing, water, fuel & power in the base year (lag 2) -0.2168 -0.5692-0.3176 x47: Building services index (BSI) for a preselected base year (lag 3) 0.1970 0.6146 0.3905 x50: Cumulative liquidity (lag 3) -0.1275 0.0000-0.3034 x52: Land price index for the base year (lag 3) -0.1467 0.0000-0.1483 x60: Consumer price index (CPI) in the base year (lag 3) 0.0000 -0.1859 0.0000 x61: CPI of housing, water, fuel & power in the base year (lag 3) 0.1178 0.2444 0.4858 x66: Building services index (BSI) for a preselected base year (lag 4)-0.2091 0.0000-0.4764 x69: Cumulative liquidity (lag 4) 0.1549 0.0000 0.2307

Table 6: The estimated SDR directions for data.4 with nonzero coefﬁcients, while coefﬁcients for those not listed here are all 0.

Due to the interaction degree 2, we draw plots in the three-dimensional space for the dependence of the cost Y on the projected variables β j X, j = 1, 2, 3, as shown in Figure 2. The ﬁrst row of plots shows the relationship between cost and the directions; the second row shows the corresponding ﬁtted functions speciﬁed in the estimated model above.

LIU, LI AND XIA

0.1 0.0 0.1

0.00. 0.2 3

0.1 0.0 0.1

fitted values E(Y|X)

fitted values E(Y|X)

fitted values E(Y|X)

β T x β T x 1 β T x β T x 0 β T x β T x

0. 0.00. 0. 0. 0.0 0.

0. 0. 0. 0. 0.0

β T x β T x 1 β T x β T x

ILWWHG IXQFWLRQ J J J

ILWWHG IXQFWLRQ J J J

ILWWHG IXQFWLRQ J J

β T x β T x

Figure 2: Plots for data.4: the plots on the top panel are the cost against each direction, and those on the bottom panel are the ﬁtted functions

Interestingly, as shown in the last panel of Figure 2, the cost has a non-linear dependence on the direction. The nonlinearity is along the second direction, which is a contrast between BSI for a preselected base year x47 (coefﬁcient 0.6146) with the CPI of housing, water, fuel & power in the base year x42, x61 (coefﬁcient -0.5692, 0.2444). The reason for this nonlinearity needs further investigation.

6. Conclusion

This paper has proposed a general method that combines SDR with the commonly-used MARS algorithm to estimate nonparametric regression functions. The special structure of the MARS basis functions makes it easy to compute the gradient vector of regression functions and thus the SDR space. The selection of spline functions in MARS also makes our dimension reduction method more suitable for high dimensional data. The proposed dr MARS based on the SDR space can in turn improve the efﬁciency of conventional MARS. Through the comparison with other commonly-used nonparametric estimation and dimension reduction techniques, our numerical studies including both simulation and empirical applications show that the proposed dr MARS has better ﬁnite-sample performance in both

DIMENSION REDUCTION AND MARS

in-sample estimation and out-of-sample prediction. In summary, there are two key factors in dr MAVE that contribute to its performance. Firstly, dr MARS beneﬁts from the automatic selection of the basic function, a feature inherent in MARS itself, which can also lead to the selection of variables. Second, dr MARS performs dimension reduction with the same objective of minimizing the loss function of MARS. Several issues can be studied further. Cai et al. (2022) suggest a hybrid of random projection with SDR, which uses random projection to reduce the dimension of predictors to a lower-dimensional space and then applies SDR to the smaller space. We conjecture that such a hybrid may be adopted here. Our method can also be applied to other regression methods such as the random forest or support vector machine to solve the interaction between variables.

Appendix: Proofs of the asymptotic theorems

In this appendix we prove the main theorems in Sections 2 and 3. Throughout the proofs, we let C denote a generic positive constant whose value may change from line to line. We start with a useful inequality for independent random matrices (Tropp, 2012).

Lemma 9 Suppose that Λi, i = 1, , n, are independent q1 q2 random matrices with zero mean and max1 i n Λi λn. Then for any z > 0,

(q1 + q2) exp z2

2(ξ2n + λnz/3)

i=1 E ΛiΛ i ,

i=1 E Λ i Λi

The following lemma ensures that the least squares estimate (8) is well deﬁned.

Lemma 10 Suppose that Assumption 1(i) (iii) is satisﬁed. Then 1

n e H e H is positive deﬁnite w.p.a.1.

Proof of Lemma 10. Recall that H = [H(X1), , H(Xn)] . We ﬁrst prove 1 n H H Ω = o P (1). (18)

Note that 1 n H H Ω= 1

H(Xi)H(Xi) Ω .

We next make use of the inequality in Lemma 9 with Λi = H(Xi)H(Xi) Ωto prove (18). It is easy to verify that λn = c1m H and ξ2 n = c2m2 H, where c1 and c2 are two positive constants. By Lemma 9 and m H log m H = o(n), for any ϵ > 0, we have

P 1 n H H Ω ϵ = P

H(Xi)H(Xi) Ω nϵ

LIU, LI AND XIA

2 c2m2 H + ϵc1m Hn/3

2m H exp ϵ2n2

= exp log(2m H) ϵ2

completing the proof of (18). Combining Assumption 1(iii) with (18), we may show that 1

n H H is positive deﬁnite w.p.a.1, i.e., λmin( 1

n H H) is positive and bounded away from zero, where λmin( ) denotes the minimum eigenvalue of a square matrix. It is trivial to verify that

λmin(H H) λmin(e H e H).

Hence, we may claim that 1

n e H e H is positive deﬁnite w.p.a.1.

Proof of Theorem 2. Without loss of generality, we next prove the convergence results by setting em = m , where m is a non-random positive integer. Letting ε = (ε1, , εn) and G = [G(X1), , G(Xn)] , by (8), we have

e G (x) G (x) = Π1(x) + Π2(x), (19)

Π1(x) = e H (x) e H e H 1 e H ε, Π2(x) = e H (x) e H e H 1 e H G G (x).

We ﬁrst consider Π1(x). Let FX = σ(X1, , Xn) and Em = (e1, e2, , em ) , where ej is an m H-dimensional vector with the j-th element being 1 and the others being zeros. Note that e H = Em H . By Assumption 1(i), we have

Em Var n 1/2H ε | FX E m = σ2 1

n Em H HE m

n e H e H , (20)

indicating that

|Π1(x)|2 2 C

n e H e H 1 e H (x)

w.p.a.1. (21)

As e H (x) is of order m1/2 , it follows from (21) and Lemma 10 that

|Π1(x)|2 = OP m1/2 n 1/2 . (22)

On the other hand, by Assumption 1(iv), we have

|Π2(x)|2 = e H (x) e H e H 1 e H G G (x) 2 = e H (x) αo G (x) 2 = OP (eρ(m )) . (23)

DIMENSION REDUCTION AND MARS

By (22) and (23), we complete the proof of (11). We next turn to the proof of (12). Note that

i=1 e G (Xi) e G (Xi) 1

i=1 G (Xi)G (Xi) #

i=1 G (Xi)G (Xi) ΣG

=: Π3 + Π4. (24)

By Assumption 1(i) and Lemma 9, we readily have

i=1 G (Xi)G (Xi) ΣG

when M . This indicates that

i=1 G (Xi)G (Xi) ΣG

= OP n 1/2 . (25)

Re-write Π3 as

h e G (Xi) G (Xi) i G (Xi) + 1

G (Xi) h e G (Xi) G (Xi) i +

h e G (Xi) G (Xi) i h e G (Xi) G (Xi) i

=: Π3,1 + Π3,2 + Π3,3. (26)

Following the proofs of (20) (23), we may show that

i=1 |Π1(Xi)|2 Cn 1/2 1

e H (Xi) = OP n 1/2m1/2 , (27)

i=1 |Π2(Xi)|2 = OP (eρ(m )) . (28)

By the decomposition (19), the Cauchy-Schwarz inequality, (27) and (28), we have

h e G (Xi) G (Xi) i G (Xi)

e G (Xi) G (Xi) 2

e G (Xi) G (Xi) 2

LIU, LI AND XIA

i=1 |Π1(Xi)|2 + 1

i=1 |Π2(Xi)|2

= OP m1/2 n 1/2 + eρ(m ) , (29)

and similarly

Π3,2 = OP m1/2 n 1/2 + eρ(m ) , Π3,3 = OP m n 1 + eρ2(m ) . (30)

By virtue of (24) (26), (29) and (30), we complete the proof of (12).

Proof of Theorem 4. By Assumption 1(v) and the Davis-Kahan theorem (e.g., Yu et al., 2015), there exists a d d rotation matrix Q such that e B BQ C eΣG ΣG ,

which, together with (12), proves Theorem 4.

The following lemma ensures that the least squares estimate (13) is well deﬁned.

Lemma 11 Suppose that Assumptions 1 and 2(i) (iii) are satisﬁed. Then 1

n b H b H is positive deﬁnite w.p.a.1.

Proof of Lemma 11. Recall that b H = [ b H(X 1), , b H(X n)] and deﬁne b H = [ b H(X 1), , b H(X n)] . By Theorem 4, the smoothness property of the basis functions and Assumption 2(iii), we have 1 n b H b H 1

= OP bm ( em1/2n 1/2 + eρ( em)) = o P (1). (31)

By (31), it is sufﬁcient to show that 1

n b H b H is positive deﬁnite w.p.a.1. This can be proved by using Assumption 2(ii) and following the proof of Lemma 10.

Proof of Theorem 7. Without loss of generality, we next prove the consistency property by setting bm = m , where m is a non-random positive integer. Let ε = (ε1, , εn) and G = [G(X1), , G(Xn)] as in the proof of Theorem 4. Note that

b G0(x ) G0(x ) = b H(x ) b H b H 1 b H ε + b H(x ) b H b H 1 b H G G0(x )

= b H(x ) b H b H 1 b H ε + OP (bρ(m )) (32)

conditional on bm = m .

DIMENSION REDUCTION AND MARS

Letting H =

H(X 1), , H(X n) with H( ) deﬁned in Section 3, and Em = (e1, e2, , em ) ,

we then have b H = Em H

. Note that

b H ε = b H ε + b H b H ε = Em H

ε + b H b H ε. (33)

As in the proof of (20), we may show that

Em Var n 1/2H

ε | FX E m = σ2 1

which, together with Lemma 11, indicates that b H(x ) b H b H 1 b H ε = b H(x ) b H b H 1 b H ε (1 + o P (1))

= OP m1/2 n 1/2 . (34)

On the other hand, by Theorem 4, Lemma 11 and the smoothness property of the MARS basis functions, we have b H(x ) b H b H 1 b H b H ε = o P m1/2 n 1/2 . (35)

By virtue of (33) (35), we have b H(x ) b H b H 1 b H ε = OP m1/2 n 1/2 (36)

With (32) and (36), we complete the proof of (16).

Acknowledgements

The authors would like to thank an Editor and three reviewers for the constructive comments, which helped to improve the article. This project is supported by the National Natural Science Foundation of China (72033002 and 11931014) and MOE s Academic Research Fund (A-8000021-00-00) of Singapore. The second author is partly supported by the Leverhulme Research Fellowship (RF-2023-396).

Anthony Bagnall, M Flynn, J Large, J Line, A Bostrom, and G Cawley. Is rotation forest the best classiﬁer for problems with continuous features? ar Xiv preprint ar Xiv:1809.06705, 2018.

Peter J Bickel and Elizaveta Levina. Covariance regularization by thresholding. The Annals of Statistics, 36(6):2577 2604, 2008.

LIU, LI AND XIA

Rico Blaser and Piotr Fryzlewicz. Random rotation ensembles. Journal of Machine Learning Research, 17(1):126 151, 2016.

Richard C Bradley. Basic properties of strong mixing conditions. a survey and some open questions. Probability Surveys, 2:107 144, 2005.

Leo Breiman. Random forests. Machine Learning, 45:5 32, 2001.

Zhibo Cai, Yingcun Xia, and Weiqiang Hang. An outer-product-of-gradient approach to dimension reduction and its application to classiﬁcation in high dimensional space. Journal of the American Statistical Association, forthcoming, 2022.

Timothy I Cannings and Richard J Samworth. Random-projection ensemble classiﬁcation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(4):959 1035, 2017.

Xin Chen, Changliang Zou, and Dennis Cook. Coordinate-independent sparse sufﬁcient dimension reduction and variable selection. The Annals of Statistics, 38(6):3696 3723, 2010.

Dennis Cook and Bing Li. Dimension reduction for conditional mean in regression. The Annals of Statistics, 30(2):455 474, 2002.

R Dennis Cook and Bing Li. Determining the dimension of iterative hessian transformation. The Annals of Statistics, 32(6):2501 2531, 2004.

Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20: 273 297, 1995.

Evgenia Dimitriadou, Kurt Hornik, Friedrich Leisch, David Meyer, and Andreas Weingessel. Misc functions of the department of statistics (e1071), tu wien. R package, 1:5 24, 2008.

Robert F. Engle, C. W. J. Granger, John Rice, and Andrew Weiss. Semiparametric estimates of the relation between weather and electricity sales. Journal of the American Statistical Association, 81(394), 1986.

Jianqing Fan and Irene Gijbels. Local Polynomial Modelling and Its Applications. Chapman & Hall/CRC, 1996.

Lukas Fertl and Efstathia Bura. Conditional variance estimator for sufﬁcient dimension reduction. Bernoulli, 28(3):1862 1891, 2022.

Jerome H Friedman. Multivariate adaptive regression splines. The Annals of Statistics, 19(1): 1 67, 1991.

Kenji Fukumizu and Chenlei Leng. Gradient-based kernel dimension reduction for regression. Journal of the American Statistical Association, 109(505):359 370, 2014.

Wolfgang H ardle, Peter Hall, and Hidehiko Ichimura. Optimal smoothing in single-index models. The Annals of Statistics, 21(1):157 178, 1993.

Trevor Hastie and Robert Tibshirani. Generalized additive models. Statistical Science, 1(3): 297 310, 1986.

DIMENSION REDUCTION AND MARS

Trevor Hastie and Robert Tibshirani. Varying-coefﬁcient models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 55(4):757 796, 1993.

Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The Elements of Statistical Learning: Data mining, Inference, and Prediction, volume 2. Springer, 2009.

Jeffrey M Hausdorff. Gait dynamics, fractals and falls: ﬁnding meaning in the stride-tostride ﬂuctuations of human walking. Human Movement Science, 26(4):555 589, 2007.

Jianhua Z Huang. Local asymptotics for polynomial spline regression. The Annals of Statistics, 31(5):1600 1635, 2003.

Marietta Kirchner, Patric Schubert, Magnus Liebherr, and Christian T Haas. Detrended ﬂuctuation analysis and adaptive fractal analysis of stride time data in parkinson s disease: stitching together short gait trials. Plo S One, 9(1):e85787, 2014.

Ker-Chau Li. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414):316 327, 1991.

Andy Liaw and Matthew Wiener. Classiﬁcation and regression by randomforest. R news, 2 (3):18 22, 2002.

Wei Lin. The Econometric Analysis of Interval-Valued Data and Adaptive Regression Splines. Ph D thesis, UC Riverside, 2013.

Wei Luo, Bing Li, and Xiangrong Yin. On efﬁcient dimension reduction with respect to a statistical functional of interest. The Annals of Statistics, 42(1):382 412, 2014.

Yanyuan Ma and Liping Zhu. On estimation efﬁciency of the central mean subspace. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(5):885 901, 2014.

Stephen Milborrow, Trevor Hastie, Robert Tibshirani, Alan Miller, and Thomas Lumley. earth: Multivariate adaptive regression splines. R package version, 5(2), 2017.

Charles J Stone. Optimal global rates of convergence for nonparametric regression. The Annals of Statistics, 10(4):1040 1053, 1982.

Charles J Stone. Large-sample inference for log-spline models. The Annals of Statistics, 18(2): 717 741, 1990.

Charles J Stone. Asymptotics for doubly ﬂexible logspline response models. The Annals of Statistics, 19(4):1832 1854, 1991.

Charles J Stone, Mark H. Hansen, Charles Kooperberg, and Young K. Truong. Polynomial splines and their tensor products in extended linear modeling. The Annals of Statistics, 25 (4):1371 1425, 1997.

Joel A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12:389 434, 2012.

LIU, LI AND XIA

Tao Wang, Peirong Xu, and Lixing Zhu. Variable selection and estimation for semiparametric multiple-index models. Bernoulli, 21(1):242 275, 2015.

S. Weisberg. Dimension reduction regression in r. Journal of Statistical Software, 7(1):1 22, 2002.

Yingcun Xia. A multiple-index model and dimension reduction. Journal of the American Statistical Association, 103(484):1631 1640, 2008.

Yingcun Xia, Howell Tong, Wai Keung Li, and Li-Xing Zhu. An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(3):363 410, 2002.

Zhuoran Yang, Krishnakumar Balasubramanian, and Han Liu. On stein s identity and nearoptimal estimation in high-dimensional index models. ar Xiv preprint ar Xiv:1709.08795, 2017.

Xiangrong Yin and Bing Li. Sufﬁcient dimension reduction based on an ensemble of minimum average variance estimators. The Annals of Statistics, 39(6):3392 3416, 2011.

Yi Yu, Tengyao Wang, and Richard J Samworth. A useful variant of the davis kahan theorem for statisticians. Biometrika, 102(2):315 323, 2015.

S. Zhou, X. Shen, and D.A. Wolfe. Local asymptotics for regression splines and conﬁdence regions. The Annals of Statistics, 26(5):1760 1782, 1998.