# additive_nonlinear_quantile_regression_in_ultrahigh_dimension__5eef35bc.pdf Journal of Machine Learning Research 23 (2022) 1-47 Submitted 8/19; Revised 10/21; Published 3/22 Additive nonlinear quantile regression in ultra-high dimension Ben Sherwood ben.sherwood@ku.edu School of Business University of Kansas Lawrence, KS 66045, USA Adam Maidman abmaidman@gmail.com Microsoft One Microsoft Way Redmond, WA 98052, USA Editor: Zaid Harchaoui We propose a method for simultaneous estimation and variable selection of an additive quantile regression model that can be used with high dimensional data. Quantile regression is an appealing method for analyzing high dimensional data because it can correctly model heteroscedastic relationships, is robust to outliers in the response, sparsity levels can change with quantiles, and it provides a thorough analysis of the conditional distribution of the response. An additive nonlinear model can capture more complex relationships, while avoiding the curse of dimensionality. The additive nonlinear model is fit using B-splines and a nonconvex group penalty is used for simultaneous estimation and variable selection. We derive the asymptotic properties of the estimator, including an oracle property, under general conditions that allow for the number of covariates, pn, and the number of true covariates, qn, to increase with the sample size, n. In addition, we propose a coordinate descent algorithm that reduces the computational cost compared to the linear programming approach typically used for solving quantile regression problems. The performance of the method is tested using Monte Carlo simulations, an analysis of fat content of meat conditional on a 100 channel spectrum of absorbances and predicting TRIM32 expression using gene expression data from the eyes of rats. Keywords: Quantile Regression; Oracle Property; Nonparametric Regression; Splines; nonconvex penalty. 1. Introduction We consider the sample {yi, zi}n i=1 where yi R and zi = (zi1, . . . , zipn) Rpn. The τth conditional quantile, τ (0, 1), of y given z is defined as Qy|z(τ) = inf{t : F(t|z) τ}, where F( |z) is the conditional distribution function of y given z. There are pn potential variables, but only qn(τ) of these variables are needed to model the τth conditional quantile. Without loss of generality we assume the first qn(τ) of these variables are active and the remaining pn qn(τ) are inactive. The index n allows the set of active and inactive variables to increase with n, including the ultra-high dimensional case where pn increase at an exponential rate c 2022 Ben Sherwood and Adam Maidman. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/19-697.html. SHERWOOD AND MAIDMAN of n. For a given τ we consider the following sparse model for the conditional quantile Qy|z(τ) = g0(z, τ) = α0(τ) + j=1 g0j(zj, τ) = α0(τ) + j=1 g0j(zj, τ), (1) and for identifiability we assume E[g0j(zj, τ)] = 0 for all j {1, . . . , qn(τ)}. The model is a high-dimensional, sparse, nonparametric model that provides great flexibility. We assume an additive model to avoid the curse of dimensionality. The active variables, intercept, and additive functions are indexed by τ as the model allows for these values to change with τ. For simplicity of notation the τ symbol will be dropped throughout the remaining of the paper, but we emphasize here that the model allows for the nonlinear relationships and sparsity structure to change with τ. We propose using B-splines to model the nonlinear relationships and a group nonconvex penalty to correctly identify the covariates that have a relationship with the response at the given conditional quantile. Tibshirani (1996) proposed the lasso penalty for simultaneous estimation and model selection, but strong conditions are required for model selection consistency (Zhao and Yu, 2006). Our focus is on model selection and our results will depend on using nonconvex penalty functions such as SCAD (Fan and Li, 2001) and MCP Zhang (2010) functions, which provide oracle estimators, a stronger result than model selection consistency, under milder conditions. When using splines multiple coefficients will be associated with a single covariate and thus we will use a group penalty, see Huang et al. (2012) for a review of group penalties in high-dimensional models. Previous works have proposed using splines with a group penalty for estimating an additive conditional mean function (Huang et al., 2010; Lin and Zhang, 2006; Meier et al., 2009; Xue, 2009). The work most similar to ours is Xue (2009) and Huang et al. (2010). Xue (2009) proposed using a group SCAD penalty and derived model consistency results for fixed q and p. Huang et al. (2010) proposed using a group adaptive lasso (Zou, 2006) and proved model selection consistency with fixed q, but allowing p to increase with n. Unlike these works, our focus is on estimating (1) instead of an additive conditional mean function. Since Koenker and Bassett (1978) proposed linear quantile regression there have been many extensions, including work on nonlinear quantile regression. For a univariate covariate He and Shi (1994) demonstrated that using B-splines for nonlinear quantile regression has the same optimal rate of convergence as nonlinear mean regression (Stone, 1982). Motivated by the work of Stone (1985) (additive mean regression) and Stone (1986) (generalized additive models), De Gooijer and Zerom (2003) proposed a kernel based method for estimating an additive nonlinear conditional quantile model and demonstrated that for fixed p, additive quantile regression achieves the same rate of convergence found in He and Shi (1994) and thus theoretically alleviates the curse of dimensionality, although the proposed method requires bias correction for p 5. Horowitz and Lee (2005) proposed a two-stage estimator for additive quantile regression that achieves the optimal rate of convergence and does not require a bias correction. Takeuchi et al. (2006) provided finite sample bounds for nonparametric quantile regression and discussed how to handle constraints such as monotonicity and non-crossing quantiles. Splines offer great flexibility in modeling conditional quantiles and have been proposed in a variety of conditional quantile models including, but not limited to, varying coefficients (Kim, 2007), growth curves (Wei et al., 2006) and ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION semiparametric models (He and Shi, 1996; He et al., 2002; Maidman and Wang, 2018; Wang et al., 2009b). Quantile regression is a robust method which estimates a conditional quantile of interest. Our proposed method estimates conditional quantiles while allowing the sparsity structure to vary with τ, which has been stated as another example of the flexibility quantile regression provides for analyzing high-dimensional data (He et al., 2013; Wang et al., 2012). Previous work in penalized quantile regression includes using the lasso penalty (Belloni and Chernozhukov, 2011) and the nonconvex penalties MCP and SCAD (Wang et al., 2012) for estimating linear quantile regression with high-dimensional covariates. The high-dimensional linear quantile model has been relaxed to a partially linear model, where variable selection is only done on the high-dimensional linear terms (Sherwood, 2016; Sherwood and Wang, 2016). Other work proposed using splines with a group penalty for simultaneous estimation and variable selection of additive quantile regression. Kato (2012) proposed using a group lasso penalty, their work focused on convergence rates, while our work focuses on deriving an estimator with the oracle property which is asymptotically equivalent to the estimator that would be fit if we a priori knew the active covariates. Zhao and Lian (2016) considered the case where p is fixed and proposed using a nonconvex group penalty with the L2 norm, while we allow p to increase with n and use the L1 norm in our group penalty. Lin et al. (2013) proposed a smoothing spline ANOVA method that focuses on computational aspects and does not contain asymptotic results. Lv et al. (2018) considered estimation of (1) where the univariate functions reside in a reproducing kernel Hilbert space. Their work focused on estimation bounds, while our work focuses on deriving an oracle estimator, and they proposed a different penalty function than the one presented here. Penalized quantile regression is not as commonly used as penalized least squares, but recent work has shown an interest in simultaneously estimating a conditional quantile and performing model selection. Essl et al. (2017) used penalized quantile regression to model extremes for the reserve capacity in the Australian electricity market, using time of day, year and week variables along with other forecast variables. Palma et al. (2020) modeled the age of a brain using MRI data for cognitively normal patients to better understand brain decay for cognitively impaired individuals. Quantile regression was used to model the .05, .5 and .95 quantiles, while the penalty was used to select the useful information from the MRI data. Motivated by the desire to identify counterfeit drugs, Ibrahim et al. (2020) used penalized quantile regression as a robust approach to model the amount of a certain chemical in a drug using spectroscopy data. Nonlinear or partially linear additive models with penalties have been used to simultaneously perform model selection and provide nonparametric estimates of conditional means. Examples include modeling stock returns given firm characteristics (Freyberger et al., 2020), predicting gene expression using DNA motifs (Lian et al., 2012), and constructing graphical models for frillice lettuce attributes and average environmental data during the cultivation period (Fujimoto et al., 2019). These are some examples of applications of penalized quantile regression and penalized additive models, but is by no means complete. In this paper we propose the penalized additive quantile model as a useful model for complex data. We demonstrate that this is a robust, theoretically sound model with few assumptions. To the best of our knowledge there are not many, or any, public applications of penalized additive nonlinear quantile models. However, given the flexibility of additive nonlinear quantile regression, we believe this can be a useful SHERWOOD AND MAIDMAN tool for data analysis. To bridge the gap between theory and application, we discuss how to compute this model. In addition, our implementation is publicly available on CRAN (Sherwood and Maidman, 2020). Theoretical challenges include dealing with a nonsmooth loss function, a nonconvex penalty function, approximation of nonlinear functions, and the number of covariates increasing with the sample size. Our asymptotic results allow for qn to increase with n, which is challenging to deal with because both the number of predictors and basis functions increase to infinity with n. In addition, previous work on deriving oracle results for high dimensional quantile regression estimators have used the fact that a quantile regression objective function with a SCAD or MCP penalty can be written as a difference of convex functions (Sherwood, 2016; Sherwood and Wang, 2016; Wang et al., 2012). The theoretical results depend on demonstrating that asymptotically the oracle estimators satisfy properties about local minimizers of difference of convex functions provided by Tao and An (1997). Our proofs are more akin to the general approach taken by Fan and Lv (2011), where we only use some very general conditions about the penalty function. Results from Fan and Lv (2011) were for likelihood based methods and assumed that the objective function was differentiable. In their proofs they used a Taylor approximation of the penalized objective function, which is not possible for the non-differentiable quantile objective function. In this paper we show that the approach of Fan and Lv (2011) can be extended to quantile regression by replacing the Taylor approximation with Knight s identity, a common tool for theoretical results about quantile regression (Knight, 1998; Koenker, 2005). It is worth noting that previous work for adaptive lasso quantile regression, which has a convex objective function, used Knight s identity when establishing oracle properties (Wang et al., 2007; Zheng et al., 2015). We believe the approach provided here will be useful for future theoretical results because working with Knight s identity is easier than dealing with the subdifferential functions, which is required when using the properties of difference of convex functions. In addition, the results provided here are more general because they work with a large class of non-convex penalty functions and are not limited to the SCAD or MCP functions. In addition to being theoretically challenging, high-dimensional quantile regression is a challenging computational problem. Koenker and Bassett (1978) showed that quantile regression can be solved by linear programming and many have found that minimizing penalized quantile regression objective functions can be framed as linear programming problems (Belloni and Chernozhukov, 2011; Sherwood and Wang, 2016; Wang et al., 2012; Wu and Liu, 2009). However, recent work has shown that in high dimensions alternative approaches can sacrifice little in terms of accuracy, while providing large computational gains. Peng and Wang (2015) proposed a coordinate descent algorithm for quantile regression with a non-convex penalty. Gu et al. (2018) proposed an alternating direction method of multiplier (ADMM) algorithm for quantile regression with lasso, adaptive lasso or a folded concave penalty. Yu et al. (2017) proposed an ADMM algorithm for nonconvex penalized quantile regression that can be computed in parallel. Yi and Huang (2017) proposed semismooth Newton coordinate descent algorithm for elastic-net penalized quantile regression that approximates the quantile loss function with a Huber loss function, creates a strong rule for discarding covariates, and uses a coordinate descent algorithm to update the remaining coefficients. None of the algorithms discussed use group penalties. We contribute to the ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION literature by proposing a coordinate descent algorithm for quantile regression with a nonconvex group penalty. Lv et al. (2018) also proposed a coordinate descent algorithm for penalized quantile regression. They approximated the quantile loss function with a smooth function and their penalty function is convex. In contrast, we do not approximate the quantile loss function and have a nonconvex penalty function. The rest of the article is organized as follows. In Section 2 we discuss estimating the additive quantile regression model, when the active covariates are known a priori. We refer to this model as the oracle model and asymptotic properties of the oracle model are presented in Section 2. In Section 3 we present a group nonconvex penalty and present a theorem demonstrating that under reasonable conditions the group penalized method is asymptotically equivalent to the oracle model. In Section 4 we propose our new algorithm. In Section 5 we compare the proposed method using Monte Carlo simulations and in Section 6 we implement the proposed method to model fat content of meat using a 100 channel spectrum of absorbances and model TRIM32 expression using other gene expression data from the eyes of 120 twelve-week old male rats. We conclude with a summary in Section 7. 2. Oracle Model To estimate (1) we propose first transforming the covariates using B-splines and then applying a group penalty method to simultaneously perform estimation and variable selection. In this section we assume that the active covariates are known a priori and thus are only estimating the model Qy|z(τ) = g0(z) = α0 + j=1 g0j(zj). (2) The work in this section establishes convergence rates for the optimal local minimum of the penalized estimator. To estimate the nonlinear functions we use B-splines of order m + 1 (degree m) with kn quasi-uniform internal knots for (kn + m + 1) spline functions. Let Jn = kn + m, for j {1, . . . , pn} the jth covariate has Jn + 1 corresponding functions of [bj,0( ), . . . , bj,Jn( )] of order m + 1 with kn quasi-uniform internal knots on [0, 1] for a total of 2(m + 1) + kn knots, (tj, m, . . . , tj,kn+m+1). Define hj = max s |tj,s tj,s+1|, the largest distance between knots for the jth covariate, and h = max j hj, the largest distance between knots for all covariates. A property of spline functions is that for any zij it follows that PJn s=0 bj,s(zij) = 1 and to avoid collinearity we drop the first term when fitting the model. See Schumaker (1981) for more details about the construction of B-splines. The ith observation of the jth covariate will have a corresponding vector of πj(zij) = [bj,1(zij), ..., bj,Jn(zij)] RJn. Define ΠA(zi) = h 1, π1(zi1) , . . . , πqn(ziqn) i RJnqn+1, as the B-splines vector of active covariates and Π(zi) = h 1, π1(zi1) , . . . , πpn(zipn) i RJnpn+1, SHERWOOD AND MAIDMAN as the B-splines vector for all covariates. B-splines can be used to approximate smooth functions and the following definitions help provide a formal definition of the class of functions for g0(z). Definition 1 Let r m + v, where m is a positive integer and v (0, 1]. Define Hr as the collection of functions h( ) on [0, 1] whose mth derivative h(m)( ) satisfies the H older condition of order v. That is, for any h( ) Hr, there exists some positive constant C such that h(m)(z ) h(m)(z) C z z v , 0 z , z 1. (3) Definition 2 Given z = (z1, ..., zqn) , the function g(z) is said to belong to the class of functions Gr if it has the representation g(z) = α + Pqn j=1 gj(zj) where α R and for all j {1, . . . , qn}, gj Hr, E[gj(zj)] = 0 and E[gj(zj)2] < M, for some positive constant M. Definition 3 Denote Gn as the space of additive functions spanned by [Π(zi)]n i=1. B-splines can approximate any function h( ) Hr. That is, there exists γ0j RJn such that sup z [0,1] g0j(z) πj(z) γ0j = O (k r n ) (Schumaker, 1981). Thus, a function g0( ) Gr can be approximated by a function from Gn, specifically there exists γA0 = (α0, γ 01, . . . , γ 0qn) Rqn Jn+1 such that sup z [0,1]qn g0(z) ΠA(z) γA0 = O qnk r n . (4) The quantile loss function is defined as ρτ(u) = u[τ I(u < 0)]. The oracle estimator only relies on the active covariates and is defined as ˆγA = argmin γA Rqn Jn+1 i=1 ρτ[yi ΠA(zi) γA]. (5) The estimator of g0(zi) is ˆg(zi) = ΠA(zi) ˆγA. The proposed estimator is a robust estimator and will be robust to outliers in the response. Similar to a univariate estimate of a quantile, if values of {yi}n i=1 are changed but the signs of residuals remain the same then the estimator ˆg(zi) remains unchanged. See Theorem 2.4 and the discussion surrounding this theorem from Koenker (2005) for a detailed discussion of this property of quantile regression. The following conditions were used to prove the rate of convergence of n 1 Pn i=1[ˆg(zi) g0(zi)]2. Condition 1 (Conditions on the random error) The random error ϵi has the conditional distribution function Fi( | zi), continuous conditional density function fi( | zi), and f i( | zi) is the derivative of the conditional density function. The density functions are uniformly bounded away from 0 and infinity in a neighborhood of zero and there exists a positive constant cf such that |f i( | zi)| cf for all i {1, . . . , n}. Condition 2 (Conditions on the covariates) Let zij [0, 1] for all i {1, . . . , n} and for all j {1, . . . , pn}. The joint density of the predictors is absolutely continuous and the density, fz(z), is bounded away from zero and infinity by positive constants. In addition, define fzj(z) as the density function for the jth covariate. There exist positive constants c1 and c2 such that c1 < fzj(z) < c2 for all z [0, 1] and j {1, . . . , pn}. ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION Condition 3 (Conditions on the splines) There exists a positive constant c3 such that max j {1,...,pn} max s (tj,s+1 tj,s) min s (tj,s+1 tj,s) c3. For the internal knots kn (qnn)1/(2r+1), h (qnn) 1/(2r+1) and for all j {1, . . . , pn} hj h 1, where a b means both a and b have the same order. Condition 4 (Condition on the size of the model) For the active variables qn = o[log(n)]. Condition 5 (Condition on the unknown functions) For r = m + v > 3 we assume g0( ) Gr. Condition 1 has been used for asymptotic results of a fixed dimensional linear quantile regression model (Koenker, 2005) and is a weaker condition than the Gaussian or sub Gaussian conditions that are common for penalized, high-dimensional models (Negahban et al., 2012). Condition 1 reflects the robust properties of quantile regression as it does not assume that any moments exist for the distribution of the errors, and thus the results will hold for heavy tailed distributions that have no moments such as the Cauchy distribution. Under Condition 2 there is no collinearity between the predictors. Stone (1985) introduced Condition 2 to provide a lower bound for the standard deviation of the additive function. It is also used to provide a lower bound for the minimum eigenvalue for the covariance matrix of the B-splines transformation of the active predictors (Chen et al., 2018b; Zhou et al., 1998). In addition, standard B-spline results depend on the covariates having a bounded support and without loss of generality, Condition 2 assumes the support to be the interval [0, 1]. The assumption that the density functions have a common lower and upper bound is frequent in work involving splines because the bounds allow for a direct application of Theorem 5.4.2 from Devore and Lorentz (2005). Condition 3 assumes that the distance between the internal knots are not drastically different, which holds in practice as long as the distributions of the covariates are not greatly skewed, and is a common assumption in work with splines (Huang, 1998a,b; Xue and Yang, 2006). In addition for fixed qn, the rate for kn is equivalent to the optimal rate found in Stone (1985). Condition 4 governs the rate at which qn can increase with n. Though the rate is slow because kn also needs to increase with n, most work in additive models assume q is fixed. The rate in Condition 4 is the same rate used by others that have considered an increasing number of true covariates when estimating additive models (Wang et al., 2014a). Condition 5 provides that only reasonably smooth functions can be estimated by the proposed method. The above conditions are used to prove the following theorem about the rate of convergence of ˆg( ). These conditions are sufficient for proving our results but are not necessarily the weakest conditions needed. Define ΠA = [ΠA(z1), . . . , ΠA(zn)] Rn qn Jn+1. Note that our conditions lack an explicit assumption about bounds on the eigenvalues for the sample covariance matrix of the active predictors, 1 n Pn i=1 ΠA(zi)ΠA(zi) , which is common in work that derives an oracle property for high-dimensional data (Fan and Lv, 2011; Loh and Wainwright, 2015; Wang et al., 2012; Zheng et al., 2015). Using the properties of B-splines and Conditions 1 - 3, the following lemma provides these bounds and insight into why the rate of qn provided in Condition 4 is so small. SHERWOOD AND MAIDMAN Lemma 4 Assume Conditions 1-3 hold. For a Rqn Jn+1 where ||a||2 = 1, there exist positive constants b1 > 0, B1 > 0 and δ (0, 1) with δqn = [(1 δ)/2]qn/2 such that for sufficiently large n that b1δ2 qnk 1 n a 1 nΠ AΠAa B1qn. Proof of Lemma 4 is provided in the Appendix. Stone (1985) first introduced a lower bound that depended on a term similar to δqn and is very common in the additive model literature. If qn is fixed then the term δqn is a constant and can be easily dealt within the asymptotic analysis. However, in the setting of qn increasing with n, the term has to be dealt with more care. Specifically, the proof of the the next theorem depends on n 1/2(qnkn)1/2δ 1 qn converging to zero and thus the need for Condition 4. Theorem 5 If Conditions 1-5 hold, then i=1 [ˆg(zi) g0(zi)]2 = OP qnkn/n + q2 nk 2r n . Thus, under Condition 3 and for fixed q, the estimator ˆg(zi) reaches the optimal rate of convergence found by Stone (1985) and extends to fixed dimensional quantile regression additive models (De Gooijer and Zerom, 2003; He and Shi, 1994; Horowitz and Lee, 2005). To the best of our knowledge, this is the first result that considers the rate of convergence for an additive quantile model with qn increasing with n. Proof of Theorem 5 is provided in the Appendix. 3. Variable Selection In the previous section we established that the oracle estimator is an optimal estimator but it requires a priori knowledge about the covariates that may not be known in practice. Define γ = (α, γ 1 , . . . , γ pn) RJnpn+1, where γj RJn is the coefficient vector for the B-spline functions of the jth covariate. For a vector a we define ||a||q as the Lq norm of a. To fit a sparse model that accounts for the groups of spline functions, we propose the following objective function i=1 ρτ[yi Π(zi) γ] + j=1 pλ,a ||γj||1 . (6) A group penalty is used to incorporate the group structure of the splines. Similar penalties have been used for mean additive models (Huang et al., 2010; Xue, 2009). Zhao and Lian (2016) consider a similar model for additive quantile regression, but use an L2 norm inside the penalty function and their theoretical results assume a fixed q and p. The L1 norm is used instead of the L2 norm for computational convenience. The L1 norm fits naturally with quantile regression and in the next section we discuss some of the computational conveniences it provides. Whether an L1 or L2 norm is used the oracle properties for group concave penalties are similar (Sherwood et al., 2020). Define the oracle estimator as ˆγ = h ˆγ A, 0 Jn(pn qn) i , the estimator that only uses relevant groups. To derive an oracle property we use a general class of nonconvex functions for pλ and will prove that ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION with probability going to one that ˆγ is a local minimum of Q(γ). Two commonly used penalty functions are SCAD and MCP. For the SCAD penalty function pλ(x) = λ|x|I(0 |x| < λ) + aλ|x| x2 + λ2 /2 a 1 I(λ |x| aλ) 2 I(|x| > aλ), for some a > 2, and for the MCP penalty function, pλ(x) = λ |x| x2 I(0 |x| < aλ) + aλ2 2 I (|x| aλ) , for some a > 1. For both penalty functions, the tuning parameter λ controls the complexity of the selected model and goes to zero as n increases to infinity. The other tuning parameter a controls how quickly p λ,a(x) goes to zero, but is considered fixed in the asymptotics. Our model selection consistency proofs use the following conditions about the penalty function and the strength of signal from a group of B-spline coefficients for an active covariate. Condition 6 The function pλ,a(x) is increasing, concave, and has a continuous positive derivative, p λ,a(x) for x [0, ). Also, p λ,a(x) is increasing with respect to λ for λ > 0, p λ,a(0+) = λ and p λ,a(x) = 0 for |x| > aλ. Condition 7 (Condition on the signal) There exist positive constants c4 and c5 such that 4/(2r + 1) < c4 < (2r 1)/(2r + 1) and n(1 c4)/2 min j {1,...,qn}||γ0j||1 c5. Both the SCAD and MCP satisfy Condition 6, which is very similar to Condition 1 from Fan and Lv (2011). Condition 7 is a strength of signal condition that is very common in high-dimensional linear models, for instance see Kim et al. (2008), Kim et al. (2012) and Wang et al. (2012). The upper and lower bounds for c4 are sensible by Condition 5. Again, these conditions are sufficient for proving our results, but are not necessarily the weakest conditions needed. Theorem 6 Assume Conditions 1 - 7 hold, λ = o n (1 c4)/2 , log(pn) = o(nλ2k 1 n ), n 1/2k2 n log(n) = o(λ) and nλ2 . Let Mn(λ) be the set of local minima of the penalized objective function Q(γ), defined in (6), for tuning parameter value λ then P [ˆγ Mn(λ)] 1. The conditions on λ are satisfied for λ = Cn 1/2+b, where b 2 2r+1, c4 2 and C is any positive constant, where 2/(2r + 1) < c4/2 is guaranteed by Condition 7. The motivation for using concave penalties is that with probability approaching one, for the correct value of λ, the oracle estimator is a local minimum of the penalized objective function. Thus, the optimal value of λ needs to properly balance over-fitting and under-fitting the model. The upper bound depends on c4, which depends on the function that provides the smallest signal. Smaller values of c4 indicate a weaker minimum signal and thus smaller values SHERWOOD AND MAIDMAN of λ are needed to avoid under-fitting. The lower bound depends on r and for smoother functions λ can be smaller. The intuition here is that for smoother functions it should be easier to separate the signal from the noise and thus smaller values of λ are needed. Finally, the oracle property holds for ultra-high dimensional predictors because the rates allow for pn = o exp nb 1/(2r+1) . Theorem 6 proves that with probability going to one the oracle estimator is a local minimizer of Q(γ), but provides no guarantee that the oracle estimator is the global minimizer. Nor does it provide any guarantees about other potential local minimizers. The next theorem provides a bound on the l2 norm of the difference between a sufficiently sparse local minimizer and the oracle estimator. However, an additional condition is used for that proof. Condition 8 If γ is a local minimizer of Q(γ), where || γ||0 = un then Pn i=1 I[yi = Π(zi) γ] = O(un) and there exists γ0 such that || γ γ0||2 = o P (1). Condition 8 protects against pathological cases. It assumes that the local minimizer γ, converges in probability to some fixed value γ0, but does not assume that γ0 is equal to γ0, the coefficients that provide the best approximation to the unknown additive function. In addition, it provides a bound on the number of zero-valued residuals. Consider an unpenalized linear quantile regression estimator with p predictors. Of the n residuals corresponding to this estimator, with probability one there will be exactly p+1 zero-valued residuals if the errors have a density with respect to a Lebesgue measure. See section 2.2.2 of (Koenker, 2005) for a more detailed discussion of this topic. Using the same notation as Condition 8, let un be the number of nonzero coefficients for a weighted lasso estimator. Then the weighted lasso quantile regression model will have at most un zero-valued residuals with probability one. This is because the weighted lasso quantile regression problem has a linear programming formulation similar to the standard quantile regression problem. Therefore, we believe the assumption is reasonable for the SCAD and MCP for two reasons. First, it will hold for unpenalized quantile regression and the motivation for both the SCAD and MCP is to approximate an unpenalized estimator. Second, it holds for the weighted lasso estimator which is the approximation we use for (6), see (7) in Section 4. Theorem 7 Define γ as a local minimizer of Q(γ) and define E = {j {1, . . . , pn}||| γj|| = 0 or ||ˆγj|| = 0} as the set of groups that have a non-zero entry in γ or ˆγ. Let wn = |E| = o[log(n)], assume Conditions 1-8 hold and that λ = n 1/2+b, where b 2 2r+1, c4 2 2r+1 < c4 || γ ˆγ||2 = OP log(n)δ 2 wnkn wnkn + knwnn 1 Corollary 8 Under the conditions of Theorem 7 with λ = n 1/2+b where b 2 2r+1, r 1 then || γ ˆγ||2 = o P (1) and || γ γ0||2 = o P (1). Corollary 8 provides that any sufficiently sparse local minimizer of Q(γ) will be a consistent estimator. ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION 4. Algorithm The objective function Q(γ) is non-convex and for high-dimensional data a grid search approach is not reasonable. Algorithms exist for finding estimators with good statistical properties for a wide class of nonconvex problems, but they assume the loss function is differentiable, which is not the case for quantile regression (Loh and Wainwright, 2015; Wang et al., 2014b). Zou and Li (2008) proposed a local linear approximation (LLA) that provides a convex approximation to a non-convex objective function. Let ˆγt j represent the estimate of γ0j at iteration t, with ˆγ0 = 0, then the LLA of Q(γ) is i=1 ρτ[yi Π(zi) γ] + j=1 p λ ||γt 1 j ||1 ||γj||1. (7) For τ (0, 1), ρτ(u) + ρτ( u) = |u| and thus the above problem can be restated as a weighted quantile regression problem with augmented data (Sherwood, 2016; Sherwood and Wang, 2016; Wang et al., 2012). In this approach the final estimates are derived once the estimates converge or a maximum number of iterations has been made. At each iteration minimizing (7) becomes a linear programming problem, however linear programming can be quite slow for high dimensional problems. If the traditional L2 norm was used then solving (7) becomes a second-order cone programming problem, but these tend to be even slower than linear programming problems. In addition, using the L1 norm allows us to build on existing computational approaches for penalized quantile regression. Peng and Wang (2015) proposed the quantile iterative coordinate descent (QICD) algorithm for solving (7) for the standard SCAD or MCP penalty, where γj is a scalar, which greatly reduces computational complexity without sacrificing estimation in accuracy. We propose an extension of the QICD algorithm for the group penalty setting, where γj is a vector. The QICD algorithm is a two-step process that first majorizes the objective function and then uses a coordinate descent algorithm to solve each iteration of the majorization step. The coordinate descent algorithm is responsible for faster convergence. The key difference is our algorithm includes an L1 grouping of coefficients and to minimize (6), we modify the QICD algorithm to allow for group penalties. Let γ(k) js denote the value of γjs after the kth iteration, k = 1, 2, . . . and γ(k) j = h γ(k) j1 , . . . , γ(k) Jn i for j {1, . . . , pn} and s {1, . . . , Jn}. Furthermore, let p λ(x+) be the limit of p λ(y) as y x from above. Then, in the kth iteration, γj = p λ ||γ(k 1) j ||1+ Jn X s=1 |γjs| p λ ||γ(k 1) j ||1+ Jn X s=1 |γ(k 1) js | + pλ ||γ(k 1) j ||1 (8) majorizes the penalty function pλ ||γj||1 for k {1, 2, . . . , K}, where K is a user defined value for the maximum number of iterations, and j {1, . . . , pn}. More specifically, φγ(k 1) j γj pλ ||γj||1 for all γj with equality when γj = γ(k 1) j . Thus, the objective SHERWOOD AND MAIDMAN function Q(γ) defined in (6) is majorized by Qγ(k 1)(γ) = 1 i=1 ρτ[yi Π(zi) γ] + j=1 φγ(k 1) j The majorizing function in (9) is similar to the majorizing function in Peng and Wang (2015). However, in our setting, coefficients for spline functions associated with the same covariate all have the same weight p λ ||γ(k 1) j ||1+ . For each k = 1, 2, . . ., the update for γ is γ(k) = arg min γ Qγ(k 1)(γ). (10) This iteration scheme decreases the value of the objective function in (6) for each γ(k). Additionally, the solution to the original nonconvex minimization problem can now be found by solving a sequence of convex minimization problems. Coordinate descent can be used to solve the convex minimization problem in (10). In the following coordinate descent algorithm, each coefficient γjs is updated one-at-a-time until convergence. For the dth iteration of the coordinate descent step and the kth iteration of the majorization step, let ω(k)(d) js = γ(k)(d+1) 11 , . . . , γ(k)(d+1) js 1 , γ(k)(d) js , . . . , γ(k)(d) pn Jn be the vector of coefficients that contains updates for the first js 1 coefficients, but not the remaining ones. We update each coefficient in the coordinate descent step as γ(k)(d) js = arg min γjs Qγ(k 1) ω(k)(d) js . (11) We omit the complete derivation of the coordinate descent algorithm as it is very similar to Peng and Wang (2015). The algorithm converges when for some specified tolerance ϵ, ||γk γk 1||2 < ϵ or k equals K, the maximum number of iterations allowed. In the following data analysis we used K = 20 and ϵ = .00001. It is important to have appropriate starting values for the algorithm to converge. We recommend using the estimates from lasso penalized quantile regression with the lasso penalty applied individually to each coefficient (i.e., ignoring the group penalty) as the starting values for γ(0) js . The algorithm is implemented in the R package rq Pen (Sherwood and Maidman, 2020). 5. Simulations We consider three different simulation settings. In the first setting the response is generated from an additive model where each function is nonlinear. The purpose of this setting is to demonstrate the effectiveness of the proposed method compared to other approaches for modeling nonlinear functions. This setting also includes comparisons of the QICD algorithm to a linear programming approach. In the second setting we use the proposed ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION approach where it might not be optimal. This setting includes a linear model, partially linear model and a non-additive model. In this setting we compare the proposed approach to linear models to test if the proposed approach is competitive with simpler methods. In the last setting, to verify results of Theorem 6, we test the model selection properties of the proposed approach for varying values of n, qn, Jn and pn. In all settings the covariates are generated in two steps. For each observation a pdimensional vector is generated by x N(0p, Σp) with σjk being the entry for the jth row and kth column of Σp and σjk = .5|j k|. For a vector a = (a1, . . . , ap) Rp define Φ(a) = {Φ(a1), . . . , Φ(ap)} Rp, where Φ( ) is the normal CDF. Then the p-dimensional covariate vector is generated by z = Φ(x). In the first two sections we consider a sample size of n = 500 and the number of covariates as p = 100, 300 or 600. More details about the third simulation will be provided later which includes different values of n, pn, Jn and qn. For the first two settings, models are fit using 500 training samples. Then 1000 testing samples are generated from the same model. All models are fit using B-splines with the training and testing covariates transformed using cubic B-splines with Jn = 3. Let y i and ˆy i (τ) represent the observed value and the predicted τth quantile for the ith testing sample, where the prediction comes from a model that was fit only using the training data. A covariate is considered selected if any of its corresponding spline coefficients are non-zero. Models are compared using the following criteria. 1. Mean squared prediction error (MSPE), 1 1000 P1000 i=1 [y i ˆy i (τ)]2. 2. Mean absolute prediction error (MAPE), 1 1000 P1000 i=1 |y i ˆy i (τ)|. 3. Mean check prediction error (MCPE), 1 1000 P1000 i=1 ρτ[y i ˆy i (τ)]. 4. True positives (TP), the number of active covariates selected. 5. False positives (FP), the number of nonactive covariates selected. 6. Proportion smaller (PS), the proportion of testing responses smaller than their predicted value. For consistent methods the value of PS should be close to τ. When modeling the median, MAPE and MCPE differ only by a multiple of 2. Thus, we only report MCPE in settings where we fit models for non-median quantiles. In the first two settings 100 replications are run for each simulation setting, while in the last setting 50 replications are run for each setting. Setting I: Additive Model In Setting I we consider the proposed quantile additive model where pλ,a( ) is the SCAD penalty function. We implement both the coordinate descent (QA-SCAD CD) and linear programing (QA-SCAD LP) algorithms. We compare the method to the quantile additive model with the lasso penalty (QA-LASSO), QA-LASSO minimizes (6) with pλ,a(x) = λ|x|. In addition, we consider the mean additive model with the group SCAD (MA-SCAD) and SHERWOOD AND MAIDMAN group lasso (MA-LASSO) penalty. The mean regression methods use the same B-spline transformation and minimize i=1 [yi Π(zi) γ]2 + j=1 pλ,a ||γj||2 , (12) where for MA-LASSO pλ,a(x) = λ|x| and for MA-SCAD pλ,a( ) is the SCAD penalty function. For both MA-SCAD and QA-SCAD we set a = 3.7. Let, γλ be the coefficient vector for a given value of λ and qλ be the number of nonzero coefficients. For the quantile regression methods λ is selected by minimizing, i=1 ρτ[yi Π(zi) γλ] + qλ log(n) Let ℓ(γ) represent the Gaussian log-likelihood evaluated at γ. For the mean regression methods λ is selected by minimizing 2ℓ( γλ) + qλ log(n). The quantile and mean regression models are fit using the R packages rq Pen (Sherwood and Maidman, 2020) and grpreg (Breheny and Zeng, 2017), respectively. Theoretically, using BIC may not be optimal for high-dimensional variables. There exist challenges to demonstrating that BIC will select the true model when the number of predictors grows with the sample size that remain unsolved (Wang et al., 2009a; Lee et al., 2014). For additive quantile regression models, Lee et al. (2014) suggested replacing qλ log(n) 2n with Cn qλ log(n) 2n , where Cn , and provide theoretical justifications. The R package rq Pen allows for implementing this high dimensional BIC. We used the standard BIC as our preliminary results found that approach to work better, but the approach of Lee et al. (2014) has been shown to be superior in other settings. The response is generated under three different settings. For the first two settings we consider the model y = 1 + 2z3 1 + sin(2πz2) + 8(z3 .5)2 + ϵ, (13) with homoscedastic errors of ϵ N(0, 1) (Setting IA) or ϵ T3 (Setting IB). In the third setting we consider the following heteroscedastic errors model y = sin(2πz2) + 8(z3 .5)2 + (.5 + z3 1)ϵ, (14) with ϵ N(0, 1) (Setting IC). For Settings IA and IB the methods estimate the median, τ = .5, while in Setting IC the methods estimate the .9 quantile, τ = .9. The quantile regression methods can directly model the .9 quantile, but the mean regression methods do not directly provide non-median estimates. To estimate ˆy i (τ) for the mean regression methods we propose a naive estimate of the conditional quantile based on estimation of the conditional quantile in the linear mean model when the error terms are normally distributed and p is small. Define Y = (y1, . . . , yn) Rn as the vector of observed responses. For an estimate ˆγ, ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION define ˆy(ˆγ) Rn as the vector of fitted values, q(ˆγ) as the number of nonzero entries in ˆγ, Π(zi, ˆγ) Rq( ˆγ) as the vector of basis functions that have non-zero coefficients in ˆγ and ˆσ(ˆγ) = 1 Y ˆY(ˆγ) 2. Let t d,τ be the τ quantile of a T-distribution with d degrees of freedom. For a vector of covariates z i with estimate of the conditional mean, ˆy i , the naive estimate of the τth conditional quantile is ˆy i (τ) = ˆy i + t n q( ˆγ),τ ˆσ(ˆγ) v u u t1 + Π(z i , ˆγ) " n X i=1 Π(zi, ˆγ)Π(zi, ˆγ) # 1 Π(z i , ˆγ). (15) The above estimator is the standard prediction interval estimator of a conditional quantile using ordinary least squares. If the errors are i.i.d. and normally distributed then the estimator is the MLE for the conditional quantile. However, it is naive because it is a fixed dimensional solution to a high-dimensional problem. Even in the fixed dimension setting it will be an inconsistent estimate if the errors are not normally distributed or if there is nonconstant variance. The means (and standard deviations) across the 100 simulations for the previously defined six summary statistics are provided in Tables 1-3 for Settings IA-IC. For p = 100 and p = 300 the tables include two versions of QA-SCAD. The linear programming (LP) approach to solving (7), using the Barrodale and Roberts (1974) algorithm for regression quantiles (Koenker and D Orey, 1987, 1994), and the coordinate descent (CD) approach described in Section 4. The summary statistics between the two algorithms are almost identical in Settings IA and IB, except that the coordinate descent approach tends to select more false positives. In Setting IC the LP approach provides better results. Figure 1 compares the computational speed for the two algorithms, in the different settings when p = 100 or p = 300. For Settings IA and IB the QICD algorithm is noticeably faster for p = 100 or p = 300. For Setting IC, the QICD algorithm is slower at p = 100, but faster at p = 300. For p = 600 only the QICD algorithm was used, due to the excessive computational time of the linear programming approach. The rest of the simulations only consider the QICD algorithm because of the computational advantages over the linear programming approach. Results in Table 1 show that the group SCAD approaches are fitting smaller models that all contain the true covariates and are doing a better job in terms of prediction. For the different values of p the MA-SCAD approach does the best in terms of prediction error, but this is not surprising as we expect a method using a squared error loss function to perform well when the errors are homoscedastic and normally distributed. In Setting IB, presented in Table 2, the QA-SCAD methods perform the best in terms of model selection and prediction accuracy. The linear mean methods are selecting more false positives because of the extra noise from the heavier tailed error distribution, T3. For Setting IC, presented in Table 3, the largest difference can be seen in terms of TP. In this model z1 is only an active variable for τ = .5 and thus the mean regression methods do not consistently select this variable. Also, the PS results are slightly better for QA-SCAD approaches than the mean regression methods, which we expect because the mean regression methods for estimating the .9 quantile do not correctly account for the nonconstant variance. In all the results we see the SCAD methods picking smaller models than their LASSO counterparts, but still getting accurate results in terms of selecting the correct number of active covariates. SHERWOOD AND MAIDMAN Setting I p=300 Setting II p=300 Setting III p=300 Setting I p=100 Setting II p=100 Setting III p=100 QICD LP QICD LP QICD LP QICD LP QICD LP QICD LP 0 Time (Seconds) Figure 1: Computation Time comparison of coordinate descent method (QICD) and linear programming (LP) for 100 simulations with p = 100. Method p MSPE MAPE TP FP PS MA-LASSO 100 1.08 (0.06) 0.83 (0.02) 3 (0) 9.16 (2.4) 0.5 (0.02) MA-SCAD 100 1.03 (0.05) 0.81 (0.02) 3 (0) 2.38 (2.48) 0.5 (0.02) QA-LASSO 100 1.2 (0.09) 0.88 (0.03) 3 (0) 11.79 (5.45) 0.5 (0.03) QA-SCAD CD 100 1.05 (0.06) 0.82 (0.02) 3 (0) 1.23 (0.69) 0.5 (0.03) QA-SCAD LP 100 1.04 (0.06) 0.82 (0.02) 3 (0) 1.23 (0.78) 0.5 (0.03) MA-LASSO 300 1.09 (0.06) 0.83 (0.02) 3 (0) 18.14 (2.67) 0.51 (0.02) MA-SCAD 300 1.03 (0.05) 0.81 (0.02) 3 (0) 10.29 (3.15) 0.51 (0.02) QA-LASSO 300 1.29 (0.1) 0.91 (0.04) 3 (0) 14.17 (6.91) 0.5 (0.03) QA-SCAD CD 300 1.05 (0.05) 0.82 (0.02) 3 (0) 2.01 (1.55) 0.51 (0.03) QA-SCAD LP 300 1.04 (0.05) 0.81 (0.02) 3 (0) 1.6 (1.34) 0.51 (0.03) MA-LASSO 600 1.11 (0.06) 0.84 (0.02) 3 (0) 28.06 (4.23) 0.5 (0.02) MA-SCAD 600 1.04 (0.05) 0.81 (0.02) 3 (0) 20 (4.89) 0.5 (0.02) QA-LASSO 600 1.43 (0.13) 0.95 (0.04) 3 (0) 13.58 (8.26) 0.49 (0.03) QA-SCAD CD 600 1.07 (0.05) 0.82 (0.02) 3 (0) 2.7 (1.9) 0.5 (0.02) Table 1: Simulation results for homoscedastic N(0,1) errors (Setting IA) Setting II: Non-optimal Models The previous section demonstrated the computational advantages of the CD algorithm so for this setting we only consider the coordinate descent implementation of the proposed approach (QA-SCAD CD). In this section the response is generated from a model where ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION Method p MSPE MAPE TP FP PS MA-LASSO 100 3.19 (2.2) 1.18 (0.06) 2.99 (0.1) 7.66 (2.12) 0.51 (0.03) MA-SCAD 100 3.1 (2.19) 1.15 (0.06) 2.99 (0.1) 6.76 (2.92) 0.5 (0.03) QA-LASSO 100 3.28 (2.2) 1.21 (0.06) 3 (0) 8.48 (4.47) 0.51 (0.03) QA-SCAD CD 100 3.03 (2.19) 1.12 (0.06) 3 (0) 1.59 (1.22) 0.5 (0.03) QA-SCAD LP 100 3.02 (2.19) 1.12 (0.06) 3 (0) 1.25 (0.77) 0.51 (0.03) MA-LASSO 300 3.05 (0.64) 1.18 (0.06) 3 (0) 15.74 (2.8) 0.5 (0.03) MA-SCAD 300 2.97 (0.65) 1.15 (0.06) 3 (0) 15.16 (4.09) 0.5 (0.03) QA-LASSO 300 3.31 (0.65) 1.26 (0.07) 2.96 (0.24) 7.24 (5.32) 0.51 (0.03) QA-SCAD CD 300 2.89 (0.62) 1.13 (0.05) 3 (0) 2.52 (2.28) 0.5 (0.03) QA-SCAD LP 300 2.88 (0.63) 1.12 (0.06) 3 (0) 2.04 (1.77) 0.5 (0.03) MA-LASSO 600 3.16 (0.96) 1.19 (0.07) 2.98 (0.14) 25.96 (4.69) 0.5 (0.03) MA-SCAD 600 4.99 (5.42) 1.45 (0.73) 2.98 (0.14) 40.37 (35.56) 0.5 (0.03) QA-LASSO 600 3.52 (0.95) 1.3 (0.08) 2.82 (0.5) 5.67 (5.68) 0.5 (0.03) QA-SCAD CD 600 2.99 (0.95) 1.14 (0.07) 3 (0) 3.32 (3) 0.5 (0.03) Table 2: Simulation results for homoscedastic T3 errors (Setting IB) Method p MSPE MAPE MCPE TP FP PS MA-LASSO 100 1.88 (0.14) 1.17 (0.05) 0.15 (0.01) 2.22 (0.42) 6.86 (2.19) 0.92 (0.01) MA-SCAD 100 1.78 (0.13) 1.14 (0.05) 0.15 (0.01) 2.04 (0.2) 2.08 (2.53) 0.92 (0.01) QA-LASSO 100 1.85 (0.36) 1.09 (0.12) 0.16 (0.01) 2.97 (0.17) 28.88 (16.28) 0.86 (0.03) QA-SCAD CD 100 1.89 (0.35) 1.1 (0.11) 0.15 (0.01) 2.92 (0.27) 2.9 (2.78) 0.89 (0.02) QA-SCAD LP 100 1.76 (0.21) 1.05 (0.06) 0.14 (0.01) 3 (0) 2.32 (2.19) 0.89 (0.02) MA-LASSO 300 1.97 (0.17) 1.21 (0.06) 0.15 (0.01) 2.17 (0.38) 14.92 (2.98) 0.93 (0.01) MA-SCAD 300 1.85 (0.15) 1.17 (0.06) 0.15 (0.01) 2.1 (0.3) 9.93 (3.78) 0.93 (0.01) QA-LASSO 300 1.9 (0.25) 1.11 (0.09) 0.16 (0.01) 3 (0) 45.61 (12.11) 0.86 (0.02) QA-SCAD CD 300 2.05 (0.49) 1.15 (0.14) 0.15 (0.02) 2.83 (0.4) 4.13 (4.39) 0.89 (0.02) QA-SCAD LP 300 1.75 (0.19) 1.04 (0.06) 0.14 (0.01) 3 (0) 6.13 (5.65) 0.88 (0.02) MA-LASSO 600 2.06 (0.18) 1.24 (0.06) 0.15 (0.01) 2.12 (0.33) 25.23 (4.49) 0.93 (0.01) MA-SCAD 600 1.92 (0.17) 1.2 (0.06) 0.15 (0.01) 2.08 (0.27) 19.94 (4.55) 0.93 (0.01) QA-LASSO 600 2 (0.3) 1.14 (0.1) 0.18 (0.01) 3 (0) 54.23 (16.41) 0.85 (0.02) QA-SCAD CD 600 2.06 (0.52) 1.16 (0.16) 0.16 (0.02) 2.74 (0.46) 5.87 (9.1) 0.88 (0.02) Table 3: Simulation results for heteroscedastic errors (Setting IC) the proposed approach is not optimal, either because it is too complex or not complex enough. The settings are Setting IIA (linear model) y = 1 + z1 z2 + 3z3 + ϵ; Setting IIB (partially linear model) y = 1 + z1 + 2 sin(4πz2) + 2(z3 .5)3 + ϵ; Setting IIC (nonadditive model) y = 1 + 2z3 1[sin(2πz2) + 1] + 8(z3 .5)2 + ϵ. In each setting ϵ T3. The QA-SCAD CD model is compared to simpler linear models. We consider linear mean and quantile regression with the SCAD (ML-SCAD, QL-SCAD) SHERWOOD AND MAIDMAN and lasso (ML-LASSO, QL-LASSO) penalty. For the linear models the objective functions are 1 n i=1 mτ(yi z i β) + j=1 pλ,a (|βj|) . (16) Where mτ(x) = x2 for the mean models and mτ(x) = ρτ(x) for the quantile regression models and pλ,a = λ|x| for lasso and for SCAD pλ,a( ) is the SCAD penalty function. For these simulations we consider the case of τ = .5. For all models λ is selected using BIC similar to what was described in the previous section and for the SCAD penalties a is fixed to 3.7. For Setting IIB Jn was set to 5, while in all other settings we fixed Jn = 3. The ML-LASSO and ML-SCAD models were fit using glmnet (Friedman et al., 2008) and ncvreg (Breheny and Huang, 2011), respectively. All the quantile regression models were fit using rq Pen (Sherwood and Maidman, 2020). Results for the simulations are reported in Tables 4-6. The QA-SCAD approach is competitive for both the linear and partially linear setting. In setting IIB it does best at selecting the true variables when p is large, p {300, 600}. For Setting IIC it dominates with respect to all metrics, except PS, which all do well at, and FP, where QL-LASSO does the best. This demonstrates that the proposed additive model has benefits compared to simpler models even when the true model is not additive. Method p MSPE MAPE TP FP PS ML-LASSO 100 3.09(2.21) 1.14(0.06) 1.73(0.62) 1.84(2.42) 0.5(0.03) ML-SCAD 100 3.06(2.21) 1.13(0.05) 1.66(0.83) 2.25(3.85) 0.5(0.03) QA-SCAD CD 100 3.07(2.21) 1.14(0.06) 2.21(0.41) 1.06(0.34) 0.51(0.03) QL-LASSO 100 3.08(2.21) 1.14(0.05) 1.71(0.61) 1.88(1.7) 0.5(0.03) QL-SCAD 100 3.05(2.2) 1.13(0.05) 2.42(0.81) 2.97(1.76) 0.51(0.03) ML-LASSO 300 2.95(0.63) 1.14(0.05) 1.51(0.52) 2.36(3.61) 0.5(0.03) ML-SCAD 300 2.91(0.64) 1.13(0.06) 1.39(0.58) 1.87(3.89) 0.5(0.03) QA-SCAD CD 300 2.91(0.63) 1.13(0.05) 2.3(0.46) 1.3(0.96) 0.5(0.03) QL-LASSO 300 2.93(0.63) 1.14(0.05) 1.7(0.48) 1.98(1.28) 0.5(0.03) QL-SCAD 300 2.9(0.63) 1.13(0.05) 1.9(0.72) 2.05(1.31) 0.5(0.03) ML-LASSO 600 3.05(0.81) 1.16(0.05) 1.5(0.51) 1.84(2.85) 0.5(0.03) ML-SCAD 600 3.01(0.82) 1.14(0.05) 1.16(0.37) 0.94(1.73) 0.5(0.04) QA-SCAD CD 600 3(0.84) 1.14(0.05) 2.26(0.44) 1.12(0.48) 0.5(0.03) QL-LASSO 600 3.02(0.82) 1.15(0.05) 1.7(0.51) 2.7(1.81) 0.5(0.03) QL-SCAD 600 2.99(0.82) 1.14(0.05) 1.22(0.46) 1.04(0.2) 0.5(0.03) Table 4: Simulation results for Setting IIA (linear model). Setting III: Model Selection Performance To validate the results of Theorem 6 we consider the model selection performance of QASCAD CD for qn, Jn, and pn increasing with n. The responses in this section are generated from the model y = 1 + 2z3 1 + sin(2πz2)I(qn > 1) + 8(z3 .5)2I(qn > 2) + 2z3 11I(qn > 3) + sin(2πz12)I(qn > 4) + 8(z13 .5)2I(qn = 6) + ϵ, ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION Method p MSPE MAPE TP FP PS ML-LASSO 100 3.09(2.21) 1.14(0.06) 1.73(0.62) 1.84(2.42) 0.5(0.03) ML-SCAD 100 3.06(2.21) 1.13(0.05) 1.66(0.83) 2.25(3.85) 0.5(0.03) QA-SCAD CD 100 3.07(2.21) 1.14(0.06) 2.21(0.41) 1.06(0.34) 0.51(0.03) QL-LASSO 100 3.08(2.21) 1.14(0.05) 1.71(0.61) 1.88(1.7) 0.5(0.03) QL-SCAD 100 3.05(2.2) 1.13(0.05) 2.42(0.81) 2.97(1.76) 0.51(0.03) ML-LASSO 300 2.95(0.63) 1.14(0.05) 1.51(0.52) 2.36(3.61) 0.5(0.03) ML-SCAD 300 2.91(0.64) 1.13(0.06) 1.39(0.58) 1.87(3.89) 0.5(0.03) QA-SCAD CD 300 2.91(0.63) 1.13(0.05) 2.3(0.46) 1.3(0.96) 0.5(0.03) QL-LASSO 300 2.93(0.63) 1.14(0.05) 1.7(0.48) 1.98(1.28) 0.5(0.03) QL-SCAD 300 2.9(0.63) 1.13(0.05) 1.9(0.72) 2.05(1.31) 0.5(0.03) ML-LASSO 600 3.05(0.81) 1.16(0.05) 1.5(0.51) 1.84(2.85) 0.5(0.03) ML-SCAD 600 3.01(0.82) 1.14(0.05) 1.16(0.37) 0.94(1.73) 0.5(0.04) QA-SCAD CD 600 3(0.84) 1.14(0.05) 2.26(0.44) 1.12(0.48) 0.5(0.03) QL-LASSO 600 3.02(0.82) 1.15(0.05) 1.7(0.51) 2.7(1.81) 0.5(0.03) QL-SCAD 600 2.99(0.82) 1.14(0.05) 1.22(0.46) 1.04(0.2) 0.5(0.03) Table 5: Simulation results for Setting IIB (partially linear model). Method p MSPE MAPE TP FP PS ML-LASSO 100 3.61(2.2) 1.3(0.05) 1.12(0.57) 1.67(2.58) 0.52(0.03) ML-SCAD 100 3.6(2.19) 1.3(0.05) 1.26(0.65) 3.02(4.3) 0.52(0.03) QA-SCAD CD 100 3.21(2.2) 1.18(0.06) 2.95(0.22) 1.51(1.11) 0.5(0.03) QL-LASSO 100 3.63(2.21) 1.31(0.06) 1.86(0.49) 0.49(1) 0.51(0.03) QL-SCAD 100 3.62(2.19) 1.31(0.06) 2.51(0.52) 2.01(1.85) 0.51(0.03) ML-LASSO 300 3.48(0.62) 1.3(0.05) 0.79(0.43) 1.56(2.76) 0.52(0.03) ML-SCAD 300 3.47(0.62) 1.3(0.05) 0.85(0.54) 2.18(3.84) 0.52(0.03) QA-SCAD CD 300 3.07(0.63) 1.18(0.06) 2.88(0.33) 1.84(1.72) 0.5(0.03) QL-LASSO 300 3.48(0.62) 1.31(0.05) 1.67(0.47) 0.5(0.93) 0.51(0.03) QL-SCAD 300 3.54(0.63) 1.32(0.06) 2.38(0.58) 4.38(3.41) 0.5(0.03) ML-LASSO 600 3.55(0.93) 1.31(0.06) 0.86(0.45) 2.22(3.98) 0.51(0.03) ML-SCAD 600 3.55(0.93) 1.31(0.06) 0.85(0.48) 2.36(3.95) 0.51(0.03) QA-SCAD CD 600 3.19(0.93) 1.2(0.07) 2.82(0.41) 2.73(3.14) 0.5(0.03) QL-LASSO 600 3.56(0.92) 1.31(0.06) 1.59(0.49) 0.41(1.2) 0.5(0.03) QL-SCAD 600 3.53(0.93) 1.3(0.06) 2.01(0.44) 1.93(2.84) 0.5(0.03) Table 6: Simulation results for Setting IIC (nonadditive model). where ϵ N(0, 1). We fit the QA-SCAD model in 3 different settings where Jn, qn, or pn vary. In all the settings we fit models with sample size of 100, 300, 600 and 1000. In Setting IIIA we fit the model with Jn {3, 4, 5}. In setting IIIB models are fit with qn {1, 2, 3, 4, 5, 6}. Finally, in Setting IIIC models are fit with pn {100, 300, 500, 1000, 2000}. When they are not varying we fix Jn = 3, qn = 3 and pn = 300. For instance, in Setting IIIA we fix qn = 3 and pn = 300 and consider performance of QA-SCAD CD with different values of Jn and n. The purpose of these simulations is to corroborate the model selection SHERWOOD AND MAIDMAN 200 400 600 800 1000 0.2 0.4 0.6 0.8 1.0 True Positive Rate by n True Positive Rate Jn 3 Jn 4 Jn 5 200 400 600 800 1000 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 False Positive Rate by n False Positive Rate Jn 3 Jn 4 Jn 5 Figure 2: True positive and false positive rates by n with varying Jn from Setting IIIA properties presented in Theorem 6 and thus we only consider the QA-SCAD CD model. In addition, performance is evaluated with respect to the TP and FP rates, number of true or false covariates selected divided by the number of true or false potential covariates, to account for the fact that qn and pn can vary. Again to duplicate the settings of Theorem 6 we fix λ = n 1/10/4 which for the given additive functions satisfies the potential valid choices of λ outlined after Theorem 6. Figures 2-5 present how the average true positive and false positive rates vary with n in the different settings across the 50 replications. Figure 2 presents how the false and true positives vary with n and Jn. The relationship between Jn and correct model selection is not straightforward. From a theoretical perspective larger values of Jn will provide a better approximation of the functions and thus we could reason that for larger value of Jn the true positive rate should go up and false positive rate should go down. However, from a practical perspective larger values of Jn cause the size of the grouped coefficients to increase where the intuition is this could lead to an increase in the false positive rate. Figure 2 reflects some of this uncertainty, but also shows that as n increases, no matter the value of Jn, the active covariates tend to be selected and the inactive covariates tend to be dropped. Figure 3 demonstrates that the larger the value of qn the harder it is to select the correct variables. We also see that for large values of n that, no matter the value of qn, with high probability the active covariates are selected and the inactive covariates are discarded. Figures 4 and 5 present the false and true positives, respectively, as functions of n and n/ log(pn). The n/ log(pn) is used to verify that pn can grow exponentially with n. Both figures demonstrate that settings with large values of pn are doing worse for smaller n, but for the larger values of n the true covariates tend to be selected and the noise covariates are removed from the models. ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION 200 400 600 800 1000 0.2 0.4 0.6 0.8 1.0 True Positive Rate by n True Positive Rate qn 1 qn 2 qn 3 qn 4 qn 5 qn 6 200 400 600 800 1000 0.005 0.010 0.015 False Positive Rate by n False Positive Rate qn 1 qn 2 qn 3 qn 4 qn 5 qn 6 Figure 3: True positive and false positive rates by n with varying qn from Setting IIIB 200 400 600 800 1000 0.2 0.4 0.6 0.8 1.0 True Positive Rate by n True Positive Rate pn 100 pn 300 pn 500 pn 1000 pn 2000 50 100 150 200 0.2 0.4 0.6 0.8 1.0 True Positive Rate by n/log(pn) True Positive Rate pn 100 pn 300 pn 500 pn 1000 pn 2000 Figure 4: True positive rates by n and n/log(pn) for varying pn from Setting IIIC SHERWOOD AND MAIDMAN 200 400 600 800 1000 0.000 0.005 0.010 0.015 0.020 False Positive Rate by n False Positive Rate pn 100 pn 300 pn 500 pn 1000 pn 2000 50 100 150 200 0.000 0.005 0.010 0.015 0.020 False Positive Rate by n/log(pn) False Positive Rate pn 100 pn 300 pn 500 pn 1000 pn 2000 Figure 5: False positive rates by n and n/log(pn) for varying pn from Setting IIIC 6. Data Analysis 6.1 Fat content of ground pork Borggaard and Thodberg (1992) measured the fat content and 100 channel spectrum of absorbances from 240 ground pork meat samples. Our analysis is limited to the 215 samples available from the R package faraway (Faraway, 2016) and we removed 5 observations with outlying values leaving us with 210 samples. We analyze this data using the QA-SCAD model and consider both the CD and LP algorithms. We compare these approaches to three other models: (1) MA-SCAD discussed in simulation Setting I; (2) ML-SCAD discussed in simulation Setting II; and (3) QL-SCAD discussed in simulation Setting II. For the linear models no B-spline transformation is done and each coefficient is penalized individually using the SCAD penalty. For the nonlinear additive models a B-spline transformation is used and group penalties are used for coefficients. The channel spectrum is scaled and centered to have a mean of zero and a standard deviation of one. The fat content data is first log transformed and then scaled and centered to have a mean of zero and a standard deviation of one. To compare the methods we randomly sample 200 of the 210 samples as training data and the other 10 samples are used as testing data. The channel spectrum data is highly correlated. Following an approach similar to the one outlined in Meier et al. (2009), we transform the predictors by using the first 30 principal components. The principal components are centered and scaled to have mean zero and a standard deviation of one. The five models are fit using the 30 principal components as covariates to model log of fat content. For the nonlinear models the principal components are transformed using cubic B-splines with Jn = 3. ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION Fitting a quantile regression model requires a choice of τ. Choosing τ = .5 provides a robust estimate of central tendencies. Two values of τ can be used to create a prediction interval that does not require a parametric assumption about the errors. For instance, models with τ = .1 and τ = .9 can be used to create an 80% prediction interval. If the whole conditional distribution is of interest then a wide range values of τ could be used. To estimate the whole conditional distribution, and test the proposed method for several values of τ, models are fit for values of τ T {.1, .2, .3, .4, .5, .6, .7, .8, .9}. The tuning parameter λ is selected using BIC, as outlined in the previous section, and we set a = 3.7. The mean models estimate the conditional quantiles using the naive procedure outlined in the previous section. The testing covariates are transformed using the rotation defined by the first 30 principal components from the testing data. The 30 covariates are then centered and scaled by the sample mean and standard deviation from the testing data. For the nonlinear models the testing covariates are transformed by the B-spline functions used on the training data. Using this transformed data, predictions from the six methods are made for the log fat content. This process is repeated 100 times. Let yij represent the scaled and centered log fat content of the ith sample from the jth testing data set and ˆyτ ij represent its corresponding estimate for the τth quantile. Let I(a b) take a value of one if a b and zero otherwise. The models are compared using 1. MSPE, 1 10 P10 i=1(yij ˆy.5 ij)2. 2. MAPE, 1 10 P10 i=1 |yij ˆy.5 ij|. 3. MCPE, 1 10 P τ T P10 i=1 ρτ(yij ˆyτ ij). 4. Quantile Bias (QB), P τ T 1 1000 P100 j=1 P10 i=1 I(yij ˆyτ ij) τ . Methods that correctly model the τth quantile will have i=1 I(yij ˆyτ ij) τ. Thus, QB is providing a summary of how accurate the conditional quantile estimates are across all partitions and all values of τ. The statistic QB and means (and standard deviations) of the other three statistics are reported in Table 7. In this data set we are comparing the linear and nonlinear approaches to see if there is justification for fitting the more complex nonlinear models. In addition, quantile and mean models are compared to see if the quantile models are providing a better description of the conditional distribution. One of the two nonlinear quantile algorithms has the best average results for MAPE, MCPE and QB. For MSPE the linear quantile approach does the best and the linear mean approach also does better than the nonlinear quantile approach. The superiority of the nonlinear quantile approach in terms of MCPE and QB suggests that the more complex nonlinear quantile models are providing useful predictions for non-central tendencies. Performance of the CD and LP algorithms is similar. SHERWOOD AND MAIDMAN Method MSPE MAPE MCPE QB ML-SCAD 0.66(0.41) 0.59(0.19) 0.24(0.07) 0.37 QL-SCAD 0.65(0.41) 0.58(0.19) 0.24(0.07) 0.32 MA-SCAD 0.96(1.56) 0.66(0.25) 0.28(0.1) 0.41 QA-SCAD CD 0.7(0.68) 0.58(0.22) 0.24(0.07) 0.21 QA-SCAD LP 0.71(0.61) 0.57(0.22) 0.23(0.08) 0.22 Table 7: Means (and standard deviations) of statistics from the Monte Carlo randomization results for the ground pork data. 6.2 Modeling TRIM32 expression levels The previous example provides some evidence that nonlinear quantile regression can provide a less biased estimate of a conditional quantile, but does not demonstrate a dramatic difference between linear and nonlinear quantile regression in terms of prediction accuracy. This section presents an example where the additive quantile regression model outperforms the linear quantile regression model in terms of prediction accuracy. Huang et al. (2010) presented an analysis of modeling high-dimensional genomics data, from Scheetz et al. (2006), where a nonlinear additive mean model improved upon the prediction performance of a linear mean model. We consider the same data set for modeling the conditional median. Scheetz et al. (2006) used 31,042 different probe sets to analyze RNA from the eyes of 120 twelve-week old male rats. Similar to Huang et al. (2010) we model the expression of gene TRIM32, because Chiang et al. (2006) identified TRIM32 as a Bardet-Biedl syndrome gene and one symptom of Bardet-Biedl syndrome is retinal degeneration. Scheetz et al. (2006) note that many of the probes were not expressed in the eye. Thus, following Huang et al. (2010) we limit our analysis to the 500 genes that have the highest absolute Pearson s correlation with the TRIM32 expression. To demonstrate that this is a setting where the nonlinear quantile model improves prediction accuracy over the linear counterpart we consider the QL-SCAD and QA-SCAD CD approaches using Monte Carlo randomization. All variables are log transformed and the predictors are further transformed to have a minimum value of zero and maximum value of one. First the data is randomly partitioned into a training set of 100 observations and a testing set of 20 observations. We fit the models using the 100 training observations and make prediction of TRIM32 expression on the remaining 20 testing observations. For the nonlinear model we set Jn = 4. This process was repeated 100 times and the MAPE recorded at each iteration. Figure 6 presents the MAPE of the two methods, demonstrating that the nonlinear model tends to be more accurate. In addition, in 69 of the 100 iterations the nonlinear model had a lower MAPE than the linear model. 7. Conclusions We proposed an additive nonlinear model to provide a flexible model. However, it is possible that too complex a model will be fit. For instance, if some of the true functions are linear than the model being fit will be more complex than necessary. To balance model ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION QA SCAD CD QL SCAD Method Figure 6: Monte Carlo randomization MAPE of TRIM32 for the genomics data. The nonlinear model tends to perform better and in 69 of the 100 Monte Carlo iterations had the lower MAPE. complexity and ease of interpretation Lou et al. (2016) proposed a penalized approach for mean regression that does both variable selection and automatic assignment of a covariate to a linear or nonlinear term. However, even this approach has some rigidity as, similar to our work, it requires preset definitions of the basis splines including the number and placement of knots. Desire for flexibility has resulted in methods which use adaptive knots (Petersen et al., 2016; Sadhanala and Tibshirani, 2019) and adaptive knot assignment and classification of predictors as linear or nonlinear (Petersen and Witten, 2019). However, all the cited work has focused on mean regression. Developing adaptive methods for quantile regression would be a useful contribution to this line of research. Acknowledgments The authors thank the anonymous reviewers for their time and effort, including finding errors in the proofs of previous versions of this work. 8. Appendix Throughout the proofs C is used to represent a generic positive constant that can change in value from line to line. We start by presenting some useful equalities that are used SHERWOOD AND MAIDMAN throughout the proofs. For u = 0, Knight (1998) introduced the equality of |u v| |u| = u[I(u > 0) I(u < 0)] + 2 Z v 0 [I(u s) I(u 0)]ds. (17) Define ψτ(u) = τ I(u < 0). As ρτ(u) = 1/2[|u| + (2τ 1)u], Koenker (2005) generalized (17) and for u = 0 presented Knight s identity as ρτ(u v) ρτ(u) = vψτ(u) + Z v 0 I(u s) I(u 0)ds. (18) Knight s identity also provides that ρτ(u a v) ρτ(u v) = Z (a+v) v ψτ(u + s)ds = Z a+v v [I(u < s) τ]ds, (19) which is an intuitive result as ψτ(u) is the derivative of ρτ(u) where it is defined. The following definitions are used throughout the proof uni = ΠA(zi) γA0 g0(zi), Dn = diag[f1(0 | z1), . . . , fn(0 | zn)] Rn n, W 2 D = Π ADnΠA Rqn Jn+1 qn Jn+1, θ = WD(γA γA0), W(zi) = W 1 D ΠA(zi), Qi(θ) = ρτ[ϵi W(zi) θ uni], Ez(x) = E (x | z) , Di(θ) = Qi(θ) Qi(0) Ezi [Qi(θ) Qi(0)] + W(zi) θψτ(ϵi), ˆθ = argmin θ Rqn Jn+1 i=1 ρτ[ϵi W(zi) θ uni]. Notice that ˆθ = WD(ˆγA γA0) and W(zi) ˆθ = ΠA(zi) ˆγA for all i {1, . . . , n}. Define dj,s = (tj,s+1 tj,s m)/(m + 1). The following Lemma is restating Corollary 1 from de Boor (1976) and provided here for convenience. Lemma 9 (Corollary 1 from de Boor (1976).) For 1 p < and for all j {1, . . . , qn} and s {0, . . . , Jn} 0 d 1 j,s |bj,s(z)|p dx 1/p 1. 8.1 Proof of Lemma 4 Proof The majority of the proof for the lower bound follows work done in proof of Lemma 1 from Chen et al. (2018a). The major difference is the constant term, corresponding to the intercept, is accounted for in these results while the results from Chen et al. (2018a) ignore the intercept because it can be removed in mean regression by centering the predictors and ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION response to be mean zero. However, for quantile regression an intercept is still required after such transformations. Define Πj Rn Jn, as the matrix of splines associated with the jth predictor, such that ΠA = [1n, Π1, . . . , Πqn] and a = (a0, a 1 , . . . , a qn) , where aj RJn for all j {1, . . . , qn}. Notice that a Π AΠAa = ||a01n+Π1a1+. . .+Πqnaqn||2 2. By Lemmas S.5 from Chen et al. (2018b) and that ||Πjaj||2 0 it follows that ||a01n + Π1a1 + . . . + Πqnaqn||2 2 1 δ j=1 ||Πjaj||2 2 From Lemma 6.2 of Zhou et al. (1998), for any j {1, . . . , qn} there exists a positive constant C such that λmin(Π j Πj) CJ 1 n n. Therefore, j=1 ||Πjaj||2 2 na2 0 + CJ 1 n n qn J 1 n n. It immediately follows that there exists a positive constant b1 such that b1δ2 qnk 1 n a 1 nΠ AΠAa. For the upper bound, using the Cauchy-Schwarz inequality and that PJn s=0 bj,s(zij) = 1 and that bj,s(z) 0 for all z [0, 1], j {1, . . . , pn} and s {0, . . . , Jn}, i=1 [a ΠA(zi)]2 ||a||2 2 1 n s=1 bj,s(zij)2 Set B1 = 2 and the result immediately follows. The following lemma provides some bounds on the vector and matrices of B-splines. Lemma 10 Under Conditions 1-4 and for sufficiently large n the following properties hold. (1) For a Rqn Jn+1 where ||a||2 = 1, there exist positive constants b2 and B2 such that for sufficiently large n that b2δ2 qnk 1 n a 1 n W 2 Da B2qn. (2) There exists a constant b3 such that max i || W(zi)||2 b3δ 1 qn (3) There exist constants m1 < M1 such that for all j {1, . . . , qn} and s {0, . . . , Jn} m1k 1 n Z 1 0 b2 j,s(z)dz M1k 1 n . (4) For all j {1, . . . , qn} and all s {0, . . . , Jn} there exist positive constants m2 < M2 such that m2k 1 n E[bj,s(zij)2] M2k 1 n . SHERWOOD AND MAIDMAN (1) Follows from Condition 1, providing uniform upper and lower bounds for fi(0) for all i {1, . . . , n}, and Lemma 4. (2) By Lemma 10 (1), it follows that || W(zi)||2 2 b2δ 2 qn knn 1||ΠA(zi)||2 2 = b2δ 2 qn knn 1 s=1 bj,s(zij)2 Cδ 2 qn knn 1qn. (3) Using Lemma 9 with p = 2, squaring all terms and moving dj,s to the upper and lower bounds it follow that (m + 1) 1dj,s Z 1 0 |bj,s(z)|2 dx dj,s. (20) By Condition 3 and the definition of h there exist positive constants c < C such that for all j {1, . . . , qn} and s {0, . . . , Jn} that c k 1 n dj,s C k 1 n . (21) Proof is complete by combining equations (20) and (21). (4) Using c1 and c2 from Condition 2 it follows that for all j {1, . . . , qn} and s {0, . . . , Jn} 0 b2 j,s(z)dz Z 1 0 b2 j,s(z)fzj(z)dz c2 0 b2 j,s(z)dz. (22) Proof is complete by combining (22) with Lemma 10 (3). Lemma 11 Under Condition 4, for any positive constants a and b, δ a qn = o nb . Proof Condition 4 provides that qn = o[log(n)]. Therefore, δ a qn nb = exp (log(n) {aqn/[2 log(n)] log[2/(1 δ)] b}) = o(1). The following lemma is central to our proof of Theorem 5. Lemma 12 For some positive constant L sup ||θ||2 L,θ Rqn Jn+1 (qnkn) 1 ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION Proof Define Wn = max i || W(zi)||2. Let Fn1 denote the event Wn < C1n 1/2(qnkn)1/2δ 1 qn and Fn2 denote the event max i |uni| < C2qnk r n . Combining Lemma 10 and (4) it follows that there exist positive constants C1 and C2 such that P(Fn1, Fn2) 1. Thus, to prove Lemma 12 it is sufficient to show that ϵ > 0 sup ||θ||2 1, θ Rqn Jn+1 (qnkn) 1 > ϵ, Fn1 Fn2 Define Θ θ | ||θ||2 1, θ Rqn Jn+1 . We can partition Θ as a union of disjoint regions Θ1, ..., ΘMn, such that the diameter of each region does not exceed m0 = ϵδqn 4C1L n. Then, following the proof of Lemma 3.2 in He and Shi (1994), Mn 2 qn Jn + 1/m0 + 1 qn Jn+1. Let θ 1, ..., θ Mn be arbitrary points in Θ1, ..., ΘMn. Then sup ||θ||2 1 (qnkn) 1 > ϵ, Fn1 Fn2 + sup θ Θk (qnkn) 1 qnknθ) Di(L p qnknθ k) i > ϵ, Fn1 Fn2 We will next show that sup θ Θk (qnkn) 1 qnknθ) Di(L p qnknθ k) i I (Fn1 Fn2) ϵ/2. From definition of Di(θ) and Qi(θ) and that ρτ(u) = 1 2 u for fixed θ and θ qnknθ) Di( p qnknθ ) = 1 qnkn W(zi) θ uni ϵi p qnkn W(zi) θ uni i 1 2Ezi h ϵi p qnkn W(zi) θ uni ϵi p qnkn W(zi) θ uni i + p qnkn W(zi) [θ θ ]ψτ(ϵi). Then using the above equality, the triangle inequality and the definition of m0 sup θ Θk (qnkn) 1 qnknθ) Di(L p qnknθ k) i I (Fn1 Fn2) 2n Lm0(qnkn) 1/2 Wn I (Fn1 Fn2) 2 n Lm0C1δ 1 qn = ϵ/2. The proof is complete if it can be shown that > qnknϵ/2, Fn1 Fn2 We will use Bernstein s inequality to prove the above result. First we need upper bounds for the maximum and the second moment for the left side of the above inequality. Note that for any θ SHERWOOD AND MAIDMAN h ϵi W(zi) θ uni |ϵi uni| i 1 2Ezi h ϵi W(zi) θ uni |ϵi uni| i + W(zi) θψτ(ϵi). Then using the triangle inequality and the above equality we have, qnknθ k) I (Fn1 Fn2) 2L p qnkn Wn I (Fn1 Fn2) 2LC1δ 1 qn qnknn 1/2. Define Vi(θ) = Qi(θ) Qi(0) + W(zi) θψτ(ϵi). Notice that Di(θ) = Vi(θ) Ez[Vi(θ)], and that Pn i=1 Var [Di(θ)I (Fn1 Fn2) | zi] Pn i=1 E V 2 i (θ)I (Fn1 Fn2) | zi . Using Knight s identity qnknθ k) = L p qnkn W(zi) θ k [I(ϵi uni < 0) I(ϵi < 0)] + Z L qnkn W(zi) θ k 0 [I(ϵi uni < s) I(ϵi uni < 0)] ds Vi1 + Vi2. i=1 Ezi V 2 i1I (Fn1 Fn2) Cqnkn i=1 Ezi h W2 n I (0 < |ϵi| < |uni|) I (Fn1 Fn2) i Cδ 2 qn (qnkn)2n 1 n X |uni| fi(s | zi)ds I (Fn1 Fn2) Cδ 2 qn q3 nk2 nk r n , where the last inequality uses Condition 1. Noting that Vi2 is always non-negative and that there exists a positive constant C such that max i qnkn L W(zi) θ k I (Fn1 Fn2) Cδ 1 qn qnknn 1/2, we have i=1 Ezi V 2 i2I (Fn1 Fn2) max i qnkn L h W(zi) θ k i (Z qnkn L[ W(zi) θ k] I(ϵi uni < s) I(ϵi uni < 0) ds zi I (Fn1 Fn2) Cδ 1 qn qnknn 1/2 n X Z qnkn L W(zi) θ k 0 [Fi(s + uni | zi) Fi(uni | zi)] I (Fn1 Fn2) ds Cδ 1 qn qnknn 1/2 n X Z qnkn L W(zi) θ k 0 [fi(0 | zi) + o(1)][s + O(s2)]ds Cδ 1 qn (qnkn)2n 1/2θ k " n X i=1 fi(0 | zi) W(zi) W(zi) # θ k[1 + o(1)] = Cδ 1 qn (qnkn)2n 1/2||θ k||2 2[1 + o(1)] Cδ 1 qn (qnkn)2n 1/2[1 + o(1)]. ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION Where the last equality follows because Pn i=1 fi(0 | zi) W(zi) W(zi) = W 1 D W 2 DW 1 D = I. Therefore, for sufficiently large n, i=1 Var [Di(θ)I (Fn1 Fn2)] Cδ 1 qn (qnkn)2 δ 1 qn qnk r n + n 1/2 . By Bernstein s inequality, for all n sufficiently large, i=1 Di θ k, L p qnkn/n > qnknϵ/2, Fn1 Fn2 ( (qnkn)2ϵ2/4 Cδ 1 qn (qnkn)2 δ 1 qn qnk r n + n 1/2 + Cϵδ 1 qn qnknn 1/2 Cδ 2 qn qnk r n nqn Jn(ϵδqn) 1 + 1 iqn Jn+1 exp ϵ2 Cδ 2 qn qnk r n Cexp n Cqnkn log h C p nqnkn(ϵδqn) 1i Cϵ2δ2 qnq 1 n kr n o Cexp Cqnkn log(n) Cϵ2δ2 qnq 1 n kr n . By taking the expected value of the initial conditional probability and the final upper bound it follows that i=1 Di(θ k, L p > qnknϵ/2, Fn1 Fn2 Cexp Cqnkn log(n) Cϵ2δ2 qnq 1 n kr n . Where the upper bound goes to zero because by Conditions 3-4 and Lemma 11 it follows that δ2 qnq 1 n kr n and q2 nδ 2 qn kn log(n) 8.2 Proof of Theorem 5 Proof We will first prove that for all η > 0, there exists an L > 0 such that inf θ Rqn Jn+1 qnknθ) Qi(0) i > 0 Gn1(θ) = (qnkn) 1 n X Gn2(θ) = (qnkn) 1 n X i=1 Ezi h Qi p qnknθ Qi(0) i , Gn3(θ) = (qnkn) 1/2 n X i=1 W(zi) θψτ(ϵi), SHERWOOD AND MAIDMAN and note that (qnkn) 1 Pn i=1 Qi( qnknθ) Qi(0) = P3 k=1 Gnk(θ). From Lemma 12 we have that sup||θ||2 L |Gn1| = o P (1). For Gn3, first notice that E(Gn3) = 0. From Condition 1 there exists a positive constant c such that min i fi(0 | zi) c and thus E[G2 n3] C(qnkn) 1 n X i=1 E fi(0 | zi) h W(zi) θ i2 C(qnkn) 1θ E i=1 fi(0 | zi)ΠA(zi)ΠA(zi) W 1 D θ = C(qnkn) 1||θ||2 2. Therefore, sup||θ||2 L Gn3(θ) = OP (qnkn) 1/2 ||θ||2 . We will complete the proof by prov- ing that inf||θ||2 L Gn2(θ) has a positive asymptotic lower bound that does not converge to zero. Applying (19) Gn2(θ) = (qnkn) 1 n X (Z qnkn W(zi) θ+uni uni [I(ϵi s) I(ϵi 0)]ds = (qnkn) 1 n X i=1 fi(0 | zi)1 qnkn h W(zi) θ i2 + uni p qnkn W(zi) θ = C||θ||2 2[1 + o(1)] + (qnkn) 1/2 n X i=1 fi(0 | zi)uni W(zi) θ[1 + o(1)]. Define un = (un1, . . . , unn) Rn, by Condition 3 it follows that sup ||θ||2 L (qnkn) 1/2 n X i=1 fi(0 | zi)uni W(zi) θ sup ||θ||2 L (qnkn) 1/2||u n D1/2 n ||2||D1/2 n ΠAW 1 D ||2||θ||2 = OP (||θ||2). Proof of (24) is completed by noticing that for sufficiently large L, inf||θ||2 L Gn2(θ) has a dominating, positive lower bound of ||θ||2 2. By the corollary to Theorem 25 in Eggleston (1958) (p.47) and the convexity of Qi( ), (24) implies ||ˆθ||2 = OP ( qnkn). From the definition of ˆθ, it follows that ||WD(ˆγA γA0)||2 = OP qnkn . Condition 5 and (4) guarantee that uni = O(qnk r n ) and therefore i=1 fi(0 | zi) [ˆg(zi) g0(zi)]2 = n 1 n X i=1 fi(0 | zi) h ΠA(zi) (ˆγA γA0) uni i2 2n 1 (ˆγA γA0) W 2 D (ˆγA γA0) + OP q2 nk 2r n = OP n 1qnkn + q2 nk 2r n . By Condition 1, which provides a constant uniform lower bound for fi(0) for all i {1, . . . , n}, n 1 Pn i=1 [ˆg(zi) g0(zi)]2 = OP n 1qnkn + q2 nk 2r n . The following lemmas are used to prove Theorem 6 ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION Lemma 13 If Conditions 1-4 hold, then ||ˆγA γA0||2 = OP Proof The proof of Theorem 5 shows ||WD(ˆγ γA0)||2 = OP ( qnkn). While from Lemma 10 it follows that ||ˆγ γA0||2 b 1/2 2 q n δ 1 qn ||WD(ˆγ γA0)||2. Lemma 14 If the Conditions of Theorem 6 hold then max qn+1 j pn 1 n i=1 πj(zij) {I [Yi g0(zi) 0] τ} Proof Recall that πj(zij) = [bj,1(zij), ..., bj,Jn(zij)] . Note, maxj,s,i |bj,s(zij) {I [Yi g0(zi) 0] τ}| 1 and E h b2 j,s(zij) {I [Yi g0(zi) 0] τ}2i Ck 1 n , see Theorem 10 (4) for the latter. For sufficiently large n, using Bernstein s inequality, i=1 bj,s(zij) {I [Yi g0(zi) 0] τ} > nk 1 n λ/4 2exp λ2n2k 2 n /32 Cnk 1 n + λnk 1 n /12 2exp Cλ2nk 1 n . max qn+1 j pn 1 n i=1 πj(zij) {I [Yi g0(zi) 0] τ} Ckn max qn+1 j pn max 1 s Jnn 1 i=1 bj,s(zij) {I [Yi g0(zi) 0] τ} Cpnknexp( Cnk 1 n λ2) = Cexp(log pn + log kn Cnk 1 n λ2) 0. Where the limit holds using the rates of pn and λ provided in Theorem 6. Lemma 15 Assume the Conditions of Theorem 6 hold max qn+1 j pn sup ||γA γA0||2 Cknδ 1 qn q1/2 n n 1/2 i=1 πj(zij) n I[Yi Π(zi) AγA 0] I[Yi g0(zi) 0] P[Yi Π(zi) AγA 0 | zi] + P[Yi g0(zi) 0 | zi] o SHERWOOD AND MAIDMAN Proof Extending results from Welsh (1989) and Wang et al. (2012), we consider the set Z n γA : ||γA γA0||2 Cknδ 1 qn q1/2 n n 1/2o . The set Z can be covered by balls with radii Cknδ 1 qn q1/2 n n 5/2 and cardinality N |Z| Cn4k2 nδ 2 qn qn, for some positive constant C. Denote the N balls by γA(u1), ..., γA(u N), where the ball γA(ul) is centered at ul for l {1, . . . , N}. Let ϵi(γA) = Yi ΠA(zi) γA, ϵi = Yi g0(zi) and mi(a, b) = I(a 0) I(b 0). Then sup ||γA γA0||2 Cknδ 1 qn q1/2 n n 1/2 i=1 πj(zij) mi[ϵi(γA), ϵi] E{mi[ϵi(γA), ϵi] | zi} i=1 πj(zij) mi[ϵi(ul), ϵi] E{mi[ϵi(ul), ϵi] | zi} sup || γA ul||2 Cknδ 1 qn q1/2 n n 5/2 i=1 πj(zij) n mi [ϵi( γA), ϵi(ul)] E {mi [ϵi( γA), ϵi(ul)] | zi} o Inj1 + Inj2. Notice that i=1 bj,s(zij) (mi [ϵi(ul), ϵi] E {mi [ϵi(ul), ϵi] | zi}) To evaluate Inj1, define νijls = bj,s(zij) (mi [ϵi(ul), ϵi] E {mi [ϵi(ul), ϵi] | zi}) , which are bounded, independent mean-zero random variables. Note that Var (νijls | zi) = bj,s(zij)2 E{mi [ϵi(ul), ϵi]2 | zi} E {mi [ϵi(ul), ϵi] | zi}2 . Then using Condition 1 E{mi [ϵi(ul), ϵi]2 | zi} E{mi [ϵi(ul), ϵi] | zi}2 = Fi(0 | zi)[1 Fi(0 | zi)] + 2Fi(0 | zi)Fi[ΠA(zi) (ul γA0) + uni | zi] +Fi[ΠA(zi) (ul γA0) + uni | zi] n 1 Fi[ΠA(zi) (ul γA0) + uni | zi] o 2Fi n min[0, ΠA(zi) (ul γA0) + uni] | zi o C ΠA(zi) (ul γA0) + uni . Applying the Cauchy-Schwarz inequality and Lemma 10 it follows that i=1 Var(νijl | zi) Cn h ΠA(zi) (ul γA0) + uni i2 )1/2 C n q ||ul γA0||2 2qn + nqnk r n C knδ 1 qn qnn1/2 + nqnk r n . ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION Recall that νijl is bounded, then applying Bernstein s inequality and using the assumed rates of λ, kn and qn it follows that > nλ/(16Jn) C n2λ2k 2 n knδ 1 qn qnn1/2 + nqnk r n + nλk 1 n exp Cnλk 1 n . The term nλk 1 n dominates the denominator in the first inequality by combining Conditions 3 and 4, Lemma 11 and the assumption that n 1/2k2 n log(n) = o(λ). Note, the upper bound does not depend on zi and taking expectations on both sides we get exp Cnλk 1 n . Inj1 CNknexp Cnλk 1 n = Cn4k2 nδ 2 qn qn+1/(2r+1)q1/(2r+1) n exp Cnλk 1 n Cexp 4k2 nδ 2 qn qn + 1/(2r + 1) log(n) + 1/(2r + 1) log(qn) Cnλk 1 n . To evaluate Inj2, note that I[ϵi( γA) 0] = I ϵi(ul) ΠA(zi) ( γA ul) . First, we will derive an upper bound for the sum in the probability statement. Since I(x a) is an increasing function of a, we have sup || γA ul||2 Cknδ 1 qn q1/2 n n 5/2 i=1 πj(zij) n mi [ϵi( γA), ϵi(ul)] E{mi [ϵi( γA), ϵi(ul)] | zi} o i=1 ||πj(zij)||1 I h ϵi (ul) Cknδ 1 qn q1/2 n n 5/2||ΠA(zi)||2 i I [ϵi (ul) 0] P h ϵi (ul) Cknδ 1 qn q1/2 n n 5/2||ΠA(zi)||2 zi i + P [ϵi (ul) 0|zi] i=1 ||πj(zij)||1 P h ϵi (ul) C||ΠA(zi)||2knδ 1 qn q1/2 n n 5/2 zi i P h ϵi (ul) Cknδ 1 qn q1/2 n n 5/2||ΠA(zi)||2 zi i) First, the second sum will be examined. Using Condition 1, Taylor series expansion and that the elements of πj(zij) are bounded, i=1 ||πj(zij)||1 P h ϵi (ul) Cknδ 1 qn q1/2 n n 5/2||ΠA(zi)||2 zi i P h ϵi (ul) Cknδ 1 qn q1/2 n n 5/2||ΠA(zi)||2 zi i) i=1 ||πj(zij)||1 ||ΠA(zi)||2knδ 1 qn q1/2 n n 5/2 Ck5/2 n δ 1 qn qnn 3/2 = o(nλ). SHERWOOD AND MAIDMAN αijl = ||πj(zij)||1 I h ϵi (ul) C||ΠA(zi)||2knn 5/2i I [ϵi (ul) 0] P h ϵi (ul) C||ΠA(zi)||2knn 5/2 | zi i + P [ϵi (ul) 0 | zi] Then for n sufficiently large, Inj2 PN l=1 P Pn i=1 αijl nλ and again Bernstein s in- equality will be used to provide an upper bound for this probability. To evaluate αijl, define ωijls = |bj,s(zij)| I h ϵi (ul) C||ΠA(zi)||2knn 5/2i I [ϵi (ul) 0] P h ϵi (ul) C||ΠA(zi)||2knn 5/2 | zi i + P [ϵi (ul) 0 | zi] which are bounded, independent mean-zero random variables. Using that the elements of ||ΠA(zi)||2 are bounded for all i, it follows that Var(ωijkl | zi) Cmax i I h ϵi (ul) C||ΠA(zi)||2knn 5/2i I [ϵi (ul) 0] E n I h ϵi (ul) C||ΠA(zi)||2knn 5/2i I [ϵi (ul) 0] zi o E n I h ϵi (ul) C||ΠA(zi)||2knn 5/2i I [ϵi (ul) 0] zi o = CFi h W(zi) A(ul γA0) + uni + C||ΠA(zi)||2knn 5/2 zi i CFi h W(zi) A(ul γA0) + uni zi i Cmax i ||ΠA(zi)||2knn 5/2 C qnk3/2 n n 5/2. i=1 αijl nλ Applying Bernstein s inequality, for some positive constants C1, C2 and C3, C1n2λ2k 2 n C2 qnk3/2 n n 3/2 + C3λnk 1 n Cexp 4k2 nδ 2 qn qn + 1/(2r + 1) log(n) + 1/(2r + 1)qn Cnλk 1 n . Note that nλk 1 n dominates qnk3/2 n n 3/2 because qnk3/2 n n 3/2 nλk 1 n = qnk5/2 n n2λ = kn qnk2 n nλ . ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION Where both fractions are o(1) by assumed rates. Note that under Condition 7, λ = o[n (1 c4)/2] implies λ = o(k 1 n ). To complete the proof, notice there exist positive constants C1, C2, C3 and C4 such that for all n sufficiently large, the probability of interest in the lemma is bounded by j=qn+1 (Inj1 + Inj2) C1exp log(pn) + C2 k2 nδ 2 qn qn + (2r + 1) 1 log(n) + C3/(2r + 1) log(qn) C4nλk 1 n . This upper bound converges to zero under the assumptions of this lemma and using Lemma 11. Lemma 16 Assume the conditions of Theorem 6 hold, then max qn+1 j pn sup ||γA γA0||2 Cknδ 1 qn q1/2 n n 1/2 i=1 πj(zij) n P h Yi ΠA(zi) γA 0 | zi i P[Yi g0(zi) 0 | zi] o Proof Define vn = knδ 1 qn q1/2 n n 1/2, using the Cauchy-Schwarz inequality and Lemma 10 max qn+1 j pn sup ||γA γA0||2 Cvn i=1 πj(zij) P h Yi ΠA(zi) γA 0 | zi i P [Yi g0(zi) 0 | zi] max qn+1 j pn sup ||γA γA0|| Cvn i=1 b2 j,s(zij) i=1 {Fi [ΠA(zi) (γA γA0) uni | zi] Fi(0 | zi)}2 max qn+1 j pn sup ||γA γA0|| Cvn C E h b2 j,s(zij) i + OP (n 1/2) i=1 [ΠA(zi) (γA γA0)]2 + u2 ni M4k 1 n + OP (n 1/2) q C(k2nδ 2 qn q2nn 1 + q2nk 2r n ) C k3/2 n δ 1 qn qnn 1/2 + qnk1/2 r n [1 + o P (1)]. From the conditions on λ, kn and qn it can be derived that k3/2 n δ 1 qn qnn 1/2+qnk1/2 r n = o(λ), thus completing the proof. The following lemma is an extension of Lemma 2.2 and 2.3 from Wang et al. (2012) which considered the linear model. SHERWOOD AND MAIDMAN Lemma 17 Assume the Conditions of Theorem 6 hold. Then for the oracle estimator, ˆγ,with probability approaching one min j {1,...,qn} ˆγj 1 (a + 1/2)λ, (25) min j {qn+1,...,pn} i=1 πj(zij)ψτ[yi Π(zi) ˆγ] 1 λ/2. (26) Proof Proof of (25): Note that min j {1,...,qn}||ˆγj||1 min j {1,...,qn} ||γ0j||1 ||ˆγ γA0||1. (27) By Lemmas 11 and 13 and Conditions 3, 4 and 7, ||ˆγ γA0||1 Jnqn + 1||ˆγ γA0||2 = OP k3/2 n qnn 1/2δ 1 qn = o P n (1 c4)/2 . Condition 7 guarantees that there exists a positive constant c5 such that min j {1,...,qn} ||γ0j||1 c5n (1 c4)/2. Finally, λ = o n (1 c4)/2 and therefore P min j {1,...,qn}||ˆγj||1 > (a + 1/2)λ 1. Proof of (26): Define sj(γ) = 1 n Pn i=1 πj(zij)ψτ[yi Π(zi) ˆγ] and D = {i : Yi ΠA(zi) ˆγA = 0}. For j {qn + 1, ..., pn}, i=1 πj(zij) n I h Yi ΠA(zi) ˆγA 0 i τ o 1 i D πj(zij)[a i + (1 τ)], where a i [τ 1, τ] with i D such that sj(ˆγ) = 0Jn for j {1, . . . , qn} and n I h Yi ΠA(zi) ˆγA 0 i τ o 1 i D [a i + (1 τ)] = 0. From Section 2.2 of Koenker (2005) it follows that with probability one |D| qn Jn + 1. Then by Conditions 2-4 and the assumptions about the rate of λ it follows that max qn+1 j pn i D πj(zij)[a i + (1 τ)] 1 = OP qnknn 1 = o P (λ). Thus, it is sufficient to show that max qn+1 j pn i=1 πj(zij) n I[Yi ΠA(zi) ˆγA 0] τ o ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION Using Lemma 14 for the second inequality it follows that, max qn+1 j pn i=1 πj(zij) n I[Yi ΠA(zi) ˆγA 0] τ o P max qn+1 j pn i=1 πj(zij) n I[Yi ΠA(zi) ˆγA 0] I[Yi g0(zi) 0] o max qn+1 j pn i=1 πj(zij) {I[Yi g0(zi) 0] τ} max qn+1 j pn sup ||γA γA0||2 Cknδ 1 qn q1/2 n n 1/2 i=1 πj(zij) n I[Yi ΠA(zi) γA 0] I[Yi g0(zi) 0] o max qn+1 j pn sup ||γA γA0||2 Cknδ 1 qn q1/2 n n 1/2 i=1 πj(zij) n I h Yi ΠA(zi) γA 0 i I[Yi g0(zi) 0] P[Yi ΠA(zi) γA 0 | zi] + P[Yi g0(zi) 0 | zi] o max qn+1 j pn sup ||γA γA0||2 Cknδ 1 qn q1/2 n n 1/2 i=1 πj(zij) n P h Yi ΠA(zi) γA 0 | zi i P[Yi g0(zi) 0 | zi] o The two probability statements go to zero by Lemmas 15 and 16. This completes the proof. 8.3 Proof of Theorem 6 Proof Define the neighborhood Xφ = γ RJnpn+1 ||ˆγ γ||1 < φ < λ/2 . In this proof we show that for sufficiently large n there exists a φ such that Q(ˆγ) Q(γ) for all γ Xφ. Define W = n γ = (γ0, . . . , γJnpn) RJnpn+1 γj = 0 for j {Jnqn + 1, . . . , Jnpn} o and Fφ = W Xφ. For any γ Xφ and for any j {1, . . . , qn} it follows from Lemma 17 and the definition of Xφ that with probability approaching one ||γj||1 ||ˆγj||1 ||ˆγj γj||1 (a + 1/2)λ λ/2 = aλ. By Condition 6 and Lemma 17 it follows, with probability approaching one, that for any γ Fφ that pλ,a(||ˆγj||1) = pλ,a(||γj||1) for all j {1, . . . , pn}. By definition of the or- SHERWOOD AND MAIDMAN acle estimator and Fφ it follows that for any γ Fφ that 1 n Pn i=1 ρτ yi Π(zi) ˆγ 1 n Pn i=1 ρτ yi Π(zi) γ . Therefore, for any γ Fφ it holds that Q(ˆγ) Q(γ). For any vector γ Xφ let γ represent the projection of γ into Fφ. For sufficiently large n, and thus sufficiently small λ, Q(ˆγ) Q( γ) and thus the proof will be complete if it can be shown that Q( γ) Q(γ). Let γA represent the first qn Jn + 1 entries of γ and γN the remaining Jn(pn qn) entries such that γ = γ A, γ N and γ = h γ A, 0 Jn(pn qn) i . Similarly define ΠN(zi) such that Π(zi) = ΠA(zi) , ΠN(zi) . By Knight s identity Q(γ) Q( γ) = 1 n i=1 ρτ[yi Π(zi) γ] ρτ[yi Π(zi) γ] + pλ,a(||γj||1) pλ,a(0) i=1 Π(zi) (γ γ)ψτ[yi Π(zi) γ] Z Π(zi) (γ γ) 0 I[yi Π(zi) γ s] I[yi Π(zi) γ 0] ds pλ,a(||γj||1) pλ,a(0) . As Pn i=1 R Π(zi) (γ γ) 0 I[yi Π(zi) γ s] I[yi Π(zi) γ 0]ds is non-negative for all i, it will be sufficient to show that 1 n i=1 Π(zi) (γ γ)ψτ[yi Π(zi) γ] pλ,a(||γj||1) pλ,a(0) . Notice, 1 n i=1 Π(zi) (γ γ)ψτ[yi Π(zi) γ] j=qn+1 ||γj||1 i=1 πj(zij)ψτ[yi Π(zi) γ] By the mean value theorem, for some c j (0, ||γj||1) pλ,a(||γj||1) pλ,a(0) = p λ(c j)||γj||1. By Lemma 17 and Condition 6 there exists a sufficiently small φ such that for all j {qn + 1, . . . , pn} i=1 πj(zij)ψτ[yi Π(zi) γ] 1 p λ,a(φ). Note that c j < φ, for all j, and therefore by the assumption that pλ,a( ) is concave in [0, ), from Condition 6, it follows that p λ,a(φ) p λ,a(c j) for all j {qn + 1, . . . , pn}. Therefore, i=1 Π(zi) (γ γ)ψτ[yi Π(zi) γ] j=qn+1 ||γj||p λ,a(φ) j=qn+1 p λ(c j)||γj||1 pλ,a(||γj||1) pλ,a(0) . ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION 8.4 Proof of Theorem 7 Proof Define the vector functions of Q (γ, a, v) = 1 i=1 Π(zi){τ I[yi Π(zi) γ]} 1 i=1 Π(zi)(1 τ + ai)I[yi = Π(zi) γ] + v, i=1 Π(zi){I[yi Π(zi) γ] τ} E i=1 Π(zi){I[yi Π(zi) γ] τ} π(γ1, γ2, a1, a2) = 1 n i=1 Π(zi)(1 τ + a2i)I[yi = Π(zi) γ2] 1 i=1 Π(zi)(1 τ + a1i)I[yi = Π(zi) γ1], and r(γ1, γ2) = r(γ1) r(γ2). For a Rn define the subgradient of ||a||1 as ||a||1 = {b Rn|bk = sgn(ak) for ak = 0 and bk [ 1, 1] otherwise}, and define the sets V(γ) = {b = (0, b 1 , . . . , b pn) Rpn Jn+1|bj = p λ(||γj||)cj, where cj ||γj||1, for all j {1, . . . , pn}}, A(γ) = {b = (b1, . . . , bn) Rn|bi = 0 if yi = Π(zi) γ and bi [ 1, 1] otherwise}. By first order conditions if γ is a local minimizer of Q(γ) then there exists v V(γ) and a A(γ) such that Q (γ, a, v) = 0pn Jn+1. Thus, there exists v V( γ) and a A( γ) such that Q ( γ, a, v) = 0pn Jn+1. Similarly, with probability approaching one, by Theorem 6, there exists ˆv V(ˆγ) and ˆa A(ˆγ) such that Q (ˆγ, ˆa, ˆv) = 0pn Jn+1. By Condition 6 and that the first derivatives of differentiable concave functions are decreasing |p λ(||γj||)| λ for all j {1, . . . , pn} and thus ||v|| λ for all v V(γ) and any γ Rpn Jn+1. For any vector a Rpn Jn+1 define a E Rwn Jn+1 as the sub-vector from the element of E similar to how we have defined a A Rqn Jn+1. For some mi between uni + Π(zi) E ( γ γ0)E and uni + Π(zi) E (ˆγ γ0)E and using Conditions 1 and 8 there exists a positive constant C such that with probability approaching one 0 = Q E( γ, a, v) Q E(ˆγ, ˆa, ˆv) ( γE ˆγE) || γE ˆγE||2 n ( γE ˆγE) E 1 n Pn i=1 fi( mi)ΠE(zi)ΠE(zi) + r( γ, ˆγ) + v ˆv π( γ, ˆγ, a, ˆa) o || γE ˆγE||2 Cδ2 wnk 1 n || γE ˆγE||2 || r E( γ, ˆγ)||2 2λ p wn Jn || πE( γ, ˆγ, a, ˆa)||2. Note for any γ, ||r E(γ)||2 = OP p wn n by Lemma 10 (4) and thus || r E( γ, ˆγ)||2 = OP p wn n . By Condition 8 || πE( γ, ˆγ, a, ˆa)||2 = OP (knwnn 1 1 + wn). If with probability approaching one || γE ˆγE||2 has a lower bound of order log(n)δ 2 wnkn wnkn + knwnn 1 SHERWOOD AND MAIDMAN then with probability approaching one [Q E( γ, a, v) Q E(ˆγ, ˆa, ˆv)] ( γE ˆγE) || γE ˆγE||2 has a posi- tive lower bound, which is a contradiction. Therefore, || γE ˆγE||2 = OP log(n)δ 2 wnkn wnkn + knwnn 1 I. Barrodale and F.D.K. Roberts. Solution of an overdetermined system of equations in the ℓ1 norm. Communications of the ACM, 17:319 320, 1974. Alexandre Belloni and Victor Chernozhukov. L1-penalized quantile regression in highdimensional sparse models. Annals of Statistics, 39(1):82 130, 2011. Claus Borggaard and Hans Henrik Thodberg. Optimal minimal neural interpretation of spectra. Analytical Chemistry, 64(5):545 551, 1992. Patrick Breheny and Jian Huang. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Annals of Applied Statistics, 5(1):232 253, 2011. Patrick Breheny and Yaohui Zeng. grpreg: Regularization paths for regression models with grouped covariates 3.1-2, 2017. URL https://cran.r-project.org/web/packages/ grpreg/index.html. Zhao Chen, Jianqing Fan, and Runze Li. Error variance estimation in ultrahigh-dimensional additive models. Journal of the American Statistical Association, 113(521):315 327, 2018a. Zhao Chen, Jianqing Fan, and Runze Li. Supplemental material of error variance estimation in ultrahigh-dimensional additive models. Journal of the American Statistical Association, 113(521):315 327, 2018b. Annie P. Chiang, John S. Beck, Hsan-Jan Yen, Marwan K. Tayeh, Todd E. Scheetz, Ruth E. Swiderski, Darryl Y. Nishimura, Terry A. Braun, Kwang-Youn A. Kim, Jian Huang, Khalil Elbedour, Rivka Carmi, Diane C. Slusarski, Thomas L. Casavant, Edwin M. Stone, and Val C. Sheffield. Homozygosity mapping with snp arrays identifies trim32, an e3 ubiquitin ligase, as a bardet biedl syndrome gene (bbs11). Proceedings of the National Academy of Sciences, 103(16):6287 6292, 2006. Carl de Boor. Splines as linear combinations of b-splines. In Approximation Theory II, pages 1 47. Academic Press (New York), 1976. Jan G. De Gooijer and Dawit Zerom. On additive conditional quantiles with highdimensional covariates. Journal of the American Statistical Association, 98:135 146, 2003. ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION Ronald A. Devore and George G. Lorentz. Constructive Approximation. Cambridge University Press, 2005. Harold Gordon Eggleston. Convexity, Cambridge Tracts in Mathematics and Mathematical Physics, No.47. Cambridge University Press, 1958. A. Essl, A. Ortner, R. Haas, and P. Hettegger. Machine learning analysis for a flexibility energy approach towards renewable energy integration with dynamic forecasting of electricity balancing power. In 2017 14th International Conference on the European Energy Market (EEM), pages 1 6, 2017. Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348 1360, 2001. Jianqing Fan and Jinchi Lv. Nonconcave penalized likelihood with np-dimensionality. IEEE Transactions on Information Theory, 57(8):5467 5484, 2011. Julian Faraway. faraway: Functions and datasets for books by julian faraway 1.0.7, 2016. URL https://cran.r-project.org/web/packages/faraway/index.html. Joachim Freyberger, Andreas Neuhierl, and Michael Weber. Dissecting Characteristics Nonparametrically. The Review of Financial Studies, 33(5):2326 2377, 04 2020. ISSN 0893-9454. Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Regularized paths for genearlized linear models via coordinate descent. Journal of Statistical Software, 33(1), 2008. Y. Fujimoto, S. Murakami, N. Kaneko, H. Fuchikami, T. Hattori, and Y. Hayashi. Machine learning approach for graphical model-based analysis of energy-aware growth control in plant factories. IEEE Access, 7:32183 32196, 2019. Yuwen Gu, Jun Fan, Lingchen Kong, Shiqian Ma, and Hui Zou. Admm for high-dimensional sparse penalized quantile regression. Technometrics, 60(3):319 331, 2018. Xuming He and Peide Shi. Convergence rate of b-spline estimators of nonparametric conditional quantile functions. Journal of Nonparametric Statistics, 3:299 308, 1994. Xuming He and Peide Shi. Bivariate tensor-product b-splines in a partly linear model. Journal of Multivariate Analysis, 58(2):162 181, 1996. Xuming He, Zhong-Yi Zhu, and Wing-Kam Fung. Estimation in semiparametric model for longitudinal data with unspecified dependence structure. Biometrika, 89(3):579 590, 2002. Xuming He, Lan Wang, and Hyokyoung Grace Hong. Quantile-adaptive model-free nonlinear feature screening for high-dimensional heterogeneous data. Annals of Statistics, 41 (1):342 369, 2013. Joel L. Horowitz and Sokbae Lee. Nonparametric estimation of additive quantile regression model. Journal of the American Statistical Association, 100(472):1238 1249, 2005. SHERWOOD AND MAIDMAN Jian Huang, Joel L. Horowitz, and Fengrong Wei. Variable selection in nonparametric additive models. The Annals of Statistics, 38(4):2282 2313, 2010. Jian Huang, Patrick Breheny, and Shuangge Ma. A selective review of group selection in high-dimensional models. Statistical Science, 27(4):481 499, 2012. Jianhua Z. Huang. Projection estimation in multiple regression with application to functional anova models. The Annals of Statistics, 26(1):242 272, 1998a. Jianhua Z. Huang. Functional anova models for generalized regression. Journal of Multivariate Analysis, 67:49 71, 1998b. Ahmed M. Ibrahim, Hassan A.M. Hendawy, Wafaa S. Hassan, Abdalla Shalaby, and Manal S. El Masry. Determination of terazosin in the presence of prazosin: Different state-of-the-art machine learning algorithms with uv spectroscopy. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, 236:1386 1425, 2020. Kengo Kato. Group lasso for high dimensional sparse quantile regression models. https://arxiv.org/pdf/1103.1458, March 2012. Mi-Ok Kim. Quantile regression with varying coefficients. The Annals of Statistics, 35(1): 92 108, 2007. Yongdai Kim, Hosik Choi, and Hee-Seok Oh. Smoothly clipped absolute deviation on high dimensions. Journal of the American Statistical Association, 103(484):1665 1673, 2008. Yongdai Kim, Sunghoon Kwon, and Hosik Choi. Consistent model selection criteria on high dimensions. Journal of Machine Learning Research, 13:1037 1057, 2012. Keith Knight. Limiting distributions for l1 regression estimators under general conditions. The Annals of Statistics, 26(2):755 770, 1998. Roger Koenker. Quantile Regression. Cambridge University Press, 2005. Roger Koenker and Gilbert Bassett. Regression quantiles. Econometrica, 46(1):33 50, 1978. Roger Koenker and Vasco D Orey. A remark on algorithm as 229: Computing dual regression quantiles and regression rank score. Journal of the Royal Statistical Society. Series C (Applied Statistics), 43(2):410 414, 1994. Roger W. Koenker and Vasco D Orey. Computing regression quantiles. Journal of the Royal Statistical Society. Series C (Applied Statistics), 36(3):383 393, 1987. Eun Ryung Lee, Hohsuk Noh, and Byeong U. Park. Model selection via bayesian information criterion for quantile regression models. Journal of the American Statistical Association, 109(505):216 229, 2014. Heng Lian, Xin Chen, and Jian-Yi Yang. Identification of partially linear structure in additive models with an application to gene expression prediction from sequences. Biometrics, 68(2):437 445, 2012. ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION Chen-Yen Lin, Howard Bondell, Hao Helen Zhang, and Hui Zou. Variable selection for nonparametric quantile regression via smoothing spline analysis of variance. Stat, 2: 255 268, 2013. Yi Lin and Hao Helen Zhang. Component selection and smoothing in multivariate nonparametric regression. The Annals of Statistics, 34(5):2272 2297, 2006. Po-Ling Loh and Martin J. Wainwright. Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Journal of Machine Learning Research, 16:559 616, 2015. Yin Lou, Jacob Bien, Rich Caruana, and Johannes Gehrke. Sparse partially linear additive models. Journal of Computational and Graphical Statistics, 25(4):1126 1140, 2016. Shaogao Lv, Huazhen Lin, Heng Lian, and Jian Huang. Oracle inequalities for sparse additive quantile regression in reproducing kernel hilbert space. Ann. Statist., 46(2): 781 813, 04 2018. Adam Maidman and Lan Wang. New semiparametric method for predicting high-cost patients. Biometrics, 74(3):1104 1111, 2018. Lukas Meier, Sara van de Geer, and Peter B uhlmann. High-dimensional additive modeling. The Annals of Statistics, 37(6B):3779 3821, 2009. Sahand N. Negahban, Pradeep Ravikumar, Martin J. Wainwright, and Bin Yu. A unified framework for high-dimensional analysis fo m-estimators with decomposable regualrizers. Statistical Science, 27(4):538 557, 2012. Marco Palma, Shahin Tavakoli, Julia Brettschneider, and Thomas E. Nichols. Quantifying uncertainty in brain-predicted age using scalar-on-image quantile regression. Neuro Image, 219:1 14, 2020. Bo Peng and Lan Wang. An iterative coordinate descent algorithm for high-dimensional nonconvex penalized quantile regression. Journal of Computational and Graphical Statistics, 24(3):676 694, 2015. Ashley Petersen and Daniela Witten. Data-adaptive additive modeling. Statistics in Medicine, 38(4):583 600, 2019. Ashley Petersen, Daniela Witten, and Noah Simon. Fused lasso additive model. Journal of Computational and Graphical Statistics, 25(4):1005 1025, 2016. Veeranjaneyulu Sadhanala and Ryan J. Tibshirani. Additive models with trend filtering. Ann. Statist., 47(6):3032 3068, 12 2019. Todd E. Scheetz, Kwang-Youn A. Kim, Ruth E. Swiderski, Alisdair R. Philp, Terry A. Braun, Kevin L. Knudtson, Anne M. Dorrance, Gerald F. Di Bona, Jian Huang, Thomas L. Casavant, Val C. Sheffield, and Edwin M. Stone. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceedings of the National Academy of Sciences, 103(39):14429 14434, 2006. SHERWOOD AND MAIDMAN Larry L. Schumaker. Spline Functions: Basic Theory. Wiley, New York, 1981. Ben Sherwood. Variable selection for additive partial linear quantile regression with missing covariates. Journal of Multivariate Analysis, 152:206 223, 2016. Ben Sherwood and Adam Maidman. rqpen: Penalized quantile regression 2.2.2, 2020. URL https://cran.r-project.org/package=rq Pen. Ben Sherwood and Lan Wang. Partially linear additive quantile regression in ultra-high dimension. The Annals of Statistics, 44(1):288 317, 2016. Ben Sherwood, Aaron J. Molstad, and Sumanta Singha. Asymptotic properties of concave l1-norm group penalties. Statistics and Probability Letters, 157:108631, 2020. Charles J. Stone. Optimal global rates of convergence for nonparametric regression. The Annals of Statistics, 10(4):1040 1053, 1982. Charles J. Stone. Additive regression and other nonparametric models. The Annals of Statistics, 13(2):689 705, 1985. Charles J. Stone. The dimensionality reduction principle for generalized additive models. The Annals of Statistics, 14(2):590 606, 1986. Ichiro Takeuchi, Quoc V. Le, Tim Sears, and Alexander J. Smola. Nonparametric quantile estimation. Journal of Machine Learning Research, 7:1231 1264, 2006. Pham Dinh Tao and Le Thi Hoai An. Convex analysis approach to d.c. programming: Theory, algorithms and applications. Acta Mathematica Vietnamica, 22(1):289 355, 1997. Robert Tibshirani. Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B, 58(1):267 288, 1996. Hansheng Wang, Guodong Li, and Guohua Jiang. Robust regression shrinkage and consistent variable selection through the lad-lasso. Journal of Business & Economic Statistics, 25(3), 2007. Hansheng Wang, Bo Li, and Chenlei Leng. Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(3):671 683, 2009a. Huixia Judy Wang, Zhongyi Zhu, and Jianhui Zhou. Quantile regression in partially linear varying coefficient models. The Annals of Statistics, 37(6B):3841 3866, 2009b. Lan Wang, Yichao Wu, and Runze Li. Quantile regression of analyzing heterogeneity in ultra-high dimension. Journal of the American Statistical Association, 107(497):214 222, 2012. Li Wang, Lan Xue, Annie Qu, and Hua Liang. Estimation and model selection in generalized additive partial linear models for correlated data with diverging number of covariates. The Annals of Statistics, 42(2):592 624, 2014a. ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION Zhaoran Wang, Han Liu, and Tong Zhang. Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. The Annals of Statistics, 42(6): 2164 2201, 2014b. Ying Wei, Anneli Pere, Roger Koenker, and Xuming He. Quantile regression methods for reference growth charts. Statistics in Medicine, 25:1369 1382, 2006. A.H. Welsh. On m-processes and m-estimation. The Annals of Statistics, 17(1):337 361, 1989. Yicaho Wu and Yufeng Liu. Variable selection in quantile regression. Statistica Sinica, 19 (2):801 817, 2009. Lan Xue. Consistent variable selection in additive models. Statistica Sinica, 19:1281 1296, 2009. Lan Xue and Lijian Yang. Additive coefficient modeling via polynomial spline. Statistica Sinica, 16(4):1423 1446, 2006. Congrui Yi and Jian Huang. Semismooth newton coordinate descent algorithm for elasticnet penalized huber loss regression and quantile regression. Journal of Computational and Graphical Statistics, 26(3):547 557, 2017. Liqun Yu, Nan Lin, and Lan Wang. A parallel algorithm for large-scale nonconvex penalized quantile regression. Journal of Computational and Graphical Statistics, 26(4):935 939, 2017. Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894 942, 2010. Kaifeng Zhao and Heng Lian. Variable selection in additive quantile regression using nonconcave penalty. Statistics, 50(6):1276 1289, 2016. Peng Zhao and Bin Yu. On model selection consistency of lasso. Journal of Machine Learning Research, 7:2541 2563, 2006. Qi Zheng, Limin Peng, and Xuming He. Globally adaptive quantile regression with ultrahigh dimensional data. The Annals of Statistics, 43(5):2225 2258, 2015. S. Zhou, X. Shen, and D.A. Wolfe. Local asymptotics for regression splines and confidence regions. The Annals of Statistics, 26(5):1760 1782, 1998. Hui Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418 1429, 2006. Hui Zou and Runze Li. One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics, 36(4):1509 1533, 2008.