# additive_nonlinear_quantile_regression_in_ultrahigh_dimension__5eef35bc.pdf

Journal of Machine Learning Research 23 (2022) 1-47 Submitted 8/19; Revised 10/21; Published 3/22

Additive nonlinear quantile regression in ultra-high dimension

Ben Sherwood ben.sherwood@ku.edu School of Business University of Kansas Lawrence, KS 66045, USA

Adam Maidman abmaidman@gmail.com Microsoft One Microsoft Way Redmond, WA 98052, USA

Editor: Zaid Harchaoui

We propose a method for simultaneous estimation and variable selection of an additive quantile regression model that can be used with high dimensional data. Quantile regression is an appealing method for analyzing high dimensional data because it can correctly model heteroscedastic relationships, is robust to outliers in the response, sparsity levels can change with quantiles, and it provides a thorough analysis of the conditional distribution of the response. An additive nonlinear model can capture more complex relationships, while avoiding the curse of dimensionality. The additive nonlinear model is ﬁt using B-splines and a nonconvex group penalty is used for simultaneous estimation and variable selection. We derive the asymptotic properties of the estimator, including an oracle property, under general conditions that allow for the number of covariates, pn, and the number of true covariates, qn, to increase with the sample size, n. In addition, we propose a coordinate descent algorithm that reduces the computational cost compared to the linear programming approach typically used for solving quantile regression problems. The performance of the method is tested using Monte Carlo simulations, an analysis of fat content of meat conditional on a 100 channel spectrum of absorbances and predicting TRIM32 expression using gene expression data from the eyes of rats.

Keywords: Quantile Regression; Oracle Property; Nonparametric Regression; Splines; nonconvex penalty.

1. Introduction

We consider the sample {yi, zi}n i=1 where yi R and zi = (zi1, . . . , zipn) Rpn. The τth conditional quantile, τ (0, 1), of y given z is deﬁned as Qy|z(τ) = inf{t : F(t|z) τ}, where F( |z) is the conditional distribution function of y given z. There are pn potential variables, but only qn(τ) of these variables are needed to model the τth conditional quantile. Without loss of generality we assume the ﬁrst qn(τ) of these variables are active and the remaining pn qn(τ) are inactive. The index n allows the set of active and inactive variables to increase with n, including the ultra-high dimensional case where pn increase at an exponential rate

c 2022 Ben Sherwood and Adam Maidman.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/19-697.html.

SHERWOOD AND MAIDMAN

of n. For a given τ we consider the following sparse model for the conditional quantile

Qy|z(τ) = g0(z, τ) = α0(τ) +

j=1 g0j(zj, τ) = α0(τ) +

j=1 g0j(zj, τ), (1)

and for identiﬁability we assume E[g0j(zj, τ)] = 0 for all j {1, . . . , qn(τ)}. The model is a high-dimensional, sparse, nonparametric model that provides great ﬂexibility. We assume an additive model to avoid the curse of dimensionality. The active variables, intercept, and additive functions are indexed by τ as the model allows for these values to change with τ. For simplicity of notation the τ symbol will be dropped throughout the remaining of the paper, but we emphasize here that the model allows for the nonlinear relationships and sparsity structure to change with τ. We propose using B-splines to model the nonlinear relationships and a group nonconvex penalty to correctly identify the covariates that have a relationship with the response at the given conditional quantile. Tibshirani (1996) proposed the lasso penalty for simultaneous estimation and model selection, but strong conditions are required for model selection consistency (Zhao and Yu, 2006). Our focus is on model selection and our results will depend on using nonconvex penalty functions such as SCAD (Fan and Li, 2001) and MCP Zhang (2010) functions, which provide oracle estimators, a stronger result than model selection consistency, under milder conditions. When using splines multiple coeﬃcients will be associated with a single covariate and thus we will use a group penalty, see Huang et al. (2012) for a review of group penalties in high-dimensional models. Previous works have proposed using splines with a group penalty for estimating an additive conditional mean function (Huang et al., 2010; Lin and Zhang, 2006; Meier et al., 2009; Xue, 2009). The work most similar to ours is Xue (2009) and Huang et al. (2010). Xue (2009) proposed using a group SCAD penalty and derived model consistency results for ﬁxed q and p. Huang et al. (2010) proposed using a group adaptive lasso (Zou, 2006) and proved model selection consistency with ﬁxed q, but allowing p to increase with n. Unlike these works, our focus is on estimating (1) instead of an additive conditional mean function. Since Koenker and Bassett (1978) proposed linear quantile regression there have been many extensions, including work on nonlinear quantile regression. For a univariate covariate He and Shi (1994) demonstrated that using B-splines for nonlinear quantile regression has the same optimal rate of convergence as nonlinear mean regression (Stone, 1982). Motivated by the work of Stone (1985) (additive mean regression) and Stone (1986) (generalized additive models), De Gooijer and Zerom (2003) proposed a kernel based method for estimating an additive nonlinear conditional quantile model and demonstrated that for ﬁxed p, additive quantile regression achieves the same rate of convergence found in He and Shi (1994) and thus theoretically alleviates the curse of dimensionality, although the proposed method requires bias correction for p 5. Horowitz and Lee (2005) proposed a two-stage estimator for additive quantile regression that achieves the optimal rate of convergence and does not require a bias correction. Takeuchi et al. (2006) provided ﬁnite sample bounds for nonparametric quantile regression and discussed how to handle constraints such as monotonicity and non-crossing quantiles. Splines oﬀer great ﬂexibility in modeling conditional quantiles and have been proposed in a variety of conditional quantile models including, but not limited to, varying coeﬃcients (Kim, 2007), growth curves (Wei et al., 2006) and

ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION

semiparametric models (He and Shi, 1996; He et al., 2002; Maidman and Wang, 2018; Wang et al., 2009b). Quantile regression is a robust method which estimates a conditional quantile of interest. Our proposed method estimates conditional quantiles while allowing the sparsity structure to vary with τ, which has been stated as another example of the ﬂexibility quantile regression provides for analyzing high-dimensional data (He et al., 2013; Wang et al., 2012). Previous work in penalized quantile regression includes using the lasso penalty (Belloni and Chernozhukov, 2011) and the nonconvex penalties MCP and SCAD (Wang et al., 2012) for estimating linear quantile regression with high-dimensional covariates. The high-dimensional linear quantile model has been relaxed to a partially linear model, where variable selection is only done on the high-dimensional linear terms (Sherwood, 2016; Sherwood and Wang, 2016). Other work proposed using splines with a group penalty for simultaneous estimation and variable selection of additive quantile regression. Kato (2012) proposed using a group lasso penalty, their work focused on convergence rates, while our work focuses on deriving an estimator with the oracle property which is asymptotically equivalent to the estimator that would be ﬁt if we a priori knew the active covariates. Zhao and Lian (2016) considered the case where p is ﬁxed and proposed using a nonconvex group penalty with the L2 norm, while we allow p to increase with n and use the L1 norm in our group penalty. Lin et al. (2013) proposed a smoothing spline ANOVA method that focuses on computational aspects and does not contain asymptotic results. Lv et al. (2018) considered estimation of (1) where the univariate functions reside in a reproducing kernel Hilbert space. Their work focused on estimation bounds, while our work focuses on deriving an oracle estimator, and they proposed a diﬀerent penalty function than the one presented here. Penalized quantile regression is not as commonly used as penalized least squares, but recent work has shown an interest in simultaneously estimating a conditional quantile and performing model selection. Essl et al. (2017) used penalized quantile regression to model extremes for the reserve capacity in the Australian electricity market, using time of day, year and week variables along with other forecast variables. Palma et al. (2020) modeled the age of a brain using MRI data for cognitively normal patients to better understand brain decay for cognitively impaired individuals. Quantile regression was used to model the .05, .5 and .95 quantiles, while the penalty was used to select the useful information from the MRI data. Motivated by the desire to identify counterfeit drugs, Ibrahim et al. (2020) used penalized quantile regression as a robust approach to model the amount of a certain chemical in a drug using spectroscopy data. Nonlinear or partially linear additive models with penalties have been used to simultaneously perform model selection and provide nonparametric estimates of conditional means. Examples include modeling stock returns given ﬁrm characteristics (Freyberger et al., 2020), predicting gene expression using DNA motifs (Lian et al., 2012), and constructing graphical models for frillice lettuce attributes and average environmental data during the cultivation period (Fujimoto et al., 2019). These are some examples of applications of penalized quantile regression and penalized additive models, but is by no means complete. In this paper we propose the penalized additive quantile model as a useful model for complex data. We demonstrate that this is a robust, theoretically sound model with few assumptions. To the best of our knowledge there are not many, or any, public applications of penalized additive nonlinear quantile models. However, given the ﬂexibility of additive nonlinear quantile regression, we believe this can be a useful

SHERWOOD AND MAIDMAN

tool for data analysis. To bridge the gap between theory and application, we discuss how to compute this model. In addition, our implementation is publicly available on CRAN (Sherwood and Maidman, 2020). Theoretical challenges include dealing with a nonsmooth loss function, a nonconvex penalty function, approximation of nonlinear functions, and the number of covariates increasing with the sample size. Our asymptotic results allow for qn to increase with n, which is challenging to deal with because both the number of predictors and basis functions increase to inﬁnity with n. In addition, previous work on deriving oracle results for high dimensional quantile regression estimators have used the fact that a quantile regression objective function with a SCAD or MCP penalty can be written as a diﬀerence of convex functions (Sherwood, 2016; Sherwood and Wang, 2016; Wang et al., 2012). The theoretical results depend on demonstrating that asymptotically the oracle estimators satisfy properties about local minimizers of diﬀerence of convex functions provided by Tao and An (1997). Our proofs are more akin to the general approach taken by Fan and Lv (2011), where we only use some very general conditions about the penalty function. Results from Fan and Lv (2011) were for likelihood based methods and assumed that the objective function was diﬀerentiable. In their proofs they used a Taylor approximation of the penalized objective function, which is not possible for the non-diﬀerentiable quantile objective function. In this paper we show that the approach of Fan and Lv (2011) can be extended to quantile regression by replacing the Taylor approximation with Knight s identity, a common tool for theoretical results about quantile regression (Knight, 1998; Koenker, 2005). It is worth noting that previous work for adaptive lasso quantile regression, which has a convex objective function, used Knight s identity when establishing oracle properties (Wang et al., 2007; Zheng et al., 2015). We believe the approach provided here will be useful for future theoretical results because working with Knight s identity is easier than dealing with the subdiﬀerential functions, which is required when using the properties of diﬀerence of convex functions. In addition, the results provided here are more general because they work with a large class of non-convex penalty functions and are not limited to the SCAD or MCP functions. In addition to being theoretically challenging, high-dimensional quantile regression is a challenging computational problem. Koenker and Bassett (1978) showed that quantile regression can be solved by linear programming and many have found that minimizing penalized quantile regression objective functions can be framed as linear programming problems (Belloni and Chernozhukov, 2011; Sherwood and Wang, 2016; Wang et al., 2012; Wu and Liu, 2009). However, recent work has shown that in high dimensions alternative approaches can sacriﬁce little in terms of accuracy, while providing large computational gains. Peng and Wang (2015) proposed a coordinate descent algorithm for quantile regression with a non-convex penalty. Gu et al. (2018) proposed an alternating direction method of multiplier (ADMM) algorithm for quantile regression with lasso, adaptive lasso or a folded concave penalty. Yu et al. (2017) proposed an ADMM algorithm for nonconvex penalized quantile regression that can be computed in parallel. Yi and Huang (2017) proposed semismooth Newton coordinate descent algorithm for elastic-net penalized quantile regression that approximates the quantile loss function with a Huber loss function, creates a strong rule for discarding covariates, and uses a coordinate descent algorithm to update the remaining coeﬃcients. None of the algorithms discussed use group penalties. We contribute to the

ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION

literature by proposing a coordinate descent algorithm for quantile regression with a nonconvex group penalty. Lv et al. (2018) also proposed a coordinate descent algorithm for penalized quantile regression. They approximated the quantile loss function with a smooth function and their penalty function is convex. In contrast, we do not approximate the quantile loss function and have a nonconvex penalty function. The rest of the article is organized as follows. In Section 2 we discuss estimating the additive quantile regression model, when the active covariates are known a priori. We refer to this model as the oracle model and asymptotic properties of the oracle model are presented in Section 2. In Section 3 we present a group nonconvex penalty and present a theorem demonstrating that under reasonable conditions the group penalized method is asymptotically equivalent to the oracle model. In Section 4 we propose our new algorithm. In Section 5 we compare the proposed method using Monte Carlo simulations and in Section 6 we implement the proposed method to model fat content of meat using a 100 channel spectrum of absorbances and model TRIM32 expression using other gene expression data from the eyes of 120 twelve-week old male rats. We conclude with a summary in Section 7.

2. Oracle Model

To estimate (1) we propose ﬁrst transforming the covariates using B-splines and then applying a group penalty method to simultaneously perform estimation and variable selection. In this section we assume that the active covariates are known a priori and thus are only estimating the model

Qy|z(τ) = g0(z) = α0 +

j=1 g0j(zj). (2)

The work in this section establishes convergence rates for the optimal local minimum of the penalized estimator. To estimate the nonlinear functions we use B-splines of order m + 1 (degree m) with kn quasi-uniform internal knots for (kn + m + 1) spline functions. Let Jn = kn + m, for j {1, . . . , pn} the jth covariate has Jn + 1 corresponding functions of [bj,0( ), . . . , bj,Jn( )] of order m + 1 with kn quasi-uniform internal knots on [0, 1] for a total of 2(m + 1) + kn knots, (tj, m, . . . , tj,kn+m+1). Deﬁne hj = max s |tj,s tj,s+1|, the largest distance between knots for the jth covariate, and h = max j hj, the largest

distance between knots for all covariates. A property of spline functions is that for any zij it follows that PJn s=0 bj,s(zij) = 1 and to avoid collinearity we drop the ﬁrst term when ﬁtting the model. See Schumaker (1981) for more details about the construction of B-splines. The ith observation of the jth covariate will have a corresponding vector of πj(zij) = [bj,1(zij), ..., bj,Jn(zij)] RJn. Deﬁne

ΠA(zi) = h 1, π1(zi1) , . . . , πqn(ziqn) i RJnqn+1,

as the B-splines vector of active covariates and

Π(zi) = h 1, π1(zi1) , . . . , πpn(zipn) i RJnpn+1,

SHERWOOD AND MAIDMAN

as the B-splines vector for all covariates. B-splines can be used to approximate smooth functions and the following deﬁnitions help provide a formal deﬁnition of the class of functions for g0(z).

Deﬁnition 1 Let r m + v, where m is a positive integer and v (0, 1]. Deﬁne Hr as the collection of functions h( ) on [0, 1] whose mth derivative h(m)( ) satisﬁes the H older condition of order v. That is, for any h( ) Hr, there exists some positive constant C such that h(m)(z ) h(m)(z) C z z v , 0 z , z 1. (3)

Deﬁnition 2 Given z = (z1, ..., zqn) , the function g(z) is said to belong to the class of functions Gr if it has the representation g(z) = α + Pqn j=1 gj(zj) where α R and for all j {1, . . . , qn}, gj Hr, E[gj(zj)] = 0 and E[gj(zj)2] < M, for some positive constant M.

Deﬁnition 3 Denote Gn as the space of additive functions spanned by [Π(zi)]n i=1.

B-splines can approximate any function h( ) Hr. That is, there exists γ0j RJn such that sup z [0,1]

g0j(z) πj(z) γ0j = O (k r n ) (Schumaker, 1981). Thus, a function g0( ) Gr

can be approximated by a function from Gn, speciﬁcally there exists γA0 = (α0, γ 01, . . . , γ 0qn) Rqn Jn+1 such that sup z [0,1]qn

g0(z) ΠA(z) γA0 = O qnk r n . (4)

The quantile loss function is deﬁned as ρτ(u) = u[τ I(u < 0)]. The oracle estimator only relies on the active covariates and is deﬁned as

ˆγA = argmin γA Rqn Jn+1

i=1 ρτ[yi ΠA(zi) γA]. (5)

The estimator of g0(zi) is ˆg(zi) = ΠA(zi) ˆγA. The proposed estimator is a robust estimator and will be robust to outliers in the response. Similar to a univariate estimate of a quantile, if values of {yi}n i=1 are changed but the signs of residuals remain the same then the estimator ˆg(zi) remains unchanged. See Theorem 2.4 and the discussion surrounding this theorem from Koenker (2005) for a detailed discussion of this property of quantile regression. The following conditions were used to prove the rate of convergence of n 1 Pn i=1[ˆg(zi) g0(zi)]2.

Condition 1 (Conditions on the random error) The random error ϵi has the conditional distribution function Fi( | zi), continuous conditional density function fi( | zi), and f i( | zi) is the derivative of the conditional density function. The density functions are uniformly bounded away from 0 and inﬁnity in a neighborhood of zero and there exists a positive constant cf such that |f i( | zi)| cf for all i {1, . . . , n}.

Condition 2 (Conditions on the covariates) Let zij [0, 1] for all i {1, . . . , n} and for all j {1, . . . , pn}. The joint density of the predictors is absolutely continuous and the density, fz(z), is bounded away from zero and inﬁnity by positive constants. In addition, deﬁne fzj(z) as the density function for the jth covariate. There exist positive constants c1 and c2 such that c1 < fzj(z) < c2 for all z [0, 1] and j {1, . . . , pn}.

ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION

Condition 3 (Conditions on the splines) There exists a positive constant c3 such that

max j {1,...,pn}

max s (tj,s+1 tj,s)

min s (tj,s+1 tj,s) c3.

For the internal knots kn (qnn)1/(2r+1), h (qnn) 1/(2r+1) and for all j {1, . . . , pn} hj

h 1, where a b means both a and b have the same order.

Condition 4 (Condition on the size of the model) For the active variables qn = o[log(n)].

Condition 5 (Condition on the unknown functions) For r = m + v > 3 we assume g0( ) Gr.

Condition 1 has been used for asymptotic results of a ﬁxed dimensional linear quantile regression model (Koenker, 2005) and is a weaker condition than the Gaussian or sub Gaussian conditions that are common for penalized, high-dimensional models (Negahban et al., 2012). Condition 1 reﬂects the robust properties of quantile regression as it does not assume that any moments exist for the distribution of the errors, and thus the results will hold for heavy tailed distributions that have no moments such as the Cauchy distribution. Under Condition 2 there is no collinearity between the predictors. Stone (1985) introduced Condition 2 to provide a lower bound for the standard deviation of the additive function. It is also used to provide a lower bound for the minimum eigenvalue for the covariance matrix of the B-splines transformation of the active predictors (Chen et al., 2018b; Zhou et al., 1998). In addition, standard B-spline results depend on the covariates having a bounded support and without loss of generality, Condition 2 assumes the support to be the interval [0, 1]. The assumption that the density functions have a common lower and upper bound is frequent in work involving splines because the bounds allow for a direct application of Theorem 5.4.2 from Devore and Lorentz (2005). Condition 3 assumes that the distance between the internal knots are not drastically diﬀerent, which holds in practice as long as the distributions of the covariates are not greatly skewed, and is a common assumption in work with splines (Huang, 1998a,b; Xue and Yang, 2006). In addition for ﬁxed qn, the rate for kn is equivalent to the optimal rate found in Stone (1985). Condition 4 governs the rate at which qn can increase with n. Though the rate is slow because kn also needs to increase with n, most work in additive models assume q is ﬁxed. The rate in Condition 4 is the same rate used by others that have considered an increasing number of true covariates when estimating additive models (Wang et al., 2014a). Condition 5 provides that only reasonably smooth functions can be estimated by the proposed method. The above conditions are used to prove the following theorem about the rate of convergence of ˆg( ). These conditions are suﬃcient for proving our results but are not necessarily the weakest conditions needed. Deﬁne ΠA = [ΠA(z1), . . . , ΠA(zn)] Rn qn Jn+1. Note that our conditions lack an explicit assumption about bounds on the eigenvalues for the sample covariance matrix of the active predictors, 1 n Pn i=1 ΠA(zi)ΠA(zi) , which is common in work that derives an oracle property for high-dimensional data (Fan and Lv, 2011; Loh and Wainwright, 2015; Wang et al., 2012; Zheng et al., 2015). Using the properties of B-splines and Conditions 1 - 3, the following lemma provides these bounds and insight into why the rate of qn provided in Condition 4 is so small.

SHERWOOD AND MAIDMAN

Lemma 4 Assume Conditions 1-3 hold. For a Rqn Jn+1 where ||a||2 = 1, there exist positive constants b1 > 0, B1 > 0 and δ (0, 1) with δqn = [(1 δ)/2]qn/2 such that for suﬃciently large n that b1δ2 qnk 1 n a 1

nΠ AΠAa B1qn.

Proof of Lemma 4 is provided in the Appendix. Stone (1985) ﬁrst introduced a lower bound that depended on a term similar to δqn and is very common in the additive model literature. If qn is ﬁxed then the term δqn is a constant and can be easily dealt within the asymptotic analysis. However, in the setting of qn increasing with n, the term has to be dealt with more care. Speciﬁcally, the proof of the the next theorem depends on n 1/2(qnkn)1/2δ 1 qn converging to zero and thus the need for Condition 4.

Theorem 5 If Conditions 1-5 hold, then

i=1 [ˆg(zi) g0(zi)]2 = OP qnkn/n + q2 nk 2r n .

Thus, under Condition 3 and for ﬁxed q, the estimator ˆg(zi) reaches the optimal rate of convergence found by Stone (1985) and extends to ﬁxed dimensional quantile regression additive models (De Gooijer and Zerom, 2003; He and Shi, 1994; Horowitz and Lee, 2005). To the best of our knowledge, this is the ﬁrst result that considers the rate of convergence for an additive quantile model with qn increasing with n. Proof of Theorem 5 is provided in the Appendix.

3. Variable Selection

In the previous section we established that the oracle estimator is an optimal estimator but it requires a priori knowledge about the covariates that may not be known in practice. Deﬁne γ = (α, γ 1 , . . . , γ pn) RJnpn+1, where γj RJn is the coeﬃcient vector for the B-spline functions of the jth covariate. For a vector a we deﬁne ||a||q as the Lq norm of a. To ﬁt a sparse model that accounts for the groups of spline functions, we propose the following objective function

i=1 ρτ[yi Π(zi) γ] +

j=1 pλ,a ||γj||1 . (6)

A group penalty is used to incorporate the group structure of the splines. Similar penalties have been used for mean additive models (Huang et al., 2010; Xue, 2009). Zhao and Lian (2016) consider a similar model for additive quantile regression, but use an L2 norm inside the penalty function and their theoretical results assume a ﬁxed q and p. The L1 norm is used instead of the L2 norm for computational convenience. The L1 norm ﬁts naturally with quantile regression and in the next section we discuss some of the computational conveniences it provides. Whether an L1 or L2 norm is used the oracle properties for group concave penalties are similar (Sherwood et al., 2020). Deﬁne the oracle

estimator as ˆγ = h ˆγ A, 0 Jn(pn qn) i , the estimator that only uses relevant groups. To derive an oracle property we use a general class of nonconvex functions for pλ and will prove that

ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION

with probability going to one that ˆγ is a local minimum of Q(γ). Two commonly used penalty functions are SCAD and MCP. For the SCAD penalty function

pλ(x) = λ|x|I(0 |x| < λ) + aλ|x| x2 + λ2 /2 a 1 I(λ |x| aλ)

2 I(|x| > aλ), for some a > 2,

and for the MCP penalty function,

pλ(x) = λ |x| x2

I(0 |x| < aλ) + aλ2

2 I (|x| aλ) , for some a > 1.

For both penalty functions, the tuning parameter λ controls the complexity of the selected model and goes to zero as n increases to inﬁnity. The other tuning parameter a controls how quickly p λ,a(x) goes to zero, but is considered ﬁxed in the asymptotics. Our model selection consistency proofs use the following conditions about the penalty function and the strength of signal from a group of B-spline coeﬃcients for an active covariate.

Condition 6 The function pλ,a(x) is increasing, concave, and has a continuous positive derivative, p λ,a(x) for x [0, ). Also, p λ,a(x) is increasing with respect to λ for λ > 0, p λ,a(0+) = λ and p λ,a(x) = 0 for |x| > aλ.

Condition 7 (Condition on the signal) There exist positive constants c4 and c5 such that 4/(2r + 1) < c4 < (2r 1)/(2r + 1) and n(1 c4)/2 min j {1,...,qn}||γ0j||1 c5.

Both the SCAD and MCP satisfy Condition 6, which is very similar to Condition 1 from Fan and Lv (2011). Condition 7 is a strength of signal condition that is very common in high-dimensional linear models, for instance see Kim et al. (2008), Kim et al. (2012) and Wang et al. (2012). The upper and lower bounds for c4 are sensible by Condition 5. Again, these conditions are suﬃcient for proving our results, but are not necessarily the weakest conditions needed.

Theorem 6 Assume Conditions 1 - 7 hold, λ = o n (1 c4)/2 , log(pn) = o(nλ2k 1 n ), n 1/2k2 n log(n) = o(λ) and nλ2 . Let Mn(λ) be the set of local minima of the penalized objective function Q(γ), deﬁned in (6), for tuning parameter value λ then

P [ˆγ Mn(λ)] 1.

The conditions on λ are satisﬁed for λ = Cn 1/2+b, where b 2 2r+1, c4

2 and C is any positive constant, where 2/(2r + 1) < c4/2 is guaranteed by Condition 7. The motivation for using concave penalties is that with probability approaching one, for the correct value of λ, the oracle estimator is a local minimum of the penalized objective function. Thus, the optimal value of λ needs to properly balance over-ﬁtting and under-ﬁtting the model. The upper bound depends on c4, which depends on the function that provides the smallest signal. Smaller values of c4 indicate a weaker minimum signal and thus smaller values

SHERWOOD AND MAIDMAN

of λ are needed to avoid under-ﬁtting. The lower bound depends on r and for smoother functions λ can be smaller. The intuition here is that for smoother functions it should be easier to separate the signal from the noise and thus smaller values of λ are needed. Finally, the oracle property holds for ultra-high dimensional predictors because the rates allow for pn = o exp nb 1/(2r+1) . Theorem 6 proves that with probability going to one the oracle estimator is a local minimizer of Q(γ), but provides no guarantee that the oracle estimator is the global minimizer. Nor does it provide any guarantees about other potential local minimizers. The next theorem provides a bound on the l2 norm of the diﬀerence between a suﬃciently sparse local minimizer and the oracle estimator. However, an additional condition is used for that proof.

Condition 8 If γ is a local minimizer of Q(γ), where || γ||0 = un then Pn i=1 I[yi = Π(zi) γ] = O(un) and there exists γ0 such that || γ γ0||2 = o P (1).

Condition 8 protects against pathological cases. It assumes that the local minimizer γ, converges in probability to some ﬁxed value γ0, but does not assume that γ0 is equal to γ0, the coeﬃcients that provide the best approximation to the unknown additive function. In addition, it provides a bound on the number of zero-valued residuals. Consider an unpenalized linear quantile regression estimator with p predictors. Of the n residuals corresponding to this estimator, with probability one there will be exactly p+1 zero-valued residuals if the errors have a density with respect to a Lebesgue measure. See section 2.2.2 of (Koenker, 2005) for a more detailed discussion of this topic. Using the same notation as Condition 8, let un be the number of nonzero coeﬃcients for a weighted lasso estimator. Then the weighted lasso quantile regression model will have at most un zero-valued residuals with probability one. This is because the weighted lasso quantile regression problem has a linear programming formulation similar to the standard quantile regression problem. Therefore, we believe the assumption is reasonable for the SCAD and MCP for two reasons. First, it will hold for unpenalized quantile regression and the motivation for both the SCAD and MCP is to approximate an unpenalized estimator. Second, it holds for the weighted lasso estimator which is the approximation we use for (6), see (7) in Section 4.

Theorem 7 Deﬁne γ as a local minimizer of Q(γ) and deﬁne E = {j {1, . . . , pn}||| γj|| = 0 or ||ˆγj|| = 0} as the set of groups that have a non-zero entry in γ or ˆγ. Let wn = |E| =

o[log(n)], assume Conditions 1-8 hold and that λ = n 1/2+b, where b 2 2r+1, c4

2 2r+1 < c4

|| γ ˆγ||2 = OP

log(n)δ 2 wnkn

wnkn + knwnn 1

Corollary 8 Under the conditions of Theorem 7 with λ = n 1/2+b where b 2 2r+1, r 1

then || γ ˆγ||2 = o P (1) and || γ γ0||2 = o P (1).

Corollary 8 provides that any suﬃciently sparse local minimizer of Q(γ) will be a consistent estimator.

ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION

4. Algorithm

The objective function Q(γ) is non-convex and for high-dimensional data a grid search approach is not reasonable. Algorithms exist for ﬁnding estimators with good statistical properties for a wide class of nonconvex problems, but they assume the loss function is diﬀerentiable, which is not the case for quantile regression (Loh and Wainwright, 2015; Wang et al., 2014b). Zou and Li (2008) proposed a local linear approximation (LLA) that provides a convex approximation to a non-convex objective function. Let ˆγt j represent the estimate of γ0j at iteration t, with ˆγ0 = 0, then the LLA of Q(γ) is

i=1 ρτ[yi Π(zi) γ] +

j=1 p λ ||γt 1 j ||1 ||γj||1. (7)

For τ (0, 1), ρτ(u) + ρτ( u) = |u| and thus the above problem can be restated as a weighted quantile regression problem with augmented data (Sherwood, 2016; Sherwood and Wang, 2016; Wang et al., 2012). In this approach the ﬁnal estimates are derived once the estimates converge or a maximum number of iterations has been made. At each iteration minimizing (7) becomes a linear programming problem, however linear programming can be quite slow for high dimensional problems. If the traditional L2 norm was used then solving (7) becomes a second-order cone programming problem, but these tend to be even slower than linear programming problems. In addition, using the L1 norm allows us to build on existing computational approaches for penalized quantile regression. Peng and Wang (2015) proposed the quantile iterative coordinate descent (QICD) algorithm for solving (7) for the standard SCAD or MCP penalty, where γj is a scalar, which greatly reduces computational complexity without sacriﬁcing estimation in accuracy. We propose an extension of the QICD algorithm for the group penalty setting, where γj is a vector. The QICD algorithm is a two-step process that ﬁrst majorizes the objective function and then uses a coordinate descent algorithm to solve each iteration of the majorization step. The coordinate descent algorithm is responsible for faster convergence. The key diﬀerence is our algorithm includes an L1 grouping of coeﬃcients and to minimize (6), we modify the QICD algorithm to allow for group penalties. Let γ(k) js denote the value of γjs after the kth

iteration, k = 1, 2, . . . and γ(k) j = h γ(k) j1 , . . . , γ(k) Jn

i for j {1, . . . , pn} and s {1, . . . , Jn}.

Furthermore, let p λ(x+) be the limit of p λ(y) as y x from above. Then, in the kth iteration,

γj = p λ ||γ(k 1) j ||1+ Jn X

s=1 |γjs| p λ ||γ(k 1) j ||1+ Jn X

s=1 |γ(k 1) js |

+ pλ ||γ(k 1) j ||1 (8)

majorizes the penalty function pλ ||γj||1 for k {1, 2, . . . , K}, where K is a user deﬁned value for the maximum number of iterations, and j {1, . . . , pn}. More speciﬁcally, φγ(k 1) j

γj pλ ||γj||1 for all γj with equality when γj = γ(k 1) j . Thus, the objective

SHERWOOD AND MAIDMAN

function Q(γ) deﬁned in (6) is majorized by

Qγ(k 1)(γ) = 1

i=1 ρτ[yi Π(zi) γ] +

j=1 φγ(k 1) j

The majorizing function in (9) is similar to the majorizing function in Peng and Wang (2015). However, in our setting, coeﬃcients for spline functions associated with the same covariate all have the same weight p λ ||γ(k 1) j ||1+ . For each k = 1, 2, . . ., the update for γ is

γ(k) = arg min γ Qγ(k 1)(γ). (10)

This iteration scheme decreases the value of the objective function in (6) for each γ(k). Additionally, the solution to the original nonconvex minimization problem can now be found by solving a sequence of convex minimization problems. Coordinate descent can be used to solve the convex minimization problem in (10). In the following coordinate descent algorithm, each coeﬃcient γjs is updated one-at-a-time until convergence. For the dth iteration of the coordinate descent step and the kth iteration of the majorization step, let

ω(k)(d) js = γ(k)(d+1) 11 , . . . , γ(k)(d+1) js 1 , γ(k)(d) js , . . . , γ(k)(d) pn Jn

be the vector of coeﬃcients that contains updates for the ﬁrst js 1 coeﬃcients, but not the remaining ones. We update each coeﬃcient in the coordinate descent step as

γ(k)(d) js = arg min γjs Qγ(k 1) ω(k)(d) js . (11)

We omit the complete derivation of the coordinate descent algorithm as it is very similar to Peng and Wang (2015). The algorithm converges when for some speciﬁed tolerance ϵ, ||γk γk 1||2 < ϵ or k equals K, the maximum number of iterations allowed. In the following data analysis we used K = 20 and ϵ = .00001. It is important to have appropriate starting values for the algorithm to converge. We recommend using the estimates from lasso penalized quantile regression with the lasso penalty applied individually to each coeﬃcient (i.e., ignoring the group penalty) as the starting values for γ(0) js . The algorithm is implemented in the R package rq Pen (Sherwood and Maidman, 2020).

5. Simulations

We consider three diﬀerent simulation settings. In the ﬁrst setting the response is generated from an additive model where each function is nonlinear. The purpose of this setting is to demonstrate the eﬀectiveness of the proposed method compared to other approaches for modeling nonlinear functions. This setting also includes comparisons of the QICD algorithm to a linear programming approach. In the second setting we use the proposed

ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION

approach where it might not be optimal. This setting includes a linear model, partially linear model and a non-additive model. In this setting we compare the proposed approach to linear models to test if the proposed approach is competitive with simpler methods. In the last setting, to verify results of Theorem 6, we test the model selection properties of the proposed approach for varying values of n, qn, Jn and pn. In all settings the covariates are generated in two steps. For each observation a pdimensional vector is generated by x N(0p, Σp) with σjk being the entry for the jth row and kth column of Σp and σjk = .5|j k|. For a vector a = (a1, . . . , ap) Rp deﬁne Φ(a) = {Φ(a1), . . . , Φ(ap)} Rp, where Φ( ) is the normal CDF. Then the p-dimensional covariate vector is generated by z = Φ(x). In the ﬁrst two sections we consider a sample size of n = 500 and the number of covariates as p = 100, 300 or 600. More details about the third simulation will be provided later which includes diﬀerent values of n, pn, Jn and qn. For the ﬁrst two settings, models are ﬁt using 500 training samples. Then 1000 testing samples are generated from the same model. All models are ﬁt using B-splines with the training and testing covariates transformed using cubic B-splines with Jn = 3. Let y i and ˆy i (τ) represent the observed value and the predicted τth quantile for the ith testing sample, where the prediction comes from a model that was ﬁt only using the training data. A covariate is considered selected if any of its corresponding spline coeﬃcients are non-zero. Models are compared using the following criteria.

1. Mean squared prediction error (MSPE), 1 1000 P1000 i=1 [y i ˆy i (τ)]2.

2. Mean absolute prediction error (MAPE), 1 1000 P1000 i=1 |y i ˆy i (τ)|.

3. Mean check prediction error (MCPE), 1 1000 P1000 i=1 ρτ[y i ˆy i (τ)].

4. True positives (TP), the number of active covariates selected.

5. False positives (FP), the number of nonactive covariates selected.

6. Proportion smaller (PS), the proportion of testing responses smaller than their predicted value.

For consistent methods the value of PS should be close to τ. When modeling the median, MAPE and MCPE diﬀer only by a multiple of 2. Thus, we only report MCPE in settings where we ﬁt models for non-median quantiles. In the ﬁrst two settings 100 replications are run for each simulation setting, while in the last setting 50 replications are run for each setting.

Setting I: Additive Model

In Setting I we consider the proposed quantile additive model where pλ,a( ) is the SCAD penalty function. We implement both the coordinate descent (QA-SCAD CD) and linear programing (QA-SCAD LP) algorithms. We compare the method to the quantile additive model with the lasso penalty (QA-LASSO), QA-LASSO minimizes (6) with pλ,a(x) = λ|x|. In addition, we consider the mean additive model with the group SCAD (MA-SCAD) and

SHERWOOD AND MAIDMAN

group lasso (MA-LASSO) penalty. The mean regression methods use the same B-spline transformation and minimize

i=1 [yi Π(zi) γ]2 +

j=1 pλ,a ||γj||2 , (12)

where for MA-LASSO pλ,a(x) = λ|x| and for MA-SCAD pλ,a( ) is the SCAD penalty function. For both MA-SCAD and QA-SCAD we set a = 3.7. Let, γλ be the coeﬃcient vector for a given value of λ and qλ be the number of nonzero coeﬃcients. For the quantile regression methods λ is selected by minimizing,

i=1 ρτ[yi Π(zi) γλ]

+ qλ log(n)

Let ℓ(γ) represent the Gaussian log-likelihood evaluated at γ. For the mean regression methods λ is selected by minimizing

2ℓ( γλ) + qλ log(n).

The quantile and mean regression models are ﬁt using the R packages rq Pen (Sherwood and Maidman, 2020) and grpreg (Breheny and Zeng, 2017), respectively. Theoretically, using BIC may not be optimal for high-dimensional variables. There exist challenges to demonstrating that BIC will select the true model when the number of predictors grows with the sample size that remain unsolved (Wang et al., 2009a; Lee et al., 2014). For additive quantile regression models, Lee et al. (2014) suggested replacing qλ log(n)

2n with Cn qλ log(n)

2n , where Cn , and provide theoretical justiﬁcations. The R package rq Pen allows for implementing this high dimensional BIC. We used the standard BIC as our preliminary results found that approach to work better, but the approach of Lee et al. (2014) has been shown to be superior in other settings. The response is generated under three diﬀerent settings. For the ﬁrst two settings we consider the model

y = 1 + 2z3 1 + sin(2πz2) + 8(z3 .5)2 + ϵ, (13)

with homoscedastic errors of ϵ N(0, 1) (Setting IA) or ϵ T3 (Setting IB). In the third setting we consider the following heteroscedastic errors model

y = sin(2πz2) + 8(z3 .5)2 + (.5 + z3 1)ϵ, (14)

with ϵ N(0, 1) (Setting IC). For Settings IA and IB the methods estimate the median, τ = .5, while in Setting IC the methods estimate the .9 quantile, τ = .9. The quantile regression methods can directly model the .9 quantile, but the mean regression methods do not directly provide non-median estimates. To estimate ˆy i (τ) for the mean regression methods we propose a naive estimate of the conditional quantile based on estimation of the conditional quantile in the linear mean model when the error terms are normally distributed and p is small. Deﬁne Y = (y1, . . . , yn) Rn as the vector of observed responses. For an estimate ˆγ,

ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION

deﬁne ˆy(ˆγ) Rn as the vector of ﬁtted values, q(ˆγ) as the number of nonzero entries in ˆγ, Π(zi, ˆγ) Rq( ˆγ) as the vector of basis functions that have non-zero coeﬃcients in ˆγ and ˆσ(ˆγ) = 1

Y ˆY(ˆγ) 2. Let t d,τ be the τ quantile of a T-distribution with d

degrees of freedom. For a vector of covariates z i with estimate of the conditional mean, ˆy i , the naive estimate of the τth conditional quantile is

ˆy i (τ) = ˆy i + t n q( ˆγ),τ ˆσ(ˆγ)

v u u t1 + Π(z i , ˆγ) " n X

i=1 Π(zi, ˆγ)Π(zi, ˆγ) # 1 Π(z i , ˆγ). (15)

The above estimator is the standard prediction interval estimator of a conditional quantile using ordinary least squares. If the errors are i.i.d. and normally distributed then the estimator is the MLE for the conditional quantile. However, it is naive because it is a ﬁxed dimensional solution to a high-dimensional problem. Even in the ﬁxed dimension setting it will be an inconsistent estimate if the errors are not normally distributed or if there is nonconstant variance. The means (and standard deviations) across the 100 simulations for the previously deﬁned six summary statistics are provided in Tables 1-3 for Settings IA-IC. For p = 100 and p = 300 the tables include two versions of QA-SCAD. The linear programming (LP) approach to solving (7), using the Barrodale and Roberts (1974) algorithm for regression quantiles (Koenker and D Orey, 1987, 1994), and the coordinate descent (CD) approach described in Section 4. The summary statistics between the two algorithms are almost identical in Settings IA and IB, except that the coordinate descent approach tends to select more false positives. In Setting IC the LP approach provides better results. Figure 1 compares the computational speed for the two algorithms, in the diﬀerent settings when p = 100 or p = 300. For Settings IA and IB the QICD algorithm is noticeably faster for p = 100 or p = 300. For Setting IC, the QICD algorithm is slower at p = 100, but faster at p = 300. For p = 600 only the QICD algorithm was used, due to the excessive computational time of the linear programming approach. The rest of the simulations only consider the QICD algorithm because of the computational advantages over the linear programming approach. Results in Table 1 show that the group SCAD approaches are ﬁtting smaller models that all contain the true covariates and are doing a better job in terms of prediction. For the diﬀerent values of p the MA-SCAD approach does the best in terms of prediction error, but this is not surprising as we expect a method using a squared error loss function to perform well when the errors are homoscedastic and normally distributed. In Setting IB, presented in Table 2, the QA-SCAD methods perform the best in terms of model selection and prediction accuracy. The linear mean methods are selecting more false positives because of the extra noise from the heavier tailed error distribution, T3. For Setting IC, presented in Table 3, the largest diﬀerence can be seen in terms of TP. In this model z1 is only an active variable for τ = .5 and thus the mean regression methods do not consistently select this variable. Also, the PS results are slightly better for QA-SCAD approaches than the mean regression methods, which we expect because the mean regression methods for estimating the .9 quantile do not correctly account for the nonconstant variance. In all the results we see the SCAD methods picking smaller models than their LASSO counterparts, but still getting accurate results in terms of selecting the correct number of active covariates.

SHERWOOD AND MAIDMAN

Setting I p=300 Setting II p=300 Setting III p=300

Setting I p=100 Setting II p=100 Setting III p=100

QICD LP QICD LP QICD LP

QICD LP QICD LP QICD LP 0

Time (Seconds)

Figure 1: Computation Time comparison of coordinate descent method (QICD) and linear programming (LP) for 100 simulations with p = 100.

Method p MSPE MAPE TP FP PS MA-LASSO 100 1.08 (0.06) 0.83 (0.02) 3 (0) 9.16 (2.4) 0.5 (0.02) MA-SCAD 100 1.03 (0.05) 0.81 (0.02) 3 (0) 2.38 (2.48) 0.5 (0.02) QA-LASSO 100 1.2 (0.09) 0.88 (0.03) 3 (0) 11.79 (5.45) 0.5 (0.03) QA-SCAD CD 100 1.05 (0.06) 0.82 (0.02) 3 (0) 1.23 (0.69) 0.5 (0.03) QA-SCAD LP 100 1.04 (0.06) 0.82 (0.02) 3 (0) 1.23 (0.78) 0.5 (0.03) MA-LASSO 300 1.09 (0.06) 0.83 (0.02) 3 (0) 18.14 (2.67) 0.51 (0.02) MA-SCAD 300 1.03 (0.05) 0.81 (0.02) 3 (0) 10.29 (3.15) 0.51 (0.02) QA-LASSO 300 1.29 (0.1) 0.91 (0.04) 3 (0) 14.17 (6.91) 0.5 (0.03) QA-SCAD CD 300 1.05 (0.05) 0.82 (0.02) 3 (0) 2.01 (1.55) 0.51 (0.03) QA-SCAD LP 300 1.04 (0.05) 0.81 (0.02) 3 (0) 1.6 (1.34) 0.51 (0.03) MA-LASSO 600 1.11 (0.06) 0.84 (0.02) 3 (0) 28.06 (4.23) 0.5 (0.02) MA-SCAD 600 1.04 (0.05) 0.81 (0.02) 3 (0) 20 (4.89) 0.5 (0.02) QA-LASSO 600 1.43 (0.13) 0.95 (0.04) 3 (0) 13.58 (8.26) 0.49 (0.03) QA-SCAD CD 600 1.07 (0.05) 0.82 (0.02) 3 (0) 2.7 (1.9) 0.5 (0.02)

Table 1: Simulation results for homoscedastic N(0,1) errors (Setting IA)

Setting II: Non-optimal Models

The previous section demonstrated the computational advantages of the CD algorithm so for this setting we only consider the coordinate descent implementation of the proposed approach (QA-SCAD CD). In this section the response is generated from a model where

ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION

Method p MSPE MAPE TP FP PS MA-LASSO 100 3.19 (2.2) 1.18 (0.06) 2.99 (0.1) 7.66 (2.12) 0.51 (0.03) MA-SCAD 100 3.1 (2.19) 1.15 (0.06) 2.99 (0.1) 6.76 (2.92) 0.5 (0.03) QA-LASSO 100 3.28 (2.2) 1.21 (0.06) 3 (0) 8.48 (4.47) 0.51 (0.03) QA-SCAD CD 100 3.03 (2.19) 1.12 (0.06) 3 (0) 1.59 (1.22) 0.5 (0.03) QA-SCAD LP 100 3.02 (2.19) 1.12 (0.06) 3 (0) 1.25 (0.77) 0.51 (0.03) MA-LASSO 300 3.05 (0.64) 1.18 (0.06) 3 (0) 15.74 (2.8) 0.5 (0.03) MA-SCAD 300 2.97 (0.65) 1.15 (0.06) 3 (0) 15.16 (4.09) 0.5 (0.03) QA-LASSO 300 3.31 (0.65) 1.26 (0.07) 2.96 (0.24) 7.24 (5.32) 0.51 (0.03) QA-SCAD CD 300 2.89 (0.62) 1.13 (0.05) 3 (0) 2.52 (2.28) 0.5 (0.03) QA-SCAD LP 300 2.88 (0.63) 1.12 (0.06) 3 (0) 2.04 (1.77) 0.5 (0.03) MA-LASSO 600 3.16 (0.96) 1.19 (0.07) 2.98 (0.14) 25.96 (4.69) 0.5 (0.03) MA-SCAD 600 4.99 (5.42) 1.45 (0.73) 2.98 (0.14) 40.37 (35.56) 0.5 (0.03) QA-LASSO 600 3.52 (0.95) 1.3 (0.08) 2.82 (0.5) 5.67 (5.68) 0.5 (0.03) QA-SCAD CD 600 2.99 (0.95) 1.14 (0.07) 3 (0) 3.32 (3) 0.5 (0.03)

Table 2: Simulation results for homoscedastic T3 errors (Setting IB)

Method p MSPE MAPE MCPE TP FP PS MA-LASSO 100 1.88 (0.14) 1.17 (0.05) 0.15 (0.01) 2.22 (0.42) 6.86 (2.19) 0.92 (0.01) MA-SCAD 100 1.78 (0.13) 1.14 (0.05) 0.15 (0.01) 2.04 (0.2) 2.08 (2.53) 0.92 (0.01) QA-LASSO 100 1.85 (0.36) 1.09 (0.12) 0.16 (0.01) 2.97 (0.17) 28.88 (16.28) 0.86 (0.03) QA-SCAD CD 100 1.89 (0.35) 1.1 (0.11) 0.15 (0.01) 2.92 (0.27) 2.9 (2.78) 0.89 (0.02) QA-SCAD LP 100 1.76 (0.21) 1.05 (0.06) 0.14 (0.01) 3 (0) 2.32 (2.19) 0.89 (0.02) MA-LASSO 300 1.97 (0.17) 1.21 (0.06) 0.15 (0.01) 2.17 (0.38) 14.92 (2.98) 0.93 (0.01) MA-SCAD 300 1.85 (0.15) 1.17 (0.06) 0.15 (0.01) 2.1 (0.3) 9.93 (3.78) 0.93 (0.01) QA-LASSO 300 1.9 (0.25) 1.11 (0.09) 0.16 (0.01) 3 (0) 45.61 (12.11) 0.86 (0.02) QA-SCAD CD 300 2.05 (0.49) 1.15 (0.14) 0.15 (0.02) 2.83 (0.4) 4.13 (4.39) 0.89 (0.02) QA-SCAD LP 300 1.75 (0.19) 1.04 (0.06) 0.14 (0.01) 3 (0) 6.13 (5.65) 0.88 (0.02) MA-LASSO 600 2.06 (0.18) 1.24 (0.06) 0.15 (0.01) 2.12 (0.33) 25.23 (4.49) 0.93 (0.01) MA-SCAD 600 1.92 (0.17) 1.2 (0.06) 0.15 (0.01) 2.08 (0.27) 19.94 (4.55) 0.93 (0.01) QA-LASSO 600 2 (0.3) 1.14 (0.1) 0.18 (0.01) 3 (0) 54.23 (16.41) 0.85 (0.02) QA-SCAD CD 600 2.06 (0.52) 1.16 (0.16) 0.16 (0.02) 2.74 (0.46) 5.87 (9.1) 0.88 (0.02)

Table 3: Simulation results for heteroscedastic errors (Setting IC)

the proposed approach is not optimal, either because it is too complex or not complex enough. The settings are

Setting IIA (linear model) y = 1 + z1 z2 + 3z3 + ϵ;

Setting IIB (partially linear model) y = 1 + z1 + 2 sin(4πz2) + 2(z3 .5)3 + ϵ;

Setting IIC (nonadditive model) y = 1 + 2z3 1[sin(2πz2) + 1] + 8(z3 .5)2 + ϵ.

In each setting ϵ T3. The QA-SCAD CD model is compared to simpler linear models. We consider linear mean and quantile regression with the SCAD (ML-SCAD, QL-SCAD)

SHERWOOD AND MAIDMAN

and lasso (ML-LASSO, QL-LASSO) penalty. For the linear models the objective functions are 1 n

i=1 mτ(yi z i β) +

j=1 pλ,a (|βj|) . (16)

Where mτ(x) = x2 for the mean models and mτ(x) = ρτ(x) for the quantile regression models and pλ,a = λ|x| for lasso and for SCAD pλ,a( ) is the SCAD penalty function. For these simulations we consider the case of τ = .5. For all models λ is selected using BIC similar to what was described in the previous section and for the SCAD penalties a is ﬁxed to 3.7. For Setting IIB Jn was set to 5, while in all other settings we ﬁxed Jn = 3. The ML-LASSO and ML-SCAD models were ﬁt using glmnet (Friedman et al., 2008) and ncvreg (Breheny and Huang, 2011), respectively. All the quantile regression models were ﬁt using rq Pen (Sherwood and Maidman, 2020). Results for the simulations are reported in Tables 4-6. The QA-SCAD approach is competitive for both the linear and partially linear setting. In setting IIB it does best at selecting the true variables when p is large, p {300, 600}. For Setting IIC it dominates with respect to all metrics, except PS, which all do well at, and FP, where QL-LASSO does the best. This demonstrates that the proposed additive model has beneﬁts compared to simpler models even when the true model is not additive.

Method p MSPE MAPE TP FP PS ML-LASSO 100 3.09(2.21) 1.14(0.06) 1.73(0.62) 1.84(2.42) 0.5(0.03) ML-SCAD 100 3.06(2.21) 1.13(0.05) 1.66(0.83) 2.25(3.85) 0.5(0.03) QA-SCAD CD 100 3.07(2.21) 1.14(0.06) 2.21(0.41) 1.06(0.34) 0.51(0.03) QL-LASSO 100 3.08(2.21) 1.14(0.05) 1.71(0.61) 1.88(1.7) 0.5(0.03) QL-SCAD 100 3.05(2.2) 1.13(0.05) 2.42(0.81) 2.97(1.76) 0.51(0.03) ML-LASSO 300 2.95(0.63) 1.14(0.05) 1.51(0.52) 2.36(3.61) 0.5(0.03) ML-SCAD 300 2.91(0.64) 1.13(0.06) 1.39(0.58) 1.87(3.89) 0.5(0.03) QA-SCAD CD 300 2.91(0.63) 1.13(0.05) 2.3(0.46) 1.3(0.96) 0.5(0.03) QL-LASSO 300 2.93(0.63) 1.14(0.05) 1.7(0.48) 1.98(1.28) 0.5(0.03) QL-SCAD 300 2.9(0.63) 1.13(0.05) 1.9(0.72) 2.05(1.31) 0.5(0.03) ML-LASSO 600 3.05(0.81) 1.16(0.05) 1.5(0.51) 1.84(2.85) 0.5(0.03) ML-SCAD 600 3.01(0.82) 1.14(0.05) 1.16(0.37) 0.94(1.73) 0.5(0.04) QA-SCAD CD 600 3(0.84) 1.14(0.05) 2.26(0.44) 1.12(0.48) 0.5(0.03) QL-LASSO 600 3.02(0.82) 1.15(0.05) 1.7(0.51) 2.7(1.81) 0.5(0.03) QL-SCAD 600 2.99(0.82) 1.14(0.05) 1.22(0.46) 1.04(0.2) 0.5(0.03)

Table 4: Simulation results for Setting IIA (linear model).

Setting III: Model Selection Performance

To validate the results of Theorem 6 we consider the model selection performance of QASCAD CD for qn, Jn, and pn increasing with n. The responses in this section are generated from the model

y = 1 + 2z3 1 + sin(2πz2)I(qn > 1) + 8(z3 .5)2I(qn > 2) + 2z3 11I(qn > 3)

+ sin(2πz12)I(qn > 4) + 8(z13 .5)2I(qn = 6) + ϵ,

ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION

Method p MSPE MAPE TP FP PS ML-LASSO 100 3.09(2.21) 1.14(0.06) 1.73(0.62) 1.84(2.42) 0.5(0.03) ML-SCAD 100 3.06(2.21) 1.13(0.05) 1.66(0.83) 2.25(3.85) 0.5(0.03) QA-SCAD CD 100 3.07(2.21) 1.14(0.06) 2.21(0.41) 1.06(0.34) 0.51(0.03) QL-LASSO 100 3.08(2.21) 1.14(0.05) 1.71(0.61) 1.88(1.7) 0.5(0.03) QL-SCAD 100 3.05(2.2) 1.13(0.05) 2.42(0.81) 2.97(1.76) 0.51(0.03) ML-LASSO 300 2.95(0.63) 1.14(0.05) 1.51(0.52) 2.36(3.61) 0.5(0.03) ML-SCAD 300 2.91(0.64) 1.13(0.06) 1.39(0.58) 1.87(3.89) 0.5(0.03) QA-SCAD CD 300 2.91(0.63) 1.13(0.05) 2.3(0.46) 1.3(0.96) 0.5(0.03) QL-LASSO 300 2.93(0.63) 1.14(0.05) 1.7(0.48) 1.98(1.28) 0.5(0.03) QL-SCAD 300 2.9(0.63) 1.13(0.05) 1.9(0.72) 2.05(1.31) 0.5(0.03) ML-LASSO 600 3.05(0.81) 1.16(0.05) 1.5(0.51) 1.84(2.85) 0.5(0.03) ML-SCAD 600 3.01(0.82) 1.14(0.05) 1.16(0.37) 0.94(1.73) 0.5(0.04) QA-SCAD CD 600 3(0.84) 1.14(0.05) 2.26(0.44) 1.12(0.48) 0.5(0.03) QL-LASSO 600 3.02(0.82) 1.15(0.05) 1.7(0.51) 2.7(1.81) 0.5(0.03) QL-SCAD 600 2.99(0.82) 1.14(0.05) 1.22(0.46) 1.04(0.2) 0.5(0.03)

Table 5: Simulation results for Setting IIB (partially linear model).

Method p MSPE MAPE TP FP PS ML-LASSO 100 3.61(2.2) 1.3(0.05) 1.12(0.57) 1.67(2.58) 0.52(0.03) ML-SCAD 100 3.6(2.19) 1.3(0.05) 1.26(0.65) 3.02(4.3) 0.52(0.03) QA-SCAD CD 100 3.21(2.2) 1.18(0.06) 2.95(0.22) 1.51(1.11) 0.5(0.03) QL-LASSO 100 3.63(2.21) 1.31(0.06) 1.86(0.49) 0.49(1) 0.51(0.03) QL-SCAD 100 3.62(2.19) 1.31(0.06) 2.51(0.52) 2.01(1.85) 0.51(0.03) ML-LASSO 300 3.48(0.62) 1.3(0.05) 0.79(0.43) 1.56(2.76) 0.52(0.03) ML-SCAD 300 3.47(0.62) 1.3(0.05) 0.85(0.54) 2.18(3.84) 0.52(0.03) QA-SCAD CD 300 3.07(0.63) 1.18(0.06) 2.88(0.33) 1.84(1.72) 0.5(0.03) QL-LASSO 300 3.48(0.62) 1.31(0.05) 1.67(0.47) 0.5(0.93) 0.51(0.03) QL-SCAD 300 3.54(0.63) 1.32(0.06) 2.38(0.58) 4.38(3.41) 0.5(0.03) ML-LASSO 600 3.55(0.93) 1.31(0.06) 0.86(0.45) 2.22(3.98) 0.51(0.03) ML-SCAD 600 3.55(0.93) 1.31(0.06) 0.85(0.48) 2.36(3.95) 0.51(0.03) QA-SCAD CD 600 3.19(0.93) 1.2(0.07) 2.82(0.41) 2.73(3.14) 0.5(0.03) QL-LASSO 600 3.56(0.92) 1.31(0.06) 1.59(0.49) 0.41(1.2) 0.5(0.03) QL-SCAD 600 3.53(0.93) 1.3(0.06) 2.01(0.44) 1.93(2.84) 0.5(0.03)

Table 6: Simulation results for Setting IIC (nonadditive model).

where ϵ N(0, 1). We ﬁt the QA-SCAD model in 3 diﬀerent settings where Jn, qn, or pn vary. In all the settings we ﬁt models with sample size of 100, 300, 600 and 1000. In Setting IIIA we ﬁt the model with Jn {3, 4, 5}. In setting IIIB models are ﬁt with qn {1, 2, 3, 4, 5, 6}. Finally, in Setting IIIC models are ﬁt with pn {100, 300, 500, 1000, 2000}. When they are not varying we ﬁx Jn = 3, qn = 3 and pn = 300. For instance, in Setting IIIA we ﬁx qn = 3 and pn = 300 and consider performance of QA-SCAD CD with diﬀerent values of Jn and n. The purpose of these simulations is to corroborate the model selection

SHERWOOD AND MAIDMAN

200 400 600 800 1000

0.2 0.4 0.6 0.8 1.0

True Positive Rate by n

True Positive Rate

Jn 3 Jn 4 Jn 5

200 400 600 800 1000

0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014

False Positive Rate by n

False Positive Rate

Jn 3 Jn 4 Jn 5

Figure 2: True positive and false positive rates by n with varying Jn from Setting IIIA

properties presented in Theorem 6 and thus we only consider the QA-SCAD CD model. In addition, performance is evaluated with respect to the TP and FP rates, number of true or false covariates selected divided by the number of true or false potential covariates, to account for the fact that qn and pn can vary. Again to duplicate the settings of Theorem 6 we ﬁx λ = n 1/10/4 which for the given additive functions satisﬁes the potential valid choices of λ outlined after Theorem 6. Figures 2-5 present how the average true positive and false positive rates vary with n in the diﬀerent settings across the 50 replications. Figure 2 presents how the false and true positives vary with n and Jn. The relationship between Jn and correct model selection is not straightforward. From a theoretical perspective larger values of Jn will provide a better approximation of the functions and thus we could reason that for larger value of Jn the true positive rate should go up and false positive rate should go down. However, from a practical perspective larger values of Jn cause the size of the grouped coeﬃcients to increase where the intuition is this could lead to an increase in the false positive rate. Figure 2 reﬂects some of this uncertainty, but also shows that as n increases, no matter the value of Jn, the active covariates tend to be selected and the inactive covariates tend to be dropped. Figure 3 demonstrates that the larger the value of qn the harder it is to select the correct variables. We also see that for large values of n that, no matter the value of qn, with high probability the active covariates are selected and the inactive covariates are discarded. Figures 4 and 5 present the false and true positives, respectively, as functions of n and n/ log(pn). The n/ log(pn) is used to verify that pn can grow exponentially with n. Both ﬁgures demonstrate that settings with large values of pn are doing worse for smaller n, but for the larger values of n the true covariates tend to be selected and the noise covariates are removed from the models.

ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION

200 400 600 800 1000

0.2 0.4 0.6 0.8 1.0

True Positive Rate by n

True Positive Rate

qn 1 qn 2 qn 3 qn 4 qn 5 qn 6

200 400 600 800 1000

0.005 0.010 0.015

False Positive Rate by n

False Positive Rate

qn 1 qn 2 qn 3 qn 4 qn 5 qn 6

Figure 3: True positive and false positive rates by n with varying qn from Setting IIIB

200 400 600 800 1000

0.2 0.4 0.6 0.8 1.0

True Positive Rate by n

True Positive Rate

pn 100 pn 300 pn 500 pn 1000 pn 2000

50 100 150 200

0.2 0.4 0.6 0.8 1.0

True Positive Rate by n/log(pn)

True Positive Rate

pn 100 pn 300 pn 500 pn 1000 pn 2000

Figure 4: True positive rates by n and n/log(pn) for varying pn from Setting IIIC

SHERWOOD AND MAIDMAN

200 400 600 800 1000

0.000 0.005 0.010 0.015 0.020

False Positive Rate by n

False Positive Rate

pn 100 pn 300 pn 500 pn 1000 pn 2000

50 100 150 200

0.000 0.005 0.010 0.015 0.020

False Positive Rate by n/log(pn)

False Positive Rate

pn 100 pn 300 pn 500 pn 1000 pn 2000

Figure 5: False positive rates by n and n/log(pn) for varying pn from Setting IIIC

6. Data Analysis

6.1 Fat content of ground pork

Borggaard and Thodberg (1992) measured the fat content and 100 channel spectrum of absorbances from 240 ground pork meat samples. Our analysis is limited to the 215 samples available from the R package faraway (Faraway, 2016) and we removed 5 observations with outlying values leaving us with 210 samples. We analyze this data using the QA-SCAD model and consider both the CD and LP algorithms. We compare these approaches to three other models: (1) MA-SCAD discussed in simulation Setting I; (2) ML-SCAD discussed in simulation Setting II; and (3) QL-SCAD discussed in simulation Setting II. For the linear models no B-spline transformation is done and each coeﬃcient is penalized individually using the SCAD penalty. For the nonlinear additive models a B-spline transformation is used and group penalties are used for coeﬃcients. The channel spectrum is scaled and centered to have a mean of zero and a standard deviation of one. The fat content data is ﬁrst log transformed and then scaled and centered to have a mean of zero and a standard deviation of one. To compare the methods we randomly sample 200 of the 210 samples as training data and the other 10 samples are used as testing data. The channel spectrum data is highly correlated. Following an approach similar to the one outlined in Meier et al. (2009), we transform the predictors by using the ﬁrst 30 principal components. The principal components are centered and scaled to have mean zero and a standard deviation of one. The ﬁve models are ﬁt using the 30 principal components as covariates to model log of fat content. For the nonlinear models the principal components are transformed using cubic B-splines with Jn = 3.

ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION

Fitting a quantile regression model requires a choice of τ. Choosing τ = .5 provides a robust estimate of central tendencies. Two values of τ can be used to create a prediction interval that does not require a parametric assumption about the errors. For instance, models with τ = .1 and τ = .9 can be used to create an 80% prediction interval. If the whole conditional distribution is of interest then a wide range values of τ could be used. To estimate the whole conditional distribution, and test the proposed method for several values of τ, models are ﬁt for values of τ T {.1, .2, .3, .4, .5, .6, .7, .8, .9}. The tuning parameter λ is selected using BIC, as outlined in the previous section, and we set a = 3.7. The mean models estimate the conditional quantiles using the naive procedure outlined in the previous section. The testing covariates are transformed using the rotation deﬁned by the ﬁrst 30 principal components from the testing data. The 30 covariates are then centered and scaled by the sample mean and standard deviation from the testing data. For the nonlinear models the testing covariates are transformed by the B-spline functions used on the training data. Using this transformed data, predictions from the six methods are made for the log fat content. This process is repeated 100 times. Let yij represent the scaled and centered log fat content of the ith sample from the jth testing data set and ˆyτ ij represent its corresponding estimate for the τth quantile. Let I(a b) take a value of one if a b and zero otherwise. The models are compared using

1. MSPE, 1 10 P10 i=1(yij ˆy.5 ij)2.

2. MAPE, 1 10 P10 i=1 |yij ˆy.5 ij|.

3. MCPE, 1 10 P τ T P10 i=1 ρτ(yij ˆyτ ij).

4. Quantile Bias (QB), P τ T 1 1000 P100 j=1 P10 i=1 I(yij ˆyτ ij) τ .

Methods that correctly model the τth quantile will have

i=1 I(yij ˆyτ ij) τ.

Thus, QB is providing a summary of how accurate the conditional quantile estimates are across all partitions and all values of τ. The statistic QB and means (and standard deviations) of the other three statistics are reported in Table 7. In this data set we are comparing the linear and nonlinear approaches to see if there is justiﬁcation for ﬁtting the more complex nonlinear models. In addition, quantile and mean models are compared to see if the quantile models are providing a better description of the conditional distribution. One of the two nonlinear quantile algorithms has the best average results for MAPE, MCPE and QB. For MSPE the linear quantile approach does the best and the linear mean approach also does better than the nonlinear quantile approach. The superiority of the nonlinear quantile approach in terms of MCPE and QB suggests that the more complex nonlinear quantile models are providing useful predictions for non-central tendencies. Performance of the CD and LP algorithms is similar.

SHERWOOD AND MAIDMAN

Method MSPE MAPE MCPE QB ML-SCAD 0.66(0.41) 0.59(0.19) 0.24(0.07) 0.37 QL-SCAD 0.65(0.41) 0.58(0.19) 0.24(0.07) 0.32 MA-SCAD 0.96(1.56) 0.66(0.25) 0.28(0.1) 0.41 QA-SCAD CD 0.7(0.68) 0.58(0.22) 0.24(0.07) 0.21 QA-SCAD LP 0.71(0.61) 0.57(0.22) 0.23(0.08) 0.22

Table 7: Means (and standard deviations) of statistics from the Monte Carlo randomization results for the ground pork data.

6.2 Modeling TRIM32 expression levels

The previous example provides some evidence that nonlinear quantile regression can provide a less biased estimate of a conditional quantile, but does not demonstrate a dramatic diﬀerence between linear and nonlinear quantile regression in terms of prediction accuracy. This section presents an example where the additive quantile regression model outperforms the linear quantile regression model in terms of prediction accuracy. Huang et al. (2010) presented an analysis of modeling high-dimensional genomics data, from Scheetz et al. (2006), where a nonlinear additive mean model improved upon the prediction performance of a linear mean model. We consider the same data set for modeling the conditional median. Scheetz et al. (2006) used 31,042 diﬀerent probe sets to analyze RNA from the eyes of 120 twelve-week old male rats. Similar to Huang et al. (2010) we model the expression of gene TRIM32, because Chiang et al. (2006) identiﬁed TRIM32 as a Bardet-Biedl syndrome gene and one symptom of Bardet-Biedl syndrome is retinal degeneration. Scheetz et al. (2006) note that many of the probes were not expressed in the eye. Thus, following Huang et al. (2010) we limit our analysis to the 500 genes that have the highest absolute Pearson s correlation with the TRIM32 expression. To demonstrate that this is a setting where the nonlinear quantile model improves prediction accuracy over the linear counterpart we consider the QL-SCAD and QA-SCAD CD approaches using Monte Carlo randomization. All variables are log transformed and the predictors are further transformed to have a minimum value of zero and maximum value of one. First the data is randomly partitioned into a training set of 100 observations and a testing set of 20 observations. We ﬁt the models using the 100 training observations and make prediction of TRIM32 expression on the remaining 20 testing observations. For the nonlinear model we set Jn = 4. This process was repeated 100 times and the MAPE recorded at each iteration. Figure 6 presents the MAPE of the two methods, demonstrating that the nonlinear model tends to be more accurate. In addition, in 69 of the 100 iterations the nonlinear model had a lower MAPE than the linear model.

7. Conclusions

We proposed an additive nonlinear model to provide a ﬂexible model. However, it is possible that too complex a model will be ﬁt. For instance, if some of the true functions are linear than the model being ﬁt will be more complex than necessary. To balance model

ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION

QA SCAD CD QL SCAD Method

Figure 6: Monte Carlo randomization MAPE of TRIM32 for the genomics data. The nonlinear model tends to perform better and in 69 of the 100 Monte Carlo iterations had the lower MAPE.

complexity and ease of interpretation Lou et al. (2016) proposed a penalized approach for mean regression that does both variable selection and automatic assignment of a covariate to a linear or nonlinear term. However, even this approach has some rigidity as, similar to our work, it requires preset deﬁnitions of the basis splines including the number and placement of knots. Desire for ﬂexibility has resulted in methods which use adaptive knots (Petersen et al., 2016; Sadhanala and Tibshirani, 2019) and adaptive knot assignment and classiﬁcation of predictors as linear or nonlinear (Petersen and Witten, 2019). However, all the cited work has focused on mean regression. Developing adaptive methods for quantile regression would be a useful contribution to this line of research.

Acknowledgments

The authors thank the anonymous reviewers for their time and eﬀort, including ﬁnding errors in the proofs of previous versions of this work.

8. Appendix

Throughout the proofs C is used to represent a generic positive constant that can change in value from line to line. We start by presenting some useful equalities that are used

SHERWOOD AND MAIDMAN

throughout the proofs. For u = 0, Knight (1998) introduced the equality of

|u v| |u| = u[I(u > 0) I(u < 0)] + 2 Z v

0 [I(u s) I(u 0)]ds. (17)

Deﬁne ψτ(u) = τ I(u < 0). As ρτ(u) = 1/2[|u| + (2τ 1)u], Koenker (2005) generalized (17) and for u = 0 presented Knight s identity as

ρτ(u v) ρτ(u) = vψτ(u) + Z v

0 I(u s) I(u 0)ds. (18)

Knight s identity also provides that

ρτ(u a v) ρτ(u v) = Z (a+v)

v ψτ(u + s)ds = Z a+v

v [I(u < s) τ]ds, (19)

which is an intuitive result as ψτ(u) is the derivative of ρτ(u) where it is deﬁned. The following deﬁnitions are used throughout the proof

uni = ΠA(zi) γA0 g0(zi),

Dn = diag[f1(0 | z1), . . . , fn(0 | zn)] Rn n,

W 2 D = Π ADnΠA Rqn Jn+1 qn Jn+1,

θ = WD(γA γA0), W(zi) = W 1 D ΠA(zi),

Qi(θ) = ρτ[ϵi W(zi) θ uni],

Ez(x) = E (x | z) ,

Di(θ) = Qi(θ) Qi(0) Ezi [Qi(θ) Qi(0)] + W(zi) θψτ(ϵi),

ˆθ = argmin θ Rqn Jn+1

i=1 ρτ[ϵi W(zi) θ uni].

Notice that ˆθ = WD(ˆγA γA0) and W(zi) ˆθ = ΠA(zi) ˆγA for all i {1, . . . , n}. Deﬁne dj,s = (tj,s+1 tj,s m)/(m + 1). The following Lemma is restating Corollary 1 from de Boor (1976) and provided here for convenience.

Lemma 9 (Corollary 1 from de Boor (1976).) For 1 p < and for all j {1, . . . , qn} and s {0, . . . , Jn}

0 d 1 j,s |bj,s(z)|p dx 1/p 1.

8.1 Proof of Lemma 4

Proof The majority of the proof for the lower bound follows work done in proof of Lemma 1 from Chen et al. (2018a). The major diﬀerence is the constant term, corresponding to the intercept, is accounted for in these results while the results from Chen et al. (2018a) ignore the intercept because it can be removed in mean regression by centering the predictors and

ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION

response to be mean zero. However, for quantile regression an intercept is still required after such transformations. Deﬁne Πj Rn Jn, as the matrix of splines associated with the jth predictor, such that ΠA = [1n, Π1, . . . , Πqn] and a = (a0, a 1 , . . . , a qn) , where aj RJn

for all j {1, . . . , qn}. Notice that a Π AΠAa = ||a01n+Π1a1+. . .+Πqnaqn||2 2. By Lemmas S.5 from Chen et al. (2018b) and that ||Πjaj||2 0 it follows that

||a01n + Π1a1 + . . . + Πqnaqn||2 2 1 δ

j=1 ||Πjaj||2 2

From Lemma 6.2 of Zhou et al. (1998), for any j {1, . . . , qn} there exists a positive constant C such that λmin(Π j Πj) CJ 1 n n. Therefore,

j=1 ||Πjaj||2 2

na2 0 + CJ 1 n n

qn J 1 n n.

It immediately follows that there exists a positive constant b1 such that b1δ2 qnk 1 n a 1

nΠ AΠAa. For the upper bound, using the Cauchy-Schwarz inequality and that PJn s=0 bj,s(zij) = 1 and that bj,s(z) 0 for all z [0, 1], j {1, . . . , pn} and s {0, . . . , Jn},

i=1 [a ΠA(zi)]2 ||a||2 2 1 n

s=1 bj,s(zij)2

Set B1 = 2 and the result immediately follows.

The following lemma provides some bounds on the vector and matrices of B-splines.

Lemma 10 Under Conditions 1-4 and for suﬃciently large n the following properties hold.

(1) For a Rqn Jn+1 where ||a||2 = 1, there exist positive constants b2 and B2 such that for suﬃciently large n that b2δ2 qnk 1 n a 1

n W 2 Da B2qn.

(2) There exists a constant b3 such that max i || W(zi)||2 b3δ 1 qn

(3) There exist constants m1 < M1 such that for all j {1, . . . , qn} and s {0, . . . , Jn}

m1k 1 n Z 1

0 b2 j,s(z)dz M1k 1 n .

(4) For all j {1, . . . , qn} and all s {0, . . . , Jn} there exist positive constants m2 < M2 such that m2k 1 n E[bj,s(zij)2] M2k 1 n .

SHERWOOD AND MAIDMAN

(1) Follows from Condition 1, providing uniform upper and lower bounds for fi(0) for all i {1, . . . , n}, and Lemma 4.

(2) By Lemma 10 (1), it follows that

|| W(zi)||2 2 b2δ 2 qn knn 1||ΠA(zi)||2 2 = b2δ 2 qn knn 1

s=1 bj,s(zij)2

Cδ 2 qn knn 1qn.

(3) Using Lemma 9 with p = 2, squaring all terms and moving dj,s to the upper and lower bounds it follow that

(m + 1) 1dj,s Z 1

0 |bj,s(z)|2 dx dj,s. (20)

By Condition 3 and the deﬁnition of h there exist positive constants c < C such that for all j {1, . . . , qn} and s {0, . . . , Jn} that

c k 1 n dj,s C k 1 n . (21)

Proof is complete by combining equations (20) and (21).

(4) Using c1 and c2 from Condition 2 it follows that for all j {1, . . . , qn} and s {0, . . . , Jn}

0 b2 j,s(z)dz Z 1

0 b2 j,s(z)fzj(z)dz c2

0 b2 j,s(z)dz. (22)

Proof is complete by combining (22) with Lemma 10 (3).

Lemma 11 Under Condition 4, for any positive constants a and b, δ a qn = o nb .

Proof Condition 4 provides that qn = o[log(n)]. Therefore,

δ a qn nb = exp (log(n) {aqn/[2 log(n)] log[2/(1 δ)] b}) = o(1).

The following lemma is central to our proof of Theorem 5.

Lemma 12 For some positive constant L

sup ||θ||2 L,θ Rqn Jn+1 (qnkn) 1

ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION

Proof Deﬁne Wn = max i || W(zi)||2. Let Fn1 denote the event Wn < C1n 1/2(qnkn)1/2δ 1 qn and Fn2 denote the event max i |uni| < C2qnk r n . Combining Lemma 10 and (4) it follows

that there exist positive constants C1 and C2 such that P(Fn1, Fn2) 1. Thus, to prove Lemma 12 it is suﬃcient to show that ϵ > 0

sup ||θ||2 1, θ Rqn Jn+1 (qnkn) 1

> ϵ, Fn1 Fn2

Deﬁne Θ θ | ||θ||2 1, θ Rqn Jn+1 . We can partition Θ as a union of disjoint regions Θ1, ..., ΘMn, such that the diameter of each region does not exceed m0 = ϵδqn 4C1L n.

Then, following the proof of Lemma 3.2 in He and Shi (1994), Mn 2 qn Jn + 1/m0 + 1 qn Jn+1. Let θ 1, ..., θ Mn be arbitrary points in Θ1, ..., ΘMn. Then

sup ||θ||2 1 (qnkn) 1

> ϵ, Fn1 Fn2

+ sup θ Θk (qnkn) 1

qnknθ) Di(L p

qnknθ k) i > ϵ, Fn1 Fn2

We will next show that

sup θ Θk (qnkn) 1

qnknθ) Di(L p

qnknθ k) i I (Fn1 Fn2) ϵ/2.

From deﬁnition of Di(θ) and Qi(θ) and that ρτ(u) = 1

2 u for ﬁxed θ and θ

qnknθ) Di( p

qnknθ ) = 1

qnkn W(zi) θ uni ϵi p

qnkn W(zi) θ uni i

1 2Ezi h ϵi p

qnkn W(zi) θ uni ϵi p

qnkn W(zi) θ uni i + p

qnkn W(zi) [θ θ ]ψτ(ϵi).

Then using the above equality, the triangle inequality and the deﬁnition of m0

sup θ Θk (qnkn) 1

qnknθ) Di(L p

qnknθ k) i I (Fn1 Fn2)

2n Lm0(qnkn) 1/2 Wn I (Fn1 Fn2) 2 n Lm0C1δ 1 qn = ϵ/2.

The proof is complete if it can be shown that

> qnknϵ/2, Fn1 Fn2

We will use Bernstein s inequality to prove the above result. First we need upper bounds for the maximum and the second moment for the left side of the above inequality. Note that for any θ

SHERWOOD AND MAIDMAN

h ϵi W(zi) θ uni |ϵi uni| i 1

2Ezi h ϵi W(zi) θ uni |ϵi uni| i + W(zi) θψτ(ϵi).

Then using the triangle inequality and the above equality we have,

qnknθ k) I (Fn1 Fn2) 2L p

qnkn Wn I (Fn1 Fn2) 2LC1δ 1 qn qnknn 1/2.

Deﬁne Vi(θ) = Qi(θ) Qi(0) + W(zi) θψτ(ϵi). Notice that Di(θ) = Vi(θ) Ez[Vi(θ)], and that Pn i=1 Var [Di(θ)I (Fn1 Fn2) | zi] Pn i=1 E V 2 i (θ)I (Fn1 Fn2) | zi . Using Knight s identity

qnknθ k) = L p

qnkn W(zi) θ k [I(ϵi uni < 0) I(ϵi < 0)]

+ Z L qnkn W(zi) θ k

0 [I(ϵi uni < s) I(ϵi uni < 0)] ds Vi1 + Vi2.

i=1 Ezi V 2 i1I (Fn1 Fn2) Cqnkn

i=1 Ezi h W2 n I (0 < |ϵi| < |uni|) I (Fn1 Fn2) i

Cδ 2 qn (qnkn)2n 1 n X

|uni| fi(s | zi)ds I (Fn1 Fn2) Cδ 2 qn q3 nk2 nk r n ,

where the last inequality uses Condition 1. Noting that Vi2 is always non-negative and that there exists a positive constant C such that max i

qnkn L W(zi) θ k I (Fn1 Fn2)

Cδ 1 qn qnknn 1/2, we have

i=1 Ezi V 2 i2I (Fn1 Fn2) max i

qnkn L h W(zi) θ k i

(Z qnkn L[ W(zi) θ k]

I(ϵi uni < s) I(ϵi uni < 0) ds zi

I (Fn1 Fn2)

Cδ 1 qn qnknn 1/2 n X

Z qnkn L W(zi) θ k

0 [Fi(s + uni | zi) Fi(uni | zi)] I (Fn1 Fn2) ds

Cδ 1 qn qnknn 1/2 n X

Z qnkn L W(zi) θ k

0 [fi(0 | zi) + o(1)][s + O(s2)]ds

Cδ 1 qn (qnkn)2n 1/2θ k " n X

i=1 fi(0 | zi) W(zi) W(zi) #

θ k[1 + o(1)]

= Cδ 1 qn (qnkn)2n 1/2||θ k||2 2[1 + o(1)] Cδ 1 qn (qnkn)2n 1/2[1 + o(1)].

ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION

Where the last equality follows because Pn i=1 fi(0 | zi) W(zi) W(zi) = W 1 D W 2 DW 1 D = I. Therefore, for suﬃciently large n,

i=1 Var [Di(θ)I (Fn1 Fn2)] Cδ 1 qn (qnkn)2 δ 1 qn qnk r n + n 1/2 .

By Bernstein s inequality, for all n suﬃciently large,

i=1 Di θ k, L p

qnkn/n > qnknϵ/2, Fn1 Fn2

( (qnkn)2ϵ2/4 Cδ 1 qn (qnkn)2 δ 1 qn qnk r n + n 1/2 + Cϵδ 1 qn qnknn 1/2

Cδ 2 qn qnk r n

nqn Jn(ϵδqn) 1 + 1 iqn Jn+1 exp ϵ2

Cδ 2 qn qnk r n

Cexp n Cqnkn log h C p

nqnkn(ϵδqn) 1i Cϵ2δ2 qnq 1 n kr n o

Cexp Cqnkn log(n) Cϵ2δ2 qnq 1 n kr n .

By taking the expected value of the initial conditional probability and the ﬁnal upper bound it follows that

i=1 Di(θ k, L p

> qnknϵ/2, Fn1 Fn2

Cexp Cqnkn log(n) Cϵ2δ2 qnq 1 n kr n .

Where the upper bound goes to zero because by Conditions 3-4 and Lemma 11 it follows that δ2 qnq 1 n kr n and q2 nδ 2 qn kn log(n)

8.2 Proof of Theorem 5

Proof We will ﬁrst prove that for all η > 0, there exists an L > 0 such that

inf θ Rqn Jn+1

qnknθ) Qi(0) i > 0

Gn1(θ) = (qnkn) 1 n X

Gn2(θ) = (qnkn) 1 n X

i=1 Ezi h Qi p

qnknθ Qi(0) i ,

Gn3(θ) = (qnkn) 1/2 n X

i=1 W(zi) θψτ(ϵi),

SHERWOOD AND MAIDMAN

and note that (qnkn) 1 Pn i=1 Qi( qnknθ) Qi(0) = P3 k=1 Gnk(θ). From Lemma 12 we have that sup||θ||2 L |Gn1| = o P (1). For Gn3, ﬁrst notice that E(Gn3) = 0. From Condition

1 there exists a positive constant c such that min i fi(0 | zi) c and thus

E[G2 n3] C(qnkn) 1 n X

i=1 E fi(0 | zi)

h W(zi) θ i2

C(qnkn) 1θ E

i=1 fi(0 | zi)ΠA(zi)ΠA(zi) W 1 D

θ = C(qnkn) 1||θ||2 2.

Therefore, sup||θ||2 L Gn3(θ) = OP (qnkn) 1/2 ||θ||2 . We will complete the proof by prov-

ing that inf||θ||2 L Gn2(θ) has a positive asymptotic lower bound that does not converge

to zero. Applying (19)

Gn2(θ) = (qnkn) 1 n X

(Z qnkn W(zi) θ+uni

uni [I(ϵi s) I(ϵi 0)]ds

= (qnkn) 1 n X

i=1 fi(0 | zi)1

qnkn h W(zi) θ i2 + uni p

qnkn W(zi) θ

= C||θ||2 2[1 + o(1)] + (qnkn) 1/2 n X

i=1 fi(0 | zi)uni W(zi) θ[1 + o(1)].

Deﬁne un = (un1, . . . , unn) Rn, by Condition 3 it follows that

sup ||θ||2 L

(qnkn) 1/2 n X

i=1 fi(0 | zi)uni W(zi) θ

sup ||θ||2 L (qnkn) 1/2||u n D1/2 n ||2||D1/2 n ΠAW 1 D ||2||θ||2 = OP (||θ||2).

Proof of (24) is completed by noticing that for suﬃciently large L, inf||θ||2 L Gn2(θ) has

a dominating, positive lower bound of ||θ||2 2. By the corollary to Theorem 25 in Eggleston (1958) (p.47) and the convexity of Qi( ), (24) implies ||ˆθ||2 = OP ( qnkn). From the deﬁnition of ˆθ, it follows that ||WD(ˆγA γA0)||2 = OP qnkn . Condition 5 and (4) guarantee that uni = O(qnk r n ) and therefore

i=1 fi(0 | zi) [ˆg(zi) g0(zi)]2 = n 1 n X

i=1 fi(0 | zi) h ΠA(zi) (ˆγA γA0) uni i2

2n 1 (ˆγA γA0) W 2 D (ˆγA γA0) + OP q2 nk 2r n

= OP n 1qnkn + q2 nk 2r n .

By Condition 1, which provides a constant uniform lower bound for fi(0) for all i {1, . . . , n}, n 1 Pn i=1 [ˆg(zi) g0(zi)]2 = OP n 1qnkn + q2 nk 2r n .

The following lemmas are used to prove Theorem 6

ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION

Lemma 13 If Conditions 1-4 hold, then

||ˆγA γA0||2 = OP

Proof The proof of Theorem 5 shows ||WD(ˆγ γA0)||2 = OP ( qnkn). While from Lemma

10 it follows that ||ˆγ γA0||2 b 1/2 2 q

n δ 1 qn ||WD(ˆγ γA0)||2.

Lemma 14 If the Conditions of Theorem 6 hold then

max qn+1 j pn 1 n

i=1 πj(zij) {I [Yi g0(zi) 0] τ}

Proof Recall that πj(zij) = [bj,1(zij), ..., bj,Jn(zij)] . Note, maxj,s,i |bj,s(zij) {I [Yi g0(zi) 0] τ}|

1 and E h b2 j,s(zij) {I [Yi g0(zi) 0] τ}2i Ck 1 n , see Theorem 10 (4) for the latter. For suﬃciently large n, using Bernstein s inequality,

i=1 bj,s(zij) {I [Yi g0(zi) 0] τ}

> nk 1 n λ/4

2exp λ2n2k 2 n /32 Cnk 1 n + λnk 1 n /12

2exp Cλ2nk 1 n .

max qn+1 j pn 1 n

i=1 πj(zij) {I [Yi g0(zi) 0] τ}

Ckn max qn+1 j pn max 1 s Jnn 1

i=1 bj,s(zij) {I [Yi g0(zi) 0] τ}

Cpnknexp( Cnk 1 n λ2) = Cexp(log pn + log kn Cnk 1 n λ2) 0.

Where the limit holds using the rates of pn and λ provided in Theorem 6.

Lemma 15 Assume the Conditions of Theorem 6 hold

max qn+1 j pn sup ||γA γA0||2 Cknδ 1 qn q1/2 n n 1/2

i=1 πj(zij) n I[Yi Π(zi) AγA 0]

I[Yi g0(zi) 0] P[Yi Π(zi) AγA 0 | zi] + P[Yi g0(zi) 0 | zi] o

SHERWOOD AND MAIDMAN

Proof Extending results from Welsh (1989) and Wang et al. (2012), we consider the set Z n γA : ||γA γA0||2 Cknδ 1 qn q1/2 n n 1/2o . The set Z can be covered by balls with

radii Cknδ 1 qn q1/2 n n 5/2 and cardinality N |Z| Cn4k2 nδ 2 qn qn, for some positive constant C. Denote the N balls by γA(u1), ..., γA(u N), where the ball γA(ul) is centered at ul for l {1, . . . , N}. Let ϵi(γA) = Yi ΠA(zi) γA, ϵi = Yi g0(zi) and mi(a, b) = I(a 0) I(b 0). Then

sup ||γA γA0||2 Cknδ 1 qn q1/2 n n 1/2

i=1 πj(zij) mi[ϵi(γA), ϵi] E{mi[ϵi(γA), ϵi] | zi}

i=1 πj(zij) mi[ϵi(ul), ϵi] E{mi[ϵi(ul), ϵi] | zi}

sup || γA ul||2 Cknδ 1 qn q1/2 n n 5/2

i=1 πj(zij) n mi [ϵi( γA), ϵi(ul)]

E {mi [ϵi( γA), ϵi(ul)] | zi} o

Inj1 + Inj2.

Notice that

i=1 bj,s(zij) (mi [ϵi(ul), ϵi] E {mi [ϵi(ul), ϵi] | zi})

To evaluate Inj1, deﬁne νijls = bj,s(zij) (mi [ϵi(ul), ϵi] E {mi [ϵi(ul), ϵi] | zi}) , which are bounded, independent mean-zero random variables. Note that

Var (νijls | zi) = bj,s(zij)2 E{mi [ϵi(ul), ϵi]2 | zi} E {mi [ϵi(ul), ϵi] | zi}2 .

Then using Condition 1

E{mi [ϵi(ul), ϵi]2 | zi} E{mi [ϵi(ul), ϵi] | zi}2

= Fi(0 | zi)[1 Fi(0 | zi)] + 2Fi(0 | zi)Fi[ΠA(zi) (ul γA0) + uni | zi]

+Fi[ΠA(zi) (ul γA0) + uni | zi] n 1 Fi[ΠA(zi) (ul γA0) + uni | zi] o

2Fi n min[0, ΠA(zi) (ul γA0) + uni] | zi o C ΠA(zi) (ul γA0) + uni .

Applying the Cauchy-Schwarz inequality and Lemma 10 it follows that

i=1 Var(νijl | zi) Cn

h ΠA(zi) (ul γA0) + uni i2 )1/2 C n q

||ul γA0||2 2qn + nqnk r n

C knδ 1 qn qnn1/2 + nqnk r n .

ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION

Recall that νijl is bounded, then applying Bernstein s inequality and using the assumed rates of λ, kn and qn it follows that

> nλ/(16Jn)

C n2λ2k 2 n knδ 1 qn qnn1/2 + nqnk r n + nλk 1 n

exp Cnλk 1 n .

The term nλk 1 n dominates the denominator in the ﬁrst inequality by combining Conditions 3 and 4, Lemma 11 and the assumption that n 1/2k2 n log(n) = o(λ). Note, the upper bound does not depend on zi and taking expectations on both sides we get

exp Cnλk 1 n .

Inj1 CNknexp Cnλk 1 n = Cn4k2 nδ 2 qn qn+1/(2r+1)q1/(2r+1) n exp Cnλk 1 n

Cexp 4k2 nδ 2 qn qn + 1/(2r + 1) log(n) + 1/(2r + 1) log(qn) Cnλk 1 n .

To evaluate Inj2, note that I[ϵi( γA) 0] = I ϵi(ul) ΠA(zi) ( γA ul) . First, we will derive an upper bound for the sum in the probability statement. Since I(x a) is an increasing function of a, we have

sup || γA ul||2 Cknδ 1 qn q1/2 n n 5/2

i=1 πj(zij) n mi [ϵi( γA), ϵi(ul)] E{mi [ϵi( γA), ϵi(ul)] | zi} o

i=1 ||πj(zij)||1

I h ϵi (ul) Cknδ 1 qn q1/2 n n 5/2||ΠA(zi)||2 i I [ϵi (ul) 0]

P h ϵi (ul) Cknδ 1 qn q1/2 n n 5/2||ΠA(zi)||2 zi i + P [ϵi (ul) 0|zi]

i=1 ||πj(zij)||1

P h ϵi (ul) C||ΠA(zi)||2knδ 1 qn q1/2 n n 5/2 zi i

P h ϵi (ul) Cknδ 1 qn q1/2 n n 5/2||ΠA(zi)||2 zi i)

First, the second sum will be examined. Using Condition 1, Taylor series expansion and that the elements of πj(zij) are bounded,

i=1 ||πj(zij)||1

P h ϵi (ul) Cknδ 1 qn q1/2 n n 5/2||ΠA(zi)||2 zi i

P h ϵi (ul) Cknδ 1 qn q1/2 n n 5/2||ΠA(zi)||2 zi i)

i=1 ||πj(zij)||1 ||ΠA(zi)||2knδ 1 qn q1/2 n n 5/2 Ck5/2 n δ 1 qn qnn 3/2 = o(nλ).

SHERWOOD AND MAIDMAN

αijl = ||πj(zij)||1

I h ϵi (ul) C||ΠA(zi)||2knn 5/2i I [ϵi (ul) 0]

P h ϵi (ul) C||ΠA(zi)||2knn 5/2 | zi i + P [ϵi (ul) 0 | zi]

Then for n suﬃciently large, Inj2 PN l=1 P

Pn i=1 αijl nλ

and again Bernstein s in-

equality will be used to provide an upper bound for this probability. To evaluate αijl, deﬁne

ωijls = |bj,s(zij)|

I h ϵi (ul) C||ΠA(zi)||2knn 5/2i I [ϵi (ul) 0]

P h ϵi (ul) C||ΠA(zi)||2knn 5/2 | zi i + P [ϵi (ul) 0 | zi]

which are bounded, independent mean-zero random variables. Using that the elements of ||ΠA(zi)||2 are bounded for all i, it follows that

Var(ωijkl | zi) Cmax i

I h ϵi (ul) C||ΠA(zi)||2knn 5/2i I [ϵi (ul) 0]

E n I h ϵi (ul) C||ΠA(zi)||2knn 5/2i I [ϵi (ul) 0] zi o

E n I h ϵi (ul) C||ΠA(zi)||2knn 5/2i I [ϵi (ul) 0] zi o

= CFi h W(zi) A(ul γA0) + uni + C||ΠA(zi)||2knn 5/2 zi i

CFi h W(zi) A(ul γA0) + uni zi i

Cmax i ||ΠA(zi)||2knn 5/2 C qnk3/2 n n 5/2.

i=1 αijl nλ

Applying Bernstein s inequality, for some positive constants C1, C2 and C3,

C1n2λ2k 2 n C2 qnk3/2 n n 3/2 + C3λnk 1 n

Cexp 4k2 nδ 2 qn qn + 1/(2r + 1) log(n) + 1/(2r + 1)qn Cnλk 1 n .

Note that nλk 1 n dominates qnk3/2 n n 3/2 because

qnk3/2 n n 3/2

nλk 1 n = qnk5/2 n n2λ = kn

qnk2 n nλ .

ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION

Where both fractions are o(1) by assumed rates. Note that under Condition 7, λ = o[n (1 c4)/2] implies λ = o(k 1 n ). To complete the proof, notice there exist positive constants C1, C2, C3 and C4 such that for all n suﬃciently large, the probability of interest in the lemma is bounded by

j=qn+1 (Inj1 + Inj2)

C1exp log(pn) + C2 k2 nδ 2 qn qn + (2r + 1) 1 log(n) + C3/(2r + 1) log(qn) C4nλk 1 n .

This upper bound converges to zero under the assumptions of this lemma and using Lemma 11.

Lemma 16 Assume the conditions of Theorem 6 hold, then

max qn+1 j pn sup ||γA γA0||2 Cknδ 1 qn q1/2 n n 1/2

i=1 πj(zij) n P h Yi ΠA(zi) γA 0 | zi i

P[Yi g0(zi) 0 | zi] o

Proof Deﬁne vn = knδ 1 qn q1/2 n n 1/2, using the Cauchy-Schwarz inequality and Lemma 10

max qn+1 j pn sup ||γA γA0||2 Cvn

i=1 πj(zij)

P h Yi ΠA(zi) γA 0 | zi i P [Yi g0(zi) 0 | zi]

max qn+1 j pn sup ||γA γA0|| Cvn

i=1 b2 j,s(zij)

i=1 {Fi [ΠA(zi) (γA γA0) uni | zi] Fi(0 | zi)}2

max qn+1 j pn sup ||γA γA0|| Cvn C

E h b2 j,s(zij) i + OP (n 1/2)

i=1 [ΠA(zi) (γA γA0)]2 + u2 ni

M4k 1 n + OP (n 1/2) q

C(k2nδ 2 qn q2nn 1 + q2nk 2r n ) C k3/2 n δ 1 qn qnn 1/2 + qnk1/2 r n [1 + o P (1)].

From the conditions on λ, kn and qn it can be derived that k3/2 n δ 1 qn qnn 1/2+qnk1/2 r n = o(λ), thus completing the proof.

The following lemma is an extension of Lemma 2.2 and 2.3 from Wang et al. (2012) which considered the linear model.

SHERWOOD AND MAIDMAN

Lemma 17 Assume the Conditions of Theorem 6 hold. Then for the oracle estimator, ˆγ,with probability approaching one

min j {1,...,qn}

ˆγj 1 (a + 1/2)λ, (25)

min j {qn+1,...,pn}

i=1 πj(zij)ψτ[yi Π(zi) ˆγ]

1 λ/2. (26)

Proof Proof of (25): Note that

min j {1,...,qn}||ˆγj||1 min j {1,...,qn} ||γ0j||1 ||ˆγ γA0||1. (27)

By Lemmas 11 and 13 and Conditions 3, 4 and 7, ||ˆγ γA0||1 Jnqn + 1||ˆγ γA0||2 = OP k3/2 n qnn 1/2δ 1 qn = o P n (1 c4)/2 . Condition 7 guarantees that there exists a positive

constant c5 such that min j {1,...,qn} ||γ0j||1 c5n (1 c4)/2. Finally, λ = o n (1 c4)/2 and

therefore P min j {1,...,qn}||ˆγj||1 > (a + 1/2)λ 1.

Proof of (26): Deﬁne sj(γ) = 1

n Pn i=1 πj(zij)ψτ[yi Π(zi) ˆγ] and D = {i : Yi ΠA(zi) ˆγA = 0}. For j {qn + 1, ..., pn},

i=1 πj(zij) n I h Yi ΠA(zi) ˆγA 0 i τ o 1

i D πj(zij)[a i + (1 τ)],

where a i [τ 1, τ] with i D such that sj(ˆγ) = 0Jn for j {1, . . . , qn} and

n I h Yi ΠA(zi) ˆγA 0 i τ o 1

i D [a i + (1 τ)] = 0.

From Section 2.2 of Koenker (2005) it follows that with probability one |D| qn Jn + 1. Then by Conditions 2-4 and the assumptions about the rate of λ it follows that

max qn+1 j pn

i D πj(zij)[a i + (1 τ)]

1 = OP qnknn 1 = o P (λ).

Thus, it is suﬃcient to show that

max qn+1 j pn

i=1 πj(zij) n I[Yi ΠA(zi) ˆγA 0] τ o

ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION

Using Lemma 14 for the second inequality it follows that,

max qn+1 j pn

i=1 πj(zij) n I[Yi ΠA(zi) ˆγA 0] τ o

P max qn+1 j pn

i=1 πj(zij) n I[Yi ΠA(zi) ˆγA 0] I[Yi g0(zi) 0] o

max qn+1 j pn

i=1 πj(zij) {I[Yi g0(zi) 0] τ}

max qn+1 j pn sup ||γA γA0||2 Cknδ 1 qn q1/2 n n 1/2

i=1 πj(zij) n I[Yi ΠA(zi) γA 0]

I[Yi g0(zi) 0] o

max qn+1 j pn sup ||γA γA0||2 Cknδ 1 qn q1/2 n n 1/2

i=1 πj(zij) n I h Yi ΠA(zi) γA 0 i

I[Yi g0(zi) 0] P[Yi ΠA(zi) γA 0 | zi] + P[Yi g0(zi) 0 | zi] o

max qn+1 j pn sup ||γA γA0||2 Cknδ 1 qn q1/2 n n 1/2

i=1 πj(zij) n P h Yi ΠA(zi) γA 0 | zi i

P[Yi g0(zi) 0 | zi] o

The two probability statements go to zero by Lemmas 15 and 16. This completes the proof.

8.3 Proof of Theorem 6

Proof Deﬁne the neighborhood Xφ = γ RJnpn+1 ||ˆγ γ||1 < φ < λ/2 . In this proof we show that for suﬃciently large n there exists a φ such that Q(ˆγ) Q(γ) for all γ Xφ. Deﬁne

W = n γ = (γ0, . . . , γJnpn) RJnpn+1 γj = 0 for j {Jnqn + 1, . . . , Jnpn} o

and Fφ = W Xφ. For any γ Xφ and for any j {1, . . . , qn} it follows from Lemma 17 and the deﬁnition of Xφ that with probability approaching one

||γj||1 ||ˆγj||1 ||ˆγj γj||1 (a + 1/2)λ λ/2 = aλ.

By Condition 6 and Lemma 17 it follows, with probability approaching one, that for any γ Fφ that pλ,a(||ˆγj||1) = pλ,a(||γj||1) for all j {1, . . . , pn}. By deﬁnition of the or-

SHERWOOD AND MAIDMAN

acle estimator and Fφ it follows that for any γ Fφ that 1 n Pn i=1 ρτ yi Π(zi) ˆγ 1 n Pn i=1 ρτ yi Π(zi) γ . Therefore, for any γ Fφ it holds that Q(ˆγ) Q(γ). For any vector γ Xφ let γ represent the projection of γ into Fφ. For suﬃciently large n, and thus suﬃciently small λ, Q(ˆγ) Q( γ) and thus the proof will be complete if it can be shown that Q( γ) Q(γ). Let γA represent the ﬁrst qn Jn + 1 entries of γ and γN the remaining Jn(pn qn) entries such that γ = γ A, γ N and γ = h γ A, 0 Jn(pn qn) i .

Similarly deﬁne ΠN(zi) such that Π(zi) = ΠA(zi) , ΠN(zi) . By Knight s identity

Q(γ) Q( γ) = 1 n

i=1 ρτ[yi Π(zi) γ] ρτ[yi Π(zi) γ] +

pλ,a(||γj||1) pλ,a(0)

i=1 Π(zi) (γ γ)ψτ[yi Π(zi) γ]

Z Π(zi) (γ γ)

0 I[yi Π(zi) γ s] I[yi Π(zi) γ 0] ds

pλ,a(||γj||1) pλ,a(0) .

As Pn i=1 R Π(zi) (γ γ) 0 I[yi Π(zi) γ s] I[yi Π(zi) γ 0]ds is non-negative for all i, it will be suﬃcient to show that 1 n

i=1 Π(zi) (γ γ)ψτ[yi Π(zi) γ]

pλ,a(||γj||1) pλ,a(0) .

Notice, 1 n

i=1 Π(zi) (γ γ)ψτ[yi Π(zi) γ]

j=qn+1 ||γj||1

i=1 πj(zij)ψτ[yi Π(zi) γ]

By the mean value theorem, for some c j (0, ||γj||1)

pλ,a(||γj||1) pλ,a(0) = p λ(c j)||γj||1.

By Lemma 17 and Condition 6 there exists a suﬃciently small φ such that for all j {qn + 1, . . . , pn}

i=1 πj(zij)ψτ[yi Π(zi) γ]

1 p λ,a(φ).

Note that c j < φ, for all j, and therefore by the assumption that pλ,a( ) is concave in [0, ), from Condition 6, it follows that p λ,a(φ) p λ,a(c j) for all j {qn + 1, . . . , pn}. Therefore,

i=1 Π(zi) (γ γ)ψτ[yi Π(zi) γ]

j=qn+1 ||γj||p λ,a(φ)

j=qn+1 p λ(c j)||γj||1

pλ,a(||γj||1) pλ,a(0) .

ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION

8.4 Proof of Theorem 7

Proof Deﬁne the vector functions of

Q (γ, a, v) = 1

i=1 Π(zi){τ I[yi Π(zi) γ]} 1

i=1 Π(zi)(1 τ + ai)I[yi = Π(zi) γ] + v,

i=1 Π(zi){I[yi Π(zi) γ] τ} E

i=1 Π(zi){I[yi Π(zi) γ] τ}

π(γ1, γ2, a1, a2) = 1 n

i=1 Π(zi)(1 τ + a2i)I[yi = Π(zi) γ2] 1

i=1 Π(zi)(1 τ + a1i)I[yi = Π(zi) γ1],

and r(γ1, γ2) = r(γ1) r(γ2). For a Rn deﬁne the subgradient of ||a||1 as

||a||1 = {b Rn|bk = sgn(ak) for ak = 0 and bk [ 1, 1] otherwise},

and deﬁne the sets

V(γ) = {b = (0, b 1 , . . . , b pn) Rpn Jn+1|bj = p λ(||γj||)cj, where cj ||γj||1, for all j {1, . . . , pn}},

A(γ) = {b = (b1, . . . , bn) Rn|bi = 0 if yi = Π(zi) γ and bi [ 1, 1] otherwise}.

By ﬁrst order conditions if γ is a local minimizer of Q(γ) then there exists v V(γ) and a A(γ) such that Q (γ, a, v) = 0pn Jn+1. Thus, there exists v V( γ) and a A( γ) such that Q ( γ, a, v) = 0pn Jn+1. Similarly, with probability approaching one, by Theorem 6, there exists ˆv V(ˆγ) and ˆa A(ˆγ) such that Q (ˆγ, ˆa, ˆv) = 0pn Jn+1. By Condition 6 and that the ﬁrst derivatives of diﬀerentiable concave functions are decreasing |p λ(||γj||)| λ for all j {1, . . . , pn} and thus ||v|| λ for all v V(γ) and any γ Rpn Jn+1. For any vector a Rpn Jn+1 deﬁne a E Rwn Jn+1 as the sub-vector from the element of E similar to how we have deﬁned a A Rqn Jn+1. For some mi between uni + Π(zi) E ( γ γ0)E and uni + Π(zi) E (ˆγ γ0)E and using Conditions 1 and 8 there exists a positive constant C such that with probability approaching one

0 = Q E( γ, a, v) Q E(ˆγ, ˆa, ˆv) ( γE ˆγE)

|| γE ˆγE||2

n ( γE ˆγE) E 1

n Pn i=1 fi( mi)ΠE(zi)ΠE(zi) + r( γ, ˆγ) + v ˆv π( γ, ˆγ, a, ˆa) o

|| γE ˆγE||2

Cδ2 wnk 1 n || γE ˆγE||2 || r E( γ, ˆγ)||2 2λ p

wn Jn || πE( γ, ˆγ, a, ˆa)||2.

Note for any γ, ||r E(γ)||2 = OP p wn

n by Lemma 10 (4) and thus || r E( γ, ˆγ)||2 = OP p wn

n . By Condition 8 || πE( γ, ˆγ, a, ˆa)||2 = OP (knwnn 1 1 + wn). If with probability approaching one || γE ˆγE||2 has a lower bound of order

log(n)δ 2 wnkn

wnkn + knwnn 1

SHERWOOD AND MAIDMAN

then with probability approaching one [Q E( γ, a, v) Q E(ˆγ, ˆa, ˆv)] ( γE ˆγE)

|| γE ˆγE||2

has a posi-

tive lower bound, which is a contradiction. Therefore,

|| γE ˆγE||2 = OP

log(n)δ 2 wnkn

wnkn + knwnn 1

I. Barrodale and F.D.K. Roberts. Solution of an overdetermined system of equations in the ℓ1 norm. Communications of the ACM, 17:319 320, 1974.

Alexandre Belloni and Victor Chernozhukov. L1-penalized quantile regression in highdimensional sparse models. Annals of Statistics, 39(1):82 130, 2011.

Claus Borggaard and Hans Henrik Thodberg. Optimal minimal neural interpretation of spectra. Analytical Chemistry, 64(5):545 551, 1992.

Patrick Breheny and Jian Huang. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Annals of Applied Statistics, 5(1):232 253, 2011.

Patrick Breheny and Yaohui Zeng. grpreg: Regularization paths for regression models with grouped covariates 3.1-2, 2017. URL https://cran.r-project.org/web/packages/ grpreg/index.html.

Zhao Chen, Jianqing Fan, and Runze Li. Error variance estimation in ultrahigh-dimensional additive models. Journal of the American Statistical Association, 113(521):315 327, 2018a.

Zhao Chen, Jianqing Fan, and Runze Li. Supplemental material of error variance estimation in ultrahigh-dimensional additive models. Journal of the American Statistical Association, 113(521):315 327, 2018b.

Annie P. Chiang, John S. Beck, Hsan-Jan Yen, Marwan K. Tayeh, Todd E. Scheetz, Ruth E. Swiderski, Darryl Y. Nishimura, Terry A. Braun, Kwang-Youn A. Kim, Jian Huang, Khalil Elbedour, Rivka Carmi, Diane C. Slusarski, Thomas L. Casavant, Edwin M. Stone, and Val C. Sheﬃeld. Homozygosity mapping with snp arrays identiﬁes trim32, an e3 ubiquitin ligase, as a bardet biedl syndrome gene (bbs11). Proceedings of the National Academy of Sciences, 103(16):6287 6292, 2006.

Carl de Boor. Splines as linear combinations of b-splines. In Approximation Theory II, pages 1 47. Academic Press (New York), 1976.

Jan G. De Gooijer and Dawit Zerom. On additive conditional quantiles with highdimensional covariates. Journal of the American Statistical Association, 98:135 146, 2003.

ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION

Ronald A. Devore and George G. Lorentz. Constructive Approximation. Cambridge University Press, 2005.

Harold Gordon Eggleston. Convexity, Cambridge Tracts in Mathematics and Mathematical Physics, No.47. Cambridge University Press, 1958.

A. Essl, A. Ortner, R. Haas, and P. Hettegger. Machine learning analysis for a ﬂexibility energy approach towards renewable energy integration with dynamic forecasting of electricity balancing power. In 2017 14th International Conference on the European Energy Market (EEM), pages 1 6, 2017.

Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348 1360, 2001.

Jianqing Fan and Jinchi Lv. Nonconcave penalized likelihood with np-dimensionality. IEEE Transactions on Information Theory, 57(8):5467 5484, 2011.

Julian Faraway. faraway: Functions and datasets for books by julian faraway 1.0.7, 2016. URL https://cran.r-project.org/web/packages/faraway/index.html.

Joachim Freyberger, Andreas Neuhierl, and Michael Weber. Dissecting Characteristics Nonparametrically. The Review of Financial Studies, 33(5):2326 2377, 04 2020. ISSN 0893-9454.

Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Regularized paths for genearlized linear models via coordinate descent. Journal of Statistical Software, 33(1), 2008.

Y. Fujimoto, S. Murakami, N. Kaneko, H. Fuchikami, T. Hattori, and Y. Hayashi. Machine learning approach for graphical model-based analysis of energy-aware growth control in plant factories. IEEE Access, 7:32183 32196, 2019.

Yuwen Gu, Jun Fan, Lingchen Kong, Shiqian Ma, and Hui Zou. Admm for high-dimensional sparse penalized quantile regression. Technometrics, 60(3):319 331, 2018.

Xuming He and Peide Shi. Convergence rate of b-spline estimators of nonparametric conditional quantile functions. Journal of Nonparametric Statistics, 3:299 308, 1994.

Xuming He and Peide Shi. Bivariate tensor-product b-splines in a partly linear model. Journal of Multivariate Analysis, 58(2):162 181, 1996.

Xuming He, Zhong-Yi Zhu, and Wing-Kam Fung. Estimation in semiparametric model for longitudinal data with unspeciﬁed dependence structure. Biometrika, 89(3):579 590, 2002.

Xuming He, Lan Wang, and Hyokyoung Grace Hong. Quantile-adaptive model-free nonlinear feature screening for high-dimensional heterogeneous data. Annals of Statistics, 41 (1):342 369, 2013.

Joel L. Horowitz and Sokbae Lee. Nonparametric estimation of additive quantile regression model. Journal of the American Statistical Association, 100(472):1238 1249, 2005.

SHERWOOD AND MAIDMAN

Jian Huang, Joel L. Horowitz, and Fengrong Wei. Variable selection in nonparametric additive models. The Annals of Statistics, 38(4):2282 2313, 2010.

Jian Huang, Patrick Breheny, and Shuangge Ma. A selective review of group selection in high-dimensional models. Statistical Science, 27(4):481 499, 2012.

Jianhua Z. Huang. Projection estimation in multiple regression with application to functional anova models. The Annals of Statistics, 26(1):242 272, 1998a.

Jianhua Z. Huang. Functional anova models for generalized regression. Journal of Multivariate Analysis, 67:49 71, 1998b.

Ahmed M. Ibrahim, Hassan A.M. Hendawy, Wafaa S. Hassan, Abdalla Shalaby, and Manal S. El Masry. Determination of terazosin in the presence of prazosin: Diﬀerent state-of-the-art machine learning algorithms with uv spectroscopy. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, 236:1386 1425, 2020.

Kengo Kato. Group lasso for high dimensional sparse quantile regression models. https://arxiv.org/pdf/1103.1458, March 2012.

Mi-Ok Kim. Quantile regression with varying coeﬃcients. The Annals of Statistics, 35(1): 92 108, 2007.

Yongdai Kim, Hosik Choi, and Hee-Seok Oh. Smoothly clipped absolute deviation on high dimensions. Journal of the American Statistical Association, 103(484):1665 1673, 2008.

Yongdai Kim, Sunghoon Kwon, and Hosik Choi. Consistent model selection criteria on high dimensions. Journal of Machine Learning Research, 13:1037 1057, 2012.

Keith Knight. Limiting distributions for l1 regression estimators under general conditions. The Annals of Statistics, 26(2):755 770, 1998.

Roger Koenker. Quantile Regression. Cambridge University Press, 2005.

Roger Koenker and Gilbert Bassett. Regression quantiles. Econometrica, 46(1):33 50, 1978.

Roger Koenker and Vasco D Orey. A remark on algorithm as 229: Computing dual regression quantiles and regression rank score. Journal of the Royal Statistical Society. Series C (Applied Statistics), 43(2):410 414, 1994.

Roger W. Koenker and Vasco D Orey. Computing regression quantiles. Journal of the Royal Statistical Society. Series C (Applied Statistics), 36(3):383 393, 1987.

Eun Ryung Lee, Hohsuk Noh, and Byeong U. Park. Model selection via bayesian information criterion for quantile regression models. Journal of the American Statistical Association, 109(505):216 229, 2014.

Heng Lian, Xin Chen, and Jian-Yi Yang. Identiﬁcation of partially linear structure in additive models with an application to gene expression prediction from sequences. Biometrics, 68(2):437 445, 2012.

ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION

Chen-Yen Lin, Howard Bondell, Hao Helen Zhang, and Hui Zou. Variable selection for nonparametric quantile regression via smoothing spline analysis of variance. Stat, 2: 255 268, 2013.

Yi Lin and Hao Helen Zhang. Component selection and smoothing in multivariate nonparametric regression. The Annals of Statistics, 34(5):2272 2297, 2006.

Po-Ling Loh and Martin J. Wainwright. Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Journal of Machine Learning Research, 16:559 616, 2015.

Yin Lou, Jacob Bien, Rich Caruana, and Johannes Gehrke. Sparse partially linear additive models. Journal of Computational and Graphical Statistics, 25(4):1126 1140, 2016.

Shaogao Lv, Huazhen Lin, Heng Lian, and Jian Huang. Oracle inequalities for sparse additive quantile regression in reproducing kernel hilbert space. Ann. Statist., 46(2): 781 813, 04 2018.

Adam Maidman and Lan Wang. New semiparametric method for predicting high-cost patients. Biometrics, 74(3):1104 1111, 2018.

Lukas Meier, Sara van de Geer, and Peter B uhlmann. High-dimensional additive modeling. The Annals of Statistics, 37(6B):3779 3821, 2009.

Sahand N. Negahban, Pradeep Ravikumar, Martin J. Wainwright, and Bin Yu. A uniﬁed framework for high-dimensional analysis fo m-estimators with decomposable regualrizers. Statistical Science, 27(4):538 557, 2012.

Marco Palma, Shahin Tavakoli, Julia Brettschneider, and Thomas E. Nichols. Quantifying uncertainty in brain-predicted age using scalar-on-image quantile regression. Neuro Image, 219:1 14, 2020.

Bo Peng and Lan Wang. An iterative coordinate descent algorithm for high-dimensional nonconvex penalized quantile regression. Journal of Computational and Graphical Statistics, 24(3):676 694, 2015.

Ashley Petersen and Daniela Witten. Data-adaptive additive modeling. Statistics in Medicine, 38(4):583 600, 2019.

Ashley Petersen, Daniela Witten, and Noah Simon. Fused lasso additive model. Journal of Computational and Graphical Statistics, 25(4):1005 1025, 2016.

Veeranjaneyulu Sadhanala and Ryan J. Tibshirani. Additive models with trend ﬁltering. Ann. Statist., 47(6):3032 3068, 12 2019.

Todd E. Scheetz, Kwang-Youn A. Kim, Ruth E. Swiderski, Alisdair R. Philp, Terry A. Braun, Kevin L. Knudtson, Anne M. Dorrance, Gerald F. Di Bona, Jian Huang, Thomas L. Casavant, Val C. Sheﬃeld, and Edwin M. Stone. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceedings of the National Academy of Sciences, 103(39):14429 14434, 2006.

SHERWOOD AND MAIDMAN

Larry L. Schumaker. Spline Functions: Basic Theory. Wiley, New York, 1981.

Ben Sherwood. Variable selection for additive partial linear quantile regression with missing covariates. Journal of Multivariate Analysis, 152:206 223, 2016.

Ben Sherwood and Adam Maidman. rqpen: Penalized quantile regression 2.2.2, 2020. URL https://cran.r-project.org/package=rq Pen.

Ben Sherwood and Lan Wang. Partially linear additive quantile regression in ultra-high dimension. The Annals of Statistics, 44(1):288 317, 2016.

Ben Sherwood, Aaron J. Molstad, and Sumanta Singha. Asymptotic properties of concave l1-norm group penalties. Statistics and Probability Letters, 157:108631, 2020.

Charles J. Stone. Optimal global rates of convergence for nonparametric regression. The Annals of Statistics, 10(4):1040 1053, 1982.

Charles J. Stone. Additive regression and other nonparametric models. The Annals of Statistics, 13(2):689 705, 1985.

Charles J. Stone. The dimensionality reduction principle for generalized additive models. The Annals of Statistics, 14(2):590 606, 1986.

Ichiro Takeuchi, Quoc V. Le, Tim Sears, and Alexander J. Smola. Nonparametric quantile estimation. Journal of Machine Learning Research, 7:1231 1264, 2006.

Pham Dinh Tao and Le Thi Hoai An. Convex analysis approach to d.c. programming: Theory, algorithms and applications. Acta Mathematica Vietnamica, 22(1):289 355, 1997.

Robert Tibshirani. Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B, 58(1):267 288, 1996.

Hansheng Wang, Guodong Li, and Guohua Jiang. Robust regression shrinkage and consistent variable selection through the lad-lasso. Journal of Business & Economic Statistics, 25(3), 2007.

Hansheng Wang, Bo Li, and Chenlei Leng. Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(3):671 683, 2009a.

Huixia Judy Wang, Zhongyi Zhu, and Jianhui Zhou. Quantile regression in partially linear varying coeﬃcient models. The Annals of Statistics, 37(6B):3841 3866, 2009b.

Lan Wang, Yichao Wu, and Runze Li. Quantile regression of analyzing heterogeneity in ultra-high dimension. Journal of the American Statistical Association, 107(497):214 222, 2012.

Li Wang, Lan Xue, Annie Qu, and Hua Liang. Estimation and model selection in generalized additive partial linear models for correlated data with diverging number of covariates. The Annals of Statistics, 42(2):592 624, 2014a.

ADDITIVE NONLINEAR QUANTILE REGRESSION IN ULTRA-HIGH DIMENSION

Zhaoran Wang, Han Liu, and Tong Zhang. Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. The Annals of Statistics, 42(6): 2164 2201, 2014b.

Ying Wei, Anneli Pere, Roger Koenker, and Xuming He. Quantile regression methods for reference growth charts. Statistics in Medicine, 25:1369 1382, 2006.

A.H. Welsh. On m-processes and m-estimation. The Annals of Statistics, 17(1):337 361, 1989.

Yicaho Wu and Yufeng Liu. Variable selection in quantile regression. Statistica Sinica, 19 (2):801 817, 2009.

Lan Xue. Consistent variable selection in additive models. Statistica Sinica, 19:1281 1296, 2009.

Lan Xue and Lijian Yang. Additive coeﬃcient modeling via polynomial spline. Statistica Sinica, 16(4):1423 1446, 2006.

Congrui Yi and Jian Huang. Semismooth newton coordinate descent algorithm for elasticnet penalized huber loss regression and quantile regression. Journal of Computational and Graphical Statistics, 26(3):547 557, 2017.

Liqun Yu, Nan Lin, and Lan Wang. A parallel algorithm for large-scale nonconvex penalized quantile regression. Journal of Computational and Graphical Statistics, 26(4):935 939, 2017.

Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894 942, 2010.

Kaifeng Zhao and Heng Lian. Variable selection in additive quantile regression using nonconcave penalty. Statistics, 50(6):1276 1289, 2016.

Peng Zhao and Bin Yu. On model selection consistency of lasso. Journal of Machine Learning Research, 7:2541 2563, 2006.

Qi Zheng, Limin Peng, and Xuming He. Globally adaptive quantile regression with ultrahigh dimensional data. The Annals of Statistics, 43(5):2225 2258, 2015.

S. Zhou, X. Shen, and D.A. Wolfe. Local asymptotics for regression splines and conﬁdence regions. The Annals of Statistics, 26(5):1760 1782, 1998.

Hui Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418 1429, 2006.

Hui Zou and Runze Li. One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics, 36(4):1509 1533, 2008.