# highdimensional_structured_quantile_regression__6e8733fd.pdf High-Dimensional Structured Quantile Regression Vidyashankar Sivakumar 1 Arindam Banerjee 1 Quantile regression aims at modeling the conditional median and quantiles of a response variable given certain predictor variables. In this work we consider the problem of linear quantile regression in high dimensions where the number of predictor variables is much higher than the number of samples available for parameter estimation. We assume the true parameter to have some structure characterized as having a small value according to some atomic norm R( ) and consider the norm regularized quantile regression estimator. We characterize the sample complexity for consistent recovery and give non-asymptotic bounds on the estimation error. While this problem has been previously considered, our analysis reveals geometric and statistical characteristics of the problem not available in prior literature. We perform experiments on synthetic data which support the theoretical results. 1 Introduction Considerable advances have been made over the past decade on fitting high-dimensional structured linear models when the number of samples n is much smaller than the ambient dimensionality p (Banerjee et al., 2014; Negahban et al., 2012; Chandrasekaran et al., 2012). Most of the advances have been made for linear models: yi = xi, θ + ωi, i = 1, . . . , n, where θ Rp is assumed to be structured, e.g., sparse, group sparse, etc. Estimation of such structured θ is usually done using Lasso-type regularized estimators (Negahban et al., 2012; Banerjee et al., 2014) or Dantzig-type constrained estimators (Chandrasekaran et al., 2012; Chatterjee et al., 2014); other related estimators have also been explored (Hsu & Sabato, 1Department of Computer Science & Engineering, University of Minnesota, Twin Cities. Correspondence to: Vidyashankar Sivakumar , Arindam Banerjee . Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). 2016; Vershynin, 2015). Such models have been extended to generalized linear models (Banerjee et al., 2014; Negahban et al., 2012), matrix completion (Cand es & Recht, 2009), vector auto-regressive models (Melnyk & Banerjee, 2016) among others. In this paper, we consider the problem of structured quantile regression in high-dimensions, which can be posed as follows: given the response variable yi and covariates xi the τth conditional quantile function of yi given xi is given by: F 1 yi|xi(τ|xi) = xi, θ τ , τ (0, 1) for some structured θ τ whose structure can be captured by a suitable atomic norm R( ), e.g., l1-norm for sparsity, l1/l2 norm for group sparsity, etc. Here F 1 yi|xi( ) is the inverse of the conditional distribution function of yi given xi. We consider the following regularized estimator for the structured quantile regression problem: ˆθn := argminθ Rp Lτ(θ) + λn R(θ) := argminθ Rp 1 i=1 ρτ(yi xi, θ ) + λn R(θ) , where ρτ(u) = (τ I(u 0))u is the asymmetric absolute deviation function (Koenker, 2005), I( ) is the indicator function. The goal is to get nonasymptotic bounds on the estimation error ˆθ θ 2. Many previous papers analyze the asymptotic performance of the estimator in (1) (Li & Zhu, 2008; Kai et al., 2011; Wu & Liu, 2009; Zou & Yuan, 2008; Wang et al., 2012). In the non-asymptotic setting of interest in this paper, special cases of the estimator in (1) have been studied in recent literature (Belloni & Chernozhukov, 2011; Kato, 2011; Fan et al., 2014a), primarily focusing on specific norms like the l1-norm and nonoverlapping group sparse norm. In contrast, our formulation and analysis is applicable to any atomic norm R( ) giving considerable flexibility in choosing a suitable structure for real world problems, e.g., hierarchical sparsity, k-support norm, OWL norm etc. More recently (Alquier et al., 2017) consider the more general problem of norm regularized regression with Lipschitz loss functions which includes (1) as a special case. They derive similar results to ours for bounds on the estimation error, but their analysis differs significantly and, in our opinion, does not leverage and highlight the key geometric and sta- High-Dimensional Structured Quantile Regression tistical characteristics of the problem. In the setting of norm regularized regression with square loss, including the widely studied Lasso estimator (Tibshirani, 1996; Negahban et al., 2012; Banerjee et al., 2014), the sample complexity n0 of the estimator gets determined by a certain restricted strong convexity (RSC) property which simplifies to the restricted eigenvalue (RE) condition on the matrix XT X (Bickel et al., 2009; Negahban et al., 2012); in the noiseless setting, i.e., when ωi = 0, the sample complexity determines a phase transition phenomenon so that the probability of recovering the structured θ is minimal when n < n0, and one can exactly recover θ with high probability when n > n0. Our work gives an equivalent sample complexity characterization for structured quantile regression, which was not available in prior work. The challenge in characterizing RSC in the context of quantile regression stems partly from the nonsmoothness of the objective, so one has to work with subgradients. However, the unique aspect stems from the geometry of quantile regression, or as the authoritative book on the topic puts it: How quantile regression works? (Koenker, 2005)[Section 2.2]. In quantile regression, the n samples get divided into three subsets: ν samples which get exactly interpolated, i.e., yi = xi, ˆθ , (n ν)τ samples which lie below the curve, i.e., yi < xi, ˆθ , and (n ν)(1 τ) samples which lie above the curve, i.e., yi > xi, ˆθ . Note that when ν = n all samples are interpolated, the loss is zero and the same ˆθ is a solution for all quantiles τ. Quantile regression is clearly not working. The Number of Inter Polated Samples (NIPS) ν is an important quantity, inherent to structure in θ , and determines the sample complexity of structured quantile regression (1). In fact, we show that when n > ν, the RSC condition associated with the estimator in (1) is satisfied. When there is no structure in θ , then ν = O(p), and quantile regression needs n > O(p) samples to work. However, when θ has structure, such as sparsity or group sparsity, ν can be substantially smaller than p. Specifically we show that ν is of the order of square of Gaussian width of the error set (Talagrand, 2014; Chandrasekaran et al., 2012) for a class of atomic norms which includes l1, l1/l2 group sparse, ksupport (Argyriou et al., 2012) and the OWL (Bogdan et al., 2013) norms. The Gaussian width as a measure of complexity of a set has been extensively used in prior literature (Chandrasekaran et al., 2012; Tropp, 2015; Banerjee et al., 2014). For example, when θ is sparse with s non-zero entries, we show that ν = cs log p. When n > ν and the RSC condition is satisfied, building on recent developments in high-dimensional estimation (Negahban et al., 2012; Banerjee et al., 2014), we show that choosing λn 2R ( Lτ(θ ))), where R ( ) is the dual norm of R( ), leads to non-asymptotic bounds on the estimation error ˆθn θ 2. While the specification of λn looks complex, with its dependency on the dual norm and its dependency on θ , we show it is sufficient to set λn based on the Gaussian width (Talagrand, 2014) of the unit norm ball for R( ) (Banerjee et al., 2014; Sivakumar et al., 2015) . Our analysis and results on the estimation error bound for quantile regression, interestingly, has the same order as that for regularized least squares regression for general norms (Banerjee et al., 2014). In contrast to the least squares loss the quantile loss is more robust as the estimation error is independent of the two norm of the noise. We discuss results for the l1, l1/l2 group sparse and k-support norms as examples, precisely characterizing the sample complexity for recovery and non-asymptotic error bounds. Specifically, our results for the l1-norm matches those from existing literature on sparse quantile regression (Belloni & Chernozhukov, 2011). The rest of the paper is organized as follows. In Section 2, we discuss the problem formulation along with assumptions, review the general framework for analyzing regularized estimation problems and discuss the three atomic norms used as examples throughout the paper. In Section 3, we analyze the number of interpolated samples and establish precise sample complexities for a class of atomic norms in terms of the Gaussian widths of sets. In Section 4 we establish key ingredients of the analysis and provide the main bound. We present experimental results in Section 5 and conclude in Section 6. 2 Background and Preliminaries Problem formulation: We outline the assumptions on the data and estimator. Similar conditions are present in all prior literature on quantile regression and we refer to Section 2.5 in Belloni & Chernozhukov (2011) for examples of data satisfying the conditions. We consider data is generated as y = Xθ + ω, X Rn p is the design matrix, θ Rp and ω Rn is the noise. We assume sub Gaussian design matrices X Rn p which includes the class of all bounded random variables. Note that this is not a restrictive assumption as to avoid the quantile crossing phenomenon the covariate space has to be bounded. Quantile crossing is when the value of the τ1th quantile is greater than the τ2th quantile for some τ1 < τ2 (See Section 2.5 in (Koenker, 2005)). We do not make any assumptions on the noise vector ω Rn. More specifically the noise can be sampled from a heavy tailed distribution, can be heteroscedastic as in the location-scale model where ωi = xi, η ϵi, η Rp, ϵi is noise independent of xi, bimodal and so on and so forth. Note that this setting is more general than for the least squares loss. We consider a parametric quantile regression model where the τth conditional quantile function of the response vari- High-Dimensional Structured Quantile Regression able yi given any xi Rp is given by, F 1 yi|xi(τ|xi) = xi, θ τ , θ τ Rp, τ (0, 1) , (2) where F 1 yi|xi is the inverse of the conditional distribution function of yi given xi. We will assume the conditional density of yi evaluated at the conditional quantile xi, θ τ is bounded away from zero uniformly for all τ, that is, fyi|xi( xi, θ τ ) > f > 0 for all τ and all xi. The goal is to estimate parameter ˆθτ close to θ τ using n observations of the data when n < p. The estimator in this paper belongs to the family of regularized estimators and is of the form: ˆθλn,τ := argminθ Rp Lτ(θ) + λn R(θ) , (3) where Lτ(θ) = 1 n Pn i=1 ρτ(yi xi, θ ), ρτ( ) is the quantile loss function and R( ) is any atomic norm. Examples of atomic norms we consider in this paper are the l1, l1/l2 nonoverlapping group sparse norm and the k-support norm. We present all results assuming any single τ (0, 1) and going forward drop the subscripts from θ τ and ˆθλn,τ. Gaussian Width: All results will be in terms of the Gaussian width of suitable sets. For any set A Rp, the Gaussian width of the set A is defined as (Gordon, 1985; Chandrasekaran et al., 2012): sup u A g, u . (4) where the expectation is over g N(0, Ip p). Gaussian widths have been widely used in prior literature on structured estimation (Chandrasekaran et al., 2012; Banerjee et al., 2014; Sivakumar et al., 2015; Tropp, 2015). High-dimensional estimation: Our analysis is built on developments over the past decade for high-dimensional structured regression for linear and generalized linear models using both regularized as well as constrained estimators (Candes & Tao, 2007; Bickel et al., 2009; Chandrasekaran et al., 2012; Negahban et al., 2012; Banerjee et al., 2014). For the regularized formulation considered in this work Banerjee et al. (2014); Negahban et al. (2012) have established a generalized analysis framework when the loss is least squares or more generally the maximum likelihood estimator for generalized linear models. We give a brief overview of the main components of the analysis. 1. Regularization parameter: In Banerjee et al. (2014); Negahban et al. (2012) the regularization parameter is assumed to satisfy the following assumption, λn 2R ( Lτ(θ )) . (5) With ΩR = {u|R(u) 1} denoting the unit norm ball, Banerjee et al. (2014) prove that with high probability a value λn = O(w(ΩR)) satisfies the above condition for sub Gaussian design matrices, noise and the least squares loss. 2. Error set: The assumption on the regularization parameter ensures that the error vector = ˆθ θ lies in the following error set (Banerjee et al., 2014), C := | R(θ + ) R(θ ) + 1 2R( ) . (6) 3. The norm compatibility constant: It is defined as follows (Negahban et al., 2012; Banerjee et al., 2014), Ψ(C) = sup u C R(u) u 2 . (7) 4. Restricted Strong Convexity (RSC): In Banerjee et al. (2014); Negahban et al. (2012) the loss function is shown to satisfy the following RSC condition with high probability once the number of samples is of the order of the square of the Gaussian width of the error set, that is, n = O(w2(C)). inf u C δL(θ , u) = inf u C(L(θ + u) L(θ ) L(θ ), u ) κ u 2 . (8) For the squared loss, the RSC condition simplifies to the Restricted Eigenvalue (RE) condition (Bickel et al., 2009) inf u C 1 n Xu 2 2 κ u 2 2 . (9) 5. Recovery Bounds: When RSC and bounds on the regularization parameter are satisfied Banerjee et al. (2014) prove the following deterministic error bound, 2 = ˆθ θ 2 cΨ(C)w(ΩR) where c is any constant. Atomic Norms: We consider the class of atomic norms for the regularizer. Mathematically consider a set A Rp the collection of atoms that is compact, centrally symmetric about the origin (that is, a A = a A). Let θ A denote the gauge of A, R(θ) = θ A = inf{t > 0 : θ tconv(A)} (11) a A ca : θ = X a A caa, ca 0, a A} . For example when A = { ei}p i=1 yields θ A = θ 1. Although the atomic set A may contain uncountably many vectors, we assume A can be decomposed as a union of m sets, A = Ai A2 . . . Am similar to the setting considered High-Dimensional Structured Quantile Regression in Chen & Banerjee (2015). Such a decomposition assumption is satisfied by many popular atomic norms like the l1, l1/l2 group sparse norms, k-support norm, OWL norm etc. Throughout the paper we will illustrate our results on the following norms. 1. l1 norm: For the l1 norm we will consider that θ is an s-sparse vector, that is, θ 0 = s. 2. l1/l2 nonoverlapping group sparse norm: Let {G1, G2, . . . , GNG} denote a collection of groups which are blocks of any vector θ Rp. Let θNG denote a vector with coordinates θi NG = θi if i GNG, else θi NG = 0. The maximum size of any group is l = max i [1,...,NG] |Gi|. The norm is given as R(θ) = PNG i=1 θi 2. Let SG {1, 2, . . . , NG} with cardinality |SG| = s G. We consider the true parameter θ Rp is s G-sparse, that is, θ NG = 0, NG / SG. 3. k-support norm: The k-support norm can be expressed as an infimum convolution given by (Argyriou et al., 2012), R(θ) = inf P i ui 2 | ui 0 k Clearly it is an atomic norm for which A = {a Rp | a 0 k, a 2 = 1} and A is a union of p k subsets, that is, A = A1 A2 . . . A( p k). More results on the k-support norm can be looked up in Chatterjee et al. (2014); Chen & Banerjee (2015). We consider the setting where θ 0 = s and k < s. For the results we require the Gaussian widths of the unit norm ball, error set and the norm compatibility constants for the norms. We provide them below for reference. All values are in order notation. Norm w(ΩR) w(C) Ψ(C) l1 c log p c s log p c s l1/l2 c l + log NG c ls G + s G log NG c s G k-sp c p k + k log p s + s log p 3 Number of Inter Polated Samples (NIPS) We begin with intuitions on the geometry of the problem. In the high sample, low dimension setting n >> p, when R(θ) = 0, the quantile loss is a linear program and hence the solutions are at the vertices, that is, where any p of the n samples are interpolated. Mathematically we define the quantity Z = {i : yi = xi, ˆθ = xi, θ + u , u Rp} and note that ν = sup u Rp |Z| = O(p). In the high di- mensional setting considered in this paper n < p and hence with R(θ) = 0 the number of interpolated samples is ν = n. From an optimization perspective there are multiple such solutions and all solutions are optimal for all quantile parameters τ. But practically quantile regression is not working. Now introducing a regularizer with a suitable choice for the regularization parameter ensures that the error vector lies in a restricted subset of Rp, C := u R(θ + u) R(θ ) + 1 2R(u) Rp. We are now interested in characterizing ν = sup u C |Z|, Z = {i : yi = xi, ˆθ = xi, θ + u , u C}, that is, the maximum number of interpolated samples with the error restricted to a particular subset of Rp. Again if ν = n, quantile regression is not working. Since there are no restrictions on the number of non-zero elements in the error vector u a first crude estimate will be ν min{n, p, u 0}, which implies quantile regression will not work unless we have a minimum of p samples. But intuitively ν should depend on properties of the error set C, which the initial crude estimate is failing to take advantage of. Below we state a result which reinforces the intuition of the relation between ν and the properties of the set C. Specifically we show that for the types of atomic norms considered in this work (which includes all popularly known vector norms) the number of interpolated samples does not exceed the product of the square of the norm compatibility constant and the square of the Gaussian width of the unit norm ball. For the norms considered, this is precisely the square of the Gaussian width of the error set C. For example for the l1 norm for an s-sparse vector this evaluates to an upper bound of ν = O(s log p). While the result statement considers sub Gaussian design matrices, the result will also hold for design matrices sampled from heavy-tailed distributions using arguments similar to Lecu e & Mendelson (2014); Sivakumar et al. (2015). Theorem 3.1 Consider X has isotropic sub Gaussian rows and θ is an s-sparse vector that can be written as a linear combination of k atoms from an atomic set of cardinality m, i=1 ciai, ai A, ci 0, |A| = m . (14) For the regularized quantile regression problem penalized with the atomic norm R(θ) = θ A, ˆθ = arg min θ Rp Lτ(θ)+λR(θ) = arg min θ Rp 1 n i=1 ρτ(θ)+λR(θ) , (15) let C := u|R(θ + u) R(θ ) + 1 2R(u) denote the error set, let λ R ( Lτ(θ )) and let n (c1s + c2Ψ2(C)w(A)) where Ψ(C) = sup u C u 2 is the norm compatibility constant in the error set, w(A) is the Gaussian width of the unit norm ball and c1 and c2 are some constants. Then with probability atleast 1 exp( c2k1 log(em)) 2 exp( ηw2(C)) the number of in- High-Dimensional Structured Quantile Regression terpolated samples, ν = sup u C |{i : yi = xi, θ +u , u C}| cΨ2(C)w2(A) , (16) where c is a constant. To understand the intuition consider the case of the l1 norm. For an error vector lying in a particular s-dimensional subspace the maximum number of interpolated samples is O(s) with high probability. Extending the argument to all such s-dimensional subspaces by a union bound argument on the p s subspaces, the maximum number of interpolated samples when the error vector is any s-sparse vector in pdimensional space is O(s log p). Finally the argument is extended to all vectors in the error set using the powerful Maurey s empirical approximation argument previously employed in Sivakumar et al. (2015); Lecu e & Mendelson (2014); Rudelson & Zhou (2013). Suprisingly in prior literature on structured high dimensional quantile regression, the importance of ν has not been explicitly discussed. This intuition about the importance of ν also shows up in an elegant form in the analysis of the RSC condition in Section 4.2. Below we provide results for the number of interpolated samples for the l1, l1/l2 nonoverlapping group sparse and k-support norms. For the l1/l2 nonoverlapping group sparse norm and the k-support norm we first illustrate that they are atomic norms. The results then follow from substituting known values for the norm compatibility constant and Gaussian width of unit norm ball for the different norms. For computation of these quantities for any general norm we refer the interested reader to work in Vershynin (2015); Chen & Banerjee (2015). Corollary 1 For the l1 norm with θ being an s-sparse vector, when n > (c1s + c2s log p) then with high probability, ν = sup u C |{i : yi = xi, θ + u }| cs log p , (17) for some constant c. Before applying the result to the nonoverlapping group sparse norm note that for any vector θ Rp, j cijβij , (18) where βij is any unit norm vector in subspace defined by the group i. For any group i, let θi denote the vector constructed from θ such that it has component k, θk = 0 if k / i. Now by definition of atomic norm cij = θi 2 for θi in the same direction of βij, otherwise cij = 0. Therefore the group sparse norm is an atomic norm with a hierarchical set structure with the number of elements m = NG in the outer set with each element of the outer set itself being a set of an infinite number of elements with any one element chosen for a particular vector θ, that is, cij = 0 for only one j amongst an infinite number of j s. Corollary 2 Consider the l1/l2 nonoverlapping groupsparse norm with n > (c1s Gl + c2s G log NG). With high probability, ν = sup u C |{i : yi = xi, θ + u }| c(ls G + s G log NG) , (19) for some constant c. For the k-support norm θ sp k = inf P i ui 2 | ui 0 k}, can be similarly expressed as, θ = ( p k) X j cijβij (20) where βij is a unit vector in k-dimensional subspace i. The difference compared to the nonoverlapping group sparse norm is that many of the cij s can now be non zero in the inner sum. This is comparably more complex than the group-sparse norm where the inner set becomes a singleton for some θ, but in terms of the analysis nothing changes. Corollary 3 Consider the k-support norm with n > (c1s+ c2s log p k ). With high probability, ν = sup u C |{i : yi = xi, θ + u }| c(s + s log p/k ) , (21) for some constant c. 4 Structured Quantile Regression In this section, we present results for the key components in the general analysis framework of Banerjee et al. (2014) which we briefly described in Section 2 of the paper. We start with results on the regularization parameter by analyzing equation (5) before establishing sample complexity bounds when the restricted strong convexity condition in equation (8) is satisfied. Finally an l2 bound on the error is obtained using (10). We will consider sub Gaussian design matrices throughout. All results are in terms of Gaussian widths of sets and the norm compatibility constant. Results for l1, l1/l2-nonoverlapping group sparse and ksupport norms are given for illustration purposes. High-Dimensional Structured Quantile Regression 4.1 Regularization Parameter λn We analyze the bound in equation (5). In prior literature a bound has been established specifically for the l1 norm in Belloni & Chernozhukov (2011) (See Theorem 1). Below we consider the case of any general atomic norm and obtain a result in terms of the Gaussian width of the unit norm ball. The analysis follows from a similar result for the regularization parameter in Banerjee et al. (2014) for the least squares case. Theorem 4.1 Let X Rn p be a design matrix with independent isotropic sub Gaussian rows with sub Gaussian norm |||xi|||ψ2 κ. Define ΩR = {u : R(u) 1} the unit norm ball and let φ = sup u u 2/R(u). Then the following E [R ( Lτ(θ ))] c τ(1 τ)w(ΩR) n , (22) where c is any fixed constant depending only on the sub Gaussian norm κ. Moreover with probability atleast 1 c1 exp τ c2φκ 2 2 exp 2t2 R ( Lτ(θ )) c τ(1 τ)w(ΩR) + t n , (23) where c1, c2, t are absolute constants. A major difference to the least squares loss setting, is the independence of the regularization parameter to assumptions on the noise vector (see for example Theorem 3 and Theorem 4 in Banerjee et al. (2014) where the noise is explicitly assumed to be sub Gaussian and homoscedastic and the noise enters the analysis through properties of ω 2). This gives the flexibility of considering noise vectors which are heavy tailed or heteroscedastic. Indeed the most interesting applications of quantile regression arise in such settings. Below we provide bounds for the regularization parameter for different norms by substituting known values of the Gaussian width for the unit norm balls. The result for the l1 norm matches with Theorem 1 in Belloni & Chernozhukov (2011) for the regularization parameter. Corollary 4 If R( ) is the l1 norm, with high probability R ( Lτ(θ )) c τ(1 τ) log p + t n . (24) Corollary 5 If R( ) is the l1/l2 nonoverlapping group sparse norm, with high probability, R ( Lτ(θ )) c p l + log NG + t n . Corollary 6 If R( ) is the k-support norm, with high probability, R ( Lτ(θ )) c p k + k log p 4.2 Restricted Strong Convexity (RSC) The loss needs to satisfy the RSC condition in equation (8). Prior literature on structured quantile regression has not discussed the RSC condition explicitly, though Fan et al. (2014b) considers it for the quantile huber loss function. We start by providing an intuition for the RSC formulation for the quantile loss. The RSC condition equation (8) on the error set C evaluates to the following, inf u C 1 n 0 (I(yi xi, θ z) I(yi xi, θ 0)) dz . (25) Let ν = sup u C |Z| = sup u C |{i|yi = xi, θ + u }| is the num- ber of interpolated samples. For any n < p if the model can interpolate all points, that is, ν = n then (25) evaluates to zero. In general, as shown in Section 3, ν gets determined by the structure. For example for the l1 norm ν = O(s log p) rather than the ambient dimensionality p. Thus, the sum over n points in (25) simply reduces to the sum over the (n ν) points which are not interpolated, and will ensure the RSC condition when n > ν. The intuition of the NIPS property of Section 3 thus shows up elegantly in the RSC condition. In equation (25), let ξi = yi xi, θ , vi = R xi,u 0 (I(ξi z) I(ξi 0)) and consider 1 n Pn i=1 E[vi], i=1 E[vi] = 1 0 (Fi(ξi + z) Fi(ξ)) 0 fi(ξi)zdz + o(1) i=1 fi(ξi) xi, u 2 + o(1) 2n Xu 2 2 fκ u 2 2 2 . The first line follows from the definition of the cumulative distribution function, the second line by a simple Taylor series expansion, the last line by the assumption that f fi(ξi), i and (1/n) Xu 2 2 κ, where κ is the restricted eigenvalue (RE) constant. The RE condition is satisfied as the sample complexity bounds for satisfying the NIPS property is same as the RE condition. More generally RSC is a condition on the minimum eigenvalue of the Jacobian matrix 1 n Pn i=1 fi(ξi) xi, u 2 restricted to the error set High-Dimensional Structured Quantile Regression C. This quantity has also been considered in prior literature (see Section 4.2 in Koenker (2005) and the proof in page 121, also see condition D.1 in Belloni & Chernozhukov (2011)). While the above analysis is in expectation of the quantity vi, we state the following result giving large deviation bounds for the above quantity. Theorem 4.2 Consider X Rn p has sub Gaussian rows. Let 0 < f < fi( xi, θ ) be a uniform lower bound on the conditional density for all xi in the support of x. Let κ denote the RE constant satisfying 1 n Xu 2 2 κ u 2 2. Let the number of samples n > cw2(C) where w(C) is the Gaussian width of the error set C and c is some constant. Then with probability atleast 1 exp( τ 2/2) exp φ2 1fξ where φ1, ξ < 1 and τ are constants, inf u C δLτ(θ , u) c1κf u 2 2 . (26) where c1 < 0 is a constant. Below we provide results for different norms. The sample complexity for the l1 norm matches the result in Belloni & Chernozhukov (2011) (see equation 2.10). Corollary 7 For the l1 norm with n > cs log p with high probability the following RSC condition is satisfied, inf u C δLτ(θ , u) cκf u 2 2 . (27) Corollary 8 For the l1/l2 nonoverlapping group sparse norm with n > c(ls G + s G log NG) with high probability the following RSC condition is satisfied, inf u C δLτ(θ , u) κf u 2 2 . (28) Corollary 9 For the k-support norm with n > c(s + s log p k ) with high probability the following RSC condition is satisfied, inf u C δLτ(θ , u) cκf u 2 2 . (29) 4.3 Recovery Bounds Following the general framework outlined in Banerjee et al. (2014) (see Theorem 2), we state the following high probability bound on the two norm of the error vector = ˆθ θ . Theorem 4.3 For the quantile regression problem, when τ(1 τ)w(ΩR) n , n > c2w2(C) for some constants c1, c2 then with high probability, τ(1 τ)Ψ(C)w(ΩR) where Ψ(C) is the norm compatibility constant in the error set. The two norm of the error depends on the two terms p τ(1 τ) and f. The p τ(1 τ) term is minimized at the tails and hence has the effect of reducing the estimation error. But typically this is dominated by the lower bound on the density f term which makes the estimate less precise in regions of low density. This is to be expected as there are very few samples to make a very precise estimate in low density regions. While similar observations are made in page 72 of Koenker (2005), the results are asymptotic while we show non-asymptotic recovery bounds. Another aspect we reiterate here is the independence of the results from the form of the noise. All results make no assumptions on the noise apart from an assumption on the lower bound of the noise density. Below we provide recovery bounds for the different norms we consider in the paper. Corollary 10 For the l1 norm when λn τ(1 τ) log p n and n > c2s log p with high probability 2 c s log p Corollary 11 For the l1/l2 nonoverlapping group sparse norm when λn > c1 l+log NG n and n > c(ls G + s G log NG) with high probability ls G + s G log NG Corollary 12 For the k-support norm when λn > k n and n > cs + s log p k with high probability s + s log p k fκ n (33) 5 Experiments We perform simulations with synthetic data. 5.1 Phase Transition Data is generated as y = Xθ + ω. θ = [1, 1, 1, 1, 1, 1 | {z } 6 , 0, 0, . . . , 0 | {z } p-6 ] Rp for the l1 norm and θ = [1, . . . , 1 | {z } 5 , 1, . . . , 1 | {z } 5 , 1, . . . , 1 | {z } 5 , 0, . . . , 0 | {z } 5 , . . . , 0, . . . , 0 | {z } 5 for the l1/l2 group sparse norm with p [500, 750, 1000]. The noise ωi N(0, 0.25), i [n] is Gaussian with zero mean and 0.25 variance. The design matrix X High-Dimensional Structured Quantile Regression 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Rescaled sample size n/s log p Probability of success 0.2 quantile,p=500 0.5 quantile,p=500 0.8 quantile,p=500 0.5 quantile,p=750 0.5 quantile,p=1000 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Rescaled sample size n=(ms G + s GNG) Probability of success 0.2 quantile,p=500 0.5 quantile,p=500 0.8 quantile,p=500 0.5 quantile,p=750 0.5 quantile,p=1000 Figure 1: Probability of recovering true parameter versus the rescaled sample size for l1 norm (top) and l1/l2 group sparse norm (bottom). There is a sharp phase transition when the number of samples exceeds NIPS 1 2 3 4 5 Student t-distribution dof Estimation error Lasso l1 quantile regression 2 4 6 8 10 12 14 Percent contamination Estimation error Lasso l1 quantile regression Figure 2: Estimation error of Lasso and l1-penalized quantile regression against different degrees of freedom of the student t-distribution noise (top) and against percentage contamination (bottom). Quantile regression is robust to heavy-tailed noise and outliers N(0, Ip p) is multivariate Gaussian with identity covariance. We vary n = [10, 20, 30, . . . , 120, 130]. For each n we generate 100 datasets with the probability of success defined as the fraction of times we are able to faithfully estimate the true parameter. For p = 500 we run simulations for τ [0.1, 0.5, 0.9] and for p [750, 1000] we run simulations only for τ = 0.5. For the optimization, we use the Alternating Direction Method of Multipliers (Boyd et al., 2010). The details of the updates can be found in the flare documentation Li et al. (2015). The code was implemented in Python. The plots in Figure 1 clearly show a phase transition for both the l1 and l1/l2 group sparse norms for all quantiles exemplifying the NIPS property described earlier. 5.2 Robustness We showcase the robustness enjoyed by quantile regression over ordinary least squares estimation against heavytailed noise and outliers. We consider the l1 norm with y = Xθ + ω. θ = [1, 1, 1, 1, 1, 1 | {z } 6 , 0, 0, . . . , 0 | {z } 494 heavy-tailed noise we consider the student t-distribution with different degrees of freedom, with lower degrees of freedom corresponding to heavier tailed data. To show the robustness to outliers we randomly pick a certain percentage of samples from the dataset and multiply the noise by 10, that is, ωi = 10 ωi for a certain proportion of the dataset. We vary the proportion of contamination from 2.5% to 15%. We fix n = 200 for this simulation. Again for both exercises, we run 100 simulations and plot the mean and standard deviation of the estimation error ˆθ θ 2. The plots in Figure 2 show 1. the estimation error against varying degrees of freedom of the student tdistribution and 2. estimation error against the percent contamination. The observations are in agreement with conventional wisdom on robustness of the quantile regression estimator to heavy-tailed noise and outliers. 6 Conclusions The paper presents a general framework for the analysis of non-asymptotic error and structured recovery for norm regularized quantile regression for any atomic norm. Our results are based on extending the general analysis framework outlined in Banerjee et al. (2014); Negahban et al. (2012) using insights from the geometry of the problem. In particular we introduce the Number of Inter Polated Samples (NIPS) as critical for determining the sample complexity for consistent recovery. We prove that once the number of samples crosses the NIPS threshold, we start recovering the true parameter. This phase transition phenomena for norm regularized quantile regression problems has not been discussed in prior literature. We also prove that NIPS is of the order of square of the Gaussian width of the error set for many atomic norms - which is the same order as that for regularized least squares regression and match results from previous work for the l1 norm (Belloni & Chernozhukov, 2011). Acknowledgements: We thank reviewers for their valuable comments. This work was supported by NSF grants IIS-1563950, IIS-1447566, IIS-1447574, IIS-1422557, CCF-1451986, CNS-1314560, IIS-0953274, IIS-1029711, NASA grant NNX12AQ39A. High-Dimensional Structured Quantile Regression Alquier, P., Cottet, V., and Lecue, G. Estimation Bounds and Sharp Oracle Inequalities of Regularized Procedures with Lipschitz Loss Functions. ar Xiv:1702.01402, 2017. Argyriou, Andreas, Foygel, Rina, and Srebro, Nathan. Sparse Prediction with the k-Support Norm. In Neural Information Processing Systems (NIPS), apr 2012. Banerjee, Arindam, Chen, Sheng, Fazayeli, Farideh, and Sivakumar, Vidyashankar. Estimation with Norm Regularization. In Neural Information Processing Systems (NIPS), 2014. Belloni, Alexandre and Chernozhukov, Victor. l1Penalized Quantile Regression in High-Dimensional Sparse Models. The Annals of Statistics, 39(1):82 130, 2011. Bickel, Peter J., Ritov, Yaacov, and Tsybakov, Alexandre B. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4):1705 1732, 2009. ISSN 0090-5364. Bogdan, Malgorzata, Berg, Ewout van den, Su, Weijie, and Emmanuel, Candes. Statistical Estimation and Testing via the Sorted L1 Norm. ar Xiv:1310.1969, 2013. Boyd, Stephen, Parikh, Neal, Chu, Eric, Peleato, Borja, and Eckstein, Jonathan. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends in Machine Learning, 3(1):1 122, 2010. ISSN 1935-8237. Cand es, Emmanuel J. and Recht, Benjamin. Exact Matrix Completion via Convex Optimization. Foundations of Computational Mathematics, 9(6):717 772, 2009. ISSN 1615-3375. Candes, Emmanuel J. and Tao, Terence. The Dantzig selector : statistical estimation when p is much larger than n. The Annals of Statistics, 35(6):2313 2351, 2007. Chandrasekaran, Venkat, Recht, Benjamin, Parrilo, Pablo A., and Willsky, Alan S. The Convex Geometry of Linear Inverse Problems. Foundations of Computational Mathematics, 12(6):805 849, 2012. Chatterjee, Soumyadeep, Chen, Sheng, and Banerjee, Arindam. Generalized Dantzig Selector: Application to the k-support Norm. In Advances in Neural Information Processing Systems, 2014. Chen, Sheng and Banerjee, Arindam. Structured Estimation with Atomic Norms: General Bounds and Applications. In Advances in Neural Information Processing Systems, 2015. Fan, Jianqing, Fan, Yingying, and Barut, Emre. Adaptive Robust Variable Selection. Annals of Statistics, 42(4): 324 351, 2014a. Fan, Jianqing, Li, Quefeng, and Wang, Yuyan. Robust Estimation of High-Dimensional Mean Regression. ar Xiv:1410.2150, 2014b. Gordon, Yehoram. Some Inequalitites for Gaussian Processes and Applications. Israel Journal of Mathematics, 50(4):265 289, 1985. Hsu, Daniel and Sabato, Sivan. Loss Minimization and Parameter Estimation with Heavy Tails. Journal of Machine Learning Research, 17(18):1 40, 2016. Kai, B., Li, R., and Zou, H. New Efficient Estimation and Variable Selection Methods for Semiparametric Varying-Coefficient Partially Linear Models. The Annals of Statistics, 39:305 332, 2011. Kato, Kengo. Group Lasso for High Dimensional Sparse Quantile Regression Models. ar Xiv:1103.1458, 2011. Koenker, Roger. Quantile Regression. Cambridge University Press, 2005. Lecu e, Guillaume and Mendelson, Shahar. Sparse recovery under weak moment assumptions. ar Xiv:1401.2188, 2014. Li, Xingguo, Zhao, Tuo, Yuan, Xiaoming, and Liu, Han. The flare package for high dimensional linear regression and precision matrix estimation in r. Journal of Machine Learning Research, 16(1):553 557, 2015. Li, Y. J. and Zhu, J. L1-norm Quantile Regression. Journal of Computational and Graphical Statistics, 17:163 185, 2008. Melnyk, Igor and Banerjee, Arindam. Estimating Structured Vector Autoregressive Model. In International Conference on Machine Learning (ICML), 2016. Negahban, Sahand N., Ravikumar, Pradeep, Wainwright, Martin J., and Yu, Bin. A Unified Framework for High Dimensional Analysis of M-Estimators with Decomposable Regularizers. Statistical Science, 27(4):538 557, 2012. ISSN 0883-4237. Rudelson, Mark and Zhou, Shuheng. Reconstruction from anisotropic random measurements. IEEE Transactions on Information Theory, 59(6):3434 3447, jun 2013. Sivakumar, Vidyashankar, Banerjee, Arindam, and Ravikumar, Pradeep. Beyond Sub-Gaussian Measurements: High-Dimensional Structured Estimation with Sub Exponential Designs. In Advances in Neural Information Processing Systems, 2015. High-Dimensional Structured Quantile Regression Talagrand, Michel. Upper and Lower Bounds of Stochastic Processes. Springer, 2014. Tibshirani, Robert. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, 58(1): 267 288, 1996. Tropp, Joel A. Convex recovery of a structured signal from independent random linear measurements. In Sampling Theory - a Renaissance. To appear, may 2015. Vershynin, Roman. Estimation in High Dimensions: A geometric perspective. In Sampling Theory, a Renaissance, pp. 3 66. Birkhauser, Basel, 2015. Wang, Lan, Wu, Yichao, and Li, Runze. Quantile Regression for Analyzing Heterogeneity in Ultra-high Dimension. Journal of the American Statistical Association, 107:214 222, 2012. Wu, Y. C. and Liu, Y. F. Variable Selection in Quantile Regression. Statistica Sinica, 19:801 817, 2009. Zou, H. and Yuan, M. Composite Quantile Regression and the Oracle Model Selection Theory. The Annals of Statistics, 36:1108 1126, 2008.