# estimation_with_norm_regularization__5070cc24.pdf

Estimation with Norm Regularization

Arindam Banerjee Sheng Chen Farideh Fazayeli Vidyashankar Sivakumar

Department of Computer Science & Engineering University of Minnesota, Twin Cities {banerjee,shengc,farideh,sivakuma}@cs.umn.edu

Analysis of non-asymptotic estimation error and structured statistical recovery based on norm regularized regression, such as Lasso, needs to consider four aspects: the norm, the loss function, the design matrix, and the noise model. This paper presents generalizations of such estimation error analysis on all four aspects. We characterize the restricted error set, establish relations between error sets for the constrained and regularized problems, and present an estimation error bound applicable to any norm. Precise characterizations of the bound is presented for a variety of noise models, design matrices, including sub-Gaussian, anisotropic, and dependent samples, and loss functions, including least squares and generalized linear models. Gaussian width, a geometric measure of size of sets, and associated tools play a key role in our generalized analysis.

1 Introduction

Over the past decade, progress has been made in developing non-asymptotic bounds on the estimation error of structured parameters based on norm regularized regression. Such estimators are usually of the form [16, 9, 3]: ˆθλn = argmin θ Rp L(θ; Zn) + λn R(θ) , (1)

where R(θ) is a suitable norm, L( ) is a suitable loss function, Zn = {(yi, Xi)}n i=1 where yi R, Xi Rp is the training set, and λn > 0 is a regularization parameter. The optimal parameter θ is often assumed to be structured, usually characterized as low value according to some norm R( ). Since ˆθλn is an estimate of the optimal structure θ , the focus has been on bounding a suitable function of the error vector ˆ n = (ˆθλn θ ), e.g., the L2 norm ˆ n 2.

To understand the state-of-the-art on non-asymptotic bounds on the estimation error for normregularized regression, four aspects of (1) need to be considered: (i) the norm R(θ), (ii) properties of the design matrix X Rn p, (iii) the loss function L( ), and (iv) the noise model, typically in terms of w = y E[y|x]. Most of the literature has focused on a linear model: y = Xθ + ω, and a squared-loss function: L(θ; Zn) = 1 n y Xθ 2 2 = 1 n Pn i=1(yi θ, Xi )2. Early work on such estimators focussed on the L1 norm [21, 20, 8], and led to sufﬁcient conditions on the design matrix X, including the restricted-isometry properties (RIP) and restricted eigenvalue (RE) conditions [2, 9, 13, 3]. While much of the development has focussed on isotropic Gaussian design matrices, recent work has extended the analysis for L1 norm to correlated Gaussian designs [13] as well as anisotropic sub-Gaussian design matrices [14].

Building on such development, [9] presents a uniﬁed framework for the case of decomposable norms and also considers generalized linear models (GLMs) for certain norms such as L1. Two key insights are offered in [9]: ﬁrst, for suitably large λn, the error vector ˆ n lies in a restricted set, a cone or a star, and second, on the restricted error set, the loss function needs to satisfy restricted strong convexity (RSC), a generalization of the RE condition, for the analysis to work out.

For isotropic Gaussian design matrices, additional progress has been made. [4] considers a constrained estimation formulation for all atomic norms, where the gain condition, equivalent to the RE condition, uses Gordons inequality [5, 7] and is succinctly represented in terms of the Gaussian width of the intersection of the cone of the error set and a unit ball/sphere. [11] considers three related formulations for generalized Lasso problems, establish recovery guarantees based on Gordons inequality, and quantities related to the Gaussian width. Sharper analysis for recovery has been considered in [1], yielding a precise characterization of phase transition behavior using quantities related to the Gaussian width. [12] consider a linear programming estimator in a 1-bit compressed sensing setting and, interestingly, the concept of Gaussian width shows up in the analysis. In spite of the advances, most of these results are restricted to isotropic Gaussian design matrices.

In this paper, we consider structured estimation problems with norm regularization, which substantially generalize existing results on all four pertinent aspects: the norm, the design matrix, the loss, and the noise model. The analysis we present applies to all norms. We characterize the structure of the error set for all norms, develop precise relationships between the error sets of the regularized and constrained versions [2], and establish an estimation error bound in Section 2. The bound depends on the regularization parameter λn and a certain RSC condition constant κ. In Section 3, for both Gaussian and sub-Gaussian noise ω, we develop suitable characterizations for λn in terms of the Gaussian width of the unit norm ball ΩR = {u|R(u) 1}. In Section 4, we characterize the RSC condition for any norm, considering two families of design matrices X Rn p: Gaussian and sub Gaussian, and three settings for each family: independent isotropic designs, independent anisotropic designs where the rows are correlated as Σp p, and dependent isotropic designs where the rows are isotropic but columns are correlated as Γn n, implying dependent samples. In Section 5, we show how to extend the analysis to generalized linear models (GLMs) with sub-Gaussian design matrices and any norm.

Our analysis techniques are simple and largely uniform across different types of noise and design matrices. Parts of our analysis are geometric, where Gaussian widths, as a measure of size of suitable sets, and associated tools play a key role [4, 7]. We also use standard covering arguments, use Sudakov-Dudley inequality to switch from covering numbers to Gaussian widths [7], and use generic chaining to upper bound sub-Gaussian widths with Gaussian widths [15].

2 Restricted Error Set and Recovery Guarantees

In this section, we give a characterization of the restricted error set Er in which the error vector ˆ n lives, establish clear relationships between the error sets for the regularized and constrained problems, and ﬁnally establish upper bounds on the estimation error. The error bound is deterministic, but has quantities which involve θ , X, ω, for which we develop high probability bounds in Sections 3, 4, and 5.

2.1 The Restricted Error Set and the Error Cone

We start with a characterization of the restricted error set Er where ˆ n will belong.

Lemma 1 For any β > 1, assuming

λn βR ( L(θ ; Zn)) , (2)

the error vector ˆ n = ˆθλn θ belongs to the set

Er = Er(θ , β) = Rp R(θ + ) R(θ ) + 1

β R( ) . (3)

The restricted error set Er need not be convex for general norms. Interestingly, for β = 1, the inequality in (3) is just the triangle inequality, and is satisﬁed by all . Note that β > 1 restricts the set of which satisfy the inequality, yielding the restricted error set. In particular, cannot go in the direction of θ , i.e., = αθ for any α > 0. Further, note that the condition in (2) is similar to that in [9] for β = 2, but the above characterization holds for any norm, not just decomposable norms [9].

While Er need not be a convex set, we establish a relationship between Er and Cc, the cone for the constrained problem [4], where

Cc = Cc(θ ) = cone { Rp | R(θ + ) R(θ )} . (4)

Theorem 1 Let Ar = Er ρBp 2 and Ac = Cc ρBp 2, where Bp 2 = {u| u 2 1} is the unit ball of ℓ2 norm and ρ > 0 is any suitable radius. Then, for any β > 1 we have

w(Ar) 1 + 2 β 1 θ 2

w(Ac) , (5)

where w(A) denotes the Gaussian width of any set A given by: w(A) = Eg[supa A a, g ], where g is an isotropic Gaussian random vector.

Thus, the Gaussian width of the error sets of regularized and constrained problems are closely related. In particular, for θ 2 = 1, with ρ = 1, β = 2, we have w(Ar) 3w(Ac). Related observations have been made for the special case of the L1 norm [2], although past work did not provide an explicit characterization in terms of Gaussian widths. The result also suggests that it is possible to move between the error analysis of the regularized and the constrained versions of the estimation problem.

2.2 Recovery Guarantees

In order to establish recovery guarantees, we start by assuming that restricted strong convexity (RSC) is satisﬁed by the loss function in Cr = cone(Er), i.e., for any Cr, there exists a suitable constant κ so that

δL( , θ ) L(θ + ) L(θ ) L(θ ), κ 2 2 . (6)

In Sections 4 and 5, we establish precise forms of the RSC condition for a wide variety of design matrices and loss functions. In order to establish recovery guarantees, we focus on the quantity

F( ) = L(θ + ) L(θ ) + λn(R(θ + ) R(θ )) . (7)

Since ˆθλn = θ + ˆ n is the estimated parameter, i.e., ˆθλn is the minimum of the objective, we clearly have F( ˆ n) 0, which implies a bound on ˆ n 2. Unlike previous results, the bound can be established without making any additional assumptions on the norm R(θ). We start with the following result, which expresses the upper bound on ˆ n 2 in terms of the gradient of the objective at θ .

Lemma 2 Assume that the RSC condition is satisﬁed in Cr by the loss L( ) with parameter κ. With ˆ n = ˆθλn θ , for any norm R( ), we have

κ L(θ ) + λn R(θ ) 2 , (8)

where R( ) is any sub-gradient of the norm R( ).

Note that the right hand side is simply the L2 norm of the gradient of the objective evaluated at θ . For the special case when ˆθλn = θ , the gradient of the objective is zero, implying correctly that ˆ n 2 = 0. While the above result provides useful insights about the bound on ˆ n 2, the quantities on the right hand side depend on θ , which is unknown. We present another form of the result in terms of quantities such as λn, κ, and the norm compatibility constant Ψ(Cr) = supu Cr R(u) u 2 , which are often easier to compute or bound.

Theorem 2 Assume that the RSC condition is satisﬁed in Cr by the loss L( ) with parameter κ. With ˆ n = ˆθλn θ , for any norm R( ), we have

ˆ n 2 1 + β

κ Ψ(Cr) . (9)

The above result is deterministic, but contains λn and κ. In Section 3, we give precise characterizations of λn, which needs to satisfy (2). In Sections 4 and 5, we characterize the RSC condition constant κ for different losses and a variety of design matrices.

3 Bounds on the Regularization Parameter

Recall that the parameter λn needs to satisfy the inequality λn βR ( L(θ ; Zn)) . (10) The right hand side of the inequality has two issues: it depends on θ , and it is a random variable, since it depends on Zn. In this section, we characterize E[R ( L(θ ; Zn))] in terms of the Gaussian width of the unit norm ball ΩR = {u : R(u) 1}, and also discuss large deviation bounds around the expectation. For ease of exposition, we present results for the case of squared loss, i.e., L(θ ; Zn) = 1 2n y Xθ ||2 with the linear model y = Xθ + ω, where ω can be Gaussian or sub-Gaussian noise. For this setting, L(θ ; Zn) = 1

n XT (y Xθ ) = 1

n XT ω. The analysis can be extended to GLMs, using analysis techniques discussed in Section 5.

Gaussian Designs: First, we consider Gaussian design X, where xij N(0, 1) are independent, and ω is elementwise independent Gaussian or sub-Gaussian noise.

Theorem 3 Let ΩR = {u : R(u) 1}. Then, for Gaussian design X and Gaussian or sub Gaussian noise ω, for a suitable constant η0 > 0, we have

E[R ( L(θ ; Zn))] η0 nw(ΩR) . (11) Further, for any τ > 0, for suitable constants η1, η2 > 0, with probability at least (1 η1 exp( η2τ 2)) R ( L(θ ; Zn)) η0 nw(ΩR) + τ n . (12)

For anisotropic Gaussian design, i.e., when columns of X Rn p have covariance Σp p, the above result continues to hold with w(ΩR) replaced by p

Λmax(Σ)w(ΩR), where Λmax(Σ) denotes the operator norm (largest eigenvalue). For correlated isotropic design, i.e., when rows of X Rn have covariance Γn n, the result continues to hold with w(ΩR) replaced by p

Λmax(Γ)w(ΩR).

Sub-Gaussian Designs: Recall that for a sub-Gaussian variable x, the sub-Gaussian norm |||x|||ψ2 = supp 1 1 p(E[|x|p])1/p [18]. Now, we consider sub-Gaussian design X, where |||xij|||ψ2 k and xij are i.i.d., and ω is elementwise independent Gaussian or sub-Gaussian noise.

Theorem 4 Let ΩR = {u : R(u) 1}. Then, for sub-Gaussian design X and Gaussian or sub Gaussian noise ω, for a suitable constant η0 > 0, we have

E[R ( L(θ ; Zn))] η0 nw(ΩR) . (13)

Interestingly, the analysis for the result above involves sub-Gaussian width which can be upper bounded by a constant times the Gaussian width, using generic chaining [15]. Further, one can get Gaussian-like exponential concentration around the expectation for important classes of sub Gaussian random variables, including bounded random variables [6], and when Xu = h, u , where u is any unit vector, are such that their Malliavin derivatives have almost surely bounded norm in L2[0, 1], i.e., R 1 0 |Dr Xu|2dr η [19].

Next, we provide a mechanism for bounding the Gaussian width w(ΩR) of the unit norm ball in terms of the Gaussian width of a suitable cone, obtained by shifting or translating the norm ball. In particular, the result involves taking any point on the boundary of the unit norm ball, considering that as the origin, and constructing a cone using the norm ball. Since such a construction can be done with any point on the boundary, the tightest bound is obtained by taking the inﬁmum over all points on the boundary. The motivation behind getting an upper bound of the Gaussian width w(ΩR) of the unit norm ball in terms of the Gaussian width of such a cone is because considerable advances have been made in recent years in upper bounding Gaussian widths of such cones.

Lemma 3 Let ΩR = {u : R(u) 1} be the unit norm ball and ΘR = {u : R(u) = 1} be the boundary. For any θ ΘR, ρ( θ) = supθ:R(θ) 1 θ θ 2 is the diameter of ΩR measured with respect to θ. Let G( θ) = cone(ΩR θ) ρ( θ)Bp 2, i.e., the cone of (ΩR θ) intersecting the ball of radius ρ( θ). Then w(ΩR) inf θ ΘR w(G( θ)) . (14)

4 Least Squares Models: Restricted Eigenvalue Conditions

When the loss function is squared loss, i.e., L(θ; Zn) = 1 2n y Xθ 2, the RSC condition (6) becomes equivalent to the Restricted Eigenvalue (RE) condition [2, 9], i.e., 1

n X 2 2 κ 2 2, or equivalently, X 2

2 κn for any in the error cone Cr. Since the absolute magnitude of 2 does not play a role in the RE condition, without loss of generality we work with unit vectors u A = Cr Sp 1, where Sp 1 is the unit sphere.

In this section, we establish RE conditions for a variety of Gaussian and sub-Gaussian design matrices, with isotropic, anisotropic, or dependent rows, i.e., when samples (rows of X) are correlated. Results for certain types of design matrices for certain types of norms, especially the L1 norm, have appeared in the literature [2, 13, 14]. Our analysis considers a wider variety of design matrices and establishes RSC conditions for any A Sp 1, thus corresponding to any norm. Interestingly, the Gaussian width w(A) of A shows up in all bounds, as a geometric measure of the size of the set A, even for sub-Gaussian design matrices. In fact, all existing RE results do implicitly have the width term, but in a form speciﬁc to the chosen norm [13, 14]. The analysis on atomic norm in [4] has the w(A) term explicitly, but the analysis relies on Gordon s inequality [5, 7], which is applicable only for isotropic Gaussian design matrices.

The proof technique we use is simple, a standard covering argument, and is largely the same across all the cases considered. A unique aspect of our analysis, used in all the proofs, is a way to go from covering numbers of A to the Gaussian width of A using the Sudakov-Dudley inequality [7]. Our general techniques are in sharp contrast to much of the existing literature on RE conditions, which commonly use specialized tools such as Gaussian comparison principles [13, 9], and/or specialized analysis geared to a particular norm such as L1 [14].

4.1 Restricted Eigenvalue Conditions: Gaussian Designs

In this section, we focus on the case of Gaussian design matrices X Rn p, and consider three settings: (i) independent-isotropic, where the entries are elementwise independent, (ii) independentanisotropic, where rows Xi are independent but each row has a covariance E[Xi XT i ] = Σ Rp p, and (iii) dependent-isotropic, where the rows are isotropic but the columns Xj are correlated with E[Xj XT j ] = Γ Rn n. For convenience, we assume E[x2 ij] = 1, noting that the analysis easily extends to the general case of E[x2 ij] = σ2.

Independent Isotropic Gaussian (IIG) Designs: The IIG setting has been extensively studied in the literature [3, 9]. As discussed in the recent work on atomic norms [4], one can use Gordon s inequality [5, 7] to get RE conditions for the IIG setting. Our goal in this section is two-fold: ﬁrst, we present the RE conditions obtained using our simple proof technique, and show that it is equivalent, up to constants, the RE condition obtained using Gordon s inequality, an arguably heavy-duty technique only applicable to the IIG setting; and second, we go over some facets of how we present the results, which will apply to all subsequent RE-style results as well as give a way to plug-in κ in the estimation error bound in (9).

Theorem 5 Let the design matrix X Rn p be elementwise independent and normal, i.e., xij N(0, 1). Then, for any A Sp 1, any n 2, and any τ > 0, with probability at least (1 η1 exp( η2τ 2)), we have

inf u A Xu 2 1

2 n η0w(A) τ , (15)

η0, η1, η2 > 0 are absolute constants.

We consider the equivalent result one could obtain by directly using Gordon s inequality [5, 7]:

Theorem 6 Let the design matrix X be elementwise independent and normal, i.e., xij N(0, 1). Then, for any A Sp 1 and any τ > 0, with probability at least (1 2 exp( τ 2/2)), we have

inf u A Xu 2 γn w(A) τ , (16)

where γn = E[ h 2] > n n+1 is the expected length of a Gaussian random vector in Rn.

Interestingly, the results are equivalent, up to constants. However, unlike Gordon s inequality, our proof technique generalizes to all the other design matrices considered in the sequel.

We emphasize three additional aspects in the context of the above analysis, which will continue to hold for all the subsequent results but will not be discussed explicitly. First, to get a form of the result which can be used as κ and plugged in to the estimation error bound (9), one can simply choose τ = 1

2 n η0w(A)) so as to get

inf u A Xu 2 1

2 w(A) , (17)

with high probability. Table 1 shows a summary of recovery bounds on Independent Isotropic Gaussian design matrices with Gaussian noise. Second, the result does not depend on the fact that u A Cr Sp 1 so that u 2 = 1. For example, one can consider the cone Cr to be intersecting with a sphere ρSp 1 of a different radius ρ, to give Aρ = Cr ρSp 1 so that u Aρ has u 2 = ρ. For simplicity, let A = A1, i.e., corresponding to ρ = 1. Then, a straightforward extension yields infu Aρ Xu 2 ( 1

2 n η0w(A) τ) u 2, with probability at least (1 η1 exp( η2τ 2)), since Xu 2 = X u u 2 2 u 2 and w(A u 2) = u 2w(A) [4]. Such a scale independence is in fact necessary for the error bound analysis in Section 2. Finally, note that the leading constant 1

2 was a consequence of our choice of ϵ = 1

4 for the ϵ-net covering of A in the proof. One can get other constants, less than 1, with different choices of ϵ, and the constants η0, η1, η2 will change based on this choice.

Independent Anisotropic Gaussian (IAG) Designs: We consider a setting where the rows Xi of the design matrix are independent, but each row is sampled from an anisotropic Gaussian distribution, i.e., Xi N(0, Σp p) where Xi Rp. The setting has been considered in the literature [13] for the special case of L1 norms, and sharp results have been established using Gaussian comparison techniques [7]. We show that equivalent results can be obtained by our simple technique, which does not rely on Gaussian comparisons [7, 9].

Theorem 7 Let the design matrix X be row wise independent and each row Xi N(0, Σp p). Then, for any A Sp 1 and any τ > 0, with probability at least 1 η1 exp( η2τ 2), we have

inf u A Xu 2 1

Λmax(Σ) w(A) τ , (18)

where ν = infu A Σ1/2u 2, p

Λmax(Σ) denotes the largest eigenvalue of Σ1/2 and η0, η1, η2 > 0 are constants.

A comparison with the results of [13] is instructive. The leading term ν appears in [13] as well we have simply considered infu A on both sides, and the result in [13] is for any u with the Σ1/2u 2 term. The second term in [13] depends on the largest entry in the diagonal of Σ, log p, and u 1. These terms are a consequence of the special case analysis for L1 norm. In contrast, we consider the general case and simply get the scaled Gaussian width term p

Λmax(Σ) w(A).

Dependent Isotropic Gaussian (DIG) Designs: We now consider a setting where the rows of the design matrix X are isotropic Gaussians, but the columns Xj are correlated with E[ Xj XT j ] = Γ Rn n. Interestingly, correlation structure over the columns make the samples dependent, a scenario which has not yet been widely studied in the literature [22, 10]. We show that our simple technique continues to work in this scenario and gives a rather intuitive result.

Theorem 8 Let X Rn p be a matrix whose rows Xi are isotropic Gaussian random vectors in Rp and the columns Xj are correlated with E[ Xj XT j ] = Γ. Then, for any set A Sp 1 and any τ > 0, with probability at least (1 η1 exp( η2τ 2), we have

inf u A Xu 2 3

Λmax(Γ) η0w(A) + 5

where η0, η1, η2 > 0 are constants.

Note that with the assumption that E[x2 ij] = 1, Γ will be a correlation matrix implying Tr(Γ) = n, and making the sample size dependence explicit. Intuitively, due to sample correlations, n samples are effectively equivalent to Tr(Γ) Λmax(Γ) = n Λmax(Γ) samples.

4.2 Restricted Eigenvalue Conditions: Sub-Gaussian Designs

In this section, we focus on the case of sub-Gaussian design matrices X Rn p, and consider three settings: (i) independent-isotropic, where the rows are independent and isotropic, (ii) independentanisotropic, where the rows Xi are independent but each row has a covariance E[Xi XT i ] = Σp p, and (iii) dependent-isotropic, where the rows are isotropic and the columns Xj are correlated with E[Xj XT j ] = Γn n. For convenience, we assume E[x2 ij] = 1 and the sub-Gaussian norm |||xij|||ψ2 k [18]. In recent work, [17] also considers generalizations of RE conditions to sub Gaussian designs, although our proof techniques are different.

Independent Isotropic Sub-Gaussian Designs: We start with the setting where the sub-Gaussian design matrix X Rn p has independent rows Xi and each row is isotropic.

Theorem 9 Let X Rn p be a design matrix whose rows Xi are independent isotropic sub Gaussian random vectors in Rp. Then, for any set A Sp 1 and any τ > 0, with probability at least (1 2 exp( η1τ 2)), we have

inf u A Xu 2 n η0w(A) τ , (20)

where η0, η1 > 0 are constants which depend only on the sub-Gaussian norm |||xij|||ψ2 = k.

Independent Anisotropic Sub-Gaussian Designs: We consider a setting where the rows Xi of the design matrix are independent, but each row is sampled from an anisotropic sub-Gaussian distribution, i.e., |||xij|||ψ2 = k and E[Xi XT i ] = Σp p.

Theorem 10 Let the sub-Gaussian design matrix X be row wise independent, and each row has E[Xi XT i ] = Σ Rp p. Then, for any A Sp 1 and any τ > 0, with probability at least (1 2 exp( η1τ 2)), we have

inf u A Xu 2 ν n η0 Λmax(Σ) w(A) τ , (21)

where ν = infu A Σ1/2u 2, p

Λmax(Σ) denotes the largest eigenvalue of Σ1/2, and η0, η1 > 0 are constants which depend on the sub-Gaussian norm |||xij|||ψ2 = k.

Note that [14] establish RE conditions for anisotropic sub-Gaussian designs for the special case of L1 norm. In contrast, our results are general and in terms of the Gaussian width w(A).

Dependent Isotropic Sub-Gaussian Designs: We consider the setting where the sub-Gaussian design matrix X has isotropic sub-Gaussian rows, but the columns Xj are correlated with E[ Xj XT j ] = Γ, implying dependent samples.

Theorem 11 Let X Rn p be a sub-Gaussian design matrix with isotropic rows and correlated columns with E[ Xj XT j ] = Γ Rn n. Then, for any A Sp 1 and any τ > 0, with probability at least (1 2 exp( η1τ 2)), we have

inf u A Xu 2 3

Tr(Γ) η0 Λmax(Γ)w(A) τ , (22)

where η0, η1 are constants which depend on the sub-Gaussian norm |||xij|||ψ2 = k.

5 Generalized Linear Models: Restricted Strong Convexity

In this section, we consider the setting where the conditional probabilistic distribution of y|x follows an exponential family distribution: p(y|x; θ) = exp{y θ, x ψ( θ, x )}, where ψ( ) is the logpartition function. Generalized linear models consider the negative likelihood of such conditional distributions as the loss function: L(θ; Zn) = 1 n Pn i=1(ψ( θ, Xi ) θ, yi Xi ). Least squares regression and logistic regression are popular special cases of GLMs. Since ψ( θ, x ) = E[y|x], we have L(θ ; Zn) = 1

n XT ω, where ωi = ψ( θ, Xi ) yi = E[y|Xi] yi plays the role of noise. Hence, the analysis in Section 3 can be applied assuming ω is Gaussian or sub-Gaussian. To obtain RSC conditions for GLMs, ﬁrst note that

δL(θ , ; Zn) = 1

i=1 2ψ( θ , Xi + γi , Xi ) , Xi 2 , (23)

Table 1: A summary of various values for L1 and L norms with all values correct upto constants.

R(u) λn := c1 w(ΩR) n κ := h max n 1 c2 w(A) n , 0 oi2

Ψ(Cr) ˆ n 2 := c3 Ψ(Cr)λn

ℓ1 norm O q

O (1) s O q

ℓ norm O p p

O(1) 1 O p p

where γi [0, 1], by mean value theorem. Since ψ is of Legendre type, the second derivative 2ψ( ) is always positive. Since the RSC condition relies on a non-trivial lower bound for the above quantity, the analysis considers a suitable compact set where ℓ= ℓψ(T) = min|a| 2T 2ψ(a) is bounded away from zero. Outside this compact set, we will only use 2ψ( ) > 0. Then,

δL(θ , ; Zn) ℓ

i=1 Xi, 2 I[| Xi, θ | < T] I[| Xi, | < T] . (24)

We give a characterization of the RSC condition for independent isotropic sub-Gaussian design matrices X Rn p. The analysis can be suitably generalized to the other design matrices considered in Section 4 by using the same techniques. As before, we denote as u, and consider u A Sp 1 so that u 2 = 1. Further, we assume θ 2 c1 for some constant c1. Assuming X has sub Gaussian entries with |||xij|||ψ2 k, Xi, θ and Xi, u are sub-Gaussian random variables with sub-Gaussian norm at most Ck. Let φ1 = φ1(T; u) = P{| Xi, u | > T} e exp( c2T 2/C2k2), and φ2 = φ2(T; θ ) = P{| Xi, θ | > T} e exp( c2T 2/C2k2). The result we present is in terms of the constants ℓ= ℓψ(T), φ1 = φ(T; u) and φ2 = φ(T, θ ) for any suitably chosen T.

Theorem 12 Let X Rn p be a design matrix with independent isotropic sub-Gaussian rows. Then, for any set A Sp 1, any α (0, 1), any τ > 0, and any n 2 α2(1 φ1 φ2)(cw2(A) +

c3(1 φ1 φ2)5

c4 4k4 (1 α)τ 2) for suitable constants c3 and c4, with probability at least 1 3 exp η1τ 2 , we have inf u A

n L(θ ; u, X) ℓ π n η0w(A) τ) , (25)

where π = (1 α)(1 φ1 φ2), ℓ= ℓψ(T) = min|a| 2T +K 2ψ(a), and constants (η0, η1) depend on the sub-Gaussian norm |||xij|||ψ2 = k.

The form of the result is closely related to the corresponding result for the RE condition on infu A Xu 2 in Section 4.2. Note that RSC analysis for GLMs was considered in [9] for speciﬁc norms, especially L1, whereas our analysis applies to any set A Sp 1, and hence to any norm. Further, following similar argument structure as in Section 4.2, the analysis for GLMs can be extended to anisotropic and dependent design matrices.

6 Conclusions

The paper presents a general set of results and tools for characterizing non-asymptotic estimation error in norm regularized regression problems. The analysis holds for any norm, and includes much of existing literature focused on structured sparsity and related themes as special cases. The work can be viewed as a direct generalization of results in [9], which presented related results for decomposable norms. Our analysis illustrates the important role Gaussian widths, as a geometric measure of size of suitable sets, play in such results. Further, the error sets of regularized and constrained versions of such problems are shown to be closely related [2]. Going forward, it will be interesting to explore similar generalizations for the semi-parametric and non-parametric settings.

Acknowledgements: We thank the anonymous reviewers for helpful comments and suggestions on related work. We thank Sergey Bobkov, Snigdhansu Chatterjee, and Pradeep Ravikumar for discussions related to the paper. The research was supported by NSF grants IIS-1447566, IIS-1422557, CCF-1451986, CNS-1314560, IIS-0953274, IIS-1029711, and by NASA grant NNX12AQ39A.

[1] D. Amelunxen, M. Lotz, M. B. Mc Coy, and J. A. Tropp. Living on the edge: A geometric theory of phase transitions in convex optimization. Inform. Inference, 3(3):224 294, 2013.

[2] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics, 37(4):1705 1732, 2009.

[3] P. Buhlmann and S. van de Geer. Statistics for High Dimensional Data: Methods, Theory and Applications. Springer Series in Statistics. Springer, 2011.

[4] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The convex geometry of linear inverse problems. Foundations of Computational Mathematics, 12(6):805 849, 2012.

[5] Y. Gordon. On Milmans inequality and random subspaces which escape through a mesh in Rn. In Geometric Aspects of Functional Analysis, volume 1317 of Lecture Notes in Mathematics, pages 84 106. Springer, 1988.

[6] M. Ledoux. The concentration of measure phenomenon. Mathematical Surveys and Mongraphs. American Mathematical Society.

[7] M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Springer, 2013.

[8] N. Meinshausen and B Yu. Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics, 37(1):246 270, 2009.

[9] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uniﬁed framework for the analysis of regularized M-estimators. Statistical Science, 27(4):538 557, December 2012.

[10] S. Negahban and M. J. Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Annals of Statistics, 39(2):1069 1097, 2011.

[11] S. Oymak, C. Thrampoulidis, and B. Hassibi. The Squared-Error of Generalized Lasso: A Precise Analysis. Arxiv, ar Xiv:1311.0830v2, 2013.

[12] Y. Plan and R. Vershynin. Robust 1-bit compressed sensing and sparse logistic regression: A convex programming approach. IEEE Transactions on Information Theory, 59(1):482 494, 2013.

[13] G. Raskutti, M. J. Wainwright, and B. Yu. Restricted Eigenvalue Properties for Correlated Gaussian Designs. Journal of Machine Learning Research, 11:2241 2259, 2010.

[14] Z. Rudelson and S. Zhou. Reconstruction from anisotropic random measurements. IEEE Transactions on Information Theory, 59(6):3434 3447, 2013.

[15] M. Talagrand. The Generic Chaining. Springer, 2005.

[16] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58(1):267 288, 1996.

[17] J. A. Tropp. Convex recovery of a structured signal from independent random linear measurements. In Sampling Theory, a Renaissance. (To Appear), 2014.

[18] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y. Eldar and G. Kutyniok, editors, Compressed Sensing, chapter 5, pages 210 268. Cambridge University Press, 2012.

[19] A. B. Vizcarra and F. G. Viens. Some applications of the Malliavin calculus to sub-Gaussian and non-sub-Gaussian random ﬁelds. In Seminar on Stochastic Analysis, Random Fields and Applications, Progress in Probability, volume 59, pages 363 396. Birkhauser, 2008.

[20] M. J. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of sparsity using ℓ1-constrained quadratic programming(Lasso). IEEE Transactions on Information Theory, 55:2183 2202, 2009.

[21] P. Zhao and B. Yu. On model selection consistency of Lasso. Journal of Machine Learning Research, 7:2541 2567, November 2006.

[22] S. Zhou. Gemini: Graph estimation with matrix variate normal instances. The Annals of Statistics, 42(2):532 562, 2014.