# the_welltempered_lasso__ce83b3f1.pdf

The Well-Tempered Lasso

Yuanzhi Li 1 Yoram Singer 1 2

We study the complexity of the entire regularization path for least squares regression with 1-norm penalty, known as the Lasso. Every regression parameter in the Lasso changes linearly as a function of the regularization value. The number of changes is regarded as the Lasso s complexity. Experimental results using exact path following exhibit polynomial complexity of the Lasso in the problem size. Alas, the path complexity of the Lasso on artiﬁcially designed regression problems is exponential. We use smoothed analysis as a mechanism for bridging the gap between worst case settings and the de facto low complexity. Our analysis assumes that the observed data has a tiny amount of intrinsic noise. We then prove that the Lasso s complexity is polynomial in the problem size.

1. Introduction

In high dimensional learning problems, a sparse solution is often desired as it has better generalization and interpretation. In order to promote sparse solutions, a regularization term that penalizes for the 1-norm of the vector of parameters is often augmented to an empirical loss term. In regression problems, the empirical loss amounts to sum of the squared differences between the linear predictions and true targets. The task of linear regression with 1-norm penalty is known as Lasso (Tibshirani, 1996). To obtain meaningful solutions for the Lasso, it is required to pick a good value of the ℓ1 regularizer. To automatically choose the best regularization value, algorithms that calculate all possible solutions were developed (Efron et al., 2004; Osborne et al., 2000; Tibshirani & Taylor, 2012).

These algorithms ﬁnd the solution set for all possible regularization values, commonly referred to as the entire reg-

1Department of Computer Science, Princeton University 2Googol Brain. Correspondence to: Yuanzhi Li <yuanzhil@cs.princeton.edu>.

Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

ularization path. The algorithms typically built upon the property that the Lasso regularization path is piecewise linear in the constituents of solution vector. As a result, their running times are also governed by the total number of linear segments. While experiments with real datasets suggest that the number of linear segments is in practice linear in the dimension of the problem (Rosset & Zhu, 2007), worst case settings (Mairal & Yu, 2012) can yield to exponentially many linear segments. The construction of exponentially complex regression problems of (Mairal & Yu, 2012) stands in stark contrast to the aforementioned methods. Provably polynomial path complexity has so far derived in vastly more restricted settings, such as the one described in (Dubiner & Singer, 2011).

We bridge the gap between the de facto complexity of the regularization path in real problems and the worst case analysis of the number of linear segments. We show that under fairly general models, the complexity of the entire regularization path is guaranteed to be polynomial in the dimension of the problem. As an important observation, settings which attain the worst case complexity of the regularization path often exhibit fragile algebraic structure. In contrast, natural datasets often comes with noise, which renders those highly frail structures improbable. This approach is called the smoothed analysis, introduced by the seminal paper of (Spielman & Teng, 2009).

The core of smoothed analysis is the assumption that the input data is subjected to a small intrinsic noise. Such noise may come from uncertainty in the physical measurements when collecting the data, irrational decisions in human feedback, or simply the rounding errors in the computation process used for obtaining the data. In this model, we let X Rn d denote the data matrix where n is the number of observations is d is the dimension (number of free parameters). Smoothed analysis assumptions implies that X is the sum of Xh, an unknown ﬁxed matrix, and G, which consists of i.i.d. random samples from a normal distribution with a zero mean and low variance, X = Xh + G. In this view, the data is neither completely random nor completely arbitrary. The smoothed complexity of the problem is measured as the expected complexity taken over the random choices of G. Using this framework, it was proved that the smoothed running time of the simplex, k-means, and the Perceptron algorithm (Spielman & Teng, 2001; Arthur et al.,

The Well-Tempered Lasso

2009; Blum & Dunagan, 2002) is in fact polynomial while the worst case complexity of these problems is exponential.

We use the above smoothed analysis model to show that on typical instances, the total number of linear segments of the Lasso s exact regularization path is polynomial in the problem size with high probability. Informally speaking and omitting technical details, our main result can be stated as follows.

Let X Rn d be a data matrix of the form X = Xh + G for any ﬁxed matrix Xh and a random Gaussian matrix G, Gij N(0,σ2). Then, for an arbitrary vector of targets y Rn, with high probability, the total number of linear segments of the Lasso s exact regularization path for (X, y) is polynomial in n, d, and 1

Our result is conceptually different than the one presented in (Mairal & Yu, 2012). Mairal and Yu showed that there exists an approximate regularization path with a small number of linear segments. However, the analysis, while being novel and inspiring, does not shed light on why, in practice, the exact number of linear segments is small as the approximated path is unlikely to coincide with the exact path. Our analysis covers uncharted terrain and different aspects than the approximated path algorithms in (Mairal & Yu, 2012; Giesen et al., 2010). On one hand, we show that when the input data is typical , namely comes from a naturally smooth distribution, then with high probability, the total number of linear segments, of the exact Lasso path, is already polynomially small. This part of our analysis provides theoretical backing to the empirical ﬁndings reported in (Rosset & Zhu, 2007). On the other hand, when the input matrix is atypical and induces a high-complexity path, then we can also obtain a low-complexity approximate regularization path by adding a small amount of random noise to the data and then solve the Lasso s regularization path on the perturbed instance exactly. We also verify our analysis experimentally in section 9. We show that even a tiny amount of perturbation to high-complexity data matrices, results in a dramatic drop in the number of linear segments.

The technique used in this paper is morally different from the smoothed analysis obtained for simplex (Spielman & Teng, 2001), k-means (Arthur et al., 2009), and the Perceptron (Blum & Dunagan, 2002), as there is no concrete algorithm involved. We develop a new framework which shows that when the total number of linear segments is excessively high, then we can tightly couple the solutions of the original, smoothed problems to another set of solutions in a manner that does not depend on G. We then use the randomness of G to show that such couplings are unlikely to exist, thus high complexity solutions are rare. We believe that our framework can be extended to other problems such as the regularization path of support vector machines.

2. Preliminaries

We use uppercase boldface letters, e.g, X, to denote matrices and lowercase letters x,w to denote vectors, variables, and scalars. We use Xi to denote the i th column of X. Given a set S, we denote by XS Rd |S| the sub-matrix of X whose columns are Xi for i S. Analogously, X S denotes the sub-matrix of X with columns Xi for i < S. For a matrix X Rn d with n d, we use the term smallest (largest) singular value of X to denote the smallest (largest) right singular value of X. We deﬁne the generalized sign of a scalar b as follows,

+1 b > 0 1 b < 0 0 b = 0 .

Let y be a vector in Rn and let X = [X1, ,Xd] be a matrix in Rn d. The Lasso is the following regression problem,

w[λ] = argmin w Rd 1 2 Xw y 2 2 + λ w 1 . (1)

Here, λ > 0 is the regularization value. The value of λ inﬂuences the sparsity level of the solution w[λ]. The larger λ is the sparser the solution is. When X is of full column rank, the solution to (1) is unique. We therefore denote it by w[λ]. We use P = {w[λ] | λ > 0}

to denote the set of all possible solution vectors. This set is also referred to as the entire regularization path.

To establish out main result we need a few technical lemmas. The ﬁrst Lemma from (Mairal & Yu, 2012) provides optimality conditions for w[λ]. Lemma 1. Let λ > 0, the w[λ] is the optimal solution iff it satisﬁes the following conditions,

1. There exists a vector u[λ] s.t.

X (Xw[λ] y) = u[λ] .

2. Each coordinate of u[λ] satisﬁes,

ui[λ] = λ sign (wi[λ]) |wi[λ]| > 0 [ λ,λ] o.w. .

Let us denote the sign vector as sign(w[λ]), which is obtained by applying the generalized sign function sign( ) element-wise to w[λ]. The result of (Mairal & Yu, 2012) shows that P is piecewise linear and unique in the following sense. Lemma 2. Suppose X is of full column rank, then P = {w[λ] | λ > 0} is unique, well-deﬁned, and w[λ] is piecewise linear. Moreover, for any λ1,λ2 > 0, if the sign vectors at λ1 and λ2 are equal, sign(w[λ1]) = sign(w[λ2]), then w[λ1] and w[λ2] are in the same linear segment.

The Well-Tempered Lasso

We use |P| to denote the total number of linear segments in P. We denote by α > 0 the smallest singular value of X. Without loss of generality, as we can rescale X and y accordingly, we assume that y 2 = 1. To obtain our main result, we introduce the following smoothness assumption on the data matrix X.

Assumption 3 (Smoothness). X is generated according to,

X = Xh + G ,

where Xh Rn d (n d) is a ﬁxed unknown matrix with Xh 2 1. Each entry of G is an i.i.d. sample from the normal distribution with 0 mean and variance of σ2

We use N(0,σ2/n) instead of N(0,σ2) for the noise distribution G. This choice implies that when σ is a constant the spectral norm of G is also a constant in expectation, E[ G 2] = O(1), see for instance (Rudelson & Vershynin, 2010). Therefore, the signal-to-noise ratio satisﬁes,

E X 2 G 1 2 = O(1) .

It is convenient to think of σ as an arbitrary small constant. The analysis presented in the sequel employs a ﬁxed constant c that does not depend on the problem size. We use f ( ) c> poly( ) (analogously, f ( ) c< poly( )) to denote the fact that the function f is everywhere greater (smaller) than a polynomial function up to a multiplicative constant. Equipped with the above conventions and the smoothness assumption, the following lemma, due to (Sankar et al., 2006), characterizes the extremal singular values of X.

Lemma 4. Let δ > 0. With probability of at least 1 δ, the smallest, denoted α, and largest, denoted β, right singular values of X satisfy,

d and β c< 1 + σ log(1/δ) .

The bound on β lets assume henceforth that β is O(1) for any reasonable choices of σ and δ. In our analysis We describe explicitly dependencies on X for clariﬁcation and states the main results with β = O(1).

The main result of the paper is states in the following theorem.

Theorem 5 (Lasso s Smoothed Complexity). Suppose assumption 3 holds for arbitrary n, d Z with n d and σ (0,1]. Then, with a probability of at least 1 δ (over the random selection of G), the complexity of the Lasso satisﬁes,

|P| c< n1.1 d

3. Main Lemmas

To prove the main theorem, we introduce several properties of X, y,w[λ] and u[λ] that are critical in the analysis of |P|. We then use the smoothness assumption to bound these properties.

Deﬁnition 1 (Lipschitzness). Let wi[λ] and ui[λ] be the value of the i th coordinate of w[λ] and u[λ] respectively for i [d]. The coordinate-wise Lipschitz parameters of w and u are deﬁned as,

Lw = max i [d] sup λ>0

λwi[λ] , Lu = max i [d] sup λ>0

By deﬁnition, Lw and Lu characterize how much each coordinate of w[λ] and u[λ] can change as we vary the value of λ. We later use the smoothness assumption to show that Lw and Lu are polynomially small. This implies that w[λ] and u[λ] would not change too fast with λ. However, Lipschitzness by itself does not give us a bound on |P| since w[λ] can still oscillate around zero and induce an excessively large number of linear segments. Therefore, we also need the following property which deﬁnes the restricted distance between the column space of X and y.

Deﬁnition 2 (Subspace distance). For any s,δ > 0, let γs denote the largest value such that,

Pr v Rd s s.t. X Sv y 2 γs δ ,

for all S [d] of size s.

This deﬁnition quantiﬁes the distance of y to a subspace spanned by s d columns of X. Since y Rn, n d, and X is smooth, it can be shown that y cannot be too close to the subspace spanned by X S. That is, γs is inversely proportional to a polynomial in n, d,1/σ. We interchangeably use in the following the original matrix X with v Rd s.t. vi = 0 for i S and X S with v Rd s. Using the above properties, we prove the following theorem.

Theorem 6 (Exact Smooth Complexity). Let X satisfy Assumption 3. Then, for all s [d] and δ > 0, with probability of at least 1 δ the complexity of the Lasso satisﬁes,

The following lemma characterizes the (smoothed) values of Lw, Lu, and γs.

Lemma 7. Let X satisfy Assumption 3. Then with probability of at least 1 δ the following properties hold,

d α2 , γs c> σ

dn(d/δ)2/s .

The Well-Tempered Lasso

Applying the bounds on Lw , Lu , γs , and α to Theorem 6 while letting s be a sufﬁciently large constant, we can directly prove Theorem 5. In Section 5, we use the value of α to bound Lw and Lu. In Section 6, we employ the smoothness of X to bound γs. Finally, in Section 7 we prove Theorem 6.

4. Proof sketch

Since X 2 = O(1), there exists a constant λmax = Ω(1) such that for λ λmax, w[λ] is the zero vector. Thus, we can divide λ [0,λmax] into λmax/ν intervals, each of size ν for some (inversely polynomial) small ν. We then show that within every interval the total number of linear segments exceeds a ﬁxed polynomial number with exponentially small probability. The total number of linear segments follows by taking a union bound over all intervals. We need to speciﬁcally address the following two questions in the analysis.

What if |P| within an intervals is excessively large?

We will show that when there are N linear segments in an interval, then there must be at least log3(N) many coordinates of w[λ] , which we denote as the set S, that change their sign. Since ν is small and w[λ] is a Lipschitz function in λ, we know that those coordinates of w[λ] must be close to zero. Therefore, we can show that w[λ] is close to the optimal solution, v[λ], of the Lasso problem when the entries of coordinates in S are constrained to be exactly zero,

v[λ] = argmin w Rd 1 2 Xw y 2 2 + λ w 1s.t. i S : wi = 0

What if w[λ] oscillates excessively around v[λ]?

From the optimality condition of u[λ] and the smoothness of u[λ], we also know that the coordinates in S of u[λ] must

be close to either λ or λ. Thus, u S[λ] def = X S(X w[λ] y) is close to a vector on the scaled hypercube { λ,λ}|S|. On the other hand, if w[λ] is close to v[λ], we know that X S(X v[λ] y) is also close to a vector in { λ,λ}|S|. However, v[λ] does not depend on XS by construction. Therefore, the residual Xv[λ] y also does not depend on XS. Thus, using the randomness etched in XS we can now show that X S(Xv[λ] y) is close to a vector in { λ,λ}|S| with probability which is exponentially small in the size of S. Therefore, we know that w.h.p. the total number of linear segments in this interval is unlikely to be large.

5. Bounding Lw and Lu

Recall that we denote the smallest singular value of X by α. We ﬁrst show the following lemma regarding pertur-

bations of strongly convex functions. Also recall that a second-order smooth function f : Rd R is α2-strongly convex if 2 f (x) α2 for all x Rd and is L-Lipschitz if f (x) 2 L for all x Rd.

Lemma 8 (Perturbation of strongly convex functions I). Let f : Rd R be an non-negative, α2-strongly convex function. Let g : Rd R be a L-Lipschitz non-negative convex function . For any λ 0, let z[λ] be the minimizer of f (z) + λg(z), then we have, dz[λ]

Proof of Lemma 8. For any τ 0 and λ 0, let us abbreviate z = z[λ] and denote ε = z[λ + τ] z. From α2-strong convexity of f at z and the optimality of z at λ, we know that

1 2α2 ε 2 2 + f (z) + λg(z) f (z + ε) + λg(z + ε) . (3)

Moreover, using the optimality of z + ε at λ + τ, we know that,

1 2α2 ε 2 2 + f (z + ε) + (λ + τ)g(z + ε)

f (z) + (λ + τ)g(z) . (4)

Summing Eqs. (3) and (4) and rearranging terms yields,

α2 ε 2 2 τ (g(z) g(z + ε)) τ ε 2L ,

where the last inequality is due to the Lipschitzness assumption on g. Therefore, we get that ε 2/τ L/α2. Letting τ 0+ completes the proof.

Using Lemma 8 with f (w) = 1

2 Xw y 2 2 and g(w) = w 1, we obtain Lipschitz properties for w[λ] and u[λ]. Since we assume that the minimum singular value of X is α then f (w) is α2-strongly convex. In addition, the norm of g(w) is clearly at most

d. To simplify notation, when w[λ] is not differentiable at a point λ, we deﬁne dw[λ]/dλ = 0. Due to Lipschitzness and strong convexity all vectors in the subgradient set λ w[λ] include this particular choice for a subgradient. In summary we get the following corollary.

Corollary 9 (Lipschitzness of w). For any λ 0 it holds that, dw[λ]

Since by deﬁnition, u[λ] = X i (Xw[λ] y), we obtain a similar corollary for u.

Corollary 10 (Lipschitzness of u). For any λ 0 it holds that, du[λ]

The Well-Tempered Lasso

Recall that X 2 = O(1), thus, from the above corollaries we get

We also use in the next section the following Lemma.

Lemma 11 (Perturbation of strongly convex functions II). Let f : Rd R be an α2-strongly convex function and g : Rd R an L-Lipschitz convex function. Let z1 and z2 be the minimizers of f (z) and f (z) + g(z), respectively, then

Proof of Lemma 11. Let ε = z2 z1. From strong convexity of f and optimality of z1, we know that

1 2α2 ε 2 2 + f (z1) f (z2) . (6)

Due to the optimality of z2 for f (z) + g(z) we get,

1 2α2 ε 2 2 + f (z2) + g(z2) f (z1) + g(z1) . (7)

Summing Eq. (6) and (7) and rearranging terms gives,

α2 ε 2 2 g(z1) g(z2) L ε 2 .

We therefore get that ε 2 L

6. Bounding γs

In this section we prove that Assumption 3 also yields a bound on γs.

Lemma 12. Let y Sn 1 be an arbitrary unit vector. Then for any s {10,. . ., d} and a set S [d] of size s, the following holds,

v Rd s s.t. X Sv y 2 c< σδ 1 s

For brevity, we denote the event above by Dγ where γ =

Proof. The proof relies on the following simple fact about Gaussian random variables. For any unit vector u, scalar τ > 0, g R, and i [d] the following holds,

Pr | u,Xi g| τ eτ n

where Xi is a (column) vector with elements distributed i.i.d according to N(0,σ2/n).

Let us assume that there exists v Rd s such that X Sv y 2 γ. From Lemma 4, we know that with prob-

ability of at least 1 δ, X 2 c< 1. Since we assume that y 2 = 1, it holds that

γ X Sv y 2 y 2 Xv 2 1 v 2 v 2 1 γ .

Since γ is smaller than 1/2, there must exist a coordinate i for which |vi| 1 2

Now, let us ﬁx Gj to its observed values for all j , i. In addition, let us denote by U the subspace spanned by

{(Xh)i} {Xj | j , i, j S} {y} .

We know that U is of dimension d s + 1. Consider an orthonormal basis for {u1, ,us 1} Rn for the subspace span({Xi}) U. Suppose there exists v such that X Sv y 2 γ. Then, for every j [s 1], multiplying by uj yields uj,Xi vi γ | uj,Xi | γ |vi| 2

Note that any pair j , j , the inner products uj,Xi and uj ,Xi are independent. We now use (8) over all uj and get that Dγ holds with probability of at most, 6

Finally, choosing γ = Ω σδ 1s

completes the proof.

7. Coupling the solutions

In this section, we show that when the total number of linear segments in a small interval is excessively large, the optimal solution w[λ] can be coupled with the optimal solution v[λ] of the constrained Lasso problem of (2).

Sign changes. For a given ﬁxed λ0 > 0 and ν > 0, let us denote by ζ(i) the number of times that the generalized sign of wi changes as λ increases from λ0 to λ0 + ν. Thus, the total number of linear segments in the interval [λ0,λ0 + ν] is at least Íd i=1 ζ(i). We prove the following lemmas related to the sign changes. Lemma 13 (Number of sign changes). For any integer N > 0, any λ0,ν > 0, if Íd i=1 ζ(i) N, then there exists at least log3(N) many indices j [d] such that ζ(j) 1.

Proof. According to Lemma 2, each linear segment is associated with a unique sign pattern in { 1,0,1}d. Since there are N segments, the pigeon hole principle implies that there must exist at least log3(N) many coordinates of w[λ] that change their sign in this interval.

The Well-Tempered Lasso

From the Lipschitzness of w[λ] and u[λ], we also obtain the following lemma. Lemma 14 (Sign change small weight). For i [d], if ζ(i) 1, then following properties hold: |wi[λ0]| Lwν and ||ui[λ0]| λ0| (Lu + 1)ν.

Proof. Since ζ(i) 1, we know that there exists λ [λ0,λ0 + ν] such that wi[λ] = 0 and |ui[λ]| = λ. Using Lipschitzness of w[λ] we get |wi[λ0] wi[λ]| = |wi[λ0]| Lwν. For ui[λ0] we have,

||ui[λ0]| λ0| ||ui[λ0]| |ui[λ]|| + ||ui[λ]| λ|

+ |λ λ0| (Lu + 1)ν

which concludes the proof.

We use Lu in the sequel as a shorthand for Lu + 1. Based on the two lemmas above we readily get the following corollary. Corollary 15. For any integer N > 0 and ν,λ0 0, if Í i ζ(i) N, then there exists a subset S [d] of cardinality at least log3(N) such that i S, wi[λ0] Lwν and |ui[λ0]| λ0 Luν .

Rare events. Let S be deﬁned as in Corollary 15, we next show that if the size of S is too large, then certain rare couplings would take place. Thus, the size of S is likely to be small with high probability. Throughout the rest of the paper we overload notation and denote by XS Rn d the matrix where each column Xi, for i < S, is replaced with the zero vector. The matrix X S is deﬁned analogously. Note that by deﬁnition, XS + X S = X. Lemma 16 (Size of S). For any ﬁxed λ0, ν > 0, and a set S [d], let X S Rn d be deﬁned as above. Let v S[λ0] be the minimizer,

v S[λ0] = arg min v Rd 1 2 X S v y 2 2 + λ0 v 1

Assume that the properties of Corollary 15 hold for a set S. Then, for every j S the following inequality holds, X j (X S v S[λ0] y) λ0

We refer to this event as E(τ) S with parameter τ =

|S| X 2 2 Lw X 2 2ν α2 + Luν .

Proof. We know that the i th coordinate of v S[λ0] is zero for all i S. Therefore, we need to focus solely on the set of vectors v which are in

K def = {v Rd | i S,vi = 0} .

Since by deﬁnition XS = X X S we can rewrite the original objective as,

X S w + XS w y 2 2 + λ0 w 1 .

Let w S[λ0] Rd be a vector whose ith coordinate is the ith

coordinate of w[λ0] for i < S and is zero otherwise and let w S[λ0] = w w S[λ0]. From the optimality of w[λ0], we know that w S[λ0] is the minimizer of

X S w+XS w S[λ0] y 2 2+λ0 w 1 s.t. w K .

Let g(w) def = 1 2 X S w y 2 2 + λ0 w 1. Expanding terms we get that for every w K,

h(w) g(w) = 1

2 XS w S[λ0] 2 2 XS w S[λ0], y

+ XSw S[λ0],X S w ,

and the gradient of h(w) g(w) satisﬁes (h(w) g(w) 2 X S XS w S[λ0] 2

X 2 2 w S[λ0] 2 . From our assumption that Corollary 15 holds for S, we know that w S[λ0] Lwν which in turn implies that w S[λ0] 2 p

We can now apply Lemma 11 w.r.t g(w) and h(w) g(w) to conclude that

v S[λ0] w S[λ0] 2 X 2 2

Therefore, for every j S the following holds X j (X S v S[λ0] y) λ0

X j (X S w S[λ0] y) λ0 + X 4 2

X j (X S w[λ0] y) λ0

|S| X 2 2 Lw ν

|S| w S[λ0]

which concludes the proof.

Using again that X 2 c< 1 we obtain

This means that of gradient of objective scales as the product of the root of the size of S and the length of the interval ν.

The Well-Tempered Lasso

Bounding the probability of bad events Next we show that for every ﬁxed set S of sufﬁciently large cardinality and sufﬁciently small ν, E(τ) S holds with very small probability if X satisﬁes the smoothness assumption.

Lemma 17 (Smoothing). For any ﬁxed λ0, ν > 0, τ 0, and S [d], let us decompose G into G = G S + GS as in Lemma 16. Then, the following inequality holds,

h E(τ) S G S i

eτ n σ X S v S[λ0] y 2

Proof. Consider the vector v S[λ0] deﬁned as in Lemma 16. We know that v S[λ0] depends only on G S but not on GS. Therefore, for any ﬁxed G0, by conditioning on G S = G0, we get that for all j S

h X j (X Sv S[λ0] y) λ0 τ i

(Xh)j + Gj (X Sv S[λ0] y) λ0

(Xh)j + Gj (X Sv S[λ0] y) λ0

eτ n σ (X Sv S[λ0] y) 2 .

Since the inequality holds for all j S and any G0 the proof is completed.

8. Proof of Theorem 6

Recall that we assume that the target vector is of unit norm y 2 = 1. We slightly overload notation and denote by α(X) the smallest right singular value of X. From the optimality of w[λ], we know that

Xw[λ] y 2 2 + λ w[λ] 1 1

which implies that Xw[λ] y 2 1. From necessary conditions for optimality we also get,

X (Xw[λ] y) = u[λ] .

Thus, we have

u[λ] 2 = X (Xw[λ] y) 2 X 2 = O(1) .

Therefore, there exists a constant λmax such that implies that w[λ] = 0 for λ λmax.

We next employ the randomness of G. Consider a ﬁxed α0 and examine the event that there exists one set S of size at

least s such that E(τ) S is true, then it holds that,

Pr h S : |S| = s,E(τ) S Dγ α(X) α0 i

S0 [d],|S0 |=s Pr h E(τ) S ,S = S0 Dγ α(X) α0 i

S0 [d],|S0 |=s Pr h E(τ) S0 Dγ α(X) α0 i

where we used the deﬁnition of Dγ and the fact that α def = α(X) α0 to obtain the last inequality. We now set

τ = O (δν)1/s σγ

which in turn implies that (see (9)), ν1 1/s c>

δ1/sσγ d ns(Lw/α2 0+ Lu) ν c> δ1/sσγ d ns(Lw/α2 0+ Lu)

s s 1 . and

Pr h S : |S| = s,E(τ) S holds Dγ α(X) α0 i δν .

Since we have at most 1/ν many intervals, taking union bound over all linear segments we get that

Dγ (α(X) α0) δ

Finally, for properly chosen γ and α0 we also obtain Pr[ Dγ (α(X) < α0)] 2δ which completes the proof.

To recap, there exists a universal constant c such that for all s [d] the complexity of the Lasso path is bounded above by,

We now use the bounds on Lw, Lu, and α, yielding,

α2 + Lu c< d4.5

δ4σ4 , γs c< σδ 1 s

|P| c< 3s s n d6

By choosing s = O log nd δσ we get that

|P| c< n1.1 d

The Well-Tempered Lasso

Figure 1. Path complexity as a function of dimension for different levels of smoothing. The theoretical non-smoothed (σ = 0) complexity is exponential in the size of the problem.

Figure 2. Path complexity in a regression task that predicts the value of a pixel from its neighboring pixels using the MNIST dataset.

9. Experiments

We performed two sets of experiments. Our path following implementation uses Python with Float128 for high accuracy computations. In the ﬁrst set of experiments, we start with the exponential complexity construction for Xh Rd d

from (Mairal & Yu, 2012), which has (3d + 1)/2 many line segments. We artiﬁcially added to each entry of Xh i.i.d. Gaussian noise of mean zero and variance σ2. In this setting, the largest value of the entries of Xh is 1 and y is an all-one vector. We show the effect of dimension d and smoothing σ on N(P). We report the average over 100 random choices for smoothing per Xh. As can be seen from the ﬁgure below, even for a tiny amount of entry-wise noisy of 10 10, the number of linear segments dramatically shrinks. We also include a full table of results, where 1/SNR denotes log10(σ).

For the next experiment we use the MNIST data set. We randomly selected n = 1000 images from the data set. We constructed the data matrix X Rn d2 such that the i th row of X is a randomly chosen patch from the i th image of size d d. We cast the center pixel of patch i as the target yi and discard the pixel from X. Thus, the regression task amounts to predicting the center pixel yi using its surrounding pixels X(i). We plot the relation between the size of the patch and the path complexity N(P). Each point in the graph is the average over 100 random samples of patches and images. As can be seen, when the amount of noise in the data is ﬁxed and governed by the data acquisition process (the minimum pixel value is 0 and maximum is 255 for MNIST), the number of linear segments increases barely faster than linearly in the dimension.

10. Conclusions

We proved that the smoothed complexity of the Lasso s regularization path is pragmatically polynomial in the input

1/SNR d = 4 d = 5 d = 6 d = 7 d = 8 d = 9 d = 10 0 6 8 10 12 13 15 18 1 8 9 10 13 16 14 17 2 10 13 13 15 17 21 24 3 16 18 18 20 25 24 26 4 21 26 27 27 30 33 33 5 31 41 41 44 46 45 47 6 36 50 55 58 59 66 67 7 41 71 81 91 89 91 98 8 41 91 118 136 133 134 138 9 41 110 148 181 184 189 192 10 41 122 205 256 259 286 268 11 41 122 276 364 407 410 411 12 41 122 354 467 538 560 566 13 41 122 365 642 704 795 831 14 41 122 365 872 1088 1141 1165 15 41 122 365 978 1404 1601 1694 16 41 122 365 1094 1814 2162 2416 17 41 122 365 1094 2478 3046 3343 18 41 122 365 1094 3030 3894 4345 19 41 122 365 1094 3281 5323 6048 20 41 122 365 1094 3281 7137 8592 41 122 365 1094 3281 9842 29525

Table 1. Path complexity in a worst-case synthetic setting. The theoretical non-smoothed complexity is exponential in the size of the problem.

size. Our analysis contrasts worst case settings for which the Lasso s complexity is known to be exponential. To illustrate the key idea, we provided analysis when smoothing each entry in the data matrix by adding small amount of Gaussian noise. Although not presented here, our analysis carries over to settings in which the smoothing is performed using other distributions which can be sub-Gaussian, sub-exponential, or even non i.i.d. so long as the rows of G are statistically independent.

The nature of smoothed analysis usually imposes a large polynomial factor (Spielman & Teng, 2001), as we do not make any additional assumptions on the hidden matrix Xh. However, we believe that the polynomial degree in our results can be further reduced. For example, in our proof, we used the fact that the condition number of X is 1

d, which is close to being ill-conditioned. For well-conditioned matrices the polynomial bound can be improved to O(d2.1n1.1). This reduction is also valid in settings when n 2d (see e.g. (Rudelson & Vershynin, 2010) for different behaviors of the condition number of Gaussian random matrices for n = d and n 2d). Furthermore, when n 2d, we can also improve γs to Ω(1), hence our polynomial dependency can be further reduced to O(d1.6n0.6). A ﬁnal improvement may stem from Lw and Lu. We actually proved that dw[λ]

d α2 , while we only need to use the inﬁnity

. We leave these improvements and further generalizations to future research.

The Well-Tempered Lasso

Arthur, David, Manthey, Bodo, and R oglin, Heiko. K-means has polynomial smoothed complexity. In Foundations of Computer Science, 2009. FOCS 09. 50th Annual IEEE Symposium on, pp. 405 414. IEEE, 2009.

Blum, Avrim and Dunagan, John. Smoothed analysis of the perceptron algorithm for linear programming. In Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 905 914, 2002.

Dubiner, Moshe and Singer, Yoram. Entire relaxation path for maximum entropy problems. In Proc. of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, pp. 941 948, 2011.

Efron, Bradley, Hastie, Trevor, Johnstone, Iain, Tibshirani, Robert, et al. Least angle regression. The Annals of statistics, 32(2):407 499, 2004.

Giesen, Joachim, Jaggi, Martin, and Laue, S oren. Approximating parameterized convex optimization problems. In European Symposium on Algorithms, pp. 524 535. Springer, 2010.

Mairal, Julien and Yu, Bin. Complexity analysis of the lasso regularization path. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pp. 1835 1842. Omnipress, 2012.

Osborne, Michael R, Presnell, Brett, and Turlach, Berwin A. A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis, 20(3): 389 403, 2000.

Rosset, Saharon and Zhu, Ji. Piecewise linear regularized solution paths. The Annals of Statistics, pp. 1012 1030, 2007.

Rudelson, Mark and Vershynin, Roman. Non-asymptotic theory of random matrices: extreme singular values. In Proceedings of the International Congress of Mathematicians 2010 (ICM 2010) (In 4 Volumes) Vol. I: Plenary Lectures and Ceremonies Vols. II IV: Invited Lectures, pp. 1576 1602. World Scientiﬁc, 2010.

Sankar, Arvind, Spielman, Daniel A, and Teng, Shang-Hua. Smoothed analysis of the condition numbers and growth factors of matrices. SIAM Journal on Matrix Analysis and Applications, 28(2):446 476, 2006.

Spielman, Daniel and Teng, Shang-Hua. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. In Proceedings of the thirty-third annual ACM symposium on Theory of computing, pp. 296 305. ACM, 2001.

Spielman, Daniel A and Teng, Shang-Hua. Smoothed analysis: an attempt to explain the behavior of algorithms in practice. Communications of the ACM, 52(10):76 84, 2009.

Tibshirani, Robert. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pp. 267 288, 1996.

Tibshirani, Ryan J. and Taylor, Jonathan. Degrees of freedom in lasso problems. The Annals of Statistics, 40(2): 1198 1232, 2012.