# feature_adaptation_for_sparse_linear_regression__4cb93ec3.pdf

Feature Adaptation for Sparse Linear Regression

Jonathan A. Kelner MIT Frederic Koehler Stanford Raghu Meka UCLA Dhruv Rohatgi MIT

Sparse linear regression is a central problem in high-dimensional statistics. We study the correlated random design setting, where the covariates are drawn from a multivariate Gaussian N(0, Σ), and we seek an estimator with small excess risk. If the true signal is t-sparse, information-theoretically, it is possible to achieve strong recovery guarantees with only O(t log n) samples. However, computationally efficient algorithms have sample complexity linear in (some variant of) the condition number of Σ. Classical algorithms such as the Lasso can require significantly more samples than necessary even if there is only a single sparse approximate dependency among the covariates. We provide a polynomial-time algorithm that, given Σ, automatically adapts the Lasso to tolerate a small number of approximate dependencies. In particular, we achieve near-optimal sample complexity for constant sparsity and if Σ has few outlier eigenvalues. Our algorithm fits into a broader framework of feature adaptation for sparse linear regression with ill-conditioned covariates. With this framework, we additionally provide the first polynomial-factor improvement over brute-force search for constant sparsity t and arbitrary covariance Σ.

1 Introduction

Sparse linear regression is a fundamental problem in high-dimensional statistics. In a natural random design formulation of this problem, we are given m independent and identically distributed samples (Xi, yi)m i=1 where each sample s covariates are drawn from an n-dimensional Gaussian random vector Xi N(0, Σ), and each response is yi = Xi, v + ξi for independent noise ξi N(0, σ2) and a t-sparse ground truth regressor v Rn, where t is much smaller than n. The goal5 is to output a vector ˆv Rn for which the excess risk

E( X0, ˆv y0)2 σ2 = (ˆv v ) Σ(ˆv v ) =: ˆv v 2 Σ is as small as possible, where (X0, y0) is an independent sample from the same model.

Without the sparsity assumption, the number of samples needed to achieve small excess risk (say, O(σ2)) is linear in the dimension; with O(n) samples, simple and computationally efficient algo-

rithms such as ordinary least squares achieve the statistically optimal excess risk O σ2n

m . Sparsity

kelner@mit.edu. This work was supported in part by NSF Large CCF-1565235, NSF Medium CCF1955217, and NSF TRIPODS 1740751. fkoehler@stanford.edu. This work was supported in part by NSF award CCF-1704417, NSF award IIS-1908774, and N. Anari s Sloan Research Fellowship raghum@cs.ucla.edu. This work was supported in part by NSF CAREER Award CCF-1553605 and NSF Small CCF-2007682 drohatgi@mit.edu. This work was supported by a U.S. Do D NDSEG Fellowship. 5More generally, from a learning theory perspective, we could consider an arbitrary improper learner outputting a function ˆf(X0), rather than specifically learning a linear function X0, ˆv . At least when Σ is known, there is no advantage as we can always project ˆf onto the space of linear functions.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

allows for a significant statistical improvement: ignoring computational efficiency, it is well known that there is an estimator ˆv with excess risk O( σ2t log n

m ) as long as m = Ω(t log n) (see e.g. [13, 33]; Theorem 4.1 in [23]).

The catch is that computing this estimator involves a brute-force search over n t possibilities (i.e., the possible supports for v ). At first glance, this combinatorial search may seem unavoidable if we wish to take advantage of sparsity. Indeed, similar problems are notoriously difficult: the only non-trivial algorithms for e.g., learning t-sparse parities with noise still require nΩ(t) time [29, 37]. However, it is a celebrated fact that for sparse linear regression, computationally efficient methods such as Lasso and Orthogonal Matching Pursuit can avoid this combinatorial search and still achieve very strong theoretical guarantees under conditions such as the Restricted Isometry Property (see e.g. [7, 10, 5, 4, 3, 1]). In the random design setting we consider, the Lasso is known to achieve optimal statistical rates (up to constants) when the covariance matrix Σ is well-conditioned [32, 46].

What about when Σ is ill-conditioned? In contrast with the statistically optimal estimator, Lasso and its cousins provably require sample complexity scaling with (some variant of) the condition number of Σ (see e.g. Theorem 14 in [38] or Theorem 6.5 in [23]). And with a few exceptions (e.g., in some settings with special graphical structure [23]) there has been little progress on designing new efficient algorithms for sparse linear regression with ill-conditioned Σ (see Section 4 for further discussion). For a general covariance Σ, no algorithm is even known that can achieve sample complexity f(t) n1 ϵ (for an arbitrary function f) without brute-force search.

A computationally efficient algorithm that approaches the optimal statistical rate for arbitrary Σ might be too much to hope for. While no computational lower bounds are known, even in restricted computational models such as the Statistical Query model,6 the related worst-case problem of finding a t-sparse solution to a system of linear equations requires nΩ(t) time under standard complexity assumptions [15]. So it is plausible, though not certain, that some assumptions on Σ are necessary. In this work inspired by a long tradition (in random matrix theory, statistics, graph theory, and other areas) of studying matrices with a spectrum that is split between a large bulk and a small number of outlier spike eigenvalues [28, 39, 43] we identify a broad generalization of the standard well-conditionedness assumption, under which brute-force search can still be avoided.

1.1 Beyond well-conditioned Σ

Say that Σ has eigenvalues λ1 λn, and that the sparsity t is a constant.7 Then standard bounds for Lasso require sample complexity (λn/λ1) O(log n). But if the covariates contain even a single approximate linear dependency, then λn/λ1 may be arbitrarily large. Moreover, if the dependency is sparse (e.g. two covariates are highly correlated), then there is a natural choice of v for which Lasso provably fails (see Theorem 6.5 of [23]). Indeed, this phenomenon is not just a limitation of the analysis; Lasso fails empirically as well, even for very small t (see Figure 2 in Appendix H for a simple example with t = 3).

Such dependencies arise in applications ranging from finance (e.g., where some pairs of stocks or ETFs may be highly correlated, and an investor may be interested in the differences) to genomic data (where functionally related genes may have highly correlated expression patterns). Two-sparse dependencies can be directly identified by looking at the covariance matrix; see Section 4 for some discussion of previous research in this direction. But as t increases, naive methods for identifying tsparse dependencies quickly become computationally intractable. With domain knowledge, it may be possible to manually identify and correct such dependencies, but this process would also be time-consuming. Thus, we ask the following question: instead of assuming that λn/λ1 is bounded, suppose that there are constants dℓand dh so that λn dh/λdℓ+1 is bounded, i.e. the spectrum of Σ has only dℓoutliers at the low end, and only dh outliers at the high end. Can we still design an algorithm that achieves sample complexity O(log n) without resorting to brute-force search?

Main result. We give a positive answer: an algorithm for sparse linear regression that is both computationally and statistically efficient for covariance matrices with a small number of outlier

6There are lower bounds for a family of regression estimators with coordinate-separable regularization [44] and a family of preconditioned-Lasso estimators [23, 24]. 7Note that for moderate-sized datasets (e.g. n = 1000), brute-force search is infeasible even for t as small as four or five.

eigenvalues. In particular, this means we can handle a few approximate dependencies among the covariates (quantified by the number of eigenvalues below a threshold). In comparison, Lasso and other classical algorithms cannot tolerate even a single sparse approximate dependency. Our main algorithmic result is the following: Theorem 1.1. Let n, t, dℓ, dh, L N and σ, δ > 0. Let Σ Rn n be a positive semi-definite matrix with (non-negative) eigenvalues λ1 λn. Let v Rn be any t-sparse vector. Let (Xi, yi)m i=1 be independent with Xi N(0, Σ) and yi = Xi, v + ξi, where ξi N(0, σ2).

Let neff := t(λn dh/λdℓ+1) log(n L/δ) + t O(t)dl + dh. Given Σ, t, dℓ, δ, and (Xi, yi)m i=1, there is an estimator ˆv Rn that has excess risk

ˆv v 2 Σ O σ2neff L

+ 2 L v 2 Σ

with probability at least 1 δ, so long as m Ω(neff L). Moreover, ˆv can be computed in time poly(n).

Specifically, taking L log(m v 2 Σ /σ2), the time complexity is dominated by L eigendecompositions and L calls to a Lasso program, for overall runtime O(n3) (see Algorithm 2). This is substantially faster than the brute-force method (which takes O(nt) time) even for small values of t.

The excess risk decays at rate O(σ2neff/m) (hiding the logarithmic factor), which is near the statistically optimal rate of O(σ2t/m) so long as neff is small, i.e. t is small and only a few eigenvalues lie outside a constant-factor range. In our analysis, we prove that the standard Lasso estimator can already tolerate a few large eigenvalues the main algorithmic innovation is needed to tolerate a few small eigenvalues, which turns out to be much trickier. Notice that when dℓ= dh = 0 we recover standard Lasso guarantees up to the factor of L; thus, Theorem 1.1 morally represents a generalization of classical results.

We also show how to achieve a different trade-off between time and samples, eliminating the dependence on dℓin sample complexity at the cost of larger runtime: Theorem 1.2. In the setting of Theorem 1.1, let n eff := t(λn dh/λdℓ+1) log(n L/δ)+t2 log(t)+dh. Given Σ, t, dℓ, δ, and (Xi, yi)m i=1, there is an estimator ˆv Rn that has excess risk

σ2n eff L m

+ 2 L v 2 Σ

with probability at least 1 δ, so long as m Ω(n eff L). Moreover, ˆv can be computed in time

poly(n, m, dt ℓ, tt2).

Discussion & limitations. We discuss two limitations of the above results. First, both results incur exponential dependence on the sparsity t (in the sample complexity for Theorem 1.1, and the runtime for Theorem 1.2), which may be suboptimal. For Theorem 1.1, we remark that in practice the algorithm may not suffer this dependence (see e.g. Figure 1), and it is possible that the analysis can be tightened. For Theorem 1.2, we emphasize that the runtime is still fundamentally different than brute-force search: in particular, it s fixed-parameter tractable in t and dℓ.

Second, both results require that Σ is known. Thus, they are only applicable in settings where we either have a priori knowledge, or can estimate Σ accurately because a large amount of unlabelled data is available. At a high level, this limitation is due to the need to compute the eigendecomposition of Σ, which cannot be approximated from the empirical covariance of a small number of samples.

For simplicity, we have stated our results in terms of Gaussian covariates and noise, but this is not a fundamental limitation. We expect it is possible to prove similar results in the sub-Gaussian case at the cost of making the proof longer for instance, by building upon the techniques from [25] and related works.

Pseudocode & simulation. See Algorithm 1 for complete pseudocode of Adapted BP(), a simplification of the method for the noiseless setting σ = 0. In Figure 1 we show that Adapted BP() significantly outperforms standard Basis Pursuit (i.e. Lasso for noiseless data [7]) on a simple example with n = 1000 variables, dℓ= 10 sparse approximate dependencies, and a ground truth

Algorithm 1: Adapted BP for sparse linear regression with few outlier eigenvalues Procedure Find Heavy Coordinates({v1, . . . , vk},α)

/* Gram-Schmidt computes an orthonormalization of v1, . . . , vk */ a1, . . . , ak GRAM-SCHMIDT({v1, . . . , vk}) return {i [n] : Pk j=1((aj)i)2 α2} Procedure Iterative Peeling(Σ, d, t)

Compute eigendecomposition Σ = Pn i=1 λiuiu i P Pn i=d+1 uiu i Kt {i [n] : Pii < 1 1/(9t2)} for j = t to 1 do

IP (Kj) Find Heavy Coordinates({Pi : i Kj}, 1/(6t)) Kj 1 Kj IP (Kj) return K0 Procedure Adapted BP(Σ, d, t, (Xi, yi)m i=1) S Iterative Peeling(Σ, d, t) return ˆv argminv Rn:Xv=y P

Figure 1: Basis Pursuit (BP) versus Adapted BP in a simple synthetic example with n = 1000 covariates. The x-axis is the number of samples. The y-axis is the out-of-sample prediction error (averaged over 10 independent runs, and error bars indicate the standard deviation).

regressor with sparsity t = 13. The covariates X1:1000 are all independent N(0, 1) except for 10 disjoint triplets {(Xi, Xi+1, Xi+2) : i = 1, 4, . . . , 28}, each of which has joint distribution

Xi := Zi; Xi+1 = Zi + 0.4Zi+1; Xi+2 = Zi+1 + 0.4Zi+2

where Zi, Zi+1, Zi+2 N(0, 1) are independent. The (noiseless) responses are y = 6.25(X1 X2) + 2.5X3 + 1

10 P1000 i=991 Xi. See Appendix I for implementation details.

1.2 Organization

In Section 2 we give an overview of the proofs of Theorem 1.1 and 1.2 (the complete proofs and full algorithm pseudocode are given in Appendix C). In Section 3 we discuss our other results obtained via feature adaptation. Section 4 covers related work.

2 Proof techniques

We obtain Theorems 1.1 and 1.2 as outcomes of a flexible algorithmic approach for tackling sparse linear regression with ill-conditioned covariates: feature adaptation. As a pre-processing step, adapt

or augment the covariates with additional features (i.e. well-chosen linear combinations of the covariates). Then, to predict the responses, apply ℓ1-regularized regression (Lasso) over the new set of features rather than the original covariates. In other words, we algorithmically change the dictionary (set of features) used in the Lasso regression. See Section 4 for a comparison to past approaches.

We start by explaining the goals of feature adaptation for general Σ, and then show how we achieve those desiderata when Σ has few outlier eigenvalues. More precisely, the main technical difficulty is in dealing with the small eigenvalues, so in this proof overview we focus on the case where the only outliers are small eigenvalues. Complete proofs of Theorems 1.1 and 1.2 are in Appendix C.

2.1 What makes a good dictionary: the view from weak learning

Obviously, the feature adaptation approach generalizes Lasso. Surprisingly, even though the sample complexity of the standard Lasso estimator is thoroughly understood, the basic question of whether for every covariate distribution (i.e. every Σ) there exists a good dictionary remains wide-open. To crystallize the power of feature adaptation, we introduce the following notion of a good dictionary. We suggest considering the simplified setting of α-weak learning, where the goal is just to find some ˆv so that the predictions X, ˆv are α-correlated with the ground truth X, v when X N(0, Σ). Moreover, we focus first on the existential question (rather than the algorithmic question of finding the dictionary). We will return to the setting of Theorems 1.1 and 1.2 later. For now, in the weak learning setting, a good dictionary (when the covariate distribution is N(0, Σ)) is one that satisfies the following covering property, but is not too large:

Definition 2.1. Let Σ Rn n be a positive semi-definite matrix and let t, α > 0. A set {D1, . . . , DN} Rn is a (t, α)-dictionary for Σ if for every t-sparse v Rn, there is some i [N] with | v, Di Σ| α v Σ Di Σ ,

where we define x, y Σ := x Σy and x 2 Σ := x Σx for any x, y Rn. Let Nt,α(Σ) be the size of the smallest (t, α)-dictionary.

The relevance of the covering number Nt,α(Σ) is quite simple: given a (t, α)-dictionary D for Σ, and given samples (Xi, yi)m i=1, the weak learning algorithm can simply output the vector ˆv D that maximizes the empirical correlation between the predictions Xi, ˆv and the responses yi. So long as there are enough samples for empirical correlations to concentrate, Definition 2.1 guarantees success. Formally, allowing for preprocessing time to compute the dictionary, O(α)-weak learning is possible in time Nt,α(Σ) poly(n), with O(α 2 log Nt,α(Σ)) samples (Proposition A.5).

Hypothetically, bounding Nt,α(Σ) may not be necessary to develop an efficient sparse linear regression algorithm. However, all assumptions on Σ that are currently known to enable efficient sparse linear regression also immediately imply bounds on Nt,α (see Appendix G). For example, when Σ is well-conditioned, the standard basis is a good dictionary of size n (Fact A.4).

In contrast, the only known bounds for arbitrary Σ (until the present work) are Nt,1/

(the brute-force dictionary, which includes a Σ-orthonormal basis for every set of t covariates) and Nt,1/ n(Σ) n (a Σ-orthonormal basis for all n covariates, which doesn t take advantage of sparsity and corresponds to algorithms such as Ordinary Least Squares). Thus, the following basic question when can we improve upon these trivial bounds seems central to understanding when brute-force search can be avoided in sparse linear regression:

Question 2.2. How large is Nt,α(Σ) for an arbitrary positive semi-definite Σ Rn n? Are there natural families of ill-conditioned Σ (and functions f, g) for which Nt,1/f(t)(Σ) g(t) poly(n)?

2.2 Constructing a good dictionary when Σ has few small eigenvalues

We now address Question 2.2 in the setting where Σ has a small number of eigenvalues that are much smaller than λn. In this setting, the standard basis may not be a good dictionary. For example, if two covariates are highly correlated, their difference may not be correlated with any of them. Nonetheless, we can prove the following covering number bound:

Theorem 2.3. Let n, t, d N. Let Σ Rn n be a positive semi-definite matrix with eigenvalues λ1 λn. Then Nt,α(Σ) t(7t)2t2+tdt + n, where α = 1 7

In particular, when t = O(1) and Σ is well-conditioned except for O(1) outliers λ1, . . . , λd, we get a linear-size dictionary just as in the case where Σ is well-conditioned. In fact, the desired (t, α)-dictionary can be constructed efficiently. Our key lemma shows that when Σ has few small eigenvalues, there is a small subset of covariates that causes all of the sparse approximate dependencies in the sense that the ℓ2 norm of any sparse vector, excluding the mass on the subset, can be upper bounded in terms of the Σ-norm of the vector. Moreover, there is an efficient algorithm that finds a superset of these covariates. Formally, we prove the following: Lemma 2.4. Let n, t, d N. Let Σ Rn n be a positive semi-definite matrix with eigenvalues λ1 λn. Given Σ, d, and t, there is a polynomial-time algorithm Iterative Peeling() producing a set S [n] with the following guarantees:

(a) For every t-sparse v Rn, it holds that v[n]\S 2 3λ 1/2 d+1 v Σ.

(b) |S| (7t)2t+1d.

Once this set S has been found, the dictionary is simply the standard basis {e1, . . . , en}, together with a Σ-orthonormal basis for every set of t covariates in S. By guarantee (a), we can prove that every t-sparse vector correlates with some element of this dictionary under the Σ-inner product. By guarantee (b), the dictionary is much smaller than the brute-force dictionary that contains a basis for all n t sets of t covariates. Together, this gives an algorithmic proof for Theorem 2.3.

Intuition for Iterative Peeling(). We compute the set S via a new iterative method which leverages knowledge of the small eigenspaces of Σ. See Algorithm 1 for the pseudocode. To compute S, the algorithm Iterative Peeling() first computes the orthogonal projection matrix P that projects onto the subspace spanned by the top n d eigenvectors of Σ. Starting with the set of coordinates that correlate with ker(P), the procedure then iteratively grows S in such a way that at each step, a new participant of each approximate sparse dependency is discovered, but S does not become too much larger.

The intuition is as follows: as a preliminary attempt, we could identify all O(d) coordinates that correlate (with respect to the standard inner product) with the lowest d eigenspaces of Σ. If e.g. the covariates have a sparse dependency X1 + X2 = 0, then ker Σ contains the vector e1 + e2, so the coordinates {e1, e2} will be correctly discovered. Unfortunately, if Σ contains a more complex sparse dependency such as

ϵ 1(X1 X2) X3 X4 = 0

where ϵ > 0 is very small, then this heuristic will discover {e1, e2} but miss {e3, e4}. For this example, the solution is to notice that e3 and e4 do correlate with the subspace spanned by ker(Σ) {e1, e2} (which contains e3 + e4). In general, if S is the set of coordinates discovered thus far, then by finding basis vectors that correlate with an appropriate subspace (of dimension at most |S|), we can efficiently augment S with at least one new coordinate from each t-sparse approximate dependency, without making S bigger by more than a factor of O(t). Iterating this augmentation t times therefore provably identifies all problematic coordinates.

To formalize this intuition, the following lemma will be needed to bound how much S grows at each iteration; it shows that the number of coordinates that correlate with a low-dimensional subspace is not too large (proof deferred to Appendix B): Lemma 2.5. Let V Rn be a subspace with d := dim V . For some α > 0 define

i [n] : sup x V \{0}

Then |S| d/α2. Moreover, given a set of vectors that span V , we can compute S in time poly(n).

We also define the set of vectors v that have unusually large norm outside a set S, compared to

v Pv, which is the distance from v to the subspace spanned by the bottom d eigenvectors of Σ: Definition 2.6. For any matrix P Rn n and subset S [n], define WP,S := {v Rn : v Sc 2 > 3

We then formalize the guarantee of each iteration of Iterative Peeling() as follows: Lemma 2.7. Let n, t N and let P : n n be an orthogonal projection matrix. Suppose τ 1 and K [n] satisfy

(a) Pii 1 1/(9t2) for all i K,

(b) | supp(v) \ K| τ for every v B0(t) WP,K.

Then there exists a set IP (K) with |IP (K)| 36t2|K| such that | supp(v) \ (IP (K) K)| τ 1 for all v B0(t) WP,K. Moreover, given P, K, and t, we can compute IP (K) in time poly(n).

Proof sketch. We define the set

a [n] \ K : sup x span{P ei:i K}\{0}

|xa| x 2 1/(6t)

It is clear from Lemma B.2 (applied with parameters V := span{Pei : i K} and α := 1/(6t)) that |IP (K)| 36t2|K|, and that IP (K) can be computed in time poly(n). It remains to show that |GP (v) \ (IP (K) K)| τ 1 for all v B0(t).

Consider any v B0(t) WP,K. Then v Kc 2 > 3 Pv 2. It s sufficient to show that IP (K) contains some j supp(v) \ K, i.e. that there is some j supp(v) \ K such that ej correlates with span{Pi : i K}. We accomplish this by showing that v Kc correlates with Pv K = P

At a high level, the reason for this is that v Kc is close to Pv Kc (since Pi ei for i Kc), and Pv = Pv K + Pv Kc is much smaller than Pv Kc v Kc, so Pv K and Pv Kc must be highly correlated. See Appendix B for the full proof.

We can now complete the proof of Lemma 2.4 by repeatedly invoking Lemma B.4.

Proof of Lemma 2.4. Let Σ = Pn i=1 λiuiu i be the eigendecomposition of Σ, and let P := Pn i=d+1 uiu i be the projection onto the top n d eigenspaces of Σ. Set Kt = {i [n] : Pii < 1 1/(9t2)}. Because tr(P) = n d and Pii 1 for all i [n], it must be that |Kt| 9t2d. Also, for any v B0(t) WP,Kt we have trivially by t-sparsity that | supp(v) \ Kt| t.

Define Kt 1 to be Kt IP (Kt) where IP (Kt) is as defined in Lemma B.4; we have the guarantees that |Kt 1| (1+36t2)|Kt| and |GP (v)\Kt| t 1 for all v B0(t) WP,Kt. Since Kt 1 Kt, it holds that WP,Kt 1 WP,Kt, and thus |GP (v) \ Kt| t 1 for all v B0(t) WP,Kt 1. Moreover, since Kt 1 Kt, it obviously holds that Pii 1 1/(9t2) for all i Kt 1. This means we can apply Lemma B.4 with τ := t 1 and K := Kt 1 and so iteratively define sets Kt 2 K1 K0 [n] in the same way. In the end, we obtain the set K0 [n] with |K0| 9t2d(1 + 36t2)t and supp(v) K0 for all v B0(t) WP,K0. The latter guarantee means that in fact B0(t) WP,K0 = . So for any t-sparse v Rn it holds that v Kc 0 2 3

v Pv 3λ 1/2 d+1

where the last inequality holds since λd+1P Σ.

2.3 Beyond weak learning

So far, we have sketched a proof that if Σ has few outlier eigenvalues, then there is an efficient algorithm to compute a good dictionary (as in Theorem 2.3). This gives an efficient α-weak learning algorithm (via Proposition A.5). However, our ultimate goal is to find a regressor ˆv with prediction error going to 0 as the number of samples increases. Definition 2.1 is not strong enough to ensure this.8 However, it turns out that the dictionary constructed in Theorem 2.3 in fact satisfies a stronger guarantee9 that is sufficient to achieve vanishing prediction error:

8Moreover, standard notions of boosting weak learners (e.g. in distribution-free classification) do not apply in this setting. 9See Lemma A.3 for a proof that the ℓ1-representation property implies the (t, α)-dictionary property.

Definition 2.8. Let Σ Rn n be a positive semi-definite matrix and let t, B > 0. A set {D1, . . . , DN} Rn is a (t, B)-ℓ1-representation for Σ if for any t-sparse v Rn there is some α RN with v = PN i=1 αi Di and PN i=1 |αi| Di Σ B v Σ .

With this definition in hand, we can actually prove the following strengthening of Theorem 2.3:

Lemma 2.9. Let n, t, d N. Let Σ Rn n be a positive semi-definite matrix with eigenvalues λ1 λn. Then Σ has a (t, 7

λn/λd+1)-ℓ1-representation D of size at most n + t(7t)2t2+tdt. Moreover, D can be computed in time t O(t2)dt poly(n).

Proof sketch. Let S be the output of Iterative Peeling(Σ, d, t). The dictionary D consists of the standard basis, together with a Σ-orthogonal basis for each set of t coordinates from S. The bound on |D| comes from the guarantee |S| (7t)2t+1d. For any t-sparse vector v Rn, we know that v Sc is efficiently represented by the standard basis (because Theorem B.1 guarantees that v Sc 2 O(λ 1/2 d+1 v Σ)), and v S is efficiently represented by one of the Σ-orthonormal bases. See Appendix B for the full proof.

Why is the above guarantee useful? If each Di is normalized to unit Σ-norm, then the condition of (t, B)-ℓ1-representability is equivalent to α 1 B v Σ. That is, with respect to the new set of features, the regressor α has bounded ℓ1 norm. Thus, if we apply the Lasso with a set of features that is a (t, B)-ℓ1-representation for Σ, then standard slow rate guarantees hold (proof in Section A):

Proposition 2.10. Let n, m, N, t N and B > 0. Let Σ Rn n be a positive semi-definite matrix and let D be a (t, B)-ℓ1-representation of size N for Σ, normalized so that v Σ = 1 for all v D. Fix a t-sparse vector v Rn, let X1, . . . , Xm N(0, Σ) be independent and let yi = Xi, v +ξi where ξi N(0, σ2). For any R > 0, define

ˆw argmin w RN: w 1 BR XDw y 2 2

where D Rn N is the matrix with columns comprising the elements of D, and X Rm n is the matrix with rows X1, . . . , Xm. So long as m = Ω(log(n/δ)) and w Σ [R/2, R], it holds with probability at least 1 δ that

D ˆw w 2 Σ = O

m + σ2 log(4/δ)

m + B2 w 2 Σ log(n) m

Combining Proposition 2.10 with Lemma 2.9 shows that there is an algorithm with time complexity t O(t2)dt poly(n) and sample complexity O(poly(t)(λn/λd+1) log(n) log(d)) for finding a regressor with squared prediction error o(σ2 + v 2 Σ). This is a simplified version of Theorem 1.2. The full proof involves additional technical details (e.g. more careful analysis to take care of large eigenvalues, and to avoid needing an estimate R for w Σ) but the above exposition contains the central ideas. Theorem 1.1 similarly computes the set S from Lemma 2.4 but uses it to construct a different dictionary: the standard basis, plus a Σ-orthonormal basis for S.10 See Appendix C for the full proofs and pseudocode.

3 Additional Results

We now return to Question 2.2 and ask whether there are other families of ill-conditioned Σ for which we can prove non-trivial bounds on Nt,α(Σ).

First, we ask what can be shown for arbitrary covariance matrices. We prove that every covariance matrix Σ satisfies a non-trivial bound Nt,1/O(t3/2 log n)(Σ) O(nt 1/2). In fact, building on tools from computational geometry, we show the stronger result that Σ has a (t, O(t3/2 log n))-ℓ1representation that of size O(nt 1/2), that is computable from samples in time O(nt Ω(1/t)) for

10More precisely, the algorithm just skips regularizing S, which is morally equivalent. As it is simpler to implement, that is shown in Algorithm 1, and analyzed for the proofs.

any constant t > 1 (Theorem D.5). As a corollary, we provide the first sparse linear regression algorithm with time complexity that is a polynomial-factor better than brute force, and with near-optimal sample complexity, for any constant t and arbitrary Σ (proof in Section D): Theorem 3.1. Let n, m, t, B N and σ > 0, and let Σ Rn n be a positive-definite matrix. Let w Rn be t-sparse, and suppose w Σ [B/2, B]. Suppose m Ω(t log n). Let (Xi, yi)m i=1 be independent samples where Xi N(0, Σ) and yi = Xi, w + N(0, σ2). Then there is an O(m2nt 1/2 +nt Ω(1/t) log O(t) n)-time algorithm that, given (Xi, yi)m i=1, B, and σ2, produces an estimate ˆw Rn satisfying, with probability 1 o(1),

σ2 m + σ w Σ t3/2

m + w 2 Σ t3

Second, one goal is to improve sample complexity (i.e. obtain α without dependence on condition number) without paying too much in time complexity (i.e. retain bounds on Nt,α that are better than nt). To this end, we prove that the dependence on κ in the correlation level (see Fact A.4) can actually be replaced by dependence on κ in the dictionary size (proof in Appendix E): Theorem 3.2. Let n, t N. Let Σ Rn n be a positive-definite matrix with condition number κ. Then Nt,1/3t+1(Σ) 2O(t2)κ2t+1 n.

In particular, for any constant t = 1/ϵ, our result shows that there is a nearly-linear size dictionary with constant correlations even for covariance matrices with polynomially-large condition number κ nϵ/100. While we are not currently aware of an efficient algorithm for computing the dictionary, the above bound nonetheless raises the interesting possibility that there may be a sample-efficient and computationally-efficient weak learning algorithm under a super-constant bound on κ.

4 Related work

Dealing with correlated covariates. There is considerable work on improving the performance of Lasso in situations where some clusters of covariates are highly correlated [47, 19, 2, 42, 21, 12, 27]. These methods can work well for two-sparse dependencies, but generally do not work as well for higher-order dependencies hence they cannot be used to prove our main result. The approach of [2] is perhaps the closest in spirit to ours. They perform agglomerative clustering of correlated covariates, orthonormalize the clusters with respect to Σ, and apply Lasso (or solve an essentially equivalent group Lasso problem). This method fails, for example, when there is a single threesparse dependency, and the remaining covariates have some mild correlations. Depending on the correlation threshold, their method will either aggressively merge all covariates into a single cluster, or fail to merge the dependent covariates.

Feature adaptation and preconditioning. Generalizations of Lasso via a preliminary change-ofbasis (or explicitly altering the regularization term) have been studied in the past, but largely not to solve sparse linear regression per se; instead the goal has been using ℓ1 regularization to encourage other structural properties such as piecewise continuity (e.g. in the fused lasso , see [35, 36, 20, 8] for some more examples). An exception is recent work showing that a sparse preconditioning step can enable Lasso to be statistically efficient for sparse linear regression when the covariates have a certain Markovian structure [23]. Our notion of feature adaptation via dictionaries generalizes sparse preconditioning, which corresponds to choosing a non-standard basis in which Σ becomes well-conditioned and the sparsity of the signal is preserved.

Statistical query (SQ) model; sparse halfspaces. From the complexity standpoint, Nt,α(Σ) is a covering number and therefore closely corresponds to a packing number Pt,α(Σ) (see Section A.1 for the definition). This packing number is essentially the (correlational) statistical dimension, which governs the complexity of sparse linear regression with covariates from N(0, Σ) in the (correlational) SQ model (see e.g. [14] for exposition of this model). Whereas strong nΩ(t) SQ lower bounds are known for related problems such as sparse parities with noise [29], no non-trivial (i.e. super-linear) lower bounds are known for sparse linear regression. Relatedly, in a COLT open problem, Feldman asked whether any non-trivial bounds can be shown for the complexity of weak learning sparse halfspaces in the SQ model [11]. Our results also yield improved bounds for weakly SQ-learning sparse halfspaces over certain families of multivariate Gaussian distributions.

Improving brute-force for arbitrary Σ. Several prior works have suggested improvements on brute-force search for variants of t-sparse linear regression [18, 16, 31, 6]. However, all of these have limitations preventing their application to the general setting we address in Theorem 3.1. Specifically, [18] requires Ω(nt) preprocessing time on the covariates; [16, 31] require noiseless responses; and [6] has time complexity scaling with logm n (since our random-design setting necessitates m Ω(t log n), their algorithm has time complexity much larger than nt).

[1] Peter J Bickel, Ya acov Ritov, Alexandre B Tsybakov, et al. Simultaneous analysis of lasso and dantzig selector. The Annals of statistics, 37(4):1705 1732, 2009.

[2] Peter B uhlmann, Philipp R utimann, Sara Van De Geer, and Cun-Hui Zhang. Correlated variables in regression: clustering and sparse estimation. Journal of Statistical Planning and Inference, 143(11):1835 1858, 2013.

[3] Emmanuel Candes, Terence Tao, et al. The dantzig selector: Statistical estimation when p is much larger than n. Annals of statistics, 35(6):2313 2351, 2007.

[4] Emmanuel J Candes, Justin K Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 59(8):1207 1223, 2006.

[5] Emmanuel J Candes and Terence Tao. Decoding by linear programming. IEEE transactions on information theory, 51(12):4203 4215, 2005.

[6] Jean Cardinal and Aur elien Ooms. Algorithms for approximate sparse regression and nearest induced hulls. Journal of Computational Geometry, 13(1):377 398, 2022.

[7] Shaobing Chen and David Donoho. Basis pursuit. In Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers, volume 1, pages 41 44. IEEE, 1994.

[8] Arnak S Dalalyan, Mohamed Hebiri, Johannes Lederer, et al. On the prediction performance of the lasso. Bernoulli, 23(1):552 581, 2017.

[9] Abhimanyu Das and David Kempe. Submodular meets spectral: Greedy algorithms for subset selection, sparse approximation and dictionary selection. ar Xiv preprint ar Xiv:1102.3975, 2011.

[10] David L Donoho and Philip B Stark. Uncertainty principles and signal recovery. SIAM Journal on Applied Mathematics, 49(3):906 931, 1989.

[11] Vitaly Feldman. Open problem: The statistical query complexity of learning sparse halfspaces. In Conference on Learning Theory, pages 1283 1289. PMLR, 2014.

[12] Mario Figueiredo and Robert Nowak. Ordered weighted l1 regularized regression with strongly correlated covariates: Theoretical aspects. In Artificial Intelligence and Statistics, pages 930 938. PMLR, 2016.

[13] Dean P Foster and Edward I George. The risk inflation criterion for multiple regression. The Annals of Statistics, pages 1947 1975, 1994.

[14] Surbhi Goel, Aravind Gollakota, Zhihan Jin, Sushrut Karmalkar, and Adam Klivans. Superpolynomial lower bounds for learning one-layer neural networks using gradient descent. In International Conference on Machine Learning, pages 3587 3596. PMLR, 2020.

[15] Aparna Gupte and Vinod Vaikuntanathan. The fine-grained hardness of sparse linear regression. ar Xiv preprint ar Xiv:2106.03131, 2021.

[16] Aparna Ajit Gupte and Kerri Lu. Fine-grained complexity of sparse linear regression. 2020.

[17] Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023.

[18] Sariel Har-Peled, Piotr Indyk, and Sepideh Mahabadi. Approximate sparse linear regression. In 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.

[19] Jim C Huang and Nebojsa Jojic. Variable selection through correlation sifting. In RECOMB, volume 6577, pages 106 123. Springer, 2011.

[20] Jan-Christian H utter and Philippe Rigollet. Optimal rates for total variation denoising. In Conference on Learning Theory, pages 1115 1146. PMLR, 2016. [21] Jinzhu Jia, Karl Rohe, et al. Preconditioning the lasso for sign consistency. Electronic Journal of Statistics, 9(1):1150 1172, 2015. [22] Jonathan Kelner, Frederic Koehler, Raghu Meka, and Ankur Moitra. Learning some popular gaussian graphical models without condition number bounds. In Proceedings of Neural Information Processing Systems (Neur IPS), 2020. [23] Jonathan Kelner, Frederic Koehler, Raghu Meka, and Dhruv Rohatgi. On the power of preconditioning in sparse linear regression. 62nd Annual IEEE Symposium on Foundations of Computer Science, 2021. [24] Jonathan A Kelner, Frederic Koehler, Raghu Meka, and Dhruv Rohatgi. Distributional hardness against preconditioned lasso via erasure-robust designs. ar Xiv preprint ar Xiv:2203.02824, 2022. [25] Guillaume Lecu e and Shahar Mendelson. Learning subgaussian classes: Upper and minimax bounds. ar Xiv preprint ar Xiv:1305.4825, 2013. [26] Yin Tat Lee, Aaron Sidford, and Sam Chiu-wai Wong. A faster cutting plane method and its implications for combinatorial and convex optimization. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 1049 1065. IEEE, 2015. [27] Yuan Li, Benjamin Mark, Garvesh Raskutti, and Rebecca Willett. Graph-based regularization for regression problems with highly-correlated designs. In 2018 IEEE Global Conference on Signal and Information Processing (Global SIP), pages 740 742. IEEE, 2018. [28] Shahar Mendelson et al. On the performance of kernel classes. 2003. [29] Elchanan Mossel, Ryan O Donnell, and Rocco P Servedio. Learning juntas. In Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, pages 206 212, 2003. [30] Wolfgang Mulzer, Huy L Nguyˆen, Paul Seiferth, and Yannik Stein. Approximate k-flat nearest neighbor search. In Proceedings of the forty-seventh annual ACM symposium on Theory of Computing, pages 783 792, 2015. [31] Eric Price, Sandeep Silwal, and Samson Zhou. Hardness and algorithms for robust and sparse optimization. In International Conference on Machine Learning, pages 17926 17944. PMLR, 2022. [32] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue properties for correlated gaussian designs. The Journal of Machine Learning Research, 11:2241 2259, 2010. [33] Phillippe Rigollet and Jan-Christian H utter. High dimensional statistics. Lecture notes for course 18S997, 813:814, 2015. [34] Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Optimistic rates for learning with a smooth loss. ar Xiv preprint ar Xiv:1009.3896, 2010. [35] Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1):91 108, 2005. [36] Ryan J Tibshirani and Jonathan Taylor. The solution path of the generalized lasso. The annals of statistics, 39(3):1335 1371, 2011. [37] Gregory Valiant. Finding correlations in subquadratic time, with applications to learning parities and juntas. In 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science, pages 11 20. IEEE, 2012. [38] Sara Van De Geer. On tight bounds for the lasso. Journal of Machine Learning Research, 19:46, 2018. [39] Sara A Van de Geer. Empirical Processes in M-estimation, volume 6. Cambridge university press, 2000. [40] Sara A Van De Geer, Peter B uhlmann, et al. On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics, 3:1360 1392, 2009. [41] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

[42] Fabian L Wauthier, Nebojsa Jojic, and Michael I Jordan. A comparative framework for preconditioned lasso algorithms. Advances in Neural Information Processing Systems, 26:1061 1069, 2013. [43] Tong Zhang. Learning bounds for kernel regression using effective data dimensionality. Neural Computation, 17(9):2077 2098, 2005. [44] Yuchen Zhang, Martin J Wainwright, Michael I Jordan, et al. Optimal prediction for sparse linear models? lower bounds for coordinate-separable m-estimators. Electronic Journal of Statistics, 11(1):752 799, 2017. [45] Lijia Zhou, Frederic Koehler, Danica J Sutherland, and Nathan Srebro. Optimistic rates: A unifying theory for interpolation learning and regularization in linear regression. ar Xiv preprint ar Xiv:2112.04470, 2021. [46] Shuheng Zhou. Restricted eigenvalue conditions on subgaussian random matrices. ar Xiv preprint ar Xiv:0912.4045, 2009. [47] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2):301 320, 2005.

A Preliminaries

Throughout, we use the following standard notation. For positive integers n, m N, we write A : m n to denote a matrix with m rows, n columns, and real-valued entries. The standard inner product on Rn is denoted u, v := u v. For a positive semi-definite matrix Σ : n n we define the Σ-inner product on Rn by u, v Σ := u Σv and the Σ-norm by u Σ = p

u, u Σ. For n N (made clear by context) we let e1, . . . , en Rn be the standard basis vectors ei(j) := 1[j = i]. For a vector v Rn and set S [n] we write v S to denote the restriction of v to coordinates in S. For symmetric matrices A, B : n n we write A B to denote that B A is positive semi-definite.

A.1 Covering, packing, and ℓ1-representability

We previously defined the covering number of t-sparse vectors with respect to a covariance matrix Σ. We next define the packing number (i.e. correlational statistical dimension) and ℓ1-representability, and discuss the connections between these quantities as well as their algorithmic implications.

Definition A.1. Let Σ : n n be a positive semi-definite matrix and let t, α > 0. A set {v1, . . . , v N} Rn is a (t, α)-packing for Σ if every vi is t-sparse, and

| vi, vj Σ| < α vi Σ vj Σ

for all i, j [N] with i = j. The (correlational) statistical dimension of t-sparse vectors with maximum correlation α, under the Σ-inner product, is denoted Pt,α(Σ) and defined as the size of the largest (t, α)-packing.

We will make use of the following connections between packing, covering, and ℓ1-representability.

Lemma A.2 (Covering packing). For any positive semi-definite matrix Σ : n n and t, α > 0, it holds that (α2/3)Pt,α2/2(Σ) Nt,α(Σ) Pt,α(Σ).

Proof. First inequality. Let {D1, . . . , DN} be any maximum-size (t, α2/2)-packing. Since the Di s are all t-sparse, each must be correlated with some element of a (t, α)-dictionary. Thus, it suffices to show that for any v Rn, the set S(v) := {i [N] : | Di, v Σ| α Di Σ v Σ} has size |S(v)| 3/α2. Indeed, for any i, j S(v) with i = j, we have by the definition of a packing that *

v 2 Σ v, Dj Dj, v Σ

Σ = Di, Dj Di, v Σ Dj, v Σ

2 Di Σ Dj Σ .

For each i S(v) define Ri = Di Di, v Σv/ v 2 Σ. Then

= |S(v)| + X

i,j S(v):i =j

Ri, Rj Σ Ri Σ Rj Σ |S(v)| |S(v)|(|S(v)| 1) α2

where the last inequality uses the bound Ri Σ Di Σ. Rearranging gives |S(v)| 1 + (2/α2).

Second inequality. Let {D1, . . . , DN} be any maximal (t, α)-packing. Then for any t-sparse v Rn, maximality implies that there must be some i [N] with | Di, v Σ| α Di Σ v Σ. So {D1, . . . , DN} is also a (t, α)-dictionary.

Lemma A.3 (ℓ1-representation = covering). Let Σ : n n be a positive semi-definite matrix and let t, B > 0. If {D1, . . . , DN} Rn is a (t, B)-ℓ1-representation for Σ, then it is also a (t, 1/B)-dictionary for Σ.

Proof. Pick any t-sparse v Rn. By ℓ1-representability, there is some α RN with v = PN i=1 αi Di and PN i=1 |αi| Di Σ B v Σ. Hence

i=1 αi v, Di Σ

i=1 |αi| v Σ Di Σ max j [N] | v, Dj Σ| v Σ Dj Σ

B v 2 Σ max j [N] | v, Dj Σ| v Σ Dj Σ

and thus maxj [N] | v,Dj Σ| v Σ Dj Σ 1/B.

We can now easily prove that the standard basis is a good dictionary for well-conditioned Σ.

Fact A.4. Let Σ be a positive definite matrix with condition number λmax(Σ)

λmin(Σ) κ. Under the Σ-inner

product, every t-sparse vector is at least 1/(

κt)-correlated with some standard basis vector.

Proof. By Lemma A.3, it suffices to show that the standard basis {e1, . . . , en} is a (t,

κt)-ℓ1representation for Σ. Indeed, for any t-sparse v Rn,

i=1 |vi| ei Σ

λmax(Σ) ei 2 = p

λmax(Σ) v 1

as desired.

A.2 Algorithmic implications

An existential proof that Nt,α(Σ) is small unfortunately does not in general give an efficient algorithm for constructing a concise dictionary. However, with the caveat that the dictionary must be given to the algorithm as advice, bounds on Nt,α do imply weak learning algorithms with sample complexity O(α 2 log(n)):

Proposition A.5. Let Σ : n n be a positive semi-definite matrix and let D be a (t, α)-dictionary for Σ, for some t N and α (0, 1). For m N and t-sparse v Rn, let X1, . . . , Xm N(0, Σ) be independent and let yi = Xi, v + ξi where ξi N(0, σ2). Define the estimator

ˆv = argmin v D β R βXv y 2 2

where X : m n is the matrix with rows X1, . . . , Xm. For any δ > 0, if m Cα 2 log(32|D|/δ) for a sufficiently large absolute constant C, then with probability at least 1 δ, ˆβ ˆw w 2

Σ (1 α2/4) w 2 Σ + 400σ2 log(4|D|/δ)

Proof. Since D is a (t, α)-dictionary, we know that there is some v D with | v, v Σ| α v Σ v Σ . We then apply Lemma F.4.

The above guarantee is essentially of the form at least 1% of the signal variance can be explained . Under the ℓ1-representability condition, something much stronger is true:

Proposition A.6. Let n, m, N, t N and B > 0. Let Σ : n n be a positive semi-definite matrix and let D be a (t, B)-ℓ1-representation of size N for Σ, normalized so that v Σ = 1 for all v D.

Fix a t-sparse vector v Rn, let X1, . . . , Xm N(0, Σ) be independent and let yi = Xi, v +ξi where ξi N(0, σ2). For any R > 0, define

ˆw argmin w RN: w 1 BR XDw y 2 2

where D : n N is the matrix with columns comprising the elements of D, and X : m n is the matrix with rows X1, . . . , Xm. So long as m = Ω(log(n/δ)) and R v Σ, it holds with probability at least 1 δ that

D ˆw w 2 Σ = O

m + σ2 log(4/δ)

m + B2R2 log(n)

Proof. By ℓ1-representability and normalization of D, there is some w RN such that v = Dw and w 1 B v Σ BR. Let Γ = D ΣD. Also, by normalization, maxi Γii = 1. Thus, we can apply standard slow rate Lasso guarantees to the samples (D Xi, yi)m i=1 to get the claimed bound (see e.g. Theorem 14 of [22]).

A.3 Optimizing the Lasso in near-linear time

Theorem A.7 (see e.g. Corollary 4 and Section 5.3 in [34]). Let n, m, B, H, T N and σ > 0. Fix X1, . . . , Xm Rn with Xi H for all i, and fix w Rn with w 1 B. For i [m] define yi = Xi, w + ξi where ξi N(0, σ2) are independent random variables. Given (Xi, yi)m i=1 as well as B, T, and σ2, there is an algorithm Mirror Descent Lasso((Xi, yi)m i=1, B, T, σ2), which optimizes the Lasso objective via T iterations of mirror descent, that produces an estimate ˆw Rn satisfying ˆw 1 B and, with probability 1 o(1),

1 m X ˆw y 2 2 1

m Xw y 2 2 + O

Moreover, the time complexity of Mirror Descent Lasso() is O(nm T).

Theorem A.8. Let n, m, B, H N and σ > 0. Let Σ : n n be positive semi-definite with maxj [n] Σjj H2. Fix w Rn with w 1 B. Let (Xi, yi)m i=1 be independent draws where Xi N(0, Σ) and yi = Xi, w + N(0, σ2). Then Mirror Descent Lasso((Xi, yi)m i=1, B, m, σ2) computes, in time O(nm2), an estimate ˆw satisfying, with probability 1 o(1),

ˆw w 2 Σ O σ2

m + σHB m + H2B2

Proof. Since maxj Σjj H we have that maxi Xi O(H log n) with probability 1 o(1). Applying Theorem A.7 with this bound and with T = m, we obtain some ˆw Rn with ˆw 1 B and, with probability 1 o(1),

1 m X ˆw y 2 2 1

m Xw y 2 2 + ϵ

where ϵ = O(H2B2/m) + p

H2B2σ2/m). By χ2-concentration, we have 1 m Xw y 2 2 σ2(1 + O(1/ m)) with probability 1 o(1). Thus,

X ˆw y 2 Xw y 2 + ϵm σ m + O(σm1/4) + ϵm

and X ˆw y 2 2 Xw y 2 2 + mϵ σ2m + O(σ2 m) + ϵm.

Next, since supw Rn: w 1 B w w , x 2B x O(HB log n) with probability 1 o(1) over x N(0, Σ), we can apply Theorem C.1 to get that with probability 1 o(1),

ˆw w 2 Σ + σ2 1 + O(1/ m)

m ( X ˆw y 2 + O(HB))2.

Substituting the bounds on X ˆw y 2 and X ˆw y 2 2 gives

ˆw w 2 Σ + σ2 σ2 + O(σ2m 1/2 + ϵ) + O(σHBm 1/2 + HB p

ϵ/m) + O(H2B2/m).

Substituting in the value of ϵ and simplifying, we get

ˆw w 2 Σ O σ2

m + σHB m + H2B2

as claimed.

B Iterative Peeling

In this section we give the complete proof of Lemma 2.4, restated below as Theorem B.1, which describes the guarantees of Iterative Peeling() (see Algorithm 1). This is a key ingredient in the proofs of Theorems 1.1 and 1.2. We also use it to formally prove Theorem 2.3, as well as Lemma 2.9. Theorem B.1. Let n, t, d N. Let Σ : n n be a positive semi-definite matrix with eigenvalues λ1 λn. Given Σ, d, and t, there is a polynomial-time algorithm Iterative Peeling() producing a set S [n] with the following guarantees:

For every t-sparse v Rn, it holds that v[n]\S 2 3λ 1/2 d+1 v Σ.

|S| (7t)2t+1d.

Essentially, the set S contains every coordinate i [n] that participates in an approximate sparse dependency, in the sense that there is some sparse linear combination of the covariates with small variance compared to the coefficient on i. To compute S, the algorithm Iterative Peeling() first computes the orthogonal projection matrix P that projects onto the subspace spanned by the top n d eigenvectors of Σ. Starting with the set of coordinates that correlate with ker(P), the procedure then iteratively grows S in such a way that at each step, a new participant of each approximate sparse dependency is discovered, but S does not become too much larger.

The following lemma will be needed to bound how much S grows at each iteration: Lemma B.2. Let V Rn be a subspace with d := dim V . For some α > 0 define

i [n] : sup x V \{0}

Then |S| d/α2. Moreover, given a set of vectors that span V , we can compute S in time poly(n).

Proof. Let k := |S| and without loss of generality suppose S = {1, . . . , k}. Define a matrix A Rn n as follows. For 1 i k let row Ai V be some vector such that Ai 2 = 1 and Aii α. For k + 1 i n let Ai = 0. Then tr(A) kα and A F =

k. However, rank(A) d, so the singular values σ1 σ2 σn 0 of A satisfy σd+1 = 0. Thus,

where the second inequality is by e.g. Von Neumann s trace inequality, and the third inequality is by d-sparsity of the vector σ. It follows that k d/α2 as claimed.

Let A be the matrix with columns consisting of the given spanning set for V . By Gram-Schmidt, we may transform the spanning set into an orthonormal basis for V , so that A has d columns, and A A = Id. Fix i [n]. Then supx V \{0} xi/ x 2 α if and only if (Av)2 i α2 Av 2 2 0 for some nonzero v Rd. Equivalently, (Av)2 i α2 for some unit vector v. This is possible if and only if Ai 2 α (where Ai is the i-th row of A), which can be checked in polynomial time.

For notational convenience, we also define the set WP,S of vectors v with unusually large norm outside the set S.

Definition B.3. For any matrix P : n n and subset S [n], define WP,S := {v Rn : v Sc 2 > 3

We then formalize the guarantee of each iteration of Iterative Peeling() as follows: Lemma B.4. Let n, t N and let P : n n be an orthogonal projection matrix. Suppose τ 1 and K [n] satisfy

(a) Pii 1 1/(9t2) for all i K,

(b) | supp(v) \ K| τ for every v B0(t) WP,K.

Then there exists a set IP (K) with |IP (K)| 36t2|K| such that

| supp(v) \ (IP (K) K)| τ 1

for all v B0(t) WP,K. Moreover, given P, K, and t, we can compute IP (K) in time poly(n).

Proof. We define the set

a [n] \ K : sup x span{P ei:i K}\{0}

|xa| x 2 1/(6t)

It is clear from Lemma B.2 (applied with parameters V := span{Pei : i K} and α := 1/(6t)) that |IP (K)| 36t2|K|, and that IP (K) can be computed in time poly(n). It remains to show that | supp(v) \ (IP (K) K)| τ 1 for all v B0(t) WP,K.

Consider any v B0(t) WP,K. Then v Kc 2 > 3

v Pv. We have

v Kc 2 2 9 > v Pv = Pv 2 2 =

where the first equality uses the fact that P is a projection matrix. We also know that

i [n]\K vi(Pi ei)

i [n]\K |vi| Pi ei 2 1 3

3 v Kc 2 (2)

by the triangle inequality, the bound Pi ei 2 2 = (I P)ii = 1 Pii 1/(9t) (since i K), and t-sparsity of v. Moreover, (2) implies that

i [n]\K vi Pi

i [n]\K vi(Pi ei)

3 v Kc 2 . (3)

Combining (1) and (3), the triangle inequality gives

i [n]\K vi Pi

3 v Kc 2 . (4)

Next, observe that

v Kc 2 2 3 >

2 v Kc 2 (by (1))

i=1 vi Pi, v Kc

+ (by Cauchy-Schwarz)

i [n]\K vi Pi, v Kc

i K vi Pi, v Kc

+ (by triangle inequality)

i [n]\K viei, v Kc

i [n]\K vi(Pi ei), v Kc

i K vi Pi, v Kc

(by triangle inequality)

i [n]\K vi(Pi ei)

i K vi Pi, v Kc

(by Cauchy-Schwarz)

i K vi Pi, v Kc

i K vi Pi, v Kc

3 v Kc 2 2 1

where the last inequality is by (4). On the other hand, observe that

i K vi Pi, v Kc

j [n]\K |vj|

i K vi Pi, ej

t v Kc 2 max j supp(v)\K

i K vi Pi, ej

Hence, there is some j supp(v) \ K such that

i K vi Pi, ej

So the vector x(v) := P

i K vi Pi span{Pi : i K} satisfies |x(v)j| > x(v) 2 /(5

t). Moreover, x(v) is nonzero since |x(v)j| > 0. Thus, j IP (K). Since we chose j to be in supp(v) \ K, it follows that

| supp(v) \ (IP (K) K)| | supp(v) \ K| 1 τ 1

where the last inequality is by assumption (b) in the lemma statement.

We can now complete the proof of Theorem B.1 by repeatedly invoking Lemma B.4 (this proof was given in Section 2.2 and is duplicated here for completeness).

Proof of Theorem B.1. Let Σ = Pn i=1 λiuiu i be the eigendecomposition of Σ, and let P := Pn i=d+1 uiu i be the projection onto the top n d eigenspaces of Σ. Set Kt = {i [n] : Pii < 1 1/(9t2)}. Because tr(P) = n d and Pii 1 for all i [n], it must be that |Kt| 9t2d. Also, for any v B0(t) WP,Kt we have trivially by t-sparsity that | supp(v) \ Kt| t.

Define Kt 1 to be Kt IP (Kt) where IP (Kt) is as defined in Lemma B.4; we have the guarantees that |Kt 1| (1+36t2)|Kt| and |GP (v)\Kt| t 1 for all v B0(t) WP,Kt. Since Kt 1 Kt, it holds that WP,Kt 1 WP,Kt, and thus |GP (v) \ Kt| t 1 for all v B0(t) WP,Kt 1. Moreover, since Kt 1 Kt, it obviously holds that Pii 1 1/(9t2) for all i Kt 1. This means we can apply Lemma B.4 with τ := t 1 and K := Kt 1 and so iteratively define sets Kt 2 K1 K0 [n] in the same way. In the end, we obtain the set K0 [n] with |K0| 9t2d(1 + 36t2)t and supp(v) K0 for all v B0(t) WP,K0. The latter guarantee means that in fact B0(t) WP,K0 = . So for any t-sparse v Rn it holds that

v Pv 3λ 1/2 d+1

where the last inequality holds since λd+1P Σ.

Proof of Lemma 2.9. By Theorem B.1, there is a polynomial-time computable set S [n] such that v Sc 2 3

tλ 1/2 d+1 v Σ for all v B0(t), and |S| (7t)2t+1d. Let the dictionary D consist of the standard basis {e1, . . . , en} together with a Σ-orthogonal basis for each subspace spanned by t vectors in {ei : i S}. Let v Rn be t-sparse. Let v S denote the restriction of v to S, i.e. v S := v P

i [n]\S viei. By construction of the dictionary, there is a Σ-orthogonal basis for {ei : i S supp(v)}, so there are d1, . . . , dt D and coefficients bd1, . . . , bdt R with v S = Pt i=1 bdidi and di, dj Σ = 0 for all i, j [t] with i = j. Note that v S 2 Σ = Pt i=1 b2 di di 2 Σ, so

i=1 |bdi| di Σ

i=1 b2 di di 2 Σ =

Now, we claim that the desired coefficient vector {αd : d D} for v is defined by αd = bd + P

i [n]\S vi1[d = ei]. We can check that P

d D αdd = Pt i=1 bdi + P

i [n]\S viei = v. Also,

v S Σ v Σ + v Sc Σ v Σ + p

λn v Sc 2 (1 + 3 p

λn/λd+1) v Σ by the guarantee of set S.

It follows that t X

i=1 |bdi| di Σ (1 + 3 p

d D |cd| d Σ (1 + 3 p

i [n]\S |vi| ei Σ

λn/λd+1 v Σ which completes the proof.

Proof of Theorem 2.3. Immediate from Lemma 2.9 and Lemma A.3.

C An efficient algorithm for handling outlier eigenvalues

In this section we describe and provide error guarantees for a novel sparse linear regression algorithm BOAR-Lasso() (see Algorithm 2 for pseudocode), completing the proof of Theorem 1.1; in Section C.1 we then analyze a modified algorithm to prove Theorem 1.2.

The key subroutine of BOAR-Lasso() is the procedure Adaptively Regularized Lasso(), which (like the simplified procedure Adapted BP() from Section 3) first invokes procedure Iterative Peeling() to compute the set of coordinates that participate in sparse approximate dependencies, and second computes a modified Lasso estimate where those coordinates are not regularized.

We start with Theorem C.2, which shows that, in the setting where Σ has few outlier eigenvalues, the procedure Adaptively Regularized Lasso() estimates the sparse ground truth regressor at the slow rate (e.g. in the noiseless setting, the excess risk is at most O( v 2 Σ reff/m)). Typical excess risk analyses for Lasso proceed by applying some general-purpose machinery for generalization bounds, such as the following result which only requires understanding w w , X for X N(0, Σ).

Algorithm 2: Solve sparse linear regression when covariate eigenspectrum has few outliers Procedure Adaptively Regularized Lasso(Σ, (Xi, yi)m i=1, t, dl, δ) Data: Covariance matrix Σ : n n, samples (Xi, yi)m i=1, sparsity t, small eigenvalue count dl, failure probability δ Result: Estimate ˆv of unknown sparse regressor, satisfying Theorem C.2 Pn i=1 λiuiu i eigendecomposition of Σ S Iterative Peeling(Σ, dl, t) /* See Algorithm 1 */ Return

ˆv argmin v Rn

i=1 ( Xi, v y)2+8λn dh log(12n/δ) v Sc 2 1+2 p

2λn dh log(12n/δ) v Sc 1 .

Procedure BOAR-Lasso(Σ, (Yi, yi)m i=1, t, dl, L, δ) Data: Covariance matrix Σ : n n, samples (Xi, yi)m i=1, sparsity t, small eigenvalue count dl, repetition count L, failure probability δ Result: Estimate ˆv of unknown sparse regressor, satisfying Theorem C.3 ˆs(0) 0 Rn for 0 j < L do

Σ(j) Σ (ˆs(j)) Σ Σˆs(j) (ˆs(j)) Σˆs(j)

Set A(j) := {mj + 1, . . . , m(j + 1)} ˆw(j+1) Adaptively Regularized Lasso(Σ(j), ((Xi, Xi, ˆs(j) ), yi Xi, ˆs(j) )i A(j), t + 1, dl + 1, δ/L) ˆv(j+1) ˆw(j+1) [n] + ˆw(j+1) n+1 ˆs(j)

ˆs(j+1) ˆs(j) + ˆv(j+1)

return ˆs(L)

Theorem C.1 (Theorem 1 in [45]). Let n, m N and ϵ, δ, σ > 0. Let Σ : n n be a positive semi-definite matrix and fix w Rn. Let X : m n have i.i.d. rows X1, . . . , Xm N(0, Σ), and let y = Xw + ξ where ξ N(0, σ2Im). Let F : Rd [0, ] be a continuous function such that

Pr x N(0,Σ)[ sup w Rn w w , x F(w) > 0] δ.

If m 196ϵ 2 log(12/δ), then with probability at least 1 4δ it holds that for all w Rd,

w w 2 Σ + σ2 1 + ϵ

m ( Xw y 2 + F(w))2 .

In classical settings, e.g. (a) where v 1 is bounded and maxi Σii 1 (see Proposition A.6) or (b) where Σ satisfies the compatibility condition (see Definition G.1), the above result can be applied together with the straightforward bound v v , X v v 1 X . To prove Theorem C.2 we follow the same general recipe as (a), with several modifications.

First, since maxi Σii could be arbitrarily large, we need to treat the (few) large eigenspaces of Σ separately when bounding v v , X . Similarly, since Theorem B.1 only gives bounds on v for coordinates outside S, we separately bound (v v )S, X using that |S| is small. Second, to achieve the optimal rate of σ2neff/m rather than σ2p

neff/m, we do not directly apply Theorem C.1 to the noisy samples (Xi, yi); instead, we derive a modification of that result (Lemma F.7) that only invokes Theorem C.1 on the noiseless samples (Xi, Xi, v ), and separately bounds the in-sample prediction error X(ˆv v ) 2. A similar technique is used in [45] for constrained least-squares programs (see their Lemma 15); our Lemma F.7 applies to a broad family of additively regularized programs, which obviates the need to independently estimate v Σ but otherwise achieves comparable bounds. Theorem C.2. Let n, t, dl, dh, m N and σ, δ > 0. Let Σ : n n be a positive semi-definite matrix with eigenvalues λ1 λn. Let (Xi, yi)m i=1 be independent samples where Xi N(0, Σ) and

yi = Xi, v + ξi, for ξi N(0, σ2) and a fixed t-sparse vector v Rn. Let ˆv be the output of Adaptively Regularized Lasso(Σ, (Xi, yi)m i=1, t, dl,δ). Let neff := (7t)2t+1dl + dh + log(48/δ) and let reff := t(λn dh/λdl+1) log(12n/δ). There are absolute constants c, C > 0 so that the following holds. If m Cneff, then with probability at least 1 δ,

m + (σ + v Σ) v Σ reff m + v 2 Σ reff m

Proof. Define projection matrix P := Pn dh i=1 uiu i , so that rank(P ) = dh and λmax(PΣP) λn dh. For any v Rn and X N(0, Σ), we can bound

v v , X = (v v )Sc, PX + (v v )Sc, P X + (v v )S, X

= (v v )Sc, PX + Σ1/2(v v ), Σ 1/2(P X)Sc + Σ1/2(v v ), Σ 1/2XS

(v v )Sc 1 PX + Σ1/2(v v ) 2 ( Z 2 + W 2)

where PX N(0, PΣP), Z N(0, Σ 1/2(P ΣP )Sc ScΣ 1/2), and W N(0, Σ 1/2ΣSSΣ 1/2). First, since maxi(PΣP)ii λmax(PΣP) λn dh, we have the Gaussian tail bound Pr h PX > p

λn dh 2 log(12n/δ) i δ/12.

Second, since

Σ 1/2(P ΣP )Sc ScΣ 1/2 Σ 1/2(P ΣP )Σ 1/2 (by Cauchy Interlacing Theorem)

= P (since P commutes with Σ)

we have that Z 2 2 is stochastically dominated by χ2 dh, and thus

Pr h Z 2 > p

2dh i e m/4 δ/12.

Third, similarly, since Σ 1/2ΣSSΣ 1/2 I (again by Cauchy Interlacing Theorem) and also rank(Σ 1/2ΣSSΣ 1/2) |S|, we have that W 2 2 is stochastically dominated by χ2 |S|, and thus

Pr h W 2 2 > p

2|S| i e m/4 δ/12.

Combining the above bounds, we have that with probability at least 1 δ/4 over X N(0, Σ), for all v Rn,

v v , X (v v )Sc 1 p

λn dh 2 log(12n/δ) + Σ1/2(v v ) 2 ( p

We can therefore apply Lemma F.7 with covariance Σ, seminorm Φ(v) := 2 p

2λn dh log(12n/δ) v Sc 1, p := 4(dh + |S|), ground truth v , samples (Xi, yi)m i=1, and failure probability δ/4. By the bound on |S| (Theorem B.1) we have |S| + dh (7t)2t+1dl + dh neff, so it holds that m 16p + 196 log(48/δ). Thus, with probability at least 1 2δ, we have

m + (σ + v Σ) v Sc 1 p

λn dh log(12n/δ) m + v Sc 2 1 λn dh log(12n/δ)

By the guarantee of S (Theorem B.1) and t-sparsity of v , we have v Sc 2 3λ 1/2 dl+1 v Σ, and

thus v Sc 1 3

tλ 1/2 dl+1 v Σ . Substituting into the previous bound, we get

m + (σ + v Σ) v Σ reff m + v 2 Σ reff m

as claimed.

The limitation of Adaptively Regularized Lasso() is that the excess risk bound depends on v 2 Σ rather than just σ2. We next show that by a boosting approach, we can exponentially attenuate that dependence, essentially achieving the near-optimal rate of σ2neff/m. The key insight is that after producing an estimate ˆv of v , we can augment the set of covariates with the feature X, ˆv , and try to predict the response y X, ˆv , which is now a (t+1)-sparse combination of the features. In standard settings, this is typically a bad idea because it introduces a sparse linear dependence. However, by the Cauchy Interlacing Theorem it increases the number of outlier eigenvalues by at most one so our algorithms still apply. Thus, if we have enough samples that the excess risk bound in Theorem C.2 is non-trivially smaller than v 2 Σ, then we can iteratively achieve better and better estimates up to the noise limit. This is precisely what BOAR-Lasso() does; the precise guarantees are stated in the following theorem, which completes the proof of Theorem 1.1.

Theorem C.3. Let n, t, dl, dh, m, L N and σ, δ > 0. Let Σ : n n be a positive semi-definite matrix with eigenvalues λ1 λn. Let (Xi, yi)m i=1 be independent samples where Xi N(0, Σ) and yi = Xi, v + ξi, for ξi N(0, σ2) and a fixed t-sparse vector v Rn.

Then, given Σ, (Xi, yi)m i=1, t, dl, and δ, the algorithm BOAR-Lasso() outputs an estimator ˆv with the following properties.

Let neff := (7t)2t+1dl + dh + log(48/δ) and let reff := t(λn dh/λdl+1) log(12n/δ). There are absolute constants c0, C0 > 0 such that the following holds. If m C0L(neff + reff), then with probability at least 1 δ, it holds that

ˆv v 2 Σ c0 σ2(neff + reff)

m/L + 2 L v 2 Σ .

Moreover, BOAR-Lasso() has time complexity poly(n, m, t).

Proof. Let (A0, . . . , AL 1) be an partition of [m] into L sets of size m/L. The idea of the algorithm is to compute vectors ˆv(1), . . . , ˆv(L) where each v(i) is an estimate of v Pi 1 j=1 ˆv(j). Concretely, fix some 0 j L 1 and suppose that we have computed some vectors ˆv(1), . . . , ˆv(j). Set ˆs(j) := ˆv(1) + + ˆv(j). Define a matrix Σ(j) : (n + 1) (n + 1) by

Σ(j) := Σ (ˆs(j)) Σ Σˆs(j) (ˆs(j)) Σ(ˆs(j))

Thus, for example, Σ(0) has zeroes in the last row and last column. Now for each i Aj, define (X(j) i , y(j) i ) by

X(j) i := (Xi, Xi, ˆs(j) )

y(j) i := yi Xi, ˆs(j) .

By construction, the m/L samples (X(j) i , y(j) i )i Aj are independent and distributed as X(j) i N(0, Σ(j)) and y(j) i = X(j) i , (v , 1) + ξi. Let λ(j) 1 λ(j) n+1 be the eigenvalues of Σ(j).

Now we apply Theorem C.2 with covariance Σ(j), samples (X(j) i , y(j) i )i Aj, sparsity t + 1, outlier counts dl + 1 and dh + 1, and failure probability δ/L; let n(j) eff and r(j) eff be the induced parameters defined in that theorem statement, and let c, C be the constants. By the Cauchy Interlacing Theorem, we have λ(j) dl+2 λdl+1 and similarly λ(j) n+1 (dh+1) λn dh. Thus r(j) eff 2reff. Also n(j) eff neff.

Thus, if the constant C0 is chosen appropriately large, then m/L 16cr(j) eff and also m/L Cn(j) eff . Hence (by the error guarantee of Theorem C.2) with probability at least 1 δ/L we obtain a vector ˆw(j+1) such that

ˆw(j+1) (v , 1) 2

Σ(j) cσ2n(j) eff m/L + c (v , 1) 2 Σ(j)

r(j) eff m/L + cσ (v , 1) Σ(j)

r(j) eff m/L

m/L + (v , 1) 2 Σ(j) 4 +

(v , 1) 2 Σ(j) 4 + 4c2σ2reff

2 σ2(neff + reff)

m/L + (v , 1) 2 Σ(j) 2 (5)

where the second inequality uses AM-GM to bound the third term, and the third inequality is by choosing c0 4c + 8c2.

But now define ˆv(j+1) := ˆw(j+1) [n] + ˆw(j+1) n+1 ˆs(j). Then we observe that (v , 1) 2 Σ(j) = v ˆs(j) 2 Σ and ˆw(j+1) (v , 1) 2 Σ(j) = ˆv(j+1) (v ˆs(j)) 2 Σ = v ˆs(j+1) 2 Σ where ˆs(j+1) = ˆv(1) + + ˆv(j+1). So (5) is equivalent to v ˆs(j+1) 2

2 σ2(neff + reff)

Inductively, we conclude that v ˆs(L) 2

Σ c0 σ2(neff + reff)

m/L + 2 L v 2 Σ

as desired. The time complexity (see Algorithm 2 for full pseudocode) is dominated by L eigendecompositions of n n Hermitian matrices (each of which takes time O(n3) by e.g. the QR algorithm), as well as L convex optimizations (each of which takes time O(n3) to solve to inversepolynomial accuracy [26], which is sufficient for the correctness proof).

C.1 An alternative algorithm (proof of Theorem 1.2)

In this section we prove Theorem 1.2, which essentially states that the sample complexity dependence on dl in BOAR-Lasso() can be removed at the cost of a time complexity depending on dt l. See Algorithm 3 for the pseudocode of how we modify Adaptively Regularized Lasso(): essentially, we brute force search over all size-t subsets of the set S produced by Iterative Peeling(), construct an appropriate dictionary for each of these |S| t subsets, and then perform a final model selection step (with fresh samples) to pick the best dictionary/estimator. The boosting step is exactly identical to that in BOAR-Lasso().

Lemma C.4. Let n, t, d N. Let Σ : n n be a positive semi-definite matrix with eigenvalues λ1 λn. Then there is a family D Rn (n+t) of size |D| (7t)2t2+t(2d)t, consisting entirely of n (n + t) matrices with the form

D := [In d1 . . . dt] ,

with the following property. For any t-sparse v Rn, there is some D D and w Rn+k with v = Dw and

w 1 7t1/2 p

Proof. Let u1, . . . , un Rn be the eigenvectors of Σ corresponding to eigenvalues λ1, . . . , λn respectively, so that Σ = Pn i=1 λiuiu i . Define Σ := λ 1 d+1 Pn i=1 min(λi, λd+1)uiu i . Let S be the output of Iterative Peeling(Σ, dl, t), and let D := {D(T) : T S t }, where for any T S t , we let {d1, . . . , dt} be a Σ-orthonormal basis for span{ei : i T}, and let D(T) be the n (n + t) matrix with columns e1, . . . , en, d1, . . . , dt. The bound on |D| follows from Theorem B.1.

For any t-sparse v Rn, pick the matrix D D indexed by any T S t with S supp(v) T. Let d1, . . . , dt Rn be the last t columns of D. Then there are coefficients b1, . . . , bt so that we can write v S = Pt i=1 bidi. Since d i Σdi = 1[i = i ] for all i, i [t], we have v S Σv S = Pt i=1 b2 i .

v S Σv S. But we can bound q

v S Σv S = Σ 1/2v S 2

Σ 1/2v 2 + Σ 1/2v Sc 2 (by triangle inequality)

Algorithm 3: Alternative algorithm to solve sparse linear regression when covariate eigenspectrum has few outliers Procedure Augmented Dictionary Lasso(Σ, (Xi, yi)m i=1, t, dl, δ) Data: Covariance matrix Σ : n n, samples (Xi, yi)m i=1, sparsity t, small eigenvalue count dl, failure probability δ Result: Estimate ˆv of unknown sparse regressor, satisfying Theorem C.5 Pn i=1 λiuiu i eigendecomposition of Σ S Iterative Peeling(Σ, dl, t) /* See Algorithm 1 */ Σ λ 1 dl+1 Pn i=1 min(λi, λdl+1)uiu i for T S [t] do

d(T ) 1 , . . . , d(T ) t Σ-orthogonal basis for span{ei : i T}

D(T) h In d(T ) 1 . . . d(T ) t i

ˆw(T) argmin w Rn+t

Xi, D(T)w y1:m/2 2

+ 8λn d log(8n/δ) w 2 1 + 2 p

2λn d log(8n/δ) y1:m/2 2 w 1

Select best hypothesis

ˆT argmin T ( S [t])

i=m/2+1 ( Xi, D(T) ˆw(T) yi)2

return D( ˆT) ˆw( ˆT)

λ 1/2 d+1 Σ1/2v 2 + v Sc 2 (by Σ λ 1 d+1Σ and Σ In)

λ 1/2 d+1 Σ1/2v 2 + 3λ 1/2 d+1

v Σv (by Theorem B.1 and t-sparsity of v)

We conclude that b 1 4

v Σv. Thus, if we define

i [n]\S viei +

i=1 bien+i,

where here e1, . . . , en+k refer to the standard basis vectors in Rn+t, then we have Dw = v[n]\S + Pt i=1 bidi = v, and also

w 1 b 1 + X

i [n]\S |vi| 4

as desired.

Theorem C.5. Let n, t, dl, dh, m N and let Σ : n n be a positive semi-definite matrix with eigenvalues λ1 λn. Let (Xi, yi)m i=1 be independent samples where Xi N(0, Σ) and yi = Xi, v + ξi, for ξi N(0, σ2) and a fixed t-sparse vector v Rn. Set k := t(7t)2t2+tdt l and let D be the family of matrices (of size at most k) guaranteed by Lemma C.4.

Let δ > 0. For every D D, define

ˆw(D) argmin w Rn+t

X(1)Dw y1:m/2 2

2+8λn d log(8n/δ) w 2 1+2 p

2λn d log(8n/δ) y1:m/2 2 w 1

where X(1) : (m/2) n is the matrix with rows X1, . . . , Xm/2, and define ˆv = ˆD ˆw( ˆD) where

ˆD argmin D D

X(2)D ˆw(D) ym/2+1:m 2

where X(2) : (m/2) n is the matrix with rows Xm/2+1, . . . , Xm.

Let neff := t2 log(t) + t log(dl) + dh + log(48/δ) and let reff := t(λn dh/λdl+1) log(8n/δ). There are absolute constants c, C > 0 so that the following holds. If m Cneff, then with probability at least 1 3δ it holds that

ˆv v 2 Σ c σ2neff

Let D D and w Rn+t be the matrix and vector guaranteed by Lemma C.4 for the t-sparse vector v . Let Γ = (D ) ΣD with eigenvalues γ1 γn+t. We make the following claim: Claim C.6. With probability at least 1 δ/4 over G N(0, Γ), it holds uniformly in w Rn+t that w w , G w w 1 p

λn dh 2 log(8n/δ) + w w Γ p

Proof. Since Σ is a principal submatrix of Γ, we have γn dh λn dh (by the Cauchy Interlacing Theorem). Suppose that Γ has eigendecomposition Γ = Pn+t i=1 γigig i , and define projection matrix P : (n + t) (n + t) by P := Pn dh i=1 gig i , so that rank(P ) = dh + t and λmax(PΓP) γn dh λn dh. Then for any w Rn+t and G N(0, Γ), we can bound

w w , G = w w , PG + w w , P G

w w 1 PG + Γ1/2(w w ), Γ 1/2P G

= w w 1 PG + Γ1/2(w w ), P Γ 1/2G

w w 1 PG + Γ1/2(w w ) 2 Z 2

where Z N(0, P ). The second equality above uses that Γ 1/2 and P are simultaneously diagonalizable (and therefore commute). But now for any δ > 0, we have the Gaussian tail bounds

max i (PΓP)ii 2 log(8n/δ) δ/8

2 rank(P ) e m/8 δ/8.

Thus, with probability at least 1 δ/4 over G N(0, Γ), for any w Rn+t, we have

w w , G w w 1 q

max i (PΓP)ii 2 log(8n/δ) + Γ1/2(w w ) 2

λn dh 2 log(8n/δ) + Γ1/2(w w ) 2

2(dh + t) = F(w)

which proves the claim.

We now proceed with proving the theorem.

Proof of Theorem C.5. Applying Claim C.6, we can now invoke Lemma F.7 with covariance matrix Γ, seminorm Φ(v) := 2 p

2λn dh log(8n/δ) v 1, p := 2(dh + t), ground truth w , samples ((D ) Xi, yi)m/2 i=1 , and failure probability δ/4. Since we chose m sufficiently large that m/2 16p + 196 log(12/δ), we conclude that with probability at least 1 2δ over the randomness of (Xi, yi)m/2 i=1 , it holds that

ˆw(D ) w 2 Γ O

m + (σ + w Γ) w 1 p

λn dh log(8n/δ) m + w 2 1 λn dh log(8n/δ)

Since v = D w and w 1 7t1/2λ 1/2 dl+1 v Σ (the guarantees of Lemma C.4), it follows that

D ˆw(D ) v 2 Σ O

m + (σ + v Σ) v Σ reff m + v 2 Σ reff m

To complete the proof of the theorem, condition on any values of (Xi, yi)m/2 i=1 for which the above bound holds. By applying Lemma F.2 with covariance matrix Σ, hypothesis set W := {D ˆw(D) : D D}, and samples (Xi, yi)m i=m/2+1 (which are independent of W), since m/2 32 log(2|D|/δ), we have with probability at least 1 2δ over the samples (Xi, yi)m i=m/2+1 that

ˆD ˆw( ˆD) v 2

Σ 6 min D D D ˆw(D) v 2 Σ + 32σ2 log(2|D|/δ)

Hence, with probability at least 1 5δ we have

ˆD ˆw( ˆD) v 2

σ2(dh + t + log(2|D|/δ))

m + (σ + v Σ) v Σ reff m + v 2 Σ reff m

which proves the theorem.

We can use the above theorem (together with the previously discussed boosting approach) to get the following result, which proves Theorem 1.2. Theorem C.7. Let n, t, dl, dh, m, L N and σ, δ > 0. Let Σ : n n be a positive semi-definite matrix with eigenvalues λ1 λn. Let (Xi, yi)m i=1 be independent samples where Xi N(0, Σ) and yi = Xi, v + ξi, for ξi N(0, σ2) and a fixed t-sparse vector v Rn.

Then, given Σ, (Xi, yi)m i=1, t, dl, and δ, there is an estimator ˆv with the following properties.

Let n eff := t2 log(t) + t log(dl) + dh + log(48L/δ) and let r eff := t(λn dh/λdl+1) log(8n L/δ). There are absolute constants c0, C0 > 0 such that the following holds. If m C0L(n eff + r eff), then with probability at least 1 δ, it holds that

ˆv v 2 Σ c0 σ2(n eff + r eff) m/L + 2 L v 2 Σ .

Moreover, ˆv is computable in time (t + 1)O(t2)(dl + 1)t+1 poly(n).

Proof. Identical to that of Theorem C.3, except using Theorem C.5 instead of Theorem C.2.

D Faster sparse linear regression for arbitrary Σ

In this section we prove Theorem 3.1. The approach is via feature adaptation: in Theorem D.5, we show that any covariance matrix Σ has a (t, O(t3/2 log n)-ℓ1-representation of size O(nt 1/2) that is computable in time nt Ω(1/t) log O(t) n, using O(t log n) samples from N(0, Σ). The algorithm for computing this representation is described in Algorithm 4. One of the key tools is the following result from computational geometry: Theorem D.1 ([30]). Let n, d, k N and δ > 0. Given points p1, . . . , pn Rd, query dimension k, and failure probability δ, there an algorithm DS((p1, . . . , pn), k, δ) with time complexity nk+1(log n)O(k) poly(d) log(1/δ), that constructs a data structure N that answers queries of the following form. Given a k-dimensional subspace F Rd, the output N(F) is some i [n]. With probability at least 1 δ, the query time complexity is n1 1/(2k) poly(d) log(1/δ), and it holds that

min q F pi q 2 O(log n) min i [n] min q F pi q 2 .

How do we use the above theorem to efficiently construct the ℓ1-representation? The intuition is as follows. Let X be the m n matrix where each row is a sample from N(0, Σ). Then each column is a vector pi representing a particular covariate. To find the ℓ1-representation, it essentially suffices

Algorithm 4: ℓ1-representation for arbitrary Σ

1 Procedure Find Orthonormalization(p1, . . . , pt)

Data: Nonzero vectors p1, . . . , pt Rm Result: α(1), . . . , α(t) Rt such that span{α(1), . . . , α(t)} = span{e1, . . . , et} and P

ℓα(i) ℓpℓ, P

ℓα(j) ℓpℓ = 0 for all i = j

3 for i = 1, . . . , t do

4 α(i) ei/ pi 2 Rk

5 for j = 1, . . . , i 1 do

ℓα(j) ℓpℓ = 0 then

7 α(i) α(i) P

ℓα(i) ℓpℓ,P

ℓα(j) ℓ pℓ P

ℓα(j) ℓ pℓ 2 α(j)

8 return α(1), . . . , α(k)

9 Procedure Represent Vectors({p1, . . . , pn},t,δ)

Data: Unit vectors p1, . . . , pn Rm, sparsity parameter t, failure probability δ Result: Set D Rn of size O(nt 1/2), where all elements d D are t-sparse (and represented succinctly)

11 Compute partition I1 I n = [n] where |Ii| n for all i

12 Initialize D

13 for j = 1, . . . , n do

14 Construct data structure N j DS((pi : i Ij), t 1, δ/nt) /* Theorem D.1 */

15 for T [n] t 1 do

16 h(T, j) N j(span{pi : i T}) /* Theorem D.1 */

17 Find γ RT such that P

i T γipi = Projspan{pi:i T } ph(T,j)

18 Write γ as a sparse vector in Rn (supported on T)

19 Add γ eh(T,j) to D

20 for T [n] t 2 do

21 for a, b Ij do

22 γ(1), . . . , γ(t) Find Orthonormalization((pi : i T {a, b}))

23 Write γ(1), . . . , γ(t) as sparse vectors in Rn (supported on T {a, b})

24 Add γ(1), . . . , γ(t) to D

25 return D

26 Procedure Compute L1Representation({X1, . . . , Xm}, t)

27 Let X : m n be the matrix with rows X1, . . . , Xm 28 Let q1, . . . , qn be the columns of X, and let pi := qi/ qi 2 for i [n]

29 D Represent Vectors({p1, . . . , pn}, t, e m)

30 ˆD diag( q1 2 , . . . , qn 2)

31 D { ˆDd : d D}

32 return D

to find a dictionary D of O(nt 1/2) sparse combinations of {p1, . . . , pn} so that every t-sparse combination of {p1, . . . , pn} can be written in terms of the chosen combinations, with a coefficient vector that has bounded ℓ1 norm.

For notational ease, we define C(x) to be the cost of a particular linear combination x Rn with respect to the set D of chosen combinations:

Definition D.2. For a subset D Rn, define CD : Rn [0, ] by

CD(x) := min α RD:P

With this notation, we want to construct a set D of size O(nt 1/2), consisting of t-sparse vectors, such that CD(x) poly(t, log n) X xipi 2 for all t-sparse x Rn.

The construction is quite simple: divide the set {p1, . . . , pn} into n equal-sized groups. For each set T of t 1 vectors and each of the n groups, find the closest vector in the group to the subspace spanned by T (using Theorem D.1 to achieve sublinear time complexity). Then add the difference between the vector and its projection (onto the subspace) to the dictionary. Finally, for each set of t vectors where two of the vectors lie in the same group, add an orthonormal basis for those vectors to the dictionary. See the procedure Represent Vectors() in Algorithm 5 for pseudocode.

By construction, the dictionary clearly has size O(nt 1/2). At a high level, the reason it satisfies the representational property is the following. Consider some t-sparse combination, such as p1+ +pt. If pt is not very close to p1+ +pt 1, then we can bound C(p1+ +pt) by C(p1+ +pt 1) and C(pt), which are O(

t p1 + + pt 1 2) and O(

t pt 2) respectively, since the dictionary contains an orthonormal basis for both terms. The only case where these bounds are not good enough is when p1 + + pt 2 is much smaller than p1 + + pt 1 2 and pt 2. In this case, pt is very close to span{p1, . . . , pt 1}. However, in the construction we found some (potentially different) pj which is just as close to span{p1, . . . , pt 1}, and moreover is in the same group as pt. Letting q be the projection of pj onto span{p1, . . . , pt 1}, we have the crucial fact that pj q 2 is as small as p1 + + pt 2.

Now, bounding C(p1 + + pt) proceeds as follows. We can subtract some appropriate (bounded) multiple of pj q from p1 + + pt to zero out at least one of the coefficients. This residual then is a t-sparse combination of {p1, . . . , pt, pj} where two of the vectors {pt, pj} are in the same group; thus it has small cost with respect to D. Moreover, pj q is contained in D and thus has small cost (specifically, not much more than pj q 2, which crucially is not much more than p1 + + pt 2). It follows that p1 + + pt has small cost.

Formalizing this argument, we start by proving one of the facts that we freely used above: that the cost function C satisfies the triangle inequality. Fact D.3. For any D Rn and x, y Rn, it holds that C(x + y) C(x) + C(y).

Proof. For any α, β RD with P

d αdd = x and P

d βdd = y, the vector α + β satisfies P

d(α + β)dd = x + y. Applying the triangle inequality to P

d |(α + β)d| P

i dipi 2 completes the proof.

We now prove the key lemma, formalizing the above intuition. Lemma D.4. Let n, m, t N, with t 2, and δ > 0. Fix p1, . . . , pn Rm with pi 2 = 1 for all i [n]. Let D be the output of Represent Vectors({p1, . . . , pn}, t, δ). Then |D| = O(nt 1/2), and every element of D is t-sparse. Also, with probability at least 1 δ, the following guarantees hold. The time complexity of computing D is O(nt Ω(1/t)(log n)O(t)m O(1) log(1/δ)). Moreover, for every t-sparse x Rn it holds that

CD(x) O(t3/2 log n) X xipi 2 .

Proof. Since the algorithm Represent Vectors() makes less than nt queries to the data structures N j, and each query has failure probability at most δ = δ/nt, the probability that any query fails is at most 1 δ. We henceforth assume that all queries succeed, i.e. satisfy the correctness guarantee and time complexity bound stated in Theorem D.1.

Time complexity. We start by analyzing the time complexity of Represent Vectors({p1, . . . , pn}, t, δ). For any fixed j [ n], the construction time of N j (with |Ij| = O( n) points in Rm, query dimension t 1, and failure probability δ/nt) is O(nt/2(log n)O(t)m O(1) log(1/δ)). We make n t 1 + |Ij|2 n t 2 = O(nt 1) queries to N j, each with time complexity n1/2 1/(4(t 1))m O(1) log(1/δ). Each projection step and each orthonormalization step has time complexity poly(t, m). Thus, since t 2, the time complexity for any

fixed j is bounded by nt 1/2 1/(8t)(log n)O(t)m O(1) log(1/δ). Summing over j, the overall time complexity to compute D is at most nt 1/(8t)(log n)O(t)m O(1) log(1/δ) as claimed.

Correctness. The bound on |D| and the fact that all elements of D are t-sparse are immediate from the algorithm definition. It remains to bound CD(x) for t-sparse vectors x. First, note that for any (t 1)-sparse y Rn, because of step (4), the dictionary contains vectors γ1, . . . , γt 1 that span supp(y) and satisfy Pn i=1 γk i pi, Pn i=1 γℓ i pi = 0 for all k = ℓ. Thus, letting α1, . . . , αt 1 R be such that y = α1γ1 + + αt 1γt 1, we get

i=1 γj i pi

i=1 γj i pi

Now fix any nonzero t-sparse x Rn. Fix any a arg maxi [n] |xi|, and let j [ n] be such that a Ij. Let T = supp(x) \ {a}. Let q := Projspan{pi:i T } ph(T,j). Then by the correctness guarantee of N j on query span{pi : i T},

ph(T,j) q 2 O(log n)

= O(log n) P

i xipi 2 |xa| . (8)

Case I. Suppose that ph(T,j) q 2 1/2. Then by (8), we have |xa| O(log n) P

i xipi 2. Thus, by the triangle inequality,

It follows from Fact D.3 and (7) that

CD(x) CD(xaea) + CD(x xaea)

as desired.

Case II. It remains to consider the case that ph(T,j) q 2 1/2. In this case we have q 2 ph(T,j) 2 1/2 1/2. By step (3) of the algorithm, the dictionary contains some vector γ eh(T,j) such that supp(γ) T and q = P i T γipi. Fix any b arg maxi |γi|. Since q = P γipi we get

t 1/(2t). Now, by Fact D.3,

γb (eh(T,j) γ) + CD

γb (eh(T,j) γ) .

By construction, eh(T,j) γ is an element of the dictionary, so we can bound the first term as

γb (eh(T,j) γ) |xb|

i=1 (eh(T,j) γ)ipi

ph(T,j) q 2

2t|xa| ph(T,j) q 2

where the equality uses that q = Pn i=1 γipi, the second inequality uses that |xb| |xa| and |γb| 1/(2t), and the final inequality uses (8).

Finally, observe that

z := x + xb

γb (eh(T,j) γ) = xaea + xb

γb eh(T,j) + X

since the coefficients on eb cancel out. Thus, z is a linear combination of two elements of {pi : i Ij} together with t 2 elements of {pi : i [n]}. Because of step (4) of the algorithm, the dictionary contains vectors γ1, . . . , γt that span supp(z) and satisfy Pn i=1 γk i pi, Pn i=1 γℓ i pi = 0 for all k = ℓ. The same argument as for (7) gives that

γb (eh(T,j) γ)

i=1 xipi + xb

γb (ph(T,j) q)

t log n) |xb| |γb||xa|

O(t3/2 log n)

2 where the second inequality uses the triangle inequality and (8), and the final inequality uses that |xb| |xa| and |γb| 1/(2t). Putting everything together, we conclude that

CD(x) O(t3/2 log n)

2 as claimed.

We now show that Represent Vectors() can be applied to the columns of the sample matrix to obtain a ℓ1-representation for Σ (procedure Compute L1Representation() in Algorithm 5). Up to an appropriate rescaling of the covariates, Lemma D.4 immediately implies that D gives a ℓ1representation for the empirical covariance ˆΣ. The main result then follows from concentration of ˆΣ and sparsity of the elements of the dictionary. Theorem D.5. Let n, m, t N and let Σ : n n be a positive-definite matrix. Suppose m Ct log n for a sufficiently large constant C. Let X1, . . . , Xm N(0, Σ) be independent samples, and let D be the output of Compute L1Representation({X1, . . . , Xm}, t). Then |D| O(nt 1/2), and every element of D is t-sparse. Also, with probability at least 1 e Ω(m), the time complexity of the algorithm is O(nt Ω(1/t)(log n)O(t)m O(1)), and D is a (t, Cl1rept3/2 log n)- ℓ1-representation for Σ, for some universal constant Cl1rep.

Proof. Let ˆΣ = 1 m X X. Let D denote the intermediary dictionary constructed by the algorithm using Represent Vectors(). With probability at least 1 e m, the successful event of Lemma D.4 holds. By standard concentration bounds (see e.g. Exercise 4.7.3 in [41]), it holds that 1

2 x Σ x ˆΣ 2 x Σ for all t-sparse x Rn, with probability at least 1 e Ω(m). Henceforth assume that both of these events hold.

Time complexity. The time complexity of the algorithm is dominated by the call to Represent Vectors(). By the guarantee of Lemma D.4, this takes time O(nt Ω(1/t)(log n)O(t)m O(1)).

Correctness. The bounds on |D| and sparsity of elements of D follow from identical bounds for D (see Lemma D.4), and the fact that every element of D is obtained by rescaling the coordinates of some element of D. It remains to show that D is a (t, O(t3/2 log n))-ℓ1 representation for Σ.

Fix any t-sparse v Rn, and define v = ˆDv. By the guarantee of Lemma D.4, since v is also t-sparse, there is some α R D such that v = P d D α d d and

i=1 di qi qi 2

2 O(t3/2 log n)

i=1 vi qi qi 2

Algorithm 5: Sparse linear regression for arbitrary Σ

1 Procedure Sparse Linear Regression((Xi, yi)m i=1, t, B, σ2)

2 D Compute L1Representation({X1, . . . , X100t log n}, t)

3 for j = m/2 + 1, . . . , m do

4 for d D do

5 Xj,d Xj, d/ q

(2/m) Pm/2 i=1 Xi, d 2

/* See Theorem A.7 for definition of Mirror Descent Lasso(), and Theorem D.5 for definition of Cl1rep */

6 ˆβ Mirror Descent Lasso(( Xi, yi)m i=m/2+1, 2Cl1rept3/2B log(n), m/2, σ2)

d D ˆβdd/ q

(2/m) Pm/2 i=1 Xi, d 2

8 return ˆw

But note that vi = ˆDiivi = qi 2 vi for all i. Similarly, every d D corresponds to some d D with di = qi 2 di for all i. Thus, reindexing α according to D in the natural way, we have that v = P

d D αdd and

2 O(t3/2 log n)

But now let ˆΣ = 1

m X X. For any i, j [n] we have qi, qj = mˆΣii. Hence,

i,j [n] vivj ˆΣij = v ˆΣv

and similarly for Pn i=1 diqi 2 2. Thus, we get X

d D |αd| d ˆΣ O(t3/2 log n) v ˆΣ .

But as shown above, we know that 1

2 x Σ x ˆΣ 2 x Σ for all t-sparse x Rn. Since v and all d D are t-sparse, we conclude that X

d D |αd| d Σ O(t3/2 log n) v Σ

as desired.

We finally restate and prove Theorem 3.1, as a corollary of Theorem D.5 and the well-known fact that standard slow rate guarantees for Lasso (i.e. based on the ℓ1 norm of the regressor) can be achieved in near-linear time (Theorem A.8). The pseudocode for the main algorithm is given in Algorithm 5.

Corollary D.6. Let n, m, t, B N and σ > 0, and let Σ : n n be a positive-definite matrix. Let w Rn be t-sparse with w Σ B. Suppose m Ct log n for a sufficiently large constant C. Let (Xi, yi)m i=1 be independent samples where Xi N(0, Σ) and yi = Xi, w + N(0, σ2). Then there is an O(m2nt 1/2+nt Ω(1/t) log O(t) n)-time algorithm (Algorithm 5) that, given (Xi, yi)m i=1, t, B, σ2, produces an estimate ˆw Rn satisfying, with probability 1 o(1),

ˆw w 2 Σ O σ2

Proof. By Theorem D.5 it holds with probability 1 n 100t that D is a (t, Cl1rept3/2 log n)-ℓ1representation for Σ. Also, by standard concentration bounds (e.g. Exercise 4.7.3 in [41]), we have

1 2 x Σ x ˆΣ 2 x Σ for all t-sparse x Rn (where ˆΣ = 2 m Pm/2 i=1 Xi X i ) with probability at least 1 exp( Ω(m)). Suppose that both of these events occur.

For each of the remaining m/2 samples Xj, compute Xj RD where the entry Xj,d corresponding to d D is Xj, d/ d ˆΣ (where ˆΣ = 2 m Pm/2 i=1 Xi X i is not explicitly computed; since d is sparse, both Xj, d and d ˆΣ can be computed in poly(t, m) time). Let N(0, Γ) denote the distribution of each Xj. For each d D, since d is t-sparse, we have that Ex N(0,Σ) x, d/ d ˆΣ 2 = d 2 Σ / d 2 ˆΣ 4. Thus, Γdd 4 for all d.

Moreover, since w is t-sparse, there is some α RD with w = P d αdd and P d |αd| d Σ Cl1rept3/2 log(n) w Σ. Define β Rd by βd = αd d ˆΣ. Then w = P

d βdd/ d ˆΣ and X

d |αd| d Σ 2Cl1rept3/2 log(n) w Σ .

But now for any of the remaining m/2 samples, we have that

d Xj, d/ d ˆΣ αd d ˆΣ = Xj, X

d αdd = Xj, w ,

and thus y Xj, β N(0, σ2). So we can apply Theorem A.8 to samples ( Xj, yj)m j=m/2+1 to

compute an estimator ˆβ satisfying ˆβ β 2

using that β 1 2Cl1rept3/2 log(n) w Σ 2Cl1rept3/2B log(n), and using the bound maxd Γdd 4. The time complexity of this step is O(|D|m2) = O(m2nt 1/2). Finally, com-

pute ˆw := P

d ˆβdd/ d ˆΣ. We have that ˆw w Σ = ˆβ β Γ, which completes the proof.

E Fixed-parameter tractability in κ and t

In this section we prove Theorem 3.2, which shows we can achieve upper bounds on Nt,α(Σ) for α independent of κ and n, if we are willing to incur dependence on κ in the resulting bound. In fact, we actually prove an upper bound on the packing number Pt,α(Σ).

To achieve this, the first key idea is to consider the dual certificates for a packing. Suppose that v1, . . . , v N are unit vectors (in the Σ-norm) with | vi, vj Σ| α for all i = j. Then | vi, Σvi | α 1 maxj =i | vj, Σvi |, so Σvi certifies that any linear combination vi = P

j =i xjvj must have the property that x 1 α 1. Thus, to show that there cannot be a large packing of sparse vectors in the Σ-norm, it would suffice to prove that any large set of sparse vectors must have one vector that can be written as a linear combination of the remaining vectors, where the coefficient vector has small ℓ1 norm. In fact, this would give an upper bound on Nt,α(Σ) for all Σ.

We do not know if such a statement is true. However, we can prove an approximate analogue. The following lemma shows that under a condition number bound on Σ, the dual certificate argument can be generalized to require only a weaker property: that any large set of sparse vectors must have one vector that can be approximately written as a linear combination of the remaining vectors, with low ℓ1 cost. The approximation error determines how small the condition number must be:

Lemma E.1. Let n, N, t, T N and let δ > 0. Suppose that for all t-sparse vectors v1, . . . , v N Rn, there exists some i [N] and x RN such that x 1 T and vi X

δ max j [N] vj 2 .

Then for every positive-definite matrix Σ : n n with κ(Σ) < 1/(4δ2) it holds that Pt,1/(3T )(Σ) N log2 κ(Σ).

Proof. Fix a positive-definite matrix Σ : n n and suppose that K := Pt,1/(3T )(Σ) > N log2 κ(Σ). By definition, there are nonzero t-sparse vectors v1, . . . , v K RN such that

| vi, vj Σ| 1

3T vi Σ vj Σ

for all i = j. Without loss of generality, assume that vi 2 = 1 for all i [K], so that

λmin(Σ) vi 2 Σ λmax(Σ).

So we can partition [K] into log2 κ(Σ) buckets such that maxi B vi 2 Σ / mini B vi 2 Σ 2 for each bucket B [K]. There must be some bucket B with |B| N. By assumption, there is some i B and x RN such that x 1 T and vi X

j B:j =i xjvj

j B:j =i xjvj

j B:j =i xjvj

j B:j =i xj vi, vj Σ +

j B:j =i xjvj

x 1 max j B:j =i | vi, vj Σ| + v i Σ 2 δ

3T max j B:j =i vi Σ vj Σ + δ q

λmax(Σ) v i Σvi

2 vi 2 Σ 3 + vi Σ δ p

Simplifying, we get vi Σ 2δ p

λmax(Σ). Since also vi Σ p

λmin(Σ), it follows that κ(Σ) = λmax(Σ)/λmin(Σ) 1/(4δ2).

It remains to show that the precondition of Lemma E.1 can be satisfied for sub-constant δ without requiring N to scale with nt. We start by proving the desired property when the vectors are all t-sparse and binary, i.e. v1, . . . , v N {0, 1}n, and afterwards we will black-box extend the result to the real-valued setting. Concretely, given sparse binary vectors v1, . . . , v N {0, 1}n (with N n), we want to find one that can be efficiently approximated (in ℓ2 norm) by the rest, where efficient means that the coefficients have small absolute sum. Thinking of each vector as the indicator vector of a subset of [n], a first step towards an efficient approximation for vi = 1[ Si] may be constructing an efficient approximation for a standard basis vector ej for some j Si.

Indeed, there is some j [n] such that Sj := {i : vij = 1} is large, i.e. |Sj| N/n. If the vectors (vi)i Sj were in some sense random, then the average 1 |Sj| P i Sj vi would be a good approximation for ej. It is also efficient, in that the absolute sum of coefficients is 1. But of course the vectors are not random; it could be that many vectors in Sj also contain some other coordinate j . In this case we restrict to the set of vectors containing both j and j . Now we may hope to approximate the vector 1[ {j, j }]. Completing this argument, we get the following lemma which states that there exists a subset of [n] that is contained in many of the vectors, and that is well-approximated by the average of those vectors.

For notational convenience, for vectors x, y {0, 1}n we say that x y if xi yi for all i [n]. Lemma E.2. Let n, N, t, s N with sn N, and let v1, . . . , v N {0, 1}n be nonzero t-sparse binary vectors. Then there is some set S [N] of size |S| s and some nonzero vector u {0, 1}n such that u vi for all i S, and u 1

t(sn/N)1/t.

Proof. For each J [n], define SJ := {i [N] : vij = 1 j J}. Since all vi are nonzero, there is some j [n] with |S{j }| N/n. We iteratively construct a set J [n] as follows. Initially, set J = {j }. While there exists some a [n] \ J such that |SJ {a}| > (sn/N)1/t|SJ|, update J to J {a} (if there are multiple such a, pick any one of them arbitrarily). At termination of this process, we have |SJ| > 0. Since every vi is t-sparse, it must be that |J| t. Thus, |SJ| (N/n) (sn/N)(t 1)/t s. Set S := SJ and u := 1J {0, 1}n. By definition of SJ, we have that u vi for all i S.

For any j J, we have uj = 1 = 1 |S| P

i S vij. For any j J, we have uj = 0 and 1 |S|

= |{i S : vij = 1}|

|S| = |SJ {j}|

|SJ| (sn/N)1/t

by construction of J. Thus, u 1

Additionally, u 1

i S vi 1 t.

By the inequality x 2 2 x 1 x , we conclude that u 1

as claimed.

We now use Lemma E.2 to show that if N is sufficiently large, then at least one of the vectors vi can be efficiently approximated by the rest. The proof is by induction on t. As a first attempt, one might use Lemma E.2 to find some u {0, 1}n and some large set S [N] such that u vi for all i S, and the average of the vi s approximates u. Then, restrict to the vectors in S, and induct on the (t 1)-sparse residual vectors {vi u : i S}. If one of the vi u s can be efficiently approximated by the other residuals, then since u can also be efficiently approximated, we can derive an efficient approximation of vi by the remaining vj s.

This doesn t quite work, since at each step of the induction the set of vectors will become smaller by a factor of roughly n. However, instead of throwing away the vectors outside S =: S(1) we can iteratively re-apply Lemma E.2 to get disjoint sets S(1), S(2), . . . , S(m), where each S(a) has the same property as S (for some potentially different vector u(a)). We can then induct on the residual vectors a{vi u(a) : i S(a)}. This suffices to efficiently approximate some vi. Since we throw away fewer vectors at each step of the induction, we do not need the initial number of vectors N to be as large.

We formalize the above ideas in the following theorem. Theorem E.3. Let n, N, t N and let v1, . . . , v N {0, 1}n be t-sparse binary vectors. Then there is some i [N] and x RN such that x 1 3t and vi X

9t(tn/N)1/t.

Proof. We induct on t, observing that the case t = 0 is immediate. Fix t > 0 and t-sparse vectors {v1, . . . , v N} {0, 1}n, and suppose that the theorem statement holds for t 1. If any vi is identically zero, then the claim is trivially true with x = 0. If N t3t+1n then the RHS of the desired norm bound exceeds 4t

t, so the claim is trivially true with x = 0 and any i [N]. Thus, we may assume that all vi are nonzero, and N t3t+1n. Applying the previous lemma with

s := 3t+1 N/n gives some S(1) [N] and nonzero u(1) {0, 1}n such that |S(1)| 3t+1 and u(1) vi for all i S(1), and u(1) 1 |S(1)|

9t(n/N)1/t.

If |N| |S(1)| N/t 3t+1n then we can reapply the lemma with vectors (vi)i [N]\S(1) and s := 3t+1 to get some S(2) [N]\S(1) and u(2) {0, 1}n. Continuing this process so long as there are at least N/t 3t+1n remaining vectors, we can generate disjoint sets S(1), . . . , S(m) [N] and vectors u(1), . . . , u(m) {0, 1}n with the following properties:

(i) |S(1) S(m)| > N N/t

(ii) |S(a)| 3t+1 for every a [m]

(iii) For every a [m], it holds that u(a) is nonzero and u(a) vi for all i S(a)

(iv) For every a [m], u(a) 1 |S(a)|

9t(tn/N)1/t.

For each a [m] and i S(a), define v i := vi u(a). By Property (iii) we have that v i {0, 1}N and v i is (t 1)-sparse. By the inductive hypothesis applied to vectors (v i)i S(1) S(m), there is some i S(1) S(m) and x RN (supported on S(1) S(m)) such that x 1 3t 1 and v i X

j =i x jv j

9(t 1)((t 1)n/|S(1) S(m)|)1/(t 1)

9t(tn/N)1/t (9)

where the last inequality uses Property (i) and the bound N tn. Of course, without loss of generality x i = 0. Let a [m] be the unique index such that i S(a). We define x {0, 1}N

(supported on S(1) S(m)) as follows. For each b [m] and each r S(b), set

xr = x r 1 |S(b)|

j S(b) x j + 1[b = a]

Since x 1 3t 1, we can see that

x 1 x 1 + X

j S(b) |x j| + X

2 3t 1 + 1.

Next, we use x to approximate vi. The following bound is almost what we want:

j [N] xjvj 2 3 4t 1p

9t(tn/N)1/t

Proof of claim. We have vi X

u(a) 1 |S(a)|

v i + 1 |S(a)|

r S(a) vr X

u(a) 1 |S(a)|

r [N] x rvr + X

j S(b) x jvr

u(a) 1 |S(a)|

r [N] x rv r

r S(b) x ru(b) + X

j S(b) x jvr

u(a) 1 |S(a)|

r [N] x rv r

u(b) 1 |S(b)|

where the first and third inequalities use that vr = v r + u(b) for all r S(b), and throughout we use that xr = x r = 0 for r S(1) S(m). Applying Property (iv), equation (9), and the bound x 1 3t 1, we get vi X

9t(tn/N)1/t + 4t 1 q

9t(tn/N)1/t + 3t 1 q

9t(tn/N)1/t

9t(tn/N)1/t

as claimed.

However, we wanted a bound on vi P j =i xjvj, and unfortunately xi = 0. Fortunately, it is enough that xi is bounded away from 1. Since x i = 0, we have

|xi| 1 |S(a)|

j S(a) |x j| + 1 |S(a)| x 1 + 1

|S(a)| 3t 1

Thus, by Claim E.4, vi 1 1 xi

1 1 xi 3 4t 1 q

9t(tn/N)1/t 4t q

9t(tn/N)1/t.

Finally, we have x/(1 xi) 1 (9/8)(2 3t 1 + 1) 3t, so x/(1 xi) satisfies all the desired conditions. This completes the induction.

Finally, we extend Theorem E.3 to real-valued sparse vectors via a discretization argument. Lemma E.5. Let n, N, t N and let v1, . . . , v N Rn be t-sparse vectors. Then there is some i [N] and x Rn such that x 1 3t and vi X

t(n/N)1/(4t) max j [N] vj .

Proof. Without loss of generality assume that maxj [N] vj = 1. Let k N be fixed later. Define a map φ : [ 1, 1] {0, 1}2k+1 by

ek+1+ ck if c < 0 ek+1 if c = 0 ek+1+ ck if c > 0 .

Also let Φ : R2k+1 R be the linear map that sends Φei 7 (i k 1)/k for each i [2k + 1]. Note that |Φφ(c) c| 1/k for all c [ 1, 1] and Φφ(0) = 0. Define φ n : [ 1, 1]n {0, 1}(2k+1)n by φ(c1, . . . , cn) = (φ(c1), . . . , φ(cn)), and define Φ n : {0, 1}(2k+1)n Rn by Φ n(x1, . . . , xn) = (Φ(x1), . . . , Φ(xn)). For any i [N], the vector φ n(vi) is t-sparse and lies in {0, 1}(2k+1)n. Thus, applying Theorem E.3 gives some i [N] and x RN with x 1 3t and φ n(vi) X

j =i xjφ n(vj)

9t(tn(2k + 1)/N)1/t.

Since Φ n is a linear map and Φ n 2 = Φ 2

2k + 1, we then get Φ nφ n(vi) X

j =i xjΦ nφ n(vj)

9t(2k + 1)(tn(2k + 1)/N)1/t.

But now for every j [N], we know that vj Φ nφ n(vj) 2 2 = X

a supp(vj) (vja Φφ(vja))2 t

We conclude that vi X

Φ nφ n(vi) X

j =i xjΦ nφ n(vj)

+ vi Φ nφ n(vi) 2

j =i |xj| vj Φ nφ n(vj) 2

9t(2k + 1)(tn(2k + 1)/N)1/t + (1 + 3t)

(2k + 1) 4t+1

t(n/N)1/(2t) + 4t

Taking k = (N/n)1/(4t) gives the claimed bound.

Combining Lemma E.5 with Lemma E.1 lets us prove Theorem 3.2.

Proof of Theorem 3.2. Set δ := p

1/(4κ) and N = 44t(t+3)t2tκ2tn. By Lemma E.5, for any t-sparse vectors v1, . . . , v N Rn with vi 2 1 for all i [N], there is some i [N] and x Rn such that x 1 3t and vi X

t(n/N)1/(4t) 1 4 κ < δ.

It follows from Lemma E.1 that Pt,1/3t+1(Σ) N log2 κ. Finally, by Lemma A.2, we conclude that Nt,1/3t+1(Σ) N log2 κ.

F Generalization bounds

F.1 Finite-class model selection

Lemma F.1. Let n, m, neff N and let Σ be a positive semi-definite matrix. Fix a vector w Rn and a closed set W Rn and let (Xi, yi)m i=1 be independent draws Xi N(0, Σ) and yi = Xi, w + ξi where ξi N(0, σ2). Pick

ˆw argmin w W Xw y 2 2

where X : m n is the matrix with rows X1, . . . , Xm. For any ϵ, δ (0, 1), suppose that with probability at least 1 δ, the following bounds hold uniformly over w W:

m X(w w ) 2 2 w w 2 Σ ϵ w w 2 Σ

D ξ, X(w w ) X(w w ) 2

Then with probability at least 1 δ it also holds that

1 + ϵ 1 ϵ inf w W w w Σ + 2σ

Proof. Consider the event in which both bounds hold. Let wopt argminw W w w 2 Σ. Then

X( ˆw w ) 2 2 = X ˆw y 2 2 + 2 ξ, X( ˆw w ) ξ 2 2 Xwopt y 2 2 + 2 ξ, X( ˆw w ) ξ 2 2 = X(wopt w ) 2 2 + 2 ξ, X( ˆw w ) 2 ξ, X(wopt w )

X(wopt w ) 2 2 + 2 X( ˆw w ) 2 + X(wopt w ) 2 σ neff.

Subtracting X(wopt w ) 2 2 from both sides and dividing by X( ˆw w ) 2 + X(wopt w ) 2, we get that X( ˆw w ) 2 X(wopt w ) 2 2σ neff. It follows that

1 (1 ϵ)m X( ˆw w ) 2

1 (1 ϵ)m X(wopt w ) 2 + 2σ

(1 + ϵ)neff

1 + ϵ 1 ϵ wopt w Σ + 2σ

as desired.

Lemma F.2. Let n, m N and let Σ be a positive semi-definite matrix. Fix a vector w Rn and a finite set W Rn and let (Xi, yi)m i=1 be independent draws Xi N(0, Σ) and yi = Xi, w + ξi where ξi N(0, σ2). Pick ˆw argmin w W Xw y 2 2 .

For any ϵ, δ (0, 1), if m 8ϵ 2 log(2|W|/δ), then with probability at least 1 2δ, we have

1 + ϵ 1 ϵ inf w W w w Σ + 4σ

log(2|W|/δ)

Proof. For any fixed w W, the random variables Xi, w w N(0, w w 2 Σ) are independent, and therefore X(w w ) 2 2 w w 2 Σ χ2 m. It follows that for any ϵ > 0,

Pr 1 m X(w w ) 2 2 w w 2 Σ

> ϵ w w 2 Σ

By the union bound, if m 8ϵ 2 log(2|W|/δ), then with probability at least 1 δ it holds that for all w W, 1 m X(w w ) 2 2 w w 2 Σ

ϵ w w 2 Σ . (10)

Also, for any fixed w W, conditioned on X, the random variable ξ, X(w w ) X(w w ) 2 has distribution N(0, σ2). Thus, by a Gaussian tail bound and the union bound, we have for any t > 0 that

σt 2|W| e t2/2.

In particular, with probability at least 1 δ it holds that

2 log(2|W|/δ). (11)

Using (10) and (11) we apply Lemma F.1 which gives the desired bound.

F.2 Weak learning

Lemma F.3. Let n, m N and ϵ, δ > 0. Let Σ : n n be a positive semi-definite matrix and let X : m n have independent rows X1, . . . , Xm N(0, Σ). For any fixed u, v Rn, if m 8ϵ 2 log(8/δ), then it holds with probability at least 1 δ that u 1

m X X Σ v 2ϵ u Σ v Σ .

Proof. Decompose u = av + w where v, w Σ = 0, so that a = u, v Σ/ v 2 Σ. Since Xv 2 2 v 2 Σ χ2 m and m 8ϵ 2 log(4/δ) it holds with probability at least 1 δ/2 that v 1

m X X Σ v =

i=1 Xi, v 2 v 2 Σ

m X X Σ v =

i=1 Xi, w Xi, v

i=1 Zi, Σ1/2w Zi, Σ1/2v

where we define independent random vectors Z1, . . . , Zm N(0, In) so that Xi = Σ1/2Zi. Since m 8 log(2/δ), with probability at least 1 δ/4 we have Pm i=1 Zi, Σ1/2v 2 2m v 2 Σ. Condition on the value of this sum, and note that since Σ1/2v Σ1/2w, the random variables Zi, Σ1/2w are still (independent and) distributed as N(0, w 2 Σ). Thus

i=1 Zi, Σ1/2w Zi, Σ1/2v N

i=1 w 2 Σ Zi, Σ1/2v 2 !

When the variance is at most 2 w 2 Σ v 2 Σ /m, we have with probability at least 1 δ/4 that the sum is at most 2 w Σ v Σ p

2 log(8/δ)/m in magnitude. So, using m 8ϵ 2 log(8/δ) it holds unconditionally with probability at least 1 δ/2 that 1 m

i=1 Zi, Σ1/2w Zi, Σ1/2v

ϵ w Σ v Σ .

In all, we have that u 1

m X X Σ v |a|ϵ v 2 Σ + ϵ w Σ v Σ 2ϵ u Σ v Σ

using that |a| u Σ / v Σ and w Σ u Σ.

Lemma F.4. Let n, m N and let Σ be a positive semi-definite matrix. Fix a vector w Rn and a finite set W Rn and let (Xi, yi)m i=1 be independent draws Xi N(0, Σ) and yi = Xi, w + ξi where ξi N(0, σ2). Pick ( ˆw, ˆβ) argmin w W β R βXw y 2 2 .

Suppose α := maxw W w,w Σ w Σ w Σ > 0. For any δ > 0, if m Cα 2 log(32|W|/δ) for a sufficiently large absolute constant C, then with probability at least 1 δ, ˆβ ˆw w 2

Σ (1 α2/4) w 2 Σ + 400σ2 log(4|W|/δ)

Proof. For any vectors u, v Rn, define (u, v) = u 1

Claim F.5. With probability at least 1 δ, the following bounds hold uniformly over w W and β R:

D ξ, X(βw w ) X(βw w ) 2

E σ neff where neff := 2 log(32|W|/δ).

2. | (βw, w )| α 100 βw Σ w Σ

3. | (βw, βw)| α 100 βw 2 Σ .

Proof of claim. For item (1), fix w W. Let Φ(w) : 2 m be a matrix whose rows form an orthonormal basis for span{Xw, Xw } Rm. Then (denoting the unit Euclidean ball in R2 by B2) we have for all β R that

ξ, X(βw w ) X(βw w ) 2

D ξ, (Φ(w)) u E Φ(w)ξ 2

2 max i [2] | Φ(w) i , ξ |.

Since Φ(w) i , ξ N(0, σ2), we have Pr[| Φ(w) i , ξ | > σ p

2 log(4|W|/δ)] δ/(4|W|). A union bound over i [2] and w W gives that condition (2) in Lemma F.1 is satisfied with probability at least 1 δ/2.

For items (2) and (3), note that is bilinear, so it suffices to take β = 1. Applying Lemma F.3 and the union bound, so long as m Cα 2 log(32|W|/δ) for a sufficiently large constant C, items (2) and (3) hold simultaneously with probability at least 1 δ/2.

Henceforth we assume that all of the events in the above claim hold. Let w0 W be such that | w0, w Σ| = α w0 Σ w Σ. Let β0 = w0, w Σ/ w0 2 Σ. Then

β0w0 w 2 Σ = (1 α2) w 2 Σ .

Claim F.6. The excess empirical risk can be bounded as X(ˆβ ˆw w ) 2 X(w0 w ) 2 + 2σ neff.

Proof of claim. We have X(ˆβ ˆw w ) 2

2 = Xˆβ ˆw y 2

2 + 2 ξ, X(ˆβ ˆw w ) ξ 2 2

Xβ0w0 y 2 2 + 2 ξ, X(ˆβ ˆw w ) ξ 2 2 = X(β0w0 w ) 2 2 + 2 ξ, X(ˆβ ˆw w ) 2 ξ, X(β0w0 w )

X(β0w0 w ) 2 2 + 2 X(ˆβ ˆw w ) 2 + X(β0w0 w ) 2 σ neff

where the last bound is by item (1) of Claim F.5. Simplifying, we get the claimed bound.

Now we have ˆβ ˆw w 2

X(ˆβ ˆw w ) 2

2 (ˆβ ˆw w , ˆβ ˆw w )

m ( X(β0w0 w ) 2 + 2σ neff)2 (ˆβ ˆw w , ˆβ ˆw w )

m X(β0w0 w ) 2 2 + (1 + 100α 2)σ2neff

m (ˆβ ˆw w , ˆβ ˆw w )

= (1 + α2/100) β0w0 w 2 Σ + (1 + 100α 2)σ2neff

(ˆβ ˆw w , ˆβ ˆw w ) + 1 + α2

(β0w0 w , β0w0 w )

(1 + α2/100) β0w0 w 2 Σ + (1 + 100α 2)σ2neff

m + | (ˆβ ˆw, ˆβ ˆw)| + 2| (ˆβ ˆw, w )|

+ (1 + α2/100)| (β0w0, β0w0)| + 2(1 + α2/100)| (β0w0, w )|

+ (α2/100)| (w , w )|. where the first inequality is by Claim F.6, the second inequality is by AM-GM, and the final inequality is expanding out the terms (ˆβ ˆw w , ˆβ ˆw w ) and (β0w0 w , β0w0 w ) (via bilinearity) and cancelling out the common term (w , w ). Finally applying items (2) and (3) of Claim F.5, we get ˆβ ˆw w 2

Σ (1 + α2/100) β0w0 w 2 Σ + (1 + 100α 2)σ2neff

ˆβ ˆw Σ w Σ

50 β0w0 2 Σ + α

25 β0w0 Σ w Σ + α3

(1 9α2/10) w 2 Σ + 101σ2neff

ˆβ ˆw Σ w Σ (12)

where the second inequality uses the bounds β0w0 w 2 Σ = (1 α2) w 2 Σ and

β0w0 Σ = | w0, w Σ|

w0 Σ = α w Σ .

But now on the other hand, ˆβ ˆw w 2

Σ = ˆβ ˆw 2

Σ + w 2 Σ 2 ˆβ ˆw, w Σ ˆβ ˆw 2

Σ + w 2 Σ + 2α ˆβ ˆw Σ w Σ .

Comparing with (12) gives 1 α

Σ 101σ2neff

α2m + 3α ˆβ ˆw Σ w Σ

and therefore ˆβ ˆw Σ 4α w Σ + σ

Substituting into (12) we finally get ˆβ ˆw w 2

Σ (1 α2/2) w 2 Σ + 200σ2neff

α2m as desired.

F.3 Excess risk at optima of additively-regularized programs

Lemma F.7. Let n N, and let Σ : n n be a positive semi-definite matrix. For some seminorm Φ : Rn [0, ) and some p, δ > 0, assume that with probability at least 1 δ over G N(0, Σ) it holds uniformly over v Rn that

2Φ(v) + p v Σ .

Fix a vector v Rn. For any m N and σ > 0 let (Xi, yi)m i=1 be independent samples distributed as Xi N(0, Σ) and yi = Xi, v + ξi where ξi N(0, σ2). Define

ˆv argmin v Rn Xv y 2 2 + Φ(v)2 + y 2 Φ(v)

where X : m n is the matrix with rows X1, . . . , Xm. Then with probability at least 1 7δ over (Xi, yi)m i=1, so long as m 16p + 196 log(12/δ)), it holds that

ˆv v 2 Σ 128σ2p

m + 8(σ + v Σ)Φ(v ) m + 8Φ(w )2

Proof. For notational convenience, define F(v) := (1/2)Φ(v v ) + p v v Σ. We apply the lemma s assumption twice:

For any fixed ξ, the random variable Xξ has distribution N(0, ξ 2 2 Σ). By the above claim, with probability at least 1 δ over X, we have ξ, X(v v ) ξ 2 F(v) uniformly in v Rn.

Since ξ 2 2 σ2χ2 m and m 8 log(2/δ), it holds with probability at least 1 δ that 1 m ξ 2

2σ. Thus, with probability at least 1 2δ, we have

2mσF(v) (13)

uniformly in v Rn.

The assumption means that we can apply Theorem C.1 with (noiseless) samples (Xi, Xi, v )m i=1 to get the following: since m 196 log(12/δ), it holds with probability at least 1 4δ over the randomness of X that for all v Rn,

m X(v v ) 2 2 + 2

m F(v)2. (14)

We also observe that the entries of y are independent and identically distributed as N(0, v 2 Σ+σ2), so by a χ2 tail bound, since m 32 log(2/δ), it holds with probability at least 1 δ that

1 m y 2 2 1

2( v 2 Σ + σ2), 3

2( v 2 Σ + σ2) . (15)

We now condition on the event (which occurs with probability at least 1 7δ) that the bounds (13), (14), and (15) all hold. Specifying (14) to v := ˆv, we get that m

X(ˆv v ) 2 2 + F(ˆv)2

X(ˆv v ) 2 2 Xˆv y 2 2 Φ(ˆv)2 y 2 Φ(ˆv)

+ Xv y 2 2 + Φ(v )2 + y 2 Φ(v ) + F(ˆv)2

= 2 Xv y, X(ˆv v )

Φ(ˆv)2 y 2 Φ(ˆv) + Φ(v )2 + y 2 Φ(v ) + F(ˆv)2

2mσF(ˆv) Φ(ˆv)2 y 2 Φ(ˆv) + Φ(v )2 + y 2 Φ(v ) + F(ˆv)2

where the first inequality is by (14), the second inequality is by optimality of ˆv, and the third inequality is by (13). We now expand F(ˆv) in the above expression. If 2mpσ ˆv v Σ exceeds m

8 ˆv v 2 Σ then the lemma immediately holds since

ˆv v 2 Σ 128σ2p

So we may assume that in fact 2mpσ ˆv v Σ m

8 ˆv v 2 Σ. By the lemma assumptions, we also know that m 16p. Thus, expanding F(ˆv) and applying these bounds,

2Φ(ˆv v ) + p ˆv v Σ

Φ(ˆv)2 y 2 Φ(ˆv) + Φ(v )2 + y 2 Φ(v )

2Φ(ˆv v )2 + 2p ˆv v 2 Σ

2 σΦ(ˆv v ) + m

Φ(ˆv)2 y 2 Φ(ˆv) + Φ(v )2 + y 2 Φ(v )

2Φ(ˆv v )2 + m

8 ˆv v 2 Σ .

Simplifying, applying the triangle inequality Φ(ˆv v ) Φ(ˆv) + Φ(v ), and grouping terms, we get m

Φ(v ) + 2Φ(v )2

2(σ + v Σ) mΦ(v ) + 2Φ(v )2

where the last inequality uses both sides of the bound (15).

G Covering bounds from classical assumptions

In this section, we further motivate the definition of our covering number Nt,α(Σ) by showing that in all settings where efficient SLR algorithms are known, there is a straightforward linear upper bound on the covering number. This lends weight to the need for stronger upper bounds on Nt,α as a stepping stone towards more efficient algorithms for sparse linear regression.

G.1 Compatibility condition

Definition G.1 (Compatibility Condition, see e.g. [40]). For a positive semidefinite matrix Σ : n n, L 1, and set S [n], we say Σ has S-restricted ℓ1-eigenvalue

ϕ2(Σ, S) = min w C(S) |S| w, Σw

where the cone C(S) is defined as

C(S) = {w = 0 : w SC 1 L w S 1}.

For t N, the t-restricted ℓ1-eigenvalue ϕ2(Σ, t) is the minimum over all S of size at most t.

It is well-known that an upper bound on maxi Σii

ϕ2(Σ,t) is sufficient for the success of Lasso (as well as nearly necessary; see e.g. the Weak Compatibility Condition defined in [23]):

Theorem G.2 (see e.g. Corollary 5 in [45]). Fix n, m, t N, σ, δ > 0, and a positive semi-definite matrix Σ : n n with maxi Σii 1. Fix a t-sparse vector v Rn and let (Xi, yi)m i=1 be independent samples distributed as Xi N(0, Σ) and yi = Xi, v + ξi where ξi N(0, σ2). Define ˆv argmin v Rn: v 1 v 1 Xv y 2 2

where X : m n is the matrix with rows X1, . . . , Xm. If m 4ϕ2(Σ, t) t log(16n/δ), then with probability at least 1 δ, it holds that

ˆv v 2 Σ O σ2t log(16n/δ)

Fact G.3. Let n, t N. For any positive semi-definite Σ : n n with ϕ2 := ϕ2(Σ, t) and maxi Σii 1, it holds that Nt,ϕ/

Proof. The proof is essentially the same as that of Fact A.4. By Lemma A.3, it suffices to show that the standard basis is a (t,

t/ϕ)-ℓ1-representation for Σ. Indeed, for any t-sparse v Rn, we have

i=1 |vi| ei Σ v 1 max i

as claimed.

G.2 Submodularity ratio

Definition G.4 (see e.g. [9]). For a positive semi-definite matrix Σ : n n and a set L [n] define the normalized residual covariance matrx Σ(L) : n n by

Σ(L) := (D1/2) Σ Σ LΣ LLΣL (D1/2)

where D := diag Σ Σ LΣ LLΣL .

Definition G.5. Fix a positive semi-definite matrix Σ : n n, a positive integer t N, and any v Rn. Define the t-submodularity ratio of Σ with respect to v by

γt(Σ, v ) := min L,S [n]:|L|,|S| t,L S= (v ) (Σ(L)) S (Σ(L))Sv

(v ) (Σ(L)) S (Σ(L)) SS(Σ(L))Sv .

In any t-sparse linear regression model with true regressor v , when the above quantity γ := γt(Σ, v ) is bounded away from zero, it can be shown that the standard Forward Regression algorithm finds some t-sparse estimate ˆv Rn such that ˆv v 2 Σ e γ v 2 Σ (see e.g. Theorem 3.2 in [9]; that result is for the model where the algorithm is given exact access to v, v Σ for any t-sparse v Rn, but analogous finite-sample bounds can be obtained with O(γ O(1)t log(n)) samples by applying the theorem to the empirical covariance matrix and using concentration of t t submatrices). A similar guarantee is also known for Orthogonal Matching Pursuit (Theorem 3.7 in [9]).

Once again, it is simple to show that the standard basis is a good dictionary for matrices with a large submodularity ratio. Fact G.6. Let n, t N. For any positive semi-definite Σ : n n with γ := minv Rn B0(t) γt(Σ, v ), it holds that Nt,

Proof. We show that the standard basis is a (t, γ/t)-dictionary for Σ. Without loss of generality assume that Σii = 1 for all i [n]. Then Σ( ) = Σ. Fix any t-sparse v Rn. Setting S := supp(v ), we have that X

i S ei, v 2 Σ = (v ) Σ S ΣSv γ(v ) Σ S (ΣSS) ΣSv = γ v 2 Σ

where the inequality is by definition of γ, and the final equality uses that ΣSv = ΣSS(v )S (since v is supported on S). It follows that maxi S ei, v 2 Σ (γ/t) v 2 Σ. Since ei Σ = 1 for all i, we conclude that

max i [n] | ei, v Σ| ei Σ v Σ rγ

t as claimed.

G.3 Sparse preconditioning

Recent work [23] showed that if Σ : n n is a positive definite matrix and the support of Θ := Σ 1 is the adjacency matrix of a graph with low treewidth, then there is a polynomial-time, sampleefficient algorithm for sparse linear regression with covariates drawn from N(0, Σ). The key to this result was a proof that such covariance matrices are sparsely preconditionable: i.e., there is a matrix S : n n such that Σ = SS and S has sparse rows. We claim that this property also immediately enables succinct dictionaries.

Concretely, suppose that S has s-sparse rows. By a change-of-basis argument, any t-sparse vector in the standard basis is st-sparse in the basis {(S ) 1 1 , . . . , (S ) 1 n }. Moreover these vectors are orthonormal under Σ. Thus, by the same argument as for Fact A.4, it s easy to see that {(S ) 1 1 , . . . , (S ) 1 n } is a (t, 1/

st)-dictionary for Σ.

H Supplementary figure

Figure 2: Performance of Basis Pursuit in a synthetic example with n = 1000 covariates. The covariates X1:1000 are all independent N(0, 1) except for (X0, X1, X2), which have joint distribution X0 = Z0, X1 = Z0 + 0.4Z1, and X2 = Z1 + 0.4Z2 where Z0, Z1, Z2 N(0, 1) are independent. The noiseless responses are y = 6.25(X1 X2) + 2.5X3, i.e. the ground truth is 3-sparse. The x-axis is the number of samples. The y-axis is the out-of-sample prediction error (averaged over 10 independent runs, and error bars indicate the standard deviation).

I Experimental details

The simulations were done using Python 3.9 and the Gurobi library [17]. Each figure took several minutes to generate using a standard laptop. See the file auglasso.py for code and execution instructions.