# robust_sparse_regression_with_nonisotropic_designs__2cc5493a.pdf

Robust Sparse Regression with Non-Isotropic Designs

Chih-Hung Liu Department of Electrical Engineering National Taiwan University chliu@ntu.edtw

Gleb Novikov Lucerne School of Computer Science and Information Technology gleb.novikov@hslu.ch

We develop a technique to design efficiently computable estimators for sparse linear regression in the simultaneous presence of two adversaries: oblivious and adaptive. Consider the model 𝑦 = 𝑋 𝛽 + 𝜂where 𝑋 is an 𝑛 𝑑random design matrix, 𝛽 R𝑑is a 𝑘-sparse vector, and the noise 𝜂is independent of 𝑋 and chosen by the oblivious adversary. Apart from the independence of 𝑋 , we only require a small fraction entries of 𝜂to have magnitude at most 1. The adaptive adversary is allowed to arbitrarily corrupt an 𝜀-fraction of the samples (𝑋 1, 𝑦 1), . . . , (𝑋 𝑛, 𝑦 𝑛). Given the 𝜀-corrupted samples (𝑋1, 𝑦1), . . . , (𝑋𝑛, 𝑦𝑛), the goal is to estimate 𝛽 . We assume that the rows of 𝑋 are iid samples from some 𝑑-dimensional distribution 𝒟with zero mean and (unknown) covariance matrix Σ with bounded condition number. We design several robust algorithms that outperform the state of the art even in the special case of Gaussian noise 𝜂 𝑁(0, 1)𝑛. In particular, we provide a polynomial-time algorithm that with high probability recovers 𝛽 up to error 𝑂( 𝜀) as long as 𝑛 𝑂(𝑘2/𝜀), only assuming some bounds on the third and the fourth moments of 𝒟. In addition, prior to this work, even in the special case of Gaussian design 𝒟= 𝑁(0, Σ) and noise 𝜂 𝑁(0, 1), no polynomial time algorithm was known to achieve error 𝑜( 𝜀) in the sparse setting 𝑛< 𝑑2. We show that under some assumptions on the fourth and the eighth moments of 𝒟, there is a polynomial-time algorithm that achieves error 𝑜( 𝜀) as long as 𝑛 𝑂(𝑘4/𝜀3). For Gaussian distribution 𝒟= 𝑁(0, Σ), this algorithm achieves error 𝑂(𝜀3/4). Moreover, our algorithm achieves error 𝑜( 𝜀) for all log-concave distributions if 𝜀 1/polylog(d). Our algorithms are based on the filtering of the covariates that uses sum-of-squares relaxations, and weighted Huber loss minimization with ℓ1 regularizer. We provide a novel analysis of weighted penalized Huber loss that is suitable for heavy-tailed designs in the presence of two adversaries. Furthermore, we complement our algorithmic results with Statistical Query lower bounds, providing evidence that our estimators are likely to have nearly optimal sample complexity.

1 Introduction

Linear regression is the fundamental task in statistics, with many applications in data science and machine learning. In ordinary (non-sparse) linear regression, we are given observations 𝑦 1, . . . , 𝑦 𝑛and 𝑋 1, . . . , 𝑋 𝑛 R𝑑such that 𝑦 𝑖= 𝑋 𝑖, 𝛽 +𝜂𝑖for some 𝛽 R𝑑and some noise 𝜂 R𝑛, and the goal

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

is to estimate 𝛽 . If 𝜂is independent of 𝑋 and has iid Gaussian entries 𝜂𝑖 𝑁(0, 1), the classical least

squares estimator ˆ𝛽with high probability achieves the prediction error 1 𝑛 𝑋 ( ˆ𝛽 𝛽 ) 𝑂 p

Note that if 𝑑/𝑛 0, the error is vanishing.

Despite the huge dimensions of modern data, many practical applications only depend on a small part of the dimensions of data, thus motivating sparse regression, where only 𝑘 𝑑explanatory variables are actually important (i.e., 𝛽 is 𝑘-sparse). In this case we want the error to be small even if we only have 𝑛 𝑑samples. In this case, there exists an estimator that achieves prediction

𝑘log(𝑑)/𝑛 (for 𝜂 𝑁(0, 1)𝑛). However, this estimator requires exponential computation time. Moreover, under a standard assumption from computational complexity theory (NP P/poly), estimators that can be computed in polynomial time require an assumption on 𝑋 called a restricted

eigenvalue condition in order to achieve error 𝑂 p

𝑘log(𝑑)/𝑛 (see [ZWJ14] for more details). One

efficiently computable estimator that achieves error 𝑂 p

𝑘log(𝑑)/𝑛 under the restricted eigenvalue condition is Lasso, that is, a minimizer of the quadratic loss with ℓ1 regularizer. In particular, the restricted eigenvalue condition is satisfied for 𝑋 with rows 𝑋 𝑖 iid 𝑁(0, Σ), where Σ has condition number 𝑂(1), as long as 𝑛 𝑘log 𝑑(with high probability).

Further we assume that the designs have iid random rows, and the condition number of the covariance matrix is bounded by some constant. In addition, for random designs, we use the standard error Σ1/2( ˆ𝛽 𝛽 ) . Note that when the number of samples is large enough, this error is very close to 1 𝑛 𝑋 ( ˆ𝛽 𝛽 ) .

Recently, there was an extensive interest in the linear regression with the presence of adversarially

chosen outliers. Under the assumption 𝑋 𝑖 iid 𝑁(0, Σ), the line of works [TJSO14, BJKK17, SBRJ19, d NS21, d LN+21] studied the case when the noise 𝜂is unbounded and chosen by an oblivious adversary, i.e., when 𝜂is an arbitrary vector independent of 𝑋 . As was shown in [d LN+21], in this case, it is possible to achieve the same error (up to a constant factor) as for 𝜂 𝑁(0, 1)𝑛if we only assume that Ω(1) fraction of the entries of 𝜂have magnitude at most 1. They analyzed the Huber loss estimator with ℓ1 regularizer.

Another line of works [BJK15, DT19, MNW22, Tho23] assumed that 𝜂has iid random entries that satisfy some assumptions on the moments, but an adversarially chosen 𝜀-fraction of 𝑦 1, . . . , 𝑦 𝑛 is replaced by arbitrary values by an adaptive adversary that can observe 𝑋 , 𝛽 and 𝜂(so the corruptions can depend on them). [Tho23] showed that for 𝑋 with iid sub-Gaussian rows and 𝜂 with iid sub-Gaussian entries with unit variance, Huber loss estimator with ℓ1 regularizer achieves an

error of 𝑂 p

𝑘log(𝑑)/𝑛+ 𝜀log(1/𝜀) with high probability. Note that the second term depends on 𝜀, but not on 𝑛; hence, even if we take more samples, this term does not decrease (if 𝜀remains the same). It is inherent: in the presence of the adaptive adversarial outliers, even for 𝑋 𝑖 iid 𝑁(0, Id) and

𝜂 𝑁(0, 1)𝑛, the information theoretically optimal error is Ω p

𝑘log(𝑑)/𝑛+ 𝜀 , so independently

of the number of samples, it is Ω(𝜀). In the algorithmic high-dimensional robust statistics, we are interested in estimators that are computable in time poly(𝑑). There is evidence that it is unlikely that poly(𝑑)-time computable estimators can achieve error 𝑂(𝜀) [DKS17]. Furthermore, for other design distributions the optimal error can be different.

Hence the natural questions to ask are : Given an error bound 𝑓(𝜀), does there exist a poly(𝑑)-time computable estimator that achieves error at most 𝑓(𝜀) with high probability? If possible, what is the smallest number of samples 𝑛that is enough to achieve error 𝑓(𝜀) in time poly(𝑑)? In the rest of this section, we write error bounds in terms of 𝜀and mention the number of samples that is required to achieve this error. In addition, we focus on the results for the high dimensional regime, where 𝑓(𝜀) does not depend polynomially on 𝑘or 𝑑.

Another line of works [BDLS17, LSLC20, PJL20, Sas22, SF23] considered the case when the adaptive adversary is allowed to corrupt 𝜀-fraction of all observed data, i.e. not only 𝑦 1, . . . , 𝑦 𝑛, but also 𝑋 1, . . . , 𝑋 𝑛, while the noise 𝜂is assumed to have iid random entries that satisfy some concentration assumptions. For simplicity, to fix the scale of the noise, we formulate their results

assuming that 𝜂 𝑁(0, 1)𝑛. In non-sparse settings, [PJL20] showed that in the case of identity covariance sub-Gaussian designs, Huber loss minimization after a proper filtering of 𝑋 achieves error 𝑂(𝜀) with 𝑛 𝑑/𝜀2 samples. Informally speaking, filtering removes the samples 𝑋 𝑖that look corrupted, and if the distribution of the design is nice enough, then after filtering we can work with (𝑋 , 𝑦 ) just like in the case when only 𝑦 is corrupted. For unknown covariance they showed a bound

𝑂 𝜀 for a large class of distributions of the design. If 𝑋 𝑖 iid 𝑁(0, Σ) for unknown Σ, one can use 𝑛 𝑂 𝑑2/𝜀2 samples to robustly estimate the covariance, and achieve nearly optimal error 𝑂(𝜀) in the case (see [DKS19] for more details).

In the sparse setting, there is likely an information-computation gap for the sample complexity of this problem, even in the case of the isotropic Gaussian design 𝑋 𝑖 iid 𝑁(0, Id). While it is informationtheoretically possible to achieve optimal error 𝑂(𝜀) with 𝑛 𝑂(𝑘/𝜀2) samples, achieving any error 𝑜(1) is likely to be not possible for poly(𝑑)-time computable estimators if 𝑛 𝑘2. Formal evidence for this conjecture include reductions from some version of the Planted Clique problem [BB20], as well as a Statistical Query lower bound (Proposition 1.10). For 𝑛 𝑂(𝑘2/𝜀2), several algorithmic results are known to achieve error 𝑂(𝜀), in particular, [BDLS17, LSLC20], and [SF23] for more general isotropic sub-Gaussian designs. Similarly to the approach of [PJL20], [SF23] used (ℓ1-penalized) Huber minimization after filtering 𝑋 .

The non-isotropic case (when Σ Id is unknown) is more challenging. [SF23] showed that for sub-Gaussian designs it is possible to achieve error 𝑂 𝜀 with 𝑛 𝑂(𝑘2) samples. [Sas22] showed that 𝑂 𝜀 error with 𝑛 𝑂(𝑘2 + 𝛽 4 1/𝑘2) samples can be achieved under some assumptions on the fourth and the eighth moments of the design distribution. While this result works for a large class of designs, the clear disadvantage is that the sample complexity depends polynomially on the norm of 𝛽 . For example, if all nonzero entries of 𝛽 have the same magnitude and 𝛽 =

𝑑, then the sample complexity is 𝑛> 𝑑2, which is not suitable in the sparse regime.

Prior to this work, no poly(𝑑)-time computable estimator that could achieve error 𝑜 𝜀 with

unknown Σ was known, even in the case of Gaussian designs 𝑋 𝑖 iid 𝑁(0, Σ) and the Gaussian noise 𝜂 𝑁(0, 1)𝑛(apart from the non-sparse setting, where such estimators require 𝑛> 𝑑2).

1.1 Results

We present two main results, both of them follow from a more general statement; see Theorem B.3. Before formally stating the results, we define the model as follows.

Definition 1.1 (Robust Sparse Regression with 2 Adversaries). Let 𝑛, 𝑑, 𝑘 N such that 𝑘 𝑑, 𝜎> 0, and 𝜀 (0, 1) is smaller than some sufficiently small absolute constant. Let 𝒟be a probability distribution in R𝑑with mean 0 and covariance Σ. Let 𝑦 = 𝑋 𝛽 + 𝜂, where 𝑋 is an 𝑛 𝑑random

matrix with rows 𝑋 𝑖 iid 𝒟, 𝛽 R𝑑is 𝑘-sparse, 𝜂 R𝑛is independent of 𝑋 and has at least 0.01 𝑛 entries bounded by 𝜎in absolute value1. We denote by 𝜅(Σ) the condition number of Σ.

An instance of our model is a pair (𝑋, 𝑦), where 𝑋 R𝑛 𝑑is a matrix and 𝑦 R𝑛is a vector such that there exists a set 𝑆good [𝑛] of size at least (1 𝜀)𝑛such that for all 𝑖 𝑆good, 𝑋𝑖= 𝑋 𝑖and 𝑦𝑖= 𝑦 𝑖.

Note that random noise models studied in prior works are captured by our model in Definition 1.1. For example, if 𝜂has iid entries that satisfy E|𝜂𝑖| 𝜎/2, by Markov s inequality, |𝜂𝑖| 𝜎with probability at least 1/2, and with overwhelming probability, at least 0.01 𝑛entries of 𝜂are bounded by 𝜎in absolute value. In addition, Cauchy noise (that does not have the first moment) with location parameter 0 and scale 𝜎also satisfies these assumptions, as well as other heavy-tailed distributions studied in literature (with appropriate scale parameter 𝜎).

1Our result also works for more general model, where we require 𝛼𝑛entries to be bounded by 𝜎for some 𝛼 𝜀. The error bound in this case also depends on 𝛼.

We formulate our results assuming that the condition number of the covariance is bounded by some constant: 𝜅(Σ) 𝑂(1). In the most general formulation (Theorem B.3), we show the dependence2 of the number of samples and the error on 𝜅(Σ).

1.1.1 Robust regression with heavy-tailed designs

We use the following notion of boundness of the moments of 𝒟:

Definition 1.2. Let 𝑀> 0, 𝑡 2 and 𝑑 N. We say that a probability distribution 𝒟in R𝑑with zero mean and covariance Σ has 𝑀-bounded 𝑡-th moment, if for all 𝑢 R𝑑 E 𝑥 𝒟| 𝑥, 𝑢 |𝑡 1/𝑡 𝑀 p

Note that an arbitrary linear transformation of an isotropic distribution with 𝑀-bounded 𝑡-th moment also has 𝑀-bounded 𝑡-th moment. Also note that if 𝑡 𝑡and a distribution 𝒟has 𝑀-bounded 𝑡-th moment, then the 𝑡 -th moment of 𝒟is also 𝑀-bounded. In particular, 𝑀cannot be smaller than 1, since the second moment cannot be 𝑀-bounded for 𝑀< 1. In addition, we will need the following (weaker) notion of the boundness of moments: Definition 1.3. Let 𝜈> 0, 𝑡 2 and 𝑑 N. We say that a probability distribution 𝒟in R𝑑with zero mean and covariance Σ has entrywise 𝜈-bounded 𝑡-th moment, if

max 𝑗 [𝑑] E 𝑥 𝒟|𝑥𝑗|𝑡 𝜈𝑡 Σ 𝑡/2 .

If a distribution has 𝑀-bounded 𝑡-th moment, then it also has entrywise 𝑀-bounded 𝑡-th moment, but the converse might not be true for some distributions. Now we are ready to state our first result. Theorem 1.4. Let 𝑛, 𝑑, 𝑘, 𝑋, 𝑦, 𝜀, 𝒟, Σ, 𝜎, 𝛽 be as in Definition 1.1. Suppose that 𝜅(Σ) 𝑂(1) and that for some 1 𝑀 𝑂(1) and 1 𝜈 𝑂(1), 𝒟has 𝑀-bounded 3-rd moment and entrywise 𝜈-bounded 4-th moment. There exists an algorithm that, given 𝑋, 𝑦, 𝑘, 𝜀, 𝜎, in time (𝑛+ 𝑑)𝑂(1)

outputs ˆ𝛽 R𝑑such that if 𝑛 𝑘2 log(𝑑)/𝜀, then with probability at least 1 𝑑 10,

Σ1/2( ˆ𝛽 𝛽 ) 𝑂(𝜎

Let us compare Theorem 1.4 with the state of the art. For heavy-tailed designs, prior to this work, the best estimator was [Sas22]. That estimator also achieves error 𝑂(𝜎 𝜀), but its sample complexity depends polynomially on the norm of 𝛽 , while our sample complexity does not depend on it. In addition, they require the distribution to have bounded 4-th moment (as opposed to our 3-rd moment assumption), and bounded entrywise 8-th moment (as opposed to our entrywise 4-th moment assumption). Finally, our noise assumption is weaker than theirs since they required the entries of 𝜂 to be iid random variables such that E|𝜂𝑖| 𝜎 for some 𝜎 > 0 known to the algorithm designer; as we mentioned after Definition 1.1, it is a special case of the oblivious noise with 𝜎= 2𝜎 .

Let us also discuss our assumptions and possibilities of an improvement of our result. The third moment assumption can be relaxed, more precisely, it is enough to require the 𝑡-th moment to be bounded, where 𝑡is an arbitrary constant greater than 2, and in this case the sample complexity is increased by a constant factor3; see Theorem B.3 for more details. The entrywise fourth moment assumption is not improvable with our techniques, that is, we get worse dependence on 𝑘if we relax it to, say, the third moment assumption.

The dependence of 𝑛on 𝜀is not improvable with our techniques4. The dependence of the error on 𝜎 is optimal. The dependence of 𝑛on 𝑘and the error on 𝜀is likely to be (nearly) optimal: Statistical Query lower bounds (Proposition 1.10 and Proposition 1.11) provide evidence that for 𝜎= Θ(1), it is unlikely that polynomial-time algorithms can achieve error 𝑜(1) if 𝑛 𝑘2, or error 𝑜( 𝜀) if 𝑛 𝑘4. Remark 1.5. Our results also imply bounds on other types of error studied in literature. In particular, observe that ˆ𝛽 𝛽 Σ1/2( ˆ𝛽 𝛽 ) / p

𝜆min(Σ), where 𝜆min(Σ) is the minimal eigenvalue of Σ.

2We did not aim to optimize this dependence. 3This factor depends on 𝑀and 𝜅(Σ), as well as on 𝑡. In particular, it goes to infinity when 𝑡 2. 4Some dependence of 𝑛on 𝜀is inherent, but potentially our dependence could be suboptimal. For subexponential distributions it is possible to get better dependence, see Remark 1.9 and Appendix H.

In addition, our estimator also satisfies ˆ𝛽 𝛽 1 𝑂( Σ1/2( ˆ𝛽 𝛽 ) p

𝑘/𝜆min(Σ)). The same is also true for our estimator from Theorem 1.7 below. These relations between different types of errors are standard for sparse regression, and they are not improvable.

1.1.2 Beyond 𝜀error

Prior to this work, no polynomial-time algorithm for (non-isotropic) robust sparse regression was

known to achieve error 𝑜(𝜎 𝜀), even for Gaussian designs 𝑋 𝑖 iid 𝑁(0, Σ) and Gaussian 𝜂 𝑁(0, 𝜎)𝑛. In this section we show that for a large class of designs, it is possible to achieve error 𝑜(𝜎 𝜀) in polynomial time, even when 𝜂is chosen by an oblivious adversary. For our second result, we require not only some bounds on the moments of 𝒟, but also their certifiability in the sum-of-squares proof system: Definition 1.6. Let 𝑀> 0 and let ℓ 4 be an even number. We say that a probability distribution 𝒟in R𝑑with zero mean and covariance Σ has ℓ-certifiably 𝑀-bounded 4-th moment, if there exist polynomials ℎ1, . . . , ℎ𝑚 R[𝑢1, . . . , 𝑢𝑑] of degree at most ℓ/2 such that

E 𝑥 𝒟 𝑥, 𝑢 4 +

𝑖=1 ℎ2 𝑖(𝑢) = 𝑀4 Σ 2 𝑢 4 .

Definition 1.6 with arbitrary ℓimplies Definition 1.2 (with the same 𝑀). Under standard complexitytheoretic assumptions, there exist distributions with bounded moments that are not ℓ-certifiably bounded even for very large ℓ[HL19]. Note that similarly to Definition 1.2, an arbitrary linear transformation of an isotropic distribution with ℓ-certifiably 𝑀-bounded 4-th moment also has ℓ-certifiably 𝑀-bounded 4-th moment.

Distributions with certifiably bounded moments are very important in algorithmic robust statistics. They were extensively studied in literature, e.g. [KS17a, KS17b, HL18, HL19, DKK+22].

Now we can state our second result. Theorem 1.7. Let 𝑛, 𝑑, 𝑘, 𝑋, 𝑦, 𝜀, 𝒟, Σ, 𝜎, 𝛽 be as in Definition 1.1. Suppose that 𝜅(Σ) 𝑂(1), and that for some 𝑀 1, some even number ℓ 4, and 1 𝜈 𝑂(1), 𝒟has ℓ-certifiably 𝑀-bounded 4-th moment and entrywise 𝜈-bounded 8-th moment. There exists an algorithm that, given 𝑋, 𝑦, 𝑘, 𝜀, 𝜎, 𝑀, ℓ, in time (𝑛+ 𝑑)𝑂(ℓ) outputs ˆ𝛽 R𝑑such that if 𝑛 𝑀4 𝑘4 log(𝑑)/𝜀3 , then with probability at least 1 𝑑 10,

Σ1/2( ˆ𝛽 𝛽 ) 𝑂(𝑀 𝜎 𝜀3/4) .

In particular, in the regime 𝑀 𝑂(1), as long as 𝑛 𝑂(𝑘4/𝜀3), the algorithm recovers 𝛽 from (𝑋, 𝑦) up to error 𝑂(𝜎𝜀3/4) (with high probability). If ℓ 𝑂(1), the algorithm runs in polynomial time. Note that in this theorem we do not assume that 𝑀is constant as opposed to Theorem 1.4 since for some natural classes of distributions, only some bounds on 𝑀that depend on 𝑑are known.

The natural question is what distributions have certifiably bounded fourth moment with ℓ 𝑂(1). First, these are products of one-dimensional distributions with 𝑀-bounded fourth moment, and their linear transformations (with ℓ= 4). Hence, linear transformations of products of one-dimensional distributions with 𝑂(1)-bounded 8-th moment satisfy the assumptions of the theorem with 𝑀 𝑂(1) and ℓ= 4. Note that such distributions might not even have a 9-th moment. This class also includes Gaussian distributions (since they are linear transformations of the 𝑁(0, 1)𝑑and 𝑁(0, 1) has 𝑂(1)- bounded 8-th moment).

Another important class is the distributions that satisfy Poincaré inequality. Concretely, these distributions, for some 𝐶𝑃 1, satisfy Var𝑥 𝒟𝑔(𝑥) 𝐶2 𝑃 Σ E𝑥 𝒟 𝑔(𝑥) 2 2 for all continuously differentiable functions 𝑔: R𝑑 R. [KS17a] showed that such distributions have 4-certifiably 𝑂(𝐶𝑃)-bounded fourth moment. We will not further discuss Poincaré inequality, and focus on the known results on the classes of distributions satisfy this inequality.

The Kannan-Lovász-Simonovits (KLS) conjecture from convex geometry says that 𝐶𝑃is bounded by some universal constant for all log-concave distributions. Recall that a distribution 𝒟is called log-concave if for some convex function 𝑉: R𝑑 R, the density of 𝒟is proportional to 𝑒 𝑉(𝑥).

Apart from the Gaussian distribution, examples include uniform distributions over convex bodies, the Wishart distribution and the Dirichlet distribution ([Pré71], see also [KBJ00] for further examples). In recent years there has been a big progrees towards the proof of the KLS conjecture. [Che21] showed that 𝐶𝑃 𝑑𝑜(1), and since then, the upper bound has been further significantly improved. The best current bound is 𝐶𝑃 𝑂( p

log 𝑑) obtained by [Kla23]. This bound implies that for all log-concave distributions whose covariance has bounded condition number, the error of our estimator is 𝑂(𝜎 p

log 𝑑 𝜀3/4). Hence for 𝜀 𝑜(1/log2(𝑑)) and 𝜎 𝑂(1), the error is 𝑜( 𝜀). Note that if the KLS conjecture is true, the error of our estimator is 𝑂(𝜎𝜀3/4) for all log-concave distributions with 𝜅(Σ) 𝑂(1), without any restrictions on 𝜀(except the standard 𝜀 1). Remark 1.8. Theorem 1.7 can be generalized as follows: If the (2𝑡)-th moment of 𝒟is 𝑀-bounded for a constant 𝑡 N 2, if this bound can be certified by a constant degree sum-of-squares proof5, and if 𝒟has entrywise (4𝑡)-th 𝑂(1)-bounded moment, then with high probability, there is a poly(𝑑)-time computable estimator that achieves error 𝑂(𝑀𝜎𝜀1 1/(2𝑡)) as long as 𝑛 𝑀4𝑘2𝑡log(𝑑)/𝜀2𝑡 1. See Theorem B.3 for more details. Remark 1.9. The dependence of 𝑛on 𝜀can be improved under the assumption that 𝒟is a subexponential distribution. In particular, all log-concave distributions are sub-exponential. Under this additional assumption, in order to achieve the error 𝑂(𝜎 𝜀), it is enough to take 𝑛 𝑘2 polylog(𝑑)+ 𝑘log(𝑑)/𝜀, and to achieve error 𝑂(𝑀𝜎𝜀3/4), it is enough to take 𝑛 𝑘4 polylog(𝑑) + 𝑘log(𝑑)/𝜀3/2 samples (assuming, as in Theorem 1.7, that the fourth moment is 𝑀-certifiably bounded).

1.1.3 Lower bounds

We provide Statistical Query (SQ) lower bounds by which our estimators likely have optimal sample complexities needed to achieve the errors 𝑂( 𝜀) and 𝑜( 𝜀), even when the design and the noise are Gaussian. SQ lower bounds are usually interpreted as a tradeoff between the time complexity and sample complexity of estimators; see Appendix G and [DKS17] for more details. Our proofs are very similar to prior works [DKS17, DKS19, DKK+22] since as was observed in [DKS19], lower bounds for mean estimation can be used to prove lower bounds for linear regression, and we use the lower bounds for sparse mean estimation from [DKS17, DKK+22].

Let us fix the scale of the noise 𝜎= 1. The first proposition shows that already for Σ = Id, 𝑘2 samples are likely to be necessary to achieve error 𝑜(1):

Proposition 1.10 (Informal, see Proposition G.9). Let 𝑛, 𝑑, 𝑘, 𝑋, 𝑦, 𝜀, 𝒟, Σ, 𝜎, 𝛽 be as in Definition 1.1. Suppose that 𝒟= 𝑁(0, Id) and 𝜂 𝑁(0, 𝜎2)𝑛, where 0.99 𝜎 1. Suppose that 𝑑0.01 𝑘

log 𝑑, and 𝑛 𝑘1.99. Then for each SQ algorithm 𝐴that finds ˆ𝛽such that

𝛽 ˆ𝛽 10 5, the simulation of 𝐴with 𝑛samples has to simulate super-polynomial (exp(𝑑Ω(1))) number of queries.

Note that under assumptions of Proposition 1.10, Theorem 1.4 implies that if we take 𝑛 𝑘2 polylog(𝑑) samples, the estimator achieves error 𝑂( 𝜀) that is 𝑜(1) if 𝜀 0 as 𝑑 .

The second proposition shows that for 1

2 Σ Id, 𝑘4 samples are likely to be necessary to achieve error 𝑜( 𝜀):

Proposition 1.11 (Informal, see Proposition G.10). Let 𝑛, 𝑑, 𝑘, 𝑋, 𝑦, 𝜀, 𝒟, Σ, 𝜎, 𝛽 be as in Definition 1.1. Suppose that 𝒟= 𝑁(0, Σ) for some Σ such that 1

2 Σ Id, and 𝜂 𝑁(0, 𝜎2)𝑛, where 0.99 𝜎 1. Suppose that 𝑑0.01 𝑘

𝑑, 𝜀 1 log 𝑑, and 𝑛 𝑘3.99. Then for each SQ algorithm

𝐴that finds ˆ𝛽such that 𝛽 ˆ𝛽 10 5 𝜀, the simulation of 𝐴with 𝑛samples has to simulate super-polynomial (exp(𝑑Ω(1))) number of queries.

Note that under assumptions of Proposition 1.11, Theorem 1.7 implies that if we take 𝑛 𝑘4 polylog(𝑑) samples, the estimator achieves error 𝑂(𝜀3/4) that is 𝑜( 𝜀) if 𝜀 0 as 𝑑 .

5See Definition B.2 for formal definition.

2 Techniques

Since the problem has multiple aspects, we first illustrate our approach on the simplest example 𝑋 𝑖 iid 𝑁(0, Σ) under the assumption that 0.1 Id Σ 10 Id. Note that already in this case, even for 𝜂 𝑁(0, 1)𝑛, our estimator from Theorem 1.7 outperforms the state of the art. In addition, we assume that 𝜎= 1.

Our estimators are based on preprocessing 𝑋, and then minimizing ℓ1-penalized Huber loss. In the Gaussian case, the preprocessing step consists only of filtering, while for heavy-tailed designs, an additional truncation step is required. The idea of using filtering before minimizing the Huber loss first appeared in [PJL20] for the dense settings, and was applied to sparse settings in [Sas22, SF23]. We will not discuss the filtering method in detail, and rather focus on its outcome: It is a set ˆ𝑆 [𝑛] of size at least (1 𝑂(𝜀))𝑛that satisfies some nice properties6. Further, we will see what properties we need from ˆ𝑆, and now let us define the Huber loss estimator.

Definition 2.1. For 𝑆 [𝑛], the Huber loss function restricted to 𝑆is defined as

𝑖 𝑆 ℎ( 𝑋𝑖, 𝛽 𝑦𝑖) where ℎ(𝑥𝑖) =

( 1 2 𝑥2 𝑖 if |𝑥𝑖| 2; 2|𝑥𝑖| 2 otherwise.

For a penalty parameter 𝜆, the ℓ1-penalized Huber loss restricted to 𝑆is defined as 𝐿𝑆(𝛽) := 𝐻𝑆(𝛽) + 𝜆 𝛽 1. We use the notation 𝜙(𝑥) for the derivative of ℎ(𝑥). Note that for all 𝑥, |𝜙(𝑥)| 2.

Our estimator is the minimizer ˆ𝛽ˆ𝑆of 𝐿ˆ𝑆(𝛽), where ˆ𝑆is the set returned by the filtering algorithm. To investigate the properties of this estimator, it is convenient to work with elastic balls. The 𝑘-elastic ball of radius 𝑟is the following set: ℰ𝑘(𝑟) := {𝑢 R𝑑| 𝑢 𝑟, 𝑢 1

𝑘 𝑟} . Note that this ball contains all 𝑘-sparse vectors with Euclidean norm at most 𝑟(as well as some other vectors). Elastic balls are very useful for sparse regression since if the following two properties hold,

1. Gradient bound: For all 𝑢 ℰ𝑘(𝑟), | 𝐻ˆ𝑆, 𝑢 | 𝑟

𝑘 𝑢 1 + 𝑟 𝑢 ,

2. Strong convexity on the boundary: For all 𝑢 ℰ𝑘(𝑟) such that 𝑢 = 𝑟,

𝐻ˆ𝑆(𝛽 + 𝑢) 𝐻ˆ𝑆(𝛽 ) 𝐻ˆ𝑆, 𝑢 Ω 𝑟2 ,

then for an appropriate choice of the penalty parameter 𝜆, then 𝛽 ˆ𝛽ˆ𝑆 < 𝑟.7

Hence it is enough to show these two properties. In the Gaussian case, the strong convexity property can be proved in exactly the same way as it is done in [d LN+21] for the case of the oblivious adversary, while for heavy-tailed designs it is significantly more challenging. Since we now discuss the Gaussian case, let us focus on the gradient bound. Denote 𝐻 𝑆(𝛽) = 1

𝑛 Í 𝑖 𝑆ℎ 𝑋 𝑖, 𝛽 𝑦 𝑖 . By triangle inequality,

| 𝐻ˆ𝑆, 𝑢 | = | 𝐻 𝑆good ˆ𝑆, 𝑢 + 𝐻𝑆bad ˆ𝑆, 𝑢 |

| 𝐻 [𝑛], 𝑢 | + | 𝐻 [𝑛]\(𝑆good ˆ𝑆), 𝑢 | + | 𝐻𝑆bad ˆ𝑆, 𝑢 | .

Since the first term can be bounded by 𝐻 [𝑛] 𝑢 1, it is enough to show that 𝐻 [𝑛] 𝑟/

where 𝑟is the error we aim to achieve. Note that 𝐻 [𝑛] = 1

𝑛 Í𝑛 𝑖=1 𝜙(𝜂𝑖) 𝑋 𝑖, 𝑢 does not depend on the outliers created by the adaptive adversary. The sharp bound on 𝐻 [𝑛] can be derived in exactly the same way as in [d LN+21] (or other prior works): Since 𝜂and 𝑋 are independent and |𝜙(𝜂)| 2, 𝐻 [𝑛] is a Gaussian vector whose entries have variance (1/𝑛). By standard properties of

Gaussian vectors, 𝐻 [𝑛] 𝑂( p

log(𝑑)/𝑛) with high probability.

6Technically, the filtering we use returns weights of the samples. For simplicity we assume here that the weights are 0 or 1. 7For simplicity, we omit some details, e.g. we need to work with ℰ𝑘 (𝑟) instead of ℰ𝑘(𝑟), where 𝑘 𝑘. See Theorem A.3 for the formal statement. Similar statements appeared in many prior works on sparse regression.

To bound the second and the third term, we can use Cauchy Schwarz inequality and get 𝑂( 𝜀) dependence on the error (like it is done in prior works on robust sparse regression, for example, [Sas22] or [SF23]), or use Hölder s inequality and get better dependence, but also more challenges since we have to work with higher (empirical) moments of 𝑋 and 𝑋. Let us use Hölder sinequality and illustrate how we work with higher moments. Note that both sets [𝑛] \ (𝑆good ˆ𝑆) and 𝑆bad ˆ𝑆 have size at most 𝑂(𝜀𝑛). Hence the second term can be bounded by

𝑖 [𝑛]\(𝑆good ˆ𝑆)

1 𝑛 𝑋 𝑖, 𝑢 4 1/4 𝑂(𝜀3/4) Õ

1 𝑛 𝑋 𝑖, 𝑢 4 1/4 ,

while the third term is bounded by

1 𝑛 𝑋𝑖, 𝑢 4 1/4 𝑂(𝜀3/4) Õ

1 𝑛 𝑋𝑖, 𝑢 4 1/4 .

A careful probabilistic analysis shows that with high probability, for all 𝑟 0 and all 𝑢 ℰ𝑘(𝑟), Í 𝑖 [𝑛] 1 𝑛 𝑋 𝑖, 𝑢 4 𝑂 𝑢 4 . Hence, our requirement on ˆ𝑆is that Í 𝑖 ˆ𝑆 1 𝑛 𝑋𝑖, 𝑢 4 𝑂(1) for all 𝑢 ℰ𝑘(1) (by scaling argument, it is enough to consider 𝑟= 1). If we find such a set ˆ𝑆, we get the desired bound. Indeed, if 𝑛 𝑘log(𝑑)/𝜀3/2, 𝐻 [𝑛] 𝑂(𝜀3/4/

𝑘), and the other terms are

bounded by 𝑂(𝜀3/4), implying that ˆ𝛽 ˆ𝛽ˆ𝑆 < 𝑟= 𝑂(𝜀3/4).

Note that such sets of size (1 𝑂(𝜀))𝑛exist since 𝑆good satisfies this property. It is clear how to find such a set inefficiently: we just need to check all candidate sets 𝑆and maximize the quartic function Í 𝑖 𝑆 𝑋𝑖, 𝑢 4 over 𝑢 ℰ𝑘(1). Furthermore, the by-now standard filtering method allows to avoid checking all the sets: If we can maximize Í 𝑖 𝑆 𝑋𝑖, 𝑢 4 over 𝑢 ℰ𝑘(1) efficiently, we can also find the desired set efficiently.

Before explaining how we maximize this function, let us see how prior works [BDLS17, SF23], optimized a simpler quadratic function Í 𝑖 𝑆 𝑋𝑖, 𝑢 2 over 𝑢 ℰ𝑘(1). They use the basic SDP relaxation for sparse PCA, that is, they optimize the linear function Í 𝑖 𝑆 𝑋𝑖𝑋 𝑖, 𝑈 over ℬ𝑘:= {𝑈 R𝑑 𝑑| 𝑈 0 , Tr(𝑈) 1 , 𝑈 1 𝑘}. This set has been used in literature for numerous sparse problems since it is a nice (perhaps the best) convex relaxation of the set 𝒮𝑘= {𝑢𝑢 | 𝑢 R𝑑, 𝑢 1 , 𝑢 0 𝑘}. Moreover, crucially for sparse regression, it is easy to see that ℬ𝑘also contains all matrices 𝑢𝑢 such that 𝑢 ℰ𝑘(1). Hence, one may try to optimize quartic functions by using relaxations of 𝒮𝑘= {𝑢 4 | 𝑢 R𝑑, 𝑢 1 , 𝑢 0 𝑘}. A natural relaxation is the sumof-squares with sparsity constraints. [DKK+22] used these relaxations for sparse mean estimation8. They showed that these relaxations provide nice guarantees for distributions with certifiably bounded 4-th moment, assuming that the distribution has sub-exponential tails. Since we now discuss the Gaussian case, the assumption on the tails is satisfied. However, there is no guarantee that these relaxations capture 𝑢 4 for all 𝑢 ℰ𝑘(1). So, for sparse regression, we need another relaxation.

We use the sum-of-squares relaxations with elastic constraints. These constraints ensure that the set of relaxations 𝒫𝑘 R𝑑4 is guaranteed to contain 𝑢 4 for all 𝑢 ℰ𝑘(1). We show that if 𝑛 𝑂(𝑘4), there is a degree-𝑂(1) sum-of-squares proof from the elastic constraints of the fact that 1 𝑛 Í 𝑖 [𝑛] 𝑋𝑖, 𝑢 4 𝑂(1). It implies that the relaxation is nice: If 1

𝑛 Í 𝑖 𝑆 𝑋𝑖, 𝑢 4 𝑂(1) for all 𝑢 ℰ𝑘(1), then 1

𝑛 Í 𝑖 𝑆 𝑋 4 𝑖 , 𝑈 𝑂(1) for all 𝑈 𝒫𝑘. Since we can efficiently optimize over 𝒫𝑘, we get an efficiently computable estimator with error 𝑂(𝜀3/4) for Gaussian distributions. Furthermore, if we first use a proper thresholding (that we discuss below), our sum-of-squares proof also works for heavy-tailed distributions, that, apart from the certifiably bounded 4-th moment (that we cannot avoid with the sum-of-squares approach), are only required to have entrywise bounded 8-th moment.

Robust sparse regression with heavy-tailed designs is much more challenging. Again, for simplicity assume that 0.1 Id Σ 10 Id and 𝜎= 1. First, there is an issue even without the adversarial noise: 𝐻 [𝑛] can be very large. Even under bounded fourth moment assumption, it can have

magnitude 𝑂(𝑑1/4/𝑛), which is too large in the sparse setting. Hence we have to perform an additional thresholding step and remove large entries of 𝑋. Usually thresholding of the design matrix should be

8These relaxations were also used in [d KNS20] in the context of sparse PCA, but they used them in a different way.

done very carefully since it breaks the relation between 𝑋and 𝑦. [Sas22] required the thresholding parameter 𝜏to be large enough and depend polynomially on 𝛽 so that this dependence does not break significantly. Since 𝐻 [𝑛] can be as large as 𝑂(𝜏/𝑛), the sample complexity of their estimator also depends polynomially on 𝛽 .

Our idea of thresholding is very different, and it plays a significant role in our analysis, especially in the proof of strong convexity. Since we already have to work with outliers chosen by the adaptive adversary, we know that for an 𝜀-fraction of samples, the dependence of 𝑦on 𝑋can already be broken. So, if we choose the thresholding parameter 𝜏to be large enough so that with high probability it only affects an 𝜀-fraction of samples, we can simply treat the samples affected by such thresholding as additional adversarial outliers, and assume that the adaptive adversary corrupted 2𝜀𝑛samples. Note that since 𝒟is heavy-tailed, each sample 𝑋 𝑖might have entries of magnitude 𝑑Ω(1). However, 𝑦 depends only on the inner products 𝑋 𝑖, 𝛽 , and this inner product depends only on the entries of 𝑋 that correspond to the support of 𝛽 . Even though we don t know the support, we can guarantee that for 𝜏 20 p

𝑘/𝜀, all entries of 𝑋𝑖from the support of 𝛽 are bounded by 𝜏with probability 1 𝜀/2. Indeed, since the variance of each entry is bounded by 10, Chebyshev s inequality implies that this entry is smaller than 𝜏with probability at least 1 𝜀/(2𝑘), and by union bound, 𝑋 𝑖, 𝛽 is not affected by the thresholding with probability 1 𝜀/2. Hence by Chernoff bound, with overwhelming probability, the number of samples affected by our thresholding is at most 𝜀𝑛.

Let us denote the distribution of the rows of 𝑋 after thresholding with parameter 𝜏by 𝒟(𝜏). After the thresholding step, we can assume that 𝑋 𝑖 iid 𝒟(𝜏). Note that thresholding can shift the mean, i.e. E 𝑋 𝑖can be nonzero. It is easy to see that E𝑥 𝒟(𝜏) 𝑥 𝑂(1/𝜏). Hence by Bernstein s inequality,

1/𝑛+ 𝜏/𝑛+ 1/𝜏 with high probability9. In particular, in order to get the error

bounded by 𝑂(𝜀3/4), we need to take 𝜏

𝑘/𝜀3/4, and it affects sample complexity. Furthermore,

our sum-of-squares proof requires that 1

𝑛 Í𝑛 𝑖=1 𝑋 𝑖 4 E 𝑋 1 4 is smaller that 1/𝑘2. It can

be shown that this quantity is bounded by 𝑂 p

1/𝑛+ 𝜏4/𝑛+ 1/𝜏4 with high probability10. In

particular, we need 𝑛 𝑂 𝜏4𝑘2 , so for 𝜏

𝑘/𝜀3/4, we have to take 𝑛 𝑂 𝑘4/𝜀3 . As was discussed in Remark 1.9, if 𝒟has sub-exponential tails, we do not have to do the thresholding, and the bounds from [DKK+22] allow to avoid this dependence of 𝑛on 𝜀. Note that due to the SQ lower bound (Proposition 1.11), sample complexity 𝑘4 is likely to be necessary, even for Gaussian designs.

Finally, let us discuss the strong convexity property. Here, we do not assume any properties related to sum-of-squares, and focus on the weak assumptions of Theorem 1.4. First, assume that we need to show strong convexity only for sparse vectors, and not for all 𝑢 ℰ𝑘(𝑟). As was observed in prior works on regression with oblivious outliers, e.g. [d LN+21], 𝜌(𝑢) := 𝐻ˆ𝑆(𝛽 +𝑢) 𝐻ˆ𝑆(𝛽 ) 𝐻ˆ𝑆, 𝑢 can be lower bounded by 1

2 Í 𝑖 ˆ𝑆 𝑋𝑖, 𝑢 21[| 𝑋𝑖,𝑢 𝑦𝑖| 1]1[| 𝑋𝑖,𝑢 | 1]. Let 𝐶(𝑢) = 𝑆good ˆ𝑆 𝐴 𝐵(𝑢), where 𝐴is the set of samples where |𝜂𝑖| 1 and 𝐵(𝑢) = {𝑖 [𝑛] | | 𝑋𝑖, 𝑢 | 1}. Then, 𝜌(𝑢) Ω(Í 𝑖 𝐶(𝑢) 𝑋 𝑖, 𝑢 2). It can be shown that for some suitable 𝑟and for each 𝑘-sparse 𝑢of norm 𝑟, 𝐶(𝑢) is a large subset of the set 𝐴(of size at least 0.99|𝐴|). Note that since 𝐴is independent of 𝑋 , the rows of 𝑋 that correspond to indices from 𝐴are just iid samples from 𝒟. If 𝑋 𝑖were Gaussian, we could have applied concentration bounds and prove strong convexity via union bound argument over subsets of size 0.99|𝐴|. In the heavy-tailed case, we need a different argument. For a fixed set 𝐶of size 0.99|𝐴|, we can use Bernstein s inequality11. We cannot use union bound argument over all subsets of size 0.99|𝐴| (there are too many), but fortunately we do not need it since for each 𝑘-sparse 𝑢of norm 𝑟, it is enough to show that Í 𝑖 𝑇(𝑢) 𝑋 𝑖, 𝑢 2 Ω(𝑟2), where 𝑇(𝑢) 𝐴is the set of the smallest (in absolute value) 0.99|𝐴| entries of the vector 𝑋 𝐴𝑢 R|𝐴|. Hence, we can use an epsilon-net argument for the set of 𝑘-sparse vectors 𝑢(of norm 𝑟). This set has very dense nets of

9Here we used the fact that 𝜙(𝜂𝑖) 2. 10[DKLP22] used thresholding for robust sparse mean estimation, and showed a similar bound for secondorder tensors. We generalize it to higher order tensors. 11Using a standard truncation argument. See also Proposition C.1. of [PJL20] for a similar argument in the dense setting.

(relatively) small size, and this is enough to show the lower bound Í 𝑖 𝐶(𝑢) 𝑋 𝑖, 𝑢 2 Ω(𝑟2) for all 𝑘-sparse 𝑢of norm 𝑟with high probability, as long as 𝑛 𝑂(𝑘2).

In order to show the same bound for all 𝑢 ℰ𝑘(𝑟) of norm 𝑟, we observe that12 if a quadratic form is Θ(𝑟2) on 𝐾-sparse vectors of norm 𝑟for some 𝐾 𝑘, then it is also Θ(𝑟2) on all 𝑢 ℰ𝑘(𝑟), and applying the argument from the previous paragraph to 𝐾-sparse vectors, we get the desired bound. We remark that directly proving it for 𝑢 ℰ𝑘(𝑟) is challenging, since we extensively used the properties of the set of sparse vectors that are not satisfied by ℰ𝑘(𝑟), e.g. the existence of very dense epsilon-nets of small size.

3 Future Work

There is an interesting open problem in robust sparse regression that is not captured by our techniques. For sparse mean estimation, in the Gaussian case, there exists a polynomial time algorithm with nearly optimal guarantees: It achieves error 𝑂( 𝜀) with 𝑘4 polylog(𝑑)/𝜀2 samples ([DKK+22]). This algorithm uses a sophisticated sum-of-squares program13. It is reasonable to apply the techniques of [DKK+22] to robust sparse regression in order to achieve nearly optimal error 𝑂( 𝜀) with poly(𝑘) samples. However, simple approaches (e.g. our approach with replacing the sparse constraints by the elastic constraints) fail in this case. Here we provide a high-level explanation of the issue. In order to combine the filtering algorithm with their techniques, we need to check whether the values of a certain quartic form are small on all sparse vectors. The analysis in [DKK+22] shows that this form is indeed small for the uncorrupted sample with high probability (see their Lemma E.2.). Since we want the filtering algorithm to be efficient, we have to use a relaxation of sparse vectors. Hence we need to find a sum-of-squares (or some other nice relaxation) version of the proof from [DKK+22]. However, in their proof they use a covering argument, and it is not clear how to avoid it. This argument fails for reasonable relaxations that we have thought about. Both potential outcomes (either an algorithm or a computational lower bound) are interesting: An algorithm would likely require new sophisticated ideas, and a lower bound would show a significant difference between robust sparse regression and robust mean estimation, while, so far, the complexity pictures of these problems have seemed to be quite similar.

Another interesting direction is to get error 𝑜( 𝜀) for distributions that do not necessarily have certifiably bounded moments. As was shown in [HL19], only moment assumptions (without certifiability) are not enough for efficient robust mean estimation, and the same should be true also for linear regression. However, other assumptions on distribution 𝒟can make the problem solvable in polynomial time. For robust mean estimation, some symmetry assumptions are enough even for heavy-tailed distributions without the second moment14 (see [NST23]). It is interesting to investigate what assumptions on the design distribution are sufficient for existence of efficiently computable estimators for robust sparse regression.

Acknowledgments and Disclosure of Funding

Chih-Hung Liu is supported by Ministry of Education, Taiwan under Yushan Fellow Program with the grant number MOE-111-YSFEE-0003-006-P1 and by National Science and Technology Council, Taiwan with the grant number 111-2222-E-002-017-MY2.

12Similar arguments are sometimes used to prove the restricted eigenvalue property of random matrices. 13A similar program for the dense setting was studied in [KMZ22]. 14And, in some sense, even without the first moment, if instead of the mean we estimate the center of symmetry.

[BB20] Matthew S. Brennan and Guy Bresler, Reducibility and statistical-computational gaps from secret leakage, Proceedings of the 33rd Annual Conference on Learning Theory (COLT), vol. 125, 2020, pp. 648 847.

[BDLS17] Sivaraman Balakrishnan, Simon S. Du, Jerry Li, and Aarti Singh, Computationally efficient robust sparse estimation in high dimensions, Proceedings of the 30th Conference on Learning Theory COLT 2017, 2017, pp. 169 212.

[BJK15] Kush Bhatia, Prateek Jain, and Purushottam Kar, Robust regression via hard thresholding, NIPS, 2015, pp. 721 729.

[BJKK17] Kush Bhatia, Prateek Jain, Parameswaran Kamalaruban, and Purushottam Kar, Consistent robust regression, Advances in Neural Information Processing Systems (I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), vol. 30, Curran Associates, Inc., 2017.

[Che21] Yuansi Chen, An almost constant lower bound of the isoperimetric coefficient in the kls conjecture, 2021.

[DKK+22] Ilias Diakonikolas, Daniel M. Kane, Sushrut Karmalkar, Ankit Pensia, and Thanasis Pittas, Robust sparse mean estimation via sum of squares, Proceedings of the 35th Annual Conference on Learning Theory (COLT22), Proceedings of Machine Learning Research, 2022, pp. 4703 4763.

[DKLP22] Ilias Diakonikolas, Daniel Kane, Jasper C. H. Lee, and Ankit Pensia, Outlier-robust sparse mean estimation for heavy-tailed distributions, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, Neur IPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022 (Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, eds.), 2022.

[d KNS20] Tommaso d Orsi, Pravesh K Kothari, Gleb Novikov, and David Steurer, Sparse pca: algorithms, adversarial perturbations and certificates, 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), IEEE, 2020, pp. 553 564.

[DKS17] Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart, Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mixtures, Proceedings of the IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS17), 2017, pp. 73 84.

[DKS19] Ilias Diakonikolas, Weihao Kong, and Alistair Stewart, Efficient algorithms and lower bounds for robust linear regression, Proceedings of the 30th Annual Symposium on Discrete Algorithms (SODA19), 2019, pp. 2745 2754.

[d LN+21] Tommaso d Orsi, Chih-Hung Liu, Rajai Nasser, Gleb Novikov, David Steurer, and Stefan Tiegel, Consistent estimation for pca and sparse regression with oblivious outliers, Advances in Neural Information Processing Systems 34 (2021), 25427 25438.

[d NS21] Tommaso d Orsi, Gleb Novikov, and David Steurer, Consistent regression when oblivious outliers overwhelm, Proceedings of the 38th International Conference on Machine Learning, (ICML 2021) (Marina Meila and Tong Zhang, eds.), 2021, pp. 2297 2306.

[DT19] Arnak S. Dalalyan and Philip Thompson, Outlier-robust estimation of a sparse linear model using ℓ1-penalized huber s 𝑀-estimator, Proceedings of the 32nd Annual Conference on Neural Information Processing Systems (Neur IPS19), 2019, pp. 13188 13198.

[HL18] Samuel B. Hopkins and Jerry Li, Mixture models, robustness, and sum of squares proofs, Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, 2018, pp. 1021 1034.

[HL19] , How hard is robust mean estimation?, Proceedings of the 32nd Annual Conference on Learning Theory (COLT19), 2019, pp. 1649 1682.

[KBJ00] Samuel Kotz, N. Balakrishnan, and Norman L. Johnson, Continuous multivariate distributions: Models and applications, Wiley, 2000.

[Kla23] Bo az Klartag, Logarithmic bounds for isoperimetry and slices of convex sets, 2023.

[KMZ22] Pravesh K. Kothari, Peter Manohar, and Brian Hu Zhang, Polynomial-time sum-ofsquares can robustly estimate mean and covariance of gaussians optimally, Proceedings of The 33rd International Conference on Algorithmic Learning Theory (Sanjoy Dasgupta and Nika Haghtalab, eds.), Proceedings of Machine Learning Research, vol. 167, PMLR, 29 Mar 01 Apr 2022, pp. 638 667. [KS17a] Pravesh K. Kothari and Jacob Steinhardt, Better agnostic clustering via relaxed tensor norms, Co RR abs/1711.07465 (2017). [KS17b] Pravesh K. Kothari and David Steurer, Outlier-robust moment-estimation via sum-ofsquares, Co RR abs/1711.11581 (2017). [Las01] Jean B. Lasserre, New positive semidefinite relaxations for nonconvex quadratic programs, Advances in convex analysis and global optimization (Pythagorion, 2000), Nonconvex Optim. Appl., vol. 54, Kluwer Acad. Publ., Dordrecht, 2001, pp. 319 331. MR 1846160 [LSLC20] Liu Liu, Yanyao Shen, Tianyang Li, and Constantine Caramanis, High dimensional robust sparse regression, Proceedings of the 23rd International Conference on Artificial Intelligence and stics AISTATS 2020, 2020, pp. 411 421. [MNW22] Stanislav Minsker, Mohamed Ndaoud, and Lang Wang, Robust and tuning-free sparse linear regression via square-root slope, ar Xiv preprint ar Xiv:2210.16808 (2022). [Nes00] Yurii Nesterov, Squared functional systems and optimization problems, High performance optimization, Appl. Optim., vol. 33, Kluwer Acad. Publ., Dordrecht, 2000, pp. 405 440. MR 1748764 [NST23] Gleb Novikov, David Steurer, and Stefan Tiegel, Robust mean estimation without moments for symmetric distributions, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Neur IPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 (Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, eds.), 2023. [Par00] Pablo A Parrilo, Structured semidefinite programs and semialgebraic geometry methods in robustness and optimization, Ph.D. thesis, California Institute of Technology, 2000. [PJL20] Ankit Pensia, Varun S. Jog, and Po-Ling Loh, Robust regression with covariate filtering: Heavy tails and adversarial contamination, Co RR abs/2009.12976 (2020). [Pré71] András Prékopa, Logarithmic concave measures with application to stochastic programming, Acta Scientiarum Mathematicarum (1971), 301 316. [RH23] Philippe Rigollet and Jan-Christian Hütter, High-dimensional statistics, 2023. [Sas22] Takeyuki Sasai, Robust and sparse estimation of linear regression coefficients with heavy-tailed noises and covariates, Co RR abs/2206.07594 (2022). [SBRJ19] Arun Sai Suggala, Kush Bhatia, Pradeep Ravikumar, and Prateek Jain, Adaptive hard thresholding for near-optimal consistent robust regression, Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA (Alina Beygelzimer and Daniel Hsu, eds.), Proceedings of Machine Learning Research, vol. 99, PMLR, 2019, pp. 2892 2897. [SF23] Takeyuki Sasai and Hironori Fujisawa, Outlier robust and sparse estimation of linear regression coefficients, Co RR abs/2208.11592 (2023). [Sho87] N. Z. Shor, Quadratic optimization problems, Izv. Akad. Nauk SSSR Tekhn. Kibernet. (1987), no. 1, 128 139, 222. MR 939596 [Tho23] Philip Thompson, Outlier-robust sparse/low-rank least-squares regression and robust matrix completion, 2023. [TJSO14] Efthymios Tsakonas, Joakim Jaldén, Nicholas D. Sidiropoulos, and Björn Ottersten, Convergence of the huber regression m-estimate in the presence of dense outliers, IEEE Signal Processing Letters 21 (2014), no. 10, 1211 1214. [Tro15] Joel A. Tropp, An introduction to matrix concentration inequalities, Foundations and Trends in Machine Learning 8 (2015), no. 1-2, 1 230. [ZWJ14] Yuchen Zhang, Martin J. Wainwright, and Michael I. Jordan, Lower bounds on the performance of polynomial-time algorithms for sparse linear regression, Proceedings of the 27th Annual Conference on Learning Theory (COLT14), 2014, pp. 921 948.

A Properties of the Huber loss minimizer

Definition A.1. For 𝑤 R𝑛 0, the weighted Huber loss function is defined as

𝑖 [𝑛] 𝑤𝑖ℎ( 𝑋𝑖, 𝛽 𝑦𝑖) where ℎ(𝑥𝑖) =

( 1 2 𝑥2 𝑖 if |𝑥𝑖| 2; 2|𝑥𝑖| 2 otherwise.

For a penalty parameter 𝜆, the ℓ1-penalized Huber loss restricted to 𝑆is defined as 𝐿𝑤(𝛽) := 𝐻𝑤(𝛽) + 𝜆 𝛽 1.

Lemma A.2. Suppose that 𝑤 R𝑛 0, 𝑢 R𝑑, 𝛾1, 𝛾2, 𝜆> 0 satisfy the following properties:

Í𝑛 𝑖=1 𝑤𝑖𝜙(𝜂𝑖+ 𝜁𝑖) 𝑋 𝑖(𝜏), 𝑢 𝛾1 𝑢 1 + 𝛾2 Σ1/2𝑢 ,

3. 𝐻𝑤(𝛽 + 𝑢) + 𝜆 𝛽 + 𝑢 1 𝐻𝑤(𝛽 ) + 𝜆 𝛽 1 .

Then 𝑢 1 4 p

𝑘/𝜎min + 2𝛾2/𝜆 Σ1/2𝑢 ,

where 𝜎min is the minimal eigenvalue of Σ.

Proof. Let 𝒦= supp(𝛽 ). Note that

𝛽 + 𝑢 1 = 𝛽 + 𝑢𝒦+ 𝑢𝒦 1 𝛽 1 + 𝑢𝒦 1 𝑢𝒦 1 .

By the convexity of 𝐻𝑤,

𝐻𝑤(𝛽 + 𝑢) 𝐻𝑤(𝛽 )

𝑖=1 𝑤𝑖𝜙(𝜂𝑖+ 𝜁𝑖) 𝑋 𝑖(𝜏), 𝑢 𝜆 𝑢 1/2 𝛾2 Σ1/2𝑢 .

0 𝜆 𝛽 + 𝑢 1 𝛽 1 + 𝐻𝑤(𝜂+ 𝜁+ 𝑋𝑢) 𝐻𝑤(𝜂+ 𝜁)

𝜆 𝑢𝒦 1 𝑢𝒦 1 1

2𝜆 𝑢𝒦 1 𝛾2 .

𝜆 𝑢 1 4𝜆 𝑢𝒦 1 + 2𝛾2 Σ1/2𝑢 4𝜆

𝑘 𝑢 + 2𝛾2 Σ1/2𝑢 4𝜆

Σ1/2𝑢 + 2𝛾2 Σ1/2𝑢 .

Theorem A.3. Let 𝜌, 𝛾1, 𝛾2 > 0 and

where 𝜎min is the minimal eigenvalue of Σ. Let 𝑘 100𝑘/𝜎min. Consider the 𝑘 -elastic ellipsoid of radius 𝑟: ℰ𝑘 (𝑟) = n 𝑢 R𝑑 Σ1/2𝑢 𝑟, 𝑢 1

Suppose that the weights 𝑤 R𝑛are such that the following two properties hold:

1. Gradient bound: For all 𝑢 ℰ𝑘 (𝑟),

𝑖=1 𝑤𝑖𝜙(𝜂𝑖+ 𝜁𝑖) 𝑋 𝑖(𝜏), 𝑢 𝛾1 𝑢 1 + 𝛾2 Σ1/2𝑢 , ,

2. Strong convexity on the boundary: For all 𝑢 ℰ𝑘 (𝑟) such that Σ1/2𝑢 = 𝑟,

𝐻𝑤(𝛽 + 𝑢) 𝐻𝑤(𝛽 )

𝑖=1 𝑤𝑖𝜙(𝜂𝑖+ 𝜁𝑖) 𝑋 𝑖(𝜏), 𝑢 + 𝜌 𝑟2 .

𝜆 2𝛾1 + 𝛾2 r

Then the minimizer ˆ𝛽of the weighted penalized Huber loss with penalty 𝜆and weights 𝑤satisfies Σ1/2 ˆ𝛽 𝛽 < 𝑟.

Proof. Let ˆ𝑢= ˆ𝛽 𝛽 . If Σ1/2 ˆ𝑢 < 𝑟, we get the desired bound. Otherwise, let 𝑢be the (unique) point in the intersection of ℰ𝑘 (𝑟) and the segment [0, ˆ𝑢] R𝑑. By convexity of the penalized loss,

𝐻𝑤(𝜂+ 𝜁+ 𝑋𝑢) + 𝜆 𝛽 + 𝑢 1 𝐻𝑤(𝜂+ 𝜁) + 𝜆 𝛽 1 ,

Since 𝑢 ℰ𝑘 (𝑟), either Σ1/2𝑢 = 𝑟, or 𝑢 1 =

𝑘 𝑟. Let us show that the latter is not possible. Since 𝜆 2𝛾1, we can apply Lemma A.2:

𝑘/𝜎min + 2𝛾2/𝜆 𝑟.

Cancelling 𝑟and using the bound 𝜆 𝛾2 q

𝑘, we get a contradiction. Hence Σ1/2𝑢 = 𝑟. By the strong convexity and the gradient bound,

𝐻𝑤(𝛽 + 𝑢) 𝐻𝑤(𝛽 )

𝑖=1 𝑤𝑖𝜙(𝜂𝑖+ 𝜁𝑖) 𝑋 𝑖(𝜏), 𝑢 + 𝜌 Σ1/2𝑢 2

2𝜆 𝑢 1 𝛾2 Σ1/2𝑢

2𝜆 𝑢 1 𝛾2𝑟.

Note that 𝐻𝑤(𝛽 + 𝑢) 𝐻𝑤(𝛽 ) 𝜆 𝛽 1 𝛽 + 𝑢 1 𝜆 𝑢 1 .

By putting the above two inequality together and by Lemma A.2 , we have that

2𝜆 𝑢 1 + 𝛾2𝑟 6𝜆 p

𝑘/𝜎min 𝑟+ 5𝛾2𝑟.

Dividing both sides by 𝜌 𝑟, we get

a contradiction. Therefore, Σ1/2 ˆ𝑢 < 𝑟.

B Heavy-tailed Designs

First, we define a bit more general model than Definition 1.1

Definition B.1 (Robust Sparse Regression with 2 Adversaries). Let 𝑛, 𝑑, 𝑘 N such that 𝑘 𝑑, 𝜎> 0, 𝛼 (0, 1] and 𝜀 𝛼. Let 𝒟be a probability distribution in R𝑑with mean 0 and covariance

Σ. Let 𝑦 = 𝑋 𝛽 + 𝜂, where 𝑋is an 𝑛 𝑑random matrix with rows 𝑋 𝑖 iid 𝒟, 𝛽 R𝑑is 𝑘-sparse, 𝜂 R𝑛is independent of 𝑋 and has at least 𝛼 𝑛entries bounded by 𝜎in absolute value15.

An instance of our model is a pair (𝑋, 𝑦), where 𝑋 R𝑛 𝑑is a matrix and 𝑦 R𝑛is a vector such that there exists a set 𝑆good [𝑛] of size at least (1 𝜀)𝑛such that for all 𝑖 𝑆good, 𝑋𝑖= 𝑋 𝑖and 𝑦𝑖= 𝑦 𝑖.

Definition B.2. Let 𝑀> 0, 𝑡 N, and let ℓ 2𝑡be an even number. We say that a probability distribution 𝒟in R𝑑with zero mean and covariance Σ has ℓ-certifiably 𝑀-bounded (2𝑡)-th moment, if there exist polynomials ℎ1, . . . , ℎ𝑚 R[𝑢1, . . . , 𝑢𝑑] of degree at most ℓ/2 such that

E 𝑥 𝒟 𝑥, 𝑢 2𝑡+

𝑖=1 ℎ2 𝑖(𝑢) = 𝑀2𝑡 Σ 𝑡 𝑢 2𝑡.

In this section we prove the following theorem

Theorem B.3 (Heavy-tailed designs, general formulation). Let 𝑛, 𝑑, 𝑘, 𝑋, 𝑦, 𝜀, 𝒟, Σ, 𝜎, 𝛼be as in Definition B.1, and let 𝛿 (0, 1).

Suppose that for some 𝑠> 2, 𝑡 N, 𝑀𝑠, 𝑀2𝑡 1, and even number ℓ 2𝑡, 𝒟has 𝑀𝑠-bounded 𝑠-th moment, and ℓ-certifiably 𝑀2𝑡-bounded (2𝑡)-th moment. In addition, 𝒟has entrywise 𝜈-bounded (4𝑡)-th moment.

There exists an algorithm that, given 𝑋, 𝑦, 𝑘, 𝜀, 𝜎, 𝑀2𝑡, ℓ, 𝑡, 𝛿and ˆ𝜎max such that Σ ˆ𝜎max 𝑂( Σ ), in time (𝑛+ 𝑑)𝑂(ℓ) outputs 𝑋 R𝑛 𝑑and weights 𝑤= (𝑤1, . . . , 𝑤𝑛) such that if

𝑛 1010𝑡 𝑀2𝑡 2𝑡 𝜈4𝑡+ 105𝑀𝑠 2𝑠

𝑠 2 𝜅(Σ)4+𝑠/(𝑠 2) + 𝜅(Σ)2𝑡

𝜀2𝑡 1 𝑘2𝑡log(𝑑/𝛿)

then with probability at least 1 𝛿, the weighted ℓ1-penalized Huber loss estimator ˆ𝛽𝑤= ˆ𝛽𝑤(𝑋 , 𝑦) with weights 𝑤(as in Definition 2.1) and parameter ℎsatisfies

Σ1/2 ˆ𝛽𝑤 𝛽 𝑂

𝜅(Σ) 𝛼 𝜎 𝜀1 1

Let us explain how this result implies Theorem 1.4 and Theorem 1.7.

Theorem 1.4 is a special case of Theorem B.3 with 𝑡= 1, 𝑠= 3, ℓ= 2, 𝑀2𝑡= 1, 𝑀𝑠= 𝑀, 𝛼= 0.01. Indeed, we only need to estimate Σ up to a constant factor. We can do it by estimating the variance of the first coordinate of 𝑥 𝒟. Applying median-of-means algorithm16 to the first coordinate, we get an estimator 𝜎2 that is 𝑂(𝜈2 Σ 𝜀)-close to the variance of the first coordinate 𝜎2 1. Note that Σ /𝜅(Σ) 𝜎2 1 Σ . Since in Theorem 1.4 𝜅(Σ) and 𝜈are constants, and 𝜀is sufficiently small, we get that 1 2𝜅(Σ) Σ 𝜎2 max 2 Σ . Hence for a constant 𝐶 𝜅(Σ), ˆ𝜎max = 2𝐶 𝜎2 is the desired estimator of Σ .

Similarly, Theorem 1.7 is a special case of Theorem B.3 with 𝑡= 2, 𝑠= 4, 𝑀2𝑡= 𝑀𝑠= 𝑀, 𝛼= 0.01. Σ can be estimated using the procedure described above.

Before proving the theorem, note that we can without loss of generality assume that 𝜎= 1. Indeed, since 𝜎is known, we can simply divide 𝑋and 𝑦by it before applying the algorithm.

B.1 Truncation

We cannot work with 𝑋 directly since it might have very large values, and Bernstein inequality that we use for random vectors concentration would give very bad bounds if we work with 𝑋 . Fortunately,

15Our result also works for more general model, where we require 𝛼𝑛entries to be bounded by 𝜎for some 𝛼 𝜀. The error bound in this case also depends on 𝛼. 16See, for example, Fact 2.1. from [DKLP22], where they state the guarantees of the median-of-means algorithm.

we can perform truncation. This technique was used in [DKLP22] for sparse mean estimation and in [Sas22] for sparse regression.

For 𝜏> 0 let 𝑋 𝑖𝑗(𝜏) = 𝑋 𝑖𝑗1h |𝑋 𝑖𝑗| 𝜏 i. Note that since P h |𝑋 𝑖𝑗| > 𝜏 i Σ /𝜏2, if 𝜏 p

then the number of entries 𝑖where 𝑋 𝑖, 𝛽 𝑋 𝑖(𝜏), 𝛽 is at most 𝜀𝑛with probability at least 1 2 𝜀𝑛/10 1 𝛿/10. Hence in the algorithm we assume that the input is 𝑋 (𝜏) instead of 𝑋 , and we treat the entries where 𝑋 𝑖, 𝛽 𝑋 𝑖(𝜏), 𝛽 as corrupted by an adversary.

Concretely, further we assume that we are given 𝑋 𝑖(𝜏), 𝑦𝑖, 𝑤𝑖 𝑛 𝑖=1 such that 𝑦= 𝑋 (𝜏)𝛽 + 𝜂+ 𝜁, where 𝑋 (𝜏) R𝑛 𝑑differs from 𝑋 (𝜏) R𝑛 𝑑only in rows from the set 𝑆bad [𝑛] of size at most 𝜀𝑛(where 𝜀 𝜀 𝑂(𝜀)), 𝜁 R𝑛is an 𝜀𝑛-sparse vector such that supp(𝜁) 𝑆bad, 𝛽 R𝑑 a 𝑘-sparse vector, and 𝜂 R𝑛is oblivious noise such that at least 𝛼𝑛entries do not exceed 1 in absolute value.

In addition, we define

𝑤 R𝑛 𝑖 [𝑛] 0 𝑤𝑖 1/𝑛,

𝑖=1 𝑤𝑖 (1 𝜀)𝑛

The weights for the Huber loss will be from 𝒲 𝜀.

Appendix E will discuss more properties of the truncation.

B.2 Gradient Bound

Lemma B.4. Let 𝑏, 𝛾1 > 0. Suppose that 𝑤 𝒲 𝜀and 𝑢 R𝑑satisfy Õ

𝑖 [𝑛] 𝑤𝑖 𝑋 𝑖(𝜏), 𝑢 2𝑡 𝑏2𝑡 Σ1/2𝑢 2𝑡 ,

𝑋 𝑖(𝜏), 𝑢 2𝑡 𝑏2𝑡 Σ1/2𝑢 2𝑡 ,

𝑖 [𝑛] 𝜙(𝜂𝑖)𝑋 𝑖(𝜏)

𝑖=1 𝑤𝑖𝜙(𝜂𝑖+ 𝜁𝑖) 𝑋 𝑖(𝜏), 𝑢 𝛾1 𝑢 1 + 6 𝑏 Σ1/2𝑢 2𝑡 𝜀1 1

Proof. Denote 𝐹(𝑤) = Í 𝑖 [𝑛](1/𝑛 𝑤𝑖)𝜙(𝜂𝑖) 𝑋 𝑖(𝜏), 𝑢 . It is a linear function of 𝑤, so |𝐹(𝑤)| is maximized in one of the vertices of the polytope 𝒲 𝜀. This vertex corresponds to set 𝑆𝑤of size at least (1 𝜀)𝑛. That is, the weights of the entries from 𝑆𝑤are 1/𝑛, and outside of 𝑆𝑤the weights are zero. It follows that

𝑖=1 𝑤𝑖𝜙(𝜂𝑖+ 𝜁𝑖) 𝑋 𝑖(𝜏), 𝑢

𝑖 [𝑛] 𝑤𝑖𝜙(𝜂𝑖) 𝑋 𝑖(𝜏), 𝑢 +

𝑖 𝑆bad 𝑤𝑖𝜙(𝜂𝑖) 𝑋 𝑖(𝜏), 𝑢 +

𝑖 𝑆bad 𝑤𝑖𝜙(𝜂𝑖+ 𝜁𝑖) 𝑋 𝑖(𝜏), 𝑢 (Triangle Inequality)

𝛾1 𝑢 1 + 2 Õ

1 𝑛 𝑋 𝑖(𝜏), 𝑢 + 2 Õ

1 𝑛 𝑋 𝑖(𝜏), 𝑢 + 2 Õ

𝑖 𝑆bad 𝑤𝑖 𝑋 𝑖(𝜏), 𝑢

𝛾1 𝑢 1 + 4 𝜀1 1

1 𝑛 𝑋 𝑖(𝜏), 𝑢 2𝑡ª

1 2𝑡 + 2 𝜀1 1

𝑖 [𝑛] 𝑤𝑖 𝑋 𝑖(𝜏), 𝑢 2𝑡ª

1 2𝑡 (Hölder s inequality)

𝛾1 𝑢 1 + 6 𝜀1 1

2𝑡 𝑏 Σ1/2𝑢 2𝑡 .

The following lemma provides a bound on 𝛾1:

Lemma B.5. With probability at least 1 𝛿/10,

𝑖 [𝑛] 𝜙(𝜂𝑖)𝑋 𝑖(𝜏)

Σ 𝑛log(𝑑/𝛿) + 10𝜏 log(𝑑/𝛿) + 2𝑛 Σ /𝜏.

Proof. It follows from Bernstein s inequality Fact I.1 and the fact that

𝑖 [𝑛] |𝜙(𝜂𝑖)| E 𝑋 𝑖(𝜏) 2 E 𝑋 1(𝜏) 2𝑛 Σ /𝜏,

where we used Corollary E.4.

B.2.1 Strong Convexity

Lemma B.6. Suppose that 𝛼 1000 𝜀, 𝜏 1000 𝜈2 Σ

𝑘 / 𝑟 𝜎min and

𝑛 (𝑘 )2 log 𝑑+ 𝑘 log(1/𝛿) 105𝑠/(𝑠 2)𝑀𝑠/(𝑠 2) 𝑠 𝜅(Σ)2+𝑠/(𝑠 2)/𝛼,

where 𝑘 = 104 𝑘 p

Σ . Then with probability 1 𝛿/10, for all 𝑢 ℰ𝑘 (𝑟) such that Σ1/2𝑢 = 𝑟,

𝐻(𝛽 + 𝑢) 𝐻(𝛽 )

𝑖=1 𝑤𝑖𝜙(𝜂𝑖+ 𝜁𝑖) 𝑋𝑖, 𝑢

Proof. Denote 𝐴good = 𝑆good 𝒜, where 𝒜is a set of entries 𝑖such that |𝜂𝑖| 1. Note that 𝒜is independent of 𝑋 . It follows that

𝐻(𝛽 + 𝑢) 𝐻(𝛽 )

𝑖=1 𝑤𝑖𝜙(𝜂𝑖+ 𝜁𝑖) 𝑋𝑖, 𝑢 1

𝑖=1 𝑤𝑖 𝑋𝑖, 𝑢 21[|𝜂𝑖+𝜁𝑖| 1]1[| 𝑋𝑖,𝑢 | 1]

𝑖 𝑆good 𝑤𝑖 𝑋 𝑖(𝜏), 𝑢 21[|𝜂𝑖| 1]1[| 𝑋 𝑖(𝜏),𝑢 | 1]

𝑖 𝐴good 𝑤𝑖 𝑋 𝑖(𝜏), 𝑢 21[| 𝑋 𝑖(𝜏),𝑢 | 1]

𝑖 𝐴good 𝑤𝑖 𝑋𝑖, 𝑢 21[| 𝑋𝑖,𝑢 | 1] ,

where 𝑋𝑖= 1h 𝑋 𝑖(𝜏) 105𝑠/(𝑠 2) 𝑀𝑠/(𝑠 2) 𝑠

Σ 𝑘 i𝑋 𝑖(𝜏).

Denote 𝐹(𝑤) = Í 𝑖 𝐴good 𝑤𝑖 𝑋𝑖, 𝑢 21[| 𝑋𝑖,𝑢 | 1]. It is a linear function of 𝑤, so it is maximized in one of the vertices of the polytope 𝒲 𝜀. This vertex corresponds to set 𝑆𝑤of size at least (1 𝜀)𝑛. That is, the weights of the entries from 𝑆𝑤are 1/𝑛, and outside of 𝑆𝑤the weights are zero. Õ

𝑖 𝐴good 𝑤𝑖 𝑋𝑖, 𝑢 21[| 𝑋𝑖,𝑢 | 1] 1

𝑋𝑖, 𝑢 21[| 𝑋𝑖,𝑢 | 1] .

Hence we need a lower bound for Í 𝑖 𝐴(𝑢) 𝑋𝑖, 𝑢 2, where

𝐴(𝑢) = 𝐴good 𝑆𝑤 𝑖 [𝑛] | 𝑋𝑖, 𝑢 | 1 .

In order to bound Í 𝑖 𝐴(𝑢) 𝑋𝑖, 𝑢 2 for vectors 𝑢from the elastic ball ℰ𝑘 (𝑟), we first show that it is

bounded for 𝑘 -sparse vectors 𝑢 for some large enough 𝑘 . First we need to show that Í 𝑖 ℐ 𝑋𝑖, 𝑢 2

is well-concentrated for a fixed set ℐ. Concretely, we need the following lemma:

Lemma B.7. Suppose that 𝜏 1000 𝑀2𝑡 𝜈2 Σ

𝑘 / 𝑟 𝜎min for some 𝑘 N. Then for a fixed (independent of 𝑋) set ℐof size

|ℐ| (𝑘 )2 log 𝑑+ 𝑘 log(1/𝛿) 1010𝑠/(𝑠 2)𝑀2𝑠/(𝑠 2) 𝑠 𝜅(Σ)2+2𝑠/(𝑠 2)

and for all 𝑘 -sparse vectors 𝑢 R𝑑such that 𝑟 Σ1/2𝑢 2𝑟,

0.99 Σ1/2𝑢 2 1 |ℐ| Õ

𝑋𝑖, 𝑢 2 1.01 Σ1/2𝑢 2 .

with probability at least 1 𝛿.

Proof. First let us show that

0.995 E 𝑋 𝑖, 𝑢 2 E 𝑋𝑖, 𝑢 2 1.005 E 𝑋 𝑖, 𝑢 2 .

Since for each set 𝒦of size 𝑘 , E 𝑋 𝑖(𝜏)

2 = Í 𝑗 𝒦E 𝑋 𝑖𝑗(𝜏) 2 2 Σ 𝑘 , by Markov s inequality,

𝒦 2 > 1010𝑠/(𝑠 2) 𝑀2𝑠/(𝑠 2) 𝑠 𝜅(Σ)𝑠/(𝑠 2) Σ 𝑘 i 1

1010𝑠/(𝑠 2) 𝑀2𝑠/(𝑠 2) 𝑠 𝜅(Σ)𝑠/(𝑠 2) .

Denote 𝐵= 105𝑠/(𝑠 2) 𝑀𝑠/(𝑠 2) 𝑠 𝜅(Σ)𝑠/(2𝑠 4)p

Σ 𝑘 . By Hölder s inequality, for all vectors 𝑢 R𝑑with support 𝒦,

E 𝑋 𝑖(𝜏), 𝑢 2 = E 𝑋𝑖(𝜏), 𝑢 21[ (𝑋 𝑖(𝜏))𝒦 𝐵] + E 𝑋 𝑖(𝜏), 𝑢 21[ (𝑋 𝑖(𝜏))𝒦 >𝐵]

E 𝑋𝑖, 𝑢 2 + 2 E 𝑋 𝑖, 𝑢 21[ (𝑋 𝑖(𝜏))𝒦 >𝐵] + 2 E 𝑋 𝑖(𝜏) 𝑋 𝑖, 𝑢 2

E 𝑋𝑖, 𝑢 2 + 2 E 1[ (𝑋 𝑖(𝜏))𝒦 >𝐵] 1 2

𝑠 E 𝑋 𝑖, 𝑢 𝑠 2

𝑠+ 2𝜈4 Σ 2𝑘 𝑢 2

E 𝑋𝑖, 𝑢 2 + 2 Σ 𝑢 2

1010 𝜅(Σ) + 2𝑟2/106

E 𝑋𝑖, 𝑢 2 + 2𝑟2/1010 + 2𝑟2/106 .

where we used Lemma E.1 and the fact that 𝑢 𝑢 1 𝑘 𝑢 𝑢 𝑘 𝑢 2. By Corollary E.5, E 𝑋 𝑖, 𝑢 2 2𝑟2/106 E 𝑋 𝑖(𝜏), 𝑢 2 E 𝑋 𝑖, 𝑢 2 + 2𝑟2/106. Hence

0.995 E 𝑋 𝑖, 𝑢 2 0.999 E 𝑋 𝑖(𝜏), 𝑢 2 E 𝑋𝑖, 𝑢 2 1.001 E 𝑋 𝑖(𝜏), 𝑢 2 1.005 E 𝑋 𝑖, 𝑢 2 .

For a fixed set 𝒦of size 𝑘 and for all unit vectors 𝑢 R𝑑with support 𝒦, by Bernstein inequality for covariance Fact I.2, with probability 1 𝛿, 1 |ℐ| Õ

𝑋𝑖, 𝑢 2 E 𝑋𝑖, 𝑢 2 1000

Σ 𝐵2 log(𝑑/𝛿)

|ℐ| + 𝐵2 log(𝑑/𝛿)

Σ 𝐵2 log(𝑑/𝛿)

𝜎2 min|ℐ| + 𝐵2 log(𝑑/𝛿)

In order to make this quantity smaller than 𝑟2/1000, it is sufficient to take |ℐ| 1010𝑠/(𝑠 2)𝑀2𝑠/(𝑠 2) 𝑠 𝜅(Σ)2+𝑠/(𝑠 2) 𝑘 log(𝑑/𝛿).

By union bound over all subsets 𝒦of [𝑑] of size 𝑘 , we get the desired bound.

Let us bound the size of 𝐴(𝑢). 𝐴good 𝑆𝑤has size at least (𝛼 3 𝜀)𝑛 0.997𝛼𝑛. By Lemma B.7

and Lemma F.2, 𝑋𝑢 2 1.1 𝛼𝑛𝑟2, hence at most 3𝛼𝑛𝑟2/ℎ 0.001𝛼𝑛entries of 𝑋𝑢can be greater than ℎ/2. Therefore, |𝐴(𝑢)| 0.99𝛼𝑛.

Let 𝑘 = min 104𝑘 Σ , 𝑑 . Recall that 𝒜a set of entries 𝑖such that |𝜂𝑖| 1, and 𝒜is independent of 𝑋 . By union bound, the result of Lemma B.7 also holds for all sets ℐthat correspond to the bottom 0.99-fraction of entries of vectors 𝑋𝑢

𝒜, where 𝑢 are from an 1/𝑛10 -net 𝒩in the set of all 𝑘 -sparse vectors 𝑢 such that Σ1/2𝑢 = 1.01𝑟. Let 𝑢 be an arbitrary 𝑘 -sparse vector such that Σ1/2𝑢 = 1.01𝑟, and let 𝑢 = 𝑢 + Δ𝑢be the closest vector in the net 𝒩to 𝑢 . It follows that Õ

𝑋𝑖, 𝑢 2 = Õ

𝑋𝑖, 𝑢 + Δ𝑢 2

𝑋𝑖, 𝑢 2 2𝑛3/𝑛10

0.99𝛼𝑛 𝑟2 2𝑛 7

If 𝑘 = 𝑑, we get the desired bound, since we can take 𝑢 = 𝑢. Otherwise, by Lemma B.7, Õ

𝑋𝑖, 𝑢 2 1.1 𝛼𝑛 𝑟2 ,

and we get the desired bound by Lemma F.2.

B.3 Putting everything together

First, we truncate the entries of 𝑋and 𝑋 and obtain 𝑋 (𝜏) and 𝑋 (𝜏) using some 𝜏such that

where 𝑘 = 106 𝑘 𝜅(Σ). We discuss the choice of 𝜏further in this subsection. Let us denote 𝜏 = 𝜏/ p

Then we find the weights 𝑤1, . . . 𝑤𝑛using Algorithm C.1.

We will show all the conditions of Theorem A.3 are satisfied if

𝑛 𝐶 1010𝑡 𝑀2𝑡 2𝑡 𝜈4𝑡+ 105𝑀𝑠 2𝑠

𝑠 2 𝜅(Σ)4+𝑠/(𝑠 2) + 𝜅(Σ)2𝑡

𝜀2𝑡 1 𝑘2𝑡log(𝑑/𝛿)

for some large enough absolute constant 𝐶and

𝜆= 1000 𝑀2𝑡 p

ˆ𝜎max 𝜀1 1/(2𝑡)/

𝑘 1000 𝑀2𝑡 p

𝜅(Σ) 𝜀1 1/(2𝑡) p

First let us show that the assumptions of Lemma B.4 are satisfied with 𝛾1 100 𝑀2𝑡 p

Σ 𝜀1 1/(2𝑡)/

𝑘and 𝛾2 10𝑀2𝑡 p

First we bound 𝛾2. Note that if 𝑢 ℰ𝑘 (𝑟) for 𝑘 = 100𝑘/𝜎min, then 𝑢 1 𝑘 𝑢 . Hence if

𝑛 1000 𝜈4𝑡 (𝑘 )𝑡+ (𝜏 )2𝑡 (𝑘 )𝑡 𝑡log(𝑑/𝛿) ,

then Lemma D.2 implies that for all 𝑢 ℰ𝑘 (𝑟), with probability 1 𝛿/10,

𝑋 𝑖(𝜏), 𝑢 2𝑡 2𝑀2𝑡 p

Σ 2𝑡 𝑢 2𝑡 2𝑀2𝑡 p

𝜅(Σ) 2𝑡 Σ1/2𝑢 2𝑡 .

Lemma D.2 and Lemma C.2 imply that for all 𝑢 ℰ𝑘 (𝑟), with probability 1 𝛿/10, Õ

𝑖 [𝑛] 𝑤𝑖 𝑋 𝑖(𝜏)(𝜏), 𝑢 2𝑡 2𝑀2𝑡 p

𝜅(Σ) 𝑡 Σ1/2𝑢 2𝑡 .

Let us bound 𝛾1. By Lemma B.5, if

𝑛 1000 𝑘log(𝑑/𝛿)/𝜀2 1/𝑡+ 𝜏log(𝑑/𝛿)

𝑘 /𝜀1 1/(2𝑡) ,

then by with probability 1 𝛿/10, 𝛾1 100 𝑀2𝑡

𝜅(Σ) 𝜀1 1/(2𝑡)

The strong convexity holds by Lemma B.6 with probability 1 𝛿/10 as long as

𝑛 𝑘2 log(𝑑/𝛿) 105𝑠/(𝑠 2)𝑀𝑠/(𝑠 2) 𝑠 𝜅(Σ)4+𝑠/(𝑠 2)/𝜀,

where we used the fact that 𝜀 𝛼and that 𝜏 p

2𝑡satisfies the assumption of that lemma.

Therefore, all the conditions of Theorem A.3 are satisfied and we attain the desired bound of

𝜅(Σ) 𝛼 𝜀1 1

2𝑡 stated in Theorem B.3.

Now let us discuss the choice of 𝜏. First we can find an estimator ˆ𝜅of 𝜅(Σ) by plugging it into the formula

𝑛= 𝐶 1010𝑡 ˆ𝜅4+𝑠/(𝑠 2) + ˆ𝜅2𝑡

𝜀2𝑡 1 𝑘2𝑡log(𝑑/𝛿).

Then we can take 𝜏 = 0.01 𝑛 ˆ𝜅𝑡 𝑘𝑡 𝑡log(𝑑/𝛿) 1/(2𝑡) . Note that if we express 𝑛in terms of ˆ𝜅and plug

into the formula for 𝜏 , we get that 𝜏 is an increasing function of ˆ𝜅. Also note that ˆ𝜅 𝜅(Σ). Hence both conditions are satisfied: 𝜏:=

ˆ𝜎max 𝜏 is larger than the required lower bound for it, and 𝑛is larger than 10000(𝜏 )2𝑡 (𝑘 )𝑡log(𝑑/𝛿) and 10000𝜏log(𝑑/𝛿)

𝑘 /𝜀1 1/(2𝑡) as required.

C Filtering

We use the following system of elastic constraints with sparsity parameter 𝐾 1 and variables 𝑣1, . . . , 𝑣𝑑, 𝑠1, . . . , 𝑠𝑑:

𝑖 [𝑑] 𝑠2 𝑖= 1

𝑖 [𝑑] 𝑠𝑖𝑣𝑖 𝑣𝑖 𝑖 [𝑑] 𝑠𝑖𝑣𝑖 𝑣𝑖

Note that the vectors from the elastic ball n 𝑣 R𝑑 𝑣 1 , 𝑣 1

𝐾 o satisfy these constraints

with 𝑠𝑖= sign(𝑣𝑖). We will later discuss the corresponding sum-of-squares certificates in Appendix D.

Let 𝑎> 0 be such that D 1 𝑛 Í𝑛 𝑖=1 𝑋 𝑖 2𝑡, E𝑣 2𝑡E 𝑎2𝑡.

Algorithm C.1 (Filtering algorithm).

1. Assign weights 𝑤1 = . . . 𝑤𝑛= 1/𝑛.

2. Find a degree 2ℓpseudo-expectation E that satisfies 𝒜𝐾and maximizes Í𝑛 𝑖=1 𝑤𝑖𝑋 𝑡 𝑖, E𝑣 𝑡 .

𝑛 Í𝑛 𝑖=1 𝑋 2𝑡 𝑖 , E𝑣 2𝑡 < 10𝑡𝑎2𝑡, stop.

4. Compute 𝜏𝑖= 𝑋 2𝑡 𝑖 , E𝑣 2𝑡 and reweight: 𝑤 𝑖= (1 𝜏𝑖 𝜏 ) 𝑤𝑖.

Lemma C.2. If at each step D 1 𝑛 Í𝑛 𝑖=1 𝑋 𝑖 2𝑡, E𝑣 2𝑡E 𝑎2𝑡, then the algorithm terminates in at

most 2𝜀𝑛 steps, and the resulting weights satisfy Í𝑛 𝑖=1 𝑤𝑖 1 2𝜀.

To prove it, we will use the following lemma:

Lemma C.3. Assume that D 1 𝑛 Í𝑛 𝑖=1 𝑋 𝑖 2𝑡, E𝑣 2𝑡E 𝑎2𝑡, 1

𝑛 Í𝑛 𝑖=1 𝑋 2𝑡 𝑖 , E𝑣 2𝑡 10𝑡𝑎2𝑡and

Proof of Lemma C.3. Note, that it is enough to show that Õ

𝑖 𝑆𝑔 𝑤𝑖 𝑤 𝑖< Õ

𝑖 𝑆𝑏 𝑤𝑖 𝑤 𝑖.

Further, recall that 𝑤 𝑖= 1 𝜏𝑖 𝜏max

𝑤𝑖, so for all 𝑖 [𝑛], 𝑤𝑖 𝑤 𝑖= 1 𝜏max 𝜏𝑖𝑤𝑖. Hence is enough to show that Õ

𝑖 𝑆𝑔 𝜏𝑖𝑤𝑖< Õ

Since 𝑆𝑔and 𝑆𝑏partition [𝑛] and

𝑖=1 𝑤𝑖𝑋 2𝑡 𝑖 , E𝑣 2𝑡 +

we can prove Í 𝑖 𝑆𝑔𝜏𝑖𝑤𝑖< Í 𝑖 𝑆𝑏𝜏𝑖𝑤𝑖by showing that

𝑖 𝑆𝑔 𝜏𝑖𝑤𝑖 𝑎2𝑡<

Í𝑛 𝑖=1 𝑤𝑖𝑋 𝑡 𝑖, E𝑣 2𝑡

Note that Õ

𝑖 𝑆𝑔 𝑤𝑖𝑋 2𝑡 𝑖 , E𝑣 2𝑡 +

𝑋 𝑖 2𝑡, E𝑣 2𝑡 +

Proof of Lemma C.2. We will show that the algorithm terminates after at most 2𝜀𝑛 iterations. Assume that it does not terminate after 𝑇= 2𝜀𝑛 iterations. Note that the number of entries of 𝑤 that are equal to 0 increases by at least 1 in every iteration. Hence, after 𝑇iterations we have set

at least 𝜀𝑛entries of 𝑤to zero whose index lies in 𝑆𝑔. By assumption that the algorithm did not terminate and Lemma C.3, it holds that

a contradiction.

Let 𝑇be the index of the last iteration of the algorithm before termination. Then 1 𝑛 𝑤(𝑇) 1 = Õ

1 𝑛 𝑤(𝑇) 𝑖 + Õ

1 𝑛 𝑤(𝑇) 𝑖 < 2 Õ

1 𝑛 𝑤(𝑇) 𝑖 2𝜀.

D Sum-of-Squares Certificates

We use the standard sum-of-squares machinery, used in numerous prior works, e.g. [KS17a, KS17b, HL18, HL19, d KNS20, DKK+22].

Let 𝑓1, 𝑓2, . . . , 𝑓𝑟and 𝑔be multivariate polynomials in 𝑥. A sum-of-squares proof that the constraints { 𝑓1 0, . . . , 𝑓𝑚 0} imply the constraint {𝑔 0} consists of sum-of-squares polynomials (𝑝𝑆)𝑆 [𝑚] such that

𝑆 [𝑚] 𝑝𝑆 Π𝑖 𝑆𝑓𝑖.

We say that this proof has degree ℓif for every set 𝑆 [𝑚], the polynomial 𝑝𝑆Π𝑖 𝑆𝑓𝑖has degree at most ℓ. If there is a degree ℓSo S proof that { 𝑓𝑖 0 | 𝑖 𝑟} implies {𝑔 0}, we write:

{ 𝑓𝑖 0 | 𝑖 𝑟} ℓ{𝑔 0} .

We provide degree 2ℓsum-of-squares proofs from the system 𝒜𝐾(see below) of (𝑛+ 𝑑)𝑂(1) constraints. The sum-of-squares algorithm (that appeared it [Sho87, Par00, Nes00, Las01]. See, e.g., Theorem 2.6. [DKK+22] for the precise formulation) returns a linear functional E : R[𝑥] 2ℓ R, that is called a degree 2ℓpseudo-expectation, that satisfies the constraints of 𝒜𝐾in time (𝑛+ 𝑑)𝑂(ℓ). In particular, it means that once we prove in sum-of-squares of degree 2ℓthat constraints 𝒜𝐾imply that some polynomial 𝑔(𝑢) is non-negative, the value of the E returned by the algorithm on 𝑔(𝑢) is also non-negative.

Recall the system 𝒜𝐾of elastic constraints in Equation (C.1) as follows:

𝑖 [𝑑] 𝑠2 𝑖= 1

𝑖 [𝑑] 𝑠𝑖𝑣𝑖 𝑣𝑖 𝑖 [𝑑] 𝑠𝑖𝑣𝑖 𝑣𝑖

Also recall that the vectors from the elastic ball n 𝑣 R𝑑 𝑣 1 , 𝑣 1

𝐾 o satisfy these

constraints with 𝑠𝑖= sign(𝑣𝑖).

The following lemma is similar to Lemma 3.4 from [DKK+22], but we prove it in using the elastic constraints. The derivation from the elastic constraints requires a bit more work. Lemma D.1. For arbitrary polynomial 𝑝(𝑣) = Í 1 𝑖1,...,𝑖𝑡 𝑑𝑝𝑖1...𝑖𝑡 𝑣𝑖1 𝑣𝑖𝑡of degree at most 𝑡we have 𝒜𝐾 4𝑡 𝑠,𝑣n (𝑝(𝑣))2 𝑝 2 𝐾𝑡o .

Proof. Observe that 𝒜𝐾 2 𝑠,𝑣𝑠𝑖𝑣𝑖 0, hence 𝒜𝐾 4𝑡 𝑠,𝑣 Í𝑑 𝑖=1 𝑠𝑖𝑣𝑖 2𝑡 𝐾𝑡. In addition, by note that, 𝑣𝑖1 𝑣𝑖𝑡 𝑠𝑖1𝑣𝑖1 𝑠𝑖𝑡𝑣𝑖𝑡. It follows that

1 𝑖1,...,𝑖𝑡 𝑑 𝑝𝑖1...𝑖𝑡 𝑣𝑖1 𝑣𝑖𝑡 Õ

1 𝑖1,...,𝑖𝑡 𝑑 |𝑝𝑖1,...𝑖𝑡|𝑠𝑖1𝑣𝑖1 𝑠𝑖𝑡𝑣𝑖𝑡

1 𝑖1,...,𝑖𝑡 𝑑 𝑝𝑖1...𝑖𝑡 𝑣𝑖1 𝑣𝑖𝑡 Õ

1 𝑖1,...,𝑖𝑡 𝑑 |𝑝𝑖1,...𝑖𝑡|𝑠𝑖1𝑣𝑖1 𝑠𝑖𝑡𝑣𝑖𝑡

1 𝑖1,...,𝑖𝑡 𝑑 𝑝𝑖1...𝑖𝑡 𝑣𝑖1 𝑣𝑖𝑡

1 𝑖1,...,𝑖𝑡 𝑑 |𝑝𝑖1,...𝑖𝑡|𝑠𝑖1𝑣𝑖1 𝑠𝑖𝑡𝑣𝑖𝑡

1 𝑖1,...,𝑖𝑡 𝑑 𝑝𝑖1...𝑖𝑡 𝑣𝑖1 𝑣𝑖𝑡

1 𝑖1,...,𝑖𝑡 𝑑 𝑝𝑖1...𝑖𝑡 𝑣𝑖1 𝑣𝑖𝑡

The following lemma shows that we can certify an upper bound on the value of the empirical moments (as polylinear functions) of truncated distribution 𝑍𝑖(𝜏) on the vectors from the elastic ball. Lemma D.2 (Certifiable bound on empirical moments). Suppose that for some 𝑡, ℓ N and 𝑀2𝑡 1,

𝒜𝐾 ℓ 𝑠,𝑣 E 𝑋 1, 𝑣 2𝑡 𝑀2𝑡 2𝑡 Σ 𝑡 ,

and for some 𝜈 1 max 𝑗 [𝑑] E|𝑋 1𝑗|4𝑡 𝜈4𝑡 Σ 2𝑡.

𝐾𝑡 𝑡log(𝑑/𝛿) ,

then with probability at least 1 𝛿, for each degree 2ℓpseudo-expectation E that satisfies 𝒜𝐾,

𝑖=1 𝑋 𝑖(𝜏), 𝑣 2𝑡 #

(2𝑀2𝑡)2𝑡 Σ 𝑡.

Proof. Consider the polynomial

𝑖=1 𝑋 𝑖(𝜏), 𝑣 2𝑡 E 𝑋 1, 𝑣 2𝑡, .

By Lemma E.6 and the assumptions on 𝑛and 𝜏, its coefficients are bounded by

𝜈4𝑡 Σ 2𝑡 𝑡log(𝑑/𝛿)

𝑛 + 20𝜏2𝑡 𝑡log(𝑑/𝛿)

𝑛 + 2𝑡𝜈4𝑡 Σ 2𝑡

𝜏2𝑡 2𝑡𝑀2𝑡 2𝑡 Σ 𝑡

It follows that

𝑖=1 𝑋 𝑖(𝜏), 𝑣 2𝑡 !2

𝑖=1 𝑋 𝑖(𝜏), 𝑣 2𝑡 E 𝑋 1, 𝑣 2𝑡+ E 𝑋 1, 𝑣 2𝑡 !2

𝑖=1 𝑋 𝑖(𝜏), 𝑣 2𝑡 !2 2

𝑖=1 𝑋 𝑖(𝜏), 𝑣 2𝑡 E 𝑋 1, 𝑣 2𝑡 !2

+ 2 E 𝑋 1, 𝑣 2𝑡 2

𝑖=1 𝑋 𝑖(𝜏), 𝑣 2𝑡 !2 2Δ2𝐾2𝑡+ 2𝑀4𝑡 2𝑡 Σ 2𝑡

𝑖=1 𝑋𝑖, 𝑣 2𝑡 !2 (2𝑀2𝑡)4𝑡 Σ 2𝑡.

By Cauchy-Schwarz inequality for pseudo-expectations (see, for example, Fact A.2. from [DKK+22]) we get the desired bound.

E Properties of the truncation

As before, let 𝑋 1, . . . , 𝑋 𝑛be iid samples from 𝒟.

For 𝜏> 0, let 𝑋 𝑖𝑗(𝜏) = 𝑋 𝑖𝑗1h |𝑋 𝑖𝑗| 𝜏 i. In this section we prove some properties of 𝑋 𝑖𝑗(𝜏) that we use

in the paper. We start with the following lemma.

Lemma E.1. Suppose that for some 𝜈 1,

max 𝑗 [𝑑] E|𝑋 𝑖𝑗|4 𝜈4 Σ 2 .

Then E 𝑋 𝑖(𝜏) 𝑋 𝑖 𝑋 𝑖(𝜏) 𝑋 𝑖 𝜈4 Σ 2

E 𝑋 𝑖𝑗(𝜏) 𝑋 𝑖𝑗 𝑋 𝑖𝑗 (𝜏) 𝑋 𝑖𝑗 = E 𝑋 𝑖𝑗1h |𝑋 𝑖𝑗|>𝜏 i𝑋 𝑖𝑗 1h |𝑋 𝑖𝑗 |>𝜏 i

E 1h |𝑋 𝑖𝑗|>𝜏 i 𝑋 𝑖𝑗 2

E 1h |𝑋 𝑖𝑗 |>𝜏 i 𝑋 𝑖𝑗 2

E 1h |𝑋 𝑖𝑗|>𝜏 i E 𝑋 𝑖𝑗 4 1/4 E 1h |𝑋 𝑖𝑗 |>𝜏 i E 𝑋 𝑖𝑗 4 1/4

P h |𝑋 𝑖𝑗|4 > 𝜏4i P h |𝑋 𝑖𝑗 |4 > 𝜏4i 1/4 𝜈2 Σ

𝜈4 Σ 2/𝜏2 .

The following lemma shows that the moments of the truncated distribution are close to the moments of 𝑋 𝑖in ℓ -norm.

Lemma E.2. Let 𝑡 N and suppose that for some 𝐵> 0 and 𝑞> 0,

max 𝑗 [𝑑] E|𝑋 𝑖𝑗|𝑡+𝑞 𝐵𝑡+𝑞.

Then E 𝑋 𝑖(𝜏) 𝑡 E 𝑋 𝑖 𝑡 𝑡 𝐵𝑡+𝑞

Proof. Denote 𝑎= 𝑋 𝑖(𝜏), 𝑏= 𝑋 𝑖. Note that by Hölder s inequality, for all 𝑠 [𝑡],

E|𝑏𝑗1 𝑏𝑗𝑠 1| |𝑎𝑗𝑠 𝑏𝑗𝑠| |𝑎𝑗𝑠+1 𝑎𝑗𝑡| = E|𝑏𝑗1 𝑏𝑗𝑠 1𝑏𝑗𝑠𝑎𝑗𝑠+1 𝑎𝑗𝑡| 1[𝑎𝑗𝑠=0]

P 𝑎𝑗𝑠= 0 𝑞 𝑡+𝑞 E|𝑏𝑗1 𝑏𝑗𝑠 1𝑏𝑗𝑠𝑎𝑗𝑠+1 𝑎𝑗𝑡|1+𝑞/𝑡 𝑡 𝑡+𝑞

P h (𝑋 𝑖𝑗𝑠)𝑡+𝑞> 𝜏𝑡+𝑞i 𝑞 𝑡+𝑞 E|𝑏𝑗1 𝑏𝑗𝑡|1+𝑞/𝑡 𝑡 𝑡+𝑞

𝜏𝑞 max 𝑗 [𝑑] E|𝑏𝑗|𝑡+𝑞 𝑡 𝑡+𝑞

It follows that E 𝑎𝑗1𝑎𝑗2 𝑎𝑗𝑡 E 𝑏𝑗1𝑏𝑗2 𝑏𝑗𝑡 E 𝑎𝑗1𝑎𝑗2 𝑎𝑗𝑡 𝑏𝑗1𝑏𝑗2 𝑏𝑗𝑡

E 𝑎𝑗1𝑎𝑗2 𝑎𝑗𝑡 𝑏𝑗1𝑎𝑗2 𝑎𝑗𝑡+ 𝑏𝑗1𝑎𝑗2 𝑎𝑗𝑡 𝑏𝑗1𝑏𝑗2 𝑏𝑗𝑡

E|𝑎𝑗1 𝑏𝑗1| |𝑎𝑗2 𝑎𝑗𝑡| + E|𝑏𝑗1| |𝑎𝑗2 𝑎𝑗𝑡 𝑏𝑗2 𝑏𝑗𝑡|

The following statement is a straightforward corollary of Lemma E.2 with 𝑞= 𝑡: Corollary E.3. Let 𝑡 N and suppose that for some 𝐵> 0,

max 𝑗 [𝑑] E|𝑋 𝑖𝑗|2𝑡 𝐵2𝑡.

Then E 𝑋 𝑖(𝜏) 𝑡 E 𝑋 𝑖 𝑡 𝑡 𝐵2𝑡

The following two statements are special cases of Corollary E.3 for 𝑡= 1 and 𝑡= 2. Corollary E.4.

Corollary E.5. Suppose that for some 𝜈 1,

max 𝑗 [𝑑] E|𝑋 𝑖𝑗|4 𝜈4 Σ 2 .

Then E 𝑋 𝑖(𝜏) 𝑋 𝑖(𝜏) E 𝑋 𝑖 𝑋 𝑖 2 𝜈4 Σ 2

The following lemma shows that the empirical mean of 𝑋 𝑖(𝜏) 𝑡is close to E 𝑋 1 𝑡for an appropriate choice of 𝜏and large enough 𝑛. Lemma E.6. Let 𝑡 N be and suppose that for some 𝜈 1

max 𝑗 [𝑑] E|𝑋 𝑖𝑗|2𝑡 𝜈2𝑡 Σ 𝑡.

Then with probability 1 𝛿, 1 𝑛

𝑋 𝑖(𝜏) 𝑡 E 𝑋 1 𝑡 10

𝜈2𝑡 Σ 𝑡 𝑡log(𝑑/𝛿)

𝑛 + 10𝜏𝑡 𝑡log(𝑑/𝛿)

𝑛 + 𝑡 𝜈2𝑡 Σ 𝑡

Proof. It follows from Corollary E.3, Bernstein inequality Fact I.1, and a union bound over all 𝑑𝑡

entries of E 𝑋 1 𝑡.

F Properties of sparse vectors

Lemma F.1. Let Σ R𝑑 𝑑be a positive definite matrix, 𝑘 , 𝑘 N, 𝑟, 𝛿 0, and

ℰ𝑘 (𝑟) = n 𝑢 R𝑑 Σ1/2𝑢 𝑟, 𝑢 1

𝒮𝑘 (𝑟) = n 𝑢 R𝑑 Σ1/2𝑢 = (1 + 𝛿) 𝑟, 𝑢is 𝑘 -sparse o .

If 𝑘 4𝑘 Σ /𝛿2, then ℰ𝑘 (𝑟) conv(𝒮𝑘 (𝑟)) .

Proof. Let us take some 𝑢 ℰ𝑘 (𝑟). Without loss of generality assume that 𝑢1 𝑢2 . . . 𝑢𝑑. Let s split indices {1, 2, . . . , 𝑑} into blocks 𝐵1 . . . , 𝐵 𝑑/𝑘 of size 𝑘 (the last block might be of smaller size). Let for each block 𝐵𝑖, let

𝑝𝑖= Σ1/2𝑢𝐵𝑖 Í 𝑑/𝑘 𝑗=1 Σ1/2𝑢𝐵𝑗

Since Í 𝑑/𝑘 𝑖 𝑝𝑖= 1 and 𝑢= Í 𝑑/𝑘 𝑖 𝑝𝑖𝑢𝐵𝑖/𝑝𝑖, it is sufficient to show that for all 𝑖, Σ1/2𝑢𝐵𝑖 /𝑝𝑖 (1 + 𝛿)𝑟.

Note that for all 𝑗 2, since 𝑢𝐵𝑗

𝑘 𝑢𝐵𝑗 and 𝑢𝐵𝑗 1 𝑘 𝑢𝐵𝑗 1 1,

𝑘 𝑢𝐵𝑗 1 1 .

By the triangle inequality,

Σ1/2𝑢𝐵1 Σ1/2𝑢 +

𝑗=2 Σ1/2𝑢𝐵𝑗 .

𝑗=1 Σ1/2𝑢𝐵𝑗

𝑗=2 Σ1/2𝑢𝐵𝑗

𝑗=2 𝑢𝐵𝑗 1 1

Lemma F.2. Let Σ R𝑑 𝑑be a positive definite matrix, and let 𝑋 R𝑚 𝑑be a matrix such that for some 𝑟> 0 and 𝛿 (0, 1), for all 𝑘 -sparse vectors 𝑢 such that 𝑟 Σ1/2𝑢 2𝑟,

(1 𝛿) Σ1/2𝑢 1 𝑚 𝑋𝑢 (1 + 𝛿) Σ1/2𝑢

If 𝑘 4𝑘 Σ /𝛿2, then for all 𝑢such that Σ1/2𝑢 = 𝑟and 𝑢 1 𝑟

(1 4𝛿) 𝑟 1 𝑚 𝑋𝑢 (1 + 4𝛿) 𝑟.

Proof. The inequlity 1 𝑚 𝑋𝑢 (1 + 𝛿)2 𝑟 (1 + 4𝛿) 𝑟follows from Jensen s inequality and Lemma F.1.

Let us show that (1 4𝛿) 𝑟 1 𝑚 𝑋𝑢 . Let 𝐵1, . . . , 𝐵 𝑑/𝑘 be blocks of indices as in the proof of Lemma F.1. It follows that

1 𝑚 𝑋𝑢 1 𝑚 𝑋𝑢𝐵1

(1 𝛿) Σ1/2𝑢𝐵1 (1 + 𝛿)

𝑗=2 Σ1/2𝑢𝐵𝑗

(1 𝛿) Σ1/2𝑢 2 𝑟

(1 𝛿)2 𝑟 𝛿𝑟 (1 4𝛿) 𝑟.

G Lower bounds

In this section we prove Statistical Query lower bounds. SQ lower bounds is a standard tool of showing computational lower bounds for statistical estimation and decision problems. SQ algorithms do not use samples, but have access to an oracle that can return the expectation of any bounded function (up to a desired additive error, called tolerance). The SQ lower bounds formally show the trateoff between the number of queries to the oracle and the tolerance. The standard interpretation of SQ lower bounds relies on the fact that simulating a query with small tolerance using iid samples requires large number of samples. Hence these lower bounds are interpreted as a tradeoff between the time complexity (number of queries) and sample complexity (tolerance) of estimators. See [DKS17] for more details.

First we give necessary definitions. These definitions are standard and can be found in [DKS17].

Definition G.1 (STAT Oracle). Let 𝒟be a distribution over R𝑑. A statistical query is a function 𝑓: R𝑑 [ 1, 1]. For 𝜏> 0 the STAT(𝜏) oracle responds to the query 𝑓with a value 𝑣such that |𝑣 E𝑋 𝒟𝑓(𝑋)| 𝜏. Parameter 𝜏is called the tolerance of the statistical query.

Simulating a query STAT(𝜏) normally requires Ω(1/𝜏2) iid samples from 𝒟, hence SQ lower bounds provide a trade-off between the running time (number of queries) and the sample complexity (Ω(1/𝜏2)). Definition G.2 (Pairwise Correlation). Let 𝒟1, 𝒟2, 𝒟be absolutely continuous distributions over R𝑑, and suppose that supp(𝒟) = R𝑑. The pairwise correlation of 𝒟1 and 𝒟2 with respect to 𝒟is defined as

𝜒𝐷(𝒟1, 𝒟2) =

R𝑑 𝑝𝒟1(𝑥)𝑝𝒟2(𝑥)

𝑝𝒟(𝑥) 𝑑𝑥 1 ,

where 𝑝𝒟1(𝑥), 𝑝𝒟2(𝑥), 𝑝𝒟(𝑥) are densities of 𝒟1, 𝒟2, 𝒟respectively. Definition G.3 (Chi-Squared Divergence). Let 𝒟 , 𝒟be absolutely continuous distributions over R𝑑, and suppose that supp(𝒟) = R𝑑. The chi-squared divergence from 𝒟 to 𝒟is

𝜒2(𝒟 , 𝒟) = 𝜒𝐷(𝒟 , 𝒟 ) .

Definition G.4 ((𝛾, 𝜌)-correlation). Let 𝜌, 𝛾> 0, and let 𝒟be a distribution over R𝑑. We say that a family of distributions ℱover R𝑑is (𝛾, 𝜌)-correlated relative to 𝒟, if for all distinct 𝒟 , 𝒟 ℱ, |𝜒𝐷(𝒟 , 𝒟 )| 𝛾and |𝜒𝐷(𝒟 , 𝒟 )| 𝜌.

Fact G.5. Let 𝒟be a distribution over R𝑑and ℱbe a family of distributions over R𝑑that does not contain 𝒟, and consider a hypothesis testing problem of determining whether a given distribution 𝒟 = 𝒟or 𝒟 ℱ.

Let 𝛾, 𝜌> 0, 𝑠 N, and suppose that there exists a subfamily of ℱof size 𝑠that is (𝛾, 𝜌)-correlated relative to 𝒟. Then for all 𝛾 > 0, every SQ algorithm for the hypothesis testing problem requires queries of tolerance 𝛾+ 𝛾 or makes at least 𝑠𝛾 /(𝜌 𝛾) queries.

We will also need the following facts:

Fact G.6 ([DKS17, Lemma 6.7]). Let 𝑐 (0, 1) and 𝑘, 𝑑 N be such that 𝑘

𝑑. There exists a set 𝒱 R𝑑of 𝑘-sparse unit vectors of size 𝑑𝑐𝑘𝑐/8 such that for all distinct 𝑢, 𝑣 𝒱, 𝑣, 𝑢 2𝑘𝑐 1. Fact G.7 ([DKS17, Lemma 3.4]). Let 𝑚 N, and suppose that a distribution ℳover R matches first 𝑚moments of 𝑁(0, 1). For a unit vector 𝑣 R𝑑let 𝒫𝑣be a distribution such that its projections onto the direction of 𝑣has distribution ℳ, the projection onto the orthogonal complement is 𝑁(0, Id𝑑 1), and these projections are independent. Then for all 𝑢, 𝑣 R𝑑, 𝜒𝑁(0,Id𝑑)(𝒫𝑣, 𝒫𝑢) | 𝑢, 𝑣 |𝑚+1𝜒2(ℳ, 𝑁(0, 1)) .

The following fact is a slight reformulation of Lemma E.4 from [DKS19] Fact G.8 ([DKS19, Lemma E.4]). Let 𝑦 𝑁(0, 1), 𝜇0 > 0, 𝑚 N, and 𝑔: R R. Let ℳ𝜇be a family of distributions over R satisfies the following properties:

1. ℳ= (1 𝜀𝜇)𝑁(𝜇, Θ(1)) + 𝜀𝜇ℬ𝜇for some 𝜀𝜇and ℬ𝜇such that ℳ𝜇has the same first 𝑚 moments as 𝑁(0, 1).

2. If |𝜇| 10𝜇0, then 𝜀𝜇/(1 𝜀𝜇) 𝑂 𝜇2 and 𝜒2(ℳ, 𝑁(0, 1)) 𝑒𝑂(max{1/𝜇2,𝜇2}).

3. If |𝜇| 10𝜇0, then 𝜀𝜇= 𝜀and 𝜒2(ℳ, 𝑁(0, 1)) 𝑔(𝜀).

For unit 𝑣 R𝑑let 𝒫𝑣,𝜇be the same as 𝒫𝑣in Fact G.7 whose projection onto 𝑣is ℳ𝜇. Let 𝒬 𝑣be a distribution over R𝑑+1 such that (𝑋, 𝑦) 𝒬 𝑣satisfy the following properties: 𝑦 𝑁(0, 1), and 𝑋|𝑦 𝒫𝑣,𝜇0 𝑦. Then for all unit 𝑢, 𝑣 R𝑑,

𝜒𝒟(𝒬 𝑣, 𝒬 𝑢) (𝑔(𝜀) + 𝑂(1)) | 𝑣, 𝑢 |𝑚+1 ,

where 𝒟= 𝑁(0, Id𝑑+1).

Proposition G.9 (Formal version of Proposition 1.10). Let 𝑘, 𝑑 N, 𝑘

𝑑, 𝜀 (0, 1/2), 𝑐 (0, 1). For a vector 𝛽 R𝑑, and a number 𝜎> 0, consider the distribution 𝒢(𝛽 , 𝜎) over R𝑑+1 such that (𝑋, 𝑦) 𝒢(𝛽 , 𝜎) satisfy 𝑋 𝑁(0, Id) and 𝑦= 𝑋, 𝛽 + 𝜂, where 𝜂 𝑁(0, 𝜎2) is independent of 𝑋.

There exist a set ℬ R𝑑of 𝑘-sparse vectors, 0.99 𝜎 1 and a distribution 𝒬over R𝑑+1, such that if an SQ algorithm 𝒜given access to a mixture (1 𝜀)𝒢(𝛽 , Σ, 𝜎) + 𝜀𝒬for 𝛽 ℬ, outputs ˆ𝛽

such that 𝛽 ˆ𝛽 10 5, then 𝒜either

makes 𝑑𝑐𝑘𝑐/8 𝑘 2+2𝑐queries,

or makes at least one query with tolerance smaller than 𝑘 1+𝑐𝑒𝑂(1/𝜀2).

Proof. Note that (𝑋, 𝑦) 𝒢(𝛽 , 𝜎) satisfy 𝑦 𝑁(0, 𝜎2 𝑦), where 𝜎2 𝑦= 𝛽 2 + 𝜎2 and 𝑋|𝑦

𝑁 𝑦 𝜎2𝑦𝛽 , Id 1

We will use vectors 𝛽 of norm 10 5. Denote 𝑣= 𝛽 / 𝛽 and let 𝜎2 = 1 𝛽 2. Consider a distribution ℳ𝜇= (1 𝜀)𝑁 𝜇, 1 𝛽 2 + 𝜀𝑁 1 𝜀

𝜀𝜇, 1 . Note that 𝜒2 ℳ𝜇, 𝑁(0, 1)

𝑒𝑂(max{1/𝜇2,𝜇2}), and 𝜀/(1 𝜀) 𝑂 𝜇2 for 𝜇 10 4. Hence by Fact G.8,

𝜒𝒟(𝒬 𝑣, 𝒬 𝑢) 𝑒𝑂(1/𝜀2) 𝑣, 𝑢 2

for all unit 𝑣, 𝑢 R𝑑.

Using Fact G.6, we can apply Fact G.5 with 𝛾= 𝑘2𝑐 2𝑒𝑂(1/𝜀2), 𝜌= 𝑒𝑂(1/𝜀2), 𝛾 = (𝜌 𝛾) 𝑘 2+2𝑐, we get that 𝒜requires at least 𝑑𝑐𝑘𝑐/8𝑘 2+2𝑐queries with tolerance greater than 𝑘 1+𝑐𝑒𝑂(1/𝜀2).

Proposition G.10 (Formal version of Proposition 1.11). Let 𝑘, 𝑑 N, 𝑘

𝑑, 𝜀 (0, 1/2), 𝑐 (0, 1). For a vector 𝛽 R𝑑, a positive definite matrix Σ and a number 𝜎> 0, consider the distribution 𝒢(𝛽 , Σ, 𝜎) over R𝑑+1 such that (𝑋, 𝑦) 𝒢(𝛽 , Σ, 𝜎) satisfy 𝑋 𝑁(0, Σ) and 𝑦= 𝑋, 𝛽 + 𝜂, where 𝜂 𝑁(0, 𝜎2) is independent of 𝑋.

There exist a set ℬ R𝑑of 𝑘-sparse vectors, 1

2Id Σ Id, 0.99 𝜎 1 and a distribution 𝒬over R𝑑+1, such that if an SQ algorithm 𝒜given access to a mixture (1 𝜀)𝒢(𝛽 , Σ, 𝜎) + 𝜀𝒬for 𝛽 ℬ, outputs ˆ𝛽such that 𝛽 ˆ𝛽 10 5 𝜀, then 𝒜either

makes 𝑑𝑐𝑘𝑐/8 𝑘 4+4𝑐queries,

or makes at least one query with tolerance at least 𝑘 2+2𝑐𝑒𝑂(1/𝜀).

Proof. Note that (𝑋, 𝑦) 𝒢(𝛽 , Σ, 𝜎) satisfy 𝑦 𝑁(0, 𝜎2 𝑦), where 𝜎2 𝑦= 𝛽 Σ𝛽 + 𝜎2 and 𝑋|𝑦

𝑁 𝑦 𝜎2𝑦Σ𝛽 , Σ 1

𝜎2𝑦(Σ𝛽 )(Σ𝛽 ) .

We will use vectors 𝛽 of norm 10 5 𝜀. Denote 𝑣= 𝛽 / 𝛽 and let 𝜎2 = 1 𝛽 Σ𝛽 , Σ = Id 𝑐 𝑣𝑣 , where 𝑐 is a constant such that

𝜎2𝑦 (Σ𝛽 )(Σ𝛽 ) = Id 𝑐 𝑣𝑣 10 5(1 𝑐 )2𝜀 𝑣𝑣 = Id 𝑣𝑣 /3 .

By [DKS19, Lemmas E.2], there exists a distribution ℳthat satisfies the assumption of Fact G.8 with 𝑚= 3 and 𝑔(𝜀) = 𝑒𝑂(1/𝜀). Hence by Fact G.8,

𝜒𝒟(𝒬 𝑣, 𝒬 𝑢) 𝑒𝑂(1/𝜀) 𝑣, 𝑢 4

for all unit 𝑣, 𝑢 R𝑑.

Using Fact G.6, we can apply Fact G.5 with 𝛾= 𝑘4𝑐 4𝑒𝑂(1/𝜀), 𝜌= 𝑒𝑂(1/𝜀), 𝛾 = (𝜌 𝛾) 𝑘 4+4𝑐, we get that 𝒜requires at least 𝑑𝑐𝑘𝑐/8𝑘 4+4𝑐queries with tolerance smaller than 𝑘 2+2𝑐𝑒𝑂(1/𝜀).

H Sub-exponential designs

Recall that a distribution 𝒟in R𝑑is called 𝐿-sub-exponential, if it has (𝐿𝑡)-bounded 𝑡-th moment for each 𝑡 N. In particular, all log-concave distributions are 𝐿-sub-exponential for some 𝐿 𝑂(1).

In this section we discuss how we can improve the dependence of the sample complexity on 𝜀if (in addition to the assumptions of Theorem B.3) we assume that 𝒟is 𝐿-sub-exponential. For these designs we do not need a truncation.

First, let us show how the gradient bound Lemma B.5 modifies in this case. It can be obtained directly from Bernstein s inequality for sub-exponential distributions ([RH23, Theorem 1.13])

Lemma H.1. With probability at least 1 𝛿/10,

𝑖 [𝑛] 𝜙(𝜂𝑖)𝑋 𝑖

Σ 𝐿 log(𝑑/𝛿)

The proof of strong convexity bound (Lemma B.6) is exactly the same, with 𝑋 (𝜏) = 𝑋 .

Finally, we need to bound 1

𝑛 Í𝑛 𝑖=1 𝑋 𝑖 2𝑡 E 𝑋 1 2𝑡 , since we need to prove Lemma D.2 for

sub-exponential distributions. By Lemma C.1. from [DKK+22], for all 𝐿-sub-exponential distributions, with probability 1 𝛿, 1 𝑛

𝑋 𝑖 2𝑡 E 𝑋 1 2𝑡 𝑂

Σ 𝑡2 log(𝑑/𝛿) 2𝑡 !

Hence with 𝑛 𝐾2𝑡 10𝑡2 log(𝑑/𝛿) 4𝑡+1, we get 1

𝑛 Í𝑛 𝑖=1 𝑋 𝑖 2𝑡 E 𝑋 1 2𝑡 𝐿2𝑡 Σ 𝑡/𝐾𝑡,

so we get the conclusion of Lemma D.2.

Putting everything together, the sample complexity is

𝑛 𝑘log(𝑑/𝛿)

𝑘2 log(𝑑/𝛿) 105𝑠/(𝑠 2)𝑀𝑠/(𝑠 2) 𝑠 𝜅(Σ)4+𝑠/(𝑠 2)

𝛼 +𝑘2𝑡 𝜅(Σ)2𝑡 106 𝑡2 log(𝑑/𝛿) 4𝑡+1 .

Note that we can use 𝑠= 4 and 𝑀𝑠= 4𝐿.

Consider the case when 𝒟is log-concave, so 𝐿 𝑂(1). For 𝜅(Σ) 𝑂(1), 𝛼 Ω(1), 𝑡= 1, with high probability we get error 𝑂(𝜎 𝜀) as long as

𝜀 + 𝑘2 (log 𝑑)5 .

Similarly, for 𝜅(Σ) 𝑂(1), 𝛼 Ω(1), 𝑡= 2, with high probability we get error 𝑂(𝑀𝜎𝜀3/4) (where 𝑀 𝑂( p

log 𝑑) is the same as in Theorem 1.7) as long as

𝜀3/2 + 𝑘4 (log 𝑑)9 .

Note that for sub-Gaussian distributions one can use better tail bounds and the polylog(𝑑) factors should be better in this case.

I Concentration Bounds

Throughout the paper we use the following versions of versions of Bernstein s inequality. The proofs can be found in [Tro15]. Fact I.1 (Bernstein inequality). Let 𝐿> 0 and let 𝑥 R𝑑be a zero-mean random variable. Let 𝑥1, . . . , 𝑥𝑛be i.i.d. copies of 𝑥. Suppose that |𝑥| 𝐿. Then the estimator 𝑥= 1

𝑛 Í𝑛 𝑖=1 𝑥𝑖satisfies for all 𝑡> 0

P(| 𝑥| 𝑡) 2 exp 𝑡2𝑛 2 E 𝑥2 + 𝐿𝑡

Fact I.2 (Bernstein inequality for covariance). Let 𝐿> 0 and let 𝑥 R𝑑be a 𝑑-dimensional random vector. Let 𝑥1, . . . , 𝑥𝑛be i.i.d. copies of 𝑥. Suppose that 𝑥 2 𝐿. Then the estimator Σ = 1

𝑛 Í𝑛 𝑖=1 𝑥𝑖𝑥 𝑖satisfies for all 𝑡> 0

P Σ E 𝑥𝑥 𝑡 2𝑑 exp 𝑡2𝑛 2𝐿 Σ + 𝐿𝑡

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: Abstract, Section 1, Section 1.1 and Section 2.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: Section 1.1 and Section 2.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: Sections 1.1 and 2 and Appendices A to I Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] Justification: This is a theory paper, and does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [NA] Justification: This is a theory paper, and does not include experiments. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justification: This is a theory paper, and does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: This is a theory paper. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean.

It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] Justification: This is a theory paper, and does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: The research conforms, in every respect, with the Neur IPS Code of Ethics. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: There is no societal impact of the work performed. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to

generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The paper poses no such risks. Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: The paper does not use existing assets. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA] Justification: The paper does not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.