# listdecodable_linear_regression__22e25c7a.pdf

List-decodeable Linear Regression

Sushrut Karmalkar University of Texas at Austin sushrutk@cs.utexas.edu

Adam R. Klivans University of Texas at Austin klivans@cs.utexas.edu

Pravesh K. Kothari Princeton University and Institute for Advanced Study kothari@cs.princeton.edu

We give the ﬁrst polynomial-time algorithm for robust regression in the listdecodable setting where an adversary can corrupt a greater than 1/2 fraction of examples. For any α < 1, our algorithm takes as input a sample {(xi, yi)}i n of n linear equations where αn of the equations satisfy yi = xi, ℓ + ζ for some small noise ζ and (1 α)n of the equations are arbitrarily chosen. It outputs a list L of size O(1/α) - a ﬁxed constant - that contains an ℓthat is close to ℓ . Our algorithm succeeds whenever the inliers are chosen from a certiﬁably anticoncentrated distribution D. As a corollary of our algorithmic result, we obtain a (d/α)O(1/α8) time algorithm to ﬁnd a O(1/α) size list when the inlier distribution is standard Gaussian. For discrete product distributions that are anti-concentrated only in regular directions, we give an algorithm that achieves similar guarantee under the promise that ℓ has all coordinates of the same magnitude. To complement our result, we prove that the anti-concentration assumption on the inliers is information-theoretically necessary. To solve the problem we introduce a new framework for list-decodable learning that strengthens the identiﬁability to algorithms paradigm based on the sum-ofsquares method.

1 Introduction

In this work, we design algorithms for the problem of linear regression that are robust to training sets with an overwhelming ( 1/2) fraction of adversarially chosen outliers.

Outlier-robust learning algorithms have been extensively studied (under the name robust statistics) in mathematical statistics [43, 37, 25, 23]. However, the algorithms resulting from this line of work usually run in time exponential in the dimension of the data [6]. An inﬂuential line of recent work [29, 1, 16, 33, 8, 30, 31, 24, 14, 17, 28] has focused on designing efﬁcient algorithms for outlier-robust learning.

Our work extends this line of research. Our algorithms work in the list-decodable learning framework. In this model, a majority of the training data (a 1 α fraction) can be adversarially corrupted leaving only an α 1/2 fraction of inliers . Since uniquely recovering the underlying parameters is information-theoretically impossible in such a setting, the goal is to output a list (with an absolute constant size) of parameters, one of which matches the ground truth. This model was introduced in [3] to give a discriminative framework for clustering. More recently, beginning with [8], various works [18, 30] have considered this as a model of untrusted data.

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

There has been phenomenal progress in developing techniques for outlier-robust learning with a small ( 1/2)-fraction of outliers (e.g. outlier ﬁlters [13, 14, 10, 15], separation oracles for inliers [13] or the sum-of-squares method [31, 24, 30, 28]). In contrast, progress on algorithms that tolerate the signiﬁcantly harsher conditions in the list-decodable setting has been slower. The only prior works [8, 18, 30] in this direction designed list-decodable algorithms for mean estimation via problem-speciﬁc methods. Recently [22] addressed the somewhat related problem of conditional linear regression where the goal is to ﬁnd a linear function with small square loss conditioned on a subset of training points whose indices satisfy some constant-width k-DNF formula.

In this paper, we develop a principled technique to give the ﬁrst efﬁcient list-decodable learning algorithm for the fundamental problem of linear regression. Our algorithm takes a corrupted set of linear equations with an α 1/2 fraction of inliers and outputs a O(1/α)-size list of linear functions, one of which is guaranteed to be close to the ground truth (i.e., the linear function that correctly labels the inliers). A key conceptual insight in this result is that list-decodable regression information-theoretically requires the inlier-distribution to be anti-concentrated . Our algorithm succeeds whenever the distribution satisﬁes a stronger certiﬁable anti-concentration condition that is algorithmically usable . This class includes the standard gaussian distribution and more generally, any spherically symmetric distribution with strictly sub-exponential tails.

Prior to our work1, the state-of-the-art outlier-robust algorithms for linear regression [28, 19, 12, 39] could handle only a small (< 0.1)-fraction of outliers even under strong assumptions on the underlying distributions.

List-decodable regression generalizes the well-studied [11, 26, 21, 44, 2, 9, 45, 41, 34] and easier problem of mixed linear regression: given k clusters of examples that are labeled by one out of k distinct unknown linear functions, ﬁnd the unknown set of linear functions. All known techniques for the problem rely on faithfully estimating certain moment tensors from samples and thus, cannot tolerate the overwhelming fraction of outliers in the list-decodable setting. On the other hand, since we can take any cluster as inliers and treat rest as outliers, our algorithm immediately yields new efﬁcient algorithms for mixed linear regression. Unlike all prior works, our algorithms work without any pairwise separation or bounded condition-number assumptions on the k linear functions.

List-Decodable Learning via the Sum-of-Squares Method Our algorithm relies on a strengthening of the robust-estimation framework based on the sum-of-squares (So S) method. This paradigm has been recently used for clustering mixture models [24, 30] and obtaining algorithms for moment estimation [31] and linear regression [28] that are resilient to a small ( 1/2) fraction of outliers under the mildest known assumptions on the underlying distributions. At the heart of this technique is a reduction of outlier-robust algorithm design to just ﬁnding simple proofs of unique identiﬁability of the unknown parameter of the original distribution from a corrupted sample. However, this principled method works only in the setting with a small ( 1/2) fraction of outliers. As a consequence, the work of [30] for mean estimation in the list-decodable setting relied on supplementing the So S method with a somewhat ad hoc, problem-dependent technique.

As an important conceptual contribution, our work yields a framework for list-decodable learning that recovers some of the simplicity of the general blueprint. Central to our framework is a general method of rounding by votes for pseudo-distributions in the setting with 1/2 fraction outliers. Our rounding builds on the work of [32] who developed such a method to give a simpler proof of the list-decodable mean estimation result of [30]. In Section 2, we explain our ideas in detail.

The results in all the works above hold for any underlying distribution that has upper-bounded lowdegree moments and such bounds are captured within the So S system. Such conditions are called as certiﬁed bounded moment inequalities. An important contribution of this work is to formalize anticoncentration inequalities within the So S system and prove certiﬁed anti-concentration for natural distribution families. Unlike bounded moment inequalities, there is no canonical encoding within So S for such statements. We choose an encoding that allow proving certiﬁed anti-concentration for a distribution by showing the existence of a certain approximating polynomial. This allows showing certiﬁed anti-concentration of natural distributions via a completely modular approach that relies on a beautiful line of works that construct weighted polynomial approximators [35].

1There s a long line of work on robust regression algorithms (see for e.g. [7, 27]) that can tolerate corruptions only in the labels. We are interested in algorithms robust against corruptions in both examples and labels.

We believe that our framework for list-decodable estimation and our formulation of certiﬁed anticoncentration condition will likely have further applications in outlier-robust learning.

1.1 Our Results

We ﬁrst deﬁne our model for generating samples for list-decodable regression. Model 1.1 (Robust Linear Regression). For 0 < α < 1 and ℓ Rd with ℓ 2 1, let Lin D(α, ℓ ) denote the following probabilistic process to generate n noisy linear equations S = { xi, a = yi | 1 i n} in variable a Rd with αn inliers I and (1 α)n outliers O:

1. Construct I by choosing αn i.i.d. samples xi D and set yi = xi, ℓ + ζ for additive noise ζ,

2. Construct O by choosing the remaining (1 α)n equations arbitrarily and potentially adversarially w.r.t the inliers I.

Note that α measures the signal (fraction of inliers) and can be 1/2. The bound on the norm of ℓ is without any loss of generality. For the sake of exposition, we will restrict to ζ = 0 for most of this paper and discuss (see Remarks 1.6 and 4.4) how our algorithms can tolerate additive noise.

An η-approximate algorithm for list-decodable regression takes input a sample from Lin D(α, ℓ ) and outputs a constant (depending only on α) size list L of linear functions such that there is some ℓ L that is η-close to ℓ .

One of our key conceptual contributions is to identify the strong relationship between anticoncentration inequalities and list-decodable regression. Anti-concentration inequalities are wellstudied [20, 42, 40] in probability theory and combinatorics. The simplest of these inequalities upper bound the probability that a high-dimensional random variable has zero projections in any direction. Deﬁnition 1.2 (Anti-Concentration). A Rd-valued zero-mean random variable Y has a δ-anticoncentrated distribution if Pr[ Y, v = 0] < δ.

In Proposition 2.4, we provide a simple but conceptually illuminating proof that anti-concentration is sufﬁcient for list-decodable regression. In Theorem 6.1, we prove a sharp converse and show that anti-concentration is information-theoretically necessary for even noiseless list-decodable regression. This lower bound surprisingly holds for a natural distribution: uniform distribution on {0, 1}d and more generally, uniform distribution on [q]d for q = {0, 1, 2 . . . , q}. And in fact, our lower bound shows the impossibility of even the easier problem of mixed linear regression on this distribution. Theorem 1.3 (See Proposition 2.4 and Theorem 6.1). There is a (inefﬁcient) list-decodable regression algorithm for Lin D(α, ℓ ) with list size O( 1

α) whenever D is α-anti-concentrated. Further, there exists a distribution D on Rd that is (α + ϵ)-anti-concentrated for every ϵ > 0 but there is no algorithm for α

2 -approximate list-decodable regression for Lin D(α, ℓ ) that returns a list of size < d.

To handle additive noise of variance ζ2, we need a control of Pr[| x, v | ζ]. For our efﬁcient algorithms, in addition, we need a certiﬁed version of the anti-concentration condition. Informally, this means that there is a low-degree sum-of-squares proof of anti-concentration of I. We give precise deﬁnition and background in Section 3. For this section, we will use this phrase informally and encourage the reader to think of it as a version of anti-concentration that the So S method can reason about. Deﬁnition 1.4 (Certiﬁable Anti-Concentration). A random variable Y has a k-certiﬁably (C, δ)-anticoncentrated distribution if there is a univariate polynomial p satisfying p(0) = 1 such that there is a degree k sum-of-squares proof of the following two inequalities:

1. v, Y, v 2 δ2E Y, v 2 implies (p( Y, v ) 1)2 δ2.

2. v, v 2 2 1 implies Ep2( Y, v ) Cδ.

Intuitively, certiﬁed anti-concentration asks for a certiﬁcate of the anti-concentration property of Y in the sum-of-squares proof system (see Section 3 for precise deﬁnitions). So S is a proof system that

Please note that sections 3-6 are in the supplementary material.

reasons about polynomial inequalities. Since the core indicator 1(| x, v | δ) is not a polynomial, we phrase the condition in terms of an approximating polynomial p. We are now ready to state our main result.

Theorem 1.5 (List-Decodable Regression). For every α, η > 0 and a k-certiﬁably (C, α2η2/10C)- anti-concentrated distribution D on Rd, there exists an algorithm that takes input a sample generated according to Lin D(α, ℓ ) and outputs a list L of size O(1/α) such that there is an ℓ L satisfying ℓ ℓ 2 < η with probability at least 0.99 over the draw of the sample. The algorithm needs a sample of size n = (kd)O(k) and runs in time n O(k) = (kd)O(k2).

Remark 1.6 (Tolerating Additive Noise). For additive noise (not necessarily independent across samples) of variance ζ2 in the inlier labels, our algorithm, in the same running time and sample complexity, outputs a list of size O(1/α) that contains an ℓsatisfying ℓ ℓ 2 ζ

α + η. Since we normalize ℓ to have unit norm, this guarantee is meaningful only when ζ α. Remark 1.7 (Exponential Dependence on 1/α). List-decodable regression algorithms immediately yield algorithms for mixed linear regression (MLR) without any assumptions on the components. The state-of-the-art algorithms for MLR with gaussian components [34, 41] has an exponential dependence on k = 1/α in the running time in the absence of strong pairwise separation or small condition number of the components. Liang and Liu [34] (see Page 10 of their paper) use the relationship to learning mixtures of k gaussians (with an exp(k) lower bound [38]) to note that there may not exist any algorithms with polynomial dependence on 1/α for MLR and thus, also for list-decodable regression.

Certiﬁably anti-concentrated distributions In Section 5, we show certiﬁable anti-concentration of some well-studied families of distributions. This includes the standard gaussian distribution and more generally any anti-concentrated spherically symmetric distribution with strictly sub-exponential tails. We also show that simple operations such as scaling, applying well-conditioned linear transformations and sampling preserve certiﬁable anti-concentration. This yields:

Corollary 1.8 (List-Decodable Regression for Gaussian Inliers). For every α, η > 0 there s an algorithm for list-decodable regression for the model Lin D(α, ℓ ) with D = N(0, Σ) with

λmax(Σ)/λmin(Σ) = O(1) that needs n = (d/αη) O 1 α4η4 samples and runs in time n O 1 α4η4 =

(d/αη) O 1 α8η8 .

We note that certiﬁably anti-concentrated distributions are more restrictive compared to the families of distributions for which the most general robust estimation algorithms work [31, 30, 28]. To a certain extent, this is inherent. The families of distributions considered in these prior works do not satisfy anti-concentration in general. And as we discuss in more detail in Section 2, anti-concentration is information-theoretically necessary (see Theorem 1.3) for list-decodable regression. This surprisingly rules out families of distributions that might appear natural and easy , for example, the uniform distribution on {0, 1}n.

We rescue this to an extent for the special case when ℓ in the model Lin(α, ℓ ) is a "Boolean vector", i.e., has all coordinates of equal magnitude. Intuitively, this helps because while the the uniform distribution on {0, 1}n (and more generally, any discrete product distribution) is badly anti-concentrated in sparse directions, they are well anti-concentrated [20] in the directions that are far from any sparse vectors.

As before, for obtaining efﬁcient algorithms, we need to work with a certiﬁed version (see Deﬁnition 4.5) of such a restricted anti-concentration condition. As a speciﬁc Corollary (see Theorem 4.6 for a more general statement), this allows us to show:

Theorem 1.9 (List-Decodable Regression for Hypercube Inliers). For every α, η > 0 there s an η-approximate algorithm for list-decodable regression for the model Lin D(α, ℓ ) with D is uniform on {0, 1}d that needs n = (d/αη)O( 1 α4η4 ) samples and runs in time n O( 1 α4η4 ) = (d/αη)O( 1 α8η8 ).

In Section 4.1, we obtain similar results for general product distributions. It is an important open problem to prove certiﬁed anti-concentration for a broader family of distributions.

Please note that sections 3-6 are in the supplementary material.

2 Overview of our Technique

In this section, we give a bird s eye view of our approach and illustrate the important ideas in our algorithm for list-decodable regression. Thus, given a sample S = {(xi, yi)}n i=1 from Lin D(α, ℓ ), we must construct a constant-size list L of linear functions containing an ℓclose to ℓ .

Our algorithm is based on the sum-of-squares method. We build on the identiﬁability to algorithms paradigm developed in several prior works [5, 4, 36, 31, 24, 30, 28] with some important conceptual differences.

An inefﬁcient algorithm Let s start by designing an inefﬁcient algorithm for the problem. This may seem simple at the outset. But as we ll see, solving this relaxed problem will rely on some important conceptual ideas that will serve as a starting point for our efﬁcient algorithm.

Without computational constraints, it is natural to just return the list L of all linear functions ℓthat correctly labels all examples in some S S of size αn. We call such an S, a large, soluble set. True inliers I satisfy our search criteria so ℓ L. However, it s not hard to show (Proposition B.1 ) that one can choose outliers so that the list so generated has size exp(d) (far from a ﬁxed constant!).

A potential ﬁx is to search instead for a coarse soluble partition of S, if it exists, into disjoint S1, S2, . . . , Sk and linear functions ℓ1, ℓ2, . . . , ℓk so that every |Si| αn and ℓi correctly computes the labels in Si. In this setting, our list is small (k 1/α). But it is easy to construct samples S for which this fails because there are coarse soluble partitions of S where every ℓi is far from ℓ .

Anti-Concentration It turns out that any (even inefﬁcient) algorithm for list-decodable regression provably (see Theorem 6.1) requires that the distribution of inliers2 be sufﬁciently anti-concentrated:

Deﬁnition 2.1 (Anti-Concentration). A Rd-valued random variable Y with mean 0 is δ-anticoncentrated3 if for all non-zero v, Pr[ Y, v = 0] < δ. A set T Rd is δ-anti-concentrated if the uniform distribution on T is δ-anti-concentrated.

As we discuss next, anti-concentration is also sufﬁcient for list-decodable regression. Intuitively, this is because anti-concentration of the inliers prevents the existence of a soluble set that intersects signiﬁcantly with I and yet can be labeled correctly by ℓ = ℓ . This is simple to prove in the special case when S admits a coarse soluble partition.

Proposition 2.2. Suppose I is α-anti-concentrated. Suppose there exists a partition S1, S2, . . . , Sk S such that each |Si| αn and there exist ℓ1, ℓ2, . . . , ℓk such that yj = ℓi, xj for every j Si. Then, there is an i such that ℓi = ℓ .

Proof. Since k 1/α, there is a j such that |I Sj| α|I|. Then, xi, ℓj = xi, ℓ for every i I Sj. Thus, Pri I[ xi, ℓj ℓ = 0] α. This contradicts anti-concentration of I unless ℓj ℓ = 0.

The above proposition allows us to use any soluble partition as a certiﬁcate of correctness for the associated list L. Two aspects of this certiﬁcate were crucial in the above argument: 1) largeness: each Si is of size αn - so the generated list is small, and, 2) uniformity: every sample is used in exactly one of the sets so I must intersect one of the Sis in at least α-fraction of the points.

Identiﬁability via anti-concentration For arbitrary S, a coarse soluble partition might not exist. So we will generalize coarse soluble partitions to obtain certiﬁcates that exist for every sample S and guarantee largeness and a relaxation of uniformity (formalized below). For this purpose, it is convenient to view such certiﬁcates as distributions µ on αn size soluble subsets of S so any collection C 2S of αn size sets corresponds to the uniform distribution µ on C.

To precisely deﬁne uniformity, let Wi(µ) = ES µ[1(i S)] be the frequency of i , that is, probability that the ith sample is chosen to be in a set drawn according to µ. Then, the uniform distribution µ on any coarse soluble k-partition satisﬁes Wi = 1

k for every i. That is, all samples

Please note that sections 3-6 are in the supplementary material. 2As in the standard robust estimation setting, the outliers are arbitrary and potentially adversarially chosen. 3Deﬁnition 1.4 differs slightly to handle list-decodable regression with additive noise in the inliers.

i S are uniformly used in such a µ. To generalize this idea, we deﬁne P i Wi(µ)2 as the distance to uniformity of µ. Up to a shift, this is simply the variance in the frequencies of the points in S used in draws from µ. Our generalization of a coarse soluble partition of S is any µ that minimizes P

i Wi(µ)2, the distance to uniformity, and is thus maximally uniform among all distributions supported on large soluble sets. Such a µ can be found by convex programming.

The following claim generalizes Proposition 2.2 to derive the same conclusion starting from any maximally uniform distribution supported on large soluble sets.

Proposition 2.3. For a maximally uniform µ on αn size soluble subsets of S, P

i I ES µ[1 (i S)] α|I|.

The proof proceeds by contradiction (see Lemma 4.3). We show that if P i I Wi(µ) α|I|, then we can strictly reduce the distance to uniformity by taking a mixture of µ with the distribution that places all its probability mass on I. This allow us to obtain an (inefﬁcient) algorithm for list-decodable regression establishing identiﬁability.

Proposition 2.4 (Identiﬁability for List-Decodable Regression). Let S be sample from Lin(α, ℓ ) such that I is δ-anti-concentrated for δ < α. Then, there s an (inefﬁcient) algorithm that ﬁnds a list L of size 20 α δ such that ℓ L with probability at least 0.99.

Proof. Let µ be any maximally uniform distribution over αn size soluble subsets of S. For k = 20 α δ, let S1, S2, . . . , Sk be independent samples from µ. Output the list L of k linear functions that correctly compute the labels in each Si.

To see why ℓ L, observe that E|Sj I| = P i I E1(i Sj) α|I|. By averaging, Pr[|Sj I| α+δ

2 . Thus, there s a j k so that |Sj I| α+δ

2 |I| with probability at least 1 (1 α δ

2 ) 20 α δ 0.99. We can now repeat the argument in the proof of Proposition 2.2 to conclude that any linear function that correctly labels Sj must equal ℓ .

An efﬁcient algorithm Our identiﬁability proof suggests the following simple algorithm: 1) ﬁnd any maximally uniform distribution µ on soluble subsets of size αn of S, 2) take O(1/α) samples Si from µ and 3) return the list of linear functions that correctly label the equations in Sis. This is inefﬁcient because searching over distributions is NP-hard in general.

To make this into an efﬁcient algorithm, we start by observing that soluble subsets S S of size αn can be described by the following set of quadratic equations where w stands for the indicator of S and ℓ, the linear function that correctly labels the examples in S.

Pn i=1 wi = αn

i [n]. w2 i = wi i [n]. wi (yi xi, ℓ ) = 0

Our efﬁcient algorithm searches for a maximally uniform pseudo-distribution on w satisfying (2.1). Degree k pseudo-distributions (see Section 3 for precise deﬁnitions) are generalization of distributions that nevertheless behave just as distributions whenever we take (pseudo)-expectations (denoted by E) of a class of degree k polynomials. And unlike distributions, degree k pseudo-distributions satisfying4 polynomial constraints (such as (2.1)) can be computed in time n O(k).

For the sake of intuition, it might be helpful to (falsely) think of pseudo-distributions µ as simply distributions where we only get access to moments of degree k. Thus, we are allowed to compute expectations of all degree k polynomials with respect to µ. Since Wi( µ) = E µ wi are just ﬁrst moments of µ, our notion of maximally uniform distributions extends naturally to pseudodistributions. This allows us to prove an analog of Proposition 2.3 for pseudo-distributions and gives us an efﬁcient replacement for Step 1.

Please note that sections 3-6 are in the supplementary material. 4See Fact 3.3 for a precise statement.

Proposition 2.5. For any maximally uniform µ of degree 2, P i I E µ[wi] α|I| = α P

i [n] E µ[wi] .

For Step 2, however, we hit a wall: it s not possible to obtain independent samples from µ given only low-degree moments.

Rounding by Votes To circumvent this hurdle, our algorithm departs from rounding strategies for pseudo-distributions used in prior works and instead rounds each sample to a candidate linear function. While a priori, this method produces n different candidates instead of one, we will be able to extract a list of O( 1

α) size that contains the true vector from them. This step will crucially rely on anti-concentration properties of I.

Consider the vector vi = E µ[wiℓ] E µ[wi] whenever E µ[wi] = 0 (set vi to zero, otherwise). This is simply the (scaled) average, according to µ, of all the linear functions ℓthat are used to label the sets S of size αn in the support of µ whenever i S. Further, vi depends only on the ﬁrst two moments of µ.

We think of vis as votes cast by the ith sample for the unknown linear function. Let us focus our attention on the votes vi of i I - the inliers. We will show that according to the distribution proportional to E[w], the average ℓ2 distance of vi from ℓ is at max η:

i I E[wi] vi ℓ 2 < η . ( )

Before diving into ( ), let s see how it gives us our efﬁcient list-decodable regression algorithm:

1. Find a pseudo-distribution µ satisfying (2.1) that minimizes distance to uniformity P

i E µ[wi]2.

2. For O( 1

α) times, independently choose a random index i [n] with probability proportional to E µ[wi] and return the list of corresponding vis.

Step 1 above is a convex program - it minimizes a norm subject on the convex set of pseudodistributions - and can be solved in polynomial time. Let s analyze step 2 to see why the algorithm works. Using ( ) and Markov s inequality, conditioned on i I, vi ℓ 2 2η with probability

1/2. By Proposition 2.5,

i I E[wi] P

i [n] E[wi] α so i I with probability at least α. Thus in each

iteration of step 2, with probability at least α/2, we choose an i such that vi is 2η-close to ℓ . Repeating O(1/α) times gives us the 0.99 chance of success.

( ) via anti-concentration As in the information-theoretic argument, ( ) relies on the anticoncentration of I. Let s do a quick proof for the case when µ is an actual distribution µ.

Proof of ( ) for actual distributions µ. Observe that µ is a distribution over (w, ℓ) satisfying (2.1). Recall that w indicates a subset S S of size αn and wi = 1 iff i S. And ℓ Rd satisﬁes all the equations in S.

By Cauchy-Schwarz, P

i Eµ[wiℓ] Eµ[wi]ℓ Eµ[P

i I wi ℓ ℓ ]. Next, as in Proposition 2.2, since I is η-anti-concentrated, and for all S such that |I S| η|I|, ℓ ℓ = 0. Thus, any such S in the support of µ contributes 0 to the expectation above. We will now show that the contribution from the remaining terms is upper bounded by η. Observe that since ℓ ℓ 2, Eµ[P

i I wi ℓ ℓ ] = Eµ[1 (|S I| < η|I|) wi ℓ ℓ ] = Eµ[P

i S I ℓ ℓ ] 2η|I|.

So Sizing Anti-Concentration The key to proving ( ) for pseudo-distributions is a sum-of-squares (So S) proof of anti-concentration inequality: Prx I[ x, v = 0] η in variable v. So S is a restricted system for proving polynomial inequalities subject to polynomial inequality constraints. Thus, to even ask for a So S proof we must phrase anti-concentration as a polynomial inequality.

Please note that sections 3-6 are in the supplementary material.

To do this, let p(z) be a low-degree polynomial approximator for the function 1 (z = 0). Then, we can hope to replace the use of the inequality Prx I[ x, v = 0] η Ex I[1( x, v = 0)] η in the argument above by Ex I[p( x, v )2] η. Since polynomials grow unboundedly for large enough inputs, it is necessary for the uniform distribution on I to have sufﬁciently light-tails to ensure that Ex Ip( x, v )2 is small. In Lemma A.1, we show that anti-concentration and strictly sub-exponential tails are sufﬁcient to construct such a polynomial.

We can ﬁnally ask for a So S proof for Ex Ip( x, v ) η in variable v. We prove such certiﬁed anti-concentration inequalities for broad families of inlier distributions in Section 5.

3 Acknowledgements

The authors would like to thank the following sources of support.

Sushrut Karmalkar was supported by NSF Award CNS-1414023. Adam Klivans was supported by NSF Award CCF-1717896. Pravesh Kothari was supported by Schmidt Foundation Fellowship and Avi Wigderson s NSF Award CCF-1412958.

[1] Pranjal Awasthi, Maria-Florina Balcan, and Philip M. Long. The power of localization for efﬁciently learning linear separators with malicious noise. Co RR, abs/1307.8371, 2013. 1

[2] Sivaraman Balakrishnan, Martin J. Wainwright, and Bin Yu. Statistical guarantees for the EM algorithm: From population to sample-based analysis. Co RR, abs/1408.2156, 2014. 2

[3] Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. A discriminative framework for clustering via similarity functions. In STOC, pages 671 680. ACM, 2008. 1

[4] Boaz Barak, Jonathan A. Kelner, and David Steurer. Dictionary learning and tensor decomposition via the sum-of-squares method [extended abstract]. In STOC 15 Proceedings of the 2015 ACM Symposium on Theory of Computing, pages 143 151. ACM, New York, 2015. 5

[5] Boaz Barak and Ankur Moitra. Noisy tensor completion via the sum-of-squares hierarchy. In COLT, volume 49 of JMLR Workshop and Conference Proceedings, pages 417 445. JMLR.org, 2016. 5

[6] Thorsten Bernholt. Robust estimators are hard to compute. Technical report, Technical Report/Universität Dortmund, SFB 475 Komplexitätsreduktion in Multivariaten Datenstrukturen, 2006. 1

[7] Kush Bhatia, Prateek Jain, Parameswaran Kamalaruban, and Purushottam Kar. Consistent robust regression. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 2107 2116, 2017. 2

[8] Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from untrusted data. In STOC, pages 47 60. ACM, 2017. 1, 2

[9] Yudong Chen, Xinyang Yi, and Constantine Caramanis. A convex formulation for mixed regression with two components: Minimax optimal rates. In Proceedings of The 27th Conference on Learning Theory, COLT 2014, Barcelona, Spain, June 13-15, 2014, pages 560 604, 2014. 2

[10] Yu Cheng, Ilias Diakonikolas, and Rong Ge. High-dimensional robust mean estimation in nearly-linear time. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pages 2755 2771, 2019. 2

[11] Richard D. De Veaux. Mixtures of linear regressions. Comput. Statist. Data Anal., 8(3):227 245, 1989. 2

[12] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li 0001, Jacob Steinhardt, and Alistair Stewart. Sever: A robust meta-algorithm for stochastic optimization. Co RR, abs/1803.02815, 2018. 2

[13] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high dimensions without the computational intractability. In FOCS, pages 655 664. IEEE Computer Society, 2016. 2

[14] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robustly learning a gaussian: Getting optimal error, efﬁciently. Co RR, abs/1704.03866, 2017. 1, 2

[15] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robustly learning a gaussian: Getting optimal error, efﬁciently. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018, pages 2683 2702, 2018. 2

[16] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Zheng Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high dimensions without the computational intractability. Co RR, abs/1604.06443, 2016. 1

[17] Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. Learning geometric concepts with nasty noise. Co RR, abs/1707.01242, 2017. 1

[18] Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. List-decodable robust mean estimation and learning mixtures of spherical gaussians. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, pages 1047 1060, 2018. 1, 2

[19] Ilias Diakonikolas, Weihao Kong, and Alistair Stewart. Efﬁcient algorithms and lower bounds for robust linear regression. In Timothy M. Chan, editor, Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pages 2745 2754. SIAM, 2019. 2

[20] P. Erdös. On a lemma of littlewood and offord. Bull. Amer. Math. Soc., 51(12):898 902, 12 1945. 3, 4

[21] Susana Faria and Gilda Soromenho. Fitting mixtures of linear regressions. J. Stat. Comput. Simul., 80(1-2):201 225, 2010. 2

[22] John Hainline, Brendan Juba, Hai Le, and David Woodruff. Conditional sparse l_p-norm regression with optimal probability. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 1042 1050, 2019. 2

[23] Frank R Hampel, Elvezio M Ronchetti, Peter J Rousseeuw, and Werner A Stahel. Robust statistics: the approach based on inﬂuence functions, volume 114. John Wiley & Sons, 2011. 1

[24] Sam B. Hopkins and Jerry Li. Mixture models, robustness, and sum of squares proofs. 2017. 1,

[25] Peter J Huber. Robust statistics. In International Encyclopedia of Statistical Science, pages 1248 1251. Springer, 2011. 1

[26] Michael I. Jordan and Robert A. Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural Computation, 6(2):181 214, 1994. 2

[27] Sushrut Karmalkar and Eric Price. Compressed sensing with adversarial sparse noise via l1 regression. In Jeremy T. Fineman and Michael Mitzenmacher, editors, SOSA@SODA, volume 69 of OASICS, pages 19:1 19:19. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2019. 2

[28] Adam R. Klivans, Pravesh K. Kothari, and Raghu Meka. Efﬁcient algorithms for outlier-robust regression. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018., pages 1420 1430, 2018. 1, 2, 4, 5

[29] Adam R. Klivans, Philip M. Long, and Rocco A. Servedio. Learning halfspaces with malicious noise. Journal of Machine Learning Research, 10:2715 2740, 2009. 1

[30] Pravesh K. Kothari and Jacob Steinhardt. Better agnostic clustering via relaxed tensor norms. 2017. 1, 2, 4, 5

[31] Pravesh K. Kothari and David Steurer. Outlier-robust moment-estimation via sum-of-squares. Co RR, abs/1711.11581, 2017. 1, 2, 4, 5

[32] Pravesh K. Kothari and David Steurer. List-decodable mean estimation made simple. In Manuscript, 2019. 2

[33] Kevin A. Lai, Anup B. Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In FOCS, pages 665 674. IEEE Computer Society, 2016. 1

[34] Yuanzhi Li and Yingyu Liang. Learning mixtures of linear regressions with nearly optimal complexity. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018., pages 1125 1144, 2018. 2, 4

[35] Doron S Lubinsky. A Survey of Weighted Approximation for Exponential Weights. ar Xiv Mathematics e-prints, page math/0701099, Jan 2007. 2

[36] Tengyu Ma, Jonathan Shi, and David Steurer. Polynomial-time tensor decompositions with sum-of-squares. In FOCS, pages 438 446. IEEE Computer Society, 2016. 5

[37] RARD Maronna, R Douglas Martin, and Victor Yohai. Robust statistics. John Wiley & Sons, Chichester. ISBN, 2006. 1

[38] Ankur Moitra and Gregory Valiant. Settling the polynomial learnability of mixtures of gaussians. In FOCS, pages 93 102. IEEE Computer Society, 2010. 4

[39] Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan, and Pradeep Ravikumar. Robust estimation via robust gradient estimation. Co RR, abs/1802.06485, 2018. 2

[40] Mark Rudelson and Roman Vershynin. The Littlewood-Offord problem and invertibility of random matrices. Adv. Math., 218(2):600 633, 2008. 3

[41] Hanie Sedghi, Majid Janzamin, and Anima Anandkumar. Provable tensor methods for learning mixtures of generalized linear models. In AISTATS, volume 51 of JMLR Workshop and Conference Proceedings, pages 1223 1231. JMLR.org, 2016. 2, 4

[42] Terence Tao and Van Vu. The Littlewood-Offord problem in high dimensions and a conjecture of Frankl and Füredi. Combinatorica, 32(3):363 372, 2012. 3

[43] John W. Tukey. Mathematics and the picturing of data. pages 523 531, 1975. 1

[44] Xinyang Yi, Constantine Caramanis, and Sujay Sanghavi. Alternating Minimization for Mixed Linear Regression. ar Xiv e-prints, page ar Xiv:1310.3745, Oct 2013. 2

[45] Kai Zhong, Prateek Jain, and Inderjit S. Dhillon. Mixed linear regression with multiple components. In NIPS, pages 2190 2198, 2016. 2