# listdecodable_sparse_mean_estimation__2a8faeb5.pdf

List-Decodable Sparse Mean Estimation

Shiwei Zeng Department of Computer Science

Stevens Institute of Technology

szeng4@stevens.edu

Jie Shen Department of Computer Science

Stevens Institute of Technology

jie.shen@stevens.edu

Robust mean estimation is one of the most important problems in statistics: given a set of samples in Rd where an fraction are drawn from some distribution D and the rest are adversarially corrupted, we aim to estimate the mean of D. A surge of recent research interest has been focusing on the list-decodable setting where 2 (0, 1

2], and the goal is to output a ﬁnite number of estimates among which at least one approximates the target mean. In this paper, we consider that the underlying distribution D is Gaussian with k-sparse mean. Our main contribution is the ﬁrst polynomial-time algorithm that enjoys sample complexity O

poly(k, log d)

, i.e. poly-logarithmic in the dimension. One of our core algorithmic ingredients is using low-degree sparse polynomials to ﬁlter outliers, which may ﬁnd more applications.

1 Introduction

Mean estimation is arguably a fundamental inference task in statistics and machine learning. Given a set of samples {x1, . . . , xn} Rd where an fraction are drawn from some well-behaved (e.g. Gaussian) distribution D and the rest are adversarially corrupted, the goal is to estimate the mean of D. In the noiseless case where = 1, the problem can be easily solved in view of the concentration of measure phenomenon [LT91]. However, this is rarely the case as modern data sets are often contaminated by random noise or even by adversarial corruptions. Thus, a great deal of recent efforts are focused on efﬁciently and robustly estimating the target mean in the presence of outliers.

Generally speaking, there is a phase transition between > 1/2 and 0 < 1/2, and solving either problem in a computationally efﬁcient manner is highly nontrivial. The problem that most of the samples are uncorrupted, i.e. > 1/2, has a very long history dating back to the 1960s [Tuk60, Hub64], yet only until recently have computationally efﬁcient algorithms been established [DKK+16, LRV16]. The other yet more challenging regime concerns that an overwhelming fraction of the samples are corrupted, i.e. 1/2, which even renders estimation impossible. This motivates a line of research on list-decodable mean estimation [CSV17], where in place of outputting one single estimate, the algorithm is allowed to generate a ﬁnite list of candidates and is considered to be successful if there exists at least one candidate in the list that is sufﬁciently close to the target mean.

In this work, we investigate the problem of list-decodable mean estimation, for which there have been a plethora of elegant results established in recent years. From a high level, most of them concern error guarantees and running time. For example, [CSV17] proposed the ﬁrst tractable algorithm based on semideﬁnite programming, which runs in polynomial time and achieves optimal error rate for variance-bounded distributions. [DKS18b] developed a multi-ﬁltering scheme and showed that the error rate can be improved by using high degree polynomials if the underlying distribution is Gaussian. The more recent works [CMY20, DKK+21a] further addressed the computational efﬁciency of this task and achieved almost linear running time in certain regimes.

Although all of these algorithms exhibit near-optimal guarantees on either error rate or computational complexity, it turns out that less is explored to improve another yet important metric: the sample

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

complexity. In particular, the sample complexity of all these algorithms is O(poly(d)), hence they quickly break down for data-demanding applications such as healthcare where the number of available samples is typically orders of magnitude less than the dimension d [Wai19]. Therefore, a pressing question that needs to be addressed in such a high-dimensional regime is the following:

Does there exist a provably robust algorithm for list-decodable mean estimation that runs in polynomial time and enjoys a sample complexity bound of O(polylog(d))?

In this paper, we answer the question in the afﬁrmative by showing that when the target mean is k-sparse, i.e. it has at most k non-zero elements, it is attribute-efﬁciently list-decodable.

Theorem 1 (Main result). Given parameter 2 (0, 1

2], failure probability 2 (0, 1), a natural number 1, and a set T of

samples in Rd, of which at least a (2 )-fraction are independent draws from the Gaussian distribution N(µ, Id) where kµk0 k, there exists an algorithm that runs in time poly

|T| , d , 1

, uses polynomials of degree at most 2 , and returns a list of O(1/ ) number of k-sparse vectors such that with probability 1 , the list contains at least one ˆµ 2 Rd with kˆµ µk2 = O

, where O( ) hides poly-logarithmic factors.

Remark 2. The key message of the theorem is that when the true mean is k-sparse, it is possible to efﬁciently approximate it with O(polylog(d)) samples. This is in stark contrast to existing listdecodable results [CSV17, DKS18b, CMY20, DKK20a, DKK+21a] where the sample complexity is O(poly(d)). The only attribute-efﬁcient robust mean estimators are [BDLS17, DKK+19, CDK+21], but their results hold only for the mild corruption regime where > 1/2.

Remark 3. Our algorithm and analysis hold for any degree 1. When = 1, the sample complexity reads as O( 7k8 log6 d) and the algorithm achieves error O( 1

2 ). As opposed to an O(1 ) error rate obtained for > 1/2, the (non-vanishing) error rate O( 1

2 ) is typically what one can expect for list-decodable mean estimation under bounded second order moment condition, in light of the lower bounds in [DKS18b]. When leveraging degree-2 polynomials into algorithmic design, we obtain the improved O

error guarantee. Specially, when taking =

, our algorithm achieves error rate of O

in quasi-polynomial time. This is very close to the minimax error rate of (log

)) established in [DKS18b].

Remark 4. If we further increase the sample size with an multiplicative factor with = (log 1

), our algorithm will achieve an O

error guarantee, which matches the minimax lower bound. The proof follows the same pipeline and we leave it to interested readers.

1.1 Overview of Our Techniques

Our main algorithm is inspired by the multiﬁltering framework of [DKS18b], where the primary idea is to construct a sequence of polynomials to test the concentration of the samples to Gaussian so that the algorithm either certiﬁes that the sample set behaves like Gaussian, or sanitizes it by removing a sufﬁcient amount of outliers. Our key technical contribution lies into a new design of sparse polynomials, and new ﬁltering rules tailored to the sparse polynomials.

Sparse polynomials and sparsity-induced ﬁlters. To ensure that our algorithm is attribute-efﬁcient, we will only control the maximum eigenvalue of the sample covariance matrix on sparse directions. Since such computation is NP-hard in general, we ﬁrst consider a sufﬁcient condition which tests the maximum Frobenuis norm under a cardinality constraint, similar to the idea of [DKK+19]. If such Frobenuis norm is small, it implies a small restricted eigenvalue and hence the sample mean is returned. Otherwise, we construct sparse polynomials in the sense that they can be represented by a set of O( 2k4 ) basis polynomials and O( k2 ) coordinates of the samples (see Deﬁnition 7), and measure the concentration of these sparse polynomials to the Gaussian. Now as the underlying polynomials are sparse, we also design new sparsity-induced ﬁlters to certify the sample set, as otherwise a large amount of clean samples will be removed. See Algorithm 3 and Algorithm 4.

Clustering by L1-norm. Technically, the success of our attribute-efﬁcient multiﬁltering approach hinges on a condition that all the samples lie within a small L1-norm ball. It is not hard to see that all the Gaussian samples satisfy such condition, and we show that there is a simple scheme which can simultaneously prune and cluster the given samples into O(1/ ) groups, such that the

retained samples are close under the L1-norm and at least one group contains most of the Gaussian samples. We note that the use of the L1-norm as our metric ensures attribute efﬁciency of this step. An immediate implication of this clustering step is that the polynomials of Gaussian samples will be close enough, which facilitates the analysis of the performance of our ﬁlters. See Section 2.3.

1.2 Related Works

Breaking the barrier of the typical O(poly(d)) sample complexity bound is one of the central problems across many ﬁelds of science and engineering. Motivated by real-world applications, a property termed sparsity is often assumed for this end, meaning that only k out of the d number of attributes contribute to the underlying inference problem. In this way, an improved bound of O(poly(k, log d)) can be obtained in many inference paradigms such as linear regression [CDS98, Tib96, CT05, Don06, SL17a, SL17b, SL18, WSL18], learning of threshold functions [Lit87, BHL95, STT12, PV13, ABHZ16, ZSA20, She20, SZ21], principal component analysis [Ma13, DKK+19], and mean estimation [BDLS17, DKK+19, CDK+21]. Unfortunately, the success of all these attributeefﬁcient algorithms hinges on the presumption that the majority of the data are uncorrupted.

Learning with mild corruption ( > 1/2). Learning in the presence of noise has been extensively studied in a broad context. In supervised learning where a sample consists of an instance (i.e. feature vector) and a label, lots of research efforts were dedicated to robust algorithms under label noise [AL87, Slo88, MN06]. Recent years have witnessed signiﬁcant progress towards optimal algorithms in the presence of label noise, see for example, [KKMS05, ABL17, DKTZ20, ZSA20, DKK+20b, ZS22] and the references therein. The regime that both instances and labels are corrupted turns out to be signiﬁcantly more challenging. The problem of learning halfspaces under such setting was put forward in the 1980s [Val85, KL88], yet only until recently have efﬁcient algorithms been established with near-optimal noise tolerance [ABL17, DKS18a, She21, SZ21]. In addition, [BJK15, KKM18, LSLC20] studied robust linear regression and [BDLS17] presented a set of interesting results under various statistical models. More in line with this work is the problem of robust mean estimation, see the breakthrough works of [DKK+16, LRV16] and many follow-up works [DKK+17, BDLS17, DKS17, SCV18, KSS18, DKK+19, HLZ20, CDK+21].

Learning with overwhelming corruption ( 1/2). The agnostic label noise of [Hau92, KSS92] seems the earliest model that allows the adversary to arbitrarily corrupt any fraction of the data (say 70%), though it can only corrupt labels. Following [CSV17], a considerate number of of recent works have studied the scenario that both instances and labels are grossly corrupted, and the goal is to output a ﬁnite list of candidate parameters among which at least one is a good approximation to the target. This includes list-decodable learning of mixture models [DKS18b, DKK+21b], regression [KKK19, RY20a], and subspace recovery [RY20b, BK21]. Interestingly, there are some works studying the problem under crowdsourcing models, where the samples are collected from crowd workers and most of them behave adversarially [SVC16, ABHM17, MV18, ZS21].

It is worth noting that [DKK+22] concurrently and independently developed a polynomial-time algorithm to solve the same problem, with an interesting difference-of-pairs metric to ﬁlter outliers.

1.3 Roadmap

We collect useful notations, deﬁnitions, and some preliminary results in Section 2. Our main algorithms are described in Section 3 along with performance guarantees. We conclude the work in Section 4, and defer all proof details to the appendix.

2 Preliminaries

Vector, matrix, and tensor. For a d-dimensional vector v = (v1, . . . , vd), denote by kvk2 its L2-norm, kvk1 its L1-norm, kvk0 its L0norm that counts the number of non-zeros, and kvk1 its inﬁnity norm. The hard thresholding operator trimk : Rd ! Rd keeps the k largest elements (in magnitude) of a vector and sets the remaining to zero. Let [d] := {1, 2, . . . , d} for some natural number d > 0. For an index set [d], v is the vector of v restricted on . We say a vector is k-sparse if it has at most k non-zero elements, and likewise for matrices and tensors. For a matrix M of size d1 d2, denote by k Mk F its Frobenius norm and by k Mk its nuclear norm. For U [d1] [d2], denote by MU the submatrix of M with entries restricted to U.

We also use tensors in our algorithms to ease expressions. Note that vectors and matrices can be seen as order-1 and order-2 tensors respectively. We say that an order-l tensor A is symmetric if Ai1,...,il = A (i1,...,il) for all permutations . Given two tensors A and B, denote by A B the outer product (or tensor product) of A and B. We will slightly abuse k Ak2 to denote the L2-norm of a tensor A by seeing it as a long vector.

Probability. We reserve the capital letter G for a random draw from N(µ, Id), i.e. G N(µ, Id), where µ 2 Rd is the target mean that we aim to estimate which is assumed to be k-sparse. Suppose that T is a ﬁnite sample set. We use µT to denote the sample mean of T, i.e. µT = 1 |T |

x2T x, and use p(T) to denote the random variable p(x) where x is drawn uniformly from T.

Constants. The capital letter C and its subscript variants such as C1, C2 are used to denote positive absolute constants. However, their values may change from appearance to appearance.

2.1 Polynomials

Let x = (x1, . . . , xd) be a d-dimensional vector in Rd, and let a = (a1, . . . , ad) 2 Nd be a ddimensional multi-index. A monomial of x is a product of powers of the coordinates of x with natural exponents, written as xa := Qd

j . A polynomial of x, p(x), is a ﬁnite sum of its monomials multiplied by real coefﬁcients; that is, p(x) = P

a2A caxa where A Nd is a ﬁnite set of multiindices and the ca s are real coefﬁcients. Note that the degree of p(x) is given by maxa2A kak1. We denote by P(Rd, l) the class of polynomials on Rd with degree at most l. We will often use the probabilist s Hermite polynomials that form a complete orthogonal basis with respect to N(0, Id).

Deﬁnition 5 (Hermite polynomials). Let x 2 R be a variate. For any natural number l 2 N, the degree-l Hermite polynomial is deﬁned as Hel(x) = ( 1)le

2 . For a 2 Nd and x 2 Rd, the d-variate Hermite polynomial is given by Hea(x) := Qd

i=1 Heai(xi), which is of degree kak1.

Harmonic and homogeneous polynomials. A polynomial h(x) 2 P(Rd, l) is called harmonic if it can be written as a linear combination of degree-l Hermite polynomials. A polynomial Hom(x) 2 P(Rd, l) is called homogeneous if all of its monomials have degree exactly l.

Fact 6. If a polynomial is degree-l harmonic or homogeneous, then there is a one-to-one mapping between it and an order-l symmetric tensor.

To see this, we may deﬁne an operation such that Hel(xi) Hel(xj) = Hel(xi) Hel(xj) if i 6= j and equals He2l(xi) otherwise. Then any degree-l Hermite polynomial can be written as He1(xi1) He1(xi2) He1(xil) where all the indices it 2 [d]. We will consider that one such sequence (i1, . . . , il) exactly corresponds to one degree-l Hermite polynomial on Rd, and there are dl number of such sequences that form all degree-l Hermite polynomials. In this sense, any harmonic polynomial h(x) can be written as h(x) = P

i1,...,il Ai1,...,il He1(xi1) He1(xi2) He1(xil), where Ai1,...,il s are the coefﬁcients which form an order-l tensor. If we choose A as symmetric, it is easy to see that A fully represents h(x). Then, we can convert back to the regular product by counting the number of times a particular index j appearing in (i1, . . . , il). If we denote this number as cj(i1, . . . , il), we have

Hecj(i1,...,il)(xj) =: h A(x), with

cj(i1, . . . , il) = l, (2.1)

where the factor 1/

l! is only used to normalize the magnitude of A to ease our analysis.

Likewise, any homogeneous polynomial takes the form

xcj(i1,...,il)

Sparse polynomials. In order to deﬁne sparse polynomials, we will ﬁrst specify a set of basis polynomials {b1, . . . , bdl} P(Rd, l). In this paper, we will either choose such set as all degree-l monomials or all degree-l Hermite polynomials.

Deﬁnition 7 (( , )-sparse polynomials). We say that a polynomial p 2 P(Rd, l) is ( , )-sparse if it can be represented by at most number of basis polynomials and coordinates of the input vector. We denote by P(Rd, l, , ) the class of ( , )-sparse polynomials.

Note that when and l are ﬁxed, p(x) will depend on at most l coordinates. Thus, the introduction of the parameter makes sense only when l. In our algorithm, we will always have l 2 , = 4 2k4 , and = 2 k2 for some natural number 1.

2.2 Representative Set and Good Set

To ease our analysis, we will need a deterministic condition on the set of uncorrupted samples. Deﬁnition 8 (Representative set). Given 2 (0, 1

2] and 2 (0, 1), we say that a sample set SG Rd

is representative with respect to P := P(Rd, 2 , 4 2k4 , 2 k2 ) if the following holds:

|Pr[p(G) 0] Pr[p(SG) 0]| 0, where 0 := 3

100k2 log2 ( d

We show that a sufﬁciently large set drawn independently from N(µ, Id) is representative. The proof follows from the classic VC theory, and is deferred to Appendix A.1. Proposition 9 (Sample complexity). Given 2 (0, 1

2] and 2 (0, 1), let SG be a set consisting of |SG| = C (l + ) log d

2 log (l + ) log d

independent samples from N(µ, Id) where C > 0 is a sufﬁciently large absolute constant. Then, with probability 1 ,

sup p2P(Rd,l, , )

|Pr[p(G) 0] Pr[p(SG) 0]| .

In particular, when l = 2 , = 4 2k4 , = 2 k2 , and = 3

100k2 log2 ( d

) for some natural

number 1, it sufﬁces to pick |SG| = C0 4 k8

) for some sufﬁciently large constant C0 so that SG is a representative set.

Our algorithm will progressively remove samples from T, and a key property that ensures the success of the algorithm is that most corrupted samples are eliminated while almost all uncorrupted samples are retained. Alternatively, we hope that T contains a representative set that contributes to a nontrivial fraction. For technical reasons, we also require that all samples in T lie in a small L1-ball. Deﬁnition 10 ( -good set). A multiset T Rd is -good if the following holds:

1. There exists a set SG which is representative and satisﬁes |SG \ T| max{(1

/6) |SG| , |T|}.

2. maxx,y2T kx yk1 C

log(d |SG| / ) for some constant C > 0.

It is not hard to verify that the initial sample set T satisﬁes the ﬁrst condition, and will also fulﬁll the second one with a simple data pre-processing, as stated in the next section.

2.3 Clustering for the Initial List

Since the corrupted samples may behave adversarially, we will perform a preliminary step of clustering which splits T into an initial list of subsets, among which at least one is -good in the sense of Deﬁnition 10. We ﬁrst show that all Gaussian samples have bounded L1-norm with high probability, which simply follows from the Gaussian tail bound.

Lemma 11. Given 2 (0, 1), with probability 1 , we have maxx2SG kx µk1

2 log d|SG|

, where SG is a set of samples drawn independently from N(µ, Id).

The above observation implies that for any x, y 2 SG, their distance under the L1-norm metric is at

2 log(d |SG| / ) O

as far as the size of SG has the same order with the one in Proposition 9. To guarantee the existence of such SG, it sufﬁces to draw a corrupted sample set T that is 1/ times larger than |SG|. The lemma below further shows that this is sufﬁcient to guarantee the existence of an -good subset of T.

Algorithm 1 CLUSTER(T, , , )

Require: A multiset of samples T Rd, parameter 2 (0, 1/2], failure probability 2 (0, 1),

degree of polynomials 1.

1: A set of centers C ;, radius γ C0

for some constant C0 > 0.

2: For each x 2 T, proceed as follows: if there are at least |T| samples y in T that satisfy

kx yk1 2γ, and no sample x0 2 C satisﬁes kx x0k1 6γ then C C [ {x}. 3: For each xi 2 C, let Ti = {y 2 T : kxi yk1 6γ}. 4: return {T1, . . . , T|C|}.

Lemma 12 (CLUSTER). Given 2 (0, 1

2] and 2 (0, 1), let T be the sample set given to the

learner. If |T| = C 7 k14

) and a (2 )-fraction are independent samples from N(µ, Id), Algorithm 1 returns a list of at most 1/ many subsets of T, such that with probability at least 1 ,

at least one of them is an -good set.

As will be clear in our analysis, the motivation of bounding the L1-distance is to make sure that the function value of any p(x) = h A(x µT ) 2 P(Rd, l, , ) is bounded for samples in the -good subset Ti. This is because when there exist a signiﬁcant fraction of good samples in Ti, we want to efﬁciently distinguish the corrupted and uncorrupted ones. A value-bounded polynomial function will facilitate our analysis on the function variance.

Lemma 13. Suppose that T is -good and a polynomial p 2 P(Rd, l, 4 2k4 , 2 k2 ) satisﬁes the following: there exists a symmetric order-l tensor A such that k Ak2 1 and p(x) = h A(x µT ).

Then, it holds that maxx,y2T |p(x) p(y)| 2k γl, where γ = C0

3 Main Algorithms and Performance Guarantees

We start with a review of the multiﬁltering framework that has been broadly used in prior works [DKS18b, DKK20a, DKK+21b], followed by a highlight of our new techniques.

The multiﬁltering framework, i.e. Algorithm 2, includes three major steps. The ﬁrst step is to invoke

CLUSTER (Algorithm 1) to generate an initial list L which guarantees the existence of an -good subset of T (see Lemma 12). We then imagine that there is a tree with root being the original contaminated sample set T and each child node of the root represents a member in L. The algorithm iterates through these child nodes and performs one of the following: (1) creating a leaf node which is an estimate of the target mean; (2) creating one or two child nodes where are subsets of the parent node; (3) certifying that the set cannot be -good and delete branch. In the end, if all leaves of the tree cannot be further split or deleted, the mean of the subsets on leaf nodes will be collected as a list M. It is worth noting that the goal of algorithmic design is to guarantee that there always exists a branch that includes only -good subsets. In other words, at any level of the algorithm, at least one of the subsets of T is -good, which ensures the existence of a good estimation in the returned list M. The ﬁnal step is a black-box algorithm that reduces the size of M from O(poly(1/ )) to O(1/ ), which is due to [DKS18b]. Our technical contributions lie into an attribute-efﬁcient implementation of the ﬁrst and second steps. In this section, we elaborate on the second step, i.e. the ATTRIBUTE-EFFICIENT-MULTIFILTER algorithm.

3.1 Overview of Attribute-Efﬁcient Multiﬁltering

The ATTRIBUTE-EFFICIENT-MULTIFILTER algorithm is presented in Algorithm 3. The starting point of the algorithm is a well-known fact that if the adversary were to signiﬁcantly deteriorate our estimate on µ, the spectral norm of a certain sample covariance matrix would become large [DKK+16, LRV16]. In order to achieve attribute-efﬁcient sample complexity O(poly(k, log d)), it is however vital to control the spectral norm only on k -sparse directions for some pre-speciﬁed polynomial degree 1, which can further be certiﬁed by a small Frobenius norm restricted on the largest k2 entries. If the restricted Frobenius norm is sufﬁciently small, it implies that the sample covariance matrix behaves as a Gaussian one, and the algorithm returns the empirical mean truncated to be k-sparse (see Step 4). Otherwise, the algorithm will invoke either BASICMF (i.e. Algorithm 4)

Algorithm 2 Main Algorithm: Attribute-Efﬁcient List-Decodable Mean Estimation

Require: A multiset of samples T Rd, parameter 2 (0, 1/2], failure probability 2 (0, 1),

degree of polynomials 1. 1: {T1, . . . , Tm} CLUSTER(T, , , ), L {(T1, /2), . . . , (Tm, /2)}, M ;. 2: while L 6= ; do 3: (T 0, 0) an element in L, L L\{(T 0, 0)}. 4: ANS ATTRIBUTE-EFFICIENT-MULTIFILTER(T 0, 0, / |T| , ).

(i) if ANS is a vector then add it into M. (ii) if ANS is a list of (Ti, i) then append those with i 1 to L. (iii) if ANS = NO then go to the next iteration. 5: end while 6: return LISTREDUCTION(T, , , M).

Algorithm 3 ATTRIBUTE-EFFICIENT-MULTIFILTER(T, , , )

Require: A multiset of samples T Rd, parameter 2 (0, 1/2], failure probability 2 (0, 1),

degree of polynomials 1. 1: E[Pd, (T µT ) Pd, (T µT )>], and Pd, (x) is the column vector of all degree Hermite

polynomials of x.

2: {(it, jt)}

1 2 (k2 +k ) t 1 index set of the k diagonal entries and 1

2(k2 k ) entries above the main diagonal of with largest magnitude. U {(it, jt)}t 1 [ {(jt, it)}t 1, U 0 I I, with I = {it}t 1 [ {jt}t 1.

C1 ( + C1 log 1

) log2(2 + log 1

2 for large enough constant C1 > 0.

sparse then return ˆµ trimk(µT ).

5: (λ , v ) the largest eigenvalue and eigenvector of ( )U 0. 6: if λ λ

sparse then 7: if = 1 then ANS BASICMF(T, , , p1) else ANS HARMONICMF(T, , , p1) where p1(x) := v Pd, (x µT ). 8: else 9: p2(x) 1 k A0k F

Pd, (x µT )> A0 Pd, (x µT )

with A0 := ( )U 0.

10: ANS HARMONICMF(T, , , p2). 11: end if 12: return ANS.

or HARMONICMF (i.e. Algortihm 5) to examine the concentration of a polynomial of the empirical data to that of Gaussian. Both algorithms will either assert that the current sample set does not contain a sufﬁciently large amount of Gaussian samples, or will prune many corrupted samples to increase the fraction of Gaussian ones. A more detailed description of the two algorithms can be found in Section 3.2.1 and Section 3.2.2 respectively. What is subtle in Algorithm 3 is that we will check the maximum eigenvalue λ of the empirical covariance matrix restricted on a carefully chosen subset U 0, which corresponds to the maximum eigenvalue on a certain (2k2 )-sparse direction. If λ is too large, this indicates an easy problem since it must be the case that the adversary corrupted the samples in an aggressively way. Therefore, it sufﬁces to prune outliers using a degreepolynomial p1 which is simply the projection of Pd, (x µT ) onto the span of the maximum eigenvector; see Step 7 in Algorithm 3. On the other hand, if λ is on a moderate scale, it indicates that the adversary corrupted the samples in a very delicate way so that it passes the tests of both Frobenius norm and spectral norm. Now the main idea is to check the concentration of higher degree polynomials induced by the sample set; we show that it sufﬁces to construct a degree-2 harmonic polynomial; see Step 10.

While sparse mean estimation has been studied in [DKK+19] and the idea of using restricted Frobenius norm and ﬁltering was also developed, we note that their analysis only holds in the mild corruption regime where > 1/2. To establish the main results, we will leverage the tools from [DKS18b], with a speciﬁc treatment on the fact that µ is k-sparse, to ensure an attribute-efﬁcient sample complexity bound. As we will show later, a key idea to this end is to utilize a sequence of carefully chosen sparse polynomials in the sense of Deﬁnition 7 along with sparsity-induced ﬁlters.

The performance guarantee of ATTRIBUTE-EFFICIENT-MULTIFILTER is as follows.

Theorem 14 (Algorithm 3). Consider Algorithm 3 and denote by ANS its output. With probability 1 , the following holds. ANS cannot be TBD. If ANS is a k-sparse vector and if T is -good, then kµ ˆµk2 O

. If ANS = NO, then T is not -good. If ANS = {(Ti, i)}m

i=1 for some m 2, then Ti T for all i 2 [m] and Pm

i 1 2 ; if additionally T is

-good, then at least one Ti is i-good. Finally, the algorithm runs in time O

poly(|T| , d )

3.2 Analysis of ATTRIBUTE-EFFICIENT-MULTIFILTER

We ﬁrst show that if the restricted Frobenius norm of the sample covariance matrix is small, then the sample mean is a good estimate of the target mean.

Lemma 15. Consider Algorithm 3. If the algorithm returns a vector ˆµ at Step 4 and if T is -good, we have that kˆµ µk2 O

) log2(2 + log 1

Next, we give performance guarantees on the remaining steps of Algorithm 3, where we consider the case that the algorithm does not return at Step 4. Namely, the algorithm will either reach at Step 7 or Step 10, and will return the ANS obtained thereof. These two steps will invoke BASICMF or HARMONICMF on different sparse polynomials. Observe that both algorithms may return 1) NO , which certiﬁes that the current input set T is not -good; 2) a list of subsets {(Ti, i)}m

i=1 for some m 2, on which Algorithm 3 will be called in a recursive manner; or 3) TBD, which indicates that the algorithm is uncertain on T being -good. In the following, we prove that the way that we invoke BASICMF and HARMONICMF ensures that they will never return TBD when being called within Algorithm 3. We then give performance guarantees on these two ﬁltering algorithms when they return NO or {(Ti, i)}m

i=1, thus establishing Theorem 14.

Let us consider that the algorithm reaches Step 7, i.e. the largest eigenvalue on one sparse direction is larger than the threshold λ

sparse. It is easy to see that when = 1, ANS cannot be TBD since the only way that BASICMF returns TBD is when Var[p(T)] is not too large, but this would violate the condition that λ > λ

sparse in view of our setting on λ

sparse. Similarly, we show that under the large λ regime, HARMONICMF will not return TBD either. Thus, we have the following lemma.

Lemma 16. Consider Algorithm 3. If it reaches Step 7, then ANS 6= TBD.

Now it remains to consider the case that the algorithm reaches Step 10, which is more subtle since the evidence from the magnitude of the largest restricted eigenvalue is not so strong to prune outliers. Note that this could happen even when T contains many outliers, since λ is not the maximum eigenvalue on all sparse directions but on a submatrix indexed by U 0. Fortunately, if λ is not large, we show that the algorithm can still make progress by calling HARMONICMF on degree-2 sparse polynomials. This is because higher-degree polynomials are more sensitive to outliers than low-degree polynomials, as far as we can certify the concentration of high-degree polynomials on clean samples. As a result, we will have the following guarantee.

Lemma 17. Consider Algorithm 3. If it reaches Step 10, then ANS 6= TBD.

3.2.1 Basic Multiﬁlter for Sparse Polynomials

The BASICMF algorithm (Algorithm 4) is a key ingredient in the multiﬁltering framework. It takes as input a sparse polynomial p and uses it to certify whether T is -good and sufﬁciently concentrated. The central idea is to measure how p(T) distributed and compare it to that of the distribution of p(G). We require the input p has certiﬁable variance on G, i.e. Var[p(G)] 1, as otherwise, it could ﬁlter away a large number of the good samples. We note that the bounded variance condition is always satisﬁed for degree-1 Hermite polynomials under proper normalization, while for high-degree polynomials, one cannot invoke BASICMF directly (see Section 3.2.2 for a remedy).

The way that BASICMF certiﬁes the input sample set T not being -good is quite simple: if not all samples lie in a small L1-ball, it returns NO at Step 2, in that this contradicts Lemma 13. Otherwise, the algorithm will attempt to search for a ﬁner interval [a, b] such that it includes most of the samples. If such interval exists, then either the adversary corrupted the samples such that the sample variance is as small as that of Gaussian while the sample mean may deviate far from the target, in which case BASICMF returns TBD at Step 5; or the sample variance is large, in which case

Algorithm 4 BASICMF(T, , , p)

Require: A multiset of samples T Rd, parameter 2 (0, 1/2], failure probability 2 (0, 1),

a polynomial p 2 P(Rd, l, 4 2k4 , 2 k2 ) such that l 2 , Var[p(G)] 1, and p(x) = h A(x µT ).

1: R (C1 log 1

2: if maxx,y2T |p(x) p(y)| > 2k γl then return NO . 3: if there is an interval [a, b] of length C1 R log(2 + log 1

) that contains at least (1

2 )-fraction of samples in {p(x) : x 2 T} then

4: if Var[p(T)] C1

"l log2(2 + log 1

) then 5: return TBD . 6: else 7: Find a threshold t > 2R such that

min{|p(x) a| , |p(x) b|} t

exp( (t 2R)2/l) + 2 2

8: T 0 {x 2 T : min{|p(x) a| , |p(x) b|} t}, 0

9: return {(T 0, 0)}. 10: end if 11: else 12: Find t 2 R, R0 > 0 such that the sets T1 := {x 2 T : p(x) > t R0} and T2 := {x 2 T : p(x) < t + R0} satisfy

|T1|2 + |T2|2 |T|2 (1 /100)2 and |T| max(|T1| , |T2|) |T| /4.

13: i (1 2/100) |T| / |Ti|, for i = 1, 2. 14: return {(T1, 1), (T2, 2)}. 15: end if

it is possible to construct a sparsity-induced ﬁlter to prune outliers (see Steps 7 and 8). We note that in Step 7, the ﬁrst term on the right-hand side is derived from Chernoff bound for degree-l Gaussian polynomials and the second term is due to concentration of empirical samples to Gaussian (see Deﬁnition 8), both of which are scaled by a factor 8/ so that the number of the samples removed from T is 8/ times more than that of the good samples in the representative set SG T, which means most of the removed samples are outliers. We show by contradiction the existence of the threshold t (see Lemma 26). In fact, had such threshold t not existed, the set T must be sufﬁciently concentrated such that the algorithm would have returned at Step 5. This essentially relies on our result of the initial clustering of Algorithm 1, which guarantees that each subset T is bounded in a small L1-ball and the function value of p on the -good T does not change drastically (Lemma 13). We then show that equipped with such threshold t, T 0 is a subset of T and it is 0-good if T is -good (Lemma 28).

When there is no such short interval [a, b], the algorithm splits T into two overlapping subsets {T1, T2} such that T1 \ T2 is large enough to contain most of the samples in SG. This guarantees that most of the samples in SG (if T is -good) are always contained in one subset and thus there always exists an -good subset of T. We show that an appropriate threshold t can also be found at Step 12 (Lemma 30), and at least one Ti is i-good if T is -good.

As a result, we have the following guarantees for Algorithm 4; see Appendix B for the full proof.

Theorem 18 (BASICMF). Consider Algorithm 4. Denote by ANS its return. Suppose that T being -good implies Var[p(G)] 1. Then with probability 1 , the following holds. ANS is either NO , TBD , or a list of {(Ti, i)}m

i=1 with m 2. 1) If ANS = NO, then T is not -good. 2) If ANS = TBD, then Var[p(T)] O

l log2(2+log 1

; and if additionally T is -good,

then |E[p(G)] E[p(T)]| O

l 2 log(2 + log 1

. 3) If ANS = {(Ti, i)}m

i=1, then Ti T and P

i 1 2 for all i 2 [m]; if additionally T is -good, then at least one Ti is i-good.

Algorithm 5 HARMONICMF(T, , , p)

Require: A multiset of samples T Rd, parameter 2 (0, 1/2], failure probability 2 (0, 1), a

polynomial p 2 P(Rd, l, 2 k2 , 2 k2 ) such that p(x) = h A(x µT ) and k Ak2 = 1. 1: for l0 = 0, 1, . . . , l do 2: Let B(l0) be an order-2l0 tensor with

i1,...,il0,j1,...,jl0 =

kl0+1,...,kl

Ai1...,il0,kl0+1,...,kl Aj1...,jl0,kl0+1,...,kl.

3: Consider B(l0) as a dl0 dl0 symmetric matrix by grouping each of the i1, . . . , il0 and j1, . . . , jl0 coordinates together. Apply eigenvalue decomposition on B(l0) to obtain B(l0) = P

i λi Vi Vi. 4: ANSi MULTILINEARMF(T, Vi, l0, , /(ldl)) for every Vi. If ANSi = NO or a list of {(Tj, j)} for some i, then return ANSi. If ANSi = TBD, continue. 5: end for

6: ANS BASICMF(T, , , 1

β h A(x µT )) with β :=

C1 (1 + log 1

) log2(2 + log 1

2 . If ANS = NO or a list of (Tj, j), return ANS. If ANS = TBD, still return NO .

3.2.2 Harmonic Multiﬁlter with Hermite Polynomials

Recall that applying BASICMF (Algorithm 4) on a polynomial p requires Var[p(G)] 1. It is nontrivial to verify this condition for a high-degree polynomial p, as the variance of high-degree Gaussian polynomials depends on the distribution mean, i.e. µ µT in this case, which is unfortunately unknown. As a remedy, notice that for any harmonic polynomial h A(x), Ex N(µ0,I)[h A(x)2] equals the summation of homogeneous polynomials of µ0, which can also be seen as the expectation of multilinear polynomials over independent variables X(i) N(µ0, Id). Thus, we only need to verify the expectation of these corresponding multilinear polynomials, whose variance on G does not hinge on µ0. The harmonic multiﬁlter is presented in Algorithm 5, where the subroutine MULTILINEARMF can be found in Appendix D. We ﬁrst present the guarantee when Algorithm 5 returns all TBD at Step 4 and reaches Step 6, where we can certify a bounded variance for h A(x µT ) on G. Lemma 19 (Variance of p). Consider Algorithm 5. If it reaches Step 6 and T is -good, then we have E[h A(G µT )2] β2.

Based on Lemma 19, we have that Var[h A(G µT )/β] 1, for which we can invoke BASICMF on h A(x µT )/β and Theorem 18 can be applied immediately. We are ready to elaborate the proof ideas for Lemma 16 and 17. First, observe that BASICMF returns TBD at Step 6 if and only if Var[h A(T µT )/β] C1

"l log2(2 + log 1

). Now return to Algorithm 3. When h A(x µT ) = p1(x), this could not happen because Var[p1(T)] = Var[v Pd, (T µT )] λ λ

C1 ( + C1 log 1

) log2(2 + log 1

" log2(2 + log 1

). A contradiction that implies Lemma 16. When h A(x µT ) = p2(x), the case is more delicate. Here, we instead show that T must not be -good and HARMONICMF will return NO correctly. This is because if T is -good, Proposition 18 implies that E[p2(G)] is close to E[p2(T)], and together with Lemma 19 we can show that E[p2(T)] is small. However, by construction

++ = E[p2(T)] λ

sparse, a contradiction that gives Lemma 17. The detailed proof can be found in Appendix B.3.

4 Conclusion and Future Work

In this paper, we developed an attribute-efﬁcient mean estimation algorithm which achieves sample complexity poly-logarithmic in the dimension with low-degree sparse polynomials under the listdecodable setting. A natural question is whether the current techniques could be utilized to attributeefﬁciently solve the other list-decodable problems, such as learning of halfspaces and linear regression.

Acknowledgments and Disclosure of Funding

We thank the anonymous reviewers and meta-reviewer for valuable discussions. This work is supported by NSF-IIS-1948133 and the startup funding from Stevens Institute of Technology.

[ABHM17] Pranjal Awasthi, Avrim Blum, Nika Haghtalab, and Yishay Mansour. Efﬁcient PAC

learning from the crowd. In Proceedings of the 30th Annual Conference on Learning Theory, pages 127 150, 2017.

[ABHZ16] Pranjal Awasthi, Maria-Florina Balcan, Nika Haghtalab, and Hongyang Zhang. Learn-

ing and 1-bit compressed sensing under asymmetric noise. In Proceedings of the 29th Annual Conference on Learning Theory, pages 152 192, 2016.

[ABL17] Pranjal Awasthi, Maria-Florina Balcan, and Philip M. Long. The power of localization

for efﬁciently learning linear separators with noise. Journal of the ACM, 63(6):50:1 50:27, 2017.

[AL87] Dana Angluin and Philip D. Laird. Learning from noisy examples. Machine Learning,

2(4):343 370, 1987.

[BDLS17] Sivaraman Balakrishnan, Simon S. Du, Jerry Li, and Aarti Singh. Computationally

efﬁcient robust sparse estimation in high dimensions. In Proceedings of the 30th Annual Conference on Learning Theory, pages 169 212, 2017.

[BHL95] Avrim Blum, Lisa Hellerstein, and Nick Littlestone. Learning in the presence of ﬁnitely

or inﬁnitely many irrelevant attributes. Journal of Computer and System Sciences, 50(1):32 40, 1995.

[BJK15] Kush Bhatia, Prateek Jain, and Purushottam Kar. Robust regression via hard threshold-

ing. In NIPS, pages 721 729, 2015.

[BK21] Ainesh Bakshi and Pravesh K. Kothari. List-decodable subspace recovery: Dimen-

sion independent error in polynomial time. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, pages 1279 1297, 2021.

[CDK+21] Yu Cheng, Ilias Diakonikolas, Daniel M. Kane, Rong Ge, Shivam Gupta, and Mahdi

Soltanolkotabi. Outlier-robust sparse estimation via non-convex optimization. Co RR, abs/2109.11515, 2021.

[CDS98] Scott Shaobing Chen, David L. Donoho, and Michael A. Saunders. Atomic decomposi-

tion by basis pursuit. SIAM Journal on Scientiﬁc Computing, 20(1):33 61, 1998.

[CMY20] Yeshwanth Cherapanamjeri, Sidhanth Mohanty, and Morris Yau. List decodable mean

estimation in nearly linear time. In 61st IEEE Annual Symposium on Foundations of Computer Science, pages 141 148. IEEE, 2020.

[CSV17] Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from untrusted data.

In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 47 60, 2017.

[CT05] Emmanuel J. Candès and Terence Tao. Decoding by linear programming. IEEE

Transactions on Information Theory, 51(12):4203 4215, 2005.

[DKK+16] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and

Alistair Stewart. Robust estimators in high dimensions without the computational intractability. In Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science, pages 655 664, 2016.

[DKK+17] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and

Alistair Stewart. Being robust (in high dimensions) can be practical. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 999 1008. PMLR, 2017.

[DKK+19] Ilias Diakonikolas, Daniel Kane, Sushrut Karmalkar, Eric Price, and Alistair Stewart.

Outlier-robust high-dimensional sparse estimation via iterative ﬁltering. In Neur IPS, pages 10688 10699, 2019.

[DKK20a] Ilias Diakonikolas, Daniel Kane, and Daniel Kongsgaard. List-decodable mean esti-

mation via iterative multi-ﬁltering. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems, 2020.

[DKK+20b] Ilias Diakonikolas, Daniel M. Kane, Vasilis Kontonis, Christos Tzamos, and Nikos

Zariﬁs. A polynomial time algorithm for learning halfspaces with Tsybakov noise. Co RR, abs/2010.01705, 2020.

[DKK+21a] Ilias Diakonikolas, Daniel Kane, Daniel Kongsgaard, Jerry Li, and Kevin Tian. List-

decodable mean estimation in nearly-pca time. In Proceedings of the 35th Annual Conference on Neural Information Processing Systems, pages 10195 10208, 2021.

[DKK+21b] Ilias Diakonikolas, Daniel M. Kane, Daniel Kongsgaard, Jerry Li, and Kevin Tian.

Clustering mixture models in almost-linear time via list-decodable mean estimation. Co RR, abs/2106.08537, 2021.

[DKK+22] Ilias Diakonikolas, Daniel M. Kane, Sushrut Karmalkar, Ankit Pensia, and Thanasis

Pittas. List-decodable sparse mean estimation via difference-of-pairs ﬁltering. Co RR, abs/2206.05245, 2022.

[DKS17] Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. Statistical query lower

bounds for robust estimation of high-dimensional gaussians and gaussian mixtures. In Proceedings of the 58th IEEE Annual Symposium on Foundations of Computer Science, pages 73 84, 2017.

[DKS18a] Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. Learning geometric concepts

with nasty noise. In Proceedings of the 50th Annual ACM Symposium on Theory of Computing, pages 1061 1073, 2018.

[DKS18b] Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. List-decodable robust mean

estimation and learning mixtures of spherical gaussians. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1047 1060. ACM,

[DKTZ20] Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, and Nikos Zariﬁs. Learning

halfspaces with Massart noise under structured distributions. In Proceedings of the 33rd Annual Conference on Learning Theory, pages 1486 1513, 2020.

[Don06] David L. Donoho. Compressed sensing. IEEE Transactions on Information Theory,

52(4):1289 1306, 2006.

[Hau92] David Haussler. Decision theoretic generalizations of the PAC model for neural net

and other learning applications. Information and Computation, 100(1):78 150, 1992.

[HLZ20] Samuel B. Hopkins, Jerry Li, and Fred Zhang. Robust and heavy-tailed mean estimation

made simple, via regret minimization. Co RR, abs/2007.15839, 2020.

[Hub64] Peter J. Huber. Robust Estimation of a Location Parameter. The Annals of Mathematical

Statistics, 35(1):73 101, 1964.

[KKK19] Sushrut Karmalkar, Adam R. Klivans, and Pravesh Kothari. List-decodable linear

regression. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems, pages 7423 7432, 2019.

[KKM18] Adam R. Klivans, Pravesh K. Kothari, and Raghu Meka. Efﬁcient algorithms for

outlier-robust regression. In Proceedings of the 31st Annual Conference on Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 1420 1430.

PMLR, 2018.

[KKMS05] Adam Tauman Kalai, Adam R. Klivans, Yishay Mansour, and Rocco A. Servedio.

Agnostically learning halfspaces. In Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science, pages 11 20, 2005.

[KL88] Michael J. Kearns and Ming Li. Learning in the presence of malicious errors. In

Proceedings of the 20th Annual ACM Symposium on Theory of Computing, pages 267 280, 1988.

[KSS92] Michael J. Kearns, Robert E. Schapire, and Linda Sellie. Toward efﬁcient agnostic

learning. In Proceedings of the Fifth Annual ACM Conference on Computational Learning Theory, pages 341 352, 1992.

[KSS18] Pravesh K. Kothari, Jacob Steinhardt, and David Steurer. Robust moment estimation

and improved clustering via sum of squares. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1035 1046, 2018.

[Lit87] Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-

threshold algorithm. In Proceedings of the 28th Annual IEEE Symposium on Foundations of Computer Science, pages 68 77, 1987.

[LRV16] Kevin A. Lai, Anup B. Rao, and Santosh S. Vempala. Agnostic estimation of mean and

covariance. In Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science, pages 665 674, 2016.

[LSLC20] Liu Liu, Yanyao Shen, Tianyang Li, and Constantine Caramanis. High dimensional

robust sparse regression. In The 23rd International Conference on Artiﬁcial Intelligence and Statistics,, volume 108 of Proceedings of Machine Learning Research, pages 411 421. PMLR, 2020.

[LT91] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: Isoperimetry

and Processes. Springer-Verlag Berlin Heidelberg, 1991.

[Ma13] Zongming Ma. Sparse principal component analysis and iterative thresholding. The

Annals of Statistics, 41(2):772 801, 2013.

[MN06] Pascal Massart and Élodie Nédélec. Risk bounds for statistical learning. The Annals of

Statistics, pages 2326 2366, 2006.

[MV18] Michela Meister and Gregory Valiant. A data prism: Semi-veriﬁed learning in the

small-alpha regime. In Proceedings of the 31st Conference On Learning Theory, pages 1530 1546, 2018.

[PV13] Yaniv Plan and Roman Vershynin. Robust 1-bit compressed sensing and sparse logistic

regression: A convex programming approach. IEEE Transactions on Information Theory, 59(1):482 494, 2013.

[RY20a] Prasad Raghavendra and Morris Yau. List decodable learning via sum of squares.

In Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, pages 161 180, 2020.

[RY20b] Prasad Raghavendra and Morris Yau. List decodable subspace recovery. In Proceedings

of the 33rd Annual Conference on Learning Theory, pages 3206 3226, 2020.

[SCV18] Jacob Steinhardt, Moses Charikar, and Gregory Valiant. Resilience: A criterion for

learning in the presence of arbitrary outliers. In Proceedings of the 9th Innovations in Theoretical Computer Science Conference, pages 45:1 45:21, 2018.

[She20] Jie Shen. One-bit compressed sensing via one-shot hard thresholding. In Proceedings

of the 36th Conference on Uncertainty in Artiﬁcial Intelligence, pages 510 519, 2020.

[She21] Jie Shen. Sample-optimal PAC learning of halfspaces with malicious noise. In Pro-

ceedings of the 38th International Conference on Machine Learning, pages 9515 9524, 2021.

[SL17a] Jie Shen and Ping Li. On the iteration complexity of support recovery via hard thresholding pursuit. In Proceedings of the 34th International Conference on Machine Learning, pages 3115 3124, 2017.

[SL17b] Jie Shen and Ping Li. Partial hard thresholding: Towards a principled analysis of

support recovery. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, pages 3127 3137, 2017.

[SL18] Jie Shen and Ping Li. A tight bound of hard thresholding. Journal of Machine Learning

Research, 18(208):1 42, 2018.

[Slo88] Robert H. Sloan. Types of noise in data for concept learning. In Proceedings of the

First Annual Workshop on Computational Learning Theory, pages 91 96, 1988.

[STT12] Rocco A. Servedio, Li-Yang Tan, and Justin Thaler. Attribute-efﬁcient learning and

weight-degree tradeoffs for polynomial threshold functions. In Proceedings of the 25th Annual Conference on Learning Theory, pages 1 19, 2012.

[SVC16] Jacob Steinhardt, Gregory Valiant, and Moses Charikar. Avoiding imposters and

delinquents: Adversarial crowdsourcing and peer prediction. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems, pages 4439 4447, 2016.

[SZ21] Jie Shen and Chicheng Zhang. Attribute-efﬁcient learning of halfspaces with malicious

noise: Near-optimal label complexity and noise tolerance. In Proceedings of the 32nd International Conference on Algorithmic Learning Theory, pages 1072 1113, 2021.

[Tib96] Robert Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the

Royal Statistical Society: Series B (Methodological), 58(1):267 288, 1996.

[Tuk60] John W. Tukey. A survey of sampling from contaminated distributions. Contributions

to probability and statistics, pages 448 485, 1960.

[Val85] Leslie G. Valiant. Learning disjunction of conjunctions. In Proceedings of the 9th

International Joint Conference on Artiﬁcial Intelligence, pages 560 566, 1985.

[Wai19] Martin J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint. Cam-

bridge University Press, 2019.

[WSL18] Jing Wang, Jie Shen, and Ping Li. Provable variable selection for streaming features.

In Proceedings of the 35th International Conference on Machine Learning, pages 5158 5166, 2018.

[ZS21] Shiwei Zeng and Jie Shen. Semi-veriﬁed learning from the crowd with pairwise

comparisons. Co RR, abs/2106.07080, 2021.

[ZS22] Shiwei Zeng and Jie Shen. Efﬁcient PAC learning from the crowd with pairwise com-

parisons. In Proceedings of the 39th International Conference on Machine Learning, pages 25973 25993, 2022.

[ZSA20] Chicheng Zhang, Jie Shen, and Pranjal Awasthi. Efﬁcient active learning of sparse

halfspaces with arbitrary bounded noise. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems, pages 7184 7197, 2020.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] See the appendix. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experi-

mental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A]