# outlierrobust_sparse_estimation_via_nonconvex_optimization__5ff207db.pdf Outlier-Robust Sparse Estimation via Non-Convex Optimization Yu Cheng Brown University Providence, RI 02912 yu_cheng@brown.edu Ilias Diakonikolas University of Wisconsin-Madison Madison, WI 53706 ilias@cs.wisc.edu Rong Ge Duke University Durham, NC 27708 rongge@cs.duke.edu Shivam Gupta University of Texas at Austin Austin, TX 78712 shivamgupta@utexas.edu Daniel M. Kane University of California, San Diego La Jolla, CA 92093 dakane@cs.ucsd.edu Mahdi Soltanolkotabi University of Southern California Los Angeles, CA 90089 soltanol@usc.edu We explore the connection between outlier-robust high-dimensional statistics and non-convex optimization in the presence of sparsity constraints, with a focus on the fundamental tasks of robust sparse mean estimation and robust sparse PCA. We develop novel and simple optimization formulations for these problems such that any approximate stationary point of the associated optimization problem yields a near-optimal solution for the underlying robust estimation task. As a corollary, we obtain that any first-order method that efficiently converges to stationarity yields an efficient algorithm for these tasks.1 The obtained algorithms are simple, practical, and succeed under broader distributional assumptions compared to prior work. 1 Introduction In several modern machine learning (ML) applications, such as ML security [BNJT10, BNL12, SKL17, DKK+19a] and exploratory analysis of real datasets, e.g., in population genetics [RPW+02, PLJD10, LAT+08, DKK+17], typical datasets contain a non-trivial fraction of arbitrary (or even adversarial) outliers. Robust statistics [HRRS86, HR09] is the subfield of statistics aiming to design estimators that are tolerant to a constant fraction of outliers, independent of the dimensionality of the data. Early work in this field, see, e.g., [Tuk60, Hub64, Tuk75] developed sample-efficient robust estimators for various basic tasks, alas with runtime exponential in the dimension. During the past five years, a line of work in computer science, starting with [DKK+16, LRV16], has developed the first computationally efficient robust high-dimensional estimators for a range of tasks. This progress has led to a revival of robust statistics from an algorithmic perspective (see, e.g., [DK19, DKK+21] for surveys on the topic). In this work, we focus on high-dimensional estimation tasks in the presence of sparsity constraints. To rigorously study these problems, we need to formally define the model of data corruption. Throughout this work, we work with the following standard contamination model. 1An implementation of our algorithms is available at https://github.com/guptashvm/Sparse-GD. 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Definition 1.1 (Strong Contamination Model, see [DKK+16]). Given a parameter 0 < ϵ < 1/2 and a distribution family D on Rd, the adversary operates as follows: The algorithm specifies a number of samples n, and n samples are drawn from some unknown D D. The adversary is allowed to inspect the samples, remove up to ϵn of them and replace them with arbitrary points. This modified set of n points is then given as input to the algorithm. We say that a set of samples is ϵ-corrupted if it is generated by the above process. High-dimensional robust statistics is algorithmically challenging because the natural optimization formulations of such tasks are typically non-convex. The recent line of work on algorithmic robust statistics has led to a range of sophisticated algorithms. In some cases, such algorithms require solving large convex relaxations, rendering them computationally prohibitive for large-scale problems. In other cases, they involve a number of hyper-parameters that may require careful tuning. Motivated by these shortcomings of known algorithms, recent work [CDGS20, ZJS20] established an intriguing connection between high-dimensional robust estimation and non-convex optimization. The high-level idea is quite simple: Even though typical robust statistics tasks lead to non-convex formulations, it may still be possible to leverage the underlying structure to show that standard first-order methods provably and efficiently reach near-optimal solutions. Indeed, [CDGS20, ZJS20] were able to prove such statements for robust mean estimation under natural distributional assumptions. Specifically, these works established that any (approximate) stationary point of a well-studied non-convex formulation for robust mean estimation yields a near-optimal solution for the underlying robust estimation task. In this work, we continue this line of work with a focus on sparse estimation tasks. Leveraging sparsity in high-dimensional datasets is a fundamental problem of significant practical importance. Various formalizations of this problem have been investigated in statistics and machine learning for at least the past two decades (see, e.g., [HTW15] for a textbook on the topic). We focus on robust sparse mean estimation and robust sparse PCA. Sparse mean estimation is arguably one of the most fundamental sparse estimation tasks and is closely related to the Gaussian sequence model [Tsy08, Joh17]. The task of sparse PCA in the spiked covariance model, initiated in [Joh01], has been extensively investigated (see Chapter 8 of [HTW15] and references therein). In the context of robust sparse mean estimation, we are given an ϵ-corrupted set of samples from a distribution with unknown mean µ Rd where µ is k-sparse, and we want to compute a vector bµ close to µ. In the context of robust sparse PCA (in the spiked covariance model), we are given an ϵ-corrupted set of samples from a distribution with covariance matrix I + ρvv T , where v Rd is k-sparse and the goal is to approximate v. It is worth noting that for both problems, we have access to much fewer samples compared to the non-sparse case (roughly O(k2 log d) instead of Ω(d)). Consequently, the design and analysis of optimization formulations for robust sparse estimation requires new ideas and techniques that significantly deviate from the standard (non-sparse) case. 1.1 Our Results and Contributions We show that standard first-order methods lead to robust and efficient algorithms for sparse mean estimation and sparse PCA. Our main contribution is to propose novel (non-convex) formulations for these robust estimation tasks, and to show that approximate stationarity suffices for near-optimality. We establish landscape results showing that any approximate stationary point of our objective function yields a near-optimal solution for the underlying robust estimation task. Consequently, gradient descent (or any other methods converging to stationarity) can solve these problems. Our results provide new insights and techniques in designing and analyzing (non-convex) optimization formulations of robust estimation tasks. Our formulations and structural results immediately lead to simple and practical algorithms for robust sparse estimation. Importantly, the gradient of our objectives can be computed efficiently via a small number of basic matrix operations. In addition to their simplicity and practicality, our methods provably succeed under more general distributional assumptions compared to prior work. For robust sparse mean estimation and robust sparse PCA, our landscape results require deterministic conditions on the original set of good samples. We refer to these conditions as stability conditions (Definitions 2.1 and 2.2, formally defined in Section 2). At a high level, they state that the first and second moments of a set of samples are stable when any ϵ-fraction of the samples are removed. These stability conditions hold with high probability for a set of clean samples drawn from natural families of distributions (e.g., subgaussian). For robust sparse mean estimation, we establish the following result. Theorem 1.2 (Robust Sparse Mean Estimation). Let 0 < ϵ < ϵ0 for some universal constant ϵ0 and let δ > ϵ. Let G be a set of n samples that is (k, ϵ, δ)-stable (per Definition 2.1) w.r.t. a distribution with unknown k-sparse mean µ Rd. Let S = (Xi)n i=1 be an ϵ-corrupted version of G . 2 There is an algorithm that on inputs S, k, ϵ, and δ, runs in polynomial time and returns a k-sparse vector bµ Rd such that bµ µ 2 O(δ). We emphasize that a key novelty of Theorem 1.2 is that the underlying algorithm is a first-order method applied to a novel non-convex formulation of the problem. The major advantage of our algorithm over prior work [BDLS17, DKK+19b] is its simplicity, practicality, and the fact that it seamlessly applies to a wider class of distributions on the clean data. As we will discuss in Section 3, when the ground-truth distribution D is subgaussian with unknown k-sparse mean µ Rd and identity covariance, a set of n = eΩ(k2 log d/ϵ2) samples drawn from D is (k, ϵ, δ)-stable (Definition 2.1) with high probability for δ = O(ϵ p log(1/ϵ)). It follows as an immediate corollary of Theorem 1.2 that, given an ϵ-corrupted set of samples, we can compute a vector bµ that is O(δ) = O(ϵ p log(1/ϵ)) close to the true mean µ. This sample complexity matches the known computational-statistical lower bounds [DKS17, BB20]. More generally, one can relax the concentration assumption on the clean data and obtain qualitatively similar error guarantees. Next we state our main result for robust sparse PCA. Theorem 1.3 (Robust Sparse PCA). Let 0 < ρ 1 and 0 < ϵ < ϵ0 for some universal constant ϵ0. Let G be a set of n samples that is (k, ϵ, δ)-stable (as in Definition 2.2) w.r.t. a centered distribution with covariance Σ = I + ρvv , for an unknown k-sparse unit vector v Rd. Let S = (Xi)n i=1 be an ϵ-corrupted version of G . There is an algorithm that on inputs S, k, and ϵ, runs in polynomial time and returns a unit vector u Rd such that uu vv F = O( p Interestingly, our algorithm for robust sparse PCA is a first-order method applied to a simple convex formulation of the problem. We view the existence of a convex formulation as an intriguing fact that, surprisingly, was not observed in prior work. As we will discuss in Section 4, when the ground-truth distribution D is centered subgaussian with covariance Σ = I + ρvv , for an unknown k-sparse unit vector v Rd, a set of n = eΩ(k2 log d/ϵ2) samples drawn from D is (k, ϵ, δ)-stable (Definition 2.2) with high probability for δ = O(ϵ log(1/ϵ)). Therefore, our algorithm outputs a vector that is O( p ϵ log(1/ϵ)/ρ) close to the true direction v. The sample complexity in this case nearly matches the computational-statistical lower bound of Ω(k2 log d/ϵ2) [BR13] which holds even without corruptions. While the error guarantee of our algorithm is slightly worse compared to prior work [BDLS17, DKK+19b] for Gaussian data (we get O( p δ/ρ) rather than O(δ/ρ)), we note that our algorithm works for a broader family of distributions. Prior Work on Robust Sparse Estimation. We provide a detailed summary of prior work for comparison. [BDLS17] obtained the first sample-efficient and polynomial-time algorithms for robust sparse mean estimation and robust sparse PCA. These algorithms succeed for Gaussian inliers and inherently use the ellipsoid method. The separation oracle required for the ellipsoid algorithm turns out to be another convex program corresponding to an SDP to solve sparse PCA. As a consequence, the running time of these algorithms, while polynomially bounded, is impractically high. [LLC19] proposed an algorithm for robust sparse mean estimation via iterative trimmed hard thresholding, which can only tolerate a sub-constant fraction of corruptions. [DKK+19b] gave iterative spectral robust algorithms for sparse mean estimation and sparse PCA. These algorithms are still quite complex and are only shown to succeed under Gaussian inliers. 1.2 Overview of Our Approach In this section, we give an overview of our approach for robust sparse mean estimation. At a very high level, we assign a nonnegative weight to each data point and try to find a good set of (1 ϵ)n samples. The constraint on the weight vector is that it represents at least a (fractional) set of (1 ϵ)-portion of the input dataset. Formally, given n datapoints (Xi)n i=1, the goal is to find a weight vector w Rn 2For two sets of samples S and T, we say S is an ϵ-corrupted version of T if |S| = |T| and |S \ T| ϵ|S|. such that µw = P i wi Xi is close to the true mean µ. The constraint on w is that it belongs to n,ϵ = n w Rn : w 1 = 1 and 0 wi 1 (1 ϵ)n i o , which is the convex hull of all uniform distributions over subsets S [n] of size |S| = (1 ϵ)n. i wi(Xi µw)(Xi µw) denote the weighted empirical covariance matrix. It is well-known that if one can find w n,ϵ that minimizes the weighted empirical variance v Σwv for all k-sparse unit vectors v, then µw must be close to µ. Unfortunately, it is NP-Hard to find the sparse direction v with the largest variance. To get around this issue, [BDLS17] considered the following convex relaxation, minimizing the variance for convex combinations of sparse directions: min w max tr(A)=1,P ij|Aij| k,A 0(A Σw) . (1) Given w, the optimal A can be found using semidefinite programming (SDP). [ZJS20] observed that any stationary point w of (1) gives a good solution for robust sparse mean estimation. However, solving (1) requires convex programming to compute the gradient in each iteration. As explained in the proceeding discussion, our approach circumvents this shortcoming, leading to a formulation for which each gradient can be computed using only basic matrix operations. In this work, we propose and analyze the following optimization formulation: min w f(w) = Σw I F,k,k subject to w n,ϵ , where A F,k,k is the Frobenius norm of the k2 entries of A with largest magnitude, with the additional constraint that these k2 entries are chosen from k rows with k entries in each row. We prove that any stationary point of f(w) yields a good solution for robust sparse mean estimation. Here we provide a brief overview of our proof (see Section 3 for more details). Given a weight vector w, we show that if w is not a good solution, then moving toward w (the weight vector corresponding to the uniform distribution on the clean input samples) will decrease the objective value. Formally, we will show that, for any 0 < η < 1, Σ(1 η)w+ηw = (1 η)Σw + ηΣw + η(1 η)(µw µw )(µw µw ) . We can then take F,k,k norm on both sides (after subtracting I) and show that the third term can be essentially ignored. If the third term were not there, we would have f((1 η)w + ηw ) = Σ(1 η)w+ηw I F,k,k (1 η) Σw I F,k,k + η Σw I F,k,k = (1 η)f(w) + ηf(w ) . Therefore, if w is a bad solution with f(w) much larger than f(w ), then w cannot be a stationary point because f decreases when we move from w to (1 η)w + ηw . Remark 1.4. The technical overview for robust sparse PCA follows a similar high-level approach, but is somewhat more technical. It is deferred to Section 4. Roadmap. In Section 2, we introduce basic notations and the deterministic stability conditions that we require on the good samples. We present our algorithms and analysis for robust sparse mean estimation in Section 3 and robust sparse PCA in Section 4. In Section 5, we evaluate our algorithm on synthetic datasets and show that it achieves good statistical accuracy under various noise models. 2 Preliminaries and Background Notation. For a positive integer n, let [n] = {1, . . . , n}. For a vector v, we use v 0, v 1, v 2, and v for the number of non-zeros, the ℓ1, ℓ2, and ℓ norm of v respectively. Let I be the identity matrix. For a matrix A, we use A 2, A F , tr(A) for the spectral norm, Frobenius norm, and trace of A respectively. For two vectors x, y, let x y denote their inner product. For two matrices A, B, we use A B = tr(A B) for their entrywise inner product. A matrix A is said to be positive semidefinite (PSD) if x Ax 0 for all x. We write A B iff (B A) is PSD. For a vector w Rn, let diag(w) Rn n denote a diagonal matrix with w on the diagonal. For a matrix A Rn n, let diag(A) Rn denote a column vector with the diagonal of A. For a vector v Rd and a set S [d], we write v S Rd for a vector that is equal to v on S and zero everywhere else. Similarly, for a matrix A Rd d and a set S ([d] [d]), we write AS for a matrix that is equal to A on S and zero everywhere else. For a vector v, we define v 2,k = max|S|=k v S 2 to be the maximum ℓ2-norm of any k entries of v. For a matrix A, we define A F,k2 to be the maximum Frobenius norm of any k2 entries of A. Moreover, we define A F,k,k to be the maximum Frobenius norm of any k2 entries with the extra requirement that these entries must be chosen from k rows with k entries in each row. Formally, A F,k2 = max |Q|=k2 AQ F and A 2 F,k,k = max |S|=k P i S Ai 2 2,k where Ai is i-th row of A . (2) Sample Reweighting Framework. We use n for the number of samples, d for the dimension, and ϵ for the fraction of corrupted samples. For sparse estimation, we use k for the sparsity of the ground-truth parameters. We use G for the original set of n good samples. We use S = G B for the input samples after the adversary replaced ϵ-fraction of G , where G G is the set of remaining good samples and B is the set of bad samples (outliers) added by the adversary. Note that |G| = (1 ϵ)n and |B| = ϵn. Given n samples X1, . . . , Xn, we write X Rd n as the sample matrix where the i-th column is Xi. For a weight vector w Rn, we use µw = Xw = P i wi Xi for the weighted empirical mean and Σw = X diag(w)X µwµ w = P i wi(Xi µw)(Xi µw) for the weighted empirical covariance. Let n,ϵ be the convex hull of all uniform distributions over subsets S [n] of size |S| = (1 ϵ)n: n,ϵ = {w Rn : w 1 = 1 and 0 wi 1 (1 ϵ)n i}, In other words, every w n,ϵ corresponds to a fractional set of (1 ϵ)n samples. We use w to denote the uniform distribution on G (the remaining good samples in S). Deterministic Stability Conditions. For robust sparse mean estimation and robust sparse PCA, we require the following conditions respectively. Definition 2.1 (Stability Conditions for Sparse Mean). A set of n samples G = (Xi)n i=1 is said to be (k, ϵ, δ)-stable (w.r.t. a distribution with mean µ) iff for any weight vector w n,2ϵ, we have µw µ 2,k δ and Σw I F,k,k δ2/ϵ, where µw and Σw are the weighted empirical mean and covariance matrix respectively, and the F,k,k norm is defined in Equation (2). Definition 2.2 (Stability Conditions for Sparse PCA). A set of n samples G = (Xi)n i=1 is (k, ϵ, δ)- stable (w.r.t. a centered distribution with covariance I + ρvv ) iff for any weight vector w n,2ϵ, Mw (I + ρvv ) F,2k2 δ , where Mw = P i wi Xi X i and the F,2k2 norm is defined in Equation (2). First-Order Stationary Points. We give a formal definition of the notion of (approximate) firstorder stationary point that we use in this paper. Definition 2.3 (Approximate Stationary Points). Fix a convex set K and a differentiable function f. For γ 0, we say that x K is a γ-stationary point of f iff the following condition holds: For any unit vector u where x + αu K for some α > 0, we have u f(x) γ. We note that the objective functions studied in this paper are not everywhere differentiable. This is because, taking the F,k,k norm as an example, there can be ties in choosing the largest k2 entries. When the function f is not differentiable, we use f informally to denote an element of the sub-differential. We will show in Appendix C that, while f is not differentiable, it does have a nonempty subdifferential, as it can be written as the pointwise maximum of differentiable functions. 3 Robust Sparse Mean Estimation In this section, we present our non-convex approach for robust sparse mean estimation. We will optimize the following objective, where F,k,k is defined in Equation (2): min w f(w) = Σw I F,k,k subject to w n,ϵ . (3) We will show that the objective function (3) has no bad stationary points (Theorem 3.1). In other words, every first-order stationary point of f yields a good solution for robust sparse mean estimation. Our algorithm is stated in Algorithm 1. As a consequence of our landscape result (Theorem 3.1), we know that Algorithm 1 works no matter how we find a stationary point of f (because any stationary point works), so we intentionally did not specify how to find such a point. As a simple illustration, we show that (projected) gradient descent can be used to minimize f. The convergence analysis and iteration complexity are provided in Appendix C. Algorithm 1: Robust sparse mean estimation. Input: k > 0, 0 < ϵ < ϵ0, and an ϵ-corrupted set of samples (Xi)n i=1 drawn from a distribution with k-sparse mean µ. 3 Output: a vector bµ that is close to µ. 1: Find a first-order stationary point w n,ϵ of the objective minw f(w) = Σw I F,k,k. 2: Return bµ = (µw)Q where Q is a set of k entries of µw with largest magnitude. Formally, we first prove that Algorithm 1 can output a vector bµ Rd that is close to µ in 2,k norm, as long as the good samples satisfies the stability condition in Definition 2.1. Theorem 3.1. Fix k > 0, 0 < ϵ < ϵ0, and δ > ϵ. Let G be a set of n samples that is (k, ϵ, δ)-stable (as in Definition 2.1) w.r.t. a distribution with unknown k-sparse mean µ Rd. Let S = (Xi)n i=1 be an ϵ-corrupted version of G . Let f(w) = Σw I F,k,k. Let γ = O(n1/2δ2ϵ 3/2). Then, for any w n,ϵ that is a γ-stationary point of f(w), we have µw µ 2,k = O(δ). Once we have a vector µw that is O(δ)-close to µ in 2,k norm, we can guarantee that a truncated version of µw (the output bµ of Algorithm 1) is O(δ)-close to µ in the ℓ2-norm: Lemma 3.2. Fix two vectors x, y with x 0 k and x y 2,k δ. Let z be a vector that keeps the k entries of y with largest absolute values and sets the rest to 0. We have x z 2 Theorem 1.2 follows immediately from Theorem 3.1 and Lemma 3.2. We can apply Theorem 1.2 to get an end-to-end result for subgaussian distributions. We show that the required stability conditions are satisfied with a small number of samples. Lemma 3.3. Fix k > 0 and 0 < ϵ < ϵ0. Let G be a set of n samples that are drawn i.i.d. from a subgaussian distribution with mean µ and covariance I. If n = Ω(k2 log d/ϵ2), then with probability at least 1 exp( Ω(k2 log d)), G is (k, ϵ, δ)-stable (as in Definition 2.1) for δ = O(ϵ log(1/ϵ)). Combining Theorem 1.2 and Lemma 3.3, we know that given an ϵ-corrupted set of O(k2 log d/ϵ2) samples drawn from a subgaussian distribution with k-sparse mean µ, the output of Algorithm 1 is O(ϵ p log(1/ϵ))-close to µ in ℓ2-norm. In the rest of this section, we will prove Theorem 3.1. Omitted proofs in this section are in Appendix A. We start with some intuition on why we choose our objective function (3). We would like to design f(w) = g(Σw I) to satisfy the following properties: 1. g(Σw I) is an upper bound on v (Σw I)v for all k-sparse unit vectors v Rd. This way, a small objective value implies that µw µ 2,k is small. 2. g(Σw I) is small for w (the uniform distribution on G). This guarantees that a good w exists. 3. Triangle inequality on g. This allows us to upper bound the objective value when we move w toward w by the sum of g( ) of each term on the right-hand side: Σ(1 η)w+ηw I = (1 η)(Σw I) + η(Σw I) + η(1 η)(µw µw )(µw µw ) . 4. g(uu ) is close to g(vv ) where v keeps only the k largest entries of u. We want to approximate µ in 2,k norm, so intuitively g(Σw I) should depend only on the largest k entries of (µw µ). 3Without loss of generality we can assume that ϵ is given to the algorithm. This is because we can run a binary search to determine ϵ: if our guess of ϵ is too small, then the algorithm will output a w whose objective value f(w) is much larger than it should be. Our choice of f(w) = g(Σw I) = Σ I F,k,k is motivated by (and satisfies) all these properties. Lemma 3.4. Fix A Rd d. We have v Av A F,k,k for any k-sparse unit vector v Rd. Lemma 3.5. For any vector v Rd, vv F,k,k = v 2 2,k. We now continue to present key technical lemmas for proving our main structural result (Theorem 3.1). Lemma 3.6 gives the weighted empirical covariance for a convex combination of two weight vectors. Lemma 3.6. Fix n samples X1, . . . , Xn Rd. Let w, bw Rn be two non-negative weight vectors with w 1 = bw 1 = 1. For any α, β 0 with α + β = 1, letting w = αw + β bw, we have Σw = αΣw + βΣ b w + αβ(µw µ b w)(µw µ b w) . Proof. Because w = αw + β bw and µw is linear in w, we have µw = αµw + βµ b w. The lemma follows from the following calculations: i wi Xi X i µwµ w = X i αwi Xi X i αµwµ i β bwi Xi X i βµ b wµ b w w + βµ b wµ b w (αµw + βµ b w)(αµw + βµ b w) = αΣw + βΣ b w + αβ(µw µ b w)(µw µ b w) . The last step uses α α2 = β β2 = αβ as α + β = 1. Let w denote the uniform distribution on G, i.e., w i = 1 (1 ϵ)n if i G and w i = 0 otherwise. By Lemma 3.6 for any w, if we move toward w , we have Σ(1 η)w+ηw = (1 η)Σw + ηΣw + η(1 η)(µw µw )(µw µw ) . We will show that we can essentially ignore the last rank-one term using Lemma 3.7. Lemma 3.7. Let G be a (k, ϵ, δ)-stable set of samples with respect to the ground-truth distribution with 0 < ϵ δ. Let S be an ϵ-corrupted version of G . Then, we have (µw µw )(µw µw ) F,k,k 4ϵ Σw I F,k,k + O(δ2/ϵ) . We are now ready to prove our main result (Theorem 3.1). Proof of Theorem 3.1. Fix any weight vector w n,ϵ. We will show that if w is a bad solution, then f(w) decreases if w moves toward w , so w cannot be a stationary point. Let c1 be the constant in O( ) in Lemma 3.7. By Lemma 3.7, if µw µ 2,k c2δ for a sufficiently large constant c2, then Σw I F,k,k ( c2 2 4 c1) δ2 By Lemma 3.6, Σ(1 η)w+ηw I = (1 η)(Σw I)+η(Σw I)+η(1 η)(µw µw )(µw µw ) . Using the triangle inequality for F,k,k, we have Σ(1 η)w+ηw I F,k,k (1 η) Σw I F,k,k + η Σw I F,k,k + η(1 η) (µw µw )(µw µw ) F,k,k . We know that Σw I F,k,k δ2 ϵ by the stability condition in Definition 2.1. By Lemma 3.7 and Σw I F,k,k = Ω(δ2/ϵ), we can show that for all 0 < η < 1, f((1 η)w + ηw ) = Σ(1 η)w+ηw I F,k,k (1 η) Σw I F,k,k + ηδ2 ϵ + 4ϵη Σw I F,k,k + O( δ2 (1 η + 4ϵη) Σw I F,k,k + (4c1 + 1) ηδ2 2) Σw I F,k,k = (1 η The last step uses ( 1 2 4ϵ) Σw I 2 (4c1 +1) δ2 ϵ which holds if ϵ 1/10 and c2 2 164 c1 +40. It follows immediately that w cannot be a stationary point. Let u = w w w w 2 and h = η w w 2. We have w +hu = (1 η)w +ηw n,ϵ because n,ϵ is convex. Since w w 2 = O( p u f(w) = lim h 0 f(w+hu) f(w) η w w 2 Ω(δ2/ϵ) w w 2 Ω(n1/2δ2ϵ 3/2) . By Definition 2.3, we know w cannot be a γ-stationary point of f for some γ = O(n1/2δ2ϵ 3/2). 4 Robust Sparse PCA We consider a spiked covariance model for sparse PCA. In this model, there is a direction v Rd with at most k nonzero entries. The good samples are drawn from a ground-truth distribution with covariance Σ = I + ρvv , where ρ > 0 is a parameter that intuitively measures the strength of the signal. We consider the more interesting case when ρ 1 (if ρ is larger the problem becomes easier). To solve the sparse PCA problem, we consider the following optimization problem, where Mw = P i wi Xi X i and A F,2k2 = max|Q|=2k2 AQ F : min w f(w) = Mw I F,2k2 subject to w n,ϵ. (5) The objective function minimizes the Frobenius norm of the largest 2k2 entries of a reweighted second-moment matrix Mw. Note that f(w) is actually convex in w, because the matrix Mw is linear in w and the F,2k2 norm is convex. Let R be the support of vv . Intuitively, the k2 entries in R could be large due to spiked covariance. By minimizing the norm of the largest 2k2 entries, we hope to make the entries outside of R very small. Our algorithm is given in Algorithm 2. Algorithm 2: Robust sparse PCA. Input: k > 0, 0 < ϵ < ϵ0, and an ϵ-corrupted set of samples (Xi)n i=1 drawn from a distribution with covariance I + ρvv for a k-sparse unit vector v. Output: a vector u that is close to v. 1: Find a first-order stationary point w n,ϵ of the objective minw f(w) = Mw I F,2k2. 2: Let A = Mw I. Let Q be the k2 entries of A with largest magnitude. 3: Return u = the top eigenvector of (AQ + A Q). Theorem 4.1. Let 0 < ρ 1, 0 < ϵ < ϵ0, and δ > ϵ. Let G be a set of n samples that is (k, ϵ, δ)- stable (as in Definition 2.2) w.r.t. a centered distribution with covariance I + ρvv for an unknown k-sparse unit vector v Rd. Let S = (Xi)n i=1 be an ϵ-corrupted version of G . Algorithm 2 outputs a vector u such that uu vv F = O( p Theorem 1.3 is an immediate corollary of Theorem 4.1. We can apply Theorem 4.1 to get an end-to-end result for subgaussian distributions. Algorithm 2 requires the stability conditions (Definition 2.2) of the original good samples G . We show that these conditions are satisfied with a small number of samples. Lemma 4.2. Let 0 < ρ 1 and 0 < ϵ < ϵ0. Let D be a centered subgaussian distribution with covariance I + ρvv for a k-sparse unit vector v Rd. Let G be a set of n = Ω(k2 log d/δ2) samples drawn from D. Then then with probability at least 1 exp( Ω(k2 log d)), G is (k, ϵ, δ)- stable (as in Definition 2.2) w.r.t. D for δ = O(ϵ log(1/ϵ)). Combining Theorem 4.1 and Lemma 4.2, given as input an ϵ-corrupted set of n = eΩ(k2 log d/ϵ2) samples drawn from a centered subgaussian distribution with covariance I + ρvv , Algorithm 2 returns a vector u with uu vv F = O( p ϵ log(1/ϵ)/ρ). We defer the proofs of Lemma 4.2 and Theorem 4.1 to Appendix B and give an overview of the proof of Theorem 4.1. Proof Sketch of Theorem 4.1. We can use the stability conditions to upper bound the optimal objective value: note that for w (uniform distribution on the remaining good samples), we must have Mw (I + ρvv ) F,2k2 δ by the stability conditions, therefore Mw I F,2k2 Mw (I + ρvv ) F,2k2 + ρvv F,2k2 ρ+δ. Because the objective function f(w) is convex, any stationary point w must be globally optimal and satisfies f(w) ρ + δ. Fix a stationary point w and let A = Mw I. Let R be the support of vv and let Q be the set of k2 largest entries of A. The stability conditions implies for any w, the projection in the v direction must be large (formally v Av ρ δ). Because the objective function measures the norm of the largest 2k2 entries of A and it is not much larger than the norm of the largest k2 entries, we can argue that AR and AQ are close, so v AQv ρ O(δ). Now AQ is a matrix with Frobenius norm at most ρ + δ while v AQv ρ O(δ). Together these imply that the norm of AQ in direction orthogonal to vv is at most O( ρδ), and then by standard matrix perturbation bounds we know the top eigenvector of (AQ + A Q) is O( p δ/ρ) close to v. 5 Experiments We perform an experimental evaluation of our robust sparse mean estimation algorithm on synthetic datasets with a focus on statistical accuracy (ℓ2-distance between the output and the true sparse mean). We evaluate our algorithm (Sparse Gradient Descent, Sparse GD) on different noise models, and compare it to the following previous algorithms: oracle, which is told exactly which samples are inliers, and outputs their empirical mean, the robust sparse mean estimation algorithm RME_sp from [DKK+19b], NP (Naive Pruning), which removes samples far from the median and output the mean of the rest, RANSAC, which randomly selects half of the points and computes their mean. One solution is preferred to another if it has more points in a ball of radius O( d) around it. For algorithms that output non-sparse vectors, we take the largest k entries before measuring the ℓ2 distance to the true mean. We evaluate the algorithms on various noise models: Linear-hiding noise. The inliers are drawn from N(0, I). Let S be a size k set. Then, half the outliers are drawn from N(1S, I) and the other half are drawn from N(0, 2I IS). Tail-flipping noise. This noise model picks a k-sparse direction v and replaces the ϵ fraction of points farthest in the v direction with points in the +v direction. Constant-bias noise. This model adds a constant to every coordinate of the outlier points. In Figure 3, we add 2 to every coordinate of every outlier point. We ran our experiments on a computer with a 1.6 GHz Intel Core i5 processor and 8 GB RAM. We built on the codebase of [DKK+19b] 4 and implemented our new algorithm for the experiments. For each pair of algorithm and noise model, we repeat the same experiment 10 times and plot the median value of the measurements. We shade the interquartile region around the reported points in the figure as confidence intervals. 2500 5000 7500 10000 12500 n d = 1000, k = 40, eps = 0.1, Noise is linear hiding oracle RME_sp NP RANSAC Sparse GD (a) Fix the sparsity k and change the number of samples n. 0 50 100 150 k d = 300, n = 5000, eps = 0.1, Noise is linear hiding oracle RME_sp NP RANSAC Sparse GD (b) Fix n and change the sparsity k. Figure 1: The performance of various algorithms under linear-hiding noise. Notably, when the number of samples n or the sparsity k is small, our algorithm Sparse GD outperforms RME_sp. 4Available at: https://github.com/sushrutk/robust_sparse_mean_estimation, MIT license 0 5000 10000 n d = 10, k = 1, eps = 0.1, Noise is tail flipping oracle RME_sp NP RANSAC Sparse GD (a) Fix the sparsity k and change the number of samples n. 0.0 0.2 0.4 eps d = 10, k = 1, n = 5000, Noise is tail flipping oracle RME_sp NP RANSAC Sparse GD (b) Fix n, k and change the fraction of corruption ϵ. Figure 2: The performance of various algorithms in the tail-flipping noise model. 0 5000 10000 n d = 300, k = 10, eps = 0.1, Noise is constant +2 biased oracle RME_sp NP RANSAC Sparse GD (a) Fix the sparsity k and change the number of samples n. 0 25 50 75 100 k d = 300, n = 5000, eps = 0.1, Noise is constant +2 biased oracle RME_sp NP RANSAC Sparse GD (b) Fix n and change the sparsity k. Figure 3: The performance of various algorithms in the constant-bias noise model. Our experimental results are summarized in Figures 1, 2 and 3. For the linear-hiding and constant-bias noise models, we run two experiments: 1) fix the sparsity k and change the number of samples n, and 2) fix n and change k. For the tail-flipping noise model, we run two experiments: 1) fix the sparsity k and change the number of samples n, and 2) fix k and n and change the fraction of corruption ϵ. In terms of statistical accuracy, our algorithm (Sparse GD), outperforms the filter-based RME_sp algorithm [DKK+19b] in the linear-hiding noise model when the number of samples or the sparsity is small, as shown in Figure 1. Our algorithm matches the performance of RME_sp under the tail flipping and constant-bias noise models, as shown in Figures 2 and 3. Matching our theoretical results, our Sparse GD algorithm has accuracy O(ϵ p log(1/ϵ)) and is within a constant factor of the Ω(ϵ p log(1/ϵ)) worst-case performance of oracle. In contrast, the naive algorithms NP and RANSAC both incur error that scales as ϵ k. The tail-flipping noise (Figure 2) illustrates that Ω(ϵ p log(1/ϵ)) error can occur no matter which algorithm is used (including oracle), because ϵ-fraction of the original good samples was removed. Acknowledgments and Disclosure of Funding Yu Cheng is supported in part by NSF Award CCF-2307106. Ilias Diakonikolas is supported by NSF Medium Award CCF-2107079, NSF Award CCF-1652862 (CAREER), NSF Award Ai TF-2006206, a Sloan Research Fellowship, and a DARPA Learning with Less Labels (Lw LL) grant. Rong Ge is supported by NSF Award CCF-1704656, NSF Award CCF-1845171 (CAREER), NSF Award CCF-1934964, a Sloan Research Fellowship, and a Google Faculty Research Award. Shivam Gupta is supported by NSF Award CCF-2008868, NSF Award Ai TF-2006206, and the NSF AI Institute for Foundations of Machine Learning (IFML). Some of this work was done while Shivam Gupta was visiting the University of Wisconsin-Madison. Daniel M. Kane is supported by NSF Award CCF-1652862 (CAREER), NSF Award CCF-2107547, a Sloan Research Fellowship, and a grant from Casper Labs. Mahdi Soltanolkotabi is supported by the Packard Fellowship in Science and Engineering, a Sloan Research Fellowship, NSF Award CCF-1846369 (CAREER), NSF Award CCF1813877, DARPA Learning with Less Labels (Lw LL) and Fast Network Interface Cards (Fast NICs) grants, and Faculty Research Awards from Google and Amazon.