# highdimensional_robust_mean_estimation_via_gradient_descent__7e7325cc.pdf High-dimensional Robust Mean Estimation via Gradient Descent Yu Cheng * 1 Ilias Diakonikolas * 2 Rong Ge * 3 Mahdi Soltanolkotabi * 4 We study the problem of high-dimensional robust mean estimation in the presence of a constant fraction of adversarial outliers. A recent line of work has provided sophisticated polynomialtime algorithms for this problem with dimensionindependent error guarantees for a range of natural distribution families. In this work, we show that a natural non-convex formulation of the problem can be solved directly by gradient descent. Our approach leverages a novel structural lemma, roughly showing that any approximate stationary point of our non-convex objective gives a near-optimal solution to the underlying robust estimation task. Our work establishes an intriguing connection between algorithmic highdimensional robust statistics and non-convex optimization, which may have broader applications to other robust estimation tasks. 1. Introduction Learning in the presence of outliers is an important goal in machine learning that has become a pressing challenge in a number of high-dimensional data analysis applications, including data poisoning attacks (Barreno et al., 2010; Biggio et al., 2012; Steinhardt et al., 2017) and exploratory analysis of real datasets with natural outliers, e.g., in biology (Rosenberg et al., 2002; Paschou et al., 2010; Li et al., 2008). In both these application domains, the outliers are not random but can be arbitrarily correlated, and could exhibit rather complex structures that is essentially impossible to accurately model. Hence, the goal in these settings is to design computationally efficient estimators that can tolerate *Equal contribution 1University of Illinois at Chicago, Chicago, IL, USA 2University of Wisconsin-Madison, Madison, WI, USA 3Duke University, Durham, NC, USA 4University of Southern California, Los Angeles, CA, USA. Correspondence to: Yu Cheng , Ilias Diakonikolas , Rong Ge , Mahdi Soltanolkotabi . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). a small constant fraction of arbitrary outliers. Throughout this paper, we focus on the following data contamination model that generalizes several existing models, including Huber s contamination model (Huber, 1964). Definition 1.1 (Strong Contamination Model). Given a parameter 0 < ϵ < 1/2 and a distribution family D on Rd, the adversary operates as follows: The algorithm specifies the number of samples N, and N samples are drawn from some unknown D D. The adversary is allowed to inspect the samples, remove up to ϵN of them and replace them with arbitrary points. This modified set of N points is then given as input to the algorithm. We say that a set of samples is ϵ-corrupted if it is generated by the above process. The parameter ϵ in the above definition is the fraction of corrupted samples and quantifies the power of the adversary. Intuitively, among our samples, an unknown (1 ϵ) fraction are generated from a distribution of interest and are called inliers, and the rest are called outliers. The statistical foundations of outlier-robust estimation were laid out in early work by the robust statistics community, starting with the pioneering works of Tukey (1960) and Huber (1964). In contrast, until fairly recently, even the most basic algorithmic questions were poorly understood. Specifically, even for the basic task of high-dimensional mean estimation, all known robust estimators had runtime exponential in the dimension, rendering them ineffective in high-dimensional settings. Recently, Diakonikolas et al. (2016); Lai et al. (2016) gave the first efficiently computable robust estimators for highdimensional unsupervised learning tasks, including mean and covariance estimation. Specifically, Diakonikolas et al. (2016) obtained the first polynomial-time robust estimators with dimension-independent error guarantees, i.e., with error scaling only with the fraction of corrupted samples ϵ and not with the dimensionality of the data. Since the dissemination of these works, there has been a flurry of research activity on algorithmic aspects of high-dimensional robust statistics; see, e.g., Diakonikolas & Kane (2019) for a recent survey on the topic. Despite this exciting progress, the design of efficient robust estimators in high dimensions remains challenging. The difficulty, of course, lies in the non-convexity of the under- Robust Mean Estimation via Gradient Descent lying optimization problem. Prior work developed fairly sophisticated algorithmic tools, even for the task of robust mean estimation. These include convex relaxations (Diakonikolas et al., 2016) and quite subtle iterative spectral methods (Diakonikolas et al., 2016; Lai et al., 2016). A natural and important goal is to understand to what extent such sophisticated methods are indeed necessary or whether much simpler robust learning algorithms exist. In this work, we take a direct optimization view of these problems and ask the following general question: Is it possible to solve robust estimation tasks by standard first-order methods? We believe that this question merits investigation in its own right. Moreover, its positive resolution may have significant implications in the practical adoption of robust estimation methods. Particularly so since prior algorithms are either (1) computationally prohibitive (relying on large convex relaxations), (2) involve carefully crafted parameters that require precise tuning for practical deployment, or (3) are challenging to extend to more sophisticated robust estimation tasks. A tantalizing possibility is the following: For a range of high-dimensional robust estimation tasks, there exists a (natural) non-convex formulation such that gradient descent efficiently converges to a near-optimal solution. In this paper, we show that this premise is true for the task of high-dimensional robust mean estimation. In robust mean estimation, we are given a set of N ϵ-corrupted samples from an unknown distribution D in a known family D, and we want to output a hypothesis vector bµ such that bµ µ 2 is as small as possible, where µ is the mean of D. For simplicity, we will assume in this discussion that D is an unknown mean and identity covariance Gaussian on Rd. We note that our results hold under more general distributional assumptions, as in Diakonikolas et al. (2016; 2017a). The goal in robust mean estimation is to develop efficient algorithms whose ℓ2-error guarantee scales only with ϵ and not with the dimension d. In particular, for the identity covariance Gaussian case, Diakonikolas et al. (2016) gave polynomial-time algorithms for the problem that use N = eΩ(d/ϵ2) samples and guarantee error O(ϵ p log(1/ϵ)). This error guarantee matches known Statistical Query (SQ) lower bounds (Diakonikolas et al., 2017b). 1.1. Overview of Results and Contributions In this paper, we consider a natural non-convex optimization formulation of high-dimensional robust mean estimation, and show that gradient descent1 efficiently converges to a near-optimal solution. Specifically, we show that gradient 1Throughout, we informally use the term gradient descent to refer to variations of gradient descent methods, which involve descent converges in a polynomial number of iterations and matches the error guarantee of the best known polynomialtime algorithms for the problem. Our technical contribution lies in showing that any approximate stationary point of our non-convex objective suffices in the sense that it gives a near-optimal solution for the underlying estimation problem. To describe our non-convex formulation, we require some background. We use the following framework for robust mean estimation, introduced in Diakonikolas et al. (2016). The idea is to assign a non-negative weight to each data point and then find an appropriate combination of weights such that the weighted empirical mean is close to the true mean. The constraint on the chosen weights is that they represent at least a (1 ϵ)-density fractional subset of the dataset. More formally, given datapoints X1, . . . , XN Rd with corresponding data matrix X Rd N, the objective is to find a weight vector w RN such that µw = Xw is close to µ . The constraint on w is that it belongs in the set N,ϵ = n w RN : w 1 = 1 and 0 wi 1 (1 ϵ)N i o , which is the convex hull of all uniform distributions over subsets S [N] of size |S| = (1 ϵ)N. Diakonikolas et al. (2016) established a key structural lemma (Lemma 2.1), which formed the basis of their algorithms. Roughly speaking, the lemma states that any weight vector w is a good solution if the spectral norm of the weighted empirical covariance, Σw = PN i=1 wi(Xi µw)(Xi µw) , is small. This lemma directly motivates the following non-convex optimization formulation: Min Σw 2 subject to w N,2ϵ (1) It follows from the aforementioned structural lemma that a near-optimal solution w to (1) gives an µw that is close to µ . The challenge is that the objective function is not convex, hence it is unclear how to efficiently optimize. Faced with this difficulty, prior works on the topic (Diakonikolas et al., 2016; 2017a) developed various sophisticated algorithms. In this paper, we work directly with the natural formulation (1). Despite its non-convexity, we are able to leverage the structure of the problem to show that gradient descent efficiently converges to a good vector w. In more detail, we prove a novel result about the structure of approximate stationary points of this objective. Theorem 1.2 (informal statement). Any approximate stationary point w of (1) defines an µw that is close to µ . See Theorem 3.1 for a detailed formal statement. Technically speaking, our statement is more subtle for various reasons, including the fact that the objective function is not updates based on a generalized notion of a gradient, e.g., subgradient for non-differentiable functions. Robust Mean Estimation via Gradient Descent differentiable and the domain is constrained. As a result, we require a careful definition of stationarity in our setting. Given Theorem 1.2, we proceed to show that projected sub-gradient descent converges to an approximate stationary point in a polynomial number of iterations. This step is also somewhat intricate as the function is non-convex, non-smooth and the optimization problem (1) involves constraints. In summary, we establish the following theorem: Theorem 1.3. After e O(N 2d4) iterations, projected subgradient descent on (1) outputs a point w such that with high probability µw µ 2 = O(ϵ p The bound we establish on the convergence rate on the spectral norm objective (1) is polynomially bounded, but relatively slow. Our second main contribution involves considering the softmax version of the spectral norm, which has better smoothness properties. An analogous lemma about the structure of stationary points allows us to show a faster rate of convergence for this modified objective. Theorem 1.4. After e O(Nd3/ϵ) iterations, projected gradient descent on the softmax objective outputs a point w such that with high probability µw µ 2 = O(ϵ p As evident from the above result, the additional smoothness of the softmax objective allows us to establish a significantly improved bound on the number of iterations. 1.2. Related Work The algorithmic question of designing efficient robust mean estimators in high-dimensions has been extensively studied in recent years. After the initial papers (Diakonikolas et al., 2016; Lai et al., 2016), a number of works (Diakonikolas et al., 2017a; Steinhardt et al., 2018; Cheng et al., 2018; Dong et al., 2019; Depersin & Lecue, 2019; Cheng et al., 2019) have obtained algorithms with improved asymptotic worst-case runtimes that work under weaker distributional assumptions on the good data. Moreover, efficient highdimensional robust mean estimators have been used as primitives for robustly solving a range of machine learning tasks that can be expressed as stochastic optimization problems (Prasad et al., 2018; Diakonikolas et al., 2019a). We compare our approach with the works of Cheng et al. (2018) and Dong et al. (2019) that give the asymptotically fastest known algorithms for robust mean estimation. At a high-level, Cheng et al. (2018), building on the convex programming relaxation of Diakonikolas et al. (2016), proposed a primal-dual approach for robust mean estimation that reduces the problem to a poly-logarithmic number of packing and covering SDPs. Each such SDP is known to be solvable in time e O(Nd), using mirror descent Allen Zhu et al. (2016); Peng et al. (2016). Dong et al. (2019) build on the iterative spectral approach of Diakonikolas et al. (2016). That work uses the matrix multiplicative weights update method with a specific regularization and dimensionreduction to improve the worst-case runtime. In contrast to all of the above, we use a natural non-convex formulation of the robust mean estimation task, and show that a standard first-order method provably and efficiently converges to a near-optimal solution. Even though the convergence rates that we establish in this work do not yield the fastest known asymptotic runtimes for the problem, we believe that our approach is conceptually interesting for a number of reasons. First, our theorem regarding stationary points provides novel structural understanding about robust mean estimation and can be viewed as an explanation as to why this problem is polynomially solvable. Second, it is plausible that gradient descent applied in this context is more stable than previously known algorithms and may facilitate the adoption of robust estimation methods in practice. We hope that this work will serve as the starting point for solving other robust estimation tasks via first-order methods. Finally, we note that there is an increasing literature on developing rigorous guarantees for non-convex optimization problems via gradient descent, e.g., see the recent survey (Jain & Kar, 2017) for a review of this literature. With a few exceptions (Loh & Wainwright, 2011; Hassani et al., 2017), this literature mostly focuses on showing that gradient descent converges to a global optimum starting from a spectral (Keshavan et al., 2010; Candes et al., 2015; Tu et al., 2015) or random initialization (Ge et al., 2015) in settings where there are no bad local optima. In contrast to most of this literature, in this paper we show that any stationary point has good approximation properties so that no specialized or random initialization is necessary. We believe that such a perspective may enable rigorous analysis of many other non-convex optimization problems. 1.3. Roadmap In Section 2, we set up the necessary notation and provide some background on robust mean estimation. In the next two sections, we focus on the spectral norm objective. In Section 3, we prove our main structural result showing that any stationary point of the spectral norm objective yields a good solution. We also extend this result in Appendix B, showing that in fact, any approximate stationary point yields a sufficiently good solution. In Section 4, we show that gradient descent converges to an approximate stationary point and hence yields a good solution in a polynomial number of iterations. In Appendix C, we prove structural and algorithmic results for the softmax objective, showing that any approximate stationary point of the softmax objective yields a good solution, and we can find an approximate stationary point using projected gradient descent in a polynomial number of iterations. We conclude with future directions in Section 5. Robust Mean Estimation via Gradient Descent 2. Preliminaries and Background Notation. For N Z+, we denote [N] := {1, . . . , N}. For a vector x, we use x 1, x 2, and x to denote the ℓ1, ℓ2, and ℓ norm of x respectively. For a matrix A, we use A 2 to denote the spectral norm of A. For two vectors x, y Rn, we use x y = Pn i=1 xiyi to denote the inner product of x and y, and we use x y Rn to denote entrywise product of x and y. For a vector x Rn, let diag(x) Rn n denote a diagonal matrix with x on the diagonal. For a matrix A Rn n, let diag(A) Rn denote a column vector with the diagonal entries of A. Let I denote the identity matrix. For a matrix A Rn n, let tr(A) denote the trace of A. For two matrices A and B of the same dimensions, let A B = A, B = tr(A B) be the entry-wise inner product of A and B. We use exp(A) to denote the matrix exponential of A. A symmetric matrix A Rn n is said to be positive semidefinite (PSD) if x Ax 0 for all x Rn. For two symmetric matrices A and B, we write A B iff the matrix B A is positive semidefinite. Let n n be the set of all PSD matrices of trace 1. Framework. We use N for the number of input samples, d for the dimension of the ground-truth distribution, and ϵ for the fraction of corrupted samples. Given N datapoints X1, . . . , XN Rd, we use X Rd N to denote the sample matrix, where the i-th column of X is Xi. Given w RN, let µw = Xw = PN i=1 wi Xi denote the weighted empirical mean and let Σw = PN i=1 wi(Xi µw)(Xi µw) denote the weighted empirical covariance. Let N,ϵ denote the convex hull of all uniform distributions over subsets S [N] of size |S| = (1 ϵ)N: N,ϵ = n w RN : w 1 = 1 and 0 wi 1 (1 ϵ)N i o . Every weight vector w N,ϵ corresponds to a fractional set of (1 ϵ)N samples. Background on Robust Mean Estimation. As mentioned in the introduction, our non-convex formulation is directly motivated by the following structural lemma: Lemma 2.1 (Diakonikolas et al. (2016)). Let S be an ϵcorrupted set of N = eΩ(d/ϵ2) samples from an unknown N(µ , I) and w N,2ϵ. If λmax (Σw) 1 + δ, for some δ 0, then with high probability, we have that µ µw 2 = O( As in prior work, we will establish correctness for our algorithms under deterministic conditions on the inliers (good samples) that hold with high probability. Let G denote the original set of N good samples. Let S = G B denote the input samples after the adversary replaced ϵ-fraction of the samples, where G G is the set of remaining good samples and B is the set of bad samples (outliers) added by the adversary. Note that |G| = (1 ϵ)N and |B| = ϵN. Given w RN, let w G = P i G wi be the total weight on good samples, and w B be the total weight on bad samples. We require the following concentration bounds to hold for the original N good samples G (which happens with high probability when N = eΩ(d/ϵ2)). For all bw N,3ϵ, we require the following condition to hold for δ = O(ϵ log(1/ϵ)): i G bwi(Xi µ )(Xi µ ) I Condition (2) on original samples G implies the following conditions on the remaining good samples G. For any weight vector w N,2ϵ on the ϵ-corrupted set of samples S = G B: i G wi(Xi µ )(Xi µ ) I This is because we can define bw as follows: bwi = wi w G for all i G and bwi = 0 for all i B. Since w N,2ϵ, we have bw w 1 w B w 1 |B| w 1 (1 3ϵ)N . In other words, bw N,3ϵ and Condition (3) follows directly from Condition (2). Remark 2.2 (Distributional Assumptions). For simplicity, in this paper we focus on the fundamental setting that the good data are drawn from an unknown mean and identity covariance Gaussian distribution. It should be noted that our structural and algorithmic results hold under more general distributional assumptions. Specifically, Theorem 4.1 immediately applies to identity covariance subgaussian distributions, with the same error guarantees, since it only relies on the concentration bounds (2) and (3) that only require subgaussian tails (see, e.g., (Diakonikolas et al., 2017a).) Moreover, one can modify the proof of our structural results (Theorems 3.1 and 3.2), mutatis-mutandis, to apply (1) for distributions with bounded covariance (i.e., Σ I) and match the optimal O( ϵ) approximation to the mean (Diakonikolas et al., 2017a); and, (2) more generally, under the (ϵ, δ)-stability condition of (Diakonikolas & Kane, 2019) to yield an O(δ) ℓ2-approximation to the mean. Background and Definitions of Stationarity. Note that the spectral norm is not a differentiable function and therefore we need an alternative definition of stationarity. To address this issue, by the definition of spectral norm, we can define a function F(w, u) = u Σwu that takes two parameters as input: the weights w RN and a unit vector u Rd. Our non-convex objective minw f(w) := Robust Mean Estimation via Gradient Descent Σw 2 is then equivalent to solving the minimax problem minw maxu F(w, u). The function maxu F(w, u) is weakly-convex, and we use the following stationary point definition that is common in the weakly-convex optimization literature (Rockafellar, 1970; 1981; Drusvyatskiy, 2017; Davis & Drusvyatskiy, 2018; Jin et al., 2019). Definition 2.3 (First-order stationary point). Let F(w, u) be a function that is differentiable with respect to w for all u. Let f(w) = maxu F(w, u). Consider the constrained optimization problem minw K f(w), where K is a closed convex set. We say that w K is a first-order stationary point if there exists some u arg maxv F(w, v) such that ( w F(w, u)) ( ew w) 0 for all ew K . We also need a notion of an approximate stationary point in the sense that the updates from one iteration to the next do not change much. In the unconstrained and differentiable case, such a point can be characterized by the gradient being small. However, the objective function we consider is both non-differentiable and has constraints, so that a proper definition of approximate stationarity is much more subtle. To overcome this, we appeal to tools from conic geometry and notions of stationarity for weakly convex functions (Rockafellar, 1970; 1981; Drusvyatskiy, 2017; Davis & Drusvyatskiy, 2018) to define an appropriate notion of approximate stationarity. To discuss the notion of approximate stationarity that we use, we need to work with a smoothed variant of the objective known as the Moreau envelope. Definition 2.4 (Moreau envelope). For any function f and closed convex set K, its associated Moreau envelope fβ(w) is defined to be the function fβ(w) := min e w K f( ew) + β w ew 2 2 . The Moreau envelope can be thought of as a form of convolution between the original function f and a quadratic, so as to smoothen the landscape. In particular, when f(w) takes the form of a maximization problem (f(w) = maxu F(w, u)) with F a mapping that is β-smooth in the u parameter (| w F(w, eu) w F(w, u)| β eu u 2), the Moreau envelope is also β-smooth (Drusvyatskiy, 2017). Therefore, the approximate stationarity of the Moreau envelope can be easily defined through its gradient allowing us to define the following notion of approximate stationarity. Definition 2.5 (Approximate first-order stationary point). For any function f and closed convex set K consider its associated Moreau envelope fβ(w) per Definition 2.4. we say that a point w is a ρ-approximately stationary point if fβ(w) 2 ρ. As mentioned earlier, the spectral norm admits a minimax formulation of the form f(w) = maxu F(w, u). Further- more, as detailed in Appendix B, the corresponding function F(w, u) is β-smooth with β = 2 X 2 2, so that this notion of approximate stationarity can be applied to the objective of interest in this paper. 3. Structural Result: Any Approximate Stationary Point Suffices In this section, we establish our main structural result, which says that every approximate stationary point of (1) must give a µw that is close to µ . For simplicity of the exposition, in the main body of this paper, we state and prove a simpler theorem showing that every (exact) stationary point is a good solution. Theorem 3.1 (Any stationary point is a good solution). Let S denote an ϵ-corrupted set of N samples drawn from a d-dimensional Gaussian N(µ , I) with unknown mean µ . Suppose that S satisfies Lemma 2.1 and Condition (3). Let f(w) be the objective function defined in Equation (1). For any first-order stationary point w N,2ϵ of f(w), we have µw µ 2 = O(ϵ p We note that while Theorem 3.1 shows that any (exact) stationary point has small objective value, a stronger statement is required for our algorithmic results in the next section. Specifically, we require that any approximate stationary point in the sense of Definition 2.5 which gradient descent efficiently converges to, also has low objective value. This is accomplished in the next theorem which we prove in Appendix B. Specifically, by appealing to the gradient of the Moreau envelope from Definition 2.4, we extend the proof of Theorem 3.1 to show the following: Theorem 3.2 (Any approximate stationary point suffices). Consider the same setting as in Theorem 3.1. Consider the spectral norm objective f(w) = Σw 2 with fβ(w) denoting the corresponding Moreau envelope function per Definition 2.4 with β = 2 X 2 2. Then, for any w N,2ϵ satisfying fβ(w) 2 = O(log(1/ϵ)) , we have µw µ 2 = O(ϵ p In the remainder of this section, we focus on proving Theorem 3.1 and briefly discuss how this proof can be generalized to prove Theorem 3.2. Our proof is carried out in two steps: (1) We establish a structural lemma which states that every stationary point w must satisfy a bimodal subgradient property; (2) We show any point satisfying such property must have a small objective value. Given these two steps, we can conclude any stationary points µw is close to µ , by Lemma 2.1. For the first step, the bimodal subgradient property states that there exists a vector ν f(w) (in the sub-gradient Robust Mean Estimation via Gradient Descent of the function at that stationary point) whose entries divided in two groups of indices such that for any i S and any j S+ we have νi νj. Intuitively, S contains all indices with positive wi, so they can potentially be decreased; while S+ contains all indices with wi < 1 (1 2ϵ)N , so they can potentially be increased. If the bimodal subgradient property is violated, there must be indices i S , j S+, where νi > νj. In this case, decreasing wi and increasing wj would decrease the objective and thus violate stationarity. For the second step, recall that Σw = X diag(w)X Xww X and F(w, u) = u Σwu. Let us first compute the subgradient w F(w, u) with respect to a vector u: w F(w, u) = X u X u 2(u Xw)X u . (4) Our key observation is that the sub-gradient at direction u is equivalent to the gradient of w for the one-dimensional problem with input (X i u)N i=1. This allows us to effectively reduce our problem to a one-dimensional robust mean estimation problem. This reduction allows us to show that when the objective function is large, then there must be some non-zero weights associated with the corrupted points that are far away from the mean (these points will be in S ); while on the other hand, S+ must contain at least ϵ-fraction of the good points. One can then select indices from these two sets to violate the bimodal sub-gradient property. Fix a first-order stationary point w N,2ϵ. Definition 2.3 implies that there is a corresponding unit vector u Rd such that w is a stationary point of F(w, u). We first state the bimodal sub-gradient property. Lemma 3.3 (Bimodal sub-gradient property at stationarity). Fix w N,2ϵ and a unit vector u with u Σwu = Σw 2. Let S = {i : wi > 0} and S+ = {i : wi < 1 (1 2ϵ)N } denote the coordinates of w that can decrease and increase respectively. If w is a first-order stationary point of F(w, u), then w F(w, u)i w F(w, u)j , for all i S and j S+. Proof. Suppose there is some i S and j S+ such that w F(w, u)i > w F(w, u)j, then intuitively we can make f(w) smaller by decreasing wi and increasing wj. Formally, let ew = w + min(wi, 1 (1 2ϵ)N wj)(ej ei) where ei is the i-th basis vector. We have ew N,2ϵ and ( w F(w, u)) ( ew w) < 0, which violates the assumption that w is a stationary point (Definition 2.3). Given Lemma 3.3, we prove Theorem 3.1 by contradiction. We show that if µw is far from µ , then w violates the property stated in Lemma 3.3 and therefore cannot be a stationary point. More specifically, we show that, if µw is far from µ , then there exists a bad sample with index j S whose gradient is large (Lemma 3.4). Meanwhile, the concentration bounds in Condition (3) guarantee that there exists a good sample with index i S+ whose gradient is small (Lemma 3.5). Lemma 3.4 (Bad sample with large gradient). Assume that Condition (3) and Lemma 2.1 hold. Fix w N,2ϵ and a unit vector u with u Σwu = Σw 2. Let r = µw µ 2 and suppose r c2ϵ p ln(1/ϵ). Then there exists some i (B S ) such that w F(w, u)i u µ (µ 2µw) u > 2c3 r2 Here, c2 and c3 are universal positive constants. Lemma 3.5 (Good sample with small gradient). Consider the same setting as in Lemma 3.4. There is some j (G S+) such that w F(w, u)j u µ (µ 2µw) u c3 r2 We defer the proofs of Lemmas 3.4 and 3.5 to Sections 3.1 and 3.2, and we first use these two lemmas to prove Theorem 3.1. Proof of Theorem 3.1. Suppose that w N,2ϵ is a firstorder stationary point of f(w), and moreover, w is a bad solution where µw µ 2 c2ϵ p ln(1/ϵ). By Definition 2.3, there exists a unit vector u Rd such that w is a stationary point of F(w, u). Fix such a vector u. Since Condition (3) and Lemma 2.1 both hold, we can invoke Lemmas 3.4 and 3.5 on (w, u) to find two coordinates i S and j S+ that violate the bimodal subgradient condition in Lemma 3.3. Consequently, w cannot be a stationary point of F(w, u). This leads to a contradiction, and therefore, all first-order stationary points of f(w) are good solutions. We now briefly comment on the modifications required to prove Theorem 3.2 (see Appendix B). Theorem 3.2 is proven by first showing (using conic geometry) that for such an approximate stationary point an approximate bimodal sub-gradient property holds. Specifically, we show that the bimodal sub-gradient property (Lemma 3.3) is stable in the sense that for an approximate stationary point an approximate bimodal sub-gradient property holds, i.e., νi νj + δ. Further, for any point obeying such an approximate bimodal property, the objective is small and has good approximation guarantees. The last two steps when combined show that any approximate stationary point has good approximation guarantees (similar to the proof of Theorem 3.1 for exact stationary points). Robust Mean Estimation via Gradient Descent 3.1. Finding a Bad Sample With Large Gradient In this subsection, we prove Lemma 3.4. Lemma 3.4 states that when µw is far from µ , there exists an index i (B S ) such that the gradient w F(w, u)i is relatively large. Recall that w F(w, u) in Equation (4) is the same as the gradient of the variance (weighted by w) of the onedimensional samples X i u N i=1. Roughly speaking, for this one-dimensional problem, a sample far from the (projected) true mean should have large gradient. Our objective is to find such a sample with positive weight. More specifically, since w is a bad solution and u is in the top eigenspace of Σw, the weighted empirical variance of the projected samples is very large. Because the good samples cannot have this much variance, most of the variance comes from the bad samples. We show that among the bad samples that contribute a lot to the variance, one of them must be very far from the (projected) true mean. In this section and Section 3.2, we use c1, . . . , c4 to denote universal constants that are independent of N, d, and ϵ. We give a detailed description of how to set these constants in Appendix A. Proof of Lemma 3.4. We first show that the variance of onedimensional samples X i u N i=1 is relatively large. By Lemma 2.1, we know that if µw µ 2 r and r c2ϵ p ln(1/ϵ), then λmax(Σw) 1 + c4 r2 ϵ for some universal constant c4. Because u is a unit vector that maximizes u Σwu, we have u Σwu = λmax(Σw) 1 + c4r2 Recall that Σw = PN i=1 wi(Xi µw)(Xi µ w). If we replace µw with µ , we have i=1 wi(Xi µ )(Xi µ ) Σw , and therefore, i=1 wi(Xi µ )(Xi µ ) ! Next we show that most of this variance is due to bad samples. By Condition (3), i G wi(Xi µ )(Xi µ ) ! u 1 + c1ϵ ln(1/ϵ) . Consequently, i B wi(Xi µ )(Xi µ ) ! ϵ c1ϵ ln(1/ϵ) 0.98 c4 r2 The last step is because r c2 ϵ p ln(1/ϵ) and we can choose c4 to be sufficiently large. Now that we know most of the variance is due to the bad samples, observe that the total weight w B on the bad samples is at most ϵN 1 (1 2ϵ)N 2ϵ. Therefore, there must be some i B with wi > 0 such that u (Xi µ )(Xi µ ) u 0.98 c4 r2 ϵ 1 In other words, u (Xi µ ) 0.7 c4 r By definition, i B S . It remains to show that w F(w, u)i is large. w F(w, u)i u µ (µ 2µw) u = u (Xi µ )(Xi µ ) u 2u (Xi µ )(µw µ ) u u (Xi µ ) 2 2 u (Xi µ ) µw µ 2 ϵ2 2 0.7 c4 r ϵ r > 2c3 r2 The first inequality is by Cauchy-Schwarz. The last step uses the fact that ϵ is sufficiently small. 3.2. Finding a Good Sample With Small Gradient In this subsection, we prove Lemma 3.5. Lemma 3.5 states that there exists an index j (G S+) such that the gradient w F(w, u)j is relatively small. Similar to the previous section, a sample close to the (projected) true mean should have small gradient. Our goal is to find such a sample for which we can increase its weight. Recall that S+ contains all samples whose weight can be increased. We first prove that there are at least ϵN good samples in S+. Among these ϵN good samples, the concentration bounds imply that some Xj must be very close to the (projected) true mean. Proof of Lemma 3.5. Recall that S+ contains every coordinate i where wi < 1 (1 2ϵ)N . Since at most (1 2ϵ)N Robust Mean Estimation via Gradient Descent samples can have the maximum weight 1 (1 2ϵ)N , we know that |S+| 2ϵN. Combining this with |G| = (1 ϵ)N, we know that |G S+| ϵN. Fix a subset G+ (G S+) of size |G+| = ϵN. We first show that on average, samples in G+ do not contribute much to the variance. Let w be the uniform weight vector on G, i.e., w i = 1 (1 ϵ)N for all i G and w i = 0 otherwise. Since w N,2ϵ, by Condition (3), 1 |G|(Xi µ )(Xi µ ) I 2 c1 ϵ ln(1/ϵ) . Let w be the uniform weight vector on S \ G+ = (G \ G+) B, i.e., w i = 1 (1 ϵ)N for all i ((G \ G+) B) and w i = 0 otherwise. Since w N,2ϵ, again by Condition (3), we have 1 |G|(Xi µ )(Xi µ ) I c1ϵ ln(1/ϵ). Combining the previous two concentration bounds, 1 |G|(Xi µ )(Xi µ ) 2 1 |G|(Xi µ )(Xi µ ) I 1 |G|(Xi µ )(Xi µ ) I 2 2c1 ϵ ln(1/ϵ) . Consequently, because u is a unit vector, 1 |G|(Xi µ )(Xi µ ) ! u 2c1ϵ ln(1/ϵ) . At this point, we know samples in G+ do not contribute much to the variance. We now proceed to show that one of these samples satisfies the lemma. Let j = arg mini G+ u (Xi µ ) . We have u (Xj µ )(Xj µ ) u |G| |G+| 2c1 ϵ ln(1/ϵ) 2c1 ln(1/ϵ) . Finally, because u (Xj µ ) p 2c1 ln(1/ϵ), we can show that w F(w, u)j is small: f(w)j µ Y (µ 2µw) = u (Xj µ )(Xj µ ) u + 2u (Xj µ )(µw µ ) u 2c1 ln(1/ϵ) + 2 p 2c1 ln(1/ϵ) r The last step uses that c3 is sufficiently large, as well as the fact that ln(1/ϵ) r2 ϵ2 because r c2ϵ p 4. Algorithmic Result: Finding a Stationary Point via Gradient Descent In this section, we show that a simple Projected Gradient Descent (PGD) algorithm (Algorithm 1) can efficiently find an approximate stationary point w of our spectral norm objective, and that w is a good solution to our robust mean estimation task. Algorithm 1 Robust Mean Estimation via PGD Input: ϵ-corrupted set of N samples {Xi}N i=1 on Rd satisfying Condition (3), and ϵ < ϵ0. Output: w RN with µw µ 2 O(ϵ p log(1/ϵ)). Let F(w, u) = u Σwu. Let w0 be an arbitrary weight vector in N,2ϵ. Let T = e O(N 2d4). for τ = 0 to T 1 do Find a unit vector uτ Rd such that F(wτ, uτ) (1 ϵ) maxu F(wτ, u). wτ+1 = P N,2ϵ (wτ η w F(wτ, uτ)), where PK( ) is the ℓ2 projection operator onto K. end for return wτ where τ = arg min 0 τ