# highdimensional_robust_mean_estimation_via_gradient_descent__7e7325cc.pdf

High-dimensional Robust Mean Estimation via Gradient Descent

Yu Cheng * 1 Ilias Diakonikolas * 2 Rong Ge * 3 Mahdi Soltanolkotabi * 4

We study the problem of high-dimensional robust mean estimation in the presence of a constant fraction of adversarial outliers. A recent line of work has provided sophisticated polynomialtime algorithms for this problem with dimensionindependent error guarantees for a range of natural distribution families. In this work, we show that a natural non-convex formulation of the problem can be solved directly by gradient descent. Our approach leverages a novel structural lemma, roughly showing that any approximate stationary point of our non-convex objective gives a near-optimal solution to the underlying robust estimation task. Our work establishes an intriguing connection between algorithmic highdimensional robust statistics and non-convex optimization, which may have broader applications to other robust estimation tasks.

1. Introduction

Learning in the presence of outliers is an important goal in machine learning that has become a pressing challenge in a number of high-dimensional data analysis applications, including data poisoning attacks (Barreno et al., 2010; Biggio et al., 2012; Steinhardt et al., 2017) and exploratory analysis of real datasets with natural outliers, e.g., in biology (Rosenberg et al., 2002; Paschou et al., 2010; Li et al., 2008). In both these application domains, the outliers are not random but can be arbitrarily correlated, and could exhibit rather complex structures that is essentially impossible to accurately model. Hence, the goal in these settings is to design computationally efﬁcient estimators that can tolerate

*Equal contribution 1University of Illinois at Chicago, Chicago, IL, USA 2University of Wisconsin-Madison, Madison, WI, USA 3Duke University, Durham, NC, USA 4University of Southern California, Los Angeles, CA, USA. Correspondence to: Yu Cheng <yucheng2@uic.edu>, Ilias Diakonikolas <ilias@cs.wisc.edu>, Rong Ge <rongge@cs.duke.edu>, Mahdi Soltanolkotabi <soltanol@usc.edu>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

a small constant fraction of arbitrary outliers.

Throughout this paper, we focus on the following data contamination model that generalizes several existing models, including Huber s contamination model (Huber, 1964).

Deﬁnition 1.1 (Strong Contamination Model). Given a parameter 0 < ϵ < 1/2 and a distribution family D on Rd, the adversary operates as follows: The algorithm speciﬁes the number of samples N, and N samples are drawn from some unknown D D. The adversary is allowed to inspect the samples, remove up to ϵN of them and replace them with arbitrary points. This modiﬁed set of N points is then given as input to the algorithm. We say that a set of samples is ϵ-corrupted if it is generated by the above process.

The parameter ϵ in the above deﬁnition is the fraction of corrupted samples and quantiﬁes the power of the adversary. Intuitively, among our samples, an unknown (1 ϵ) fraction are generated from a distribution of interest and are called inliers, and the rest are called outliers.

The statistical foundations of outlier-robust estimation were laid out in early work by the robust statistics community, starting with the pioneering works of Tukey (1960) and Huber (1964). In contrast, until fairly recently, even the most basic algorithmic questions were poorly understood. Speciﬁcally, even for the basic task of high-dimensional mean estimation, all known robust estimators had runtime exponential in the dimension, rendering them ineffective in high-dimensional settings.

Recently, Diakonikolas et al. (2016); Lai et al. (2016) gave the ﬁrst efﬁciently computable robust estimators for highdimensional unsupervised learning tasks, including mean and covariance estimation. Speciﬁcally, Diakonikolas et al. (2016) obtained the ﬁrst polynomial-time robust estimators with dimension-independent error guarantees, i.e., with error scaling only with the fraction of corrupted samples ϵ and not with the dimensionality of the data. Since the dissemination of these works, there has been a ﬂurry of research activity on algorithmic aspects of high-dimensional robust statistics; see, e.g., Diakonikolas & Kane (2019) for a recent survey on the topic.

Despite this exciting progress, the design of efﬁcient robust estimators in high dimensions remains challenging. The difﬁculty, of course, lies in the non-convexity of the under-

Robust Mean Estimation via Gradient Descent

lying optimization problem. Prior work developed fairly sophisticated algorithmic tools, even for the task of robust mean estimation. These include convex relaxations (Diakonikolas et al., 2016) and quite subtle iterative spectral methods (Diakonikolas et al., 2016; Lai et al., 2016).

A natural and important goal is to understand to what extent such sophisticated methods are indeed necessary or whether much simpler robust learning algorithms exist. In this work, we take a direct optimization view of these problems and ask the following general question:

Is it possible to solve robust estimation tasks by standard ﬁrst-order methods?

We believe that this question merits investigation in its own right. Moreover, its positive resolution may have signiﬁcant implications in the practical adoption of robust estimation methods. Particularly so since prior algorithms are either (1) computationally prohibitive (relying on large convex relaxations), (2) involve carefully crafted parameters that require precise tuning for practical deployment, or (3) are challenging to extend to more sophisticated robust estimation tasks. A tantalizing possibility is the following: For a range of high-dimensional robust estimation tasks, there exists a (natural) non-convex formulation such that gradient descent efﬁciently converges to a near-optimal solution.

In this paper, we show that this premise is true for the task of high-dimensional robust mean estimation. In robust mean estimation, we are given a set of N ϵ-corrupted samples from an unknown distribution D in a known family D, and we want to output a hypothesis vector bµ such that bµ µ 2 is as small as possible, where µ is the mean of D. For simplicity, we will assume in this discussion that D is an unknown mean and identity covariance Gaussian on Rd. We note that our results hold under more general distributional assumptions, as in Diakonikolas et al. (2016; 2017a).

The goal in robust mean estimation is to develop efﬁcient algorithms whose ℓ2-error guarantee scales only with ϵ and not with the dimension d. In particular, for the identity covariance Gaussian case, Diakonikolas et al. (2016) gave polynomial-time algorithms for the problem that use N = eΩ(d/ϵ2) samples and guarantee error O(ϵ p

log(1/ϵ)). This error guarantee matches known Statistical Query (SQ) lower bounds (Diakonikolas et al., 2017b).

1.1. Overview of Results and Contributions

In this paper, we consider a natural non-convex optimization formulation of high-dimensional robust mean estimation, and show that gradient descent1 efﬁciently converges to a near-optimal solution. Speciﬁcally, we show that gradient

1Throughout, we informally use the term gradient descent to refer to variations of gradient descent methods, which involve

descent converges in a polynomial number of iterations and matches the error guarantee of the best known polynomialtime algorithms for the problem. Our technical contribution lies in showing that any approximate stationary point of our non-convex objective sufﬁces in the sense that it gives a near-optimal solution for the underlying estimation problem.

To describe our non-convex formulation, we require some background. We use the following framework for robust mean estimation, introduced in Diakonikolas et al. (2016). The idea is to assign a non-negative weight to each data point and then ﬁnd an appropriate combination of weights such that the weighted empirical mean is close to the true mean. The constraint on the chosen weights is that they represent at least a (1 ϵ)-density fractional subset of the dataset. More formally, given datapoints X1, . . . , XN Rd

with corresponding data matrix X Rd N, the objective is to ﬁnd a weight vector w RN such that µw = Xw is close to µ . The constraint on w is that it belongs in the set

N,ϵ = n w RN : w 1 = 1 and 0 wi 1 (1 ϵ)N i o ,

which is the convex hull of all uniform distributions over subsets S [N] of size |S| = (1 ϵ)N.

Diakonikolas et al. (2016) established a key structural lemma (Lemma 2.1), which formed the basis of their algorithms. Roughly speaking, the lemma states that any weight vector w is a good solution if the spectral norm of the weighted empirical covariance, Σw = PN i=1 wi(Xi µw)(Xi µw) , is small. This lemma directly motivates the following non-convex optimization formulation:

Min Σw 2 subject to w N,2ϵ (1)

It follows from the aforementioned structural lemma that a near-optimal solution w to (1) gives an µw that is close to µ . The challenge is that the objective function is not convex, hence it is unclear how to efﬁciently optimize. Faced with this difﬁculty, prior works on the topic (Diakonikolas et al., 2016; 2017a) developed various sophisticated algorithms.

In this paper, we work directly with the natural formulation (1). Despite its non-convexity, we are able to leverage the structure of the problem to show that gradient descent efﬁciently converges to a good vector w. In more detail, we prove a novel result about the structure of approximate stationary points of this objective.

Theorem 1.2 (informal statement). Any approximate stationary point w of (1) deﬁnes an µw that is close to µ .

See Theorem 3.1 for a detailed formal statement. Technically speaking, our statement is more subtle for various reasons, including the fact that the objective function is not

updates based on a generalized notion of a gradient, e.g., subgradient for non-differentiable functions.

Robust Mean Estimation via Gradient Descent

differentiable and the domain is constrained. As a result, we require a careful deﬁnition of stationarity in our setting.

Given Theorem 1.2, we proceed to show that projected sub-gradient descent converges to an approximate stationary point in a polynomial number of iterations. This step is also somewhat intricate as the function is non-convex, non-smooth and the optimization problem (1) involves constraints. In summary, we establish the following theorem: Theorem 1.3. After e O(N 2d4) iterations, projected subgradient descent on (1) outputs a point w such that with high probability µw µ 2 = O(ϵ p

The bound we establish on the convergence rate on the spectral norm objective (1) is polynomially bounded, but relatively slow. Our second main contribution involves considering the softmax version of the spectral norm, which has better smoothness properties. An analogous lemma about the structure of stationary points allows us to show a faster rate of convergence for this modiﬁed objective. Theorem 1.4. After e O(Nd3/ϵ) iterations, projected gradient descent on the softmax objective outputs a point w such that with high probability µw µ 2 = O(ϵ p

As evident from the above result, the additional smoothness of the softmax objective allows us to establish a signiﬁcantly improved bound on the number of iterations.

1.2. Related Work

The algorithmic question of designing efﬁcient robust mean estimators in high-dimensions has been extensively studied in recent years. After the initial papers (Diakonikolas et al., 2016; Lai et al., 2016), a number of works (Diakonikolas et al., 2017a; Steinhardt et al., 2018; Cheng et al., 2018; Dong et al., 2019; Depersin & Lecue, 2019; Cheng et al., 2019) have obtained algorithms with improved asymptotic worst-case runtimes that work under weaker distributional assumptions on the good data. Moreover, efﬁcient highdimensional robust mean estimators have been used as primitives for robustly solving a range of machine learning tasks that can be expressed as stochastic optimization problems (Prasad et al., 2018; Diakonikolas et al., 2019a).

We compare our approach with the works of Cheng et al. (2018) and Dong et al. (2019) that give the asymptotically fastest known algorithms for robust mean estimation. At a high-level, Cheng et al. (2018), building on the convex programming relaxation of Diakonikolas et al. (2016), proposed a primal-dual approach for robust mean estimation that reduces the problem to a poly-logarithmic number of packing and covering SDPs. Each such SDP is known to be solvable in time e O(Nd), using mirror descent Allen Zhu et al. (2016); Peng et al. (2016). Dong et al. (2019) build on the iterative spectral approach of Diakonikolas et al. (2016). That work uses the matrix multiplicative weights

update method with a speciﬁc regularization and dimensionreduction to improve the worst-case runtime.

In contrast to all of the above, we use a natural non-convex formulation of the robust mean estimation task, and show that a standard ﬁrst-order method provably and efﬁciently converges to a near-optimal solution. Even though the convergence rates that we establish in this work do not yield the fastest known asymptotic runtimes for the problem, we believe that our approach is conceptually interesting for a number of reasons. First, our theorem regarding stationary points provides novel structural understanding about robust mean estimation and can be viewed as an explanation as to why this problem is polynomially solvable. Second, it is plausible that gradient descent applied in this context is more stable than previously known algorithms and may facilitate the adoption of robust estimation methods in practice. We hope that this work will serve as the starting point for solving other robust estimation tasks via ﬁrst-order methods.

Finally, we note that there is an increasing literature on developing rigorous guarantees for non-convex optimization problems via gradient descent, e.g., see the recent survey (Jain & Kar, 2017) for a review of this literature. With a few exceptions (Loh & Wainwright, 2011; Hassani et al., 2017), this literature mostly focuses on showing that gradient descent converges to a global optimum starting from a spectral (Keshavan et al., 2010; Candes et al., 2015; Tu et al., 2015) or random initialization (Ge et al., 2015) in settings where there are no bad local optima. In contrast to most of this literature, in this paper we show that any stationary point has good approximation properties so that no specialized or random initialization is necessary. We believe that such a perspective may enable rigorous analysis of many other non-convex optimization problems.

1.3. Roadmap

In Section 2, we set up the necessary notation and provide some background on robust mean estimation. In the next two sections, we focus on the spectral norm objective. In Section 3, we prove our main structural result showing that any stationary point of the spectral norm objective yields a good solution. We also extend this result in Appendix B, showing that in fact, any approximate stationary point yields a sufﬁciently good solution. In Section 4, we show that gradient descent converges to an approximate stationary point and hence yields a good solution in a polynomial number of iterations. In Appendix C, we prove structural and algorithmic results for the softmax objective, showing that any approximate stationary point of the softmax objective yields a good solution, and we can ﬁnd an approximate stationary point using projected gradient descent in a polynomial number of iterations. We conclude with future directions in Section 5.

Robust Mean Estimation via Gradient Descent

2. Preliminaries and Background

Notation. For N Z+, we denote [N] := {1, . . . , N}. For a vector x, we use x 1, x 2, and x to denote the ℓ1, ℓ2, and ℓ norm of x respectively. For a matrix A, we use A 2 to denote the spectral norm of A.

For two vectors x, y Rn, we use x y = Pn i=1 xiyi to denote the inner product of x and y, and we use x y Rn

to denote entrywise product of x and y. For a vector x Rn, let diag(x) Rn n denote a diagonal matrix with x on the diagonal. For a matrix A Rn n, let diag(A) Rn

denote a column vector with the diagonal entries of A.

Let I denote the identity matrix. For a matrix A Rn n, let tr(A) denote the trace of A. For two matrices A and B of the same dimensions, let A B = A, B = tr(A B) be the entry-wise inner product of A and B. We use exp(A) to denote the matrix exponential of A.

A symmetric matrix A Rn n is said to be positive semideﬁnite (PSD) if x Ax 0 for all x Rn. For two symmetric matrices A and B, we write A B iff the matrix B A is positive semideﬁnite. Let n n be the set of all PSD matrices of trace 1.

Framework. We use N for the number of input samples, d for the dimension of the ground-truth distribution, and ϵ for the fraction of corrupted samples. Given N datapoints X1, . . . , XN Rd, we use X Rd N to denote the sample matrix, where the i-th column of X is Xi.

Given w RN, let µw = Xw = PN i=1 wi Xi denote the weighted empirical mean and let Σw = PN i=1 wi(Xi µw)(Xi µw) denote the weighted empirical covariance. Let N,ϵ denote the convex hull of all uniform distributions over subsets S [N] of size |S| = (1 ϵ)N:

N,ϵ = n w RN : w 1 = 1 and 0 wi 1 (1 ϵ)N i o .

Every weight vector w N,ϵ corresponds to a fractional set of (1 ϵ)N samples.

Background on Robust Mean Estimation. As mentioned in the introduction, our non-convex formulation is directly motivated by the following structural lemma:

Lemma 2.1 (Diakonikolas et al. (2016)). Let S be an ϵcorrupted set of N = eΩ(d/ϵ2) samples from an unknown N(µ , I) and w N,2ϵ. If λmax (Σw) 1 + δ, for some δ 0, then with high probability, we have that µ µw 2 = O(

As in prior work, we will establish correctness for our algorithms under deterministic conditions on the inliers (good samples) that hold with high probability. Let G denote the original set of N good samples. Let S = G B denote

the input samples after the adversary replaced ϵ-fraction of the samples, where G G is the set of remaining good samples and B is the set of bad samples (outliers) added by the adversary. Note that |G| = (1 ϵ)N and |B| = ϵN. Given w RN, let w G = P

i G wi be the total weight on good samples, and w B be the total weight on bad samples.

We require the following concentration bounds to hold for the original N good samples G (which happens with high probability when N = eΩ(d/ϵ2)). For all bw N,3ϵ, we require the following condition to hold for δ = O(ϵ log(1/ϵ)):

i G bwi(Xi µ )(Xi µ ) I

Condition (2) on original samples G implies the following conditions on the remaining good samples G. For any weight vector w N,2ϵ on the ϵ-corrupted set of samples S = G B:

i G wi(Xi µ )(Xi µ ) I

This is because we can deﬁne bw as follows: bwi = wi

w G for all i G and bwi = 0 for all i B. Since w N,2ϵ, we have bw w

1 w B w 1 |B| w 1 (1 3ϵ)N . In other words, bw N,3ϵ and Condition (3) follows directly from Condition (2).

Remark 2.2 (Distributional Assumptions). For simplicity, in this paper we focus on the fundamental setting that the good data are drawn from an unknown mean and identity covariance Gaussian distribution. It should be noted that our structural and algorithmic results hold under more general distributional assumptions. Speciﬁcally, Theorem 4.1 immediately applies to identity covariance subgaussian distributions, with the same error guarantees, since it only relies on the concentration bounds (2) and (3) that only require subgaussian tails (see, e.g., (Diakonikolas et al., 2017a).) Moreover, one can modify the proof of our structural results (Theorems 3.1 and 3.2), mutatis-mutandis, to apply (1) for distributions with bounded covariance (i.e., Σ I) and match the optimal O( ϵ) approximation to the mean (Diakonikolas et al., 2017a); and, (2) more generally, under the (ϵ, δ)-stability condition of (Diakonikolas & Kane, 2019) to yield an O(δ) ℓ2-approximation to the mean.

Background and Deﬁnitions of Stationarity. Note that the spectral norm is not a differentiable function and therefore we need an alternative deﬁnition of stationarity. To address this issue, by the deﬁnition of spectral norm, we can deﬁne a function F(w, u) = u Σwu that takes two parameters as input: the weights w RN and a unit vector u Rd. Our non-convex objective minw f(w) :=

Robust Mean Estimation via Gradient Descent

Σw 2 is then equivalent to solving the minimax problem minw maxu F(w, u). The function maxu F(w, u) is weakly-convex, and we use the following stationary point deﬁnition that is common in the weakly-convex optimization literature (Rockafellar, 1970; 1981; Drusvyatskiy, 2017; Davis & Drusvyatskiy, 2018; Jin et al., 2019). Deﬁnition 2.3 (First-order stationary point). Let F(w, u) be a function that is differentiable with respect to w for all u. Let f(w) = maxu F(w, u). Consider the constrained optimization problem minw K f(w), where K is a closed convex set. We say that w K is a ﬁrst-order stationary point if there exists some u arg maxv F(w, v) such that

( w F(w, u)) ( ew w) 0 for all ew K .

We also need a notion of an approximate stationary point in the sense that the updates from one iteration to the next do not change much. In the unconstrained and differentiable case, such a point can be characterized by the gradient being small. However, the objective function we consider is both non-differentiable and has constraints, so that a proper deﬁnition of approximate stationarity is much more subtle. To overcome this, we appeal to tools from conic geometry and notions of stationarity for weakly convex functions (Rockafellar, 1970; 1981; Drusvyatskiy, 2017; Davis & Drusvyatskiy, 2018) to deﬁne an appropriate notion of approximate stationarity.

To discuss the notion of approximate stationarity that we use, we need to work with a smoothed variant of the objective known as the Moreau envelope. Deﬁnition 2.4 (Moreau envelope). For any function f and closed convex set K, its associated Moreau envelope fβ(w) is deﬁned to be the function

fβ(w) := min e w K f( ew) + β w ew 2 2 .

The Moreau envelope can be thought of as a form of convolution between the original function f and a quadratic, so as to smoothen the landscape. In particular, when f(w) takes the form of a maximization problem (f(w) = maxu F(w, u)) with F a mapping that is β-smooth in the u parameter (| w F(w, eu) w F(w, u)| β eu u 2), the Moreau envelope is also β-smooth (Drusvyatskiy, 2017). Therefore, the approximate stationarity of the Moreau envelope can be easily deﬁned through its gradient allowing us to deﬁne the following notion of approximate stationarity. Deﬁnition 2.5 (Approximate ﬁrst-order stationary point). For any function f and closed convex set K consider its associated Moreau envelope fβ(w) per Deﬁnition 2.4. we say that a point w is a ρ-approximately stationary point if fβ(w) 2 ρ.

As mentioned earlier, the spectral norm admits a minimax formulation of the form f(w) = maxu F(w, u). Further-

more, as detailed in Appendix B, the corresponding function F(w, u) is β-smooth with β = 2 X 2 2, so that this notion of approximate stationarity can be applied to the objective of interest in this paper.

3. Structural Result: Any Approximate Stationary Point Sufﬁces

In this section, we establish our main structural result, which says that every approximate stationary point of (1) must give a µw that is close to µ . For simplicity of the exposition, in the main body of this paper, we state and prove a simpler theorem showing that every (exact) stationary point is a good solution.

Theorem 3.1 (Any stationary point is a good solution). Let S denote an ϵ-corrupted set of N samples drawn from a d-dimensional Gaussian N(µ , I) with unknown mean µ . Suppose that S satisﬁes Lemma 2.1 and Condition (3).

Let f(w) be the objective function deﬁned in Equation (1). For any ﬁrst-order stationary point w N,2ϵ of f(w), we have µw µ 2 = O(ϵ p

We note that while Theorem 3.1 shows that any (exact) stationary point has small objective value, a stronger statement is required for our algorithmic results in the next section. Speciﬁcally, we require that any approximate stationary point in the sense of Deﬁnition 2.5 which gradient descent efﬁciently converges to, also has low objective value. This is accomplished in the next theorem which we prove in Appendix B. Speciﬁcally, by appealing to the gradient of the Moreau envelope from Deﬁnition 2.4, we extend the proof of Theorem 3.1 to show the following:

Theorem 3.2 (Any approximate stationary point sufﬁces). Consider the same setting as in Theorem 3.1. Consider the spectral norm objective f(w) = Σw 2 with fβ(w) denoting the corresponding Moreau envelope function per Deﬁnition 2.4 with β = 2 X 2 2. Then, for any w N,2ϵ satisfying fβ(w) 2 = O(log(1/ϵ)) ,

we have µw µ 2 = O(ϵ p

In the remainder of this section, we focus on proving Theorem 3.1 and brieﬂy discuss how this proof can be generalized to prove Theorem 3.2. Our proof is carried out in two steps: (1) We establish a structural lemma which states that every stationary point w must satisfy a bimodal subgradient property; (2) We show any point satisfying such property must have a small objective value. Given these two steps, we can conclude any stationary points µw is close to µ , by Lemma 2.1.

For the ﬁrst step, the bimodal subgradient property states that there exists a vector ν f(w) (in the sub-gradient

Robust Mean Estimation via Gradient Descent

of the function at that stationary point) whose entries divided in two groups of indices such that for any i S

and any j S+ we have νi νj. Intuitively, S contains all indices with positive wi, so they can potentially be decreased; while S+ contains all indices with wi < 1 (1 2ϵ)N , so they can potentially be increased. If the bimodal subgradient property is violated, there must be indices i S , j S+, where νi > νj. In this case, decreasing wi and increasing wj would decrease the objective and thus violate stationarity.

For the second step, recall that

Σw = X diag(w)X Xww X

and F(w, u) = u Σwu. Let us ﬁrst compute the subgradient w F(w, u) with respect to a vector u:

w F(w, u) = X u X u 2(u Xw)X u . (4)

Our key observation is that the sub-gradient at direction u is equivalent to the gradient of w for the one-dimensional problem with input (X i u)N i=1. This allows us to effectively reduce our problem to a one-dimensional robust mean estimation problem. This reduction allows us to show that when the objective function is large, then there must be some non-zero weights associated with the corrupted points that are far away from the mean (these points will be in S ); while on the other hand, S+ must contain at least ϵ-fraction of the good points. One can then select indices from these two sets to violate the bimodal sub-gradient property.

Fix a ﬁrst-order stationary point w N,2ϵ. Deﬁnition 2.3 implies that there is a corresponding unit vector u Rd

such that w is a stationary point of F(w, u). We ﬁrst state the bimodal sub-gradient property.

Lemma 3.3 (Bimodal sub-gradient property at stationarity). Fix w N,2ϵ and a unit vector u with u Σwu = Σw 2. Let S = {i : wi > 0} and S+ = {i : wi < 1 (1 2ϵ)N } denote the coordinates of w that can decrease and increase respectively. If w is a ﬁrst-order stationary point of F(w, u), then w F(w, u)i w F(w, u)j ,

for all i S and j S+.

Proof. Suppose there is some i S and j S+ such that w F(w, u)i > w F(w, u)j, then intuitively we can make f(w) smaller by decreasing wi and increasing wj. Formally, let ew = w + min(wi, 1 (1 2ϵ)N wj)(ej ei) where ei is the i-th basis vector. We have ew N,2ϵ and ( w F(w, u)) ( ew w) < 0, which violates the assumption that w is a stationary point (Deﬁnition 2.3).

Given Lemma 3.3, we prove Theorem 3.1 by contradiction. We show that if µw is far from µ , then w violates the

property stated in Lemma 3.3 and therefore cannot be a stationary point. More speciﬁcally, we show that, if µw is far from µ , then there exists a bad sample with index j S whose gradient is large (Lemma 3.4). Meanwhile, the concentration bounds in Condition (3) guarantee that there exists a good sample with index i S+ whose gradient is small (Lemma 3.5). Lemma 3.4 (Bad sample with large gradient). Assume that Condition (3) and Lemma 2.1 hold. Fix w N,2ϵ and a unit vector u with u Σwu = Σw 2. Let r = µw µ 2 and suppose r c2ϵ p

ln(1/ϵ). Then there exists some i (B S ) such that

w F(w, u)i u µ (µ 2µw) u > 2c3 r2

Here, c2 and c3 are universal positive constants. Lemma 3.5 (Good sample with small gradient). Consider the same setting as in Lemma 3.4. There is some j (G S+) such that

w F(w, u)j u µ (µ 2µw) u c3 r2

We defer the proofs of Lemmas 3.4 and 3.5 to Sections 3.1 and 3.2, and we ﬁrst use these two lemmas to prove Theorem 3.1.

Proof of Theorem 3.1. Suppose that w N,2ϵ is a ﬁrstorder stationary point of f(w), and moreover, w is a bad solution where µw µ 2 c2ϵ p

ln(1/ϵ). By Deﬁnition 2.3, there exists a unit vector u Rd such that w is a stationary point of F(w, u).

Fix such a vector u. Since Condition (3) and Lemma 2.1 both hold, we can invoke Lemmas 3.4 and 3.5 on (w, u) to ﬁnd two coordinates i S and j S+ that violate the bimodal subgradient condition in Lemma 3.3. Consequently, w cannot be a stationary point of F(w, u). This leads to a contradiction, and therefore, all ﬁrst-order stationary points of f(w) are good solutions.

We now brieﬂy comment on the modiﬁcations required to prove Theorem 3.2 (see Appendix B). Theorem 3.2 is proven by ﬁrst showing (using conic geometry) that for such an approximate stationary point an approximate bimodal sub-gradient property holds. Speciﬁcally, we show that the bimodal sub-gradient property (Lemma 3.3) is stable in the sense that for an approximate stationary point an approximate bimodal sub-gradient property holds, i.e., νi νj + δ. Further, for any point obeying such an approximate bimodal property, the objective is small and has good approximation guarantees. The last two steps when combined show that any approximate stationary point has good approximation guarantees (similar to the proof of Theorem 3.1 for exact stationary points).

Robust Mean Estimation via Gradient Descent

3.1. Finding a Bad Sample With Large Gradient

In this subsection, we prove Lemma 3.4.

Lemma 3.4 states that when µw is far from µ , there exists an index i (B S ) such that the gradient w F(w, u)i is relatively large.

Recall that w F(w, u) in Equation (4) is the same as the gradient of the variance (weighted by w) of the onedimensional samples X i u N i=1. Roughly speaking, for this one-dimensional problem, a sample far from the (projected) true mean should have large gradient. Our objective is to ﬁnd such a sample with positive weight.

More speciﬁcally, since w is a bad solution and u is in the top eigenspace of Σw, the weighted empirical variance of the projected samples is very large. Because the good samples cannot have this much variance, most of the variance comes from the bad samples. We show that among the bad samples that contribute a lot to the variance, one of them must be very far from the (projected) true mean.

In this section and Section 3.2, we use c1, . . . , c4 to denote universal constants that are independent of N, d, and ϵ. We give a detailed description of how to set these constants in Appendix A.

Proof of Lemma 3.4. We ﬁrst show that the variance of onedimensional samples X i u N i=1 is relatively large.

By Lemma 2.1, we know that if µw µ 2 r and r c2ϵ p

ln(1/ϵ), then

λmax(Σw) 1 + c4 r2

ϵ for some universal constant c4.

Because u is a unit vector that maximizes u Σwu, we have

u Σwu = λmax(Σw) 1 + c4r2

Recall that Σw = PN i=1 wi(Xi µw)(Xi µ w). If we replace µw with µ , we have

i=1 wi(Xi µ )(Xi µ ) Σw ,

and therefore,

i=1 wi(Xi µ )(Xi µ ) !

Next we show that most of this variance is due to bad samples. By Condition (3),

i G wi(Xi µ )(Xi µ ) !

u 1 + c1ϵ ln(1/ϵ) .

Consequently,

i B wi(Xi µ )(Xi µ ) !

ϵ c1ϵ ln(1/ϵ) 0.98 c4 r2

The last step is because r c2 ϵ p

ln(1/ϵ) and we can choose c4 to be sufﬁciently large.

Now that we know most of the variance is due to the bad samples, observe that the total weight w B on the bad samples is at most ϵN 1 (1 2ϵ)N 2ϵ. Therefore, there must be some i B with wi > 0 such that

u (Xi µ )(Xi µ ) u 0.98 c4 r2 ϵ 1

In other words, u (Xi µ ) 0.7 c4 r

By deﬁnition, i B S . It remains to show that w F(w, u)i is large.

w F(w, u)i u µ (µ 2µw) u

= u (Xi µ )(Xi µ ) u

2u (Xi µ )(µw µ ) u

u (Xi µ ) 2 2 u (Xi µ ) µw µ 2

ϵ2 2 0.7 c4 r

ϵ r > 2c3 r2

The ﬁrst inequality is by Cauchy-Schwarz. The last step uses the fact that ϵ is sufﬁciently small.

3.2. Finding a Good Sample With Small Gradient

In this subsection, we prove Lemma 3.5.

Lemma 3.5 states that there exists an index j (G S+) such that the gradient w F(w, u)j is relatively small. Similar to the previous section, a sample close to the (projected) true mean should have small gradient. Our goal is to ﬁnd such a sample for which we can increase its weight.

Recall that S+ contains all samples whose weight can be increased. We ﬁrst prove that there are at least ϵN good samples in S+. Among these ϵN good samples, the concentration bounds imply that some Xj must be very close to the (projected) true mean.

Proof of Lemma 3.5. Recall that S+ contains every coordinate i where wi < 1 (1 2ϵ)N . Since at most (1 2ϵ)N

Robust Mean Estimation via Gradient Descent

samples can have the maximum weight 1 (1 2ϵ)N , we know that |S+| 2ϵN. Combining this with |G| = (1 ϵ)N, we know that |G S+| ϵN.

Fix a subset G+ (G S+) of size |G+| = ϵN. We ﬁrst show that on average, samples in G+ do not contribute much to the variance.

Let w be the uniform weight vector on G, i.e., w i = 1 (1 ϵ)N for all i G and w i = 0 otherwise. Since w N,2ϵ, by Condition (3),

1 |G|(Xi µ )(Xi µ ) I

2 c1 ϵ ln(1/ϵ) .

Let w be the uniform weight vector on S \ G+ = (G \ G+) B, i.e., w i = 1 (1 ϵ)N for all i ((G \ G+) B) and w i = 0 otherwise. Since w N,2ϵ, again by Condition (3), we have

1 |G|(Xi µ )(Xi µ ) I

c1ϵ ln(1/ϵ).

Combining the previous two concentration bounds,

1 |G|(Xi µ )(Xi µ ) 2

1 |G|(Xi µ )(Xi µ ) I

1 |G|(Xi µ )(Xi µ ) I

2 2c1 ϵ ln(1/ϵ) .

Consequently, because u is a unit vector,

1 |G|(Xi µ )(Xi µ ) !

u 2c1ϵ ln(1/ϵ) .

At this point, we know samples in G+ do not contribute much to the variance. We now proceed to show that one of these samples satisﬁes the lemma.

Let j = arg mini G+ u (Xi µ ) . We have

u (Xj µ )(Xj µ ) u |G|

|G+| 2c1 ϵ ln(1/ϵ)

2c1 ln(1/ϵ) .

Finally, because u (Xj µ ) p

2c1 ln(1/ϵ), we can

show that w F(w, u)j is small:

f(w)j µ Y (µ 2µw)

= u (Xj µ )(Xj µ ) u

+ 2u (Xj µ )(µw µ ) u

2c1 ln(1/ϵ) + 2 p

2c1 ln(1/ϵ) r

The last step uses that c3 is sufﬁciently large, as well as the fact that ln(1/ϵ) r2

ϵ2 because r c2ϵ p

4. Algorithmic Result: Finding a Stationary Point via Gradient Descent

In this section, we show that a simple Projected Gradient Descent (PGD) algorithm (Algorithm 1) can efﬁciently ﬁnd an approximate stationary point w of our spectral norm objective, and that w is a good solution to our robust mean estimation task.

Algorithm 1 Robust Mean Estimation via PGD

Input: ϵ-corrupted set of N samples {Xi}N i=1 on Rd

satisfying Condition (3), and ϵ < ϵ0. Output: w RN with µw µ 2 O(ϵ p

log(1/ϵ)). Let F(w, u) = u Σwu. Let w0 be an arbitrary weight vector in N,2ϵ. Let T = e O(N 2d4). for τ = 0 to T 1 do

Find a unit vector uτ Rd such that F(wτ, uτ) (1 ϵ) maxu F(wτ, u).

wτ+1 = P N,2ϵ (wτ η w F(wτ, uτ)), where PK( ) is the ℓ2 projection operator onto K. end for return wτ where τ = arg min 0 τ<T Σwτ 2.

We note that ﬁnding the unit vector uτ required in the for loop of Algorithm 1 can be done in time O(Nd log(d)/ϵ). Given the PSD matrix A = Σ(wτ), we want to ﬁnd a unit vector u Rd such that u Au (1 ϵ) maxv(v Av). This is the (approximate) largest eigenvector problem which can be solved via power method in O(log(d)/ϵ) iterations. Since the matrix-vector multiplication Av = Σwτ v = X diag(wτ)X Xwτw τ X v can be computed in time O(Nd), the running time for ﬁnding such a vector uτ is O(Nd log(d)/ϵ).

The main result of this section is the following theorem:

Theorem 4.1 (Gradient descent ﬁnds a good solution). Let S be an ϵ-corrupted set of N = eΩ(d/ϵ2) samples from a d-dimensional Gaussian N(µ , I) with unknown mean µ . Suppose S satisﬁes Condition (3) and Lemma 2.1. Then,

Robust Mean Estimation via Gradient Descent

after e O(N 2d4) iterations, Algorithm 1 outputs a weight vector w RN such that µw µ 2 = O(ϵ p

We ﬁrst give a high-level overview of the proof. Our proof of Theorem 4.1 can be divided into two steps:

1. The ﬁrst step is an immediate consequence of Theorem

3.2, which allows us to conclude that any approximate stationary point (in the sense of Deﬁnition 2.5) has good approximation guarantees.

2. To ﬁnalize the proof, in the second step we show that simple iterative procedures such as (sub)gradient descent can converge in a polynomial number of iterations to such an approximate stationary point. We prove such a result by utilizing a simple and wellknown observation: a minimax optimization problem which is smooth in the minimization parameter is weakly convex (after maximization) in the minimization parameter. This connection allows us to leverage recent literature (Drusvyatskiy, 2017; Davis & Drusvyatskiy, 2018) that provides convergence guarantees for weakly convex optimization problems to prove our algorithm ﬁnds an approximate stationary point in a polynomial number of iterations.

To elaborate further, in the second step of our proof, we utilize and slightly generalize2 the analysis of (Davis & Drusvyatskiy, 2018) and prove that projected sub-gradient descent can ﬁnd an approximate stationary point.

Lemma 4.2. Let K be a closed convex set. Let F(w, u) be a function which is L-Lipschitz and β-smooth with respect to w. Consider the following optimization problem minw K max u 2=1 F(w, u).

Starting from any initial point w0 K, we run iterative updates of the form:

ﬁnd uτ with F(wτ, uτ) (1 ϵ ) max u F(wτ, uτ)

wτ+1 = PK(wτ η w F(wτ, uτ)

for T iterations with step size η = γ

T . Then, we have

min 0 τ<T fβ(wτ) 2 2

fβ(w0) minw f(w)

γ + γβL2 + 4βϵ

where fβ(w) is the Moreau envelope as in Deﬁnition 2.4.

As shown in Appendix B, F(w, u) associated with f(w) obeys the required Lipschitz and smoothness property, with L = e O(

Nd) and β = e O(Nd). In addition, we have 0

2The generalization is to deal with constraints and handle the fact that the inner maximization is not solved precisely.

f(w) e O(d) for all w N,2ϵ. Thus, we can apply the result above with the constraint K = N,2ϵ. Theorem 4.1 follows by combining Theorem 3.2 and Lemma 4.2. We defer the proofs to Appendix B.

5. Discussion

The main conceptual contribution of this work is to establish an intriguing connection between algorithmic highdimensional robust statistics and non-convex optimization. Speciﬁcally, we showed that high-dimensional robust mean estimation can be efﬁciently solved by directly applying a ﬁrst-order method to a natural non-convex formulation of the problem.

The main technical contribution of this paper is in showing that any approximate stationary point of our non-convex objective sufﬁces to solve the underlying learning problem. Our novel structural result may be viewed as an explanation as to why robust mean estimation can be solved efﬁciently in high dimensions, despite its non-convexity. Speciﬁcally, we establish that the optimization landscape of our non-convex objective is well-behaved, in a precise sense.

There are a number of directions along which our results could be improved. At the technical level, it would be interesting to obtain faster convergence rates for gradient descent (or other ﬁrst-order methods), with linear convergence as the ultimate goal. We note that our upper bound is fairly loose and we did not make an explicit effort to optimize the polynomial dependence.

A natural direction is to extend our approach to more general robust estimation tasks, including covariance estimation (Diakonikolas et al., 2016; Cheng et al., 2019), sparse PCA (Balakrishnan et al., 2017; Diakonikolas et al., 2019b), and robust regression (Klivans et al., 2018; Diakonikolas et al., 2019c). Such generalizations will appear in a followup work.

6. Acknowledgments

We thank Jelena Diakonikolas for sharing her expertise in optimization. Part of the work was done while Ilias Diakonikolas and Mahdi Soltanolkotabi were visiting the Simons Institute for the Theory of Computing during the Summer 2019 program on the Foundations of Deep Learning. Part of the work was done while Yu Cheng and Rong Ge were visiting the Institute for Advanced Study for the Special Year on Optimization, Statistics, and Theoretical Machine Learning. Ilias Diakonikolas is supported by NSF Award CCF1652862 (CAREER), a Sloan Research Fellowship, and a DARPA Learning with Less Labels (Lw LL) grant. Rong Ge is supported by NSF CCF-1704656, NSF CCF-1845171 (CAREER), NSF CCF-1934964, a Sloan Fellowship, and a

Robust Mean Estimation via Gradient Descent

Google Faculty Research Award. Mahdi Soltanolkotabi is supported by the Packard Fellowship in Science and Engineering, a Sloan Research Fellowship in Mathematics, NSF CCF-CIF grants #1846369 and #1813877, AFOSR-YIP under award #FA9550-18-1-0078, DARPA Learning with Less Labels (Lw LL) and Fast Network Interface Cards (Fast NICs) programs, and a Google Faculty Research Award.

Allen-Zhu, Z., Lee, Y., and Orecchia, L. Using optimization to obtain a width-independent, parallel, simpler, and faster positive SDP solver. In Proc. 27th Annual Symposium on Discrete Algorithms (SODA), pp. 1824 1831, 2016.

Balakrishnan, S., Du, S. S., Li, J., and Singh, A. Computationally efﬁcient robust sparse estimation in high dimensions. In Proc. 30th Annual Conference on Learning Theory, pp. 169 212, 2017.

Barreno, M., Nelson, B., Joseph, A. D., and Tygar, J. D. The security of machine learning. Machine Learning, 81 (2):121 148, 2010.

Beck, A. First-Order Methods in Optimization. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2017. doi: 10.1137/1.9781611974997. URL https://epubs.siam.org/doi/abs/10. 1137/1.9781611974997.

Biggio, B., Nelson, B., and Laskov, P. Poisoning attacks against support vector machines. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, 2012.

Candes, E. J., Li, X., and Soltanolkotabi, M. Phase retrieval via Wirtinger ﬂow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985 2007, 2015.

Cheng, Y., Diakonikolas, I., and Ge, R. High-dimensional robust mean estimation in nearly-linear time. Co RR, abs/1811.09380, 2018. URL http://arxiv.org/ abs/1811.09380. Conference version in SODA 2019, p. 2755-2771.

Cheng, Y., Diakonikolas, I., Ge, R., and Woodruff, D. P. Faster algorithms for high-dimensional robust covariance estimation. In Conference on Learning Theory, COLT 2019, pp. 727 757, 2019.

Davis, D. and Drusvyatskiy, D. Stochastic subgradient method converges at the rate o(k 1/4) on weakly convex functions. ar Xiv preprint ar Xiv:1802.02988, 2018.

Depersin, J. and Lecue, G. Robust subgaussian estimation of a mean vector in nearly linear time. Co RR, abs/1906.03058, 2019.

Diakonikolas, I. and Kane, D. M. Recent advances in algorithmic high-dimensional robust statistics. Co RR, abs/1911.05911, 2019. URL http://arxiv.org/ abs/1911.05911.

Diakonikolas, I., Kamath, G., Kane, D. M., Li, J., Moitra, A., and Stewart, A. Robust estimators in high dimensions without the computational intractability. In Proc. 57th IEEE Symposium on Foundations of Computer Science (FOCS), pp. 655 664, 2016.

Diakonikolas, I., Kamath, G., Kane, D. M., Li, J., Moitra, A., and Stewart, A. Being robust (in high dimensions) can be practical. In Proc. 34th International Conference on Machine Learning (ICML), pp. 999 1008, 2017a.

Diakonikolas, I., Kane, D. M., and Stewart, A. Statistical query lower bounds for robust estimation of highdimensional Gaussians and Gaussian mixtures. In Proc. 58th IEEE Symposium on Foundations of Computer Science (FOCS), pp. 73 84, 2017b.

Diakonikolas, I., Kamath, G., Kane, D. M., Li, J., Steinhardt, J., and Stewart, A. SEVER: A robust meta-algorithm for stochastic optimization. In Proc. 36th International Conference on Machine Learning (ICML), pp. 1596 1606, 2019a.

Diakonikolas, I., Karmalkar, S., Kane, D., Price, E., and Stewart, A. Outlier-robust high-dimensional sparse estimation via iterative ﬁltering. In Advances in Neural Information Processing Systems 33, Neur IPS 2019, pp. 10688 10699, 2019b.

Diakonikolas, I., Kong, W., and Stewart, A. Efﬁcient algorithms and lower bounds for robust linear regression. In Proc. 30th Annual Symposium on Discrete Algorithms (SODA), pp. 2745 2754, 2019c.

Dong, Y., Hopkins, S. B., and Li, J. Quantum entropy scoring for fast robust mean estimation and improved outlier detection. Co RR, abs/1906.11366, 2019. URL http://arxiv.org/abs/1906.11366. Conference version in Neur IPS 2019.

Drusvyatskiy, D. The proximal point method revisited. ar Xiv preprint ar Xiv:1712.06038, 2017.

Ge, R., Huang, F., Jin, C., and Yuan, Y. Escaping from saddle points online stochastic gradient for tensor decomposition. In Conference on Learning Theory, pp. 797 842, 2015.

Robust Mean Estimation via Gradient Descent

Hassani, H., Soltanolkotabi, M., and Karbasi, A. Gradient methods for submodular maximization. In Advances in Neural Information Processing Systems, pp. 5841 5851, 2017.

Huber, P. J. Robust estimation of a location parameter. Ann. Math. Statist., 35(1):73 101, 03 1964.

Jain, P. and Kar, P. Non-convex optimization for machine learning. Foundations and Trends R in Machine Learning, 10(3-4):142 336, 2017.

Jin, C., Netrapalli, P., and Jordan, M. I. What is local optimality in nonconvex-nonconcave minimax optimization? ar Xiv preprint ar Xiv:1902.00618, 2019.

Keshavan, R. H., Montanari, A., and Oh, S. Matrix completion from a few entries. IEEE transactions on information theory, 56(6):2980 2998, 2010.

Klivans, A., Kothari, P., and Meka, R. Efﬁcient algorithms for outlier-robust regression. In Proc. 31st Annual Conference on Learning Theory (COLT), pp. 1420 1430, 2018.

Lai, K. A., Rao, A. B., and Vempala, S. Agnostic estimation of mean and covariance. In Proc. 57th IEEE Symposium on Foundations of Computer Science (FOCS), pp. 665 674, 2016.

Li, J., Absher, D., Tang, H., Southwick, A., Casto, A., Ramachandran, S., Cann, H., Barsh, G., Feldman, M., Cavalli-Sforza, L., and Myers, R. Worldwide human relationships inferred from genome-wide patterns of variation. Science, 319:1100 1104, 2008.

Loh, P. and Wainwright, M. J. High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. In Advances in Neural Information Processing Systems, pp. 2726 2734, 2011.

Paschou, P., Lewis, J., Javed, A., and Drineas, P. Ancestry informative markers for ﬁne-scale individual assignment to worldwide populations. Journal of Medical Genetics, 47:835 847, 2010.

Peng, R., Tangwongsan, K., and Zhang, P. Faster and simpler width-independent parallel algorithms for positive semideﬁnite programming. ar Xiv preprint ar Xiv:1201.5135v3, 2016.

Prasad, A., Suggala, A. S., Balakrishnan, S., and Ravikumar, P. Robust estimation via robust gradient estimation. ar Xiv preprint ar Xiv:1802.06485, 2018.

Rockafellar, R. T. Convex analysis. Number 28. Princeton university press, 1970.

Rockafellar, R. T. Favorable classes of lipschitz continuous functions in subgradient optimization. 1981.

Rockafellar, R. T. Convex Analysis. Princeton Landmarks in Mathematics and Physics. Princeton University Press, 2015. ISBN 9781400873173.

Rosenberg, N., Pritchard, J., Weber, J., Cann, H., Kidd, K., Zhivotovsky, L., and Feldman, M. Genetic structure of human populations. Science, 298:2381 2385, 2002.

Steinhardt, J., Koh, P. W., and Liang, P. S. Certiﬁed defenses for data poisoning attacks. In Advances in Neural Information Processing Systems 30, pp. 3520 3532, 2017.

Steinhardt, J., Charikar, M., and Valiant, G. Resilience: A criterion for learning in the presence of arbitrary outliers. In Proc. 9th Innovations in Theoretical Computer Science Conference (ITCS), pp. 45:1 45:21, 2018.

Tu, S., Boczar, R., Simchowitz, M., Soltanolkotabi, M., and Recht, B. Low-rank solutions of linear matrix equations via Procrustes ﬂow. ar Xiv preprint ar Xiv:1507.03566, 2015.

Tukey, J. W. A survey of sampling from contaminated distributions. Contributions to probability and statistics, 2:448 485, 1960.

Wilcox, R. M. Exponential operators and parameter differentiation in quantum physics. Journal of Mathematical Physics, 8(4):962 982, 1967.