# robustly_learning_a_single_neuron_via_sharpness__de302d54.pdf

Robustly Learning a Single Neuron via Sharpness

Puqian Wang * 1 Nikos Zarifis * 1 Ilias Diakonikolas 1 Jelena Diakonikolas 1

We study the problem of learning a single neuron with respect to the L2 2-loss in the presence of adversarial label noise. We give an efficient algorithm that, for a broad family of activations including Re LUs, approximates the optimal L2 2error within a constant factor. Notably, our algorithm succeeds under much milder distributional assumptions compared to prior work. The key ingredient enabling our results is a novel connection to local error bounds from optimization theory.

1. Introduction

We study the following learning task: Given labeled examples (x, y) Rd R from an unknown distribution D, output the best-fitting Re LU (or other nonlinear function) with respect to square loss. This is a fundamental problem in machine learning that has been extensively studied in a number of interrelated contexts over the past two decades, including learning GLMs and neural networks. More specifically, letting σ : R 7 R denote a nonlinear activation, e.g., σ(t) = Re LU(t) = max{0, t}, the (population) square loss of a vector w is defined as the L2 2 loss of the hypothesis σ(w x), i.e., LD,σ 2 (w) E(x,y) D[(σ(w x) y)2]. Our learning problem is then formally defined as follows.

Problem 1.1 (Robustly Learning a Single Neuron). Fix ϵ > 0, W > 0, and a class of distributions G on Rd. Let σ : R 7 R be an activation and D a distribution on labeled examples (x, y) Rd R such that its xmarginal Dx belongs to G. For some C 1, a Capproximate proper learner is given ϵ, W and i.i.d. samples from D and outputs ˆw Rd such that with high probability it holds LD,σ 2 ( ˆw) C OPT + ϵ, where OPT min w 2 W LD,σ 2 (w) is the minimum attainable square loss. We use W argmin w 2 W LD,σ 2 (w) to denote the set of square loss minimizers.

*Equal contribution 1Department of Computer-Sciences, University of Wisconsin-Madison, Madison, USA. Correspondence to: Nikos Zarifis <zarifis@wisc.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

Problem 1.1 does not make realizability assumptions on the distribution D. The labels are allowed to be arbitrary and we are interested in the best-fit function with respect to the L2 2 error. This corresponds to the (distribution-specific) agnostic PAC learning model (Haussler, 1992; Kearns et al., 1994). In this paper, we focus on developing constant factor approximate learners, corresponding to the case that C is a universal constant greater than one.

The special case of Problem 1.1 where the labels are consistent with a function in H = {σ(w x) : w 2 W} was studied in early work (Kalai & Sastry, 2009; Kakade et al., 2011). These papers gave efficient methods that succeed for any distribution on the unit ball and any monotone Lipschitz activation1. More recently, (Yehudai & Shamir, 2020) showed that gradient descent on the nonconvex L2 2 loss succeeds under a natural class of distributions (again in the realizable case) but fails in general. In other related work, (Soltanolkotabi, 2017) analyzed the case of Re LUs in the realizable setting under the Gaussian distribution and showed that gradient descent efficiently achieves exact recovery.

The agnostic setting is computationally challenging. First, even for the case that the marginal distribution on the examples is Gaussian, there is strong evidence that any algorithm achieving error OPT + ϵ (corresponding to C = 1 in Problem 1.1) requires dpoly(1/ϵ) time (Goel et al., 2019; Diakonikolas et al., 2020b; Goel et al., 2020; Diakonikolas et al., 2021). Second, even if we relax our goal to constant factor approximations, some distributional assumptions are required: known NP-hardness results rule out proper learners achieving any constant factor (S ıma, 2002; Manurangsi & Reichman, 2018). More recent work (Diakonikolas et al., 2022a) has shown that no polynomial time constant factor improper learner exists (under cryptographic assumptions), even if the distribution is bounded. These intractability results motivate the design of constant factor approximate learners corresponding to C > 1 and C = O(1) that succeed under as mild distributional assumptions as possible.

Prior algorithmic work in the robust setting can be classified in two categories: A line of work (Frei et al., 2020; Diakonikolas et al., 2022b; Awasthi et al., 2022) analyzes

1The results in these works can tolerate zero mean random noise, but do not apply to the adversarial noise setting.

Robustly Learning a Single Neuron via Sharpness

gradient descent-based algorithms on the natural nonconvex L2 2 objective (possibly with regularization). These works show that under certain distributional assumptions gradient descent avoids poor local minima and converges to a good solution. Specifically, (Diakonikolas et al., 2022b) established that gradient descent efficiently converges to a constant factor approximation for a family of well-behaved continuous distributions (including logconcave distributions). The second line of work (Diakonikolas et al., 2020a) proceeds by convexifying the problem, namely constructing a convex surrogate whose optimal solution gives a good solution to the initial nonconvex problem. This convex surrogate was analyzed in (Diakonikolas et al., 2020a) for the case of Re LUs and showed that it yields a constant factor approximation for logconcave distributions.

The starting point of our investigation is the observation that all previous algorithmic works for Problem 1.1 impose fairly stringent distributional assumptions. These works require all of the following properties from the marginal distribution on examples: (i) anti-concentration, (ii) concentration, and (iii) anti-anti-concentration. Assumption (i) posits that that every one-dimensional (or, in some cases, constant-dimensional) projection of the points should not put too much mass in any interval (or rectangle ). Property (ii) means that every one-dimensional projection should be strongly concentrated around its mean; specifically, prior work required at least exponential concentration. Finally, (iii) requires that the density of every low-dimensional projection is bounded below by a positive constant.

While some concentration appears necessary, prior work required sub-exponential concentration, which rules out the important case of heavy-tailed data. The anticoncentration assumption (i) from prior work rules out possibly lowerdimensional data, while the anti-anti-concentration rules out discrete distributions, which naturally occur in practice.

The preceding discussion raises the following question:

Under what distributional assumptions can we obtain efficient constant factor learners for Problem 1.1?

In this paper, we give such an algorithm that succeeds under minimal distributional assumptions. Roughly speaking, our novel assumptions require anti-concentration only in the direction of the optimal solution (aka a margin assumption) and allow for heavy-tailed data. Moreover, by removing assumption (iii) altogether, we obtain the first positive results for structured discrete distributions (including, e.g., discrete Gaussians and the uniform distribution over the cube).

In addition to its generality, our algorithm is simple a mini-batch SGD and achieves significantly better sample complexity for distributions covered in prior work.

1.1. Overview of Results

We provide a simplified version of our distributional assumptions followed by our main result for Re LU activations.

Distributional Assumptions. We make only the following two distributional assumptions.

Margin-like Condition: There exists w W and constants γ, λ > 0 such that

xx 1 {w x γ w 2} λI . (1)

Concentration: There exists non-increasing h : R+ R+ satisfying h(r) = O(r 5) such that for any unit vector u and any r 1, it holds Prx Dx[|u x| r] h(r).

Before we state our algorithmic result, some comments are in order. Condition (1) is an anti-concentration condition, reminiscent of the classical margin condition for halfspaces. In comparison with prior work, our condition requires anticoncentration only in the direction of an optimal solution as opposed to every direction. Our second condition requires that every univariate projection exhibits some concentration. Our concentration function h can even be inverse polynomial, allowing for heavy-tailed data. In contrast, prior work only considered sub-exponential tails. As we will see, the function h affects the sample complexity of our algorithm.

As we show in Section E of the supplementary material, our distributional assumptions subsume all previous such assumptions considered in the literature and additionally include a range of distributions (including heavy-tailed and discrete distributions) not handled in prior work.

A simplified version of our main result for the special case of Re LU activations is as follows (see Theorem 3.3 for a detailed more general statement): Theorem 1.2 (Main Algorithmic Result, Informal). Let W = O(1), G be a class of marginal distributions satisfying the above distributinal assumptions, and σ be the Re LU activation. There exists a sample-efficient and sample-linear time algorithm that outputs a hypothesis ˆw such that, with high probability, LD,σ 2 ( ˆw) = O(OPT) + ϵ. In particular, if the tail function h is subexponential, namely h(r) = e Ω(r), then the algorithm has sample complexity n = O(d polylog(1/ϵ)). For heavy-tailed distributions, namely for h(r) = O(r k) for some k > 4, the algorithm has sample complexity n = O(d (1/ϵ)2/(k 4)). The algorithm s runtime is always O(nd).

Our algorithm is extremely simple: it amounts to mini-batch SGD on a natural convex surrogate of the problem. As we will explain subsequently, this convex surrogate has been studied before in closely related yet more restricted contexts. Our main technical contribution lies in the analysis, which hinges on a new connection to local error bounds

Robustly Learning a Single Neuron via Sharpness

from the theory of optimization. This connection is crucial for us in two ways: First, we leverage it to obtain the first constant-factor approximate learners under much weaker distributional assumptions. Second, even for distributions covered by prior work, the connection allows us to obtain significantly more efficient algorithms.

Finally, we note that our algorithmic result applies to a broad family of monotone activations (Definition 2.1 and Assumption 2.2), and can be adapted to handle non-monotone activations including Ge LU (Hendrycks & Gimpel, 2016) and Swish (Ramachandran et al., 2017) see Appendix F.

1.2. Technical Contributions

The main algorithmic difficulty in solving Problem 1.1 is its non-convexity. Indeed, the L2 2 loss is non-convex for nonlinear activations, even without noise. Of course, the presence of adversarial label noise only makes the problem even more challenging. At a high-level, our approach is to convexify the problem via an appropriate convex surrogate function (see, e.g., (Bartlett et al., 2006)). In more detail, given a distribution D on labeled examples (x, y) and an activation σ, the surrogate LD,σ sur (w) is defined by LD,σ sur (w) = E(x,y) D R w x 0 (σ(r) y) dr .

This function is not new. It was first defined in (Auer et al., 1995) and subsequently (implicitly) used in (Kalai & Sastry,

2009; Kakade et al., 2011) for learning GLMs with zero mean noise. More recently, (Diakonikolas et al., 2020a) used this convex surrogate for robustly learning Re LUs under logconcave distributions. Roughly speaking, they showed that under the logconcavity assumption a near-optimal solution to the (convex) optimization problem of minimizing LD,σ sur (w) yields a constant factor approximate learner for Problem 1.1 (for the special case of Re LU activations).

Very roughly speaking, our high-level approach is similar to that of (Diakonikolas et al., 2020a). The main novelty of our contributions lies in two aspects: (1) The generality of the distributional assumptions under which we obtain a constantfactor approximation, and (2) the sample and computational complexities of the associated algorithm. Specifically, our analysis yields a constant-factor approximate learner under a vastly more general class of distributions2 as compared to prior work, and extends to a much broader family of activations beyond Re LUs. Moreover, even if restrict ourselves to, e.g., logconcave distributions, the complexity of our algorithm is exponentially smaller as a function of ϵ namely, polylog(1/ϵ) as opposed to Ω(1/ϵ2). For a more detailed comparison, see Appendix B.

The key technical ingredient enabling our results is the notion of sharpness (local error bound) from optimization

2Recall that without distributional assumptions obtaining any constant-factor approximate learner is NP-hard.

theory, which we prove holds for our stochastic surrogate problem. Before explaining how this comes up in our setting, we provide an overview from an optimization perspective.

Local Error Bounds and Sharpness. Broadly speaking, given an optimization problem (P) and a residual function r that is a measure of error of a candidate solution w to (P), an error bound certifies that a small residual translates into closeness between the candidate solution and the set of test (typically optimal) solutions W to (P). In particular, an error bound certifies an inequality of the form

r(w) (µ/ν) dist(w, W )ν

for some parameters µ, ν > 0, where dist(w, W ) = minw W w w 2 (see, e.g., the survey (Pang, 1997)). When this bound holds only locally in some neighborhood of W , it is referred to as a local error bound.

Local error bounds are well-studied within optimization theory, with the earliest result in this area being attributed to (Hoffman, 1952), which provided local error bounds for systems of linear inequalities. The work of (Hoffman, 1952) was extended to many other optimization problems; see, e.g., Chapter 6 in (Facchinei & Pang, 2003) for an overview of classical results and (Bolte et al., 2017; Karimi et al., 2016; Roulet & d Aspremont, 2017; Liu et al., 2022) and references therein for a more cotemporary overview. One of the most surprising early results in this area states that for minimizing a convex function f, an inequality of the form

f(w) min u f(u) (µ/ν) dist(w, W )ν (2)

holds generically whenever f is a real analytic or subanalytic function (Łojasiewicz, 1963; 1993). The main downside of this result is that the parameters µ, ν are usually impossible to evaluate and, moreover, even when it is known that, e.g., ν = 2, the parameter µ can be exponentially small in the dimension. Furthermore, local error bounds have primarily been studied in the context of deterministic optimization problems, with results for stochastic problems being very rare (Chen & Fukushima, 2005; Liu et al., 2018).

Perhaps the most surprising aspect of our results is that we show that the (stochastic) convex surrogate minimization problem not only satisfies a local error bound (a relaxation of (2) and a much weaker property than strong convexity; see Appendix A) with ν = 2, but we are also able to characterize the parameter µ based on the assumptions about the activation function and the probability distribution over the data. More importantly, for standard activation functions such as Re LU, Swish, and Ge LU and for broad classes of distributions (including heavy-tailed and discrete ones), we prove that µ is an absolute constant. This is precisely what leads to the error and complexity results achieved in our work.

Robustly Learning a Single Neuron via Sharpness

Robustly Learning a Single Neuron via Sharpness. Our technical approach can be broken down into the following main ideas. As a surrogate for minimizing the square loss, we first consider the noise-free convex surrogate, defined by LD,σ sur (w; w ) = Ex Dx R w x 0 (σ(r) σ(w x)) dr , where w W is a square-loss minimizer that satisfies our margin assumption. We keep this w fixed throughout the analysis and simply write LD,σ sur (w) instead of LD,σ sur (w; w ). Compared to the convex surrogate LD,σ sur (w) introduced earlier in the introduction, the noise-free convex surrogate LD,σ sur (w; w ) replaces the noisy labels y with σ(w x). Clearly, LD,σ sur (w; w ) is a function that cannot be directly optimized, as we lack the knowledge of w . On the other hand, the noise-free surrogate relates more directly to the square loss minimization: we prove (Lemma 2.5) that our distributional assumptions suffice for the noise-free surrogate to be sharp on a ball of radius 2 w 2, B(2 w 2); this structural result in turn leads to the conclusion that w is its unique minimizer. Hence, we can conclude that minimizing the noise-free surrogate LD,σ sur (w; w ) leads to minimizing the L2 2 loss. Of course, we cannot directly minimize LD,σ sur (w; w ), as we do not know w .

Had there been no adversarial label noise, we could stop at this conclusion, as there would be no difference between LD,σ sur (w) and LD,σ sur (w; w ) and we could minimize the L2 2 error to any desired accuracy by minimizing LD,σ sur (w). This difference between LD,σ sur (w) and LD,σ sur (w; w ) is precisely what causes the L2 2 error to only be brought down to O(OPT) + ϵ, where the constant in the big-Oh notation depends on the sharpness parameter µ. On the technical side, we prove (Proposition 3.2) that LD,σ sur (w) is also sharp w.r.t. the same w as LD,σ sur (w; w ) and with the sharpness parameter µ of the same order, but only on a nonconvex subset of the ball B(2 w 2), which excludes a neighborhood of w . This turns out to be sufficient to relate minimizing LD,σ sur (w) to minimizing the L2 2 loss (Theorem 3.1).

What we argued so far is sufficient for ensuring that minimizing the surrogate loss LD,σ sur (w) leads to the claimed bound on the L2 2 loss. However, it is not sufficient for obtaining the claimed sample and computational complexities, and there are additional technical hurdles that can only be handled using the specific structural properties of our resulting optimization problem. In particular, using solely smoothness and sharpness of the objective (even if the sharpness held on the entire region over which we are optimizing), would only lead to complexities scaling with 1 ϵ , using standard results from stochastic convex optimization. However, the complexity that we get is exponentially better, scaling with polylog( 1

ϵ ). This is enabled by the refined variance bound for the stochastic gradient estimate (see Corollary D.11), which, unlike in standard stochastic optimization settings (where we get a fixed upper bound),

scales with OPT + w w 2 2.3 This property enables us to construct high-accuracy gradient estimates using minibatching, which further leads to the improved linear rates within the (nonconvex) region where the surrogate loss is sharp. To complete the argument, we further show that the excluded region on which the sharpness does not hold does not negatively impact the overall complexity, as within it the target approximation guarantee for the L2 2 loss holds.

1.3. Notation

For n Z+, we denote by [n] the set {1, . . . , n}. We use lowercase boldface letters for vectors and uppercase bold letters for matrices. For x Rd and i [d], xi denotes the ith coordinate of x, and x 2 := (Pd i=1 xi2)1/2 denotes the ℓ2-norm of x. We use x y for the standard inner product of x, y Rd and θ(x, y) for the angle between x, y. We use 1E for the characteristic function of the set/event E, i.e., 1E(x) = 1 if x E and 1E(x) = 0 if x / E. We denote by B(r) = {u : u 2 r} the ℓ2-ball of radius r. We use the standard asymptotic notation e O( ) and eΩ( ) to omit polylogarithmic factors in the argument. We write E F for two nonnegative expressions E and F to denote that there exists some universal constant c > 0 (independent of the variables or parameters on which E and F depend) such that E c F. We use EX D[X] for the expectation of random variable X according to the distribution D and Pr [E] for the probability of event E. For simplicity of exposition, we may omit the distribution when it is clear from the context. For (x, y) distributed according to D, we denote by Dx the marginal distribution of x.

2. Landscape of Noise-Free Surrogate

We start by defining the class of activations and the distributional assumptions under which our results apply. We then establish our first structural result, showing that these conditions suffice for sharpness of the noise-free surrogate.

2.1. Activations and Distributional Assumptions

The main assumptions used throughout this paper to prove sharpness results are summarized below.

Definition 2.1 (Monotonic Unbounded Activations, (Diakonikolas et al., 2022b)). Let σ : R 7 R be nondecreasing and let α, β > 0. We say that σ is (monotonic) (α, β)-unbounded if (i) σ is α-Lipschitz; and (ii) σ (t) β

3Similar variance bound assumptions have been made in the more recent literature on stochastic optimization; see, e.g., Assumption 4.3(c) in (Bottou et al., 2018). We note, however, that our guarantees hold with high probability (compared to the more common expectation guarantees) and that the bulk of of our technical contribution lies in proving that such a variance bound holds, rather than in analyzing SGD under such an assumpton.

Robustly Learning a Single Neuron via Sharpness

for all t > 0.

The above class contains a range of popular activations, including the Re LU (which is (1, 1)-unbounded), and the Leaky Re LU with parameter 0 λ 1 2, i.e., σ(t) = max{λt, (1 λ)t} (which is is (1 λ, 1 λ)-unbounded).

Our results apply for the following class of activations. Assumption 2.2 (Controlled Activation). The activation function σ : R R is (α, β)-unbounded, for some positive parameters α 1, β (0, 1), and it holds that σ(0) = 0.

The assumption on the activation is important both for the convergence analysis of our algorithm and for proving the sharpness property of the surrogate loss.

We can now state our distributional assumptions. Assumption 2.3 (Margin). There exists w W and parameters γ, λ (0, 1] such that Ex Dx xx 1 {w x γ w 2} λI.

We note that in order to obtain a constant-factor approximate learner, the parameters γ and λ in Assumption 2.3 should be dimension-independent constants. Assumption 2.4 (Concentration). There exists a nonincreasing h : R+ R+ satisfying h(r) Br (4+ρ) for some parameters B 1 and 1 ρ > 0, such that for any u B(1) and any r 1, it holds Pr[|u x| r] h(r).

The concentration property enables us to control the moments of |u x|, playing an important role when we bound the variance of the gradient of the empirical surrogate loss.

2.2. Key Assumptions Suffice for Sharpness

We now prove that Assumptions 2.2 2.4 suffice to guarantee that the noise-free surrogate loss is sharp. We provide a proof sketch under the simplifying assumption that w 2 = 1. The full proof can be found in Appendix C.1.

Lemma 2.5. Suppose that Assumptions 2.2 2.4 hold. Then the noise-free surrogate loss LD,σ sur is Ω(λ2γβρ/B)-sharp in the ball B(2 w 2), i.e., w B(2 w 2),

LD,σ sur (w) (w w ) λ2γβρ/B w w 2 2 .

Proof Sketch of Lemma 2.5. Observe that LD,σ sur (w) = Ex Dx [(σ(w x) σ(w x))x]. Using the fact that σ is non-decreasing, it holds that LD,σ sur (w) (w w ) = Ex Dx[|σ(w x) σ(w x)||w x w x|]. Denote Em = {w x γ}. Using the fact that every term inside the expectation is nonnegative, we can further bound LD,σ sur (w) (w w ) from below by

LD,σ sur (w) (w w )

E x Dx[|σ(w x) σ(w x)||w x w x|1Em(x)] .

Since σ is (α, β)-unbounded, we have that σ (t) β for all t (0, ). By the mean value theorem, we can show that for t2 t1 0, we have |σ(t1) σ(t2)| β|t1 t2|. Additionally, if t1 0 and t2 0, then |σ(t1) σ(t2)| βt1. Therefore, by combining the above, and denoting the event {x : w x 0, w x γ} as E0, we get

LD,σ sur (w) (w w )

β E x Dx[(w x w x)21{w x > 0, Em(x)}]

+ β E x Dx [|w x||w x w x|1E0(x)] | {z } (Q)

We show that the term (Q) can be bounded below by a quantity that is proportional to Ex Dx (w x w x)21E0(x) . To this end, we establish the following claim.

Claim 2.6. For r0 1, define the event E1 = E1(r0) = {x : 2r0 < w x 0, Em(x)}. It holds (Q) (γ/(3r0)) Ex Dx (w x w x)21E1(x) .

Proof of Claim 2.6. Since E1 E0, it holds that (Q) Ex Dx [|w x||w x w x|1E1(x)]. Restricting x on the event E1, it holds that |w x| 2(r0/γ)|w x|. Thus,

w x w x = |w x| + |w x| (1 + 2r0/γ)|w x|.

By Assumption 2.3 we have that γ (0, 1], therefore we get that |w x| γ/(γ + 2r0) γ/(3r0), since r0 1. Taking the expectation of |w x||w x w x| with x restricted on event E1, we obtain

(Q) E x Dx [|w x||w x w x|1E1(x)]

γ/(3r0) E x Dx

(w x w x)21E1(x) ,

as desired.

Combining Equation (3) and Claim 2.6, we get that

LD,σ sur (w) (w w ) βγ 3r0 E x Dx[(w x w x)21{w x > 2r0, Em(x)}],

where in the last inequality we used the fact that 1 γ/(3r0) (since γ (0, 1] and r0 1). To complete the proof, we need to show that, for an appropriate choice of r0, the probability of the event {x : w x > 2r0, w x > γ} is close to the probability of the event {x : w x γ}. Given such a statement, the lemma follows from Assumption 2.3. Formally, we show the following claim.

Robustly Learning a Single Neuron via Sharpness

Claim 2.7. Let r0 1 such that h(r0) λ2ρ/(20B). Then, for all w B(2 w 2), we have that

E x Dx[(w x w x)21{w x > 2r0, Em(x)}]

(λ/2) w w 2 2 .

Since h(r) B/r4+ρ and h(r) is decreasing, such an r0 exists and we can always take r0 1.

Combining Equation (4) and Claim 2.7, we get:

LD,σ sur (w) (w w ) γλβ

r0 w w 2 2.

To complete the proof of Lemma 2.5, it remains to choose r0 appropriately. By Claim 2.7, we need to select r0 to be sufficiently large so that h(r0) λ2ρ/(20B). By Assumption 2.4, we have that h(r) B/r4+ρ. Thus, we can choose r0 = 5B/(λρ), by our assumptions.

3. Efficient Constant-Factor Approximation

We now outline our main technical approach, including the algorithm, its analysis, connections between the L2 2 loss and the two (noisy and noise-free) surrogates, and the role of sharpness. For space constraints, this section contains simplified proofs and proof sketches, while the full technical details are deferred to Appendix C.

3.1. The Landscape of Surrogate Loss

We start this section by showing that the landscape of surrogate loss connects with the error of the true loss.

Theorem 3.1. Let D be a distribution supported on Rd R and let σ : R 7 R be an (α, β)-unbounded activation. Fix w W and suppose that Dx satisfies Assumptions 2.3 and 2.4 with respect to w . Furthermore, let C > 0 be a sufficiently small absolute constant and let µ = Cλ2γβρ/B. Then, for any ϵ > 0 and ˆw B(2 w 2), so that LD,σ sur ( ˆw) infw B(2 w 2) LD,σ sur (w) ϵ, it holds LD,σ 2 ( ˆw) O((αB/(ρ µ))2)(LD,σ 2 (w ) + αϵ).

Proof. For this proof, we assume for ease of presentation that Ex Dx xx I and B, ρ, α = 1. Denote K as the set of ˆw such that ˆw B(2 w 2) and LD,σ sur ( ˆw) infw B(2 w 2) LD,σ sur (w) ϵ.

Next observe that the set of minimizers of the loss LD,σ sur inside the ball B(2 w 2) is convex. Furthermore, the set B(2 w 2) is compact. Thus, for any point w B(2 w 2) that minimizes LD,σ sur it will either hold that LD,σ sur (w ) 2 = 0 or w B(2 w 2). Let W sur be the set of minimizers of LD,σ sur .

We first show that if there exists a minimizer w W sur such that w B(2 w 2), then any point w inside the set

B(2 w 2) gets error proportional to LD,σ 2 (w ). Observe for such point w , by the necessary condition of optimality,

LD,σ sur (w ) (w w) 0 , (5)

for any w B(2 w 2). Using Proposition 3.2, we get that either LD,σ sur (w ) (w w ) ( µ/2) w w 2 2 or w {w : w w 2 2 (20/ µ2)LD,σ 2 (w )}. But Equation (5) contradicts with LD,σ sur (w ) (w w ) ( µ/2) w w 2 2 > 0, since w B(2 w 2), w 2 = 2 w 2; hence w = w . So it must be the case that w {w : w w 2 2 (20/ µ2)LD,σ 2 (w )}. Again, we have that w B(2 w 2), therefore w w 2 w 2. Hence, (20/ µ2)LD,σ 2 (w ) w 2 2. Therefore, for any w B(2 w 2), we have

LD,σ 2 (w) = E (x,y) D

(σ(w x) y)2

2LD,σ 2 (w ) + w w 2 2 = O(1/ µ2)LD,σ 2 (w ) ,

where we used the fact that Ex Dx xx I and that σ is 1-Lipschitz. Since the inequality above holds for any w B(2 w 2), it will also be true for ˆw K B(2 w 2). It remains to consider the case where the minimizers W sur are strictly inside the B(2 w 2). Note that LD,σ sur (w) is 1-smooth. Therefore, for any ˆw K it holds LD,σ sur ( ˆw) 2 2 2ϵ. By Proposition 3.2 (stated and proved below), we get that either ˆw w 2 2 (1/ µ2)LD,σ 2 (w ) or that

2ϵ ( µ/2) ˆw w 2. Therefore, we obtain that ˆw w 2 2 (1/ µ2)(LD,σ 2 (w ) + ϵ).

The proof of Theorem 3.1 required the following proposition which shows that if the current vector w is sufficiently far away from the true vector w , then the gradient of the surrogate loss has a large component in the direction of w w ; in other words, the surrogate loss is sharp.

Proposition 3.2. Let D be a distribution supported on Rd R and let σ : R 7 R be an (α, β)-unbounded activation. Suppose that Dx satisfies Assumptions 2.3 and 2.4 and let C > 0 be a sufficiently small absolute constant and let µ = Cλ2γβρ/B. Fix w W and let S = B(2 w 2) {w : w w 2 2 (20B/(ρ µ2))LD,σ 2 (w )}. Then, the surrogate loss LD,σ sur is ( µ/2)-sharp in S, i.e.,

LD,σ sur (w) (w w ) ( µ/2) w w 2 2, w S.

Proof. For this proof, we assume for ease of presentation that Ex Dx xx I and κ, B, ρ, α = 1. We show that LD,σ sur (w) (w w ) is bounded away from zero. We decompose the gradient into two parts, i.e., LD,σ sur (w) = ( LD,σ sur (w) LD,σ sur (w ))+ LD,σ sur (w ). First, we bound LD,σ sur (w ) in the direction w w , which yields

LD,σ sur (w ) (w w ) q

LD,σ 2 (w ) w w 2 ,

Robustly Learning a Single Neuron via Sharpness

Algorithm 1 Stochastic Gradient Descent on Surrogate Loss

Input: Iterations: T, sample access from D, batch size N, step size η, bound M. Initialize: w(0) 0. for t = 1 to T do

Draw N samples {(x(j), y(j))}N j=1 D. For each j [N], y(j) sign(y(j)) min(|y(j)|, M). g(t) 1

N PN j=1(σ(w(t) x(j)) y(j))x(j). w(t+1) w(t) ηg(t). end for Output: The weight vector w(T ).

where we used the Cauchy-Schwarz inequality and that Ex Dx[xx ] I. It remains to bound the remaining term. Note that ( LD,σ sur (w) LD,σ sur (w )) = LD,σ sur (w). Using the fact that LD,σ sur (w) is µ-sharp for any w S from Lemma 2.5, it holds that LD,σ sur (w) (w w ) µ w w 2 2 . Combining everything together, we get the claimed result.

3.2. Fast Rates for L2 2 Loss Minimization

In this section, we proceed to show that when the surrogate loss is sharp, applying batch Stochastic Gradient Descent (SGD) on the empirical surrogate loss obtains a Capproximate parameter ˆw of the L2 2 loss in linear time. To be specific, consider the following iteration update

w(t+1) = argmin w B(W )

w g(t)+(1/(2η)) w w(t) 2 2 , (6)

where η is the step size and g(t) is the empirical gradient of the surrogate loss, i.e., g(t) = 1 N PN j=1(σ(w(t) x(j)) y(j))x(j). The algorithm is summarized in Algorithm 1.

We define the helper functions H2 and H4 as follows: H2(r) maxu B(1) Ex Dx (u x)21{|u x| r} and H4(r) maxu B(1) Ex Dx (u x)41{|u x| r} .

Now we state our main theorem. Theorem 3.3 (Main Algorithmic Result). Fix ϵ, W > 0 and suppose Assumptions 2.2 to 2.4 hold. Let µ := µ(λ, γ, β, ρ, B) be a sufficiently small constant multiple of λ2γβρ/B, and let M = αWH 1 2 (ϵ/(4α2W 2)). Further, choose parameter rϵ large enough so that H4(rϵ) is a sufficiently small constant multiple of ϵ. Then after T = eΘ (B2α2/(ρ2µ2)) log (W/ϵ) iterations with batch size N = Ω(d T(r2 ϵ +α2M 2)), Algorithm 1 converges to a point w(T ) such that LD,σ 2 (w(T )) = O (B2α2/(ρ2µ2)) OPT+ ϵ , with probability at least 2/3.

As shown in Theorem 3.1, when we find a vector ˆw that minimizes the surrogate loss, then this ˆw is itself a Capproximate solution of Problem 1.1. However, minimizing the surrogate loss can be expensive in sample and computational complexity. Proposition 3.2 says that we can achieve

strong-convexity-like rates, as long as we are far away from a minimizer of the L2 2 loss. Roughly speaking, we show that at each iteration t, it holds w(t+1) w 2 2 C w(t) w 2 2 + OPT, where 0 < C < 1 is some constant depending on the parameters α, β, µ, ρ, and B. Then w(t) w 2 contracts fast as long as w(t) w 2 2 > (1/(1 C))OPT. When this condition fails, we have converged to a point that achieves O(OPT) L2 2 error.

The following lemma states that we can truncate the labels y to y M, where M is a parameter depending on Dx. The proof can be found in Appendix D.2.

Lemma 3.4. Let M = αWH 1 2 (ϵ/(4α2W 2)) and y = sign(y) min(|y|, M). Then we have that E(x,y) D (σ(w x) y )2 = OPT + ϵ .

Lemma 3.4 allows us to assume that |y| M. Proof Sketch of Theorem 3.3. For this sketch, we will assume for ease of notation that B, ρ, α = 1 and that Ex Dx xx I. The blueprint of the proof is to show that Algorithm 1 minimizes w w 2 efficiently, in terms of both the sample complexity and the iteration complexity. To be specific, we show that at each iteration, w(t+1) w 2 2 (1 C) w(t) w 2 2 + (small error), where 0 < C < 1. The key technique is to exploit the sharpness property of the surrogate loss, which we have already proved in Proposition 3.2.

To this aim, we study the difference of w(t+1) w 2 2 and w(t) w 2 2. We remind the reader that for convenience of notation, we denote the empirical gradients as the following g(t) = 1

N PN j=1(σ(w(t) x(j)) y(j))x(j), g =

1 N PN j=1(σ(w x(j)) y(j))x(j). Moreover, we denote the noise-free empirical gradient by g(t), i.e., g(t) = g(t) g . Plugging in the iteration scheme w(t+1) = w(t) ηg(t)

while expanding the squared norm, we get

w(t+1) w 2 2 = w(t) w 2 2 2η LD,σ sur (w(t)) (w(t) w ) | {z } Q1 2η(g(t) LD,σ sur (w(t))) (w(t) w ) + η2 g(t) 2 2 | {z } Q2

Observe that we decomposed the right hand side into two parts, the true contribution of the gradient (Q1) and the estimation error (Q2). In order to utilize the sharpness property of surrogate loss at the point w(t), the conditions

w(t) B(2 w 2) and

w(t) {w : w(t) w 2 2 20/ µ2OPT} (7)

need to be satisfied. For the first condition, recall that we initialized w(0) = 0; hence, Equation (7) is valid for t = 0. By induction, it suffices to show that assuming

Robustly Learning a Single Neuron via Sharpness

w(t) B(2 w 2) holds, we have w(t+1) w 2 (1 C) w(t) w 2 for some constant 0 < C < 1. Thus, we assume temporarily that Equation (7) is true at iteration t, and we will show in the remainder of the proof that w(t+1) w 2 (1 C) w(t) w 2 until we arrived at some final iteration T. Then, by induction, the first part of Equation (7) is satisfied at each step t T. For the second condition, note that if it is violated at some iteration T, then w(T ) w 2 2 = O(OPT) implying that this would be the solution we are looking for and the algorithm could be terminated at T. Therefore, whenever w(t) w 2 2 is far away from OPT, the prerequisites of Proposition 3.2 are satisfied and LD,σ sur is sharp.

For the first term (Q1), using that LD,σ sur (w(t)) is µ-sharp by Proposition 3.2, we immediately get a sufficient decrease at each iteration, i.e., w(t+1) w 2 2 (1 C) w(t) w 2 2. Namely, applying Proposition 3.2, we get

(Q1) = w(t) w 2 2 2η LD,σ sur (w(t)) (w(t) w )

(1 2ηµ) w(t) w 2 2 ,

where µ = Cλ2γβ for some sufficiently small constant C.

Now it suffices to show that (Q2) can be bounded above by C w(t) w 2 2, where C is a parameter depending on η and µ that can be made comparatively small. Formally, we show the following claim.

Claim 3.5. Suppose η 1. Fix rϵ 1 such that H4(rϵ) is a sufficiently small constant multiple of ϵ. Choosing N to be a sufficiently large constant multiple of (d/δ)(r2 ϵ + M 2), then we have with probability at least 1 δ

(Q2) ((3/2)ηµ+8η2) w(t) w 2 2+(8η/µ)(OPT+ϵ) .

Proof. Observe that by the inequality x y (µ/2) x 2 2 + (1/(2µ)) y 2 2 applied to the inner product (g(t) LD,σ sur (w(t))) (w(t) w ), we get

µ g(t) LD,σ sur (w(t)) 2 2 + ηµ w(t) w 2 2

+ 2η2 g(t) 2 2 + 2η2 g 2 2 ,

where µ is the sharpness parameter and we used the definition that g(t) = g(t) g in the first inequality.

Note that g(t) LD,σ sur (w(t)) 2 2 2 g(t) LD,σ sur (w(t)) 2 2 + 2 g LD,σ sur (w ) 2 2, since we have LD,σ sur (w(t)) = LD,σ sur (w(t)) LD,σ sur (w ). Thus, it holds

(Q2) ηµ w(t) w 2 2 + 2η2 g(t) 2 2 + 2η2 g 2 2+

(2η/µ) g(t) LD,σ sur (w(t)) 2 2 + g LD,σ sur (w ) 2 2 .

Furthermore, using standard concentration tools, it can be shown that when N Cd(r2 ϵ + M 2)/δ where C is a sufficiently large absolute constant, with probability at least

1 δ, it holds

g(t) LD,σ sur (w(t)) 2 2 (µ2/4) w(t) w 2 2,

g(t) 2 2 4 w(t) w 2 2,

and g LD,σ sur (w ) 2 2 OPT+ϵ, g 2 2 2OPT+ϵ. It remains to plug these bounds back into Equation (8).

Combining the upper bounds on (Q1) and (Q2) and choosing η = µ/32, we have:

w(t+1) w 2 2 1 µ2/128 w(t) w 2 2 + (1/4) OPT + ϵ . (9)

When w(t) w 2 2 (64/µ2)(OPT + ϵ), in other words when w(t) is still away from the minimizer w , it further holds with probability 1 δ:

w(t+1) w 2 2 (1 µ2/256) w(t) w 2 2, (10)

which proves the sufficient decrease of w(t) w 2 2 that we proposed at the beginning.

Let T be the first iteration such that w(T ) satisfies w(T ) w 2 2 (64/µ2)(OPT + ϵ). Recall that we need Equation (7) for every t T to be satisfied to implement sharpness. The first condition is satisfied naturally for w(t+1) w 2 2 w 2 2 as a consequence of Equation (10) (recall that w(0) = 0). For the second condition, when t + 1 T, we have w(t+1) w 2 2 (64/µ2)(OPT + ϵ), hence the second condition also holds.

When t T, the contraction of w(t) w 2 2 indicates a linear convergence of SGD. Since w(0) = 0, w 2 W, it holds w(t) w 2 2 (1 µ2/256)t w(0) w 2 2 exp( tµ2/256)W 2. Thus, to generate w(T ) such that w(T ) w 2 2 (64/µ2)(OPT + ϵ), it suffices to run Algorithm 1 for T = eΘ((1/µ2) log (W/ϵ) iterations. Recall that at each step t the contraction w(t+1) w 2 2 (1 µ2/256) w(t) w 2 2 holds with probability 1 δ, thus the union bound implies w(T ) w 2 2 (64/µ2)(OPT + ϵ) holds with probability 1 Tδ. Moreover, as LD,σ 2 (w(T )) w(T ) w 2 2, if w(T ) w 2 2 (64/µ2)(OPT+ϵ), then LD,σ 2 (w(T )) = O(1/µ2)(OPT + ϵ). Letting δ = 1/(3T) completes the proof.

4. Conclusion

We provided an efficient constant-factor approximate learner for the problem of agnostically learning a single neuron over structured classes of distributions. Notably, our algorithmic result applies under much milder distributional assumptions as compared to prior work. Our results are obtained by leveraging a sharpness property (a local error bound) from optimization theory that we prove holds for the considered

Robustly Learning a Single Neuron via Sharpness

problems. This property is crucial both to establishing a constant factor approximation and to obtaining improved sample complexity and runtime. An interesting direction for future work is to explore whether sharpness can be leveraged to obtain positive results for other related learning problems.

5. Acknowledgements

PW was supported in part by NSF Award CCF-2007757. NZ was supported in part by NSF award DMS-2023239. ID was supported by NSF Medium Award CCF-2107079, NSF Award CCF-1652862 (CAREER), a Sloan Research Fellowship, and a DARPA Learning with Less Labels (Lw LL) grant. JD was supported by NSF Award CCF-2007757 and by the Office of Naval Research under contract number N00014-22-1-2348. JD thanks Alexandre D Aspremont and J erˆome Bolte for a useful discussion on local error bounds.

Auer, P., Herbster, M., and Warmuth, M. K. Exponentially many local minima for single neurons. In Advances in Neural Information Processing Systems 8, NIPS, pp. 316 322. MIT Press, 1995.

Awasthi, P., Tang, A., and Vijayaraghavan, A. Agnostic learning of general relu activation using gradient descent. Co RR, abs/2208.02711, 2022.

Bartlett, P., Jordan, M., and Mc Auliffe, J. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006.

Bolte, J., Nguyen, T. P., Peypouquet, J., and Suter, B. W. From error bounds to the complexity of first-order descent methods for convex functions. Mathematical Programming, 165(2):471 507, 2017.

Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for large-scale machine learning. Siam Review, 60(2):223 311, 2018.

Chen, X. and Fukushima, M. Expected residual minimization method for stochastic linear complementarity problems. Mathematics of Operations Research, 30(4):1022 1038, 2005.

De, A., Diakonikolas, I., Feldman, V., and Servedio, R. A. Nearly optimal solutions for the chow parameters problem and low-weight approximation of halfspaces. J. ACM, 61(2):11:1 11:36, 2014.

De, A., Diakonikolas, I., and Servedio, R. A. The inverse shapley value problem. Games Econ. Behav., 105:122 147, 2017.

Diakonikolas, I., Goel, S., Karmalkar, S., Klivans, A. R., and Soltanolkotabi, M. Approximation schemes for Re LU regression. In Conference on Learning Theory, COLT, volume 125 of Proceedings of Machine Learning Research, pp. 1452 1485. PMLR, 2020a.

Diakonikolas, I., Kane, D. M., and Zarifis, N. Near-optimal SQ lower bounds for agnostically learning halfspaces and Re LUs under Gaussian marginals. In Advances in Neural Information Processing Systems, Neur IPS, 2020b.

Diakonikolas, I., Kane, D. M., Pittas, T., and Zarifis, N. The optimality of polynomial regression for agnostic learning under gaussian marginals in the sq model. In Proceedings of The 34th Conference on Learning Theory, COLT, 2021.

Diakonikolas, I., Kane, D., Manurangsi, P., and Ren, L. Hardness of learning a single neuron with adversarial label noise. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS), 2022a.

Diakonikolas, I., Kontonis, V., Tzamos, C., and Zarifis, N. Learning a Single Neuron with Adversarial Label Noise via Gradient Descent. In Conference on Learning Theory (COLT), pp. 4313 4361, 2022b.

Diakonikolas, I., Pavlou, C., Peebles, J., and Stewart, A. Efficient approximation algorithms for the inverse semivalue problem. In 21st International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2022, pp. 354 362. International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS), 2022c.

Facchinei, F. and Pang, J.-S. Finite-dimensional variational inequalities and complementarity problems. Springer, 2003.

Frei, S., Cao, Y., and Gu, Q. Agnostic learning of a single neuron with gradient descent. In Advances in Neural Information Processing Systems, Neur IPS, 2020.

Goel, S., Karmalkar, S., and Klivans, A. R. Time/accuracy tradeoffs for learning a relu with respect to gaussian marginals. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, 2019.

Goel, S., Gollakota, A., and Klivans, A. R. Statistical-query lower bounds via functional gradients. In Advances in Neural Information Processing Systems, Neur IPS, 2020.

Haussler, D. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100:78 150, 1992.

Hendrycks, D. and Gimpel, K. Gaussian error linear units (gelus). ar Xiv preprint ar Xiv:1606.08415, 2016.

Robustly Learning a Single Neuron via Sharpness

Hoffman, A. J. On approximate solutions of systems of linear inequalities. Journal of Research of the National Bureau of Standards, 49:263 265, 1952.

Kakade, S., Kanade, V., Shamir, O., and Kalai, A. Efficient learning of generalized linear and single index models with isotonic regression. Advances in Neural Information Processing Systems, 24, 2011.

Kalai, A. T. and Sastry, R. The isotron algorithm: Highdimensional isotonic regression. In COLT 2009 - The 22nd Conference on Learning Theory, 2009.

Karimi, H., Nutini, J., and Schmidt, M. Linear convergence of gradient and proximal-gradient methods under the Polyak-łojasiewicz condition. In Joint European conference on machine learning and knowledge discovery in databases, pp. 795 811, 2016.

Karmakar, S. and Mukherjee, A. Provable training of a Re LU gate with an iterative non-gradient algorithm. Neural Networks, 151:264 275, 2022.

Kearns, M., Schapire, R., and Sellie, L. Toward Efficient Agnostic Learning. Machine Learning, 17(2/3):115 141, 1994.

Liu, J., Cui, Y., and Pang, J.-S. Solving nonsmooth and nonconvex compound stochastic programs with applications to risk measure minimization. Mathematics of Operations Research, 2022.

Liu, M., Zhang, X., Zhang, L., Jin, R., and Yang, T. Fast rates of erm and stochastic approximation: Adaptive to error bound conditions. Advances in Neural Information Processing Systems, 31, 2018.

Łojasiewicz, S. Une propri et e topologique des sousensembles analytiques r eels. Les equations aux d eriv ees partielles, 117:87 89, 1963.

Łojasiewicz, S. Sur la g eom etrie semi-et sous-analytique. In Annales de l institut Fourier, volume 43, pp. 1575 1595, 1993.

Manurangsi, P. and Reichman, D. The computational complexity of training relu (s). ar Xiv preprint ar Xiv:1810.04207, 2018.

Pang, J.-S. Error bounds in mathematical programming. Mathematical Programming, 79(1):299 332, 1997.

Ramachandran, P., Zoph, B., and Le, Q. V. Searching for activation functions. ar Xiv preprint ar Xiv:1710.05941, 2017.

Roulet, V. and d Aspremont, A. Sharpness, restart and acceleration. Advances in Neural Information Processing Systems, 30, 2017.

S ıma, J. Training a single sigmoidal neuron is hard. Neural Computation, 14(11):2709 2728, 2002.

Soltanolkotabi, M. Learning Re LUs via gradient descent. In Advances in neural information processing systems, pp. 2007 2017, 2017.

Yehudai, G. and Shamir, O. Learning a single neuron with gradient methods. In Conference on Learning Theory, COLT, 2020.

Robustly Learning a Single Neuron via Sharpness

Supplementary Material

Organization The supplementary material is organized as follows: In Appendix A, we provide some remarks on the sharpness property we have been using throughout the paper. In Appendix B, we provide additional detailed comparison with prior work. In Appendix C and Appendix D, we present the full contents of Section 2 and Section 3 respectively, providing supplementary lemmas and completing the omitted proofs in the main body. Appendix E shows that there are many natural distributions satisfying Assumption 2.3 and Assumption 2.4. Finally, in Appendix F, we show that our results extend to certain non-monotonic distributions, including Ge LUs (Hendrycks & Gimpel, 2016) and Swish (Ramachandran et al., 2017).

Additional Notation Some additional notation we use here is listed below. Given a distribution D on Rd R, we use {(x(j), y(j))}N j=1 to denote N i.i.d. samples from D. We slightly abuse the notation and denote by ei the ith standard basis vector in Rd. The notation [ ]+ is used for the positive part of the argument, i.e., [ ]+ = max{ , 0}. For a vector x = (x1, , xn), [ ]+ is applied element-wise: [x]+ := ([x1]+, , [xn]+). For nonnegative expressions E, F we write E F to denote E C F, where C > 0 is a sufficiently large universal constant (independent of the parameters of E and F). The notation is defined similarly.

A. Remarks about Sharpness

We recall the formal definition of sharpness, already mentioned in the introduction.

Definition A.1 (Sharpness). Given a function f : C 7 R where C Rd, suppose the set of its minimizers Z = argminz C f(z) is closed and not empty. Let f = minz C f(z). We say that f is µ-sharp, for some µ > 0, if the following inequality holds:

2 dist(z, Z )2, z Rd,

where dist(z, Z ) = minz Z z z 2.

Remark A.2. We will slightly abuse the name of sharpness to refer to sharpness-like properties. For example, if a function satisfies f(z) (z z ) µ z z 2 2, (11)

for some z Z , then we say f is µ-sharp. This is due to the fact that when f is a convex function, it holds f(z) f f(z) (z z ), hence Definition A.1 implies Equation (11). Thus, Equation (11) can be viewed as a milder property of sharpness.

Compared to strong convexity, sharpness is a milder condition. Indeed, for any µ-strongly-convex function f, if z argminz Rd f(z) then f(z) f f(z ) (z z ) + µ

2 z z 2 2 µ

2 z z 2 2; therefore, f is µ-sharp. However, the opposite does not hold in general. For example, consider f : R 7 R defined by f(z) = z2 if z 0 and f(z) = 0 otherwise, whose set of minimizers on R is Z = ( , 0] and f = 0. Thus, if z 0, then f(z) f dist(z, Z )2 = z2 and if z < 0, we have f(z) f = 0 = dist(z, Z )2. Therefore, f is 2-sharp but it is not strongly convex.

B. Additional Comparison to Prior Work

Here we summarize and provide additional technical comparison to prior work that did not appear in the main body, due to space limitations.

Comparison with Frei et al. (2020). The work of Frei et al. (2020) studies the problem of learning Re LU (and other nonlinear) activations and shows that gradient descent on the L2 2 loss converges to a point achieving error K

OPT. The parameter K depends on the maximum norm of the points x, and can depend on the dimension d. Specifically, even for the basic case that the marginal distribution on examples is the standard normal distribution, the parameter K scales (polynomially) with d. That is, Frei et al. (2020) does not provide constant factor approximate learners in this setting.

Comparison with Diakonikolas et al. (2020a). The work of Diakonikolas et al. (2020a) studies the problem of learning Re LU activations using the same surrogate loss we consider in this work. Our work differs from Diakonikolas et al. (2020a) in two key aspects. The first aspect concerns the generality and strengh of results; the second aspect concerns the techniques.

Robustly Learning a Single Neuron via Sharpness

In terms of the results themselves, the algorithm given in Diakonikolas et al. (2020a) is restricted to the case of Re LUs (while we handle a broader family of activations). More importantly, the distributional assumptions of Diakonikolas et al. (2020a) are much stricter than ours, focusing on logconcave distributions whereas we handle broader classes of distributions, including heavy tailed and discrete distributions, not covered by any prior work (see also Appendix E). Informally, what allows us to handle broader classes of distributions is our focus on proving the sharpness property (as opposed to strong convexity), which is a much milder property. Further, we show that it suffices for this property to hold only in a small region (ball of radius 2 w 2) and for the (impossible to evaluate) noise-free surrogate loss. Another remark is that Diakonikolas et al. (2020a) assume that the (corrupted) labels are bounded, not fully capturing the agnostic setting. By contrast, our analysis can handle unbounded labels, i.e., we do not make further assumptions about the noise. Finally, even if we restrict our focus to the class of logconcave distributions, our algorithm has sample complexity scaling with polylog(1/ϵ), as opposed to 1/ϵ2 in Diakonikolas et al. (2020a).

The second and more important difference lies in the techniques that are used in each work. Diakonikolas et al. (2020a) optimizes the surrogate loss directly and shows that finding a point with a small gradient of the surrogate loss leads to the small L2 2 error. More specifically, the requirement in Diakonikolas et al. (2020a) is that the gradient is sufficiently small so that the optimality gap of the surrogate loss is of the order ϵ. This statement is similar to the result we show in Theorem 3.1. Crucially, while we utilize the gradients of the surrogate loss in the algorithm and in the analysis, we never impose a requirement that the optimality gap of the surrogate loss is of the order ϵ. Instead, we show that as long as the gradient is larger than order-

OPT + ϵ, sharpness holds and linear convergence rate applies. On the other hand, when the gradient is of the order

OPT + ϵ or smaller, we argue that the candidate solution that the algorithm maintains is already an (O(OPT) + ϵ)-approximate solution in terms of the L2 2 error. This approach further enables us to be agnostic in the value of OPT. Notably, if ϵ OPT and we were to require that the algorithm finds a solution with either the gradient of the order ϵ or the optimality gap ϵ, we would need to optimize the surrogate loss within a region where the sharpness does not necessarily hold. Without sharpness, only sublinear rates of convergence apply, and the number of iterations increases to order1

ϵ . Thus, leveraging the structural properties that we prove in this work is crucial to obtaining the exponential improvements in sample and computational complexities.

Finally, Diakonikolas et al. (2020a) requires the surrogate loss to be strongly convex to connect the small gradient condition with the small L2 2 error. This makes the argument rather straightforward, compared with what is used in our work. For the sake of discussion, assume that LD,σ sur is 1strongly convex and the distribution is isotropic. Furthermore, denote by w

the minimizer of the L2 2 loss and by w the minimizer of LD,σ sur . The property that LD,σ sur is strongly convex implies that LD,σ sur (w ) LD,σ sur (w ) 2 2 w w 2 2; furthermore, it can be shown that LD,σ sur (w ) 2 2 LD,σ 2 (w ). Therefore, because LD,σ sur (w ) = 0, it immediately follows that LD,σ 2 (w ) LD,σ 2 (w ). Our work leverages a much weaker property than strong convexity sharpness as summarized in Proposition 3.2. This weaker property turns out to be sufficient to ensure that the noise cannot make the gradient field guide us far away from the optimal solution.

Comparison with Diakonikolas et al. (2022b). The work of Diakonikolas et al. (2022b) studies the problem of Re LU (and other unbounded activations) regression with agnostic noise. They show that for a class of well-behaved distributions (see Definition E.1) gradient descent on the L2 2 loss converges to a point achieving O(OPT) + ϵ error. Moreover, the sample and computational complexities of their algorithm are similar to those achieved in our work (for the class of well-behaved distributions). On the other hand, the distributional assumptions used in Diakonikolas et al. (2022b) are quite strong. Specifically, the well-behaved assumption requires that the marginal distribution have sub-exponential concentration and anti-anti-concentration in every lower dimensional subspace; that is, the probability density function is lower bounded by a positive constant at every point. The latter assumption does not allow for several discrete distributions, like discrete Gaussians or uniform on the cube, that is handled in our work. Moreover, our work can additionally handle distributions with much weaker concentration properties.

Comparison with Karmakar & Mukherjee (2022). In a weaker noise model, the work of Karmakar & Mukherjee (2022) considered a similar-looking though crucially different condition for robust Re LU regression, namely that:

xx 1{w x 2θ } = λ1 > 0, (12)

where θ is the largest possible absolute value of the noise; in other words, θ = sup(x,y) D |y σ(w x)|. It is worth noting that Equation (12) cannot be easily satisfied, as the noise in the agnostic model is not bounded. But even if the

Robustly Learning a Single Neuron via Sharpness

noise was bounded, this condition would give slack for a small number of distributions. For instance in the uniform on the hypercube, if θ > 1/2, then the minimum eigenvalue is zero. Furthermore, the algorithm in that work converges to a point that achieves O(θ ) error, instead of O(OPT) error. In contrast, we make no assumptions about the boundedness of the noise, and obtain near-optimal error in more general settings.

Additional Related Work As mentioned in the introduction, the convex surrogate we leverage was first defined in (Auer et al., 1995) and then implicitly used in (Kalai & Sastry, 2009; Kakade et al., 2011) for learning GLMs. In addition to these and the aforementioned works, it is worth mentioning that the same convex surrogate has been useful in the context of learning linear separators from limited information (De et al., 2014) and in related game-theoretic settings (De et al., 2017; Diakonikolas et al., 2022c).

C. Full Version of Section 2

Discussion about the Parameters in Assumptions 2.2 to 2.4 If an activation σ is (α , β )-bounded, then it is also (α, β)-bounded for α α and β β . This justifies the convention α 1 and β 1 in Assumption 2.2. If σ(0) = 0, we can generate new labels y by subtracting σ(0) from y and consider the activation σ0(t) = σ(t) σ(0). Similar reasoning justifies the conventions λ, γ (0, 1] in Assumption 2.3 and B 1, ρ 1 in Assumption 2.4.

C.1. Proof of Lemma 2.5

For convenience, we restate the lemma followed by its detailed proof.

Lemma C.1. Suppose that Assumptions 2.2 2.4 hold. Then the noise-free surrogate loss LD,σ sur is Ω(λ2γβρ/B)-sharp in the ball B(2 w 2), i.e., w B(2 w 2) we have

LD,σ sur (w) (w w ) λ2γβρ/B w w 2 2 .

Proof. By definition, we can write LD,σ sur (w) = Ex Dx [(σ(w x) σ(w x))x]. Therefore, the inner product LD,σ sur (w) (w w ) can be written as

LD,σ sur (w) (w w )

= E x Dx [(σ(w x) σ(w x))(w x w x)]

= E x Dx [|σ(w x) σ(w x)||w x w x|]

E x Dx [|σ(w x) σ(w x)||w x w x|1{w x γ w 2}] ,

where the second equality is due to the non-decreasing property of σ, and the inequality is due to the fact that every term inside the expectation is nonnegative. Since σ is (α, β)-unbounded, we have that σ (t) β for all t [0, ). By the mean value theorem, for t2 t1 0, we have σ(t1) σ(t2) = σ (ξ)(t1 t2) for some ξ (t1, t2). Thus, we obtain that |σ(t1) σ(t2)| β|t1 t2|. Additionally, if t1 0 and t2 0, then |σ(t1) σ(t2)| = |σ(t1) σ(0)| + |σ(0) σ(t2)| |σ(t1) σ(0)| βt1. Therefore, by combining the above, we get

LD,σ sur (w) (w w ) β E x Dx

(w x w x)21{w x > 0, w x > γ w 2}

+ β E x Dx [|w x||w x w x|1{w x 0, w x γ w 2}] | {z } (Q)

Denote E0 = {x : w x 0, w x γ w 2}. We show that the term (Q) can be bounded below by a quantity that is proportional to Ex Dx (w x w x)21{w x 0, w x > γ w 2} . To this end, we establish the following claim.

Claim C.2. For r0 1, define the event E1 = E1(r0) = {x : 2r0 w 2 < w x 0, w x γ w 2}. It holds (Q) (γ/(3r0)) Ex Dx (w x w x)21E1(x) .

Robustly Learning a Single Neuron via Sharpness

Proof of Claim C.2. Since E1 E0, it holds that (Q) Ex Dx [|w x||w x w x|1E1(x)]. Restricting x on the event E1, it holds that |w x| 2(r0/γ)|w x|. Therefore, we get

w x w x = |w x| + |w x| (1 + 2r0/γ)|w x|.

By Assumption 2.3 we have that γ (0, 1], therefore we get that |w x| γ/(γ + 2r0) γ/(3r0), since r0 1. Taking the expectation of |w x||w x w x| with x restricted on event E1, we obtain

(Q) E x Dx [|w x||w x w x|1E1(x)] γ/(3r0) E x Dx

(w x w x)21E1(x) ,

as desired.

Combining Equation (13) and Claim C.2, we get that

LD,σ sur (w) (w w )

(w x w x)21{w x > 0, w x γ w 2}

(w x w x)21{ 2r0 w 2 < w x 0, w x γ w 2}

(w x w x)21{w x > 2r0 w 2, w x > γ w 2} , (14)

where in the last inequality we used the fact that 1 γ/(3r0) (since γ (0, 1] and r0 1). To complete the proof, we need to show that, for an appropriate choice of r0, the probability of the event {x : w x > 2r0 w 2, w x > γ w 2} is close to the probability of the event {x : w x γ w 2}. Given such a statement, the lemma follows from Assumption 2.3.

Formally, we show the following claim.

Claim C.3. Let r0 1 such that h(r0) λ2ρ/(20B). Then, for all w B(2 w 2), we have that

(w x w x)21{w x > 2r0 w 2, w x > γ w } λ

2 w w 2 2 .

Since h(r) B/r4+ρ and h(r) is decreasing, such an r0 exists and we can always make r0 1.

Proof of Claim C.3. By Assumption 2.3, we have that Ex Dx (w x)21{w x γ w 2} λ w 2 2. Let E2 = {w x 2r0 w 2, w x γ w 2}. We have that

(w x w x)21{w x > 2r0 w 2, w x > γ w }

(w x w x)21{w x γ w 2} E x Dx

(w x w x)21E2(x)

λ w w 2 2 E x Dx

(w x w x)21E2(x) .

By the Cauchy-Schwarz inequality, we get

(w x w x)21E2(x) E x Dx

(w x w x)21{w x 2r0 w 2}

w w 2 2 max u B(1)

E x Dx [(u x)4] p

Pr [w x 2r0 w 2}]] .

Since w B(2 w 2), it holds that w/(2 w 2) B(1). Thus, from the concentration properties of Dx, it follows that Pr [w x 2r0 w 2}] h(r0). It remains to bound maxu B(1) Ex Dx (u x)4 . It is not hard to see that for distributions satisfying the concentration property of Assumption 2.4, maxu B(1) Ex Dx (u x)4 as well as maxu B(1) Ex Dx (u x)2 are at most 5B/ρ. The proof of the following simple fact can be found in Appendix C.2.

Robustly Learning a Single Neuron via Sharpness

Fact C.4. Let Dx be a distribution satisfying Assumption 2.4. Then maxu B(1) Ex Dx (u x)i 5B/ρ for i = 2, 4.

Although only the bound on the 4th order moment is needed here, the upper bound on maxu B(1) Ex Dx (u x)2 will also be used in later sections.

Therefor, by our choice of r0, we have h(r0) λ2ρ

20B , hence maxu B(1) Ex Dx (u x)4 h(r0) λ2/4. Therefore,

(w x w x)21E2(x) (λ/2) w w 2 2 ,

completing the proof of Claim C.3.

Combining Equation (14) and Claim C.3, we get:

LD,σ sur (w) (w w ) γλβ

r0 w w 2 2.

To complete the proof, it remains to choose r0 appropriately. By Claim C.3, we need to select r0 to be sufficiently large so that h(r0) λ2ρ/(20B). By Assumption 2.4, we have that h(r) B/r4+ρ. Thus, we can choose r0 = 5B/(λρ), which is at least 1 by our assumptions. This completes the proof of the lemma.

C.2. Proof of Fact C.4

We restate and prove the following fact.

Fact C.5. Let Dx be a distribution satisfying Assumption 2.4. Then maxu B(1) Ex Dx (u x)i 5B/ρ for i = 2, 4.

Proof. Let i = 2 or 4. By Assumption 2.4, for any unit vector u, we have

0 Pr (u x)i t dt

0 isi 1Pr [|u x| s] ds

0 isi 1 min{1, h(s)} ds.

By Assumption 2.4 we have h(s) B/s4+ρ for some 1 ρ > 0 and B 1, thus it further holds

0 isi 1 ds + Z

1 isi 1h(s) ds 1 + B Z

1 isi 1 1 s4+ρ ds 5B

D. Full Version of Section 3

D.1. The Landscape of Surrogate Loss

Theorem D.1 (Landscape of Surrogate Loss). Let µ (0, 1] and α, κ 1. Let D be a distribution supported on Rd R and let σ : R 7 R be an (α, β)-unbounded activation for some β > 0. Furthermore, assume that the maximum eigenvalue of the matrix Ex Dx[xx ] is κ. Further, fix w W and suppose LD,σ sur is µ-sharp with respect to w in a subset S1 Rd. Let S2 = {w : LD,σ 2 (w) (4ακ/ µ)2LD,σ 2 (w )}. Then for any w S1 S2, we have

LD,σ sur (w) 2 α κ w w 2 + q

κLD,σ 2 (w ) ,

and LD,σ sur (w) 2 µ 4α κ

LD,σ 2 (w) .

Robustly Learning a Single Neuron via Sharpness

If we can assume that the set S1 of Theorem D.1 is convex and that there is no local minima in the boundary of S1, then by running any convex-optimization algorithm in the feasible set S1, we guarantee that we converge either to a local minimum which has zero gradient or to a point inside the set (S2)c where the true loss is sufficiently small. The next corollary shows that this is indeed the case for a distribution that satisfies Assumptions 2.3 and 2.4.

Corollary D.2. Let D be a distribution supported on Rd R and let σ : R 7 R be an (α, β)-unbounded activation. Fix w W and suppose that Dx satisfies Assumptions 2.3 and 2.4 with respect to w . Furthermore, let C > 0 be a sufficiently small absolute constant and let µ = Cλ2γβρ/B. Then, for any ϵ > 0 and ˆw B(2 w 2), so that LD,σ sur ( ˆw) infw B(2 w 2) LD,σ sur (w) ϵ, it holds

LD,σ 2 ( ˆw) O((αB/(ρ µ))2)(LD,σ 2 (w ) + αϵ) .

Proof of Corollary D.2. Denote K as the set of ˆw such that ˆw B(2 w 2) and LD,σ sur ( ˆw) infw B(2 w 2) LD,σ sur (w) ϵ. First, note that as claimed in Fact C.4, Ex Dx[xx ] (5B/ρ)I for any unit vector u when Assumption 2.4 holds.

Next, observe that the set of minimizers of the loss LD,σ sur inside the ball B(2 w 2) is convex. Furthermore, the set B(2 w 2) is compact. Thus, for any point w B(2 w 2) that minimizes LD,σ sur it will either hold that LD,σ sur (w ) 2 = 0 or w B(2 w 2). Let W sur be the set of minimizers of LD,σ sur .

We first show that if there exists a minimizer w W sur such that w B(2 w 2), then any point w inside the set B(2 w 2) gets error proportional to LD,σ 2 (w ). Observe for such point ˆw, by the necessary condition of optimality, it should hold LD,σ sur (w ) (w w) 0 , (15)

for any w B(2 w 2). Using Corollary D.4, we get that either LD,σ sur (w ) (w w ) ( µ/2) w w 2 2 or w {w : w w 2 2 (20B/( µ2ρ))LD,σ 2 (w )}. But Equation (15) contradicts with LD,σ sur (w ) (w w ) ( µ/2) w w 2 2 > 0 since w B(2 w 2), w 2 = 2 w 2 hence w = w . So it must be the case that w {w : w w 2 2 (20B/( µ2ρ))LD,σ 2 (w )}. Again, we have that w B(2 w 2), therefore w w 2 w 2. Hence, (20B/( µ2ρ))LD,σ 2 (w ) w 2 2 (1/9) w w 2 2 for any w B(2 w 2). Therefore, for any w B(2 w 2), we have

LD,σ 2 (w) = E (x,y) D

(σ(w x) y)2

2LD,σ 2 (w ) + 2 E x Dx

(σ(w x) σ(w x))2

2LD,σ 2 (w ) + 10Bα2/ρ w w 2 2 (16)

O(B2α2/( µ2ρ2))LD,σ 2 (w ) ,

where in the second inequality we used the fact that Ex Dx xx (5B/ρ)I and σ is α-Lipschitz. Since the inequality above holds for any w B(2 w 2), it will also be true for ˆw K B(2 w 2).

It remains to consider the case where the minimizers W sur are strictly inside the B(2 w 2). Note that LD,σ sur (w) is α-smooth. Therefore, we get that for any ˆw K, it holds LD,σ sur ( ˆw) 2 2 2αϵ. By applying Corollary D.4, we get that either ˆw w 2 2 (20B/( µ2ρ))LD,σ 2 (w ) or that

2αϵ ( µ/2) ˆw w 2. Therefore we get that, ˆw w 2 2 (20B/( µ2ρ))(LD,σ 2 (w ) + αϵ). Then the result follows from Equation (16).

To prove Theorem D.1, we need the following proposition which shows that if the current vector w is sufficiently far away from the true vector w , then the gradient of the surrogate loss has a large component in the direction of w w , in other words, the surrogate loss is sharp.

Proposition D.3. Let D be a distribution supported on Rd R and let σ : R 7 R be an (α, β)-unbounded activation. Furthermore, assume that the maximum eigenvalue of the matrix Ex Dx[xx ] is κ > 0. Fix w W and suppose LD,σ sur is µ-sharp for some µ > 0 with respect to w in a nonempty subset S1 Rd. Further, let S2 = {w : w w 2 2 4(κ/ µ2)LD,σ 2 (w )}. Then for any w S1 S2, we have

LD,σ sur (w) (w w ) ( µ/2) w w 2 2 .

Robustly Learning a Single Neuron via Sharpness

Proof of Proposition D.3. We show that LD,σ sur (w) (w w ) is bounded sufficiently far away from zero. We decompose the gradient into the noise-free part and the noisy, i.e., LD,σ sur (w) = LD,σ sur (w) + LD,σ sur (w ). First, we bound the noisy term in the direction w w , which yields

LD,σ sur (w ) (w w )

E(x,y) D[|σ(w x) y||w x w x|]

LD,σ 2 (w ) w w 2 κ ,

where we used the Cauchy-Schwarz inequality and that Ex Dx[xx ] κI. Next, we bound the contribution of LD,σ sur (w) in the direction w w . Using the fact that LD,σ sur (w) is µ-sharp for any w S1, it holds that

LD,σ sur (w) (w w ) µ w w 2 2 .

Combining everything together we have that

LD,σ sur (w) (w w )

w w 2 ( κ/ µ) q

LD,σ 2 (w ) .

The proof is completed by taking any w S1 S2, where w w 2 (2 κ/ µ) q

LD,σ 2 (w ), and therefore

LD,σ sur (w) (w w ) ( µ/2) w w 2 2 .

Corollary D.4. Let D be a distribution supported on Rd R and let σ : R 7 R be an (α, β)-unbounded activation. Suppose that Dx satisfies Assumptions 2.3 and 2.4 and let C > 0 be a sufficiently small absolute constant and let µ = Cλ2γβρ/B. Fix w W and let S = B(2 w 2) {w : w w 2 2 20B

µ2ρ LD,σ 2 (w )}. Then, the surrogate loss LD,σ sur is µ-sharp in S, i.e., LD,σ sur (w) (w w ) ( µ/2) w w 2 2, w S.

Proof of Corollary D.4. Note that maxu B(1) Ex Dx (u x)2 = κ 5B/ρ as proven in Fact C.4. Then combining Proposition D.3 and Lemma 2.5 we get the desired result.

Proof of Theorem D.1. Using Proposition D.3, we get that for any w S S1, where S = {w : w w 2 2 4(κ/ µ2)LD,σ 2 (w )}, we have that LD,σ sur (w) (w w ) ( µ/2) w w 2 2. Note that

LD,σ 2 (w) = E (x,y) D

(σ(w x) y)2

(σ(w x) σ(w x))2 + 2 E (x,y) D

(σ(w x) y)2

2α2κ w w 2 2 + 2LD,σ 2 (w ) 2α2κ w w 2 2 + (1/2)LD,σ 2 (w) ,

where we used that LD,σ 2 (w) 4LD,σ 2 (w ). Hence, it holds 4α2κ w w 2 2 LD,σ 2 (w). Therefore, when w S2, it holds that w w 2 2 (4ακ/ µ)2LD,σ 2 (w), hence S2 S .

Now observe that for any unit vector v Rd, it holds LD,σ sur (w) 2 v LD,σ sur (w). Therefore, for any w S1 S2 S1 S , we have

LD,σ sur (w) 2 LD,σ sur (w) w w

( µ/2) w w 2 µ 4α κ

LD,σ 2 (w) .

Robustly Learning a Single Neuron via Sharpness

We now show that the gradient is also bounded from above. By definition, we have

LD,σ sur (w) 2 = E (x,y) D [(σ(w x) y)x] 2

E x Dx [(σ(w x) σ(w x))x] 2 + E (x,y) D [(σ(w x) y)x] 2 max u 2 1 E x Dx [|σ(w x) σ(w x)||u x|] + max v 2 1 E (x,y) D [|σ(w x) y||v x|]

Applying Cauchy-Schwarz to the inequality above, we further get

LD,σ sur (w) 2 max u 2 1

E x Dx [|σ(w x) σ(w x)|2] E x Dx [|u x|2]

+ max v 2 1

E (x,y) D [|σ(w x) y|2] E x Dx [|v x|2]

α κ w w 2 + q

κLD,σ 2 (w ) ,

where in the last inequality we used the fact that σ is α-Lipschitz and that the maximum eigenvalue of Ex Dx[xx ] is κ.

D.2. Fast Rates for Surrogate Loss

In this section, we proceed to show that when the surrogate loss is sharp, then applying batch Stochastic Gradient Descent (SGD) on the empirical surrogate loss obtains a C-approximate parameter ˆw of the L2 2 loss in linear time. To be specific, consider the following iteration update

w(t+1) = argmin w B(W )

2η w w(t) 2 2

where η is the step size and g(t) is the empirical gradient of the surrogate loss:

j=1 (σ(w(t) x(j)) y(j))x(j). (18)

The algorithm is summarized in Algorithm 2.

Algorithm 2 Stochastic Gradient Descent on Surrogate Loss

Input: Iterations: T, sample access from D, batch size N, step size η, bound M. Initialize w(0) 0. for t = 1 to T do

Draw N samples {(x(j), y(j))}N j=1 D. For each j [N], y(j) sign(y(j)) min(|y(j)|, M). Calculate

j=1 (σ(w(t) x(j)) y(j))x(j).

w(t+1) w(t) ηg(t). end for Output: The weight vector w(T ).

Further, for simplicity of notation, we use g(t) to denote the empirical gradient of the noise-free surrogate loss:

j=1 (σ(w(t) x(j)) σ(w x(j)))x(j). (19)

Robustly Learning a Single Neuron via Sharpness

In addition, we define the following helper functions H2 and H4.

Definition D.5. Let Dx be a distribution on supported on Rd that satisfies Assumption 2.4 we define non-negative nonincreasing functions H2 and H4 as follows:

H2(r) max u B(1) E x Dx

(u x)21{|u x| r} ,

H4(r) max u B(1) E x Dx

(u x)41{|u x| r} .

Remark D.6. In particular, when r = 0, H2(0) and H4(0) bounds from above the second and fourth moments. Recall that in Fact C.4, it is proved that H2(0), H4(0) 5B/ρ.

Now we state our main theorem.

Theorem D.7 (Main Algorithmic Result). Fix ϵ > 0 and W > 0 and suppose Assumptions 2.2 to 2.4 hold. Let µ := µ(λ, γ, β, ρ, B) be a sufficiently small constant multiple of λ2γβρ/B, and let M = αWH 1 2 ϵ 4α2W 2 . Further, choose parameter rϵ large enough so that H4(rϵ) is a sufficiently small constant multiple of ϵ. Then after

T = eΘ B2α2

iterations with batch size N = Ω(d T(r2 ϵ + α2M 2)), Algorithm 2 converges to a point w(T ) such that

LD,σ 2 (w(T )) = O B2α2

ρ2µ2 OPT + ϵ ,

with probability at least 2/3.

We now provide a brief overview of the proof. As follows from Corollary D.2, when we find a vector ˆw that minimizes the surrogate loss, then this ˆw is itself a C-approximate solution of Problem 1.1. However, minimizing the surrogate loss can be expensive in computational and sample complexity. Corollary D.4 says that we can achieve strong-convexity-like rates as long as we are far away from a minimizer of the L2 2 loss, i.e., when w w 2 2 O(OPT). Roughly speaking, we would like to show that at each iteration t, it holds ||w(t+1) w ||2 2 C||w(t) w ||2 2 where 0 < C < 1 is some constant depending on the parameters α, β, µ, ρ and B. Then since the distance from w(t) to w contracts fast, we are able to get the linear convergence rate of the algorithm. To this end, we prove that under a sufficiently large batch size, the empirical gradient of the surrogate loss g(t) approximates LD,σ sur (w(t)) with a small error. Thus, ||w(t+1) w ||2 2 can be written as

||w(t+1) w ||2 2 = ||w(t) w ||2 2 2η LD,σ sur (w(t)) (w(t) w ) + (error).

We then apply the sharpness property of the surrogate (Proposition D.3) to the inner product LD,σ sur (w(t)) (w(t) w ), which as a result leads to ||w(t+1) w ||2 2 (1 2ηµ)||w(t) w ||2 2 + (error). By choosing the parameters η and the batch size N carefully, one can show that

||w(t+1) w ||2 2 (1 C)||w(t) w ||2 2 + C (OPT + ϵ),

indicating a fast contraction ||w(t+1) w ||2 2 (1 C/2)||w(t) w ||2 2 whenever C (OPT + ϵ) (C/2)||w(t) w ||2 2.

To prove the theorem, we provide some supplementary lemmata. The following lemma states that we can truncate the labels y to y M, where M is a parameter determined by distribution Dx.

Lemma D.8. Define y = sign(y) min(|y|, M) for M = αWH 1 2 ( ϵ 4α2W 2 ), then:

(σ(w x) y )2 = OPT + ϵ,

meaning that we can consider y instead of y and assume |y| M without loss of generality, where H2 was defined in Definition D.5.

Robustly Learning a Single Neuron via Sharpness

Proof of Lemma D.8. Fix M > 0, and denote P : R R the operator that projects the points of R onto the interval [ M, M], i.e., P(t) = sign(t) min(|t|, M). To prove the aforementioned claim, we split the expectation into two events: the first event is when |w x| (M/α) and the second when the latter is not true. Observe that in the first case, P(σ(w x)) = σ(w x), hence, using the fact that P is non-expansive, we get

(σ(w x) P(y))21{|w x| (M/α)}

= E (x,y) D

(P(σ(w x)) P(y))21{|w x| (M/α)}

(σ(w x y)21{|w x| (M/α)}

It remains to bound the error in the event that |w x| > (M/α). In this event α|w x| |P(y)|, and so we have

(σ(w x) P(y))21{|w x| > (M/α)} 4α2 E (x,y) D

(w x)21{|w x| > (M/α)}

4α2 w 2 2H2(M/(αW)) ϵ ,

where in the first inequality we used the standard inequality (a + b)2 2(a2 + b2) and that σ is α-Lipschitz hence |σ(w x)| = |σ(w x) σ(0)| α|w x|.

Next, we show that the difference between the empirical gradients and the population gradients of the surrogate loss can be made small by choosing a large batch size N. Specifically, we have:

Lemma D.9. Suppose N samples {(x(j), y(j))}N j=1 are drawn from D independently and suppose Assumptions 2.2 to 2.4 hold. Let g be the empirical gradient of LD,σ sur at w and let gt be the empirical gradient of LD,σ sur (wt), i.e.,

j=1 (σ(w x(j)) y(j))x(j),

j=1 (σ(w(t) x(j)) σ(w x(j)))x(j).

Moreover, let H4(r) be defined as in Definition D.5. Then for a fixed positive real number rϵ satisfying H4(rϵ) ϵ and rϵ 1, we have the following bounds holds with probability at least 1 δ:

g LD,σ sur (w ) 2

d(r2ϵOPT + α2M 2ϵ)

and similarly:

g(t) LD,σ sur (w(t)) 2

δρN w(t) w 2. (21)

Proof of Lemma D.9. The proof follows from a direct application of Markov s inequality and a careful bound on the variance term using the tail-bound assumptions. To be specific, by Markov s Inequality, for any ξ > 0 it holds:

Pr g LD,σ sur (w ) 2 ξ = Pr g LD,σ sur (w ) 2 2 ξ2 1

ξ2 E (x,y) D

g LD,σ sur (w ) 2 2

Now for the variance term E(x,y) D ||g LD,σ sur (w )||2 2 , recall that each sample x(j) and y(j) are i.i.d., therefore, we

Robustly Learning a Single Neuron via Sharpness

can bound it in the following way

g LD,σ sur (w ) 2 2

= E (x,y) D

(σ(w x(j)) y(j))x(j) E (x,y) D

(σ(w x(j)) y(j))x(j)

N E (x,y) D

(σ(w x) y)x E (x,y) D [(σ(w x) y)x] 2 2

N E (x,y) D

(σ(w x) y)x 2 2 , (22)

where in the second equation we used that for any mean-zero independent random variables zj, we have E[|| P

j zj||2 2] = P

j E[ zj 2 2], and in the final inequality we used that for any random variable X, it holds E[ X E[X] 2 2] E[ X 2 2].

Next, we show that Ex Dx (σ(w x) y)x 2 2 can be bounded above in terms of H2 and H4.

Claim D.10. E(x,y) D (σ(w x) y)x 2 2 d(r2 ϵOPT + α2M 2H4(rϵ)).

Proof of Claim D.10. To prove the claim, note that x 2 2 = Pd i=1 |xi|2, therefore by linearity of expectation it holds

(σ(w x) y)x 2 2 =

i=1 E (x,y) D

(σ(w x) y)2x2 i .

Thus, the goal is to bound E(x,y) D (σ(w x) y)2x2 i effectively for each entry i. Deploying the intuition that the probability of |xi| = |ei x| being very large is tiny since we have Pr [|ei x| > r] h(r) and h(r) Br (4+ρ) by the Assumption 2.4, we fix some large rϵ and bound the expectation by looking separately at the events that |xi| rϵ and |xi| > rϵ, i.e.,

(σ(w x) y)2x2 i = E (x,y) D

(σ(w x) y)2x2 i 1{|xi| rϵ}

+ E (x,y) D

(σ(w x) y)2x2 i 1{|xi| > rϵ} . (23)

Note when conditioned on the event |xi| rϵ the bound follows easily as:

(σ(w x) y)2x2 i 1{|xi| rϵ} r2 ϵ E (x,y) D

(σ(w x) y)2 = r2 ϵOPT. (24)

When considering |xi| > rϵ, notice that σ is α-Lipschitz and that σ(0) = 0 , as well as that we assumed |y| M due to Lemma D.8, therefore, denoting uw = w / w 2, it holds:

(σ(w x) y)2x2 i 1{|xi| > rϵ} 2 E (x,y) D

((σ(w x))2 + y2)x2 i 1{|xi| > rϵ}

(α2(w x)2 + M 2)x2 i 1{|xi| > rϵ}

2α2 w 2 2 E x Dx

(uw x)2x2 i 1{|xi| > rϵ} + 2M 2H2(rϵ) ,

where in the last inequality we used Definition D.5. For the first term above, note that uw is also a unit vector, so by Assumption 2.4 the probability mass of |uw x| > rϵ is also small, thus, we can show that Ex Dx (uw x)2x2 i 1{|xi| > rϵ}

is dominated by r2 ϵ Ex Dx x2 i 1{|xi| > rϵ} , which can then be bounded above by H2 and H4. In detail, we split the expectation by conditioning on the events that |uw x| > rϵ and |uw x| rϵ, then noticing that 1{|xi| > rϵ, |uw x|

Robustly Learning a Single Neuron via Sharpness

rϵ} 1{|xi| rϵ}, we get:

(uw x)2x2 i 1{|xi| > rϵ} E x Dx

r2 ϵx2 i 1{|xi| > rϵ, |uw x| rϵ}

(uw x)2x2 i 1{|xi| > rϵ, |uw x| > rϵ}

r2 ϵx2 i 1{|xi| > rϵ}

E x Dx [(uw x)41{|uw x| > rϵ}] E x Dx [x4 i 1{|xi| > rϵ}]

r2 ϵH2(rϵ) + H4(rϵ), (25)

where the second inequality comes from Cauchy-Schwarz and in the last inequality we applied H4(rϵ) Ex Dx (u x)41{|u x| rϵ} for any u B(1) by Definition D.5. Now plugging Equation (25) to the bound we get for E(x,y) D σ(w x) y)2x2 i 1{|xi| rϵ} , we have:

(σ(w x) y)2x2 i 1{|xi| rϵ} 2α2 w 2 2(r2 ϵH2(rϵ) + H4(rϵ)) + 2M 2H2(rϵ).

Further recall that by definition:

H4(r) = max u B(1) E x Dx

(u x)41{|u x| r} max u B(1) r2 E x Dx

(u x)21{|u x| r} = r2H2(r),

hence H4(r) H2(r) when r 1. Then applying these facts along with the fact that w 2 M simplifies the inequality above to the following: E (x,y) D

(σ(w x) y)2x2 i 1{|xi| rϵ} α2M 2H4(rϵ). (26)

Combining Equation (26) and Equation (24) with Equation (23), we get:

(σ(w x) y)2 x 2 2 d(r2 ϵOPT + α2M 2H4(rϵ)),

proving the desired claim.

Plugging Claim D.10 above back to Equation (22), we immediately get:

g LD,σ sur (w ) 2 2 d

N (r2 ϵOPT + α2M 2ϵ),

given that H4(rϵ) ϵ. Then choosing ξ q

d δN (r2ϵOPT + α2M 2ϵ), we get Equation (20):

g LD,σ sur (w ) 2

d δN (r2ϵ + α2M 2)OPT

For Equation (21), we repeat the steps when proving Equation (20). Using Markov inequality again, we have

Pr h g(t) LD,σ sur (w(t)) 2 ζ i = Pr h g(t) LD,σ sur (w(t)) 2 2 ζ2i

g(t) LD,σ sur (w(t)) 2 2

The goal is to bound the expectation of the squared norm. Notice that (x(j), y(j)) D are i.i.d. samples, therefore, it holds:

h g(t) LD,σ sur (w(t)) 2 2 i

= 1 N 2 E x Dx

(σ(w(t) x(j)) σ(w x(j)))x(j) E x Dx

h (σ(w(t) x(j)) σ(w x(j)))x(j) i

(σ(w(t) x) σ(w x))x E x Dx

h (σ(w(t) x) σ(w x))x i 2 2

Robustly Learning a Single Neuron via Sharpness

because for any i.i.d. zero-mean random variables z(j) it holds E[|| P

j z(j)||2 2] = P

j E[||z(j)||2 2]. Note that E[||z E[z]||2 2] E[||z||2 2], therefore, we can further bound the variance of gt LD,σ sur (wt) as:

h g(t) LD,σ sur (w(t)) 2 2 i 1

h (σ(w(t) x) σ(w x))2 x 2 2 i

h (σ(w(t) x) σ(w x))2x2 i i

h (w(t) x w x)2x2 i i , (27)

where in the last inequality we used |σ(w(t) x) σ(w x)| α|w(t) x w x|, as σ is α-Lipschitz.

It remains to bound Ex Dx (w(t) x w x)2x2 i . Denote uw(t) = (w(t) w )/ w(t) w , which is a unit vector. Abstracting wt w 2 from the expectation then applying Cauchy-Schwarz, we get:

h (w(t) x w x)2x2 i i = w(t) w 2 2 E x Dx

(uw(t) x)2x2 i

w(t) w 2 2 r

E x Dx [(uw(t) x)4] E x Dx [x4 i ]

w(t) w 2 2H4(0),

where the last inequality comes from H4(0) = maxu B(1) Ex Dx (u x)4 , which holds by definition. Further from Fact C.4, H4(0) 5B/ρ, thus, we get

h (w(t) x w x)2x2 i i 5B

ρ w(t) w 2 2. (28)

To sum up, plugging Equation (28) back to Equation (27), we have:

h g(t) LD,σ sur (w(t)) 2 2 i α2d B

ρN w(t) w 2 2.

Finally, choosing ζ to be a sufficiently small multiple of q

δρN w(t) w 2, Equation (21) follows.

Corollary D.11. Suppose N samples {(x(j), y(j))}N j=1 are drawn from D independently and suppose Assumptions 2.2 to 2.4 hold. Let g(t) be the empirical gradient of LD,σ sur (w(t)). Moreover, let H4(r) be defined as in Definition D.5. Then for a fixed positive real number rϵ satisfying H4(rϵ) ϵ and rϵ 1, with probability at least 1 δ it holds

g(t) LD,σ sur (w(t)) 2

w(t) w 2 + p

r2ϵOPT + M 2ϵ . (29)

Corollary D.12. Let D be a distribution in Rd R and suppose Assumptions 2.2 to 2.4 hold. Moreover, let H4(r) be defined as in Definition D.5. Fix a positive real number rϵ satisfying H4(rϵ) ϵ and rϵ 1. It holds that

Ex Dx h LD,σ sur (w(t)) 2 2 i dα2B

w(t) w 2 2 + r2 ϵOPT + M 2ϵ . (30)

We further show that the norm of empirical gradients g and g(t) can be bounded with respect to OPT, ϵ and w(t) w 2.

Corollary D.13. Suppose the conditions in Lemma D.9 are satisfied. Fix rϵ 1 such that H4(rϵ) is a sufficiently small multiple of ϵ. Then with probability at least 1 δ, we have:

d(r2ϵOPT + α2M 2ϵ)

Robustly Learning a Single Neuron via Sharpness

w(t) w 2. (32)

Proof. We first estimate the norm of LD,σ sur (w ) and LD,σ sur (w(t)). For the former, applying the Cauchy-Schwarz inequality, we get:

LD,σ sur (w ) 2 = E[(σ(w x) y)x] 2 = max u 2=1 E[(σ(w x) y)u x]

E[(σ(w x) y)2] E[(u x)2]

where we used that H2(0) 5B/ρ from Fact C.4. In addition, by Lemma D.9, with probability at least 1 δ, we have:

g LD,σ sur (w ) 2

d δN (r2ϵOPT + α2M 2ϵ),

given that rϵ is chosen large enough so that H4(rϵ) ϵ. Then combining with the bound of LD,σ sur (w ) 2 above, it holds:

d(r2ϵOPT + α2M 2ϵ)

For the second claim, following the exact same approach and utilizing the fact that σ is α-Lipschitz continuous again, we have:

LD,σ sur (w(t)) 2 = E x Dx

h (σ(w(t) x) σ(w x))x i 2

= max u 2=1 E x Dx

h |σ(w(t) x) σ(w x)|u x i

α max u 2=1 E x Dx

h |(w(t) w ) x|u x i .

Applying Cauchy-Schwarz inequality, we have

LD,σ sur (w(t)) 2 α max u 2=1

((w(t) w ) x)2 E x Dx [(u x)2] 5αB

ρ w(t) w 2.

Then combining with Equation (21), we get the desired claim:

Finally, we can turn to the proof of Theorem D.7.

Proof of Theorem D.7. Recall that for a vector ˆw, we have

LD,σ 2 ( ˆw) = E (x,y) D

(σ( ˆw x) y)2 2LD,σ 2 (w ) + 2 E x Dx

(σ( ˆw x) σ(w x))2

2LD,σ 2 (w ) + 10Bα2/ρ ˆw w 2 2,

Robustly Learning a Single Neuron via Sharpness

where in the last inequality we used the fact that σ is α-Lipschitz and Ex Dx[xx ] (5B/ρ)I according to Fact C.4. Thus when the algorithm generates some ˆw such that ˆw w 2 2 ϵ , it holds

LD,σ 2 (w(T )) 2OPT + (10Bα2/ρ)ϵ (33)

yielding a C-approximate solution to the Problem 1.1. Therefore, our ultimate goal is to minimize w w 2 efficiently. To this aim, we study the difference of w(t+1) w 2 2 and w(t) w 2 2. We remind the reader that for convenience of notation, we denote the empirical gradients as the following

j=1 (σ(w(t) x(j)) y(j))x(j),

j=1 (σ(w x(j)) y(j))x(j).

Moreover, we denote the noise-free empirical gradient by g(t), i.e.,

g(t) = g(t) g = 1

j=1 (σ(w(t) x(j)) σ(w x(j)))x(j).

Plugging in the iteration scheme w(t+1) = w(t) ηg(t) while expanding the squared norm, we get

w(t+1) w 2 2 = w(t) w 2 2 2ηg(t) (w(t) w ) + η2 g(t) 2 2 w(t) w 2 2 2η LD,σ sur (w(t)) (w(t) w ) | {z } Q1 2η(g(t) LD,σ sur (w(t))) (w(t) w ) + η2 g(t) 2 2 | {z } Q2

Observe that we decomposed the right-hand side into two parts, the true contribution of the gradient (Q1) and the estimation error (Q2).

Note that in order to utilize the sharpness property of surrogate loss at the point w(t), the conditions

w(t) B(2||w ||2) and

w(t) {w : w(t) w 2 2 20B/( µ2ρ)OPT} (34)

need to be satisfied. For the first condition, recall that we initialized w(0) = 0, hence Equation (34) is valid for t = 0. By induction rule, it suffices to show that assuming w(t) B(2||w ||2) holds, we have ||w(t+1) w ||2 (1 C)||w(t) w ||2 for some constant 0 < C < 1. Thus, we assume temporarily Equation (34) is true at iteration t, and we will show in the remainder of the proof that ||w(t+1) w ||2 (1 C)||w(t) w ||2 until we arrived at some final iteration T. Then by induction, the first part of Equation (34) is satisfied at each step t T. For the second condition, note that if it is violated at some iteration T, then w(T ) w 2 O(OPT) implying that this would be the solution we are looking for and the algorithm could be terminated at T. Therefore, whenever w(t) w 2 is far away from OPT, the prerequisites of Proposition D.3 are satisfied and the sharpness property of LD,σ sur is allowed to use.

Now for the first term (Q1), using the fact that LD,σ sur (w(t)) is µ(γ, λ, β, ρ, B)-sharp according to Corollary D.4, we immediately get a sufficient decrease at each iteration: ||w(t+1) w ||2 2 (1 C)||w(t) w ||2 2. Namely, denote µ(γ, λ, β, ρ, B) as µ for simplicity, applying Corollary D.4 we have

(Q1) = w(t) w 2 2 2η LD,σ sur (w(t)) (w(t) w ) (1 2ηµ) w(t) w 2 2 ,

where µ = 1/2 µ, and µ = Cλ2γβρ/B for some sufficiently small constant C.

Now it suffices to show that (Q2) can be bounded above by C ||w(t) w ||2 2, where C is a parameter depending on η and µ that can be made comparatively small. Formally, we show the following claim.

Robustly Learning a Single Neuron via Sharpness

Claim D.14. Suppose η 1. Fix rϵ 1 such that H4(rϵ) is a sufficiently small multiple of ϵ. Choosing N to be a sufficiently large constant multiple of d

δ (r2 ϵ + α2M 2), then we have with probability at least 1 δ

2ηµ + 8η2α2B2

w(t) w 2 2 + 4η

ρ OPT + ϵ .

Proof of Claim D.14. Observe that by applying the Arithmetic-Geometric Mean inequality and Cauchy-Schwarz inequality, we get x y (a/2) x 2 2 + (1/2a) y 2 2 for any vector x and y, thus applying this inequality to the inner product (g(t) LD,σ sur (w(t))) (w(t) w ) with coefficient a = µ, we get

(Q2) = 2η(g(t) LD,σ sur (w(t))) (w(t) w ) + 2η2 g(t) 2 2 2η(g(t) LD,σ sur (w(t))) (w(t) w ) + 2η2 g(t) 2 2 + 2η2 g 2 2

µ g(t) LD,σ sur (w(t)) 2 2 + ηµ w(t) w 2 2 + 2η2 g(t) 2 2 + 2η2 g 2 2 ,

where µ is the sharpness parameter and we used the definition that g(t) = g(t) g in the first inequality. Note that

g(t) LD,σ sur (w(t)) 2 2 = g(t) g ( LD,σ sur (w(t)) LD,σ sur (w )) + g LD,σ sur (w ) 2 2 2 g(t) LD,σ sur (w(t)) 2 2 + 2 g LD,σ sur (w ) 2 2,

since we have LD,σ sur (w(t)) = LD,σ sur (w(t)) LD,σ sur (w ). Thus, it holds

µ g(t) LD,σ sur (w(t)) 2 2 + 2η

µ g LD,σ sur (w ) 2 2 + ηµ w(t) w 2 2

+ 2η2 g(t) 2 2 + 2η2 g 2 2

Furthermore, recall that as shown in Lemma D.9 and Corollary D.13, g(t) LD,σ sur (w(t)) 2 2, LD,σ sur (w(t)) 2 2, ||g ||2 2 and g(t) 2 2 can be made small by increasing the batch size N. In particular, when rϵ satisfies H4(rϵ) ϵ, we have proved that with probability at least 1 δ, it holds

g(t) LD,σ sur (w(t)) 2 2 α2d B

δρN w(t) w 2 2, g(t) 2 2 α2B2

w(t) w 2 2,

g LD,σ sur (w ) 2 2 d δN (r2 ϵOPT + α2M 2ϵ),

d(r2ϵOPT + α2M 2ϵ)

1 + r2 ϵd δN

OPT + α2M 2d

Therefore, choosing

N C max dr2 ϵ δ , α2M 2d

where C is a sufficiently large absolute constant, then with probability at least 1 δ, it holds

g(t) LD,σ sur (w(t)) 2 2 µ2

4 w(t) w 2 2, g(t) 2 2 4α2B2

ρ2 w(t) w 2 2,

and g LD,σ sur (w ) 2 2 OPT + ϵ , g 2 2 2B

Plugging these bounds back to Equation (35), we get

2ηµ + 8η2α2B2

w(t) w 2 2 + 2η η + 1

2ηµ + 8η2α2B2

w(t) w 2 2 + 4η

ρ OPT + ϵ ,

where in the last inequality we used the assumption that η 1 and µ 1. The proof is now complete.

Robustly Learning a Single Neuron via Sharpness

Now combining the upper bounds on (Q1) and (Q2) and choosing η = µρ2

32α2B2 , we have:

w(t+1) w 2 2 1 1

2ηµ + 8η2α2B2

w(t) w 2 2 + 4η

w(t) w 2 2 + ρ 4α2B

OPT + ϵ . (37)

When w(t) w 2 2 (64B/(ρµ2))(OPT + ϵ), in other words when w(t) is still away from the minimizer w , it further holds with probability 1 δ:

w(t+1) w 2 2 1 µ2ρ2

w(t) w 2 2, (38)

which proves the sufficient decrease of ||w(t) w ||2 2 that we proposed at the beginning.

Let T be the first iteration such that w(T ) satisfies w(T ) w 2 2 (64B/(ρµ2))(OPT + ϵ). Recall that we need Equation (34) for every t T to be satisfied to implement sharpness. The first condition is satisfied naturally for w(t+1) w 2 2 w 2 2 as a consequence of Equation (38) (recall that w(0) = 0). For the second condition, when t + 1 T, since µ = 1/2 µ, we have

w(t+1) w 2 2 64B

ρµ2 (OPT + ϵ) 20B

hence the second condition is also satisfied.

When t T, the contraction of w(t) w 2 2 indicates a linear convergence rate of stochastic gradient descent. Since w(0) = 0, w 2 W, it holds w(t) w 2 2 (1 µ2ρ2/(256α2B2))t w(0) w 2 2 exp( tµ2ρ2/(256α2B2))W 2. Thus, to generate a point w(T ) such that w(T ) w 2 2 (64B/(ρµ2))(OPT + ϵ), it suffices to run Algorithm 2 for

T = eΘ B2α2

iterations, where the logarithmic dependence on parameters α, B, ρ and µ are hidden in the eΘ( ) notation. Further, recall that at each step t the contraction w(t+1) w 2 2 (1 µ2ρ2

256α2B2 ) w(t) w 2 2 holds with probability 1 δ, thus the union bound inequality implies w(T ) w 2 2 (64B/(ρµ2))(OPT + ϵ) holds with probability 1 Tδ. Let δ = 1/(3T), we get with probability at least 2/3, w(T ) w 2 2 (64B/(ρµ2))(OPT + ϵ), and thus from Equation (33),

LD,σ 2 (w(T )) 2OPT + 640α2B2

ρ2µ2 (OPT + ϵ) = O Bα

and the proof is now complete.

In the final part of this section we apply Theorem D.7 to sub-Exponential and k-Heavy tail distributions. Before we dig into the details, some upper bounds on H2(r) and H4(r) are needed. We provide the following simple fact.

Fact D.15. Let H2(r) and H4(r) be as in Definition D.5. Then, we have the following bounds:

H2(r) r2 min{1, h(r)} + Z

r 2s min{1, h(s)} ds,

H4(r) r4 min{1, h(r)} + Z

r 4s3 min{1, h(s)} ds.

Moreover, if Dx is sub-exponential with h(r) = exp( r/B) or k-heavy tailed with h(r) = B/rk, k > 4 + ρ, ρ > 0, and r max{1, B 4 ρ} then

H2(r) r2h(r) and H4(r) r4h(r).

Robustly Learning a Single Neuron via Sharpness

Proof. To prove the fact, we bound the expectation Ex Dx |u x|i1{|u x| r} for any vector u B(1), where i = 2, 4. To calculate the expectation, observe that when t < ri, it holds

Pr |u x|i1{|u x| r} t = Pr [|u x| r] .

Thus, we have

|u x|i1{|u x| r} = Z

0 Pr |u x|i1{|u x| r} t dt

= Pr [|u x| r] Z ri

ri Pr |u x|i1{|u x| r} t dt

= ri Pr [|u x| r] + Z

r i Pr |u x|i1{|u x| r} si si 1 ds.

Since (by Assumption 2.4) Pr [|u x| r] min{1, h(r)}, and further note that when s r it holds

Pr |u x|i1{|u x| r} si = Pr [|u x|1{|u x| r} s] = Pr [|u x| s] min{1, h(s)},

then we get

|u x|i1{|u x| r} ri min{1, h(r)} + Z

r isi 1 min{1, h(s)} ds,

which holds for any u B(1). Therefore, we proved the first part of the claim by taking the maximum over u B(1) on both sides of the inequality.

Now, consider r max{1, B 4 ρ}. Then for sub-Exponential distributions, as h(s) = exp( s

B ) 1 when s r, we have: Z

r sh(s) ds = Z

r s exp( s/B) ds = B(r B) exp( r/B) Br2h(r),

r s3h(s) ds = Z

r s3 exp( s/B) ds = B4((r/B)3 + 3(r/B)2 + 6r/B + 6) exp( r/B) 16B4r4h(r),

where we assumed without loss of generality that c 1. Hence H2(r) (1 + 2B)r2h(r) and H4(r) (1 + 64B4)r4h(r), proving the desired claim.

Finally, for k-Heavy tail distributions with k > 4 + ρ, ρ > 0, h(r) = B/rk. Since h(r) 1 when r max{1, B 4 ρ}, we have:

H2(r) r2h(r) + Z

2B sk 1 ds (1 + 2B)r2h(r),

and in addition,

H4(r) r4h(r) + Z

4B sk 3 ds 1 + 4B

The claim is now complete.

Applying Theorem D.7 to sub-Exponential distributions yeilds an L2 2 error of order O(OPT) + ϵ with Θ(log(1/ϵ)) convergence rate, using Ω(polylog(1/ϵ)) samples. Formally, we have the following corollaries.

Corollary D.16 (Sub-Exponential Distributions). Fix ϵ > 0 and W > 0 and suppose Assumptions 2.2 and 2.3 hold. Moreover, assume that Assumption 2.4 holds for h(r) = exp( r/B) for some B 1. Let OPT denote the minimum value of the L2 2, i.e., OPT = minw B(W ) E(x,y) D (σ(w x) y)2 . Let µ := µ(λ, γ, β, B) be a sufficiently small multiple of λ2γβ, and let M = O(αWB log αW

ϵ ). Then after

T = eΘ B2α2

Robustly Learning a Single Neuron via Sharpness

iterations with batch size

N = eΩ d B4α6W 2

µ2 polylog(1/ϵ) ,

Algorithm 2 converges to a point w(T ) such that LD,σ 2 (w(T )) = O Bα

µ 2OPT + ϵ with probability at least 2/3.

Proof of Corollary D.16. Since by assumption it holds h(r) = exp( r/B) B/r4+ρ for ρ = 1, we can set ρ = 1 in Theorem D.7. Thus, a direct application of Theorem D.7 with parameter ρ = 1 gives the desired L2 2 error and the required number of iterations. It remains to determine the batch size with respect to the sub-Exponential distributions. To this aim, note that N = Ω(d T(r2 ϵ + α2M 2)), thus we need to find the truncation bound M, which is defined in Lemma D.8, and calculate rϵ such that H4(rϵ) ϵ.

Denote κ = ϵ 4α2W 2 . Recall that M = αWH 1 2 (κ). To determine H 1 2 (κ), note that H2(r) is a non-increasing function, therefore it suffices to find a rκ such that H2(rκ) κ, then it holds H 1 2 (κ) rκ. For sub-Exponential distributions where h(r) = exp( r/B), choosing rκ = B log(1/κ2) satisfies

r2 κh(rκ) = 4B2 log2 1

since log2(1/κ)κ 1 as κ = ϵ/(4α2W 2) 1. Further note that in Fact D.15 we showed H2(r) (1 + 2B)r2h(r), thus H2(rκ) κ hence M = O(αWB log((αW)/ϵ)).

For rϵ, by the same idea one can show that for rϵ = B log(1/ϵ2), it holds

H4(rϵ) 1 + 64B4 r4 ϵ exp( rϵ/B) = 1 + 64B4 16B4 log4(1/ϵ)ϵ2 1 + 64B4 80B4ϵ,

where the first inequality is due to Fact D.15 and in the last inequality we used the fact that log4(1/ϵ)ϵ 5 when ϵ 1.

Therefore, combining the bounds on M and rϵ, we get

N = eΩ(d T(r2 ϵ + α2M 2)) = eΩ d B4α6W 2

ρ2µ2 log3 1

Next, we apply Theorem D.7 to Heavy-Tail distributions.

Corollary D.17 (Heavy-Tail Distributions). Fix ϵ > 0 and W > 0 and suppose Assumptions 2.2 and 2.3 hold. Moreover, assume that Assumption 2.4 holds for h(r) = B/rk for some k > 4 + ρ where ρ > 0 and B 1. Let OPT denote the minimum value of the L2 2, i.e., OPT = minw B(W ) E(x,y) D (σ(w x) y)2 . Let µ := µ(λ, γ, β, ρ, B) be a sufficiently

small multiple of λ2γβρ/B, and let M = Θ(αW αW B

ϵ 1/(k 2)). Then after

T = eΘ B2α2

iterations with batch size

N = eΩ d B2α6W 2

Algorithm 2 converges to a point w(T ) such that LD,σ 2 (w(T )) = O Bα

ρµ 2OPT + ϵ with probability at least 2/3.

Proof of Corollary D.17. Applying Theorem D.7 directly we get the desired convergence rate and L2 2 loss. Now for batch size, we need to determine the truncation bound M = αWH 1 2 (ϵ/(4α2W 2)) (see Lemma D.8) as well as rϵ such that H4(rϵ) ϵ.

First, denote κ = ϵ 4α2W 2 . Let rκ = ( 2B

κ )1/(k 2). By Fact D.15, H2(rκ) r2 κh(rκ) = κ 2 . Since H2(r) is nonincreasing, we know H 1 2 (κ) rκ. Thus, M = O(αW( αW B

ϵ )1/(k 2)). Next, choose rϵ = ( B

ϵ )1/(k 4), then it holds H4(rϵ) r4 ϵh(rϵ) = ϵ satisfying the condition.

Robustly Learning a Single Neuron via Sharpness

Combining the bounds on rϵ and M, we get the batch size

N = Ω(d T(r2 ϵ + α2M 2)) = eΩ d B2α2

2 k 4 + α4W 2 B

= eΩ d B2α6W 2

Thus, Algorithm 2 yields an L2 2 error of O(OPT) + ϵ in eΘ(log(1/ϵ)) iterations with batch size eΩ((1/ϵ)2/(k 4)) when applied on k-Heavy Tailed distributions.

E. Distributions Satisfying Our Assumptions

In this section, we show that many natural distributions satisfy Assumptions 2.3 and 2.4

E.1. Well-Behaved Distributions from Diakonikolas et al. (2022b)

We first consider the class of distributions defined by Diakonikolas et al. (2022b) and termed well-behaved . This distribution class contains many natural distributions like log-concave and s-concave distributions.

Definition E.1 (Well-Behaved Distributions). Let L, R > 0. An isotropic (i.e., zero mean and identity covariance) distribution Dx on Rd is called (L, R)-well-behaved if for any projection (Dx)V of Dx onto a subspace V of dimension at most two, the corresponding pdf ϕV on R2 satisfies the following:

For all x V such that x R it holds ϕV (x) L (anti-anti-concentration).

For all x V it holds that ϕV (x) (1/L)(e L x 2) (anti-concentration and concentration).

The distribution class that is (L, B)-well-behaved satisfies Assumption 2.4. Therefore, we need to show that the distributions in this class satisfy Assumption 2.3.

Lemma E.2. Let Dx be a (L, B)-well-behaved distribution. Then, Dx satisfies Assumption 2.3, for γ = R/2 and λ = LR4/16.

Proof. Let u, v Rd be any two orthonormal vectors and let V be the subspace spanned by u, v. We have that

Ex Dx (u x)21{v x R/2} Ex Dx (u x)21{R u x R/2, R v x R/2}

(R2/4)Ex Dx [1{R u x R/2, R v x R/2}]

= (R2/4) Z R

R/2 ϕV (x)dx (R2/4)L Z R

R/2 dx = LR4/16 .

Furthermore, similarly, we have that Ex Dx (v x)21{v x R/2} LR4/16. Therefore, Dx satisfies Assumption 2.3 with γ = R/2 and λ = LR4/16.

E.2. Symmetric Product Distributions with Strong Concentration

E.2.1. k-HEAVY TAILED SYMMETRIC DISTRIBUTIONS, k 7

Here we show that symmetric product distributions with sufficiently large polynomial tails satisfy our assumptions.

Proposition E.3. Let Dx be a k-Heavy Tailed symmetric distribution with k 7 and i.i.d. coordinates, i.e., it satisfies Pr [|u x| r] B/rk for some absolute constant B 1. Let α = Ex Dx x2 i and β = Ex Dx x4 i . Suppose β α2 cα2, where c > 0 is an absolute constant and let C to be suffciently small absolute multiple of (k 6)c2α4/B. Then, Dx satisfies Assumptions 2.3 and 2.4 with γ = C

16B )1/k and λ = C5

Robustly Learning a Single Neuron via Sharpness

Proof of Proposition E.3. First, observe that by definition, Dx satisfies Assumption 2.4 with h(r) = B/rk. We show that Dx satisfies Assumption 2.3 with some absolute constants γ and λ. Let u, v Rd be any two orthonormal vectors. We have that Ex Dx [|u x|v x] = 0 ,

since the distribution is symmetric. Therefore, we have that Ex Dx [|u x||v x|1{v x 0}] = Ex Dx [|u x||v x|] /2. Let V = |u x||v x|. We show that Ex Dx [V ] c2α4(k 6)/B.

Claim E.4. Let V = |u x||v x|. Assume that x has i.i.d. zero mean coordinates and that x is k-Heavy tailed with parameter k 7, B 1. Let α = Ex Dx x2 i and β = Ex Dx x4 i . If β α2 cα2, where c > 0 is an absolute constant then, Ex Dx [V ] c2α4(k 6)/B.

Proof of Claim E.4. First, note that we can write V 2 as

V V 3/2, because V 0. Therefore, by applying the Cauchy Schwarz inequality, we have that

V 2 2 E x Dx [V ] E x Dx

Therefore, we have that Ex Dx [V ] (Ex Dx V 2 )2/ Ex Dx V 3 . We first bound Ex Dx V 2 from below. Let α = Ex Dx x2 i and β = Ex Dx x4 i . Observe that

i1,i2,i3,i4 E x Dx [ui1ui2vi3vi4xi1xi2xi3xi4]

i1,i2,i1 =i2 E x Dx

u2 i1v2 i2x2 i1x2 i2 + 2ui1vi1ui2vi2x2 i1x2 i2 + X

u2 i v2 i x4 i

i1,i2,i1 =i2 u2 i1v2 i2 + 2α2 X

i1,i2,i1 =i2 ui1vi1ui2vi2 + β X

i u2 i v2 i

i1,i2,i1 =i2 u2 i1v2 i2 2α2 X

i u2 i v2 i + β X

i u2 i v2 i

i u2 i (1 v2 i ) 2α2 X

i u2 i v2 i + β X

i u2 i v2 i

= (β 3α2) X

i u2 i v2 i + α2 ,

where we used that Pd i=1 viui = 0 and u 2 = v 2 = 1, because v, u are orthonormal. Next, we show that P i u2 i v2 i is less than 1/2.

Claim E.5. Let v, u be two orthonormal vectors. Then, P i u2 i v2 i 1/2.

Proof of Claim E.5. Note that since P

i v2 i = 1 and P

i uivi = 0, it holds

i u2 i v2 i + X

1 i<j d (u2 i v2 j + u2 jv2 i ) ,

i u2 i v2 i + 2 X

1 i<j d uiviujvj .

Thus, summing the equalities above, we get

i u2 i v2 i + X

1 i<j d (uivj + ujvi)2 2 X

i u2 i v2 i ,

therefore, we have P

i u2 i v2 i 1/2.

Robustly Learning a Single Neuron via Sharpness

Therefore, we have that if c 2 then β 3α2 0 hence Ex Dx V 2 α2. If c 2, then it holds Ex Dx V 2 (c/2)α2. In summary we have Ex Dx V 2 (c/2)α2 for any c > 0. Furthermore, we can bound Ex Dx V 3 from above as the following. Using Cauchy-Schwarz, we have that Ex Dx V 3 maxu B(1) Ex Dx (u x)6 . Recall that x is a k-Heavy Tailed random variable with k 7, hence Ex Dx (u x)6 1 + 6B/(k 6). Therefore, we have that

E x Dx [V ] = E x Dx [|u x||v x|] c2α4

This completes the proof of Claim E.4.

Lemma E.6. Let Z = |u x||v x|1{v x 0}. Assume that, there exists a constant 1 > C > 0, so that Ex Dx [Z] C and Ex Dx Z2 1/C. Then it holds

(u x)21 v x C

(v x)21 v x C

Proof of Lemma E.6. Using Paley-Zigmund inequality, we have that

Pr Z ζ E x Dx [Z] (1 ζ)2 Ex Dx [Z]2

Ex Dx [Z2] .

Therefore, we have that Pr [Z C/2] C3/4, where Z = |u x||v x|1{v x 0}. For simplicity of notation, let s denote |u x| and |v x|1{v x 0} as a and b respectively. Then fixing some t > 0, it holds:

Note that Dx is k-Heavy Tailed, thus, when t q

C3 1/k, it holds

C/2 i = Pr h |u x| t p

Similarly, for b = |v x|1{v x 0} |v x| it holds Pr h b t/ p

C/2 i C3/16. Therefore, Pr [ab C/2] can be upper-bounded by

Hence, we get

Robustly Learning a Single Neuron via Sharpness

where we choose t = q

C3 1/k. As a result,

(u x)21 v x C

(u x)21 v x C

1/k , |u x| C

Similarly, we also have Ex Dx[(v x)21{v x C

16B )1/k}] C5

From Claim E.4 we know that E[Z] = Ex Dx [|u x||v x|1{v x 0}] = 1

2 Ex Dx [|u x||v x|] (k 6)c2α4/B. In addition, using Cauchy-Schwarz we have E[Z2] maxu B(1) Ex Dx (u x)4 5B, where the last inequality comes from Fact C.4 (note that ρ = 1 suffices in when Dx is k-Heavy Tailed, k 7). Thus, choosing C to be suffciently small absolute multiple of (k 6)c2α4/B, it holds E[Z] C and E[Z2] 1/C. Let v = w / w 2 and let u be any vector that is orthonormal to v. Then by the results of Lemma E.6, we know that choosing γ = C

16B )1/k and λ = C5

16B )2/k, it holds Ex Dx xx 1{w x γ w 2} λI.

E.2.2. DISCRETE GAUSSIANS

Here we show that our assumptions are satisfied for discrete multivariate Gaussians.

We will use the following standard definition of a discrete Gaussian.

Definition E.7 (Discrete Gaussian). We define the discrete standard Gaussian distribution as follows: Fix θ R+ with θ > 0. Then, the pmf of the discrete Gaussian distribution is given by

where Z is a normalization constant. Similarly, we define the high dimensional analogous as follows, we say that a random vector x Rd follows the d-dimensinal discrete Gaussian distribution if x is a vector of d independent random variables, each of which follows the discrete Gaussian distribution.

Corollary E.8. Let θ (0, 1] and let Dx be a d-dimensional discrete Gaussian distribution with parameter θ. Then, there exists an absolute constant C > 0, so that Dx satisfies Assumptions 2.3 and 2.4 with ρ = 1, B = e9, γ = C(C3/B)1/7 and λ = C5(C3/B)2/7.

Proof of Corollary E.8. We first show that the discrete Gaussian distribution is subgaussian with an appropriate parameter.

Lemma E.9. Let θ (0, 1] then the discrete Gaussian distribution is subgaussian with parameter 2

Proof of Lemma E.9. By definition, a random variable X is D-subgaussian if

Pr [|X| t] exp( t2/D2) .

Let X Dx, where Dx is a discrete Gaussian distribution with parameter θ. Fix t θ, Then, we have that

Pr [|X| t] 1

z θZ,|z| t exp z2

t/θ 1 exp z2θ2

where for the first inequality, we used the integral mean value theorem and the monotonicity, i.e., R k+1 k f(t)dt = f(ξ) for ξ (k, k + 1) and that f(k + 1) f(ξ) f(k) because f is a decreasing function. Further note that 2 R t/2 exp( x2/2) dx =

Robustly Learning a Single Neuron via Sharpness

2π exp( t2/8). It remains to show that θZ is lower-bouned. Note that exp( x2/2) is a decreasing

function when x 0, therefore for any z Z+ it holds exp( (θz)2/2) R z+1 z exp( (θx)2/2) dx. Thus, by definition,

θZ = θ + 2θ X

z Z+ exp((θz)2/2)

0 exp( (θx)2/2) dx + 2θ X

z exp( (θx)2/2) dx

0 exp( t2/2) dt + Z

1 exp( t2/2) dt = rπ

2 (2 erf(1/

Thus, combining these results, we get Pr [|X| t] 4 exp( t2/8), thus discrete Gaussian is sub-Gaussian with parameter D = 2

It remains to show that the discrete Gaussian distribution satisfies the requirements of Proposition E.3. To be specific, we show that it holds

Claim E.10. Let X be a discrete Gaussian random variable. Denote E[X2] as α and E[X4] as β. Then it holds α 1 and β 1.25.

Proof of Claim E.10. By Poisson summation formula, we know that it holds P

z Z f(z) = P

z Z ˆf(z) where ˆf(t) is the fourier transform of f, i.e., ˆf(z) = R + f(x)e 2πixtdx. It is easy to calculate that for f(z) = θ2z2 exp θ2z2

2π θ3 (θ2 4π2t2) exp 2π2t2

and for g(z) = exp θ2z2

2 , we have

2π θ exp 2π2t2

Thus, by definition,

α = E[X2] = P

z Z g(z) = P

z Z ˆf(z) P

z Z ˆg(z) = 1 P

θ2 exp 2π2z2

P z Z exp 2π2z2

For E[X4], note that t4 exp( t2/2) is an increasing function when t (0, 2) and is decreasing when t (2, ). Thus, denote z = 1, then by the property of integral, it holds

z Z (θz)4 exp( (θz)2/2)θ( z)

0 t4 exp( t2/2) dt + Z

2 t4 exp( t2/2) dt

0 t4 exp( t2/2) dt + 2.06 4.4

where the second inequality is due to the fact that R 2 t4 exp( t2/2) dt 2.06 and θ (0, 1]. Further, in the last inequality we used the fact that R 1 0 t4 exp( t2/2) dt 0.14.

Now, note that for θZ, it holds

θZ = θ + 2 X

z Z+ exp( (θz)2/2)θ( z) θ + 2 Z

Therefore, combining with upper-bound on θZ, it holds E[X4] 4.4/(1 +

Robustly Learning a Single Neuron via Sharpness

Now since α 1, β 1.25α2 and discrete Gaussian is 2

2-sub-Gaussian as proved in Lemma E.9, we have Pr [|u x| r] exp( r2/8) e9/r7. Thus, the conditions in Proposition E.3 are satisfied with parameters B = e9, c = 0.25, k = 7. Thus, choosing C to be a small multiple of c2α4/B, then since according to Claim E.4, Ex Dx [Z] c2α4/B C and Ex Dx Z2 5B 1/C, thus, by Proposition E.3, we know that Assumption 2.3 is satisfied with parameters γ = C

16B )1/7 and λ = C5

E.2.3. UNIFORM DISCRETE DISTRIBUTION ON { 1, 0, 1}d

Finally, we show that the uniform distribution on a hyper-grid satisfies our assumptions. Corollary E.11. Let Dx be a d-dimensional uniform distribution over the { 1, 0, 1}d. Then, there exists an absolute constant C > 0, so that Dx satisfies Assumptions 2.3 and 2.4 with B = 1, ρ = 1, γ = C(C3/B)1/7 and λ = C5(C3/B)2/7.

Proof of Corollary E.11. Note that the distribution is 1-sub-Gaussian. Now since β = Ex Dx x4 i = 2/3 and α = Ex Dx x2 i = 2/3, therefore, β = 1.5α2. Thus, the conditions in Proposition E.3 are satisfied with parameters B = 1, α = 2/3, c = 0.5, k = 7. Now choosing C to be a small multiple of c2α4/B, then since according to Claim E.4, Ex Dx [Z] c2α4/B C and Ex Dx Z2 5B 1/C, thus, by Proposition E.3, we know that Assumption 2.3 is satisfied with parameters γ = C

16B )1/7 and λ = C5

F. Extension to Certain Non-Monotone Activations

In this section, we extend our algorithmic results to certain cases where the activation function is not monotone. Specifically, we will consider activations like Ge LU (Hendrycks & Gimpel, 2016): σGe LU(t) = tΦ(t), where Φ(t) is the cdf. of the standard normal random variable N(0, 1) and Swish (Ramachandran et al., 2017) defined by σSwish(t) = t 1+exp( t).

Definition F.1 (Non-Monotonic (α, β)-Unbounded Activations). Let σ : R 7 R be an activation function and let α, β > 0. We say that σ is non-monotonic (α, β)-unbounded if it satisfies the following assumptions:

1. σ(t2) 0 σ(t1) for any t2 0 t1;

2. σ is α-Lipschitz; and

3. σ (t) β for all t (0, ).

As mentioned above, Definition F.1 contains Ge LU and Swish. Indeed, one can show that σGe LU(t) is actually nonmonotonic (1.1, 1/2)-unbounded, and σSwish(t) is non-monotonic (1.2, 0.4)-unbounded. We include the following Figure 1 of these activations to provide the readers with a better geometric intuition.

Figure 1: Non-Monotonic (α, β)-Unbounded Activation Examples: Ge LU and Swish

Now, we show that truncating a non-monotonic (α, β)-unbounded activation σ to ˆσ(t) = [σ(t)]+ and cutting off the negative part of y induces only a small L2 2 error at point w , i.e., E(x,y) D (ˆσ(w x) y1{y 0})2 OPT, implying that we can consider running an algorithm similar to Algorithm 1 on ˆσ(t) and truncated y.

Robustly Learning a Single Neuron via Sharpness

Lemma F.2. Let w = argminw B(W ) E(x,y) D (σ(w x) y)2 = argminw B(W ) LD,σ 2 (w) and denote LD,σ 2 (w ) as OPT. Define ˆy = [y]+ and ˆσ(t) = [σ(t)]+. Then:

(ˆσ(w x) ˆy)2 OPT .

Proof. The proof follows similar ideas in Lemma D.8. Since [t]+ is a non-expansive projection from R to R+, we have |[t1]+ [t2]+| |t1 t2| for any t1, t2 R. Thus, we get

(ˆσ(w x) y )2 = E (x,y) D

([σ(w x)]+ [y]+)2 E (x,y) D

(σ(w x) y)2 = OPT.

The truncated activation function is a monotonic (α, β)-unbounded since ˆσ(t) is increasing when t 0 and ˆσ(t) = 0 when t 0. Thus, when Assumptions 2.3 and 2.4 hold with respect to distribution Dx and ˆσ, we can use a slightly modified algorithm Algorithm 3 that works as efficiently as Algorithm 1 since Lemma 2.5 and Theorem 3.3 can be applied to activation ˆσ(t) with minor modifications. Formally, we have the following corollaries: Corollary F.3. Let σ(t) be a non-monotonic (α, β)-unbounded activation function satisfying Definition F.1. Suppose that Assumption 2.3 and Assumption 2.4 holds. Further denote ˆσ(t) = σ(t)1{t 0}. Then the noise-free surrogate loss LD,ˆσ sur with respect to activation function ˆσ(t) is Ω(λ2γβρ/B)-sharp in the ball B(2 w 2), i.e.,

LD,ˆσ sur (w) (w w ) λ2γβρ/B w w 2 2, w B(2 w 2).

Proof. Since ˆσ is monotonic (α, β)-unbounded, we have proven in Lemma 2.5 for monotonic (α, β)-unbounded activations and distribution Dx satisfying Assumption 2.2 to Assumption 2.4, LD,ˆσ sur is µ sharp with the parameter µ = Ω(λ2γβρ/B).

Corollary F.4. Let σ(t) be a non-monotonic (α, β)-unbounded activation function, satisfying Definition F.1. Fix ϵ > 0 and W > 0 and suppose Assumptions 2.3 and 2.4 hold. Let OPT denote the minimum value of the L2 2 loss i.e.,

OPT = min w B(W ) E (x,y) D

(σ(w x) y)2 .

Let µ := µ(λ, γ, β, ρ, B) be a sufficiently small multiple of λ2γβρ/B, and let M = αWH 1 2 ϵ 4α2W 2 . Further, choose parameter rϵ large enough so that H4(rϵ) is a sufficiently small multiple of ϵ. Then after

T = eΘ B2α2

iterations with batch size N = eΩ(d T(r2 ϵ + α2M 2)), Algorithm 3 converges to a point w(T ) such that LD,σ 2 (w(T )) =

ρ2µ2 OPT + ϵ with probability at least 2/3.

Proof. First observe that since the α-Lipschitz property remains for non-monotonic (α, β)-unbounded functions, the following is still valid:

LD,σ 2 (w) = E (x,y) D

(σ(w x) y)2

(σ(w x) σ(w x))2 + 2 E (x,y) D

(σ(w x) y)2

((w w ) x)2 + 2OPT

(10Bα2/ρ) w w 2 2 + 2OPT. (41)

Now denote ˆϵ = E(x,y) D (ˆσ(w x) ˆy)2 . If one can show that after T = eΘ B2α2/(ρ2µ2) log (W/ϵ) iterations with a large enough batch size N = eΩ(d T(r2 ϵ + α2M 2)), Algorithm 3 generates a point w(T ) such that it holds

w(T ) w 2 2 64B

ρµ2 ˆϵ + ϵ, (42)

Robustly Learning a Single Neuron via Sharpness

Algorithm 3 Stochastic Gradient Descent on Surrogate Loss For Non-Monotonic (α, β)-Unbounded Activations

Input: Iterations: T, sample access from D, batch size N, step size η, bound M. Initialize w(0) 0. for t = 1 to T do

Draw N samples {(x(j), y(j))}N j=1 D. for each j [N], y(j) min([y(j)]+, M) Let ˆσ(t) = [σ(t)]+, calculate

j=1 (ˆσ(w(t) x(j)) y(j))x(j),

w(t+1) w(t) ηg(t). end for Output: The weight vector w(T ).

then, since we showed in Lemma F.2 that ˆϵ OPT, combining with Equation (41) we immediately get

LD,σ 2 (w) 2 + 640B2α2

OPT + (10Bα2/ρ)ϵ,

thus completing the corollary.

In order to prove the claim above and Equation (42), one only needs to observe that ˆσ(t) is monotonic (α, β)-unobounded, and Assumption 2.2 to Assumption 2.4 holds for ˆσ(t) and the distribution Dx. Therefore, the exact same techniques for proving Theorem 3.3 (see Appendix D.2) can be applied. Results similar to Lemma D.8 and Lemma D.9 still hold for activation function ˆσ(t) and data points (x, ˆy), with the only difference being that we have E(x,y) D (ˆσ(w x) ˆy)2 = ˆϵ instead of OPT. Moreover, it has proven in Corollary F.3 that LD,ˆσ sur is µ sharp, therefore by Proposition 3.2 we know LD,σ sur is µ/2 sharp. Thus, Equation (42) follows from the same steps and the same choice of parameters as in the proof of Theorem 3.3 (see Appendix D.2).