# robustly_learning_singleindex_models_via_alignment_sharpness__403b5b4a.pdf

Robustly Learning Single-Index Models via Alignment Sharpness

Nikos Zarifis * 1 Puqian Wang * 1 Ilias Diakonikolas 1 Jelena Diakonikolas 1

We study the problem of learning Single-Index Models under the L2 2 loss in the agnostic model. We give an efficient learning algorithm, achieving a constant factor approximation to the optimal loss, that succeeds under a range of distributions (including log-concave distributions) and a broad class of monotone and Lipschitz link functions. This is the first efficient constant factor approximate agnostic learner, even for Gaussian data and for any nontrivial class of link functions. Prior work for the case of unknown link function either works in the realizable setting or does not attain constant factor approximation. The main technical ingredient enabling our algorithm and analysis is a novel notion of a local error bound in optimization that we term alignment sharpness and that may be of broader interest.

1. Introduction

Single-index models (SIMs) (Ichimura, 1993; Hristache et al., 2001; Härdle et al., 2004; Dalalyan et al., 2008; Kalai & Sastry, 2009; Kakade et al., 2011; Dudeja & Hsu, 2018) are a classical supervised learning model extensively studied in statistics and machine learning. SIMs capture the common assumption that the target function f depends on an unknown direction w, i.e., f(x) = u(w x) for some link (a.k.a. activation) function u : R 7 R and w Rd. In most settings, the link function is unknown and is assumed to satisfy certain regularity properties. Classical works (Kalai & Sastry, 2009; Kakade et al., 2011) studied the efficient learnability of SIMs for monotone and Lipschitz link functions and data distributed on the unit ball. These early algorithmic results succeed in the realizable setting (i.e., with clean labels) or in the presence of zero-mean label noise.

The focus of this work is on learning SIMs in the challenging

*Equal contribution 1Department of Computer Science, University of Wisconsin - Madison, Madison, WI, USA. Correspondence to: Puqian Wang <pwang333@wisc.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

agnostic (or adversarial label noise) model (Haussler, 1992; Kearns et al., 1994), where no assumptions are made on the labels of the examples and the goal is to compute a hypothesis that is competitive with the best-fit function in the class. Importantly, as will be formalized below, we will not assume a priori knowledge of the link function. In more detail, let D be a distribution on labeled examples (x, y) Rd R and L2(h) = E(x,y) D[(h(x) y)2] be the squared loss of the hypothesis h : Rd R with respect to D. Given i.i.d. samples from D, the goal of the learner is to output a hypothesis h with squared error competitive with OPT, where OPT = inff C L2(f) is the best attainable error by any function in the target class C.

In the context of this paper, the class C above is the class of SIMs, i.e., all functions of the form f(x) = u(w x) where both the weight vector w and the link function u are unknown. For this task to be even information-theoretically solvable, one requires some assumptions on the vector w and the link function u. We will assume, as is standard, that the ℓ2-norm of w is bounded by a parameter W. We will similarly assume that the link function lies in a family of well-behaved functions that are monotone and satisfy certain Lipschitz properties (see Definition 1.3).

For a weight vector w and link function u , the L2 2 loss of the SIM hypothesis u(w x) (defined by u and w) is L2(w; u) = E(x,y) D[(u(w x) y)2]. Our problem of robustly learning SIMs is defined as follows.

Problem 1.1 (Robustly Learning Single-Index Models). Fix a class of distributions G on Rd and class of link functions1

F. Let D be a distribution of labeled examples (x, y) Rd R such that its x-marginal Dx belongs to G. We say that an algorithm is a C-approximate proper SIM learner, for some C 1, if given ϵ > 0, W > 0, and i.i.d. samples from D, the algorithm outputs a link function ˆu F and a vector bw Rd such that with high probability it holds L2(bw; ˆu) C OPT + ϵ, where OPT min w 2 W,u F L2(w; u).

Throughout, we use u (w x) to denote an(y) fixed optimal solution to the learning problem, i.e., L2(w ; u ) = OPT.

Some comments are in order. First, Problem 1.1 does not make realizability assumptions on the distribution D. That

1Throughout this paper, we will use the terms link function and activation interchangeably.

Robustly Learning Single-Index Models via Alignment Sharpness

is, the labels are allowed to be arbitrary and the goal is to be competitive against the best-fit function in the class C = {u(w x) | w Rd, w 2 W, u F} . Second, our focus is on obtaining efficient learners that achieve a constant factor approximation to the optimum loss, i.e., where C in Problem 1.1 is a universal constant independent of the dimension d and the radius W of the weight space.

Ideally, one would like an efficient learner that succeeds for all marginal distributions and achieves optimal error of OPT + ϵ (corresponding to C = 1). Unfortunately, known computational hardness results rule out this possibility. Even for the very special case that the marginal distribution is Gaussian and the link function is known (e.g., a Re LU), there is strong evidence that any algorithm achieving error OPT + ϵ requires dpoly(1/ϵ) time (Diakonikolas et al., 2020b; Goel et al., 2020; Diakonikolas et al., 2021; 2023). Moreover, even if we relax our goal to constant factor approximation (i.e., C = O(1)), distributional assumptions are required both for proper (Síma, 2002; Manurangsi & Reichman, 2018) and improper learning (Diakonikolas et al., 2022a). As a consequence, algorithmic research in this area has focused on constant factor approximate learners that succeed under mild distributional assumptions.

Recent works (Diakonikolas et al., 2020a; 2022c; Awasthi et al., 2023; Wang et al., 2023) gave efficient, constant factor approximate learners, under natural distributional assumptions, for the special case of Problem 1.1 where the link function is known a priori (see also Frei et al. (2020)). For the general setting, the only prior algorithmic result was recently obtained in Gollakota et al. (2023). Specifically, Gollakota et al. (2023) gave an efficient algorithm that succeeds for the class of monotone 1-Lipschitz link functions and any marginal distribution with second moment bounded by λ. Their algorithm achieves L2 2 error

OPT) + ϵ (1)

under the assumption that the labels are bounded in [0, 1]. The error guarantee (1) is substantially weaker both qualitatively and quantitatively from the goal of this paper. Firstly, the dependence on OPT scales with its square root, as opposed to linearly. Secondly, and arguably importantly, the multiplicative factor inside the big-O scales (linearly) with the diameter of the space W.

Interestingly, Gollakota et al. (2023) showed via a hardness construction from Diakonikolas et al. (2022a) that, under their distributional assumptions, a multiplicative dependence on W (in the error guarantee) is inherent for efficient algorithms. That is, to obtain an efficient constant factor approximation, it is necessary to restrict ourselves to distributions with additional structural properties. This discussion raises the following question:

Can we obtain efficient constant factor learners for

Problem 1.1 under mild distributional assumptions?

The natural goal here is to match the guarantees of known algorithmic results for the special case of known link function (Diakonikolas et al., 2022c; Wang et al., 2023).

As our main contribution, we answer this question in the affirmative. That is, we give the first efficient constant factor approximation learner that succeeds under natural and broad families of distributions (including log-concave distributions) and a broad class of link functions. We emphasize that this is the first polynomial-time constant factor approximate learner even for Gaussian marginals and for any nontrivial class of link functions. Roughly speaking, our distributional assumptions require concentration and (anti)-anti-concentration (see Definition 1.2).

1.1. Overview of Results

We start by providing the distributional assumptions and family of link functions for which our algorithm succeeds.

Distributional Assumptions Our algorithm succeeds for the following class of structured distributions.

Definition 1.2 (Well-Behaved Distributions). Let L, R > 0. Let V be any subspace in Rd of dimension at most 2. A distribution Dx on Rd is called (L, R)-well-behaved if Dx is isotropic and for any projection (Dx)V of Dx onto subspace V , the corresponding pdf γV on R2 satisfies the following:

For all x V V such that x V R, γV (x V ) L (anti-anti-concentration).

For all x V V , γV (x V ) (1/L)(e L x V 2) (anticoncentration and concentration).

This distribution class was introduced in Diakonikolas et al. (2020c), in the context of learning linear separators with noise and has since been used in a number of prior works, including for robustly learning SIMs with known link functions (Diakonikolas et al., 2022c). The parameters L, R in Definition 1.2 are viewed as universal constants, i.e., L, R = O(1). Indeed, it is known that many natural distributions, most importantly isotropic log-concave distributions, fall in this category; see, e.g., Diakonikolas et al. (2020c).

Unbounded Activations Our algorithm succeeds for a broad class of link functions that contains many well-studied activations, including Re LU. This class, defined in Diakonikolas et al. (2022c) and used in Wang et al. (2023), requires the link functions to be monotone, Lipschitzcontinuous and strictly increasing in the positive region.

Definition 1.3 (Unbounded Activations). Let u : R 7 R. Given a, b R such that 0 < a b, we say that u(z) is (a, b)-well-behaved if u(0) = 0 and u(z) is non-decreasing, b-Lipschitz-continuous, and u(z) u(z ) a(z z ) for all z z 0. We denote this function class by U(a,b).

Robustly Learning Single-Index Models via Alignment Sharpness

A simplified version of our main algorithmic result is as follows (see Theorem 4.2 for a more detailed statement):

Theorem 1.4 (Main Algorithmic Result, Informal). Given Problem 1.1, where G is the class of (L, R)-well behaved distributions with L, R = O(1) and F = U(a,b) such that (1/a), b = O(1), there is an algorithm that draws N = poly(W) O(d/ϵ2) samples from D, runs in poly(N, d) time, and outputs a hypothesis ˆu(bw x) with ˆu U(a,b), bw 2 W such that L2(bw; ˆu) = COPT + ϵ with high probability, where C > 0 is an absolute constant.

We reiterate that the approximation factor C in Theorem 1.4 is a universal constant, independent of the dimension and diameter of the space. That is, our main result provides the first efficient learning algorithm achieving a constant factor approximation, even for the most basic case of Gaussian data and any non-trivial class of link functions.

1.2. Technical Overview

When it comes to learning SIMs in the agnostic model with target error COPT + ϵ, to the best of our knowledge, all prior work that achieves such a guarantee with C being an absolute constant only applies to the special case of known link function u . Such results are established by proving growth conditions (local error bounds) that relate either the L2 2 loss or a surrogate loss to (squared) distance to the set of target solutions, using assumptions about the link function and the data distribution, such as concentration and (anti-)anti-concentration (Diakonikolas et al., 2020a; 2022b; Wang et al., 2023). Among these, most relevant to our work is Wang et al. (2023), which proved a sharpness property for a convex surrogate function defined by

Lsur(w; u) = E (x,y) D

0 (u(r) y) dr , (2)

based on assumptions about the link function that are the same as ours and distributional assumptions that are somewhat weaker but comparable to ours. Their sharpness result corresponds to guaranteeing that for vectors w that are not already O(OPT)+ϵ accurate solutions, the following holds:

Lsur(w; u ) (w w ) w w 2 2, (3)

where w is a vector that achieves error O(OPT) + ϵ.

One may hope that the sharpness result of Wang et al. (2023) can be generalized to the case of unknown link function and leveraged to obtain constant factor robust learners. However, as we discuss below, such direct generalizations are not possible and there are several technical challenges that had to be overcome in our work. To illustrate some of the intricacies, consider first the following example.

Example 1.5. Let x N(0, I) and w = (1/2)w , where w is an arbitrary but fixed target unit vector. Let b > 2a.

Suppose that the link function at hand is u(z) = bz and the target link function is u (z) = az. Observe that both u, u U(a,b), as required by our model. Furthermore, suppose there is no label noise, in which case OPT = 0. Note that the L2 2 error of u(w x) in this case is

L2(w; u) = E x N(0,I)[(u(w x) u (w x))2]

= E z N(0,1)[(b/2 a)2z2] = (b/2 a)2 = Θ(1).

However, the gradient of the surrogate loss, Lsur(w; u) = E[(u(w x) u (w x))x], is negatively correlated with w w , i.e., Lsur(w; u) (w w ) < 0, contrary to what we would hope for if a sharpness property as in Wang et al. (2023) were to hold. Thus, although w and u are both still far away from the target parameters w and u , the gradient of the surrogate loss cannot provide useful information about the direction in which to update w.

What Example 1.5 demonstrates is that we cannot hope for the surrogate loss to satisfy a local error bound for an arbitrary parameter pair (u, w) that would guide the convergence of an algorithm toward a target parameter pair (u , w ). This seemingly insurmountable obstacle is surpassed by observing that we do not, in fact, need the surrogate loss to contain a signal that would guide us toward target parameters for an arbitrary pair (u, w). Instead, we can restrict our attention to pairs (u, w) satisfying that u is a reasonably good link function for the vector w. Ideally, we would like to only consider link functions u that minimize the L2 2 loss considering that u must minimize the L2 2 loss for a given, fixed w but it is unclear how to achieve that in a statistically and computationally efficient manner. As a natural approach, we consider link functions that are the best fitting functions in an empirical distribution sense. In particular, given a sample set S = {(x(i), y(i))}m i=1 and a parameter w, we select a function ˆuw that solves the following (convex) optimization problem:

ˆuw argmin u U(a,b)

i=1 (u(w x(i)) y(i))2. (P)

For notational simplicity, we drop the parameter wt from ˆuwt and use ˆut instead. It is worth pointing out here that in general the problem of finding the best function that minimizes the L2 2 error fails under the category of nonparametric regression, which unfortunately requires exponentially many samples (namely, Ω(1/ϵd)). Fortunately, in our setting, we are looking for the best function that lies in a one-dimensional space. Therefore, instead of looking at all possible directions, we can project all the points of the sample set S to the direction w and find the best fitting link function efficiently. We provide the full details for efficiently solving (P) in Appendix E.

Robustly Learning Single-Index Models via Alignment Sharpness

Having set on the best-fit link functions in the sense of the problem (P), the next obstacle one encounters when trying to prove a sharpness-like result is that neither the L2 2 loss nor its surrogate convey information about the scale of w and w . This is because models determined by u, w and u/c, cw for some parameter c > 0 have the same value of both loss functions. Thus, it seems unlikely that a more traditional local error bound as in (3) can be established in general, for either the surrogate loss or the original L2 2 loss. Instead, we prove a weaker property that establishes strong correlation between the gradient of the empirical surrogate loss b Lsur(wt; ˆut) = (1/m) Pm i=1(ˆut(wt x(i)) y(i))x(i)

and the direction wt w that holds whenever wt is not an O(OPT) + ϵ error solution and which is independent of the scale of wt. This constitutes our key structural result, stated as Proposition 3.1 and discussed in detail in Section 3. We further discuss how this result relates to classical and recent local error bounds in Appendix B.

In addition to this weaker version of a sharpness property, we further prove in Corollary 3.4 that given a parameter wt

and a dataset of m samples from D, the activation ˆut(wt x) generated by optimizing the empirical risk on the dataset as in (P) satisfies Ex Dx[(ˆut(wt x) u (w x))2] b2 wt w 2 2 with high probability. As a result, we can guarantee that when wt w 2 decreases, the L2 2 distance between ˆut and u diminishes as well. This is crucial, since without such a coupling we would not be able to argue about convergence over both model parameters u, w.

Leveraging these results, we arrive at an algorithm that alternates between gradient descent-style updates for w and best-fit updates for u. We note in passing that similar alternating updates have been used in classical work on SIM learning in the less challenging, non-agnostic setting (Kakade et al., 2011). In more detail, our algorithm fixes the scale β of wt 2 and alternates between taking a Riemannian gradient descent step on a sphere for wt w.r.t. the empirical surrogate loss and solving (P). The unknown scale for the true parameter vector w is resolved by applying this approach using β chosen from a sufficiently fine grid of the interval [0, W] and employing a testing procedure at the end to select the best parameter vector. Although the idea is simple, the proof is quite technical, as it requires ensuring that the entire process does not accumulate spurious errors arising from the stochastic nature of the problem, adversarial labels, and approximate minimization of the surrogate loss, and, as a result, that it converges to the target error.

Technical Comparison to Gollakota et al. (2023) The only prior work addressing SIM learning (with unknown link functions) in the agnostic model is Gollakota et al. (2023), thus here we provide a technical comparison. While both Gollakota et al. (2023) and our work make use of the surrogate loss function from (2), on a technical level the two

works are completely disjoint. Gollakota et al. (2023) uses a framework of omnipredictors to minimize the surrogate loss and then relates this result to the L2 2 loss. Although they handle more general distributions and activations, their learner outputs a hypothesis with error that cannot be considered constant factor approximation (see (1)) and is improper. By contrast, our work does not seek to minimize the surrogate loss. Instead, our main insight is that the gradient of the surrogate loss at a vector w conveys information about the direction of a target vector w , for a fixed link function that minimizes the L2 2 loss. We leverage this property to construct a proper learner with constant factor approximation.

2. Preliminaries

Basic Notation For n Z+, let [n] := {1, . . . , n}. We use lowercase boldface characters for vectors. We use x y for the inner product of x, y Rd and θ(x, y) for the angle between x, y. For x Rd and k [d], xk denotes the kth coordinate of x, and x 2 denotes the ℓ2-norm of x. We use 1A = 1{A} to denote the characteristic function of the set A. For vectors v, u Rd, we denote by v u the projection of v onto the subspace orthogonal to u, i.e., v u := v ((v u)u)/ u 2 2. We use B(r) to denote the ℓ2 ball in Rd of radius r, centered at the origin.

Asymptotic Notation We use the standard O( ), Θ( ), Ω( ) asymptotic notation. We use e O( ) to omit polylogarithmic factors in the argument. We use Op( ) to suppress polynomial dependence on p, i.e., Op(ω) = O(poly(p)ω). Θp( ) and Ωp( ) are defined similarly. We write E F for two non-negative expressions E and F to denote that there exists some positive universal constant c > 0 (independent of the variables or parameters on which E and F depend) such that E c F. The notation is defined similarly.

Probability Notation We use EX D[X] for the expectation of a random variable X according to the distribution D and Pr[E] for the probability of event E. For simplicity of notation, we omit the distribution when it is clear from the context. For (x, y) distributed according to D, we use Dx to denote the marginal distribution of x.

Organization In Section 3, we establish our main structural result of alignment sharpness. In Section 4, we describe and analyze our constant factor approximate SIM learner. We conclude the paper in Section 5. The full version of the proofs is deferred to the supplementary material.

3. Main Structural Result: Alignment Sharpness of Surrogate Loss

In this section, we establish our main structural result (Proposition 3.1), which is what crucially enables us to obtain the target O(OPT) + ϵ error for the studied problem.

Robustly Learning Single-Index Models via Alignment Sharpness

Proposition 3.1 states that the empirical gradient of the surrogate loss positively correlates with the direction of wt w

whenever wt does not correspond to an O(OPT) + ϵ solution, and, moreover, the correlation is proportional to the quantity (w ) wt 2 2. This is a key property that is leveraged in our algorithmic result (Theorem 4.2), both in obtaining an O(OPT) + ϵ error result and in arguing about the convergence and computational efficiency of our algorithm. Intuitively, what Proposition 3.1 allows us to argue is that as long as the angle between wt and w is not close to zero, we can update wt to better align it with w , in the sense that we reduce the angle between these two vectors.

Proposition 3.1 (Alignment Sharpness of the Convex Surrogate). Suppose that Dx is (L, R)-well-behaved, U(a,b) is as in Definition 1.3, and ϵ, δ > 0. Let µ a2LR4/b. Given any wt B(W), denote by ˆut the optimal solution to (P) with respect to wt and the sample set S = {(x(i), y(i))}m i=1 drawn i.i.d. from D. If m satisfies

m d W 9/2b4L 4 log4(d/(ϵδ))(1/ϵ3/2 + 1/(ϵδ)) ,

then, with probability at least 1 δ,

b Lsur(wt; ˆut) (wt w ) µ (w ) wt 2 2 2(OPT + ϵ)/b 2(

OPT + ϵ) wt w 2 .

To prove Proposition 3.1, we rely on the following key ingredients. In Section 3.1, we prove our main technical lemma (Lemma 3.2), which states that the L2 2 distance between a hypothesis u(w x) and the target u (w x) is bounded below by the misalignment of wt and w , i.e., the squared norm of the component of w that is orthogonal to wt, (w ) wt 2 2. As will become apparent in the proof of Proposition 3.1, the inner product b Lsur(wt; ˆut) (wt w ) can be bounded below as a function of the empirical L2 2 error for wt and a different (but related) activation ˆu t, which can in turn be argued to be close to the population L2 2 error for a sufficiently large sample size, using concentration. Thus, Lemma 3.2 can be leveraged to obtain a term scaling with (w ) wt 2 2 in the lower bound on b Lsur(wt; ˆut) (wt w ).

In Section 3.2, we characterize structural properties of the population-optimal link functions ut and u t (see (EP) and (EP*)), which play a crucial role in the proof of Proposition 3.1. Specifically, we show that activation ut is close to the idealized activation u t (the optimal activation without noise, given wt) in L2 2 distance (Lemma 3.3). Since by standard uniform convergence results we have that ˆut and ˆu t

are close to their population counterparts ut and u t, respectively, Lemma 3.3 certifies that ˆut is not far away from ˆu t. This property enables us to replace ˆut by (the idealized) ˆu t

in the empirical surrogate gradient b Lsur(wt; ˆut), which is easier to analyze, since ˆu t is defined with respect to the ideal dataset (with uncorrupted labels).

Finally, as a simple corollary of Lemma 3.3, we prove Corollary 3.4, which gives a clear explanation of why our algorithm, which alternates between updating wt and ˆut, works: we show that the L2 2 loss between the hypothesis generated by our algorithm ˆut(wt x) and the underlying optimal hypothesis u (w x) is bounded above by the distance between wt and w . Since our structural sharpness result (Proposition 3.1) enables us to decrease wt w 2, Corollary 3.4 certifies that choosing the empirically-optimal activation leads to convergence of the hypothesis ˆut(wt x).

Equipped with these technical lemmas, we prove our main structural result (Proposition 3.1) in Section 3.3.

3.1. L2 2 Error and Misalignment

Our first key result is Lemma 3.2 below, which plays a critical role in the proof of Proposition 3.1. As discussed in Section 1.2, for two different activations u and u and parameters w and w such that w and w are parallel, even when the L2 2 error is Ω(1), the gradient Lsur(w; u) might not significantly align with the direction of w w , and thus cannot provide sufficient information about the direction to decrease w w 2. Intuitively, the following lemma shows that this is the only thing that can go wrong, and it happens when w and w are parallel. In particular, Lemma 3.2 shows that for any square integrable link function f, we can relate the L2 2 distance Ex Dx[(f(w x) u (w x))2] to the magnitude of the component of w that is orthogonal to w. The full proof of Lemma 3.2 is deferred to Appendix C.2.

Lemma 3.2 (Lower-Bounding L2 2 Error by Misalignment). Let u U(a,b), Dx be (L, R)-well-behaved, and f : R 7 R be square-integrable with respect to the measure of the distribution Dx. Then, for any w, w Rd, it holds that

E x Dx[(f(w x) u (w x))2] a2LR4 (w ) w 2 2 .

Proof Sketch of Lemma 3.2. Suppose for simplicity that R = 1, and let us denote (f(w x) u (w x))2 by p(x). The statement holds trivially when w is parallel to w , so assume this is not the case. Let us also assume w w 0, as for the case of w w 0 the proof can be carried out using similar arguments. Define v = (w ) w = w (w w)w/ w 2 2 and let v = v/ v 2. Then w = αw + v, for some α 0. Let V be the subspace spanned by w, v. Then considering the event A = {w x 0, v x (1/16, 1/8) (3/8, 1/2)}, we have Ex Dx[p(x)] Ex Dx[(f(w x V ) u (w x V ))21{A}].

The main idea is to study the relation between the value of f(w x) and u (w x), utilizing the fact that u (z) is strictly increasing when z 0. In particular, define the following functions indicating the intervals that f(w x) belongs to: I1(x) = sign(f(w x) u (αw x+ v 2/32)),

Robustly Learning Single-Index Models via Alignment Sharpness

I2(x) = sign(f(w x) u (αw x+ v 2/4)), and I3(x) = sign(f(w x) u (αw x + v 2)). Since u is nondecreasing, it must be I1(x)I2(x) 0 or I2(x)I3(x) 0. We discuss the cases of I1(x)I2(x) 0 and I2(x)I3(x) 0, and provide lower bound for each case respectively.

Consider first I1(x), I2(x) 0, indicating that f(w x) u (αw x + v 2/4). Further restricting x on the band B = {w x 0, v x (1/16, 1/8)}, B A, we have u (w x) u (αw x + v 2/8). Using the fact that u (z) u (z ) a(z z ) when z z 0, we have

p(x)1{B} = {f(w x) u (αw x + v 2/4)}

+ {u (αw x + v 2/4) u (w x)} 21{B}

(u (αw x + v 2/4) u (w x))21{B}

(a2/82) v 2 21{B} .

With similar arguments, when both I1(x), I2(x) 0, it holds p(x)1{B} (a2/32)2 v 2 21{B}. Thus, when I1(x)I2(x) 0 we have p(x)1{B} a2 v 2 21{B}.

For the case of I2(x)I3(x) 0, we consider restricting x on B = {w x 0, v x (3/8, 1/2)}. Then, similarly, after discussing the cases of I2(x), I3(x) 0 and I2(x), I3(x) 0, we get that when I2(x)I3(x) 0, p(x)1{B } a2 v 2 21{B }.

Finally, let D = (B {I1(x)I2(x) 0}) (B {I2(x)I3(x) 0}) A. Since Dx is (L, 1)-well behaved, the probability mass γV satisfies γV (x) L when x 1. Therefore, since V is a 2-dimensional plane, using geometric observations, Pr[D] can be bounded below by Pr[D] Pr[D { x 1}] L/16. Hence, combining the bounds above and recalling that b a, we finally get Ex Dx[p(x)] Ex Dx[p(x)1{D}] a2 v 2 2 Pr[D] a2L v 2 2 completing the proof of Lemma 3.2.

3.2. Closeness of Idealized and Attainable Activations

In this section, we bound the contribution of the error incurred from working with attainable link functions ˆut in the iterations of the algorithm. The error incurred is due to both the arbitrary noise in the labels and due to using a finite sample set. In bounding the error, for analysis purposes, we introduce auxiliary population-level link functions.

Concretely, given w B(W), a population-optimal activation is a solution to the following stochastic convex program:

uw argmin u U(a,b) E (x,y) D[(u(w x) y)2]. (EP)

We further introduce auxiliary idealized, noiseless activations, which, given noiseless labels y = u (w x) and a parameter weight vector w, are defined via

u w argmin u U(a,b) E (x,y) D[(u(w x) y )2]. (EP*)

Below we relate ut := uwt and u t := u wt and show that their L2 2 error for the parameter vector wt is bounded by OPT. The proof of Lemma 3.3 is deferred to Appendix C.3.

Lemma 3.3 (Closeness of Population-Optimal Activations). Let wt B(W) and let u t, ut be defined as solutions to (EP*), (EP), respectively. Then,

E x Dx[(ut(wt x) u t(wt x))2] OPT.

As a consequence of the lemma above, we are able to relate ˆut to the noiseless labels y = u (w x) by showing that the L2 2 distance between u (w x) and the sample-optimal activation ˆut(wt x) is bounded by wt w 2 2. Although Corollary 3.4 is not used in the proof of Proposition 3.1, we still present it here as it justifies the mechanism of our approach alternating between updates for wt and ˆut. The proof of Corollary 3.4 can be found in Appendix C.4.

Corollary 3.4 (Closeness of Idealized and Attainable Activations). Let ϵ, δ > 0. Given a parameter wt B(W) and m d log4(d/(ϵδ))(b2W 3/(L2ϵ))3/2 samples from D, let ˆut be the sample-optimal activation on these samples given wt, as defined in (P). Then, with probability at least 1 δ,

E x Dx[(ˆut(wt x) u (w x))2]

3(ϵ + OPT + b2 wt w 2 2).

3.3. Proof of Proposition 3.1

We now provide a proof sketch for Proposition 3.1, while the detailed proof is deferred to Appendix C.1.

Proof Sketch of Proposition 3.1. Given any weight parameter wt B(W) and ˆut as defined in the statement, let ut

be the population-optimal activation defined by (EP). Given S = {(x(i), y(i))}m i=1, define y (i) = u (w x(i)) for i [m]. Further define the following auxiliary problem:

ˆu w argmin u U(a,b)

i=1 (u(w x(i)) y (i))2. (P*)

Denote ˆu t := ˆu wt. Observe that (P*) is the empirical version of (EP*). To prove Proposition 3.1, we decompose

Robustly Learning Single-Index Models via Alignment Sharpness

b Lsur(wt; ˆut) (wt w ) into three summation terms.

b Lsur(wt; ˆut) (wt w )

i=1 (ˆut(wt x(i)) ˆu t(wt x(i)))(wt w ) x(i)

i=1 (ˆu t(wt x(i)) y (i))(wt w ) x(i)

i=1 (y (i) y(i))(wt x(i) w x(i))

We bound each term Q1, Q2, Q3 in (4) separately and defer the proofs of corresponding claims to Appendix C. We first show that with probability at least 1 δ, we have

OPT) wt w 2 (OPT + ϵ)/b. (5)

Inequality (5) contains two error terms, the first one is due to noise and the second one is the estimation error. The second term comes from standard concentration results (see Appendix F). The first term comes from replacing ˆut and ˆu t

with their population counterparts ut and u t and combining Cauchy-Schwarz inequality with Lemma 3.3.

We next show that Q2 is a constant multiple of (w ) wt 2 2, which is the main positive contribution among the three summands. In particular, we show that for an absolute constant C , with probability at least 1 δ,

b (w ) wt 2 2 ϵ wt w 2 ϵ/b. (6)

The proof of the above statement is rather technical. We first define an empirical inverse ˆf : R R of the activation u , as u is not strictly increasing hence (u ) 1 is not welldefined on R. Denote q(x(i)) := ˆu t(wt x(i)) u (w x(i)). Then adding and subtracting ˆf(ˆu t(wt x(i))),

i=1 q(x(i))(wt x(i) ˆf(ˆu t(wt x(i))))

i=1 q(x(i))( ˆf(ˆu t(wt x(i))) w x(i)).

Using the optimality conditions of (P*), the first summation above is always positive. For the second summation, by the definition of ˆf, we have that | ˆf(ˆu t(wt x(i))) w x(i)| (1/b)|q(x(i))| and q(x(i))( ˆf(ˆu t(wt x(i))) w x(i)) 0, leading to a lower bound of (1/bm) Pm i=1(q(x(i)))2. Using Lemma 3.2 and standard concentration arguments, we then

obtain Inequality (6). Finally, using similar arguments as for Q1, we show that with probability at least 1 δ,

OPT w wt 2 (OPT + ϵ)/b . (7)

The proof is completed by plugging the bounds from Inequalities (5) to (7) back into Equation (4).

4. Robust SIM Learning via Alignment Sharpness

As discussed in Section 1.2, our algorithm can be viewed as employing an alternating procedure: taking a Riemannian gradient descent step on a sphere with respect to the empirical surrogate, given an estimate of the activation, and optimizing the activation function on the sample set for a given parameter weight vector. This procedure is performed using a fine grid of guesses of the scale of w 2. For this process to converge with the desired linear rate (even for a known value of w 2), the algorithm needs to be properly initialized to ensure that the initial weight vector has a nontrivial alignment with the optimal vector w . The initialization process is handled in the following subsection.

4.1. Initialization

We begin by showing that the Initialization subroutine stated in Algorithm 2 (see Appendix D.1) returns a point w0 that has a sufficient alignment with w . As will become apparent later in the proof of Theorem 4.2, this property of the initial point is critical for Algorithm 1 to converge at a linear rate. We defer the more detailed statement of Lemma 4.1 (see Lemma D.1) and its proof to Appendix D.1. Lemma 4.1 (Initialization). Given µ = OL,R(a2/b) and ϵ, δ > 0, (Initialization) Algorithm 2 draws m0 = OW,b,1/L,1/µ(d/(δϵ3/2)) i.i.d. samples from D, it runs in time OW,b,1/L,1/µ(dm0), and with probability at least 1 δ, it generates a list of size t0 (b/µ)6 log(b/µ) that contains a vector w0 such that

(w ) w0 2 max{µ w 2/b, b2/µ3(

4.2. Optimization

Our main optimization algorithm is summarized in Algorithm 1 (see Algorithm 3 for a more detailed version). We now provide intuition for how guessing the value of w 2 is used in the convergence analysis. Let wt = w 2 wt/ wt 2 so that wt 2 = w 2 and let vt := (w ) wt. Observe that vt 2 = wt w 2 cos(θ(wt, w )/2). Applying Proposition 3.1, it can be shown that wt+1 w 2 2 wt w 2 2 C vt 2 2 for some constant C. Thus, as long as the angle between wt and w is not too large (ensured by initialization), wt w 2 vt 2. Hence, we can argue that

Robustly Learning Single-Index Models via Alignment Sharpness

wt w 2 contracts in each iteration, by observing that wt w 2 2 vt+1 2 2 wt+1 w 2 2.

Algorithm 1 Optimization

1: Input: wini = 0; ϵ > 0; positive parameters: a, b, L, R, W, µ; step size η 2: {wini 0 , . . . , wini t0 } = Initialization[wini] (Algorithm 2) 3: P = {(w = 0; u(z) = 0)} 4: for k = 0 to t0 (b/µ)6 log(b/µ) do 5: for j = 1 to J = W/(η ϵ) do 6: w0 j,k = wini k , βj = jη ϵ 7: for t = 0 to T = O((b/µ)2 log(1/ϵ)) do 8: bwt j,k = βj( wt j,k/ wt j,k 2)

9: Draw m = ΘW,b,1/L,1/µ(d/ϵ3/2) new samples

10: ˆut j,k = argmin u U(a,b)

i=1 (u(bwt j,k x(i)) y(i))2

11: wt+1 j,k = bwt j,k η b Lsur(bwt j,k; ˆut j,k) 12: end for 13: P P {(bw T j,k; ˆu T j,k)} 14: end for 15: end for 16: (bw; ˆu) = Test[(w; u) P] (Algorithm 4) 17: Return: (bw; ˆu)

Our main result is the following theorem (see Theorem D.2 for a more detailed statement and proof in Appendix D.2): Theorem 4.2 (Main Result). Let D be a distribution in Rd R and suppose that Dx is (L, R)-well-behaved. Let U(a,b) be as in Definition 1.3 and let ϵ > 0. Then, Algorithm 1 uses N = OW,b,1/L,1/µ(d/ϵ2) samples, it runs for OW,b,1/µ(1/ ϵ) iterations, and, with probability at least 2/3, returns a hypothesis (ˆu, bw), where ˆu U(a,b) and bw B(W), such that L2(bw; ˆu) = O1/L,1/R,b/a(OPT) + ϵ .

To prove Theorem 4.2, we make use of two technical results stated below. First, Lemma 4.3 provides an upper bound on the norm of the empirical gradient of the surrogate loss. The proof of the lemma relies on concentration properties of (L, R)-well behaved distributions Dx, and leverages the uniform convergence of the empirically-optimal activations ˆut. A more detailed statement (Lemma D.7) and the proof of Lemma 4.3 is deferred to Appendix D.3. Lemma 4.3 (Bound on Empirical Gradient Norm). Let S be a set of i.i.d. samples of size m = ΘW,b,1/L(d/ϵ3/2 + d/(ϵδ)). Given any wt B(W), let ˆut U(a,b) be the solution of optimization problem (P) with respect to wt and sample set S. Then, with probability at least 1 δ,

b Lsur(wt; ˆut) 2 2 4b2 wt w 2 2 + 10(OPT + ϵ).

The next claim bounds the L2 2 error of a hypothesis ˆuw(w x) by the distance between w and w . We defer a more detailed statement (Claim D.8) and the proof to Appendix D.4.

Claim 4.4. Let w B(W) be any fixed vector. Let ˆuw be defined by (P) given w and a sample set of size m = ΘW,b,1/L(d/ϵ3/2). Then, E(x,y) D[(ˆuw(w x) y)2] 8(OPT + ϵ) + 4b2 w w 2 2.

Proof Sketch of Theorem 4.2. For this sketch, we consider the case w 2 b3/µ4(

OPT + ϵ) so that the initialization subroutine generates a point wini k {wini k }t0 k=1 such that (w ) wini k 2 µ w 2/(4b), by Lemma 4.1. Fix this initialized parameter w0 j,k = wini k at step k and drop the subscript k for simplicity. Since we constructed a grid with width η ϵ, there exits an index j such that |βj w 2| η ϵ. We consider the intermediate forloop at this iteration j , and show that the inner loop with normalizer βj outputs a solution with error O(OPT) + ϵ. This solution can be picked using standard testing procedures. We now focus on the iteration j , and drop the subscript j for notiational simplicity.

Let wt = w 2( wt/ wt 2) and denote vt := (w ) b wt. Expanding wt+1 w 2 2 and applying Proposition 3.1 and Lemma 4.3, we get

wt+1 w 2 2 = bwt η b Lsur(bwt; ˆut) w 2 2 = bwt w 2 2 + η2 b Lsur(bwt; ˆut) 2 2 2η b Lsur(bwt; ˆut) (bwt w )

bwt w 2 2 + η2(10(OPT + ϵ) + 4b2 bwt w 2 2)

OPT + ϵ) bwt w 2 µ vt 2 2)

+ 4η(OPT + ϵ)/b

(1 + 4η2b2) wt w 2 2 + (24η2 + 4η/b)(OPT + ϵ)

OPT + ϵ) wt w 2 µ vt 2 2), (8)

where in the last inequality we used bwt wt 2 η ϵ.

Since wt and w are on the same sphere, wt w 2 vt 2. In particular, letting ρt = vt 2/ w 2, we have wt w 2 2 (1 + ρ2 t) vt 2 2 2 vt 2 2. Recall that the algorithm is started from w0 that satisfies ρ0 µ/(4b). If ρt µ/(4b), then wt w 2 2 (1 + (µ/(4b))2) vt 2 2. Assuming in addition that vt 2 (1/µ)(

OPT + ϵ), and choosing the stepsize η = µ/(4b2), (8) implies that

vt+1 2 2 wt+1 w 2 2 (1 µ2/(32b2)) vt 2 2,

and thus, in addition, ρt+1 µ/(4b). Therefore, by an inductive argument, we show that as long as wt is still far from w , i.e., vt 2 (1/µ)(

OPT + ϵ), we have

vt+1 2 2 (1 µ2/(32b2)) vt 2 2 and ρt+1 µ/(4b).

Hence, after T = O((b2/µ2) log(1/ϵ)) iterations, it must be v T 2 (1/µ)(

OPT + ϵ), which implies

w T w 2 2 2 v T 2 2 = O(OPT) + ϵ.

Robustly Learning Single-Index Models via Alignment Sharpness

Finally, by Claim 4.4, hypothesis ˆu T (bw T x) achieves L2 2error O(OPT) + ϵ, which completes the proof.

5. Conclusion

We presented the first constant-factor approximate SIM learner in the agnostic model, for the class of (a, b)- unbounded link functions under mild distributional assumptions. Immediate questions for future research involve extending these results to other classes of link functions. More specifically, our results require that b/a is bounded by a constant. It is an open question whether the constant-factor approximation result in the agnostic model can be extended to all b-Lipschitz functions (with a = 0). This question is open even when the link function is known to the learner.

Acknowledgements

NZ was supported in part by NSF Medium Award CCF2107079 and NSF Award DMS-2023239. PW was supported in part by NSF Award DMS-2023239. ID was supported by NSF Medium Award CCF-2107079 and a DARPA Learning with Less Labels (Lw LL) grant. JD was supported by the U. S. Office of Naval Research under award number N00014-22-1-2348.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Awasthi, P., Tang, A., and Vijayaraghavan, A. Agnostic learning of general Re LU activation using gradient descent. In The Eleventh International Conference on Learning Representations, ICLR, 2023.

Bhojanapalli, S., Neyshabur, B., and Srebro, N. Global optimality of local search for low rank matrix recovery. Advances in Neural Information Processing Systems, 29, 2016.

Bolte, J., Nguyen, T. P., Peypouquet, J., and Suter, B. W. From error bounds to the complexity of first-order descent methods for convex functions. Mathematical Programming, 165(2):471 507, 2017.

Dalalyan, A. S., Juditsky, A., and Spokoiny, V. A new algorithm for estimating the effective dimension-reduction subspace. The Journal of Machine Learning Research, 9: 1647 1678, 2008.

Diakonikolas, I., Goel, S., Karmalkar, S., Klivans, A. R., and Soltanolkotabi, M. Approximation schemes for Re LU

regression. In Conference on Learning Theory, COLT, volume 125 of Proceedings of Machine Learning Research, pp. 1452 1485. PMLR, 2020a.

Diakonikolas, I., Kane, D. M., and Zarifis, N. Near-optimal SQ lower bounds for agnostically learning halfspaces and Re LUs under Gaussian marginals. In Advances in Neural Information Processing Systems, Neur IPS, 2020b.

Diakonikolas, I., Kontonis, V., Tzamos, C., and Zarifis, N. Learning halfspaces with massart noise under structured distributions. In Conference on Learning Theory, COLT, 2020c.

Diakonikolas, I., Kane, D. M., Pittas, T., and Zarifis, N. The optimality of polynomial regression for agnostic learning under Gaussian marginals in the SQ model. In Proceedings of The 34th Conference on Learning Theory, COLT, 2021.

Diakonikolas, I., Kane, D., Manurangsi, P., and Ren, L. Hardness of learning a single neuron with adversarial label noise. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS), 2022a.

Diakonikolas, I., Kane, D. M., Kontonis, V., Tzamos, C., and Zarifis, N. Learning general halfspaces with general Massart noise under the Gaussian distribution. In STOC 22: 54th Annual ACM SIGACT Symposium on Theory of Computing, pp. 874 885. ACM, 2022b.

Diakonikolas, I., Kontonis, V., Tzamos, C., and Zarifis, N. Learning a single neuron with adversarial label noise via gradient descent. In Conference on Learning Theory (COLT), pp. 4313 4361, 2022c.

Diakonikolas, I., Kane, D. M., and Ren, L. Near-optimal cryptographic hardness of agnostically learning halfspaces and Re LU regression under Gaussian marginals. In ICML, 2023.

Dudeja, R. and Hsu, D. Learning single-index models in Gaussian space. In Conference on Learning Theory, COLT, volume 75 of Proceedings of Machine Learning Research, pp. 1887 1930. PMLR, 2018.

Facchinei, F. and Pang, J.-S. Finite-dimensional variational inequalities and complementarity problems. Springer, 2003.

Frei, S., Cao, Y., and Gu, Q. Agnostic learning of a single neuron with gradient descent. In Advances in Neural Information Processing Systems, Neur IPS, 2020.

Goel, S., Gollakota, A., and Klivans, A. R. Statistical-query lower bounds via functional gradients. In Advances in Neural Information Processing Systems, Neur IPS, 2020.

Robustly Learning Single-Index Models via Alignment Sharpness

Gollakota, A., Gopalan, P., Klivans, A. R., and Stavropoulos, K. Agnostically learning single-index models using omnipredictors. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Härdle, W., Müller, M., Sperlich, S., Werwatz, A., et al. Nonparametric and semiparametric models, volume 1. Springer, 2004.

Haussler, D. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100:78 150, 1992.

Hoffman, A. J. On approximate solutions of systems of linear inequalities. Journal of Research of the National Bureau of Standards, 49:263 265, 1952.

Hristache, M., Juditsky, A., and Spokoiny, V. Direct estimation of the index coefficient in a single-index model. Annals of Statistics, pp. 595 623, 2001.

Ichimura, H. Semiparametric least squares (SLS) and weighted SLS estimation of single-index models. Journal of econometrics, 58(1-2):71 120, 1993.

Jin, C., Ge, R., Netrapalli, P., Kakade, S., and Jordan, M. How to escape saddle points efficiently. In International conference on machine learning, pp. 1724 1732. PMLR, 2017.

Kakade, S. M., Kanade, V., Shamir, O., and Kalai, A. Efficient learning of generalized linear and single index models with isotonic regression. Advances in Neural Information Processing Systems, 24, 2011.

Kalai, A. T. and Sastry, R. The isotron algorithm: Highdimensional isotonic regression. In COLT, 2009.

Karimi, H., Nutini, J., and Schmidt, M. Linear convergence of gradient and proximal-gradient methods under the Polyak-łojasiewicz condition. In Joint European conference on machine learning and knowledge discovery in databases, pp. 795 811, 2016.

Kearns, M., Schapire, R., and Sellie, L. Toward efficient agnostic learning. Machine Learning, 17(2/3):115 141, 1994.

Liu, J., Cui, Y., and Pang, J.-S. Solving nonsmooth and nonconvex compound stochastic programs with applications to risk measure minimization. Mathematics of Operations Research, 2022.

Łojasiewicz, S. Une propriété topologique des sousensembles analytiques réels. Les équations aux dérivées partielles, 117:87 89, 1963.

Łojasiewicz, S. Sur la géométrie semi-et sous-analytique. In Annales de l institut Fourier, volume 43, pp. 1575 1595, 1993.

Lu, C. and Hochbaum, D. S. A unified approach for a 1D generalized total variation problem. Mathematical Programming, 194(1-2):415 442, 2022.

Manurangsi, P. and Reichman, D. The computational complexity of training Re LU(s). ar Xiv preprint ar Xiv:1810.04207, 2018.

Mei, S., Bai, Y., and Montanari, A. The landscape of empirical risk for nonconvex losses. The Annals of Statistics, 46(6A):2747 2774, 2018.

Roulet, V. and d Aspremont, A. Sharpness, restart and acceleration. Advances in Neural Information Processing Systems, 30, 2017.

Síma, J. Training a single sigmoidal neuron is hard. Neural Computation, 14(11):2709 2728, 2002.

Wang, P., Zarifis, N., Diakonikolas, I., and Diakonikolas, J. Robustly learning a single neuron via sharpness. 40th International Conference on Machine Learning, 2023.

Zhang, H. and Yin, W. Gradient methods for convex minimization: better rates under weaker conditions. ar Xiv preprint ar Xiv:1303.4645, 2013.

Zheng, Q. and Lafferty, J. Convergence analysis for rectangular matrix completion using Burer-Monteiro factorization and gradient descent. ar Xiv preprint ar Xiv:1605.07051, 2016.

Robustly Learning Single-Index Models via Alignment Sharpness

Supplemental Material

Organization The appendix is organized as follows. In Appendix A, we highlight some useful properties about the distribution class and the activation class. Appendix B provides a detailed introduction about the notion of local error bounds and its relation to our alignment sharpness structural result. In Appendix C, we provide detailed proofs that are omitted in Section 3, and in Appendix D we complete the proofs omitted in Section 4. In Appendix E, we provide a detailed discussion about computing the optimal empirical activation. Finally, in Appendix F we state and prove the standard uniform convergence results that are used throughout the paper.

A. Remarks about the Distribution Class and the Activation Class

In this section, we show that without the loss of generality we can assume that the parameters L, R in the distributional assumptions (Definition 1.2) can be taken less than 1, while the parameters a, b of the activations functions (see Definition 1.3) can be taken as a, 1/b 1. Remark A.1 (Distribution/Activation Parameters, (Definition 1.2 & Definition 1.3)). We observe that if a distribution Dx is (L, R)-well-behaved, then it is also (L , R )-well-behaved for any 0 < L L, 0 < R R. Hence, it is without loss of generality to assume that L, R (0, 1]. Similarly, if an activation is (a, b)-unbounded, it is also an (a, b )-unbounded activation with b b. Thus, we assume that b 1. We can similarly assume a 1.

In addition, we remark that the (L, R)-well behaved distributions are sub-exponential. Remark A.2 (Sub-exponential Tails of Well-Behaved Distributions, Definition 1.2). Definition 1.2 might seem abstract, but to put it plain it implies that the random variable x has a 1/L-sub-exponential tail, and that the pdf of the projected random variable x V onto the space V is lower bounded by L. To see the first statement, given any unit vector p, let xp be the projection of x onto the one-dimensional linear space Vp = {z Rd : z = tp, t R}, i.e., xp = p x Vp. Then, by the anti-concentration and concentration property, we have

Pr[|p x| r] = Pr[|xp| r] Z

|x| r γ(x) dx 2 Z

1 L exp( Lx) dx = 2

L2 exp( Lr),

which implies that x possess a sub-exponential tail.

B. Local Error Bounds and Alignment Sharpness

Given a generic optimization problem minw f(w) and a non-negative residual function r(w) measuring the approximation error of the optimization problem, we say that the problem satisfies a local error bound if in some neighborhood of test (typically optimal) solutions W we have that

r(w) (µ/ν) dist(w, W )ν. (9)

In other words, low value of the residual function implies that w must be close to the test set W .

Local error bounds have been studied in the optimization literature for decades, starting with the seminal works of (Hoffman, 1952; Łojasiewicz, 1963); see, e.g., Chapter 6 in (Facchinei & Pang, 2003) for an overview of classical results and (Bolte et al., 2017; Karimi et al., 2016; Roulet & d Aspremont, 2017; Mei et al., 2018; Liu et al., 2022) and references therein for a more cotemporary overview. While local error bounds can be shown to hold generically under fairly minimal assumptions on f and for r(w) = f(w) minw f(w ) (Łojasiewicz, 1963; 1993), it is rarely the case that it can be ensured with a parameter µ that is not trivially small.

On the other hand, learning problems often possess very strong structural properties that can lead to stronger local error bounds. There are two main such examples we are aware of, where local error bounds can be shown to hold with ν = 2 and an absolute constant µ > 0. The first example are low-rank matrix problems such as matrix completion and matrix sensing, which are unrelated to our work (Bhojanapalli et al., 2016; Zheng & Lafferty, 2016; Jin et al., 2017). More relevant to our work are the recent results in (Mei et al., 2018; Wang et al., 2023) which proved local error bounds of the form

2 dist(w, W )2 (10)

for the more restricted problem than ours: (Mei et al., 2018) only dealt with the additive zero-mean noise, and was given the knowledge of the activation, and in addition, they assumed that the marginal Dx is sub-Gaussian; while in (Wang et al.,

Robustly Learning Single-Index Models via Alignment Sharpness

2023) they considered the agnostic learning of SIMs also with a known activation function but under somewhat more general distributional assumptions. In (Wang et al., 2023), the residual function was defined by r(wt) = b Lsur(wt; u ) (wt w ), where b Lsur(wt; u ) is the gradient of an empirical surrogate loss, and the resulting local error bound referred to as sharpness. 2

Our structural result can be seen as a weak notion of a local error bound, where the residual function for the empirical surrogate loss expressed as r(wt, ˆut) = b Lsur(wt; ˆut) (wt w ) is bounded below as a function of the magnitude of the component of w that is orthogonal to wt. Compared to more traditional local error bounds and the bound from (Wang et al., 2023), which bound below the residual error function as a function of the distance to W , this is a much weaker local error bound since it does not distinguish between vectors of varying magnitudes along the direction of w . Since our lower bound is related to the sharpness notion studied in (Wang et al., 2023), we refer to it as the alignment sharpness to emphasize that it only relates the misalignment (as opposed to the distance) of vectors wt and w to the residual error. To the best of our knowledge, such a form of a local error bound, which only bounds the alignment of vectors as opposed to their distance, is novel. We expect it to find a more broader use in learning theory and optimization.

C. Omitted Proofs from Section 3

C.1. Proof of Proposition 3.1

This subsection is devoted to the full version of the proof of Proposition 3.1.

Proposition C.1 (Alignment Sharpness of the Convex Surrogate). Suppose that Dx is (L, R)-well-behaved, U(a,b) is as in Definition 1.3, and ϵ, δ > 0. Let µ a2LR4/b. Given any wt B(W), denote by ˆut the optimal solution to (P) with respect to wt and the sample set S = {(x(i), y(i))}m i=1 drawn i.i.d. from D. If m satisfies

m d W 9/2b4L 4 log4(d/(ϵδ))(1/ϵ3/2 + 1/(ϵδ)) ,

then, with probability at least 1 δ,

b Lsur(wt; ˆut) (wt w ) µ (w ) wt 2 2 2(OPT + ϵ)/b 2(

OPT + ϵ) wt w 2 .

Proof of Proposition 3.1. Given any weight parameter wt B(W) and ˆut from the lemma statement, let ut be the population-optimal activation, which recall was defined as the solution of the optimization problem (EP). Given a sample set S = {(x(i), y(i))}m i=1, suppose, hypothetically, that we could construct a new sample set S using x(i) s from S but with the true labels without noise: S = {(x(i), y (i))}m i=1, y (i) = u (w x(i)). For this reason, we define the following sub-problem, to use as reference:

ˆu w argmin u U(a,b)

i=1 (u(w x(i)) y (i))2. (P*)

For a parameter wt, we will denote ˆu wt by ˆu t for simplicity. Recall that the population version of ˆu t was defined by (EP*).

To prove Proposition 3.1, we decompose b Lsur(wt; ˆut) (wt w ) into three summation terms, as follows.

b Lsur(wt; ˆut) (wt w )

i=1 (ˆut(wt x(i)) y(i))(wt w ) x(i)

(ˆut(wt x(i)) ˆu t(wt x(i)))(wt w ) x(i) + (ˆu t(wt x(i)) y (i))(wt w ) x(i)

i=1 (y (i) y(i))(wt x(i) w x(i)). (11)

2A local error bound of this form was first used in (Zhang & Yin, 2013) under the name restricted secant inequality.

Robustly Learning Single-Index Models via Alignment Sharpness

We tackle each term in (11) separately, using the following arguments. Because the proofs of these claims are technical, we defer them to later subsections in Appendix C.

The first claim stated that the first summation in (11) is of order ( ϵ +

OPT) wt w 2 + (OPT + ϵ)/b with high probability.

Claim C.2. Let S = {(x(i), y(i))}m i=1 be m i.i.d. samples from D where m is as specified in the statement of Proposition 3.1. Let ˆut be the solution of optimization problem (P) given wt B(W) and S. Furthermore, denote the uncorrupted version of S by S = {(x(i), y (i))}m i=1, where y (i) = u (w x(i)). Let ˆu t be the solution of problem (P*). Then, with probability at least 1 δ, it holds

i=1 ((ˆut(wt x(i)) ˆu t(wt x(i)))(wt w ) x(i) ( ϵ +

OPT) wt w 2 (ϵ + OPT)/b .

The proof of Claim C.2 proceeds in the following routine: first, as we showed in Appendix F that due to some standard uniform convergence results the sample-optimal activations ˆut and ˆu t are close to their population counterparts, ut and u t, in L2 2 distance. Therefore, applying Chebyshev s inequality we are able to swap the sample-optimal activations in (11) by their population-optimal counterparts with high probability. Therefore, the proof boils down to lower bounding the following right-hand side:

i=1 ((ˆut(wt x(i)) ˆu t(wt x(i)))(wt w ) x(i) error1 + 1

i=1 (ut(wt x(i)) u t(wt x(i)))(wt w ) x(i).

Recall that in Lemma 3.3 we have shown that Ex Dx[(ut(wt x) u t(wt x))2] OPT, thus using Chebyshev s inequality we can further show that the second term in the right-hand side of the inequality above can be lower bounded by some small error terms as well, completing the proof of the claim.

The second claim implemented the misalignment lemma (Lemma 3.2) and showed that the second summation term in (11) is basically some constant multiple of (w ) wt 2 2, which is the main positive dominant factor among the three summations.

Claim C.3. Let S = {(x(i), y (i))}m i=1 be a sample set such that x(i) s are m i.i.d. samples from Dx, and y (i) = u (w x(i)) for each i. Let m be the value specified in the statement of Proposition 3.1. Then, given a parameter wt B(W), with probability at least 1 δ it holds that

i=1 (ˆu t(wt x(i)) y (i))(wt w ) x(i) Ca2LR4

b (w ) wt 2 2 ϵ wt w 2 ϵ/b ,

where C is an absolute constant.

The proof of Claim C.3 is rather technical. We first define an empirical inverse of the activation u , and denote it by ˆf. Note that u (z) U(a,b) is not necessarily strictly increasing when z 0, therefore (u ) 1 is not defined everywhere on R, and the introduction of this empirical inverse function ˆf is needed. Then, adding and subtracting ˆf(ˆu t(wt x(i))) in the wt x(i) w x(i) term, we get

i=1 (ˆu t(wt x(i)) u (w x(i)))(wt w ) x(i)

i=1 (ˆu t(wt x(i)) u (w x(i)))(wt x(i) ˆf(ˆu t(wt x(i))))

i=1 (ˆu t(wt x(i)) u (w x(i)))( ˆf(ˆu t(wt x(i))) w x(i)).

Studying the KKT condition of the optimization problem (P*), we obtain a critical observation that the first term in the equation above is always positive. Then, observe that by our definition of ˆf it works similar to the inverse of (u ) 1 and has the property that | ˆf(ˆu t(wt x(i))) w x(i)| (1/b)|ˆu t(wt x(i)) u (w x(i))| as well as

Robustly Learning Single-Index Models via Alignment Sharpness

(ˆu t(wt x(i)) u (w x(i)))( ˆf(ˆu t(wt x(i))) w x(i)) 0, thus, the second term in the equation above can be lower bounded by 1 bm Pm i=1(ˆu t(wt x(i)) u (w x(i)))2. By some standard concentration techniques, the quantity above concentrates around its expectation Ex Dx[(ˆu t(wt x) u (w x))2], hence deploying the main structural result on alignment sharpness Proposition 3.1 completes the proof of Claim C.3.

Similar to Claim C.2, the last claim showed that the third summation term in (11) is of order

OPT w wt 2, which is relative small compared to the positive term in Claim C.3.

Claim C.4. Let S = {(x(i), y(i))}m i=1 be m i.i.d. samples from D, and denote S = {(x(i), y (i))}m i=1 the uncorrupted version of S where y (i) = u (w x(i)). Under the condition of Proposition 3.1, given a parameter wt B(W) with probability at least 1 δ it holds:

i=1 (y (i) y(i))(wt x(i) w x(i))

OPT w wt 2 (OPT + ϵ)/b .

The proof of Claim C.4 follows from a routine similar to Claim C.2.

Plugging the bounds from Claim C.2, Claim C.3, and Claim C.4 back into (11), we get that with probability at least 1 3δ,

b Lsur(wt; ˆut) (wt w ) Ca2LR4

b (w ) wt 2 2 2(

OPT + ϵ) wt w 2 2(OPT + ϵ)/b,

for some absolute constant C, and the proof is now complete.

C.2. Proof of Lemma 3.2

We restate and prove Lemma 3.2. Lemma C.5 (Lower-Bounding L2 2 Error by Misalignment). Let u U(a,b), Dx be (L, R)-well-behaved, and f : R 7 R be square-integrable with respect to the measure of the distribution Dx. Then, for any w, w Rd, it holds that

E x Dx[(f(w x) u (w x))2] a2LR4 (w ) w 2 2 .

Proof. The statement holds trivially if w is parallel to w , so assume this is not the case. Let v = (w ) w = w (w w)w/ w 2 2. Suppose first that w w 0. Then w = αw + v, for some α > 0. Let V be the subspace spanned by w, v. Then,

E x Dx[(f(w x) u (w x))2] = E x Dx[(f(w x V ) u (w x V ))2] E x Dx[(f(w x V ) u (w x V ))21{x V A}] ,

for any A Rd. For ease of notation, we drop the subscript V , and we assume that all x are projected to the subspace V . We denote by w = w/ w 2 (resp. v = v/ v 2) the unit vector in the direction of w (resp. v). We choose A = {w x 0, v x (R/16, R/8) (3R/8, R/2)}.

The idea of the proof is to utilize the non-decreasing property of u and the fact that the marginal distribution Dx is anti-concentrated on the subspace S. In short, for any | v x| R, by the non-decreasing property of u we know that f(w x) falls into one of the following four intervals:

( , u (αw x + v 2R/32)], (u (αw x + v 2R/32), u (αw x + v 2R/4)],

(u (αw x + v 2R/4), u (αw x + v 2R)], (u (αw x + v 2R), + ).

When f(w x) belongs to any of the intervals above, we can show that with some constant probability, the difference between w x and w x is proportional to v 2 and hence u (w x) is far away from f(w x), due to the well-behaved property of the marginal Dx.

To indicate that f(w x) belongs to one of the intervals above, denote

I1(x) = f(w x) u (αw x + v 2R/32),

I2(x) = f(w x) u (αw x + v 2R/4),

I3(x) = f(w x) u (αw x + v 2R).

Robustly Learning Single-Index Models via Alignment Sharpness

For any x Rd, using the assumption that u is non-decreasing, we have that I1(x) I2(x) I3(x); as a consequence, it must be that I1(x)I2(x) 0 or I2(x)I3(x) 0.

Figure 1: Under the assumption that v x (R/16, R/8), and I1(x) 0, I2(x) 0, the distance between f(w x) and u (w x) is at least |u (αw x + v 2R/4) u (w x)| a v 2R/8.

Case 1: f(w x) (u (αw x + v 2R/4), ). Then I1(x) I2(x) 0. Let B := {w x 0, v x (R/16, R/8)} and notice that B A. We have that when x B,

u (w x) = u (αw x + v 2 v x) (u (αw x + v 2R/16), u (αw x + v 2R/8)),

thus we can conclude that

(f(w x) u (w x))21{x B}

= {f(w x) u (αw x + v 2R/4)} + {u (αw x + v 2R/4) u (w x)} 21{x B}

(u (αw x + v 2R/4) u (w x))21{x B} ,

where in the last inequality we used that I2(x) = f(w x) u (αw x + v 2R/4) 0 and u (αw x + v 2R/4) u (w x) 0 by the non-decreasing property of u and that if a, b 0 then (a + b)2 max(a, b)2. Further, using u (t) u (t ) a(t t ) for t t 0 (which holds by assumption) and w = αw + v, we have

(u (αw x + v 2R/4) u (w x))21{x B} a2( v 2R/4 v x)21{x B}

a2 v 2 2(R/8)21{x B} ,

where in the last inequality we used that 0 v x R/8 (by the definition of the event B). A visual explanation of the result above is displayed in Figure 1.

Case 2: f(w x) ( , u (αw x + v 2R/32)). Then 0 I1(x) I2(x). We follow a similar argument as in the previous case. In particular, we begin with

(f(w x) u (w x))21{x B}

= {f(w x) u (αw x + v 2R/32)} + {u (αw x + v 2R/32) u (w x)} 21{x B}. (12)

Note that I1(x) 0 and u (w x) = u (αw x + v 2 v x) u (αw x + v 2R/32) since v x R/16 R/32 for x B, thus the two terms in curly brackets in (12) have the same sign and we further have:

(f(w x) u (w x))21{x B} (u (αw x + v 2R/32) u (w x))21{x B}

a2 v 2 2(R/32)21{x B},

where in the first inequality we used the fact that (a + b)2 max{a2, b2} when both a, b 0.

By Case 1 and Case 2, we can conclude that when I1(x)I2(x) 0, it must be:

(f(w x) u (w x))21{x B} a2 v 2 2R2/2101{x B}. (13)

Robustly Learning Single-Index Models via Alignment Sharpness

Case 3: f(w x) (u (αw x + v 2R), + ). Then I2(x) I3(x) 0 and we choose B = {w x 0, v x (3R/8, R/2)}. Following the same reasoning as in the previous two cases, we have

(f(w x) u (w x))21{x B }

= {f(w x) u (αw x + v 2R)} + {u (αw x + v 2R) u (w x)} 21{x B }

(u (αw x + v 2R) u (w x))21{x B }

a2 v 2 2(R/2)21{x B }.

Case 4: f(w x) ( , u (αw x + v 2R/4)). Then 0 I2(x) I3(x). It follows that

(f(w x) u(w x))21{x B }

= {f(w x) u (αw x + v 2R/4)} + {u (αw x + v 2R/4) u(w x)} 21{x B }

a2 v 2 2(R/8)21{x B }.

Thus, we conclude from Case 3 and Case 4 that when I2(x)I3(x) 0, we have

(f(w x) u (w x))21{x B } a2 v 2 2(R2/64)1{x B }. (14)

Recall that for any x, at least one of the inequalities I1(x)I2(x) 0 or I2(x)I3(x) 0 happens, thus, 1{I1(x)I2(x) 0} 1 1{I2(x)I3(x) 0}. Therefore, the probability mass of the region (B {I1(x)I2(x) 0}) (B {I2(x)I3(x) 0}) can be lower bounded by:

Pr x (B {I1(x)I2(x) 0}) (B {I2(x)I3(x) 0})

1{x B}1{I1(x)I2(x) 0} + 1{x B }1{I2(x)I3(x) 0} γ(x) dx

1{x B}1{I1(x)I2(x) 0} + 1{x B }1{I2(x)I3(x) 0} L dx

1{x B} + (1{x B } 1{x B})1{I2(x)I3(x) 0} dx, (15)

where in the first inequality we used the assumption that Dx is (L, R)-well-behaved. As a visual explanation of the lower bound above, we include the following Figure 2.

To finish bounding the probability in (15), it remains to bound the integral from its final inequality, which now does not involve the pdf anymore, as we used the anti-concentration property of Dx to uniformly bound below γ(x). Recall that by definition, I1(x), I2(x), I3(x) are functions of w x that do not depend on v x. Denote the projection of x on the standard basis of space V by x w = w x and x v = v x. Then, we have:

1{x B } 1{x B} 1{I2(x)I3(x) 0} dx

dx v1{x w 0, I2(x)I3(x) 0} dx w

|x w| R 1{x w 0, I2(x)I3(x) 0} dx w

Robustly Learning Single-Index Models via Alignment Sharpness

Figure 2: On the 2-dimensional space V spanned by (xv, xw), at each point x B B , it must be that I1(x)I2(x) 0 or I2(x)I3(x) 0. Γ1 denotes the interval of xw = w x such that f(w x) u (αw x + v 2R), hence both I1(x)I2(x) 0, I2(x)I3(x) 0; Γ2 denotes the interval of xw such that f(w x) (u (αw x + v 2R/32), u (αw x + v 2R/4)), hence I2(x)I3(x) 0; finally, Γ3 denotes the interval of xw such that f(w x) (u (αw x + v 2R/4), u (αw x + v 2R/)), hence I1(x)I2(x) 0. The area of the union of the red and blue regions is the lower bound on the probability in (15). As displayed in the figure, the sum of the blue and red region is lower bounded by 1{x B} + (1{x B } 1{x B})1{I2(x)I3(x) 0}.

Plugging the inequality above back into (15), we get:

Pr x B {I1(x)I2(x) 0} + B {I2(x)I3(x) 0}

V, x R 1{w x 0, v x (R/16, R/8)} dx

= L ZZ (1{x w (0, R)} dx w)1{x v (R/16, R/8)} dx v = LR2/16. (16)

We are now ready to provide a lower bound on the L2 2 distance between f(w x) and u (w x). Combining the inequalities from (13) and (14), we get

E x Dx[(f(w x) u (w x))2]

E x Dx[(f(w x V ) u (w x V ))21{x V A}]

a2(R2/1024) v 2 2 E x Dx[1 {x V B {I1(x)I2(x) 0}} {B {I2(x)I3(x) 0}} ]

a2(R4/213)L v 2 2 ,

where we used (16) in the last inequality.

Now for the case where w w 0, it holds w = αw + v with α 0. Considering instead A = {w x 0, v x (R/16, R/8) (3R/8, R/2)} and similarly B = {w x 0, v x (R/16, R/8)}, B = {w x 0, v x (R/3, R/2)}, then all the steps above remains valid without modification. This completes the proof of Lemma 3.2.

C.3. Proof of Lemma 3.3

In this section, we restate and prove Lemma 3.3. We first show the following claim, which is inspired by Lemma 9 in (Kakade et al., 2011).

Robustly Learning Single-Index Models via Alignment Sharpness

Claim C.6. Let wt B(W) and let u t, ut be defined as solutions to (EP*), (EP), respectively. It holds that E(x,y) D[(ut(wt x) v(wt x))(y ut(wt x))] 0, for any v U(a,b); similarly, it holds that E(x,y) D[(u t(wt x) v (wt x))(y u t(wt x))] 0, for any v U(a,b).

Proof of Claim C.6. The proof is Let us denote by Ft the set of functions of the form f(x) = u(wt x), where u U(a,b) and wt is a fixed vector in B(W). We first observe that Ft is a convex set of functions. This is because for any α [0, 1], for any f1, f2 Ft such that f1(x) = u1(wt x), f2 = u2(wt x), let u3( ) = αu1( ) + (1 α)u2( ), it holds:

αf1(x) + (1 α)f2(x) = αu1(wt x) + (1 α)u2(wt x) = u3(wt x),

note that u3 is also (a, b)-bounded, non-decreasing, and u3(0) = 0, hence u3 U(a,b) and f3(x) = u3(wt x) Ft, thus, Ft is convex.

Since Ft is a convex set of functions, essentially we can regard ut(wt x) as the L2 projection of y (which is a function of x) onto the convex set Ft. Classic inequalities of ℓ2 projection can be seamlessly transformed in our case. In particular, below we prove that E (x,y) D[(ut(wt x) v(wt x))(y ut(wt x))] 0, v U(a,b). (17)

To prove (17), note first that fu(x) = ut(wt x) Ft and fv(x) = v(wt x) Ft since ut, v U(a,b). Thus, for any α (0, 1), we have αfv(x) + (1 α)fu(x) Ft. Furthermore, by definition of ut, f Ft we have E(x,y) D[(ut(wt x) y)2] E(x,y) D[(f(x) y)2], therefore, it holds:

α E (x,y) D[(αfv(x) + (1 α)fu(x) y)2] 1

α E (x,y) D[(ut(wt x) y)2]

α E (x,y) D[(ut(wt x) y + α(v(wt x) ut(wt x)))2 (ut(wt x) y)2]

= E (x,y) D[2(ut(wt x) y)(v(wt x) ut(wt x)) + α(v(wt x) ut(wt x))2].

Let α 0, and note that Ex Dx[(v(wt x) ut(wt x))2] < + , we thus have

E (x,y) D[(ut(wt x) v(wt x))(y ut(wt x))] 0,

proving the claim.

We can show that a similar result also holds for u t and y . Specifically, we have:

E (x,y) D[(u t(wt x) v (wt x))(y u t(wt x))] 0, v U(a,b). (18)

This completes the proof of Claim C.6.

We now proceed to the proof of Lemma 3.3.

Lemma C.7 (Closeness of Population-Optimal Activations). Let wt B(W) and let u t, ut be defined as solutions to (EP*), (EP), respectively. Then, E x Dx[(ut(wt x) u t(wt x))2] OPT.

Proof. Summing up the first and second statement of Claim C.6 (i.e., (17) and (18)) with v = u t U(a,b) in (17) and v = ut U(a,b) in (18), we get:

0 E (x,y) D[(ut(wt x) u t(wt x))(y ut(wt x)) + (u t(wt x) ut(wt x))(y u t(wt x))]

= E (x,y) D[(ut(wt x) u t(wt x))(y y + u t(wt x) ut(wt x))]

= E (x,y) D[(ut(wt x) u t(wt x))(y y )] E x Dx[(ut(wt x) u t(wt x))2]

Robustly Learning Single-Index Models via Alignment Sharpness

Therefore, moving the second term above to the left-hand side, then applying the Cauchy-Schwarz inequality, we have

E x Dx[(ut(wt x) u t(wt x))2] E (x,y) D[(ut(wt x) u t(wt x))(y y )]

E x Dx[(ut(wt x) u t(wt x))2] E[(y y )2],

completing the proof of Lemma 3.3.

C.4. Proof of Corollary 3.4

Corollary C.8 (Closeness of Idealized and Attainable Activations). Let ϵ, δ > 0. Given a parameter wt B(W) and m d log4(d/(ϵδ))(b2W 3/(L2ϵ))3/2 samples from D, let ˆut be the sample-optimal activation on these samples given wt, as defined in (P). Then, with probability at least 1 δ,

E x Dx[(ˆut(wt x) u (w x))2]

3(ϵ + OPT + b2 wt w 2 2).

Proof. The corollary follows directly from the combination of Lemma F.4 and Lemma 3.3, as we have:

E x Dx[(ˆut(wt x) u (w x))2]

= E x Dx[(ˆut(wt x) ut(wt x) + ut(wt x) u t(wt x) + u t(wt x) u (w x))2]

3( E x Dx[(ˆut(wt x) ut(wt x))2] + E x Dx[(ut(wt x) u t(wt x))2] + E x Dx[(u t(wt x) u (w x))2])

3(ϵ + OPT + b2 wt w 2 2),

where we used that because u t argminu U(a,b) Ex Dx[(u(wt x) u (w x))2], we have Ex Dx[(u t(wt x) u (w x))2] Ex Dx[(u (wt x) u (w x))2] b2 wt w 2 2, with the last inequality following from the fact that u U(a,b).

C.5. Proof of Claim C.2

In this subsection, we prove Claim C.2 that appeared in Appendix C.1, the proof of Proposition 3.1. Claim C.9. Let S = {(x(i), y(i))}m i=1 be m i.i.d. samples from D where m is as specified in the statement of Proposition 3.1. Let ˆut be the solution of optimization problem (P) given wt B(W) and S. Furthermore, denote the uncorrupted version of S by S = {(x(i), y (i))}m i=1, where y (i) = u (w x(i)). Let ˆu t be the solution of problem (P*). Then, with probability at least 1 δ, it holds

i=1 ((ˆut(wt x(i)) ˆu t(wt x(i)))(wt w ) x(i) ( ϵ +

OPT) wt w 2 (ϵ + OPT)/b .

Proof. Adding and subtracting ut(wt x(i)) and u t(wt x(i)), we have

i=1 ((ˆut(wt x(i)) ˆu t(wt x(i)))(wt w ) x(i)

i=1 (ˆut(wt x(i)) ut(wt x(i)))(wt w ) x(i) + 1

i=1 (u t(wt x(i)) ˆu t(wt x(i)))(wt w ) x(i)

i=1 (ut(wt x(i)) u t(wt x(i)))(wt w ) x(i). (19)

To proceed, we use that both ˆu t(z) and ˆut(z) are close to their population counterparts ut(z) and u t(z), respectively. In particular, in Lemma F.4 and Lemma F.2, we showed that using a dataset S of m samples such that

m d log4(d/(ϵδ)) b2W 3

Robustly Learning Single-Index Models via Alignment Sharpness

we have that with probability at least 1 δ, for all wt, w B(W) it holds

E x Dx[(ˆut(wt x) ut(wt x))2] ϵ, E x Dx[(ˆu t(wt x) u t(wt x))2] ϵ. (20)

Now suppose that the inequalities in (20) hold for the given wt B(W) (which happens with probability at least 1 δ). Applying Chebyshev s inequality to the first summation term in (19), we get:

i=1 (ˆut(wt x(i)) ut(wt x(i)))(wt w ) x(i) E x Dx[(ˆut(wt x) ut(wt x))(wt w ) x] s

1 ms2 E x Dx[(ˆut(wt x) ut(wt x))2(wt x w x)2], (21)

since x(i) are i.i.d random variables. The next step is to bound the variance. Note that Dx possesses a 1/L-sub-exponential tail, thus we have Pr[|(wt w ) x| wt w 2r] (2/L2) exp( Lr). Choose r = 2W

L log(2/(L2ϵ )); then, we have Pr[|(wt w ) x| r] ϵ . Now we separate the variance under two events: A = {x : |(wt w ) x| r} and A = {x : |(wt w ) x| r}.

E x Dx[(ˆut(wt x) ut(wt x))2(wt x w x)2]

= E x Dx[(ˆut(wt x) ut(wt x))2(wt x w x)21{A}]

+ E x Dx[(ˆut(wt x) ut(wt x))2(wt x w x)2(1 1{A})].

Using that Ex Dx[(ˆut(wt x) ut(wt x))2] ϵ, the first term in (22) can be bounded as follows:

E x Dx[(ˆut(wt x) ut(wt x))2(wt x w x)21{A}] r2 E x Dx[(ˆut(wt x) ut(wt x))2]

r2ϵ = 4W 2ϵ

L2 log2(2/(L2ϵ )). (23)

The second term in (22) can be bounded using that both ˆut and ut are non-decreasing b-Lipschitz and vanish at zero (thus |ˆut(wt x)| b|wt x| and |ut(wt x)| b|wt x|, with their signs determined by the sign of wt x) and then applying Young s inequality:

E x Dx[(ˆut(wt x) ut(wt x))2(wt x w x)2(1 1{A})]

b2 E x Dx[(wt x)2(wt x w x)2(1 1{A})]

2b2 E x Dx[((wt x)4 + (wt x)2(w x)2)(1 1{A})].

Since Dx is sub-exponential, we have E[(v x)8] c2/L8 for some absolute constant c, hence

E x Dx[(wt x)4(1 1{A})] q

E x Dx[W 8((wt/ wt 2) x)8] Pr[|wt x| r] c W 4

Similarly, for E[(wt x)2(w x)2(1 1{A})], we have:

E x Dx[(wt x)2(w x)2(1 1{A})] 2 E x Dx[((wt x)4 + (w x)4)(1 1{A})] 2c(W/L)4

Combining the inequalities above with (23), we get the final upper bound on the variance in (22):

E x Dx[(ˆut(wt x) ut(wt x))2(wt x w x)2] 4W 2ϵ

L2 log2(2/(L2ϵ )) + 6cb2(W/L)4

Thus, choosing s = ϵ/b in (21), ϵ = ϵ2, and using m W 4b4 log2(1/ϵ)/(ϵδL4) samples we get

L2 log2(2/(Lϵ )) + 12cb2W 4

b2L4ϵδ ϵ2W 4b4 log2(1/ϵ)

Robustly Learning Single-Index Models via Alignment Sharpness

Plugging the inequality above back into (21) and recalling that Ex Dx[(ˆut(wt x) ut(wt x))2] ϵ (from (20)), we finally have with probability at least 1 δ,

i=1 (ˆut(wt x(i)) ut(wt x(i)))(wt w ) x(i)

E x Dx[(ˆut(wt x) ut(wt x))(wt w ) x] ϵ/b

E x Dx[(ˆut(wt x) ut(wt x))2] E x Dx[(wt x w x)2] ϵ/b

ϵ wt w 2 ϵ/b,

where in the second inequality we used the Cauchy-Schwarz inequality and in the last inequality we used the assumption that Dx is isotropic, i.e., Ex Dx[xx ] = I. Finally, note that (20) holds with probability at least 1 δ, applying a union bound we get that with probability at least 1 2δ, we have

i=1 (ˆut(wt x(i)) ut(wt x(i)))(wt w ) x(i) ϵ wt w 2 ϵ/b.

In summary, to guarantee that the inequality above remains valid, we need the batch size to be:

m d W 9/2b4 log4(d/(ϵδ)

We finished bounding the first term in (19).

Since the same statements hold for the relationship between ˆu t and u t as they do for ˆut and ut, using the same argument we also get that with probability at least 1 2δ,

i=1 (ˆu t(wt x(i)) u t(wt x(i)))(wt w ) x(i) ϵ wt w 2 ϵ/b,

which is the lower bound for the second term in (19).

Lastly, for the third term in (19), since in Lemma 3.3 we showed that for any wt it always holds:

E x Dx[(ut(wt x) u t(wt x))2] OPT,

the only change of the previous steps is at the right-hand side of (23), where instead of having the upper bound of r2ϵ, we have

E x Dx[(ut(wt x) u t(wt x))2(wt x w x)21{A}] r2OPT = 4W 2OPT

L2 log2(2/(L2ϵ )).

And in the same reason, we have

E x Dx[(ut(wt x) u t(wt x))2(wt x w x)2(1 1{A})] 6cb2(W/L)4

As a result, Chebyshev s inequality yields:

i=1 (ut(wt x(i)) u t(wt x(i)))(wt w ) x(i) E x Dx[(ut(wt x) u t(wt x))(wt w ) x] s

1 ms2 E x Dx[(ut(wt x) u t(wt x))2(wt x w x)2]

L2 log2(2/(L2ϵ )) + 6cb2W 4

Robustly Learning Single-Index Models via Alignment Sharpness

Now instead of choosing s = ϵ, we let s = (OPT + ϵ)/b and keep ϵ as ϵ2 to get

b2L4ϵδ d W 9/2b4 log4(d/(ϵδ))(OPT + ϵ)2

under our choice of m as specified in (24). Thus, we have that with probability at least 1 δ, it holds

i=1 (ut(wt x(i)) u t(wt x(i)))(wt w ) x(i)

E x Dx[(ut(wt x) u t(wt x))(wt w ) x] (OPT + ϵ)/b

OPT wt w 2 (OPT + ϵ)/b,

where in the last inequality we used the fact that

| E x Dx[(ut(wt x) u t(wt x))(wt w ) x]| q

E x Dx[(ut(wt x) u t(wt x))2] E x Dx[((wt w ) x)2]

OPT wt w 2,

since Ex Dx[(ut(wt x) u t(wt x))2] OPT by Lemma 3.3.

Therefore, combing the upper bounds on the three terms in (19) we get that with probability at least 1 5δ, it holds:

i=1 ((ˆut(wt x(i)) ˆu t(wt x(i)))(wt w ) x(i) (2 ϵ +

OPT) wt w 2 (3ϵ + OPT)/b. (25)

Since (25) was proved using arbitrary ϵ, δ > 0, it remains to replace δ δ/5 and ϵ ϵ/4 to complete the proof of Claim C.2.

C.6. Proof of Claim C.3

In this subsection, we prove Claim C.3 that appeared in the proof of Proposition 3.1 in Appendix C.1.

Claim C.10. Let S = {(x(i), y (i))}m i=1 be a sample set such that x(i) s are m i.i.d. samples from Dx, and y (i) = u (w x(i)) for each i. Let m be the value specified in the statement of Proposition 3.1. Then, given a parameter wt B(W), with probability at least 1 δ it holds that

i=1 (ˆu t(wt x(i)) y (i))(wt w ) x(i) Ca2LR4

b (w ) wt 2 2 ϵ wt w 2 ϵ/b ,

where C is an absolute constant.

Proof. Before we proceed to the proof of the claim, let us consider first the inverse of u . Since u (z) U(a,b) is strictly increasing when z 0, hence (u ) 1(α) exists for α 0. However, when z 0, u (z) could be a constant on some intervals, hence (u ) 1(α) might not exist for every α 0. We consider instead an empirical version of (u ) 1(α) based on S , which is defined on every α R. Given a sample set S = {(x(i), y (i))} where y (i) = u (w x(i)), let us sort the the index i in the increasing order of w x(i), i.e., w x(1) w x(m). Since u is a monotone function, this implies y (i) s are also in increasing order, i.e., we have y (1) y (m). We then partition the set {y (i)}m i=1 into blocks s = {y (ks 1+1), . . . , y (ks)}, s.t. y (ks 1+1) = = y (ks) = τs,

for s = 1, . . . , s . Since {y (i)} is sorted in increasing order, we have τs 1 < τs for s = 2, . . . , s . Note that since u (z) is strictly increasing when z 0 and that u (0) = 0, s is a singleton set whenever τs > 0. Furthermore, let s denote s as the largest index among 1, . . . , s such that τs 0.

Robustly Learning Single-Index Models via Alignment Sharpness

Suppose first that τs < 0, let s define a function ˆf : R R in the following way:

(u ) 1(α), α > 0 w x(ks ) + α τs

τs (w x(ks )), α [τs , 0] w x(ks), α = τs, s = 1, . . . , s 1 w x(ks 1) + α τs 1

τs τs 1 (w x(ks 1+1) w x(ks 1)), α (τs 1, τs), s = 2, . . . , s

b(α τ1), α ( , τ1)

When τs = 0, we define (0 τs )/τs = 1, and hence ˆf(0) = 0. The rest remains unchanged. A visualization of ˆf with respect to Re LU activation is presented in Figure 3.

Figure 3: Given u (z) = max{0, z} and a dataset S = {(x(1), u (w x(1))), . . . , (x(6), u (w x(6)))} where w x(1) < w x(2) < w x(3) < 0, ˆf, the empirical inverse of u , has image above.

The function ˆf has the following properties. First of all, ˆf(α) satisfies ˆf(0) = 0, (α1 α2)/a ˆf(α1) ˆf(α2), for all α1 α2 0, since ˆf(α) = (u ) 1(α) when α > 0 and u U(a,b). Secondly, ˆf(α1) ˆf(α2) (α1 α2)/b for all α1, α2 R, α1 α2. This is because each segment of ˆf has tangent at least 1/b. Thirdly, for any α u (w x(i)), it holds that ˆf(α) w x(i) (α u (w x(i)))/b. This is because that suppose u (w x(i)) s, for any α τs it holds

ˆf(α) w x(i) ˆf(α) w x(ks) = ˆf(α) ˆf(τs) = ˆf(α) ˆf(u (w x(i))) (α u (w x(i)))/b,

by the fact that tangent of ˆf(α) is lower bounded by 1/b. In addition, for any α < u (w x(i)), it holds w x(i) ˆf(α) (u (w x(i)) α)/b. This can be seen similarly from the construction of ˆf. Suppose u (w x(i)) s, then for any α < τs it holds

w x(i) ˆf(α) w x(ks 1+1) ˆf(α) = ˆf(τs) ˆf(α) = ˆf(u (w x(i))) ˆf(α) (u (w x(i)) α)/b.

Again, we used the fact that ˆf(α1) ˆf(α2) (α1 α2)/b for all α1, α2 R, α1 α2 in the last inequality.

Now we turn to the summation displayed in the statement of the claim. To proceed, we add and subtract ˆf(ˆu t(wt x)) in

Robustly Learning Single-Index Models via Alignment Sharpness

the second component in the inner product, which yields:

i=1 (ˆu t(wt x(i)) u (w x(i)))(wt w ) x(i)

i=1 (ˆu t(wt x(i)) u (w x(i)))(wt x(i) ˆf(ˆu t(wt x(i))))

i=1 (ˆu t(wt x(i)) u (w x(i)))( ˆf(ˆu t(wt x(i))) w x(i)). (27)

To bound below the first term in (27), we make use of the following Fact C.11. The proof of Fact C.11 can be found at Appendix C.7.

Fact C.11. Let wt B(W). Given m samples S = {(x(1), y (1)), , (x(m), y (m))}, let ˆu t be one of the solutions to the optimization problem (P*) , i.e., ˆu t argminu U(a,b)(1/m) Pm i=1(u(wt x(i)) y (i))2. Then

i=1 (ˆu t(wt x(i)) y (i))(wt x(i) f(ˆu t(wt x(i)))) 0,

for any function f : R R such that f(0) = 0, (α1 α2)/a f(α1) f(α2) for all α1 α2 0, and f(α1) f(α2) (α1 α2)/b, α1, α2 R, α1 α2.

As we showed previously, ˆf satisfies the prerequisite of Fact C.11, hence applying Fact C.11 we observe that the first term in (27) is non-negative:

i=1 (ˆu t(wt x(i)) u (w x(i)))(wt x(i) ˆf(ˆu t(wt x(i)))) 0. (28)

Therefore, plugging (28) back into (27), we get:

i=1 (ˆu t(wt x(i)) y (i))(wt w ) x(i) 1

i=1 (ˆu t(wt x(i)) y (i))( ˆf(ˆu t(wt x(i))) w x(i)). (29)

Recall that we have showed the function ˆf satisfies ˆf(α) w x(i) (α u (w x(i)))/b 0 whenever α u (w x(i)), and moreover, w x(i) ˆf(α) (u (w x(i)) α)/b 0 when α < u (w x(i)). Therefore, let α = ˆu t(wt x(i)), combining these results we get

(ˆu t(wt x(i)) u (w x(i)))(f(ˆu t(wt x(i))) w x(i)) 1

b (ˆu t(wt x(i)) u (w x(i)))2.

Bringing the inequality above back to (29) we then get

i=1 (ˆu t(wt x(i)) y (i))(wt w ) x(i) 1 mb

i=1 (ˆu t(wt x(i)) y (i))2. (30)

The goal now is to bound below the right-hand side of (30) by E[(ˆu t(wt x) y )2] and some small error terms using Chebyshev inequality as we did in Claim C.2. Plugging in Lemma 3.2 we can further lower bound E[(ˆu t(wt x) y )2] by (w ) wt 2 2 and then we are done with the proof of this claim. Note that Chebyshev s inequality yields

i=1 (ˆu t(wt x(i)) y (i))2 E x Dx[(ˆu t(wt x) y )2] s 1 ms2 E x Dx[(ˆu t(wt x) y )4]. (31)

Robustly Learning Single-Index Models via Alignment Sharpness

We now bound Ex Dx[(ˆu t(wt x) y )4]. Observe that

E x Dx[(ˆu t(wt x) y )4] = E x Dx[(ˆu t(wt x) u t(wt x) + u t(wt x) y )2(ˆu t(wt x) y )2]

4 E x Dx[(ˆu t(wt x) u t(wt x))2((ˆu t(wt x))2 + (y )2)]

+ 4 E x Dx[(u t(wt x) y )2((ˆu t(wt x))2 + (y )2)]. (32)

We focus on the two terms in (32) separately. Again, choosing r = 2W

L log(2/(L2ϵ )), then by the L-sub-exponential tail bound of Dx, it holds Pr[|wt x| r] ϵ , Pr[|w x| r] ϵ . Since y = u (w x) and both u and ˆu t are non-decreasing b-Lipschitz, it holds:

E x Dx[(ˆu t(wt x) u t(wt x))2((ˆu t(wt x))2 + (y )2)]

b2 E x Dx[(ˆu t(wt x) u t(wt x))2((wt x)2 + (w x)2)]

= b2 E x Dx[(ˆu t(wt x) u t(wt x))2((wt x)2 + (w x)2)1{|wt x| r, |w x| r}]

+ b2 E x Dx[(ˆu t(wt x) u t(wt x))2((wt x)2 + (w x)2)1{|wt x| r or |w x| r}]

2b2r2 E x Dx[(ˆu t(wt x) u t(wt x))2]

+ 2b4 E x Dx[2(wt x)2((wt x)2 + (w x)2)1{|wt x| r or |w x| r}]. (33)

The first term in (33) can be upper bounded using Lemma F.2, which states that when m d log(1/δ)(b2W 3 log2(d/ϵ)/(L2ϵ))3/2), with probability at least 1 δ it holds Ex Dx[(ˆu t(wt x) u t(wt x))2] ϵ for all wt B(W). Now suppose this inequality is valid the given wt B(W) (which happens with probability at least 1 δ). For the second term in (33), note that for any unit vector a it holds Ex Dx[(a x)8] c2/L8 for some absolute constant c > 0, and furthermore, the magnitude of r ensures that Pr[|wt x| r or |w x| r] 2ϵ , therefore, combining these bounds, we get:

E x Dx[2(wt x)2((wt x)2 + (w x)2)1{|wt x| r or |w x| r}]

E x Dx[(wt x)8] Pr[|wt x| r or |w x| r]

2( E x Dx[(wt x)8] + E x Dx[(w x)8]) Pr[|wt x| r or |w x| r]

Plugging back into (33), we have

E x Dx[(ˆu t(wt x) u t(wt x))2((ˆu t(wt x))2 + (y )2)] 2b2r2ϵ + 48c(b W/L)4

which is the upper bound on the first term of (32).

For the second term in (32), since by definition we have u t argminu U(a,b) Ex Dx[(u(wt x) y )2], it holds that

E x Dx[(u t(wt x) y )2] E x Dx[(u (wt x) u (w x))2] b2 E x Dx[((wt w ) x)2] b2 wt w 2 2,

since x is isotropic. Thus, using similar steps as in (33), we have

E x Dx[(u t(wt x) y )2((ˆu t(wt x))2 + (y )2)]

2b2r2 E x Dx[(u(wt x) y )2]

+ 2b4 E x Dx[2((wt x)2 + (w x)2)21{|wt x| r or |w x| r}]

2b4r2 wt w 2 2 + 48c(b W/L)4 ϵ.

Robustly Learning Single-Index Models via Alignment Sharpness

In summary, combining all the results and plugging them back into (32), we finally get the upper bound for the variance:

E x Dx[(ˆu t(wt x) y )4] 32b2W 2

L2 log2(2/(L2ϵ ))(b2 wt w 2 2 + ϵ) + 384c(b W/L)4

Let s = b ϵ wt w 2 + ϵ/b and plug the last inequality back into (31) to get:

i=1 (ˆu t(wt x(i)) y (i))2 E x Dx[(ˆu t(wt x) y )2] b ϵ wt w 2 + ϵ/b

1 m(ϵb2 wt w 2 2 + ϵ2/b2)

(b2 wt w 2 2 + ϵ) + 384c(b W/L)4

Choosing ϵ = ϵ2/b4 and using similar arguments as in Claim C.2, we get that the right-hand side of the inequality above is bounded by δ, given our choice of m db4W 9/2 log4(d/(ϵδ))(1/ϵ3/2 + 1/(ϵδ)) as specified in the statement of Proposition 3.1. In summary, after a union bound on the probability above and the event that Ex Dx[(ˆu t(wt x) u t(wt x))2] ϵ, we have with probability at least 1 2δ,

i=1 (ˆu t(wt x(i)) y (i))2 E x Dx[(ˆu t(wt x) y )2] ϵb wt w 2 ϵ/b.

Recall that in Lemma 3.2 we showed that Ex Dx[(ˆu t(wt x) u (w x))2] Ca2LR4 (w ) wt 2 2 for an absolute constant C; thus, our final result is that with probability at least 1 δ,

i=1 (ˆu t(wt x(i)) y (i))(wt w ) x(i) 1 mb

i=1 (ˆu t(wt x(i)) y (i))2

b (w ) wt 2 2 ϵ wt w 2 ϵ/b.

This completes the proof of Claim C.3.

C.7. Proof of Fact C.11

We prove a modified version of Lemma 1 (Kakade et al., 2011), presented as the statement below. The statement considers a smaller activation class and a function f with different properties compared to (Kakade et al., 2011), and the proof is based on a rigorous KKT argument. Fact C.12. Let wt B(W). Given m samples S = {(x(1), y (1)), , (x(m), y (m))}, let ˆu t be one of the solutions to the optimization problem (P*) , i.e., ˆu t argminu U(a,b)(1/m) Pm i=1(u(wt x(i)) y (i))2. Then

i=1 (ˆu t(wt x(i)) y (i))(wt x(i) f(ˆu t(wt x(i)))) 0,

for any function f : R R such that f(0) = 0, (α1 α2)/a f(α1) f(α2) for all α1 α2 0, and f(α1) f(α2) (α1 α2)/b, α1, α2 R, α1 α2.

Proof. We transform the optimization problem (P*) to a quadratic optimization problem with linear constraints. To guarantee that the solution of this quadratic problem corresponds to a function that is (a, b)-unbounded, we add a sample (x(k), y (k)) = (0, 0) to the sample set. Let zi = wt x(i) such that (perhaps after some permutation) z1 z2 zm and zk = 0, we solve the following optimization problem:

min y(i),i [m]

i=1 ( y(i) y (i))2

s.t. 0 y(i+1) y(i), 1 i k 1 ,

a(zi+1 zi) y(i+1) y(i), k i m 1 ,

y(i+1) y(i) b(zi+1 zi), 1 i m 1 ,

Robustly Learning Single-Index Models via Alignment Sharpness

Denote the solution of (34) as ˆy (i), i = 1, , m. Let ˆu t(z) be the linear interpolation function of (zi, ˆy (i)), then ˆu t U(a,b) since ˆu t(0) = ˆu t(zk) = ˆy (k) = 0, ˆu t is b-Lipschitz and ˆu t(z) ˆu t(z ) a(z z ) for all z z 0. In other words, finding a solution of (P*) is equivalent to solving (34).

Now observe that the summation Pm i=1(ˆy (i) y (i))(zi f(ˆy (i))) can be transformed into the following:

i=1 (ˆy (i) y (i))(zi f(ˆy (i))) =

j=1 (ˆy (j) y (j)) (zi f(ˆy (i)) (zi+1 f(ˆy (i+1)))), (35)

where we let zm+1 = 0, ˆy m+1 = 0 (and hence f(ˆy m+1) = 0 as f(0) = 0).

To fully utilize the information that ˆy (i) is the minimizer of the optimization problem (34), we write down the KKT condition for the optimization problem (34) described above:

ˆy (i) = y (i) + (λ i λ i 1)/2 (λi λi 1)/2 (νk/2)1{i = k}, i = 1, , m; (36)

λi(ˆy (i+1) ˆy (i)) = 0, i = 1, , k 1; (37)

λi(a(zi+1 zi) (ˆy (i+1) ˆy (i))) = 0, i = k, , m 1; (38)

λ i((ˆy (i+1) ˆy (i)) b(zi+1 zi)) = 0, i = 1, , m 1; (39)

νkˆy (k) = 0 , (40)

where λi, λ i 0, for i = 1, . . . , m 1, and νk R are dual variables, and we let λ0 = λ 0 = 0 for the convenience of presenting (36).

Summing up (36) recursively, we immediately get that

j=1 (ˆy (i) y (i)) = 1

2((λ i λi) νk1{i k}).

Bringing this equality above back to (35), we have

i=1 (ˆy (i) y (i))(zi f(ˆy (i)))

i=1 (λ i λi)(zi f(ˆy (i)) (zi+1 f(ˆy (i+1)))) + 1

i=k νk(zi f(ˆy (i)) (zi+1 f(ˆy (i+1))))

i=1 (λ i λi)(zi f(ˆy (i)) (zi+1 f(ˆy (i+1)))) + νk(zk f(ˆy (k)) (zm+1 f(ˆy (m+1)))). (41)

Since by definition, zm+1 = f(ˆy (m+1)) = 0, zk = 0, and as ˆy (i), i [m], is a feasible solution of (34), it holds ˆy (k) = 0, we thus have νk(zk f(ˆy (k)) (zm+1 f(ˆy (m+1)))) = 0.

Bringing this back to (41), we get

i=1 (ˆy (i) y (i))(zi f(ˆy (i))) = 1

i=1 (λ i λi)(zi f(ˆy (i)) (zi+1 f(ˆy (i+1)))). (42)

Consider first when i = 1, . . . , k 1. Suppose for some i {1, . . . , k 1} we have λ i, λi > 0, then according to the complementary slackness condition (37) and (38) it holds that 0 = ˆy (i+1) ˆy (i) = b(zi+1 zi). Therefore, the we have (λ i λi)(zi f(ˆy (i)) (zi+1 f(ˆy (i+1)))). Suppose for some i {1, . . . , k 1}, it holds λ i > 0. Then, it must be the case that ˆy (i+1) ˆy (i) = b(zi+1 zi) 0, according to the KKT condition (39). Since f(ˆy (i+1)) f(ˆy (i)) (ˆy (i+1) ˆy (i))/b by assumption on f, we thus have (zi f(ˆy (i)) (zi+1 f(ˆy (i+1)))) 0. Finally, if λi > 0, then (37) indicates that 0 = ˆy (i+1) ˆy (i). Therefore, as zi+1 zi, the ith summand is also positive. In summary, the first summation in (42) is positive.

Robustly Learning Single-Index Models via Alignment Sharpness

Now consider i {k, . . . , m}. Observe that if for some i {k, . . . , m} it holds λi > 0 and λ i > 0 at the same time, then KKT conditions (38) and (39) imply that a(zi+1 zi) = ˆy (i+1) ˆy (i) = b(zi+1 zi), as a < b it has to be zi+1 zi = ˆy (i+1) ˆy (i) = 0, which indicates that the ith summand in the second term must be 0, i.e., (λ i λi)(zi f(ˆy (i)) (zi+1 f(ˆy (i+1)))) = 0.

Now suppose for some i {1, . . . , m}, λ i > 0 and λi = 0, then by the complementary slackness conditions (38) and (39), it must be that ˆy (i+1) ˆy (i) = b(zi+1 zi) 0. Again, since f satisfies f(ˆy (i+1)) f(ˆy (i)) (ˆy (i+1) ˆy (i))/b for any ˆy (i+1) ˆy (i), we thus have zi zi+1 + (f(ˆy (i+1)) f(ˆy (i))) 0. Thus it holds that (λ i λi)(zi f(ˆy (i)) (zi+1 f(ˆy (i+1)))) 0.

On the other hand, if λ i = 0 and λi > 0, then complementary slackness implies that ˆy (i+1) ˆy (i) = a(zi+1 zi) 0. Furthermore, since ˆy (i) ˆy (k) 0 when i k, using the assumption that (α1 α2)/a f(α1) f(α2) when α1 α2 0, we get zi zi+1+(f(ˆy (i+1)) f(ˆy (i))) 0, and hence (λ i λi)(zi f(ˆy (i)) (zi+1 f(ˆy (i+1)))) 0 holds as well.

In summary, since each summand in (42) is positive, we finally get that

i=1 (ˆy (i) y (i))(zi f(ˆy (i))) =

j=1 (ˆy (j) y (j)) (zi f(ˆy (i)) (zi+1 f(ˆy (i+1)))) 0.

This completes the proof of Fact C.11.

C.8. Proof of Claim C.4

We restate and prove Claim C.4 that appeared in the proof of Proposition 3.1 in Appendix C.1.

Claim C.13. Let S = {(x(i), y(i))}m i=1 be m i.i.d. samples from D, and denote S = {(x(i), y (i))}m i=1 the uncorrupted version of S where y (i) = u (w x(i)). Under the condition of Proposition 3.1, given a parameter wt B(W) with probability at least 1 δ it holds:

i=1 (y (i) y(i))(wt x(i) w x(i))

OPT w wt 2 (OPT + ϵ)/b .

Proof. By Chebyshev s inequality, we can write

i=1 (y (i) y(i))(wt x(i) w x(i)) E (x,y) D[(y y)(wt w ) x] s

E(x,y) D[(y y)2(wt x w x)2]

L log(2/(L2ϵ )), then by the fact that Dx is sub-exponential, we have Pr[|(wt w ) x| r] ϵ . Furthermore, since |y| M where M = b W

L log(16b4W 4/ϵ2), as stated in Fact F.3, the variance can be bounded as follows:

E (x,y) D[(y y)2(wt x w x)2]

E (x,y) D[(y y)2(wt x w x)21{|(wt w ) x| r}]

+ E (x,y) D[(y y)2(wt x w x)21{|(wt w ) x| r}]

r2 E (x,y) D[(u (w x) y)2]

+ E (x,y) D[(2(u (w x))2 + y2)(wt x w x)21{|(wt w ) x| r}]

r2OPT + E x Dx[2(b2(wt x)2 + M 2)(wt x w x)21{|(wt w ) x| r}].

Robustly Learning Single-Index Models via Alignment Sharpness

Since for any unit vectors a, b we have Ex Dx[(a x)4] c2/L4 and Ex Dx[(a x)4(b x)4] c2/L8, we have:

2b2 E x Dx[(wt x)2(wt x w x2)21{|(wt w ) x| r}]

E x Dx[((wt/ wt 2) x)4(((wt w )/ wt w 2) x)4] Pr[|(wt w ) x| r]

and in addition,

E x Dx[M 2((wt w ) x)21{|(wt w ) x| r}]

E x Dx[((wt w ) x)4] Pr[|(wt w ) x| r] c M 2(W/L)2

Let s = (OPT + ϵ)/b, ϵ = ϵ2, under our choice of m db4W 9/2 log4(d/(ϵδ))(1/ϵ3/2 + 1/(ϵδ)), it holds that

4W 2 log2(1/(L2ϵ ))OPT

L2 + (4cb2(W/L)4 + c M 2(W/L)2)

Thus, with probability at least 1 δ it holds that

i=1 (y (i) y(i))(wt x(i) w x(i)) E (x,y) D[(y y )(wt w ) x] (OPT + ϵ)/b.

Since E (x,y) D[(y y )(wt w ) x] r

E (x,y) D[(y y )2] E x Dx[((wt w ) x)2]

OPT w wt 2,

we finally have

i=1 (y (i) y(i))(wt x(i) w x(i))

OPT w wt 2 (OPT + ϵ)/b,

completing the proof of Claim C.4.

D. Omitted Proofs from Section 4

D.1. Proof of Lemma 4.1

Our algorithm for initialization is the following Algorithm 2:

Algorithm 2 Initialization

1: Input: w0 = 0; ϵ, δ > 0; positive parameters a, b, L, R, W; µ a2LR4/b, step size η = µ3/(27b4), number of iterations t0 (b/µ)6 log(b/µ); 2: for t = 0 to t0 do 3: Draw m0 W 9/2b10d log4(d/(ϵδ))/(L4µ6δϵ3/2) i.i.d. samples from D

4: ˆut = argmin u U(a,b)

i=1 (u(wt x(i)) y(i))2.

5: b Lsur(wt; ˆut) = 1 m0

i=1 (ˆut(wt x(i)) y(i))x(i).

6: wt+1 = wt η b Lsur(wt; ˆut). 7: end for 8: Return: {w0, . . . , wt0}

We restate and prove Lemma 4.1 in below.

Robustly Learning Single-Index Models via Alignment Sharpness

Lemma D.1 (Initialization). Let µ = Ca2LR4/b for an absolute constant C > 0 and let ϵ, δ > 0. Choose the step size η = µ3/(27b4) in Algorithm 2. Then, drawing m0 i.i.d. samples from D at each iteration such that

m0 W 9/2b10d log4(d/(ϵδ))

L4µ6δϵ3/2 ,

ensures that within t0 b6 log(b/µ)/µ6 iterations, the initialization subroutine Algorithm 2 generates a list of size t0 that contains a point w0 such that (w ) w0 2 max{µ w 2/(4b), 64b2/µ3(

OPT+ ϵ)}, with probability at least 1 δ. The total number of samples required for Algorithm 2 is N0 = t0m0.

Proof. Assume first that w 2 64b2/µ3(

OPT + ϵ). Then, for the parameter w0 = 0, it holds that (w ) w0 2 = w 2 64b2/µ3(

OPT + ϵ). Hence, w0 = 0 satisfies the condition (w ) w0 2 max{µ w 2/(4b), 64b2/µ3(

Now suppose w 2 64b2/µ3(

OPT + ϵ). Let vt denote the component of w that is orthogonal to wt; i.e., vt = w (w wt)wt/ wt 2 2 = (w ) wt, where wt is defined in Algorithm 2. Our goal is to show that when vt 2 µ w 2/(4b) at iteration t, the distance between wt+1 and w contracts by a constant factor 1 c for some c < 1, i.e., wt+1 w 2 (1 c) wt w 2. This implies that when vt 2 is greater than µ w 2/(4b), wt+1 wt 2 contracts until vt 2 µ w 2/(4b) is violated at step t0; this wt0 is exactly the initial point we are seeking to initialize the optimization subroutine.

Applying Proposition 3.1 we get that under our choice of batch size m, with probability at least 1 δ, at each iteration it holds

b Lsur(wt; ˆut) (wt w ) Ca2LR4

b (w ) wt 2 2 2(

OPT + ϵ) wt w 2 2(OPT + ϵ)/b.

We now study the distance between wt+1 and w , where wt+1 is updated from wt according to Algorithm 2.

wt+1 w 2 2 = wt η b Lsur(wt; ˆut) w 2 2 = wt w 2 2 + η2 b Lsur(wt; ˆut) 2 2 2η b Lsur(wt; ˆut) (wt w ). (43)

Applying Lemma 4.3 to (43), and plugging in Proposition 3.1, we get that under our choice of batch size m it holds that with probability at least 1 δ,

wt+1 w 2 2 wt w 2 2 + η2(10(OPT + ϵ) + 4b2 wt w 2 2)

+ 2η(2(OPT + ϵ)/b + 2(

OPT + ϵ) wt w 2 µ vt 2 2)

(1 + 4b2η2) wt w 2 2 + 2η(2(

OPT + ϵ) wt w 2 µ vt 2 2)

+ 5η(OPT + ϵ), (44)

where µ = Ca2LR4/b and C is an absolute constant. Note that in the last inequality we used that η 1/10, hence 10η2 η, and that b 1.

Note that we have assumed

OPT + ϵ µ3/(64b2) w 2. Furthermore, when t = 0, v0 = w hence we would have v0 2 µ w 2/(4b). Suppose that at iteration t, vt 2 µ w 2/(4b) is still valid. Then, (44) is transformed to:

wt+1 w 2 2 (1 + 4b2η2) wt w 2 2 + 5η(OPT + ϵ)

+ 2η((µ3/(32b2)) wt w 2 w 2 (µ3/(16b2)) w 2 2) (45)

We will use an induction argument to show that at iteration t, wt w 2 w 2, which will eventually yield a contraction wt+1 w 2 2 (1 c) wt w 2 2 for some constant c < 1. This condition wt w 2 w 2 certainly holds for the base case t = 0 as w0 = 0 hence w0 w 2 = w 2. Now, suppose wt w 2 w 2 holds for all the iterations from 0 to t. Then, bringing in η = µ3/(27b4) to (45), we get:

wt+1 w 2 2 (1 + 4b2η2) wt w 2 2 + 2η((µ3/(32b2)) (µ3/(16b2))) wt w 2 w 2 + 5η(OPT + ϵ)

(1 + 4η2b2 2ηµ3/(32b2)) wt w 2 2 + 5µ3/(27b4)(OPT + ϵ)

(1 µ6/(211b6)) wt w 2 2 + 5µ3/(27b4)(OPT + ϵ).

Robustly Learning Single-Index Models via Alignment Sharpness

Since we have assumed

OPT + ϵ µ3/(64b2) w 2, it holds wt w 2 vt 2 µ w 2/(4b) (16b/µ2)(

OPT + ϵ), thus, we have (noting that µ 1):

5µ3/(27b4)(OPT + ϵ) 5µ3/(27b4)(

OPT + ϵ)2 µ6/(212b6) wt w 2.

Therefore, combining the results above, we get:

wt+1 w 2 2 (1 µ6/(212b6)) wt w 2 2,

for any iteration t such that vt 2 µ w 2/(4b) holds. This validates the induction argument that wt w 2 w 2 for every t = 0, . . . , t0 and at the same time yields the desired contraction property of the sequence wt w 2, t = 0, . . . , t0. Now, since w0 w 2 = w 2 and wt w 2 vt 2, we have

vt+1 2 2 (1 µ6/(212b6))t w 2 2 exp( tµ6/(212b6)) w 2 2.

Thus, after at most t0 = 212b6 log(4b/µ)/µ6 iterations, it must hold that among all those vectors v1, . . . , vt0, there exists a vector vt 0 such that vt 0 2 µ w 2/(4b). Since there are only a constant number of candidates, we can feed each one as the initialized input to the optimization subroutine Algorithm 3. This will only result in a constant boost up in the runtime and sample complexity.

Finally, recall that we need to draw

m W 9/2b4 log4(d/(ϵδ))

new samples at each iteration for (44) to hold with probability 1 δ, and the total number of iterations is t0. Thus, applying a union bound we know that the probability that (44) holds for all t0 is 1 t0δ. Hence, choosing δ δt0, and note that t0 b6/µ6 log(b/µ), it yields that setting the batch size to be

m0 = Θ W 9/2b4 log4(d/(ϵδ))

ϵ3/2 + b6 log(b/µ)

= Θ W 9/2b10d log4(d/(ϵδ))

suffices and the total number of sample complexity for the initialization process is t0m0.

D.2. Proof of Theorem 4.2

In this subsection, we restate and prove our main theorem Theorem 4.2. The full version of the optimization algorithm as well as the main theorem Theorem 4.2 is displayed below:

Theorem D.2 (Main Result). Let D be a distribution in Rd R and suppose that Dx is (L, R)-well-behaved. Furthermore, let U(a,b) be as in Definition 1.3, and ϵ > 0. Let µ = Ca2LR4/b, where C is an absolute constant. Running Algorithm 1 with the following parameters: step size η = µ/(4b2), batch size to be m d W 5.5b14 log5(d/ϵ)/(L4µ9ϵ3/2) and the total number of iterations to be T = t0JT = O(Wb10/(µ9 ϵ) log(1/ϵ)), where T = O((b/µ)2 log(1/ϵ)), then with probability at least 2/3, Algorithm 1 returns a hypothesis (ˆu, bw) where ˆu U(a,b) and bw B(W) such that

L2(bw; ˆu) = O b4

using N = O(T m) = O(d W 6.5b24/(L4µ18ϵ2)) samples.

Proof. As proved in Lemma 4.1, the initialization subroutine Algorithm 2 outputs a starting point w0 such that (w ) w0 2 max{µ w 2/(4b), 64b2/µ3(

OPT+ ϵ)}. Suppose first that µ w 2/(4b) 64b2/µ3(

OPT+ ϵ). Then this implies that w 2 256b3/µ4(

OPT + ϵ). Therefore, applying Claim 4.4 we immediately get that the trivial hypothesis (w = 0, u(z) = 0) works as a constant approximate solution, as L2(w; u) 8(OPT + ϵ) + 4b2 w 2 = O((b/µ)8)OPT + ϵ. This hypothesis (w = 0, u(z) = 0) is contained in our solution set P (see Algorithm 3) and will be tested in Algorithm 4. Thus, we assume this is not the case and the initial point w0 satisfies (w ) w0 2 µ w 2/(4b).

Since there exists a wini k {wini k }t0 k=1 such that (w ) wini k 2 µ w 2/(4b). Let us consider this initialized parameter at k step w0 j,k = wini k and ignore the subscript k for simplicity. Since we constructed a grid with grid width η ϵ from

Robustly Learning Single-Index Models via Alignment Sharpness

Algorithm 3 Optimization

1: Input: wini = 0; ϵ > 0; positive parameters: a, b, L, R, W; let µ a2LR4/b; step size η = µ/(4b2), number of iterations T = O((b/µ)2 log(1/ϵ)). 2: {wini 0 , . . . , wini t0 } = Initialization[wini] (Algorithm 2) 3: for k = 0 to t0 (b/µ)6 log(b/µ) do 4: Pk = {} 5: for j = 1 to J = W/(η ϵ) do 6: w0 j,k = wini k . 7: βj = jη ϵ. find an η ϵ approximation of w 2 8: for t = 0 to T 1 do 9: bwt j,k = βj( wt j,k/ wt j,k 2). normalize w

10: Draw m W 5.5b14 log5(d/ϵ)d/(L4µ9ϵ3/2) new i.i.d. samples from D 11: ˆut j,k = argminu U(a,b)(1/m) Pm i=1(u(bwt j,k x(i)) y(i))2.

12: b Lsur(bwt j,k; ˆut j,k) = (1/m) Pm i=1(ˆut j,k(bwt j,k x(i)) y(i))x(i).

13: wt+1 j,k = bwt j,k η b Lsur(bwt j,k; ˆut j,k) 14: end for 15: Pk Pk {(bw T j,k; ˆu T j,k)}. 16: end for 17: P = t0 k=1Pk {(w = 0; u(z) = 0)} 18: end for 19: (bw; ˆu) = Test[(w; u) P] (Algorithm 4) testing 20: Return: (bw; ˆu)

Algorithm 4 Testing

1: Input: ϵ > 0; positive parameters: a, b, L, R, W; list of solutions P; let r 1

L log(b W/(Lϵ) log2(1/ϵ)) 2: Draw m (b W/L)4 log5(1/ϵ)/ϵ2 new i.i.d. samples from D.

3: (bw; ˆu) = argmin(w;u) P{ 1

i=1(u(w x(i)) y(i))21{|w x(i)| Wr} }. 4: Return: (bw; ˆu)

0 to W to find the (approximate) value of w 2, there must exist an index j such that the value of βj is η ϵ close to w 2, i.e., |βj w 2| η ϵ. We now consider this j th outer loop and ignore the subscript j for simplicity. Let wt = w 2( wt/ wt 2), which is the true normalized vector of wt that has no error.

We study the squared distance between wt+1 and w :

wt+1 w 2 2 = bwt η b Lsur(bwt; ˆut) w 2 2 = bwt w 2 2 + η2 b Lsur(bwt; ˆut) 2 2 2η b Lsur(bwt; ˆut) (bwt w ). (46)

Applying Lemma 4.3 to (46), and plugging in Proposition 3.1, we get that when drawing

m d W 9/2b4 log4(d/(ϵδ))

samples from the distribution, it holds with probability at least 1 δ that:

wt+1 w 2 2 bwt w 2 2 + η2(10(OPT + ϵ) + 4b2 bwt w 2 2)

+ 2η(2(OPT + ϵ)/b + 2(

OPT + ϵ) bwt w 2 µ vt 2 2), (48)

where µ = Ca2LR4/b with C being an absolute constant, and where vt is the component of w that is orthogonal to bwt, i.e., vt = w (w bwt)bwt/ bwt 2 2 = (w ) b wt.

Robustly Learning Single-Index Models via Alignment Sharpness

Note that vt 2 is invariant to the rescaling of bwt, in other words, w has the same orthogonal component vt for all wt, wt

Since bwt wt 2 η ϵ, we have

bwt w 2 2 = bwt wt + wt w 2 2 wt w 2 2 + η2ϵ + 2η ϵ wt w 2. (49)

In addition, by triangle inequality we have bwt w 2 wt w 2 + η ϵ. Therefore, substituting wt with bwt in (48), we get:

wt+1 w 2 2 wt w 2 2 + η2ϵ + 2η ϵ wt w 2 + η2(10(OPT + ϵ) + 4b2 wt w 2 2 + 4b2η2ϵ + 8b2η ϵ wt w 2)

+ 2η(2(OPT + ϵ)/b + 2(

OPT + ϵ)( wt w 2 + η ϵ) µ vt 2 2)

wt w 2 2 + η2(24(OPT + ϵ) + 4b2 wt w 2 2)

+ 2η(2(OPT + ϵ)/b + 4(

OPT + ϵ) wt w 2 µ vt 2 2), (50)

where we used 4b2η2 1, which holds because η = µ/(4b2).

Our goal is to show that vt+1 2 2 wt+1 w 2 2 (1 c) vt 2 2 + ϵ, where c (0, 1) is a constant and ϵ is a small error parameter. However, this linear contraction can only be obtained when vt 2 is relatively small compared to w 2. Specifically, as will be manifested in Claim D.3 and the proceeding proof, the linear contraction is achieved only when vt 2 µ w 2/(4b). Luckily, we can start with a v0 such that this condition is satisfied, due to the initialization subroutine Algorithm 2, as proved in Lemma 4.1. We prove the following claim.

Claim D.3. Let η = µ/(4b2). Then, under the assumptions of Theorem 4.2, with probability at least 1 δ, we have

wt+1 w 2 2 1 µ2

whenever vt 2 (96/µ)(

Proof of Claim D.3. Since the norm of wt is normalized to w , the quantity wt w 2 is controlled by vt 2 2. In particular, let w = αtwt + vt. Then, since vt wt, we have w 2 2 = α2 t wt 2 2 + vt 2 2 = α2 t w 2 2 + vt 2 2, thus, α2 t = 1 vt 2 2/ w 2 2, and vt 2 2 = (1 α2 t) w 2 2. In addition, wt w 2 2 can be expressed as a function of αt and w , as wt w 2 2 = (1 αt)2 w 2 2 + vt 2 2 = 2(1 αt) w 2 2. (51)

Note that since αt = p

1 vt 2 2/ w 2 2, denoting ρt = vt 2/ w 2, we further have:

1 vt 2 2/ w 2 2 = 1 q

2ρ4 t ρ2 t, ρt [0, 1]. (52)

Therefore, plugging (51) and (52) back into (50), we get:

wt+1 w 2 2 2(1 αt) w 2 2 + 4b2η2(2(1 αt) w 2 2) + 8η(

2(1 αt) w 2 2ηµ vt 2 2 + 24η2(OPT + ϵ) + 4η(OPT + ϵ)/b

(ρ2 t + ρ4 t) w 2 2 + 4b2η2(ρ2 t + ρ4 t) w 2 2 + 8

OPT + ϵ)ρt w 2 2ηµ vt 2 2 + 24η2(OPT + ϵ) + 4η(OPT + ϵ)/b

= (1 + ρ2 t + 4b2η2(1 + ρ2 t)) vt 2 2 + 12η(

OPT + ϵ) vt 2 2ηµ vt 2 2 + 4(6η2 + η/b)(OPT + ϵ)

(1 + ρ2 t + 4b2η2(1 + ρ2 t)) vt 2 2 + 12η(

OPT + ϵ) vt 2 2ηµ vt 2 2 + 5η(OPT + ϵ), (53)

where in the last inequality we observed that since η = µ 4b2 , it holds that 24η 1, as µ is small and b 1.

Robustly Learning Single-Index Models via Alignment Sharpness

Note that we have assumed that vt 2 (96/µ)(

OPT + ϵ), which indicates

OPT + ϵ) vt 2 1

8ηµ vt 2 2,

since b 1, assuming without loss of generality. Furthermore, when vt 2 (96/µ)(

OPT + ϵ), it also holds that

1 8ηµ vt 2 2 (96)2

8µ2 ηµ(OPT + ϵ) 5η(OPT + ϵ),

since we have assumed µ = Ca2LR4/b 1 without loss of generality. Finally, as we will show in the rest of the proof, it holds that vt+1 2 vt 2 for t = 0, 1, . . . , T, thus as η = µ/(4b2), we have vt 2 ηµ w 2/2 = µ w 2/(4b), since v0 2 ηµ w 2/2. This condition guarantees that

ρ2 t = vt 2 2/ w 2 2 1

Plugging these conditions back into (53), it is then simplified as (note that 1 + ρ2 t 1 + (1/4)ηµ 9/8 for ηµ 1/2):

wt+1 w 2 2 1 + 9

2ηµ vt 2 2.

Therefore, when η = µ/(4b2) we have

wt+1 w 2 2 1 µ2

completing the proof.

We proceed first under the condition that vt 2 (96/µ)(

OPT + ϵ) holds for t = 0, . . . , T and show that after some certain number of iterations T this condition must be violated. Observe if vt 2 (96/µ)(

OPT + ϵ), then it holds wt w 2 2 (1/µ2)(OPT + ϵ), implying that ˆut(wt x) is a hypothesis achieving constant approximation error according to Claim 4.4, hence the algorithm can be terminated. However, note that T only works as an upper bound for the iteration complexity of or algorithm, and it is possible that the condition vt 2 (96/µ)(

OPT + ϵ) is violated at some step t < T. However, we will show later that the value of v T 2 can not be larger than c vt 2 where c is an absolute constant. We observe that:

vt+1 = w (w wt+1)wt+1/ wt+1 2 2 = w (w wt+1) wt+1/ wt+1 2 2 = (w ) wt+1,

therefore, vt+1 2 2 wt+1 w 2 2, which, combined with Claim D.3, yields

vt+1 2 2 1 µ2

vt 2 2 1 µ2

t v0 2 2 exp µ2t

The above contraction only holds when vt 2 (96/µ)(

OPT + ϵ). Hence, after at most

inner iterations, the algorithm outputs a vector wt with vt 2 96

OPT + ϵ), where t [T].

Now suppose at step t < T it holds that vt 2 96(

OPT + ϵ)/µ but at the next iteration vt +1 2 96(

OPT + ϵ)/µ. Recall first that in Lemma 4.3 we showed that b Lsur(bwt; ˆut) 2 2 4b2 bwt w 2 2 + 10(OPT + ϵ). Therefore, revisiting the updating scheme of the algorithm we have

vt +1 2 2 wt +1 w 2 2 = bwt η b Lsur(bwt ; ˆut ) w 2 2 2 bwt w 2 2 + 2η2 b Lsur(bwt ; ˆut ) 2 2 (2 + 8b2η2) bwt w 2 2 + 20η2(OPT + ϵ)

3 bwt w 2 2 + (OPT + ϵ),

Robustly Learning Single-Index Models via Alignment Sharpness

where in the last inequality we plugged in the value of η = µ/(4b2), and used the assumption that µ 1 and b 1, hence 20η2 1 and 8b2η2 1. Furthermore, recall that by the construction of the grid, bwt wt 2 η ϵ, implying that bwt w 2 2 2 wt w 2 2 + 2η2ϵ by triangle inequality. Therefore, going back to the inequality of vt +1 2 2 above, we get

vt +1 2 2 6 wt w 2 2 + 6η2ϵ + OPT + ϵ 6 wt w 2 2 + 2(OPT + ϵ).

Finally, observe that since wt 2 = w 2, it holds wt w 2

2 vt 2, hence, we get

vt +1 2 2 12 vt 2 2 + 2(OPT + ϵ).

Now since vt +1 2 96(

OPT + ϵ)/µ, the value of vt 2 2 will start to decrease again for t t + 1. This implies that the value of v T 2 satisfies

OPT + ϵ) 384

Combining Claim 4.4 and Lemma F.4, as we have guaranteed that v T 2 (384/µ)(

OPT + ϵ), the hypothesis ˆu T (bw T x) has the L2 2 error that can be bounded as:

L2(bw T ; ˆu T ) 6OPT + 3b2(4 v T 2 2 + η2ϵ) + ϵ = O b2

µ2 (OPT + ϵ) .

Setting ϵ C (b2/µ2)ϵ for some universal absolute constant C we get L2(bw T ; ˆu T ) O((b2/µ2)OPT) + ϵ.

It still remains to determine the batch size as drawing a sample set of size m as displayed in (47) only guarantees that the contraction of vt 2 at step t holds with probability 1 δ. Applying a union bound on all t0JT = O( W b10

µ9 ϵ log(1/ϵ)) iterations yields that the contraction holds at every step with probability at least 1 t0JTδ. Therefore, setting δ δ(t0JT) and bringing the value of δ back to (47), we get that it suffices to choose the batch size as:

m = Θ W 9/2b4 log4(d/(ϵδ))

ϵ3/2 + Wb10

= Θ W 5.5b14d log5(d/(ϵδ))

to guarantee that we get a O(OPT) + ϵ-solution with probability at least 1 δ.

The argument above justifies the claim that among all t0J = Wb6 log(b/µ)/(ηµ6 ϵ) hypotheses in P = {(bw T j ; ˆu T j )}t0J j=1, there exists at least one hypothesis that achieves L2 2 error O(OPT) + ϵ. To select the correct hypothesis from the set P, one only needs to draw a new batch of m = Θ(b4W 4 log(1/δ)/(L4ϵ2)) i.i.d. samples from D, and choose the hypothesis from P that achieves the minimal empirical error defined in Algorithm 3. To be concrete, we prove the following claim, whose proof can be found in Appendix D.5.

Claim D.4. Fix some positive real numbers µ, ϵ, δ. Let r = 1

L log( Cb4W 4

L6ϵ2 log2( b W

ϵ )) where C is a large absolute constant. Given a set of parameter-activation pairs P = {(wj; uj)}t0J j=1 such that wj B(W) and uj U(a,b) for j [t0J], where t0J = 4b9W/(µ8 ϵ), we have that using

m = Θ b4W 4 log(1/δ)

L4ϵ2 log5 b W

i.i.d. samples from D, for any (wj; uj) P it holds with probability at least 1 δ,

i=1 (uj(wj x(i)) y(i))21{|wj x(i)| Wr} E (x,y) D[(uj(wj x) y)2] 2ϵ.

Therefore, according to Claim D.4, we know that testing each (bw T j ; ˆu T j ) P on a fresh set of m samples and choosing the one that achieves minimum error yields a solution (bw; ˆu) that introduces at most 2ϵ error with high probability. In

Robustly Learning Single-Index Models via Alignment Sharpness

conclusion, it holds by a union bound that the Algorithm 3 delivers a solution with O(OPT) + ϵ error with probability at least 1 2δ. The total sample complexity of our algorithm is

N = t0JTm + m = Θ W 6.5b24d log5(d/(ϵδ))

L4µ18δϵ2 + b4W 4 log(1/δ) log5(1/ϵ)

= Θ W 6.5b24d log6(d/(ϵδ))

Choosing δ = 1/6 above, we get that the Algorithm 3 succeeds with probability at least 1 2δ = 2/3, completing the proof of Theorem D.2.

D.3. Proof of Lemma 4.3

This subsection is devoted to the proof of Lemma 4.3. To this aim, we first show the following lemmas that bound from above the norm of the population gradient Lsur(wt; ˆut) and the difference between the population gradient and the empirical gradient b Lsur(wt; ˆut). Lemma D.5. Let S be a sample set of m i.i.d. samples of size at least m d log4(d/(ϵδ))(b2W 3/L2ϵ)3/2. Furthermore, given wt B(W), let ˆut be defined as in (P). Then, it holds that with probability at least 1 δ,

Lsur(wt; ˆut) 2 2 8(OPT + ϵ) + 2b2 wt w 2 2.

Proof. By the definition of ℓ2 norms, we have:

Lsur(wt; ˆut) 2 = max v 2=1 Lsur(wt; ˆut) v

= max v 2=1 E (x,y) D[(ˆut(wt x) y)v x]

= max v 2=1

E (x,y) D[(ˆut(wt x) ut(wt x) + ut(wt x) u t(wt x))(v x)]

+ E (x,y) D[(u t(wt x) u (w x) + u (w x) y)(v x)] .

By the Cauchy-Schwarz inequality, we further have:

Lsur(wt; ˆut) 2

E x Dx[(ˆut(wt x) ut(wt x))2] E x Dx[(v x)2] + q

E x Dx[(ut(wt x) u t(wt x))2] E x Dx[(v x)2]

E x Dx[(u t(wt x) u (w x))2] E x Dx[(v x)2] + q

E x Dx[(u (w x) y)2] E x Dx[(v x)2]

E x Dx[(ˆut(wt x) ut(wt x))2] + q

E x Dx[(ut(wt x) u t(wt x))2]

E x Dx[(u t(wt x) u (w x))2] + q

E x Dx[(u (w x) y)2],

where in the last inequality we used the fact that Dx is isotropic hence Ex Dx[(v x)2] = 1. It remains to bound these four terms above respectively. The first term is bounded by ϵ for every wt B(W), with probability at least 1 δ according to Lemma F.4. The fourth term is bounded by OPT, by definition. Recall that in Lemma 3.3 we showed that the second term in the display above is upper-bounded: Ex Dx[(ut(wt x) u t(wt x))2] OPT. For the third term, note that u t argminu U(a,b) Ex Dx[(u(wt x) u (w x))2], therefore, since u U(a,b), we have

E x Dx[(u t(wt x) u (w x))2] E x Dx[(u (wt x) u (w x))2] b2 wt w 2 2,

after applying the fact that u is b-Lipschitz. Thus, in conclusion, we have

Lsur(wt; ˆut) 2 2

OPT + ϵ + b wt w 2.

Furthermore, since (a + b)2 2a2 + 2b2 for any a, b R, we get with probability at least 1 δ:

Lsur(wt; ˆut) 2 2 8OPT + 8ϵ + 2b2 wt w 2 2,

completing the proof of Lemma D.5.

Robustly Learning Single-Index Models via Alignment Sharpness

Now we prove that the distance between Lsur(wt; ˆut) and b Lsur(wt; ˆut) is bounded by b2 wt w 2 2 + OPT + ϵ with high probability.

Lemma D.6. Let S be a sample set of m (d W 9/2b4 log4(d/(ϵδ))/L4)(1/ϵ3/2 + 1/(ϵδ)) i.i.d. samples. Given a vector wt B(W), it holds that with probability at least 1 δ,

b Lsur(wt; ˆut) Lsur(wt; ˆut) 2 q

b2 wt w 2 2 + OPT + ϵ.

Proof. Since for any mean-zero independent random variables zj, we have E[|| P j zj||2 2] = P j E[ zj 2 2], hence, by Markov: Pr[ b Lsur(wt; ˆut) Lsur(wt; ˆut) 2 s] 1 ms2 E (x,y) D[ (ˆut(wt x) y)x 2 2] . (54)

By linearity of expectation, we have:

E (x,y) D[ (ˆut(wt x) y)x 2 2] =

k=1 E (x,y) D[(ˆut(wt x) y)2(xk)2],

where xk = ek x and ek is the kth unit basis of Rd. Let r = O(W/L log(1/(Lϵ ))), then it holds Pr[|xk| r] ϵ . Then, the variance above can be decomposed into the following parts:

E (x,y) D[(ˆut(wt x) y)2x2 k] = E (x,y) D[(ˆut(wt x) y)2x2 k1{|xk| r}]

+ E (x,y) D[(ˆut(wt x) y)2x2 k1{|xk| r}].

Since |y| M = O(b W/L log(b W/ϵ)), and Ex Dx[(wt x)4x4 k] W 4c2/L8, Ex Dx[x4 k] c2/L4 for Dx is L-subexponential, we have

E (x,y) D[(ˆut(wt x) y)2x2 k1{|xk| r}] 2 E (x,y) D[(ˆut(wt x))2 + y2)x2 k1{|xk| r}]

2 E (x,y) D[(b(wt x))2 + y2)x2 k1{|xk| r}]

E x Dx[((wt x)4x4 k] Pr[|xk| r]

E x Dx[x4 k] Pr[|xk| r]

(2cb2W 2/L4)

ϵ + (2c M 2/L2)

ϵ (4c M 2/L2)

In addition, (ˆut(wt x) y)2 can be decomposed as the following:

E (x,y) D[(ˆut(wt x) y)2] 4 E x Dx[(ˆut(wt x) ut(wt x))2] + 4 E x Dx[(ut(wt x) u t(wt x))2]

+ 4 E x Dx[(u t(wt x) u (w x))2] + 4 E (x,y) D[(u (w x) y)2].

The first term is upper bounded by 4ϵ with probability at least 1 δ for every wt B(W) whenever m d log4(d/(ϵδ))(b2W 3/L2ϵ)3/2, as proved in Lemma F.4. The second term is smaller than 4OPT, which is shown in Lemma 3.3. The third term can be upper bounded using again the definition of u t = argminu U(a,b) Ex Dx[(u(wt x) y )2], as

4 E x Dx[(u t(wt x) u (w x))2] 4 E x Dx[(u (wt x) u (w x))2] 4b2 wt w 2 2,

using the fact that u is b-Lipschitz and Dx is isotropic. Lastly, the fourth term is bounded by 4OPT by the definition of u (w x). In summary, we have

E (x,y) D[(ˆut(wt x) y)2x2 k1{|xk| r}] r2 E (x,y) D[(ˆut(wt x) y)2]

4r2(b2 wt w 2 2 + 2OPT + ϵ),

Robustly Learning Single-Index Models via Alignment Sharpness

which, combining with (55), implies that the expectation E(x,y) D[(ˆut(wt x) y)2x2 k] is bounded by:

E (x,y) D[(ˆut(wt x) y)2x2 k] 4r2b2 wt w 2 2 + 4r2(2OPT + 2ϵ)

(b2 wt w 2 2 + OPT + ϵ),

where C is a large absolute constant. Note to get the inequality above we chose ϵ = Cϵ2(L/b)4, which then indicates that 4c(M/L)2

ϵ r2ϵ. Summing the inequality above from k = 1 to d delivers the final upper bound on the variance:

E (x,y) D[ (ˆut(wt x) y)x 2 2] d CW 2

(b2 wt w 2 2 + OPT + ϵ).

Thus, plugging the upper bound on the variance above back to (54), as long as m (d W 2/L2) log2(b/(Lϵ))/δ, we get with probability at least 1 δ,

b Lsur(wt; ˆut) Lsur(wt; ˆut) 2 q

b2 wt w 2 2 + OPT + ϵ.

Noting that m (d W 9/2b4 log4(d/(ϵδ))/L4)(1/ϵ3/2 + 1/(ϵδ)) certainly satisfies the condition on m above as m (d W 2/L2) log2(b/(Lϵ))/δ, thus, we completed the proof of Lemma D.6

We can now proceed to the proof of Lemma 4.3, which can be derived directly from the preceding lemmas.

Lemma D.7 (Upper Bound on Empirical Gradient Norm). Let S be a set of i.i.d. samples of size m (d W 9/2b4 log4(d/(ϵδ))/L4)(1/ϵ3/2 + 1/(ϵδ)). Given any wt B(W), let ˆut U(a,b) be the solution of optimization problem (P) with respect to wt and sample set S. Then, with probability at least 1 δ, we have that b Lsur(wt; ˆut) 2 2 4b2 wt w 2 2 + 10(OPT + ϵ).

Proof. The lemma follows directly by combining Lemma D.5, Lemma D.6 and the triangle inequality.

D.4. Proof of Claim 4.4

We restate and prove Claim 4.4.

Claim D.8. Let w be any vector from B. Let ˆuw be the activation defined as the empirical-optimal solution of the optimization problem (P) for a fixed parameter vector w Rd with batch size m d log4(d/(ϵδ))(b2W 3/(L2ϵ))3/2. Then the L2 2 error of ˆuw(w x) is bounded by: E(x,y) D[(ˆuw(w x) y)2] 8(OPT + ϵ) + 4b2 w w 2 2.

Proof. Let u w, uw be the optimal activation of problem (EP*) and (EP) under parameter w respectively. Then, direct calculation gives:

E (x,y) D[(ˆuw y)2]

= E (x,y) D[(ˆuw(w x) uw(w x) + uw(w x) u w(w x) + u w(w x) u (w x) + u (w x) y)2]

4 E x Dx[(ˆuw(w x) uw(w x))2] + 4 E x Dx[(uw(w x) u w(w x))2]

+ 4 E x Dx[(u w(w x) u (w x))2] + 4OPT

8(OPT + ϵ) + 4b2 w w 2 2, (56)

where in the second inequality we used the results from Lemma 3.3, Lemma F.4 and we applied the observation that:

E x Dx[(u w(w x) u (w x))2] E x Dx[(u (w x) u (w x))2] b2 w w 2 2,

by the definition of u w.

Robustly Learning Single-Index Models via Alignment Sharpness

D.5. Proof of Claim D.4

We restate Claim D.4 and show the number of samples needed for the testing subroutine Algorithm 4.

Claim D.9. Fix some positive real numbers µ, ϵ, δ. Let r = 1

L log( Cb4W 4

L6ϵ2 log2( b W

ϵ )) where C is a large absolute constant. Given a set of parameter-activation pairs P = {(wj; uj)}t0J j=1 such that wj B(W) and uj U(a,b) for j [t0J], where t0J = 4b9W/(µ8 ϵ), we have that using

m = Θ b4W 4 log(1/δ)

L4ϵ2 log5 b W

i.i.d. samples from D, for any (wj; uj) P it holds with probability at least 1 δ,

i=1 (uj(wj x(i)) y(i))21{|wj x(i)| Wr} E (x,y) D[(uj(wj x) y)2] 2ϵ.

Proof. Fix some r 0, and fix any (wj, uj) P. Since Dx is sub-exponential, we have Pr[|wj x| wj 2r] 1 L2 exp( Lr). Consider random variables Zi,j = (uj(wj x(i)) y(i))21{|w x(i)| r}, i = 1, , m, j = 1, , t0J, where (x(i), y(i)) are independent random variables drawn from D. Recall that using the result from Fact F.3, we can truncate the labels y such that |y| M, where M = C(b W/L) log(b W/ϵ) for some large absolute constant C. Hence, |Zi,j| 2(u2 j(wj x(i)) + (y(i))2)1{|wj x(i)| Wr} 2(b2W 2r2 + M 2), note that we used the assumption that u is b-Lipschitz in the last inequality. Therefore, applying Hoeffding s inequality on Zi,j we get:

i=1 (Zi,j E[Zi,j]) m t exp cm t2

(b2W 2r2 + M 2)2

where c is an absolute constant. Since there are t0J = Wb6/(µ6η ϵ) = 4b8W/(µ7 ϵ) elements in the set P, thus applying a union bound yields:

i=1 Zi,j E[Zi,j] m t, j [J] exp cm t2

(b2W 2r2 + M 2)2 + log(4b8W/(µ7 ϵ)) .

Therefore, when

m = (b2W 2r2 + M 2)2

+ log(1/δ) , (57)

we have that with probability at least 1 δ:

i=1 (uj(wj x(i)) y(i))21{|wj x(i)| Wr} E (x,y) D[(uj(wj x) y)21{|wj x| Wr}] ϵ, (58)

for any (wj, uj) P. In addition, as Pr[|wj x| Wr] Pr[|wj x| wj 2r] 2 L2 exp( Lr), let ϵ = 2 L2 exp( Lr), we further have:

E (x,y) D[(uj(wj x) y)21{|wj x| Wr}]

2 E x Dx[((uj(wj x))2 + M 2)1{|wj x| Wr}]

E x Dx[(wj x)4] Pr[|wj x| Wr] + M 2 Pr[|wj x| Wr]

ϵ + M 2ϵ (2cb2(W/L)2 + M 2)

where in the second inequality we use Cauchy-Schwarz inequality and in the last inequality we used the property that for any unit vector a it holds E[(a x)4] c2/L4 for some absolute constant c as x possesses a 1

L-sub-exponential

Robustly Learning Single-Index Models via Alignment Sharpness

tail. Therefore, choosing r = 1 L log( C2b4W 4

L6ϵ2 log2( b W

ϵ )) = O( 1

Lϵ )) for some large absolute constant C renders

ϵ ϵ/(2Cb2(W/L)2 log2(b W/ϵ)), and we have

E (x,y) D[(uj(wj x) y)21{|wj x| Wr}] ϵ.

Observe that E(x,y) D[(uj(wj x) y)2] is the sum of E(x,y) D[(uj(wj x) y)21{|wj x| Wr}] and E(x,y) D[(uj(wj x) y)21{|wj x| Wr}], we thus have

0 E (x,y) D[(uj(wj x) y)2] E (x,y) D[(uj(wj x) y)21{|wj x| Wr}]

E (x,y) D[(uj(wj x) y)21{|wj x| Wr}] ϵ.

Bringing the choice of r back to (57) we get that it is sufficient to choose m as

m = C log(log(1/ϵ))

+ log(1/δ) = Θ b4W 4 log(1/δ)

L4ϵ2 log5 b W

Therefore, using m = Ω(b4W 4/(L4ϵ2)) samples, (58) indicates that with probability at least 1 δ, for any (wj, uj) P it holds

i=1 (uj(wj x(i)) y(i))21{|wj x(i)| Wr} E (x,y) D[(uj(wj x) y)2]

i=1 (uj(wj x(i)) y(i))21{|wj x(i)| Wr} E (x,y) D[(uj(wj x) y)21{|wj x| Wr}]

+ E (x,y) D[(uj(wj x) y)2] E (x,y) D[(uj(wj x) y)21{|wj x| Wr}]

thus completing the proof of Claim D.4.

E. Efficiently Computing the Optimal Empirical Activation

In this section, we show that the optimization problem (P) can be solved efficiently, following the framework from (Lu & Hochbaum, 2022) with minor modifications. We show that, for any ϵ > 0, there is an efficient algorithm that runs in O(m2 log(1/ϵ)) time and outputs a solution ˆvt(z) such that ˆvt(z) ˆut(z) ϵ.

Proposition E.1 (Approximating the Optimal Empirical Activation). Let ϵ > 0, and Dx be (L, R)-well behaved. Let ˆut U(a,b) be the optimal solution of the optimization problem Equation (P) given a sample set S of size m drawn from D and a parameter wt B(W). There exists an algorithm that produces an activation ˆvt U(a,b) such that ˆvt ˆut ϵ, with computation complexity O(m2 log(b W/(Lϵ))).

To show Proposition E.1, we leverage the following result:

Lemma E.2 (Section 5 (Lu & Hochbaum, 2022)). Let fi(y) and hi(y) be any convex lower semi-continuous functions for i = 1, . . . , m. Consider the following convex optimization problem

(ˆy1, . . . , ˆym) = argmin y1,...,ym

i=1 fi(yi) +

i=1 hi(yi yi+1), (59)

where yi [ U, U] for all i = 1, . . . , m for some positive constant U. Then, for any ϵ > 0, there exists an algorithm (the cc-algorithm (Lu & Hochbaum, 2022)) that outputs an ϵ-close solution {y1, . . . , ym} such that |yi ˆyi| ϵ for all i [m] with computational complexity O(m2 log(U/ϵ)).

Robustly Learning Single-Index Models via Alignment Sharpness

Proof of Proposition E.1. We first formulate problem (P) into a quadratic optimization problem with linear constraints. To guarantee that ˆut is an element in U(a,b) that satisfies ˆut(0) = 0, we add a zero point (x(0), y(0)) = (0, 0) to the data set S if S does not contain (0, 0) in the first place. We will thus assume without loss of generality that the data set contains (0, 0). Denote zi = w x(i) such that z1 z2 zm after rearranging the order of (x(i), y(i)) s, and suppose zk = w x0 = 0 for a k [m]. Then (P) is equivalent to the following optimization problem:

(ˆy(1), , ˆy(m)) = argmin y(i),i [m]

i=1 ( y(i) y(i))2

s.t. 0 y(i+1) y(i), 1 i k 1,

a(zi+1 zi) y(i+1) y(i), 1 i k 1,

y(i+1) y(i) b(zi+1 zi), 1 i m 1,

Define hi(y) = I[ b(zi+1 zi),0](y) for i = 1 . . . , k 1, hi(y) = I[ b(zi+1 zi), a(zi+1 zi)](y) for i = k . . . , m 1, where IY(y) is the indicator function of a convex set Y, i.e., IY(y) = 0 if y Y and IY(y) = + otherwise. It is known that hi s are convex and sub-differentiable on their domain Yi. In addition, let fi(y) = 1

2(y y(i))2 for i = k and fk(y) = I{0}(y). Then, we have the following formulation for problem (P):

(ˆy(1), , ˆy(m)) = argmin y(i),i=1,...,m

i=1 fi( y(i)) +

i=1 hi( y(i) y(i+1)) (P1)

Note that the functions fi and hi we defined above satisfy the conditions of Lemma E.2. Thus, it only remains to find the bounds on the variables y(i). This is easy to achieve as all y(i) must satisfy | y(i)| b|zi| = b|w x(i)| and we know that x(i) are sub-exponential random variables. Therefore, following the same idea from the proof of Lemma F.2, we know that for U = 2W

L log(m/(Lδ)), it holds that with probability at least 1 δ, | y(i)| b|w x(i)| b U for all i [m]. Hence, applying Lemma E.2 to problem (P1), we get that it can be solved within ϵ-error with computation complexity O(m2 log(b W/(Lϵ))).

Since the solution ˆvt is ϵ-close to ˆut, this approximated solution will only result in an ϵ-additive error in the sharpness result Proposition 3.1 and the gradient norm concentration Lemma 4.3. In more details, for the result of Proposition 3.1, we have

( b Lsur(wt; ˆvt) b Lsur(wt; ˆut)) (wt w ) = 1 m

i=1 (ˆvt(wt x(i)) ˆut(wt x(i)))(wt w ) x(i)

(wt w ) x(i) 2ϵU,

since |wt x(i)| U and |w x(i)| U with probability at least 1 δ. Therefore, choosing ϵ = ϵ/U we have that Proposition 3.1 holds for approximate activations ˆvt with an additional ϵ error.

Let us denote the unit ball by B. For the gradient norm concentration lemma Lemma 4.3, note that at any iteration t, it always holds that

b Lsur(wt; ˆvt) b Lsur(wt; ˆut) 2 = max v B 1 m

i=1 (ˆvt(wt x(i)) ˆut(wt x(i)))x(i) v max v B ϵ m

i=1 |x(i) v|.

Since x is isotropic and v B, we have Ex Dx[|x v|] p

E[(x v)2] 1. Now since |x(i) v| are independent 1/L-sub-exponential random variables, applying Bernstein s inequality it holds that for any v B,

i=1 |x(i) v| E x Dx[|x v|] s 2 exp c min m2s2

= 2 exp( cm L2s2).

Robustly Learning Single-Index Models via Alignment Sharpness

Let N(B, ϵ; ℓ2) be the ϵ-net of the unit ball B. Note that the cover number of these v B is of order (1/ϵ)O(d), therefore, applying a union bound on N(B, ϵ; ℓ2) and for all t0JT = O(log(1/ϵ)/ ϵ) iterations, and setting s = 1, it holds

Pr v N(B, ϵ; ℓ2), 1 m

i=1 |x(i) v| E x Dx[|x v|] 1 2 exp( cm L2 + c d log(1/ϵ)) δ,

where the last inequality comes from the fact that we have m W 9/2b14d log(1/δ) log4(d/ϵ)/(L4µ9δϵ3/2) as the batch size. Let v = argmaxv B Pm i=1 |x(i) v|, there exists a v N(B, ϵ; ℓ2) such that v v 2 ϵ and hence,

i=1 |x(i) v | 1

i=1 |x(i) (v v )| + 1

i=1 |x(i) v |

i=1 |x(i) v v

i=1 |x(i) v |

i=1 |x(i) v | + 1

i=1 |x(i) v |,

where the last inequality comes from the observation that as (v v )/ϵ B, it holds Pm i=1 |x(i) ((v v )/ϵ)| Pm i=1 |x(i) v |, by the definition of v . Therefore, with probability at least 1 δ we have

i=1 |x(i) v | 1 1 ϵ 1 m

i=1 |x(i) v | 2(1 + E x Dx[|v x|]) 4.

This implies that b Lsur(wt; ˆvt) 2 b Lsur(wt; ˆut) 2 + 4ϵ for all iterations with probability at least 1 δ. Therefore, Lemma 4.3 continues to hold for the approximately optimal activation ˆvt with only an additive ϵ error.

Thus, we have the inequations (46) and (48) in the proof of Theorem 4.2 remains valid for ˆvt with an additional ϵ error, and hence the results in Theorem 4.2 is unchanged.

F. Uniform Convergence of Activations

In this section, we provide some standard uniform convergence results showing that the empirical optimal activation concentrates nicely around the population activations. We first bound the L2 2 distance between the sample-optimal and population-optimal activations under wt. To do so, we build on Lemma 8 in (Kakade et al., 2011). Note that Lemma 8 from (Kakade et al., 2011) only works for bounded 1-Lipschitz activations u : R 7 [0, 1], hence it is not directly applicable to our case. Fortunately, since Dx has a sub-exponential tail (see Definition 1.2), we are able to bound the range of u(w x) for u U(a,b) and w B(W) with high probability. Concretely, we prove the following lemma. Note that in the lemma statement, ˆu t is a random variable defined w.r.t. the (random) dataset S , and thus the probabilistic statement is for this random variable.

We make use of the following fact from (Kakade et al., 2011):

Fact F.1 (Lemma 8 (Kakade et al., 2011)). Let V be the set of non-deceasing 1-Lipschitz functions such that v : R [0, 1], v V. Given Sm = {(x(i), y(i))}m i=1, where (x(i), y(i)) are sampled i.i.d. from some distribution D , let

ˆvw argmin v V

i=1 (v(w x(i)) y(i))2.

Then, with probability at least 1 δ over the random dataset Sm, for any w B(W) it holds uniformly that

E (x,y) D [(ˆvw(w x) y)2] inf v V E (x,y) D [(v(w x) y)2] = O W d log(Wm/δ)

The first lemma states that with sufficient many of samples, the idealized empirical activation ˆu t defined as the optimal solution of (P*) is close to its population counterpart u t, the optimal solution of (EP*).

Robustly Learning Single-Index Models via Alignment Sharpness

Lemma F.2 (Approximating Population-Optimal Noiseless Activation by Empirical). Let Dx be (L, R)-well behaved and let wt B(W). Provided a dataset S = {(x(i), y (i))}, where x(i) are i.i.d. samples from Dx and y (i) = u (w x(i)), let ˆu t be the sample-optimal activation on S as defined in (P*). In addition, let u t be the corresponding population-optimal activation, following the definition in (EP*). Then, for any ϵ, δ > 0, if the size m of the dataset S is sufficiently large

m d log4(d/(ϵδ)) b2W 3

we have that with probability at least 1 δ, for any wt B(W):

E x Dx[(ˆu t(wt x) u (w x))2] E x Dx[(u t(wt x) u (w x))2] + ϵ ,

and, furthermore, E x Dx[(ˆu t(wt x) u t(wt x))2] ϵ.

Proof. Our goal is to show that with high probability, the empirical optimal activation ˆu t U(a,b) and the population optimal activation u t U(a,b) can be scaled to 1-Lipschitz functions mapping R to [0, 1], then, Fact F.1 can be applied.

Since x possesses a sub-exponential tail, for any w B(W) we have Pr[|w x| w 2r] 2 L2 exp( Lr). Therefore, with probability at least 1 (δ1/m)2 it holds |w x| 2W

L log(m/(Lδ1)). Since we have m samples, a union bound on these m samples yields that with probability at least 1 δ2 1/m it holds |w x(i)| 2W

L log(m/(Lδ1)), for any given w B(W). Let r = 2W

L log(m/(Lδ1)). In the remaining of the proof, we assume that wt x(i) r holds for every x(i) in the dataset S , which happens with probability at least 1 δ2 1/m 1 δ1.

Let V be the set of non-decreasing 1-Lipschitz functions v : R [0, 1] such that v(0) = 1/2, and v(z1) v(z2) (a/(2br))(z1 z2) for all z1 z2 0. We observe that restricted on the interval |z| r, (ˆu t(z)/(2br) + 1/2)||z| r is 1-Lipschitz, non-decreasing and bounded in the interval [0, 1]. Thus, (ˆu t(z)/(2br) + 1/2)||z| r = ˆv t(z)||z| r, for some ˆv t V. Furthermore, under the condition that |wt x(i)| r, since (ˆu t(z)/(2br) + 1/2)||z| r = ˆv t(z)||z| r, we observe that v t(z) is the optimal activation in the function space V, given the dataset S and parameter wt, i.e.,

ˆv t argmin v V

i=1 (v(wt x(i)) (u (w x(i))/(2br) + 1/2))2.

In other words, ˆu t(z)/(2br) + 1/2 is the optimal empirical activation in the function class V when restricted to the interval |z| r. Consider x Dx, it holds that Pr[|wt x| r] (δ1/m)2. Then, for any wt B(W), the expectation Ex Dx[(ˆu t(wt x) u (w x))2] can be decomposed into the following terms

E x Dx[(ˆu t(wt x) u (w x))2] = E x Dx[(ˆu t(wt x) u (w x))21{|wt x| r}]

+ E x Dx[(ˆu t(wt x) u (w x))21{|wt x| > r}]

E x Dx[(ˆu t(wt x) u (w x))21{|wt x| r}]

+ 2 E x Dx[(ˆu t(wt x))2 + (u (w x))21{|wt x| > r}] .

Since both ˆu t and u are (a, b)-unbounded functions such that ˆu t(0) = u (0) = 0, hence it holds (ˆu t(wt x))2 b2W 2((wt/ wt 2) x)2 and similarly, (u (w x))2 b2W 2((w / w 2) x)2. Furthermore, since for any unit vector a, the random variable a x follows a 1/L-sub-exponential distribution for Dx is (L, R)-well behaved, thus, it holds that Ex Dx[(a x)4] c/L4 for some absolute constant c. Therefore, after applying Cauchy-Schwarz to E[(ˆu t(wt x))21{|wt x| r}] it holds

E[(ˆu t(wt x))21{|wt x| r}] b2W 2q

E x Dx[((wt/ wt 2) x)4] Pr[|wt x| r]

cb2W 2δ1/(L2m), (60)

Robustly Learning Single-Index Models via Alignment Sharpness

and similarly, E[(u (w x))21{|wt x| r}] cb2W 2δ1/(L2m). Thus, bringing back to the upper bound on Ex Dx[(ˆu t(wt x) u (w x))2] displayed above, we get

E x Dx[(ˆu t(wt x) u (w x))2] E x Dx[(ˆu t(wt x) u (w x))21{|wt x| r}]

+ 2cb2W 2δ1/(L2m).

We are now ready to apply Fact F.1 (note that V is a smaller function class compared to the class of 1-Lipschitz functions described Fact F.1, hence the results in Fact F.1 applies). Denote A = {x : |wt x| r}. Let y = y /(2br) + 1/2, y = u (w x). Since conditioning on the A, ˆu t(z)/(2br) + 1/2 is the optimal empirical activation, applying Fact F.1 we get with probability at least 1 δ2:

E x Dx[(ˆu t(wt x)/(2br) + 1/2 (u (w x)/(2br) + 1/2))2|A]

= E x Dx[(ˆv t(wt x) y )2|A]

inf v V E x Dx[(v(wt x) y )2|A] + O(W(d log(m/δ2)/m)2/3).

Let V||z| r and U(a,b)||z| r be the functions from V and U(a,b) restricted on the interval |z| r, respectively. It s easy to see that by definition of U(a,b) and V, (U(a,b)||z| r)/(2br) + 1/2 V||z| r. Therefore,

inf v V E x Dx[(v(wt x) y )2|A] inf u U E x Dx[(u(wt x)/(2br) + 1/2 y )2|A]

1 4b2r2 inf u U E x Dx[(u(wt x) y )2|A].

Hence, with probability at least 1 δ2,

E x Dx[(ˆu t(wt x) u (w x))21{A}]

= 4b2r2 E x Dx[(ˆv t(wt x) y )2|A] Pr[A]

4b2r2 inf v V E x Dx[(v(wt x) y )2|A] Pr[A] + O(b2r2W(d log(m/δ2)/m)2/3) Pr[A]

inf u U E x Dx[(u(wt x) u (w x))21{A}] + O(b2r2W(d log(m/δ2)/m)2/3)

inf u U E x Dx[(u(wt x) u (w x))2] + O(b2r2W(d log(m/δ2)/m)2/3) .

Setting δ1 = δ2 = δ/2 and plugging everything back to (60), we finally get that with probability at least 1 δ,

E x Dx[(ˆu t(wt x) u (w x))2]

E x Dx[(ˆu t(wt x) u (w x))21{|wt x| r}] + 2cb2W 2δ/(L2m)

inf u U E x Dx[(u(wt x) u (w x))2] + O b2W 3

To complete the first part of the claim, it remains to choose m as the following value

m = Θ d log4(d/(ϵδ)) b2W 3

For the second part of the claim, note that U(a,b) is a closed convex set of functions, and that the infimum infu U Ex Dx[(u(wt x) u (w x))2] is attained by u t(z). As we have shown that with the sample size m specified above, with probability at least 1 δ, it holds

ϵ E x Dx[(ˆu t(wt x) u (w x))2 (u t(wt x) u (w x))2]

= E x Dx[(ˆu t(wt x) u t(wt x))(ˆu t(wt x) + u t(wt x) 2u (w x))]

= E x Dx[(ˆu t(wt x) u t(wt x))2] + 2 E x Dx[(ˆu t(wt x) u t(wt x))(u t(wt x) u (w x))].

Robustly Learning Single-Index Models via Alignment Sharpness

Since ˆu t(z) U(a,b), applying the second part of Claim C.6 with v = ˆu t we get

E x Dx[(u t(wt x) ˆu t(wt x))(u (w x) u t(wt x))] 0.

Thus, we have:

E x Dx[(ˆu t(wt x) u t(wt x))2] ϵ.

This completes the proof of Lemma F.2.

To prove a similar uniform convergence result for the attainable activations ˆut, we make use of the following fact from prior literature, which shows that we can without loss of generality take the noisy labels to be bounded by M = O( b W

L log(b W/ϵ)), due to Dx being (L, R)-well behaved.

Fact F.3 (Lemma D.8 (Wang et al., 2023)). Let y = sign(y) min(|y|, M) for M = b W

L log( 16b4W 4

ϵ2 ). Then:

E (x,y) D[(u (w x) y )2] = OPT + ϵ.

In other words, we can assume |y| M without loss of generality by truncating labels that are larger than M. Under this assumption, as stated in Lemma F.4 below, we bound the L2 2 distance between ˆut and ut using similar arguments as in Lemma F.2.

Lemma F.4 (Approximating Population-Optimal Activation by Empirical). Let wt B(W). Given a distribution D whose marginal Dx is (L, R)-well behaved, let S = {(x(i), y(i))}m i=1, where (x(i), y(i)) for i [m] are i.i.d. samples from D. Let ˆut be a sample-optimal activation for the dataset S and parameter vector wt, as defined in (P). In addition, let ut be the corresponding population-optimal activation, as defined in (EP). Then, for any ϵ, δ > 0, choosing a sufficiently large

m d log4(d/(ϵδ)) b2W 3

we have that for any wt B(W), with probability at least 1 δ over the dataset S:

E (x,y) D[(ˆut(wt x) y)2] E (x,y) D[(ut(wt x) y)2] + ϵ ,

and, furthermore,

E x Dx[(ˆut(wt x) ut(wt x))2] ϵ.

Proof. As in the proof of Lemma F.2, we choose r = 2c W

L log(m/(Lδ1)) so that |wt x(i)| r for all x(i) s from the dataset with probability at least 1 δ2 1/m 1 δ1. We now condition on the event that |wt x(i)| r for all i = 1, . . . , m. Let V be the set of non-decreasing 1-Lipschitz functions such that v V, v(0) = 1/2, and v(z1) v(z2) (a/(2br))(z1 z2) for all z1 z2 0. Then, conditioned on this event, we similarly have that (ˆut(z)/(2br) + 1/2)||z| r = ˆvt(z) V, and ˆvt(z) satisfies:

ˆvt(z) argmin v V

i=1 (v(wt x(i)) y(i))2.

Again, studying the L2 2 distance between ˆut(z) and ut(z), we have:

E (x,y) D[(ˆut(wt x) y)2] = E (x,y) D[(ˆut(wt x) y)21{|wt x| r}]

+ E (x,y) D[(ˆut(wt x) y)21{|wt x| > r}].

Robustly Learning Single-Index Models via Alignment Sharpness

The probability of |wt x| > r is small due to the fact that Dx possesses sub-exponential tail: Pr[|wt x| > r] (δ1/m)2. Now note that |y| M and Ex Dx[((wt/ wt 2) x)4] c/L4 by the sub-exponential property of Dx, we thus have:

E (x,y) D[(ˆut(wt x) y)21{|wt x| > r}]

2 E (x,y) D[((ˆut(wt x))2 + y2)1{|wt x| > r}]

2 E x Dx[b2W 2((wt/ wt 2) x)21{|wt x| > r}] + 2M 2 Pr[|wt x| > r]

E x Dx[((wt/ wt 2) x)4] Pr[|wt x| > r] + 2M 2 Pr[|wt x| > r]

2cb2W 2δ1/(L2m) + 2M 2(δ1/m)2,

where in the second inequality we used the fact that ˆut is b-Lipschitz and wt B(W), and in the third inequality we applied Cauchy-Schwarz. Since M = b W

L log( 16b4W 4

ϵ2 ), we have M 2(δ1/m) cb2W 2/L2 for m log(b W/ϵ), thus, in the end, we get E (x,y) D[(ˆut(wt x) y)21{|wt x| > r}] 4c(b W/L)2δ1/m, (61)

for some absolute constant c.

The rest remains the same as in the proof of Lemma F.2. Let A = {x : |wt x| r}. Let y = y/(2br) + 1/2. As ˆvt(z) = ˆut(z)/(2br) + 1/2 is the optimal empirical activation in V given wt (conditioned on A), applying Fact F.1 we have with probability at least 1 δ:

E (x,y) D[((ˆut(wt x)/(2br) + 1/2) y )2|A] = E (x,y) D[(ˆvt(wt x) y )2|A]

inf v V E (x,y) D[(v(wt x) y )2|A] + O(W(d log(m/δ2)/m)2/3).

Since U(a,b)||z| r/(2br) + 1/2 V||z| r, we further have

inf v V E (x,y) D[(v(wt x) y )2|A] inf u U(a,b) E (x,y) D[(u(wt x)/(2br) + 1/2 y )2|A]

1 4b2r2 inf u U(a,b) E (x,y) D[(u(wt x) y)2|A].

Therefore, E(x,y) D[(ˆut(wt x) y)21{A}] can be bounded from above by

E (x,y) D[(ˆut(wt x) y)21{A}]

= 4b2r2 E (x,y) D[(ˆvt(wt x) y )2|A] Pr[A]

4b2r2 inf v V E (x,y) D[(v(wt x) y )2|A] Pr[A] + O(b2r2W(d log(m/δ2)/m)2/3)

inf u U(a,b) E (x,y) D[(u(wt x) y)21{A}] + O(b2r2W(d log(m/δ2)/m)2/3)

inf u U(a,b) E (x,y) D[(u(wt x) y)2] + O(b2r2W(d log(m/δ2)/m)2/3).

Thus, combining with (61), we get with probability at least 1 δ1 δ2,

E (x,y) D[(ˆut(wt x) y)2] E (x,y) D[(ut(wt x) y)2] + O Wb2r2 d log(m/δ2)

Choosing the size of the sample set to be:

m = Θ d log4(d/(ϵδ)) b2W 3

Robustly Learning Single-Index Models via Alignment Sharpness

and recall that r = 2c W

L log(m/(Lδ1)), we finally have

E (x,y) D[(ˆut(wt x) y)2] E (x,y) D[(ut(wt x) y)2] + ϵ,

with probability at least 1 δ, after choosing δ1 = δ2 = δ/2.

To prove the final claim of the lemma, we follow the same routine as in Lemma F.2. Since we have just shown that with probability at least 1 δ, it holds

ϵ E x Dx[(ˆut(wt x) y)2 (ut(wt x) y)2]

= E x Dx[(ˆut(wt x) ut(wt x))2] + 2 E x Dx[(ˆut(wt x) ut(wt x))(ut(wt x) y)],

applying the first statement in Claim C.6 finishes the proof.