# relu_regression_with_massart_noise__b6f2374c.pdf

Re LU Regression with Massart Noise

Ilias Diakonikolas University of Wisconsin-Madison

ilias@cs.wisc.edu

Jongho Park University of Wisconsin-Madison

jongho.park@wisc.edu

Christos Tzamos University of Wisconsin-Madison

tzamos@wisc.edu

We study the fundamental problem of Re LU regression, where the goal is to ﬁt Rectiﬁed Linear Units (Re LUs) to data. This supervised learning task is efﬁciently solvable in the realizable setting, but is known to be computationally hard with adversarial label noise. In this work, we focus on Re LU regression in the Massart noise model, a natural and well-studied semi-random noise model. In this model, the label of every point is generated according to a function in the class, but an adversary is allowed to change this value arbitrarily with some probability, which is at most < 1/2. We develop an efﬁcient algorithm that achieves exact parameter recovery in this model under mild anti-concentration assumptions on the underlying distribution. Such assumptions are necessary for exact recovery to be informationtheoretically possible. We demonstrate that our algorithm signiﬁcantly outperforms naive applications of 1 and 2 regression on both synthetic and real data.

1 Introduction

Learning in the presence of outliers is a key challenge in machine learning, with several data analysis applications, including in ML security (4; 7; 46; 14) and in exploratory data analysis of real datasets with natural outliers, e.g., in biology (43; 41; 36). The goal is to design computationally efﬁcient learners that can tolerate a constant fraction of outliers, independent of the dimensionality of the data. Early work in robust statistics (28; 30) gave sample-efﬁcient robust estimators for various basic tasks, alas with exponential runtime. A recent line of work in computer science, starting with (13; 35), developed the ﬁrst computationally efﬁcient robust learning algorithms for various high-dimensional tasks. Since these early works, there has been signiﬁcant progress in algorithmic robust high-dimensional statistics by several communities, see (16) for a recent survey.

In this work, we study the problem of learning Rectiﬁed Linear Units (Re LUs) in the presence of label noise. The Re LU function Re LUw : Rd ! R, parameterized by a vector w 2 Rd, is deﬁned as Re LUw(x) := Re LU(w x) = max {0, w x}. Re LU regression the task of ﬁtting Re LUs to a set of labeled examples is a fundamental task and an important primitive in the theory of deep learning. In recent years, Re LU regression has been extensively studied in theoretical machine learning both from the perspective of designing efﬁcient algorithms and from the perspective of computational hardness, see, e.g., (26; 45; 37; 48; 27; 49; 11; 15; 25; 18). The computational difﬁculty of this problem crucially depends on the assumptions about the input data. In the realizable case, i.e., when the labels are consistent with the target function, the problem is efﬁciently solvable, see, e.g., (45). On the other hand, in the presence of even a small constant fraction of adversarially labeled data, computational hardness results are known even for approximate recovery (29; 37) and under well-behaved distributions (27; 15; 25; 18). See Section 1.3 for a detailed summary of related work.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

A challenging corruption model is the adversarial label noise model, in which an adversary is allowed to corrupt an arbitrary < 1/2 fraction of the labels. The aforementioned hardness results rule out the existence of efﬁcient algorithms for learning Re LUs with optimal error guarantees in this model, even when the underlying distribution on examples is Gaussian. Moreover, when no assumptions are made on the underlying data distribution, no fully polynomial time algorithm with non-trivial guarantee is possible. In fact, for the distribution-independent setting, even for the simpler case of learning linear functions, there are strong computational hardness results for any constant (33; 29). These negative results motivate the following natural question:

Are there realistic label noise models in which efﬁcient learning is possible

without strong distributional assumptions?

Here we focus on Re LU regression in the presence of Massart (or bounded) noise (38), and provide an efﬁcient learning algorithm with minimal distributional assumptions. In the process, we also provide an efﬁcient noise-tolerant algorithm for the simpler case of linear regression.

The Massart model (38) is a classical semi-random noise model originally deﬁned in the context of binary classiﬁcation. In this model, an adversary has control over a random < 1/2 fraction of the labels. Recent work (12) gave the ﬁrst efﬁcient learning algorithm for linear separators with non-trivial error gurrantees in the Massart model without distributional assumptions. In this work, we ask to what extent such algorithmic results are possible for learning real-valued functions. To state our contributions, we formally deﬁne the following natural generalization of the model for real-valued functions. Deﬁnition 1.1 (Learning Real-valued Functions with Massart Noise). Let F be a concept class of real-valued functions over Rd and f : Rd ! R be an unknown function in F. For a given parameter < 1/2, the algorithm speciﬁes m 2 Z+ and obtains m samples (xi, yi)m

i=1, such that:

(a) every xi is drawn i.i.d. from a ﬁxed distribution Dx, and

(b) each yi is equal to f(xi) with probability 1 and takes an arbitrary value with probability

, chosen by an adversary after observing the samples drawn and the values that can be corrupted.

In the context of binary classiﬁcation, the above model has been extensively studied in the theoretical ML community for the class of linear separators (2; 3; 51; 50; 20; 12; 9). Even though the noise model might appear innocuous at ﬁrst sight, the ability of the Massart adversary to choose whether to perturb a given label and, if so, with what probability (which is unknown to the learner), makes the design of efﬁcient algorithms in this model challenging. Speciﬁcally, for distribution-independent PAC learning of linear separators, even approximate learning in this model is computationally hard (17).

Extending this model to real-valued functions, we study regression under Massart noise for the realizable setting; that is, when the uncorrupted data exhibit clean functional dependencies, i.e., yi = Re LU(w xi). The realizable setting is both of theoretical and practical interest. Prior work (45; 23; 31; 49) developed algorithms for learning Re LUs in this setting (without Massart noise), providing theoretical insights on the success of deep learning architectures. On the practical side, there are many applications in which we observe clean functional dependencies on the uncorrupted data. For instance, clean measurements are prevalent in many signal processing applications, including medical imaging, and are at the heart of the widely popular ﬁeld of compressive sensing (8).

1.1 Main Results

To build up to the more challenging case of Re LUs, we start with the simpler case of linear functions, which is in and of itself one of the most well-studied statistical tasks, with numerous applications in machine learning (44), as well as in other disciplines, including economics (21) and biology (39).

In our Massart setting, the goal is to identify a linear relation y = w x that the clean samples (x, y) (inliers) satisfy. We show that, under the minimal (necessary) assumption that the distribution is not fully concentrated on any subspace, the problem is efﬁciently identiﬁable. Theorem 1.2 (Exact Recovery of Linear Functions). Let Dx be a distribution on Rd that has zero measure on any linear subspace and let < 1/2 be the upper bound on the Massart noise rate. Denote w the vector representing the true linear function. There is an algorithm that draws

O( d3 (1 2 )2 ) samples, runs in poly(d, b, (1 2 ) 1) time, where b is an upper bound on the bit complexity of the samples and parameters, and outputs w with probability at least 9/10.

We provide a more general algorithmic result that relaxes the density assumption on Dx in Theorem C.3 of Appendix C, so that the only assumption needed is that the support of Dx spans Rd. It is important to note that if the distribution was concentrated entirely on a linear subspace, it would be information-theoretically impossible to determine the orthogonal component of w on that subspace. That is, our anti-concentration assumption on the distribution is necessary for identiﬁability to be possible. Even when this assumption is violated and the problem is non-identiﬁable, we provide a (weaker) PAC learning guarantee for the linear case in Theorem D.1 of Appendix D.

Our main algorithmic result is for the problem of Re LU regression, where the inliers satisfy y = Re LU(w x) and an < 1/2 fraction of the labels are corrupted by Massart noise. Even in this more challenging case, we show it is possible to efﬁciently identify the true parameters w , as long as every homogeneous halfspace contains a non-negligible fraction of the sample points.

Theorem 1.3 (Exact Recovery of Re LUs). Let Dx be a distribution on Rd that has zero measure on any linear subspace and such that Prx Dx[w x 0] λ for any w 2 Rd. Let < 1/2 be the upper bound on the Massart noise rate. Denote w the parameter vector of the target Re LU. There is an algorithm that draws O( d3 λ2(1 2 )2 ) samples, runs in time poly(d, b, λ 1, (1 2 ) 1), and outputs w with probability at least 9/10.

We similarly provide a more general result relaxing the density assumption on Dx in Theorem C.4 of Appendix C. We note that the assumption on the mass of halfspaces in Theorem 1.3 is necessary for identiﬁability. Indeed, if there was a halfspace, parameterized by w 2 Rd, such that Prx Dx[w x 0] = 0, it would be impossible to distinguish between the functions Re LU(w x) and Re LU(2w x) (even without noise), as all points would have 0 labels. It remains an interesting open problem whether similar PAC learning guarantees can be obtained for the case of Re LU regression. We suspect that this problem is computationally hard in full generality.

1.2 Technical Overview

As explained in the introduction, the focus of our work is on the problem of robust regression in the presence of outliers. When < 1

2 fraction of the data is arbitrarily corrupted in the realizable setting, the goal is to compute the function that ﬁts as many as points (inliers) as possible. Given a sufﬁcient number of samples from a full-dimensional distribution, this function is unique for the class of Re LUs and matches the true function with high probability. However, even in the simpler case of linear functions, the corresponding computational problem of 0-minimization is computationally hard without distributional assumptions, as it is an instance of robust subspace recovery (29).

Our positive results are driven by relaxing the assumption that an arbitrary fraction of the points is corrupted. Instead, as deﬁned in Deﬁnition 1.1, we consider a more restricted adversary that is presented with a uniformly random fraction of the points, which can be corrupted arbitrarily at will.

0 to 1 minimization Given this milder corruption model, we propose novel algorithms for efﬁcient exact recovery of the underlying function. We obtain our algorithms by replacing the 0-minimization with 1-minimization, which can be shown to converge to the true function in the limit and is efﬁcient to optimize in the linear regression case. For intuition, consider a single-point distribution that always outputs labeled examples of the form (x, y), where the example x is always the same but the labels y may differ. The Massart assumption indicates that the value of y is correct more than half of the time, so the estimate that maximizes the number of correct samples ( 0-minimizer) recovers the underlying function. However, if one considers the 1-minimizer, i.e., the value v that minimizes E[|y v|], this corresponds to the median value of y which is also correct if more than half samples are correct.

We generalize this intuition and propose a natural and tight condition under which empirical 1minimization results in the true 0-minimizer (see Lemma 2.2). While this condition holds under Massart noise for arbitrary distributions in the population level, it can fail to hold with high probability when considering only a ﬁnite set of samples from the distribution. For example, consider the onedimensional case of w = 1 where most xi s are near-zero and uncorrupted, while a few corrupted samples lie extremely far from zero. The empirical 1-minimizer here will be dominated by the few

corrupted samples and would not be the 0-minimizer. In particular, the sample complexity of naive 1-minimization would crucially depend on the concentration properties of the distribution on x.

Transforming the points via radial-isotropy The main technical idea behind obtaining sampleefﬁcient algorithms that run in polynomial time is to transform the original point set into an equivalent one that satisﬁes the required properties with high probability, as it becomes sufﬁciently concentrated. In particular, performing a linear transformation mapping every point x to Ax, while keeping the corresponding label y, is without loss of generality, as we are interested in identifying the true (generalized) linear function that depends only on the inner product of every point with a parameter vector w. Finding such a vector w0 in the transformed space Ax results in the equivalent vector w = AT w0 in the original space. Moreover, an additional operation we can perform is to take a single sample (x, y) and multiply it by a positive scalar λ > 0 to replace it with the sample (λx, λy). For both the linear and Re LU cases, any sample that is an inlier for the true function remains an inlier after this transformation.

We can use these two operations to bring our point set in radial-isotropic position, i.e., so that all the x s in the dataset are unit-norm and the variance in any direction is nearly identical:

Deﬁnition 1.4 (Radial Isotropy). Let Sd 1 be the unit sphere in Rd. Given {x1, . . . , xn} Sd 1, A : Rd ! Rd is a radial-isotropic transformation if Pn

(Axi)(Axi)T

d I. For 0 < γ < 1,

we say the points are in γ-approximate radial-isotropic position, if for all v 2 Sd 1, it holds that (d/n) Pn

i=1(xi v)2 1 γ.

In such a normalized position, we can argue that with high probability the weight of all inliers in every direction is more than the weight of the outliers, which guarantees that the empirical 1-minimizer will converge to the true function.

Learning Re LUs Unfortunately, while 1-minimization for linear functions is convex and efﬁciently solvable via linear programming, 1-minimization for Re LUs is challenging due to its non-convexity; that is, we cannot easily reduce Re LU regression to a simple optimization method. We instead establish a structural condition under which we can compute an efﬁcient separation oracle between the optimal parameter vector w and a query w. More speciﬁcally, we show that any suboptimal guess for the parameter vector w can be improved by moving along the opposite direction of the gradient of the 1-loss for the subset of points in which the condition in Lemma 3.1 is satisﬁed. Identifying such a direction of improvement yields a separating hyperplane, so we exploit this to efﬁciently identify w by running the ellipsoid method with our separation oracle.

However, for this result to hold with a small number of samples, we need to again bring to radialisotropic position the points that fall in the linear (positive) part of the Re LU for the current guess vector w. In contrast to the linear case, though, where this transformation was applied once, in this case it needs to be applied again with every new guess. This results in a function that changes at every step, which is not suitable for direct optimization.

Using these ideas, our algorithms can efﬁciently recover the underlying function exactly using few samples. Our algorithms make mild genericity assumptions about the position of the points, requiring that the points are not concentrated on a lower-dimensional subspace or, for the case of Re LUs, do not lie entirely in an origin-centered halfspace. As already mentioned, such assumptions are necessary for the purposes of identiﬁability.

1.3 Related Work

Given the extensive literature on robust regression, here we discuss the most relevant prior work.

Re LU Regression In the realizable setting, (45) and, more recently, (31) showed that gradient descent efﬁciently performs exact recovery for Re LU regression under the Gaussian distribution on examples. (49) generalized this result to a broader family of well-behaved distributions. In the agnostic or adversarial label noise model, a line of work has shown that learning with near-optimal error guarantees requires super-polynomial time, even under the Gaussian distribution (27; 15; 25; 18). On the positive side, (11) gave an efﬁcient learner with approximation guarantees under log-concave distributions. Without distributional assumptions, even approximate learning is hard (29; 37).

The recent work (32) studies Re LU regression in the realizable setting under a noise model similar to but more restrictive than the Massart model of Deﬁnition 1.1. Speciﬁcally, in the setting of (32), the adversary can corrupt a label with probability at most , but only via additive noise bounded above by a constant. (32) gives an SGD-type algorithm for Re LU regression in this model. We note that their algorithm does not achieve exact recovery and its guarantees crucially depend on the concentration properties of the marginal distribution and the bound on the additive noise.

Comparison of Noise models It is worth comparing the Massart noise model (Deﬁnition 1.1) with other noise models studied in the literature. The strongest corruption model we are aware of is the strong contamination model (13), in which an omniscient adversary can corrupt an arbitrary < 1/2 fraction of the labeled examples. In the adversarial label noise model, the adversary can corrupt an arbitrary < 1/2 fraction of the labels (but not the examples). Efﬁcient robust learning algorithms in these models typically only give approximate error guarantees and require strong distributional assumptions. Speciﬁcally, for the case of linear regression, (34; 19; 14) give robust approximate learners in the strong contamination model under the Gaussian distribution and, more broadly, distributions with bounded moments. In the adversarial label noise model, (6) gave efﬁcient robust learners under strong concentration bounds on the underlying distribution that can tolerate < 1/50 fraction of outliers.

The recent work (10) considers a Massart-like noise model in the context of linear regression with random observation noise. (10) provides an SDP-based approximate recovery algorithm when the noise rate satisﬁes < 1/3. It should be noted their algorithm does not efﬁciently achieve exact recovery. Due to space limitations, we provide a more detailed description of that work in Appendix F.

A related noise model is that of oblivious label noise, where the adversary can corrupt an fraction of the labels with additive noise that is independent of the covariate x. More precisely, the oblivious adversary corrupts the vector of labels y 2 Rm by adding a m-sparse corruption vector b. Since b is independent of the covariates, oblivious noise can be viewed as corrupting a sample with probability with a random non-zero entry of b. Consequently, oblivious noise can be seen as a special case of Massart noise. We formally compare these two noise models in more detail in Appendix E. A line of work (5; 47; 22; 42) studied robust linear regression under oblivious noise and developed efﬁcient exact recovery algorithms under strong distributional assumptions.

2 Warmup: Linear Regression with Massart Noise

In this section, we establish structural conditions under which we can perform efﬁcient 0minimization for linear functions under Massart noise. It is imperative that we ﬁnd the 0-minimizer with respect to w since, with a sufﬁcient number of samples, the 0-minimizer is the true function we wish to recover. We then show that appropriately transforming the data via radial-isotropic transformation and then solving for the empirical 1-loss arg minw2Rd 1

i=1 | yi w xi| can efﬁciently recover the true parameter w . We describe the algorithm for recovering linear functions below.

Algorithm 1 Linear function recovery via radial isotropy

Draw m = O( d3 (1 2 )2 ) samples (xi, yi)m

i=1 with -Massart noise Compute A that puts (xi, yi)m

i=1 in 1/2-approximate radial-isotropic position ˆw arg minw2Rd Pm

i=1 | yi k Axik2 w Axi k Axik2 | by solving the LP return A ˆw

In fact, there is no need to compute an exact radial-isotropic transformation (γ = 0) as an approximate one sufﬁces. An approximate radial-isotropic transformation can be computed efﬁciently as in Lemma 2.1, which we prove in Appendix A.

Lemma 2.1. Given S Rd in general position, there is a poly(n, d, b, γ 1) time algorithm that computes a positive deﬁnite symmetric matrix A such that

k Axk : x 2 S

is in γ-approximate radial-isotropic position where b is an upper bound on the bit complexity of the parameters and samples in S. Morever, the condition number of A is at most 2poly(n,d,b).

Given that computing such approximate transformation A and solving a linear program (LP) can be done efﬁciently, Algorithm 1 gives us the polynomial runtime for Theorem 1.2.

The proof of Theorem 1.2 relies on two key ideas. First, we can minimize the empirical 1-loss with respect to w instead of the 0-loss because the two are equivalent under the structural condition of Lemma 2.2. Once the following sufﬁcient condition is satisﬁed, we can efﬁciently solve for 1-minimization as this is efﬁcient in the case of linear functions by solving an LP. Lemma 2.2 (Structural Condition for Recovery). Given f : R ! R and m samples (xi, yi)m

i=1 in Rd, let the 0-minimizer w = arg minw2Rd 1

i=1 kyi f(w xi)k0 be unique. If P

|f((w + r) xi) f(w xi)| > P

yi6=f(w xi)

|f((w + r) xi) f(w xi)| (?)

for all non-zero r 2 Rd, then w is also the 1-minimizer arg minw2Rd 1

i=1 |yi h(xi)|.

The structural condition (?) for linear functions reduces to having the sum of |r xi| for the good points be greater than the sum of |r xi| for the bad points in every direction r. However, this implies that if one sample is much greater in norm than the others in some direction, this point can have undue inﬂuence and may easily dominate the 1-loss. Therefore, without any preprocessing or transformation to the data, one has to rely on naively increasing the sample complexity until there are enough points in this direction to satisfy condition (?). Instead, we minimize the dominating effects of such outlier points and reduce the sample complexity through transforming the dataset with radial isotropy. We defer the proof of Lemma 2.2 and Theorem 1.2 to Appendix A.

3 Re LU Regression with Massart Noise

In this subsection, we study the problem of exact recovery for Re LUs in the presence of Massart noise. For the case of Re LUs, we can still use the structural condition of Lemma 2.2 to do 1minimization arg minw 1

i=1 |yi Re LU(w xi)|. However, minimizing this objective is no longer straightforward, because the objective function is non-convex. Despite this fact, it is possible to exactly recover a Re LU under mild anti-concentration assumptions on the underlying distribution.

The key idea behind Theorem 1.3 is establishing the condition under which we can compute an efﬁcient separation oracle between the query w and the true parameter w . Once we obtain a separation oracle, we can use the ellipsoid method to recover w exactly. In turn, similarly to Lemma 2.2, we identify a sufﬁcient structural condition on the dataset, which allows us to efﬁciently compute a separating hyperplane between w and w if w 6= w , and then use radial-isotropic transformations such that this condition is satisﬁed. We state this separation condition in the following lemma. Lemma 3.1 (Separation Condition). Let H be an hypothesis class such that H = {hw : hw(x) = f(w x), w 2 Rd} where f : R ! R is monotonically non-decreasing. Given a set of m samples (xi, yi)m

i=1, let w = arg minw2Rd 1

i=1 kyi f(w xi)k0 be unique. Let > 0 and B(w, ) be the open ball of radius centered at w. Denote the empirical 1-loss ˆL(w) = (1/m) Pm

i=1 |yi f(w xi)|. Given a query w0 /2 B(w , ), if

|(w0 w ) xi|f 0(w0 xi) P

yi6=f(w xi)

|(w0 w ) xi|f 0(w0 xi) m ( )

then rˆL(w0) (w0 w) = 0 is a separating hyperplane for w0 and B(w , /2) such that rˆL(w0) (w0 w) > 0 for w 2 B(w , /2).

In particular, the gradient of the empirical 1-loss gives us the separating hyperplane above. Other than the fact that only the points in the non-negative side of the halfspace of w are considered in the separation condition ( ), the condition resembles much of the structural condition used for linear functions. Analogously, we apply a radial-isotropic transformation to the points of w xi 0. Thus, we have the following sub-procedure of the ellipsoid method where is the radius of a ball which depends on the distance between the points (and hence the bit complexity b).

The main difference between the algorithm for Re LUs and linear functions is that here we must apply a different radial-isotropic transformation to every new subset of points in every iteration, depending on the query w0. In turn, the algorithm transforms the space according a new transformation A and computes a separating hyperplane and transforms the hyperplane back into the original space. Due to

Algorithm 2 Separation oracle sub-procedure

Input: (xi, yi)m

i=1 with Massart noise, query w0, > 0 Output: Unless w0 2 B(w , ), a separating hyperplane between w0 and B(w , /2) if Re LU(w0 x) ﬁts at least m

2 points then return w0 as true parameter w

S {(xi, yi) : w0 xi 0 for i 2 [m]} Compute A that puts Sx into 1/2-approximate radial-isotropic position r 1 |S|

Axi k Axik2 sgn(w0 xi yi) return separating hyperplane A 1r (w0 w) = 0

these repeated transformations, the proof of Theorem 1.3 requires a more intricate argument to make the ellipsoid method work correctly. We now prove Theorem 1.3.

Proof of Theorem 1.3. Let w0 be the original query to the oracle and assume the separation condition ( ) holds for a set of points (Axi/k Axik2, yi/k Axik2)m

i=1 and A 1w0. Then, by Lemma 3.1, r (A 1w0 w) = 0 where r = (1/m) Pm

i=1(Axi/k Axik2) sgn(w0 xi yi) separates A 1w0 and A 1w . Thus the separation for w0 and w is A 1r (w0 w) = 0.

Now it remains to check the sample complexity necessary for the separation condition ( ). Each unique set of {(xi, yi) : w0 xi 0 for i 2 [m]} determines a radial isotropic transformation but there can only be at most md+1 unique sets by the VC dimension of halfspaces. So there are only at most md+1 radial-isotropic transformations we have to consider. Let A be the linear transformation of the radial-isotropic transformation applied to points of w0 xi 0. Denote ( xi, yi) = (Axi/k Axik2, yi/k Axik2), w = A 1w , w0 = A 1w0, and let Dx be Dx|{w0 x 0}

transformed by A then normalized so that Dx lies on Sd 1. Then, for all md+1 transformations, we have the following VC inequality using m = O(d/ 2) samples with high probability:

[|(w w ) x|1{w x 0, y = Re LU(w x)} > t]

1{|(w w ) xi| > t, w xi 0, yi = Re LU(w xi)}

Let S = {( xi, yi) : w0 xi 0}. Similarly to the proof of Theorem 1.2, we have that max( x, y)2S |r

x| (1 γ)d 1 γ d E x Dx[|r x|] where γ = 1/2. Then, we can write

( xi, yi)2S

|( w0 w ) x|1{yi = Re LU(w xi)}

1{|( w0 w ) xi| > t, w0 xi 0, yi = Re LU(w xi)

E Dx[|( w0 w ) x|1{y = Re LU(w x)}] ( m/|S|) max

( x, y)2S |( w0 w ) x|

(1 ) E Dx[|( w0 w ) x|] ( md/(|S|(1 2 d)))E Dx[|( w0 w ) x|]

By setting = O(λ(1 2 )/d), we can bound m/|S| be at most a constant times λ 1 for all md+1 possible subsets S using Hoeffding s inequality and the union bound. Then we have

|( w0 w ) x|1{ y = Re LU( w x)} ((1/2)+(1 2 )/4)E Dx[|( w0 w ) x|].

We can do the same to the corrupted points in S getting (1/2 (1 2 )/4)E Dx[|( w w ) x|]. Thus, for points ( x, y) 2 S, we have the condition

y=Re LU( w x)

|( w0 w ) x| P

y6=Re LU( w x)

|( w0 w ) x|

(1/2 )E Dx[|( w0 w ) x|]

(1/2 )k w0 w k2/d .

By Lemma 3.1, the inequality above implies that we can ﬁnd a hyperplane of r that separates w0 and B( w , /2) where = 1 2

2d k w0 w k2. In the original space of (xi, yi)m

i=1, we have that the transformed hyperplane of A 1r seperates w0 and B(w , /2) where = 1 2

w k2 since applying A 1 to r keeps the distance from w to w0 at least 1 2 2dλmax(A)kw0 w k2 and applying A to w bounds the distance from w to w0 to be at least . Therefore, we can set of Algorithm 2 to be equal to minw6=w 1 2

λmax(A)kw w k2.

If w 6= w , by bounded bit complexity b, we have that the volume of the ellipsoid decreases at every step but the ball of radius (1 2 )λmin(A)

4dλmax(A) will always be contained in it. Thus the algorithm terminates

in poly(d, b, (1 2 ) 1) iterations since log

= poly(d, b, (1 2 ) 1).

4 Experiments

In this section, we apply our algorithms that are based on radial-isotropic transformations to both synthetic and real datasets and compare robustness in regression with other baseline methods of 1 and 2-regression. Our experiments demonstrate the efﬁcacy of radial-isotropic transformations in robust regression and how our algorithms outperform baseline regression methods.

All experiments were done on a laptop computer with a 2.3 GHz Dual-Core Intel Core i5 CPU and 8 GB of RAM. We ran CVXPY s linear program solver for 1-regression for linear functions.

Recovering Linear Functions We ﬁrst show how our algorithm based on radial-isotropic position (Algorithm 1) compares to naive 1 regression in exact recovery using an LP solver. As another baseline, we also ran 1-regression with a normalization preprocessing step, where we normalize all points (x, y) to ( x

kxk, y kxk). We did not run regression with an isotropic-transformation preprocessing step because this yields identical results as naive regression with no preprocessing.

We evaluated different transformations to the data on the following synthetic distribution. Deﬁne a mixture of Gaussians Dx = 1

d2 Id) + 1 2d

i=1 N(dei, 1

d2 Id), where ei denotes the i-th standard basis vector and d = 30. Let w = 9e2 + Pd

i=1 ei. For various noise levels , consider the following -Massart adversary: the labels for all x for which any coordinate is greater than d

2 are ﬂipped to w x with probability , and the labels for all other points are not ﬂipped. Essentially, only the points not from N(e1, 1

d2 Id) are affected by Massart noise.

We measured exact parameter recovery rate, which captures how often the algorithm solves for w exactly. We varied the noise rate while running the methods with 120 samples from Dx. We also varied the sample size while keeping the noise = 0.25. We ran 200 trials for each measurement of exact recovery rate and the error bars represent two standard deviations from the mean.

Recovering Re LUs For Re LUs, we used the same distribution as the experiments for linear functions to generate samples. We ran and compared constant-step-sized gradient descent on the empirical 1-loss 1

i=1 |yi Re LU(w xi)| with different transformations to the data. We ran gradient descent since our separation oracle for the Ellipsoid method bears similarities with gradient descent. As seen in Lemma 3.1, this is due to the fact that our separating hyperplane is based on the gradient of the empirical 1-loss of a subset of points.

The experiment is set up with = 0.4, w = 9e2 + Pd

i=1 ei, w0 = 0, 240 samples from Dx, and gradient descent step size of one. For Original , we use a step size of 1/465 to keep the magnitude of the points xi comparable to that of the transformed points xi.

In Figure 2(a), Original corresponds to naive gradient descent, while Normalized has a normalization preprocessing step. The transformations of Isotropic and Radial-isotropic follow our algorithm for Re LUs from Section 3, where the transformation is only applied to the positive-side points of w xi 0 for the current hypothesis w. The gradient is then calculated with the transformed points xi and appropriately transformed back to the original space in order to update w. The gradient descent updates under transformation A and step size is the following:

w0 A 1w then w w (Arw0L0) ,

(a) Recovering Linear (vs. noise rate) (b) Recovering Linear (vs. sample size)

Figure 1: Experiments for exact parameter recovery of linear functions on synthetic data. Exact recovery rate (y-axis) measures how often the algorithm outputs the true parameter out of 200 trials. We compare Algorithm 1 with naive 1-minimization and 1-minimization with normalized data. Error bars cover two standard deviations from the mean.

(a) Gradient Descent on Re LU (b) Drug Discovery Dataset

Figure 2: On the left, we compare the performance of different transformations with gradient descent on synthetic data where we measure the 2-distance to the optimal solution. On the right, we compare different regression methods applied to real data where the labels are artiﬁcially corrupted with -Massart noise. Rescaled L1 represents Algorithm 1. We report the fraction of points that lie in the subspace generated by its output with margin.

where L0 denotes the empirical 1-loss for the transformed subset of points. This update method is directly adapted from our Ellipsoid method.

Drug Discovery Dataset The drug discovery dataset was originally curated by (40) and we used the same dataset as the one used in (14). The dataset has a training and test set of 3084 and 1000 points of 410 dimensions. The -Massart noise adversary corrupts the training data (xi, yi) so that all points are corrupted to ﬂip labels to 100yi with probability .

We compared 1-regression with radial-isotropic transformation ( Rescaled L1 ) to other baseline methods, such as least squares and naive 1-regression. For ridge regression, we optimized the regularization coefﬁcient based on the uncorrupted data. We measure performance by computing the fraction of the test set that lies within the subspace generated by the output vector with a margin of 2.

Results In Figure 1, our algorithm with radial-isotropic transformation outperforms other baseline methods in robustness with respect to the noise level and in efﬁciency with respect to the sample size. This is in line with the results of Theorem 1.2. In fact, our experiments on Re LUs also empirically demonstrate that radial isotropy signiﬁcantly improves Re LU regression via gradient descent by making the dataset more robust to noise at each iteration. For the drug discovery dataset, although

1-regression with radial isotropy performs slightly worse than naive 1-regression when there is minimal noise, it signiﬁcantly outperforms the baseline methods at regimes of high noise levels.

5 Conclusion

In this work, we propose a generalization of the Massart (or bounded) noise model, previously studied in binary classiﬁcation, to the real-valued setting. The Massart model is a realistic semi-random noise model that is stronger than uniform random noise or oblivious noise, but weaker than adversarial label noise. Our main result is an efﬁcient algorithm for Re LU regression (and, in the process, also linear regression) in this model under minimal distributional assumptions. At the technical level, we provide structural conditions for 0-minimization to be efﬁciently computable. A key conceptual idea enabling our efﬁcient algorithms is that of transforming the dataset using radial-isotropic transformations. We empirically validated the effectiveness of radial-isotropic transformations for robustness via experiments on both synthetic and real data. In contrast to previous works on robust regression that require strong distributional assumptions, our framework and results may be seen as an intricate balance between slightly weakening the noise model yet affording generality in the underlying distribution.

Acknowledgments and Disclosure of Funding

Ilias Diakonikolas is supported by NSF Medium Award CCF-2107079, NSF Award CCF-1652862 (CAREER), a Sloan Research Fellowship, and a DARPA Learning with Less Labels (Lw LL) grant.

[1] ARTSTEIN-AVIDAN, S., KAPLAN, H., AND SHARIR, M. On radial isotropic position: Theory

and algorithms. ar Xiv preprint ar Xiv:2005.04918 (2020).

[2] AWASTHI, P., BALCAN, M. F., HAGHTALAB, N., AND URNER, R. Efﬁcient learning of linear

separators under bounded noise. In Proceedings of The 28th Conference on Learning Theory, COLT 2015 (2015), pp. 167 190.

[3] AWASTHI, P., BALCAN, M. F., HAGHTALAB, N., AND ZHANG, H. Learning and 1-bit

compressed sensing under asymmetric noise. In Proceedings of the 29th Conference on Learning Theory, COLT 2016 (2016), pp. 152 192.

[4] BARRENO, M., NELSON, B., JOSEPH, A. D., AND TYGAR, J. D. The security of machine

learning. Machine Learning 81, 2 (2010), 121 148.

[5] BHATIA, K., JAIN, P., KAMALARUBAN, P., AND KAR, P. Consistent robust regression.

In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017 (2017), pp. 2107 2116.

[6] BHATIA, K., JAIN, P., AND KAR, P. Robust regression via hard thresholding. In Advances

in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015 (2015), pp. 721 729.

[7] BIGGIO, B., NELSON, B., AND LASKOV, P. Poisoning attacks against support vector machines.

In Proceedings of the 29th International Conference on Machine Learning, ICML 2012 (2012).

[8] CANDÈS, E. J., AND WAKIN, M. B. An introduction to compressive sampling. IEEE signal

processing magazine 25, 2 (2008), 21 30.

[9] CHEN, S., KOEHLER, F., MOITRA, A., AND YAU, M. Classiﬁcation under misspeciﬁcation:

Halfspaces, generalized linear models, and evolvability. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020 (2020).

[10] CHEN, S., KOEHLER, F., MOITRA, A., AND YAU, M. Online and distribution-free robustness:

Regression and contextual bandits with huber contamination. ar Xiv preprint ar Xiv:2010.04157 (2020).

[11] DIAKONIKOLAS, I., GOEL, S., KARMALKAR, S., KLIVANS, A. R., AND SOLTANOLKOTABI,

M. Approximation schemes for relu regression. In Conference on Learning Theory, COLT 2020 (2020), vol. 125 of Proceedings of Machine Learning Research, PMLR, pp. 1452 1485.

[12] DIAKONIKOLAS, I., GOULEAKIS, T., AND TZAMOS, C. Distribution-independent pac learning

of halfspaces with massart noise. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché Buc, E. Fox, and R. Garnett, Eds. Curran Associates, Inc., 2019, pp. 4751 4762.

[13] DIAKONIKOLAS, I., KAMATH, G., KANE, D. M., LI, J., MOITRA, A., AND STEWART, A.

Robust estimators in high dimensions without the computational intractability. In Proc. 57th IEEE Symposium on Foundations of Computer Science (FOCS) (2016), pp. 655 664.

[14] DIAKONIKOLAS, I., KAMATH, G., KANE, D. M., LI, J., STEINHARDT, J., AND STEWART,

A. SEVER: A robust meta-algorithm for stochastic optimization. In Proc. 36th International Conference on Machine Learning (ICML) (2019), pp. 1596 1606.

[15] DIAKONIKOLAS, I., KANE, D., AND N.ZARIFIS. Near-optimal SQ lower bounds for agnosti-

cally learning halfspaces and relus under gaussian marginals. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020 (2020).

[16] DIAKONIKOLAS, I., AND KANE, D. M. Recent advances in algorithmic high-dimensional

robust statistics. Co RR abs/1911.05911 (2019).

[17] DIAKONIKOLAS, I., AND KANE, D. M. Hardness of learning halfspaces with massart noise.

Co RR abs/2012.09720 (2020).

[18] DIAKONIKOLAS, I., KANE, D. M., PITTAS, T., AND ZARIFIS, N. The optimality of polyno-

mial regression for agnostic learning under gaussian marginals. Co RR abs/2102.04401 (2021). To appear in COLT 2021.

[19] DIAKONIKOLAS, I., KONG, W., AND STEWART, A. Efﬁcient algorithms and lower bounds

for robust linear regression. In Proc. 30th Annual Symposium on Discrete Algorithms (SODA) (2019), pp. 2745 2754.

[20] DIAKONIKOLAS, I., KONTONIS, V., TZAMOS, C., AND ZARIFIS, N. Learning halfspaces with

massart noise under structured distributions. In Conference on Learning Theory, COLT 2020 (2020), J. D. Abernethy and S. Agarwal, Eds., vol. 125 of Proceedings of Machine Learning Research, PMLR, pp. 1486 1513.

[21] DIELMAN, T. E. Applied Regression Analysis for Business and Economics. Duxbury/Thomson

Learning Paciﬁc Grove, CA, 2001.

[22] D ORSI, T., NOVIKOV, G., AND STEURER, D. Regress consistently when oblivious outliers

overwhelm. ar Xiv preprint ar Xiv:2009.14774 (2020).

[23] DU, S. S., LEE, J. D., AND TIAN, Y. When is a convolutional ﬁlter easy to learn? In 6th

International Conference on Learning Representations, ICLR 2018 (2018).

[24] DUNAGAN, J., AND VEMPALA, S. Optimal outlier removal in high-dimensional spaces. J.

Computer & System Sciences 68, 2 (2004), 335 373.

[25] GOEL, S., GOLLAKOTA, A., AND KLIVANS, A. R. Statistical-query lower bounds via functional gradients. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020 (2020).

[26] GOEL, S., KANADE, V., KLIVANS, A., AND THALER, J. Reliably learning the relu in

polynomial time. In Conference on Learning Theory (2017), pp. 1004 1042.

[27] GOEL, S., KARMALKAR, S., AND KLIVANS, A. R. Time/accuracy tradeoffs for learning a

relu with respect to gaussian marginals. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019 (2019), pp. 8582 8591.

[28] HAMPEL, F. R., RONCHETTI, E. M., ROUSSEEUW, P. J., AND STAHEL, W. A. Robust

statistics. The approach based on inﬂuence functions. Wiley New York, 1986.

[29] HARDT, M., AND MOITRA, A. Algorithms and hardness for robust subspace recovery. In Proc.

26th Annual Conference on Learning Theory (COLT) (2013), pp. 354 375.

[30] HUBER, P. J., AND RONCHETTI, E. M. Robust statistics. Wiley New York, 2009.

[31] KALAN, S. M. M., SOLTANOLKOTABI, M., AND AVESTIMEHR, S. Fitting relus via sgd and

quantized sgd. In 2019 IEEE International Symposium on Information Theory (ISIT) (2019), IEEE, pp. 2469 2473.

[32] KARMAKAR, S., MUKHERJEE, A., AND MUTHUKUMAR, R. A study of neural training with

iterative non-gradient methods. ar Xiv e-prints (2020), ar Xiv 2005.

[33] KHACHIYAN, L. On the complexity of approximating extremal determinants in matrices.

Journal of Complexity 11, 1 (1995), 138 153.

[34] KLIVANS, A., KOTHARI, P., AND MEKA, R. Efﬁcient algorithms for outlier-robust regression.

In Proc. 31st Annual Conference on Learning Theory (COLT) (2018), pp. 1420 1430.

[35] LAI, K. A., RAO, A. B., AND VEMPALA, S. Agnostic estimation of mean and covariance. In

Proc. 57th IEEE Symposium on Foundations of Computer Science (FOCS) (2016), pp. 665 674.

[36] LI, J., ABSHER, D., TANG, H., SOUTHWICK, A., CASTO, A., RAMACHANDRAN, S., CANN,

H., BARSH, G., FELDMAN, M., CAVALLI-SFORZA, L., AND MYERS, R. Worldwide human relationships inferred from genome-wide patterns of variation. Science 319 (2008), 1100 1104.

[37] MANURANGSI, P., AND REICHMAN, D. The computational complexity of training relu (s).

ar Xiv preprint ar Xiv:1810.04207 (2018).

[38] MASSART, P., AND NEDELEC, E. Risk bounds for statistical learning. Ann. Statist. 34, 5 (10

2006), 2326 2366.

[39] MCDONALD, J. H. Handbook of Biological Statistics, volume 2. Sparky House Publishing,

Baltimore, MD, 2009.

[40] OLIER, I., SADAWI, N., BICKERTON, G. R., VANSCHOREN, J., GROSAN, C., SOLDATOVA,

L., AND KING, R. D. Meta-qsar: a large-scale application of meta-learning to drug design and discovery. Machine Learning 107, 1 (2018), 285 311.

[41] PASCHOU, P., LEWIS, J., JAVED, A., AND DRINEAS, P. Ancestry informative markers for

ﬁne-scale individual assignment to worldwide populations. Journal of Medical Genetics 47 (2010), 835 847.

[42] PESME, S., AND FLAMMARION, N. Online robust regression via SGD on the l1 loss. In Ad-

vances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020 (2020).

[43] ROSENBERG, N., PRITCHARD, J., WEBER, J., CANN, H., KIDD, K., ZHIVOTOVSKY, L.,

AND FELDMAN, M. Genetic structure of human populations. Science 298 (2002), 2381 2385.

[44] ROUSSEEUW, P. J., AND LEROY, A. M. Robust Regression and Outlier Detection. John Wiley

& Sons, Inc., New York, NY, USA, 1987.

[45] SOLTANOLKOTABI, M. Learning relus via gradient descent. In Advances in neural information

processing systems (2017), pp. 2007 2017.

[46] STEINHARDT, J., KOH, P. W., AND LIANG, P. S. Certiﬁed defenses for data poisoning attacks.

In Advances in Neural Information Processing Systems 30 (2017), pp. 3520 3532.

[47] SUGGALA, A. S., BHATIA, K., RAVIKUMAR, P., AND JAIN, P. Adaptive hard thresholding

for near-optimal consistent robust regression. In Conference on Learning Theory, COLT 2019 (2019), pp. 2892 2897.

[48] YEHUDAI, G., AND SHAMIR, O. On the power and limitations of random features for understanding neural networks. Co RR abs/1904.00687 (2019).

[49] YEHUDAI, G., AND SHAMIR, O. Learning a single neuron with gradient methods. In Conference on Learning Theory (2020), PMLR, pp. 3756 3786.

[50] ZHANG, C., SHEN, J., AND AWASTHI, P. Efﬁcient active learning of sparse halfspaces with

arbitrary bounded noise. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020 (2020).

[51] ZHANG, Y., LIANG, P., AND CHARIKAR, M. A hitting time analysis of stochastic gradient

langevin dynamics. In Proceedings of the 30th Conference on Learning Theory, COLT 2017 (2017), pp. 1980 2022.