# relu_regression_with_massart_noise__b6f2374c.pdf Re LU Regression with Massart Noise Ilias Diakonikolas University of Wisconsin-Madison ilias@cs.wisc.edu Jongho Park University of Wisconsin-Madison jongho.park@wisc.edu Christos Tzamos University of Wisconsin-Madison tzamos@wisc.edu We study the fundamental problem of Re LU regression, where the goal is to fit Rectified Linear Units (Re LUs) to data. This supervised learning task is efficiently solvable in the realizable setting, but is known to be computationally hard with adversarial label noise. In this work, we focus on Re LU regression in the Massart noise model, a natural and well-studied semi-random noise model. In this model, the label of every point is generated according to a function in the class, but an adversary is allowed to change this value arbitrarily with some probability, which is at most < 1/2. We develop an efficient algorithm that achieves exact parameter recovery in this model under mild anti-concentration assumptions on the underlying distribution. Such assumptions are necessary for exact recovery to be informationtheoretically possible. We demonstrate that our algorithm significantly outperforms naive applications of 1 and 2 regression on both synthetic and real data. 1 Introduction Learning in the presence of outliers is a key challenge in machine learning, with several data analysis applications, including in ML security (4; 7; 46; 14) and in exploratory data analysis of real datasets with natural outliers, e.g., in biology (43; 41; 36). The goal is to design computationally efficient learners that can tolerate a constant fraction of outliers, independent of the dimensionality of the data. Early work in robust statistics (28; 30) gave sample-efficient robust estimators for various basic tasks, alas with exponential runtime. A recent line of work in computer science, starting with (13; 35), developed the first computationally efficient robust learning algorithms for various high-dimensional tasks. Since these early works, there has been significant progress in algorithmic robust high-dimensional statistics by several communities, see (16) for a recent survey. In this work, we study the problem of learning Rectified Linear Units (Re LUs) in the presence of label noise. The Re LU function Re LUw : Rd ! R, parameterized by a vector w 2 Rd, is defined as Re LUw(x) := Re LU(w x) = max {0, w x}. Re LU regression the task of fitting Re LUs to a set of labeled examples is a fundamental task and an important primitive in the theory of deep learning. In recent years, Re LU regression has been extensively studied in theoretical machine learning both from the perspective of designing efficient algorithms and from the perspective of computational hardness, see, e.g., (26; 45; 37; 48; 27; 49; 11; 15; 25; 18). The computational difficulty of this problem crucially depends on the assumptions about the input data. In the realizable case, i.e., when the labels are consistent with the target function, the problem is efficiently solvable, see, e.g., (45). On the other hand, in the presence of even a small constant fraction of adversarially labeled data, computational hardness results are known even for approximate recovery (29; 37) and under well-behaved distributions (27; 15; 25; 18). See Section 1.3 for a detailed summary of related work. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). A challenging corruption model is the adversarial label noise model, in which an adversary is allowed to corrupt an arbitrary < 1/2 fraction of the labels. The aforementioned hardness results rule out the existence of efficient algorithms for learning Re LUs with optimal error guarantees in this model, even when the underlying distribution on examples is Gaussian. Moreover, when no assumptions are made on the underlying data distribution, no fully polynomial time algorithm with non-trivial guarantee is possible. In fact, for the distribution-independent setting, even for the simpler case of learning linear functions, there are strong computational hardness results for any constant (33; 29). These negative results motivate the following natural question: Are there realistic label noise models in which efficient learning is possible without strong distributional assumptions? Here we focus on Re LU regression in the presence of Massart (or bounded) noise (38), and provide an efficient learning algorithm with minimal distributional assumptions. In the process, we also provide an efficient noise-tolerant algorithm for the simpler case of linear regression. The Massart model (38) is a classical semi-random noise model originally defined in the context of binary classification. In this model, an adversary has control over a random < 1/2 fraction of the labels. Recent work (12) gave the first efficient learning algorithm for linear separators with non-trivial error gurrantees in the Massart model without distributional assumptions. In this work, we ask to what extent such algorithmic results are possible for learning real-valued functions. To state our contributions, we formally define the following natural generalization of the model for real-valued functions. Definition 1.1 (Learning Real-valued Functions with Massart Noise). Let F be a concept class of real-valued functions over Rd and f : Rd ! R be an unknown function in F. For a given parameter < 1/2, the algorithm specifies m 2 Z+ and obtains m samples (xi, yi)m i=1, such that: (a) every xi is drawn i.i.d. from a fixed distribution Dx, and (b) each yi is equal to f(xi) with probability 1 and takes an arbitrary value with probability , chosen by an adversary after observing the samples drawn and the values that can be corrupted. In the context of binary classification, the above model has been extensively studied in the theoretical ML community for the class of linear separators (2; 3; 51; 50; 20; 12; 9). Even though the noise model might appear innocuous at first sight, the ability of the Massart adversary to choose whether to perturb a given label and, if so, with what probability (which is unknown to the learner), makes the design of efficient algorithms in this model challenging. Specifically, for distribution-independent PAC learning of linear separators, even approximate learning in this model is computationally hard (17). Extending this model to real-valued functions, we study regression under Massart noise for the realizable setting; that is, when the uncorrupted data exhibit clean functional dependencies, i.e., yi = Re LU(w xi). The realizable setting is both of theoretical and practical interest. Prior work (45; 23; 31; 49) developed algorithms for learning Re LUs in this setting (without Massart noise), providing theoretical insights on the success of deep learning architectures. On the practical side, there are many applications in which we observe clean functional dependencies on the uncorrupted data. For instance, clean measurements are prevalent in many signal processing applications, including medical imaging, and are at the heart of the widely popular field of compressive sensing (8). 1.1 Main Results To build up to the more challenging case of Re LUs, we start with the simpler case of linear functions, which is in and of itself one of the most well-studied statistical tasks, with numerous applications in machine learning (44), as well as in other disciplines, including economics (21) and biology (39). In our Massart setting, the goal is to identify a linear relation y = w x that the clean samples (x, y) (inliers) satisfy. We show that, under the minimal (necessary) assumption that the distribution is not fully concentrated on any subspace, the problem is efficiently identifiable. Theorem 1.2 (Exact Recovery of Linear Functions). Let Dx be a distribution on Rd that has zero measure on any linear subspace and let < 1/2 be the upper bound on the Massart noise rate. Denote w the vector representing the true linear function. There is an algorithm that draws O( d3 (1 2 )2 ) samples, runs in poly(d, b, (1 2 ) 1) time, where b is an upper bound on the bit complexity of the samples and parameters, and outputs w with probability at least 9/10. We provide a more general algorithmic result that relaxes the density assumption on Dx in Theorem C.3 of Appendix C, so that the only assumption needed is that the support of Dx spans Rd. It is important to note that if the distribution was concentrated entirely on a linear subspace, it would be information-theoretically impossible to determine the orthogonal component of w on that subspace. That is, our anti-concentration assumption on the distribution is necessary for identifiability to be possible. Even when this assumption is violated and the problem is non-identifiable, we provide a (weaker) PAC learning guarantee for the linear case in Theorem D.1 of Appendix D. Our main algorithmic result is for the problem of Re LU regression, where the inliers satisfy y = Re LU(w x) and an < 1/2 fraction of the labels are corrupted by Massart noise. Even in this more challenging case, we show it is possible to efficiently identify the true parameters w , as long as every homogeneous halfspace contains a non-negligible fraction of the sample points. Theorem 1.3 (Exact Recovery of Re LUs). Let Dx be a distribution on Rd that has zero measure on any linear subspace and such that Prx Dx[w x 0] λ for any w 2 Rd. Let < 1/2 be the upper bound on the Massart noise rate. Denote w the parameter vector of the target Re LU. There is an algorithm that draws O( d3 λ2(1 2 )2 ) samples, runs in time poly(d, b, λ 1, (1 2 ) 1), and outputs w with probability at least 9/10. We similarly provide a more general result relaxing the density assumption on Dx in Theorem C.4 of Appendix C. We note that the assumption on the mass of halfspaces in Theorem 1.3 is necessary for identifiability. Indeed, if there was a halfspace, parameterized by w 2 Rd, such that Prx Dx[w x 0] = 0, it would be impossible to distinguish between the functions Re LU(w x) and Re LU(2w x) (even without noise), as all points would have 0 labels. It remains an interesting open problem whether similar PAC learning guarantees can be obtained for the case of Re LU regression. We suspect that this problem is computationally hard in full generality. 1.2 Technical Overview As explained in the introduction, the focus of our work is on the problem of robust regression in the presence of outliers. When < 1 2 fraction of the data is arbitrarily corrupted in the realizable setting, the goal is to compute the function that fits as many as points (inliers) as possible. Given a sufficient number of samples from a full-dimensional distribution, this function is unique for the class of Re LUs and matches the true function with high probability. However, even in the simpler case of linear functions, the corresponding computational problem of 0-minimization is computationally hard without distributional assumptions, as it is an instance of robust subspace recovery (29). Our positive results are driven by relaxing the assumption that an arbitrary fraction of the points is corrupted. Instead, as defined in Definition 1.1, we consider a more restricted adversary that is presented with a uniformly random fraction of the points, which can be corrupted arbitrarily at will. 0 to 1 minimization Given this milder corruption model, we propose novel algorithms for efficient exact recovery of the underlying function. We obtain our algorithms by replacing the 0-minimization with 1-minimization, which can be shown to converge to the true function in the limit and is efficient to optimize in the linear regression case. For intuition, consider a single-point distribution that always outputs labeled examples of the form (x, y), where the example x is always the same but the labels y may differ. The Massart assumption indicates that the value of y is correct more than half of the time, so the estimate that maximizes the number of correct samples ( 0-minimizer) recovers the underlying function. However, if one considers the 1-minimizer, i.e., the value v that minimizes E[|y v|], this corresponds to the median value of y which is also correct if more than half samples are correct. We generalize this intuition and propose a natural and tight condition under which empirical 1minimization results in the true 0-minimizer (see Lemma 2.2). While this condition holds under Massart noise for arbitrary distributions in the population level, it can fail to hold with high probability when considering only a finite set of samples from the distribution. For example, consider the onedimensional case of w = 1 where most xi s are near-zero and uncorrupted, while a few corrupted samples lie extremely far from zero. The empirical 1-minimizer here will be dominated by the few corrupted samples and would not be the 0-minimizer. In particular, the sample complexity of naive 1-minimization would crucially depend on the concentration properties of the distribution on x. Transforming the points via radial-isotropy The main technical idea behind obtaining sampleefficient algorithms that run in polynomial time is to transform the original point set into an equivalent one that satisfies the required properties with high probability, as it becomes sufficiently concentrated. In particular, performing a linear transformation mapping every point x to Ax, while keeping the corresponding label y, is without loss of generality, as we are interested in identifying the true (generalized) linear function that depends only on the inner product of every point with a parameter vector w. Finding such a vector w0 in the transformed space Ax results in the equivalent vector w = AT w0 in the original space. Moreover, an additional operation we can perform is to take a single sample (x, y) and multiply it by a positive scalar λ > 0 to replace it with the sample (λx, λy). For both the linear and Re LU cases, any sample that is an inlier for the true function remains an inlier after this transformation. We can use these two operations to bring our point set in radial-isotropic position, i.e., so that all the x s in the dataset are unit-norm and the variance in any direction is nearly identical: Definition 1.4 (Radial Isotropy). Let Sd 1 be the unit sphere in Rd. Given {x1, . . . , xn} Sd 1, A : Rd ! Rd is a radial-isotropic transformation if Pn (Axi)(Axi)T d I. For 0 < γ < 1, we say the points are in γ-approximate radial-isotropic position, if for all v 2 Sd 1, it holds that (d/n) Pn i=1(xi v)2 1 γ. In such a normalized position, we can argue that with high probability the weight of all inliers in every direction is more than the weight of the outliers, which guarantees that the empirical 1-minimizer will converge to the true function. Learning Re LUs Unfortunately, while 1-minimization for linear functions is convex and efficiently solvable via linear programming, 1-minimization for Re LUs is challenging due to its non-convexity; that is, we cannot easily reduce Re LU regression to a simple optimization method. We instead establish a structural condition under which we can compute an efficient separation oracle between the optimal parameter vector w and a query w. More specifically, we show that any suboptimal guess for the parameter vector w can be improved by moving along the opposite direction of the gradient of the 1-loss for the subset of points in which the condition in Lemma 3.1 is satisfied. Identifying such a direction of improvement yields a separating hyperplane, so we exploit this to efficiently identify w by running the ellipsoid method with our separation oracle. However, for this result to hold with a small number of samples, we need to again bring to radialisotropic position the points that fall in the linear (positive) part of the Re LU for the current guess vector w. In contrast to the linear case, though, where this transformation was applied once, in this case it needs to be applied again with every new guess. This results in a function that changes at every step, which is not suitable for direct optimization. Using these ideas, our algorithms can efficiently recover the underlying function exactly using few samples. Our algorithms make mild genericity assumptions about the position of the points, requiring that the points are not concentrated on a lower-dimensional subspace or, for the case of Re LUs, do not lie entirely in an origin-centered halfspace. As already mentioned, such assumptions are necessary for the purposes of identifiability. 1.3 Related Work Given the extensive literature on robust regression, here we discuss the most relevant prior work. Re LU Regression In the realizable setting, (45) and, more recently, (31) showed that gradient descent efficiently performs exact recovery for Re LU regression under the Gaussian distribution on examples. (49) generalized this result to a broader family of well-behaved distributions. In the agnostic or adversarial label noise model, a line of work has shown that learning with near-optimal error guarantees requires super-polynomial time, even under the Gaussian distribution (27; 15; 25; 18). On the positive side, (11) gave an efficient learner with approximation guarantees under log-concave distributions. Without distributional assumptions, even approximate learning is hard (29; 37). The recent work (32) studies Re LU regression in the realizable setting under a noise model similar to but more restrictive than the Massart model of Definition 1.1. Specifically, in the setting of (32), the adversary can corrupt a label with probability at most , but only via additive noise bounded above by a constant. (32) gives an SGD-type algorithm for Re LU regression in this model. We note that their algorithm does not achieve exact recovery and its guarantees crucially depend on the concentration properties of the marginal distribution and the bound on the additive noise. Comparison of Noise models It is worth comparing the Massart noise model (Definition 1.1) with other noise models studied in the literature. The strongest corruption model we are aware of is the strong contamination model (13), in which an omniscient adversary can corrupt an arbitrary < 1/2 fraction of the labeled examples. In the adversarial label noise model, the adversary can corrupt an arbitrary < 1/2 fraction of the labels (but not the examples). Efficient robust learning algorithms in these models typically only give approximate error guarantees and require strong distributional assumptions. Specifically, for the case of linear regression, (34; 19; 14) give robust approximate learners in the strong contamination model under the Gaussian distribution and, more broadly, distributions with bounded moments. In the adversarial label noise model, (6) gave efficient robust learners under strong concentration bounds on the underlying distribution that can tolerate < 1/50 fraction of outliers. The recent work (10) considers a Massart-like noise model in the context of linear regression with random observation noise. (10) provides an SDP-based approximate recovery algorithm when the noise rate satisfies < 1/3. It should be noted their algorithm does not efficiently achieve exact recovery. Due to space limitations, we provide a more detailed description of that work in Appendix F. A related noise model is that of oblivious label noise, where the adversary can corrupt an fraction of the labels with additive noise that is independent of the covariate x. More precisely, the oblivious adversary corrupts the vector of labels y 2 Rm by adding a m-sparse corruption vector b. Since b is independent of the covariates, oblivious noise can be viewed as corrupting a sample with probability with a random non-zero entry of b. Consequently, oblivious noise can be seen as a special case of Massart noise. We formally compare these two noise models in more detail in Appendix E. A line of work (5; 47; 22; 42) studied robust linear regression under oblivious noise and developed efficient exact recovery algorithms under strong distributional assumptions. 2 Warmup: Linear Regression with Massart Noise In this section, we establish structural conditions under which we can perform efficient 0minimization for linear functions under Massart noise. It is imperative that we find the 0-minimizer with respect to w since, with a sufficient number of samples, the 0-minimizer is the true function we wish to recover. We then show that appropriately transforming the data via radial-isotropic transformation and then solving for the empirical 1-loss arg minw2Rd 1 i=1 | yi w xi| can efficiently recover the true parameter w . We describe the algorithm for recovering linear functions below. Algorithm 1 Linear function recovery via radial isotropy Draw m = O( d3 (1 2 )2 ) samples (xi, yi)m i=1 with -Massart noise Compute A that puts (xi, yi)m i=1 in 1/2-approximate radial-isotropic position ˆw arg minw2Rd Pm i=1 | yi k Axik2 w Axi k Axik2 | by solving the LP return A ˆw In fact, there is no need to compute an exact radial-isotropic transformation (γ = 0) as an approximate one suffices. An approximate radial-isotropic transformation can be computed efficiently as in Lemma 2.1, which we prove in Appendix A. Lemma 2.1. Given S Rd in general position, there is a poly(n, d, b, γ 1) time algorithm that computes a positive definite symmetric matrix A such that k Axk : x 2 S is in γ-approximate radial-isotropic position where b is an upper bound on the bit complexity of the parameters and samples in S. Morever, the condition number of A is at most 2poly(n,d,b). Given that computing such approximate transformation A and solving a linear program (LP) can be done efficiently, Algorithm 1 gives us the polynomial runtime for Theorem 1.2. The proof of Theorem 1.2 relies on two key ideas. First, we can minimize the empirical 1-loss with respect to w instead of the 0-loss because the two are equivalent under the structural condition of Lemma 2.2. Once the following sufficient condition is satisfied, we can efficiently solve for 1-minimization as this is efficient in the case of linear functions by solving an LP. Lemma 2.2 (Structural Condition for Recovery). Given f : R ! R and m samples (xi, yi)m i=1 in Rd, let the 0-minimizer w = arg minw2Rd 1 i=1 kyi f(w xi)k0 be unique. If P |f((w + r) xi) f(w xi)| > P yi6=f(w xi) |f((w + r) xi) f(w xi)| (?) for all non-zero r 2 Rd, then w is also the 1-minimizer arg minw2Rd 1 i=1 |yi h(xi)|. The structural condition (?) for linear functions reduces to having the sum of |r xi| for the good points be greater than the sum of |r xi| for the bad points in every direction r. However, this implies that if one sample is much greater in norm than the others in some direction, this point can have undue influence and may easily dominate the 1-loss. Therefore, without any preprocessing or transformation to the data, one has to rely on naively increasing the sample complexity until there are enough points in this direction to satisfy condition (?). Instead, we minimize the dominating effects of such outlier points and reduce the sample complexity through transforming the dataset with radial isotropy. We defer the proof of Lemma 2.2 and Theorem 1.2 to Appendix A. 3 Re LU Regression with Massart Noise In this subsection, we study the problem of exact recovery for Re LUs in the presence of Massart noise. For the case of Re LUs, we can still use the structural condition of Lemma 2.2 to do 1minimization arg minw 1 i=1 |yi Re LU(w xi)|. However, minimizing this objective is no longer straightforward, because the objective function is non-convex. Despite this fact, it is possible to exactly recover a Re LU under mild anti-concentration assumptions on the underlying distribution. The key idea behind Theorem 1.3 is establishing the condition under which we can compute an efficient separation oracle between the query w and the true parameter w . Once we obtain a separation oracle, we can use the ellipsoid method to recover w exactly. In turn, similarly to Lemma 2.2, we identify a sufficient structural condition on the dataset, which allows us to efficiently compute a separating hyperplane between w and w if w 6= w , and then use radial-isotropic transformations such that this condition is satisfied. We state this separation condition in the following lemma. Lemma 3.1 (Separation Condition). Let H be an hypothesis class such that H = {hw : hw(x) = f(w x), w 2 Rd} where f : R ! R is monotonically non-decreasing. Given a set of m samples (xi, yi)m i=1, let w = arg minw2Rd 1 i=1 kyi f(w xi)k0 be unique. Let > 0 and B(w, ) be the open ball of radius centered at w. Denote the empirical 1-loss ˆL(w) = (1/m) Pm i=1 |yi f(w xi)|. Given a query w0 /2 B(w , ), if |(w0 w ) xi|f 0(w0 xi) P yi6=f(w xi) |(w0 w ) xi|f 0(w0 xi) m ( ) then rˆL(w0) (w0 w) = 0 is a separating hyperplane for w0 and B(w , /2) such that rˆL(w0) (w0 w) > 0 for w 2 B(w , /2). In particular, the gradient of the empirical 1-loss gives us the separating hyperplane above. Other than the fact that only the points in the non-negative side of the halfspace of w are considered in the separation condition ( ), the condition resembles much of the structural condition used for linear functions. Analogously, we apply a radial-isotropic transformation to the points of w xi 0. Thus, we have the following sub-procedure of the ellipsoid method where is the radius of a ball which depends on the distance between the points (and hence the bit complexity b). The main difference between the algorithm for Re LUs and linear functions is that here we must apply a different radial-isotropic transformation to every new subset of points in every iteration, depending on the query w0. In turn, the algorithm transforms the space according a new transformation A and computes a separating hyperplane and transforms the hyperplane back into the original space. Due to Algorithm 2 Separation oracle sub-procedure Input: (xi, yi)m i=1 with Massart noise, query w0, > 0 Output: Unless w0 2 B(w , ), a separating hyperplane between w0 and B(w , /2) if Re LU(w0 x) fits at least m 2 points then return w0 as true parameter w S {(xi, yi) : w0 xi 0 for i 2 [m]} Compute A that puts Sx into 1/2-approximate radial-isotropic position r 1 |S| Axi k Axik2 sgn(w0 xi yi) return separating hyperplane A 1r (w0 w) = 0 these repeated transformations, the proof of Theorem 1.3 requires a more intricate argument to make the ellipsoid method work correctly. We now prove Theorem 1.3. Proof of Theorem 1.3. Let w0 be the original query to the oracle and assume the separation condition ( ) holds for a set of points (Axi/k Axik2, yi/k Axik2)m i=1 and A 1w0. Then, by Lemma 3.1, r (A 1w0 w) = 0 where r = (1/m) Pm i=1(Axi/k Axik2) sgn(w0 xi yi) separates A 1w0 and A 1w . Thus the separation for w0 and w is A 1r (w0 w) = 0. Now it remains to check the sample complexity necessary for the separation condition ( ). Each unique set of {(xi, yi) : w0 xi 0 for i 2 [m]} determines a radial isotropic transformation but there can only be at most md+1 unique sets by the VC dimension of halfspaces. So there are only at most md+1 radial-isotropic transformations we have to consider. Let A be the linear transformation of the radial-isotropic transformation applied to points of w0 xi 0. Denote ( xi, yi) = (Axi/k Axik2, yi/k Axik2), w = A 1w , w0 = A 1w0, and let Dx be Dx|{w0 x 0} transformed by A then normalized so that Dx lies on Sd 1. Then, for all md+1 transformations, we have the following VC inequality using m = O(d/ 2) samples with high probability: [|(w w ) x|1{w x 0, y = Re LU(w x)} > t] 1{|(w w ) xi| > t, w xi 0, yi = Re LU(w xi)} Let S = {( xi, yi) : w0 xi 0}. Similarly to the proof of Theorem 1.2, we have that max( x, y)2S |r x| (1 γ)d 1 γ d E x Dx[|r x|] where γ = 1/2. Then, we can write ( xi, yi)2S |( w0 w ) x|1{yi = Re LU(w xi)} 1{|( w0 w ) xi| > t, w0 xi 0, yi = Re LU(w xi) E Dx[|( w0 w ) x|1{y = Re LU(w x)}] ( m/|S|) max ( x, y)2S |( w0 w ) x| (1 ) E Dx[|( w0 w ) x|] ( md/(|S|(1 2 d)))E Dx[|( w0 w ) x|] By setting = O(λ(1 2 )/d), we can bound m/|S| be at most a constant times λ 1 for all md+1 possible subsets S using Hoeffding s inequality and the union bound. Then we have |( w0 w ) x|1{ y = Re LU( w x)} ((1/2)+(1 2 )/4)E Dx[|( w0 w ) x|]. We can do the same to the corrupted points in S getting (1/2 (1 2 )/4)E Dx[|( w w ) x|]. Thus, for points ( x, y) 2 S, we have the condition y=Re LU( w x) |( w0 w ) x| P y6=Re LU( w x) |( w0 w ) x| (1/2 )E Dx[|( w0 w ) x|] (1/2 )k w0 w k2/d . By Lemma 3.1, the inequality above implies that we can find a hyperplane of r that separates w0 and B( w , /2) where = 1 2 2d k w0 w k2. In the original space of (xi, yi)m i=1, we have that the transformed hyperplane of A 1r seperates w0 and B(w , /2) where = 1 2 w k2 since applying A 1 to r keeps the distance from w to w0 at least 1 2 2dλmax(A)kw0 w k2 and applying A to w bounds the distance from w to w0 to be at least . Therefore, we can set of Algorithm 2 to be equal to minw6=w 1 2 λmax(A)kw w k2. If w 6= w , by bounded bit complexity b, we have that the volume of the ellipsoid decreases at every step but the ball of radius (1 2 )λmin(A) 4dλmax(A) will always be contained in it. Thus the algorithm terminates in poly(d, b, (1 2 ) 1) iterations since log = poly(d, b, (1 2 ) 1). 4 Experiments In this section, we apply our algorithms that are based on radial-isotropic transformations to both synthetic and real datasets and compare robustness in regression with other baseline methods of 1 and 2-regression. Our experiments demonstrate the efficacy of radial-isotropic transformations in robust regression and how our algorithms outperform baseline regression methods. All experiments were done on a laptop computer with a 2.3 GHz Dual-Core Intel Core i5 CPU and 8 GB of RAM. We ran CVXPY s linear program solver for 1-regression for linear functions. Recovering Linear Functions We first show how our algorithm based on radial-isotropic position (Algorithm 1) compares to naive 1 regression in exact recovery using an LP solver. As another baseline, we also ran 1-regression with a normalization preprocessing step, where we normalize all points (x, y) to ( x kxk, y kxk). We did not run regression with an isotropic-transformation preprocessing step because this yields identical results as naive regression with no preprocessing. We evaluated different transformations to the data on the following synthetic distribution. Define a mixture of Gaussians Dx = 1 d2 Id) + 1 2d i=1 N(dei, 1 d2 Id), where ei denotes the i-th standard basis vector and d = 30. Let w = 9e2 + Pd i=1 ei. For various noise levels , consider the following -Massart adversary: the labels for all x for which any coordinate is greater than d 2 are flipped to w x with probability , and the labels for all other points are not flipped. Essentially, only the points not from N(e1, 1 d2 Id) are affected by Massart noise. We measured exact parameter recovery rate, which captures how often the algorithm solves for w exactly. We varied the noise rate while running the methods with 120 samples from Dx. We also varied the sample size while keeping the noise = 0.25. We ran 200 trials for each measurement of exact recovery rate and the error bars represent two standard deviations from the mean. Recovering Re LUs For Re LUs, we used the same distribution as the experiments for linear functions to generate samples. We ran and compared constant-step-sized gradient descent on the empirical 1-loss 1 i=1 |yi Re LU(w xi)| with different transformations to the data. We ran gradient descent since our separation oracle for the Ellipsoid method bears similarities with gradient descent. As seen in Lemma 3.1, this is due to the fact that our separating hyperplane is based on the gradient of the empirical 1-loss of a subset of points. The experiment is set up with = 0.4, w = 9e2 + Pd i=1 ei, w0 = 0, 240 samples from Dx, and gradient descent step size of one. For Original , we use a step size of 1/465 to keep the magnitude of the points xi comparable to that of the transformed points xi. In Figure 2(a), Original corresponds to naive gradient descent, while Normalized has a normalization preprocessing step. The transformations of Isotropic and Radial-isotropic follow our algorithm for Re LUs from Section 3, where the transformation is only applied to the positive-side points of w xi 0 for the current hypothesis w. The gradient is then calculated with the transformed points xi and appropriately transformed back to the original space in order to update w. The gradient descent updates under transformation A and step size is the following: w0 A 1w then w w (Arw0L0) , (a) Recovering Linear (vs. noise rate) (b) Recovering Linear (vs. sample size) Figure 1: Experiments for exact parameter recovery of linear functions on synthetic data. Exact recovery rate (y-axis) measures how often the algorithm outputs the true parameter out of 200 trials. We compare Algorithm 1 with naive 1-minimization and 1-minimization with normalized data. Error bars cover two standard deviations from the mean. (a) Gradient Descent on Re LU (b) Drug Discovery Dataset Figure 2: On the left, we compare the performance of different transformations with gradient descent on synthetic data where we measure the 2-distance to the optimal solution. On the right, we compare different regression methods applied to real data where the labels are artificially corrupted with -Massart noise. Rescaled L1 represents Algorithm 1. We report the fraction of points that lie in the subspace generated by its output with margin. where L0 denotes the empirical 1-loss for the transformed subset of points. This update method is directly adapted from our Ellipsoid method. Drug Discovery Dataset The drug discovery dataset was originally curated by (40) and we used the same dataset as the one used in (14). The dataset has a training and test set of 3084 and 1000 points of 410 dimensions. The -Massart noise adversary corrupts the training data (xi, yi) so that all points are corrupted to flip labels to 100yi with probability . We compared 1-regression with radial-isotropic transformation ( Rescaled L1 ) to other baseline methods, such as least squares and naive 1-regression. For ridge regression, we optimized the regularization coefficient based on the uncorrupted data. We measure performance by computing the fraction of the test set that lies within the subspace generated by the output vector with a margin of 2. Results In Figure 1, our algorithm with radial-isotropic transformation outperforms other baseline methods in robustness with respect to the noise level and in efficiency with respect to the sample size. This is in line with the results of Theorem 1.2. In fact, our experiments on Re LUs also empirically demonstrate that radial isotropy significantly improves Re LU regression via gradient descent by making the dataset more robust to noise at each iteration. For the drug discovery dataset, although 1-regression with radial isotropy performs slightly worse than naive 1-regression when there is minimal noise, it significantly outperforms the baseline methods at regimes of high noise levels. 5 Conclusion In this work, we propose a generalization of the Massart (or bounded) noise model, previously studied in binary classification, to the real-valued setting. The Massart model is a realistic semi-random noise model that is stronger than uniform random noise or oblivious noise, but weaker than adversarial label noise. Our main result is an efficient algorithm for Re LU regression (and, in the process, also linear regression) in this model under minimal distributional assumptions. At the technical level, we provide structural conditions for 0-minimization to be efficiently computable. A key conceptual idea enabling our efficient algorithms is that of transforming the dataset using radial-isotropic transformations. We empirically validated the effectiveness of radial-isotropic transformations for robustness via experiments on both synthetic and real data. In contrast to previous works on robust regression that require strong distributional assumptions, our framework and results may be seen as an intricate balance between slightly weakening the noise model yet affording generality in the underlying distribution. Acknowledgments and Disclosure of Funding Ilias Diakonikolas is supported by NSF Medium Award CCF-2107079, NSF Award CCF-1652862 (CAREER), a Sloan Research Fellowship, and a DARPA Learning with Less Labels (Lw LL) grant. [1] ARTSTEIN-AVIDAN, S., KAPLAN, H., AND SHARIR, M. On radial isotropic position: Theory and algorithms. ar Xiv preprint ar Xiv:2005.04918 (2020). [2] AWASTHI, P., BALCAN, M. F., HAGHTALAB, N., AND URNER, R. Efficient learning of linear separators under bounded noise. In Proceedings of The 28th Conference on Learning Theory, COLT 2015 (2015), pp. 167 190. [3] AWASTHI, P., BALCAN, M. F., HAGHTALAB, N., AND ZHANG, H. Learning and 1-bit compressed sensing under asymmetric noise. In Proceedings of the 29th Conference on Learning Theory, COLT 2016 (2016), pp. 152 192. [4] BARRENO, M., NELSON, B., JOSEPH, A. D., AND TYGAR, J. D. The security of machine learning. Machine Learning 81, 2 (2010), 121 148. [5] BHATIA, K., JAIN, P., KAMALARUBAN, P., AND KAR, P. Consistent robust regression. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017 (2017), pp. 2107 2116. [6] BHATIA, K., JAIN, P., AND KAR, P. Robust regression via hard thresholding. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015 (2015), pp. 721 729. [7] BIGGIO, B., NELSON, B., AND LASKOV, P. Poisoning attacks against support vector machines. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012 (2012). [8] CANDÈS, E. J., AND WAKIN, M. B. An introduction to compressive sampling. IEEE signal processing magazine 25, 2 (2008), 21 30. [9] CHEN, S., KOEHLER, F., MOITRA, A., AND YAU, M. Classification under misspecification: Halfspaces, generalized linear models, and evolvability. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020 (2020). [10] CHEN, S., KOEHLER, F., MOITRA, A., AND YAU, M. Online and distribution-free robustness: Regression and contextual bandits with huber contamination. ar Xiv preprint ar Xiv:2010.04157 (2020). [11] DIAKONIKOLAS, I., GOEL, S., KARMALKAR, S., KLIVANS, A. R., AND SOLTANOLKOTABI, M. Approximation schemes for relu regression. In Conference on Learning Theory, COLT 2020 (2020), vol. 125 of Proceedings of Machine Learning Research, PMLR, pp. 1452 1485. [12] DIAKONIKOLAS, I., GOULEAKIS, T., AND TZAMOS, C. Distribution-independent pac learning of halfspaces with massart noise. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché Buc, E. Fox, and R. Garnett, Eds. Curran Associates, Inc., 2019, pp. 4751 4762. [13] DIAKONIKOLAS, I., KAMATH, G., KANE, D. M., LI, J., MOITRA, A., AND STEWART, A. Robust estimators in high dimensions without the computational intractability. In Proc. 57th IEEE Symposium on Foundations of Computer Science (FOCS) (2016), pp. 655 664. [14] DIAKONIKOLAS, I., KAMATH, G., KANE, D. M., LI, J., STEINHARDT, J., AND STEWART, A. SEVER: A robust meta-algorithm for stochastic optimization. In Proc. 36th International Conference on Machine Learning (ICML) (2019), pp. 1596 1606. [15] DIAKONIKOLAS, I., KANE, D., AND N.ZARIFIS. Near-optimal SQ lower bounds for agnosti- cally learning halfspaces and relus under gaussian marginals. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020 (2020). [16] DIAKONIKOLAS, I., AND KANE, D. M. Recent advances in algorithmic high-dimensional robust statistics. Co RR abs/1911.05911 (2019). [17] DIAKONIKOLAS, I., AND KANE, D. M. Hardness of learning halfspaces with massart noise. Co RR abs/2012.09720 (2020). [18] DIAKONIKOLAS, I., KANE, D. M., PITTAS, T., AND ZARIFIS, N. The optimality of polyno- mial regression for agnostic learning under gaussian marginals. Co RR abs/2102.04401 (2021). To appear in COLT 2021. [19] DIAKONIKOLAS, I., KONG, W., AND STEWART, A. Efficient algorithms and lower bounds for robust linear regression. In Proc. 30th Annual Symposium on Discrete Algorithms (SODA) (2019), pp. 2745 2754. [20] DIAKONIKOLAS, I., KONTONIS, V., TZAMOS, C., AND ZARIFIS, N. Learning halfspaces with massart noise under structured distributions. In Conference on Learning Theory, COLT 2020 (2020), J. D. Abernethy and S. Agarwal, Eds., vol. 125 of Proceedings of Machine Learning Research, PMLR, pp. 1486 1513. [21] DIELMAN, T. E. Applied Regression Analysis for Business and Economics. Duxbury/Thomson Learning Pacific Grove, CA, 2001. [22] D ORSI, T., NOVIKOV, G., AND STEURER, D. Regress consistently when oblivious outliers overwhelm. ar Xiv preprint ar Xiv:2009.14774 (2020). [23] DU, S. S., LEE, J. D., AND TIAN, Y. When is a convolutional filter easy to learn? In 6th International Conference on Learning Representations, ICLR 2018 (2018). [24] DUNAGAN, J., AND VEMPALA, S. Optimal outlier removal in high-dimensional spaces. J. Computer & System Sciences 68, 2 (2004), 335 373. [25] GOEL, S., GOLLAKOTA, A., AND KLIVANS, A. R. Statistical-query lower bounds via functional gradients. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020 (2020). [26] GOEL, S., KANADE, V., KLIVANS, A., AND THALER, J. Reliably learning the relu in polynomial time. In Conference on Learning Theory (2017), pp. 1004 1042. [27] GOEL, S., KARMALKAR, S., AND KLIVANS, A. R. Time/accuracy tradeoffs for learning a relu with respect to gaussian marginals. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019 (2019), pp. 8582 8591. [28] HAMPEL, F. R., RONCHETTI, E. M., ROUSSEEUW, P. J., AND STAHEL, W. A. Robust statistics. The approach based on influence functions. Wiley New York, 1986. [29] HARDT, M., AND MOITRA, A. Algorithms and hardness for robust subspace recovery. In Proc. 26th Annual Conference on Learning Theory (COLT) (2013), pp. 354 375. [30] HUBER, P. J., AND RONCHETTI, E. M. Robust statistics. Wiley New York, 2009. [31] KALAN, S. M. M., SOLTANOLKOTABI, M., AND AVESTIMEHR, S. Fitting relus via sgd and quantized sgd. In 2019 IEEE International Symposium on Information Theory (ISIT) (2019), IEEE, pp. 2469 2473. [32] KARMAKAR, S., MUKHERJEE, A., AND MUTHUKUMAR, R. A study of neural training with iterative non-gradient methods. ar Xiv e-prints (2020), ar Xiv 2005. [33] KHACHIYAN, L. On the complexity of approximating extremal determinants in matrices. Journal of Complexity 11, 1 (1995), 138 153. [34] KLIVANS, A., KOTHARI, P., AND MEKA, R. Efficient algorithms for outlier-robust regression. In Proc. 31st Annual Conference on Learning Theory (COLT) (2018), pp. 1420 1430. [35] LAI, K. A., RAO, A. B., AND VEMPALA, S. Agnostic estimation of mean and covariance. In Proc. 57th IEEE Symposium on Foundations of Computer Science (FOCS) (2016), pp. 665 674. [36] LI, J., ABSHER, D., TANG, H., SOUTHWICK, A., CASTO, A., RAMACHANDRAN, S., CANN, H., BARSH, G., FELDMAN, M., CAVALLI-SFORZA, L., AND MYERS, R. Worldwide human relationships inferred from genome-wide patterns of variation. Science 319 (2008), 1100 1104. [37] MANURANGSI, P., AND REICHMAN, D. The computational complexity of training relu (s). ar Xiv preprint ar Xiv:1810.04207 (2018). [38] MASSART, P., AND NEDELEC, E. Risk bounds for statistical learning. Ann. Statist. 34, 5 (10 2006), 2326 2366. [39] MCDONALD, J. H. Handbook of Biological Statistics, volume 2. Sparky House Publishing, Baltimore, MD, 2009. [40] OLIER, I., SADAWI, N., BICKERTON, G. R., VANSCHOREN, J., GROSAN, C., SOLDATOVA, L., AND KING, R. D. Meta-qsar: a large-scale application of meta-learning to drug design and discovery. Machine Learning 107, 1 (2018), 285 311. [41] PASCHOU, P., LEWIS, J., JAVED, A., AND DRINEAS, P. Ancestry informative markers for fine-scale individual assignment to worldwide populations. Journal of Medical Genetics 47 (2010), 835 847. [42] PESME, S., AND FLAMMARION, N. Online robust regression via SGD on the l1 loss. In Ad- vances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020 (2020). [43] ROSENBERG, N., PRITCHARD, J., WEBER, J., CANN, H., KIDD, K., ZHIVOTOVSKY, L., AND FELDMAN, M. Genetic structure of human populations. Science 298 (2002), 2381 2385. [44] ROUSSEEUW, P. J., AND LEROY, A. M. Robust Regression and Outlier Detection. John Wiley & Sons, Inc., New York, NY, USA, 1987. [45] SOLTANOLKOTABI, M. Learning relus via gradient descent. In Advances in neural information processing systems (2017), pp. 2007 2017. [46] STEINHARDT, J., KOH, P. W., AND LIANG, P. S. Certified defenses for data poisoning attacks. In Advances in Neural Information Processing Systems 30 (2017), pp. 3520 3532. [47] SUGGALA, A. S., BHATIA, K., RAVIKUMAR, P., AND JAIN, P. Adaptive hard thresholding for near-optimal consistent robust regression. In Conference on Learning Theory, COLT 2019 (2019), pp. 2892 2897. [48] YEHUDAI, G., AND SHAMIR, O. On the power and limitations of random features for understanding neural networks. Co RR abs/1904.00687 (2019). [49] YEHUDAI, G., AND SHAMIR, O. Learning a single neuron with gradient methods. In Conference on Learning Theory (2020), PMLR, pp. 3756 3786. [50] ZHANG, C., SHEN, J., AND AWASTHI, P. Efficient active learning of sparse halfspaces with arbitrary bounded noise. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020 (2020). [51] ZHANG, Y., LIANG, P., AND CHARIKAR, M. A hitting time analysis of stochastic gradient langevin dynamics. In Proceedings of the 30th Conference on Learning Theory, COLT 2017 (2017), pp. 1980 2022.