# efficiently_learning_adversarially_robust_halfspaces_with_noise__134264d5.pdf Efficiently Learning Adversarially Robust Halfspaces with Noise Omar Montasser 1 Surbhi Goel 2 Ilias Diakonikolas 3 Nathan Srebro 1 We study the problem of learning adversarially robust halfspaces in the distribution-independent setting. In the realizable setting, we provide necessary and sufficient conditions on the adversarial perturbation sets under which halfspaces are efficiently robustly learnable. In the presence of random label noise, we give a simple computationally efficient algorithm for this problem with respect to any ℓp-perturbation. 1. Introduction Learning predictors that are robust to adversarial examples remains a major challenge in machine learning. A line of work has shown that predictors learned by deep neural networks are not robust to adversarial examples (Szegedy et al., 2014; Biggio et al., 2013; Goodfellow et al., 2015). This has led to a long line of research studying different aspects of robustness to adversarial examples. In this paper, we consider the problem of distributionindependent learning of halfspaces that are robust to adversarial examples at test time, also referred to as robust PAC learning of halfspaces. Halfspaces are binary predictors of the form hw(x) = sign( w, x ), where w Rd. In adversarially robust PAC learning, given an instance space X and label space Y = { 1}, we formalize an adversary that we would like to be robust against as a map U : X 7 2X , where U(x) X represents the set of perturbations (adversarial examples) that can be chosen by the adversary at test time (i.e., we require that x U(x)). For an unknown distribution D over X Y, we observe m i.i.d. samples S Dm, and our goal is to learn a predictor ˆh : X 7 Y that achieves small robust risk, RU(ˆh; D) E (x,y) D sup z U(x) 1[ˆh(z) = y] 1Toyota Technological Institute at Chicago 2University of Texas at Austin 3University of Wisconsin-Madison. Correspondence to: Omar Montasser . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). The information-theoretic aspects of adversarially robust learning have been studied in recent work, see, e.g., (Schmidt et al., 2018; Cullina et al., 2018; Khim & Loh, 2018; Bubeck et al., 2019; Yin et al., 2019; Montasser et al., 2019). This includes studying what learning rules should be used for robust learning and how much training data is needed to guarantee high robust accuracy. It is now known that any hypothesis class H with finite VC dimension is robustly learnable, though sometimes improper learning is necessary and the sample complexity may be exponential in the VC dimension (Montasser et al., 2019). On the other hand, the computational aspects of adversarially robust PAC learning are less understood. In this paper, we take a first step towards studying this broad algorithmic question with a focus on the fundamental problem of learning adversarially robust halfspaces. A first question to ask is whether efficient PAC learning implies efficient robust PAC learning, i.e., whether there is a general reduction that solves the adversarially robust learning problem. Recent work has provided strong evidence that this is not the case. Specifically, (Bubeck et al., 2019) showed that there exists a learning problem that can be learned efficiently non-robustly, but is computationally intractable to learn robustly (under plausible complexitytheoretic assumptions). There is also more recent evidence that suggests that this is also the case in the PAC model. (Awasthi et al., 2019) showed that it is computationally intractable to even weakly robustly learn degree-2 polynomial threshold functions (PTFs) with ℓ perturbations in the realizable setting, while PTFs of any constant degree are known to be efficiently PAC learnable non-robustly in the realizable setting. (Gourdeau et al., 2019) showed that there are hypothesis classes that are hard to robustly PAC learn, under the assumption that it is hard to non-robustly PAC learn. The aforementioned discussion suggests that when studying robust PAC learning, we need to characterize which types of perturbation sets U admit computationally efficient robust PAC learners and under which noise assumptions. In the agnostic PAC setting, it is known that even weak (non-robust) learning of halfspaces is computationally intractable (Feldman et al., 2006; Guruswami & Raghavendra, 2009; Diakonikolas et al., 2011; Daniely, 2016). For ℓ2- Efficiently Learning Adversarially Robust Halfspaces with Noise perturbations, where U(x) = {z : x z 2 γ}, it was recently shown that the complexity of proper learning is exponential in 1/γ (Diakonikolas et al., 2019b). In this paper, we focus on the realizable case and the (more challenging) case of random label noise. We can be more optimistic in the realizable setting. Halfspaces are efficiently PAC learnable non-robustly via Linear Programming (Maass & Tur an, 1994), and under the margin assumption via the Perceptron algorithm (Rosenblatt, 1958). But what can we say about robustly PAC learning halfspaces? Given a perturbation set U and under the assumption that there is a halfspace hw that robustly separates the data, can we efficiently learn a predictor with small robust risk? Just as empirical risk minimization (ERM) is central for non-robust PAC learning, a core component of adversarially robust learning is minimizing the robust empirical risk on a dataset S, ˆh RERMU(S) argmin h H i=1 sup z U(x) 1[h(z) = y]. In this paper, we provide necessary and sufficient conditions on perturbation sets U, under which the robust empirical risk minimization (RERM) problem is efficiently solvable in the realizable setting. We show that an efficient separation oracle for U yields an efficient solver for RERMU, while an efficient approximate separation oracle for U is necessary for even computing the robust loss supz U(x) 1[hw(z) = y] of a halfspace hw. In addition, we relax our realizability assumption and show that under random classification noise (Angluin & Laird, 1987), we can efficiently robustly PAC learn halfspaces with respect to any ℓp perturbation. Main Contributions Our main contributions can be summarized as follows: 1. In the realizable setting, the class of halfspaces is efficiently robustly PAC learnable with respect to U, given an efficient separation oracle for U. 2. To even compute the robust risk with respect to U efficiently, an efficient approximate separation oracle for U is necessary. 3. In the random classification noise setting, the class of halfspaces is efficiently robustly PAC learnable with respect to any ℓp perturbation. 1.1. Related Work Here we focus on the recent work that is most closely related to the results of this paper. (Awasthi et al., 2019) studied the tractability of RERM with respect to ℓ perturbations, obtaining efficient algorithms for halfspaces in the realizable setting, but showing that RERM for degree-2 polynomial threshold functions is computationally intractable (assuming NP = RP). (Gourdeau et al., 2019) studied robust learnability of hypothesis classes defined over {0, 1}n with respect to hamming distance, and showed that monotone conjunctions are robustly learnable when the adversary can perturb only O(log n) bits, but are not robustly learnable even under the uniform distribution when the adversary can flip ω(log n) bits. In this work, we take a more general approach, and instead of considering specific perturbation sets, we provide methods in terms of oracle access to a separation oracle for the perturbation set U, and aim to characterize which perturbation sets U admit tractable RERM. In the non-realizable setting, the only prior work we are aware of is by (Diakonikolas et al., 2019b) who studied the complexity of robustly learning halfspaces in the agnostic setting under ℓ2 perturbations. 2. Problem Setup Let X = Rd be the instance space and Y = { 1} be the label space. We consider halfspaces H = {x 7 sign( w, x ) : w Rd}. The following definitions formalize the notion of adversarially robust PAC learning in the realizable and random classification noise settings: Definition 2.1 (Realizable Robust PAC Learning). We say H is robustly PAC learnable with respect to an adversary U in the realizable setting, if there exists a learning algorithm A : (X Y) 7 YX with sample complexity m : (0, 1) N such that: for any ϵ, δ (0, 1), for every data distribution D over X Y where there exists a predictor h H with zero robust risk, RU(h ; D) = 0, with probability at least 1 δ over S Dm, RU(A(S); D) ϵ. Definition 2.2 (Robust PAC Learning with Random Classification Noise). Let h H be an unknown halfspace. Let Dx be an arbitrary distribution over X such that RU(h ; Dh ) = 0, and η 0 < 1/2. A noisy example oracle, EX(h , Dx, η) works as follows: Each time EX(h , Dx, η) is invoked, it returns a labeled example (x, y), where x Dx, y = h (x) with probability 1 η and y = h (x) with probability η. Let D be the joint distribution on (x, y) generated by the above oracle. We say H is robustly PAC learnable with respect to an adversary U in the random classification noise model, if m(ϵ, δ, η) N {0} and a learning algorithm A : (X Y) 7 YX , such that for every distribution D over X Y (generated as above by a noisy oracle), with probability at Efficiently Learning Adversarially Robust Halfspaces with Noise least 1 δ over S Dm, RU(A(S); D) η + ϵ. Sample Complexity of Robust Learning Denote by LU H the robust loss class of H, (x, y) 7 sup z U(x) 1[h(z) = y] : h H It was shown by (Cullina et al., 2018) that for any set B X that is nonempty, closed, convex, and origin-symmetric, and an adversary U that is defined as U(x) = x + B (e.g., ℓpballs), the VC dimension of the robust loss of halfspaces vc(LU H) is at most the standard VC dimension vc(H) = d + 1. Based on Vapnik s General Learning (Vapnik, 1982), this implies that we have uniform convergence of robust risk with m = O( d+log(1/δ) ϵ2 ) samples. Formally, for any ϵ, δ (0, 1) and any distribution D over X Y, with probability at least 1 δ over S Dm, w Rd, |RU(hw; D) RU(hw; S)| ϵ. (2) In particular, this implies that for any adversary U that satisfies the conditions above, H is robustly PAC learnable w.r.t. U by minimizing the robust empirical risk on S, RERMU(S) = argmin w Rd 1 m i=1 sup zi U(xi) 1[yi w, zi 0]. Thus, it remains to efficiently solve the RERMU problem. We discuss necessary and sufficient conditions for solving RERMU in the following section. 3. The Realizable Setting In this section, we show necessary and sufficient conditions for minimizing the robust empirical risk RERMU on a dataset S = {(x1, y1), . . . , (xm, ym)} (X Y)m in the realizable setting, i.e. when the dataset S is robustly separable with a halfspace hw where w Rd. In Theorem 3.5, we show that an efficient separation oracle for U yields an efficient solver for RERMU. While in Theorem 3.10, we show that an efficient approximate separation oracle for U is necessary for even computing the robust loss supz U(x) 1[hw(z) = y] of a halfspace hw. Note that the set of allowed perturbations U can be nonconvex, and so it might seem difficult to imagine being able to solve the RERMU problem in full generality. But, it turns out that for halfspaces it suffices to consider only convex perturbation sets due to the following observation: Observation 3.1. Given a halfspace w Rd and an example (x, y) X Y. If z U(x), y w, z > 0, then z conv(U(x)), y w, z > 0. And if z U(x), y w, z 0, then z conv(U(x)), y w, z 0, where conv(U(x)) denotes the convex-hull of U(x). Observation 3.1 shows that for any dataset S that is robustly separable w.r.t. U with a halfspace w , S is also robustly separable w.r.t. the convex hull conv(U) using the same halfspace w , where conv(U)(x) = conv(U(x)). Thus, in the remainder of this section we only consider perturbation sets U that are convex, i.e., for each x, U(x) is convex. Definition 3.2. Denote by SEPU a separation oracle for U. SEPU(x, z) takes as input x, z X and either: asserts that z U(x), or returns a separating hyperplane w Rd such that w, z w, z for all z U(x). Definition 3.3. For any η > 0, denote by SEPη U an approximate separation oracle for U. SEPη U takes as input x, z X and either: asserts that z U(x)+η def = {z : z U(x) s.t. z z 2 η}, or returns a separating hyperplane w Rd such that w, z w, z + η for all z U(x) η def = {z : B(z , η) U(x)}. Definition 3.4. Denote by MEMU a membership oracle for U. MEMU(x, z) takes as input x, z X and either: asserts that z U(x), or asserts that z / U(x). When discussing a separation or membership oracle for a fixed convex set K, we overload notation and write SEPK, SEPη K, and MEMK (in this case only one argument is required). 3.1. An efficient separation oracle for U is sufficient to solve RERMU efficiently Let Soln U S = {w Rd : (x, y) S, z U(x), y w, z > 0} denote the set of valid solutions for RERMU(S) (see Equation 3). Note that Soln U S is not empty since we are considering the realizable setting. Although the treatment we present here is for homogeneous halfspaces (where a bias term is not needed), the results extend trivially to the non-homogeneous case. Below, we show that we can efficiently find a solution w Soln U S given access to a separation oracle for U, SEPU. Theorem 3.5. Let U be an arbitrary convex adversary. Given access to a separation oracle for U, SEPU that runs in time poly(d, b). There is an algorithm that finds w Soln U S Efficiently Learning Adversarially Robust Halfspaces with Noise in poly(m, d, b) time where b is an upper bound on the bit complexity of the valid solutions in Soln U S and the examples and perturbations in S. Note that the polynomial dependence on b in the runtime is unavoidable even in standard non-robust ERM for halfspaces, unless we can solve linear programs in strongly polynomial time, which is currently an open problem. Theorem 3.5 implies that for a broad family of perturbation sets U, halfspaces H are efficiently robustly PAC learnable with respect to U in the realizable setting, as we show in the following corollary: Corollary 3.6. Let U : X 7 2X be an adversary such that U(x) = x + B where B is nonempty, closed, convex, and origin-symmetric. Then, given access to an efficient separation oracle SEPU that runs in time poly(d, b), H is robustly PAC learnable w.r.t. U in the realizable setting in time poly(d, b, 1/ϵ, log(1/δ)). Proof. This follows from the uniform convergence guarantee for the robust risk of halfspaces (see Equation (2)) and Theorem 3.5. This covers many types of perturbation sets that are considered in practice. For example, U could be perturbations of distance at most γ w.r.t. some norm , such as the ℓ norm considered in many applications: U(x) = {z X : x z γ}. In addition, Theorem 3.5 also implies that we can solve the RERM problem for other natural perturbation sets such as translations and rotations in images (see, e.g., (Engstrom et al., 2019)), and perhaps mixtures of perturbations of different types (see, e.g., (Kang et al., 2019)), as long we have access to efficient separation oracles for these sets. Benefits of handling general perturbation sets U: One important implication of Theorem 3.5 that highlights the importance of having a treatment that considers general perturbation sets (and not just ℓp perturbations for example) is the following: for any efficiently computable feature map ϕ : Rr Rd, we can efficiently solve the robust empirical risk problem over the induced halfspaces Hϕ = x 7 sign( w, ϕ(x) ) : w Rd , as long as we have access to an efficient separation oracle for the image of the perturbations ϕ(U(x)). Observe that in general ϕ(U(x)) maybe non-convex and complicated even if U(x) is convex, however Observation 3.1 combined with the realizability assumption imply that it suffices to have an efficient separation oracle for the convex-hull conv(ϕ(U(x))). Before we proceed with the proof of Theorem 3.5, we state the following requirements and guarantees for the Ellipsoid method which will be useful for us in the remainder of the section: Lemma 3.7 (see, e.g., Theorem 2.4 in (Bubeck et al., 2015)). Let K Rd be a convex set, and SEPK a separation oracle for K. Then, the Ellipsoid method using O(d2b) oracle queries to SEPK, will find a w K, or assert that K is empty. Furthermore, the total runtime is O(d4b). The proof of Theorem 3.5 relies on two key lemmas. First, we show that efficient robust certification yields an efficient solver for the RERM problem. Given a halfspace w Rd and an example (x, y) Rd Y, efficient robust certification means that there is an algorithm that can efficiently either: (a) assert that w is robust on U(x), i.e. z U(x), y w, z > 0, or (b) return a perturbation z U(x) such that y w, z 0. Lemma 3.8. Let CERTU(w, (x, y)) be a procedure that either: (a) Asserts that w is robust on U(x), i.e., z U(x), y w, z > 0, or (b) Finds a perturbation z U(x) such that y w, z 0. If CERTU(w, (x, y)) can be solved in poly(d, b) time, then there is an algorithm that finds w Soln U S in poly(m, d, b) time. Proof. Observe that Soln U S is a convex set since w1, w2 Soln U S (x, y) S, z U(x), y w1, z > 0 and y w2, z > 0 α [0, 1], (x, y) S, z U(x), y αw1 + (1 α)w2, z > 0 α [0, 1], αw1 + (1 α)w2 Soln U S. Our goal is to find a w Soln U S. Let CERTU(w, (x, y)) be an efficient robust certifier that runs in poly(d, b) time. We will use CERTU(w, (x, y)) to implement a separation oracle for Soln U S denoted SEPSoln U S . Given a halfspace w Rd, we simply check if w is robustly correct on all datapoints by running CERTU(w, (xi, yi)) on each (xi, yi) S. If there is a point (xi, yi) S where w is not robustly correct, then we get a perturbation zi U(xi) where yi w, zi 0, and we return yizi as a separating hyperplane. Otherwise, we know that w is robustly correct on all datapoints, and we just assert that w Soln U S. Once we have a separation oracle SEPSoln U S , we can use the Ellipsoid method (see Lemma 3.7) to solve the RERMU(S) problem. More specifically, with a query complexity of O(d2b) to SEPSoln U S and overall runtime of poly(m, d, b) (this depends on runtime of CERTU(w, (x, y))), the Ellipsoid method will return a w Soln U S. Next, we show that we can do efficient robust certification when given access to an efficient separation oracle for U, SEPU. Efficiently Learning Adversarially Robust Halfspaces with Noise Lemma 3.9. If we have an efficient separation oracle SEPU that runs in poly(d, b) time. Then, we can efficiently solve CERTU(w, (x, y)) in poly(d, b) time. Proof. Given a halfspace w Rd and (x, y) Rd Y, we want to either: (a) assert that w is robust on U(x), i.e. z U(x), y w, z > 0, or (b) find a perturbation z U(x) such that y w, z 0. Let M(w, y) = {z X : y w, z 0} be the set of all points that w mislabels. Observe that by definition M(w, y) is convex, and therefore U(x) M(w, y) is also convex. We argue that having an efficient separation oracle for U(x) M(w, y) suffices to solve our robust certification problem. Because if U(x) M(w, y) is not empty, then by definition, we can find a perturbation z U(x) such that y w, z 0 with a separation oracle SEPU(x) M(w,y) and the Ellipsoid method (see Lemma 3.7). If U(x) M(w, y) is empty, then by definition, w is robustly correct on U(x), and the Ellipsoid method will terminate and assert that U(x) M(w, y) is empty. Thus, it remains to implement SEPU(x) M(w,y). Given a point z Rd, we simply ask the separation oracle for U(x) by calling SEPU(x, z) and the separation oracle for M(w, y) by checking if y w, z 0. If z / U(x) the we get a separating hyperplane c from SEPU and we can use it separate z from U(x) M(w, y). Similarly, if z / M(w, y), by definition, yw, z > 0 and so we can use yw as a separating hyperplane to separate z from U(x) M(w, y). The overall runtime of this separation oracle is poly(d, b), and so we can efficiently solve CERTU(w, (x, y)) in poly(d, b) time using the Ellipsoid method (Lemma 3.7). We are now ready to proceed with the proof of Theorem 3.5. Proof of Theorem 3.5. We want to efficiently solve RERMU(S). Given that we have a separation oracle for U, SEPU that runs in poly(d, b). By Lemma 3.9, we get an efficient robust certification procedure CERTU(w, (x, y)). Then, by Lemma 3.8, we get an efficient solver for RERMU. In particular, the runtime complexity is poly(m, d, b). 3.2. An efficient approximate separation oracle for U is necessary for computing the robust loss Our efficient algorithm for RERMU requires a separation oracle for U. We now show that even efficiently computing the robust loss of a halfspace (w, b0) Rd R on an example (x, y) Rd Y requires an efficient approximate separation oracle for U. Theorem 3.10. Given a halfspace w Rd and an example (x, y) Rd Y, let EVALU((w, b0), (x, y)) be a procedure that computes the robust loss supz U(x) 1[y( w, z +b0) 0] in poly(d, b) time, then for any γ > 0, we can implement an efficient γ-approximate separation oracle SEPγ U(x, z) in poly(d, b, log(1/γ), log(R)) time, where U(x) B(0, R). Proof. Let γ > 0. We will describe how to implement a γ-approximate separation oracle for U denoted SEPγ U(x, z). Fix the first argument to an arbitrary x X. Upon receiving a point z X as input, the main strategy is to search for a halfspace w Rd that can label all of U(x) with +1, and label the point z with 1. If z / U(x) then there is a halfspace w that separates z from U(x) because U(x) is convex, but this is impossible if z U(x). Since we are only concerned with implementing an approximate separation oracle, we will settle for a slight relaxation which is to either: assert that z is γ-close to U(x), i.e., z B(U(x), γ), or return a separating hyperplane w such that w, z w, z for all z U(x). Let K = {(w, b0) : z U(x), w, z + b0 > 0} denote the set of halfspaces that label all of U(x) with +1. Since U(x) is nonempty, it follows by definition that K is nonempty. To evaluate membership in K, given a query wq, bq, we just make a call to EVALU((wq, bq), (x, +)). Let MEMK(wq, bq) = 1 EVALU((wq, bq), (x, +)). This can be efficiently computed in poly(d, b) time. Next, for any η (0, 0.5), we can get an η-approximate separation oracle for K denoted SEPη K (see Definition 3.3) using O(db log (d/η)) queries to the membership oracle MEMK described above (Lee et al., 2018). When queried with a halfspace w = (w, b0), SEPη K either: asserts that w K+η, or returns a separating hyperplane c such that c, w c, w + η for all halfspaces w K η. Observe that by definition, K η K K+η. Furthermore, for any w K+η, by definition, w K such that w w 2 η. Since, for each z U(x), by definition of K, we have w , (z , 1) = w , z + b0 > 0, it follows by Cauchy-Schwarz inequality that ( w K+η) ( z U(x)): w, (z , 1) = w w , z + w , (z , 1) > η2R. (4) Let SEPγ/4R K be a γ 4R-approximate separation oracle for K. Observe that if the distance between z and U(x) is greater than γ, it follows that there is (w, b0) such that: w, z + b0 γ/2 and w, z + b0 > 0 ( z U(x)) . By definition of K, this implies that K {(w, b0) : w, z + b0 γ/2} is not empty, Efficiently Learning Adversarially Robust Halfspaces with Noise which implies that the intersection K+ γ 4R {(w, b0) : w, z + b0 γ/2} is nonempty. We also have the contrapositive, which is, if the intersection K+ γ 4R {(w, b0) : w, z + b0 γ/2} is empty, then we know that z B(U(x), γ). To conclude the proof, we run the Ellipsoid method with the approximate separation oracle SEPγ/4R K to search over the restricted space {(w, b0) : w, z + b0 γ/2}. Restricting the space is easily done because we will use the query point z as the separating hyperplane. Either the Ellipsoid method will find (w, b0) K+ γ 4R {(w, b0) : w, z + b0 γ/2}, in which case by Equation 4, (w, b0) has the property that: 2 and w, z +b0 > γ 2 ( z U(x)) , and so we return w as a separating hyperplane between z and U(x). If the Ellipsoid terminates without finding any such (w, b0), this implies that the intersection K+ γ 4R {(w, b0) : w, z + b0 γ/2} is empty, and therefore, by the contrapositive above, we assert that z B(U(x), γ). 4. Random Classification Noise In this section, we relax the realizability assumption to random classification noise (Angluin & Laird, 1987). We show that for any adversary U that represents perturbations of bounded norm (i.e., U(x) = x + B, where B = δ Rd : δ p γ , p [1, ]), the class of halfspaces H is efficiently robustly PAC learnable with respect to U in the random classification noise model. Theorem 4.1. Let U : X 7 2X be an adversary such that U(x) = x + B where B = δ Rd : δ p γ and p [1, ]. Then, H is robustly PAC learnable w.r.t U under random classification noise in time poly(d, 1/ε, 1/γ, 1/(1 2η), log(1/δ)). The proof of Theorem 4.1 relies on the following key lemma. We show that the structure of the perturbations B allows us to relate the robust loss of a halfspace w Rd with the γ-margin loss of w. Before we state the lemma, recall that the dual norm of w denoted w is defined as sup { u, w : u 1}. Lemma 4.2. For any w, x Rd and any y Y, sup δ B 1{hw(x + δ) = y} = 1 y w w , x γ . Proof. First observe that sup δ B 1{hw(x + δ) = y} = sup δ B 1{y w, x + δ 0} = 1 inf δ B y w, x + δ 0 . This holds because when infδ B y w, x + δ > 0, by definition δ B, y w, x + δ > 0, which implies that supδ B 1{hw(x + δ) = y} = 0. For the other direction, when supδ B 1{hw(x + δ) = y} = 1, by definition δ B such that y w, x + δ 0, which implies that infδ B y w, x + δ 0. To conclude the proof, by definition of the set B and the dual norm , we have inf δ B y w, x + δ = y w, x sup δ B yw, δ = y w, x w γ. Lemma 4.2 implies that for any distribution D over X Y, to solve the γ-robust learning problem argmin w Rd E (x,y) D sup δ B 1{hw(x + δ) = y} , (5) it suffices to solve the γ-margin learning problem argmin w =1 E (x,y) D [1{y w, x γ}] . (6) We will solve the γ-margin learning problem in Equation (6) in the random classification noise setting using an appropriately chosen convex surrogate loss. Our convex surrogate loss and its analysis build on a convex surrogate that appears in the appendix of (Diakonikolas et al., 2019a) for learning large ℓ2-margin halfspaces under random classification noise w.r.t. the 0-1 loss. We note that the idea of using a convex surrogate to (non-robustly) learn large margin halfspaces in the presence of random classification noise is implicit in a number of prior works, starting with (Bylander, 1994). Our robust setting is more challenging for the following reasons. First, we are not interested in only ensuring small 0-1 loss, but rather ensuring small γ-margin loss. Second, we want to be able to handle all ℓp norms, as opposed to just the ℓ2 norm. As a result, our analysis is somewhat delicate. γ ), s > γ (1 λ)(1 s We will show that solving the following convex optimization problem: argmin w 1 Gγ λ(w) def = E (x,y) D [φ(y w, x )] , (7) where λ = εγ/2+η 1+εγ , suffices to solve the γ-margin learning problem in Equation (6). Intuitively, the idea here is that for λ = 0 , the φ objective is exactly a scaled hinge loss, which gives a learning guarantee w.r.t to the γ-margin loss when Efficiently Learning Adversarially Robust Halfspaces with Noise there is no noise (η = 0). When the noise η > 0, we slightly adjust the slopes, such that even correct prediction encounters a loss. The choice of the slope is based on λ which will depend on the noise rate η and the ε-suboptimality that is required for Equation (6). We can solve Equation (7) with a standard first-order method through samples using Stochastic Mirror Descent, when the dual norm is an ℓq-norm. We state the following properties of Mirror Descent we will require: Lemma 4.3 (see, e.g., Theorem 6.1 in (Bubeck et al., 2015)). Let G(w) def = E(x,y) D [ℓ(w, (x, y))] be a convex function that is L-Lipschitz w.r.t. w q where q > 1. Then, using the potential function ψ(w) = 1 2 w 2 q, a suitable step-size η, and a sequence of iterates wk computed by the following update: wk+1 = Πψ Bq ψ 1 ψ(wk) ηgk , Stochastic Mirror Descent with O(L2/(q 1)ε2) stochastic gradients g of G, will find an ε-suboptimal point ˆw such that ˆw q 1 and G( ˆw) infw G(w) + ε. Remark 4.4. When q = 1, we will use the entropy potential function ψ(w) = Pd i=1 wi log wi. In this case, Stochastic Mirror Descent will require O( L2 log d ε2 ) stochastic gradients. We are now ready to state our main result for this section: Theorem 4.5. Let X = x Rd : x p 1 . Let D be a distribution over X Y such that there exists a halfspace w Rd with Prx Dx [| w , x | > γ] = 1 and y is generated by hw (x) := sign( w , x ) corrupted by RCN with noise rate η < 1/2. An application of Stochastic Mirror Descent on Gγ λ(w), returns, with high probability, a halfspace w where w q 1 with γ/2-robust misclassification error E(x,y) D [1{y w, x γ/2}] η + ε in poly(d, 1/ε, 1/γ, 1/(1 2η)) time. With Theorem 4.5, the proof of Theorem 4.1 immediately follows. Proof of Theorem 4.1. This follows from Lemma 4.2 and Theorem 4.5. Remark 4.6. In Theorem 4.5, we get a γ/2-robustness guarantee assuming γ-robust halfspace w that is corrupted with random classification noise. This can be strengthened to get a guarantee of (1 c)γ-robustness for any constant c > 0. The rest of this section is devoted to the proof of Theorem 4.5. The high-level strategy is to show that an ε -suboptimal solution to Equation (7) gives us an εsuboptimal solution to Equation (6) (for a suitably chosen ε ). In Lemma 4.8, we bound from above the γ/2-margin loss in terms of our convex surrogate objective Gγ λ, and in Lemma 4.9 we show that there are minimizers of our convex surrogate Gγ λ such that it is sufficiently small. These are the two key lemmas that we will use to piece everything together. For any w, x Rd, consider the contribution of the objective Gγ λ of x, denoted by Gγ λ(w, x). This is defined as Gγ λ(w, x) = Ey Dy(x) [φ(y w, x )] = ηφ( z) + (1 η)φ(z) where z = hw (x) w, x . In the following lemma, we provide a decomposition of Gγ λ(w, x) that will help us in proving Lemmas 4.8 and 4.9. Lemma 4.7. For any w, x Rd, let z = hw (x) w, x . Then, we have that: Gγ λ(w, x) = (η λ) z + λ + η 2λη + 1{ γ z γ}(1 η)(1 2λ) 1 z + 1{z < γ}(1 2λ) 1 2η z Proof. Based on the definition of the surrogate loss, it suffices to consider three cases: Case z > γ : l1(z) def = ηφ( z) + (1 η)φ(z) = η(1 λ)(1 + z/γ) + (1 η)λ(1 z/γ) = (η(1 λ) (1 η)λ) z + λ + η 2λη. Case γ z γ : l2(z) def = ηφ( z) + (1 η)φ(z) = η(1 λ)(1 + z/γ) + (1 η)(1 λ)(1 z/γ) = (1 2η)(1 λ) z Case z < γ : l3(z) def = ηλ(1 + z/γ) + (1 η)(1 λ)(1 z/γ) = (ηλ (1 η)(1 λ)) z + (1 η)(1 λ) = (η + λ 1) z + 1 η λ + 2ηλ. Efficiently Learning Adversarially Robust Halfspaces with Noise Considering the three cases above, we can write G(z) = l1(z) + 1{ γ z γ} (l2(z) l1(z)) + 1{z < γ} (l3(z) l1(z)) . Then, we calculate l2(z) l1(z) and l3(z) l1(z) as follows: l2(z) l1(z) = (1 2η)(1 λ) z = ((1 2η)(1 λ) + η λ) z + (1 η)(1 2λ) = (1 η)(1 2λ) 1 z l3(z) l1(z) = (η + λ 1) z + 1 η λ + 2ηλ + (1 2η)(1 2λ) Using the above, we have that G(z) = l1(z) + 1{ γ z γ}(1 η)(1 2λ) 1 z + 1{z < γ}(1 2λ) 1 2η z The following lemma allows us to bound from below our convex surrogate Ex Dx[Gγ λ(w, x)] in terms of the γ/2margin loss of w. Lemma 4.8. Assume that λ is chosen such that λ < 1/2 and η < λ. Then, for any w Rd, Ex Dx[Gγ λ(w, x)] η λ 2(1 2λ)(1 η) Ex 1 z γ 2 + λ + η 2λη. Proof. By Lemma 4.7, and linearity of expectation, we have that E x Dx[Gγ λ(w, x)] = (η λ) E x Dx + λ + η 2λη + (1 η)(1 2λ) E x Dx 1{ γ z γ} 1 z + (1 2λ) E x Dx 1{z < γ} 1 2η z First, observe that for any x, 1 z = hw (x) w, x 1 and since η λ, we have (η λ) E x Dx Then we observe that whenever z < γ, 1 z γ 2η > 2(1 η) > (1 η)/2 and λ 1/2, thus we can bound from below the third term (1 2λ) E x Dx 1{z < γ} 1 2η z 1 2(1 2λ)(1 η) E x Dx [1{z < γ}] . Next we note that whenever γ z γ, 1 z γ 0. This implies that instead of considering 1{ γ z γ}, we can relax this and consider the subset 1 γ z γ 2 , and on this subset 1 z γ 1/2. Thus, we can bound the second term from below as follows: (1 2λ)(1 η) E x Dx 1{ γ z γ} 1 z 1 2(1 2λ)(1 η) E x Dx h 1 n γ z γ Combining the above, we obtain E x Dx[Gγ λ(w, x)] η λ 2(1 2λ)(1 η) E x h 1 n γ z γ o + 1{z < γ} i + λ + η 2λη 2(1 2λ)(1 η) E x + λ + η 2λη , as desired. We now show that there exist minimizers of the convex surrogate Gγ λ such that it is sufficiently small, which will be useful later in choosing the suboptimality parameter ϵ . Lemma 4.9. Assume that λ is chosen such that λ < 1/2 and η < λ. Then we have that inf w Rd E x Dx[Gγ λ(w, x)] 2η(1 λ). Proof. By definition, we have that infw Rd Ex Dx[Gγ λ(w, x)] Ex Dx[Gγ λ(w , x)]. By assumption, with probability 1 over x Dx, we have hw (x) w , x > γ. Thus, by Lemma 4.7, we have E x[Gγ λ(w , x)] = (η λ) z + λ + η 2λη η λ + λ + η 2λη = 2η(1 λ) , Efficiently Learning Adversarially Robust Halfspaces with Noise where the last inequality follows from the fact that η < λ and the fact that Ex h z γ i > 1. Using the above lemmas, we are now able to bound from above the γ/2-margin loss of a halfspace w that is ε - suboptimal for our convex optimization problem (see Equation (7)). Lemma 4.10. For any ε (0, 1) and any w Rd such that Ex Dx[Gγ λ(w, x)] Ex Dx[Gγ λ(w , x)] + ε , the γ/2-missclassification error of w satisfies E (x,y) D [1{y w, x γ/2}] η + 2 (1 2λ) ε + (λ η) 1 Proof. By Lemma 4.8 and Lemma 4.9, we have 2(1 2λ)(1 η) E x oi + λ + η 2λη 2η(1 λ) + ε . This implies oi 2 (1 2λ) ε + (λ η) 1 Since E(x,y) D [1{y w, x γ/2}] η + (1 η) Ex 1 z γ 2 , we get the desired result. We are now ready to prove Theorem 4.5. Proof of Theorem 4.5. Based on Lemma 4.10, we will choose λ, ε such that ε + (λ η) 1 By setting ε = λ η, this condition reduces to 2(λ η) (1 2λ) γε . This implies that we need λ εγ/2+η 1+εγ . We will choose 1+εγ . Note that our analysis relied on having λ 1/2 and η λ. These conditions combined imply that we should choose λ such that η λ 1/2. Our choice of λ = εγ/2+η 1+εγ satisfies these conditions, since 1 + εγ η = εγ/2 + η η(1 + εγ) 1 + εγ = εγ(1/2 η) 2 = εγ/2 + η 1/2(1 + εγ) 1 + εγ = η 1/2 By our choice of λ, we have that ε = λ η = εγ(1/2 η) 1+εγ = εγ(1 2η) 2(1+εγ) . By the guarantees of Stochastic Mirror Descent (see Lemma 4.3), our theorem follows with O 1 ε2γ2(1 2η)2(q 1) samples for q > 1 and O log d ε2γ2(1 2η)2 samples for q = 1. 5. Conclusion In this paper, we provide necessary and sufficient conditions for perturbation sets U, under which we can efficiently solve the robust empirical risk minimization (RERM) problem. We give a polynomial time algorithm to solve RERM given access to a polynomial time separation oracle for U. In addition, we show that an efficient approximate separation oracle for U is necessary for even computing the robust loss of a halfspace. As a corollary, we show that halfspaces are efficiently robustly PAC learnable for a broad range of perturbation sets. By relaxing the realizability assumption, we show that under random classification noise, we can efficiently robustly PAC learn halfspaces with respect to any ℓp perturbations. An interesting direction for future work is to understand the computational complexity of robustly PAC learning halfspaces under stronger noise models, including Massart noise and agnostic noise. Acknowledgments We thank Sepideh Mahabadi for many insightful and helpful discussions, and Brian Bullins for helpful discussions on Mirror Descent. We also thank the anonymous reviewers for their helpful feedback. This work was initiated when the authors were visiting the Simons Institute for the Theory of Computing as part of the Summer 2019 program on Foundations of Deep Learning. Work by N.S. and O.M. was partially supported by NSF award IIS-1546500 and by DARPA1 cooperative agreement HR00112020003. I.D. was supported by NSF Award CCF-1652862 (CAREER), a Sloan Research Fellowship, and a DARPA Learning with Less Labels (Lw LL) grant. S.G. was supported by the JP Morgan AI Research Ph D Fellowship. Appendix: Another Approach to Large Margin Learning under Random Classification Noise We remark that γ-margin learning of halfspaces has been studied in earlier work, and we have algorithms such as Margin Perceptron (Balcan et al., 2008) and SVM. The Margin Perceptron (for ℓ2 margin) and other ℓp margin algorithms have been also implemented in the SQ model 1This paper does not reflect the position or the policy of the Government, and no endorsement should be inferred. Efficiently Learning Adversarially Robust Halfspaces with Noise (Feldman et al., 2017). But no explicit connection has been made to adversarial robustness. We present here a simple approach to learn γ-margin halfspaces under random classification noise using only a convex surrogate loss and Stochastic Mirror Descent. The construction of the convex surrogate is based on learning generalized linear models with a suitable link function u : R R. To the best of our knowledge, the result of this section is not explicit in prior work. Theorem 5.1. Let X = x Rd : x p 1 . Let D be a distribution over X Y such that there exists a halfspace w Rd, w q = 1 with Prx Dx [| w , x | > γ] = 1 and y is generated by hw (x) := sign( w , x ) corrupted with random classification noise rate η < 1/2. Then, running Stochastic Mirror Descent on the following convex optimization problem: min w Rd, w q 1 E (x,y) D [ℓ(w, (x, y))] where the convex loss function ℓis defined in Equation 8, returns with high probability, a halfspace w with γ/2-robust misclassification error E(x,y) D [1{y w, x γ/2}] η + ε. We prove Theorem 4.5 in the remainder of this section. We will connect our problem to that of solving generalized linear models. We define the link function as follows, η s < γ 1 2η 2 γ s γ 1 η s > γ Observe that u is monotone and 1 2η 2γ -Lipschitz. First, we will relate our loss of interest, which is the γ/2margin loss with the squared loss defined in terms of the link function u, Lemma 5.2. For any w Rd, E x Dx [1{hw (x) w, x γ/2}] 16 (1 2η)2 E x Dx h (u( w, x ) u( w , x ))2i . Proof. Let E+ = {hw (x) = +} and E = {hw (x) = }. Let U(x) = (u( w, x ) u( w , x ))2. By law of total expectation and definition of the link function u, we have E x Dx [U(x)] = E x Dx U(x)1 E+ + E x Dx (u( w, x ) (1 η))2 1 E+ (u( w, x ) η)2 1 E . We will lower bound both terms: h (u( w, x ) (1 η))2 1 E+ i h (a (1 η))2 1 E+ 1{u( w, x ) a} i , h (u( w, x ) η)2 1 E i h (b η)2 1 E 1{u( w, x ) b} i . Then, observe that the event { w, x γ/2} implies the event u( w, x ) 3 2η 4 , and similarly the event { w, x γ/2} implies the event u( w, x ) 2η+1 4 . This means that 4 (1 η) 2 1 n E+o 1 u( w, x ) 3 2η 4 (1 η) 2 E x Dx h 1 n E+o 1{ w, x γ/2} i , and 4 η 2 1 n E o 1 u( w, x ) 2η + 1 4 η 2 E x Dx h 1 n E o 1{ w, x γ/2} i . We combine these observations to conclude the proof, h u( w, x ) u( w , x ) 2i (1 2η)2 h 1(E+)1{ w, x γ/2} + 1(E )1{ w, x γ/2} i 16 E x Dx [1{hw (x) w, x γ/2}] . But note that the squared loss is non-convex and so it may not be easy to optimize. Luckily, we can get a tight upper-bound with the following surrogate loss (see (Kanade, 2018)): ℓ(w, (x, y)) = Z w,x 0 (u(s) y)ds. (8) Note that ℓ(w, (x, y)) is convex w.r.t w since the Hessian 2 wℓ(w, (x, y)) = u ( w, x )xx T is positive semidefinite. Assuming our labels y have been transformed to {0, 1} from { 1}, observe that E[y|x] = u( w , x ). We now have the following guarantee (see e.g., (Cohen, 2014; Kanade, 2018)) for U(x) = (u( w, x ) u( w , x ))2: E x Dx [U(x)] 1 2η γ E (x,y) D ℓ(w, (x, y)) ℓ(w , (x, y)) . (9) Proof of Theorem 5.1. Combining Lemma 5.2 and Equation 9, we get the following guarantee for any w Rd, (1 2η) E x Dx [1{hw (x) w, x γ/2}] γ E (x,y) D [ℓ(w, (x, y)) ℓ(w , (x, y))] . Efficiently Learning Adversarially Robust Halfspaces with Noise Thus, running Stochastic Mirror Descent with ε = (εγ(1 2η))/16 and O(1/ε 2) samples (labels transformed to {0, 1}), returns with high probability, a halfspace w such that E x Dx [1{hw (x) w, x γ/2}] ε. (10) Then, to conclude the proof, observe that E (x,y) D [1{y w, x γ/2}] = η E x Dx [1{ hw (x) w, x γ/2}] + (1 η) E x Dx [1{hw (x) w, x γ/2}] = η(1 E x Dx [1{hw (x) w, x γ/2}]) + (1 η) E x Dx [1{hw (x) w, x γ/2}] (i) η + (1 η) E x Dx [1{hw (x) w, x γ/2}] (ii) η + ε, where (i) follows from that fact that Ex Dx [1{hw (x) w, x γ/2}] 0, and (ii) follows from Equation 10. Angluin, D. and Laird, P. D. Learning from noisy examples. Mach. Learn., 2(4):343 370, 1987. doi: 10.1007/ BF00116829. URL https://doi.org/10.1007/ BF00116829. Awasthi, P., Dutta, A., and Vijayaraghavan, A. On robustness to adversarial examples and polynomial optimization. In Advances in Neural Information Processing Systems, pp. 13737 13747, 2019. Balcan, M., Blum, A., and Srebro, N. A theory of learning with similarity functions. Machine Learning, 72(1-2):89 112, 2008. doi: 10.1007/ s10994-008-5059-5. URL https://doi.org/10. 1007/s10994-008-5059-5. Biggio, B., Corona, I., Maiorca, D., Nelson, B., ˇSrndi c, N., Laskov, P., Giacinto, G., and Roli, F. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pp. 387 402. Springer, 2013. Bubeck, S., Lee, Y. T., Price, E., and Razenshteyn, I. Adversarial examples from computational constraints. In International Conference on Machine Learning, pp. 831 840, 2019. Bubeck, S. et al. Convex optimization: Algorithms and complexity. Foundations and Trends R in Machine Learning, 8(3-4):231 357, 2015. Bylander, T. Learning linear threshold functions in the presence of classification noise. In Warmuth, M. K. (ed.), Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory, COLT 1994, New Brunswick, NJ, USA, July 12-15, 1994, pp. 340 347. ACM, 1994. doi: 10.1145/180139.181176. URL https://doi.org/10.1145/180139.181176. Cohen, A. Surrogate Loss Minimization. Ph D thesis, Hebrew University of Jerusalem, 2014. Cullina, D., Bhagoji, A. N., and Mittal, P. Pac-learning in the presence of adversaries. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 230 241. Curran Associates, Inc., 2018. Daniely, A. Complexity theoretic limitations on learning halfspaces. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pp. 105 117, 2016. Diakonikolas, I., O Donnell, R., Servedio, R. A., and Wu, Y. Hardness results for agnostically learning low-degree polynomial threshold functions. In Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2011, pp. 1590 1606. SIAM, 2011. Diakonikolas, I., Gouleakis, T., and Tzamos, C. Distributionindependent pac learning of halfspaces with massart noise. In Advances in Neural Information Processing Systems, pp. 4751 4762, 2019a. Diakonikolas, I., Kane, D., and Manurangsi, P. Nearly tight bounds for robust proper learning of halfspaces with a margin. In Advances in Neural Information Processing Systems, pp. 10473 10484, 2019b. Engstrom, L., Tran, B., Tsipras, D., Schmidt, L., and Madry, A. Exploring the landscape of spatial robustness. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 1802 1811, Long Beach, California, USA, 09 15 Jun 2019. PMLR. URL http://proceedings. mlr.press/v97/engstrom19a.html. Feldman, V., Gopalan, P., Khot, S., and Ponnuswami, A. K. New results for learning noisy parities and halfspaces. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 06), pp. 563 574. IEEE, 2006. Feldman, V., Guzman, C., and Vempala, S. S. Statistical query algorithms for mean vector estimation and stochastic convex optimization. In Klein, P. N. (ed.), Proceedings Efficiently Learning Adversarially Robust Halfspaces with Noise of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2017, Barcelona, Spain, Hotel Porta Fira, January 16-19, pp. 1265 1277. SIAM, 2017. doi: 10.1137/1.9781611974782.82. URL https: //doi.org/10.1137/1.9781611974782.82. Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. In Bengio, Y. and Le Cun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6572. Gourdeau, P., Kanade, V., Kwiatkowska, M., and Worrell, J. On the hardness of robust classification. In Advances in Neural Information Processing Systems, pp. 7444 7453, 2019. Guruswami, V. and Raghavendra, P. Hardness of learning halfspaces with noise. SIAM Journal on Computing, 39 (2):742 765, 2009. Kanade, V. Computational learning theory notes - 8 : Learning real-valued functions, 2018. Kang, D., Sun, Y., Hendrycks, D., Brown, T., and Steinhardt, J. Testing robustness against unforeseen adversaries. Co RR, abs/1908.08016, 2019. URL http: //arxiv.org/abs/1908.08016. Khim, J. and Loh, P.-L. Adversarial risk bounds for binary classification via function transformation. ar Xiv preprint ar Xiv:1810.09519, 2018. Lee, Y. T., Sidford, A., and Vempala, S. S. Efficient convex optimization with membership oracles. In Conference On Learning Theory, pp. 1292 1294, 2018. Maass, W. and Tur an, G. How fast can a threshold gate learn? In Proceedings of a workshop on Computational learning theory and natural learning systems (vol. 1): constraints and prospects: constraints and prospects, pp. 381 414, 1994. Montasser, O., Hanneke, S., and Srebro, N. Vc classes are adversarially robustly learnable, but only improperly. In Beygelzimer, A. and Hsu, D. (eds.), Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pp. 2512 2530, Phoenix, USA, 25 28 Jun 2019. PMLR. Rosenblatt, F. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958. Schmidt, L., Santurkar, S., Tsipras, D., Talwar, K., and Madry, A. Adversarially robust generalization requires more data. In Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montr eal, Canada, pp. 5019 5031, 2018. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R. Intriguing properties of neural networks. In Bengio, Y. and Le Cun, Y. (eds.), 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 1416, 2014, Conference Track Proceedings, 2014. URL http://arxiv.org/abs/1312.6199. Vapnik, V. Estimation of Dependencies Based on Empirical Data. Springer-Verlag, New York, 1982. Yin, D., Ramchandran, K., and Bartlett, P. L. Rademacher complexity for adversarially robust generalization. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 7085 7094. PMLR, 2019. URL http:// proceedings.mlr.press/v97/yin19b.html.