# efficiently_learning_adversarially_robust_halfspaces_with_noise__134264d5.pdf

Efﬁciently Learning Adversarially Robust Halfspaces with Noise

Omar Montasser 1 Surbhi Goel 2 Ilias Diakonikolas 3 Nathan Srebro 1

We study the problem of learning adversarially robust halfspaces in the distribution-independent setting. In the realizable setting, we provide necessary and sufﬁcient conditions on the adversarial perturbation sets under which halfspaces are efﬁciently robustly learnable. In the presence of random label noise, we give a simple computationally efﬁcient algorithm for this problem with respect to any ℓp-perturbation.

1. Introduction

Learning predictors that are robust to adversarial examples remains a major challenge in machine learning. A line of work has shown that predictors learned by deep neural networks are not robust to adversarial examples (Szegedy et al., 2014; Biggio et al., 2013; Goodfellow et al., 2015). This has led to a long line of research studying different aspects of robustness to adversarial examples.

In this paper, we consider the problem of distributionindependent learning of halfspaces that are robust to adversarial examples at test time, also referred to as robust PAC learning of halfspaces. Halfspaces are binary predictors of the form hw(x) = sign( w, x ), where w Rd.

In adversarially robust PAC learning, given an instance space X and label space Y = { 1}, we formalize an adversary that we would like to be robust against as a map U : X 7 2X , where U(x) X represents the set of perturbations (adversarial examples) that can be chosen by the adversary at test time (i.e., we require that x U(x)). For an unknown distribution D over X Y, we observe m i.i.d. samples S Dm, and our goal is to learn a predictor ˆh : X 7 Y that achieves small robust risk,

RU(ˆh; D) E (x,y) D

sup z U(x) 1[ˆh(z) = y]

1Toyota Technological Institute at Chicago 2University of Texas at Austin 3University of Wisconsin-Madison. Correspondence to: Omar Montasser <omar@ttic.edu>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

The information-theoretic aspects of adversarially robust learning have been studied in recent work, see, e.g., (Schmidt et al., 2018; Cullina et al., 2018; Khim & Loh, 2018; Bubeck et al., 2019; Yin et al., 2019; Montasser et al., 2019). This includes studying what learning rules should be used for robust learning and how much training data is needed to guarantee high robust accuracy. It is now known that any hypothesis class H with ﬁnite VC dimension is robustly learnable, though sometimes improper learning is necessary and the sample complexity may be exponential in the VC dimension (Montasser et al., 2019).

On the other hand, the computational aspects of adversarially robust PAC learning are less understood. In this paper, we take a ﬁrst step towards studying this broad algorithmic question with a focus on the fundamental problem of learning adversarially robust halfspaces.

A ﬁrst question to ask is whether efﬁcient PAC learning implies efﬁcient robust PAC learning, i.e., whether there is a general reduction that solves the adversarially robust learning problem. Recent work has provided strong evidence that this is not the case. Speciﬁcally, (Bubeck et al., 2019) showed that there exists a learning problem that can be learned efﬁciently non-robustly, but is computationally intractable to learn robustly (under plausible complexitytheoretic assumptions). There is also more recent evidence that suggests that this is also the case in the PAC model. (Awasthi et al., 2019) showed that it is computationally intractable to even weakly robustly learn degree-2 polynomial threshold functions (PTFs) with ℓ perturbations in the realizable setting, while PTFs of any constant degree are known to be efﬁciently PAC learnable non-robustly in the realizable setting. (Gourdeau et al., 2019) showed that there are hypothesis classes that are hard to robustly PAC learn, under the assumption that it is hard to non-robustly PAC learn.

The aforementioned discussion suggests that when studying robust PAC learning, we need to characterize which types of perturbation sets U admit computationally efﬁcient robust PAC learners and under which noise assumptions. In the agnostic PAC setting, it is known that even weak (non-robust) learning of halfspaces is computationally intractable (Feldman et al., 2006; Guruswami & Raghavendra, 2009; Diakonikolas et al., 2011; Daniely, 2016). For ℓ2-

Efﬁciently Learning Adversarially Robust Halfspaces with Noise

perturbations, where U(x) = {z : x z 2 γ}, it was recently shown that the complexity of proper learning is exponential in 1/γ (Diakonikolas et al., 2019b). In this paper, we focus on the realizable case and the (more challenging) case of random label noise.

We can be more optimistic in the realizable setting. Halfspaces are efﬁciently PAC learnable non-robustly via Linear Programming (Maass & Tur an, 1994), and under the margin assumption via the Perceptron algorithm (Rosenblatt, 1958). But what can we say about robustly PAC learning halfspaces? Given a perturbation set U and under the assumption that there is a halfspace hw that robustly separates the data, can we efﬁciently learn a predictor with small robust risk?

Just as empirical risk minimization (ERM) is central for non-robust PAC learning, a core component of adversarially robust learning is minimizing the robust empirical risk on a dataset S,

ˆh RERMU(S) argmin h H

i=1 sup z U(x) 1[h(z) = y].

In this paper, we provide necessary and sufﬁcient conditions on perturbation sets U, under which the robust empirical risk minimization (RERM) problem is efﬁciently solvable in the realizable setting. We show that an efﬁcient separation oracle for U yields an efﬁcient solver for RERMU, while an efﬁcient approximate separation oracle for U is necessary for even computing the robust loss supz U(x) 1[hw(z) = y] of a halfspace hw. In addition, we relax our realizability assumption and show that under random classiﬁcation noise (Angluin & Laird, 1987), we can efﬁciently robustly PAC learn halfspaces with respect to any ℓp perturbation.

Main Contributions Our main contributions can be summarized as follows:

1. In the realizable setting, the class of halfspaces is efﬁciently robustly PAC learnable with respect to U, given an efﬁcient separation oracle for U. 2. To even compute the robust risk with respect to U efﬁciently, an efﬁcient approximate separation oracle for U is necessary. 3. In the random classiﬁcation noise setting, the class of halfspaces is efﬁciently robustly PAC learnable with respect to any ℓp perturbation.

1.1. Related Work

Here we focus on the recent work that is most closely related to the results of this paper. (Awasthi et al., 2019) studied the tractability of RERM with respect to ℓ perturbations, obtaining efﬁcient algorithms for halfspaces in the realizable setting, but showing that RERM for degree-2 polynomial

threshold functions is computationally intractable (assuming NP = RP). (Gourdeau et al., 2019) studied robust learnability of hypothesis classes deﬁned over {0, 1}n with respect to hamming distance, and showed that monotone conjunctions are robustly learnable when the adversary can perturb only O(log n) bits, but are not robustly learnable even under the uniform distribution when the adversary can ﬂip ω(log n) bits.

In this work, we take a more general approach, and instead of considering speciﬁc perturbation sets, we provide methods in terms of oracle access to a separation oracle for the perturbation set U, and aim to characterize which perturbation sets U admit tractable RERM.

In the non-realizable setting, the only prior work we are aware of is by (Diakonikolas et al., 2019b) who studied the complexity of robustly learning halfspaces in the agnostic setting under ℓ2 perturbations.

2. Problem Setup

Let X = Rd be the instance space and Y = { 1} be the label space. We consider halfspaces H = {x 7 sign( w, x ) : w Rd}.

The following deﬁnitions formalize the notion of adversarially robust PAC learning in the realizable and random classiﬁcation noise settings:

Deﬁnition 2.1 (Realizable Robust PAC Learning). We say H is robustly PAC learnable with respect to an adversary U in the realizable setting, if there exists a learning algorithm A : (X Y) 7 YX with sample complexity m : (0, 1) N such that: for any ϵ, δ (0, 1), for every data distribution D over X Y where there exists a predictor h H with zero robust risk, RU(h ; D) = 0, with probability at least 1 δ over S Dm,

RU(A(S); D) ϵ.

Deﬁnition 2.2 (Robust PAC Learning with Random Classiﬁcation Noise). Let h H be an unknown halfspace. Let Dx be an arbitrary distribution over X such that RU(h ; Dh ) = 0, and η 0 < 1/2. A noisy example oracle, EX(h , Dx, η) works as follows: Each time EX(h , Dx, η) is invoked, it returns a labeled example (x, y), where x Dx, y = h (x) with probability 1 η and y = h (x) with probability η. Let D be the joint distribution on (x, y) generated by the above oracle.

We say H is robustly PAC learnable with respect to an adversary U in the random classiﬁcation noise model, if m(ϵ, δ, η) N {0} and a learning algorithm A : (X Y) 7 YX , such that for every distribution D over X Y (generated as above by a noisy oracle), with probability at

Efﬁciently Learning Adversarially Robust Halfspaces with Noise

least 1 δ over S Dm,

RU(A(S); D) η + ϵ.

Sample Complexity of Robust Learning Denote by LU H the robust loss class of H,

(x, y) 7 sup z U(x) 1[h(z) = y] : h H

It was shown by (Cullina et al., 2018) that for any set B X that is nonempty, closed, convex, and origin-symmetric, and an adversary U that is deﬁned as U(x) = x + B (e.g., ℓpballs), the VC dimension of the robust loss of halfspaces vc(LU H) is at most the standard VC dimension vc(H) = d + 1. Based on Vapnik s General Learning (Vapnik, 1982), this implies that we have uniform convergence of robust risk with m = O( d+log(1/δ)

ϵ2 ) samples. Formally, for any ϵ, δ (0, 1) and any distribution D over X Y, with probability at least 1 δ over S Dm,

w Rd, |RU(hw; D) RU(hw; S)| ϵ. (2)

In particular, this implies that for any adversary U that satisﬁes the conditions above, H is robustly PAC learnable w.r.t. U by minimizing the robust empirical risk on S,

RERMU(S) = argmin w Rd 1 m

i=1 sup zi U(xi) 1[yi w, zi 0].

Thus, it remains to efﬁciently solve the RERMU problem. We discuss necessary and sufﬁcient conditions for solving RERMU in the following section.

3. The Realizable Setting

In this section, we show necessary and sufﬁcient conditions for minimizing the robust empirical risk RERMU on a dataset S = {(x1, y1), . . . , (xm, ym)} (X Y)m in the realizable setting, i.e. when the dataset S is robustly separable with a halfspace hw where w Rd. In Theorem 3.5, we show that an efﬁcient separation oracle for U yields an efﬁcient solver for RERMU. While in Theorem 3.10, we show that an efﬁcient approximate separation oracle for U is necessary for even computing the robust loss supz U(x) 1[hw(z) = y] of a halfspace hw.

Note that the set of allowed perturbations U can be nonconvex, and so it might seem difﬁcult to imagine being able to solve the RERMU problem in full generality. But, it turns out that for halfspaces it sufﬁces to consider only convex perturbation sets due to the following observation:

Observation 3.1. Given a halfspace w Rd and an example (x, y) X Y. If z U(x), y w, z > 0,

then z conv(U(x)), y w, z > 0. And if z U(x), y w, z 0, then z conv(U(x)), y w, z 0, where conv(U(x)) denotes the convex-hull of U(x).

Observation 3.1 shows that for any dataset S that is robustly separable w.r.t. U with a halfspace w , S is also robustly separable w.r.t. the convex hull conv(U) using the same halfspace w , where conv(U)(x) = conv(U(x)). Thus, in the remainder of this section we only consider perturbation sets U that are convex, i.e., for each x, U(x) is convex.

Deﬁnition 3.2. Denote by SEPU a separation oracle for U. SEPU(x, z) takes as input x, z X and either:

asserts that z U(x), or returns a separating hyperplane w Rd such that w, z w, z for all z U(x).

Deﬁnition 3.3. For any η > 0, denote by SEPη U an approximate separation oracle for U. SEPη U takes as input x, z X and either:

asserts that z U(x)+η def = {z : z U(x) s.t. z z 2 η}, or returns a separating hyperplane w Rd such that w, z w, z + η for all z U(x) η def = {z : B(z , η) U(x)}.

Deﬁnition 3.4. Denote by MEMU a membership oracle for U. MEMU(x, z) takes as input x, z X and either:

asserts that z U(x), or asserts that z / U(x).

When discussing a separation or membership oracle for a ﬁxed convex set K, we overload notation and write SEPK, SEPη K, and MEMK (in this case only one argument is required).

3.1. An efﬁcient separation oracle for U is sufﬁcient to solve RERMU efﬁciently

Let Soln U S = {w Rd : (x, y) S, z U(x), y w, z > 0} denote the set of valid solutions for RERMU(S) (see Equation 3). Note that Soln U S is not empty since we are considering the realizable setting. Although the treatment we present here is for homogeneous halfspaces (where a bias term is not needed), the results extend trivially to the non-homogeneous case.

Below, we show that we can efﬁciently ﬁnd a solution w Soln U S given access to a separation oracle for U, SEPU.

Theorem 3.5. Let U be an arbitrary convex adversary. Given access to a separation oracle for U, SEPU that runs in time poly(d, b). There is an algorithm that ﬁnds w Soln U S

Efﬁciently Learning Adversarially Robust Halfspaces with Noise

in poly(m, d, b) time where b is an upper bound on the bit complexity of the valid solutions in Soln U S and the examples and perturbations in S.

Note that the polynomial dependence on b in the runtime is unavoidable even in standard non-robust ERM for halfspaces, unless we can solve linear programs in strongly polynomial time, which is currently an open problem.

Theorem 3.5 implies that for a broad family of perturbation sets U, halfspaces H are efﬁciently robustly PAC learnable with respect to U in the realizable setting, as we show in the following corollary:

Corollary 3.6. Let U : X 7 2X be an adversary such that U(x) = x + B where B is nonempty, closed, convex, and origin-symmetric. Then, given access to an efﬁcient separation oracle SEPU that runs in time poly(d, b), H is robustly PAC learnable w.r.t. U in the realizable setting in time poly(d, b, 1/ϵ, log(1/δ)).

Proof. This follows from the uniform convergence guarantee for the robust risk of halfspaces (see Equation (2)) and Theorem 3.5.

This covers many types of perturbation sets that are considered in practice. For example, U could be perturbations of distance at most γ w.r.t. some norm , such as the ℓ norm considered in many applications: U(x) = {z X : x z γ}. In addition, Theorem 3.5 also implies that we can solve the RERM problem for other natural perturbation sets such as translations and rotations in images (see, e.g., (Engstrom et al., 2019)), and perhaps mixtures of perturbations of different types (see, e.g., (Kang et al., 2019)), as long we have access to efﬁcient separation oracles for these sets.

Beneﬁts of handling general perturbation sets U: One important implication of Theorem 3.5 that highlights the importance of having a treatment that considers general perturbation sets (and not just ℓp perturbations for example) is the following: for any efﬁciently computable feature map ϕ : Rr Rd, we can efﬁciently solve the robust empirical risk problem over the induced halfspaces Hϕ = x 7 sign( w, ϕ(x) ) : w Rd , as long as we have access to an efﬁcient separation oracle for the image of the perturbations ϕ(U(x)). Observe that in general ϕ(U(x)) maybe non-convex and complicated even if U(x) is convex, however Observation 3.1 combined with the realizability assumption imply that it sufﬁces to have an efﬁcient separation oracle for the convex-hull conv(ϕ(U(x))).

Before we proceed with the proof of Theorem 3.5, we state the following requirements and guarantees for the Ellipsoid method which will be useful for us in the remainder of the section:

Lemma 3.7 (see, e.g., Theorem 2.4 in (Bubeck et al., 2015)). Let K Rd be a convex set, and SEPK a separation oracle for K. Then, the Ellipsoid method using O(d2b) oracle queries to SEPK, will ﬁnd a w K, or assert that K is empty. Furthermore, the total runtime is O(d4b).

The proof of Theorem 3.5 relies on two key lemmas. First, we show that efﬁcient robust certiﬁcation yields an efﬁcient solver for the RERM problem. Given a halfspace w Rd and an example (x, y) Rd Y, efﬁcient robust certiﬁcation means that there is an algorithm that can efﬁciently either: (a) assert that w is robust on U(x), i.e. z U(x), y w, z > 0, or (b) return a perturbation z U(x) such that y w, z 0.

Lemma 3.8. Let CERTU(w, (x, y)) be a procedure that either: (a) Asserts that w is robust on U(x), i.e., z U(x), y w, z > 0, or (b) Finds a perturbation z U(x) such that y w, z 0. If CERTU(w, (x, y)) can be solved in poly(d, b) time, then there is an algorithm that ﬁnds w Soln U S in poly(m, d, b) time.

Proof. Observe that Soln U S is a convex set since

w1, w2 Soln U S (x, y) S, z U(x), y w1, z > 0

and y w2, z > 0

α [0, 1], (x, y) S, z U(x),

y αw1 + (1 α)w2, z > 0

α [0, 1], αw1 + (1 α)w2 Soln U S.

Our goal is to ﬁnd a w Soln U S. Let CERTU(w, (x, y)) be an efﬁcient robust certiﬁer that runs in poly(d, b) time. We will use CERTU(w, (x, y)) to implement a separation oracle for Soln U S denoted SEPSoln U S . Given a halfspace w Rd, we simply check if w is robustly correct on all datapoints by running CERTU(w, (xi, yi)) on each (xi, yi) S. If there is a point (xi, yi) S where w is not robustly correct, then we get a perturbation zi U(xi) where yi w, zi 0, and we return yizi as a separating hyperplane. Otherwise, we know that w is robustly correct on all datapoints, and we just assert that w Soln U S.

Once we have a separation oracle SEPSoln U S , we can use the Ellipsoid method (see Lemma 3.7) to solve the RERMU(S) problem. More speciﬁcally, with a query complexity of O(d2b) to SEPSoln U S and overall runtime of poly(m, d, b) (this depends on runtime of CERTU(w, (x, y))), the Ellipsoid method will return a w Soln U S.

Next, we show that we can do efﬁcient robust certiﬁcation when given access to an efﬁcient separation oracle for U, SEPU.

Efﬁciently Learning Adversarially Robust Halfspaces with Noise

Lemma 3.9. If we have an efﬁcient separation oracle SEPU that runs in poly(d, b) time. Then, we can efﬁciently solve CERTU(w, (x, y)) in poly(d, b) time.

Proof. Given a halfspace w Rd and (x, y) Rd Y, we want to either: (a) assert that w is robust on U(x), i.e. z U(x), y w, z > 0, or (b) ﬁnd a perturbation z U(x) such that y w, z 0. Let M(w, y) = {z X : y w, z 0} be the set of all points that w mislabels. Observe that by deﬁnition M(w, y) is convex, and therefore U(x) M(w, y) is also convex. We argue that having an efﬁcient separation oracle for U(x) M(w, y) sufﬁces to solve our robust certiﬁcation problem. Because if U(x) M(w, y) is not empty, then by deﬁnition, we can ﬁnd a perturbation z U(x) such that y w, z 0 with a separation oracle SEPU(x) M(w,y) and the Ellipsoid method (see Lemma 3.7). If U(x) M(w, y) is empty, then by deﬁnition, w is robustly correct on U(x), and the Ellipsoid method will terminate and assert that U(x) M(w, y) is empty.

Thus, it remains to implement SEPU(x) M(w,y). Given a point z Rd, we simply ask the separation oracle for U(x) by calling SEPU(x, z) and the separation oracle for M(w, y) by checking if y w, z 0. If z / U(x) the we get a separating hyperplane c from SEPU and we can use it separate z from U(x) M(w, y). Similarly, if z / M(w, y), by deﬁnition, yw, z > 0 and so we can use yw as a separating hyperplane to separate z from U(x) M(w, y). The overall runtime of this separation oracle is poly(d, b), and so we can efﬁciently solve CERTU(w, (x, y)) in poly(d, b) time using the Ellipsoid method (Lemma 3.7).

We are now ready to proceed with the proof of Theorem 3.5.

Proof of Theorem 3.5. We want to efﬁciently solve RERMU(S). Given that we have a separation oracle for U, SEPU that runs in poly(d, b). By Lemma 3.9, we get an efﬁcient robust certiﬁcation procedure CERTU(w, (x, y)). Then, by Lemma 3.8, we get an efﬁcient solver for RERMU. In particular, the runtime complexity is poly(m, d, b).

3.2. An efﬁcient approximate separation oracle for U is necessary for computing the robust loss

Our efﬁcient algorithm for RERMU requires a separation oracle for U. We now show that even efﬁciently computing the robust loss of a halfspace (w, b0) Rd R on an example (x, y) Rd Y requires an efﬁcient approximate separation oracle for U.

Theorem 3.10. Given a halfspace w Rd and an example (x, y) Rd Y, let EVALU((w, b0), (x, y)) be a procedure that computes the robust loss supz U(x) 1[y( w, z +b0)

0] in poly(d, b) time, then for any γ > 0, we can implement an efﬁcient γ-approximate separation oracle SEPγ U(x, z) in poly(d, b, log(1/γ), log(R)) time, where U(x) B(0, R).

Proof. Let γ > 0. We will describe how to implement a γ-approximate separation oracle for U denoted SEPγ U(x, z). Fix the ﬁrst argument to an arbitrary x X. Upon receiving a point z X as input, the main strategy is to search for a halfspace w Rd that can label all of U(x) with +1, and label the point z with 1. If z / U(x) then there is a halfspace w that separates z from U(x) because U(x) is convex, but this is impossible if z U(x). Since we are only concerned with implementing an approximate separation oracle, we will settle for a slight relaxation which is to either:

assert that z is γ-close to U(x), i.e., z B(U(x), γ), or return a separating hyperplane w such that w, z w, z for all z U(x).

Let K = {(w, b0) : z U(x), w, z + b0 > 0} denote the set of halfspaces that label all of U(x) with +1. Since U(x) is nonempty, it follows by deﬁnition that K is nonempty. To evaluate membership in K, given a query wq, bq, we just make a call to EVALU((wq, bq), (x, +)). Let MEMK(wq, bq) = 1 EVALU((wq, bq), (x, +)). This can be efﬁciently computed in poly(d, b) time. Next, for any η (0, 0.5), we can get an η-approximate separation oracle for K denoted SEPη K (see Deﬁnition 3.3) using O(db log (d/η)) queries to the membership oracle MEMK described above (Lee et al., 2018). When queried with a halfspace w = (w, b0), SEPη K either:

asserts that w K+η, or returns a separating hyperplane c such that c, w c, w + η for all halfspaces w K η.

Observe that by deﬁnition, K η K K+η. Furthermore, for any w K+η, by deﬁnition, w K such that w w 2 η. Since, for each z U(x), by deﬁnition of K, we have w , (z , 1) = w , z + b0 > 0, it follows by Cauchy-Schwarz inequality that ( w K+η) ( z U(x)):

w, (z , 1) = w w , z + w , (z , 1) > η2R. (4)

Let SEPγ/4R K be a γ 4R-approximate separation oracle for K. Observe that if the distance between z and U(x) is greater than γ, it follows that there is (w, b0) such that:

w, z + b0 γ/2 and w, z + b0 > 0 ( z U(x)) .

By deﬁnition of K, this implies that K {(w, b0) : w, z + b0 γ/2} is not empty,

Efﬁciently Learning Adversarially Robust Halfspaces with Noise

which implies that the intersection K+ γ

4R {(w, b0) : w, z + b0 γ/2} is nonempty. We also have the contrapositive, which is, if the intersection K+ γ

4R {(w, b0) : w, z + b0 γ/2} is empty, then we know that z B(U(x), γ). To conclude the proof, we run the Ellipsoid method with the approximate separation oracle SEPγ/4R K to search over the restricted space {(w, b0) : w, z + b0 γ/2}. Restricting the space is easily done because we will use the query point z as the separating hyperplane. Either the Ellipsoid method will ﬁnd (w, b0) K+ γ

4R {(w, b0) : w, z + b0 γ/2}, in which case by Equation 4, (w, b0) has the property that:

2 and w, z +b0 > γ

2 ( z U(x)) ,

and so we return w as a separating hyperplane between z and U(x). If the Ellipsoid terminates without ﬁnding any such (w, b0), this implies that the intersection K+ γ

4R {(w, b0) : w, z + b0 γ/2} is empty, and therefore, by the contrapositive above, we assert that z B(U(x), γ).

4. Random Classiﬁcation Noise

In this section, we relax the realizability assumption to random classiﬁcation noise (Angluin & Laird, 1987). We show that for any adversary U that represents perturbations of bounded norm (i.e., U(x) = x + B, where B = δ Rd : δ p γ , p [1, ]), the class of halfspaces H is efﬁciently robustly PAC learnable with respect to U in the random classiﬁcation noise model.

Theorem 4.1. Let U : X 7 2X be an adversary such that U(x) = x + B where B = δ Rd : δ p γ and p [1, ]. Then, H is robustly PAC learnable w.r.t U under random classiﬁcation noise in time poly(d, 1/ε, 1/γ, 1/(1 2η), log(1/δ)).

The proof of Theorem 4.1 relies on the following key lemma. We show that the structure of the perturbations B allows us to relate the robust loss of a halfspace w Rd with the γ-margin loss of w. Before we state the lemma, recall that the dual norm of w denoted w is deﬁned as sup { u, w : u 1}.

Lemma 4.2. For any w, x Rd and any y Y,

sup δ B 1{hw(x + δ) = y} = 1 y w w , x γ .

Proof. First observe that

sup δ B 1{hw(x + δ) = y} = sup δ B 1{y w, x + δ 0}

= 1 inf δ B y w, x + δ 0 .

This holds because when infδ B y w, x + δ > 0, by definition δ B, y w, x + δ > 0, which implies that supδ B 1{hw(x + δ) = y} = 0. For the other direction, when supδ B 1{hw(x + δ) = y} = 1, by deﬁnition δ B such that y w, x + δ 0, which implies that infδ B y w, x + δ 0. To conclude the proof, by deﬁnition of the set B and the dual norm , we have

inf δ B y w, x + δ = y w, x sup δ B yw, δ

= y w, x w γ.

Lemma 4.2 implies that for any distribution D over X Y, to solve the γ-robust learning problem

argmin w Rd E (x,y) D

sup δ B 1{hw(x + δ) = y} , (5)

it sufﬁces to solve the γ-margin learning problem

argmin w =1 E (x,y) D [1{y w, x γ}] . (6)

We will solve the γ-margin learning problem in Equation (6) in the random classiﬁcation noise setting using an appropriately chosen convex surrogate loss. Our convex surrogate loss and its analysis build on a convex surrogate that appears in the appendix of (Diakonikolas et al., 2019a) for learning large ℓ2-margin halfspaces under random classiﬁcation noise w.r.t. the 0-1 loss. We note that the idea of using a convex surrogate to (non-robustly) learn large margin halfspaces in the presence of random classiﬁcation noise is implicit in a number of prior works, starting with (Bylander, 1994).

Our robust setting is more challenging for the following reasons. First, we are not interested in only ensuring small 0-1 loss, but rather ensuring small γ-margin loss. Second, we want to be able to handle all ℓp norms, as opposed to just the ℓ2 norm. As a result, our analysis is somewhat delicate.

γ ), s > γ (1 λ)(1 s

We will show that solving the following convex optimization problem:

argmin w 1 Gγ λ(w) def = E (x,y) D [φ(y w, x )] , (7)

where λ = εγ/2+η

1+εγ , sufﬁces to solve the γ-margin learning problem in Equation (6). Intuitively, the idea here is that for λ = 0 , the φ objective is exactly a scaled hinge loss, which gives a learning guarantee w.r.t to the γ-margin loss when

Efﬁciently Learning Adversarially Robust Halfspaces with Noise

there is no noise (η = 0). When the noise η > 0, we slightly adjust the slopes, such that even correct prediction encounters a loss. The choice of the slope is based on λ which will depend on the noise rate η and the ε-suboptimality that is required for Equation (6).

We can solve Equation (7) with a standard ﬁrst-order method through samples using Stochastic Mirror Descent, when the dual norm is an ℓq-norm. We state the following properties of Mirror Descent we will require:

Lemma 4.3 (see, e.g., Theorem 6.1 in (Bubeck et al., 2015)). Let G(w) def = E(x,y) D [ℓ(w, (x, y))] be a convex function that is L-Lipschitz w.r.t. w q where q > 1. Then, using the potential function ψ(w) = 1

2 w 2 q, a suitable step-size η, and a sequence of iterates wk computed by the following update:

wk+1 = Πψ Bq ψ 1 ψ(wk) ηgk ,

Stochastic Mirror Descent with O(L2/(q 1)ε2) stochastic gradients g of G, will ﬁnd an ε-suboptimal point ˆw such that ˆw q 1 and G( ˆw) infw G(w) + ε.

Remark 4.4. When q = 1, we will use the entropy potential function ψ(w) = Pd i=1 wi log wi. In this case, Stochastic Mirror Descent will require O( L2 log d

ε2 ) stochastic gradients.

We are now ready to state our main result for this section:

Theorem 4.5. Let X = x Rd : x p 1 . Let D be a distribution over X Y such that there exists a halfspace w Rd with Prx Dx [| w , x | > γ] = 1 and y is generated by hw (x) := sign( w , x ) corrupted by RCN with noise rate η < 1/2. An application of Stochastic Mirror Descent on Gγ λ(w), returns, with high probability, a halfspace w where w q 1 with γ/2-robust misclassiﬁcation error E(x,y) D [1{y w, x γ/2}] η + ε in poly(d, 1/ε, 1/γ, 1/(1 2η)) time.

With Theorem 4.5, the proof of Theorem 4.1 immediately follows.

Proof of Theorem 4.1. This follows from Lemma 4.2 and Theorem 4.5.

Remark 4.6. In Theorem 4.5, we get a γ/2-robustness guarantee assuming γ-robust halfspace w that is corrupted with random classiﬁcation noise. This can be strengthened to get a guarantee of (1 c)γ-robustness for any constant c > 0.

The rest of this section is devoted to the proof of Theorem 4.5. The high-level strategy is to show that an ε -suboptimal solution to Equation (7) gives us an εsuboptimal solution to Equation (6) (for a suitably chosen

ε ). In Lemma 4.8, we bound from above the γ/2-margin loss in terms of our convex surrogate objective Gγ λ, and in Lemma 4.9 we show that there are minimizers of our convex surrogate Gγ λ such that it is sufﬁciently small. These are the two key lemmas that we will use to piece everything together.

For any w, x Rd, consider the contribution of the objective Gγ λ of x, denoted by Gγ λ(w, x). This is deﬁned as Gγ λ(w, x) = Ey Dy(x) [φ(y w, x )] = ηφ( z) + (1 η)φ(z) where z = hw (x) w, x . In the following lemma, we provide a decomposition of Gγ λ(w, x) that will help us in proving Lemmas 4.8 and 4.9.

Lemma 4.7. For any w, x Rd, let z = hw (x) w, x . Then, we have that:

Gγ λ(w, x) = (η λ) z

+ λ + η 2λη

+ 1{ γ z γ}(1 η)(1 2λ) 1 z

+ 1{z < γ}(1 2λ) 1 2η z

Proof. Based on the deﬁnition of the surrogate loss, it sufﬁces to consider three cases:

Case z > γ :

l1(z) def = ηφ( z) + (1 η)φ(z)

= η(1 λ)(1 + z/γ) + (1 η)λ(1 z/γ)

= (η(1 λ) (1 η)λ) z

+ λ + η 2λη.

Case γ z γ :

l2(z) def = ηφ( z) + (1 η)φ(z)

= η(1 λ)(1 + z/γ)

+ (1 η)(1 λ)(1 z/γ)

= (1 2η)(1 λ) z

Case z < γ :

l3(z) def = ηλ(1 + z/γ) + (1 η)(1 λ)(1 z/γ)

= (ηλ (1 η)(1 λ)) z

+ (1 η)(1 λ)

= (η + λ 1) z

+ 1 η λ + 2ηλ.

Efﬁciently Learning Adversarially Robust Halfspaces with Noise

Considering the three cases above, we can write

G(z) = l1(z) + 1{ γ z γ} (l2(z) l1(z))

+ 1{z < γ} (l3(z) l1(z)) .

Then, we calculate l2(z) l1(z) and l3(z) l1(z) as follows:

l2(z) l1(z) = (1 2η)(1 λ) z

= ((1 2η)(1 λ) + η λ) z

+ (1 η)(1 2λ)

= (1 η)(1 2λ) 1 z

l3(z) l1(z) = (η + λ 1) z

+ 1 η λ + 2ηλ

+ (1 2η)(1 2λ)

Using the above, we have that

G(z) = l1(z) + 1{ γ z γ}(1 η)(1 2λ) 1 z

+ 1{z < γ}(1 2λ) 1 2η z

The following lemma allows us to bound from below our convex surrogate Ex Dx[Gγ λ(w, x)] in terms of the γ/2margin loss of w.

Lemma 4.8. Assume that λ is chosen such that λ < 1/2 and η < λ. Then, for any w Rd, Ex Dx[Gγ λ(w, x)] η λ

2(1 2λ)(1 η) Ex 1 z γ

2 + λ + η 2λη.

Proof. By Lemma 4.7, and linearity of expectation, we have that

E x Dx[Gγ λ(w, x)] = (η λ) E x Dx

+ λ + η 2λη

+ (1 η)(1 2λ) E x Dx

1{ γ z γ} 1 z

+ (1 2λ) E x Dx

1{z < γ} 1 2η z

First, observe that for any x, 1 z = hw (x) w, x 1 and since η λ, we have

(η λ) E x Dx

Then we observe that whenever z < γ, 1 z

γ 2η > 2(1 η) > (1 η)/2 and λ 1/2, thus we can bound from below the third term

(1 2λ) E x Dx

1{z < γ} 1 2η z

1 2(1 2λ)(1 η) E x Dx [1{z < γ}] .

Next we note that whenever γ z γ, 1 z

γ 0. This implies that instead of considering 1{ γ z γ}, we can relax this and consider the subset 1 γ z γ

2 , and on this subset 1 z

γ 1/2. Thus, we can bound the second term from below as follows:

(1 2λ)(1 η) E x Dx

1{ γ z γ} 1 z

1 2(1 2λ)(1 η) E x Dx

h 1 n γ z γ

Combining the above, we obtain

E x Dx[Gγ λ(w, x)] η λ

2(1 2λ)(1 η) E x

h 1 n γ z γ

o + 1{z < γ} i

+ λ + η 2λη

2(1 2λ)(1 η) E x

+ λ + η 2λη ,

as desired.

We now show that there exist minimizers of the convex surrogate Gγ λ such that it is sufﬁciently small, which will be useful later in choosing the suboptimality parameter ϵ . Lemma 4.9. Assume that λ is chosen such that λ < 1/2 and η < λ. Then we have that

inf w Rd E x Dx[Gγ λ(w, x)] 2η(1 λ).

Proof. By deﬁnition, we have that infw Rd Ex Dx[Gγ λ(w, x)] Ex Dx[Gγ λ(w , x)]. By assumption, with probability 1 over x Dx, we have hw (x) w , x > γ. Thus, by Lemma 4.7, we have

E x[Gγ λ(w , x)] = (η λ) z

+ λ + η 2λη

η λ + λ + η 2λη

= 2η(1 λ) ,

Efﬁciently Learning Adversarially Robust Halfspaces with Noise

where the last inequality follows from the fact that η < λ and the fact that Ex h z γ i > 1.

Using the above lemmas, we are now able to bound from above the γ/2-margin loss of a halfspace w that is ε - suboptimal for our convex optimization problem (see Equation (7)).

Lemma 4.10. For any ε (0, 1) and any w Rd such that Ex Dx[Gγ λ(w, x)] Ex Dx[Gγ λ(w , x)] + ε , the γ/2-missclassiﬁcation error of w satisﬁes

E (x,y) D [1{y w, x γ/2}]

η + 2 (1 2λ)

ε + (λ η) 1

Proof. By Lemma 4.8 and Lemma 4.9, we have

2(1 2λ)(1 η) E x

oi + λ + η 2λη

2η(1 λ) + ε .

This implies

oi 2 (1 2λ)

ε + (λ η) 1

Since E(x,y) D [1{y w, x γ/2}] η + (1 η) Ex 1 z γ

2 , we get the desired result.

We are now ready to prove Theorem 4.5.

Proof of Theorem 4.5. Based on Lemma 4.10, we will choose λ, ε such that

ε + (λ η) 1

By setting ε = λ η, this condition reduces to

2(λ η) (1 2λ) γε .

This implies that we need λ εγ/2+η

1+εγ . We will choose

1+εγ . Note that our analysis relied on having λ 1/2 and η λ. These conditions combined imply that we should choose λ such that η λ 1/2. Our choice of λ = εγ/2+η

1+εγ satisﬁes these conditions, since

1 + εγ η = εγ/2 + η η(1 + εγ)

1 + εγ = εγ(1/2 η)

2 = εγ/2 + η 1/2(1 + εγ)

1 + εγ = η 1/2

By our choice of λ, we have that ε = λ η = εγ(1/2 η)

1+εγ = εγ(1 2η)

2(1+εγ) . By the guarantees of Stochastic Mirror Descent (see Lemma 4.3), our theorem follows with O 1 ε2γ2(1 2η)2(q 1) samples for q > 1 and

O log d ε2γ2(1 2η)2 samples for q = 1.

5. Conclusion

In this paper, we provide necessary and sufﬁcient conditions for perturbation sets U, under which we can efﬁciently solve the robust empirical risk minimization (RERM) problem. We give a polynomial time algorithm to solve RERM given access to a polynomial time separation oracle for U. In addition, we show that an efﬁcient approximate separation oracle for U is necessary for even computing the robust loss of a halfspace. As a corollary, we show that halfspaces are efﬁciently robustly PAC learnable for a broad range of perturbation sets. By relaxing the realizability assumption, we show that under random classiﬁcation noise, we can efﬁciently robustly PAC learn halfspaces with respect to any ℓp perturbations. An interesting direction for future work is to understand the computational complexity of robustly PAC learning halfspaces under stronger noise models, including Massart noise and agnostic noise.

Acknowledgments

We thank Sepideh Mahabadi for many insightful and helpful discussions, and Brian Bullins for helpful discussions on Mirror Descent. We also thank the anonymous reviewers for their helpful feedback. This work was initiated when the authors were visiting the Simons Institute for the Theory of Computing as part of the Summer 2019 program on Foundations of Deep Learning. Work by N.S. and O.M. was partially supported by NSF award IIS-1546500 and by DARPA1 cooperative agreement HR00112020003. I.D. was supported by NSF Award CCF-1652862 (CAREER), a Sloan Research Fellowship, and a DARPA Learning with Less Labels (Lw LL) grant. S.G. was supported by the JP Morgan AI Research Ph D Fellowship.

Appendix: Another Approach to Large Margin Learning under Random Classiﬁcation Noise

We remark that γ-margin learning of halfspaces has been studied in earlier work, and we have algorithms such as Margin Perceptron (Balcan et al., 2008) and SVM. The Margin Perceptron (for ℓ2 margin) and other ℓp margin algorithms have been also implemented in the SQ model

1This paper does not reﬂect the position or the policy of the Government, and no endorsement should be inferred.

Efﬁciently Learning Adversarially Robust Halfspaces with Noise

(Feldman et al., 2017). But no explicit connection has been made to adversarial robustness.

We present here a simple approach to learn γ-margin halfspaces under random classiﬁcation noise using only a convex surrogate loss and Stochastic Mirror Descent. The construction of the convex surrogate is based on learning generalized linear models with a suitable link function u : R R. To the best of our knowledge, the result of this section is not explicit in prior work.

Theorem 5.1. Let X = x Rd : x p 1 . Let D be a distribution over X Y such that there exists a halfspace w Rd, w q = 1 with Prx Dx [| w , x | > γ] = 1 and y is generated by hw (x) := sign( w , x ) corrupted with random classiﬁcation noise rate η < 1/2. Then, running Stochastic Mirror Descent on the following convex optimization problem:

min w Rd, w q 1 E (x,y) D [ℓ(w, (x, y))]

where the convex loss function ℓis deﬁned in Equation 8, returns with high probability, a halfspace w with γ/2-robust misclassiﬁcation error E(x,y) D [1{y w, x γ/2}] η + ε.

We prove Theorem 4.5 in the remainder of this section. We will connect our problem to that of solving generalized linear models. We deﬁne the link function as follows,

η s < γ 1 2η

2 γ s γ 1 η s > γ

Observe that u is monotone and 1 2η

2γ -Lipschitz.

First, we will relate our loss of interest, which is the γ/2margin loss with the squared loss deﬁned in terms of the link function u,

Lemma 5.2. For any w Rd,

E x Dx [1{hw (x) w, x γ/2}]

16 (1 2η)2 E x Dx

h (u( w, x ) u( w , x ))2i .

Proof. Let E+ = {hw (x) = +} and E = {hw (x) = }. Let U(x) = (u( w, x ) u( w , x ))2. By law of total expectation and deﬁnition of the link function u, we have

E x Dx [U(x)] = E x Dx

U(x)1 E+ + E x Dx

(u( w, x ) (1 η))2 1 E+

(u( w, x ) η)2 1 E .

We will lower bound both terms:

h (u( w, x ) (1 η))2 1 E+ i

h (a (1 η))2 1 E+ 1{u( w, x ) a} i ,

h (u( w, x ) η)2 1 E i

h (b η)2 1 E 1{u( w, x ) b} i .

Then, observe that the event { w, x γ/2} implies the event u( w, x ) 3 2η

4 , and similarly the event { w, x γ/2} implies the event u( w, x ) 2η+1

4 . This means that

4 (1 η) 2 1 n E+o 1 u( w, x ) 3 2η

4 (1 η) 2 E x Dx

h 1 n E+o 1{ w, x γ/2} i , and

4 η 2 1 n E o 1 u( w, x ) 2η + 1

4 η 2 E x Dx

h 1 n E o 1{ w, x γ/2} i .

We combine these observations to conclude the proof,

h u( w, x ) u( w , x ) 2i (1 2η)2

h 1(E+)1{ w, x γ/2} + 1(E )1{ w, x γ/2} i

16 E x Dx [1{hw (x) w, x γ/2}] .

But note that the squared loss is non-convex and so it may not be easy to optimize. Luckily, we can get a tight upper-bound with the following surrogate loss (see (Kanade, 2018)):

ℓ(w, (x, y)) = Z w,x

0 (u(s) y)ds. (8)

Note that ℓ(w, (x, y)) is convex w.r.t w since the Hessian 2 wℓ(w, (x, y)) = u ( w, x )xx T is positive semideﬁnite. Assuming our labels y have been transformed to {0, 1} from { 1}, observe that E[y|x] = u( w , x ). We now have the following guarantee (see e.g., (Cohen, 2014; Kanade, 2018)) for U(x) = (u( w, x ) u( w , x ))2:

E x Dx [U(x)] 1 2η

γ E (x,y) D

ℓ(w, (x, y)) ℓ(w , (x, y)) . (9)

Proof of Theorem 5.1. Combining Lemma 5.2 and Equation 9, we get the following guarantee for any w Rd,

(1 2η) E x Dx [1{hw (x) w, x γ/2}]

γ E (x,y) D [ℓ(w, (x, y)) ℓ(w , (x, y))] .

Efﬁciently Learning Adversarially Robust Halfspaces with Noise

Thus, running Stochastic Mirror Descent with ε = (εγ(1 2η))/16 and O(1/ε 2) samples (labels transformed to {0, 1}), returns with high probability, a halfspace w such that E x Dx [1{hw (x) w, x γ/2}] ε. (10)

Then, to conclude the proof, observe that

E (x,y) D [1{y w, x γ/2}] = η E x Dx [1{ hw (x) w, x γ/2}]

+ (1 η) E x Dx [1{hw (x) w, x γ/2}]

= η(1 E x Dx [1{hw (x) w, x γ/2}])

+ (1 η) E x Dx [1{hw (x) w, x γ/2}]

(i) η + (1 η) E x Dx [1{hw (x) w, x γ/2}]

(ii) η + ε,

where (i) follows from that fact that Ex Dx [1{hw (x) w, x γ/2}] 0, and (ii) follows from Equation 10.

Angluin, D. and Laird, P. D. Learning from noisy examples. Mach. Learn., 2(4):343 370, 1987. doi: 10.1007/ BF00116829. URL https://doi.org/10.1007/ BF00116829.

Awasthi, P., Dutta, A., and Vijayaraghavan, A. On robustness to adversarial examples and polynomial optimization. In Advances in Neural Information Processing Systems, pp. 13737 13747, 2019.

Balcan, M., Blum, A., and Srebro, N. A theory of learning with similarity functions. Machine Learning, 72(1-2):89 112, 2008. doi: 10.1007/ s10994-008-5059-5. URL https://doi.org/10. 1007/s10994-008-5059-5.

Biggio, B., Corona, I., Maiorca, D., Nelson, B., ˇSrndi c, N., Laskov, P., Giacinto, G., and Roli, F. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pp. 387 402. Springer, 2013.

Bubeck, S., Lee, Y. T., Price, E., and Razenshteyn, I. Adversarial examples from computational constraints. In International Conference on Machine Learning, pp. 831 840, 2019.

Bubeck, S. et al. Convex optimization: Algorithms and complexity. Foundations and Trends R in Machine Learning, 8(3-4):231 357, 2015.

Bylander, T. Learning linear threshold functions in the presence of classiﬁcation noise. In Warmuth, M. K. (ed.), Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory, COLT 1994, New Brunswick, NJ, USA, July 12-15, 1994, pp. 340 347. ACM, 1994. doi: 10.1145/180139.181176. URL https://doi.org/10.1145/180139.181176.

Cohen, A. Surrogate Loss Minimization. Ph D thesis, Hebrew University of Jerusalem, 2014.

Cullina, D., Bhagoji, A. N., and Mittal, P. Pac-learning in the presence of adversaries. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 230 241. Curran Associates, Inc., 2018.

Daniely, A. Complexity theoretic limitations on learning halfspaces. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pp. 105 117, 2016.

Diakonikolas, I., O Donnell, R., Servedio, R. A., and Wu, Y. Hardness results for agnostically learning low-degree polynomial threshold functions. In Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2011, pp. 1590 1606. SIAM, 2011.

Diakonikolas, I., Gouleakis, T., and Tzamos, C. Distributionindependent pac learning of halfspaces with massart noise. In Advances in Neural Information Processing Systems, pp. 4751 4762, 2019a.

Diakonikolas, I., Kane, D., and Manurangsi, P. Nearly tight bounds for robust proper learning of halfspaces with a margin. In Advances in Neural Information Processing Systems, pp. 10473 10484, 2019b.

Engstrom, L., Tran, B., Tsipras, D., Schmidt, L., and Madry, A. Exploring the landscape of spatial robustness. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 1802 1811, Long Beach, California, USA, 09 15 Jun 2019. PMLR. URL http://proceedings. mlr.press/v97/engstrom19a.html.

Feldman, V., Gopalan, P., Khot, S., and Ponnuswami, A. K. New results for learning noisy parities and halfspaces. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 06), pp. 563 574. IEEE, 2006.

Feldman, V., Guzman, C., and Vempala, S. S. Statistical query algorithms for mean vector estimation and stochastic convex optimization. In Klein, P. N. (ed.), Proceedings

Efﬁciently Learning Adversarially Robust Halfspaces with Noise

of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2017, Barcelona, Spain, Hotel Porta Fira, January 16-19, pp. 1265 1277. SIAM, 2017. doi: 10.1137/1.9781611974782.82. URL https: //doi.org/10.1137/1.9781611974782.82.

Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. In Bengio, Y. and Le Cun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6572.

Gourdeau, P., Kanade, V., Kwiatkowska, M., and Worrell, J. On the hardness of robust classiﬁcation. In Advances in Neural Information Processing Systems, pp. 7444 7453, 2019.

Guruswami, V. and Raghavendra, P. Hardness of learning halfspaces with noise. SIAM Journal on Computing, 39 (2):742 765, 2009.

Kanade, V. Computational learning theory notes - 8 : Learning real-valued functions, 2018.

Kang, D., Sun, Y., Hendrycks, D., Brown, T., and Steinhardt, J. Testing robustness against unforeseen adversaries. Co RR, abs/1908.08016, 2019. URL http: //arxiv.org/abs/1908.08016.

Khim, J. and Loh, P.-L. Adversarial risk bounds for binary classiﬁcation via function transformation. ar Xiv preprint ar Xiv:1810.09519, 2018.

Lee, Y. T., Sidford, A., and Vempala, S. S. Efﬁcient convex optimization with membership oracles. In Conference On Learning Theory, pp. 1292 1294, 2018.

Maass, W. and Tur an, G. How fast can a threshold gate learn? In Proceedings of a workshop on Computational learning theory and natural learning systems (vol. 1): constraints and prospects: constraints and prospects, pp. 381 414, 1994.

Montasser, O., Hanneke, S., and Srebro, N. Vc classes are adversarially robustly learnable, but only improperly. In Beygelzimer, A. and Hsu, D. (eds.), Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pp. 2512 2530, Phoenix, USA, 25 28 Jun 2019. PMLR.

Rosenblatt, F. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958.

Schmidt, L., Santurkar, S., Tsipras, D., Talwar, K., and Madry, A. Adversarially robust generalization requires more data. In Bengio, S., Wallach, H. M., Larochelle,

H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montr eal, Canada, pp. 5019 5031, 2018.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R. Intriguing properties of neural networks. In Bengio, Y. and Le Cun, Y. (eds.), 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 1416, 2014, Conference Track Proceedings, 2014. URL

http://arxiv.org/abs/1312.6199.

Vapnik, V. Estimation of Dependencies Based on Empirical Data. Springer-Verlag, New York, 1982.

Yin, D., Ramchandran, K., and Bartlett, P. L. Rademacher complexity for adversarially robust generalization. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 7085 7094. PMLR, 2019. URL http:// proceedings.mlr.press/v97/yin19b.html.