# consistent_nonparametric_methods_for_maximizing_robustness__631ba00e.pdf

Consistent Non-Parametric Methods for Maximizing Robustness

Robi Bhattacharjee University of California San Diego rcbhatta@eng.ucsd.edu

Kamalika Chaudhuri University of California San Diego kamalika@eng.ucsd.edu

Learning classiﬁers that are robust to adversarial examples has received a great deal of recent attention. A major drawback of the standard robust learning framework is there is an artiﬁcial robustness radius r that applies to all inputs. This ignores the fact that data may be highly heterogeneous, in which case it is plausible that robustness regions should be larger in some regions of data, and smaller in others. In this paper, we address this limitation by proposing a new limit classiﬁer, called the neighborhood optimal classiﬁer, that extends the Bayes optimal classiﬁer outside its support by using the label of the closest in-support point. We then argue that this classiﬁer maximizes the size of its robustness regions subject to the constraint of having accuracy equal to the Bayes optimal. We then present sufﬁcient conditions under which general non-parametric methods that can be represented as weight functions converge towards this limit, and show that both nearest neighbors and kernel classiﬁers satisfy them under certain conditions.

1 Introduction

Adversarially robust classiﬁcation, that has been of much recent interest, is typically formulated as follows. We are given data drawn from an underlying distribution D, a metric d, as well as a pre-speciﬁed robustness radius r. We say that a classiﬁer c is r-robust at an input x if it predicts the same label on a ball of radius r around x. Our goal in robust classiﬁcation is to ﬁnd a classiﬁer c that maximizes astuteness, which is deﬁned as accuracy on those examples where c is also r-robust.

While this formulation has inspired a great deal of recent work, both theoretical and empirical [5, 17, 19, 20, 26, 15, 18, 21, 22, 23, 30], a major limitation is that enforcing a pre-speciﬁed robustness radius r may lead to sub-optimal accuracy and robustness. To see this, consider what would be an ideally robust classiﬁer the example in Figure 1. For simplicity, suppose that we know the data distribution. In this case, a classiﬁer that has an uniformly large robustness radius r will misclassify some points from the blue cluster on the left, leading to lower accuracy. This is illustrated in panel (a), in which large robustness radius leads to intersecting robustness regions. On the other hand, in panel (b), the blue cluster on the right is highly separated from the red cluster, and could be accurately classiﬁed with a high margin. But this will not happen if the robustness radius is set small enough to avoid the problems posed in panel (a). Thus, enforcing a ﬁxed robustness radius that applies to the entire dataset may lead to lower accuracy and lower robustness.

In this work, we propose an alternative formulation of robust classiﬁcation that ensures that in the large sample limit, there is no robustness-accuracy trade off, and that regions of space with higher separation are classiﬁed more robustly. An extra advantage is that our formulation is achievable by existing methods. In particular, we show that two very common non-parametric algorithms nearest neighbors and kernel classiﬁers achieve these properties in the large sample limit.

Our formulation is built on the notion of a new large-sample limit. In the standard statistical learning framework, the large-sample ideal is the Bayes optimal classiﬁer that maximizes accuracy on the

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

(a) Large robustness radii

(b) Small robustness radii

Figure 1: A data distribution demonstrating the difﬁculties with ﬁxed radius balls for robustness regions. The red represents negatively labeled points, and the blue positive. If the robustness radius is set too large (panel (a)), then the regions of A and B intersect leading to a loss of accuracy. If the radius is set too small (panel (b)), this leads to a loss of robustness at point C where in principle it should be possible to defend against a larger amount of adversarial attacks.

data distribution, and is undeﬁned outside. Since this is not always robust with radius r, prior work introduces the notion of an r-optimal classiﬁer [31] that maximizes accuracy on points where it is also r-robust. However, this classiﬁer also suffers from the same challenges as the example in Figure 1.

We depart from both by introducing a new limit that we call the neighborhood preserving Bayes optimal classiﬁer, described as follows. Given an input x that lies in the support of the data distribution D, it predicts the same label as the Bayes optimal. On an x outside the support, it outputs the prediction of the Bayes Optimal on the nearest neighbor of x within the support of D. The ﬁrst property ensures that there is no loss of accuracy since it always agrees with the Bayes Optimal within the data distribution. The second ensures higher robustness in regions that are better separated. Our goal is now to design classiﬁers that converge to the neighborhood preserving Bayes optimal in the large sample limit; this ensures that with enough data, the classiﬁer will have accuracy approaching that of the Bayes optimal, as well as higher robustness where possible without sacriﬁcing accuracy.

We next investigate how to design classiﬁers with this convergence property. Our starting point is classical statistical theory [25] that shows that a class of methods known as weight functions will converge to a Bayes optimal in the large sample limit provided certain conditions hold; these include k-nearest neighbors under certain conditions on k and n, certain kinds of decision trees as well as kernel classiﬁers. Through an analysis of weight functions, we next establish precise conditions under which they converge to the neighborhood preserving Bayes optimal in the large sample limit. As expected, these are stronger than standard convergence to the Bayes optimal. In the large sample limit, we show that kn-nearest neighbors converge to the neighborhood preserving Bayes optimal provided kn = ω(log n), and kernel classiﬁers converge to the neighborhood preserving Bayes optimal provided certain technical conditions (such as the bandwidth shrinking sufﬁciently slowly). By contrast, certain types of histograms do not converge to the neighborhood preserving Bayes optimal, even if they do converge to the Bayes optimal. We round these off with a lower bound that shows that for nearest neighbor, the condition that kn = ω(log n) is tight. In particular, for kn = O(log n), there exist distributions for which kn-nearest neighbors provably fails to converge towards the neighborhood preserving Bayes optimal (despite converging towards the standard Bayes optimal).

In summary, the contributions of the paper are as follows. First, we propose a new large sample limit the neighborhood preserving Bayes optimal and a new formulation for robust classiﬁcation. We then establish conditions under which weight functions, a class of non-parametric methods, converge to the neighborhood preserving Bayes optimal in the large sample limit. Using these conditions, we show that kn-nearest neighbors satisfy these conditions when kn = ω(log n), and kernel classiﬁers satisfy these conditions provided the kernel function K has faster than polynomial decay, and the bandwidth parameter hn decreases sufﬁciently slowly.

To complement these results, we also include negative examples of non-parametric classiﬁers that do not converge. We provide an example where histograms do not converge to the neighborhood preserving Bayes optimal with increasing n. We also show a lower bound for nearest neighbors, indi-

cating that kn = ω(log n) is both necessary and sufﬁcient for convergence towards the neighborhood preserving Bayes optimal.

Our results indicate that the neighborhood preserving Bayes optimal formulation shows promise and has some interesting theoretical properties. We leave open the question of coming up with other alternative formulations that can better balance both robustness and accuracy for all kinds of data distributions, as well as are achievable algorithmically. We believe that addressing this would greatly help address the challenges in adversarial robustness.

2 Preliminaries

We consider binary classiﬁcation over Rd { 1}, and let ρ denote any distance metric on Rd. We let µ denote the measure over Rd corresponding to the probability distribution over which instances x Rd are drawn. Each instance x is then labeled as +1 with probability η(x) and 1 with probability 1 η(x). Together, µ and η comprise our data distribution D = (µ, η) over Rd { 1}.

For comparison to the robust case, for a classiﬁer f : Rd { 1} and a distribution D over Rd { 1}, it will be instructive to consider its accuracy, denoted A(f, D), which is deﬁned as the fraction of examples from D that f labels correctly. Accuracy is maximized by the Bayes Optimal classiﬁer: which we denote by g. It can be shown that for any x supp(µ), g(x) = 1 if η(x) 1

2, and g(x) = 1 otherwise.

Our goal is to build classiﬁers Rd { 1} that are both accurate and robust to small perturbations. For any example x, perturbations to it are constrained to taking place in the robustness region of x, denoted Ux. We will let U = {Ux : x Rd} denote the collections of all robustness regions.

We say that a classiﬁer f : Rd { 1} is robust at x if for all x Ux, f(x ) = f(x). Combining robustness and accuracy, we say that classiﬁer is astute at a point x if it is both accurate and robust. Formally, we have the following deﬁnition.

Deﬁnition 1. A classiﬁer f : Rd { 1} is said to be astute at (x, y) with respect to robustness collection U if f(x) = y and f is robust at x with respect to U. If D is a data distribution over Rd { 1}, the astuteness of f over D with respect to U, denoted AU(f, D), is the fraction of examples (x, y) D for which f is astute at (x, y) with respect to U. Thus

AU(f, D) = P(x,y) D[f(x ) = y, x Ux].

Non-parametric Classiﬁers We now brieﬂy review several kinds of non-parametric classiﬁers that we will consider throughout this paper. We begin with weight functions, which are a general class of non-parametric algorithms that encompass many classic algorithms, including nearest neighbors and kernel classiﬁers.

Weight functions are built from training sets, S = {(x1, y1), (x2, y2, ), . . . , (xn, yn)} by assigning a function w S i : Rd [0, 1] that essentially scores how relevant the training point (xi, yi) is to the example being classiﬁed. The functions w S i are allowed to depend on x1, . . . , xn but must be independent of the labels y1, . . . , yn. Given these functions, a point x is classiﬁed by just checking whether P yiw S i (x) 0 or not. If it is nonnegative, we output +1 and otherwise 1. A complete description of weight functions is included in the appendix.

Next, we enumerate several common Non-parametric classiﬁers that can be construed as weight functions. Details can be found in the appendix.

Histogram classiﬁers partition the domain Rd into cells recursively by splitting cells that contain a sufﬁciently large number of points xi. This corresponds to a weight function in which w S i (x) = 1 kx if xi is in the same cell as x, where kx denotes the number of points in the cell containing x.

kn-nearest neighbors corresponds to a weight function in which w S i (x) = 1 kn if xi is one of the kn nearest neighbors of x, and w S i (x) = 0 otherwise.

Kernel-Similarity classiﬁers are weight functions built from a kernel function K : R 0 R 0 and a window size (hn) 1 such that w S i (x) K(ρ(x, xi)/hn) (we normalize by dividing by Pn 1 K((ρ(x, xi)/hn))).

Figure 2: The decision boundary of the neighborhood preserving Bayes optimal classiﬁer is shown in green, and the neighborhood preserving robust region of x is shown in pink. The former consists of points equidistant from µ+, µ , and the latter consists of points equidistant from x, µ+.

3 The Neighborhood preserving Bayes optimal classiﬁer

Robust classiﬁcation is typically studied by setting the robustness regions, U = {Ux}x Rd, to be balls of radius r centered at x, Ux = {x : ρ(x, x ) r}. The quantity r is the robustness radius, and is typically set by the practitioner (before any training has occurred).

This method has a limitation with regards to trade-offs between accuracy and robustness. To increase the margin or robustness, we must have a large robustness radius (thus allowing us to defend from larger adversarial attacks). However, with large robustness radii, this can come at a cost of accuracy, as it is not possible to robustly give different labels to points with intersecting robustness regions.

For an illustration, consider Figure 1. Here we consider a data distribution D = (µ, η) in which the blue regions denote all points with η(x) > 0.5 (and thus should be labeled +), and the red regions denote all points with η(x) < 0.5 (and thus should be labeled ). Observe that it is not possible to be simultaneously accurate and robust at points A, B while enforcing a large robustness radius, as demonstrated by the intersecting balls. While this can be resolved by using a smaller radius, this results in losing out on potential robustness at point C. In principal, we should be able to afford a large margin of robustness about C due to its relatively far distance from the red regions.

Motivated by this issue, we seek to ﬁnd a formalism for robustness that allows us to simultaneously avoid paying for any accuracy-robustness trade-offs and adaptively size robustness regions (thus allowing us to defend against a larger range of adversarial attacks at points that are located in more homogenous zones of the distribution support). To approach this, we will ﬁrst provide an ideal limit object: a classiﬁer that has the same accuracy as the Bayes optimal (thus meeting our ﬁrst criteria) that has good robustness properties. We call this the the neighborhood preserving Bayes optimal classiﬁer, deﬁned as follows.

Deﬁnition 2. Let D = (µ, η) be a distribution over Rd { 1}. Then the neighborhood preserving Bayes optimal classiﬁer of D, denoted gneighbor, is the classiﬁer deﬁned as follows. Let µ+ = {x : η(x) 1 2} and µ = {x : η(x) < 1 2}. Then for any x Rd, gneighbor(x) = +1 if ρ(x, µ+) ρ(x, µ ), and gneighbor(x) = 1 otherwise.

This classiﬁer can be thought of as the most robust classiﬁer that matches the accuracy of the Bayes optimal. We call it neighborhood preserving because it extends the Bayes optimal classiﬁer into a local neighborhood about every point in the support. For an illustration, refer to Figure 2, which plots the decision boundary of the neighborhood preserving Bayes optimal for an example distribution.

Next, we turn our attention towards measuring its robustness, which must be done with respect to some set of robustness regions U = {Ux}. While these regions Ux can be nearly arbitrary, we seek regions Ux such that AU(gmax, D) = A(gbayes, D) (our astuteness equals the maximum possible accuracy) and Ux are as large as possible" (representing large robustness). To this end, we propose the following regions.

Deﬁnition 3. Let D = (µ, η) be a data distribution over Rd { 1}. Let µ+ = {x : η(x) > 1

2}, µ = {x : η(x) < 1

2}, and µ1/2 = {x : η(x) = 1

2}. For x µ+, we deﬁne the neighborhood

preserving robustness region, denoted Vx, as

Vx = {x : ρ(x, x ) < ρ(µ µ 1 2 , x )}.

It consists of all points that are closer to x than they are to µ µ1/2 (points oppositely labeled from x). We can use a similar deﬁnition for x µ . Finally, if x µ1/2, we simply set Vx = {x}.

These robustness regions take advantage of the structure of the neighborhood preserving Bayes optimal. They can essentially be thought of as regions that maximally extend from any point x in the support of D to the decision boundary of the neighborhood preserving Bayes optimal. We include an illustration of the regions Vx for an example distribution in Figure 2.

As a technical note, for x supp(D) with η(x) = 0.5, we give them a trivial robustness region. The rational for doing this is that η(x) = 0.5 is an edge case that is arbitrary to classify, and consequently enforcing a robustness region at that point is arbitrary and difﬁcult to enforce.

We now formalize the robustness and accuracy guarantees of the max-margin Bayes optimal classiﬁer with the following two results. Theorem 4. (Accuracy) Let D be a data distribution. Let V denote the collection of neighborhood preserving robustness regions, and let g denote the Bayes optimal classiﬁer. Then the neighborhood preserving Bayes optimal classiﬁer, gneighbor, satisﬁes AV(gneighbor, D) = A(g, D), where A(g, D) denotes the accuracy of the Bayes optimal. Thus, gneighbor maximizes accuracy. Theorem 5. (Robustness) Let D be a data distribution, let f be a classiﬁer, and let U be a set of robustness regions. Suppose that AU(f, D) = A(g, D), where g denotes the Bayes optimal classiﬁer. Then there exists x supp(D) such that Vx Ux, where Vx denotes the neighborhood preserving robustness region about x. In particular, we cannot have Vx be a strict subset of Ux for all x.

Theorem 4 shows that the neighborhood preserving Bayes classiﬁer achieves maximal accuracy, while Theorem 5 shows that achieving a strictly higher robustness (while maintaining accuracy) is not possible; while it is possible to make accurate classiﬁers which have higher robustness than gneighbor in some regions of space, it is not possible for this to hold across all regions. Thus, the neighborhood preserving Bayes optimal classiﬁer can be thought of as a local maximum to the constrained optimization problem of maximizing robustness subject to having maximum (equal to the Bayes optimal) accuracy.

3.1 Neighborhood Consistency

Having deﬁned the neighborhood preserving Bayes optimal classiﬁer, we now turn our attention towards building classiﬁers that converge towards it. Before doing this, we must precisely deﬁne what it means to converge. Intuitively, this consists of building classiﬁers whose robustness regions approach" the robustness regions of the neighborhood preserving Bayes optimal classiﬁer. This motivates the deﬁnition of partial neighborhood preserving robustness regions. Deﬁnition 6. Let 0 < κ < 1 be a real number, and let D = (µ, η) be a data distribution over Rd { 1}. Let µ+ = {x : η(x) > 1

2}, µ = {x : η(x) < 1

2}, and µ1/2 = {x : η(x) = 1

2}. For x µ+, we deﬁne the neighborhood preserving robustness region, denoted Vx, as

Vx = {x : ρ(x, x ) < κρ(µ µ 1 2 , x )}.

It consists of all points that are closer to x than they are to µ µ1/2 (points oppositely labeled from x) by a factor of κ. We can use a similar deﬁnition for x µ . Finally, if η(x) = 1

2, we simply set V κ x = {x}.

Observe that V κ x Vx for all 0 < κ < 1, and thus being robust with respect to V κ x is a milder condition than Vx. Using this notion, we can now deﬁne margin consistency. Deﬁnition 7. A learning algorithm A is said to be neighborhood consistent if the following holds for any data distribution D. For any 0 < ϵ, δ, κ < 1, there exists N such that for all n N, with probability at least 1 δ over S Dn,

AVκ(AS, D) A(g, D) ϵ,

where g denotes the Bayes optimal classiﬁer and AS denotes the classiﬁer learned by algorithm A from dataset S.

This condition essentially says that the astuteness of the classiﬁer learned by the algorithm converges towards the accuracy of the Bayes optimal classiﬁer. Furthermore, we stipulate that this holds as long as the astuteness is measured with respect to some Vκ. Observe that as κ 1, these regions converge towards the neighborhood preserving robustness regions, thus giving us a classiﬁer with robustness effectively equal to that of the neighborhood preserving Bayes optimal classiﬁer.

4 Neighborhood Consistent Non-Parametric Classiﬁers

Having deﬁned neighborhood consistency, we turn to the following question: which non-parametric algorithms are neighborhood consistent? Our starting point will be the standard literature for the convergence of non-parametric classiﬁers with regard to accuracy. We begin by considering the standard conditions for kn-nearest neighbors to converge (in accuracy) towards the Bayes optimal.

kn-nearest neighbors is consistent if and only if the following two conditions are met: limn kn = , and limn kn

n = 0. The ﬁrst condition guarantees that each point is classiﬁed by using an increasing number of nearest neighbors (thus making the probability of a misclassiﬁcation small), and the second condition guarantees that each point is classiﬁed using only points very close to it. We will refer to the ﬁrst condition as precision, and the second condition as locality. A natural question is whether the same principles sufﬁce for neighborhood consistency as well. We began by showing that without any additional constraints, the answer is no.

Theorem 8. Let D = (µ, η) be the data distribution where µ denotes the uniform distribution over [0, 1] and η is deﬁned as: η(x) = x. Over this space, let ρ be the euclidean distance metric. Suppose kn = O(log n) for 1 n < . Then kn-nearest neighbors is not neighborhood consistent with respect to D.

The issue in the example above is that for smaller kn, kn-nearest neighbors lacks sufﬁcient precision. For neighborhood consistnecy, points must be labeled using even more training points than are needed accuracy. This is because the classiﬁer must be uniformly correct across the entirety of V κ x . Thus, to build neighborhood consistent classiﬁers, we must bolster the precision from the standard amount used for standard consistency. To do this, we begin by introducing splitting numbers, a useful tool for bolstering the precision of weight functions.

4.1 Splitting Numbers

We will now generalize beyond nearest neighbors to consider weight functions. Doing so will allow us to simultaneously analyze nearest neighbors and kernel classiﬁers. To do so, we must ﬁrst rigorously substantiate our intuitions about increasing precision into concrete requirements. This will require several technical deﬁnitions.

Deﬁnition 9. Let µ be a probability measure over Rd. For any x Rd, the probability radius rp(x) is the smallest radius for which B(x, rp(x)) has probability mass at least p. More precisely, rp(x) = inf{r : µ(B(x, r)) p}.

Deﬁnition 10. Let W be a weight function and let S = {x1, x2, . . . , xn} be any ﬁnite subset of Rd. For any x Rd, α 0, and 0 β 1, let Wx,α,β = {i : ρ(x, xi) α, w S i (x) β}. Then the splitting number of W with respect to S, denoted as T(W, S) is the number of distinct subsets generated by Wx,αβ as x ranges over Rd, α ranges over [0, ), and β ranges over [0, 1]. Thus T(W, S) = |{Wx,α,β : x Rd, 0 α, 0 β 1}|.

Splitting numbers allow us to ensure high amounts of precision over a weight function. To prove neighborhood consistency, it is necessary for a classiﬁer to be correct at all points in a given region. Consequently, techniques that consider a single point will be insufﬁcient. The splitting number provides a mechanism for studying entire regions simultaneously. For more details on splitting numbers, we include several examples in the appendix.

4.2 Sufﬁcient Conditions for Neighborhood Consistency

We now state our main result.

Theorem 11. Let W be a weight function, D a distribution over Rd { 1}, U a neighborhood preserving collection, and (tn) 1 be a sequence of positive integers such that the following four conditions hold.

1. W is consistent (with resp. to accuracy) with resp. to D.

2. For any 0 < p < 1, limn ES Dn[supx Rd Pn 1 w S i (x)1ρ(x,xi)>rp(x)] = 0.

3. limn ES Dn[tn supx Rd w S i (x)] = 0.

4. limn ES Dn log T (W,S)

Then W is neighborhood consistent with respect to D.

Remarks: Condition 1 is necessary because neighborhood consistency implies standard consistency or, convergence in accuracy to the Bayes Optimal. Standard consistency has been well studied for non-parametric classiﬁers, and there are a variety of results that can be used to ensure it for example, Stone s Theorem (included in the appendix).

Conditions 2. and 3. are stronger version of conditions 2. and 3. of Stone s theorem. In particular, both include a supremum taken over all x Rd as opposed to simply considering a random point x D. This is necessary for ensuring correct labels on entire regions of points simultaneously. We also note that the dependence on rp(x) (as opposed to some ﬁxed r) is a key property used for adaptive robustness. This allows the algorithm to adjust to potential differing distance scales over different regions in Rd. This idea is reminiscent of the analysis given in [6], which also considers probability radii.

Condition 4. is an entirely new condition which allows us to simultaneously consider all T(W, S) subsets of S. This is needed for analyzing weighted sums with arbitrary weights.

Next, we apply Theorem 11 to get speciﬁc examples of margin consistent non-parametric algorithms.

4.3 Nearest Neighbors and Kernel Classiﬁers

We now provide sufﬁcient conditions for kn-nearest neighbors to be neighborhood consistent.

Corollary 12. Suppose (kn) 1 satisﬁes (1) limn kn

n = 0, and (2) limn log n

kn = 0. Then kn-nearest neighbors is neighborhood consistent.

As a result of Theorem 8, corollary 12 is tight for nearest neighbors. Thus kn nearest neighbors is neighborhood consistent if and only if kn = ω(log n).

Next, we give sufﬁcient conditions for a kernel-similarity classiﬁer.

Corollary 13. Let W be a kernel classiﬁer over Rd { 1} constructed from K : R+ R+ and hn. Suppose the following properties hold.

1. K is decreasing, and satisﬁes R

Rd K(||x||)dx < .

2. limn hn = 0 and limn nhd n = .

3. For any c > 1, limx K(cx)

4. For any x 0, limn n log n K( x

Then W is neighborhood consistent.

Observe that conditions 1. 2. and 3. are satisﬁed by many common Kernel functions such as the Gaussian or Exponential kernel (K(x) = exp( x2)/ K(x) = exp( x)). Condition 4. can be similarly satisﬁed by just increasing hn to be sufﬁciently large. Overall, this theorem states that Kernel classiﬁcation is neighborhood consistent as long as the bandwidth shrinks slowly enough.

4.4 Histogram Classiﬁers

Having discussed neighborhood consistent nearest-neighbors and kernel classiﬁer, we now turn our attention towards another popular weight function, histogram classiﬁers. Recall that histogram

Figure 3: we have a histogram classiﬁer being applied to the blue and red regions. The classiﬁer will be unable to construct good labels in the cells labeled A, B, C, and consequently will not be robust with respect to V κ x for sufﬁciently large κ.

classiﬁers operate by partitioning their input space into increasingly small cells, and then classifying each cell by using a majority vote from the training examples within that cell (a detailed description can be found in the appendix). We seek to answer the following question: is increasing precision sufﬁcient for making histogram classiﬁers neighborhood consistent? Unfortunately, the answer this turns out not to be no. The main issue is that histogram classiﬁers have no mechanism for performing classiﬁcation outside the support of the data distribution.

For an example of this, refer to Figure 3. Here we see a distribution being classiﬁed by a histogram classiﬁer. Observe that the cell labeled A contains points that are strictly closer to µ+ than µ , and consequently, for sufﬁciently large κ, V κ x will intersect A for some point x µ+. A similar argument holds for the cells labeled B and C.. However, since A, B, C are all in cells that will never contain any data, they will never be labeled in a meaningful way. Because of this, histogram classiﬁers are not neighborhood consistent.

5 Validation

(a) exponential kernel

(b) polynomial kernel

Figure 4: Plots of astuteness against the training sample size. In both panels, accuracy is plotted in red, and the varying levels of robustness regions (κ = 0.1, 0.3, 0.5) are givne in blue, green and purple. In panel (a), observe that as sample size increases, every measure of astuteness converges towards 0.8 which is as predicted by Corollary 13. In panel (b), although the accuracy appears to converge, none of the robustness measure. In fact, they get progressively worse the larger κ gets.

To complement our theoretical large sample results for non-parametric classiﬁers, we now include several experiments to understand their behavior for ﬁnite samples. We seek to understand how quickly non-parametic classiﬁers converge towards the neighborhood preserving Bayes optimal.

We focus our attention on kernel classiﬁers and use two different kernel similarity functions: the ﬁrst, an exponential kernel, and the second, a polynomial kernel. These classiﬁers were chosen so that the

former meets the conditions of Corollary 13, and the latter does not. Full details on these classiﬁers can be found in the appendix.

To be able to measure performance with increasing data size, we look at a simple synthetic dataset over overlayed circles (see Figure 5 for an illustration) with support designed so that the data is intrinsically multiscaled. In particular, this calls for different levels of robustness in different regions. For simplicity, we use a global label noise parameter of 0.2, meaning that any sample drawn from this distribution is labeled differently than its support with probability 0.2. Further details about our dataset are given in section D.

Performance Measure. For a given classiﬁer, we evaluate its astuteness at a test point x with respect to the robustness region V κ x (Deﬁnition 6). While these regions are not computable in practice due to their dependency on the support of the data distribution, we are able to approximate them for this synthetic example due to our explicit knowledge of the data distribution. Details for doing this can be found in the appendix. To compute the empirical astuteness of a kernel classiﬁer WK about test point x, we perform a grid search over all points in V κ x to ensure that all points in the robustness region are labeled correctly.

For each classiﬁer, we measure the empirical astuteness by using three trials of 20 test points and taking the average. While this is a relatively small amount of test data, it sufﬁces as our purpose is to just verify that the algorithm roughly converges towards the optimal possible astuteness. Recall that for any neighborhood consistent algorithm, as n , AVκ should converge towards A , the accuracy of the Bayes optimal classiﬁer, for any 0 < κ < 1. Thus, to verify this holds, we use κ = 0.1, 0.3, 0.5. For each of these values, we plot the empirical astuteness as the training sample size n gets larger and larger. As a baseline, we also plot their standard accuracy on the test set.

Results and Discussion: The results are presented in Figure 4; the left panel is for the exponential kernel, while the right one is for the polynomial kernel. As predicted by our theory, we see that in all cases, the exponential kernel converges towards the maximum astuteness regardless of the value of κ: the only difference is that the rate of convergence is slower for larger values of κ. This is, of course, expected because larger values of κ entail larger robustness regions.

By contrast, the polynomial kernel performs progressively worse for larger values of κ. This kernel was selected speciﬁcally to violate the conditions of Corollary 13, and in particular fails criteria 3. However, note that the polynomial kernel nevertheless performs will with respect to accuracy thus giving another example demonstrating the added difﬁculty of neighborhood consistency.

Our results bridge the gap between our asymptotic theoretical results and ﬁnite sample regimes. In particular, we see that kernel classiﬁers that meet the conditions of Corollary 13 are able to converge in astuteness towards the neighborhood preserving Bayes optimal classiﬁer, while classiﬁers that do not meet these conditions fail.

6 Related Work

There is a wealth of literature on robust classiﬁcation, most of which impose the same robustness radius r on the entire data. [5, 17, 19, 20, 26, 15, 16, 18, 21, 22, 23], among others, focus primarily on neural networks, and robustness regions that are ℓ1, ℓ2, or ℓ norm balls of a given radius r.

[7] and [12] show how to train neural networks with different robustness radii at different points by trading off robustness and accuracy; their work differ from ours in that they focus on neural networks, their robustness regions are still norm balls, and that their work is largely empirical.

Our framework is also related to large margin classiﬁcation in the sense that the robustness regions U induce a margin constraint on the decision boundary. The most popular large margin classiﬁer is the Support Vector Machine[9, 3, 14] a large margin linear classiﬁer that minimizes the worstcase margin over the training data. Similar ideas have also been used to design classiﬁers that are more ﬂexible than linear; for example, [27] shows how to build large margin Lipschitz classiﬁers by rounding globally Lipschitz functions. Finally, there has also been purely empirical work on achieving large margins for more complex classiﬁers such as [13] for deep neural networks that minimizes the worst case margin, and [29] for metric learning to ﬁnd large margin nearest neighbors. Our work differs from these in that our goal is to ensure a high enough local margin at each x, (by considering the neighborhood preserving regions Vx) as opposed to optimizing a global margin.

Finally, our analysis builds on prior work on robust classiﬁcation for non-parametric methods in the standard framework. [1, 24, 28, 31] provide adversarial attacks on non-parametric methods. Wang et. al. [28] develops a defense for 1-NN that removes a subset of the training set to ensure higher robustness. Yang et. al [31] proposes the r-optimal classiﬁer which is the maximally astute classiﬁer in the standard robustness framework and proposes a defense called Adversarial Pruning.

Theoretically, [4] provide conditions under which weight functions converge towards the r-optimal classiﬁer in the large sample limit. They show that for r-separated distributions, where points from different classes are at least distance 2r or more apart, nearest neighbors and kernel classiﬁers satisfy these conditions. In the more general case, they use Adversarial Pruning as a preprocessing step to ensure that the training data is r-separated, and show that this preprocessing step followed by nearest neighbors or kernel classiﬁers leads to solutions that are robust and accurate in the large sample limit. Our result fundamentally differs from theirs in that we analyze a different algorithm, and our proof techniques are quite different. In particular, the fundamental differences between the r-optimal classiﬁer and the neighborhood preserving Bayes optimal classiﬁer call for different algorithms and different analysis techniques.

In concurrent work, [8] proposes a similar limit to the neighborhood preserving Bayes optimal which they refer to as the margin canonical Bayes. However, their work then focuses on a data augmentation technique that leads to convergence whereas we focus on proving the neighborhood consistency of classical non-parametric classiﬁers.

[1] Laurent Amsaleg, James Bailey, Dominique Barbe, Sarah M. Erfani, Michael E. Houle, Vinh Nguyen, and Milos Radovanovic. The vulnerability of learning to adversarial perturbation increases with intrinsic dimensionality. In 2017 IEEE Workshop on Information Forensics and Security, WIFS 2017, Rennes, France, December 4-7, 2017, pages 1 6, 2017.

[2] Robert B. Ash. Information theory. Dover Publications, 1990.

[3] Kristin P. Bennett and Erin J. Bredensteiner. Duality and geometry in SVM classiﬁers. In Pat Langley, editor, Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), Stanford University, Stanford, CA, USA, June 29 - July 2, 2000, pages 57 64. Morgan Kaufmann, 2000.

[4] Robi Bhattacharjee and Kamalika Chaudhuri. When are non-parametric methods robust? In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 832 841. PMLR, 2020.

[5] Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy, SP 2017, San Jose, CA, USA, May 22-26, 2017, pages 39 57, 2017.

[6] Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of convergence for nearest neighbor classiﬁcation. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3437 3445. Curran Associates, Inc., 2014.

[7] Minhao Cheng, Qi Lei, Pin-Yu Chen, Inderjit S. Dhillon, and Cho-Jui Hsieh. CAT: customized adversarial training for improved robustness. Co RR, abs/2002.06789, 2020.

[8] Sadia Chowdhury and Ruth Urner. On the (un-)avoidability of adversarial examples. Co RR, abs/2106.13326, 2021.

[9] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Mach. Learn., 20(3):273 297, 1995.

[10] Sanjoy Dasgupta, Daniel J. Hsu, and Claire Monteleoni. A general agnostic active learning algorithm. In John C. Platt, Daphne Koller, Yoram Singer, and Sam T. Roweis, editors, Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007, pages 353 360. Curran Associates, Inc., 2007.

[11] Luc Devroye, László Györﬁ, and Gábor Lugosi. A Probabilistic Theory of Pattern Recognition, volume 31 of Stochastic Modelling and Applied Probability. Springer, 1996.

[12] Gavin Weiguang Ding, Yash Sharma, Kry Yik Chau Lui, and Ruitong Huang. MMA training: Direct input space margin maximization through adversarial training. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020.

[13] Gamaleldin F. Elsayed, Dilip Krishnan, Hossein Mobahi, Kevin Regan, and Samy Bengio. Large margin deep networks for classiﬁcation. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montréal, Canada, pages 850 860, 2018.

[14] Yoav Freund and Robert E. Schapire. Large margin classiﬁcation using the perceptron algorithm. Mach. Learn., 37(3):277 296, 1999.

[15] Matthias Hein and Maksym Andriushchenko. Formal guarantees on the robustness of a classiﬁer against adversarial manipulation. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 2266 2276. Curran Associates, Inc., 2017.

[16] Guy Katz, Clark W. Barrett, David L. Dill, Kyle Julian, and Mykel J. Kochenderfer. Towards proving the adversarial robustness of deep neural networks. In Proceedings First Workshop on Formal Veriﬁcation of Autonomous Vehicles, FVAV@i FM 2017, Turin, Italy, 19th September 2017., pages 19 26, 2017.

[17] Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.

[18] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.

[19] Nicolas Papernot, Patrick D. Mc Daniel, Ian J. Goodfellow, Somesh Jha, Z. Berkay Celik, and Ananthram Swami. Practical black-box attacks against deep learning systems using adversarial examples. ASIACCS, 2017.

[20] Nicolas Papernot, Patrick D. Mc Daniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In IEEE European Symposium on Security and Privacy, Euro S&P 2016, Saarbrücken, Germany, March 21-24, 2016, pages 372 387, 2016.

[21] Nicolas Papernot, Patrick D. Mc Daniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In IEEE Symposium on Security and Privacy, SP 2016, San Jose, CA, USA, May 22-26, 2016, pages 582 597, 2016.

[22] Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certiﬁed defenses against adversarial examples. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.

[23] Aman Sinha, Hongseok Namkoong, and John C. Duchi. Certifying some distributional robustness with principled adversarial training. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.

[24] Chawin Sitawarin and David A. Wagner. On the robustness of deep k-nearest neighbors. In 2019 IEEE Security and Privacy Workshops, SP Workshops 2019, San Francisco, CA, USA, May 19-23, 2019, pages 1 7, 2019.

[25] Charles Stone. Consistent nonparametric regression. The Annals of Statistics, 5(4):595 645, 1977.

[26] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.

[27] Ulrike von Luxburg and Olivier Bousquet. Distance-based classiﬁcation with lipschitz functions. In Bernhard Schölkopf and Manfred K. Warmuth, editors, Computational Learning Theory and Kernel Machines, 16th Annual Conference on Computational Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003, Proceedings, volume 2777 of Lecture Notes in Computer Science, pages 314 328. Springer, 2003.

[28] Yizhen Wang, Somesh Jha, and Kamalika Chaudhuri. Analyzing the robustness of nearest neighbors to adversarial examples. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 5120 5129, 2018.

[29] Kilian Q. Weinberger, John Blitzer, and Lawrence K. Saul. Distance metric learning for large margin nearest neighbor classiﬁcation. In Advances in Neural Information Processing Systems 18 [Neural Information Processing Systems, NIPS 2005, December 5-8, 2005, Vancouver, British Columbia, Canada], pages 1473 1480, 2005.

[30] Yao-Yuan Yang, Cyrus Rashtchian, Ruslan Salakhutdinov, and Kamalika Chaudhuri. A closer look at robustness vs. accuracy. In Neural Information Processing Systems (Neu RIPS), 2020.

[31] Yao-Yuan Yang, Cyrus Rashtchian, Yizhen Wang, and Kamalika Chaudhuri. Robustness for non-parametric methods: A generic attack and a defense. In Artiﬁcial Intelligence and Statistics (AISTATS), 2020.

The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change the default [TODO] to [Yes] , [No] , or [N/A] . You are strongly encouraged to include a justiﬁcation to your answer, either by referencing the appropriate section of your paper or providing a brief inline description. For example:

Did you include the license to the code and datasets? [Yes] See Section ??.

Did you include the license to the code and datasets? [No] The code and the data are proprietary.

Did you include the license to the code and datasets? [N/A]

Please do not modify the questions and only use the provided macros for your answers. Note that the Checklist section does not count towards the page limit. In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] we express our claims through theorems (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [Yes] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] In the theorem statements (b) Did you include complete proofs of all theoretical results? [Yes] in the appendix

3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] in the appendix (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] Many details are given in the main body, but a full explanation with all details is in the appendix. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] In the appendix: this was not particularly needed for our very light experiments. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] Just a simple personal computer. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]