# when_are_nonparametric_methods_robust__12ce47b1.pdf

When are Non-Parametric Methods Robust?

Robi Bhattacharjee * 1 Kamalika Chaudhuri * 1

A growing body of research has shown that many classiﬁers are susceptible to adversarial exam ples small strategic modiﬁcations to test inputs that lead to misclassiﬁcation. In this work, we study general non-parametric methods, with a view towards understanding when they are ro bust to these modiﬁcations. We establish general conditions under which non-parametric methods are r-consistent in the sense that they converge to optimally robust and accurate classiﬁers in the large sample limit.

Concretely, our results show that when data is well-separated, nearest neighbors and kernel clas siﬁers are r-consistent, while histograms are not. For general data distributions, we prove that pre processing by Adversarial Pruning (Yang et al., 2019) that makes data well-separated followed by nearest neighbors or kernel classiﬁers also leads to r-consistency.

1. Introduction

Recent work has shown that many classiﬁers tend to be highly non-robust and that small strategic modiﬁcations to regular test inputs can cause them to misclassify (Szegedy et al., 2014; Goodfellow et al., 2014; Lowd & Meek, 2005). Motivated by the use of machine learning in safety-critical applications, this phenomenon has recently received con siderable interest; however, what exactly causes this phe nomenon known in the literature as adversarial examples still remains a mystery.

Prior work has looked at three plausible reasons why ad versarial examples might exist. The ﬁrst, of course, is the possibility that in real data distributions, different classes are very close together in space which does not seem plausible in practice. Another possibility is that classiﬁcation algo

*Equal contribution 1Department of Computer Science, Uni versity of California, San Diego. Correspondence to: Robi <rcb hatta@eng.ucsd.edu>.

Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s).

rithms may require more data to be robust than to be merely accurate; some prior work (Schmidt et al., 2018; Wang et al., 2018; Montasser et al., 2019) suggests that this might be true for certain classiﬁers or algorithms. Finally, others (Salman et al., 2019; Avdiukhin et al., 2019; Wang et al., 2018) have suggested that better training algorithms may give rise to more robust classiﬁers and that in some cases, ﬁnding robust classiﬁers may even be computationally challenging.

In this work, we consider this problem in the context of general non-parametric classiﬁers. Contrary to parametrics, non-parametric methods are a form of local classiﬁers, and include a large number of pattern recognition methods such as nearest neighbors, decision trees, random forests and kernel classiﬁers. There is a richly developed statistical theory of non-parametric methods (Devroye et al., 1996), which focuses on accuracy, and provides very general con ditions under which these methods converge to the Bayes optimal with growing number of samples. We, in contrast, analyze robustness properties of these methods, and ask instead when they converge to the classiﬁer with the highest astuteness at a desired radius r. Recall that the astuteness of a classiﬁer at radius r is the fraction of points from the dis tribution on which it is accurate and has the same prediction up to a distance r (Wang et al., 2018; Schmidt et al., 2018).

We begin by looking at the very simple case when data from different classes is well-separated by at least a distance 2r. Although achieving astuteness in this case may appear trivial, we show that even in this highly favorable case, not all non-parametric methods provide robust classiﬁers and this even holds for methods that converge to the Bayes optimal in the large sample limit.

This raises the natural question when do non-parametric methods produce astute classiﬁers? We next provide condi tions under which a non-parametric method converges to the most astute classiﬁer in the large sample limit under wellseparated data. Our conditions are analogous to the classical conditions for convergence to the Bayes optimal (Devroye et al., 1996; Stone, 1977), but a little stronger. We show that nearest neighbors and kernel classiﬁers whose kernel functions decay fast enough, satisfy these conditions, and hence converge to astute classiﬁers in the large sample limit. In constrast, histogram classiﬁers, which do converge to the Bayes optimal in the large sample limit, may not converge to the most astute classiﬁer. This indicates that there may

When are Non-parametric Methods Robust?

be some non-parametric methods, such as nearest neighbors and kernel classiﬁers, that are more naturally robust when trained on well-separated data, and some that are not.

What happens when different classes in the data are not as well-separated? For this case, (Yang et al., 2019) proposes a method called Adversarial Pruning that preprocesses the training data by retaining the maximal set of points such that different classes are distance 2r apart, and then trains a non-parametric method on the pruned data. We next prove that if a non-parametric method has certain properties, then the classiﬁer produced by Adversarial Pruning followed by the method does converges to the most astute classiﬁer in the large sample limit. We show that again nearest neighbors and kernel classiﬁers whose kernel functions decay faster than inverse polynomials satisfy these properties. Our re sults thus complement and build upon the empirical results of (Yang et al., 2019) by providing a performance guarantee.

What can we conclude about the cause for adversarial ex amples? Our results seem to indicate that at least for non parametrics, it is mostly the training algorithms that are responsible. With a few exceptions, decades of prior work in machine learning and pattern recognition has largely fo cussed on designing training methods that provide increas ingly accurate classiﬁers perhaps to the detriment of other aspects such as robustness. In this context, our results serve to (a) provide a set of guidelines that can be used for design ing non-parametric methods that are robust and accurate on well-separated data and (b) demonstrate that when data is not well-separated, preprocessing through adversarial prun ing (Yang et al., 2019) may be used to ensure convergence to optimally astute solutions in the large sample limit.

1.1. Related Work

There is a large body of work on adversarial attacks (Carlini & Wagner, 2017; Liu et al., 2017; Papernot et al., 2017; 2016a; Szegedy et al., 2014) and defenses (Hein & An driushchenko, 2017; Katz et al., 2017; Madry et al., 2018; Papernot et al., 2016b; Raghunathan et al., 2018; Sinha et al., 2018) in the parametric setting, speciﬁcally focusing on neural networks. On the other hand, adversarial exam ples for nonparametric classiﬁers have mostly been studied in a much more ad-hoc manner, and to our knowledge, there has been no theoretical investigation into general properties of algorithms that promote robustness in non-parametric classiﬁers.

For nearest neighbors, there has been some prior work on ad versarial attacks (Amsaleg et al., 2017; Sitawarin & Wagner, 2019; Wang et al., 2018; Yang et al., 2019) as well as de fenses. Wang et. al. (Wang et al., 2018) proposes a defense for 1-NN by pruning the input sample. However, their de fense learns a classiﬁer whose robustness regions converge towards those of the Bayes optimal classiﬁer, which itself

may potentially have poor robustness properties. Yang et. al. (Yang et al., 2019) accounts for this problem by proposing the notion of the r-optimal classiﬁer, and propose an algo rithm called Adversarial Pruning which can be interpreted as a ﬁnite sample approximation to the r-optimal. How ever, they do not provide formal performance guarantees for Adversarial Pruning, which we do.

For Kernel methods, Hein and Andriushchenko (Hein & Andriushchenko, 2017) study lower bounds on the norm of the adversarial manipulation that is required for changing a classiﬁers output. They speciﬁcally study bounds for Kernel Classiﬁers, and propose an empirically based regularization idea that improves robustness. In this work, we improve the robustness properties of kernel classiﬁcation through adversarial pruning, and show formal guarantees regarding convergence towards the r-optimal classiﬁer.

For decision trees and random forests, attacks and defenses have been provided by (Andriushchenko & Hein, 2019; Kantchelian et al., 2015; Chen et al., 2019). Again, most of the work here is empirical in nature, and convergence guarantees are not provided.

Pruning has a long history of being applied for improving nearest neighbors (Gates, 1972; Gottlieb et al., 2014; Hart, 1968; Kontorovich et al., 2017; Kontorovich & Weiss, 2015; Hanneke et al., 2019), but this has been entirely done in the context of generalization, without accounting for robustness. In their work, Yang et. al. empirically show that adversarial pruning can improve robustness for nearest neighbor classi ﬁers. However, they do not provide any formal guarantees for their algorithms. In this work, we prove formal guaran tees for adversarial pruning in the large sample limit, both for nearest neighbors as well as for more general weight functions.

There is a long history of literature for understanding the consistency of Kernel classiﬁers (Steinwart, 2005; Stone, 1977), but this has only been done for accuracy and general

ization. In this work, we ﬁnd different conditions are needed to ensure that a Kernel classiﬁer converges in robustness in addition to accuracy.

2. Preliminaries

2.1. Setting

We consider binary classiﬁcation where instances are drawn from a totally bounded metric space X that is equipped with distance metric denoted by d, and the label space is { 1} = { 1, +1}. The classical goal of classiﬁcation is to build a highly accurate classiﬁer, which we deﬁne as follows.

Deﬁnition 1. (Accuracy) Let D be a distribution over X { 1}, and let f { 1}X be a classiﬁer. Then the

When are Non-parametric Methods Robust?

accuracy of f over D, denoted A(f, D), is the fraction of examples (x, y) D for which f(x) = y. Thus

A(f, D) = P(x,y) D[f(x) = y].

In this work, we consider robustness in addition to accuracy. Let B(x, r) denoted the closed ball of radius r centered at x. Deﬁnition 2. (Robustness) A classiﬁer f { 1}X is said to be robust at x with radius r if f(x) = f(x0) for all x0 B(x, r).

Our goal is to ﬁnd non-parametric algorithms that output classiﬁers that are robust, in addition to being accurate. To account for both criteria, we combine them into a notion of astuteness (Wang et al., 2018; Schmidt et al., 2018). Deﬁnition 3. (Astuteness) A classiﬁer f { 1}X is said to be astute at (x, y) with radius r if f is robust at x with radius r and f(x) = y. The astuteness of f over D, denoted Ar(f, D), is the fraction of examples (x, y) D for which f is astute at (x, y) with radius r. Thus

Ar(f, D) = P(x,y) D[f(x 0) = y, x 0 B(x, r)].

It is worth noting that A0(f, D) = A(f, D), since astuteness with radius 0 is simply the accuracy. For this reason, we will use A0(f, D) to denote accuracy from this point forwards.

2.2. Notions of Consistency

Traditionally, a classiﬁcation algorithm is said to be consis tent if as the sample size grows to inﬁnity, the accuracy of the classiﬁer it learns converges towards the best possible accuracy on the underlying data distribution. We next in troduce and formalize an alternative form of consistency, called r-consistency, that applies to robust classiﬁers.

We begin with a formal deﬁnition of the Bayes Optimal Classiﬁer the most accurate classiﬁer on a distribution and consistency. Deﬁnition 4. (Bayes Optimal Classiﬁer) The Bayes Opti mal Classiﬁer on a distribution D, denoted by g , is deﬁned as follows. Let η(x) = p D(+1|x). Then

+1 η(x) 0.5 g (x) = 1 η(x) < 0.5

It can be shown that g achieves the highest accuracy over D over all classiﬁers. Deﬁnition 5. (Consistency) Let M be a classiﬁcation algo rithm over X { 1}. M is said to be consistent if for any D over X { 1}, and any ϵ, δ over (0, 1), there exists N such that for n N , with probability 1 δ over S Dn , we have:

A(M(S), D) A(g , D) ϵ,

where g is the Bayes optimal classiﬁer for D.

How can we incorporate robustness in addition to accuracy in this notion? A plausible way, as used in (Wang et al., 2018), is that the classiﬁer should converge towards being astute where the Bayes Optimal classiﬁer is astute. However, the Bayes Optimal classiﬁer is not necessarily the most astute classiﬁer and may even have poor astuteness. To see this, consider the following example.

Example 1 Consider D over X = [0, 1] such that DX is the uniform distribution and

1 4πx p(y = 1|x) = + sin . 2 r

For any point x, there exists x1, x2 ([x r, x + r] [0, 1])

1 1 such that p(y = 1|x1) > and p(y = 1|x2) < . 2 2 Ar(g , r) = 0. However, the classiﬁer that always predicts f(x) = +1 does better. It is robust everywhere, and since

1 1 P(x,y) D[y = +1] = , it follows that Ar(f, D) = . 2 2

This motivates the notion of the r-optimal classiﬁer, intro duced by (Yang et al., 2019), which is the classiﬁer with maximum astuteness.

Deﬁnition 6. (r-optimal classiﬁer) The r-optimal classi

ﬁer of a distribution G denoted by g is the classiﬁer with r maximum astuteness. Thus

g = arg max Ar(f, D). r

We let A (D) denote Ar(g , D). r r

Observe that g is not necessarily unique. To account for r this, we use A (D) in our deﬁnition for r-consistency. r Deﬁnition 7. (r-consistent) Let M be a classiﬁcation algo rithm over X { 1}. M is said to be r-consistent if for any D, any ϵ, δ (0, 1), and 0 < γ < r, there exists N such that for n N, with probability 1 δ over S Dn ,

Ar γ (M(S), D) A (D) ϵ. r

if the above conditions hold for a speciﬁc distribution D, we say that M is r-consistent with respect to D.

Observe that in addition to the usual ϵ and δ, there is an extra parameter γ which measures the gap in the robustness radius. We may need this parameter as when classes are exactly 2r apart, we may not be able to ﬁnd the exact robust boundary with only ﬁnite samples.

Our analysis will be centered around understanding what kinds of algorithms M provide highly astute classiﬁers for a given radius r. We begin by ﬁrst considering the special case of r-separated distributions.

When are Non-parametric Methods Robust?

Deﬁnition 8. (r-separated distributions) A distribution D is said to be r-separated if there exist subsets T +, T X such that

1. P(x,y) D[x T y] = 1.

2. x1 T + , x2 T , d(x1, x2) > 2r.

Observe that if D is r-separated, Ar(g , D) = 1. r

2.3. Non-parametric Classiﬁers

Many non-parametric algorithms classify points by averag ing labels over a local neighborhood from their training data. A very general form of this idea is encapsulated in weight functions which is the general form we will use.

Deﬁnition 9. (Devroye et al., 1996) A weight function W is a non-parametric classiﬁer with the following properties.

1. Given input S = {(x1, y1), (x2, y2, ), . . . , (xn, yn)} Dn , W S S S constructs functions w1 , w2 , . . . , w : X [0, 1] n Pn S such that for all x X , 1 wi (x) = 1. The S functions w are allowed to depend on x1, x2, . . . xn i but must be independent of y1, y2, . . . , yn.

2. W has output WS deﬁned as

( Pn S +1 (x)yi > 0 1 wi WS (x) = Pn S 1 1 w (x)yi 0 i

S As a result, wi (x) can be thought of as the weight that (xi, yi) has in classifying x.

Weight functions encompass a fairly extensive set of com mon non-parametric classiﬁers, which is the motivation for considering them. We now deﬁne several common non parametric algorithms that can be construed as weight func tions.

Deﬁnition 10. A histogram classiﬁer, H, is a non parametric classiﬁcation algorithm over Rd { 1} that works as follows. For a distribution D over R { 1}, H takes S = {(xi, yi) : 1 i n} Dn as input. Let ki be a sequence with limi ki = and limi ki = 0. i H constructs a set of hypercubes C = {c1, c2, . . . , cm} as follows:

1. Initially C = {c}, where S c.

2. For c C, if c contains more than kn points of S, then partition c into 2d equally sized hypercubes, and insert them into C.

3. Repeat step 2 until all cubes in C have at most kn points.

For x R let c(x) denote the unique cell in C contain

ing x. If c(x) doesn t exist, then HS (x) = 1 by default. Otherwise,

( P +1 xi c(x) yi > 0 HS (x) = P . 1 xi c(x) yi 0

Histogram classiﬁers are weight functions in which all xi contained within the same cell as x are given the same

S weight wi (x) in predicting x, while all other xi are given weight 0.

Deﬁnition 11. A kernel classiﬁer is a weight function W over X { 1} constructed from function K : R+ {0} R+ and some sequence {hn} R+ in the following manner. Given S = {(xi, yi)} Dn , we have

d(x,xi) K( hn ) S wi (x) = P . n d(x,xj ) K( ) j=1 hn

Then, as above, W has output

( Pn S +1 (x)yi > 0 1 wi WS (x) = Pn S 1 1 wi (x)yi 0

Finally, we note that kn-nearest neighbors is also a weight S 1 function; wi (x) = kn if xi is one of the kn closest neigh bors of x and 0 otherwise.

3. Warm Up: r-separated distributions

We begin by considering the case when the data distribution is r-separated; the more general case is considered in Sec tion 4. While classifying r-separated distributions robustly may appear almost trivial, learning an arbitrary classiﬁer does not necessarily produce an astute result. To see this, consider the following example of a histogram classiﬁer which is known to be consistent.

We let H denote the histogram classiﬁer over R.

Example 2 Consider the data distribution D = D+ D

1 where D+ is the uniform distribution over [0, ) and D

4 is the uniform distribution over ( 1

2 , 1], p(+1|x) = 1 for x D+, and p( 1|x) = 1 for x D .

We make the following observations (refer to Figure 1).

1. D is 0.1-separated, since the supports of D+ and D

have distance 0.25 > 0.2.

2. If n is sufﬁciently large, H will construct the cell [0.25, 0.5), which will not be split because it will never contain any points.

3. HS (x) = 1 for x [0.25, 0.5).

When are Non-parametric Methods Robust?

0 0.25 0.5 1

Figure 1. HS is astute in the green region, but not robust in the red region.

4. HS is not astute at (x, 1) for x (0.15, 0.25). Thus A0.1(HS , D) = 0.8.

Example 2 shows that histogram classiﬁers do not always learn astute classiﬁers even when run on r-separated distri butions. This motivates the question: which non-parametric classiﬁers do?

We answer this question in the following theorem, which gives sufﬁcient conditions for a weight function (deﬁnition 9) to be r-consistent over an r-separated distribution. Theorem 12. Let D be a distribution over X { 1}, and let W be a weight function. Let X be a random variable with distribution DX , and S = {(x1, y1), (x2, y2), . . . , (xn, yn)} Dn . Suppose that for any 0 < a < b,

n X S lim EX,S sup wi (x 0)I||xi x0 ||>b = 0. n x0 B(X,a) 1 Then if D is r-separated, W is r-consistent with respect to D.

First, we compare Theorem 12 to Stone s theorem (Stone, 1977), which gives sufﬁcient conditions for a weight func

tion to be consistent (i.e. converge in accuracy towards the Bayes optimal). For convenience, we include a statement of Stone s theorem. Theorem 13. (Stone, 1977) Let W be weight function over X { 1}. Suppose the following conditions hold for any distribution D over X { 1}. Let X be a random variable with distribution DX , and S = {(x1, y1), (x2, y2), . . . , (xn, yn)} Dn . All expectations are taken over X and S.

1. There is a constant c such that, for every nonnegative measurable function f satisfying E[f(X)] < ,

n X S E[ wi (X)f(xi)] c E[f(x)].

2. For all a > 0,

n X S lim E[ wi (x)I||xi X||>a] = 0, n

where I||xi X||>a is an indicator variable.

S lim E[ max wi (X)] = 0. n 1 i n

Then W is consistent.

There are two main differences between Theorem 12 and Stone s theorem.

1. Conditions 1. and 3. of Stone s theorem are no longer necessary. This is because r-separated distributions are well-separated and thus have simpler conditions for consistency. In fact, a slight modiﬁcation of the arguments of (Stone, 1977) shows that for r-separated distributions, condition 2. alone is sufﬁcient for consis tency.

2. Condition 2. is strengthened. Instead of requiring the weight of xi s outside of a given radius to go to 0 for X D, we require the same to uniformly hold over a ball centered at X.

Theorem 12 provides a general condition that allows us to verify the r-consistency of non-parametric methods. We now show below that two common non-parametric algo rithms kn-nearest neighbors and kernel classiﬁers with rapidly decaying kernel functions satisfy the conditions of Theorem 12.

Corollary 14. Let D be any r-separated distribution. Let kn be any sequence such that limn

kn = 0, and let M n be the kn-nearest neighbors classiﬁer on a sample S Dn . Then M is r-consistent with respect to D.

1. Because the data distribution is r-separated, kn = 1 will be r-consistent. Also observe that for r-separated distributions, kn = 1 will converge towards the Bayes Optimal classiﬁer.

2. In general, M converges towards the Bayes Opti

mal classiﬁer provided that kn in addition to kn/n 0. This condition is not necessary for rconsistency because the distribution is r-separated.

We next show that kernel classiﬁers are also r-consistent on r-separated data distributions, provided the kernel function decreases rapidly enough.

Corollary 15. Let W be a kernel classiﬁer over X { 1} constructed from K and hn. Suppose the following proper ties hold for K and hn.

K(cx) 1. For any c > 1, limx K(x) = 0.

When are Non-parametric Methods Robust?

2. limn hn = 0.

If D is an r-separated distribution over X { 1}, then W is r-consistent with respect to D.

Observe that Condition 1. is satisﬁed for any K(x) that decreases more rapidly than an inverse polynomial and is hence satisﬁed by most popular kernels like the Gaussian kernel. Is the condition on K in Corollary 15 necessary? The following example illustrates that a kernel classiﬁer with any arbitrary K is not necessarily r-consistent. This indicates that some sort of condition needs to be imposed on K to ensure r-consistency; ﬁnding a tight necessary condition however is left for future work.

Example 3 Let X = [ 1, 1] and let D be a distribution with p D( 1, 1) = 0.1 and p D(1, 1) = 0.9. Clearly, D

min(|x|,0.2)2 is 0.3-separated. Let K(x) = e . Let hn be any sequence with limn hn = 0 and limn nhn = . Let W be the weight classiﬁer with input S = {(x1, y1), (x2, y2), . . . , (xn, yn)} such that

|x xi| K( ) wi

S (x) = hn . Pn |x xj | K( ) j=1 hn

W can be shown to satisfy all the conditions of Theorem 13 (the proof is analogous to the case for a Gaussian Classiﬁer), and is therefore consistent. However, W does not learn a robust classiﬁer on D for r = 0.3.

Consider x = 0.7. For any {(x1, y1), (x2, y2), . . . , (xn, yn)} Dn , all xi will either be 1 or 1. Therefore, since K(|x ( 1)|) = K(|x 1|),

S 1 it follows that wi (x) = n for all 1 i n. Since xi = 1 with probability 0.9, it follows that with high probability x will be classiﬁed as 1 which means that f, the output of W , is not robust at x = 1. Thus f has astuteness at most 0.9 which means that W is not r-consistent for r = 0.3.

4. General Distributions

We next consider more general data distributions, where data from different classes may be close together in space, and may even overlap. Observe that unlike the r-separated case, here there may be no classiﬁer with astuteness one. Thus, a natural question is: what does the optimally astute classiﬁer look like, and how can we build non-parametric classiﬁers to this limit?

4.1. The r-Optimal Classiﬁer and Adversarial Pruning

(Yang et al., 2019) propose a large-sample limit called the r-optimal and show that it is analogous to the Bayes

Optimal classiﬁer for robustness. More speciﬁcally, given a data distribution D, to ﬁnd the r-optimal classiﬁer, we solve the following optimization problem.

Z max p(y = +1|x)dµD(x)+ S+1 ,S 1 x S+1 Z

p(y = 1|x)dµD(x) (1)

x S 1 subject to d(S+1, S 1) > 2r

Then, the r-optimal classiﬁer is deﬁned as follows.

Deﬁnition 16. (Yang et al., 2019) Fix r, D. Let S and +1 S be any optimizers of (1). Then the r-optimal clas 1

siﬁer, g is any classiﬁer such that g (x) = j whenever r r d(Sj

(Yang et al., 2019) show that the r-optimal classiﬁer achieves the optimal astuteness out of all classiﬁers on the data distribution D; hence, it is a robustness analogue to the Bayes Optimal Classiﬁer. Therefore, for general distributions, the goal in robust classiﬁcation is to ﬁnd non parametric algorithms that output classiﬁers that converge

towards g . r

To ﬁnd robust classiﬁers, (Yang et al., 2019) propose Ad versarial Pruning a defense method that preprocesses the training data by making it better separated. More speciﬁ cally, Adversarial Pruning takes as input a training dataset S and a radius r, and ﬁnds the largest subset of the training set where differently labeled points are at least distance 2r apart.

Deﬁnition 17. A set Sr X { 1} is said to be rseparated if for all (x1, y1), (x2, y2) Sr, if y1 6= y2, then d(x1, x2) > 2r. To adversarially prune a set S is to return its largest r-separated subset. We let Adv P run(S, r) denote the result of adversarially pruning S.

Once an r-separated subset Sr of the training set is found, a standard non-parametric method is trained on Sr. While (Yang et al., 2019) show good empirical performance of such algorithms, no formal guarantees are provided. We next formally characterize when adversarial pruning fol lowed by a non-parametric method results in a classiﬁer that is provably r-consistent.

Speciﬁcally, we consider analyzing the general algorithm provided in Algorithm 1.

4.2. Convergence Guarantees

We begin with some notation. For any weight function W and radius r > 0, we let Robust Non P ar(W, r) repre sent the weight function that outputs weights for S Dn

When are Non-parametric Methods Robust?

Algorithm 1 Robust Non Par

Input: S Dn, weight function W , robustness radius r Sr Adv P run(S, r) Output: WSr

according to Robust Non P ar(S, W, r). In particular, this can be used to convert any weight function algorithm into a new weight function which takes robustness into account. A natural question is, for which weight functions W is Robust Non P ar(W, r) r-consistent? Our next theorem pro vides sufﬁcient conditions for this.

Theorem 18. Let W be a weight function over X { 1}, and let D be a distribution over X { 1}. Fix r > 0. Let Sr = Adv P run(S, r). For convenience, relabel xi, yi so that Sr = {(x1, y1), (x2, y2), . . . , (xm, ym)}. Suppose that for any 0 < a < b,

m m 1 X X Sr lim ES Dn sup w (x)I||xj x||>b = 0. j n m x B(xi,a) i=1 j=1

Then Robust Non P ar(W, r) is r-consistent with respect to D.

Remark: There are two important differences between the conditions in Theorem 18 and Theorem 12.

1. We replace S with Sr.

2. The expectation over X DX is replaced with an average over {x1, x2, . . . , xm}. The intuition here is that we are replacing D with a uniform distribution over Sr. While D may not be r-separated, the uniform distribution over Sr is, and represents the region of points where our classiﬁer is astute.

A natural question is what satisﬁes the conditions in Theo rem 18. We next show that kn-nearest neighbors and kernel classiﬁers with rapidly decaying kernel functions continue to satisfy the conditions in Theorem 18; this means that these classiﬁers, when combined with Adversarial Pruning, will converge to r-optimal classiﬁers in the large sample limit.

kn Corollary 19. Let kn be a sequence with limn = 0, n and let M denote the kn-nearest neighbor algorithm. Then for any r > 0, Robust Non P ar(M, r) is r-consistent.

Remark: Corollary 19 gives a formal guarantee in the large sample limit for the modiﬁed nearest-neighbor algo rithm proposed by (Yang et al., 2019).

Corollary 20. Let W be a kernel classiﬁer over X { 1} constructed from K and hn. Suppose the following proper ties hold for K and hn.

K(cx) 1. For any c > 1, limx K(x) = 0.

2. limn hn = 0.

Then for any r > 0, Robust Non P ar(W, r) is r-consistent.

Observe again that Condition 1. is satisﬁed by any K that decreases more rapidly than an inverse polynomial kernel; it is thus satisﬁed by most popular kernels, such as the Gaussian kernel.

5. Validation

Our theoretical results are, by nature, large sample; we next validate how well they apply to the ﬁnite sample case by trying them out on a simple example. In particular, we ask the following question:

How does the robustness of non-parametric clas siﬁers change with increasing sample size?

This question is considered in the context of two simple non-parametric classiﬁers one nearest neighbor (which is guaranteed to be r-consistent) and histograms (which is not). To be able to measure performance with increasing data size, we look at a simple synthetic dataset the Half Moons.

5.1. Experimental Setup

Classiﬁers and Dataset. We consider two different clas siﬁcation algorithms one nearest neighbor (NN) and a Histogram Classiﬁer (HC). We use the Halfmoon dataset with two settings of the gaussian noise parameter σ, σ = 0 (Noiseless) and σ = 0.08 (Noisy). For the Noiseless set ting, observe that the data is already 0.1-separated; for the Noisy setting, we use Adversarial Pruning (Algorithm 1) with parameter r = 0.1 for both classiﬁcation methods.

Performance Measure. We evaluate robustness with re spect to the metric, that is commonly used in the adver sarial examples literature. Speciﬁcally, for each classiﬁer, we calculate the empirical astuteness, which is the fraction of test examples on which it is astute.

Observe that computing the empirical astuteness of a clas siﬁer around an input x amounts to ﬁnding the adversarial example that is closest to x according to the norm. For the 1-nearest neighbor, we do this using the optimal attack algorithm proposed by Yang et. al. (Yang et al., 2019). For the histogram classiﬁer, we use the optimal attack frame work proposed by (Yang et al., 2019), and show that the structure of the classiﬁer can be exploited to solve the con vex program efﬁciently. Details are in Appendix C.

When are Non-parametric Methods Robust?

(a) Noiseless Histogram (b) Noisy Histogram (c) Histogram trained on 500 samples

(d) Noiseless 1-NN (e) Noisy 1-NN (f) Histogram trained on 3000 samples

Figure 2. Empirical accuracy/astuteness of different classiﬁers as a function of training sample size. Accuracy is shown in green, astuteness in purple. Left : Noiseless Setting. Right: Noisy Setting. Top Row: Histogram Classiﬁer, Bottom Row: 1-Nearest Neighbor

We use an attack radius of r = 0.1 for the Noiseless setting, and r = 0.09 for the Noisy setting. For all classiﬁcation algorithms, we plot the empirical astuteness as a function of the training set size. As a baseline, we also plot their standard accuracy on the test set.

5.2. Results

The results are presented in Figure 2; the left two panels are for the Noiseless setting while the two center ones are for the Noisy setting.

The results show that as predicted by our theory, for the Noiseless setting, the empirical astuteness of nearest neigh bors converges to 1 as the training set grows. For Histogram Classiﬁers, the astuteness converges to 0.5 indicating that the classiﬁer may grow less and less astute with higher sample size even for well-separated data. This is plausi bly because the cell size induced by the histogram grows smaller with growing training data; thus, the classiﬁer that outputs the default label 1 in empty cells is incorrect on adversarial examples that are close to a point with +1 label, but belongs to a different, empty cell. The rightmost panels in Figure 2 provide a visual illustration of this process.

For the Noisy setting, the empirical astuteness of adversar ial pruning followed by nearest neighbors converges to 0.8. For histograms with adversarial pruning, the astuteness con verges to 0.7, which is higher than the noiseless case but

5.3. Discussion

Our results show that even though our theory is asymp totic, our predictions continue to be relevant in ﬁnite sample regimes. In particular, on well-separated data, nearest neigh bors that we theoretically predict to be intrinsically robust is robust; histogram classiﬁers, which do not satisfy the con ditions in Theorem 12 are not. Our predictions continue to hold for data that is not well-separated. Nearest neighbors coupled with Adversarial Pruning continues to be robust with growing sample size, while histograms continue to be non-robust. Thus our theory is conﬁrmed by practice.

6. Conclusion

In conclusion, we rigorously analyze when non-parametric methods provide classiﬁers that are robust in the large sam ple limit. We provide a general condition that characterizes when non-parametric methods are robust on well-separated data, and show that Adversarial Pruning of (Yang et al., 2019) works on data that is not well-separated.

Our results serve to provide a set of guidelines that can be used for designing non-parametric methods that are ro bust and accurate on well-separated data; additionally, we demonstrate that when data is not well-separated, prepro cessing by adversarial pruning (Yang et al., 2019) does lead to optimally astute solutions in the large sample limit.

still clearly sub-optimal.

When are Non-parametric Methods Robust?

Acknowledgements

We thank NSF under CNS 1804829 for research support.

Amsaleg, L., Bailey, J., Barbe, D., Erfani, S. M., Houle,

M. E., Nguyen, V., and Radovanovic, M. The vulner ability of learning to adversarial perturbation increases with intrinsic dimensionality. In 2017 IEEE Workshop on Information Forensics and Security, WIFS 2017, Rennes, France, December 4-7, 2017, pp. 1 6, 2017.

Andriushchenko, M. and Hein, M. Provably robust boosted decision stumps and trees against adversarial attacks. In Wallach, H., Larochelle, H., Beygelzimer, A., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32, pp. 12997 13008. Curran Asso ciates, Inc., 2019.

Avdiukhin, D., Mitrovic, S., Yaroslavtsev, G., and Zhou,

S. Adversarially robust submodular maximization un der knapsack constraints. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019., pp. 148 156, 2019. doi: 10.1145/3292500.3330911.

Carlini, N. and Wagner, D. A. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy, SP 2017, San Jose, CA, USA, May 22-26, 2017, pp. 39 57, 2017.

Chen, H., Zhang, H., Boning, D. S., and Hsieh, C.-J. Ro

bust decision trees against adversarial examples. Co RR, abs/1902.10660, 2019.

Devroye, L., Gy orﬁ, L., and Lugosi, G. A Probabilistic Theory of Pattern Recognition, volume 31 of Stochastic Modelling and Applied Probability. Springer, 1996.

Gates, G. W. The reduced nearest neighbor rule (corresp.). IEEE Trans. Information Theory, 18(3):431 433, 1972.

Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples, March 20 2014. URL http://arxiv.org/abs/1412.6572.

Gottlieb, L., Kontorovich, A., and Nisnevitch, P. Nearoptimal sample compression for nearest neighbors. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 370 378, 2014.

Hanneke, S., Kontorovich, A., Sabato, S., and Weiss, R. Universal bayes consistency in metric spaces. Co RR, abs/1906.09855, 2019.

Hart, P. E. The condensed nearest neighbor rule (corresp.). IEEE Trans. Information Theory, 14(3):515 516, 1968.

Hein, M. and Andriushchenko, M. Formal guarantees on the robustness of a classiﬁer against adversarial manipu lation. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 2266 2276. Curran Associates, Inc., 2017.

Kantchelian, A., Tygar, J. D., and Joseph, A. D. Evasion and hardening of tree ensemble classi ﬁers. Co RR, abs/1509.07892, 2015. URL http://arxiv.org/abs/1509.07892.

Katz, G., Barrett, C. W., Dill, D. L., Julian, K., and Kochen

derfer, M. J. Towards proving the adversarial robustness of deep neural networks. In Proceedings First Work shop on Formal Veriﬁcation of Autonomous Vehicles, FVAV@i FM 2017, Turin, Italy, 19th September 2017., pp. 19 26, 2017.

Kontorovich, A. and Weiss, R. A bayes consistent 1-nn classiﬁer. In Proceedings of the Eighteenth International Conference on Artiﬁcial Intelligence and Statistics, AIS TATS 2015, San Diego, California, USA, May 9-12, 2015, 2015.

Kontorovich, A., Sabato, S., and Weiss, R. Nearest-neighbor sample compression: Efﬁciency, consistency, inﬁnite di mensions. In Advances in Neural Information Process ing Systems 30: Annual Conference on Neural Informa tion Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 1573 1583, 2017.

Liu, Y., Chen, X., Liu, C., and Song, D. Delving into trans

ferable adversarial examples and black-box attacks. In 5th International Conference on Learning Representa tions, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.

Lowd, D. and Meek, C. Adversarial learning. In Grossman, R., Bayardo, R. J., and Bennett, K. P. (eds.), Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, Illi nois, USA, August 21-24, 2005, pp. 641 647. ACM, 2005. ISBN 1-59593-135-X.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Pro ceedings, 2018.

Montasser, O., Hanneke, S., and Srebro, N. VC classes are adversarially robustly learnable, but only improperly. In

When are Non-parametric Methods Robust?

Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, pp. 2512 2530, 2019.

Papernot, N., Mc Daniel, P. D., Jha, S., Fredrikson, M., Celik,

Z. B., and Swami, A. The limitations of deep learning in adversarial settings. In IEEE European Symposium on Security and Privacy, Euro S&P 2016, Saarbr ucken, Germany, March 21-24, 2016, pp. 372 387, 2016a.

Papernot, N., Mc Daniel, P. D., Wu, X., Jha, S., and Swami,

A. Distillation as a defense to adversarial perturbations against deep neural networks. In IEEE Symposium on Security and Privacy, SP 2016, San Jose, CA, USA, May 22-26, 2016, pp. 582 597, 2016b.

Papernot, N., Mc Daniel, P. D., Goodfellow, I. J., Jha, S., Celik, Z. B., and Swami, A. Practical black-box attacks against deep learning systems using adversarial examples. ASIACCS, 2017.

Raghunathan, A., Steinhardt, J., and Liang, P. Certiﬁed defenses against adversarial examples. In 6th Interna tional Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.

Salman, H., Yang, G., Li, J., Zhang, P., Zhang, H., Razenshteyn, I. P., and Bubeck, S. Provably ro bust deep learning via adversarially trained smoothed classiﬁers. Co RR, abs/1906.04584, 2019. URL http://arxiv.org/abs/1906.04584.

Schmidt, L., Santurkar, S., Tsipras, D., Talwar, K., and Madry, A. Adversarially robust generalization requires more data. In Advances in Neural Information Process ing Systems 31, pp. 5014 5026. Curran Associates, Inc., 2018.

Sinha, A., Namkoong, H., and Duchi, J. C. Certifying some distributional robustness with principled adversarial training. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.

Sitawarin, C. and Wagner, D. A. On the robustness of deep k-nearest neighbors. In 2019 IEEE Security and Privacy Workshops, SP Workshops 2019, San Francisco, CA, USA, May 19-23, 2019, pp. 1 7, 2019.

Steinwart, I. Consistency of support vector machines and other regularized kernel classiﬁers. IEEE Trans. Informa tion Theory, 51(1):128 142, 2005.

Stone, C. Consistent nonparametric regression. The Annals of Statistics, 5(4):595 645, 1977.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R. Intriguing prop erties of neural networks. In 2nd International Confer ence on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceed ings, 2014.

Wang, Y., Jha, S., and Chaudhuri, K. Analyzing the ro

bustness of nearest neighbors to adversarial examples. In Proceedings of the 35th International Conference on Ma chine Learning, ICML 2018, Stockholmsm assan, Stock holm, Sweden, July 10-15, 2018, pp. 5120 5129, 2018.

Yang, Y., Rashtchian, C., Wang, Y., and Chaud

huri, K. Adversarial examples for non-parametric methods: Attacks, defenses and large sample limits. Co RR, abs/1906.03310, 2019. URL http://arxiv.org/abs/1906.03310.