# when_are_nonparametric_methods_robust__12ce47b1.pdf When are Non-Parametric Methods Robust? Robi Bhattacharjee * 1 Kamalika Chaudhuri * 1 A growing body of research has shown that many classifiers are susceptible to adversarial exam ples small strategic modifications to test inputs that lead to misclassification. In this work, we study general non-parametric methods, with a view towards understanding when they are ro bust to these modifications. We establish general conditions under which non-parametric methods are r-consistent in the sense that they converge to optimally robust and accurate classifiers in the large sample limit. Concretely, our results show that when data is well-separated, nearest neighbors and kernel clas sifiers are r-consistent, while histograms are not. For general data distributions, we prove that pre processing by Adversarial Pruning (Yang et al., 2019) that makes data well-separated followed by nearest neighbors or kernel classifiers also leads to r-consistency. 1. Introduction Recent work has shown that many classifiers tend to be highly non-robust and that small strategic modifications to regular test inputs can cause them to misclassify (Szegedy et al., 2014; Goodfellow et al., 2014; Lowd & Meek, 2005). Motivated by the use of machine learning in safety-critical applications, this phenomenon has recently received con siderable interest; however, what exactly causes this phe nomenon known in the literature as adversarial examples still remains a mystery. Prior work has looked at three plausible reasons why ad versarial examples might exist. The first, of course, is the possibility that in real data distributions, different classes are very close together in space which does not seem plausible in practice. Another possibility is that classification algo *Equal contribution 1Department of Computer Science, Uni versity of California, San Diego. Correspondence to: Robi . Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s). rithms may require more data to be robust than to be merely accurate; some prior work (Schmidt et al., 2018; Wang et al., 2018; Montasser et al., 2019) suggests that this might be true for certain classifiers or algorithms. Finally, others (Salman et al., 2019; Avdiukhin et al., 2019; Wang et al., 2018) have suggested that better training algorithms may give rise to more robust classifiers and that in some cases, finding robust classifiers may even be computationally challenging. In this work, we consider this problem in the context of general non-parametric classifiers. Contrary to parametrics, non-parametric methods are a form of local classifiers, and include a large number of pattern recognition methods such as nearest neighbors, decision trees, random forests and kernel classifiers. There is a richly developed statistical theory of non-parametric methods (Devroye et al., 1996), which focuses on accuracy, and provides very general con ditions under which these methods converge to the Bayes optimal with growing number of samples. We, in contrast, analyze robustness properties of these methods, and ask instead when they converge to the classifier with the highest astuteness at a desired radius r. Recall that the astuteness of a classifier at radius r is the fraction of points from the dis tribution on which it is accurate and has the same prediction up to a distance r (Wang et al., 2018; Schmidt et al., 2018). We begin by looking at the very simple case when data from different classes is well-separated by at least a distance 2r. Although achieving astuteness in this case may appear trivial, we show that even in this highly favorable case, not all non-parametric methods provide robust classifiers and this even holds for methods that converge to the Bayes optimal in the large sample limit. This raises the natural question when do non-parametric methods produce astute classifiers? We next provide condi tions under which a non-parametric method converges to the most astute classifier in the large sample limit under wellseparated data. Our conditions are analogous to the classical conditions for convergence to the Bayes optimal (Devroye et al., 1996; Stone, 1977), but a little stronger. We show that nearest neighbors and kernel classifiers whose kernel functions decay fast enough, satisfy these conditions, and hence converge to astute classifiers in the large sample limit. In constrast, histogram classifiers, which do converge to the Bayes optimal in the large sample limit, may not converge to the most astute classifier. This indicates that there may When are Non-parametric Methods Robust? be some non-parametric methods, such as nearest neighbors and kernel classifiers, that are more naturally robust when trained on well-separated data, and some that are not. What happens when different classes in the data are not as well-separated? For this case, (Yang et al., 2019) proposes a method called Adversarial Pruning that preprocesses the training data by retaining the maximal set of points such that different classes are distance 2r apart, and then trains a non-parametric method on the pruned data. We next prove that if a non-parametric method has certain properties, then the classifier produced by Adversarial Pruning followed by the method does converges to the most astute classifier in the large sample limit. We show that again nearest neighbors and kernel classifiers whose kernel functions decay faster than inverse polynomials satisfy these properties. Our re sults thus complement and build upon the empirical results of (Yang et al., 2019) by providing a performance guarantee. What can we conclude about the cause for adversarial ex amples? Our results seem to indicate that at least for non parametrics, it is mostly the training algorithms that are responsible. With a few exceptions, decades of prior work in machine learning and pattern recognition has largely fo cussed on designing training methods that provide increas ingly accurate classifiers perhaps to the detriment of other aspects such as robustness. In this context, our results serve to (a) provide a set of guidelines that can be used for design ing non-parametric methods that are robust and accurate on well-separated data and (b) demonstrate that when data is not well-separated, preprocessing through adversarial prun ing (Yang et al., 2019) may be used to ensure convergence to optimally astute solutions in the large sample limit. 1.1. Related Work There is a large body of work on adversarial attacks (Carlini & Wagner, 2017; Liu et al., 2017; Papernot et al., 2017; 2016a; Szegedy et al., 2014) and defenses (Hein & An driushchenko, 2017; Katz et al., 2017; Madry et al., 2018; Papernot et al., 2016b; Raghunathan et al., 2018; Sinha et al., 2018) in the parametric setting, specifically focusing on neural networks. On the other hand, adversarial exam ples for nonparametric classifiers have mostly been studied in a much more ad-hoc manner, and to our knowledge, there has been no theoretical investigation into general properties of algorithms that promote robustness in non-parametric classifiers. For nearest neighbors, there has been some prior work on ad versarial attacks (Amsaleg et al., 2017; Sitawarin & Wagner, 2019; Wang et al., 2018; Yang et al., 2019) as well as de fenses. Wang et. al. (Wang et al., 2018) proposes a defense for 1-NN by pruning the input sample. However, their de fense learns a classifier whose robustness regions converge towards those of the Bayes optimal classifier, which itself may potentially have poor robustness properties. Yang et. al. (Yang et al., 2019) accounts for this problem by proposing the notion of the r-optimal classifier, and propose an algo rithm called Adversarial Pruning which can be interpreted as a finite sample approximation to the r-optimal. How ever, they do not provide formal performance guarantees for Adversarial Pruning, which we do. For Kernel methods, Hein and Andriushchenko (Hein & Andriushchenko, 2017) study lower bounds on the norm of the adversarial manipulation that is required for changing a classifiers output. They specifically study bounds for Kernel Classifiers, and propose an empirically based regularization idea that improves robustness. In this work, we improve the robustness properties of kernel classification through adversarial pruning, and show formal guarantees regarding convergence towards the r-optimal classifier. For decision trees and random forests, attacks and defenses have been provided by (Andriushchenko & Hein, 2019; Kantchelian et al., 2015; Chen et al., 2019). Again, most of the work here is empirical in nature, and convergence guarantees are not provided. Pruning has a long history of being applied for improving nearest neighbors (Gates, 1972; Gottlieb et al., 2014; Hart, 1968; Kontorovich et al., 2017; Kontorovich & Weiss, 2015; Hanneke et al., 2019), but this has been entirely done in the context of generalization, without accounting for robustness. In their work, Yang et. al. empirically show that adversarial pruning can improve robustness for nearest neighbor classi fiers. However, they do not provide any formal guarantees for their algorithms. In this work, we prove formal guaran tees for adversarial pruning in the large sample limit, both for nearest neighbors as well as for more general weight functions. There is a long history of literature for understanding the consistency of Kernel classifiers (Steinwart, 2005; Stone, 1977), but this has only been done for accuracy and general ization. In this work, we find different conditions are needed to ensure that a Kernel classifier converges in robustness in addition to accuracy. 2. Preliminaries 2.1. Setting We consider binary classification where instances are drawn from a totally bounded metric space X that is equipped with distance metric denoted by d, and the label space is { 1} = { 1, +1}. The classical goal of classification is to build a highly accurate classifier, which we define as follows. Definition 1. (Accuracy) Let D be a distribution over X { 1}, and let f { 1}X be a classifier. Then the When are Non-parametric Methods Robust? accuracy of f over D, denoted A(f, D), is the fraction of examples (x, y) D for which f(x) = y. Thus A(f, D) = P(x,y) D[f(x) = y]. In this work, we consider robustness in addition to accuracy. Let B(x, r) denoted the closed ball of radius r centered at x. Definition 2. (Robustness) A classifier f { 1}X is said to be robust at x with radius r if f(x) = f(x0) for all x0 B(x, r). Our goal is to find non-parametric algorithms that output classifiers that are robust, in addition to being accurate. To account for both criteria, we combine them into a notion of astuteness (Wang et al., 2018; Schmidt et al., 2018). Definition 3. (Astuteness) A classifier f { 1}X is said to be astute at (x, y) with radius r if f is robust at x with radius r and f(x) = y. The astuteness of f over D, denoted Ar(f, D), is the fraction of examples (x, y) D for which f is astute at (x, y) with radius r. Thus Ar(f, D) = P(x,y) D[f(x 0) = y, x 0 B(x, r)]. It is worth noting that A0(f, D) = A(f, D), since astuteness with radius 0 is simply the accuracy. For this reason, we will use A0(f, D) to denote accuracy from this point forwards. 2.2. Notions of Consistency Traditionally, a classification algorithm is said to be consis tent if as the sample size grows to infinity, the accuracy of the classifier it learns converges towards the best possible accuracy on the underlying data distribution. We next in troduce and formalize an alternative form of consistency, called r-consistency, that applies to robust classifiers. We begin with a formal definition of the Bayes Optimal Classifier the most accurate classifier on a distribution and consistency. Definition 4. (Bayes Optimal Classifier) The Bayes Opti mal Classifier on a distribution D, denoted by g , is defined as follows. Let η(x) = p D(+1|x). Then +1 η(x) 0.5 g (x) = 1 η(x) < 0.5 It can be shown that g achieves the highest accuracy over D over all classifiers. Definition 5. (Consistency) Let M be a classification algo rithm over X { 1}. M is said to be consistent if for any D over X { 1}, and any ϵ, δ over (0, 1), there exists N such that for n N , with probability 1 δ over S Dn , we have: A(M(S), D) A(g , D) ϵ, where g is the Bayes optimal classifier for D. How can we incorporate robustness in addition to accuracy in this notion? A plausible way, as used in (Wang et al., 2018), is that the classifier should converge towards being astute where the Bayes Optimal classifier is astute. However, the Bayes Optimal classifier is not necessarily the most astute classifier and may even have poor astuteness. To see this, consider the following example. Example 1 Consider D over X = [0, 1] such that DX is the uniform distribution and 1 4πx p(y = 1|x) = + sin . 2 r For any point x, there exists x1, x2 ([x r, x + r] [0, 1]) 1 1 such that p(y = 1|x1) > and p(y = 1|x2) < . 2 2 Ar(g , r) = 0. However, the classifier that always predicts f(x) = +1 does better. It is robust everywhere, and since 1 1 P(x,y) D[y = +1] = , it follows that Ar(f, D) = . 2 2 This motivates the notion of the r-optimal classifier, intro duced by (Yang et al., 2019), which is the classifier with maximum astuteness. Definition 6. (r-optimal classifier) The r-optimal classi fier of a distribution G denoted by g is the classifier with r maximum astuteness. Thus g = arg max Ar(f, D). r We let A (D) denote Ar(g , D). r r Observe that g is not necessarily unique. To account for r this, we use A (D) in our definition for r-consistency. r Definition 7. (r-consistent) Let M be a classification algo rithm over X { 1}. M is said to be r-consistent if for any D, any ϵ, δ (0, 1), and 0 < γ < r, there exists N such that for n N, with probability 1 δ over S Dn , Ar γ (M(S), D) A (D) ϵ. r if the above conditions hold for a specific distribution D, we say that M is r-consistent with respect to D. Observe that in addition to the usual ϵ and δ, there is an extra parameter γ which measures the gap in the robustness radius. We may need this parameter as when classes are exactly 2r apart, we may not be able to find the exact robust boundary with only finite samples. Our analysis will be centered around understanding what kinds of algorithms M provide highly astute classifiers for a given radius r. We begin by first considering the special case of r-separated distributions. When are Non-parametric Methods Robust? Definition 8. (r-separated distributions) A distribution D is said to be r-separated if there exist subsets T +, T X such that 1. P(x,y) D[x T y] = 1. 2. x1 T + , x2 T , d(x1, x2) > 2r. Observe that if D is r-separated, Ar(g , D) = 1. r 2.3. Non-parametric Classifiers Many non-parametric algorithms classify points by averag ing labels over a local neighborhood from their training data. A very general form of this idea is encapsulated in weight functions which is the general form we will use. Definition 9. (Devroye et al., 1996) A weight function W is a non-parametric classifier with the following properties. 1. Given input S = {(x1, y1), (x2, y2, ), . . . , (xn, yn)} Dn , W S S S constructs functions w1 , w2 , . . . , w : X [0, 1] n Pn S such that for all x X , 1 wi (x) = 1. The S functions w are allowed to depend on x1, x2, . . . xn i but must be independent of y1, y2, . . . , yn. 2. W has output WS defined as ( Pn S +1 (x)yi > 0 1 wi WS (x) = Pn S 1 1 w (x)yi 0 i S As a result, wi (x) can be thought of as the weight that (xi, yi) has in classifying x. Weight functions encompass a fairly extensive set of com mon non-parametric classifiers, which is the motivation for considering them. We now define several common non parametric algorithms that can be construed as weight func tions. Definition 10. A histogram classifier, H, is a non parametric classification algorithm over Rd { 1} that works as follows. For a distribution D over R { 1}, H takes S = {(xi, yi) : 1 i n} Dn as input. Let ki be a sequence with limi ki = and limi ki = 0. i H constructs a set of hypercubes C = {c1, c2, . . . , cm} as follows: 1. Initially C = {c}, where S c. 2. For c C, if c contains more than kn points of S, then partition c into 2d equally sized hypercubes, and insert them into C. 3. Repeat step 2 until all cubes in C have at most kn points. For x R let c(x) denote the unique cell in C contain ing x. If c(x) doesn t exist, then HS (x) = 1 by default. Otherwise, ( P +1 xi c(x) yi > 0 HS (x) = P . 1 xi c(x) yi 0 Histogram classifiers are weight functions in which all xi contained within the same cell as x are given the same S weight wi (x) in predicting x, while all other xi are given weight 0. Definition 11. A kernel classifier is a weight function W over X { 1} constructed from function K : R+ {0} R+ and some sequence {hn} R+ in the following manner. Given S = {(xi, yi)} Dn , we have d(x,xi) K( hn ) S wi (x) = P . n d(x,xj ) K( ) j=1 hn Then, as above, W has output ( Pn S +1 (x)yi > 0 1 wi WS (x) = Pn S 1 1 wi (x)yi 0 Finally, we note that kn-nearest neighbors is also a weight S 1 function; wi (x) = kn if xi is one of the kn closest neigh bors of x and 0 otherwise. 3. Warm Up: r-separated distributions We begin by considering the case when the data distribution is r-separated; the more general case is considered in Sec tion 4. While classifying r-separated distributions robustly may appear almost trivial, learning an arbitrary classifier does not necessarily produce an astute result. To see this, consider the following example of a histogram classifier which is known to be consistent. We let H denote the histogram classifier over R. Example 2 Consider the data distribution D = D+ D 1 where D+ is the uniform distribution over [0, ) and D 4 is the uniform distribution over ( 1 2 , 1], p(+1|x) = 1 for x D+, and p( 1|x) = 1 for x D . We make the following observations (refer to Figure 1). 1. D is 0.1-separated, since the supports of D+ and D have distance 0.25 > 0.2. 2. If n is sufficiently large, H will construct the cell [0.25, 0.5), which will not be split because it will never contain any points. 3. HS (x) = 1 for x [0.25, 0.5). When are Non-parametric Methods Robust? 0 0.25 0.5 1 Figure 1. HS is astute in the green region, but not robust in the red region. 4. HS is not astute at (x, 1) for x (0.15, 0.25). Thus A0.1(HS , D) = 0.8. Example 2 shows that histogram classifiers do not always learn astute classifiers even when run on r-separated distri butions. This motivates the question: which non-parametric classifiers do? We answer this question in the following theorem, which gives sufficient conditions for a weight function (definition 9) to be r-consistent over an r-separated distribution. Theorem 12. Let D be a distribution over X { 1}, and let W be a weight function. Let X be a random variable with distribution DX , and S = {(x1, y1), (x2, y2), . . . , (xn, yn)} Dn . Suppose that for any 0 < a < b, n X S lim EX,S sup wi (x 0)I||xi x0 ||>b = 0. n x0 B(X,a) 1 Then if D is r-separated, W is r-consistent with respect to D. First, we compare Theorem 12 to Stone s theorem (Stone, 1977), which gives sufficient conditions for a weight func tion to be consistent (i.e. converge in accuracy towards the Bayes optimal). For convenience, we include a statement of Stone s theorem. Theorem 13. (Stone, 1977) Let W be weight function over X { 1}. Suppose the following conditions hold for any distribution D over X { 1}. Let X be a random variable with distribution DX , and S = {(x1, y1), (x2, y2), . . . , (xn, yn)} Dn . All expectations are taken over X and S. 1. There is a constant c such that, for every nonnegative measurable function f satisfying E[f(X)] < , n X S E[ wi (X)f(xi)] c E[f(x)]. 2. For all a > 0, n X S lim E[ wi (x)I||xi X||>a] = 0, n where I||xi X||>a is an indicator variable. S lim E[ max wi (X)] = 0. n 1 i n Then W is consistent. There are two main differences between Theorem 12 and Stone s theorem. 1. Conditions 1. and 3. of Stone s theorem are no longer necessary. This is because r-separated distributions are well-separated and thus have simpler conditions for consistency. In fact, a slight modification of the arguments of (Stone, 1977) shows that for r-separated distributions, condition 2. alone is sufficient for consis tency. 2. Condition 2. is strengthened. Instead of requiring the weight of xi s outside of a given radius to go to 0 for X D, we require the same to uniformly hold over a ball centered at X. Theorem 12 provides a general condition that allows us to verify the r-consistency of non-parametric methods. We now show below that two common non-parametric algo rithms kn-nearest neighbors and kernel classifiers with rapidly decaying kernel functions satisfy the conditions of Theorem 12. Corollary 14. Let D be any r-separated distribution. Let kn be any sequence such that limn kn = 0, and let M n be the kn-nearest neighbors classifier on a sample S Dn . Then M is r-consistent with respect to D. 1. Because the data distribution is r-separated, kn = 1 will be r-consistent. Also observe that for r-separated distributions, kn = 1 will converge towards the Bayes Optimal classifier. 2. In general, M converges towards the Bayes Opti mal classifier provided that kn in addition to kn/n 0. This condition is not necessary for rconsistency because the distribution is r-separated. We next show that kernel classifiers are also r-consistent on r-separated data distributions, provided the kernel function decreases rapidly enough. Corollary 15. Let W be a kernel classifier over X { 1} constructed from K and hn. Suppose the following proper ties hold for K and hn. K(cx) 1. For any c > 1, limx K(x) = 0. When are Non-parametric Methods Robust? 2. limn hn = 0. If D is an r-separated distribution over X { 1}, then W is r-consistent with respect to D. Observe that Condition 1. is satisfied for any K(x) that decreases more rapidly than an inverse polynomial and is hence satisfied by most popular kernels like the Gaussian kernel. Is the condition on K in Corollary 15 necessary? The following example illustrates that a kernel classifier with any arbitrary K is not necessarily r-consistent. This indicates that some sort of condition needs to be imposed on K to ensure r-consistency; finding a tight necessary condition however is left for future work. Example 3 Let X = [ 1, 1] and let D be a distribution with p D( 1, 1) = 0.1 and p D(1, 1) = 0.9. Clearly, D min(|x|,0.2)2 is 0.3-separated. Let K(x) = e . Let hn be any sequence with limn hn = 0 and limn nhn = . Let W be the weight classifier with input S = {(x1, y1), (x2, y2), . . . , (xn, yn)} such that |x xi| K( ) wi S (x) = hn . Pn |x xj | K( ) j=1 hn W can be shown to satisfy all the conditions of Theorem 13 (the proof is analogous to the case for a Gaussian Classifier), and is therefore consistent. However, W does not learn a robust classifier on D for r = 0.3. Consider x = 0.7. For any {(x1, y1), (x2, y2), . . . , (xn, yn)} Dn , all xi will either be 1 or 1. Therefore, since K(|x ( 1)|) = K(|x 1|), S 1 it follows that wi (x) = n for all 1 i n. Since xi = 1 with probability 0.9, it follows that with high probability x will be classified as 1 which means that f, the output of W , is not robust at x = 1. Thus f has astuteness at most 0.9 which means that W is not r-consistent for r = 0.3. 4. General Distributions We next consider more general data distributions, where data from different classes may be close together in space, and may even overlap. Observe that unlike the r-separated case, here there may be no classifier with astuteness one. Thus, a natural question is: what does the optimally astute classifier look like, and how can we build non-parametric classifiers to this limit? 4.1. The r-Optimal Classifier and Adversarial Pruning (Yang et al., 2019) propose a large-sample limit called the r-optimal and show that it is analogous to the Bayes Optimal classifier for robustness. More specifically, given a data distribution D, to find the r-optimal classifier, we solve the following optimization problem. Z max p(y = +1|x)dµD(x)+ S+1 ,S 1 x S+1 Z p(y = 1|x)dµD(x) (1) x S 1 subject to d(S+1, S 1) > 2r Then, the r-optimal classifier is defined as follows. Definition 16. (Yang et al., 2019) Fix r, D. Let S and +1 S be any optimizers of (1). Then the r-optimal clas 1 sifier, g is any classifier such that g (x) = j whenever r r d(Sj (Yang et al., 2019) show that the r-optimal classifier achieves the optimal astuteness out of all classifiers on the data distribution D; hence, it is a robustness analogue to the Bayes Optimal Classifier. Therefore, for general distributions, the goal in robust classification is to find non parametric algorithms that output classifiers that converge towards g . r To find robust classifiers, (Yang et al., 2019) propose Ad versarial Pruning a defense method that preprocesses the training data by making it better separated. More specifi cally, Adversarial Pruning takes as input a training dataset S and a radius r, and finds the largest subset of the training set where differently labeled points are at least distance 2r apart. Definition 17. A set Sr X { 1} is said to be rseparated if for all (x1, y1), (x2, y2) Sr, if y1 6= y2, then d(x1, x2) > 2r. To adversarially prune a set S is to return its largest r-separated subset. We let Adv P run(S, r) denote the result of adversarially pruning S. Once an r-separated subset Sr of the training set is found, a standard non-parametric method is trained on Sr. While (Yang et al., 2019) show good empirical performance of such algorithms, no formal guarantees are provided. We next formally characterize when adversarial pruning fol lowed by a non-parametric method results in a classifier that is provably r-consistent. Specifically, we consider analyzing the general algorithm provided in Algorithm 1. 4.2. Convergence Guarantees We begin with some notation. For any weight function W and radius r > 0, we let Robust Non P ar(W, r) repre sent the weight function that outputs weights for S Dn When are Non-parametric Methods Robust? Algorithm 1 Robust Non Par Input: S Dn, weight function W , robustness radius r Sr Adv P run(S, r) Output: WSr according to Robust Non P ar(S, W, r). In particular, this can be used to convert any weight function algorithm into a new weight function which takes robustness into account. A natural question is, for which weight functions W is Robust Non P ar(W, r) r-consistent? Our next theorem pro vides sufficient conditions for this. Theorem 18. Let W be a weight function over X { 1}, and let D be a distribution over X { 1}. Fix r > 0. Let Sr = Adv P run(S, r). For convenience, relabel xi, yi so that Sr = {(x1, y1), (x2, y2), . . . , (xm, ym)}. Suppose that for any 0 < a < b, m m 1 X X Sr lim ES Dn sup w (x)I||xj x||>b = 0. j n m x B(xi,a) i=1 j=1 Then Robust Non P ar(W, r) is r-consistent with respect to D. Remark: There are two important differences between the conditions in Theorem 18 and Theorem 12. 1. We replace S with Sr. 2. The expectation over X DX is replaced with an average over {x1, x2, . . . , xm}. The intuition here is that we are replacing D with a uniform distribution over Sr. While D may not be r-separated, the uniform distribution over Sr is, and represents the region of points where our classifier is astute. A natural question is what satisfies the conditions in Theo rem 18. We next show that kn-nearest neighbors and kernel classifiers with rapidly decaying kernel functions continue to satisfy the conditions in Theorem 18; this means that these classifiers, when combined with Adversarial Pruning, will converge to r-optimal classifiers in the large sample limit. kn Corollary 19. Let kn be a sequence with limn = 0, n and let M denote the kn-nearest neighbor algorithm. Then for any r > 0, Robust Non P ar(M, r) is r-consistent. Remark: Corollary 19 gives a formal guarantee in the large sample limit for the modified nearest-neighbor algo rithm proposed by (Yang et al., 2019). Corollary 20. Let W be a kernel classifier over X { 1} constructed from K and hn. Suppose the following proper ties hold for K and hn. K(cx) 1. For any c > 1, limx K(x) = 0. 2. limn hn = 0. Then for any r > 0, Robust Non P ar(W, r) is r-consistent. Observe again that Condition 1. is satisfied by any K that decreases more rapidly than an inverse polynomial kernel; it is thus satisfied by most popular kernels, such as the Gaussian kernel. 5. Validation Our theoretical results are, by nature, large sample; we next validate how well they apply to the finite sample case by trying them out on a simple example. In particular, we ask the following question: How does the robustness of non-parametric clas sifiers change with increasing sample size? This question is considered in the context of two simple non-parametric classifiers one nearest neighbor (which is guaranteed to be r-consistent) and histograms (which is not). To be able to measure performance with increasing data size, we look at a simple synthetic dataset the Half Moons. 5.1. Experimental Setup Classifiers and Dataset. We consider two different clas sification algorithms one nearest neighbor (NN) and a Histogram Classifier (HC). We use the Halfmoon dataset with two settings of the gaussian noise parameter σ, σ = 0 (Noiseless) and σ = 0.08 (Noisy). For the Noiseless set ting, observe that the data is already 0.1-separated; for the Noisy setting, we use Adversarial Pruning (Algorithm 1) with parameter r = 0.1 for both classification methods. Performance Measure. We evaluate robustness with re spect to the metric, that is commonly used in the adver sarial examples literature. Specifically, for each classifier, we calculate the empirical astuteness, which is the fraction of test examples on which it is astute. Observe that computing the empirical astuteness of a clas sifier around an input x amounts to finding the adversarial example that is closest to x according to the norm. For the 1-nearest neighbor, we do this using the optimal attack algorithm proposed by Yang et. al. (Yang et al., 2019). For the histogram classifier, we use the optimal attack frame work proposed by (Yang et al., 2019), and show that the structure of the classifier can be exploited to solve the con vex program efficiently. Details are in Appendix C. When are Non-parametric Methods Robust? (a) Noiseless Histogram (b) Noisy Histogram (c) Histogram trained on 500 samples (d) Noiseless 1-NN (e) Noisy 1-NN (f) Histogram trained on 3000 samples Figure 2. Empirical accuracy/astuteness of different classifiers as a function of training sample size. Accuracy is shown in green, astuteness in purple. Left : Noiseless Setting. Right: Noisy Setting. Top Row: Histogram Classifier, Bottom Row: 1-Nearest Neighbor We use an attack radius of r = 0.1 for the Noiseless setting, and r = 0.09 for the Noisy setting. For all classification algorithms, we plot the empirical astuteness as a function of the training set size. As a baseline, we also plot their standard accuracy on the test set. 5.2. Results The results are presented in Figure 2; the left two panels are for the Noiseless setting while the two center ones are for the Noisy setting. The results show that as predicted by our theory, for the Noiseless setting, the empirical astuteness of nearest neigh bors converges to 1 as the training set grows. For Histogram Classifiers, the astuteness converges to 0.5 indicating that the classifier may grow less and less astute with higher sample size even for well-separated data. This is plausi bly because the cell size induced by the histogram grows smaller with growing training data; thus, the classifier that outputs the default label 1 in empty cells is incorrect on adversarial examples that are close to a point with +1 label, but belongs to a different, empty cell. The rightmost panels in Figure 2 provide a visual illustration of this process. For the Noisy setting, the empirical astuteness of adversar ial pruning followed by nearest neighbors converges to 0.8. For histograms with adversarial pruning, the astuteness con verges to 0.7, which is higher than the noiseless case but 5.3. Discussion Our results show that even though our theory is asymp totic, our predictions continue to be relevant in finite sample regimes. In particular, on well-separated data, nearest neigh bors that we theoretically predict to be intrinsically robust is robust; histogram classifiers, which do not satisfy the con ditions in Theorem 12 are not. Our predictions continue to hold for data that is not well-separated. Nearest neighbors coupled with Adversarial Pruning continues to be robust with growing sample size, while histograms continue to be non-robust. Thus our theory is confirmed by practice. 6. Conclusion In conclusion, we rigorously analyze when non-parametric methods provide classifiers that are robust in the large sam ple limit. We provide a general condition that characterizes when non-parametric methods are robust on well-separated data, and show that Adversarial Pruning of (Yang et al., 2019) works on data that is not well-separated. Our results serve to provide a set of guidelines that can be used for designing non-parametric methods that are ro bust and accurate on well-separated data; additionally, we demonstrate that when data is not well-separated, prepro cessing by adversarial pruning (Yang et al., 2019) does lead to optimally astute solutions in the large sample limit. still clearly sub-optimal. When are Non-parametric Methods Robust? Acknowledgements We thank NSF under CNS 1804829 for research support. Amsaleg, L., Bailey, J., Barbe, D., Erfani, S. M., Houle, M. E., Nguyen, V., and Radovanovic, M. The vulner ability of learning to adversarial perturbation increases with intrinsic dimensionality. In 2017 IEEE Workshop on Information Forensics and Security, WIFS 2017, Rennes, France, December 4-7, 2017, pp. 1 6, 2017. Andriushchenko, M. and Hein, M. Provably robust boosted decision stumps and trees against adversarial attacks. In Wallach, H., Larochelle, H., Beygelzimer, A., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32, pp. 12997 13008. Curran Asso ciates, Inc., 2019. Avdiukhin, D., Mitrovic, S., Yaroslavtsev, G., and Zhou, S. Adversarially robust submodular maximization un der knapsack constraints. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019., pp. 148 156, 2019. doi: 10.1145/3292500.3330911. Carlini, N. and Wagner, D. A. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy, SP 2017, San Jose, CA, USA, May 22-26, 2017, pp. 39 57, 2017. Chen, H., Zhang, H., Boning, D. S., and Hsieh, C.-J. Ro bust decision trees against adversarial examples. Co RR, abs/1902.10660, 2019. Devroye, L., Gy orfi, L., and Lugosi, G. A Probabilistic Theory of Pattern Recognition, volume 31 of Stochastic Modelling and Applied Probability. Springer, 1996. Gates, G. W. The reduced nearest neighbor rule (corresp.). IEEE Trans. Information Theory, 18(3):431 433, 1972. Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples, March 20 2014. URL http://arxiv.org/abs/1412.6572. Gottlieb, L., Kontorovich, A., and Nisnevitch, P. Nearoptimal sample compression for nearest neighbors. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 370 378, 2014. Hanneke, S., Kontorovich, A., Sabato, S., and Weiss, R. Universal bayes consistency in metric spaces. Co RR, abs/1906.09855, 2019. Hart, P. E. The condensed nearest neighbor rule (corresp.). IEEE Trans. Information Theory, 14(3):515 516, 1968. Hein, M. and Andriushchenko, M. Formal guarantees on the robustness of a classifier against adversarial manipu lation. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 2266 2276. Curran Associates, Inc., 2017. Kantchelian, A., Tygar, J. D., and Joseph, A. D. Evasion and hardening of tree ensemble classi fiers. Co RR, abs/1509.07892, 2015. URL http://arxiv.org/abs/1509.07892. Katz, G., Barrett, C. W., Dill, D. L., Julian, K., and Kochen derfer, M. J. Towards proving the adversarial robustness of deep neural networks. In Proceedings First Work shop on Formal Verification of Autonomous Vehicles, FVAV@i FM 2017, Turin, Italy, 19th September 2017., pp. 19 26, 2017. Kontorovich, A. and Weiss, R. A bayes consistent 1-nn classifier. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, AIS TATS 2015, San Diego, California, USA, May 9-12, 2015, 2015. Kontorovich, A., Sabato, S., and Weiss, R. Nearest-neighbor sample compression: Efficiency, consistency, infinite di mensions. In Advances in Neural Information Process ing Systems 30: Annual Conference on Neural Informa tion Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 1573 1583, 2017. Liu, Y., Chen, X., Liu, C., and Song, D. Delving into trans ferable adversarial examples and black-box attacks. In 5th International Conference on Learning Representa tions, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. Lowd, D. and Meek, C. Adversarial learning. In Grossman, R., Bayardo, R. J., and Bennett, K. P. (eds.), Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, Illi nois, USA, August 21-24, 2005, pp. 641 647. ACM, 2005. ISBN 1-59593-135-X. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Pro ceedings, 2018. Montasser, O., Hanneke, S., and Srebro, N. VC classes are adversarially robustly learnable, but only improperly. In When are Non-parametric Methods Robust? Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, pp. 2512 2530, 2019. Papernot, N., Mc Daniel, P. D., Jha, S., Fredrikson, M., Celik, Z. B., and Swami, A. The limitations of deep learning in adversarial settings. In IEEE European Symposium on Security and Privacy, Euro S&P 2016, Saarbr ucken, Germany, March 21-24, 2016, pp. 372 387, 2016a. Papernot, N., Mc Daniel, P. D., Wu, X., Jha, S., and Swami, A. Distillation as a defense to adversarial perturbations against deep neural networks. In IEEE Symposium on Security and Privacy, SP 2016, San Jose, CA, USA, May 22-26, 2016, pp. 582 597, 2016b. Papernot, N., Mc Daniel, P. D., Goodfellow, I. J., Jha, S., Celik, Z. B., and Swami, A. Practical black-box attacks against deep learning systems using adversarial examples. ASIACCS, 2017. Raghunathan, A., Steinhardt, J., and Liang, P. Certified defenses against adversarial examples. In 6th Interna tional Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018. Salman, H., Yang, G., Li, J., Zhang, P., Zhang, H., Razenshteyn, I. P., and Bubeck, S. Provably ro bust deep learning via adversarially trained smoothed classifiers. Co RR, abs/1906.04584, 2019. URL http://arxiv.org/abs/1906.04584. Schmidt, L., Santurkar, S., Tsipras, D., Talwar, K., and Madry, A. Adversarially robust generalization requires more data. In Advances in Neural Information Process ing Systems 31, pp. 5014 5026. Curran Associates, Inc., 2018. Sinha, A., Namkoong, H., and Duchi, J. C. Certifying some distributional robustness with principled adversarial training. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018. Sitawarin, C. and Wagner, D. A. On the robustness of deep k-nearest neighbors. In 2019 IEEE Security and Privacy Workshops, SP Workshops 2019, San Francisco, CA, USA, May 19-23, 2019, pp. 1 7, 2019. Steinwart, I. Consistency of support vector machines and other regularized kernel classifiers. IEEE Trans. Informa tion Theory, 51(1):128 142, 2005. Stone, C. Consistent nonparametric regression. The Annals of Statistics, 5(4):595 645, 1977. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R. Intriguing prop erties of neural networks. In 2nd International Confer ence on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceed ings, 2014. Wang, Y., Jha, S., and Chaudhuri, K. Analyzing the ro bustness of nearest neighbors to adversarial examples. In Proceedings of the 35th International Conference on Ma chine Learning, ICML 2018, Stockholmsm assan, Stock holm, Sweden, July 10-15, 2018, pp. 5120 5129, 2018. Yang, Y., Rashtchian, C., Wang, Y., and Chaud huri, K. Adversarial examples for non-parametric methods: Attacks, defenses and large sample limits. Co RR, abs/1906.03310, 2019. URL http://arxiv.org/abs/1906.03310.