# efficient_classification_with_adaptive_knn__b6fdc0b1.pdf

Efﬁcient Classiﬁcation with Adaptive KNN

Puning Zhao and Lifeng Lai University of California Davis {pnzhao, lﬂai}@ucdavis.edu

In this paper, we propose an adaptive k NN method for classiﬁcation, in which different k are selected for different test samples. Our selection rule is easy to implement since it is completely adaptive and does not require any knowledge of the underlying distribution. The convergence rate of the risk of this classiﬁer to the Bayes risk is shown to be minimax optimal for various settings. Moreover, under some special assumptions, the convergence rate is especially fast and does not decay with the increase of dimensionality.

Introduction k Nearest Neighbor (k NN) method is a simple and popular approach to nonparametric classiﬁcation. In this setup, we have N identical and independently distributed (i.i.d) training samples (Xi, Yi), i = 1, . . . , N, which are all drawn from a pair of random variables (X, Y ) with an unknown distribution. Given any new test sample x, the k NN classiﬁer assigns label Y that is determined by the majority vote of the k nearest neighbors of this new sample among the training dataset (Fix 1951). In the standard k NN method, k is ﬁxed for all test samples. It has been proved that if k grows with N and k/N goes to zero, then as the sample size N increases, the risk of k NN classiﬁer converges to the Bayes risk, which is deﬁned as the minimum possible error probability among all classiﬁers (Cover and Hart 1967; Biau and Devroye 2015; Stone 1977; Devroye et al. 1994; Devroye, Györﬁ, and Lugosi 2013; Cérou and Guyader 2006). To maximize the convergence rate, the growth rate of k needs to be properly selected to achieve a desirable bias and variance tradeoff. It has been shown that, if the support set of the distribution of the feature vector X is bounded and the probability density function (pdf) of X is bounded away from zero, then the risk of the standard k NN classiﬁer converges to the Bayes risk with the best rate among all classiﬁers in the minimax sense, if the growth rate of k is properly selected (Audibert 2004; Kohler and Krzyzak 2007; Györﬁ1981; Audibert and Tsybakov 2007; Chaudhuri and Dasgupta 2014; Döring, Györﬁ, and Walk 2017; Mammen and Tsybakov 1999). On the contrary, if the distribution of X has tails, then the convergence rate of the standard k NN classiﬁer is no longer minimax optimal (Gadat, Klein, and

Copyright c 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Marteau 2016; Zhao and Lai 2019). This can be explained by the fact that the bias and the variance of the prediction vary among different locations, and hence the optimal k that achieves the best bias and variance tradeoff also depends on the feature vector. If we use the same k for all test samples, then the selected k may not be universally optimal. Therefore, to improve the performance of k NN classiﬁer, it is necessary to design a rule such that k is selected adaptively based on the locations of test samples (Ougiaroglou et al. 2007; Sun and Huang 2010; Balsubramani et al. 2019; Kpotufe 2011). Several adaptive k NN classiﬁers have been designed and analyzed in (Gadat, Klein, and Marteau 2016; Cannings, Berrett, and Samworth 2017; Zhao and Lai 2019). In (Gadat, Klein, and Marteau 2016), a sliced nearest neighbor method was proposed. This method divides the whole support into several regions depending on the pdf of X, and uses different k in different regions. Moreover, it is shown in (Gadat, Klein, and Marteau 2016) that this adaptive k NN method is minimax rate optimal for distributions with tails. However, this classiﬁer requires the precise knowledge of the pdf f(x), which is usually unavailable. If we use an estimate of pdf ˆf, then the theoretical guarantee of this classiﬁer has not been established after we take the estimation error into consideration. (Cannings, Berrett, and Samworth 2017) proposed a method for semi-supervised learning, which means that, apart from the labeled training samples, we also have a much larger set of unlabeled samples. The pdf f(x) can be estimated using these unlabeled samples, and then the optimal k can be selected based on the estimated pdf. A new adaptive k NN method was proposed in (Zhao and Lai 2019), which relies entirely on the training dataset, without requiring the precise knowledge of the pdf f(x), or a large number of unlabeled samples. In particular, for each test sample, the adaptive k NN method in (Zhao and Lai 2019) selects k according to the number of training samples that fall in a ball centered at the test sample with a ﬁxed radius, so that one can achieve a desirable bias and variance tradeoff. Moreover, theoretical analysis shows that this method is minimax rate optimal for a broad range of distributions, under some tail, smoothness and margin assumptions. However, there are some design parameters of this method that need to be carefully tuned, and the optimal values of these parameters depend on the properties of the underlying joint distributions of the feature X and the label Y . Therefore, the implementation of this method still

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

relies partly on the underlying distribution, which is unknown in practice. In this paper, we propose a new adaptive k NN classiﬁer that does not need any prior knowledge of the underlying distribution. We consider binary classiﬁcation with Y taking values in { 1, 1}, and deﬁne η(x) = E[Y |X = x] as the conditional expectation of the label. The basic idea of our new method is to let k grow step by step. In each step, we estimate η(x) with the average of labels of k nearest neighbors of x. If the estimated η(x) is sufﬁciently far away from zero (will be made precise in the sequel), then we can believe with high conﬁdence that the sign of the estimated η(x) is correct, and hence the prediction will be the same as that of the Bayes classiﬁer. In this case, our algorithm returns a prediction using the sign of the current estimate of η(x). On the contrary, if η(x) is not large enough, then the algorithm lets k increase, until the random error bound is lower than the estimated |η(x)|, or k reaches an upper bound kmax. The key difference between our method and previous adaptive k NN methods in (Gadat, Klein, and Marteau 2016; Cannings, Berrett, and Samworth 2017; Zhao and Lai 2019) is that previous methods select k based on the real or estimated pdf f(x), while our method selects k based on the estimated η(x). We then analyze the convergence property of this new method. To begin with, we establish a general convergence bound, which depends on the joint distribution of X and Y , and holds universally without any assumptions. Based on such a general bound, we then obtain convergence rates for some common classes of distributions, as were discussed in (Chaudhuri and Dasgupta 2014) and (Zhao and Lai 2019). Our results show that, for both distributions with or without tails, the proposed method is minimax rate optimal. Furthermore, we show that under a special case where the Bayes boundary is linear and some other assumptions hold, the convergence rate of the new adaptive k NN classiﬁer is fast and does not become worse as the dimensionality increases. Compared with the existing adpative k NN methods (Gadat, Klein, and Marteau 2016; Cannings, Berrett, and Samworth 2017; Zhao and Lai 2019), our new method has the following advantages. Firstly, the new method does not require any prior knowledge of the underlying distribution, and the parameter tuning is convenient. The only parameter of this new method is kmax, the largest value of k that the algorithm will use. Unlike previous methods, in which the parameters need to be carefully tuned to achieve the optimal bias and variance tradeoff, in this new method, it is always safe to use a larger kmax, although sometimes it is sufﬁcient to use a smaller kmax to reduce the computational complexity without signiﬁcantly deteriorating the performance. Moreover, the performance of our new method is also competitive. Despite that previous methods in (Gadat, Klein, and Marteau 2016; Zhao and Lai 2019) are already minimax rate optimal under some tail, margin and smoothness assumptions, the convergence rate can still be further improved, since the minimax lower bound is only tight for the worst case among all distributions that satisfy those assumptions. For many common distributions, it is possible to achieve a faster convergence rate. We show that the convergence rate of our new classiﬁer is usually better than the minimax bound for many common distributions.

Our method is a simpliﬁed form of the method proposed in (Balsubramani et al. 2019). In (Balsubramani et al. 2019), there are several parameters to be tuned. On the contrary, our method has only one design parameter kmax, and the value of this parameter is not crucial, hence our method is easier to use. Moreover, we analyze the convergence rate of the overall excess risk of the adaptive k NN method for a broad class of different cases, while (Balsubramani et al. 2019) only analyze its local convergence. The remainder of this paper is organized as follows. In Section 2, we describe the detailed design of our new proposed method. We then derive a general convergence bound of the proposed method without any assumptions in Section 3. In Section 4, we analyze the convergence rate under certain common assumptions. In Section 5, we show that for a special case, the new method has a fast convergence rate, and this rate does not become worse as the dimensionality increases. Finally, numerical examples and the concluding remarks are provided in Section 6 and 7, respectively.

The Proposed Method In this section, we describe the proposed adaptive k NN classifer. Let the feature vector X and target Y take values in Rd and { 1, 1}, respectively. (X, Y ) is a pair of random variables that follows an unknown distribution. The training dataset contains N samples (Xi, Yi), i = 1, . . . , N, which are i.i.d drawn from this joint distribution. Based on these training samples, our goal is to learn a function g : Rd { 1, 1}, which can be used to make a prediction ˆY . In this paper, we use 0 1 loss function to evaluate the quality of classiﬁcation, i.e.

L( ˆY , Y ) = 0 if ˆY = Y 1 if ˆY = Y. (1)

With this loss function, the risk of a classiﬁer is deﬁned as R(g) = E[L( ˆY , Y )] = P(g(X) = Y ). (2) Moreover, deﬁne the regression function as η(x) = E[Y |X = x]. (3) From (1), it can be shown that η(x) = P(Y = 1|X = x) P(Y = 1|X = x). (4) It can be shown (Chaudhuri and Dasgupta 2014; Döring, Györﬁ, and Walk 2017) that the optimal classiﬁcation rule is g (x) = sign(η(x)), since if η(x) > 0, then Y is more likely to be 1, thus predicting Y = 1 minimizes the error probability, and vice versa. The corresponding risk, called Bayes risk, is

R = P(g (X) = Y )) = E 1 |η(X)|

In practice, η is unknown, and hence g (x) is also unknown. The k NN classiﬁcation rule returns g(x) = sign(ˆηk(x)) instead, in which ˆηk(x) is the estimated regression function, based on the average of the labels of k nearest neighbors of x, i.e.

i=1 Y (i), (6)

Algorithm 1 Adaptive k NN classiﬁcation algorithm

input Training samples (X1, Y1), . . . , (XN, YN); test point x; kmax output Prediction ˆY Find the labels Y (i), i = 1, . . . , kmax of the nearest neighbors of x k ln2 N ˆη (1/k) Pk i=1 Y (k) while k kmax do if |ˆη| > k 1/2 ln N then return sign(ˆη) else ˆη η + 1 k+1(Y (k+1) ˆη) k k + 1 end if end while return a random value from { 1, 1}

with Y (i) being the label of the i-th nearest neighbor of x. In the standard k NN classifer, the value of k remains the same regardless of the value of x. In this paper, we design a new adaptive k NN classiﬁer in which the value of k is different for different x. Our design is motivated by the following intuitions. The prediction of a k NN classiﬁer is the same as the Bayes classiﬁer if g(x) = g (x), thus the best value of k maximizes the probability that ˆηk(x) and η(x) have the same signs. Such an optimal k changes with x. For example, consider two test samples located at x1 and x2, respectively, with f(x1) > f(x2) and |η(x1)| < |η(x2)|. In this case, the optimal k of the ﬁrst sample is larger than that of the second one. For the ﬁrst sample, since f(x1) is larger, the k NN distances are smaller, thus we can use a larger k without worrying too much about the bias. On the other hand, since |η(x1)| is relatively small, it is necessary to use a large k to reduce the estimation variance, so that the label of η(x) can be more likely to be inferred correctly. On the contrary, for the second sample, a smaller k is better. In the standard k NN method, k is ﬁxed for all samples, therefore it is inevitable that k is suboptimal for some test points. As a result, in order to improve the classiﬁcation accuracy, we need to estimate f(x) or η(x) to help us decide the best k to use. Previous adaptive k NN methods in (Gadat, Klein, and Marteau 2016; Cannings, Berrett, and Samworth 2017; Zhao and Lai 2019) solve this problem by selecting k based on the real or estimated pdf f(x), so that larger k is used where f(x) is high, and vice versa. Our new method takes a different approach. We select k based on estimated |η(x)| instead of f(x). The detailed algorithm is shown as Algorithm 1. The main idea of this adaptive k NN algorithm is to let k grow step by step. In each step, we calculate the estimated regression function at the test point, i.e. ˆηk(x), based on the current k. Considering that the variance of ˆηk(x) scales with 1/k, if |ˆηk(x)| is larger than k 1/2 ln N, then with high probability the random error will not change the sign of ˆηk(x) . However, it is still possible that the sign of ˆηk(x) is not correct, due to the estimation bias, which increases with

the k NN distances. Therefore, in our algorithm, we use the smallest k such that

|ˆηk(x)| > k 1/2 ln N,

in order to control the bias. On the other hand, if |ˆηk(x)| is not large enough, then it is likely that the sign is not correct due to the random error, therefore we continue to increase k, until k reaches an upper bound kmax, which is the only design parameter of our method. If |ˆηk(x)| k 1/2 ln N for all k kmax, then the determination of the sign of η(x) is hard, since |η(x)| is too low. In this case, we give up the prediction and return a random result. In Algorithm 1, k starts from ln2 N , since if k < ln2 N, then k 1/2 ln N > 1, which will always lead to the increase of k, thus it is not necessary to try with these small k values. Similar idea has been proposed in (Balsubramani et al. 2019). Compared with (Balsubramani et al. 2019), our proposed method is simpler, since we are using less parameters. Moreover, we derive bounds of the overall excess risk for a broad class of different f and η functions, while (Balsubramani et al. 2019) only shows the local convergence rate. In previous adaptive k NN methods (Zhao and Lai 2019; Gadat, Klein, and Marteau 2016), there are parameters that need to be tuned carefully to achieve the minimax optimal rate, and the optimal parameter depends on the property of the distribution, such as tail, margin and smoothness parameters. However, these properties are usually unknown, therefore it is hard to ﬁnd the optimal parameters. Our new method can solve this problem, since this method has only one design parameter kmax, and we do not need to tune it very carefully. In fact, in Sections 3 and 4, we show that using a large kmax can always achieve good accuracy, although sometimes we can use a smaller kmax to accelerate the computation. Note that compared with the standard k NN method, our new method does not increase the time complexity of computation. In Algorithm 1, the computation cost includes nearest neighbor search and the update of η. The updating requires O(kmax) time for each test sample, while the nearest neighbor search requires more time (Bentley, Stanat, and Williams Jr 1977; Leibe, Mikolajczyk, and Schiele 2006). As a result, the overall time complexity of our new method with parameter kmax is the same as that of standard k NN method with k = kmax.

General Bound

In this section, we show a general convergence bound of this new adaptive k NN classiﬁer without any restricting assumptions. For any set S Rd, deﬁne

η(S) := E[Y |X S], (7)

which is the weighted average of the regression function within S. Based on the regression function, the excess risk of a classiﬁer g(x) is

R R = E[|η(X)|1(g(X) = sign(η(X)))]. (8)

Denote B(x, r) = {x Rd| x x < r} as the ball

centering at x with radius r. Deﬁne

sup r <r (η(x) η(B(x, r ))) if η(x) 0

sup r <r (η(B(x, r )) η(x)) if η(x) < 0,(9)

ρ(x):=sup r| (x, r) 1

2|η(x)| , (10)

p(x):=P(B(x, ρ(x))). (11) Then the convergence rate of the adaptive k NN classiﬁcation algorithm is shown in the following theorem. Theorem 1 Deﬁne SN = {x ||η(x)|

32 Np(x) ln N, 4 ln Nk 1

then the excess risk is bounded by

SN f(x)|η(x)|dx + 2(N + 20)e 2 ln2 N. (13)

Intuitively, SN in (12) is the region in which the classiﬁcation is hard, which means that for any k, we can not ensure that the inference of the sign of η(x) is correct with high probability, since |η(x)| is too small, and thus the estimation error can make ˆη(x) has different sign with η(x). On the contrary, for all x / SN, there exists a k such that the prediction is the same as that of the Bayes classiﬁer with a high probability. Such a k value can be found using Algorithm 1. From (12), we observe that SN shrinks with N if kmax/ ln2 N , and its limit is {x|η(x) = 0}. Therefore, the ﬁrst term in the right hand side of (13) converges to zero. The second term in (13) converges faster than any polynomial of N. If η(x) crosses zero in its support, then the ﬁrst term usually dominates, since the size of SN shrinks polynomially with N.

Convergence Rate with Assumptions In this section, we derive the convergence rate of the new k NN classiﬁer under some common assumptions that have been analyzed in (Chaudhuri and Dasgupta 2014; Zhao and Lai 2019). To begin with, we analyze the convergence rate under certain margin and probabilistic continuous assumptions. These assumptions are proposed in (Chaudhuri and Dasgupta 2014), and are usually satisﬁed if the feature distribution has bounded support and the pdf is bounded away from zero. We then analyze the convergence rate of the proposed method for distributions with tails under some margin, smoothness and tail assumptions. These assumptions were used in (Zhao and Lai 2019) to analyze distributions with tails. For both cases, we show that the new k NN classiﬁer is minimax rate optimal up to a log polynomial factor, as long as kmax grows sufﬁciently fast with N.

Convergence Rate under Margin and Probabilistic Continuous Assumptions We ﬁrst analyze the convergence rate of the new adaptive k NN classiﬁer under certain margin and probabilistic continuous assumptions. These assumptions have been used in

(Chaudhuri and Dasgupta 2014) for deriving the bound of the standard k NN classiﬁcation. Theorem 2 Assume f(x) and η(x) satisfy the following assumptions: (a) There exist two constants Ca, α such that

P(0 < |η(X)| < t) Catα (14)

for all t > 0; (b) There exist two constants L, γ such that

|η(B(x, r)) η(x)| LPγ(B(x, r)) (15)

for all r > 0, in which P(B(x, r)) is the probability mass of B(x, r). The convergence rate of the excess risk of the adaptive k NN classiﬁer is bounded by:

R R = O N γ(1+α)

2γ+1 (ln N)

2 (1+α) max (ln N)1+α . (16)

2γ 2γ+1 , then from (16),

R R = O N γ(1+α)

2γ+1 (ln N)

2γ+1 . (17)

The convergence rate in (16) matches the previous results in (Chaudhuri and Dasgupta 2014; Kohler and Krzyzak 2007), as well as the minimax lower bound in (Audibert and Tsybakov 2007), up to a log-polynomial factor. Assumption (a) is common in many previous works (Kohler and Krzyzak 2007; Döring, Györﬁ, and Walk 2017; Chaudhuri and Dasgupta 2014; Gadat, Klein, and Marteau 2016), which restricts the noise level. The convergence rate is faster with a larger α, since misclassiﬁcation is easier to occur where η(x) is close to zero. Assumption (b) requires η to be continuous. Although assumption (b) does not strictly require that the pdf is bounded away from zero, it can be observed that if the distribution has tails, then at the region with low pdf, P(B(x, r)) is small, and thus |η(B(x, r)) η(x)| need to be very close to zero, which can be too restrictive. Therefore, this assumption usually holds for the cases such that the feature distribution has bounded support and the pdf f(x) is bounded away from zero.

Convergence Rate for Distributions with Tails We now analyze the proposed adaptive k NN classiﬁer for pdfs that can be arbitrarily close to zero. In particular, we analyze the convergence rate under the same assumptions as in (Zhao and Lai 2019).

Theorem 3 Assume that there exist constants Ca, Cb, Cc, Cd, α, β and D, such that (a) For any t > 0,

P(0 < |η(X)| < t) Catα;

(b) For any t > 0,

P(f(X) < t) Cbtβ;

(c) For any r > 0,

|η(B(x, r)) η(x)| Ccr2;

(d) For any r (0, D],

P(B(x, r)) Cdf(x)rd,

R R = O N min{β, 2β(1+α) βd+2(α+2β)} lnc N

2 (1+α) max (ln N)1+α , (18)

for some constant c.

Assumption (a) is the same as that in Theorem 2, which restricts the noise level. Assumption (b) is about the strength of the tail. A lower β indicates a heavier tail. For example, for one dimensional Gaussian or exponential distribution, β = 1, while for Cauchy distribution, β = 1/2. Assumption (c) assumes that η is smooth. Assumption (d) is the minimal mass assumption, which has been used in (Gadat, Klein, and Marteau 2016). From (18), it can be observed that if

kmax N min{ 2β

1+α , 4β βd+2(α+2β)},

then the ﬁrst term dominates and hence (18) matches the minimax lower bound derived in (Zhao and Lai 2019) up to a log polynomial factor. Moreover, using the same steps as the proof of Theorem 3, we can also show that our new method is nearly minimax optimal for a slightly different case discussed in (Gadat, Klein, and Marteau 2016), which assumes (a), (b) and (d) in Theorem 3, while (c) is changed to the assumption that η is Lipschitz. The results in Theorems 2 and 3 and the corresponding discussions show that our new adaptive method is nearly minimax rate optimal under a large variety of settings. From (16) and (18), we also ﬁnd that it is not always necessary to use large kmax. Letting kmax grows sublinearly with N instead of linearly with N can help us to reduce the computation cost, without having a signiﬁcant impact on the classiﬁcation accuracy. However, considering that the minimum necessary kmax depends on the parameters of distributions such as α, β and γ, if these parameters are unknown, it would be better to use a larger kmax.

Fast Convergence Rate for a Special Class of Distributions Curse of dimensionality is a common problem for nonparametric learning methods. Minimax analysis in (Audibert and Tsybakov 2007; Gadat, Klein, and Marteau 2016; Zhao and Lai 2019) shows that this problem can not be solved in general. However, we show that for a particular class of distributions, the convergence rate of the new adaptive k NN method does not decay with the increase of dimensionality.

Theorem 4 Suppose that f and η satisfy the following assumptions: There exist some positive constants A, cf, Cf, cη, Cη, M1, M2, δ, D, such that (a) For all x such that |x1| < A, in which xj is the j-th component of x, j = 1, . . . , d, we have

η( x1, x2, . . . , xd) = η(x1, . . . , xd),

and f( x1, x2, . . . , xd) = f(x1, . . . , xd);

(b) If |x1| < A, then

cf f(x1|x2, . . . , xd) Cf;

(c) If |x1| < A, then

cη η(x)/ x1 Cη;

(d) If |x1| < A, then

(e) If |x1| < A, then 2η M2;

(f) If |x1| A/2, then

Let kmax grows linearly with N, then (1) If P(B(x, r)) f Lrd

for some constant f L and all x and r D, then

R R = O N 1 ln2 N ; (19)

(2) If P(B(x, r)) Cdrdf(x)

for some constant Cd and all x and r D,

P(f(X) t) Cbtβ,

R R = O N 2β 2β+1 ln2 N + N β ln2β N . (20)

Remark 1 Assumptions (a)-(f) hold for the case in which the underlying distribution is a random mixture of two distributions with opposite label, and these two distributions are the same except that the means are different, i.e.

f(x|Y = i) = φ(x ic), i = 1, 1, (21)

in which c is a ﬁxed vector, and

P(Y = 1) = P(Y = 1) = 1/2. (22)

Remark 2 In Theorem 4, we assume that η is antisymmetric and f is symmetric around x1 = 0. In fact, the axis x1 = 0 can be generalized to arbitrary linear (d 1) dimensional subspace. In this case, we just need to conduct a simple transformation from (x1, . . . , xd) to (x 1, . . . , x d), such that the axis becomes x 1 = 0. Moreover, assumption (c) can also be changed to

cη < η(x)/ x1 Cη,

then (19) and (20) still hold.

From (19) and (20), we can observe that, if the pdf is bounded away from zero, then from Theorem 4, the convergence rate is always O(N 1), while for distributions with tails, the convergence rate depends on the tail parameter β. For both cases, it can be observed that the convergence rate of this new adaptive k NN method is faster than those derived in Sections 3 and 4, and does not decrease with the increase of the dimensionality, as long as Assumptions (a)-(f) are satisﬁed. The complexity of the classiﬁcation problem under these assumptions is lower than those in Theorem 2 or Theorem 3, since the Bayes decision boundary is linear. As a result, with an adaptive selection of k, the convergence rate can become much faster than the minimax lower bounds derived in (Audibert and Tsybakov 2007; Gadat, Klein, and Marteau 2016; Zhao and Lai 2019).

Numerical Examples In this section, we use numerical experiments to validate our theoretical analysis. In particular, we calculate the convergence rates of the adaptive k NN classiﬁer for some common distributions, and create a log-log plot of the estimated excess risk of the new adaptive k NN classiﬁer versus the sample size. Each point in the curves is averaged over 5,000 trials. The results are shown in Figures 1 and 2. In Figure 1, we show some examples that satisfy the assumptions in Theorem 2 or 3. In subﬁgures (a), (b), (c) and (d), the feature distribution is Uniform distribution in [ 5, 5]d, standard Gaussian distribution, standard Laplace distribution, and triangular distribution in [ 5, 5] with mode at 0, respectively. From subﬁgures (a) to (d), the regression functions are all sinusoidal, i.e. η(x) = sin(x1). Subﬁgure (a) is an example that satisﬁes assumptions in Theorem 2, while subﬁgures (b), (c) and (d) are examples that satisfy Theorem 3. In Figure 2, we show some examples that satisfy assumptions in Theorem 4. In particular, in subﬁgure (a),

2x1 if |x1| 1

2 1 if x1 > 1

2 1 if x1 < 1

in which x1 is the ﬁrst component of x. In subﬁgures (b), (c) and (d), we analyze the case that the feature distributions for Y = 1 and Y = 1 are the same except that they have different means:

f(x|Y = 1) = φ(x e1), (24) f(x|Y = 1) = φ(x + e1), (25)

in which e1 = (1, 0, . . . , 0) is the unit vector in the ﬁrst dimension, φ can be the pdf of some common distributions, such as Gaussian distribution. This type of distributions satisfy the assumptions in Theorem 4. In subﬁgures (b), (c) and (d), φ is the pdf of the standard Gaussian distribution, standard Laplace distribution and triangular distribution in [ 2, 2] with mode at 0, respectively. Based on Figures 1 and 2, we calculate the empirical convergence rates of the new adaptive k NN method, which are the negative slopes of the curves in the ﬁgures. The empirical rates are then compared with the theoretical rates. The results

2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 log(N)

log(Estimated Excess Risk)

d=1 d=2 d=3

(a) Uniform distributions with sinusoidal regression function.

2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 log(N)

log(Estimated Excess Risk)

d=1 d=2 d=3

(b) Gaussian distributions with sinusoidal regression function.

2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 log(N)

log(Estimated Excess Risk)

d=1 d=2 d=3

(c) Laplace distributions with sinusoidal regression function.

2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 log(N)

log(Estimated Excess Risk)

d=1 d=2 d=3

(d) Triangular distributions with sinusoidal regression function.

Figure 1: The excess risk vs log(N) for some examples satisfying assumptions in Theorem 2 or 3.

are shown in Table 1. For the convenience of expression, we use value θ to indicate that the theoretical convergence rate is

2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 log(N)

log(Estimated Excess Risk)

d=1 d=2 d=3

(a) Uniform distribution.

2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 log(N)

log(Estimated Excess Risk)

d=1 d=2 d=3

(b) Gaussian distributions with different means.

2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 log(N)

log(Estimated Excess Risk)

d=1 d=2 d=3

(c) Laplace distributions with different means.

2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 log(N)

log(Estimated Excess Risk)

d=1 d=2 d=3

(d) Triangular distributions with different means.

Figure 2: The excess risk vs log(N) for some examples satisfying assumptions in Theorem 4.

Case Empirical Theoretical d = 1 d = 2 d = 3 d = 1 d = 2 d = 3 1(a) 0.98 1.03 0.74 0.80 0.67 0.57 1(b) 0.80 0.81 0.83 0.57 0.50 0.44 1(c) 0.60 0.54 0.50 0.57 0.50 0.44 1(d) 0.96 0.60 0.49 0.67 0.57 0.50 2(a) 0.99 0.99 1.00 1.00 1.00 1.00 2(b) 0.99 0.99 0.99 0.67 0.67 0.67 2(c) 1.05 1.04 1.03 0.67 0.67 0.67 2(d) 0.97 0.97 0.98 0.80 0.80 0.80

Table 1: The convergence rate of the new k NN adaptive classiﬁer

O(N θ lnc N) for some constant c. The theoretical rate of Fig. 1(a) comes from Theorem 2 with γ = 2/d, while Fig. 1 (b)-(d) comes from Theorem 3 with β = 1, 1, 2 respectively. For Fig. 2, the theoretical results come from Theorem 4, in which (a) comes from Theorem 4(1), while the remainders come from Theorem 4(2). Note that the margin parameter is α = 1 for all of the eight cases, therefore α is not listed in the table. From Table 1, we can observe that for some cases, the empirical rates are approximately the same as the theoretical bounds, while for most of the other cases, the empirical rates are actually faster. As discussed in Section 4, under the assumptions in Theorem 2 and 3, the convergence rates are already minimax optimal. However, the minimax lower bounds in (Audibert and Tsybakov 2007; Zhao and Lai 2019) are established for the worst case that satisﬁes the assumptions. Our numerical results show that for many common distributions, the real convergence rates can be much faster. In particular, if the assumptions (a)-(f) in Theorem 4 hold, then the convergence rates are fast and do not decay with the increase of dimensionality. These assumptions hold commonly for cases that are constructed by random mixtures of two distributions with opposite labels, different means but the same shapes.

Conclusion In this paper, we have proposed a new adaptive k NN classiﬁer, which selects different k for different test samples. Our analysis has shown that it is minimax optimal up to a log polynomial factor under some assumptions. Moreover, if the Bayes decision boundary is linear, under some other assumptions, the convergence rate can be faster, and does not become slower with the increase of the dimensionality. Compared with previous adaptive k NN methods, this method is more convenient to use since it does not require any knowledge of the underlying distribution. The performance of our new method is also competitive, since for many common distributions, the real convergence rate is faster than the minimax lower bound. Numerical experiments have been conducted to validate our theoretical analysis.

References Audibert, J.-Y. 2004. Classiﬁcation under polynomial entropy and margin assumptions and randomized estimators Preprint.

Audibert, J.-Y.; and Tsybakov, A. B. 2007. Fast learning rates for plug-in classiﬁers. The Annals of Statistics 35(2): 608 633.

Balsubramani, A.; Dasgupta, S.; Moran, S.; et al. 2019. An adaptive nearest neighbor rule for classiﬁcation. In Advances in Neural Information Processing Systems, 7577 7586.

Bentley, J. L.; Stanat, D. F.; and Williams Jr, E. H. 1977. The complexity of ﬁnding ﬁxed-radius near neighbors. Information processing letters 6(6): 209 212.

Biau, G.; and Devroye, L. 2015. Lectures on the nearest neighbor method. Springer.

Cannings, T. I.; Berrett, T. B.; and Samworth, R. J. 2017. Local nearest neighbour classiﬁcation with applications to semi-supervised learning. ar Xiv preprint ar Xiv:1704.00642 .

Cérou, F.; and Guyader, A. 2006. Nearest neighbor classiﬁcation in inﬁnite dimension. ESAIM: Probability and Statistics 10: 340 355.

Chaudhuri, K.; and Dasgupta, S. 2014. Rates of convergence for nearest neighbor classiﬁcation. In Proc. Advances in Neural Information Processing Systems, 3437 3445. Montreal, Canada.

Cover, T.; and Hart, P. 1967. Nearest neighbor pattern classiﬁcation. IEEE Transactions on Information Theory 13(1): 21 27.

Devroye, L.; Györﬁ, L.; Krzyzak, A.; and Lugosi, G. 1994. On the strong universal consistency of nearest neighbor regression function estimates. The Annals of Statistics 1371 1385.

Devroye, L.; Györﬁ, L.; and Lugosi, G. 2013. A probabilistic theory of pattern recognition, volume 31. Springer Science & Business Media.

Döring, M.; Györﬁ, L.; and Walk, H. 2017. Rate of convergence of k-nearest-neighbor classiﬁcation rule. Journal of Machine Learning Research 18(1): 8485 8500.

Fix, E. 1951. Discriminatory analysis: nonparametric discrimination, consistency properties. USAF school of Aviation Medicine.

Gadat, S.; Klein, T.; and Marteau, C. 2016. Classiﬁcation in general ﬁnite dimensional spaces with the k-nearest neighbor rule. The Annals of Statistics 44(3): 982 1009.

Györﬁ, L. 1981. The rate of convergence of kn-NN regression estimates and classiﬁcation rules. IEEE Transactions on Information Theory 27(3): 362 364.

Kohler, M.; and Krzyzak, A. 2007. On the rate of convergence of local averaging plug-in classiﬁcation rules under a margin condition. IEEE Transactions on Information Theory 53(5): 1735 1742.

Kpotufe, S. 2011. k-NN regression adapts to local intrinsic dimension. In Proc. Advances in Neural Information Processing Systems, 729 737. Granada, Spain.

Leibe, B.; Mikolajczyk, K.; and Schiele, B. 2006. Efﬁcient clustering and matching for object class recognition. In BMVC, 789 798.

Mammen, E.; and Tsybakov, A. B. 1999. Smooth discrimination analysis. The Annals of Statistics 27(6): 1808 1829. Ougiaroglou, S.; Nanopoulos, A.; Papadopoulos, A. N.; Manolopoulos, Y.; and Welzer-Druzovec, T. 2007. Adaptive k-nearest-neighbor classiﬁcation using a dynamic number of nearest neighbors. In East European Conference on Advances in Databases and Information Systems, 66 82. Springer.

Stone, C. J. 1977. Consistent nonparametric regression. The Annals of Statistics 595 620.

Sun, S.; and Huang, R. 2010. An adaptive k-nearest neighbor algorithm. In 2010 seventh international conference on fuzzy systems and knowledge discovery, volume 1, 91 94. IEEE. Zhao, P.; and Lai, L. 2019. Minimax Rate Optimal Adaptive Nearest Neighbor Classiﬁcation and Regression. ar Xiv preprint ar Xiv:1910.10513 .