# distributionally_robust_weighted_knearest_neighbors__3bbd52f6.pdf

Distributionally Robust Weighted k-Nearest

Shixiang Zhu Carnegie Mellon University shixianz@andrew.cmu.edu

Liyan Xie The Chinese University of Hong Kong, Shenzhen

xieliyan@cuhk.edu.cn

Minghe Zhang Georgia Institute of Technology

minghe_zhang@gatech.edu

Rui Gao University of Texas at Austin rui.gao@mccombs.utexas.edu

Yao Xie Georgia Institute of Technology

yao.xie@isye.gatech.edu

Learning a robust classiﬁer from a few samples remains a key challenge in machine learning. A major thrust of research has been focused on developing k-nearest neighbor (k-NN) based algorithms combined with metric learning that captures similarities between samples. When the samples are limited, robustness is especially crucial to ensure the generalization capability of the classiﬁer. In this paper, we study a minimax distributionally robust formulation of weighted k-nearest neighbors, which aims to ﬁnd the optimal weighted k-NN classiﬁers that hedge against feature uncertainties. We develop an algorithm, Dr.k-NN, that efﬁciently solves this functional optimization problem and features in assigning minimax optimal weights to training samples when performing classiﬁcation. These weights are class-dependent, and are determined by the similarities of sample features under the least favorable scenarios. The proposed framework can be shown to be equivalent to a Lipschitz norm regularization problem. We also couple our framework with neural-network-based feature embedding. We demonstrate the competitive performance of our algorithm compared to the state-of-the-art in the few-training-sample setting with various real-data experiments.

1 Introduction Machine learning has been proven successful in data-intensive applications but is often hampered when the data set is small. For example, in breast mammography diagnosis for breast cancer screening [5], the diagnosis of the type of breast cancer requires specialized analysis by pathologists in a highly timeand cost-consuming task and often leads to non-consensual results. As a result, labeled data in digital pathology are generally very scarce; so do many other applications.

In this paper, we aim to tackle the general multi-class classiﬁcation problem when only very few training samples are available for each class [43]. Evidently, k-Nearest Neighbor (k-NN) algorithm [3] is a natural idea to tackle this problem and shows promising empirical performances. Notable contributions, including seminal work [19] and the follow-up non-linear version [34], go beyond the vanilla k-NN and propose neighborhood component analysis (NCA). NCA learns a distance metric that minimizes the expected leave-one-out classiﬁcation error on the training data using a stochastic neighbor selection rule. Some recent studies in few-shot learning utilize the limited training data using a similar idea, such as matching network [41] and prototypical network [39]. They are primarily based on distance-weighted k-NN, which classiﬁes an unseen sample (aka. query) by a weighted

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

* A Komondor

dressing up as a mop

Label: Komondor Label: Komondor

Query Training set

Figure 1: Motivating example: a small training set of three image-label pairs: the ﬁrst image is a mop; the second image is a dog; the third image looks like a mop but is, in fact, a dog (dressing up as a mop). The query (the last image) is closer to the third one and more likely to be misclassiﬁed as a dog if we use distance-weighted k-NN.

vote of its neighbors and uses the distance between two data points in the embedding space as their weights.

The classiﬁcation performance of weighted k-NN critically depends on the choice of weighting scheme. The distance measuring the similarity between samples is typically chosen by metric learning, where a task-speciﬁc distance metric is automatically constructed from supervised data [24, 33, 41]. However, it has been recognized that they may be not robust to the few-training-samples scenario, where an outlier may greatly deteriorate the performance. An example to illustrate this issue is shown in Figure 1. The training set with only three labeled samples includes two categories we want to classify: mop and Komondor. As we can see, the query image is visually closer to the third sample and thus more likely to be misclassiﬁed as a Komondor. Here the third sample in the training set is an outlier , since it is a Komondor dressing up as a mop, and it misleads the metric learning model to capture irrelevant details (e.g., the bucket and the mop handle) for the Komondor category. Such a problem can become even severe when the sample size is small.

The discussion above highlights the importance of choosing a good weighting scheme in weighted k-NN. To develop algorithms that are more robust in the few-training-samples settings, we propose a new formulation of distributionally robust weighted k-nearest neighbors. More speciﬁcally, for a given set of features of training samples, we solve a Wasserstein distributionally robust optimization problem that ﬁnds the minimax optimal weight functions for the k-nearest neighbors. This inﬁnitedimensional functional optimization over weight functions presents a unique challenge for which existing literature on distributionally robust optimization do not consider. To tackle this challenge, we ﬁrst consider a relaxed problem that optimizes over all randomized classiﬁers, which turns out to admit a ﬁnite-dimensional convex programming reformulation in spite of being inﬁnite-dimensional (Theorem 1). Next, we show that there is a weighted k-NN classiﬁer achieving the same risk and shares the same least favorable distributions (LFDs) as the robust classiﬁer (Theorem 2). Thereby we prove the optimality of such weighted k-NN classiﬁer for the original distributionally robust weighted k-nearest neighbors problem. Furthermore, we prove the equivalence between the proposed robust weighted k-NN classiﬁer with a Lipschitz norm regularization problem (Theorem 3), which provides promising insights on the generalization property of the robust classiﬁer.

Based on these theoretical results, we proposed a novel algorithm called Dr.k-NN. Unlike the traditional distance-weighted k-NN that uses the same weight for all label classes, our algorithm introduces a vector of weights, one for each class, for each sample in k-NN and performs a weighted majority vote. These weights are determined from the LFDs and reveal the signiﬁcance of each sample in the worst case, thereby contributing effectively to ﬁnal decision making. An example is illustrated in Figure 2. Further, using differentiable optimization [4, 2], we incorporate a neural network into the minimax classiﬁer that jointly learns the feature embedding and the minimax optimal classiﬁer. Numerical experiments show that our algorithm can effectively improve the multi-class classiﬁcation performance with few training samples on various data sets.

Related work Recently, there has been much interest in multi-class classiﬁcation with few training samples, see [31, 43] for a survey. The main idea of our work is related to metric learning [19, 22, 33, 34, 32], which essentially translates the hidden information carried by the limited data into a distance metric, and has been widely adopted in few-shot learning and meta learning [15, 24, 39, 41, 30]. However, unlike few-shot and meta learning, where the goal is to acquire meta knowledge from a

Decision Boundary

Komondor Mop

(a) Vanilla k-NN

Decision Boundary

Least Favorable Distributions

Komondor Mop

Weight vector

(b) Dr.k-NN

Figure 2: An illustrative comparison of Dr.k-NN and vanilla k-NN. Each colored dot is a training sample, where the color indicates its class-membership. The horizontal/vertical bar represents the probability mass of one training sample under the distribution P1, P2, respectively.

large number of observed classes and then predict examples from unobserved classes, we focus on attacking a speciﬁc general classiﬁcation problem where the number of categories is ﬁxed but labeled data are scarce. In this paper, we take a different probabilistic approach to exploit information from the data: we construct an uncertainty set for distributions of each class based on the Wasserstein distance.

Wasserstein distributionally robust optimization [10, 7, 13, 1, 8, 17, 38, 6, 36, 16] is an emerging paradigm for statistical learning; see [26] for a recent survey. Our work is mostly related to [18], a framework for Wasserstein robust hypothesis testing, but is different in three important ways. First, we focus on multi-class classiﬁcation, while [18] only studied two hypotheses. Second, we focus on directly minimizing the mis-classiﬁcation error while [18] used a convex relaxation for the 0-1 loss. Third, we analyze the generalization bound while [18] does not. Fourth, while [18] requires sample features as an input, we develop a scalable algorithmic framework to simultaneously learn the optimal feature extractor parameterized by neural networks and robust classiﬁer to achieve the best performance. A recent work [9] studies distributionally robust k-NN regression. Note that regression and classiﬁcation are fundamentally different as different performance metrics are used. In [9] the objective is to minimize the mean square error, whereas in our work we minimize classiﬁcation errors. Last but not least, our minimax formulation is similar to [14], which introduces a generalization of the maximum entropy principle and aims to address a different supervised learning problem. It is worth noting that the paper considers another type of ambiguity set deﬁned via cross-moment constraints. In contrast, we adopt Wasserstein sets in our formulation, leading to different tractable optimization reformulations and regularization terms.

Another well-known work on optimal weighted nearest neighbor binary classiﬁer [35] assigns one weight to each sample; the optimal weights minimize asymptotic expansion for the excess risk (regret). In contrast, we consider minimax robust multi-class classiﬁcation, each training sample is associated with different weights for different classes, and we consider the few sample regime instead of the asymptotic regime in [35].

2 Distributionally Robust k-NN

In this section, we present our model. We ﬁrst deﬁne weighted k-NN classiﬁer in Section 2.1, then present the proposed framework of distributionally robust k-NN problem in Section 2.2.

2.1 Weighted k-NN classiﬁer

Let {(x1, y1), . . . , (xn, yn)} be a set of training samples, where xi denotes the i-th data sample in the observation space X, and yi 2 Y := {1, . . . , M} denotes the class (label) of the i-th data sample. Let φ : X ! be a feature extractor that embeds samples to the feature space (in Section 4.2 we will train a neural network to learn φ). Denote the sample feature vectors and the empirical support

i := φ(xi), i = 1, . . . , n, b := { 1, . . . , n}. Let

S = {( 1, y1), . . . , ( n, yn)}. Deﬁne empirical distributions:

b Pm := 1 |{i : yi = m}|

δ i I{yi = m}, m = 1, . . . , M,

where δ denotes the Dirac point mass, | | denotes the cardinality of a set, and I denotes the indicator function.

Let : ! M be a randomized classiﬁer that assigns class m 2 {1, . . . , M} with probability m( ) to a query feature vector 2 , where M is the probabilistic simplex M = { 2 RM

m=1 m = 1}. It is worth mentioning that the randomized test is more general than the commonly seen deterministic test. In particular, the random classiﬁer reduces to the deterministic test if for any , there exists a m such that m( ) = 1. Suppose the features in each class m follows a distribution Pm. We deﬁne the risk of a classiﬁer as the total error probabilities1

( ; P1, . . . , PM) :=

E Pm[1 m( )]. (1)

Recall that the vanilla k-NN is performed as follows. Let c : ! R+ be a metric on that measures distance between features. For any given query point , let S

1 ( ), . . . , S

n ( ) be a reordering of {1, . . . , n} according to their distance to , i.e., c( , S

i ( )) c( , S

i+1( )) for all i < n, where the tie is broken arbitrarily. Here the superscript S indicates the dependence on the sample S. In vanilla k-NN, we compute the votes as

i ( ) = m}, m = 1, . . . , M. (2)

The vanilla k-NN decides the class for by the majority vote, i.e., accept the class arg max1 m M pm( ).

To deﬁne a weighted k-NN, let us replace the equal weights in (2) by an arbitrary weight function wm : ! R+ for each class m = 1, . . . , M:

i ( )), (3)

and use a shorthand notation w := (w1, . . . , w M). In the sequel, we deﬁne a general tie-breaking rule as follows. For any , denote M0( ) := arg max1 m M pm( ). When |M0( )| > 1, there is a tie at . We denote m( ) as the probability of accepting class m for m 2 M0( ) and we have P

m2M0( ) m( ) = 1.

We deﬁne a weighted k-NN classiﬁer knn( ; k, w) : ! M as:

m ( ; k, w) =

m( ), m 2 M0( ), 0, otherwise. (4)

A weighted k-NN classiﬁer involves two parameters: number of nearest neighbors k and weighting scheme w. Particularly, wm( , i) = I{yi = m} recovers the vanilla k-NN, and wm( , i) = I{yi = m}/c( , i) recovers the distance-based weighted k-NN. Note that our deﬁnition (3) allows different weighting schemes for different classes, which is more general than the standard weighted k-NN.

The goal is to ﬁnd the optimal weighted k-NN classiﬁer knn( ; k, w) such that the risk as deﬁned in (1) is minimized. Since the underlying true distributions are unknown, the commonly used loss

1To ease the exposition we consider only equal weights over the error probabilities, but our results can be easily generalized to any weighted average of error probabilities.

function is the empirical loss, i.e., substitute the empirical distributions b Pm into the risk function (1). This leads to the following optimization problem:

min 1 k n wm: !R+, 1 m M

( knn( ; k, w); b P1, . . . , b PM). (5)

It is worth mentioning that this minimization problem is an inﬁnite-dimensional functional optimization, since the weighting schemes w is a function on .

2.2 Distributionally robust k-NN For few-training-sample setting, the empirical distributions might not be good estimates for the true distribution since the sample size is small. To hedge against distributional uncertainty, we propose a distributionally robust counterpart of the weighted k-NN problem deﬁned in the previous subsection. Speciﬁcally, suppose each class m is associated with a distributional uncertainty set Pm, which will be speciﬁed shortly. Given P1, . . . , PM, deﬁne the worst-case risk of a classiﬁer as the worst-case total error probabilities

max Pm2Pm,1 m M ( ; P1, . . . , PM),

where is deﬁned in (1).

We consider the following distributionally robust k-NN problem that ﬁnds the optimal weighted k-NN classiﬁer minimizing the worst-case risk:

min 1 k n wm: !R+

max Pm2Pm 1 m M

( knn( ; k, w); P1, . . . , PM). (6)

Here the optimal solution P

1 , . . . , P

M to the inner maximization problem is also called least favorable distributions (LFD) in statistics literature [21]. We summarize the architecture of the proposed distributionally robust k-NN framework in Figure 3, more details are provided in Section 4.

Now we describe the uncertainty set Pm. First, since we are going to re-weight the training samples to build the classiﬁer, we restrict the support of every distribution in Pm to b , the set of empirical points. Second, the uncertainty set is data-driven, containing the empirical distribution b Pm and distributions surrounding its neighborhood. Third, to measure the closeness between distributions, we choose the Wasserstein metric of order 1 [40], deﬁned as

W(P, P 0) := min

γ E( , 0) γ [c( , 0)]

for any two distributions P and P 0 on , where the minimization of γ is taken over the set of all probability distributions on with marginals P and P 0. The main advantage of using Wasserstein metric is that it takes account of the geometry of the feature space by incorporating the metric c( , ) in its deﬁnition. Given the empirical distribution b Pm for m = 1, . . . M, we deﬁne

Pm 2 P(b ) : W(Pm, b Pm) #m

where P(b ) denotes the set of all probability distributions on b ; #m 0 speciﬁes the size of the uncertainty set for the m-th class that speciﬁes the amount of deviation we would like to control.

3 Theoretical Properties In this section, we analyze the computational tractability and statistical properties of the proposed distributionally robust weighted k-NN classiﬁer found in (6). All proofs are delegated to Appendix B.

3.1 Robust Classiﬁcation Observe that similar to (5), the formulation (6) is also an inﬁnite-dimensional functional optimization. Let us ﬁrst relate it to a relaxed robust classiﬁcation problem, which turns out to be more tractable.

Consider the following minimax robust classiﬁcation problem over all randomized classiﬁers (recalling M is the probability simplex in RM

min : ! M max Pm2Pm,1 m M ( ; P1, . . . , PM). (8)

Least Favorable Distributions

samples Testing

Neural Networks (e.g. CNN)

End-to-end Learning Framework

project % to Ω

yield ( )#!(%), )##(%))

Figure 3: An overview of the end-to-end learning framework, which consists of two cohesive components: (1) an architecture that is able to produce feature embedding and least favorable distributions P

m for training set; (2) an Dr.k-NN makes decisions for any unseen sample based on the estimated weight vector epm( ) (probability mass on least favorable distributions).

Yet still, (8) is an inﬁnite-dimensional functional optimization, since we are optimizing over the set

of all randomized classiﬁers. We establish the following theorem stating a ﬁnite-dimensional convex programming reformulation for the problem (8). Theorem 1. For the uncertainty sets deﬁned in (7), the least favorable distribution of problem (8) can be obtained by solving the following problem:

min p1,...,p M2Rn

+ γ1,...,γM2Rn n

max 1 m M pi

m c( i, j) #m,

m = b Pm( j),

81 i, j N, 1 m M.

The decision variable γm 2 Rn n

+ can be viewed as a joint distribution on n empirical points with marginal distributions b Pm and Pm, represented by a vector pm 2 Rn

+. The inequality constraint controls the Wasserstein distance between Pm and b Pm.

Below we give an intuitive explanation for the objective function in (9). Note that max1 m0 M pi

m measures the margin between the maximum likelihood of i among all classes and the likelihood of the m-th class. Thus, the objective in (9) can be equivalently rewritten as minimization of total margin:

max 1 m0 M pi

When M = 2, the total margin reduces to the total variation distance. Also, let yi

m 2 {0, 1} be the class indicator variable of sample i, observe that

max 1 m M pi

m log exp(tpi

m=1 exp(tpim)

where the second term on the right side represents the cross-entropy (or negative log-likelihood).

Therefore, problem (9) perturbs ( b P1, . . . , b PM) to LFDs (P

1 , . . . , P

M) so as to minimize the total margin as well as an upper bound on cross-entropy of LFDs; the smaller the margin (or cross-entropy) is, the more similar between classes and thus the harder to distinguish among them.

3.2 Expressiveness of Weighted k-NN In this subsection we study the expressive power of the class of weighted k-NN classiﬁers

{ knn( ; k, w): 1 k n,wm: ! R+, 1 m M}

deﬁned in Section 2.1.

The following theorem establishes the equivalence between the original problem (6) and the relaxed robust classiﬁcation problem (8) studied in Section 3.1.

Theorem 2. For the uncertainty set deﬁned in (7), formulations (6) and (8) have identical optimal values. In addition, there exists optimal solutions of (6) and (8) that share common LFDs that are optimal to (9).

Theorem 2 implies that the set of weighted k-NN classiﬁers is exhaustive, in the sense that it achieves the same optimal robust risk as optimizing over the set of all randomized classiﬁers. In our proof, we show that the weighted 1-NN classiﬁer, with weights equal to the LFDs of (9), is an optimal solution to (6). Therefore, instead of solving (6) directly, by Theorem 1, we can solve the convex program (9) for the LFDs, based on which we construct a robust k-NN classiﬁer. This justiﬁes the Dr.k-NN algorithm to be described in Section 4.

3.3 Lipschitz Regularization Next, we discuss the connection between the proposed distributional robust k-NN framework in (6), (8) and an equivalent Lipschitz regularization problem.

Using duality for Wasserstein DRO [17], problem (8) is equivalent to

min : b ! M λm 0,m=1,...M

λm#m+ Eˆ b Pm

1 m( ) λmc( , ˆ )

Then by [16], this problem is upper bounded by the following Lipschitz regularized classiﬁcation problem

min : b ! M

E b Pm [1 m( )] + #mk mk Lip

k mk Lip := max , 2b , 6=

| m( ) m( )|

c( , ) is the Lipschitz norm of the function m for each m = 1, . . . , M. Perhaps surprisingly, the next result shows that (10) and (11) are actually equivalent (thus by Theorem 2, are both equivalent to (6)), despite that the loss function does not satisfy existing criteria ensuring the equivalence [13, 36, 16].

Theorem 3. Formulations (10) and (11) are equivalent, and there exists an optimizer of (11) satisfying k

m, m = 1, . . . , M, where (λ

m)m is the optimizer of (10) when = .

The theorem is proved by using c-transform [40] to show that any optimizer

m can be modiﬁed into a λ

m-Lipschitz classiﬁer while maintaining the optimality. Note that Lipschitz classiﬁers are known to enjoy good generalization property [42, 20], thus the equivalence provides promising insights for the generalization property of the robust classiﬁer.

4 Proposed Algorithm Dr.k-NN In this section, we present the Distributional robust k-Nearest Neighbor (Dr.k-NN) algorithm, which is a direct consequence of the theoretical justiﬁcations in Section 3.

4.1 Dr.k-NN algorithm Based on Theorem 2, our algorithm contains two steps.

Step 1. [Sample re-weighting] For each class m, re-weight n samples using a distribution P

m, where (P

1 , . . . , P

M) is the p-component of the minimizer of (9).

Step 2. [k-NN] Given a query point , ordering the training samples according to their distance to : c( , S

1 ( )) c( , S

2 ( )) c( , S

n ( )). Compute the weighted k-NN votes, deﬁne

epm( ) := 1

i ( )), m = 1, . . . , M. (12)

Figure 4: An example of the weights P

3 yielding from (9) and the corresponding results of Dr.k-NN using a small subset of MNIST (digit 4 (red), 6 (blue), 9 (green) and k = 5). Raw samples are projected on a 2D feature space (d = 2), with the color indicating their true class-membership. In (a)(b)(c), shaded areas indicate the kernel smoothing of P

3 deﬁned in (13). In (d), big dots represent the training points and small dots represent the query points, and their color depth suggests how likely the sample is being classiﬁed into the true category.

Table 1: Comparison of classiﬁcation accuracy in the few-training-sample setting

MNIST mini Image Net CIFAR-10 Omniglot Lung Cancer COVID-19 CT M = 2 M = 5 M = 2 M = 5 M = 2 M = 5 M = 2 M = 5 M = 3 M = 2 K = 5 K = 10 K = 5 K = 10 K = 5 K = 10 K = 5 K = 10 K = 5 K = 10 K = 5 K = 10 K = 5 K = 10 K = 5 K = 10 K = 5 K = 8 K = 5 K = 10

PCA+k-NN 0.801 0.872 0.614 0.678 0.578 0.667 0.268 0.277 0.687 0.711 0.262 0.270 0.597 0.638 0.309 0.358 0.617 0.647 0.658 0.719 SVD+k-NN 0.749 0.790 0.524 0.567 0.587 0.675 0.268 0.283 0.680 0.701 0.259 0.266 0.591 0.618 0.305 0.413 0.624 0.648 0.646 0.715 NCA+k-NN 0.602 0.640 0.340 0.355 0.547 0.578 0.245 0.258 0.597 0.616 0.232 0.236 0.549 0.574 0.267 0.346 0.575 0.582 0.612 0.624 Matching Net 0.732 0.830 0.625 0.732 0.687 0.703 0.286 0.360 0.632 0.641 0.241 0.247 0.735 0.769 0.412 0.433 0.621 0.635 0.715 0.732 Prototypical Net 0.742 0.842 0.671 0.759 0.710 0.725 0.296 0.348 0.651 0.664 0.254 0.259 0.769 0.836 0.448 0.532 0.632 0.644 0.729 0.744 Meta Opt Net 0.725 0.843 0.658 0.790 0.732 0.741 0.255 0.363 0.702 0.713 0.257 0.298 0.742 0.755 0.412 0.453 0.638 0.642 0.713 0.739 Feature embedding + k-NN 0.792 0.798 0.546 0.551 0.738 0.742 0.490 0.486 0.689 0.691 0.492 0.494 0.725 0.751 0.445 0.495 0.664 0.691 0.701 0.710 Kernel Smoothing 0.777 0.873 0.559 0.579 0.593 0.601 0.272 0.278 0.642 0.661 0.272 0.282 0.520 0.565 0.240 0.285 0.367 0.370 0.582 0.604 Truncated Dr.k-NN 0.815 0.926 0.742 0.825 0.746 0.753 0.295 0.340 0.703 0.719 0.297 0.305 0.755 0.825 0.425 0.542 0.652 0.693 0.722 0.741 Dr.k-NN 0.838 0.959 0.746 0.831 0.752 0.786 0.306 0.358 0.707 0.728 0.309 0.311 0.765 0.850 0.465 0.580 0.667 0.704 0.734 0.752

Decide the class for a query feature point as arg max1 m M epm( ), where the tie is broken according to the rule (4).

Figure 4 gives an illustration showing the probabilistic weights (P

3 ) for three classes and its corresponding decision boundary yielding from the weighted k-NN.

For the sake of completeness, we also extend our algorithm to non-few-training-sample setting, which is referred to as truncated Dr.k-NN. The key idea is to keep the training samples that are important in deciding the decision boundary based on the maximum entropy principle [11]. This can be particularly useful for the general classiﬁcation problem with an arbitrary size of training set. An illustration (Figure 6) and more details can be found in Appendix C.

4.2 Joint learning framework In this section, we propose a framework that jointly learns the feature mapping and the robust classiﬁer. Let the feature mapping φ(; ) be a neural network parameterized by whose input is a batch of training samples (Figure 3), and then compose it with an optimization layer that packs the convex problem (9) as an output layer that outputs the LFDs of (8). The optimization layer is adopted from differentiable optimization [4, 2], in which the optimization problem is integrated as an individual layer in an end-to-end trainable deep networks and the solution of the problem can be backpropagated through neural networks.

To apply the mini-batch stochastic gradient descent, we need to ensure that each batch comprises of multiple mini-sets , one for each class, containing at least one training sample from each class fed into the convex optimization layer. In light of (9), the objective of our joint learning framework is min J( ; P

1 , . . . , P

1 , . . . , P

m(φ(xi; )),

m(φ( ; ))}1 m M are the LFDs generated by the convex solver deﬁned in (9) given input variables { i = φ(xi; )}1 i n. The algorithm is summarized in Algorithm 1 (Appendix A).

Accuracy: 0.771

(a) Dr.k-NN

Accuracy: 0.709

(b) PCA + k-NN

Accuracy: 0.621

(c) SVD + k-NN

Accuracy: 0.507

Figure 5: A comparison of the learned feature spaces and the corresponding decision boundaries. There are 10 training samples from two categories of MNIST identiﬁed as large dots and 1,000 query samples identiﬁed as small dots. The color of dots shows their true categories. The color of the region shows the decisions made by corresponding methods.

5 Experiments In this section, we evaluate our method and eight alternative approaches on four commonly-used image data sets: MNIST [28], CIFAR-10 [25], Omniglot [27], and present a set of comprehensive numerical examples.

We also test our method on two medical diagnosis data sets: Lung Cancer [12], and COVID-19 CT [44], where very few data samples are available for study, due to privacy concerns and high costs associated with harvesting data. Speciﬁcally, Lung Cancer data record 56 attributes for only 32 patients who have been diagnosed with three types of pathological lung cancers; COVID-19 Computed Tomography (CT) data contain 349 COVID-19 CT images from 216 patients and 463 non-COVID-19 CTs. Here, we present a few examples of COVID-19 CT images and non-COVID-19 CT images in Appendix E.

Experiment set-up. We compare our method including Dr.k-NN and its truncated version with the following baselines: (1) k-NN based methods with different dimension reduction techniques, including Principal Component Analysis (PCA+k-NN), Singular Value Decomposition (SVD+k-NN), Neighbourhood Components Analysis (NCA+k-NN) [19], and feature embeddings generated by Dr.k NN (Feature embedding + k-NN) as a sanity check; (2) matching networks [41]; (3) prototypical networks [39]; (4) Meta Opt Net [29]. To make these methods comparable, we adopt the same naive neural network with a single CNN layer on matching network, prototypical network, and our model, respectively, where the kernel size is 3, the stride is 1 and the width of the output layer is d = 400.

In our experiments, we focus on an M-class K-sample (K training samples for each class) learning task. To generate the training data set, we randomly select M classes and for each class we take K random samples. So our training data set contains MK samples overall. We then aim to classify a disjoint batch of unseen samples into one of these M classes. Thus random performance on this task stands at 1/M. We test the average performance of different methods using 1, 000 unseen samples from the same M classes. To obtain reliable results, we repeat each test 10 times and calculate the average accuracy.

Other experimental conﬁgurations are described as follows: The Adam optimizer [23] is adopted for all experiments conducted in this paper, where learning rate is 10 2. The mini-batch size is 32. The hyper-parameter #m is chosen by cross-validation, which varies from application to application. The differentiable convex optimization layer we adopt is from [2]. To make all approaches comparable, we use the same network structure in matching network, prototypical network, and Meta Opt Net as we described above. We use the Euclidean distance c( , 0) = k 0k2 throughout our experiment. All experiments are performed on Google Colaboratory (Pro version) with 12GB RAM and dual-core Intel processors, which speed up to 2.3 GHz (without GPU).

Results. We present the average test accuracy in Table 1 for the unseen samples with different M = 2, 5 and K = 5, 10 on small subsets of MNIST, mini Image Net, CIFAR-10, Omniglot, and with M = 3 and K = 5, 8 on lung cancer data. Note that random performance for two-class and ﬁve-class classiﬁcations are 0.5 and 0.2, respectively. The ﬁgures in Table 1 show that Dr.k-NN (k = 5) outperforms other baselines in terms of the average test accuracy on all data sets. We note

that 95% conﬁdence interval of our method s performance on all the data sets is smaller than 0.08. The truncated Dr.k-NN also yields competitive results using only 20% training samples ( = 0.9), compared to standard Dr.k-NN.

To conﬁrm that the proposed learning framework will affect the distribution of hidden representation of data points, we show the training and query samples in a 2D feature space and the corresponding decision boundary in Figure 5. It turns out our framework ﬁnds a better feature representation in the 2D space with a smooth decision boundary and a reasonable decision conﬁdence map (indicated by the color depth in Figure 5 (a)).

Comparison to kernel smoothing. We also compare with an approach using kernel-smoothing of the LFDs (in contrast to using k-NN) for performing classiﬁcation. Consider a Gaussian kernel (x) = |Hh| 1/2 (H 1/2

h x), where Hh = h Ip is the isotropical kernel with bandwidth h. Then we replace epm( ) in Step 2 of Dr.k-NN with the following

m( i) ( i), 8 , m = 1, . . . , M. (13)

We evaluate both methods on a subset of MNIST, which contains 1,000 testing samples (small dots) and 20 training samples (large dots) from two categories (indexed by blue and red, respectively).

As shown in Figure 7 and Figure 8 (Appendix D), our experimental results have shown the importance of using k-NN in our proposed algorithm, where the Dr.k NN signiﬁcantly outperforms the parallel version using kernel smoothing (even after the kernel bandwidth being optimized). We ﬁnd that the performance when using kernel smoothing (13) heavily depends on selecting an appropriate kernel bandwidth h as illustrated by Figure 8.

Moreover, the best kernel bandwidth may vary from one dataset to another. Therefore, the crossvalidation is required to be carried out to ﬁnd the best kernel bandwidth in practice, which is quite time-consuming. In contrast, choosing the hyper-parameter k is an easy task, since we only have limited choices of k in few-training-sample scenario and the performance of Dr.k-NN is insensitive to the choices of k (see Figure 7).

6 Conclusion We propose a distributionally robust k-NN classiﬁer (Dr.k-NN) for tackling the multi-class classiﬁcation problem with few training samples. To make a decision, each neighboring sample is weighted according to least favorable distributions resulting from a distributionally robust problem. As shown in the theoretical results and demonstrated by experiments, our methods achieve outstanding performance in classiﬁcation accuracy compared with other baselines using minimal resources. The robust classiﬁer layer (9) serves an alternative to the usual softmax layer in a neural network for classiﬁcation, and we believe it is promising for other machine learning tasks.

[1] Soroosh Shaﬁeezadeh Abadeh, Peyman Mohajerin Mohajerin Esfahani, and Daniel Kuhn.

Distributionally robust logistic regression. In Advances in Neural Information Processing Systems, pages 1576 1584, 2015.

[2] Akshay Agrawal, Brandon Amos, Shane Barratt, Stephen Boyd, Steven Diamond, and J Zico

Kolter. Differentiable convex optimization layers. In Advances in Neural Information Processing Systems, pages 9558 9570, 2019.

[3] Naomi S. Altman. An introduction to kernel and nearest-neighbor nonparametric regression.

The American Statistician, 46(3):175 185, 1992.

[4] Brandon Amos and J Zico Kolter. Optnet: Differentiable optimization as a layer in neural

networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 136 145. JMLR. org, 2017.

[5] Guilherme Aresta, Teresa Araújo, Scotty Kwok, Sai Saketh Chennamsetty, Mohammed Safwan,

Varghese Alex, Bahram Marami, Marcel Prastawa, Monica Chan, Michael Donovan, Gerardo

Fernandez, Jack Zeineh, Matthias Kohl, Christoph Walz, Florian Ludwig, Stefan Braunewell, Maximilian Baust, Quoc Dang Vu, Minh Nguyen Nhat To, Eal Kim, Jin Tae Kwak, Sameh Galal, Veronica Sanchez-Freire, Nadia Brancati, Maria Frucci, Daniel Riccio, Yaqi Wang, Lingling Sun, Kaiqiang Ma, Jiannan Fang, Ismael Kone, Lahsen Boulmane, Aurélio Campilho, Catarina Eloy, António Polónia, and Paulo Aguiar. Bach: Grand challenge on breast cancer histology images. Medical Image Analysis, 56:122 139, 2019.

[6] Jose Blanchet, Yang Kang, and Karthyek Murthy. Robust Wasserstein proﬁle inference and

applications to machine learning. Journal of Applied Probability, 56(3):830 857, 2019.

[7] Jose Blanchet, Yang Kang, Fan Zhang, Fei He, and Zhangyi Hu. Doubly robust data-driven

distributionally robust optimization. ar Xiv preprint ar Xiv:1705.07168, 2017.

[8] Jose Blanchet and Karthyek Murthy. Quantifying distributional model risk via optimal transport.

Mathematics of Operations Research, 44(2):565 600, 2019.

[9] Ruidi Chen and Ioannis Paschalidis. Selecting optimal decisions via distributionally robust

nearest-neighbor regression. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 749 759. Curran Associates, Inc., 2019.

[10] Ruidi Chen, Ioannis Ch Paschalidis, et al. Distributionally robust learning. Foundations and

Trends R in Optimization, 4(1-2):1 243, 2020.

[11] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in

Telecommunications and Signal Processing). Wiley-Interscience, USA, 2006.

[12] Dheeru Dua and Casey Graff. UCI machine learning repository: Lung cancer data set, 2017.

[13] Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization

using the Wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171(1-2):115 166, 2018.

[14] Farzan Farnia and David Tse. A minimax approach to supervised learning. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.

[15] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adapta-

tion of deep networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1126 1135, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. PMLR.

[16] Rui Gao, Xi Chen, and Anton J Kleywegt. Wasserstein distributional robustness and regulariza-

tion in statistical learning. ar Xiv preprint ar Xiv:1712.06050, 2017.

[17] Rui Gao and Anton J Kleywegt. Distributionally robust stochastic optimization with Wasserstein

distance. ar Xiv preprint ar Xiv:1604.02199, 2016.

[18] Rui Gao, Liyan Xie, Yao Xie, and Huan Xu. Robust hypothesis testing using Wasserstein

uncertainty sets. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS 18, page 7913 7923, Red Hook, NY, USA, 2018. Curran Associates Inc.

[19] Jacob Goldberger, Geoffrey E Hinton, Sam T. Roweis, and Russ R Salakhutdinov. Neighbour-

hood components analysis. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 513 520. MIT Press, 2005.

[20] Lee-Ad Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. Efﬁcient classiﬁcation for

metric data. IEEE Transactions on Information Theory, 60(9):5750 5759, 2014.

[21] Peter J Huber. A robust version of the probability ratio test. Annals of Mathematical Statistics,

36(6):1753 1758, 1965.

[22] Wen Jiang, Kai Huang, Jie Geng, and Xinyang Deng. Multi-scale metric learning for few-shot

learning. IEEE Transactions on Circuits and Systems for Video Technology, 31(3):1091 1102, 2021.

[23] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint

ar Xiv:1412.6980, 2014.

[24] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot

image recognition. In ICML deep learning workshop, volume 2. Lille, 2015.

[25] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced

research). 2020.

[26] Daniel Kuhn, Peyman Mohajerin Esfahani, Viet Anh Nguyen, and Soroosh Shaﬁeezadeh-

Abadeh. Wasserstein distributionally robust optimization: Theory and applications in machine learning. In Operations Research & Management Science in the Age of Analytics, pages 130 166. INFORMS, 2019.

[27] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept

learning through probabilistic program induction. Science, 350(6266):1332 1338, 2015.

[28] Yann Le Cun and Corinna Cortes. MNIST handwritten digit database. 2010.

[29] Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. Meta-learning

with differentiable convex optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10657 10665, 2019.

[30] Fei-Fei Li, R. Fergus, and P. Perona. One-shot learning of object categories. IEEE Transactions

on Pattern Analysis and Machine Intelligence, 28(4):594 611, 2006.

[31] Xiaoxu Li, Xiaochen Yang, Zhanyu Ma, and Jing-Hao Xue. Deep metric learning for few-shot

image classiﬁcation: A selective review. ar Xiv preprint ar Xiv:2105.08149, 2021.

[32] Tingting Liao, Zhen Lei, Tianqing Zhu, Shan Zeng, Yaqin Li, and Cao Yuan. Deep metric

learning for k nearest neighbor classication. IEEE Transactions on Knowledge and Data Engineering, 2021.

[33] Tobias Plötz and Stefan Roth. Neural nearest neighbors networks. In Proceedings of the

32nd International Conference on Neural Information Processing Systems, NIPS 18, page 1095 1106, Red Hook, NY, USA, 2018. Curran Associates Inc.

[34] Ruslan Salakhutdinov and Geoff Hinton. Learning a nonlinear embedding by preserving

class neighbourhood structure. In Marina Meila and Xiaotong Shen, editors, Proceedings of the Eleventh International Conference on Artiﬁcial Intelligence and Statistics, volume 2 of Proceedings of Machine Learning Research, pages 412 419, San Juan, Puerto Rico, 21 24 Mar 2007. PMLR.

[35] Richard J Samworth. Optimal weighted nearest neighbour classiﬁers. The Annals of Statistics,

40(5):2733 2763, 2012.

[36] Soroosh Shaﬁeezadeh-Abadeh, Daniel Kuhn, and Peyman Mohajerin Esfahani. Regularization

via mass transportation. Journal of Machine Learning Research, 20(103):1 68, 2019.

[37] Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczy nski. Lectures on stochastic

programming: modeling and theory. SIAM, 2014.

[38] Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying some distributional robustness

with principled adversarial training. ar Xiv preprint ar Xiv:1710.10571, 2017.

[39] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In

I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4077 4087. Curran Associates, Inc., 2017.

[40] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business

Media, 2008.

[41] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra.

Matching networks for one shot learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 16, page 3637 3645, Red Hook, NY, USA, 2016. Curran Associates Inc.

[42] Ulrike von Luxburg and Olivier Bousquet. Distance-based classiﬁcation with lipschitz functions.

J. Mach. Learn. Res., 5(Jun):669 695, 2004.

[43] Yaqing Wang, Quanming Yao, James Kwok, and Lionel M. Ni. Generalizing from a few

examples: A survey on few-shot learning, 2019.

[44] Xingyi Yang, Xuehai He, Jinyu Zhao, Yichen Zhang, Shanghang Zhang, and Pengtao Xie.

COVID-CT-Dataset: A CT scan dataset about COVID-19, 2020.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Section 6

(c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experi-

mental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [Yes] (d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [Yes] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A]