# weighted_distance_nearest_neighbor_condensing__df908e16.pdf

Weighted Distance Nearest Neighbor Condensing

Lee-Ad Gottlieb 1 Timor Sharabi 1 Roi Weiss 1

The problem of nearest neighbor condensing has enjoyed a long history of study, both in its theoretical and practical aspects. In this paper, we introduce the problem of weighted distance nearest neighbor condensing, where one assigns weights to each point of the condensed set, and then new points are labeled based on their weighted distance nearest neighbor in the condensed set. We study the theoretical properties of this new model, and show that it can produce dramatically better condensing than the standard nearest neighbor rule, yet is characterized by generalization bounds almost identical to the latter. We then suggest a condensing heuristic for our new problem. We demonstrate Bayes consistency for this heuristic, and also show promising empirical results.

1. Introduction

The nearest neighbor (NN) classifier, introduced by (Fix & Hodges, 1951), is an intuitive and popular learning tool. In this model, a learner observes a sample of labeled points, and given a new point to be classified, assigns the new point the same label as its nearest neighbor in the sample. It was subsequently demonstrated by Cover & Hart (1967) that when no label noise is present, the nearest neighbor classifier s expected error converges to zero as the sample size grows. This and other results helped spawn a deep body of research into proximity-based classification (Devroye et al., 1996; Shalev-Shwartz & Ben-David, 2014).

Nearest neighbor classifiers enjoy other advantages as well. They require only a distance function on the points, and do not even require that the host space be metric. They also extend naturally to the multi-class setting. Yet they are not

1Department of Computer Science, Ariel University, Ariel, Israel. Correspondence to: Lee-Ad Gottlieb <leead@ariel.ac.il>, Timor Sharabi <timorsharabi@gmail.com>, Roi Weiss <roiw@ariel.ac.il>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

without their disadvantages: For example, a naive nearest neighbor approach may require storing the entire sample. To address the disadvantages of the NN classifier, (Hart, 1968) introduced the technique of sample compression for the NN classifier. This work defined the minimum consistent subset problem (also called the nearest neighbor condensing problem): Given a sample, find a minimal subset of the sample that is consistent with it, meaning that for every point of the sample, its nearest neighbor in the subset (that is the condensed set) has the same label. (Hart, 1968) further suggested a heuristic for the NN condensing problem. Analysis of this problem, as well as the creation of new heuristics for it, has been the subject of extensive research since its introduction (Gates, 1972; Ritter et al., 1975; Devroye et al., 1996; Wilson & Martinez, 2000).

Our Contribution In this paper, we introduce a novel modification of the NN condensing problem. In our new condensing problem (presented formally in Section 2), each point of the condensed set is also assigned a weight. We modify the distance function to consider weighted distance, meaning that the distance from a new point to a point in the condensed set is their original distance divided by the weight of the point in the condensed set. It follows that assigning high weight to a point in the condensed set increases its influence on the labeling of new points. We call the new problem of choosing a condensed set and assigning its weights the weighted distance nearest neighbor condensing problem.

We proceed with an in-depth study of the statistical properties of weighted distance condensing (Section 3). Crucially, we find that our model allows dramatically better condensing than what is possible under standard (that is, unweighted) NN condensing: There are families of instances wherein the optimal condensing under unweighted NN rule is of size Θ(n), while condensing under the weighted NN (WNN) rule yields a condensed set of size exactly 2 (Theorem 3.1). At the same time, we demonstrate generalization bounds for condensing under the WNN rule which are almost identical to those previously known for the NN rule (Theorem 3.4). This means that the much more powerful weighted rule can be adopted with only negligible increase in the variance of the model, so that the more powerful rule does not contribute to overfitting.

Weighted Distance Nearest Neighbor Condensing

Having established these statistical properties of WNN condensing, we suggest a greedy-based heuristic for this problem, that is the identification of a condensed set and the assignment of weights to members of this set (Section 4). We further demonstrate that the suggested heuristic is a member of a broad set of classifiers which are Bayes consistent, thereby lending statistical support to its use.

After deciding on a heuristic for our condensing problem, we compare its empirical condensing abilities to those of popular heuristics for the unweighted NN problem (Section 5). We find that that the condensing bounds of our heuristic compare favorably to the others, which additionally suggests that further research on heuristics for WNN condensing is a promising direction.

1.1. Related Work

The nearest neighbor condensing problem is known to be NP-hard (Wilfong, 1991; Zukhba, 2010), and (Hart, 1968) provided the first heuristic for it. Many other heuristics have been suggested since, and we mention only a few of them here: (Gates, 1972) introduced the reduced nearest neighbor (RNN) rule heuristic to iteratively contract the sample set. (Ritter et al., 1975) introduced the selective subset heuristic, which additionally guarantees that for any sample point, the distance to its nearest neighbor in the compressed set is less than the distance to any sample point with opposite label. (Barandela et al., 2005) subsequently suggested a modification to this algorithm, which they called modified selected subset (MSS). Angiulli (2005) introduced the fast nearest neighbor condensing (FCNN) heuristic, while Flores-Velazco (2020) introduced the RSS and VSS heuristics, which modify the FCNN algorithm to improve its behavior in cases where the points are too close to each other. Another popular heuristic, modified condensed nearest neighbor (MCNN), was introduced by (Devi & Murty, 2002).

No concrete algorithmic condensing bounds were known for NN condensing until the work of Gottlieb et al. (2018): They derived hardness-of-approximation results, and designed an algorithm called NET, which computes in polynomial time an approximation to the minimum subset almost matching the hardness bounds. This approach was later extended by Gottlieb & Ozeri (2019) to asymmetric distance function. Also related to this is the result of Gottlieb & Kontorovich (2022), which presented non-uniform packing, that is using balls with different radii.

As for the statistical properties of NN condensing rules, Devroye et al. (1996, Chapter 19) established Bayes consistency for an intractable rule that searches for a condensed set of fixed, data-independent size, minimizing the empirical error on the entire sample. They showed that universal Bayes consistency is achieved in finite-dimensional

spaces provided the size of the condensed set is sublinear in the sample size. Hanneke et al. (2021) introduced a computationally-efficient data-dependent samplecompression NN rule, termed Opti Net, that computes a net of the samples and associates each point of the condensed set with the majority vote label in its Voronoi cell. They showed that Opti Net is universally Bayes consistency in any separable metric space. A simpler sub-sampling NN rule achieving universal Bayes consistency in separable metric spaces was demonstrated by Gy orfi & Weiss (2021), establishing also error rates. Xue & Kpotufe (2018) studied the error rates achieved by more complex sub-sampling NN rules. Lastly, Kerem & Weiss (2023) studied jointlyachievable error and compression rates for Opti Net under a margin condition.

Our main statistical contribution is a compression bound (Theorem 3.4). Similar finite-sample bounds (up to constants) for standard NN rules were established in Gottlieb et al. (2014); Kontorovich et al. (2017); Hanneke et al. (2021); in particular, leverage the fact that such bounds are dimension free. See also (Cohen & Kontorovich, 2022), who gave compression bounds for learning mappings between two metric spaces.

We note that our weighted approach to the NN condensing is not related to the weighted KNN model (Dudani, 1976). Weighted KNN is a classification model (not a compression optimization problem), a variant of KNN which classifies using the k closest neighbors while giving additional preference to some of them. Its local use of the underlying geometry of the k neighbors is unrelated to our assignment of weights to points of a condensed set in order to create a new global distance function. Our weights are chosen for condensing a realizable sample in a pre-processing stage, not for local denoising among the neighbors in the search stage. Likewise, our condensing approach places emphasis on farther points over closer ones, which is the opposite of what is done for denoising in the KNN model.

1.2. Preliminaries

Metric Space. A metric d defined on a point set X is a positive symmetric function satisfying the triangle inequality, i.e. d(x, y) d(x, z)+d(z, y). The set X and metric d together define the metric space (X, d). Let B(x, r) denote the (open) ball centered at x with radius r; a point y X is in B(x, r) if d(x, y) < r.

Notation. We use [k] to denote {1, . . . , k}. We define the distance between a point y and set S as the distance of y to the closest point in S, that is d(x, S) = miny S d(x, y). For a set S, we denote its cardinality by |S|.

Weighted Distance Nearest Neighbor Condensing

2. Nearest Neighbor Rules

Given a metric space (X, d) and a labeled sample S X (where l(x) is the label of point x X), the nearest neighbor condensing problem is to find a subset S S of minimal cardinality, such that the nearest neighbor rule on S classifies all sample points in S correctly; that is, for any point x S, l(x) = l(argminy S (d (x, y))).

Now let w : X (0, ) be a positive weight function, and define the weighted distance

d(x, x ) = d(x, x ) w(x) w(x ), x, x X.

This is our weighted distance nearest neighbor (WNN) rule, under which we may define a WNN classifier:

h( S,w)(x) = l(arg min y S d(y, x)), x X.

Note that the weighted distance may not satisfy the triangle inequality. In the weighted condensing problem, we seek a subset S S and a weight assignment w for S (with w(x) = 1 for all x X \ S), such that for each point x S, the weighted distance nearest neighbor of x in S has the same label as x, h( S,w)(x) = l(x).

It is easy to see that WNN is a generalization of the NN rule, as the latter can be recovered by simply taking all weights to be equal to 1. Let us motivate our weighted distance function by illustrating the effect of weighting on the Euclidean decision boundary between two points. Consider two points in the Euclidean plane, p1 = (x1, y1) and p2 = (x2, y2). If w(p1) = w(p2), then the WNN decision boundary is equivalent to the NN decision boundary, defined by p

(x1 x)2 + (y1 y)2 = p

(x2 x)2 + (y2 y)2,

which can be simplified to a line in slope-intercept form:

y = x x1 x2

y2 y1 + x2 1 + x2 2 y2 1 + y2 2 2(y2 y1) .

In contrast, when w(p1) = w(p2), the decision boundary is defined by p

(x1 x)2 + (y1 y)2

(x2 x)2 + (y2 y)2

This can be simplified to the standard equation of a circle x + w2 1x2 w2 2x1 w2 2 w2 1

2 + y + w2 1y2 w2 2y1 w2 2 w2 1

= w2 1y2 2 + w2 1x2 2 w2 2y2 1 w2 2x2 1 w2 2 w2 1

+ w2 1x2 w2 2x1 w2 2 w2 1

2 + w2 1y2 w2 2y1 w2 2 w2 1

Figure 1: Illustration of decision boundary for equal (left) and unequal weights (right). In the right figure, w(p1) = 1.5 and w(p2) = 1, where p1 is its left point and p2 is its right point.

See Figure 1. The fact that WNN induces a circular boundary (and in higher dimension, a ball separator) will be useful for our proofs and constructions. The separator induced by the standard unweighted NN rule is simply the Voronoi diagram.

While the distance function for nearest neighbor condensing rules is often taken to be the Euclidean norm, all our results below hold for all metric distance functions.

3. Properties of Weighted Distance Nearest Neighbor

In this section, we establish condensing properties under WNN, especially in comparison to condensing properties already known for the regular unweighted nearest neighbor (NN) rule. We show that the former rule is strictly more powerful: It can always yield condensing as good as the latter, and in some cases better by a factor of Θ(n). At the same time, we derive generalization bounds for WNN condensing that are essentially the same as those known for NN condensing, so that the utilization of a more powerful tool does not lead to an increase in generalization error.

3.1. Condensing Bounds

As weighted nearest neighbor generalizes unweighted nearest neighbor, it can only improve the condensing. We can in fact show the following:

Theorem 3.1. For any n, there exists an n-point data-set that can be condensed to 2 candidates under the WNN rule, but requires Θ(n) candidates under the NN rule.

In place of simply presenting a construction achieving the bound of Theorem 3.1, we will instead present a new condensing rule we call the ball cover (BC) rule, and show that WNN generalizes this rule as well. We then demonstrate that condensing under NN and BC is incomparable, in that each rule admits n-point sets that it can condense to a constant size, but which the other cannot condense below Θ(n) points. This expanded explanation will yield a better understanding of the power of WNN, and motivate the heuristic of Section 4 below.

Weighted Distance Nearest Neighbor Condensing

Ball Cover Rule. We define another rule for condensing, the ball cover (BC) rule: Given the input points S X, we must produce a subset S X of minimum cardinality, and also assign each point xi S a radius ri. Let Bi = B(xi, ri). We require for all xi S and xj S with l(xi) = l(xj), that ri < d(xi, xj). It follows that no ball contains sample points of the opposite label. The decision rule simply assigns a point x S the same label as the center of a ball containing x. A valid BC condensing satisfies that every x S \ S is found in a ball Bi satisfying l(x) = l(xi). (There is however a caveat relating to the labeling of points not in the sample: These points can fall into multiple balls, or no ball at all. In these cases, one may impose some arbitrary decision rule, such as a priority over the balls.)

The BC rule is motivated by the NET algorithm of Gottlieb et al. (2018). This algorithm can be viewed as covering the space with balls of equal radius. The BC rule is an extension of the NET approach to balls of different radii.

It is easy to show that WNN generalizes the BC rule: Given a condensed set S with assigned radii ri, a WNN classifier can be produced by taking the same points of S, and assigning weight w(xi) = ri for each xi S. Consider any point x S \ S falling in Bi but not in Bj, and we have that d(x, xi) = d(x,xi)

w(xi) < ri ri = 1, while d(x, xj) = d(x,xj)

w(xj) rj rj = 1, so that the WNN rule is indeed consistent with the BC rule on the sample.

Comparison Between NN and BC. It remains to prove the following lemma, concerning condensing under the NN and BC rules:

Lemma 3.2. For any n, there exists an n-point data-set that can be condensed to 2 candidates under the NN rule, but requires Θ(n) candidates under the BC rule.

Likewise, for any n, there exists an n-point data-set that can be condensed to 2 candidates under the BC rule, but requires Θ(n) candidates under the NN rule.

As WNN generalizes BC, Theorem 3.1 follows immediately from Lemma 3.2. It remains to prove the lemma.

Proof of Lemma 3.2. For the first item of the lemma, assume without loss of generality that n is even, and set γ = n/2. Our example set X will have γ red points and γ blue points, with a diameter Θ(n) and minimum distance of 1 between the points. Let c = 1

2, and add to the set γ red points at positions {(i, c) : i [γ]}, and γ blue points at positions {(i, c) : i [γ]}. See Figure 2.

Now under the NN rule, there exists a solution which uses only two points: This solution S takes the red point at (1, c) together with the blue point at (1, c): Take any red

point at (i, c), and its distance from the red candidate is i 1, while its distance from the blue candidate is exactly p

(i 1)2 + 4c2 > i 1. A similar argument applies to the blue points, and so we conclude that S is a consistent condensing of X.

Turning to the BC rule, we show that any solution under this rule must have all n points: Assume by contradiction that we can consistently cover all the points of X with only k balls, where k < n. This implies that there exists some solution ball which covers 2 or more points of the same color; assume without loss of generality that this ball is red, and its center is pi = (i, c). As this ball covers more than a single red point, its radius must be greater than 1, but then it contains the blue point (i, c), which is forbidden. We conclude that no ball contains more than one point, and so the optimal solution under BC must contain exactly n points.

For the second item of the lemma, assume without loss of generality that n is odd, set γ = n 1

2 , and further assume that γ is odd. Our example has γ blue points and γ + 1 red points: Let the points {(0, 2i) : i [γ]} be red, and the points {(1, 2i + 1) : i [γ 1]} be blue. Set t = (γ+1)2

2 , and add an additional red point r = ( t, γ + 1), and an additional blue point b = (2t, γ + 1). See Figure 2.

We show that under the BC rule, two balls are sufficient to correctly label all points: First take the red ball centered at r with radius p

t2 + (γ + 1)2 =

t2 + 2t < t + 1. Clearly this ball contains no blue points, since the line containing the blue points is at distance exactly t + 1 from r. But the ball contains all the red points, as the farthest red points from r are at (0, 2) and (0, 2γ), both of which are at distance p

t2 + (γ 1)2 from the ball center, and hence inside the ball. Now take the blue ball centered at b with radius p

4t2 + (γ + 1)2 = p

(2t)2 + 2t < 2t + 1. A similar argument to that above shows that this ball contains all blue points, but none of the red.

We show that under the NN rule, at least n 2 points must be found in the condensed set. First take the blue points on the line x = 1, and at least one of these must be in the condensed set: If only b were in the condensed set, then the other blue points would be closer to every red point in the condensed set. Now take a blue point (1, 2i + 1) in the condensed set, and it must be that the red points (0, 2i), (0, 2i + 2)) are both in the condensed set, since the blue point is their nearest neighbor. It similarly follows that blue points (1, 2i 1), (1, 2i+3) are both in the condensed, and in fact that all red points on the line x = 0 and all blue points on the line x = 1 are in the condensed set.

Weighted Distance Nearest Neighbor Condensing

Figure 2: Left: A set that admits good condensing under the NN rule, but not under the BC rule. Right: A set that admits good condensing under the BC rule, but not under the NN rule.

3.2. Learning Bounds

Here we establish uniform generalization bounds for WNN rules. By definition, a general WNN classifier is uniquely determined by a subset of labeled samples S and a weight function w. In addition, since w(x) = 1 for all x / S, w is uniquely determined by its values on S. The following representation lemma establishes a 2| S|-sample compression scheme for the class of WNN rules, which will allow us to derive compression-based error bounds (Littlestone & Warmuth, 1986; Floyd & Warmuth, 1995; Graepel et al., 2005).

Lemma 3.3. There exist an encoding function C and a reconstruction function R that satisfy the following: For any finite labeled set S, any subset S S, and any weight assignment w for S, if the WNN classifier corresponding to the pair ( S, w) is consistent on S, then C(S, S, w) returns two ordered labeled subsets S , W S, each of size | S|, such that the WNN classifier h S , W = R( S , W ) is consistent on S.

Proof. We first describe the encoding function C that given S and the pair ( S, w), selects S , W . Then we will describe the reconstruction function R and conclude that R( S , W ) is indeed consistent on S.

Given S and ( S, w), the encoding function C returns two ordered lists S , W S such that S has the same sample points as S but in a specific order, and W consists of sample points from S from which appropriate weights for S can be deduced via R. The first sample in the list S

corresponds to x S having the maximal weight, and the first sample in W is also taken as x. Subsequent points are added to S and W by the following procedure: Starting with initial weights as determined by w, multiplicatively increase the weights of all points of S not yet placed in S , until one of the following occurs:

(i) A point x S \ S has weight equal to that of a point already in S , say x1. Then x is added to S and x1 is added to W .

(ii) There is some point x S \ S that became equidistant (under weighted distance) from its closest point x1 S of the same label, and its closest point x2 S with opposite label. It must be that x1 is already in S , or else the weights of x1, x2 would increase in unison. Then x2 is added to S and x is added to W .

The weight increase procedure is carried on until all samples of S have been placed in S .

As for the reconstruction, given two lists of sample pointlabel pairs S = ((x1, y1), . . . , (xm, ym)) and W = ((x 1, y 1), . . . , (x m, y m)) that have been computed by C as described above, the reconstruction function R construct a WNN classifier corresponding to the pair ( S , w ), where the weight assignment w is computed as follows. The weight of the first point in S is set to w (x1) = 1. The weights of subsequent points in S are set depending on their corresponding points in W : For k > 1,

If xk appears in (x1, . . . , xk 1), say as xj, then we are in case (i), and thus put w (xk) = w (xj).

If xk does not appear in (x1, . . . , xk 1), then we are in case (ii), and (x k, y k) corresponds to a witness for (xk, yk) and some other point in (x1, . . . , xk 1), say (xj, yj). The identity of xj can be inferred by finding the point in (x1, . . . , xk 1) with label opposite to yk and having the minimal weighted distance to x k,

xj = arg min xi {x1,...,xk 1}:yi =yk

d(x k, xi) w (xi)

breaking ties towards xi with smallest index i. Then

Weighted Distance Nearest Neighbor Condensing

w (xk) is set to satisfy the equation

d(x k, xk) w (xk) = d(x k, xj) w (xj) .

Since multiplying the weights of several points in unison do not change their pairwise weighted-distance boundaries, it is clear that during the whole construction process done in C, the classifier remains consistent on S, provided ties in weighted distances are decided in favor of points in S of smaller index. Hence R( S , W ) is consistent on S.

Lemma 3.3 can be used to derive generalization bounds, as we show in Theorem 3.4. But note first that the reconstruction function R of Lemma 3.3 heavily relies on that S

and W are ordered. Below in Section 4 we will consider a subclass of WNNs whose R assumes no such matching is given, and this matching needs to be deduced. In this case a tighter generalization bound holds, which we will leverage to establish Bayes consistency for the aforementioned subclass in Section 4.2. Formally, a reconstruction function R is said to be permutation invariant if for any two arbitrary permutations σ1, σ2 of the samples in S and W

respectively,

R(σ1( S ), σ2( W )) = R( S , W ). (1)

In other words, a permutation invariant R is able to deduce the matching between the samples in S and their weights from the unordered elements of S and W alone.

We can now present the generalization bounds. These essentially correspond to sample compression-based error bounds for the compression scheme established in Lemma 3.3. See Section 1.1 for previously known bounds related to these.

Theorem 3.4. For any probability distribution of x, any labeling function l : X { 1, 1}, and any n N and 0 < δ < 1, it holds that with probability 1 δ over the i.i.d. labeled sample S = {(x1, l(x1)), . . . , (xn, l(xn))}, for any S S and weight assignment w for S, if the WNN classifier h( S,w) corresponding to the pair ( S, w) correctly classifies all points in S, then

err(h( S,w)) 2 n | S|

| S| log 2n + log n

If in addition the reconstruction function R is permutation invariant, then

err(h( S,w)) 2 n | S|

| S| log 2en

The proof of Theorem 3.4 is deferred to Appendix A.

Algorithm 1 Greedy weighted heuristic

Input: Point set S Initialize solution set T , S S, weight function w : S {1} while S = do

x argmaxx S |B(x, dne(x)) S | S S \ B(x, dne(x)) T T {x} w(x) dne(x) end while return T, w.

4. Greedy Heuristic for WNN, and its Properties

In this section, we suggest a heuristic to produce a weighted condensed set. The heuristic is motivated by the ball cover rule introduced above. After presenting the heuristic, we establish that it is Bayes consistent under mild assumptions.

4.1. Heuristic

Our heuristic for WNN condensing is based on a greedy approach for the BC rule, meaning that at each step, we identify a ball covering the maximum number of uncovered points of the same label (and no points of the opposite label), and add it to our ball set. Similarly, in our WNN heuristic (see Algorithm 1), we iteratively add to the condensed set a point and weight which correspond to the center and radius of the ball covering the largest number of as of yet not covered points. (The equivalence of the radius in the BC rule to weight in the WNN rule was already established above in Section 3.1.) As in (Flores-Velazco & Mount, 2021), the notation dne(x) denotes the distance from x to its closest oppositely labeled point in S (its nearest enemy ).

4.2. Bayes Consistency

In the following, we consider a family of WNN classifiers that use a specific (data-dependent) weight function wne, that assigns to each data point ( x, y) S S weight equal to the minimal distance from x to the points in S that have the opposite label from y, and for points x / S assigns wne(x) = 1; that is, defining S+ and S = S \ S+ as the split of S into positively and negatively labeled points respectively, wne( x) = dne( x), where

( d( x, S ), if x S+, d( x, S+), if x S .

Note that for this subclass of WNN classifiers there is a simpler compression scheme than that of Lemma 3.3;

Weighted Distance Nearest Neighbor Condensing

in particular, the reconstruction function R can be made permutation invariant in the sense of (1). Indeed, the weight assignment wne for S can be encoded into a subset W S \ S consisting of the nearest enemies of the samples in S. Then the weight for ( x, y) S can be uniquely determined by splitting W into its positively and negatively labeled samples W+ and W and putting wne( x) = min(x,y) W y d( x, x). With this rule, for any two permutations σ1, σ2, R(σ1( S), σ2( W)) = R( S, W).

We first consider the (computationally intractable) learning rule that finds the subset S S of minimal cardinality such that the classifier h( S ,wne) is consistent on S,

S = arg min S S {| S| : h( S,wne)(x) = y, (x, y) S}. (4)

The following theorem establishes the Bayes consistency1

of h( S ,wne), meaning that its error on a new sample, drawn independently from the same probability distribution that generated the dataset, converges to zero as the dataset size increases, with probability one over the random dataset. Theorem 4.1. Let (X, d) be a separable metric space and assume x has an atomless distribution and that the labeling function l is countably piece-wise continuous. Then, almost surely,

err(h( S ,wne)) n 0. (5)

The proof of Theorem 4.1 is given in Appendix A. The proof essentially establishes that | S | is almost surely sublinear in the sample size n. An application of the error bound (3) of Theorem 3.4 (corresponding to a permutationinvariant rule) then establishes (5). Note that a sub-linear | S | in conjunction with the error bound (2) (corresponding to a non-permutation invariant rule) do not suffice to establish Bayes consistency with our proof technique: Without further assumptions on the tail of the distribution of x, the rate at which | S |/n n 0 can be arbitrarily slow.

As for our greedy heuristic in Algorithm 1, note that the intractable optimization problem (4) can be cast as a set cover problem. Algorithm 1 then corresponds to the standard greedy approximation for set cover (Chvatal, 1979). This approximation is guaranteed to compute S of size at most O(| S | log | S |). Hence, if | S | log | S | is guaranteed to be almost surely sub-linear, an adaptation of our proof of Theorem 4.1 implies that the greedy heuristic is Bayes consistent. This is made formal in the following Corollary. Corollary 4.2. Under the conditions of Theorem 4.1 and an additional appropriate tail condition on the probability

1Not to be confused with consistency, which here means that h( S ,wne) correctly classifies all points in the uncondensed dataset.

distribution of x, the greedy weighted heuristic of Algorithm 1 is Bayes consistent.

Proof. In the proof of Theorem 4.1 we fetched a function tr : N R+ in o(1) to establish that the size of a r - net (with r > 0) is sub-linear in n. Inspecting the proof of Theorem 4.1, to guarantee that | S | log | S | is almost surely sub-linear in n, it suffices to additionally assume that tr o(1/ log n).

As two examples of the applicability of Corollary 4.2, if random variable x is bounded then | S | = O(1), and if x has a normal distribution then | S | = O(log n). Hence in these examples Algorithm 1 is Bayes consistent.

5. Experimental Results

In this section we present promising experimental results for condensing under WNN, using the heuristic of Section 4. We ran two separate trials: The first experiment was a comparison of condensing achieved by our results to those achieved by other popular condensing algorithms. For these we used datasets already established as appropriate for condensing. The second experiment also compared condensing algorithms, but here we also computed the optimal unweighted compressed set and compared our results to these. This required the introduction of an exact integer program, and the use of very small datasets amenable to exact solutions.

5.1. Trial: Comparison Between Condensing Algorithms

As proof of concept, we selected representative datasets from the condensing experiments of (Garcia et al., 2012) (see Tables 2 and 7 there), appearing in Table 1. For each data set, we randomly split it into training samples (70%) and testing samples (30%). On the training sets, we ran the popular MSS (Barandela et al., 2005) and the recent successful RSS (Flores-Velazco & Mount, 2021) heuristics, as well as our greedy heuristic for WNN condensing, as presented above in Algorithm 1. We found that across all datasets considered, our weighted condensing heuristic achieved either superior or comparable compression of the training dataset, and either superior or comparable accuracy on the testing data set, when compared to the unweighted heuristics; see Table 1, which reports the size of the condensed set as a fraction of the size of the original input training dataset, and the test error as the fraction of wrongly classified samples from the test dataset.

Weighted Distance Nearest Neighbor Condensing

Dataset Size Classes Fraction retained Test error MSS RSS WNN MSS RSS WNN

Magic 19,020 2 0.29 0.37 0.26 0.22 0.26 0.21 Sat Image 6,430 7 0.15 0.19 0.14 0.11 0.12 0.09 Spambase 4,560 2 0.27 0.33 0.27 0.21 0.21 0.18 Twonorm 7,400 2 0.15 0.16 0.06 0.06 0.11 0.03 Phoneme 5,404 2 0.19 0.22 0.16 0.13 0.12 0.12 Segment 2,310 7 0.13 0.14 0.10 0.07 0.05 0.05 Shuttle 43,498 7 0.030 0.008 0.005 0.004 0.002 0.002

Table 1: The fraction of training samples retained in the condensed subset and the error achieved on the testing samples as described in Section 5.1.

5.2. Trial: Comparison with Exact Solvers on Small Datasets

In this experiment, we compared our condensing algorithm with the optimal solution produced by a brute-force solver. This is instructive in understanding the quality of the tested heuristics. Due to the significant limitations inherent in producing an exact solution, our comparison necessitated the use of small dataset amenable to computing an optimal solution using an integer program (IP) solver.

Integer Programming for NN Condensing. We formalize an integer program for NN condensing. As this problem is NP-hard, we do not expect the algorithm to have a reasonable run time for large sets, but for smaller sets it successfully returns an optimal solution (after large run time).

To model the NN condensing problem as an integer program, we introduced constraints corresponding to the inclusion of a point in the condensed set. This allowed us to identify the minimal non-empty subset of examples that can recover all labels of the sample via the nearest neighbor classifier.

For each sample point x, we introduce a 0 1 variable v(x), corresponding to whether x will appear in the condensed set. For each ordered pair of points x and x

with opposite labels, we introduce the constraint v(x ) P

x C(x,x ) v(x ), where C(x, x ) is the set of points with the same label as x which are all closer to x than x

is to x. (Note that x C(x, x ).) This constraint enforces that if x appears in the condensed set (meaning v(x ) = 1), then there must be in the condensed set some point closer to x with the same label as x. We also need the constraint P

x v(x) 1, to disallow the empty set. Finally, the objective is to minimize P

x v(x), which corresponds to minimizing the size of the condensed set. We implemented this program using the Python cvxpy library.

Datasets. While the condensing heuristics can handle larger datasets, we found that (not surprisingly) the exact IP

Table 2: Number of samples retained in condensed subset

Dataset Points MSS RSS IP WNN

Circle 200 52 45 7 12 Banana 200 74 66 32 35 Iris 100 11 9 2 4

algorithm failed for set sizes much larger than 200 points. Accordingly, we ran trials on the small banana, circle and iris data sets, which all have binary classes, see Figure 3.

Banana. This data set is a synthetic collection, previously used by Flores-Velazco & Mount (2021) for NNC. It contains instances arranged in several banana-shaped clusters. (The x and y axes represent the respective properties At1 and At2 defined there.) For our experiment, we retained only 200 of more than 5000 original points.

Circle. This is a synthetic randomized data set containing 200 points. It contains instances arranged in a circular cluster, surrounded by instances of the opposing class.

Iris. This is the very popular data set of the UCI Machine Learning Repository. It consists of three classes, each containing 50 instances of a certain species of iris. For our experiments, we considered only two classes of the three (namely Setosa and Versicolour).

Results. We ran the above NN condensing heuristics, the exact IP algorithm, and also our greedy heuristic for WNN condensing on the small data sets. We again found that our weighted condensing heuristic achieved superior compression when compared to the unweighted heuristics; see Table 2 which reports the exact sizes of the condensed sets. Our heuristic also approached the optimal unweighted solution computed by the brute-force IP, an algorithm which (unlike ours) does not scale to larger datasets.

Weighted Distance Nearest Neighbor Condensing

Figure 3: The banana, circle and iris data sets

6. Conclusions

We have demonstrated that WNN condensing is more powerful than standard NN condensing, yet is characterized by similar generalization bounds. Hence WNN can only improve the degree of compression, while maintaining the same theoretical guarantees. Our suggested heuristic is theoretically sound, and gave promising empirical results. This indicates that WNN condensing heuristics are deserving of further study.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Angiulli, F. Fast condensed nearest neighbor rule. In Proceedings of the 22nd international conference on Machine learning, pp. 25 32, 2005.

Barandela, R., Ferri, F. J., and S anchez, J. S. Decision boundary preserving prototype selection for nearest neighbor classification. International Journal of Pattern Recognition and Artificial Intelligence, 19(06):787 806, 2005.

Chvatal, V. A greedy heuristic for the set-covering problem. Mathematics of operations research, 4(3):233 235, 1979.

Cohen, D. T. and Kontorovich, A. Learning with metric losses. In Loh, P.-L. and Raginsky, M. (eds.), Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pp. 662 700. PMLR, 02 05 Jul 2022.

Cover, T. and Hart, P. Nearest neighbor pattern classifica-

tion. IEEE Transactions on Information Theory, 13(1): 21 27, 1967. doi: 10.1109/TIT.1967.1053964.

Devi, V. S. and Murty, M. N. An incremental prototype set building technique. Pattern Recognition, 35(2):505 513, 2002.

Devroye, L., Gy orfi, L., and Lugosi, G. A probabilistic theory of pattern recognition. Springer-Verlag New York, Inc., 1996.

Dudani, S. A. The distance-weighted k-nearest-neighbor rule. IEEE Transactions on Systems, Man, and Cybernetics, (4):325 327, 1976.

Fix, E. and Hodges, J. L. Discriminatory analysis, nonparametric discrimination: Consistency properties. 57 (3):238 247, 1951.

Flores-Velazco, A. Social distancing is good for points too! ar Xiv preprint ar Xiv:2006.15650, 2020.

Flores-Velazco, A. and Mount, D. Guarantees on nearestneighbor condensation heuristics. Computational Geometry, 95:101732, 2021.

Floyd, S. and Warmuth, M. Sample compression, learnability, and the Vapnik-Chervonenkis dimension. Machine learning, 21(3):269 304, 1995.

Garcia, S., Derrac, J., Cano, J., and Herrera, F. Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3):417 435, 2012.

Gates, W. The reduced nearest neighbor rule. IEEE Transactions on Information Theory, 18:431 433, 1972.

Gottlieb, L., Kontorovich, A., and Krauthgamer, R. Efficient classification for metric data (extended abstract COLT 2010). IEEE Transactions on Information Theory, 60(9):5750 5759, 2014. doi: 10.1109/TIT.2014. 2339840. URL http://dx.doi.org/10.1109/ TIT.2014.2339840.

Weighted Distance Nearest Neighbor Condensing

Gottlieb, L.-A. and Kontorovich, A. Non-uniform packings. Information Processing Letters, 174:106179, 2022.

Gottlieb, L.-A. and Ozeri, S. Classification in asymmetric spaces via sample compression. ar Xiv preprint ar Xiv:1909.09969, 2019.

Gottlieb, L.-A., Kontorovich, A., and Nisnevitch, P. Nearoptimal sample compression for nearest neighbors. IEEE Transactions on Information Theory, 64(6):4120 4128, 2018.

Graepel, T., Herbrich, R., and Shawe-Taylor, J. PACBayesian compression bounds on the prediction error of learning algorithms for classification. Machine Learning, 59(1):55 76, 2005.

Gy orfi, L. and Weiss, R. Universal consistency and rates of convergence of multiclass prototype algorithms in metric spaces. The Journal of Machine Learning Research, 22 (1):6702 6726, 2021.

Hanneke, S., Kontorovich, A., Sabato, S., and Weiss, R. Universal Bayes consistency in metric spaces. The Annals of Statistics, 49(4):2129 2150, 2021. doi: 10. 1214/20-AOS2029. URL https://doi.org/10. 1214/20-AOS2029.

Hart, P. The condensed nearest neighbor rule (corresp.). IEEE transactions on information theory, 14(3):515 516, 1968.

Kerem, O. and Weiss, R. On error and compression rates for prototype rules. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 8228 8236, 2023.

Kontorovich, A., Sabato, S., and Weiss, R. Nearestneighbor sample compression: Efficiency, consistency, infinite dimensions. In Advances in Neural Information Processing Systems, pp. 1573 1583, 2017.

Littlestone, N. and Warmuth, M. K. Relating data compression and learnability. unpublished, 1986.

Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of machine learning. MIT press, 2018.

Ritter, G., Woodruff, H., Lowry, S., and Isenhour, T. An algorithm for a selective nearest neighbor decision rule (corresp.). IEEE Transactions on Information Theory, 21(6):665 669, 1975.

Shalev-Shwartz, S. and Ben-David, S. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.

Wilfong, G. Nearest neighbor problems. In Proceedings of the Seventh Annual Symposium on Computational Geometry, SCG 91, pp. 224 233, 1991.

Wilson, D. R. and Martinez, T. R. Reduction techniques for instance-based learning algorithms. Machine Learning, 38:257 286, 2000.

Xue, L. and Kpotufe, S. Achieving the time of 1-nn, but the accuracy of k-nn. In International Conference on Artificial Intelligence and Statistics, pp. 1628 1636. PMLR, 2018.

Zukhba, A. V. NP-completeness of the problem of prototype selection in the nearest neighbor method. Pattern Recognit. Image Anal., 20(4):484 494, December 2010. ISSN 1054-6618.

Weighted Distance Nearest Neighbor Condensing

A. Deferred proofs

Proof of Theorem 3.4. Consider first the case of non-invariant R. Set m = | S| and let In,m denote the set of all (ordered) sequences of length m of indices from {1, 2, . . . , n}. For i = (i1, . . . , im) In,m denote by S(i) the subset of S with indices in i. For a weight assignment w (i) for S(i), write c err(h(S(i),w (i))) for the empirical error of h(S(i),w (i)) over the n m samples in S \ S(i), that is

(xj,yj) S\S(i) 1{h(S(i),w (i))(xj) =yj},

and note that the samples in S \ S(i) are i.i.d. and independent of S(i). For ε > 0, we bound

P n c err(h( S,w)) = 0 | S| = m err(h( S,w)) > ε o

P n i In,m, w (i) : c err(h(S(i),w (i))) = 0

err(h(S(i),w (i))) > ε o

i In,m E n P w (i) : c err(h(S(i),w (i))) = 0 (6)

err(h(S(i),w (i))) > ε | S(i) o .

Fix i In,m and S(i), and consider the class of WNN classifiers given by

HS(i) = {h(S(i),w (i)) : w (i) weight assign. for S(i)}.

The growth function ΠHS(i)(n) of HS(i) counts the maximum number of different possible labelings of n points from X:

ΠHS(i)(n) =

max {z1,...,zn} X

(h(z1), . . . , h(zn)) : h HS(i) .

Then by a standard argument (Mohri et al., 2018),

P w (i) : c err(h(S(i),w (i))) = 0

err(h(S(i),w (i))) > ε | S(i)

2ΠHS(i)(2(n m)) e (n m)ε/2.

To bound ΠHS(i)(2(n m)), note that by Lemma 3.3, for any weight assignment w (i) for S(i) and any set S of 2(n m) points from X, there exists a subset W S S(i) of size m such that the WNN classifier h S(i), W gives the same labeling as h(S(i),w (i)) on the 2(n m) points in S . Since the number of different classifiers in n h S(i), W : W is a list of m points

from a sample of size 2n m o

is at most |I2n m,m| |I2n,m|, it follows that

ΠHS(i)(2(n m)) |I2n,m|.

Hence, Eq. (6) is bounded from above by X

i In,m |I2n,m|e (n m)ε/2 |In,m||I2n,m|e (n m)ε/2. (7)

Weighted Distance Nearest Neighbor Condensing

log (|In,m| |I2n,m|) + log n

m log 2n + log n

where we used the bound |In,m| nm. Then the right hand side of (7) is δ/n. Summing over the n possible values of m completes the proof for the case of non-invariant R.

As for the case of permutation invariant R, the only difference from the proof above is the definition of In,m. For the permutation invariant case we take In,m to be the set of all (unordered) subsets of {1, 2, . . . , n} of size m. Then |In,m| n m ( en

m )m. Putting this into Eq. (8), we have

in accordance with (3).

Proof of Theorem 4.1. Denote by µ the probability distribution of x and abbreviate h S = h( S ,wne). For r > 0 let

x supp(µ) : 1 µ(Br(x))

Br(x) l(x )µ(dx ) = l(x)

that is, Ur is the set of all points in the support of µ where l is essentially constant in the ball of radius r around x. The assumptions that x is atomless and that l is piece-wise continuous imply that µ(Ur) is monotonic decreasing and continuous in r and satisfies

lim r 0 µ(Ur) = 1. (9)

Given ε > 0, let r = r (ε) > 0 be such that

µ(Ur ) 1 α,

where α = α(ε) (0, 1/8) satisfies

Since the left hand side of (10) goes to 0 monotonically (and continuously) as α 0, such an α always exists (this choice of α will be made clear below).

Given a sample S of size n, denote by Xn = (x1, . . . , xn) the instances in S and by Yn = (y1, . . . , yn) their corresponding labels. Let Xr Xn Ur be an r -net of Xn Ur (see (Gottlieb et al., 2018) for the formal definition of an r-net) and let Yr be the corresponding labels in Yn, stacked into the labeled set S(1) = (Xr , Yr ). Let U c r = X \ Ur denote the set complement of Ur and let S(2) = S (U c r Y). Define the labeled dataset

Sr = S(1) S(2) S.

Then h Sr is consistent on S. Indeed, any (xi, yi) S (U c r Y) is included in Sr and is thus classified correctly by h Sr . For any (xi, yi) S (Ur Y), since Xr is an r -net of Xn Ur , there is x Xr with

d(xi, x) < r .

Since x Ur and d(xi, x) < r , we have yi = l(xi) = l( x) = y with probability 1. In addition, any point (xj, yj) S with an opposite label to y satisfies d(xj, x) r , and so wne( x) r . Thus,

d(xi, x) = d(xi, x)

wne( x) < 1.

Weighted Distance Nearest Neighbor Condensing

Additionally, any x Sr with a different label from yi has weight

wne( x ) d(xi, x ).

d(xi, x ) = d(xi, x )

So the WNN classifier h Sr classifies the point (xi, yi) correctly in this case as well, and so h Sr is consistent on S. It follows that the subset S in (4) computed by the learning rule satisfies

| S | | Sr |. (11)

We next bound | Sr | = | S(1)| + | S(2)| with high probability. Since by construction µ(U c r ) α, Hoeffding s inequality implies that

P n | S(2)| > 2nα o = P {|Xn U c r | > 2nα} e 2nα2.

As for |S(1)|, by Hanneke et al. (2021, Lemma 3.7), there is tr : N R+ in o(1) such that

P {|Xr | ntr (n)} 1/n2.

Since tr o(1), we may take n sufficiently large so that tr (n) 2α. So for all sufficiently large n,

P{| Sr | > 4αn} 1

n2 + e 2nα2.

P {err(h S ) > ε}

P n err(h S ) > ε | Sr | 4αn o + P n | Sr | > 4αn o

P n err(h S ) > ε | Sr | 4αn o + 1

n2 + e 2nα2. (12)

To complete the proof we show below that for all sufficiently large n,

P n err(h S ) > ε | Sr | 4αn o 1

Since the right hand side of (12) is summable over n, the Borel-Cantelli Lemma implies that almost surely,

err(h S ) n 0,

concluding the proof of the Theorem.

To show (13), put δ = δn = 1/n2 in (3) and observe that the right hand side of (3) is monotonic increasing with | S|. Thus, using (11), we have that under the event {| Sr | 4αn},

| S | log 2en

| Sr | log 2en

1 4α + 3 log n (1 4α)n

Weighted Distance Nearest Neighbor Condensing

where in the last inequality we used the choice of α in (10) and took n sufficiently large so that 3 log n (1 4α)n ε

2. Applying Lemma 3.4, it follows that

P n err(h S ) > ε | Sr | 4αn o

P err(h S ) > 2 n | S |

| S | log 2en

establishing (13).