# active_nearestneighbor_learning_in_metric_spaces__d6d18382.pdf

Active Nearest-Neighbor Learning in Metric Spaces

Aryeh Kontorovich Department of Computer Science Ben-Gurion University of the Negev Beer Sheva 8499000, Israel

Sivan Sabato Department of Computer Science Ben-Gurion University of the Negev Beer Sheva 8499000, Israel

Ruth Urner Max Planck Institute for Intelligent Systems Department for Empirical Inference Tübingen 72076, Germany

We propose a pool-based non-parametric active learning algorithm for general metric spaces, called MArgin Regularized Metric Active Nearest Neighbor (MARMANN), which outputs a nearest-neighbor classiﬁer. We give prediction error guarantees that depend on the noisy-margin properties of the input sample, and are competitive with those obtained by previously proposed passive learners. We prove that the label complexity of MARMANN is signiﬁcantly lower than that of any passive learner with similar error guarantees. Our algorithm is based on a generalized sample compression scheme and a new label-efﬁcient active model-selection procedure.

1 Introduction

In this paper we propose a non-parametric pool-based active learning algorithm for general metric spaces, which outputs a nearest-neighbor classiﬁer. The algorithm is named MArgin Regularized Metric Active Nearest Neighbor (MARMANN). In pool-based active learning [Mc Callum and Nigam, 1998] a collection of random examples is provided, and the algorithm can interactively query an oracle to label some of the examples. The goal is good prediction accuracy, while keeping the label complexity (the number of queried labels) low. MARMANN receives a pool of unlabeled examples in a general metric space, and outputs a variant of the nearest-neighbor classiﬁer. The algorithm obtains a prediction error guarantee that depends on a noisy-margin property of the input sample, and has a provably smaller label complexity than any passive learner with a similar guarantee.

The theory of active learning has received considerable attention in the past decade [e.g., Dasgupta, 2004, Balcan et al., 2007, 2009, Hanneke, 2011, Hanneke and Yang, 2015]. Active learning has been mostly studied in a parametric setting (that is, learning with respect to a ﬁxed hypothesis class with a bounded capacity). Various strategies have been analyzed for parametric classiﬁcation [e.g., Dasgupta, 2004, Balcan et al., 2007, Gonen et al., 2013, Balcan et al., 2009, Hanneke, 2011, Awasthi et al., 2013].An active model selection procedure has also been developed for the parametric setting Balcan et al. [2010]. However, the number of labels used there depends quadratically on the number of possible model classes, which is prohibitive in our non-parametric setting.

The potential beneﬁts of active learning for non-parametric classiﬁcation in metric spaces are less well understood. The paradigm of cluster-based active learning [Dasgupta and Hsu, 2008] has been shown to provide label savings under some distributional clusterability assumptions [Urner et al., 2013, Kpotufe et al., 2015]. Certain active learning methods for nearest neighbor classiﬁcation are known to be Bayes consistent [Dasgupta, 2012], and an active querying rule, based solely on information in

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

the unlabeled data, has been shown to be beneﬁcial for nearest neighbors under covariate shift [Berlind and Urner, 2015]. Castro and Nowak [2007] analyze minimax rates for a class of distributions in Euclidean space, characterized by decision boundary regularity and noise conditions. However, no active non-parametric strategy for general metric spaces, with label complexity guarantees for general distributions, has been proposed so far. Here, we provide the ﬁrst such algorithm and guarantees.

The passive nearest-neighbor classiﬁer is popular among theorists and practitioners alike [Fix and Hodges, 1989, Cover and Hart, 1967, Stone, 1977, Kulkarni and Posner, 1995]. This paradigm is applicable in general metric spaces, and its simplicity is an attractive feature for both implementation and analysis. When appropriately regularized [e.g. Stone, 1977, Devroye and Györﬁ, 1985, von Luxburg and Bousquet, 2004, Gottlieb et al., 2010, Kontorovich and Weiss, 2015] this type of learner can be made Bayes consistent. Another desirable property of nearest-neighbor-based methods is their ability to generalize at a rate that scales with the intrinsic data dimension, which can be much lower than that of the ambient space [Kpotufe, 2011, Gottlieb et al., 2014a, 2016a, Chaudhuri and Dasgupta, 2014]. Furthermore, margin-based regularization makes nearest neighbors ideally suited for sample compression, which yields a compact representation, faster classiﬁcation runtime, and improved generalization performance [Gottlieb et al., 2014b, Kontorovich and Weiss, 2015]. The resulting error guarantees can be stated in terms of the sample s noisy-margin, which depends on the distances between differently-labeled examples in the input sample.

Our contribution. We propose MARMANN, a non-parametric pool-based active learning algorithm that obtains an error guarantee competitive with that of a noisy-margin-based passive learner, but can provably use signiﬁcantly fewer labels. This is the ﬁrst non-parametric active learner for general metric spaces that achieves prediction error that is competitive with passive learning for general distributions, and provably improves label complexity.

Our approach. Previous passive learning approaches to classiﬁcation using nearest-neighbor rules under noisy-margin assumptions [Gottlieb et al., 2014b, 2016b] provide statistical guarantees using sample compression bounds [Graepel et al., 2005]. The ﬁnite-sample guarantees depend on the number of noisy labels relative to an optimal margin scale. A central challenge in the active setting is performing model selection (selecting the margin scale) with a low label complexity. A key insight that we exploit in this work is that by designing a new labeling scheme for the compression set, we can construct the compression set and estimate its error with label-efﬁcient procedures. We obtain statistical guarantees for this approach using a generalized sample compression analysis. We derive a label-efﬁcient (as well as computationally efﬁcient) active model-selection procedure. This procedure ﬁnds a good scale by estimating the sample error for some scales, using a small number of active querying rounds. Crucially, unlike cross-validation, our model-selection procedure does not require a number of labels that depends on the worst possible scale, nor does it test many scales. This allows our label complexity bounds to be low, and to depend only on the ﬁnal scale selected by the algorithm. Our error guarantee is a constant factor over the error guarantee of the passive learner of Gottlieb et al. [2016b]. An approach similar to Gottlieb et al. [2016b], proposed in Gottlieb et al. [2014a], has been shown to be Bayes consistent [Kontorovich and Weiss, 2015]. The Bayes-consistency of the passive version of our approach is the subject of ongoing work.

Paper outline. We deﬁne the setting and notations in Section 2. In Section 3 we provide our main result, Theorem 3.2, giving error and label complexity guarantees for MARMANN. Section 4 shows how to set the nearest neighbor rule for a given scale, and Section 5 describes the model selection procedure. Some of the analysis is omitted due to lack of space. The full analysis is available at Kontorovich et al. [2016].

2 Setting and notations

We consider learning in a general metric space (X, ρ), where X is a set and ρ is the metric on X. Our problem setting is that of classiﬁcation of the instance space X into some ﬁnite label set Y. Assume that there is some distribution D over X Y, and let S Dm be a labeled sample of size m, where m is an integer. Denote the sequence of unlabeled points in S by U(S). We sometimes treat S and U(S) as multisets, since the order is unimportant. The error of a classiﬁer h : X Y on D is denoted err(h, D) := P[h(X) = Y ], where (X, Y ) D. The empirical error on a labeled sample S instantiates to err(h, S) = 1 |S| P I[h(X) = Y ]. A passive learner receives a labeled sample Sin as input. An active learner receives the unlabeled part of the sample Uin := U(Sin) as input, and

is allowed to adaptively select examples from Uin and request their label from Sin. When either learner terminates, it outputs a classiﬁer ˆh : X Y, with the goal of achieving a low err(ˆh, D). An additional goal of the active learner is to achieve a performance competitive with that of the passive learner, while querying considerably fewer labels.

The diameter of a set A X is deﬁned by diam(A) := supa,a A ρ(a, a ). Denote the index of the closest point in U to x X by κ(x, U) := argmini:xi U ρ(x, xi). We assume here and throughout this work that when there is more than one minimizer for ρ(x, xi), ties are broken arbitrarily (but in a consistent fashion). For a set Z X, denote κ(Z, U) := {κ(z, U) | z Z}. Any labeled sample S = ((xi, yi))i [k] naturally induces the nearest-neighbor classiﬁer hnn S : X Y, via hnn S (x) := yκ(x,U(S)).

For x X, and t > 0, denote by ball(x, t) the (closed) ball of radius t around x: ball(x, t) := {x X | ρ(x, x ) t}. The doubling dimension, the effective dimension of the metric space, which controls generalization and runtime performance of nearest-neighbors [Kpotufe, 2011, Gottlieb et al., 2014a], is deﬁned as follows. Let λ = λ(X) be the smallest number such that every ball in X can be covered by λ balls of half its radius, where all balls are centered at points of X. Formally, λ(X) := min{λ N : x X, r > 0, x1, . . . , xλ X : ball(x, r) λ i=1ball(xi, r/2)}. Then the doubling dimension of X is deﬁned by ddim(X) := log2 λ. In line with modern literature, we work in the low-dimension, big-sample regime, where the doubling dimension is assumed to be constant and hence sample complexity and algorithmic runtime may depend on it exponentially. This exponential dependence is unavoidable, even under margin assumptions, as previous analysis [Kpotufe, 2011, Gottlieb et al., 2014a] indicates.

A set A X is t-separated if infa,a A:a =a ρ(a, a ) t. For A B X, the set A is a t-net of B if A is t-separated and B S

a A ball(a, t). Constructing a minimum size t-net for a general set B is NP-hard [Gottlieb and Krauthgamer, 2010], however efﬁcient procedures exist for constructing some t-net [Krauthgamer and Lee, 2004, Gottlieb et al., 2014b]. The size of any t-net is at most 2ddim(B) times the smallest possible size (see Kontorovich et al. [2016]). In addition, the size of any t-net is at most diam(B)/t ddim(X)+1 [Krauthgamer and Lee, 2004]. Throughout the paper, we ﬁx a deterministic procedure for constructing a t-net, and denote its output for a multiset U X by Net(U, t). Let Par(U, t) be a partition of X into regions induced by Net(U, t), that is: for Net(U, t) = {x1, . . . , x N}, deﬁne Par(U, t) := {P1, . . . , PN}, where Pi = {x X | κ(x, Net(U, t)) = i}. For t > 0, denote N(t) := |Net(Uin, t)|. For a labeled multiset S X Y and y Y, denote Sy := {x | (x, y) S}; in particular, U(S) = y YSy.

3 Main results

Non-parameteric binary classiﬁcation admits performance guarantees that scale with the sample s noisy-margin [von Luxburg and Bousquet, 2004, Gottlieb et al., 2010, 2016b]. We say that a labeled multiset S is (ν, t)-separated, for ν [0, 1] and t > 0 (representing a margin t with noise ν), if one can remove a ν-fraction of the points in S, and in the resulting multiset, points with different labels are at least t-far from each other. Formally, S is (ν, t)-separated if there exists a subsample S S such that |S \ S| ν|S| and y1 = y2 Y, a Sy1, b Sy2, we have ρ(a, b) t. For a given labeled sample S, denote by ν(t) the smallest value ν such that S is (ν, t)-separated. Gottlieb et al. [2016b] propose a passive learner with the following guarantees as a function of the separation of S. Setting α := m/(m N), deﬁne the following form of a generalization bound:

GB(ϵ, N, δ, m, k) := αϵ + 2

3 (N + 1) log(mk) + log( 1

δ ) m N + 3

αϵ((N + 1) log(mk) + log( 1

Theorem 3.1 (Gottlieb et al. [2016b]). Let m be an integer, Y = {0, 1}, δ (0, 1). There exists a passive learning algorithm that returns a nearest-neighbor classiﬁer hnn Spas, where Spas Sin, such that, with probability 1 δ,

err(hnn Spas, D) min t>0:N(t)<m GB(ν(t), N(t), δ, m, 1).

The passive algorithm of Gottlieb et al. [2016b] generates Spas of size approximately N(t) for the optimal scale t > 0 (found by searching over all scales), removing the |Sin|ν(t) points that

obstruct the t-separation between different labels in Sin, and then selecting a subset of the remaining labeled examples to form Spas, so that the examples are a t-net for Sin. We propose a different approach for generating a compression set for a nearest-neighbor rule. This approach, detailed in the following sections, does not require ﬁnding and removing all the obstructing points in Sin, and can be implemented in an active setting using a small number of labels. The resulting active learning algorithm, MARMANN, has an error guarantee competitive with that of the passive learner and a label complexity that can be signiﬁcantly lower. Our main result is the following guarantee for MARMANN. Theorem 3.2. Let Sin Dm, where m max(6, |Y|), δ (0, 1

4). Let ˆS be the output of MARMANN(Uin, δ), where ˆS X Y, and let ˆN := | ˆS|. Let ˆh := hnn ˆS and ˆϵ := err(ˆh, Sin), and

denote ˆG := GB(ˆϵ, ˆN, δ, m, 1). With a probability of 1 δ over Sin and randomness of MARMANN,

err(ˆh, D) 2 ˆG O min t>0:N(t)<m GB(ν(t), N(t), δ, m, 1) ,

and the number of labels from Sin requested by MARMANN is at most

ˆG ) + m ˆG .

Here O( ) hides only universal numerical constants.

To observe the advantages of MARMANN over a passive learner, consider a scenario in which the upper bound GB of Theorem 3.1, as well as the Bayes error of D, are of order Θ(1/ m). Then ˆG = Θ(1/ m) as well. Therefore, MARMANN obtains a prediction error guarantee of Θ(1/ m), similarly to the passive learner, but it uses only Θ( m) labels instead of m. Moreover, no learner that selects labels randomly from Sin can compete with MARMANN: In Kontorovich et al. [2016] we adapt an argument of Devroye et al. [1996] to show that for any passive learner that uses Θ( m) random labels from Sin, there exists a distribution D with the above properties, for which the prediction error of the passive learner in this case is Ω(m 1/4), a decay rate which is almost quadratically slower than the O(1/ m) rate achieved by MARMANN. Thus, the guarantees of MARMANN cannot be matched by any passive learner.

MARMANN operates as follows. First, a scale ˆt > 0 is selected, by calling ˆt Select Scale(δ), where Select Scale is our model selection procedure. Select Scale has access to Uin, and queries labels from Sin as necessary. It estimates the generalization error bound GB for several different scales, and executes a procedure similar to binary search to identify a good scale. The binary search keeps the number of estimations (and thus requested labels) small. Crucially, our estimation procedure is designed to prevent the search from spending a number of labels that depends on the net size of the smallest possible scale t, so that the total label complexity of MARMANN depends only on error of the selected ˆt. Second, the selected scale ˆt is used to generate the compression set by calling ˆS Generate NNSet(ˆt, [N(ˆt)], δ), where Generate NNSet is our compression set generation procedure. For clarity of presentation, we ﬁrst introduce in Section 4 the procedure Generate NNSet, which determines the compression set for a given scale, and then in Section 5, we describe how Select Scale chooses the appropriate scale.

4 Active nearest-neighbor at a given scale

The passive learner of Gottlieb et al. [2014a, 2016b] generates a compression set by ﬁrst ﬁnding and removing from Sin all points that obstruct (ν, t)-separation at a given scale t > 0. We propose below a different approach for generating a compression set, which seems more conducive to active learning: as we show below, it also generates a low-error nearest neighbor rule, just like the passive approach. At the same time, it allows us to estimate the error on many different scales using few label queries. A small technical difference, which will be evident below, is that in this new approach, examples in the compression set might have a different label than their original label in Sin. Standard sample compression analysis [e.g. Graepel et al., 2005] assumes that the classiﬁer is determined by a small number of labeled examples from Sin. This does not allow the examples in the compression set to have a different label than their original label in Sin. Therefore, we require a slight generalization of previous compression analysis, which allows setting arbitrary labels for examples that are assigned to the compression set. The following theorem quantiﬁes the effect of this change on generalization.

Theorem 4.1. Let m |Y| be an integer, δ (0, 1

4). Let Sin Dm. With probability at least 1 δ, if there exist N < m and S (X Y)N such that U(S) Uin and ϵ := err(hnn S , Sin) 1

2, then err(hnn S , D) GB(ϵ, N, δ, m, |Y|) 2GB(ϵ, N, 2δ, m, 1).

The proof is similar to that of standard sample compression schemes. If the compression set includes only the original labels, the compression analysis of Gottlieb et al. [2016b] gives the bound GB(ϵ, N, δ, m, 1). Thus the effect of allowing the labels to change is only logarithmic in |Y|, and does not appreciably degrade the prediction error.

We now describe the generation of the compression set for a given scale t > 0. Recall that ν(t) is the smallest value for which Sin is (ν, t)-separated. We deﬁne two compression sets. The ﬁrst one, denoted Sa(t), represents an ideal compression set, which induces an empirical error of at most ν(t), but calculating it might require many labels. The second compression set, denoted ˆSa(t), represents an approximation to Sa(t), which can be constructed using a small number of labels, and induces a sample error of at most 4ν(t) with high probability. MARMANN constructs only ˆSa(t), while Sa(t) is deﬁned for the sake of analysis only.

We ﬁrst deﬁne the ideal set Sa(t) := {(x1, y1), . . . , (x N, y N)}. The examples in Sa(t) are the points in Net(Uin, t/2), and the label of each example is the majority label out of the examples in Sin to which xi is closest. Formally, {x1, . . . , x N} := Net(Uin, t/2), and for i [N], yi := argmaxy Y |Sy Pi|, where Pi = {x X | κ(x, Net(U, t/2)) = i} Par(Uin, t/2). For i [N], let Λi := Syi Pi. The following lemma bounds the empirical error of hnn Sa(t).

Lemma 4.2. For every t > 0, err(hnn Sa(t), Sin) ν(t).

Proof. Since Net(Uin, t/2) is a t/2-net, diam(P) t for any P Par(Uin, t/2). Let S S be a subsample that witnesses the (ν(t), t)-separation of S, so that | S| m(1 ν(t)), and for any two points (x, y), (x , y ) S, if ρ(x, x ) t then y = y . Denote U := U( S). Since max P Par(Uin,t/2) diam(P) t, for any i [N] all the points in U Pi must have the same label in S. Therefore, y Y such that U Pi Sy Pi. Hence | U Pi| |Λi|. It follows

m err(hnn Sa(t), Sin) |S| X

i [N] |Λi| |S| X

i [N] | U Pi| = |S| | S| = m ν(t).

Dividing by m we get the statement of the theorem.

Now, calculating Sa(t) requires knowing most of the labels in Sin. MARMANN constructs instead an approximation ˆSa(t), in which the examples are the points in Net(Uin, t/2) (so that U( ˆSa(t)) = U(Sa(t)) ), but the labels are determined using a bounded number of labels requested from Sin. The labels in ˆSa(t) are calculated by the simple procedure Generate NNSet given in Alg. 1. The empirical error of the output of Generate NNSet is bounded in Theorem 4.3 below.1

A technicality in Alg. 1 requires explanation: In MARMANN, the generation of ˆSa(t) will be split into several calls to Generate NNSet, so that different calls determine the labels of different points in ˆSa(t). Therefore Generate NNSet has an additional argument I, which speciﬁes the indices of the points in Net(Uin, t/2) for which the labels should be returned this time. Crucially, if during the run of MARMANN, Generate NNSet is called again for the same scale t and the same point in Net(Uin, t/2), then Generate NNSet returns the same label that it returned before, rather than recalculating it using fresh labels from Sin. This guarantees that despite the randomness in Generate NNSet, the full ˆSa(t) is well-deﬁned within any single run of MARMANN, and is distributed like the output of Generate NNSet(t, [N(t/2)], δ), which is convenient for the analysis.

Theorem 4.3. Let ˆSa(t) be the output of Generate NNSet(t, [N(t/2)], δ). With a probability at least 1 δ 2m2 , we have err(hnn S , Sin) 4ν(t). Denote this event by E(t).

1In the case of binary labels (|Y| = 2), the problem of estimating Sa(t) can be formulated as a special case of the benign noise setting for parametric active learning, for which tight lower and upper bounds are provided in Hanneke and Yang [2015]. However, our case is both more general (as we allow multiclass labels) and more speciﬁc (as we are dealing with a speciﬁc hypothesis class). Thus we provide our own procedure and analysis.

Algorithm 1 Generate NNSet(t, I, δ) input Scale t > 0, a target set I [N(t/2)], conﬁdence δ (0, 1). output A labeled set S X Y of size |I|

{x1, . . . , x N} Net(Uin, t/2), {P1, . . . , PN} Par(Uin, t/2), S () for i I do

if ˆyi has not already been calculated for Uin with this values of t then

Draw Q := 18 log(2m3/δ) points uniformly at random from Pi and query their labels. Let ˆyi be the majority label observed in these Q queries. end if S S {(xi, ˆyi)}. end for Output S

Proof. By Lemma 4.2, err(hnn Sa(t), Sin) ν(t). In Sa(t), the labels assigned to each point in Net(Uin, t/2) are the majority labels (based on Sin) of the points in the regions in Par(Uin, t/2). Denote the majority label for region Pi by yi := argmaxy Y |Sy Pi|. We now compare these labels to the labels ˆyi assigned by Alg. 1. Let p(i) = |Λi|/|Pi| be the fraction of points in Pi which are labeled by the majority label yi. Let ˆp(i) be the fraction of labels equal to yi out of those queried by Alg. 1 in round i. Let β := 1/6. By Hoeffding s inequality and union bounds, we have that with a probability of at least 1 N(t/2) exp( Q

18) 1 δ 2m2 , we have maxi [N(t/2)] |ˆp(i) p(i)| β. Denote this good event by E . We now prove that E E(t). Let J [N(t/2)] = {i | ˆp(i) > 1

2}. It can be easily seen that ˆyi = yi for all i J. Therefore, for all x such that κ(x, U(Sa(t))) J, hnn S (x) = hnn Sa(t)(x), and hence err(hnn S , Uin) PX Uin[κ(X, U(Sa(t))) / J] + err(hnn Sa(t), Uin). The second term is at most ν(t), and it remains to bound the ﬁrst term, on the condition that E holds. We have PX U[κ(X, U(Sa(t))) / J] = 1

i/ J |Pi|. If E holds, then for any i / J, p(i) 1

2 +β, therefore |Pi| |Λi| = (1 p(i))|Pi| ( 1

2 β)|Pi|. Therefore

i/ J |Λi| 1

i/ J |Pi|( 1

2 β) = PX U[κ(X, U(Sa(t))) / J]( 1

On the other hand, as in the proof of Lemma 4.2, 1 1

m P i [N(t/2)] |Λi| ν(t). Thus, under E ,

PX U[κ(X, S) / J] ν(t) 1 2 β = 3ν(t). It follows that under E , err(hnn S , Uin) 4ν(t).

5 Model Selection

We now show how to select the scale ˆt that will be used to generate the output nearest-neighbor rule. The main challenge is to do this with a low label complexity: Generating the full classiﬁcation rule for scale t requires a number of labels that depends on N(t), which might be very large. We would like the label complexity of MARMANN to depend only on N(ˆt) (where ˆt is the selected scale), which is of the order m ˆG. Therefore, during model selection we can only invest a bounded number of labels in each tested scale. In addition, to keep the label complexity low, we cannot test all scales.

For t > 0, let ˆSa(t) be the model that MARMANN would generate if the selected scale were set to t. Our model selection procedure performs a search, similar to binary search, over the possible scales. For each tested scale t, the procedure estimates ϵ(t) := err(hnn ˆSa(t), S) within a certain accuracy, using an estimation procedure we call Estimate Err. Estimate Err outputs an estimate ˆϵ(t) of ϵ(t), up to a given accuracy θ > 0, using labels requested from Sin. It draws random examples from Sin, asks for their label, and calls Generate NNSet (which also might request labels) to ﬁnd the prediction error of hnn ˆSa(t) on these random examples. The estimate ˆϵ(t) is set to this prediction error. The number of random examples drawn by Estimate Err is determined based on the accuracy θ, using empirical Bernstein bounds [Maurer and Pontil, 2009]. Theorem 5.1 gives a guarantee for the accuracy and label complexity of Estimate Err. The full implementation of Estimate Err and the proof of Theorem 5.1 can be found in the long version of this paper Kontorovich et al. [2016].

Theorem 5.1. Let t, θ > 0 and δ (0, 1), and let ˆϵ(t) Estimate Err(t, θ, δ). Let Q be as deﬁned in Alg. 1. The following properties (which we denote below by V (t)) hold with a probability of 1 δ 2m2 over the randomness of Estimate Err (and conditioned on ˆSa(t)).

1. If ˆϵ(t) θ, then ϵ(t) 5θ/4. Otherwise, 4ϵ(t)

5 ˆϵ(t) 4ϵ(t)

2. Estimate Err requests at most 520(Q+1) log( 1040m2

ψ labels, where ψ := max(θ, ϵ(t)).

The model selection procedure Select Scale, given in Alg. 2, implements its search based on the guarantees in Theorem 5.1. First, we introduce some notation. Let G = mint GB(ν(t), N(t), δ, m, 1). We would like MARMANN to obtain a generalization guarantee that is competitive with G . Denote φ(t) := ((N(t) + 1) log(m) + log( 1

δ ))/m, and let G(ϵ, t) := ϵ + 2

ϵφ(t). Note that for all ϵ, t, GB(ϵ, N(t), δ, m, 1) = m m N(t)G(ϵ, t).

When referring to G(ν(t), t), G(ϵ(t), t), or G(ˆϵ(t), t) we omit the second t for brevity. Instead of directly optimizing GB, we will select a scale based on our estimate G(ˆϵ(t)) of G(ϵ(t)).

Let Dist denote the set of pairwise distances in the unlabeled dataset Uin (note that |Dist| < m 2 ). We remove from Dist some distances, so that the remaining distances have a net size N(t) that is monotone non-increasing in t. We also remove values with a very large net size. Concretely, deﬁne

Distmon := Dist \ {t | N(t) + 1 > m/2} \ {t | t Dist, t < t and N(t ) < N(t)}.

Then for all t, t Distmon such that t < t, we have N(t ) N(t). The output of Select Scale is always a value in Distmon. The following lemma shows that it sufﬁces to consider these scales. Lemma 5.2. Assume m 6 and let t m argmint Dist G(ν(t)). If G 1/3 then t m Distmon.

Proof. Assume by way of contradiction that t m Dist \ Distmon. First, since G(ν(t m)) G 1/3 we have N(t m)+1 m N(t m) log(m) 1

2. Therefore, since m 6, it is easy to verify N(t m) + 1 m/2. Therefore, by deﬁnition of Distmon there exists a t t m with φ(t) < φ(t m). Since ν(t) is monotone over all of t Dist, we also have ν(t) ν(t m). Now, φ(t) < φ(t m) and ν(t) ν(t m) together imply that G(ν(t)) < G(ν(t m)), a contradiction. Hence, t m Distmon.

Select Scale follows a search similar to binary search, however the conditions for going right and for going left are not complementary. The search ends when either none of these two conditions hold, or when there is nothing left to try. The ﬁnal output of the algorithm is based on minimizing G(ˆϵ(t)) over some of the values tested during search.

For c > 0, deﬁne γ(c) := 1 + 2

2c and γ(c) := 1

2c. For all t, ϵ > 0 we have the implications

ϵ cφ(t) γ(c)ϵ G(ϵ, t) and φ(t) cϵ γ(c)φ(t) G(ϵ, t). (1)

The following lemma uses Eq. (1) to show that the estimate G(ˆϵ(t)) is close to the true G(ϵ(t)). Lemma 5.3. Let t > 0, δ (0, 1), and suppose that Select Scale calls ˆϵ(t) Estimate Err(t, φ(t), δ). Suppose that V (t) as deﬁned in Theorem 5.1 holds. Then 1

6G(ˆϵ(t)) G(ϵ(t)) 6.5G(ˆϵ(t)).

Proof. Under V (t), we have that if ˆϵ(t) < φ(t) then ϵ(t) 5 4φ(t). In this case, G(ϵ(t)) γ(4/5)φ(t) 4.3φ(t), by Eq. (1). Therefore G(ϵ(t)) 3 4.3

2 G(ˆϵ(t)). In addition, G(ϵ(t)) 2

3φ(t) (from the deﬁnition of G), and by Eq. (1) and γ(1) 4, φ(t) 1

4G(ˆϵ(t)). Therefore G(ϵ(t)) 1 6G(ˆϵ(t)). On the other hand, if ˆϵ(t) φ(t), then by Theorem 5.1 4

5ϵ(t) ˆϵ(t) 4

3ϵ(t). Therefore G(ˆϵ(t)) 4

3G(ϵ(t)) and G(ϵ(t)) 5

4G(ˆϵ(t)). Taking the worst-case of both possibilities, we get the bounds in the lemma.

The next theorem bounds the label complexity of Select Scale. Let Ttest Distmon be the set of scales that are tested during Select Scale (that is, their ˆϵ(t) was estimated).

Algorithm 2 Select Scale(δ) input δ (0, 1) output Scale ˆt

T Distmon, # T maintains the current set of possible scales while T = do

t the median value in T # break ties arbitrarily ˆϵ(t) Estimate Err(t, φ(t), δ). if ˆϵ(t) < φ(t) then

T T \ [0, t] # go right in the binary search else if ˆϵ(t) > 11

10φ(t) then T T \ [t, ) # go left in the binary search else

t0 t, T0 {t0}. break from loop end if end while if T0 was not set yet then

If the algorithm ever went to the right, let t0 be the last value for which this happened, and let T0 := {t0}. Otherwise, T0 := . end if Let TL be the set of all t that were tested and made the search go left Output ˆt := argmint TL T0 G(ˆϵ(t))

Theorem 5.4. Suppose that the event V (t) deﬁned in Theorem 5.1 holds for all t Ttest for the calls ˆϵ(t) Estimate Err(t, φ(t), δ). If the output of Select Scale is ˆt, then the number of labels requested by Select Scale is at most

19240|Ttest|(Q + 1) 1 G(ϵ(ˆt)) log(38480m2

δG(ϵ(ˆt)) ),

where Q is as deﬁned in Alg. 1.

The following theorem provides a competitive error guarantee for the selected scale ˆt. Theorem 5.5. Suppose that V (t) and E(t), deﬁned in Theorem 5.1 and Theorem 4.3, hold for all values t Ttest, and that G 1/3. Then Select Scale outputs ˆt Distmon such that

GB(ϵ(ˆt), N(ˆt), δ, m, 1) O(G ),

where O( ) hides numerical constants only.

The idea of the proof is as follows: First, we show (using Lemma 5.3) that it sufﬁces to prove that G(ν(t m)) O(G(ˆϵ(ˆt))) to derive the bound in the theorem. Now, Select Scale ends in one of two cases: either T0 is set within the loop, or T = and T0 is set outside the loop. In the ﬁrst case, neither of the conditions for turning left and turning right holds for t0, so we have ˆϵ(t0) = Θ(φ(t0)) (where Θ hides numerical constants). We show that in this case, whether t m t0 or t m t0, G(ν(t m)) O(G(ˆϵ(t0))). In the second case, there exist (except for edge cases, which are also handled) two values t0 T0 and t1 TL such that t0 caused the binary search to go right, and t1 caused it to go left, and also t0 t1, and (t0, t1) Distmon = . We use these facts to show that for t m t1, G(ν(t m)) O(G(ˆϵ(t1))), and for t m t0, G(ν(t m)) O(G(ˆϵ(t0))). Since ˆt minimizes over a set that includes t0 and t1, this gives G(ν(t m)) O(G(ˆϵ(ˆt))) in all cases.

The proof of the main theorem, Theorem 3.2, which gives the guarantee for MARMANN, is almost immediate from Theorem 4.1, Theorem 4.3, Theorem 5.5 and Theorem 5.4.

Acknowledgements

Sivan Sabato was partially supported by the Israel Science Foundation (grant No. 555/15). Aryeh Kontorovich was partially supported by the Israel Science Foundation (grants No. 1141/12 and 755/15) and a Yahoo Faculty award. We thank Lee-Ad Gottlieb and Dana Ron for helpful discussions.

P. Awasthi, M.-F. Balcan, and P. M. Long. The power of localization for efﬁciently learning linear separators with malicious noise. Co RR, abs/1307.8371, 2013. M. Balcan, S. Hanneke, and J. W. Vaughan. The true sample complexity of active learning. Machine Learning, 80(2-3):111 139, 2010. M.-F. Balcan, A. Broder, and T. Zhang. Margin-based active learning. In COLT, 2007. M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. J. Comput. Syst. Sci., 75(1), 2009. C. Berlind and R. Urner. Active nearest neighbors in changing environments. In ICML, pages 1870 1879, 2015. R. M. Castro and R. D. Nowak. Learning Theory: 20th Annual Conference on Learning Theory, COLT 2007, San Diego, CA, USA; June 13-15, 2007. Proceedings, chapter Minimax Bounds for Active Learning, pages 5 19. Springer Berlin Heidelberg, Berlin, Heidelberg, 2007. K. Chaudhuri and S. Dasgupta. Rates of convergence for nearest neighbor classiﬁcation. In NIPS, 2014. T. M. Cover and P. E. Hart. Nearest neighbor pattern classiﬁcation. IEEE Transactions on Information Theory, 13:21 27, 1967. S. Dasgupta. Analysis of a greedy active learning strategy. In NIPS, pages 337 344, 2004. S. Dasgupta. Consistency of nearest neighbor classiﬁcation under selective sampling. In COLT, 2012. S. Dasgupta and D. Hsu. Hierarchical sampling for active learning. In ICML, pages 208 215, 2008. L. Devroye and L. Györﬁ. Nonparametric density estimation: the L1 view. Wiley Series in Probability and Mathematical Statistics: Tracts on Probability and Statistics. John Wiley & Sons, Inc., New York, 1985. L. Devroye, L. Györﬁ, and G. Lugosi. A probabilistic theory of pattern recognition, volume 31 of Applications of Mathematics (New York). Springer-Verlag, New York, 1996. ISBN 0-387-94618-7. E. Fix and J. Hodges, J. L. Discriminatory analysis. nonparametric discrimination: Consistency properties. International Statistical Review / Revue Internationale de Statistique, 57(3):pp. 238 247, 1989. A. Gonen, S. Sabato, and S. Shalev-Shwartz. Efﬁcient active learning of halfspaces: an aggressive approach. Journal of Machine Learning Research, 14(1):2583 2615, 2013. L. Gottlieb and R. Krauthgamer. Proximity algorithms for nearly-doubling spaces. In APPROX-RANDOM, pages 192 204, 2010. L. Gottlieb, L. Kontorovich, and R. Krauthgamer. Efﬁcient classiﬁcation for metric data. In COLT, pages 433 440, 2010. L. Gottlieb, A. Kontorovich, and R. Krauthgamer. Efﬁcient classiﬁcation for metric data. IEEE Transactions on Information Theory, 60(9):5750 5759, 2014a. L. Gottlieb, A. Kontorovich, and P. Nisnevitch. Near-optimal sample compression for nearest neighbors. In NIPS, pages 370 378, 2014b. L. Gottlieb, A. Kontorovich, and R. Krauthgamer. Adaptive metric dimensionality reduction. Theoretical Computer Science, pages 105 118, 2016a. L. Gottlieb, A. Kontorovich, and P. Nisnevitch. Nearly optimal classiﬁcation for semimetrics. In Artiﬁcial Intelligence and Statistics (AISTATS), 2016b. T. Graepel, R. Herbrich, and J. Shawe-Taylor. PAC-Bayesian compression bounds on the prediction error of learning algorithms for classiﬁcation. Machine Learning, 59(1-2):55 76, 2005. S. Hanneke. Rates of convergence in active learning. The Annals of Statistics, 39(1):333 361, 2011. S. Hanneke and L. Yang. Minimax analysis of active learning. JMLR, 16:3487 3602, 2015. A. Kontorovich and R. Weiss. A Bayes consistent 1-NN classiﬁer. In AISTATS, 2015. A. Kontorovich, S. Sabato, and R. Urner. Active nearest-neighbor learning in metric spaces. Co RR, abs/1605.06792, 2016. URL http://arxiv.org/abs/1605.06792. S. Kpotufe. k-NN regression adapts to local intrinsic dimension. In NIPS, 2011. S. Kpotufe, R. Urner, and S. Ben-David. Hierarchical label queries with data-dependent partitions. In COLT, pages 1176 1189, 2015. R. Krauthgamer and J. R. Lee. Navigating nets: Simple algorithms for proximity search. In 15th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 791 801, Jan. 2004. S. R. Kulkarni and S. E. Posner. Rates of convergence of nearest neighbor estimation under arbitrary sampling. IEEE Transactions on Information Theory, 41(4):1028 1039, 1995. A. Maurer and M. Pontil. Empirical Bernstein bounds and sample-variance penalization. In COLT, 2009. A. K. Mc Callum and K. Nigam. Employing EM and pool-based active learning for text classiﬁcation. In ICML, 1998. C. J. Stone. Consistent nonparametric regression. The Annals of Statistics, 5(4):595 620, 1977. R. Urner, S. Wulff, and S. Ben-David. PLAL: cluster-based active learning. In COLT, pages 376 397, 2013. U. von Luxburg and O. Bousquet. Distance-based classiﬁcation with Lipschitz functions. Journal of Machine Learning Research, 5:669 695, 2004.