# active_nearestneighbor_learning_in_metric_spaces__60795bcc.pdf Journal of Machine Learning Research 18 (2018) 1-38 Submitted 10/16; Revised 12/17; Published 6/18 Active Nearest-Neighbor Learning in Metric Spaces Aryeh Kontorovich KARYEH@CS.BGU.AC.IL Sivan Sabato SABATOS@CS.BGU.AC.IL Department of Computer Science Ben-Gurion University of the Negev Beer Sheva 8499000, Israel Ruth Urner RUTH@EECS.YORKU.CA Lassonde School of Engineeging, EECS Department York University Toronto, ON, Canada Editor: Andreas Krause We propose a pool-based non-parametric active learning algorithm for general metric spaces, called MArgin Regularized Metric Active Nearest Neighbor (MARMANN), which outputs a nearestneighbor classifier. We give prediction error guarantees that depend on the noisy-margin properties of the input sample, and are competitive with those obtained by previously proposed passive learners. We prove that the label complexity of MARMANN is significantly lower than that of any passive learner with similar error guarantees. MARMANN is based on a generalized sample compression scheme, and a new label-efficient active model-selection procedure. Keywords: Nearest-neighbors, active learning, metric spaces, non-parametric learning 1. Introduction Active learning is a framework for reducing the amount of label supervision for prediction tasks. While labeling large amounts of data can be expensive and time-consuming, unlabeled data is often much easier to come by. In this paper we propose a non-parametric pool-based active learning algorithm for general metric spaces, which outputs a nearest-neighbor classifier. In pool-based active learning (Mc Callum and Nigam, 1998), a collection of random examples is provided, and the algorithm can interactively query an oracle to label some of the examples. The goal is good prediction accuracy, while keeping the label complexity that is, the number of labels queried low. Our algorithm, MArgin Regularized Metric Active Nearest Neighbor (MARMANN), receives a pool of unlabeled examples in a general metric space, and outputs a variant of the 1-nearest-neighbor classifier, implemented by a 1-nearest-neighbor rule. The algorithm obtains a prediction error guarantee that depends on a noisy-margin property of the input sample, and has a provably smaller label complexity than any passive learner with a similar guarantee. Active learning has been mostly studied in a parametric setting, in which learning takes place with respect to a fixed hypothesis class with a bounded capacity. There has also been some work on analyzing non-parametric active learning strategies under certain distributional assumptions (see Section 1.1 for more discussion on this). However, the question of whether active querying strate- c 2018 Aryeh Kontorovich, Sivan Sabato, and Ruth Urner. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/16-499.html. KONTOROVICH, SABATO AND URNER gies can yield label savings for non-parametric methods in a general setting, without distributional assumptions, had not been analyzed prior to this work. Here, we provide a first demonstration that this is indeed possible. We discuss related work in detail in Section 1.1 below. Our contributions. MARMANN is a new non-parametric pool-based active learning algorithm, which obtains an error guarantee competitive with that of a noisy-margin-based passive learner. Additionally, it provably uses significantly fewer labels in nontrivial regimes. As far as the authors are aware, this is the first non-parametric active learner for general metric spaces, which achieves competitive prediction error guarantees to the passive learner, while provably improving label complexity. The guarantees of MARMANN are given in Theorem 4 in Section 3. We further provide a passive learning lower bound (Theorem 5), which together with Theorem 4 shows that MARMANN can have a significantly reduced label complexity compared to any passive learner. The passive lower bound is more general than previous lower bounds, relies on a novel technique, and may be of independent interest. Additionally, we give an active label complexity lower bound (Theorem 6), which holds for any active learner with similar error guarantees as MARMANN. The proof of this active lower bound relies on a new No-Free-Lunch type result, which holds for active learning algorithms. Our approach. Previous passive learning approaches to classification using nearest-neighbor rules under noisy-margin assumptions (Gottlieb et al., 2014b, 2017) provide statistical guarantees using sample compression bounds (Graepel et al., 2005). Their finite-sample guarantees depend on the number of noisy labels relative to an optimal margin scale. A central challenge in the active setting is performing model selection to select a margin scale with a low label complexity. A key insight that we exploit in this work is that by designing a new labeling scheme for the compression set, we can construct the compression set and estimate its error with label-efficient procedures. We obtain statistical guarantees for this approach using generalization bounds for sample compression with side information. We derive a label-efficient, as well as computationally efficient, active model-selection procedure. This procedure finds a good scale by estimating the sample error for some scales, using a small number of active querying rounds. Crucially, unlike cross-validation, our model-selection procedure does not require a number of labels that depends on the worst possible scale, nor does it test many scales. This allows our label complexity bounds to be low, and to depend only on the final scale selected by the algorithm. Our error guarantee is a constant factor over the error guarantee of the passive learner of Gottlieb et al. (2017). An approach similar to Gottlieb et al. (2017), proposed in Gottlieb et al. (2014a), has been shown to be Bayes consistent (Kontorovich and Weiss, 2015). The Bayes-consistency of the passive version of our approach has recently been established (Kontorovich et al., 2017). Paper structure. Related work is discussed in Section 1.1. We lay down the preliminaries in Section 2. In Section 3 we provide our main result: Theorem 4, which gives error and label complexity guarantees for MARMANN. Additionally we state the passive and active lower bounds, Theorem 5 and Theorem 6. The rest of the paper is devoted to the description and analysis of MARMANN, and proof of the main results. Section 4 shows how MARMANN defines the nearest neighbor rule for a given scale, and Section 5 describes the model selection procedure of MARMANN. Theorem 4 is proved in Section 6, based on a framework for compression with side information. The passive lower bound in Theorem 5 is proved in Section 7. The active lower bound Theorem 6 is proved in Section 8. We conclude with a discussion in Section 9. ACTIVE NEAREST-NEIGHBOR LEARNING IN METRIC SPACES 1.1 Related Work The theory of active learning has received considerable attention in the past decade (e.g., Dasgupta, 2004; Balcan et al., 2007, 2009; Hanneke, 2011; Hanneke and Yang, 2015). Active learning theory has been mostly studied in a parametric setting (that is, learning with respect to a fixed hypothesis class with a bounded capacity). Benefits and limitations of various active querying strategies have been proven in the realizable setting (Dasgupta, 2004; Balcan et al., 2007; Gonen et al., 2013) as well as in the agnostic case (Balcan et al., 2009; Hanneke, 2011; Awasthi et al., 2014). It has also been shown that active queries can also be beneficial for regression tasks (Castro et al., 2005; Sabato and Munos, 2014). Further, an active model selection procedure has been developed for the parametric setting (Balcan et al., 2010). The potential benefits of active learning for non-parametric settings are less well understood. Practical Bayesian graph-based active learning methods (Zhu et al., 2003; Wei et al., 2015) rely on generative model assumptions, and therefore come without distribution-free performance guarantees. From a theoretical perspective, the label complexity of graph based active learning has mostly been analyzed in terms of combinatorial graph parameters (Cesa-Bianchi et al., 2010; Dasarathy et al., 2015). While the latter work provides some statistical guarantees under specific conditions on the data-generating distribution, this type of analysis does not yield distribution-free statistical performance guarantees. Castro et al. (2005); Castro and Nowak (2008) analyze minimax rates for non-parametric regression and classification respectively, for a class of distributions in Euclidean space, characterized by decision boundary regularity and noise conditions with uniform marginals. The paradigm of cluster-based active learning (Dasgupta and Hsu, 2008) has been shown to provide label savings under some distributional clusterability assumptions (Urner et al., 2013; Kpotufe et al., 2015). Dasgupta and Hsu (2008) showed that a suitable cluster-tree can yield label savings in this framework, and papers following up (Urner et al., 2013; Kpotufe et al., 2015) quantified the label savings under distributional clusterability assumptions. However, no active non-parametric strategy has been proposed so far that has label complexity guarantees for i.i.d. data from general distributions and general metric spaces. Here, we provide the first such algorithm and guarantees. The passive nearest-neighbor classifier, introduced by Fix and Hodges (1951, 1989), is popular among theorists and practitioners alike (Fix and Hodges, 1989; Cover and Hart, 1967; Stone, 1977; Kulkarni and Posner, 1995; Boiman et al., 2008). This paradigm is applicable in general metric spaces, and its simplicity is an attractive feature for both implementation and analysis. When appropriately regularized either by taking a majority vote among the k nearest neighbors (Stone, 1977; Devroye and Gy orfi, 1985; Zhao, 1987), or by enforcing a margin separating the classes (von Luxburg and Bousquet, 2004; Gottlieb et al., 2014a; Kontorovich and Weiss, 2015; Kontorovich et al., 2017) this type of learner can be made Bayes-consistent. Another desirable property of nearest-neighbor-based methods is their ability to generalize at a rate that scales with the intrinsic data dimension, which can be much lower than that of the ambient space (Kpotufe, 2011; Gottlieb et al., 2014a, 2016; Chaudhuri and Dasgupta, 2014). Furthermore, margin-based regularization makes nearest neighbor classifiers ideally suited for sample compression, which yields a compact representation, faster classification runtime, and improved generalization performance (Gottlieb et al., 2014b). The resulting error guarantees can be stated in terms of the sample s noisymargin, which depends on the distances between differently-labeled examples in the input sample. KONTOROVICH, SABATO AND URNER Active learning strategies specific to nearest neighbor classification have recently received attention. It has been shown that certain active querying rules maintain Bayes consistency for nearest neighbor classification, while other, seemingly natural, rules do not lead to a consistent algorithm (Dasgupta, 2012). A selective querying strategy has been shown to be beneficial for nearest neighbors under covariate shift (Berlind and Urner, 2015), where one needs to adapt to a change in the data generating process. However, the querying rule in that work is based solely on information in the unlabeled data, to account for a shift in the distribution over the covariates. It does not imply any label savings in the standard learning setting, where training and test distribution are identical. In contrast, our current work demonstrates how an active learner can take label information into account, to reduce the label complexity of a general nearest neighbor method in the standard setting. 1.2 A Remark on Bayes-Consistency We remark on the Bayes-consistency of the margin-based passive 1-NN methods. In Gottlieb et al. (2014a), a PAC-style generalization bound was given. At a given scale t, the algorithm first ensured t-separation of the sample by solving solving a minimum vertex cover problem to eliminate the t-blocking pairs. Following that, the hypothesis was constructed as a Lipschitz extension from the remaining sample; the latter is computationally implemented as a nearest neighbor classifier. Structural Risk Minimization (SRM) was used to select the optimal scale t. A very close variant of this learner was shown to be Bayes-consistent by Kontorovich and Weiss (2015). The only difference between the two is that the former analyzed the hypothesis complexity in terms of fatshattering dimension while the latter via Rademacher averages. Thus, a margin-regularized 1-NN classifier was shown to be Bayes-consistent; however, no compression was involved. A compression-based alternative to Lipschitz extension was proposed in Gottlieb et al. (2014b). The idea is again to ensure t-separation via vertex cover and then compress the remaining sample down to a t-net. We conjecture that this latter algorithm is also Bayes-consistent, but currently have no proof. If instead one considers a compression-based passive learner implemented as in this paper (by taking majority vote in each Voronoi region rather than enforcing t-separation via vertex cover), the resulting classifier is indeed Bayes-consistent, as was recently shown by Kontorovich et al. (2017). 2. Preliminaries In this section we lay down the necessary preliminaries. We formally define the setting and necessary notation in Section 2.1. We discuss nets in metric spaces in Section 2.2, and present the guarantees of the compression-based passive learner of Gottlieb et al. (2017) in Section 2.3. 2.1 Setting and Notation For positive integers n, denote [n] := {1, . . . , n}. We consider learning in a general metric space (X, ρ), where X is a set and ρ is the metric on X. Our problem setting is that of classification of the instance space X into some finite label set Y. Assume that there is some distribution D over X Y, and let S Dm be a labeled sample of size m, where m is an integer. Denote the sequence of unlabeled points in S by U(S). We sometimes treat S and U(S) as multisets, since the order is ACTIVE NEAREST-NEIGHBOR LEARNING IN METRIC SPACES unimportant. For a labeled multiset S X Y and y Y, denote Sy := {x | (x, y) S}; in particular, U(S) = y YSy. The error of a classifier h : X Y on D, for any fixed h, is denoted err(h, D) := P[h(X) = Y ], where (X, Y ) D. The empirical error on a labeled sample S instantiates to err(h, S) = 1 X I[h(X) = Y ]. A passive learner receives a labeled sample Sin as input. An active learner receives the unlabeled part of the sample Uin := U(Sin) as input, and is allowed to interactively select examples from Uin and request their label from Sin. In other words, the active learner iteratively selects an example and requests its label, wherein all the labels requested so far can be used to make the next selection. When either learner terminates, it outputs a classifier ˆh : X Y, with the goal of achieving a low err(ˆh, D). An additional goal of the active learner is to achieve a performance competitive with that of the passive learner, while querying considerably fewer labels. The diameter of a set A X is defined by diam(A) := sup a,a A ρ(a, a ). For a finite set U = u1, . . . , u|U| X with some fixed numbering of its elements,1 denote the index of the closest point in U to x X by κ(x, U) := argmin i:xi U ρ(x, xi). We assume here and throughout this work that when there is more than one minimizer for ρ(x, xi), ties are broken arbitrarily (but in a consistent and deterministic fashion). Any labeled sample S = ((xi, yi))i [k] naturally induces the 1-nearest-neighbor classifier hnn S : X Y, via hnn S (x) := yκ(x,U(S)). For a set Z X, denote by κ(Z, U) := {κ(z, U) | z Z} the set of all the indices κ(z, U), as defined above. For x X, and t > 0, denote by ball(x, t) the (closed) ball of radius t around x: ball(x, t) := x X | ρ(x, x ) t . A set A X is t-separated if infa,a A:a =a ρ(a, a ) t. For A B X, the set A is a t-net of B if A is t-separated and B S a A ball(a, t). Thus, A is a t-net of B if it is both a t-covering and a t-packing. 1. Invoking the well-ordering principle, we may assume X to be well-ordered and then any U X inherits the ordering of X. KONTOROVICH, SABATO AND URNER The size of a t-net of a metric space is strongly related to its doubling dimension. The doubling dimension is the effective dimension of the metric space, which controls generalization and runtime performance of nearest-neighbors (Kpotufe, 2011; Gottlieb et al., 2014a). It is defined as follows. Let λ = λ(X) be the smallest number such that every ball in X can be covered by λ balls of half its radius, where all balls are centered at points of X. Formally, λ(X) := min{λ N : x X, r > 0, x1, . . . , xλ X : ball(x, r) λ i=1ball(xi, r/2)}. Then the doubling dimension of X is defined by ddim(X) := log2 λ. In line with modern literature, we work in the low-dimension, large-sample regime, where the doubling dimension is assumed to be constant, and hence sample complexity and algorithmic runtime may depend on it exponentially. This exponential dependence is unavoidable, even under margin assumptions, as previous analyses (Kpotufe, 2011; Gottlieb et al., 2014a) indicate. Generalization bounds in terms of the doubling dimension of the hypothesis space were established in Bshouty et al. (2009), while runtime and generalization errors in terms of ddim(X) were given in Gottlieb et al. (2014a). As shown in Gottlieb and Krauthgamer (2013), the doubling dimension is almost hereditary in the sense that for A X, we have ddim(A) cddim(X) for some universal constant c 2 (Feldmann et al., 2015, Lemma 6.6). In the works cited above, where generalization bounds are stated in terms of ddim(X), one can obtain tighter bounds in terms of ddim(U(S)) when the latter is substantially lower than that of the ambient space, and it is also possible to perform metric dimensionality reduction, as in Gottlieb et al. (2016). Constructing a minimum size t-net for a general set B is NP-hard (Gottlieb and Krauthgamer, 2013). However, a simple greedy algorithm constructs a (not necessarily minimal) t-net in time O(m2) (Gottlieb et al., 2014b, Algorithm 1). There is also an algorithm for constructing a t-net in time 2O(ddim(X))m log(1/t) (Krauthgamer and Lee, 2004; Gottlieb et al., 2014b). The size of any t-net of a metric space A X is at most diam(A)/t ddim(X)+1 (1) (Krauthgamer and Lee, 2004). In addition, the size of any t-net is at most 2ddim(A)+1 times the size of the minimal t-net, as the following easy lemma shows. Lemma 1 (comparison of two nets) Let t > 0 and suppose that M1, M2 are t-nets of A X. Then |M1| 2ddim(A)+1|M2|. Proof Suppose that |M1| k|M2| for some positive integer k. Since M1 S x M2 ball(x, t), it follows from the pigeonhole principle that at least one of the points in M2 must cover at least k points in M1. Thus, suppose that x M2 covers the set Z = {z1, . . . , zl} M1, meaning that Z ball(x, t), where l = |Z| k. By virtue of belonging to the t-net M1, the set Z is t-separated. Therefore Z is a t-net of Z. Since Z is contained in a t-ball, we have diam(Z) 2t. It follows from Eq. (1) that |Z| 2ddim(A)+1, whence the claim. Throughout the paper, we fix a deterministic procedure for constructing a t-net, and denote its output for a multiset U X by Net(U, t). Let Par(U, t) be a partition of X into regions induced by Net(U, t), that is: for Net(U, t) = {x1, . . . , x N}, define Par(U, t) := {P1, . . . , PN}, where Pi = {x X | κ(x, Net(U, t)) = i}. For t > 0, let N(t) := |Net(Uin, t)| be the size of the t-net for the input sample. ACTIVE NEAREST-NEIGHBOR LEARNING IN METRIC SPACES 2.3 Passive Compression-Based Nearest-Neighbors Non-parametric binary classification admits performance guarantees that scale with the sample s noisy-margin (von Luxburg and Bousquet, 2004; Gottlieb et al., 2014a, 2017). The original marginbased methods of von Luxburg and Bousquet (2004) and Gottlieb et al. (2014a) analyzed the generalization performance via the technique of Lipschitz extension. Later, it was noticed in Gottlieb et al. (2014b) that the presence of a margin allows for compression in fact, nearly optimally so. We say that a labeled multiset S is (ν, t)-separated, for ν [0, 1] and t > 0 (representing a margin t with noise ν), if one can remove a ν-fraction of the points in S, and in the resulting multiset, any pair of differently labeled points is separated by a distance of at least t. Formally, we have the following definition. Definition 2 S is (ν, t)-separated if there exists a subsample S S such that 1. |S \ S| ν|S| and 2. y1 = y2 Y, a Sy1, b Sy2, we have ρ(a, b) t. For a given labeled sample S, denote by ν(t) the smallest value ν such that S is (ν, t)-separated. Gottlieb et al. (2017) propose a passive learner with the following guarantees2 as a function of the separation of S. Setting α := m/(m N), define the following form of a generalization bound: GB(ϵ, N, δ, m, k) := αϵ + 2 3 (N + 1) log(mk) + log(1 αϵ((N + 1) log(mk) + log(1 Further, for an integer m and δ (0, 1), denote Gmin(m, δ) := min t>0 GB(ν(t), N(t), δ, m, 1). The quantity Gmin(m, δ) is small for datasets where we only need to remove few points to obtain a reasonably well separated subset. As an example, consider data generated by two well separated Gaussians (one generating the positively labeled points, and one generating the negatively labeled points). Then, most of the data points will be close to their respective means, but some will be farther, and may lie closer to the mean of the other Gaussian. Removing those few will result in a separated set. Theorem 3 (Gottlieb et al. 2017) Let m be an integer, δ (0, 1). There exists a passive learning algorithm that returns a nearest-neighbor classifier hnn Spas, where Spas Sin, such that, with probability 1 δ, err(hnn Spas, D) Gmin(m, δ). The bound above is data-dependent, meaning that the strength of the generalization guarantee depends on the quality of the random sample. Specifically, the passive algorithm of Gottlieb et al. (2017) generates Spas of size approximately N(t) for the optimal scale t > 0 (found by searching over all scales), by removing the |Sin|ν(t) points that obstruct the t-separation between different labels in Sin, and then selecting a subset of the remaining labeled examples to form Spas, so that the 2. The guarantees hold for the more general case of semimetrics. KONTOROVICH, SABATO AND URNER examples are a t-net for Sin (not including the obstructing points). For the binary classification case (|Y| = 2) an efficient algorithm is shown in Gottlieb et al. (2017). However, in the general multiclass case, it is not known how to find a minimal t-separation efficiently a naive approach requires solving the NP-hard problem of vertex cover. Our approach, which we detail below, circumvents this issue, and provides an efficient algorithm also for the multiclass case. 3. Main Results We propose a novel approach for generating a subset for a nearest-neighbor rule. This approach, detailed in the following sections, does not require finding and removing all the obstructing points in Sin, and can be implemented in an active setting using a small number of labels. The resulting active learning algorithm, MARMANN, has an error guarantee competitive with that of the passive learner, and a label complexity that can be significantly lower. We term the subset used by the nearest-neighbor rule a compression set. Algorithm 1 MARMANN: MArgin Regularized Metric Active Nearest Neighbor input Unlabeled sample Uin of size m, δ (0, 1). ˆt Select Scale(δ). # Select Scale is given in Section 5, Alg. 4. ˆS Generate NNSet(ˆt, [N(ˆt)], δ). # Generate NNSet is given in Section 4, Alg. 2. Output hnn ˆS . MARMANN, listed in Alg. 1, operates as follows. First, a scale ˆt > 0 is selected, by calling ˆt Select Scale(δ), where Select Scale is our model selection procedure. Select Scale has access to Uin, and queries labels from Sin as necessary. It estimates the generalization error bound GB for several different scales, and executes a procedure similar to binary search to identify a good scale. The binary search keeps the number of estimations (and thus requested labels) small. Crucially, our estimation procedure is designed to prevent the search from spending a number of labels that depends on the net size of the smallest possible scale t, so that the total label complexity of MARMANN depends only on the error of the selected ˆt. Second, the selected scale ˆt is used to generate the compression set by calling ˆS Generate NNSet(ˆt, [N(ˆt)], δ), where Generate NNSet is our procedure for generating the compression set, and [N(ˆt)] {1, . . . , N(ˆt)}. Our main result is the following guarantee for MARMANN. Theorem 4 (Main result; Guarantee for MARMANN) Let Sin Dm, where m max(6, |Y|), δ (0, 1 4). Let hnn ˆS be the output of MARMANN(Uin, δ), where ˆS X Y, and let ˆN := | ˆS|. Let ˆh := hnn ˆS and ˆϵ := err(ˆh, Sin), and denote ˆG := GB(ˆϵ, ˆN, δ, m, 1). With a probability of 1 δ over Sin and randomness of MARMANN, err(ˆh, D) 2 ˆG O (Gmin(m, δ)) , and the number of labels from Sin requested by MARMANN is at most ˆG ) + m ˆG . (2) Here the O( ) notation hides only universal multiplicative constants. ACTIVE NEAREST-NEIGHBOR LEARNING IN METRIC SPACES Our error guarantee is thus a constant factor over the error guarantee of the passive learner of (Gottlieb et al., 2017), given in Theorem 3. The constant factor that we derive in our analysis is in the order of 2000 (this can be seen in the proof of Theorem 15). Note that we did not focus on optimizing it, opting instead for a more streamlined analysis. As the lower bound in Theorem 6 shows, the additive term m ˆG in Eq. (2) is essentially unavoidable. Whether the dependence on 1/ ˆG is indeed necessary is currently an open problem. To observe the advantages of MARMANN over a passive learner, consider a scenario in which the upper bound Gmin of Theorem 3, as well as the Bayes error of D, are of order Θ(1/ m). Then ˆG = Θ(1/ m) as well. Therefore, MARMANN obtains a prediction error guarantee of Θ(1/ m), similarly to the passive learner, but it uses only Θ( m) labels instead of m. In contrast, the following result shows that no learner that selects labels uniformly at random from Sin can compete with MARMANN: Theorem 5 below shows that for any passive learner that uses Θ( m) random labels from Sin, there exists a distribution D with the above properties, for which the prediction error of the passive learner in this case is Ω(m 1/4), a decay rate which is almost quadratically slower than the O(1/ m) rate achieved by MARMANN. Thus, the guarantees of MARMANN cannot be matched by any passive learner. Theorem 5 (Passive lower bound) Let m > 0 be an integer, and suppose that (X, ρ) is a metric space such that for some t > 0, there is a t-net T of X of size Θ( m). Consider any passive learning algorithm that maps i.i.d. samples Sℓ Dℓfrom some distribution D over X { 1, 1}, to functions ˆhℓ: X { 1, 1}. For any such algorithm and any ℓ= Θ( m), there exists a distribution D such that: i. The Bayes error of D is Θ(1/ m); ii. With at least a constant probability, both of the following events occur: (a) The passive learner achieves error err(ˆhℓ, D) = Ω(m 1/4), (b) Gmin(m, δ) = Θ(1/ m). Furthermore, i. and ii. continue to hold when the learning algorithm has access to the full marginal distribution over X. Thus, MARMANN even improves over a semi-supervised learner: its label savings stem from actively selecting labels, and are not achievable by merely exploiting information from unlabeled data or by randomly selecting examples to label. We deduce Theorem 5 from a more general result, which might be of independent interest. Theorem 18, given in Section 7, improves existing passive learning sample complexity lower bounds. In particular, our result removes the restrictions of previous lower bounds on the relationship between the sample size, the VC-dimension, and the noise level, which render existing bounds inapplicable to our parameter regime. The proof of Theorem 5 is given thereafter in Section 7, as a consequence of Theorem 18. We further provide a label complexity lower bound, in Theorem 6 below, which holds for any active learner that obtains similar guarantees to those of MARMANN. The lower bound shows that any active learning algorithm which guarantees a multiplicative accuracy over Gmin(m, δ) has a label complexity which is Ω(m Gmin(m, δ)), for a wide range of values of Gmin(m, δ) essentially, as long as Gmin(m, δ) is not trivially large or trivially small. This implies that the term m ˆG KONTOROVICH, SABATO AND URNER in the upper bound of the label complexity of MARMANN in Theorem 4 cannot be significantly improved. Theorem 6 (Active lower bound) Let X = R, δ (0, 1/14). Let C 1, and let A be an active learning algorithm that outputs ˆh. Suppose that for any distribution D over X Y, if the input unlabeled sample is of size m, then err(ˆh, D) CGmin(m, δ) with probability at least 1 δ. Then for any α (log(m)+log(28) 2m , 1 240C ) there exists a distribution D such with probability at least 1 28 over S Dm and the randomness of A, both of the following events hold: 1. α Gmin(m, δ) 30α 2. A queries at least 1 2 jm Gmin(m,δ) log( m δ ) 30 log(m) k Ω(m Gmin(m, δ)) labels. The proof of this lower bound is provided in Section 8. In the rest of the paper, the components of MARMANN are described in detail, and the main results are proved. 4. Active Nearest-Neighbor at a Given Scale A main challenge for active learning in our non-parametric setting is performing model selection, that is, selecting a good scale t similarly to the passive learner of Gottlieb et al. (2017). In the passive supervised setting, the approach developed in several previous works (Gottlieb et al., 2014b; Kontorovich and Weiss, 2014; Gottlieb et al., 2014a; Kontorovich and Weiss, 2015) performs model selection by solving a minimum vertex cover problem for each considered scale t, so as to eliminate all of the t-blocking pairs i.e., pairs of differently labeled points within a distance t. The passive algorithm generates a compression set by first finding and removing from Sin all points that obstruct (ν, t)-separation at a given scale t > 0. This incurs a computational cost but no significant sample complexity increase, aside from the standard logarithmic factor that comes from stratifying over data-dependent hierarchies (Shawe-Taylor et al., 1998). While this approach works for passive learning, in the active setting we face a crucial challenge: estimating the error of a nearest-neighbor rule at scale t using a small number of samples. A key insight that we exploit in this work is that instead of eliminating the blocking pairs, one may simply relabel some of the points in the compression set, and this would also generate a low-error nearest neighbor rule. This new approach enables estimation of the sample accuracy of a (possibly relabeled) t-net by label-efficient active sampling. In addition, this approach is significantly simpler than estimating the size of the minimum vertex cover of the t-blocking graph. Moreover, we gain improved algorithmic efficiency, by avoiding the relatively expensive vertex cover procedure. A small technical difference, which will be evident below, is that in this new approach, examples in the compression set might have a different label than their original label in Sin. Standard sample compression analysis (e.g. Graepel et al., 2005) assumes that the classifier is determined by a small number of labeled examples from Sin. This does not allow the examples in the compression set to have a different label than their original label in Sin. Therefore, we require a slight generalization of previous compression analysis (following previous works on compression, see details in Section 6.1), which allows adding side information to the compression set. This side information will be used to set the label of each of the examples in the compression set. The generalization incurs a small statistical penalty, which we quantify in Section 6, as a preliminary to proving Theorem 4. ACTIVE NEAREST-NEIGHBOR LEARNING IN METRIC SPACES We now describe our approach to generating a compression set for a given scale t > 0. Recall that ν(t) is the smallest value for which Sin is (ν, t)-separated. We define two compression sets. The first one, denoted Sa(t), represents an ideal compression set, constructed (solely for the sake of analysis) so that it induces an empirical error of at most ν(t). Calculating Sa(t) might require many labels, thus it is only used for analysis purposes; the algorithm never constructs it. The second compression set, denoted ˆSa(t), represents an approximation to Sa(t), which can be constructed using a small number of labels, and induces a sample error of at most 4ν(t) with high probability. We first define the ideal set Sa(t) := {(x1, y1), . . . , (x N, y N)}. The examples in Sa(t) are the points in Net(Uin, t/2), and the label of each example is the majority label, out of the labels of the examples in Sin to which xi is closest. Formally, {x1, . . . , x N} := Net(Uin, t/2), and for i [N], yi := argmaxy Y |Sy Pi|, where Pi = {x X | κ(x, Net(U, t/2)) = i} Par(Uin, t/2). Lemma 7 Let S be a labeled sample of size m, and let {P1, . . . , PN} be a partition of U(S), with maxi diam(Pi) t for some t 0. For i [N], let Λi := Syi Pi. Then i [N] |Λi|. Proof Let S S be a subsample that witnesses the (ν(t), t)-separation of S, so that | S| m(1 ν(t)), and for any two points (x, y), (x , y ) S, if ρ(x, x ) t then y = y . Denote U := U( S). Since maxi diam(Pi) t, for any i [N] all the points in U Pi must have the same label in S. Therefore, y Y s.t. U Pi Sy Pi. Hence | U Pi| |Λi|. It follows i [N] |Λi| |S| X i [N] | U Pi| = |S| | S| = m ν(t). Dividing by m we get the statement of the lemma. From Lemma 7, we get Cor. 8, which upper bounds the empirical error of hnn Sa(t) by ν(t). Corollary 8 For every t > 0, err(hnn Sa(t), Sin) ν(t). This corollary is immediate from Lemma 7, since for any Pi Par(Uin, t/2), diam(Pi) t, and err(hnn Sa(t), Sin) = 1 1 i [N] |Λi|. Now, calculating Sa(t) requires knowing most of the labels in Sin. MARMANN constructs instead an approximation ˆSa(t), in which the examples are the points in Net(Uin, t/2) (so that U( ˆSa(t)) = U(Sa(t)) ), but the labels are determined using a bounded number of labels requested from Sin. The labels in ˆSa(t) are calculated by the simple procedure Generate NNSet given in Alg. 2. The empirical error of the output of Generate NNSet is bounded in Theorem 9 below.3 3. In the case of binary labels (|Y| = 2), the problem of estimating Sa(t) can be formulated as a special case of the benign noise setting for parametric active learning, for which tight lower and upper bounds are provided in Hanneke and Yang (2015). However, our case is both more general (as we allow multiclass labels) and more specific (as we are dealing with a specific hypothesis class ). Thus we provide our own procedure and analysis. KONTOROVICH, SABATO AND URNER A technicality in Alg. 2 requires explanation: In MARMANN, the generation of ˆSa(t) will be split into several calls to Generate NNSet, so that different calls determine the labels of different points in ˆSa(t). Therefore Generate NNSet has an additional argument I, which specifies the indices of the points in Net(Uin, t/2) for which the labels should be returned this time. Crucially, if during the run of MARMANN, Generate NNSet is called again for the same scale t and the same point in Net(Uin, t/2), then Generate NNSet returns the same label that it returned before, rather than recalculating it using fresh labels from Sin. This guarantees that despite the randomness in Generate NNSet, the full ˆSa(t) is well-defined within any single run of MARMANN, and is distributed like the output of Generate NNSet(t, [N(t/2)], δ), which is convenient for the analysis. Define Q(m) := 18 log(4m3/δ) . (3) Algorithm 2 Generate NNSet(t, I, δ) input Scale t > 0, a target set I [N(t/2)], confidence δ (0, 1). output A labeled set S X Y of size |I| {x1, . . . , x N} Net(Uin, t/2), {P1, . . . , PN} Par(Uin, t/2), S () for i I do if ˆyi has not already been calculated for Uin with this value of t then Draw Q(m) points uniformly at random from Pi and query their labels. Let ˆyi be the majority label observed in these Q(m) queries. end if S S {(xi, ˆyi)}. end for Output S Theorem 9 Let ˆSa(t) be the output of Generate NNSet(t, [N(t/2)], δ). With a probability at least 1 δ 2m2 , the following event, which we denote by E(t), holds: err(hnn ˆSa(t), Sin) 4ν(t). Proof By Cor. 8, err(hnn Sa(t), Sin) ν(t). In Sa(t), the labels assigned to each point in Net(Uin, t/2) are the majority labels (based on Sin) of the points in the regions in Par(Uin, t/2). As above, we denote the majority label for region Pi by yi := argmaxy Y |Sy Pi|. We now compare these labels to the labels ˆyi assigned by Alg. 2. Let p(i) = |Λi|/|Pi| be the fraction of points in Pi which are labeled by the majority label yi, where Λi is as defined in Lemma 7. Let ˆp(i) be the fraction of labels equal to yi out of those queried by Alg. 2 in round i. Let β := 1/6. By Hoeffding s inequality and union bounds, we have that with a probability of at least 1 2N(t/2) exp( Q(m) 18 ) 1 δ 2m2 , we have maxi [N(t/2)] |ˆp(i) p(i)| β. Denote this good event by E . We now prove that E E(t). Let J = {i [N(t/2)] | ˆp(i) > 1 2}. It can be easily seen that ˆyi = yi for all i J. Therefore, for all x such that κ(x, U(Sa(t))) J, hnn ˆSa(t)(x) = hnn Sa(t)(x), and hence err(hnn S , Sin) PX Sin[κ(X, U(Sa(t))) / J] + err(hnn Sa(t), Uin). ACTIVE NEAREST-NEIGHBOR LEARNING IN METRIC SPACES The second term is at most ν(t) by Cor. 8, and it remains to bound the first term, on the condition that E holds. We have PX U[κ(X, U(Sa(t))) / J] = 1 m P i/ J |Pi|. If E holds, then for any i / J, p(i) 1 2 + β, therefore |Pi| |Λi| = (1 p(i))|Pi| (1 Recall that, by Lemma 7, ν(t) 1 1 m P i [N(t/2)] |Λi|. Therefore, i [N(t/2)] |Λi| i [N(t/2)] (|Pi| |Λi|) i/ J (|Pi| |Λi|) Thus, under E , PX U[κ(X, U(Sa(t))) / J] ν(t) 1 2 β = 3ν(t). It follows that under E , err(hnn S , Uin) 4ν(t). 5. Model Selection We now show how to select the scale ˆt that will be used to generate the output nearest-neighbor rule. The main challenge is to do this with a low label complexity: Generating the full classification rule for scale t requires a number of labels that depends on N(t), which might be very large. We would like the label complexity of MARMANN to depend only on N(ˆt) (where ˆt is the selected scale), which is of the order m ˆG. Therefore, during model selection we can only invest a bounded number of labels in each tested scale. In addition, to keep the label complexity low, we would like to avoid testing all scales. In Section 5.1 we describe how we estimate the error on a given scale. In Section 5.2 we provide a search procedure, resembling binary search, which uses the estimation procedure to select a single scale ˆt. 5.1 Estimating the Error at a Given Scale For t > 0, let ˆSa(t) be the compressed sample that MARMANN would generate if the selected scale were set to t. Our model selection procedure performs a search, similar to binary search, over the possible scales. For each tested scale t, the procedure estimates the empirical error ϵ(t) := err(hnn ˆSa(t), S) within a certain accuracy, using an estimation procedure given below, called Estimate Err. Estimate Err outputs an estimate ˆϵ(t) of ϵ(t), up to a given threshold θ > 0, using labels requested from Sin. KONTOROVICH, SABATO AND URNER To estimate the error, we sample random labeled examples from Sin, and check the prediction error of hnn ˆSa(t) on these examples. The prediction error of any fixed hypothesis h on a random labeled example from Sin is an independent Bernoulli variable with expectation err(h, Sin). Estimate Err is implemented using the following procedure, Est Ber, which adaptively estimates the expectation of a Bernoulli random variable to an accuracy specified by the parameter θ, using a small number of random independent Bernoulli experiments. Let B1, B2, . . . {0, 1} be i.i.d. Bernoulli random variables. For an integer n, denote ˆpn = 1 n Pn i=1 Bi. The estimation procedure Est Ber is given in Alg. 3. We prove a guarantee for this procedure in Theorem 10. Note that we assume that the threshold parameter is in (0, 1], since for θ 1 one can simply output 1 using zero random draws to satisfy Theorem 10. Algorithm 3 Est Ber(θ, β, δ) input A threshold parameter θ (0, 1], a budget parameter β 7, confidence δ (0, 1) S {B1, . . . , B4} K 4β δθ ) for i = 3 : log2(β log(2K/δ)/θ) do S S {Bn/2+1, . . . , Bn}. if ˆpn > β log(2n/δ)/n then break end if end for Output ˆpn. The following theorem states that Alg. 3 essentially estimates p, the expectation of the i.i.d. Bernoulli variables B1, B2, . . ., up to a multiplicative constant, except if p is smaller than a value proportional to the threshold θ, in which case the algorithm simply returns a value at most θ. Moreover, the theorem shows that the number of random draws required by the algorithm is inversely proportional to the maximum of the threshold θ and the expectation p. Thus, if p is very small, the number of random draws does not increase without bound. The parameter β controls the trade-off between the accuracy of estimation and the number of random draws. Theorem 10 Let δ (0, 1), θ (0, 1], β 7. Let B1, B2, . . . {0, 1} be i.i.d Bernoulli random variables with expectation p. Let po be the output of Est Ber(θ, β, δ). The following holds with a probability of 1 δ, where f(β) := 1 + 8 1. If po θ, then p f(β)θ. Otherwise, p f(β) po p 2 f(β). 2. Let ψ := max(θ, p/f(β)). The number of random draws in Est Ber is at most 4β log( 8β Proof First, consider any single round i with n = 2i. By the empirical Bernstein bound (Maurer and Pontil, 2009, Theorem 4), with a probability of 1 δ/n, for n 8, we have4 |ˆpn p| 8 log(2n/δ) 2ˆpn log(2n/δ) 4. This follows from Theorem 4 of Maurer and Pontil (2009) since 7 3(n 1) 8 3n for n 8. ACTIVE NEAREST-NEIGHBOR LEARNING IN METRIC SPACES Define g := (β + 8/3 + 2β), so that f(β) = g/β. Conditioned on Eq. (4), there are two cases: (a) ˆpn β log(2n/δ)/n. In this case, p g log(2n/δ)/n. (b) ˆpn > β log(2n/δ)/n. In this case, n β log(2n/δ)/ˆpn. Thus, by Eq. (4), |ˆpn p| ˆpn( 8 2/β) = ˆpn(g/β 1). Therefore βp g ˆpn p 2 g/β . Taking a union bound on all the rounds, we have that the guarantee holds for all rounds with a probability of at least 1 δ. Condition now on the event that these guarantees all hold. First, we prove the label complexity bound. Note that since β 7, K 28, thus we have 2 log(2K) > 8, therefore there is always at least one round. Let no be the value of n in the last round that the algorithm runs, and let po = ˆpno. Let i such that no = 2i+1, thus the algorithm stops during round i + 1. This implies ˆpn β log(2n/δ)/n for n = 2i, therefore case (a) holds for n, which means n g log(2n/δ)/p. It follows that n 2g log(4g/δ)/p, therefore no 4g log(4g/(δp))/p. In addition, the number of random draws in the algorithm is n0, which is bounded by n0 2 log2(β log(2K/δ)/θ) 2 2log2(β log(2K/δ)/θ) 2β log(2K/δ)/θ. Therefore we have the following bound on the number of random draws: no min 2β log(2K/δ) θ , 4g log(4g/(δp)) Plugging in the definition of K and substituting β f(β) for g yields δθ )) θ , 4f(β) log( 4βf(β) δθ ) θ , 4f(β) log(4βf(β) δ ) + log(1 δ ) + log(f(β) Using the definition of ψ, we get that the number of draws is at most 4β log( 8β δψ ) ψ . Next, we prove the accuracy of po (item 1 in the theorem statement) by considering two cases. (I) If po > β log(2no/δ)/no, then case (b) above holds for no, thus g po p 2 g/β . In addition, if po θ, the LHS implies p f(β)θ. Thus item 1 in the theorem statement holds in this case. KONTOROVICH, SABATO AND URNER (II) If po β log(2no/δ)/no, then Est Ber could not have ended by breaking out of the loop, thus it ran until the last round. Therefore no β log(2K/δ)/θ. In addition, case (a) holds for n0, therefore p g log(2no/δ) no gθ log(2n0/δ) β log(2K/δ) . (5) Now, for any possible value of no, no 2β log(2K/δ)/θ K. The first inequality follows from the bound on i in Est Ber, and the second inequality holds since, as defined in Est Ber, K 4β θδ ). Since n0 K, Eq. (5) implies that In addition, we have po β log(2no/δ)/no θ log(2n0/δ) log(2K/δ) θ. Therefore in this case, necessarily po θ and p f(β)θ, which satisfies item 1 in the theorem statement. In both cases item 1 holds, thus the theorem is proved. The procedure Estimate Err(t, θ, δ) is then implemented as follows: Call Est Ber(θ, 52, δ/(2m2)), where the random variables Bi are independent copies of the Bernoulli variable B := I[hnn ˆSa(t)(X) = Y ] and (X, Y ) Sin. To draw a single Bi, sample a random pair (x , y ) from Sin, set i := κ(x , Net(Uin, t/2)), and get S Generate NNSet(t, {i}, δ). This returns S = ((xi, ˆyi)) where ˆyi is the label of xi in ˆSa(t). Then Bi := I[ˆyi = y ]. Note that Bi is indeed distributed like B, and E[B] = ϵ(t). Note further that this call to Generate NNSet(t, {i}, δ) uses Q(m) label queries. Therefore the overall label complexity of a single draw of a Bi is Q(m) + 1. Cor. 11 gives a guarantee for the accuracy and label complexity of Estimate Err. The proof is immediate from Theorem 10, by setting β = 52, which implies f(β) 5/4. Corollary 11 Let t, θ > 0 and δ (0, 1), and let ˆϵ(t) Estimate Err(t, θ, δ). Let Q(m) as defined in Eq. (3) The following properties hold with a probability of 1 δ 2m2 over the randomness of Estimate Err (and conditioned on ˆSa(t)). ACTIVE NEAREST-NEIGHBOR LEARNING IN METRIC SPACES 1. If ˆϵ(t) θ, then ϵ(t) 5θ/4. Otherwise, 5 ˆϵ(t) 4ϵ(t) 2. Let ψ := max(θ, ϵ(t)). The number of labels that Estimate Err requests is at most 260(Q(m) + 1) log(1040m2 To derive item 2. above from Theorem 10, note that for β = 52, ψ = max(θ, ϵ(t)) f(β) max(θ, ϵ(t)/f(β)) = f(β)ψ 5 where ψ is as defined in Theorem 10. Below we denote the event that the two properties in Cor. 11 hold for t by V (t). 5.2 Selecting a Scale The model selection procedure Select Scale, given in Alg. 4, implements its search based on the guarantees in Cor. 11. First, we introduce some notation. We would like MARMANN to obtain a generalization guarantee that is competitive with Gmin(m, δ). Denote φ(t) := (N(t) + 1) log(m) + log(1 G(ϵ, t) := ϵ + 2 Note that for all ϵ, t, GB(ϵ, N(t), δ, m, 1) = m m N(t)G(ϵ, t). When referring to G(ν(t), t), G(ϵ(t), t), or G(ˆϵ(t), t) we omit the second t for brevity. Instead of directly optimizing G(ν(t)), we will select a scale based on our estimate G(ˆϵ(t)) of G(ϵ(t)). Let Dist denote the set of pairwise distances in the unlabeled dataset Uin (note that |Dist| < m 2 ). We remove from Dist some distances, so that the remaining distances have a net size N(t) that is monotone non-increasing in t. We also remove values with a very large net size. Concretely, define Distmon := Dist \ {t | N(t) + 1 > m/2} \ {t | t Dist, t < t and N(t ) < N(t)}. Then for all t, t Distmon such that t < t, we have N(t ) N(t). The output of Select Scale is always a value in Distmon. The following lemma shows that it suffices to consider these scales. Lemma 12 Assume m 6 and let t m argmint Dist G(ν(t)). If Gmin(m, δ) 1/3 then t m Distmon. KONTOROVICH, SABATO AND URNER Algorithm 4 Select Scale(δ) input δ (0, 1) output Scale ˆt 1: T Distmon, # T maintains the current set of possible scales 2: while T = do 3: t the median value in T # break ties arbitrarily 4: ˆϵ(t) Estimate Err(t, φ(t), δ). 5: if ˆϵ(t) < φ(t) then 6: T T \ [0, t] # go right in the binary search 7: else if ˆϵ(t) > 11 10φ(t) then 8: T T \ [t, ) # go left in the binary search 10: t0 t, T0 {t0}. 11: break from loop 13: end while 14: if T0 was not set yet then 15: If the algorithm ever went to the right, let t0 be the last value for which this happened, and let T0 := {t0}. Otherwise, T0 := . 17: Let TL be the set of all t that were tested and made the search go left 18: Output ˆt := argmint TL T0 G(ˆϵ(t)) Proof Assume by way of contradiction that t m Dist \ Distmon. First, since G(ν(t m)) Gmin(m, δ) 1/3 we have N(t m) + 1 m N(t m) log(m) 1 Therefore, since m 6, it is easy to verify N(t m) + 1 m/2. Therefore, by definition of Distmon there exists a t t m with φ(t) < φ(t m). Since ν(t) is monotone over all of t Dist, we also have ν(t) ν(t m). Now, φ(t) < φ(t m) and ν(t) ν(t m) together imply that G(ν(t)) < G(ν(t m)), a contradiction. Hence, t m Distmon. Select Scale follows a search procedure similar to binary search, however the conditions for going right and for going left are not exhaustive, thus it is possible that neither condition holds. The search ends either when neither conditions hold, or when no additional scale should be tested. The final output of the algorithm is based on minimizing G(ˆϵ(t)) over some of the values tested during search. For c > 0, define γ(c) := 1 + 2 2c and γ(c) := 1 For all t, ϵ > 0 we have the implications ϵ cφ(t) γ(c)ϵ G(ϵ, t) and φ(t) cϵ γ(c)φ(t) G(ϵ, t). (7) ACTIVE NEAREST-NEIGHBOR LEARNING IN METRIC SPACES The following lemma uses Eq. (7) to show that the estimate G(ˆϵ(t)) is close to the true G(ϵ(t)). Lemma 13 Let t > 0, δ (0, 1), and suppose that Select Scale calls ˆϵ(t) Estimate Err(t, φ(t), δ). Suppose that V (t) as defined in Cor. 11 holds. Then 1 6G(ˆϵ(t)) G(ϵ(t)) 6.5G(ˆϵ(t)). Proof Under V (t), we have that if ˆϵ(t) < φ(t) then ϵ(t) 5 4φ(t). In this case, G(ϵ(t)) γ(4/5)φ(t) 4.3φ(t), by Eq. (7). Therefore G(ϵ(t)) 3 4.3 2 G(ˆϵ(t)). In addition, G(ϵ(t)) 2 3φ(t) (from the definition of G), and by Eq. (7) and γ(1) 4, Therefore G(ϵ(t)) 1 6G(ˆϵ(t)). On the other hand, if ˆϵ(t) φ(t), then by Cor. 11 4 5ϵ(t) ˆϵ(t) 4 Therefore G(ˆϵ(t)) 4 3G(ϵ(t)) and G(ϵ(t)) 5 4G(ˆϵ(t)). Taking the worst-case of both possibilities, we get the bounds in the lemma. The next theorem bounds the label complexity of Select Scale. Let Ttest Distmon be the set of scales that are tested during Select Scale (that is, their ˆϵ(t) was estimated). Theorem 14 Suppose that the event V (t) defined in Cor. 11 holds for all t Ttest for the calls ˆϵ(t) Estimate Err(t, φ(t), δ). If the output of Select Scale is ˆt, then the number of labels requested by Select Scale is at most 9620|Ttest|(Q(m) + 1) 1 G(ϵ(ˆt)) log(38480m2 δG(ϵ(ˆt)) ). Proof The only labels used by the procedure are those used by calls to Estimate Err. Let ψt := max(φ(t), ϵ(t)), and ψmin := mint Ttest ψt. Denote also ˆψt := max(φ(t), ˆϵ(t)). From Cor. 11 we have that the total number of labels in all the calls to Estimate Err in Select Scale is at most 260(Q(m) + 1) log(1040m2 ψt |Ttest| 260(Q(m) + 1) log(1040m2 We now lower bound ψmin using G(ϵ(ˆt)). By Lemma 13 and the choice of ˆt, G(ϵ(ˆt)) 6.5G(ˆϵ(ˆt)) = 6.5 min t TL T0 G(ˆϵ(t)). KONTOROVICH, SABATO AND URNER From the definition of G, for any t > 0, G(ˆϵ(t)) γ(1) max(φ(t), ˆϵ(t)) 25 ˆψt. Therefore G(ϵ(ˆt)) 25 min t TL T0 ˆψt. (9) We will show a similar upper bound when minimizing over all of Ttest, not just over TL T0. This is trivial if Ttest = TL T0. Consider the case TL T0 Ttest. For any t Ttest, we have one of: The search went left on t (step 8), hence t TL. The search went nowhere on t and the loop broke (step 11), hence t = t0 T0. The search went right on t (step 6) and this was the last value for which this happened, hence t = t0 T0. The search went right on t (step 6) and this was not the last value for which this happened. Hence t Ttest \ (TL T0). Set some t1 Ttest \ (TL T0). Since the search went right on t1, then t0 also exists, since the algorithm did go to the right for some t (see step 15). Since the binary search went right on t1, we have ˆϵ(t1) φ(t1). Since the binary search did not go left on t0 (it either broke from the loop or went right), ˆϵ(t0) 11 10φ(t0). In addition, t0 t1 (since the search went right at t1, and t0 was tested later than t1), thus φ(t0) φ(t1) (since t0, t1 Distmon). Therefore, ˆψt0 = max(φ(t0), ˆϵ(t0)) 11 10φ(t1) = 11 10 max(φ(t1), ˆϵ(t1)) = ˆψt1. It follows that for any such t1, min t TL T0 ˆψt 11 Therefore min t TL T0 ˆψt 11 10 min t Ttest ˆψt. Therefore, by Eq. (9) G(ϵ(ˆt)) 27.5 min t Ttest ˆψt. By Cor. 11, ˆϵ(t) max(φ(t), 4ϵ(t)/3), therefore ˆψt 4 3ψt. Therefore G(ϵ(ˆt)) 37ψmin. Therefore, from Eq. (8), the total number of labels is at most 9620|Ttest|(Q(m) + 1) 1 G(ϵ(ˆt)) log(38480m2 δG(ϵ(ˆt)) ). The following theorem provides a competitive error guarantee for the selected scale ˆt. ACTIVE NEAREST-NEIGHBOR LEARNING IN METRIC SPACES Theorem 15 Suppose that V (t) and E(t), defined in Cor. 11 and Theorem 9, hold for all values t Ttest, and that Gmin(m, δ) 1/3. Then Select Scale outputs ˆt Distmon such that GB(ϵ(ˆt), N(ˆt), δ, m, 1) O(Gmin(m, δ)), Where the O( ) notation hides only universal multiplicative constants. The full proof of this theorem is given below. The idea of the proof is as follows: First, we show (using Lemma 13) that it suffices to prove that G(ν(t m)) O(G(ˆϵ(ˆt))) to derive the bound in the theorem. Now, Select Scale ends in one of two cases: either T0 is set within the loop, or T = and T0 is set outside the loop. In the first case, neither of the conditions for turning left and turning right holds for t0, so we have ˆϵ(t0) = Θ(φ(t0)) (where Θ hides numerical constants). We show that in this case, whether t m t0 or t m t0, G(ν(t m)) O(G(ˆϵ(t0))). In the second case, there exist (except for edge cases, which are also handled) two values t0 T0 and t1 TL such that t0 caused the binary search to go right, and t1 caused it to go left, and also t0 t1, and (t0, t1) Distmon = . We use these facts to show that for t m t1, G(ν(t m)) O(G(ˆϵ(t1))), and for t m t0, G(ν(t m)) O(G(ˆϵ(t0))). Since ˆt minimizes over a set that includes t0 and t1, this gives G(ν(t m)) O(G(ˆϵ(ˆt))) in all cases. Proof First, note that it suffices to show that there is a constant C, such that for the output ˆt of Select Scale, we have G(ϵ(ˆt)) CG(ν(t m)). This is because of the following argument: From Lemma 12 we have that if Gmin(m, δ) 1/3, then t m Distmon. Now Gmin(m, δ) = m m N(t m)G(ν(t m)) G(ν(t m)). And, if we have the guarantee on G(ϵ(ˆt)) and Gmin(m, δ) 1/3 we will have GB(ϵ(ˆt), N(ˆt), δ, m, 1) = m m N(ˆt)G(ϵ(ˆt)) 2G(ϵ(ˆt)) 2CG(ν(t m)) 2CGmin(m, δ). (10) We now prove the existence of such a guarantee and set C. Denote the two conditions checked in Select Scale during the binary search by Condition 1: ˆϵ(t) < φ(t) and Condition 2: ˆϵ(t) > 11 10φ(t). The procedure ends in one of two ways: either T0 is set within the loop (Case 1), or T = and T0 is set outside the loop (Case 2). We analyze each case separately. In Case 1, none of the conditions 1 and 2 hold for t0. Therefore φ(t0) ˆϵ(t0) 11 Therefore, by Eq. (7), φ(t0) G(ˆϵ(t0))/ γ(10 By Cor. 11, since ˆϵ(t0) > φ(t0), 4ˆϵ(t0) ϵ(t0) 5 Suppose t m t0, then G(ν(t m)) ν(t m) ν(t0) 1 KONTOROVICH, SABATO AND URNER here we used ϵ(t0) 4ν(t0) by Theorem 9. Therefore, from Eq. (7) and Lemma 13, G(ν(t m)) 3 16φ(t0) 3 16 γ 40 55 G(ϵ(t0)) 1 2 16 γ(40 55)G(ˆϵ(t0)). Now, suppose t m < t0, then G(ν(t m)) 2 3φ(t0) 2 3 γ(10 11)G(ˆϵ(t0)). In this inequality we used the fact that t m, t0 Distmon, hence φ(t m) φ(t0). Combining the two possibilities for t m, we have in Case 1, G(ˆϵ(t0)) max(32 γ(40 55), 3 γ(10 11) 2 )G(ν(t m)). Since ˆt minimizes G(ˆϵ(t)) on a set that includes t0, we have, using Lemma 13 G(ϵ(ˆt)) 6.5G(ˆϵ(ˆt)) 6.5G(ˆϵ(t0)). Therefore, in Case 1, G(ϵ(ˆt)) 6.5 max(32 γ(40 55), 3 γ(10 11) 2 )G(ν(t m)). (11) In Case 2, the binary search halted without satisfying Condition 1 nor Condition 2 and with T = . Let t0 be as defined in this case in Select Scale (if it exists), and let t1 be the smallest value in TL (if it exists). At least one of these values must exist. If both values exist, we have t0 t1 and (t0, t1) Distmon = . If t0 exists, it is the last value for which the search went right. We thus have ˆϵ(t0) < φ(t0). If t m t0, from condition 1 on t0 and Eq. (7) with γ(1) 4, G(ν(t m)) 2 6G(ˆϵ(t0)). Here we used the monotonicity of φ on t m, t0 Distmon, and Eq. (7) applied to condition 1 for t0. If t1 exists, the search went left on t1, thus ˆϵ(t1) > 11 10φ(t1). By Cor. 11, it follows that ˆϵ(t1) 4 3ϵ(t1). Therefore, if t m t1, G(ν(t m)) ν(t m) ν(t1) 1 16ˆϵ(t1) 3 16γ(11/10)G(ˆϵ(t1)). Here we used ϵ(t1) 4ν(t1) by Theorem 9 and Eq. (7). Combining the two cases for t m, we get that if t0 exists and t m t0, or t1 exists and t m t1, G(ν(t m)) min(1 6, 3 16γ(11/10)) min t TE G(ˆϵ(t)). where we define TE = {t {t0, t1} | t exists}. We now show that this covers all possible values for t m: If both t0, t1 exist, then since (t0, t1) Distmon = , it is impossible to have t m (t0, t1). ACTIVE NEAREST-NEIGHBOR LEARNING IN METRIC SPACES If only t0 exists, then the search never went left, which means t0 = max(Distmon), thus t m t0. If only t1 exists, then the search never went right, which means t1 = min(Distmon), thus t m t1. Since ˆt minimizes G(ˆϵ(t)) on a set that has TE as a subset, we have, using Lemma 13 G(ϵ(ˆt)) 6.5G(ˆϵ(ˆt)) 6.5 mint TE G(ˆϵ(t)). Therefore in Case 2, G(ν(t m)) 1 6, 3 16γ(11/10) G(ϵ(ˆt). (12) From Eq. (11) and Eq. (12) we get that in both cases G(ν(t m)) 1 1 6, 3 16γ(11/10), 2 3 γ(10/11), 1 32 γ(40 G(ϵ(ˆt)) G(ϵ(ˆt))/865. Combining this with Eq. (10) we get the statement of the theorem. 6. Bounding the Label Complexity of MARMANN We are now almost ready to prove Theorem 4. Our last missing piece is quantifying the effect of side information on the generalization of sample compression schemes in Section 6.1. We then prove Theorem 4 in Section 6.2. 6.1 Sample Compression with Side Information It appears that compression-based generalization bounds were independently discovered by Littlestone and Warmuth (1986) and Devroye et al. (1996); some background is given in Floyd and Warmuth (1995). As noted in Section 4, our algorithm relies on a generalized sample compression scheme, which requires side information. This side information is used to represent the labels of the sample points in the compression set. A similar idea appears in Floyd and Warmuth (1995) for hypotheses with short description length. Here we provide a generalization that is useful for the analysis of MARMANN. Let Σ be a finite alphabet, and define a mapping Rec N : (X Y)N ΣN YX .5 This is a reconstruction function mapping a labeled sequence of size N with side information T ΣN to a classifier. For I [|S|], denote by S[I] the subsequence of S indexed by I. For a labeled sample S, define the set of possible hypotheses reconstructed from a compression of S of size N with side information in Σ: HN(S) := h : X Y | h = Rec N(S[I], T), I [m]N, T ΣN . The following result closely follows the sample compression arguments in Graepel et al. (2005, Theorem 2), and Gottlieb et al. (2017, Theorem 6), but incorporates side information. Theorem 16 Let m be an integer and δ (0, 1). Let S Dm. With probability at least 1 δ, if there exist N < m and h HN(S) with ϵ := err(h, S) 1 2, then err(h, D) GB(ϵ, N, δ, m, |Σ|). Proof We recall a result of Dasgupta and Hsu (2008, Lemma 1): if ˆp Bin(n, p)/n and δ > 0, then the following holds with probability at least 1 δ: 5. If X is infinite, replace YX with the set of measurable functions from X to Y. KONTOROVICH, SABATO AND URNER Now fix N < m, and suppose that h HN (S) has ˆϵ 1 2. Let I [m]N, T ΣN such that h = Rec N(S[I], T). We have err(h, S[[m] \ I]) ˆϵm m N = θˆϵ. Substituting into (13) p := err(h, D), n := m N and ˆp := err(h, S[[m] \ I]) θˆϵ, yields that for a fixed S[I] and a random S[[m] \ I] Dm N, with probability at least 1 δ, err(h, D) θˆϵ + 2 3(m N) log 1 9θˆϵ 2(m N) log 1 To make (14) hold simultaneously for all (I, T) [m]N ΣN, divide δ by (m|Σ|)N. To make the claim hold for all N [m], stratify (as in Graepel et al. (2005, Lemma 1)) over the (fewer than) m possible choices of N, which amounts to dividing δ by an additional factor of m. For MARMANN, we use the following sample compression scheme with Σ = Y. Given a subsequence S := S[I] := (x 1, . . . , x N) and T = (t1, . . . , t N) YN, the reconstruction function Rec N(S[I], T) generates the nearest-neighbor rule induced by the labeled sample ψ(S , T) := ((x i, ti))i [N]. Formally, Rec N(S , T) = hnn ψ(S ,T). Note the slight abuse of notation: formally, the yi in Sa(t) should be encoded as side information T, but for clarity, we have opted to relabel the examples {x1, . . . , x N} as dictated by the majority in each region. The following corollary is immediate from Theorem 16 and the construction above. Theorem 17 Let m |Y| be an integer, δ (0, 1 4). Let Sin Dm. With probability at least 1 δ, if there exist N < m and S (X Y)N such that U(S) Uin and ϵ := err(hnn S , Sin) 1 2, then err(hnn S , D) GB(ϵ, N, δ, m, |Y|) 2GB(ϵ, N, 2δ, m, 1). If the compression set includes only the original labels, the compression analysis of Gottlieb et al. (2017) gives the bound GB(ϵ, N, δ, m, 1). Thus the effect of allowing the labels to change is only logarithmic in |Y|, and does not appreciably degrade the prediction error. 6.2 Proof of Theorem 4 The proof of the main theorem, Theorem 4, which gives the guarantee for MARMANN, is almost immediate from Theorem 17, Theorem 9, Theorem 15 and Theorem 14. Proof [of Theorem 4] We have |Distmon| m 2 . By a union bound, the events E(t) and V (t) of Theorem 9 and Cor. 11 hold for all t Ttest Distmon with a probability of at least 1 δ/2. Under these events, we have by Theorem 15 that if Gmin(m, δ) 1/3, GB(ϵ(ˆt), N(ˆt), δ, m, 1) O min t GB(ν(t), N(t), δ, m, 1) . By Theorem 17, with a probability at least 1 δ/2, if ϵ(ˆt) 1 err(ˆh, D) 2GB(ϵ(ˆt), N(ˆt), δ, m, 1). The statement of the theorem follows. Note that the statement trivially holds for Gmin(m, δ) 1/3 and for ϵ(ˆt) 1 2, thus these conditions can be removed. To bound the label complexity, note that the total number of labels used by MARMANN is at most the number of labels used by Select Scale plus the number of labels used by Generate NNSet when the final compression set is generated. ACTIVE NEAREST-NEIGHBOR LEARNING IN METRIC SPACES By Theorem 14, since Q(m) = O(log(m/δ)), the number of labels used by Select Scale is at most O |Ttest|log2(m/δ) G(ϵ(ˆt)) log 1 G(ϵ(ˆt) In addition, G(ϵ(ˆt)) GB(ϵ(ˆt), N(ˆt), δ, m, 1) = ˆG. The number of tested scales in Select Scale is bounded by |Ttest| log2(|Distmon|) + 1 2 log2(m) Therefore the number of labels used by Select Scale is O log3(m/δ) The number of labels used by Generate NNSet is at most Q(m)N(ˆt), where Q(m) O(log(m/δ), and from the definition of ˆG, N(ˆt) O(m ˆG/ log(m)). Summing up the number of labels used by Select Scale and the number used by Generate NNSet, this gives the bound in the statement of the theorem. 7. Passive Learning Lower Bounds Theorem 5 lower bounds the performance of a passive learner who observes a limited number ℓof random labels from Sin. The number ℓis chosen so that it is of the same order as the number of labels MARMANN observes for the case analyzed in Section 3. We deduce Theorem 5 from a more general result pertaining to the sample complexity of passive learning. The general result is given as Theorem 18 in Section 7.1. The proof of Theorem 5 is provided in Section 7.2. We note that while the lower bounds below assume that the passive learner observes only the random labeled sample of size ℓ, in fact their proofs hold also if the algorithm has access to the full unlabeled sample of size m of which Sℓis sampled. This is because the lower bound is based on requiring the learner to distinguish between distributions that all have the same marginal. Under this scenario, access to unlabeled examples does not provide any additional information to the learner. 7.1 A General Lower Bound In this section we show a general sample complexity lower bound for passive learning, which may be of independent interest. We are aware of two existing lower bounds for agnostic PAC with bounded Bayes error: Devroye et al. (1996, Theorem 14.5) and Audibert (2009, Theorem 8.8). Both place restrictions on the relationship between the sample size, VC-dimension, and Bayes error level, which render them inapplicable as stated to some parameter regimes, including the one needed for proving Theorem 5. Let H be a hypothesis class with VC-dimension d and suppose that L is a passive learner6 mapping labeled samples Sℓ= (Xi, Yi)i [ℓ] to hypotheses ˆhℓ H. For any distribution D over 6. We allow L access to an independent internal source of randomness. KONTOROVICH, SABATO AND URNER X { 1, 1}, define the excess risk of ˆhℓby (ˆhℓ, D) := err(ˆhℓ, D) inf h H err(h, D). Let D(η) be the collection of all η-bounded agnostic error distributions D over X { 1, 1} that satisfy infh H err(h, D) η. We say that Z { 1, 1} has Rademacher distribution with parameter b [ 1, 1], denoted Z Rb, if P[Z = 1] = 1 P[Z = 1] = 1 All distributions on { 1, 1} are of this form. For k N and b [0, 1], define the function bayes(k, b) = 1 2 Rk b Rk b 1 where Rk b is the corresponding product distribution on { 1, 1}k and 1 2 1 is the total variation norm. This expression previously appeared in Berend and Kontorovich (2015, Equation (25)) in the context of information-theoretic lower bounds; the current terminology was motivated in Kontorovich and Pinelis (2016), where various precise estimates on bayes( ) were provided. In particular, the function baˇyes(κ, b) was defined as follows: for each fixed b [0, 1], baˇyes( , b) is the largest convex minorant on [0, ) of the function bayes( , b) on {0, 1, . . . }. It was shown in Kontorovich and Pinelis (2016, Proposition 2.8) that baˇyes( , b) is the linear interpolation of bayes( , b) at the points 0, 1, 3, 5, . . . . Theorem 18 Let 0 < η < 1 2, ℓ 1, and H be a hypothesis class with VC-dimension d > 1. Then, for all 0 < b, p < 1 satisfying inf ˆhℓ sup D D(η) E Dℓ h (ˆhℓ, D) i bp baˇyes(ℓp/(d 1), b). (16) Furthermore, for 0 u < 1, inf ˆhℓ sup D D(η) P h (ˆhℓ, Dσ,b,p) > bpu i > baˇyes(ℓp/(d 1), b) u. (17) Proof This proof uses ideas from Devroye et al. (1996, Theorem 14.5), Anthony and Bartlett (1999, Theorem 5.2) and Kontorovich and Pinelis (2016, Theorem 2.2). We will construct adversarial distributions supported on a shattered subset of size d, and hence there is no loss of generality in taking X = [d] and H = { 1, 1}X . A random distribution Dσ,b,p over X { 1, 1}, parametrized by a random σ Unif({ 1, 1}d 1) and scalars b, p (0, 1) to be specified later, is defined as follows. The point x = d X gets a marginal weight of 1 p where p is a parameter to be set; the remaining d 1 points each get a marginal weight of p/(d 1): P X Dσ,b,p [X = d] = 1 p, P X Dσ,b,p [X < d] = p d 1. (18) ACTIVE NEAREST-NEIGHBOR LEARNING IN METRIC SPACES The distribution of Y conditional on X is given by P(X,Y ) Dσ,b,p[Y = 1 | X = d] = 1 and P (X,Y ) Dσ,b,p [Y = 1 | X = j < d] = 1 Suppose that (Xi, Yi)i [ℓ] is a sample drawn from Dℓ σ,b,p. The assumption that Dσ,b,p D(η) implies that b and p must satisfy the constraint (15). A standard argument (e.g., Anthony and Bartlett (1999) p. 63 display (5.5)) shows that, for any hypothesis ˆhℓ, (ˆhℓ, Dσ,b,p) = err(ˆhℓ, Dσ,b,p) inf h H err(h, Dσ,b,p) = P X Dσ,b,p [X = d, ˆhℓ(X) = 1] + b P X Dσ,b,p [X < d, ˆhℓ(X) = σ(X)] b P X Dσ,b,p [X < d, ˆhℓ(X) = σ(X)] = bp P X Dσ,b,p [ˆhℓ(X) = σ(X)|X < d]. (20) It follows from Kontorovich and Pinelis (2016, Theorems 2.1, 2.5) that E σ P X Dσ,b,p [ˆhℓ(X) = σ(X)|X < d] E N Bin(ℓ,p/(d 1)) [bayes(N, b)] (21) E N Bin(ℓ,p/(d 1)) [baˇyes(N, b)] baˇyes(E[N], b) = baˇyes(ℓp/(d 1), b), where the second inequality holds because baˇyes is, by definition, a convex minorant of bayes, and the third follows from Jensen s inequality. Combined with (20), this proves (16). To show (17), define the random variable Z = Z(σ, L) = P X Dσ,b,p [ˆhℓ(X) = σ(X)|X < d]. Since Z [0, 1], Markov s inequality implies P[Z > u] E[Z] u 1 u > E[Z] u, 0 u < 1. Now (20) implies that (ˆhℓ, Dσ,b,p) bp Z and hence, for 0 u < 1, inf ˆhℓ sup D D(η) P h (ˆhℓ, Dσ,b,p) > bpu i = inf ˆhℓ sup D D(η) P[Z > u] > inf ˆhℓ sup D D(η) E[Z] u 1 bp inf ˆhℓ sup D D(η) E[ (ˆhℓ, Dσ,b,p)] u baˇyes(ℓp/(d 1), b) u. KONTOROVICH, SABATO AND URNER 7.2 Proof of Theorem 5 We break down the proof into several steps. For now, we assume that the labeled examples are sampled i.i.d. as per the classic PAC setup. At the end, we show how to extend the proof to the semi-supervised setting. (i) Defining a family of adversarial distributions. Let T be a t-net of X of size Θ( m) and η = Θ(1/ m). For any passive learning algorithm mapping i.i.d. samples of size ℓ= Θ( m) to hypotheses ˆhℓ: X { 1, 1}, we construct a random adversarial distribution Dσ,b,p with agnostic error η. We accomplish this via the construction described in the proof of Theorem 18, with |T| = d = Θ( m). The marginal distribution over T = {x1, . . . , xd} puts a mass of 1 p on xd T and spreads the remaining mass uniformly over the other points, as in (18). The heavy point has a deterministic label and the remaining light points have noisy labels drawn from a random distribution with symmetric noise level b, as in (19). We proceed to choose b and p; namely, 2ℓ η = Θ(m 1/4), b = 1 2η p = 1 Θ(m 1/4), which makes the constraint in (15) hold with equality; this means that the agnostic error is exactly η and in particular, establishes (i). (ii.a) Lower-bounding the passive learner s error. Our choice of p implies that ℓp/(d 1) = η/2 =: κ < 1. For this range of κ, Kontorovich and Pinelis (2016, Proposition 2.8) implies that baˇyes(κ, b) = 1 2(1 κb) = Θ(1). Choosing u = 1 4(1 κb) = Θ(1) in (17), Theorem 18 implies that inf ˆhℓ sup D D(η) P[ (ˆhℓ, D) > Ω(m 1/4)] > Ω(1). In more formal terms, there exist constants c0, c1 > 0 such that inf ˆhℓ sup D D(η) P[ (ˆhℓ, D) > c0p] > c1. (22) (ii.b) Upper-bounding ν( t). To establish (ii.b), it suffices to show that for (Xi, Yi)i [ℓ] Dℓ σ,b,p, we will have ν( t) = O(m 1/2) with sufficiently high probability. Indeed, the latter condition implies the requisite upper bound on mint>0:N(t) 0, P[ν( t) > c2 E[ν( t)] 1 Choosing c2 sufficiently large that 1 1/c2 > c1 (the latter from (22)) implies the existence of constants c0, c2, c3 > 0 such that inf ˆhℓ sup D D(η) P h (ˆhℓ, D) > c0p (ν( t) c2η) i > c3. KONTOROVICH, SABATO AND URNER Since p = Θ(m 1/4) and η = Θ(m 1/2), this establishes (ii) and concludes the proof of Theorem 5. (iii) Extending to the semi-supervised setting. Providing the learner with the exact weights of X = [d] under our adversarial distribution does not give it any additional power. Indeed, the information-theoretic excess risk lower bound in Eq. (21) hinges on the fact that to estimate σ(x) with some desired certainty, the point x must be sampled some minimal number of times. The marginal probability of x does not enter that calculation, and hence knowing its value does not afford the learner an improved performance. 8. Active Learning Lower Bound We now prove the active learning lower bound stated in Theorem 6. To prove the theorem, we first prove a result which is similar to the classical No-Free-Lunch theorem, except it holds for active learning algorithms. The proof follows closely the proof of the classical No-Free-Lunch theorem given in Shalev-Shwartz and Ben-David (2014, Theorem 5.1), with suitable modifications. Theorem 19 Let β [0, 1 2), and m be an integer. Let A any active learning algorithm over a finite domain X which gets as input a random labeled sample S Dm (with hidden labels) and outputs ˆh. If A queries fewer than X/2 labels from S, then there exists a distribution D over X {0, 1} such that Its marginal on X is uniform, and for each x X, P[Y = 1 | X = x] {β, 1 β}. E[err(ˆh, D)] 1 Proof Let F = {f1, . . . , f T } be the set of possible functions fi : X {0, 1}. Let Di to be a distribution with a uniform marginal over X, and P(X,Y ) Di[Y = 1 | X = x] = fi(x)(1 β) + (1 fi(x))β. Consider the following random process: First, draw an unlabeled sample X = (x1, . . . , xm) i.i.d. from Dm X. Then, draw B = (b1, . . . , bm) independently from a Bernoulli distribution with P[bi = 1] = β. For i [T], let Si(X, B) = ((x1, y1), . . . , (xm, ym)) such that xi are set by X, and yi = fi(x) if bi = 0 and 1 fi(x) otherwise. Clearly, Si(X, B) is distributed according to Dm i . Let hi(S) be the output of A when the labeled sample is S. Denote by ˆhi the (random) output of A when the sample is drawn from Di. Clearly E[err(ˆhi, Di)] = E X,B[err(hi(S(X, B)), Di)]. Therefore (as in (5.4) in Shalev-Shwartz and Ben-David (2014)), for some j, X, B it holds that E[err(ˆhj, Dj)] 1 i=1 E[err(ˆhi, Di)] 1 i=1 err(hi(S(X, B)), Di). (23) Fix X, B, j as above, and denote for brevity hi := hi(S(X, B)). Let Vi be the set of examples x X for which that A does not observe their label if the labeled sample is Si(X, B) (this includes both examples that are not in the sample at all as well as examples that are in the sample but their label is not requested by A). We have |Vi| > |X|/2 by assumption. Then (as in Eq. (5.6) therein) i=1 err(hi, Di) 1 x Vi I[hi(x) = fi(x)]. (24) ACTIVE NEAREST-NEIGHBOR LEARNING IN METRIC SPACES Since A is active, it selects which examples to request, which can depend on the labels observed by A so far. Therefore, Vi can be different for different i. However, an argument similar to that of the No-Free-Lunch theorem for the passive case still goes through, as follows. Let i, i such that fi(x) = fi (x) for all x / Vi, and fi(x) = 1 fi (x) for all x Vi. Since X, B are fixed, A observes the same labels for all x / Vi for both Si (X, B) and Si(X, B), thus all its decisions and requests are identical for both samples, and so Vi = Vi , and hi = hi . Therefore, it is possible to partition T into T/2 pairs of indices i, i such that for each such pair, x Vi I[hi(x) = fi(x)] + 1 2|Vi | x Vi I[hi (x) = fi (x)] x Vi I[hi(x) = fi(x)] + I[hi(x) = 1 fi(x)] Therefore, 1 T PT i=1 err(hi, Di) 1 4. Therefore, from Eq. (24), 1 T PT i=1 err(hi, Di) 1 4. Combining this with Eq. (23), it follows that E[err(ˆhj, Dj)] 1 We will also make use of the following simple lemma. Lemma 20 Let β [0, 1]. Let D be a distribution over X {0, 1} such that for (X, Y ) D, for any x in the support of D, P[Y = 1 | X = x] {β, 1 β}. Let N be the size of the support of D. Let S Dm. Denote by nx the number of sample pairs (x , y ) in S where x = x, and let n+ x be the number of sample pairs (x , y ) where x = x and y = 1. Let ˆp+ x = n+ x /nx (or zero if nx = 0). Then 2β(1 β)(m N) X x X E[2nxˆp+ x (1 ˆp+ x )] 2β(1 β)m. Proof We have E[2nxˆp+ x (1 ˆp+ x )] = i=1 P[nx = i] i E[2ˆp+ x (1 ˆp+ x ) | nx = i]. Note that E[2ˆp+ x (1 ˆp+ x ) | nx = 1] = 0. For i > 1, let y1, . . . , yi be the labels of the examples that are equal to x in S, then X j,k [i] I[yk = yj] = 2n+ x (i n+ x ) = i2 2ˆp+ x (1 ˆp+ x ). Therefore, letting (X1, Y1), (X2, Y2) D2, E S Dm[2ˆp+ x (1 ˆp+ x ) | nx = i] = 1 i2 E S Dm[ X j,k [nx] I[yk = yj] | nx = i] i2 P[Y1 = Y2 | X1 = X2 = x] KONTOROVICH, SABATO AND URNER E[2nxˆp+ x (1 ˆp+ x )] = 2β(1 β) i=2 (i 1) P[nx = i] = 2β(1 β)(E[nx] + P[nx = 0] 1). To complete the proof, sum over all x in the support of D, and note that P x E[nx] = m, and P x(P[nx = 0] 1) [ N, 0]. We now prove our lower bound, stated in Theorem 6, on the number of queries required by any active learning with competitive guarantees similar to ours. Proof [of Theorem 6] Let N = jmα log( m δ ) log(m) k . Let β = 8α 1 2. Consider a marginal distribution DX over X which is uniform over N points 1, . . . , N R. Consider the following family of distributions: D such that its marginal over X is DX, and for each x X, P[Y = 1 | X = x] {β, 1 β}. Thus the Bayes optimal error for each of these distributions is β. Let S Dm. If one example in S is changed, ν(1 2) changes by at most 1/m. Hence, by Mc Diarmid s inequality (Mc Diarmid, 1989), with probability at least 1 1 2m . Denote the event that this holds EM. Since β = 8α log(m)+log(28) 2m , it follows that under EM, |ν(1/2) E[ν(1/2)]| β/8. (25) We now bound E[ν(1 2)]. Using the notation ˆp+ x , nx, n+ x as in Lemma 20, we have x X min(n+ x , nx n+ x ) = 1 x X nx min(p+ x , 1 p+ x ) Also, for all p [0, 1], min(p, 1 p) 2p(1 p) 2 min(p, 1 p). Therefore x X 2nxp+ x (1 p+ x ) ν(1/2) 1 x X 2nxp+ x (1 p+ x ). By Lemma 20, it follows that m β(1 β) E[ν(1/2)] 2β(1 β). Since N m/2 and β [0, 1 2)] (β/4, 2β). Combining this with Eq. (25), we get that under EM, α = β/8 ν(1 8 β = 17α. Now, we bound Gmin from above and below assuming EM holds. Denote G(t) := GB(ν(t), N(t), δ, m, 1). To establish a lower bound on Gmin(m, δ), note that Gmin(m, δ) = mint>0 G(t) mint>0 ν(t). For t (0, 1 2), ν(t) = ν(1 2) (since the distances between any two distinct points in S is at least 1). In addition, since ν is monotonically increasing, we have ν(t) ν(1 2) for t 1 2. Hence mint>0 ν(t) ν(1 2) β/8 = α. ACTIVE NEAREST-NEIGHBOR LEARNING IN METRIC SPACES To show an upper bound on Gmin(m, δ), we upper bound G(1 2). Note that N(1 2) N. Recall the definition of φ(t) in Eq. (6). We have 2) = (N + 1) log(m) + log(1 Then, since ν(1 G(1/2) m m N (ν(1 In the last inequality we used the fact that m m N 10/9. So if EM holds, Gmin(m, δ) G(1 2) 30α. From the assumption on A, with probability at least 1 δ, we have err(ˆh, D) CGmin(m, δ) 30Cα 1/8 (since α 1 240C ). Let EL(D) denote the event that A queries fewer than N/2 labels, where the probability is over the randomness of S and A. Let h be the output of an algorithm that behaves like A in cases where EL(D) holds, and queries at most N/2 otherwise. By Theorem 19, there exists some D in the family of distributions such that E[err(h , D)] 1 4. By Markov s inequality, P[err(h , D) 1 8] 1/7. Also, P[h = ˆh] P[EL(D)]. Therefore P[err(ˆh, D) 1/8] P[err(h , D) 1/8] + P[EL(D)] 1 = P[EL(D)] 6/7. Therefore P[EL(D)] 6/7 P[err(ˆh, D) 1 8] δ. Since by assumption δ 1/14, it follows that P[EL(D)] 6/7 + δ 13/14. It follows that with a probability of at least 1/14, the negation of EL(D) holds. Since also EM holds with probability at least 1 1 28, it follows that with a probability of at least 1 28, both EM and the negation of EL(D) hold. Now, as shown above, EM implies the bounds on Gmin(m, δ) (item 1 in the theorem statement). In addition, the negation of EL(D) implies that A queries at least N/2 = 1 2 jmα log( m δ ) log(m) k labels (item 2 in the theorem statement). This completes the proof. 9. Discussion We have presented an efficient fully empirical proximity-based non parametric active learner. Our approach provides competitive error guarantees for general distributions, in a general metric space, while keeping label complexity significantly lower than any passive learner with the same guarantees. MARMANN yields fully empirical error estimates, easily computable from finite samples. This is in contrast with classic techniques, that present bounds and rates that depend on unknown distribution-dependent quantities. An interesting question is whether the guarantees can be related to the Bayes error of the distribution. Our error guarantees give a constant factor over the error guarantees of Gottlieb et al. (2017). A variant of this approach (Gottlieb et al., 2014a) was shown to be Bayes-consistent (Kontorovich and Weiss, 2015), and we conjecture that this holds also for the algorithm of Gottlieb et al. (2017). The passive component of our learning algorithm is indeed Bayes-consistent (Kontorovich et al., 2017). Since in our analysis MARMANN achieves a constant factor over the error of the passive KONTOROVICH, SABATO AND URNER learner, Bayes-consistency of the active learner cannot be inferred from our present techniques; we leave this problem open for future research. Another important issue is one of efficient implementation. We mentioned that the naive O(m2) runtime for constructing a t-net may be improved to 2O(ddim(X))m log(1/t), as shown in Krauthgamer and Lee (2004); Gottlieb et al. (2014b). The fast t-net construction was the algorithmic work-horse of Gottlieb et al. (2014a,b, 2017) and inspired the passive component of our learner. We note that implementing even this passive component efficiently is far from trivial; this formed the core of a Master s thesis (Korsunsky, 2017). The remaining obstacle to making our algorithm fully practical is the magnitude of some of the constants. We believe these to be artifacts of the proof and intend to bring them down to manageable values in future work. Acknowledgments Acknowledgements Sivan Sabato was partially supported by the Israel Science Foundation (grant No. 555/15). Aryeh Kontorovich was partially supported by the Israel Science Foundation (grants No. 1141/12 and 755/15) and Yahoo Faculty and Paypal awards. We thank Lee-Ad Gottlieb and Dana Ron for the helpful discussions, and the referees for carefully reading the manuscript and their helpful suggestions. Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge, 1999. Jean-Yves Audibert. Fast learning rates in statistical inference through aggregation. Ann. Statist., 37(4):1591 1646, 08 2009. Pranjal Awasthi, Maria-Florina Balcan, and Philip M. Long. The power of localization for efficiently learning linear separators with noise. In Symposium on Theory of Computing, STOC 2014, pages 449 458, 2014. Maria-Florina Balcan, Andrei Z. Broder, and Tong Zhang. Margin based active learning. In Proceedings of the 20th Annual Conference on Learning Theory, COLT 2007, pages 35 50, 2007. Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. J. Comput. Syst. Sci., 75(1), 2009. Maria-Florina Balcan, Steve Hanneke, and Jennifer Wortman Vaughan. The true sample complexity of active learning. Machine Learning, 80(2-3):111 139, 2010. Daniel Berend and Aryeh Kontorovich. A finite sample analysis of the naive bayes classifier. Journal of Machine Learning Research, 16:1519 1545, 2015. Christopher Berlind and Ruth Urner. Active nearest neighbors in changing environments. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, pages 1870 1879, 2015. ACTIVE NEAREST-NEIGHBOR LEARNING IN METRIC SPACES Oren Boiman, Eli Shechtman, and Michal Irani. In defense of nearest-neighbor based image classification. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2008, 2008. Nader H. Bshouty, Yi Li, and Philip M. Long. Using the doubling dimension to analyze the generalization of learning algorithms. Journal of Computer and System Sciences, 75(6):323 335, 2009. Rui M Castro and Robert D Nowak. Minimax bounds for active learning. IEEE Transactions on Information Theory, 54(5):2339 2353, 2008. Rui M. Castro, Rebecca Willett, and Robert D. Nowak. Faster rates in regression via active learning. In Advances in Neural Information Processing Systems, NIPS 2005, pages 179 186, 2005. Nicol o Cesa-Bianchi, Claudio Gentile, Fabio Vitale, and Giovanni Zappella. Active learning on trees and graphs. In Proceedings of rhe 23rd Conference on Learning Theory, COLT 2010, pages 320 332, 2010. Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of convergence for nearest neighbor classification. In Advances in Neural Information Processing Systems, NIPS 2014, pages 3437 3445, 2014. Thomas M. Cover and Peter E. Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13:21 27, 1967. Gautam Dasarathy, Robert D. Nowak, and Xiaojin Zhu. S2: an efficient graph based active learning algorithm with application to nonparametric classification. In Proceedings of the 28th Annual Conference on Learning Theory, COLT 2015, pages 503 522, 2015. Sanjoy Dasgupta. Analysis of a greedy active learning strategy. In Advances in Neural Information Processing Systems, NIPS 2004, pages 337 344, 2004. Sanjoy Dasgupta. Consistency of nearest neighbor classification under selective sampling. In Proceedings of the 25th Annual Conference on Learning Theory, COLT 2012, pages 18.1 18.15, 2012. Sanjoy Dasgupta and Daniel Hsu. Hierarchical sampling for active learning. In Proceedings of the 25th International Conference on Machine Learning, ICML 2008, pages 208 215, 2008. Luc Devroye and L aszl o Gy orfi. Nonparametric density estimation: the L1 view. Wiley Series in Probability and Mathematical Statistics: Tracts on Probability and Statistics. John Wiley & Sons, Inc., New York, 1985. Luc Devroye, L aszl o Gy orfi, and G abor Lugosi. A probabilistic theory of pattern recognition, volume 31 of Applications of Mathematics (New York). Springer-Verlag, New York, 1996. Andreas Emil Feldmann, Wai Shing Fung, Jochen K onemann, and Ian Post. A (1 + ϵ)-embedding of low highway dimension graphs into bounded treewidth graphs. Co RR, abs/1502.04588, 2015. URL http://arxiv.org/abs/1502.04588. Evelyn Fix and Jr. Hodges, J. L. Report Number 4, Project Number 21-49-004, USAF School of Aviation Medicine, Randolph Field, Texas, 1951. KONTOROVICH, SABATO AND URNER Evelyn Fix and Jr. Hodges, J. L. Discriminatory analysis. nonparametric discrimination: Consistency properties. International Statistical Review / Revue Internationale de Statistique, 57(3):pp. 238 247, 1989. Sally Floyd and Manfred K. Warmuth. Sample compression, learnability, and the vapnikchervonenkis dimension. Machine Learning, 21(3):269 304, 1995. Alon Gonen, Sivan Sabato, and Shai Shalev-Shwartz. Efficient active learning of halfspaces: an aggressive approach (extended abstract ICML 2013). Journal of Machine Learning Research, 14 (1):2583 2615, 2013. Lee-Ad Gottlieb and Robert Krauthgamer. Proximity algorithms for nearly doubling spaces. SIAM J. Discrete Math., 27(4):1759 1769, 2013. Lee-Ad Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. Efficient classification for metric data (extended abstract COLT 2010). IEEE Transactions on Information Theory, 60(9):5750 5759, 2014a. Lee-Ad Gottlieb, Aryeh Kontorovich, and Pinhas Nisnevitch. Near-optimal sample compression for nearest neighbors. In Advances in Neural Information Processing Systems, NIPS 2014, pages 370 378, 2014b. Lee-Ad Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. Adaptive metric dimensionality reduction (extended abstract ALT 2013). Theoretical Computer Science, pages 105 118, 2016. Lee-Ad Gottlieb, Aryeh Kontorovich, and Pinhas Nisnevitch. Nearly optimal classification for semimetrics (extended abstract AISTATS 2016). Journal of Machine Learning Research, 2017. Thore Graepel, Ralf Herbrich, and John Shawe-Taylor. Pac-bayesian compression bounds on the prediction error of learning algorithms for classification. Machine Learning, 59(1-2):55 76, 2005. Steve Hanneke. Rates of convergence in active learning. The Annals of Statistics, 39(1):333 361, 2011. Steve Hanneke and Liu Yang. Minimax analysis of active learning. Journal of Machine Learning Research, 16:3487 3602, 2015. Aryeh Kontorovich and Iosif Pinelis. Exact lower bounds for the agnostic probably-approximatelycorrect (PAC) machine learning model. Co RR, abs/1606.08920, 2016. Aryeh Kontorovich and Roi Weiss. Maximum margin multiclass nearest neighbors. In Proceedings of the 31st International Conference on Machine Learning, ICML 2014, pages 892 900, 2014. Aryeh Kontorovich and Roi Weiss. A bayes consistent 1-nn classifier. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2015, 2015. Aryeh Kontorovich, Sivan Sabato, and Roi Weiss. Nearest-neighbor sample compression: Efficiency, consistency, infinite dimensions. In Advances in Neural Information Processing Systems, NIPS 2017, pages 1572 1582, 2017. ACTIVE NEAREST-NEIGHBOR LEARNING IN METRIC SPACES Yevgeni Korsunsky. Classifying processes by their system call traces using 1-nearest-neighbor with compression. Master s thesis, Ben-Gurion University Computer Science Department, 2017. URL https://tinyurl.com/Korsunsky-msc. Samory Kpotufe. k-nn regression adapts to local intrinsic dimension. In Advances in Neural Information Processing Systems, NIPS 2011, pages 729 737, 2011. Samory Kpotufe, Ruth Urner, and Shai Ben-David. Hierarchical label queries with data-dependent partitions. In Proceedings of the 28th Annual Conference on Learning Theory, COLT 2015, pages 1176 1189, 2015. Robert Krauthgamer and James R. Lee. Navigating nets: Simple algorithms for proximity search. In 15th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 791 801, January 2004. Sanjeev R. Kulkarni and S. E. Posner. Rates of convergence of nearest neighbor estimation under arbitrary sampling. IEEE Transactions on Information Theory, 41(4):1028 1039, 1995. Nick Littlestone and Manfred K. Warmuth. Relating data compression and learnability. unpublished, 1986. Andreas Maurer and Massimiliano Pontil. Empirical Bernstein bounds and sample-variance penalization. In Proceedings of the 22nd Annual Conference on Learning Theory, COLT 2009, 2009. Andrew Mc Callum and Kamal Nigam. Employing EM and pool-based active learning for text classification. In Proceedings of the 15th International Conference on Machine Learning, ICML 1998, pages 350 358, 1998. Colin Mc Diarmid. On the method of bounded differences. In J. Siemons, editor, Surveys in Combinatorics, volume 141 of LMS Lecture Notes Series, pages 148 188. Morgan Kaufmann Publishers, San Mateo, CA, 1989. Sivan Sabato and R emi Munos. Active regression by stratification. In Advances in Neural Information Processing Systems, NIPS 2014, pages 469 477, 2014. Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson, and Martin Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5): 1926 1940, 1998. Charles J. Stone. Consistent nonparametric regression. The Annals of Statistics, 5(4):595 620, 1977. Ruth Urner, Sharon Wulff, and Shai Ben-David. PLAL: cluster-based active learning. In Proceedings of the 26th Annual Conference on Learning Theory, COLT 2013, pages 376 397, 2013. Ulrike von Luxburg and Olivier Bousquet. Distance-based classification with Lipschitz functions. Journal of Machine Learning Research, 5:669 695, 2004. KONTOROVICH, SABATO AND URNER Kai Wei, Rishabh K. Iyer, and Jeff A. Bilmes. Submodularity in data subset selection and active learning. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, pages 1954 1963, 2015. Lin Cheng Zhao. Exponential bounds of mean error for the nearest neighbor estimates of regression functions. J. Multivariate Anal., 21(1):168 178, 1987. Xiaojin Zhu, John Lafferty, and Zoubin Ghahramani. Combining active learning and semisupervised learning using gaussian fields and harmonic functions. In ICML 2003 workshop, pages 58 65, 2003.