# maximum_margin_multiclass_nearest_neighbors__fef4f91c.pdf Maximum Margin Multiclass Nearest Neighbors Aryeh Kontorovich KARYEH@CS.BGU.AC.IL Department of Computer Science, Ben-Gurion University, Beer Sheva 84105, ISRAEL Roi Weiss ROIWEI@CS.BGU.AC.IL Department of Computer Science, Ben-Gurion University, Beer Sheva 84105, ISRAEL We develop a general framework for marginbased multicategory classification in metric spaces. The basic work-horse is a marginregularized version of the nearest-neighbor classifier. We prove generalization bounds that match the state of the art in sample size n and significantly improve the dependence on the number of classes k. Our point of departure is a nearly Bayes-optimal finite-sample risk bound independent of k. Although k-free, this bound is unregularized and non-adaptive, which motivates our main result: Rademacher and scale-sensitive margin bounds with a logarithmic dependence on k. As the best previous risk estimates in this setting were of order k, our bound is exponentially sharper. From the algorithmic standpoint, in doubling metric spaces our classifier may be trained on n examples in O(n2 log n) time and evaluated on new points in O(log n) time. 1. Introduction Whereas the theory of supervised binary classification is by now fairly well developed, its multiclass extension continues to pose numerous novel statistical and computational challenges. On the algorithmic front, there is the basic question of how to adapt the hyperplane and kernel methods ideally suited for two classes to three or more. A host of new problems also arise on the statistical front. In the binary case, the VC-dimension characterizes the distribution-free sample complexity (Anthony & Bartlett, 1999) and tighter distribution-dependent bounds are available via Rademacher techniques (Bartlett & Mendelson, 2002; Koltchinskii & Panchenko, 2002). Characterizing the multiclass distribution-free sample complexity is far Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). less straightforward, though impressive progress has been recently made (Daniely et al., 2011). Following von Luxburg & Bousquet (2004); Gottlieb et al. (2010), we adopt a proximity-based approach to supervised multicategory classification in metric spaces. The principal motivation for this framework is two-fold: (i) Many natural metrics, such as L1, earthmover, and edit distance cannot be embedded in a Hilbert space without a large distortion (Enflo, 1969; Naor & Schechtman, 2007; Andoni & Krauthgamer, 2010). Any kernel method is thus a priori at a disadvantage when learning to classify non-Hilbertian objects, since it cannot faithfully represent the data geometry. (ii) Nearest neighbor-based classification sidesteps the issue of k-to-binary reductions which, despite voluminous research, is still the subject of vigorous debate (Rifkin & Klautau, 2004; El-Yaniv et al., 2008). In terms of time complexity, the reductions approach faces an Ω(k) information-theoretic lower bound (Beygelzimer et al., 2009), while nearest neighbors admit solutions whose runtime does not depend on the number of classes. Main results. Our contributions are both statistical and algorithmic in nature. On the statistical front, we open with the observation that the nearest-neighbor classifier s expected risk is at most twice the Bayes optimal plus a term that decays with sample size at a rate not dependent on the number of classes k (and continues to hold for k = , Theorem 1). Although of interest as apparently the first k-free finite-sample result, it has the drawback of being non-adaptive in the sense of depending on properties of the unknown sampling distribution and failing to provide the learner with a usable data-dependent bound. This difficulty is overcome in our main technical contribution (Theorems 4 and 5), where we give a margin-based multiclass Maximum Margin Multiclass Nearest Neighbors bound of order where k is the number of classes, n is sample size, D is the doubling dimension of the metric instance space and 0 < γ 1 is the margin. This matches the state of the art asymptotics in n for metric spaces and significantly improves the dependence on k, which hitherto was of order k (Zhang, 2002; 2004) or worse. The exponential dependence on some covering dimension (such as D) is in general inevitable, as shown by a standard no-free-lunch argument (Ben-David & Shalev-Shwartz, 2014), but whether (1) is optimal remains an open question. On the algorithmic front, using the above bounds, we show how to efficiently perform Structural Risk Minimization (SRM) so as to avoid overfitting. This involves deciding how many and which sample points one is allowed to err on. We reduce this problem to minimal vertex cover, which admits a greedy 2-approximation. Our algorithm admits a significantly faster ε-approximate version in doubling spaces with a graceful degradation in ε of the generalization bounds, based on approximate nearest neighbor techniques developed by Gottlieb et al. (2010; 2013a). For doubling dimension D and ε, our runtime is O(2O(D)n2 log n) for learning and O(2O(D) log n) for evaluation on a test point. (Exact nearest neighbor requires Θ(n) evaluation time.) Finally, our generalization bounds and algorithm can be made adaptive to the intrinsic dimension of the data via a recent metric dimensionality-reduction technique (Gottlieb et al., 2013b). Related work. Due to space constraints, we are only able to mention the most directly relevant results and even these, not in full generality but rather with an eye to facilitating comparison to the present work. Supervised k-category classification approaches follow two basic paradigms: (I) defining a score function on point-label pairs and classifying by choosing the label with the optimal score and (II) reducing the problem to several binary classification problems. Regarding the second paradigm, the seminal paper of Allwein et al. (2001) unified the various error correcting output code (ECOC)-based multiclassto-binary reductions under a single margin-based framework. Their generalization bound requires the base classifier to have VC-dimension d VC < (and hence does not apply to nearest neighbors or infinite-dimensional Hilbert spaces) and is of the form O log k n . Langford & Beygelzimer (2005); Beygelzimer et al. (2009) gave k-free and O(log k) regret bounds, but these are conditional on the performance of the underlying binary classifiers as opposed to the unconditional bounds we provide in this paper. As for the first paradigm, proximity is perhaps the most natural score function and indeed, a formal analysis of the nearest neighbor classifier (Cover & Hart, 1967) much predated the first multiclass extensions of SVM (Weston & Watkins, 1999). Crammer & Singer (2002a;b) considerably reduced the computational complexity of the latter approach and gave a risk bound decaying as O(k2/nγ2), for the separable case with margin γ. In an alternative approach based on choosing q prototype examples, Crammer et al. (2002) gave a risk bound with rate O(qk/2/γ n). Ben-David et al. (1995) characterized the PAC learnability of k-valued functions in terms of combinatorial dimensions, such as the Natarajan dimension d Nat. Guermeur (2007; 2010) gave scale-sensitive analogues of these dimensions. He gave a risk bound decaying as O log k dγNat/n , where dγNat is a scale-sensitive Natarajan dimension essentially replacing the finite VC dimension d VC in Allwein et al. (2001) by dγNat. He further showed that for linear function classes in Hilbert spaces, dγNat is bounded by O(k2/γ2), resulting in a risk bound decaying as O(k/γ2 n). To the best of our knowledge, the sharpest current estimate on the Natarajan dimension (for some special function classes) is d Nat = O(k) with a matching lower bound of Ω(k) (Daniely et al., 2011). A margin-based Rademacher analysis of score functions (Mohri et al., 2012) yields a bound of order O(k2/γ n), and this is also the k-dependence obtained by Cortes et al. (2013) in a recent paper proposing a multiple kernel approach to multiclass learning. Closest in spirit to our work are the results of Zhang (2002; 2004), who used the chaining technique to achieve a Rademacher complexity with asymptotics O 1 γ Besides the dichotomy of score functions vs. multiclassto-binary reductions outlined above, multicategory risk bounds may also be grouped by the trichotomy of (a) combinatorial dimensions (b) Hilbert spaces (c) metric spaces (see Table 1). Category (a) is comprised of algorithmindependent results that give generalization bounds in terms of some combinatorial dimension of a fixed concept class (Allwein et al., 2001; Ben-David et al., 1995; Guermeur, 2007; 2010; Daniely et al., 2011). Multiclass extensions of SVM and related kernel methods (Weston & Watkins, 1999; Crammer & Singer, 2002a;b; Crammer et al., 2002; Cortes et al., 2013) fall into category (b). Category (c), consisting of agnostic1 metric-space methods is the most sparsely populated. The pioneering asymptotic analysis of Cover & Hart (1967) was cast in a modern, finite-sample version by Ben-David & Shalev Shwartz (2014), but only for binary classification. Unlike Hilbert spaces, which admit dimension-free margin bounds, we are not aware of any metric space risk bound 1 in the sense of not requiring an a priori fixed concept class Maximum Margin Multiclass Nearest Neighbors Paper decay rate O( ) group Allwein et al. (2001) log k Daniely et al. (2011) d Nat log k Guermeur (2010) log k Crammer & Singer (2002b) k2 γ2n (I,b) Cortes et al. (2013) k2 Guermeur (2010) k γ2 n (I,b) Zhang (2004) 1 γ current paper 1 γD/2 q current paper 1 γ log k n 1 1+D (I,c) Table 1. Comparing various multiclass bounds. ( ) Not margin-based. ( ) Only for the separable case. ( ) Combinatorial dimension depends on k. that does not explicitly depend on some metric dimension D or covering numbers. The bounds in Ben-David & Shalev-Shwartz (2014); Gottlieb et al. (2013b) exhibit a characteristic curse of dimensionality decay rate of O(n 1/(D+1)), but more optimistic asymptotics can be obtained (Guermeur, 2007; 2010; Zhang, 2002; 2004; Gottlieb et al., 2010). Although some sample lower bounds for proximity-based methods are known (Ben-David & Shalev-Shwartz, 2014), the optimal dependence on D and k is far from being fully understood. 2. Preliminaries Metric Spaces. Given two metric spaces (X, d) and (Z, ρ), a function f : X Z is called L-Lipschitz if ρ(f(x), f(x )) Ld(x, x ) for all x, x X. (The real line R is always considered with its Euclidean metric | |.) The Lipschitz constant of f, denoted f Lip, is the smallest L for which f is L-Lipschitz. The distance between two sets A, B X is defined by d(A, B) = infx A,x B d(x, x ). For a metric space (X, d), let λ be the smallest value such that every ball in X can be covered by λ balls of half the radius. The doubling dimension of X is ddim(X) := log2 λ. A metric is doubling when its doubling dimension is bounded. The ε-covering number of a metric space (X, d), denoted N(ε, X, d), is defined as the smallest number of balls of radius ε that suffices to cover X. It can be shown (e.g., Krauthgamer & Lee (2004)) that N(ε, X, d) 2 diam(X) ddim(X) , (2) where diam(X) = sup x,x X d(x, x ) is the diameter of X. The multiclass learning framework. Let (X, d) be a metric instance space with diam(X) = 1, ddim(X) = D < , and Y N an at most countable label set. We observe a sample S = (Xi, Yi)n i=1 {X Y}n drawn iid from an unknown distribution P over X Y. In line with paradigm (I) outlined in the Introduction, our classification procedure consists of optimizing a score function. In hindsight, the score at a test point will be determined by its labeled neighbors, but for now, we consider an unspecified collection F of functions mapping X Y to R. A score function f F induces the classifier gf : X Y via gf(x) = argmax y Y f(x, y), (3) breaking ties arbitrarily. The margin of f F on (x, y) is defined by γf(x, y) = 1 f(x, y) sup y =y f(x, y ) Note that gf misclassifies (x, y) precisely when γf(x, y) < 0. One of our main objectives is to upper-bound the generalization error P(gf(X) = Y ) = E[1{γf (X,Y )<0}]. To this end, we introduce surrogate loss functions L : R R+, Lcutoff(u) = 1{u<1} Lmargin(u) = T[0,1](1 u), T[a,b](z) = max {a, min {b, z}} (5) is the truncation operator. The empirical loss b E[L(γf)] induced by any of the loss functions above is 1 n Pn i=1 L(γf(Xi, Yi)). All probabilities P( ) and expectations E[ ] are with respect to the sampling distribution P. We will write ES to indicate expectation over a sample (i.e., over P n). 3. Risk bounds In this section we analyze the statistical properties of nearest-neighbor multicategory classifiers in metric spaces. In Section 3.1, Theorem 1, we record the observation that the 1-nearest neighbor classifier is nearly Bayes optimal, with a risk decay that does not depend on the number of classes k. Of course, the naive 1-nearest neighbor is wellknown to overfit. This is reflected in the non-adaptive nature of the analysis: the bound is stated in terms of properties of the unknown sampling distribution, and fails to provide the learner with a usable data-dependent bound. To achieve the latter goal, we develop a margin analysis in Section 3.2. Our main technical result is Lemma 2, from Maximum Margin Multiclass Nearest Neighbors which the logarithmic dependence on k claimed in (1) follows. Although not k-free like the Bayes excess risk bound of Theorem 1, O(log k) is exponentially sharper than the current state of the art (Zhang, 2002; 2004). Whether a kfree metric entropy bound is possible is currently left as an open problem. The metric entropy bound of Lemma 2 facilitates two approaches to bounding the risk: via Rademacher complexity (Section 3.2.2) and via scale-sensitive techniques in the spirit of Guermeur (2007) (Section 3.2.3). In Section 3.2.4 we combine these two margin bounds by taking their minimum. The resulting bound will be used in Section 4 to perform efficient Structural Risk Minimization. 3.1. Multiclass Bayes near-optimality In this section, (X, d) is a metric space and Y is an at most countable (possibly infinite) label set. A sample S = (Xi, Yi)n i=1 is drawn iid from an unknown distribution P over X Y. For x X let (Xπ1(x), Yπ1(x)) be its nearest neighbor in S: π1(x) = argmin i [n] d(Xi, x). Thus, the nearest-neighbor classifier g NN is given by g NN(x) = Yπ1(x). (6) Define the function η : X RY by η(x) = P(Y = | X = x). The Bayes optimal classifier g i.e., one that minimizes P(g(X) = Y ) over all measurable g YX is wellknown to have the form g (x) = argmax y Y ηy(x), where ties are broken arbitrarily. Our only distributional assumption is that η is L-Lipschitz with respect to the supnorm. Namely, for all x, x X, we have η(x) η(x ) sup y Y |ηy(x) ηy(x )| Ld(x, x ). This is a direct analogue of the Lipschitz assumption for the binary case (Cover & Hart, 1967; Ben-David & Shalev Shwartz, 2014). We make the additional standard assumption that X has a finite doubling dimension: ddim(X) = D < . The Lipschitz and doubling assumptions are sufficient to extend the finite-sample analysis of binary nearest neighbors (Ben-David & Shalev-Shwartz, 2014) to the multiclass case: ES [P(g NN(X) = Y )] 2P(g (X) = Y ) + 4L n1/(D+1) . h(x) hf(x, y) Figure 1. The mapping in Lemma 2 with |Y| = 3. Note that the bound is independent of the number of classes k and holds even for k = . The proof is deferred to Appendix A. 3.2. Multiclass margin bounds Here again (X, d) is a metric space, but now the label set Y is assumed finite: |Y| = k < . As before, S = (Xi, Yi)n i=1 with (Xi, Yi) P iid. It will be convenient to write Sy = {Xi : Yi = y, i [n]} for the subset of examples with label y. The metric induces the natural score function f NN(x, y) = d(x, Sy) with corresponding nearest-neighbor classifier g NN(x) = argmax y Y f NN(x, y), (7) easily seen to be identical to the one in (6). At this point we make the simple but crucial observation that the function f NN( , y) : X R is 1-Lipschitz. This will enable us to generalize the powerful Lipschitz extension framework of von Luxburg & Bousquet (2004) to |Y| > 2. We will need a few definitions. Let FL be the collection of all L-Lipschitz functions from X to R and put FL = FL Y. Since each f FL maps X Y to R, the margin γf(x, y) is well-defined via (4). Putting y f(x) = argmax y Y f(x, y), γ f(x) = γf(x, y f(x)), we define the projection Φf: ( γ f(x), if y = y f(x) γ f(x), otherwise. Finally, we define HL as the truncated (as in (5)) projections of functions in FL: HL = (x, y) 7 T[-1,1] (Φf(x, y)) : f FL . (8) Thus, HL is the set of functions hf : X Y [ 1, 1], where each hf( , y) is L-Lipschitz and hf(x, y) = T[-1,1](γ f(x)), depending upon whether y = y f(x), see Figure 1 (left). Maximum Margin Multiclass Nearest Neighbors Figure 2. The metric ρ( h(x), h (x)) with |Y| = 3. 3.2.1. BOUNDING THE METRIC ENTROPY Our main technical result is a bound on the metric entropy of HL, which will be used to obtain error bounds (Theorems 4 and 5) for classifiers derived from this function class. The analysis differs from previous bounds (see Table 1) by explicitly taking advantage of the mutually exclusive nature of the labels, obtaining an exponential improvement in terms of the number of classes k. Endow HL with the sup-norm = sup x X max y Y | | . Lemma 2. For any ε > 0, log N(ε, HL, ) 16L Proof. By the definition of HL, for all hf HL and x X there is at most one y Y such that hf(x, y) > 0. In addition, if hf(x, y) = c > 0, then hf(x, y ) = c for all y = y. Since γ f(x) 0, we may reparametrize hf(x, y) by (y f(x), γ f(x)) Y [0, 1], see Figure 1. To complete the mapping hf 7 (y f, γ f), define the following star-like metric ρ over Y [0, 1] (see Figure 2): ρ((y, γ), (y , γ )) = |γ γ | y = y γ + γ y = y . Let HL be the collection of functions h : X Y [0, 1] that are L-Lipschitz: ρ( h(x), h(x )) Ld(x, x ), x, x X. It is easily verified that the metric space (HL, ) is isometric to ( HL, ρ ) with ρ ( h, h ) = sup x X ρ( h(x), h (x)). Thus, N(ε, HL, ) = N(ε, HL, ρ ), and we proceed to bound the latter.2 Fix a covering of X consisting of |N| = N(ε/8L, X, d) balls {U1, . . . , U|N|} of radius ε = ε/8L and choose |N| points N = {xi Ui}|N| i=1. 2The remainder of the proof is based on a technique communicated to us by R. Krauthgamer, a variant of the classic Kolmogorov & Tikhomirov (1959) method. Construct b H H2L as follows. At every point xi N select one of the classes y Y and set bh(xi) = (y, γ(xi)) with γ(xi) some multiple of 2Lε = ε/4, while maintaining bh Lip 2L. Construct a 2L-Lipschitz extension for bh from N to all over X (such an extension always exists, (Mc Shane, 1934; Whitney, 1934)). We claim that every classifier in HL, via its twin h HL, is close to some bh b H, in the sense that ρ ( h,bh) ε. Indeed, every point x X is 2ε -close to some point xi N, and since h is L-Lipschitz and bh is 2L-Lipschitz, ρ( h(x),bh(x)) ρ( h(x), h(xi)) + ρ( h(xi),bh(xi)) + ρ(bh(xi),bh(x)) Ld(x, xi) + ε/4 + 2Ld(x, xi) ε. Thus, b H provides an ε-cover for HL (and hence for HL). Note that | b H| ( 4k/ε + 1)|N| , since by construction, functions bh are determined by their values on N, which at a given point can take one of 4k/ε + 1 possible values. Since by (2) we have |N| = N(ε/8L, X, d) 16L ε D the bound follows. A tighter bound is possible when the metric space (X, d) possesses two additional properties: 1. (X, d) is connected if for all x, x X and all ε > 0, there is a finite sequence of points x = x1, x2, . . . , xm = x such that d(xi, xi+1) < ε for all 1 i < m. 2. (X, d) is centered if for all r > 0 and all A X with diam(A) 2r, there exists a point x X such that d(x, a) r for all a A. Lemma 3. If (X, d) is connected and centered, then log N(ε, HL, ) = O D log k + log 1 Proof. With the additional assumptions on X we follow the proof idea in Kolmogorov & Tikhomirov (1959) and demonstrate the tighter bound | b H| ( 4k/ε + 1)(2k + 1)|N| 1 = O((2k)|N|/ε). Here b H is constructed as in the proof for Lemma 2 but now each xi N is taken to be a center of Ui, as furnished by Property 2 above. Let xj N. Since X is connected, we may traverse a path from x1 to xj via the cover points x1 = xi1, xi2, . . . , xim = xj, such that the distance between any two successive points (xil, xil+1) is at most 2ε = ε/4L. Since bh is 2L-Lipschitz, on any two such points the value of bh can change by at most ε/2. Thus, given the value bh(xil), the value of bh(xil+1) can take one of at most 2k + 1 values (as Figure 2 shows, at the star s hub, bh(xil+1) can take one of 2k + 1 values, while Maximum Margin Multiclass Nearest Neighbors at one of the spokes only 5 values are possible). So we are left to choose the value of bh on the point x1 to be one from the 4k/ε +1 possible values. The bounds on | b H| and the metric entropy follow. 3.2.2. RADEMACHER ANALYSIS The Rademacher complexity of the set of functions HL is defined by i=1 σih(Xi, Yi) where the σi are n independent random variables with P(σi = +1) = P(σi = 1) = 1/2. In Appendix B, we invoke Lemma 2 to derive the bound Rn(HL) c DL log 5k 1/(D+1) , (10) (c D is a constant depending only on D), which in turn implies half of our main risk estimate (1): Theorem 4. With probability at least 1 δ, for all L > 0 and every f FL with its projected version hf HL, P(gf(X) = Y ) b E [L(hf)] + Rad(n, L, δ), where gf is the classifier defined in (3), L is any of the loss functions defined in Section 2 and Rad(n, L, δ) is at most c DL log 5k s log log2 2L 3.2.3. SCALE-SENSITIVE ANALYSIS The following Theorem, proved in Appendix C, is an adaptation of Guermeur (2007, Theorem 1), using Lemma 2. Theorem 5. With probability at least 1 δ, for all L > 0 and every f FL with its induced hf HL, P(gf(X) = Y ) b E [Lcutoff(hf)] + fat(n, L, δ), where fat(n, L, δ) is at most s 2 (16L)D log (20k) + ln 2L 3.2.4. COMBINED BOUND Taking L = Lcutoff in Theorem 4 we can merge the above two bounds by taking their minimum. Namely, Theorem 5 holds with (n, L, δ) = min { Rad(n, L, δ), fat(n, L, δ)} in place of fat(n, L, δ), see Figure 3. The resulting risk decay rate is 10 6 10 9 10 12 10 15 0 D = 5, L = 1 D = 5, L = 2 D = 5, L = 4 D = 5, L = 8 Rad fat Figure 3. The combined complexity bounds (k = 10, δ = 0.01). of order min n L (log(k)/n) 1 D+1 , L D 2 (log(k)/n) 1 2 o , as claimed in (1). In terms of the number of classes k, our bound compares favorably to those in Allwein et al. (2001); Guermeur (2007; 2010), and more recently in Daniely et al. (2011), which have a k-dependence of O(d Nat log k), where d Nat is the (scale-sensitive, k-dependent) Natarajan dimension of the multiclass hypothesis class. The optimal dependence of the risk on k is an intriguing open problem. 4. Algorithm Theorems 4 and 5 yield generalization bounds of the schematic form P(g(X) = Y ) b E[L] + (n, L, δ). (11) The free parameter L in (11) controls (roughly speaking) the bias-variance tradeoff: for larger L, we may achieve a smaller empirical loss b E[L] at the expense of a larger hypothesis complexity (n, L, δ). Our Structural Risk Minimization (SRM) consists of seeking the optimal L i.e., one that minimizes the right-hand side of (11) via the following high-level procedure: 1. For each L > 0, minimize b E[L(hf)] over f FL. 2. Choose the optimal L and its corresponding classifier gf with f FL . Minimizing the empirical loss. Let S = (Xi, Yi)n i=1 be the training sample and L > 0 a given maximal allowed Lipschiz constant. We will say that a function h HL is inconsistent with a sample point (x, y) if h(x, y) < 1 (i.e., if the margin of h on (x, y) is less than one). Denote by bm(L) the smallest possible number of sample points on which a function h HL may be inconsistent: bm(L) = min h HL b E[Lcutoff(h)]. Maximum Margin Multiclass Nearest Neighbors Thus, our SRM problem consists of finding L = argmin L>0 { bm(L) + (n, L, δ)} . For k = 2, Gottlieb et al. (2010) reduced the problem of computing bm(L) to one of finding a minimal vertex cover in a bipartite graph (by K onig s theorem, the latter is efficiently computable as a maximal matching). We will extend this technique to k > 2 as follows. Define the k-partite graph GL = ({V y}k y=1, E), where each vertex set V y corresponds to the sample points Sy with label y. Now in order for h HL to be consistent with the points (Xi, Yi) and (Xj, Yj) for Yi = Yj, the following relation must hold: Ld(Xi, Xj) 2. (12) Hence, we define the edges of GL to consist of all point pairs violating (12): (Xi, Xj) E (Yi = Yj) (d(Xi, Xj) < 2/L). Since removing either of Xi, Xj in (12) also deletes the violating edge, bm(L) is by construction equivalent to the size of the minimum vertex cover for GL. Although minimum vertex cover is NP-hard to compute (and even hard to approximate within a factor of 1.3606, (Dinur & Safra, 2005)), a 2-approximation may be found in O(n2) time (Papadimitriou & Steiglitz, 1998). This yields a 2approximation m(L) for bm(L). Optimizing over L. Equipped with an efficient routine for computing m(L) 2 bm(L), we now seek an L > 0 that minimizes Q(L) := m(L) + (n, L, δ). (13) Since the Lipschitz constant induced by the data is determined by the n 2 distances among the sample points, we need only consider O(n2) values of L. Rather a brute-force searching all of these values, Theorem 7 of Gottlieb et al. (2010) shows that using an O(log n) time binary search over the values of L, one may approximately minimize Q(L), which in turn yields an approximate solution to (11). The resulting procedure has runtime O(n2 log n) and guarantees an L for which Q( L) 4 [ bm(L ) + (n, L , δ)] . (14) Classifying test points. Given the nearly optimal Lipschitz constant L computed above we construct the approximate (within a factor of 4) empirical risk minimizer h H L. The latter partitions the sample into S = S0 S1, where S1 consists of the points on which h is consistent and S0 = S \ S1. Evaluating h on a test point amounts to finding its nearest neighbor in S1. Although in general metric spaces, nearest-neighbors search requires Ω(n) time, for doubling spaces, an exponential speedup is available via approximate nearest neighbors (see Section 5). 5. Extensions In this section, we discuss two approaches that render the methods presented above considerably more efficient in terms of runtime and generalization bounds. The first is based on the fact that in doubling spaces, hypothesis evaluation time may be reduced from O(n) to O(log n) at the expense of a very slight degradation of the generalization bounds. The second relies on a recent metric dimensionality reduction result. When the data is close to being Ddimensional, with D much smaller than the ambient metric space dimension D, both the evaluation runtime and the generalization bounds may be significantly improved depending essentially on D rather than D. 5.1. Exponential speedup via approximate NN If (X, d) is a metric space and x E X is a minimizer of d(x, x ) over x E, then x is a nearest neighbor of x in E. A simple information-theoretic argument shows that the problem of computing an exact nearest neighbor in general metric spaces has Ω(n) time complexity. However, an exponential speedup is possible if (i) X is a doubling space and (ii) one is willing to settle for approximate nearest neighbors. A (1 + η) nearest neighbor oracle returns an x E such that d(x, x ) d(x, x) (1 + η)d(x, x ). (15) We will use the fact that in a doubling space, one may precompute a (1 + η) nearest neighbor data structure in (2O(ddim(X)) log n + η O(ddim(X)))n time and evaluate it on a test point in 2O(ddim(X)) log n + η O(ddim(X)) time (Cole & Gottlieb, 2006; Har-Peled & Mendel, 2006).3 The approximate nearest neighbor oracle induces an ηapproximate version of g NN in defined (7). After performing SRM as described in Section 4, we are left with a subset S1 S of the sample, which will be used to label test points. More precisely, the predicted label of a test point will be determined by its η-nearest neighbor in S1. The exponential speedup afforded by approximate nearest neighbors comes at the expense of mildly degraded generalization guarantees. The modified generalization bounds are derived in three steps, whose details are deferred to Appendix D: (i) We cast the evaluation of h HL in (8) as a nearest neighbor calculation with a corresponding h induced by the (1 + η) approximate nearest neighbor oracle. The nearestneighbor formulation of h is essentially the one obtained 3 Note that an alternative approximate nearest-neighbor search technique based on Locality Sensitive Hashing (Andoni & Indyk, 2008) is only applicable to vector-represented data and not in general metric spaces. Maximum Margin Multiclass Nearest Neighbors by von Luxburg & Bousquet (2004): h(x, y) = 1 min S1 {ξ(y, y ) + Ld(x, x )} (16) + max S1 {ξ(y, y ) Ld(x, x )} , where (x , y ) S1 and ξ(y, y ) = 2 1{y=y } 1. (ii) We observe a simple relation between h and h: h h sup x X,y Y |h(x, y) h(x, y)| 2η. (iii) Defining the 2η-perturbed function class HL,2η = {T[-1,1](h ) : h h 2η, h HL}, we relate its metric entropy to that of HL: Lemma 6. For ε > 2η > 0, we have N(ε, HL,2η, ) N(ε 2η, HL, ). The metric entropy estimate for HL,2η readily yields ηperturbed versions of Theorems 4 and 5. From the standpoint of generalization bounds, the effect of the ηperturbation on HL amounts, roughly speaking, to replacing L by L(1 + O(η)), which constitutes a rather benign degradation. 5.2. Adaptive dimensionality reduction The generalization bound in (1) and the runtime of our sped-up algorithm in Section 5.1 both depend exponentially on the doubling dimension of the metric space. Hence, even a modest dimensionality reduction could lead to dramatic savings in algorithmic and sample complexities. The standard Euclidean dimensionality-reduction tool, PCA, until recently had no metric analogue at least not with rigorous performance guarantees. The technique proposed in Gottlieb et al. (2013b) may roughly be described as a metric analogue of PCA. A set X = {x1, . . . , xn} X inherits the metric d of X and hence ddim(X) ddim(X) is well-defined. We say that X = { x1, . . . , xn} X is an (α, β)-perturbation of X if Pn i=1 d(xi, xi) α and ddim( X) β. Intuitively, the data is essentially low-dimensional if it admits an (α, β)-perturbation with small α, β, which leads to improved Rademacher estimates. The empirical Rademacher complexity of HL on a sample S = (X, Y ) X n Yn is given by b Rn(HL; S) = E i=1 σih(Xi, Yi) and is related to Rn defined in (9) via Rn(HL) = ES h b Rn(HL; S) i P Rn b Rn ε 2 exp( ε2n/2), where the identity is obvious and the inequality is a simple consequence of measure concentration (Mohri et al., 2012). Hence, up to small changes in constants, the two may be used in generalization bounds such as Theorem 4 interchangeably. The data-dependent nature of b Rn lets us exploit essentially low-dimensional data (see Appendix E): Theorem 7. Let S = (X, Y ) X n Yn be the training sample and suppose that X admits an (α, β)-perturbation X. Then b Rn(HL; S) = O α n + log k A pleasant feature of the bound above is that it does not depend on ddim(X) (the dimension of the ambient space) or even on ddim(X) (the dimension of the data). Note the inherent tradeoff between the distortion α and dimension β, with some non-trivial (α , β ) minimizing the right-hand side of (17). Although computing the optimal (α , β ) seems computationally difficult, Gottlieb et al. (2013b) were able to obtain an efficient (O(1), O(1))-bicriteria approximation. Namely, their algorithm computes an α c0α and β c1β , with the corresponding perturbed set X, for universal constants c0, c1, with a runtime of 2O(ddim(X))n log n + O(n log5 n). The optimization routine over (α, β) may then be embedded inside our SRM optimization over the Lipschitz constant L in Section 4. The end result will be a nearly optimal (in the sense of (14)) Lipschitz constant L, which induces the partition S = S0 S1, as well as ( α, β), which induce the perturbed set S1. To evaluate our hypothesis on a test point, we may invoke the (1 + η)-approximate nearestneighbor routine from Section 5.1. This involves a precomputation of time complexity (2O( β) log n+η O( β))n, after which new points are classified in 2O( β) log n + η O( β) time. Note that the evaluation time complexity depends only on the intrinsic dimension β of the data, rather than the ambient metric space dimension. ACKNOWLEDGEMENTS A.K. was partially supported by the Israel Science Foundation (grant No. 1141/12) and a Yahoo Faculty award. R.W. was partially supported by the Lynne and William Frankel Center for Computer Science and a Kreitman Negev Fellowship. Allwein, Erin L, Schapire, Robert E, and Singer, Yoram. Reducing multiclass to binary: A unifying approach for margin classifiers. JMLR, 1:113 141, 2001. Andoni, A. and Krauthgamer, R. The computational hardness of estimating edit distance. SICOMP, 39(6), 2010. Anthony, M. and Bartlett, P. Neural network learning: theoretical foundations, 1999. Maximum Margin Multiclass Nearest Neighbors Bartlett, P. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. JMLR, 3:463-482, 2002. Ben-David, S. and Shalev-Shwartz, S. Understanding machine learning: From theory to algorithms, 2014. Ben-David, S., Cesa-Bianchi, N., Haussler, D., and Long, P. Characterizations of learnability for classes of {0, , n}-valued functions. J. CSS., 50(1), 1995. Beygelzimer, A., Langford, J., and Ravikumar, P. Errorcorrecting tournaments. ALT, 2009. Cole, R. and Gottlieb, L. Searching dynamic point sets in spaces with bounded doubling dimension. STOC, 2006. Cortes, C., Mohri, M., and Rostamizadeh, A. Multiclass classification with maximum margin multiple kernel. ICML, 2013. Cover, T. and Hart, P. Nearest neighbor pattern classification. IEEE Trans. Info. Theo., 13(1):21-27, 1967. Crammer, K. and Singer, Y. On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2:265-292, 2002a. Crammer, K. and Singer, Y. On the learnability and design of output codes for multiclass problems. Mach. Learn., 47(2-3):201-233, 2002b. Crammer, K., Gilad-Bachrach, R., Navot, A., and Tishby, N. Margin analysis of the lvq algorithm. NIPS, 2002. Daniely, A., Sabato, S., Ben-David, S., and Shalev Shwartz, S. Multiclass learnability and the erm principle. JMLR - Proceedings Track, 19:207-232, 2011. Dinur, I. and Safra, S. On the hardness of approximating minimum vertex cover. Ann. Math., 162(1), 2005. Dudley, R.M. The sizes of compact subsets of hilbert space and continuity of gaussian processes. J. Func. Anal., 1 (3):290-330, 1967. El-Yaniv, R., Pechyony, D., and Yom-Tov, E. Better multiclass classification via a margin-optimized single binary problem. Patt. Rec. Lett., 29(14):1954-1959, 2008. Enflo, P. On the nonexistence of uniform homeomorphisms between Lp-spaces. Ark. Mat., 8:103-105, 1969. Gottlieb, L., Kontorovich, A., and Krauthgamer, R. Efficient classification for metric data. COLT, 2010. Gottlieb, L., Kontorovich, A., and Krauthgamer, R. Efficient regression in metric spaces via approximate lipschitz extension. SIMBAD, 2013a. Gottlieb, L., Kontorovich, A., and Krauthgamer, R. Adaptive metric dimensionality reduction. ALT, 2013b. Guermeur, Y. VC theory of large margin multi-category classifiers. JMLR, 8:2551-2594, 2007. Guermeur, Y. Sample complexity of classifiers taking values in RQ, application to multi-class SVMs. Comm. Statist. Theory Methods, 39(3):543-557, 2010. Har-Peled, S. and Mendel, M. Fast construction of nets in low-dimensional metrics and their applications. SIAM J. Comp., 35(5):1148-1184, 2006. Andoni, A. and Indyk, P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117 122, 2008. Kolmogorov, A. and Tikhomirov, V. ε-entropy and εcapacity of sets in function spaces. Uspekhi Matematicheskikh Nauk, 14(2):3-86, 1959. Koltchinskii, V. and Panchenko, D. Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statist., 30(1):1-50, 2002. Krauthgamer, R. and Lee, J. Navigating nets: Simple algorithms for proximity search. SODA, 2004. Langford, J. and Beygelzimer, A. Sensitive error correcting output codes. COLT, 2005. Mc Shane, E. J. Extension of range of functions. Bull. Amer. Math. Soc., 40(12):837-842, 1934. Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations Of Machine Learning. The MIT Press, 2012. Naor, A. and Schechtman, G. Planar earthmover is not in l1. SICOMP, 37:804-826, June 2007. Papadimitriou, C. and Steiglitz, K. Combinatorial optimization : algorithms and complexity. 1998. Rifkin, R. and Klautau, A. In defense of one-vs-all classification. JMLR, 5:101-141, 2004. von Luxburg, U. and Bousquet, O. Distance-based classification with lipschitz functions. JMLR, 5:669-695, 2004. Weston, J. and Watkins, C. Support vector machines for multi-class pattern recognition. ESANN 99, 61-72, 1999. Whitney, H. Analytic extensions of differentiable functions defined in closed sets. Trans. Amer. Math. Soc., 36(1): 63-89, 1934. Zhang, T. Covering number bounds of certain regularized linear function classes. JMLR, 2:527-550, 2002. Zhang, T. Statistical analysis of some multi-category large margin classification methods. JMLR, 5, 2004.