# maximum_margin_multiclass_nearest_neighbors__fef4f91c.pdf

Maximum Margin Multiclass Nearest Neighbors

Aryeh Kontorovich KARYEH@CS.BGU.AC.IL Department of Computer Science, Ben-Gurion University, Beer Sheva 84105, ISRAEL

Roi Weiss ROIWEI@CS.BGU.AC.IL Department of Computer Science, Ben-Gurion University, Beer Sheva 84105, ISRAEL

We develop a general framework for marginbased multicategory classiﬁcation in metric spaces. The basic work-horse is a marginregularized version of the nearest-neighbor classiﬁer. We prove generalization bounds that match the state of the art in sample size n and signiﬁcantly improve the dependence on the number of classes k. Our point of departure is a nearly Bayes-optimal ﬁnite-sample risk bound independent of k. Although k-free, this bound is unregularized and non-adaptive, which motivates our main result: Rademacher and scale-sensitive margin bounds with a logarithmic dependence on k. As the best previous risk estimates in this setting were of order

k, our bound is exponentially sharper. From the algorithmic standpoint, in doubling metric spaces our classiﬁer may be trained on n examples in O(n2 log n) time and evaluated on new points in O(log n) time.

1. Introduction

Whereas the theory of supervised binary classiﬁcation is by now fairly well developed, its multiclass extension continues to pose numerous novel statistical and computational challenges. On the algorithmic front, there is the basic question of how to adapt the hyperplane and kernel methods ideally suited for two classes to three or more. A host of new problems also arise on the statistical front. In the binary case, the VC-dimension characterizes the distribution-free sample complexity (Anthony & Bartlett, 1999) and tighter distribution-dependent bounds are available via Rademacher techniques (Bartlett & Mendelson, 2002; Koltchinskii & Panchenko, 2002). Characterizing the multiclass distribution-free sample complexity is far

Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s).

less straightforward, though impressive progress has been recently made (Daniely et al., 2011).

Following von Luxburg & Bousquet (2004); Gottlieb et al. (2010), we adopt a proximity-based approach to supervised multicategory classiﬁcation in metric spaces. The principal motivation for this framework is two-fold:

(i) Many natural metrics, such as L1, earthmover, and edit distance cannot be embedded in a Hilbert space without a large distortion (Enﬂo, 1969; Naor & Schechtman, 2007; Andoni & Krauthgamer, 2010). Any kernel method is thus a priori at a disadvantage when learning to classify non-Hilbertian objects, since it cannot faithfully represent the data geometry.

(ii) Nearest neighbor-based classiﬁcation sidesteps the issue of k-to-binary reductions which, despite voluminous research, is still the subject of vigorous debate (Rifkin & Klautau, 2004; El-Yaniv et al., 2008). In terms of time complexity, the reductions approach faces an Ω(k) information-theoretic lower bound (Beygelzimer et al., 2009), while nearest neighbors admit solutions whose runtime does not depend on the number of classes.

Main results. Our contributions are both statistical and algorithmic in nature. On the statistical front, we open with the observation that the nearest-neighbor classiﬁer s expected risk is at most twice the Bayes optimal plus a term that decays with sample size at a rate not dependent on the number of classes k (and continues to hold for k = , Theorem 1). Although of interest as apparently the ﬁrst k-free ﬁnite-sample result, it has the drawback of being non-adaptive in the sense of depending on properties of the unknown sampling distribution and failing to provide the learner with a usable data-dependent bound. This difﬁculty is overcome in our main technical contribution (Theorems 4 and 5), where we give a margin-based multiclass

Maximum Margin Multiclass Nearest Neighbors

bound of order

where k is the number of classes, n is sample size, D is the doubling dimension of the metric instance space and 0 < γ 1 is the margin. This matches the state of the art asymptotics in n for metric spaces and signiﬁcantly improves the dependence on k, which hitherto was of order

k (Zhang, 2002; 2004) or worse. The exponential dependence on some covering dimension (such as D) is in general inevitable, as shown by a standard no-free-lunch argument (Ben-David & Shalev-Shwartz, 2014), but whether (1) is optimal remains an open question.

On the algorithmic front, using the above bounds, we show how to efﬁciently perform Structural Risk Minimization (SRM) so as to avoid overﬁtting. This involves deciding how many and which sample points one is allowed to err on. We reduce this problem to minimal vertex cover, which admits a greedy 2-approximation. Our algorithm admits a signiﬁcantly faster ε-approximate version in doubling spaces with a graceful degradation in ε of the generalization bounds, based on approximate nearest neighbor techniques developed by Gottlieb et al. (2010; 2013a). For doubling dimension D and ε, our runtime is O(2O(D)n2 log n) for learning and O(2O(D) log n) for evaluation on a test point. (Exact nearest neighbor requires Θ(n) evaluation time.) Finally, our generalization bounds and algorithm can be made adaptive to the intrinsic dimension of the data via a recent metric dimensionality-reduction technique (Gottlieb et al., 2013b).

Related work. Due to space constraints, we are only able to mention the most directly relevant results and even these, not in full generality but rather with an eye to facilitating comparison to the present work. Supervised k-category classiﬁcation approaches follow two basic paradigms: (I) deﬁning a score function on point-label pairs and classifying by choosing the label with the optimal score and (II) reducing the problem to several binary classiﬁcation problems. Regarding the second paradigm, the seminal paper of Allwein et al. (2001) uniﬁed the various error correcting output code (ECOC)-based multiclassto-binary reductions under a single margin-based framework. Their generalization bound requires the base classiﬁer to have VC-dimension d VC < (and hence does not apply to nearest neighbors or inﬁnite-dimensional Hilbert

spaces) and is of the form O log k

n . Langford & Beygelzimer (2005); Beygelzimer et al. (2009) gave k-free and O(log k) regret bounds, but these are conditional on the performance of the underlying binary classiﬁers as opposed to the unconditional bounds we provide in this paper.

As for the ﬁrst paradigm, proximity is perhaps the most natural score function and indeed, a formal analysis of the nearest neighbor classiﬁer (Cover & Hart, 1967) much predated the ﬁrst multiclass extensions of SVM (Weston & Watkins, 1999). Crammer & Singer (2002a;b) considerably reduced the computational complexity of the latter approach and gave a risk bound decaying as O(k2/nγ2), for the separable case with margin γ. In an alternative approach based on choosing q prototype examples, Crammer et al. (2002) gave a risk bound with rate O(qk/2/γ n). Ben-David et al. (1995) characterized the PAC learnability of k-valued functions in terms of combinatorial dimensions, such as the Natarajan dimension d Nat. Guermeur (2007; 2010) gave scale-sensitive analogues of these dimensions. He gave a risk bound decaying as O log k

dγNat/n , where dγNat is a scale-sensitive Natarajan dimension essentially replacing the ﬁnite VC dimension d VC in Allwein et al. (2001) by dγNat. He further showed that for linear function classes in Hilbert spaces, dγNat is bounded by O(k2/γ2), resulting in a risk bound decaying as O(k/γ2 n). To the best of our knowledge, the sharpest current estimate on the Natarajan dimension (for some special function classes) is d Nat = O(k) with a matching lower bound of Ω(k) (Daniely et al., 2011). A margin-based Rademacher analysis of score functions (Mohri et al., 2012) yields a bound of order O(k2/γ n), and this is also the k-dependence obtained by Cortes et al. (2013) in a recent paper proposing a multiple kernel approach to multiclass learning. Closest in spirit to our work are the results of Zhang (2002; 2004), who used the chaining technique to achieve a Rademacher complexity with

asymptotics O 1 γ

Besides the dichotomy of score functions vs. multiclassto-binary reductions outlined above, multicategory risk bounds may also be grouped by the trichotomy of (a) combinatorial dimensions (b) Hilbert spaces (c) metric spaces (see Table 1). Category (a) is comprised of algorithmindependent results that give generalization bounds in terms of some combinatorial dimension of a ﬁxed concept class (Allwein et al., 2001; Ben-David et al., 1995; Guermeur, 2007; 2010; Daniely et al., 2011). Multiclass extensions of SVM and related kernel methods (Weston & Watkins, 1999; Crammer & Singer, 2002a;b; Crammer et al., 2002; Cortes et al., 2013) fall into category (b). Category (c), consisting of agnostic1 metric-space methods is the most sparsely populated. The pioneering asymptotic analysis of Cover & Hart (1967) was cast in a modern, ﬁnite-sample version by Ben-David & Shalev Shwartz (2014), but only for binary classiﬁcation. Unlike Hilbert spaces, which admit dimension-free margin bounds, we are not aware of any metric space risk bound

1 in the sense of not requiring an a priori ﬁxed concept class

Maximum Margin Multiclass Nearest Neighbors

Paper decay rate O( ) group

Allwein et al. (2001) log k

Daniely et al. (2011) d Nat log k

Guermeur (2010) log k

Crammer & Singer (2002b) k2 γ2n (I,b)

Cortes et al. (2013) k2

Guermeur (2010) k γ2 n (I,b)

Zhang (2004) 1 γ

current paper 1 γD/2 q

current paper 1 γ log k

n 1 1+D (I,c)

Table 1. Comparing various multiclass bounds. ( ) Not margin-based.

( ) Only for the separable case. ( ) Combinatorial dimension depends on k.

that does not explicitly depend on some metric dimension D or covering numbers. The bounds in Ben-David & Shalev-Shwartz (2014); Gottlieb et al. (2013b) exhibit a characteristic curse of dimensionality decay rate of O(n 1/(D+1)), but more optimistic asymptotics can be obtained (Guermeur, 2007; 2010; Zhang, 2002; 2004; Gottlieb et al., 2010). Although some sample lower bounds for proximity-based methods are known (Ben-David & Shalev-Shwartz, 2014), the optimal dependence on D and k is far from being fully understood.

2. Preliminaries

Metric Spaces. Given two metric spaces (X, d) and (Z, ρ), a function f : X Z is called L-Lipschitz if ρ(f(x), f(x )) Ld(x, x ) for all x, x X. (The real line R is always considered with its Euclidean metric | |.) The Lipschitz constant of f, denoted f Lip, is the smallest L for which f is L-Lipschitz. The distance between two sets A, B X is deﬁned by d(A, B) = infx A,x B d(x, x ). For a metric space (X, d), let λ be the smallest value such that every ball in X can be covered by λ balls of half the radius. The doubling dimension of X is ddim(X) := log2 λ. A metric is doubling when its doubling dimension is bounded. The ε-covering number of a metric space (X, d), denoted N(ε, X, d), is deﬁned as the smallest number of balls of radius ε that sufﬁces to cover X. It can be shown (e.g., Krauthgamer & Lee (2004)) that

N(ε, X, d) 2 diam(X)

ddim(X) , (2)

where diam(X) = sup x,x X d(x, x ) is the diameter of X.

The multiclass learning framework. Let (X, d) be a metric instance space with diam(X) = 1, ddim(X) = D < , and Y N an at most countable label set. We observe a sample S = (Xi, Yi)n i=1 {X Y}n drawn iid from an unknown distribution P over X Y.

In line with paradigm (I) outlined in the Introduction, our classiﬁcation procedure consists of optimizing a score function. In hindsight, the score at a test point will be determined by its labeled neighbors, but for now, we consider an unspeciﬁed collection F of functions mapping X Y to R. A score function f F induces the classiﬁer gf : X Y via

gf(x) = argmax y Y f(x, y), (3)

breaking ties arbitrarily. The margin of f F on (x, y) is deﬁned by

γf(x, y) = 1

f(x, y) sup y =y f(x, y )

Note that gf misclassiﬁes (x, y) precisely when γf(x, y) < 0. One of our main objectives is to upper-bound the generalization error P(gf(X) = Y ) = E[1{γf (X,Y )<0}]. To this end, we introduce surrogate loss functions L : R R+,

Lcutoff(u) = 1{u<1} Lmargin(u) = T[0,1](1 u),

T[a,b](z) = max {a, min {b, z}} (5)

is the truncation operator. The empirical loss b E[L(γf)] induced by any of the loss functions above is 1 n Pn i=1 L(γf(Xi, Yi)). All probabilities P( ) and expectations E[ ] are with respect to the sampling distribution P. We will write ES to indicate expectation over a sample (i.e., over P n).

3. Risk bounds

In this section we analyze the statistical properties of nearest-neighbor multicategory classiﬁers in metric spaces. In Section 3.1, Theorem 1, we record the observation that the 1-nearest neighbor classiﬁer is nearly Bayes optimal, with a risk decay that does not depend on the number of classes k. Of course, the naive 1-nearest neighbor is wellknown to overﬁt. This is reﬂected in the non-adaptive nature of the analysis: the bound is stated in terms of properties of the unknown sampling distribution, and fails to provide the learner with a usable data-dependent bound.

To achieve the latter goal, we develop a margin analysis in Section 3.2. Our main technical result is Lemma 2, from

Maximum Margin Multiclass Nearest Neighbors

which the logarithmic dependence on k claimed in (1) follows. Although not k-free like the Bayes excess risk bound of Theorem 1, O(log k) is exponentially sharper than the current state of the art (Zhang, 2002; 2004). Whether a kfree metric entropy bound is possible is currently left as an open problem.

The metric entropy bound of Lemma 2 facilitates two approaches to bounding the risk: via Rademacher complexity (Section 3.2.2) and via scale-sensitive techniques in the spirit of Guermeur (2007) (Section 3.2.3). In Section 3.2.4 we combine these two margin bounds by taking their minimum. The resulting bound will be used in Section 4 to perform efﬁcient Structural Risk Minimization.

3.1. Multiclass Bayes near-optimality

In this section, (X, d) is a metric space and Y is an at most countable (possibly inﬁnite) label set. A sample S = (Xi, Yi)n i=1 is drawn iid from an unknown distribution P over X Y. For x X let (Xπ1(x), Yπ1(x)) be its nearest neighbor in S:

π1(x) = argmin i [n] d(Xi, x).

Thus, the nearest-neighbor classiﬁer g NN is given by

g NN(x) = Yπ1(x). (6)

Deﬁne the function η : X RY by

η(x) = P(Y = | X = x).

The Bayes optimal classiﬁer g i.e., one that minimizes P(g(X) = Y ) over all measurable g YX is wellknown to have the form

g (x) = argmax y Y ηy(x),

where ties are broken arbitrarily. Our only distributional assumption is that η is L-Lipschitz with respect to the supnorm. Namely, for all x, x X, we have

η(x) η(x ) sup y Y |ηy(x) ηy(x )| Ld(x, x ).

This is a direct analogue of the Lipschitz assumption for the binary case (Cover & Hart, 1967; Ben-David & Shalev Shwartz, 2014). We make the additional standard assumption that X has a ﬁnite doubling dimension: ddim(X) = D < . The Lipschitz and doubling assumptions are sufﬁcient to extend the ﬁnite-sample analysis of binary nearest neighbors (Ben-David & Shalev-Shwartz, 2014) to the multiclass case:

ES [P(g NN(X) = Y )] 2P(g (X) = Y ) + 4L n1/(D+1) .

h(x) hf(x, y)

Figure 1. The mapping in Lemma 2 with |Y| = 3.

Note that the bound is independent of the number of classes k and holds even for k = . The proof is deferred to Appendix A.

3.2. Multiclass margin bounds

Here again (X, d) is a metric space, but now the label set Y is assumed ﬁnite: |Y| = k < . As before, S = (Xi, Yi)n i=1 with (Xi, Yi) P iid. It will be convenient to write Sy = {Xi : Yi = y, i [n]} for the subset of examples with label y. The metric induces the natural score function f NN(x, y) = d(x, Sy) with corresponding nearest-neighbor classiﬁer

g NN(x) = argmax y Y f NN(x, y), (7)

easily seen to be identical to the one in (6). At this point we make the simple but crucial observation that the function f NN( , y) : X R is 1-Lipschitz. This will enable us to generalize the powerful Lipschitz extension framework of von Luxburg & Bousquet (2004) to |Y| > 2.

We will need a few deﬁnitions. Let FL be the collection of all L-Lipschitz functions from X to R and put FL = FL Y. Since each f FL maps X Y to R, the margin γf(x, y) is well-deﬁned via (4). Putting

y f(x) = argmax y Y f(x, y),

γ f(x) = γf(x, y f(x)),

we deﬁne the projection Φf:

( γ f(x), if y = y f(x) γ f(x), otherwise.

Finally, we deﬁne HL as the truncated (as in (5)) projections of functions in FL:

HL = (x, y) 7 T[-1,1] (Φf(x, y)) : f FL . (8)

Thus, HL is the set of functions hf : X Y [ 1, 1], where each hf( , y) is L-Lipschitz and hf(x, y) = T[-1,1](γ f(x)), depending upon whether y = y f(x), see Figure 1 (left).

Maximum Margin Multiclass Nearest Neighbors

Figure 2. The metric ρ( h(x), h (x)) with |Y| = 3.

3.2.1. BOUNDING THE METRIC ENTROPY

Our main technical result is a bound on the metric entropy of HL, which will be used to obtain error bounds (Theorems 4 and 5) for classiﬁers derived from this function class. The analysis differs from previous bounds (see Table 1) by explicitly taking advantage of the mutually exclusive nature of the labels, obtaining an exponential improvement in terms of the number of classes k. Endow HL with the sup-norm = sup x X max y Y | | .

Lemma 2. For any ε > 0,

log N(ε, HL, ) 16L

Proof. By the deﬁnition of HL, for all hf HL and x X there is at most one y Y such that hf(x, y) > 0. In addition, if hf(x, y) = c > 0, then hf(x, y ) = c for all y = y. Since γ f(x) 0, we may reparametrize hf(x, y) by (y f(x), γ f(x)) Y [0, 1], see Figure 1. To complete the mapping hf 7 (y f, γ f), deﬁne the following star-like metric ρ over Y [0, 1] (see Figure 2):

ρ((y, γ), (y , γ )) = |γ γ | y = y

γ + γ y = y .

Let HL be the collection of functions h : X Y [0, 1] that are L-Lipschitz:

ρ( h(x), h(x )) Ld(x, x ), x, x X.

It is easily veriﬁed that the metric space (HL, ) is isometric to ( HL, ρ ) with

ρ ( h, h ) = sup x X ρ( h(x), h (x)).

Thus, N(ε, HL, ) = N(ε, HL, ρ ), and we proceed to bound the latter.2 Fix a covering of X consisting of |N| = N(ε/8L, X, d) balls {U1, . . . , U|N|} of radius

ε = ε/8L and choose |N| points N = {xi Ui}|N| i=1.

2The remainder of the proof is based on a technique communicated to us by R. Krauthgamer, a variant of the classic Kolmogorov & Tikhomirov (1959) method.

Construct b H H2L as follows. At every point xi N select one of the classes y Y and set bh(xi) = (y, γ(xi)) with γ(xi) some multiple of 2Lε = ε/4, while maintaining bh Lip 2L. Construct a 2L-Lipschitz extension for bh from N to all over X (such an extension always exists, (Mc Shane, 1934; Whitney, 1934)). We claim that every classiﬁer in HL, via its twin h HL, is close to some bh b H, in the sense that ρ ( h,bh) ε. Indeed, every point x X is 2ε -close to some point xi N, and since h is L-Lipschitz and bh is 2L-Lipschitz,

ρ( h(x),bh(x)) ρ( h(x), h(xi)) + ρ( h(xi),bh(xi))

+ ρ(bh(xi),bh(x))

Ld(x, xi) + ε/4 + 2Ld(x, xi) ε.

Thus, b H provides an ε-cover for HL (and hence for HL). Note that | b H| ( 4k/ε + 1)|N| , since by construction, functions bh are determined by their values on N, which at a given point can take one of 4k/ε + 1 possible values. Since by (2) we have |N| = N(ε/8L, X, d) 16L

ε D the bound follows.

A tighter bound is possible when the metric space (X, d) possesses two additional properties:

1. (X, d) is connected if for all x, x X and all ε > 0, there is a ﬁnite sequence of points x = x1, x2, . . . , xm = x such that d(xi, xi+1) < ε for all 1 i < m.

2. (X, d) is centered if for all r > 0 and all A X with diam(A) 2r, there exists a point x X such that d(x, a) r for all a A.

Lemma 3. If (X, d) is connected and centered, then

log N(ε, HL, ) = O

D log k + log 1

Proof. With the additional assumptions on X we follow the proof idea in Kolmogorov & Tikhomirov (1959) and demonstrate the tighter bound | b H| ( 4k/ε + 1)(2k + 1)|N| 1 = O((2k)|N|/ε). Here b H is constructed as in the proof for Lemma 2 but now each xi N is taken to be a center of Ui, as furnished by Property 2 above. Let xj N. Since X is connected, we may traverse a path from x1 to xj via the cover points x1 = xi1, xi2, . . . , xim = xj, such that the distance between any two successive points (xil, xil+1) is at most 2ε = ε/4L. Since bh is 2L-Lipschitz, on any two such points the value of bh can change by at most ε/2. Thus, given the value bh(xil), the value of bh(xil+1) can take one of at most 2k + 1 values (as Figure 2 shows, at the star s hub, bh(xil+1) can take one of 2k + 1 values, while

Maximum Margin Multiclass Nearest Neighbors

at one of the spokes only 5 values are possible). So we are left to choose the value of bh on the point x1 to be one from the 4k/ε +1 possible values. The bounds on | b H| and the metric entropy follow.

3.2.2. RADEMACHER ANALYSIS

The Rademacher complexity of the set of functions HL is deﬁned by

i=1 σih(Xi, Yi)

where the σi are n independent random variables with P(σi = +1) = P(σi = 1) = 1/2. In Appendix B, we invoke Lemma 2 to derive the bound

Rn(HL) c DL log 5k

1/(D+1) , (10)

(c D is a constant depending only on D), which in turn implies half of our main risk estimate (1):

Theorem 4. With probability at least 1 δ, for all L > 0 and every f FL with its projected version hf HL,

P(gf(X) = Y ) b E [L(hf)] + Rad(n, L, δ),

where gf is the classiﬁer deﬁned in (3), L is any of the loss functions deﬁned in Section 2 and Rad(n, L, δ) is at most

c DL log 5k

s log log2 2L

3.2.3. SCALE-SENSITIVE ANALYSIS

The following Theorem, proved in Appendix C, is an adaptation of Guermeur (2007, Theorem 1), using Lemma 2.

Theorem 5. With probability at least 1 δ, for all L > 0 and every f FL with its induced hf HL,

P(gf(X) = Y ) b E [Lcutoff(hf)] + fat(n, L, δ),

where fat(n, L, δ) is at most s

2 (16L)D log (20k) + ln 2L

3.2.4. COMBINED BOUND

Taking L = Lcutoff in Theorem 4 we can merge the above two bounds by taking their minimum. Namely, Theorem 5 holds with (n, L, δ) = min { Rad(n, L, δ), fat(n, L, δ)} in place of fat(n, L, δ), see Figure 3. The resulting risk decay rate is

10 6 10 9 10 12 10 15 0

D = 5, L = 1 D = 5, L = 2 D = 5, L = 4 D = 5, L = 8 Rad fat

Figure 3. The combined complexity bounds (k = 10, δ = 0.01).

of order min n L (log(k)/n)

1 D+1 , L D

2 (log(k)/n)

1 2 o , as claimed in (1). In terms of the number of classes k, our bound compares favorably to those in Allwein et al. (2001); Guermeur (2007; 2010), and more recently in Daniely et al. (2011), which have a k-dependence of O(d Nat log k), where d Nat is the (scale-sensitive, k-dependent) Natarajan dimension of the multiclass hypothesis class. The optimal dependence of the risk on k is an intriguing open problem.

4. Algorithm

Theorems 4 and 5 yield generalization bounds of the schematic form

P(g(X) = Y ) b E[L] + (n, L, δ). (11)

The free parameter L in (11) controls (roughly speaking) the bias-variance tradeoff: for larger L, we may achieve a smaller empirical loss b E[L] at the expense of a larger hypothesis complexity (n, L, δ). Our Structural Risk Minimization (SRM) consists of seeking the optimal L i.e., one that minimizes the right-hand side of (11) via the following high-level procedure:

1. For each L > 0, minimize b E[L(hf)] over f FL.

2. Choose the optimal L and its corresponding classiﬁer gf with f FL .

Minimizing the empirical loss. Let S = (Xi, Yi)n i=1 be the training sample and L > 0 a given maximal allowed Lipschiz constant. We will say that a function h HL is inconsistent with a sample point (x, y) if h(x, y) < 1 (i.e., if the margin of h on (x, y) is less than one). Denote by bm(L) the smallest possible number of sample points on which a function h HL may be inconsistent:

bm(L) = min h HL b E[Lcutoff(h)].

Maximum Margin Multiclass Nearest Neighbors

Thus, our SRM problem consists of ﬁnding

L = argmin L>0 { bm(L) + (n, L, δ)} .

For k = 2, Gottlieb et al. (2010) reduced the problem of computing bm(L) to one of ﬁnding a minimal vertex cover in a bipartite graph (by K onig s theorem, the latter is efﬁciently computable as a maximal matching). We will extend this technique to k > 2 as follows. Deﬁne the k-partite graph GL = ({V y}k y=1, E), where each vertex set V y corresponds to the sample points Sy with label y. Now in order for h HL to be consistent with the points (Xi, Yi) and (Xj, Yj) for Yi = Yj, the following relation must hold:

Ld(Xi, Xj) 2. (12)

Hence, we deﬁne the edges of GL to consist of all point pairs violating (12):

(Xi, Xj) E (Yi = Yj) (d(Xi, Xj) < 2/L).

Since removing either of Xi, Xj in (12) also deletes the violating edge, bm(L) is by construction equivalent to the size of the minimum vertex cover for GL. Although minimum vertex cover is NP-hard to compute (and even hard to approximate within a factor of 1.3606, (Dinur & Safra, 2005)), a 2-approximation may be found in O(n2) time (Papadimitriou & Steiglitz, 1998). This yields a 2approximation m(L) for bm(L).

Optimizing over L. Equipped with an efﬁcient routine for computing m(L) 2 bm(L), we now seek an L > 0 that minimizes

Q(L) := m(L) + (n, L, δ). (13)

Since the Lipschitz constant induced by the data is determined by the n 2 distances among the sample points, we need only consider O(n2) values of L. Rather a brute-force searching all of these values, Theorem 7 of Gottlieb et al. (2010) shows that using an O(log n) time binary search over the values of L, one may approximately minimize Q(L), which in turn yields an approximate solution to (11). The resulting procedure has runtime O(n2 log n) and guarantees an L for which

Q( L) 4 [ bm(L ) + (n, L , δ)] . (14)

Classifying test points. Given the nearly optimal Lipschitz constant L computed above we construct the approximate (within a factor of 4) empirical risk minimizer h H L. The latter partitions the sample into S = S0 S1, where S1 consists of the points on which h is consistent and S0 = S \ S1. Evaluating h on a test point amounts to ﬁnding its nearest neighbor in S1. Although in general metric spaces, nearest-neighbors search requires Ω(n) time, for doubling spaces, an exponential speedup is available via approximate nearest neighbors (see Section 5).

5. Extensions

In this section, we discuss two approaches that render the methods presented above considerably more efﬁcient in terms of runtime and generalization bounds. The ﬁrst is based on the fact that in doubling spaces, hypothesis evaluation time may be reduced from O(n) to O(log n) at the expense of a very slight degradation of the generalization bounds. The second relies on a recent metric dimensionality reduction result. When the data is close to being Ddimensional, with D much smaller than the ambient metric space dimension D, both the evaluation runtime and the generalization bounds may be signiﬁcantly improved depending essentially on D rather than D.

5.1. Exponential speedup via approximate NN

If (X, d) is a metric space and x E X is a minimizer of d(x, x ) over x E, then x is a nearest neighbor of x in E. A simple information-theoretic argument shows that the problem of computing an exact nearest neighbor in general metric spaces has Ω(n) time complexity. However, an exponential speedup is possible if (i) X is a doubling space and (ii) one is willing to settle for approximate nearest neighbors. A (1 + η) nearest neighbor oracle returns an x E such that

d(x, x ) d(x, x) (1 + η)d(x, x ). (15)

We will use the fact that in a doubling space, one may precompute a (1 + η) nearest neighbor data structure in (2O(ddim(X)) log n + η O(ddim(X)))n time and evaluate it on a test point in 2O(ddim(X)) log n + η O(ddim(X)) time (Cole & Gottlieb, 2006; Har-Peled & Mendel, 2006).3

The approximate nearest neighbor oracle induces an ηapproximate version of g NN in deﬁned (7). After performing SRM as described in Section 4, we are left with a subset S1 S of the sample, which will be used to label test points. More precisely, the predicted label of a test point will be determined by its η-nearest neighbor in S1.

The exponential speedup afforded by approximate nearest neighbors comes at the expense of mildly degraded generalization guarantees. The modiﬁed generalization bounds are derived in three steps, whose details are deferred to Appendix D: (i) We cast the evaluation of h HL in (8) as a nearest neighbor calculation with a corresponding h induced by the (1 + η) approximate nearest neighbor oracle. The nearestneighbor formulation of h is essentially the one obtained

3 Note that an alternative approximate nearest-neighbor search technique based on Locality Sensitive Hashing (Andoni & Indyk, 2008) is only applicable to vector-represented data and not in general metric spaces.

Maximum Margin Multiclass Nearest Neighbors

by von Luxburg & Bousquet (2004):

h(x, y) = 1

min S1 {ξ(y, y ) + Ld(x, x )} (16)

+ max S1 {ξ(y, y ) Ld(x, x )} ,

where (x , y ) S1 and ξ(y, y ) = 2 1{y=y } 1. (ii) We observe a simple relation between h and h:

h h sup x X,y Y |h(x, y) h(x, y)| 2η.

(iii) Deﬁning the 2η-perturbed function class

HL,2η = {T[-1,1](h ) : h h 2η, h HL},

we relate its metric entropy to that of HL: Lemma 6. For ε > 2η > 0, we have

N(ε, HL,2η, ) N(ε 2η, HL, ).

The metric entropy estimate for HL,2η readily yields ηperturbed versions of Theorems 4 and 5. From the standpoint of generalization bounds, the effect of the ηperturbation on HL amounts, roughly speaking, to replacing L by L(1 + O(η)), which constitutes a rather benign degradation.

5.2. Adaptive dimensionality reduction

The generalization bound in (1) and the runtime of our sped-up algorithm in Section 5.1 both depend exponentially on the doubling dimension of the metric space. Hence, even a modest dimensionality reduction could lead to dramatic savings in algorithmic and sample complexities. The standard Euclidean dimensionality-reduction tool, PCA, until recently had no metric analogue at least not with rigorous performance guarantees. The technique proposed in Gottlieb et al. (2013b) may roughly be described as a metric analogue of PCA.

A set X = {x1, . . . , xn} X inherits the metric d of X and hence ddim(X) ddim(X) is well-deﬁned. We say that X = { x1, . . . , xn} X is an (α, β)-perturbation of X if Pn i=1 d(xi, xi) α and ddim( X) β. Intuitively, the data is essentially low-dimensional if it admits an (α, β)-perturbation with small α, β, which leads to improved Rademacher estimates. The empirical Rademacher complexity of HL on a sample S = (X, Y ) X n Yn is given by

b Rn(HL; S) = E

i=1 σih(Xi, Yi)

and is related to Rn deﬁned in (9) via

Rn(HL) = ES h b Rn(HL; S) i

P Rn b Rn ε 2 exp( ε2n/2),

where the identity is obvious and the inequality is a simple consequence of measure concentration (Mohri et al., 2012). Hence, up to small changes in constants, the two may be used in generalization bounds such as Theorem 4 interchangeably. The data-dependent nature of b Rn lets us exploit essentially low-dimensional data (see Appendix E): Theorem 7. Let S = (X, Y ) X n Yn be the training sample and suppose that X admits an (α, β)-perturbation X. Then

b Rn(HL; S) = O

α n + log k

A pleasant feature of the bound above is that it does not depend on ddim(X) (the dimension of the ambient space) or even on ddim(X) (the dimension of the data). Note the inherent tradeoff between the distortion α and dimension β, with some non-trivial (α , β ) minimizing the right-hand side of (17). Although computing the optimal (α , β ) seems computationally difﬁcult, Gottlieb et al. (2013b) were able to obtain an efﬁcient (O(1), O(1))-bicriteria approximation. Namely, their algorithm computes an α c0α and β c1β , with the corresponding perturbed set X, for universal constants c0, c1, with a runtime of 2O(ddim(X))n log n + O(n log5 n).

The optimization routine over (α, β) may then be embedded inside our SRM optimization over the Lipschitz constant L in Section 4. The end result will be a nearly optimal (in the sense of (14)) Lipschitz constant L, which induces the partition S = S0 S1, as well as ( α, β), which induce the perturbed set S1. To evaluate our hypothesis on a test point, we may invoke the (1 + η)-approximate nearestneighbor routine from Section 5.1. This involves a precomputation of time complexity (2O( β) log n+η O( β))n, after which new points are classiﬁed in 2O( β) log n + η O( β) time. Note that the evaluation time complexity depends only on the intrinsic dimension β of the data, rather than the ambient metric space dimension.

ACKNOWLEDGEMENTS

A.K. was partially supported by the Israel Science Foundation (grant No. 1141/12)

and a Yahoo Faculty award. R.W. was partially supported by the Lynne and William

Frankel Center for Computer Science and a Kreitman Negev Fellowship.

Allwein, Erin L, Schapire, Robert E, and Singer, Yoram. Reducing multiclass to binary: A unifying approach for margin classiﬁers. JMLR, 1:113 141, 2001.

Andoni, A. and Krauthgamer, R. The computational hardness of estimating edit distance. SICOMP, 39(6), 2010.

Anthony, M. and Bartlett, P. Neural network learning: theoretical foundations, 1999.

Maximum Margin Multiclass Nearest Neighbors

Bartlett, P. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. JMLR, 3:463-482, 2002.

Ben-David, S. and Shalev-Shwartz, S. Understanding machine learning: From theory to algorithms, 2014.

Ben-David, S., Cesa-Bianchi, N., Haussler, D., and Long, P. Characterizations of learnability for classes of {0, , n}-valued functions. J. CSS., 50(1), 1995.

Beygelzimer, A., Langford, J., and Ravikumar, P. Errorcorrecting tournaments. ALT, 2009.

Cole, R. and Gottlieb, L. Searching dynamic point sets in spaces with bounded doubling dimension. STOC, 2006.

Cortes, C., Mohri, M., and Rostamizadeh, A. Multiclass classiﬁcation with maximum margin multiple kernel. ICML, 2013.

Cover, T. and Hart, P. Nearest neighbor pattern classiﬁcation. IEEE Trans. Info. Theo., 13(1):21-27, 1967.

Crammer, K. and Singer, Y. On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2:265-292, 2002a.

Crammer, K. and Singer, Y. On the learnability and design of output codes for multiclass problems. Mach. Learn., 47(2-3):201-233, 2002b.

Crammer, K., Gilad-Bachrach, R., Navot, A., and Tishby, N. Margin analysis of the lvq algorithm. NIPS, 2002.

Daniely, A., Sabato, S., Ben-David, S., and Shalev Shwartz, S. Multiclass learnability and the erm principle. JMLR - Proceedings Track, 19:207-232, 2011.

Dinur, I. and Safra, S. On the hardness of approximating minimum vertex cover. Ann. Math., 162(1), 2005.

Dudley, R.M. The sizes of compact subsets of hilbert space and continuity of gaussian processes. J. Func. Anal., 1 (3):290-330, 1967.

El-Yaniv, R., Pechyony, D., and Yom-Tov, E. Better multiclass classiﬁcation via a margin-optimized single binary problem. Patt. Rec. Lett., 29(14):1954-1959, 2008.

Enﬂo, P. On the nonexistence of uniform homeomorphisms between Lp-spaces. Ark. Mat., 8:103-105, 1969.

Gottlieb, L., Kontorovich, A., and Krauthgamer, R. Efﬁcient classiﬁcation for metric data. COLT, 2010.

Gottlieb, L., Kontorovich, A., and Krauthgamer, R. Efﬁcient regression in metric spaces via approximate lipschitz extension. SIMBAD, 2013a.

Gottlieb, L., Kontorovich, A., and Krauthgamer, R. Adaptive metric dimensionality reduction. ALT, 2013b.

Guermeur, Y. VC theory of large margin multi-category classiﬁers. JMLR, 8:2551-2594, 2007.

Guermeur, Y. Sample complexity of classiﬁers taking values in RQ, application to multi-class SVMs. Comm. Statist. Theory Methods, 39(3):543-557, 2010.

Har-Peled, S. and Mendel, M. Fast construction of nets in low-dimensional metrics and their applications. SIAM J. Comp., 35(5):1148-1184, 2006.

Andoni, A. and Indyk, P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117 122, 2008.

Kolmogorov, A. and Tikhomirov, V. ε-entropy and εcapacity of sets in function spaces. Uspekhi Matematicheskikh Nauk, 14(2):3-86, 1959.

Koltchinskii, V. and Panchenko, D. Empirical margin distributions and bounding the generalization error of combined classiﬁers. Ann. Statist., 30(1):1-50, 2002.

Krauthgamer, R. and Lee, J. Navigating nets: Simple algorithms for proximity search. SODA, 2004.

Langford, J. and Beygelzimer, A. Sensitive error correcting output codes. COLT, 2005.

Mc Shane, E. J. Extension of range of functions. Bull. Amer. Math. Soc., 40(12):837-842, 1934.

Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations Of Machine Learning. The MIT Press, 2012.

Naor, A. and Schechtman, G. Planar earthmover is not in l1. SICOMP, 37:804-826, June 2007.

Papadimitriou, C. and Steiglitz, K. Combinatorial optimization : algorithms and complexity. 1998.

Rifkin, R. and Klautau, A. In defense of one-vs-all classiﬁcation. JMLR, 5:101-141, 2004.

von Luxburg, U. and Bousquet, O. Distance-based classiﬁcation with lipschitz functions. JMLR, 5:669-695, 2004.

Weston, J. and Watkins, C. Support vector machines for multi-class pattern recognition. ESANN 99, 61-72, 1999.

Whitney, H. Analytic extensions of differentiable functions deﬁned in closed sets. Trans. Amer. Math. Soc., 36(1): 63-89, 1934.

Zhang, T. Covering number bounds of certain regularized linear function classes. JMLR, 2:527-550, 2002.

Zhang, T. Statistical analysis of some multi-category large margin classiﬁcation methods. JMLR, 5, 2004.