# discriminative_metric_learning_by_neighborhood_gerrymandering__78a83e3d.pdf

Discriminative Metric Learning by Neighborhood Gerrymandering

Shubhendu Trivedi, David Mc Allester, Gregory Shakhnarovich Toyota Technological Institute Chicago, IL - 60637 {shubhendu,mcallester,greg}@ttic.edu

We formulate the problem of metric learning for k nearest neighbor classiﬁcation as a large margin structured prediction problem, with a latent variable representing the choice of neighbors and the task loss directly corresponding to classiﬁcation error. We describe an efﬁcient algorithm for exact loss augmented inference, and a fast gradient descent algorithm for learning in this model. The objective drives the metric to establish neighborhood boundaries that beneﬁt the true class labels for the training points. Our approach, reminiscent of gerrymandering (redrawing of political boundaries to provide advantage to certain parties), is more direct in its handling of optimizing classiﬁcation accuracy than those previously proposed. In experiments on a variety of data sets our method is shown to achieve excellent results compared to current state of the art in metric learning.

1 Introduction

Nearest neighbor classiﬁers are among the oldest and the most widely used tools in machine learning. Although nearest neighor rules are often successful, their performance tends to be limited by two factors: the computational cost of searching for nearest neighbors and the choice of the metric (distance measure) deﬁning nearest . The cost of searching for neighbors can be reduced with efﬁcient indexing, e.g., [1, 4, 2] or learning compact representations, e.g., [13, 19, 16, 9]. We will not address this issue here. Here we focus on the choice of the metric. The metric is often taken to be Euclidean, Manhattan or χ2 distance. However, it is well known that in many cases these choices are suboptimal in that they do not exploit statistical regularities that can be leveraged from labeled data. This paper focuses on supervised metric learning. In particular, we present a method of learning a metric so as to optimize the accuracy of the resulting nearest neighbor classiﬁer.

Existing works on metric learning formulate learning as an optimization task with various constraints driven by considerations of computational feasibility and reasonable, but often vaguely justiﬁed principles [23, 8, 7, 22, 21, 14, 11, 18]. A fundamental intuition is shared by most of the work in this area: an ideal distance for prediction is distance in the label space. Of course, that can not be measured, since prediction of a test example s label is what we want to use the similarities to begin with. Instead, one could learn a similarity measure with the goal for it to be a good proxy for the label similarity. Since the performance of k NN prediction often is the real motivation for similarity learning, the constraints typically involve pulling good neighbors (from the correct class for a given point) closer while pushing the bad neighbors farther away. The exact formulation of good and bad varies but is deﬁned as a combination of proximity and agreement between labels. We give a formulation that facilitates a more direct attempt to optimize for the k NN accuracy as compared to previous work as far as we are aware. We discuss existing methods in more detail in section 2, where we also place our work in context.

In the k NN prediction problem, given a point and a chosen metric, there is an implicit hidden variable: the choice of k neighbors . The inference of the predicted label from these k examples is trivial, by simple majority vote among the associated labels. Given a query point, there can possibly exist a very large number of choices of k points that might correspond to zero loss: any set of k points with the majority of correct class will do. We would like a metric to prefer one of these good example sets over any set of k neighbors which would vote for a wrong class. Note that to win, it is not necessary for the right class to account for all the k neighbors it just needs to get more votes than any other class. As the number of classes and the value of k grow, so does the space of available good (and bad) example sets.

These considerations motivate our approach to metric learning. It is akin to the common, albeit negatively viewed, practice of gerrymandering in drawing up borders of election districts so as to provide advantages to desired political parties, e.g., by concentrating voters from that party or by spreading voters of opposing parties. In our case, the districts are the cells in the Voronoi diagram deﬁned by the Mahalanobis metric, the parties are the class labels voted for by the neighbors falling in each cell, and the desired winner is the true label of the training points associated with the cell. This intuition is why we refer to our method as neighborhood gerrymandering in the title.

Technically, we write k NN prediction as an inference problem with a structured latent variable being the choice of k neighbors. Thus learning involves minimizing a sum of a structural latent hinge loss and a regularizer [3]. Computing structural latent hinge loss involves loss-adjusted inference one must compute loss-adjusted values of both the output value (the label) and the latent items (the set of nearest neighbors). The loss augmented inference corresponds to a choice of worst k neighbors in the sense that while having a high average similarity they also correspond to a high loss ( worst offending set of k neighbors ). Given the inherent combinatorial considerations, the key to such a model is efﬁcient inference and loss augmented inference. We give an efﬁcient algorithm for exact inference. We also design an optimization algorithm based on stochastic gradient descent on the surrogate loss. Our approach achieves k NN accuracy higher than state of the art for most of the data sets we tested on, including some methods specialized for the relevant input domains.

Although the experiments reported here are restricted to learning a Mahalanobis distance in an explicit feature space, the formulation allows for nonlinear similarity measures, such as those deﬁned by nonlinear kernels, provided computing the gradients of similarities with respect to metric parameters is feasible. Our formulation can also naturally handle a user-deﬁned loss matrix on labels.

2 Related Work and Discussion

There is a large body of work on similarity learning done with the stated goal of improving k NN performance. In much of the recent work, the objective can be written as a combination of some sort of regularizer on the parameters of similarity, with loss reﬂecting the desired purity of the neighbors under learned similarity. Optimization then balances violation of these constraints with regularization. The main contrast between this body of work and our approach here is in the form of the loss.

A well known family of methods of this type is based on the Large Margin Nearest Neighbor (LMNN) algorithm [22] . In LMNN, the constraints for each training point involve a set of predeﬁned target neighbors from correct class, and impostors from other classes. The set of target neighbors here plays a similar role to our best correct set of k neighbors (h in Section 4). However the set of target neighbors are chosen at the onset based on the euclidean distance (in absence of a priori knowledge). Moreover as the metric is optimized, the set of target neighbors is not dynamically updated. There is no reason to believe that the original choice of neighbors based on the euclidean distance is optimal while the metric is updated. Also h represents the closest neighbors that have zero loss but they are not necessarily of the same class. In LMNN the target neighbors are forced to be of the same class. In doing so it does not fully leverage the power of the k NN objective. The role of imposters is somewhat similar to the role of the worst offending set of k neighbors in our method (bh in Section 4). See Figure 2 for an illustration. Extensions of LMNN [21, 11] allow for non-linear metrics, but retain the same general ﬂavor of constraints. There is another extension to LMNN that is more aligned to our work [20], in that they lift the constraint of having a static set of neighbors chosen based on the euclidean distance and instead learn the neighborhood.

Figure 1: Illustration of objectives of LMNN (left) and our structured approach (right) for k = 3. The point x of class blue is the query point. In LMNN, the target points are the nearest neighbors of the same class, which are points a, b and c (the circle centered at x has radius equal to the farthest of the target points i.e. point b). The LMNN objective will push all the points of the wrong class that lie inside this circle out (points e, f, h, i, andj), while pulling in the target points to enforce the margin. For our structured approach (right), the circle around x has radius equal to the distance of the farthest of the three nearest neighbors irrespective of class. Our objective only needs to ensure zero loss which is achieved by pushing in point a of the correct class (blue) while pushing out the point having the incorrect class (point f). Note that two points of the incorrect class lie inside the circle (e, andf), both being of class red. However f is pushed out and not e since it is farther from x. Also see section 2.

The above family of methods may be contrasted with methods of the ﬂavor as proposed in [23]. Here good neighbors are deﬁned as all similarly labeled points and each class is mapped into a ball of a ﬁxed radius, but no separation is enforced between the classes. The k NN objective does not require that similarly labeled points be clustered together and consequently such methods try to optimize a much harder objective for learning the metric.

In Neighborhood Component Analysis (NCA) [8], the piecewise-constant error of the k NN rule is replaced by a soft version. This leads to a non-convex objective that is optimized via gradient descent. This is similar to our method in the sense that it also attempts to directly optimize for the choice of the nearest neighbor at the price of losing convexity. This issue of non-convexity was partly remedied in [7], by optimization of a similar stochastic rule while attempting to collapse each class to one point. While this makes the optimization convex, collapsing classes to distinct points is unrealistic in practice. Another recent extension of NCA [18] generalizes the stochastic classiﬁcation idea to k NN classiﬁcation with k > 1.

In Metric Learning to Rank (MLR)[14], the constraints involve all the points: the goal is to push all the correct matches in front of all the incorrect ones. This again is not the same as requiring correct classiﬁcation. In addition to global optimization constraints on the rankings (such as mean average precision for target class), the authors allow localized evaluation criteria such as Precision at k, which can be used as a surrogate for classiﬁcation accuracy for binary classiﬁcation, but is a poor surrogate for multi-way classiﬁcation. Direct use of k NN accuracy in optimization objective is brieﬂy mentioned in [14], but not pursued due to the difﬁculty in loss-augmented inference. This is because the interleaving technique of [10] that is used to perform inference with other losses based inherently on contingency tables, fails for the multiclass case (since the number of data interleavings could be exponential). We take a very different approach to loss augmented inference, using targeted inference and the classiﬁcation loss matrix, and can easily extend it to arbitrary number of classes.

A similar approach is taking in [15], where the constraints are derived from triplets of points formed by a sample, correct and incorrect neighbors. Again, these are assumed to be set statically as an input to the algorithm, and the optimization focuses on the distance ordering (ranking) rather than accuracy of classiﬁcation.

3 Problem setup

We are given N training examples X = {x1, . . . , x N}, represented by a native feature map, xi Rd, and their class labels y = [y1, . . . , y N]T , with yi [R], where [R] stands for the set

{1, . . . , R}. We are also given the loss matrix Λ with Λ(r, r ) being the loss incurred by predicting r when the correct class is r. We assume Λ(r, r) = 0, and (r, r ), Λ(r, r ) 0.

We are interested in Mahalanobis metrics

DW (x, xi) = (x xi)T W (x xi) , (1)

parameterized by positive semideﬁnite d d matrices W. Let h X be a set of examples in X. For a given W we deﬁne the distance score of h w.r.t. a point x as

SW(x, h) = X

xj h DW (x, xj) (2)

Hence, the set of k nearest neighbors of x in X is

h W(x) = argmax |h|=k SW(x, h). (3)

For the remainder we will assume that k is known and ﬁxed. From any set h of k examples from X, we can predict the label of x by (simple) majority vote:

by (h) = majority{yj : xj h},

with ties resolved by a heuristic, e.g., according to 1NN vote. In particular, the k NN classiﬁer predicts by(h W(x)). Due to this deterministic dependence between by and h, we can deﬁne the classiﬁcation loss incured by a voting classiﬁer when using the set h as

(y, h) = Λ (y, by(h)) . (4)

4 Learning and inference

One might want to learn W to minimize training loss P i (yi, h W(xi)). However, this fails due to the intractable nature of classiﬁcation loss . We will follow the usual remedy: deﬁne a tractable surrogate loss.

Here we note that in our formulation, the output of the prediction is a structured object h W, for which we eventually report the deterministically computed by. Structured prediction problems usually involve loss which is a generalization of the hinge loss; intuitively, it penalizes the gap between score of the correct structured output and the score of the worst offending incorrect output (the one with the highest score and highest ).

However, in our case there is no single correct output h, since in general many choices of h would lead to correct by and zero classiﬁcation loss: any h in which the majority votes for the right class. Ideally, we want SW to prefer at least one of these correct hs over all incorrect hs. This intuition leads to the following surrogate loss deﬁnition:

L(x, y, W) = max h [SW(x, h) + (y, h)] (5)

max h: (y,h)=0 SW(x, h). (6)

This is a bit different in spirit from the notion of margin sometimes encountered in ranking problems where we want all the correct answers to be placed ahead of all the wrong ones. Here, we only care to put one correct answer on top; it does not matter which one, hence the max in (6).

5 Structured Formulation

Although we have motivated this choice of L by intuitive arguments, it turns out that our problem is an instance of a familiar type of problems: latent structured prediction [24], and thus our choice of loss can be shown to form an upper bound on the empirical task loss .

First, we note that the score SW can be written as

xj h (x xj)(x xj)T +

where , stands for the Frobenius inner product. Deﬁning the feature map

xj h (x xj)(x xj)T , (8)

we get a more compact expression W, Ψ(x, h) for (7).

Furthermore, we can encode the deterministic dependence between y and h by a compatibility function A(y, h) = 0 if y = by(h) and A(y, h) = otherwise. This allows us to write the joint inference of y and (hidden) h performed by k NN classiﬁer as

by W(x),bh W(x) = argmax h,y [A(y, h) + W, Ψ(x, h) ] . (9)

This is the familiar form of inference in a latent structured model [24, 6] with latent variable h. So, despite our model s somewhat unusual property that the latent h completely determines the inferred y, we can show the equivalence to the normal latent structured prediction.

5.1 Learning by gradient descent

We deﬁne the objective in learning W as

min W W 2 F + C X

i L (xi, yi, W) , (10)

where 2 F stands for Frobenius norm of a matrix.1 The regularizer is convex, but as in other latent structured models, the loss L is non-convex due to the subtraction of the max in (6). To optimize (10), one can use the convex-concave procedure (CCCP) [25] which has been proposed speciﬁcally for latent SVM learning [24]. However, CCCP tends to be slow on large problems. Furthermore, its use is complicated here due to the requirement that W be positive semideﬁnite (PSD). This means that the inner loop of CCCP includes solving a semideﬁnite program, making the algorithm slower still. Instead, we opt for a simpler choice, often faster in practice: stochastic gradient descent (SGD), described in Algorithm 1.

Algorithm 1: Stochastic gradient descent Input: labeled data set (X, Y ), regularization parameter C, learning rate η( ) initialize W(0) = 0 for t = 0, . . ., while not converged do sample i [N] bhi = argmaxh [SW(t)(xi, h) + (yi, h)] h i = argmaxh: (yi,h)=0 SW(t)(xi, h)

" SW(xi,bhi)

W SW(xi, h i ) W

# W(t) W(t+1) = (1 η(t))W(t) CδW project W(t+1) to PSD cone

The SGD algorithm requires solving two inference problems (bh and h ), and computing the gradient of SW which we address below.2

5.1.1 Targeted inference of h i

Here we are concerned with ﬁnding the highest-scoring h constrained to be compatible with a given target class y. We give an O(N log N) algorithm in Algorithm 2. Proof of its correctness and complexity analysis is in Appendix.

1We discuss other choices of regularizer in Section 7. 2We note that both inference problems over h are done in leave one out settings, i.e., we impose an additional constraint i / h under the argmax, not listed in the algorithm explicitly.

Algorithm 2: Targeted inference

Input: x, W, target class y, τ Jties forbidden K Output: argmaxh:by(h)=y SW(x)

Let n = k+τ(R 1)

R // min. required number of neighbors from y h := for j = 1, . . . , n do h := h argmin xi: yi=y,i/ h DW (x, xi)

for l = n + 1, . . . , k do deﬁne #(r) |{i : xi h, yi = r}| // count selected neighbors from class r h := h argmin xi: yi=y, or #(yi)<#(y) τ, i/ h DW (x, xi)

The intuition behind Algorithm 2 is as follows. For a given combination of R (number of classes) and k (number of neighbors), the minimum number of neighbors from the target class y required to allow (although not guarantee) zero loss, is n (see Proposition 1 in the App. The algorithm ﬁrst includes n highest scoring neighbors from the target class. The remaining k n neighbors are picked by a greedy procedure that selects the highest scoring neighbors (which might or might not be from the target class) while making sure that no non-target class ends up in a majority.

When using Alg. 2 to ﬁnd an element in H , we forbid ties, i.e. set τ = 1.

5.1.2 Loss augmented inference bhi

Calculating the max term in (5) is known as loss augmented inference. We note that

max h W, Ψ(x, h ) + (y, h ) = max y

n max h H (y ) W, Ψ(x, h )

= W,Ψ(x,h (x,y ))

+ Λ(y, y ) o (11)

which immediately leads to Algorithm 3, relying on Algorithm 2. The intuition: perform targeted inference for each class (as if that were the target class), and the choose the set of neighbors for the class for which the loss-augmented score is the highest. In this case, in each call to Alg. 2 we set τ = 0, i.e., we allow ties, to make sure the argmax is over all possible h s.

Algorithm 3: Loss augmented inference Input: x, W,target class y Output: argmaxh [SW(x, h) + (y, h)] for r {1, . . . , R} do h(r) := h (x, W, r, 1) // using Alg. 2 Let Value (r) := SW(x, h(r)), + Λ(y, r) Let r = argmaxr Value (r) return h(r )

5.1.3 Gradient update

Finally, we need to compute the gradient of the distance score. From (7), we have

W = Ψ(x, h) = X

xj h (x xj)(x xj)T . (12)

Thus, the update in Alg 1 has a simple interpretation, illustrated in Fig 2 on the right. For every xi h \bh, it pulls xi closer to x. For every xi bh \ h , it pushes it farther from x; these push and pull refer to increase/decrease of Mahalanobis distance under the updated W. Any other xi, including any xi h bh, has no inﬂuence on the update. This is a difference of our approach from

LMNN, MLR etc. This is illustrated in Figure 2. In particular h corresponds to points a, c and e, whereas bh corresponds to points c, e and f. Thus point a is pulled while point f is pushed.

Since the update does not necessarily preserve W as a PSD matrix, we enforce it by projecting W onto the PSD cone, by zeroing negative eigenvalues. Note that since we update (or downdate ) W each time by matrix of rank at most 2k, the eigendecomposition can be accomplished more efﬁciently than the na ıve O(d3) approach, e.g., as in [17].

Using ﬁrst order methods, and in particular gradient methods for optimization of non-convex functions, has been common across machine learning, for instance in training deep neural networks. Despite lack (to our knowledge) of satisfactory guarantees of convergence, these methods are often successful in practice; we will show in the next section that this is true here as well.

One might wonder if this method is valid for our objective that is not differentiable; we discuss this brieﬂy before describing experiments. A given x imposes a Voronoi-type partition of the space of W into a ﬁnite number of cells; each cell is associated with a particular combination of bh(x) and h (x) under the values of W in that cell. The score SW is differentiable (actually linear) on the interior of the cell, but may be non-differentiable (though continuous) on the boundaries. Since the boundaries between a ﬁnite number of cells form a set of measure zero, we see that the score is differentiable almost everywhere.

6 Experiments

We compare the error of k NN classiﬁers using metrics learned with our approach to that with other learned metrics. For this evaluation we replicate the protocol in [11], using the seven data sets in Table 1. For all data sets, we report error of k NN classiﬁer for a range of values of k; for each k, we test the metric learned for that k. Competition to our method includes Euclidean Distance, LMNN [22], NCA, [8], ITML [5], MLR [14] and GB-LMNN [11]. The latter learns non-linear metrics rather than Mahalanobis.

For each of the competing methods, we used the code provided by the authors. In each case we tuned the parameters of each method, including ours, in the same cross-validation protocol. We omit a few other methods that were consistently shown in literature to be dominated by the ones we compare to, such as χ2 distance, MLCC, M-LMNN. We also could not include χ2-LMNN since code for it is not available; however published results for k = 3 [11] indicate that our method would win against χ2-LMNN as well.

Isolet and USPS have a standard training/test partition, for the other ﬁve data sets, we report the mean and standard errors of 5-fold cross validation (results for all methods are on the same folds). We experimented with different methods for initializing our method (given the non-convex objective), including the euclidean distance, all zeros etc. and found the euclidean initialization to be always worse. We initialize each fold with either the diagonal matrix learned by Relief F [12] (which gives a scaled euclidean distance) or all zeros depending on whether the scaled euclidean distance obtained using Relief F was better than unscaled euclidean distance. In each experiment, x are scaled by mean and standard deviation of the training portion.3 The value of C is tuned on on a 75%/25% split of the training portion. Results using different scaling methods are attached in the appendix.

Our SGD algorithm stops when the running average of the surrogate loss over most recent epoch no longer descreases substantially, or after max. number of iterations. We use learning rate η(t) = 1/t.

The results show that our method dominates other competitors, including non-linear metric learning methods, and in some cases achieves results signiﬁcantly better than those of the competition.

7 Conclusion

We propose a formulation of the metric learning for k NN classiﬁer as a structured prediction problem, with discrete latent variables representing the selection of k neighbors. We give efﬁcient algorithms for exact inference in this model, including loss-augmented inference, and devise a stochastic gradient algorithm for learning. This approach allows us to learn a Mahalanobis metric with an objective which is a more direct proxy for the stated goal (improvement of classiﬁcation by k NN rule)

3For Isolet we also reduce dimensionality to 172 by PCA computed on the training portion.

k = 3 Dataset Isolet USPS letters DSLR Amazon Webcam Caltech

d 170 256 16 800 800 800 800 N 7797 9298 20000 157 958 295 1123 C 26 10 26 10 10 10 10 Euclidean 8.66 6.18 4.79 0.2 75.20 3.0 60.13 1.9 56.27 2.5 80.5 4.6 LMNN 4.43 5.48 3.26 0.1 24.17 4.5 26.72 2.1 15.59 2.2 46.93 3.9 GB-LMNN 4.13 5.48 2.92 0.1 21.65 4.8 26.72 2.1 13.56 1.9 46.11 3.9 MLR 6.61 8.27 14.25 5.8 36.93 2.6 24.01 1.8 23.05 2.8 46.76 3.4 ITML 7.89 5.78 4.97 0.2 19.07 4.9 33.83 3.3 13.22 4.6 48.78 4.5 NCA 6.16 5.23 4.71 2.2 31.90 4.9 30.27 1.3 16.27 1.5 46.66 1.8 ours 4.87 5.18 2.32 0.1 17.18 4.7 21.34 2.5 10.85 3.1 43.37 2.4 k = 7 Dataset Isolet USPS letters DSLR Amazon Webcam Caltech

Euclidean 7.44 6.08 5.40 0.3 76.45 6.2 62.21 2.2 57.29 6.3 80.76 3.7 LMNN 3.78 4.9 3.58 0.2 25.44 4.3 29.23 2.0 14.58 2.2 46.75 2.9 GB-LMNN 3.54 4.9 2.66 0.1 25.44 4.3 29.12 2.1 12.45 4.6 46.17 2.8 MLR 5.64 8.27 19.92 6.4 33.73 5.5 23.17 2.1 18.98 2.9 46.85 4.1 ITML 7.57 5.68 5.37 0.5 22.32 2.5 31.42 1.9 10.85 3.1 51.74 2.8 NCA 6.09 5.83 5.28 2.5 36.94 2.6 29.22 2.7 22.03 6.5 45.50 3.0 ours 4.61 4.9 2.54 0.1 21.61 5.9 22.44 1.3 11.19 3.3 41.61 2.6 k = 11 Dataset Isolet USPS letters DSLR Amazon Webcam Caltech

Euclidean 8.02 6.88 5.89 0.4 73.87 2.8 64.61 4.2 59.66 5.5 81.39 4.2 LMNN 3.72 4.78 4.09 0.1 23.64 3.4 30.12 2.9 13.90 2.2 49.06 2.3 GB-LMNN 3.98 4.78 2.86 0.2 23.64 3.4 30.07 3.0 13.90 1.0 49.15 2.8 MLR 5.71 11.11 15.54 6.8 36.25 13.1 24.32 3.8 17.97 4.1 44.97 2.6 ITML 7.77 6.63 6.52 0.8 22.28 3.1 30.48 1.4 11.86 5.6 50.76 1.9 NCA 5.90 5.73 6.04 2.8 40.06 6.0 30.69 2.9 26.44 6.3 46.48 4.0 ours 4.11 4.98 3.05 0.1 22.28 4.9 24.11 3.2 11.19 4.4 40.76 1.8

Table 1: k NN error,for k=3, 7 and 11. Features were scaled by z-scoring. Mean and standard deviation are shown for data sets on which 5-fold partition was used. Best performing methods are shown in bold. Note that the only non-linear metric learning method in the above is GB-LMNN.

than previously proposed similarity learning methods. Our learning algorithm is simple yet efﬁcient, converging on all the data sets we have experimented upon in reasonable time as compared to the competing methods.

Our choice of Frobenius regularizer is motivated by desire to control model complexity without biasing towards a particular form of the matrix. We have experimented with alternative regularizers, both the trace norm of W and the shrinkage towards Euclidean distance, W I 2 F , but found both to be inferior to W 2 F . We suspect that often the optimal W corresponds to a highly anisotropic scaling of data dimensions, and thus bias towards I may be unhealthy.

The results in this paper are restricted to Mahalanobis metric, which is an appealing choice for a number of reasons. In particular, learning such metrics is equivalent to learning linear embedding of the data, allowing very efﬁcient methods for metric search. Still, one can consider non-linear embeddings x φ(x; w) and deﬁne the distance D in terms of the embeddings, for example, as D(x, xi) = φ(x) φ(xi) or as φ(x)T φ(xi). Learning S in the latter form can be seen as learning a kernel with discriminative objective of improving k NN performance. Such a model would be more expressive, but also more challenging to optimize. We are investigating this direction.

Acknowledgments

This work was partly supported by NSF award IIS-1409837.

[1] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An optimal algorithm for approximate nearest neighbor searching ﬁxed dimensions. J. ACM, 45(6):891 923, 1998. [2] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In ICML, pages 97 104. ACM, 2006. [3] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classiﬁers. In COLT, pages 144 152. ACM Press, 1992. [4] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In So CG, pages 253 262. ACM, 2004. [5] J. V. Davis, B. Kulis, J. Prateek, S. Suvrit, and D. Inderjeet. Information-theoretic metric learning. pages 209 216, 2007. [6] P. F. Felzenszwalb, R. B. Girshick, D. Mc Allester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE T. PAMI, 32(9):1627 1645, 2010. [7] A. Globerson and S. Roweis. Metric learning by collapsing classes. In Y. Weiss, B. Sch olkopf, and J. Platt, editors, NIPS, pages 451 458, Cambridge, MA, 2006. MIT Press. [8] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In NIPS, 2004. [9] Y. Gong and S. Lazebnik. Iterative quantization: A procrustean approach to learning binary codes. In CVPR, pages 817 824. IEEE, 2011. [10] T. Joachims. A support vector method for multivariate performance measures. In ICML, pages 377 384. ACM Press, 2005. [11] D. Kedem, S. Tyree, K. Weinberger, F. Sha, and G. Lanckriet. Non-linear metric learning. In NIPS, pages 2582 2590, 2012. [12] K. Kira and A. Rendell. The feature selection problem: Traditional methods and a new algorithm). AAAI, 2:129 134, 1992. [13] B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. NIPS, 22:1042 1050, 2009. [14] B. Mc Fee and G. Lanckriet. Metric learning to rank. In ICML, 2010. [15] M. Norouzi, D. Fleet, and R. Salakhutdinov. Hamming distance metric learning. In NIPS, pages 1070 1078, 2012. [16] M. Norouzi and D. J. Fleet. Minimal loss hashing for compact binary codes. ICML, 1:2, 2011. [17] P. Stange. On the efﬁcient update of the singular value decomposition. PAMM, 8(1):10827 10828, 2008. [18] D. Tarlow, K. Swersky, I. Sutskever, and R. S. Zemel. Stochastic k-neighborhood selection for supervised and unsupervised learning. ICML, 28:199 207, 2013. [19] J. Wang, S. Kumar, and S.-F. Chang. Sequential projection learning for hashing with compact codes. In ICML, 2010. [20] J. Wang, A. Woznica, and A. Kalousis. Learning neighborhoods for metric learning. In ECMLPKDD, 2012. [21] K. Q. Weinberger and L. K. Saul. Fast solvers and efﬁcient implementations for distance metric learning. In ICML, pages 1160 1167. ACM, 2008. [22] K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classiﬁcation. JMLR, 10:207 244, 2009. [23] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In NIPS, pages 505 512. MIT Press, 2002. [24] C.-N. J. Yu and T. Joachims. Learning structural svms with latent variables. In ICML, pages 1169 1176. ACM, 2009. [25] A. L. Yuille, A. Rangarajan, and A. Yuille. The concave-convex procedure (cccp). NIPS, 2:1033 1040, 2002.