# kernel_functions_based_on_triplet_comparisons__c78a30c0.pdf

Kernel functions based on triplet comparisons

Matthäus Kleindessner Department of Computer Science

Rutgers University Piscataway, NJ 08854 mk1572@cs.rutgers.edu

Ulrike von Luxburg Department of Computer Science

University of Tübingen Max Planck Institute for Intelligent Systems, Tübingen

luxburg@informatik.uni-tuebingen.de

Given only information in the form of similarity triplets Object A is more similar to object B than to object C about a data set, we propose two ways of deﬁning a kernel function on the data set. While previous approaches construct a lowdimensional Euclidean embedding of the data set that reﬂects the given similarity triplets, we aim at deﬁning kernel functions that correspond to high-dimensional embeddings. These kernel functions can subsequently be used to apply any kernel method to the data set.

1 Introduction

Assessing similarity between objects is an inherent part of many machine learning problems, be it in an unsupervised task like clustering, in which similar objects should be grouped together, or in classiﬁcation, where many algorithms are based on the assumption that similar inputs should produce similar outputs. In a typical machine learning setting one assumes to be given a data set D of objects together with a dissimilarity function d (or, equivalently, a similarity function s) quantifying how close objects are to each other. In recent years, however, a new branch of the machine learning literature has emerged that relaxes this scenario (see the next paragraph and Section 3 for references). Instead of being able to evaluate d itself, we only get to see a collection of similarity triplets of the form Object A is more similar to object B than to object C , which claims that d(A, B) < d(A, C). The main motivation for this relaxation comes from human-based computation: It is widely accepted that humans are better and more reliable at providing similarity triplets, which means assessing similarity on a relative scale, than at providing similarity estimates on an absolute scale ( The similarity between objects A and B is 0.8 ). This can be seen as a special case of the general observation that humans are better at comparing two stimuli than at identifying a single one (Stewart et al., 2005). For this reason, whenever one is lacking a meaningful dissimilarity function that can be evaluated automatically and has to incorporate human expertise into the machine learning process, collecting similarity triplets (e.g., via crowdsourcing) may be an appropriate means.

Given a data set D and similarity triplets for its objects, it is not immediately clear how to solve machine learning problems on D. A general approach is to construct an ordinal embedding of D, that is to map objects to a Euclidean space of a small dimension such that the given triplets are preserved as well as possible (Agarwal et al., 2007; Tamuz et al., 2011; van der Maaten and Weinberger, 2012; Terada and von Luxburg, 2014; Amid and Ukkonen, 2015; Heim et al., 2015; Amid et al., 2016; Jain et al., 2016). Once such an ordinal embedding has been constructed, one can solve a problem on D by solving it on the embedding. Only recently, algorithms have been proposed for solving various speciﬁc problems directly without constructing an ordinal embedding as an intermediate step (Heikinheimo and Ukkonen, 2013; Kleindessner and von Luxburg, 2017). With this paper we provide another generic means for solving machine learning problems based on similarity triplets that is different from

Work done while being a Ph D student at the University of Tübingen.

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

the ordinal embedding approach. We deﬁne two data-dependent kernel functions on D, corresponding to high-dimensional embeddings of D, that can subsequently be used by any kernel method. Our proposed kernel functions measure similarity between two objects in D by comparing to which extent the two objects give rise to resembling similarity triplets. The intuition is that this quantiﬁes the relative difference in the locations of the two objects in D. Experiments on both artiﬁcial and real data show that this is indeed the case and that the similarity scores deﬁned by our kernel functions are meaningful. Our approach is appealingly simple, and other than ordinal embedding algorithms our kernel functions are deterministic and parameter-free. We observe them to run signiﬁcantly faster than well-known embedding algorithms and to be ideally suited for a landmark design.

Setup Let X be an arbitrary set and d : X X ! R+

0 be a symmetric dissimilarity function on X: a higher value of d means that two elements of X are more dissimilar to each other. The terms dissimilarity and distance are used synonymously. To simplify presentation, we assume that for all triples of distinct objects A, B, C 2 X either d(A, B) < d(A, C) or d(A, B) > d(A, C) is true. Note that we do not require d to be a metric. We formally deﬁne a similarity triplet as binary answer to a dissimilarity comparison

?< d(A, C). (1)

We refer to A as the anchor object. A similarity triplet can be incorrect, meaning that it claims a positive answer to the comparison (1) although in fact the negative answer is true. In the following, we deal with a ﬁnite data set D = {x1, . . . , xn} X and collections of similarity triplets that are encoded as follows: an ordered triple of distinct objects (xi, xj, xk) means d(xi, xj) < d(xi, xk). A collection of similarity triplets is the only information that we are given about D. Note that such a collection does not necessarily provide an answer to every possible dissimilarity comparison (1).

2 Our kernel functions

Assume we are given a collection S of similarity triplets for the objects of D. Similarity triplets in S can be incorrect, but for the moment assume that contradicting triples (xi, xj, xk) and (xi, xk, xj) cannot be present in S at the same time. We will discuss how to deal with the general case below.

Kernel function k1 Our ﬁrst kernel function is based on the following idea: We ﬁx two objects xa and xb. In order to compute a similarity score between xa and xb we would like to rank all objects in D with respect to their distance from xa and also rank them with respect to their distance from xb, and take a similarity score between these two rankings as similarity score between xa and xb. One possibility to measure similarity between rankings is given by the famous Kendall tau correlation coefﬁcient (Kendall, 1938), which is also known as Kendall s : for two rankings of n items, Kendall s between the two rankings is the fraction of concordant pairs of items minus the fraction of discordant pairs of items. Here, a pair of two items i1 and i2 is concordant if i1 i2 or i1 i2 according to both rankings, and discordant if it satisﬁes i1 i2 according to one and i1 i2 according to the other ranking. Formally, a ranking is represented by a permutation σ : {1, . . . , n} ! {1, . . . , n} such that σ(i) 6= σ(j), i 6= j, and σ(i) = m means that item i is ranked at the m-th position. Given two rankings σ1 and σ2, the number of concordant pairs equals

fc(σ1, σ2) =

{σ1(i) < σ1(j)}

{σ2(i) < σ2(j)} +

{σ1(i) > σ1(j)}

{σ2(i) > σ2(j)}],

the number of discordant pairs equals

fd(σ1, σ2) =

{σ1(i) < σ1(j)}

{σ2(i) > σ2(j)} +

{σ1(i) > σ1(j)}

{σ2(i) < σ2(j)}],

and Kendall s between σ1 and σ2 is given by (σ1, σ2) = [fc(σ1, σ2) fd(σ1, σ2)] /

By measuring similarity between the two rankings of objects (one with respect to their distance from xa and one with respect to their distance from xb) with Kendall s we would compute a similarity score between xa and xb. This idea is illustrated with an example in Figure 1 (left). It has been established recently that Kendall s is actually a kernel function on the set of total rankings (Jiao and Vert, 2015). Hence, by measuring similarity on D in the described way we would even end up with a

Figure 1: Illustrations of the ideas behind k1 (left) and k2 (right). For k1: In order to compute a similarity score between x1 (in red) and x2 (in blue) we would like to rank all objects with respect to their distance from x1 and also with respect to their distance from x2 and compute Kendall s between the two rankings. In this example, the objects would rank as x1 x3 x2 x4 x5 x6 x7 and x2 x3 x6 x1 x5 x4 x7, respectively. Kendall s between these two rankings is 1/3, and this would be the similarity score between x1 and x2. For comparison, the score between x1 and x7 (in green) would be 5/7, and between x2 and x7 it would be 3/7. For k2: In order to compute a similarity score between x1 and x2 we would like to check for every pair of objects (xi, xj) whether the distance comparisons d(xi, x1)

? < d(xi, xj) and d(xi, x2)

? < d(xi, xj) yield the same result or not. Here, we have 32 pairs for which they yield the same result and 17 pairs for which they do not. We would assign 7 2 (32 17) = 15/49 as similarity score between x1 and x2. The score between x1 and x7 would be 3/49, and between x2 and x7 it would be 1/49.

kernel function on D since the following holds: for any mapping h : D ! Z and kernel function k : Z Z ! R, k (h, h) : D D ! R is a kernel function.

In our situation, the problem is that in most cases S will contain only a small fraction of all possible similarity triplets and also that some of the triplets in S might be incorrect, so that there is no way of ranking all objects with respect to their distance from any ﬁxed object based on the similarity triplets in S. To adapt the procedure, we consider a feature map that corresponds to the kernel function just described. By a feature map corresponding to a kernel function k : D D ! R we mean a mapping Φ : D ! Rm for some m 2 N such that k(xi, xj) = hΦ(xi), Φ(xj)i = Φ(xi)T Φ(xj). It is easy to see from the above formulas (also compare with Jiao and Vert, 2015) that a feature map corresponding to the described kernel function is given by Φk : D ! R(

Φk (xa) = 1 q"n

{d(xa, xi) < d(xa, xj)}

{d(xa, xi) > d(xa, xj)}

In our situation, where we are only given S and cannot evaluate Φk in most cases, we have to replace Φk by an approximation: up to a normalizing factor, we replace an entry in Φk (xa) by zero if we cannot evaluate it based on the triplets in S. More precisely, we consider the feature map Φk1 : D ! R(

n 2) given by Φk1(xa) = ([Φk1(xa)]i,j)1 i<j n with

[Φk1(xa)]i,j = 1 p

|{(xi, xj, xk) 2 S : xi = xa}|

{(xa, xi, xj) 2 S}

{(xa, xj, xi) 2 S}

(2) and deﬁne our ﬁrst proposed kernel function k1 : D D ! R by

k1(xi, xj) = Φk1(xi)T Φk1(xj). (3) Note that the scaling factor in the deﬁnition of Φk1, ensuring that the feature embedding lies on the unit sphere, is crucial whenever the number of similarity triplets in which an object appears as anchor object is not approximately constant over the different objects. For ease of exposition we have assumed that every object in D appears at least once as an anchor object in a similarity triplet in S. In the unlikely case that xa does not appear at least once as an anchor object, meaning that we do not have any information for ranking the objects in D with respect to their distance from xa at all, we simply set Φk1(xa) to zero (which is consistent with (2) under the convention 0/0=0 ).

Kernel function k2 Our second kernel function is based on a similar idea. Now we do not consider xa and xb as anchor objects when measuring their similarity, but compare whether they rank similarly with respect to their distances from the various other objects. Concretely, we would like to count the number of pairs of objects (xi, xj) for which the comparisons

?< d(xi, xj) and d(xi, xb)

?< d(xi, xj) (4)

{y : d(y, xn) < d(y, x2)}

{y : d(y, x5) < d(y, x4)}

Figure 2: k1 measures similarity between two objects by counting in how many of the halfspaces that are obtained from distance comparisons the two objects reside at the same time. The outcome does not only depend on the distance between the two objects, but also on their location within the data set: although x1 and x2 are located far apart, k1 considers them to be very similar. See the running text for details.

yield the same result and subtract the number of pairs for which these comparisons yield different results. See the right-hand side of Figure 1 for an illustration of this idea. Adapted to our situation of being only given S it corresponds to considering the feature map Φk2 : D ! Rn2 given by

Φk2(xa) = 1 p

|{(xi, xj, xk) 2 S : xj = xa _ xk = xa}|

{(xi, xa, xj) 2 S}

{(xi, xj, xa) 2 S}

and deﬁning our second proposed kernel function k2 : D D ! R by

k2(xi, xj) = Φk2(xi)T Φk2(xj). (5)

Again, the scaling factor in the deﬁnition of Φk2 is crucial whenever there are objects appearing in more similarity triplets than others and we apply the convention 0/0=0 .

Contradicting similarity triplets If S contains contradicting triples (xi, xj, xk) and (xi, xk, xj) and there might be triples being present repeatedly, one can alter the deﬁnition of Φk1 or Φk2 as follows: if #{(xa, xi, xj) 2 S} denotes the number of how often the triple (xa, xi, xj) appears in S, set Φk1(xa) = eΦk1(xa)/

)) where eΦk1(xa) equals #{(xa, xi, xj) 2 S} #{(xa, xj, xi) 2 S}

#{(xa, xi, xj) 2 S} + #{(xa, xj, xi) 2 S}

The deﬁnition of Φk2 can be revised in an analogous way. In doing so, we incorporate a simple estimate of the likelihood of a triple being correct.

2.1 Reducing diagonal dominance

If the number |S| of given similarity triplets is small, our kernel functions suffer from a problem that is shared by many other kernel functions deﬁned on complex data: Φk1 and Φk2 map the objects in D to sparse vectors, that is almost all of their entries are zero. As a consequence, two different feature vectors Φki(xa) and Φki(xb) appear to be almost orthogonal and the similarity score ki(xa, xb) is much smaller than the self-similarity scores ki(xa, xa) or ki(xb, xb). This phenomenon, usually referred to as diagonal dominance of the kernel function, has been observed to pose difﬁculties for the kernel methods using the kernel function, and several ways have been proposed for dealing with it (Schölkopf et al., 2002; Greene and Cunningham, 2006). In all our experiments we deal with diagonal dominance in the following simple way: Let k denote a kernel function and K the kernel matrix on D, that is K = (k(xi, xj))n

i,j=1, which would be the input to a kernel method. Then we replace K by K λmin I where I 2 Rn n denotes the identity matrix and λmin is the smallest eigenvalue of K.

2.2 Geometric intuition

Intuitively, our kernel functions measure similarity between xa and xb by quantifying to which extent xa and xb can be expected to be located in the same region of D: Think of D as a subset of Rm and d being the Euclidean metric. A similarity triplet d(xa, xi) < d(xa, xj) then tells us that xa resides in the halfspace deﬁned by the hyperplane that is perpendicular to the line segment connecting xi and xj and goes through the segment s midpoint. If there is also a similarity triplet d(xb, xi) < d(xb, xj), xa and xb thus are located in the same halfspace (assuming the correctness of the similarity triplets) and this is reﬂected by a higher value of k1(xa, xb). Similarly, a similarity triplet d(xi, xa) < d(xi, xj)

400 points Distance matrix K1 K2 Similarity scores

-6 -4 -2 0 2 4 6

-5 0 5 10 15

-5 0 5 10 15 -10

Figure 3: Kernel matrices for two data sets, each consisting of 400 points, based on 10% of all similarity triplets. 1st plot of a row: Data points. 2nd plot: Distance matrix. 3rd / 4th plot: Kernel matrix for k1 / k2. 6th plot: Similarity scores between a ﬁxed point and the other points (for k1).

tells us that xa is located in a ball with radius d(xi, xj) centered at xi, and the value of k2(xa, xb) is higher if there is a similarity triplet d(xi, xb) < d(xi, xj) telling us that xb is located in this ball too and it is smaller if there is a triplet d(xi, xj) < d(xi, xb) telling us that xb is not located in this ball.

Note that the similarity scores between xa and xb deﬁned by k1 or k2 do not only depend on d(xa, xb), but rather on the locations of xa and xb within D and on how the points in D are spread in the space since this affects how the various hyperplanes or balls are related to each other. Consider the example illustrated in Figure 2: Let d(x3, xn) = 1 implying that d(xi, xi+1) = (1/n), 3 i < n, and d(x1, x2) > d(x2, xn) > d(x1, xn) > d(x2, x3) > d(x1, x3) > 1 be arbitrarily large. Although x1 and x2 are located at the maximum distance to each other, they satisfy d(x1, xi) < d(x1, xj) and d(x2, xi) < d(x2, xj) for all 3 i < j n, and hence both x1 and x2 are jointly located in all the halfspaces obtained from these distance comparisons. We end up with k1(x1, x2) ! 1, n ! 1, assuming k1 is computed based on all possible similarity triplets, all of which are correct. The distance between x3 and xn is much smaller, but there are many points in between them and the hyperplanes obtained from the distance comparisons with these points separate x3 and xn. We end up with k1(x3, xn) ! 1, n ! 1. Depending on the task at hand, this may be desirable or not.

Let us examine the meaningfulness of our kernel functions by calculating them on ﬁve visualizable data sets. Each of the ﬁrst four data sets consists of 400 points in R2 and d equals the Euclidean metric. The ﬁfth data set consists of 400 vertices of an undirected graph from a stochastic block model and d equals the shortest path distance. We computed k1 and k2 based on 10% of all possible similarity triplets (chosen uniformly at random from all triplets). The results for the ﬁrst two data sets are shown in Figure 3. The results for the remaining data sets are shown in Figure 6 in Section A.1 in the supplementary material. The ﬁrst plot of a row shows the data set. The second plot shows the distance matrix on the data set. Next, we can see the kernel matrices. The last plot of a row shows the similarity scores (encoded by color) based on k1 between one ﬁxed point (shown as a black cross) and the other points in the data set. Clearly, the kernel matrices reﬂect the block structures of the distance matrices, and the similarity scores between a ﬁxed point and the other points tend to decrease as the distances to the ﬁxed point increase. A situation like in the example of Figure 2 does not occur.

2.3 Landmark design

Our kernel functions are designed as to extract information from an arbitrary collection S of similarity triplets. However, by construction, a single triplet is useless, and what matters is the concurrent presence of two triplets: k1(xa, xb) is only affected by pairs of triplets answering d(xa, xi)

? < d(xa, xj)

and d(xb, xi)

? < d(xb, xj), while k2(xa, xb) is only affected by pairs of triplets answering (4). Hence, when we can choose which dissimilarity comparisons of the form (1) are evaluated for creating S (e.g., in crowdsourcing), we should aim at maximizing the number of appropriate pairs of triplets. This can easily be achieved by means of a landmark design inspired from landmark multidimensional scaling (de Silva and Tenenbaum, 2004): We choose a small subset of landmark objects L D. Then, for k1, only comparisons of the form d(xi, xj)

? < d(xi, xk) with xi 2 D and xj, xk 2 L are evalu-

ated. For k2, only comparisons of the form d(xj, xi)

? < d(xj, xk) with xi 2 D and xj, xk 2 L are evaluated. The landmark objects can be chosen either randomly or, if available, based on additional knowledge about D and the task at hand.

2.4 Computational complexity

General S A naive implementation of our kernel functions explicitly computes the feature vectors Φk1(xi) or Φk2(xi), i = 1, . . . , n, and subsequently calculates the kernel matrix K by means of (3) or (5). In doing so, we store the feature vectors in the feature matrix Φk1(D) = (Φk1(xi))n

n 2) n or Φk2(D) = (Φk2(xi))n

i=1 2 Rn2 n. Proceeding this way is straightforward and simple, requiring to go through S only once, but comes with a computational cost of O(|S| + n4) operations. Note that the number of different distance comparisons of the form (1) is O(n3) and hence one might expect that |S| 2 O(n3) and O(|S| + n4) = O(n4). By performing (3) or (5) in terms of matrix multiplication Φk1(D)T Φk1(D) or Φk2(D)T Φk2(D) and applying Strassen s algorithm (Higham, 1990) one can reduce the number of operations to O(|S| + n3.81), but still this is infeasible for many

data sets. Infeasibility for large data sets, however, is even more the case for ordinal embedding algorithms, which are the current state-of-the-art method for solving machine learning problems based on similarity triplets. All existing ordinal embedding algorithms iteratively solve an optimization problem. For none of these algorithms theoretical bounds for their complexity are available in the literature, but it is widely known that their running times are prohibitively high (Heim et al., 2015; Kleindessner and von Luxburg, 2017).

Landmark design If we know that S contains only dissimilarity comparisons involving landmark objects, we can adapt the feature matrices such that Φk1(D) 2 R(

2 ) n or Φk2(D) 2 R|L|2 n

and reduce the number of operations to O(|S| + min{|L|2, n}log2(7/8)|L|2n2), which is O(|S| + |L|1.62n2) if |L|2 n. Note that in this case we might expect that |S| 2 O(|L|2n).

In both cases, whenever the number of given similarity triplets |S| is small compared to the number of all different distance comparisons under consideration, the feature matrix Φk1(D) or Φk2(D) is sparse with only O(|S|) non-zero entries and methods for sparse matrix multiplication decrease computational complexity (Gustavson, 1978; Kaplan et al., 2006).

3 Related work

Similarity triplets are a special case of answers to the general dissimilarity comparisons d(A, B)

? < d(C, D), A, B, C, D 2 X. We refer to any collection of answers to these general comparisons as ordinal data. In recent years, ordinal data has become popular in machine learning. Among the work on ordinal data in general (see Kleindessner and von Luxburg, 2014, 2017, for references), similarity triplets have been paid particular attention: Jamieson and Nowak (2011) deal with the question of how many similarity triplets are required for uniquely determining an ordinal embedding of Euclidean data. This work has been carried on and generalized by Jain et al. (2016). Algorithms for constructing an ordinal embedding based on similarity triplets (but not on general ordinal data) are proposed in Tamuz et al. (2011), van der Maaten and Weinberger (2012), Amid et al. (2016), and Jain et al. (2016). Heikinheimo and Ukkonen (2013) present a method for medoid estimation based on statements Object A is the outlier within the triple of objects (A, B, C) , which correspond to the two similarity triplets d(B, C) < d(B, A) and d(C, B) < d(C, A). Ukkonen et al. (2015) use the same kind of statements for density estimation and Ukkonen (2017) uses them for clustering. Wilber et al. (2014) examine how to minimize time and costs when collecting similarity triplets via crowdsourcing. Producing a number of ordinal embeddings at the same time, each corresponding to a different dissimilarity function based on which a comparison (1) might have been evaluated, is studied in Amid and Ukkonen (2015). In Heim et al. (2015), one of the algorithms by van der Maaten and Weinberger (2012) is adapted from the batch setting to an online setting, in which similarity triplets are observed in a sequential way, using stochastic gradient descent. In Kleindessner and von Luxburg (2017), we propose algorithms for medoid estimation, outlier detection, classiﬁcation, and clustering based on statements Object A is the most central object within (A, B, C) , which comprise the two similarity triplets d(B, A) < d(B, C) and d(C, A) < d(C, B). Finally, Haghiri et al. (2017) study the problem of efﬁcient nearest neighbor search based on similarity triplets. There

Figure 4: Best viewed magniﬁed on screen. Left: Clustering of the food data set. Part of the dendrogram obtained from complete-linkage clustering using k1. Right: Kernel PCA on the car data set based on the kernel function k2.

is also a number of papers that consider similarity triplets as side information to vector data (e.g., Schultz and Joachims, 2003; Mc Fee and Lanckriet, 2011; Wilber et al., 2015).

4 Experiments

We performed experiments that demonstrate the usefulness of our kernel functions. We ﬁrst apply them to three small image data sets for which similarity triplets have been gathered via crowdsourcing. We then study them more systematically and compare them to an ordinal embedding approach in clustering tasks on subsets of USPS and MNIST digits using synthetically generated triplets.

4.1 Crowdsourced similarity triplets

In this section we present experiments on real crowdsourcing data that show that our kernel functions can capture the structure of a data set. Note that for the following data sets there is no ground truth available and hence there is no way other than visual inspection for evaluating our results.

Food data set We applied the kernelized version of complete-linkage clustering based on our kernel function k1 to the food data set introduced in Wilber et al. (2014). This data set consists of 100 images2 of a wide range of foods and comes with 190376 (unique) similarity triplets, which contain 9349 pairs of contradicting triplets. Figure 4 (left) shows a part of the dendrogram that we obtained. Each of the ten clusters depicted there contains pretty homogeneous images. For example, the fourth row only shows vegetables and salads whereas the ninth row only shows fruits and the last row only shows desserts. To give an impression of accelerated running time of our approach compared to an ordinal embedding approach: computation of k1 or k2 on this data set took about 0.1 seconds while computing an ordinal embedding using the GNMDS algorithm (Agarwal et al., 2007) took 18 seconds (embedding dimension equaling two; all computations performed in Matlab see Section 4.2 for details; the embedding is shown in Figure 9 in Section A.1 in the supplementary material).

Car data set We applied kernel PCA (Schölkopf et al., 1999) based on our kernel function k2 to the car data set, which we have introduced in Kleindessner and von Luxburg (2017). It consists of 60 images of cars. For this data set we have collected statements of the kind Object A is the most central object within (A, B, C) , meaning that d(B, A) < d(B, C) and d(C, A) < d(C, B), via crowdsourcing. We ended up with 13514 similarity triplets, of which 12502 were unique. The projection of the car data set onto the ﬁrst two kernel principal components can be seen in Figure 4 (right). The result looks reasonable, with the cars arranged in groups of sports cars (top left), ordinary cars (middle right) and off-road/sport utility vehicles (bottom left). Also within these groups there is some reasonable structure. For example, the race-like sports cars are located near to each other and close to the Formula One car, and the sport utility vehicles from German manufacturers are placed next to each other.

Nature data set We performed similar experiments on the nature data set introduced in Heikinheimo and Ukkonen (2013). The results are presented in Section A.2 in the supplementary material.

2According to Wilber et al., the data set contains copyrighted material under the educational fair use exemption to the U.S. copyright law.

We would like to discuss a question raised by one of the reviewers: in our setup (see Section 1), we assume that similarity triplets are noisy evaluations of dissimilarity comparisons (1), where d is some ﬁxed dissimilarity function. This leads to our (natural) way of dealing with contradicting similarity triplets as described in Section 2. In a different setup one could drop the dissimilarity function d and consider similarity triplets as elements of some binary relation on D D that is not necessarily transitive or antisymmetric. In the latter setup it is not clear whether our way of dealing with contradicting triplets is the right thing to do. However, we believe that the experiments of this section show that our setup is valid in a wide range of scenarios and our approach works in practice.

4.2 Synthetically generated triplets

We studied our kernel functions with respect to the number of input similarity triplets that they require in order to produce a valuable solution in clustering tasks. We found that in the scenario of a general collection S of triplets our approach is highly superior compared to an ordinal embedding approach in terms of running time, but on most data sets it is inferior regarding the required number of triplets. The full beneﬁt of our kernel functions emerges in a landmark design. There our approach can compete with an embedding approach in terms of the required number of triplets and is so much faster as to being easily applicable to large data sets to which ordinal embedding algorithms are not. In this section we want to demonstrate this claim. We studied k1 and k2 in a landmark design by applying kernel k-means clustering (Dhillon et al., 2001) to subsets of USPS and MNIST digits, respectively. Collections S of similarity triplets were generated as follows: We chose a certain number of landmark objects uniformly at random from all objects of the data set under consideration. Choosing d as the Euclidean metric, we created answers to all possible distance comparisons with the landmark objects as explained in Section 2.3. Answers were incorrect with some probability 0 ep 1 independently of each other. From the set of all answers we chose triplets in S uniformly at random without replacement. We compared our approach to an ordinal embedding approach with ordinary k-means clustering. We tried the GNMDS (Agarwal et al., 2007), the CKL (Tamuz et al., 2011), and the t-STE (van der Maaten and Weinberger, 2012) embedding algorithms in the Matlab implementation made available by van der Maaten and Weinberger (2012). In doing so, we set all parameters except the embedding dimension to the provided default parameters. The parameter µ of the CKL algorithm was set to 0.1 since we observed good results with this value. Note that in these unsupervised clustering tasks there is no immediate way of performing cross-validation for choosing parameters. We compared to the embedding algorithms in two scenarios: in one case they were provided the same triplets as input as our kernel functions, in the other case (denoted by the additional rand in the plots) they were provided a same number of triplets chosen uniformly at random with replacement from all possible triplets (no landmark design) and incorrect with the same probability ep. For further comparison, we considered ordinary k-means applied to the original point set and a random clustering. We always provided the correct number of clusters as input, and set the number of replicates in k-means and kernel k-means to ﬁve and the maximum number of iterations to 100. For assessing the quality of a clustering we computed its purity (e.g., Manning et al., 2008), which measures the accordance with the known ground truth partitioning according to the digits values. A high purity value indicates a good clustering. Note that the limitation for the scale of our experiments only comes from the running time of the embedding algorithms and not from our kernel functions. Still, in terms of the number of data points our experiments are comparable or actually even superior to all the papers on ordinal embedding cited in Section 3. In terms of the number of similarity triplets per data point, we used comparable numbers of triplets.

USPS digits We chose 1000 points uniformly at random from the subset of USPS digits 1, 2, and 3. Using 15 landmark objects, we studied the performance of our approach and the ordinal embedding approach as a function of the number of input triplets. The ﬁrst and the second row of Figure 5 show the results (average over 10 runs of an experiment) for k1. The results for k2 are shown in Figure 7 in Section A.1 in the supplementary material. The ﬁrst two plots of a row show the purity values of the various clusterings for ep = 0 and ep = 0.3, respectively. The third and the fourth plot show the corresponding time (in sec) that it took to compute our kernel function or an ordinal embedding. We set the embedding dimension to 2 (1st row) or 10 (2nd row). Based on the achieved purity values no method can be considered superior. Our kernel function k2 performs slightly worse than k1 and the ordinal embedding algorithms. The GNMDS algorithm apparently cannot deal with the landmark triplets at all and yields the same purity values as a random clustering when provided with the landmark triplets. Our approach is highly superior regarding running time. The running times of the

0 1 2 3 4 # input triplets

k1 on USPS (ep=0, embedding dim=2)

0 2 4 6 # input triplets 104

k1 on USPS (ep=0.3, embedding dim=2)

k1 GNMDS t-STE CKL GNMDS rand t-STE rand CKL rand Coordinates Random

0 1 2 3 4 # input triplets 104

Running time [s]

k1 on USPS (ep=0, embedding dim=2)

0 2 4 6 # input triplets 104

Running time [s]

k1 on USPS (ep=0.3, embedding dim=2)

0 1 2 3 4 # input triplets 104

k1 on USPS (ep=0, embedding dim=10)

0 2 4 6 # input triplets 104

k1 on USPS (ep=0.3, embedding dim=10)

0 1 2 3 4 # input triplets 104

Running time [s]

k1 on USPS (ep=0, embedding dim=10)

0 2 4 6 # input triplets 104

Running time [s]

k1 on USPS (ep=0.3, embedding dim=10)

0 2000 4000 6000 8000 10000 # points

k1 on MNIST (ep=0.15, embedding dim=5)

0 2000 4000 6000 8000 10000 # points

k2 on MNIST (ep=0.15, embedding dim=5)

0 2000 4000 6000 8000 10000 # points

Running time [s]

k1 on MNIST (ep=0.15, embedding dim=5)

0 2000 4000 6000 8000 10000 # points

Running time [s]

k2 on MNIST (ep=0.15, embedding dim=5)

Figure 5: 1st & 2nd row (USPS digits for k1): Clustering 1000 points from USPS digits 1, 2, and 3. Purity and running time as a function of the number of input triplets. 3rd row (MNIST digits): Clustering subsets of MNIST digits. Purity and running time as a function of the number of points.

ordinal embedding algorithms depend on the embedding dimension and ep and in these experiments the dependence is monotonic. All computations were performed in Matlab R2016a on a Mac Book Pro with 2.9 GHz Intel Core i7 and 8 GB 1600 MHz DDR3. In order to make a fair comparison we did not use MEX ﬁles or sparse matrix operations in the implementation of our kernel functions.

MNIST digits We studied the performance of the various methods as a function of the size n of the data set with the number of input triplets growing linearly with n. For i = 1, . . . , 10, we chose n = i 103 points uniformly at random from MNIST digits. We used 30 landmark objects and provided 150n input similarity triplets. The third row of Figure 5 shows the purity values of the various methods for k1 / k2 (1st / 2nd plot) and the corresponding running times (3rd / 4th plot) when ep = 0.15. The embedding dimension was set to 5. A spot check suggested that setting it to 2 would have given worse results, while setting it to 10 would have given similar results, but would have led to a higher running time. We computed the t-STE embedding only for n 6000 due to its high running time. It seems that GNMDS with random input triplets performs best, but for large values of n our kernel function k1 can compete with it. For 10000 points, computing k1 or k2 took 100 or 180 seconds, while even the fastest embedding algorithm ran for 2000 seconds. For further comparison, Figure 8 in Section A.1 in the supplementary material shows a kernel PCA embedding based on k1 (150n landmark triplets) and a 2-dim GNMDS embedding (150n random triplets) of n = 20000 digits. Here, computation of k1 took 900 seconds, while GNMDS ran for more than 6000 seconds.

5 Conclusion

We proposed two data-dependent kernel functions that can be evaluated when given only an arbitrary collection of similarity triplets for a data set D. Our kernel functions can be used to apply any kernel method to D. Hence they provide a generic alternative to the standard ordinal embedding approach based on numerical optimization for machine learning with similarity triplets. In a number of experiments we demonstrated the meaningfulness of our kernel functions. A big advantage of our kernel functions compared to the ordinal embedding approach is that our kernel functions run signiﬁcantly faster. A drawback is that, in general, they seem to require a higher number of similarity triplets for capturing the structure of a data set. However, in a landmark design our kernel functions can compete with the ordinal embedding approach in terms of the required number of triplets.

Acknowledgements

This work has been supported by the Institutional Strategy of the University of Tübingen (DFG, ZUK 63).

S. Agarwal, J. Wills, L. Cayton, G. Lanckriet, D. Kriegman, and S. Belongie. Generalized non-metric

multidimensional scaling. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2007.

E. Amid and A. Ukkonen. Multiview triplet embedding: Learning attributes in multiple maps. In

International Conference on Machine Learning (ICML), 2015.

E. Amid, N. Vlassis, and M. Warmuth. t-exponential triplet embedding. ar Xiv:1611.09957v1 [cs.AI],

V. de Silva and J. Tenenbaum. Sparse multidimensional scaling using landmark points. Technical

report, Stanford University, 2004.

I. Dhillon, Y. Guan, and B. Kulis. Kernel k-means, spectral clustering and normalized cuts. In

International Conference on Knowledge Discovery and Data Mining (KDD), 2001.

D. Greene and P. Cunningham. Practical solutions to the problem of diagonal dominance in kernel

document clustering. In International Conference on Machine Learning (ICML), 2006.

F. G. Gustavson. Two fast algorithms for sparse matrices: Multiplication and permuted transposition.

ACM Transactions on Mathematical Software, 4(3):250 269, 1978.

S. Haghiri, U. von Luxburg, and D. Ghoshdastidar. Comparison based nearest neighbor search. In

International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2017.

H. Heikinheimo and A. Ukkonen. The crowd-median algorithm. In Conference on Human Computa-

tion and Crowdsourcing (HCOMP), 2013. Data available on http://www.anttiukkonen.com/.

E. Heim, M. Berger, L. M. Seversky, and M. Hauskrecht. Efﬁcient online relative comparison kernel

learning. In SIAM International Conference on Data Mining (SDM), 2015.

N. Higham. Exploiting fast matrix multiplication within the level 3 BLAS. ACM Transactions on

Mathematical Software, 16(4):352 368, 1990.

L. Jain, K. Jamieson, and R. Nowak. Finite sample prediction and recovery bounds for ordinal

embedding. In Neural Information Processing Systems (NIPS), 2016.

K. Jamieson and R. Nowak. Low-dimensional embedding using adaptively selected ordinal data. In

Allerton Conference on Communication, Control, and Computing, 2011.

Y. Jiao and J.-P. Vert. The Kendall and Mallows kernels for permutations. In International Conference

on Machine Learning (ICML), 2015.

H. Kaplan, M. Sharir, and E. Verbin. Colored intersection searching via sparse rectangular matrix

multiplication. In Symposium on Computational Geometry (So CG), 2006.

M. Kendall. A new measure of rank correlation. Biometrika, 30(1 2):81 93, 1938.

M. Kleindessner and U. von Luxburg. Uniqueness of ordinal embedding. In Conference on Learning

Theory (COLT), 2014.

M. Kleindessner and U. von Luxburg. Lens depth function and k-relative neighborhood graph:

Versatile tools for ordinal data analysis. JMLR, 18(58):1 52, 2017. Data available on http://www.tml.cs.uni-tuebingen.de/team/luxburg/code_and_data/.

C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge

University Press, 2008.

B. Mc Fee and G. Lanckriet. Learning multi-modal similarity. JMLR, 12:491 523, 2011.

B. Schölkopf, A. Smola, and K.-R. Müller. Kernel principal component analysis. In B. Schölkopf,

C. Burges, and A. Smola, editors, Advances in Kernel Methods: Support Vector Learning, pages 327 352. MIT Press, 1999.

B. Schölkopf, J. Weston, E. Eskin, C. Leslie, and W. Noble. A kernel approach for learning from

almost orthogonal patterns. In European Conference on Machine Learning (ECML), 2002.

M. Schultz and T. Joachims. Learning a distance metric from relative comparisons. In Neural

Information Processing Systems (NIPS), 2003.

N. Stewart, G. D. A. Brown, and N. Chater. Absolute identiﬁcation by relative judgment. Psychologi-

cal Review, 112(4):881 911, 2005.

O. Tamuz, C. Liu, S. Belongie, O. Shamir, and A. Kalai. Adaptively learning the crowd kernel. In

International Conference on Machine Learning (ICML), 2011.

Y. Terada and U. von Luxburg. Local ordinal embedding. In International Conference on Machine

Learning (ICML), 2014.

A. Ukkonen. Crowdsourced correlation clustering with relative distance comparisons. In International

Conference on Data Mining series (ICDM), 2017.

A. Ukkonen, B. Derakhshan, and H. Heikinheimo. Crowdsourced nonparametric density estimation

using relative distances. In Conference on Human Computation and Crowdsourcing (HCOMP), 2015.

L. van der Maaten and K. Weinberger. Stochastic triplet embedding. In IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2012. Code available on http://homepage.tudelft.nl/19j49/ste/.

M. Wilber, I. Kwak, and S. Belongie. Cost-effective hits for relative similarity comparisons. In

Conference on Human Computation and Crowdsourcing (HCOMP), 2014. Data available on http://vision.cornell.edu/se3/projects/cost-effective-hits/.

M. Wilber, I. Kwak, D. Kriegman, and S. Belongie. Learning concept embeddings with combined

human-machine expertise. In International Conference on Computer Vision (ICCV), 2015.