# boosting_for_comparisonbased_learning__c42f76e1.pdf

Boosting for Comparison-Based Learning

Micha el Perrot1 and Ulrike von Luxburg1,2

1Max-Planck-Institute for Intelligent Systems, T ubingen, Germany 2University of T ubingen, Department of Computer Science, T ubingen, Germany michael.perrot@tuebingen.mpg.de, ulrike.luxburg@uni-tuebingen.de

We consider the problem of classiﬁcation in a comparison-based setting: given a set of objects, we only have access to triplet comparisons of the form object A is closer to object B than to object C. In this paper we introduce Triplet Boost, a new method that can learn a classiﬁer just from such triplet comparisons. The main idea is to aggregate the triplets information into weak classiﬁers, which can subsequently be boosted to a strong classiﬁer. Our method has two main advantages: (i) it is applicable to data from any metric space, and (ii) it can deal with large scale problems using only passively obtained and noisy triplets. We derive theoretical generalization guarantees and a lower bound on the number of necessary triplets, and we empirically show that our method is both competitive with state of the art approaches and resistant to noise.

1 Introduction

In the past few years the problem of comparison-based learning has attracted growing interest in the machine learning community [Agarwal et al., 2007, Jamieson and Nowak, 2011, Tamuz et al., 2011, Tschopp et al., 2011, Van Der Maaten and Weinberger, 2012, Heikinheimo and Ukkonen, 2013, Amid and Ukkonen, 2015, Kleindessner and Luxburg, 2015, Jain et al., 2016, Haghiri et al., 2017, Kazemi et al., 2018]. The motivation is to relax the assumption that an explicit representation of the objects or a distance metric between pairs of examples are available. Instead one only has access to a set of ordinal distance comparisons that can take several forms depending on the problem at hand. In this paper we focus on triplet comparisons of the form object xi is closer to object xj than to object xk, that is on relations of the form d(xi, xj) < d(xi, xk) where d is an unknown metric1. We address the problem of classiﬁcation with noisy triplets that have been obtained in a passive manner: the examples lie in an unknown metric space, not necessarily Euclidean, and

1Note that this kind of ordinal information is sometimes just used as side information [Bellet et al., 2015, Kane et al., 2017], but, as the references in the main text, we focus on the setting where ordinal comparisons are the sole information available.

we are only given a small set of triplet comparisons there is no way in which we could actively ask for more triplets. Furthermore we assume that the answers to the triplet comparisons can be noisy. To deal with this problem one can try to ﬁrst recover an explicit representation of the examples, a task that can be solved by ordinal embedding approaches [Agarwal et al., 2007, Van Der Maaten and Weinberger, 2012, Terada and von Luxburg, 2014, Jain et al., 2016], and then apply standard machine learning approaches. However, such embedding methods assume that the examples lie in a Euclidean space and do not scale well with the number of examples: typically they are too slow for datasets with more than 103 examples. As an alternative, it would be desirable to have a classiﬁcation algorithm that can work with triplets directly, without taking a detour via ordinal embedding. To the best of our knowledge, for the case of passively obtained triplets, this problem has not yet been solved in the literature. Another interesting question in this context is that of the minimal number of triplets required to successfully learn a classiﬁer. It is known that to exactly recover an ordinal embedding one needs of the order Ω(n3) passively queried triplets in the worst case (essentially all of them), unless we make stronger assumptions on the underlying metric space [Jamieson and Nowak, 2011]. However, classiﬁcation is a problem which seems simpler than ordinal embedding, and thus it might be possible to obtain better lower bounds. In this paper we propose Triplet Boost, a method for classiﬁcation that is able to learn using only passively obtained triplets while not making any assumptions on the underlying metric space. To the best of our knowledge this is the ﬁrst approach that is able to solve this problem. Our method is based on the idea that the triplets can be aggregated into simple triplet classiﬁers, which behave like decision stumps and are well-suited for boosting approaches [Schapire and Freund, 2012]. From a theoretical point of view we prove that our approach learns a classiﬁer with low training error, and we derive generalization guarantees that ensure that its error on new examples is bounded. Furthermore we derive a new lower bound on the number of triplets that are necessary to ensure useful predictions. From an empirical point of view we demonstrate that our approach can be applied to datasets that are several order of magnitudes larger than the ones that can currently be handled by ordinal embedding methods. Furthermore we show that our method is quite resistant to noise.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Figure 1: Example of a triplet classiﬁer. Given two reference points xj and xk, the space is divided in two half-spaces: examples closer to xj and examples closer to xk. Triplet information is enough to reveal which half-space a point xi is in.

2 The Triplet Boost Algorithm

In this paper we are interested in multi-class classiﬁcation problems. Let (X, d) be an unknown and general metric space, typically not Euclidean. Let Y be a ﬁnite label space. Let S = {(xi, yi)}n i=1 be a set of n examples drawn i.i.d. from an unknown distribution DS deﬁned over X Y. Note that we use the notation xi as a convenient way to identify an object; it does not correspond to any explicit representation that could be used by an algorithm (such as coordinates in a vector space). Let T = {(xi, xj, xk) : (xi, yi), (xj, yj), (xk, yk) S, xj = xk} be a set of m triplets. Each ordered tuple (xi, xj, xk) T encodes the following relation between the three examples:

d(xi, xj) < d(xi, xk) . (1)

Given the triplets in T and the label information of all points, our goal is to learn a classiﬁer. We make two main assumptions about the data. First, we assume that the triplets are uniformly and independently sampled from the set of all possible triplets. Second, we assume that the triplets in T can be noisy (that is the inequality has been swapped, for example (xi, xk, xj) T while the true relation is d(xi, xj) < d(xi, xk)), but the noise is uniform and independent from one triplet to another. In the following Ia denotes the indicator function returning 1 when a property a is veriﬁed and 0 otherwise, Wc is an empirical distribution over S Y, and wc,xi,y is the weight associated to object xi and label y.

2.1 Weak Triplet Classiﬁers Rather than considering triplets as individual pieces of information we propose to aggregate them into decision stumps that we call triplet classiﬁers. The underlying idea is to select two reference examples, xj and xk, and to divide the space into two half-spaces: examples that are closer to xj and examples that are closer to xk. This is illustrated in Figure 1. In principle, this can be achieved with triplets only. However, a major difﬁculty in our setting is that our triplets are passively obtained: for most training points xi we do not know whether they are closer to xj or xk. In particular, it is impossible to evaluate the classiﬁcation accuracy of such a simple classiﬁer on the whole training set. To deal with this problem we propose to use an abstention scheme where a triplet classiﬁer

Algorithm 1 Triplet Boost: boosting with triplet classiﬁers Input: S = {(xi, yi)}n i=1 a set of n examples, T = {(xi, xj, xk) : xj = xk} a set of m triplets. Output: H( ) a strong classiﬁer. 1: Let W1 be the empirical uniform distribution: (xi, yi) S, y Y, w1,xi,y = 1 n|Y|. 2: for c = 1, . . . , C do 3: Choose a triplet classiﬁer hc according to Wc (Equation (2)). 4: Compute the weight αc of hc according to its performance on Wc (Equation (3)). 5: Update the weights of the examples to obtain a new distribution Wc+1 (Equation (4)). 6: end for

7: return H( ) = arg max y Y

c=1 αc Ihc( ) =ϑ y hc( )

abstains if it does not know on which side of the hyperplane the considered point lies. Given a set T of triplets and two reference points xj and xk, we deﬁne a triplet classiﬁer as:

( oj if (x, xj, xk) T, ok if (x, xk, xj) T, ϑ otherwise.

In our multi-class setting, oj and ok will be sets of labels, that is oj, ok Y. In Section 2.2 we describe how we choose them in a data dependent fashion to obtain classiﬁers with minimal error on the training set. The prediction ϑ simply means that the triplet classiﬁer abstains on the example. Let H denote the set of all possible triplet classiﬁers. The triplet classiﬁers are very simple and, in practice, we do not expect them to perform well at all. But we prove in Section 3.1 that, for appropriate choices of oj and ok, they are at least as good as random predictors. This is all we need to ensure that they can be used successfully in a boosting framework. The next section describes how this works.

2.2 Triplet Boost Boosting is based on the insight that weak classiﬁers (that is classiﬁers marginally better than random predictors) are usually easy to obtain and can be combined in a weighted linear combination to obtain a strong classiﬁer. This weighted combination can be obtained in an iterative fashion where, at each iteration, a weak-classiﬁer is chosen and weighted so as to minimize the error on the training examples. The weights of the points are then updated to put more focus on hard-to-classify examples [Schapire and Freund, 2012]. In this paper we use a well-known boosting algorithm called Ada Boost.MO [Schapire and Singer, 1999, Schapire and Freund, 2012]. This method can handle multi-class problems with a one-against-all approach, works with abstaining classiﬁers and is theoretically well founded. Algorithm 1 summarizes the main steps of our approach that we detail below. Choosing a triplet classiﬁer. To choose a triplet classiﬁer we proceed in two steps. In the ﬁrst step, we select two reference points xj and xk such that yj = yk. This is done by randomly sampling from an empirical distribution Wc,X on the

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

examples. Here Wc,X denotes the marginal distribution of Wc with respect to the examples. This distribution is updated at each iteration to put more focus on those parts of the space that are hard to classify while promoting triplet classiﬁers that are able to separate different classes (see Equation 4). Choosing the predicted labels. In the second step, we choose oj and ok, the sets of labels that should be predicted for each half space of the triplet classiﬁer. Given one of the half spaces, we propose to add a label to the set of predicted labels if the weight of examples of this class is greater than the weight of examples of different classes. Formally, with wc,xi,y deﬁned as in Algorithm 1, we construct oj as follows:

(xi,yi) S, (xi,xj,xk) T

(Iy=yi Iy =yi) wc,xi,y > 0

We construct ok in a similar way. The underlying idea is that adding hc to the current combination of triplet classiﬁers should improve the predictions on the training set as much as possible. In Section 3.1 we show that this strategy is optimal and that it ensures that the selected triplet classiﬁer is either a weak classiﬁer or has a weight αc of 0. Computing the weight of the triplet classiﬁer. To choose the weight of the triplet classiﬁer hc we start by computing Wc,+ and Wc, , the weights of correctly and incorrectly classiﬁed examples:

(xi,yi) S, hc(xi) =ϑ

Iyi hc(xi)wc,xi,yi + X

y =yi Iy / hc(xi)wc,xi,y

(xi,yi) S, hc(xi) =ϑ

Iyi / hc(xi)wc,xi,yi + X

y =yi Iy hc(xi)wc,xi,y

We then set αc = log Wc,++ 1

/2. The term 1

n is a smoothing constant [Schapire and Singer, 2000]: in our setting with few, passively queried triplets it helps to avoid numerical problems that might arise when Wc,+ or Wc, = 0. In Theorem 2 we show that this choice of αc leads to a decrease in training error as the number of iterations increases. Updating the weights of the examples. In each iteration of our algorithm, a new triplet classiﬁer hc is added to the weighted combination of classiﬁers, and we need to update the empirical distribution Wc over the examples for the next iteration. The idea is (i) to reduce the weights of correctly classiﬁed examples, (ii) to keep constant the weights of the examples for which the current triplet classiﬁer abstains, and (iii) to increase the weights of incorrectly classiﬁed examples. The weights are then normalized by a factor Zc so that Wc+1 remains an empirical distribution over the examples. Formally, (xi, yi) S, y Y, if hc(xi) = ϑ then wc+1,xi,y = wc,xi,y

Zc and if hc(xi) = ϑ then

wc+1,xi,y = wc,xi,y

Zc exp { αc (Iy=yi Iy =yi)

Iy hc(xi) Iy / hc(xi) . (4)

Using H for prediction. Given a new example x, Triplet Boost predicts its label as

H(x) = arg max y Y

c=1 αc Ihc(x) =ϑ y hc(x)

that is the label with the highest weight as predicted by the weighted combination of selected triplet classiﬁers. However, recall that we are in a passive setting, and thus we assume that we are given a set of triplets Tx = {(x, xj, xk) : (xj, yj), (xk, yk) S, xj = xk} associated with the example x (but there is no way to choose them). Hence, some of the triplets in Tx correspond to triplet classiﬁers in H (that is (x, xj, xk) Tx and the reference points for hc were xj and xk) and some do not. In particular, it might happen that none of the triplets in Tx corresponds to a triplet classiﬁer in H and, in this case, H can only randomly predict a label. In Section 3.3 we provide a lower bound on the number of triplets necessary to avoid this behaviour. The main computational bottleneck when predicting the label of a new example x is to check whether the triplets in Tx match a triplet classiﬁer in H. A naive implementation would compare each triplet in Tx to each triplet classiﬁer, which can be as expensive as O(|Tx| C). Fortunately, by ﬁrst sorting the triplets and the triplet classiﬁers, a far more reasonable complexity of O(|Tx| log(|Tx|) + C log(C)) can be achieved.

3 Theoretical Analysis In this section we show that our approach is theoretically well founded. First we prove that the triplet classiﬁers with nonzero weights are weak learners: they are slightly better than random predictors (Theorem 1). Building upon this result we show that, as the number of iterations increases, the training error of the strong classiﬁer learned by Triplet Boost is decreased (Theorem 2). Then, to ensure that Triplet Boost does not over-ﬁt, we derive a generalization bound showing that, given a sufﬁcient amount of training examples, the test error is bounded (Theorem 3). Finally, we derive a lower bound on the number of triplets necessary to ensure that Triplet Boost does not learn a random predictor (Theorem 4).

3.1 Triplet Classiﬁers and Weak Learners We start this theoretical analysis by showing that the strategy to choose the predicted labels of triplet classiﬁers described in Equation (2) is optimal: it ensures that their error is minimal on the training set (compared to any other labelling strategy). We also show that the triplet classiﬁers are never worse than random predictors and in fact, that only those triplets classiﬁers that are weak classiﬁers (strictly better than random classiﬁers) are affected a non-zero weight. This is summarized in the next theorem. Theorem 1 (Triplet classiﬁers and weak learners). Let Wc be an empirical distribution over S Y and hc be the corresponding triplet classiﬁer chosen as described in Section 2.2. It holds that: 1. the error of hc on Wc is at most the error of a random predictor and is minimal compared to other labelling strategies,

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

2. the weight αc of the classiﬁer is non-zero if and only if, hc is a weak classiﬁer, that is is strictly better than a random predictor.

Proof. Given in Perrot and von Luxburg [2018].

3.2 Boosting Guarantees From a theoretical point of view the boosting framework has been extensively investigated and it has been shown that most Ada Boost-based methods decrease the training error at each iteration [Freund and Schapire, 1997]. Another question that has attracted a lot of attention is the problem of generalization. It is known that when the training error has been minimized, Ada Boost-based methods often do not over-ﬁt and it might even be beneﬁcial to further increase the number of weak learners. A popular explanation is the margin theory which says that as the number of iterations increases, the conﬁdence of the algorithm in its predictions increases and thus, the test accuracy is improved [Schapire et al., 1998, Breiman, 1999, Wang et al., 2011, Gao and Zhou, 2013]. Triplet Boost is based on Ada Boost.MO [Schapire and Freund, 2012], and thus it inherits the theoretical guarantees presented above. In this section, we provide two theorems which show that (i) Triplet Boost reduces the training error as the number of iterations increases, and (ii) it generalizes well to new examples. The following theorem shows that, as the number of iterations increases, Triplet Boost decreases the training error.

Theorem 2 (Reduction of the training error). Let S be a set of n examples and T be a set of m triplets (obtained as described in Section 2). Let H( ) be the classiﬁer obtained after C iterations of Triplet Boost (Algorithm 1) using S and T as input. It holds that:

P (x,y) S [H(x) = y] |Y|

with Zc = (1 Wc,+ Wc, ) + (Wc,+) r

Proof. This result, inherited from Ada Boost.MO, is proven in Perrot and von Luxburg [2018].

The next theorem shows that the true error of a classiﬁer learned by Triplet Boost can be bounded by a quantity related to the conﬁdence of the classiﬁer on the training examples, with respect to a margin, plus a term which decreases as the number of examples increases. The conﬁdence of the classiﬁer on the training examples is also bounded and decreases for sufﬁciently small margins.

Theorem 3 (Generalization guarantees). Let DS be a distribution over X Y, let S be a set of n examples drawn i.i.d. from DS, and let T be a set of m triplets (obtained as described in Section 2). Let H( ) be the classiﬁer obtained after C iterations of Triplet Boost (Algorithm 1) using S and T as input. Let H be a set of triplet classiﬁers as deﬁned in

Section 2.1. Then, given a margin parameter θ > q

log |H| 16|Y|2n

and a measure of the conﬁdence of H( ) in its predictions θH(x, y), with probability at least 1 δ, we have that

P (x,y) DS [H(x) = y] P (x,y) S [θH(x, y) θ]

v u u tlog 1

Furthermore we have that

P (x,y) S [θH(x, y) θ] |Y|

Proof. This result, inherited from Ada Boost.MO, is proven in Perrot and von Luxburg [2018].

At a ﬁrst glance it seems that this bound does not depend on m, the number of available triplets. However, this dependency is implicit: m impacts the probability that the training examples are well classiﬁed with a large margin θ. If the number of triplets is small, the probability that the training examples are well classiﬁed with a given margin is small. This probability increases when the number of triplets increases (as illustrated in Perrot and von Luxburg [2018]). To prove a bound that explicitly depends on m would be of signiﬁcant interest. However this is a difﬁcult problem, as it requires to use an explicit measure of complexity for general metric spaces, which is beyond the scope of this paper.

3.3 Lower Bound on the Number of Triplets In this section we investigate the minimum number of triplets that are necessary to ensure that our algorithm performs well. Ideally, we would like to obtain a lower bound on the number of triplets that are necessary to achieve a given accuracy. In this paper we take a ﬁrst step in this direction by deriving a bound on the number of triplets that are necessary to ensure that the learned classiﬁer does not abstain on any unseen example. Theorems 4 and 5 show that it abstains with high probability if it is learned using too few triplets or if it combines too few triplet classiﬁers. Theorem 4 (Lower bound on the probability that a strong classiﬁer abstains). Let n 2 be the number of training examples, p = 2nk

n3 with k h 0, 3 log(2)

log(n) be the probability

that a triplet is available in the triplet set T and C = nβ

β h 0, 1 + log (n 1)

log (n) i be the number of classiﬁers combined in the learned classiﬁer. Let A be any algorithm learning a classiﬁer H( ) = arg maxy Y PC c=1 αc Iy hc( ) that combines several triplet classiﬁers using some weights αc R. Assume that triplet classiﬁers that abstain on all the training examples have a weight of 0 (that is if hc(xi) = ϑ for all the examples (xi, yi) S then αc = 0). Then the probability that H abstains on a test example is bounded as follows:

P (x,y) DS [H(x) = ϑ] 1 p + p (1 p)n C . (6)

Proof. Given in Perrot and von Luxburg [2018].

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

(a) Moons, Metric: Euclidean, Noise Level: 0%

(b) Moons, Metric: Euclidean, Proportion of Triplets: 10%

(c) Iris, Proportion of Triplets: 10%, Noise Level: 0%

Figure 2: Moons is a small scale dataset with 500 examples and 2 dimensions. We ﬁx the triplets metric to the Euclidean distance. In Figure 2a we consider the noise free setting and we vary the proportion of available triplets from 1 to 10% of all the triplets. In Figure 2b we ﬁx this proportion to 10% and we vary the noise level from 0 to 20%. Iris is a small scale dataset with 150 examples and 4 dimensions. In Figure 2c we ﬁx the proportion of available triplets to 10%, the noise level to 0% and we vary the metric considered to generate the triplets.

To understand the implications of this theorem we consider a concrete example. Example 1. Assume that we build a linear combination of all possible triplet classiﬁers, that is C = n(n 1)

2 . Then we have

lim n + P (x,y) DS[H(x) = ϑ]

2, exp( 2) if k = 3

where k is the parameter that controls the probability p that a particular triplet is available in the triplets set T. The bottom line is that when k < 3

2, that is when we do not have at least Ω(n n) random triplets, the learned classiﬁer abstains on all the examples.

Proof. Given in Perrot and von Luxburg [2018].

Theorem 4 shows that when p and C are too small, then the strong classiﬁer abstains with high probability. However, the theorem does not guarantee that the strong classiﬁer does not abstain when p and C are large. The next theorem takes care of this other direction under slightly stronger assumptions on the weights learned by the algorithm. Theorem 5 (Exact bound on the probability that a strong classiﬁer abstains). In Theorem 4, further assume that each triplet classiﬁer that does not abstain on at least one training example has a weight different from 0 (if for at least one example (xi, yi) S we have that hc(xi) = ϑ then αc = 0). Then equality holds in Equation (6).

Proof. Given in Perrot and von Luxburg [2018].

Theorem 5 implies that equality holds in Example 1, thus when C = n(n 1)

2 we need at least k > 3 2, that is at least Ω(n n) random triplets, to obtain a classiﬁer that never abstains. In Perrot and von Luxburg [2018] we extend Example 1 and we study the limit as n for general values of C and p. We also provide a graphical illustration of the bound and a discussion on how this lower bound compares to existing results [Ailon, 2012, Jamieson and Nowak, 2011, Jain et al., 2016].

4 Experiments We propose an empirical evaluation of Triplet Boost. We consider six datasets of varying scales and four baselines. Baselines. First, we consider an embedding approach. We use t STE [Van Der Maaten and Weinberger, 2012] to embed the triplets in a Euclidean space and we use the 1-nearest neighbour algorithm for classiﬁcation. We also would like to compare to alternative approaches able to learn directly from triplets (without embedding as a ﬁrst step). However, to the best of our knowledge, Triplet Boost is the only method able to do classiﬁcation using only passively obtained triplets. The only option is to choose competing methods that have access to more information (providing them an unfair advantage). We settled for a method that uses actively chosen triplets to build a comparison-tree to retrieve nearest neighbours (Comp Tree) [Haghiri et al., 2017]. Finally, to put the results obtained in the triplet setting in perspective, we consider two methods that use the original Euclidean representations of the data, the 1-nearest neighbour algorithm (1NN) and Ada Boost.SAMME (SAMME) [Hastie et al., 2009]. Implementation details. For t STE we used the implementation distributed on the authors website and we set the embedding dimension to the original dimension of the data. This method was only considered for small datasets with less than 103 examples as it does not scale well to bigger datasets (Figure 3c). For Comp Tree we used our own implementation and the leaf size of the comparison tree is set to 1 as this is the only value for which this method can handle noise. For 1NN and SAMME we used sk-learn [Pedregosa et al., 2011]. The number of boosting iterations for SAMME is set to 103. Finally for Triplet Boost we set the number of iterations to 106. Datasets and performance measure. We consider six datasets: Iris, Moons, Gisette, Cod-rna, MNIST, and k MNIST. For each dataset we generate some triplets as in Equation (1) using three metrics: the Euclidean, Cosine, and Cityblock distances (See Perrot and von Luxburg [2018] for details). Given a set of n examples there are n2(n 1)/2 possible triplets. We consider three different regimes where 1%, 5% or 10% of them are available, and we consider three noise levels

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

(a) Gisette, Metric: Euclidean, Noise Level: 0%

(b) Gisette, Metric: Euclidean, Proportion of Triplets: 5%

(c) Moons, Metric: Euclidean, Proportion of Triplets: 10%, Noise Level: 10%

Figure 3: The Gisette dataset has 7000 examples and 5000 dimensions. In Figure 3a we consider the noise free setting and we vary the proportion of triplets available from 1 to 10% of all the triplets. In Figure 3b we ﬁx the proportion of available triplets to 5% and we vary the noise level from 0 to 20%. Figure 3c presents the training time of the triplet-based methods with training samples of increasing sizes on the Moons dataset. The proportion of triplets and the noise level were both set to 10%. The results were obtained on a single core @3.40GHz.

where 0%, 10% or 20% of them are incorrect. We measure performances in terms of test accuracy (higher is better). For all the experiments we report the mean and standard deviation of 10 repetitions. Since the results are mostly consistent across the datasets we present some representative ones here and defer the others to Perrot and von Luxburg [2018].

Small scale regime. We ﬁrst consider the datasets with less than 103 training examples (Figure 2). In this setting our method does not perform well when the number of triplets is too small, but gets closer to the baselines when the number of triplets increases (Figure 2a). This behaviour can be easily explained: when only 1% of the triplets are available, the triplet classiﬁers abstain on all but 3 or 4 examples on average and thus their performance evaluations are not reliable. Consequently their weights cannot be chosen in a satisfactory manner. This problem vanishes when the number of triplets increases. With increasing noise levels (Figure 2b) one can notice that Triplet Boost is more robust than Comp Tree. Indeed, Comp Tree generates a comparison-tree based on individual triplet queries, and the greedy decisions in the tree building procedure can easily be misleading in the presence of noise. Finally, our approach is less sensitive than t STE to changes in the metric that generated the triplets (Figure 2c). Indeed, t STE assumes that this metric is the Euclidean distance while our approach does not make any assumptions.

Large scale regime. On larger datasets (Figure 3), our method does not reach the accuracy of 1NN and SAMME, who exploit a signiﬁcant amount of extra information. Still, it performs quite well and is competitive with Comp Tree, the method that uses active rather than passive queries. The ordinal embedding methods cannot compete in this regime, as they are too slow to even ﬁnish (Figure 3c). Once again Triplet Boost is quite resistant to noise (Figure 3b).

5 Conclusion

In this paper we proposed Triplet Boost to address the problem of comparison-based classiﬁcation. It is particularly designed for situations where triplets cannot be queried actively,

and we have to live with whatever set of triplets we get. We do not make any geometric assumptions on the underlying space. From a theoretical point of view we have shown that Triplet Boost is well founded and we proved guarantees on both the training error and the generalization error of the learned classiﬁer. Furthermore we derived a new lower bound showing that to avoid learning a random predictor, at least Ω(n n) triplets are needed. In practice we have shown that, given a sufﬁcient amount of triplets, our method is competitive with state of the art methods and that it is quite resistant to noise. To the best of our knowledge, Triplet Boost is the ﬁrst algorithm that is able to handle large scale datasets using only passively obtained triplets. This means that the comparisonbased setting could be considered for problems which were, until now, out of reach. As an illustration, consider a platform where users can watch, comment and rate movies. It is reasonable to assume that triplets of the form movie mi is closer to mj than to mk can be automatically obtained using the ratings of the users, their comments, or their interactions. In this scenario, active learning methods are not applicable since the users might be reluctant to answer solicitations. Similarly, embedding methods are too slow to handle large numbers of movies or users. However, we can use Triplet Boost to solve problems such as predicting the genres of the movies. As a proof of concept we considered the 1m movielens dataset [Harper and Konstan, 2016]. It contains 1 million ratings from 6040 users on 3706 movies. We used the users ratings to obtain some triplets about the movies and Triplet Boost to learn a classiﬁer able to predict the genres of a new movie (details are given in Perrot and von Luxburg [2018]). Given a new movie, in 83% of the cases the genre predicted as the most likely one is correct and, on average, the 5 genres predicted as the most likely ones cover 92% of the true genres.

Acknowledgments

Ulrike von Luxburg acknowledges funding by the DFG through the Institutional Strategy of the University of T ubingen (DFG, ZUK 63) and the Cluster of Excellence EXC 2064/1, project number 390727645.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

References [Agarwal et al., 2007] Sameer Agarwal, Josh Wills, Lawrence Cayton, Gert Lanckriet, David Kriegman, and Serge Belongie. Generalized non-metric multidimensional scaling. In Artiﬁcial Intelligence and Statistics, pages 11 18, 2007. [Ailon, 2012] Nir Ailon. An active learning algorithm for ranking from pairwise preferences with an almost optimal query complexity. Journal of Machine Learning Research, 13(Jan):137 164, 2012. [Amid and Ukkonen, 2015] Ehsan Amid and Antti Ukkonen. Multiview triplet embedding: Learning attributes in multiple maps. In International Conference on Machine Learning, pages 1472 1480, 2015. [Bellet et al., 2015] Aur elien Bellet, Amaury Habrard, and Marc Sebban. Metric learning. Synthesis Lectures on Artiﬁcial Intelligence and Machine Learning, 9(1):1 151, 2015. [Breiman, 1999] Leo Breiman. Prediction games and arcing algorithms. Neural computation, 11(7):1493 1517, 1999. [Freund and Schapire, 1997] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119 139, 1997. [Gao and Zhou, 2013] Wei Gao and Zhi-Hua Zhou. On the doubt about margin explanation of boosting. Artiﬁcial Intelligence, 203:1 18, 2013. [Haghiri et al., 2017] Siavash Haghiri, Debarghya Ghoshdastidar, and Ulrike von Luxburg. Comparison-based nearest neighbor search. In Artiﬁcial Intelligence and Statistics, pages 851 859, 2017. [Harper and Konstan, 2016] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems, 5(4):19, 2016. [Hastie et al., 2009] Trevor Hastie, Saharon Rosset, Ji Zhu, and Hui Zou. Multi-class adaboost. Statistics and its Interface, 2(3):349 360, 2009. [Heikinheimo and Ukkonen, 2013] Hannes Heikinheimo and Antti Ukkonen. The crowd-median algorithm. In First AAAI Conference on Human Computation and Crowdsourcing, 2013. [Jain et al., 2016] Lalit Jain, Kevin G Jamieson, and Rob Nowak. Finite sample prediction and recovery bounds for ordinal embedding. In Neural Information Processing Systems, pages 2711 2719, 2016. [Jamieson and Nowak, 2011] Kevin G Jamieson and Robert D Nowak. Low-dimensional embedding using adaptively selected ordinal data. In Conference on Communication, Control, and Computing, pages 1077 1084, 2011. [Kane et al., 2017] Daniel M Kane, Shachar Lovett, Shay Moran, and Jiapeng Zhang. Active classiﬁcation with comparison queries. In Foundations of Computer Science, pages 355 366, 2017.

[Kazemi et al., 2018] Ehsan Kazemi, Lin Chen, Sanjoy Dasgupta, and Amin Karbasi. Comparison based learning from weak oracles. ar Xiv preprint ar Xiv:1802.06942v1, 2018. [Kleindessner and Luxburg, 2015] Matth aus Kleindessner and Ulrike Luxburg. Dimensionality estimation without distances. In Artiﬁcial Intelligence and Statistics, pages 471 479, 2015. [Pedregosa et al., 2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. [Perrot and von Luxburg, 2018] Micha el Perrot and Ulrike von Luxburg. Boosting for comparison-based learning. ar Xiv preprint ar Xiv:1810.13333, 2018. [Schapire and Freund, 2012] Robert E Schapire and Yoav Freund. Boosting: Foundations and algorithms. MIT press, 2012. [Schapire and Singer, 1999] Robert E Schapire and Yoram Singer. Improved boosting algorithms using conﬁdencerated predictions. Machine learning, 37(3):297 336, 1999. [Schapire and Singer, 2000] Robert E Schapire and Yoram Singer. Boostexter: A boosting-based system for text categorization. Machine learning, 39(2-3):135 168, 2000. [Schapire et al., 1998] Robert E Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of statistics, pages 1651 1686, 1998. [Tamuz et al., 2011] Omer Tamuz, Ce Liu, Serge Belongie, Ohad Shamir, and Adam Tauman Kalai. Adaptively learning the crowd kernel. In International Conference on Machine Learning, pages 673 680, 2011. [Terada and von Luxburg, 2014] Yoshikazu Terada and Ulrike von Luxburg. Local ordinal embedding. In International Conference on Machine Learning, pages 847 855, 2014. [Tschopp et al., 2011] Dominique Tschopp, Suhas Diggavi, Payam Delgosha, and Soheil Mohajer. Randomized algorithms for comparison-based search. In Neural Information Processing Systems, pages 2231 2239, 2011. [Van Der Maaten and Weinberger, 2012] Laurens Van Der Maaten and Kilian Weinberger. Stochastic triplet embedding. In Machine Learning for Signal Processing, pages 1 6, 2012. [Wang et al., 2011] Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou, and Jufu Feng. A reﬁned margin analysis for boosting algorithms via equilibrium margin. Journal of Machine Learning Research, 12(Jun):1835 1863, 2011.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)