# comparisonbased_random_forests__11035d40.pdf

Comparison-Based Random Forests

Siavash Haghiri 1 Damien Garreau 2 Ulrike von Luxburg 1 2

Assume we are given a set of items from a general metric space, but we neither have access to the representation of the data nor to the distances between data points. Instead, suppose that we can actively choose a triplet of items (A, B, C) and ask an oracle whether item A is closer to item B or to item C. In this paper, we propose a novel random forest algorithm for regression and classiﬁcation that relies only on such triplet comparisons. In the theory part of this paper, we establish sufﬁcient conditions for the consistency of such a forest. In a set of comprehensive experiments, we then demonstrate that the proposed random forest is efﬁcient both for classiﬁcation and regression. In particular, it is even competitive with other methods that have direct access to the metric representation of the data.

1. Introduction

Assume we are given a set of items from a general metric space (X, δ), but we neither have access to the representation of the data nor to the distances between data points. Instead, we have access to an oracle that we can actively ask a triplet comparison: given any triplet of items (xi, xj, xk) in the metric space X, is it true that

δ(xi, xj) < δ(xi, xk) ?

Such a comparison-based framework has become popular in recent years, for example in the context of crowd-sourcing applications (Tamuz et al., 2011; Heikinheimo and Ukkonen, 2013; Ukkonen et al., 2015), and more generally, whenever humans are supposed to give feedback or when constructing an explicit distance or similarity function is difﬁcult (Wilber et al., 2015; Zhang et al., 2015; Wah et al., 2015; Balcan et al., 2016; Kleindessner and von Luxburg, 2017).

1Department of Computer Science, University of T ubingen, Germany 2Max Planck Institute for Intelligent Systems, T ubingen, Germany. Correspondence to: Siavash Haghiri <haghiri@informatik.uni-tuebingen.de>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

In the present work, we consider classiﬁcation and regression problems in a comparison-based setting where we are given the labels y1, . . . , yn of unknown objects x1, . . . , xn, and we can actively query triplet comparisons between objects. An indirect way to solve such problems is to ﬁrst construct an ordinal embedding of the data points in a (typically low-dimensional) Euclidean space that satisﬁes the set of triplet comparisons, and then to apply standard machine learning methods to the Euclidean data representation. However, this approach is often not satisfactory because this new representation necessarily introduces distortions in the data. Furthermore, all existing ordinal embedding methods are painfully slow, even on moderate-sized datasets (Agarwal et al., 2007; van der Maaten and Weinberger, 2012; Terada and von Luxburg, 2014). In addition, one has to estimate the embedding dimension, which is a challenging task by itself (Kleindessner and Luxburg, 2015).

As an alternative, we solve the classiﬁcation/regression problems by a new random forest algorithm that requires only triplet comparisons. Standard random forests (Biau and Scornet, 2016) are one of the most popular and successful classiﬁcation/regression algorithms in Euclidean spaces (Fern andez-Delgado et al., 2014). However, they heavily rely on the underlying vector space structure. In our comparison-based setting we need a completely different tree building strategy. We use the recently described comparison tree (Haghiri et al., 2017) for this purpose (which in Euclidean cases would be distantly related to linear decision trees (Kane et al., 2017b; Ezra and Sharir, 2017; Kane et al., 2017a)). A comparison-based random forest (Comp RF) consists of a collection of comparison trees built on the training set.

We study the proposed Comp RF both from a theoretical and a practical point of view. In Section 3, we give sufﬁcient conditions under which a slightly simpliﬁed variant of the comparison-based forest is statistically consistent. In Section 4, we apply the Comp RF to various datasets. In the ﬁrst set of experiments we compare our random forests to traditional CART forests on Euclidean data. In the second set of experiments, the distances between objects are known while their representation is missing. Finally, we consider a case in which only triplet comparisons are available.

Comparison-Based Random Forests

2. Comparison-Based Random Forests

Random forests, ﬁrst introduced in Breiman (2001), are one of the most popular algorithms for classiﬁcation and regression in Euclidean spaces. In a comprehensive study on more than 100 classiﬁcation tasks, random forests show the best performance among many other general purpose methods (Fern andez-Delgado et al., 2014). However, standard random forests heavily rely on the vector space representation of the underlying data points, which is not available in a comparison-based framework. Instead, we propose a comparison-based random forest algorithm for classiﬁcation and regression tasks. The main ingredient is the comparison tree, which only uses of triplet comparisons and does not rely on Euclidean representation or distances between items.

Let us recap the CART random forest: The input consists of a labeled set Dn = {(x1, y1), (x2, y2), . . . , (xn, yn)} Rd R. To build an individual tree, we ﬁrst draw a random subsample Ds of an points from Dn. Second, we select a random subset Dims of size mtry of all possible dimensions {1, 2, . . . , d}. The tree is then built based on recursive, axis-aligned splits along a dimension randomly chosen from Dims. The exact splitting point along this direction is determined via the CART criterion, which also involves the labels of the subset Ds of points (see Biau and Scornet (2016) for details). The tree is grown until each cell contains at most n0 points these cells then correspond to the leaf nodes of the tree. To estimate a regression function m(x), each individual tree routes the query point to the appropriate leaf and outputs the average response over all points in this leaf. The random forest aggregates M such trees. Let us denote the prediction of tree i at point x by mi(x, Θi, Dn), where Θi encodes the randomness in the tree construction. Then the ﬁnal forest estimation at x is the average result over all trees (for classiﬁcation, the average is replaced by a majority vote):

m M,n(x; (Θi)1 i M , Dn) = 1

i=1 mi(x, Θi, Dn) .

The general consensus in the literature is that CART forests are surprisingly robust to parameter choices. Consequently, people use explicit rules of thumb, for example to set mtry = d/3 , and n0 = 5 (resp. n0 = 1) for regression (resp. classiﬁcation) tasks.

We now suggest to replace CART trees by comparison trees, leading to comparison-based random forests (Comp RF). Comparison trees have originally been designed to ﬁnd nearest neighbors by recursively splitting the search space into smaller subspaces. Inspired by the CART criterion, we propose a supervised variant of the comparison tree, which we refer to as supervised comparison tree.

For classiﬁcation, the supervised comparison tree construc-

Algorithm 1 Comp Tree(S, n0): Supervised comparison tree construction Input: Labeled data S and maximum leaf size n0 Output: Comparison tree T

1: T.root S 2: if |S| > n0 then 3: Sample distinct (x1, y1), (x2, y2) S s.t. y1 = y2 (if all points have the same label choose randomly) 4: S1 {(x, y) S : δ(x, x1) δ(x, x2)} 5: T.leftpivot x1, T.rightpivot x2 6: T.leftchild Comp Tree(S1, n0) 7: T.rightchild Comp Tree(S\S1, n0) 8: end if 9: Return T

tion for a labeled set S X {0, 1} is as follows (see Algorithm 1 and Figure 1): we randomly choose two pivot points x1 and x2 with different labels y1 and y2 among the points in S (the case where all the points in S have the same label is trivial). For every remaining point (x, y) S, we request the triplet comparison δ(x, x1) < δ(x, x2). The answer to this query determines the relative position of x with respect to the generalized hyperplane separating x1 and x2. We assign the points closer to x1 to the ﬁrst child node of S and the points closer to x2 to the other one. We now recurse the algorithm on the child nodes until less than n0 points remain in every leaf node of the tree.

The supervised pivot selection is analogous to the CART criterion. However, instead of a costly optimization over the choice of split, it only requires to choose pivots with different labels. In Section 4.1, we empirically show that the supervised split procedure leads to a better performance than the CART forests for classiﬁcation tasks.

For regression, it is not obvious how the pivot selection should depend on the output values. Here we use an unsupervised version of the forest (unsupervised Comp RF): we choose the pivots x1, x2 without considering y1, y2.

The ﬁnal comparison-based forest consists of M independently constructed comparison trees. To assign a label to a query point, we traverse every tree to a leaf node, then we aggregate all the items in the leaf nodes of M trees to estimate the label of the query item. For classiﬁcation, the ﬁnal label is the majority vote over the labels of the accumulated set (in the multiclass case we use a one vs. one approach). For regression we use the mean output value.

Intuitive comparison: The general understanding is that the efﬁciency of CART random forests is due to: (1) the randomness due to subsampling of dimensions and data points (Breiman, 1996); (2) the CART splitting criterion that exploits the label information already in the tree construction (Breiman et al., 1984). A weakness of CART splits is that

Comparison-Based Random Forests

Figure 1. Construction of the comparison tree, illustrated in the Euclidean setting. (i) The current cell contains points with two different labels. (ii) Two pivot points with opposite labels are chosen randomly from all sample points in the current cell (circled black dots). (iii) The current cell is split according to whether points are closer to the one or the other pivot; in the Euclidean setting this corresponds to a hyperplane split. (iv) Result after recursive application of this principle with ﬁnal leaf size n0 = 10.

Algorithm 2 Comp RF(Dn, q, M, n0, r): Comp RF prediction at query q Input: Labeled dataset Dn X {0, 1}, query q X, leaf size n0, trees M and subsampling ratio r. Output: yq = label prediction for q

1: Set C = as the list of predictor items 2: for j=1,. . . ,M do 3: Take a random subsample Ds Dn, s.t., |Ds|

4: Tj Comp Tree(Ds, n0) 5: Given q, traverse the tree Tj to the leaf node Nj 6: C C Nj 7: end for 8: Return Majority Vote({y|(x, y) C})

they are necessarily axis-aligned, and thus not well-adapted to the geometry of the data.

In comparison trees, randomness is involved in the tree construction as well. But once a splitting direction has been determined by choosing the pivot points, the exact splitting point along this direction cannot be inﬂuenced any more, due to the lack of a vector space representation. On the other hand, the comparison tree splits are well adapted to the data geometry by construction, giving some advantage to the comparison trees.

All in all, the comparison-based forest is a promising candidate with slightly different strengths and weaknesses than CART forest. Our empirical comparison in Section 4.1 reveals that it performs surprisingly well and can even outperform CART forests in certain settings.

3. Theoretical Analysis

Despite their intensive use in practice, theoretical questions regarding the consistency of the original procedure of Breiman (2001) are still under investigation. Most of the research focuses on simpliﬁed models in which the construction of the forest does not depend on the training set at all (Biau, 2012), or only via the xis but not the yis (Biau et al., 2008; Ishwaran and Kogalur, 2010; Denil et al., 2013).

Recent efforts nearly closed this gap, notably Scornet et al. (2015), where it is shown that the original algorithm is consistent in the context of additive regression models and under suitable assumptions. However, there is no previous work on the consistency of random forests constructed only with triplet comparisons.

As a ﬁrst step in this direction, we investigate the consistency of individual comparison trees, which is the ﬁrst building block in the study of random forests consistency. As it is common in the theoretical literature on random forests, we consider a slightly modiﬁed version of the comparison tree. We assume that the pivot points are not randomly drawn from the underlying sample but according to the true distribution of the data. In this setting, we show that, when the number of observations grows to inﬁnity, (i) the diameter of the cells converges to zero in probability, and (ii) each cell contains an arbitrarily large number of observations. Using a result of Devroye et al. (1996), we deduce that the associated classiﬁer is consistent. The challenging part of the proof is to obtain control over the diameter of the cells. Intuitively, as in Dasgupta and Freund (2008, Lemma 12), it sufﬁces to show that each cut has a larger probability to decrease the diameter of the current cell than that of leaving it unchanged. To prove this in our case is very challenging since both the position and the decrease in diameter caused by the next cut depend on the geometry of the cell.

3.1. Continuous Comparison Tree

As it is the case for most theoretical results on random forests, we carry out our analysis in a Euclidean setting (however, the comparison-forest only has indirect access to the Euclidean metric via triplet queries). We assume that the input space is X = [0, 1]d with distance δ given by the Euclidean norm, that is, δ (x, y) = x y . Let X be a random variable with support included in [0, 1]d. We assume that the observations X1, . . . , Xn [0, 1]d are drawn independently according to the distribution of X. We make the following assumptions:

Assumption 3.1. The random variable X [0, 1]d has density f with respect to the Lebesgue measure on [0, 1]d.

Comparison-Based Random Forests

Additionally, f is ﬁnite and bounded away from 0.

For any x, y Rd, let us deﬁne

(x, y) := z Rd | δ (x, z) = δ (y, z) .

In the Euclidean setting, (x, y) is a hyperplane that separates Rd in two half-spaces. We call Hx (resp. Hy) the open half-space containing x (resp. y). The set S1 in Algorithm 1 corresponds to S Hx1.

We can now deﬁne the continuous comparison tree.

Deﬁnition 1 (Continuous comparison tree). A continuous comparison tree is a random inﬁnite binary tree T 0 obtained via the following iterative construction:

The root of T 0 is [0, 1]d;

Assuming that level ℓof T 0 has been built already, then level ℓ+ 1 is built as follows: For every cell C at height ℓ, draw X1, X2 C independently according to the distribution of X restricted to C. The children of C are deﬁned as the closure of C HX1 and C HX2.

For any sequence (pn)n 0, a truncated, continuous comparison tree T 0(pn) consists of the ﬁrst pn levels of T 0.

From a mathematical point of view, the continuous tree has a number of advantages. (i) Its construction does not depend on the responses Y1, . . . , Yn. Such a simpliﬁcation is quite common because data-dependent random tree structures are notoriously difﬁcult to analyze (Biau et al., 2008). (ii) Its construction is formally independent of the ﬁnite set of data points, but close in spirit : Rather than sampling the pivots among the data points in a cell, pivots are independently sampled according to the underlying distribution. Whenever a cell contains a large number of sample points, both distributions are close, but they may drift apart when the diameter of the cells go to 0. (iii) In the continuous comparison tree, we stop splitting cells at height pn , whereas in the discrete setting we stop if there is less than n0 observations in the current cell. As a consequence, T 0(pn) is a perfect binary tree: each interior node has exactly 2 children. This is typically not the case for comparison trees.

3.2. Consistency

To each realization of T 0(pn) is associated a partition of [0, 1]d into disjoint cells A1,n, A2,n, . . . , A2pn,n . For any x [0, 1]d, let A(x) be the cell of T 0(pn) containing x. Let us assume that the responses (Yi)1 i n are binary labels. We consider the classiﬁer deﬁned by majority vote in each cell, that is,

Xi A(x) 1Yi=1 P

Xi A(x) 1Yi=0 0 otherwise.

Deﬁne Ln := P (gn(X) = Y |Dn) . Following Devroye et al. (1996), we say that the classiﬁer gn is consistent if

E [Ln] = P (gn(X) = Y ) n + L ,

where L is the Bayes error probability. Our main result is the consistency of the classiﬁer associated with the continuous comparison tree truncated to a logarithmic height. Theorem 2 (Consistency of comparison-based trees). Under Assumption 3.1, the classiﬁer associated to the continuous, truncated tree T 0(α log n) is consistent for any constant 0 < α < 1/ log 2.

In particular, since each individual tree is consistent, a random forest with base tree T 0(pn) is also consistent. Theorem 2 is a ﬁrst step towards explaining why comparisonbased trees perform well without having access to the representation of the points. Also note that, even though the continuous tree is a simpliﬁed version of the discrete tree, they are quite similar and share all important characteristics. In particular, they roughly have the same depth with high probability, the comparison tree has logarithmic depth (Haghiri et al., 2017, Theorem 1).

3.3. Sketch of the Proof

Since the construction of T 0(pn) does not depend on the labels, we can use Theorem 6.1 of Devroye et al. (1996). It gives sufﬁcient conditions for classiﬁcation rules based on space partitioning to be consistent. In particular, we have to show that the partition satisﬁes two properties: ﬁrst, the leaf cells should be small enough, so that local changes of the distribution can be detected; second, the leaf cells should contain a sufﬁciently large number of points so that averaging among the labels makes sense. More precisely, we have to show that (1) diam A(X) 0 in probability, where diam A := supx,y A δ (x, y) is the diameter of A, and (2) N(X) in probability, where N(X) is the number of sample points in the cell containing X. The second point is simple, because it is sufﬁcient to show that the number of cells in the partition associated to T 0(α log n) is o (n) (according to Lemma 20.1 in Devroye et al. (1996) and the remark that follows). Proving (1) is much more challenging. A sketch of the proof of (1) follows note that the complete version of the proof of Theorem 2 can be found in the supplementary material.

The critical part in proving (1) is to show that, for any cell of the continuous comparison tree, the diameter of its descendants at least k levels below is halved with high probability. More precisely, the following proposition shows that this probability is lower bounded by 1 δ, where δ is exponentially decreasing in k. Proposition 3 (Diameter control). Let C be a cell of T 0(X) such that diam C D. Then, under Assumption 3.1, the probability that there exists a descendant of C

Comparison-Based Random Forests

Table 1. Average and standard deviation of classiﬁcation error for the Comp RF vs. other methods. The ﬁrst three rows describe datasets.

MNIST Gisette UCIHAR Isolet

Dataset Size 70000 7000 10229 6238 Variables 728 5000 561 617 Classes 10 2 5 26

KNN 2.91 3.50 12.15 8.27 CART RF 2.90 ( 0.05) 3.04 ( 0.26) 7.47 ( 0.32) 5.48 ( 0.27) Comp RF unsupervised 4.21 ( 0.05) 3.28 ( 0.19) 8.70 ( 0.32) 6.65 ( 0.14) Comp RF supervised 2.50 ( 0.05) 2.48 ( 0.13) 6.54 ( 0.11) 4.43 ( 0.26)

100 101 102

M: Number of trees

Avergae classification error

100 101 102

M: Number of trees

Avergae classification error

100 101 102

M: Number of trees

Avergae classification error

100 101 102

M: Number of trees

Avergae classification error

Figure 2. Average classiﬁcation error of the Comp RF algorithm on classiﬁcation datasets. X-Axis shows the number of trees used in the forest. The title denotes the dataset and each curve corresponds to a ﬁxed value of n0.

which is more than k levels below and yet has diameter greater than D/2 is at most δ = cγk, where c > 0 and γ (0, 1) are constants depending only on the dimension d and the density f.

The proof of Proposition 3 amounts to showing that the probability of decreasing the diameter of any given cell is higher than the probability of keeping it unchanged see the supplementary material.

Assuming Proposition 3, the rest of the proof goes as follows. Let us set ε (0, 1). We are going to show that

P (diam A(X) > ε) 0 when n + .

Let Γ be the path in T 0(α log n) that goes from the root to the leaf A of maximum diameter. This path has length pn according to the deﬁnition of T 0(α log n). The root, which consists of the set [0, 1]d, has diameter

d. This means that

we need to divide the diameter of this cell π = l log2

times to obtain cells with diameter smaller than ε. Let us set k = pn/π and pick cells C(j)

0 j π along Γ such

that C(0) = [0, 1]d, C(π) = A, and such that there are more than k levels between C(j) and C(j+1). Then we can prove that P (diam A > ε) is smaller than

diam C(j) >

diam C(j 1)

Furthermore, according to Prop. 3, the quantity in the last expression is upper bounded by πcγk. Since k = O (log n) and γ (0, 1), we can conclude.

4. Experiments

In this section, we ﬁrst examine comparison-based forests in the Euclidean setting. Secondly, we apply the Comp RF method to non-Euclidean datasets with a general metric available. Finally we run experiments in the setting where we are only given triplet comparisons.

4.1. Euclidean Setting

Here we examine the performance of Comp RF on classiﬁcation and regression tasks in the Euclidean setting, and compare it against CART random forests as well as the KNN classiﬁer as a baseline. As distance function for KNN and Comp RF we use the standard Euclidean distance. Since the Comp RF only has access to distance comparisons, the amount of information it uses is considerably lower than the information available to the CART forest. Hence, the goal of this experiment is not to show that comparison-based random forests can perform better, but rather to ﬁnd out whether the performance is still acceptable.

To emphasize the role of supervised pivot selection, we report the performance of the unsupervised Comp RF algo-

Comparison-Based Random Forests

Table 2. Average and standard deviation of the RMSE for the Comp RF vs. CART regression forest.

ONP Boston Forest Fire White Wine

Dataset Size 39644 506 517 4898 Variables 58 13 12 11

CART RF 1.04 ( 0.50) 104 3.02 ( 0.95) 45.32 ( 4.89) 59.00 ( 2.94) 10 2

Comp RF 1.05 ( 0.50) 104 6.16 ( 1.00) 45.37 ( 4.69) 72.46 ( 3.16) 10 2

100 101 102

M: Number of trees

Average RMSE

100 101 102

M: Number of trees

Average RMSE

100 101 102

M: Number of trees

Average RMSE

Forest Fire

100 101 102

M: Number of trees

Average RMSE

Figure 3. Average RMSE of the Comp RF algorithm on regression datasets. X-Axis shows the number of trees used in the forest. The title denotes dataset and each curve corresponds to a ﬁxed value of n0.

rithm in classiﬁcation tasks as well. The tree structure in the unsupervised Comp RF chooses the pivot points uniformly at random without considering the labels.

For the sake of simplicity, we do not perform subsampling when building the Comp RF trees. We report some experiments concerning the role of subsampling in Section 3.2 of supplementary material. All other parameters of Comp RF are adjusted by cross-validation.

4.1.1. CLASSIFICATION

We use four classiﬁcation datasets. MNIST (Le Cun et al., 1998) and Gisette are handwritten digit datasets. Isolet and UCIHAR are speech recognition and human activity recognition datasets respectively (Lichman, 2013). Details of the datasets are shown in the ﬁrst three rows of Table 1.

Parameters of Comp RF: We examine the behaviour of the Comp RF with respect to the choice of the leaf size n0 and the number of trees M. We perform 10-fold cross-validation over n0 {1, 4, 16, 64} and M {1, 4, 16, 64, 256}. In Figure 2 we report the resulting cross validation error. Similar to the recommendation for CART forests (Biau and Scornet, 2016), we achieve the best performance when the leaf size is small, that is (n0 = 1). Moreover, there is no signiﬁcant improvement for M greater than 100.

Comparison between Comp RF, CART and KNN: Table 1 shows the average and standard deviation of classiﬁcation error for 10 independent runs of Comp RF, CART forest and KNN. Training and test sets are given in the respective

datasets. The parameters n0 and M of Comp RF and CART, and k of KNN are chosen by 10-fold cross validation on the training set. Note that KNN is not randomized, thus there is no standard deviation to report.

The results show that, surprisingly, Comp RF can slightly outperform the CART forests for classiﬁcation tasks even though it uses considerably less information. The reason might be that the Comp RF splits are better adapted to the geometry of the data than the CART splits. While the CART criterion for selecting the exact splitting point can be very informative for regression (see below), for classiﬁcation it seems that a simple data dependent splitting criterion as in the supervised Comp RF can be as efﬁcient. Conversely, we see that unsupervised splitting as in the unsupervised Comp RF is clearly worse than supervised splitting.

4.1.2. REGRESSION

Next we consider regression tasks on four datasets. Online news popularity (ONP) is a dataset of articles with the popularity of the article as target (Fernandes et al., 2015). Boston is a dataset of properties with the estimated value as target variable. Forest Fire is a dataset meant to predict the burned area of forest ﬁres, in the northeast region of Portugal (Cortez and Morais, 2007). White Wine (Wine) is a subset of wine quality dataset (Cortez et al., 2009). Details of the datasets are shown in the ﬁrst two rows of Table 2.

Since the regression datasets have no separate training and test set, we assign 90% of the items to the training and the remaining 10% to the test set. In order to remove the effect

Comparison-Based Random Forests

Table 3. Average and standard deviation of the classiﬁcation error for the Comp RF in comparison with kernel SVM on graph datasets with two graph kernels: WL-subtree and WL-edge.

MUTAG ENZYMES NCI1 NCI109

Train Size 188 600 4110 4127 Classes 2 6 2 2

WL-subtree kernel

Kernel SVM 17.77 ( 7.31) 47.16 ( 5.72) 15.96 ( 1.56) 15.55 ( 1.40) KNN 14.00 ( 8.78) 48.17 ( 4.48) 18.13 ( 2.27) 18.74 ( 1.97) Comp RF unsupervised 14.44 ( 7.94) 39.33 ( 6.49) 17.96 ( 1.85) 19.10 ( 2.22) Comp RF supervised 13.89 ( 7.97) 39.83 ( 5.00) 17.35 ( 1.98) 18.71 ( 2.61)

WL-edge kernel

Kernel SVM 15.55 ( 6.30) 53.67 ( 6.52) 15.13 ( 1.44) 15.38 ( 1.69) KNN 12.78 ( 7.80) 51.00 ( 4.86) 18.56 ( 1.36) 18.30 ( 1.82) Comp RF unsupervised 11.67 ( 7.15) 38.50 ( 4.19) 17.91 ( 1.42) 19.56 ( 1.61) Comp RF supervised 11.11 ( 8.28) 38.17 ( 5.35) 18.05 ( 1.63) 18.40 ( 2.27)

of the ﬁxed partitioning, we repeat the experiments 10 times with random training/test set assignments. Note that we use Comp RF with unsupervised tree construction for regression.

Parameters of Comp RF: We report the behaviour of the Comp RF with respect to the parameters n0 and M. We perform 10-fold cross-validation with the same range of parameters as in the previous section. Figure 3 shows the average root mean squared error (RMSE) over the 10 folds. The cross-validation is performed for 10 random training/test set assignments. The ﬁgure corresponds to the ﬁrst assignment out of 10 (the behaviour for the other training/test set assignments is similar). The Comp RF algorithm shows the best performance with n0 = 1 for the Boston and Forest Fire datasets, however larger values of n0 lead to better performance for other datasets. We believe that the main reason for this variance is the unsupervised tree construction in the Comp RF algorithm for regression.

Comparison between Comp RF and CART: Table 2 shows the average and standard deviation of the RMSE for the Comp RF and CART forests over the 10 runs with random training/test set assignment. For each combination of training and test sets we tuned parameters independently by cross validation. Comp RF is constructed with unsupervised splitting, while the CART forests are built using a supervised criterion. We can see that on the Boston and Wine datasets, the performance of the CART forest is substantially better than the Comp RF. In this case, ignoring the Euclidean representation of the data and just relying on the comparison-based trees leads to a signiﬁcant decrease in performance. However the performance of our method on the other two datasets is quite the same as CART forests. We can conclude that in some cases the CART criterion can be essential for regression. However, note that if we are just

given a comparison-based setting without actual vector space representation it is hardly possible to propose an efﬁcient supervised criterion for splitting.

4.2. Metric, non-Euclidean Setting

In this set of experiments we aim to demonstrate the performance of the Comp RF in general metric spaces. We choose graph-structured data for this experiment. Each data-point is a graph, and as a distance between graphs we use graphbased kernel functions. In particular, the Weisfeiler-Lehman graph kernels are a family of graph kernels that have promising results on various graph datasets (Shervashidze et al., 2011). We compute the WL-subtree and WL-edge kernels on four of the datasets reported in Shervashidze et al. (2011): MUTAG, ENZYMES, NCI1 and NCI109. In order to evaluate triplet comparisons based on the graph kernels, we ﬁrst convert the kernel matrix to a distance matrix in the standard way (expressing the Gram matrix in terms of distances).

We compare supervised and unsupervised Comp RF with the Kernel SVM and KNN classiﬁer in Table 3. Note that in this setting, CART forests are not applicable as they would require an explicit vector space representation. Parameters of the Kernel SVM and k of the KNN classiﬁer are adjusted with 10-fold cross-validation on training sets.

We set the parameters of the Comp RF to n0 = 1 and M = 200, as it shows acceptable performance in the Euclidean setting. We assign 90% of the items as training and the remaining 10% as the test set. The experiment is repeated 10 times with random training/test assignments. The average and standard deviation of classiﬁcation error is reported in Table 3. The Comp RF algorithm outperforms the kernel SVM on the MUTAG and ENZYMES datasets.

Comparison-Based Random Forests

Gisette n=1000

M=1 T=12100 M=5 T=60964 M=10 T=120494 M=20 T=239554 M: Number of trees, T: Number of triplets

Avergae classification error

MNIST n=1000

M=1 T=11792 M=5 T=57952 M=10 T=116076 M=20 T=227595 M: Number of trees, T: Number of triplets

Avergae classification error

UCIHAR n=1000

M=1 T=10380 M=5 T=51896 M=10 T=103317 M=20 T=207462 M: Number of trees

Avergae classification error

Isolet n=480

M=1 T=4338 M=5 T=22028 M=10 T=44366 M=20 T=87629 M: Number of trees, T: Number of triplets

Avergae classification error

Comp RF Emb KNN-Same Emb CART-Same Emb KNN-Rand Emb CART-Rand

Figure 4. Average classiﬁcation error of the Comp RF in comparison with embedding approach on classiﬁcation datasets with less than 1000 items. Emb KNN-Same (resp. Emb CART-Same) denotes the TSTE+KNN using the same triplets as Comp RF, while Emb KNN-Rand (resp. Emb CART-Rand) stands for using TSTE with the same number of random triplets. X-Axis show the number of trees (M) used for the Comp RF and the corresponding number of triplets (T) for the embedding. Each set of bars corresponds to a ﬁxed M. Note that by increasing M, the number of triplets used by Comp RF will be increased, as it appears in the X-Axis.

However, it has slightly lower performance on the other two datasets. However, note that the kernel SVM requires a lot of background knowledge (one has to construct a kernel in the ﬁrst place, which can be difﬁcult), whereas our Comp RF algorithm neither uses the explicit distance values nor requires them to satisfy the axioms of a kernel function.

4.3. Comparison-Based Setting

Now we assume that the distance metric is unknown and inaccessible directly, but we can actively ask for triplet comparisons. In this setting, the major competitors to comparison-based forests are indirect methods that ﬁrst use ordinal embedding to a Euclidean space, and then classify the data in the Euclidean space. As practical active ordinal embedding methods do not really exist we settle for a batch setting in this case. After embedding, we use CART forests and the KNN classiﬁer in the Euclidean space.

Comparing various ordinal embedding algorithms, such as GNMDS (Agarwal et al., 2007), LOE (Terada and von Luxburg, 2014) and TSTE (van der Maaten and Weinberger, 2012) shows that the TSTE in combination with a classiﬁer consistently outperforms the others (see Section 3.1 in the supplement). Therefore, we here only report the comparison with the TSTE embedding algorithm. We choose the embedding dimension by 2-fold cross-validation in the range of d {10, 20, 30, 40, 50} (embedding in more than 50 dimensions is impossible in practice due to the running time of the TSTE). We also adjust k of the KNN classiﬁer in the cross-validation process.

We design a comparison-based scenario based on Euclidean datasets. First, we let Comp RF choose the desired triplets to construct the forest and classify the test points. The embedding methods are used in two different scenarios: once with exactly the same triplets as in the Comp RF algorithm, and

once with a completely random set of triplets of the same size as the one used by Comp RF.

The size of our datasets by far exceeds the number of points that embedding algorithms, particularly TSTE, can handle. To reduce the size of the datasets, we choose the ﬁrst two classes, then we subsample 1000 items. Isolet has already less than 1000 items in ﬁrst two classes. We assign half of the dataset as training and the other half as test set. Bar plots in Figure 4 show the classiﬁcation error of the Comp RF in comparison with embedding methods with various numbers of trees in the forests (M). We set n0 = 1 for the Comp RF.

In each set of bars, which corresponds to a restricted comparison-based regime, Comp RF outperforms embedding methods or has the same performance. Another signiﬁcant advantage of Comp RF in comparison with the embedding is the low computation cost. A simple demonstration is provided in Section 3.3 of the supplementary material.

5. Conclusion and Future Work

We propose comparison-based forests for classiﬁcation and regression tasks. This method only requires comparisons of distances as input. From a practical point of view, it works surprisingly well in all kinds of circumstances (Euclidean spaces, metric spaces, comparison-based setting) and is much simpler and more efﬁcient than some of its competitors such as ordinal embeddings.

We have proven consistency in a simpliﬁed setting. As future work, this analysis should be extended to more realistic situations, namely tree construction depending on the sample; forests with inconsistent trees, but the forest is still consistent; and ﬁnally the supervised splits. In addition, it would be interesting to propose a comparison-based supervised tree construction for the regression tasks.

Comparison-Based Random Forests

Acknowledgements

The authors thank Debarghya Goshdastidar and Micha el Perrot for fruitful discussions. This work has been supported by the German Research Foundation DFG (SFB 936/ Z3), the Institutional Strategy of the University of T ubingen (DFG ZUK 63), and the International Max Planck Research School for Intelligent Systems (IMPRS-IS).

S. Agarwal, J. Wills, L. Cayton, G. Lanckriet, D. Kriegman, and S. Belongie. Generalized non-metric multidimensional scaling. In AISTATS, pages 11 18, 2007.

M.F. Balcan, E. Vitercik, and C. White. Learning combinatorial functions from pairwise comparisons. In COLT, pages 310 335, 2016.

G. Biau. Analysis of a random forests model. JMLR, 13(4): 1063 1095, 2012.

G. Biau and E. Scornet. A random forest guided tour. Test, 25(2):197 227, 2016.

G. Biau, L. Devroye, and G. Lugosi. Consistency of random forests and other averaging classiﬁers. JMLR, 9(9):2015 2033, 2008.

L. Breiman. Bagging predictors. Machine learning, 24(2): 123 140, 1996.

L. Breiman. Random forests. Machine Learning, 45(1): 5 32, 2001.

L. Breiman, J. Friedman, C.J. Stone, and R.A. Olshen. Classiﬁcation and regression trees. CRC press, 1984.

P. Cortez and A.J.R. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data. In Portuguese Conference on Artiﬁcial Intelligence, pages 512 523, 2007.

P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis. Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4): 547 553, 2009.

S. Dasgupta and Y. Freund. Random projection trees and low dimensional manifolds. In STOC, pages 537 546, 2008.

M. Denil, D. Matheson, and N. Freitas. Consistency of online random forests. In ICML, pages 1256 1264, 2013.

L. Devroye, L. Gy orﬁ, and G. Lugosi. A probabilistic theory of pattern recognition. Springer, 1996.

E. Ezra and M. Sharir. A nearly quadratic bound for the decision tree complexity of k-sum. In LIPIcs-Leibniz International Proceedings in Informatics, volume 77. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017.

K. Fernandes, P. Vinagre, and P. Cortez. A proactive intelligent decision support system for predicting the popularity of online news. In Portuguese Conference on Artiﬁcial Intelligence, pages 535 546, 2015.

M. Fern andez-Delgado, E. Cernadas, S. Barro, and D. Amorim. Do we need hundreds of classiﬁers to solve real world classiﬁcation problems. JMLR, 15(1):3133 3181, 2014.

S. Haghiri, D. Ghoshdastidar, and U. von Luxburg. Comparison-based nearest neighbor search. In AISTATS, pages 851 859, 2017.

H. Heikinheimo and A. Ukkonen. The crowd-median algorithm. In HCOMP, 2013.

H. Ishwaran and U. B. Kogalur. Consistency of random survival forests. Statistics & probability letters, 80(13): 1056 1064, 2010.

D.M. Kane, S. Lovett, and S. Moran. Near-optimal linear decision trees for k-sum and related problems. ar Xiv preprint ar Xiv:1705.01720, 2017a.

D.M. Kane, S. Lovett, S. Moran, and J. Zhang. Active classiﬁcation with comparison queries. Foundations Of Computer Science (FOCS), 2017b.

M. Kleindessner and U. Luxburg. Dimensionality estimation without distances. In AISTATS, pages 471 479, 2015.

M. Kleindessner and U. von Luxburg. Kernel functions based on triplet comparisons. In NIPS, pages 6810 6820, 2017.

Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

M. Lichman. UCI machine learning repository, 2013. Available at http://archive.ics.uci.edu/ml.

E. Scornet, G. Biau, and J.-P. Vert. Consistency of random forests. The Annals of Statistics, 43(4):1716 1741, 2015.

N. Shervashidze, P. Schweitzer, E.J. Leeuwen, K. Mehlhorn, and K.M. Borgwardt. Weisfeiler-lehman graph kernels. JMLR, 12:2539 2561, 2011.

O. Tamuz, C. Liu, S. Belongie, O. Shamir, and A. Kalai. Adaptively learning the crowd kernel. In ICML, pages 673 680, 2011.

Comparison-Based Random Forests

Y. Terada and U. von Luxburg. Local ordinal embedding. In ICML, pages 847 855, 2014.

A. Ukkonen, B. Derakhshan, and H. Heikinheimo. Crowdsourced nonparametric density estimation using relative distances. In HCOMP, 2015.

L. van der Maaten and K. Weinberger. Stochastic triplet embedding. In Machine Learning for Signal Processing (MLSP), pages 1 6, 2012. Code available at http:// homepage.tudelft.nl/19j49/ste.

C. Wah, S. Maji, and S. Belongie. Learning localized perceptual similarity metrics for interactive categorization. In Winter Conference on Applications of Computer Vision (WACV), pages 502 509, 2015.

M. Wilber, I.S. Kwak, D. Kriegman, and S. Belongie. Learning concept embeddings with combined human-machine expertise. In ICCV, pages 981 989, 2015.

L. Zhang, S. Maji, and R. Tomioka. Jointly learning multiple measures of similarities from triplet comparisons. ar Xiv preprint ar Xiv:1503.01521, 2015.