# comparisonbased_random_forests__11035d40.pdf Comparison-Based Random Forests Siavash Haghiri 1 Damien Garreau 2 Ulrike von Luxburg 1 2 Assume we are given a set of items from a general metric space, but we neither have access to the representation of the data nor to the distances between data points. Instead, suppose that we can actively choose a triplet of items (A, B, C) and ask an oracle whether item A is closer to item B or to item C. In this paper, we propose a novel random forest algorithm for regression and classification that relies only on such triplet comparisons. In the theory part of this paper, we establish sufficient conditions for the consistency of such a forest. In a set of comprehensive experiments, we then demonstrate that the proposed random forest is efficient both for classification and regression. In particular, it is even competitive with other methods that have direct access to the metric representation of the data. 1. Introduction Assume we are given a set of items from a general metric space (X, δ), but we neither have access to the representation of the data nor to the distances between data points. Instead, we have access to an oracle that we can actively ask a triplet comparison: given any triplet of items (xi, xj, xk) in the metric space X, is it true that δ(xi, xj) < δ(xi, xk) ? Such a comparison-based framework has become popular in recent years, for example in the context of crowd-sourcing applications (Tamuz et al., 2011; Heikinheimo and Ukkonen, 2013; Ukkonen et al., 2015), and more generally, whenever humans are supposed to give feedback or when constructing an explicit distance or similarity function is difficult (Wilber et al., 2015; Zhang et al., 2015; Wah et al., 2015; Balcan et al., 2016; Kleindessner and von Luxburg, 2017). 1Department of Computer Science, University of T ubingen, Germany 2Max Planck Institute for Intelligent Systems, T ubingen, Germany. Correspondence to: Siavash Haghiri . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). In the present work, we consider classification and regression problems in a comparison-based setting where we are given the labels y1, . . . , yn of unknown objects x1, . . . , xn, and we can actively query triplet comparisons between objects. An indirect way to solve such problems is to first construct an ordinal embedding of the data points in a (typically low-dimensional) Euclidean space that satisfies the set of triplet comparisons, and then to apply standard machine learning methods to the Euclidean data representation. However, this approach is often not satisfactory because this new representation necessarily introduces distortions in the data. Furthermore, all existing ordinal embedding methods are painfully slow, even on moderate-sized datasets (Agarwal et al., 2007; van der Maaten and Weinberger, 2012; Terada and von Luxburg, 2014). In addition, one has to estimate the embedding dimension, which is a challenging task by itself (Kleindessner and Luxburg, 2015). As an alternative, we solve the classification/regression problems by a new random forest algorithm that requires only triplet comparisons. Standard random forests (Biau and Scornet, 2016) are one of the most popular and successful classification/regression algorithms in Euclidean spaces (Fern andez-Delgado et al., 2014). However, they heavily rely on the underlying vector space structure. In our comparison-based setting we need a completely different tree building strategy. We use the recently described comparison tree (Haghiri et al., 2017) for this purpose (which in Euclidean cases would be distantly related to linear decision trees (Kane et al., 2017b; Ezra and Sharir, 2017; Kane et al., 2017a)). A comparison-based random forest (Comp RF) consists of a collection of comparison trees built on the training set. We study the proposed Comp RF both from a theoretical and a practical point of view. In Section 3, we give sufficient conditions under which a slightly simplified variant of the comparison-based forest is statistically consistent. In Section 4, we apply the Comp RF to various datasets. In the first set of experiments we compare our random forests to traditional CART forests on Euclidean data. In the second set of experiments, the distances between objects are known while their representation is missing. Finally, we consider a case in which only triplet comparisons are available. Comparison-Based Random Forests 2. Comparison-Based Random Forests Random forests, first introduced in Breiman (2001), are one of the most popular algorithms for classification and regression in Euclidean spaces. In a comprehensive study on more than 100 classification tasks, random forests show the best performance among many other general purpose methods (Fern andez-Delgado et al., 2014). However, standard random forests heavily rely on the vector space representation of the underlying data points, which is not available in a comparison-based framework. Instead, we propose a comparison-based random forest algorithm for classification and regression tasks. The main ingredient is the comparison tree, which only uses of triplet comparisons and does not rely on Euclidean representation or distances between items. Let us recap the CART random forest: The input consists of a labeled set Dn = {(x1, y1), (x2, y2), . . . , (xn, yn)} Rd R. To build an individual tree, we first draw a random subsample Ds of an points from Dn. Second, we select a random subset Dims of size mtry of all possible dimensions {1, 2, . . . , d}. The tree is then built based on recursive, axis-aligned splits along a dimension randomly chosen from Dims. The exact splitting point along this direction is determined via the CART criterion, which also involves the labels of the subset Ds of points (see Biau and Scornet (2016) for details). The tree is grown until each cell contains at most n0 points these cells then correspond to the leaf nodes of the tree. To estimate a regression function m(x), each individual tree routes the query point to the appropriate leaf and outputs the average response over all points in this leaf. The random forest aggregates M such trees. Let us denote the prediction of tree i at point x by mi(x, Θi, Dn), where Θi encodes the randomness in the tree construction. Then the final forest estimation at x is the average result over all trees (for classification, the average is replaced by a majority vote): m M,n(x; (Θi)1 i M , Dn) = 1 i=1 mi(x, Θi, Dn) . The general consensus in the literature is that CART forests are surprisingly robust to parameter choices. Consequently, people use explicit rules of thumb, for example to set mtry = d/3 , and n0 = 5 (resp. n0 = 1) for regression (resp. classification) tasks. We now suggest to replace CART trees by comparison trees, leading to comparison-based random forests (Comp RF). Comparison trees have originally been designed to find nearest neighbors by recursively splitting the search space into smaller subspaces. Inspired by the CART criterion, we propose a supervised variant of the comparison tree, which we refer to as supervised comparison tree. For classification, the supervised comparison tree construc- Algorithm 1 Comp Tree(S, n0): Supervised comparison tree construction Input: Labeled data S and maximum leaf size n0 Output: Comparison tree T 1: T.root S 2: if |S| > n0 then 3: Sample distinct (x1, y1), (x2, y2) S s.t. y1 = y2 (if all points have the same label choose randomly) 4: S1 {(x, y) S : δ(x, x1) δ(x, x2)} 5: T.leftpivot x1, T.rightpivot x2 6: T.leftchild Comp Tree(S1, n0) 7: T.rightchild Comp Tree(S\S1, n0) 8: end if 9: Return T tion for a labeled set S X {0, 1} is as follows (see Algorithm 1 and Figure 1): we randomly choose two pivot points x1 and x2 with different labels y1 and y2 among the points in S (the case where all the points in S have the same label is trivial). For every remaining point (x, y) S, we request the triplet comparison δ(x, x1) < δ(x, x2). The answer to this query determines the relative position of x with respect to the generalized hyperplane separating x1 and x2. We assign the points closer to x1 to the first child node of S and the points closer to x2 to the other one. We now recurse the algorithm on the child nodes until less than n0 points remain in every leaf node of the tree. The supervised pivot selection is analogous to the CART criterion. However, instead of a costly optimization over the choice of split, it only requires to choose pivots with different labels. In Section 4.1, we empirically show that the supervised split procedure leads to a better performance than the CART forests for classification tasks. For regression, it is not obvious how the pivot selection should depend on the output values. Here we use an unsupervised version of the forest (unsupervised Comp RF): we choose the pivots x1, x2 without considering y1, y2. The final comparison-based forest consists of M independently constructed comparison trees. To assign a label to a query point, we traverse every tree to a leaf node, then we aggregate all the items in the leaf nodes of M trees to estimate the label of the query item. For classification, the final label is the majority vote over the labels of the accumulated set (in the multiclass case we use a one vs. one approach). For regression we use the mean output value. Intuitive comparison: The general understanding is that the efficiency of CART random forests is due to: (1) the randomness due to subsampling of dimensions and data points (Breiman, 1996); (2) the CART splitting criterion that exploits the label information already in the tree construction (Breiman et al., 1984). A weakness of CART splits is that Comparison-Based Random Forests Figure 1. Construction of the comparison tree, illustrated in the Euclidean setting. (i) The current cell contains points with two different labels. (ii) Two pivot points with opposite labels are chosen randomly from all sample points in the current cell (circled black dots). (iii) The current cell is split according to whether points are closer to the one or the other pivot; in the Euclidean setting this corresponds to a hyperplane split. (iv) Result after recursive application of this principle with final leaf size n0 = 10. Algorithm 2 Comp RF(Dn, q, M, n0, r): Comp RF prediction at query q Input: Labeled dataset Dn X {0, 1}, query q X, leaf size n0, trees M and subsampling ratio r. Output: yq = label prediction for q 1: Set C = as the list of predictor items 2: for j=1,. . . ,M do 3: Take a random subsample Ds Dn, s.t., |Ds| 4: Tj Comp Tree(Ds, n0) 5: Given q, traverse the tree Tj to the leaf node Nj 6: C C Nj 7: end for 8: Return Majority Vote({y|(x, y) C}) they are necessarily axis-aligned, and thus not well-adapted to the geometry of the data. In comparison trees, randomness is involved in the tree construction as well. But once a splitting direction has been determined by choosing the pivot points, the exact splitting point along this direction cannot be influenced any more, due to the lack of a vector space representation. On the other hand, the comparison tree splits are well adapted to the data geometry by construction, giving some advantage to the comparison trees. All in all, the comparison-based forest is a promising candidate with slightly different strengths and weaknesses than CART forest. Our empirical comparison in Section 4.1 reveals that it performs surprisingly well and can even outperform CART forests in certain settings. 3. Theoretical Analysis Despite their intensive use in practice, theoretical questions regarding the consistency of the original procedure of Breiman (2001) are still under investigation. Most of the research focuses on simplified models in which the construction of the forest does not depend on the training set at all (Biau, 2012), or only via the xis but not the yis (Biau et al., 2008; Ishwaran and Kogalur, 2010; Denil et al., 2013). Recent efforts nearly closed this gap, notably Scornet et al. (2015), where it is shown that the original algorithm is consistent in the context of additive regression models and under suitable assumptions. However, there is no previous work on the consistency of random forests constructed only with triplet comparisons. As a first step in this direction, we investigate the consistency of individual comparison trees, which is the first building block in the study of random forests consistency. As it is common in the theoretical literature on random forests, we consider a slightly modified version of the comparison tree. We assume that the pivot points are not randomly drawn from the underlying sample but according to the true distribution of the data. In this setting, we show that, when the number of observations grows to infinity, (i) the diameter of the cells converges to zero in probability, and (ii) each cell contains an arbitrarily large number of observations. Using a result of Devroye et al. (1996), we deduce that the associated classifier is consistent. The challenging part of the proof is to obtain control over the diameter of the cells. Intuitively, as in Dasgupta and Freund (2008, Lemma 12), it suffices to show that each cut has a larger probability to decrease the diameter of the current cell than that of leaving it unchanged. To prove this in our case is very challenging since both the position and the decrease in diameter caused by the next cut depend on the geometry of the cell. 3.1. Continuous Comparison Tree As it is the case for most theoretical results on random forests, we carry out our analysis in a Euclidean setting (however, the comparison-forest only has indirect access to the Euclidean metric via triplet queries). We assume that the input space is X = [0, 1]d with distance δ given by the Euclidean norm, that is, δ (x, y) = x y . Let X be a random variable with support included in [0, 1]d. We assume that the observations X1, . . . , Xn [0, 1]d are drawn independently according to the distribution of X. We make the following assumptions: Assumption 3.1. The random variable X [0, 1]d has density f with respect to the Lebesgue measure on [0, 1]d. Comparison-Based Random Forests Additionally, f is finite and bounded away from 0. For any x, y Rd, let us define (x, y) := z Rd | δ (x, z) = δ (y, z) . In the Euclidean setting, (x, y) is a hyperplane that separates Rd in two half-spaces. We call Hx (resp. Hy) the open half-space containing x (resp. y). The set S1 in Algorithm 1 corresponds to S Hx1. We can now define the continuous comparison tree. Definition 1 (Continuous comparison tree). A continuous comparison tree is a random infinite binary tree T 0 obtained via the following iterative construction: The root of T 0 is [0, 1]d; Assuming that level ℓof T 0 has been built already, then level ℓ+ 1 is built as follows: For every cell C at height ℓ, draw X1, X2 C independently according to the distribution of X restricted to C. The children of C are defined as the closure of C HX1 and C HX2. For any sequence (pn)n 0, a truncated, continuous comparison tree T 0(pn) consists of the first pn levels of T 0. From a mathematical point of view, the continuous tree has a number of advantages. (i) Its construction does not depend on the responses Y1, . . . , Yn. Such a simplification is quite common because data-dependent random tree structures are notoriously difficult to analyze (Biau et al., 2008). (ii) Its construction is formally independent of the finite set of data points, but close in spirit : Rather than sampling the pivots among the data points in a cell, pivots are independently sampled according to the underlying distribution. Whenever a cell contains a large number of sample points, both distributions are close, but they may drift apart when the diameter of the cells go to 0. (iii) In the continuous comparison tree, we stop splitting cells at height pn , whereas in the discrete setting we stop if there is less than n0 observations in the current cell. As a consequence, T 0(pn) is a perfect binary tree: each interior node has exactly 2 children. This is typically not the case for comparison trees. 3.2. Consistency To each realization of T 0(pn) is associated a partition of [0, 1]d into disjoint cells A1,n, A2,n, . . . , A2pn,n . For any x [0, 1]d, let A(x) be the cell of T 0(pn) containing x. Let us assume that the responses (Yi)1 i n are binary labels. We consider the classifier defined by majority vote in each cell, that is, Xi A(x) 1Yi=1 P Xi A(x) 1Yi=0 0 otherwise. Define Ln := P (gn(X) = Y |Dn) . Following Devroye et al. (1996), we say that the classifier gn is consistent if E [Ln] = P (gn(X) = Y ) n + L , where L is the Bayes error probability. Our main result is the consistency of the classifier associated with the continuous comparison tree truncated to a logarithmic height. Theorem 2 (Consistency of comparison-based trees). Under Assumption 3.1, the classifier associated to the continuous, truncated tree T 0(α log n) is consistent for any constant 0 < α < 1/ log 2. In particular, since each individual tree is consistent, a random forest with base tree T 0(pn) is also consistent. Theorem 2 is a first step towards explaining why comparisonbased trees perform well without having access to the representation of the points. Also note that, even though the continuous tree is a simplified version of the discrete tree, they are quite similar and share all important characteristics. In particular, they roughly have the same depth with high probability, the comparison tree has logarithmic depth (Haghiri et al., 2017, Theorem 1). 3.3. Sketch of the Proof Since the construction of T 0(pn) does not depend on the labels, we can use Theorem 6.1 of Devroye et al. (1996). It gives sufficient conditions for classification rules based on space partitioning to be consistent. In particular, we have to show that the partition satisfies two properties: first, the leaf cells should be small enough, so that local changes of the distribution can be detected; second, the leaf cells should contain a sufficiently large number of points so that averaging among the labels makes sense. More precisely, we have to show that (1) diam A(X) 0 in probability, where diam A := supx,y A δ (x, y) is the diameter of A, and (2) N(X) in probability, where N(X) is the number of sample points in the cell containing X. The second point is simple, because it is sufficient to show that the number of cells in the partition associated to T 0(α log n) is o (n) (according to Lemma 20.1 in Devroye et al. (1996) and the remark that follows). Proving (1) is much more challenging. A sketch of the proof of (1) follows note that the complete version of the proof of Theorem 2 can be found in the supplementary material. The critical part in proving (1) is to show that, for any cell of the continuous comparison tree, the diameter of its descendants at least k levels below is halved with high probability. More precisely, the following proposition shows that this probability is lower bounded by 1 δ, where δ is exponentially decreasing in k. Proposition 3 (Diameter control). Let C be a cell of T 0(X) such that diam C D. Then, under Assumption 3.1, the probability that there exists a descendant of C Comparison-Based Random Forests Table 1. Average and standard deviation of classification error for the Comp RF vs. other methods. The first three rows describe datasets. MNIST Gisette UCIHAR Isolet Dataset Size 70000 7000 10229 6238 Variables 728 5000 561 617 Classes 10 2 5 26 KNN 2.91 3.50 12.15 8.27 CART RF 2.90 ( 0.05) 3.04 ( 0.26) 7.47 ( 0.32) 5.48 ( 0.27) Comp RF unsupervised 4.21 ( 0.05) 3.28 ( 0.19) 8.70 ( 0.32) 6.65 ( 0.14) Comp RF supervised 2.50 ( 0.05) 2.48 ( 0.13) 6.54 ( 0.11) 4.43 ( 0.26) 100 101 102 M: Number of trees Avergae classification error 100 101 102 M: Number of trees Avergae classification error 100 101 102 M: Number of trees Avergae classification error 100 101 102 M: Number of trees Avergae classification error Figure 2. Average classification error of the Comp RF algorithm on classification datasets. X-Axis shows the number of trees used in the forest. The title denotes the dataset and each curve corresponds to a fixed value of n0. which is more than k levels below and yet has diameter greater than D/2 is at most δ = cγk, where c > 0 and γ (0, 1) are constants depending only on the dimension d and the density f. The proof of Proposition 3 amounts to showing that the probability of decreasing the diameter of any given cell is higher than the probability of keeping it unchanged see the supplementary material. Assuming Proposition 3, the rest of the proof goes as follows. Let us set ε (0, 1). We are going to show that P (diam A(X) > ε) 0 when n + . Let Γ be the path in T 0(α log n) that goes from the root to the leaf A of maximum diameter. This path has length pn according to the definition of T 0(α log n). The root, which consists of the set [0, 1]d, has diameter d. This means that we need to divide the diameter of this cell π = l log2 times to obtain cells with diameter smaller than ε. Let us set k = pn/π and pick cells C(j) 0 j π along Γ such that C(0) = [0, 1]d, C(π) = A, and such that there are more than k levels between C(j) and C(j+1). Then we can prove that P (diam A > ε) is smaller than diam C(j) > diam C(j 1) Furthermore, according to Prop. 3, the quantity in the last expression is upper bounded by πcγk. Since k = O (log n) and γ (0, 1), we can conclude. 4. Experiments In this section, we first examine comparison-based forests in the Euclidean setting. Secondly, we apply the Comp RF method to non-Euclidean datasets with a general metric available. Finally we run experiments in the setting where we are only given triplet comparisons. 4.1. Euclidean Setting Here we examine the performance of Comp RF on classification and regression tasks in the Euclidean setting, and compare it against CART random forests as well as the KNN classifier as a baseline. As distance function for KNN and Comp RF we use the standard Euclidean distance. Since the Comp RF only has access to distance comparisons, the amount of information it uses is considerably lower than the information available to the CART forest. Hence, the goal of this experiment is not to show that comparison-based random forests can perform better, but rather to find out whether the performance is still acceptable. To emphasize the role of supervised pivot selection, we report the performance of the unsupervised Comp RF algo- Comparison-Based Random Forests Table 2. Average and standard deviation of the RMSE for the Comp RF vs. CART regression forest. ONP Boston Forest Fire White Wine Dataset Size 39644 506 517 4898 Variables 58 13 12 11 CART RF 1.04 ( 0.50) 104 3.02 ( 0.95) 45.32 ( 4.89) 59.00 ( 2.94) 10 2 Comp RF 1.05 ( 0.50) 104 6.16 ( 1.00) 45.37 ( 4.69) 72.46 ( 3.16) 10 2 100 101 102 M: Number of trees Average RMSE 100 101 102 M: Number of trees Average RMSE 100 101 102 M: Number of trees Average RMSE Forest Fire 100 101 102 M: Number of trees Average RMSE Figure 3. Average RMSE of the Comp RF algorithm on regression datasets. X-Axis shows the number of trees used in the forest. The title denotes dataset and each curve corresponds to a fixed value of n0. rithm in classification tasks as well. The tree structure in the unsupervised Comp RF chooses the pivot points uniformly at random without considering the labels. For the sake of simplicity, we do not perform subsampling when building the Comp RF trees. We report some experiments concerning the role of subsampling in Section 3.2 of supplementary material. All other parameters of Comp RF are adjusted by cross-validation. 4.1.1. CLASSIFICATION We use four classification datasets. MNIST (Le Cun et al., 1998) and Gisette are handwritten digit datasets. Isolet and UCIHAR are speech recognition and human activity recognition datasets respectively (Lichman, 2013). Details of the datasets are shown in the first three rows of Table 1. Parameters of Comp RF: We examine the behaviour of the Comp RF with respect to the choice of the leaf size n0 and the number of trees M. We perform 10-fold cross-validation over n0 {1, 4, 16, 64} and M {1, 4, 16, 64, 256}. In Figure 2 we report the resulting cross validation error. Similar to the recommendation for CART forests (Biau and Scornet, 2016), we achieve the best performance when the leaf size is small, that is (n0 = 1). Moreover, there is no significant improvement for M greater than 100. Comparison between Comp RF, CART and KNN: Table 1 shows the average and standard deviation of classification error for 10 independent runs of Comp RF, CART forest and KNN. Training and test sets are given in the respective datasets. The parameters n0 and M of Comp RF and CART, and k of KNN are chosen by 10-fold cross validation on the training set. Note that KNN is not randomized, thus there is no standard deviation to report. The results show that, surprisingly, Comp RF can slightly outperform the CART forests for classification tasks even though it uses considerably less information. The reason might be that the Comp RF splits are better adapted to the geometry of the data than the CART splits. While the CART criterion for selecting the exact splitting point can be very informative for regression (see below), for classification it seems that a simple data dependent splitting criterion as in the supervised Comp RF can be as efficient. Conversely, we see that unsupervised splitting as in the unsupervised Comp RF is clearly worse than supervised splitting. 4.1.2. REGRESSION Next we consider regression tasks on four datasets. Online news popularity (ONP) is a dataset of articles with the popularity of the article as target (Fernandes et al., 2015). Boston is a dataset of properties with the estimated value as target variable. Forest Fire is a dataset meant to predict the burned area of forest fires, in the northeast region of Portugal (Cortez and Morais, 2007). White Wine (Wine) is a subset of wine quality dataset (Cortez et al., 2009). Details of the datasets are shown in the first two rows of Table 2. Since the regression datasets have no separate training and test set, we assign 90% of the items to the training and the remaining 10% to the test set. In order to remove the effect Comparison-Based Random Forests Table 3. Average and standard deviation of the classification error for the Comp RF in comparison with kernel SVM on graph datasets with two graph kernels: WL-subtree and WL-edge. MUTAG ENZYMES NCI1 NCI109 Train Size 188 600 4110 4127 Classes 2 6 2 2 WL-subtree kernel Kernel SVM 17.77 ( 7.31) 47.16 ( 5.72) 15.96 ( 1.56) 15.55 ( 1.40) KNN 14.00 ( 8.78) 48.17 ( 4.48) 18.13 ( 2.27) 18.74 ( 1.97) Comp RF unsupervised 14.44 ( 7.94) 39.33 ( 6.49) 17.96 ( 1.85) 19.10 ( 2.22) Comp RF supervised 13.89 ( 7.97) 39.83 ( 5.00) 17.35 ( 1.98) 18.71 ( 2.61) WL-edge kernel Kernel SVM 15.55 ( 6.30) 53.67 ( 6.52) 15.13 ( 1.44) 15.38 ( 1.69) KNN 12.78 ( 7.80) 51.00 ( 4.86) 18.56 ( 1.36) 18.30 ( 1.82) Comp RF unsupervised 11.67 ( 7.15) 38.50 ( 4.19) 17.91 ( 1.42) 19.56 ( 1.61) Comp RF supervised 11.11 ( 8.28) 38.17 ( 5.35) 18.05 ( 1.63) 18.40 ( 2.27) of the fixed partitioning, we repeat the experiments 10 times with random training/test set assignments. Note that we use Comp RF with unsupervised tree construction for regression. Parameters of Comp RF: We report the behaviour of the Comp RF with respect to the parameters n0 and M. We perform 10-fold cross-validation with the same range of parameters as in the previous section. Figure 3 shows the average root mean squared error (RMSE) over the 10 folds. The cross-validation is performed for 10 random training/test set assignments. The figure corresponds to the first assignment out of 10 (the behaviour for the other training/test set assignments is similar). The Comp RF algorithm shows the best performance with n0 = 1 for the Boston and Forest Fire datasets, however larger values of n0 lead to better performance for other datasets. We believe that the main reason for this variance is the unsupervised tree construction in the Comp RF algorithm for regression. Comparison between Comp RF and CART: Table 2 shows the average and standard deviation of the RMSE for the Comp RF and CART forests over the 10 runs with random training/test set assignment. For each combination of training and test sets we tuned parameters independently by cross validation. Comp RF is constructed with unsupervised splitting, while the CART forests are built using a supervised criterion. We can see that on the Boston and Wine datasets, the performance of the CART forest is substantially better than the Comp RF. In this case, ignoring the Euclidean representation of the data and just relying on the comparison-based trees leads to a significant decrease in performance. However the performance of our method on the other two datasets is quite the same as CART forests. We can conclude that in some cases the CART criterion can be essential for regression. However, note that if we are just given a comparison-based setting without actual vector space representation it is hardly possible to propose an efficient supervised criterion for splitting. 4.2. Metric, non-Euclidean Setting In this set of experiments we aim to demonstrate the performance of the Comp RF in general metric spaces. We choose graph-structured data for this experiment. Each data-point is a graph, and as a distance between graphs we use graphbased kernel functions. In particular, the Weisfeiler-Lehman graph kernels are a family of graph kernels that have promising results on various graph datasets (Shervashidze et al., 2011). We compute the WL-subtree and WL-edge kernels on four of the datasets reported in Shervashidze et al. (2011): MUTAG, ENZYMES, NCI1 and NCI109. In order to evaluate triplet comparisons based on the graph kernels, we first convert the kernel matrix to a distance matrix in the standard way (expressing the Gram matrix in terms of distances). We compare supervised and unsupervised Comp RF with the Kernel SVM and KNN classifier in Table 3. Note that in this setting, CART forests are not applicable as they would require an explicit vector space representation. Parameters of the Kernel SVM and k of the KNN classifier are adjusted with 10-fold cross-validation on training sets. We set the parameters of the Comp RF to n0 = 1 and M = 200, as it shows acceptable performance in the Euclidean setting. We assign 90% of the items as training and the remaining 10% as the test set. The experiment is repeated 10 times with random training/test assignments. The average and standard deviation of classification error is reported in Table 3. The Comp RF algorithm outperforms the kernel SVM on the MUTAG and ENZYMES datasets. Comparison-Based Random Forests Gisette n=1000 M=1 T=12100 M=5 T=60964 M=10 T=120494 M=20 T=239554 M: Number of trees, T: Number of triplets Avergae classification error MNIST n=1000 M=1 T=11792 M=5 T=57952 M=10 T=116076 M=20 T=227595 M: Number of trees, T: Number of triplets Avergae classification error UCIHAR n=1000 M=1 T=10380 M=5 T=51896 M=10 T=103317 M=20 T=207462 M: Number of trees Avergae classification error Isolet n=480 M=1 T=4338 M=5 T=22028 M=10 T=44366 M=20 T=87629 M: Number of trees, T: Number of triplets Avergae classification error Comp RF Emb KNN-Same Emb CART-Same Emb KNN-Rand Emb CART-Rand Figure 4. Average classification error of the Comp RF in comparison with embedding approach on classification datasets with less than 1000 items. Emb KNN-Same (resp. Emb CART-Same) denotes the TSTE+KNN using the same triplets as Comp RF, while Emb KNN-Rand (resp. Emb CART-Rand) stands for using TSTE with the same number of random triplets. X-Axis show the number of trees (M) used for the Comp RF and the corresponding number of triplets (T) for the embedding. Each set of bars corresponds to a fixed M. Note that by increasing M, the number of triplets used by Comp RF will be increased, as it appears in the X-Axis. However, it has slightly lower performance on the other two datasets. However, note that the kernel SVM requires a lot of background knowledge (one has to construct a kernel in the first place, which can be difficult), whereas our Comp RF algorithm neither uses the explicit distance values nor requires them to satisfy the axioms of a kernel function. 4.3. Comparison-Based Setting Now we assume that the distance metric is unknown and inaccessible directly, but we can actively ask for triplet comparisons. In this setting, the major competitors to comparison-based forests are indirect methods that first use ordinal embedding to a Euclidean space, and then classify the data in the Euclidean space. As practical active ordinal embedding methods do not really exist we settle for a batch setting in this case. After embedding, we use CART forests and the KNN classifier in the Euclidean space. Comparing various ordinal embedding algorithms, such as GNMDS (Agarwal et al., 2007), LOE (Terada and von Luxburg, 2014) and TSTE (van der Maaten and Weinberger, 2012) shows that the TSTE in combination with a classifier consistently outperforms the others (see Section 3.1 in the supplement). Therefore, we here only report the comparison with the TSTE embedding algorithm. We choose the embedding dimension by 2-fold cross-validation in the range of d {10, 20, 30, 40, 50} (embedding in more than 50 dimensions is impossible in practice due to the running time of the TSTE). We also adjust k of the KNN classifier in the cross-validation process. We design a comparison-based scenario based on Euclidean datasets. First, we let Comp RF choose the desired triplets to construct the forest and classify the test points. The embedding methods are used in two different scenarios: once with exactly the same triplets as in the Comp RF algorithm, and once with a completely random set of triplets of the same size as the one used by Comp RF. The size of our datasets by far exceeds the number of points that embedding algorithms, particularly TSTE, can handle. To reduce the size of the datasets, we choose the first two classes, then we subsample 1000 items. Isolet has already less than 1000 items in first two classes. We assign half of the dataset as training and the other half as test set. Bar plots in Figure 4 show the classification error of the Comp RF in comparison with embedding methods with various numbers of trees in the forests (M). We set n0 = 1 for the Comp RF. In each set of bars, which corresponds to a restricted comparison-based regime, Comp RF outperforms embedding methods or has the same performance. Another significant advantage of Comp RF in comparison with the embedding is the low computation cost. A simple demonstration is provided in Section 3.3 of the supplementary material. 5. Conclusion and Future Work We propose comparison-based forests for classification and regression tasks. This method only requires comparisons of distances as input. From a practical point of view, it works surprisingly well in all kinds of circumstances (Euclidean spaces, metric spaces, comparison-based setting) and is much simpler and more efficient than some of its competitors such as ordinal embeddings. We have proven consistency in a simplified setting. As future work, this analysis should be extended to more realistic situations, namely tree construction depending on the sample; forests with inconsistent trees, but the forest is still consistent; and finally the supervised splits. In addition, it would be interesting to propose a comparison-based supervised tree construction for the regression tasks. Comparison-Based Random Forests Acknowledgements The authors thank Debarghya Goshdastidar and Micha el Perrot for fruitful discussions. This work has been supported by the German Research Foundation DFG (SFB 936/ Z3), the Institutional Strategy of the University of T ubingen (DFG ZUK 63), and the International Max Planck Research School for Intelligent Systems (IMPRS-IS). S. Agarwal, J. Wills, L. Cayton, G. Lanckriet, D. Kriegman, and S. Belongie. Generalized non-metric multidimensional scaling. In AISTATS, pages 11 18, 2007. M.F. Balcan, E. Vitercik, and C. White. Learning combinatorial functions from pairwise comparisons. In COLT, pages 310 335, 2016. G. Biau. Analysis of a random forests model. JMLR, 13(4): 1063 1095, 2012. G. Biau and E. Scornet. A random forest guided tour. Test, 25(2):197 227, 2016. G. Biau, L. Devroye, and G. Lugosi. Consistency of random forests and other averaging classifiers. JMLR, 9(9):2015 2033, 2008. L. Breiman. Bagging predictors. Machine learning, 24(2): 123 140, 1996. L. Breiman. Random forests. Machine Learning, 45(1): 5 32, 2001. L. Breiman, J. Friedman, C.J. Stone, and R.A. Olshen. Classification and regression trees. CRC press, 1984. P. Cortez and A.J.R. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data. In Portuguese Conference on Artificial Intelligence, pages 512 523, 2007. P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis. Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4): 547 553, 2009. S. Dasgupta and Y. Freund. Random projection trees and low dimensional manifolds. In STOC, pages 537 546, 2008. M. Denil, D. Matheson, and N. Freitas. Consistency of online random forests. In ICML, pages 1256 1264, 2013. L. Devroye, L. Gy orfi, and G. Lugosi. A probabilistic theory of pattern recognition. Springer, 1996. E. Ezra and M. Sharir. A nearly quadratic bound for the decision tree complexity of k-sum. In LIPIcs-Leibniz International Proceedings in Informatics, volume 77. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017. K. Fernandes, P. Vinagre, and P. Cortez. A proactive intelligent decision support system for predicting the popularity of online news. In Portuguese Conference on Artificial Intelligence, pages 535 546, 2015. M. Fern andez-Delgado, E. Cernadas, S. Barro, and D. Amorim. Do we need hundreds of classifiers to solve real world classification problems. JMLR, 15(1):3133 3181, 2014. S. Haghiri, D. Ghoshdastidar, and U. von Luxburg. Comparison-based nearest neighbor search. In AISTATS, pages 851 859, 2017. H. Heikinheimo and A. Ukkonen. The crowd-median algorithm. In HCOMP, 2013. H. Ishwaran and U. B. Kogalur. Consistency of random survival forests. Statistics & probability letters, 80(13): 1056 1064, 2010. D.M. Kane, S. Lovett, and S. Moran. Near-optimal linear decision trees for k-sum and related problems. ar Xiv preprint ar Xiv:1705.01720, 2017a. D.M. Kane, S. Lovett, S. Moran, and J. Zhang. Active classification with comparison queries. Foundations Of Computer Science (FOCS), 2017b. M. Kleindessner and U. Luxburg. Dimensionality estimation without distances. In AISTATS, pages 471 479, 2015. M. Kleindessner and U. von Luxburg. Kernel functions based on triplet comparisons. In NIPS, pages 6810 6820, 2017. Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. M. Lichman. UCI machine learning repository, 2013. Available at http://archive.ics.uci.edu/ml. E. Scornet, G. Biau, and J.-P. Vert. Consistency of random forests. The Annals of Statistics, 43(4):1716 1741, 2015. N. Shervashidze, P. Schweitzer, E.J. Leeuwen, K. Mehlhorn, and K.M. Borgwardt. Weisfeiler-lehman graph kernels. JMLR, 12:2539 2561, 2011. O. Tamuz, C. Liu, S. Belongie, O. Shamir, and A. Kalai. Adaptively learning the crowd kernel. In ICML, pages 673 680, 2011. Comparison-Based Random Forests Y. Terada and U. von Luxburg. Local ordinal embedding. In ICML, pages 847 855, 2014. A. Ukkonen, B. Derakhshan, and H. Heikinheimo. Crowdsourced nonparametric density estimation using relative distances. In HCOMP, 2015. L. van der Maaten and K. Weinberger. Stochastic triplet embedding. In Machine Learning for Signal Processing (MLSP), pages 1 6, 2012. Code available at http:// homepage.tudelft.nl/19j49/ste. C. Wah, S. Maji, and S. Belongie. Learning localized perceptual similarity metrics for interactive categorization. In Winter Conference on Applications of Computer Vision (WACV), pages 502 509, 2015. M. Wilber, I.S. Kwak, D. Kriegman, and S. Belongie. Learning concept embeddings with combined human-machine expertise. In ICCV, pages 981 989, 2015. L. Zhang, S. Maji, and R. Tomioka. Jointly learning multiple measures of similarities from triplet comparisons. ar Xiv preprint ar Xiv:1503.01521, 2015.