# label_ranking_through_nonparametric_regression__8a8967bb.pdf Label Ranking through Nonparametric Regression Dimitris Fotakis 1 Alkis Kalavasis 1 Eleni Psaroudaki 1 Label Ranking (LR) corresponds to the problem of learning a hypothesis that maps features to rankings over a finite set of labels. We adopt a nonparametric regression approach to LR and obtain theoretical performance guarantees for this fundamental practical problem. We introduce a generative model for Label Ranking, in noiseless and noisy nonparametric regression settings, and provide sample complexity bounds for learning algorithms in both cases. In the noiseless setting, we study the LR problem with full rankings and provide computationally efficient algorithms using decision trees and random forests in the high-dimensional regime. In the noisy setting, we consider the more general cases of LR with incomplete and partial rankings from a statistical viewpoint and obtain sample complexity bounds using the One-Versus-One approach of multiclass classification. Finally, we complement our theoretical contributions with experiments, aiming to understand how the input regression noise affects the observed output. 1. Introduction Label Ranking (LR) studies the problem of learning a mapping from features to rankings over a finite set of labels. This task emerges in many domains. Common practical illustrations include pattern recognition (Geng & Luo, 2014), web advertisement (Djuric et al., 2014), sentiment analysis (Wang et al., 2011), document categorization (Jindal & Taneja, 2015) and bio-informatics (Balasubramaniyan et al., 2005). The importance of LR has spurred the development of several approaches for tackling this task from the perspective of the applied CS community (Vembu & G artner, 2010; Zhou et al., 2014b). 1School of Electrical and Computer Engineering, National Technical University of Athens, Athens, Greece. Correspondence to: Eleni Psaroudaki . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). The overwhelming majority of these solutions comes with experimental evaluation and no theoretical guarantees; e.g., algorithms based on decision trees are a workhorse for practical LR and lack theoretical guarantees. Given state-ofthe-art experimental results, based on Random Forests (see Zhou & Qiu, 2018), we are highly motivated not only to work towards a theoretical understanding of this central learning problem but also to theoretically analyze how efficient tree-based methods can be under specific assumptions. LR comprises a supervised learning problem that extends multiclass classification (Dekel et al., 2003). In the latter, with instance domain X Rd and set of labels [k] := {1, . . . , k}, the learner draws i.i.d. labeled examples (x, y) X [k] and aims to learn a hypothesis from instances to labels, following the standard PAC model. In LR, the learner observes labeled examples (x, σ) X Sk and the goal is to learn a hypothesis h : X Sk from instances to rankings of labels, where Sk is the symmetric group of k elements. The ranking h(x) corresponds to the preference list of the feature x and, as mentioned in previous works (H ullermeier et al., 2008), a natural way to represent preferences is to evaluate individual alternatives through a real-valued utility (or score) function. Note that if the training data offer the utility scores directly, the problem is reduced to a standard regression problem. In our work, we assume that there exists such an underlying nonparametric score function m : X [0, 1]k, mapping features to score values. The value mi(x) corresponds to the score assigned to the label i [k] for input x and can be considered proportional to the posterior probability Pr(x,y)[y = i|x]. For each LR example (x, σ), the label σ is generated by sorting the underlying regression-score vector m(x), i.e., σ = argsort(m(x) + ξ) (with some regression noise ξ). We are also interested in cases where some of the alternatives of σ are missing, i.e., we observe incomplete rankings σ S k; the way that such rankings occur will be clarified later. Formally, we have: Definition 1.1 (Distribution-free Nonparametric LR). Let X Rd, [k] be a set of labels, C be a class of functions from X to [0, 1]k and Dx be an arbitrary distribution over X. Consider a noise distribution E over Rk. Let m be an unknown target function in C. An example oracle Ex(m, E) with complete rankings, Label Ranking through Nonparametric Regression works as follows: Each time Ex(m, E) is invoked, it returns a labeled example (x, σ) X Sk, where (i) x Dx and ξ E independently and (ii) σ = argsort(m(x) + ξ). Let DR be the joint distribution over (x, σ) generated by the oracle. In the noiseless case (ξ = 0 almost surely), we simply write Ex(m). Let M be a randomized mechanism that given a tuple (x, y) X Rk generates an incomplete ranking M(x, y) S k. An example oracle Ex(m, E, M) with incomplete rankings, works as follows: Each time Ex(m, E, M) is invoked, it returns a labeled example (x, σ) X S k, where (i) x Dx, ξ E , (ii) y = m(x) + ξ and (iii) σ = M(x, y). Let (x, σ) DM R . We denote h : X Sk the composition h = argsort m. Note that the oracle Ex(m, E, M) generalizes Ex(m, E) (which generalizes Ex(m) accordingly) since we can set M to be M(x, y) = argsort(y). 1.1. Problem Formulation and Contribution Most of our attention focuses on the two upcoming learning goals, which are stated for the abstract Label Ranking example oracle O {Ex(m), Ex(m, E), Ex(m, E, M)}. Let d be an appropriate ranking distance metric. Problem 1 (Computational). The learner is given i.i.d. samples from the oracle O and its goal is to efficiently output a hypothesis bh : X Sk such that with high probability the error Ex Dx[d(bh(x), h(x))] is small. Problem 2 (Statistical). Consider the median problem h = argminh E(x,σ)[d(h(x), σ)] where (x, σ) DR. The learner is given i.i.d. samples from the oracle O and its goal is to output a hypothesis bh : X Sk from some hypothesis class H such that with high probability the error Prx Dx[bh(x) = h (x)] against the median h is small. The main gap in the theoretical literature of LR was the lack of computational guarantees. Problem 1 identifies this gap and offers, in combination with the generative models of the previous section, a natural and formal way to study the theoretical performance of practical methods for LR such as decision trees and random forests. We believe that this is the main conceptual contribution of our work. In Problem 1, the runtime should be polynomial in d, k, 1/ε. While Problem 1 deals with computational aspects of LR, Problem 2 focuses on the statistical aspects, i.e., the learner may be computationally inefficient. This problem is extensively studied as Ranking Median Regression (Cl emenc on et al., 2018; Cl emenc on & Vogel, 2020) and is closely related to Empirical Risk Minimization (and this is why it is statistical , since NP-hardness barriers may arise). We note that the median problem is defined w.r.t. DR (over complete rankings) but the learner receives examples from O (which may correspond to incomplete rankings). We study the distribution-free nonparametric LR task from either theoretical or experimental viewpoints in three cases: Noiseless Oracle with Complete Rankings. In this setting, we draw samples from Ex(m) (i.e., Ex(m, E) with ξ = 0). For this case, we resolve Problem 1 (under mild assumptions) and provide theoretical guarantees for efficient algorithms that use decision trees and random forests, built greedily based on the CART empirical MSE criterion, to interpolate the correct ranking hypothesis. This class of algorithms is widely used in applied LR but theoretical guarantees were missing. For the analysis, we adopt the labelwise decomposition technique (Cheng et al., 2013), where we generate one decision tree (or random forest) for each position of the ranking. We underline that decision trees and random forests are the state-of-the-art techniques for LR. Contribution 1: We provide the first theoretical performance guarantees for these algorithms for Label Ranking under mild conditions. We believe that our analysis and the identification of these conditions contributes towards a better understanding of the practical success of these algorithms. Noisy Oracle with Complete Rankings. We next replace the noiseless oracle of Problem 1 with Ex(m, E). In this noisy setting, the problem becomes challenging for theoretical analysis; we provide experimental evaluation aiming to quantify how noise affects the capability of decision trees and random forests to interpolate the true hypothesis. Contribution 2: Our experimental evaluation demonstrates that random forests and shallow decisions trees are robust to noise, not only in our noisy setting, but also in standard LR benchmarks. Noisy Oracle with Incomplete Rankings. We consider the oracle Ex(m, E, M) with incomplete rankings. We resolve Problem 2 for the Kendall tau distance, as in previous works (so, we resolve it for the weaker oracles too). Now, the learner is agnostic to the positions of the elements in the incomplete ranking and so labelwise decomposition cannot be applied. Using pairwise decomposition, we compute a ranking predictor that achieves low misclassification error compared to the optimal classifier h and obtain sample complexity bounds for this task. Contribution 3: Building on the seminal results of (Korba et al., 2017; Cl emenc on et al., 2018; Cl emenc on & Korba, 2018; Cl emenc on & Vogel, 2020), we give results for Problem 2 for incomplete rankings under appropriate conditions. 1.2. Related Work LR has received significant attention over the years (Shalev-Shwartz, 2007; H ullermeier et al., 2008; Cheng Label Ranking through Nonparametric Regression & H ullermeier, 2008; Har-Peled et al., 2003), due to the large number of practical applications. There are multiple approaches for tackling this problem (see Vembu & G artner, 2010; Zhou et al., 2014b, and the references therein). Some of them are based on probabilistic models (Cheng & H ullermeier, 2008; Cheng et al., 2010; Grbovic et al., 2012; Zhou et al., 2014a). Others are tree and ensemble based, such as adaption of decision trees (Cheng et al., 2009), entropy based ranking trees and forests (de S a et al., 2017), bagging techniques (Aledo et al., 2017), random forests (Zhou & Qiu, 2018), boosting (Dery & Shmueli, 2020), achieving highly competitive results. There are also works focusing on supervised clustering (Grbovic et al., 2013). Decomposition techniques are closely related to our work; they mainly transform the LR problem into simpler problems, e.g., binary or multiclass (H ullermeier et al., 2008; Cheng & H ullermeier, 2012; Cheng et al., 2013; Cheng & H ullermeier, 2013; Gurrieri et al., 2014). Comparison to Previous Work. To the best of our knowledge, there is no previous theoretical work focusing on the computational complexity of LR (a.k.a. Problem 1). However, there are many important works that adopt a statistical viewpoint. Closer to ours are the following seminal works on the statistical analysis of LR: Korba et al. (2017); Cl emenc on et al. (2018); Cl emenc on & Vogel (2020). Korba et al. (2017) introduced the statistical framework of consensus ranking (which is the unsupervised analogue of Problem 2) and identified crucial properties for the underlying distribution in order to get fast learning rate bounds for empirical estimators. A crucial contribution of this work (that we also make use of) is to prove that when Strict Stochastic Transitivity holds, the set of Kemeny medians (solutions of Problem 2 under the KT distance) is unique and has a closed form. Problem 2 was introduced in Cl emenc on et al. (2018), where the authors provide fast rates (under standard conditions) when the learner observes complete rankings, which reveal the relative order, but not the positions of the labels in the correct ranking. The work of Cl emenc on & Vogel (2020) provides a novel multiclass classification approach to Label Ranking, where the learner observes the top-label with some noise, i.e, observes only the partial information σ 1 x (1) in presence of noise, under the form of the random label y assigned to x. Our contribution concerning Problem 2 is a natural follow-up of these works where the learner observes noisy incomplete rankings (and so has only information about the relative order of the alternatives). Our solution for Problem 2 crucially relies on the conditions and the techniques developed in (Korba et al., 2017; Cl emenc on et al., 2018; Cl emenc on & Vogel, 2020). In our setting we have to modify the key conditions in order to handle incomplete rankings. Finally, our labelwise decomposition approach to Problem 1 is closely related to Korba et al. (2018), where many embeddings for ranking data are discussed. Nonparametric Regression and CART. Regression trees constitute a fundamental approach in order to deal with nonparametric regression. Our work is closely related to the one of Syrgkanis & Zampetakis (2020), which shows that trees and forests, built greedily based on the CART empirical MSE criterion, provably adapt to sparsity in the high-dimensional regime. Specifically, Syrgkanis & Zampetakis (2020) analyze two greedy tree algorithms (they can be found at the Appendix D): (a) in the Level Splits variant, in each level of the tree, the same variable is greedily chosen at all the nodes in order to maximize the overall variance reduction; (b) in the Breiman s variant, which is the most popular in practice, the choice of the next variable to split on is locally decided at each node of the tree. In general, regression trees (Breiman et al., 1984) and random forests (Breiman, 2001) are one of the most widely used estimation methods by ML practitioners (Loh, 2011; Louppe, 2014). For further literature review and preliminaries on decision trees and random forests, we refer to the Appendix D. Multiclass Prediction. In multiclass prediction with k labels, there are various techniques such as One-versus-All and One-versus-One (see Shalev-Shwartz & Ben-David, 2014). We adopt the OVO approach for Problem 2, where we consider k 2 binary sub-problems (Hastie & Tibshirani, 1998; Moreira & Mayoraz, 1998; Allwein et al., 2000; F urnkranz, 2002; Wu et al., 2004) and we combine the binary predictions. A similar approach was employed for a variant of Problem 2 by Cl emenc on & Vogel (2020). 1.3. Notation For vectors, we use lowercase bold letters x; let xi be the ith coordinate of x. We write poly to denote that the degree of the polynomial depends on the subscripted parameters. Also, e O( ) is used to hide logarithmic factors. We denote the symmetric group over k elements with Sk and S k for incomplete rankings. For i [k], we let σ(i) denote the position of the i-th alternative. As distance metrics we use the Kendall Tau (KT) distance d KT (π, σ) = X i 0 (resp. β < 1), the structure of the problem changes and our theoretical guarantees fail. Interestingly, this is due to the fact that the geometry of the input noise is structurally different from the observed output. The input noise acts additively to the vector m(x), while the output is computed by permuting the elements. Hence, the relation between the observed ranking and the expected one is no more linear and hence one needs to extend the standard additive nonparametric regression setting y = f(x) + ξ to another geometry dealing with rankings σ = Eθ f(x), where f : X Sk is the regression function and Eθ : Sk Sk is a parameterized noise operator (e.g., a Mallows model). This change of geometry is interesting and to the best of our knowledge cannot be captured by existing theoretical results (see Appendix B for the experimental results of different additive nonparametric noise settings). Our experimental results aim to complement and go beyond our theoretical understanding of the capability of the decision trees and random forests to interpolate the correct underlying regression function with the presence of regression noise. To this end, we consider the oracle Ex(m, E) where the noise distribution E satisfies either property (i) or (ii) of Definition 2.2. For the experimental evaluation, two synthetic data set families were used, namely LFN (Large Features Number) and SFN (Small Features Number), consisting of a single noiseless and 50 noisy data sets, respectively. For either data set family, a common m : {0, 1}d [0, 1]k was employed, accordingly. Each noiseless data set was created according to the oracle Ex(m). It consists of 10000 samples (x, σ), where x {0, 1}d (d = 1000 for LFN and d = 100 for SFN) and σ S5 (k = 5), with r = 10 informative binary features per label (sparsity). The noisy data sets were Label Ranking through Nonparametric Regression 0.0 0.1 0.2 0.3 0.4 α [0.0, 1.0] mean kτ [ 1.0, 1.0] random forests (LFN) random forests (SFN) decision trees (LFN) decision trees (SFN) shallow trees (LFN) shallow trees (SFN) 0.75 0.80 0.85 0.90 0.95 1.00 β [ 1.0, 1.0] mean kτ [ 1.0, 1.0] random forests (LFN) random forests (SFN) decision trees (LFN) decision trees (SFN) shallow trees (LFN) shallow trees (SFN) Figure 1. Illustration of the experimental results in terms of mean kτ for different noise distributions E with respect to (a) α-inconsistency; (b) β-kτ gap. produced according to the generative process Ex(m, E), each using a different zero-mean noise distribution E. We implemented modified versions of Algorithm 1. The shallow trees ( , ), fully grown decision trees ( , ) and random forests ( , ) were built greedily based on the CART empirical MSE criterion, using the Breiman s method instead of Level Splits. Figure 1 summarizes the experimental results for different values of α [0, 1] and β [ 1, 1]. Results are obtained in terms of mean kτ, using the noisy data as training set and noiseless data as validation set. The y axis of Figure 1 depicts the mean kτ of all the corresponding pairs of the output of our model and the given noiseless ranking in the test sets. At this point we remind the reader that kτ is a normalization of d KT in [ 1, 1] and measures the proportion of the concordant pairs in two rankings, with kτ = 1 meaning that the two ranking have perfect agreement (i.e., the two rankings are exactly the same), while kτ = 1 that the two rankings have perfect disagreement. As expected, decision trees as well as random forests interpolate the m function successfully in the noiseless setting, since it is r-sparse. Moreover, the increase of noise level leads to the decay of the decision trees performance, indicated by the α-inconsistency. However, the β-kτ gap is a more appropriate noise level measure, because it quantifies the degree of deviation rather than the existence of it. The ratio of the performance in terms of mean kτ over β-kτ gap is approximately equal to one, revealing that decision trees fit the noise. On the contrary, shallow trees have better ability to generalize and avoid overfitting. Fully grown honest random forests are also resistant to overfitting due to bagging, and therefore are noise tolerant. This is visible in Figure 1 where the graphs of the random forests and of the shallow trees attain a mean KT coefficient close to 1 for most values of α-inconsistency and β-kτ gap. Appendix B presents the experimental setting in more detail and additional experimental results of LR standard benchmarks and different additive nonparametric noise settings. The code and data sets to reproduce our results are available: https://github.com/ pseleni/LR-nonparametric-regression. Remark 2.3. We have encountered Problem 1 with the oracles Ex(m) and Ex(m, E). A natural question is what happens if we use the incomplete oracle Ex(m, E, M). In this setting, the position of each alternative is not correctly observed (we observe a ranking with size ℓ k, see Section 2.3). Hence, we cannot apply our labelwise method for this oracle. One could obtain similar results for Problem 1 for incomplete rankings using pairwise decomposition, but we leave it for future work. Next we focus on Problem 2. 2.3. Noisy Oracle with Incomplete Rankings We now study Problem 2, where we consider a metric d in Sk, a distribution DR over X Sk that corresponds to the example oracle Ex(m, E) and set up the task of finding a measurable mapping h : X Sk that minimizes the objective R(h) = E(x,σ) DR[d(h(x), σ)]. In this work, we focus on the Kendall tau distance (d = d KT ) and ask how well we can estimate the minimizer of the above population objective if we observe i.i.d. samples from the incomplete rankings oracle Ex(m, E, M). We underline that in what follows whenever we refer to Problem 2, we have d = d KT in mind. A natural question is: What is the optimal solution? In binary classification, the learner aims to estimate the Bayes classifier, since it is known to minimize the misclassification error among all classifiers (Massart & N ed elec, 2006). Problem 2 deals with rankings and is well-studied in previous works when: d is the Kendall tau distance and Label Ranking through Nonparametric Regression the learner either observes complete rankings (Cl emenc on et al., 2018) or observes only the top element (under some BTL noise) (Cl emenc on & Vogel, 2020). As we will see later, the optimal solution h of Problem 2 is unique under mild conditions on DR, due to Korba et al. (2017). Our goal will be to estimate h from labeled examples generated by Ex(m, E, M). We are going to introduce the example oracle Ex(m, E, M): We are interested in the case where the mechanism M generates incomplete rankings and captures a general spectrum of ways to generate such rankings (e.g., H ullermeier et al., 2008). We begin with the incomplete rankings mechanism M. We assume that there exists a survival probabilities vector q : X [0, 1]k, which is feature-dependent, i.e., the vector q depends on the input x X. Hence, for the example x and alternative i [k], with probability qi(x), we set the score yi equal to a noisy value of the score mi(x) (the alternative i survives) and, otherwise, we set yi = . We mention that the events of observing i and j are not necessarily independent and so do not necessarily occur with probability qi(x)qj(x). We denote the probability of the event Observe both i and j in x by qi,j(x). We modify the argsort : [0, 1]k Sk routine so that it will ignore the symbol in the ranking, e.g., argsort(0.4, , 0.7, , 0.1) = (c a e). Crucially, we remark that another variant would preserve the symbols: this problem is easier since it reveals the correct position of the non-erased alternatives. In our model, the information about the location of each alternative is not preserved. In order to model regression noise, we consider a noise distribution E over the bounded cube [ 1/4, 1/4]k. Hence, we model Ex(m, E, M) as follows: Definition 2.4 (Generative Process for Incomplete Data). Consider an underlying score hypothesis m : X [0, 1]k and let Dx be a distribution over features. Let y : [k] [0, 1] { } and consider the survival probabilities vector q : X [0, 1]k. Each sample (x, σ) Dq R is generated as follows: (i) Draw x X from Dx and ξ [ 1 from E; (ii) draw q(x)-biased coins c { 1, +1}k; (iii) if ci > 0, set yi = mi(x) + ξi, else yi = ; (iv) compute σ = argsort(y), ignoring the symbol. In what follows, we resolve Problem 2 for the example oracle of Definition 2.4 and d = d KT . As in the complete ranking case, Definition 2.4 imposes restrictions neither on the structure of the true score hypothesis m nor on the noise distribution E. In order to resolve Problem 2, we assume the following. Recall that qi,j(x) is the probability of the event Observe both i and j in x . Condition 2. Let pij(x) = Prξ E[mi(x) + ξi > mj(x) + ξj|x] for x Rd. For any 1 i < j k, we assume that the following hold. 1. (Strict Stochastic Transitivity) For any x Rd and any u [k], we have that pij(x) = 1/2 and (piu(x) > 1/2 puj(x) > 1/2) pij(x) > 1/2. 2. (Tsybakov s Noise Condition) There exists a [0, 1] and B > 0 so that the probability that a random feature x Dx satisfies |pij(x) 1/2| < 2t , is at most B ta/(1 a) for all t 0. 3. (Deletion Tolerance) There exists φ (0, 1] so that qi,j(x) φ for any x Rd, where qi,j(x) is the survival probability of the pair i < j in x. Importance of Item 1: Our goal is to find a good estimate for the minimizer h of R(h) = E(x,σ) DR[d KT (h(x), σ)]. As observed in previous works (Korba et al., 2017; Cl emenc on et al., 2018), this problem admits a unique solution (with closed form) under mild assumptions on DR and specifically this is assured by Strict Stochastic Transitivity (SST) of the pairwise probabilities pij(x). The first condition guarantees (Korba et al., 2017) that the minimizer of R(h) is almost surely unique and is given, with probability one and for any i [k], by h (x; i) = 1 + X j =i 1{pij(x) < 1/2} . (2) This is the well-known Copeland rule and we note that SST is satisfied by most natural probabilistic ranking models. Importance of Item 2: The Tsybakov s noise condition is standard in binary classification and corresponds to a realistic noise model (Boucheron et al., 2005). In its standard form, this noise condition naturally requires that the regression function of a binary problem η(x) = E[Y |X = x] = 2 Pr[Y = +1|X = x] 1 is close to the the critical value 0 with low probability over the features, i.e., the labels are not completely random for a sufficiently large portion of the feature space. We consider that, for any two alternatives i = j, this condition must be satisfied by the true score function m and the noise distribution E (specifically the functions mi and mj and the random variables ξi, ξj). This condition is very common in binary classification since it guarantees fast rates and has been previously applied to LR (Cl emenc on & Vogel, 2020). Importance of Item 3: The last condition is natural in the sense that we need to observe the pair (i, j) at some frequency in order to achieve some meaningful results. Variants of this condition have already appeared in the incomplete rankings literature (see e.g., Fotakis et al., 2021) and in previous works in Label Ranking (see Cl emenc on & Vogel, 2020). We note that if we relax this condition to state that we only observe the pair i, j only in a portion of Rd (e.g., qij(x) = 0 for 40% of x s), then we will probably miss a crucial part of the structure of the underlying mapping. This intuitively justifies the reason that we need deletion tolerance to hold for any x Rd. Label Ranking through Nonparametric Regression Having described our conditions, we continue with our approach. We first remind the reader that the labelwise perspective we adopted in the complete case now fails. Second, since our data are incomplete, the learner cannot recover the optimal ranking rule h = argminh E[d KT (h(x), σ)] by simply minimizing an empirical version of this objective. To tackle this problem, we adopt a pairwise comparisons approach. In fact, the key idea is the closed form of the optimal ranking rule: From (2), we can write h (x; i) = 1 + P j =i 1{h ij(x) = 1} where h ij is the Bayes optimal classifier of the binary sub-problem of the pair i = j. We provide our result based on the standard One-Versus-One (OVO) approach, reducing this complex problem into multiple binary ones: We reduce the ranking problem into O(k2) binary sub-problems and each subproblem corresponds to a pairwise comparison between the alternatives i and j for any 1 i < j k. We solve each sub-problem separately by obtaining the Empirical Risk Minimizer bhij (for a fixed VC class G, as in the previous works) whose risk is compared to the optimal h ij and then we aggregate the k 2 binary classifiers into a single output hypothesis bh : Rd Sk. We compare the generalization of this empirical estimate bh with the optimal predictor h of (2). We set Lij(g) = E[g(x) = sgn(σ(i) σ(j))|σ {i, j}] where (x, σ) Dq R. Our main result in this setting follows. Theorem 2.5 (Noisy and Incomplete LR). Let ϵ, δ (0, 1). Consider a hypothesis class G of binary classifiers with finite VC dimension. Under Condition 2 with parameters a, B, φ, there exists an algorithm (Algorithm 2) that draws a polya(φ ε) max log k samples from Dq R, as in Definition 2.4, and computes an estimate bh : Rd Sk so that Prx Dx[bh(x) = h (x)] is, with probability 1 δ, at most inf g G Li,j(g) L i,j where h is the optimal predictor of (2) and L i,j is the loss of the binary Bayes classifiers h i,j for 1 i < j k, where Ca,B is a constant depending on a, B. Our result for incomplete rankings is a PAC result, in the sense that we guarantee that, when optimizing over a VC class G, the gap between the empirical estimate (the algorithm s output) and the optimal predictor of (2) is at most C OPT + rn(δ), where OPT is (a function of) the gap between the best classifier in the class (argming G L(g)) and the Bayes classifier and rn(δ) is a function which tends to 0 as the number of samples n increases (see (3)). We remark that the algorithm does not come with a computational efficiency guarantee, since the results are based on the computation of the ERM of each pairwise comparison i < j. In general, this is NP-hard but if the binary hypothesis class G is simple then we also obtain computational guarantees. Algorithm 2 Algorithm of Theorem 2.5 1: T n i.i.d. samples (x, σ) Dq R (as in Theorem 2.5) 2: For any i = j, set Tij = 3: for 1 i < j k do 4: if (x, σ) T and σ {i, j} then 5: Add (x, sgn(σ(i) σ(j))) to Tij 6: endif 7: endfor 8: bs Estimate Aggregate(Tij for i < j, G) 9: On input x Rd, output argsort(bs(x)) breaking arbitrarily possible ties. Algorithm 2 works as follows: Given a training set T of the form (x(i), σ(i)) with incomplete rankings, the algorithm creates k 2 datasets Tij with the following criterion: For any i < j, if (x, σ) T and σ {i, j}, the algorithm adds to the dataset Tij the example (x, sgn(σ(i) σ(j))). For any such binary dataset, the algorithm computes the ERM and aggregates the estimates to bs (these routines can be found as Algorithm 6). This aggregate rule is based on the structure of the optimal classifier h (that is valid due to the SST condition). The final estimator is the function bh that, on input x Rd, outputs the ranking bh(x) = argsort(bs(x)) (by breaking ties randomly). Remark 2.6. (i) The oracle Ex(m, E, M) and Definition 2.4 can be adapted to handle partial rankings (see Appendix C). (ii) Theorem 2.4 directly controls the risk gap R(bh) R(h ) Ex[d KT (bh(x), h (x))], since d KT (π, σ) k21{π = σ}. (iii) We studied Problem 2 for Ex(m, E, M). Our results can be transferred to the oracles Ex(m, E) and Ex(m) under Condition 2 with φ = 1. (iv) This result is similar to Cl emenc on & Vogel (2020), where the learner observes (x, y) where y = σ 1 x (1) is the top-label. To adapt our incomplete setting to theirs, we must erase positions instead of alternatives. If eqi is the probability that the i-th position survives, then we have that, in Cl emenc on & Vogel (2020): eq1(x) = 1 and eqi =1(x) = 0 for all x Rd. Also, the generative processes of the two works are different (noisy nonparametric regression vs. Plackett-Luce based models). In general, our results and our analysis for Problem 2 are very closely related and rely on the techniques of Cl emenc on & Vogel (2020). 3. Technical Overview Proof Sketch of Theorem 2.1 Our starting point is the work of Syrgkanis & Zampetakis (2020), where they pro- Label Ranking through Nonparametric Regression vide a collection of nonparametric regression algorithms based on decision trees and random forests. We have to provide a vector-valued extension of these results. For decision trees, we show (see Theorem A.3) that if the learner observes i.i.d. samples (x, m(x) + ξ) for some unknown target m satisfying Condition 1 with r, C > 0, then there exists an algorithm ALGO that uses n = e O log(d) poly C,r(k Cr/ε) samples and computes an estimate m(n) which satisfies Ex Dx h m(x) m(n)(x) 2 2 i ε with probability 99%. In our setting, we receive examples from Ex(m) and set h(x) = argsort(m(x)). As described in Algorithm 1, we design the training set T = {x(i), y(i)}, where y(i) = (h(x(i); j))j [k] (we make the ranking h(x) a vector). We provide T as input to ALGO (with ξ = 0) and get a vector-valued estimation m(n) that approximates the vector (h( ; 1), ..., h( ; k)). Our next goal is to convert the estimate m(n) to a ranking by setting bh = argsort m(n). In Theorem A.4, we show that rounding our estimations back to permutations will yield bounds for the expected Spearman distance Ex Dx[d2(h(x),bh(x))]. Proof Sketch of Theorem 2.5 Consider the VC class G consisting of mappings g : Rd { 1, +1}. Let Gi,j = {gi,j : Rd { 1, +1}} be a copy of G for the pair (i, j). We let bgi,j and g i,j be the algorithm s empirical classifier and the Bayes classifier respectively for the pair (i, j). The first key step is that the SST property (which holds thanks to Item 1) implies that the optimal ranking predictor h is unique almost surely and satisfies (2). Thanks to the structure of the optimal solution, we can compute the score estimates bs(x; i) = 1 + P j =i 1{bgi,j(x) = 1} for any i [k] and we will set bh to be argsort bs. We first show that Prx Dx[bh(x) = h (x)] P i 0, under the score generative process of Definition A.2 with underlying score hypothesis m : {0, 1}d [0, 1]k and given i.i.d. data (x, y) D, the following hold: 1. There exists an algorithm (Decision Trees via Level-Splits - Algorithm 3) with set of splits Sn that computes a score estimate m(n) which satisfies Pr (x1,y1),...,(xn,yn) Dn m(x) m(n)(x; Sn) 2 and for the number of samples n and the number of splits log(t), we have that: (a) If m is r-sparse as per Definition A.1 and under the C-submodularity condition (mi and Dx satisfy Condition 3 for each alternative i [k]), it suffices to draw n = e O log(dk/δ) k Cr+2 (Cr/ε)Cr+2 samples and set the number of splits to be log(t) = Cr Cr+2(log(n) log(log(d/δ))). (b) If, additionally to 1.(a), the marginal over the feature vectors Dx is a Boolean product probability distribution, it suffices to draw n = e O log(dk/δ) 2r k2 (C/ε)2 samples and set the number of splits to be log(t) = r. 2. There exists an algorithm (Decision Trees via Breiman - Algorithm 4) that computes a score estimate m(n) which satisfies Pr (x1,y1),...,(xn,yn) Dn m(x) m(n)(x; Pn) 2 and for the number of samples n and the number of splits log(t), we have that: (a) If m is r-sparse and under the C-approximate diminishing returns condition (mi and Dx satisfy Condition 4 for each alternative i [k]), it suffices to draw n = e O log(dk/δ) k Cr+3 (Cr/ε)Cr+3 samples and set log(t) Cr Cr+3(log(n) log(log(d/δ))). Label Ranking through Nonparametric Regression (b) If, additionally to 2.(a), the distribution Dx is a Boolean product distribution, it suffices to draw n = e O log(dk/δ) k3 C2 2r/ε3 samples and set log(t) r. The running time of the algorithms is poly C,r(d, k, 1/ϵ). Algorithm 3 Level-Splits Algorithm for Score Learning 1: Input: Access to i.i.d. examples of the form (x, y) D. 2: Model: y = m(x) + ξ (Definition A.2) with m : {0, 1}d [0, 1]k. 3: Output: An estimate m(n)( ; Sn) that, with probability 1 δ, satisfies m(x) m(n)(x; Sn) 2 4: Learn Score(ε, δ): 5: Draw n = eΘ(log(dk/δ)(Crk/ε)O(Cr)) samples from D {Under sparsity and Condition 3.} 6: D(n) {x(j), y(j)}j [n] 7: output Learn Score-LS(D(n)) 8: Learn Score-LS (D(n)): 9: Set log(t) = Θ(r log(rk)) 10: Create k datasets Di = {(x(j), y(j) i )}j [n] 11: for i [k] do 12: m(n) i , S(i) n = Level Splits-Algo(0, Di, log(t)) {Call Algorithm 9.} 13: endfor 14: Output m(n)( ; S(1) n , ..., S(k) n ) = (m(n) 1 ( ; S(1) n ), ..., m(n) k ( ; S(k) n )) Algorithm 4 Breiman s Algorithm for Score Learning 1: Input: Access to i.i.d. examples of the form (x, y) D. 2: Model: y = m(x) + ξ (Definition A.2) with m : {0, 1}d [0, 1]k. 3: Output: An estimate m(n)( ; Sn) that, with probability 1 δ, satisfies m(x) m(n)(x; Pn) 2 4: Learn Score (ε, δ): 5: Draw n = eΘ log(dk/δ)(Crk/ε)O(Cr) samples from D {Under sparsity and Condition 4.} 6: D(n) {x(j), y(j)}j [n] 7: output Learn Score-Breiman(D(n)) 8: Learn Score-Breiman (D(n)): 9: Set log(t) = Θ(r log(rk)) 10: Create k datasets Di = {(x(j), y(j) i )}j [n] 11: for i [k] do 12: m(n) i , P (i) n = Breiman-Algo(0, Di, log(t)) {Call Algorithm 10.} 13: endfor 14: Output m(n)( ; P (1) n , ..., P (k) n ) = (m(n) 1 ( ; P (1) n ), ..., m(n) k ( ; P (k) n )) Proof. (of Theorem A.3) Let us set J = [1/4, 3/4] and let m : {0, 1}d Jk be the underlying score vector hypothesis and consider a training set with n samples of the form (x, y) {0, 1}d Jk with law D, generated as in Definition A.2. Label Ranking through Nonparametric Regression We decompose the mapping as m(x) = (m1(x), . . . , mk(x)) and aim to learn each function mi : {0, 1}d J separately. Note that since m is r-sparse, then any mi is r-sparse for any i [k]. We observe that each sample of Definition A.2 can be equivalently generated as follows: 1. x {0, 1}d is drawn from Dx, 2. For each i [k] : (a) Draw ξ [ 1/4, 1/4] from the zero mean distribution marginal Ei. (b) Compute yi = mi(x) + ξ. 3. Output (x, y), where y = (yi)i [k]. In order to estimate the coordinate i [k], i.e., the function mi : {0, 1}d J, we have to make use of the samples (x, yi) {0, 1}d [0, 1]. We have that m(x) m(n)(x; Sn) 2 i [k] E x Dx h (mi(x) m(n) i (x; Sn))2i > ε Pr[ i [k] : Bi] , where we consider the events Bi = E x Dx mi(x) m(n) i (x; Sn i ) 2 > ε/k , for any i [k], whose randomness lies in the random variables used to construct the empirical estimate mn i ( ; Sn i ) = mn i ( ; Sn i , (x(1), y(1) i ), . . . , (x(n), y(n) i )). We note that we have to split the dataset with examples (x, y) into k datasets (x, yi) and execute each sub-routine with parameters (ϵ/k, δ/k). We now turn to the sample complexity guarantees. Let us begin with the Level-Splits Algorithm. Case 1a. If each mi is r-sparse and under the submodularity condition, by Theorem D.2 with f = mi, we have that Pr[Bi] d exp( n/(Cr k/ε)Cr+2) . By the union bound, we have that Pr[ i [k] : Bi] X i [k] Pr[Bi] . In order to make this probability at most δ, it suffices to make the probability of the bad event Bi at most δ/k, and, so, it suffices to draw n = e O(log(dk/δ) (Crk/ε)Cr+2) . Case 1b. If each mi is r-sparse and under the submodularity and the independence of features conditions, by Theorem D.2 with f = mi, we have that Pr[Bi] d exp( n/(2r(Ck/ε)2)) . By the union bound and in order to make the probability Pr[ i [k] : Bi] at most δ, it suffices to draw n = e O log(dk/δ) 2r (Ck/ε)2 . Hence, in each one of the above scenarios, we have that m(x) m(n)(x; S1 n, ..., Sk n) 2 We proceed with the Breiman Algorithm. Label Ranking through Nonparametric Regression Case 2a. If each mi is r-sparse and under the approximate diminishing returns condition (Condition 4), by Theorem D.3 with f = mi, we have that Pr[Bi] d exp( n/(Cr k/ε)Cr+3) . By the union bound, we have that Pr[ i [k] : Bi] X i [k] Pr[Bi] . In order to make this probability at most δ, it suffices to make the probability of the bad event Bi at most δ/k, and, so, it suffices to draw n = e O(log(dk/δ) (Cr k/ε)Cr+3) . Case 2b. If each mi is r-sparse and under the approximate diminishing returns condition (Condition 4) and the independence of features conditions, by Theorem D.3 with f = mi, we have that Pr[Bi] d exp( nε3/(k3 C22r)) . By the union bound and in order to make the probability Pr[ i [k] : Bi] at most δ, it suffices to draw n = e O(log(dk/δ) k3C22r/ε3) . Hence, in each one of the above scenarios, we have that m(x) m(n)(x; P 1 n, ..., P k n) 2 A.1.3. MAIN RESULT FOR NOISELESS LR WITH DECISION TREES We are now ready to address Problem 1 for the oracle Ex(m). Our main theorem follows. We comment that Theorem 2.1 corresponds to the upcoming case 1(a). Theorem A.4 (Label Ranking with Decision Trees). Consider the example oracle Ex(m) of Definition 1.1 with underlying score hypothesis m : {0, 1}d [0, 1]k, where k N is the number of labels. Given i.i.d. data (x, σ) DR, the following hold for any ε > 0 and δ > 0: 1. There exists an algorithm (Decision Trees via Level-Splits - Algorithm 5) with set of splits Sn that computes an estimate h(n)( ; Sn) : {0, 1}d Sk which satisfies Pr (x1,σ1),...,(xn,σn) Dn R h d2(h(x), h(n)(x; Sn)) i > ϵ k2 δ , and for the number of samples n and the number of splits log(t), we have that: (a) If m is r-sparse as per Definition A.1 and under the C-submodularity condition (mi and Dx satisfy Condition 3 for each alternative i [k]), it suffices to draw n = e O log(dk/δ) k Cr+2 (Cr/ε)Cr+2 samples and set the number of splits to be log(t) = Cr Cr+2(log(n) log(log(d/δ))). (b) If, additionally to 1.(a), the distribution Dx is a Boolean product distribution, it suffices to draw n = e O log(dk/δ) 2r k2 (C/ε)2 samples and set the number of splits to be log(t) = r. Label Ranking through Nonparametric Regression 2. There exists an algorithm (Decision Tress via Breiman - Algorithm 5) with Pn that computes an estimate h(n)( ; Pn) : {0, 1}d Sk which satisfies Pr (x1,σ1),...,(xn,σn) Dn R h d2(h(x), h(n)(x; Pn)) i > ϵ k2 δ , and for the number of samples n and the number of splits log(t), we have that: (a) If m is r-sparse and under the C-approximate diminishing returns condition (mi and Dx satisfy Condition 4 for each alternative i [k]), it suffices to draw n = e O log(dk/δ) k Cr+3 (Cr/ε)C r+3 samples and set log(t) Cr Cr+3(log(n) log(log(d/δ))). (b) If, additionally to 2.(a), the distribution Dx is a Boolean product distribution, it suffices to draw n = e O log(dk/δ) k3 C2 2r/ε3 samples and set log(t) r. The running time of the algorithms is poly C,r(d, k, 1/ϵ). Algorithm 5 Algorithms for Label Ranking with Complete Rankings 1: Input: Access to i.i.d. examples of the form (x, σ) DR. 2: Model: Oracle Ex(m) with m : {0, 1}d [0, 1]k and h(x) = argsort(m(x)). (Definition 1.1) 3: Output: An estimate h(n)( ; Sn). 4: Label Rank(ε, δ): 5: Level-Splits Case 6: Draw n = eΘ(log(dk/δ)(rk/ε)r) samples from DR {Under sparsity and Condition 3.} 7: For any j [n], compute y(j) m C(σ(j)) {See Equation (6).} 8: D(n) {x(j), y(j)}j [n] 9: output argsort(Learn Score-LS(D(n))) {Call Algorithm 3.} 10: Breiman Case 11: Draw n = eΘ (log(dk/δ)(rk/ε)r) samples from D {Under sparsity and Condition 4.} 12: For any j [n], compute y(j) m C(σ(j)) {See Equation (6).} 13: D(n) {x(j), y(j)}j [n] 14: output argsort( Learn Score-Breiman(D(n))) {Call Algorithm 4.} Proof. (of Theorem A.4) We let the mapping m C : Sk [0, 1]k be the canonical representation of a ranking, i.e., for σ Sk, we define m C(σ) = (σ(i)/k)i [k] . (6) We reduce this problem to a score problem: for any sample (x, σ) = (x, argsort(m(x))) DR, we create the tuple (x, y ) = (x, m C(σ)), where m C is the canonical representation of the ranking σ. Hence, any permutation of length k is mapped to a vector whose entries are integer multiples of 1/k. Let us fix x {0, 1}d. So, the tuple (x, y ) falls under the setting of the score variant of the regression setting of Definition D.1 with regression function equal to m (which is equal to m C argsort m) and noise vector ξ = 0, i.e., y = m (x) + ξ . Recall that our goal is to use the transformed samples (x, y ) in order to estimate the true label ranking mapping h : X Sk. Let us set h (n) be the label ranking estimate using n samples. We will show that h (n) = argsort(m (n)) is close to h = argsort(m ) in Spearman s distance, where m (n) is the estimation of m using Theorem A.3. We have that m (x) m (n)(x; S1 n, ..., Sk n) 2 Label Ranking through Nonparametric Regression with high probability using the vector-valued tools developed in Theorem A.3. By choosing an appropriate method, we obtain each one of the items 1(a), 1(b), 2(a) and 2(b) (each sample complexity result is in full correspondence with Theorem A.3). Hence, our estimate is, by definition, close to m C argsort m, i.e., m C(argsort(m(x))) m (n)(x; S1 n, ..., Sk n) 2 thanks to the structure of the samples (x, y ). We can convert our estimate m (n) to a ranking by setting h = argsort(m (n)). For any i [k], let us set mi and bmi for the true and the estimation quantities for simplicity; intuitively (without the expectation operator), a gap of order ϵ/k to the estimate (mi bmi)2 yields a bound |mi bmi| p ε/k. Recall that mi = σ(i)/k and so this implies that, in integer scaling, |σ(i) k bmi| O( ε k). We now have to compute bσ(i), that is the rounded value of k bmi. When turning the values k bmi into a ranking, the distortion of the i-th element from the correct value σ(i) is at most the number of indices j = i that lie inside the estimation radius. So, any term of the Spearman distance is on expectation of order O(ε k). This is due to the fact that E[|mi k bmi|] p E[(mi k bmi)2] = O( ε k). To conclude, we get that Ex Dx d2(argsort(m(x))), h (x)) O(ϵ) k2 . A.2. Noiseless Oracle with Complete Rankings and Random Forests (Level Splits & Breiman) In this section, we provide similar algorithmic results for Fully Grown Honest Forests based on the Level-Splits and the Breiman s criteria. A.2.1. DEFINITION OF PROPERTIES FOR RANDOM FORESTS For algorithms that use Random Forests via Level Splits, we need the following condition. Condition 5 (Strong Sparsity). A target function f : {0, 1}d [ 1/2, 1/2] is (β, r)-strongly sparse if f is r-sparse with relevant features R (see Definition A.1) and the function e V (see Equation (4)) satisfies e V (T {j}) e V (T) + β e V (T {i}) e V (T) , for all i R, j [d] \ R and T [d] \ {i}. Moreover, a vector-valued function m : {0, 1}d [ 1/2, 1/2]k is (β, r)-strongly sparse if each mj is (β, r)-strongly sparse for any j [k]. For algorithms that use Random Forests via Breiman s criterion, we need the following condition. Condition 6 (Marginal Density Lower Bound). We say that the density Dx is (ζ, q)-lower bounded if, for every set Q [d] with size |Q| = q, for every w {0, 1}q, it holds that Pr x Dx[x Q = w] ζ/2q . A.2.2. MAIN RESULT FOR NOISELESS LR WITH RANDOM FORESTS Our theorem both for Score Learning and Label Ranking for Random Forests with Level-Splits follows. Theorem A.5 (Label Ranking with Fully Grown Honest Forests via Level-Splits). Let ε, δ > 0. Let H > 0. Under Definition A.2 with underlying score hypothesis m : {0, 1}d [0, 1]k, where k N is the number of labels and given access to i.i.d. data (x, σ) DR, the following hold. For any i [k], we have that: let m(n,s) i be the forest estimator for alternative i that is built with sub-sampling of size s from the training set and where every tree mi(x, Ds) is built using Algorithm 9, with inputs: log(t) large enough so that every leaf has two or three samples and h = 1. Under the strong sparsity condition (see Condition 5) for any i [k], if R is the set of relevant features and for every w {0, 1}r, it holds for the marginal probability that Prz Dx(z R = w) / (0, ζ/2r) and if s = eΘ(2r (log(dk/δ)/β2 + log(k/δ)/ζ)), then it holds that Pr (x1,σ1),...,(xn,σn) Dn R m(x) m(n,s)(x) 2 using a training set of size n = e O 2rk log(k/δ) ζ . Moreover, under the generative process of Ex(m) of Definition 1.1, it holds that there exists a poly(d, k, 1/ε)-time algorithm with the same sample complexity that computes an estimate h(n,s) : {0, 1}d Sk which satisfies Pr (x1,σ1),...,(xn,σn) Dn R h d2(h(x), h(n,s)(x)) i > ϵ k2 δ . Label Ranking through Nonparametric Regression Proof. (of Theorem A.5) Recall that each sample of Definition A.2 can be equivalently generated as follows: 1. x {0, 1}d is drawn from Dx, 2. For each i [k] : (a) Draw ξ [ 1/4, 1/4] from the zero mean distribution marginal Ei. (b) Compute yi = mi(x) + ξ. 3. Output (x, y), where y = (yi)i [k]. In order to estimate the coordinate i [k], i.e., the function mi : {0, 1}d [1/4, 3/4], we have to make use of the samples (x, yi) {0, 1}d [0, 1]. We have that m(x) m(n,s)(x) 2 i [k] E x Dx h (mi(x) m(n,s) i (x))2i > ε Pr[ i [k] : Bi] , where we consider the events Bi = E x Dx mi(x) m(n,s) i (x) 2 > ε/k , for any i [k], whose randomness lies in the random variables used to construct the empirical estimate m(n,s) i = m(n,s) i ( ; (x(1), y(1) i ), . . . , (x(n), y(n) i )), where m(n,s) i is the forest estimator for the alternative i using subsampling of size s. We are ready to apply the result of Syrgkanis & Zampetakis (2020) for random forest with Level-Splits (Theorem 3.4) with accuracy ε/k and confidence δ/k. Fix i [k]. Let m(n,s) i be the forest estimator that is built with sub-sampling of size s from the training set and where every tree mi(x, Ds) is built using Algorithm 9, with inputs: log(t) large enough so that every leaf has two or three samples and h = 1. Under the strong sparsity condition for mi (see Condition 5), if R is the set of relevant features and for every w {0, 1}r, it holds for the marginal probability that Prz Dx(z R = w) / (0, ζ/2r) and if s = eΘ(2r (log(dk/δ)/β2 + log(k/δ)/ζ)), then it holds that mi(x) m(n,s) i (x) 2 ε/k δ/k , using a training set of size n = e O 2rk log(k/δ) ζ . Aggregating the k random forests, we get the desired result using the union bound. The Spearman s distance result follows using the canonical vector representation, as in Theorem A.4. The result for the Breiman s criterion is the following. Theorem A.6 (Label Ranking with Fully Grown Honest Forests via Breiman). Let ε, δ > 0. Under Definition A.2 with underlying score hypothesis m : {0, 1}d [0, 1]k, where k N is the number of labels and given access to i.i.d. data (x, σ) DR, the following hold. Suppose that Dx is (ζ, r)-lower bounded (see Condition 6). For any i [k], let m(n,s) i be the forest estimator for the i-th alternative that is built with sub-sampling of size s from the training set and where every tree mi(x, Ds) is built using the Algorithm 10, with inputs: log(t) large enough so that every leaf has two or three samples, training set Ds and h = 1. Then, using s = eΘ( 2r log(dk/δ) ζβ2 ) and under Condition 5 for any i [k], we have that Pr (x1,σ1),...,(xn,σn) Dn R m(x) m(n,s)(x) 2 using n = e O 2rk log(dk/δ) εζβ2 . Moreover, under the generative process of Ex(m) of Definition 1.1, it holds that there exists a poly(d, k, 1/ε)-time algorithm with the same sample complexity that computes an estimate h(n,s) : {0, 1}d Sk which satisfies Pr (x1,σ1),...,(xn,σn) Dn R h d2(h(x), h(n,s)(x)) i > ϵ k2 δ . Label Ranking through Nonparametric Regression Proof. (of Theorem A.6) The proof is similar as in Theorem A.5 (by modifying the sample complexity and the value of subsampling s) and is omitted. A.3. Noisy Oracle with Incomplete Rankings In this section, we study the Label Ranking problem with incomplete rankings and focus on Problem 2. In Definition 2.4, we describe how a noisy score vector y and its associated ranking argsort(y) is generated. In order to resolve Problem 2, we consider an One-Versus-One (OVO) approach. In fact, we consider a VC class G of binary class and our goal is to use the incomplete observations and output a collection of k 2 classifiers from G so that, for a testing example x Dx with x Rd, the estimated ranking bσx, based on our selected hypotheses from G, will be close to the optimal one with high probability. We propose the following algorithm (Algorithm 7). Algorithm 6 Algorithm for Estimation and Aggregation for a VC class 1: Input: A collection of training sets Di,j for 1 i < j k, VC class G. 2: Output: An estimate bs : X Nk for the optimal score vector s . 3: Estimate Aggregate(Di,j for all i < j, G): 4: for 1 i < j k do 5: Find bgi,j = argming G 1 |Di,j| P (x,y) Di,j 1{g(x) = y} 6: endfor 7: for 1 i k do 8: bs(x; i) = 1 + P j =i 1{bgi,j(x) = 1} {Due to the Strict Stochastic Transitivity property (see Condition 2).} 9: endfor 10: Break ties randomly 11: output bs( ) = (bs( ; 1), ..., bs( ; k)) Algorithm 7 Algorithm for Label Ranking with Incomplete Rankings 1: Input: Sample access to i.i.d. examples of the form (x, σ) Dq R, VC class G. 2: Model: Incomplete rankings are generated as in Definition 2.4. 3: Output: An estimate bσ : Rd Sk of the optimal classifier σ that satisfies Pr x Dx[bσ(x) = σ (x)] Ca,B φ2 OPT(G) + ε . 4: Label Rank Incomplete(ε, δ): 5: Set n = eΘ k 4(1 a) a max{log(k/δ), VC(G))}/polya(φ ε) {See Theorem A.7.} 6: Draw a training set D of n independent samples from Dq R 7: For any i = j, set Di,j = 8: for 1 i < j k do 9: if (x, σ) D and σ {i, j} then 10: Add (x, sgn(σ(i) σ(j))) to Di,j 11: endif 12: endfor 13: Training Phase: bs Estimate Aggregate(Di,j for all i < j, G) {See Algorithm 6.} 14: Testing Phase: On input x Rd, output argsort(bs(x)) In order to resolve Problem 2 under Condition 2, we will make use of the Kemeny embedding and the OVO approach. Let D be the training set with labeled examples of the form (x, σ) Dq R, where σ corresponds to an incomplete ranking generated as in Definition 2.4. Our algorithm proceeds as follows: 1. As a first step, for any pair of alternatives i < j with i, j [k], we create a dataset Di,j = . Label Ranking through Nonparametric Regression 2. For any i < j and for any feature x D whose incomplete ranking σ contains both i and j, we add in the dataset Di,j the example (x, y) := (x, sgn(σ(i) σ(j))). 3. For any i < j, we compute the ERM solution (see Algorithm 6) to the binary classification problem bgi,j = argming G b Li,j(g) where b Li,j(g) = 1 |Di,j| (x,y) Di,j 1{g(x) = y} . 4. We aggregate the binary classifiers (see Algorithm 6) using the score function: bs(x; i) = 1 + X j =i 1{bgi,j(x) = 1} . The structure of this score function comes from the SST property. 5. Break the possible ties randomly and output the prediction argsort(bs(x)). Let us consider a binary classification problem with labels 1, +1. Let the regression function be η(x) = Pr(x,y)[y = +1|x] and define the mapping g (x) = 1{η(x) 1/2}. If the distribution over (x, y) were known, the problem of finding an optimal classifier would be solved by simply outputting the Bayes classifier g , since it is known to minimize the misclassication probability Pr(x,y)[y = g(x)] over the collection of all classifiers. In particular, for any g G, it holds that L(g) L(g ) = 2 Ex Dx [|η(x) 1/2| 1{g(x) = g (x)}] . In Problem 2, our goal is to estimate the solution of the ranking median regression problem with respect to the KT distance σ = argmin h:X Sk E (x,σ) DR[d KT (h(x), σ)] . When the probabilities pij(x) = Pr[σ(i) > σ(j)|x] satisfy the SST property, the solution is unique almost surely and has a closed form (see (2)). Hence, we can use estimate the O(k2) binary optimal classifiers in order to estimate it. This is exactly what we will do in Algorithm 7. Our main result is the following. Theorem A.7 (Label Ranking with Incomplete Rankings). Let ε, δ (0, 1) and assume that Condition 2 holds, i.e., the Stochastic Transitivity property holds, the Tsybakov s noise condition holds with a (0, 1), B > 0 and the deletion tolerance condition holds for the survival probability vector with parameter φ (0, 1). Set Ca,B = B1 a/((1 a)1 aaa) and consider a hypothesis class G of binary classifiers with finite VC dimension. There exists an algorithm (Algorithm 7) that computes an estimate bσ : Rd Sk so that Pr x Dx[bσ(x) = σ (x)] Ca,B inf g G Li,j(g) L i,j with probability at least 1 δ, where σ : Rd Sk is the mapping (see Equation (2)) induced by the aggregation of the k 2 Bayes classifiers g i,j with loss L i,j, using n independent samples from Dq R (see Definition 2.4), with Ca,B φ4 2a k 2 log(k/δ), VC(G) log In Table 1, we present our sample complexity results (concerning Theorem 2.5 (and Theorem A.7)) for various natural candidate VC classes, including halfspaces and neural networks. We let a b := max{a, b}. Label Ranking through Nonparametric Regression Table 1. The table depicts the sample complexity for Problem 2 and Theorem A.7 for various concept classes. In the sample complexity column, we set N0 = polya,B k φ ε . The VC dimension bounds for halfspaces and axis-aligned rectangles can be found in Shalev Shwartz & Ben-David (2014) and the VC dimension of L2-balls can be found in Dudley (1979). For the Neural Networks cases, M and N are the number of parameters and of neurons respectively and the corresponding VC dimension bounds are from Baum & Haussler (1989) and Karpinski & Macintyre (1997). CONCEPT CLASS VC DIMENSION SAMPLE COMPLEXITY HALFSPACES IN Rd d + 1 N0 O(log(k/δ) d log(d)) AXIS-ALIGNED RECTANGLES IN Rd 2d N0 O(log(k/δ) d log(d)) L2-BALLS IN Rd d + 1 N0 O(log(k/δ) d log(d)) NN WITH SIGMOID ACTIVATION O(M 2N 2) N0 O(log(k/δ) M 2N 2 log(M N)) NN WITH SIGN ACTIVATION O(M log(M)) N0 O(log(k/δ) M log2(M)) Remark A.8. We remark that, in the above results for the noisy nonparametric regression, we only focused on the sample complexity of our learning algorithms. Crucially, the runtime depends on the complexity of the Empirical Risk Minimizer and this depends on the selected VC class. Hence, the choice of the VC class involves a trade-off between computational complexity and expressivity/flexibility. We continue with the proof. In order to obtain fast learning rates for general function classes, the well-known Talagrand s inequality (see Fact 1 and Boucheron et al., 2005) is used combined with an upper bound on the variance of the loss (which is given by the noise condition) and convergence bounds on Rademacher averages (see e.g., Bartlett et al., 2005). Proof. (of Theorem A.7) We decompose the proof into a series of claims. Consider the binary hypothesis class G consisting of mappings g : Rd { 1, +1} of finite VC dimension. We let g be the Bayes classifier. We consider k 2 copies of this class, one for each unordered pair (i, j) and let Gi,j = {gi,j : Rd { 1, +1}} be the corresponding class. We let bgi,j and g i,j be the algorithm s empirical classifier and the Bayes classifier respectively for the pair (i, j). Claim 1. It holds that Pr x Dx[bσ(x) = σ (x)] X i 0 and i = j with i, j [k]. Assume that the Tsybakov condition (Condition 2.ii) holds with parameters a, B for the pair (i, j) and that the survival probabilities vector q satisfies Condition 2.iii with parameter φ (0, 1). Set Ca,B = B1 a (1 a)1 aaa . Then, for a training set Tn with elements (x, y) with y = sgn(σ(i) σ(j)) where (x, σ) Dq R conditioned that σ {i, j}, it holds that Li,j(bgi,j) Li,j(g ) 2 inf g G Li,j(g) Li,j(g ) + rn(δ) , with probability at least 1 δ, where bgi,j = argming G b Li,j(g; Tn) and g is the Bayes classifier and rn(δ) = max 1 2 a (C VC(G) log(n)) 1 2 a + (32 log(2/δ)) 1 2 a , C log(2/δ) where C, C are constants. Moreover, if the size of the training set Tn is at least n n PC := max 1 2C2 VC(G) , log(2/δ) 62 aφ1 a we have that rn(δ) = 2 16Ca,B 1 2 a (C VC(G) log(n)) 1 2 a + (32 log(2/δ)) Proof. Let us fix i = j. Consider the binary class G = Gi,j and let us set L i,j = Li,j(g ), where g is the Bayes classifier. Let us set yi,j = sgn(σ(i) σ(j)) { 1, +1}. Consider the loss function for the classifier g G: ci,j(g; x, σ) = 1{g(x) = yi,j σ {i, j}} . We introduce the class of loss functions Fi,j associated with Gi,j, where Fi,j = (x, σ) 7 1{σ {i, j}} (ci,j(g; x, σ) 1{g i,j(x) = yi,j}) : g Gi,j . Let F i,j be the star-hull of Fi,j with F i,j = {a f : a [0, 1], f Fi,j}. For any f Fi,j, we introduce the function T which controls the variance of the function f as follows: First, we have that Var(f) Pr x Dx[g(x) = g i,j(x)] Ca,B φ (Li,j(g) L i,j)a . We remark that the first inequality follows from the binary structure of f and the second inequality follows from the fact that the Tsybakov s noise condition implies (see Fact 2) that Pr (x,σ)[g(x) = g i,j(x)|σ {i, j}] Ca,B(Li,j(g) L i,j)a , and since Pr[σ {i, j}|x] > φ for all i = j and x Rd. Next, we can write that φ (Li,j(g) L i,j)a = Ca,B φ Pr[σ {i, j}]a (E f)a =: T 2(f) . Label Ranking through Nonparametric Regression We will make use of the following result of Boucheron et al. (2005). In order to state this result, we have to define the functions ψ and w, (we also refer the reader to Boucheron et al. (2005) for further intuition on the definition of these crucial functions). We set ψ(r) = E Rn{f F i,j : T(f) r} , where Rn is the Rademacher average2 of a subset of the star-hull of the loss class Fi,j whose variance is controlled by r2 and w(r) = sup f F i,j:E f r T(f) , which captures the largest variance (i.e., the value p Var(f)) among all loss functions f in the star hull of the loss class with bounded expectation. Theorem A.9 (Theorem 5.8 in (Boucheron et al., 2005)). Consider the class G of classifiers g : X { 1, +1}. For any δ > 0, let r (δ) denote the solution of r = 4ψ(w(r)) + 2w(r) n + 16 log(2/δ) and ε the positive solution of the equation r = ψ(w(r)). Then, for any θ > 0, with probability at least 1 δ, the empirical risk minimizer gn satisfies L(gn) inf g G L(g) θ inf g G L(g) L(g ) + (1 + θ)2 In our binary setting with i = j, the risk Li,j is conditioned on the event σ {i, j}. Hence, with probability at least 1 δ, we get that Li,j(gn) inf g G Li,j(g) θ inf g G Li,j(g) Li,j(g ) + (1 + θ)2 4θ r (δ) Pr[σ {i, j}] , where G = Gi,j. We set θ = 1 and, by adding and subtracting the Bayes error L i,j = Li,j(g ) in the left hand side, we obtain Li,j(bgi,j) L i,j 2 inf g G Li,j(g) L i,j + r (δ) Pr[σ {i, j}] . The result follows by using the fact that Pr[σ {i, j}] = Ω(φ) and by the analysis of Cl emenc on & Vogel (2020) for r (δ) (see the proof of Lemma 14 by Cl emenc on & Vogel (2020) and replace ε by φ). Finally, we set rn(δ) = r (δ)/ Pr[σ {i, j}]. Crucially, observe that the above result does not depend on k, since it focuses on the pairwise comparison i, j. Claim 3. For Ca,B and φ as defined in Claim 2, it holds that Pr x Dx[bσ(x) = σ (x)] Ca,B inf g G Li,j(g) L i,j with probability at least 1 δ. Proof. Recall that (x, σ) Dq R and fix the training examples given that σ {i, j}. The Tsybakov s noise condition over the marginal over x given that σ {i, j} implies (see Fact 2) that Pr (x,σ)[gi,j(x) = g i,j(x)|σ {i, j}] Ca,B(Li,j(g) L i,j)a . 2The Rademacher average of a set A is Rn(A) := 1 n E supa A | Pn i=1 σiai|, where σ1, ..., σn are independent Rademacher random variables. Label Ranking through Nonparametric Regression We have that Pr x Dx[gi,j(x) = g i,j(x)|σ {i, j}] = Z Rd 1{gi,j(x) = g i,j(x)}Dx(x|σ {i, j})dx = Dx(x|σ {i, j}) Dx(x) 1{gi,j(x) = g i,j(x)} , where Dx( |σ {i, j}) is the conditional distribution of x given that the label permutation contains both i and j. Then, using the deletion tolerance property, we can obtain that Dx(x|σ {i, j}) = Ω(φ Dx(x)). So, we have that Pr x [bgi,j(x) = g i,j(x)] Ca,B φ (Li,j(bgi,j) L i,j)a . Using Claim 2 for the pair i = j and Minkowski inequality, we have that Pr x [bgi,j(x) = g i,j(x)] Ca,B 2 inf g G Li,j(g) L i,j a + ra n(δ) . Hence, via the union bound, we have that Pr x Dx[bσ(x) = σ (x)] Ca,B inf g G Li,j(g) L i,j with probability at least 1 δ. Claim 4. Let Ca,B and φ as defined in Claim 2. For any ε > 0, it suffices to draw samples from Dq R in order to get Ca,B Proof. Note that each pair (i, j) requires a training set of size at least n PC(δ) in order to make the failure probability at most δ/ k 2 . These samples correspond the rankings σ so that σ {i, j}. Hence, the sample complexity of the problem is equal to the number of (incomplete) permutations drawn from Dq R in order to obtain the desired number of pairwise comparisons. With high probability, the sample complexity is n PC k 2 mini =j Pr[σ {i, j}] since the random variable that corresponds to the number of pairwise comparisons provided by each sample (x, σ) Dq R is at least φ k 2 , with high probability. The desired sample complexity bound requires a number of pairwise comparisons n PC so that Ca,B The number of pairwise comparisons should be εφ Ca,B k 2 Let us set m = n PC. It remains to control the function rm. Recall that (see Claim 2) rm(δ) = max 1 2 a (C VC(G) log(m)) 1 2 a + (32 log(2/δ)) 1 2 a , C log(2/δ) Label Ranking through Nonparametric Regression If the second term is larger in the above maximum operator, we get that c1 log(k/δ) εφ Ca,B k 2 On the other side, we get that c1Ca,B VC(G) log(m) 1 2 a + c2Ca,B log(k/δ) εφ Ca,B k 2 and so we should take the maximum between the terms Ca,B log(k/δ) Let m = ey and set ye y = 1/M3. So, we have that ye y = 1/M3 and hence y = W( 1/M3), where W is the Lambert W function. Let N3 be the value of m that corresponds to the Lambert equation, i.e., log(m) = W( 1/M3) and so N3 e W ( 1/M3) with 1/M3 [0, 1/e]. Hence, the number of samples that suffice to draw from Dq R is n 1 φ k 2 max{N1, N2, N3} = 1 φ k 2 max{N2, N3} . We remark that, in the third case, it suffices to take m N3 2M3 log(M3) = eΩ since 2M log(M) M log(2M log(M)) for all M sufficiently large. These claims complete the proof. B. Additional Experimental Results for Noisy Oracle with Complete Rankings Experimental Setting and Evaluation Metrics. We follow the setting that was proposed by Cheng & H ullermeier (2008) (it has been used for the empirical evaluation of LR since then). For each data set, we run five repetitions of a ten-fold cross-validation process. Each data set is divided randomly into ten folds five times. For every division, we repeat the following process: every fold is used exactly one time as the validation set, while the other nine are used as the training set (i.e., ten iterations for every repetition of the ten-fold cross-validation process) (see James et al., 2013, p.181). For every test instance, we compute the Kendall tau coefficient between the output and the given ranking. In every iteration, we compute the mean Kendall tau coefficient of all the test instances. Finally, we compute the mean and standard deviation of every iteration s aggregated results. This setting is used for the evaluation of both our synthetic data sets and the LR standard benchmarks. Algorithm s Implementation. The algorithm s implementation was in Python. We decided to use the version of our algorithms with Breiman s criterion. Therefore there was no need to implement decision trees and random forests from scratch. We used scikit-learn implementations. The code for reproducibility of our results can be found in: https://github.com/pseleni/LR-nonparametric-regression. Label Ranking through Nonparametric Regression Data sets. The code for the creation of the Synthetic bencmarks, the Synthetic benchmarks and the standard LR benchmarks that were used in the experimental evaluation can be found in: https://github.com/pseleni/ LR-nonparametric-regression. Experimental Results for Different Noise Settings. In our noisy nonparametric regression setting, the input noise acts additively to the vector m(x) for some input x and the output is computed by permuting the labels. We aim to understand if the proposed algorithms performance differs in case the added noise is applied directly to the output ranking, by a parameterized noise operator, instead of being added to the vector m(x). Hence, we resort to a popular distance-based probability model introduced by Mallows (Mallows, 1957). The standard Mallows model M(σ0, θ) is a two-parameter model supported on Sk with density function for a permutation σ equal to Pr σ M(σ0,θ)[σ|θ, σ0] = e θD(σ,σ0) Zk(θ, σ0) . The ranking σ0 is the central ranking (location parameter) and θ > 0 is the spread (or dispersion) parameter. In this dispersion regime, the ranking σ0 is the mode of the distribution. The probability of any other permutation decays exponentially as the distance from the center permutation increases. The spread parameter controls how fast this occurs. Finally, Zk(θ, σ0) := P σ Sk exp( θD(σ, σ0)) is the partition function of the Mallows distribution. In what follows, we work with the KT distance (when D = d KT , the partition function only depends on the dispersion θ and k and not on the central ranking i.e., Zk = Zk(θ)). We once again use the noiseless data sets of the two families, namely LFN and SFN. We create 20 noisy data sets for each family using the Mallows Model as a noise operator (we will refer to these data sets as MN - Mallows Noisy). For each noiseless sample, we generate the corresponding output ranking by drawing a permutation from a Mallows distribution with mode equal to the noiseless ranking and dispersion parameter θ [0.6, 4.4]3 . We also generate 20 additional noisy data sets for each family (following the generative process Ex(m, E)) with different zero mean Gaussian noise distributions (we will refer to these data sets as GN - Gaussian Noisy), matching the α-inconsistency property of the MN data sets. Since the corresponding MN and GN data sets do not also share the β-kτ property, the results are not directly comparable. Therefore we compare the MN and GN data sets, for each algorithm and data set family separately, with respect to α-inconsistency and using β-kτ gap as a reference point. The results are obtained again in terms of mean kτ from five repetitions of a ten-fold cross validation, using the noisy output as training data set and the noiseless output data as validation set, and are visualized in Figure 2. The mean kτ values are shown as solid lines, the shaded area corresponds to the standard deviation, and the dotted lines reveal the β-kτ gap. We come to similar conclusions regarding the noise tolerance of each algorithm, e.g., shallow trees and fully grown random forests are noisy tolerant while fully grown decision trees performance decays almost linearly. What is really interesting, is that in spite of the β-kτ gap being higher in the MN data sets, i.e., the mean error in the input is smaller, the performance of the algorithms is worse, in comparison to the GN data sets. This reveals that neither α-inconsistency nor β-kτ gap can be strictly correlated to the ability of the models to interpolate the correct underlying function. Some comments are in order. The geometry change of the input noise is not the one we should blindly blame for the performance drop. It has to do more with the generative process used for the creation of the data sets. Specifically, in the GN, for every x, k samples from a zero mean Gaussian distribution truncated to [ 0.25, 0.25] were drawn (ξ [ 1/4, 1/4]k) and subsequently resulted to a permutation of the elements. The permutation of the elements depends on the noiseless regression values and the additive noise of each label. This results to irregular changes, which in the presence of a big number of samples cancel each other. On the other hand, the MN data sets were created by drawing a sample from a Mallows distribution, which means that certain permutations are more probable for each ranking than others. The noise is no longer unbiased and it is highly likely that the same ranking almost constantly appears as output for a given center permutation. This could also be linked to strategic addition of noise. With these in mind, we can safely conclude that the noise tolerance of our models is directly affected not only by the error in terms of permutation distance, but also, quite naturally, by the frequency of the repetitive errors, i.e., the frequency of the same erroneous noisy output ranking that arises from the noiseless output ranking, as we have to observe the correct permutation at some frequency to achieve meaningful results. 3We use a different dispersion parameter for each MN dataset. Specifically we use θ [0.6, 4.4] with 0.2 step. Label Ranking through Nonparametric Regression 0.0 0.2 0.4 0.6 0.8 α [0.0, 1.0] mean kτ [ 1.0, 1.0] mallows (β) mallows gaussian (β) gaussian (a) SFN - fully grown random forests 0.2 0.4 0.6 0.8 α [0.0, 1.0] mean kτ [ 1.0, 1.0] mallows (β) mallows gaussian (β) gaussian (b) LFN - fully grown random forests 0.0 0.2 0.4 0.6 0.8 α [0.0, 1.0] mean kτ [ 1.0, 1.0] mallows (β) mallows gaussian (β) gaussian (c) SFN - fully grown decision trees 0.2 0.4 0.6 0.8 α [0.0, 1.0] mean kτ [ 1.0, 1.0] mallows (β) mallows gaussian (β) gaussian (d) LFN - fully grown decision trees 0.0 0.2 0.4 0.6 0.8 α [0.0, 1.0] mean kτ [ 1.0, 1.0] mallows (β) mallows gaussian (β) gaussian (e) SFN - shallow decision trees 0.2 0.4 0.6 0.8 α [0.0, 1.0] mean kτ [ 1.0, 1.0] mallows (β) mallows gaussian (β) gaussian (f) LFN - shallow decision trees Figure 2. Illustration of the experimental results in terms of mean kτ for different noise operators E with respect to α-inconsistency and with β-kτ gap as a reference point. The Gaussian operator gives a ranking σ = argsort(m(x) + ξ) with ξ [ 1/4, 1/4]k being a zero mean truncated Gaussian random variable. The Mallows operator gives a permutation σ M(argsort(m(x)), θ) where θ [0.6, 4.4]. Evaluation on LR Standard Benchmarks. We also evaluate our algorithms on standard Label Ranking Benchmarks. Specifically on sixteen semi-synthetic data sets and on five real world LR data sets. The semi-synthetic ones are considered standard benchmarks for the evaluation of LR algorithms, ever since they were proposed in Cheng & H ullermeier (2008). They were created from the transformation of multi-class (Type A) and regression (Type B) data sets from UCI repository Label Ranking through Nonparametric Regression and the Statlog collection into Label Ranking data (see Cheng & H ullermeier, 2008). A summary of these data sets and their characteristics are given in Table 2. The real-world ones are genetic data, where the genome consists of 2465 genes and each gene is represented by an associated phylogenic profile of length, i.e., 24 features. In these data sets we aim to predict a qualitative representation of an expression profile. The expression profile of a gene is an ordered sequence of real-valued measurements, converted into a ranking (e.g., (2.1, 3.5, 0.7, 2.5) is converted to (2, 1, 3, 4)) (H ullermeier et al., 2008). A summary of the real-world data sets is given in Table 3. Table 2. Properties of the Semi-Synthetic Benchmarks BENCHMARK TYPE NUMBER OF INSTANCES NUMBER OF ATTRIBUTES NUMBER OF LABELS AUTHORSHIP A 841 70 4 BODYFAT B 4522 7 7 CALHOUSING B 37152 4 4 CPU-SMALL B 14744 6 5 ELEVATORS B 29871 9 9 FRIED B 73376 9 5 GLASS A 214 9 6 HOUSING B 906 6 6 IRIS A 150 4 3 PENDIGITS A 10992 16 10 SEGMENT A 2310 18 7 STOCK B 1710 5 5 VEHICLE B 846 18 14 VOWEL A 528 10 11 WINE A 178 13 3 WISCONSIN B 346 16 16 Table 3. Properties of the Real-Word Benchmarks BENCHMARK NUMBER OF INSTANCES NUMBER OF FEATURES NUMBER OF LABELS SPO 2465 24 11 HEAT 2465 24 6 DDT 2465 24 4 COLD 2465 24 4 DIAU 2465 24 7 We follow the same experimental setting as before, i.e., five repetitions of a ten-fold cross validation process for each data set. In Table 4 we summarize our results. RF stands for the algorithm using Random Forest, DT for Decision Trees, SDT for Shallow Decision Trees. These are the vanilla versions of the proposed algorithms, i.e., the parameters of the regressors were note tuned (only MSE criterion and maximum depth were defined, while the other parameters were the default). RFT and DTT stand for Random Forests Tuned and Decision Trees Tuned (each decision tree was tuned on the training data). The parameters of the regressor were tuned in a five folds inner c.v. for each training set. The parameter grids are reported in the anonymized repository. Random Forests have the best result overall. Interestingly the tuning does not always lead to better results. In the majority of the benchmarks RF and RFT have comparable results. In Table 6, we compare our Random Forest results (RF and RFT) with other previously proposed methods, as in Cheng et al. (2013) Labelwise Decomposition (LWD), H ullermeier et al. (2008) Ranking with Pairwise Comparisons (RPC), Cheng & H ullermeier (2008) Label Ranking Trees (LRT), Zhou & Qiu (2018) Label Ranking Random Forests (LR-RF), Korba et al. (2018) k-NN Kemeny regressor (k NN Kemeny) and Dery & Shmueli (2020) Boosting-based Learning Ensemble for LR (Boost LR). For the semi-synthetic data sets, our results are competitive in comparison to Cheng et al. (2013) LWD, H ullermeier et al. (2008) RPC and Cheng & H ullermeier (2008) LRT results, but with no systematic improvements. As expected, in comparison to the most recent and state-of-the-art results, the experimental results are comparable but cannot compete the Label Ranking through Nonparametric Regression Table 4. Performance in terms of Kendall s tau coefficient - Semi Synthetic Benchmarks BENCHMARK RF DT DTS RFT DTT AUTHORSHIP 0.85 0.04 0.78 0.05 0.81 0.05 0.87 0.04 0.77 0.05 BODYFAT 0.12 0.05 0.05 0.06 0.09 0.07 0.11 0.06 0.07 0.07 CALHOUSING 0.32 0.01 0.24 0.01 0.17 0.02 0.33 0.01 0.16 0.03 CPU-SMALL 0.29 0.01 0.21 0.02 0.28 0.01 0.30 0.02 0.28 0.01 ELEVATORS 0.60 0.01 0.47 0.01 0.50 0.01 0.61 0.01 0.55 0.02 FRIED 0.96 0.00 0.90 0.00 0.65 0.01 0.96 0.00 0.90 0.00 GLASS 0.88 0.06 0.80 0.06 0.79 0.06 0.80 0.07 0.80 0.07 HOUSING 0.44 0.07 0.40 0.06 0.39 0.06 0.37 0.07 0.31 0.08 IRIS 0.95 0.07 0.91 0.09 0.92 0.08 0.95 0.07 0.90 0.10 PENDIGITS 0.86 0.01 0.77 0.018 0.63 0.01 0.86 0.01 0.78 0.01 SEGMENT 0.90 0.02 0.87 0.02 0.82 0.02 0.91 0.02 0.87 0.02 STOCK 0.80 0.03 0.76 0.04 0.76 0.04 0.78 0.03 0.73 0.05 VEHICLE 0.84 0.03 0.78 0.04 0.79 0.04 0.83 0.03 0.78 0.04 VOWEL 0.67 0.04 0.63 0.05 0.55 0.04 0.68 0.04 0.58 0.06 WINE 0.90 0.09 0.84 0.12 0.86 0.10 0.91 0.08 0.84 0.11 WISCONSIN 0.14 0.04 0.08 0.04 0.1 0.04 0.14 0.05 0.09 0.04 Table 5. Performance in terms of Kendall s tau coefficient Real World Benchmarks BENCHMARK RF DT DTS RFT DTT COLD 0.10 0.03 0.06 0.03 0.07 0.03 0.09 0.03 0.07 0.04 DIAU 0.15 0.03 0.12 0.02 0.12 0.02 0.14 0.03 0.12 0.04 DTT 0.13 0.04 0.10 0.04 0.09 0.04 0.13 0.03 0.10 0.03 HEAT 0.07 0.02 0.05 0.02 0.05 0.02 0.08 0.03 0.05 0.02 SPO 0.05 0.02 0.05 0.02 0.04 0.02 0.01 0.01 0.01 0.01 highly optimized applied algorithms. But we believe that the insights gained using this technique may be valuable to a variate of other Label Ranking methods. For the real-world data sets there are not so many methods to compare our results with. Therefore we compare them with the RPC method and Boost LR. The results are summarized in Table 7. The performance of our algorithm in the real-word benchmarks is worse than in the other data sets. We suspect that the non-sparsity of these data sets (genome data) is one of the main reasons that this pattern in the performance is observed. A more thorough investigation is left for future work. C. Results on the Noisy Oracle with Partial Rankings In this setting, we consider a distribution of partitions of the interval (of positive integers) [1..k], which depends on the feature x Rd. Before a formal definition, we provide an intuitive example: for some feature x, let the noisy score vector be equal to y = [0.2, 0.4, 0.1, 0.3, 0.5] (where y = m(x) + ξ, as in Definition 2.4). Then, it holds that σ = argsort(y) = (e b d a c). In the partial setting, we additionally draw an increasing4 partition I of the label space [k] from the distribution p(x). Assume that I = [1...2][3...3][4...5], whose size is 3. Then, the partial ranking is defined as Partial Rank(σ; I) and is equal to e = b d a = c. Definition C.1 (Generative Process for Partial Data). Consider an instance of the Label Ranking problem with underlying score hypothesis m : X [0, 1]k and let Dx be a distribution over features. Consider the partial partition distribution p. Each sample is generated as follows: 1. Draw x X from Dx and ξ [ 1/4, 1/4]k from the distribution E. 2. Compute y = m(x) + ξ. 4A partition I of the space [k] is called increasing if there exists an increasing sequence i1 < i2 < ... < im of indices of [k] so that I = [1..i1][i1..i2]...[im + 1..k]. For instance, the partition [1, 2][3, 4][5] is increasing, but the partition [1, 4][2, 3][5] is not. Label Ranking through Nonparametric Regression Table 6. Evaluation in terms of Kendall s tau coefficient - Semi-Synthetic Benchmarks BENCHMARK RF RFT LWD RPC LRT LR-RF KNN KEMENY BOOSTLR AUTHORSHIP 0.86 0.04 0.87 0.04 0.91 0.01 0.91 0.88 0.92 0.94 0.02 0.92 BODYFAT 0.12 0.05 0.11 0.06 - 0.28 0.11 0.19 0.23 0.06 0.20 CALHOUSING 0.32 0.01 0.33 0.01 - 0.24 0.36 0.37 0.33 0.01 0.44 CPU-SMALL 0.29 0.01 0.3 0.02 - 0.45 0.42 0.52 0.51 0.00 0.50 ELEVATORS 0.60 0.01 0.61 0.01 - 0.75 0.76 0.76 - 0.77 FRIED 0.96 0.00 0.96 0.00 - 1.00 0.89 1.00 0.89 0.00 0.94 GLASS 0.83 0.06 0.80 0.07 0.88 0.4 0.88 0.88 0.89 0.85 0.06 0.89 HOUSING 0.44 0.07 0.37 0.07 - 0.67 0.80 0.80 - 0.83 IRIS 0.95 0.07 0.95 0.07 0.93 0.06 0.89 0.95 0.97 0.95 0.04 0.83 PENDIGITS 0.86 0.01 0.86 0.01 - 0.93 0.94 0.94 0.94 0.00 0.94 SEGMENT 0.90 0.02 0.91 0.02 0.94 0.01 0.93 0.95 0.96 0.95 0.01 0.96 STOCK 0.80 0.03 0.78 0.03 - 0.78 0.90 0.92 - 0.93 VEHICLE 0.84 0.03 0.83 0.03 0.87 0.02 0.85 0.83 0.86 0.85 0.03 0.86 VOWEL 0.67 0.04 0.68 0.04 0.67 0.02 0.65 0.79 0.97 0.85 0.03 0.84 WINE 0.90 0.09 0.91 0.08 0.91 0.06 0.92 0.88 0.95 0.94 0.06 0.95 WISCONSIN 0.14 0.04 0.14 0.05 - 0.63 0.34 0.48 0.49 0.04 0.45 Table 7. Evaluation in terms of Kendall s tau coefficient - Real Word Benchmarks BENCHMARK RF RFT RPC BOOSTLR SPO 0.05 0.02 0.01 0.01 0.14 0.02 0.14 HEAT 0.07 0.02 0.08 0.03 0.13 0.2 0.13 DTT 0.13 0.04 0.13 0.03 0.17 0.3 0.17 COLD 0.10 0.03 0.09 0.0 0.22 0.03 0.21 DIAU 0.15 0.03 0.14 0.03 0.33 0.02 0.33 3. Set eσ = argsort(y). 4. Draw a partition I = [1...i1][i1 + 1...i2]...[i L + 1...k] of size L + 1 from a distribution p(x), for some L [0..k]. 5. Set σ = Partial Rank(eσ; I) 6. Output (x, σ). We let (x, σ) Dp R. Note that when L = 0, we observe a complete ranking σ. We remark that our generative model is quite general: It allows arbitrarily complex distributions over partitions, which depend on the feature space instances. We aim to address Problem 2 when dealing with partial observations. We are going to address this question in a similar fashion as in the incomplete regime. Specifically, we assume that Condition 2 still holds, but instead of the Item 3 (Deletion tolerance), we assume that the property of partial tolerance holds. Condition 7. For any 1 i < j k, we assume that the following hold: The Stochastic Transitivity and the Tsybakov s noise condition with parameters a, B (see Condition 2) are satisfied and the following condition holds: 1. (Partial tolerance): There exists ξ (0, 1) so that pi,j(x) ξ, where pi,j(x) is the probability that the pair i < j is not in the same subset of the partial partition for x Rd. Using similar technical tools as in Theorem 2.5, we obtain the following result. Theorem C.2 (Label Ranking with Partial Permutations). Let ε, δ (0, 1) and assume that Condition 7 holds, i.e., the Stochastic Transitivity property holds and both the Tsybakov s noise condition holds with a (0, 1), B > 0 and the partial tolerance condition holds for the partition probability vector p with parameter ξ (0, 1). Set Ca,B = B1 a/((1 a)1 aaa) Label Ranking through Nonparametric Regression and consider a hypothesis class G of binary classifiers with finite VC dimension. There exists an algorithm (Algorithm 8) that computes an estimate bσ : Rd Sk so that Pr x Dx[bσ(x) = σ (x)] Ca,B inf g G Li,j(g) L i,j with probability at least 1 δ, where σ : Rd Sk is the mapping (see Equation (2)) induced by the aggregation of the k 2 Bayes classifiers g i,j with loss L i,j, using n independent samples from Dp R (see Definition C.1), with Ca,B ξ4 2a k 2 log(k/δ), VC(G) log Algorithm 8 Algorithm for Label Ranking with Partial Rankings 1: Input: Sample access to i.i.d. examples of the form (x, σ) Dp R, VC class G. 2: Model: Partial rankings are generated as in Definition C.1. 3: Output: An estimate bσ : Rd Sk of the optimal estimator σ that satisfies Pr x Dx[bσ(x) = σ (x)] Cα,B ξ2 OPT(G) + ε . 4: Label Rank Partial(ε, δ): 5: Set n = eΘ k 4(1 α) α max{log(k/δ), VC(G))}/polyα(ξ ε) {See Theorem C.2.} 6: Draw a training set D of n independent samples from Dp R 7: For any i = j, set Di,j = 8: for 1 i < j k do 9: if (x, σ) D and σ(i) = σ(j) then {i and j do not lie in the same partition.} 10: Add (x, sgn(σ(i) σ(j))) to Di,j 11: endif 12: endfor 13: Training Phase: bs Estimate Aggregate(Di,j for all i < j, G) {See Algorithm 6.} 14: Testing Phase: On input x Rd, output argsort(bs(x)) Proof. The proof is similar to the incomplete rankings case and the difference lies in the following steps 1. For any i = j, the conditioning on the event σ {i, j} should be replaced with the event that i does not lie in the same partition (σ(i) = σ(j)). This implies that any φ term (which corresponds to a lower bound on the probability Pr[σ {i, j}]) should be replaced by the term ξ (0, 1). 2. For the final sample complexity, the following holds: With high probability, the number of pairwise comparisons induced by a partial ranking is at least ξ k 2 . Hence, with high probability, a number of n PC/(ξ k 2 ) independent draws from the distribution Dp R suffices to obtain the desired bound. Label Ranking through Nonparametric Regression D. Background on Regression with Trees and Forests In this section, we provide a discussion on decision trees and forests. We refer to Shalev-Shwartz & Ben-David (2014), Syrgkanis & Zampetakis (2020) and Breiman et al. (1984) for further details. D.1. Further Previous Work From an information-theoretic viewpoint, sample complexity bounds of decision trees and other data-adaptive partitioning estimators have been established (Nobel, 1996; Lugosi & Nobel, 1996; Mansour & Mc Allester, 2000). However, from a computational aspect, the problem of choosing the optimal tree is NP-complete (e.g., Laurent & Rivest, 1976) and, hence, from a practical standpoint, trees and forests are built greedily (e.g., Level-Splits, Breiman) by identifying the most empirically informative split at each iteration (Breiman et al., 1984; Breiman, 2001). Advances have shown that such greedily constructed trees are asymptotically consistent (Biau, 2012; Denil et al., 2014; Scornet et al., 2015) in the low-dimensional regime. On the other side, the high-dimensional regime, where the number of features can grow exponentially with the number of samples, is studied by Syrgkanis & Zampetakis (2020). The literature related to the CART criterion (Breiman et al., 1984) is vast; however, there are various other strands of research dealing with the problem of sparse nonparametric regression (that we consider in the noiseless regression settings of our work). On the one side, several heuristic methods have been proposed (Friedman et al., 2001; Friedman, 1991; George & Mc Culloch, 1997; Smola & Bartlett, 2001). On the other side, various works, such as the ones of Lafferty & Wasserman (2008); Liu & Chen (2009); Comminges & Dalalyan (2012); Yang & Tokdar (2015), design and theoretically analyze greedy algorithmic approaches that exploit the sparsity of the regression function in order to get around with the curse of dimensionality of the input feature data. D.2. Preliminaries on Regression Trees Decision Trees. A decision tree is a predictor h : X Y, which, on the input feature x, predicts the label associated with the instance by following a decision path from a root node of a tree to a leaf. At each node on the root-to-leaf path, the successor child is chosen on the basis of a splitting of the input space. The splitting may be based on a specific feature of x or on a predefined set of splitting rules and a leaf always contains a specific label or value, depending on the context (classification or regression). Nonparametric Regression. In the nonparametric regression problem, we consider that we observe independent samples (x, y), generated as y = f(x) + ξ, where ξ corresponds to bounded zero mean noise. Specifically, we have the following generative process: Definition D.1 (Standard Nonparametric Regression). Consider the underlying regression function f : X [1/4, 3/4] and let Dx be a distribution over features. Each sample is generated as follows: 1. Draw x X from Dx. 2. Draw ξ [ 1/4, 1/4] from the zero mean distribution E. 3. Compute y = f(x) + ξ. We let (x, y) D. Note that the noise random variable does not depend on the feature x. In the high-dimensional regime, we assume that the target function is sparse (recall Definition A.1). Regression Tree Algorithms. Let us briefly describe how regression tree algorithms work in an abstract fashion. We will focus on binary trees. In general, the regression tree algorithms operate in two phases, which are the following: first the algorithm finds a partition P of the hypercube {0, 1}d. Afterwards, it assigns a single value to every cell of the partition P, which defines the estimation function f (n). Finally, the algorithm outputs this estimate. More concretely, we have that: Phase 1 (Partitioning the space). First, a depth 0 tree contains a single cell {{0, 1}d}. If this single node splits based on whether x1 = 0 or x1 = 1, we obtain a depth 1 tree with two cells {{0} {0, 1}d 1, {1} {0, 1}d 1}. In general, this procedure generates a partition P of the space {0, 1}d. We let P(x) denote the unique cell of P that contains x. Label Ranking through Nonparametric Regression Phase 2 (Computing the estimation). Let D denote the training set that contains examples of the form (x, y), generated as y = f(x) + ξ. For any cell c P, we create the dataset Dc of all the training examples (x, y) D that are contained in the cell c, i.e., x c. Then, we compute the value of the cell as f (n)(c; P) := 1 |Dc| (x,y) Dc y . The main question not covered in the above discussion is the following: How is the split of Phase 1 chosen? There are various splitting rules in order to partition the space in Phase 1. We discuss two such rules: the Breiman s Algorithm and the Level-Splits Algorithm. In Breiman s algorithm, every node can choose a different direction to split, by using the following greedy criterion: every node chooses the direction that minimizes its own empirical mean squared error. On the other side, in the Level-Splits algorithm, every node at the same level has to split in the same direction, by using the greedy criterion: at every level, we choose the direction that minimizes the total empirical mean squared error. In the upcoming sections, we elaborate on the algorithms based on the Level-Splits (Appendix D.3) and Breiman s (Appendix D.4) criterion. D.3. Level-Splits Algorithm We define x S as the sub-vector of x, where we observe only the coordinates with indices in S [d]. Recall that in the Level-Splits algorithm with set of splits S, any level has to split at the same direction. Hence, each level provides a single index to the set S and the size of the set S is the depth of the decision tree. Given a set of splits S, we define the expected mean squared error of S as follows: e L(S) = E x Dx " f(x) E w Dx[f(w)|w S = x S] 2# f 2(x) E z S Dx,S E w Dx[f(w)|w S = z S] 2 . This function quantifies the population version of the mean squared error between the actual value of f at x and the mean value of f constrained at the cell of P that contains x, i.e., the subspace of {0, 1}d that is equal to P(x). Observe that e L depends only on f and Dx. We set e V (S) := E z S Dx,S E w Dx[f(w)|w S = z S] 2 . (7) The function e V can be seen as a measure of heterogeneity of the within-leaf mean values of the target function f, from the leafs created by split S. The following condition is required. Condition 8 (Approximate Submodularity). Let C 1. We say that the function e V is C-approximate submodular if and only if for any T, S [d], such that S T and any i [d], it holds that e V (T {i}) e V (T) C (e V (S {i})) e V (S)). We can equivalently write this condition as (this is the formulation we used in Condition 1): e L(T) e L(T {i}) C (e L(S) e L(S {i})). The reduction of the mean squared error when the coordinate i is added to a set of splits T is upper bounded by the reduction when adding i to a subset of T, i.e., if adding i does not decrease the mean squared error significantly at some point (when having the set S), then i cannot decrease the mean squared error significantly in the future either (for any superset of S). We remark that this condition is necessary for any greedy algorithm to work (see Syrgkanis & Zampetakis, 2020). Label Ranking through Nonparametric Regression Empirical MSE. For the algorithm, we will use the empirical version of the mean square error. Provided a set of splits S, we have that y(j) f (n)(x(j); S) 2 = 1 j [n] (y(j))2 1 j [n] f (n)(x(j); S)2 (8) j [n] (y(j))2 Vn(S) . (9) Algorithm 9 Level-Splits Algorithm (see (Syrgkanis & Zampetakis, 2020)) 1: Input: honesty flag h, training dataset Dn, maximum number of splits log(t). 2: Output: Tree approximation of f. 3: Level Splits-Algo(h, Dn, log(t)): 4: V Dn,x {Keep only training features x.} 5: if h = 1 then Split randomly Dn in half; Dn/2, D n/2, n n/2, V D n,x 6: Set P0 = {{0, 1}d} {The partition that corresponds to the root.} 7: Pℓ= for any ℓ [n] 8: level 1, S 9: while level < log(t) do 10: level level + 1 11: Select the direction i [d] that maximizes Vn(S {i}) {For Vn, see Equation (9).} 12: for all C Plevel do 13: Partition the cell C into the cells Ci k = {x : x C xi = k}, k 0, 1 14: if |V Ci 0| 1 |V Ci 1| 1 then 15: Plevel+1 Plevel+1 {Ci 0, Ci 1} 16: else 17: Plevel+1 Plevel+1 {C} 18: endif 19: endfor 20: S S {i} 21: endwhile 22: Output (Pn, f (n)) = (Plevel+1, x 7 f (n)(x; S)) The following result summarizes the theoretical guarantees for algorithms with Decision Trees via the Level-Splits criterion (see Syrgkanis & Zampetakis, 2020). Theorem D.2 (Learning with Decision Trees via Level-Splits (see (Syrgkanis & Zampetakis, 2020))). Let ε, δ > 0. Let H > 0. Let Dn be i.i.d. samples from the nonparametric regression model y = f(x) + ξ, where f(x) [ 1/2, 1/2], ξ E, Eξ E[ξ] = 0 and ξ [ 1/2, 1/2]. Let also Sn be the set of splits chosen by the Level-Splits algorithm (see Algorithm 9), with input h = 0. The following statements hold. 1. Given n = e O log(d/δ) (Cr/ε)Cr+2 samples, if f is r-sparse as per Definition A.1 and under the submodularity Condition 8, and if we set the number of splits to be log(t) = Cr Cr+2(log(n) log(log(d/δ))), then it holds that h (f(x) f (n)(x; Sn))2i > ε δ . 2. If f is r-sparse as per Definition A.1 and under the submodularity Condition 8 and the independence of features condition, given n = e O log(d/δ) 2r (C/ε)2 samples and if we set the number of splits to be log(t) = r, then it holds that h (f(x) f (n)(x; Sn))2i > ε δ . Label Ranking through Nonparametric Regression Fully Grown Honest Forests with Level-Splits Algorithm. We first explain the term Fully Grown Honest Forests: The term Fully Grown or (deep) means that we split every node until every leaf has exactly 1 training sample. The term Honest (see Wager & Athey, 2018) corresponds to the following: the regression tree algorithms operate in two phases where in both stages we use the same set of training examples. In honest trees, we split randomly the training set and use half of the dataset (Dn/2) in find a partition of {0, 1}d and the other half (D n/2) to assign the values in the cells. Finally, the term Forest is used when we subsample s out of n samples and use them in order to build independent trees; we then output the average of these trees and this function is denoted by f (n,s). In the work of Syrgkanis & Zampetakis (2020), a result about Fully Grown Forests via the Level-Splits criterion is provided under Condition 5. Shortly, it holds that, using a training set of size n = e O 2r log(1/δ) ζ and if for every w {0, 1}r, it holds for the marginal probability that Prz Dx(z R = w) / (0, ζ/2r) and if s = eΘ(2r (log(d/δ)/β2 + log(1/δ)/ζ)), then it holds that Pr Dn Dn Ex Dx[(f(x) f (n,s)(x))2] ε δ. We remark that every tree f(x, Ds) is built using Algorithm 9, with inputs: log(t) large enough so that every leaf has two or three samples and h = 1. D.4. Breiman s Algorithm We now turn our attention to the Breiman s criterion. We define the total expected mean square error that is achieved by a partition P of {0, 1}d in the population model as follows: e L(P) = E x Dx " f(x) E z Dx[f(z)|z P(x)] 2# = E x Dx[f 2(x)] E x Dx E z Dx[f(z)|z P(x)] 2 . As in the Level-Splits criterion, we set e V (P) = E x Dx E z Dx[f(z)|z P(x)] 2 . In order to define the splitting criterion of the algorithm (due to the local nature of Breiman), one has to introduce the local version of the expected MSE for the cell A: e Lℓ(A, P) = E x Dx " f(x) E z Dx[f(z)|z P(x)] 2 x A = E x Dx[f 2(x)|x A] E x Dx " E z Dx[f(z)|z P(x)] 2 x A e Vℓ(A, P) = E x Dx " E z Dx[f(z)|z P(x)] 2 x A The following condition is required for decision tree-based algorithms that use the Breiman s criterion. Condition 9 (Approximate Diminishing Returns). For C 1, we say that the function e V has the C-approximate diminishing returns property if for any cells A, A , any i [d] and any T [d] such that A A, it holds that e Vℓ(A , T {i}) e Vℓ(A , T) C (e Vℓ(A, i) e Vℓ(A)) . For the algorithm, we need the empirical mean squared error, conditional on a cell A and a potential split direction i, which is defined as follows: let Nn(A) be the number of training points in the cell A. Recall that Ai z = {x A|xi = z} for Label Ranking through Nonparametric Regression z {0, 1}. Also, set f (n)(x; P) = g(n)(P(x)). Then, we have that Lℓ n(A, i) = X Nn(Ai z) Nn(A) 1 Nn(Aiz)(y(j) f (n)(x(j); P(x(j))))2 (10) j:x(j) A (y(j))2 X Nn(Ai z) Nn(A) (g(n)(Ai z))2 (11) j:x(j) A (y(j))2 V ℓ n(A, i) . (12) Algorithm 10 Breiman s Algorithm (see (Syrgkanis & Zampetakis, 2020)) 1: Input: honesty flag h, training dataset Dn, maximum number of splits t. 2: Output: Tree approximation of f. 3: Breiman-Algo(h, Dn, t): 4: V Dn,x 5: if h = 1 then Split randomly Dn in half; Dn/2, D n/2, n n/2, V D n,x 6: Set P0 = {{0, 1}d} {The partition that corresponds to the root.} 7: Pℓ= for any ℓ [n] 8: level 0, nnodes 1, queue P0 9: while nnodes < t do 10: if queue = do 11: level level + 1, queue Plevel 12: endif 13: Pick A the first element in queue 14: if |V A| 1 then 15: queue queue \ {A}, Plevel+1 Plevel+1 {A} 16: else 17: Select i [d] that maximizes V ℓ n(A, i) {See Equation (12).} 18: Cut the cell A to cells Ai k = {x|x A xi = k}, k = 0, 1 19: queue queue \ {A}, Plevel+1 Plevel+1 {Ai 0, Ai 1} 20: endif 21: endwhile 22: Plevel+1 Plevel+1 queue 23: Output (Pn, f (n)) = (Plevel+1, x 7 f (n)(x; Plevel+1)) Theorem D.3 (Learning with Decision Trees via Breiman (see (Syrgkanis & Zampetakis, 2020))). Let ε, δ > 0. Let H > 0. Let Dn be i.i.d. samples from the nonparametric regression model y = f(x) + ξ, where f(x) [ 1/2, 1/2], ξ E, Eξ E[ξ] = 0 and ξ [ 1/2, 1/2]. Let also Pn be the partition that the algorithm (see Algorithm 10) returns with input h = 0. The following statements hold. 1. If f be r-sparse as per Definition A.1 and if the approximate diminishing returns Condition 9 holds, then given n = e O log(d/δ)(C r/ε)C r+3 samples and if we set log(t) Cr Cr+3(log(n) log(log(d/δ))), then it holds that h (f(x) f (n)(x; Pn))2i > ε δ . 2. If f is r-sparse as per Definition A.1, if the approximate diminishing returns Condition 9 holds and the distribution Dx is a product distribution, given n = e O C22r log(d/δ)/ε3 samples and if we set log(t) r, then it holds that h (f(x) f (n)(x; Pn))2i > ε δ . Label Ranking through Nonparametric Regression Finally, in the work of Syrgkanis & Zampetakis (2020), a result about Fully Grown Forests via the Breiman s criterion is provided under Condition 6. Shortly, it holds that, using a training set of size n = 2r log(d/δ) εζβ2 and if s = eΘ( 2r log(d/δ) ζβ2 ), then it holds that Pr Dn Dn Ex Dx[(f(x) f (n,s)(x))2] ε δ. Note that every tree f(x, Ds) is built using the Algorithm 10, with inputs: log(t) large enough so that every leaf has two or three samples, training set Ds and h = 1. E. Background on Statistical Learning Theory For a detailed exposition of a statistical learning theory perspective to binary classification, we refer to Bousquet et al. (2003); Boucheron et al. (2005). Talagrand s Inequality. Let Pf = E f and Pnf be the corresponding empirical functional. Talagrand s inequality provides a concentration inequality for the random variable supf F(Pf Pnf), which depends on the maximum variance attained by any function over the class F. Fact 1 (Theorem 5.4 in Boucheron et al. (2005)). Let b > 0 and F be a set of functions from X to R. Assume that all functions in F satisfy Pf f b. Then, with probability at least 1 δ, it holds that sup f F (Pf Pnf) 2 E[sup f F (Pf Pnf)] + 2 supf F Var(f) log(1/δ) n + 4b log(1/δ) On Tsybakov s Condition. The following property holds for the Tsybakov s noise condition. Fact 2 (Tsybakov s Condition). Let G be a class of binary classifiers. Under the Tsybakov s noise condition (see Condition 2.(ii)) with a, B > 0 and for i = j, it holds that Pr (x,σ) Dq R [g(x) = g i,j(x)|σ {i, j}] Ca,B(Li,j(g) Li,j(g ))a , where Ca,B = B1 a (1 a)1 aaa , g is the Bayes classifier and the loss function is defined as Li,j(g) := E (x,σ) Dq R 1{g(x) = sgn(σ(i) σ(j)) σ {i, j}} . Proof. Let us set L i,j = Li,j(g ). Define the quantity η(x) = E (X,σ)[sgn(σ(i) σ(j)) = +1|X = x] . The loss of the classifier is equal to Li,j(g) L i,j = E (x,σ) Dq R [|2η(x) 1| 1{g(x) = g i,j(x)}|σ {i, j}] , and so Li,j(g) L i,j t E[1{g(x) = g i,j(x)} 1{|2η(x) 1| t}|σ {i, j}] . Using Markov s inequality, we have that for all t 0: Li,j(g) L i,j t Pr[|2η(x) 1| t|σ {i, j}] t E[1{g(x) = g i,j(x)}1{|2η(x) 1| t}|σ {i, j}] . The Tsybakov s condition implies that Li,j(g) L i,j t(1 Bt a 1 a ) t Pr[g(x) = g i,j(x)|σ {i, j}] . Hence, Li,j(g) L i,j t(Pr[g(x) = g i,j(x)|σ {i, j}] Bt a 1 a ) . Choosing t appropriately, one gets that Pr[g(x) = g i,j(x)|σ {i, j}] B1 a (1 a)1 aaa (Li,j(g) L i,j)a . The proof is concluded by setting Ca,B = B1 a (1 a)1 aaa .