# label_ranking_through_nonparametric_regression__8a8967bb.pdf

Label Ranking through Nonparametric Regression

Dimitris Fotakis 1 Alkis Kalavasis 1 Eleni Psaroudaki 1

Label Ranking (LR) corresponds to the problem of learning a hypothesis that maps features to rankings over a ﬁnite set of labels. We adopt a nonparametric regression approach to LR and obtain theoretical performance guarantees for this fundamental practical problem. We introduce a generative model for Label Ranking, in noiseless and noisy nonparametric regression settings, and provide sample complexity bounds for learning algorithms in both cases. In the noiseless setting, we study the LR problem with full rankings and provide computationally efﬁcient algorithms using decision trees and random forests in the high-dimensional regime. In the noisy setting, we consider the more general cases of LR with incomplete and partial rankings from a statistical viewpoint and obtain sample complexity bounds using the One-Versus-One approach of multiclass classiﬁcation. Finally, we complement our theoretical contributions with experiments, aiming to understand how the input regression noise affects the observed output.

1. Introduction

Label Ranking (LR) studies the problem of learning a mapping from features to rankings over a ﬁnite set of labels. This task emerges in many domains. Common practical illustrations include pattern recognition (Geng & Luo, 2014), web advertisement (Djuric et al., 2014), sentiment analysis (Wang et al., 2011), document categorization (Jindal & Taneja, 2015) and bio-informatics (Balasubramaniyan et al., 2005). The importance of LR has spurred the development of several approaches for tackling this task from the perspective of the applied CS community (Vembu & G artner, 2010; Zhou et al., 2014b).

1School of Electrical and Computer Engineering, National Technical University of Athens, Athens, Greece. Correspondence to: Eleni Psaroudaki <epsaroudaki@mail.ntua.gr>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

The overwhelming majority of these solutions comes with experimental evaluation and no theoretical guarantees; e.g., algorithms based on decision trees are a workhorse for practical LR and lack theoretical guarantees. Given state-ofthe-art experimental results, based on Random Forests (see Zhou & Qiu, 2018), we are highly motivated not only to work towards a theoretical understanding of this central learning problem but also to theoretically analyze how efﬁcient tree-based methods can be under speciﬁc assumptions.

LR comprises a supervised learning problem that extends multiclass classiﬁcation (Dekel et al., 2003). In the latter, with instance domain X Rd and set of labels [k] := {1, . . . , k}, the learner draws i.i.d. labeled examples (x, y) X [k] and aims to learn a hypothesis from instances to labels, following the standard PAC model. In LR, the learner observes labeled examples (x, σ) X Sk and the goal is to learn a hypothesis h : X Sk from instances to rankings of labels, where Sk is the symmetric group of k elements. The ranking h(x) corresponds to the preference list of the feature x and, as mentioned in previous works (H ullermeier et al., 2008), a natural way to represent preferences is to evaluate individual alternatives through a real-valued utility (or score) function. Note that if the training data offer the utility scores directly, the problem is reduced to a standard regression problem.

In our work, we assume that there exists such an underlying nonparametric score function m : X [0, 1]k, mapping features to score values. The value mi(x) corresponds to the score assigned to the label i [k] for input x and can be considered proportional to the posterior probability Pr(x,y)[y = i|x]. For each LR example (x, σ), the label σ is generated by sorting the underlying regression-score vector m(x), i.e., σ = argsort(m(x) + ξ) (with some regression noise ξ). We are also interested in cases where some of the alternatives of σ are missing, i.e., we observe incomplete rankings σ S k; the way that such rankings occur will be clariﬁed later. Formally, we have:

Deﬁnition 1.1 (Distribution-free Nonparametric LR). Let X Rd, [k] be a set of labels, C be a class of functions from X to [0, 1]k and Dx be an arbitrary distribution over X. Consider a noise distribution E over Rk. Let m be an unknown target function in C.

An example oracle Ex(m, E) with complete rankings,

Label Ranking through Nonparametric Regression

works as follows: Each time Ex(m, E) is invoked, it returns a labeled example (x, σ) X Sk, where (i) x Dx and ξ E independently and (ii) σ = argsort(m(x) + ξ). Let DR be the joint distribution over (x, σ) generated by the oracle. In the noiseless case (ξ = 0 almost surely), we simply write Ex(m).

Let M be a randomized mechanism that given a tuple (x, y) X Rk generates an incomplete ranking M(x, y) S k. An example oracle Ex(m, E, M) with incomplete rankings, works as follows: Each time Ex(m, E, M) is invoked, it returns a labeled example (x, σ) X S k, where (i) x Dx, ξ E , (ii) y = m(x) + ξ and (iii) σ = M(x, y). Let (x, σ) DM R .

We denote h : X Sk the composition h = argsort m. Note that the oracle Ex(m, E, M) generalizes Ex(m, E) (which generalizes Ex(m) accordingly) since we can set M to be M(x, y) = argsort(y).

1.1. Problem Formulation and Contribution

Most of our attention focuses on the two upcoming learning goals, which are stated for the abstract Label Ranking example oracle O {Ex(m), Ex(m, E), Ex(m, E, M)}. Let d be an appropriate ranking distance metric. Problem 1 (Computational). The learner is given i.i.d. samples from the oracle O and its goal is to efﬁciently output a hypothesis bh : X Sk such that with high probability the error Ex Dx[d(bh(x), h(x))] is small. Problem 2 (Statistical). Consider the median problem h = argminh E(x,σ)[d(h(x), σ)] where (x, σ) DR. The learner is given i.i.d. samples from the oracle O and its goal is to output a hypothesis bh : X Sk from some hypothesis class H such that with high probability the error Prx Dx[bh(x) = h (x)] against the median h is small.

The main gap in the theoretical literature of LR was the lack of computational guarantees. Problem 1 identiﬁes this gap and offers, in combination with the generative models of the previous section, a natural and formal way to study the theoretical performance of practical methods for LR such as decision trees and random forests. We believe that this is the main conceptual contribution of our work. In Problem 1, the runtime should be polynomial in d, k, 1/ε.

While Problem 1 deals with computational aspects of LR, Problem 2 focuses on the statistical aspects, i.e., the learner may be computationally inefﬁcient. This problem is extensively studied as Ranking Median Regression (Cl emenc on et al., 2018; Cl emenc on & Vogel, 2020) and is closely related to Empirical Risk Minimization (and this is why it is statistical , since NP-hardness barriers may arise). We note that the median problem is deﬁned w.r.t. DR (over complete rankings) but the learner receives examples from O (which

may correspond to incomplete rankings).

We study the distribution-free nonparametric LR task from either theoretical or experimental viewpoints in three cases:

Noiseless Oracle with Complete Rankings. In this setting, we draw samples from Ex(m) (i.e., Ex(m, E) with ξ = 0). For this case, we resolve Problem 1 (under mild assumptions) and provide theoretical guarantees for efﬁcient algorithms that use decision trees and random forests, built greedily based on the CART empirical MSE criterion, to interpolate the correct ranking hypothesis. This class of algorithms is widely used in applied LR but theoretical guarantees were missing. For the analysis, we adopt the labelwise decomposition technique (Cheng et al., 2013), where we generate one decision tree (or random forest) for each position of the ranking. We underline that decision trees and random forests are the state-of-the-art techniques for LR.

Contribution 1: We provide the ﬁrst theoretical performance guarantees for these algorithms for Label Ranking under mild conditions. We believe that our analysis and the identiﬁcation of these conditions contributes towards a better understanding of the practical success of these algorithms.

Noisy Oracle with Complete Rankings. We next replace the noiseless oracle of Problem 1 with Ex(m, E). In this noisy setting, the problem becomes challenging for theoretical analysis; we provide experimental evaluation aiming to quantify how noise affects the capability of decision trees and random forests to interpolate the true hypothesis.

Contribution 2: Our experimental evaluation demonstrates that random forests and shallow decisions trees are robust to noise, not only in our noisy setting, but also in standard LR benchmarks.

Noisy Oracle with Incomplete Rankings. We consider the oracle Ex(m, E, M) with incomplete rankings. We resolve Problem 2 for the Kendall tau distance, as in previous works (so, we resolve it for the weaker oracles too). Now, the learner is agnostic to the positions of the elements in the incomplete ranking and so labelwise decomposition cannot be applied. Using pairwise decomposition, we compute a ranking predictor that achieves low misclassiﬁcation error compared to the optimal classiﬁer h and obtain sample complexity bounds for this task.

Contribution 3: Building on the seminal results of (Korba et al., 2017; Cl emenc on et al., 2018; Cl emenc on & Korba, 2018; Cl emenc on & Vogel, 2020), we give results for Problem 2 for incomplete rankings under appropriate conditions.

1.2. Related Work

LR has received signiﬁcant attention over the years (Shalev-Shwartz, 2007; H ullermeier et al., 2008; Cheng

Label Ranking through Nonparametric Regression

& H ullermeier, 2008; Har-Peled et al., 2003), due to the large number of practical applications. There are multiple approaches for tackling this problem (see Vembu & G artner, 2010; Zhou et al., 2014b, and the references therein). Some of them are based on probabilistic models (Cheng & H ullermeier, 2008; Cheng et al., 2010; Grbovic et al., 2012; Zhou et al., 2014a). Others are tree and ensemble based, such as adaption of decision trees (Cheng et al., 2009), entropy based ranking trees and forests (de S a et al., 2017), bagging techniques (Aledo et al., 2017), random forests (Zhou & Qiu, 2018), boosting (Dery & Shmueli, 2020), achieving highly competitive results. There are also works focusing on supervised clustering (Grbovic et al., 2013). Decomposition techniques are closely related to our work; they mainly transform the LR problem into simpler problems, e.g., binary or multiclass (H ullermeier et al., 2008; Cheng & H ullermeier, 2012; Cheng et al., 2013; Cheng & H ullermeier, 2013; Gurrieri et al., 2014).

Comparison to Previous Work. To the best of our knowledge, there is no previous theoretical work focusing on the computational complexity of LR (a.k.a. Problem 1). However, there are many important works that adopt a statistical viewpoint. Closer to ours are the following seminal works on the statistical analysis of LR: Korba et al. (2017); Cl emenc on et al. (2018); Cl emenc on & Vogel (2020). Korba et al. (2017) introduced the statistical framework of consensus ranking (which is the unsupervised analogue of Problem 2) and identiﬁed crucial properties for the underlying distribution in order to get fast learning rate bounds for empirical estimators. A crucial contribution of this work (that we also make use of) is to prove that when Strict Stochastic Transitivity holds, the set of Kemeny medians (solutions of Problem 2 under the KT distance) is unique and has a closed form. Problem 2 was introduced in Cl emenc on et al. (2018), where the authors provide fast rates (under standard conditions) when the learner observes complete rankings, which reveal the relative order, but not the positions of the labels in the correct ranking. The work of Cl emenc on & Vogel (2020) provides a novel multiclass classiﬁcation approach to Label Ranking, where the learner observes the top-label with some noise, i.e, observes only the partial information σ 1 x (1) in presence of noise, under the form of the random label y assigned to x. Our contribution concerning Problem 2 is a natural follow-up of these works where the learner observes noisy incomplete rankings (and so has only information about the relative order of the alternatives). Our solution for Problem 2 crucially relies on the conditions and the techniques developed in (Korba et al., 2017; Cl emenc on et al., 2018; Cl emenc on & Vogel, 2020). In our setting we have to modify the key conditions in order to handle incomplete rankings. Finally, our labelwise decomposition approach to Problem 1 is closely related to Korba et al. (2018), where many embeddings for ranking data are discussed.

Nonparametric Regression and CART. Regression trees constitute a fundamental approach in order to deal with nonparametric regression. Our work is closely related to the one of Syrgkanis & Zampetakis (2020), which shows that trees and forests, built greedily based on the CART empirical MSE criterion, provably adapt to sparsity in the high-dimensional regime. Speciﬁcally, Syrgkanis & Zampetakis (2020) analyze two greedy tree algorithms (they can be found at the Appendix D): (a) in the Level Splits variant, in each level of the tree, the same variable is greedily chosen at all the nodes in order to maximize the overall variance reduction; (b) in the Breiman s variant, which is the most popular in practice, the choice of the next variable to split on is locally decided at each node of the tree. In general, regression trees (Breiman et al., 1984) and random forests (Breiman, 2001) are one of the most widely used estimation methods by ML practitioners (Loh, 2011; Louppe, 2014). For further literature review and preliminaries on decision trees and random forests, we refer to the Appendix D.

Multiclass Prediction. In multiclass prediction with k labels, there are various techniques such as One-versus-All and One-versus-One (see Shalev-Shwartz & Ben-David, 2014). We adopt the OVO approach for Problem 2, where we consider k 2 binary sub-problems (Hastie & Tibshirani, 1998; Moreira & Mayoraz, 1998; Allwein et al., 2000; F urnkranz, 2002; Wu et al., 2004) and we combine the binary predictions. A similar approach was employed for a variant of Problem 2 by Cl emenc on & Vogel (2020).

1.3. Notation

For vectors, we use lowercase bold letters x; let xi be the ith coordinate of x. We write poly to denote that the degree of the polynomial depends on the subscripted parameters. Also, e O( ) is used to hide logarithmic factors. We denote the symmetric group over k elements with Sk and S k for incomplete rankings. For i [k], we let σ(i) denote the position of the i-th alternative.

As distance metrics we use the Kendall Tau (KT) distance

d KT (π, σ) = X

i<j 1{(π(i) π(j))(σ(i) σ(j)) < 0},

the Spearman distance

d2(π, σ) = X

i [k] (π(i) σ(i))2,

and the KT coefﬁcient

kτ = 1 4d KT (π, σ)

i.e. the normalization of d KT to the interval [ 1, 1].

The Mean Squared Error (MSE) of a function

f : {0, 1}d [0, 1]

Label Ranking through Nonparametric Regression

is equal to

e L(f, S) = E x Dx

h f(x) E w Dx[f(w)|w S = x S] 2i , (1)

where x S is the sub-vector of x, where we observe only the coordinates with indices in S [d] and x S {0, 1}|S|. The VC dimension VC(G) of a class G { 1, +1}X is the largest n such that there exists a set T X, |T| = n and G shatters T. When VC(G) < , G is said to be a VC class.

2. Our Results

We provide an overview of our contributions on distributionfree Label Ranking settings, as introduced in Deﬁnition 1.1.

2.1. Noiseless Oracle with Complete Rankings

We begin with Label Ranking as noiseless nonparametric regression (Tsybakov, 2008). This corresponds to the example oracle Ex(m) of Deﬁnition 1.1 which we recall now: For an underlying score hypothesis m : X [0, 1]k, where k is the number of labels and mi(x) is the score of the alternative i [k] with respect to x. The learner observes a labeled example (x, σ) DR. It holds that σ = h(x) = argsort(m(x)).

We resolve Problem 1 for the Ex(m) oracle: We provide the ﬁrst theoretical guarantees in the LR setting for the performance of algorithms based on decision trees and random forests, when the feature space is the Boolean hypercube X = {0, 1}d under mild assumptions, using the labelwise decomposition technique (Cheng et al., 2013). We underline once again that this class of algorithms constitutes a fundamental tool for practical works to solve LR; this heavily motivates the design of our theory. We focus on the performance of regression trees and forests in high dimensions. Crucially, Deﬁnition 1.1 makes no assumptions on the structure of the underlying score hypothesis m. In order to establish our theoretical guarantees, we are going to provide a pair of structural conditions for the score hypothesis m and the features distribution Dx. We will now state these conditions; for this we will need the deﬁnition of the mean squared error that can be found at (1). Condition 1. Consider the feature space X = {0, 1}d and the regression vector-valued function m : {0, 1}d [0, 1]k

with m = (m1, . . . , mk). Let Dx be the distribution over features. We assume that the following hold for any j [k].

1. (Sparsity) The function mj : {0, 1}d [0, 1] is rsparse, i.e., it depends on r out of d coordinates.

2. (Approximate Submodularity) The mean squared error e Lj of mj is C-approximate-submodular, i.e., for any S T [d], i [d], it holds that

e Lj(T) e Lj(T {i}) C e Lj(S) e Lj(S {i}) .

Some comments are in order: (1) The approximate submodularity condition for the mean squared error is the more technical condition, which however is provably necessary (Syrgkanis & Zampetakis, 2020) to obtain meaningful results about the consistency of greedily grown trees in high dimensions. (2) For the theoretical analysis, we constrain ourselves to the case where all features are binary. However, in the experimental part, we test the performance of the method with non-binary features too. (3) Sparsity should be regarded as a way to parameterize the class of functions m, rather than a restriction. Any function is r-sparse, for some value of r. However, our results are interesting when r d and establish that decision trees and random forests provably behave well under sparsity. As we will see in Theorem 2.1, the sample complexity has an rr dependence, which cannot be avoided, since the class of functions is nonparametric. Observe that both mi and mj are r-sparse but they are not constrained to depend on the same set of coordinates; the function m is at most (k r)-sparse, where k r d, and we say that m is r-sparse. These are the state-of-the-art conditions for the high-dimensional regime (Syrgkanis & Zampetakis, 2020).

Our algorithm for Problem 1 uses decision trees via the Level Splits criterion. In this criterion, a set of splits S [d] is collected greedily and any tree level has to split at the same direction i [d]. Intuitively, the approximate submodularity condition captures the following phenomenon: If adding i does not decrease the mean squared error signiﬁcantly at some point (when having the set S), then i cannot decrease the mean squared error signiﬁcantly in the future either (for any superset of S) .

Our main result for Problem 1 using decision trees with Level Splits follows. Recall that h(x) = argsort(m(x)) and d2 is the Spearman distance (i.e., L2 squared over the rankings positions).

Theorem 2.1 (Noiseless LR (Informal)). Under Condition 1 with parameters r, C, there exists an algorithm (Decision Trees via Level-Splits - Algorithm 1) that draws n = e O log(d) poly C,r(k) (Cr/ϵ)Cr+2 independent samples from Ex(m) and, in poly C,r(d, k, 1/ϵ) time, computes a set of splits Sn and an estimate h(n)( ; Sn) : {0, 1}d Sk which, with probability 99%, satisﬁes Ex Dx d2(h(x), h(n)(x; Sn)) ϵ .

See also Theorem A.4. This is the ﬁrst sample complexity guarantee in LR for decision tree-based algorithms. We also provide results and algorithms for the Breiman s criterion (Appendix A.1) and for Random Forests (Appendix A.2). Our result can be read as: In practice, sparsity of the instance s score function is one of the reasons why such algorithms work well and efﬁciently in real world.

The description of Algorithm 1 follows. Given (x, σ)

Label Ranking through Nonparametric Regression

DR, we transform the ranking-label σ to a vector y = m C(σ) [0, 1]k, where m C(σ) is the canonical representation of the ranking σ. Speciﬁcally, one can obtain the score vector y by setting yi = m C,i(σ) equal to σ(i)/k, i.e., the position of the i-th alternative in the permutation σ, normalized by k, where k is the length of the permutation (see Line 3 of Algorithm 1). Hence, we obtain a training set of the form T = (x(i), y(i))i [N]. Our goal is to ﬁt the score vectors y(i) using decision trees (or random forests depending on the black-box algorithm that we will choose to apply). During this step, we have to feed our training set T into a learning algorithm that ﬁts the function m C h : X [0, 1]k, where denotes composition. We remark that since the regression function m is sparse, this vector-valued function is sparse too. We do this as follows.

Algorithm 1 Algorithm of Theorem 2.1

1: Set n eΘ(log(d/δ) poly C,r(k) (Cr/ε)Cr+2)

2: Draw n samples (x(j), σ(j)) DR, j [n] 3: For any j [n], set y(j) (σ(j)(i)/k)i [k] 4: Create k datasets Ti = {(x(j), y(j) i )}j [n] 5: for i [k] do

6: m(n) i , S(i) n = Level Splits(Ti, eΘ(Cr log(k))) 7: endfor 8: Output argsort (m(n) 1 ( ; S(1) n ), ..., m(n) k ( ; S(k) n ))

We decompose the training set T into k data sets Ti, where the labels are no more vectors but real values (labelwise decomposition, see Line 4 of Algorithm 1). For each Ti, we apply the Level Splits method and ﬁnally we combine our estimates, where we have to aggregate our estimates into a ranking (see Line 8 of Algorithm 1). Algorithm 1 uses the routine Level Splits, which computes a decision tree estimate based on the input training set (see Algorithm 9 (with h = 0) in Appendix D). The second argument of the routine is the maximum number of splits H (height of the tree) and in the above result H = eΘ(Cr log(k)). The routine iterates H times, one for each level of the tree: at every level, we choose the direction i [d] that minimizes the total empirical mean squared error (greedy choice) and the space is partitioned recursively based on whether xi = 0 or 1. The routine outputs the estimated function and the set of splits S [d]. For a proof sketch, see Section 3.

2.2. Noisy Oracle with Complete Rankings

We remark that the term noiseless in the above regression problem is connected with the output (the ranking), i.e., in the generative process given x X, we will constantly observe the same output ranking. Let us consider the oracle Ex(m, E), that corresponds to noisy nonparametric regression: Draw x Dx and independently draw ξ [ 1/4, 1/4]k from a zero mean noise distribu-

tion E. Compute y = m(x) + ξ, rank the alternatives σ = argsort(y) and output (x, σ). Due to the noise vector ξ, we may observe e.g., different rankings for the same x feature. We provide the following notions of inconsistency. Deﬁnition 2.2 (Output Inconsistency). Let σ = argsort(m(x) + ξ) denote the observed ranking of Ex(m, E). The noise distribution E satisﬁes

(i) the α-inconsistency property if there exists α [0, 1] so that Ex Dx [Prξ E[h(x) = σ]] α, and

(ii) the β-kτ gap property if there exists β [ 1, 1] so that Ex Dx Eξ E[kτ(h(x), σ))] = β.

Property (i) captures the phenomenon that the probability that the observed ranking σ differs from the correct one h(x) is roughly α over the feature space. However, it does not capture the distance between these two rankings; this is why we need property (ii) which captures the expected similarity (recall the deﬁnition of the KT coefﬁcient kτ in Section 1.3). The case α = 0 (resp. β = 1) gives our noiseless setting. When α > 0 (resp. β < 1), the structure of the problem changes and our theoretical guarantees fail. Interestingly, this is due to the fact that the geometry of the input noise is structurally different from the observed output. The input noise acts additively to the vector m(x), while the output is computed by permuting the elements. Hence, the relation between the observed ranking and the expected one is no more linear and hence one needs to extend the standard additive nonparametric regression setting y = f(x) + ξ to another geometry dealing with rankings σ = Eθ f(x), where f : X Sk is the regression function and Eθ : Sk Sk is a parameterized noise operator (e.g., a Mallows model). This change of geometry is interesting and to the best of our knowledge cannot be captured by existing theoretical results (see Appendix B for the experimental results of different additive nonparametric noise settings).

Our experimental results aim to complement and go beyond our theoretical understanding of the capability of the decision trees and random forests to interpolate the correct underlying regression function with the presence of regression noise. To this end, we consider the oracle Ex(m, E) where the noise distribution E satisﬁes either property (i) or (ii) of Deﬁnition 2.2.

For the experimental evaluation, two synthetic data set families were used, namely LFN (Large Features Number) and SFN (Small Features Number), consisting of a single noiseless and 50 noisy data sets, respectively. For either data set family, a common m : {0, 1}d [0, 1]k was employed, accordingly. Each noiseless data set was created according to the oracle Ex(m). It consists of 10000 samples (x, σ), where x {0, 1}d (d = 1000 for LFN and d = 100 for SFN) and σ S5 (k = 5), with r = 10 informative binary features per label (sparsity). The noisy data sets were

Label Ranking through Nonparametric Regression

0.0 0.1 0.2 0.3 0.4 α [0.0, 1.0]

mean kτ [ 1.0, 1.0]

random forests (LFN)

random forests (SFN)

decision trees (LFN)

decision trees (SFN)

shallow trees (LFN)

shallow trees (SFN)

0.75 0.80 0.85 0.90 0.95 1.00 β [ 1.0, 1.0]

mean kτ [ 1.0, 1.0]

random forests (LFN)

random forests (SFN)

decision trees (LFN)

decision trees (SFN)

shallow trees (LFN)

shallow trees (SFN)

Figure 1. Illustration of the experimental results in terms of mean kτ for different noise distributions E with respect to (a) α-inconsistency; (b) β-kτ gap.

produced according to the generative process Ex(m, E), each using a different zero-mean noise distribution E. We implemented modiﬁed versions of Algorithm 1. The shallow trees ( , ), fully grown decision trees ( , ) and random forests ( , ) were built greedily based on the CART empirical MSE criterion, using the Breiman s method instead of Level Splits.

Figure 1 summarizes the experimental results for different values of α [0, 1] and β [ 1, 1]. Results are obtained in terms of mean kτ, using the noisy data as training set and noiseless data as validation set. The y axis of Figure 1 depicts the mean kτ of all the corresponding pairs of the output of our model and the given noiseless ranking in the test sets. At this point we remind the reader that kτ is a normalization of d KT in [ 1, 1] and measures the proportion of the concordant pairs in two rankings, with kτ = 1 meaning that the two ranking have perfect agreement (i.e., the two rankings are exactly the same), while kτ = 1 that the two rankings have perfect disagreement.

As expected, decision trees as well as random forests interpolate the m function successfully in the noiseless setting, since it is r-sparse. Moreover, the increase of noise level leads to the decay of the decision trees performance, indicated by the α-inconsistency. However, the β-kτ gap is a more appropriate noise level measure, because it quantiﬁes the degree of deviation rather than the existence of it. The ratio of the performance in terms of mean kτ over β-kτ gap is approximately equal to one, revealing that decision trees ﬁt the noise. On the contrary, shallow trees have better ability to generalize and avoid overﬁtting. Fully grown honest random forests are also resistant to overﬁtting due to bagging, and therefore are noise tolerant. This is visible in Figure 1 where the graphs of the random forests and of the shallow trees attain a mean KT coefﬁcient close to 1 for

most values of α-inconsistency and β-kτ gap.

Appendix B presents the experimental setting in more detail and additional experimental results of LR standard benchmarks and different additive nonparametric noise settings. The code and data sets to reproduce our results are available: https://github.com/ pseleni/LR-nonparametric-regression. Remark 2.3. We have encountered Problem 1 with the oracles Ex(m) and Ex(m, E). A natural question is what happens if we use the incomplete oracle Ex(m, E, M). In this setting, the position of each alternative is not correctly observed (we observe a ranking with size ℓ k, see Section 2.3). Hence, we cannot apply our labelwise method for this oracle. One could obtain similar results for Problem 1 for incomplete rankings using pairwise decomposition, but we leave it for future work. Next we focus on Problem 2.

2.3. Noisy Oracle with Incomplete Rankings

We now study Problem 2, where we consider a metric d in Sk, a distribution DR over X Sk that corresponds to the example oracle Ex(m, E) and set up the task of ﬁnding a measurable mapping h : X Sk that minimizes the objective R(h) = E(x,σ) DR[d(h(x), σ)]. In this work, we focus on the Kendall tau distance (d = d KT ) and ask how well we can estimate the minimizer of the above population objective if we observe i.i.d. samples from the incomplete rankings oracle Ex(m, E, M). We underline that in what follows whenever we refer to Problem 2, we have d = d KT in mind. A natural question is: What is the optimal solution? In binary classiﬁcation, the learner aims to estimate the Bayes classiﬁer, since it is known to minimize the misclassiﬁcation error among all classiﬁers (Massart & N ed elec, 2006). Problem 2 deals with rankings and is well-studied in previous works when: d is the Kendall tau distance and

Label Ranking through Nonparametric Regression

the learner either observes complete rankings (Cl emenc on et al., 2018) or observes only the top element (under some BTL noise) (Cl emenc on & Vogel, 2020). As we will see later, the optimal solution h of Problem 2 is unique under mild conditions on DR, due to Korba et al. (2017). Our goal will be to estimate h from labeled examples generated by Ex(m, E, M).

We are going to introduce the example oracle Ex(m, E, M): We are interested in the case where the mechanism M generates incomplete rankings and captures a general spectrum of ways to generate such rankings (e.g., H ullermeier et al., 2008). We begin with the incomplete rankings mechanism M. We assume that there exists a survival probabilities vector q : X [0, 1]k, which is feature-dependent, i.e., the vector q depends on the input x X. Hence, for the example x and alternative i [k], with probability qi(x), we set the score yi equal to a noisy value of the score mi(x) (the alternative i survives) and, otherwise, we set yi = . We mention that the events of observing i and j are not necessarily independent and so do not necessarily occur with probability qi(x)qj(x). We denote the probability of the event Observe both i and j in x by qi,j(x). We modify the argsort : [0, 1]k Sk routine so that it will ignore the symbol in the ranking, e.g., argsort(0.4, , 0.7, , 0.1) = (c a e). Crucially, we remark that another variant would preserve the symbols: this problem is easier since it reveals the correct position of the non-erased alternatives. In our model, the information about the location of each alternative is not preserved. In order to model regression noise, we consider a noise distribution E over the bounded cube [ 1/4, 1/4]k. Hence, we model Ex(m, E, M) as follows: Deﬁnition 2.4 (Generative Process for Incomplete Data). Consider an underlying score hypothesis m : X [0, 1]k

and let Dx be a distribution over features. Let y : [k] [0, 1] { } and consider the survival probabilities vector q : X [0, 1]k. Each sample (x, σ) Dq R is generated as follows: (i) Draw x X from Dx and ξ [ 1

from E; (ii) draw q(x)-biased coins c { 1, +1}k; (iii) if ci > 0, set yi = mi(x) + ξi, else yi = ; (iv) compute σ = argsort(y), ignoring the symbol.

In what follows, we resolve Problem 2 for the example oracle of Deﬁnition 2.4 and d = d KT . As in the complete ranking case, Deﬁnition 2.4 imposes restrictions neither on the structure of the true score hypothesis m nor on the noise distribution E. In order to resolve Problem 2, we assume the following. Recall that qi,j(x) is the probability of the event Observe both i and j in x . Condition 2. Let pij(x) = Prξ E[mi(x) + ξi > mj(x) + ξj|x] for x Rd. For any 1 i < j k, we assume that the following hold.

1. (Strict Stochastic Transitivity) For any x Rd and any

u [k], we have that pij(x) = 1/2 and (piu(x) > 1/2 puj(x) > 1/2) pij(x) > 1/2.

2. (Tsybakov s Noise Condition) There exists a [0, 1] and B > 0 so that the probability that a random feature x Dx satisﬁes |pij(x) 1/2| < 2t , is at most B ta/(1 a) for all t 0.

3. (Deletion Tolerance) There exists φ (0, 1] so that qi,j(x) φ for any x Rd, where qi,j(x) is the survival probability of the pair i < j in x.

Importance of Item 1: Our goal is to ﬁnd a good estimate for the minimizer h of R(h) = E(x,σ) DR[d KT (h(x), σ)]. As observed in previous works (Korba et al., 2017; Cl emenc on et al., 2018), this problem admits a unique solution (with closed form) under mild assumptions on DR and speciﬁcally this is assured by Strict Stochastic Transitivity (SST) of the pairwise probabilities pij(x). The ﬁrst condition guarantees (Korba et al., 2017) that the minimizer of R(h) is almost surely unique and is given, with probability one and for any i [k], by

h (x; i) = 1 + X

j =i 1{pij(x) < 1/2} . (2)

This is the well-known Copeland rule and we note that SST is satisﬁed by most natural probabilistic ranking models.

Importance of Item 2: The Tsybakov s noise condition is standard in binary classiﬁcation and corresponds to a realistic noise model (Boucheron et al., 2005). In its standard form, this noise condition naturally requires that the regression function of a binary problem η(x) = E[Y |X = x] = 2 Pr[Y = +1|X = x] 1 is close to the the critical value 0 with low probability over the features, i.e., the labels are not completely random for a sufﬁciently large portion of the feature space. We consider that, for any two alternatives i = j, this condition must be satisﬁed by the true score function m and the noise distribution E (speciﬁcally the functions mi and mj and the random variables ξi, ξj). This condition is very common in binary classiﬁcation since it guarantees fast rates and has been previously applied to LR (Cl emenc on & Vogel, 2020).

Importance of Item 3: The last condition is natural in the sense that we need to observe the pair (i, j) at some frequency in order to achieve some meaningful results. Variants of this condition have already appeared in the incomplete rankings literature (see e.g., Fotakis et al., 2021) and in previous works in Label Ranking (see Cl emenc on & Vogel, 2020). We note that if we relax this condition to state that we only observe the pair i, j only in a portion of Rd (e.g., qij(x) = 0 for 40% of x s), then we will probably miss a crucial part of the structure of the underlying mapping. This intuitively justiﬁes the reason that we need deletion tolerance to hold for any x Rd.

Label Ranking through Nonparametric Regression

Having described our conditions, we continue with our approach. We ﬁrst remind the reader that the labelwise perspective we adopted in the complete case now fails. Second, since our data are incomplete, the learner cannot recover the optimal ranking rule h = argminh E[d KT (h(x), σ)] by simply minimizing an empirical version of this objective. To tackle this problem, we adopt a pairwise comparisons approach. In fact, the key idea is the closed form of the optimal ranking rule: From (2), we can write h (x; i) = 1 + P

j =i 1{h ij(x) = 1} where h ij is the Bayes optimal classiﬁer of the binary sub-problem of the pair i = j. We provide our result based on the standard One-Versus-One (OVO) approach, reducing this complex problem into multiple binary ones: We reduce the ranking problem into O(k2) binary sub-problems and each subproblem corresponds to a pairwise comparison between the alternatives i and j for any 1 i < j k. We solve each sub-problem separately by obtaining the Empirical Risk Minimizer bhij (for a ﬁxed VC class G, as in the previous works) whose risk is compared to the optimal h ij and then we aggregate the k 2 binary classiﬁers into a single output hypothesis bh : Rd Sk. We compare the generalization of this empirical estimate bh with the optimal predictor h of (2). We set Lij(g) = E[g(x) = sgn(σ(i) σ(j))|σ {i, j}] where (x, σ) Dq R. Our main result in this setting follows. Theorem 2.5 (Noisy and Incomplete LR). Let ϵ, δ (0, 1). Consider a hypothesis class G of binary classiﬁers with ﬁnite VC dimension. Under Condition 2 with parameters a, B, φ, there exists an algorithm (Algorithm 2) that draws

a polya(φ ε) max log k

samples from Dq R, as in Deﬁnition 2.4, and computes an estimate bh : Rd Sk so that Prx Dx[bh(x) = h (x)] is, with probability 1 δ, at most

inf g G Li,j(g) L i,j

where h is the optimal predictor of (2) and L i,j is the loss of the binary Bayes classiﬁers h i,j for 1 i < j k, where Ca,B is a constant depending on a, B.

Our result for incomplete rankings is a PAC result, in the sense that we guarantee that, when optimizing over a VC class G, the gap between the empirical estimate (the algorithm s output) and the optimal predictor of (2) is at most C OPT + rn(δ), where OPT is (a function of) the gap between the best classiﬁer in the class (argming G L(g)) and the Bayes classiﬁer and rn(δ) is a function which tends to 0 as the number of samples n increases (see (3)). We remark that the algorithm does not come with a computational efﬁciency guarantee, since the results are based on the

computation of the ERM of each pairwise comparison i < j. In general, this is NP-hard but if the binary hypothesis class G is simple then we also obtain computational guarantees.

Algorithm 2 Algorithm of Theorem 2.5

1: T n i.i.d. samples (x, σ) Dq R (as in Theorem 2.5) 2: For any i = j, set Tij = 3: for 1 i < j k do 4: if (x, σ) T and σ {i, j} then 5: Add (x, sgn(σ(i) σ(j))) to Tij 6: endif 7: endfor 8: bs Estimate Aggregate(Tij for i < j, G) 9: On input x Rd, output argsort(bs(x)) breaking arbitrarily possible ties.

Algorithm 2 works as follows: Given a training set T of the form (x(i), σ(i)) with incomplete rankings, the algorithm creates k 2 datasets Tij with the following criterion: For any i < j, if (x, σ) T and σ {i, j}, the algorithm adds to the dataset Tij the example (x, sgn(σ(i) σ(j))). For any such binary dataset, the algorithm computes the ERM and aggregates the estimates to bs (these routines can be found as Algorithm 6). This aggregate rule is based on the structure of the optimal classiﬁer h (that is valid due to the SST condition). The ﬁnal estimator is the function bh that, on input x Rd, outputs the ranking bh(x) = argsort(bs(x)) (by breaking ties randomly).

Remark 2.6. (i) The oracle Ex(m, E, M) and Deﬁnition 2.4 can be adapted to handle partial rankings (see Appendix C). (ii) Theorem 2.4 directly controls the risk gap R(bh) R(h ) Ex[d KT (bh(x), h (x))], since d KT (π, σ) k21{π = σ}. (iii) We studied Problem 2 for Ex(m, E, M). Our results can be transferred to the oracles Ex(m, E) and Ex(m) under Condition 2 with φ = 1. (iv) This result is similar to Cl emenc on & Vogel (2020), where the learner observes (x, y) where y = σ 1 x (1) is the top-label. To adapt our incomplete setting to theirs, we must erase positions instead of alternatives. If eqi is the probability that the i-th position survives, then we have that, in Cl emenc on & Vogel (2020): eq1(x) = 1 and eqi =1(x) = 0 for all x Rd. Also, the generative processes of the two works are different (noisy nonparametric regression vs. Plackett-Luce based models). In general, our results and our analysis for Problem 2 are very closely related and rely on the techniques of Cl emenc on & Vogel (2020).

3. Technical Overview

Proof Sketch of Theorem 2.1 Our starting point is the work of Syrgkanis & Zampetakis (2020), where they pro-

Label Ranking through Nonparametric Regression

vide a collection of nonparametric regression algorithms based on decision trees and random forests. We have to provide a vector-valued extension of these results. For decision trees, we show (see Theorem A.3) that if the learner observes i.i.d. samples (x, m(x) + ξ) for some unknown target m satisfying Condition 1 with r, C > 0, then there exists an algorithm ALGO that uses n = e O log(d) poly C,r(k Cr/ε)

samples and computes an estimate m(n) which satisﬁes Ex Dx h m(x) m(n)(x) 2 2

i ε with probability 99%. In our setting, we receive examples from Ex(m) and set h(x) = argsort(m(x)). As described in Algorithm 1, we design the training set T = {x(i), y(i)}, where y(i) = (h(x(i); j))j [k] (we make the ranking h(x) a vector). We provide T as input to ALGO (with ξ = 0) and get a vector-valued estimation m(n) that approximates the vector (h( ; 1), ..., h( ; k)). Our next goal is to convert the estimate m(n) to a ranking by setting bh = argsort m(n). In Theorem A.4, we show that rounding our estimations back to permutations will yield bounds for the expected Spearman distance Ex Dx[d2(h(x),bh(x))].

Proof Sketch of Theorem 2.5 Consider the VC class G consisting of mappings g : Rd { 1, +1}. Let Gi,j = {gi,j : Rd { 1, +1}} be a copy of G for the pair (i, j). We let bgi,j and g i,j be the algorithm s empirical classiﬁer and the Bayes classiﬁer respectively for the pair (i, j). The ﬁrst key step is that the SST property (which holds thanks to Item 1) implies that the optimal ranking predictor h is unique almost surely and satisﬁes (2). Thanks to the structure of the optimal solution, we can compute the score estimates bs(x; i) = 1 + P

j =i 1{bgi,j(x) = 1} for

any i [k] and we will set bh to be argsort bs. We ﬁrst show that Prx Dx[bh(x) = h (x)] P

i<j Prx Dx[bgi,j(x) = g i,j(x)] . Hence, we have reduced the problem of bounding the LHS error to a series of binary sub-problems. At this moment the problem is binary classiﬁcation, we can use tools for generalization bounds (Boucheron et al., 2005), where the population loss function for the pair (i, j) of the classiﬁer g in the VC class is Li,j(g) = E(x,σ) Dq R[1{g(x) = sgn(σ(i) σ(j))}|σ {i, j}]. We can control the incurred population loss of the ERM against the Bayes classiﬁer exploiting Condition 2, similar to Cl emenc on & Vogel (2020). We show that for a training set Tn with elements (x, y) with y = sgn(σ(i) σ(j)) where (x, σ) Dq R conditioned that σ {i, j}, it holds that: Li,j(bgi,j) Li,j(g ) 2 (infg G Li,j(g) Li,j(g )) + rn , with high probability, where bgi,j = argming G b Li,j(g; Tn) and g is the Bayes

classiﬁer and rn = O(n 1 2 a VC(G) 1 2 a ), where a is the parameter of the Tsybakov s noise; note that when a = 1, we obtain the fast rate 1/n and when a = 0, we get that standard rate 1/ n. It remains to aggregate the above O(k2) binary classiﬁers into a single one (see Claim 3) in order to get the result of Theorem 2.5. For the sample complexity

bound, see Claim 4. For the full proof, see Theorem A.7.

Acknowledgments

This work is supported by NTUA Basic Research Grant (PEBE 2020) Algorithm Design through Learning Theory: Learning-Augmented and Data-Driven Online Algorithms (LEADAlgo) .

Aledo, J. A., G amez, J. A., and Molina, D. Tackling the supervised label ranking problem by bagging weak learners. Information Fusion, 35:38 50, 2017.

Allwein, E. L., Schapire, R. E., and Singer, Y. Reducing Multiclass to Binary: A Unifying Approach for Margin Classiﬁers. Journal of machine learning research, 1(Dec): 113 141, 2000.

Balasubramaniyan, R., H ullermeier, E., Weskamp, N., and K amper, J. Clustering of gene expression data using a local shape-based similarity measure. Bioinformatics, 21 (7):1069 1077, 2005.

Bartlett, P. L., Bousquet, O., and Mendelson, S. Local Rademacher Complexities. The Annals of Statistics, 33 (4):1497 1537, 2005.

Baum, E. B. and Haussler, D. What Size Net Gives Valid Generalization? Neural computation, 1(1):151 160, 1989.

Biau, G. Analysis of a Random Forests Model. The Journal of Machine Learning Research, 13:1063 1095, 2012.

Boucheron, S., Bousquet, O., and Lugosi, G. Theory of Classiﬁcation: A Survey of Some Recent Advances. ESAIM: probability and statistics, 9:323 375, 2005.

Bousquet, O., Boucheron, S., and Lugosi, G. Introduction to Statistical Learning Theory. In Summer school on machine learning, pp. 169 207. Springer, 2003.

Breiman, L. Random Forests. Machine learning, 45(1): 5 32, 2001.

Breiman, L., Friedman, J., Olshen, R., and Stone, C. Classiﬁcation and Regression Trees. Wadsworth International Group, 37(15):237 251, 1984.

Cheng, W. and H ullermeier, E. Instance-Based Label Ranking using the Mallows Model. In ECCBR Workshops, pp. 143 157, 2008.

Cheng, W. and H ullermeier, E. Probability Estimation for Multi-class Classiﬁcation Based on Label Ranking. In Joint European Conference on Machine Learning and

Label Ranking through Nonparametric Regression

Knowledge Discovery in Databases, pp. 83 98. Springer, 2012.

Cheng, W. and H ullermeier, E. A Nearest Neighbor Approach to Label Ranking based on Generalized Labelwise Loss Minimization. In International Joint Conference on Artiﬁcial Intelligence (IJCAI-13). Citeseer, 2013.

Cheng, W., H uhn, J., and H ullermeier, E. Decision Tree and Instance-Based Learning for Label Ranking. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 161 168, 2009.

Cheng, W., Dembczynski, K., and H ullermeier, E. Label Ranking Methods based on the Plackett-Luce Model. In ICML, 2010.

Cheng, W., Henzgen, S., and H ullermeier, E. Labelwise versus Pairwise Decomposition in Label Ranking. In Lwa, pp. 129 136, 2013.

Cl emenc on, S. and Korba, A. On Aggregation in Ranking Median Regression. In ESANN, 2018.

Cl emenc on, S. and Vogel, R. A Multiclass Classiﬁcation Approach to Label Ranking. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 1421 1430. PMLR, 2020.

Cl emenc on, S., Korba, A., and Sibony, E. Ranking Median Regression: Learning to Order through Local Consensus. In Algorithmic Learning Theory, pp. 212 245. PMLR, 2018.

Comminges, L. and Dalalyan, A. S. Tight conditions for consistency of variable selection in the context of high dimensionality. The Annals of Statistics, 40(5):2667 2696, 2012.

de S a, C. R., Soares, C., Knobbe, A., and Cortez, P. Label Ranking Forests. Expert systems, 34(1):e12166, 2017.

Dekel, O., Singer, Y., and Manning, C. D. Log-Linear Models for Label Ranking. Advances in neural information processing systems, 16:497 504, 2003.

Denil, M., Matheson, D., and De Freitas, N. Narrowing the Gap: Random Forests In Theory and In Practice. In International conference on machine learning, pp. 665 673. PMLR, 2014.

Dery, L. and Shmueli, E. Boost LR: A Boosting-Based Learning Ensemble for Label Ranking Tasks. IEEE Access, 8:176023 176032, 2020.

Djuric, N., Grbovic, M., Radosavljevic, V., Bhamidipati, N., and Vucetic, S. Non-Linear Label Ranking for Large Scale Prediction of Long-Term User Interests. In Twenty Eighth AAAI Conference on Artiﬁcial Intelligence, 2014.

Dudley, R. M. Balls in Rk do not cut all subsets of k + 2 points. Advances in Mathematics, 31(3):306 308, 1979.

Fotakis, D., Kalavasis, A., and Stavropoulos, K. Aggregating Incomplete and Noisy Rankings. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 2278 2286. PMLR, 2021.

Friedman, J., Hastie, T., Tibshirani, R., et al. The Elements of Statistical Learning, volume 1. Springer series in statistics New York, 2001.

Friedman, J. H. Multivariate Adaptive Regression Splines. The annals of statistics, pp. 1 67, 1991.

F urnkranz, J. Round Robin Classiﬁcation. The Journal of Machine Learning Research, 2:721 747, 2002.

Geng, X. and Luo, L. Multilabel Ranking with Inconsistent Rankers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3742 3747, 2014.

George, E. I. and Mc Culloch, R. E. Approaches for Bayesian Variable Selection. Statistica sinica, pp. 339 373, 1997.

Grbovic, M., Djuric, N., and Vucetic, S. Learning from Pairwise Preference Data using Gaussian Mixture Model. Preference Learning: Problems and Applications in AI, 33, 2012.

Grbovic, M., Djuric, N., Guo, S., and Vucetic, S. Supervised clustering of label ranking data using label preference information. Machine learning, 93(2-3):191 225, 2013.

Gurrieri, M., Fortemps, P., and Siebert, X. Alternative Decomposition Techniques for Label Ranking. In International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, pp. 464 474. Springer, 2014.

Har-Peled, S., Roth, D., and Zimak, D. Constraint Classiﬁcation for Multiclass Classiﬁcation and Ranking. Advances in neural information processing systems, pp. 809 816, 2003.

Hastie, T. and Tibshirani, R. Classiﬁcation by Pairwise Coupling. The annals of statistics, 26(2):451 471, 1998.

H ullermeier, E., F urnkranz, J., Cheng, W., and Brinker, K. Label Ranking by Learning Pairwise Preferences. Artiﬁcial Intelligence, 172(16):1897 1916, 2008. ISSN 0004-3702.

James, G., Witten, D., Hastie, T., and Tibshirani, R. An Introduction to Statistical Learning, volume 112. Springer, 2013.

Label Ranking through Nonparametric Regression

Jindal, R. and Taneja, S. Ranking in multi label classiﬁcation of text documents using quantiﬁers. In 2015 IEEE International Conference on Control System, Computing and Engineering (ICCSCE), pp. 162 166. IEEE, 2015.

Karpinski, M. and Macintyre, A. Polynomial Bounds for VC Dimension of Sigmoidal and General Pfafﬁan Neural Networks. Journal of Computer and System Sciences, 54 (1):169 176, 1997.

Korba, A., Cl emenc on, S., and Sibony, E. A Learning Theory of Ranking Aggregation. In Artiﬁcial Intelligence and Statistics, pp. 1001 1010. PMLR, 2017.

Korba, A., Garcia, A., and d'Alch e-Buc, F. A Structured Prediction Approach for Label Ranking. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

Lafferty, J. and Wasserman, L. Rodeo: Sparse, greedy nonparametric regression. The Annals of Statistics, 36(1): 28 63, 2008.

Laurent, H. and Rivest, R. L. Constructing Optimal Binary Decision Trees is NP-Complete. Information processing letters, 5(1):15 17, 1976.

Liu, H. and Chen, X. Nonparametric Greedy Algorithms for the Sparse Learning Problem. In Proceedings of the 22nd International Conference on Neural Information Processing Systems, pp. 1141 1149, 2009.

Loh, W.-Y. Classiﬁcation and Regression Trees. Wiley interdisciplinary reviews: data mining and knowledge discovery, 1(1):14 23, 2011.

Louppe, G. Understanding Random Forests: From Theory to Practice. ar Xiv preprint ar Xiv:1407.7502, 2014.

Lugosi, G. and Nobel, A. Consistency of Data-driven Histogram Methods for Density Estimation and Classiﬁcation. The Annals of Statistics, 24(2):687 706, 1996.

Mallows, C. L. Non-Null Ranking Models. I. Biometrika, 44(1/2):114 130, 1957.

Mansour, Y. and Mc Allester, D. A. Generalization Bounds for Decision Trees. In COLT, pp. 69 74. Citeseer, 2000.

Massart, P. and N ed elec, E. Risk bounds for statistical learning. The Annals of Statistics, 34(5):2326 2366, 2006.

Moreira, M. and Mayoraz, E. Improved Pairwise Coupling Classiﬁcation with Correcting Classiﬁers. In European conference on machine learning, pp. 160 171. Springer, 1998.

Nobel, A. Histogram Regression Estimation Using Data Dependent Partitions. The Annals of Statistics, 24(3): 1084 1105, 1996.

Scornet, E., Biau, G., and Vert, J.-P. Consistency of random forests. The Annals of Statistics, 43(4):1716 1741, 2015.

Shalev-Shwartz, S. Online Learning: Theory, Algorithms, and Applications. Ph D thesis, The Hebrew University of Jerusalem, 2007.

Shalev-Shwartz, S. and Ben-David, S. Understanding Machine Learning: From Theory to Algorithms. Cambridge university press, 2014.

Smola, A. J. and Bartlett, P. L. Sparse Greedy Gaussian Process Regression. In Advances in neural information processing systems, pp. 619 625, 2001.

Syrgkanis, V. and Zampetakis, M. Estimation and Inference with Trees and Forests in High Dimensions. In Conference on Learning Theory, pp. 3453 3454. PMLR, 2020.

Tsybakov, A. B. Introduction to Nonparametric Estimation. Springer Publishing Company, Incorporated, 1st edition, 2008. ISBN 0387790519.

Vembu, S. and G artner, T. Label Ranking Algorithms: A Survey. In Preference learning, pp. 45 64. Springer, 2010.

Wager, S. and Athey, S. Estimation and Inference of Heterogeneous Treatment Effects using Random Forests. Journal of the American Statistical Association, 113(523): 1228 1242, 2018.

Wang, Q., Wu, O., Hu, W., Yang, J., and Li, W. Ranking social emotions by learning listwise preference. In The First Asian Conference on Pattern Recognition, pp. 164 168. IEEE, 2011.

Wu, T.-F., Lin, C.-J., and Weng, R. C. Probability Estimates for Multi-Class Classiﬁcation by Pairwise Coupling. Journal of Machine Learning Research, 5(Aug): 975 1005, 2004.

Yang, Y. and Tokdar, S. T. Minimax-optimal nonparametric regression in high dimensions. The Annals of Statistics, 43(2):652 674, 2015.

Zhou, Y. and Qiu, G. Random Forest for Label Ranking. Expert Systems with Applications, 112:99 109, 2018.

Zhou, Y., Liu, Y., Gao, X.-Z., and Qiu, G. A label ranking method based on Gaussian mixture model. Knowledge Based Systems, 72:108 113, 2014a.

Zhou, Y., Liu, Y., Yang, J., He, X., and Liu, L. A Taxonomy of Label Ranking Algorithms. J. Comput., 9(3):557 565, 2014b.

Label Ranking through Nonparametric Regression

A. Main Theoretical Results: Statements and Proofs

In this section, we provide our results formally: In particular, Appendix A.1 contains our results for LR with complete rankings and decision trees and A.2 contains our results for LR with complete rankings and random forests. Finally, in the Appendix A.3, our results for noisy LR with incomplete rankings are provided.

A.1. Noiseless Oracle with Complete Rankings and Decision Trees (Level Splits & Breiman)

A.1.1. DEFINITION OF PROPERTIES FOR DECISION TREES

In this section, we study the Label Ranking problem in the complete rankings setting. We ﬁrst deﬁne some properties required in order to state our results. In the high-dimensional regime, we assume that the target function is sparse. We remind the reader the following standard deﬁnition of sparsity of a real-valued Boolean function. Deﬁnition A.1 (Sparsity). We say that the target function f : {0, 1}d R is r-sparse if and only if there exists a set R [d] with |R| = r and a function h : {0, 1}r R such that, for every z {0, 1}d, it holds that f(z) = h(z R). The set R is called the set of relevant features. Moreover, a vector-valued function m : {0, 1}d Rk is said to be r-sparse if each coordinate mj : {0, 1}d R is r-sparse.1

For intuition about the upcoming Condition 3 and Condition 4, we refer the reader to the Appendix D.3 and Appendix D.4 respectively (and also the work of Syrgkanis & Zampetakis, 2020). Let us deﬁne the function e V for a set S [d], given a function f and a distribution over features Dx:

e V (S) := E z S Dx,S

" E w Dx[f(w)|w S = z S] 2#

where Dx,S is the marginal distribution Dx conditioned on the index set S. The function e V can be seen as a measure of heterogeneity of the within-leaf mean values of the target function f, from the leafs created by the split S, as mentioned by Syrgkanis & Zampetakis (2020). Condition 3 (Approximate Submodularity). Let C 1. We say that the function e V with respect to f (Equation (4)) is C-approximate submodular if and only if for any T, S [d], such that S T and any i [d], it holds that

e V (T {i}) e V (T) C (e V (S {i})) e V (S)).

Moreover, a vector-valued function m : {0, 1}d Rk is said to be C-approximate submodular if the function e V with respect to each coordinate mj : {0, 1}d R of m is C-approximate submodular.

The above condition will be used (and is necessary) in algorithms that use the Level Splits criterion. Let us also set

e Vℓ(A, P) = E x Dx

" E z Dx[f(z)|z P(x)] 2 x A

where P is a partition of the hypercube, P(x) is the cell of the partition in which x lies and A is a cell of the partition. The next condition is the analogue of Condition 3 for algorithms that use the Breiman s criterion. Condition 4 (Approximate Diminishing Returns). For C 1, we say that the function e Vℓwith respect to f (Equation (5)) has the C-approximate diminishing returns property if for any cells A, A , any i [d] and any T [d] such that A A, it holds that e Vℓ(A , T {i}) e Vℓ(A , T) C (e Vℓ(A, i) e Vℓ(A)) .

Moreover, a vector-valued function m : {0, 1}d Rk is said to have the C-approximate diminishing returns property if the function e Vℓwith respect to each coordinate mj : {0, 1}d R of m has the C-approximate diminishing returns property.

A.1.2. THE SCORE PROBLEM

Having provided a list of conditions that will be useful in our theorems, we are now ready to provide our key results. In order to resolve Problem 1, we consider the following crucial problem. The solution of this problem will be used as a black-box in order to address Problem 1. We consider the following general setting.

1Note that each coordinate function mi can be sparse in a different set of indices.

Label Ranking through Nonparametric Regression

Deﬁnition A.2 (Score Generative Process). Consider an underlying score hypothesis m : X [1/4, 3/4]k and let Dx be a distribution over features. Each sample is generated as follows:

1. Draw x X from Dx.

2. Draw ξ [ 1/4, 1/4]k from the zero mean noise distribution E.

3. Compute the score y = m(x) + ξ.

4. Output (x, y).

We let (x, y) D.

Under the score generative process of Deﬁnition A.2, the following problem arises. Problem 3 (Score Learning). Consider the score generative process of Deﬁnition A.2 with underlying score hypothesis m : X [1/4, 3/4]k, that outputs samples of the form (x, y) D. The learner is given i.i.d. samples from D and its goal is to efﬁciently output a hypothesis c m : X Rk such that with high probability the error Ex Dx[ c m(x) m(x) 2 2] is small.

In order to solve Problem 3, we adopt the techniques and the results of Syrgkanis & Zampetakis (2020) concerning efﬁcient algorithms based on decision trees and random forests (for an exposition of the framework, we refer the reader to Appendix D). We provide the following vector-valued analogue of the results of Syrgkanis & Zampetakis (2020), which resolves Problem 3. Speciﬁcally, we can control the expected squared L2 norm of the error between our estimate and the true score vector m.

Theorem A.3 (Score Learning with Decision Trees). For any ε, δ > 0, under the score generative process of Deﬁnition A.2 with underlying score hypothesis m : {0, 1}d [0, 1]k and given i.i.d. data (x, y) D, the following hold:

1. There exists an algorithm (Decision Trees via Level-Splits - Algorithm 3) with set of splits Sn that computes a score estimate m(n) which satisﬁes

Pr (x1,y1),...,(xn,yn) Dn

m(x) m(n)(x; Sn) 2

and for the number of samples n and the number of splits log(t), we have that:

(a) If m is r-sparse as per Deﬁnition A.1 and under the C-submodularity condition (mi and Dx satisfy Condition 3

for each alternative i [k]), it sufﬁces to draw

n = e O log(dk/δ) k Cr+2 (Cr/ε)Cr+2

samples and set the number of splits to be log(t) = Cr Cr+2(log(n) log(log(d/δ))). (b) If, additionally to 1.(a), the marginal over the feature vectors Dx is a Boolean product probability distribution, it sufﬁces to draw n = e O log(dk/δ) 2r k2 (C/ε)2

samples and set the number of splits to be log(t) = r.

2. There exists an algorithm (Decision Trees via Breiman - Algorithm 4) that computes a score estimate m(n) which satisﬁes

Pr (x1,y1),...,(xn,yn) Dn

m(x) m(n)(x; Pn) 2

and for the number of samples n and the number of splits log(t), we have that:

(a) If m is r-sparse and under the C-approximate diminishing returns condition (mi and Dx satisfy Condition 4 for each alternative i [k]), it sufﬁces to draw

n = e O log(dk/δ) k Cr+3 (Cr/ε)Cr+3

samples and set log(t) Cr Cr+3(log(n) log(log(d/δ))).

Label Ranking through Nonparametric Regression

(b) If, additionally to 2.(a), the distribution Dx is a Boolean product distribution, it sufﬁces to draw

n = e O log(dk/δ) k3 C2 2r/ε3

samples and set log(t) r.

The running time of the algorithms is poly C,r(d, k, 1/ϵ).

Algorithm 3 Level-Splits Algorithm for Score Learning

1: Input: Access to i.i.d. examples of the form (x, y) D. 2: Model: y = m(x) + ξ (Deﬁnition A.2) with m : {0, 1}d [0, 1]k. 3: Output: An estimate m(n)( ; Sn) that, with probability 1 δ, satisﬁes

m(x) m(n)(x; Sn) 2

4: Learn Score(ε, δ): 5: Draw n = eΘ(log(dk/δ)(Crk/ε)O(Cr)) samples from D {Under sparsity and Condition 3.} 6: D(n) {x(j), y(j)}j [n] 7: output Learn Score-LS(D(n))

8: Learn Score-LS (D(n)): 9: Set log(t) = Θ(r log(rk))

10: Create k datasets Di = {(x(j), y(j) i )}j [n] 11: for i [k] do

12: m(n) i , S(i) n = Level Splits-Algo(0, Di, log(t)) {Call Algorithm 9.} 13: endfor 14: Output m(n)( ; S(1) n , ..., S(k) n ) = (m(n) 1 ( ; S(1) n ), ..., m(n) k ( ; S(k) n ))

Algorithm 4 Breiman s Algorithm for Score Learning

1: Input: Access to i.i.d. examples of the form (x, y) D. 2: Model: y = m(x) + ξ (Deﬁnition A.2) with m : {0, 1}d [0, 1]k. 3: Output: An estimate m(n)( ; Sn) that, with probability 1 δ, satisﬁes

m(x) m(n)(x; Pn) 2

4: Learn Score (ε, δ): 5: Draw n = eΘ log(dk/δ)(Crk/ε)O(Cr) samples from D {Under sparsity and Condition 4.} 6: D(n) {x(j), y(j)}j [n] 7: output Learn Score-Breiman(D(n))

8: Learn Score-Breiman (D(n)): 9: Set log(t) = Θ(r log(rk))

10: Create k datasets Di = {(x(j), y(j) i )}j [n] 11: for i [k] do

12: m(n) i , P (i) n = Breiman-Algo(0, Di, log(t)) {Call Algorithm 10.} 13: endfor 14: Output m(n)( ; P (1) n , ..., P (k) n ) = (m(n) 1 ( ; P (1) n ), ..., m(n) k ( ; P (k) n ))

Proof. (of Theorem A.3) Let us set J = [1/4, 3/4] and let m : {0, 1}d Jk be the underlying score vector hypothesis and consider a training set with n samples of the form (x, y) {0, 1}d Jk with law D, generated as in Deﬁnition A.2.

Label Ranking through Nonparametric Regression

We decompose the mapping as m(x) = (m1(x), . . . , mk(x)) and aim to learn each function mi : {0, 1}d J separately. Note that since m is r-sparse, then any mi is r-sparse for any i [k]. We observe that each sample of Deﬁnition A.2 can be equivalently generated as follows:

1. x {0, 1}d is drawn from Dx,

2. For each i [k] :

(a) Draw ξ [ 1/4, 1/4] from the zero mean distribution marginal Ei. (b) Compute yi = mi(x) + ξ.

3. Output (x, y), where y = (yi)i [k].

In order to estimate the coordinate i [k], i.e., the function mi : {0, 1}d J, we have to make use of the samples (x, yi) {0, 1}d [0, 1]. We have that

m(x) m(n)(x; Sn) 2

i [k] E x Dx

h (mi(x) m(n) i (x; Sn))2i > ε

Pr[ i [k] : Bi] ,

where we consider the events

Bi = E x Dx

mi(x) m(n) i (x; Sn i ) 2 > ε/k ,

for any i [k], whose randomness lies in the random variables used to construct the empirical estimate mn i ( ; Sn i ) = mn i ( ; Sn i , (x(1), y(1) i ), . . . , (x(n), y(n) i )). We note that we have to split the dataset with examples (x, y) into k datasets (x, yi) and execute each sub-routine with parameters (ϵ/k, δ/k). We now turn to the sample complexity guarantees. Let us begin with the Level-Splits Algorithm.

Case 1a. If each mi is r-sparse and under the submodularity condition, by Theorem D.2 with f = mi, we have that

Pr[Bi] d exp( n/(Cr k/ε)Cr+2) .

By the union bound, we have that Pr[ i [k] : Bi] X

i [k] Pr[Bi] .

In order to make this probability at most δ, it sufﬁces to make the probability of the bad event Bi at most δ/k, and, so, it sufﬁces to draw n = e O(log(dk/δ) (Crk/ε)Cr+2) .

Case 1b. If each mi is r-sparse and under the submodularity and the independence of features conditions, by Theorem D.2 with f = mi, we have that Pr[Bi] d exp( n/(2r(Ck/ε)2)) .

By the union bound and in order to make the probability Pr[ i [k] : Bi] at most δ, it sufﬁces to draw

n = e O log(dk/δ) 2r (Ck/ε)2 .

Hence, in each one of the above scenarios, we have that

m(x) m(n)(x; S1 n, ..., Sk n) 2

We proceed with the Breiman Algorithm.

Label Ranking through Nonparametric Regression

Case 2a. If each mi is r-sparse and under the approximate diminishing returns condition (Condition 4), by Theorem D.3 with f = mi, we have that Pr[Bi] d exp( n/(Cr k/ε)Cr+3) .

By the union bound, we have that Pr[ i [k] : Bi] X

i [k] Pr[Bi] .

In order to make this probability at most δ, it sufﬁces to make the probability of the bad event Bi at most δ/k, and, so, it sufﬁces to draw n = e O(log(dk/δ) (Cr k/ε)Cr+3) .

Case 2b. If each mi is r-sparse and under the approximate diminishing returns condition (Condition 4) and the independence of features conditions, by Theorem D.3 with f = mi, we have that

Pr[Bi] d exp( nε3/(k3 C22r)) .

By the union bound and in order to make the probability Pr[ i [k] : Bi] at most δ, it sufﬁces to draw

n = e O(log(dk/δ) k3C22r/ε3) .

Hence, in each one of the above scenarios, we have that

m(x) m(n)(x; P 1 n, ..., P k n) 2

A.1.3. MAIN RESULT FOR NOISELESS LR WITH DECISION TREES

We are now ready to address Problem 1 for the oracle Ex(m). Our main theorem follows. We comment that Theorem 2.1 corresponds to the upcoming case 1(a).

Theorem A.4 (Label Ranking with Decision Trees). Consider the example oracle Ex(m) of Deﬁnition 1.1 with underlying score hypothesis m : {0, 1}d [0, 1]k, where k N is the number of labels. Given i.i.d. data (x, σ) DR, the following hold for any ε > 0 and δ > 0:

1. There exists an algorithm (Decision Trees via Level-Splits - Algorithm 5) with set of splits Sn that computes an estimate h(n)( ; Sn) : {0, 1}d Sk which satisﬁes

Pr (x1,σ1),...,(xn,σn) Dn R

h d2(h(x), h(n)(x; Sn)) i > ϵ k2 δ ,

and for the number of samples n and the number of splits log(t), we have that:

(a) If m is r-sparse as per Deﬁnition A.1 and under the C-submodularity condition (mi and Dx satisfy Condition 3

for each alternative i [k]), it sufﬁces to draw

n = e O log(dk/δ) k Cr+2 (Cr/ε)Cr+2

samples and set the number of splits to be log(t) = Cr Cr+2(log(n) log(log(d/δ))). (b) If, additionally to 1.(a), the distribution Dx is a Boolean product distribution, it sufﬁces to draw

n = e O log(dk/δ) 2r k2 (C/ε)2

samples and set the number of splits to be log(t) = r.

Label Ranking through Nonparametric Regression

2. There exists an algorithm (Decision Tress via Breiman - Algorithm 5) with Pn that computes an estimate h(n)( ; Pn) : {0, 1}d Sk which satisﬁes

Pr (x1,σ1),...,(xn,σn) Dn R

h d2(h(x), h(n)(x; Pn)) i > ϵ k2 δ ,

and for the number of samples n and the number of splits log(t), we have that:

(a) If m is r-sparse and under the C-approximate diminishing returns condition (mi and Dx satisfy Condition 4 for each alternative i [k]), it sufﬁces to draw

n = e O log(dk/δ) k Cr+3 (Cr/ε)C r+3

samples and set log(t) Cr Cr+3(log(n) log(log(d/δ))). (b) If, additionally to 2.(a), the distribution Dx is a Boolean product distribution, it sufﬁces to draw

n = e O log(dk/δ) k3 C2 2r/ε3

samples and set log(t) r.

The running time of the algorithms is poly C,r(d, k, 1/ϵ).

Algorithm 5 Algorithms for Label Ranking with Complete Rankings

1: Input: Access to i.i.d. examples of the form (x, σ) DR. 2: Model: Oracle Ex(m) with m : {0, 1}d [0, 1]k and h(x) = argsort(m(x)). (Deﬁnition 1.1) 3: Output: An estimate h(n)( ; Sn).

4: Label Rank(ε, δ): 5: Level-Splits Case 6: Draw n = eΘ(log(dk/δ)(rk/ε)r) samples from DR {Under sparsity and Condition 3.} 7: For any j [n], compute y(j) m C(σ(j)) {See Equation (6).} 8: D(n) {x(j), y(j)}j [n] 9: output argsort(Learn Score-LS(D(n))) {Call Algorithm 3.} 10: Breiman Case 11: Draw n = eΘ (log(dk/δ)(rk/ε)r) samples from D {Under sparsity and Condition 4.} 12: For any j [n], compute y(j) m C(σ(j)) {See Equation (6).} 13: D(n) {x(j), y(j)}j [n] 14: output argsort( Learn Score-Breiman(D(n))) {Call Algorithm 4.}

Proof. (of Theorem A.4) We let the mapping m C : Sk [0, 1]k be the canonical representation of a ranking, i.e., for σ Sk, we deﬁne m C(σ) = (σ(i)/k)i [k] . (6)

We reduce this problem to a score problem: for any sample (x, σ) = (x, argsort(m(x))) DR, we create the tuple (x, y ) = (x, m C(σ)), where m C is the canonical representation of the ranking σ. Hence, any permutation of length k is mapped to a vector whose entries are integer multiples of 1/k. Let us ﬁx x {0, 1}d.

So, the tuple (x, y ) falls under the setting of the score variant of the regression setting of Deﬁnition D.1 with regression function equal to m (which is equal to m C argsort m) and noise vector ξ = 0, i.e., y = m (x) + ξ . Recall that our goal is to use the transformed samples (x, y ) in order to estimate the true label ranking mapping h : X Sk. Let us set h (n) be the label ranking estimate using n samples. We will show that h (n) = argsort(m (n)) is close to h = argsort(m ) in Spearman s distance, where m (n) is the estimation of m using Theorem A.3. We have that

m (x) m (n)(x; S1 n, ..., Sk n) 2

Label Ranking through Nonparametric Regression

with high probability using the vector-valued tools developed in Theorem A.3. By choosing an appropriate method, we obtain each one of the items 1(a), 1(b), 2(a) and 2(b) (each sample complexity result is in full correspondence with Theorem A.3). Hence, our estimate is, by deﬁnition, close to m C argsort m, i.e.,

m C(argsort(m(x))) m (n)(x; S1 n, ..., Sk n) 2

thanks to the structure of the samples (x, y ). We can convert our estimate m (n) to a ranking by setting h = argsort(m (n)). For any i [k], let us set mi and bmi for the true and the estimation quantities for simplicity; intuitively (without the expectation operator), a gap of order ϵ/k to the estimate (mi bmi)2 yields a bound |mi bmi| p

ε/k. Recall that mi = σ(i)/k and so this implies that, in integer scaling, |σ(i) k bmi| O(

ε k). We now have to compute bσ(i), that is the rounded value of k bmi. When turning the values k bmi into a ranking, the distortion of the i-th element from the correct value σ(i) is at most the number of indices j = i that lie inside the estimation radius. So, any term of the Spearman distance is on expectation of order O(ε k). This is due to the fact that E[|mi k bmi|] p

E[(mi k bmi)2] = O(

ε k). To conclude, we get that Ex Dx d2(argsort(m(x))), h (x)) O(ϵ) k2 .

A.2. Noiseless Oracle with Complete Rankings and Random Forests (Level Splits & Breiman)

In this section, we provide similar algorithmic results for Fully Grown Honest Forests based on the Level-Splits and the Breiman s criteria.

A.2.1. DEFINITION OF PROPERTIES FOR RANDOM FORESTS

For algorithms that use Random Forests via Level Splits, we need the following condition. Condition 5 (Strong Sparsity). A target function f : {0, 1}d [ 1/2, 1/2] is (β, r)-strongly sparse if f is r-sparse with relevant features R (see Deﬁnition A.1) and the function e V (see Equation (4)) satisﬁes

e V (T {j}) e V (T) + β e V (T {i}) e V (T) ,

for all i R, j [d] \ R and T [d] \ {i}. Moreover, a vector-valued function m : {0, 1}d [ 1/2, 1/2]k is (β, r)-strongly sparse if each mj is (β, r)-strongly sparse for any j [k].

For algorithms that use Random Forests via Breiman s criterion, we need the following condition. Condition 6 (Marginal Density Lower Bound). We say that the density Dx is (ζ, q)-lower bounded if, for every set Q [d] with size |Q| = q, for every w {0, 1}q, it holds that

Pr x Dx[x Q = w] ζ/2q .

A.2.2. MAIN RESULT FOR NOISELESS LR WITH RANDOM FORESTS

Our theorem both for Score Learning and Label Ranking for Random Forests with Level-Splits follows. Theorem A.5 (Label Ranking with Fully Grown Honest Forests via Level-Splits). Let ε, δ > 0. Let H > 0. Under Deﬁnition A.2 with underlying score hypothesis m : {0, 1}d [0, 1]k, where k N is the number of labels and given access to i.i.d. data (x, σ) DR, the following hold. For any i [k], we have that: let m(n,s) i be the forest estimator for alternative i that is built with sub-sampling of size s from the training set and where every tree mi(x, Ds) is built using Algorithm 9, with inputs: log(t) large enough so that every leaf has two or three samples and h = 1. Under the strong sparsity condition (see Condition 5) for any i [k], if R is the set of relevant features and for every w {0, 1}r, it holds for the marginal probability that Prz Dx(z R = w) / (0, ζ/2r) and if s = eΘ(2r (log(dk/δ)/β2 + log(k/δ)/ζ)), then it holds that

Pr (x1,σ1),...,(xn,σn) Dn R

m(x) m(n,s)(x) 2

using a training set of size n = e O 2rk log(k/δ)

ζ . Moreover, under the generative process of Ex(m) of

Deﬁnition 1.1, it holds that there exists a poly(d, k, 1/ε)-time algorithm with the same sample complexity that computes an estimate h(n,s) : {0, 1}d Sk which satisﬁes

Pr (x1,σ1),...,(xn,σn) Dn R

h d2(h(x), h(n,s)(x)) i > ϵ k2 δ .

Label Ranking through Nonparametric Regression

Proof. (of Theorem A.5) Recall that each sample of Deﬁnition A.2 can be equivalently generated as follows:

1. x {0, 1}d is drawn from Dx,

2. For each i [k] :

(a) Draw ξ [ 1/4, 1/4] from the zero mean distribution marginal Ei. (b) Compute yi = mi(x) + ξ.

3. Output (x, y), where y = (yi)i [k].

In order to estimate the coordinate i [k], i.e., the function mi : {0, 1}d [1/4, 3/4], we have to make use of the samples (x, yi) {0, 1}d [0, 1]. We have that

m(x) m(n,s)(x) 2

i [k] E x Dx

h (mi(x) m(n,s) i (x))2i > ε

Pr[ i [k] : Bi] ,

where we consider the events

Bi = E x Dx

mi(x) m(n,s) i (x) 2 > ε/k ,

for any i [k], whose randomness lies in the random variables used to construct the empirical estimate m(n,s) i = m(n,s) i ( ; (x(1), y(1) i ), . . . , (x(n), y(n) i )), where m(n,s) i is the forest estimator for the alternative i using subsampling of size s. We are ready to apply the result of Syrgkanis & Zampetakis (2020) for random forest with Level-Splits (Theorem 3.4) with accuracy ε/k and conﬁdence δ/k. Fix i [k]. Let m(n,s) i be the forest estimator that is built with sub-sampling of size s from the training set and where every tree mi(x, Ds) is built using Algorithm 9, with inputs: log(t) large enough so that every leaf has two or three samples and h = 1. Under the strong sparsity condition for mi (see Condition 5), if R is the set of relevant features and for every w {0, 1}r, it holds for the marginal probability that Prz Dx(z R = w) / (0, ζ/2r) and if s = eΘ(2r (log(dk/δ)/β2 + log(k/δ)/ζ)), then it holds that

mi(x) m(n,s) i (x) 2 ε/k δ/k ,

using a training set of size n = e O 2rk log(k/δ)

ζ . Aggregating the k random forests, we get the desired result using the union bound. The Spearman s distance result follows using the canonical vector representation, as in Theorem A.4.

The result for the Breiman s criterion is the following.

Theorem A.6 (Label Ranking with Fully Grown Honest Forests via Breiman). Let ε, δ > 0. Under Deﬁnition A.2 with underlying score hypothesis m : {0, 1}d [0, 1]k, where k N is the number of labels and given access to i.i.d. data (x, σ) DR, the following hold. Suppose that Dx is (ζ, r)-lower bounded (see Condition 6). For any i [k], let m(n,s) i be the forest estimator for the i-th alternative that is built with sub-sampling of size s from the training set and where every tree mi(x, Ds) is built using the Algorithm 10, with inputs: log(t) large enough so that every leaf has two or three samples, training set Ds and h = 1. Then, using s = eΘ( 2r log(dk/δ)

ζβ2 ) and under Condition 5 for any i [k], we have that

Pr (x1,σ1),...,(xn,σn) Dn R

m(x) m(n,s)(x) 2

using n = e O 2rk log(dk/δ)

εζβ2 . Moreover, under the generative process of Ex(m) of Deﬁnition 1.1, it holds that there exists

a poly(d, k, 1/ε)-time algorithm with the same sample complexity that computes an estimate h(n,s) : {0, 1}d Sk which satisﬁes

Pr (x1,σ1),...,(xn,σn) Dn R

h d2(h(x), h(n,s)(x)) i > ϵ k2 δ .

Label Ranking through Nonparametric Regression

Proof. (of Theorem A.6) The proof is similar as in Theorem A.5 (by modifying the sample complexity and the value of subsampling s) and is omitted.

A.3. Noisy Oracle with Incomplete Rankings

In this section, we study the Label Ranking problem with incomplete rankings and focus on Problem 2. In Deﬁnition 2.4, we describe how a noisy score vector y and its associated ranking argsort(y) is generated. In order to resolve Problem 2, we consider an One-Versus-One (OVO) approach. In fact, we consider a VC class G of binary class and our goal is to use the incomplete observations and output a collection of k 2 classiﬁers from G so that, for a testing example x Dx with x Rd, the estimated ranking bσx, based on our selected hypotheses from G, will be close to the optimal one with high probability. We propose the following algorithm (Algorithm 7).

Algorithm 6 Algorithm for Estimation and Aggregation for a VC class

1: Input: A collection of training sets Di,j for 1 i < j k, VC class G. 2: Output: An estimate bs : X Nk for the optimal score vector s .

3: Estimate Aggregate(Di,j for all i < j, G): 4: for 1 i < j k do 5: Find bgi,j = argming G 1 |Di,j| P

(x,y) Di,j 1{g(x) = y} 6: endfor 7: for 1 i k do 8: bs(x; i) = 1 + P

j =i 1{bgi,j(x) = 1} {Due to the Strict Stochastic Transitivity property (see Condition 2).} 9: endfor 10: Break ties randomly 11: output bs( ) = (bs( ; 1), ..., bs( ; k))

Algorithm 7 Algorithm for Label Ranking with Incomplete Rankings

1: Input: Sample access to i.i.d. examples of the form (x, σ) Dq R, VC class G. 2: Model: Incomplete rankings are generated as in Deﬁnition 2.4. 3: Output: An estimate bσ : Rd Sk of the optimal classiﬁer σ that satisﬁes

Pr x Dx[bσ(x) = σ (x)] Ca,B

φ2 OPT(G) + ε .

4: Label Rank Incomplete(ε, δ):

5: Set n = eΘ k 4(1 a)

a max{log(k/δ), VC(G))}/polya(φ ε) {See Theorem A.7.}

6: Draw a training set D of n independent samples from Dq R 7: For any i = j, set Di,j = 8: for 1 i < j k do 9: if (x, σ) D and σ {i, j} then 10: Add (x, sgn(σ(i) σ(j))) to Di,j 11: endif 12: endfor

13: Training Phase: bs Estimate Aggregate(Di,j for all i < j, G) {See Algorithm 6.} 14: Testing Phase: On input x Rd, output argsort(bs(x))

In order to resolve Problem 2 under Condition 2, we will make use of the Kemeny embedding and the OVO approach. Let D be the training set with labeled examples of the form (x, σ) Dq R, where σ corresponds to an incomplete ranking generated as in Deﬁnition 2.4. Our algorithm proceeds as follows:

1. As a ﬁrst step, for any pair of alternatives i < j with i, j [k], we create a dataset Di,j = .

Label Ranking through Nonparametric Regression

2. For any i < j and for any feature x D whose incomplete ranking σ contains both i and j, we add in the dataset Di,j the example (x, y) := (x, sgn(σ(i) σ(j))).

3. For any i < j, we compute the ERM solution (see Algorithm 6) to the binary classiﬁcation problem bgi,j = argming G b Li,j(g) where

b Li,j(g) = 1 |Di,j|

(x,y) Di,j 1{g(x) = y} .

4. We aggregate the binary classiﬁers (see Algorithm 6) using the score function:

bs(x; i) = 1 + X

j =i 1{bgi,j(x) = 1} .

The structure of this score function comes from the SST property.

5. Break the possible ties randomly and output the prediction argsort(bs(x)).

Let us consider a binary classiﬁcation problem with labels 1, +1. Let the regression function be η(x) = Pr(x,y)[y = +1|x] and deﬁne the mapping g (x) = 1{η(x) 1/2}. If the distribution over (x, y) were known, the problem of ﬁnding an optimal classiﬁer would be solved by simply outputting the Bayes classiﬁer g , since it is known to minimize the misclassication probability Pr(x,y)[y = g(x)] over the collection of all classiﬁers. In particular, for any g G, it holds that L(g) L(g ) = 2 Ex Dx [|η(x) 1/2| 1{g(x) = g (x)}] .

In Problem 2, our goal is to estimate the solution of the ranking median regression problem with respect to the KT distance

σ = argmin h:X Sk E (x,σ) DR[d KT (h(x), σ)] .

When the probabilities pij(x) = Pr[σ(i) > σ(j)|x] satisfy the SST property, the solution is unique almost surely and has a closed form (see (2)). Hence, we can use estimate the O(k2) binary optimal classiﬁers in order to estimate it. This is exactly what we will do in Algorithm 7. Our main result is the following.

Theorem A.7 (Label Ranking with Incomplete Rankings). Let ε, δ (0, 1) and assume that Condition 2 holds, i.e., the Stochastic Transitivity property holds, the Tsybakov s noise condition holds with a (0, 1), B > 0 and the deletion tolerance condition holds for the survival probability vector with parameter φ (0, 1). Set Ca,B = B1 a/((1 a)1 aaa) and consider a hypothesis class G of binary classiﬁers with ﬁnite VC dimension. There exists an algorithm (Algorithm 7) that computes an estimate bσ : Rd Sk so that

Pr x Dx[bσ(x) = σ (x)] Ca,B

inf g G Li,j(g) L i,j

with probability at least 1 δ, where σ : Rd Sk is the mapping (see Equation (2)) induced by the aggregation of the k 2

Bayes classiﬁers g i,j with loss L i,j, using n independent samples from Dq R (see Deﬁnition 2.4), with

Ca,B φ4 2a k 2

log(k/δ), VC(G) log

In Table 1, we present our sample complexity results (concerning Theorem 2.5 (and Theorem A.7)) for various natural candidate VC classes, including halfspaces and neural networks. We let a b := max{a, b}.

Label Ranking through Nonparametric Regression

Table 1. The table depicts the sample complexity for Problem 2 and Theorem A.7 for various concept classes. In the sample complexity column, we set N0 = polya,B k φ ε . The VC dimension bounds for halfspaces and axis-aligned rectangles can be found in Shalev Shwartz & Ben-David (2014) and the VC dimension of L2-balls can be found in Dudley (1979). For the Neural Networks cases, M and N are the number of parameters and of neurons respectively and the corresponding VC dimension bounds are from Baum & Haussler (1989) and Karpinski & Macintyre (1997).

CONCEPT CLASS VC DIMENSION SAMPLE COMPLEXITY

HALFSPACES IN Rd d + 1 N0 O(log(k/δ) d log(d)) AXIS-ALIGNED RECTANGLES IN Rd 2d N0 O(log(k/δ) d log(d)) L2-BALLS IN Rd d + 1 N0 O(log(k/δ) d log(d)) NN WITH SIGMOID ACTIVATION O(M 2N 2) N0 O(log(k/δ) M 2N 2 log(M N)) NN WITH SIGN ACTIVATION O(M log(M)) N0 O(log(k/δ) M log2(M))

Remark A.8. We remark that, in the above results for the noisy nonparametric regression, we only focused on the sample complexity of our learning algorithms. Crucially, the runtime depends on the complexity of the Empirical Risk Minimizer and this depends on the selected VC class. Hence, the choice of the VC class involves a trade-off between computational complexity and expressivity/ﬂexibility.

We continue with the proof. In order to obtain fast learning rates for general function classes, the well-known Talagrand s inequality (see Fact 1 and Boucheron et al., 2005) is used combined with an upper bound on the variance of the loss (which is given by the noise condition) and convergence bounds on Rademacher averages (see e.g., Bartlett et al., 2005).

Proof. (of Theorem A.7) We decompose the proof into a series of claims. Consider the binary hypothesis class G consisting of mappings g : Rd { 1, +1} of ﬁnite VC dimension. We let g be the Bayes classiﬁer. We consider k 2 copies of this class, one for each unordered pair (i, j) and let Gi,j = {gi,j : Rd { 1, +1}} be the corresponding class. We let bgi,j and g i,j be the algorithm s empirical classiﬁer and the Bayes classiﬁer respectively for the pair (i, j).

Claim 1. It holds that Pr x Dx[bσ(x) = σ (x)] X

i<j Pr x Dx[bgi,j(x) = g i,j(x)] .

Proof. The following hold due to the SST condition (Condition 2.i), which implies (2). Let bgi,j = bgi,j(Dn) be the output estimator for the pair (i, j). We have that \

i<j {x Rd : bgi,j(x) = g i,j(x)} {x Rd : bσ(x) = σ (x)} ,

where bσ, σ : Rd Sk are the mappings generated by aggregating the estimators {bgi,j}, {g i,j} respectively. Hence, we get that {x Rd : bσ(x) = σ (x)} [

i<j {x Rd : bgi,j(x) = g i,j(x)} .

So, we have that the desired probability is controlled by

Pr x Dx[bσ(x) = σ (x)] X

i<j Pr x Dx[bgi,j(x) = g i,j(x)] ,

where the above probabilities also depend on the input training set Dn.

Thanks to the union bound, it sufﬁces to control the error probability of a single binary classiﬁer. Note that the empirical estimator bgi,j : Rd { 1, 1} is built from a random number of samples (x, σ) (those that satisfy σ {i, j}, see also Algorithm 7). Let us ﬁx a pair (i, j) and, for σ {i, j}, we set yi,j = sgn(σ(i) σ(j)). For each classiﬁer g Gi,j, we introduce the risk

Li,j(g) = E (x,σ) Dq R [1{g(x) = yi,j}|σ {i, j}] = E(x,σ) Dq R[1{g(x) = yi,j σ {i, j}}]

E(x,σ) Dq R[1{σ {i, j}}] ,

Label Ranking through Nonparametric Regression

and the empirical risk that is obtained using n i.i.d. samples

b Li,j(g) =

i [n] 1{g(x) = yi,j σ {i, j}}

i [n] 1{σ {i, j}} .

We can control these quantities using the following result, which is a modiﬁcation of a result that appears in Cl emenc on & Vogel (2020). In our case, the estimator for the pair (i, j) is built from the samples (x, σ) Dq R which contain both i and j.

Claim 2. Let δ > 0 and i = j with i, j [k]. Assume that the Tsybakov condition (Condition 2.ii) holds with parameters a, B for the pair (i, j) and that the survival probabilities vector q satisﬁes Condition 2.iii with parameter φ (0, 1). Set Ca,B = B1 a (1 a)1 aaa . Then, for a training set Tn with elements (x, y) with y = sgn(σ(i) σ(j)) where (x, σ) Dq R conditioned that σ {i, j}, it holds that

Li,j(bgi,j) Li,j(g ) 2 inf g G Li,j(g) Li,j(g ) + rn(δ) ,

with probability at least 1 δ, where bgi,j = argming G b Li,j(g; Tn) and g is the Bayes classiﬁer and

rn(δ) = max

1 2 a (C VC(G) log(n)) 1 2 a + (32 log(2/δ))

1 2 a , C log(2/δ)

where C, C are constants. Moreover, if the size of the training set Tn is at least

n n PC := max

1 2C2 VC(G) , log(2/δ) 62 aφ1 a

we have that

rn(δ) = 2 16Ca,B

1 2 a (C VC(G) log(n)) 1 2 a + (32 log(2/δ))

Proof. Let us ﬁx i = j. Consider the binary class G = Gi,j and let us set L i,j = Li,j(g ), where g is the Bayes classiﬁer. Let us set yi,j = sgn(σ(i) σ(j)) { 1, +1}. Consider the loss function for the classiﬁer g G:

ci,j(g; x, σ) = 1{g(x) = yi,j σ {i, j}} .

We introduce the class of loss functions Fi,j associated with Gi,j, where

Fi,j = (x, σ) 7 1{σ {i, j}} (ci,j(g; x, σ) 1{g i,j(x) = yi,j}) : g Gi,j .

Let F i,j be the star-hull of Fi,j with F i,j = {a f : a [0, 1], f Fi,j}. For any f Fi,j, we introduce the function T which controls the variance of the function f as follows: First, we have that

Var(f) Pr x Dx[g(x) = g i,j(x)] Ca,B

φ (Li,j(g) L i,j)a .

We remark that the ﬁrst inequality follows from the binary structure of f and the second inequality follows from the fact that the Tsybakov s noise condition implies (see Fact 2) that

Pr (x,σ)[g(x) = g i,j(x)|σ {i, j}] Ca,B(Li,j(g) L i,j)a ,

and since Pr[σ {i, j}|x] > φ for all i = j and x Rd. Next, we can write that

φ (Li,j(g) L i,j)a = Ca,B φ Pr[σ {i, j}]a (E f)a =: T 2(f) .

Label Ranking through Nonparametric Regression

We will make use of the following result of Boucheron et al. (2005). In order to state this result, we have to deﬁne the functions ψ and w, (we also refer the reader to Boucheron et al. (2005) for further intuition on the deﬁnition of these crucial functions). We set ψ(r) = E Rn{f F i,j : T(f) r} ,

where Rn is the Rademacher average2 of a subset of the star-hull of the loss class Fi,j whose variance is controlled by r2

and w(r) = sup f F i,j:E f r T(f) ,

which captures the largest variance (i.e., the value p

Var(f)) among all loss functions f in the star hull of the loss class with bounded expectation.

Theorem A.9 (Theorem 5.8 in (Boucheron et al., 2005)). Consider the class G of classiﬁers g : X { 1, +1}. For any δ > 0, let r (δ) denote the solution of

r = 4ψ(w(r)) + 2w(r)

n + 16 log(2/δ)

and ε the positive solution of the equation r = ψ(w(r)). Then, for any θ > 0, with probability at least 1 δ, the empirical risk minimizer gn satisﬁes

L(gn) inf g G L(g) θ inf g G L(g) L(g ) + (1 + θ)2

In our binary setting with i = j, the risk Li,j is conditioned on the event σ {i, j}. Hence, with probability at least 1 δ, we get that

Li,j(gn) inf g G Li,j(g) θ inf g G Li,j(g) Li,j(g ) + (1 + θ)2

4θ r (δ) Pr[σ {i, j}] ,

where G = Gi,j. We set θ = 1 and, by adding and subtracting the Bayes error L i,j = Li,j(g ) in the left hand side, we obtain

Li,j(bgi,j) L i,j 2 inf g G Li,j(g) L i,j

+ r (δ) Pr[σ {i, j}] .

The result follows by using the fact that Pr[σ {i, j}] = Ω(φ) and by the analysis of Cl emenc on & Vogel (2020) for r (δ) (see the proof of Lemma 14 by Cl emenc on & Vogel (2020) and replace ε by φ). Finally, we set rn(δ) = r (δ)/ Pr[σ {i, j}].

Crucially, observe that the above result does not depend on k, since it focuses on the pairwise comparison i, j.

Claim 3. For Ca,B and φ as deﬁned in Claim 2, it holds that

Pr x Dx[bσ(x) = σ (x)] Ca,B

inf g G Li,j(g) L i,j

with probability at least 1 δ.

Proof. Recall that (x, σ) Dq R and ﬁx the training examples given that σ {i, j}. The Tsybakov s noise condition over the marginal over x given that σ {i, j} implies (see Fact 2) that

Pr (x,σ)[gi,j(x) = g i,j(x)|σ {i, j}] Ca,B(Li,j(g) L i,j)a .

2The Rademacher average of a set A is Rn(A) := 1

n E supa A | Pn i=1 σiai|, where σ1, ..., σn are independent Rademacher random variables.

Label Ranking through Nonparametric Regression

We have that

Pr x Dx[gi,j(x) = g i,j(x)|σ {i, j}] = Z

Rd 1{gi,j(x) = g i,j(x)}Dx(x|σ {i, j})dx =

Dx(x|σ {i, j})

Dx(x) 1{gi,j(x) = g i,j(x)} ,

where Dx( |σ {i, j}) is the conditional distribution of x given that the label permutation contains both i and j. Then, using the deletion tolerance property, we can obtain that Dx(x|σ {i, j}) = Ω(φ Dx(x)). So, we have that

Pr x [bgi,j(x) = g i,j(x)] Ca,B

φ (Li,j(bgi,j) L i,j)a .

Using Claim 2 for the pair i = j and Minkowski inequality, we have that

Pr x [bgi,j(x) = g i,j(x)] Ca,B

2 inf g G Li,j(g) L i,j

a + ra n(δ) .

Hence, via the union bound, we have that

Pr x Dx[bσ(x) = σ (x)] Ca,B

inf g G Li,j(g) L i,j

with probability at least 1 δ.

Claim 4. Let Ca,B and φ as deﬁned in Claim 2. For any ε > 0, it sufﬁces to draw

samples from Dq R in order to get Ca,B

Proof. Note that each pair (i, j) requires a training set of size at least n PC(δ) in order to make the failure probability at most δ/ k 2 . These samples correspond the rankings σ so that σ {i, j}. Hence, the sample complexity of the problem is equal to the number of (incomplete) permutations drawn from Dq R in order to obtain the desired number of pairwise comparisons. With high probability, the sample complexity is

n PC k 2 mini =j Pr[σ {i, j}]

since the random variable that corresponds to the number of pairwise comparisons provided by each sample (x, σ) Dq R is at least φ k 2 , with high probability. The desired sample complexity bound requires a number of pairwise comparisons n PC so that Ca,B

The number of pairwise comparisons should be

εφ Ca,B k 2

Let us set m = n PC. It remains to control the function rm. Recall that (see Claim 2)

rm(δ) = max

1 2 a (C VC(G) log(m)) 1 2 a + (32 log(2/δ))

1 2 a , C log(2/δ)

Label Ranking through Nonparametric Regression

If the second term is larger in the above maximum operator, we get that

c1 log(k/δ)

εφ Ca,B k 2

On the other side, we get that

c1Ca,B VC(G) log(m)

1 2 a + c2Ca,B log(k/δ)

εφ Ca,B k 2

and so we should take the maximum between the terms

Ca,B log(k/δ)

Let m = ey and set ye y = 1/M3. So, we have that ye y = 1/M3 and hence y = W( 1/M3), where W is the Lambert W function. Let N3 be the value of m that corresponds to the Lambert equation, i.e., log(m) = W( 1/M3) and so N3 e W ( 1/M3) with 1/M3 [0, 1/e]. Hence, the number of samples that sufﬁce to draw from Dq R is

n 1 φ k 2 max{N1, N2, N3} = 1 φ k 2 max{N2, N3} .

We remark that, in the third case, it sufﬁces to take

m N3 2M3 log(M3) = eΩ

since 2M log(M) M log(2M log(M)) for all M sufﬁciently large.

These claims complete the proof.

B. Additional Experimental Results for Noisy Oracle with Complete Rankings

Experimental Setting and Evaluation Metrics. We follow the setting that was proposed by Cheng & H ullermeier (2008) (it has been used for the empirical evaluation of LR since then). For each data set, we run ﬁve repetitions of a ten-fold cross-validation process. Each data set is divided randomly into ten folds ﬁve times. For every division, we repeat the following process: every fold is used exactly one time as the validation set, while the other nine are used as the training set (i.e., ten iterations for every repetition of the ten-fold cross-validation process) (see James et al., 2013, p.181). For every test instance, we compute the Kendall tau coefﬁcient between the output and the given ranking. In every iteration, we compute the mean Kendall tau coefﬁcient of all the test instances. Finally, we compute the mean and standard deviation of every iteration s aggregated results. This setting is used for the evaluation of both our synthetic data sets and the LR standard benchmarks.

Algorithm s Implementation. The algorithm s implementation was in Python. We decided to use the version of our algorithms with Breiman s criterion. Therefore there was no need to implement decision trees and random forests from scratch. We used scikit-learn implementations. The code for reproducibility of our results can be found in: https://github.com/pseleni/LR-nonparametric-regression.

Label Ranking through Nonparametric Regression

Data sets. The code for the creation of the Synthetic bencmarks, the Synthetic benchmarks and the standard LR benchmarks that were used in the experimental evaluation can be found in: https://github.com/pseleni/ LR-nonparametric-regression.

Experimental Results for Different Noise Settings. In our noisy nonparametric regression setting, the input noise acts additively to the vector m(x) for some input x and the output is computed by permuting the labels. We aim to understand if the proposed algorithms performance differs in case the added noise is applied directly to the output ranking, by a parameterized noise operator, instead of being added to the vector m(x).

Hence, we resort to a popular distance-based probability model introduced by Mallows (Mallows, 1957). The standard Mallows model M(σ0, θ) is a two-parameter model supported on Sk with density function for a permutation σ equal to

Pr σ M(σ0,θ)[σ|θ, σ0] = e θD(σ,σ0)

Zk(θ, σ0) .

The ranking σ0 is the central ranking (location parameter) and θ > 0 is the spread (or dispersion) parameter. In this dispersion regime, the ranking σ0 is the mode of the distribution. The probability of any other permutation decays exponentially as the distance from the center permutation increases. The spread parameter controls how fast this occurs. Finally, Zk(θ, σ0) := P

σ Sk exp( θD(σ, σ0)) is the partition function of the Mallows distribution. In what follows, we work with the KT distance (when D = d KT , the partition function only depends on the dispersion θ and k and not on the central ranking i.e., Zk = Zk(θ)).

We once again use the noiseless data sets of the two families, namely LFN and SFN. We create 20 noisy data sets for each family using the Mallows Model as a noise operator (we will refer to these data sets as MN - Mallows Noisy). For each noiseless sample, we generate the corresponding output ranking by drawing a permutation from a Mallows distribution with mode equal to the noiseless ranking and dispersion parameter θ [0.6, 4.4]3 .

We also generate 20 additional noisy data sets for each family (following the generative process Ex(m, E)) with different zero mean Gaussian noise distributions (we will refer to these data sets as GN - Gaussian Noisy), matching the α-inconsistency property of the MN data sets. Since the corresponding MN and GN data sets do not also share the β-kτ property, the results are not directly comparable. Therefore we compare the MN and GN data sets, for each algorithm and data set family separately, with respect to α-inconsistency and using β-kτ gap as a reference point. The results are obtained again in terms of mean kτ from ﬁve repetitions of a ten-fold cross validation, using the noisy output as training data set and the noiseless output data as validation set, and are visualized in Figure 2. The mean kτ values are shown as solid lines, the shaded area corresponds to the standard deviation, and the dotted lines reveal the β-kτ gap.

We come to similar conclusions regarding the noise tolerance of each algorithm, e.g., shallow trees and fully grown random forests are noisy tolerant while fully grown decision trees performance decays almost linearly. What is really interesting, is that in spite of the β-kτ gap being higher in the MN data sets, i.e., the mean error in the input is smaller, the performance of the algorithms is worse, in comparison to the GN data sets. This reveals that neither α-inconsistency nor β-kτ gap can be strictly correlated to the ability of the models to interpolate the correct underlying function.

Some comments are in order. The geometry change of the input noise is not the one we should blindly blame for the performance drop. It has to do more with the generative process used for the creation of the data sets. Speciﬁcally, in the GN, for every x, k samples from a zero mean Gaussian distribution truncated to [ 0.25, 0.25] were drawn (ξ [ 1/4, 1/4]k) and subsequently resulted to a permutation of the elements. The permutation of the elements depends on the noiseless regression values and the additive noise of each label. This results to irregular changes, which in the presence of a big number of samples cancel each other. On the other hand, the MN data sets were created by drawing a sample from a Mallows distribution, which means that certain permutations are more probable for each ranking than others. The noise is no longer unbiased and it is highly likely that the same ranking almost constantly appears as output for a given center permutation. This could also be linked to strategic addition of noise. With these in mind, we can safely conclude that the noise tolerance of our models is directly affected not only by the error in terms of permutation distance, but also, quite naturally, by the frequency of the repetitive errors, i.e., the frequency of the same erroneous noisy output ranking that arises from the noiseless output ranking, as we have to observe the correct permutation at some frequency to achieve meaningful results.

3We use a different dispersion parameter for each MN dataset. Speciﬁcally we use θ [0.6, 4.4] with 0.2 step.

Label Ranking through Nonparametric Regression

0.0 0.2 0.4 0.6 0.8 α [0.0, 1.0]

mean kτ [ 1.0, 1.0]

mallows (β) mallows gaussian (β) gaussian

(a) SFN - fully grown random forests

0.2 0.4 0.6 0.8 α [0.0, 1.0]

mean kτ [ 1.0, 1.0]

mallows (β) mallows gaussian (β) gaussian

(b) LFN - fully grown random forests

0.0 0.2 0.4 0.6 0.8 α [0.0, 1.0]

mean kτ [ 1.0, 1.0]

mallows (β) mallows gaussian (β) gaussian

(c) SFN - fully grown decision trees

0.2 0.4 0.6 0.8 α [0.0, 1.0]

mean kτ [ 1.0, 1.0]

mallows (β) mallows gaussian (β) gaussian

(d) LFN - fully grown decision trees

0.0 0.2 0.4 0.6 0.8 α [0.0, 1.0]

mean kτ [ 1.0, 1.0]

mallows (β) mallows gaussian (β) gaussian

(e) SFN - shallow decision trees

0.2 0.4 0.6 0.8 α [0.0, 1.0]

mean kτ [ 1.0, 1.0]

mallows (β) mallows gaussian (β) gaussian

(f) LFN - shallow decision trees

Figure 2. Illustration of the experimental results in terms of mean kτ for different noise operators E with respect to α-inconsistency and with β-kτ gap as a reference point. The Gaussian operator gives a ranking σ = argsort(m(x) + ξ) with ξ [ 1/4, 1/4]k being a zero mean truncated Gaussian random variable. The Mallows operator gives a permutation σ M(argsort(m(x)), θ) where θ [0.6, 4.4].

Evaluation on LR Standard Benchmarks. We also evaluate our algorithms on standard Label Ranking Benchmarks. Speciﬁcally on sixteen semi-synthetic data sets and on ﬁve real world LR data sets. The semi-synthetic ones are considered standard benchmarks for the evaluation of LR algorithms, ever since they were proposed in Cheng & H ullermeier (2008). They were created from the transformation of multi-class (Type A) and regression (Type B) data sets from UCI repository

Label Ranking through Nonparametric Regression

and the Statlog collection into Label Ranking data (see Cheng & H ullermeier, 2008). A summary of these data sets and their characteristics are given in Table 2. The real-world ones are genetic data, where the genome consists of 2465 genes and each gene is represented by an associated phylogenic proﬁle of length, i.e., 24 features. In these data sets we aim to predict a qualitative representation of an expression proﬁle. The expression proﬁle of a gene is an ordered sequence of real-valued measurements, converted into a ranking (e.g., (2.1, 3.5, 0.7, 2.5) is converted to (2, 1, 3, 4)) (H ullermeier et al., 2008). A summary of the real-world data sets is given in Table 3.

Table 2. Properties of the Semi-Synthetic Benchmarks

BENCHMARK TYPE NUMBER OF INSTANCES NUMBER OF ATTRIBUTES NUMBER OF LABELS

AUTHORSHIP A 841 70 4 BODYFAT B 4522 7 7 CALHOUSING B 37152 4 4 CPU-SMALL B 14744 6 5 ELEVATORS B 29871 9 9 FRIED B 73376 9 5 GLASS A 214 9 6 HOUSING B 906 6 6 IRIS A 150 4 3 PENDIGITS A 10992 16 10 SEGMENT A 2310 18 7 STOCK B 1710 5 5 VEHICLE B 846 18 14 VOWEL A 528 10 11 WINE A 178 13 3 WISCONSIN B 346 16 16

Table 3. Properties of the Real-Word Benchmarks

BENCHMARK NUMBER OF INSTANCES NUMBER OF FEATURES NUMBER OF LABELS

SPO 2465 24 11 HEAT 2465 24 6 DDT 2465 24 4 COLD 2465 24 4 DIAU 2465 24 7

We follow the same experimental setting as before, i.e., ﬁve repetitions of a ten-fold cross validation process for each data set. In Table 4 we summarize our results. RF stands for the algorithm using Random Forest, DT for Decision Trees, SDT for Shallow Decision Trees. These are the vanilla versions of the proposed algorithms, i.e., the parameters of the regressors were note tuned (only MSE criterion and maximum depth were deﬁned, while the other parameters were the default). RFT and DTT stand for Random Forests Tuned and Decision Trees Tuned (each decision tree was tuned on the training data). The parameters of the regressor were tuned in a ﬁve folds inner c.v. for each training set. The parameter grids are reported in the anonymized repository.

Random Forests have the best result overall. Interestingly the tuning does not always lead to better results. In the majority of the benchmarks RF and RFT have comparable results.

In Table 6, we compare our Random Forest results (RF and RFT) with other previously proposed methods, as in Cheng et al. (2013) Labelwise Decomposition (LWD), H ullermeier et al. (2008) Ranking with Pairwise Comparisons (RPC), Cheng & H ullermeier (2008) Label Ranking Trees (LRT), Zhou & Qiu (2018) Label Ranking Random Forests (LR-RF), Korba et al. (2018) k-NN Kemeny regressor (k NN Kemeny) and Dery & Shmueli (2020) Boosting-based Learning Ensemble for LR (Boost LR).

For the semi-synthetic data sets, our results are competitive in comparison to Cheng et al. (2013) LWD, H ullermeier et al. (2008) RPC and Cheng & H ullermeier (2008) LRT results, but with no systematic improvements. As expected, in comparison to the most recent and state-of-the-art results, the experimental results are comparable but cannot compete the

Label Ranking through Nonparametric Regression

Table 4. Performance in terms of Kendall s tau coefﬁcient - Semi Synthetic Benchmarks

BENCHMARK RF DT DTS RFT DTT

AUTHORSHIP 0.85 0.04 0.78 0.05 0.81 0.05 0.87 0.04 0.77 0.05 BODYFAT 0.12 0.05 0.05 0.06 0.09 0.07 0.11 0.06 0.07 0.07 CALHOUSING 0.32 0.01 0.24 0.01 0.17 0.02 0.33 0.01 0.16 0.03 CPU-SMALL 0.29 0.01 0.21 0.02 0.28 0.01 0.30 0.02 0.28 0.01 ELEVATORS 0.60 0.01 0.47 0.01 0.50 0.01 0.61 0.01 0.55 0.02 FRIED 0.96 0.00 0.90 0.00 0.65 0.01 0.96 0.00 0.90 0.00 GLASS 0.88 0.06 0.80 0.06 0.79 0.06 0.80 0.07 0.80 0.07 HOUSING 0.44 0.07 0.40 0.06 0.39 0.06 0.37 0.07 0.31 0.08 IRIS 0.95 0.07 0.91 0.09 0.92 0.08 0.95 0.07 0.90 0.10 PENDIGITS 0.86 0.01 0.77 0.018 0.63 0.01 0.86 0.01 0.78 0.01 SEGMENT 0.90 0.02 0.87 0.02 0.82 0.02 0.91 0.02 0.87 0.02 STOCK 0.80 0.03 0.76 0.04 0.76 0.04 0.78 0.03 0.73 0.05 VEHICLE 0.84 0.03 0.78 0.04 0.79 0.04 0.83 0.03 0.78 0.04 VOWEL 0.67 0.04 0.63 0.05 0.55 0.04 0.68 0.04 0.58 0.06 WINE 0.90 0.09 0.84 0.12 0.86 0.10 0.91 0.08 0.84 0.11 WISCONSIN 0.14 0.04 0.08 0.04 0.1 0.04 0.14 0.05 0.09 0.04

Table 5. Performance in terms of Kendall s tau coefﬁcient Real World Benchmarks

BENCHMARK RF DT DTS RFT DTT

COLD 0.10 0.03 0.06 0.03 0.07 0.03 0.09 0.03 0.07 0.04 DIAU 0.15 0.03 0.12 0.02 0.12 0.02 0.14 0.03 0.12 0.04 DTT 0.13 0.04 0.10 0.04 0.09 0.04 0.13 0.03 0.10 0.03 HEAT 0.07 0.02 0.05 0.02 0.05 0.02 0.08 0.03 0.05 0.02 SPO 0.05 0.02 0.05 0.02 0.04 0.02 0.01 0.01 0.01 0.01

highly optimized applied algorithms. But we believe that the insights gained using this technique may be valuable to a variate of other Label Ranking methods.

For the real-world data sets there are not so many methods to compare our results with. Therefore we compare them with the RPC method and Boost LR. The results are summarized in Table 7. The performance of our algorithm in the real-word benchmarks is worse than in the other data sets. We suspect that the non-sparsity of these data sets (genome data) is one of the main reasons that this pattern in the performance is observed. A more thorough investigation is left for future work.

C. Results on the Noisy Oracle with Partial Rankings

In this setting, we consider a distribution of partitions of the interval (of positive integers) [1..k], which depends on the feature x Rd. Before a formal deﬁnition, we provide an intuitive example: for some feature x, let the noisy score vector be equal to y = [0.2, 0.4, 0.1, 0.3, 0.5] (where y = m(x) + ξ, as in Deﬁnition 2.4). Then, it holds that σ = argsort(y) = (e b d a c). In the partial setting, we additionally draw an increasing4 partition I of the label space [k] from the distribution p(x). Assume that I = [1...2][3...3][4...5], whose size is 3. Then, the partial ranking is deﬁned as Partial Rank(σ; I) and is equal to e = b d a = c.

Deﬁnition C.1 (Generative Process for Partial Data). Consider an instance of the Label Ranking problem with underlying score hypothesis m : X [0, 1]k and let Dx be a distribution over features. Consider the partial partition distribution p. Each sample is generated as follows:

1. Draw x X from Dx and ξ [ 1/4, 1/4]k from the distribution E.

2. Compute y = m(x) + ξ.

4A partition I of the space [k] is called increasing if there exists an increasing sequence i1 < i2 < ... < im of indices of [k] so that I = [1..i1][i1..i2]...[im + 1..k]. For instance, the partition [1, 2][3, 4][5] is increasing, but the partition [1, 4][2, 3][5] is not.

Label Ranking through Nonparametric Regression

Table 6. Evaluation in terms of Kendall s tau coefﬁcient - Semi-Synthetic Benchmarks

BENCHMARK RF RFT LWD RPC LRT LR-RF KNN KEMENY BOOSTLR

AUTHORSHIP 0.86 0.04 0.87 0.04 0.91 0.01 0.91 0.88 0.92 0.94 0.02 0.92 BODYFAT 0.12 0.05 0.11 0.06 - 0.28 0.11 0.19 0.23 0.06 0.20 CALHOUSING 0.32 0.01 0.33 0.01 - 0.24 0.36 0.37 0.33 0.01 0.44 CPU-SMALL 0.29 0.01 0.3 0.02 - 0.45 0.42 0.52 0.51 0.00 0.50 ELEVATORS 0.60 0.01 0.61 0.01 - 0.75 0.76 0.76 - 0.77 FRIED 0.96 0.00 0.96 0.00 - 1.00 0.89 1.00 0.89 0.00 0.94 GLASS 0.83 0.06 0.80 0.07 0.88 0.4 0.88 0.88 0.89 0.85 0.06 0.89 HOUSING 0.44 0.07 0.37 0.07 - 0.67 0.80 0.80 - 0.83 IRIS 0.95 0.07 0.95 0.07 0.93 0.06 0.89 0.95 0.97 0.95 0.04 0.83 PENDIGITS 0.86 0.01 0.86 0.01 - 0.93 0.94 0.94 0.94 0.00 0.94 SEGMENT 0.90 0.02 0.91 0.02 0.94 0.01 0.93 0.95 0.96 0.95 0.01 0.96 STOCK 0.80 0.03 0.78 0.03 - 0.78 0.90 0.92 - 0.93 VEHICLE 0.84 0.03 0.83 0.03 0.87 0.02 0.85 0.83 0.86 0.85 0.03 0.86 VOWEL 0.67 0.04 0.68 0.04 0.67 0.02 0.65 0.79 0.97 0.85 0.03 0.84 WINE 0.90 0.09 0.91 0.08 0.91 0.06 0.92 0.88 0.95 0.94 0.06 0.95 WISCONSIN 0.14 0.04 0.14 0.05 - 0.63 0.34 0.48 0.49 0.04 0.45

Table 7. Evaluation in terms of Kendall s tau coefﬁcient - Real Word Benchmarks

BENCHMARK RF RFT RPC BOOSTLR

SPO 0.05 0.02 0.01 0.01 0.14 0.02 0.14 HEAT 0.07 0.02 0.08 0.03 0.13 0.2 0.13 DTT 0.13 0.04 0.13 0.03 0.17 0.3 0.17 COLD 0.10 0.03 0.09 0.0 0.22 0.03 0.21 DIAU 0.15 0.03 0.14 0.03 0.33 0.02 0.33

3. Set eσ = argsort(y).

4. Draw a partition I = [1...i1][i1 + 1...i2]...[i L + 1...k] of size L + 1 from a distribution p(x), for some L [0..k].

5. Set σ = Partial Rank(eσ; I)

6. Output (x, σ).

We let (x, σ) Dp R.

Note that when L = 0, we observe a complete ranking σ. We remark that our generative model is quite general: It allows arbitrarily complex distributions over partitions, which depend on the feature space instances. We aim to address Problem 2 when dealing with partial observations. We are going to address this question in a similar fashion as in the incomplete regime. Speciﬁcally, we assume that Condition 2 still holds, but instead of the Item 3 (Deletion tolerance), we assume that the property of partial tolerance holds. Condition 7. For any 1 i < j k, we assume that the following hold: The Stochastic Transitivity and the Tsybakov s noise condition with parameters a, B (see Condition 2) are satisﬁed and the following condition holds:

1. (Partial tolerance): There exists ξ (0, 1) so that pi,j(x) ξ, where pi,j(x) is the probability that the pair i < j is not in the same subset of the partial partition for x Rd.

Using similar technical tools as in Theorem 2.5, we obtain the following result.

Theorem C.2 (Label Ranking with Partial Permutations). Let ε, δ (0, 1) and assume that Condition 7 holds, i.e., the Stochastic Transitivity property holds and both the Tsybakov s noise condition holds with a (0, 1), B > 0 and the partial tolerance condition holds for the partition probability vector p with parameter ξ (0, 1). Set Ca,B = B1 a/((1 a)1 aaa)

Label Ranking through Nonparametric Regression

and consider a hypothesis class G of binary classiﬁers with ﬁnite VC dimension. There exists an algorithm (Algorithm 8) that computes an estimate bσ : Rd Sk so that

Pr x Dx[bσ(x) = σ (x)] Ca,B

inf g G Li,j(g) L i,j

with probability at least 1 δ, where σ : Rd Sk is the mapping (see Equation (2)) induced by the aggregation of the k 2

Bayes classiﬁers g i,j with loss L i,j, using n independent samples from Dp R (see Deﬁnition C.1), with

Ca,B ξ4 2a k 2

log(k/δ), VC(G) log

Algorithm 8 Algorithm for Label Ranking with Partial Rankings

1: Input: Sample access to i.i.d. examples of the form (x, σ) Dp R, VC class G. 2: Model: Partial rankings are generated as in Deﬁnition C.1. 3: Output: An estimate bσ : Rd Sk of the optimal estimator σ that satisﬁes

Pr x Dx[bσ(x) = σ (x)] Cα,B

ξ2 OPT(G) + ε .

4: Label Rank Partial(ε, δ):

5: Set n = eΘ k 4(1 α)

α max{log(k/δ), VC(G))}/polyα(ξ ε) {See Theorem C.2.}

6: Draw a training set D of n independent samples from Dp R 7: For any i = j, set Di,j = 8: for 1 i < j k do 9: if (x, σ) D and σ(i) = σ(j) then {i and j do not lie in the same partition.} 10: Add (x, sgn(σ(i) σ(j))) to Di,j 11: endif 12: endfor

13: Training Phase: bs Estimate Aggregate(Di,j for all i < j, G) {See Algorithm 6.} 14: Testing Phase: On input x Rd, output argsort(bs(x))

Proof. The proof is similar to the incomplete rankings case and the difference lies in the following steps

1. For any i = j, the conditioning on the event σ {i, j} should be replaced with the event that i does not lie in the same partition (σ(i) = σ(j)). This implies that any φ term (which corresponds to a lower bound on the probability Pr[σ {i, j}]) should be replaced by the term ξ (0, 1).

2. For the ﬁnal sample complexity, the following holds: With high probability, the number of pairwise comparisons induced by a partial ranking is at least ξ k 2 . Hence, with high probability, a number of n PC/(ξ k 2 ) independent draws from the distribution Dp R sufﬁces to obtain the desired bound.

Label Ranking through Nonparametric Regression

D. Background on Regression with Trees and Forests

In this section, we provide a discussion on decision trees and forests. We refer to Shalev-Shwartz & Ben-David (2014), Syrgkanis & Zampetakis (2020) and Breiman et al. (1984) for further details.

D.1. Further Previous Work

From an information-theoretic viewpoint, sample complexity bounds of decision trees and other data-adaptive partitioning estimators have been established (Nobel, 1996; Lugosi & Nobel, 1996; Mansour & Mc Allester, 2000). However, from a computational aspect, the problem of choosing the optimal tree is NP-complete (e.g., Laurent & Rivest, 1976) and, hence, from a practical standpoint, trees and forests are built greedily (e.g., Level-Splits, Breiman) by identifying the most empirically informative split at each iteration (Breiman et al., 1984; Breiman, 2001). Advances have shown that such greedily constructed trees are asymptotically consistent (Biau, 2012; Denil et al., 2014; Scornet et al., 2015) in the low-dimensional regime. On the other side, the high-dimensional regime, where the number of features can grow exponentially with the number of samples, is studied by Syrgkanis & Zampetakis (2020). The literature related to the CART criterion (Breiman et al., 1984) is vast; however, there are various other strands of research dealing with the problem of sparse nonparametric regression (that we consider in the noiseless regression settings of our work). On the one side, several heuristic methods have been proposed (Friedman et al., 2001; Friedman, 1991; George & Mc Culloch, 1997; Smola & Bartlett, 2001). On the other side, various works, such as the ones of Lafferty & Wasserman (2008); Liu & Chen (2009); Comminges & Dalalyan (2012); Yang & Tokdar (2015), design and theoretically analyze greedy algorithmic approaches that exploit the sparsity of the regression function in order to get around with the curse of dimensionality of the input feature data.

D.2. Preliminaries on Regression Trees

Decision Trees. A decision tree is a predictor h : X Y, which, on the input feature x, predicts the label associated with the instance by following a decision path from a root node of a tree to a leaf. At each node on the root-to-leaf path, the successor child is chosen on the basis of a splitting of the input space. The splitting may be based on a speciﬁc feature of x or on a predeﬁned set of splitting rules and a leaf always contains a speciﬁc label or value, depending on the context (classiﬁcation or regression).

Nonparametric Regression. In the nonparametric regression problem, we consider that we observe independent samples (x, y), generated as y = f(x) + ξ, where ξ corresponds to bounded zero mean noise. Speciﬁcally, we have the following generative process:

Deﬁnition D.1 (Standard Nonparametric Regression). Consider the underlying regression function f : X [1/4, 3/4] and let Dx be a distribution over features. Each sample is generated as follows:

1. Draw x X from Dx.

2. Draw ξ [ 1/4, 1/4] from the zero mean distribution E.

3. Compute y = f(x) + ξ.

We let (x, y) D.

Note that the noise random variable does not depend on the feature x. In the high-dimensional regime, we assume that the target function is sparse (recall Deﬁnition A.1).

Regression Tree Algorithms. Let us brieﬂy describe how regression tree algorithms work in an abstract fashion. We will focus on binary trees. In general, the regression tree algorithms operate in two phases, which are the following: ﬁrst the algorithm ﬁnds a partition P of the hypercube {0, 1}d. Afterwards, it assigns a single value to every cell of the partition P, which deﬁnes the estimation function f (n). Finally, the algorithm outputs this estimate. More concretely, we have that:

Phase 1 (Partitioning the space). First, a depth 0 tree contains a single cell {{0, 1}d}. If this single node splits based on whether x1 = 0 or x1 = 1, we obtain a depth 1 tree with two cells {{0} {0, 1}d 1, {1} {0, 1}d 1}. In general, this procedure generates a partition P of the space {0, 1}d. We let P(x) denote the unique cell of P that contains x.

Label Ranking through Nonparametric Regression

Phase 2 (Computing the estimation). Let D denote the training set that contains examples of the form (x, y), generated as y = f(x) + ξ. For any cell c P, we create the dataset Dc of all the training examples (x, y) D that are contained in the cell c, i.e., x c. Then, we compute the value of the cell as

f (n)(c; P) := 1 |Dc|

(x,y) Dc y .

The main question not covered in the above discussion is the following:

How is the split of Phase 1 chosen?

There are various splitting rules in order to partition the space in Phase 1. We discuss two such rules: the Breiman s Algorithm and the Level-Splits Algorithm. In Breiman s algorithm, every node can choose a different direction to split, by using the following greedy criterion: every node chooses the direction that minimizes its own empirical mean squared error. On the other side, in the Level-Splits algorithm, every node at the same level has to split in the same direction, by using the greedy criterion: at every level, we choose the direction that minimizes the total empirical mean squared error. In the upcoming sections, we elaborate on the algorithms based on the Level-Splits (Appendix D.3) and Breiman s (Appendix D.4) criterion.

D.3. Level-Splits Algorithm

We deﬁne x S as the sub-vector of x, where we observe only the coordinates with indices in S [d]. Recall that in the Level-Splits algorithm with set of splits S, any level has to split at the same direction. Hence, each level provides a single index to the set S and the size of the set S is the depth of the decision tree. Given a set of splits S, we deﬁne the expected mean squared error of S as follows:

e L(S) = E x Dx

" f(x) E w Dx[f(w)|w S = x S] 2#

f 2(x) E z S Dx,S

E w Dx[f(w)|w S = z S] 2 .

This function quantiﬁes the population version of the mean squared error between the actual value of f at x and the mean value of f constrained at the cell of P that contains x, i.e., the subspace of {0, 1}d that is equal to P(x). Observe that e L depends only on f and Dx. We set

e V (S) := E z S Dx,S

E w Dx[f(w)|w S = z S] 2 . (7)

The function e V can be seen as a measure of heterogeneity of the within-leaf mean values of the target function f, from the leafs created by split S. The following condition is required. Condition 8 (Approximate Submodularity). Let C 1. We say that the function e V is C-approximate submodular if and only if for any T, S [d], such that S T and any i [d], it holds that

e V (T {i}) e V (T) C (e V (S {i})) e V (S)).

We can equivalently write this condition as (this is the formulation we used in Condition 1):

e L(T) e L(T {i}) C (e L(S) e L(S {i})).

The reduction of the mean squared error when the coordinate i is added to a set of splits T is upper bounded by the reduction when adding i to a subset of T, i.e., if adding i does not decrease the mean squared error signiﬁcantly at some point (when having the set S), then i cannot decrease the mean squared error signiﬁcantly in the future either (for any superset of S). We remark that this condition is necessary for any greedy algorithm to work (see Syrgkanis & Zampetakis, 2020).

Label Ranking through Nonparametric Regression

Empirical MSE. For the algorithm, we will use the empirical version of the mean square error. Provided a set of splits S, we have that

y(j) f (n)(x(j); S) 2 = 1

j [n] (y(j))2 1

j [n] f (n)(x(j); S)2 (8)

j [n] (y(j))2 Vn(S) . (9)

Algorithm 9 Level-Splits Algorithm (see (Syrgkanis & Zampetakis, 2020))

1: Input: honesty ﬂag h, training dataset Dn, maximum number of splits log(t). 2: Output: Tree approximation of f.

3: Level Splits-Algo(h, Dn, log(t)): 4: V Dn,x {Keep only training features x.} 5: if h = 1 then Split randomly Dn in half; Dn/2, D n/2, n n/2, V D n,x 6: Set P0 = {{0, 1}d} {The partition that corresponds to the root.} 7: Pℓ= for any ℓ [n] 8: level 1, S 9: while level < log(t) do 10: level level + 1 11: Select the direction i [d] that maximizes Vn(S {i}) {For Vn, see Equation (9).} 12: for all C Plevel do 13: Partition the cell C into the cells Ci k = {x : x C xi = k}, k 0, 1 14: if |V Ci 0| 1 |V Ci 1| 1 then 15: Plevel+1 Plevel+1 {Ci 0, Ci 1} 16: else 17: Plevel+1 Plevel+1 {C} 18: endif 19: endfor 20: S S {i} 21: endwhile 22: Output (Pn, f (n)) = (Plevel+1, x 7 f (n)(x; S))

The following result summarizes the theoretical guarantees for algorithms with Decision Trees via the Level-Splits criterion (see Syrgkanis & Zampetakis, 2020).

Theorem D.2 (Learning with Decision Trees via Level-Splits (see (Syrgkanis & Zampetakis, 2020))). Let ε, δ > 0. Let H > 0. Let Dn be i.i.d. samples from the nonparametric regression model y = f(x) + ξ, where f(x) [ 1/2, 1/2], ξ E, Eξ E[ξ] = 0 and ξ [ 1/2, 1/2]. Let also Sn be the set of splits chosen by the Level-Splits algorithm (see Algorithm 9), with input h = 0. The following statements hold.

1. Given n = e O log(d/δ) (Cr/ε)Cr+2 samples, if f is r-sparse as per Deﬁnition A.1 and under the submodularity Condition 8, and if we set the number of splits to be log(t) = Cr Cr+2(log(n) log(log(d/δ))), then it holds that

h (f(x) f (n)(x; Sn))2i > ε δ .

2. If f is r-sparse as per Deﬁnition A.1 and under the submodularity Condition 8 and the independence of features condition, given n = e O log(d/δ) 2r (C/ε)2 samples and if we set the number of splits to be log(t) = r, then it holds that

h (f(x) f (n)(x; Sn))2i > ε δ .

Label Ranking through Nonparametric Regression

Fully Grown Honest Forests with Level-Splits Algorithm. We ﬁrst explain the term Fully Grown Honest Forests: The term Fully Grown or (deep) means that we split every node until every leaf has exactly 1 training sample. The term Honest (see Wager & Athey, 2018) corresponds to the following: the regression tree algorithms operate in two phases where in both stages we use the same set of training examples. In honest trees, we split randomly the training set and use half of the dataset (Dn/2) in ﬁnd a partition of {0, 1}d and the other half (D n/2) to assign the values in the cells. Finally, the term Forest is used when we subsample s out of n samples and use them in order to build independent trees; we then output the average of these trees and this function is denoted by f (n,s).

In the work of Syrgkanis & Zampetakis (2020), a result about Fully Grown Forests via the Level-Splits criterion is provided under Condition 5. Shortly, it holds that, using a training set of size n = e O 2r log(1/δ)

ζ and if for every

w {0, 1}r, it holds for the marginal probability that Prz Dx(z R = w) / (0, ζ/2r) and if s = eΘ(2r (log(d/δ)/β2 + log(1/δ)/ζ)), then it holds that Pr Dn Dn Ex Dx[(f(x) f (n,s)(x))2] ε δ. We remark that every tree f(x, Ds) is built using Algorithm 9, with inputs: log(t) large enough so that every leaf has two or three samples and h = 1.

D.4. Breiman s Algorithm

We now turn our attention to the Breiman s criterion. We deﬁne the total expected mean square error that is achieved by a partition P of {0, 1}d in the population model as follows:

e L(P) = E x Dx

" f(x) E z Dx[f(z)|z P(x)] 2#

= E x Dx[f 2(x)] E x Dx

E z Dx[f(z)|z P(x)] 2 .

As in the Level-Splits criterion, we set

e V (P) = E x Dx

E z Dx[f(z)|z P(x)] 2 .

In order to deﬁne the splitting criterion of the algorithm (due to the local nature of Breiman), one has to introduce the local version of the expected MSE for the cell A:

e Lℓ(A, P) = E x Dx

" f(x) E z Dx[f(z)|z P(x)] 2 x A

= E x Dx[f 2(x)|x A] E x Dx

" E z Dx[f(z)|z P(x)] 2 x A

e Vℓ(A, P) = E x Dx

" E z Dx[f(z)|z P(x)] 2 x A

The following condition is required for decision tree-based algorithms that use the Breiman s criterion.

Condition 9 (Approximate Diminishing Returns). For C 1, we say that the function e V has the C-approximate diminishing returns property if for any cells A, A , any i [d] and any T [d] such that A A, it holds that

e Vℓ(A , T {i}) e Vℓ(A , T) C (e Vℓ(A, i) e Vℓ(A)) .

For the algorithm, we need the empirical mean squared error, conditional on a cell A and a potential split direction i, which is deﬁned as follows: let Nn(A) be the number of training points in the cell A. Recall that Ai z = {x A|xi = z} for

Label Ranking through Nonparametric Regression

z {0, 1}. Also, set f (n)(x; P) = g(n)(P(x)). Then, we have that

Lℓ n(A, i) = X

Nn(Ai z) Nn(A)

1 Nn(Aiz)(y(j) f (n)(x(j); P(x(j))))2 (10)

j:x(j) A (y(j))2 X

Nn(Ai z) Nn(A) (g(n)(Ai z))2 (11)

j:x(j) A (y(j))2 V ℓ n(A, i) . (12)

Algorithm 10 Breiman s Algorithm (see (Syrgkanis & Zampetakis, 2020))

1: Input: honesty ﬂag h, training dataset Dn, maximum number of splits t. 2: Output: Tree approximation of f.

3: Breiman-Algo(h, Dn, t): 4: V Dn,x 5: if h = 1 then Split randomly Dn in half; Dn/2, D n/2, n n/2, V D n,x 6: Set P0 = {{0, 1}d} {The partition that corresponds to the root.} 7: Pℓ= for any ℓ [n] 8: level 0, nnodes 1, queue P0 9: while nnodes < t do 10: if queue = do 11: level level + 1, queue Plevel 12: endif 13: Pick A the ﬁrst element in queue 14: if |V A| 1 then 15: queue queue \ {A}, Plevel+1 Plevel+1 {A} 16: else 17: Select i [d] that maximizes V ℓ n(A, i) {See Equation (12).} 18: Cut the cell A to cells Ai k = {x|x A xi = k}, k = 0, 1 19: queue queue \ {A}, Plevel+1 Plevel+1 {Ai 0, Ai 1} 20: endif 21: endwhile 22: Plevel+1 Plevel+1 queue 23: Output (Pn, f (n)) = (Plevel+1, x 7 f (n)(x; Plevel+1))

Theorem D.3 (Learning with Decision Trees via Breiman (see (Syrgkanis & Zampetakis, 2020))). Let ε, δ > 0. Let H > 0. Let Dn be i.i.d. samples from the nonparametric regression model y = f(x) + ξ, where f(x) [ 1/2, 1/2], ξ E, Eξ E[ξ] = 0 and ξ [ 1/2, 1/2]. Let also Pn be the partition that the algorithm (see Algorithm 10) returns with input h = 0. The following statements hold.

1. If f be r-sparse as per Deﬁnition A.1 and if the approximate diminishing returns Condition 9 holds, then given n = e O log(d/δ)(C r/ε)C r+3 samples and if we set log(t) Cr Cr+3(log(n) log(log(d/δ))), then it holds that

h (f(x) f (n)(x; Pn))2i > ε δ .

2. If f is r-sparse as per Deﬁnition A.1, if the approximate diminishing returns Condition 9 holds and the distribution Dx is a product distribution, given n = e O C22r log(d/δ)/ε3 samples and if we set log(t) r, then it holds that

h (f(x) f (n)(x; Pn))2i > ε δ .

Label Ranking through Nonparametric Regression

Finally, in the work of Syrgkanis & Zampetakis (2020), a result about Fully Grown Forests via the Breiman s criterion is provided under Condition 6. Shortly, it holds that, using a training set of size n = 2r log(d/δ)

εζβ2 and if s = eΘ( 2r log(d/δ)

ζβ2 ), then it holds that Pr Dn Dn Ex Dx[(f(x) f (n,s)(x))2] ε δ. Note that every tree f(x, Ds) is built using the Algorithm 10, with inputs: log(t) large enough so that every leaf has two or three samples, training set Ds and h = 1.

E. Background on Statistical Learning Theory

For a detailed exposition of a statistical learning theory perspective to binary classiﬁcation, we refer to Bousquet et al. (2003); Boucheron et al. (2005).

Talagrand s Inequality. Let Pf = E f and Pnf be the corresponding empirical functional. Talagrand s inequality provides a concentration inequality for the random variable supf F(Pf Pnf), which depends on the maximum variance attained by any function over the class F. Fact 1 (Theorem 5.4 in Boucheron et al. (2005)). Let b > 0 and F be a set of functions from X to R. Assume that all functions in F satisfy Pf f b. Then, with probability at least 1 δ, it holds that

sup f F (Pf Pnf) 2 E[sup f F (Pf Pnf)] +

2 supf F Var(f) log(1/δ)

n + 4b log(1/δ)

On Tsybakov s Condition. The following property holds for the Tsybakov s noise condition. Fact 2 (Tsybakov s Condition). Let G be a class of binary classiﬁers. Under the Tsybakov s noise condition (see Condition 2.(ii)) with a, B > 0 and for i = j, it holds that

Pr (x,σ) Dq R [g(x) = g i,j(x)|σ {i, j}] Ca,B(Li,j(g) Li,j(g ))a ,

where Ca,B = B1 a (1 a)1 aaa , g is the Bayes classiﬁer and the loss function is deﬁned as

Li,j(g) := E (x,σ) Dq R 1{g(x) = sgn(σ(i) σ(j)) σ {i, j}} .

Proof. Let us set L i,j = Li,j(g ). Deﬁne the quantity

η(x) = E (X,σ)[sgn(σ(i) σ(j)) = +1|X = x] .

The loss of the classiﬁer is equal to

Li,j(g) L i,j = E (x,σ) Dq R [|2η(x) 1| 1{g(x) = g i,j(x)}|σ {i, j}] ,

and so Li,j(g) L i,j t E[1{g(x) = g i,j(x)} 1{|2η(x) 1| t}|σ {i, j}] .

Using Markov s inequality, we have that for all t 0:

Li,j(g) L i,j t Pr[|2η(x) 1| t|σ {i, j}]

t E[1{g(x) = g i,j(x)}1{|2η(x) 1| t}|σ {i, j}] .

The Tsybakov s condition implies that

Li,j(g) L i,j t(1 Bt a 1 a ) t Pr[g(x) = g i,j(x)|σ {i, j}] .

Hence, Li,j(g) L i,j t(Pr[g(x) = g i,j(x)|σ {i, j}] Bt a 1 a ) .

Choosing t appropriately, one gets that

Pr[g(x) = g i,j(x)|σ {i, j}] B1 a

(1 a)1 aaa (Li,j(g) L i,j)a .

The proof is concluded by setting Ca,B = B1 a (1 a)1 aaa .