# linear_label_ranking_with_bounded_noise__7923d0e5.pdf

Linear Label Ranking with Bounded Noise

Dimitris Fotakis NTUA fotakis@cs.ntua.gr

Alkis Kalavasis NTUA kalavasisalkis@mail.ntua.gr

Vasilis Kontonis UW Madison kontonis@wisc.edu

Christos Tzamos UW Madison tzamos@wisc.edu

Label Ranking (LR) is the supervised task of learning a sorting function that maps feature vectors 𝑥 R𝑑to rankings 𝜎(𝑥) S𝑘over a finite set of 𝑘labels. We focus on the fundamental case of learning linear sorting functions (LSFs) under Gaussian marginals: 𝑥is sampled from the 𝑑-dimensional standard normal and the ground truth ranking 𝜎 (𝑥) is the ordering induced by sorting the coordinates of the vector 𝑊 𝑥, where 𝑊 R𝑘 𝑑is unknown. We consider learning LSFs in the presence of bounded noise: assuming that a noiseless example is of the form (𝑥, 𝜎 (𝑥)), we observe (𝑥, 𝜋), where for any pair of elements 𝑖 = 𝑗, the probability that the order of 𝑖, 𝑗is different in 𝜋than in 𝜎 (𝑥) is at most 𝜂< 1/2. We design efficient non-proper and proper learning algorithms that learn hypotheses within normalized Kendall s Tau distance 𝜖from the ground truth with 𝑁= 𝑂(𝑑log(𝑘)/𝜖) labeled examples and runtime poly(𝑁, 𝑘). For the more challenging top-𝑟disagreement loss, we give an efficient proper learning algorithm that achieves 𝜖top-𝑟disagreement with the ground truth with 𝑁= 𝑂(𝑑𝑘𝑟/𝜖) samples and poly(𝑁) runtime.

1 Introduction

1.1 Background and Motivation

Label Ranking (LR) is the problem of learning a hypothesis that maps features to rankings over a finite set of labels. Given a feature vector 𝑥 R𝑑, a sorting function 𝜎( ) maps it to a ranking of 𝑘 alternatives, i.e., 𝜎(𝑥) is an element of the symmetric group with 𝑘elements, S𝑘. Assuming access to a training dataset of features labeled with their corresponding rankings, i.e., pairs of the form (𝑥, 𝜋) R𝑑 S𝑘, the goal of the learner is to find a sorting function ℎ(𝑥) that generalizes well over a fresh sample. LR has received significant attention over the years [DSM03, SS07, HFCB08, CH08, FHMB08] due to the large number of applications. For example, ad targeting [DGR+14] is an LR instance where for each user we want to use their feature vector to predict a ranking over ad categories and present them with the most relevant. The practical significance of LR has lead to the development of many techniques based on probabilistic models and instance-based methods [CH08, CDH10], [GDV12, ZLGQ14], decision trees [CHH09], entropy-based ranking trees [Rd SRSK15], bagging [AGM17], and random forests [d SSKC17, ZQ18]. However, almost all of these works come without provable guarantees and/or fail to learn in the presence of noise in the observed rankings.

Linear Sorting Functions (LSFs). In this work, we focus on the fundamental concept class of Linear Sorting functions [HPRZ03]. A linear sorting function parameterized by a matrix 𝑊 R𝑘 𝑑 with 𝑘rows 𝑊1, . . . , 𝑊𝑘takes a feature 𝑥 R𝑑, maps it to 𝑊𝑥= (𝑊1 𝑥, . . . , 𝑊𝑘 𝑥) R𝑘and

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

then outputs an ordering (𝑖1, . . . , 𝑖𝑘) of the 𝑘alternatives such that 𝑊𝑖1 𝑥 𝑊𝑖2 𝑥 . . . 𝑊𝑖𝑘 𝑥. In other words, a linear sorting function ranks the 𝑘alternatives (corresponding to rows of 𝑊) with respect to how well they correlate with the feature 𝑥. We denote a linear sorting function with parameter 𝑊 R𝑘 𝑑by 𝜎𝑊(𝑥) argsort(𝑊𝑥) where argsort : R𝑘 S𝑘takes as input a vector (𝑣1, . . . , 𝑣𝑘) R𝑘, sorts it in decreasing order to obtain 𝑣𝑖1 𝑣𝑖2 . . . 𝑣𝑖𝑘and returns the ordering (𝑖1, . . . , 𝑖𝑘).

Noisy Ranking Distributions. Learning LSFs in the noiseless setting can be done efficiently by using linear programming. However, the common assumption both in theoretical and in applied works is that the observed rankings are noisy in the sense that they do not always correspond to the ground-truth ranking. We assume that the probability that the order of two elements 𝑖, 𝑗in the observed ranking 𝜋is different than their order in the ground-truth ranking 𝜎 is at most 𝜂< 1/2. Definition 1 (Noisy Ranking Distribution). Fix 𝜂 [0, 1/2). An 𝜂-noisy ranking distribution ℳ(𝜎 ) with ground-truth ranking 𝜎 S𝑘is a probability measure over S𝑘that, for any 𝑖, 𝑗 [𝑘], with 𝑖 = 𝑗, satisfies Pr𝜋 ℳ(𝜎 )[𝑖 𝜋𝑗| 𝑖 𝜎 𝑗] 𝜂. 1

Note that, when 𝜂= 0, we always observe the ground-truth permutation and, in the case of 𝜂= 1/2, we may observe a uniformly random permutation. We remark that most natural ranking distributions satisfy this bounded noise property, e.g., (i) the Mallows model, which is probably the most fundamental ranking distribution (see, e.g., [BM09, LB11, CPS13, ABSV14, BFFSZ19, FKS21, DOS18, LM18, MW20, LM21] for a small sample of this line of research) and (ii) the Bradley-Terry-Mallows model [Mal57], which corresponds to the ranking distribution analogue of the Bradley-Terry-Luce model [BT52, Luc12] (the most studied pairwise comparisons model; see, e.g., [Hun04, NOS17, APA18] and the references therein). For more details, see Appendix E.

We consider the fundamental setting where the feature vector 𝑥 R𝑑is generated by a standard normal distribution and the ground-truth ranking for each sample 𝑥is given by the LSF 𝜎𝑊 (𝑥) for some unknown parameter matrix 𝑊 R𝑘 𝑑. For a fixed 𝑥, the ranking that we observe comes from an 𝜂-noisy ranking distribution with ground-truth ranking 𝜎𝑊 (𝑥). Definition 2 (Noisy Linear Label Ranking Distribution). Fix 𝜂 [0, 1/2) and some ground-truth parameter matrix 𝑊 R𝑘 𝑑. We assume that the 𝜂-noisy linear label ranking distribution 𝒟over R𝑑 S𝑘satisfies the following:

1. The 𝑥-marginal of 𝒟is the 𝑑-dimensional standard normal distribution.

2. For any (𝑥, 𝜋) 𝒟, the distribution of 𝜋conditional on 𝑥is an 𝜂-noisy ranking distribution with ground-truth ranking 𝜎𝑊 (𝑥).

At first sight, the assumption that the underlying 𝑥-marginal is the standard normal may look too strong. However, for 𝑘= 2, Definition 2 captures the problem of learning linear threshold functions with Massart noise. Without assumptions for the 𝑥-marginal, it is known [DGT19, CKMY20, DK20, NT22] that optimal learning of halfspaces under Massart noise requires super-polynomial time (in the Statistical Query model of [Kea98]). On the other hand, a lot of recent works [BZ17, MV19, DKTZ20, ZSA20, ZL21] have obtained efficient algorithms for learning Massart halfspaces under Gaussian marginals. The goal of this work is to provide efficient algorithms for the more general problem of learning LSFs with bounded noise under Gaussian marginals.

1.2 Our Results

The main contributions of this paper are the first efficient algorithms for learning LSFs with bounded noise with respect to Kendall s Tau distance and top-𝑟disagreement loss.

Learning in Kendall s Tau Distance. The most standard metric in rankings [SSBD14] is Kendall s Tau (KT) distance which, for two rankings 𝜋, 𝜏 S𝑘, measures the fraction of pairs (𝑖, 𝑗) on which they disagree. That is, KT(𝜋, 𝜏) =

𝑖 𝜋𝑗1{𝑖 𝜏𝑗}/ ( 𝑘 2 ) . Our first result is an efficient learning algorithm that, given samples from an 𝜂-noisy linear label ranking distribution 𝒟, computes

1We use 𝑖 𝜋𝑗(resp. 𝑖 𝜋𝑗) to denote that the element 𝑖is ranked higher (resp. lower) than 𝑗according to the ranking 𝜋.

a parameter matrix 𝑊that ranks the alternatives almost optimally with respect to the KT distance from the ground-truth ranking 𝜎𝑊 ( ).

Theorem 1 (Learning LSFs in KT Distance). Fix 𝜂 [0, 1/2) and 𝜖, 𝛿 (0, 1). Let 𝒟be an 𝜂-noisy linear label ranking distribution satisfying the assumptions of Definition 2 with ground-truth LSF 𝜎𝑊 ( ). There exists an algorithm that draws 𝑁= 𝑂 ( 𝑑 𝜖(1 2𝜂)6 log(𝑘/𝛿) ) samples from 𝒟, runs

in sample-polynomial time, and computes a matrix 𝑊 R𝑘 𝑑such that, with probability at least 1 𝛿, E 𝑥 𝒩𝑑[ KT(𝜎𝑊(𝑥), 𝜎𝑊 (𝑥))] 𝜖.

Theorem 1 gives the first efficient algorithm with provable guarantees for the supervised problem of learning noisy linear rankings. We remark that the sample complexity of our learning algorithm is qualitatively optimal (up to logarithmic factors) since, for 𝑘= 2, our problem subsumes learning a linear classifier with Massart noise 2 for which Ω(𝑑/𝜖) are known to be information theoretically necessary [MN06]. Moreover, our learning algorithm is proper in the sense that it computes a linear sorting function 𝜎𝑊( ). As opposed to improper learners (see also Section 1.3), a proper learning algorithm gives us a compact representation (storing 𝑊requires 𝑂(𝑘𝑑) memory) of the sorting function that allows us to efficiently compute (with runtime 𝑂(𝑘𝑑+ 𝑘log 𝑘)) the ranking corresponding to a fresh datapoint 𝑥 R𝑑.

Learning in top-𝑟Disagreement. We next present our learning algorithm for the top-𝑟metric formally defined as top 𝑟(𝜋, 𝜏) = 1{𝜋1..𝑟 = 𝜏1..𝑟}, where by 𝜋1..𝑟we denote the ordering on the first 𝑟elements of the permutation 𝜋. The top-𝑟metric is a disagreement metric in the sense that it takes binary values and for 𝑟= 1 captures the standard (multiclass) top-1 classification loss. We remark that, in contrast with the top-𝑟classification loss, which only requires the predicted label to be in the top-𝑟predictions of the model, the top-𝑟ranking metric that we consider here requires that the model puts the same elements in the same order as the ground truth in the top-𝑟positions. The top-𝑟ranking is well-motivated as, for example, in ad targeting (discussed in Section 1.1) we want to be accurate on the top-𝑟ad categories for a user so that we can diversify the content that they receive.

Theorem 2 (Learning LSFs in top-𝑟Disagreement). Fix 𝜂 [0, 1/2), 𝑟 [𝑘] and 𝜖, 𝛿 (0, 1). Let 𝒟be an 𝜂-noisy linear label ranking distribution satisfying the assumptions of Definition 2 with ground-truth LSF 𝜎𝑊 ( ). There exists an algorithm that draws 𝑁= 𝑂 ( 𝑑𝑟𝑘 𝜖(1 2𝜂)6 log(1/𝛿) )

samples from 𝒟, runs in sample-polynomial time and computes a matrix 𝑊 R𝑘 𝑑such that, with probability at least 1 𝛿,

E 𝑥 𝒩𝑑[ top 𝑟(𝜎𝑊(𝑥), 𝜎𝑊 (𝑥))] 𝜖.

As a direct corollary of our result, we obtain a proper algorithm for learning the top-1 element with respect to the standard 0-1 loss that uses 𝑂(𝑘𝑑) samples. In fact, for small values of 𝑟, i.e., 𝑟= 𝑂(1), our sample complexity is essentially tight. It is known that Θ(𝑘𝑑) samples are information theoretically necessary [Nat89] for top-1 classification. 3 For the case 𝑟= 𝑘, i.e., when we want to learn the whole ranking with respect to the 0-1 loss, our sample complexity is 𝑂(𝑘2𝑑). However, using arguments similar to [DSBDSS11], one can show that in fact 𝑂(𝑑𝑘) ranking samples are sufficient in order to learn the whole ranking with respect to the 0-1 loss. In this case, it is unclear whether a better sample complexity can be achieved with an efficient algorithm and we leave this as an interesting open question for future work.

1.3 Our Techniques

Learning in Kendall s Tau distance. Our proper learning algorithm consists of two steps: an improper learning algorithm that decomposes the ranking problem to 𝑂(𝑘2) binary linear classification problems and a convex (second order conic) program that compresses the 𝑘2 linear classifiers

2Notice that in this case Kendall s Tau distance is simply the standard 0-1 binary loss. 3Strictly speaking, those lower bounds do not directly apply in our setting because our labels are whole rankings instead of just the top classes but, in the Appendix D, we show that we can adapt the lower bound technique of [DSBDSS11] to obtain the same sample complexity lower bound for our ranking setting.

to obtain a 𝑘 𝑑matrix 𝑊. Our improper learning algorithm splits the ranking learning problem into 𝑂(𝑘2) binary, 𝑑-dimensional linear classification problems with Massart noise. In particular, for every pair of elements 𝑖, 𝑗 [𝑘], each binary classification task asks whether element 𝑖is ranked higher than element 𝑗in the ground-truth permutation 𝜎𝑊 (𝑥). As we already discussed, we have that, under the Gaussian distribution, there exist efficient Massart learning algorithms [BZ17, MV19, DKTZ20, ZSA20, ZL21] that can recover linear classifiers sgn(𝑣𝑖𝑗 𝑥) that correctly order the pair 𝑖, 𝑗for all 𝑥apart from a region of 𝑂(𝜖)-Gaussian mass. However, we still need to aggregate the results of the approximate binary classifiers in order to obtain a ranking of the 𝑘 alternatives for each 𝑥. We first show that we can design a voting scheme that combines the results of the binary classifiers using an efficient constant factor approximation algorithm for the Minimum Feedback Arc Set (MFAS) problem [ACN08]. This gives us an efficient but improper algorithm for learning LSFs in Kendall s Tau distance. In order to obtain a proper learning algorithm, we further compress the 𝑂(𝑘2) approximate linear classifiers with normal vectors 𝑣𝑖𝑗and obtain a matrix 𝑊 R𝑘 𝑑with the property that the difference of every two rows 𝑊𝑖 𝑊𝑗is 𝑂(𝜖)-close to the vector 𝑣𝑖𝑗. More precisely, we show that, given the linear classifiers 𝑣𝑖𝑗 R𝑑, we can efficiently compute a matrix 𝑊 R𝑘 𝑑such that the following angle distance with 𝑊 is small:

𝑑angle(𝑊, 𝑊 ) max 𝑖,𝑗𝜃(𝑊𝑖 𝑊𝑗, 𝑊 𝑖 𝑊 𝑗) 𝑂(𝜖) . (1)

It is not hard to show that, as long as the above angle metric is at most 𝑂(𝜖), then (in expectation over the standard Gaussian) Kendall s Tau distance between the LSFs is also 𝑂(𝜖). A key technical difficulty that we face in this reduction is bounding the condition number of the convex (second order conic) program that finds the matrix 𝑊given the vectors 𝑣𝑖𝑗, see Claim 2. Finally, we remark that the proper learning algorithm of Theorem 1 results in a compact and efficient sorting function that requires: (i) storing 𝑂(𝑘) weight vectors as opposed to the initial 𝑂(𝑘2) vectors of the improper learner; and (ii) evaluating 𝑘inner products with 𝑥to find its ranking (instead of 𝑂(𝑘2)).

Learning in top-𝑟Disagreement. We next turn our attention to the more challenging top-𝑟ranking disagreement metric. In particular, suppose that we are interested in recovering only the top element of the ranking. One approach would be to directly use the improper learning algorithm for this task and ask for KT distance of order roughly 𝜖/𝑘2. The resulting hypothesis would produce good predictions for the top element but the required sample complexity would be 𝑂(𝑑𝑘2). While it seems that training 𝑂(𝑘2) 𝑑-dimensional binary classifiers inherently requires 𝑂(𝑑𝑘2) samples, we show that, using the proper KT distance learning algorithm of Theorem 1, we can also obtain improved sample complexity results for the top-𝑟metric. Our main technical contribution here is a novel estimate of the top-𝑟disagreement in terms of the angle metric. In general, one can show that the top-𝑟disagreement is at most 𝑂(𝑘2) 𝑑angle(𝑊, 𝑊 ). We significantly sharpen this estimate by showing the following lemma.

Lemma 1 (Top-𝑟Disagreement via Parameter Distance). Consider two matrices 𝑊, 𝑊 R𝑘 𝑑 and let 𝒩𝑑be the standard Gaussian in 𝑑dimensions. We have that

Pr 𝑥 𝒩𝑑[𝜎1..𝑟(𝑊𝑥) = 𝜎1..𝑟(𝑊 𝑥)] 𝑂(𝑘𝑟) 𝑑angle(𝑊, 𝑊 ) .

We remark that Lemma 1 is a general geometric tool that we believe will be useful in other distributionspecific multiclass learning settings. The proof of Lemma 1 mainly relies on geometric Gaussian surface area computations that we believe are of independent interest. For the details, we refer the reader to Section 4. An interesting question with a convex-geometric flavor is whether the sharp bound of Lemma 1 also holds under the more general class of isotropic log-concave distributions.

1.4 Related Work

Robust Supervised Learning. We start with a summary of prior work on PAC learning with Massart noise. The Massart noise model was formally defined in [MN06] but similar variants had been defined by Vapnik, Sloan and Rivest [Vap06, Slo88, Slo92, RS94, Slo96]. This model is a strict extension of the Random Classification Noise (RCN) model [AL88], where the label noise is uniform, i.e., context-independent and is a special case of the agnostic model [Hau18, KSS94], where the label noise is fully adversarial and computational barriers are known to exist [GR09, FGKP06, Dan16, DKZ20, GGK20, DKPZ21, HSSVG22]. Our work partially builds upon on the algorithmic task of

PAC learning halfspaces with Massart noise [BH20]. In the distribution-independent setting, known efficient algorithms [DGT19, CKMY20, DKT21] achieve error 𝜂+𝜖and the works of [DK20, NT22] indicate that this error bound is the best possible in the Statistical Query model [Kea98]. This lower bound motivates the study of the distribution-specific setting (which is also the case of our work). There is an extensive line of work in this direction: [ABHU15, ABHZ16, YZ17, ZLC17, BZ17, MV19, DKTZ20, ZSA20, ZL21] with the currently best algorithms succeeding for all 𝜂< 1/2 with a sample and computational complexity poly(𝑑, 1/𝜖, 1/(1 2𝜂)) under a class of distributions including isotropic log-concave distributions. For details, see [DKK+21]. In this work we focus on Gaussian marginals but some of our results extend to larger distribution classes.

Label Ranking. Our work lies in the area of Label Ranking, which has received significant attention over the years [SS07, HFCB08, CH08, HPRZ03, FHMB08, DSM03]. There are multiple approaches for tackling this problem (see [VG10], [ZLY+14]). Some of them are based on probabilistic models [CH08, CDH10, GDV12, ZLGQ14] or may be tree based, such as decision trees [CHH09], entropy based ranking trees and forests [Rd SRSK15, d SSKC17], bagging techniques [AGM17] and random forests [ZQ18]. There are also works focusing on supervised clustering [GDGV13]. Finally, [CH08, CDH10, CHH09] adopt an instance-based approaches using nearest neighbors approaches. The above results are industrial. From a theoretical perspective, LR has been mainly studied from a statistical learning theory framework [CV20, CKS18, KGB18, KCS17]. [FKP21] provide some computational guarantees for the performance of decision trees in the noiseless case and some experimental results on the robustness of random forests to noise. The setting of [DGR+14] is close to ours but is investigated from an experimental standpoint. We remark that while reducing LR to multiple binary classification tasks has been used in prior literature [HFCB08, CH12, FKP21], standard reductions can not tolerate noise in rankings (nevertheless, from an experimental perspective, e.g., random forests seem robust to noise but lack formal theoretical guarantees). Our reduction crucially relies on the existence of efficient learning algorithms for binary linear classification with Massart noise.

2 Notation and Preliminaries

General Notation. We use 𝑂( ) to omit poly-logarithmic factors. A learning algorithm has samplepolynomial runtime if it runs in time polynomial in the size of the description of the input training set. We denote vectors by boldface 𝑥(with elements 𝑥𝑖) and matrices with 𝑊, where we let 𝑊𝑖 R𝑑

denote the 𝑖-th row of 𝑊 R𝑘 𝑑and 𝑊𝑖𝑗its elements. We denote 𝑎 𝑏the inner product of two vectors and 𝜃(𝑎, 𝑏) their angle. Let 𝒩𝑑denote the 𝑑-dimensional standard normal and Γ( ) the Gaussian surface area.

Rankings. We let argsort𝑖 [𝑘]𝑣denote the ranking of [𝑘] in decreasing order according to the values of 𝑣. For a ranking 𝜋, we let 𝜋(𝑖) denote the position of the 𝑖-th element. If 𝜋= 𝜋(𝑥), we may also write 𝜋(𝑥)(𝑖) to denote the position of 𝑖. We often refer to the elements of a ranking as alternatives. For a ranking 𝜎, we let 𝜎1..𝑟denote the top-𝑟part of 𝜎. When 𝜎= 𝜎(𝑥), we may also write 𝜎1..𝑟(𝑥) and 𝜎ℓ(𝑥) will be the alternative at the ℓ-th position. We let KT denote the (normalized) KT distance, i.e., KT(𝜋, 𝜏) =

𝑖 𝜋𝑗1{𝑖 𝜏𝑗}/ ( 𝑘 2 ) for 𝜋, 𝜏 S𝑘.

3 Learning in KT distance: Theorem 1

In this section, we present the main tools required to obtain our proper learning algorithm of Theorem 1. Our proper algorithm adopts a two-step approach: it first invokes an efficient improper algorithm which, instead of a linear sorting function (i.e., a matrix 𝑊 R𝑘 𝑑), outputs a list of 𝑂(𝑘2) linear classifiers. We then design a novel convex program in order to find the matrix 𝑊 satisfying the guarantees of Theorem 1. Let us begin with the improper learner for LSFs with bounded noise with respect to the KT distance, whose description can be found in Algorithm 1.

3.1 Improper Learning Algorithm

Let us assume that the target function is 𝜎 (𝑥) = 𝜎𝑊 (𝑥) = argsort(𝑊 𝑥) for some 𝑊 R𝑘 𝑑.

Step 1: Binary decomposition and Noise Structure. For each drawn example (𝑥, 𝜋) from the 𝜂-noisy linear label ranking distribution 𝒟(see Definition 2), we create ( 𝑘 2 ) binary examples (𝑥, 𝑦𝑖𝑗)

Algorithm 1 Non-proper Learning Algorithm Improper LSF

Input: Training set 𝑇= {(𝑥𝑡, 𝜋𝑡)}𝑡 [𝑁], 𝜖, 𝛿 (0, 1), 𝜂 [0, 1/2) Output: Sorting function ℎ: R𝑑 S𝑘

For any 1 𝑖< 𝑗 𝑘, create 𝑇𝑖𝑗= {(𝑥𝑡, sgn(𝜋𝑡(𝑖) 𝜋𝑡(𝑗)))} For any 1 𝑖< 𝑗 𝑘, compute 𝑣𝑖𝑗= Massart LTF(𝑇𝑖𝑗, 𝜖

4, 𝛿 10𝑘2 , 𝜂) See Appendix A.1 Ranking Phase: Given 𝑥 R𝑑:

(a) Construct directed graph 𝐺with 𝑉(𝐺) = [𝑘] and edges 𝑒𝑖 𝑗only if 𝑣𝑖𝑗 𝑥> 0 𝑖 = 𝑗 (b) Output ℎ(𝑥) = MFAS(𝐺) See Appendix A.1

with 𝑦𝑖𝑗= sgn(𝜋(𝑖) 𝜋(𝑗)) for any 1 𝑖< 𝑗 𝑘. We have that

[ 𝑦𝑖𝑗 sgn((𝑊 𝑖 𝑊 𝑗) 𝑥) < 0 | 𝑥 ] = Pr 𝜋 ℳ(𝜎 (𝑥))

[ 𝜋(𝑖) < 𝜋(𝑗) | 𝑊 𝑖 𝑥< 𝑊 𝑗 𝑥 ] .

Since ℳ(𝜎 (𝑥)) is an 𝜂-noisy ranking distribution (see Definition 1), we get that the above quantity is at most 𝜂< 1/2. Therefore, each sample (𝑥, 𝑦𝑖𝑗) can be viewed as a sample from a distribution 𝒟𝑖𝑗with Gaussian 𝑥-marginal, optimal linear classifier sgn((𝑊 𝑖 𝑊 𝑗) 𝑥), and Massart noise 𝜂. Hence, we have reduced the task of learning noisy LSFs to a number of ( 𝑘 2 ) sub-problems concerning the learnability of halfspaces in the presence of bounded (Massart) noise.

Step 2: Solving Binary Sub-problems. We can now apply the algorithm Massart LTF for LTFs with Massart noise under standard Gaussian marginals [ZSA20] (for details, see Appendix A.1): for all the pairs of alternatives 1 𝑖< 𝑗 𝑘with accuracy parameter 𝜖 , confidence 𝛿 = 𝑂(𝛿/𝑘2), and

a total number of 𝑁= Ω ( 𝑑 𝜖 (1 2𝜂)6 log(𝑘/𝛿) ) i.i.d. samples from 𝒟, we can obtain a collection of linear classifiers with normal vectors 𝑣𝑖𝑗for any 𝑖< 𝑗. We remark that each one of these halfspaces 𝑣𝑖𝑗achieves 𝜖disagreement with the ground-truth halfspaces 𝑊 𝑖 𝑊 𝑗with high probability, i.e.,

Pr 𝑥 𝒩𝑑[sgn(𝑣𝑖𝑗 𝑥) = sgn((𝑊 𝑖 𝑊 𝑗) 𝑥)] 𝜖 .

Step 3: Ranking Phase. We now have to aggregate the linear classifiers and compute a single sorting function ℎ: R𝑑 S𝑘. Given an example 𝑥, we create the tournament graph 𝐺with 𝑘nodes that contains a directed edge 𝑒𝑖 𝑗if 𝑣𝑖𝑗 𝑥> 0. If 𝐺is acyclic, we output the induced permutation; otherwise, the graph contains cycles which should be eliminated. In order to output a ranking, we remove cycles from 𝐺with an efficient, 3-approximation algorithm for MFAS [ACN08, VZW09]. Hence, the output ℎ(𝑥) and the true target 𝜎 (𝑥) will have E𝑥 𝒩𝑑[ KT(ℎ(𝑥), 𝜎 (𝑥))] 𝜖 +3𝜖 = 4𝜖 . This last equation indicates why a constant factor approximation algorithm suffices for our purposes we can always pick 𝜖 = 𝜖/4 and complete the proof. For details, see Appendix A.1.

3.2 Proper Learning Algorithm: Theorem 1

Having obtained the improper learning algorithm, we can now describe our proper Algorithm 2. Initially, the algorithm starts similarly with the improper learner and obtains a collection of binary linear classifiers. The crucial idea is the next step: the design of an appropriate convex program which will efficiently give the matrix 𝑊. We proceed with the details. For the proof, see Appendix A.2.

Algorithm 2 Proper Learning Algorithm Proper LSF

Input: Training set 𝑇= {(𝑥𝑡, 𝜋𝑡)}𝑡 [𝑁], 𝜖, 𝛿 (0, 1), 𝜂 [0, 1/2) Output: Linear Sorting function ℎ: R𝑑 S𝑘, i.e., ℎ( ) = 𝜎𝑊( ) for some matrix 𝑊 R𝑘 𝑑

Compute (𝑣𝑖𝑗)1 𝑖<𝑗 𝑘= Improper LSF(𝑇, 𝜖, 𝛿, 𝜂) See Algorithm 1 Setup the CP 1 and compute 𝑊= Ellipsoid(CP) See Appendix A.2 Ranking Phase: Given 𝑥 R𝑑, output ℎ(𝑥) = argsort(𝑊𝑥)

Step 1: Calling Non-proper Learners. As a first step, the algorithm calls Algorithm 1 with parameters 𝜖, 𝛿and 𝜂 [0, 1/2) and obtains a list of linear classifiers with normal vectors 𝑣𝑖𝑗for 𝑖< 𝑗. Without loss of generality, assume that 𝑣𝑖𝑗 2 = 1.

Step 2: Designing and Solving the CP 1. Our main goal is to find a matrix 𝑊whose LSF is close to the true target in KT distance. We show the following lemma that connects the KT distance between two LSFs with the angle metric 𝑑angle( , ) defined in Eq. (1). The proof can be found in the Appendix A.2. Lemma 2. For 𝑊, 𝑊 R𝑘 𝑑, it holds E𝑥 𝒩𝑑[ KT(𝜎𝑊(𝑥), 𝜎𝑊 (𝑥))] 𝑑angle(𝑊, 𝑊 ) .

The above lemma states that, for our purposes, it suffices to control the 𝑑angle metric between the guess 𝑊and the true matrix 𝑊 . It turns out that, given the binary classifiers 𝑣𝑖𝑗, we can design a convex program whose solution will satisfy this property. Thinking of the binary classifier 𝑣𝑖𝑗as a proxy for 𝑊 𝑖 𝑊 𝑗, we want each difference 𝑊𝑖 𝑊𝑗to have small angle with 𝑣𝑖𝑗or equivalently to have large correlation with it, i.e., (𝑊𝑖 𝑊𝑗) 𝑣𝑖𝑗 𝑊𝑖 𝑊𝑗 2. To enforce this condition, we can therefore use the second order conic constraint (𝑊𝑖 𝑊𝑗) 𝑣𝑖𝑗 (1 𝜑) 𝑊𝑖 𝑊𝑗 2. We formulate the following convex program 1 with variable the matrix 𝑊:

Find 𝑊 R𝑘 𝑑, 𝑊 𝐹 1, such that (𝑊𝑖 𝑊𝑗) 𝑣𝑖𝑗 (1 𝜑) 𝑊𝑖 𝑊𝑗 2 for any 1 𝑖< 𝑗 𝑘, (1)

for some 𝜑 (0, 1) to be decided. Intuitively, since any 𝑣𝑖𝑗has good correlation with 𝑊 𝑖 𝑊 𝑗 (by the guarantees of the improper learning algorithm) and the CP 1 requires that its solution 𝑊 similarly correlates well with 𝑣𝑖𝑗, we expect that 𝑑angle(𝑊, 𝑊 ) will be small. We show that: Claim 1. The convex program 1 is feasible and any solution 𝑊of 1 satisfies 𝑑angle(𝑊, 𝑊 ) 𝜖.

To see this, note that any solution of CP 1 is a matrix 𝑊whose angle metric (see Eq. (1)) with the true matrix is small by an application of the triangle inequality between the angles of (𝑣𝑖𝑗, 𝑊𝑖 𝑊𝑗) and (𝑣𝑖𝑗, 𝑊 𝑖 𝑊 𝑗) for any 𝑖 = 𝑗. We next have to deal with the feasibility of CP 1. Our goal is to determine the value of 𝜑that makes the CP 1 feasible. For the pair 1 𝑖< 𝑗 𝑘, the guess 𝑣𝑖𝑗and the true normal vector 𝑊 𝑖 𝑊 𝑗satisfy, with high probability,

Pr 𝑥 𝒟𝑥[sgn(𝑣𝑖𝑗 𝑥) = sgn((𝑊 𝑖 𝑊 𝑗) 𝑥)] 𝜖. (2)

Under the Gaussian distribution (which is rotationally symmetric), it is well known that the angle 𝜃(𝑢, 𝑣) between two vectors 𝑢, 𝑣 R𝑑is equal to 𝜋 Pr𝑥 𝒩𝑑[sgn(𝑢 𝑥) = sgn(𝑣 𝑥)]. Hence, using Eq. (2), we get that the angle between the guess 𝑣𝑖𝑗and the true normal vector 𝑊 𝑖 𝑊 𝑗 is 𝜃(𝑊 𝑖 𝑊 𝑗, 𝑣𝑖𝑗) 𝑐𝜖. For sufficiently small 𝜖, this bound implies that the cosine of the above angle is of order 1 (𝑐𝜖)2 and so the following inequality will hold (since 𝑣𝑖𝑗is unit):

(𝑊 𝑖 𝑊 𝑗) 𝑣𝑖𝑗 (1 2(𝑐𝜖)2) 𝑊 𝑖 𝑊 𝑗 2 .

Hence, by setting 𝜑= 2(𝑐𝜖)2, the convex program 1 with variables 𝑊 R𝑘 𝑑will be feasible; since 𝑊 𝐹 1 comes without loss of generality, 𝑊 will be a solution with probability 1 𝛿.

Next, we have to control the volume of the feasible region. This is crucial in order to apply the ellipsoid algorithm (for details, see in Appendix A.2.1) and, hence, solve the convex program. We show the following claim (see Appendix A.2.1 for the proof):

Claim 2. There exists 𝜌 2 poly(𝑑,𝑘,1/𝜖,log(1/𝛿)) so that the feasible set of CP 1 with 𝜑= 𝑂(𝜖2) contains a ball (with respect to the Frobenius norm) of radius 𝜌.

Critically, the runtime of the ellipsoid algorithm is logarithmic in 1/𝜌. So, the ellipsoid runs in time polynomial in the parameters of the problem and outputs the desired matrix 𝑊.

4 Learning in top-𝑟Disagreement: Theorem 2

In this section we show that the proper learning algorithm of Section 3.2 learns noisy LSFs in the top-𝑟 disagreement metric. We have seen that, with 𝑂(𝑑log(𝑘)/𝜖) samples, Algorithm 2 of Section 3.2 computes a matrix 𝑊such that 𝑑angle(𝑊, 𝑊 ) 𝜖, see Claim 1. Let us be more specific. Lemma 2 relates the expected KT distance with the angle metric of the two matrices (see also Equation (1)). Our Algorithm 2 essentially gives an upper bound on this angle metric. When we shift our objective and our goal is to control the top-𝑟disagreement, we can still apply Algorithm 2 which essentially controls the angle metric. The crucial ingredient that is missing is the relation between the loss we

have to control, i.e., the expected top-𝑟disagreement and the angle metric of Equation 1. This relation is presented right after and essentially says that the expected top-𝑟disagreement is at most 𝑂(𝑘𝑟) times this angle metric. Hence, in order to get top-𝑟disagreement of order 𝜖, it suffices to apply our Algorithm 2 with 𝜖 = 𝑂(𝜖/(𝑘𝑟)).

We continue with our main contribution which is the following lemma that connects the top-𝑟 disagreement metric with the geometric distance 𝑑angle( , ), recall Lemma 1. To keep this sketch simple we shall present a sketch of the proof of Lemma 1 for the special case of top-1 classification, which we restate below. The proof of the top-1 case can be found at the Appendix B. The detailed proof of the general case (𝑟> 1) can be found in the Appendix C.

Lemma 3 (Top-1 Disagreement Loss via 𝑑angle( , )). Consider two matrices 𝑈, 𝑉 R𝑘 𝑑and let 𝒩𝑑be the standard Gaussian in 𝑑dimensions. We have that

Pr 𝑥 𝒩𝑑[𝜎1(𝑈𝑥) = 𝜎1(𝑉𝑥)] 𝑂 ( 𝑘

log 𝑘 ) 𝑑angle(𝑈, 𝑉) .

We observe that

Pr 𝑥 𝒩𝑑[𝜎1(𝑈𝑥) = 𝜎1(𝑉𝑥)] =

𝑖 [𝑘] Pr 𝑥 𝒩𝑑[𝜎1(𝑈𝑥) = 𝑖, 𝜎1(𝑉𝑥) = 𝑖] . (1)

We denote by 𝒞(𝑖) 𝑈 1{𝑥: 𝜎1(𝑈𝑥) = 𝑖} =

𝑗 =𝑖1{(𝑈𝑖 𝑈𝑗) 𝑥 0}, i.e., this is the set

where the ranking corresponding to 𝑈picks 𝑖as the top element. Note that 𝒞(𝑖) 𝑈is the indicator of a homogeneous polyhedral cone since it can be written as the intersection of homogeneous halfspaces. Using these cones we can rewrite the top-1 disagreement of Eq. (1) as

Pr 𝑥 𝒩𝑑[𝜎1(𝑈𝑥) = 𝜎1(𝑉𝑥)] =

𝑖 [𝑘] Pr 𝑥 𝒩𝑑[𝐶(𝑖) 𝑈(𝑥) = 1, 𝐶(𝑖) 𝑉(𝑥) = 0] . (2)

Hence, our task is to control the mass of the disagreement region of two cones. The next Lemma 4 achieves this task and, combined with Eq. (2) directly gives the conclusion of Lemma 3.

Next we work with two general homogeneous polyhedral cones with set indicators 𝐶1, 𝐶2: Lemma 4 (Cone Disagreement). Let 𝐶1, 𝐶2 : R𝑑 {0, 1} be homogeneous polyhedral cones defined by the 𝑘unit vectors 𝑣1, . . . , 𝑣𝑘and 𝑢1, . . . , 𝑢𝑘respectively. For some universal constant 𝑐> 0, it holds that Pr𝑥 𝒩𝑑[𝐶1(𝑥) = 𝐶2(𝑥)] 𝑐 log 𝑘max𝑖 [𝑘] 𝜃(𝑣𝑖, 𝑢𝑖) .

Roadmap of the Proof of Lemma 4: Assume that we rotate one face of the polyhedral cone 𝐶1 by a very small angle 𝜃to obtain the perturbed cone 𝐶2. At a high-level, we expect the probability of the disagreement region between the new cone 𝐶2 and 𝐶1 to be roughly (this is an underestimation) equal to the size of the perturbation 𝜃times the (Gaussian) surface area of the face of the convex cone that we perturbed. The Gaussian Surface Area (GSA) of a convex set 𝐴 R𝑑, is defined as Γ(𝐴)

𝐴𝜑𝑑(𝑥)𝑑𝜇(𝑥), where 𝑑𝜇(𝑥) is the standard surface measure in R𝑑and 𝜑𝑑(𝑥) = (2𝜋) 𝑑/2 exp( 𝑥 2 2/2). In fact, in Claim 3 below, we show that the probability of the disagreement between 𝐶1 and 𝐶2 is roughly 𝑂(𝜃)Γ(𝐹1)

log(1/Γ(𝐹1) + 1), where 𝐹1 is the face of cone 𝐶1 that we rotated. Now, when we perturb all the faces by small angles (all perturbations are at most 𝜃), we can show (via a sequence of triangle inequalities) that the total probability of the disagreement region is bounded above by the perturbation size 𝜃times the sum of the Gaussian surface area of every face (times a logarithmic blow-up factor):

Pr 𝑥 𝒩𝑑[𝐶1(𝑥) = 𝐶2(𝑥)] 𝑂(𝜃)

log(1/Γ(𝐹𝑖) + 1) .

Surprisingly, for homogeneous convex cones, the above sum cannot grow very fast with 𝑘. In fact, we show that it can be at most 𝑂( log 𝑘). To prove this, we crucially rely on the following convex geometry result showing that the Gaussian surface area of a homogeneous convex cone is 𝑂(1) regardless of the number of its faces 𝑘. Lemma 5 ([Naz03]). Let 𝐶be a homogeneous polyhedral cone with 𝑘faces 𝐹1, . . . , 𝐹𝑘. Then 𝐶 has Gaussian surface area Γ(𝐶) = 𝑘 𝑖=1 Γ(𝐹𝑖) 1.

Using an inequality similar to the fact that the maximum entropy of a discrete distribution on 𝑘elements is at most log 𝑘, and, since, from Lemma 5, it holds that 𝑘 𝑖=1 Γ(𝐹𝑖) 1, we can show that 𝑘 𝑖=1 Γ(𝐹𝑖)

log(1/Γ(𝐹𝑖) + 1) = 𝑂( log 𝑘). Therefore, with the above lemma we conclude that, if the maximum angle perturbation that we perform on 𝐶1 is 𝜃, then the probability of the disagreement region is 𝑂(𝜃). We next give the formal proof resulting in the upper bound of 𝑂( log 𝑘𝜃) for the disagreement.

Single Face Perturbation Bound: Claim 3: We will use the following notation for the positive orthant indicator 𝑅(𝑧) = 𝑘 𝑖=1 1{𝑧𝑖 0}. Notice that the homogeneous polyhedral cone 𝐶1 can be written as 𝐶1(𝑥) = 𝑅(𝑉𝑥) = 𝑅(𝑣1 𝑥, . . . , 𝑣𝑘 𝑥). Claim 3 below shows that the disagreement of two cones that differ on a single normal vector is bounded by above by the Gaussian surface area of a particular face 𝐹1 times a logarithmic blow-up factor

log(1/Γ(𝐹1) + 1).

Claim 3. Let 𝑣1, . . . , 𝑣𝑘 R𝑑and 𝑟 R𝑑with 𝜃(𝑣1, 𝑟) 𝜃for some sufficiently small 𝜃 (0, 𝜋/2). Let 𝐹1 be the face with 𝑣1 𝑥= 0 of the cone 𝑅(𝑉𝑥) and 𝑐> 0 be some universal constant. Then,

Pr 𝑥 𝒩𝑑[𝑅(𝑣1 𝑥, . . . , 𝑣𝑘 𝑥) = 𝑅(𝑟 𝑥, 𝑣2 𝑥, . . . , 𝑣𝑘 𝑥)] 𝑐 𝜃 Γ(𝐹1)

log ( 1 Γ(𝐹1) + 1 ) .

Proof Sketch of Claim 3. Since the constraints 𝑣2 𝑥 0, . . . , 𝑣𝑘 𝑥 0 are common in the two cones, we have that 𝑅(𝑣1 𝑥, . . . , 𝑣𝑘 𝑥) = 𝑅(𝑟 𝑥, 𝑣2 𝑥, . . . , 𝑣𝑘 𝑥) only when the first halfspaces disagree, i.e., when (𝑣1 𝑥)(𝑟 𝑥) < 0. Thus, we have that the LHS probability of Claim 3 is equal to

E 𝑥 𝒩𝑑[𝑅(𝑣2 𝑥, . . . , 𝑣𝑘 𝑥) 1{(𝑣1 𝑥)(𝑟 𝑥) < 0}] . (3)

This expectation contains two terms: the term 𝑅(𝑣2 𝑥, . . . 𝑣𝑘 𝑥) that contains the last 𝑘 1 common constrains of the two cones and the region where the first two halfspaces disagree, i.e., the set {𝑥: (𝑣1 𝑥)(𝑟 𝑥) < 0}. In order to upper bound this integral in terms of the angle 𝜃, we observe that (for 𝜃sufficiently small) it is not hard to show (see Appendix B) that the disagreement region, which is itself a (non-convex) cone, is a subset of the region {𝑥: |𝑣1 𝑥| 2𝜃|𝑞 𝑥|}, where 𝑞the normalized projection of 𝑟onto the orthogonal complement of 𝑣1, i.e., 𝑞= proj𝑣 1 𝑟/ proj𝑣 1 𝑟 2. Therefore, we have that the integral of Eq. (3) is at most

E 𝑥 𝒩𝑑[𝑅(𝑣2 𝑥, . . . , 𝑣𝑘 𝑥) 1{|𝑣1 𝑥| 2𝜃|𝑞 𝑥|}] .

This is where the definition of the Gaussian surface area appears. In fact, we have to compute the derivative of the above expression (which is a function of 𝜃) with respect to 𝜃and evaluate it at 𝜃= 0. The idea behind this computation is that we can upper bound probability mass of the cone disagreement, i.e., the term Pr𝑥 𝒩𝑑[𝑅(𝑣1 𝑥, . . . , 𝑣𝑘 𝑥) = 𝑅(𝑟 𝑥, 𝑣2 𝑥, . . . , 𝑣𝑘 𝑥)] by its derivative with respect to 𝜃(evaluated at 0) times 𝜃by introducing 𝑜(𝜃) error. Hence, it suffices to upper bound the value of this derivative at 0, which is:

2 E 𝑥 𝒩𝑑[𝑅(𝑣2 𝑥, . . . , 𝑣𝑘 𝑥) |𝑞 𝑥| 𝛿(|𝑣1 𝑥|)] ,

where 𝛿is the Dirac delta function. Notice that, if we did not have the term |𝑞 𝑥|, the above expression would be exactly equal to two times the Gaussian surface area of the face with 𝑣1 𝑥= 0, i.e., it would be equal to 2Γ(𝐹1). We now show that this extra term of |𝑞 𝑥| can only increase the above surface integral by at most a logarithmic factor. For some 𝜉to be decided, we have that

E 𝑥 𝒩𝑑[𝑅(𝑣2 𝑥, . . . , 𝑣𝑘 𝑥) |𝑞 𝑥| 𝛿(|𝑣1 𝑥|)] =

𝑥 𝐹1 𝜑𝑑(𝑥)|𝑞 𝑥|𝑑𝜇(𝑥)

𝑥 𝐹1 𝜑𝑑(𝑥)|𝑞 𝑥|1{|𝑞 𝑥| 𝜉}𝑑𝜇(𝑥) +

𝑥 𝐹1 𝜑𝑑(𝑥)|𝑞 𝑥|1{|𝑞 𝑥| 𝜉}𝑑𝜇(𝑥)

𝑥 𝐹1 𝜑𝑑(𝑥)𝑑𝜇(𝑥) +

𝑥 𝐹1 𝜑𝑑(𝑥)|𝑞 𝑥|1{|𝑞 𝑥| 𝜉}𝑑𝜇(𝑥) ,

where 𝑑𝜇(𝑥) is the standard surface measure in R𝑑. The first integral above is exactly equal to the Gaussian surface area of the face 𝐹1. To bound from above the second term we can use the next claim showing that not a lot of mass of the face 𝐹1 can concentrate on the region where |𝑞 𝑥| is very large. Its proof relies on standard Gaussian concentration arguments, and is provided in Appendix B.

Claim 4. It holds that

𝑥 𝐹1 𝜑𝑑(𝑥)|𝑞 𝑥|1{|𝑞 𝑥| 𝜉}𝑑𝜇(𝑥) 𝑂(exp( 𝜉2/2)) .

Using the above result, we get that

( E 𝑥 𝒩𝑑[𝑅(𝑣2 𝑥, . . . , 𝑣𝑘 𝑥) 1{|𝑣1 𝑥| 2𝜃|𝑞 𝑥|}] ) 𝜃=0 𝑂(𝜉) Γ(𝐹1)+𝑂(exp( 𝜉2/2)) .

By picking 𝜉= Θ(

log(1 + 1/Γ(𝐹1))), the result follows since, up to introducing 𝑜(𝜃) error, we can bound the term Pr𝑥 𝒩𝑑[𝑅(𝑣1 𝑥, . . . , 𝑣𝑘 𝑥) = 𝑅(𝑟 𝑥, 𝑣2 𝑥, . . . , 𝑣𝑘 𝑥)] by its derivative with respect to 𝜃, evaluated at 0, times 𝜃.

Conclusion. Our work presents the first theoretical guarantees for (linear) LR with noise and settles interesting directions for future work, as mentioned in Section 1. This paper is theoretical and does not have any negative social impact.

Acknowledgments and Disclosure of Funding

Dimitris Fotakis and Alkis Kalavasis were supported by the Hellenic Foundation for Research and Innovation (H.F.R.I.) under the First Call for H.F.R.I. Research Projects to support Faculty members and Researchers and the procurement of high-cost research equipment grant , project BALSAM, HFRIFM17-1424.

[ABHU15] Pranjal Awasthi, Maria-Florina Balcan, Nika Haghtalab, and Ruth Urner. Efficient learning of linear separators under bounded noise. In Conference on Learning Theory, pages 167 190. PMLR, 2015.

[ABHZ16] Pranjal Awasthi, Maria-Florina Balcan, Nika Haghtalab, and Hongyang Zhang. Learning and 1-bit compressed sensing under asymmetric noise. In Conference on Learning Theory, pages 152 192. PMLR, 2016.

[ABSV14] Pranjal Awasthi, Avrim Blum, Or Sheffet, and Aravindan Vijayaraghavan. Learning mixtures of ranking models. ar Xiv preprint ar Xiv:1410.8750, 2014.

[ACN08] Nir Ailon, Moses Charikar, and Alantha Newman. Aggregating inconsistent information: ranking and clustering. Journal of the ACM (JACM), 55(5):1 27, 2008.

[AGM17] Juan A Aledo, José A Gámez, and David Molina. Tackling the supervised label ranking problem by bagging weak learners. Information Fusion, 35:38 50, 2017.

[AL88] Dana Angluin and Philip Laird. Learning from noisy examples. Machine Learning, 2(4):343 370, 1988.

[APA18] Arpit Agarwal, Prathamesh Patil, and Shivani Agarwal. Accelerated spectral ranking. In International Conference on Machine Learning, pages 70 79. PMLR, 2018.

[BDCBL92] Shai Ben-David, Nicolò Cesa-Bianchi, and Philip M Long. Characterizations of learnability for classes of {O,..., n}-valued functions. In Proceedings of the fifth annual workshop on Computational learning theory, pages 333 340, 1992.

[BFFSZ19] Róbert Busa-Fekete, Dimitris Fotakis, Balázs Szörényi, and Manolis Zampetakis. Optimal learning of mallows block model. In Conference on Learning Theory, pages 529 532. PMLR, 2019.

[BH20] Maria-Florina Balcan and Nika Haghtalab. Noise in classification., 2020.

[BM09] Mark Braverman and Elchanan Mossel. Sorting from noisy information. ar Xiv preprint ar Xiv:0910.1191, 2009.

[BT52] Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324 345, 1952.

[BZ17] Maria-Florina F Balcan and Hongyang Zhang. Sample and computationally efficient learning algorithms under s-concave distributions. Advances in Neural Information Processing Systems, 30, 2017.

[CDH10] Weiwei Cheng, Krzysztof Dembczynski, and Eyke Hüllermeier. Label ranking methods based on the plackett-luce model. In ICML, 2010.

[CH08] Weiwei Cheng and Eyke Hüllermeier. Instance-based label ranking using the mallows model. In ECCBR Workshops, pages 143 157, 2008.

[CH12] Weiwei Cheng and Eyke Hüllermeier. Probability estimation for multi-class classification based on label ranking. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 83 98. Springer, 2012.

[CHH09] Weiwei Cheng, Jens Hühn, and Eyke Hüllermeier. Decision tree and instance-based learning for label ranking. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 161 168, 2009.

[CKMY20] Sitan Chen, Frederic Koehler, Ankur Moitra, and Morris Yau. Classification under misspecification: Halfspaces, generalized linear models, and evolvability. Advances in Neural Information Processing Systems, 33:8391 8403, 2020.

[CKS18] Stephan Clémençon, Anna Korba, and Eric Sibony. Ranking median regression: Learning to order through local consensus. In Algorithmic Learning Theory, pages 212 245. PMLR, 2018.

[CPS13] Ioannis Caragiannis, Ariel D Procaccia, and Nisarg Shah. When do noisy votes reveal the truth? In Proceedings of the fourteenth ACM conference on Electronic commerce, pages 143 160, 2013.

[CV20] Stéphan Clémençon and Robin Vogel. A multiclass classification approach to label ranking. In International Conference on Artificial Intelligence and Statistics, pages 1421 1430. PMLR, 2020.

[Dan16] Amit Daniely. Complexity theoretic limitations on learning halfspaces. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 105 117, 2016.

[DGR+14] Nemanja Djuric, Mihajlo Grbovic, Vladan Radosavljevic, Narayan Bhamidipati, and Slobodan Vucetic. Non-linear label ranking for large-scale prediction of long-term user interests. In Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.

[DGT19] Ilias Diakonikolas, Themis Gouleakis, and Christos Tzamos. Distribution-independent pac learning of halfspaces with massart noise. Advances in Neural Information Processing Systems, 32, 2019.

[DK20] Ilias Diakonikolas and Daniel M Kane. Hardness of learning halfspaces with massart noise. ar Xiv preprint ar Xiv:2012.09720, 2020.

[DKK+21] Ilias Diakonikolas, Daniel M Kane, Vasilis Kontonis, Christos Tzamos, and Nikos Zarifis. Learning general halfspaces with general massart noise under the gaussian distribution. ar Xiv preprint ar Xiv:2108.08767, 2021.

[DKM05] Sanjoy Dasgupta, Adam Tauman Kalai, and Claire Monteleoni. Analysis of perceptronbased active learning. In International conference on computational learning theory, pages 249 263. Springer, 2005.

[DKPZ21] Ilias Diakonikolas, Daniel M Kane, Thanasis Pittas, and Nikos Zarifis. The optimality of polynomial regression for agnostic learning under gaussian marginals. ar Xiv preprint ar Xiv:2102.04401, 2021.

[DKT21] Ilias Diakonikolas, Daniel Kane, and Christos Tzamos. Forster decomposition and learning halfspaces with noise. Advances in Neural Information Processing Systems, 34, 2021.

[DKTZ20] Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, and Nikos Zarifis. Learning halfspaces with massart noise under structured distributions. In Conference on Learning Theory, pages 1486 1513. PMLR, 2020.

[DKZ20] Ilias Diakonikolas, Daniel Kane, and Nikos Zarifis. Near-optimal sq lower bounds for agnostically learning halfspaces and relus under gaussian marginals. Advances in Neural Information Processing Systems, 33:13586 13596, 2020.

[DOS18] Anindya De, Ryan O Donnell, and Rocco Servedio. Learning sparse mixtures of rankings from noisy information. ar Xiv preprint ar Xiv:1811.01216, 2018.

[DSBDSS11] Amit Daniely, Sivan Sabato, Shai Ben-David, and Shai Shalev-Shwartz. Multiclass learnability and the erm principle. In Proceedings of the 24th Annual Conference on Learning Theory, pages 207 232. JMLR Workshop and Conference Proceedings, 2011.

[DSM03] Ofer Dekel, Yoram Singer, and Christopher D Manning. Log-linear models for label ranking. Advances in neural information processing systems, 16:497 504, 2003.

[DSS14] Amit Daniely and Shai Shalev-Shwartz. Optimal learners for multiclass problems. In Conference on Learning Theory, pages 287 316. PMLR, 2014.

[d SSKC17] Cláudio Rebelo de Sá, Carlos Soares, Arno Knobbe, and Paulo Cortez. Label ranking forests. Expert systems, 34(1):e12166, 2017.

[FGKP06] Vitaly Feldman, Parikshit Gopalan, Subhash Khot, and Ashok Kumar Ponnuswami. New results for learning noisy parities and halfspaces. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 06), pages 563 574. IEEE, 2006.

[FHMB08] Johannes Fürnkranz, Eyke Hüllermeier, Eneldo Loza Mencía, and Klaus Brinker. Multilabel classification via calibrated label ranking. Machine learning, 73(2):133 153, 2008.

[FKP21] Dimitris Fotakis, Alkis Kalavasis, and Eleni Psaroudaki. Label ranking through nonparametric regression. ar Xiv preprint ar Xiv:2111.02749, 2021.

[FKS21] Dimitris Fotakis, Alkis Kalavasis, and Konstantinos Stavropoulos. Aggregating incomplete and noisy rankings. In International Conference on Artificial Intelligence and Statistics, pages 2278 2286. PMLR, 2021.

[GDGV13] Mihajlo Grbovic, Nemanja Djuric, Shengbo Guo, and Slobodan Vucetic. Supervised clustering of label ranking data using label preference information. Machine learning, 93(2-3):191 225, 2013.

[GDV12] Mihajlo Grbovic, Nemanja Djuric, and Slobodan Vucetic. Learning from pairwise preference data using gaussian mixture model. Preference Learning: Problems and Applications in AI, 33, 2012.

[GGK20] Surbhi Goel, Aravind Gollakota, and Adam Klivans. Statistical-query lower bounds via functional gradients. Advances in Neural Information Processing Systems, 33:2147 2158, 2020.

[GR09] Venkatesan Guruswami and Prasad Raghavendra. Hardness of learning halfspaces with noise. SIAM Journal on Computing, 39(2):742 765, 2009.

[Hau18] David Haussler. Decision theoretic generalizations of the pac model for neural net and other learning applications. In The Mathematics of Generalization, pages 37 116. CRC Press, 2018.

[HFCB08] Eyke Hüllermeier, Johannes Fürnkranz, Weiwei Cheng, and Klaus Brinker. Label ranking by learning pairwise preferences. Artificial Intelligence, 172(16):1897 1916, 2008.

[HPRZ03] Sariel Har-Peled, Dan Roth, and Dav Zimak. Constraint classification for multiclass classification and ranking. Advances in neural information processing systems, pages 809 816, 2003. URL: https://proceedings.neurips.cc/paper/2002/file/ 16026d60ff9b54410b3435b403afd226-Paper.pdf.

[HSSVG22] Daniel Hsu, Clayton Sanford, Rocco Servedio, and Emmanouil-Vasileios Vlatakis Gkaragkounis. Near-optimal statistical query lower bounds for agnostically learning intersections of halfspaces with gaussian marginals. ar Xiv preprint ar Xiv:2202.05096, 2022.

[Hun04] David R Hunter. Mm algorithms for generalized bradley-terry models. The annals of statistics, 32(1):384 406, 2004.

[KCS17] Anna Korba, Stephan Clémençon, and Eric Sibony. A learning theory of ranking aggregation. In Artificial Intelligence and Statistics, pages 1001 1010. PMLR, 2017.

[Kea98] Michael Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM), 45(6):983 1006, 1998.

[KGB18] Anna Korba, Alexandre Garcia, and Florence d Alché Buc. A structured prediction approach for label ranking. ar Xiv preprint ar Xiv:1807.02374, 2018.

[KMS06] Claire Kenyon-Mathieu and Warren Schudy. How to rank with few errors a ptas for weighted feedback arc set on tournaments. In ELECTRONIC COLLOQUIUM ON COMPUTATIONAL COMPLEXITY, REPORT NO. 144 (2006). Citeseer, 2006.

[KSS94] Michael J Kearns, Robert E Schapire, and Linda M Sellie. Toward efficient agnostic learning. Machine Learning, 17(2):115 141, 1994.

[LB11] Tyler Lu and Craig Boutilier. Learning mallows models with pairwise preferences. In ICML, 2011.

[LM18] Allen Liu and Ankur Moitra. Efficiently learning mixtures of mallows models. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 627 638. IEEE, 2018.

[LM21] Allen Liu and Ankur Moitra. Robust voting rules from algorithmic robust statistics. ar Xiv preprint ar Xiv:2112.06380, 2021.

[Luc12] R Duncan Luce. Individual choice behavior: A theoretical analysis. Courier Corporation, 2012.

[LV07] László Lovász and Santosh Vempala. The geometry of logconcave functions and sampling algorithms. Random Structures & Algorithms, 30(3):307 358, 2007.

[Mal57] Colin L Mallows. Non-null ranking models. i. Biometrika, 44(1/2):114 130, 1957.

[MN06] Pascal Massart and Élodie Nédélec. Risk bounds for statistical learning. The Annals of Statistics, 34(5):2326 2366, 2006.

[MV19] Oren Mangoubi and Nisheeth K Vishnoi. Nonconvex sampling with the metropolisadjusted langevin algorithm. In Conference on Learning Theory, pages 2259 2293. PMLR, 2019.

[MW20] Cheng Mao and Yihong Wu. Learning mixtures of permutations: Groups of pairwise comparisons and combinatorial method of moments. ar Xiv preprint ar Xiv:2009.06784, 2020.

[Nat89] Balas K Natarajan. On learning sets and functions. Machine Learning, 4(1):67 97, 1989.

[Naz03] Fedor Nazarov. On the maximal perimeter of a convex set in R𝑛with respect to a gaussian measure. In Geometric aspects of functional analysis, pages 169 187. Springer, 2003.

[NOS17] Sahand Negahban, Sewoong Oh, and Devavrat Shah. Rank centrality: Ranking from pairwise comparisons. Operations Research, 65(1):266 287, 2017.

[NT22] Rajai Nasser and Stefan Tiegel. Optimal sq lower bounds for learning halfspaces with massart noise. ar Xiv preprint ar Xiv:2201.09818, 2022.

[Pap81] Christos H Papadimitriou. On the complexity of integer programming. Journal of the ACM (JACM), 28(4):765 768, 1981.

[Rd SRSK15] Cláudio Rebelo de Sá, Carla Rebelo, Carlos Soares, and Arno Knobbe. Distance-based decision tree algorithms for label ranking. In Portuguese Conference on Artificial Intelligence, pages 525 534. Springer, 2015.

[RS94] Ronald L Rivest and Robert Sloan. A formal model of hierarchical concept-learning. Information and Computation, 114(1):88 114, 1994.

[Sch98] Alexander Schrijver. Theory of linear and integer programming. John Wiley & Sons, 1998.

[Slo88] Robert Sloan. Types of noise in data for concept learning. In Proceedings of the first annual Workshop on Computational Learning Theory, pages 91 96, 1988.

[Slo92] Robert H Sloan. Corrigendum to types of noise in data for concept learning. In Proceedings of the fifth annual workshop on Computational learning theory, page 450, 1992.

[Slo96] Robert H Sloan. Pac learning, noise, and geometry. In Learning and Geometry: Computational Approaches, pages 21 41. Springer, 1996.

[SS07] Shai Shalev-Shwartz. Online learning: Theory, algorithms, and applications. Ph D thesis, The Hebrew University of Jerusalem, 2007.

[SSBD14] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

[Vap06] Vladimir Vapnik. Estimation of dependences based on empirical data. Springer Science & Business Media, 2006.

[VG10] Shankar Vembu and Thomas Gärtner. Label ranking algorithms: A survey. In Preference learning, pages 45 64. Springer, 2010.

[Vis21] Nisheeth K Vishnoi. Algorithms for convex optimization. Cambridge University Press, 2021.

[VZW09] Anke Van Zuylen and David P Williamson. Deterministic pivoting algorithms for constrained ranking and clustering problems. Mathematics of Operations Research, 34(3):594 620, 2009.

[YZ17] Songbai Yan and Chicheng Zhang. Revisiting perceptron: Efficient and label-optimal learning of halfspaces. Advances in Neural Information Processing Systems, 30, 2017.

[ZL21] Chicheng Zhang and Yinan Li. Improved algorithms for efficient active learning halfspaces with massart and tsybakov noise. In Mikhail Belkin and Samory Kpotufe, editors, Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 4526 4527. PMLR, 15 19 Aug 2021.

[ZLC17] Yuchen Zhang, Percy Liang, and Moses Charikar. A hitting time analysis of stochastic gradient langevin dynamics. In Conference on Learning Theory, pages 1980 2022. PMLR, 2017.

[ZLGQ14] Yangming Zhou, Yangguang Liu, Xiao-Zhi Gao, and Guoping Qiu. A label ranking method based on gaussian mixture model. Knowledge-Based Systems, 72:108 113, 2014.

[ZLY+14] Yangming Zhou, Yangguang Liu, Jiangang Yang, Xiaoqi He, and Liangliang Liu. A taxonomy of label ranking algorithms. J. Comput., 9(3):557 565, 2014.

[ZQ18] Yangming Zhou and Guoping Qiu. Random forest for label ranking. Expert Systems with Applications, 112:99 109, 2018.

[ZSA20] Chicheng Zhang, Jie Shen, and Pranjal Awasthi. Efficient active learning of sparse halfspaces with arbitrary bounded noise. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 7184 7197. Curran Associates, Inc., 2020.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] We have presented the assumptions of our models and we have explained future research directions in Section 1. (c) Did you discuss any potential negative societal impacts of your work? [Yes] We have discussed that in the Conclusion paragraph at the end of the main body. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] We have specified the theoretical models we are working on, see e.g., Section 1.1. (b) Did you include complete proofs of all theoretical results? [Yes] Due to space constraints we have included the full proofs of the results in the Supp. Material. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]