# mrabased_statistical_learning_from_incomplete_rankings__4e62fa5e.pdf MRA-based Statistical Learning from Incomplete Rankings Eric Sibony ERIC.SIBONY@TELECOM-PARISTECH.FR St ephan Cl emenc on STEPHAN.CLEMENCON@TELECOM-PARISTECH.FR LTCI UMR No. 5141, Telecom Paris Tech/CNRS, Institut Mines-Telecom, Paris, 75013, France J er emie Jakubowicz JEREMIE.JAKUBOWICZ@TELECOM-SUDPARIS.EDU SAMOVAR UMR No. 5157, Telecom Sud Paris/CNRS, Institut Mines-Telecom, Evry, 91000, France Statistical analysis of rank data describing preferences over small and variable subsets of a potentially large ensemble of items {1, . . . , n} is a very challenging problem. It is motivated by a wide variety of modern applications, such as recommender systems or search engines. However, very few inference methods have been documented in the literature to learn a ranking model from such incomplete rank data. The goal of this paper is twofold: it develops a rigorous mathematical framework for the problem of learning a ranking model from incomplete rankings and introduces a novel general statistical method to address it. Based on an original concept of multiresolution analysis (MRA) of incomplete rankings, it finely adapts to any observation setting, leading to a statistical accuracy and an algorithmic complexity that depend directly on the complexity of the observed data. Beyond theoretical guarantees, we also provide experimental results that show its statistical performance. 1. Introduction Motivated by numerous modern applications ranging from the design of recommender systems to customer analytics through the optimization of search engines, the analysis of rank data has recently been the subject of much attention in the machine-learning literature. A full ranking on a set of n 1 items Jn K := {1, . . . , n} is an ordering of the form a1 an, where a1 and an are respectively the items ranked first and last. It is usually described as the permutation σ on Jn K that maps an item to its rank, i.e. such that σ(ai) = i for all i Jn K. The Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s). variability of a statistical population of full rankings is thus modeled by a probability distribution p on the symmetric group Sn, the set of all permutations on Jn K, called a ranking model. The statistical issue is then to learn this ranking model from the available observations, either through parametric modeling such as those proposed in the seminal contributions of (Mallows, 1957), (Luce, 1959; Plackett, 1975) or more recently in (Fligner & Verducci, 1986; Liqun, 2000; Lebanon & Lafferty, 2002; Meek & Meila, 2014), or through nonparametric-like approaches (i.e. without assuming an explicit structure for the distribution), such as kernel estimation (Lebanon & Mao, 2008; Sun et al., 2012), techniques based on sparsity concepts (Jagabathula & Shah, 2011), independence modeling (Huang & Guestrin, 2009) or inference based on harmonic analysis of the space of distributions on Sn (Diaconis, 1988; Huang et al., 2009; Kondor & Barbosa, 2010; Irurozki et al., 2011; Kakarala, 2011; Kondor & Dempsey, 2012). However, a major issue arises in practice from the fact that rank data seldom take the form of full rankings. Much more frequently the available rank data describe limited information, of the form of partial and/or incomplete rankings. Formally, they correspond to partial orders on Jn K and can be naturally identified as subsets of permutations on Jn K, the subsets formed by their linear extensions. Let p be the probability distribution on Sn of the random permutation Σ that models the variability of the preferences among the population of interest. The probability of a partial order represented by S Sn is then defined by For instance, the probability that item i be ranked first is given by P[Σ(i) = 1] = P σ Sn, σ(i)=1 p(σ), and the probability that item i be preferred to item j is given by P[Σ(i) < Σ(j)] = P σ Sn, σ(i)<σ(j) p(σ). These quantities correspond to marginal probabilities of the ranking model p. One may either seek to estimate the full ranking model p from degraded observations under a strong MRA-based Statistical Learning from Incomplete Rankings structural assumption on p or else try to learn what is accessible from the observations and typically build an empirical ranking model, whose observable marginal distributions are close to those of the true underlying model. Special attention has been paid in the literature to the situation where observations are partial rankings or bucket orders, defined as full rankings with ties, i.e. as orderings of the form a1,1, . . . , a1,n1 ar,1, . . . , ar,nr where 1 r n and Pr i=1 ni = n (see Critchlow, 1985; Fagin et al., 2006, for instance). This type of data includes most preferred items or more general top-k rankings. The information carried by such data is of global nature in the sense that it involves all the items (e.g. a most preferred item is preferred to every other item). Most estimation procedures originally designed for the observation of full rankings then easily extend to the case where available data are partial rankings (e.g. Lebanon & Lafferty, 2002; Lebanon & Mao, 2008; Huang et al., 2012; Kakarala, 2012). In many practical situations however, the rank data at disposal only provide local information, in the sense that they do not involve all the n items but only small and variable subsets of items. In a recommendation setting for instance, users usually express their preferences through implicit feedback data related to the subset of recommended items, not to all the items in the catalog. The first step towards the analysis of such data is the analysis of orderings of the form a1 ak with k < n, referred to as incomplete rankings. The case k = 2 corresponds to the setting of pairwise comparisons, the most widely considered in the literature (e.g. Wauthier et al., 2013; Busafekete et al., 2014; Rajkumar & Agarwal, 2014). Roughly speaking, the general problem of ranking model inference based on incomplete rankings is poorly understood. For instance, if the Placket-Luce model is naturally adapted to such rank data and can be used with various statistical estimation techniques (see Hunter, 2004; Guiver & Snelson, 2009; Azari Soufiani et al., 2013), inferring the Mallows model in this context is only possible with one method, introduced in (Lu & Boutilier, 2011). In addition, only two non-parametric methods capable of dealing with incomplete rankings of arbitrary length have been documented in the literature, those introduced in Kondor & Barbosa (2010) and Sun et al. (2012) namely. It is the main purpose of this paper to propose a novel methodology for the statistical analysis of incomplete rankings. It crucially relies on a recent construction of a multiresolution analysis (MRA) for incomplete rankings, introduced in Cl emenc on et al. (2014). Related concepts and results involved in the subsequent analysis are summarized in section 3. The first contribution of the present article is the rigorous definition of an appropriate statistical framework in section 2, together with a full characterization of the components of a ranking model that can be statistically recovered without any structural assumption. Its second contribution is the introduction of a general method to learn these components in section 4. It is model-free and statistically efficient, while having tractable complexity, as shown in section 5. Experimental results are also provided in section 6 to illustrate its performance. Notations. For a set E of finite cardinality |E| < , we set P(E) = {A E | |A| 2} and denote by L(E) = {f : E R} the euclidean space of realvalued functions on E, equipped with the canonical inner product f, g = P x E f(x)g(x) and the associated norm f = (P x E f(x)2)1/2. The indicator function of a subset S E is denoted by 1S in general and by δx when S is the singleton {x}, in which case it is called a Dirac function. For a r.v. X and a probability distribution µ on a measurable space X, the notation X µ means that µ is the probability distribution of X. For any sigma-algebra B, we denote by F(B, X) the set of random variables on X that are B-measurable. 2. Problem Statement If full rankings can be regarded as permutations, it is more convenient to see an incomplete ranking π1 πk as the injective word π = π1 . . . πk, where 2 k n (see Kitaev, 2011). The content of a ranking π is the set c(π) = {π1, . . . , πk} and its length is the number |π| = |c(π)|. We denote by Γ(A) the set of all injective words of content A P(Jn K) and by Γn the set of all incomplete rankings on Jn K. Notice that, equipped with these notations, Γ(Jn K) corresponds to Sn. Word π Γn is a subword of word π Γn if there exist indices 1 i1 < < i|π | |π| such that π = πi1 . . . πi|π |, we then write π π. Incomplete rankings - Probabilistic model. With these notations, the marginal distribution of a ranking model p on a subset A P(Jn K) is the probability distribution PA over Γ(A) defined for π Γ(A) by σ Sn, π σ p(σ). (1) As explained in the Introduction, it is assumed here that full realizations of a random permutation Σ p are not observable. The variability of rankings over a given subset A P(Jn K) is described by PA, and it is thus natural to model the observed rankings over A as i.i.d. realizations of PA. The complexity in learning from incomplete rankings stems from the fact that observations are not made on one single subset of items only, but on a possibly very large collection of (small) subsets. In e-commerce for instance, a user only expresses her preferences on the subset of items she came upon while browsing. We model the observation of an incomplete ranking by a random pair (A, Π), MRA-based Statistical Learning from Incomplete Rankings where A is the subset of items involved in the ranking and Π is the ranking per se. We assume that A is independent from the underlying random variable Σ and drawn from an unknown probability distribution ν on P(Jn K). In this setting, a dataset DN is formed of N 1 i.i.d. pairs (A1, Π(1)), . . . , (AN, Π(N)), drawn according to the following scheme: Ai ν and Π(i)|(Ai = A) PA. (2) Remark 1. The statistical model (2) can be viewed as a specific case of that introduced in Sun et al. (2012), where observations of incomplete rankings are modeled as censored observations of permutations, leading to the following setup: first a permutation Σ is drawn from p, then the subset of items A is drawn from a probability distribution νΣ on P(Jn K), and the ranking Π is taken equal to the ranking induced by Σ over A. This setting boils down to ours when the distribution νσ is equal to ν for all σ Sn. Identifiability. In this paper, the goal pursued is not to recover the true underlying model p. Focus is rather on learning an empirical ranking model as close as possible to the true underlying model p given ν. Indeed, all subsets of items are not assumed to be observable, which is the case in many applications where preferences are only observed on small subsets. In the absence of structural assumptions, the ranking model p is not identifiable. For instance, when only pairwise comparisons can be observed, one has access to the pairwise marginals, whose distributions are each characterized by a single parameter, leading to possibly n(n 1)/2 accessible parameters, whereas p is described by n! 1 free parameters, and thus cannot be identified. In the general case, the accessible marginals are the PA s for A lying in the support of the distribution ν, which is denoted by A = {A P(Jn K) | ν(A) > 0} and referred to as the observation design in the sequel. The number of accessible parameters is therefore equal to P A A(|A|! 1), which can be greater than n! 1 even if Jn K A. It is then a priori unclear whether they charaterize p or, more generally, how many degrees of freedom they have. The answers are provided by the following theorem. For any collection of subsets S P(Jn K), we set P(S) = S A S P(A). Theorem 1. In absence of any restrictive assumption on the ranking model p, only the marginals p B for B P(A) are identifiable from data drawn from (2). They are characterized by P B P(A) d|B| independent parameters, where dk is the number of fixed-point free permutations (also called derangements) on a set with k elements. Theorem 1 is a direct consequence of the MRA of incomplete rankings introduced in Cl emenc on et al. (2014), briefly recalled in section 3. It shows for instance that if A = {A Jn K | 2 |A| k} then the accessible parameters have O(nk) degrees of freedom. Attempting to estimate p without any assumption is thus vain in general. Statistical framework. Practical applications require the construction of a ranking model, that can be used to compute probabilities of rankings on any subset of items A P(Jn K), as it is well illustrated in Sun et al. (2012). As Γn = F A P(Jn K) Γ(A), we embed all the spaces L(Γ(A)) into L(Γn). Let then MA : L(Γn) L(Γ(A)) be the marginal operator on A P(Jn K) defined for any f L(Γn) and π Γ(A) by σ Γn, π σ f(σ), so that MAp = PA for all A P(Jn K) and MAf = 0 if f L(Γ(B)) with A P(B). We then consider the problem of building an empirical ranking model bq N on Sn from the dataset DN such that its marginals MAbq N are accurate estimators of the PA s, for A P(A). Mathematically, building bq N from DN means that bq N must be BN-measurable, where BN is the σ-algebra generated by DN. We allow bq N to take negative values but we impose that P σ Sn bq N(σ) = 1. The possible negativity of bq N(σ) for σ Sn does not have much impact in practice because when the marginals MAbq N are close to the PA s, they only take positive values. We evaluate the quality of the estimation of PA for A A through the mean squared error (MSE) E[ MAbq N PA 2 A], where A denotes the euclidean norm on L(Γ(A)). Here and throughout the article, the symbol E represents the expectation with respect to the randomly drawn dataset DN. As we consider the problem of simultaneous estimation of the marginals, it is natural to consider the sum of the errors on each A weighted by ν. Definition 1 (Performance measure). The performance of an empirical ranking model bq N F(BN, L(Sn)) with P σ Sn bq N(σ) = 1 is given by E(bq N) := X A A ν(A)E MAbq N PA 2 A . Statistical and computational challenge. We emphasize that constructing an accurate empirical ranking model bq N in this setting is far from being trivial, because the marginals MAbq N are linked through highly entangled combinatorial relationships. Estimating marginals on different subsets thus cannot be done separately (refer to the Supplementary Material for an illustrative example). Even if the PA s are assumed to be known, finding a function q L(Sn) such that MAq = PA for all A A requires to solve a linear system of P A A |A|! equations with n! unknowns, where each equation MAq(π) = PA(π) for A A and π Γ(A) involves at least n!/|A|! operations. It therefore represents a daunting computational task as soon as n > 10, whereas n is around 104 in practical applications. Fortunately, as shall be seen below, the MRA framework brings a new representation of the data tailored to this task, drastically reducing this complexity. MRA-based Statistical Learning from Incomplete Rankings 3. MRA of Incomplete Rankings Multiresolution analysis of incomplete rankings crucially relies on the following result: any function f on Sn can be decomposed as a sum of components that each localize the specific information of one marginal MBf for B P(Jn K), in the sense that the marginal MAf on any subset A P(Jn K) only involves the components specific to the subsets B P(A). This is formalized in the following theorem, established in Cl emenc on et al. (2014). Let us denote by V 0 = R1Sn the 1-d space of constant functions in L(Sn), and for f L(Γn), we set f = P Theorem 2. There exists a collection (WB)B P(Jn K) of subspaces of L(Sn) orthogonal to V 0 such that: 1. L(Sn) = V 0 L B P(Jn K) WB, 2. For B P(Jn K) and f WB, MBf = 0 f = 0, 3. For A, B P(Jn K) with B A and f WB, MAf = 0. In particular for f L(Sn), decomposed as f = (1/n!) f + P B P(Jn K) f B, and A P(Jn K), one has: MAf = 1 |A|! f + X B P(A) MA f B. At last, dim WB = d|B| for all B P(Jn K). Multiscale structure. MRA allows to exploit the natural multiscale structure of the marginals MAf of any function f L(Sn). Here, the notion of scale corresponds to the number of items in the subset on which the marginal is considered. For k {2, . . . , n}, the space of scale k is defined by W k = L |B|=k WB. One thus obtains L(Sn) = V 0 Ln k=2 W k. This last decomposition of L(Sn) is somehow analogous to classic MRA on L2(R) and offers a similar interpretation: if a function f L(Sn) is projected onto V 0 LK k=2 W k with K {2, . . . , n}, then only the information of f of scale up to K can be captured. In other words, starting from a constant function in V 0, each space W k provides the supplementary level of details specific to scale k. The decomposition of L(Sn) given by Theorem 2 actually allows to further decompose a function f L(Sn) in the space of items , where each component f B provides the supplementary level of details specific to the marginal MBf, for B P(Jn K). To a certain extent, this space-scale localization is analogous to the classic space-scale localization in wavelet analysis. When incomplete ranking data generated through the scheme (2) are observed, one may form empirical versions of the marginals MAp of the ranking model p on the subsets A A and represent them as elements of L(Γ(A)). This raw representation of the data is however not efficient for a statistical analysis because it does not allow to exploit the structure induced by equation (1). In contrast, the MRA framework brings a new representation that defines efficient features for statistical analysis. To define this representation, we enter the details of the construction MRA framework. For B P(Jn K), let HB be the subspace of L(Γ(B)) defined by HB = {F L(Γ(B)) | MB F = 0 for all B B}. Let 0 be by convention the unique injective word of content and length 0. Word π Γn is said a contiguous subword of word π Γn if there exists i {1, . . . , |π| |π | + 1} such that π = πiπi+1 . . . πi+|π | 1. This is denoted by π π. For A P(Jn K), define the linear embedding operator φA : L(Γn { 0}) L(Γ(A)) for σ Γ(A) by φAδ 0(σ) = 1/|A|! and for any ranking π Γn by φAδπ(σ) = 1 (|A| |π| + 1)! if π σ and 0 otherwise. Then for A P(Jn K), φJn K is an isomorphism between HA and WA. This means that for f L(Sn) and B P(Jn K), there exists a unique element XJn K B f HB such that f B = XJn K B f. Setting X f = fδ 0, one has B P(Jn K) { } φJn KXJn K B f. More generally, Theorem 2 still holds true for any space L(Γ(A)) with A P(Jn K), so that L(Γ(A)) = R1Γ(A) L B P(A) W A B , where for B P(A), W A B is the detail subspace of L(Γ(A)) related to B, and φA is an isomorphism between HB and W A B . Thus for any function F L(Γ(A)), there exists a unique family (XA BF)B P(A) with XA BF HB for each B P(A) such that B P(A) { } φAXA BF, where XA F = Fδ 0. This defines for any B P(Jn K) the operator XB : L(Γn) HB on each space L(Γ(A)) with A P(Jn K) by XBF = XA BF if B A and 0 otherwise, for F L(Γ(A)). Now, a result from Cl emenc on et al. (2014) shows that for any A, A , B P(Jn K) such that B A A and any F L(Γ(A)), XBMA F = XBF. (3) Equation (3) means that the component XBF of F is invariant under marginal transform . In a statistical learning perspective, the marginals should thus be more efficiently estimated when represented through the projectors XB. We call the operators XB the wavelet projectors, and define the wavelet transform as the operator MRA-based Statistical Learning from Incomplete Rankings \ P{2,4} \ P{1,2,3} \ P{1,3} \ P{3,4} M{1,3,4}bq N b X{2,4} b X{1,2} b X{2,3} b X{1,3} b X{1,2,3} b X{3,4} Figure 1. Principle of the MRA-based linear ranking model ΨX : L(Γn) L B P(Jn K) HB, F 7 (XBF)B P(Jn K). The wavelet transform defines the representation that we use for any function F L(Γ(A)) with A P(Jn K). We summarize its properties in the following theorem. Theorem 3 (MRA representation). Let A P(Jn K) and F L(Γ(A)). Then B P(A) { } φAXBF and B P(A ) { } φA XBF for any A P(A). 4. MRA-based Estimation The MRA framework shows that the marginals PA for A A of the ranking model p are characterized by the wavelet projections XBp for B P(A) := S A A P(A). This proves Theorem 1 and shows more specifically that the component of p that can be learned when data is drawn from the scheme (2) is φJn K P B P(A) XBp. Based on this observation, we introduce a general learning method that relies on the estimation of the wavelet projections XBp of the ranking model p for all B P(A). Definition 2 (MRA-based empirical ranking model). The general MRA-based learning method consists in constructing for each B P(A) an estimator b XB F(BN, HB) of XBp from the dataset DN. The empirical ranking model is then given by bq N = P B P(A) { } φJn K b XB, where we fix b X = δ 0 for all empirical ranking models. From dependence to independence. We underline that the major advantage of this approach consists in the fact that, as the wavelet projections are by construction linearly independent, it turns the complex problem of simultaneously estimating the marginals into a collection of estimation problems that can be solved independently. The performance of a ranking model constructed this way can be easily controlled via the following proposition. Proposition 1. Let ( b XB)B P(A) with b XB F(BN, HB) for each B P(A) and bq N the associated empirical ranking model from definition 2. Then B P(A) νφ(B)E b XB XBp 2 where νφ(B) = P A Jn K, B A ν(A)2|A|/(|A| |B| + 1)!. Sketch of proof. One can show that for A, B P(Jn K) with B A and F L(Γ(B)), φAF 2 A = F 2 B/(|A| |B| + 1)!. Then the proof is concluded using Theorem 3 and the Cauchy-Schwarz inequality. Refer to the Supplementary Material for the technical details. Among all the empirical ranking models that can be built according to the principle formulated in Definition 2, we consider a specific class of models, that shall be referred to as MRA-based linear empirical ranking models. Its definition is based on several empirical estimates built from the dataset DN = {(A1, Π(1)), . . . , (AN, Π(N))}. For A P(Jn K), we set b IA = {1 i N |Ai = A}, so that b IA = means that A is observed in the dataset DN. The empirical estimator bνN of ν is naturally defined by bνN(A) = |b IA|/N for A P(Jn K). We denote its support by b AN and refer to it as the empirical observation design. Notice that b AN A, the support of ν. For A P(Jn K), we define the empirical estimator of PA for π Γ(A) by c PA(π) = |{i b IA | Π(i) = π}/|b IA| if A b AN and 1/|A|! otherwise. We denote by Bν N the σ-algebra generated by bνN. For a given subset of items B P(A), there are two possibilities. Either B P( b AN), meaning that it is not included in any of the observed sets of items. In this case one cannot infer anything about b XB without additional regularity assumption. Or else B P( b AN), meaning that there exists at least one observed subset of items A b AN such that B A. Then natural candidates for b XB are the wavelet projections XB c PA of the empirical estimators c PA for A b AN such that B A. This essential observation motivates the learning method proposed below. For B P(Jn K), we set Q(B) = {A Jn K | B A}. Definition 3 (MRA-based linear ranking model). For B P(A) and bθ F(BN, R2n), the MRA-based linear estimator related to the weighting vector bθ is defined by b XB,bθ = X A b AN Q(B) bθ(A)XB c PA, where by convention b XB,bθ = 0 when b AN Q(B) = or equivalently B P( b AN). We denote by bq N,bθ the related empirical ranking model and fix b X ,bθ = δ 0. Example 1. The principle of the MRA-based estimator is depicted in Fig. 1. In this example, we assume that b AN = {{2, 4}, {1, 2, 3}, {1, 3}, {3, 4}}. For each A b AN, the empirical estimator c PA contributes to the estimators b XB of XBp for all B P(A). Then the prediction on a (possibly unobserved) subset A, A = {1, 3, 4} in the illustration, only involves the b XB s for B P(A) P( b AN). MRA-based Statistical Learning from Incomplete Rankings Remark 2. Several methods for ranking aggregation or estimation rely on the breaking of rankings into pairwise comparisons (see H ullermeier et al., 2008, for instance). For the usual parametric ranking models, it is usually shown that these methods do not degrade the available information too much (see for instance Lu & Boutilier, 2011; Meek & Meila, 2014). When no structural assumption is made however on the ranking model p, breaking all the observed rankings into pairwise comparisons boils down to suppressing all the information of scale higher than 2 defined in the MRA framework. In particular any MRA-based linear ranking model defined from pairwise comparisons only would be such that b XB,bθ = 0 for all B Jn K with |B| > 2. The following result provides asymptotic guarantees for the accuracy of the MRA-based empirical ranking model we proposed. We say that an empirical ranking model bq N F(BN, L(Sn)) is asymptotically unbiased if lim N E[MAbq N] = PA for all A A. Proposition 2. Let bθ F(Bν N, R2n). Then the MRA-based linear ranking model bq N,bθ is asymptotically unbiased if lim N E h P A b AN Q(B) bθ(A) i = 1 for all B P(A). Sketch of proof. By Theorem 3, one has E[MAbq N] = P B P(A) { } φAE[ b XB,bθ] for A A. Now, for B P(A) and π Γ(B), one can show after some calculations that E[ b XB,bθ(π)] = XBp(π)E h P A b AN Q(B) bθ(A) i . This suffices to conclude the proof. Refer to the Supplementary Material for the technical details. Notice that Proposition 2 requires that the weights bθ(A) are Bν N-measurable, in other words that they are constructed from bνN. They can be constructed from b AN but not from the c PA s for A b AN for instance. This hypothesis is however not too limiting in practice. It is satisfied in particular by the weighted least square estimator defined below. Definition 4 (WLS estimator). Let B P(A). Given bνN, the solutions of the following minimization problem min bθ R2n X A b AN Q(B) bνN(A) b XB,bθ XB c PA 2 B are the vectors bθ R2n defined for all A b AN QB by bθW LS(A) := bνN(A) P A b AN Q(B) bνN(A ). (4) We then define the weighted least square estimator by b XW LS B := b XB,bθW LS and denote the related empirical ranking model by bq W LS N . Beyond the fact that the WLS ranking model is a natural choice among the class of MRA-based linear empirical ranking models, the result stated below reveals that it is asymptotically unbiased with an error rate of order O(1/N), the optimal rate in parametric estimation. Theorem 4. The WLS estimator bq W LS N is asymptotically unbiased and has an error bounded by E bq W LS N C1 for all N 1, where 0 < ρ < 1 is a constant that only depends on ν and C1 and C2 are positive constants that only depend on p and ν, given in the Supplementary Material. Sketch of proof. First, using (4) one obtains, for B P(A), E h P A b AN Q(B) bθW LS(A) i = 1 (1 P A Q(B) ν(A))N. Since B P(A), A Q(B) = and P A Q(B) ν(A) > 0. Then Proposition 2 ensures that bq W LS N is asymptotically unbiased. To bound the error, we use Proposition 1 and calculate explicitly the terms E h b XW LS B XBp 2 B i for B P(A) through the biasvariance decomposition. Refer to the Supplementary Material for technical details. Remark 3. The two non-parametric approaches proposed in the literature to handle incomplete rankings in Kondor & Barbosa (2010) and Sun et al. (2012) both rely on kernel regularization of a global estimator defined as n! 1Sn(Π(i)) (5) in the present setting, where Sn(π) = {σ Sn | π σ} is the set of linear extensions of the ranking π Γn. The choice of this estimator relies on the following heuristic: the observation of an incomplete ranking π is actually a degraded observation of a full ranking σ Sn(π). Thus it gives the same information as the uniform distribution over Sn(π), equal to (|π|!/n!)1Sn(π). The estimator bp N is then simply the average of the uniform distributions over the Sn(Π(i)) s. Though this assumption is appealing, the estimator bp N is fundamentally biased. Indeed for B P(Jn K) and π Γ(B), E [MBbp N(π)] = A A ν(A)|A|! π Γ(A) PA(π ) 1Sn(π ), 1Sn(π) , which is usually different from PB(π) because the Sn(π) s are not disjoint for different π s in general. For instance in the case where only pairwise comparisons are observed, one obtains after some calculations, for i = j Jn K: E M{i,j}bp N(ij) = 1 2 + ν({i, j})P {i,j}(ij) k Jn K\{i,j} h ν({i, k})P {i,k}(ik) + ν({j, k})P {j,k}(kj) i , MRA-based Statistical Learning from Incomplete Rankings where for k = l Jn K and π Γ({k, l}), P {k,l}(kl) := P{k,l}(kl) 1/2. Except if the distribution ν is concentrated on the pair {i, j} solely (in which case there is not much to say), this expression shows that the estimate E M{i,j}bp N(ij) blends the probabilities of many other pairwise comparisons and is therefore fundamentally different from P{i,j}(ij). The MRA representation allows to exploit only the information from the observed dataset, which is why the constants C1 and C2 in Theorem 4 depend on p and ν and not directly on n. We point out however that the more diffuse ν is, the more degrees of freedom the dataset has, and the bigger they are. The same interpretation applies to the computational aspects. 5. Computational Advantages To be useful in practice, an empirical ranking model bq N L(Sn) must face the following three intertwined computational challenges. 1. Storage of the ranking model: the naive storage of a vector (bq N(σ))σ Sn requires n! 1 parameters, which is largely unfeasible when n becomes greater than 15. 2. Complexity of the learning procedure: the learning procedure can require a drastic amount of operations, because of the high dimensionality of the data. 3. Complexity of the computation of a marginal: for A P(Jn K) and π Γ(A), the naive computation of MAbq N(π) for π Γ(A) requires n!/|A|! operations. Example 2. Consider the empirical model bp N defined in Remark 3 by (5). It can be rewritten as A b AN bνN(A) X π Γ(A) c PA(π)1Sn(π). Its most efficient storage is under the form of the collections of parameters (bνN(A))A b AN and (c PA(π))A b AN, π Γ(A), and the learning procedure is naturally in O(N). But then, each computation of the marginal probability of a ranking π Γn involves the computation of all the inner products 1Sn(π ), 1Sn(π) for π F A b AN Γ(A). This is at the root of the main computational limitation of the approaches introduced in Kondor & Barbosa (2010) and Sun et al. (2012). The MRA-based linear empirical ranking model is defined in terms of wavelet projections and thus naturally overcomes challenges 1. and 3., as shown by the following result (see the Supplementary Material for the technical proof). We denote by K = max A A |A| the maximum size of an observable subset. Proposition 3. Let bq N F(BN, L(Sn)) be a MRA-based linear ranking model. Its storing requires a number of parameters upper bounded by K! 2K min(N, |A|). The computation of MAbq N(π) for A P(Jn K) and π Γ(A) needs less than |A|(|A| 1)/2 operations. As K is small in practical applications, these bounds are quite reasonable. Analyzing the complexity of the learning procedure for a general MRA-based linear ranking model is too intricate because it depends on the choice of weighting vector bθ R2n. We do it for the weighted least squares model, using the following explicit formula (proved in the Supplementary Material). Proposition 4. For B P( b AN) and π Γ(B), XW LS B (π) = π Γ(B) αB(π, π )|{1 i N | π Π(i)}| A b AN Q(B) |IA| , where αB(π, π ) := XBδπ(π ) for π Γ(B). The coefficients αB(π, π ) for B P(Jn K) and π, π Γ(B) do not depend on the application nor on the dataset, they can be pre-computed. It is shown in the Supplementary Material how their computation can be implemented with complexity bounded by K2K!. Proposition 5. Assuming that all coefficients αB(π, π ) for B P(A) and π, π Γ(B) have been pre-computed, the WLS ranking model can be learned with complexity bounded by 2K(K! + 1)(N + |A|). Refer to the Supplementary Material for the proof of Proposition 5. The bounds given by Propositions 3 and 5 are not small but they depend directly on the complexity of the observed dataset, not on the number of items n. This is a great achievement regarding the computational challenges of the analysis of incomplete rankings. 6. Numerical Experiments Here we examine the performance of the WLS empirical ranking model in numerical experiments and compare it with three others: the Plackett-Luce model (estimated by means of the MM algorithm introduced in Hunter (2004)), the estimator from Sun et al. (2012), called SLK (we take the bandwidth of the kernel equal to n 2 + 1 to be sure that the smoothing is applied to the entire dataset), and the collection of empirical estimators (c PA)A A. The latter are not given as the marginals of a ranking model and besides, it may be the case that no ranking model induces them due to sampling noise. It is however interesting to see them as a baseline. MRA-based Statistical Learning from Incomplete Rankings Up to scale 2 Up to scale 3 Up to scale 4 Up to scale 5 APA Dataset 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 0 WLS empirical Plackett Luce SLK 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 0 WLS empirical Plackett Luce SLK 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 0 WLS empirical Plackett Luce SLK 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 0 WLS empirical Plackett Luce SLK 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 10 4 WLS empirical Plackett Luce SLK 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 10 4 WLS empirical Plackett Luce SLK 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 10 4 WLS empirical Plackett Luce SLK 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 10 4 WLS empirical Plackett Luce SLK Plackett-Luce 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 10 4 WLS empirical Plackett Luce SLK 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 10 4 WLS empirical Plackett Luce SLK 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 10 4 WLS empirical Plackett Luce SLK 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 10 4 WLS empirical Plackett Luce SLK Figure 2. Evolution of the performance E(bq N) with N for each estimator: WLS in squares, empirical in diamonds, Plackett-Luce in triangles and SLK in circles, with different underlying ranking models: APA dataset (first row), Mallows (second row), Plackett-Luce (third row) and with probability ν uniform on {A J5K | 2 |A| k} with k = 2, 3, 4, 5 (from left to right). For the Mallows and Plackett-Luce models, the results are represented on a logarithmic scale. Each experiment is characterized by a ranking model p, a probability distribution ν and a number of observations N. We consider two theoretical ranking models, namely a Plackett-Luce model defined for σ Sn by pw(σ) = Qn i=1 wσi/(Pn j=i wσj) with parameter vector w = (w1, . . . , wn) drawn uniformly at random on the simplex {x [0, 1]n | Pn i=1 xi = 1} and a Mallows model defined for σ Sn by p(σ) e T (σ0,σ) where T is the Kendall s tau distance on Sn and σ0 = 12 . . . n, and one empirical model, namely the distribution of the 5738 votes in the APA dataset (see Diaconis, 1989) that we consider as a ground truth ranking model. In all the experiments, n = 5. For each ranking model, we examine the four different settings where ν is the uniform probability distribution on {A J5K | 2 |A| k} for k = 2, 3, 4, 5, and let the size of the drawn dataset DN vary between 500 and 5000. We then evaluate the performance of an empirical ranking model bq N constructed from DN through a Monte Carlo estimate of the performance E(bq N) averaged from 100 drawings of DN. Fig. 2 depicts the experimental results. As explained in Remark 3, the SLK ranking model applies a strong smoothing which leads to a very small variance but an important bias when p differs from the uniform distribution on Sn. This is why it converges rapidly and its performance is almost constant through the experiments for N 500. The Plackett Luce model relies on a structural assumption and is thus naturally biased when p is not a Plackett-Luce model. This explains why it does not perform best in the latter case. As noticed in Remark 2, the marginals of the WLS ranking model are equal to the empirical estimators when only pairwise comparisons are observed. More generally, they are both asymptotically unbiased whatever the underlying ranking model p and have similar behaviors, except that the WLS has reduced variance and thus converges faster. Globally, the WLS ranking model quickly outperforms its competitors when N grows. 7. Conclusion In this paper, we rigorously formulated the issue of learning a ranking model from incomplete rankings, as the problem of building an empirical ranking model whose marginals are good estimators of those of the true underlying ranking model, when they are observable. Based on the concept of MRA of incomplete rankings introduced in (Cl emenc on et al., 2014), we provided a general framework to construct an asymptotically unbiased ranking model with an optimal convergence rate, which can be computed with complexity directly depending on the data. These theoretical guarantees, as well as the good performance observed in numerical experiments, are an encouragement to pursue the development of this framework, for instance to define efficient regularization procedures or apply it to other statistical tasks. MRA-based Statistical Learning from Incomplete Rankings Azari Soufiani, Hossein, Chen, William, Parkes, David C, and Xia, Lirong. Generalized method-of-moments for rank aggregation. In Advances in Neural Information Processing Systems 26, pp. 2706 2714, 2013. Busa-fekete, Robert, Huellermeier, Eyke, and Szrnyi, Balzs. Preference-based rank elicitation using statistical models: The case of mallows. In Proceedings of the 31st International Conference on Machine Learning (ICML14), pp. 1071 1079, 2014. Cl emenc on, St ephan, Jakubowicz, J er emie, and Sibony, Eric. Multiresolution analysis of incomplete rankings. Ar Xiv e-prints, march 2014. Critchlow, D. E. Metric Methods for Analyzing Partially Ranked Data, volume 34 of Lecture Notes in Statistics. Springer, 1985. Diaconis, P. A generalization of spectral analysis with application to ranked data. The Annals of Statistics, 17(3): 949 979, 1989. Diaconis, Persi. Group representations in probability and statistics. Institute of Mathematical Statistics Lecture Notes - Monograph Series. Institute of Mathematical Statistics, Hayward, CA, 1988. ISBN 0-940600-14-5. Fagin, R., Kumar, R., Mahdian, M., Sivakumar, D., and Vee, E. Comparing partial rankings. SIAM J. Discrete Mathematics, 20(3):628 648, 2006. Fligner, M. A. and Verducci, J. S. Distance based ranking models. JRSS Series B (Methodological), 48(3):359 369, 1986. Guiver, John and Snelson, Edward. Bayesian inference for plackett-luce ranking models. In ICML, 2009. Huang, J. and Guestrin, C. Riffled independence for ranked data. In Proceedings of NIPS 09, 2009. Huang, J., Guestrin, C., and Guibas, L. Fourier theoretic probabilistic inference over permutations. JMLR, 10: 997 1070, 2009. Huang, Jonathan, Kapoor, Ashish, and Guestrin, Carlos. Riffled independence for efficient inference with partial ranking. Journal of Artificial Intelligence, 44:491 532, 2012. H ullermeier, E., F urnkranz, J., Cheng, W., and Brinker, K. Label ranking by learning pairwise preferences. Artificial Intelligence, 172:1897 1917, 2008. Hunter, David R. MM algorithms for generalized Bradley Terry models. The Annals of Statistics, 32:384 406, 2004. Irurozki, Ekhine, Calvo, Borja, and Lozano, J. Learning probability distributions over permutations by means of Fourier coefficients. Advances in Artificial Intelligence, pp. 186 191, 2011. Jagabathula, Srikanth and Shah, Devavrat. Inferring Rankings Using Constrained Sensing. IEEE Transactions on Information Theory, 57(11):7288 7306, 2011. Kakarala, Ramakrishna. A signal processing approach to Fourier analysis of ranking data: the importance of phase. IEEE Transactions on Signal Processing, pp. 1 10, 2011. Kakarala, Ramakrishna. Interpreting the phase spectrum in Fourier Analysis of partial ranking data. Advances in Numerical Analysis, 2012. Kitaev, S. Patterns in Permutations and Words. Springer Publishing Company, Incorporated, 1st edition, 2011. Kondor, Risi and Barbosa, Marconi S. Ranking with kernels in Fourier space. In Proceedings of COLT 10, pp. 451 463, 2010. Kondor, Risi and Dempsey, Walter. Multiresolution analysis on the symmetric group. In Neural Information Processing Systems 25, 2012. Lebanon, G. and Mao, Y. Non-parametric modeling of partially ranked data. JMLR, 9:2401 2429, 2008. Lebanon, Guy and Lafferty, John. Cranking: Combining rankings using conditional probability models on permutations. In Proceedings of the 19th International Conference on Machine Learning, pp. 363 370, 2002. Liqun, Xu. A multistage ranking model. Psychometrika, 65(2):217 231, 2000. Lu, Tyler and Boutilier, Craig. Learning mallows models with pairwise preferences. In ICML, pp. 145 152, 2011. Luce, R. D. Individual Choice Behavior. Wiley, 1959. Mallows, C. L. Non-null ranking models. Biometrika, 44 (1-2):114 130, 1957. Meek, Christopher and Meila, Marina. Recursive inversion models for permutations. In Advances in Neural Information Processing Systems 27, pp. 631 639, 2014. Plackett, R. L. The analysis of permutations. Applied Statistics, 2(24):193 202, 1975. Rajkumar, Arun and Agarwal, Shivani. A statistical convergence perspective of algorithms for rank aggregation from pairwise data. In Proceedings of the 31st International Conference on Machine Learning, 2014. MRA-based Statistical Learning from Incomplete Rankings Sun, Mingxuan, Lebanon, Guy, and Kidwell, Paul. Estimating probabilities in recommendation systems. Journal of the Royal Statistical Society: Series C (Applied Statistics), 61(3):471 492, 2012. Wauthier, Fabian, Jordan, Michael, and Jojic, Nebojsa. Efficient ranking from pairwise comparisons. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp. 109 117, 2013.