# learning_plackettluce_mixtures_from_partial_preferences__9df6da67.pdf The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Learning Plackett-Luce Mixtures from Partial Preferences Ao Liu Rensselaer Polytechnic Institute Troy, NY 12180, USA liua6@rpi.edu Zhibing Zhao Rensselaer Polytechnic Institute Troy, NY 12180, USA zhaoz6@rpi.edu Chao Liao Shanghai Jiao Tong University Shanghai 200240, China chao.liao.95@gmail.com Pinyan Lu Shanghai University of Finance and Economics Shanghai 200433, China lu.pinyan@mail.shufe.edu.cn Lirong Xia Rensselaer Polytechnic Institute Troy, NY 12180, USA xial@cs.rpi.edu We propose an EM-based framework for learning Plackett Luce model and its mixtures from partial orders. The core of our framework is the efficient sampling of linear extensions of partial orders under Plackett-Luce model. We propose two Markov Chain Monte Carlo (MCMC) samplers: Gibbs sampler and the generalized repeated insertion method tuned by MCMC (GRIM-MCMC), and prove the efficiency of GRIMMCMC for a large class of preferences. Experiments on synthetic data show that the algorithm with Gibbs sampler outperforms that with GRIM-MCMC. Experiments on real-world data show that the likelihood of test dataset increases when (i) partial orders provide more information; or (ii) the number of components in mixtures of Plackett Luce model increases. Introduction Rank aggregation has found applications in many areas, such as meta-search (Dwork et al. 2001), information retrieval (Liu 2011), and recommender systems (Baltrunas, Makcinskas, and Ricci 2010). In the rank aggregation problem, we want to find an aggregated ranking from the rank data of a given population. One popular approach is learning to rank, which is to learn a statistical ranking model from the data, and interpret the learned parameter as a ranking. Plackett-Luce model (PL) (Plackett 1975; Luce 1959) is one of the most widely-used models. In PL, each alternative is assigned a positive value. The greater this value is, the more likely its corresponding alternative is ranked at a higher position. In recent years, mixtures of Plackett-Luce models have drawn attention for heterogeneous data. Mixture models provide better fitness to data (Zhao, Villamil, and Xia 2018), and can be used for clustering. A mixture model assumes that there are multiple clusters in the population, and each of the clusters can be learned using a component model. In the case of mixtures of Plackett-Luce models, each component is a Plackett-Luce model, while the quality of a certain alternative is usually different across different components. Algorithms to learn a single Plackett-Luce model and its mixtures from linear orders (full rankings) have been well Copyright c 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. investigated. However, many real-world datasets are composed of partial orders over a subset of alternatives (Lu and Boutilier 2014; H ullermeier et al. 2008), e.g. the Netflix prize data. Some of the previously proposed algorithms work only for special structures of partial orders or general partial orders with ignored comparisons or dependence between pairwise comparisons. However, modern datasets can be unstructured or even implicit, e.g., preferences implied by comparing time users spent on news articles or whether a user finished watching a movie (Khetan and Oh 2016b). None of the existing algorithms is able to learn PL and its mixtures with full utilization of general partial orders with no restrictions. The only known approach to tackle general partial orders without ignoring information in literature is to sample linear extensions from partial orders under a given model (Mallows model) and learn the model from full rankings (Lu and Boutilier 2014). This paper addresses the following problem: How can we efficiently learn the Plackett-Luce model and its mixtures from partial orders? Our Contributions. We propose an EM framework to learn mixtures of Plackett-Luce models from partial orders. The E step consists of two parts: given the current estimate, (i) sampling multiple linear extensions from each partial order, and (ii) computing the membership of each sampled linear order (full ranking). The M step can adapt any learning algorithm for the single Plackett-Luce model. In this paper, we use the LSR algorithm by Maystre and Grossglauser (2015) as the subroutine. We address the technical challenge of sampling an linear extension of a partial order under PL in part (i) by proposing the following two samplers. Sampler 1: GRIM-MCMC generalizes the GRIM-based sampling algorithm by Lu and Boutilier (2014), which was for Mallows model, to PL. We further prove that GRIMMCMC is efficient for a large class of preferences called layered-structure preferences in Theorem 4. Sampler 2: Gibbs generalizes the sampler by Azari Soufiani, Parkes, and Xia (2012) to deal with partial orders, and can be naturally generalized to learn mixtures of Random Utility Models (Thurstone 1927), which subsume PL. We compare the performances of the proposed sampling algorithms for PL and its mixtures using experiments on synthetic data and show that Gibbs sampler is better. Our algorithm uses 75% information at the cost of mere 6% MSE compared to the best algorithm for PL mixtures for linear orders. Experiments on real-world data show that the likelihood of test dataset increases when partial orders provide more information; or the number of components in PL mixtures increases within a certain range. Related Work. We are not aware of any previous algorithm that learns a mixture of Plackett-Luce models from general partial orders without discarding information. A vast number of algorithms have been proposed to learn the Plackett Luce model (Hunter 2004; Azari Soufiani et al. 2013; Maystre and Grossglauser 2015; Khetan and Oh 2016b; 2016a; Negahban, Oh, and Shah 2017; Zhao and Xia 2018) and its mixtures (Gormley and Murphy 2008; Oh and Shah 2014; Mollica and Tardella 2016; Tkachenko and Lauw 2016; Zhao, Piech, and Xia 2016). Some of these algorithms were designed for linear orders (full rankings) (Azari Soufiani et al. 2013; Zhao and Xia 2018; Liu et al. 2019; Zhao, Piech, and Xia 2016), or some special structures of partial orders (Hunter 2004; Maystre and Grossglauser 2015; Khetan and Oh 2016b; Negahban, Oh, and Shah 2017; Mollica and Tardella 2016; Tkachenko and Lauw 2016; Zhao et al. 2018), e.g. top-l rankings. Very few algorithms were designed for general partial orders, but some ignore dependencies between pairwise comparisons with overlapping alternatives (Oh and Shah 2014), and some discard comparison information in order for the algorithm to be consistent (Khetan and Oh 2016a). Lu and Boutilier (2014) proposed an algorithm to learn the Mallows model and its mixtures from general partial orders by sampling linear extensions. We focus on a very different but even more popular model: Plackett-Luce model, as well as its mixtures. Efficiently sampling linear extensions from arbitrary distributions is an open problem. There have been research on generating linear extensions according to various distributions (e.g., uniform distribution, Mallows model, etc.). Repeated insertion methods (RIM) and MCMC are two groups of methods widely used in those works. We propose two algorithms to sample linear extensions from a partial order under the Plackett-Luce model: one inspired by Lu and Boutilier (2014) while the other based on the random utility nature of Plackett-Luce model. To our best knowledge, this is the first work that exploits all information from general partial orders by sampling linear extensions under the Plackett-Luce model, as well as other random utility models (Gibbs). Preliminaries Let C = {c1, , cm} denote a set of m alternatives. Let L(C) denote the set of linear orders, which are transitive, antisymmetric and total binary relations, over C. A linear order R is often denoted by ci1 ci2 cim, which means that ci1 is ranked at the top, ci2 is ranked at the second, ..., cim is ranked at the bottom. Let {1, , n} denote n agents. Each agent j is assumed to have a strict complete preference over C, represented by a linear order Rj. However, only partial preferences over a subset of alternatives ESj is observed, represented by a partial order1, which consists of irreflexive, transitive, and asymmetric binary relations. We call ESj the effective alternative set of agent j and let m j = |ESj|. We describe a partial order by a set of pairwise comparisons Vj = {ci1 ci 1, , ciq ci q|1 i1, i 1, , iq, i q m}. Due to transitivity, such description of a partial order is not unique. A unique representation of a partial order Vj is the transitive closure, denoted by tc(Vj), which consists of all pairwise comparisons in Vj considering transitivity. An example of a partial order and its transitive closure is shown in Figure 1. Let P = (V1, , Vn) denote the data (also called a preference profile), where Vj is agent j s partial preference. Let P(C) denote the set of all partial orders. Figure 1: Example of a partial order and its transitive closure. Let Ext(Vj) be the set of all linear extensions of Vj, which consists of all linear orders consistent with Vj. A linear extension bridges a partial order (observed data) and a linear order. For example, Ext( ) is the set of all permutations on C. In this paper, we assume each agent s preference is generated from the Plackett-Luce model, defined as follows. Definition 1 (Plackett-Luce model) The sample space is L(C)n. The parameter space is Θ = { θ = (θ1, , θm) | i, θi (0, 1), Pm i=1 θi = 1}. Given parameter θ Θ, the probability of any linear order R = ci1 , , cim is Pr PL(R| θ) = θi1 P p 1 θip θi2 P p 2 θip θim 1 θim 1 + θim . (1) The probability of any partial order V is the sum of all probability of its linear extensions, i.e. Pr PL(V | θ) = P R Ext(V ) Pr(R| θ). Another probability of interest is the probability of linear order R conditioned on a partial order V , where R Ext(V ). By the definition of conditional probability, when R Ext(V ), we have Pr PL (R| θ, V ) = Pr PL(R| θ) Pr PL(V | θ) . (2) The Plackett-Luce model has the random utility interpretation, where each alternative ci is associated with a Gumbel distribution centered at ln θi, whose cumulative distribution function is e θi e x (Yellott 1977). A linear order is generated by sampling a utility for each alternative from its corresponding utility distribution and ranking all the sampled utilities in descending order. For any θ Θ, we further define η θ = θmax θmin , where θmax = maxi [m] θi and θmin = mini [m] θi. Following (Ford Jr 1957) and (Hunter 2004), we make the following assumption on the data so that there exists a maximum likelihood estimation. 1Throughout this paper, all partial orders are strict. Assumption 1 In every possible partition of the alternatives into two nonempty subsets, some individual in the second set beats some individual in the first set at least once. We define the mixture of k Plackett-Luce models (k-PL), which subsumes the Plackett-Luce model by letting k = 1, as follows. Definition 2 (k-PL) Given integers m 2 and k 1, the k-mixture Plackett-Luce model is defined as follows. The sample space is L(C)n. The parameter space has two parts. The first part is the mixing coefficients α = (α1, , αk), where for all 1 r k, αr 0 and Pk r=1 αr = 1 The second part is θ(1), , θ(k) , where θ(r) Θ is the parameter of the r-th Plackett-Luce component. The probability of any linear order R is Prk-PL(R| θ) = Pk r=1 αr Pr PL(R| θ(r)). Similar to the (single) Plackett-Luce model, the probability of any partial order V is Prk-PL(V | θ) = P R Ext(V ) Prk-PL(R| θ), which is equivalent to Prk-PL(V | θ) = Pk r=1 αr Pr PL(V | θ(r)). An EM Framework for Learning PL Mixtures We propose an EM framework where the E-step has two parts: (1) generating N linear extensions from each partial order, and (2) computing the expected membership of each linear order. In the next section, we will introduce the two sampling algorithms for part (1): GRIM-MCMC (Algorithm 3) and Gibbs Sampling (Algorithm 4). Algorithm 1: EM Algorithm for k-PL with GRIM sampler and Gibbs sampler 1 Input: Profile P = (V1, , Vn), the number of components k, number of linear extensions per partial order N, the number of iterations T. 2 Output: For all 1 r k, α(T ) r and θ(r,T ) 3 Initialization: Randomly generate θ(1,0), , θ(k,0), and mixing coefficients α(0) 1 , , α(0) k . 4 for t = 1, , T do 6 Sample N linear extensions for each partial order using e.g. Algorithm 3 (GRIM-MCMC) or Algorithm 4 (Gibbs); 7 1 j n, 1 r k, compute w(t) jr using (3). 9 For all 1 r k, update α(t) r = j w(t) jr n and compute θ(r) using the LSR algorithm by (Maystre and Grossglauser 2015). Formally, let zjr denote the membership indicator where zjr = 1 means the linear order Rj belongs to the rth component. Let wjr denote the weight of ranking Rj in rth component. For all j, we have P r wjr = 1. Given last iteration Figure 2: Definition and labeling of vacancies. estimate ( α(t 1), θ(1,t 1), . . . , θ(k,t 1)), each linear order Rj is clustered to each component by the following equation w(t) jr = Pr(zjr = 1|Rj, θ(r,t 1)) = Pr(Rj| θ(r,t 1)) α(t 1) r Pk s=1 Pr(Rj| θ(s,t 1)) α(t 1) s . (3) In the case of k = 1, the membership computation is redundant. In the M-step, we update the mixing coefficients by j w(t) jr n and the parameters of all components ( θ(1), , θ(k)) using the LSR algorithm proposed by (Maystre and Grossglauser 2015). The algorithm is formally shown as Algorithm 1. Sampling Linear Extensions according to PL In the section, we introduce two efficient algorithms to sample linear extensions of a partial order under PL, which are used in step 6 of Algorithm 1. The GRIM-MCMC Sampler for Plackett-Luce We extend the GRIM-based algorithm by (Lu and Boutilier 2014) to Plackett-Luce model. Repeated insertion model (RIM) (Doignon, Pekeˇc, and Regenwetter 2004) is a polynomial-time tool applicable to exactly sample from Plackett-Luce model (given an empty partial order). Then, we generalize the Plackett-Luce RIM to an approximate sampler for linear extensions given an arbitrary partial order. GRIM for Plackett-Luce Model We first present the PL RIM sampler as the building block of GRIM. RIM samples a linear order by inserting alternatives one by one to the temporary ranking until the ranking is complete, where the order of insertions is arbitrary. W.l.o.g., we use the order from c1 to cm throughout this section. Let R(t 1) = [c(t 1) r1 c(t 1) rt 1 ] denote the temporary ranking over alternatives c1, , ct 1. Let the i-th vacancy of this ranking as the i-th possible position for ct to be inserted (shown in Figure 2), we define conditional insertion probabilities for PL RIM as: Pr RIM R(t) R(t 1) 1R(t) Ext(R(t 1)) Pr PL R(t) (θ1, , θt) Pr PL R(t 1) (θ1, , θt 1) . The condition in indicator function 1R(t) Ext(R(t 1)) guar- antees ct is inserted to one of the vacancies of R(t 1). Due to the space limit, we only present the generalized version of RIM in Algorithm 2, which will reduce to RIM if letting input V be . Theorem 1 PL RIM is an exact sampler for Plackett-Luce model that runs in O(m2) time. Proof: We first prove the correctness of PL RIM. Slightly abusing the notation, we let R(t) be the temporary ranking over c1, , ct following the same order as full ranking R. It s easy to check that R(t) Ext R(t 1) and R = R(m). Thus, for Plackett-Luce RIM defined above, we have, Pr RIM (R| θ) = Pr RIM R(m)|R(m 1) Pr RIM R(2)|R(1) Pr RIM R(1) = Pr PL (R| θ). We now analyze the complexity of calculating all insertion probabilities at the t-th insertion step. In stead of directly calculating (4) from (1), we use the following iterative formula to calculate Pr RIM R(t) i R(t 1) from Pr RIM R(t) i 1 R(t 1) : Pr RIM R(t) i R(t 1) Pr RIM R(t) i 1 R(t 1) = i =i 1 θr(t 1) i i =i θr(t 1) i = A(t 1) i B(t 1) i . It is noteworthy that A(t 1) i and B(t 1) i can be calculated through the iterative formulas A(t 1) i = A(t 1) i+1 + θ(t 1) ri 1 and Bi = A(t 1) i+1 + θt. The 2-level iteration process gives a O(t) approach to calculate all needed probabilities and Theorem 1 follows by combining all m insertion steps. 2 Algorithm 2: Plackett-Luce GRIM 1 Input: Partial order V , Plackett-Luce parameters θ = (θ1, , θm) (θi corresponds to alternative ci). 2 Initialization: Set R(1) = [c1] as initialized ranking. 3 for t = 2; t m do 4 Insert alternative ct to R(t 1) = [c(t 1) r1 c(t 1) r2 . . . c(t 1) rt 1 ] s ith vacancy (1 i t) with probability calculated by (5). 5 end 6 return R(m). Now we are ready to propose PL GRIM, which samples linear extensions of any given partial order, where in each step we only insert the current alternative to vacancies consistent with the partial order. We define the GRIM insertion probability by Pr GRIM R(t) i R(t 1) 1R(t) i Ext(V ) ZV,R(t 1), θ Pr PL R(t) i (θ1, , θt) Pr PL R(t 1) (θ1, , θt 1) , (5) ZV,R(t 1), θ = 1 R(t) i Ext(V ) Pr PL R(t) i (θ1, , θt) Pr PL R(t 1) (θ1, , θt 1) is the normalization factor. Formally, GRIM procedure is shown in Algorithm 2 (note again its fourth-step used (5) and thus used the information from V ). Theorem 2 PL GRIM (Algorithm 2) runs in O(m2) time, and the probability of outputting full ranking R is Pr GRIM R θ, V = 1R Ext(V ) Qm 1 t=1 ZV,R(t), θ Pr PL R θ . Table 1: Example of GRIM (probability columns show conditional probabilities on given rankings) Insert c1, c2 Insert c3 Ranking Pr Ranking Pr c1 c2 1 1+2 = 1 c1 c3 c2 3/5 c1 c2 c3 2/5 c2 c1 2 1+2 = 2 3 c2 c1 c3 1 The distribution of PL GRIM is not always exactly the distribution in (2) (See Example 1). Example 1 Consider the Plackett-Luce model over C = {c1, c2, c3}, partial order V = {[c1 c3]} and parameter θ = (1/6, 2/6, 3/6), the process of GRIM is shown in Table 1. The comparison between GRIM and Placket-Luce model is shown in Table 2. Table 2: Comparing the Distributions between GRIM PL c1 c3 c2 c1 c2 c3 c2 c1 c3 Pr GRIM 1/5 2/15 2/3 Pr PL 2/5 4/15 1/3 Even though GRIM is not a precise sampler for Plackett-Luce model, it provides us a good proposal distribution on linear extensions for Metropolis Hastings algorithm (Metropolis et al. 1953; Hastings 1970), which is an MCMC algorithm that tunes a proposal distribution to the target distribution. It works as follows. For each state β we first generate a candidate next state γ from a proposal distribution ˆP( ) (slightly abusing this notation). Then, let the next state to be γ with probability min n 1, P(γ) ˆ P(β|γ) P(β) ˆ P(γ|β) o , otherwise the next state re- mains β, where P( ) is the (probability of) target distribution. GRIM-MCMC In this section, we present the Markov Chain MPL used to tune PL GRIM in Algorithm 3. In each step of MPL, we first use Algorithm 2 to generate a full ranking γ, which is a linear extension of preference V . Then, with probability min n Pr PL(γ| θ) Pr PL(β| θ) Pr GRIM(β| θ) Pr GRIM(γ| θ), 1 o , let β = γ, otherwise the next ranking remains current ranking β. Before analyzing the mixing time of MPL, we first show the probability Pr PL(γ| θ) Pr PL(β| θ) Pr GRIM(β| θ) Pr GRIM(γ| θ) used in step 5 of Algorithm 2 can be easily computed after GRIM sampling in the following Proposition. Algorithm 3: Tuned Plackett-Luce GRIM by Markov Chain MPL 1 Input: Plackett-Luce parameters θ = (θ1, , θm), partial order Vj and number of iteration I. 2 Initialization: Generate an linear extension β over alternative set ESj by Algorithm 2. 3 for ℓ= 1; ℓ I do 4 Generate another linear extension γ using parameters θ over alternative set ESj using Algorithm 2. 5 With probability min n Pr PL(γ| θ) Pr PL(β| θ) Pr GRIM(β| θ) Pr GRIM(γ| θ), 1 o , let β = γ. 7 Insert all remaining alternatives in C \ ESi to β using Algorithm 2. 8 return R. Proposition 3 Pr PL(γ| θ) Pr PL(β| θ) Pr GRIM(β| θ) Pr GRIM(γ| θ) = Qm 1 t=1 ZV,γ(t), θ Z V,β(t), θ . Proof: By Theorem 2, we have, Pr PL (γ) Pr GRIM(β) Pr PL (β) Pr GRIM(γ) ZV Pr PL (γ| θ, V ) ZV Pr PL (β| θ, V ) 1β Ext(V ) Qm 1 t=1 ZV,β(t), θ Pr PL (β| θ, V ) 1γ Ext(V ) Qm 1 t=1 ZV,γ(t), θ Pr PL (γ| θ, V ) ZV,γ(t), θ ZV,β(t), θ , where all normalization factors, ZV,β(t), θ and ZV,γ(t), θ are already calculated in GRIM process. 2 We now prove that the mixing time of MPL is exponentially upper bounded for a large class of profiles, called layered-structure profiles, defined as follows. Intuitively, alternatives in such a partial order can be partitioned into multiple layers, such that each layer contains alternatives among which no comparisons are available, and there is a linear order over layers such that alternatives in a high-ranked layer is always preferred to alternatives in a low-ranked layer. Definition 3 (Layered-Structure Preferences) We call agent j s preference is layered-structured if and only if there exist non-overlapping sets of alternatives C1, , Cl such that: 1 For any l [l] and any two alternatives ci1, ci2 Cl , no preference between ci1, ci2 are given by agent j. 2 For any two alternatives ci1 Cl1 and ci2 Cl2, ci1 j ci2 if and only if l1 < l2. 3 l [l] Cl = ESj. Theorem 4 For layered-structured preferences, the mixing time of Markov Chain MPL is O(ηm 2 ln ϵ 1), where η = maxi [m] θi mini [m] θi and m = |ESj|. Proof: Since the proposal state γ is independent to current state β, we have the following Lemma: Lemma 4.1 [Example 2 of (Liu 1996)] The mixing time of MPL is O smax ln ϵ 1 , where smax = maxγ Ext(Vi) Pr PL(γ| θ) Pr GRIM(γ| θ). Armed with Lemma 4.1, we only need to provide an upper bound of smax (shown in Lemma 4.2). Lemma 4.2 For any layered-structured preference V , smax = maxγ Ext(Vi) Pr PL(γ| θ) Pr GRIM(γ| θ) ηm 2. Proof: We use ml to denote the number of alternatives in Cl and we define smin = minγ Ext(Vi) Pr PL(γ| θ) Pr GRIM(γ| θ) similar to smax. Since smax and smin are defined as maximum (minimum) ratio between the probabilities of GRIM and Plackett Luce, we have smax 1 smin. According to Proposition 3, we have, smin = max γ,β Ext(V ) ZV,γ(t), θ ZV,β(t), θ Inequality (6) reduce the problem to bounding GRIM normalization factors. Next, we will focus on layered-structure preferences and show the their normalization factors are bounded as: ZV,γ(t 1), θ ZV,β(t 1), θ ( 1 t < m1 ηt t m1 , (7) where γ, β Ext(V ) are two extensions of V . W.l.o.g., we let the insertion order in GRIM follow the same order of C1 Cl. Recalling the justification of RIM algorithm, we know GRIM is a perfect sampler on C1 since all alternatives in C1 can be inserted to any vacancies and the case of t < m1 follows. Recalling the definition of ZV,γ(t 1), θ, alternative ct Cl s normalization factor ZV,γ(t 1), θ is Pt i=1 1γ(t) i Ext(γ(t 1)) Pr PL γ(t) i (θ1,1, ,θ1,m1, ,θl ,1, ,θl ,t (m1+ +ml 1)) Pr PL γ(t) i (θ1,1, ,θ1,m1, ,θl ,1, ,θl ,t (m1+ +ml 1) 1) , where θi,j represent the j-th inserted alternative in Ci. Since all alternatives in Cl are strictly less preferred to alternatives in C1, , Cl 1, only the last t P i [l 1] mi vacancies are allowed by linear extension condition, and we have, ZV,γ(t 1), θ Pr PL γ(t) i (θ1,1, , θ1,m1, , θl ,1, , θl ,t (m1+ +ml 1)) Pr PL γ(t) i (θ1,1, , θ1,m1, , θl ,1, , θl ,t (m1+ +ml 1) 1) P i [l 1] mi Y θγj θt + Pt 1 i =j θγi θγj Pt 1 i =j θγi where γj denote the j-th ranked alternative in γ(t 1). Doing the same deviation to ranking β(t 1) and combine it with the result of γ, for any alternative ct Cl we have, ZV,γ(t 1), θ ZV,β(t 1), θ i [l 1] mi Y θt + Pt 1 i =j θβi θt + Pt 1 i =j θγi j=1+P i [l 1] mi Pt 1 i =j θβi Pt 1 i =j θγi i [l 1] mi Y Lemma 4.2 follows by combining (7) and (6). 2 Theorem 4 follows by applying Lemma 4.2 to Lemma 4.1. 2 The Gibbs Sampler for Plackett-Luce Model We first introduce one exact Plackett-Luce sampler for full rankings using random utility interpretation (Theorem 5). It is noteworthy that the complexity of this sampler is O(m) per iteration (or equivalently, the complexity of one Gibbs iteration in Algorithm 4 is O(m )). Algorithm 4: Gibbs Plackett-Luce Sampler 1 Input: Plackett-Luce parameter θ = (θ1, , θm), partial order Vj and number of iterations I. 2 Initialization: Arbitrarily set an 1 m array of utilities u = (u1, , um) and set of inserted alternatives S = . 3 for ℓ= 1; ℓ I do 4 for ct ESj do 5 Calculate l U = mini Ui e θt e ui, where Ui = {i | (ci S) ([ci j ct] Vj)}. 6 Calculate l L = maxi Li e θt e ui, where Li = {i | (ci S) ([ct j ci] Vj)}. 7 Uniformly generate a random number r from interval (max{0, l L}, min{1, l U}). 8 Let ut = ln( ln r + ln θt) and add ct to S. 11 Sample the remaining alternatives utilities according to Theorem 5. 12 Sort the utilities (u1, , um) to get the linear order R. 13 return R. Theorem 5 (Yellott 1977, Theorem 5) For sample space S = L(C) and parameters θ = (θ1, , θm), Plackett Luce sampling is equivalent to sample from Gumbel distributions whose cumulative distribution functions (CDFs) are e θi e x. Our Gibbs Plackett-Luce sampler given arbitrary partial orders (Algorithm 4) is based on the PL sampler introduced above and the Gibbs sampler by Azari Soufiani, Parkes, and Xia (2012). We let l U = 1 (or l L = 0) when Ui = (or Li = ). We note that the sampler shown in Algorithm 4 is applicable to any Random Utility Models with given CDF on utility distributions. Remark for Gibbs Sampler Bounding the mixing time of Gibbs sampler for arbitrary (truncated) distribution is still an open question. Moreover, the mixing time of Algorithm 4 is even harder to bound because it focuses on the distribution over rankings instead of utilities. Nevertheless, Gibbs sampler might be a fast algorithm in experiments (or, averaged case) because it takes the advantage of lower complexity per iteration. Experiments We illustrate the performances of our algorithms on both synthetic data and real-world data. We note that there is no existing algorithm for learning Plackett-Luce model and its mixtures from general partial orders without discarding any information. All experiments with recorded runtime were run on an Ubuntu Linux server with Intel Xeon E5 v3 CPUs each clocked at 3.50 GHz. Figure 3: MSE of the proposed EM algorithms for a single Plackett-Luce model. Figure 4: Runtime (in seconds) of the proposed EM algorithms for a single Plackett-Luce model. Note ILSR-Linear Orders is not applicable to arbitrary partial preferences. Experiments on Synthetic Data Learning the Plackett-Luce Model from Partial Orders. We first generated linear orders from the Plackett-Luce model over 10 alternatives. The ground truth parameters θ = (θ1, . . . , θm) were generated uniformly at random and normalized s.t. Pm i=1 θi = 1. The linear orders were sampled by Definition 1. Then we sampled partial orders in the following way. We fixed p = 0.5. Given any linear order, we sampled from all m(m 1) 2 pairwise comparisons by keeping each pairwise comparison with probability p independently. The sampled partial order is then converted to its transitive closure. In average, below 75% of all pairwise comparisons were sampled. We ran our proposed EM algorithms with GRIM-MCMC and Gibbs samplers respectively, denoted by ELSR-GRIM-MCMC and ELSR-Gibbs . Two linear extensions were sampled for each partial order. We use five EM iterations for each algorithm. Values were averaged over 2000 trials. To show the statistical efficiency of our algorithms, we also learned from linear orders (which are assumed to be unobservable) using the ILSR algorithm (denoted by ILSRLinear Order ) by (Maystre and Grossglauser 2015), which is the iterative LSR. Results are shown in Figures 3 and 4. Observations. Both statistical efficiency and computational efficiency of the algorithm with Gibbs sampler are slightly better than that with GRIM sampler. The statistical efficiency of both our algorithms are close to ILSR, which learns from supposedly unobservable linear orders. Figure 5: MSE of the proposed EM algorithms for mixtures of 3 Plackett-Luce models. Figure 6: Runtime (in seconds) of the proposed EM algorithms for mixtures of 3 Plackett-Luce models. Learning Mixtures of Plackett-Luce Models. The setup is similar to the previous experiments. We generated the ground truth parameters for 3-PL over 6 alternatives by generating the mixing coefficients and the parameter of each component uniformly at random and normalizing s.t. Pk r=1 αr = 1 and for all r, Pm i=1 θ(r) i = 1. Then we sampled partial orders using exactly the same way with a fixed p = 0.5. In average, below 65% of all pairwise comparisons were sampled due to fewer alternatives. 6 linear extensions were generated from each partial order. We use five EM iterations for each algorithm. Values were averaged over 2000 trials. Observations. From Figures 5 and 6 we observe that our algorithm with Gibbs sampler is more computationally efficient than that with GRIM-MCMC sampler. Again, both algorithms have similar statistical efficiency ELSR-Linear Order , which learns from supposedly unobservable linear orders. Figure 7: Log-likelihood of sushi test data as k increases when a k-PL is learned from the training data. Values were averaged over 100 trials. Experiments on Real-World Data The sushi data from Preflib (Mattei and Walsh 2013) consists of 5000 linear orders of 10 types of sushi. We randomly split this dataset into training data (3500 linear orders) and test data (1500 partial orders). To generate partial orders, we sample pairwise comparisons with p = 0.2, 0.4, 0.6, 0.8, 1 and learn a k-PL from the partial orders using the proposed E-LSR algorithm with Gibbs sampler, where k ranges from 1 to 5. We measure the fitness of a k-PL using the likelihood of the test data given the learned parameter. Results are shown in Figure 7. Observations. We observe that when p increases, the learning outcome fits the test data better. For k 5, k-PL fits the test data better as k increases. Conclusions and Future Work We propose an EM-based framework with two samplers for learning Plackett-Luce model and its mixtures from partial orders. We demonstrate the efficiency of our algorithms with experiments on both synthetic datasets and real-world data and conclude that Gibbs sampler outperforms GRIM-MCMC. For future work we plan to explore faster algorithms to sample linear extensions from partial orders or other approaches to learn mixtures of Plackett-Luce models, as well as other mixture models. Acknowledgments We thank all anonymous reviewers for helpful comments and suggestions. This work is supported by NSF #1453542 and ONR #N00014-17-1-2621. Azari Soufiani, H.; Chen, W.; Parkes, D. C.; and Xia, L. 2013. Generalized method-of-moments for rank aggregation. In Proceedings of Advances in Neural Information Processing Systems (NIPS). Azari Soufiani, H.; Parkes, D. C.; and Xia, L. 2012. Random utility theory for social choice. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 126 134. Baltrunas, L.; Makcinskas, T.; and Ricci, F. 2010. Group recommendations with rank aggregation and collaborative filtering. In Proceedings of the fourth ACM conference on Recommender systems, 119 126. ACM. Doignon, J.-P.; Pekeˇc, A.; and Regenwetter, M. 2004. The repeated insertion model for rankings: Missing link between two subset choice models. Psychometrika 69(1):33 54. Dwork, C.; Kumar, R.; Naor, M.; and Sivakumar, D. 2001. Rank aggregation methods for the web. In Proceedings of the 10th World Wide Web Conference, 613 622. Ford Jr, L. R. 1957. Solution of a ranking problem from binary comparisons. The American Mathematical Monthly 64(8P2):28 33. Gormley, I. C., and Murphy, T. B. 2008. Exploring voting blocs within the Irish electorate: A mixture modeling approach. Journal of the American Statistical Association 103(483):1014 1027. Hastings, W. K. 1970. Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika 57(1):97 109. H ullermeier, E.; F urnkranz, J.; Cheng, W.; and Brinker, K. 2008. Label ranking by learning pairwise preferences. Artificial Intelligence 172(16-17):1897 1916. Hunter, D. R. 2004. MM algorithms for generalized Bradley Terry models. In The Annals of Statistics, volume 32, 384 406. Khetan, A., and Oh, S. 2016a. Computational and statistical tradeoffs in learning to rank. In Advances in Neural Information Processing Systems, 739 747. Khetan, A., and Oh, S. 2016b. Data-driven rank breaking for efficient rank aggregation. Journal of Machine Learning Research 17(193):1 54. Liu, A.; Wu, Q.; Zhenming, L.; and Xia, L. 2019. Nearneighbor methods in random preference completion. In Proceedings of 33rd AAAI Conference on Artifical Intelligence (AAAI-19). Liu, J. S. 1996. Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Statistics and Computing 6(2):113 119. Liu, T.-Y. 2011. Learning to Rank for Information Retrieval. Springer. Lu, T., and Boutilier, C. 2014. Effective sampling and learning for mallows models with pairwise-preference data. The Journal of Machine Learning Research 15(1):3783 3829. Luce, R. D. 1959. Individual Choice Behavior: A Theoretical Analysis. Wiley. Mattei, N., and Walsh, T. 2013. Pref Lib: A Library of Preference Data. In Proceedings of Third International Conference on Algorithmic Decision Theory (ADT 2013), Lecture Notes in Artificial Intelligence. Maystre, L., and Grossglauser, M. 2015. Fast and accurate inference of Plackett Luce models. In Advances in Neural Information Processing Systems, 172 180. Metropolis, N.; Rosenbluth, A. W.; Rosenbluth, M. N.; Teller, A. H.; and Teller, E. 1953. Equation of State Calculations by Fast Computing Machines. Journal of Chemical Physics 21(6):1087 1092. Mollica, C., and Tardella, L. 2016. Bayesian Plackett Luce mixture models for partially ranked data. Psychometrika 1 17. Negahban, S.; Oh, S.; and Shah, D. 2017. Rank centrality: Ranking from pairwise comparisons. Operations Research 65(1):266 287. Oh, S., and Shah, D. 2014. Learning mixed multinomial logit model from ordinal data. In Advances in Neural Information Processing Systems, 595 603. Plackett, R. L. 1975. The analysis of permutations. Journal of the Royal Statistical Society. Series C (Applied Statistics) 24(2):193 202. Thurstone, L. L. 1927. A law of comparative judgement. Psychological Review 34(4):273 286. Tkachenko, M., and Lauw, H. W. 2016. Plackett-luce regression mixture model for heterogeneous rankings. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 237 246. ACM. Yellott, J. I. J. 1977. The relationship between Luce s Choice Axiom, Thurstone s Theory of Comparative Judgment, and the double exponential distribution. Journal of Mathematical Psychology 15(2):109 144. Zhao, Z., and Xia, L. 2018. Composite marginal likelihood methods for random utility models. In Proceedings of The 35th International Conference on Machine Learning. Zhao, Z.; Li, H.; Wang, J.; Kephart, J.; Mattei, N.; Su, H.; and Xia, L. 2018. A cost-effective framework for preference elicitation and aggregation. In Proceedings of the 34th Conference on Uncertainty in Artificial Intelligence (UAI-18). Zhao, Z.; Piech, P.; and Xia, L. 2016. Learning mixtures of Plackett-Luce models. In Proceedings of The 33rd International Conference on Machine Learning. Zhao, Z.; Villamil, T.; and Xia, L. 2018. Learning mixtures of random utility models. In Proceedings of 32nd AAAI Conference on Artifical Intelligence (AAAI-18).