# learning_mixtures_of_random_utility_models__4dfe8636.pdf Learning Mixtures of Random Utility Models Zhibing Zhao, Tristan Villamil, Lirong Xia Rensselaer Polytechnic Institute 110 8th Street, Troy, NY, USA {zhaoz6, villat2}@rpi.edu, xial@cs.rpi.edu We tackle the problem of identifiability and efficient learning of mixtures of Random Utility Models (RUMs). We show that when the PDFs of utility distributions are symmetric, the mixture of k RUMs (denoted by k-RUM) is not identifiable when the number of alternatives m is no more than 2k 1. On the other hand, when m max{4k 2, 6}, any k-RUM is generically identifiable. We then propose three algorithms for learning mixtures of RUMs: an EM-based algorithm, which we call E-GMM, a direct generalized-methodof-moments (GMM) algorithm, and a sandwich (GMM-EGMM) algorithm that combines the other two. Experiments on synthetic data show that the sandwich algorithm achieves the highest statistical efficiency and GMM is the most computationally efficient. Experiments on real-world data at Preflib show that Gaussian k-RUMs provide better fitness than a single Gaussian RUM, the Plackett-Luce model, and mixtures of Plackett-Luce models w.r.t. commonly-used model fitness criteria. To the best of our knowledge, this is the first work on learning mixtures of general RUMs. Introduction In rank aggregation, the goal is to aggregate rank data to make an optimal decision, where each data point is a linear order over a set of alternatives. Rank aggregation has many applications. For example, in political elections, agents cast votes to elect a president; in information retrieval, rankings over documents are combined into a list; in crowdsourcing, crowd workers sometimes give rankings as answers, which are aggregated to estimate the correct ranking. In this paper, we take a machine learning approach towards rank aggregation, by addressing the problem of efficiently learning mixtures of Random Utility Models (RUMs). RUMs are a widely applied statistical model for human behaviors (Thurstone 1927). In a single, non-mixture RUM, each agent samples a utility for each alternative independently and reports the ranking over alternatives by sorting their utilities. A special case of RUMs is the Plackett-Luce model (Plackett 1975; Luce 1959), which can be seen as the extension of multinomial logistic regression to rank data. General RUMs can fit data better than Plackett-Luce models and thus provide more accurate predictions (Azari Soufiani, Copyright c 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Parkes, and Xia 2012). Unfortunately, unlike the Placket Luce model, general RUMs are hard to learn due to the lack of closed-form formula of the likelihood function. One prominent example is Thurstone s case V model (Thurstone 1927), where the utility distributions are Gaussian and can be seen as an extension of multinomial probit regression to rank data. See Related Work for more discussions. A mixture of k RUM models, denoted by k-RUM, combines k RUM components via the mixing coefficients α = [α1, α2, . . . , αk]. Given a k-RUM, a ranking can be generated as follows: (i) the rth component is selected with probability αr; (ii) a ranking is generated from the rth RUM component. Like other mixture models, k-RUMs can potentially provide better fitness and can be used for clustering. We are not aware of previous work on learning mixtures of general RUMs. There are some recent works on mixtures of Plackett-Luce models (Gormley and Murphy 2008; Mollica and Tardella 2016; Tkachenko and Lauw 2016; Zhao, Piech, and Xia 2016), which are special k-RUMs. The motivation of our work on learning mixtures of general RUMs is that general k-RUMs outperforms a single RUM, the Plackett-Luce model, and mixtures of Plackett Luce models w.r.t. various commonly-used model fitting criteria, as we will show later in the paper based on experiments on real-world data. This means that general k-RUMs often improve predictions and decisions compared to the state-ofthe-art models. In this paper, we address the following two fundamental questions on learning mixtures of RUMs. Is a k-RUM learnable? Can we design efficient algorithms for learning k-RUMs in general? Our Contributions In this paper, we measure learnability by identifiability , which means different parameters correspond to different distributions of rank data. Identifiability is important to mixture models for parameter estimation and clustering. The identifiability of k-RUMs depends on the number of components k in the mixture model and the number of alternatives m, as we show in the following two theorems. Theorem 1. Let M be any symmetric RUM from the location family. When m 2k 1, k-RUMM over m alterna- The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) tives is non-identifiable1. Here k-RUMM denotes the mixture of k RUMs, each of which is chosen from the RUM model M. As a corollary, the k-mixture of Thurstone s original case V model (with Gaussian distributions) is non-identifiable when m 2k 1. Our second theorem proves generic identifiability. k-RUMM is generically identifiable, if the Lebesgue measure of nonidentifiable parameters is 0. Theorem 2. For any RUMM where all utility distributions have support ( , ), when m max{4k 2, 6}, k RUMM over m alternatives is generically identifiable. We propose three algorithms to learn k-RUMs. The first one is an EM-based algorithm (Dempster, Laird, and Rubin 1977), which we call E-GMM. It works as follows: the E-step is standard, calculating the membership of each ranking in the data w.r.t. the k components based on the likelihood ratios. In the M step, we adopt the generalized-methodof-moments (GMM) algorithm proposed by Azari Soufiani, Parkes, and Xia (2014) to estimate parameters for each RUM component. Our second algorithm is a direct GMM algorithm (Hansen 1982) that tries to match the moments for k-RUM. We prove that under mild conditions the algorithm is consistent. That is, it converges to the ground truth with probability going to 1 as the number of independently generated rankings approaches infinity. Our third algorithm combines the advantages of the EGMM algorithm and the GMM algorithm. We first run GMM, use the output as the initial values to run E-GMM. The algorithm is thus called the sandwich algorithm (GMME-GMM). We note that the two GMMs in the sandwich algorithm are different. Experiments on synthetic data show that the sandwich algorithm is the most statistically efficient and the GMM algorithm is the fastest. Observing that the E-GMM algorithm is not as good as the sandwich algorithm w.r.t. both statistical efficiency and computational efficiency, we don t show the results of the E-GMM algorithm in this paper. Experiments on real-world Preflib data (Mattei and Walsh 2013) show that k-RUMs provide better model fitness than a single RUM, the Plackett-Luce model, and mixtures of Plackett-Luce models w.r.t. commonly-used model fitness criteria, including Akaike Information Criterion (AIC) (Akaike 1974), corrected AIC for finite samples (AICc) (Hurvich and Tsai 1989), and Bayesian Information Criterion (BIC) (Schwarz 1978). We note that such comparisons and conclusions are enabled by our algorithms on learning general k-RUMs, which are the first of their kind. Related Work and Discussions Our (non)-identifiability theorems are the first ones for mixtures of general RUMs. Recently Zhao, Piech, and Xia (2016) characterized identifiability of mixtures of Plackett-Luce models. We focus on a different class of mix- 1All identifiability results in this paper hold modulo label switching, which means that if we label the components differently, the parameter is treated the same. tures of RUMs those whose utility distributions have symmetric PDFs, including Thurstone s case V model. In order to prove the (non)-identifiability theorem, we adopted novel techniques by avoiding direct calculation of the likelihood. This is the major difference between our proofs and the proofs of Zhao, Piech, and Xia (2016) for mixtures of Plackett-Luce models, where closed-form formulas for the likelihood function are available. We emphasize that our non-identifiability theorem does not apply to mixtures of Plackett-Luce models because the utility distributions in the Plackett-Luce model (Gumbel distributions) are not symmetric. Our work also makes a number of algorithmic contributions to the literature of rank aggregation, sometimes known as learning to rank (Liu 2011). In particular, GMM-based techniques for learning a single (non-mixture) RUM, including the Plackett-Luce model, have been extensively investigated (Negahban, Oh, and Shah 2012; Azari Soufiani et al. 2013; Azari Soufiani, Parkes, and Xia 2014; Chen and Suh 2015; Khetan and Oh 2016b; 2016a). Learning algorithms have also been proposed for mixtures of Plackett Luce models, including EM-based algorithms (Gormley and Murphy 2008; Mollica and Tardella 2016; Tkachenko and Lauw 2016) and a generalized method of moments algorithm (Zhao, Piech, and Xia 2016). To the best of our knowledge, our algorithms for learning k-RUMs are quite general and are the first algorithms for learning mixtures of RUMs beyond mixtures of Plackett-Luce models. Our E-GMM is inspired by the EMM algorithm for learning mixtures of Plackett-Luce models proposed by Gormley and Murphy (2008). However, the EMM cannot be easily applied to learn general k-RUMs, because EMM uses a Minorize-Maximization (MM) algorithm to estimate parameters of each Plackett-Luce component, but no MM algorithm is known for general RUMs. To address this challenge, we adopt a GMM by Azari Soufiani, Parkes, and Xia (2014) in the M step. Our GMM algorithm is inspired by the GMMs in the previous work discussed above but we use a different set of moment conditions. Recently tensor-decomposition techniques have been explored to learn models with latent variables, see for example the work by Anandkumar et al. (2014). However, such techniques cannot be easily applied to learn RUMs and their mixtures. In the proof of Theorem 2 we shed some light on a potential algorithm, which requires multiple tensor decompositions and each component is only labeled by its mixing probability. Designing computationally tractable tensordecomposition algorithms is a promising direction for future work. Preliminaries Let A = {ai|i = 1, 2, , m} denote a set of m alternatives. Let L(A) be the set of linear orders (full rankings) over A, which are transitive, antisymmetric and total binary relations. Let P = {V1, V2, , Vn} denote the data (also called a preference profile), where for all j n, Vj L(A). Definition 1 (Random Utility Model (RUM)). Given m 2 alternatives, a random utility model M associates each al- ternative ai with a utility distribution. The parameter space is Θ = { θ = { θi|i = 1, 2, , m}}, where θi is the parameter of the utility distribution of ai. The sample space is S = L(A)n. An agent s ranking is generated in the following steps: first, the agent samples a random utility Ui for each alternative independently from μi( | θi); second, she ranks the alternatives w.r.t. these utilities. Formally, given a parameter θ, the probability of ranking V = ai1 ai2 aim is Pr M(V | θ) = xi2 μim(xim| θim) μi1(xi1| θi1)dxi1dxi2 dxim We assume that rankings are generated i.i.d. Therefore, given a preference profile P and θ Θ, we have Pr M(P| θ) = n j=1 Pr M(Vj| θ). In this paper, we focus on the location family, where each utility distribution μi is only parameterized by its mean, denoted by θi. In other words, the shapes of the utility distributions are fixed. We note that shifting the means of all alternatives simultaneously by the same distance will not affect the distribution of the rankings. To eliminate this problem, we require m i=1 θi = 0. We further say that an RUM is symmetric if the PDF of each utility distribution is symmetric around its mean. Definition 2 (k-RUMM). Given an RUM M within the location family, for any k N+ and m 2, we define the k-RUMM as follows. The sample space is S = L(A)n. The parameter space has two parts. The first part is the mixing coefficients α = (α1, . . . , αk) where for all r k, αr 0, and k r=1 αr = 1. The second part is ( θ(1), θ(2), . . . , θ(k)), where θ(r) is the parameter of the r-th random utility component. All components come from the same class of RUM M, i.e. the shape of the utility distribution of any alternative is fixed. The probability of a ranking V is Prk-RUMM(V | θ) = k r=1 αr Pr M(V | θ(r)), where Pr M(V | θ(r)) is the probability of generating V in the r-th RUM. Example 1 Consider a 2-RUM over 3 alternatives. The mixing coefficients are α = (0.3, 0.7). Let all utility distributions be Gaussian distributions with standard deviation 1. The two RUM components are parameterized by θ(1) = ( 3, 1, 4) and θ(2) = (5, 2, 3), respectively. Taking the ranking a1 a3 a2 as an example, we have Pr(a1 a3 a2) = 0.3 x2 x3 φ(x2 + 1)φ(x3 4)φ(x1 + 3)dx1dx3dx2 + 0.7 x2 x3 φ(x2 + 2)φ(x3 +3)φ(x1 5)dx1dx3dx2, where φ(x) is the PDF of the standard Gaussian distribution. Identifiability of k-RUMM We first recall the identifiability of a statistical model. Definition 3 (Identifiability) Let M = {Pr( | θ) : θ Θ} be a statistical model. M is identifiable if for all θ, γ Θ, we have Pr( | θ) = Pr( | γ) = θ = γ. We note that single RUMs are generally identifiable and can be learned using an MC-EM algorithm (Azari Soufiani, Parkes, and Xia 2012) or a GMM algorithm (Azari Soufiani, Parkes, and Xia 2014). However, identifiability of single RUMs does not imply identifiability of mixtures of RUMs. To eliminate the label switching problem (Redner and Walker 1984), we say that k-RUMM is identifiable if there do not exist (1) 1 k1, k2 k, non-degenerate θ(1), θ(2), , θ(k1), γ(1), γ(2), , γ(k2), meaning that these k1+k2 vectors are pairwise different; (2) all strictly positive mixing coefficients (α(1) 1 , . . . , α(1) k1 ) and (α(2) 1 , . . . , α(2) k2 ), so that for any ranking V in P we have: r=1 α(1) r Pr M(V | θ(r)) = r=1 α(2) r Pr M(V | γ(r)) A statistical model is generically identifiable, if the Lebesgue measure of parameters that does not satisfy the condition in Definition 3 is zero in the parameter space. Theorem 1 Let M be any symmetric RUM from the location family. When m 2k 1, k-RUMM over m alternatives is non-identifiable. Proof: The proof is constructive. We prove the case where m = 2k 1. The proof for cases where m < 2k 1 is similar. For any k and m = 2k 1, we will define non-degenerate θ(1), . . . , θ(k), γ(1), . . . , γ(k) with mixing probabilities α = [α1, . . . , αk]T and β = [β1, . . . , βk]T , respectively. We let θ(1) 1 , . . . , θ(k) 1 , γ(1) 1 , . . . , γ(k) 1 be 2k pairwise different numbers where for all r = 1, . . . , k, θ(r) 1 + γ(r) 1 = 0. For any θ(r), we let θ(r) 2 = . . . = θ(r) m = θ(r) 1 m 1 s.t. m i=1 θ(r) i = 0. γ(r) s are defined similarly. Because the parameters for a2, . . . , am are equal in each RUM component, the distribution over data in each RUM component can then be represented compactly using m events instead of m! rankings, i.e. a1 at the first position, second position, etc. For convenience we define Fθ as Pr(a1 top| θ(1)) Pr(a1 top| θ(k)) ... ... ... Pr(a1 bottom| θ(1)) Pr(a1 bottom| θ(k)) Fγ can be defined similarly. We will prove that there exist positive α and β s.t. Fθ α = Fγ β, which will prove non-identifiability. Consider the matrix ΔF = Fθ Fγ. For all 1 r k, the rth column of ΔF is the following: Pr(a1 top| θ(r)) Pr(a1 top| γ(r)) ... Pr(a1 bottom| θ(r)) Pr(a1 bottom| γ(r)) Because the utility distributions of all alternatives are symmetric, we have Pr(a1 position i| θ(r)) = Pr(a1 postion m i + 1| γ(r)). Therefore, the first element of ΔF(r) is exactly the same as the last element of it except for a negative sign. The same holds for the second element and the last but one element, etc. The center element (the kth element) is zero. Since this holds for all r, the last k 1 rows of ΔF are in the reversed order of the negative top k 1 rows, while the center row consists of only zeros. So the rank of ΔF is at most k 1. Since ΔF has k rows, there exists a nonzero λ = [λ1, λ2, . . . , λk] s.t. ΔF λ = 0. Elementwise we have i: k r=1 λr Pr(a1 position i| θ(r)) = k r=1 λr Pr(a1 position i| γ(r)). If the entries in λ are all nonnegative, we let α = β = λ, and the proof is done. If there are negative elements in λ, we can switch the corresponding components θ(r) and γ(r) and flip the sign of λr until all elements in λ are nonnegative. This finishes the proof. We conjecture that this bound is tight since the counterexample has the fewest number of moments, which means the parameter is the most likely to be under-constrained. Theorem 2 For any RUMM where all utility distributions have support ( , ), when m max{4k 2, 6}, k RUMM over m alternatives is generically identifiable. Proof: We will focus on the parameters whose mixing coefficients (entries of α) are pairwise different. Formally, for any r1, r2 {1, . . . , k}, αr1 = αr2 = r1 = r2. (1) Such parameters have Lebesgue measure 1 because the parameters with any pair of identical mixing coefficients are in a lower dimensional space. The generic identifiability will be proved by analyzing the uniqueness of tensor decompositions. We will construct a rank-one tensor T(r)( θ(r)) to represent the rth RUM component. Then the k-RUMM can be represented by the weighted average of tensors of its components, i.e. T( θ) = k r=1 αr T(r)( θ(r)). We now provide a set of two sufficient conditions for θ = ( α, θ(1), . . . , θ(k)) to be identifiable, and then prove that both hold generically. Condition 1. For every γ = ( β, γ(1), . . . , γ(k)), where β = α modulo label switching (the set of entries in β is different from the set of entries in α), we have T( γ) = T( θ). Condition 2. For every γ = ( β, γ(1), . . . , γ(k)) that is different from θ and β = α, we have Prk-RUMM( | γ) = Prk-RUMM( | θ). It follows from the definition of identifiability that if both conditions hold, then θ is identifiable. Condition 1 generically holds. We first show that T( θ) has a unique decomposition generically, then prove that the uniqueness of decomposition implies Condition 1. In the rest of the proof we assume that m is even. The theorem can be proved similarly for odd m s. To construct the rank-one tensor T(r)( θ(r)), we partition the set of alternatives into m/2 subsets. S1 = {a1, a2}, S2 = {a3, a4}, . . . , Sm/2 = {am 1, am}. We define the m 2 - dimensional rank-one tensor T(r)( θ(r)) for the rth RUMM component as follows. Let the qth coordinate of Tr be the probabilities for the pairwise comparison between the two alternatives in Sq. More precisely, for any 1 q m/2 and any rankings Vq L(Sq), the qth coordinate of T(r)( θ(r)), denoted by p(r) q , is the following: p(r) q = Pr(a2q 1 a2q| θ(r)), Pr(a2q a2q 1| θ(r)) (2) According to Lemma 3 of (Zhao, Piech, and Xia 2016)2, we have Pr M(V1, V2, . . . , Vm/2| θ(r)) = m/2 q=1 Pr M(Vq| θ(r)), where for all q = 1, . . . , m/2, Vq L(Sq). It follows that T(r)( θ(r)) = m/2 q=1 p(r) q . The tensor for the mixture model can be written as T( θ) = k r=1 αr T(r)( θ(r)). To analyze the uniqueness of tensor decomposition of T( θ), we investigate the Kruskal s rank of each matrix Pq defined as Pq = p(1) q , p(2) q , , p(k) q . The Kruskal s rank of a matrix is the largest number K such that every set of K columns in the matrix is linearly independent. Now we will prove that the Kruskal rank of Pq, denoted by Kq, is generically 2, i.e. any two columns of Pq are generically independent. It is not hard to see that the two entries of p(r) q sum up to 1. So the only case where p(r1) q and p(r2) q are linearly dependent is that they are exactly the same. According to Proposition 1 in (Azari Soufiani, Parkes, and Xia 2014), Pr(a2q 1 a2q| θ(r)) monotonically increases as θ(r) 2q 1 θ(r) 2q increases because all utility distributions have support ( , ). Therefore, p(r1) q = p(r2) q if and only if θ(r1) 2q 1 θ(r1) 2q = θ(r2) 2q 1 θ(r2) 2q , which has Lebesgue measure 0. Thus, for all q m/2, Kq = 2 generically holds. So m/2 q=1 Kq = m/2 2 = m generically holds. Because k m+2 4 , we have m/2 q=1 Kq = 2k + m/2 1. Sidiropoulos and Bro (2000) proved that, for any N-way tensor, if N q=1 Kq 2k + N 1, then the tensor decomposition is unique. This is true in our case since N = m/2. We now prove that uniqueness of decomposition of T( θ) implies Condition 1. Suppose for the purpose of contradiction, the decomposition of T( θ) is unique but Condition 1 does not hold. This means that there exists γ = ( β, γ(1), . . . , γ(k)) where β is not equal to α modulo label switching, s.t. T( γ) = T( θ). Because components in α are pairwise different, there exists r1 k such that for all r2 = 1, . . . , k, αr1 = βr2, while T( θ) = T( γ) = T. We will show that for any r2 = 1, . . . , k, we have αr1T(r1)( θ(r1)) = βr2T(r2)( γ(r2)), which contradicts the unique decomposition of T. Recall that for any r, the sum 2This lemma proved the independence of two rankings with mutually exclusive alternatives, which can be easily extended to multiple rankings. of the two entries of p(r) q is 1. It is not hard to see the sum of all entries of T(r)( ) is also one. So the sums of all entries of αr1T(r1)( θ(r1)) and βr2T(r2)( γ(r2)) are αr1 and βr2 respectively. We recall that for all r2, αr1 = βr2 holds. Therefore, αr1T(r1)( θ(r1)) = βr2T(r2)( γ(r2)) holds for all r2, which is a contradiction. This finishes the proof that Condition 1 generically holds. Condition 2 generically holds. Unfortunately the tensor decomposition technique used for Condition 1 no longer works for Condition 2. Here is a counterexample. Given θ = ( α, θ(1), . . . , θ(k)), we can construct γ = ( α, γ(1), . . . , γ(k)) s.t. T( γ) = T( θ) in the following way. For any r, we let γ(r) i = θ(r) i + 1 for i = 1, 2, γ(r) i = θ(r) i 1 for i = 3, 4 and γ(r) i = θ(r) i for i = 5, . . . , m. The resulting tensor T( γ) = T( θ) because the probability of rankings restricted to each group are exactly the same. To address this problem we consider an additional tensor decomposition using a different partition of the alternatives, defined as follows: S 1 = {a2, a3}, S 2 = {a4, a5}, . . . , S m/2 = {am, a1}. Let T ( θ) denote the tensor under this partition. Similar to the proof for Condition 1, we can show that generically T ( θ) has a unique decomposition. Next, we will prove that for any θ where T( θ) and T ( θ) have unique decompositions (which holds generically), Condition 2 holds. Suppose for the sake of contradiction that Condition 2 does not hold for θ where T( θ) and T ( θ) have unique decompositions. Then, there exists γ = ( β, γ(1), . . . , γ(k)) that is different from θ modulo label switching, such that Prk-RUMM( | γ) = Prk-RUMM( | θ). This means that T( γ) = T( θ) and T ( γ) = T ( θ). We first match the components in γ and θ. Recall that uniqueness of T( θ) implies Condition 1. Because both T( θ) and T ( θ) have unique decompositions, Condition 1 must hold for both. It follows that the entries of β are exactly entries of α, otherwise T( θ) = T( γ), which is a contradiction. Since entries of α are pairwise different, there is a unique way of matching components in θ to components in γ by matching the corresponding mixing coefficients. Consequently, we can relabel components in γ s.t. the β becomes exactly α. Let γ = ( α, γ (1), . . . , γ (k)) denote the γ parameter after relabeling the components. Thus we have T( γ ) = T( γ) = T( θ) and T ( γ ) = T ( γ) = T ( θ). Because γ = θ modulo label switching, we have γ = θ, which implies that there exists r s.t. γ (r ) = θ(r ). (3) Next, we will show that from the uniqueness of decomposition of T( θ) and T ( θ), we must have γ (r ) = θ(r ), which is a contradiction. To this end, we show how to uniquely determine the parameters of k-RUMM (i.e. γ and θ) from decompositions of T( θ) and T ( θ). T( γ ) = T( θ) has a unique decomposition means that for any r1 {1, . . . , k}, there exists r2 {1, . . . , k} s.t. αr1T(r1)( γ (r1)) = αr2T(r2)( θ(r2)), which implies the sums of all entries of αr1T(r1)( γ (r1)) and αr2T(r2)( θ(r2)) are equal, i.e. αr1 = αr2. It follows from (1) that r1 = r2. Namely, we have αr T(r)( γ (r)) = αr T(r)( θ(r)) for all r = 1, . . . , k. It follows that for all r = 1, . . . , k, we have T(r)( γ (r)) = T(r)( θ(r)), which means T(r )( γ (r )) = T(r )(θ(r )). Similarly, we also have T (r )( γ (r )) = T (r )(θ(r )). For any r and q, p(r) q can be easily obtained from T(r)( θ(r)) by normalizing the corresponding entries of T(r)( θ(r)). E.g., p(r) 1 can be obtained by normalizing entries (1, 1, . . . , 1) and (2, 1, . . . , 1). Such p(r) 1 is uniquely determined by T(r)( θ(r)) because the two entries of p(r) 1 sum up to 1. Specifically, for all q = 1, . . . , m/2, p(r ) q is uniquely determined by T(r )( θ(r )) and p (r ) q is uniquely determined by T (r )( θ(r )). Next, we will prove that for all q = 1, . . . , m/2, p(r ) q , p (r ) q uniquely determine θ(r ), which contradicts (3). Now we focus on γ (r ) and θ(r ) restricted to S1. We claim that there exists a constant C1 s.t. for all i = 1, 2 (i.e. ai S1), γ (r ) i = θ(r ) i + C1. The reason is as follows. Recall from (2) that p(r ) q consists of probabilities of a1 a2 and a2 a1 given the component r . By Proposition 1 in (Azari Soufiani, Parkes, and Xia 2014), Pr(a1 a2| θ(r )) is a function of θ(r ) 1 θ(r ) 2 and Pr(a1 a2| γ (r )) is a function of γ (r ) 1 γ (r ) 2 . By matching the probabilities we have γ (r ) 1 γ (r ) 2 = θ(r ) 1 θ(r ) 2 , which means that there exists a constant C1 s.t. for i = 1, 2, γ (r ) i = θ(r ) i + C1. Similarly for all q = 1, . . . , m/2 there exist Cq, s.t. γ (r ) i = θ(r ) i + Cq, for i = 2q 1, 2q (4) Similarly from T (r)( γ(r)) = T (r)( θ(r)), for all q = 1, . . . , m/2, there exists C q s.t. γ (r ) i = θ(r ) i + C q, for i = 2q, (2q + 1 mod m) (5) Therefore, we have C1 = C 1 by letting i = 2 in (4) and (5); C 1 = C2 by letting i = 3; C2 = C 2 by letting i = 4, etc. So we have C1 = = Cq = C 1 = = C q. Let Cq = C q = C for all q = 1, . . . , m/2. Then for all i = 1, . . . , m, we have γ (r ) i = θ(r ) i + C. Because m i=1 γ (r ) i = m i=1 θ(r ) i = 0, we have C = 0, which contradicts (3). As we have proved, the Lebesgue measure of parameters where tensor T( θ) (or T ( θ)) has a nonunique decomposition is 0. Therefore, the Lebesgue measure of parameters θ where both T( θ) and T ( θ) have unique decompositions is also 0. This finishes the proof. The E-GMM Algorithm To compute k-RUMM, we propose an EM-based algorithm, which we call E-GMM, where the GMM algorithm proposed by Azari Soufiani, Parkes, and Xia (2014) is used in the M step. The detailed algorithm is as follows. During the E-step, given the t-th step estimate α(t) and θ(1,t), . . . , θ(k,t), we use Bayes rule to calculate the probability of a ranking Vj to belong to component r for (t+1)-th iteration, denoted as p(jr) t+1. We have p(jr) t+1 α(t) r Pr(Vj| θ(r,t)) (6) We can normalize w.r.t. r because k r=1 p(jr) = 1. In the M-step, mixing probabilities are calculated by n j=1 p(jr) t+1 n (7) and θ(r,t+1) s are estimated using the GMM Algorithm from (Azari Soufiani, Parkes, and Xia 2014). Formally, the EGMM algorithm is presented below as Algorithm 1. Algorithm 1 E-GMM Algorithm Input: Profile P of n rankings, the number of components k, the number of iterations T. Output: α(T +1) r , θ(r,T +1), where r = 1, 2, , k. Initialize α(1) r , and θ(r,1) randomly for all r = 1, 2, , k. 1: for t = 1 to T do 2: Given the estimate at t-th step α(t) r , θ(r,t). 3: E step: calculate the expected membership by (6). 4: M step: calculate α(t+1) r using (7) and use GMM with weighted rankings/breakings to estimate θ(r,t+1). 5: end for GMM for k-RUMM Generalized-method-of-moments (GMM) (Hansen 1982) algorithms are a widely applied class of algorithms that generalize the classical method of moments. Each GMM is specified by a set of q 1 moment conditions g(V, θ), where V is a data point and θ is a parameter, such that for any θ0, the expectation of each moment condition is zero at θ0, when the data are generated from the model given θ0. That is, E[g(V, θ0)] = 0. Then, the algorithm computes θ to minimize a certain norm, e.g. the 2-norm, of V P g(V, θ). Our GMM algorithm for k-RUMM is defined as follows. We first define a breaking matrix B(P) = [bi1i2]m m, where bi1i2 is the empirical probability that ai1 is preferred over ai2, namely the number of times that ai1 ai2 over the number of rankings. The diagonal elements are zeros. For example, let P = (a1 a2 a3, a1 a3 a2), we have B(P) = 0 1 1 0 0 0.5 0 0.5 0 . We then define the following m 2 moment conditions: for any i1 < i2, let gi1i2(P, θ) = bi1i2 Pr(ai1 ai2| θ). In our GMM algorithm, we minimize the following objective function i2>i1 (bi1i2 Pr(ai1 ai2| θ))2 (8) where Pr(ai1 ai2| θ) = k r=1 αr Pr(ai1 ai2| θ(r)). Formally, the algorithm is presented as Algorithm 2. Algorithm 2 GMM for k-RUMM Input: A Preference profile P. 1: Compute the breaking matrix B(P). 2: Compute the parameter that minimizes (8). The advantages of our GMM algorithm are: 1. Gradient and Hessian of G are easy to compute. To calculate the gradient of G, we need partial derivatives of Pr(ai1 ai2| θ), which can be broken down to partial derivatives of Pr(ai1 ai2| θ(r)) w.r.t. θ(r). This was given by Proposition 1 by (Azari Soufiani, Parkes, and Xia 2014). 2. Pr(ai1 ai2| θ(r)) can be computed easily by sampling the utilities from the utility distributions of ai1 and ai2. For Gaussian distributions, Pr(ai1 ai2| θ(r)) is the CDF valued at θ(r) i1 θ(r) i2 of another Gaussian distribution with mean 0 and variance being the sum of the variances of utility distributions of ai1 and ai2. For better accuracy we add m more moment conditions, corresponding to m cyclic triple-wise comparisons, i.e. T1 = a1 a2 a3, T2 = a2 a3 a4, . . ., Tm = am a1 a2. Let b Ti be the empirical probability of Ti. The new GMM minimizes: G = G + m i=1(b Ti Pr(Ti| θ))2. We now prove that our GMM algorithm is consistent: when the data are generated independently from k-RUMM and the size of data approaches infinity, the algorithm converges to the ground truth with probability that goes to 1. Theorem 3 If k-RUMM over m alternatives is identifiable, and the means of the utility distributions of all alternatives in all RUM components are bounded in close intervals [0, C], then Algorithm 2 is consistent. Proof: Hall (2005) provides a set of necessary conditions for any GMM to be consistent. Therefore, it suffices to check that all assumptions in Theorem 3.1 in (Hall 2005) holds. Assumption 3.1: Strict Stationarity: the (n 1) random vectors {vt; < t < } form a strictly stationary process with sample space S Rn. As the data are generated i.i.d., the process is strictly stationary. Assumption 3.2: Regularity Conditions for g( , ): the function g : S Θ Rq where q < , satisfies: (i) it is continuous on Θ for each P S; (ii) E[g(P, θ)] exists and is finite for every θ Θ; (iii) E[g(P, θ)] is continuous on Θ. Our moment conditions satisfy all the regularity conditions since gi1i2(P, θ) is continuous on Θ and bounded in [ 1, 1] for any i2 = i1. Assumption 3.3: Population Moment Condition. The random vector vt and the parameter vector θ0 satisfy the (q 1) Figure 1: MWRSE with 95% confidence intervals and runtime for 2-RUM over 6 alternatives. We use 10 EM iterations for the sandwich algorithm. Values are averaged over 1000 trials. The ground truth for each component ranges between 0 and 5. population moment condition: E[g(P, θ0)] = 0. This assumption holds by the definition of our GMM. Assumption 3.4: Global Identification. E[g(P, θ )] = 0 for all θ Θ such that θ = θ0. This is the assumption of the theorem. Assumption 3.7: Properties of the Weighting Matrix. Wt is a positive semi-definite matrix which converges in probability to the positive definite matrix of constants W. This holds because W = I. Assumption 3.8: Ergodicity. The random process {vt; < t < } is ergodic. Since the data are generated i.i.d., the process is ergodic. Assumption 3.9: Θ is a compact set. The mixing probabilities are compact in interval [0, 1] and in the assumption of this theorem, the θ(r) i s are bounded in [0, C]. Assumption 3.10: Domination of g(P, θ). E[supθ Θ ||g(P, θ)||] < . This assumption holds because all moment conditions are finite. The Sandwich Algorithm The performance of E-GMM algorithm is sensitive to the initial value. Therefore, we propose the sandwich algorithm (GMM-E-GMM) to give our E-GMM algorithm a good starting point. We note that the first GMM stands for our GMM algorithm for learning k-RUMs and the second GMM stands for Azari Soufiani, Parkes, and Xia s algorithm [2014] for learning a single RUM in each M-step. The algorithm is formally shown as Algorithm 3. Experiments We implemented all algorithms with Matlab and tested them on synthetic data and Preflib data. Synthetic data generation. In each trial, we first generate the ground truth parameters for k RUM components and Algorithm 3 Sandwich (GMM-E-GMM) Algorithm Input: Profile P of n rankings, the number of components k, the number of iterations T. Output: αr, θ(r), where r = 1, 2, , k. 1: Run Algorithm 2, whose output is α(0) r , θ(r,0), where r = 1, 2, , k. 2: Use the output as the initial value of Algorithm 1. normalize each θ(r) s.t. m i=1 θ(r) i = 0. The mixing coefficients α are generated uniformly at random in [0, 1] and then normalized s.t. k r=1 αr = 1. Then with probability αr, the rth component is selected to generate a full ranking. We run experiments on two settings: (i) k = 2, m = 6, 1000 trials (Figure 1), and (ii) k = 4, m = 15, 900 trials (Figure 2). In both figures, results for the E-GMM algorithm are not shown because the sandwich algorithm strictly improves E-GMM w.r.t. both statistical efficiency and computational efficiency. All experiments were run on an Ubuntu Linux server with Intel Xeon E5 v3 CPUs clocked at 3.50 GHz. Measures. We note that components with small mixing coefficients are generally hard to learn accurately in the extreme case, when the mixing coefficient for one component is 0, it is impossible to learn the component. Therefore, the standard mean square error is not very informative for our experiments. Consequently, we define MWRSE (mean weighted root square error) to mitigate the impact of such components. Formally, let θ0 denote the ground truth parameter and θ denote the estimate. We define WRSE as follows: WRSE = || α0 α ||2 + r=1 α0,r|| θ(r) 0 θ (r)||2 MWRSE is computed by averaging WRSE of all trials. Observations. The sandwich algorithm has lower MWRSEs and narrower 95% confidence than GMM under both set- Figure 2: MWRSE with 95% confidence intervals and runtime for 4-RUM over 15 alternatives. We use 5 EM iterations for the sandwich algorithm. Values are averaged over 900 trials. The ground truth for each component ranges between 0 and 10. tings (left subfigures of Figures 1 and 2), which means the sandwich algorithm achieves better statistical efficiency. In terms of runtime, the sandwich algorithm is not as fast as GMM (right subfigures of Figures 1 and 2). Moreover, the running time of the E step dominates that of the M step. Our sandwich algorithm achieves satisfactory statistical efficiency based on the following high-level reasoning. From Figure 1 we observe that the MWRSE for the sandwich algorithm is no more than 0.6 for k = 2, m = 6. Therefore, typically the root squared error for each component is about 0.6 because the weights sum up to 1. For each single parameter, we would expect the error to be typically 0.62/6 0.245, which is reasonably small considering the range of each single parameter to be [0, 5]. Similarly for k = 4, m = 15, the error of each single parameter is typically below 22/15 0.52, which is also small compared to the range [0, 10]. Real-World Data. We learn different models, including the Plackett-Luce model (PL), its mixtures (k-PLs), and Gaussian k-RUMs from 209 linear order datasets on Preflib (Mattei and Walsh 2013) and compute their AIC, AICc and BIC, defined as: AIC = 2d 2 ln(L), AICc = AIC+ 2d(d+1) n d 1 and BIC = d ln(n) 2 ln(L), where L is the value of the likelihood function evaluated at the estimation, d is the number of parameters in the model, and n is the number of rankings. A smaller AIC, AICc or BIC means better fitness. For mixture models (k-PL and k-RUM), we increase k until all the three measurements start increasing. We compute the percentage for one model to be strictly better (lower) than another w.r.t. each measure, shown in Table 1. For example, in terms of AIC, k-RUM (with the best k) beats a single RUM in 60.3% of the datasets, which means that in 60.3% of the datasets, the best k for k-RUM is at least 2. Observations. From Table 1 we can see that the three infor- mation criteria agree on the following order of models: k-RUM k-PL RUM PL, where A B means that the number of datasets where A beats B is more than that where B beats A. k-RUM k-PL RUM PL k-RUM 0 60.8% 60.3% 90.0% k-PL 39.2% 0 79.4% 90.4% RUM 0 20.6% 0 76.6% PL 10.0% 0 23.4% 0 k-RUM 0 60.3% 59.8% 90.0% k-PL 39.7% 0 79.4% 89.5% RUM 0 20.6% 0 76.6% PL 10.0% 0 23.4% 0 k-RUM 0 66.0% 40.2% 84.2% k-PL 34.0% 0 59.8% 66.0% RUM 0 40.2% 0 76.6% PL 15.8% 0 23.4% 0 Table 1: Model fitness comparisons. Conclusions and Future Work We characterize conditions for mixtures of general RUMs to be non-identifiable and generically identifiable, and designed three algorithms for computing them. Our experiments show that the sandwich algorithm achieves higher statistical efficiency and GMM achieves higher computational efficiency. Open questions include improving the identifiability theorems for k-RUMs and designing more efficient algorithms, such as tensor-decomposition-based algorithms. Acknowledgments We thank all anonymous reviewers for helpful comments and suggestions. This work is supported by NSF #1453542 and ONR #N00014-17-1-2621. Akaike, H. 1974. A new look at the statistical model identification. IEEE transactions on automatic control 19(6):716 723. Anandkumar, A.; Ge, R.; Hsu, D.; Kakade, S. M.; and Telgarsky, M. 2014. Tensor decompositions for learning latent variable models. In 15., ed., The Journal of Machine Learning Research, volume 1, 2773 2832. Azari Soufiani, H.; Chen, W.; Parkes, D. C.; and Xia, L. 2013. Generalized method-of-moments for rank aggregation. In Proceedings of Advances in Neural Information Processing Systems (NIPS). Azari Soufiani, H.; Parkes, D. C.; and Xia, L. 2012. Random utility theory for social choice. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 126 134. Azari Soufiani, H.; Parkes, D. C.; and Xia, L. 2014. Computing parametric ranking models via rank-breaking. In Proceedings of the 31st International Conference on Machine Learning. Chen, Y., and Suh, C. 2015. Spectral MLE: Top-k rank aggregation from pairwise comparisons. In Proceedings of the 32nd International Conference on Machine Learning, volume 37. Dempster, A. P.; Laird, N. M.; and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological) 1 38. Gormley, I. C., and Murphy, T. B. 2008. Exploring voting blocs within the irish exploring voting blocs within the irish electorate: A mixture modeling approach. Journal of the American Statistical Association 103(483):1014 1027. Hall, A. R. 2005. Generalized Method of Moments. Oxford University Press. Hansen, L. P. 1982. Large sample properties of generalized method of moments estimators. Econometrica 50(4):1029 1054. Hurvich, C. M., and Tsai, C.-L. 1989. Regression and time series model selection in small samples. Biometrika 76(2):297 307. Khetan, A., and Oh, S. 2016a. Computational and statistical tradeoffs in learning to rank. In Advances in Neural Information Processing Systems (NIPS). Khetan, A., and Oh, S. 2016b. Data-driven rank breaking for efficient rank aggregation. In Proceedings of the 33rd International Conference on Machine Learning, volume 48. Liu, T.-Y. 2011. Learning to Rank for Information Retrieval. Springer. Luce, R. D. 1959. Individual Choice Behavior: A Theoretical Analysis. Wiley. Mattei, N., and Walsh, T. 2013. Preflib: A library of preference data. In Proceedings of Third International Conference on Algorithmic Decision Theory (ADT 2013), Lecture Notes in Artificial Intelligence. Mollica, C., and Tardella, L. 2016. Bayesian Plackett Luce mixture models for partially ranked data. Psychometrika 1 17. Negahban, S.; Oh, S.; and Shah, D. 2012. Iterative ranking from pair-wise comparisons. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2483 2491. Plackett, R. L. 1975. The analysis of permutations. Journal of the Royal Statistical Society. Series C (Applied Statistics) 24(2):193 202. Redner, R. A., and Walker, H. F. 1984. Mixture densities, maximum likelihood and the EM algorithm. SIAM review 26(2):195 239. Schwarz, G. 1978. Estimating the dimension of a model. Annals of Statistics 6(2):461 464. Sidiropoulos, N. D., and Bro, R. 2000. On the uniqueness of multilinear decomposition of n-way arrays. Journal of chemometrics 14(3):229 239. Thurstone, L. L. 1927. A law of comparative judgement. Psychological Review 34(4):273 286. Tkachenko, M., and Lauw, H. W. 2016. Plackett-Luce regression mixture model for heterogeneous rankings. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 237 246. ACM. Zhao, Z.; Piech, P.; and Xia, L. 2016. Learning mixtures of Plackett-Luce models. In Proceedings of the 33rd International Conference on Machine Learning (ICML-16).