# learning_mixtures_of_random_utility_models__4dfe8636.pdf

Learning Mixtures of Random Utility Models

Zhibing Zhao, Tristan Villamil, Lirong Xia Rensselaer Polytechnic Institute 110 8th Street, Troy, NY, USA {zhaoz6, villat2}@rpi.edu, xial@cs.rpi.edu

We tackle the problem of identiﬁability and efﬁcient learning of mixtures of Random Utility Models (RUMs). We show that when the PDFs of utility distributions are symmetric, the mixture of k RUMs (denoted by k-RUM) is not identiﬁable when the number of alternatives m is no more than 2k 1. On the other hand, when m max{4k 2, 6}, any k-RUM is generically identiﬁable. We then propose three algorithms for learning mixtures of RUMs: an EM-based algorithm, which we call E-GMM, a direct generalized-methodof-moments (GMM) algorithm, and a sandwich (GMM-EGMM) algorithm that combines the other two. Experiments on synthetic data show that the sandwich algorithm achieves the highest statistical efﬁciency and GMM is the most computationally efﬁcient. Experiments on real-world data at Preﬂib show that Gaussian k-RUMs provide better ﬁtness than a single Gaussian RUM, the Plackett-Luce model, and mixtures of Plackett-Luce models w.r.t. commonly-used model ﬁtness criteria. To the best of our knowledge, this is the ﬁrst work on learning mixtures of general RUMs.

Introduction In rank aggregation, the goal is to aggregate rank data to make an optimal decision, where each data point is a linear order over a set of alternatives. Rank aggregation has many applications. For example, in political elections, agents cast votes to elect a president; in information retrieval, rankings over documents are combined into a list; in crowdsourcing, crowd workers sometimes give rankings as answers, which are aggregated to estimate the correct ranking. In this paper, we take a machine learning approach towards rank aggregation, by addressing the problem of efﬁciently learning mixtures of Random Utility Models (RUMs). RUMs are a widely applied statistical model for human behaviors (Thurstone 1927). In a single, non-mixture RUM, each agent samples a utility for each alternative independently and reports the ranking over alternatives by sorting their utilities. A special case of RUMs is the Plackett-Luce model (Plackett 1975; Luce 1959), which can be seen as the extension of multinomial logistic regression to rank data. General RUMs can ﬁt data better than Plackett-Luce models and thus provide more accurate predictions (Azari Souﬁani,

Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Parkes, and Xia 2012). Unfortunately, unlike the Placket Luce model, general RUMs are hard to learn due to the lack of closed-form formula of the likelihood function. One prominent example is Thurstone s case V model (Thurstone 1927), where the utility distributions are Gaussian and can be seen as an extension of multinomial probit regression to rank data. See Related Work for more discussions. A mixture of k RUM models, denoted by k-RUM, combines k RUM components via the mixing coefﬁcients α = [α1, α2, . . . , αk]. Given a k-RUM, a ranking can be generated as follows: (i) the rth component is selected with probability αr; (ii) a ranking is generated from the rth RUM component. Like other mixture models, k-RUMs can potentially provide better ﬁtness and can be used for clustering. We are not aware of previous work on learning mixtures of general RUMs. There are some recent works on mixtures of Plackett-Luce models (Gormley and Murphy 2008; Mollica and Tardella 2016; Tkachenko and Lauw 2016; Zhao, Piech, and Xia 2016), which are special k-RUMs. The motivation of our work on learning mixtures of general RUMs is that general k-RUMs outperforms a single RUM, the Plackett-Luce model, and mixtures of Plackett Luce models w.r.t. various commonly-used model ﬁtting criteria, as we will show later in the paper based on experiments on real-world data. This means that general k-RUMs often improve predictions and decisions compared to the state-ofthe-art models. In this paper, we address the following two fundamental questions on learning mixtures of RUMs. Is a k-RUM learnable? Can we design efﬁcient algorithms for learning k-RUMs in general?

Our Contributions

In this paper, we measure learnability by identiﬁability , which means different parameters correspond to different distributions of rank data. Identiﬁability is important to mixture models for parameter estimation and clustering. The identiﬁability of k-RUMs depends on the number of components k in the mixture model and the number of alternatives m, as we show in the following two theorems.

Theorem 1. Let M be any symmetric RUM from the location family. When m 2k 1, k-RUMM over m alterna-

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

tives is non-identiﬁable1. Here k-RUMM denotes the mixture of k RUMs, each of which is chosen from the RUM model M. As a corollary, the k-mixture of Thurstone s original case V model (with Gaussian distributions) is non-identiﬁable when m 2k 1. Our second theorem proves generic identiﬁability. k-RUMM is generically identiﬁable, if the Lebesgue measure of nonidentiﬁable parameters is 0. Theorem 2. For any RUMM where all utility distributions have support ( , ), when m max{4k 2, 6}, k RUMM over m alternatives is generically identiﬁable. We propose three algorithms to learn k-RUMs. The ﬁrst one is an EM-based algorithm (Dempster, Laird, and Rubin 1977), which we call E-GMM. It works as follows: the E-step is standard, calculating the membership of each ranking in the data w.r.t. the k components based on the likelihood ratios. In the M step, we adopt the generalized-methodof-moments (GMM) algorithm proposed by Azari Souﬁani, Parkes, and Xia (2014) to estimate parameters for each RUM component. Our second algorithm is a direct GMM algorithm (Hansen 1982) that tries to match the moments for k-RUM. We prove that under mild conditions the algorithm is consistent. That is, it converges to the ground truth with probability going to 1 as the number of independently generated rankings approaches inﬁnity. Our third algorithm combines the advantages of the EGMM algorithm and the GMM algorithm. We ﬁrst run GMM, use the output as the initial values to run E-GMM. The algorithm is thus called the sandwich algorithm (GMME-GMM). We note that the two GMMs in the sandwich algorithm are different. Experiments on synthetic data show that the sandwich algorithm is the most statistically efﬁcient and the GMM algorithm is the fastest. Observing that the E-GMM algorithm is not as good as the sandwich algorithm w.r.t. both statistical efﬁciency and computational efﬁciency, we don t show the results of the E-GMM algorithm in this paper. Experiments on real-world Preﬂib data (Mattei and Walsh 2013) show that k-RUMs provide better model ﬁtness than a single RUM, the Plackett-Luce model, and mixtures of Plackett-Luce models w.r.t. commonly-used model ﬁtness criteria, including Akaike Information Criterion (AIC) (Akaike 1974), corrected AIC for ﬁnite samples (AICc) (Hurvich and Tsai 1989), and Bayesian Information Criterion (BIC) (Schwarz 1978). We note that such comparisons and conclusions are enabled by our algorithms on learning general k-RUMs, which are the ﬁrst of their kind.

Related Work and Discussions

Our (non)-identiﬁability theorems are the ﬁrst ones for mixtures of general RUMs. Recently Zhao, Piech, and Xia (2016) characterized identiﬁability of mixtures of Plackett-Luce models. We focus on a different class of mix-

1All identiﬁability results in this paper hold modulo label switching, which means that if we label the components differently, the parameter is treated the same.

tures of RUMs those whose utility distributions have symmetric PDFs, including Thurstone s case V model. In order to prove the (non)-identiﬁability theorem, we adopted novel techniques by avoiding direct calculation of the likelihood. This is the major difference between our proofs and the proofs of Zhao, Piech, and Xia (2016) for mixtures of Plackett-Luce models, where closed-form formulas for the likelihood function are available. We emphasize that our non-identiﬁability theorem does not apply to mixtures of Plackett-Luce models because the utility distributions in the Plackett-Luce model (Gumbel distributions) are not symmetric. Our work also makes a number of algorithmic contributions to the literature of rank aggregation, sometimes known as learning to rank (Liu 2011). In particular, GMM-based techniques for learning a single (non-mixture) RUM, including the Plackett-Luce model, have been extensively investigated (Negahban, Oh, and Shah 2012; Azari Souﬁani et al. 2013; Azari Souﬁani, Parkes, and Xia 2014; Chen and Suh 2015; Khetan and Oh 2016b; 2016a). Learning algorithms have also been proposed for mixtures of Plackett Luce models, including EM-based algorithms (Gormley and Murphy 2008; Mollica and Tardella 2016; Tkachenko and Lauw 2016) and a generalized method of moments algorithm (Zhao, Piech, and Xia 2016). To the best of our knowledge, our algorithms for learning k-RUMs are quite general and are the ﬁrst algorithms for learning mixtures of RUMs beyond mixtures of Plackett-Luce models. Our E-GMM is inspired by the EMM algorithm for learning mixtures of Plackett-Luce models proposed by Gormley and Murphy (2008). However, the EMM cannot be easily applied to learn general k-RUMs, because EMM uses a Minorize-Maximization (MM) algorithm to estimate parameters of each Plackett-Luce component, but no MM algorithm is known for general RUMs. To address this challenge, we adopt a GMM by Azari Souﬁani, Parkes, and Xia (2014) in the M step. Our GMM algorithm is inspired by the GMMs in the previous work discussed above but we use a different set of moment conditions. Recently tensor-decomposition techniques have been explored to learn models with latent variables, see for example the work by Anandkumar et al. (2014). However, such techniques cannot be easily applied to learn RUMs and their mixtures. In the proof of Theorem 2 we shed some light on a potential algorithm, which requires multiple tensor decompositions and each component is only labeled by its mixing probability. Designing computationally tractable tensordecomposition algorithms is a promising direction for future work.

Preliminaries

Let A = {ai|i = 1, 2, , m} denote a set of m alternatives. Let L(A) be the set of linear orders (full rankings) over A, which are transitive, antisymmetric and total binary relations. Let P = {V1, V2, , Vn} denote the data (also called a preference proﬁle), where for all j n, Vj L(A).

Deﬁnition 1 (Random Utility Model (RUM)). Given m 2 alternatives, a random utility model M associates each al-

ternative ai with a utility distribution. The parameter space is Θ = { θ = { θi|i = 1, 2, , m}}, where θi is the parameter of the utility distribution of ai. The sample space is S = L(A)n. An agent s ranking is generated in the following steps: ﬁrst, the agent samples a random utility Ui for each alternative independently from μi( | θi); second, she ranks the alternatives w.r.t. these utilities. Formally, given a parameter θ, the probability of ranking V = ai1 ai2 aim is

Pr M(V | θ) =

xi2 μim(xim| θim)

μi1(xi1| θi1)dxi1dxi2 dxim

We assume that rankings are generated i.i.d. Therefore, given a preference proﬁle P and θ Θ, we have Pr M(P| θ) = n j=1 Pr M(Vj| θ). In this paper, we focus on the location family, where each utility distribution μi is only parameterized by its mean, denoted by θi. In other words, the shapes of the utility distributions are ﬁxed. We note that shifting the means of all alternatives simultaneously by the same distance will not affect the distribution of the rankings. To eliminate this problem, we require m i=1 θi = 0. We further say that an RUM is symmetric if the PDF of each utility distribution is symmetric around its mean.

Deﬁnition 2 (k-RUMM). Given an RUM M within the location family, for any k N+ and m 2, we deﬁne the k-RUMM as follows. The sample space is S = L(A)n. The parameter space has two parts. The ﬁrst part is the mixing coefﬁcients α = (α1, . . . , αk) where for all r k, αr 0, and k r=1 αr = 1. The second part is ( θ(1), θ(2), . . . , θ(k)), where θ(r) is the parameter of the r-th random utility component. All components come from the same class of RUM M, i.e. the shape of the utility distribution of any alternative is ﬁxed. The probability of a ranking V is Prk-RUMM(V | θ) = k r=1 αr Pr M(V | θ(r)), where Pr M(V | θ(r)) is the probability of generating V in the r-th RUM.

Example 1 Consider a 2-RUM over 3 alternatives. The mixing coefﬁcients are α = (0.3, 0.7). Let all utility distributions be Gaussian distributions with standard deviation 1. The two RUM components are parameterized by θ(1) = ( 3, 1, 4) and θ(2) = (5, 2, 3), respectively. Taking the ranking a1 a3 a2 as an example, we have Pr(a1 a3 a2) = 0.3 x2 x3 φ(x2 + 1)φ(x3 4)φ(x1 + 3)dx1dx3dx2 + 0.7 x2 x3 φ(x2 + 2)φ(x3 +3)φ(x1 5)dx1dx3dx2, where φ(x) is the PDF of the standard Gaussian distribution.

Identiﬁability of k-RUMM We ﬁrst recall the identiﬁability of a statistical model.

Deﬁnition 3 (Identiﬁability) Let M = {Pr( | θ) : θ Θ} be a statistical model. M is identiﬁable if for all θ, γ Θ, we have Pr( | θ) = Pr( | γ) = θ = γ.

We note that single RUMs are generally identiﬁable and can be learned using an MC-EM algorithm (Azari Souﬁani, Parkes, and Xia 2012) or a GMM algorithm (Azari Souﬁani, Parkes, and Xia 2014). However, identiﬁability of single RUMs does not imply identiﬁability of mixtures of RUMs. To eliminate the label switching problem (Redner and Walker 1984), we say that k-RUMM is identiﬁable if there do not exist (1) 1 k1, k2 k, non-degenerate θ(1), θ(2), , θ(k1), γ(1), γ(2), , γ(k2), meaning that these k1+k2 vectors are pairwise different; (2) all strictly positive mixing coefﬁcients (α(1) 1 , . . . , α(1) k1 ) and

(α(2) 1 , . . . , α(2) k2 ), so that for any ranking V in P we have:

r=1 α(1) r Pr M(V | θ(r)) =

r=1 α(2) r Pr M(V | γ(r))

A statistical model is generically identiﬁable, if the Lebesgue measure of parameters that does not satisfy the condition in Deﬁnition 3 is zero in the parameter space.

Theorem 1 Let M be any symmetric RUM from the location family. When m 2k 1, k-RUMM over m alternatives is non-identiﬁable.

Proof: The proof is constructive. We prove the case where m = 2k 1. The proof for cases where m < 2k 1 is similar. For any k and m = 2k 1, we will deﬁne non-degenerate θ(1), . . . , θ(k), γ(1), . . . , γ(k) with mixing probabilities α = [α1, . . . , αk]T and β = [β1, . . . , βk]T , respectively. We let θ(1) 1 , . . . , θ(k) 1 , γ(1) 1 , . . . , γ(k) 1 be 2k pairwise different numbers where for all r = 1, . . . , k, θ(r) 1 + γ(r) 1 = 0. For any θ(r), we let θ(r) 2 = . . . = θ(r) m = θ(r) 1 m 1 s.t. m i=1 θ(r) i = 0. γ(r) s are deﬁned similarly. Because the parameters for a2, . . . , am are equal in each RUM component, the distribution over data in each RUM component can then be represented compactly using m events instead of m! rankings, i.e. a1 at the ﬁrst position, second position, etc. For convenience we deﬁne Fθ as

Pr(a1 top| θ(1)) Pr(a1 top| θ(k)) ... ... ... Pr(a1 bottom| θ(1)) Pr(a1 bottom| θ(k))

Fγ can be deﬁned similarly. We will prove that there exist positive α and β s.t. Fθ α = Fγ β, which will prove non-identiﬁability. Consider the matrix ΔF = Fθ Fγ. For all 1 r k, the rth column of ΔF is the following:

Pr(a1 top| θ(r)) Pr(a1 top| γ(r)) ... Pr(a1 bottom| θ(r)) Pr(a1 bottom| γ(r))

Because the utility distributions of all alternatives are symmetric, we have

Pr(a1 position i| θ(r)) = Pr(a1 postion m i + 1| γ(r)).

Therefore, the ﬁrst element of ΔF(r) is exactly the same as the last element of it except for a negative sign. The same holds for the second element and the last but one element, etc. The center element (the kth element) is zero. Since this holds for all r, the last k 1 rows of ΔF are in the reversed order of the negative top k 1 rows, while the center row consists of only zeros. So the rank of ΔF is at most k 1. Since ΔF has k rows, there exists a nonzero λ = [λ1, λ2, . . . , λk] s.t. ΔF λ = 0. Elementwise we have i: k r=1 λr Pr(a1 position i| θ(r)) = k r=1 λr Pr(a1 position i| γ(r)). If the entries in λ are all nonnegative, we let α = β = λ, and the proof is done. If there are negative elements in λ, we can switch the corresponding components θ(r) and γ(r) and ﬂip the sign of λr until all elements in λ are nonnegative. This ﬁnishes the proof. We conjecture that this bound is tight since the counterexample has the fewest number of moments, which means the parameter is the most likely to be under-constrained.

Theorem 2 For any RUMM where all utility distributions have support ( , ), when m max{4k 2, 6}, k RUMM over m alternatives is generically identiﬁable.

Proof: We will focus on the parameters whose mixing coefﬁcients (entries of α) are pairwise different. Formally, for any r1, r2 {1, . . . , k},

αr1 = αr2 = r1 = r2. (1)

Such parameters have Lebesgue measure 1 because the parameters with any pair of identical mixing coefﬁcients are in a lower dimensional space. The generic identiﬁability will be proved by analyzing the uniqueness of tensor decompositions. We will construct a rank-one tensor T(r)( θ(r)) to represent the rth RUM component. Then the k-RUMM can be represented by the weighted average of tensors of its components, i.e. T( θ) = k r=1 αr T(r)( θ(r)). We now provide a set of two sufﬁcient conditions for θ = ( α, θ(1), . . . , θ(k)) to be identiﬁable, and then prove that both hold generically.

Condition 1. For every γ = ( β, γ(1), . . . , γ(k)), where β = α modulo label switching (the set of entries in β is different from the set of entries in α), we have T( γ) = T( θ).

Condition 2. For every γ = ( β, γ(1), . . . , γ(k)) that is different from θ and β = α, we have Prk-RUMM( | γ) = Prk-RUMM( | θ).

It follows from the deﬁnition of identiﬁability that if both conditions hold, then θ is identiﬁable. Condition 1 generically holds. We ﬁrst show that T( θ) has a unique decomposition generically, then prove that the uniqueness of decomposition implies Condition 1. In the rest of the proof we assume that m is even. The theorem can be proved similarly for odd m s. To construct the rank-one tensor T(r)( θ(r)), we partition the set of alternatives into m/2 subsets. S1 = {a1, a2}, S2 =

{a3, a4}, . . . , Sm/2 = {am 1, am}. We deﬁne the m

2 - dimensional rank-one tensor T(r)( θ(r)) for the rth RUMM component as follows. Let the qth coordinate of Tr be the probabilities for the pairwise comparison between the two alternatives in Sq. More precisely, for any 1 q m/2 and any rankings Vq L(Sq), the qth coordinate of T(r)( θ(r)), denoted by p(r) q , is the following:

p(r) q = Pr(a2q 1 a2q| θ(r)), Pr(a2q a2q 1| θ(r))

(2) According to Lemma 3 of (Zhao, Piech, and Xia 2016)2, we have Pr M(V1, V2, . . . , Vm/2| θ(r)) = m/2 q=1 Pr M(Vq| θ(r)), where for all q = 1, . . . , m/2,

Vq L(Sq). It follows that T(r)( θ(r)) = m/2 q=1 p(r) q . The tensor for the mixture model can be written as T( θ) = k r=1 αr T(r)( θ(r)). To analyze the uniqueness of tensor decomposition of T( θ), we investigate the Kruskal s rank of each matrix Pq deﬁned as

Pq = p(1) q , p(2) q , , p(k) q . The Kruskal s rank of a matrix is the largest number K such that every set of K columns in the matrix is linearly independent. Now we will prove that the Kruskal rank of Pq, denoted by Kq, is generically 2, i.e. any two columns of Pq are generically independent. It is not hard to see that the two entries of p(r) q sum up to 1. So the only case where p(r1) q and p(r2) q are linearly dependent is that they are exactly the same. According to Proposition 1 in (Azari Souﬁani, Parkes, and Xia 2014), Pr(a2q 1 a2q| θ(r)) monotonically increases as θ(r) 2q 1 θ(r) 2q increases because all utility distributions

have support ( , ). Therefore, p(r1) q = p(r2) q if and only if θ(r1) 2q 1 θ(r1) 2q = θ(r2) 2q 1 θ(r2) 2q , which has Lebesgue measure 0. Thus, for all q m/2, Kq = 2 generically holds. So m/2 q=1 Kq = m/2 2 = m generically holds.

Because k m+2

4 , we have m/2 q=1 Kq = 2k + m/2 1. Sidiropoulos and Bro (2000) proved that, for any N-way tensor, if N q=1 Kq 2k + N 1, then the tensor decomposition is unique. This is true in our case since N = m/2. We now prove that uniqueness of decomposition of T( θ) implies Condition 1. Suppose for the purpose of contradiction, the decomposition of T( θ) is unique but Condition 1 does not hold. This means that there exists γ = ( β, γ(1), . . . , γ(k)) where β is not equal to α modulo label switching, s.t. T( γ) = T( θ). Because components in α are pairwise different, there exists r1 k such that for all r2 = 1, . . . , k, αr1 = βr2, while T( θ) = T( γ) = T. We will show that for any r2 = 1, . . . , k, we have αr1T(r1)( θ(r1)) = βr2T(r2)( γ(r2)), which contradicts the unique decomposition of T. Recall that for any r, the sum

2This lemma proved the independence of two rankings with mutually exclusive alternatives, which can be easily extended to multiple rankings.

of the two entries of p(r) q is 1. It is not hard to see the sum of all entries of T(r)( ) is also one. So the sums of all entries of αr1T(r1)( θ(r1)) and βr2T(r2)( γ(r2)) are αr1 and βr2 respectively. We recall that for all r2, αr1 = βr2 holds. Therefore, αr1T(r1)( θ(r1)) = βr2T(r2)( γ(r2)) holds for all r2, which is a contradiction. This ﬁnishes the proof that Condition 1 generically holds. Condition 2 generically holds. Unfortunately the tensor decomposition technique used for Condition 1 no longer works for Condition 2. Here is a counterexample. Given θ = ( α, θ(1), . . . , θ(k)), we can construct γ = ( α, γ(1), . . . , γ(k)) s.t. T( γ) = T( θ) in the following way. For any r, we let γ(r) i = θ(r) i + 1 for i = 1, 2, γ(r) i = θ(r) i 1 for i = 3, 4 and γ(r) i = θ(r) i for i = 5, . . . , m. The resulting tensor T( γ) = T( θ) because the probability of rankings restricted to each group are exactly the same. To address this problem we consider an additional tensor decomposition using a different partition of the alternatives, deﬁned as follows: S 1 = {a2, a3}, S 2 = {a4, a5}, . . . , S m/2 = {am, a1}. Let T ( θ) denote the tensor under this partition. Similar to the proof for Condition 1, we can show that generically T ( θ) has a unique decomposition. Next, we will prove that for any θ where T( θ) and T ( θ) have unique decompositions (which holds generically), Condition 2 holds. Suppose for the sake of contradiction that Condition 2 does not hold for θ where T( θ) and T ( θ) have unique decompositions. Then, there exists γ = ( β, γ(1), . . . , γ(k)) that is different from θ modulo label switching, such that Prk-RUMM( | γ) = Prk-RUMM( | θ). This means that T( γ) = T( θ) and T ( γ) = T ( θ). We ﬁrst match the components in γ and θ. Recall that uniqueness of T( θ) implies Condition 1. Because both T( θ) and T ( θ) have unique decompositions, Condition 1 must hold for both. It follows that the entries of β are exactly entries of α, otherwise T( θ) = T( γ), which is a contradiction. Since entries of α are pairwise different, there is a unique way of matching components in θ to components in γ by matching the corresponding mixing coefﬁcients. Consequently, we can relabel components in γ s.t. the β becomes exactly α. Let γ = ( α, γ (1), . . . , γ (k)) denote the γ parameter after relabeling the components. Thus we have T( γ ) = T( γ) = T( θ) and T ( γ ) = T ( γ) = T ( θ). Because γ = θ modulo label switching, we have γ = θ, which implies that there exists r s.t.

γ (r ) = θ(r ). (3)

Next, we will show that from the uniqueness of decomposition of T( θ) and T ( θ), we must have γ (r ) = θ(r ), which is a contradiction. To this end, we show how to uniquely determine the parameters of k-RUMM (i.e. γ and θ) from decompositions of T( θ) and T ( θ).

T( γ ) = T( θ) has a unique decomposition means that for any r1 {1, . . . , k}, there exists r2 {1, . . . , k} s.t. αr1T(r1)( γ (r1)) = αr2T(r2)( θ(r2)), which implies the sums of all entries of αr1T(r1)( γ (r1)) and αr2T(r2)( θ(r2)) are equal, i.e. αr1 = αr2. It follows from (1) that r1 = r2. Namely, we have αr T(r)( γ (r)) = αr T(r)( θ(r)) for all r = 1, . . . , k. It follows that for all r = 1, . . . , k, we have T(r)( γ (r)) = T(r)( θ(r)), which means T(r )( γ (r )) = T(r )(θ(r )). Similarly, we also have T (r )( γ (r )) = T (r )(θ(r )).

For any r and q, p(r) q can be easily obtained from T(r)( θ(r)) by normalizing the corresponding entries of T(r)( θ(r)). E.g., p(r) 1 can be obtained by normalizing entries (1, 1, . . . , 1) and (2, 1, . . . , 1). Such p(r) 1 is uniquely determined by T(r)( θ(r)) because the two entries of p(r) 1 sum up to 1. Speciﬁcally, for all q = 1, . . . , m/2, p(r ) q is uniquely determined by T(r )( θ(r )) and p (r ) q is uniquely determined by T (r )( θ(r )). Next, we will prove that for all q = 1, . . . , m/2, p(r ) q , p (r ) q uniquely determine θ(r ), which contradicts (3). Now we focus on γ (r ) and θ(r ) restricted to S1. We claim that there exists a constant C1 s.t. for all i = 1, 2 (i.e. ai S1), γ (r ) i = θ(r ) i + C1. The reason is as follows. Recall from (2) that p(r ) q consists of probabilities of a1 a2 and a2 a1 given the component r . By Proposition 1 in (Azari Souﬁani, Parkes, and Xia 2014), Pr(a1 a2| θ(r )) is a function of θ(r ) 1 θ(r ) 2 and Pr(a1 a2| γ (r )) is a function of γ (r ) 1 γ (r ) 2 . By matching the probabilities we have γ (r ) 1 γ (r ) 2 = θ(r ) 1 θ(r ) 2 , which means that there exists a constant C1 s.t. for i = 1, 2, γ (r ) i = θ(r ) i + C1. Similarly for all q = 1, . . . , m/2 there exist Cq, s.t.

γ (r ) i = θ(r ) i + Cq, for i = 2q 1, 2q (4)

Similarly from T (r)( γ(r)) = T (r)( θ(r)), for all q = 1, . . . , m/2, there exists C q s.t.

γ (r ) i = θ(r ) i + C q, for i = 2q, (2q + 1 mod m) (5)

Therefore, we have C1 = C 1 by letting i = 2 in (4) and (5); C 1 = C2 by letting i = 3; C2 = C 2 by letting i = 4, etc. So we have C1 = = Cq = C 1 = = C q. Let Cq = C q = C for all q = 1, . . . , m/2. Then for

all i = 1, . . . , m, we have γ (r ) i = θ(r ) i + C. Because m i=1 γ (r ) i = m i=1 θ(r ) i = 0, we have C = 0, which contradicts (3). As we have proved, the Lebesgue measure of parameters where tensor T( θ) (or T ( θ)) has a nonunique decomposition is 0. Therefore, the Lebesgue measure of parameters θ where both T( θ) and T ( θ) have unique decompositions is also 0. This ﬁnishes the proof.

The E-GMM Algorithm To compute k-RUMM, we propose an EM-based algorithm, which we call E-GMM, where the GMM algorithm proposed by Azari Souﬁani, Parkes, and Xia (2014) is used in the M step. The detailed algorithm is as follows. During the E-step, given the t-th step estimate α(t) and θ(1,t), . . . , θ(k,t), we use Bayes rule to calculate the probability of a ranking Vj to belong to component r for (t+1)-th iteration, denoted as p(jr) t+1. We have

p(jr) t+1 α(t) r Pr(Vj| θ(r,t)) (6)

We can normalize w.r.t. r because k r=1 p(jr) = 1. In the M-step, mixing probabilities are calculated by

n j=1 p(jr) t+1 n (7)

and θ(r,t+1) s are estimated using the GMM Algorithm from (Azari Souﬁani, Parkes, and Xia 2014). Formally, the EGMM algorithm is presented below as Algorithm 1.

Algorithm 1 E-GMM Algorithm Input: Proﬁle P of n rankings, the number of components k, the number of iterations T. Output: α(T +1) r , θ(r,T +1), where r = 1, 2, , k. Initialize α(1) r , and θ(r,1) randomly for all r = 1, 2, , k. 1: for t = 1 to T do 2: Given the estimate at t-th step α(t) r , θ(r,t). 3: E step: calculate the expected membership by (6).

4: M step: calculate α(t+1) r using (7) and use GMM with weighted rankings/breakings to estimate θ(r,t+1). 5: end for

GMM for k-RUMM Generalized-method-of-moments (GMM) (Hansen 1982) algorithms are a widely applied class of algorithms that generalize the classical method of moments. Each GMM is speciﬁed by a set of q 1 moment conditions g(V, θ), where V is a data point and θ is a parameter, such that for any θ0, the expectation of each moment condition is zero at θ0, when the data are generated from the model given θ0. That is, E[g(V, θ0)] = 0. Then, the algorithm computes θ to minimize a certain norm, e.g. the 2-norm, of V P g(V, θ). Our GMM algorithm for k-RUMM is deﬁned as follows. We ﬁrst deﬁne a breaking matrix B(P) = [bi1i2]m m, where bi1i2 is the empirical probability that ai1 is preferred over ai2, namely the number of times that ai1 ai2 over the number of rankings. The diagonal elements are zeros. For example, let P = (a1 a2 a3, a1 a3 a2), we

have B(P) =

0 1 1 0 0 0.5 0 0.5 0

. We then deﬁne the following

m 2 moment conditions: for any i1 < i2, let gi1i2(P, θ) =

bi1i2 Pr(ai1 ai2| θ). In our GMM algorithm, we minimize the following objective function

i2>i1 (bi1i2 Pr(ai1 ai2| θ))2 (8)

where Pr(ai1 ai2| θ) = k r=1 αr Pr(ai1 ai2| θ(r)). Formally, the algorithm is presented as Algorithm 2.

Algorithm 2 GMM for k-RUMM Input: A Preference proﬁle P. 1: Compute the breaking matrix B(P). 2: Compute the parameter that minimizes (8).

The advantages of our GMM algorithm are: 1. Gradient and Hessian of G are easy to compute. To calculate the gradient of G, we need partial derivatives of Pr(ai1 ai2| θ), which can be broken down to partial derivatives of Pr(ai1 ai2| θ(r)) w.r.t. θ(r). This was given by Proposition 1 by (Azari Souﬁani, Parkes, and Xia 2014). 2. Pr(ai1 ai2| θ(r)) can be computed easily by sampling the utilities from the utility distributions of ai1 and ai2. For Gaussian distributions, Pr(ai1 ai2| θ(r)) is the CDF valued at θ(r) i1 θ(r) i2 of another Gaussian distribution with mean 0 and variance being the sum of the variances of utility distributions of ai1 and ai2. For better accuracy we add m more moment conditions, corresponding to m cyclic triple-wise comparisons, i.e. T1 = a1 a2 a3, T2 = a2 a3 a4, . . ., Tm = am a1 a2. Let b Ti be the empirical probability of Ti. The new GMM minimizes: G = G + m i=1(b Ti Pr(Ti| θ))2. We now prove that our GMM algorithm is consistent: when the data are generated independently from k-RUMM and the size of data approaches inﬁnity, the algorithm converges to the ground truth with probability that goes to 1.

Theorem 3 If k-RUMM over m alternatives is identiﬁable, and the means of the utility distributions of all alternatives in all RUM components are bounded in close intervals [0, C], then Algorithm 2 is consistent.

Proof: Hall (2005) provides a set of necessary conditions for any GMM to be consistent. Therefore, it sufﬁces to check that all assumptions in Theorem 3.1 in (Hall 2005) holds. Assumption 3.1: Strict Stationarity: the (n 1) random vectors {vt; < t < } form a strictly stationary process with sample space S Rn. As the data are generated i.i.d., the process is strictly stationary. Assumption 3.2: Regularity Conditions for g( , ): the function g : S Θ Rq where q < , satisﬁes: (i) it is continuous on Θ for each P S; (ii) E[g(P, θ)] exists and is ﬁnite for every θ Θ; (iii) E[g(P, θ)] is continuous on Θ. Our moment conditions satisfy all the regularity conditions since gi1i2(P, θ) is continuous on Θ and bounded in [ 1, 1] for any i2 = i1. Assumption 3.3: Population Moment Condition. The random vector vt and the parameter vector θ0 satisfy the (q 1)

Figure 1: MWRSE with 95% conﬁdence intervals and runtime for 2-RUM over 6 alternatives. We use 10 EM iterations for the sandwich algorithm. Values are averaged over 1000 trials. The ground truth for each component ranges between 0 and 5.

population moment condition: E[g(P, θ0)] = 0. This assumption holds by the deﬁnition of our GMM. Assumption 3.4: Global Identiﬁcation. E[g(P, θ )] = 0 for all θ Θ such that θ = θ0. This is the assumption of the theorem. Assumption 3.7: Properties of the Weighting Matrix. Wt is a positive semi-deﬁnite matrix which converges in probability to the positive deﬁnite matrix of constants W. This holds because W = I. Assumption 3.8: Ergodicity. The random process {vt; < t < } is ergodic. Since the data are generated i.i.d., the process is ergodic. Assumption 3.9: Θ is a compact set. The mixing probabilities are compact in interval [0, 1] and in the assumption of this theorem, the θ(r) i s are bounded in [0, C]. Assumption 3.10: Domination of g(P, θ). E[supθ Θ ||g(P, θ)||] < . This assumption holds because all moment conditions are ﬁnite.

The Sandwich Algorithm

The performance of E-GMM algorithm is sensitive to the initial value. Therefore, we propose the sandwich algorithm (GMM-E-GMM) to give our E-GMM algorithm a good starting point. We note that the ﬁrst GMM stands for our GMM algorithm for learning k-RUMs and the second GMM stands for Azari Souﬁani, Parkes, and Xia s algorithm [2014] for learning a single RUM in each M-step. The algorithm is formally shown as Algorithm 3.

Experiments

We implemented all algorithms with Matlab and tested them on synthetic data and Preﬂib data. Synthetic data generation. In each trial, we ﬁrst generate the ground truth parameters for k RUM components and

Algorithm 3 Sandwich (GMM-E-GMM) Algorithm Input: Proﬁle P of n rankings, the number of components k, the number of iterations T. Output: αr, θ(r), where r = 1, 2, , k.

1: Run Algorithm 2, whose output is α(0) r , θ(r,0), where r = 1, 2, , k. 2: Use the output as the initial value of Algorithm 1.

normalize each θ(r) s.t. m i=1 θ(r) i = 0. The mixing coefﬁcients α are generated uniformly at random in [0, 1] and then normalized s.t. k r=1 αr = 1. Then with probability αr, the rth component is selected to generate a full ranking. We run experiments on two settings: (i) k = 2, m = 6, 1000 trials (Figure 1), and (ii) k = 4, m = 15, 900 trials (Figure 2). In both ﬁgures, results for the E-GMM algorithm are not shown because the sandwich algorithm strictly improves E-GMM w.r.t. both statistical efﬁciency and computational efﬁciency. All experiments were run on an Ubuntu Linux server with Intel Xeon E5 v3 CPUs clocked at 3.50 GHz. Measures. We note that components with small mixing coefﬁcients are generally hard to learn accurately in the extreme case, when the mixing coefﬁcient for one component is 0, it is impossible to learn the component. Therefore, the standard mean square error is not very informative for our experiments. Consequently, we deﬁne MWRSE (mean weighted root square error) to mitigate the impact of such components. Formally, let θ0 denote the ground truth parameter and θ denote the estimate. We deﬁne WRSE as follows:

WRSE = || α0 α ||2 +

r=1 α0,r|| θ(r) 0 θ (r)||2

MWRSE is computed by averaging WRSE of all trials. Observations. The sandwich algorithm has lower MWRSEs and narrower 95% conﬁdence than GMM under both set-

Figure 2: MWRSE with 95% conﬁdence intervals and runtime for 4-RUM over 15 alternatives. We use 5 EM iterations for the sandwich algorithm. Values are averaged over 900 trials. The ground truth for each component ranges between 0 and 10.

tings (left subﬁgures of Figures 1 and 2), which means the sandwich algorithm achieves better statistical efﬁciency. In terms of runtime, the sandwich algorithm is not as fast as GMM (right subﬁgures of Figures 1 and 2). Moreover, the running time of the E step dominates that of the M step.

Our sandwich algorithm achieves satisfactory statistical efﬁciency based on the following high-level reasoning. From Figure 1 we observe that the MWRSE for the sandwich algorithm is no more than 0.6 for k = 2, m = 6. Therefore, typically the root squared error for each component is about 0.6 because the weights sum up to 1. For each single parameter, we would expect the error to be typically

0.62/6 0.245, which is reasonably small considering the range of each single parameter to be [0, 5]. Similarly for k = 4, m = 15, the error of each single parameter is typically below

22/15 0.52, which is also small compared to the range [0, 10].

Real-World Data. We learn different models, including the Plackett-Luce model (PL), its mixtures (k-PLs), and Gaussian k-RUMs from 209 linear order datasets on Preﬂib (Mattei and Walsh 2013) and compute their AIC, AICc and BIC, deﬁned as: AIC = 2d 2 ln(L), AICc = AIC+ 2d(d+1)

n d 1 and BIC = d ln(n) 2 ln(L), where L is the value of the likelihood function evaluated at the estimation, d is the number of parameters in the model, and n is the number of rankings. A smaller AIC, AICc or BIC means better ﬁtness.

For mixture models (k-PL and k-RUM), we increase k until all the three measurements start increasing. We compute the percentage for one model to be strictly better (lower) than another w.r.t. each measure, shown in Table 1. For example, in terms of AIC, k-RUM (with the best k) beats a single RUM in 60.3% of the datasets, which means that in 60.3% of the datasets, the best k for k-RUM is at least 2.

Observations. From Table 1 we can see that the three infor-

mation criteria agree on the following order of models: k-RUM k-PL RUM PL, where A B means that the number of datasets where A beats B is more than that where B beats A.

k-RUM k-PL RUM PL

k-RUM 0 60.8% 60.3% 90.0% k-PL 39.2% 0 79.4% 90.4% RUM 0 20.6% 0 76.6% PL 10.0% 0 23.4% 0

k-RUM 0 60.3% 59.8% 90.0% k-PL 39.7% 0 79.4% 89.5% RUM 0 20.6% 0 76.6% PL 10.0% 0 23.4% 0

k-RUM 0 66.0% 40.2% 84.2% k-PL 34.0% 0 59.8% 66.0% RUM 0 40.2% 0 76.6% PL 15.8% 0 23.4% 0

Table 1: Model ﬁtness comparisons.

Conclusions and Future Work We characterize conditions for mixtures of general RUMs to be non-identiﬁable and generically identiﬁable, and designed three algorithms for computing them. Our experiments show that the sandwich algorithm achieves higher statistical efﬁciency and GMM achieves higher computational efﬁciency. Open questions include improving the identiﬁability theorems for k-RUMs and designing more efﬁcient algorithms, such as tensor-decomposition-based algorithms.

Acknowledgments We thank all anonymous reviewers for helpful comments and suggestions. This work is supported by NSF #1453542 and ONR #N00014-17-1-2621.

Akaike, H. 1974. A new look at the statistical model identiﬁcation. IEEE transactions on automatic control 19(6):716 723. Anandkumar, A.; Ge, R.; Hsu, D.; Kakade, S. M.; and Telgarsky, M. 2014. Tensor decompositions for learning latent variable models. In 15., ed., The Journal of Machine Learning Research, volume 1, 2773 2832. Azari Souﬁani, H.; Chen, W.; Parkes, D. C.; and Xia, L. 2013. Generalized method-of-moments for rank aggregation. In Proceedings of Advances in Neural Information Processing Systems (NIPS). Azari Souﬁani, H.; Parkes, D. C.; and Xia, L. 2012. Random utility theory for social choice. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 126 134. Azari Souﬁani, H.; Parkes, D. C.; and Xia, L. 2014. Computing parametric ranking models via rank-breaking. In Proceedings of the 31st International Conference on Machine Learning. Chen, Y., and Suh, C. 2015. Spectral MLE: Top-k rank aggregation from pairwise comparisons. In Proceedings of the 32nd International Conference on Machine Learning, volume 37. Dempster, A. P.; Laird, N. M.; and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological) 1 38. Gormley, I. C., and Murphy, T. B. 2008. Exploring voting blocs within the irish exploring voting blocs within the irish electorate: A mixture modeling approach. Journal of the American Statistical Association 103(483):1014 1027. Hall, A. R. 2005. Generalized Method of Moments. Oxford University Press. Hansen, L. P. 1982. Large sample properties of generalized method of moments estimators. Econometrica 50(4):1029 1054. Hurvich, C. M., and Tsai, C.-L. 1989. Regression and time series model selection in small samples. Biometrika 76(2):297 307. Khetan, A., and Oh, S. 2016a. Computational and statistical tradeoffs in learning to rank. In Advances in Neural Information Processing Systems (NIPS). Khetan, A., and Oh, S. 2016b. Data-driven rank breaking for efﬁcient rank aggregation. In Proceedings of the 33rd International Conference on Machine Learning, volume 48. Liu, T.-Y. 2011. Learning to Rank for Information Retrieval. Springer. Luce, R. D. 1959. Individual Choice Behavior: A Theoretical Analysis. Wiley. Mattei, N., and Walsh, T. 2013. Preﬂib: A library of preference data. In Proceedings of Third International Conference on Algorithmic Decision Theory (ADT 2013), Lecture Notes in Artiﬁcial Intelligence. Mollica, C., and Tardella, L. 2016. Bayesian Plackett Luce

mixture models for partially ranked data. Psychometrika 1 17. Negahban, S.; Oh, S.; and Shah, D. 2012. Iterative ranking from pair-wise comparisons. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2483 2491. Plackett, R. L. 1975. The analysis of permutations. Journal of the Royal Statistical Society. Series C (Applied Statistics) 24(2):193 202. Redner, R. A., and Walker, H. F. 1984. Mixture densities, maximum likelihood and the EM algorithm. SIAM review 26(2):195 239. Schwarz, G. 1978. Estimating the dimension of a model. Annals of Statistics 6(2):461 464. Sidiropoulos, N. D., and Bro, R. 2000. On the uniqueness of multilinear decomposition of n-way arrays. Journal of chemometrics 14(3):229 239. Thurstone, L. L. 1927. A law of comparative judgement. Psychological Review 34(4):273 286. Tkachenko, M., and Lauw, H. W. 2016. Plackett-Luce regression mixture model for heterogeneous rankings. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 237 246. ACM. Zhao, Z.; Piech, P.; and Xia, L. 2016. Learning mixtures of Plackett-Luce models. In Proceedings of the 33rd International Conference on Machine Learning (ICML-16).