# learning_mixtures_of_plackettluce_models__a9361770.pdf

Learning Mixtures of Plackett-Luce Models

Zhibing Zhao ZHAOZ6@RPI.EDU Rensselaer Polytechnic Institute, 110 8th St., Troy, NY 12180 USA

Peter Piech PIECHP@RPI.EDU Rensselaer Polytechnic Institute, 110 8th St., Troy, NY 12180 USA

Lirong Xia XIAL@CS.RPI.EDU Rensselaer Polytechnic Institute, 110 8th St., Troy, NY 12180 USA

In this paper we address the identiﬁability and efﬁcient learning problems of ﬁnite mixtures of Plackett-Luce models for rank data. We prove that for any k 2, the mixture of k Plackett Luce models for no more than 2k 1 alternatives is non-identiﬁable and this bound is tight for k = 2. For generic identiﬁability, we prove that the mixture of k Plackett-Luce models over m alternatives is generically identiﬁable if k m 2

2 !. We also propose an efﬁcient generalized method of moments (GMM) algorithm to learn the mixture of two Plackett-Luce models and show that the algorithm is consistent. Our experiments show that our GMM algorithm is signiﬁcantly faster than the EMM algorithm by Gormley & Murphy (2008), while achieving competitive statistical efﬁciency.

1. Introduction

In many machine learning problems the data are composed of rankings over a ﬁnite number of alternatives (Marden, 1995). For example, meta-search engines aggregate rankings over webpages from individual search engines (Dwork et al., 2001); rankings over documents are combined to ﬁnd the most relevant document in information retrieval (Liu, 2011); noisy answers from online workers are aggregated to produce a more accurate answer in crowdsourcing (Mao et al., 2013). Rank data are also very common in economics and political science. For example, consumers often give discrete choices data (Mc Fadden, 1974) and voters often give rankings over presidential candidates (Gormley

Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s).

& Murphy, 2008).

Perhaps the most commonly-used statistical model for rank data is the Plackett-Luce model (Plackett, 1975; Luce, 1959). The Plackett-Luce model is a natural generalization of multinomial logistic regression. In a Plackett-Luce model, each alternative is parameterized by a positive number that represents the quality of the alternative. The greater the quality, the the chance the alternative will be ranked at a higher position.

In practice, mixtures of Plackett-Luce models can provide better ﬁtness than a single Plackett-Luce model. An additional beneﬁt is that the learned parameter of a mixture model can naturally be used for clustering (Mc Lachlan & Basford, 1988). The k-mixture of Plackett-Luce combines k individual Plackett-Luce models via a linear vector of mixing coefﬁcients. For example, Gormley & Murphy (2008) propose an Expectation Minorization Maximization (EMM) algorithm to compute the MLE of Plackett-Luce mixture models. The EMM was applied to an Irish election dataset with 5 alternatives and the four components in the mixture model are interpreted as voting blocs.

Surprisingly, the identiﬁability of Plackett-Luce mixture models is still unknown. Identiﬁability is an important property for statistical models, which requires that different parameters of the model have different distributions over samples. Identiﬁability is crucial because if the model is not identiﬁable, then there are cases where it is impossible to estimate the parameter from the data, and in such cases conclusions drawn from the learned parameter can be wrong. In particular, if Plackett-Luce mixture models are not identiﬁable, then the voting bloc produced by the EMM algorithm of Gormley & Murphy (2008) can be dramatically different from the ground truth.

In this paper, we address the following two important questions about the theory and practice of Plackett-Luce mixture models for rank data.

Learning Mixtures of Plackett-Luce Models

Q1. Are Plackett-Luce mixture models identiﬁable?

Q2. How can we efﬁciently learn Plackett-Luce mixture models?

Q1 can be more complicated than one may think because the non-identiﬁability of a mixture model usually comes from two sources. The ﬁrst is label switching, which means that if we label the components of a mixture model differently, the distribution over samples does not change (Stephens, 2000). This can be avoided by ordering the components and merging the same components in the mixture model. The second is more fundamental, which states that the mixture model is non-identiﬁable even after ordering and merging duplicate components. Q1 is about the second type of non-identiﬁability.

The EMM algorithm by Gormley & Murphy (2008) converges to the MLE, but as we will see in the experiments, it can be very slow when the data size is not too small. Therefore, to answer Q2, we want to design learning algorithms that are much faster than the EMM without sacriﬁcing too much statistical efﬁciency, especially mean squared error (MSE).

1.1. Our Contributions

We answer Q1 with the following theorems. The answer depends on the number of components k in the mixture model and the number of alternatives m.

Theorem 1 and 2. For any m 2 and any k m+1

2 , the k-mixture Plackett-Luce model (denoted by k-PL) is nonidentiﬁable. This lower bound on k as a function of m is tight for k = 2 (m = 4).

The second half of the theorem is positive: the mixture of two Plackett-Luce models is identiﬁable for four or more alternatives. We conjecture that the bound is tight for all k > 2.

The k-PL is generically identiﬁable for m alternatives, if the Lebesgue measure of non-identiﬁable parameters is 0. We prove the following positive results for k-PL.

Theorem 3. For any m 6 and any k m 2

2 !, the k-PL is generically identiﬁable.

We note that m 2

2 ! is exponentially larger than the lower bound m+1

2 for (strict) identiﬁability. One interpretation of the theorem is that, when m

2 + 1 k m 2

2 !, even though k-PL is not identiﬁable in the strict sense, one may not need to worry too much in practice due to generic identiﬁability.

For Q2, we propose a generalized method of moments

(GMM)1 algorithm (Hansen, 1982) to learn the k-PL. We illustrate the algorithm for k = 2 and m 4, and prove that the algorithm is consistent, which means that when the data are generated from k-PL and the data size n goes to inﬁnity, the algorithm will reveal the ground truth with probability that goes to 1. We then compare our GMM algorithm and the EMM algorithm (Gormley & Murphy, 2008) w.r.t. statistical efﬁciency (mean squared error) and computational efﬁciency in synthetic experiments. As we will see, in Section 5, our GMM algorithm is signiﬁcantly faster than the EMM algorithm while achieving competitive statistical efﬁciency. Therefore, we believe that our GMM algorithm is a promising candidate for learning Plackett-Luce mixture models from big rank data.

1.2. Related Work and Discussions

Most previous work in mixture models (especially Gaussian mixture models) focuses on cardinal data (Teicher, 1961; 1963; Mc Lachlan & Peel, 2004; Kalai et al., 2012; Dasgupta, 1999). Little is known about the identiﬁability of mixtures of models for rank data.

For rank data, Iannario (2010) proved the identiﬁability of the mixture of shifted binomial model and the uniform models. Awasthi et al. (2014) proved the identiﬁability of mixtures of two Mallows models. Mallows mixture models were also studied by Lu & Boutilier (2014) and Chierichetti et al. (2015). Our paper, on the other hand, focuses on mixtures of Plackett-Luce models.

Technically, part of our (non-)identiﬁability proofs is motivated by the work of Teicher (1963), who obtained sufﬁcient conditions for the identiﬁability of ﬁnite mixture models. However, technically these conditions cannot be directly applied to k-PL because they work either for ﬁnite families (Theorem 1 in (Teicher, 1963)) or for cardinal data (Theorem 2 in (Teicher, 1963)). Neither is the case for mixtures of Plackett-Luce models. To prove our (non- )identiﬁability theorems, we develop novel applications of the Fundamental Theorem of Algebra to analyze the rank of a matrix Fk m that represents k-PL (see Preliminaries for more details). Our proof for generic identiﬁability is based on a novel application of the tensor-decomposition approach that analyzes the generic Kruskal s rank of matrices advocated by Allman et al. (2009).

In addition to being important in their own right, our (non)- identiﬁability theorems also carry a clear message that has been overlooked in the literature: when using Plackett Luce mixture models to ﬁt rank data, one must be very careful about the interpretation of the learned parameter. Speciﬁcally, when m 2k 1, it is necessary to doublecheck whether the learned parameter is identiﬁable (Theo-

1This should not be confused with Gaussian mixture models.

Learning Mixtures of Plackett-Luce Models

rem 1), which can be computationally hard. On the positive side, identiﬁability may not be a big concern in practice under a much milder condition (k m 2

2 !, Theorem 3).

Gormley & Murphy (2008) used 4-PL to ﬁt an Irish election dataset with 5 alternatives. According to our Theorem 1, 4-PL for 5 alternatives is non-identiﬁable. Moreover, our generic identiﬁability theorem (Theorem 3) does not apply because m = 5 < 6. Therefore, it is possible that there exists another set of voting blocs and mixing coefﬁcients with the same likelihood as the output of the EMM algorithm. Whether it is true or not, we believe that it is important to add discussions and justiﬁcations of the uniqueness of the voting blocs obtained by Gormley & Murphy (2008).

Parameter inference for single Plackett-Luce models is studied in (Cheng et al., 2010) and (Azari Souﬁani et al., 2013). Azari Souﬁani et al. (2013) proposed a GMM, which is quite different from our method, and cannot be applied to Plackett-Luce mixture models. The MM algorithm by Hunter (2004), which is compared in (Azari Souﬁani et al., 2013), is also very different from the EMM that is being compared in this paper.

2. Preliminaries

Let A = {ai|i = 1, 2, , m} denote a set of m alternatives. Let L(A) denote the set of linear orders (rankings), which are transitive, antisymmetric and total binary relations, over A. A ranking is often denoted by ai1 ai2 aim, which means that ai1 is the most preferred alternative, ai2 is the second preferred, aim is the least preferred, etc. Let P = (V1, V2, , Vn) denote the data (also called a preference proﬁle), where for all j n, Vj L(A).

Deﬁnition 1 (Plackett-Luce model). The parameter space is Θ = { θ = {θi|i = 1, 2, , m, θi [0, 1], Pm i=1 θi = 1}}. The sample space is S = L(A)n. Given a parameter θ Θ, the probability of any ranking V = ai1 ai2 aim is

Pr PL(V | θ) = θi1

p>1 θip θim 1 θim 1+θim

We assume that data are generated i.i.d. in the Plackett Luce model. Therefore, given a preference proﬁle P and θ Θ, we have Pr PL(P| θ) = Qn j=1 Pr PL(Vj| θ).

The Plackett-Luce model has the following intuitive explanation. Suppose there are m balls, representing m alternatives in an opaque bag. Each ball ai is assigned a quality value θi. Then, we generate a ranking in m stages. In each stage, we take one ball out of the bag. The probability for each remaining ball being taken out is the value assigned to it over the sum of the values assigned to the remaining

balls. The order of drawing is the ranking over the alternatives.

We require P

i θi = 1 to normalize the parameter so that the Plackett-Luce model is identiﬁable. It is not hard to verify that for any Plackett-Luce model, the probability for the alternative ap (p m) to be ranked at the top of a ranking is θp; the probability for ap to be ranked at the top and aq ranked at the second position is θpθq

1 θp , etc.

Deﬁnition 2 (k-mixture Plackett-Luce model). Given m 2 and k 2, we deﬁne the k-mixture Plackett-Luce model as follows. The sample space is S = L(A)n. The parameter space has two parts. The ﬁrst part is the mixing coefﬁcients (α1, . . . , αk) where for all r k, αr 0, and Pk r=1 αr = 1. The second part is ( θ(1), θ(2), . . . , θ(k)), where θ(r) = [θ(r) 1 , θ(r) 2 , , θ(r) m ] is the parameter of the r-th Plackett-Luce component. The probability of a ranking V is

Prk-PL(V | θ) = Pk r=1 αr Pr PL(V | θ(r)),

where Pr PL(V | θ(r)) is the probability of V in the r-th Plackett-Luce model given θ(r).

For simplicity we use k-PL to denote the k-mixture Plackett-Luce model.

Deﬁnition 3 (Identiﬁability) Let M = {Pr( | θ) : θ Θ} be a statistical model. M is identiﬁable if for all θ, γ Θ, we have Pr( | θ) = Pr( | γ) = θ = γ.

In this paper, we slightly modify this deﬁnition to eliminate the label switching problem. We say that k-PL is identiﬁable if there do not exist (1) 1 k1, k2 k, non-degenerate θ(1), θ(2), , θ(k1), γ(1), γ(2), , γ(k2), which means that these k1 + k2 vectors are pairwise different; (2) all strictly positive mixing coefﬁcients (α(1) 1 , . . . , α(1) k1 ) and (α(2) 1 , . . . , α(2) k2 ), so that for all rankings V we have Pk1 r=1 α(1) r Pr PL(V | θ(r)) = Pk2 r=1 α(2) r Pr PL(V | γ(r))

Throughout the paper, we will represent a distribution over the m! rankings over m alternatives for a Plackett-Luce component with parameter θ(r) as a column vector fm( θ) with m! elements, one for each ranking and whose value is the probability of the corresponding ranking. For example, when m = 3, we have

Pr(a1 a2 a3| θ) Pr(a1 a3 a2| θ) Pr(a2 a1 a3| θ) Pr(a2 a3 a1| θ) Pr(a3 a1 a2| θ) Pr(a3 a2 a1| θ)

θ1θ2 1 θ1 θ1θ3 1 θ1 θ1θ2 1 θ2 θ2θ3 1 θ2 θ1θ3 1 θ3 θ2θ3 1 θ3

Learning Mixtures of Plackett-Luce Models

Given θ(1), . . . , θ(2k), we deﬁne Fk m as a m! 2k matrix for k-PL with m alternatives

Fk m = h fm( θ(1)) fm( θ(2)) fm( θ(2k)) i (1)

We note that Fk m is a function of θ(1), . . . , θ(2k), which are often omitted. We prove the identiﬁability or nonidentiﬁability of k-PL by analyzing the rank of Fk m. The reason that we consider 2k components is that we want to ﬁnd (or argue that we cannot ﬁnd) another k-mixture model that has the same distribution as the original one.

3. Identiﬁability of Plackett-Luce Mixture Models

We ﬁrst prove a general lemma to reveal a relationship between the rank of Fk m and the identiﬁability of Plackett Luce mixture models. We recall that a set of vectors is non-degenerate if its elements are pairwise different.

Lemma 1 If the rank of Fk m is 2k for all non-degenerate θ(1), . . . , θ(2k), then k-PL is identiﬁable. Otherwise (2k 1)-PL is non-identiﬁable.

Proof: Suppose for the sake of contradiction the rank of Fk m is 2k for all non-degenerate θ(1), . . . , θ(2k)

but k-PL is non-identiﬁable. Then, there exist nondegenerate θ(1), θ(2), , θ(k1), γ(1), γ(2), , γ(k2) and all strictly positive mixing coefﬁcients (α(1) 1 , . . . , α(1) k1 ) and

(α(2) 1 , . . . , α(2) k2 ), such that for all rankings V , we have Pk1 r=1 α(1) r Pr PL(V | θ(r)) = Pk2 r=1 α(2) r Pr PL(V | γ(r))

Let δ(1), δ(2), . . . , δ(2k (k1+k2)) denote any 2k (k1 + k2) vectors so that { θ(1), . . . , θ(k1), γ(1), . . . , γ(k2), δ(1), . . . , δ(2k (k1+k2))} is non-degenerate. It follows that the rank of the corresponding Fk m is strictly smaller than 2k, because Pk1 r=1 α(1) r Pr PL(V | θ(r)) Pk2 r=1 α(2) r Pr PL(V | γ(r)) + P(2k k1 k2) r=1 δ(r) 0 = 0. This is a contradiction.

On the other hand, if rank(Fk m) < 2k for some nondegenerate θ s, then there exists a nonzero vector α = [α1, α2, . . . , α2k] such that Fk m α = 0. Suppose in α there are k1 positive elements and k2 negative elements, then it follows that max{k1, k2}-mixture model is not identiﬁable, and max{k1, k2} 2k 1.

Theorem 1 For any m 2 and any k m+1

2 , the k-PL is non-identiﬁable.

Proof sketch: The proof is constructive and is based on a reﬁnement of the second half of Lemma 1. For any k

and m = 2k 1, we will deﬁne θ(1), . . . , θ(2k) and α = [α1, . . . , α2k]T such that (1) Fk m α = 0 and (2) α has k positive elements and k negative elements. In each θ(r), the value for alternatives {a2, . . . , am} are the same. The proof for any m < 2k 1 is similar.

Formally, let m = 2k 1. For all i 2 and r 2k, we let

θ(r) i = 1 θ(r) 1 2k 2 , where θ(r) i is the parameter corresponding to the ith alternative of the rth model. We use er to repre-

sent θ(r) 1 and we use br to represent 1 θ(r) 1 2k 2 . It is not hard to check that the probability for a1 to be ranked at the ith position in the rth Plackett-Luce model is

(2k 2)! (2k 1 i)! er(br)i 1 Qi 1 p=0(1 pbr) (2)

Then Fk m can be reduced to a (2k 1) (2k) matrix. Because rank(Fk m) 2k 1 < 2k, Lemma 1 immediately tells us that (2k 1)-PL is non-identiﬁable for 2k 1 alternatives, but this is much weaker than what we are proving in this theorem. We now deﬁne a new (2k 1) (2k) matrix Hk obtained from Fk m by performing the following linear operations on row vectors. (i) Make the ﬁrst row of Hk to be 1; (ii) for any 2 i 2k 1, the ith row of Hk is the probability for a1 to be ranked at the (i 1)-th position according to (2); (iii) remove all constant factors.

More precisely, for any er we deﬁne the following function.

1 er er(1 er) er+2k 3 ... er(1 er)2k 3

(er+2k 3) ((2k 3)er+1)

Then we deﬁne Hk = [ f (e1), f (e2), , f (e2k)].

Lemma 2 If there exist all different e1, e2, , e2k < 1 and a non-zero vector β = [β 1, β 2, , β 2k] such that (i) Hk β = 0 and (ii) β has k positive elements and k negative elements, then k-PL for 2k 1 alternatives is not identiﬁable.

Then, the theorem is proved by showing that the following β satisﬁes the conditions in Lemma 2. For any r 2k,

Q2k 3 p=1 (per + 2k 2 p) Q

q =r(er eq) (3)

Note that the numerator is always positive.

Theorem 2 For k = 2, and any m 4, the 2-PL is identiﬁable.

Learning Mixtures of Plackett-Luce Models

Proof sketch: We will apply Lemma 1 to prove the theorem. That is, we will show that for all non-degenerate θ(1), θ(2), θ(3), θ(4) such that rank(F2 4) = 4. We recall that F2 4 is a 24 4 matrix. Instead of proving rank(F2 4) = 4 directly, we will ﬁrst obtain a 4 4 matrix F = T F2 4 by linearly combining some row vectors of F2 4 via a 4 24 matrix T. Then, we show that rank(F ) = 4, which implies that rank(F2 4) = 4.

For simplicity we use [er, br, cr, dr] to denote the parameter of r-th Plackett-Luce component for a1, a2, a3, a4 respectively. Namely, h θ(1) θ(2) θ(3) θ(4) i =

e1 e2 e3 e4 b1 b2 b3 b4 c1 c2 c3 c4 d1 d2 d3 d4

, where for each r 4,

ω(r) is a row vector. We further let 1 = [1, 1, 1, 1].

Clearly we have P4 i=1 ω(i) = 1. Therefore, if there exist three ω s, for example { ω(1), ω(2), ω(3)}, such that ω(1), ω(2), ω(3) and 1 are linearly independent, then rank(F2 4) = 4 because each ω(i) corresponds to the probability of ai being ranked at the top, which means that ω(i) is a linear combination of rows in F2 4. Because θ(1), θ(2), θ(3), θ(4) is non-degenerate, at least one of { ω(1), ω(2), ω(3), ω(4)} is linearly independent of 1. W.l.o.g. suppose ω(1) is linearly independent of 1. This means that not all of e1, e2, e3, e4 are equal. The theorem will be proved in the following two cases.

Case 1. ω(2), ω(3), and ω(4) are all linear combinations of 1 and ω(1).

Case 2. There exists a ω(i) (where i {2, 3, 4}) that is linearly independent of 1 and ω(1).

We will only show the proof for a subcase of Case 1 to illustrate the main idea. The full proof is quite involved and can be found in the full version on ar Xiv. In Case 1, for all i = 2, 3, 4 we can rewrite ω(i) = pi ω(1) + qi for some constants pi, qi. Because ω(1) + ω(2) + ω(3) + ω(4) = 1, we have p2 + p3 + p4 = 1 and q2 + q3 + q4 = 1.

In this case for each r 4, the r-th column of F2 4, which is f4( θ(r)), is a function of er. Because the θ s are nondegenerate, e1, e2, e3, e4 must be pairwise different. We will show the proof for the following subcase of Case 1.

Case 1.1: p2 + q2 = 0 and p2 + q2 = 1.

For this case we ﬁrst deﬁne a 4 4 matrix ˆF as in Table 1. We use 1 and ω(1) to denote the ﬁrst two rows of ˆF. ω(1)

corresponds to the probability that a1 is ranked at the top. We call such a probability a moment. Each moment is the sum of probabilities of some rankings. For example, the a1 others moment is the total probability for {V L(A) : a1 is ranked at the top of V }. It follows that there

1 1 1 1 e1 e2 e3 e4 e1b1 1 b1 e2b2 1 b2 e3b3 1 b3 e4b4 1 b4 e1b1 1 e1 e2b2 1 e2 e3b3 1 e3 e4b4 1 e4

1 a1 others a2 a1 others a1 a2 others

Table 1. ˆF.

exists a 4 24 matrix ˆT such that ˆF = ˆT F2 4.

Deﬁne θ(b) = [ 1 1 b1 , 1 1 b2 , 1 1 b3 , 1 1 b4 ], where bi = p2ei+

q2. We then deﬁne θ(e) = [ 1 1 e1 , 1 1 e2 , 1 1 e3 , 1 1 e4 ], and let

1 ω(1) θ(b) θ(e)

. It can be veriﬁed that ˆF = T F , where

1 0 0 0 0 1 0 0 1

p2 0 (p2 + q2) p2 0 p2 + q2

Because Case 1.1 assumes that p2 + q2 = 0 and we can select a2 such that p2 = 0, q2 = 1, we have that T is invertible. Therefore, F = (T ) 1 ˆF, which means that F = T F2 4 for some 4 24 matrix T.

We now prove that rank(F ) = 4. For the sake of contradiction, suppose that rank(F ) < 4. It follows that there exist a nonzero row vector t = [t1, t2, t3, t4], such that t F = 0. This means that for all r 4,

t1 + t2er + t3 1 p2er q2 + t4 1 er = 0

Let f(x) = t1 + t2x + t3 1 p2x q2 + t4 1 x. Let g(x) = (1 p2x q2)(1 x)f(x). We recall that e1, e2, e3, e4 are four roots of f(x), which means that they are also the four roots of g(x). Because in Case 1.1 we assume that p2 + q2 = 1, it can be veriﬁed that not all coefﬁcients of g(x) are zero. We note that the degree of g(x) is 3. Therefore, due to the Fundamental Theorem of Algebra, g(x) has at most three different roots. This means that e1, e2, e3, e4 are not pairwise different, which is a contradiction.

Therefore, rank(F ) = 4, which means that rank(F2 4) = 4. This ﬁnishes the proof for Case 1.1. All missing proofs can be found in the full version on ar Xiv.

Slightly abusing the notation, we say that a parameter of k PL is identiﬁable, if there does not exist a different parameter modulo label switching with the same probability distribution over the sample space. The next theorem proves that the Lebesgue measure (in the km 1 dimensional Euclidean space) of non-identiﬁable parameters of k-PL for

Learning Mixtures of Plackett-Luce Models

m alternatives is 0 (generic identiﬁability as is deﬁned in Section 1.1).

Theorem 3 For 1 k m 2

2 !, k-PL over m 6 alternatives is generically identiﬁable.

Proof: The theorem is proved by analyzing the uniqueness of tensor decomposition. We construct a rank-one tensor for each Plackett-Luce component. Then the k-mixture model can be represented by another tensor, which is the weighted sum of k rank-one tensors. If the tensor decomposition is unique, then k-PL is identiﬁable.

To construct the rank-one tensor Tr for the r-th Plackett Luce component, we partition the set of alternatives into three sets. In the rest of the proof we assume that m is even. The theorem can be proved similarly for odd m.

SA = {a1, a2, , a m 2

2 , , am 2}

SC = {am 1, am}

There are n1 = n2 = m 2

2 ! rankings over SA and SB respectively, and two rankings over SC (n3 = 3). Let the three coordinates in the tensor Tr for the r-th Plackett Luce model (with parameter θ(r) be p(r) A , p(r) B , p(r) C that represent probabilities of all rankings within SA, SB, SC respectively.

Then, for any rankings VA L(SA), VB L(SB), and VC L(SC), we can prove that Pr PL(VA, VB, VC| θ(r)) = Pr PL(VA| θ(r)) Pr PL(VB| θ(r)) Pr PL(VC| θ(r)). That is, VA, VB and VC are independent given θ(r). We will prove this result for a more general class of models called random utility models (RUM), of which the Plackett-Luce model is a special case (Thurstone, 1927).

Lemma 3 Given a random utility model M( θ) over a set of m alternatives A, let A1, A2 be two non-overlapping subsets of A, namely A1, A2 A and A1 A2 = . Let V1, V2 be rankings over A1 and A2, respectively, then we have Pr(V1, V2| θ) = Pr(V1| θ) Pr(V2| θ).

Because SA, SB, and SC are non-overlapping, it follows that Tr = p(r) A p(r) B p(r) C . Because k m 2

2 !, we have min{k, |L(SA)|} + min{k, |L(SB)|} + min{k, |L(SC)|} = 2k + 2. By Corollary 3 in (Allman et al., 2009), k-PL is generically identiﬁable. For completeness we include Corollary 3 here. Let M(k; n1, n2, n3) be a k-mixture, 3-feature statistical model, where n1, n2, n3 are the cardinalities of the three sets of events we deﬁned.

The parameters of the model M(k; n1, n2, n3) are generically identiﬁable, up to label switching, provided min(k, n1) + min(k, n2) + min(k, n3) 2k + 2.

Since n1 = n2 = m 2

2 !, n3 2, this condition holds.

4. A Generalized Method of Moments Algorithm for 2-PL

In a generalized method of moments (GMM) algorithm, a set of q 1 moment conditions g(V, θ) are speciﬁed. Moment conditions are functions of the parameter and the data, whose expectations are zero at the ground truth. g(V, θ) Rq has two inputs: a data point V and a parameter θ. For any θ , the expectation of any moment condition should be zero at θ , when the data are generated from the model given θ . Formally E[g(V, θ )] = 0. In practice the observed moment values should match the theoretical values from the model. In our algorithm, each moment condition corresponds to an event in the data, e.g. a1 is ranked at the top. We use moments to denote such events. Given any preference proﬁle P, we let g(P, θ) = 1 n P

V P g(V, θ), which is a function of θ. The GMM algorithm we will use then computes the parameter that minimizes the 2-norm of the empirical moment conditions in the following way.

GMMg(P) = inf θ ||g(P, θ)||2 2 (4)

In this paper, we will show results for m = 4 and k = 2. Our GMM works for other combinations of k and m, if the model is identiﬁable. Otherwise the estimator is not consistent. For m = 4 and k = 2, the parameter of the 2PL is θ = (α, θ(1), θ(2)). We will use the following q = 20 moments from three categories.

(i) There are four moments, one for each of the four alternatives to be ranked at the top. Let {gi : i 4} denote the four moment conditions. Let pi = αθ(1) i +(1 α)θ(2) i . For any V L(A), we have gi(V, θ) = 1 pi if and only if ai is ranked at the top of V ; otherwise gi(V, θ) = pi.

(ii) There are 12 moments, one for each combination of top-2 alternatives in a ranking. Let {gi1i2 : i1 = i2 4}

denote the 12 moment conditions. Let pi1i2 = α θ(1) i1 θ(1) i2 1 θ(1) i1 +

(1 α) θ(2) i1 θ(2) i2 1 θ(2) i1 . For any V L(A), we have gi1i2(V, θ) =

1 pi1i2 if and only if ai1 is ranked at the top and ai2 is ranked at the second in V ; otherwise gi1i2(V, θ) = pi1i2.

(iii) There are four moments that correspond to the following four rankings a1 a2 a3 a4, a2 a3 a4 a1, a3 a4 a1 a2, a4 a1 a2 a3. The corresponding gi1i2i3i4 s are deﬁned similarly.

To illustrate how GMM works, we give the following example. Suppose the data P contain the following 20 rankings.

Learning Mixtures of Plackett-Luce Models

Figure 1. The MSE and running time of GMM and EMM. EMM-20-1 is 20 iterations of EMM overall with 1 iteration of MM for each M step. Likewise, EMM-10-1 is 10 iterations of EMM with 1 iteration of MM each time. Values are calculated over 2000 datasets.

8 @ a1 a2 a3 a4 4 @ a2 a3 a4 a1 3 @ a3 a4 a1 a2 3 @ a4 a1 a2 a3 2 @ a3 a1 a4 a2

Then, for example, g1(P, θ) = 8 20 (αθ(1) 1 + (1 α)θ(2) 1 ), corresponding to the moment that a1 is ranked at top (cate-

gory i). g34(P, θ) = 3 20 ( αθ(1) 3 θ(1) 4 θ(1) 1 +θ(1) 2 +θ(1) 4 + (1 α)θ(2) 3 θ(2) 4 θ(2) 1 +θ(2) 2 +θ(2) 4 ),

corresponding to the moment of a3 a4 others (cate-

gory ii). g2341(P, θ) = 4 20 ( αθ(1) 2 θ(1) 3 θ(1) 4 (θ(1) 3 +θ(1) 4 +θ(1) 1 )(θ(1) 4 +θ(1) 1 ) +

(1 α)θ(2) 2 θ(2) 3 θ(2) 4 (θ(2) 3 +θ(2) 4 +θ(2) 1 )(θ(2) 4 +θ(2) 1 )), corresponding to the moment

of a2 a3 a4 a1 (category iii). Remember that P4 i=1 θ(1) i = P4 i=1 θ(2) i = 1.

The choices of these moment conditions are based on the proof of Theorem 2, so that the 2-PL is strictly identiﬁable w.r.t. these moment conditions. Therefore, our simple GMM algorithm is the following.

Algorithm 1 GMM for 2-PL Input: Preference proﬁle P with n full rankings. Compute the frequency of each of the 20 moments Compute the output according to (4)

The theoretical guarantee of our GMM is its consistency, as we deﬁned in Section 1.1.

Theorem 4 Algorithm 1 is consistent w.r.t. 2-PL, where there exists ϵ > 0 such that each parameter is in [ϵ, 1].

Originally all parameters lie in open intervals (0, 1]. The ϵ requirement in the theorem is introduced to make the pa-

rameter space compact, i.e. all parameters are chosen from closed intervals. The proof is done by applying Theorem 3.1 in (Hall, 2005). The main hardness is the identiﬁability of 2-PL w.r.t. the moment conditions used in our GMM. Our proof of the identiﬁability of 2-PL (Theorem 2) only uses the 20 moment conditions described above.2

Complexity of GMM. For learning k-PL with m alternatives and n rankings with EMM, each E-step performs O(nk2) operations and each iteration of the MM algorithm for the M-step performs O(m2nk) operations. Our GMM for k = 2 and m = 4 has overall complexity O(n). The complexity of calculating moments is O(n) and the complexity of optimization depends only on m and k.

5. Experiments

The performance of our GMM algorithm (Algorithm 1) is compared to the EMM algorithm (Gormley & Murphy, 2008) for 2-PL with respect to running time and statistical efﬁciency for synthetic data. The synthetic datasets are generated as follows.

Generating the ground truth: for k = 2 mixtures and m = 4 alternatives, the mixing coefﬁcient α is generated uniformly at random and the Plackett-Luce components θ(1) and θ(2) are each generated from the Dirichlet distribution Dir( 1).

Generating data: given a ground truth θ , we generate

2In fact our proof only uses 16 of them (4 out of the 12 moment conditions in category (ii) are redundant). However, our synthetic experiments show that using 20 moments improves statistical efﬁciency without sacriﬁcing too much computational efﬁciency.

Learning Mixtures of Plackett-Luce Models

Figure 2. The MSE and running time of GMM. GMM-Moments is the time to calculate the moment condition values observed in the data and GMM-Opt is the time to perform the optimization. Values are calculated over 50000 trials.

each ranking with probability α from the PL model parameterized by θ(1) and with probability 1 α from the PL model parameterized by θ(2) up to 45000 full rankings.

The GMM algorithm is implemented in Python 3.4 and termination criteria for the optimization are convergence of the solution and the objective function values to within 10 10 and 10 6 respectively. The optimization of (4) uses the fmincon function through the MATLAB Engine for Python.

The EMM algorithm is also implemented in Python 3.4 and the E and M steps are repeated together for a ﬁxed number of iterations without convergence criteria. The running time of EMM is largely determined by the total number of iterations of the MM algorithm and we look to compare the best EMM conﬁguration with comparable running time to GMM. We have tested all conﬁgurations of EMM with 10 and 20 overall MM iterations, respectively. We found that the optimal conﬁgurations are EMM-10-1 and EMM-201 (shown in Figure 1, results for other conﬁgurations are omitted), where EMM-20-1 means 20 iterations of E step, each of which uses 1 MM iteration.

We use the Mean Squared Error (MSE) as the measure of statistical efﬁciency deﬁned as MSE = E( θ θ 2 2).

All experiments are run on an Ubuntu Linux server with Intel Xeon E5 v3 CPUs each clocked at 3.50 GHz.

Results. The comparison of the performance of the GMM algorithm to the EMM algorithm is presented in Figure 1 for up to n = 2000 rankings. Statistics are calculated over 2000 trials (datasets). We observe that:

GMM is signiﬁcantly faster than EMM. The running time of GMM grows at a much slower rate in n.

GMM achieves competitive MSE and the MSE of EMM does not improve over n. In particular, when n is more than 1000, GMM achieves smaller MSE.

The implication is that GMM may be better suited for reasonably large datasets where running time becomes infeasibly large with EMM. Moreover, it is possible that the GMM algorithm can be further improved by using a more accurate optimizer or another set of moment conditions. GMM can also be used to provide a good initial point for other methods such as the EMM.

For larger datasets, the performance of the GMM algorithm is shown in Figure 2 for up to n = 45000 rankings calculated over 50000 trials. As the data size increases, GMM converges toward the ground truth, which veriﬁes our consistency theorem (Theorem 4). The overall running time of GMM shown in the ﬁgure is comprised of the time to calculate the moments from data (GMM-Moments) and the time to optimize the objective function (GMM-Opt). The time for calculating the moment values increases linearly in n, but it is dominated by the time to perform the optimization.

6. Future Work

There are many directions for future research. An obvious open question is whether k-PL is identiﬁable for 2k alternatives for k 3, which we conjecture to be true. It is important to study how to efﬁciently check whether a learned parameter is identiﬁable for k-PL when m < 2k. Can we further improve the statistical efﬁciency and computational efﬁciency for learning k-PL? We also plan to develop efﬁcient implementations of our GMM algorithm and apply it widely to various learning problems with big rank data.

Learning Mixtures of Plackett-Luce Models

Acknowledgments

This work is supported by the National Science Foundation under grant IIS-1453542 and a Simons-Berkeley research fellowship. We thank all reviewers for helpful comments and suggestions.

Allman, Elizabeth S., Matias, Catherine, and Rhodes, John A. Identiﬁability of parameters in latent structure models with many observed variables. The Annals of Statistics, 37(6A):3099 3132, 2009.

Awasthi, Pranjal, Blum, Avrim, Sheffet, Or, and Vijayaraghavan, Aravindan. Learning Mixtures of Ranking Models. In Proceedings of Advances in Neural Information Processing Systems, pp. 2609 2617, 2014.

Azari Souﬁani, Hossein, Chen, William, Parkes, David C., and Xia, Lirong. Generalized method-of-moments for rank aggregation. In Proceedings of Advances in Neural Information Processing Systems (NIPS), Lake Tahoe, NV, USA, 2013.

Cheng, Weiwei, Dembczynski, Krzysztof J., and H ullermeier, Eyke. Label ranking methods based on the plackett-luce model. Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 215 222, 2010.

Chierichetti, Flavio, Dasgupta, Anirban, Kumar, Ravi, and Lattanzi, Silvio. On learning mixture models for permutations. Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, pp. 85 92, 2015.

Dasgupta, Sanjoy. Learning mixtures of gaussians. Foundations of Computer Science, 1999. 40th Annual Symposium on, pp. 634 644, 1999.

Dwork, Cynthia, Kumar, Ravi, Naor, Moni, and Sivakumar, D. Rank aggregation methods for the web. In Proceedings of the 10th World Wide Web Conference, pp. 613 622, 2001.

Gormley, Isobel Claire and Murphy, Thomas Brendan. Exploring voting blocs within the Irish electorate: A mixture modeling approach. Journal of the American Statistical Association, 103(483):1014 1027, 2008.

Hall, Alastair R. Generalized Method of Moments. Oxford University Press, 2005.

Hansen, Lars Peter. Large Sample Properties of Generalized Method of Moments Estimators. Econometrica, 50 (4):1029 1054, 1982.

Hunter, David R. MM algorithms for generalized Bradley Terry models. In The Annals of Statistics, volume 32, pp. 384 406, 2004.

Iannario, Maria. On the identiﬁability of a mixture model for ordinal data. Metron, 68(1):87 94, 2010.

Kalai, Adam Tauman, Moitra, Ankur, and Valiant, Gregory. Disentangling gaussians. In Communications of the ACM, volume 55, pp. 113 120, 2012.

Liu, Tie-Yan. Learning to Rank for Information Retrieval. Springer, 2011.

Lu, Tyler and Boutilier, Craig. Effective sampling and learning for mallows models with pairwise-preference data. The Journal of Machine Learning Research, 15 (1):3783 3829, 2014.

Luce, Robert Duncan. Individual Choice Behavior: A Theoretical Analysis. Wiley, 1959.

Mao, Andrew, Procaccia, Ariel D., and Chen, Yiling. Better human computation through principled voting. In Proceedings of the National Conference on Artiﬁcial Intelligence (AAAI), Bellevue, WA, USA, 2013.

Marden, John I. Analyzing and Modeling Rank Data. Chapman & Hall, 1995.

Mc Fadden, Daniel. Conditional logit analysis of qualitative choice behavior. In Frontiers of Econometrics, pp. 105 142, New York, NY, 1974. Academic Press.

Mc Lachlan, Geoffrey and Peel, David. Finite Mixture Models. John Wiley & Sons, 2004.

Mc Lachlan, Geoffrey J. and Basford, Kaye E. Mixture Models: Inference and Applications to Clustering. Marcel Dekker, 1988.

Plackett, Robin L. The analysis of permutations. Journal of the Royal Statistical Society. Series C (Applied Statistics), 24(2):193 202, 1975.

Stephens, Matthew. Dealing with label switching in mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(4):795 809, 2000.

Teicher, Henry. Identiﬁability of mixtures. The Annals of Mathematical Statistics, 32(1):244 248, 1961.

Teicher, Henry. Identiﬁability of ﬁnite mixtures. The Annals of Mathematical Statistics, 34(4):1265 1269, 1963.

Thurstone, Louis Leon. A law of comparative judgement. Psychological Review, 34(4):273 286, 1927.