# accelerated_spectral_ranking__8202494b.pdf

Accelerated Spectral Ranking

Arpit Agarwal 1 Prathamesh Patil 1 Shivani Agarwal 1

The problem of rank aggregation from pairwise and multiway comparisons has a wide range of implications, ranging from recommendation systems to sports rankings to social choice. Some of the most popular algorithms for this problem come from the class of spectral ranking algorithms; these include the rank centrality algorithm for pairwise comparisons, which returns consistent estimates under the Bradley-Terry-Luce (BTL) model for pairwise comparisons (Negahban et al., 2017), and its generalization, the Luce spectral ranking algorithm, which returns consistent estimates under the more general multinomial logit (MNL) model for multiway comparisons (Maystre & Grossglauser, 2015). In this paper, we design a provably faster spectral ranking algorithm, which we call accelerated spectral ranking (ASR), that is also consistent under the MNL/BTL models.

Our accelerated algorithm is achieved by designing a random walk that has a faster mixing time than the random walks associated with previous algorithms. In addition to a faster algorithm, our results yield improved sample complexity bounds for recovery of the MNL/BTL parameters: to the best of our knowledge, we give the ﬁrst general sample complexity bounds for recovering the parameters of the MNL model from multiway comparisons under any (connected) comparison graph (and improve signiﬁcantly over previous bounds for the BTL model for pairwise comparisons). We also give a message-passing interpretation of our algorithm, which suggests a decentralized distributed implementation. Our experiments on several real world and synthetic datasets conﬁrm that our new ASR algorithm is indeed orders of magnitude faster than existing algorithms.

1Department of Computer and Information Science, University of Pennsylvania, Philadelphia, USA. Correspondence to: Arpit Agarwal <aarpit@seas.upenn.edu>, Shivani Agarwal <ashivani@seas.upenn.edu>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

1. Introduction

The problem of rank aggregation from pairwise or multiway comparisons is a fundamental one in machine learning with applications in recommendation systems, sports, social choice etc. In this problem, given pairwise or multiway comparisons among n items, the goal is to learn a score for each item. These scores can further be used to produce a ranking over these items. For example, in recommendation systems, the goal might be to learn a ranking over items by observing the choices that users make when presented with different subsets of these items; in sports, the goal might be to rank teams/individuals at the end of a tournament based on pairwise or multiway games between these individuals/teams; in social choice, the goal might be to aggregate the choices of individuals when presented with different alternatives such as candidates in an election.

In the case of pairwise comparisons, a popular model is the Bradley-Terry-Luce (BTL) model (Bradley & Terry, 1952; Luce, 1959) which posits that given a set of n items, there is a positive weight wi associated with each item i, and the probability that i is preferred over j in a pairwise comparison between i and j is wi wi+wj . The BTL model is a special case of the multinomial logit (MNL)/Plackett-Luce model (Plackett, 1975; Mc Fadden, 1974) which is deﬁned for more general multiway comparisons. Under the MNL model, the probability that an item i is preferred amongst all items in a set S is wi P

Rank aggregation under pairwise comparisons has been an active area of research, and several algorithms have been proposed that are consistent under the BTL model (Negahban et al., 2017; Rajkumar & Agarwal, 2014; Hunter, 2004; Chen & Suh, 2015; Jang et al., 2016; Guiver & Snelson, 2009; Souﬁani et al., 2013). The case of multiway comparisons has also received some attention recently (Maystre & Grossglauser, 2015; Jang et al., 2017; Chen et al., 2017). Two popular algorithms are the rank centrality (RC) algorithm (Negahban et al., 2017) for the case of pairwise comparisons, and its generalization to the case of multiway comparisons, called the Luce spectral ranking (LSR) algorithm (Maystre & Grossglauser, 2015). The key idea behind these algorithms is to construct a random walk (equivalently a Markov chain) over the comparison graph on n items, where there is an edge between two items if they are com-

Accelerated Spectral Ranking

pared in a pairwise or multiway comparison. This random walk is constructed such that its stationary distribution corresponds to the weights of the MNL/BTL model.

Given the widespread application of these algorithms, understanding their computational aspects is of paramount importance. For random walk based algorithms this amounts to analyzing the mixing/convergence time of their random walks to stationarity. In the case of rank centrality and Luce spectral ranking, ensuring that the stationary distribution of the random walk corresponds to the weights of the underlying model forces their construction to have self loops with large mass. These self loops can lead to a large mixing time of Ω ξ 1dmax , where dmax is the maximum number of unique comparisons that any item participates in; and ξ is the spectral gap of the graph Laplacian. In practical settings dmax can be very large, for example when the graph follows a power-law distribution, and can even be Ω(n) if one item is compared to a large fraction of the items.

In this paper we show that it is possible to construct a faster mixing random walk whose mixing time is O ξ 1 . We are able to construct this random walk by relaxing the condition that its stationary distribution should exactly correspond to the weights of the MNL model, and instead imposing a weaker condition that the weights can be recovered through a linear transform of the stationary distribution. We call the resulting algorithm accelerated spectral ranking (ASR).

In addition to computational advantages, the faster mixing property of our random walk also comes with statistical advantages, as it is well understood that faster mixing Markov chains lend themselves to tighter perturbation error bounds (Mitrophanov, 2005). We are able to establish a sample complexity bound of O ξ 2 n poly(log n) , in terms of the total variation distance, for recovering the true weights under the MNL (and BTL) model for almost any comparison graph of practical interest. To our knowledge, these are the ﬁrst sample complexity bounds for the general case of multiway comparisons under the MNL model. Negahban et al. (2017) show similar results in terms of L2 error for the special case of BTL model. However, their bounds have an additional dependence on dmax, due to the large mixing time of their random walk.

We also show that our algorithm can be viewed as a message passing algorithm. This connection provides a very attractive property to our algorithm it can be implemented in a distributed manner with decentralized communication and comparison data being stored in different machines.

We ﬁnally conduct several experiments on synthetic and real world datasets to compare the convergence time of our algorithm with the previous algorithms. These experiments conﬁrm the behavior predicted by our theoretical analysis of mixing times the convergence of our algorithm is in fact

orders of magnitude faster than existing algorithms.

1.1. Our Contributions

We summarize our contributions as follows:

1. Faster Algorithm: We present an algorithm for aggregating pairwise comparisons under the BTL model, and more general multiway comparisons under the MNL model, that is provably faster than the previous algorithms of Negahban et al. (2017); Maystre & Grossglauser (2015). We also give experimental evidence supporting this fact.

2. New and Improved Error Bounds: We present the ﬁrst error bounds for parameter recovery by spectral ranking algorithms under the general MNL model for any general (connected) comparison graph. These bounds improve upon the existing bounds of Negahban et al. (2017) for the special case of the BTL model.

3. Message Passing Interpretation: We provide an interpretation of our algorithm as a message passing/belief propagation algorithm. This connection can be used to design a decentralized distributed algorithm, which can work with distributed data storage.

1.2. Organization

In Section 2 we describe the problem formally. In Section 3 we present our algorithm for rank aggregation under the MNL/BTL model. In Section 4 we analyze the mixing time of our random walk, showing that our random walk converges much faster than existing approaches. In Section 5 we give bounds on sample complexity for recovery of MNL parameters with respect to the total variation distance. In Section 6 we give a message passing view of our algorithm. In Section 7 we provide experimental results on synthetic and real world datasets.

2. Problem Setting and Preliminaries

We consider a setting where there are n items, and one observes outcomes of noisy pairwise or multiway comparisons between these items. We will assume that the outcome of these comparisons is generated according to the multinomial logit (MNL) model, which posits that each item i [n] is associated with a (unknown) weight/score wi > 0, and the probability that item i wins a comparison is proportional to its weight wi. More formally, when there is a (multiway) comparison between items of a set S [n], for i S, we have

pi|S := Pr(i is the most preferred item in S) = wi P

This model is also referred to as the Plackett-Luce model, and it reduces to the Bradley-Terry-Luce (BTL) model in the special case of pairwise comparisons, i.e. |S| = 2. Let

Accelerated Spectral Ranking

w Rn + be the vector of weights, i.e. w = (w1, , wn) . Note that this model is invariant to any scaling of w, so for uniqueness we will assume that Pn i=1 wi = 1, i.e. w n where n is the n-dimensional probability simplex.

The comparison data is of the following form: there are d different comparison sets S1, , Sd [n], with |Sa| = m for all a [d] and some constant m < n. For each set Sa, for a [d], one observes the outcomes of L independent m-way comparisons between items in Sa, drawn according to the MNL model. The assumptions that each comparison set is of the same size m, and each set is compared an equal L number of times, are only for simplicity of exposition, and we give a generalization in the supplementary material. We will denote by yl a the winner of the l-th comparison amongst items of Sa, for l [L] and a [d].

Given comparison data Y = {(Sa, ya)}d a=1, where ya = (y1 a, , y L a ), the problem is to ﬁnd a weight vector ˆw n, which is close to the true weight vector w under some notion of error/distance. More formally, the problem is to ﬁnd ˆw n, such that ˆw w can be bounded in terms of the parameters n, L, and m, for some norm . We will give results in terms of the total variation distance, which for two vectors u, ˆu n is deﬁned as

u ˆu TV = 1

2 u ˆu 1 = 1

i [n] |ui ˆui| .

In the following sections, we will present an algorithm for recovering an estimate ˆw of w, and give bounds on the error ˆw w TV in terms of the problem parameters under natural assumptions on the comparison data.

3. Accelerated Spectral Ranking Algorithm

In this section, we will describe our algorithm, which we term as accelerated spectral ranking (ASR). Our algorithm is based on the idea of constructing a random walk1 on the comparison graph with n vertices, which has an edge between nodes i and j if items i and j are compared in any m-way comparison. The key idea is to construct the random walk such that the probability of transition from node i to node j is proportional to wj. If wj is larger than wi, then with other quantities being equal, one would expect the random walk to spend more time in node j than node i in its steady state distribution. Hence, if we can calculate the stationary distribution of this random walk, it might give us a way to estimate the weight vector w. Moreover, for computational efﬁciency, we would also want this random walk to have a fast mixing time, i.e. it should rapidly converge to its stationary distribution.

The rank centrality (RC) algorithm (Negahban et al., 2017)

1Throughout this paper we will use the terminology Markov chain and random walk interchangeably.

for the BTL model, and its generalization the Luce spectral ranking (LSR) algorithm (Maystre & Grossglauser, 2015) for the MNL model, are based on a similar idea of constructing a random walk over the comparison graph. These algorithms construct a random walk whose stationary distribution, in expectation, is exactly w. However, this construction forces their Markov chain to have self loops with large mass, slowing down the convergence rate.

In this section we will show that it is possible to design a signiﬁcantly faster mixing random walk that belongs to a different class of random walks over the comparison graph. More precisely, the random walk that we construct is such that it is possible to recover the weight vector w from its stationary distribution using a ﬁxed linear transformation, while for RC and LSR, the stationary distribution is exactly w. Our theoretical analysis in Section 5 as well as experiments on synthetic and real world datasets in Section 7 will show that this difference can lead to vastly improved results.

Given comparison data Y, let us denote by Gc([n], E) the undirected graph on n vertices, with an edge (i, j) E for any i, j that are a part of an m-way comparison. More formally, (i, j) E if there exists an index a [d] such that i, j Sa. We will call Gc the comparison graph, and throughout this paper, we will assume that Y is such that Gc is connected. We will denote by di the number of unique m-way comparisons of which i [n] was a part, i.e. di = P

a [d] 1[i Sa]. Let D Rn n be a diagonal matrix, with Dii being equal to di, i [n]. Also, let dmax := maxi di and dmin := mini di.

Suppose for each a [d] and j Sa, one had access to the true probability pj|Sa of j being the most preferred item in Sa. Then one could deﬁne a random walk on Gc with transition probability from node i [n] to j [n] given by

a [d]:i,j Sa pj|Sa = 1

a [d]:i,j Sa

Let P := [Pij]. One can verify that P corresponds to a valid transition probability matrix as it is non-negative and row stochastic. Furthermore, P deﬁnes a reversible Markov chain as it satisﬁes the detailed balance equations

wi di Pij = wj dj Pji ,

for all i, j [n]. If the graph Gc is connected then π = D w/ D w 1 is the unique stationary distribution of P, and one can recover the true weight vector w from this stationary distribution using a linear transform D 1.

In practice one does not have access to P, so we propose an empirical estimate of P that can be computed from the given comparison data. Formally, deﬁne ˆpi|Sa to be the fraction of times that i won a m-way comparison amongst items in the

Accelerated Spectral Ranking

Algorithm 1 ASR

Input Markov chain ˆP according to Eq. (2) Initialize ˆπ = ( 1

n) n while estimates do not converge do

ˆπ ˆP ˆπ end while Output ˆw = D 1 ˆπ D 1 ˆπ 1

set Sa, i.e. ˆpi|Sa := 1

L PL l=1 1[yl a = i]. Let us then deﬁne a random walk where the probability of transition from node i [n] to node j [n] is given by

a [d]:i,j Sa ˆpj|Sa . (2)

Let ˆP := [ ˆPij]. One can again verify that ˆP corresponds to a valid transition probability matrix. We can think of ˆP as a perturbation of P, with the error due to perturbation decreasing with more and more comparisons. There is a rich literature (Cho & Meyer, 2001; Mitrophanov, 2005) on analyzing sensitivity of the stationary distribution of a Markov chain under small perturbations. Hence, given a large number of comparisons, one can expect the stationary distribution of ˆP to be close to that of P. Since we take a linear transform of these stationary distributions, one also needs to show that closeness is preserved under this linear transform. We defer this analysis to Section 5.

The pseudo-code for our algorithm is given in Algorithm 1. The algorithm computes the stationary distribution ˆπ of the Markov chain ˆP using the power method.2 It then outputs the (normalized) vector ˆw that is obtained after applying the linear transform D 1 to ˆπ, i.e. ˆw = D 1 ˆπ D 1 ˆπ 1 . In the next section we will compare the convergence time of our algorithm with previous algorithms (Negahban et al., 2017; Maystre & Grossglauser, 2015).

4. Comparison of Mixing Time with Rank Centrality (RC) and Luce Spectral Ranking (LSR)

The random walk PRC constructed by the RC (Negahban et al., 2017) algorithm for the BTL model is given by

a [d]:i,j Sa pj|Sa if i = j 1 1 dmax P

j =i P RC ij if i = j , (3)

2The stationary distribution of the Markov chain may also be computed using other linear algebraic techniques, but these techniques typically have a running time of O(n3) which is impractical for most modern applications.

and the random walk PLSR constructed by LSR (Maystre & Grossglauser, 2015) for the MNL model is given by

P LSR ij :=

a [d]:i,j Sa pj|Sa if i = j 1 ϵ P

j =i P LSR ij if i = j , (4)

where ϵ > 0 is chosen such that the diagonal entries are non-negative. In general ϵ would be O( 1 dmax ). The random

walks ˆPRC and ˆPLSR constructed from the comparison data are deﬁned analogously using empirical probabilities ˆpj|Sa instead of pj|Sa.

We ﬁrst begin by showing that for any given comparison data Y, both RC/LSR and our algorithm will return the same estimate upon convergence. Proposition 1. Given items [n] and comparison data Y = {(Sa, ya)}d a=1, let ˆπ be the stationary distribution of the Markov chain ˆP constructed by ASR, and let ˆw LSR be the stationary distribution of the Markov chain ˆPLSR. Then ˆw LSR = D 1 ˆπ D 1 ˆπ 1 . The same result is also true for ˆw RC for the case of pairwise comparisons.

We give a proof of this result in the supplementary material. Although the above lemma shows that in a convergent state both these algorithms will return the same estimates, it does not say anything about the time it takes to reach this convergent state. This is where the key difference lies.

Observe that each row i [n] of our matrix P is divided by di, whereas each row of PRC is divided by dmax except the diagonal entries. Now if dmax is very large, a row i [n] of PRC that corresponds to an item i with small di would have very small non-diagonal entries. This can make the diagonal entry P RC ii very large, which amounts to having a heavy self loop at node i. This heavy self loop can signiﬁcantly reduce the time it takes for the random walk to reach its stationary distribution, since a lot of transitions starting from i will return back to i. The same analysis holds true for LSR under multiway comparisons.

To formalize this intuition, we need to analyze the spectral gap of a random walk X, which we denote by µ(X), which plays an important role in determining its mixing time. The spectral gap of a reversible random walk (or Markov chain) X is deﬁned as µ(X) := 1 λ2(X), where λ2(X) is the second largest eigenvalue of X in terms of absolute value. The following lemma (see Levin et al. (2008) for more details) gives both upper and lower bounds on the mixing time (w.r.t. the total variation distance) of a random walk in terms of the spectral gap. Lemma 1. (Levin et al., 2008) Let X be the transition probability matrix of a reversible, irreducible Markov chain with state space [n], π be the stationary distribution of X, and πmin := mini [n] πi, and let

d(r) = sup p n p Xr π TV .

Accelerated Spectral Ranking

For any γ > 0, let t(γ) = min{r N : d(r) γ}; then

2γ ) 1 µ(X) 1 t(γ) log( 1 γπmin ) 1 µ(X) .

The above lemma states that the mixing time of a Markov chain X is inversely proportional to its spectral gap µ(X). Now, we will compare the spectral gap of our Markov chain P with the spectral gap of PRC (and PLSR).

Proposition 2. Let the probability transition matrix P for our random walk be as deﬁned in Eq. (1). Let PRC and PLSR

be as deﬁned in Eq. (3) and Eq. (4), respectively. Then

dmin dmax µ(P) µ(PRC) µ(P) , (5)

and ϵdminµ(P) µ(PLSR) µ(P) , (6)

where ϵ = O( 1 dmax ).

A formal proof of this lemma is given in the supplementary material, and uses comparison theorems for reversible Markov chains due to Diaconis & Saloff-Coste (1993). This lemma shows that the spectral gap of P is always lower bounded by that of PRC (and PLSR), but can be much larger than it. In the latter case one would observe, using Lemma 1, that our algorithm will converge faster than the RC algorithm (and LSR). In fact there are instances where O(dmax/dmin) = Ω(n) and the leftmost inequalities in both Eq. (5) and Eq. (6) hold with equality. In these instances the convergence of our algorithm will be Ω(n) times faster. We give examples of two such instances.

Example 1. Let n = 3, m = 2, w1 = 1/2, w2 = 1/4 and w3 = 1/4. In the comparison data 1 is compared to both 2 and 3; but items 2 and 3 are not compared to each other. This implies that d1 = 2, and di = 1 for i = 1. One can calculate the matrices P and PRC, and their respective eigenvalues, and observe that µ(P) = 2µ(PRC).

Example 2. Let m = 2, w = (1/n, , 1/n) , and the comparison data be such that item 1 is compared to every other item, and no other items are compared to each other. This implies that d1 = n 1, and di = 1 for i = 1. One can calculate the matrix P and PRC again, and their respective eigenvalues, and observe that µ(P) = (n 1) µ(PRC).

Note that in the above lemma, we only show the relation between the spectral gaps of the matrices P and PRC, and not for any particular realization ˆP and ˆPRC. If the Markov chains ˆP and ˆPRC are reversible, then identical results hold. However, similar results are very hard to prove for nonreversible Markov chains (Dyer et al., 2006). Nevertheless, for large L, one can expect the realized matrices ˆP and ˆPRC

to be close to their expected matrices P and PRC, respectively. Hence, using eigenvalue perturbation bounds (Horn

& Johnson, 1990), one can show that the spectrum of ˆP and ˆPRC is close to the spectrum of P and PRC, respectively. The same analysis holds true for LSR under multiway comparisons. In Section 7 we perform experiments on synthetic and real world datasets which empirically show that the mixing times of the realized Markov chains behave as predicted.

It has been observed that faster mixing rates of Markov chains gives us the ability to prove sharper perturbation bounds for these Markov chains (Mitrophanov, 2005). In the following section we will use these perturbation bounds to prove sharper sample complexity bounds for our algorithm.

5. Sample Complexity Bounds

In this section we will present sample complexity bounds for the estimates returned by ASR in terms of total variation distance. The following theorem gives an error bound in terms of the total variation distance for estimates ˆw of the MNL weights returned by our algorithm

Theorem 1. Given items [n] and comparison data Y = {(Sa, ya)}d a=1, let each set Sa of cardinality m be compared L times, with outcomes ya = (y1 a, , y L a ) produced as per a MNL model with parameters w = (w1, . . . , wn), such that w 1 = 1. If the random walk ˆP (Eq. (2)) on the comparison graph Gc([n], E) induced by the comparison data Y is strongly connected, then the ASR algorithm (Algorithm 1) converges to a unique distribution ˆw, which with probability 1 3n (C2 50)/25 satisﬁes the following error bound3

w ˆw TV C κ davg µ(P) dmin

max{m, log(n)}

where κ = log davg dminwmin

, wmin = mini [n] wi, davg = P

i [n] widi, dmin = mini [n] di, µ(P) is the spectral gap of the random walk P (Eq. (1)), and C is any constant.

Recall from Section 3 that the Markov chain ˆP can be viewed as a perturbation of P. To show that the stationary distributions of ˆP and P are close, we use the results of Mitrophanov (2005) on the stability of Markov chains under perturbations. We also show closeness is preserved under the linear transformation D 1, giving the ﬁnal bound stated in the aforementioned theorem. We present a formal proof in the supplementary material.

In the error bound of Theorem 1, one can further bound the spectral gap µ(P) of P in terms of the spectral gap of the random walk normalized Laplacian of Gc, which is a

3The dependence on κ is due to the dependence on 1 πmin in the mixing time upper bounds in Lemma 1. There are other bounds for κ in terms of the condition number for Markov chains, for example see (Mitrophanov, 2005), and any improvement on these bounds will lead to an improvement in our sample complexity. In the worst case, κ has a trivial upper bound of O(log n).

Accelerated Spectral Ranking

fundamental quantity associated with Gc. The Laplacian represents a random walk on Gc that transitions from a node i to one of its neighbors uniformly at random. Formally, the Laplacian L := C 1A, where C is a diagonal matrix with Cii = S

a [d]:i Sa Sa , i.e. the number of unique items i was compared with, and A is the adjacency matrix, such that for i, j [n], Aij = 1 if (i, j) E, and Aij = 0 otherwise. Let ξ := µ(L) be the spectral gap of L. Then we can lower bound µ(P) as follows (proof in the supplement)

µ(P) ξ m b2 ,

where b is the ratio of the maximum to the minimum weight, i.e. b = maxi,j [n] wi/wj. This gives us the following.

Corollary 1. In the setting of Theorem 1, the ASR algorithm converges to a unique distribution ˆw, which with probability 1 3n (C2 50)/25 satisﬁes the following error bound:

w ˆw TV C m b2 κ davg

max{m, log(n)}

where b = maxi,j [n] wi wj .

In the discussion that follows, we will assume b = O(1), and hence, µ(P) = Ω(ξ/m). The quantity davg has an interesting interpretation: it is the weighted average of the number of sets in which each item was shown. It has a trivial upper bound of dmax, however, a careful analysis will reveal a better bound of O(|E|/n) where E is the set of edges in the comparison graph Gc. Using this observation we can give the following corollary of the above theorem.

Corollary 2. If the conditions of Theorem 1 are satisﬁed, and if the number of edges in the comparison graph Gc are O(n poly(log n)), i.e. |E| = O(n poly(log n)), then in order to ensure a total variation error of o(1), the required number of comparisons per set is upper bounded as

L = O µ(P) 2 poly(log n) = O ξ 2 m3 poly(log n) .

Hence, the sample complexity, i.e. total number of m-way comparisons needed to estimate w with error o(1), is given by |E| L = O ξ 2 m3 n poly(log n) .

We again provide a proof of this corollary in the appendix. Note that the case when the total number of edges in the comparison graph is O(n poly(log n)) captures the most interesting case in ranking and sorting. Also, in most practical settings the size m of comparison sets will be O(log n). In this case, the above corollary implies a sample complexity bound of O ξ 2 n poly(log n) , which is sometimes referred to as quasi-linear complexity. The following simple example illustrates this sample complexity bound.

Example 3. Consider a star comparison graph, discussed in Example 2, where there is one item i [n] that is compared to all other n 1 items, and no other items are

compared to each other. Let w = ( 1

n) . One can calculate the spectral gap µ(P) to be 0.5 exactly. In this case, the sample complexity bound given by our result is O(n poly(log n)).

Discussion/Comparison. For the special case of pairwise comparisons under the BTL model (m = 2), Negahban et al. (2017) give a sample complexity bound of O dmax

dmin ξ 2 n poly(log n) for recovering the estimates ˆw with low (normalized) L2 error. Using Proposition 1 one can see that this bound also applies to the estimates returned by our algorithm, and our bound in terms of L1 applies to rank centrality as well. However, the bounds due to Negahban et al. (2017) have a dependence on the ratio dmax

dmin due to the large spectral gap of their Markov chain as compared to ξ, the spectral gap of the Laplacian. In Section 7 we show that for many real world datasets dmax

dmin can be much larger than log n, and hence, their bounds are no longer quasi-linear. A large class of graphs that occur in many real world scenarios and exhibit this behavior are the power-law graphs. Another real world scenario in which dmax

dmin = Ω(n) arises is choice modeling (Agrawal et al., 2016), where one explicitly models the no choice option where the user has an option of not selecting any item from the set of items presented to her. In this case the no choice option will be present in each comparison set, and the comparison graph will behave like a star graph discussed in Example 2. In fact for such graphs, the results of (Negahban et al., 2017) give a trivial bound of poly(n) in terms of the L2 error.

For the general case of multiway comparisons we are not aware of any other sample complexity bounds. It is also important to note that the dependence on the number of comparison sets comes only through the spectral gap ξ of the natural random walk on the comparison graph. For example, if the graph is a cycle (d = n), then the spectral gap is O(1/n2), whereas if the graph is a clique (d = O(n2)) the spectral gap is O(1).

6. Message Passing Interpretation of ASR

In this section, we show our spectral ranking algorithm can be interpreted as a message passing/belief propagation algorithm. This connection can be used to design a decentralized distributed version of our algorithm.

Let us introduce the factor graph, which is an important data structure used in message passing algorithms. The factor graph is a bipartite graph Gf([n] [d], Ef) which has two type of nodes item nodes which correspond to the n items, and set nodes which correspond to the d sets. More formally, there is an item node i for each item i [n], and there is a set node a for each set Sa, a [d]. There is an edge (i, a) Ef between node i and a if and only if i Sa. There is a weight ˆpi|Sa on the edge (i, a) which corresponds

Accelerated Spectral Ranking

to the fraction of times i won in the set Sa.

Algorithm 2 Message Passing

Input Graph Gf = ([n] [d], Ef), edge (i, a) E has weight ˆpi|Sa Initialize Set m(0) a i m/n, a [d], i Sa for t = 1, 2, until convergence do

for all i [n] do m(t) i a = 1 di P a :i Sa ˆpi|Sa m(t 1) a i for all a [d] do m(t) a i = P

i Sa m(t) i a end for Set ˆwi m(t 1) i a , i [n] Output ˆw/ ˆw 1

We shall now describe the algorithm. In each iteration of this algorithm, the item nodes send a message to their neighboring set nodes, and the set nodes respond to these messages. A message from an item node i to a set node a represents an estimate of the weight wi of item i, and a message from a set node a to an item i represents an estimate of the sum of weights of items contained in set Sa.

In each iteration, the item nodes update their estimates based on the messages they receive in the previous iteration, and send these estimates to their neighboring set nodes. The set nodes then update their estimate by summing up the messages they receive from their neighboring item nodes, and then send these estimates to their neighboring item nodes. This process continues until the messages converge.

Formally, let m(t 1) i a be the message from item node i to set node a in iteration t 1, and m(t 1) a i be the corresponding message from the set node a to item node i. Then the messages in the next iteration are updated as follows:

m(t) i a = 1

a [d]:i Sa ˆpi|Sa m(t 1) a i ,

m(t) a i = X

i Sa m(t) i a .

Now, suppose that the empirical edge weights ˆpi|Sa are equal to the true weights pi|Sa = wi P

j Sa wj , i [n], a

[d]. Also, suppose on some iteration t 1, the item messages m(t) i a become equal to the item weights wi, i [n]. Then it is easy to observe that the next iteration of messages m(t+1) i a are also equal to wi. Therefore, the true weights w, in some sense, are a ﬁxed point of the above set of equations. The following lemma shows that the ASR algorithm is equivalent to this message passing algorithm.

Lemma 2. For any realization of comparison data Y, there is a one-to-one correspondence d each iteration of the message passing algorithm (2) and the corresponding power iteration of the ASR algorithm (1), and both algorithms return the same estimates ˆw for any Y.

We give a proof of the above lemma in the supplementary material. The above lemma gives an interesting connection between spectral ranking under the MNL model and message passing/belief propagation. Such connections have been observed for other problem such as the problem of aggregating crowdsourced binary tasks (Khetan & Oh, 2016). A consequence of this connection is that it facilitates a fully decentralized distributed implementation of the ASR algorithm. This can be very useful for modern applications, where machines can communicate local parameter updates to each other, without explicitly communicating the data.

7. Experiments

In this section we perform experiments on both synthetic and real data to compare our algorithm to the existing LSR (Maystre & Grossglauser, 2015) and RC (Negahban et al., 2017) algorithms for recovering the weight vector w under the MNL and BTL model, respectively. The implementation4 of our algorithm is based on applying the power method on ˆP (Eq. (2)). The power method was chosen due to its simplicity, efﬁciency, and scalability to large problem sizes. Similarly, the implementations of LSR and RC are based on applying the power method on ˆPLSR (Eq. (4)), and ˆPRC (Eq. (3)), respectively. In the deﬁnition of ˆPLSR, the parameter ϵ was chosen to be the maximum possible value that ensures ˆPLSR is a Markov chain.

7.1. Synthetic Data

We conducted experiments on synthetic data generated according to the MNL model, with weight vectors w generated randomly (details below). We compared our algorithm with the LSR algorithm for comparison sets of size m = 5, and with the RC algorithm for sets of size m = 2. We used two different graph topologies for generating the comparison graph Gc, or equivalently the comparison sets:

1. Random Topology: This graph topology corresponds to random graphs where n log2(n) comparison sets are chosen uniformly at random from all the n m unique sets of cardinality m. This topology is very close to the Erd os-Rényi topology which has been well-studied in the literature. In fact the degree distributions of nodes in this random topology are very close to the degree distributions in the Erd os Rényi topology (Mezard & Montanari, 2009). The only reason we study the former is computational, as iterating over all n m hyper-edges is computationally challenging.

2. Star Topology: In this graph topology, there is a single item that belongs to all sets; the remaining (m 1) items in each set are contained only in that set. We study this topology because it corresponds to the choice sets used in Example 2, where there was a factor of Ω(n) gap in the

4code available: https://github.com/agarpit/asr

Accelerated Spectral Ranking

2 4 6 8 10 12 14 16 18 20 0

1 2 3 4 5 6 7 8 0

200 400 600 800 1000 1200 1400 1600 0

100 200 300 400 500 600 700 800 900 0

Figure 1. Results on synthetic data: L1 error vs. number of iterations for our algorithm, ASR, compared with the RC algorithm (for m = 2) and the LSR algorithm (for m = 5), on data generated from the MNL/BTL model with the random and star graph topologies.

50 100 150 200 -9

20 40 60 80 100 120 140 160 180 -4

5 10 15 20 -5030

5 10 15 20 25 30 35 40 45 50 -4400

Figure 2. Results on real data: Log-likelihood vs. number of iterations for our algorithm, ASR, compared with the RC algorithm (for pairwise comparison data) and the LSR algorithm (for multi-way comparison data), all with regularization parameter set to 0.2.

spectral gap between our algorithm and the other algorithms.

In our experiments we selected n = 5005, and the weight wi of each item i [n] was drawn uniformly at random from the range (0, 1); the weights were then normalized so they sum to 1. A comparison graph Gc was generated according to each of the graph topologies above. The parameter L was set to 300 log2 n. The winner for each comparison set was drawn according to the MNL model with weights w. The convergence criterion for all algorithms was the same: we run the algorithm until the L1 distance between the new estimates and the old estimates is 0.0001. Each experiment was repeated 100 times and the average values over all trials are reported. For n = 500, m {2, 5}, and both graph topologies described above, we compared the convergence as a function of the number of iterations6 for each algorithm. We plotted the L1 error of the estimates produced by these algorithms after each iteration. The plots are given in Figure 1. These plots verify the mixing time analysis of Section 4, and show that our algorithm converges much faster than RC and LSR, and orders of magnitude faster in the case of the star graph.

7.2. Real World Datasets

We conducted experiments on the You Tube dataset (Shetty, 2012), GIF-anger dataset (Rich et al.), and the SFwork and SFshop (Koppelman & Bhat, 2006) datasets. Table 1 gives some statistics about these datasets. We also plot the degree distributions of these datasets in the supplementary material.

5Results for other values of n are given in the supplement. 6We also plotted the convergence as a function of the running time; the results were similar as the running time of each iteration is similar for all these algorithm.

Table 1. Statistics for real world datasets Dataset n m d dmax/dmin Youtube 21207 2 394007 600 GIF-anger 6119 2 64830 106 SFwork 6 3-6 12 4.3 SFshop 8 4-8 10 1.9

For these datasets, a ground-truth w is either unknown or undeﬁned; and hence, we compare our algorithm and the RC/LSR algorithm with respect to the log-likelihood of the estimates as a function of number of iterations. Due to the number of comparisons per set (or pair) being very small, in order to ensure irreducibility of random walks, we use a regularized version of all algorithms (see supplementary material, and also Section 3.3 in Negahban et al. (2017), for more details). Here, we give results when the regularization parameter λ is set to 0.2, and defer the results for other parameter values to the supplementary material. The results are given in Figure 2. We observe that our algorithm converges rapidly to the peak log-likelihood value while RC and LSR are always slower in converging to this value.

8. Conclusion and Future Work

We presented a spectral algorithm for the problem of rank aggregation from pairwise and multiway comparisons. Our algorithm is considerably faster than previous algorithms; in addition, our analysis yields improved sample complexity results for estimation under the BTL and MNL model. We also give a message passing/belief propagation interpretation for our algorithm. It would be interesting to see if one can use our algorithm to give better guarantees for recovery of top-k items under MNL.

Accelerated Spectral Ranking

Acknowledgements

We would like to thank Rohan Ghuge and Ashish Khetan for helpful discussions. We would also like to thank the anonymous reviewers for helpful comments. This material is based upon work supported by the US National Science Foundation under Grant No. 1717290. Any opinions, ﬁndings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reﬂect the views of the National Science Foundation.

Agarwal, A., Agarwal, S., Assadi, S., and Khanna, S. Learning with limited rounds of adaptivity: Coin tossing, multiarmed bandits, and ranking from pairwise comparisons. In COLT, 2017.

Agrawal, S., Avadhanula, V., Goyal, V., and Zeevi, A. A near-optimal exploration-exploitation approach for assortment selection. In EC, 2016.

Arrow, K. J. Social choice and individual values, 1963.

Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324 345, 1952.

Chen, X., Li, Y., and Mao, J. A nearly instance optimal algorithm for top-k ranking under the multinomial logit model. In SODA, 2017.

Chen, Y. and Suh, C. Spectral MLE: Top-k rank aggregation from pairwise comparisons. In ICML, 2015.

Cho, G. and Meyer, C. Comparison of perturbation bounds for the stationary distribution of a Markov chain. Linear Algebra and its Applications, 335(1-3):137 150, 2001.

Condorcet, M. d. Essay on the application of analysis to the probability of majority decisions. Paris: Imprimerie Royale, 1785.

de Borda, J. C. Mémoire sur les élections au scrutin. 1781.

Devroye, L. The equivalence of weak, strong and complete convergence in l1 for kernel density estimates. The Annals of Statistics, pp. 896 904, 1983.

Diaconis, P. and Saloff-Coste, L. Comparison theorems for reversible Markov chains. The Annals of Applied Probability, pp. 696 730, 1993.

Diaconis, P. and Stroock, D. Geometric bounds for eigenvalues of Markov chains. The Annals of Applied Probability, pp. 36 61, 1991.

Dwork, C., Kumar, R., Naor, M., and Sivakumar, D. Rank aggregation methods for the web. In WWW, 2001.

Dyer, M., Goldberg, L. A., Jerrum, M., Martin, R., et al. Markov chain comparison. Probability Surveys, 3:89 111, 2006.

Guiver, J. and Snelson, E. Bayesian inference for Plackett Luce ranking models. In ICML, 2009.

Horn, R. A. and Johnson, C. R. Matrix analysis. Cambridge university press, 1990.

Hunter, D. R. MM algorithms for generalized Bradley-Terry models. Annals of Statistics, pp. 384 406, 2004.

Jang, M., Kim, S., Suh, C., and Oh, S. Top-k Ranking from Pairwise Comparisons: When Spectral Ranking is Optimal. ar Xiv preprint ar Xiv:1603.04153, 2016.

Jang, M., Kim, S., Suh, C., and Oh, S. Optimal sample complexity of m-wise data for top-k ranking. In NIPS, 2017.

Khetan, A. and Oh, S. Achieving budget-optimality with adaptive schemes in crowdsourcing. In NIPS, 2016.

Koppelman, F. S. and Bhat, C. A self instructing course in mode choice modeling: multinomial and nested logit models. US Department of Transportation, Federal Transit Administration, 2006.

Levin, D. A., Peres, Y., and Wilmer, E. L. Markov Chains and Mixing Times. American Mathematical Society, Providence, RI, USA, 2008.

Luce, D. R. Individual choice behavior. 1959.

Maystre, L. and Grossglauser, M. Fast and accurate inference of plackett-luce models. In NIPS, 2015.

Mc Fadden, D. Conditional logit analysis of qualitative choice behavior. Frontiers in Econometrics, pp. 105 142, 1974.

Mezard, M. and Montanari, A. Information, Physics, and Computation. Oxford University Press, Inc., New York, NY, USA, 2009.

Mitrophanov, A. Y. Sensitivity and convergence of uniformly ergodic Markov chains. Journal of Applied Probability, 42(4):1003 1014, 2005.

Negahban, S., Oh, S., and Shah, D. Rank centrality: Ranking from pairwise comparisons. Operations Research, 65 (1):266 287, 2017.

Plackett, R. L. The analysis of permutations. Applied Statistics, pp. 193 202, 1975.

Rajkumar, A. and Agarwal, S. A statistical convergence perspective of algorithms for rank aggregation from pairwise data. In ICML, 2014.

Accelerated Spectral Ranking

Rich, T., Hu, K., and Tome, B. GIFGIF dataset. Data Available: http://www.gif.gf.

Shetty, S. Quantifying comedy on You Tube: why the number of o s in your LOL matter. Data Available: https: //archive.ics.uci.edu/ml/datasets/ You Tube+Comedy+Slam+Preference+Data, 2012.

Souﬁani, H. A., Chen, W. Z., Parkes, D. C., and Xia, L. Generalized method-of-moments for rank aggregation. In NIPS, 2013.