# efficient_computation_of_rankings_from_pairwise_comparisons__4c45ba89.pdf

Journal of Machine Learning Research 24 (2023) 1-25 Submitted 9/22; Revised 1/23; Published 6/23

Eﬃcient Computation of Rankings from Pairwise Comparisons

M. E. J. Newman mejn@umich.edu Center for the Study of Complex Systems University of Michigan Ann Arbor, MI 48109, USA

Editor: Tina Eliassi-Rad

We study the ranking of individuals, teams, or objects, based on pairwise comparisons between them, using the Bradley-Terry model. Estimates of rankings within this model are commonly made using a simple iterative algorithm ﬁrst introduced by Zermelo almost a century ago. Here we describe an alternative and similarly simple iteration that provably returns identical results but does so much faster over a hundred times faster in some cases. We demonstrate this algorithm with applications to a range of example data sets and derive a number of results regarding its convergence. Keywords: Ranking, paired comparisons, Bradley-Terry model, iterative algorithms, networks

1. Introduction

The problem of ranking a set of individuals, teams, or objects on the basis of a set of pairwise comparisons between them arises in many contexts, including competitions in sports, chess, and other games, paired comparison studies of consumer choice, and observational studies of dominance behaviors in animals and humans (Zermelo, 1929; Bradley and Terry, 1952; Davidson and Farquhar, 1976; David, 1988; Cattelan, 2012). If a group of chess players play games against one another, for example, how can we rank the players, from best to worst, based on the outcome of those games? The outcomes may be contradictory or ambiguous underdogs sometimes win and strong players sometimes lose so we adopt a probabilistic model. In the most common version we assign a numerical score si to each individual i and the probability pij of i beating j is assumed to be some function of the diﬀerence in their scores pij = f(si sj). The most popular choice of functional form is the logistic function f(s) = 1/(1 + e s), which gives

esi + esj . (1)

This is the Bradley-Terry model, ﬁrst introduced by Zermelo (1929) and heavily studied in the years since, particularly following its rediscovery by Bradley and Terry (1952). For convenience, one often introduces the shorthand πi = esi so that pij can also be written as

pij = πi πi + πj , (2)

c 2023 M. E. J. Newman.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-1086.html.

and we will do that here. Zermelo (writing in German) referred to the non-negative parameters πi as Spielst arken or playing strengths, although they are elsewhere variously called worth parameters, skill parameters, merit parameters, ratings, or simply weights. Following Zermelo, we will call them strengths. Given the outcomes of a series of pairwise competitions between N competitors the strengths can be estimated straightforwardly. Commonly one makes a maximum-likelihood estimate. Deﬁning wij to be the total number of times i beats j, or zero if i and j never competed, it can be shown that the maximum-likelihood values of the strengths are given by a simple procedure: starting from any convenient initial values we iterate the equation

PN j=1 wij PN j=1(wij + wji)/(πi + πj) (3)

until convergence is reached. This algorithm was also ﬁrst described by Zermelo (1929) and we will refer to it as Zermelo s algorithm. An extraordinary number of papers have been written about it, its variants, its properties, and its applications. Although widely used, however, Zermelo s algorithm is known to be slow to converge (Dykstra, 1956; Hunter, 2004). In this paper we study the alternate iteration

PN j=1 wijπj/(πi + πj) PN j=1 wji/(πi + πj) . (4)

We show that iteration of this equation solves the same problem and converges to the same solution as Zermelo s algorithm but does so signiﬁcantly faster over a hundred times faster in some cases. Given that Eq. (4) is also simple to implement we know of no reason not to favor it over Eq. (3). In recent years a number of other authors have also considered alternative and potentially more eﬃcient algorithms for ranking under the Bradley-Terry model. One promising approach employs spectral methods that estimate rankings based on the properties of random walks on the network of directed interactions between individuals (Maystre and Grossglauser, 2015; Negahban et al., 2017; Agarwal et al., 2018). Although they do not directly maximize likelihood under the Bradley-Terry model, these algorithms can be shown to converge closely to the maximum-likelihood solution. Some versions are quite numerically eﬃcient, though they can also be complex to implement. Minorization-maximization (MM) algorithms, which optimize a minorizing proxy for the likelihood function, can also be applied to ranking problems. For the Bradley-Terry model the appropriate MM algorithm turns out to be exactly equivalent to Zermelo s algorithm and hence oﬀers no speed improvement (Hunter, 2004), but techniques have been suggested for accelerating convergence (Vojnovic et al., 2019) and the MM formulation also provides an elegant route to developing algorithms for generalizations of the model. Perhaps more directly competitive with our approach is one of the simplest of methods: one can ﬁt the Bradley-Terry model using Newton s method applied to the derivative of the likelihood. For small applications with only a few individuals or teams to be ranked this is typically the fastest approach, although for such small cases the diﬀerence may be moot. As the number N of individuals gets larger, however, Newton s method becomes impractical because the time taken per iteration of the algorithm scales as N3, which quickly becomes prohibitive (Hunter, 2004). For

Efficient computation of rankings

Eq. (4) the time per iteration scales only as the number of pairwise competitions, making this approach faster for larger applications. Maximum-likelihood methods are not the only approach for ﬁtting: one can also adopt Bayesian approaches, although these usually require Monte Carlo estimation and hence are not competitive in terms of speed (Davidson and Solomon, 1973; Caron and Doucet, 2012). Overall, the particular combination of simplicity and speed oﬀered by Eq. (4) makes it an attractive approach for practical applications. The rest of this paper is organized as follows. In Section 2 we describe and derive Zermelo s original ranking algorithm and the new algorithm proposed here. In Section 3 we prove that the new algorithm converges to the same unique maximum-likelihood solution as Zermelo s algorithm. In Section 4 we brieﬂy discuss a larger family of ranking algorithms of which Zermelo s algorithm and our own are special cases. Within this family, the algorithm of this paper is the fastest to converge and hence it is our primary focus. In Sections 5 and 6 we describe two extensions of our approach, one to maximum a priori (MAP) estimation of rankings and the other to a (previously proposed) generalization of the Bradley-Terry model that allows for ties or draws in competitions. In Section 7 we apply our algorithms to a broad selection of example data, both real and synthetic, ﬁnding in every case that the algorithm of this paper is faster than Zermelo s, often by a wide margin. In Section 8 we give our conclusions. Some additional technical results are presented in appendices.

2. Iterative Algorithms and the Bradley-Terry Model

Consider a tournament where N players or teams play games of some kind against one another in pairs. We will initially assume that no ties or draws are allowed, so there is a clear winner and loser of every game. The case where ties are allowed is treated separately in Section 6. We assume that the probability pij that player i beats player j obeys Eq. (1) and we consider the strengths πi to be a measure of the skill of the players, higher values indicating better players. Note that the win probabilities pij are invariant under multiplication of all the πi by any constant. One can remove this ambiguity by imposing any convenient normalization condition. Here we ﬁx the geometric mean strength to be 1, which is equivalent to setting Q i πi = 1. This choice has the nice eﬀect that the probability p1 of a player with strength π beating the average player with strength 1 is p1 = π/(π+1) and hence π = p1/(1 p1). Thus the strength parameter has a simple interpretation: it is the odds of beating the average player.

2.1 Zermelo s Algorithm

Suppose the tournament consists of a total of M games between pairs of players and let wij be the number of times that player i beats player j. Using these data we can make a maximum-likelihood estimate of the strengths πi as follows. The likelihood of the observed games given the strengths (represented by a matrix W = [wij] and vector π = [πi] respectively) is

ij pwij ij = Y

so that the log-likelihood is

log P(W|π) = X

ij wij log πi πi + πj = X

ij wij log πi X

ij wij log(πi + πj). (6)

Diﬀerentiating with respect to πi for any i and setting the result to zero we get

πi + πj = 0, (7)

which can be rearranged to read

P j wij P j(wij + wji)/(πi + πj). (8)

In general this equation has no closed-form solution but it can be solved numerically by simple iteration: one picks a suitable set of non-negative starting values for the πi random values are often used and then computes new values π i according to

P j wij P j(wij + wji)/(πi + πj). (9)

Iterating this process, it can be proved subject to certain conditions that we converge to the global maximum of the likelihood and hence obtain an estimate of the strengths πi (Zermelo, 1929; Ford, 1957; Hunter, 2004). The values can then be sorted in order to give a ranking of the players, or simply used in their raw form as a kind of rating. The iteration can be performed synchronously (all πi updated at the same time) or asynchronously (πi updated one by one in cyclic fashion), but it is generally believed that asynchronous updating is more eﬃcient, since the update of any individual πi beneﬁts from the improved estimates of previously updated ones. In this paper we use asynchronous updates. Our convergence results in Sections 3 and 4 are also for the asynchronous case. This iterative algorithm, ﬁrst described by Zermelo (1929), is the standard method for calculating rankings within the Bradley-Terry model and has seen numerous applications over the years in a wide variety of contexts.

2.2 An Alternative Algorithm

In a sense, computing maximum-likelihood estimates for the Bradley-Terry model is a straightforward problem. As discussed in Section 3, the likelihood is concave and many standard convex optimization methods can be applied. Speed, however, is of the essence in practical applications of the model, so considerable eﬀort has been exerted in recent years to ﬁnd solution methods faster than Zermelo s algorithm (Maystre and Grossglauser, 2015; Negahban et al., 2017; Agarwal et al., 2018; Vojnovic et al., 2019). Some of these are quite complex, but here we propose a very simple approach that also turns out to be highly eﬃcient. Grouping the terms slightly diﬀerently, Eq. (7) can be rewritten as

j wij πj πi + πj X

wji πi + πj = 0, (10)

Efficient computation of rankings

which can be rearranged as

P j wijπj/(πi + πj) P j wji/(πi + πj) . (11)

This suggests a diﬀerent iterative algorithm for the Bradley-Terry model. Again we choose suitable starting values (for instance at random), then we iterate the form

P j wijπj/(πi + πj) P j wji/(πi + πj) (12)

to convergence. In Section 3 we prove that, like Zermelo s algorithm, this process converges to the global maximum of the likelihood. One nice feature of this algorithm is that it is transparent from Eq. (12) that πi = 0 for any individual who loses all their games and πi = for any individual who wins all their games. Furthermore, the iteration converges to these values in a single step. The same values are also returned by the standard Zermelo algorithm, but it is less obvious from Eq. (9) that this is true it is some work to demonstrate the result for the player who wins every game and moreover it takes the Zermelo algorithm an inﬁnite number of iterations to reach the correct value instead of just one iteration. This is a special case of the more general ﬁnding, which we explore in this paper, that (12) converges faster than Zermelo s algorithm.

3. Convergence

In this section we prove that the iteration of Eq. (12) converges to the global maximum of the likelihood, Eq. (5), from any starting point, whenever a maximum exists. Zermelo proved that the likelihood has only one stationary point for πi 0, corresponding to the global maximum, provided the πi are normalized as discussed in Section 2 and the directed network of interactions (the network with adjacency matrix wij) is strongly connected, i.e., there is a directed path through the network from every individual to every other (Zermelo, 1929; Ford, 1957). If the network is not strongly connected then there are no stationary points and there is no maximum of the likelihood, and hence our problem has no solution. For the moment we will assume, as other authors have done, that the network is strongly connected and hence that there is a maximum of the likelihood, although we show how to relax this requirement in Section 5. Since any ﬁxed point of the iteration of Eq. (12) corresponds to a stationary point of the likelihood, and since the iteration generates non-negative values of πi only (given non-negative initial values), it follows that if the iteration converges to a ﬁxed point that point must be the global maximum. To prove that it converges to a ﬁxed point it suﬃces to demonstrate that the value of the log-likelihood always increases upon application of Eq. (12) unless a ﬁxed point has been reached, since the log-likelihood cannot increase without bound, being bounded above by the maximum. We consider the asynchronous version of the iteration of Eq. (12) in which a single πi is updated at each step, all others πj remaining the same. The πi are updated in order until all N have been updated. Consider the step on which a particular πi is updated. We deﬁne

a function f(πi) equal to the sum of the terms in the log-likelihood, Eq. (6), that depend on πi: f(πi) = X

j wij log πi πi + πj X

j wji log(πi + πj). (13)

Noting that log x x 1 for all real x > 0 and making the substitution x x/y, we derive the useful inequality log x log y + x

for all x, y > 0, or equivalently

log y log x x

y + 1, (15)

with the exact equality holding if and only if x = y. This implies for any πi and π i that

log π i π i + πj log πi πi + πj πi/(πi + πj)

π i/(π i + πj) + 1 = log πi πi + πj + (π i πi)/π i (πi + πj)/πj , (16)

and (14) implies that

log(π i + πj) log(πi + πj) + π i + πj πi + πj 1 = log(πi + πj) + π i πi πi + πj . (17)

Evaluating Eq. (13) at the point π i deﬁned by Eq. (12) and applying these two inequalities, we ﬁnd that

j wij log π i π i + πj X

j wji log(π i + πj)

j wij log πi πi + πj + π i πi

j wij πj πi + πj

j wji log(πi + πj) (π i πi) X

wji πi + πj

j wij log πi πi + πj X

j wji log(πi + πj)

+ (π i πi) 1

j wij πj πi + πj X

wji πi + πj

= f(πi), (18)

where we have used Eq. (13) again, the term inside the square brackets vanishes because of (12), and the exact equality applies if and only if π i = πi. Thus f(πi) always increases upon application of Eq. (12) and hence so also does the log-likelihood, unless π i = πi, in which case the log-likelihood remains the same but could still increase when one of the other πi is updated. Only if π i = πi for all i does the loglikelihood not increase at all, but if this occurs then by deﬁnition we have reached a ﬁxed point of the iteration, and hence we have reached the global maximum. This now guarantees convergence of the iterative algorithm of Eq. (12) to the global likelihood maximum.

Efficient computation of rankings

4. Other Iterative Algorithms

Given the existence of two diﬀerent iterations, Eqs. (9) and (12), that both converge to the same maximum-likelihood estimate, one might wonder whether there exist any others. In fact, it turns out there is an entire one-parameter family of iterations that includes (9) and (12) as special cases, and all of them converge to the same solution. Of these iterations, Eq. (12) converges most rapidly and hence is our primary focus in this paper, but for the interested reader we discuss the full family brieﬂy in this section. For any α we can rewrite Eq. (10) in the form

j wij απi + πj

πi + πj = 0, (19)

which we can solve by iterating

P j wij(απi + πj)/(πi + πj) P j(αwij + wji)/(πi + πj) (20)

until convergence is achieved. When α = 1 this procedure is equivalent to Zermelo s algorithm, Eq. (9). When α = 0 it is equivalent to the algorithm presented in this paper, Eq. (12). For negative α the iteration does not generate positive values of πi in general and hence is invalid, but for zero or positive α it gives a whole range of algorithms, all of which converge to the same maximum-likelihood solution as Zermelo s algorithm. When 0 α 1, convergence can be proved using a straightforward generalization of the method of Section 3 as follows. Since the log-likelihood has only a single stationary point corresponding to the global maximum, and since any ﬁxed point of (20) is a solution of (19) and hence corresponds to a stationary point of the log-likelihood, it follows that if (20) converges to a ﬁxed point at all it must converge to the global likelihood maximum. To show that it converges to a ﬁxed point it suﬃces, as previously, to show that the log-likelihood always increases upon application of (20) unless a ﬁxed point has been reached. To do this we rewrite Eq. (13) for the terms in the log-likelihood that depend on πi as

f(πi) = α X

j wij log πi + (1 α) X

j wij log πi πi + πj X

j (αwij + wji) log(πi + πj). (21)

For any πi, π i the inequality (15) implies that

log π i log πi πi

π i + 1 = log πi + π i πi

with the exact equality applying if and only if π i = πi. Evaluating Eq. (21) at the point π i given by Eq. (20) and applying this inequality along with (16) and (17), we ﬁnd for 0 α 1

f(π i) = α X

j wij log π i + (1 α) X

j wij log π i π i + πj X

j (αwij + wji) log(π i + πj)

log πi + π i πi

log πi πi + πj + (π i πi)/π i (πi + πj)/πj

j (αwij + wji) log(πi + πj) + π i πi πi + πj

= f(πi) + (π i πi) 1

j wij απi + πj

= f(πi), (23)

where we have employed Eq. (21) again, the term in square brackets in the penultimate line vanishes because of Eq. (20), and the exact equality applies if and only if π i = πi. Thus f(πi) always increases upon application of (20) unless π i = πi. The rest of the proof follows the same lines as in Section 3 and hence convergence to the global likelihood maximum is established. As a corollary, this also provides an alternative proof of the convergence of Zermelo s algorithm (the case α = 1) which is signiﬁcantly simpler than the original proof given by Zermelo (1929) or the later proof by Ford (1957). The method of proof used here does not extend to the case of α > 1, because 1 α becomes negative and the inequality in (23) no longer follows from (16). It is still possible to prove convergence for α > 1 but the proof is more involved. See Appendix A for details. Numerical measurements, some of which are presented in Section 7, indicate that convergence of the algorithms of this section becomes monotonically slower with increasing α, so that the main algorithm presented in this paper, Eq. (12), which corresponds to the smallest allowed value of α = 0, is the fastest, and it is on this case that we concentrate in the remainder of the paper. Some formal results on rates of convergence as a function of α are presented in Appendix B.

5. Prior on the Strength Parameters

Equation (12) provides a complete algorithm for ﬁtting the Bradley-Terry model. In practice, however, pure maximum-likelihood ﬁts such as this can be problematic for this model. In particular, as mentioned in Section 3, a likelihood maximum exists only if the network of interactions is strongly connected. If this condition is not met then the score parameters si will diverge and the algorithm of Eq. (12) and indeed all maximum-likelihood methods for this model will fail. The root cause of this problem is that the maximum-likelihood ﬁt eﬀectively assumes a uniform (improper) prior on the si, which places all but a vanishing fraction of its weight on arbitrarily large values and, when coupled with a network that is not strongly connected, causes divergences. An eﬀective solution is to impose a better-behaved prior on si and then compute a maximum a posteriori (MAP) estimate of the scores instead of a maximumlikelihood estimate (MLE). A range of priors have been proposed for this purpose (Davidson and Solomon, 1973; Caron and Doucet, 2012; Whelan, 2017) but arguably the most natural

Efficient computation of rankings

is a logistic prior. Recall from Section 2 that the probability p1 of a player with strength π winning against the average player is p1 = π/(π + 1). In the absence of any evidence to the contrary, we assume this probability to be uniformly distributed between zero and one so that P(p1) = 1, a least informative or maximum-entropy prior. Then the prior on the scores s = log π is

P(s) = P(p1)dp1

ds = π (π + 1)2 = 1 (es + 1)(e s + 1), (24)

which is the logistic distribution. Combining this result with Eq. (5) we then get a posterior probability on the scores that is given, up to a multiplicative constant, by

1 (esi + 1)(e si + 1)

πi (πi + 1)2 . (25)

Maximizing this posterior probability instead of the likelihood regularizes the values of the scores, preventing them from diverging. It also removes the invariance under multiplication of the πi by a constant and hence eliminates the need to normalize them. The iterative algorithm of Eq. (12) can be generalized straightforwardly to this MAP estimate. As observed by Whelan (2017), the prior for individual i can be thought of as

πi (1 + πi)2 = πi πi + 1 1 πi + 1,

which is precisely the probability that i plays two games against the average player (who has π = 1) and wins one of them and loses the other. Thus Eq. (25) can be thought of as the likelihood of a Bradley-Terry model in which we have added two ﬁctitious games for each player, one won and one lost, and we can maximize this likelihood (and hence the posterior of Eq. (25)) using the same algorithm as before, merely adding these extra ﬁctitious games to the data. This also means that our proof of convergence generalizes to the MAP case and that the network of interactions is now strongly connected, so the probability maximum always exists. Alternatively, and perhaps more conveniently, we can derive an explicit algorithm for the MAP case by diﬀerentiating Eq. (25) with respect to πi for any i, which leads to the iteration

π i = 1/(πi + 1) + P j wijπj/(πi + πj)

1/(πi + 1) + P j wji/(πi + πj) . (26)

This is the generalization of Eq. (12) to the MAP case. It is completely equivalent to adding the ﬁctitious games and has the same guaranteed convergence. One can also add the same prior to the traditional Zermelo algorithm of Eq. (9), which gives

π i = 1 + P j wij 2/(πi + 1) + P j(wij + wji)/(πi + πj). (27)

In Section 7 we present the results of numerical experiments on the rate of convergence both of these MAP estimators and of the MLEs, using Eqs. (9), (12), (26), and (27).

Ties or draws can occur in certain types of competition, such as chess and soccer. There are a number of ways to generalize ranking calculations to include ties. The simplest is just to consider a tied game to be half of a win for each of the players. This approach is used for instance in the Elo chess rating system and can be trivially incorporated into our calculations by modifying the values wij. A more sophisticated approach, however, incorporates the probability of a tie into the model itself. There is more than one way to do this (Rao and Kupper, 1967; Davidson, 1970). Here we employ the modiﬁcation of the Bradley-Terry model proposed by Davidson (1970). One again deﬁnes strengths πi for each player and the probabilities of a win pij and a tie qij between players i and j are

pij = πi πi + πj + 2ν πiπj , qij = 2ν πiπj πi + πj + 2ν πiπj , (28)

where ν > 0 is a parameter which controls the overall frequency of ties and which we will estimate by maximum likelihood along with the strengths. Note that when πi = πj we have qij = ν/(1 + ν) and hence ν = qij/(1 qij), so ν can be interpreted as the odds of a tie between evenly matched players. The form (28) satisﬁes the obvious requirements that pij +pji+qij = 1 and qij = qji, and also has the intuitive property that the probability of a tie is greatest when the players are evenly matched and vanishes as πi and πj become arbitrarily far apart. As with the standard Bradley-Terry model, the probabilities pij and qij are invariant under multiplication of all πi by a constant, and again we remove this ambiguity by normalizing them so that Q i πi = 1. Davidson (1970) proposed an iterative algorithm for computing maximum-likelihood estimates of the strengths and the parameter ν within this model. Deﬁning wij as before to be the number of times i beats j and tij = tji to be the number of ties, we can write the likelihood of a set of observations W = [wij], T = [tij] as

P(W, T|π, ν) = Y

ij pwij ij Y

i<j qtij ij

πi πi + πj + 2ν πiπj

2ν πiπj πi + πj + 2ν πiπj

and the corresponding log-likelihood is

log P(W, T|π, ν) = X

2tij log πi + 1

2tij log πi + πj + 2ν πiπj .

The combination wij + 1

2tij comes up repeatedly in the analysis so, following Davidson, we deﬁne the convenient shorthand aij = wij + 1

log P(W, T|π, ν) = X

ij aij log πi + 1

ij aij log πi + πj + 2ν πiπj . (30)

Efficient computation of rankings

Diﬀerentiating with respect to πi and setting the result to zero gives

aij + aji 1 + ν p

πj/πi πi + πj + 2ν πiπj . (31)

This equation has no general closed-form solution for πi but Davidson proposed solving it by the obvious iteration

P j aij P j aij + aji 1 + ν πj/πi πi + πj + 2ν πiπj

We can also calculate a maximum-likelihood estimate of the parameter ν by diﬀerentiating (30) with respect to ν to get

ij aij 2 πiπj πi + πj + 2ν πiπj , (33)

which is again solved by iteration:

1 2 P ij tij P ij aij

2 πiπj πi + πj + 2ν πiπj

Davidson used asynchronous updates in which one applies Eq. (32) to each πi in turn, then applies (34) once to update ν, then repeats until convergence is achieved. This is a natural generalization of Zermelo s algorithm, Eq. (9), to situations where ties are allowed, and it includes Zermelo s algorithm as the special case when ν = 0 and tij = 0. Davidson proved that the procedure always converges to the global likelihood maximum (when the maximum exists), but once again convergence can be slow in practice. Here we propose an alternative algorithm which generalizes Eq. (12) and is substantially faster. Equation (31) can be rearranged in the form

j aij πj + ν πiπj πi + πj + 2ν πiπj = X

j aji 1 + ν p

πj/πi πi + πj + 2ν πiπj ,

which can be solved by iterating the equation

πj + ν πiπj πi + πj + 2ν πiπj

1 + ν πj/πi πi + πj + 2ν πiπj

Similarly, writing aij = wij + 1

2tij, Eq. (33) can be rearranged in the form

ij tij πi + πj πi + πj + 2ν πiπj = X

ij wij 2 πiπj πi + πj + 2ν πiπj ,

which can be solved by iterating

1 2 P ij tij

πi + πj πi + πj + 2ν πiπj

2 πiπj πi + πj + 2ν πiπj

Equations (35) and (36) are the appropriate generalization of (12) to the case with ties and they include (12) as the special case when ν = 0 and tij = 0. Again we recommend applying the equations asynchronously: one cycle of the algorithm involves updating each πi in turn using Eq. (35) then applying Eq. (36) once to update ν, and repeating until convergence is achieved. As we show in Section 7, this procedure converges signiﬁcantly faster than Davidson s algorithm. The proof that Eqs. (35) and (36) do in fact converge to the likelihood maximum follows similar lines to that for the case without ties but the algebra is tedious so we omit it here. The interested reader can ﬁnd it in Appendix C.

The iterations (12) and (26) (for the case without ties) and (35) and (36) (with ties) converge signiﬁcantly faster in typical applications than the traditional algorithm of Zermelo or its extension for the case where ties are allowed. In this section we illustrate the convergence rates with a selection of example applications to both real and synthetic data.

7.1 Computer-generated Data

As our ﬁrst example, we apply our algorithms to a collection of random computer-generated data sets. In these calculations we generated synthetic test data with N = 1000 players and M = 50 000 games, for an average of 100 games per player. The players for each game are chosen uniformly at random (with replacement) and the winners of the games are chosen using the Bradley-Terry model itself: scores si for each player are drawn from a logistic distribution P(s) = 1/[(es + 1)(e s + 1)] and then the winner of each game is chosen at random according to the probability pij of Eq. (1). In cases where the resulting network of interactions is not strongly connected, games are discarded and redrawn until a strongly connected network is achieved, to ensure that a likelihood maximum does exist as discussed in Section 3. Figure 1a shows the results of applying both Zermelo s algorithm and the algorithm of this paper to one such synthetic network. Initial values of πi were chosen randomly such that si is drawn from the logistic distribution 1/[(es + 1)(e s + 1)]. (Other methods of choosing the initial values have been proposed and may improve performance in some cases (Dykstra, 1956), but we avoid these here to separate eﬀects of the diﬀerent algorithms from eﬀects of the initial values.) Figure 1a shows how the log-likelihood, Eq. (6), converges to its ﬁnal value on successive iterations of each algorithm. As we can see, the algorithm of this paper (top curve, solid points) comes close to the ﬁnal value of the log-likelihood after only two iterations, while Zermelo s algorithm (bottom curve, open points) takes signiﬁcantly longer. The other curves in the ﬁgure (gray points) show values for various algorithms in the

Efficient computation of rankings

(a) Synthetic data

α = 0 (this paper) α = 0.25 α = 0.5 α = 0.75 α = 1 (Zermelo)

Log-likelihood ( 103)

(c) Dominance hierarchy 14

(b) Synthetic data

10 7 10 6 10 5 10 4 10 3 0.01

Iteration 0 5 10 15

(d) Dominance hierarchy

10 7 10 6 10 5 10 4 10 3 0.01 0.1 1

Iteration 0 5 10 15

Figure 1: Convergence of the iterative algorithms studied here. (a) The log-likelihood, Eq. (6), for a synthetic network of 1000 players and 50 000 games. The plot shows the value on successive iterations for the algorithm of this paper, Zermelo s algorithm, and a selection of algorithms from the family deﬁned in Section 4. (b) The root-mean-square (RMS) deviation from the ﬁnal maximum-likelihood solution for the same synthetic data set. (c) The log-likelihood for the dominance hierarchy of vervet monkeys described in Section 7.2. (d) The RMS deviation for the same dominance hierarchy data set.

family deﬁned in Section 4, parametrized by the quantity α as in Eq. (20), and it appears that convergence becomes monotonically slower with increasing α, as mentioned previously in Section 4.

Arguably, however, Fig. 1a fails to truly show how much faster the algorithm of this paper is. Figure 1b gives an alternative view. In this plot we show the root-mean-square (RMS) diﬀerence (averaged over all players) between the current estimate of the probability p1 = πi/(πi +1) of beating the average player and the ﬁnal converged value. We choose this quantity because the probability of beating the average player is typically more uniformly distributed than the πi themselves and hence the average over players is better behaved, minimizing eﬀects of ﬂuctuations and dependency on outliers. As the ﬁgure shows this quantity converges enormously faster for our algorithm than for Zermelo s algorithm, being already more than a thousand times better than the value for Zermelo s algorithm after less than ten iterations. Indeed there is a signiﬁcant gulf in the speed of convergence between

our algorithm and all of the others shown in the ﬁgure: for all the nonzero values of α convergence is much slower than it is for α = 0. To make these comparisons more quantitative, we have conducted extensive further tests using synthetic data of the type described here. In these tests we ﬁrst iterate Eq. (12) until it converges to high precision, eﬀectively solving for the maximum-likelihood solution at or close to the limits of numerical accuracy. Then, using either the Zermelo algorithm or our new algorithm, we measure the number of iterations needed to converge to this ﬁnal solution within a prescribed level of accuracy. Speciﬁcally we require that the probability πi/(πi + 1) of beating the average player converge to within 10 6 from its ﬁnal value for all i. This criterion is more stringent and arguably more realistic than criteria based on convergence of the value of the log-likelihood (Vojnovic et al., 2019). For tests of the algorithm of Section 5 for MAP estimates with a logistic prior the same procedure was used to generate data and we compare the convergence of the new algorithm of Eq. (26) against the generalized Zermelo algorithm of Eq. (27). For the case with ties the procedure was similar, but wins, losses, and ties were generated according to the probabilities pij and qij of Eq. (28) with ν = 1

2 and we compare the convergence rate of Eqs. (35) and (36) against Eqs. (32) and (34) with an initial value of ν = 1 in all cases. All tests were averaged over 100 randomly generated data sets and the results are reported in Table 1. As these results show, the algorithm of this paper is much faster than Zermelo s algorithm. For the standard maximum likelihood estimate (the most common application) the diﬀerence is spectacular: the new algorithm is over a hundred times faster. Where Zermelo s algorithm takes an average of more than 1200 iterations to converge, the new algorithm takes just 12. For the MAP estimates the diﬀerence in running time is less extreme but still large the new algorithm is over eight times faster than Zermelo s algorithm while for the case with ties the new algorithm is an impressive 42 times faster.

7.2 Real-world Data

In this section we present example applications to several real-world data sets and show that our algorithm also oﬀers signiﬁcant speed improvements in these settings. The data sets we study are as follows.

Wolves: A typical animal dominance hierarchy data set describing observations of subordinate behaviors among members of a family of 15 captive wolves in Arnhem, Netherlands as reported by van Hooﬀand Wensing (1987).

Vervet monkeys: A larger dominance hierarchy data set describing observations of agonistic interactions of various kinds among 63 wild vervet monkeys in the Samara Private Game Reserve in South Africa, as reported by Vilette et al. (2020). The original data set had 66 monkeys, but three were removed in order to ensure that the network of interactions was strongly connected, as discussed in Section 3.

American football: As an example of an application to sports competition, this data set describes professional American football games played in the US National Football League during a single season. Unlike association football, American football proceeds by a series of discrete plays in which the team currently in possession of the ball attempts to advance it up the ﬁeld. This data set consists of individual plays in all games between the 32 teams

Efficient computation of rankings

Iterations to reach convergence Data N M Zermelo This paper Speed-up

Without ties

Synthetic 1000 50 000 1270 470 12 2 104 Wolves 15 10 382 2410 10 145 1 17 Vervet monkeys 63 11 621 232 8 19 1 12 American football 32 35 741 49 3 15 1 3.4 Political ﬁgures 67 76 632 54 3 8 5 7.1 Photographs 9097 247 531 911 4 22 0 41

Synthetic 1000 50 000 1560 40 185 18 8.5 Wolves 15 10 382 49 200 1700 2200 110 22 American football 32 35 741 19 200 1500 6000 600 3.3 Photographs 9097 247 531 1186 3 82 10 14

Synthetic 1000 50 000 1130 760 27 8 42 Soccer 177 898 1650 16 421 5 3.9 School students 2155 8970 2770 10 613 1 4.5 Chess 14 852 623 727 1750 90 162 9 11

Table 1: The number of iterations required for the algorithms discussed in this paper to converge in applications to real and synthetic (computer-generated) data. Results are averaged over 100 runs and rounded to three ﬁgures. N is the number of individuals or teams being ranked, M is the total number of interactions among all individuals, and the ﬁgures following are standard deviations about the mean, giving an indication of the amount of variation in the results. Speed-up is the average factor by which the method of this paper improves upon the traditional Zermelo algorithm, or its generalizations for the MAP case and the case where ties are allowed.

in the league during the 2016 regular season, as compiled by Yurko et al. (2019). Only passing plays, running plays, punts, sacks, and ﬁeld goals were used in the analysis. Other plays such as kickoﬀs and conversions were excluded. The team in possession of the ball is considered to have won a play if either (a) they score points on the play or (b) they advance the ball and retain possession; otherwise the other team wins the play.

Political ﬁgures: The results of an online paired comparison survey conducted by the Washington Post newspaper in 2010, in which readers were presented with pairs of prominent political ﬁgures and asked to judge which had had the worse week in politics. The data were made available on the survey platform allourideas.org.

Photographs: Results from the IMDB-Wiki-Sb S study of Pavlichenko and Ustalov (2021), a paired comparison study that asked participants to judge people s age from photographs. Participants were presented with 247 531 pairs of faces drawn from a pool of 9097 photographs and asked to judge which of the people depicted was older. In principle, a ranking of the results should then be able to order the people from (apparent) oldest to youngest.

A small number of images were excluded from the data set for our calculations to ensure a strongly connected network.

Soccer: Wins, losses, and draws in 898 men s international association football matches between 177 diﬀerent countries during the year 2011. Data from Mart J urisoo at kaggle.com/martj42. The original network of matches was not strongly connected, so the data analyzed here represent only the largest strongly connected component of the network.

School students: These data describe declared friendships among 2155 students in a large US high school and its feeder middle school, from the National Longitudinal Study of Adolescent Health (the Add Health study, Udry et al. 1997). If student i states that they are friends with student j but j does not reciprocate (something that occurs often in these data) we consider it a win for j; if i and j both state they are friends we consider it a tie. Although in principle friendships are not competitive, there is evidence that friendship patterns among school students do describe a clear hierarchy because students tend to claim friendship with others who have higher social status than themselves (Hallinan and Kubitschek, 1988). Thus ranking calculations applied to data like these can be used to infer social status (Ball and Newman, 2013). Treating a reciprocated friendship as a tie is arguably more correct than treating it as two separate wins: reciprocated friendships clearly violate the assumption of independence in the Bradley-Terry model without ties, since the two wins never go in the same direction, but there is no equivalent violation for the model with ties. The same approach could also be applied to other social networks that show similar reciprocity properties. The network of friendships for this data set was not strongly connected, so the data analyzed here represent only the largest strongly connected component of the network.

Chess: Wins, losses, and draws in chess matches between 14 852 expert players on the online chess server lichess.com during the month of July 2016. For a match to be included, both players must have had Elo ratings of 2000 or higher at the time of the match. A small number of players were removed to ensure the network of matches was strongly connected. The data are from lichess.com via kaggle.com/arevel. With over 600 000 matches, this is the largest data set considered here.

Figures 1c and 1d show an example of the convergence of the log-likelihood and RMS error during a single run using the vervet monkey data. The behavior is similar to that for the synthetic data in Figs. 1a and 1b: the log-likelihood converges most rapidly for the algorithm of this paper and signiﬁcantly more slowly for Zermelo s algorithm. Other algorithms from the family deﬁned in Section 4 fall between the two, and convergence appears to become monotonically slower as the parameter α of Eq. (20) increases. The RMS error shown in Fig. 1d once again shows very rapid convergence for the algorithm of this paper. All the other algorithms are substantially slower by this measure. Complete results on time to convergence for the various data sets is presented in Table 1. The methodology for these calculations was the same as for the synthetic data: the parameters were ﬁrst converged to high precision, then the results used to estimate the time to convergence in a second run of the calculation. Each calculation was replicated 100 times

Efficient computation of rankings

with random initial conditions in which the si were drawn from a logistic distribution as previously.

The overall picture for these runs is again similar to that for the synthetic data. In all cases the method of this paper outpaces the traditional Zermelo algorithm. For instance, for maximum-likelihood estimates in cases without ties the new algorithm is 17 times faster on the smallest example, the dominance hierarchy of wolves, while on the largest example, the photographs, it is a remarkable 41 times faster. The smallest diﬀerence is for the American football data set, for which the new algorithm is 3.4 times faster than Zermelo s algorithm. For MAP estimates the numbers are similar: the new algorithm is a factor of 22, 14, and 3.3 times faster respectively on these three data sets.

As with the synthetic data, the speed diﬀerence on the tests with ties is less dramatic though still substantial, with the new algorithm being about 4 to 11 times faster. Convergence was also somewhat slower overall for both algorithms in the case with ties, although this may have more to do with the fact that these data sets are sparser (which tends to slow convergence) than with the presence of ties. Notice that convergence of our algorithm is very fast for the synthetic data with ties, which is relatively dense.

These eﬀects can make a substantial diﬀerence to running times in practice. For the dominance hierarchy of wolves, for instance, a single run of Zermelo s algorithm (implemented in the Python programming language on an up-to-date but otherwise unremarkable personal computer circa 2022) converges to the maximum-likelihood solution in a running time of about 1 minute. The algorithm of this paper, by contrast, takes 3 seconds. For the more demanding photograph data set, Zermelo s algorithm takes over 8 minutes; the method of this paper takes just 11 seconds. For larger applications still, such as to web data or online social networks, the diﬀerence could become very signiﬁcant.

All the results of this section are numerical. Ideally we would like to be able to prove formally that the algorithms presented in this paper converge faster than Zermelo s algorithm. At present we do not have such a proof but we can show certain results. As demonstrated in Appendix B, we can prove that within the one-parameter family of algorithms deﬁned in Section 4, all those for α > 1 converge slower than Zermelo s algorithm (the case α = 1), which means these are not normally of interest. We can also prove that convergence becomes monotonically faster with decreasing α down to some point α < 1, meaning that there provably exist algorithms that are faster than Zermelo s algorithm. In general, however, the proof does not extend to α = 0 (the algorithm of this paper), so for the moment the ﬁnding that convergence is fastest for α = 0 is a numerical one only.

8. Conclusions

We have presented an alternative to the classic algorithm of Zermelo for computing rankings from pairwise comparisons using ﬁts to the Bradley-Terry model, with or without ties allowed. Like Zermelo s algorithm, the method presented is a simple iterative scheme. We have proved that the iteration always converges to the global maximum of the likelihood and given numerical evidence that it does so faster typically many times faster than Zermelo s algorithm. Given that it is also simple to implement we know of no reason not to favor the algorithm presented here over Zermelo s algorithm.

Acknowledgments

This work was funded in part by the US National Science Foundation under grant DMS 2005899. All empirical data used in this paper are previously published and freely available online.

Appendix A: Proof of Convergence for α > 1

As discussed in Section 4, Zermelo s algorithm and the algorithm of this paper are both special cases of a larger one-parameter family of algorithms given by the iteration of

P j wij(απi + πj)/(πi + πj) P j(αwij + wji)/(πi + πj)

for any α 0. For 0 α 1 the convergence of this iteration to the likelihood maximum can be proved straightforwardly as described in Section 4. For α > 1 the same method of proof does not work because 1 α becomes negative and the inequality in (23) no longer follows from (16). It is still possible to prove convergence but the method of proof is somewhat diﬀerent, as we now describe. From (15) we have for any x, y, c > 0

log(x + c) log(y + c) y + c

x + c + 1 = log(y + c) + x y

= log(y + c) + x y

(y + c)/c + c(x y)2

x(x + c)(y + c)

log(y + c) + x y

(y + c)/c ,

which is equivalent to

log x log y x y

x log x x + c log y y + c (x y)/x

(y + c)/c ,

with the exact equality applying if and only if x = y. Noting that the left-hand side of this inequality is always positive by (15), for any α > 1 we then have

log x log y x y

log x x + c log y y + c (x y)/x

which can be rearranged to read

α log x + (1 α) log x x + c α log y + x y

+ (1 α) log y y + c + (x y)/x

Now setting x = π i, y = πi, and c = πj, multiplying by the positive quantities wij, and summing, we have

j wij log π i + (1 α) X

j wij log π i π i + πj α X

log πi + π i πi

log πi πi + πj + (π i πi)/π i (πi + πj)/πj

Efficient computation of rankings

where the exact equality applies if and only if π i = πi. In combination with (17), this is now suﬃcient to establish the inequality in (23) once again, and hence convergence of the algorithm for α > 1 is assured.

Appendix B: Rate of Convergence

The numerical results of Section 7 show markedly faster convergence for the algorithms of this paper than for the standard Zermelo algorithm. As discussed at the end of Section 7, we do not at present have a proof that convergence is faster, but it is possible to prove that some algorithms within the family deﬁned in Section 4 converge faster than Zermelo s algorithm. As observed in Fig. 1, the iterative algorithms of this paper show exponential convergence, which is expected in general all iterations of the form x = f(x) converge exponentially, if they converge at all, except in certain special cases that do not apply here. For the family of algorithms in Section 4 the rate of convergence for any given value of the parameter α can be quantiﬁed by the factor λi(α) by which the distance between the current estimate of πi and the ﬁnal maximum-likelihood estimate (MLE) ˆπi decreases when πi is updated, as πi approaches ˆπi. Thus

λi(α) = lim π ˆπ π i ˆπi πi ˆπi = π i πi

where the subscript ˆπ indicates that the derivative is evaluated at the MLE. For instance, for Zermelo s algorithm (the case α = 1), applying Eq. (3) we have

P j wij P j(wij + wji)/(ˆπi + ˆπj)2 P i(wij + wji)/(ˆπi + ˆπj) 2 = 1 P j wij

j (wij + wji) ˆπi ˆπi + ˆπj

where we have employed (3) again to simplify the expression and made use of the fact that π i = πi = ˆπi at the MLE. Assuming once again that the network of interactions represented by wij is strongly connected, the wij are strictly positive for all i, j. As shown by Ford (1957), this implies that the ˆπi are strictly positive and ﬁnite, which means in turn that the value of λi(1) is strictly positive. For other α, however, the value of λi(α) can be negative (meaning that convergence to the MLE is oscillatory). This observation will be important in a moment. The factor by which the RMS error of Fig. 1 decreases over a complete round of updates depends asymptotically on the slowest decaying πi and is given by

λmax(α) = max i |λi(α)|,

where we take the absolute value to allow for the possibility of negative λi. An algorithm with given α asymptotically converges faster than Zermelo s algorithm if λmax(α) < λmax(1). Here we demonstrate that this is the case for at least some values of α. We consider how the value λi(α) changes with α and compute the derivative

ˆπ = 2π i πi α

From Eq. (20) we have

j wij πi πi + πj X

j wij απi + πj

wij πi + πj

P j wij/(πi + πj) P j(αwij + wji)/(πi + πj)(πi π i),

where we have used (20) again in the second line. Diﬀerentiating with respect to πi, setting πi = ˆπi for all i, and noting again that π i = πi at the MLE, we ﬁnd that

α = 2π i πi α

P j wij/(ˆπi + ˆπj) P j(αwij + wji)/(ˆπi + ˆπj) 1 λi(α) , (38)

where we have used Eq. (37). The fact that the iteration of Eq. (20) converges to the MLE for all α 0 implies that λi(α) must be strictly less than 1 for all i and hence (38) is strictly positive, since wij and ˆπi are strictly positive. At the same time it is also ﬁnite, and hence λi(α) is increasing in α and continuous for all α 0. This now establishes some useful results. First, it implies that λi(α) > λi(1) for all i when α > 1 (and also that λi(α) is positive in this regime). Thus, if the largest value of λi(1) occurs for i = µ, then

λmax(α) λµ(α) > λµ(1) = λmax(1).

Hence all algorithms with α > 1 converge slower than Zermelo s algorithm. For this reason these algorithms are not normally of practical interest. Second, we also have λi(α) < λi(1) for all i when 0 α < 1. Unfortunately, this is not suﬃcient to establish that λmax(α) < λmax(1) in this regime (and hence that these algorithms converge faster than Zermelo s algorithm) because, as mentioned above, it is not guaranteed that λi(α) is positive. The value of λi(α) for α < 1 can and in practice often does become negative. This means that |λi(α)| could be larger than λi(1) and indeed it is straightforward to ﬁnd cases where this occurs. On the other hand, we can prove that there exist some algorithms that are faster than Zermelo s. Given that λi(α) is continuous and increasing in α, its value must diminish smoothly and monotonically from α = 1 all the way down to α = 0. Thus, given that λi(1) is strictly positive, one of two things must happen: either λi(α) never reaches the line λi(α) = λi(1), in which case |λi(0)| < λi(1), or it does reach this line, in which case there exists some ci < 1 such that λi(ci) = λi(1). In this case, by continuity, |λi(α)| < λi(1) in the non-vanishing interval ci < α < 1. Now we repeat the same argument for all i and deﬁne c = maxi ci, or c = 0 if |λi(0)| < λi(1) for all i, and then for all i we have |λi(α)| < λi(1) in the non-vanishing interval c < α < 1. Now choose any α in this interval and suppose the largest value of |λi(α)| occurs for i = ν. Then at this α we have

λmax(α) = |λν(α)| < λν(1) λmax(1).

Hence all algorithms with c < α < 1 converge faster than Zermelo s algorithm. Algorithms with 0 α c may also converge faster than Zermelo s algorithm and the numerical evidence suggests that they do but this cannot be proved using the present approach.

Efficient computation of rankings

Appendix C: Proof of Convergence for the Model with Ties

For the case where ties are allowed, the proof that iteration of Eqs. (35) and (36) converges to the maximum of the log-likelihood (30) follows similar lines to that for the case without ties. Davidson (1970) proved that the likelihood has only a single stationary point with respect to its parameters, corresponding to the global likelihood maximum, provided the πi are normalized and the network of interactions is strongly connected (with a tie counting as an edge in both directions between the relevant pair of players). Since any ﬁxed point of Eqs. (35) and (36) corresponds to a stationary point of the likelihood, this implies that if our iteration converges to a ﬁxed point at all then that point is the global maximum. To prove that we converge to a ﬁxed point it suﬃces to show that the log-likelihood always increases upon application of either Eq. (35) or Eq. (36), unless a ﬁxed point has been reached.

The terms in the log-likelihood of Eq. (30) that depend on πi can be written in the form

j aij log πi πi + πj + 2ν πiπj X

j aji log πi + πj + 2ν πiπj , (39)

where aij = wij + 1

2tij as previously. Applying the inequalities (14) and (15), we have for any πi and π i

log π i π i + πj + 2ν p

π iπj log πi πi + πj + 2ν πiπj πi/(πi + πj + 2ν πiπj)

π i/(π i + πj + 2ν p

= log πi πi + πj + 2ν πiπj + p

2ν πiπj πi + πj + 2ν πiπj

πj πi + πj + 2ν πiπj

log π i + πj + 2ν p

π iπj log πi + πj + 2ν πiπj + π i + πj + 2ν p

π iπj πi + πj + 2ν πiπj 1

= log πi + πj + 2ν πiπj + p

2ν πiπj πi + πj + 2ν πiπj

+ π i πi πi + πj + 2ν πiπj .

Evaluating Eq. (39) at the point π i deﬁned by Eq. (35) and applying these two inequalities, we have

j aij log π i π i + πj + 2ν p

j aji log π i + πj + 2ν p

log πi πi + πj + 2ν πiπj + p

2ν πiπj πi + πj + 2ν πiπj

πj πi + πj + 2ν πiπj

log πi + πj + 2ν πiπj + p

2ν πiπj πi + πj + 2ν πiπj

+ π i πi πi + πj + 2ν πiπj

= f(πi) + X

2ν πiπj πi + πj + 2ν πiπj π i πi

ν πiπj πi + πj + 2ν πiπj

2ν πiπj πi + πj + 2ν πiπj π i πi

ν πiπj πi + πj + 2ν πiπj

= f(πi) + ( p

j aij ν πiπj πi + πj + 2ν πiπj

j aji ν πiπj πi + πj + 2ν πiπj

where we have used Eq. (35) and the exact equality applies if and only if π i = πi. Hence f(πi) always increases upon application of Eq. (35) unless π i = πi, and so therefore does the log-likelihood as well.

The same is also true of the update (36) for the parameter ν. The terms in the loglikelihood that depend on ν can be written

ij tij log ν πi + πj + 2ν πiπj X

ij wij log(πi + πj + 2ν πiπj). (40)

Efficient computation of rankings

For any ν, ν the inequalities (14) and (15) imply that

πi + πj + 2ν πiπj log ν πi + πj + 2ν πiπj ν/(πi + πj + 2ν πiπj)

ν /(πi + πj + 2ν πiπj) + 1

= log ν πi + πj + 2ν πiπj + ν ν

πi + πj πi + πj + 2ν πiπj ,

log(πi + πj + 2ν πiπj) log(πi + πj + 2ν πiπj) + πi + πj + 2ν πiπj

πi + πj + 2ν πiπj 1

= log(πi + πj + 2ν πiπj) + (ν ν) 2 πiπj πi + πj + 2ν πiπj .

Evaluating (40) at the point ν given by Eq. (36) and applying these two inequalities we get

ij tij log ν

πi + πj + 2ν πiπj X

ij wij log(πi + πj + 2ν πiπj)

log ν πi + πj + 2ν πiπj + ν ν

πi + πj πi + πj + 2ν πiπj

log(πi + πj + 2ν πiπj) + (ν ν) 2 πiπj πi + πj + 2ν πiπj

= g(ν) + (ν ν) 1

ij tij πi + πj πi + πj + 2ν πiπj X

ij wij 2 πiπj πi + πj + 2ν πiπj

where the term in square brackets in the penultimate line vanishes because of Eq. (36) and the exact equality applies if and only if ν = ν. Thus g(ν) always increases upon application of (36) unless ν = ν, and so therefore does the log-likelihood. The remainder of the proof follows the same lines of argument as in Section 3 and hence convergence of Eqs. (35) and (36) to the unique likelihood maximum is established.

A. Agarwal, P. Patil, and S. Agarwal. Accelerated spectral ranking. Proceedings of Machine Learning Research, 80:70 79, 2018.

B. Ball and M. E. J. Newman. Friendship networks and social status. Network Science, 1: 16 30, 2013.

R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39:324 345, 1952.

F. Caron and A. Doucet. Eﬃcient Bayesian inference for generalized Bradley-Terry models. Journal of Computational and Graphical Statistics, 21:174 196, 2012.

M. Cattelan. Models for paired comparison data: A review with emphasis on dependent data. Statistical Science, 27:412 433, 2012.

H. A. David. The Method of Paired Comparisons. Griﬃn, London, 2 edition, 1988.

R. R. Davidson and D. L. Solomon. A Bayesian approach to paired comparison experimentation. Biometrika, 60:477 487, 1973.

R. R. Davidson. On extending the Bradley-Terry model to accommodate ties in paired comparison experiments. Journal of the American Statistical Association, 65:317 328, 1970.

R. R. Davidson and P. H. Farquhar. A bibliography on the method of paired comparisons. Biometrics, 32:241 252, 1976.

O. Dykstra, Jr. A note on the rank analysis of incomplete block designs. Biometrics, 12: 301 306, 1956.

L. R. Ford, Jr. Solution of a ranking problem from binary comparisons. American Mathematical Monthly, 64(8):28 33, 1957.

M. T. Hallinan and W. N. Kubitschek. The eﬀect of individual and structural characteristics on intransitivity in social networks. Social Psychology Quarterly, 51:81 92, 1988.

D. R. Hunter. MM algorithms for generalized Bradley-Terry models. Annals of Statistics, 32:384 406, 2004.

L. Maystre and M. Grossglauser. Fast and accurate inference of Plackett-Luce models. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Proceedings of the 2015 Conference on Advances in Neural Information Processing Systems, pages 172 180, Cambridge, MA, 2015. MIT Press.

S. Negahban, S. Oh, and D. Shah. Rank centrality: Ranking from pair-wise comparisons. Operations Research, 65:266 287, 2017.

N. Pavlichenko and D. Ustalov. IMDB-WIKI-Sb S: An evaluation dataset for crowdsourced pairwise comparisons. Preprint arxiv:2110.14990, 2021.

P. V. Rao and L. L. Kupper. Ties in paired-comparison experiments: A generalization of the Bradley-Terry model. Journal of the American Statistical Association, 62:194 204, 1967.

J. R. Udry, P. S. Bearman, and K. M. Harris. National Longitudinal Study of Adolescent Health, 1997. This research uses data from Add Health, a program project directed by Kathleen Mullan Harris and designed by J. Richard Udry, Peter S. Bearman, and Kathleen Mullan Harris at the University of North Carolina at Chapel Hill, and funded by grant P01 HD31921 from the Eunice Kennedy Shriver National Institute of Child Health and Human Development, with cooperative funding from 23 other federal agencies and foundations. Special acknowledgment is due Ronald R. Rindfuss and Barbara Entwisle for assistance in the original design. Information on how to obtain the Add Health data ﬁles is available on the Add Health website (https://www.cpc.unc.edu/addhealth). No direct support was received from grant P01 HD31921 for this analysis.

Efficient computation of rankings

J. A. R. A. M. van Hooﬀand J. A. B. Wensing. Dominance and its behavioral measures in a captive wolf pack. In Harry Frank, editor, Man and Wolf, pages 219 252. Junk Publishers, Dordrecht, 1987.

C. Vilette, T. Bonnell, P. Henzi, and L. Barrett. Comparing dominance hierarchy methods using a data-splitting approach with real-world data. Behavioral Ecology, 31:1379 1390, 2020.

M. Vojnovic, S.-Y. Yun, and K. Zhou. Accelerated MM algorithms for ranking scores inference from comparison data. Preprint arxiv:1901.00150, 2019.

J. T. Whelan. Prior distributions for the Bradley-Terry model of paired comparisons. Preprint arxiv:1712.05311, 2017.

R. Yurko, S. Ventura, and M. Horowitz. nﬂWAR: A reproducible method for oﬀensive player evaluation in football. Journal of Quantitative Analysis in Sports, 15:163 183, 2019.

E. Zermelo. Die Berechnung der Turnier-Ergebnisse als ein Maximumproblem der Wahrscheinlichkeitsrechnung. Mathematische Zeitschrift, 29:436 460, 1929.