# efficient_computation_of_rankings_from_pairwise_comparisons__4c45ba89.pdf Journal of Machine Learning Research 24 (2023) 1-25 Submitted 9/22; Revised 1/23; Published 6/23 Efficient Computation of Rankings from Pairwise Comparisons M. E. J. Newman mejn@umich.edu Center for the Study of Complex Systems University of Michigan Ann Arbor, MI 48109, USA Editor: Tina Eliassi-Rad We study the ranking of individuals, teams, or objects, based on pairwise comparisons between them, using the Bradley-Terry model. Estimates of rankings within this model are commonly made using a simple iterative algorithm first introduced by Zermelo almost a century ago. Here we describe an alternative and similarly simple iteration that provably returns identical results but does so much faster over a hundred times faster in some cases. We demonstrate this algorithm with applications to a range of example data sets and derive a number of results regarding its convergence. Keywords: Ranking, paired comparisons, Bradley-Terry model, iterative algorithms, networks 1. Introduction The problem of ranking a set of individuals, teams, or objects on the basis of a set of pairwise comparisons between them arises in many contexts, including competitions in sports, chess, and other games, paired comparison studies of consumer choice, and observational studies of dominance behaviors in animals and humans (Zermelo, 1929; Bradley and Terry, 1952; Davidson and Farquhar, 1976; David, 1988; Cattelan, 2012). If a group of chess players play games against one another, for example, how can we rank the players, from best to worst, based on the outcome of those games? The outcomes may be contradictory or ambiguous underdogs sometimes win and strong players sometimes lose so we adopt a probabilistic model. In the most common version we assign a numerical score si to each individual i and the probability pij of i beating j is assumed to be some function of the difference in their scores pij = f(si sj). The most popular choice of functional form is the logistic function f(s) = 1/(1 + e s), which gives esi + esj . (1) This is the Bradley-Terry model, first introduced by Zermelo (1929) and heavily studied in the years since, particularly following its rediscovery by Bradley and Terry (1952). For convenience, one often introduces the shorthand πi = esi so that pij can also be written as pij = πi πi + πj , (2) c 2023 M. E. J. Newman. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-1086.html. and we will do that here. Zermelo (writing in German) referred to the non-negative parameters πi as Spielst arken or playing strengths, although they are elsewhere variously called worth parameters, skill parameters, merit parameters, ratings, or simply weights. Following Zermelo, we will call them strengths. Given the outcomes of a series of pairwise competitions between N competitors the strengths can be estimated straightforwardly. Commonly one makes a maximum-likelihood estimate. Defining wij to be the total number of times i beats j, or zero if i and j never competed, it can be shown that the maximum-likelihood values of the strengths are given by a simple procedure: starting from any convenient initial values we iterate the equation PN j=1 wij PN j=1(wij + wji)/(πi + πj) (3) until convergence is reached. This algorithm was also first described by Zermelo (1929) and we will refer to it as Zermelo s algorithm. An extraordinary number of papers have been written about it, its variants, its properties, and its applications. Although widely used, however, Zermelo s algorithm is known to be slow to converge (Dykstra, 1956; Hunter, 2004). In this paper we study the alternate iteration PN j=1 wijπj/(πi + πj) PN j=1 wji/(πi + πj) . (4) We show that iteration of this equation solves the same problem and converges to the same solution as Zermelo s algorithm but does so significantly faster over a hundred times faster in some cases. Given that Eq. (4) is also simple to implement we know of no reason not to favor it over Eq. (3). In recent years a number of other authors have also considered alternative and potentially more efficient algorithms for ranking under the Bradley-Terry model. One promising approach employs spectral methods that estimate rankings based on the properties of random walks on the network of directed interactions between individuals (Maystre and Grossglauser, 2015; Negahban et al., 2017; Agarwal et al., 2018). Although they do not directly maximize likelihood under the Bradley-Terry model, these algorithms can be shown to converge closely to the maximum-likelihood solution. Some versions are quite numerically efficient, though they can also be complex to implement. Minorization-maximization (MM) algorithms, which optimize a minorizing proxy for the likelihood function, can also be applied to ranking problems. For the Bradley-Terry model the appropriate MM algorithm turns out to be exactly equivalent to Zermelo s algorithm and hence offers no speed improvement (Hunter, 2004), but techniques have been suggested for accelerating convergence (Vojnovic et al., 2019) and the MM formulation also provides an elegant route to developing algorithms for generalizations of the model. Perhaps more directly competitive with our approach is one of the simplest of methods: one can fit the Bradley-Terry model using Newton s method applied to the derivative of the likelihood. For small applications with only a few individuals or teams to be ranked this is typically the fastest approach, although for such small cases the difference may be moot. As the number N of individuals gets larger, however, Newton s method becomes impractical because the time taken per iteration of the algorithm scales as N3, which quickly becomes prohibitive (Hunter, 2004). For Efficient computation of rankings Eq. (4) the time per iteration scales only as the number of pairwise competitions, making this approach faster for larger applications. Maximum-likelihood methods are not the only approach for fitting: one can also adopt Bayesian approaches, although these usually require Monte Carlo estimation and hence are not competitive in terms of speed (Davidson and Solomon, 1973; Caron and Doucet, 2012). Overall, the particular combination of simplicity and speed offered by Eq. (4) makes it an attractive approach for practical applications. The rest of this paper is organized as follows. In Section 2 we describe and derive Zermelo s original ranking algorithm and the new algorithm proposed here. In Section 3 we prove that the new algorithm converges to the same unique maximum-likelihood solution as Zermelo s algorithm. In Section 4 we briefly discuss a larger family of ranking algorithms of which Zermelo s algorithm and our own are special cases. Within this family, the algorithm of this paper is the fastest to converge and hence it is our primary focus. In Sections 5 and 6 we describe two extensions of our approach, one to maximum a priori (MAP) estimation of rankings and the other to a (previously proposed) generalization of the Bradley-Terry model that allows for ties or draws in competitions. In Section 7 we apply our algorithms to a broad selection of example data, both real and synthetic, finding in every case that the algorithm of this paper is faster than Zermelo s, often by a wide margin. In Section 8 we give our conclusions. Some additional technical results are presented in appendices. 2. Iterative Algorithms and the Bradley-Terry Model Consider a tournament where N players or teams play games of some kind against one another in pairs. We will initially assume that no ties or draws are allowed, so there is a clear winner and loser of every game. The case where ties are allowed is treated separately in Section 6. We assume that the probability pij that player i beats player j obeys Eq. (1) and we consider the strengths πi to be a measure of the skill of the players, higher values indicating better players. Note that the win probabilities pij are invariant under multiplication of all the πi by any constant. One can remove this ambiguity by imposing any convenient normalization condition. Here we fix the geometric mean strength to be 1, which is equivalent to setting Q i πi = 1. This choice has the nice effect that the probability p1 of a player with strength π beating the average player with strength 1 is p1 = π/(π+1) and hence π = p1/(1 p1). Thus the strength parameter has a simple interpretation: it is the odds of beating the average player. 2.1 Zermelo s Algorithm Suppose the tournament consists of a total of M games between pairs of players and let wij be the number of times that player i beats player j. Using these data we can make a maximum-likelihood estimate of the strengths πi as follows. The likelihood of the observed games given the strengths (represented by a matrix W = [wij] and vector π = [πi] respectively) is ij pwij ij = Y so that the log-likelihood is log P(W|π) = X ij wij log πi πi + πj = X ij wij log πi X ij wij log(πi + πj). (6) Differentiating with respect to πi for any i and setting the result to zero we get πi + πj = 0, (7) which can be rearranged to read P j wij P j(wij + wji)/(πi + πj). (8) In general this equation has no closed-form solution but it can be solved numerically by simple iteration: one picks a suitable set of non-negative starting values for the πi random values are often used and then computes new values π i according to P j wij P j(wij + wji)/(πi + πj). (9) Iterating this process, it can be proved subject to certain conditions that we converge to the global maximum of the likelihood and hence obtain an estimate of the strengths πi (Zermelo, 1929; Ford, 1957; Hunter, 2004). The values can then be sorted in order to give a ranking of the players, or simply used in their raw form as a kind of rating. The iteration can be performed synchronously (all πi updated at the same time) or asynchronously (πi updated one by one in cyclic fashion), but it is generally believed that asynchronous updating is more efficient, since the update of any individual πi benefits from the improved estimates of previously updated ones. In this paper we use asynchronous updates. Our convergence results in Sections 3 and 4 are also for the asynchronous case. This iterative algorithm, first described by Zermelo (1929), is the standard method for calculating rankings within the Bradley-Terry model and has seen numerous applications over the years in a wide variety of contexts. 2.2 An Alternative Algorithm In a sense, computing maximum-likelihood estimates for the Bradley-Terry model is a straightforward problem. As discussed in Section 3, the likelihood is concave and many standard convex optimization methods can be applied. Speed, however, is of the essence in practical applications of the model, so considerable effort has been exerted in recent years to find solution methods faster than Zermelo s algorithm (Maystre and Grossglauser, 2015; Negahban et al., 2017; Agarwal et al., 2018; Vojnovic et al., 2019). Some of these are quite complex, but here we propose a very simple approach that also turns out to be highly efficient. Grouping the terms slightly differently, Eq. (7) can be rewritten as j wij πj πi + πj X wji πi + πj = 0, (10) Efficient computation of rankings which can be rearranged as P j wijπj/(πi + πj) P j wji/(πi + πj) . (11) This suggests a different iterative algorithm for the Bradley-Terry model. Again we choose suitable starting values (for instance at random), then we iterate the form P j wijπj/(πi + πj) P j wji/(πi + πj) (12) to convergence. In Section 3 we prove that, like Zermelo s algorithm, this process converges to the global maximum of the likelihood. One nice feature of this algorithm is that it is transparent from Eq. (12) that πi = 0 for any individual who loses all their games and πi = for any individual who wins all their games. Furthermore, the iteration converges to these values in a single step. The same values are also returned by the standard Zermelo algorithm, but it is less obvious from Eq. (9) that this is true it is some work to demonstrate the result for the player who wins every game and moreover it takes the Zermelo algorithm an infinite number of iterations to reach the correct value instead of just one iteration. This is a special case of the more general finding, which we explore in this paper, that (12) converges faster than Zermelo s algorithm. 3. Convergence In this section we prove that the iteration of Eq. (12) converges to the global maximum of the likelihood, Eq. (5), from any starting point, whenever a maximum exists. Zermelo proved that the likelihood has only one stationary point for πi 0, corresponding to the global maximum, provided the πi are normalized as discussed in Section 2 and the directed network of interactions (the network with adjacency matrix wij) is strongly connected, i.e., there is a directed path through the network from every individual to every other (Zermelo, 1929; Ford, 1957). If the network is not strongly connected then there are no stationary points and there is no maximum of the likelihood, and hence our problem has no solution. For the moment we will assume, as other authors have done, that the network is strongly connected and hence that there is a maximum of the likelihood, although we show how to relax this requirement in Section 5. Since any fixed point of the iteration of Eq. (12) corresponds to a stationary point of the likelihood, and since the iteration generates non-negative values of πi only (given non-negative initial values), it follows that if the iteration converges to a fixed point that point must be the global maximum. To prove that it converges to a fixed point it suffices to demonstrate that the value of the log-likelihood always increases upon application of Eq. (12) unless a fixed point has been reached, since the log-likelihood cannot increase without bound, being bounded above by the maximum. We consider the asynchronous version of the iteration of Eq. (12) in which a single πi is updated at each step, all others πj remaining the same. The πi are updated in order until all N have been updated. Consider the step on which a particular πi is updated. We define a function f(πi) equal to the sum of the terms in the log-likelihood, Eq. (6), that depend on πi: f(πi) = X j wij log πi πi + πj X j wji log(πi + πj). (13) Noting that log x x 1 for all real x > 0 and making the substitution x x/y, we derive the useful inequality log x log y + x for all x, y > 0, or equivalently log y log x x y + 1, (15) with the exact equality holding if and only if x = y. This implies for any πi and π i that log π i π i + πj log πi πi + πj πi/(πi + πj) π i/(π i + πj) + 1 = log πi πi + πj + (π i πi)/π i (πi + πj)/πj , (16) and (14) implies that log(π i + πj) log(πi + πj) + π i + πj πi + πj 1 = log(πi + πj) + π i πi πi + πj . (17) Evaluating Eq. (13) at the point π i defined by Eq. (12) and applying these two inequalities, we find that j wij log π i π i + πj X j wji log(π i + πj) j wij log πi πi + πj + π i πi j wij πj πi + πj j wji log(πi + πj) (π i πi) X wji πi + πj j wij log πi πi + πj X j wji log(πi + πj) + (π i πi) 1 j wij πj πi + πj X wji πi + πj = f(πi), (18) where we have used Eq. (13) again, the term inside the square brackets vanishes because of (12), and the exact equality applies if and only if π i = πi. Thus f(πi) always increases upon application of Eq. (12) and hence so also does the log-likelihood, unless π i = πi, in which case the log-likelihood remains the same but could still increase when one of the other πi is updated. Only if π i = πi for all i does the loglikelihood not increase at all, but if this occurs then by definition we have reached a fixed point of the iteration, and hence we have reached the global maximum. This now guarantees convergence of the iterative algorithm of Eq. (12) to the global likelihood maximum. Efficient computation of rankings 4. Other Iterative Algorithms Given the existence of two different iterations, Eqs. (9) and (12), that both converge to the same maximum-likelihood estimate, one might wonder whether there exist any others. In fact, it turns out there is an entire one-parameter family of iterations that includes (9) and (12) as special cases, and all of them converge to the same solution. Of these iterations, Eq. (12) converges most rapidly and hence is our primary focus in this paper, but for the interested reader we discuss the full family briefly in this section. For any α we can rewrite Eq. (10) in the form j wij απi + πj πi + πj = 0, (19) which we can solve by iterating P j wij(απi + πj)/(πi + πj) P j(αwij + wji)/(πi + πj) (20) until convergence is achieved. When α = 1 this procedure is equivalent to Zermelo s algorithm, Eq. (9). When α = 0 it is equivalent to the algorithm presented in this paper, Eq. (12). For negative α the iteration does not generate positive values of πi in general and hence is invalid, but for zero or positive α it gives a whole range of algorithms, all of which converge to the same maximum-likelihood solution as Zermelo s algorithm. When 0 α 1, convergence can be proved using a straightforward generalization of the method of Section 3 as follows. Since the log-likelihood has only a single stationary point corresponding to the global maximum, and since any fixed point of (20) is a solution of (19) and hence corresponds to a stationary point of the log-likelihood, it follows that if (20) converges to a fixed point at all it must converge to the global likelihood maximum. To show that it converges to a fixed point it suffices, as previously, to show that the log-likelihood always increases upon application of (20) unless a fixed point has been reached. To do this we rewrite Eq. (13) for the terms in the log-likelihood that depend on πi as f(πi) = α X j wij log πi + (1 α) X j wij log πi πi + πj X j (αwij + wji) log(πi + πj). (21) For any πi, π i the inequality (15) implies that log π i log πi πi π i + 1 = log πi + π i πi with the exact equality applying if and only if π i = πi. Evaluating Eq. (21) at the point π i given by Eq. (20) and applying this inequality along with (16) and (17), we find for 0 α 1 f(π i) = α X j wij log π i + (1 α) X j wij log π i π i + πj X j (αwij + wji) log(π i + πj) log πi + π i πi log πi πi + πj + (π i πi)/π i (πi + πj)/πj j (αwij + wji) log(πi + πj) + π i πi πi + πj = f(πi) + (π i πi) 1 j wij απi + πj = f(πi), (23) where we have employed Eq. (21) again, the term in square brackets in the penultimate line vanishes because of Eq. (20), and the exact equality applies if and only if π i = πi. Thus f(πi) always increases upon application of (20) unless π i = πi. The rest of the proof follows the same lines as in Section 3 and hence convergence to the global likelihood maximum is established. As a corollary, this also provides an alternative proof of the convergence of Zermelo s algorithm (the case α = 1) which is significantly simpler than the original proof given by Zermelo (1929) or the later proof by Ford (1957). The method of proof used here does not extend to the case of α > 1, because 1 α becomes negative and the inequality in (23) no longer follows from (16). It is still possible to prove convergence for α > 1 but the proof is more involved. See Appendix A for details. Numerical measurements, some of which are presented in Section 7, indicate that convergence of the algorithms of this section becomes monotonically slower with increasing α, so that the main algorithm presented in this paper, Eq. (12), which corresponds to the smallest allowed value of α = 0, is the fastest, and it is on this case that we concentrate in the remainder of the paper. Some formal results on rates of convergence as a function of α are presented in Appendix B. 5. Prior on the Strength Parameters Equation (12) provides a complete algorithm for fitting the Bradley-Terry model. In practice, however, pure maximum-likelihood fits such as this can be problematic for this model. In particular, as mentioned in Section 3, a likelihood maximum exists only if the network of interactions is strongly connected. If this condition is not met then the score parameters si will diverge and the algorithm of Eq. (12) and indeed all maximum-likelihood methods for this model will fail. The root cause of this problem is that the maximum-likelihood fit effectively assumes a uniform (improper) prior on the si, which places all but a vanishing fraction of its weight on arbitrarily large values and, when coupled with a network that is not strongly connected, causes divergences. An effective solution is to impose a better-behaved prior on si and then compute a maximum a posteriori (MAP) estimate of the scores instead of a maximumlikelihood estimate (MLE). A range of priors have been proposed for this purpose (Davidson and Solomon, 1973; Caron and Doucet, 2012; Whelan, 2017) but arguably the most natural Efficient computation of rankings is a logistic prior. Recall from Section 2 that the probability p1 of a player with strength π winning against the average player is p1 = π/(π + 1). In the absence of any evidence to the contrary, we assume this probability to be uniformly distributed between zero and one so that P(p1) = 1, a least informative or maximum-entropy prior. Then the prior on the scores s = log π is P(s) = P(p1)dp1 ds = π (π + 1)2 = 1 (es + 1)(e s + 1), (24) which is the logistic distribution. Combining this result with Eq. (5) we then get a posterior probability on the scores that is given, up to a multiplicative constant, by 1 (esi + 1)(e si + 1) πi (πi + 1)2 . (25) Maximizing this posterior probability instead of the likelihood regularizes the values of the scores, preventing them from diverging. It also removes the invariance under multiplication of the πi by a constant and hence eliminates the need to normalize them. The iterative algorithm of Eq. (12) can be generalized straightforwardly to this MAP estimate. As observed by Whelan (2017), the prior for individual i can be thought of as πi (1 + πi)2 = πi πi + 1 1 πi + 1, which is precisely the probability that i plays two games against the average player (who has π = 1) and wins one of them and loses the other. Thus Eq. (25) can be thought of as the likelihood of a Bradley-Terry model in which we have added two fictitious games for each player, one won and one lost, and we can maximize this likelihood (and hence the posterior of Eq. (25)) using the same algorithm as before, merely adding these extra fictitious games to the data. This also means that our proof of convergence generalizes to the MAP case and that the network of interactions is now strongly connected, so the probability maximum always exists. Alternatively, and perhaps more conveniently, we can derive an explicit algorithm for the MAP case by differentiating Eq. (25) with respect to πi for any i, which leads to the iteration π i = 1/(πi + 1) + P j wijπj/(πi + πj) 1/(πi + 1) + P j wji/(πi + πj) . (26) This is the generalization of Eq. (12) to the MAP case. It is completely equivalent to adding the fictitious games and has the same guaranteed convergence. One can also add the same prior to the traditional Zermelo algorithm of Eq. (9), which gives π i = 1 + P j wij 2/(πi + 1) + P j(wij + wji)/(πi + πj). (27) In Section 7 we present the results of numerical experiments on the rate of convergence both of these MAP estimators and of the MLEs, using Eqs. (9), (12), (26), and (27). Ties or draws can occur in certain types of competition, such as chess and soccer. There are a number of ways to generalize ranking calculations to include ties. The simplest is just to consider a tied game to be half of a win for each of the players. This approach is used for instance in the Elo chess rating system and can be trivially incorporated into our calculations by modifying the values wij. A more sophisticated approach, however, incorporates the probability of a tie into the model itself. There is more than one way to do this (Rao and Kupper, 1967; Davidson, 1970). Here we employ the modification of the Bradley-Terry model proposed by Davidson (1970). One again defines strengths πi for each player and the probabilities of a win pij and a tie qij between players i and j are pij = πi πi + πj + 2ν πiπj , qij = 2ν πiπj πi + πj + 2ν πiπj , (28) where ν > 0 is a parameter which controls the overall frequency of ties and which we will estimate by maximum likelihood along with the strengths. Note that when πi = πj we have qij = ν/(1 + ν) and hence ν = qij/(1 qij), so ν can be interpreted as the odds of a tie between evenly matched players. The form (28) satisfies the obvious requirements that pij +pji+qij = 1 and qij = qji, and also has the intuitive property that the probability of a tie is greatest when the players are evenly matched and vanishes as πi and πj become arbitrarily far apart. As with the standard Bradley-Terry model, the probabilities pij and qij are invariant under multiplication of all πi by a constant, and again we remove this ambiguity by normalizing them so that Q i πi = 1. Davidson (1970) proposed an iterative algorithm for computing maximum-likelihood estimates of the strengths and the parameter ν within this model. Defining wij as before to be the number of times i beats j and tij = tji to be the number of ties, we can write the likelihood of a set of observations W = [wij], T = [tij] as P(W, T|π, ν) = Y ij pwij ij Y i 1 converge slower than Zermelo s algorithm (the case α = 1), which means these are not normally of interest. We can also prove that convergence becomes monotonically faster with decreasing α down to some point α < 1, meaning that there provably exist algorithms that are faster than Zermelo s algorithm. In general, however, the proof does not extend to α = 0 (the algorithm of this paper), so for the moment the finding that convergence is fastest for α = 0 is a numerical one only. 8. Conclusions We have presented an alternative to the classic algorithm of Zermelo for computing rankings from pairwise comparisons using fits to the Bradley-Terry model, with or without ties allowed. Like Zermelo s algorithm, the method presented is a simple iterative scheme. We have proved that the iteration always converges to the global maximum of the likelihood and given numerical evidence that it does so faster typically many times faster than Zermelo s algorithm. Given that it is also simple to implement we know of no reason not to favor the algorithm presented here over Zermelo s algorithm. Acknowledgments This work was funded in part by the US National Science Foundation under grant DMS 2005899. All empirical data used in this paper are previously published and freely available online. Appendix A: Proof of Convergence for α > 1 As discussed in Section 4, Zermelo s algorithm and the algorithm of this paper are both special cases of a larger one-parameter family of algorithms given by the iteration of P j wij(απi + πj)/(πi + πj) P j(αwij + wji)/(πi + πj) for any α 0. For 0 α 1 the convergence of this iteration to the likelihood maximum can be proved straightforwardly as described in Section 4. For α > 1 the same method of proof does not work because 1 α becomes negative and the inequality in (23) no longer follows from (16). It is still possible to prove convergence but the method of proof is somewhat different, as we now describe. From (15) we have for any x, y, c > 0 log(x + c) log(y + c) y + c x + c + 1 = log(y + c) + x y = log(y + c) + x y (y + c)/c + c(x y)2 x(x + c)(y + c) log(y + c) + x y (y + c)/c , which is equivalent to log x log y x y x log x x + c log y y + c (x y)/x (y + c)/c , with the exact equality applying if and only if x = y. Noting that the left-hand side of this inequality is always positive by (15), for any α > 1 we then have log x log y x y log x x + c log y y + c (x y)/x which can be rearranged to read α log x + (1 α) log x x + c α log y + x y + (1 α) log y y + c + (x y)/x Now setting x = π i, y = πi, and c = πj, multiplying by the positive quantities wij, and summing, we have j wij log π i + (1 α) X j wij log π i π i + πj α X log πi + π i πi log πi πi + πj + (π i πi)/π i (πi + πj)/πj Efficient computation of rankings where the exact equality applies if and only if π i = πi. In combination with (17), this is now sufficient to establish the inequality in (23) once again, and hence convergence of the algorithm for α > 1 is assured. Appendix B: Rate of Convergence The numerical results of Section 7 show markedly faster convergence for the algorithms of this paper than for the standard Zermelo algorithm. As discussed at the end of Section 7, we do not at present have a proof that convergence is faster, but it is possible to prove that some algorithms within the family defined in Section 4 converge faster than Zermelo s algorithm. As observed in Fig. 1, the iterative algorithms of this paper show exponential convergence, which is expected in general all iterations of the form x = f(x) converge exponentially, if they converge at all, except in certain special cases that do not apply here. For the family of algorithms in Section 4 the rate of convergence for any given value of the parameter α can be quantified by the factor λi(α) by which the distance between the current estimate of πi and the final maximum-likelihood estimate (MLE) ˆπi decreases when πi is updated, as πi approaches ˆπi. Thus λi(α) = lim π ˆπ π i ˆπi πi ˆπi = π i πi where the subscript ˆπ indicates that the derivative is evaluated at the MLE. For instance, for Zermelo s algorithm (the case α = 1), applying Eq. (3) we have P j wij P j(wij + wji)/(ˆπi + ˆπj)2 P i(wij + wji)/(ˆπi + ˆπj) 2 = 1 P j wij j (wij + wji) ˆπi ˆπi + ˆπj where we have employed (3) again to simplify the expression and made use of the fact that π i = πi = ˆπi at the MLE. Assuming once again that the network of interactions represented by wij is strongly connected, the wij are strictly positive for all i, j. As shown by Ford (1957), this implies that the ˆπi are strictly positive and finite, which means in turn that the value of λi(1) is strictly positive. For other α, however, the value of λi(α) can be negative (meaning that convergence to the MLE is oscillatory). This observation will be important in a moment. The factor by which the RMS error of Fig. 1 decreases over a complete round of updates depends asymptotically on the slowest decaying πi and is given by λmax(α) = max i |λi(α)|, where we take the absolute value to allow for the possibility of negative λi. An algorithm with given α asymptotically converges faster than Zermelo s algorithm if λmax(α) < λmax(1). Here we demonstrate that this is the case for at least some values of α. We consider how the value λi(α) changes with α and compute the derivative ˆπ = 2π i πi α From Eq. (20) we have j wij πi πi + πj X j wij απi + πj wij πi + πj P j wij/(πi + πj) P j(αwij + wji)/(πi + πj)(πi π i), where we have used (20) again in the second line. Differentiating with respect to πi, setting πi = ˆπi for all i, and noting again that π i = πi at the MLE, we find that α = 2π i πi α P j wij/(ˆπi + ˆπj) P j(αwij + wji)/(ˆπi + ˆπj) 1 λi(α) , (38) where we have used Eq. (37). The fact that the iteration of Eq. (20) converges to the MLE for all α 0 implies that λi(α) must be strictly less than 1 for all i and hence (38) is strictly positive, since wij and ˆπi are strictly positive. At the same time it is also finite, and hence λi(α) is increasing in α and continuous for all α 0. This now establishes some useful results. First, it implies that λi(α) > λi(1) for all i when α > 1 (and also that λi(α) is positive in this regime). Thus, if the largest value of λi(1) occurs for i = µ, then λmax(α) λµ(α) > λµ(1) = λmax(1). Hence all algorithms with α > 1 converge slower than Zermelo s algorithm. For this reason these algorithms are not normally of practical interest. Second, we also have λi(α) < λi(1) for all i when 0 α < 1. Unfortunately, this is not sufficient to establish that λmax(α) < λmax(1) in this regime (and hence that these algorithms converge faster than Zermelo s algorithm) because, as mentioned above, it is not guaranteed that λi(α) is positive. The value of λi(α) for α < 1 can and in practice often does become negative. This means that |λi(α)| could be larger than λi(1) and indeed it is straightforward to find cases where this occurs. On the other hand, we can prove that there exist some algorithms that are faster than Zermelo s. Given that λi(α) is continuous and increasing in α, its value must diminish smoothly and monotonically from α = 1 all the way down to α = 0. Thus, given that λi(1) is strictly positive, one of two things must happen: either λi(α) never reaches the line λi(α) = λi(1), in which case |λi(0)| < λi(1), or it does reach this line, in which case there exists some ci < 1 such that λi(ci) = λi(1). In this case, by continuity, |λi(α)| < λi(1) in the non-vanishing interval ci < α < 1. Now we repeat the same argument for all i and define c = maxi ci, or c = 0 if |λi(0)| < λi(1) for all i, and then for all i we have |λi(α)| < λi(1) in the non-vanishing interval c < α < 1. Now choose any α in this interval and suppose the largest value of |λi(α)| occurs for i = ν. Then at this α we have λmax(α) = |λν(α)| < λν(1) λmax(1). Hence all algorithms with c < α < 1 converge faster than Zermelo s algorithm. Algorithms with 0 α c may also converge faster than Zermelo s algorithm and the numerical evidence suggests that they do but this cannot be proved using the present approach. Efficient computation of rankings Appendix C: Proof of Convergence for the Model with Ties For the case where ties are allowed, the proof that iteration of Eqs. (35) and (36) converges to the maximum of the log-likelihood (30) follows similar lines to that for the case without ties. Davidson (1970) proved that the likelihood has only a single stationary point with respect to its parameters, corresponding to the global likelihood maximum, provided the πi are normalized and the network of interactions is strongly connected (with a tie counting as an edge in both directions between the relevant pair of players). Since any fixed point of Eqs. (35) and (36) corresponds to a stationary point of the likelihood, this implies that if our iteration converges to a fixed point at all then that point is the global maximum. To prove that we converge to a fixed point it suffices to show that the log-likelihood always increases upon application of either Eq. (35) or Eq. (36), unless a fixed point has been reached. The terms in the log-likelihood of Eq. (30) that depend on πi can be written in the form j aij log πi πi + πj + 2ν πiπj X j aji log πi + πj + 2ν πiπj , (39) where aij = wij + 1 2tij as previously. Applying the inequalities (14) and (15), we have for any πi and π i log π i π i + πj + 2ν p π iπj log πi πi + πj + 2ν πiπj πi/(πi + πj + 2ν πiπj) π i/(π i + πj + 2ν p = log πi πi + πj + 2ν πiπj + p 2ν πiπj πi + πj + 2ν πiπj πj πi + πj + 2ν πiπj log π i + πj + 2ν p π iπj log πi + πj + 2ν πiπj + π i + πj + 2ν p π iπj πi + πj + 2ν πiπj 1 = log πi + πj + 2ν πiπj + p 2ν πiπj πi + πj + 2ν πiπj + π i πi πi + πj + 2ν πiπj . Evaluating Eq. (39) at the point π i defined by Eq. (35) and applying these two inequalities, we have j aij log π i π i + πj + 2ν p j aji log π i + πj + 2ν p log πi πi + πj + 2ν πiπj + p 2ν πiπj πi + πj + 2ν πiπj πj πi + πj + 2ν πiπj log πi + πj + 2ν πiπj + p 2ν πiπj πi + πj + 2ν πiπj + π i πi πi + πj + 2ν πiπj = f(πi) + X 2ν πiπj πi + πj + 2ν πiπj π i πi ν πiπj πi + πj + 2ν πiπj 2ν πiπj πi + πj + 2ν πiπj π i πi ν πiπj πi + πj + 2ν πiπj = f(πi) + ( p j aij ν πiπj πi + πj + 2ν πiπj j aji ν πiπj πi + πj + 2ν πiπj where we have used Eq. (35) and the exact equality applies if and only if π i = πi. Hence f(πi) always increases upon application of Eq. (35) unless π i = πi, and so therefore does the log-likelihood as well. The same is also true of the update (36) for the parameter ν. The terms in the loglikelihood that depend on ν can be written ij tij log ν πi + πj + 2ν πiπj X ij wij log(πi + πj + 2ν πiπj). (40) Efficient computation of rankings For any ν, ν the inequalities (14) and (15) imply that πi + πj + 2ν πiπj log ν πi + πj + 2ν πiπj ν/(πi + πj + 2ν πiπj) ν /(πi + πj + 2ν πiπj) + 1 = log ν πi + πj + 2ν πiπj + ν ν πi + πj πi + πj + 2ν πiπj , log(πi + πj + 2ν πiπj) log(πi + πj + 2ν πiπj) + πi + πj + 2ν πiπj πi + πj + 2ν πiπj 1 = log(πi + πj + 2ν πiπj) + (ν ν) 2 πiπj πi + πj + 2ν πiπj . Evaluating (40) at the point ν given by Eq. (36) and applying these two inequalities we get ij tij log ν πi + πj + 2ν πiπj X ij wij log(πi + πj + 2ν πiπj) log ν πi + πj + 2ν πiπj + ν ν πi + πj πi + πj + 2ν πiπj log(πi + πj + 2ν πiπj) + (ν ν) 2 πiπj πi + πj + 2ν πiπj = g(ν) + (ν ν) 1 ij tij πi + πj πi + πj + 2ν πiπj X ij wij 2 πiπj πi + πj + 2ν πiπj where the term in square brackets in the penultimate line vanishes because of Eq. (36) and the exact equality applies if and only if ν = ν. Thus g(ν) always increases upon application of (36) unless ν = ν, and so therefore does the log-likelihood. The remainder of the proof follows the same lines of argument as in Section 3 and hence convergence of Eqs. (35) and (36) to the unique likelihood maximum is established. A. Agarwal, P. Patil, and S. Agarwal. Accelerated spectral ranking. Proceedings of Machine Learning Research, 80:70 79, 2018. B. Ball and M. E. J. Newman. Friendship networks and social status. Network Science, 1: 16 30, 2013. R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39:324 345, 1952. F. Caron and A. Doucet. Efficient Bayesian inference for generalized Bradley-Terry models. Journal of Computational and Graphical Statistics, 21:174 196, 2012. M. Cattelan. Models for paired comparison data: A review with emphasis on dependent data. Statistical Science, 27:412 433, 2012. H. A. David. The Method of Paired Comparisons. Griffin, London, 2 edition, 1988. R. R. Davidson and D. L. Solomon. A Bayesian approach to paired comparison experimentation. Biometrika, 60:477 487, 1973. R. R. Davidson. On extending the Bradley-Terry model to accommodate ties in paired comparison experiments. Journal of the American Statistical Association, 65:317 328, 1970. R. R. Davidson and P. H. Farquhar. A bibliography on the method of paired comparisons. Biometrics, 32:241 252, 1976. O. Dykstra, Jr. A note on the rank analysis of incomplete block designs. Biometrics, 12: 301 306, 1956. L. R. Ford, Jr. Solution of a ranking problem from binary comparisons. American Mathematical Monthly, 64(8):28 33, 1957. M. T. Hallinan and W. N. Kubitschek. The effect of individual and structural characteristics on intransitivity in social networks. Social Psychology Quarterly, 51:81 92, 1988. D. R. Hunter. MM algorithms for generalized Bradley-Terry models. Annals of Statistics, 32:384 406, 2004. L. Maystre and M. Grossglauser. Fast and accurate inference of Plackett-Luce models. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Proceedings of the 2015 Conference on Advances in Neural Information Processing Systems, pages 172 180, Cambridge, MA, 2015. MIT Press. S. Negahban, S. Oh, and D. Shah. Rank centrality: Ranking from pair-wise comparisons. Operations Research, 65:266 287, 2017. N. Pavlichenko and D. Ustalov. IMDB-WIKI-Sb S: An evaluation dataset for crowdsourced pairwise comparisons. Preprint arxiv:2110.14990, 2021. P. V. Rao and L. L. Kupper. Ties in paired-comparison experiments: A generalization of the Bradley-Terry model. Journal of the American Statistical Association, 62:194 204, 1967. J. R. Udry, P. S. Bearman, and K. M. Harris. National Longitudinal Study of Adolescent Health, 1997. This research uses data from Add Health, a program project directed by Kathleen Mullan Harris and designed by J. Richard Udry, Peter S. Bearman, and Kathleen Mullan Harris at the University of North Carolina at Chapel Hill, and funded by grant P01 HD31921 from the Eunice Kennedy Shriver National Institute of Child Health and Human Development, with cooperative funding from 23 other federal agencies and foundations. Special acknowledgment is due Ronald R. Rindfuss and Barbara Entwisle for assistance in the original design. Information on how to obtain the Add Health data files is available on the Add Health website (https://www.cpc.unc.edu/addhealth). No direct support was received from grant P01 HD31921 for this analysis. Efficient computation of rankings J. A. R. A. M. van Hooffand J. A. B. Wensing. Dominance and its behavioral measures in a captive wolf pack. In Harry Frank, editor, Man and Wolf, pages 219 252. Junk Publishers, Dordrecht, 1987. C. Vilette, T. Bonnell, P. Henzi, and L. Barrett. Comparing dominance hierarchy methods using a data-splitting approach with real-world data. Behavioral Ecology, 31:1379 1390, 2020. M. Vojnovic, S.-Y. Yun, and K. Zhou. Accelerated MM algorithms for ranking scores inference from comparison data. Preprint arxiv:1901.00150, 2019. J. T. Whelan. Prior distributions for the Bradley-Terry model of paired comparisons. Preprint arxiv:1712.05311, 2017. R. Yurko, S. Ventura, and M. Horowitz. nflWAR: A reproducible method for offensive player evaluation in football. Journal of Quantitative Analysis in Sports, 15:163 183, 2019. E. Zermelo. Die Berechnung der Turnier-Ergebnisse als ein Maximumproblem der Wahrscheinlichkeitsrechnung. Mathematische Zeitschrift, 29:436 460, 1929.