# distributional_rank_aggregation_and_an_axiomatic_analysis__e270086b.pdf Distributional Rank Aggregation, and an Axiomatic Analysis Adarsh Prasad* ADARSH@CS.UTEXAS.EDU Harsh Pareek* HARSHP@CS.UTEXAS.EDU Pradeep Ravikumar PRADEEPR@CS.UTEXAS.EDU Department of Computer Science, The University of Texas, Austin, TX 78712, USA The rank aggregation problem has been studied with varying desiderata in varied communities such as Theoretical Computer Science, Statistics, Information Retrieval and Social Welfare Theory. We introduce a variant of this problem we call distributional rank aggregation, where the ranking data is only available via the induced distribution over the set of all permutations. We provide a novel translation of the usual social welfare theory axioms to this setting. As we show this allows for a more quantitative characterization of these axioms: which then are not only less prone to misinterpretation, but also allow simpler proofs for some key impossibility theorems. Most importantly, these quantitative characterizations lead to natural and novel relaxations of these axioms, which as we show, allow us to get around celebrated impossibility results in social choice theory. We are able to completely characterize the class of positional scoring rules with respect to our axioms and show that Borda Count is optimal in a certain sense. 1. Introduction We study the problem of rank aggregation: a set of agents provide their ranked preferences over a set of alternatives, and we wish to aggregate them into a consensus ranking. Due in part to its generality and importance, this problem has been studied with varying desiderata in varied communities. In Theoretical Computer Science, this has been framed as a combinatorial optimization problem of finding the ranking that minimizes the average distance (for some appropriate distance measure such as the Kemeny distance) to the set of Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s). agent rankings. These resulting problems are however typically NP-hard, and research effort has been devoted to developing polynomial time algorithms with approximation guarantees (Ailon, 2010; Dwork et al., 2001a;b). These can also be cast as maximizing the average utility (in contrast to minimizing the average distance); Pivato (2013); Conitzer & Sandholm (2005) in particular showed that a large class of existing voting rules could also be interpreted as such average utility maximizers. Lu & Boutilier (2010) design a suitable utility measure that in their context captured the facet of uncertainty about the availability of alternatives. In the field of Information Retrieval, the problem of rank aggregation arises as a meta-algorithm where for instance we are given the output rankings from multiple search engines, and we are to combine these into a single ranking (Dwork et al., 2001a;b). Also in this domain, Cheng et al. (2010) cast the aggregation task in a conditional or supervised setting, where we are given a training set of (feature-vector, ranking) pairs, and the task is to learn a map from the feature vector space to the set of permutations. In Sociometrics, social welfare theories by Condorcet (1785); Arrow (1951) specify normative properties (axioms) that any reasonable rank aggregation should satisfy. Some results (including the celebrated impossibility theorem of Arrow (1951)) then state that some subsets of axioms cannot be jointly compatible. Several voting rules such as Condorcet methods, Borda Count, Runoff procedures, etc. have then been suggested, which satisfy different subsets of these axioms. In Statistics, the agents preferences are viewed as i.i.d. samples from a statistical model and the problem is reformulated to that of learning a distribution over permutations and of computing the Maximum Likelihood Estimate (MLE) of the parameters; the final aggregation returned is then typically just the mode of this distribution. The focus in this line of work has then been on investigating efficiently learnable models for different types of in- *Denotes equal contribution by the authors Distributional Rank Aggregation, and an Axiomatic Analysis puts (Critchlow et al., 1991; Lebanon & Lafferty, 2002a;b; Hunter, 2004; Lebanon & Mao, 2008; Guiver & Snelson, 2009; Negahban et al., 2012; Soufiani et al., 2013). There have also been other proposals that go beyond this MLE based approach. In the social choice domain, where the aim is to pick a single winner, Young (1988) proposed to select a winning alternative that is most likely to be the best (i.e., top-ranked in the true ranking) and provided formulas to compute it for three alternatives. This idea has been formalized and extended by Procaccia et al. (2012) to choose a given number of alternatives with highest marginal probability under the Mallows model. In recent work, Soufiani et al. (2014) considered the estimation of the Mallows model (P( |W) φℓKT ( ,W )), and outputting a point estimate of the distribution based on statistical decision theoretic considerations. The interesting aspect of this work was that they then derived constraints on the dispersion parameter φ of the Mallows model for their estimator to satisfy different social-choice theoretic axioms. We consider rank aggregation from the computer science and social-welfare theoretic viewpoint: and wish to aggregate an arbitrary set of ranking preferences over a set of alternatives into a consensus ranking over the entire set of alternatives. We note that we focus on the more difficult social welfare problem of providing a ranking over the entire set of alternatives, in contrast to the simpler social choice problem of outputting only the winner, i.e. the top position, in the aggregation. In many cases however, we do not have access to the specific agent or voter identities associated with a particular preference, as for instance with data from a secret ballot. Even given access to such voter identities, the key normative property (i.e. social choice theoretic axiom) of anonymity entails that the consensus ranking not make use of this information. Accordingly, one might consider a rank aggregation algorithm that has access to (or makes use of) only the histogram of the ranking preferences, or in other words, the empirical distribution over the set of all possible rankings as entailed by the given set of agent ranking preferences. We term this problem as distributional rank aggregation: given any distribution over the set of all possible rankings over a set of alternatives, output a consensus ranking, that in effect serves as a pointestimate of this distribution. (We note that this is distinct from the statistical viewpoint of assuming preferences are i.i.d. samples from some distribution, and estimating this distribution.) Such distributional rank aggregation is also natural when aggregating preferences of a very large population, where we would expect almost all possible rankings to be present in differing relative proportions, and it would be natural to collate these preferences into a distribution over the set of all possible rankings over the set of alternatives. The first contribution of this paper is in translating the set of normative axioms studied in the social choice literature to our distributional ranking setting. As we show, our more general distributional setting allows to quantitatively characterize these axioms: which addresses a common problem in social choice practice where the qualitatively stated social choice theoretic axioms are frequently misunderstood or misinterpreted. This also provides a potential way to understand the relationships between the varied social welfare axioms. We also provide a counterpart of a classical impossibility theorem stating that it is impossible to satisfy all of a set of key axioms simultaneously, but where our quantitative characterization of the axioms allows a very simple proof. Another key advantage of a quantitative characterization of the axioms is in potentially relaxing them. In our second key contribution, we provide precisely such a novel quantitative characterization of approximately satisfiable or relaxed variants of the set of social choice axioms. In our third key contribution, we leverage these relaxed axiom variants to finesse the celebrated impossibility theorem(s): we show that there exist distributional rank aggregation algorithms that satisfy the relaxed variants of all of a set of key axioms. We build towards this in the following stages: we first deconstruct the set of normative axioms and derive an equivalent characterization in terms of certain margin-like quantities. We then focus on a particular class of rank aggregation algorithms known as positional scoring rules, and exhaustively characterize which members of this class satisfy the set of exact axioms as well as their relaxed counterparts. As we then show, there exist members of this family that do satisfy the relaxed variants of all of a set of key axioms, thus allowing us to finesse a key impossibility theorem. Our development leads us to a surprising result: that Borda Count has the minimum worst-case margin with respect to approximate satisfiability for certain axioms over all distributions. For specific distributions, our framework also allows us to identify positional losses which can outperform Borda and we present experiments to this effect. 2. Problem Setup Consider a set of alternatives or a label set X of size n; without loss of generality, we will assume this set is X := [n] = {1, . . . , n}. A total ordering over the label set is given by a permutation σ over the set [n], so that σ(x) = j indicates that the position of label x is j in σ. Let Sn be the set of all permutations. We also frequently refer to permutations as votes or rankings in this paper. We will refer to the alternatives or labels with alphabets {x, y}. The alternative x being preferred to y in a permutation σ is equivalent to having σ(x) < σ(y). Distributional Rank Aggregation, and an Axiomatic Analysis Consider any distribution P over the set of permutations Sn. Let the marginal probability of x being ranked first be denoted by P 1(x) = P σ:σ(x)=1 P(σ) and the marginal prob- ability of x being ranked above y be denoted as Px 0; σ(x) < σ(y), then Pareto-efficiency is satisfied iff σ P (x) < σ P (y). (D5) Independence of irrelevant alternatives (IIA): If voters change their preferences in a way that preserves the relative positions of x and y in each vote, the marginal probability of x being preferred over y remains constant. Let Px 0 and P(σ) = Q(σ) for σ = σ1, σ2 then, monotonicity is satisfied iff σ P (x) σ Q(x). (D8) Consistency: Consistency requires that if an alternative wins in two separate sets of votes, it should also win in the (multiset) union of those sets. For distributions the union of the sets is given as (P + Q)/2 and we have the condition: Given two distributions P and Q such that σ P (x) = 1, σ Q(x) = 1, then consistency is satisfied iff σ P +Q (D9) Majority rule: If an alternative x ranks at top of strictly greater than half of the votes, then x must win. Let P 1(x) = P σ:σ(x)=1 P(σ) measure the marginal probability of x being ranked first. Then, majority rule is satisfied iff P 1(x) > 1 2 σ P (x) = 1 (D10) Condorcet Criterion: Given a distribution P such that if Px 1 2 y X\{x} then the Condorcet criterion is satisfied iff σ P (x) = 1. Observe that, if the Condorcet criterion is satisfied, then Majority rule is also satisfied. This translation is a novel contribution of this paper. The following theorem, which follows from the above discussion, summarizes the equivalence of these axioms to those from social welfare theory: Theorem 1 (Equivalence of axioms). If voting data is accessible only via its distribution over Sn, the Social Welfare Axioms (S1-S10) are equivalent to the axioms (D1-D10) introduced above. Since non-dictatorship and transitivity are implicit in the distributional framework, Arrow s Impossibility Theorem states that a procedure cannot satisfy Universality, Pareto and IIA simultaneously. However, the quantitative nature of these axioms suggests that they can be relaxed and we will do so in the next section. 3.2. Relaxations of Social Welfare Axioms We see that in the distributional setting, most axioms can be interpreted in terms of marginals of the distribution P, and represent constraints on σ P based on these marginals. This probabilistic interpretation of the axioms motivates us to account for noise in the observed rankings by introducing a margin-like notion and go beyond impossibility results. Distributional Rank Aggregation, and an Axiomatic Analysis Consider for example, IIA. Let Mγ(x, y) = {P|Px 0, then sign σ Q1(x) σ Q1(y) = sign σ Q2(x) σ Q2(y) . Note that ϵ = 0 corresponds to the usual IIA. Similarly, the Majority Rule axiom can be parameterized in terms of marginal probability of an alternative x being ranked at the top, P 1(x) = P σ:σ(x)=1 P(σ) = γ . As this marginal probability goes to 1, a reasonable rank aggregation procedure would place x at the top. This intuition motivates the following definition: Definition 3. The ϵ-Majority Rule axiom is satisfied, iff for all P : P 1(x) 1 2 + ϵ ; ϵ > 0, σ P (x) = 1. The Condorcet Criterion also allows for a similar extension: Definition 4. ϵ-Condorcet Rule is satisfied, if for all P : Px 0 y X\{x}, σ P (x) = 1. While it is also possible to relax the other axioms in a similar manner, it is not as natural; for instance, we expect Pareto to always be satisfied for any reasonable rank aggregation procedure. In addition, as we will see in Section 4, the above relaxations are sufficient to ensure that a ranking procedure satisfying all our axioms exist. Note that the Condorcet criterion is stronger than Pareto in that Condorcet implies Pareto. It is also possible to strengthen the Condorcet criterion further by dropping the y requirement. We call this the ϵ-Strong Condorcet criterion: Definition 5. ϵ-Strong Condorcet Rule is satisfied, if for all P : Px 0, then σ P (x) < σ P (y). Thus, if x is ranked above y in a significant fraction of the votes, x should be ranked above y in the aggregation. We note that this may not always be desirable, due to the following proposition: Proposition 1. For ϵ < 1/6, ϵ-Strong Condorcet is incompatible with universality. We also have: Proposition 2. ϵ-Strong Condorcet ϵ-IIA. 3.3. Comparative Analysis of Axioms While the previous sections provided a quantitative characterization of social-choice theoretic axioms, as well as suitable relaxations of these, one might ask if there were commonalities between these different axioms. In this section, we investigate this question further. Before doing so, we first obtain a optimization based characterization of any distributional rank aggregation procedure. Consider the following class of distributional rank aggregation procedures: σ P = argmin σ Sn g(σ, P) (2) where g : Sn Pn 7 R, and Pn is the set of all distributions on Sn. This characterization is WLOG as: Proposition 3. Any distributional rank aggregation procedure can be expressed in the form (2) for some function g. We now define two quantities for any pair of alternatives x and y based on g and 0 γ 1. Lγ(g; x, y) = min P :Pxσ(y) g P (σ) Uγ(g; x, y) = max P :Pxσ(y) g P (σ) Thus, Lγ(g; x, y) and Uγ(g; x, y) are the lower and upper bounds for the quantity in brackets. We now show how most of the axioms can be re-written in terms of these two quantities. Proposition 4. g satisfies Pareto-efficiency iff Uγ(g; x, y) < 0, x, y X for γ = 1 Proposition 5. g satisfies IIA iff Lγ(g; x, y), Uγ(g; x, y) > 0, γ : 0 γ 1 and x, y X, Proposition 6. g satisfies the Condorcet criterion iff Uγ(g; x, y) < 0, for 1 2 < γ 1, x, y X. We define another quantity for any alternative x based on g and 0 γ 1 U 1 γ(g; x) = max P :P 1(x)=γ min σ:σ(x)=1 g P (σ) min σ:σ(x) =1 g P (σ) Proposition 7. g satisfies Majority Rule iff U 1 γ(g; x) < 0, for all 1 2 < γ 1, x X. Distributional Rank Aggregation, and an Axiomatic Analysis Similar conditions can be obtained for the relaxed definitions of axioms discussed before. Consistency and monotonicity both depend on the structure of g in ways that we believe cannot be captured as easily by the quantities Lγ(g; x, y) and Uγ(g; x, y) and we leave these for future work. 3.4. An Impossibility Theorem for Distributional Rank Aggregation We now present an impossibility theorem regarding the ability of any distributional rank aggregation procedure to satisfy both Universality and Pareto axioms simultaneously. Recall that Arrow s impossibility theorem in our setting states that no procedure satisfies Pareto, IIA and Universality. This result then provides an alternative intuitive proof of Arrow theorem for the special case of g continuous in P. Theorem 2. For n 2, if g(σ, P) in (2) is a continuous function of P for each fixed σ, both Universality and Pareto cannot be satisfied simultaneously. The continuity property is desirable for computational purposes and procedures used in practice for rank aggregation use a continuous g. Since Pareto is an important condition that we do not wish to relax, the above result implies that we must abandon the universality assumption, i.e. we must be content with procedures that sometimes return multiple aggregations. Regarding Arrow s impossibility theorem for general g, we first note that the usual proofs of Arrow s theorem which rely on identifying pivotal voters fail completely in this framework. In fact, Fishburn (1970) gives a possibility result in the case of infinite voters, which our framework captures. However, the proof there is not constructive and the g involved will quite likely be pathological it will certainly not be continuous in P. We hope to investigate this further in future work. 4. Positional Scoring Losses The impossibility theorems derived in the previous section state that there are no distributional rank aggregation procedures that simultaneously satisfy a set of key axioms. But would the impossibility theorem continue to hold if we only need satisfy our relaxed variants of the exact axioms? We can ask a more general and more difficult question: for each axiom, and its relaxed variant, can we exhaustively characterize which rank aggregation procedures would satisfy or not satisfy these? We consider a simplification of the above general question, and ask for such an exhaustive characterization for a simpler family of rank aggregation procedures known as posi- tional scoring rules (Kenneth J. Arrow & Suzumura, 2002, Chapter 7). Recall from Section 3.3 that any distributional rank aggregation function can be expressed in the form in (2) as the the minimizer of some loss functional g(σ, P). Let us first consider the following specialization of these loss functionals that are commonly used in varied computer scientific domains. Specifically, for any bivariate loss function ℓ: Sn Sn 7 R that measures the discrepancy between two rankings, consider the loss functional g(σ, P) set to the expected value of ℓover P. We then obtain the following rank aggregation procedure: σ ℓ,P = argmin σ Sn g(σ, P) = argmin σ Sn Eσ P [ℓ(σ, σ )] (3) In general, ℓ(σ , σ) can be any bivariate function but the following are of special interest: ℓ0 1(σ, σ ) = 1[σ = σ ]: This corresponds to the zeroone distance between permutations. The corresponding σ ℓ0 1,P is then the mode of the distribution P. Statistical and machine learning approaches to the rank aggregation problem fit a (unimodal) distribution to the P and then return the mode of the estimated distribution as a point estimate. Thus, these techniques implicitly use the zero-one loss. ℓKT (σ, σ ) = P (x,y) [n] 1[sign(σ(x) σ(y)) = sign(σ (x) σ (y))]: This loss measures the Kendall Tau distance between two permutations, i.e. the number of inversions between them. The minimization problem for this loss has been shown to be NP-hard and this loss has been studied widely in the Theoretical Computer Science literature. ℓ(σ, σ ) = P x [n] |σ(x) σ (x)|: This loss measures the Spearman-footrule distance and is used as an approximation to the Kendall-Tau. Positional scoring methods assign a score to each alternative in a permutation based on its position, these scores are aggregated across permutations and the final aggregation is obtained by sorting the alternatives based on their cumulative scores. Equivalently, Definition 6 (Positional Scoring Loss). A loss ℓh is a Positional Scoring Loss iff it can be decomposed as ℓh(σ, σ ) = P x h(σ (x)).σ(x), where h : [n] 7 R Using Equation (3), the corresponding σ P,ℓh can be written as: σ P,ℓh = argmin σ Sn Eσ P [ℓh(σ, σ )] = argmin σ Sn x Eσ P [h(σ (x))].σ(x) Distributional Rank Aggregation, and an Axiomatic Analysis Name h : [n] 7 R Ωh ωh ω1 h ϵ for IIA/Cond. ϵ for Maj. Borda Count h(i) = n i n 1 1 1 1/2 1/n 1/2 1/n Plurality Rule h(i) = 1 if i = 1, 0 else 1 0 1 1/2 0 Anti-Plurality Rule h(n) = 0; h(i) = 1 i = n 1 0 0 1/2 1/2 Log Rule h(i) = log(i) log(n) log( n n 1) log(2) log(n) 2 log(n2/(n 1)) log(n) 2 log(n2/(n 1)) Squared Rule h(i) = i2 n2 1 3 3 n2 4 2(n2+2) n2 4 2(n2+2) Table 1. Different Positional Scoring Rules and their key properties = argmin σ Sn x f P (x).σ(x) (4) where f P : X 7 R is given by f P (x) = Eσ P [h(σ (x))] is the cumulative scoring function. Proposition 8. Given f P (x) = Eσ P [h(σ (x))], then σ P,ℓh is the permutation achieved by sorting the alternatives in descending order of f P (x) This proposition also shows that the rank aggregation using Positional Scoring rules can be computed efficiently. Table 1 give examples of some well-known positional scoring rules, their corresponding h-functions, and their keyproperties. 4.1. Axiomatic Analysis of Positional Scoring Rules Non-dictatorship, Transitivity and Anonymity are implicit in the distributional framework. Positional scoring rules are continuous in P, so Theorem 2 applies and we will not require Universality. We define three properties of interest for any positional loss function. These quantities play a key role in establishing several axioms: Ωh = max {i,j [n]|i1 h(1) h(j) This is the minimum variation in possible scores for h with respect to the first position. This is used in analyzing the Majority Rule. We observe the following: For n = 2, Ωh = ωh = ω1 h. For n > 3, Ωh ωh with equality holding iff h : [n] 7 R is a constant function. Proposition 9 (IIA). No positional scoring loss ℓh satisfies IIA exactly. Proposition 10 (ϵ IIA). A positional scoring loss ℓh satisfies ϵ IIA iff ϵ > 1 2 Ωh ωh Ωh+ωh Proposition 11 (Pareto Efficiency). A positional scoring loss ℓh satisfies Pareto-efficiency exactly iff h : [n] 7 R is strictly monotonically decreasing. Proposition 12 (Monotonicity). A positional scoring loss ℓh satisfies Monotonicity iff h : [n] 7 R is non-increasing. Proposition 13 (Consistency). Every positional loss function ℓh satisfies consistency. Proposition 14 (Majority rule). A positional scoring loss ℓh satisfies Majority rule iff max(Ωh, ωh) ω1 h = 0 and ω1 h + max(Ωh, ωh) > 0 Proposition 15 (ϵ-Majority rule). A positional scoring loss ℓh satisfies ϵ Majority Rule iff ϵ > 1 2 max(Ωh, ωh) ω1 h max(Ωh, ωh)+ω1 h Proposition 16 (Condorcet Criterion). No positional scoring loss ℓh satisfies the Condorcet Criterion Proposition 17 (ϵ Strong Condorcet Criterion). A positional scoring loss ℓh satisfies the ϵ Strong Condorcet Criterion iff ϵ > 1 2 Ωh ωh Ωh+ωh Recall from Proposition 2, that ϵ-Strong Condorcet implies ϵ-IIA, consistent with propositions 10 and 17. Thus, we have the same ϵ for both IIA and Condorcet for positional loss functions and they are shown together in Table 1. 4.2. A Possibility Theorem with Relaxed Axioms Given the theory above, we can now quantify the extent to which a given positional loss satisfies each of the approximate axioms. A natural first question is whether any loss satisfies our relaxed axioms. Theorem 3. The set of positional loss functions which admits the following axioms is non-empty: exact versions of Pareto and Monotonicity, relaxed axioms ϵ-IIA, ϵ-Strong Condorcet, ϵ -Majority Rule A second question is whether we can design losses that satisfy the approximate axioms to the greatest extent possible. The following theorems show a surprising result, that the Borda Count is the optimal positional ranking function in a certain sense. Distributional Rank Aggregation, and an Axiomatic Analysis 0.5 0.6 0.7 0.8 0.9 1 0 Weight on Center 1 Prob. of Success Plurality Log Borda Squared Anti Plurality (a) Experiment 1 0.5 0.6 0.7 0.8 0.9 1 0 Weight on Center 1 Prob. of Success Plurality Log Borda Squared Anti Plurality (b) Experiment 2 Figure 1. Performance of different positional scoring rules on the mixture of Mallows models Lemma 1. For any positional scoring loss ℓh that satisfies Pareto, Ωh ωh n 1, where n is the number of alternatives. Theorem 4. For any fixed n, Borda count is optimal w.r.t. the ϵ-Strong Condorcet condition and ϵ-IIA, i.e. has the least ϵ among all positional loss functions. This fact seems to have been noted previously in the literature. Saari (1990) states the Borda method is the unique positionalist method to minimize the kinds and number of paradoxes that can occur . However, the justification given in (Saari, 1990) uses the much more complex machinery that our relaxation-based approach. 5. Experiments Theroem 4 showed that Borda Count is optimal in a worstcase sense in satisfying certain relaxed axioms. However, this does not mean that it dominates all other positional losses over all distributions and we present experiments to demonstrate this fact. Our framework shows that the key quantities to take into account when designing a positional loss are Ωh and ωh. For a non-increasing h, h can be scaled so that Ωh is fixed. Then ωh, intuitively the smallest (discrete) gradient of h, is the key quantity to compare among losses. A key factor in the following discussion will be the position in the list where this minimum value ωh is achieved. Our discussion will be terse for lack of space and we discuss the results in detail in Appendix B. Setup. We consider the following losses: Borda count where h is linear, a Convex loss log, a concave loss Square and the limiting convex and concave losses Plurality and Anti Plurality. The convex losses achieve minimum ωh = h(n) h(n 1) at the bottom of the list and thus pay more attention to relative positions at the top. The concave losses achieve minimum ωh = h(1) h(2) at the top of the list and thus pay more attention to relative positions at the bottom. See Table 1 for details on the losses. Let {A, B, C, D, E} be the set of alternatives X, n = 5. A Mallows model is given by P( |W) φℓKT ( ,W ), where W is the central permutation, φ (0, 1] is the dispersion parameter and ℓKT is the Kendall-Tau distance. A higher φ makes the distribution more concentrated around the central permutation. We use a mixture of two Mallows models for our experiments. Let Z1 and Z2 be the two central permutations and w and 1 w be their weights in the mixture. We perform two experiments. In Experiment 1, we fix centers Z1 = {D, E, A, B, C} and Z2 = {B, C, D, E, A}, while in Experiment 2 we fix centers Z1 = {A, B, C, D, E} and Z2 = {B, C, D, E, A}. Then, for φ = 0.8, we vary w from 0.51 to 1.0. Thus, in both experiments, we put higher weight on Z1 having σ(A) < σ(B) and as we increase the weight, we are effectively increasing Px