# rums_from_headtohead_contests__9a9934b4.pdf

RUMs from Head-to-Head Contests

Matteo Almanza 1 2 Flavio Chierichetti 3 Ravi Kumar 4 Alessandro Panconesi 3 Andrew Tomkins 4

Random utility models (RUMs) encode the likelihood that a particular item will be selected from a slate of competing items. RUMs are wellstudied objects in both discrete choice theory and, more recently, in the machine learning community, as they encode a fairly broad notion of rational user behavior. In this paper, we focus on slates of size two representing head-to-head contests. Given a tournament matrix M such that Mi,j is the probability that item j will be selected from {i, j}, we consider the problem of ﬁnding the RUM that most closely reproduces M. For this problem we obtain a polynomial-time algorithm returning a RUM that approximately minimizes the average error over the pairs. Our experiments show that RUMs can perfectly represent many of the tournament matrices that have been considered in the literature; in fact, the maximum average error induced by RUMs on the matrices we considered is negligible ( 0.001). We also show that RUMs are competitive, on prediction tasks, with previous approaches.

1. Introduction

Random utility models (RUMs) are perhaps the most important model in discrete choice (Train, 2003). In full generality, they posit that a user selecting an item from some slate of choices has a personal vector encoding the value of all items in the universe. The user behaves rationally by selecting the available item in the slate of the highest value. As each user may have a different value vector, the RUM model allows a wide range of effects. There are no known efﬁcient learning algorithms for RUMs.

In pairwise choice, the setting is the same, but the user must choose between only two alternatives, rather than from an

1Algorand Labs 2This work was carried out while the author was at Sapienza University of Rome. 3Sapienza University of Rome 4Google Mountain View. Correspondence to: Matteo Almanza <almanza@di.uniroma1.it>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

arbitrary slate of alternatives as in the full model. Pairwise choice arises whenever head-to-head competitions occur, as in two-player games, two-alternative forced-choice testing (Bogacz et al., 2006), or online experiences that compare an item with an alternative. The pairwise choice model is well studied, with a number of approaches that learn over either subsets of the class of RUMs, or incomparable models (Chen & Joachims, 2016; Adams et al., 2010; Makhijani & Ugander, 2019; Veerathu & Rajkumar, 2021).

Our main technical ﬁndings are general RUM learning algorithms for pairwise choice. First, we introduce RIPPLE, for Randomized Interior Point Permutation LEarner. We show that for pairwise choice, RIPPLE is guaranteed to return a RUM that attains almost the best possible average error over all RUMs, in polynomial time. The polynomialtime guarantee requires solving an exponential-size dual linear program using the ellipsoid method, which is unlikely to be efﬁcient in practice. Hence, we also introduce RUMRUNNER, a practical variant which is also guaranteed to return a near-optimal RUM. RUMRUNNER uses a separating hyperplane heuristic that performs efﬁciently in our experimental evaluation, but unlike RIPPLE it does not have a guarantee of polynomial running time. While previous results in this area require heuristics to optimize non-concave likelihoods, often requiring multiple runs to escape possible poor local optima, RUMRUNNER guarantees convergence to a solution whose value is ϵ-close to the optimum, and contains no hyper-parameters to tune. In empirical evaluations we observe that RUMRUNNER is neither subject to poor local optima nor has stability issues.

Our approach. A RUM may be characterized as a distribution over users, each of whom is represented by a vector giving the value of each item in the universe. Given a pair of alternatives, the likelihood that the ﬁrst element is selected is the likelihood that a value vector drawn from the distribution assigns a higher score to the ﬁrst item than the second. Since the decision depends only on which score is higher, we may represent these value vectors as permutations of {1, . . . , n} for a universe of n items, which simpliﬁes the RUM to a distribution over permutations.

Our approach begins by writing a large linear program (LP) in which a variable encodes the weight of each permutation in the RUM. The LP contains constraints for each pair {i, j} of elements asking that the total probability assigned

RUMs from Head-to-Head Contests

to permutations with i j should have at most a given error from the true target probability. The objective of the LP then is to minimize the total approximation error.

This LP has exponentially many variables, but its dual has only polynomially many variables (albeit with exponentially many constraints). The structure of the dual allows a separating oracle to be constructed by approximately solving an instance of the minimum Feedback Arc Set (FAS) problem. This in turn gives a polynomial-size set of permutations capable of additively approximating the best possible RUM, and the original LP can now be solved in polynomial time to recover the approximating solution.

Empirical ﬁndings. We present a comprehensive set of experiments comparing the RUM algorithm to other standard and recent approaches to pairwise choice: the Blade-Chest (BC) algorithm of Chen & Joachims (2016) and the MV algorithm of Makhijani & Ugander (2019).

In the pairwise setting, the entire data distribution has a particularly simple parameterization: it can be completely represented by an n n tournament matrix M such that Mi,j is the probability that j beats i when the user is shown the pair {i, j}. Furthermore, for several of our experimental datasets, the training data provides a high-accuracy estimate of M. In these cases, the training data is effectively the entire data distribution, so the best algorithm will exactly reproduce the matrix M. Thus, in addition to comparing test-set prediction accuracy (which will depend on the variance of the entries of M), we also report the ability of an algorithm to capture the matrix M as given. For datasets based on election data, our RUM-ﬁnding algorithm is able to perfectly encode M. For other datasets, the error we attain in approximating M is on the order of 1/n2. For predictive tasks, our RUM algorithms also perform comparably to the BC and MV algorithms.

Implications. Finding a high-quality RUM for pairwise comparisons has several desirable properties:

RUMs naturally provide the ability to predict winning probabilities for larger slates, even when trained only on pairwise data; no other algorithm we are aware of for this problem has this property. Our learned RUMs are based on weighted sets of permutations, and are thus naturally interpretable and introspectible. For instance, one can study the properties of the high-weight permutations. RUMs capture a natural notion of rational behavior, in which users select the item they value most highly. The permutations learned by our RUM algorithm represent a de facto clustering of the user population into segments, characterized by their different behaviors with respect to their choices. This may have value beyond predictions, e.g., in market segmentation.

The rest of the paper proceeds as follows. Section 2 covers related work and Section 3 contains the background. Section 4 presents our results for minimizing average errors and shows evidence that our approach will not extend to the uniform error setting. Section 5 describes techniques to ﬁnd succinct approximating RUMs. Experimental results are in Section 6 and Section 7 has concluding thoughts. All missing proofs are in the Supplementary Material.

2. Related work

The problem of representing a tournament matrix with choice models has been often considered in the literature. We mention those that are closest in spirit to our work.

Chen & Joachims (2016) proposed the Blade-Chest (BC) model, which represents each player i with two vectors bi (blade) and ci (chest) in the Euclidean space; the probability that player i beats player j is a function of bj, ci, bi, cj. The BC model can produce any tournament matrix, in particular those with non-transitive triplets1, that are observed in practice and cannot be achieved with standard MNL models. The BC model is optimized via gradient descent (GD) on a non-concave function, with no global optimum guarantee. Earlier, Adams et al. (2010) proposed a similar model to represent the outcomes of baseball matches.

Makhijani & Ugander (2019) proposed a different model, the Majority Vote (MV) model. In this model, each player is represented by k features, with k odd. Each feature corresponds to an independent RUM2, e.g., an MNL or a Gaussian Thurstone model. A match between two players i and j consists of k rounds: in the tth round, the two players play independently against each other using only their tth features. The player who wins the majority of rounds, is the winner of the match. They proposed GD to maximize the model s (non-concave) likelihood, with no guarantees on the global optimum. Interestingly, they show that for a class of models, the ability of models from the class to induce non-transitivity implies the non-log-concavity of their likelihoods, i.e., the GD procedure of any of these models could still get stuck at a local optimum.

In fact, Makhijani & Ugander (2019) suggest general (i.e., possibly dependent) RUMs to be a natural choice for representing tournament matrices, but point out that previous work (Train, 2003) has found optimization over unrestricted RUMs to be hard. In our paper we tackle this problem. We show how to ﬁnd an approximately optimal unrestricted RUM for a tournament matrix; we circumvent the aforementioned non-concavity issues by casting the prob-

1Players i, j, k such that each beats another with probability more than 1/2: Pi,j > 1/2, Pj,k > 1/2, and Pk,i > 1/2. 2An independent RUM is a RUM where the noise vector η (see Section 3.1) has mutually independent entries.

RUMs from Head-to-Head Contests

lem combinatorially, and solving it using LP and the ellipsoid method. We also observe experimentally that these unrestricted RUMs almost perfectly represent the tournament matrices that originated them vindicating that RUMs are ideal for this task!

Finally, we mention the novel two-level (2L) model of Veerathu & Rajkumar (2021). In general their model is as powerful (and as hard to optimize) as the BC model. They thus restrict their model to rank-2 tournaments and show a polynomial time learning algorithm in this case.3

Interestingly, their learning algorithm appeals to the minimum FAS problem, which they show can be solved optimally in polynomial time on rank-2 tournaments. We observe that there exist dependent RUMs that cannot be represented by the 2L model (see Appendix D).

The problem of actively, or passively, learning RUMs has been studied in several papers (Souﬁani et al., 2012; Oh & Shah, 2014; Chierichetti et al., 2018a;b; Negahban et al., 2018; Tang, 2020). Farias et al. (2009); Chierichetti et al. (2021) considered the problem of bounding a RUM s support without changing the RUM s behavior. The goal of our paper, instead, is to ﬁnd the RUM (approximately) closest to a given tournament matrix.

3. Preliminaries

3.1. Discrete Choice and Random Utility Models

In our paper, we use [n] to denote the set {1, . . . , n}, and 2[n] to denote the power set of [n]. We also use [n] k to denote the class of subsets of [n] of cardinality k.

We use Sn to denote the set of all the permutations of [n], and S n = Sn \ {(1 n)} to denote the set of all the permutations of [n] with the exception of the permutation that reverses the natural ordering relation. For a permutation π Sn and for an item i [n], we let π(i) [n] be the value (or position) of item i in π. For instance, if π = (3 1 4 2), then π(2) = 4, π(4) = 3, π(1) = 2, π(3) = 1.

As is standard in discrete choice literature, we use slate to denote a set in 2[n] \ { }, i.e., any non-empty subset of [n]. For a permutation π Sn and a slate T [n], we let π(T) = arg max i T π(i),

denote the item of T that has the largest value in π, i.e., the winner in T given π.

A random utility model (RUM) on [n] is a distribution D on Sn. (Henceforth, we drop [n] when it is clear from

3In their experiments, they do not measure how close their guesses of the win-probabilities are; they consider the upset metric, which counts the number of pairs of players such that the model correctly identiﬁes the most-likely winner in the pair.

the context.) For a slate T, we use DT to denote the distribution of the random variable π(T) for π D. (Clearly, supp(DT ) T.) In other words, DT is the induced distribution of the winner in T for a random permutation from D. Let DT (i) be the probability that i is a winner in T.

RUMs are often presented in terms of noisy item evaluations made by users: each item i has a base value Vi, each user samples (η1, . . . , ηn) from a joint noise distribution4, and the utility Ui of item i to this user is Vi + ηi. Given a slate, the rational user chooses the highest utility item in the slate (with ties broken u.a.r.). Equivalently, the user ﬁrst sorts all the items decreasingly according to the observed Ui s (again, with ties broken u.a.r.) to get a permutation. Then, for a given slate, the rational user chooses the highest ranked item in the slate according to the permutation. It is easy to see that these two deﬁnitions are equivalent (Chierichetti et al., 2018a).

3.2. Tournaments and Feedback Arc Set problems

Given a set [n] of players, and {i, j} [n] 2 , a tournament matrix {Pi,j}n i,j=1 gives the (empirical) probability that j wins in a head-to-head contest with i. Clearly, Pi,j, Pj,i 0 and Pi,j + Pj,i = 1, for each {i, j} [n] 2 .

As we will show, RUM optimization for head-to-head contests (i.e., tournament matrices) is closely related to the minimum Feedback Arc Set (FAS) problem, which is NPhard (Karp, 1972); the latter has two equivalent deﬁnitions.

The ﬁrst deﬁnition is perhaps the most well-known. A τbounded directed graph G(V, A, w) consists of the vertex set V = [n], the arc set A {(i, j) V 2 | i = j} satisfying (i, j) A = (j, i) A, and a non-negative weight function w : A [0, τ] on its arcs.

Problem 1 (FAS on τ-bounded directed graphs). Given a τ-bounded directed graph G(V, A, w), ﬁnd a permutation π of the vertices V that minimizes C G(V,A,w)(π) = X

(i,j) A and i πj w((i, j)).

The second deﬁnition forces input graph to be a transitive tournament (i.e., a complete DAG) with arcs going from lower-indexed vertices to higher-indexed ones. A τ-bounded transitive tournament G(V, A, w) is a complete DAG consisting of the vertex set V = [n], arc set A = {(i, j) V 2 | i < j}, and a weight function w : A [ τ, τ] on its arcs.

Problem 2 (FAS on τ-bounded transitive tournaments). Given a τ-bounded transitive tournament G(V, A, w), ﬁnd a permutation π of the vertices V that minimizes

4We recall that, in an independent RUM, the ηi s are mutually independent.

RUMs from Head-to-Head Contests

C G(V,A,w)(π) = X

1 i<j n and i πj w((i, j)).

Note that while the input graphs to Problem 2 are less general than those to Problem 1, the arc weights of Problem 2 can be chosen in a more general way.

For Problem 1, Frieze & Kannan (1999) give an O(δτn2)- additive approximation algorithm, for δ > 0.

Theorem 3 (Frieze & Kannan (1999)). There exists an approximation algorithm for Problem 1 that, for a τ-bounded directed graph G on n nodes, returns a permutation π

such that C G(π ) O(δτn2) + minπ Sn C G(π), in time O δ 4n + 2 e O(δ 2), with probability at least 2/3, for an arbitrary δ > 0.

We will use this approximation algorithm for Problem 2. The two problems are equivalent from an additive approximation point of view. The following is folklore.

Observation 4. Any α-additive approximation polynomial time algorithm for Problem 1 with a given τ can be transformed into an α-additive approximation polynomial time algorithm for Problem 2 with the same τ, and vice versa.

3.3. Approximating Tournaments by RUMs

We formally deﬁne the main problems that we consider.

Deﬁnition 5 (Average RUM approximation). A given tournament matrix P is approximated on average to within ϵ by a RUM R if avg 1 i<j n

R{i,j}(j) Pi,j ϵ.

Given P, let ϵ1(P) be the smallest5 x 0 such that there exists a RUM that approximates P on average to within x.

Note that each RUM R on [n] is such that R{i,j}(i) = 1 R{i,j}(j) for each {i, j} [n] 2 . Thus, the above average

is equal to

j [n]\{i} |R{i,j}(j) Pi,j|

n (n 1) , i.e., average the error over each pair and each winner of the pair.

Problem 6. Given a tournament matrix P, ﬁnd a δadditive approximation to ϵ1(P), i.e., obtain a RUM whose average distance from P is not larger than ϵ1(P) + δ.

Similarly, one can deﬁne the maximum-error variant.

Deﬁnition 7 (Uniform RUM approximation). A given tournament matrix P is uniformly approximated to within ϵ by a RUM R if max 1 i<j n R{i,j}(j) Pi,j ϵ.

Given P, let ϵ (P) be the smallest5 x 0 such that there exists a RUM that uniformly approximates P to within x.

5As we will see, the minimum exists since it can be obtained as the solution to a ﬁnite-sized LP.

4. Minimizing Average Error

In this section we describe the RIPPLE algorithm to ﬁnd a RUM that additively approximates ϵ1(P) to within δ = Θ log 0.49 n in time polynomial in n. Our approach has three steps: (i) writing down Problem 6 as a linear program (LP) of exponential size, (ii) developing an approximate separation oracle for its dual, and (iii) showing that the Ellipsoid algorithm converges to an approximate solution when aided by this approximate separation oracle.

We start by writing down the primal LP for Problem 6:

min n 2 1 P

1 i<j n ϵi,j

Pi,j ϵi,j P

pπ Pi,j + ϵi,j 1 i < j n

ϵi,j 0 1 i < j n pπ 0 π S n

The program minimizes the average ℓ1-error over the pairs, optimizing over the permutations. Note that, since its number of variables is exponential in n, this LP cannot be directly solved as is in time polynomial in n.

Observe that the LP (1) assigns a non-negative probability to each permutation π S n, and requires the sums of their probabilities to be not larger than 1. To get a RUM R out of a solution to the LP (1) one can assign, for each π S n, probability pπ to π. Recall that (1 n) is the only permutation of [n] not in S n (this particular permutation does not contribute to the probability that j beats i, for any j > i). The probability that the RUM R assigns to (1 n) is then equal to 1 P

Let us ﬁrst rewrite the above primal LP as:

min n 2 1 P

1 i<j n ϵi,j

Li,j : ϵi,j + P

pπ Pi,j 1 i<j n

Ui,j : ϵi,j P

pπ Pi,j 1 i<j n

ϵi,j 0 1 i<j n pπ 0 π S n

We then take its dual in order to reduce the number of variables to a polynomial in n:

1 i<j n (Pi,j (Li,j Ui,j))

ϵi,j : Li,j + Ui,j n 2 1 1 i<j n pπ : D + P

1 i<j n i πj

(Li,j Ui,j) 0 π S n

D, Ui,j, Li,j 0 1 i<j n

RUMs from Head-to-Head Contests

Observe that each feasible solution to the Dual LP (3) can be transformed into a feasible solution with the same value and with the property that for each i < j at most one of Ui,j and Li,j is non-zero. (Indeed, if they are both positive, then we can subtract min(Ui,j, Li,j) from both of them without impacting feasibility and without changing the objective function s value.) Then, given a feasible solution to the Dual LP, we deﬁne i,j = Ui,j Li,j and after the above transformation, we are guaranteed that Ui,j = max ( i,j, 0), Li,j = max ( i,j, 0), and | i,j| = Li,j +Ui,j. The Dual LP (3) is then equivalent to:

1 i<j n (Pi,j i,j)

1 i<j n i πj

i,j 0 π S n

n 2 1 i,j n 2 1 1 i < j n D 0

Observe that LP (4) optimizes over a vector whose ℓ - norm is bounded by n 2 1. The goal is to maximize the objective function under the constraint that the transitive tournament whose (i, j) arc has weight i,j, for each 1 i < j n, has a minimum FAS of value at least D. In other words, the separation oracle problem for LP (4) is an instance of Problem 2, i.e., of minimum FAS on 1/( n 2)- bounded transitive tournaments.

We now transform LP (4) from a maximization problem into a feasibility one, since we will be using the Ellipsoid algorithm to solve it:

1 i<j n (Pi,j i,j) ρ

1 i<j n i πj

i,j 0 π S n

ci,j : n 2 1 i,j n 2 1 1 i<j n c D : D 0

Clearly, LP (4) has value at least ρ iff Fρ is feasible. We now give an approximate separation oracle for Fρ.

Theorem 8. Fix any δ(n) Ω log 0.49 n . Then, there exists a randomized algorithm that gets as input an assignment {D, 1,2, . . . , n 1,n} to Fρ, and that in time O(n2), with probability at least 2/3, (i) returns an unsatisﬁed constraint of Fρ if at least one of the constraints cρ, c D, or ci,j (for 1 i < j n) is unsatisﬁed, otherwise (ii) if there exists at least one π S n such that D P

1 i<j n i πj

i,j > δ(n), the algorithm returns an un-

satisﬁed cπ constraint, otherwise (iii) the algorithm might not return any unsatisﬁed constraint (even if some exists).

Moreover, if there exists at least one π S n such that D P

1 i<j n i πj

i,j > δ(n), then for any constant c > 0,

if the above randomized algorithm is run independently for Θ(c log n) times, the probability that no unsatisﬁed constraint is returned shrinks to n c.

Proof. The algorithm can check the validity of each of the cρ, c D, and ci,j (for 1 i < j n) constraints in time O(n2); if one of the constraints is unsatisﬁed, it can be returned. Otherwise, they are all valid, and the algorithm creates the transitive tournament G(V, A, w) with V = [n], A = {(i, j) | 1 i < j n}, and w((i, j)) = i,j. Then, G(V, A, w) is a τ-bounded transitive tournament, for τ O(n 2), i.e., is an instance of Problem 2. In fact, consider any π S n: if C G(V,A,w)(π) is the cost of solution π for the instance G(V, A, w) of Problem 2, then the constraint cπ of Fγ is exactly D C G(V,A,w)(π) 0, or equivalently, C G(V,A,w)(π) D.

Theorem 3 and Observation 4 imply the existence of an algorithm for Problem 2 on G(V, A, w) that additively approximates the optimal solution to any δ > 0 in time O n δ 4 + 2 e O(δ 2). In particular, for each δ(n) Ω log 0.49 n , the problem can be approximated to within an additive δ(n) in time O(n2).

Suppose that π is the permutation returned by the additiveapproximation algorithm. Then, with probability at least 2/3, for each π Sn it holds that C G(V,A,w)(π) C G(V,A,w)(π ) δ(n).

Now, if D C G(V,A,w)(π ) > 0 then cπ is an unsatisﬁed constraint, which can be returned (indeed, if this happens, then π S n; in fact, 1 2 n is the only permutation in Sn S n, and C G(V,A,w) has a value of 0 on this permutation). Otherwise, C G(V,A,w)(π ) D; then, it must hold that C G(V,A,w)(π) C G(V,A,w)(π ) δ(n) D δ(n), i.e., for each π S n, the constraint D P

1 i<j n i πj

i,j δ(n) must hold.

Finally, to decrease the error probability to n c, it sufﬁces to (independently) run the randomized algorithm for Problem 2, O(c log n) times. Then a permutation with the smallest cost C G(V,A,w) in these runs can be returned.

We can now use the approximate separation oracle of Theorem 8 along with the Ellipsoid algorithm (Gr otschel et al., 1988), to obtain a δ-additive approximation to Problem 6.

Theorem 9. Problem 6 can be additively approximated to any δ Ω(log 0.49 n) in polynomial time.

Proof. First, our algorithm RIPPLE guesses a ρ i δ | 0 i 1

δ (the algorithm can binary search among the values in this set). For a given ρ, the Ellipsoid algorithm (Gr otschel et al., 1988) can be used with

RUMs from Head-to-Head Contests

the approximate separation oracle of Theorem 8 to approximately check the non-emptiness of Fρ. In particular, the Ellipsoid algorithm will call the separation oracle at most polynomially many times, obtaining at most polynomially many separating hyperplanes. If the set of returned hyperplanes forms an infeasible LP, the Ellipsoid algorithm can correctly conclude that Fρ is empty.

Otherwise, the Ellipsoid algorithm could return a point x = (D, 1,2, . . . , n 1,n) that the oracle was unable to separate from Fρ. This point could be outside of Fρ since the oracle only guarantees6 that the cπ constraints hold to within an additive δ error. We then deﬁne the point x = (D + δ, 1,2, . . . , n 1,n); we have that x Fρ δ. Indeed, (i) each cπ constraint will be satisﬁed by x (with x, cπ is off by at most δ; and, the value of the LHS of the cπ constraint decreases by δ from x to x ), (ii) the cρ δ constraint of Fρ δ will be satisﬁed by x (the constraint cρ of Fρ is satisﬁed with x, thus cρ δ is satisﬁed by x ), and (iii) each remaining constraint will still be satisﬁed.

Suppose that i is the smallest i for which the algorithm could conclude that x Fρ δ where ρ = i δ.7

Then, the Dual LP (4) does not admit a solution of value at least ρ + δ, but admits a solution of value at least ρ δ. It follows that the optimal solution of the Dual LP (4), and thus of the Primal LP (1), lies in [ρ δ, ρ +δ]. Hence, the Ellipsoid algorithm with the above separation oracle, lets us additively approximate the value of the optimal solution of Problem 6 to within 2δ.

To complete the description of RIPPLE, it only remains to obtain an approximating RUM having an error within the smallest possible plus 2δ. Consider the run of the Ellipsoid algorithm with ρ = ρ . In this run, the Ellipsoid algorithm calls the separation oracle at most polynomially many times; thus, the oracle returns at most polynomially many separating hyperplanes, some of which might refer to non-permutation constraints, while the rest refer to the permutation constraints of, say, permutations π1, . . . , πt (t n O(1)). Now, if we restrict the primal LP (1) to (i) its non-permutation variables and (ii) the permutation variables pπ1, . . . , pπt, then we obtain an LP of size polynomial in n (hence, solvable in polynomial time) and with an optimal value not larger than 2δ plus the optimum of the primal LP (1). Thus, solving the restricted LP allows us to obtain

6Given that the approximate separation oracle is a randomized algorithm, this guarantee might fail to hold. Still, given that the oracle is called no more than polynomially many times, and given that the probability of its error can be made as small as n c, for an arbitrary constant c > 0, with high probability the guarantee will hold at each oracle call. 7Such an i must exist since Problem 6 always admits a solution of value at most 1/2. Indeed, the RUM such that R{i,j}(j) = 1/2, for each 1 i < j n, can be realized, e.g., with a uniform distribution over Sn.

a RUM with an error not larger than the smallest possible plus 2δ.

One might ask if the same technique can be used to minimize uniform error as in Deﬁnition 7. Unfortunately, the separation oracle for this version is NP-hard to approximate to some additive constant. We discuss the uniform error problem in Section B of the Supplementary Material.

5. Succinct Representation for Average Errors

The RUMs obtained from Theorem 9 might be supported by (polynomially) many permutations. In this section we discuss succinct representation of those (and other) RUMs.

First, we note that a result of Chierichetti et al. (2021) implies that any RUM can be sketched to O(ϵ 2 n log2 n) bits in such a way that the probability distribution of each slate of size two is approximated to within an additive error ϵ. Here, we will show that O(ϵ 2 n log n) bits sufﬁce to approximate the average ℓ1-error (and the average ℓ2-error) over slates of size two to within an additive error ϵ.

As we will see, our sketch is a uniform RUM supported on a multiset of O(ϵ 2) permutations, i.e., by a number of permutations that is independent of n. Note that a uniform RUM supported on k permutations can be represented by a k-dimensional vector per item; the generic item s vector contains the ranks of that item in the k permutations of the RUM. Therefore, our succinct representation can also be seen as an embedding of the items in an O(ϵ 2)- dimensional space.

Deﬁne the (pairwise) RMSE8 between RUMs D and D as

i,j [n],i =j

D{i,j}(j) D {i,j}(j) 2

We also deﬁne a ℓq-version of the above measure:

ρq(D, D ) =

i,j [n],i<j

D{i,j}(j) D {i,j}(j) q

Since D{i,j}(i) = 1 D{i,j}(j), we get |D{i,j}(j) D {i,j}(j)| = |D{i,j}(i) D {i,j}(i)|; thus, ρ = ρ2.

The sketch we propose samples O(ϵ 2) permutations from the original RUM D and uses them as the support of a new RUM e D that, as we show, approximates D both in the ρ1and ρ2-senses.

Theorem 10. For a RUM D and for each ϵ > 0, there is a RUM e D that samples uniformly from a multiset of O ϵ 2

8Note that the RMSE deﬁned by Makhijani & Ugander (2019) is slightly different: their denominator is n2 instead of n (n 1). It is possible to show that our RUM approximation holds for their RMSE version as well.

RUMs from Head-to-Head Contests

Algorithm 1 RUMRUNNER, a heuristic for Problem 6.

1: S 2: π (1 n) 3: repeat 4: S S {π } 5: Solve the primal LP (1) restricted to the variables ϵ, and pπ for π S; let P be its optimal primal solution, and D be its optimal dual solution 6: π Viol-HP (D) 7: until π = 8: return the RUM induced by P, i.e., the RUM that samples π S with probability P(pπ)

permutations such that ρ1(D, e D) < ϵ and ρ2(D, e D) < ϵ.

Since each permutation can be represented with O(n log n) bits, an immediate consequence of Theorem 10 is that all RUMs (in particular, those obtained by our algorithms) can be represented succinctly.

Corollary 11. For a RUM D on [n] and for each ϵ > 0, there is a RUM e D representable with O(ϵ 2 n log n) bits such that ρ1(D, e D) < ϵ and ρ2(D, e D) < ϵ.

6. Experiments

We present three groups of experiments. Section 6.3 studies the effectiveness of RUMRUNNER at representing a given tournament matrix. As described in the Introduction, we view this as the key metric for situations in which the tournament matrix is known to reasonable precision, and we ﬁnd that the approximation error is very small for all datasets (zero, or order of 1/n2 for an n n tournament).

Next, in Section 6.4, we ask whether the resulting RUM can be approximated by a much smaller RUM; we consider a number of heuristics to sketch a RUM with support on only k = 10 permutations, and identify two approaches that are able to approximate all tournament matrices well in this regime. This suggests that very small RUMs may approximate real-world datasets, resulting in computational efﬁciencies and increased transparency.

Finally, in Section 6.5, we consider a standard prediction setting in which the data is split between train/test, and the learned model is measured according to its generalization performance. This setting is appropriate if the training sample is too small to fully characterize the data distribution.

6.1. Linear programs and separation oracles

For computational efﬁciency reasons, we do not use RIPPLE as described to ﬁt RUMs to the datasets. Instead, we develop RUMRUNNER using a separating-hyperplane approach, described in Algorithm 1.

RUMRUNNER employs a separation oracle for the dual: a

Algorithm 2 A randomized local search for Viol-HP, i.e., the separation oracle. In experiments, we set t = 100.

1: For a permutation π, let (i) fas(π) = P i<j i πj D( i,j) be

the FAS cost of π on the dual solution s directed graph and (ii) N(π) be the set of permutations that can be obtained by moving one of the elements of π to a new position. 2: fasmin 3: for i = 1, . . . , t do 4: π uniform at random permutation from Sn 5: while π N(π) such that fas(π ) < fas(π) do 6: π arg minπ N(π) fas(π ) 7: if fas(π) < fasmin then 8: fasmin fas(π) 9: πmin π 10: if D(D) fasmin 0 then 11: return 12: else 13: return πmin

function Viol-HP that, given the transitive tournament induced by the dual solution, either returns a violated hyperplane, or if it could not ﬁnd any such hyperplane. As mentioned in Section 4, the Viol-HP separation oracle needs to ﬁnd a permutation π inducing a minimum FAS of the transitive tournament. Given such a π, the function needs only to check if P

i<j,i πj D( i,j) D(D).

Now, given that ﬁnding a minimum FAS is NP-hard, we implemented two versions of Viol-HP. One is the classical exact dynamic programming algorithm (Lawler, 1964), running in time O(n2n), and another is a randomized local search heuristic, Algorithm 2; the latter might fail to return the minimum FAS. In our experiments, we could run the exact algorithm on most of the datasets; for the remaining, we resorted to Algorithm 2, sampling 100 permutations, and stopping an iteration if the gain in the last step was less than 10 5. Note that Algorithm 2 might fail to ﬁnd a separating hyperplane even if it exists, which means RUMRUNNER might fail to return a RUM at the minimum distance from the input matrix. (This is the reason why Table 1 has two columns detailing the average-error of RUMs: the ﬁrst column gives the average-error that the RUM returned by RUMRUNNER has with respect to the matrix, the second gives a lower bound on the minimum possible averageerror achievable by a RUM with respect to the matrix.) In practice however, as we discuss in Section C of the Supplementary Material, RUMRUNNER works fairly well: it always returns a RUM (hence always gives a correct upper bound on the distance of the input matrix to RUMs) and, with an extra ﬁnal check on the feasibility for the dual of its last D solution, RUMRUNNER can also certify a lower bound on the distance of the input matrix to the best approximating RUM. In our experiments, this ﬁnal check was implemented with an exact algorithm.

RUMs from Head-to-Head Contests

Table 1. Statistics about the RUMs obtained by RUMRUNNER: the cardinality of the support (|S|), the average error, a lower bound on the average error, and the uniform error.

Dataset n |S| avg. err. lower bound unif. err. on avg. err.

A5 16 121 A9 12 67 A17 13 79 0 A48 10 46 A81 11 56

SF 35 572 0.001408 0.001408 0.1438 Jester 100 3553 0.000461 0 0.0786

6.2. Experimental Setup

Our main algorithm is RUMRUNNER, which uses Algorithm 1 along with either Algorithm 2 or the exact algorithm (Lawler, 1964) for Viol-HP. We implemented RUMRUNNER in Python using IBM cplex9 as the LP solver. All experiments were done on commodity hardware10.

Baselines. As baselines we considered the Thurstone model (Thurstone, 1994), Multinomial Logit (MNL) (also known as Bradley Terry Luce model) (Bradley & Terry, 1952), Blade-Chest (BC) (Chen & Joachims, 2016), and 3D Gaussian majority vote (MV) (Makhijani & Ugander, 2019). We used, as implementations of these baselines, those made available by Chen & Joachims (2016) and Makhijani & Ugander (2019).

Datasets. We measured the performances of the algorithms on a number of datasets: a class of commonly studied election datasets A5, A9, A17, A48, A81 (Tideman, 2006), one videogame dataset representing matchups between Super Street Fighter IV (SF) characters (Chen & Joachims, 2016), and the Jester rating dataset (Goldberg et al., 2001).

Unlike the others, the election datasets are composed of ballots, each containing a partial ranking (i.e., a ranking of a subset) of the candidates. Following Makhijani & Ugander (2019), as we are looking for pairwise matches, we transformed a generic ballot a ranking of a subset S of candidates into a set of |S| 2 matches, one for each pair of candidates in S, where the winner of the generic match is the candidate that is ranked higher in the ballot.

6.3. RUM Approximation

Table 1 shows how close the RUMs computed by RUMRUNNER come to their respective input tournament matrices. In the case of the election datasets the error is zero: the representation is perfect. For SF, the total error = 35 2 0.001408 . . . 0.837, which is smaller than 1, the range of an entry of the 35 35 matrix. For

9http://ibm.com/analytics/cplex-optimizer 1012-Core Ryzen 3900X with 64GB of RAM

0 250 500 10 3

Random + LP Top-k (scaled)

Top-k + LP k-center + LP

light-RUM Original RUM

Figure 1. Average error for different number of permutations k (horizontal axis) for the considered heuristics.

Jester, the best RUM we could ﬁnd induces a total error of 100 2 0.000461 . . . 2.281, i.e., slightly more than two entries of the 100 100 matrix. While the algorithms are crafted to minimize average error, we also report the uniform/maximum error.

6.4. Sketching

Theorem 10 shows that the support of a generic RUM can be reduced to O(ϵ 2) permutations while guaranteeing that the new RUM is within a distance ϵ to the original, both in the RMSE and in the average sense. In this section we describe two heuristics for support reduction, that appear to work signiﬁcantly better in practice. We apply our heuristics to the RUMs produced by our algorithm; we note, though, that they can be applied to any RUM.

For each of our RUMs R, and for a ﬁxed value of k, we considered the following heuristics:

Random + LP: sample a set of k permutations u.a.r. from Sn, and run the primal LP (1) on that set of permutations to obtain their weights. Top-k (scaled): consider the set T of the k permutations having largest probability in R; create a new RUM supported on T, assign probabilities to T s permutations by scaling up their probabilities in R. Top-k + LP: given T as above, run the primal LP (1) on T to assign new probabilities to its permutations. k-Center + LP: run the Greedy k-center algorithm of (Gonzalez, 1985) on the support permutations of the RUM, using their Kendall distance scaled by the product of their probabilities in R; run the LP on the returned centers to assign them probabilities. Light-RUM: sample k permutations from R as in Theorem 10. Each heuristic returns a RUM supported on k permutations. We compared these RUMs to the tournament matrix that was used to produce R, i.e., to the original dataset.

RUMs from Head-to-Head Contests

Table 2. Average error of RUMs obtained through different support reduction heuristics, with the target support of size k = 10.

Dataset Original RUM Random + LP Top-k (scaled) Top-k + LP k-center + LP Light-RUM

A5 0 0.156 0.020 0.059 0.041 0.045 0.004 0.111 0.016 A9 0 0.133 0.030 0.031 0.010 0.011 0.002 0.119 0.023 A17 0 0.112 0.014 0.034 0.014 0.014 0.001 0.119 0.033 A48 0 0.107 0.022 0.043 0.015 0.015 0.001 0.119 0.029 A81 0 0.129 0.024 0.056 0.047 0.046 0.005 0.113 0.021 SF 0.00141 0.149 0.008 0.113 0.104 0.105 0.008 0.127 0.011 Jester 0.00046 0.168 0.008 0.119 0.108 0.108 0.003 0.121 0.006

In Table 2, we report the (expected) average error of the different heuristics (each averaged over 20 runs) for a ﬁxed value of k = 10. In Figure 1, we report the (expected) average error for different values of k. Observe that the quality of the solution produced by the LP (1) has a signiﬁcant dependence on the set of permutations/variables available to it Random + LP is signiﬁcantly worse than the other heuristics that use the LP in their ﬁnal step. Still, if we limit ourselves to scaling up the probabilities of the top-k permutations of the original RUM (Top-k (scaled)), the results are worse than, but comparable to, those obtained by the best LP method. It is also clear that Light-RUM is penalized by its uniformity its probabilities, by design, have a 1/k granularity.

Even with k = 10 permutations, the RUMs obtained by the two best heuristics, Top-k + LP and k-center + LP closely approximate the original tournament matrix.

6.5. Quality of generalization

We also compared the predictive performance of our algorithm with respect to several baselines. Our tests follows those of Makhijani & Ugander (2019): we use a ﬁve-fold cross-validation, and measure their RMSE as a metric.

For each dataset, we ﬁrst compute the list of its t matchups. We then split, uniformly at random, the list into ﬁve parts, conditioned on each part having length equal to either t/5 or t/5 . Then, for each part, we considered that part as the test set, and the union of the remaining four parts as the training set. For each train/test pair, we ran all the algorithms on the training data to obtain a prediction matrix. Finally, we computed the RMSE as deﬁned in Makhijani & Ugander (2019), ρ (D, D ) = q P

i,j [n],i =j(D{i,j}(j) D {i,j}(j)) 2/n2, between each predic-

tion matrix and its corresponding test set. We randomly sampled 10 splittings obtaining, for each dataset, 50 train/test pairs.

In Table 3 we report our prediction results. Observe that the baseline results are different from the ones reported in Makhijani & Ugander (2019),11 but the ratios of their RM-

11We conﬁrmed with the authors that our numbers are indeed correct; the discrepancy is due to a library error.

SEs remain similar.

The results show that RUMRUNNER is competitive with the other baselines. Since we are able to approximate perfectly (e.g., in the elections datasets) or almost perfectly (SF) the training matrix, our algorithm is essentially obtaining the same results of the training matrix used, with no change, as the predictor.

Table 3. Expected RMSE in the cross-validation experiment. We sampled ten 5-fold splittings per dataset. The only signiﬁcant difference is that RUMRUNNER beats MNL and Thurstone for A9. The standard deviations are all in the range [ 0.002, 0.007].

Dataset MNL Thurstone BC MV RUMRUNNER

A5 0.041 0.041 0.041 0.041 0.043 A9 0.026 0.027 0.023 0.023 0.021 A17 0.049 0.050 0.051 0.050 0.052 A48 0.034 0.035 0.033 0.034 0.035 A81 0.043 0.043 0.042 0.042 0.043 SF 0.137 0.137 0.138 0.137 0.142

7. Conclusions In this paper we studied the problem of approximating a given tournament matrix by a RUM. We obtained an algorithm to minimize the average error to within a o(1) margin; we also showed that such an approach is unlikely to work to minimize the uniform error. A natural open question is to extend our methods beyond head-to-head contests. For example, given the winning probabilities in slates of size three, what is the complexity of ﬁnding the closest RUM, to minimize the average error? Another interesting question is the following: given a tournament matrix, is there a RUM on support of size k that exactly represents it?

Acknowledgments

We thank the anonymous reviewers for their careful reading. We also thank Mohammad Mahdian, Rahul Makhijani, Tim Roughgarden, Johan Ugander, and Santosh Vempala for useful discussions and suggestions.

Flavio Chierichetti and Alessandro Panconesi were supported in part by Bi Ci Bertinoro International Center for Informatics. Flavio Chierichetti was supported in part by the PRIN project 2017K7XPAN.

RUMs from Head-to-Head Contests

Adams, R. P., Dahl, G. E., and Murray, I. Incorporating side information in probabilistic matrix factorization with Gaussian processes. In UAI, pp. 1 9, 2010.

Bogacz, R., Brown, E., Moehlis, J., Holmes, P., and Cohen, J. The physics of optimal decision making: A formal analysis of models of performance in two-alternative forced-choice tasks. Psychological Review, 113(4):700 765, 2006.

Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39(3/4):324 345, 1952.

Chen, S. and Joachims, T. Modeling intransitivity in matchup and comparison data. In WSDM, pp. 227 236, 2016.

Chierichetti, F., Kumar, R., and Tomkins, A. Discrete choice, permutations, and reconstruction. In SODA, pp. 576 586, 2018a.

Chierichetti, F., Kumar, R., and Tomkins, A. Learning a mixture of two multinomial logits. In ICML, pp. 961 969, 2018b.

Chierichetti, F., Kumar, R., and Tomkins, A. Light RUMs. In ICML, pp. 1888 1897, 2021.

Clementi, A. and Trevisan, L. Improved nonapproximability results for minimum vertex cover with density constraints. TCS, 225(1):113 0128, 1999.

Farias, V. F., Jagabathula, S., and Shah, D. A data-driven approach to modeling choice. In NIPS, pp. 504 512, 2009.

Frieze, A. and Kannan, R. Quick approximation to matrices and applications. Combinatorica, 19(2):175 220, 1999.

Goldberg, K., Roeder, T., Gupta, D., and Perkins, C. Eigentaste: A constant time collaborative ﬁltering algorithm. Information Retrieval, 4:133 151, 07 2001.

Gonzalez, T. F. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38: 293 306, 1985.

Gr otschel, M., Lov asz, L., and Schrijver, A. Geometric Algorithms and Combinatorial Optimization, volume 2. Springer, 1988.

Karp, R. M. Reducibility among combinatorial problems. In Miller, R. E., Thatcher, J. W., and Bohlinger, J. D. (eds.), Complexity of Computer Computations. The IBM Research Symposia Series, pp. 85 103. Springer, 1972.

Lawler, E. A comment on minimum feedback arc sets. IEEE Transactions on Circuit Theory, 11(2):296 297, 1964.

Makhijani, R. and Ugander, J. Parametric models for intransitivity in pairwise rankings. In WWW, pp. 3056 3062, 2019.

Negahban, S., Oh, S., Thekumparampil, K. K., and Xu, J. Learning from comparisons and choices. JMLR, 19(1): 1478 1572, 2018.

Oh, S. and Shah, D. Learning mixed multinomial logit model from ordinal data. In NIPS, pp. 595 603, 2014.

Souﬁani, H. A., Parkes, D. C., and Xia, L. Random utility theory for social choice. In NIPS, pp. 126 134, 2012.

Tang, W. Learning an arbitrary mixture of two multinomial logits. ar Xiv, 2007.00204, 2020.

Thurstone, L. L. A law of comparative judgment. Psychological Review, 34:273 286, 1994.

Tideman, N. Collective Decisions and Voting: The Potential for Public Choice. Routledge, 2006.

Train, K. E. Discrete Choice Methods with Simulation. Cambridge University Press, 2003.

Veerathu, V. and Rajkumar, A. On the structure of parametric tournaments with application to ranking from pairwise comparisons. In Neur IPS, 2021.

RUMs from Head-to-Head Contests

Supplementary Material

A. Missing Proofs

A.1. Proof of Observation 4

Proof. Let G = ([n], A , w ) be an instance of Problem 1. Let A = {(i, j) | 1 i < j n}, and for each 1 i < j n, let

w ((i, j)) =

w ((i, j)) if (i, j) A , w ((i, j)) if (j, i) A , 0 otherwise.

Then, G = (V, A , w ) is an instance of Problem 2. Now, let W = P

(i,j) A j>i

w ((i, j)); then, for each permutation π Sn,

it holds that C G=(V,A ,w )(π) = C G=(V,A ,w )(π) W . Thus additive approximation is preserved in this direction.

Analogously, let G = ([n], A , w ) be an instance of Problem 2. Let A = {(i, j) | (1 i < j n and w ((i, j)) 0) or (1 j < i n and w ((i, j)) < 0)}, and for each (i, j) A , let

w ((i, j)) = w ((i, j)) if i < j, w ((i, j)) if j < i.

Then, G = (V, A , w ) is an instance of Problem 1. Now, let W = P

w ((i,j))<0

w ((i, j)); then, for each permutation

π Sn, it holds that C G=(V,A ,w )(π) = C G=(V,A ,w )(π) W . Thus additive approximation is preserved in this other direction, as well.

A.2. Proof of Theorem 10

Proof. In a manner similar to Chierichetti et al. (2021), we sample k = 4

ϵ2 independent permutations π1, . . . , πk from D, and we let e D be the uniform RUM over the multiset {π1, . . . , πk}, i.e., the RUM e D samples i uniformly at random from [k], and returns πi.

For arbitrary i < j, consider the random variable X = e D{i,j}(j). Then, X = k

k iff there are exactly k indices t in [k] such that j πt i. Observe that, by the iid choice of the πi s from D, the distribution of k is equal to the binomial distribution Bin(k, p) with p = D{i,j}(j).

Then, E[X] = E h e D{i,j}(j) i = kp

k = p = D{i,j}(j), and

E h (X p)2i = Var[X] = kp(1 p)

k2 = p(1 p)

since p(1 p) is maximized at p = 1/2.

Using Jensen s inequality, E[|X p|] p

E h e D{i,j}(j) D{i,j}(j) i = E[|X p|] 1 2

Thus for each q {1, 2},

E h e D{i,j}(j) D{i,j}(j) qi (4k) q/2 .

e D{i,j}(j) D{i,j}(j) q .

By the linearity of expectation, we have

We also have Sq 0; thus, by Markov s inequality, we get that Pr [Sq < 4E[Sq]] > 3/4. Consider the event ξq = Sq <

RUMs from Head-to-Head Contests n 2 41 q/2 k q/2 . We get

Pr [ξq] Pr [Sq < 4E[Sq]] > 3

Now, ρq D, e D = qq

Sq/ n 2 . Then, ξq entails that

ρq D, e D < q s n 2 41 q

2 n 2 4 1 q 1

where the last inequality follows from q 1 and k 4 ϵ2 . Since Pr[ξq] > 3/4 for q {1, 2}, with probability 1/2, e D is an ϵ-approximation of D both in ρ1and ρ2-senses.

B. Minimizing Uniform Error

As in the average error case, once can write the primal LP for minimizing the uniform error:

Li,j : ϵ + P

pπ Pi,j 1 i<j n

pπ Pi,j 1 i<j n

ϵ 0 pπ 0 π S n

We then take the dual, obtaining:

1 i<j n (Pi,j (Li,j Ui,j))

1 i<j n (Li,j + Ui,j) 1

1 i<j n i πj

(Li,j Ui,j) 0 π S n

D, Ui,j, Li,j 0 1 i < j n

As before, we observe that if min(Ui,j, Li,j) > 0, then we can subtract this quantity from both Li,j and Ui,j with no impact on feasibility nor on the objective. This way, we can set i,j = Ui,j Li,j. The ϵ constraint then becomes P

1 i<j n | i,j| 1. Thus, the dual LP is equivalent to:

1 i<j n (Pi,j i,j)

1 i<j n i πj

i,j 0 π S n

Thus, when minimizing the maximum error, the dual LP optimizes over a vector whose ℓ1-norm is at most 1 recall that, in the case of average error minimization, it was the ℓ -norm of that was bounded to be at most 1. The goal of the LP does not change: this LP aims to maximize the same objective function of the LP for the average error, under the same constraint that the -graph has a minimum FAS of value at least D.

We show that the Separation Oracle problem for LP (5) is NP-hard to additively approximate to some positive constant. Hence, the Ellipsoid-based approach we used to get an o(1)-approximation algorithm for Problem 6, fails for the maximumerror variant.

Theorem 12. There exists a constant α > 0 such that, given an assignment to (5), it is NP-hard to determine whether the assignment satisﬁes each constraint, or whether some constraint is off by at least α.

RUMs from Head-to-Head Contests

Proof. The classical reduction (Karp, 1972) from Vertex-Cover to (unweighted) Feedback Arc Set12 creates a FAS instance composed of the digraph H with 2n vertices, max-indegree and max-outdegree not larger than δ + 1, and a number of arcs equal to 2n + 2m, starting from a Vertex Cover instance composed of a graph G, having n nodes, maximum degree δ, and m edges. The reduction guarantees the existence of a k-Vertex Cover in G(V, E) iff a k-Feedback Arc Set exists in H(V, A).

The Vertex Cover instances of Clementi & Trevisan (1999) have maximum degree δ c = O(1), so that m δ

2 n and, also, m n

2 ;13 Clementi & Trevisan (1999) show that there exists a constant c > 1 such that it is NP-hard to distinguish whether the minimum Vertex Cover of such an instance has size k or c k, for k m

Then, if we plug the Vertex Cover instances of Clementi & Trevisan (1999) into the reduction of Karp (1972), we get a digraph H having m = 2n + 2m arcs, with

2 n = 2n + δn (2 + c) n,

and such that it is NP-hard to tell whether the minimum FAS of H is smaller than k or larger than c k, for the inapproximability ratio c > 1 of (Clementi & Trevisan, 1999).

We observe that, by k m

2 and δ c, we get k n

2c. Also, by m (2 + c) n, we get that k m 2c(2+c).

Now, let t = |{(j, i)|(j, i) A 1 i < j n}| be the number of arcs of H from a node of higher index to a node of lower index. We can assume that t c k.15

Now, consider the problem that a Separation Oracle for (5) has to solve that is, consider Problem 2. Given a digraph H with m arcs on the vertex set [n ], for each 1 i < j n , we set (i) i,j = 1 m if (i, j) is an arc of H, (ii) i,j = 1

m if (j, i) is an arc of H, and (iii) i,j = 0 otherwise. Now, let t be the number of pairs 1 i < j n such that i,j < 0. We set D = t c k

m observe, then, that D 0. The question is whether this assignment satisﬁes each constraint of (5), or whether it is off on some constraint by some additive constant α > 0.

Observe that, for any permutation π of the vertices, if C (π) is the cost of π for the FAS instance H, then

1 i<j n i πj

i,j = C (π)

Observe that the total ℓ1 weight of the -tournament is 1; thus, the | |1 1 constraint is satisﬁed. The reduction above shows that it is NP-hard to tell whether minπ Sn C (π) k t

m , or minπ Sn C (π) c k t

m . Now, in the latter case, our choice of D guarantees that each constraint is satisﬁed; in the former case, instead, some constraint (in particular, a constraint corresponding to a permutation obtaining a minimum Feedback Arc Set) would be off by at least an additive (c 1) k m term. Thus, it is NP-hard to determine if all the constraints of the separation oracle are satisﬁed, or if there exists a constraint that is off by an additive term of

α = (c 1) k

m c 1 2c2 + 4c = Ω(1).

Thus, there exists a constant α > 0 such that it is NP-hard to distinguish whether an assignment to (5) satisﬁes each constraint, or whether there exists some permutation constraint on which it is off by at least α. By the polynomial time equivalence of the separation oracle problem and the LP optimization problem (Gr otschel et al., 1988), it is thus NP-hard to ﬁnd the RUM that minimizes the maximum error with respect to a given tournament matrix.

12That is, to the version of Problem 1, where each arc has the same weight of 1. 13Indeed, nodes of degree 0 can be removed from the graph without impacting the size of its optimal Vertex Covers. And, a graph of n nodes, having no nodes of degree 0, has to have at least n

2 edges. 14Recall that a graph G with m edges and maximum degree δ, cannot have a Vertex Cover of size smaller than m

δ . That is, the Vertex Cover problem reduced to instances where k < m

δ is solvable in polynomial time. Thus, we can assume to reduce the set of hard Vertex Cover instances to those such that k m

δ . 15Given that we have a gap problem, which requires us to distinguish between graphs whose minimum FAS has value smaller than k, and graphs with minimum FAS having value larger than c k, instances that do not satisfy this property can be easily solved in polynomial time in such instances, the permutation 1 2 n has a FAS cost not larger than t < c k.

RUMs from Head-to-Head Contests

C. Certifying optimality

If the RUM returned by Algorithm 1 is at distance 0 from the matrix P, its optimality is trivially guaranteed. If, instead, the distance is non-zero, we need to do extra work to certify it. One way to do this is to test whether the dual solution D returned by the last iteration of the loop of Algorithm 1 satisﬁes all dual constraints, i.e., we have to obtain the exact minimum FAS of the transitive tournament with D( ) as the weights.

In the case of the Jester dataset in which n = 100, doing so is infeasible since the runtime of the exact dynamic programming algorithm for FAS is O(n2n); we then have no positive lower bound on the smallest distance to a RUM of that particular matrix. For the SF dataset in which n = 35, it so happens that the dual solution D obtained by the last iteration of Algorithm 1 s loop induces a directed graph with only 28 vertices hit by at least one non-zero arc. Then, the minimum FAS instance is effectively reduced to a graph of order n = 28: the exact algorithm terminated in less than a minute, certifying (i) the feasibility of D for the unrestricted dual (3) and that (ii) the distance to the SF matrix of the RUM returned by Algorithm 1 is optimal (see Table 1).

Thus, while Algorithm 1 is a heuristic, it always returns a RUM and hence always gives a correct upper bound on the distance of the input matrix to RUMs. Moreover, with an extra ﬁnal check on the feasibility for the dual of its last D solution, the heuristic can also certify a lower bound on the distance of the input matrix to the best approximating RUM. In our experiments, when applied, this ﬁnal check was made with an exact algorithm. The Ellipsoid Algorithm, instead, makes this check using the Approximate Separation Oracle of Theorem 8.

D. The 2L model

The two-level (2L) model of Veerathu & Rajkumar (2021) partitions the n items into groups: each group is associated to a positive weight, and to a rank-2 tournament between its items. Pairwise matches between players in a same group are governed by the rank-2 tournament; if a player belongs to group i, and a second player belongs to group j = i, and if the weights of the two groups are wi and wj, then the ﬁrst player wins with probability proportional to wi (that is, with probability wi wi+wj ).

One can easily show that there exist tournament matrices that can be represented via rank-2 tournaments (and thus via the 2L model), but not with RUMs. We show here that there also exist matrices that can be represented by RUMs, but not by 2L models.

In particular, consider the following tournament matrix over the set of players X = {(i, j) | 0 i, j 2},

(0, 0) (0, 1) (0, 2) (1, 0) (1, 1) (1, 2) (2, 0) (2, 1) (2, 2)

α 1 α α α α 1 α 1 α 1 α (0, 0) 1 α α α α α 1 α 1 α 1 α (0, 1) α 1 α α α α 1 α 1 α 1 α (0, 2) 1 α 1 α 1 α α 1 α α α α (1, 0) 1 α 1 α 1 α 1 α α α α α (1, 1) 1 α 1 α 1 α α 1 α α α α (1, 2) α α α 1 α 1 α 1 α α 1 α (2, 0) α α α 1 α 1 α 1 α 1 α α (2, 1) α α α 1 α 1 α 1 α α 1 α (2, 2)

for α [0, 1]. With this matrix,

player (i, j) beats player (i, k), for k = (j + 1) mod 3, with probability 1 α, and player (i, j) beats player (k, ℓ), for k = (i + 1) mod 3, with probability 1 α.

Note that for each α = 1 2, Pα has various disjoint non-transitive triplets, e.g., for each i {0, 1, 2}, the subset {(i, 0), (i, 1), (i, 2)} is a non-transitive triplet. Moreover, Pα conjoins those three disjoint non-transitive triplets to form other non-transitive triplets: for each i, j, k {0, 1, 2}, the subset {(0, i), (1, j), (2, k)} is also a non-transitive triplet.

We now show that (unrestricted) RUMs are able to represent such a Pα for each 1

3. We will later show that 2L models can represent Pα only if α = 1

2. Lemma 13. For each 1

3, the tournament matrix Pα can be represented with a RUM.

RUMs from Head-to-Head Contests

Proof. We start by observing that if Pα is representable with a RUM, then P1 α is also representable with a RUM. Indeed, for a permutation π of X, let π be the reverse of π. We also deﬁne the reverse of a RUM: the reverse R of RUM R is the RUM that assigns probability R(π) to the permutation π, for each π in the support of R. Then, for each pair a and b of items, the probability that a beats b in R is equal to the probability that b beats a in R. Observe that if Pα requires the generic item a beat another item b with probability pa,b, then P1 α requires b to beat a with probability pa,b. Thus, if R is a RUM for Pα, then R is a RUM for P1 α.

We now prove that the existence of a RUM for P1/3 implies the existence of a RUM for Pα, for each 1

First, if both the tournament matrix P and the tournament matrix P admit RUMs, then for each β [0, 1], the tournament matrix β P + (1 β) P also admits a RUM: to sample a permutation from this RUM, one ﬁrst ﬂips a coin with head probability β; if the coin comes up heads, one samples an independent permutation from the RUM of P , otherwise one samples an independent permutation from the rum of P .

We have already proved that the existence of a RUM for P1/3 implies the existence of a RUM for P2/3. Let Mα be the RUM that is obtained by mixing the RUM for P1/3, with weight β = 2 3α, and the RUM for P2/3, with weight 1 β = 3α 1. Then, an event that has probability 1

3 in P1/3 (and thus probability 2

3 in P2/3) ends up having probability (2 3α) 1

3 + (3α 1) 2

3 = α in Mα; and, likewise, an event having probability 2

3 in P1/3 (and probability 1

3 in P2/3) ends up having probability (2 3α) 2

3 + (3α 1) 1

3 = 1 α in Mα.

Thus, the existence of a RUM for P1/3 implies the existence of a RUM for Pα for each 1

Then, there only remains to prove that there exists a RUM for P1/3.

Consider the set S = {2 1 0, 0 2 1, 1 0 2} of permutations. We will consider the uniform distribution on S; with this distribution, for each i {0, 1, 2}, the probability that i beats (i + 1) mod 3 is exactly 2/3.

Our RUM R1/3 behaves as follows. For each i {0, 1, 2}, let πi be sampled independently and uniformly at random from S. Moreover, let π be also sampled independently and uniformly at random from S. Then, the RUM returns the permutation (π(0), ππ(0)(0)) (π(0), ππ(0)(1)) (π(0), ππ(0)(2)) (π(1), ππ(1)(0)) (π(1), ππ(1)(1)) (π(1), ππ(1)(2)) (π(2), ππ(2)(0)) (π(2), ππ(2)(1)) (π(2), ππ(2)(2))

With this RUM, for each player (i, j), and for k = (j + 1) mod 3, we have that Pr [(i, j) beats (i, k)] = 2/3.

Analogously, for each player (i, j), for k = (i+1) mod 3, and for each ℓ {0, 1, 2}, we that Pr [(i, j) beats (k, ℓ)] = 2/3. Thus R1/3 is a RUM that represents P1/3.

We then show that the 2L model of (Veerathu & Rajkumar, 2021) cannot represent Pα unless α = 1

2. In P1/2, the winner of each single match is chosen by ﬂipping a fair coin: thus, one can represent P1/2 with a uniform MNL and, consequently, with a 2L model.

Our construction and our proof hinge on some constraints that 2L models have to obey when representing non-transitive triplets. As we will see, these constraints are incompatible with Pα for each α = 1

Lemma 14. For each α = 1

2, the tournament matrix Pα cannot be represented with a 2L model.

Proof. Recall that X = {(i, j) | 0 i, j 2} is the set of players; let us also deﬁne three disjoint subsets of X, that we call parts , Xi = {(i, j) | 0 j 2} for 0 i 2.

Recall that the model of Veerathu & Rajkumar (2021) is on two levels. The model partitions the player set X in groups G0, G1, . . ., and assigns a positive weight wi to the generic group i. When two players, respectively from groups Gi and Gj, i = j, play against each other, a MNL is used to determine the winner: the player from group i wins with probability wi wi+wj , and the one from group j wins with probability wj wi+wj . When two players from the same group Gi play against each other, the game is decided by a rank-2 tournament.

Suppose by contradiction that a 2L model for representing Pα, α = 1

2 is composed of groups G0, . . . , Gt 1. Then,

RUMs from Head-to-Head Contests

there cannot exist a part Xi that is split among 3 distinct groups, say, Gi1, Gi2, Gi3: indeed, if this were the case, there would have to exist a permutation π of the indices such that wiπ(1) wiπ(1)+wiπ(2) > 1

2, wiπ(2) wiπ(2)+wiπ(3) > 1

2, wiπ(3) wiπ(3)+wiπ(1) > 1

i.e., wiπ(1) > wiπ(2) > wiπ(3) > wiπ(1), a contradiction; also, there cannot exist a part Xi that is split among 2 distinct groups, say, (i, j), (i, k) Ga and (i, ℓ) Gb for {j, k, ℓ} = {0, 1, 2} and a = b. Indeed, if this were the case, the probability that the model assigns to (i, ℓ) beating (i, j), i.e., wb wa+wb , would be the same to that it assigns to (i, ℓ) beating (i, k), contradicting Pα s requirements.

Therefore, each Xi is to be fully contained in some group; in particular, there can either be 1, 2, or 3 groups.

Veerathu & Rajkumar (2021) prove that rank-2 tournaments are unable to capture tournament matrices P having indices a, b, c, d such that Pa,b, Pb,c, Pc,a > 1

2 and Pa,d, Pb,d, Pc,d > 1

2, or such that Pa,b, Pb,c, Pc,a > 1

2 and Pa,d, Pb,d, Pc,d < 1

2. In other words, rank-2 tournaments cannot represent a non-transitive triplet (a is stronger than b, b is stronger than c, and c is stronger than a) together with an extra item that is either stronger than each item in the triple, or weaker than each item in the triple (i.e., each of a, b, c is weaker than d, or each of a, b, c is stronger than d).

Thus, if the partition into groups of the 2L model of Veerathu & Rajkumar (2021) has a group Gi having some Xj as a subset, and containing some element of X Xj, then the rank-2 tournament of Gi is unable to represent Pα restricted to Gi; thus the 2L model is unable to represent Pα.

It follows that a 2L model, to correctly represent Pα, has to contain exactly 3 groups, in fact, one group for each of the three parts X0, X1, X2. Without loss of generality, let us assume that Gi = Xi for i {0, 1, 2}.

Now, suppose that α < 1

2. To correctly represent Pα, the MNL weights would have to guarantee that the weight w1 of G1 is larger the weight w2 of G2 (since P(0,0),(1,0) = α < 1 2), that w2 > w3 (since P(1,0),(2,0) = α < 1 2), and that w3 > w1 (since P(2,0),(0,0) = α < 1

2), which is a contradiction. Likewise, suppose that α > 1

2. In this case, to correctly represent Pα, the MNL weights would have to guarantee that w1 < w2 (since P(0,0),(1,0) = α > 1

2), that w2 < w3 (since P(1,0),(2,0) = α > 1

2), and that w3 < w1 (since P(2,0),(0,0) = α > 1

2), which is also a contradiction.

Thus, the tournament matrix Pα cannot be represented by a 2L model, for each α = 1

The following is then immediate.

Corollary 15. For each α 1

3 , the tournament matrix Pα can be represented by a RUM and cannot be represented by 2L models.

We point out that the same result could be obtained by a subset of the rows and columns of Pα it is sufﬁcient to have each element of one of the three parts Xi, together with an arbitrary element of a second part Xj, and an arbitrary element of the last part Xk, for {i, j, k} = {0, 1, 2}. We used all the 9 elements of the union of the three parts, since we felt that, this way, the RUM construction is clearer. In fact, while our construction only tensors an intransitive triplet with itself, RUMs can be shown to withstand arbitrarily many such tensorings, provided that each triplet s α is bounded between 1