# rums_from_headtohead_contests__9a9934b4.pdf RUMs from Head-to-Head Contests Matteo Almanza 1 2 Flavio Chierichetti 3 Ravi Kumar 4 Alessandro Panconesi 3 Andrew Tomkins 4 Random utility models (RUMs) encode the likelihood that a particular item will be selected from a slate of competing items. RUMs are wellstudied objects in both discrete choice theory and, more recently, in the machine learning community, as they encode a fairly broad notion of rational user behavior. In this paper, we focus on slates of size two representing head-to-head contests. Given a tournament matrix M such that Mi,j is the probability that item j will be selected from {i, j}, we consider the problem of finding the RUM that most closely reproduces M. For this problem we obtain a polynomial-time algorithm returning a RUM that approximately minimizes the average error over the pairs. Our experiments show that RUMs can perfectly represent many of the tournament matrices that have been considered in the literature; in fact, the maximum average error induced by RUMs on the matrices we considered is negligible ( 0.001). We also show that RUMs are competitive, on prediction tasks, with previous approaches. 1. Introduction Random utility models (RUMs) are perhaps the most important model in discrete choice (Train, 2003). In full generality, they posit that a user selecting an item from some slate of choices has a personal vector encoding the value of all items in the universe. The user behaves rationally by selecting the available item in the slate of the highest value. As each user may have a different value vector, the RUM model allows a wide range of effects. There are no known efficient learning algorithms for RUMs. In pairwise choice, the setting is the same, but the user must choose between only two alternatives, rather than from an 1Algorand Labs 2This work was carried out while the author was at Sapienza University of Rome. 3Sapienza University of Rome 4Google Mountain View. Correspondence to: Matteo Almanza . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). arbitrary slate of alternatives as in the full model. Pairwise choice arises whenever head-to-head competitions occur, as in two-player games, two-alternative forced-choice testing (Bogacz et al., 2006), or online experiences that compare an item with an alternative. The pairwise choice model is well studied, with a number of approaches that learn over either subsets of the class of RUMs, or incomparable models (Chen & Joachims, 2016; Adams et al., 2010; Makhijani & Ugander, 2019; Veerathu & Rajkumar, 2021). Our main technical findings are general RUM learning algorithms for pairwise choice. First, we introduce RIPPLE, for Randomized Interior Point Permutation LEarner. We show that for pairwise choice, RIPPLE is guaranteed to return a RUM that attains almost the best possible average error over all RUMs, in polynomial time. The polynomialtime guarantee requires solving an exponential-size dual linear program using the ellipsoid method, which is unlikely to be efficient in practice. Hence, we also introduce RUMRUNNER, a practical variant which is also guaranteed to return a near-optimal RUM. RUMRUNNER uses a separating hyperplane heuristic that performs efficiently in our experimental evaluation, but unlike RIPPLE it does not have a guarantee of polynomial running time. While previous results in this area require heuristics to optimize non-concave likelihoods, often requiring multiple runs to escape possible poor local optima, RUMRUNNER guarantees convergence to a solution whose value is ϵ-close to the optimum, and contains no hyper-parameters to tune. In empirical evaluations we observe that RUMRUNNER is neither subject to poor local optima nor has stability issues. Our approach. A RUM may be characterized as a distribution over users, each of whom is represented by a vector giving the value of each item in the universe. Given a pair of alternatives, the likelihood that the first element is selected is the likelihood that a value vector drawn from the distribution assigns a higher score to the first item than the second. Since the decision depends only on which score is higher, we may represent these value vectors as permutations of {1, . . . , n} for a universe of n items, which simplifies the RUM to a distribution over permutations. Our approach begins by writing a large linear program (LP) in which a variable encodes the weight of each permutation in the RUM. The LP contains constraints for each pair {i, j} of elements asking that the total probability assigned RUMs from Head-to-Head Contests to permutations with i j should have at most a given error from the true target probability. The objective of the LP then is to minimize the total approximation error. This LP has exponentially many variables, but its dual has only polynomially many variables (albeit with exponentially many constraints). The structure of the dual allows a separating oracle to be constructed by approximately solving an instance of the minimum Feedback Arc Set (FAS) problem. This in turn gives a polynomial-size set of permutations capable of additively approximating the best possible RUM, and the original LP can now be solved in polynomial time to recover the approximating solution. Empirical findings. We present a comprehensive set of experiments comparing the RUM algorithm to other standard and recent approaches to pairwise choice: the Blade-Chest (BC) algorithm of Chen & Joachims (2016) and the MV algorithm of Makhijani & Ugander (2019). In the pairwise setting, the entire data distribution has a particularly simple parameterization: it can be completely represented by an n n tournament matrix M such that Mi,j is the probability that j beats i when the user is shown the pair {i, j}. Furthermore, for several of our experimental datasets, the training data provides a high-accuracy estimate of M. In these cases, the training data is effectively the entire data distribution, so the best algorithm will exactly reproduce the matrix M. Thus, in addition to comparing test-set prediction accuracy (which will depend on the variance of the entries of M), we also report the ability of an algorithm to capture the matrix M as given. For datasets based on election data, our RUM-finding algorithm is able to perfectly encode M. For other datasets, the error we attain in approximating M is on the order of 1/n2. For predictive tasks, our RUM algorithms also perform comparably to the BC and MV algorithms. Implications. Finding a high-quality RUM for pairwise comparisons has several desirable properties: RUMs naturally provide the ability to predict winning probabilities for larger slates, even when trained only on pairwise data; no other algorithm we are aware of for this problem has this property. Our learned RUMs are based on weighted sets of permutations, and are thus naturally interpretable and introspectible. For instance, one can study the properties of the high-weight permutations. RUMs capture a natural notion of rational behavior, in which users select the item they value most highly. The permutations learned by our RUM algorithm represent a de facto clustering of the user population into segments, characterized by their different behaviors with respect to their choices. This may have value beyond predictions, e.g., in market segmentation. The rest of the paper proceeds as follows. Section 2 covers related work and Section 3 contains the background. Section 4 presents our results for minimizing average errors and shows evidence that our approach will not extend to the uniform error setting. Section 5 describes techniques to find succinct approximating RUMs. Experimental results are in Section 6 and Section 7 has concluding thoughts. All missing proofs are in the Supplementary Material. 2. Related work The problem of representing a tournament matrix with choice models has been often considered in the literature. We mention those that are closest in spirit to our work. Chen & Joachims (2016) proposed the Blade-Chest (BC) model, which represents each player i with two vectors bi (blade) and ci (chest) in the Euclidean space; the probability that player i beats player j is a function of bj, ci, bi, cj. The BC model can produce any tournament matrix, in particular those with non-transitive triplets1, that are observed in practice and cannot be achieved with standard MNL models. The BC model is optimized via gradient descent (GD) on a non-concave function, with no global optimum guarantee. Earlier, Adams et al. (2010) proposed a similar model to represent the outcomes of baseball matches. Makhijani & Ugander (2019) proposed a different model, the Majority Vote (MV) model. In this model, each player is represented by k features, with k odd. Each feature corresponds to an independent RUM2, e.g., an MNL or a Gaussian Thurstone model. A match between two players i and j consists of k rounds: in the tth round, the two players play independently against each other using only their tth features. The player who wins the majority of rounds, is the winner of the match. They proposed GD to maximize the model s (non-concave) likelihood, with no guarantees on the global optimum. Interestingly, they show that for a class of models, the ability of models from the class to induce non-transitivity implies the non-log-concavity of their likelihoods, i.e., the GD procedure of any of these models could still get stuck at a local optimum. In fact, Makhijani & Ugander (2019) suggest general (i.e., possibly dependent) RUMs to be a natural choice for representing tournament matrices, but point out that previous work (Train, 2003) has found optimization over unrestricted RUMs to be hard. In our paper we tackle this problem. We show how to find an approximately optimal unrestricted RUM for a tournament matrix; we circumvent the aforementioned non-concavity issues by casting the prob- 1Players i, j, k such that each beats another with probability more than 1/2: Pi,j > 1/2, Pj,k > 1/2, and Pk,i > 1/2. 2An independent RUM is a RUM where the noise vector η (see Section 3.1) has mutually independent entries. RUMs from Head-to-Head Contests lem combinatorially, and solving it using LP and the ellipsoid method. We also observe experimentally that these unrestricted RUMs almost perfectly represent the tournament matrices that originated them vindicating that RUMs are ideal for this task! Finally, we mention the novel two-level (2L) model of Veerathu & Rajkumar (2021). In general their model is as powerful (and as hard to optimize) as the BC model. They thus restrict their model to rank-2 tournaments and show a polynomial time learning algorithm in this case.3 Interestingly, their learning algorithm appeals to the minimum FAS problem, which they show can be solved optimally in polynomial time on rank-2 tournaments. We observe that there exist dependent RUMs that cannot be represented by the 2L model (see Appendix D). The problem of actively, or passively, learning RUMs has been studied in several papers (Soufiani et al., 2012; Oh & Shah, 2014; Chierichetti et al., 2018a;b; Negahban et al., 2018; Tang, 2020). Farias et al. (2009); Chierichetti et al. (2021) considered the problem of bounding a RUM s support without changing the RUM s behavior. The goal of our paper, instead, is to find the RUM (approximately) closest to a given tournament matrix. 3. Preliminaries 3.1. Discrete Choice and Random Utility Models In our paper, we use [n] to denote the set {1, . . . , n}, and 2[n] to denote the power set of [n]. We also use [n] k to denote the class of subsets of [n] of cardinality k. We use Sn to denote the set of all the permutations of [n], and S n = Sn \ {(1 n)} to denote the set of all the permutations of [n] with the exception of the permutation that reverses the natural ordering relation. For a permutation π Sn and for an item i [n], we let π(i) [n] be the value (or position) of item i in π. For instance, if π = (3 1 4 2), then π(2) = 4, π(4) = 3, π(1) = 2, π(3) = 1. As is standard in discrete choice literature, we use slate to denote a set in 2[n] \ { }, i.e., any non-empty subset of [n]. For a permutation π Sn and a slate T [n], we let π(T) = arg max i T π(i), denote the item of T that has the largest value in π, i.e., the winner in T given π. A random utility model (RUM) on [n] is a distribution D on Sn. (Henceforth, we drop [n] when it is clear from 3In their experiments, they do not measure how close their guesses of the win-probabilities are; they consider the upset metric, which counts the number of pairs of players such that the model correctly identifies the most-likely winner in the pair. the context.) For a slate T, we use DT to denote the distribution of the random variable π(T) for π D. (Clearly, supp(DT ) T.) In other words, DT is the induced distribution of the winner in T for a random permutation from D. Let DT (i) be the probability that i is a winner in T. RUMs are often presented in terms of noisy item evaluations made by users: each item i has a base value Vi, each user samples (η1, . . . , ηn) from a joint noise distribution4, and the utility Ui of item i to this user is Vi + ηi. Given a slate, the rational user chooses the highest utility item in the slate (with ties broken u.a.r.). Equivalently, the user first sorts all the items decreasingly according to the observed Ui s (again, with ties broken u.a.r.) to get a permutation. Then, for a given slate, the rational user chooses the highest ranked item in the slate according to the permutation. It is easy to see that these two definitions are equivalent (Chierichetti et al., 2018a). 3.2. Tournaments and Feedback Arc Set problems Given a set [n] of players, and {i, j} [n] 2 , a tournament matrix {Pi,j}n i,j=1 gives the (empirical) probability that j wins in a head-to-head contest with i. Clearly, Pi,j, Pj,i 0 and Pi,j + Pj,i = 1, for each {i, j} [n] 2 . As we will show, RUM optimization for head-to-head contests (i.e., tournament matrices) is closely related to the minimum Feedback Arc Set (FAS) problem, which is NPhard (Karp, 1972); the latter has two equivalent definitions. The first definition is perhaps the most well-known. A τbounded directed graph G(V, A, w) consists of the vertex set V = [n], the arc set A {(i, j) V 2 | i = j} satisfying (i, j) A = (j, i) A, and a non-negative weight function w : A [0, τ] on its arcs. Problem 1 (FAS on τ-bounded directed graphs). Given a τ-bounded directed graph G(V, A, w), find a permutation π of the vertices V that minimizes C G(V,A,w)(π) = X (i,j) A and i πj w((i, j)). The second definition forces input graph to be a transitive tournament (i.e., a complete DAG) with arcs going from lower-indexed vertices to higher-indexed ones. A τ-bounded transitive tournament G(V, A, w) is a complete DAG consisting of the vertex set V = [n], arc set A = {(i, j) V 2 | i < j}, and a weight function w : A [ τ, τ] on its arcs. Problem 2 (FAS on τ-bounded transitive tournaments). Given a τ-bounded transitive tournament G(V, A, w), find a permutation π of the vertices V that minimizes 4We recall that, in an independent RUM, the ηi s are mutually independent. RUMs from Head-to-Head Contests C G(V,A,w)(π) = X 1 i 0. Theorem 3 (Frieze & Kannan (1999)). There exists an approximation algorithm for Problem 1 that, for a τ-bounded directed graph G on n nodes, returns a permutation π such that C G(π ) O(δτn2) + minπ Sn C G(π), in time O δ 4n + 2 e O(δ 2), with probability at least 2/3, for an arbitrary δ > 0. We will use this approximation algorithm for Problem 2. The two problems are equivalent from an additive approximation point of view. The following is folklore. Observation 4. Any α-additive approximation polynomial time algorithm for Problem 1 with a given τ can be transformed into an α-additive approximation polynomial time algorithm for Problem 2 with the same τ, and vice versa. 3.3. Approximating Tournaments by RUMs We formally define the main problems that we consider. Definition 5 (Average RUM approximation). A given tournament matrix P is approximated on average to within ϵ by a RUM R if avg 1 i i). The probability that the RUM R assigns to (1 n) is then equal to 1 P Let us first rewrite the above primal LP as: min n 2 1 P 1 i δ(n), the algorithm returns an un- satisfied cπ constraint, otherwise (iii) the algorithm might not return any unsatisfied constraint (even if some exists). Moreover, if there exists at least one π S n such that D P 1 i δ(n), then for any constant c > 0, if the above randomized algorithm is run independently for Θ(c log n) times, the probability that no unsatisfied constraint is returned shrinks to n c. Proof. The algorithm can check the validity of each of the cρ, c D, and ci,j (for 1 i < j n) constraints in time O(n2); if one of the constraints is unsatisfied, it can be returned. Otherwise, they are all valid, and the algorithm creates the transitive tournament G(V, A, w) with V = [n], A = {(i, j) | 1 i < j n}, and w((i, j)) = i,j. Then, G(V, A, w) is a τ-bounded transitive tournament, for τ O(n 2), i.e., is an instance of Problem 2. In fact, consider any π S n: if C G(V,A,w)(π) is the cost of solution π for the instance G(V, A, w) of Problem 2, then the constraint cπ of Fγ is exactly D C G(V,A,w)(π) 0, or equivalently, C G(V,A,w)(π) D. Theorem 3 and Observation 4 imply the existence of an algorithm for Problem 2 on G(V, A, w) that additively approximates the optimal solution to any δ > 0 in time O n δ 4 + 2 e O(δ 2). In particular, for each δ(n) Ω log 0.49 n , the problem can be approximated to within an additive δ(n) in time O(n2). Suppose that π is the permutation returned by the additiveapproximation algorithm. Then, with probability at least 2/3, for each π Sn it holds that C G(V,A,w)(π) C G(V,A,w)(π ) δ(n). Now, if D C G(V,A,w)(π ) > 0 then cπ is an unsatisfied constraint, which can be returned (indeed, if this happens, then π S n; in fact, 1 2 n is the only permutation in Sn S n, and C G(V,A,w) has a value of 0 on this permutation). Otherwise, C G(V,A,w)(π ) D; then, it must hold that C G(V,A,w)(π) C G(V,A,w)(π ) δ(n) D δ(n), i.e., for each π S n, the constraint D P 1 i 0, with high probability the guarantee will hold at each oracle call. 7Such an i must exist since Problem 6 always admits a solution of value at most 1/2. Indeed, the RUM such that R{i,j}(j) = 1/2, for each 1 i < j n, can be realized, e.g., with a uniform distribution over Sn. a RUM with an error not larger than the smallest possible plus 2δ. One might ask if the same technique can be used to minimize uniform error as in Definition 7. Unfortunately, the separation oracle for this version is NP-hard to approximate to some additive constant. We discuss the uniform error problem in Section B of the Supplementary Material. 5. Succinct Representation for Average Errors The RUMs obtained from Theorem 9 might be supported by (polynomially) many permutations. In this section we discuss succinct representation of those (and other) RUMs. First, we note that a result of Chierichetti et al. (2021) implies that any RUM can be sketched to O(ϵ 2 n log2 n) bits in such a way that the probability distribution of each slate of size two is approximated to within an additive error ϵ. Here, we will show that O(ϵ 2 n log n) bits suffice to approximate the average ℓ1-error (and the average ℓ2-error) over slates of size two to within an additive error ϵ. As we will see, our sketch is a uniform RUM supported on a multiset of O(ϵ 2) permutations, i.e., by a number of permutations that is independent of n. Note that a uniform RUM supported on k permutations can be represented by a k-dimensional vector per item; the generic item s vector contains the ranks of that item in the k permutations of the RUM. Therefore, our succinct representation can also be seen as an embedding of the items in an O(ϵ 2)- dimensional space. Define the (pairwise) RMSE8 between RUMs D and D as i,j [n],i =j D{i,j}(j) D {i,j}(j) 2 We also define a ℓq-version of the above measure: ρq(D, D ) = i,j [n],i 0, there is a RUM e D that samples uniformly from a multiset of O ϵ 2 8Note that the RMSE defined by Makhijani & Ugander (2019) is slightly different: their denominator is n2 instead of n (n 1). It is possible to show that our RUM approximation holds for their RMSE version as well. RUMs from Head-to-Head Contests Algorithm 1 RUMRUNNER, a heuristic for Problem 6. 1: S 2: π (1 n) 3: repeat 4: S S {π } 5: Solve the primal LP (1) restricted to the variables ϵ, and pπ for π S; let P be its optimal primal solution, and D be its optimal dual solution 6: π Viol-HP (D) 7: until π = 8: return the RUM induced by P, i.e., the RUM that samples π S with probability P(pπ) permutations such that ρ1(D, e D) < ϵ and ρ2(D, e D) < ϵ. Since each permutation can be represented with O(n log n) bits, an immediate consequence of Theorem 10 is that all RUMs (in particular, those obtained by our algorithms) can be represented succinctly. Corollary 11. For a RUM D on [n] and for each ϵ > 0, there is a RUM e D representable with O(ϵ 2 n log n) bits such that ρ1(D, e D) < ϵ and ρ2(D, e D) < ϵ. 6. Experiments We present three groups of experiments. Section 6.3 studies the effectiveness of RUMRUNNER at representing a given tournament matrix. As described in the Introduction, we view this as the key metric for situations in which the tournament matrix is known to reasonable precision, and we find that the approximation error is very small for all datasets (zero, or order of 1/n2 for an n n tournament). Next, in Section 6.4, we ask whether the resulting RUM can be approximated by a much smaller RUM; we consider a number of heuristics to sketch a RUM with support on only k = 10 permutations, and identify two approaches that are able to approximate all tournament matrices well in this regime. This suggests that very small RUMs may approximate real-world datasets, resulting in computational efficiencies and increased transparency. Finally, in Section 6.5, we consider a standard prediction setting in which the data is split between train/test, and the learned model is measured according to its generalization performance. This setting is appropriate if the training sample is too small to fully characterize the data distribution. 6.1. Linear programs and separation oracles For computational efficiency reasons, we do not use RIPPLE as described to fit RUMs to the datasets. Instead, we develop RUMRUNNER using a separating-hyperplane approach, described in Algorithm 1. RUMRUNNER employs a separation oracle for the dual: a Algorithm 2 A randomized local search for Viol-HP, i.e., the separation oracle. In experiments, we set t = 100. 1: For a permutation π, let (i) fas(π) = P ii w ((i, j)); then, for each permutation π Sn, it holds that C G=(V,A ,w )(π) = C G=(V,A ,w )(π) W . Thus additive approximation is preserved in this direction. Analogously, let G = ([n], A , w ) be an instance of Problem 2. Let A = {(i, j) | (1 i < j n and w ((i, j)) 0) or (1 j < i n and w ((i, j)) < 0)}, and for each (i, j) A , let w ((i, j)) = w ((i, j)) if i < j, w ((i, j)) if j < i. Then, G = (V, A , w ) is an instance of Problem 1. Now, let W = P w ((i,j))<0 w ((i, j)); then, for each permutation π Sn, it holds that C G=(V,A ,w )(π) = C G=(V,A ,w )(π) W . Thus additive approximation is preserved in this other direction, as well. A.2. Proof of Theorem 10 Proof. In a manner similar to Chierichetti et al. (2021), we sample k = 4 ϵ2 independent permutations π1, . . . , πk from D, and we let e D be the uniform RUM over the multiset {π1, . . . , πk}, i.e., the RUM e D samples i uniformly at random from [k], and returns πi. For arbitrary i < j, consider the random variable X = e D{i,j}(j). Then, X = k k iff there are exactly k indices t in [k] such that j πt i. Observe that, by the iid choice of the πi s from D, the distribution of k is equal to the binomial distribution Bin(k, p) with p = D{i,j}(j). Then, E[X] = E h e D{i,j}(j) i = kp k = p = D{i,j}(j), and E h (X p)2i = Var[X] = kp(1 p) k2 = p(1 p) since p(1 p) is maximized at p = 1/2. Using Jensen s inequality, E[|X p|] p E h e D{i,j}(j) D{i,j}(j) i = E[|X p|] 1 2 Thus for each q {1, 2}, E h e D{i,j}(j) D{i,j}(j) qi (4k) q/2 . e D{i,j}(j) D{i,j}(j) q . By the linearity of expectation, we have We also have Sq 0; thus, by Markov s inequality, we get that Pr [Sq < 4E[Sq]] > 3/4. Consider the event ξq = Sq < RUMs from Head-to-Head Contests n 2 41 q/2 k q/2 . We get Pr [ξq] Pr [Sq < 4E[Sq]] > 3 Now, ρq D, e D = qq Sq/ n 2 . Then, ξq entails that ρq D, e D < q s n 2 41 q 2 n 2 4 1 q 1 where the last inequality follows from q 1 and k 4 ϵ2 . Since Pr[ξq] > 3/4 for q {1, 2}, with probability 1/2, e D is an ϵ-approximation of D both in ρ1and ρ2-senses. B. Minimizing Uniform Error As in the average error case, once can write the primal LP for minimizing the uniform error: Li,j : ϵ + P pπ Pi,j 1 i 0, then we can subtract this quantity from both Li,j and Ui,j with no impact on feasibility nor on the objective. This way, we can set i,j = Ui,j Li,j. The ϵ constraint then becomes P 1 i 0 such that, given an assignment to (5), it is NP-hard to determine whether the assignment satisfies each constraint, or whether some constraint is off by at least α. RUMs from Head-to-Head Contests Proof. The classical reduction (Karp, 1972) from Vertex-Cover to (unweighted) Feedback Arc Set12 creates a FAS instance composed of the digraph H with 2n vertices, max-indegree and max-outdegree not larger than δ + 1, and a number of arcs equal to 2n + 2m, starting from a Vertex Cover instance composed of a graph G, having n nodes, maximum degree δ, and m edges. The reduction guarantees the existence of a k-Vertex Cover in G(V, E) iff a k-Feedback Arc Set exists in H(V, A). The Vertex Cover instances of Clementi & Trevisan (1999) have maximum degree δ c = O(1), so that m δ 2 n and, also, m n 2 ;13 Clementi & Trevisan (1999) show that there exists a constant c > 1 such that it is NP-hard to distinguish whether the minimum Vertex Cover of such an instance has size k or c k, for k m Then, if we plug the Vertex Cover instances of Clementi & Trevisan (1999) into the reduction of Karp (1972), we get a digraph H having m = 2n + 2m arcs, with 2 n = 2n + δn (2 + c) n, and such that it is NP-hard to tell whether the minimum FAS of H is smaller than k or larger than c k, for the inapproximability ratio c > 1 of (Clementi & Trevisan, 1999). We observe that, by k m 2 and δ c, we get k n 2c. Also, by m (2 + c) n, we get that k m 2c(2+c). Now, let t = |{(j, i)|(j, i) A 1 i < j n}| be the number of arcs of H from a node of higher index to a node of lower index. We can assume that t c k.15 Now, consider the problem that a Separation Oracle for (5) has to solve that is, consider Problem 2. Given a digraph H with m arcs on the vertex set [n ], for each 1 i < j n , we set (i) i,j = 1 m if (i, j) is an arc of H, (ii) i,j = 1 m if (j, i) is an arc of H, and (iii) i,j = 0 otherwise. Now, let t be the number of pairs 1 i < j n such that i,j < 0. We set D = t c k m observe, then, that D 0. The question is whether this assignment satisfies each constraint of (5), or whether it is off on some constraint by some additive constant α > 0. Observe that, for any permutation π of the vertices, if C (π) is the cost of π for the FAS instance H, then 1 i 0 such that it is NP-hard to distinguish whether an assignment to (5) satisfies each constraint, or whether there exists some permutation constraint on which it is off by at least α. By the polynomial time equivalence of the separation oracle problem and the LP optimization problem (Gr otschel et al., 1988), it is thus NP-hard to find the RUM that minimizes the maximum error with respect to a given tournament matrix. 12That is, to the version of Problem 1, where each arc has the same weight of 1. 13Indeed, nodes of degree 0 can be removed from the graph without impacting the size of its optimal Vertex Covers. And, a graph of n nodes, having no nodes of degree 0, has to have at least n 2 edges. 14Recall that a graph G with m edges and maximum degree δ, cannot have a Vertex Cover of size smaller than m δ . That is, the Vertex Cover problem reduced to instances where k < m δ is solvable in polynomial time. Thus, we can assume to reduce the set of hard Vertex Cover instances to those such that k m δ . 15Given that we have a gap problem, which requires us to distinguish between graphs whose minimum FAS has value smaller than k, and graphs with minimum FAS having value larger than c k, instances that do not satisfy this property can be easily solved in polynomial time in such instances, the permutation 1 2 n has a FAS cost not larger than t < c k. RUMs from Head-to-Head Contests C. Certifying optimality If the RUM returned by Algorithm 1 is at distance 0 from the matrix P, its optimality is trivially guaranteed. If, instead, the distance is non-zero, we need to do extra work to certify it. One way to do this is to test whether the dual solution D returned by the last iteration of the loop of Algorithm 1 satisfies all dual constraints, i.e., we have to obtain the exact minimum FAS of the transitive tournament with D( ) as the weights. In the case of the Jester dataset in which n = 100, doing so is infeasible since the runtime of the exact dynamic programming algorithm for FAS is O(n2n); we then have no positive lower bound on the smallest distance to a RUM of that particular matrix. For the SF dataset in which n = 35, it so happens that the dual solution D obtained by the last iteration of Algorithm 1 s loop induces a directed graph with only 28 vertices hit by at least one non-zero arc. Then, the minimum FAS instance is effectively reduced to a graph of order n = 28: the exact algorithm terminated in less than a minute, certifying (i) the feasibility of D for the unrestricted dual (3) and that (ii) the distance to the SF matrix of the RUM returned by Algorithm 1 is optimal (see Table 1). Thus, while Algorithm 1 is a heuristic, it always returns a RUM and hence always gives a correct upper bound on the distance of the input matrix to RUMs. Moreover, with an extra final check on the feasibility for the dual of its last D solution, the heuristic can also certify a lower bound on the distance of the input matrix to the best approximating RUM. In our experiments, when applied, this final check was made with an exact algorithm. The Ellipsoid Algorithm, instead, makes this check using the Approximate Separation Oracle of Theorem 8. D. The 2L model The two-level (2L) model of Veerathu & Rajkumar (2021) partitions the n items into groups: each group is associated to a positive weight, and to a rank-2 tournament between its items. Pairwise matches between players in a same group are governed by the rank-2 tournament; if a player belongs to group i, and a second player belongs to group j = i, and if the weights of the two groups are wi and wj, then the first player wins with probability proportional to wi (that is, with probability wi wi+wj ). One can easily show that there exist tournament matrices that can be represented via rank-2 tournaments (and thus via the 2L model), but not with RUMs. We show here that there also exist matrices that can be represented by RUMs, but not by 2L models. In particular, consider the following tournament matrix over the set of players X = {(i, j) | 0 i, j 2}, (0, 0) (0, 1) (0, 2) (1, 0) (1, 1) (1, 2) (2, 0) (2, 1) (2, 2) α 1 α α α α 1 α 1 α 1 α (0, 0) 1 α α α α α 1 α 1 α 1 α (0, 1) α 1 α α α α 1 α 1 α 1 α (0, 2) 1 α 1 α 1 α α 1 α α α α (1, 0) 1 α 1 α 1 α 1 α α α α α (1, 1) 1 α 1 α 1 α α 1 α α α α (1, 2) α α α 1 α 1 α 1 α α 1 α (2, 0) α α α 1 α 1 α 1 α 1 α α (2, 1) α α α 1 α 1 α 1 α α 1 α (2, 2) for α [0, 1]. With this matrix, player (i, j) beats player (i, k), for k = (j + 1) mod 3, with probability 1 α, and player (i, j) beats player (k, ℓ), for k = (i + 1) mod 3, with probability 1 α. Note that for each α = 1 2, Pα has various disjoint non-transitive triplets, e.g., for each i {0, 1, 2}, the subset {(i, 0), (i, 1), (i, 2)} is a non-transitive triplet. Moreover, Pα conjoins those three disjoint non-transitive triplets to form other non-transitive triplets: for each i, j, k {0, 1, 2}, the subset {(0, i), (1, j), (2, k)} is also a non-transitive triplet. We now show that (unrestricted) RUMs are able to represent such a Pα for each 1 3. We will later show that 2L models can represent Pα only if α = 1 2. Lemma 13. For each 1 3, the tournament matrix Pα can be represented with a RUM. RUMs from Head-to-Head Contests Proof. We start by observing that if Pα is representable with a RUM, then P1 α is also representable with a RUM. Indeed, for a permutation π of X, let π be the reverse of π. We also define the reverse of a RUM: the reverse R of RUM R is the RUM that assigns probability R(π) to the permutation π, for each π in the support of R. Then, for each pair a and b of items, the probability that a beats b in R is equal to the probability that b beats a in R. Observe that if Pα requires the generic item a beat another item b with probability pa,b, then P1 α requires b to beat a with probability pa,b. Thus, if R is a RUM for Pα, then R is a RUM for P1 α. We now prove that the existence of a RUM for P1/3 implies the existence of a RUM for Pα, for each 1 First, if both the tournament matrix P and the tournament matrix P admit RUMs, then for each β [0, 1], the tournament matrix β P + (1 β) P also admits a RUM: to sample a permutation from this RUM, one first flips a coin with head probability β; if the coin comes up heads, one samples an independent permutation from the RUM of P , otherwise one samples an independent permutation from the rum of P . We have already proved that the existence of a RUM for P1/3 implies the existence of a RUM for P2/3. Let Mα be the RUM that is obtained by mixing the RUM for P1/3, with weight β = 2 3α, and the RUM for P2/3, with weight 1 β = 3α 1. Then, an event that has probability 1 3 in P1/3 (and thus probability 2 3 in P2/3) ends up having probability (2 3α) 1 3 + (3α 1) 2 3 = α in Mα; and, likewise, an event having probability 2 3 in P1/3 (and probability 1 3 in P2/3) ends up having probability (2 3α) 2 3 + (3α 1) 1 3 = 1 α in Mα. Thus, the existence of a RUM for P1/3 implies the existence of a RUM for Pα for each 1 Then, there only remains to prove that there exists a RUM for P1/3. Consider the set S = {2 1 0, 0 2 1, 1 0 2} of permutations. We will consider the uniform distribution on S; with this distribution, for each i {0, 1, 2}, the probability that i beats (i + 1) mod 3 is exactly 2/3. Our RUM R1/3 behaves as follows. For each i {0, 1, 2}, let πi be sampled independently and uniformly at random from S. Moreover, let π be also sampled independently and uniformly at random from S. Then, the RUM returns the permutation (π(0), ππ(0)(0)) (π(0), ππ(0)(1)) (π(0), ππ(0)(2)) (π(1), ππ(1)(0)) (π(1), ππ(1)(1)) (π(1), ππ(1)(2)) (π(2), ππ(2)(0)) (π(2), ππ(2)(1)) (π(2), ππ(2)(2)) With this RUM, for each player (i, j), and for k = (j + 1) mod 3, we have that Pr [(i, j) beats (i, k)] = 2/3. Analogously, for each player (i, j), for k = (i+1) mod 3, and for each ℓ {0, 1, 2}, we that Pr [(i, j) beats (k, ℓ)] = 2/3. Thus R1/3 is a RUM that represents P1/3. We then show that the 2L model of (Veerathu & Rajkumar, 2021) cannot represent Pα unless α = 1 2. In P1/2, the winner of each single match is chosen by flipping a fair coin: thus, one can represent P1/2 with a uniform MNL and, consequently, with a 2L model. Our construction and our proof hinge on some constraints that 2L models have to obey when representing non-transitive triplets. As we will see, these constraints are incompatible with Pα for each α = 1 Lemma 14. For each α = 1 2, the tournament matrix Pα cannot be represented with a 2L model. Proof. Recall that X = {(i, j) | 0 i, j 2} is the set of players; let us also define three disjoint subsets of X, that we call parts , Xi = {(i, j) | 0 j 2} for 0 i 2. Recall that the model of Veerathu & Rajkumar (2021) is on two levels. The model partitions the player set X in groups G0, G1, . . ., and assigns a positive weight wi to the generic group i. When two players, respectively from groups Gi and Gj, i = j, play against each other, a MNL is used to determine the winner: the player from group i wins with probability wi wi+wj , and the one from group j wins with probability wj wi+wj . When two players from the same group Gi play against each other, the game is decided by a rank-2 tournament. Suppose by contradiction that a 2L model for representing Pα, α = 1 2 is composed of groups G0, . . . , Gt 1. Then, RUMs from Head-to-Head Contests there cannot exist a part Xi that is split among 3 distinct groups, say, Gi1, Gi2, Gi3: indeed, if this were the case, there would have to exist a permutation π of the indices such that wiπ(1) wiπ(1)+wiπ(2) > 1 2, wiπ(2) wiπ(2)+wiπ(3) > 1 2, wiπ(3) wiπ(3)+wiπ(1) > 1 i.e., wiπ(1) > wiπ(2) > wiπ(3) > wiπ(1), a contradiction; also, there cannot exist a part Xi that is split among 2 distinct groups, say, (i, j), (i, k) Ga and (i, ℓ) Gb for {j, k, ℓ} = {0, 1, 2} and a = b. Indeed, if this were the case, the probability that the model assigns to (i, ℓ) beating (i, j), i.e., wb wa+wb , would be the same to that it assigns to (i, ℓ) beating (i, k), contradicting Pα s requirements. Therefore, each Xi is to be fully contained in some group; in particular, there can either be 1, 2, or 3 groups. Veerathu & Rajkumar (2021) prove that rank-2 tournaments are unable to capture tournament matrices P having indices a, b, c, d such that Pa,b, Pb,c, Pc,a > 1 2 and Pa,d, Pb,d, Pc,d > 1 2, or such that Pa,b, Pb,c, Pc,a > 1 2 and Pa,d, Pb,d, Pc,d < 1 2. In other words, rank-2 tournaments cannot represent a non-transitive triplet (a is stronger than b, b is stronger than c, and c is stronger than a) together with an extra item that is either stronger than each item in the triple, or weaker than each item in the triple (i.e., each of a, b, c is weaker than d, or each of a, b, c is stronger than d). Thus, if the partition into groups of the 2L model of Veerathu & Rajkumar (2021) has a group Gi having some Xj as a subset, and containing some element of X Xj, then the rank-2 tournament of Gi is unable to represent Pα restricted to Gi; thus the 2L model is unable to represent Pα. It follows that a 2L model, to correctly represent Pα, has to contain exactly 3 groups, in fact, one group for each of the three parts X0, X1, X2. Without loss of generality, let us assume that Gi = Xi for i {0, 1, 2}. Now, suppose that α < 1 2. To correctly represent Pα, the MNL weights would have to guarantee that the weight w1 of G1 is larger the weight w2 of G2 (since P(0,0),(1,0) = α < 1 2), that w2 > w3 (since P(1,0),(2,0) = α < 1 2), and that w3 > w1 (since P(2,0),(0,0) = α < 1 2), which is a contradiction. Likewise, suppose that α > 1 2. In this case, to correctly represent Pα, the MNL weights would have to guarantee that w1 < w2 (since P(0,0),(1,0) = α > 1 2), that w2 < w3 (since P(1,0),(2,0) = α > 1 2), and that w3 < w1 (since P(2,0),(0,0) = α > 1 2), which is also a contradiction. Thus, the tournament matrix Pα cannot be represented by a 2L model, for each α = 1 The following is then immediate. Corollary 15. For each α 1 3 , the tournament matrix Pα can be represented by a RUM and cannot be represented by 2L models. We point out that the same result could be obtained by a subset of the rows and columns of Pα it is sufficient to have each element of one of the three parts Xi, together with an arbitrary element of a second part Xj, and an arbitrary element of the last part Xk, for {i, j, k} = {0, 1, 2}. We used all the 9 elements of the union of the three parts, since we felt that, this way, the RUM construction is clearer. In fact, while our construction only tensors an intransitive triplet with itself, RUMs can be shown to withstand arbitrarily many such tensorings, provided that each triplet s α is bounded between 1