# light_rums__8244dae4.pdf Flavio Chierichetti 1 Ravi Kumar 2 Andrew Tomkins 2 A Random Utility Model (RUM) is a distribution on permutations over a universe of items. For each subset of the universe, a RUM induces a natural distribution of the winner in the subset: choose a permutation according to the RUM distribution and pick the maximum item in the subset according to the chosen permutation. RUMs are widely used in the theory of discrete choice. In this paper we consider the question of the (lossy) compressibility of RUMs on a universe of size n, i.e., the minimum number of bits required to approximate the winning probabilities of each slate. Our main result is that RUMs can be approximated using e O(n2) bits, an exponential improvement over the standard representation; furthermore, we show that this bound is optimal. En route, we sharpen the classical existential result of Mc Fadden & Train (2000) by showing that the minimum size of a mixture of multinomial logits required to approximate a general RUM is 1. Introduction Random utility models, or RUMs, are the most influential and well-studied class of user behavior models in the field of discrete choice (see Train, 2003, for an overview). A model in the RUM family is a predictor for the item a user will choose when presented with a slate of options. The prediction has a specific form: the user arrives with a utility in mind for each item of the universe, drawn from a joint distribution over utility vectors. For any slate of options, the user behaves rationally by selecting the highest-utilty option available. Much of the work in the area covers subclasses of RUMs in which the utility distribution has a specific form, *Equal contribution 1Dipartimento di Informatica, Sapienza University of Rome, Italy 2Google, Mountain View, CA, USA. Correspondence to: Flavio Chierichetti <flavio@di.uniroma1.it>, Ravi Kumar , Andrew Tomkins . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). but in this paper we consider the general class. A seminal work of Mc Fadden & Train (2000) showed that any RUM may be approximated by a mixture of multinomial logit (MNL) models. Models of this form, MNL mixtures, have seen significant recent study, as in (Shazeer et al., 2017; Yang et al., 2018). However, the construction of Mc Fadden & Train (2000) relied on unboundedly many mixture components, leaving little understanding of the correct complexity. Partial progress on this question was attained by Chierichetti et al. (2018a), who showed that quadratically many mixture components or quadratically many permutations suffice to approximate any RUM. In this paper we continue to study efficient representations for approximating any RUM. We begin with some intuition about approximating RUMs. Any RUM encodes exponentially many distributions over a universe of n items, in the sense that each of the exponentially many slates (subsets of the universe) induces a distribution over the item of the slate that a random user will prefer. These distributions are not arbitrary, as the RUM imposes some structure relating the distributions of nearby slates. Nonetheless, exactly representing the RUM requires exponentially many bits. The question therefore is whether the combination of the restrictions given by the form of RUMs plus the ability to introduce a small and controllable error in the approximation will allow a more concise representation of the entire RUM. An answer to this question will reveal how much practical information a RUM actually carries, as well as how concisely it can be specified and communicated. As in sketching, metric approximation, and related areas, the information content captures some aspect of the representative power of the model. While RUMs have seen significant investment over many decades, leading to the 2000 Nobel prize being awarded to Daniel Mc Fadden for his development of theory and methods for analyzing discrete choice, computational and information-theoretic properties of the model have not seen the same level of scrutiny until more recently. Hence, fundamental questions such as the amount of information required to approximate a RUM remain unresolved to now. Our Results. This paper closes the question to within logarithmic factors, as follows. For a universe of n items, any RUM may be approximated arbitrarily closely on all slates as either a mixture of (n) permutations or as a mixture of e (n) MNLs, and both these bounds are tight up to logarithmic factors. Either type of model may be represented using e (n2) bits; such a representation is also tight, in the strict sense that no representation of asymptotically fewer bits is sufficient to approximate a generic RUM, no matter the form of the representation. We show additionally that the mixture of MNLs representation may be viewed as strictly more powerful than the mixture of permutations in the technical sense that any mixture of t permutations may be well-approximated by a mixture of t MNLs, but the converse is not true: even mixtures of a single MNL may require (n) permutations to approximate. Finally, we show that a mixture of e n(k) permutations is sufficient to approximate any RUM model on slates of size at most k, a common setting, and this result also is tight to polylogarithmic factors in n. We perform some experiments to explore the implications of these theoretical results. In our first set of experiments we study a dataset in which users provide their total ordering of different sushi variants, essentially encoding a complete RUM. This allows us to study exactly how well the construction in our upper bound approximates a RUM in a practical setting. We show that the quality of approximation is almost perfectly predicted by our theoretical results, and that a representation based on just 1% of the data provides an accurate approximation of the overall RUM. In our second set of experiments we consider the setting of Ragain & Ugander (2016), who present an interesting non-rational choice model that is incomparable to the class of RUMs. In this setting we are able to compare the nonrational choice models learned by Ragain & Ugander (2016) against RUMs we learn based on a simple linear program. RUMs perform well, outperforming the MNL mixtures model of Ragain & Ugander (2016), and in some cases outperforming the non-rational choice model, as well. This suggests that, at least in some settings, our findings on expressive power of permutation mixtures may point to new algorithmic approaches to RUM discovery. The paper is structured as follows. Section 2 introduces the notation. Sections 3 and 4 give, respectively, the upper and lower bound on the bit complexity of arbitrary representations. Section 5 gives the corresponding bounds for MNL mixtures. Section 6 compares the RUM, and the mixture of MNLs, representations. Section 7 extends the results to the setting of slates of bounded size. Section 8 gives our experimental results. The Supplementary Material contains each proof missing from the main body of this paper, along with an exponential lower bound on the bit complexity of exact representations of RUMs, and comparisons of RUMs with several other choice models. 2. Preliminaries Distributions. Throughout the paper, we deal with discrete probability distributions. Let supp(D) denote the support of a discrete distribution D. We use x D to denote that x 2 supp(D) is sampled according to D. For a generic x, we use D(x) to denote the probability that D assigns to x; in particular, if x 2 supp(D), then D(x) > 0, otherwise D(x) = 0. For S supp(D), we use D(S) to denote the probability that x D is in S, i.e., D(S) = P x2S D(x). When this creates no ambiguity, for a generic set S, we use D(S) to denote D(S \ supp(D)). The total variation distance between D and D0 is equal to |D D0|tv = 1 |D(x) D0(x)| = 1 2 |D D0|1 . The total variation distance between D and D0 is also equal to the maximum, over all the events , of the absolute difference between the probabilities of in D and in D0, i.e., |D D0|tv = max S supp(D) [ supp(D0) |D(S) D0(S)| . Permutations and RUMs. Let [n] = {1, . . . , n}, 2[n] be the power set of [n], be the set of subsets of [n] of size k. Let Sn be the set of permutations of the set [n]. For a given permutation 2 Sn and for i 2 [n], we let (i) 2 [n] be the value (or position) of item i in . E.g., if = (2 3 1), then (2) = 1, (3) = 2, (1) = 3. In this paper we use the term slate to denote any non-empty subset of [n]. Given 2 Sn and a slate T [n], let (T) = arg max i.e., the maximum item in T according to , aka, the winner. A RUM on [n] is a probability distribution D over Sn.1 We drop the quantifier on [n] when it is obvious from the context. Given a slate T [n], we use DT to denote the distribution of the random variable (T) for D, i.e., the distribution of the winner in the slate T with a random permutation from D. Note that supp(DT ) T. 1RUMs are typically presented in terms of noisy item evaluations made by users. Each item i 2 [n] is assumed to have some base value Vi; each user samples a noise vector (E1, . . . , En) from a joint noise distribution, and observes the utility of i 2 [n] to be Ui = Vi + Ei. The user then chooses an item rationally as the option with the highest utility Ui between the available ones (breaking ties, if they exist, u.a.r.). As the utilities are random, the family of resulting models is named Random Utility Models, or RUMs. In a second definition, the user first sorts all the items decreasingly according to their observed utilities Ui (breaking ties, if they exist, u.a.r.), obtaining a permutation. Then, given a slate, a rational user will choose its item with highest rank in the permutation. These two definitions of RUMs are equivalent (see, e.g., Chierichetti et al., 2018a). MNLs and MNL Mixtures. A Multinomial Logit (aka, MNL) is a widely used kind of RUM. In an MNL L, one associates a positive weight wi > 0 to each item i 2 [n].2 To produce a random permutation, one samples the n elements, one after the other without replacement, with probability proportional to their respective MNL weights. It is not hard to see that, with this RUM, the probability that i wins in S is exactly LS(i) = wi P j2S wj , for each i 2 S [n]. A mixture of t MNLs, also called a mixed logit, is given by a sequence L(1), . . . , L(t) of MNLs and a mixture distribution p over [t]. To determine the winner of a slate S, we first sample i p and then use the MNL L(i) to sample the winner of S. Since a mixture of RUMs is a RUM, it also holds that a mixture of MNLs is a RUM (see, e.g., Mc Fadden & Train, 2000). Approximating RUMs. To define an approximation notion for RUMs, we first define a distance between RUMs D, D0: dist(D, D0) = max ?6=S [n] |DS D0 = max ?6=S0 S [n] |DS(S0) D0 I.e., the distance is the maximum, over the slates S, of the total variation distance of the winner distributions of S with D and D0. Equivalently, it is the maximum over S0 S of the absolute difference of the probabilities, with D and D0, that the random winner in slate S is in S0. E.g., if dist(D, D0) , S is a slate of movies, and S0 S is its subset of dark comedies, then the probability that the movie chosen from S is a dark comedy changes by no more than from D to D0. Conversely, if dist(D, D0) , then there is a slate S and one of its subsets S0 S such that the probability that the winner of S is in S0 changes by at least from D to D0. Since a RUM is a probability distribution over permutations, it can be represented by O(n!) real numbers, each giving the probability of a permutation. Clearly, this representation is prohibitive. If one is allowed to approximate a RUM, is a more succinct representation possible? 3. An Efficient Representation In this section we show that O(n2 log n) bits are sufficient to approximately represent a RUM. The algorithm we will give to produce the representation will sample repeatedly the distribution D on Sn underlying the RUM. Theorem 1. Let 0 < , δ < 1. There is a polynomial time algorithm that, given any distribution D on Sn, produces a multiset M of O n + ln δ 1%% permutations such that, with probability at least 1 δ, the uniform distribution 2In a machine learning setting, this weight is often the result of a linear or non-linear combination of features of the item, and is produced as the exponentiation of a computed logit. e D on M guarantees that dist(D, e D) . Proof. The algorithm will first sample t = l n ln 3+ln 2 independent permutations 1, . . . , t from D. After this first step, the algorithm fixes e D to be the uniform distribution on the multiset of these samples, i.e., e D chooses i 2 [t] uniformly at random (u.a.r.), and returns i. We now prove that, with high probability, dist(D, e D) . Consider any slate S [n], and any of its non-empty subsets S0 S. Let e DS(S0) = P s2S0 e DS(s) be the probability that, using the distribution e D, the winner in the slate S belongs to S0. Then, e DS(S0) 2 [0, 1] is a random variable. Clearly, E [ e DS(S0)] = P s2S0 E [ e DS(s)] = P s2S0 DS(s) = DS(S0), which is the probability that the winner in the slate S with distribution D is in S0. Since |DS e DS|tv = max ?6=S0 S |DS(S0) e DS(S0)| , the claim is proved if we show that, with probability at least 1 δ, for each ? 6= S0 S [n] it holds that |DS(S0) e DS(S0)| . Indeed, by a Chernoff Hoeffding bound (Hoeffding, 1963), Pr [|DS(S0) e DS(S0)| ] 2 e 2 2t n ln 3+ln 2 δ 2 2 = 2 e n ln 3+ln δ 2 = δ 3 n. There are at most 3n pairs of slates S0 S [n] (the generic item either belongs to S0, or to S \ S0, or to [n] \ S), thus we get Pr [9? 6= S0 S [n] : |DS(S0) e DS(S0)| ] δ 3 n 3n = δ. Note that the above upper bound improves the one in (Chierichetti et al., 2018a) from O(n2) to O(n). An immediate consequence of this improvement is that any RUM can be approximately represented using O(n2 log n) bits. Corollary 2. For each 0 < < 1, and for each RUM D, one can build a data structure using O( 2 n2 log n) bits that one can use to return, for each slate S [n], a distribution e DS satisfying |DS e DS|tv . Proof. The algorithm of Theorem 1 provides, with probability at least 1/2, a multiset of O(n/ 2) permutations of [n] such that the RUM e D that chooses a permutation u.a.r. in the multiset, satisfies the approximation requirement of the statement. The generic permutation of [n] can be represented with O(n log n) bits; thus, e D can be represented with O( 2 n2 log n) bits. 4. A Lower Bound on the Representation Size In this section we prove an (n2) lower bound on the number of bits required to sketch a RUM. This shows that the representation obtained in Section 3 is near-optimal. The cornerstone of our lower bound is a class of 2 (n2) pairwise-distant RUMs that will be constructed randomly. Before introducing this class, we introduce the key notion of a sieve RUM. Definition 3 (Sieve RUM). Given a sequence σ = (S1, . . . , Ss) of sets such that Si [n s] for each i 2 [s], we define D(σ), the sieve RUM with signature σ, as follows: for i 2 [s], let i 2 Sn have (i) the items of Si in its top |Si| positions, sorted increasingly, (ii) item n s + i at position |Si| + 1, and (iii) the remaining items in the bottom n |Si| 1 positions, also sorted increasingly; the sieve RUM D(σ) will choose i u.a.r. from [s], and will return permutation i. To build the class of RUMs, and prove our lower bound, we will independently sample exponentially many sieve RUMs from the following distribution. Definition 4 (Random Sieve Distribution). Let s be given. For each i 2 [s], let Si be a i.i.d. and u.a.r. subset of [n s]. Let σ = (S1, . . . , Ss). Return the sieve RUM D(σ) as in Definition 3. As a first step in our lower bound proof, we show that given two independent random sieve RUMs, with extremely high probability, there exists at least one slate where the two induced distributions are far in total variation distance. To do so, we will prove that (i) there exists a class of (n) slates such that, for each slate S in that class, the probability that two random sieve RUMs have close winner distributions on S is 2 (n) and (ii) the behavior of a random sieve RUM on a generic subclass is independent of its behavior on any disjoint subclass. Thus, the probability that two random sieve RUMs behave similarly on each of the (n) slates in the class can be upper bounded by 2 (n)% (n) = 2 (n2). Our lower bound proof will then be concluded by a union bound argument: if one samples 2 (n2) random sieve RUMs, with large enough probability any two of the sampled RUMs will behave dissimilarly on at least one slate. Let H(x) = x log2 1 x + (1 x) log2 1 1 x denote the binary entropy of x 2 (0, 1). Lemma 5. Suppose that D0 and D00 are sieve RUMs sampled independently from the distribution in Definition 4, with s = (1 ) n, for 0 < 1/2. Then, 8? 6= T [n] : |D0 2 )) (1 )n2. Proof. We will in fact show that the event in the statement of the lemma holds, with small enough probability, even if we consider all and only the slates T1, . . . , Tn s, where Tk = {k} [ ([n] \ [n s]) for k 2 [n s]. Given the generic slate Tk and a generic signature σ = (S1, . . . , Ss), for any i 2 [s], it holds that n s + i 2 Tk ) if and only if k 62 Si. Recall that [n]\[n s] Tk and hence if j is sampled from the sieve RUM, one of its |Sj| + 1 top-most items will be the winner: indeed, the (|Sj| + 1)th item of j is n s + j (which is in Tk) and thus either that item or one of those preceding it (those of the set Sj [n s]), will be the winner. Thus, for n s + i to be chosen in the slate Tk, it must hold that (i) i is sampled and (ii) k 62 Si. In particular, if k 2 Si we will have that D(σ) Tk (n s + i) = 0; if k 62 Si, Tk (n s + i) = 1/s. For a given σ, define the (n s) s matrix Mσ whose (k, i)th entry is D(σ) Tk (n s + i). Recall that the random sampling of σ = (S1, . . . , Ss) is such that, for each k 2 [n s] and i 2 [s], an independent fair coin flip determines whether k 2 Si. Thus, under our sampling of σ, each entry of Mσ will be chosen independently and u.a.r. in {0, 1/s}. Let σ0 (resp., σ00) be the (random) signature corresponding to D0 (resp., D00) and let M 0 = Mσ0, M 00 = Mσ00. Pick any k 2 [n s]. Observe that ** + δk, where we let i2[n]\[n s] tv δk/2. Now, consider the event k = δk 1 . Observe that 2 implies δk 1 , that is, it implies k. Moreover, whether k happens is a function of the kth rows of M 0 and M 00. Since the entries of M 0 and M 00 are chosen i.i.d., the events 1, . . . , n s are mutually independent. Note also that for each i 2 [n]\[n s], ** is chosen i.i.d. and u.a.r. in {0, 1/s}, i.e., s δk Bin(s, 1 Pr [ k] = Pr 2 ) s = 2 (1 H( 1 since Pbβtc 2t H(β) for β 1/2. 3The binomial distribution Bin(n, p) has support {0, 1, . . . , n}, and it is defined as Pr[Bin(n, p) = k] = !n pk(1 p)n k. Putting these together, we have 8? 6= T [n] : |D0 8k 2 [n s] : Pr [ k] 2 (1 H( 1 2 ))s(n s), where we used the mutual independence of the events 1, . . . , n s. The claim follows from s = (1 )n. We now use the strong probability guarantee of Lemma 5, and a union bound, to produce 2 (n2) pairwise distant RUMs from the distribution of Definition 4. Theorem 6. For each 0 < 1/2, there is a set D of RUMs on [n] such that (i) |D| = 2b(1 H( 1 and (ii) for each {D0, D00} 2 , dist(D0, D00) > 1 Proof. Let D = D(σ1), . . . , D(σt) be a multiset of t = 2b(1 H( 1 2 n2c, RUMs sampled i.i.d. from the distribution in Definition 4. We apply Lemma 5 on each of the 2 pairs of sampled RUMs, together with a union bound, to obtain: , 8? 6= T [n] : 2 (1 H((1 )/2)) (1 )n2 t2/2 1/2. Thus, D has pairwise distinct elements (and is then a set) at distance larger than 1 2 from each other, with probability at least 1/2 i.e., with positive probability, D has properties (i) and (ii). We conclude by showing the representation lower bound: representing a RUM to within a maximum total variation distance bounded below 1/4 requires (n2) bits. Corollary 7. Fix some 0 < 1/2. A data structure for a generic RUM D that can be used, for each slate S [n], to return a distribution e DS satisfying |DS e DS|tv 1 4 , requires at least 3 6 n2 1 bits. Proof. It is well-known that H ln 4, for x 2 [ 1, 1], (see, e.g., Calabro, 2009). Theorem 6 guarantees that, for each small enough , there exists a set D of RUMs, such that for each {D, D0} 2 , there exists a slate T = TD,D0 [n] such that |DT D0 2 , and with lg2 |D| (1 ) ln 4 n2 1 > 3 since 0 < 1/2 and 4 ln 4 < 6. Let D 2 D. Suppose we have a data structure that, for each slate S [n], can provide a distribution e DS over S, such that |DS e DS|tv 1 4 . We show that uniquely determines D 2 D. Indeed, for each D0 2 D \ {D} there exists at least one slate T = TD,D0 [n] such that T |tv |DT e DT |tv + | e DT D0 4 + | e DT D0 where the second inequality is the triangle inequality. The above inequalities then entail that | e DT D0 4 . Hence, D is the only RUM of D that, for all slates S, guarantees |DS e DS|tv 1 4 . It follows that can be used to uniquely identify each RUM in D. Thus, by counting, uses at least lg2 |D| > 3n2 5. The Efficiency of MNL Mixtures In this section we study the question of how well succinct MNL mixtures can approximate a RUM. We start by observing that the distribution of winners of a permutation can be well-approximated by those of an MNL. Observation 8. Let D be a RUM supported on a single permutation. Then, for each 0 < < 1, there exists an MNL L such that dist(L, D) . Proof. Let supp(D) = { }. We construct an MNL L by assigning a weight of r to the item with rank r in , for each r 2 [n]. For any slate S [n], let i = (S) be the top element of S with the ordering of ; then, DS(i) = 1. If r is the rank of i in , then j=r j = 1 P1 Since DS(i) = 1, we have |LS DS|tv . Theorem 1 and Observation 8 immediately yield: Corollary 9. For each RUM D, there exists a uniform mixture L of O(n/ 2) MNLs such that dist(L, D) . In the remainder of this section we will prove an almost matching lower bound: one cannot approximate a generic RUM to within some constant error using a mixture of only o(n/ log n) MNLs. There are two ingredients in this proof: the representational lower bound for RUMs (Corollary 7) and a compression result for MNL mixtures that we show below (Theorem 12). The latter is of independent interest. To show MNL mixtures can be compressed we proceed as follows. First, we show that the weights in any MNL can be reduced to weights whose consecutive ratios can be represented with O(log n) bits each, at the cost of a small error in the winner distributions. Consequently, any MNL can be approximated by a representation that uses only O(n log n) bits. Then, we use this result to show that a mixture of t MNLs can be approximated using only O(nt log n) bits. The argument is concluded by applying Corollary 7 which ensures that, for a mixture of t MNLs to approximate a generic RUM, one must have nt log n (n2), and thus We begin by proving our main MNL compression result. Theorem 10. For any MNL L and for each 0 < 1, there is an MNL e L that can be represented with O(n log n ) bits such that dist(L, e L) . Proof. Let [n] = {i1, . . . , in} and assume w.l.o.g. that 1 = wi1 win are the weights of the items in the MNL L. For each j = 2, . . . , n, define j = wij wij 1 , 2n . Note that j is a non-negative integer of value at most O and hence can be represented using O bits. Thus, the full sequence of 2, . . . , n, and the ordering i1, i2, . . . , in, can be represented with O(n log n We define the MNL e L using 2, . . . , n and i1, . . . , in: the weight of item ij 2 [n] in e L is ewij = (1 + 2n) t=2 t. To prove dist(L, e L) , we need two key properties of these new weights, stated next. Lemma 11. (i) If for some j < j0 it holds wij/wij0 < 2n, then ewij/ ewij0 < 2n. (ii) If for some j < j0 it holds wij/wij0 > 2n, then e wij e wij0 Now, consider a slate S [n] with S = s1, . . . , s|S| and ws1 ws|S|. We aim to show |LS e LS|tv . If |S| = 1, then LS = e LS and |LS e LS|tv = 0. Otherwise, let j 1 be the smallest integer such that wsj 2n ws|S|. Now, we write |LS e LS|tv = ( S + βS)/2, where |LS(sj) e LS(sj)|, (1) |LS(sj) e LS(sj)|. (2) We begin by upper bounding S. First, L{sj,s|S|}(sj) = wsj ws|S| + wsj e L{sj,s|S|}(sj) ewsj ews|S| where the penultimate step is from the definition of j and by applying Lemma 11(i). Using (1), (3), (4), and by applying the triangle inequality, we obtain S < . We now upper bound βS. To do this, we write e LS(sj) = ewsj P|S| = ewsj/ ews|S| P|S| =1( ews / ews|S|) and apply Lemma 11(ii) to get upper and lower bounds wsj ws|S| P|S| e LS(sj) wsj/ws|S| $ Using these, we obtain |LS(sj) e LS(sj)| = /2 1 /2 wsj P|S| where we used 2 (0, 1). Now, |LS(sj) e LS(sj)| j=j wsj P|S| Finally, |LS e LS|tv = ( S + βS)/2 . We next show that a mixture of t MNLs can be approximated by a mixture that admits an O(tn log n ) bit representation. Theorem 12. For any mixture L of t MNLs and for each 0 < 1, there is a mixture e L of t MNLs that can be represented with O(tn log n ) bits and dist(L, e L) . Using these, we can now prove the lower bound on the size of a mixture of MNLs that approximates a RUM. Corollary 13. Fix any small enough constant > 0. There is a RUM D such that for each mixture L of o(n/ log n) MNLs, it holds that dist(L, D) > 1 2 Proof. Let L be a generic mixture of t = o(n/ log n) MNLs. By Theorem 12, there is a mixture e L such that dist(L, e L) /4 and e L can be represented with O(tn log n ) = o(n2) bits. By contradiction, suppose that for each RUM D, one could find a mixture L(D) of o(n/ log n) MNLs that approximates D to within total variation distance 1 2 4 on each slate S. Then, one would be able to approximate DS to within total variation distance 1 2 4 for each slate S [n] by storing only the representation of e L(D), i.e., o(n2) bits. But, this contradicts Corollary 7 and hence it is impossible that the mixture L(D) exists for each RUM D. 6. RUMs vs MNL Mixtures In this section we compare RUMs and MNL mixtures having supports of the same size. We will see that, for large supports, they are near-equivalent. On the other hand, small mixtures of MNLs are more powerful than small-support RUMs. Earlier, we have shown that a RUM supported on O(n) permutations (Theorem 1), and a mixture of O(n) MNLs (Corollary 9), are both sufficient to approximate any RUM. We also proved that no mixture of o(n/ log n) MNLs can approximately represent a generic RUM (Corollary 13). We begin this section by proving that there are simple RUMs that can only be approximated by RUMs with (n) support. In particular, let U be the uniform RUM, i.e., the RUM that chooses a permutation u.a.r. from Sn. Clearly US(i) = 1/|S| for any slate S [n] and for each i 2 S.4 Theorem 14. For any RUM e U with |supp(e U)| = o(n), it holds that dist(U, e U) 1 o(1). Thus, in the worst case, both RUMs and MNL mixtures require a support of size e (n) to approximately represent the generic RUM. This equivalence, though, does not translate uniformly to the whole space of RUMs. In particular, U is equivalent to an MNL that assigns the same weight of 1 to each item. Thus, U is a RUM that can be perfectly represented with a single MNL, but that can only be approximated by RUMs having (n) permutations in their support. Conversely, by Observation 8, any RUM supported on k permutations can be approximately represented by a mixture of k MNLs. 7. The Case of Small Slates In many practical settings, only the behavior of the RUM on slates of small sizes matters. In this section we consider this case and extend many of our results. First, we modify 4We point out that the distribution over permutations of U is the one supporting the min-hash (or shingles) sketch (Broder, 1997). the distance notion to focus on small slates: distk(D, D0) = max ?6=S [n],|S| k |DS D0 We show that in order to approximate the winner distributions but only for slates of size at most k, the number of permutations needed can be shrunk from (n) to O(k log n). Theorem 15. Let 0 < , δ < 1. There is a polynomial time algorithm that, given any distribution D on Sn, produces a multiset M of O k log n + ln δ 1%% permutations such that, with probability at least 1 δ, the uniform distribution D0 on M guarantees that distk(D, D0) . The following is immediate and analogous to Corollary 2. Corollary 16. For each 0 < < 1 and for each RUM D, one can build a data structure using O( 2 nk log2 n) bits that one can use to return, for each slate S [n] with |S| k, a distribution e DS satisfying |DS e DS|tv . We conclude with a near-matching (nk) bit lower bound. Theorem 17. Fix some 0 < 1/2. A data structure for any RUM D that can be used, for each slate S [n] with |S| k, to return a distribution e DS satisfying |DS e DS|tv 1 4 , requires at least (1 O(k 2)) 3 8. Experimental Results The experiments were coded in Python and used IBM cplex.5 We ran them on a 8-Core i9 Mac Book Pro with 64Gi B of RAM; the total running time of all runs of all experiments was under two hours. Approximation vs Size. In the first experiment, we aim to understand the relation between the maximum (and the average) total variation distance of the approximating RUM and the number of permutations in its support. We consider the Sushi 3A dataset (Kamishima, 2003). The dataset is composed of 5,000 permutations of n = 10 fixed types of sushi, where the ith permutation represents the user i s preference order of the 10 types. A uniform distribution on the dataset defines a RUM D. To compress D, for each t 2 [25], we sample i.i.d. t n permutations from D and produce a RUM e D(t) as in Theorem 1. We computed two errors for each such e D(t): the maximum and the average total variation distance |DS e D(t) S |tv over all slates S. The results, each averaged over 1,000 runs of the sampling algorithm, are in Figure 1. The error, averaged over all the slates, is below 0.1 already with 5n = 50 samples (i.e., 1% of the original data). The maximum error over the slates is below 0.1 already with 23n samples (4.6% of the original data). 5www.ibm.com/analytics/cplex-optimizer Total Variation Cost Number of Permutations Figure 1. The results for Sushi 3A. The data points represent the costs averaged over 1,000 runs of the sampling algorithm; the error bars represent one standard deviation. The solid lines are power laws with exponent 1/2. The c 2n prediction of Theorem 1 (x = cy 2n, or y = cn/x) fits the data quite precisely. Choices Representation. In our second experiment, we compare the quality of the predictions given by a RUM representation with that of the PCMC model6 of Ragain & Ugander (2016; 2021), for the SFwork and the SFshop datasets (Koppelman & Bhat, 2006). These datasets represent the choices between transportation alternatives made by people that were to travel to and from their workplace (SFwork), and a shopping center (SFshop). SFwork contains 5,029 events, each of which is composed of a slate shown to a user, together with the user s choice in that slate; here, n = 6. SFshop is similar, with 3,157 events and n = 8. To measure the quality of a prediction, we follow (Ragain & Ugander, 2021), and use the expected total variation distance7, which we now define. Given a multiset M = {(S1, i1), . . . , (St, it)}, with ij 2 Sj, of (slate, winner) pairs, one computes the empirical distribution µM over the slates as follows: µM(S) equals the fraction of the pairs of M that have S as their slate. Then, given the same M and a slate S such that µM(S) > 0, one computes the empirical distribution of the winner of S: MS(i) is the ratio of the number of pairs of M that have S as their slate and i as their winner, to the number of pairs of M that have S as their slate. To test the quality of a model D against M, we compute the expected total variation distance: dist M(D) = ES µM [|DS MS|tv] . 6A PCMC model is defined by an n n matrix Q representing a continuous time Markov chain, with Qi,j + Qj,i > 0 for each {i, j} 2 ". Given a slate S, the distribution of the winner of S is the (unique) stationary distribution of the continuous-time Markov chain on state space S and transition rates qi,j = Qi,j for each i 2 S and j 2 S \ {i}. 7Ragain & Ugander (2021) plot the expected 1 distance, which is exactly twice the expected total variation distance. 0 10 20 30 40 50 60 70 80 Expected Total Variation Cost Percentage of Data Used for Training PCMC (SFshop) RUM (SFshop) PCMC (SFwork) RUM (SFwork) Figure 2. The expected total variation distance on the SFshop and SFwork datasets (averaged over 100 random partitions of the datasets) between M te and the RUM model produced by our LP, and between M te and the PCMC model of Ragain & Ugander (2021), as the percentage of the dataset used for training ranges from 5% to 75%. As in (Ragain & Ugander, 2021), for each dataset M, we split the events: a u.a.r. part M tr of the dataset, conditioned on having a fixed size (in the 5% 75% range), was used for training; a u.a.r. part M te of the dataset disjoint from M tr, conditioned on having size 25%, was used for testing. We use M tr to produce the RUM e D that minimizes the expected total variation distance dist M tr( e D) to the training data M tr, using the following linear program (LP): 8 > > > > > > > < > > > > > > > : (µM tr(S) δS,i) p δS,i 8i2S 2supp(µM tr) p 0 8 2 Sn The solution to this LP directly gives a RUM e D: the RUM will sample a permutation with probability p . After having computed the best RUM for M tr, we test it on M te. Figure 2 plots, for each dataset, the dist M te( e D) of our RUM and the dist M te(P) of the PCMC model P of Ragain & Ugander (2021), also trained on M tr. Each point represents the expected total variation distance, averaged over 100 random partitions of the dataset. For SFshop, approximating with a RUM gives a much better error than that obtained with PCMC (the average of the ratios of the PCMC error and of the RUM error is 2.29). For SFwork, the RUM gives a slightly worse approximation than the PCMC one (the average of the PCMC/RUM error ratios is 0.84). Thus RUM is competitive against PCMC, representation-wise; also, the RUMs we learned have a smaller expected total variation error than that of the mixture of MNL models of Ragain & Ugander (2021) for both datasets (see their Figure 2, and the Figure 2 of their Supplementary Material; recall that the distance they plot is twice ours). When using the full dataset for both training and testing, the RUM representations given by the LP have a dist M error of 0.026 for SFwork8, and a dist M error of 0.027 for SFshop9. 9. Related Work Discrete choice theory is a well-established research topic in economics; see the excellent book by Train (2003). We only cover work that is directly relevant to our focus. Farias et al. (2009) considered the problem of approximating the choices made by users with a RUM model on bounded support. They proposed an 0-minimization formulation of the problem, and showed that it can be optimized efficiently, under assumptions on its optimum. Our positive results can be seen as robust versions of their conditional result. Several papers (Chierichetti et al., 2018a;b; Negahban et al., 2018; Oh & Shah, 2014; Soufiani et al., 2012; Tang, 2020) have considered the problem of learning the behavior of a general RUM or of restricted RUM models, in both the passive and active learning settings. Our work, on the other hand, addresses the representation complexity of RUMs. Very recently, some deterministic choice models have been considered in the ML literature (Rosenfeld et al., 2020). These models, while interesting, provably cannot approximate RUMs. In fact, in the Supplementary Material, we show that deterministic models, as well as the models of (Ragain & Ugander, 2016; Seshadri et al., 2019), provably cannot represent general RUMs. 10. Conclusions In this paper we consider the representational complexity of approximating an arbitrary RUM on n items. We obtain near-optimal bounds of e (n2) bits needed to represent any RUM. We also show a similarly tight bound of e (n) on the size of MNL mixtures for approximating the generic RUM. Besides the immediate question of how to close the small gap between our upper and lower bounds, our work opens up other research avenues. For example, it would be interesting to consider the 1 version of our question, i.e., if we only 8This RUM model for SFwork is supported on t = 16 permutations. The model can then be represented with t (n 1) = 80 integers (used for storing the permutations in the RUM s support) and t 1 = 15 floating point numbers (used for storing the probabilities that the RUM assigns to those t permutations). 9This RUM model for SFshop is supported on t = 11 permutations. The model can then be represented with t (n 1) = 77 integers and t 1 = 10 floating point numbers. want to -additively approximate the probability that the generic item wins in the generic slate. While our upper bound applies here, the lower bound does not. Another question: can a succinct representation be learned with a deep network? Acknowledgements We thank Johan Ugander and the anonymous reviewers for their comments and suggestions. Flavio Chierichetti was supported in part by the PRIN project 2017K7XPAN, by a Google Focused Research Award, by Bi Ci Bertinoro international Center for informatics, and by the Dipartimenti di Eccellenza grant awarded to the Dipartimento di Informatica at Sapienza. Broder, A. Z. On the resemblance and containment of documents. In SEQUENCES, pp. 21 29, 1997. Calabro, C. The Exponential Complexity of Satisfiability Problems. Ph D thesis, UCSD, 2009. Chierichetti, F., Kumar, R., and Tomkins, A. Discrete choice, permutations, and reconstruction. In SODA, pp. 576 586, 2018a. Chierichetti, F., Kumar, R., and Tomkins, A. Learning a mixture of two multinomial logits. In ICML, pp. 961 969, 2018b. Farias, V. F., Jagabathula, S., and Shah, D. A data-driven approach to modeling choice. In NIPS, pp. 504 512, 2009. Hoeffding, W. Probability inequalities for sums of bounded random variables. JASA, 58, 1963. Kamishima, T. Nantonac collaborative filtering: Recommen- dation based on order responses. In KDD, pp. 583 588, 2003. Koppelman, F. S. and Bhat, C. A self instructing course in mode choice modeling: Multinomial and nested logit models, 2006. US Department of Transportation, Federal Transit Administration. Mc Fadden, D. and Train, K. Mixed MNL models for dis- crete response. J. Applied Econometrics, 15(5):447 470, 2000. Negahban, S., Oh, S., Thekumparampil, K. K., and Xu, J. Learning from comparisons and choices. JMLR, 19(1): 1478 1572, 2018. Oh, S. and Shah, D. Learning mixed multinomial logit model from ordinal data. In NIPS, pp. 595 603, 2014. Ragain, S. and Ugander, J. Pairwise choice Markov chains. In NIPS, pp. 3198 3206, 2016. Ragain, S. and Ugander, J. Pairwise choice Markov chains. In ar Xiv:1603.02740, 2021. Rosenfeld, N., Oshiba, K., and Singer, Y. Predicting choice with set-dependent aggregation. In ICML, pp. 8220 8229, 2020. Seshadri, A., Peysakhovich, A., and Ugander, J. Discovering context effects from raw choice data. In ICML, pp. 5660 5669, 2019. Seshadri, A., Ragain, S., and Ugander, J. Learning rich rankings. In Neur IPS, 2020. Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q. V., Hinton, G. E., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In ICLR, 2017. Soufiani, H. A., Parkes, D. C., and Xia, L. Random utility theory for social choice. In NIPS, pp. 126 134, 2012. Tang, W. Learning an arbitrary mixture of two multinomial logits. ar Xiv, 2007.00204, 2020. Train, K. Discrete Choice Methods with Simulation. Cam- bridge University Press, 2003. Yang, Z., Dai, Z., Salakhutdinov, R., and Cohen, W. W. Breaking the softmax bottleneck: A high-rank RNN language model. In ICLR, 2018.