# pairwise_choice_markov_chains__4f01aa7d.pdf Pairwise Choice Markov Chains Stephen Ragain Management Science & Engineering Stanford University Stanford, CA 94305 sragain@stanford.edu Johan Ugander Management Science & Engineering Stanford University Stanford, CA 94305 jugander@stanford.edu As datasets capturing human choices grow in richness and scale particularly in online domains there is an increasing need for choice models that escape traditional choice-theoretic axioms such as regularity, stochastic transitivity, and Luce s choice axiom. In this work we introduce the Pairwise Choice Markov Chain (PCMC) model of discrete choice, an inferentially tractable model that does not assume any of the above axioms while still satisfying the foundational axiom of uniform expansion, a considerably weaker assumption than Luce s choice axiom. We show that the PCMC model significantly outperforms both the Multinomial Logit (MNL) model and a mixed MNL (MMNL) model in prediction tasks on both synthetic and empirical datasets known to exhibit violations of Luce s axiom. Our analysis also synthesizes several recent observations connecting the Multinomial Logit model and Markov chains; the PCMC model retains the Multinomial Logit model as a special case. 1 Introduction Discrete choice models describe and predict decisions between distinct alternatives. Traditional applications include consumer purchasing decisions, choices of schooling or employment, and commuter choices for modes of transportation among available options. Early models of probabilistic discrete choice, including the well known Thurstone Case V model [27] and Bradley-Terry-Luce (BTL) model [7], were developed and refined under diverse strict assumptions about human decision making. As complex individual choices become increasingly mediated by engineered and learned platforms from online shopping to web browser clicking to interactions with recommendation systems there is a pressing need for flexible models capable of describing and predicting nuanced choice behavior. Luce s choice axiom, popularly known as the independence of irrelevant alternatives (IIA), is arguably the most storied assumption in choice theory [18]. The axiom consists of two statements, applied to each subset of alternatives S within a broader universe U. Let pa S = Pr(a chosen from S) for any S U, and in a slight abuse of notation let pab = Pr(a chosen from {a, b}) when there are only two elements. Luce s axiom is then that: (i) if pab = 0 then pa S = 0 for all S containing a and b, (ii) the probability of choosing a from U conditioned on the choice lying in S is equal to pa S. The BTL model, which defines pab = γa/(γa + γb) for latent quality parameters γi > 0, satisfies the axiom while Thurstone s Case V model does not [1]. Soon after its introduction, the BTL model was generalized from pairwise choices to choices from larger sets [4]. The resulting Multinomal Logit (MNL) model again employs quality parameters γi 0 for each i U and defines pi S, the probability of choosing i from S U, proportional to γi for all i S. Any model that satisfies Luce s choice axiom is equivalent to some MNL model [19]. One consequence of Luce s choice axiom is strict stochastic transitivity between alternatives: if pab 0.5 and pbc 0.5, then pac max(pab, pbc). A possibly undesirable consequence of strict stochastic transitivity is the necessity of a total order across all elements. But note that strict 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. stochastic transitivity does not imply the choice axiom; Thurstone s model exhibits strict stochastic transitivity. Many choice theorists and empiricists, including Luce, have noted that the choice axiom and stochastic transitivity are strong assumptions that do not hold for empirical choice data [9, 12, 13, 26, 28]. A range of discrete choice models striving to escape the confines of the choice axiom have emerged over the years. The most popular of these models have been Elimination by Aspects [29], mixed MNL (MMNL) [6], and nested MNL [22]. Inference is practically difficult for all three of these models [15, 23]. Additionally, Elimination by Aspects and the MMNL model also both exhibit the rigid property of regularity, defined below. A broad, important class of models in the study of discrete choice is the class of random utility models (RUMs) [4, 20]. A RUM affiliates with each i U a random variable Xi and defines for each subset S U the probability Pr(i chosen from S) = Pr(Xi Xj, j S). An independent RUM has independent Xi. RUMs assume neither choice axiom nor stochastic transitivity. Thurstone s Case V model and the BTL model are both independent RUMs; the Elimination by Aspects and MMNL models are both RUMs. A major result by Mc Fadden and Train establishes that for any RUM there exists a MMNL model that can approximate the choice probabilities of that RUM to within an arbitrary error [23], a strong result about the generality of MMNL models. The nested MNL model, meanwhile, is not a RUM. Although RUMs need not exhibit stochastic transitivity, they still exhibit the weaker property of regularity: for any choice sets A, B where A B, px A px B. Regularity may at first seem intuitively pleasing, but it prevents models from expressing framing effects [12] and other empirical observations from modern behavior economics [28]. This rigidity motivates us to contribute a new model of discrete choice that escapes historically common assumptions while still furnishing enough structure to be inferentially tractable. The present work. In this work we introduce a conceptually simple and inferentially tractable model of discrete choice that we call the PCMC model. The parameters of the PCMC model are the off-diagonal entries of a rate matrix Q indexed by U. The PCMC model affiliates each subset S of the alternatives with a continuous time Markov chain (CTMC) on S with transition rate matrix QS, whose off-diagonal entries are entries of Q indexed by pairs of items in S. The model defines pi S, the selection probability of alternative i S, as the probability mass of alternative i S of the stationary distribution of the CTMC on S. The transition rates of these CTMCs can be interpreted as measures of preferences between pairs of alternatives. Special cases of the model use pairwise choice probabilities as transition rates, and as a result the PCMC model extends arbitrary models of pairwise choice to models of setwise choice. Indeed, we show that when the matrix Q is parameterized with the pairwise selection probabilities of a BTL pairwise choice model, the PCMC model reduces to an MNL model. Recent parameterizations of non-transitive pairwise probabilities such as the Blade-Chest model [8] can be usefully employed to reduce the number of free parameters of the PCMC model. Our PCMC model can be thought of as building upon the observation underlying the recently introduced Iterative Luce Spectral Ranking (I-LSR) procedure for efficiently finding the maximum likelihood estimate for parameters of MNL models [21]. The analysis of I-LSR is precisely analyzing a PCMC model in the special case where the matrix Q has been parameterized by BTL. In that case the stationary distribution of the chain is found to satisfy the stationary conditions of the MNL likelihood function, establishing a strong connection between MNL models and Markov chains. The PCMC model generalizes that connection. Other recent connections between the MNL model and Markov chains include the work on Rank Centrality [24], which employs a discrete time Markov chain for inference in the place of I-LSR s continuous time chain, in the special case where all data are pairwise comparisons. Separate recent work has contributed a different discrete time Markov chain model of choice substitution capable of approximating any RUM [3], a related problem but one with a strong focus on ordered preferences. Lastly, recent work by Kumar et al. explores conditions under which a probability distribution over discrete items can be expressed as the stationary distribution of a discrete time Markov chain with score functions similar to the quality parameters in an MNL model [17]. The PCMC model is not a RUM, and in general does not exhibit stochastic transitivity, regularity, or the choice axiom. We find that the PCMC model does, however, obey the lesser known but fundamental axiom of uniform expansion, a weakened version of Luce s choice axiom proposed by Yellott that implies the choice axiom for independent RUMs [30]. In this work we define a convenient structural property termed contractibility, for which uniform expansion is a special case, and we show that the PCMC model exhibits contractibility. Of the models mentioned above, only Elimination by Aspects exhibits uniform expansion without being an independent RUM. Elimination by Aspects obeys regularity, which the PCMC model does not; as such, the PCMC model is uniquely positioned in the literature of axiomatic discrete choice, minimally satisfying uniform expansion without the other aforementioned axioms. After presenting the model and its properties, we investigate choice predictions from our model on two empirical choice datasets as well as diverse synthetic datasets. The empirical choice datasets concern transportation choices made on commuting and shopping trips in San Francisco. Inference on synthetic data shows that PCMC is competitive with MNL when Luce s choice axiom holds, while PCMC outperforms MNL when the axiom does not hold. More significantly, for both of the empirical datasets we find that a learned PCMC model predicts empirical choices significantly better than a learned MNL model. 2 The PCMC model Figure 1: Markov chains on choice sets {a, b}, {a, c}, and {b, c}, where line thicknesses denote transition rates. The chain on the choice set {a, b, c} is assembled using the same rates. A Pairwise Choice Markov Chain (PCMC) model defines the selection probability pi S, the probability of choosing i from S U, as the probability mass on alternative i S of the stationary distribution of a continuous time Markov chain (CTMC) on the set of alternatives S. The model s parameters are the off-diagonal entries qij of rate matrix Q indexed by pairs of elements in U. See Figure 1 for a diagram. We impose the constraint qij + qji 1 for all pairs (i, j), which ensures irreducibility of the chain for all S. Given a query set S U, we construct QS by restricting the rows and columns of Q to elements in S and setting qii = P j S\i qij for each i S. Let πS = {πS(i)}i S be the stationary distribution of the corresponding CTMC on S, and let πS(A) = P x A πS(x). We define the choice probability pi S := πS(i), and now show that the PCMC model is well defined. Proposition 1. The choice probabilities pi S are well defined for all i S, all S U of a finite U. Proof. We need only to show that there is a single closed communicating class. Because S is finite, there must be at least one closed communicating class. Suppose the chain had more than one closed communicating class and that i S and j S were in different closed communicating classes. But qij+qji 1, so at least one of qij and qji is strictly positive and the chain can switch communicating classes through the transition with strictly positive rate, a contradiction. While the support of πS is the single closed communicating class, S may have transient states corresponding to alternatives with selection probability 0. Note that irreducibility argument needs only that qij + qji be positive, not necessarily at least 1 as imposed in the model definition. One could simply constrain qij + qji ϵ for some positive ϵ. However, multiplying all entriesof Q by some c > 0 does not affect the stationary distribution of the corresponding CTMC, so multiplication by 1/ϵ gives a Q with the same selection probabilities. In the subsections that follow, we develop key properties of the model. We begin by showing how assigning Q according a Bradley-Terry-Luce (BTL) pairwise model results in the PCMC model being equivalent to BTL s canonical extension, the Multinomial Logit (MNL) set-wise model. We then construct a Q for which the PCMC model is neither regular nor a RUM. 2.1 Multinomial Logit from Bradley-Terry-Luce We now observe that the Multinomial Logit (MNL) model, also called the Plackett-Luce model, is precisely a PCMC model with a matrix Q consisting of pairwise BTL probabilities. Recall that the BTL model assumes the existence of latent quality parameters γi > 0 for i U with pij = γi/(γi + γj), i, j U and that the MNL generalization defines pi S γi, i S for each S U. Proposition 2. Let γ be the parameters of a BTL model on U. For qji = γi γi+γj , the PCMC probabilities pi S are consistent with an MNL model on S with parameters γ. Proof. We aim to show that πS = γ ||γ||1 is a stationary distribution of the PCMC chain: πT S QS = 0. We have: (πT S QS)i = 1 ||γ||1 j =i γjqji γi( X = γi ||γ||1 γj γi + γj X Thus πS is always the stationary distribution of the chain, and we know by Proposition 1 that it is unique. It follows that pi S γi for all i S, as desired. Other parameterizations of Q, which can be used for parameter reduction or to extend arbitrary models for pairwise choice, are explored section 1 of the Supplementary material. 2.2 A counterexample to regularity The regularity property stipulates that for any S S, the probability of selecting a from S is at least the probability of selecting a from S. All RUMs exhibit regularity because S S implies Pr(Xi = maxj S Xj) Pr(Xi = maxj S Xj). We now construct a simple PCMC model which does not exhibit regularity, and is thus not a RUM. Consider U = {r, p, s} corresponding to a rock-paper-scissors-like stochastic game where each pairwise matchup has the same win probability α > 1 2. Constructing a PCMC model where the transition rate from i to j is α if j beats i in rock-paper-scissors yields the rate matrix " 1 1 α α α 1 1 α 1 α α 1 We see that for pairs of objects, the PCMC model returns the same probabilities as the pairwise game, i.e. pij = α when i beats j in rock-paper-scissors, as pij = qji when qij+qji = 1. Regardless of how the probability α is chosen, however, it is always the case that pr U = pp U = ps U = 1/3. It follows that regularity does not hold for α > 2/3. We view the PCMC model s lack of regularity is a positive trait in the sense that empirical choice phenomena such as framing effects and asymmetric dominance violate regularity [12], and the PCMC model is rare in its ability to model such choices. Deriving necessary and sufficient conditions on Q for a PCMC model to be a RUM, analogous to known characterization theorems for RUMs [10] and known sufficient conditions for nested MNL models to be RUMs [5], is an interesting open challenge. 3 Properties While we have demonstrated already that the PCMC model avoids several restrictive properties that are often inconsistent with empirical choice data, we demonstrate in this section that the PCMC model still exhibits deep structure in the form of contractibility, which implies uniform expansion. Inspired by a thought experiment that was posed as an early challenge to the choice axiom, we define the property of contractibility to handle notions of similarity between elements. We demonstrate that the PCMC model exhibits contractibility, which gracefully handles this thought experiment. 3.1 Uniform expansion Yellott [30] introduced uniform expansion as a weaker condition than Luce s choice axiom, but one that implies the choice axiom in the context of any independent RUM. Yellott posed the axiom of invariance to uniform expansion in the context of copies of elements which are identical. In the context of our model, such copies would have identical transition rates to alternatives: Definition 1 (Copies). For i, j in S U, we say that i and j are copies if for all k S i j, qik = qjk and qij = qji. Yellott s introduction to uniform expansion asks the reader to consider an offer of a choice of beverage from k identical cups of coffee, k identical cups of tea, and k identical glasses of milk. Yellott contends that the probability the reader chooses a type of beverage (e.g. coffee) in this scenario should be the same as if they were only shown one cup of each beverage type, regardless of k 1. Definition 2 (Uniform Expansion). Consider a choice between n elements in a set S1 = {i11, . . . , in1}, and another choice from a set Sk containing k copies of each of the n elements: Sk = {i11, . . . , i1k, i21, . . . , i2k, . . . , in1, . . . , ink}. The axiom of uniform expansion states that for each m = 1, . . . , n and all k 1: j=1 pimj Sk. We will show that the PCMC model always exhibits a more general property of contractibility, of which uniform expansion is a special case; it thus always exhibits uniform expansion. Yellott showed that for any independent RUM with |U| 3 the double-exponential distribution family is the only family of independent distributions that exhibit uniform expansion for all k 1, and that Thurstone s model based on the Gaussian distribution family in particular does not exhibit uniform expansion. While uniform expansion seems natural in many discrete choice contexts, it should be regarded with some skepticism in applications that model competitions. Sports matches or races are often modeled using RUMs, where the winner of a competition can be modeled as the competitor with the best draw from their random variable. If a competitor has a performance distribution with a heavy upper tail (so that their wins come from occasional good days ), uniform expansion would not hold. This observation relates to recent work on team performance and selection [14], where non-invariance under uniform expansion plays a key role. 3.2 Contractibility In a book review of Luce s early work on the choice axiom, Debreu [9] considers a hypothetical choice between three musical recordings: one of Beethoven s eighth symphony conducted by X, another of Beethoven s eighth symphony conducted by Y , and one of Debussy quartet conducted by Z. We will call these options B1, B2, and D respectively. When compared to D, Debreu argues that B1 and B2 are indistinguishable in the sense that p DB1 = p DB2. However, someone may prefer B1 over B2 in the sense that p B1B2 > 0.5. This is impossible under a BTL model, in which p DB1 = p DB2 implies that γB1 = γB2 and in turn p B1B2 = 0.5. To address contexts in which elements compare identically to alternatives but not each other (e.g. B1 and B2), we introduce contractible partitions that group these similar alternatives into sets. We then show that when a PCMC model contains a contractible partition, the relative probabilities of selecting from one of these partitions is independent from how comparisons are made between alternatives in the same set. Our contractible partition definition can be viewed as akin to (but distinct from) nests in nested MNL models [22]. Definition 3 (Contractible Partition). A partition of U into non-empty sets A1, . . . , Ak is a contractible partition if qaiaj = λij for all ai Ai, aj Aj for some Λ = {λij} for i, j {1, . . . , k}. Proposition 3. For a given Λ, let A1, . . . , Ak be a contractible partition for two PCMC models on U represented by Q, Q with stationary distributions π, π . Then for any Ai: X j Ai pj U = X j Ai p j U, (1) or equivalently, π(Ai) = π (Ai). Proof. Suppose Q has contractible partition A1, . . . , Ak with respect to Λ. If we decompose the balance equations (i.e. each row of πT Q = 0), for x A1 WLOG we obtain: y A1\x qxy + y A1\x π(y)qyx + ai Ai π(ai)qaix. (2) Noting that for ai Ai and aj Aj, qaiaj = λij, (2) can be rewritten: i=2 |Ai|λi1 = X y A1\x π(y)qyx + i=2 π(Ai)λi1. Summing over x A1 then gives i=2 |Ai|λi1 = X y A1\x π(y)qyx + |A1| i=2 π(Ai)λi1. The leftmost term of each side is equal, so we have π(A1) = |A1| Pk i=2 π(Ai)λi1 P i=2 |Ai|λ1i , (3) which makes π(A1) the solution to global balance equations for a different continuous time Markov chain with the states {A1, . . . , Ak} and transition rate q Ai Aj = |Aj|λij between state Ai and Aj, and q Ai Ai = P j =i q Ai Aj. Now qaiaj + qajai 1 implies λij + λji 1. Combining this observation with |Ai| > 0 shows (as with the proof of Proposition 1) that this chain is irreducible and thus that {π(Ai)}k i=1 are well-defined. Furthermore, because Q is determined entirely by Λ and |A1|, . . . , |Ak|, we have that Q = Q , and thus that π(Ai) = π (Ai), i regardless of how Q and Q may differ, completing the proof. The intuition is that we can contract each Ai to a single type because the probability of choosing an element of Ai is independent of the pairwise probabilities between elements within the sets. The above proposition and the contractibility of a PCMC model on all uniformly expanded sets implies that all PCMC models exhibit uniform expansion. Proposition 4. Any PCMC model exhibits uniform expansion. Proof. We translate the problem of uniform expansion into the language of contractibility. Let U1 be the universe of unique items i11, i21, . . . , in1, and let Uk be a universe containing k copies of each item in U1. Let imj denote the jth copy of the mth item in U1. Thus Uk = n m=1 k j=1 imj. Let Q be the transition rate matrix of the CTMC on U1. We construct a contractible partition of Uk into the n sets, each containing the k copies of some item in U1. Thus Am = k j=1imj. By the definition of copies, that {Am}n m=1 is a contractible partition of Uk with Λ = Q. Noting |Am| = k for all m in Equation (3) above results in {π(Am)}n m=1 being the solution to πT Q = πT Λ = 0. Thus pim U1 = π(Am) = Pk j=1 pimj Uk for each m, showing that the model exhibits uniform expansion. We end this section by noting that every PCMC model has a trivial contractible partition into singletons. Detection and exploitation of Q s non-trivial contractible partitions (or appropriately defined nearly contractible partitions ) are interesting open research directions. 4 Inference and prediction Our ultimate goal in formulating this model is to make predictions: using past choices from diverse subsets S U to predict future choices. In this section we first give the log-likelihood function log L(Q; C) of the rate matrix Q given a choice data collection of the form C = {(ik, Sk)}n k=1, where ik Sk was the item chosen from Sk. We then investigate the ability of a learned PCMC model to make choice predictions on empirical data, benchmarked against learned MNL and MMNL models, and interpret the inferred model parameters ˆQ. Let Ci S(C) = |{(ik, Sk) C : ik = i, Sk = S}| denote the number of times in the data that i was chosen out of set S for each S U, and let CS(C) = |{(ik, Sk) C : Sk = S}| be the number of times that S was the choice set for each S U. 4.1 Maximum likelihood For each S U, i S, recall that pi S(Q) is the probability that i is selected from set S as a function of the rate matrix Q. After dropping all additive constants, the log-likelihood of Q given the data C (derived from the probability mass function of the multinomial distribution) is: log L(Q; C) = X i S Ci S(C) log(pi S(Q)). Recall that for the PCMC model, pi S(Q) = πS(i), where πS is the stationary distribution for a CTMC with rate matrix QS, i.e. πT S QS = 0 and P i S πS(i) = 1. There is no general closed form expression for pi S(Q). The implicit definition also makes it difficult to derive gradients for log L with respect to the parameters qij. We employ SLSQP [25] to maximize log L(Q; C), which is nonconcave in general. For more information on the optimization techniques used in this section, see the Supplementary Materials. 4.2 Empirical data results We evaluate our inference procedure on two empirical choice datasets, SFwork and SFshop, collected from a survey of transportation preferences around the San Francisco Bay Area [16]. The SFshop dataset contains 3,157 observations each consisting of a choice set of transportation alternatives available to individuals traveling to and returning from a shopping center, as well as a choice from that choice set. The SFwork dataset, meanwhile, contains 5,029 observations consisting of commuting options and the choice made on a given commute. Basic statistics describing the choice set sizes and the number of times each pair of alternatives appear in the same choice set appear in the Supplementary Materials1. We train our model on observations Ttrain C and evaluate on a test set Ttest C via Error(Ttrain; Ttest) = 1 |Ttest| (i,S) Ttest j S |pj S( ˆQ(Ttrain)) pi S(Ttest)|, (4) where ˆQ(Ttrain) is the estimate for Q obtained from the observations in Ttrain and pi S(Ttest) = Ci S(Ttest)/CS(Ttest) is the empirical probability of i was selected from S among observations in Ttest. Note that Error(Ttrain; Ttest) is the expected ℓ1-norm of the difference between the empirical distribution and the inferred distribution on a choice set drawn uniformly at random from the observations in Ttest. We applied small amounts of additive smoothing to each dataset. We compare our PCMC model against both an MNL model trained using Iterative Luce Spectral Ranking (I-LSR) [21] and a more flexible MMNL model. We used a discrete mixture of k MNL models (with O(kn) parameters), choosing k so that the MMNL model had strictly more parameters than the PCMC model on each data set. For details on how the MMNL model was trained, see the Supplementary Materials. Figure 2 shows Error(Ttrain; Ttest) on the SFwork data as the learning procedure is applied to increasing amounts of data. The results are averaged over 1,000 different permutations of the data with a 75/25 train/test split employed for each permutation. We show the error on the testing data as we train with increasing proportions of the training data. A similar figure for SFshop data appears in the Supplementary Materials. We see that our model is better equipped to learn from and make predictions in both datasets, and when using all of the training data, we observe an error reduction of 36.2% and 46.5% compared to MNL and 24.4% and 31.7% compared to MMNL on SFwork and SFshop respectively. Figure 2 also gives two different heat maps of ˆQ for the SFwork data, showing the relative rates ˆqij/ˆqji between pairs of items as well as how the total rate ˆqij + ˆqji between pairs compares to total rates between other pairs. The index ordering of each matrix follows the estimated selection probabilities of the PCMC model on the full set of the alternatives for that dataset. The ordered options for SFwork are: (1) driving alone, (2) sharing a ride with one other person, (3) walking, 1Data and code available here: https://github.com/sragain/pcmc-nips Figure 2: Prediction error on a 25% holdout of the SFwork data for the PCMC, MNL, and MMNL models. PCMC sees improvements of 35.9% and 24.5% in prediction error over MNL and MMNL, respectively, when training on 75% of the data. (4) public transit, (5) biking, and (6) carpooling with at least two others. Numerical values for the entries of ˆQ for both datasets appear in the Supplementary Materials. The inferred pairwise selection probabilities are ˆpij = ˆqji/(ˆqji + ˆqij). Constructing a tournament graph on the alternatives where (i, j) E if ˆpij 0.5, cyclic triplets are then length-3 cycles in the tournament. A bound due to Harary and Moser [11] establishes that the maximum number of cyclic triples on a tournament graph on n nodes is 8 when n = 6 and 20 when n = 8. According to our learned model, the choices exhibit 2 out of a maximum 8 cyclic triplets in the SFwork data and 6 out of a maximum 20 cyclic triplets for the SFshop data. Additional evaluations of predictive performance across a range of synthetic datasets appear in the Supplementary Materials. The majority of datasets in the literature on discrete choice focus on pairwise comparisons or ranked lists, where lists inherently assume transitivity and the independence of irrelevant alternatives. The SFwork and SFshop datasets are rare examples of public datasets that genuinely study choices from sets larger than pairs. 5 Conclusion We introduce a Pairwise Choice Markov Chain (PCMC) model of discrete choice which defines selection probabilities according to the stationary distributions of continuous time Markov chains on alternatives. The model parameters are the transition rates between pairs of alternatives. In general the PCMC model is not a random utility model (RUM), and maintains broad flexibility by eschewing the implications of Luce s choice axiom, stochastic transitivity, and regularity. Despite this flexibility, we demonstrate that the PCMC model exhibits desirable structure by fulfilling uniform expansion, a property previously found only in the Multinomial Logit (MNL) model and the intractable Elimination by Aspects model. We also introduce the notion of contractibility, a property motivated by thought experiments instrumental in moving choice theory beyond the choice axiom, for which Yellott s axiom of uniform expansion is a special case. Our work demonstrates that the PCMC model exhibits contractibility, which implies uniform expansion. We also showed that the PCMC model offers straightforward inference through maximum likelihood estimation, and that a learned PCMC model predicts empirical choice data with a significantly higher fidelity than both MNL and MMNL models. The flexibility and tractability of the PCMC model opens up many compelling research directions. First, what necessary and sufficient conditions on the matrix Q guarantee that a PCMC model is a RUM [10]? The efficacy of the PCMC model suggests exploring other effective parameterizations for Q, including developing inferential methods which exploit contractibility. There are also open computational questions, such as streamlining the likelihood maximization using gradients of the implicit function definitions. Very recently, learning results for nested MNL models have shown favorable query complexity under an oracle model [2], and a comparison of our PCMC model with these approaches to learning nested MNL models is important future work. Acknowledgements. This work was supported in part by a David Morgenthaler II Faculty Fellowship and a Dantzig Lieberman Operations Research Fellowship. [1] E. Adams and S. Messick. An axiomatic formulation and generalization of successive intervals scaling. Psychometrika, 23(4):355 368, 1958. [2] A. R. Benson, R. Kumar, and A. Tomkins. On the relevance of irrelevant alternatives. In WWW, 2016. [3] J. Blanchet, G. Gallego, and V. Goyal. A markov chain approximation to choice modeling. In EC, 2013. [4] H. D. Block and J. Marschak. Random orderings and stochastic theories of responses. Contributions to Probability and Statistics, 2:97 132, 1960. [5] A. B orsch-Supan. On the compatibility of nested logit models with utility maximization. Journal of Econometrics, 43(3):373 388, 1990. [6] J H Boyd and R E Mellman. The effect of fuel economy standards on the us automotive market: an hedonic demand analysis. Transportation Research Part A: General, 14(5-6):367 378, 1980. [7] R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs the method of paired comparisons. Biometrika, 39(3-4):324 345, 1952. [8] S. Chen and T. Joachims. Modeling intransitivity in matchup and comparison data. In WSDM, 2016. [9] G. Debreu. Review of individual choice behavior: A theoretical analysis. American Economic Review, 1960. [10] J.-C. Falmagne. A representation theorem for finite random scale systems. J. Math. Psych., 18(1):52 72, 1978. [11] Frank Harary and Leo Moser. The theory of round robin tournaments. The American Mathematical Monthly, 73(3):231 246, 1966. [12] J. Huber, J. W. Payne, and C. Puto. Adding asymmetrically dominated alternatives: Violations of regularity and the similarity hypothesis. Journal of Consumer Research, pages 90 98, 1982. [13] S. Ieong, N. Mishra, and O. Sheffet. Predicting preference flips in commerce search. In ICML, 2012. [14] J. Kleinberg and M. Raghu. Team performance with test scores. In EC, pages 511 528, 2015. [15] R. Kohli and K. Jedidi. Error theory for elimination by aspects. Operations Research, 63(3):512 526, 2015. [16] F. S Koppelman and C. Bhat. A self instructing course in mode choice modeling: multinomial and nested logit models. US Department of Transportation, Federal Transit Administration, 31, 2006. [17] Ravi Kumar, Andrew Tomkins, Sergei Vassilvitskii, and Erik Vee. Inverting a steady-state. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pages 359 368. ACM, 2015. [18] R. D. Luce. Individual Choice Behavior: A Theoretical Analysis. Wiley, 1959. [19] R. D. Luce. The choice axiom after twenty years. J. Math. Psych., 15(3):215 233, 1977. [20] C. F. Manski. The structure of random utility models. Theory and Decision, 8(3):229 254, 1977. [21] L. Maystre and M. Grossglauser. Fast and accurate inference of plackett luce models. In NIPS, 2015. [22] D. Mc Fadden. Econometric models for probabilistic choice among products. Journal of Business, pages S13 S29, 1980. [23] D. Mc Fadden, K. Train, et al. Mixed mnl models for discrete response. Journal of Applied Econometrics, 15(5):447 470, 2000. [24] S. Negahban, S. Oh, and D. Shah. Rank centrality: Ranking from pair-wise comparisons. ar Xiv preprint ar Xiv:1209.1688v4, 2015. [25] J. Nocedal and S. J. Wright. Numerical optimization. 2006. [26] I. Simonson and A. Tversky. Choice in context: Tradeoff contrast and extremeness aversion. Journal of Marketing Research, 29(3):281, 1992. [27] L. L. Thurstone. A law of comparative judgment. Psychological Review, 34(4):273, 1927. [28] J. S. Trueblood, S. D. Brown, A. Heathcote, and J. R. Busemeyer. Not just for consumers context effects are fundamental to decision making. Psychological Science, 24(6):901 908, 2013. [29] A. Tversky. Elimination by aspects: A theory of choice. Psychological Review, 79(4):281, 1972. [30] J. I. Yellott. The relationship between luce s choice axiom, thurstone s theory of comparative judgment, and the double exponential distribution. J. Math. Psych., 15(2):109 144, 1977.