# discovering_context_effects_from_raw_choice_data__143b8796.pdf Discovering Context Effects from Raw Choice Data Arjun Seshadri 1 Alexander Peysakhovich 2 Johan Ugander 1 Many applications in preference learning assume that decisions come from the maximization of a stable utility function. Yet a large experimental literature shows that individual choices and judgements can be affected by irrelevant aspects of the context in which they are made. An important class of such contexts is the composition of the choice set. In this work, our goal is to discover such choice set effects from raw choice data. We introduce an extension of the Multinomial Logit (MNL) model, called the context dependent random utility model (CDM), which allows for a particular class of choice set effects. We show that the CDM can be thought of as a second-order approximation to a general choice system, can be inferred optimally using maximum likelihood and, importantly, is easily interpretable. We apply the CDM to both real and simulated choice data to perform principled exploratory analyses for the presence of choice set effects. 1. Introduction Modeling individual choice is an important component of recommender systems (Resnick and Varian, 1997), search engine ranking (Schapire et al., 1998), analysis of auctions (Athey and Levin, 2001), marketing (Allenby and Rossi, 1998), and demand modeling in diverse domains (Berry et al., 1995; Bruch et al., 2016). The workhorse models used either implicitly or explicitly in these disparate literatures are random utility models (RUMs) (Manski, 1977), which assume that individuals have a numeric utility for each item and that they make choices that maximize noisy observations of these utilities (Luce, 1959; Mc Fadden, 1980; Kreps, 1988). 1Stanford University, Stanford, CA 2Facebook Artificial Intelligence Research, New York, NY. Correspondence to: Arjun Seshadri , Alexander Peysakhovich , Johan Ugander . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). The most well known and widely used RUM is the conditional multinomial logit (MNL), also called the Luce model, which is the unique RUM that satisfies the axiom known as the independence of irrelevant alternatives (IIA) (Luce, 1959). Informally, this axiom states that adding an item to a choice set does not change the relative probabilities of choosing the other items. This assumption is very strong, but it allows analysts to build powerful and interpretable models. However, if one assumes IIA but it is not actually true, predictions for out-of-sample choices could be very wrong. Thus, it is important for analysts to discover whether IIA is approximately true in a given dataset. At the same time, there is a large amount of experimental evidence showing significant deviations from rational choice across many domains. In particular, the value assigned to an item can strongly depend on the irrelevant elements of the context of the choice (Tversky, 1972; Tversky and Simonson, 1993). Attempts to model these context effects in a domain-free manner fall short of being practically valuable, either due to large parameter requirements, inferential intractability, or both (Park and Choi, 2013). Our contribution addresses both issues. We consider the IIA-satisfying MNL model and make small modifications to subsume a class of IIA violations that we believe are important in practice, while retaining parametric and inferential efficiency. We refer to this model as the context dependent random utility model (CDM). The CDM can be thought of as a second order approximation of a general choice system (the MNL model, meanwhile, corresponds to a first order approximation). Because the CDM nests MNL, it can fit data that does satisfy IIA just as well, and, importantly, can be used to construct a nested-model hypothesis test for whether a particular dataset is consistent with IIA. The key assumption of the CDM is that IIA violations come from pairwise interactions between items in the choice set and that larger choice set effects can be approximated additively using all pairwise effects. This assumption means that the CDM has many fewer parameters than a general choice system. We can further reduce the CDM s data dependence by assuming that these underlying effects can be well modeled by latent vectors of a smaller dimensionality than the number of items, resulting in what we call the low-rank CDM. The low-rank CDM can be useful in applications Discovering Context Effects from Raw Choice Data where the number of items is relatively large and where seeing all possible comparisons may be extremely costly. For a theoretical contribution, we furnish formal results for conditions under which the parameters of a CDM can or can not be identified from data. In situations where identifiability is not achieved, we advocate for additive ℓ2regularization of the log-likelihood to select the minimum norm solution. We also provide finite sample convergence guarantees for the expected squared ℓ2 error of the estimate as a function of comparison structure. For an applied contribution, we first test the CDM in synthetic data and show that a nested model likelihood ratio test between the CDM and MNL models has good finite sample properties. When IIA holds, a p < .05 hypothesis test rejects the null slightly less than 5% of the time. When IIA does not hold the null hypothesis is overwhelmingly rejected even in medium size data-sets. By contrast we see that using a nested model test based on the nested structure of a general choice system and MNL model gives a test that wildly over-rejects the null, even when IIA is true. We apply the CDM to several real-world datasets. First, we show that we can strongly reject IIA in the popular SFWork and SFShop choice datasets. Second, we consider using the CDM to model choices in the task of Heikinheimo and Ukkonen (2013), where individuals are presented with triplets of items and asked which item is least like the other two. Here the CDM can capture the underlying choice structure quite well while IIA is an extremely unreasonable assumption. 1.1. Related Work The CDM vaguely resembles the continuous bag of words (CBOW) model popularized by word2vec (Mikolov et al., 2013). However, while the CBOW model predicts a word from the whole vocabulary, the CDM models choices from arbitrary subsets. Related work shows that word2vec style models can be used to model complements and substitutes in supermarket shopping data (Ruiz et al., 2017), a setting where the target and context embedding each play a unique role (Rudolph et al., 2016). Utility models with a contextual component are widely used to analyze intertemporal choice (discount functions) (Muraven and Baumeister, 2000; Fudenberg and Levine, 2012), choice under uncertainty (Bordalo et al., 2012; Fox and Tversky, 1995), and choices about cooperation (List, 2007; Liberman et al., 2004; Peysakhovich and Rand, 2015). They are also workhorses in modeling consumer behavior in applied settings. Online recommender systems (Resnick and Varian, 1997), which model user-item interactions as inner products of low rank vectors, can be seen as employing a utility function that is an inner product between item at- tributes and user weights. The CDM and low-rank CDM generalize a number of prominent choice models in a unified framework. In a later section, after we introduce the basic mathematical notation, we present connections to the work of Tversky and Simonson (1993), Batsell and Polking (1985), and Chen and Joachims (2016a;b). Importantly, this means our convergence and identifiability results carry over to these other models, which all previously lacked such results. 2. Modeling Choice Systems Let X be a finite set of n alternatives that we hold fixed and let C = {C : C X, |C| 2} be the set of all subsets of X of size greater than or equal to two. Throughout this work, we assume there is a single individual that is presented with choice sets and chooses a single item from each choice set. In this setting, the fundamental object of study is a choice system, a collection of probability distributions for every C C, describing the probability that an item x is chosen from C, x C. We denote each such probability by P(x | C). In general, a choice system can model arbitrary preferences on arbitrary subsets with no further restrictions. The most commonly assumed restriction on choice systems is that they satisfy the independence of irrelevant alternatives (IIA), which can be stated as follows. Assumption 1. A choice system on X satisfies the independence of irrelevant alternatives (IIA) if for any x, y X and choice sets A, B X with x, y A, B we have P(x | A) P(y | A) = P(x | B) In other words, IIA states that the composition of a choice set does not affect the relative attractiveness of items. A main question in this work will be, given a dataset D of choices from choice sets, can we determine whether D was generated by a model satisfying IIA or a model of a more general choice system? If not generated by a model satsify IIA, is it possible to define tractable model classes between the class of models satisfying IIA and the class of fully general choice systems? Our answer to this question is a formal truncation of a general choice system that we call the context dependent random utility model (CDM). 2.1. Context-dependent Random Utility Models A trivial model of a general choice system is the universal logit model (Mc Fadden et al., 1977), which simply parameterizes the choice system object, defining utilities u(x | C), x C, for each C C that can vary arbitrarily for every item, for every set. The choice probabilities for a Discovering Context Effects from Raw Choice Data universal logit model are then: P(x | C) = exp(u(x | C)) P y C exp(u(y | C)). The above model exhibits scale-invariance on each subset C, and thus we require that P y C u(y | C) = 0, C C for the purposes of identifiability. While relatively uninteresting as a model, the above formulation is the starting point for the following observation about choice systems, first documented by Batsell and Polking (1985). Lemma 1. The utilities in the universal logit model, u(x | C), C C, x C, can be uniquely mapped as u(x | C) = X B C\x v(x | B), where v(x | B) are values that satisfy the constraints P x/ B v(x | B) = 0, B X. For greater clarity, we expand out the terms individually. u(x | C) = v(x) |{z} 1st order y C\x v(x | {y}) | {z } 2nd order {y,z} C\x v(x | {y, z}) | {z } 3rd order + . . . + v(x | C \ {x}) | {z } |C|th order and expand out the first three sets of constraints: X x X v(x) = 0, X x X\y v(x | {y}) = 0, x X\{y,z} v(x | {y, z}) = 0. We use v(x) = v(x | ) for simplicity. This expansion reveals that arbitrary contextual utilities can be decomposed into intuitive contributions: the first order terms represent the item s intrinsic contribution to the utility, the second order terms represent the contextual contributions from every other item in the choice set, the third order terms the contributions from contextual pairs not modeled by contributions of the pairs constituent items, and so on. The expansion invites one to consider a hierarchy of choice model classes indexed by their order1: the pth order model refers to forcing all terms of order greater than p to zero. Denote this class of choice system models by Mp. Clearly, we have M1 M2 . . . Mn 1, where Mn 1 is the universal logit model. We next consider two excercises. First, we write out the 1st order model explicitly: P(x | C) = exp(v(x)) P y C exp(v(y)), X x X v(x) = 0. 1This expansion differs from the one used by Batsell and Polking (1985), which expands the log probability ratios of items being chosen instead of the underlying contextual utilities. This model M1 is the multinomial logit model, the workhorse model of discrete choice (Luce, 1959; Mc Fadden, 1980). Moving to higher-order models, a counting exercise reveals that the number of free parameters in the pth order model is p X which simplifies to n 1 and (n 2)2n 1 + 1 parameters for M1 (MNL) and Mn 1 (universal logit), respectively. Clearly, the parameters grow polynomially in the number of items and exponentially in the order: there are O(n) parameters for the 1st order model, O(n2) parameters for the 2nd, and so on. The principled next step, then, is to consider the minimal model class that accounts for context effects, M2: P(x | C) = exp(v(x) + P z C\x v(x | {z})) P y C exp(v(y) + P z C\y v(y | {z})), x X v(x) = 0, X x X\y v(x | {y}) = 0, y X. We remove most constraints (see Appendix C.1 for details) by introducing new parameters uxz = v(x | {z}) v(z), x, z X, interpretable as the pairwise push and pull of z on x s utility. We may then rewrite the above, C X, x C, as P(x | C) = exp(P z C\x uxz) P z C\y uyz), (1) with one constraint, P y X\x uxy = 0. We may then do away with this final constraint by noticing that P(x | C) is invariant to a constant shift of uxz, as in the case of both the MNL and universal logit model. We refer to the model M2, as parameterized in equation (1), as the context dependent random utility model (CDM), and note that it has n(n 1) 1 free parameters. The CDM then corresponds to the following restriction on choice systems. Assumption 2 (Pairwise linear dependence of irrelevant alternatives). A choice system on X satisfies pairwise linear dependence of irrelevant alternatives if, in the universal logit representation of Lemma 1, v(x | B) = 0 for all B X for which |B| 2. This assumption can either be taken literally, or can be justified as an approximation on the grounds of applications: in practice many problems are concerned with choices from relatively small sets, and the linear context effect assumption is then a decent approximation. Discovering Context Effects from Raw Choice Data 2.2. Low-rank CDMs From equation (1), it is clear that the parameters of the CDM, uxz, x, z X, have a matrix-like structure. Note that the parameters do not quite form a matrix, as the diagonal elements uxx are undefined and unused. But given this structure, it is natural ask if the pairwise contextual utilities can be modeled by a lower-dimensional parameterization. Formally, we define the low-rank CDM as a CDM where the pairwise contextual utilities jointly admit a low-rank factorization uxz = c T z tx, x, y X. We call tx, cx Rr, the target and context vectors, respectively, for each item x X. We can then write the choice probabilities of the low-rank CDM, for all C X, for all x C, as: P(x | C) = exp((P z C\x cz)T tx) P z C\y cz)T ty). (2) The rank-r CDM then has 2nr parameters and has at most min{(2n r)r, n(n 1) 1} degrees of freedom. Our low-rank assumption is strongly related to standard additive utility models where one is given a low-dimensional featurization x Rr of each item and an individual s utility is βT x. A difference here, other than the notion of contextual utility, is that we assume no featurization is available and that it must be learned. 3. Identifiability and Estimation of the CDM Consider a dataset D of choices with generic element (x, C) that correspond to observing element x being chosen from set C. Let CD denote the collection of unique subsets of X represented in D. If we assume that the data was generated by a CDM, it is important to understand conditions under which the parameters of that CDM are identifiable and conditions under which the expected error of a tractable estimation procedure converges to zero as the dataset gets large. In this section we furnish two sufficient conditions and one insufficient condition for identifiability. We then bound the expected squared ℓ2 error of the maximum likelihood estimate (MLE) of a full-rank CDM. Because the log-likelihood of the full-rank CDM is convex (by the convexity of log-sum-exp), we know we can efficiently find this MLE, and that this bound on the error of the full-rank model also bounds the error of any low-rank model. We consider the dataset D as being generated in the following hierarchical manner: 1. A choice set A is chosen at random from a distribution on the set of all subsets of X. 2. The chooser chooses an item x from the choice set A according to a CDM with parameters θ Θ. We can parametrize the utility function by θ referring to it as uθ( | ). Given a D and guess θ we can write the probability of (x, A) as Pθ(x | A) = exp(uθ(x | A)) P y A exp(uθ(y | A)). This means we have a well defined likelihood function for the full dataset L(D | θ) = Y (x,A) D Pθ(x | A). (3) For now we consider a full rank CDM where the parameter vector θ is the set of pairwise contextual utilities uxz, x, z X. We will consider u Rd as the parameter vector, where for the full-rank CDM d = n(n 1) 1. Because u can only be identified up to a scale, we consider possible CDMs with the constraint that P xz uxz = 0. The likelihood (3) can be maximized using standard techniques. We will say that a dataset identifies a CDM if there are no two sets of parameters that have the same distribution P(x | C), C C, x C. We now give a sufficient (but not necessary) condition for identification. Theorem 1. A CDM is identifiable from a dataset D if CD contains comparisons over all choice sets of two sizes k, k , where at least one of k, k is not 2 or n. The proof is given in Appendix A. In the multinomial logit model, the constraints of IIA allow us to identify all parameters given just the probability distribution P( | X), but in the less constrained CDM more information is needed. For a simple demonstration of the theorem, consider a choice system on X = {a, b, c} where P(a | X) = 0.8, P(b | X) = 0.1, P(c | X) = 0.1. Here, if we assume IIA we can infer any pairwise choice probability simply by taking the appropriate ratio. However, if we do not assume IIA and only assume Assumption 2 (equivalent to assuming the choice system is a CDM), any set of pairwise probabilities is consistent with what we ve observed. Thus the CDM is not identified if we only receive data about choices from {a, b, c}. Moreover, because CDM can fit any pairwise probability while retaining the above choice probabilities over {a, b, c}, the CDM clearly violates IIA. We elaborate further in Appendix C.2, showing specific CDM parameters handling violations of IIA, and further discuss the various context effects (e.g. the compromise effect) the CDM can discover and accommodate. In Appendix A we show that the identifiability of the fullrank CDM for a given dataset D is equivalent to testing the rank of an integer design matrix G(D) constructed from the dataset (Theorem 4). This characterization of the identifiability of the full-rank CDM also gives a sufficient condition for the identifiability of low-rank CDMs. The proof of this theorem can be easily expanded to demonstrate an advan- Discovering Context Effects from Raw Choice Data tage of the CDM instead of a general choice system: for a general choice system to be identified CD would need to include in its support every choice set. In addition to the above sufficient conditions for identifiability, we also have the following result about an important insufficient condition. Theorem 2. No rank r CDM, 1 r n, is identifiable from a dataset D if CD contains only choices from sets of a single size. The proof is given in Appendix A. Requiring comparisons over two different choice set sizes is not unique to the CDM; recent results (Chierichetti et al., 2018) demonstrate that even a uniform mixture of two multinomial logit models, a special case of the mixed logit that violates IIA, requires comparisons over two different choice set sizes. As a result of this theorem, for choice data collected from sets of a fixed size, the parameters of a CDM model that has been fit to data can not be interpreted without some amount of explicit or implicit regularization. This nonidentifiability also applies to all blade-chest models (Chen and Joachims, 2016a), which (as alluded to in Section 3.3) are CDM models restricted to pairwise choices. We further explore regularization and identifiability in Appendix C.3. 3.1. Uniform Performance Guarantees The likelihood function is log-concave and can thus be solved to arbitrary error through standard convex optimization procedures (avoiding shift invariance with the constraint P xz uxz = 0). We now show that maximum likelihood estimation efficiently recovers the true CDM parameters under mild regularity conditions. Theorem 3. Let u denote the true CDM model from which data is drawn. Let ˆu MLE denote the maximum likelihood solution. Assume CD identifies the CDM. For any u UB = {u Rd : u B, 1T u = 0}, and expectation taken over the dataset D generated by the CDM model, E ˆu MLE(D) u 2 2 c B,kmax d 1 where kmax refers to the maximum choice set size in the dataset, and c B,kmax is a constant that depends on B, kmax and the spectrum of the design matrix G(D). The proof is given in Appendix B, where we also state the exact relationship of c B,kmax to the max norm radius B, the maximum choice set size kmax and design matrix G(D). Both the identifiability condition and maximum norm bound are essential to the statement, as the right hand side diverges when the former is violated, and diverges as B . Theorem 3 is a generalization of a similar convergence result previously shown for the multinomial logit case (Shah et al., 2016) (the multinomial logit model class is a subset of the CDM model class). The proof follows the same steps, showing first that the objective satisfies a notion of strong convexity, and using that fact to bound the distance between the estimate and the true value. Our contributions augment the notation of Shah et al. (2016) to support multiple set sizes and the more complex structure of the CDM, and carefully bound the role of these deviations in the steps leading to the result. To our knowledge, this convergence bound furnishes the first tractable sample complexity result for a model that can accommodate deviations from a random utility model (RUM). A comparable lower bound, which we do not furnish in this work, would make clear whether the maximum likelihood procedure is inferentially optimal or not. And while stated for the full-rank model, our convergence bound holds for CDMs of any rank. It is possible that low-rank CDMs admit an improved rank-dependent convergence rate. 3.2. Testing We can use the CDM to construct a statistical test of whether our data is indeed consistent with the MNL/Luce model, and thus IIA, across the choice sets we observe. Recall that the class of Luce models is nested within the CDM, which is in turn nested within the universal logit, as discussed in Section 2. We can consider the following likelihood ratio statistic, Λ(D) = supθ ΘLuce ΘCDM L(D | θ) supθ ΘCDM L(D | θ) , where ΘLuce and ΘCDM respectively refer to the parameter classes of Luce and CDM Models. We then appeal to a classical result from asymptotic statistics (Wilks, 1938) that as the sample size m , D = 2 log(Λ(D)) converges to the χ2 distribution with degrees of freedom equal to the difference between the number of parameters between the two model classes. For CDM and Luce, = n(n 2). For a universal logit and Luce, = (P C CD(|C| 1)) (n 1), where CD are the unique subsets in the dataset that the test can reasonably evaluate. Our test then compares the statistic to the value of the χ2 distribution corresponding to a desired level of statistical significance. We are keen to note that the CDM test likely enjoys finite sample guarantees when the true distribution is sampled from a CDM, owing to the vanishing risk of the MLE shown in Theorem 3. In experiments that follow, we look at the finite sample performance of this likelihood ratio test, evaluating this claim empirically and comparing the performance of our test to the universal logit test. 3.3. Unifying Existing Choice Models The CDM and low-rank CDM generalize a number of prominent choice models in a unified framework. In this sec- Discovering Context Effects from Raw Choice Data tion we present connections to the work of Tversky and Simonson (1993), Batsell and Polking (1985), and Chen and Joachims (2016a;b). This means that our convergence and identifiability results carry over to these other models, which all previously lacked such results. The Tversky-Simonson model. The additive separable utility model (ASM) is the cornerstone of random utility modeling in many applications. In the ASM the utility of item x can be written as an inner product u(x) = w T tx, where tx is a feature vector of item x (typically known to the analyst, but sometimes latent) and the vector w contains the parameters of the linear model (estimated from data). The parameters w have a real world interpretation: they are the weights that an individual places on each attribute. The ASM assumes that the weights are constant across contexts, making the framing effects such as those observed in (Tversky and Kahneman, 1985) impossible if the ASM is indeed the true model. Tversky and Simonson (1993) expand the ASM to allow context to adjust the weights that individuals place on attributes while keeping the attributes fixed. This approach has a particular psychological interpretation: the presence of certain items makes some dimensions of a choice more salient than others, an effect that appears across a variety of decision situations. This is formalized by setting utility of x in context C to be u T S(x | C) = w(C)T tx. Tversky and Simonson discuss several ways in which some experimental results can be modeled using various forms of weights w(C), though their approach requires both features and context-dependent weight functions to be hand-engineered. Thus our CDM can be seen as a method for learning the parameters of a Tversky-Simonsohn model directly from data in an efficient manner. The Batsell-Polking model. Batsell and Polking (1985) introduces a model of competing product market shares that can also be written as a truncated expansion of the log ratio of choice probabilities. The CDM can be viewed as an alternative parameterization of a third-order Batsell Polking model. There are several significant differences between the way Batsell and Polking viewed their thirdorder model and how we view the CDM. First, Batsell and Polking advocated for fitting their models to data using a hand-tuned least squares procedure whereas we use more general maximum likelihood techniques. Second, our identifiability and convergence results are entirely new. Their least-squares procedure understandably has no analogous guarantees. Lastly, our restriction to low-rank parameterizations is squarely new and can greatly reduce the model complexity. The Blade-Chest model. Standard models for competition build on the Elo rating system for chess (Elo, 1978) and the True Skill rating system for online gaming (Herbrich et al., 2006). Both of these models assume that individuals have a one-dimensional latent skill parameter that can be discovered from matchup data between competitors. The Blade-Chest model (Chen and Joachims, 2016a;b) tries to model rock-paper-scissors-type intransitivies in pairwise matchups through a multidimensional latent embedding of skill. In the language of our CDM, the blade-chest inner product model (the authors also consider a distance model) defines the probability that x beats y as: Pr(x | {x, y}) = exp(t T x cy) exp(t Tx cy) + exp(t Ty cx), which is precisely a CDM restricted to pairs. We can view the CDM as a natural extension of the Blade-Chest model from pairs to larger sets. Considering our negative identifiability result for choice data consisting of only a single set size (Theorem 2), we conclude that the Blade-Chest model is not identifiable and requires either explicit or implicit regularization in order to make the parameters interpretable. 4. Experiments We now evaluate the CDM and low-rank CDM on data. Our evaluation includes comparisons with MNL/Luce models and mixed MNL models (Mc Fadden and Train, 2000). MNL and CDM model likelihoods are optimized using Adam (Kingma and Ba, 2014), a stochastic gradient descent algorithm with adaptive moment estimation. Mixed MNL likelihoods are optimized using open source code from (Ragain and Ugander, 2016). The CDM parameter optimization is initialized with values corresponding to a Luce MLE for that dataset. All datasets are pre-existing and public; replication code for all figures will be released at publication time. Simulated Data. We begin with simulated data, which allows us to validate our theoretical results regarding the convergence of the MLE in a setting where the underlying data-generating process is known. Since we know whether IIA holds in the simulated data, simulated data is also useful for examining two aspects of the CDM-based hypothesis test. First, we ask about the power of the test, in other words, does the test reject IIA when it is not true? Second, we ask about the conservatism of the test. The nested model likelihood ratio tests are only valid asymptotically; in our simulated data we can check whether the CDM over or under-rejects the null hypothesis of IIA in finite samples. We consider three data-generating processes: one where the data is generated from a MNL model (where IIA holds), one where the data is generated from a CDM, and one where the data is generated from a general choice system. The universe has n = 6 items. In the IIA dataset the underly- Discovering Context Effects from Raw Choice Data Figure 1. (a) Approximation error of an estimated CDM in 10 growing datasets validates our convergence theorem. The dashed black line is a visual guide of the slope 1/m. (b,c,d) The proportion of rejections for a CDM-based hypothesis test of IIA (at a threshold of p < .05) when the data is generated by a MNL, CDM, and general choice system model as a function of the number of samples. When IIA is true, the CDM-based test has a 5% of false rejection rate while the test based on the general choice system is highly anti-conservative. When IIA is false, both tests quickly and correctly reject. All model parameters are described in the main text. ing MNL model has parameters [1, 2, 3, 4, 5, 6]/21. In the CDM dataset the parameters U = T T C are generated by sampling elements of both 6x6 matrices T and C i.i.d. from N(0, 1). The probabilities of the general choice system are sampled U[0, 1] and renormalized. We sample choice sets uniformly at random (thus our identification conditions are quickly met) and then a choice according to the underlying model. We fit a Luce, CDM, and universal logit model to the data and look at both the error of the CDM MLE (to evaluate convergence) and the p-value from the nested model likelihood ratio tests. When the p-value falls below .05 we say that the hypothesis of IIA is rejected. In addition, we compare the CDM-based nested test to another nested model test where we use a general choice system as the alternative model. Recall that the general choice system also nests MNL. However, the general choice system has combinatorially more parameters. Figure 1 shows our results. The left panel validates the O( 1 m) convergence result in Theorem 3. The right three panels look at how often the hypothesis of IIA is rejected, out of 1000 independent growing datasets, when the underlying data comes from the three different data generating processes. We see that the CDM rejects the null less than 5% of the time when the data generating process indeed satisfies IIA and rejects IIA when it is not true almost all the time, even with relatively small amounts of data. By contrast we see that the universal logit requires quite a lot of data to reach the asymptotically valid coverage, even with a universe of only 6 items. For finite samples it is highly anti-conservative, over-rejecting when IIA is true for small and medium amounts of data. SFwork/SFshop. We now turn to two real-world datasets: SFwork and SFshop. These data are collected from a survey of transportation preferences around the San Francisco Bay Area (Koppelman and Bhat, 2006). SFshop consists of 3,157 observations of a choice set of transportation alternatives available to individuals traveling to and from a shopping center, as well as what transportation that individual actually chose. SFwork is similar, containing 5,029 observations Figure 2. The out of sample negative log-likelihood of the MLE for the SFWork and SFshop datasets under an MNL/Luce model, mixed MNL model with varying number of mixture components, and CDMs of varying rank. The CDM outperforms the other models at all ranks. consisting of commuting options and the choice made. These datasets are similar to those employed in many demand estimation applications. For example, Berry et al. (1995) fit a MNL model to aggregate data in order to estimate the utility function of an average consumer for automobiles as well as how individuals (on average) trade off various qualities of a car (e.g., gas mileage vs. price). With access to underlying parameters, the analyst can then make counterfactual estimates. With the SFWork/SFShop data, we can ask questions like: what would happen if we made certain types of transit more or less available? Of course, if the underlying assumption (IIA) of the MNL model is wrong, then we may expect our counterfactual answers to also be wrong. We run both hypothesis test for IIA (asking: is there IIA in the data? ) and examine the out-of-sample performance of the low-rank CDM (asking: does the violation of IIA have meaningful consequence for prediction? ). From the hypothesis test we obtain a p-value of 10 7 and can strongly reject IIA. Figure 2 shows the out of sample fit on a held out 20% of the data for low-rank CDMs, mixed MNLs, and an MNL model, again showing that IIA is not satisfied in this data. The full-rank unfactorized CDM is omitted from the figure for improved visibility, but attains an out of sample log-likelihood of 0.808 and 1.540, respectively for SFwork and SFshop. Though the unfactorized CDM outperforms MNL and mixed MNLs, it fails to outperform low-rank CDMs. Discovering Context Effects from Raw Choice Data Figure 3. (Left) The out of sample negative log-likelihood and accuracy of the MLE for the nature photo dataset under an MNL/Luce model, mixed MNL model with varying number of mixture components, and CDMs of varying rank. The CDM outperforms the other models at all ranks in terms of both likelihood and accuracy. The target (center) and context (right) vector embeddings for the nature photo dataset with a rank-2 CDM, with three sample vectors highlighted. An item s context and target vectors have, on average, a negative dot product, showing that the addition of an item makes similar items already in the choice set less likely to be chosen. Not Like The Other. We turn to a slightly different dataset to demonstrate another way the CDM can be used. We consider the task introduced by Heikinheimo and Ukkonen (2013) where individuals are shown triplets of nature photographs and asked to choose the one photo that is most unlike the other two. This task involves comparison-based choices (Kleinberg et al., 2017) where there are no irrelevant alternatives and IIA is clearly violated: consider two example task where the choice set is two mountains and a beach vs. two beaches and a mountain. The dataset is comprised of m = 3355 triplets spanning n = 120 photos. Because the dataset only has choice sets of a fixed size, the CDM is not directly identifiable (Theorem 2). We resolve this issue by adding an ℓ2 regularization term to the log-likelihood. For a small positive regularization penalty, the optimizer then selects the least norm solution within the null space. We choose a non-negligible penalty, chosen through cross-validation, to serve the additional purpose of improving model generalization. We fit low-rank CDMs and see that they handily outperforms a MNL model (i.e. just item-level utilities) and mixed MNL models on a 20% held-out test set (Figure 3, left). Though mixed MNL is often a competitive baseline, it is still a RUM, and cannot model the inherent asymmetric dominance of this task. The full-rank, unfactorized CDM is again omitted, but attains an out of sample log-likelihood of 0.843, yet again outperforming MNL and mixed MNL but falling significantly short of the low rank models. We plot the vectors learned in the low-rank CDM (Figure 3, right). We see that similar images are grouped together both as targets and as contexts. We also see an intuitive property of the dataset: for most items x, tx and cx have a negative inner product. Essentially, having two copies of the same item in a choice set makes each copy less likely to be chosen. 5. Conclusion Existing work has argued that context dependence, and in particular choice-set dependence, is an important part of human decision-making (Ariely et al., 2003; Slovic, 1995; Tversky and Simonson, 1993). Tractable tools like the CDM are therefore crucial to further understanding decisionmaking, providing both good empirical performance and optimistic worst case guarantees. It should also be noted that IIA violations are often seen in intertemporal choice, choice under uncertainty, and choices about cooperation. Applying a CDM to these domains is an important area of future work. There is separate experimental evidence that human choices are intransitive in some settings, where people may prefer A to B and B to C but then C to A. This evidence has given rise to a theoretical literature on relaxing the transitivity axiom of rational choice or the regularity axiom of random utility (Tversky, 1969; Ragain and Ugander, 2016; Benson et al., 2016). To that end, an axiomatic characterization of the CDM and the kinds of violations of rational choice that the model can or cannot represent would be worthy of further study. Understanding human decision-making is an important endeavor for both basic and applied science and is becoming increasingly important in human-centered machine learning and artificial intelligence. We view the introduction of techniques from machine learning and AI into behavioral science and the flow of realistic models of human behavior in the other direction as crucial and beneficial for both fields (Wager and Athey, 2015; Fudenberg and Peysakhovich, 2014; Naecker, 2015; Epstein et al., 2016; Peysakhovich and Rand, 2017). We hope that our work contributes to this important conversation. Discovering Context Effects from Raw Choice Data Acknowledgments We thank Fred Feinberg, Stephen Ragain, and Steve Yadlowsky for their helpful comments and feedback. AS was supported in part by an NSF Graduate Research Fellowship. JU was supported in part by an ARO Young Investigator Award. Allenby, G. M. and Rossi, P. E. (1998), Marketing models of consumer heterogeneity , Journal of econometrics 89(1), 57 78. Ariely, D., Loewenstein, G. and Prelec, D. (2003), coherent arbitrariness : Stable demand curves without stable preferences , The Quarterly Journal of Economics 118(1), 73 106. Athey, S. and Levin, J. (2001), Information and competition in us forest service timber auctions , Journal of Political economy 109(2), 375 417. Batsell, R. R. and Polking, J. C. (1985), A new class of market share models , Marketing Science 4(3), 177 198. Benson, A. R., Kumar, R. and Tomkins, A. (2016), On the relevance of irrelevant alternatives, in Proceedings of the 25th International Conference on World Wide Web , International World Wide Web Conferences Steering Committee, pp. 963 973. Berry, S., Levinsohn, J. and Pakes, A. (1995), Automobile prices in market equilibrium , Econometrica: Journal of the Econometric Society pp. 841 890. Bordalo, P., Gennaioli, N. and Shleifer, A. (2012), Salience theory of choice under risk , The Quarterly journal of economics p. qjs018. Bruch, E., Feinberg, F. and Lee, K. Y. (2016), Extracting multistage screening rules from online dating activity data , Proceedings of the National Academy of Sciences 113(38), 10530 10535. Chen, S. and Joachims, T. (2016a), Modeling intransitivity in matchup and comparison data, in Proceedings of the Ninth ACM International Conference on Web Search and Data Mining , ACM, pp. 227 236. Chen, S. and Joachims, T. (2016b), Predicting matchups and preferences in context, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , ACM, pp. 775 784. Chierichetti, F., Kumar, R. and Tomkins, A. (2018), Learning a mixture of two multinomial logits, in International Conference on Machine Learning , pp. 960 968. Elo, A. E. (1978), The rating of chessplayers, past and present, Arco Pub. Epstein, Z. G., Peysakhovich, A. and Rand, D. G. (2016), The good, the bad, and the unflinchingly selfish: Cooperative decision-making can be predicted with high accuracy using only three behavioral types , Proceedings of the 17th Economics and Computation Conference (EC17) . Fox, C. R. and Tversky, A. (1995), Ambiguity aversion and comparative ignorance , The Quarterly Journal of Economics 110(3), 585 603. Fudenberg, D. and Levine, D. K. (2012), Timing and selfcontrol , Econometrica pp. 1 42. Fudenberg, D. and Peysakhovich, A. (2014), Recency, records and recaps: Learning and non-equilibrium behavior in a simple decision problem, in Proceedings of the fifteenth ACM conference on Economics and Computation , ACM, pp. 971 986. Heikinheimo, H. and Ukkonen, A. (2013), The crowdmedian algorithm, in First AAAI Conference on Human Computation and Crowdsourcing . Herbrich, R., Minka, T. and Graepel, T. (2006), Trueskill TM: a bayesian skill rating system, in Proceedings of the 19th International Conference on Neural Information Processing Systems , MIT Press, pp. 569 576. Kingma, D. and Ba, J. (2014), Adam: A method for stochastic optimization , ar Xiv preprint ar Xiv:1412.6980 . Kleinberg, J., Mullainathan, S. and Ugander, J. (2017), Comparison-based choices, in Proceedings of the 2017 ACM Conference on Economics and Computation , ACM, pp. 127 144. Koppelman, F. S. and Bhat, C. (2006), A self instructing course in mode choice modeling: multinomial and nested logit models . Kreps, D. (1988), Notes on the Theory of Choice, Westview press. Liberman, V., Samuels, S. M. and Ross, L. (2004), The name of the game: Predictive power of reputations versus situational labels in determining prisoner?s dilemma game moves , Personality and social psychology bulletin 30(9), 1175 1185. List, J. A. (2007), On the interpretation of giving in dictator games , Journal of Political economy 115(3), 482 493. Luce, R. D. (1959), Individual Choice Behavior a Theoretical Analysis, John Wiley and sons. Discovering Context Effects from Raw Choice Data Manski, C. F. (1977), The structure of random utility models , Theory and decision 8(3), 229 254. Mc Fadden, D. (1980), Econometric models for probabilistic choice among products , Journal of Business pp. S13 S29. Mc Fadden, D. and Train, K. (2000), Mixed mnl models for discrete response , Journal of applied Econometrics 15(5), 447 470. Mc Fadden, D., Tye, W. B. and Train, K. (1977), An application of diagnostic tests for the independence from irrelevant alternatives property of the multinomial logit model, Institute of Transportation Studies, University of California. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. and Dean, J. (2013), Distributed representations of words and phrases and their compositionality, in Advances in neural information processing systems , pp. 3111 3119. Muraven, M. and Baumeister, R. F. (2000), Self-regulation and depletion of limited resources: Does self-control resemble a muscle? , Psychological bulletin 126(2), 247. Naecker, J. (2015), The lives of others: Predicting donations with non-choice responses . Park, S.-J. and Choi, S. (2013), A theoretical note on the number of free parameters in the elimination-by-aspects model , Journal of Mathematical Psychology 57(5), 255 259. Peysakhovich, A. and Rand, D. G. (2015), Habits of virtue: Creating norms of cooperation and defection in the laboratory , Management Science 62(3), 631 647. Peysakhovich, A. and Rand, D. G. (2017), In-group favoritism caused by pokemon go and the use of machine learning to learn its mechanisms , SSRN . Ragain, S., Peysakhovich, A. and Ugander, J. (2018), Improving pairwise comparison models using empirical bayes shrinkage , ar Xiv preprint ar Xiv:1807.09236 . Ragain, S. and Ugander, J. (2016), Pairwise choice markov chains, in Advances in Neural Information Processing Systems , pp. 3198 3206. Resnick, P. and Varian, H. R. (1997), Recommender systems , Communications of the ACM 40(3), 56 58. Rudolph, M., Ruiz, F., Mandt, S. and Blei, D. (2016), Exponential family embeddings, in Advances in Neural Information Processing Systems , pp. 478 486. Ruiz, F. J., Athey, S. and Blei, D. M. (2017), Shopper: A probabilistic model of consumer choice with substitutes and complements , ar Xiv preprint ar Xiv:1711.03560 . Schapire, R. E., Cohen, W. W. and Singer, Y. (1998), Learning to order things , Advances in Neural Information Processing Systems 10(451), 24. Shah, N. B., Balakrishnan, S., Bradley, J., Parekh, A., Ramchandran, K. and Wainwright, M. J. (2016), Estimation from pairwise comparisons: Sharp minimax bounds with topology dependence , The Journal of Machine Learning Research 17(1), 2049 2095. Slovic, P. (1995), The construction of preference. , American psychologist 50(5), 364. Srivastava, N. and Schrater, P. R. (2012), Rational inference of relative preferences, in Advances in neural information processing systems , pp. 2303 2311. Tversky, A. (1969), Intransitivity of preferences , Preference, Belief, and Similarity p. 433. Tversky, A. (1972), Elimination by aspects: A theory of choice. , Psychological review 79(4), 281. Tversky, A. and Kahneman, D. (1985), The framing of decisions and the psychology of choice, in Environmental Impact Assessment, Technology Assessment, and Risk Analysis , Springer, pp. 107 129. Tversky, A. and Simonson, I. (1993), Context-dependent preferences , Management science 39(10), 1179 1189. Wager, S. and Athey, S. (2015), Estimation and inference of heterogeneous treatment effects using random forests , ar Xiv preprint ar Xiv:1510.04342 . Wilks, S. S. (1938), The large-sample distribution of the likelihood ratio for testing composite hypotheses , The Annals of Mathematical Statistics 9(1), 60 62.