# learning_from_stochastically_revealed_preference__fe9b612f.pdf

Learning from Stochastically Revealed Preference

John R. Birge The University of Chicago Booth School of Business John.Birge@chicagobooth.edu

Xiaocheng Li Imperial College Business School, Imperial College London xiaocheng.li@imperial.ac.uk

Chunlin Sun Institute for Computational and Mathematical Engineering, Stanford University chunlin@stanford.edu

We study the learning problem of revealed preference in a stochastic setting: a learner observes the utility-maximizing actions of a set of agents whose utility follows some unknown distribution, and the learner aims to infer the distribution through the observations of actions. The problem can be viewed as a singleconstraint special case of the inverse linear optimization problem. Existing works all assume that all the agents share one common utility which can easily be violated under practical contexts. In this paper, we consider two settings for the underlying utility distribution: a Gaussian setting where the customer utility follows the von Mises-Fisher distribution, and a δ-corruption setting where the customer utility distribution concentrates on one fixed vector with high probability and is arbitrarily corrupted otherwise. We devise Bayesian approaches for parameter estimation and develop theoretical guarantees for the recovery of the true parameter. We illustrate the algorithm performance through numerical experiments.

1 Introduction

The problem of learning from revealed preference refers to the learning of a common utility function for a set of agents based on the observations of the utility-maximizing actions from the agents. The revealed preference problem has a long history in economics [Sam48, Afr67] (See [Var06] for a review). A line of works [BV06, ZR12, BDM+14, ACD+15] formulate the problem as a learning problem with two objectives: (i) rationalizing a set of observations, i.e., to find a utility function which explains a set of past observations; (ii) predicting the future behavior of a utility-maximization agent. Mathematically, the action of the agents is modeled by an optimization problem that maximizes a linear (or concave) utility function subject to one linear budget constraint. The learner (decision maker) aims to learn the unknown utility function through a set of observations of the constraints and the optimal solutions. The problem can be viewed as a single-constraint special case of the inverse optimization problem [AO01] which covers a wider range of applications: geoscience [BT92], finance [BGP12], market analysis [BHP17], energy [ASS18], etc.

In this paper, we study the problem under a stochastic setting where the agents have a linear utility function randomly distributed according to some unknown distribution. Such a stochastic setting is well-motivated by some application context where the agents are customers and the constraint models the prices and the customer s budget. The optimal solution encodes the customer s purchase

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

behavior and the stochastic utility (objective function) captures the heterogeneity of the customer preference for the products. The goal of learning in this stochastic setting thus becomes to learn the utility distribution through observations of the actions. To the best of our knowledge, we provide the first result of learning a stochastic utility for the revealed preference problem and even in the more general literature of the inverse optimization problem.

Related Literature: The existing approaches to the problem can be roughly divided into two categories.

Query-based: In a query-based model, the learner aims to learn the utility function by querying an oracle for the agent s optimal actions, and the goal is to derive the sample complexity guarantee for a sufficiently accurate estimation of the utility function. [BV06] initiates this line of research and studies a statistical setup where the input data is a set of observations and the learner s performance is evaluated by sample complexity bounds. [ZR12] studies the case of a linear or linearly separable concave utility function, and [BDM+14] generalizes the setting and devises learning algorithms for several classes of utility functions. Other than a statistical setup where the observations are sampled from some distribution, both of these two works study an active learning setting where the learner has the power to choose the linear constraint (set the prices of the products). Some subsequent works along this line study the associated revenue management problem [ACD+15] and a game-theoretic setting [DRS+18] where the agents act strategically to hide the true actions.

Optimization-based: The optimization-based approach is usually adopted in the literature of inverse optimization, and some algorithms developed therein can be applied to the special case of the revealed preference problem. [ZL96] and [AO01] study the inverse optimization with one single observation and develop linear programming formulations to solve the problem. Later, [KWB11] and [ASS18] study the statistical (or data-driven) setting where the observations are sampled from some distribution. Specifically, [ASS18] considers a setting where the optimal actions of the agents are contaminated with some independent noises, but all the agents still follow a common utility parameter vector. [MESAHK18] studies the distributional robust version of the problem and [BFL21] considers a contextual formulation. A recent line of works [BMPS18, DCZ18, DZ20, CKK20] cast the inverse optimization problem in an online context and develop (online) gradient-based algorithms.

As we understand, all the existing algorithms and analyses under this topic rely on the assumption that all the agents share one common utility function (or a common utility parameter vector), and thus can fail in the stochastic setting. In this paper, we formulate the problem in Section 2 and focus on the statistical data input where the budget constraint is sampled from some unknown distribution. We consider two stochastic settings: a Gaussian setting in Section 3 and a δ-corruption setting in Section 4. We conclude with numerical experiments and discuss (i) how the results can be generalized to the inverse optimization problem and (ii) the implications for the query-based model where the learner has the power to choose the budget constraints.

2 Model Setup

Consider a customer who purchases a bundle of products subject to some budget constraint. The customer s utility-maximizing action can be modeled by the following linear program:

LP(u, a, b) := max x

i=1 aixi b, 0 xi 1, i = 1, ..., n,

where u = (u1, ..., un) Rn, a = (a1, ..., an) Rn +, and b R+ are the inputs of the LP. Here the decision variables x encode the purchase decisions where a partial purchase is allowed, ui denotes the customer s utility for the i-th product, and ai denotes the price or cost of purchasing the i-th product. The right-hand-side b denotes the budget of the customer.

Throughout this paper, we make the following assumption. Assumption 1. We assume:

The utility vector u follows some unknown distribution Pu.

The LP s input (a, b) follows some unknown distribution Pa,b independent of Pu.

There exists a > 0 such that ai [a, 1] almost surely for i = 1, ..., n.

Our goal is to infer the distribution through observations of customers optimal actions. Mathematically, we aim to estimate the distribution of Pu through the dataset

DT = {(x t , at, bt)}T t=1 .

Here the t-th sample corresponds to an unobservable ut generated from Pu and x t is one optimal solution of LP(ut, at, bt). Due to the scale invariance of the utility vector, we restrict the distribution Pu to the unit sphere Sn 1 = {u : u 2 = 1}. In the following two sections, we consider two settings: (i) Gaussian: Pu follows the von Mises Fisher distribution, i.e., the restriction of a multivariate Gaussian distribution to the unit sphere; (ii) δ-corruption: Pu concentrates on one point u with probability 1 δ and follows an arbitrarily corrupted distribution with probability δ.

Figure 1: Visualizing the challenge of the problem in 1-D and 2-D.

The challenge of the problem. Each observation (x t , at, bt) prescribes a region Ut Sn 1,

Ut := u Sn 1 : x t is an optimal solution of LP(u, at, bt) .

The set Ut captures all the possible values of ut that is consistent with the t-th observation. The following lemma states that the set Ut can be expressed by a group of linear constraints.

Lemma 1. For each Ut, there exists a matrix Vt and a vector wt such that

Ut = u Sn 1 : Vtu wt .

In the deterministic setting of the revealed preference problem, all the ut s are identical and the learning problem is thus reduced to finding one feasible u in the set of T t=1Ut. But in a stochastic setting, it may happen that the set of T t=1Ut is empty. Figure 1 provides a conceptual visualization of this challenge of empty intersection . Each blue solid segment denotes one such Ut and the blue dashed line represents a value of u that appears most frequently in these Ut s. We remark that the figure is just for illustrative purpose as the problem may not be well-defined in the 1-dimensional case.

From an estimation viewpoint, the goal is to estimate the distribution of Pu without the knowledge of the realized samples ut s, but merely with the knowledge of Ut to which ut belongs. The sample efficiency of the estimation procedure is naturally contingent on the dispersion of Ut which is essentially determined by the generation of (at, bt). For example, if all the Ut s coincide with each other, then one can hardly learn much about the underlying Pu. In this paper, we aim to pinpoint conditions for Pa,b such that the learning of Pu is possible. Also, an alternative way to measure the estimation accuracy is to evaluate the predictive performance of the estimated model on new observations generated from Pa,b, and such performance bounds generally bear less dependency on the distribution of Pa,b. We also provide theoretical guarantees in this sense.

3 Gaussian Setting

In this section, we consider a setting where the distribution Pu follows the von Mises-Fisher distribution parameterized by θ = (µ, κ) with the density function

f(u; θ) := exp κµ u R

u Sn 1 exp( κµ u)du exp κµ u .

Here the vector µ Rn represents the mean direction and the parameter κ > 0 controls the concentration of the distribution. The deterministic setting of the revealed preference problem can be viewed as the case when κ = and then the distribution degenerates to a point-mass distribution on the unit sphere. Denote the true parameters of the distribution Pu by θ = (µ , κ ). Then the likelihood of the dataset DT under a parameter θ is

P(DT |θ) :=

t=1 P ((x t , at, bt)|θ) =

u Ut f(u; θ)du.

We remark that the maximum likelihood approach cannot be applied here for two reasons. First, the integration of f(u; θ) over the region Ut is not closed-form. The first point is not only pertaining to the Gaussian parameterization of Pu. The scale-invariant property of the utility vector restricts the distribution Pu to a unit sphere or a simplex, and consequently, the likelihood function inevitably involves the non-closed-form integration. This issue can be partially resolved by using the Monte Carlo method to approximate the integration, and a good thing is that the same integrand is shared across all the observations. Second, the likelihood function is not analytical in θ. Thus this prevents the usage of gradient-based algorithms to solve the problem and also makes it difficult to derive theoretical guarantees for the maximum likelihood estimator.

We propose a Bayesian perspective for the problem: instead of identifying the parameter that maximizes the likelihood function, we directly draw samples from the posterior distribution. We will see shortly that the approach can be justified through a concentration property of the posterior distribution. Suppose we have a prior distribution P0(θ) and then we can define the posterior distribution by

PT (θ) := P0(θ) P (DT |θ)

u Ut f(u; θ)du.

With slight abuse of notation, we use PT ( ) (or P0( )) to refer to both the density function and the probability measure of the posterior (or prior) distribution. We make the following assumption on the prior distribution.

Assumption 2. We assume the concentration parameter κ (κ, κ) where κ, κ are two known positive constants. The prior distribution P0(θ) is a uniform distribution on Sn 1 (κ, κ).

Theorem 1. Let

ΘT := θ Sn 1 (κ, κ) : W (P ((x t , at, bt)|θ) , P ((x t , at, bt)|θ )) max (8, 8 κ) n log T

where W( , ) is the Wasserstein distance between two distributions supported on X Rn + R+ equipped with Euclidean metric. Then, under Assumptions 1-2,

1 PT (ΘT ) 0 in probability as T .

Specifically, the following inequality holds

E [PT (ΘT )] 1 3

where the expectation is taken with respect to the random distribution PT ( ) (essentially, with respect to the dataset DT .)

Theorem 1 justifies the approach of posterior sampling. We first remark that the Bayesian sampling approach is just proposed to estimate the parameters, but all the theoretical results are stated in frequentist language. The proof of Theorem 1 follows the standard analysis of the convergence of the posterior distribution [GGVDV00, CDBW21]. While similar results should also hold for other underlying distribution of Pu, the von Mises-Fisher distribution provides much analytical convenience in deriving the bound. Each θ, together with the distribution of Pa,b, defines a distribution over the space of (x t , at, bt). As we use observations (x t , at, bt) s to identify the true θ , the set ΘT defines a set of indistinguishable θ s based on the Wasserstein distance between distributions of (x t , at, bt). The set ΘT shrinks as T . The posterior sampling approach samples from the distribution PT ( ), and Theorem 1 states that the samples will be concentrated in set ΘT with high probability. The posterior distribution PT ( ) is dependent on the dataset DT , so it is a random distribution itself and the results in Theorem 1 are stated in either convergence in probability or expectation. As a side note, the Wasserstein distance in the theorem is not critical and it can be replaced with other distances such as the total variation distance and the Hellinger distance.

Intuitively, Theorem 1 says that for some θ such that the likelihood distribution P ((x t , at, bt)|θ) differs from P ((x t , at, bt)|θ ) to a certain extent, the posterior PT ( ) is unlikely to generate such θ. In other words, the posterior distribution identifies the true θ up to some equivalence in the likelihood distribution space. The following corollary formalizes this intuition that if there is an equivalence between the likelihood distribution space and the underlying parameter space, then the posterior distribution is capable of identifying the true parameter. Corollary 1. Suppose W (P ((x t , at, bt)|θ) , P ((x t , at, bt)|θ )) > 0 for all θ = θ Sn 1 [κ, κ]. Then the posterior distribution PT ( ) will converge to the point-mass distribution on θ almost surely as T . Moreover, suppose there exists a constant L > 0 satisfying W (P ((x t , at, bt)|θ) , P ((x t , at, bt)|θ )) L θ θ 2, (1) for all θ = θ Sn 1 [κ, κ].

Under Assumptions 1-2, the following inequality holds with probability no less than 1 6L n T 1/2 log T ,

ET [ θT θ 2] max (9, 9 κ) n log T

L T 1/2 where θT is sampled from the posterior distribution PT ( ).

The corollary states that when there is some equivalence between the likelihood distribution space and the parameter space as (1), the true parameter is identifiable. The first part of the corollary states a consistency result that as long as all the θ = θ are distinguishable from θ through the likelihood function, then the posterior sampling will eventually identify the true θ . The second part relates to the convergence rate with an equivalence parameter L.

In Assumption 1, we assume the constraint input (at, bt) is generated from some distribution Pa,b. We note that Theorem 1 and Corollary 1 hold without any additional assumption on Pa,b, but the space topology of the likelihood distribution is highly dependent on Pa,b. Specifically, a different distribution of (at, bt) determines the separateness of the parameter space through affecting the value of L in (1) or even its existence. The value L of a specific distribution of Pa,b can be examined through simulation. So if the learner has some flexibility in choosing the distribution of Pa,b, the optimal choice would be the one that corresponds to a larger value of L. If the constraint input (at, bt) is not randomly generated but can be actively chosen as the query-based preference learning problem, the results in Theorem 1 and Corollary 1 still hold by conditioning on all the (at, bt) s. Unlike the deterministic case where the utility vector u is fixed for all the observations, the stochastic nature of the problem setup here makes it generally very complicated to fully extract the benefit of designing (at, bt) s by the learner. We leave it as a future open question.

Corollary 2. Suppose Pa,b is a discrete distribution with a finite support. Let (a, b) be a new sample from Pa,b, i.e., independent from the dataset DT , and let θT = (µT , κT ) be a sample from the posterior distribution PT ( ). Denote x and x as the optimal solutions of LP(µT , a, b) and LP(µ , a, b), respectively. Then, under Assumptions 1-2, the following inequality holds with probability no less than 1 6 n T 1/2 log T ,

E [ x x 2] max (16, 16 κ) n log T

where the expectation is taken with respect to both the posterior distribution PT ( ) and (a, b).

Corollary 2 provides an upper bound on the predictive performance of the posterior distribution. Specifically, we want to predict the optimal solution of a linear program specified by µ (proportionally to E[u]) and a new sample of the constraint (a, b), and the prediction x is based on a posterior sample. We know from Theorem 1 that the posterior distribution concentrates on those θ s that are indistinguishable from the true θ in terms of the likelihood. Speaking of the predictive performance, we only concern the distribution of the optimal solution (equivalently, the likelihood), but do not require the identification of exact true θ , so Corollary 2 does not require the condition (1) to hold. Intuitively, the prediction of the optimal solution on a new observation (a, b) can be viewed as a condition distribution of the optimal solution given (a, b). While the definition of ΘT in Theorem 1 concerns the joint distribution of the optimal solution and (a, b), the finite-support condition on Pa,b in Corollary 2 transforms the result on the joint distribution to the conditional distribution.

4 δ-Corruption Case

In this section, we consider a setting where the utility vector is specified by

ut = u , w.p. 1 δ, P u, w.p. δ, (2)

where u Sn 1 is a fixed vector, δ [0, 1], and P u is an arbitrary distribution that corrupts the inference of u . The deterministic setting of the revealed preference problem in literature can be viewed as the case of δ = 0, and the Gaussian setting in the previous section can be viewed as the case of δ = 1 and P u being the von-Mises Fisher distribution. In this setting, we do not aim to learn the distribution of P u, but rather our goal is to identify the vector u using the dataset DT .

A natural idea to estimate u is by solving the following optimization problem:

OPTδ := max u Sn 1

where the indicator function IE(e) = 1 if e E and IE(e) = 0 otherwise. The rationale for the optimization problem is that for the t-th observation, a vector u is consistent with the observation, i.e., x t is the optimal solution of LP(u, at, bt), if and only if IUt(u) = 1. Thus the optimization problem finds a vector u that is consistent with the maximal number of observations. The objective function is discontinuous in u, so we propose the simulated annealing algorithm Algorithm 2 to solve for its optimal solution.

We first build some connection between the optimization problem OPTδ and that of the deterministic setting with δ = 0. Let x t be the optimal solution of LP(u , at, bt) and define

Ut := u Sn 1 : x t is an optimal solution of LP(u, at, bt) .

Then the deterministic setting of the revealed preference problem solves

OPTδ := max u Sn 1

t=1 I Ut(u).

By the setup of the problem, u is an optimizer of OPTδ and the optimal objective value is T. The following proposition establishes that the optimization problem OPTδ is a contaminated version of OPTδ and the effect that the contamination has on the objective function can be bounded using δ.

Proposition 1. Under Assumption 1, the following inequality holds

t=1 IUt(u) 1

t=1 I Ut(u)

When the constraints (at, bt) s are generated from some distribution Pa,b, it can happen that there exist some vectors u that are indistinguishable from u based on the observations DT as in the

previous Gaussian case. So, we do not hope for an exact recovery of u , but alternatively, we aim to derive a generalization bound for our estimator ˆu. Specifically, we define and analyze the accuracy

Acc(ˆu) := E [IU(ˆu)] with U := u Sn 1 : x is an optimal solution of LP(unew, a, b)

where ˆu is our estimator of u , (a, b) is a new sample from the distribution Pa,b, unew is a new sample following the law of (2), and x is the optimal solution of LP(u, a, b). In other words, the quantity captures the probability that ˆu is consistent with a new (unseen) observation, and we know that for the true parameter, Acc(u ) 1 δ, which serves as a performance benchmark.

The challenge for deriving a bound on Acc(ˆu) arises from the discontinuity of the objective function OPTδ. The existing methods for deriving generalization bound largely rely on the continuity and the Lipschitzness of the loss function. To make it worse, from Lemma 1, we know that Ut is specified by (Vt, wt) and the Vt s are of different dimensions for different t s. To overcome these challenges, we devise the following γ-margin objective function. Specifically, we first define a parameterized version of Ut by Ut(γ) := u Sn 1 : Vtu wt γe

where γ is a positive constant and e is an all-one vector. It is obvious that Ut(γ) Ut. Accordingly, we define the γ-margin optimization problem by

OPTδ(γ) := max u Sn 1

t=1 IUt(γ)(u).

Proposition 2. Under Assumption 1, the following inequality holds with probability no less than 1 ϵ,

max u Sn 1 1 T

t=1 IUt(γ)(u) Acc(u) 4

log(T) a2γ2T + 6

for ϵ (0, 1).

Proposition 2 relates the generalization accuracy of any arbitrary u with the corresponding objective value of the γ-margin optimization problem. As γ increases, the objective function will decrease, so the right-hand-side becomes tighter. Importantly, the accuracy is defined by the original indicator function (or equivalently, Ut), while the objective value is defined by the γ-margin indicator function (or equivalently, Ut(γ)). The implication is that when we optimize the γ-margin objective, we can still obtain a bound on the original accuracy Acc(u) for sufficiently large γ. Theorem 2. Suppose Pa,b is a continuous distribution and it has a density function upper bounded by p. Then, under Assumption 1, the following inequality holds for sufficient large T

Acc(ˆu) 1 δ n p min i:u i =0 |u i | T 1/4 20n log(T)

where ˆu is one optimal solution of OPTδ(γ) with γ = 1 4n T 1/4 .

Theorem 2 states a generalization bound on the accuracy of ˆu for continuous distributions of (a, b). From Proposition 2, a larger γ leads to a smaller gap between the accuracy and the γ-margin objective function. Meanwhile, a smaller γ leads to a smaller gap between the optimal objective value OPTδ(γ) and 1 δ. In the extreme case of γ = 0, E[OPTδ(0)] = Acc(u ) 1 δ. Theorem 2 optimizes the value of γ to trade off these two aspects. We note that the continuous distribution and the upper bound on the density function make it possible to bound the gap of the second aspect. Finally, we remark that the design of γ-margin loss function is inspired from the max-margin classifier, but the analysis is entirely different. For the max-margin classifier, the introduction of the margin aims to make the underlying loss function 1-Lipschitz so that a generalization bound using Rademacher complexity can be derived. But for here, our γ-margin objective function is still a discontinuous one.

5 Computational Aspects and Discussions

In the previous section, we developed theoretical results for both Gaussian and δ-corruption settings. Now we discuss computational aspects with respect to the sampling of the posterior PT ( ) and the

Algorithm 1 Posterior Sampling for the Gaussian Setting

1: Input: dataset DT = {(x t , at, bt)}T t=1, number of iterations K 2: Initialize θ(0) by randomly sampling from the prior distribution P0(θ) 3: for k = 1, ..., K do 4: Draw a random θ from a pre-determined proposal distribution Q(θ |θ(k 1)) 5: Compute the acceptance rate:

r = min PT (θ ) PT (θ(k 1)), 1

θ(k) = θ , w.p. r θ(k 1), w.p. 1 r

7: 8: end for 9: Output: θ(K)

optimization of OPTδ(γ). As mentioned earlier, the posterior sampling removes the complication of optimizing over θ in the maximum likelihood estimation, but still inevitably needs to deal with the sampling and numeric approximation of the likelihood function. Algorithm 1 describes a standard Metropolis Hastings algorithm to sample from the posterior distribution PT ( ). In the numerical experiments, we choose the proposal distribution Q to be a Gaussian random perturbation, i.e., θ = Proj(θ(k 1) + ϵ) where ϵ follows a Gaussian distribution and the projection ensures that θ stays on the sphere Sn 1. For the acceptance ratio, as the posterior distribution is not in closed form, a Monte Carlo subroutine is needed to estimate the ratio.

Algorithm 2 Simulated annealing algorithm for δ-corruption

1: Input: dataset DT = {(x t ), at, bt}T t=1, margin γ, number of iterations K, interval length τ 2: Initialize an initial (temperature) η > 0 and the reduction rate c (0, 1) 3: Randomly generate the first estimate u(0)

4: for k = 1, ..., K do 5: if k mod τ = 0 then 6: Update η c η 7: end if 8: Draw a proposal u from a predetermined proposal distribution Q(u |u(k 1)) 9: Compute the acceptance rate:

t=1 IUt(γ)(u )

t=1 IUt(γ)(u(k 1))

u(k) = u , w.p. r u(k 1), w.p. 1 r

11: end for 12: Output: u(K)

Algorithm 2 presents a simulated annealing algorithm to solve the optimization problem OPTδ(γ) in Section 4. It takes a similar MCMC routine as Algorithm 1 and we use the same Gaussian random perturbation for the proposal distribution Q. As the temperature parameter η decreases, the sampling distribution in Algorithm 2 will gradually be more concentrated on the optimal solution set of OPTδ(γ). Algorithm 2 can be implemented more efficiently than Algorithm 1 in that the likelihood ratio calculation in (3) is analytical.

Table 1 reports some numerical results for the two algorithms. For both the Gaussian and δ-corruption settings, we consider three distributions of Pa,b: (i) a uniform distribution where a Unif([1, 2]n) and b Unif([1, n]); (ii) a discrete distribution where Unif({1, 2}n) and b Unif(1, ..., n); (iii) a fixed-a distribution where a = (1, ..., 1) and b Unif(1, ..., n). For the Gaussian case, the true parameters (µ , κ ) are uniformly generated from Sn 1 [1, 10], and the accuracy is calculated by (µ x µ x )/µ x where x and x are defined in Corollary 2. For the Gaussian case, u is uniformly generated from Sn 1 and δ is set to be 0.1, and the accuracy is calculated by Acc(ˆu)/Acc(u ) where Acc(u) is defined in Section 4. The numbers in Table 1 are reported based on an average of 20 simulation trials, and we run both Algorithm 1 and Algorithm 2 for K = 1000 iterations.

We make the following observations from the numerical experiments. First, we remark that the theoretical results in the previous sections provide strong guarantees on the convergence property of the posterior distribution. So the deterioration of the algorithm performance for the case when n = 25 is solely caused by the inaccuracy of the approximate sampling in either Algorithm 1 or Algorithm 2. Such inaccuracy can definitely be mitigated to some extent by a more efficient algorithm implementation such as parallel computing. However, we argue that the performance deterioration as n grows may point to a curse of dimensionality that is intrinsic to this estimation problem. Essentially, we aim to estimate a high-dimensional distribution only through partial information, i.e., the sets Ut s. On the positive end, the algorithms work well for n 10, so if the learner has the power of choosing (at, bt), s/he can break up the high-dimensional estimation problem into a number of low-dimensional estimation problems by focusing on a handful of dimensions each time. Moreover, we provide a visualization of the condition (1) in Figure 2 for n = 5 calculated based on simulation. The visualization supports the existence of L and thus the identifiability of the true parameters when the posterior sampling can be accurately fulfilled.

n = 3 n = 5 n = 10 n = 25

(i) 99.9% 99.9% 98.8% 59.9% Gaussian (ii) 99.9% 99.3% 96.9% 56.9% (iii) 99.9% 94.8% 92.6% 68.5%

(i) 99.9% 96.7% 96.0% 55.7% δ-corru. (ii) 99.6% 97.9% 97.6% 63.1% (iii) 99.9% 98.7% 87.1% 58.7%

Table 1: Predictive accuracies under two settings.

Figure 2: Visualization of (1).

We conclude our discussion with the following remarks.

Query-based model with learner-chosen (at, bt): In this paper, we have focused on the case where the constraints (at, bt) s are stochastically generated. When the concentration parameter κ is known for the Gaussian case, there is an efficient way of learning µ through choosing (at, bt) s (See the Appendix). In addition, the numerical experiments above also inspire a method that dismantles the high-dimensional estimation problem into a number of low-dimensional problems. Another

interesting and important question is whether there exist designs of (at, bt) s such that the posterior sampling can be more efficiently carried out.

Multiple constraints and nonlinear utility: The results in this paper are presented under the setting of a linear objective and a single constraint. We emphasize that the results can be easily generalized to the case of multiple constraints and parameterized nonlinear utility. Thus our result can be viewed as a preliminary effort to address the problem of stochastic inverse optimization. Our conjecture is that when the set Ut corresponds to a multiple-constraint problem, it may feature more structure and thus facilitate the learning of the utility distribution.

Choice modeling: The stochastic utility model in our paper also draws an interesting connection with the literature on choice modeling, which is a pillar for the pricing and assortment problems in revenue management [TVRVR04, GT+19]. For most of the existing choice models, the learning problem can be viewed as a special case of our study by letting a = (1, ..., 1) and b = 1. The results in our paper complement to this line of literature in developing a model where customers can make multiple purchases.

[ACD+15] Kareem Amin, Rachel Cummings, Lili Dworkin, Michael Kearns, and Aaron Roth. Online learning and profit maximization from revealed preferences. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015.

[Afr67] Sydney N Afriat. The construction of utility functions from expenditure data. International economic review, 8(1):67 77, 1967.

[AO01] Ravindra K Ahuja and James B Orlin. Inverse optimization. Operations Research, 49(5):771 783, 2001.

[ASS18] Anil Aswani, Zuo-Jun Shen, and Auyon Siddiq. Inverse optimization with noisy data. Operations Research, 66(3):870 892, 2018.

[BDM+14] Maria-Florina Balcan, Amit Daniely, Ruta Mehta, Ruth Urner, and Vijay V Vazirani. Learning economic parameters from revealed preferences. In International Conference on Web and Internet Economics, pages 338 353. Springer, 2014.

[BFL21] Omar Besbes, Yuri Fonseca, and Ilan Lobel. Contextual inverse optimization: Offline and online learning. Available at SSRN 3863366, 2021.

[BGP12] Dimitris Bertsimas, Vishal Gupta, and Ioannis Ch Paschalidis. Inverse optimization: A new perspective on the black-litterman model. Operations research, 60(6):1389 1403, 2012.

[BHP17] John R Birge, Ali Hortaçsu, and J Michael Pavlin. Inverse optimization for the recovery of market structure from market outcomes: An application to the miso electricity market. Operations Research, 65(4):837 855, 2017.

[BMPS18] Andreas Bärmann, Alexander Martin, Sebastian Pokutta, and Oskar Schneider. An online-learning approach to inverse optimization. ar Xiv preprint ar Xiv:1810.12997, 2018.

[BT92] Didier Burton and Ph L Toint. On an instance of the inverse shortest paths problem. Mathematical programming, 53(1):45 61, 1992.

[BV06] Eyal Beigman and Rakesh Vohra. Learning from revealed preference. In Proceedings of the 7th ACM Conference on Electronic Commerce, pages 36 42, 2006.

[CDBW21] Minwoo Chae, Pierpaolo De Blasi, and Stephen G Walker. Posterior asymptotics in wasserstein metrics on the real line. Electronic Journal of Statistics, 15(2):3635 3677, 2021.

[CKK20] Violet Xinying Chen and Fatma Kılınç-Karzan. Online convex optimization perspective for learning from dynamically revealed preferences. ar Xiv preprint ar Xiv:2008.10460, 2020.

[DCZ18] Chaosheng Dong, Yiran Chen, and Bo Zeng. Generalized inverse optimization through online learning. Advances in Neural Information Processing Systems, 31, 2018.

[DRS+18] Jinshuo Dong, Aaron Roth, Zachary Schutzman, Bo Waggoner, and Zhiwei Steven Wu. Strategic classification from revealed preferences. In Proceedings of the 2018 ACM Conference on Economics and Computation, pages 55 70, 2018.

[DZ20] Chaosheng Dong and Bo Zeng. Expert learning through generalized inverse multiobjective optimization: Models, insights, and algorithms. In International Conference on Machine Learning, pages 2648 2657. PMLR, 2020.

[GGVDV00] Subhashis Ghosal, Jayanta K Ghosh, and Aad W Van Der Vaart. Convergence rates of posterior distributions. Annals of Statistics, pages 500 531, 2000.

[GT+19] Guillermo Gallego, Huseyin Topaloglu, et al. Revenue management and pricing analytics, volume 209. Springer, 2019.

[KWB11] Arezou Keshavarz, Yang Wang, and Stephen Boyd. Imputing a convex objective function. In 2011 IEEE international symposium on intelligent control, pages 613 619. IEEE, 2011.

[MESAHK18] Peyman Mohajerin Esfahani, Soroosh Shafieezadeh-Abadeh, Grani A Hanasusanto, and Daniel Kuhn. Data-driven inverse optimization with imperfect information. Mathematical Programming, 167(1):191 234, 2018.

[Sam48] Paul A Samuelson. Consumption theory in terms of revealed preference. Economica, 15(60):243 253, 1948.

[TVRVR04] Kalyan T Talluri, Garrett Van Ryzin, and Garrett Van Ryzin. The theory and practice of revenue management, volume 1. Springer, 2004.

[Var06] Hal R Varian. Revealed preference. Samuelsonian economics and the twenty-first century, pages 99 115, 2006.

[ZL96] Jianzhong Zhang and Zhenhong Liu. Calculating some inverse linear programming problems. Journal of Computational and Applied Mathematics, 72(2):261 273, 1996.

[ZR12] Morteza Zadimoghaddam and Aaron Roth. Efficiently learning from revealed preference. In International Workshop on Internet and Network Economics, pages 114 127. Springer, 2012.