# generalized_inverse_optimization_through_online_learning__1db06434.pdf Generalized Inverse Optimization through Online Learning Chaosheng Dong Department of Industrial Engineering University of Pittsburgh chaosheng@pitt.edu Yiran Chen Department of Electrical and Computer Engineering Duke University yiran.chen@duke.edu Bo Zeng Department of Industrial Engineering University of Pittsburgh bzeng@pitt.edu Inverse optimization is a powerful paradigm for learning preferences and restrictions that explain the behavior of a decision maker, based on a set of external signal and the corresponding decision pairs. However, most inverse optimization algorithms are designed specifically in batch setting, where all the data is available in advance. As a consequence, there has been rare use of these methods in an online setting suitable for real-time applications. In this paper, we propose a general framework for inverse optimization through online learning. Specifically, we develop an online learning algorithm that uses an implicit update rule which can handle noisy data. Moreover, under additional regularity assumptions in terms of the data and the model, we prove that our algorithm converges at a rate of O(1/ T) and is statistically consistent. In our experiments, we show the online learning approach can learn the parameters with great accuracy and is very robust to noises, and achieves a dramatic improvement in computational efficacy over the batch learning approach. 1 Introduction Possessing the ability to elicit customers preferences and restrictions (PR) is crucial to the success for an organization in designing and providing services or products. Nevertheless, as in most scenarios, one can only observe their decisions or behaviors corresponding to external signals, while cannot directly access their decision making schemes. Indeed, decision makers probably do not have exact information regarding their own decision making process [1]. To bridge that discrepancy, inverse optimization has been proposed and received significant research attention, which is to infer or learn the missing information of the underlying decision models from observed data, assuming that human decision makers are rationally making decisions [2, 3, 4, 5, 1, 6, 7, 8, 9, 10, 11]. Nowadays, extending from its initial form that only considers a single observation [2, 3, 4, 5] with clean data, inverse optimization has been further developed and applied to handle more realistic cases that have many observations with noisy data [1, 6, 7, 9, 10, 11]. Despite of these remarkable achievements, traditional inverse optimization (typically in batch setting) has not proven fully applicable for supporting recent attempts in AI to automate the elicitation of human decision maker s PR in real time. Consider, for example, recommender systems (RSs) used by online retailers to increase product sales. The RSs first elicit one customer s PR from the 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. Figure 1: An overview of inverse optimization through batch learning versus through online learning. Left: Framework of inverse optimization in batch setting. Right: Framework of the generalized inverse optimization in online setting proposed in our paper. historical sequence of her purchasing behaviors, and then make predictions about her future shopping actions. Indeed, building RSs for online retailers is challenging because of the sparsity issue. Given the large amount of products available, customer s shopping vector, each element of which represents the quantity of one product purchased, is highly sparse. Moreover, the shift of the customer s shopping behavior along with the external signal (e.g. price, season) aggravates the sparsity issue. Therefore, it is particularly important for RSs to have access to large data sets to perform accurate elicitation [12]. Considering the complexity of the inverse optimization problem (IOP), it will be extremely difficult and time consuming to extract user s PR from large, noisy data sets using conventional techniques. Thus, incorporating traditional inverse optimization into RSs is impractical for real time elicitation of user s PR. To automate the elicitation of human decision maker s PR, we aim to unlock the potential of inverse optimization through online learning in this paper. Specifically, we formulate such learning problem as an IOP considering noisy data, and develop an online learning algorithm to derive unknown parameters occurring in either the objective function or constraints. At the heart of our algorithm is taking inverse optimization with a single observation as a subroutine to define an implicit update rule. Through such an implicit rule, our algorithm can rapidly incorporate sequentially arrived observations into this model, without keeping them in memory. Indeed, we provide a general mechanism for the incremental elicitation, revision and reuse of the inference about decision maker s PR. Related work Our work is most related to the subject of inverse optimization with multiple observations. The goal is to find an objective function or constraints that explains the observations well. This subject actually carries the data-driven concept and becomes more applicable as large amounts of data are generated and become readily available, especially those from digital devices and online transactions. Solution methods in batch setting for such type of IOP include convex optimization approach [1, 13, 10] and non-convex optimization approach [7]. The former approach often yields incorrect inferences of the parameters [7] while the later approach is known to lead to intractable programs to solve [10]. In contrast, we do inverse optimization in online setting, and the proposed online learning algorithm significantly accelerate the learning process with performance guarantees, allowing us to deal with more realistic and complex PR elicitation problems. Also related to our work is [6], which develops an online learning method to infer the utility function from sequentially arrived observations. They prove a different regret bound for that method under certain conditions, and demonstrate its applicability to handle both continuous and discrete decisions. However, their approach is only possible when the utility function is linear and the data is assumed to be noiseless. Differently, our approach does not make any such assumption and only requires the convexity of the underlying decision making problem. Besides the regret bound, we also show the statistical consistency of our algorithm by applying both the consistency result proven in [7] and the regret bound provided in this paper, which guarantees that our algorithm will asymptotically achieves the best prediction error permitted by the inverse model we consider. Our contributions To the best of authors knowledge, we propose the first general framework for eliciting decision maker s PR using inverse optimization through online learning. This framework can learn general convex utility functions and constraints with observed (signal, noisy decision) pairs. In Figure 1, we provide the comparison of inverse optimization through batch learning versus through online learning. Moreover, we prove that the online learning algorithm, which adopts an implicit update rule, has a O( T) regret under certain regularity conditions. In addition, this algorithm is statistically consistent when the data satisfies some rather common conditions, which guarantees that our algorithm will asymptotically achieves the best prediction error permitted by the inverse model we consider. Numerical results show that our algorithm can learn the parameters with great accuracy, is robust to noises even if some assumptions do not hold, and achieves a dramatic improvement over the batch learning approach on computational efficacy. 2 Problem setting 2.1 Decision making problem We consider a family of parameterized decision making problems, in which x Rn is the decision variable, u U Rm is the external signal, and θ Θ Rp is the parameter. min x Rn f(x, u, θ) s.t. g(x, u, θ) 0, DMP where f : Rn Rm Rp 7 R is a real-valued function, and g : Rn Rm Rp 7 Rq is a vector-valued function. We denote X(u, θ) = {x Rn : g(x, u, θ) 0} the feasible region of DMP. We let S(u, θ) = arg min{f(x, u, θ) : x X(u, θ)} be the optimal solution set of DMP. 2.2 Inverse optimization and online setting Consider a learner who monitors the signal u U and the decision maker decision x X(u, θ) in response to u. We assume that the learner does not know the decision maker s utility function or constraints in DMP. Since the observed decision might carry measurement error or is generated with a bounded rationality of the decision maker, i.e., being suboptimal, we denote y the observed noisy decision for u U. Note that y does not necessarily belong to X(u, θ), i.e., it might be infeasible with respect to X(u, θ). Throughout the paper, we assume that the (signal,noisy decision) pair (u, y) is distributed according to some unknown distribution P supported on {(u, y) : u U, y Y}. In our inverse optimization model, the learner aims to learn the decision maker s objective function or constraints from (signal, noisy decision) pairs. More precisely, the goal of the learner is to estimate the parameter θ of the DMP. In our online setting, the (signal, noisy decision) pair become available to the learner one by one. Hence, the learning algorithm produces a sequence of hypotheses (θ1, . . . , θT +1). Here, T is the total number of rounds, and θ1 is an arbitrary initial hypothesis and θt for t 2 is the hypothesis chosen after observing the (t 1)th (signal,noisy decision) pair. Let l(yt, ut, θt) denote the loss the learning algorithm suffers when it tries to predict the tth decision given ut based on {(u1, y1), , (ut 1, yt 1)}. The goal of the learner is to minimize the regret, which is the cumulative loss P t [T ] l(yt, ut, θt) against the possible loss when the whole batch of (signal,noisy decision) pairs are available. Formally, the regret is defined as t [T ] l(yt, ut, θt) min θ Θ t [T ] l(yt, ut, θ). In the following, we make a few assumptions to simplify our understanding, which are actually mild and frequently appear in the inverse optimization literature [1, 13, 10, 7]. Assumption 2.1. Set Θ is a convex compact set. There exists D > 0 such that θ 2 D for all θ Θ. In addition, for each u U, θ Θ, both f(x, u, θ) and g(x, u, θ) are convex in x. 3 Learning the parameters 3.1 The loss function Different loss functions that capture the mismatch between predictions and observations have been used in the inverse optimization literature. In particular, the (squared) distance between the observed decision and the predicted decision enjoys a direct physical meaning, and thus is most widely used [14, 15, 16, 7]. Hence, we take the (squared) distance as our loss function in this paper.In batch setting, statistical properties of inverse optimization with such a loss function have been analyzed extensively in [7]. In this paper, we focus on exploring the performance of the online setting. Given a (signal,noisy decision) pair (u, y) and a hypothesis θ, we define the following loss function as the minimum (squared) distance between y and the optimal solution set S(u, θ). l(y, u, θ) = min x S(u,θ) y x 2 2. Loss Function 3.2 Online implicit updates Once receiving the tth (signal,noisy decision) pair (ut, yt), θt+1 can be obtained by solving the following optimization problem: θt+1 = arg min θ Θ 1 2 θ θt 2 2 + ηtl(yt, ut, θ), (1) where ηt is the learning rate in round t, and l(yt, ut, θ) is defined in (Loss Function). The updating rule (1) seeks to balance the tradeoff between "conservativeness" and correctiveness", where the first term characterizes how conservative we are to maintain the current estimation, and the second term indicates how corrective we would like to modify with the new estimation. As there is no closed form for θt+1 in general, we call (1) an implicit update rule [17, 18]. To solve (1), we can replace x S(u, θ) by KKT conditions (or other optimality conditions) of the DMP, and get a mixed integer nonlinear program. Consider, for example, a decision making problem that is a quadratic optimization problem. Namely, the DMP has the following form: min x Rn 1 2x T Qx + c T x s.t. Ax b. QP Suppose that b changes over time t. That is, b is the external signal for QP and equals to bt at time t. If we seek to learn c, the optimal solution set for QP can be characterized by KKT conditions as S(bt) = {x : Ax bt, u Rm +, u T (Ax bt) = 0, Qx + c AT u = 0}. Here, u is the dual variable for the constraints. Then, the single level reformulation of the update rule by solving (1) is min c Θ 1 2 c ct 2 2 + ηt yt x 2 2 s.t. Ax bt, u Mz, Ax bt M(1 z), Qx + c AT u = 0, c Rm, x Rn, u Rm +, z {0, 1}m, where z is the binary variable used to linearize KKT conditions, and M is an appropriate number used to bound the dual variable u and Ax bt. Clearly, IQP is a mixed integer second order conic program (MISOCP). More examples are given in the supplementary material. Our application of the implicit updates to learn the parameter of DMP proceeds in Algorithm 1. Remark 3.1. (i) In Algorithm 1, we let θt+1 = θt if the prediction error l(yt, ut, θt) is zero. But in practice, we can set a threshold ϵ > 0 and let θt+1 = θt once l(yt, ut, θt) < ϵ. (ii) Normalization of θt+1 is needed in some situations, which eliminates the impact of trivial solutions. (iii) Minibatches One technique to enhance online learning is to consider multiple observations per update. In our framework, this means that computing θt+1 using |Nt| > 1 noisy decisions in (1). Remark 3.2. To obtain a strong initialization of θ in Algorithm 1, we can incorporate an idea in [1], which imputes a convex objective function by minimizing the residuals of KKT conditions incurred by the noisy data. Assume we have a historical data set e T, which may be of poor qualities for the current learning. This leads to the following initialization problem: min θ Θ 1 | e T | P rt c + rt s s.t. |u T t g(yt, ut, θ)| rt c, t e T, f(yt, ut, θ) + u T t g(yt, ut, θ) 2 rt s, t e T, ut Rm +, rt c R+, rt s R+, t e T, Algorithm 1 Implicit Online Learning for Generalized Inverse Optimization 1: Input: (signal,noisy decision) pairs {(ut, yt)}t [T ] 2: Initialization: θ1 could be an arbitrary hypothesis of the parameter. 3: for t = 1 to T do 4: receive (ut, yt) 5: suffer loss l(yt, ut, θt) 6: if l(yt, ut, θt) = 0 then 7: θt+1 θt 8: else 9: set learning rate ηt 1/ t 10: update θt+1 = arg min θ Θ 1 2 θ θt 2 2 + ηtl(yt, ut, θ) (solve (1)) 11: end if 12: end for where rt c and rt s are residuals corresponding to the complementary slackness and stationarity in KKT conditions for the t-th noisy decision yt, and ut is the dual variable corresponding to the constraints in DMP. Note that (2) is a convex program. It can be solved quite efficiently compared to solving the inverse optimization problem in batch setting [7]. Other initialization approaches using similar ideas e.g., computing a variational inequality based approximation of inverse model [13], can also be incorporated into our algorithm. 3.3 Theoretical analysis Note that the implicit online learning algorithm is generally applicable to learn the parameter of any convex DMP. In this section, we prove that the average regret RT /T converges at a rate of O(1/ T) under certain regularity conditions. Furthermore, we will show that the proposed algorithm is statistically consistent when the data satisfies some common regularity conditions. We begin by introducing a few assumptions that are rather common in literature [1, 13, 10, 7]. Assumption 3.1. (a) For each u U and θ Θ, X(u, θ) is closed, and has a nonempty relative interior. X(u, θ) is also uniformly bounded. That is, there exists B > 0 such that x 2 B for all x X(u, θ). (b) f(x, u, θ) is λ-strongly convex in x on Y for fixed u U and θ Θ. That is, x, y Y, f(y, u, θ) f(x, u, θ) T (y x) λ x y 2 2. Remark 3.3. For strongly convex program, there exists only one optimal solution. Therefore, Assumption 3.1.(b) ensures that S(u, θ) is a single-valued set for each u U. However, S(u, θ) might be multivalued for general convex DMP for fixed u. Consider, for example, minx1,x2{x1 + x2 : x1 + x2 1}. Note that all points on line x1 + x2 = 1 are optimal. Indeed, we find such case is quite common when there are many variables and constraints. Actually, it is one of the major challenges when learning parameters of a function that s not strongly convex using inverse optimization. For convenience of analysis, we assume below that we seek to learn the objective function while constraints are known. Then, the performance of Algorithm 1 also depends on how the change of θ affects the objective values. For x Y, u U, θ1, θ2 Θ, we consider the difference function h(x, u, θ1, θ2) = f(x, u, θ1) f(x, u, θ2). (3) Assumption 3.2. κ > 0, u U, θ1, θ2 Θ, h( , u, θ1, θ2) is Lipschitz continuous on Y: |h(x, u, θ1, θ2) h(y, u, θ1, θ2)| κ θ1 θ2 2 x y 2, x, y Y. Basically, this assumption says that the objectives functions will not change very much when either the parameter θ or the variable x is perturbed. It actually holds in many common situations, including the linear program and quadratic program. Lemma 3.1. Under Assumptions 2.1 - 3.2, the loss function l(y, u, θ) is uniformly 4(B+R)κ λ - Lipschitz continuous in θ. That is, y Y, u U, θ1, θ2 Θ, we have |l(y, u, θ1) l(y, u, θ2)| 4(B + R)κ The establishment of Lemma 3.1 is based on the key observation that the perturbation of S(u, θ) due to θ is bounded by the perturbation of θ through applying Proposition 6.1 in [19]. Details of the proof are given in the supplementary material. Remark 3.4. When we seek to learn the constraints or jointly learn the constraints and objective function, similar result can be established by applying Proposition 4.47 in [20] while restricting not only the Lipschitz continuity of the difference function in (3), but also the Lipschitz continuity of the distance between the feasible sets X(u, θ1) and X(u, θ2) (see Remark 4.40 in [20]). Assumption 3.3. For the DMP, y Y, u U, θ1, θ2 Θ, α, β 0 s.t. α + β = 1, we have αS(u, θ1) + βS(u, θ2) S(u, αθ1 + βθ2) 2 αβ S(u, θ1) S(u, θ2) 2/(2(B + R)). Essentially, this assumption indicates that the distance between S(u, αθ1 + βθ2) and the convex combination of S(u, θ1) and S(u, θ2) shall be small when S(u, θ1) and S(u, θ2) are close. An example is provided in the supplementary material to show that this assumption can be satisfied. Yet, we note that it probably is restrictive and hard to verify in general. Let θ be an optimal inference to minθ Θ 1 T P t [T ] l(yt, θ), i.e., an inference derived with the whole batch of observations available. Then, the following theorem asserts that RT = P t [T ](l(yt, θt) l(yt, θ )) of the implicit online learning algorithm is of O( Theorem 3.2 (Regret bound). Suppose Assumptions 2.1 - 3.3 hold. Then, choosing ηt = Dλ 2 Remark 3.5. We establish of the above regret bound by extending Theorem 3.2. in [18]. Our extension involves several critical and complicated analyses for the structure of the optimal solution set S(u, θ) as well as the loss function, which is essential to our theoretical understanding. Moreover, we relax the requirement of smoothness of loss function in that theorem to Lipschitz continuity through a similar argument in Lemma 1 of [21] and [22]. By applying both Theorem 3 in [7] and the regret bound proved in Theorem 3.2, we show the risk consistency of the online learning algorithm in the sense that the average cumulative loss converges in probability to the true risk in the batch setting. Theorem 3.3 (Risk consistency). Let θ0 = arg minθ Θ{E [l(y, u, θ)]} be the optimal solution that minimizes the true risk in batch setting. Suppose the conditions in Theorem 3.2 hold. If E[y2] < , then choosing ηt = Dλ 2 t [T ] l(yt, ut, θt) p E l(y, u, θ0) . Corollary 3.3.1. Suppose that the true parameter θtrue Θ, and y = x + ϵ, where x S(u, θtrue) for some u U, E[ϵ] = 0, E[ϵT ϵ] < , and u, x are independent of ϵ. Let the conditions in Theorem 3.2 hold. Then choosing ηt = Dλ 2 t [T ] l(yt, ut, θt) p E[ϵT ϵ]. Remark 3.6. (i) Theorem 3.3 guarantees that the online learning algorithm proposed in this paper will asymptotically achieves the best prediction error permitted by the inverse model we consider. (ii) Corollary 3.3.1 suggests that the prediction error is inevitable as long as the data carries noise. This prediction error, however, will be caused merely by the noisiness of the data in the long run. 4 Applications to learning problems in IOP In this section, we will provide sketches of representative applications for inferring objective functions and constraints using the proposed online learning algorithm. Our preliminary experiments have been run on Bridges system at the Pittsburgh Supercomputing Center (PSC) [23]. The mixed integer second order conic programs, which are derived from using KKT conditions in (1), are solved by Gurobi. All the algorithms are programmed with Julia [24]. 0 200 400 600 800 1000 10-1 Cold-start: Estimation error Cold-start: Average error Warm-start: Estimation error Warm-start: Average error Batch setting Online setting 0 200 400 600 800 1000 0 Average cumulative loss Loss per round Figure 2: Learning the utility function over T = 1000 rounds. (a) We run 100 repetitions of the experiments using Algorithm 1 with two settings. Cold-start means that we initialize r as a vector of zeros. Warm-start means that we initialize r by solving (2) with 1000 (price,noisy decision) pairs. We plot the estimation errors over round t in pink and brown for all the 100 repetitions, respectively. We also plot the average estimation errors of the 100 repetitions in red line and dashed brown line, respectively. (b) The dotted brown line is the error bar plot of the average running time over 10 repetitions in batch setting. The blue line is the error bar plot of the average running time over 100 repetitions in online setting. Here, the error bar is [mean-std, mean+std]. (c) We randomly pick one repetition. The loss over round is indicated by the dot. The average cumulative loss is indicated by the line. The dotted line indicates the variance of the noise. Here, E[ϵT ϵ] = 0.2083. 4.1 Learning consumer behavior We now study the consumer s behavior problem in a market with n products. The prices for the products are denoted by pt Rn + which varies over time t [T]. We assume throughout that the consumer has a rational preference relation, and we take u to be the utility function representing these preferences. The consumer s decision making problem of choosing her most preferred consumption bundle x given the price vector pt and budget b can be stated as the following utility maximization problem (UMP) [25]. max x Rn + u(x) s.t. p T t x b, UMP where p T t x b is the budget constraint at time t. For this application, we will consider a concave quadratic representation for u(x). That is, u(x) = 1 2x T Qx + r T x, where Q Sn (the set of symmetric negative semidefinite matrices), r Rn. We consider a problem with n = 10 products, and the budget b = 40. Q and r are randomly generated and are given in the supplementary material. Suppose prices are changing in T rounds. In each round, the learner would receive one (price,noisy decision) pair (pt, yt). Her goal is to learn the utility function or budget of the consumer. The (price,noisy decision) pair in each round is generated as follows. In round t, we generate the prices from a uniform distribution, i.e. pt i U[pmin, pmax], with pmin = 5 and pmax = 25. Then, we solve UMP and get the optimal decision xt. Next, the noisy decision yt is obtained by corrupting xt with noise that has a jointly uniform distribution with support [ 0.25, 0.25]2. Namely, yt = xt + ϵt, where each element of ϵt U( 0.25, 0.25). Learning the utility function In the first set of experiments, the learner seeks to learn r given {(pt, yt)}t [T ] that arrives sequentially in T = 1000 rounds. We assume that r is within [0, 5]10. The learning rate is set to ηt = 5/ t. Then, we implement Algorithm 1 with two settings. We report our results in Figure 2. As can be seen in Figure 2a, solving the initialization problem provides quite good initialized estimations of r, and Algorithm 1 with Warm-start converges faster than that with Cold-start. Note that (2) is a convex program and the time to solve it is negligible in Algorithm 1. Thus, the running times with and without Warm-start are roughly the same. This suggests that one might prefer to use Algorithm 1 with Warm-start if she wants to get a relatively good estimation of the parameters in few iterations. However, as shown in the figure, both settings would return very similar estimations on r in the long run. To keep consistency, we would use Algorithm 1 with Cold-start in the remaining experiments. We can also see that estimation errors over rounds for different repetitions concentrate around the average, indicating that our algorithm is pretty robust to noises. Moreover, Figure 2b shows that inverse optimization in online setting is drastically faster 0 200 400 600 800 1000 0 Estimation error per round Average estimation error Batch setting Online setting 0 200 400 600 800 1000 0 Average cumulative loss Loss per round Figure 3: Learning the budget over T = 1000 rounds. (a) We run 100 repetitions of the experiments. We plot the estimation error over round t for all the 100 repetitions in pink. We also plot the average estimation error of the 100 repetitions in red. (b) The dotted brown line is the error bar plot of the average running time over 10 repetitions in batch setting. The blue line is the error bar plot of the average running time over 100 repetitions in online setting. (c) We randomly pick one repetition. The loss over round is indicated by the dot. The average cumulative loss is indicated by the line. The dotted line is the reference line indicating the variance of the noise. Here, E[ϵT ϵ] = 0.2083. than in batch setting. This also suggests that windowing approach for inverse optimization might be practically infeasible since it fails even with a small subset of data, such as window size equals to 10. We then randomly pick one repetition and plot the loss over round and the average cumulative loss in Figure 2c. We see clearly that the average cumulative loss asymptotically converges to the variance of the noise. This makes sense because the loss merely reflects the noise in the data when the estimation converges to the true value as stated in Remark 3.6. Learning the budget In the second set of experiments, the learner seeks to learn the budget b in T = 1000 rounds. We assume that b is within [0, 100]. The learning rate is set to ηt = 100/ t. Then, we apply Algorithm 1 with Cold-start. We show the results in Figure 3. All the analysis for the results in learning the utility function apply here. One thing to emphasize is that learning the budget is much faster than learning the utility function, as shown in Figure 2b and 3b. The main reason is that the budget b is a one dimensional vector, while the utility vector r is a ten dimensional vector, making it drastically more complex to solve (1). 4.2 Learning the transportation cost We now consider the transshipment network G = (Vs Vd, E), where nodes Vs are producers and the remaining nodes Vd are consumers. The production level is yv for node v Vs, and has a maximum capacity of wv. The demand level is dt v for node v Vs and varies over time t [T]. We assume that producing yv incurs a cost of Cv(yv) for node v Vs; furthermore, we also assume that there is a transportation cost cexe associated with edge e E, and the flow xe has a maximum capacity of ue. The transshipment problem can be formulated in the following: v Vs Cv(yv) + P e δ+(v) xe P e δ (v) xe = yv, v Vs, e δ+(v) xe P e δ (v) xe = dt v, v Vd, 0 xe ue, 0 yv wv, e E, v Vs, where we want to learn the transportation cost ce for e E. For this application, we will consider a convex quadratic cost for Cv(yv). That is, Cv(yv) = 1 2λvy2 v, where λv 0. We create instances of the problem based on the network in Figure 4a. λ1, λ2, {ue}e E, {wv}v Vs and the randomly generated {ce}e E are given in supplementary material. In each round, the learner would receive the demands {dt v}v Vd, the production levels {yv}v Vs and the flows {xe}e E, where the later two are corrupted by noises. In round t, we generate the dt v for v Vd from a uniform distribution, i.e. dt v U[ 1.25, 0]. Then, we solve TP and get the optimal production levels and flows. Next, the noisy production levels and flows are obtained by corrupting the optimal ones with noise that has a jointly uniform distribution with support [ 0.25, 0.25]8. 0 200 400 600 800 1000 0 Estimation error per round Average estimation error 0 200 400 600 800 1000 0 Average cumulative loss Loss per round Figure 4: Learning the transportation cost over T = 1000 rounds. (a) We plot the five-node network in our experiment. (b) Denote c R|E| the vector of transportation costs. We run 100 repetitions of the experiments. We plot the estimation error at each round t for all the 100 experiments. We also plot the average estimation error of the 100 repetitions. (c) We randomly pick one repetition. The loss over round is indicated by the dot. The average cumulative loss is indicated by the line. The dotted line is the reference line indicating the variance of the noise. Here, E[ϵT ϵ] = 0.1667. Suppose the transportation cost on edge (2, 3) and (2, 5) are unknown, and the learner seeks to learn them given the (demand,noisy decision) pairs that arrive sequentially in T = 1000 rounds. We assume that ce for e E is within [1, 10]. The learning rate is set to ηt = 2/ t. Then, we implement Algorithm 1 with Cold-start. Figure 4b shows the estimation error of c in each round over the 100 repetitions. We also plot the average estimation error of the 100 repetitions. As shown in this figure, ct asymptotically converges to the true transportation cost cture pretty fast. Also. estimation errors over rounds for different repetitions concentrate around the average, indicating that our algorithm is pretty robust to noises. We then randomly pick one repetition and plot the loss over round and the average cumulative loss in Figure 4c. Note that the variance of the noise E[ϵT ϵ] = 0.1667. We can see that the average cumulative loss asymptotically converges to the variance of the noise. 5 Conclusions and final remarks In this paper, an online learning method to infer preferences or restrictions from noisy observations is developed and implemented. We prove a regret bound for the implicit online learning algorithm under certain regularity conditions, and show that the algorithm is statistically consistent, which guarantees that our algorithm will asymptotically achieves the best prediction error permitted by the inverse model. Experiment results show that our algorithm can learn the parameters with great accuracy, is robust to noises even if some assumptions are not satisfied or difficult to be verified, and achieves a dramatic improvement over the batch learning approach on computational efficacy. Future research directions include the algorithm development with more sophisticated online learning techniques for a stronger performance, and the theoretical investigation with less restriction assumptions and a broader applicability. Acknowledgments This work was partially supported by CMMI-1642514 from the National Science Foundation. This work used the Bridges system, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC). [1] Arezou Keshavarz, Yang Wang, and Stephen Boyd. Imputing a convex objective function. In Intelligent Control (ISIC), 2011 IEEE International Symposium on, pages 613 619. IEEE, 2011. [2] Ravindra K Ahuja and James B Orlin. Inverse optimization. Operations Research, 49(5):771 783, 2001. [3] Garud Iyengar and Wanmo Kang. Inverse conic programming with applications. Operations Research Letters, 33(3):319 330, 2005. [4] Andrew J. Schaefer. Inverse integer programming. Optimization Letters, 3(4):483 489, 2009. [5] Lizhi Wang. Cutting plane algorithms for the inverse mixed integer linear programming problem. Operations Research Letters, 37(2):114 116, 2009. [6] Andreas Bärmann, Sebastian Pokutta, and Oskar Schneider. Emulating the expert: Inverse optimization through online learning. In ICML, 2017. [7] Anil Aswani, Zuo-Jun Shen, and Auyon Siddiq. Inverse optimization with noisy data. Operations Research, 2018. [8] Timothy CY Chan, Tim Craig, Taewoo Lee, and Michael B Sharpe. Generalized inverse multiobjective optimization with application to cancer therapy. Operations Research, 62(3):680 695, 2014. [9] Dimitris Bertsimas, Vishal Gupta, and Ioannis Ch Paschalidis. Inverse optimization: A new perspective on the black-litterman model. Operations research, 60(6):1389 1403, 2012. [10] Peyman Mohajerin Esfahani, Soroosh Shafieezadeh-Abadeh, Grani A Hanasusanto, and Daniel Kuhn. Data-driven inverse optimization with imperfect information. Mathematical Programming, pages 1 44, 2017. [11] Chaosheng Dong and Bo Zeng. Inferring parameters through inverse multiobjective optimization. ar Xiv preprint ar Xiv:1808.00935, 2018. [12] Charu C Aggarwal. Recommender Systems: The Textbook. Springer, 2016. [13] Dimitris Bertsimas, Vishal Gupta, and Ioannis Ch Paschalidis. Data-driven estimation in equilibrium using inverse optimization. Mathematical Programming, 153(2):595 633, 2015. [14] Hai Yang, Tsuna Sasaki, Yasunori Iida, and Yasuo Asakura. Estimation of origin-destination matrices from link traffic counts on congested networks. Transportation Research Part B: Methodological, 26(6):417 434, 1992. [15] Stephan Dempe and Sebastian Lohse. Inverse linear programming. In Recent Advances in Optimization, pages 19 28. Springer, 2006. [16] Timothy CY Chan, Taewoo Lee, and Daria Terekhov. Inverse optimization: Closed-form solutions, geometry, and goodness of fit. Management Science, 2018. [17] Li Cheng, Dale Schuurmans, Shaojun Wang, Terry Caelli, and Svn Vishwanathan. Implicit online learning with kernels. In NIPS, 2007. [18] Brian Kulis and Peter L. Bartlett. Implicit online learning. In ICML, 2010. [19] J Frédéric Bonnans and Alexander Shapiro. Optimization problems with perturbations: A guided tour. SIAM Review, 40(2):228 264, 1998. [20] J Frédéric Bonnans and Alexander Shapiro. Perturbation analysis of optimization problems. Springer Science & Business Media, 2013. [21] Jialei Wang, Weiran Wang, and Nathan Srebro. Memory and communication efficient distributed stochastic optimization with minibatch-prox. Proceedings of Machine Learning Research, 65:1 37, 2017. [22] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121 2159, 2011. [23] Nicholas A. Nystrom, Michael J. Levine, Ralph Z. Roskies, and J. Ray Scott. Bridges: A uniquely flexible hpc resource for new communities and data analytics. In Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure, XSEDE 15, pages 30:1 30:8, New York, NY, USA, 2015. ACM. [24] Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B Shah. Julia: A fresh approach to numerical computing. SIAM Review, 59(1):65 98, 2017. [25] Andreu Mas-Collell, Michael Whinston, and Jerry R Green. Microeconomic theory. 1995.