# wasserstein_distributionally_robust_inverse_multiobjective_optimization__13d3e41f.pdf

Wasserstein Distributionally Robust Inverse Multiobjective Optimization

Chaosheng Dong1*, Bo Zeng2

1Amazon 2University of Pittsburgh chaosd@amazon.com, bzeng@pitt.edu

Inverse multiobjective optimization provides a general framework for the unsupervised learning task of inferring parameters of a multiobjective decision making problem (DMP), based on a set of observed decisions from the human expert. However, the performance of this framework relies critically on the availability of an accurate DMP, sufﬁcient decisions of high quality, and a parameter space that contains enough information about the DMP. To hedge against the uncertainties in the hypothetical DMP, the data, and the parameter space, we investigate in this paper the distributionally robust approach for inverse multiobjective optimization. Speciﬁcally, we leverage the Wasserstein metric to construct a ball centered at the empirical distribution of these decisions. We then formulate a Wasserstein distributionally robust inverse multiobjective optimization problem (WRO-IMOP) that minimizes a worst-case expected loss function, where the worst case is taken over all distributions in the Wasserstein ball. We show that the excess risk of the WRO-IMOP estimator has a sub-linear convergence rate. Furthermore, we propose the semi-inﬁnite reformulations of the WRO-IMOP and develop a cutting-plane algorithm that converges to an approximate solution in ﬁnite iterations. Finally, we demonstrate the effectiveness of our method on both a synthetic multiobjective quadratic program and a real world portfolio optimization problem.

Introduction Inverse multiobjective optimization provides a compelling tool to learn humans decision making scheme or emulate their behaviors (Dong and Zeng 2020). Its goal is to infer the parameters of the multiobjective decision making problem (DMP), based on a set of observed decisions from the human experts. More precisely, it seeks to learn θ given {yi}i [N] that are observations of the Pareto optimal solutions of the multiobjective optimization problem (MOP):

min x {f1(x, θ), f2(x, θ), . . . , fp(x, θ)} s.t. x X(θ),

where θ indicates the true but unknown parameter for the expert s multiobjective DMP.

*Work was done prior to joining Amazon Copyright 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

This tool is generally applicable in many scenerios. These underlying multiobjective decision making schemes, once obtained, would presumably play critical roles in various aspects, such as assisting agents in automating the process of providing professional services for clients. For example, the modern portfolio theory risk and proﬁt are two objectives to optimize is often used by portfolio managers when buying or selling stocks on behalf of clients (Markowitz 1952). To automate the portfolio management, one could leverage the portfolio manager s investment records to learn the key parameters of this model, e.g., the risk aversion score or expected returns of the assets. Despite its widely applications, inverse multiobjective optimization relies critically on the availability of an accurate decision making model, sufﬁcient decisions of high quality, and a parameter space that contains as much information about the objective functions or constraints as possible. In practice, however, it is highly unlikely that all of these critical factors would be satisﬁed. For example, outliers in a limited amount of decisions would render the empirical distribution of decisions deviate from the true distribution, and thus signiﬁcantly weaken the predictive power of the inverse optimization multiobjective estimator. We note that this issue is not unique for inverse multiobjective optimization and one can observe similar ﬁndings in inverse optimization models that has only one objective function (Keshavarz, Wang, and Boyd 2011; Bertsimas, Gupta, and Paschalidis 2015; Aswani, Shen, and Siddiq 2018; Esfahani et al. 2018; Dong, Chen, and Zeng 2018). To hedge against these uncertainties contained in the hypothetical decision making model, the data and the selected parameter space, we investigate the distributionally robust approach for inverse multiobjective optimization. More speciﬁcally, motivated by Shaﬁeezadeh-Abadeh, Esfahani, and Kuhn (2015); Gao and Kleywegt (2016); Esfahani and Kuhn (2018), etc., we use the Wasserstein metric (Villani 2008) to construct the uncertainty set centered at the empirical distribution of the observed decisions. Subsequently, we propose a distributionally robust inverse multiobjective optimization program that minimizes the worst-case risk of loss, where the worst case is taken over all distributions in the uncertainty set. By such a distributionally robust framework, we aim to bridge the discrepancy between the lack of certainties in the information and the expectation for the

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

accurate prediction of human s or robot s future behavior.

Related Work Our work is most related to Dong and Zeng (2020), which proposes the general framework of using inverse multiobjective optimization to infer the objective functions or constraints of the multiobjective DMP, based on observations of Pareto optimal solutions. Dong and Zeng (2020) takes the framework of empirical risk minimization for this unsupervised learning task, and generally works well when there are few uncertainties in the model, data or hypothetical parameter space. In contrast, we believe that those uncertainties inherently root in the applications of inverse multiobjective optimization, and we aim to hedge against their inﬂuences by adopting the distributionally robust optimization paradigm based on Wasserstein metric. We demonstrate both theoretically and experimentally that our method has out-of-sample performance guarantees under uncertainties. Our work draws inspirations from Esfahani et al. (2018), who develop a distributionally robust approach for inverse optimization to infer the utility function from sequentially arrived observations. They aim to mitigate the poor performance of inverse optimization models (Ahuja and Orlin 2001; Keshavarz, Wang, and Boyd 2011; Bertsimas, Gupta, and Paschalidis 2015; Aswani, Shen, and Siddiq 2018; Esfahani et al. 2018; B armann, Pokutta, and Schneider 2017; Dong, Chen, and Zeng 2018) when the learner has imperfect information. They show that the associated distributionally robust inverse optimization approach offers out-of-sample performance guarantees under such a situation. However, their approach is specially designed for the simpler case where the DMP has only one objective function. Differently, our approach considers a more complex situation where the DMP has multiple objectives. Moreover, instead of using the suboptimality loss function, we consider another one that would better capture the learner s purpose to predict the decision maker s decisions. Due to the nonconvex nature of our loss function, extensive efforts are made to develop the algorithm for solving the resulting nonconvex minmax program.

Contributions We summarize our contributions as follows: We present a novel Wasserstein distributionally robust framework for constructing inverse multiobjective optimization estimator. We use the prominent Wasserstein metric to construct the uncertainty set centered at the empirical distribution of observed decisions. We show that the proposed framework has statistical performance guarantees, and the excess risk of the distributionally robust inverse multiobjective optimization estimator would converge to zero with a sub-linear rate as the number of observed decisions approaches to inﬁnity. We reformulate the resulting minmax problem as a semiinﬁnite program and develop a cutting-plane algorithm which converges to an approximate solution in ﬁnite iterations. We demonstrate the effectiveness of our method on both a multiobjective quadratic program and a portfolio optimization problem.

Problem Setting Multiobjective Decision Making Problem We consider a family of parametrized multiobjective decision making problems with p ( 2) objective functions,

min x Rn {f1(x, θ), f2(x, θ), . . . , fp(x, θ)}

s.t. x X(θ), DMP

where θ Θ is the parameter for the multiobjective DMP. For easy exposition, we denote f(x, θ) the vector valued function (f1(x, θ), f2(x, θ), . . . , fp(x, θ))T . Also, the feasible set X(θ) is characterized as X(θ) = {x Rn : g(x, θ) 0}, where g(x, θ) = (g1(x, θ), . . . , gq(x, θ))T is another vector-valued function. Following Dong and Zeng (2020), we consider a convex DMP where all objectives and constraints are convex in x for each θ Θ.

Deﬁnition 1 (Pareto optimality). For a ﬁxed θ, a decision x X(θ) is said to be Pareto optimal if there exists no other decision x X(θ) such that fi(x, θ) fi(x , θ) for all i [p], and fj(x, θ) < fj(x , θ) for at least one j [p].

We denote XP (θ) the Pareto optimal set that consists of all the Pareto optimal solutions. The weighted sum approach (Gass and Saaty 1955) is often taken to derive a Pareto optimal solution by solving

min w T f(x, θ) s.t. x X(θ), WP

where w = (w1, . . . , wp)T is the nonnegative weight vector in the (p 1)-simplex Wp {w Rp + : 1T w = 1}. When each w Rp ++, such set is denoted by W + p . We denote S(w, θ) the set of optimal solutions of WP, i.e.,

S(w, θ) = arg min x w T f(x, θ) : x X(θ) .

We next make a few assumptions to simplify our understanding, which are actually mild and appear frequently in the inverse optimization literature.

Assumption 1. Set Θ is a convex compact set in Rnθ. There exists D > 0 such that supθ Θ θ 2 D. In addition, f(x, θ) and g(x, θ) are convex in x for each θ Θ.

Inverse Multiobjective Optimization Consider a learner who has access to decision makers decisions, but does not know the underlying decision making model. In the inverse multiobjective optimization model, the learner aims to learn the parameter θ in DMP from observed noisy decisions only, and no information regarding decision makers preferences over multiple objective functions is available. We denote y the observed noisy decision that might carry measurement error or is generated with bounded rationality of the decision maker, i.e., being suboptimal. Throughout the paper we assume that y is a random variable distributed according to an unknown distribution Py supported on Y. We next discuss the construction of the loss function for the unsupervised learning task in Dong and Zeng (2020).

Given a noisy decision y and a hypothesis θ, the loss function could ideally be deﬁned as the minimum distance between y and the XP (θ). For a general DMP, however, there typically exists no explicit way to characterize XP (θ). Instead, a sampling approach is adopted to generate wk Wp for each k [K] and approximate XP (θ) as S k [K] S(wk, θ). Then, the loss function is deﬁned as

l K(y, θ) = min x S

k [K] S(wk,θ) y x 2 2. (loss function)

By using binary variables, this loss function can be converted into the following problem.

l K(y, θ) = min zj {0,1} y P

k [K] zkxk 2 2

k [K] zk = 1, xk S(wk, θ) (1)

Constraint P k [K] zk = 1 ensures that exactly one of Pareto optimal solutions will be chosen to measure the distance from y to XP (θ). Hence, solving this optimization problem identiﬁes some wk with k [K] such that the corresponding Pareto optimal solution S(wk, θ) is closest to y. We make the following assumptions as those in (Dong and Zeng 2020).

Assumption 2. (a) For each θ Θ, X(θ) is compact, and has a nonempty relative interior. Namely, there exists B > 0 such that x 2 B for all x X(θ). The support Y of the noisy decisions y is contained within a ball of radius R, where R < . (b) Each function in f is strongly convex on Rn, that is for each l [p], λl > 0, x, y Rn

fl(y, θl) fl(x, θl) T (y x) λl x y 2 2.

Regarding Assumption 2 (a), we note that assuming the compactness of the feasible region is very common in inverse optimization. The ﬁnite support of the observations is needed since we do not hope outliers have too strong impacts in our learning process. Let λ = minl [p]{λl}. It follows that w T f(x, θ) is strongly convex with parameter λ for each w Wp. Therefore, Assumptions 2 (a) - (b) together ensure that S(w, θ) is a single-valued set for each w and θ. Given observations {yi}i [N] drawn i.i.d. according to the distribution Py, the inverse multiobjective optimization program is given in the following.

min θ Θ 1 N P

i [N] l K(yi, θ). (IMOP)

Statistical properties, algorithm developments, and connections to other unsupervised learning tasks have been extensively investigated in Dong and Zeng (2020).

Wasserstein Ambiguity Set Let Y Rn be the observation space where the observed noisy decisions take values. Denote P(Y) be the set of all probability distributions on Y. From now on, we let the

Wasserstein ambiguity set P be the 1-Wasserstein ball of radius ϵ centered at P0:

P = Bϵ(P0) := {Q P(Y) : W(Q, P0) ϵ} , (2)

where P0 is the nominal distribution on Y, ϵ > 0 is the radius of the set, and W(Q, P0) is the wasserstein distance metric of order 1 deﬁned as (Villani 2008; Esfahani and Kuhn 2018; Gao and Kleywegt 2016)

W(Q, P0) = inf π Π(Q,P0)

Y Y z1 z2 2π(dz1, dz2),

where Π(Q, P0) is the set of probability distributions on Y Y with marginals Q and P0.

Wasserstein Distributionally Robust IMOP In this section, we propose the Wasserstein distributionally robust IMOP, and show its equivalence to a semi-inﬁnite program. Subsequently, we present an algorithm to handle the resulting reformulations, and show its convergence in ﬁnite steps. Finally, we establish the statistical performance guarantees for the distributionally robust IMOP. Given observations {yi}i [N] drawn i.i.d. according to the distribution Py, the corresponding distributionally robust program of (IMOP) equipped with the Wasserstein ambiguity set is constructed as follows

min θ Θ sup Q Bϵ( b PN) Ey Q [l K(y, θ)] , (WRO-IMOP)

which minimizes the worst case expected loss over all the distributions in the Wasserstein ambiguity set. Here Bϵ( b PN) is deﬁned in (2), and b PN is the empirical distribution satisfying: b PN(yi) = 1/N, i [N].

Semi-inﬁnite Reformulations WRO-IMOP involves minimizing a supremum over inﬁnitely many distributions, making it difﬁcult to solve. In this section, we establish the reformulation of WRO-IMOP into a semi-inﬁnite program. The performance of WRO-IMOP depends on how the change of θ affects the objective values. For w Wp, θ1, θ2 Θ, we consider the following function

h(x, w, θ1, θ2) = w T f(x, θ1) w T f(x, θ2).

Assumption 3. κ > 0, w Wp, θ1 = θ2 Θ, h( , w, θ1, θ2) is Lipschitz continuous on Y: x, y Y,

|h(x, w, θ1, θ2) h(y, w, θ1, θ2)| κ θ1 θ2 2 x y 2.

Basically, this assumption requires that the objective functions will not change much when either the parameter θ or the variable x is perturbed. It actually holds in many common situations, including the multiobjective linear program (MLP) and multiobjective quadratic program (MQP). As a motivating example, we give the κ for an MQP.

Example 1. Suppose that f(x, θ) = 1

2x T Q1x + c T 1 x 1 2x T Q2x + c T 2 x

where θ = (Q1, Q2, c1, c2). Under Assumption 2, we know that y 2 R. Then, h( , w, θ1, θ2) is 2R θ1 θ2 2Lipschitz continuous on Y. That is, we can set κ = 2R.

Under the previous assumptions, we will establish several properties of the loss function l K(y, θ), which are ensential for our reformulation for WRO-IMOP. Lemma 1. Under Assumptions 1 - 3, the loss function l K(y, θ) has the following properties: (a) y Y, θ Θ, 0 l K(y, θ) (B + R)2. (b) l K(y, θ) is uniformly 2(B + R)-Lipschitz continuous in y. That is, θ Θ, y1, y2 Y, we have

|l K(y1, θ) l K(y2, θ)| 2(B + R) y1 y2 2.

(c) l K(y, θ) is uniformly 4(B+R)κ

λ -Lipschitz continuous in θ. That is, y Y, θ1, θ2 Θ, we have

|l K(y, θ1) l K(y, θ2)| 4(B + R)κ

(a) and (b) of Lemma 1 are built upon direct analysis of the loss function l K(y, θ). Proof of (c) is much more involved and needs the key observation that the perturbation of S(w, θ) due to θ is bounded by the perturbation of θ by applying Proposition 6.1 in (Bonnans and Shapiro 1998). Details of the proof are provided in the supplementary material. Let

V := v RN+1 :V1 vi (m + 1)V2 m V1, i [N],

0 v N+1 (V2 V1)/ϵ .

where V1 and V2 are the lower and upper bounds for the loss function l K(y, θ), respectively. By part (a) of Lemma 1, we will set V1 = 0, and V2 = (B + R)2 throughout the remainder of the paper. The following theorem presents a tractable reformulation of the distributionally robust optimization problem WROIMOP and thus constitutes the ﬁrst main result of this paper. Theorem 1 (Semi-inﬁnite Reformulation). Under Assumptions 1 - 3, WRO-IMOP is equivalent to the following semiinﬁnite program:

min θ,v ϵ v N+1 + 1

s.t. sup ey Y (l K(ey, θ) v N+1 ey yi 2) vi, i [N],

θ Θ, v V (3)

Proof. Under Assumption 1, we know that Θ is compact. Similarly, Y is also compact under Assumption 2 (a). By lemma 1 (a), y Y, θ Θ, 0 l K(y, θ) (B + R)2, and thus l K(y, θ) is bounded. In addition, by lemma 1 (b), l K(y, θ) is continuous in y for any y Θ. Finally, by Lemma 1 (c), l K(y, θ) is uniformly 4(B+R)κ

λ -Lipschitz continuous in θ. Hence, applying Corollary 3.8 of (Luo and Mehrotra 2017) yields the result.

Remark 1. The establishment of Theorem 1 relies on those properties of l K(y, θ) stated in Lemma 1. Although l K(y, θ) might not be convex in θ or y, these properties ensure that strong (Kantorovich) duality holds for the inner problem of WRO-IMOP.

Next, we will discuss how to incorporate the explicit form of l K(y, θ) into the constraints of (3). For each i [N], constraints in (3) is equivalent to: ey Y,

ey xk 2 2 v N+1 ey yi 2 vi Mzik, P

k [K] zik = K 1, (4)

where the additional constraint P k [K] zik = K 1, is imposed to ensure that ey xk 2 2 vi v N+1 ey yi 2 0 for at least one k [K]. M is an uniform upper bound for the left-hand side of the ﬁrst constraint in (4). An appropriate M could be (B + R)2, since i [N], k [K],

ey xk 2 2 v N+1 ey yi 2 vi ey xk 2 2 (B + R)2.

One can verify that (4) is indeed equivalent to the ﬁrst set of constraints in (3) without much effort.

Remark 2. We admit that the semi-inﬁnite reformulation in Theorem 1 might still be valid if some assumption is not satisﬁed. Consider, for example, one of the objective functions is known to be strongly convex and the decision makers always has a positive preference for it.

Algorithm and Analysis of Convergence

Theorem 1 shows that the Wasserstein distributionally inverse multiobjective program WRO-IMOP is equivalent to the semi-inﬁnite program (3). Now, any existing method for solving the general semi-inﬁnite program can be employed to solve (3). In particular, we are interested in exchange methods (Hettich and Kortanek 1993; Joachims, Finley, and Yu 2009), since our algorithms inherits the spirit of these methods when applied to solve the minmax problem. The basic idea is to approximate the inﬁnite set of constraints in (3) with a sequence of ﬁnite sets of constraints. Iteratively, new constraints are added to the previous set of constraints by solving a maximum constraint violation problem. This is repeated until certain stopping criterion is satisﬁed. Next, we discuss how to construct the ﬁnite problem. Let e Yi = {eyi1, , eyi Ji} Y, i [N] be a collection of ﬁnite subsets of Y, where each subset has Ji samples. Then, the associated ﬁnite problem of (3) is

min θ,v ϵ v N+1 + 1

s.t. l K(eyij, θ) v N+1 eyij yi 2 vi, j [Ji], i [N], θ Θ, v V. (5)

By the same arguments for the transformation from constraints in (3) to those in (4), constraints in (5) are equivalent to

eyij xk 2 2 v N+1 eyij yi 2 vi Mzijk, P

k [K] zijk = K 1, i [N], j [Ji].

Algorithm 1 Wasserstein Distributionally Robust IMOP

1: Input: noisy decisions {yi}i [N], weights {wk}k K, radius ϵ of Wasserstein ball, and stopping tolerance δ 2: Initialize e Yi , i [N] 3: repeat 4: solve (6) with e Yi, i [N], and return an optimal solution (bθ, bv) 5: for i = 1, . . . , N do 6: solve maximum constraint violation problem (7) 7: if CVi > 0 then let e Yi e Yi {eyi} end if 8: end for 9: until maxi [N] CVi δ

10: Output: a δ-optimal solution bθN of (3)

Using the above transformation, (5) can be further cast into the following ﬁnite problem with ﬁnitely many constraints:

min θ,v,xk,zijk ϵ v N+1 + 1

s.t. eyij xk 2 2 v N+1 eyij yi 2 vi Mzijk, xk S(wk, θ), P

k [K] zijk = K 1,

θ Θ, v V, zijk {0, 1}, i [N], j [Ji], k [K]. (6)

At each iteration, new constraints are determined to add to the previous set of constraints in (6) by solving the following Maximum constraint violation problem: i [N],

CVi = max ey Y l K(ey, bθ) bv N+1 ey yi 2 bvi. (7)

Denote eyi the optimal solution of (7) for each i [N]. Whenever we ﬁnd that CVi > 0, we append eyi to e Yi. As a result, we tighten our approximation for the inﬁnite set of constraints in (3) by imposing the additional constraint l K(eyi, bθ) bv N+1 eyi yi 2 bvi 0 in the next iteration. With the above assumptions and analysis, we now present our method to solve WRO-IMOP in Algorithm 1. We also illustrate the general scheme of Algorithm 1 in Figure 1.

Remark 3. In Step 6, the maximum constraint violation problem can be solved exactly and efﬁciently by invoking solver such as Baron (Sahinidis 1996). Nevertheless, it can also be solved approximately by decomposing into K subproblems, each of which is a possibly nonconvex program when bv N+1 < 1. Nevertheless, we note that this nonconvex problem is a quadratically constrained quadratic program (QCQP) with a single constraint, and thus can be solved exactly and efﬁciently through the so-called S-procedure (Boyd and Vandenberghe 2004; P olik and Terlaky 2007). Additionally, K different subproblems can be solved independently and in parallel, allowing a linear speedup of Step 6.

For completeness, we give the convergence proof of Algorithm 1 in the following theorem.

Semi-infinite

problem (3)

Finite problem (6)

Maximum constraint violation problem (7) Add constraints

Figure 1: General scheme of Algorithm 1.

Theorem 2. Under Assumptions 1 - 3, Algorithm 1 converges within ( GR0

δ + 1)nθ+N+1 iterations. Here,

G = (1 + 2R + 4(B + R)κ

D2 + N (m + 1)V2 m V1 2 + V2 V1

Remark 4. The proof of convergence is in spirit similar to that of the cutting plane methods for robust optimization and distributionally robust optimization (Mutapcic and Boyd 2009; Luo and Mehrotra 2017). In practice, we mention that the actual number of iterations typically required is much smaller than ( GR0

δ + 1)nθ+N+1 as can be seen in the experiments. Variants of Algorithm 1 Note that we add eyi to e Yi whenever Vi > 0 for each i [N]. Nevertheless, from the convergence proof, it sufﬁces to add only one eyi corresponding to the biggest Vi that are positive. Consequently, we dramatically ease the computational burden in each iteration.

Performance Guarantees One of the main goals of statistical analysis of learning algorithms is to understand how the excess risk of a data dependent decision rule output by the empirical risk minimization depends on the sample size of the observations and on the complexity of the class Θ. Next, we provide a performance guarantee for WRO-IMOP by showing below that the excess risk of the estimator obtained by solving WRO-IMOP would converge sub-linearly to zero. Theorem 3 (Excess risk bound). Deﬁne the minimax risk estimator

θ arg min θ Θ

sup Q Bϵ(P ) Ey Q [l K(y, θ)]

where P is the distribution from which the observations {yi}i [N] are drawn, and the minimax empirical risk estimator

bθN arg min θ Θ

sup Q Bϵ( b PN) Ey Q [l K(y, θ)]

and b PN is the empirical distribution of the observations {yi}i [N]. Under Assumptions 1 - 3, 0 < δ < 1, the following holds with probability at least 1 δ:

sup Q Bϵ(P ) Ey Q h l K(y, bθN) i sup Q Bϵ(P ) Ey Q [l K(y, θ )]

N + 3(B+R)2

where H is a constant depending only on D, B, R, nθ, κ:

H = 96 3D nθ

κ + 2R (B + R).

Remark 5 (Performance Guarantees).

The bounded support assumption for Y of the noisy decisions is restrictive but seems to be unavoidable for any a priori guarantees of the type described in Theorem 3. In future work, we will investigate whether we could obtain other types of performace guarantees while relaxing Py to be light-tailed. Analogous to the convergence rate of empirical risk minimization when ϵ = 0, we get an O(1/

N) excess risk bound. However, the obtained excess risk bound does not depend on the radius ϵ of the Wasserstein ambiguity set. Similar to Lee and Raginsky (2018), this phenomenon is due to the fact that we are using the Lipschitz continuity of the loss function l K(y, θ). The right terms in the excess risk bound inequality increase as either D, B, R, nθ grow or κ shrinks, indicating that the learnability of the decision making model decreases. This is consistent with our observation that uncertainties in the model, data, and parameter space will enhance the difﬁculty of learning the parameters through inverse multiobjective optimization in general.

Experiments In this section, we provide an MQP and a portfolio optimization problem to illustrate the performance of Algorithm 1. The MISOCPs for IMOP are solved by Gurobi. All the algorithms are programmed with Julia (Bezanson et al. 2017).

Sythetic Data: Learning the Objective Functions of an MQP Consider the following multiobjective quadratic optimization problem.

2x T Q1x + c T 1 x f2(x) = 1

2x T Q2x + c T 2 x

where the parameters of the two objective functions are

Q1 = 1 0 0 2

, c1 = 0.5 1

, Q2 = 2 0 0 1

, c2 = 5 2.5

and the parameters for the feasible region are

A = 1 0 0 1

We seek to learn c1 and c2 in this experiment. The data is generated as follows. We ﬁrst compute Pareto optimal solutions {xi}i [N] by solving WP with weight samples {wi}i [N] that are uniformly chosen from W2. Next, the noisy decision yi is obtained by adding noise to xi for each i [N]. More precisely, yi = xi + ϵi, where each element

of ϵi has a uniform distribution supporting on [ 0.25, 0.25] with mean 0 for all i [N]. We assume that c1 and c2 are within [ 6, 0]2, and the ﬁrst elements for them are given. K = 6 weights from W2 are evenly sampled. The radius ϵ of the Wasserstein ambiguity set is selected from the set {10 4, 10 3, 10 2, 10 1, 1}. We report below the results with lowest prediction error across all candidate radii. The stopping criteria δ is set to be 0.1. Then, we implement Algorithm 1 with different N. To illustrate the performance of the algorithm in a statistical way, we run 10 repetitions of the experiments. Figure 2a shows the maximum constraint violation maxi [N] Vi versus iteration for one repetition when N = 10. As can be seen in the ﬁgure, the algorithm converges very fast. In Figure 2b, we report the prediction errors averaged over 10 repetitions with both the robust and non-robust approaches for different N. Here, we use an independent validation set that consists of 105 noisy decisions generated in the same way as the training data to compute the prediction error. The experiments suggest that the Wasserstein distributionally robust approach can signiﬁcantly reduce the prediction error, especially when N is small, i.e., we have a very limited number of observations. To further illustrate the performance of Algorithm 1, we randomly pick one repetition and plot the estimated Pareto optimal sets using both approaches in Figure 2c. We can see clearly that the estimated Pareto optimal set by the distributionally robust approach is closer to the real Pareto optimal set than that of the non-robust approach. Also, one could expect that the estimated Pareto optimal sets will get closer and closer to the true one as K and N increase.

Real World Case Study: Learning the Expected Returns We consider a portfolio selection problem, where investors need to determine the fraction of their wealth to invest in each security in order to maximize the total return and minimize the total risk. The classical Markovitz mean-variance portfolio selection (Markowitz 1952) in the following is frequently employed by analysts.

min f1(x) = r T x f2(x) = x T Qx

s.t. 0 xi bi, i [n], n P

i=1 xi = 1,

where r Rn + is a vector of individual security expected returns, Q Rn n is the covariance matrix of securities returns, x is a portfolio specifying the proportions of capital to be invested in the different securities, and bi is an upper bound on the proportion of security i, i [n]. Dataset: The dataset is derived from monthly total returns of 30 stocks from a blue-chip index which tracks the performance of top 30 stocks in the market when the total investment universe consists of thousands of assets. The true expected returns and true return covariance matrix for the ﬁrst 8 securities are given in the supplementary material. Suppose a learner seeks to learn the expected return for the ﬁrst

5 10 15 20 0

5 10 15 20 25 30 0

IMOP WRO-IMOP

Real efficient set Nosiy decisions IMOP WRO-IMOP

Figure 2: Learning the objective functions of an MQP. (a) Maximum constraint violation versus iteration for N = 15. (b) Prediction errors for two methods with different N. Results are averaged over 10 repetitions. (c) The Pareto optimal set and estimated Pareto optimal sets by using IMOP and WRO-IMOP with N = 10.

1 2 3 4 5 6 7 8 0

Figure 3: Maximum constraint violation versus iteration.

four securities that an analyst uses based on 20 noisy decisions from investors that the analyst serves. The noisy decision for each investor i [20] is generated as follows. We set each upper bound for the proportion of the 8 securities to bi = 1.0, i [8]. Then, we uniformly sample 20 weights and use them to generate optimal portfolios on the efﬁcient frontier that is plot in Figure 4. Subsequently, each component of these portfolios is rounded to the nearest thousandth, which can be seen as measurement error. The radius ϵ of the Wasserstein ambiguity set is selected from the set {10 4, 10 3, 10 2, 10 1, 1}. The stopping criteria δ is set to be 0.1. Figure 3 shows that our algorithm converges in 8 iterations. We also plot the estimated efﬁcient frontiers using both the robust and non-robust approaches with K = 6 in Figure 4. We can see that the estimated efﬁcient frontier of the Wasserstein distributionally robust approach is closer to the real one than the non-robust approach, showing that our method in this paper allows for a lower prediction error when a limited number of decisions observed are accessible. Note that the ﬁrst function is not strongly convex. The experiment results suggest that our reformulation is generalizable to a broader class of problems.

Conclusions and Future Work

In this paper, we present a novel Wasserstein distributionally robust framework for constructing inverse multiobjective

0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 Standard Deviation of Portfolio Returns (Annualized)

Mean of Portfolio Returns (Annualized)

Efficient Frontier and Estimated Efficient Frontier

Real Efficient Frontier IMOP WRO-IMOP

Figure 4: The red line indicates the real efﬁcient frontier. The yellow dots indicates the estimated efﬁcient frontier by solving WRO-IMOP. The blue dots indicates the estimated efﬁcient frontier using the non-robust approach.

optimization estimator. We show that the proposed framework has statistical performance guarantees, and the excess risk of the distributionally robust inverse multiobjective optimization estimator would converge to zero with a sub-linear rate as the number of observed decisions approaches to inﬁnity. To solve the resulting minmax problem, we reformulate it as a semi-inﬁnite program and develop a cutting-plane algorithm which converges to an approximate solution in ﬁnite iterations. We demonstrate the effectiveness of our method on both a multiobjective quadratic program and a portfolio optimization problem. We note that the technical assumptions of strong convexity of the objective functions in DMP and bounded support of the observations might not be fully satisﬁed in many real world scenarios. It remains to be seen what will happen if we relax these assumptions. Without them, these formulations and algorithm are still valid but performances are unlikely to be completely guaranteed. For example, we can only guarantee that the optimal objective of the Semi-inﬁnite reformulation in Theorem 1 provides an upper bound for WRO-IMOP. In future, we will work on extending the current framework and analysis to scenarios such as when DMP has at least one strongly convex objective function or data that follows light-tailed distribution.

References Ahuja, R. K.; and Orlin, J. B. 2001. Inverse optimization. Operations Research 49(5): 771 783. Aswani, A.; Shen, Z.-J.; and Siddiq, A. 2018. Inverse optimization with noisy data. Operations Research . B armann, A.; Pokutta, S.; and Schneider, O. 2017. Emulating the expert: Inverse optimization through online learning. In International Conference on Machine Learning, 400 410. Bertsimas, D.; Gupta, V.; and Paschalidis, I. C. 2015. Datadriven estimation in equilibrium using inverse optimization. Mathematical Programming 153(2): 595 633. Bezanson, J.; Edelman, A.; Karpinski, S.; and Shah, V. B. 2017. Julia: A fresh approach to numerical computing. SIAM Review 59(1): 65 98. Bonnans, J. F.; and Shapiro, A. 1998. Optimization problems with perturbations: A guided tour. SIAM Review 40(2): 228 264. Boyd, S.; and Vandenberghe, L. 2004. Convex Optimization. Cambridge university press. Dong, C.; Chen, Y.; and Zeng, B. 2018. Generalized Inverse Optimization through Online Learning. In Neur IPS. Dong, C.; and Zeng, B. 2020. Expert Learning through Generalized Inverse Multiobjective Optimization: Models, Insights, and Algorithms. In ICML. Esfahani, P. M.; and Kuhn, D. 2018. Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming 171(1-2): 115 166. Esfahani, P. M.; Shaﬁeezadeh-Abadeh, S.; Hanasusanto, G. A.; and Kuhn, D. 2018. Data-driven inverse optimization with imperfect information. Mathematical Programming 167(1): 191 234. Gao, R.; and Kleywegt, A. J. 2016. Distributionally robust stochastic optimization with Wasserstein distance. ar Xiv preprint ar Xiv:1604.02199 . Gass, S.; and Saaty, T. 1955. The computational algorithm for the parametric objective function. Naval Research Logistics 2(1-2): 39 45. Hettich, R.; and Kortanek, K. O. 1993. Semi-inﬁnite programming: theory, methods, and applications. SIAM Review 35(3): 380 429. Joachims, T.; Finley, T.; and Yu, C.-N. J. 2009. Cuttingplane training of structural SVMs. Machine learning 77(1): 27 59. Keshavarz, A.; Wang, Y.; and Boyd, S. 2011. Imputing a convex objective function. In Intelligent Control (ISIC), 2011 IEEE International Symposium on, 613 619. IEEE. Lee, J.; and Raginsky, M. 2018. Minimax Statistical Learning with Wasserstein distances. In Neur IPS. Luo, F.; and Mehrotra, S. 2017. Decomposition algorithm for distributionally robust optimization using Wasserstein metric. ar Xiv preprint ar Xiv:1704.03920 .

Markowitz, H. 1952. Portfolio selection. The Journal of Finance 7(1): 77 91. Mutapcic, A.; and Boyd, S. 2009. Cutting-set methods for robust convex optimization with pessimizing oracles. Optimization Methods & Software 24(3): 381 406. P olik, I.; and Terlaky, T. 2007. A survey of the S-lemma. SIAM Review 49(3): 371 418. Sahinidis, N. V. 1996. BARON: A general purpose global optimization software package. Journal of global optimization 8(2): 201 205. Shaﬁeezadeh-Abadeh, S.; Esfahani, P. M.; and Kuhn, D. 2015. Distributionally robust logistic regression. In NIPS. Villani, C. 2008. Optimal transport: old and new, volume 338. Springer Science & Business Media.