# imo3_interactive_multiobjective_offpolicy_optimization__9f87efd7.pdf

IMO3: Interactive Multi-Objective Off-Policy Optimization

Nan Wang1 , Hongning Wang1 , Maryam Karimzadehgan2 , Branislav Kveton3 , Craig Boutilier2

1University of Virginia 2Google Research 3Amazon {nw6a, hw5x}@virginia.edu, maryamk@google.com, bkveton@amazon.com, cboutilier@google.com

Most real-world optimization problems have multiple objectives. A system designer needs to ﬁnd a policy that trades off these objectives to reach a desired operating point. This problem has been studied extensively in the setting of known objective functions. However, we consider a more practical but challenging setting of unknown objective functions. In industry, optimization under this setting is mostly approached with online A/B testing, which is often costly and inefﬁcient. As an alternative, we propose Interactive Multi-Objective Off-policy Optimization (IMO3). The key idea of IMO3 is to interact with a system designer using policies evaluated in an offpolicy fashion to uncover which policy maximizes her unknown utility function. We theoretically show that IMO3 identiﬁes a near-optimal policy with high probability, depending on the amount of designer s feedback and training data for off-policy estimation. We demonstrate its effectiveness empirically on several multi-objective optimization problems.

1 Introduction

Most real-world optimization problems involve multiple objectives. Multi-objective optimization (MOO) has been studied and applied in various ﬁelds of system design, including engineering, economics, and logistics, where optimal policies need to trade off multiple, potentially conﬂicting objectives [Keeney and Raiffa, 1976]. The system designer aims to ﬁnd the optimal policy that respects her design principles, preferences and trade-offs. For example, when designing an investment portfolio, one s investment strategy requires trading off maximizing expected gain with minimizing risk [Liang and Qu, 2013]. Two key issues need to be addressed in MOO before policy optimization. First, the objectives in most real world problems do not have explicit functional forms. Given a decision or policy space, we need a mapping of policies to the expected values of the objectives in question. These objective values may be obtained by executing new policies on live trafﬁc, which is risky and time-consuming [Deaton and Cartwright, 2018; Kohavi et al., 2009]. A more efﬁcient approach could

This work started prior to joining Amazon.

be learning a model from data, such as for the expected return and risk of an investment portfolio. In practice, acquiring data for learning a model can be costly and the model may be biased due to the data-gathering policy [Strehl et al., 2010]. Correcting for such biases is the target of the literature on off-policy evaluation and optimization [Rosenbaum and Rubin, 1983; Strehl et al., 2010; Dudik et al., 2011]. When objective values can be obtained for a new policy, bandit algorithms can be used for optimization [Lattimore and Szepesv ari, 2020; Wang et al., 2021]. However, these generally require scalar rewards that already dictate a decision maker s desired tradeoffs among the objectives. This leads to the second issue the speciﬁcation of a single objective function that dictates the desired trade-offs. This can be viewed as the decision maker s utility function. In the example above, it might specify how much risk a decision maker can tolerate to attain some expected return. Assessing utility functions almost always requires interaction with the decision maker requiring human judgements that typically cannot be learned from data in the usual sense [Keeney and Raiffa, 1976]. Moreover, utility elicitation is generally challenging and costly due to the cognitive difﬁculty faced by human decision makers when trying to assess trade-offs among objectives in a quantitatively precise fashion [Tversky and Kahneman, 1974; Camerer, 2004]. While some elicitation techniques attempt to identify the full utility function [Keeney and Raiffa, 1976], others try to minimize this burden in various ways. One common principle is to limit trade-off assessments to only those that are relevant given the feasible or realizable combinations of objectives w.r.t. the utility model and policy constraints [Boutilier, 2013].1 This requires that the model is known. We propose an interactive off-policy technique which supports a system designer to identify the optimal policy that trades off multiple objectives w.r.t. an unknown utility function [Branke et al., 2008]. The utility function is modeled as a linear scalarization of the objectives considered [Keeney and Raiffa, 1976], where the scalarization parameters specify the trade-off among the objectives. We use off-policy estimators to evaluate policies in an unbiased way without ever executing them. To learn the desired scalarization, we present

1Much work in MOO focuses on the identiﬁcation of Pareto optimal solutions [Mas-Colell et al., 1995]. The selection of a solution from this set still requires the decision maker to choose and thus make a trade-off among the objectives, possibly implicitly.

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

the off-policy estimates of the objective values of candidate policies to the designer for feedback. The candidate policies are chosen judiciously to maximize the information gain from the feedback. Over time, (i) the scalarization converges to the trade-offs embodied in the unknown utility function by learning from the designer s feedback; and (ii) the policy induced by the scalarization converges to the optimal policy. We analyze our approach and prove theoretical guarantees for ﬁnding near-optimal policies. Our comprehensive empirical evaluation on four multi-objective optimization problems shows the effectiveness of our method.2

2 Problem Formulation We denote the set {1, . . . , n} by [n]. Consider a policy optimization problem with d 1 (potentially conﬂicting) objectives. Let X be a context space and A be an action space with K actions. In each round, x X is sampled from a context distribution Px. An action a A is taken in response following a stochastic policy π( | x), which is a distribution over A for any x X. The policy space is Π = π π( | x) K 1, x X , where K is the K-simplex with K + 1 vertices. After taking action a, the agent receives a d-dimensional reward vector r [0, 1]d sampled from a reward distribution Pr( | x, a), corresponding to d objectives. The expected value of policy π is V (π) = Ex Px,a π( |x),r Pr( |x,a) r . Note that V (π) is a d-dimensional vector whose i-th entry Vi(π) is the expected value of objective i under policy π. We assume that there exists a utility function uθ, parameterized by θ, which is used by the designer to assess the quality uθ(v) of any objective-value vector v Rd. Without loss of generality, we assume that uθ is absolutely monotonic in each objective; but the correlations and conﬂicts among the objectives are unknown. We adopt the common assumption that uθ is linear [Keeney and Raiffa, 1976] and determined by a scalarization uθ(v) = θ v of the objective values, where θ Rd determines the designer s trade-off among the objectives. We treat θ as a priori unknown and that it is not easy to specify directly by the designer. Hence, we learn it through the interactions with the designer. The optimal policy, for any ﬁxed designer s trade-off preferences θ Rd, is deﬁned as

π = arg max π Π uθ V (π) . (1)

Since the interactions can be costly, we consider a ﬁxed budget of T rounds of interactions with the designer. Our goal is to ﬁnd a near-optimal policy with high probability after the interactions. Speciﬁcally, we use simple regret [Lattimore and Szepesv ari, 2020] to measure the optimality of a policy π identiﬁed after T rounds of interactions, which is the difference in the utilities of π and π,

Rsim T = uθ V (π ) uθ V (π) . (2)

Note that we do not consider the optimality of policies presented during the interactions.

2An extended version including all appendices can be found at https://arxiv.org/abs/2201.09798.

3 General Algorithm Design

We ﬁrst describe our approach in general terms, motivating it by the de facto standard approach to A/B testing in industry [Kohavi et al., 2009]. In the standard iterative approach, a policy designer proposes a candidate policy π and evaluates it on live trafﬁc for some time period (say, two weeks, to average out basic seasonal trends). If π outperforms a production policy (e.g., it improves some metrics/objectives and does not degrade others, or it achieves a desired trade-off among all objectives), π is accepted and deployed. If it does not, the designer proposes a new candidate policy and the process is repeated. This approach has three major shortcomings. First, each iteration takes a long time and many iterations may be needed to ﬁnd a good policy. Second, it is difﬁcult to propose good candidate policies, because the policy space is large and it is not a priori clear which objective trade-offs are feasible. Finally, due to the difﬁculty of managing changes in the control and treatment groups in large-scale platforms, online randomized experiments often lead to unexpected results [Kohavi et al., 2009; Kohavi and Longbotham, 2011], which limit their efﬁciency and application in the fast-evolving industrial settings. Consider an idealized scenario where V (π) is known for any policy π Π. Then we can learn θ in (1) by interacting with the designer. A variety of preference elicitation techniques could be used [Keeney and Raiffa, 1976; Boutilier, 2013]. We study the following approach. In round (interaction) t, we (i) propose a policy πt; (ii) present the value vector V (πt) to the designer; and (iii) obtain a noisy response based on the designer s true utility uθ (V (πt)). The feedback can take different forms, but ultimately reﬂects the designer s perceived value for πt. We assume a binary feedback of the form Is policy π acceptable? , motivated by our industry example. In this work, we consider a more realistic but also more challenging setting where V (π) is unknown. In principle, any policy π can be evaluated on live trafﬁc. However, online evaluation can be costly, inefﬁcient, and time consuming; leading to unacceptable delays in ﬁnding π [Deaton and Cartwright, 2018; Kohavi et al., 2009]. To address this issue, we evaluate π ofﬂine using logged data generated by some prior policy, such as the production policy [Swaminathan and Joachims, 2015]. In Section 4, we introduce three most common off-policy estimators for this purpose. The off-policy estimated value vector ˆV (π) is then used in the elicitation process with the designer. Finally, we learn θ and π based on the estimated values and noisy feedback from interactions with the designer. We present our algorithm and analyze it in Section 5.

4 Multi-Objective Off-Policy Evaluation and Optimization

In this section, we discuss how to evaluate a policy π using logged data generated by another (say, production) policy, and optimize π w.r.t. any (ﬁxed and known) scalarization parameters θ. We have a set of logged records D = (xj, aj, rj) N j=1 collected by a logging policy π0. For the j-th record, xj is the context, aj is the action from π0, and rj Rd is the realized reward vector corresponding to d objectives. We also assume that the propensity scores π0(aj | xj) (i.e., the probability that

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

π0 takes action aj given context xj) are logged. If not, they can be estimated from logged data [Strehl et al., 2010].

4.1 Evaluation Off-policy evaluation has been studied extensively in the single-objective setting [Strehl et al., 2010; Dudik et al., 2011]. Generally, better evaluation leads to better optimization [Strehl et al., 2010]. By treating the reward as a d-dimensional vector rather than a scalar, we can adapt existing off-policy estimators to MOO. We adapt three popular estimators below. The ﬁrst estimator, the direct method (DM) [Lambert and Pregibon, 2007], estimates the expected reward vector E r | x, a by ˆr(a, x) Rd, where ˆr is some ofﬂine-learned reward model. The policy value is estimated by

ˆV DM(π) = 1

a A π(a | xj)ˆr(a, xj) . (3)

Since the model is learned without knowledge of π, it may focus on areas irrelevant for estimating V (π), resulting in a biased estimate of V (π). The second estimator, inverse propensity scoring (IPS) [Rosenbaum and Rubin, 1983], is less prone to bias. Instead of estimating rewards, IPS uses the propensities of logged records to correct the shift between the logging and new policies,

ˆV IPS(π) = 1

j=1 min M, π(aj | xj)

π0(aj | xj)

where M > 0 is a hyper-parameter that trades off the bias and variance in the estimate. The IPS estimator is unbiased for M = , but can have a high variance if π takes actions that are unlikely under π0. When M is small, the variance is small but the bias can be high, since the IPS scores are clipped. To alleviate the high variance of IPS, we can take advantage of both ˆr and IPS to construct the doubly robust (DR) estimator [Dudik et al., 2011]

ˆV DR(π)= 1

π(aj |xj) π0(aj |xj)(rj ˆr(aj,xj))+ ˆV DM(π). (5)

Intuitively, ˆr is used as a baseline for the IPS estimator. If the model for reward estimation is unbiased or the propensities are correctly speciﬁed, DR can provide an unbiased estimate of the value. It has been shown that DR achieves lower variance than IPS [Dudik et al., 2011].

4.2 Optimization A key component in our approach is policy optimization, i.e., ﬁnding the optimal policy given a scalarization vector θ,

ˆπ = arg max π Π uθ ˆV (π) = arg max π Π θ ˆV (π) , (6)

where ˆV (π) is some off-policy estimator. The optimized variables are the entries of π Π that represent the probabilities of taking actions. In Appendix A, we prove that (6) can be formulated as a linear program (LP) for all off-policy estimators in Section 4.1 in the tabular case, where the policy is parameterized separately for each context. For non-tabular policies,

we suggest using gradient-based policy optimization methods [Swaminathan and Joachims, 2015], though we provide no theoretical guarantees for this case. Since (6) is an LP for all estimators, at least one solution to (6) is a vertex of the feasible set, corresponding to nondominated policies, which cannot be written as a convex combination of other policies. For such policies, we can learn π by ﬁrst learning θ .

5 IMO3: Interactive Multi-Objective Off-Policy Optimization

Off-policy estimation and optimization in Section 4 assume that the utility parameters θ are known. Now we turn to interactively estimating θ by querying the designer for feedback on carefully selected policies over T rounds. Utility elicitation can be accomplished using a variety of query formats (e.g., value queries, bound queries, k-wise comparisons, critiques) and optimization criteria for selecting queries [Keeney and Raiffa, 1976; Boutilier, 2002; Boutilier, 2013].

5.1 Response Model

Following a common industrial practice (Section 3), we adopt a simple query format where we ask the designer to rate an objective value vector v corresponding to d objectives as acceptable or not acceptable. We require a response model that relates this stochastic feedback to the designer s underlying utility for v. We adopt a logistic response model

ℓθ (v) = 1/(1 + exp( uθ (v))) , (7)

where uθ (v) = θ v, and the designer responds acceptable with probability ℓθ (v) and not acceptable otherwise. Roughly speaking, this can be understood as a designer s noisy feedback relative to some implicit baseline (e.g., the value vector of the production policy). Logistic response of this form arises frequently in modeling binary or k-wise discrete choice in econometrics, psychometrics, marketing, AI, and other ﬁelds [Mc Fadden, 1974; Viappiani and Boutilier, 2010]; and lies at the heart of feedback mechanisms in much of the dueling bandits literature [Dud ık et al., 2015]. We defer the study of other types of feedback to future work.

5.2 Algorithm

Now we introduce IMO3 for engaging the designer in solving the problem. We approach the problem as ﬁxed-budget best-arm identiﬁcation (BAI) [Karnin et al., 2013], where we minimize the simple regret (2) in T rounds of interactions. At a high level, IMO3 works as follows. In round t [T], it selects a policy (arm) πt and presents its off-policy estimated value vector ˆV (πt) to the designer. The designer responds with Yt Ber ℓθ ( ˆV (πt)) , where Ber(µ) is a Bernoulli distribution with mean µ. After T rounds, IMO3 computes the maximum likelihood estimate (MLE) ˆθ of θ , where ˆV (πt) serves as the feature vector for response Yt. To incorporate other feedback and response models, only the MLE would change. Therefore, our algorithmic template is very general.

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

Algorithm 1 IMO3

Input: Logging policy π0, logged data D, budget T, and pre-selection budget L

1: W {} 2: for i = 1, . . . , L do 3: Sample θi from a unit ball in Rd

4: πi arg max π Π uθi ˆV (π)

5: W W + {πi} 6: PG(W) G-optimal design over W 7: for t = 1, . . . , T do 8: πt PG(W) 9: Present ˆV (πt) to the designer and observe Yt 10: ˆθ MLE({ ˆV (πt), Yt}T t=1) 11: Return π arg max π Π uˆθ ˆV (π)

To make IMO3 statistically efﬁcient in identifying the optimal policy with limited budget, we must design a good distribution over policies to be presented to the designer. One challenge is that the policy space Π is continuous and inﬁnite. To address this issue, we ﬁrst discretize Π to a set W of L diverse policies, which are optimal under different random scalarizations. The other challenge is learning θ efﬁciently. We approach this as an optimal design problem [Wong, 1994]. Speciﬁcally, we use G-optimality to design a distribution over W, from which we draw πt in round t that minimizes the variance of the MLE ˆθ. Since the design is variance minimizing, IMO3 chooses the ﬁnal optimal policy π solely based on the highest mean utility under ˆθ. We experimented with more complex algorithm designs, where the distribution of πt was adapted over time, analogous to sequential halving in BAI [Karnin et al., 2013; Jamieson and Talwalkar, 2016]. However, none of these approaches improved IMO3 under small ﬁxed budgets, and thus we focus on the non-adaptive algorithm. We present IMO3 in Algorithm 1. In lines 1 5, the policy space Π is discretized into L policies W. Any policy πi W is optimal under its θi. Since θi are sampled uniformly from a unit ball, representing all scalarization directions, the policies πi are diverse and allow us to learn about any θ efﬁciently. Note that we do not interact with the designer in this stage. In line 6, we compute the G-optimal design over W, a distribution over W that minimizes the variance of the MLE ˆθ after T rounds. In lines 7 10, we interact with the designer over T rounds. In round t [T], we draw πt according to the G-optimal design, present its values ˆV (πt) to the designer, and receive feedback Yt. In line 11, we compute the MLE ˆθ from all collected observations { ˆV (πt), Yt}T t=1. Finally, we use the estimated ˆθ to identify the optimal policy π w.r.t. off-policy estimated values using an LP (Section 4.2).

5.3 Regret Analysis We now analyze the simple regret of IMO3, which is deﬁned in (2). Due to space constraints, we focus on the IPS estimator and then discuss extensions to other estimators. To state our regret bound, we ﬁrst introduce some notation.

Let W = {πi}L i=1 be the set of pre-selected policies in IMO3

and V = {vi}L i=1 be their estimated values, with vi = ˆV (πi). Let ˆπ = arg max π Π uθ ˆV (π) be the optimal policy under θ in logged data. Let ˆπ W and π1 = ˆπ without loss of generality. Let µi = v i θ [0, 1] be the utility of policy πi and i = µ1 µi be its gap. Let min = mini>1 i be the minimum gap. Let α = arg min α L 1 g(α) be the G-optimal design on V, where g(α) = maxi [L] v i G 1 α vi and Gα = PL i=1 αiviv i . Let h( ) be the sigmoid function and h ( ) be its derivative. Theorem 1. Let cmin, δ1 > 0 be chosen such that

min v V min{h (v θ ), h (v ˆθ)} cmin

holds with probability at least 1 δ1. Then Rsim T

L exp 2 minc2 min T 2g(α )

d M 2 log(2d/δ2)

holds with probability at least 1 (δ1 + 2δ2), where d is the number of objectives, M is the tunable parameter in the IPS estimator, and N is the size of logged data. The proof of Theorem 1 is in Appendix B. The regret bound decomposes into two terms. The ﬁrst term is the regret of BAI w.r.t. estimated policy values and decreases with the amount of designer s feedback T. The second term reﬂects the error of the IPS estimator and decreases with data size N. Speciﬁcally, the ﬁrst term in (8) is O(L exp[ T]). While it increases with the number of pre-selected policies L, it decreases exponentially with budget T. Therefore, even relatively small budget sizes of T = O(log L) lead to low simple regret. This is important, as large L may be needed to guarantee that the optimal policy under θ in logged data satisﬁes ˆπ W, a condition in our theorem. Regarding the other terms, 2 minc2 min is a problem-speciﬁc constant and we minimize g(α ) by design. The second term in (8) decreases with data size N at an expected rate of O( p

1/N). Now we discuss the errors for other estimators. For the DM estimator, this error depends on the quality of the reward model and cannot be directly analyzed. It could be large when the reward model is biased. For the DR estimator, it is unbiased if the reward model is unbiased or the propensity scores are correctly speciﬁed. If the reward model is unbiased, the error can be bounded the same as in the IPS estimator. Otherwise the error cannot be directly analyzed due to the bias of the reward model.

6 Experiments In this section, we evaluate IMO3 on four MOO problems. We introduce the problems for evaluation in Section 6.1, describe several baseline methods in Section 6.2, and evaluate IMO3 vs. baselines from different perspectives in Section 6.3. Due to space limit, we put the details of how to generate logged data for each problem in Appendix C. To simulate designer feedback, we sample the ground-truth scalarization θ Rd from the unit ball, and sample responses from Ber ℓ( ˆV (π); θ ) , where ˆV (π) is the off-policy estimated value vector presented to the designer. We generate feedback in the same way in all four problems.

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

5 10 20 50 100 150 200 Budget

Simple Regret

Rand-P Rand-T Log-TS IMO3-true

5 10 20 50 100 150 200 Budget

Simple Regret

Rand-P Rand-T Log-TS IMO3-true

(b) Crashworthiness.

5 10 20 50 100 150 200 Budget

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8

Simple Regret

Rand-P Rand-T Log-TS IMO3-true

(c) Stock investment.

5 10 20 50 100 150 200 Budget

Simple Regret

Rand-P Rand-T Log-TS IMO3-true

(d) Yahoo! news recommendation.

Figure 1: Simple regret of different algorithms for a ﬁxed logged data size N = 20, 000 and varying interaction budget. Each experiment is averaged over 10 logged data, 10 randomly selected θ and 5 runs under each combination of logged data and θ .

6.1 Multi-Objective Optimization Problems ZDT1. The ZDT test suite [Zitzler et al., 2000] is the most widely employed benchmark for MOO. We use ZDT1, the ﬁrst problem in the test suite, a box-constrained n-dimensional two-objective problem, with objectives F1 and F2 deﬁned as

F1(x) = 5x1, F2(x) = g(x) 1 r x1

and g(x) = 1 + 9(Pn i=2 xi) n 1 ,

where x = (xi)n i=1 are variables and xi [0, 1], i [n]. We use n = 5 in our experiments, treating (x4, x5) as context, and perform optimization on (xi)3 i=1. We sample ﬁve combinations of (x4, x5) uniformly to create context set X and ten combinations of (xi)3 i=1 to create the action set A. Crashworthiness. This MOO problem is extracted from a real-world crashworthiness domain [de Carvalho et al., 2018], where three objectives factor into the optimization of the crashsafety level of a vehicle. We refer to Sec. 2.1 of [de Carvalho et al., 2018] for detailed objective functions and constraints. Five bounded decision variables (xi)5 i=1 represent the thickness of reinforced members around the car front. We use different combinations of the last two variables as contexts and the ﬁrst three as actions. The rest settings are the same as for ZDT1. Stock Investment. The stock investment problem is a widely studied real-world MOO problem [Liang and Qu, 2013], where we need to trade off returns and volatility of an investment strategy. We consider investing one dollar in a stock at the end of each day as an action and try to optimize the relative gain and volatility of this investment at the end of the next day. Speciﬁcally, the relative gain is the stock s closing price on the second day minus that on the ﬁrst day, and we use the absolute difference as a measure of investment volatility. Our goal is to maximize the relative gain and minimize the volatility between two consecutive days of a one-dollar investment, on average. We use 48 popular stocks (see Appendix C for the full list) as the action set A, and the four quarters of a year as the context set X. We collect the closing stock prices from Yahoo Finance for the period Nov.1/2020 Nov.1/2021 for generating logged data. Yahoo! News Recommendation. This is a news article recommendation problem derived from the Yahoo! Today Module click log dataset (R6A). We consider two objectives to maximize, the click through rate (CTR) and diversity of the

recommended articles. In the original dataset, each record contains the recommended article, the click event (0 or 1), the pool of candidate articles, and a 6-dimensional feature vector for each article in the pool. The recommended article is selected from the pool uniformly at random. We adopt the original click event in the logged dataset to measure CTR of the recommendation, and use the ℓ2 distance between the recommended article s feature and the average feature vector in the pool to represent the diversity of this recommendation. For our experiments, we extract ﬁve different article pools as contexts and all logged records associated with them from the original data, resulting in 1,123,158 records in total. Each article pool has 20 candidates as actions.

6.2 Baselines

Random Policy (Rand-P). The random policy [Jamieson and Talwalkar, 2016] is a standard baseline in BAI, which selects a policy (arm) πt Π uniformly at random from the policy space in each round t. The off-policy value estimate ˆV (πt) is presented to the designer for feedback Yt. After T rounds, the value estimates and their feedback are used to form the maximum likelihood estimate of θ , ˆθ, which is used to solve (6) for the ﬁnal identiﬁed policy. Random Trade-off (Rand-T). Instead of sampling a random policy, Rand-T samples a trade-off vector θt uniformly at random from a d-dimensional unit ball, which is used to identify a policy πt in each round by policy optimization in (6). The rest is the same as the Rand-P baseline. Logistic Thompson Sampling (Log-TS). Many cumulative regret minimization algorithms with guarantees exist [Abeille and Lazaric, 2017; Kveton et al., 2020]. We also consider a cumulative-to-simple regret reduction as a baseline. We adapt Thompson sampling (TS) for generalized linear bandits [Abeille and Lazaric, 2017; Kveton et al., 2020] to the BAI problem, and output the best policy as the average of its selected policies. In each round t, we sample a trade-off vector θt from the current posterior over θ with Log-TS, which is used to identify a policy πt in each round by policy optimization using (6). Then ˆV (πt) and feedback Yt are used to update the posterior. The ﬁnal output policy is the average of all policies selected in T rounds, π = PT t=1 πt/T. This reduction of Log TS leads to a simple regret of ˆRsim T = O(d 3 2 p

T log(1/δ)), where O stands for the big-O notation up to logarithmic factors in T. The proof is in Appendix C.

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

5k 10k 20k 40k 80k Log Data Size

Simple Regret

Rand-P Rand-T Log-TS IMO3-true

5k 10k 20k 40k 80k Log Data Size

Simple Regret

Rand-P Rand-T Log-TS IMO3-true

(b) Crashworthiness.

5k 10k 20k 40k 80k Log Data Size

Simple Regret

Rand-P Rand-T Log-TS IMO3-true

(c) Stock investment.

5k 10k 20k 40k 80k Log Data Size

Simple Regret

Rand-P Rand-T Log-TS IMO3-true

(d) Yahoo! news recommendation.

Figure 2: Simple regret of different algorithms for a ﬁxed interaction budget T = 100 and varying logged data size. Each experiment is averaged over 10 logged data, 10 randomly selected θ and 5 runs under each combination of logged data and θ .

IMO3 with different value estimators. We ﬁx the preselection budget L = 500, which requires no designer feedback. To assess the impact of off-policy estimated values on optimization performance, we test IMO3 with its off-policy estimated values replaced by the true expected values (dubbed IMO3-true). We use the IPS estimator by default. Experiments with the DM and DR estimators can be found in Appendix C.

6.3 Results and Analysis For each of the problems, we ﬁrst ﬁx the size of the logged dataset and assess how simple regret (lower the better) varies with the interaction budget T. The results are shown in Figure 1. Each result is averaged over ten generated logged datasets, ten randomly sampled θ , and 5 repeated runs under each combination of logged data and θ (error bars represent standard error). We see that IMO3 outperforms or performs comparably to our baselines in all four problems. While Rand T is similar to the pre-selection phase of IMO3 and performs relatively well, its exploration is less efﬁcient and limited by the budget, and thus is worse than IMO3. This illustrates the advantage of using G-optimal design with a sufﬁcient number of pre-selected policies to query the designer for feedback. The gap between IMO3 using estimated vs. true values is due to errors in value estimation see the second term in our regret bound (Theorem 1). This term is invariant w.r.t. T, thus the gap remains relatively constant as T varies in our experiments. We further study how the amount of logged data inﬂuences the simple regret of IMO3. We ﬁx T = 100, and vary the size of the logged dataset used for policy-value estimation. Intuitively, if the dataset is sufﬁcient to provide an accurate value estimate for any policy, IMO3 should perform similarly to directly using true values. Results in Figure 2 show that when the logged dataset is small, inaccurate value estimates cause algorithms that rely on off-policy estimates to perform poorly compared to using true values. As the size of the dataset increases, the decrease in value-estimation error allows IMO3 to outperform the baselines by selecting the most effective policies for querying the designer. When the logged dataset is sufﬁciently large, more accurate value estimates ensure that IMO3 converges to that of using true values.

7 Related Work

Drugan and Now e [2013] is the ﬁrst work to propose, analyze and experiment with a Pareto UCB1 algorithm and a UCB1 algorithm with a scalarized objective for MOO. Auer et al.

[2016] formulate the problem of Pareto-frontier identiﬁcation as a BAI problem. Thompson sampling in MOO is studied but not analyzed by Yahyaa and Manderick [2015]. Two recent works apply Gaussian process (GP) bandits to MOO. Paria et al. [2019] model the posterior of each objective function as a GP and minimize regret w.r.t. a known distribution of scalarization vectors. Zhang and Golovin [2020] show that this algorithm generates a set of points that maximize random hypervolume scalarization, an objective often used in practice. All above works are in the online setting, where the agent probes the environment to learn about its objective functions. Our setting is ofﬂine and the objective functions are estimated from logged data collected by some prior policy. In terms of the motivation, the closest work to ours is that of Roijers et al. [2017], who treat online MOO as a two-stage problem, where the objective functions are estimated using initial interactions with the environment and the scalarization vector is then estimated via user interaction. Unlike our work, they do not propose a speciﬁc algorithm for their setting, but only adapt existing bandit algorithms based on learned utility functions. They also do not formulate their problem as offpolicy optimization, and thus the process can be costly.

8 Conclusions

In this work, we study multi-objective optimization with unknown objective functions. We propose an interactive offpolicy optimization algorithm for ﬁnding the optimal policy that achieves the desired trade-off among objectives. Speciﬁcally, we adapt off-policy estimators to evaluate policy values on all objectives, choose policies that effectively elicit designer s preferences, and learn the optimal policy using best arm identiﬁcation. We prove upper bounds on the simple regret of our method and demonstrate its effectiveness with experiments on four MOO problems. For future work, we plan to generalize (and analyze) our algorithm to more complex utility functions, other types of feedback and response models. We applied G-optimal design for BAI to provide theoretical guarantees using other BAI algorithms for MOO is of interest.

Acknowledgments

This work is partially supported by the National Science Foundation under grant IIS-2128009, IIS-2007492 and IIS1553568.

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

References [Abeille and Lazaric, 2017] Marc Abeille and Alessandro Lazaric. Linear thompson sampling revisited. In AISTATS, 2017. [Auer et al., 2016] Peter Auer, Chao-Kai Chiang, Ronald Ortner, and Madalina M. Drugan. Pareto front identiﬁcation from stochastic bandit feedback. In AISTATS, 2016. [Boutilier, 2002] Craig Boutilier. A pomdp formulation of preference elicitation problems. In Eighteenth National Conference on Artiﬁcial Intelligence, 2002. [Boutilier, 2013] Craig Boutilier. Computational decision support: Regret-based models for optimization and preference elicitation. In Comparative Decision Making: Analysis and Support Across Disciplines and Applications. 2013. [Branke et al., 2008] J. Branke, K. Deb, Kaisa Miettinen, and R. Slowinski. Multiobjective Optimization: Interactive and Evolutionary Approaches. Springer-Verlag, 2008. [Camerer, 2004] Colin F. Camerer. Advances in Behavioral Economics. 2004. [de Carvalho et al., 2018] Vinicius Renan de Carvalho, Jaime Sim ao, and Sichman. Solving real-world multi-objective engineering optimization problems with an election-based hyper-heuristic. 2018. [Deaton and Cartwright, 2018] Angus Deaton and Nancy Cartwright. Understanding and misunderstanding randomized controlled trials. Social Science & Medicine, 2018. [Drugan and Now e, 2013] Madalina M. Drugan and Ann Now e. Designing multi-objective multi-armed bandits algorithms: A study. In IJCNN, 2013. [Dudik et al., 2011] Miroslav Dudik, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. ICML, 2011. [Dud ık et al., 2015] Miroslav Dud ık, Katja Hofmann, Robert E. Schapire, Aleksandrs Slivkins, and Masrour Zoghi. Contextual dueling bandits. In COLT, 2015. [Jamieson and Talwalkar, 2016] Kevin Jamieson and Ameet Talwalkar. Non-stochastic best arm identiﬁcation and hyperparameter optimization. In AISTATS. PMLR, 2016. [Karnin et al., 2013] Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed bandits. In ICML, 2013. [Keeney and Raiffa, 1976] Ralph L. Keeney and Howard Raiffa. Decisions with Multiple Objectives: Preferences and Value Trade-offs. Wiley, 1976. [Kohavi and Longbotham, 2011] Ron Kohavi and Roger Longbotham. Unexpected results in online controlled experiments. SIGKDD Explor. Newsl., 2011. [Kohavi et al., 2009] Ron Kohavi, Roger Longbotham, Dan Sommerﬁeld, and Randal M. Henne. Controlled experiments on the web: Survey and practical guide. KDD, 2009. [Kveton et al., 2020] Branislav Kveton, Manzil Zaheer, Csaba Szepesvari, Lihong Li, Mohammad Ghavamzadeh, and Craig Boutilier. Randomized exploration in generalized linear bandits. In AISTATS, 2020.

[Lambert and Pregibon, 2007] Diane Lambert and Daryl Pregibon. More bang for their bucks: assessing new features for online advertisers. SIGKDD Explor., 2007. [Lattimore and Szepesv ari, 2020] Tor Lattimore and Csaba Szepesv ari. Bandit Algorithms. Cambridge University Press, 2020. [Liang and Qu, 2013] J. J. Liang and B. Y. Qu. Large-scale portfolio optimization using multiobjective dynamic mutliswarm particle swarm optimizer. In IEEE-SIS, 2013. [Mas-Colell et al., 1995] Andreu Mas-Colell, Micheal D. Whinston, and Jerry R. Green. Microeconomic Theory. Oxford University Press, 1995. [Mc Fadden, 1974] Daniel Mc Fadden. Conditional logit analysis of qualitative choice behavior. In Frontiers in Econometrics. 1974. [Paria et al., 2019] Biswajit Paria, Kirthevasan Kandasamy, and Barnab as P oczos. A ﬂexible framework for multiobjective bayesian optimization using random scalarizations. In UAI, 2019. [Roijers et al., 2017] Diederik M. Roijers, Luisa M. Zintgraf, and Ann Now e. Interactive thompson sampling for multiobjective multi-armed bandits. In ADT, 2017. [Rosenbaum and Rubin, 1983] Paul R. Rosenbaum and Donald B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 1983. [Strehl et al., 2010] Alex Strehl, John Langford, Lihong Li, and Sham M Kakade. Learning from logged implicit exploration data. In Neur IPS, 2010. [Swaminathan and Joachims, 2015] Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. ICML, 2015. [Tversky and Kahneman, 1974] Amos Tversky and Daniel Kahneman. Judgment under uncertainty: Heuristics and biases. Science, 1974. [Viappiani and Boutilier, 2010] Paolo Viappiani and Craig Boutilier. Optimal Bayesian recommendation sets and myopically optimal choice query sets. In Neur IPS, 2010. [Wang et al., 2021] Nan Wang, Branislav Kveton, and Maryam Karimzadehgan. Core: Capitalizing on rewards in bandit exploration. In UAI, 2021. [Wong, 1994] Weng Kee Wong. Comparing robust properties of a, d, e and g-optimal designs. Computational Statistics & Data Analysis, 1994. [Yahyaa and Manderick, 2015] Saba Q. Yahyaa and Bernard Manderick. Thompson sampling for multi-objective multiarmed bandits problem. In ESANN, 2015. [Zhang and Golovin, 2020] Richard Zhang and Daniel Golovin. Random hypervolume scalarizations for provable multi-objective black box optimization. In ICML, 2020. [Zitzler et al., 2000] Eckart Zitzler, Kalyanmoy Deb, and Lothar Thiele. Comparison of multiobjective evolutionary algorithms: Empirical results. Evol. Comput., 2000.

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)