# preferential_multiobjective_bayesian_optimization__a78e0146.pdf

Published in Transactions on Machine Learning Research (05/2025)

Preferential Multi-Objective Bayesian Optimization

Raul Astudillo1 rastudil@caltech.edu Kejun Li1 kli5@caltech.edu Maegan Tucker2 mtucker@gatech.edu Chu Xin Cheng1 ccheng2@caltech.edu Aaron D. Ames1 ames@caltech.edu Yisong Yue1 yyue@caltech.edu 1California Institute of Technology 2Georgia Institute of Technology

Reviewed on Open Review: https: // openreview. net/ forum? id= mjso ESa WDH

Preferential Bayesian optimization (PBO) is a framework for optimizing a decision-maker s latent preferences over available design choices. While real-world problems often involve multiple conflicting objectives, existing PBO methods assume that preferences can be encoded by a single objective function. For instance, in the customization of robotic assistive devices, technicians aim to maximize user comfort while minimizing energy consumption to extend battery life. Likewise, in autonomous driving policy design, stakeholders must evaluate safety and performance trade-offs before committing to a policy. To bridge this gap, we introduce the first framework for PBO with multiple objectives. Within this framework, we propose dueling scalarized Thompson sampling (DSTS), a multi-objective generalization of the popular dueling Thompson sampling algorithm, which may also be of independent interest beyond our setting. We evaluate DSTS across four synthetic test functions and two simulated tasks exoskeleton personalization and driving policy design demonstrating that it outperforms several benchmarks. Finally, we prove that DSTS is asymptotically consistent. Along the way, we provide, to our knowledge, the first convergence guarantee for dueling Thompson sampling in single-objective PBO.

1 Introduction

Bayesian optimization (BO) is a framework for optimizing objective functions with expensive or timeconsuming evaluations. It has been successful in real-world applications such as hyperparameter tuning of machine learning algorithms (Snoek et al., 2012), e-commerce platform design (Letham & Bakshy, 2019), and materials design (Zhang et al., 2020). Preferential Bayesian optimization (PBO), a subframework within BO, focuses on settings where the objective function is latent, i.e., where the objective values cannot be observed and instead, only ordinal preference feedback from a decision-maker (DM) is observed.

While prior work in PBO has demonstrated success in various applications (Brochu et al., 2010; Nielsen et al., 2015; Tucker et al., 2020b), existing methods operate under the assumption that preferences can be encoded by a single objective function. In practice, however, problems are often characterized by multiple conflicting objectives. This occurs, for instance, when multiple users with conflicting preferences collaborate in a joint design task, as illustrated in Figure 1, or when a user wishes to explore the trade-offs between multiple conflicting attributes before committing to a design.

To motivate the need for multi-objective PBO, we examine two illustrative applications. The first application involves an exoskeleton customization task that aims to enhance user comfort. In this situation, a user assisted by an exoskeleton experiences different gait designs and indicates the most comfortable option (Tucker et al., 2020a;b). In this and other robotic assistive personalization applications, users and clinical technicians often collaborate on a design task to maximize user comfort (the user s objective) while optimizing energy

Published in Transactions on Machine Learning Research (05/2025)

Figure 1: In this work, we extend preferential Bayesian optimization to the multi-objective setting. In contrast with existing approaches, our approach allows the decision-makers involved in the joint design task to efficiently explore optimal trade-offs between the conflicting objectives.

consumption and other metrics related to the exoskeleton s long-term functionality (the technician s objective) (Kerdraon et al., 2021).

The second application is autonomous driving policy design, where a user is presented with multiple simulations of an autonomous vehicle under different driving policies, and the user indicates the one with better safety and performance attributes (Bıyık et al., 2019). In such settings, policy-makers often seek to understand the trade-offs between multiple latent objectives, such as lane keeping and speed tracking, before committing to a specific policy (Bhatia et al., 2020).

Motivated by the applications described above, we propose a framework for PBO with multiple objectives. Our contributions are as follows:

To the best of our knowledge, our work proposes the first framework for preferential Bayesian optimization with multiple objectives.

We present dueling scalarized Thompson sampling (DSTS), the first extension of dueling Thompson sampling (DTS) algorithms (Sui et al., 2017; Novoseller et al., 2020; Siivola et al., 2021) to the multi-objective setting.

We prove that DSTS is asymptotically consistent. Furthermore, we also provide the first convergence guarantee for DTS in single-objective PBO.

We demonstrate our framework across six test problems, including simulated exoskeleton personalization and autonomous driving policy design tasks. Our results show that DSTS can efficiently explore the objectives Pareto front using preference feedback.

2 Related work

2.1 Preference-based optimization

Preference-based optimization has been actively studied across various frameworks, including multi-armed bandits (Yue et al., 2012; Bengs et al., 2021), reinforcement learning (Wirth et al., 2017), and BO (Brochu et al., 2010; González et al., 2017; Astudillo et al., 2023). It has been successful in a broad range of applications, such as personalized medicine (Nielsen et al., 2015; Sui et al., 2017; Tucker et al., 2020a), robot control (Bıyık et al., 2019; Tucker et al., 2021; Maccarini et al., 2022; Csomay-Shanklin et al., 2022) and, more recently, the alignment of large language models (Rafailov et al., 2023).

Most work in this area focuses on the single-objective setting. Two notable exceptions are the works of (Bhatia et al., 2020) and (Zhou et al., 2023). Bhatia et al. (2020) considers one-shot preference-based optimization

Published in Transactions on Machine Learning Research (05/2025)

across multiple criteria over a finite design space. This study adopts a game-theoretic viewpoint and introduces the concept of a Blackwell winner, which implicitly requires the user to specify an acceptable trade-off between criteria, in contrast with our work. (Zhou et al., 2023) considers multi-objective preference alignment of large language models. Like our work, these two works are motivated by the idea that preference-based optimization across multiple objectives is crucial for capturing richer human feedback.

Our work extends the dueling Thompson sampling algorithm for dueling bandits introduced by (Sui et al., 2017) (termed self-sparring), which has been adapted to preference-based reinforcement learning (termed dueling posterior sampling) (Novoseller et al., 2020) and PBO (termed batch Thompson sampling) (Siivola et al., 2021). To our knowledge, we provide the first multi-objective generalization of this algorithm.

2.2 Multi-objective optimization

The field of multi-objective optimization has been extensively studied, encompassing both theoretical advancements and applications across various engineering problems (Miettinen, 1999; Marler & Arora, 2004; Deb, 2013). Literature within the BO framework is most closely related to our work (Khan et al., 2002; Knowles, 2006; Belakaria et al., 2019; Paria et al., 2020; Daulton et al., 2020).

Our algorithm draws inspiration from Par EGO (Knowles, 2006), a multi-objective BO algorithm that employs augmented Chebyshev scalarizations to convert a multi-objective optimization problem into multiple singleobjective problems. Unlike Knowles (2006), our objectives are not observable, preventing direct modeling of scalarized values. Instead, we model each objective separately and scalarize samples drawn from these models, similar to Daulton et al. (2020) s version of Par EGO.

Additionally, our work is related to research that incorporates user preferences into multi-objective optimization a topic that has been actively studied both within and beyond the BO framework (Branke & Deb, 2005; Wang et al., 2017; Hakanen & Knowles, 2017; Lin et al., 2022). In most of this prior work, all objectives are assumed to be directly observable, and user preferences are captured through a latent utility function that combines these objectives into a single score to guide optimization. In contrast, we do not assume access to the objective values. Instead, we receive binary preference feedback for each objective individually, without ever observing their actual values or requiring a predefined utility function to aggregate them.

2.3 Additional related work

Emerging from the operations research community, the field of multi-criteria decision analysis (MCDA) focuses on decision-making under multiple conflicting criteria (Keeney & Raiffa, 1993; Pomerol & Barba-Romero, 2000). Although our work is related to this field, it diverges from the traditional MCDA approaches, which often involve aggregating preferences across criteria into a single performance measure (Young, 1974; Dyer & Sarin, 1979; Baskin & Krishnamurthi, 2009; Bhatia et al., 2020). Such aggregation requires additional assumptions about the DM s desired trade-off. Additionally, methods in this field have been explored outside the PBO framework, making them not directly applicable in our setting.

3 Problem setting

Preferences Let X denote the space of designs. We assume there is a DM (which may represent one or multiple users collaborating on a design task) aiming to maximize their preferences over designs. We assume the DM s preferences can be encoded via m objective functions f1, . . . , fm : X R so that, for any given pair of designs x, x X, the DM prefers x over x with respect to objective j if and only if fj(x) > fj(x ). For simplicity, we assume all m objectives are latent, but our approach can be easily adapted to settings where some objectives are observable, as discussed in Section 4.

Goal Let f = [f1, . . . , fm] : X Rm denote the concatenation of the m objective functions. The DM seeks to find designs that maximize each objective. This concept is formalized through the notion Pareto-dominance. For a pair of designs x, x X, x Pareto-dominates x , denoted by x f x , if fj(x) fj(x ) for j = 1, . . . , m with strict inequality for at least one index j. The DM seeks to find the Pareto-optimal set of f, defined by

Published in Transactions on Machine Learning Research (05/2025)

X f := {x : x such that x f x}. The set Y f := {f(x) : x X f} is termed the Pareto front of f. Figure 2 depicts the Pareto front for one of our test problems; the light grey region is the set of feasible objective vectors, i.e., {f(x) : x X} and the dark grey curve indicates the Pareto front of f.

Feedback To assist the DM s goal, our algorithm collects preference feedback interactively (Algorithm 1). At each iteration, denoted by n = 1, . . . , N, the algorithm selects a query constituted of q designs Xn = (xn,1, . . . , xn,q) Xq. The DM then indicates their most preferred design among these q designs for each objective. Let rj,n {1, . . . , q} denote the DM s preferred design with respect to objective j. The collection of these responses is denoted by rn = [r1,n, . . . , rm,n].

Algorithm 1 Dueling Scalarized Thompson Sampling

Input Initial dataset: D0, and prior on f: p0.

for n = 1, . . . , N do

Compute pn, the posterior on f given Dn 1 Sample θn uniformly at random over Θ Draw samples fn,1, . . . fn,q iid pn Find xn,i argmaxx X s( fn,i(x); θn), i = 1, . . . , q Set Xn = (xn,1, . . . , xn,q), and observe rn Update dataset Dn = Dn 1 {(Xn, rn)} end for

1.5 1.0 0.5 0.0 y1

Feasible region Pareto front

Figure 2: Feasible region and Pareto front of the DTLZ2 test function.

4 Dueling scalarized Thompson sampling

We introduce a novel algorithm termed dueling scalarized Thompson sampling (DSTS), summarized in Algorithm 1. DSTS is obtained by adeptly combining ideas from preference-based and multi-objective optimization to derive a sound algorithm with strong performance and convergence guarantees. As is common in BO, our algorithm is comprised of a probabilistic model of the objective functions for predictions and uncertainty reasoning, along with a sampling policy that, informed by the probabilistic model, iteratively selects new queries, balancing exploration and exploitation.

4.1 Probabilistic model

The probabilistic model is encoded by a prior distribution over f, denoted by p0. We assume p0 consists of a set of independent Gaussian processes, each corresponding to an objective. However, our framework does not rely on this choice and can easily accommodate other priors as long as samples from the posterior distribution can be drawn.

As is standard in the PBO literature (González et al., 2017; Nguyen et al., 2021; Astudillo et al., 2023), we account for noise in the DM s responses by using a Logistic likelihood for each objective j = 1, . . . , m of the following form:

P (rj,n = i | fj(Xn)) = exp(fj(xn,i)/λj) Pq i =1 exp(fj(xn,i )/λj), i = 1, . . . , q, (1)

where λj > 0 is the noise-level parameter. We estimate λj along with the other hyperparameters via maximum likelihood. We assume noise is independent across objectives and interactions.

Let D0 denote the initial dataset and Dn 1 = D0 {(Xk, rk)}n 1 k=1 denote the data available right before the n-th interaction with the DM. Let pn denote he posterior over f given Dn 1. The posterior cannot be computed in closed form but can be approximated using, e.g., a variational inducing point approach (Nguyen et al., 2021). For observable objectives, the above model can be replaced by a standard Gaussian process model with a Gaussian likelihood (see Appendix B.1).

Published in Transactions on Machine Learning Research (05/2025)

4.2 Sampling policy

Our primary algorithmic contribution is our sampling policy, which extends the dueling Thompson sampling (DTS) algorithmic family to the multi-objective setting. This is achieved by leveraging augmented Chebyshev scalarizations, a technique from multi-objective optimization used to decompose a multi-objective optimization problem into multiple single-objective problems. We next explain augmented Chebyshev scalarizations and describe how we integrate them with DTS.

Augmented Chebyshev scalarizations Augmented Chebyshev scalarizations are widely used for multiobjective optimization (Miettinen, 1999). In BO, in particular, they were employed by Knowles (2006) and Paria et al. (2020). We also leverage them to derive a sound sampling policy in our setting.

For a given vector of scalarization parameters, θ Θ := {θ Rm : Pm j=1 θj = 1 and θj 0, j = 1, . . . , m}, the augmented Chebyshev scalarization function is defined by

s(y; θ) = min j=1,...,m{θjyj} + ρ

j=1 θjyj, (2)

where ρ is a small positive constant. It can be shown that any solution of maxx X s (f(x); θ) lies in the Pareto-optimal set of f. Conversely, if ρ is small enough, every point in the Pareto-optimal set of f is a solution of maxx X s (f(x); θ) for some θ Θ (Theorem 3.4.6, Miettinen, 1999).

Dueling scalarized Thompson sampling At each iteration, n, we draw a sample from the scalarization parameters uniformly at random over Θ, denoted by θn. We also draw q independent samples, denoted by fn,1, . . . , fn,q, from the posterior distribution on f given Dn 1. The next query is then given by Xn = (xn,1, . . . , xn,q), where xn,i argmax x X s fn,i(x); θn , i = 1, . . . , q. (3)

Intuitively, our sampling policy operates by first determining a subset of the Pareto-optimal set of f using θn, denoted as X f; θn = argmaxx X s(f(x); θn). Then, each xn,i is sampled according to the probability (induced by the posterior on f) that xn,i X f; θn, analogous to single-objective dueling posterior sampling (Sui et al., 2017). The DM s responses provide information of the highest value point among xn,1, . . . , xn,q for each objective, which in turn allows us to learn about X f; θn. Since θn is drawn independently at each iteration, we explore a diverse collection of subsets X f; θn within X f.

We note that our sampling policy is agnostic to the choice of the probabilistic model, provided that samples from the posterior can be drawn. In addition, our sampling policy is suitable for problems with mixed latent and observable objectives thanks to its dual interpretation as a policy for preference-based optimization (Sui et al., 2017) and traditional optimization with observable objectives (Paria et al., 2020). Specifically, when all objectives are observable, our sampling policy can be interpreted as a batch generalization (Kandasamy et al., 2018) of the scalarized Thompson sampling algorithm proposed by Paria et al. (2020).

4.3 Theoretical analysis

We now study the convergence properties of DSTS. We begin by analyzing the single-objective setting and establish the asymptotic consistency of DTS. To our knowledge, this is the first such result for DTS in PBO. The result is stated in Theorem 1, with a proof based on a martingale argument provided in Appendix A.2.

Theorem 1. Suppose that X is finite, m = 1, and the sequence of queries {Xn} n=1 is chosen according to the DTS policy. Then, for each x X, limn Pn(x argmaxx X f(x )) = 1{x argmaxx X f(x )} almost surely for f drawn from the prior.

Extending this result to the multi-objective setting requires a minor modification to the DSTS algorithm. Specifically, our proof introduces a small probability of comparing against a fixed reference design in each iteration. This modification is required by our proof due to the non-linear nature of Chebyshev scalarizations

Published in Transactions on Machine Learning Research (05/2025)

0 50 100 number of queries

hypervolume

Random PBO-DTS-IF q Par EGO q EHVI q MES DSTS

0 50 100 number of queries

hypervolume

0 50 100 number of queries

hypervolume

(c) Vehicle Safety

0 50 100 number of queries

hypervolume

(d) Car Side Impact

0 50 100 number of queries

hypervolume

(e) Autonomous Driving

0 50 100 number of queries

hypervolume

(f) Exoskeleton

Figure 3: Our framework was demonstrated on six test problems: DTLZ1 (a), DTLZ2 (b), Vehicle Safety (c), Car Side Impact (d), Autonomous Driving (e), and Exoskeleton (f). Overall, our proposed method (DSTS) delivers the best performance. q MES and q Par EGO exhibit a mixed performance, achieving good results in some test problems and poor results in others. The remaining methods, Random, PBO-DTS-IF, and q EHVI, consistently underperform DSTS.

and is not required in the single-objective case. The resulting convergence guarantee is stated in Theorem 2. The proof which can be found in Appendix A.3 again relies on a martingale argument and the fact that varying θ allows Chebyshev scalarizations to recover all Pareto-optimal points.

Before stating the result, we describe the modified DSTS policy under consideration. Assume q = 2, and fix any reference point xref X and δ (0, 1). At each iteration, the first design xn,1 is selected as in Equation 3, while the second design xn,2 is set to xref with probability δ, or otherwise selected via Equation 3.

Theorem 2. Suppose that X is finite, q = 2, and the sequence of queries {Xn} n=1 is chosen according to the modified DSTS policy described above. Then, for each x X, limn Pn(x X f) = 1{x X f} almost surely for f drawn from the prior.

We now place our results in context with prior theoretical work on DTS. Sui et al. (2017) showed that DTS achieves asymptotic consistency and sublinear regret in the dueling bandits setting, assuming independent pairs of arms. However, their analysis does not extend to our setting, where arms may be correlated. Notably, the analysis of Sui et al. (2017) relies on showing that all arms are chosen infinitely often, which may not be true in our context. Similarly, Novoseller et al. (2020) showed analogous convergence results in a reinforcement learning setting under a Bayesian linear reward model. In contrast, our result holds for non-linear objectives. Moreover, the result of Novoseller et al. (2020) also relies on showing that each arm is selected infinitely often. Finally, note that the results of Sui et al. (2017) and Novoseller et al. (2020) are only applicable in the single-objective setting; as discussed above, the multi-objective setting presents additional challenges.

Published in Transactions on Machine Learning Research (05/2025)

Unlike prior work, we do not establish regret bounds for DSTS. Indeed, such bounds remain an open question even for DTS in single-objective PBO. While we see this as a valuable research direction, such analysis is beyond the scope of our work which primarily aims to introduce multi-objective PBO. Finally, it is important to recognize that the asymptotic consistency of data-driven algorithms like DSTS cannot be taken for granted. For instance, Astudillo et al. (2023) showed that the adaptation of q EI proposed by Siivola et al. (2021) is not asymptotically consistent and can perform poorly in single-objective PBO, despite being one of the most widely used algorithms. In Theorem 3, we show that q EHVI (Daulton et al., 2020), a multi-objective generalization of q EI, suffers from the same limitation in our setting. A proof is provided in Appendix A.4. Our empirical results support this finding, showing that q EHVI can perform very poorly.

Theorem 3. There exists a problem instance with finite X and q = 2 such that if Xn argmax X Xq q EHVIn(X) for all n, then limn Pn(x X f) = t almost surely for some fixed x X and t (0, 1).

5 Numerical experiments

We evaluate our algorithm across six test problems and compare it with five other sampling policies. All algorithms are implemented using Bo Torch (Balandat et al., 2020). Details on the performance metric, the benchmark sampling policies, and the test problems are provided below. The code for reproducing our experiments can be found at https://github.com/Raul Astudillo06/PMBO.

5.1 Performance metric

We quantify performance using the hypervolume indicator, which has been shown to result in good coverage of Pareto fronts when maximized (Zitzler et al., 2003). Let b Y = {yℓ}L ℓ=1 be a finite approximation of the

Pareto front of f. Its hypervolume is given by HV(b Y , r) = µ SL ℓ=1 [r, yℓ] , where r Rm is a reference vector, µ denotes the Lebesgue measure over Rm, and [r, yℓ] denotes the hyper-rectangle bounded by the vertices r and yℓ. We report performance by setting b Y equal to the set of Pareto optimal points across designs shown to the DM.

5.2 Benchmarks

We compare our algorithm (DSTS) against uniform random sampling (Random), three adapted algorithms from standard multi-objective BO (q Par Ego, q EHVI, q MES), and a standard PBO algorithm with inconsistent overall preference feedback (PBO-DTS-IF). Our experiments in this section use the regular version of DSTS. In Appendix B.3 we show that the modified version of DSTS used in Theorem 2 achieves virtually the same performance for small values of δ. All algorithms use the same priors, and the resulting posteriors are approximated via the variational inducing point approach proposed by Nguyen et al. (2021). Approximate samples from the posterior distribution used by DSTS and PBO-DTS-IF are obtained via 1000 random Fourier features (Rahimi & Recht, 2007).

Adapted standard multi-objective BO methods A common approach in the PBO literature is to use a batch acquisition function designed for parallel BO with observable objectives, ignoring the fact that preference feedback is observed rather than objective values (Siivola et al., 2021; Takeno et al., 2023). Despite lacking the principled interpretations they enjoy in their original setting, they often deliver strong empirical performance. Following this principle, we adopt three batch acquisition functions from standard multi-objective BO as benchmarks: q Par EGO (Knowles, 2006; Daulton et al., 2020), q EHVI (Daulton et al., 2020), and q MES (Belakaria et al., 2019). Since these algorithms were not originally designed for latent objectives, they require minor adaptations that we describe in Appendix B.5. These algorithms use the same probabilistic model as DSTS. Thus, any difference in performance is solely due to the use of different sampling policies.

Single-objective PBO with inconsistent aggregated preference feedback Single-objective PBO methods are often applied to problems characterized by multiple conflicting objectives. In such cases, DMs are expected to aggregate their preferences across objectives, which can be challenging for DMs and often

Published in Transactions on Machine Learning Research (05/2025)

(a) Autonomous Driving

(b) Exoskeleton Walking

Figure 4: Simulation environments used in our test problems.

results in inconsistent feedback. For example, in the context of exoskeleton personalization, this would require forcing the exoskeleton user and clinical technician to reach a unified response at every iteration, which can be challenging if the user s objective is to maximize comfort while the technician s objective is to ensure the exoskeleton s long-term energy efficiency. To understand the effect of using this approach, we include a standard single-objective PBO approach using inconsistent feedback. Additional details on this benchmark are provided in Appendix B.5.

1.5 1.0 0.5 0.0 y1

Feasible region Pareto front Random

1.5 1.0 0.5 0.0 y1

Feasible region Pareto front PBO-DTS-IF

(b) PBO-DTS-IF

1.5 1.0 0.5 0.0 y1

Feasible region Pareto front q Par EGO

(c) q Par EGO

1.5 1.0 0.5 0.0 y1

Feasible region Pareto front q EHVI

1.5 1.0 0.5 0.0 y1

Feasible region Pareto front q MES

1.5 1.0 0.5 0.0 y1

Feasible region Pareto front DSTS

Figure 5: Illustration of sampled designs for the DTLZ2 test function. These figures show that our proposed method (DSTS) provides a better exploration of the Pareto front than its competitors.

5.3 Test problems

We report performance across four synthetic test problems (DTLZ1, DTLZ2, Vehicle Safety, and Car Side Impact), a simulated autonomous driving policy design task (Autonomous Driving), and a simulated exoskeleton gait design task (Exoskeleton) using queries with q = 2 and q = 4 designs. Details of these test problems are provided below. In all problems, an initial dataset is obtained using 2(d + 1) queries chosen

Published in Transactions on Machine Learning Research (05/2025)

uniformly at random over Xq, where d is the input dimension of the problem. After this initial stage, each algorithm was used to select 100 additional queries sequentially. Results for q = 2 are shown in Figure 3. Each plot shows the mean of the hypervolume of the designs included in queries thus far, plus and minus 1.96 times the standard error. Each experiment was replicated 30 times using different initial datasets. In all problems, the DM s responses are corrupted by moderate levels of Gumbel noise, which is consistent with the use of a Logistic likelihood (see Appendix B.2 for the details). Results for q = 4 can be found in Appendix B.4.

DTLZ1 and DTLZ2 The DTLZ1 and DTLZ2 functions are standard test problems from the multiobjective optimization literature (Deb et al., 2005). In our experiments, we configure DTLZ1 with d = 6 design variables and m = 2 objectives, and DTLZ2 with d = 3 design variables and m = 2 objectives. Results for these problems are shown in Figures 3(a) and 3(b), respectively. Our approach achieves the best performance in both problems, tied with q Par EGO on DTLZ2.

Surprisingly, on the DTLZ2 problem, PBO-DTS-IF, q EHVI, and q MES underperform significantly, even being surpassed by Random. To understand this, we plot a representative set of objective vectors corresponding to the queried designs in Figure 5. As illustrated, Random offers a reasonable exploration of the Pareto front (likely due to the low dimensionality of DTLZ2). However, it exposes the user to many low-quality designs, which can potentially frustrate DMs. PBO-DTS-IF and q MES tend to favor designs where one of the objectives achieves its maximum possible value, which may be problematic for DMs seeking more balanced solutions. q EHVI fails to explore the Pareto front, concentrating its queries on a limited sub-optimal region instead. Finally, DSTS and q Par EGO provide a more comprehensive exploration of the Pareto front.

Vehicle Safety and Car Side Impact The Vehicle Safety and Car Side Impact test functions are designed to emulate various metrics of interest in the context of crashworthiness vehicle design. Overall, these test problems emulate an expert s assessment based on expensive experiments where cars are intentionally crashed, and safety metrics are evaluated. Vehicle Safety has d = 5 design variables and m = 3 objectives. Car Side Impact has d = 7 design variables and m = 4 objectives. For further details, we refer the reader to Tanabe & Ishibuchi (2020). Results for the Vehicle Safety and Car Side Impact experiments can be found in Figures 3(c) and 3(d), respectively. For the Vehicle Safety problem, q MES is the best-performing algorithm, followed by DSTS. For the Car Side Impact, DSTS performs the best, followed closely by q MES.

Autonomous Driving Policy Design To supplement the synthetic test functions, we further evaluate our algorithm on a simulated autonomous driving policy design task. For this problem, we use a modification of the Driver environment presented in Bıyık et al. (2019). A similar environment was also used by Bhatia et al. (2020), providing empirical evidence that user preferences in this context are inherently governed by multiple latent objectives. In our modified environment, illustrated in Figure 4(a), an autonomous control policy is created to drive a trailing (red) vehicle forward to a goal location while maintaining some minimum distance with a leading (white) vehicle. The control policy switches between two modes, collision avoidance and goal-following, based on a minimum distance threshold. The behavior of the leading car is fixed by setting a pre-specified set of actions.

Using this simulation environment, we consider four objectives representing approximations of subjective notions of safety and performance: lane keeping, speed tracking, heading angle, and collision avoidance. The design space is parameterized by four control variables: two parameters that account for how fast the vehicle approaches the goal or the other vehicle, respectively, one position gain that accounts for the adjustment on the desired heading, and the minimum distance threshold used to switch between the two modes. The results of this experiment are shown in Figure 3(e). As illustrated, our approach again delivers better performance than its competitors.

Exoskeleton Gait Customization Lastly, we evaluate our algorithm on an exoskeleton gait personalization task using a high-fidelity simulator of the lower-body exoskeleton Atalante (Kerdraon et al., 2021), illustrated in Figure 4(b). This problem emulates the scenario discussed in the introduction, in which there are two conflicting objectives: subjective user comfort and energy efficiency. For simulation purposes, we approximate comfort as a linear combination of three attributes: average walking speed (faster speed is preferred),

Published in Transactions on Machine Learning Research (05/2025)

maximum pelvis acceleration (lower peak acceleration is preferred), and the center of mass tracking error (lower error is preferred). We approximate total energy consumption as the l2-norm of joint-level torques, averaged over the simulation duration. We note that this is an observable objective. Thus, our approach is modified as discussed in Section 4 and further elaborated on Appendix B.1 to leverage direct observations of this objective.

The design space is parameterized by five gait features: step length, minimum center of mass position with respect to stance foot in sagittal and coronal plane, minimum foot clearance, and the percentage of the gait cycle at which minimum foot clearance is enforced. Each unique set of features corresponds to a unique gait. These gaits are synthesized using the FROST toolbox (Hereid & Ames, 2017) and are simulated in Mujoco to obtain the corresponding objectives. Since simulations are time-consuming, we build surrogate objectives by fitting a (regular) Gaussian process to the objectives obtained from 1000 simulations, with each set of gait features drawn uniformly over the design space. As shown in Figure 3(f), DSTS achieves the best performance, followed closely by q Par EGO and q MES.

5.4 Discussion

Across the broad range of problems considered, DSTS delivers the best overall performance. Specifically, DSTS yields the highest hypervolume in nearly all problems (except for the Vehicle Safety problem, where it is second to q MES). Two of the standard multi-objective benchmarks, q Par EGO and q MES, exhibit mixed results, highlighting the importance of developing algorithms designed to handle preference feedback as opposed to naively adapting algorithms intended for observable objectives. Notably, q EHVI is the worst-performing algorithm, even surpassed by Random. This is consistent with Theorem 3, which shows that q EHVI is not consistent in general, thus highlighting the value of our asymptotic consistency result for DSTS (Theorem 2). Lastly, PBO-DTS-IF consistently underperforms DSTS, confirming that a single-objective PBO approach is insufficient to explore the optimal trade-offs in problems with multiple conflicting objectives. The runtimes of all methods are discussed in Appendix B.6.

6 Conclusion

In this work, we proposed a framework for PBO with multiple latent objectives, where the goal is to help DMs efficiently explore the objectives Pareto front guided by preference feedback. Within this framework, we introduced dueling scalarized Thompson sampling (DSTS), which, to our knowledge, is the first approach for PBO with multiple objectives. Our experiments demonstrate that DSTS provides significantly better exploration of the Pareto front than several benchmarks across six test problems, including simulated autonomous driving policy design and exoskeleton gait customization tasks. Moreover, we showed that DSTS is asymptotically consistent, providing the first convergence result for dueling Thompson sampling in PBO.

While our work provides a sound approach to tackling important applications not covered by existing methods, there are also a few limitations that suggest avenues for future exploration. Future work could include a deeper theoretical analysis of DSTS, such as investigating convergence rates and regret bounds, as well as the development of alternative sampling policies. For example, Astudillo et al. (2023) provided an efficient approach to approximate a one-step lookahead Bayes optimal policy in single-objective PBO, demonstrating superior performance against various established benchmarks. Although their approach cannot be easily adapted to our context, exploring alternative mechanisms for computing non-myopic sampling policies in our setting would be valuable. Finally, it would be interesting to explore DSTS in other settings, such as preference-based reinforcement learning.

Acknowledgments

This research was partially supported by Wandercraft.

Published in Transactions on Machine Learning Research (05/2025)

Raul Astudillo, Zhiyuan Jerry Lin, Eytan Bakshy, and Peter Frazier. q EUBO: A decision-theoretic acquisition function for preferential Bayesian optimization. In International Conference on Artificial Intelligence and Statistics, pp. 1093 1114. PMLR, 2023.

Maximilian Balandat, Brian Karrer, Daniel Jiang, Samuel Daulton, Ben Letham, Andrew G Wilson, and Eytan Bakshy. Botorch: A framework for efficient monte-carlo Bayesian optimization. Advances in neural information processing systems, 33:21524 21538, 2020.

Jacob P Baskin and Shriram Krishnamurthi. Preference aggregation in group recommender systems for committee decision-making. In Proceedings of the Third ACM Conference on Recommender Systems, pp. 337 340, 2009.

Syrine Belakaria, Aryan Deshwal, and Janardhan Rao Doppa. Max-value entropy search for multi-objective Bayesian optimization. Advances in Neural Information Processing Systems, 32, 2019.

Viktor Bengs, Róbert Busa-Fekete, Adil El Mesaoudi-Paul, and Eyke Hüllermeier. Preference-based online learning with dueling bandits: A survey. The Journal of Machine Learning Research, 22(1):278 385, 2021.

Kush Bhatia, Ashwin Pananjady, Peter Bartlett, Anca Dragan, and Martin J Wainwright. Preference learning along multiple criteria: A game-theoretic perspective. Advances in neural information processing systems, 33:7413 7424, 2020.

Erdem Bıyık, Malayandi Palan, Nicholas C Landolfi, Dylan P Losey, and Dorsa Sadigh. Asking easy questions: A user-friendly approach to active reward learning. ar Xiv preprint ar Xiv:1910.04365, 2019.

Jürgen Branke and Kalyanmoy Deb. Integrating user preferences into evolutionary multi-objective optimization. In Knowledge Incorporation in Evolutionary Computation, pp. 461 477. Springer, 2005.

Eric Brochu, Tyson Brochu, and Nando De Freitas. A Bayesian interactive optimization approach to procedural animation design. In Proceedings of the 2010 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 103 112, 2010.

Noel Csomay-Shanklin, Maegan Tucker, Min Dai, Jenna Reher, and Aaron D Ames. Learning controller gains on bipedal walking robots via user preferences. In 2022 International Conference on Robotics and Automation (ICRA), pp. 10405 10411. IEEE, 2022.

Samuel Daulton, Maximilian Balandat, and Eytan Bakshy. Differentiable expected hypervolume improvement for parallel multi-objective Bayesian optimization. Advances in Neural Information Processing Systems, 33: 9851 9864, 2020.

Kalyanmoy Deb. Multi-objective optimization. In Search Methodologies: Introductory Tutorials in Optimization and Decision Support Techniques, pp. 403 449. Springer, 2013.

Kalyanmoy Deb, Lothar Thiele, Marco Laumanns, and Eckart Zitzler. Scalable test problems for evolutionary multiobjective optimization. In Evolutionary Multiobjective Optimization, pp. 105 145. Springer, 2005. doi: 10.1007/1-84628-137-7_6.

Joseph L Doob. Application of the theory of martingales. Le calcul des probabilites et ses applications, pp. 23 27, 1949.

James S Dyer and Rakesh K Sarin. Group preference aggregation rules based on strength of preference. Management Science, 25(9):822 832, 1979.

Javier González, Zhenwen Dai, Andreas Damianou, and Neil D Lawrence. Preferential Bayesian optimization. In International Conference on Machine Learning, pp. 1282 1291. PMLR, 2017.

Jussi Hakanen and Joshua D Knowles. On using decision maker preferences with Par EGO. In Evolutionary Multi-Criterion Optimization, pp. 282 297. Springer, 2017.

Published in Transactions on Machine Learning Research (05/2025)

Ayonga Hereid and Aaron D Ames. FROST: Fast robot optimization and simulation toolkit. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 719 726. IEEE, 2017.

Kirthevasan Kandasamy, Akshay Krishnamurthy, Jeff Schneider, and Barnabas Poczos. Parallelised Bayesian optimisation via Thompson sampling. In Amos Storkey and Fernando Perez-Cruz (eds.), Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pp. 133 142. PMLR, 09 11 Apr 2018.

Ralph L Keeney and Howard Raiffa. Decisions with multiple objectives: preferences and value trade-offs. Cambridge university press, 1993.

Jacques Kerdraon, Jean Gabriel Previnaire, Maegan Tucker, Pauline Coignard, Willy Allegre, Emmanuel Knappen, and Aaron Ames. Evaluation of safety and performance of the self balancing walking system atalante in patients with complete motor spinal cord injury. Spinal cord series and cases, 7(1):71, 2021.

Nazan Khan, David E Goldberg, and Martin Pelikan. Multi-objective Bayesian optimization algorithm. In Proceedings of the 4th Annual Conference on Genetic and Evolutionary Computation, pp. 684 684, 2002.

Joshua Knowles. Par EGO: A hybrid algorithm with on-line landscape approximation for expensive multiobjective optimization problems. IEEE Transactions on Evolutionary Computation, 10(1):50 66, 2006.

Benjamin Letham and Eytan Bakshy. Bayesian optimization for policy search via online-offline experimentation. J. Mach. Learn. Res., 20:145 1, 2019.

Zhiyuan Jerry Lin, Raul Astudillo, Peter Frazier, and Eytan Bakshy. Preference exploration for efficient Bayesian optimization with multiple outcomes. In International Conference on Artificial Intelligence and Statistics, pp. 4235 4258. PMLR, 2022.

Marco Maccarini, Filippo Pura, Dario Piga, Loris Roveda, Lorenzo Mantovani, and Francesco Braghin. Preference-based optimization of a human-robot collaborative controller. IFAC-Papers On Line, 55(38):7 12, 2022.

R Timothy Marler and Jasbir S Arora. Survey of multi-objective optimization methods for engineering. Structural and Multidisciplinary Optimization, 26:369 395, 2004.

Kaisa Miettinen. Nonlinear multiobjective optimization. International Series in Operations Research & Management Science. Springer, 1999.

Quoc Phong Nguyen, Sebastian Tay, Bryan Kian Hsiang Low, and Patrick Jaillet. Top-k ranking Bayesian optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 9135 9143, 2021.

Jens Brehm Bagger Nielsen, Jakob Nielsen, and Jan Larsen. Perception-based personalization of hearing aids using Gaussian processes and active learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(1):162 173, 2015.

Ellen Novoseller, Yibing Wei, Yanan Sui, Yisong Yue, and Joel Burdick. Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence, pp. 1029 1038. PMLR, 2020.

Biswajit Paria, Kirthevasan Kandasamy, and Barnabás Póczos. A flexible framework for multi-objective Bayesian optimization using random scalarizations. In Uncertainty in Artificial Intelligence, pp. 766 776. PMLR, 2020.

Jean-Charles Pomerol and Sergio Barba-Romero. Multicriterion decision in management: principles and practice, volume 25. Springer Science & Business Media, 2000.

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Published in Transactions on Machine Learning Research (05/2025)

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. Advances in Neural Information Processing Systems, 20, 2007.

Carl Edward Rasmussen and Christopher K I Williams. Gaussian processes for machine learning. MIT Press, 2006.

Eero Siivola, Akash Kumar Dhaka, Michael Riis Andersen, Javier González, Pablo García Moreno, and Aki Vehtari. Preferential batch Bayesian optimization. In 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing, pp. 1 6. IEEE, 2021.

Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine learning algorithms. Advances in neural information processing systems, 25, 2012.

Yanan Sui, Vincent Zhuang, Joel W Burdick, and Yisong Yue. Multi-dueling bandits with dependent arms. ar Xiv preprint ar Xiv:1705.00253, 2017.

Shion Takeno, Masahiro Nomura, and Masayuki Karasuyama. Towards practical preferential Bayesian optimization with skew Gaussian processes. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 33516 33533. PMLR, 23 29 Jul 2023.

Ryoji Tanabe and Hisao Ishibuchi. An easy-to-use real-world multi-objective optimization problem suite. Applied Soft Computing, 89:106078, 2020.

Maegan Tucker, Myra Cheng, Ellen Novoseller, Richard Cheng, Yisong Yue, Joel W Burdick, and Aaron D Ames. Human preference-based learning for high-dimensional optimization of exoskeleton walking gaits. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3423 3430. IEEE, 2020a.

Maegan Tucker, Ellen Novoseller, Claudia Kann, Yanan Sui, Yisong Yue, Joel W Burdick, and Aaron D Ames. Preference-based learning for exoskeleton gait optimization. In 2020 IEEE International Conference on Robotics and Automation, pp. 2351 2357. IEEE, 2020b.

Maegan Tucker, Noel Csomay-Shanklin, Wen-Loong Ma, and Aaron D Ames. Preference-based learning for user-guided hzd gait generation on bipedal walking robots. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 2804 2810. IEEE, 2021.

Handing Wang, Markus Olhofer, and Yaochu Jin. A mini-review on preference modeling and articulation in multi-objective optimization: Current status and challenges. Complex & Intelligent Systems, 3:233 245, 2017.

Christian Wirth, Riad Akrour, Gerhard Neumann, Johannes Fürnkranz, et al. A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18(136):1 46, 2017.

H Peyton Young. A note on preference aggregation. Econometrica: Journal of the Econometric Society, pp. 1129 1131, 1974.

Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538 1556, 2012.

Yichi Zhang, Daniel W Apley, and Wei Chen. Bayesian optimization for materials design with mixed quantitative and qualitative variables. Scientific reports, 10(1):4924, 2020.

Zhanhui Zhou, Jie Liu, Chao Yang, Jing Shao, Yu Liu, Xiangyu Yue, Wanli Ouyang, and Yu Qiao. Beyond one-preference-for-all: Multi-objective direct preference optimization. ar Xiv preprint ar Xiv:2310.03708, 2023.

Eckart Zitzler, Lothar Thiele, Marco Laumanns, Carlos M Fonseca, and Viviane Grunert Da Fonseca. Performance assessment of multiobjective optimizers: An analysis and review. IEEE Transactions on Evolutionary Computation, 7(2):117 132, 2003.

Published in Transactions on Machine Learning Research (05/2025)

A Proofs of theoretical results

A.1 Notation and auxiliary results

We first introduce the following notation. Let Fn denote the σ-algebra generated by Dn 1 and F denote the minimal σ-algebra generated by {Fn} n=1. We denote the conditional probability measures induced by Fn and F by Pn and P , respectively. Unless otherwise stated, throughout this analysis, we assume that f is a random function drawn from the prior.

We will now prove two lemmas. Lemma 1 guarantees that for any Pareto optimal point, there is a set of scalarizations with positive measure for which this point is optimal. Lemma 2, on the other hand, ensures that promising points are compared against the reference point sufficiently many times.

Lemma 1. Let f be any fixed function. If x X f, then there exists Ψ Θ such that x is the only element of argmaxx X s(f(x ); ψ) for all ψ Ψ and P(θ Ψ) > 0, where θ is a uniform random variable over Θ.

Proof. From Theorem 3.4.6 in Miettinen (1999), we know that x X f ( i.e. x is properly Pareto optimal) if and only if there exists θ Θ such that x is the only element of argmaxx X s(f(x ); θ). As a result, since x X f, there exists at least one θ Θ such that x is the unique maximizer of s(f(x ); θ). Since X is finite, the function s(f(x ); θ) is continuous with respect to θ and x is the unique maximizer at θ, by continuity, there exists a neighborhood around θ where x remains the unique maximizer. Let Ψ be this neighborhood around θ. Since θ is uniformly distributed over Θ, the set Ψ, being non-empty, has positive measure. Hence P(θ Ψ) > 0.

Lemma 2. Suppose that the assumptions of Theorem 2 hold and let x X be any point with P (x X f) > 0. Then, fj(x) f(xref) is F -measurable for j = 1, . . . , m.

Proof. Making a slight abuse of notation, we write x = argmaxx X s(f(x ); θ) if x is the only element of argmaxx X s(f(x ); θ).

Let θ denote a uniform random variable over Θ independent of F . The existence of such a θ is guaranteed by Kolmogorov s extension theorem.

From Lemma 1, there exists a Ψ Θ (depending on f) such that P (x = argmaxx X s(f(x ); ψ) ψ Ψ) > 0 and P (θ Ψ) > 0.

A standard Martingale argument shows that

lim n Pn(x = argmax x X s(f(x ); ψ) ψ Ψ) = P (x = argmax x X s(f(x ); ψ) ψ Ψ) > 0

almost surely.

This convergence to a positive limit implies that for sufficiently large n, the conditional probability is bounded below by some ϵ1 > 0 such that Pn(x = argmaxx X s(f(x ); ψ) ψ Ψ) > ϵ1. Similarly, we can find ϵ2 such that Pn(θ Ψ) > ϵ2 for all n large enough.

Under the modified DSTS policy where the xn,2 defaults to xref with probability δ, we have

Pn(Xn = (x, xref)) Pn(x = argmax x X s(f(x ); ψ) ψ Ψ)Pn(θ Ψ)δ > ϵ1ϵ2δ

for all n large enough.

It follows that the event Xn = (x, xref) occurs for infinitely many n. Let n1, n2, . . . denote the sequence of indices such that Xnk = (x, xref). Since q = 2,

E[rj,nk 1 | fj] = P(rj,nk = 2)

= exp(fj(xref)/λj) exp(fj(x)/λj) + exp(fj(xref)/λj).

Published in Transactions on Machine Learning Research (05/2025)

Thus, by the law of large numbers we have

k=1 (rj,nk 1) = exp(fj(xref)/λj) exp(fj(x)/λj) + exp(fj(xref)/λj)

= 1 1 + exp((fj(x) fj(xref))/λj)

almost surely. We deduce from this that fj(x) fj(xref) is F -measurable for j = 1, . . . , m.

A.2 Proof of Theorem 1

Theorem 1. Suppose that X is finite, m = 1, and the sequence of queries {Xn} n=1 is chosen according to the DTS policy. Then, for each x X, limn Pn(x argmaxx X f(x )) = 1{x argmaxx X f(x )} almost surely for f drawn from the prior.

Proof. Observe that X f = argmaxx X f(x ) in the single-objective setting. We will show that the event 0 < P (x X f) < 1 occurs with probability zero by showing that a contradiction arises almost surely.

Since P (x X f) < 1, there must exist a fixed x X such that P (f(x ) > f(x)) > 0. Moreover, we can choose x such that P (x X f) > 0. Similar to the proof of Lemma 2, this implies that we can find ϵx, ϵ x > 0 such that Pn(x X f) > ϵx and Pn(x X f) > ϵx for all n large enough.

Now observe that under the DTS policy,

Pn(Xn = (x, x , . . . , x )) = Pn(x X f)Pn(x X f)q 1 > ϵxϵq 1 x

for all n large enough (note that we have used that xn,1, . . . , xn,q are chosen independently). This implies that the event Xn = (x, x , . . . , x ) occurs infinitely often almost surely. Similar to the proof of Lemma 2, this implies that exp(f(x)/λ) exp(f(x)/λ) + (q 1) exp(f(x )/λj)

and exp(f(x )/λ) exp(f(x)/λ) + (q 1) exp(f(x )/λj)

are both F -measurable. By taking the ratio between these two quantities, we see that exp((f(x) f(x ))/λ) is F -measurable, which in turn implies that f(x) f(x ) is also F -measurable. From this, it follows that P (f(x ) > f(x)) = 1{f(x ) > f(x)}.

Now recall that P (f(x ) > f(x)) > 0. Thus, it must be the case that P (f(x ) > f(x)) = 1. However, this implies that P (x X f) = 0, which is a contradiction.

From the above, it follows that P (x X f) is 0 or 1 almost surely. Similar to the proof of Theorem 2, we conclude that P (x X f) = 1{x X f} by virtue of Doob s Bayesian consistency theorem (Doob, 1949).

A.3 Proof of Theorem 2

Theorem 2. Suppose that X is finite, q = 2, and the sequence of queries {Xn} n=1 is chosen according to the modified DSTS policy. Then, for each x X, limn Pn(x X f) = 1{x X f} almost surely for f drawn from the prior.

Proof. A standard martingale argument shows that limn Pn(x X f) = P (x X f) almost surely. Thus, it remains to show that P (x X f) = 1{x X f} almost surely.

To prove this, we will show that the event 0 < P (x X f) < 1 occurs with probability zero. For the sake of contradiction, assume this event holds with positive probability. As we will see next, this yields a contradiction that holds almost surely.

Published in Transactions on Machine Learning Research (05/2025)

Since P (x X f) < 1, there must exist x X such that P (x f x) > 0. Moreover, we can choose x

such that P (x X f) > 0.

Given P (x X f) > 0, fj(x) fj(xref) is F -measurable for each j = 1, . . . , m by Lemma 2. Similarly, since P (x X f) > 0, fj(x ) fj(xref) is F -measurable. Thus,

(fj(x ) fj(xref)) (fj(x) fj(xref)) = fj(x ) fj(x)

is F -measurable for each j = 1, . . . , m.

Next, observe that x f x if and only if minj=1,...,m fj(x ) fj(x) > 0. Hence, 1{x f x} is F -measurable and P (x f x) = 1{x f x} almost surely.

Recall that P (x f x) > 0. Thus, it must be the case that P (x f x) = 1. This implies that P (x X f) = 0, which contradicts our initial assumption that 0 < P (x X f) < 1.

Finally, since P (x X f) is 0 or 1 almost surely, from Doob s Bayesian consistency theorem (Doob, 1949) we conclude that P (x X f) = 1{x X f}.

A.4 Proof of Theorem 3

Theorem 3. There exists a problem instance with finite X and q = 2 such that if Xn argmax X Xq q EHVIn(X) for all n, then limn Pn(x X f) = t almost surely for some fixed x X and t (0, 1).

Proof. For simplicity, we focus on the single-objective case. Our example can be easily extended to the multi-objective case, e.g., by augmenting the problem with dummy constant objectives. In this case, X f = argmaxx X f(x ) and the q EHVI acquisition function reduces to the q EI acquisition function (Siivola et al., 2021), defined by q EIn(x) = En[{f(x) µ n}+], (4)

where µ n is the maximum posterior mean value over designs previously shown.

We build on the example provided by Astudillo et al. (2023). Specifically, we let X = {1, 2, 3, 4} and consider the functions f( ; k) : X R, for k = 1, 2, 3, 4, given by f(1; k) = 1 and f(2; k) = 0 for all k, and

( 1, x = 3 1 2, x = 4 , f(x; 2) =

( 1 2, x = 3 1, x = 4 ,

2, x = 3 1, x = 4 , f(x; 4) =

( 1, x = 3 1

Let s be a number with 0 < s < 1/3 and set t = 1 s. We consider a prior distribution on f with support {f( ; i)}4 i=1 such that

p0,k = P(f = f( ; k)) =

( s/2, k = 1, 2, t/2, k = 3, 4.

We assume a logistic likelihood given by

P (rn = i | f(Xn)) = exp(f(xn,i)/λ) exp(f(xn,1)/λ) + exp(f(xn,2)/λ), i = 1, 2.

From the proof of Astudillo et al. (2023), we know that Xn = (3, 4) for all n and the posterior distribution evolves according to the equations

( pn,kwn, k = 1, 3, pn,k(1 wn), k = 2, 4,

Published in Transactions on Machine Learning Research (05/2025)

where wn = a1{rn = 1} + (1 a)1{rn = 2} and a = exp(1/2λ).

We will use the above to show that pn,3 + pn,4 = t for all n. To this end, observe that the above equations imply that pn,1/pn,3 = p0,1/p0,3 = s/t for all n. Similarly, pn,2/pn,4 = p0,2/p0,4 = s/t for all n. Thus, pn,1 = (s/t)pn,3 and pn,2 = (s/t)pn,4 for all n. Moreover,

1 = pn,1 + pn,2 + pn,3 + pn,4 = (s/t)pn,3 + (s/t)pn,4 + pn,3 + pn,4 = (pn,3 + pn,4)(1 + s/t).

pn,3 + pn,4 = 1/(1 + s/t)

Finally, let x = 2 and observe that x argmaxx X f(x ) if and only if f = f( ; 3) or f = f( ; 4). Hence,

x argmax x X f(x ) = Pn (f = f( ; 3) or f = f( ; 4))

= pn,3 + pn,4 = t.

B Additional experimental details and results

B.1 Additional details on our probabilistic model under observable objectives

Suppose for concreteness that objective j is observable. Then, at each iteration n, we observe (potentially noisy) measurements of fj(xn,1), . . . , fj(xn,q). Let yn,1, . . . , yn,q denote these measurements, and let Dn 1 = {(xk,i, yk,i)}k=1,...,n,i=1,...,q denote all the measurements available right before the n-th interaction with the DM. As is standard in the BO literature, we can assume Gaussian noise such tat yn,i = fj(xn,i) + ϵn,i where the terms ϵn,i N(0, σ2) are independent across n and i. If we assume a Gaussian prior over fj, then the posterior distribution of fj given Dn 1 is again a Gaussian process, whose mean and covariance functions can be computed using the standard Gaussian process regression equations (equations 2.23 and 2.24 of Rasmussen & Williams, 2006).

B.2 Noise in the DM s responses

In our experiments, we simulate noise in the DM s responses using additive Gumbel noise. Specifically, if Xn is the query presented at iteration n, then the response observed is rn = [r1,n, . . . , rm,n], where

rj,n = argmax i=1,...,q fj(xn,i) + ϵn,i,j,

and ϵn,i,j Gumbel(0, λtrue j ) for i = 1, . . . , q and j = 1, . . . , m are independent. Under this choice, we have

P (rj,n = i | fj(Xn)) = exp(fj(xn,i)/λtrue j ) Pq i =1 exp(fj(xn,i )/λtrue j ),

for i = 1, . . . , q, i.e., this recovers a Logistic likelihood.

Following Astudillo et al. (2023), in each problem, for every objective, λtrue j is chosen such that, on average, the DM makes a mistake 20% of the time when comparing pairs of designs among those with the top 1% objective values within X. We estimate this percentage by uniformly sampling a large number of design points over X.

Published in Transactions on Machine Learning Research (05/2025)

Noise in the DM s responses observed by PBO-DTS-IF is generated by scalarizing the corrupted objective values defined above using a Chebyshev scalarization. Concretely, if r n denotes the response observed by PBO-DTS-IF when presented with query Xn, then

r n = argmax i=1,...,q s(f(xn,i) + ϵn,i; θn),

where ϵn,i = [ϵn,i,1, . . . , ϵn,i,m], ϵn,i,j is defined as before, and θn is drawn uniformly from Θ.

0 50 100 number of queries

hypervolume

DSTS DSTS-M

(a) Car Side Impact

0 50 100 number of queries

hypervolume

(b) Autonomous Driving

0 50 100 number of queries

hypervolume

(c) Exoskeleton

Figure 6: Performance of DSTS and modified DSTS (DSTS-M) using queries with q = 2 alternatives across three test problems. Their performance is almost identical in the three problems considered.

B.3 Results for modified DSTS

We evaluate the performance of the modified version of DSTS used in the proof of Theorem 1 on three of our test problems. We set xref by drawing a point uniformly at random over X. This point is different for each replication. In addition, we set δ = 0.05 such that xref is selected five times on average across 100 iterations. The results are shown in Figure 6.

B.4 Results for q = 4

We carry out experiments analogous to those presented in the main paper, using queries with q = 4 alternatives each. We focus on DSTS and the two strongest benchmarks, q Par EGO and q MES. The results are shown in Figure 7. For a clearer comparison, we also include results for q = 2. As depicted, increasing the number of alternatives improves the performance of the three algorithms. DSTS delivers the best overall performance under queries with q = 2 and q = 4 alternatives.

B.5 Additional details on our benchmarks

B.5.1 Adapted standard multi-objective BO algorithms

q Par EGO In standard multi-objective BO with observable objectives, the q Par EGO acquisition function is defined by αn(x) = En[{s(f(x); θn) maxx Xn s(f(x ); θn)}+], where Xn is the set of all designs in queries presented to the DM up to time n and θn is drawn uniformly at random over Θ. Following Siivola et al. (2021), we replace f(x ) by the posterior mean of f at x to avoid taking the expectation with respect to the random variable maxx Xn s(f(x ); θn).

q EHVI In the standard multi-objective BO setting with observable objectives, the q EHVI acquisition function is defined by αn(x) = En[HV(Yn {f(x)}, r) HV(Yn, r)], where Yn is the set of objective vectors corresponding to designs presented to the DM up to time n. As with q Par EGO, we replace these objective vectors with posterior mean vectors to avoid taking the expectation with respect to these unknown values.

Published in Transactions on Machine Learning Research (05/2025)

0 50 100 number of queries

hypervolume

q Par EGO, q=2 q MES, q=2 DSTS, q=2 q Par EGO, q=4 q MES, q=4 DSTS, q=4

0 50 100 number of queries

hypervolume

0 50 100 number of queries

hypervolume

(c) Vehicle Safety

0 50 100 number of queries

hypervolume

(d) Car Side Impact

0 50 100 number of queries

hypervolume

(e) Autonomous Driving

0 50 100 number of queries

hypervolume

(f) Exoskeleton

Figure 7: Performance of q Par EGO, q MES and DSTS using queries with q = 2 and q = 4 alternatives across our six test problems. Using a larger number of alternatives improves the performance of the three algorithms. DSTS delivers the best overall performance under queries with q = 2 and q = 4 alternatives.

Note that q EHVI also requires specifying reference point r, which is challenging if the objectives are latent. In our experiments, we set r equal to the coordinate-wise minimum posterior mean value across designs presented to the DM up to time n.

q MES Unlike q Par EGO and q EHVI, q MES does not require any modification to be applied in our setting. However, it is worth noting that the lookahead step in the definition of q MES assumes that the objective values will be observed at the selected points. However, this is not the case in our setting. We believe this can be problematic as it may cause q MES to over-value queries constituted by designs with very distinct objective values, even though preference feedback from such queries can often be uninformative.

B.5.2 PBO with inconsistent preference feedback

The feedback used by PBO-DTS-IF is produced as follows. At each iteration n, a set of scalarization parameters θn, is drawn uniformly at random over Θ. Given a query Xn = (xn,1, . . . , xn,q) , we assume the DM then provides a noisy response to argmaxi=1,...,q s(f(xn,i); θn). Inconsistency arises from sampling different scalarization parameters at every iteration, which, in general, implies that preferences cannot be encoded by a single objective function. We argue this mimics the DM s desire to explore the trade-off between objectives before committing to a solution. At the same time, we note that these responses respect intuitive user behavior, such as preference for queries for which each objective is as large as possible.

The responses provided by the DM are used to fit a (single-output) Gaussian process with a Logistic likelihood for which posterior inference is carried out using the same approach we use for the other algorithms. New

Published in Transactions on Machine Learning Research (05/2025)

queries are generated using the (single-objective) dueling Thompson sampling strategy under this probabilistic model. The performance of this method is expected to be poor when the Pareto-optimal set is large, i.e., when the trade-offs between objectives are significant.

Problem/Algorithm PBO-DTS-IF q Par EGO q EHVI q MES DSTS

DTLZ1 5.7 24.3 25.6 58.5 19.2 DTLZ2 6.8 16.4 10.6 16.5 13.3 Vehicle Safety 8.4 15.4 36.7 43.8 17.8 Car Side Impact 5.9 38.1 352.8 324.4 36.3 Autonomous Driving 5.5 34.1 102.7 72.2 32.1 Exoskeleton 6.3 16.5 23.7 59.3 14.6

Table 1: Average runtimes in seconds for all the algorithms compared (except for Random) accross all test problems.

B.6 Runtimes

The average runtimes for all algorithms across all test problems are presented in Table 1. DSTS is comparable with q Par EGO and faster than q EHVI and q MES. PBO-DTS-IF is the faster algorithm since, in contrast with the other algorithms, it requires a single Gaussian process model instead of m. However, it is worth noting that the runtimes of these algorithms could be reduced by parallelizing the training step of its m models. Moreover, the runtimes of DSTS could be further reduced by parallelizing the generation of the q alternatives in each query during the optimization step.