# principled_preferential_bayesian_optimization__343d75ff.pdf

Principled Preferential Bayesian Optimization

Wenjie Xu 1 2 Wenbin Wang 1 Yuning Jiang 1 Bratislav Svetozarevic 2 3 Colin N. Jones 1

We study the problem of preferential Bayesian optimization (BO), where we aim to optimize a black-box function with only preference feedback over a pair of candidate solutions. Inspired by the likelihood ratio idea, we construct a confidence set of the black-box function using only the preference feedback. An optimistic algorithm with an efficient computational method is then developed to solve the problem, which enjoys an information-theoretic bound on the total cumulative regret, a first-of-its-kind for preferential BO. This bound further allows us to design a scheme to report an estimated best solution, with a guaranteed convergence rate. Experimental results on sampled instances from Gaussian processes, standard test functions, and a thermal comfort optimization problem all show that our method stably achieves better or competitive performance as compared to the existing state-of-the-art heuristics, which, however, do not have theoretical guarantees on regret bounds or convergence.

1. Introduction

Bayesian optimization (BO) is a popular sample-efficient black-box optimization method (Shahriari et al., 2015; Frazier, 2018). It is widely applied to tuning hyperparameters of machine learning models (Snoek et al., 2012), optimizing the performance of control systems (Xu et al., 2022b), and discovering new drugs (Negoescu et al., 2011), etc.

The main idea of BO is based on surrogate modeling. That is, a learning algorithm (typically Gaussian process regression) is applied to learn the unknown black-box function using historical samples, which then outputs a learned surro-

1Automatic Control Laboratory, EPFL, Lausanne, Switzerland 2Urban Energy Systems Laboratory, Empa, Zurich, Switzerland 3The Institute for Artificial Intelligence Research and Development of Serbia, Serbia. Correspondence to: Yuning Jiang <yuning.jiang@ieee.org>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

gate together with uncertainty quantification. Then BO algorithms, such as the popular Expected Improvement (Jones et al., 1998) and GP-UCB algorithms (Srinivas et al., 2012), use the information of this learned surrogate and uncertainty quantification to choose the next sample point.

The conventional BO setting assumes each sample, which typically corresponds to a round of real-world experiment or software simulation in practice, returns a noisy scalar evaluation of the black-box function. However, many human-inthe-loop systems can not return such a scalar value, or it is much more difficult to directly obtain such a scalar evaluation from humans since humans are bad at sensing absolute magnitude (Kahneman & Tversky, 2013). In contrast, it is much easier for a human to compare a pair of solutions and report which is preferred (Lichtenstein & Slovic, 1971; Tversky & Kahneman, 1974; Kahneman & Tversky, 2013).

This gives rise to preferential Bayesian optimization (Gonz alez et al., 2017), where the scalar evaluation of the black-box function is not available. But rather, we can query an oracle to compare a pair of solutions, or the so-called duels. Such settings arise widely in a broad range of applications, such as visual design optimization (Koyama et al., 2020), thermal comfort optimization (Abdelrahman & Miller, 2022) and robotic gait optimization (Li et al., 2021).

Existing preferential Bayesian optimization methods are mostly heuristic, without formal guarantees on cumulative regret or convergence to the global optimal solution. For example, (Gonz alez et al., 2017) proposes several heuristic acquisition strategies, including expected improvement and Thompson sampling-based methods, for preferential Bayesian optimization. (Mikkola et al., 2020) extends the preferential Bayesian optimization to the projective setting. (Takeno et al., 2023) proposes a Thompson sampling-based method for practical preferential Bayesian optimization with skew Gaussian process. (Astudillo et al., 2023) proposes a decision theoretical acquisition strategy with a convergence rate guarantee for a finite input set. However, as far as we know, all the existing preferential Bayesian optimization methods can not provide theoretical guarantees on cumulative regret or global convergence with continuous input space, partially due to the challenge of quantifying uncertainty in a principled way.

Beyond preferential BO, optimization from preference feed-

Principled Preferential Bayesian Optimization

back has also been investigated in other contexts. In the following, we first survey the related work other than preferential BO and then highlight our unique contributions.

Dueling Bandits In dueling bandits (Yue et al., 2012), the goal is to identify the best arm from a set of finite arms, using only the noisy comparison feedback. It has also been extended to adversarial (Gajane et al., 2015) and contextual (Dud ık et al., 2015; Saha & Krishnamurthy, 2022) settings. One extension that is most related to this work is kernelized dueling bandits (Sui et al., 2017; 2018). However, this line of research is typically restricted to the case where the number of arms is finite, and the regret bound can blow up to infinity when the number of arms goes to infinity (e.g., Thm. 2 in (Sui et al., 2017)). A recent work (Mehta et al., 2023) proposes an offline method with suboptimality bound by learning winning probability, which, however, are not applicable to online learning problems due to linear growth of regret over the randomly sampled compared point sequences. In the existing literature, there is no cumulative regret bound that depends on an inherent complexity metric (such as covering number and maximum information gain (Srinivas et al., 2012)) of the black-box function with continuous input space.

Convex Optimization with Preference Feedback (Saha et al., 2021; Yue & Joachims, 2009) consider the optimization of convex functions, where only a comparison oracle of function values over different points is available. The proposed methods estimate the gradient from the preference signals. However, this line of research restricts the function to be convex, while in practice, the black-box function may be non-convex. The proposed method may get stuck in a local optimum and can be sample-inefficient since each estimate of the gradient already needs several samples.

Reinforcement Learning from Human Feedback Reinforcement learning from human feedback (RLHF) (Christiano et al., 2017; Griffith et al., 2013) has recently become very popular. It has found many successes in wide applications, including training robots (Hiranaka et al., 2023), playing games (Warnell et al., 2018), and remarkably large language models (Ouyang et al., 2022). On the theoretical line of RLHF research, recent results analyze the offline learning of the implicit reward function (Zhu et al., 2023) and the model-based optimistic reinforcement learning from human feedback (Wang et al., 2023). However, the existing theoretical analysis either only deals with finite-dimensional generalized linear models or highly relies on the complexity measure of Eluder dimension (Osband & Van Roy, 2014). The existing generic theoretical analysis for RLHF can not be directly applied to the Bayesian optimization setting, where the Eluder dimension of the infinite-dimensional reproducing kernel Hilbert space is not well understood.

Optimistic Model-based Sequential Decision Making Op-

timism in the face of uncertainty is a widely adopted design principle for model-based sequential decision making problems, such as in Bayesian optimization/reinforcement learning (Wu et al., 2022; Xu et al., 2023; Pacchiano et al., 2021; Curi et al., 2020; Liu et al., 2023). The optimism principle has also been applied to RLHF (Wang et al., 2023) recently. However, as far as we know, there is no existing principled optimistic algorithm for preferential BO yet.

Our contributions. Guided by the optimism principle, we design a preferential Bayesian optimization algorithm that enjoys information-theoretic bounds on the cumulative regret. Specifically, our contributions include:

Algorithm design. Inspired by the recent work of the confidence set based on optimistic maximum likelihood estimate (Liu et al., 2023) and the likelihood ratio confidence set idea (Owen, 1990; Emmenegger et al., 2023), we construct a confidence set by only using the preference feedback. We then exploit the principle of optimism in the face of uncertainty to design a Principled Optimistic Preferential Bayesian Optimization (POP-BO) algorithm, together with a scheme of reporting an estimated best solution.

Theoretical analysis. Under some mild regularity assumptions, we prove an information-theoretic bound on the cumulative regret of POP-BO algorithm, which is first-of-its-kind 1 for preferential Bayesian optimization. This is significant since previous informationtheoretic regret bounds typically assume the direct scalar evaluations of black-box functions (Srinivas et al., 2012) while the recent generic theoretical results for RLHF typically rely on Eluder dimension, which is not well understood for RKHS.

Efficient computations. The optimistic algorithm needs to solve bi-level optimization problems with the inner variable in an infinite-dimensional function space. We leverage the representer theorem (Sch olkopf et al., 2001) to reduce the inner optimization problem to finite-dimensional space, which turns out to be tractable via convex optimization. This further allows efficient grid-free joint optimization.

Empirical validations and toolbox. 2 Experimental results show that POP-BO consistently achieves better or competitive performance as compared to the stateof-the-art heuristic baselines and more than 10 times speed-up in computation as compared to the Thompson sampling based method. We also provide a reusable toolbox for future applications of our method.

1(Mehta et al., 2023) provides a bound on the partial cumulative regret, which only captures the suboptimality of one point in each compared duel. We consider stronger total cumulative regret over both points in the compared duel. See Appendix Q for a detailed discussion. 2Code link: https://github.com/PREDICT-EPFL/POP-BO

Principled Preferential Bayesian Optimization

2. Problem Statement

We consider the maximization of a black-box function f,

max x X f(x), (1)

where X Rd with d as the input dimension. We use x x to denote the event that x is preferred to x . In contrast to the standard BO setup, we assume that we can not directly evaluate the scalar value of f(x) but rather, we have a comparison oracle that compares any two points x, x

and returns a preference signal 1x x , which is defined as

( 1, if x is preferred, 0, if x is preferred. (2)

Before proceeding, we state a set of common assumptions.

Assumption 2.1. X is compact and nonempty.

Assumption 2.1 is reasonable because, in many applications (e.g., continuous hyperparameter tuning) of Bayesian Optimization, we are able to restrict the optimization into certain ranges based on domain knowledge. Regarding the black-box function f, we assume that,

Assumption 2.2. f Hk, where k : Rd Rd R is a symmetric, positive semidefinite kernel function and Hk is the corresponding reproducing kernel Hilbert space (RKHS, see (Sch olkopf et al., 2001)). Furthermore, we assume f k B, where k is the norm induced by the inner product in the corresponding RKHS.

Assumption 2.2 requires that the function to be optimized is regular in the sense that it has a bounded norm in the RKHS, which is a common assumption (Chowdhury & Gopalan, 2017a; Zhou & Ji, 2022). For simplicity, we will use Bf to

denote the set n f Hk| f k B o , which is a ball with radius B in Hk. Remark 2.3 (Choice of B). In practice, a tight norm bound B might not be known beforehand. In the theoretical analysis, we only assume that there is a finite bound B, possibly unknown beforehand. In the practical implementation of our algorithm, we can adapt B based on hypothesis testing (Newey & Mc Fadden, 1994). For example, we can double B every time we detect a low likelihood value (See more elaboration in Appendix G.).

Assumption 2.4. k(x, x ) 1, x, x X and k(x, x ) is continuous on Rd Rd.

Assumption 2.4 is a commonly adopted mild assumption in the BO literature (Srinivas et al., 2012; Chowdhury & Gopalan, 2017a). It holds for most commonly used kernel functions after normalization, such as the linear kernel, the Mat ern kernel, and the squared exponential kernel.

Assumption 2.5. The random preference feedback 1x x from the comparison oracle follows the Bernoulli distribution with P(1x x = 1) = px x = σ(y y ), where y = f(x), y = f(x ) and σ(u) = 1/(1+e u)3.

Assumption 2.5 equivalently assumes that,

P(1x x = 1) = ef(x)

ef(x) + ef(x ) , (3)

which can be observed to be the widely used Bradley-Terry Luce (BTL) model (Bradley & Terry, 1952) for pairwise comparison. The intuition here is that the more advantage f(x) has as compared to f(x ), the more likely x is preferred. The same comparison model is also used in, e.g., training large language models (Ouyang et al., 2022). At step t, our algorithm queries the pair (xt, x t) and the comparison oracle returns the random preference 1xt x t {0, 1}. For the simplicity of notation, we use 1τ {0, 1} to denote the realization of the Bernoulli random variable 1xτ x τ when querying the comparison oracle at step τ. Based on the historical comparison results

Dt := {(xτ, x τ, 1τ)}t τ=1, (4)

the algorithm needs to decide the next pair of samples to compare. Without further notice, all the theoretical results in this paper are under the assumptions 2.1, 2.2, 2.4, 2.5, and all the corresponding proofs are in the appendices.

Notations. The probability, denoted as P( ), is taken over the randomness of the preference feedback generated by the comparison oracle and the randomness generated by the algorithm. Let the filtration Ft capture all the randomness up to step t. N(Bf, ϵ, ) denotes the standard covering number (Zhou, 2002) of the function space ball Bf with the covering balls radius ϵ and the infinity norm . We will also use [τ] to denote the set {1, , τ}.

3. High Confidence Set

3.1. Likelihood-based Confidence Set

We first introduce the function,

p ˆ f(xτ, x τ, 1τ) :=1τσ( ˆf(xτ) ˆf(x τ))+ (5)

(1 1τ) 1 σ( ˆf(xτ) ˆf(x τ)) ,

which is the likelihood of ˆf over the event 1xτ x τ = 1τ under the Bernoulli preference model in Assumption 2.5.

We can then derive the likelihood function of a fixed function

3We mainly consider the widely used sigmoid function here. Our result can be extended to more general σ under mild regularity conditions.

Principled Preferential Bayesian Optimization

ˆf over the historical preference dataset Dt 4.

P ˆ f((xτ, x τ, 1τ)t τ=1) :=

τ=1 p ˆ f(xτ, x τ, 1τ) (6)

Taking log gives the log-likelihood function,

ℓt( ˆf) := log P ˆ f((xτ, x τ, 1τ)t τ=1) =

τ=1 log p ˆ f(xτ, x τ, 1τ)

ezτ 1τ + ez τ (1 1τ) ezτ + ez τ

τ=1 (zτ1τ + z τ(1 1τ))

τ=1 log ezτ + ez τ ,

where zτ = ˆf(xτ), z τ = ˆf(x τ), 1τ {0, 1} is the data realization of 1xτ x τ , and the last equality can be checked correct for either 1τ = 1 or 1τ = 0.

A common method for statistical estimation is by maximizing the likelihood. Hence, we introduce the maximum likelihood estimator (MLE),

ˆf MLE t arg max f Bf log P f((xτ, x τ, 1τ)t τ=1). (8)

With the maximum likelihood estimator introduced, the posterior high confidence set can be derived as shown in Thm. 3.1 using the maximum log-likelihood value.

Theorem 3.1 (Likelihood-based Confidence Set). ϵ, δ > 0, let,

Bt+1 f := { f Bf|ℓt( f) ℓt( ˆf MLE t ) β1(ϵ, δ, t)}, (9)

where β1(ϵ, δ, t) := q

32t B2 log π2t2N(Bf ,ϵ, )

t log t N(Bf ,ϵ, )

δ + ϵt , with CL a con-

stant independent of δ, t and ϵ. We have,

P f Bt+1 f , t 1 1 δ. (10)

Intuitively, the confidence set Bt+1 f includes the functions with the log-likelihood value that is only a little worse than the maximum likelihood estimator. It turns out that by correctly setting the worse level β1, the confidence set Bt+1 f contains the ground-truth function f with high probability. This is reasonable because the preference data is generated with the ground-truth function, and thus the likelihood of the ground-truth function will not be too much lower than the maximum likelihood estimator.

4Note that P ˆ f( ) is the likelihood function in ˆf over the historical data Dt, not the probability taken over the data/algorithm randomness.

Remark 3.2 (Choice of ϵ). In Thm. 3.1, β1(ϵ, δ, t) also depends on a small positive value ϵ, which is to be chosen. In the theoretical analysis, it will be seen that ϵ can be selected to be 1/T, where T is the algorithm s running horizon. Remark 3.3 (Likelihood Ratio Idea). The confidence set Bt+1 f contains the functions f that satisfy,

P f((xτ, x τ, 1τ)t τ=1)

P ˆ f MLE t ((xτ, x τ, 1τ)t τ=1) e β1(ϵ,δ,t), (11)

which is the likelihood ratio confidence set (Owen, 1990). Remark 3.4. Surrogate-based black-box optimization with kernel method is often referred to as Bayesian optimization due to its close relations to Bayesian Gaussian process model. Hence, we refer to our method as preferential BO.

Based on the confidence set in Thm. 3.1, we can derive the pointwise confidence range for the black-box function.

inf f Bt f f(x) f(x) sup f Bt f

Fig. 1 demonstrates the maximum likelihood estimate function and the confidence range with the ground truth function sampled from a Gaussian process, random comparison inputs, and β1(ϵ, δ, t) set to be a constant 1.0. It can be seen that the maximum likelihood estimate approximates the ground truth better and better with the confidence range shrinking, as we have more and more comparison data.

3.2. Bound Duel-wise Error

Thm. 3.1 gives a high confidence set based on the likelihood function. However, it is not straightforward how the likelihood bounds lead to the error bounds on function value differences over a compared pair (x, x ), which determines the preference distribution. The following theorem further gives such a bound over the historical samples.

Lemma 3.5 (Elliptical Bound). For any estimate ˆft+1 Bt+1 f that is measurable with respect to the filtration Ft, we have, with probability at least 1 δ, t 1,

ˆft+1(xτ) ˆft+1(x τ) (f(xτ) f(x τ)) 2

β(ϵ, δ/2, t), (13)

and f Bt+1 f , (14)

where β(ϵ, δ/2, t) = σ 2

Hσ (β2(ϵ, δ/2, t) + 2β1(ϵ, δ/2, t)) =

t log t N(Bf ,ϵ, )

δ + ϵt + ϵ2t , with β2(ϵ, δ, t) =

8Hσ σ 2ϵ2t+2CLϵt+ q

8t B2p log π2t2N(Bf ,ϵ, )

3δ and the constants σ , Hσ, σ , Bp as defined in Appendix B.

Principled Preferential Bayesian Optimization

MLE function Ground Truth Confidence Range

Figure 1. Demonstration of the maximum likelihood function and the confidence set based on likelihood. The results are derived using random sequential comparisons (that is, comparing xt to xt 1), where each xt is uniformly randomly sampled from the input set.

Lem. 3.5 highlights that with high probability, all the functions in the confidence set have difference values over the historical sample points that lie in a ball with the ground-truth function difference value as the center and p

β(ϵ, δ/2, t) as the radius. Lem. 3.5 indicates that our likelihood-based learning scheme can gradually learn the function differences f(xτ) f(x τ) but not the absolute value f(xτ). This is reasonable since shifting f by a constant will not change the distribution of preference feedback.

Furthermore, to derive an error bound over a new pair (x, x ), we need to quantify the uncertainty of f(x) f(x ), where f Bf. Since f Bf by the definition of Bf, it can be seen that f(x) f(x ) Bff , where

Bff := {F(x, x ) = f(x) + f (x )| f, f Bf}. (15)

Indeed, Bff is the ball with radius 2B in the RKHS equipped with the additive kernel function kff ((x, x ), ( x, x )) := k(x, x) + k(x , x ), which we term as the augmented RKHS here, and inner product f1 + f 1, f2 + f 2 kff := f1, f2 k + f 1, f 2 k. The readers are referred to (Christmann & Hable, 2012; Kandasamy et al., 2015) for more details of the additive kernel and the corresponding RKHS. To quantify the uncertainty of a new pair (x, x ), we further introduce the function, σff t (ω) 2 = kff (ω, ω) (16)

kff (ω1:t 1, ω) Kff t 1 + λI 1 kff (ω1:t 1, ω) ,

where ω := (x, x ), ω1:t 1 := ((xτ, x τ))t 1 τ=1, Kff t 1 := (kff ((xτ1, x τ1), (xτ2, x τ2)))τ1 [t 1],τ2 [t 1], and λ is a positive regularization constant. Theorem 3.6 (Duel-wise Error Bound). For any estimate ˆft+1 Bt+1 f measurable with respect to Ft, we have, with probability at least 1 δ, t 1, (x, x ) X X, ( ˆft+1(x) ˆft+1(x )) (f(x) f(x ))

2 2B + λ 1/2p

β(ϵ, δ/2, t) σff t+1((x, x )). (17)

Remark 3.7. In preferential BO, we do not get the scalar value of f(x) f(x ). Hence, σff t can not be interpreted as the posterior standard deviation as in (Srinivas et al., 2012). However, it turns out that σff t , as a measure of uncertainty, still accounts for a factor of the duel-wise error.

To characterize the complexity of this augmented RKHS, we use the maximum information gain (Srinivas et al., 2012),

T := max Ω X X;|Ω|=T 1 2 log I + λ 1Kff

Ω = kff ((x, x ), ( x, x ))

(x,x ),( x, x ) Ω.

4. Algorithm

4.1. Principled Optimistic Algorithm

We are now ready to give the optimistic algorithm in Alg. 1.

Algorithm 1 Principled Optimistic Preferential Bayesian Optimization (POP-BO).

1: Given the initial point x0 X and set B1 f = Bf. 2: for t [T] do 3: Set the reference point x t = xt 1. 4: Compute xt arg max x X max f Bt f ( f(x) f(x t)),

with the inner optimal function denoted as ft. 5: Query the comparison oracle to get the feedback result 1t and append the new data to Dt. 6: Update the maximum likelihood estimator ˆf MLE t and the posterior confidence set Bt+1 f . 7: end for

The key to Alg. 1 is line 4. The idea is to maximize the

Principled Preferential Bayesian Optimization

optimistic advantage of f(x) as compared to f(x t) with the uncertainty of the black-box function f Bt f.

In line 3, we set the reference point x t as the last generated point xt 1. In practice, this may correspond to two possible scenarios. In the first, each comparison requires one experiment, such as image quality comparison. In this case, we only need to set one of the compared pair as the last newly generated solution. While in the other scenario, comparing xt and x t needs separate experiments for xt and x t. For example, when optimizing the building thermal comfort, the occupants need to experience both thermal conditions to report preference. If at step t, the oracle still has memory about the experience with input xt 1, we can directly compare xt and xt 1. In this case, setting x t to be xt 1 saves the experimental expense with x t.

For online applications, cumulative regret is more of our interest. However, for an offline optimization setting, it may be of more interest to identify one near-optimal solution to report. Unlike in the scalar evaluation setting, where we can directly use the scalar value to report the best observed solution, we can not directly identify the best sampled solution in the preferential Bayesian optimization scenario. To address this issue, we report the solution xt , where

t arg min t [T ] 2 2B + λ 1/2p

β(ϵ, δ/2, t) σff t ((xt, x t)).

(19) The idea is that although the best sample may not be known, we can derive a solution by minimizing the known term 2(2B +λ 1/2p

β(ϵ, δ/2, t))σff t ((xt, x t)) to find a solution xt to report. Indeed, this term upper bounds the uncertainty of the optimistic advantage (as shown in Thm. 3.6). Hence, the smaller it is, the more certain that f(xt) is close to the ground-truth optimal value. At step t, we can report the current estimated solution with index τ (t) satisfying a similar formula to Eq. (19).

4.2. Efficient Computations

Line 4 in Alg. 1 requires solving a nested optimization problem with inner variables in an infinite-dimensional function space. The update of the maximum likelihood estimator also requires solving an optimization problem with an infinitedimensional function as the decision variable. These are in general not tractable in their current forms. Fortunately, we can reduce the infinite-dimensional problems to finitedimensional ones, thanks to the structures of the problem and the representer theorem (Sch olkopf et al., 2001).

Maximum likelihood estimation. Since the log-likelihood function

ℓt( f) = log P f((xτ, x τ, 1τ)t τ=1) (20)

τ=1 (zτ1τ + z τ(1 1τ))

τ=1 log ezτ + ez τ

only depends on the function value (zτ, z τ) = ( f(xτ), f(x τ)), we only need to optimize over (zτ, z τ) subject to that they are functions in Hk with norm less or equal to B. Furthermore, Alg. 1 sets x τ = xτ 1 and thus z τ = zτ 1. So we can reduce the optimization variables to only (zτ)t τ=0. Hence, Eq. (20) is reduced to the following log-likelihood function that only depends on (zτ)t τ=0,

ℓ(Z0:t|Dt) (21)

:= Z 1:t11:t + Z0:t 1 (1 11:t)

τ=1 log (ezτ + ezτ 1) ,

where Z0:t := (zτ)t τ=0, Z1:t := (zτ)t τ=1, Z0:t 1 := (zτ)t 1 τ=0 and 11:t = (1τ)t τ=1.

By the representer theorem (Sch olkopf et al., 2001), the maximum likelihood estimation problem can be solved via,

ℓt( ˆf MLE t ) = max Z0:t Rt+1 ℓ(Z0:t|Dt)

subject to Z 0:t K 1 0:t Z0:t B2, (22)

where K0:t := (k(xτ1, xτ2))τ1 {0} [t],τ2 {0} [t]. The constraint restricts that the function values need to come from a function inside the function space ball Bf, where the lefthand side is indeed the minimum norm square of the possible interpolant through {(xτ, zτ)}t τ=0 as shown in (Wendland, 2004). It can be checked that the maximization problem in Eq. (22) has a concave objective (as shown in Appendix A) with a convex feasible set. Thus, the problem in Eq. (22) is tractable via convex optimization.

Generating new sample point. On the line 4 of Alg. 1, a bi-level optimization problem needs to be solved, where the inner-level part has an infinite-dimensional function variable. The inner optimization problem has the form,

max f f(x) f(xt)

subject to f Bf,

ℓt( f) ℓt( ˆf MLE t ) β1(ϵ, δ, t),

where β1(ϵ, δ, t) is as given in Thm. 3.1. Similar to the representer theorem, we have,

Lemma 4.1. Prob. (23) can be equivalently reduced to,

max Z0:t Rt+1,z R z zt

subject to Z0:t z

ℓ(Z0:t|Dt) ℓt( ˆf MLE t ) β1(ϵ, δ, t), (24)

Principled Preferential Bayesian Optimization

K0:t,x = K0:t (k(xτ, x))t τ=0 (k(xτ, x))t τ=0 k(x, x)

Similarly, it can be checked that the Prob. (24) is convex.

For low-dimensional x, the outer-level problem can be solved via grid search. For medium-dimensional problems, we can optimize the inner/outer variables using a gradientbased/zero-order optimization method. Alternatively, we can jointly optimize x, Z0:t, and z by a nonlinear programming solver from multiple random initial conditions. That is, we add x as another optimization variable as shown in the Prob. (26),

max Z0:t Rt+1,z R,x X z zt

subject to Constraints of Prob. (24). (26)

More details on this joint optimization approach is in Appendix H. Remark 4.2. We add a matrix ϵKI to K0:t and K0:t,x before inversion to avoid numerical issue, where ϵK > 0 is small. Remark 4.3. In this paper, we mainly consider the setting where in each step, the preference is queried over two candidate points. Our Alg. 1 and the efficient computation schemes in this section can be easily extended to multiplechoice setting, where in each step, the best or most preferred point is queried over a batch of candidates. The detailed discussion is in Appendix I.

5. Theoretical Analysis

We first introduce the performance metrics to use. As in the scalar Bayesian optimization setting ((Srinivas et al., 2012)), cumulative regret is used as defined in Eq. (27),

t=1 (f(x ) f(xt)) , (27)

where x arg maxx X f(x). Remark 5.1. The cumulative regret RT as defined in Eq. (27) does not explicitly consider the sub-optimality of the reference point x t. However, since x t = xt 1, the cumulative regret of the reference points is the same as RT in Eq. (27), up to the difference of the first/last term.

Cumulative regret is of interest in the online setting. In the offline optimization setting, it is of more interest to analyze the sub-optimality of the final reported solution, i.e.,

f(x ) f(xt ), (28)

where xt is the final reported solution as defined in Eq. (19).

5.1. Regret Bound and Convergence Rate

Theorem 5.2 (Cumulative Regret Bound). With probability at least 1 δ, the cumulative regret of Alg. 1 satisfies,

βT γff T T , (29)

βT = β(1/T, δ, T) = O

T log TN(Bf, 1/T, )

Remark 5.3 (Differentiate from GP-UCB regret). Our bound has a similar form as compared to the well-known regret bound for standard GP-UCB type algorithms (Srinivas et al., 2012; Chowdhury & Gopalan, 2017a). However, the βT term here is significantly different from that in the existing literature (e.g., in Thm. 3 in (Srinivas et al., 2012)). It is derived specifically for the preferential BO and will lead to a bit larger bound for specific kernels in Sec. 5.2.

We highlight that Thm. 5.2 provides the first-of-its-kind information-theoretic bound on the cumulative regret of preferential BO, which further allows us to derive a convergence rate for the reported solution xt in Thm. 5.4. Theorem 5.4 (Convergence Guarantee). Let t be defined as in Eq. (19). With probability at least 1 δ,

f(x ) f(xt ) O

Thm 5.4 highlights that by minimizing the known term

2 2B +λ 1/2 q

2, t) σff t ((xt, x t)), the reported final solution xt has a guaranteed convergence rate.

5.2. Kernel-Specific Bounds and Rates

In this section, we show kernel-specific bounds for the regret and convergence rate for the reported solution. The explicit forms of the considered kernels are given in Appendix L. Theorem 5.5 (Kernel-Specific Regret Bounds). Setting ϵ = 1/T and running our POP-BO algorithm in Alg. 1,

1. If k(x, y) = x, y , we have,

RT = O T 3/4(log T) 3/4 . (31)

2. If k(x, y) is a squared exponential kernel, we have,

RT = O T 3/4(log T) 3/4(d+1) . (32)

3. If k(x, y) is a Mat ern kernel, we have,

RT = O T 3/4(log T) 3/4T

d ν 1 4 + d+1 4+2(d+1)d/ν

Principled Preferential Bayesian Optimization

where ν is the smooth parameter of the Mat ern kernel that is assumed to be large enough such that ν > d 4(3 + d +

d2 + 14d + 17) = Θ(d2).

Remark 5.6 (Comparison to GP-UCB with Scalar Feedback). Interestingly, as compared to the kernel-specific bounds in the scalar evaluation-based optimization (Fig. 1 in (Srinivas et al., 2012)), the regret bound of preferential Bayesian optimization approximately has an additional factor of T 1/4. This is reasonable since intuitively, scalar evaluation can imply preference, but not vice versa. Therefore, preference feedback contains less information and thus may suffer from higher regret. Fig. 2 in Sec. 6.1 and Fig. 4 in Appendix N empirically verify our bounds here.

We then derive the kernel-specific convergence rates for the reported solution xt , as shown in Tab. 3 in the Appendix O.

6. Experimental Results

In this section, we compare our method to the state-of-theart preferential BO methods on sampled instances from Gaussian process, standard test functions, and a thermal comfort optimization problem. The comparison outcome is sampled as assumed in Assump. 2.5. We implement our algorithm based on the Gaussian process package GPy (GPy, since 2012). The optimization problems for MLE and generating new samples are formulated and solved using Cas ADi (Andersson et al., 2019) and Ipopt (W achter & Biegler, 2006). We compare our methods to three baseline methods: dueling Thompson sampling (Gonz alez et al., 2017), skew-GP based preferential BO (Takeno et al., 2023), and the q EUBO (Astudillo et al., 2023). The dueling Thompson sampling method (Gonz alez et al., 2017) derives the next pair to compare by maximizing the soft-Copeland s score. The skew-GP based method (Takeno et al., 2023) applies standard BO algorithms conditioned on the Thompson sampling results on the historical sample points that are consistent with the historical preference feedbacks. The q EUBO (Astudillo et al., 2023) method uses the expected utility of the best option as an acquisition function. More experimental details and results on thermal comfort optimization are put in the Appendix P.

6.1. Sampled Instances from Gaussian Process

In this section, we sample the black-box function f from a Gaussian process with the squared exponential kernel as shown in Appendix L where the variance parameter is 9.0 and the lengthscale is 1.0. We sampled 30 instances in total.

Fig. 2 shows the performance comparisons with baselines. Our method achieves the lowest sublinear growth in cumulative regret. It also achieves better/competitive convergence speed for the reported solution as compared to the DTS method, while outperforming the SGP.

However, our method only uses less than 10% of the computation time as compared to the DTS as shown in Tab. 1. The SGP method gets stuck in local optimum because it overly trusts the random preference feedback (hard constraint when doing Thompson sampling). Although the q EUBO method performs slightly better in the reported solution, it suffers from more than 2.5 times the cumulative regret as compared to ours. Similar to q EUBO (reporting posterior mean maximizer), we can report the maximizer of the minimumnorm ˆf MLE t (POP-BO max-MLE in Fig. 2) instead of xt in Eq. (19), and achieves faster convergence than q EUBO.

Cumulative Regret

Suboptimality of Reported Solution

q EUBO SGP DTS POP-BO (ours) POP-BO max-MLE (ours)

Figure 2. Cumulative regret and the suboptimality of the reported solution, where the shaded areas represent 0.1 standard deviation. q EUBO represents the method in (Astudillo et al., 2023), which reports the solution that maximizes the expected objective value conditioned on the historical samples. SGP represents the skew-GP based method (Takeno et al., 2023), which reports the first point of the duel proposed by the algorithm in the last step. DTS represents the dueling Thompson sampling method in (Gonz alez et al., 2017), which reports the Condorcet winner.

Table 1. Computation time normalized against the DTS method.

DTS q EUBO SGP POP-BO (ours) 1.0 0.21 0.07 0.09

6.2. Test Function Optimization

In this section, we compare our method to several wellknown global optimization test functions (Dixon, 1978; Molga & Smutnicki, 2005), which are divided by the standard deviation of samples over a grid. We run our method multiple times from different random initial points. Tab. 2

Principled Preferential Bayesian Optimization

shows that POP-BO consistently finds better or comparable solutions as compared to other baselines.

Table 2. Suboptimality for the final reported solution after 30 steps. The results (mean standard deviation) are taken over 30 runs with random starting points.

Problem DTS q EUBO SGP POP-BO (ours)

Beale 0.84 0.52 0.15 0.52 0.10 0.19 0.008 0.025 Branin 1.35 1.16 0.71 1.16 2.20 0.81 0.31 0.29 Bukin 1.45 1.13 0.59 1.20 1.27 0.80 0.92 0.54 Cross-in-Tray 1.56 1.39 2.03 1.82 1.79 1.49 1.38 0.97 Eggholder 3.08 0.55 3.11 0.55 1.87 0.94 1.83 0.96 Holder Table 3.21 1.38 3.20 1.38 1.56 1.62 1.22 1.01 Levy13 2.36 1.22 1.06 1.22 1.29 1.00 0.35 0.31

6.3. Scalability to Higher Dimension

To demonstrate the computational scalability of our joint optimization approach (as shown in Prob. (26)), we consider a set of higher dimensional problems. Due to space limitation, we show the results for the optimization of 12dimensional black-box function sampled from a Gaussian process with squared exponential kernel function. More results can be found in Appendix P.1 and Appendix P.2. The optimization domain is set to be [0, 10]12. We run 10 randomly sampled instances for 100 steps. The average update time per step is only 18.0 seconds on a personal computer with one Intel64 Family 6 Model 142 Stepping 12 Genuine Intel 1803 Mhz processor and 16.0 GB RAM. This is comparably very small considering that each query to the comparison oracle can be very expensive in practice (e.g., heating the room up to a certain temperature to evaluate occupant comfort, which may take tens of minutes). We compare our method to the SGP baseline, which is one of the state-of-the-art computationally practical preferential Bayesian optimization method. Fig. 3 shows the cumulative regret (in log scale) and the suboptimality of the reported solution for the problem. It can be seen that our algorithm still achieves sublinear regret growth and good convergence for the suboptimality of the reported solution within 100 steps in this 12-dimensional problem. Fig. 3 also shows that our POP-BO has faster convergence speed in higher dimensional problem and thus scales better than the SGP method.

7. Conclusion and Future Work

In this paper, we have presented a principled optimistic preferential Bayesian optimization algorithm, based on the likelihood-based confidence set. An efficient computational method is developed to implement the algorithm. We further show an information-theoretic bound on the cumulative regret, a first-of-its-kind for preferential BO. We also design a scheme to report an estimated optimal solution, with a guaranteed convergence rate. Experimental results

100 101 102 100

Step (log scale)

Cumulative Regret in log Scale

POP-BO (ours) Linear Growth SGP

0 20 40 60 80 0.5

Suboptimality of reported solution

Figure 3. Cumulative regret in log scale and the suboptimality of the reported solution in linear scale for a 12-dimensional problem sampled from Gaussian process. For reference purpose, we also plot T in the cumulative regret plot in log scale, where the shaded areas represent 0.2 standard deviation.

show that our method achieves better or competitive performance as compared to the state-of-the-art heuristics, which, however, do not have theoretical guarantees on regret. Future works include the extension to the safety-critical problem (Berkenkamp et al., 2016; Guo et al., 2023) and game theoretical setting. The likelihood-based confidence set and the error bound in Sec. 3 can also be applied to more scenarios with preference feedback.

Acknowledgements

This research was supported by the Swiss National Science Foundation under NCCR Automation, grant agreement 51NF40 180545, the Swiss Federal Office of Energy SFOE as part of the SWEET consortium SWICE, and in part by the Swiss Data Science Center, grant agreement C20-13.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Principled Preferential Bayesian Optimization

Abdelrahman, M. M. and Miller, C. Targeting occupant feedback using digital twins: Adaptive spatial temporal thermal preference sampling to optimize personal comfort models. Building and Environment, 218:109090, 2022.

Andersson, J. A., Gillis, J., Horn, G., Rawlings, J. B., and Diehl, M. Cas ADi: a software framework for nonlinear optimization and optimal control. Mathematical Programming Computation, 11(1):1 36, 2019.

Astudillo, R., Lin, Z. J., Bakshy, E., and Frazier, P. q EUBO: A decision-theoretic acquisition function for preferential Bayesian optimization. In International Conference on Artificial Intelligence and Statistics, pp. 1093 1114. PMLR, 2023.

Berkenkamp, F., Schoellig, A. P., and Krause, A. Safe controller optimization for quadrotors with Gaussian processes. In 2016 IEEE international conference on robotics and automation (ICRA), pp. 491 496. IEEE, 2016.

Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324 345, 1952.

Bull, A. D. Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research, 12(10), 2011.

Chowdhury, S. R. and Gopalan, A. On kernelized multiarmed bandits. In International Conference on Machine Learning, pp. 844 853. PMLR, 2017a.

Chowdhury, S. R. and Gopalan, A. On kernelized multiarmed bandits. ar Xiv preprint ar Xiv:1704.00445, 2017b.

Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30, 2017.

Christmann, A. and Hable, R. Consistency of support vector machines using additive kernels for additive models. Computational Statistics & Data Analysis, 56(4):854 873, 2012.

Curi, S., Berkenkamp, F., and Krause, A. Efficient modelbased reinforcement learning through optimistic policy search and planning. Advances in Neural Information Processing Systems, 33:14156 14170, 2020.

Dixon, L. C. W. The global optimization problem: an introduction. Towards Global Optimiation 2, pp. 1 15, 1978.

Dud ık, M., Hofmann, K., Schapire, R. E., Slivkins, A., and Zoghi, M. Contextual dueling bandits. In Conference on Learning Theory, pp. 563 587. PMLR, 2015.

Edmunds, D. E. and Triebel, H. Function spaces, entropy numbers, differential operators, volume 120. Cambridge Univ Pr, 1996.

Emmenegger, N., Mutny, M., and Krause, A. Likelihood ratio confidence sets for sequential decision making. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Fanger, P. O. et al. Thermal comfort. analysis and applications in environmental engineering. Thermal comfort. Analysis and applications in environmental engineering., 1970.

Frazier, P. I. A tutorial on Bayesian optimization. ar Xiv preprint ar Xiv:1807.02811, 2018.

Gajane, P., Urvoy, T., and Cl erot, F. A relative exponential weighing algorithm for adversarial utility-based dueling bandits. In International Conference on Machine Learning, pp. 218 227. PMLR, 2015.

Gonz alez, J., Dai, Z., Damianou, A., and Lawrence, N. D. Preferential Bayesian optimization. In International Conference on Machine Learning, pp. 1282 1291. PMLR, 2017.

GPy. GPy: A Gaussian process framework in python. http://github.com/Sheffield ML/GPy, since 2012.

Griffith, S., Subramanian, K., Scholz, J., Isbell, C. L., and Thomaz, A. L. Policy shaping: Integrating human feedback with reinforcement learning. Advances in Neural Information Processing Systems, 26, 2013.

Guo, B., Jiang, Y., Kamgarpour, M., and Ferrari-Trecate, G. Safe zeroth-order convex optimization using quadratic local approximations. In 2023 European Control Conference (ECC), pp. 1 8. IEEE, 2023.

Hiranaka, A., Hwang, M., Lee, S., Wang, C., Fei-Fei, L., Wu, J., and Zhang, R. Primitive skill-based robot learning from human evaluative feedback. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7817 7824. IEEE, 2023.

Jones, D. R., Schonlau, M., and Welch, W. J. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13(4):455 492, 1998.

Kahneman, D. and Tversky, A. Prospect theory: An analysis of decision under risk. In Handbook of the Fundamentals of Financial Decision Making: Part I, pp. 99 127. World Scientific, 2013.

Principled Preferential Bayesian Optimization

Kandasamy, K., Schneider, J., and P oczos, B. High dimensional Bayesian optimisation and bandits via additive models. In International Conference on Machine Learning, pp. 295 304. PMLR, 2015.

Koyama, Y., Sato, I., and Goto, M. Sequential gallery for interactive visual design optimization. ACM Transactions on Graphics (TOG), 39(4):88 1, 2020.

Lalley, S. P. Concentration inequalities. Lecture notes, University of Chicago, 2013.

Li, K., Tucker, M., Bıyık, E., Novoseller, E., Burdick, J. W., Sui, Y., Sadigh, D., Yue, Y., and Ames, A. D. ROIAL: Region of interest active learning for characterizing exoskeleton gait preference landscapes. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 3212 3218. IEEE, 2021.

Lichtenstein, S. and Slovic, P. Reversals of preference between bids and choices in gambling decisions. Journal of experimental psychology, 89(1):46, 1971.

Liu, Q., Netrapalli, P., Szepesvari, C., and Jin, C. Optimistic MLE: A generic model-based algorithm for partially observable sequential decision making. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pp. 363 376, 2023.

Lyu, J., Shi, Y., Du, H., and Lian, Z. Sex-based thermal comfort zones and energy savings in spaces with joint operation of air conditioner and fan. Building and Environment, 246:111002, 2023. ISSN 0360-1323. doi: https://doi.org/10.1016/j.buildenv.2023.111002. URL https://www.sciencedirect.com/ science/article/pii/S0360132323010296.

Maddalena, E. T., Scharnhorst, P., and Jones, C. N. Deterministic error bounds for kernel-based learning techniques under bounded noise. Automatica, 134:109896, 2021.

Mehta, V., Neopane, O., Das, V., Lin, S., Schneider, J., and Neiswanger, W. Kernelized offline contextual dueling bandits. ar Xiv preprint ar Xiv:2307.11288, 2023.

Mikkola, P., Todorovi c, M., J arvi, J., Rinke, P., and Kaski, S. Projective preferential Bayesian optimization. In International Conference on Machine Learning, pp. 6884 6892. PMLR, 2020.

Molga, M. and Smutnicki, C. Test functions for optimization needs. Test functions for optimization needs, 101:48, 2005.

Negoescu, D. M., Frazier, P. I., and Powell, W. B. The knowledge-gradient algorithm for sequencing experiments in drug discovery. INFORMS Journal on Computing, 23(3):346 363, 2011.

Newey, W. K. and Mc Fadden, D. Large sample estimation and hypothesis testing. Handbook of econometrics, 4: 2111 2245, 1994.

Osband, I. and Van Roy, B. Model-based reinforcement learning and the Eluder dimension. Advances in Neural Information Processing Systems, 27, 2014.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730 27744, 2022.

Owen, A. Empirical likelihood ratio confidence regions. The Annals of Statistics, 18(1):90 120, 1990.

Pacchiano, A., Ball, P., Parker-Holder, J., Choromanski, K., and Roberts, S. Towards tractable optimism in modelbased reinforcement learning. In Uncertainty in Artificial Intelligence, pp. 1413 1423. PMLR, 2021.

Saha, A. and Krishnamurthy, A. Efficient and optimal algorithms for contextual dueling bandits under realizability. In International Conference on Algorithmic Learning Theory, pp. 968 994. PMLR, 2022.

Saha, A., Koren, T., and Mansour, Y. Dueling convex optimization. In International Conference on Machine Learning, pp. 9245 9254. PMLR, 2021.

Sch olkopf, B., Herbrich, R., and Smola, A. J. A generalized representer theorem. In International Conference on Computational Learning Theory, pp. 416 426. Springer, 2001.

Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and De Freitas, N. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148 175, 2015.

Snoek, J., Larochelle, H., and Adams, R. P. Practical Bayesian optimization of machine learning algorithms. Advances in Neural Inf. Process. Syst., 25, 2012.

Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. W. Information-theoretic regret bounds for Gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 58(5):3250 3265, 2012.

Sui, Y., Zhuang, V., Burdick, J. W., and Yue, Y. Multidueling bandits with dependent arms. ar Xiv preprint ar Xiv:1705.00253, 2017.

Sui, Y., Burdick, J., Yue, Y., et al. Stage-wise safe Bayesian optimization with Gaussian processes. In Proc. of the Int. Conf. on Mach. Learn., pp. 4781 4789, 2018.

Principled Preferential Bayesian Optimization

Takeno, S., Nomura, M., and Karasuyama, M. Towards practical preferential Bayesian optimization with skew Gaussian processes. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pp. 33516 33533, 2023.

Tversky, A. and Kahneman, D. Judgment under uncertainty: Heuristics and biases: Biases in judgments reveal some heuristics of thinking under uncertainty. Science, 185 (4157):1124 1131, 1974.

W achter, A. and Biegler, L. T. On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Mathematical Programming, 106(1):25 57, 2006.

Wang, Y., Liu, Q., and Jin, C. Is RLHF more difficult than standard RL? A theoretical perspective. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Warnell, G., Waytowich, N., Lawhern, V., and Stone, P. Deep TAMER: Interactive agent shaping in highdimensional state spaces. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.

Wendland, H. Scattered data approximation, volume 17. Cambridge university press, 2004.

Wu, C., Li, T., Zhang, Z., and Yu, Y. Bayesian optimistic optimization: Optimistic exploration for model-based reinforcement learning. Advances in Neural Information Processing Systems, 35:14210 14223, 2022.

Wu, Y. Lecture notes on information-theoretic methods for high-dimensional statistics. Lecture Notes for ECE598YW (UIUC), 16, 2017.

Xu, W., Jiang, Y., Maddalena, E. T., and Jones, C. N. Lower bounds on the worst-case complexity of efficient global optimization. ar Xiv preprint ar Xiv:2209.09655, 2022a.

Xu, W., Jones, C. N., Svetozarevic, B., Laughman, C. R., and Chakrabarty, A. VABO: Violation-aware Bayesian optimization for closed-loop control performance optimization with unmodeled constraints. In 2022 American Control Conference (ACC), pp. 5288 5293. IEEE, 2022b.

Xu, W., Jiang, Y., Svetozarevic, B., and Jones, C. Constrained efficient global optimization of expensive blackbox functions. In International Conference on Machine Learning, pp. 38485 38498. PMLR, 2023.

Yue, Y. and Joachims, T. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1201 1208, 2009.

Yue, Y., Broder, J., Kleinberg, R., and Joachims, T. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538 1556, 2012.

Zhang, H., Lee, S., and Tzempelikos, A. Bayesian meta-learning for personalized thermal comfort modeling. Building and Environment, 249:111129, February 2024. ISSN 03601323. doi: 10.1016/j.buildenv.2023. 111129. URL https://linkinghub.elsevier. com/retrieve/pii/S0360132323011563.

Zhou, D.-X. The covering number in learning theory. Journal of Complexity, 18(3):739 767, 2002.

Zhou, X. and Ji, B. On kernelized multi-armed bandits with constraints. Advances in Neural Information Processing Systems, 35, 2022.

Zhu, B., Jordan, M., and Jiao, J. Principled reinforcement learning with human feedback from pairwise or k-wise comparisons. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pp. 43037 43067, 23 29 Jul 2023.

Principled Preferential Bayesian Optimization

Without further notice, all the results shown in this appendix are under the assumptions 2.1, 2.2, 2.4, and 2.5.

A. Preliminaries

To prepare for the proofs of the main results shown in this paper, we first state several useful lemmas.

Lemma A.1. The function ψ(y, y ) = log(ey + ey ) is convex in (y, y ).

Proof. We calculate the Hessian of the function ψ and derive

(ey + ey )2

Hence, ψ is convex.

Therefore, we can see ℓ(Z0:t|Dt) is concave in Z0:t.

Lemma A.2. f Bf, x X, f(x) [ B, B].

Proof. | f(x)| = | f, k(x, ) | f k(x, ) B p

k(x, x) B, where the first inequality follows by Cauchy Schwarz inequality, the second inequality follows by Assump. 2.2, and the last inequality follows by Assump. 2.4.

B. Properties of the Function σ( )

When applying the function σ to the difference of objective function f(x) f(x ), f Bf, we have the calculations by single variable calculus,

u := f(x) f(x ) [ 2B, 2B],

σ(u) [σ, σ],

σ (u) = 1 2 + eu + e u [σ , σ ],

where σ = 1/(1+e2B), σ = 1/(1+e 2B) and σ = 1/(2+e2B+e 2B), σ = 1/4. We also introduce some constants Bp = σ

σ, Hσ = 1 2 σ2 and CL = 1 + 2 1+e 2B , which will be used in the proof.

C. Proof of Thm. 3.1

To prepare for the proof of the theorem, we first prove several lemmas.

Lemma C.1. For any fixed ˆf Bf, we have,

P log P ˆ f((xτ, x τ, 1τ)t τ=1) log Pf((xτ, x τ, 1τ)t τ=1) r

32t B2 log 1

where f is the ground-truth function.

Proof. We use yτ (y τ resp.) to denote f(xτ) (f(x τ) resp.). We use zτ (z τ resp.) to denote ˆf(xτ) ( ˆf(x τ) resp.). And we use pτ to denote σ(yτ y τ).

P log P ˆ f((xτ, x τ, 1τ)t τ=1) log Pf((xτ, x τ, 1τ)t τ=1) ξ

τ=1 ((zτ yτ)1τ + (z τ y τ)(1 1τ))

τ=1 log ezτ + ez τ +

τ=1 log eyτ + ey τ ξ

Principled Preferential Bayesian Optimization

τ=1 ((zτ yτ)1τ + (z τ y τ)(1 1τ))

τ=1 ((zτ yτ)pτ + (z τ y τ)(1 pτ)) ξ !

where ξ = ξ + Pt τ=1 log ezτ + ez τ Pt τ=1 log eyτ + ey τ Pt τ=1 ((zτ yτ)pτ + (z τ y τ)(1 pτ)), and the probability P is taken with respect to the randomness from the comparison oracle and the randomness from the algorithm.

It can be checked that ψτ(y, y ) := log ey + ey pτy (1 pτ)y is a convex function and ψτ(yτ, y τ) = 0. This

implies that (yτ, y τ) achieves the minimum for the convex function ψτ. Therefore,

log eyτ + ey τ pτyτ (1 pτ)y τ log ezτ + ez τ pτzτ (1 pτ)z τ.

Rearrangement gives,

log ezτ + ez τ log eyτ + ey τ ((zτ yτ)pτ + (z τ y τ)(1 pτ)) 0.

Hence, ξ ξ. Therefore,

P log P ˆ f((xτ, x τ, 1τ)t τ=1) log Pf((xτ, x τ, 1τ)t τ=1) ξ

τ=1 ((zτ yτ)1τ + (z τ y τ)(1 1τ))

τ=1 ((zτ yτ)pτ + (z τ y τ)(1 pτ)) ξ !

τ=1 ((zτ yτ)1τ + (z τ y τ)(1 1τ))

τ=1 ((zτ yτ)pτ + (z τ y τ)(1 pτ)) ξ

We further notice that

E[((zτ yτ)1τ + (z τ y τ)(1 1τ)) ((zτ yτ)pτ + (z τ y τ)(1 pτ)) |Fτ 1] = 0, (36)

and with probability one,

|((zτ yτ)1τ + (z τ y τ)(1 1τ)) ((zτ yτ)pτ + (z τ y τ)(1 pτ))| = |(zτ yτ z τ + y τ)(1τ pτ)| 4B. (37)

We can thus apply the Azuma-Hoeffding inequality (see, e.g., (Lalley, 2013)). By Azuma Hoeffding inequality,

P log P ˆ f((xτ, x τ, 1τ)t τ=1) log Pf((xτ, x τ, 1τ)t τ=1) ξ

τ=1 ((zτ yτ)1τ + (z τ y τ)(1 1τ))

τ=1 ((zτ yτ)pτ + (z τ y τ)(1 pτ)) ξ

Set exp n ξ2

32t B2 o = δt. That is, ξ = q

32t B2 log 1

δt . We then get the desired result.

We then have the following high probability confidence set lemma.

Lemma C.2. For any fixed ˆf Bf that is independent of ((xτ, x τ, 1τ)t τ=1), we have, with probability at least 1 δ,

log P ˆ f((xτ, x τ, 1τ)t τ=1) log Pf((xτ, x τ, 1τ)t τ=1)

32t B2 log π2t2

6δ , t 1. (38)

Principled Preferential Bayesian Optimization

Proof. We use Et to denote the event log P ˆ f((xτ, x τ, 1τ)t τ=1) log Pf((xτ, x τ, 1τ)t τ=1) q

32t B2 log 1

δt . We pick

δt = (6δ)/(π2t2) and have,

P log P ˆ f((xτ, x τ, 1τ)t τ=1) log Pf((xτ, x τ, 1τ)t τ=1) r

32t B2 log 1

t=1 (1 P (Et))

1 P log P ˆ f((xτ, x τ, 1τ)t τ=1) log Pf((xτ, x τ, 1τ)t τ=1) r

32t B2 log 1

We then have a lemma to bound the difference of log likelihood when two functions are close in infinity-norm sense.

Lemma C.3. There exists an independent constant CL > 0, such that, ϵ > 0, f1, f2 Bf that satisfies f1 f2 ϵ, we have, log Pf1((xτ, x τ, 1τ)t τ=1) log Pf2((xτ, x τ, 1τ)t τ=1) CLϵt. (39)

Proof. We use zi,τ (z i,τ, resp.) to denote fi(xτ) (fi(x τ), resp.), i {0, 1}.

log Pf1((xτ, x τ, 1τ)t τ=1) log Pf2((xτ, x τ, 1τ)t τ=1)

(z1,τ z2,τ)1τ + (z 1,τ z 2,τ)(1 1τ)

τ=1 log ez1,τ + ez 1,τ +

τ=1 log ez2,τ + ez 2,τ (40)

τ=1 max z,z [ B,B]

z,z log ez + ez (z1,τ, z 1,τ) (z2,τ, z 2,τ) (41)

= 1 + 2 1 + e 2B

where the equality (40) follows by the definition of log-likelihood function, and the inequality (41) follows by the assumption and the mean-value theorem. The conclusion follows by setting CL = 1 + 2 1+e 2B .

Main proof: We use N(Bf, ϵ, ) to denote the covering number of the set Bf, with (f ϵ i )N(Bf ,ϵ, ) i=1 be a set of ϵ-covering for the set Bf. Reset the δ in Lem. C.2 as δ/N(Bf ,ϵ, ) and applying the probability union bound, we have,

Principled Preferential Bayesian Optimization

with probability at least 1 δ, f ϵ i , t 1,

log Pf ϵ i ((xτ, x τ, 1τ)t τ=1) log Pf((xτ, x τ, 1τ)t τ=1)

32t B2 log π2t2N(Bf, ϵ, )

By the definition of ϵ-covering, there exists j [N(Bf, ϵ, )], such that,

ˆf MLE t f ϵ j ϵ. (45)

Hence, with probability at least 1 δ,

log P ˆ f MLE t ((xτ, x τ, 1τ)t τ=1) log Pf((xτ, x τ, 1τ)t τ=1)

= log P ˆ f MLE t ((xτ, x τ, 1τ)t τ=1) log Pf ϵ j ((xτ, x τ, 1τ)t τ=1) + log Pf ϵ j ((xτ, x τ, 1τ)t τ=1) log Pf((xτ, x τ, 1τ)t τ=1)

32t B2 log π2t2N(Bf, ϵ, )

where the inequality follows by Lem. C.3 and the inequality (44).

D. Proof of Lem. 3.5

We first have a lemma.

Lemma D.1. We have,

log ˆp log p 1

p(ˆp p) Hσ(ˆp p)2, p, ˆp [σ, σ], (46)

where Hσ = 1 2 σ2 .

Proof. Let ζ(ˆp) = log ˆp log p 1

p(ˆp p) + Hσ(ˆp p)2, p, ˆp [σ, σ]. We have,

p + 2Hσ(ˆp p) = (ˆp p) 1

, ˆp [σ, σ].

Since p, ˆp [σ, σ], we have 1 σ2 1 ˆpp 0. Hence, ζ (ˆp) 0, ˆp [σ, p] and ζ (ˆp) 0, ˆp [p, σ]. Therefore, ζ(ˆp) achieves the maximum over [σ, σ] at the point p. So ζ(ˆp) ζ(p) = 0. Rearrangement then gives the desired result.

For any fixed function ˆf Bf, we use the notations ˆpτ = σ( ˆf(xτ) ˆf(x τ)) [σ, σ] and pτ = σ(f(xτ) f(x τ)) [σ, σ]. We have,

log P ˆ f((xτ, x τ, 1τ)t τ=1) log Pf((xτ, x τ, 1τ)t τ=1)

log p ˆ f(xτ, x τ, 1τ) log pf(xτ, x τ, 1τ)

τ=1 (1τ (log ˆpτ log pτ) + (1 1τ) (log (1 ˆpτ) log(1 pτ))) .

log P ˆ f((xτ, x τ, 1τ)t τ=1) log Pf((xτ, x τ, 1τ)t τ=1)

τ=1 (1τ (log ˆpτ log pτ) + (1 1τ) (log (1 ˆpτ) log(1 pτ)))

pτ Hσ (ˆpτ pτ)2 + (1 1τ) pτ ˆpτ

1 pτ Hσ (ˆpτ pτ)2

Principled Preferential Bayesian Optimization

Rearrangement gives,

τ=1 (ˆpτ pτ)2 + log P ˆ f((xτ, x τ, 1τ)t τ=1) log Pf((xτ, x τ, 1τ)t τ=1)

pτ + (1 1τ)pτ ˆpτ

(47) We then have the following lemma,

Lemma D.2. For any fixed ˆf Bf and t 1, we have, with probability at least 1 δt,

τ=1 (ˆpτ pτ)2 log Pf((xτ, x τ, 1τ)t τ=1) log P ˆ f((xτ, x τ, 1τ)t τ=1) + r

2t B2p log 1

Proof. Since E h 1τ ˆpτ pτ

pτ + (1 1τ) pτ ˆpτ

1 pτ |Fτ 1 i = E h pτ ˆpτ pτ

pτ + (1 pτ) pτ ˆpτ

1 pτ |Fτ 1 i = 0 and with probability one, 1τ ˆpτ pτ

pτ + (1 1τ)pτ ˆpτ

+ (1 1τ) pτ ˆpτ

ˆpτ pτ 1 + (1 1τ) 1 ˆpτ 1 pτ 1 (50)

σ = Bp, (51)

where the last inequality follows by that ˆpτ, pτ, 1 ˆpτ, 1 pτ [σ, σ]. Thus we can apply the Azuma Hoeffding inequality. By Azuma Hoeffding inequality, we have,

pτ + (1 1τ)pτ ˆpτ

We set exp n ξ2

o = δt, and derive

pτ + (1 1τ)pτ ˆpτ

2t B2p log 1

Combining the inequality (47) and the inequality (53), the desired result is derived.

Lemma D.3. For any fixed ˆf Bf, we have, with probability at least 1 δ,

τ=1 (ˆpτ pτ)2 log Pf((xτ, x τ, 1τ)t τ=1) log P ˆ f((xτ, x τ, 1τ)t τ=1) +

2t B2p log π2t2

6δ , t 1. (54)

Proof. We use Et 5 to denote the event Hσ Pt τ=1 (ˆpτ pτ)2 log Pf((xτ, x τ, 1τ)t τ=1) log P ˆ f((xτ, x τ, 1τ)t τ=1) + q

2t B2p log 1

δt and pick δt = (6δ)/(π2t2). We have,

τ=1 (ˆpτ pτ)2 log Pf((xτ, x τ, 1τ)t τ=1) log P ˆ f((xτ, x τ, 1τ)t τ=1) + r

2t B2p log 1

5With abuse of notation here. Et is only a local notation in this proof here.

Principled Preferential Bayesian Optimization

t=1 (1 P (Et))

τ=1 (ˆpτ pτ)2 log Pf((xτ, x τ, 1τ)t τ=1) log P ˆ f((xτ, x τ, 1τ)t τ=1) + r

2t B2p log 1

Main Proof: Resetting the δ in Lem. D.3 to be δ/N(Bf ,ϵ, ), we can guarantee the Eq. (54) holds for all the function in an ϵ-covering of Bf with probability at least 1 δ, by applying the probability union bound.

For any ˆft+1 Bt+1 f Bf, there exists a function in the ϵ-covering of Bf, which we set to be ˆf, such that ˆft+1 ˆf ϵ. We also use ˆpt+1 τ to denote σ( ˆft+1(xτ) ˆft+1(x τ)). Thus, we have,

ˆpt+1 τ pτ 2 (55)

ˆpt+1 τ ˆpτ 2 + 2Hσ

τ=1 (ˆpτ pτ)2 (56)

2Hσ σ 2 t X

( ˆft+1(xτ) ˆft+1(x τ)) ( ˆf(xτ) ˆf(x τ)) 2 + 2Hσ

τ=1 (ˆpτ pτ)2 (57)

8Hσ σ 2 t X

τ=1 ϵ2 + 2Hσ

τ=1 (ˆpτ pτ)2 (58)

8Hσ σ 2ϵ2t +

8t B2p log π2t2N(Bf, ϵ, )

6δ + 2 log Pf((xτ, x τ, 1τ)t τ=1) log P ˆ f((xτ, x τ, 1τ)t τ=1) (59)

C(ϵ, δ, t) + 2 log P ˆ f MLE t ((xτ, x τ, 1τ)t τ=1) log P ˆ ft+1((xτ, x τ, 1τ)t τ=1) (60)

+ 2 log P ˆ ft+1((xτ, x τ, 1τ)t τ=1) log P ˆ f((xτ, x τ, 1τ)t τ=1

C(ϵ, δ, t) + 2CLϵt + 2β1(ϵ, δ, t) (61)

=β2(ϵ, δ, t) + 2β1(ϵ, δ, t), (62)

where C(ϵ, δ, t) = 8Hσ σ 2ϵ2t + q

8t B2p log π2t2N(Bf ,ϵ, )

6δ and β2(ϵ, δ, t) = C(ϵ, δ, t) + 2CLϵt. The inequality (56)

follows by the fact that (a + b)2 2a2 + 2b2, a, b R. The inequality (58) follows because ˆft+1 ˆf ϵ. The inequality (59) follows by Lem. D.3 (with reset of δ ). The inequality (60) follows by that

log P ˆ f MLE t ((xτ, x τ, 1τ)t τ=1) log Pf((xτ, x τ, 1τ)t τ=1).

The inequality (61) follows by the fact that ˆft+1 Bt+1 f and Lem. C.3.

Furthermore,

ˆpt+1 τ pτ 2 =

σ ˆft+1(xτ) ˆft+1(x τ) σ (f(xτ) f(x τ)) 2 (63)

Principled Preferential Bayesian Optimization

τ=1 σ 2 ˆft+1(xτ) ˆft+1(x τ) (f(xτ) f(x τ)) 2 , (64)

where the inequality follows by mean value theorem. The conclusion then follows.

E. Proof of Thm. 3.6

Before we proceed to prove Thm. 3.6, we first conduct a black-box analysis in Sec. E.1 to bound the pointwise error for a generic RKHS with a generic learning scheme, which we think can be of independent interest.

E.1. Black-box Analysis on the Pointwise Inference Error in a Generic RKHS

Suppose we have a generic RKHS H with a generic positive semidefinite kernel function k( , ). After obtaining some information (preference information in this paper) on a sequence x1, x2, , xt 1, a learning scheme outputs a learnt uncertainty set,

St = { h B|

h( xτ) h( xτ) 2 βt}, (65)

where B is a function space ball with radius B in H, h B is the ground truth function and βt quantifies the size of this confidence set. Let X denote the function input set, which is assumed to be compact. We introduce the function,

σ2 t ( x) = k ( x, x) k( x1:t 1, x) Kt 1 + λI 1 k ( x1:t 1, x) , (66)

where λ is a positive constant and Kt 1 = ( k( xi, xj))i [t 1],j [t 1]. We have the following theorem.

Theorem E.1. h St+1, x X, we have,

| h( x) h( x)| 2( B + λ 1/2 β 1/2 t+1) σt+1( x). (67)

Proof. For simplicity, we use ϕ( x) to denote the function k( x, ), where ϕ : R d H maps a finite dimensional point x R d to the RKHS H. For simplicity, we use h 1 h2 to denote the inner product of two functions h1, h2 from the RKHS H. Therefore, h( x) = h, k( x, ) k = h ϕ( x) and k( x, x) = k( x, ), k( x, ) = ϕ( x) ϕ( x), x, x X. We can introduce the feature map

Φt := ϕ( x1) , . . . , ϕ( xt) ,

we then get the kernel matrix Kt = ΦtΦ t = ( k( xi, xj))i,j [t], kt( x) = Φtϕ( x) = ( k( x, xi))i [t] for all x X and h1:t = Φth.

Note that when the Hilbert space H is finite-dimensional, Φt is interpreted as the normal finite-dimensional matrix. In the more general setting where H can be an infinite-dimensional space, Φt is the evaluation operator H Rt defined as Φth := [h( x1), , h( xt)] , h H, with Φ t : Rt H as its adjoint operator. For the simplicity of notation, we abuse the notation I to denote the identity mapping in both the RKHS H and Rt. The specific meaning of I depends on the context.

Since the matrices/operators (Φ t Φt + λI) and (ΦtΦ t + λI) are strictly positive definite and

(Φ t Φt + λI)Φ t = Φ t (ΦtΦ t + λI),

we have Φ t (ΦtΦ t + λI) 1 = (Φ t Φt + λI) 1Φ t . (68)

Also from the definitions above (Φ t Φt + λI)ϕ( x) = Φ t kt( x) + λϕ( x), and thus ϕ( x) = (Φ t Φt + λI) 1Φ t kt( x) + λ(Φ t Φt + λI) 1ϕ( x). Hence, from Eq. (68) we deduce that

ϕ( x) = Φ t (ΦtΦ t + λI) 1 kt( x) + λ(Φ t Φt + λI) 1ϕ( x), (69)

Principled Preferential Bayesian Optimization

which gives ϕ( x) ϕ( x) = kt( x) (ΦtΦ t + λI) 1 kt( x) + λϕ( x) (Φ t Φt + λI) 1ϕ( x), (70)

by multiplying both sides of Eq. (69) with ϕ( x) . This implies

λϕ( x) (Φ t Φt + λI) 1ϕ( x) = k( x, x) kt( x) ( Kt + λI) 1 kt( x) = σ2 t+1( x), (71)

where the second equality follows by the definition of σt+1( x). Now observe that h B,

| h( x) kt( x) ( Kt + λI) 1 h1:t| (72)

=|ϕ( x) h ϕ( x) Φ t (ΦtΦ t + λI) 1Φt h| (73)

=|ϕ( x) h ϕ( x) (Φ t Φt + λI) 1Φ t Φt h| (74)

=|ϕ( x) (Φ t Φt + λI) 1(Φ t Φt + λI) h ϕ( x) (Φ t Φt + λI) 1Φ t Φt h| (75)

=|λϕ( x) (Φ t Φt + λI) 1 h| (76)

λ(ΦT t Φt + λI) 1ϕ( x) k

λϕ( x) (Φ t Φt + λI) 1λI(Φ t Φt + λI) 1ϕ( x) (78)

λϕ( x) (Φ t Φt + λI) 1(Φ t Φt + λI)(Φ t Φt + λI) 1ϕ( x) (79)

= B σt+1( x), (80)

where the equality (74) uses Eq. (68), the inequality (77) is by Cauchy-Schwartz, the inequality (79) follows by the assumption that h k B and that Φ t Φt is positive semidefinite, and the equality (80) is from Eq. (71). We define 1:t = h1:t h1:t,

| kt( x) ( Kt + λI) 1 1:t| (81)

=|ϕ( x) Φ t (ΦtΦ t + λI) 1 1:t| (82)

=|ϕ( x) (Φ t Φt + λI) 1Φ t 1:t| (83)

(Φ t Φt + λI) 1/2ϕ( x) k

(Φ t Φt + λI) 1/2Φ t 1:t k (84)

ϕ( x) (Φ t Φt + λI) 1ϕ( x) q

(Φ t 1:t) (Φ t Φt + λI) 1Φ t 1:t (85)

=λ 1/2 σt+1( x) q

1:tΦtΦ t (ΦtΦ t + λI) 1 1:t (86)

=λ 1/2 σt+1( x) q

1:t Kt( Kt + λI) 1 1:t (87)

λ 1/2 σt+1( x) q

1:t 1:t (88)

λ 1/2 β1/2 t+1 σt+1( x) (89)

where the equality (83) is from Eq. (68), the inequality (84) is by Cauchy-Schwartz and the equality (86) uses both Eq. (68) and Eq. (71). We can finally derive, h( x) h( x) (90)

= kt( x) ( Kt + λI) 1( h1:t h1:t) h( x) kt( x)T ( Kt + λI) 1h1:t + h( x) kt( x) ( Kt + λI) 1 h1:t (91)

kt( x) ( Kt + λI) 1( h1:t h1:t) + h( x) kt( x)T ( Kt + λI) 1h1:t + h( x) kt( x) ( Kt + λI) 1 h1:t (92)

2 B + λ 1/2 β1/2 t+1 σt+1( x), (93)

where the equality (91) follows by splitting, the inequality (92) follows by triangle inequality, the last inequality follows by combining the inequality (80) and the inequality (89). The conclusion then follows.

Remark E.2. The proof idea is inspired by the proof of Thm. 2 in (Chowdhury & Gopalan, 2017b).

Principled Preferential Bayesian Optimization

E.2. Main Proof of Thm. 3.6

We set the generic RKHS H to be the augmented RKHS with the additive kernel function kff , the function space ball to be Bff , B = 2B and the confidence set as,

( f(x) f(x )| f Bf,

( f(xτ) f(x τ)) (f(xτ) f(x τ)) 2 β(ϵ, δ/2, t 1)

The desired result then follows by applying Thm. E.1.

F. Proof of Lem. 4.1

It suffices to prove that for any feasible solution of Prob. (23), we can find a corresponding feasible solution of Prob. (24) with the same objective value and that the inverse direction also holds.

1. In this part, we first show that for any feasible solution of Prob. (23), we can find a corresponding feasible solution of Prob. (24) with the same objective value. Let f be a feasible solution of Prob. (23). We construct Z0:t = ( f(xτ))t τ=0 and z = f(x). Consider the minimum-norm interpolation problem,

min s Bf s 2

subject to s(xτ) = zτ, τ {0} [t],

By representer theorem, the Prob. (94) admits an optimal solution with the form α k0:t,x( ), where k0:t,x := (k(w, ))w {x0, ,xt,x}. So Prob. (94) can be reduced to

min α Rt+2 α K0:t,xα

subject to K0:t,xα = Z0:t z

Hence, by solving Prob. (95), we can derive the minimum norm square with interpolation constraints as

Since f itself is an interpolant by construction of ( Z0:t, z). We have

And since the log-likelihood only depends on Z0:t, it holds that

ℓ( Z0:t|Dt) = ℓt( f) ℓt( ˆf MLE t ) β1(ϵ, δ, t).

And the objectives satisfy, z zt = f(x) f(xt).

Therefore, ( Z0:t, z) is a feasible solution for Prob. (24) with the same objective as f for Prob. (23).

2. We then show that for any feasible solution of Prob. (24), we can find a corresponding feasible solution of Prob. (23) with the same objective value. Let (Z0:t, z) be a feasible solution of Prob. (24). We construct

fz = Z0:t z

K 1 0:t,xk0:t,x( ).

Principled Preferential Bayesian Optimization

fz 2 = Z0:t z

And it can be checked that fz(xτ) = zτ, τ {0} [t] and fz(x) = z. So ℓt( fz) = ℓ(Z0:t|Dt) ℓt( ˆf MLE t ) β1(ϵ, δ, t). And the objectives satisfy fz(x) fz(xt) = z zt. So it is proved that for any feasible solution of Prob. (24), we can find a corresponding feasible solution of Prob. (23) with the same objective value.

The desired result then follows.

G. Elaboration on Remark 2.3

By assumption 2.2, we assume that there exists a large enough constant B that upper bounds the norm of the ground-truth black-box function f. However, the exact value of this upper bound may be unknown to us in practice, while the execution of our algorithm relies on the knowledge of B (in Problem (23), B is a key parameter). So we need to guess the value of B. Suppose our guess is ˆB. It is possible that ˆB is even smaller than the ground-truth function norm f . To detect this wrong guess, we observe that, with the correct setting of B such that B f , we have that by Thm. 3.1 and the definition of maximum likelihood estimate, with high probability,

ℓt( ˆf MLE t|B ) ℓt(f) ℓt( ˆf MLE t|B ) β1(ϵ, δ, t|B),

where ˆf MLE t|B is the maximum likelihood estimate function with function norm bound B and β1(ϵ, δ, t|B) is the corresponding parameter as defined in Thm. 3.1 with norm bound B. We also have 2B is a valid upper bound on f and thus,

ℓt( ˆf MLE t|2B ) ℓt(f) ℓt( ˆf MLE t|2B ) β1(ϵ, δ, t|2B).

Hence, ℓt( ˆf MLE t|B ) ℓt(f) ℓt( ˆf MLE t|2B ) β1(ϵ, δ, t|2B).

That is to say, ℓt( ˆf MLE t|B ) needs to be greater than or equal to ℓt( ˆf MLE t|2B ) β1(ϵ, δ, t|2B) when B is a valid upper bound on f .

Therefore, we can use the heuristic: every time we find that

ℓt( ˆf MLE t| ˆ B ) < ℓt( ˆf MLE t|2 ˆ B ) β1(ϵ, δ, t|2 ˆB),

we double the upper bound guess ˆB.

H. Jointly Optimize x, Z0,t and z for the Problem (24).

For medium-dimensional problems (d > 4), we can jointly optimize x, Z0:t, and z by a nonlinear programming solver from multiple random initial conditions. That is, we can also treat x in the problem (23) as an optimization variable. In this way, we lose convexity but only need to solve the problem (23) for only once in each step t.

More specifically, we solve the optimization problem (96).

max x Rd,Z0:t Rt+1,z R z zt

subject to Z0:t z

ℓ(Z0:t|Dt) ℓt( ˆf MLE t ) β1(ϵ, δ, t),

The only constraint that involves x is Z0:t z

Principled Preferential Bayesian Optimization

Applying matrix inversion, we derive that the left-hand side is equal to,

Z 0:t K 1 0:t Z0:t + 1 k(x, x) kt(x) K 1 0:t kt(x)

K 1 0:t kt(x) 1

K 1 0:t kt(x) 1

where kt(x) := (k(xτ, x))t τ=0.

We can then apply a nonlinear programming solver such as Ipopt to solve the problem (96) from randomly sampled initial points. Then the best converged solution is set to be the next sample point xt.

I. Extension to the Multiple-Choice Setting

In this paper, we mainly consider the setting where human expresses preference over only two choices, because of its low cognitive burden to the human user and simplicity of theoretical analysis. However, we can extend POP-BO to the multiple-choice setting where human can compare multiple choices and express the favorite one.

Suppose that in each step τ, we aim to generate a batch of q points. Then we can mix the new batch with the old batch generated in step τ 1, and query the comparison oracle to report the favorite point among the 2q points.

Firstly, the confidence set of functions can be similarly constructed using the likelihood ratio idea and the multiple-choice probabilistic preference model as in (Astudilo et al. 2023),

P (xr is the favorite) = ef(xr) P

x {last batch and the new batch} ef(x) . (99)

Secondly, to generate the new batch, the basic idea is that we can apply a bootstrap -type technique. More specifically, we can sequentially generate the new batch x1, x2, , xq. When generating the new point xr+1, we maximize its corresponding optimistic advantage of zr+1 as compared to the maximum of zt q+1:t, z1, , zr by solving a similar problem to Problem (23). That is, we solve the Problem (100) to generate the new point xr+1 in the same batch,

max x Rd,z R,z1:r Rr,Z0:t Rt+1 z max{zt q+1, , zt, z1, , zr}

K 1 0:t,x1:r,x

ℓ(Z0:t|Dt) ℓt( ˆf MLE t ) βt,

which is equivalent to

max x Rd,z R,v R,z1:r Rr,Z0:t Rt+1 z v

K 1 0:t,x1:r,x

ℓ(Z0:t|Dt) ℓt( ˆf MLE t ) βt,

v zt i+1, i [q],

v zj, j [r],

by introducing an auxiliary variable v R. Problem (101) can be efficiently solved by the nonlinear programming solver Ipopt.

J. Proof of Thm. 5.2

To prepare for the following analysis, we first give a useful lemma.

Principled Preferential Bayesian Optimization

Lemma J.1 (Lemma 4, (Chowdhury & Gopalan, 2017b)).

t=1 σff t ((xt, x t)) q

4(T + 2)γff T , (102)

where σff t is as defined in Eq. (16) and γff

T is as defined in Eq. (18).

Proof. Apply the Lemma 4 in (Chowdhury & Gopalan, 2017b) by setting the kernel function as kff .

For convenience, we use βt to denote β(ϵ, δ/2, t). We can then analyze the regret of the optimistic algorithm.

t=1 [f(x ) f(xt)]

t=1 [(f(x ) f(x t)) (f(xt) f(x t))]

t=1 [( ft(xt) ft(x t)) (f(xt) f(x t))]

t=1 2(2B + λ 1/2β 1/2 t )σff t ((xt, x t)),

where the first inequality follows by the optimality of (xt, ft) for the optimization problem in line 4 of the Alg. 1, and the second inequality follows by Thm. 3.6 (Note that β(ϵ, δ/2, t 1) βt = β(ϵ, δ/2, t)). Hence,

t=1 2(2B + λ 1/2β 1/2 t )σff t ((xt, x t))

2(2B + λ 1/2β 1/2 T )

t=1 σff t ((xt, x t))

2(2B + λ 1/2β 1/2 T ) q

4(T + 2)γff T

K. Proof of Thm. 5.4

f(x ) f(xt ) =(f(x ) f(x t )) (f(xt ) f(x t ))

( ft (xt ) ft (x t )) (f(xt ) f(x t ))

2(2B + λ 1/2β 1/2 t )σff t ((xt , x t )),

where σff t is as given in Eq. (16) with the kernel function as kff ((x1, x 1), (x2, x 2)) = k(x1, x2) + k(x 1, x 2) and βt = β(ϵ, δ/2, t ). Furthermore, by the definition of t ,

2(2B + λ 1/2β 1/2 t )σff t ((xt , x t )) 1

t=1 2(2B + λ 1/2β 1/2 t )σff t ((xt, x t))

T (2B + λ 1/2β 1/2 T )

t=1 σff t ((xt, x t))

Principled Preferential Bayesian Optimization

T (2B + λ 1/2β 1/2 T ) q

4(T + 2)γff T

The conclusion then follows.

L. Commonly Used Specific Kernel Functions

Linear: k(x, x) = x x.

Squared Exponential (SE):

k(x, x) = σ2 SE exp

where σ2 SE is the variance parameter and l is the lengthscale parameter.

k(x, x) = 21 ν

where ρ and ν are the two positive parameters of the kernel function, Γ is the gamma function, and Kν is the modified Bessel function of the second kind. ν captures the smoothness of the kernel function.

M. Proof of Thm. 5.5

Recall that

β(ϵ, δ/2, t) = σ 2

Hσ (β2(ϵ, δ, t) + 2β1(ϵ, δ, t)) = O

t log t N(Bf, ϵ, )

δ + ϵt + ϵ2t

We pick ϵ = 1/T, and can thus derive,

βT = β(T 1, δ/2, T) = O

T log TN(Bf, T 1, )

1. k is a linear kernel, then the corresponding RKHS is a finite-dimensional space and log N(Bf, T 1, ) = O log 1

ϵ = O (log T) (see, e.g., (Wu, 2017)). The corresponding kff ((x, x ), (y, y )) = x y + x y = (x, x ), (y, y ) , which is also linear. Thus, by Thm. 5 in (Srinivas et al., 2012),

T = O(log T).

Hence, RT = O (T log T)1/4+1/2 = O T 3/4(log T) 3/4 .

2. k is a squared exponential kernel, then log N(Bf, T 1, ) = O (log 1

ϵ )d+1 = O (log T)d+1 (Example 4, (Zhou, 2002)). By Thm. 4 in (Kandasamy et al., 2015), we have,

T = O((log T)d+1).

Hence, RT = O T 3/4(log T) 3/4(d+1) .

Principled Preferential Bayesian Optimization

3. k is a Mat ern kernel. Lem. 3 in (Bull, 2011) implies the equivalence between RKHS and Sobolev Hilbert space. We can then apply the rich results on the bound of covering number of Sobolev Hilbert space (Edmunds & Triebel, 1996). So log N(Bf, T 1, ) = O ( 1

ϵ ) d/ν log 1

ϵ = O T d/ν log T (by combing the lower bound in Thm. 5.1 (Xu et al., 2022a) and the convergence rate in Thm. 1 (Bull, 2011)). By Thm. 4 in (Kandasamy et al., 2015), we have,

d(d+1) 2ν+d(d+1) log T .

RT = O T 3/4(log T) 3/4T

d ν 1 4 + d+1 4+2(d+1)d/ν

O T 3/4(log T) 3/4T 1 4 d(d+2)

N. Empirical Evidence for the Order of The Cumulative Regret

Fig. 4 shows the cumulative regret of POP-BO algorithm. The experimental conditions are the same as in Sec. 6.1. Note that both horizontal and vertical axes in Fig. 4 are in log scale, and thus the slope of the curve roughly represents the power of the cumulative regret. It can be clearly seen that the order of the cumulative regret is between

T and T (indeed, close to T 3 4 by checking the slope in log scale), which verifies our theoretical results in Thm. 5.5.

Cumulative Regret

T RT of POP-BO

Figure 4. Cumulative regret of our algorithm in log scale. For reference purpose, we also plot

T and T in log scale.

O. Kernel-Specific Convergence Rate

Similar to the bounds in the Appendix M, we can plug in the kernel-specific covering number and maximum information gain to derive the kernel-specific convergence rate in Tab. 3.

Table 3. Kernel-specific convergence rate for xt . Kernel Linear Squared Exponential Mat ern ν > d

d2 + 14d + 17) = Θ(d2)

f(x ) f(xt ) O (log T ) 3/4

O (log T ) 3/4(d+1)

(log T ) 3/4T

1 4 + d+1 4+2(d+1)d/ν

P. More Experimental Results and Details

Selection of Hyperparameters. Three key hyperparameters that influence the performance of POP-BO are the kernel lengthscale, the norm bound and the confidence level term β as shown in Thm. 3.1. We set β = β0

t, where β0 is set to 1.0 by default. For the sampled instances from Gaussian processes, the lengthscale is set to be the ground truth and the norm bound is set to be 1.1 times the ground truth. For the test function examples, we choose the lengthscale by maximizing the

Principled Preferential Bayesian Optimization

likelihood value over a set of randomly sampled data and set the norm bound to be 6 by default (with the test functions all normalized).

Details on Sampled Instances from Gaussian Process. Specifically, we randomly sample some knot points from a joint Gaussian distribution marginalized from the Gaussian process, and then construct its corresponding minimum-norm interpolant (Maddalena et al., 2021) as the ground truth function.

Empirical Method for Reporting a Solution. In the experiment of test function optimization, we report the point that maximizes the minimum norm maximum likelihood estimator ˆf MLE t , which achieves better empirical performance.

Solution Report Method for Baselines. The approach to reporting a solution is the same as in the original paper of the baseline algorithm if it is mentioned. Therefore, for the baseline q EUBO (Astudillo et al., 2023), we report the solution that maximizes the expected objective value conditioned on the historical samples. For the baseline SGP (Takeno et al., 2023), we report the first point of the duel proposed by the algorithm in step t. For the baseline DTS (Gonz alez et al., 2017), we report the Condorcet winner.

Effect of Hyperparameters. We conducted more experiments to assess the effect of hyperparameters. We observe that the hyperparameters with most influence are the norm bound B and the confidence level βt. The larger the norm bound B is, the more variance the estimate function has. If B is set too large, the convergence for the suboptimality of the reported solution tends to be slower. βt can be set to be β0

t in practice and determines the level of exploration, where β0 is a fixed constant. The larger β0 is, the more explorative the algorithm is and may have higher cumulative regret. But setting β0 to be very small may also cause weak exploration and make the suboptimality of the reported solution converge slower.

P.1. Experimental Results for Higher-Dimensional Problems

Higher-Dimensional Problems Sampled from Gaussian Process. We consider the optimization of 7-dimensional black-box function sampled from a Gaussian process with kernel function as shown in Eq. (103),

k(x, x) = σ2 SE exp

where σ2 SE = 9.0 and l = 5

7. The optimization domain is set to be [0, 10]7. We run 20 randomly sampled instances for 100 steps. The average update time for each step t is only 11.0 seconds on a personal computer with one Intel64 Family 6 Model 142 Stepping 12 Genuine Intel 1803 Mhz processor and 16.0 GB RAM. This is comparably very small considering that each query to the comparison oracle can be very expensive in practice (e.g., heating the room up to a certain temperature to evaluate occupant comfort, which may take tens of minutes). We compare our method to the SGP baseline.

100 101 102

Step (log scale)

Cumulative Regret in log Scale

POP-BO (ours) Linear Growth SGP

0 20 40 60 80 100

Suboptimality of reported solution

Figure 5. Cumulative regret in log scale and the suboptimality of the reported solution in linear scale for a 7-dimensional problem sampled from Gaussian process. For reference purpose, we also plot T in the cumulative regret plot in log scale, where the shaded areas represent 0.2 standard deviation.

Principled Preferential Bayesian Optimization

Fig. 5 shows the cumulative regret (in log scale) and the suboptimality of the reported solution for our POP-BO algorithm, where the reported solution is derived by maximizing the maximum likelihood estimate function. It can be clearly seen that our algorithm achieves both sublinear regret growth and fast convergence for the suboptimality of the reported solution in this 7-dimensional problem. Interestingly, the suboptimality of SGP converges similarly to our method before 50 steps, but get even worse after 50 steps. This is because SGP ignores the randomness in the preference feedback, which leads to misbelief in the function difference value, and such misbelief is more significant when the function difference value is small.

Higher-Dimensional Test Problem. In this section, we further consider the optimization of the 6-dimensional Ackley function as shown in (Astudillo et al., 2023). For this problem, we compare POP-BO algorithm to the q EUBO algorithm proposed in (Astudillo et al., 2023). Fig. 6 shows the cumulative regret and the suboptimality of the reported solution. In this particular problem, q EUBO performs better than our POP-BO algorithm in terms of cumulative regret, while our POP-BO algorithm performs slightly better than q EUBO in terms of the suboptimality of the reported solution.

0 20 40 60 80

Cumulative Regret

Ackley (d = 6)

0 20 40 60 80

Suboptimality of reported solution

Ackley (d = 6)

q EUBO POP-BO (ours)

Figure 6. Cumulative regret and the suboptimality of the reported solution for the 6-dimensional Ackley function optimization problem, where the shaded areas represent 0.5 standard deviation.

P.2. Occupant Thermal Comfort Optimization

Two-Dimensional Comfort Optimization. An accurate model of human thermal comfort is crucial for improving occupants comfort while saving energy in buildings. However, establishing such a model has proven to be a complex and challenging task (Zhang et al., 2024) and standard offline models ignore the individual differences among occupants. In this section, we consider the real-world problem of maximizing occupant thermal comfort directly from thermal preference feedback. To emulate real human thermal sensation, we use the well-known and widely adopted Predicted Mean Vote (PMV) model (Fanger et al., 1970) as the ground truth and generate the preference feedback according to the Bernoulli model as assumed in Assumption 2.5. We optimize the indoor air temperature and air speed, which are the two major factors that influence thermal comfort and are controllable by HVAC (Heating, Ventilation, and Air Conditioning) systems and fans. Indeed, tuning these two factors has been proven effective in providing thermal comfort while minimizing energy consumption (Lyu et al., 2023). The result is shown in Fig. 7 where the mean is taken over 30 instances of simulation. It can be seen that our method stably achieves superior performance in optimizing human thermal comfort, which implies its potential to deal with preferential feedback in real-world applications. It is also noticeable that although q EUBO achieves slightly better performance in terms of the convergence of the reported solution, the cumulative regret of q EUBO is almost twice of POP-BO s cumulative regret. This means our method is more favorable in applications where online performance during the optimization is also critical, such as online tuning of HVAC systems.

Scalability to Higher Dimension. Additionally, to demonstrate the scalability of POP-BO in this real-world comfort optimization problem, we additionally tune the mean radiant temperature and relative humidity, which results in a fourdimensional black-box optimization problem. The result is shown in Fig. 8. It can be observed that increasing the dimensionality does not drastically decrease the convergence rate of our method. Furthermore, the baseline method q EUBO can decrease the objective value very fast in the initial steps, but seems to be still very oscillatory after 10 steps. In contrast,

Principled Preferential Bayesian Optimization

0 20 40 60 80

Cumulative Regret

0 20 40 60 80

Suboptimality of Reported Solution

q EUBO SGP DTS POP-BO (ours)

Figure 7. Cumulative regret and the suboptimality of the reported solution of different algorithms for thermal comfort optimization.

our method converges faster than SGP without the oscillation issue like q EUBO.

0 20 40 60 80

Cumulative Regret

0 20 40 60 80 0

Suboptimality of reported solution

q EUBO SGP POP-BO (ours)

Figure 8. Cumulative regret and the suboptimality of the reported solution of different algorithms for the four-dimensional thermal comfort optimization problem.

P.3. Details About the Results in Tab. 2

The cumulative regret and evolution of suboptimality for the different test problems in Tab. 2 are shown in Fig. 9. Since the considered problems only have 2-dimensional input and in the applications of Bayesian optimization, it is typically desired to obtain a set of solution with objective value as close to the optimal value as possible. So we only consider 30 steps here. Other baselines can make limited progress in terms of the suboptimality of the reported solution within only 30 steps (partially also due to the adversarial property of the test functions, i.e., severe non-convexity and multiple local maxima) as shown in Tab. 2. To the sharp contrast, our POP-BO algorithm makes significant progress in reducing the suboptimality of the reported solution by balancing exploration and exploitation, and estimating the best solution in a principled way.

To provide more insights into POP-BO s performance across different settings, we compare our algorithm s evolution of cumulative regret and suboptimality to other baseline methods for each test problem in Fig. 10 and Fig. 11. It can be observed that our method may perform slightly worse than some baselines in certain problems. For example, our method performs slightly worse than q EUBO in the Bukin problem in terms of suboptimality. However, our method performs stably and is consistently one of the best in all the test problems in terms of the suboptimality.

Q. Additional Contributions as Compared to (Mehta et al., 2023)

Notably, (Mehta et al., 2023) proposes Borda-AE algorithm, which directly learns the winning probability function using kernel ridge regression. This key design allows the authors to derive an information-theoretic convergence rate and efficient

Principled Preferential Bayesian Optimization

0 5 10 15 20 25 0

Cumulative Regret

0 5 10 15 20 25 0

Suboptimality of reported solution

Beale Branin Bukin Cross-in-Tray Eggholder Holder Table Levy13

Figure 9. Cumulative regret and the suboptimality of the reported solution of POP-BO algorithm for the different test problems in Tab. 2.

computation method without diving into the learning of the underlying reward function.

However, (Mehta et al., 2023) has key limitations and our paper makes additional contributions in the following two aspects.

1. Cumulative regret bound. There are two possible ways to define cumulative regret. One way is that we can define the (partial) cumulative regret as the summation of the suboptimality of only xt (that is, PT t=1(f(x ) f(xt))). With this (partial) cumulative regret definition, Borda-AE algorithm can provide a sublinear (partial) cumulative regret bound, although it has linear growth in the cumulative regret of the compared point sequence {x t}T t=1. However, in many practical online learning applications, it is desired to control the suboptimality of both xt and x t sequences. For example, when tuning the thermal/visual comfort of room occupants, we require the occupants to experience both xt and x t conditions for comparison purposes and the suboptimality (links to discomfort) caused by both xt and x t need to be managed.

Therefore, it is more practically relevant to define (total) cumulative regret as the total cumulative suboptimality of both xt and x t sequences (that is, PT t=1(f(x ) f(xt)) + PT t=1(f(x ) f(x t))). Interestingly, since x t = xt 1 by the design of our POP-BO algorithm, this (total) cumulative regret bound reduces to 2 PT t=1(f(x ) f(xt)), for which we provide our sublinear cumulative regret bound. As such, the (total) cumulative regret bound provided by our paper is stronger than the (partial) cumulative regret bound that could be obtained by (Mehta et al., 2023).

2. Applicability to online learning problem. Following the last point, (Mehta et al., 2023) is not applicable to the online learning problem since in line 6 of the Borda-AE algorithm, a t is uniformly sampled from the action space, which leads to a linear growth of cumulative regret. This means Borda-AE has very poor online performance and can not be applied to an online learning problem. For example, in building thermal comfort tuning, we also want to control the discomfort caused during the tuning process. In contrast, our POP-BO algorithm has good online performance with both a theoretical bound on cumulative regret (Thm. 5.2) and empirical evidence on small cumulative regret (Fig. 2).

Principled Preferential Bayesian Optimization

Cumulative Regret

Suboptimality of reported solution

Cumulative Regret

Suboptimality of reported solution

Cumulative Regret

Suboptimality of reported solution

Cumulative Regret

Cross-in-Tray

Suboptimality of reported solution

Cross-in-Tray

q EUBO SGP DTS POP-BO (ours)

Figure 10. Cumulative regret and the suboptimality of the reported solution of different algorithms for the test problems Beale, Branin, Bukin, and Cross-in-Tray in Tab. 2.

Principled Preferential Bayesian Optimization

Cumulative Regret

Suboptimality of reported solution

Cumulative Regret

Holder Table

Suboptimality of reported solution

Holder Table

Cumulative Regret

Suboptimality of reported solution

q EUBO SGP DTS POP-BO (ours)

Figure 11. Cumulative regret and the suboptimality of the reported solution of different algorithms for the test problems Eggholder, Holder Table, and Levy13 in Tab. 2.