# riskaverse_heteroscedastic_bayesian_optimization__634e8967.pdf

Risk-averse Heteroscedastic Bayesian Optimization

Anastasiia Makarova ETH Zürich anmakaro@ethz.ch

Ilnura Usmanova ETH Zürich ilnurau@ethz.ch

Ilija Bogunovic ETH Zürich ilijab@ethz.ch

Andreas Krause ETH Zürich krausea@ethz.ch

Many black-box optimization tasks arising in high-stakes applications require riskaverse decisions. The standard Bayesian optimization (BO) paradigm, however, optimizes the expected value only. We generalize BO to trade mean and inputdependent variance of the objective, both of which we assume to be unknown a priori. In particular, we propose a novel risk-averse heteroscedastic Bayesian optimization algorithm (RAHBO) that aims to identify a solution with high return and low noise variance, while learning the noise distribution on the ﬂy. To this end, we model both expectation and variance as (unknown) RKHS functions, and propose a novel risk-aware acquisition function. We bound the regret for our approach and provide a robust rule to report the ﬁnal decision point for applications where only a single solution must be identiﬁed. We demonstrate the effectiveness of RAHBO on synthetic benchmark functions and hyperparameter tuning tasks.

1 Introduction

Black-box optimization tasks arise frequently in high-stakes applications such as drug and material discovery [17, 22, 28], genetics [16, 27], robotics [5, 12, 25], hyperparameter tuning of complex learning systems [13, 21, 34], to name a few. In many of these applications, there is often a trade-off between achieving high utility and minimizing risk. Moreover, uncertain and costly evaluations are an inherent part of black-box optimization tasks, and modern learning methods need to handle these aspects when balancing between the previous two objectives.

Bayesian optimization (BO) is a powerful framework for optimizing such costly black-box functions from noisy zeroth-order evaluations. Classical BO approaches are typically risk-neutral as they seek to optimize the expected function value only. In practice, however, two different solutions might attain similar expected function values, but one might produce signiﬁcantly noisier realizations. This is of major importance when it comes to actual deployment of the found solutions. For example, when selecting hyperparameters of a machine learning algorithm, we might prefer conﬁgurations that lead to slightly higher test errors but at the same time lead to smaller variance.

In this paper, we generalize BO to trade off mean and input-dependent noise variance when sequentially querying points and outputting ﬁnal solutions. We introduce a practical setting where both the black-box objective and input-dependent noise variance are unknown a priori, and the learner needs to estimate them on the ﬂy. We propose a novel optimistic risk-averse algorithm RAHBO that makes sequential decisions by simultaneously balancing between exploration (learning about uncertain actions), exploitation (choosing actions that lead to high gains) and risk (avoiding unreliable actions). We bound the cumulative regret of RAHBO as well as the number of samples required to output a single near-optimal risk-averse solution. In our experiments, we demonstrate the risk-averse performance of our algorithm and show that standard BO methods can severely fail in applications where reliability of the reported solutions is of utmost importance.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

(a) Unknown objective f

(b) Unknown variance ρ2

0.0 0.5 1.0 1.5 Variance 2(x0,x1)

Empirical frequency

RAHBO =1 GP-UCB RAHBO-US =1

(c) Histogram of variance

Figure 1: When there is a choice between identical optima with different noise level, standard BO tends to query points corresponding to higher noise. (a) Unknown objective with 3 global maxima marked as (A, B, C); (b) Heteroscedastic noise variance over the same domain: the noise level at (A, B, C) varies according to the sigmoid function; (c) Empirical variance distribution at all points acquired during BO procedure (over 9 experiments with different seeds). The three bumps correspond to the three global optima with different noise variance. RAHBO dominates in choosing the risk-averse optimum, consequently yielding lower risk-averse regret in Figure 5a.

Related work. Bayesian optimization (BO) [26] refers to approaches for optimizing a noisy blackbox objective that is typically expensive to evaluate. A great number of BO methods have been developed over the years, including a signiﬁcant number of variants of popular algorithms such as GP-UCB [35], Expected Improvement [26], and Thompson Sampling [14]. While the focus of standard BO approaches is mainly on trading-off exploration vs. exploitation and optimizing for the expected performance, in this work, we additionally focus on the risk that is involved when working with noisy objectives, as illustrated in Figure 1.

The vast majority of previous BO works assume (sub-) Gaussian and homoscedastic noise (i.e., input independent and of some known ﬁxed level). Both assumptions can be restrictive in practice. For example, as demonstrated in [15], the majority of hyperparameter tuning tasks exhibit heteroscedasticity. A few works relax the ﬁrst assumption and consider, e.g., heavy-tailed noise models [31] and adversarially corrupted observations [8]. The second assumption is typically generalized via heteroscedastic Gaussian process (GP), allowing an explicit dependence of the noise distribution on the evaluation points [6, 7, 10, 20]. Similarly, in this work, we consider heteroscedastic GP models, but unlike the previous works, we speciﬁcally focus on the risk that is associated with querying and reporting noisy points.

Several works have recently considered robust and risk-averse aspects in BO. Their central focus is on designing robust strategies and protecting against the change/shift in uncontrollable covariates. They study various notions including worst-case robustness [9], distributional robustness [19, 30], robust mixed strategies [33] and other notions of risk-aversion [11, 18, 29], and while some of them report robust regret guarantees, their focus is primarily on the robustness in the homoscedastic GP setting. Instead, in our setting, we account for the risk that comes from the realization of random noise with unknown distribution. Rather than optimizing the expected performance, in our risk-averse setting, we prefer inputs with lower variance. To this end, we incorporate the learning of the noise distribution into the optimization procedure via a mean-variance objective. The closest to our setting is risk-aversion with respect to noise in multi-armed bandits [32]. Their approach, however, fails to exploit correlation in rewards among similar arms.

Contributions. We propose a novel Risk-averse Heteroscedastic Bayesian optimization (RAHBO) approach based on the optimistic principle that trades off the expectation and uncertainty of the mean-variance objective function. We model both the objective and variance as (unknown) functions belonging to RKHS space of functions, and propose a practical risk-aware algorithm in the heteroscedastic GP setting. In our theoretical analysis, we establish rigorous sublinear regret guarantees for our algorithm, and provide a robust reporting rule for applications where only a single solution is required. We demonstrate the effectiveness of RAHBO on synthetic benchmarks, as well as on hyperparameter tuning tasks for the Swiss free-electron laser and a machine learning model.

2 Problem setting

Let X be a given compact set of inputs (X Rd for some d N). We consider a problem of sequentially interacting with a ﬁxed and unknown objective f : X R. At every round of this procedure, the learner selects an action xt X, and obtains a noisy observation

yt = f(xt) + ξ(xt), (1)

where ξ(xt) is zero-mean noise independent across different time steps t. In this work, we consider sub-Gaussian heteroscedastic noise that depends on the query location. Deﬁnition 1. A zero-mean real-valued random variable ξ is ρ sub-Gaussian, if there exists variance-

proxy ρ2 such that λ R, E[eλξ] e λ2ρ2

For a sub-Gaussian ξ, its variance Var[ξ] lower bounds any valid variance-proxy ρ, i.e., Var[ξ] ρ2. In case Var[ξ] = ρ2 holds, ξ is said to be strictly ρ sub-Gaussian. Besides zero-mean Gaussian random variables, most standard symmetric bounded random variables (e.g., Bernoulli, beta, uniform, binomial) are strictly sub-Gaussian (see [2, Proposition 1.1]). Throughout the paper, we consider sub-Gaussian noise, and in Section 3.3, we specialize to the case of strictly sub-Gaussian noise.

Optimization objective. Unlike the previous works that mostly focus on sequential optimization of f in the homoscedastic noise case, in this work, we consider the trade-off between risk and return in the heteroscedastic case. While there exist a number of risk-averse objectives, we consider the simple and frequently used mean-variance objective (MV) [32]. Here, the objective value at x X is a trade-off between the (mean) return f(x) and the risk expressed by its variance-proxy ρ2(x):

MV(x) = f(x) αρ2(x), (2)

where α 0 is a so-called coefﬁcient of absolute risk tolerance. In this work, we assume α is ﬁxed and known to the learner. In the case of α = 0, maximizing MV(x) coincides with the standard BO objective.

Performance metrics. We aim to construct a sequence of input evaluations xt that eventually maximizes the risk-averse objective MV(x). To assess this convergence, we consider two metrics. The ﬁrst metric corresponds to the notion of cumulative regret similar to the one used in standard BO and bandits. Here, the learner s goal is to maximize its risk-averse cumulative reward over a time horizon T, or equivalently minimize its risk-averse cumulative regret:

h MV(x ) MV(xt) i , (3)

where x arg maxx X MV(x). A sublinear growth of RT with T implies vanishing average regret RT /T 0 as T . Intuitively, this implies the existence of some t such that MV(xt) is arbitrarily close to the optimal value MV(x ).

The second metric is used when the learner seeks to simultaneously minimize the number of expensive function evaluations T. Namely, for a given accuracy ϵ 0, we report a single "good" risk-averse point ˆx T X after a total of T rounds, that satisﬁes:

MV(ˆx T ) MV(x ) ϵ. (4)

Both metrics are important for choosing risk-averse solutions and which one is preferred depends on the application at hand. For example, risk-averse cumulative regret RT might be of a greater interest in online recommendation systems, while reporting a single point with high MV value might be more suitable when tuning machine learning hyperparameters. We consider both performance metrics in our experiments.

Regularity assumptions. We consider standard smoothness assumptions [9, 35] when it comes to the unknown function f : X R. In particular, we assume that f( ) belongs to a reproducing kernel Hilbert space (RKHS) Hκ (a space of smooth and real-valued functions deﬁned on X), i.e., f Hκ, induced by a kernel function κ( , ). We also assume that κ(x, x ) 1 for every x, x X. Moreover, the RKHS norm of f( ) is assumed to be bounded f κ Bf for some ﬁxed constant Bf > 0. We assume that the noise ξ(x) is ρ(x) sub-Gaussian with variance-proxy ρ2(x) uniformly bounded ρ(x) [ϱ, ϱ] for some constant values ϱ ϱ > 0.

3 Algorithms

We ﬁrst recall the Gaussian process (GP) based framework for sequential learning of RKHS functions from observations with heteroscedastic noise. Then, in Section 3.2, we consider a simple risk-averse Bayesian optimization problem with known variance-proxy, and later on in Section 3.3, we focus on our main problem setting in which the variance-proxy is unknown.

3.1 Bayesian optimization with heteroscedastic noise

Before addressing the risk-averse objective, we brieﬂy recall the standard GP-UCB algorithm [35] in the setting of heteroscedastic sub-Gaussian noise. The regularity assumptions permit the construction of conﬁdence bounds via GP model. Particularly, to decide which point to query at every round, GPUCB makes use of the posterior GP mean and variance denoted by µt( ) and σ2 t ( ), respectively. They are computed based on the previous measurements y1:t = [y1, . . . , yt] and the given kernel κ( , ) :

µt(x) = κt(x)T (Kt + λΣt) 1y1:t, (5)

σ2 t (x) = 1

λ κ(x, x) κt(x) (Kt + λΣt) 1κt(x) , (6)

where Σt := diag(ρ2(x1), . . . , ρ2(xt)), (Kt)i,j = κ(xi, xj), κt(x)T = [κ(x1, x), . . . , κ(xt, x)]T , λ > 0 and prior modelling assumptions are ξ( ) N(0, ρ2( )) and f GP(0, λ 1κ).

At time t, GP-UCB maximizes the upper conﬁdence bound of f( ), i.e.,

xt arg max x X µt 1(x) + βtσt 1(x) | {z }

=:ucbf t (x)

If the noise ξt(xt) is heteroscedastic and ρ(xt)-sub-Gaussian, the following conﬁdence bounds hold: Lemma 1 (Lemma 7 in [20]). Let f Hκ, and µt( ) and σ2 t ( ) be deﬁned as in Eqs. (5) and (6) with λ > 0. Assume that the observations (xt, yt)t 1 satisfy Eq. (1). Then the following holds for all t 1 and x X with probability at least 1 δ:

|µt 1(x) f(x)|

2 ln det(λΣt + Kt)1/2

δ det(λΣt)1/2

| {z } :=βt

σt 1(x). (8)

Here, βt stands for the parameter that balances between exploration vs. exploitation and ensures the validity of conﬁdence bounds. The analogous concentration inequalities in case of homoscedastic noise were considered in [1, 14, 35].

Failure of GP-UCB in the risk-averse setting. GP-UCB is guaranteed to achieve sublinear cumulative regret with high probability in the risk-neutral (homoscedastic/heteroscedastic) BO setting [14, 35]. However, for the risk-averse setting in Eq. (2), the maximizers x arg maxx X MV(x) and x f arg maxx X f(x) might not coincide, and consequently, MV(x ) can be signiﬁcantly larger than MV(x f). This is illustrated in Figure 1, where GP-UCB most frequently chooses optimum A of the highest risk.

3.2 Warm up: Known variance-proxy

We remedy the previous issue with GP-UCB by proposing a natural Risk-averse Heteroscedastic BO (RAHBO) in case of the known variance-proxy ρ2( ). At each round t, RAHBO chooses the action:

xt arg max x X µt 1(x) + βtσt 1(x) αρ2(x), (9)

where βt is from Lemma 1 and α is from Eq. (2). In the next section, we further relax the assumption of the variance-proxy and consider a more practical setting when ρ2( ) is unknown to the learner. For the current setting, the performance of RAHBO is formally captured in the following proposition. Proposition 1. Consider any f Hκ with f κ Bf and sampling model from Eq. (1) with known variance-proxy ρ2(x). Let {βt}T t=1 be set as in Lemma 1 with λ = 1. Then, with probability at least 1 δ, RAHBO attains cumulative risk-averse regret RT = O βT p

TγT ( ϱ2 + 1) .

Here, γT denotes the maximum information gain [35] at time T deﬁned via mutual information I(y1:T , f1:T ) between evaluations y1:T and f1:T = [f(x1), . . . , f(x T )] at points A D:

γT := max A X, |A|=T I(y1:T , f1:T ), (10)

where I(y1:T , f1:T ) = 1

t=1 ln 1 + σ2 t 1(xt) ρ2(xt) (11)

in case of heteroscedastic noise (see Appendix A.1.1). The upper bounds on γT are provided in [35] widely used kernels. These upper bounds typically scale sublinearly in T; for linear kernel γT = O(d log T), and in case of squared exponential kernel γT = O(d(log T)d+1). While these bounds are derived assuming the homoscedastic GP setting with some ﬁxed constant noise variance, we show (in Appendix A.1.3) that the same rates (up to a multiplicative constant factor) apply in the heteroscedastic case.

3.3 RAHBO for unknown variance-proxy

In the case of unknown variance-proxy, the conﬁdence bounds for the unknown f(x) in Lemma 1 can not be readily used, and we construct new ones on the combined mean-variance objective. To learn about the unknown ρ2(x), we make some further assumptions. Assumption 1. The variance-proxy ρ2(x) belongs to an RKHS induced by some kernel κvar, i.e., ρ2 Hκvar, and its RKHS norm is bounded ρ2 κvar Bvar for some ﬁnite Bvar > 0. Moreover, the noise ξ(x) in Eq. (1) is strictly ρ(x) sub-Gaussian, i.e., Var[ξ(x)] = ρ2(x) for every x X.

As a consequence of our previous assumption, we can now focus on estimating the variance since Var[ξ( )] and ρ2( ) coincide. In particular, to estimate Var[ξ( )] we consider a repeated experiment setting, where for each xt we collect k > 1 evaluations {yi(xt)}k i=1, yi(xt) = f(xt) + ξi(xt). Then, the sample mean and variance of ξ(xt) are given as:

ˆmk(xt) = 1

i=1 yi(xt) and ˆs2 k(xt) = 1 k 1

yi(xt) ˆmk(xt) 2. (12)

The key idea is that for strictly sub-Gaussian noise ξ(x), ˆs2 1:t = [ˆs2 k(x1), . . . , ˆs2 k(xt)] yields unbiased, but noisy evaluations of the unknown variance-proxy ρ2 1:t = [ρ2(x1), . . . , ρ2(xt)] , i.e.,

ˆs2 k(xt) = ρ2(xt) + η(xt) (13)

with zero-mean noise η(xt). In order to efﬁciently estimate ρ2( ), we need an additional assumption. Assumption 2. The noise η(x) in Eq. (13) is ρη(x) sub-Gaussian with known ρ2 η(x) and the realizations {η(xt)}t 1 are independent between t.

We note that a similar assumption is made in [32] in the multi-armed bandit setting. The fact that ρ2 η( ) is known is rather mild as Assumption 1 allows controlling its value. For example, in case of strictly sub-Gaussian η(x) we show (in Appendix A.2) that Var[η( )] = ρ2 η( ) 2ρ4( )/(k 1). Then, given that ρ2( ) ϱ2, we can utilize the following (rather conservative) bound as a variance-proxy, i.e., ρ2 η(x) = 2 ϱ4/(k 1).

RAHBO algorithm. We present our Risk-averse Heteroscedastic BO approach for unknown variance-proxy in Algorithm 1. Our method relies on building the following two GP models.

Firstly, we use sample variance evaluations ˆs2 1:t to construct a GP model for ρ2( ). The corresponding µvar t 1( ) and σvar t 1( ) are computed as in Eqs. (5) and (6) by using kernel κvar, variance-proxy ρ2 η( ) and noisy observations ˆs2 1:t. Consequently, we build the upper and lower conﬁdence bounds ucbvar t ( ) and lcbvar t ( ) of the variance-proxy ρ2( ) and we set βvar t according to Lemma 1:

ucbvar t (x) := µvar t 1(x) + βvar t σvar t 1(x), (14) lcbvar t (x) := µvar t 1(x) βvar t σvar t 1(x). (15)

Secondly, we use sample mean evaluations ˆm1:t = [ ˆmk(x1), . . . , ˆmk(xt)] to construct a GP model for f( ). The mean µt( ) and variance σ2 t ( ) in Eqs. (5) and (6), however, rely on the unknown

Algorithm 1 Risk-averse Heteroscedastic Bayesian Optimization (RAHBO)

Require: Parameters α, {βt, βvar t }t 1, λ, k, Prior µf 0 = µvar 0 = 0, Kernel functions κ, κvar

1: for t = 1, 2, . . . do 2: Construct conﬁdence bounds ucbvar t ( ) and lcbvar t ( ) as in Eqs. (14) and (15) 3: Construct ucbf t ( ) as in Eq. (7) 4: Select xt arg maxx X ucbf t (x) α lcbvar t (x) 5: Observe k samples: yi(xt) = f(xt) + ξi(xt) for every i [k] 6: Use samples {yi(xt)}k i=1 to compute sample mean ˆmk(xt) and variance ˆs2 k(xt) as in Eq. (12) 7: Use xt, ˆs2 k(xt) to update posterior µvar t ( ) and σvar t ( ) as in Eqs. (32) and (33) 8: Use ucbvar t ( ) to compute ˆΣt as in Eq. (16) 9: Use xt, ˆmk(xt) and ˆΣt to update posterior µt( ) and σt( ) as in Eqs. (5) and (6) 10: end for

variance-proxy ρ2( ) in Σt, an we thus use its upper conﬁdence bound ucbvar t ( ) truncated with ϱ2:

kdiag min{ucbvar t (x1), ϱ2}, . . . , min{ucbvar t (xt), ϱ2} , (16)

where ˆΣt is corrected by k since every evaluation in ˆm1:t is an average over k samples. This substitution of the unknown variance-proxy by its conservative estimate guarantees that the conﬁdence bounds ucbf t (x) := µt 1(x) + βtσt 1(x) on f also hold with high probability (conditioning on the conﬁdence bounds for ρ( ) holding true; see Appendix A.3 for more details).

Finally, we deﬁne the acquisition function as ucb MV t (x) := ucbf t (x) αlcbvar t (x), i.e., selecting xt arg maxx X ucb MV t (x) at each round t.

The proposed algorithm leads to new maximum information gains ˆγT = max A I( ˆm1:T , f1:T ) and ΓT = max A I(ˆs2 1:T , ρ2 1:T ) for sample mean ˆm1:T and sample variance ˆs2 1:T evaluations. The corresponding mutual information in ˆγT and ΓT is computed according to Eq. (11) for heteroscedastic noise with variance-proxy ϱ2/k and ρ2 η, respectively (see Appendix A.4). The performance of RAHBO is captured in the following theorem.

Theorem 1. Consider any f Hκ with f κ Bf and sampling model in Eq. (1) with unknown variance-proxy ρ2(x) that satisﬁes Assumptions 1 and 2. Let {xt}T t=1 denote the set of actions chosen by RAHBO (Algorithm 1) over T rounds. Set {βt}T t=1 and {βvar t }T t=1 according to Lemma 1 with λ = 1, R2 = maxx X ρ2 η(xt) and ρ( ) [ϱ, ϱ]. Then, the risk-averse cumulative regret RT of RAHBO is bounded as follows:

2T ˆγT ln(1 + k/ ϱ2) + αβvar T k

2TΓT ln(1 + R 2), T 1 1 δ. (17)

The risk-averse cumulative regret of RAHBO depends sublinearly on T for most of the popularly used kernels. This follows from the implicit sublinear dependence on T in βT , βvar T and ˆγT , ΓT (the bounds in case of heteroscedastic noise replicate the ones used in Proposition 1 as shown in Appendices A.1.2 and A.1.3). Finally, the result of Theorem 1 provides a non-trivial trade-off for number of repetitions k where larger k increases sample complexity but also leads to better estimation of the noise model. Furthermore, we obtain the bound for the number of rounds T required for identifying an ϵ-optimal point:

Corollary 1.1. Consider the setup of Theorem 1. Let A = {xt}T t=1 denote actions selected by RAHBO over T rounds. Then, with probability at least 1 δ, the reported point ˆx T := arg maxxt A lcb MV t (xt), where lcb MV t (xt) = lcbf t (x) α ucbvar t (x), achieves ϵ-accuracy, i.e., MV(x ) MV(ˆx T ) ϵ, after T 4β2 T ˆγT / ln(1+k/ ϱ2)+4α(βvar t )2ΓT / ln(1+R 2) ϵ2 rounds.

The previous result demonstrates the sample complexity rates when a single risk-averse reported solution is required. We note that both Theorem 1 and Corollary 1.1 provide guarantees for choosing risk-averse solutions, and depending on application at hand, we might consider either one of the proposed performance metrics. We demonstrate use-cases for both in the following section.

(a) Illustration of the sine function (left) and noise variance (right)

20 40 60 BO iterations

RAHBO =2 GP-UCB RAHBO-US =2

(b) Cumulative regret

20 40 60 BO iterations

MV(x *) MV(x T)

RAHBO =1 GP-UCB RAHBO-US =1

(c) Suboptimality w.r.t. MV

20 40 60 BO iterations

f (x *) f (x T)

RAHBO =1 GP-UCB RAHBO-US =1

(d) Suboptimality w.r.t. f

Figure 2: (a) Unknown true objective along with noisy evaluations with varying noise level (left) and unknown true noise variance and its evaluations (right). (b) Cumulative regret. (c) Simple MV regret for reporting rule ˆx T = arg maxxt lcb T (xt). (c) Simple regret f(x ) f(ˆx T ) for the unknown function at the reported point ˆx T from (d). RAHBO not only leads to strong results in terms of MV but also in terms of the mean objective f(x).

4 Experiments

In this section, we experimentally validate RAHBO on two synthetic examples and two real hyperparameter tuning tasks, and compare it with the baselines. We provide an open-source implementation of our method.1

Baselines. We compare against two baselines: As the ﬁrst baseline, we use GP-UCB with heteroscedastic noise as a standard risk-neutral algorithm that optimizes the unknown f(x). As the second one, we consider a risk-averse baseline that uniformly learns variance-proxy ρ2(x) before the optimization procedure, in contrast to RAHBO which learns the variance-proxy on the ﬂy. We call it RAHBO-US, standing for RAHBO with uncertainty sampling. It consists of two stages: (i) uniformly learning ρ2(x) via uncertainty sampling, (ii) GP-UCB applied to the mean-variance objective, in which instead of the unknown ρ2(x) we use the mean of the learned model. Note that RAHBO-US is the closest to the contextual BO setting [18], where the context distribution is assumed to be known.

Experimental setup. At each iteration t, an algorithm queries a point xt and observes sample mean and sample variance of k observations {yi(xt)}k i=1. We use a heteroscedastic GP for modelling f(x) and a homoscedastic GP for ρ2(x). We set λ = 1 and βt = 2, which is commonly used in practice to improve performance over the theoretical results. Before the BO procedure, we determine the GP hyperparameters maximizing the marginal likelihood. To this end, we use initial points that are same for all the baselines and are chosen via Sobol sequence that generates low discrepancy quasi-random samples. We repeat each experiment several times, generating new initial points for every repetition. We use two metrics: (a) risk-averse cumulative regret Rt computed for the acquired inputs; (b) simple regret MV(x ) MV(ˆx T ) computed for inputs as reported via Corollary 1.1. For each metric, we report its mean two standard errors over the repetitions.

Example function We ﬁrst illustrate the methods performance on a sine function depicted in Figure 2a. This function has two global optimizers. We induce a heteroscedastic zero-mean Gaussian noise on the measurements. We use a sigmoid function for the noise variance, as depicted in Figure 2a, that induces small noise on [0, 1] and higher noise on (1, 2]. We initialize the algorithms by selecting 10 inputs x at random and keep these points the same for all the algorithms. We use k = 10 samples at each chosen xt. The number of acquisition rounds is T = 60. We repeat the experiment 30 times for each method and show their average performances in Figure 2.

1https://github.com/Avidereta/risk-averse-hetero-bo

Figure 3: GP models ﬁtted for GP-UCB (left) and RAHBO (right) for sine function. After initialization with the same sampled points, GP-UCB concentrates on the high-noise region whereas RAHBO prefers small variance. Additional plots are presented in Appendix A.6.

Branin benchmark Next, we evaluate the methods on the (negated) Branin benchmark function in Figure 1a, achieving its optimum value f = 0.4 at ( π, 12.3), (π, 2.3), (9.4, 2.5). The heteroscedastic variance function illustrated in Figure 1b deﬁnes different noise variances for the three optima. We initialize all algorithms by selecting 10 inputs. We use k = 10 samples to estimate the noise variance. The number of acquisition rounds is T = 150. We repeat BO 25 times and show the results in Figures 1c and 5a. Figure 1c provides more intuition behind the observed regret: UCB exploits the noisiest maxima the most, while RAHBO prefers smaller variance.

Tuning Swiss free-electron laser In this experiment, we tune the parameters of Swiss X-ray freeelectron laser (Swiss FEL), an important scientiﬁc instrument that generates very short pulses of X-ray light and enables researchers to observe extremely fast processes. The main objective is to maximize the pulse energy measured by a gas detector, that is a time-consuming and repetitive task during the Swiss FEL operation. Such (re-)tuning takes place while user experiments on Swiss FEL are running, and thus the cumulative regret is the metric of high importance in this application.

We use real Swiss FEL measurements collected in [21] to train a neural network surrogate model, and use it to simulate the Swiss FEL objective f(x) for new parameter settings x. We similarly ﬁt a model of the heteroscedastic variance by regressing the squared residuals via a GP model. Here, we focus on the calibration of the four most sensitive parameters.

We report our comparison in Figure 4 where we also assess the effect of varying the coefﬁcient of absolute risk tolerance α. We use 30 points to initialize the baselines and then perform 200 acquisition rounds. We repeat each experiment 15 times. In Figure 4a we plot the empirical frequency of the true (unknown to the methods) values f(xt) and ρ2(xt) at the inputs xt acquired by the methods. The empirical frequency for ρ2(x) illustrates the tendency of risk-neutral GP-UCB to query points with higher noise, while risk-averse achieves substantially reduced variance and minimal reduction in mean performance. Sometimes, risk-neutral GP-UCB also fails to succeed in querying points with the highest f-value. That tendency results in lower cumulative regret for RAHBO in Figures 4c and 4d. We also compare the performance of the reporting rule from Corollary 1.1 in Figure 4b, where we plot error bars with standard deviation both for f(ˆx T ) and ρ2(ˆx T ) at the reported point ˆx T . As before, RAHBO drastically reduces the variance compared to GP-UCB, while having only slightly lower mean performance. Additional results are presented in Figure 10 in Appendix.

Random Forest tuning BO is widely used by cloud services for tuning machine learning hyperparameters and the resulting models might be then used in high-stakes applications such as credit scoring or fraud detection. In k-fold cross-validation, the average metric over the validation sets is optimized a canonical example of the repeated experiment setting that we consider in the paper. High acrossfolds variance is a practical problem [24] where the mean-variance approach might be beneﬁcial.

In our experiment, we tune hyperparameters of a random forest classiﬁer (RF) on a dataset of fraudulent credit card transactions [23].2 It consist of 285k transactions with 29 features (processed due to conﬁdentiality issues) that are distributed over time, and only 0.2% are fraud examples (see Appendix for more details). The search space for the RF hyperparameters is also provided in the Appendix. We use the balanced accuracy score and 5 validation folds, i.e., k = 5, and each validation fold is shifted in time with respect to the training data. We seek not only for high performance on average but also for low variance across the validation folds that have different time shifts with respect to the training data.

2https://www.kaggle.com/mlg-ulb/creditcardfraud

0.6 0.7 0.8 0.9 1.0 f (x)

RAHBO =0.5 RAHBO =1 RAHBO =2 RAHBO =5 GP-UCB

RAHBO-US =0.5 RAHBO-US =1 RAHBO-US =2 RAHBO-US =5

0.0 0.1 0.2 0.3 0.4 0.5 0.6 2 (x)

(a) Empirical distribution of true f(x) (left) and ρ2(x) (right) for Swiss FEL

0.0 0.1 0.2 0.3 0.4 0.5 0.6 2 (x)

RAHBO =0.5 RAHBO =1 RAHBO =2 GP-UCB

RAHBO-US =0.5 RAHBO-US =1 RAHBO-US =2

(b) Mean-variance tradeoff

50 100 150 200 BO iterations

RAHBO =0.5 GP-UCB RAHBO-US =0.5

(c) Cum. regret (α = 0.5)

50 100 150 200 BO iterations

RAHBO =1 GP-UCB RAHBO-US =1

(d) Cum. regret (α = 1)

Figure 4: Experimental results for Swiss FEL: (a) Distributions of f(x) and ρ2(x) for all points queried during the optimization. GP-UCB queries points with higher noise (but not necessarily high return f) in contrast to the risk-averse methods. (b) Mean f(ˆx T ) and variance ρ2(ˆx T ) at the reported ˆx T = arg maxxt lcb T (xt): for each method, we repeat BO experiment 15 times (separate points) and plot corresponding standard deviation error bars. RAHBO reports solutions with reasonable mean-variance tradeoff, while GP-UCB produces solutions with high mean value but also high noise variance. (c-d) Cum. regret for α = 0.5 and α = 1 (see more in Appendix A.7.1).

We initialize the algorithms by selecting 10 hyperparameter settings and keep these points the same for all algorithms. We use Matérn 5/2 kernels with Automatic Relevance Discovery (ARD) and normalize the input features to the unit cube. The number of acquisition rounds in one experiment is 50 and we repeat each experiment 15 times. We demonstrate our results in Figures 5b and 5c where we plot mean 2 standard errors. While both RAHBO and GP-UCB perform comparable in terms of the mean error, its standard deviation for RAHBO is smaller.

(a) Branin benchmark

20 40 60 BO iterations

MV(x *) MV(x T)

RAHBO =20 GP-UCB Random Search

(b) RF Tuning

20 40 60 BO iterations

MV(x *) MV(x T)

RAHBO =100 GP-UCB Random Search

(c) RF Tuning

Figure 5: Branin: (a) Cumulative regret. Random Forest: (b-c) Simple regret for the reported ˆx T = arg maxxt MV (xt) for (b) α = 20 and (c) α = 100. While both methods have comparable mean, RAHBO has consistently lower variance.

5 Conclusion

In this work, we generalize Bayesian optimization to the risk-averse setting and propose RAHBO algorithm aiming to ﬁnd an input with both large expected return and small input-dependent noise variance. Both the mean objective and the variance are assumed to be unknown a priori and hence are estimated online. RAHBO is equipped with theoretical guarantees showing (under reasonable

assumptions) sublinear dependence on the number of evaluation rounds T both for cumulative risk-averse regret and ϵ-accurate mean-variance metric. The empirical evaluation of the algorithm on synthetic benchmarks and hyperparameter tuning tasks demonstrate promising examples of heteroscedastic use-cases beneﬁting from RAHBO.

Acknowledgements

This research has been gratefully supported by NCCR Automation grant 51NF40 180545, by ERC under the European Union s Horizon grant 815943, SNSF grant 200021_172781, and ETH Zürich Postdoctoral Fellowship 19-2 FEL-47. The authors thank Sebastian Curi, Mojmír Mutný and Johannes Kirschner as well as the anonymous reviewers of this paper for their helpful feedback.

[1] Y. Abbasi-Yadkori. Online learning for linearly parametrized control problems. Ph D thesis, Edmonton, Alberta, 2012. [2] J. Arbel, O. Marchal, and H. D. Nguyen. On strict sub-Gaussianity, optimal proxy variance and symmetry for bounded random variables. ar Xiv:1901.09188, 2019. [3] M. Balandat, B. Karrer, D. R. Jiang, S. Daulton, B. Letham, A. G. Wilson, and E. Bakshy. Bo Torch: a framework for efﬁcient Monte-Carlo Bayesian optimization. In Neural Information Processing Systems (Neur IPS), 2020. [4] E. Benhamou. A few properties of sample variance. ar Xiv:1809.03774, 2018. [5] F. Berkenkamp, A. Krause, and A. P. Schoellig. Bayesian optimization with safety constraints: Safe and automatic parameter tuning in robotics. ar Xiv:1602.04450, 2016. [6] M. Binois, R. B. Gramacy, and M. Ludkovski. Practical heteroscedastic Gaussian process modeling for large simulation experiments. Journal of Computational and Graphical Statistics, 2018. [7] M. Binois, J. Huang, R. B. Gramacy, and M. Ludkovski. Replication or exploration: Sequential design for stochastic simulation experiments. In Technometrics, 2019. [8] I. Bogunovic, A. Krause, and J. Scarlett. Corruption-tolerant Gaussian process bandit optimization. In Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2020. [9] I. Bogunovic, J. Scarlett, S. Jegelka, and V. Cevher. Adversarially robust optimization with Gaussian processes. In Neural Information Processing Systems (Neur IPS), 2018. [10] I. Bogunovic, J. Scarlett, A. Krause, and V. Cevher. Truncated variance reduction: A uniﬁed approach to Bayesian optimization and level-set estimation. In Neural Information Processing Systems (Neur IPS), 2016. [11] S. Cakmak, R. Astudillo, P. I. Frazier, and E. Zhou. Bayesian optimization of risk measures. In Neural Information Processing Systems (Neur IPS), 2020. [12] R. Calandra, A. Seyfarth, J. Peters, and M. P. Deisenroth. Bayesian optimization for learning gaits under uncertainty. Annals of Mathematics and Artiﬁcial Intelligence, 2016. [13] Y. Chen, A. Huang, Z. Wang, I. Antonoglou, J. Schrittwieser, D. Silver, and N. de Freitas. Bayesian optimization in Alphago. ar Xiv:1812.06855, 2018. [14] S. R. Chowdhury and A. Gopalan. On kernelized multi-armed bandits. In International Conference on Machine Learning (ICML), 2017. [15] A. I. Cowen-Rivers, W. Lyu, R. Tutunov, Z. Wang, A. Grosnit, R. R. Grifﬁths, H. Jianye, J. Wang, and H. B. Ammar. An empirical study of assumptions in Bayesian optimisation. ar Xiv:2012.03826, 2021. [16] J. González, J. Longworth, D. C. James, and N. D. Lawrence. Bayesian optimization for synthetic gene design. ar Xiv:1505.01627, 2015. [17] R.-R. Grifﬁths and J. M. Hernández-Lobato. Constrained Bayesian optimization for automatic chemical design using variational autoencoders. Chemical Science, 2020. [18] S. Iwazaki, Y. Inatsu, and I. Takeuchi. Mean-variance analysis in Bayesian optimization under uncertainty. In Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2021.

[19] J. Kirschner, I. Bogunovic, S. Jegelka, and A. Krause. Distributionally robust Bayesian optimization. In Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2020. [20] J. Kirschner and A. Krause. Information directed sampling and bandits with heteroscedastic noise. In Conference On Learning Theory (COLT), 2018. [21] J. Kirschner, M. Mutny, N. Hiller, R. Ischebeck, and A. Krause. Adaptive and safe Bayesian optimization in high dimensions via one-dimensional subspaces. In International Conference on Machine Learning (ICML), 2019. [22] K. Korovina, S. Xu, K. Kandasamy, W. Neiswanger, B. Poczos, J. Schneider, and E. P. Xing. Chembo: Bayesian optimization of small organic molecules with synthesizable recommendations. ar Xiv:1908.01425, 2019. [23] Y.-A. Le Borgne and G. Bontempi. Machine Learning for Credit Card Fraud Detection - Practical Handbook. Université Libre de Bruxelles, 2021. [24] A. Makarova, H. Shen, V. Perrone, A. Klein, J. B. Faddoul, A. Krause, M. Seeger, and C. Archambeau. Overﬁtting in Bayesian optimization: an empirical study and early-stopping solution. In ICLR Workshop on Neural Architecture Search, 2021. [25] A. Marco, P. Hennig, J. Bohg, S. Schaal, and S. Trimpe. Automatic LQR tuning based on Gaussian process global optimization. In International Conference on Robotics and Automation (ICRA), 2016. [26] J. Moˇckus. On Bayesian methods for seeking the extremum. In Optimization Techniques, 1975. [27] H. Moss, D. Leslie, D. Beck, J. González, and P. Rayson. BOSS: Bayesian optimization over string spaces. In Neural Information Processing Systems (Neur IPS), 2020. [28] D. M. Negoescu, P. I. Frazier, and W. B. Powell. The knowledge-gradient algorithm for sequencing experiments in drug discovery. INFORMS J. on Computing, 2011. [29] Q. P. Nguyen, Z. Dai, B. K. H. Low, and P. Jaillet. Value-at-risk optimization with Gaussian processes. In International Conference on Machine Learning (ICML), 2021. [30] T. Nguyen, S. Gupta, H. Ha, S. Rana, and S. Venkatesh. Distributionally robust Bayesian quadrature optimization. In Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2020. [31] S. Ray Chowdhury and A. Gopalan. Bayesian optimization under heavy-tailed payoffs. In Neural Information Processing Systems (Neur IPS), 2019. [32] A. Sani, A. Lazaric, and R. Munos. Risk-aversion in multi-armed bandits. In Neural Information Processing Systems (Neur IPS), 2012. [33] P. G. Sessa, I. Bogunovic, M. Kamgarpour, and A. Krause. Mixed strategies for robust optimization of unknown objectives. In Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2020. [34] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms. In Neural Information Processing Systems (Neur IPS), 2012. [35] N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on International Conference on Machine Learning (ICML), 2010.