# optimizing_conditional_valueatrisk_of_blackbox_functions__a72095d8.pdf Optimizing Conditional Value-At-Risk of Black-Box Functions Quoc Phong Nguyen, Zhongxiang Dai, Bryan Kian Hsiang Low, and Patrick Jaillet Dept. of Computer Science, National University of Singapore, Republic of Singapore Dept. of Electrical Engineering and Computer Science, MIT, USA {qphong,daizhongxiang,lowkh}@comp.nus.edu.sg, jaillet@mit.edu This paper presents two Bayesian optimization (BO) algorithms with theoretical performance guarantee to maximize the conditional value-at-risk (CVa R) of a black-box function: CV-UCB and CV-TS which are based on the well-established principle of optimism in the face of uncertainty and Thompson sampling, respectively. To achieve this, we develop an upper confidence bound of CVa R and prove the no-regret guarantee of CV-UCB by utilizing an interesting connection between CVa R and value-at-risk (Va R). For CV-TS, though it is straightforwardly performed with Thompson sampling, bounding its Bayesian regret is non-trivial because it requires a tail expectation bound for the distribution of CVa R of a black-box function, which has not been shown in the literature. The performances of both CV-UCB and CV-TS are empirically evaluated in optimizing CVa R of synthetic benchmark functions and simulated real-world optimization problems. 1 Introduction A wide range of applications from Auto-ML [15] to chemistry [6] and drug design [3] require optimizing a black-box objective function (i.e., its closed-form expression, gradient, and convexity are unknown) through observing noisy function evaluations. To resolve this problem, an efficient class of algorithms, called Bayesian optimization (BO) [2, 20], has seen rapid development lately. One of the recent developments of BO considers searching for the optimization variable x that maximizes the objective function ρ[f(x, W)] [4]. Different from the classical BO, there exists an environmental random variable W that cannot be controlled, which happens frequently in real-world problems. For example, to maximize the crop yield (i.e., f(x, W)) by controlling the amount of fertilizer (i.e., x), there are various uncontrollable weather conditions such as the temperature, lighting, and rainfall (i.e., W) [13]. Thus, even though an amount of fertilizer x increases the crop yield most of the time, there exists a risk/chance that under certain weather conditions (realizations of W), the same amount of fertilizer leads to a low crop yield (an undesirable realization of the random variable f(x, W)). This issue renders the optimization of f(x, W) futile, so the work of [4] proposes to optimize a risk measure, denoted as ρ, of f(x, W) such as value-at-risk (Va R) and conditional Va R (CVa R). However, its approach suffers from two main shortcomings: the computational cost of a nested optimization procedure and the lack of theoretical performance guarantee. Although these drawbacks are addressed in the work of [13], its proposed algorithm only works for optimizing Va R of a black-box function. Hence, an efficient algorithm with theoretical performance guarantee for optimizing CVa R of a black-box function remains an open question. Nonetheless, optimizing CVa R of a black-box function is desirable, especially when the effect of optimizing CVa R cannot be obtained by optimizing Va R. A prominent property of CVa R is its sensitivity to values at the extreme tails of the random variable f(x, W), which Va R does not exhibit [19]. Let us consider two portfolio allocation schemes: one scheme has an unacceptable 35th Conference on Neural Information Processing Systems (Neur IPS 2021). worst-case value of the return and the other does not. It is possible that the two schemes have the same Va R of the return because Va R is insensitive to the extreme tails of the return (e.g., the worst-case value of the return). On the other hand, CVa R is able to distinguish the risks of the two schemes. This advantage of CVa R is further clarified in Remark 1. In this paper, we adopt the settings of [4, 13]: the distribution of W is given and we can control/select the realization w of W in the optimization procedure. In practice, the distribution of W can be estimated from data (e.g., weather historical data in the crop yield optimization example) and we can perform BO in a laboratory or using computer simulation where W is controlled [4, 13]. After the optimization, the optimal x can be used in the real-world environment with the uncontrollable W. Our main contribution is to propose two computationally efficient algorithms (Sec. 3) with theoretical performance guarantee (Sec. 4) for optimizing CVa R of a black-box function. First, we propose CVa R optimization with upper confidence bound (CV-UCB) that is based on the well-established principle of optimism in the face of uncertainty. By exploiting the connection between CVa R and Va R, we are able to develop a confidence bound of CVa R (Sec. 3.1.1) and prove the no-regret guarantee of CV-UCB (Sec. 4.1). Second, we propose CVa R optimization with Thompson sampling (CV-TS) which can be naturally extended to handle batch queries (i.e., gathering observations at a batch of inputs in each BO iteration). The capability of gathering observations in a batch simultaneously is often available and preferable in laboratory settings and computer simulations. Though CV-TS can be simply performed with the popular Thompson sampling (or posterior sampling) [18] (Sec. 3.1.2), bounding its Bayesian regret is challenging as the distribution of CVa R in our problem has not been investigated before and its support is unbounded. Fortunately, we are able to relate the problem of bounding the tail expectation of CVa R to that of the function evaluation f(x, w) (Sec. 4.2). The latter follows a Gaussian distribution with known tail expectation, unlike that of CVa R. We empirically evaluate the performance of both CV-UCB and CV-TS in optimizing CVa R of synthetic benchmark functions, an optimization problem of the residuary resistance per unit weight of displacement of a yacht, a portfolio optimization problem, and a simulated robot task (Sec. 5). Related works. Existing solutions to optimizing a black-box function f(x, W) with an environmental random variable W are different in the assumptions about W and the objective function. Other than the aforementioned works of [4, 13] where we adopt the assumptions that the distribution of W is known and w can be selected during the optimization procedure, the work of [9] also proposes an efficient algorithm with theoretical performance guarantee under the same assumptions to maximize the probabilistic threshold robustness measure. This measure requires a threshold of the desirable performance, instead of a risk level (i.e., the probability of undesirable performance) in CVa R and Va R. Another related work is adversarially robust BO [1] which maximizes f(x, w) with the worst-case realization w of W. It can be cast as an optimization problem of Va R of a black-box function [13]. The work of [10] assumes that the distribution of W is given, but W is sampled from its distribution during the optimization. Its objective is to balance between the mean and the variance of f(x, W) from different perspectives: multi-task learning, multi-objective optimization, and constrained optimization. The work of [22] maximizes Va R and CVa R of a black-box function, yet, it assumes W is unobservable and its distribution is unknown. It suffers from an approximate posterior belief and the lack of theoretical performance guarantee. The works of [12, 14] (namely, distributional robust BO) assume that the unknown distribution of W belongs to an uncertainty set of distributions. They maximize the mean of f(x, W) under the worst-case realization of the unknown distribution of W in the uncertainty set. 2 Background 2.1 Conditional Value-at-Risk Let the black-box function of interest be a real-valued function f(x, w) where x X Rm and w W Rn. Let W denote the environmental random variable whose probability distribution is specified by p(W = w) for all w W. Let f(x, W) denote the random variable representing the function evaluation at x and a random realization of W. The conditional value-at-risk (CVa R) of f(x, W) at risk level α (0, 1) is defined as: cf(x; α) E [f(x, W)| f(x, W) vf(x; α)] (1) which is the expectation of f(x, W) over function values that are at most the value-at-risk (Va R) at the same risk level α defined as vf(x; α) inf{ω : P(f(x, W) ω) α}. cf(x0; α) = E[f(x0, W)| f(x0, W) vf(x0; α)] vf(x0; α) f(x0, W) p(f(x0, W)) cf(x1; α) = E[f(x1, W)| f(x1, W) vf(x1; α)] vf(x1; α) f(x1, W) p(f(x1, W)) Figure 1: Plot of the lower tails of f(x0, W) and f(x1, W) with their Va R and CVa R at risk level α. The shaded area is α. It is observed that vf(x0; α) = vf(x1; α) but cf(x0; α) > cf(x1; α). Remark 1 (CVa R vs. Va R). While Va R at α is interpreted as the α-percentile of f(x, W), CVa R is interpreted as the expectation of function values that are at most Va R. Therefore, CVa R is sensitive to the function values less than Va R, while Va R is not. Fig. 1 shows an example of function evaluations at x0 and x1 such that their Va R values (i.e., their α-percentiles) are the same. We observe that Va R is insensitive to the tails of f(x, W) which are different in this case (see the region around the black dot). In contrast, cf(x0; α) > cf(x1; α) which conveys the idea that f(x0, W) has a lower risk of low function values as the probability density of function values less than Va R concentrates in a region nearer to Va R than that of f(x1, W). Thus, CVa R is more risk-aversed than Va R, especially when the distribution of f(x, W) has a heavy lower tail, which makes it more suitable for critical applications. In the portfolio optimization problem, if the distribution of the portfolio s return has negative outliers (which are captured in CVa R), the average return in the long run can be negative even though the Va R of the return is positive. In this work, we assume that |W| is finite to simplify the theoretical analysis (described in Sec. 4) and the evaluation of CVa R. The latter requires the computation of the cumulative probability mass of f(x, W) by enumerating all possible realizations of f(x, W) (described in [4]). 2.2 Bayesian Optimization of CVa R and Gaussian Processes BO of CVa R. In this paper, we would like to maximize CVa R cf(x; α) (1) of f(x, W) given the distribution p(W) by iteratively gathering observations as noisy function evaluations y(x, w) f(x, w) + ϵ(x, w) where ϵ(x, w) N(0, σ2 n). The gist of BO of CVa R is a strategy to select the input query (xt, wt) at iteration t such that the maximizer x argmaxx X cf(x; α) is found as rapidly as possible. Note that while W is controllable in BO, after the optimization, the optimal x can be used in real-world settings where W is uncontrollable. At iteration t, we have all observations at input queries in the previous t 1 iterations, denote as Dt 1 = Dt 2 {(xt 1, wt 1)} (D0 denotes the set of initial observed inputs). These observations y Dt 1 (y(x, w))(x,w) Dt 1 are used to construct the posterior belief/distribution of the underlying black-box function f(x, w) which is utilized in building the query selection strategy. Let us revisit Fig. 1: cf(x0; α) > cf(x1; α) implies that x0 is preferred to x1. It is because the risk/chance of getting low function values at x0 is smaller (see the region around the black dot). In contrast, Va R cannot differentiate the risk of f(x0, W) and that of f(x1, W) since their Va R values are the same. Algorithm 1 BO Algorithms for optimizing CVa R of a black-box function 1: Input: algo, X, W, initial observation y D0, prior µ0 = 0, σn, κ 2: for t = 1, 2, . . . do 3: {Selecting xt} 4: if algo is CV-UCB then 5: Select xt argmax x cut 1(x; α) 6: else if algo is CV-TS then 7: Sample a function f1 from the GP posterior belief given y Dt 1 8: Select xt argmax x cf1(x; α) 9: end if 10: {Selecting wt} 11: Find αt arg max α (0,α] vut 1(xt; α ) vlt 1(xt; α ) 12: Given αt, select wt as an LV w.r.t. xt, ut 1, and lt 1 13: {Collecting data and updating GP} 14: Incorporate new observation at input query: y Dt = y Dt 1 {y(xt, wt)} 15: Update the GP posterior belief given y Dt 16: end for Gaussian process (GP). In order to obtain the posterior belief of f(x, w) given noisy observations y Dt 1, we model the black-box function f with a GP: every finite subset of {f(x, w)}(x,w) X W follows a multivariate Gaussian distribution [17]. Thus, GP is fully specified by its prior mean which is assumed to be zero and its covariance function/kernel κ(x, w; x , w ) cov[f(x, w), f(x , w )]. Given noisy observations y Dt 1 at iteration t, the posterior belief p(f(x, w)|y Dt 1) is a Gaussian specified by the following mean and variance: µt 1(x, w) Kx,w;Dt 1ΛDt 1y Dt 1 σ2 t 1(x, w) κ(x, w) Kx,w;Dt 1ΛDt 1KDt 1;x,w (2) where κ(x, w) κ(x, w; x, w), Kx,w;Dt 1 (κ(x, w; x , w ))(x ,w ) Dt 1, KDt 1;x,w K x,w;Dt 1, and ΛDt 1 KDt 1Dt 1 + σ2 n I 1 (I is the identity matrix and KDt 1Dt 1 (κ(x , w ; x , w ))(x ,w ,x ,w ) Dt 1 Dt 1). Although f(x, w) follows a Gaussian distribution, CVa R cf(x; α) (defined in (1)) does not. Furthermore, although |W| is finite, the distribution of CVa R is continuous because the black-box function evaluation is real-valued. Besides, as the support of f(x, W) is unbounded, so is CVa R. These points pose challenges in deriving the following 2 algorithms and their theoretical analyses. 3 BO Algorithms for Optimizing CVa R of a Black-Box Function In this section, we present two BO algorithms for optimizing CVa R of a black-box function f(x, W): CVa R optimization with upper confidence bound (CV-UCB) and CVa R optimization with Thompson sampling (CV-TS). These two algorithms are different in the way of handling the explorationexploitation trade-off in the selection of xt to find argmaxx X cf(x; α) (Sec. 3.1.1 and Sec. 3.1.2). On the other hand, given the selected xt, selecting wt is only about reducing the uncertainty of CVa R cf(xt; α). Therefore, we would like to design a single selection strategy of wt that works for both CV-UCB and CV-TS (Sec. 3.2). 3.1 Selection Strategy of xt 3.1.1 A UCB-based Approach to Select xt in CV-UCB In (1), CVa R cf(x; α) at risk level α of f(x, W) is defined as the expectation of function values that are at most Va R at risk level α [4]. While this definition (1) is interpretable, it is challenging to devise a no-regret BO algorithm based on (1). It is because (1) requires learning about not only (i) Va R vf(x; α) but also (ii) the probabilities of function values less than vf(x; α). Even though the former (i.e., Va R vf(x; α)) can be optimized using the V-UCB algorithm in [13], it remains a challenge to learn about the latter. Interestingly, in this section, we present an approach that does not resort to handling the above two problems (i) and (ii) separately. To achieve this, we employ an alternative definition of CVa R which is an expectation of Va R over the risk level: cf(x; α) = 1 0 vf(x; α ) dα . (3) Although (1) is more interpretable and is often used to evaluate CVa R, the above definition (3) paves the way to our selection strategy of xt in CV-UCB: by utilizing (3), we are able to construct an upper confidence bound of CVa R relying on that of Va R. Recall that the work of [13] proposes a confidence bound of Va R at iteration t, denoted as It 1[vf(x; α )]: It 1[vf(x; α )] [vlt 1(x; α ), vut 1(x; α )] x X α (0, 1) (4) where vlt 1(x; α ) and vut 1(x; α ) are Va R (due to the randomness in W) of lt 1(x, W) and ut 1(x, W), respectively. At iteration t, for all (x, w) X W, [lt 1(x, w), ut 1(x, w)] is a confidence bound of f(x, w) (due to the uncertainty in the unknown black-box function f) depending on an exploration parameter βt [13]: lt 1(x, w) µt 1(x, w) β1/2 t σt 1(x, w) ut 1(x, w) µt 1(x, w) + β1/2 t σt 1(x, w) (5) where µt 1(x, w) and σt 1(x, w) are defined in (2). The exploration parameter βt is to balance between exploitation (based on the GP posterior mean µt 1(x, w)) and exploration (based on the GP posterior standard deviation σt 1(x, w)), which is specified later in Lemma 1. From (3), a confidence bound It 1[cf(x; α)] of CVa R follows the confidence bound It 1[vf(x; α )] of Va R (4): It 1[cf(x; α)] [clt 1(x; α), cut 1(x; α)] 1 0 vlt 1(x; α ) dα , 1 0 vut 1(x; α ) dα . (6) Given the above upper confidence bound cut 1(x; α) of CVa R, we select xt as its maximizer (line 5 of Algorithm 1). The time complexity of the selection strategy of xt in CV-UCB is O(|Dt 1|3 + ntrain(|W||Dt 1|2 + |W| log |W|)) where ntrain is the number of gradient descent steps to find argmaxx cut 1(x; α), O(|Dt 1|3) is to compute ΛDt 1 in (2), O(|W||Dt 1|2) is the GP prediction time complexity, and O(|W| log |W|) is to evaluate CVa R. 3.1.2 A Thompson Sampling Approach to Select xt in CV-TS An alternative approach to select xt is based on Thompson sampling [18]: xt is selected as a sample of the maximizer x of CVa R cf(x; α). Unlike the selection strategy of xt in CV-UCB that requires an exploration parameter βt (Sec. 3.1.1), the exploration-exploitation trade-off in Thompson sampling is naturally handled by the randomness in the sampling of the maximizer of cf(x; α). This approach can also be extended to handle a batch query at each BO iteration by drawing a batch of samples of the maximizer of cf(x; α). While unexplored for BO of CVa R, batch queries are popular in BO literature [8, 11]. A sample of the maximizer of cf(x; α) is obtained by the following 2 steps (lines 7-8 of Algorithm 1). First, we draw a function sample f1 from the GP posterior belief by using random Fourier feature approximation [16] which is also employed in classical BO works [7, 23] (line 7 of Algorithm 1). Second, given f1, we maximize CVa R cf1(x; α) to obtain its maximizer. This is a sample of the maximizer of cf(x; α) which is assigned to xt in CV-TS (line 8 of Algorithm 1). The time complexity of the selection strategy of xt in CV-TS is O(|Dt 1|3 + ntrain(n2 feature + |W| log |W|)) where nfeature is the number of random Fourier samples, ntrain is the number of gradient descent steps to find argmaxx cf1(x; α), O(|Dt 1|3) is to compute ΛDt 1 in (2) (which dominates the time complexity to draw a function f1 from the GP posterior belief), O(n2 feature) is the time complexity to evaluate f1, and O(|W| log |W|) is to evaluate CVa R. 3.2 Selection Strategy of wt Shared by Both CV-UCB & CV-TS Given the selected xt at iteration t, we would like to propose a single strategy of selecting wt to reduce the uncertainty of CVa R cf(xt; α) for both CV-UCB and CV-TS that still guarantees no-regret for both algorithms as shown in Sec. 4. Our main idea is to prioritize learning about Va R at risk level αt which has the highest uncertainty among Va R vf(x; α ) at risk levels α (0, α]. In turn, it reduces the uncertainty of CVa R from (3). We quantify the uncertainty of Va R by the size of its confidence bound (4). Thus, αt is defined as (line 11 of Algorithm 1) αt arg max α (0,α] vut 1(xt; α ) vlt 1(xt; α ) . (7) To reduce the uncertainty of vf(xt; αt) quantified by the size of its confidence bound vut 1(xt; αt) vlt 1(xt; αt), let us view vf(xt; αt) as the αt-percentile of f(xt, W). There is no uncertainty in the αt-percentile of f(xt, W) if the set of w where f(xt, w) is at most the αt-quantile of f(xt, W) (i.e., the lower tail of f(xt, W) upper bounded by its αt-percentile) remains unchanged under different realizations of f, especially its lower confidence bound lt 1 and its upper confidence bound ut 1. On the contrary, if there exists w LV that violates the above statement due to the uncertainty of f(xt, w LV) (due to the unknown f), we would like to gather observation at (xt, w LV) to reduce the uncertainty of its function evaluation. Such w LV is formally defined as the lacing value (LV) which is shown to exist in [13].1 Definition 1 (Lacing values [13]). Lacing values (LV) with respect to α (0, 1), x X, lt 1, and ut 1 are w LV W that satisfies lt 1(x, w LV) vlt 1(x; α) vut 1(x; α) ut 1(x, w LV), equivalently, It 1[vf(x; α)] [lt 1(x, w LV), ut 1(x, w LV)]. Thus, we select wt as an LV w.r.t. αt, xt, lt 1, and ut 1. If there are multiple LVs, we select the LV with the maximum probability p(W). It is a heuristic to improve the performance suggested by [13]. For CV-TS, to avoid repeating inputs in the batch query, we can select wt as a sample of the random variable W by restricting its support to the LVs (if there are multiple LVs). 4 Theoretical Analysis This section presents our theoretical analyses of CV-UCB and CV-TS. We choose the exploration parameter βt in (5) following a lemma in [21]. We assume a discrete domain X for simplicity (and |W| is finite). Lemma 1 (Lemma 5.1 in [21]). Pick δ (0, 1) and set βt = 2 log(|X||W|πt/δ), where P t 1 π 1 t = 1, πt > 0. Then, |f(x, w) µt 1(x, w)| β1/2 t σt 1(x, w) (x, w) X W t 1 (8) holds with probability 1 δ. Thus, given the above βt, the function evaluation, Va R, and CVa R are in their confidence bounds (in (5), (4), and (6), respectively) with probability 1 δ. The performance of CV-UCB is measured by the cumulative regret adopted from existing BO works, e.g., in [21, 13]: RT PT t=1 rt which is the sum over instantaneous regrets, denoted as rt cf(x ; α) cf(xt; α) where x argmaxx X cf(x; α) is the optimal value of the optimization variable. In this section, we would like to show that the cumulative regret of CV-UCB is sublinear so that lim T RT /T = 0. Since CV-UCB selects xt as the maximizer of the upper confidence bound of CVa R (line 5 of Algorithm 1), it can be shown in Appendix A that the instantaneous regret is bounded by rt cut 1(xt; α) clt 1(xt; α) (9) with probability 1 δ. Then, by exploiting the relationship between CVa R and Va R in (3) and the fact that the average of a set of values is at most the maximum value of the set, the bound on rt can be simplified to the size of a confidence bound of Va R at risk level αt (7) in Appendix A: rt vut 1(xt; αt) vlt 1(xt; αt) (10) 1Shrewd readers may notice a difference between the intuition and definition of LV ( vs. <). It is to handle the case when there is only 1 LV. holds with probability 1 δ. Furthermore, utilizing the property of wt as an LV in Definition 1, it can be shown in Appendix A that rt 2β1/2 t σt 1(xt, wt) (11) with probability 1 δ. Then, we follow [21] to construct the following theorem which shows a sublinear regret bound for CV-UCB (detailed in Appendix B). Theorem 1. By selecting xt as the maximizer of the upper confidence bound cut 1(x; α) (line 5 of Algorithm 1) and selecting wt as an LV w.r.t. αt (7), xt, lt 1, and ut 1 (line 12 of Algorithm 1), C1TβT γT (12) holds with probability 1 δ where C1 8/ log(1 + σ 2 n ), γT is the maximum information gain about f that can be obtained by observing any set of T observations (which is bounded in [21] for several commonly used kernels), βT and δ are defined in Lemma 1. The performance of CV-TS is measured by the Bayesian cumulative regret [18]: RBayes T E h PT t=1 cf(x ; α) cf(xt; α) i where the expectation includes the randomness in the GP prior of f, the observation noise, and xt. Let us denote r Bayes t E[cf(x ; α) cf(xt; α)|y Dt 1] where the expectation includes the randomness in the GP posterior of f given y Dt 1 and xt, then RBayes T = E h PT t 1 r Bayes t i . In this section, we would like to show that the Bayesian cumulative regret of CV-TS is sublinear so that lim T RBayes T /T = 0. Following [18], we can decompose r Bayes t into (to ease the notation, we omit y Dt 1): r Bayes t E[ lower c (xt; α)] + E[ upper c (x ; α)] + E[cut 1(xt; α) clt 1(xt; α)] (13) where lower c (xt; α) max(0, clt 1(xt; α) cf(xt; α)) and upper c (x ; α) max(0, cf(x ; α) cut 1(x ; α)) (shown in Appendix C). From Appendix A, cut 1(xt; α) clt 1(xt; α) is bounded by 2β1/2 t σt 1(xt, wt) provided that wt is an LV w.r.t. αt (7), xt, lt 1, and ut 1 (line 12 of Algorithm 1). Thus, we are able to obtain a bound on the last term of the decomposed r Bayes t (13): E[cut 1(xt; α) clt 1(xt; α)] 2β1/2 t σt 1(xt, wt) . (14) However, the challenge remains in bounding the first two terms (i.e., E[ lower c (xt; α)] and E[ upper c (x ; α)]) of the decomposed r Bayes t (13). These terms are the expectations over the tails of CVa R. As suggested by the seminal work of [18], bounding these terms may require CVa R to have a bounded support or a strong tail decay. However, in our BO formulation where f is modelled with a GP, the support of CVa R is unbounded. Thus, to derive the bound of the tail expectation of the unfamiliar distribution of CVa R, our key idea is to relate it to the tail expectation of a Gaussian distribution through 3 steps: first, relating this tail expectation of CVa R to that of Va R in Lemma 2, second, relating the tail expectation of Va R to the tail probability of Va R in (16), and third, relating the tail probability of Va R to that of function evaluation f(x, w) in Lemma 3. As f(x, w) follows a Gaussian distribution, we can bound its tail expectation. Lemma 2 (From tail expectation of CVa R to that of Va R). For all x X, E lower c (x; α) 1 0 E lower v (x; α ) dα (15) where lower v (x; α ) max 0, vlt 1(x; α ) vf(x; α ) (shown in Appendix D). Furthermore, as lower v (x; α ) is a non-negative random variable, its expectation in (15) can be expressed as an integration over the tail probabilities: α (0, 1), x X, E lower v (x; α ) = Z 0 P( lower v (x; α ) > ω) dω = Z 0 P(vlt 1(x; α ) vf(x; α ) > ω) dω . Then, we are able to relate the tail probability P(vlt 1(x; α ) vf(x; α ) > ω) to the tail probability of the function evaluation f(x, w) using our key observation on the relationship between the 2 events: (a) Va R vf(x; α ) falls below its lower confidence bound vlt 1(x; α ) by ω 0 and (b) the function evaluation f(x, w) falls below its lower confidence bound lt 1(x, w) by ω in the following lemma (proved in Appendix E). Lemma 3. Consider a realization f1 of the black-box function f following the GP posterior belief given y Dt 1 that satisfies vlt 1(x; α ) vf1(x; α ) > ω for α (0, 1), x X, and ω 0. Let Wupper lt 1 {w W : lt 1(x, w) vlt 1(x; α )}. Then, w0 Wupper lt 1 , lt 1(x, w0) f1(x, w0) > ω . As the tail expectation of the Gaussian random variable f(x, w) can be evaluated, we can utilize the above results ((15), (16), and Lemma 3) to obtain a bound of E[ lower c (x; α)] (and similarly, a bound of E[ upper c (x; α)]) for all x X. These bounds and (14) lead to the following theorem on the Bayesian cumulative regret bound (proved in Appendix F). Theorem 2. Assuming a bounded GP prior standard deviation: κ(x, w) 1 (x, w) X W, by selecting xt as a sample of the maximizer of cf(x; α) (lines 7-8 of Algorithm 1) and selecting wt as an LV w.r.t. αt (7), xt, lt 1, and ut 1 (line 12 of Algorithm 1), then the Bayesian cumulative regret is bounded by 2 |X| π + p C1TβT γT (17) where C1 8/ log(1 + σ 2 n ), γT is the maximum information gain about f that can be obtained by observing any set of T observations, βT and δ are defined in Lemma 1. Remark 2. We can extend CV-TS to select a batch query of size k (instead of a single query) at each BO iteration by drawing k samples of the maximizer of CVa R and finding their corresponding αt and LVs (detailed in Appendix G). We can also show that the difference between the average (over the number of observations) of the Bayesian cumulative regret bound of CV-TS with batch queries and that of CV-TS with single queries is small for a large number of observations (Appendix G), which is similar to a result in [11] for the classical BO with Thompson sampling. However, with a larger batch size k, more observations can be obtained given the same number of BO iterations. Thus, the use of CV-TS with a large batch size k is advantageous if obtaining observations at the batch query is feasible, e.g., by running multiple simulations in parallel. Remark 3. Our CV-TS algorithm and its theoretical analysis can be utilized to construct a BO algorithm to optimize Va R of a black-box function. Like CV-TS, this algorithm is also capable of handling batch queries. Since it is beyond the scope of this work (which is about CVa R), we refer readers interested in optimizing Va R of a black-box function to Appendix H. 5 Experiments We empirically evaluate the performance of CV-UCB and CV-TS by comparing them with the work of [4]. To the best of our knowledge, it is the only existing work that optimizes CVa R of a black-box function by selecting both x and w. In particular, the ρKGapx algorithm is selected as a baseline thanks to its time efficiency and competitive empirical performance as demonstrated in [4]. There are 2 variants of CV-TS in the experiments: CV-TS k = 1 which selects 1 input query at each iteration and CV-TS k = 3 which selects 3 input queries at each iteration. Each experiment is repeated 10 times with different random seeds to account for the randomness in the observation noise, the set of initial observations, and the sampling from the GP posterior belief. Both the average and the confidence interval of the logarithm of the inference regret (taken from [7, 23]) are reported. The hyperparameters of GP and the noise variance are learned from the observations using maximum likelihood estimation [17]. Further details on the experiments are described in Appendix I.2 Figs. 2a-d show the results of optimizing CVa R of synthetic benchmark functions with low input dimensions (the dimension of x is m = 1 and of W is n = 1): Branin-Hoo and Goldstein-Price; and 2The code is available at https://github.com/qphong/Bayes Opt-LV. 0 10 20 30 Iteration log10 Regret 0 20 40 60 Iteration log10 Regret 0 25 50 75 Iteration log10 Regret 0 20 40 Iteration log10 Regret (a) Branin-Hoo (b) Goldstein-Price (c) Hartmann-6D m = 5 (d) Hartmann-6D m = 1 0 20 40 Iteration log10 Regret 0 20 40 60 Iteration log10 Regret 0 20 40 Iteration log10 Regret (e) Yacht hydrodynamics (f) Portfolio optimization (g) Robot pushing experiment Figure 2: Plots of the regret against the BO iteration in (a-d) experiments that optimize CVa R of synthetic benchmark functions, (e-g) simulated real-world optimization problems. with moderate input dimensions (m = 5, n = 1; and m = 1, n = 5): Hartmann-6D. The risk level α is set to 0.1. We observe that CV-UCB achieves comparable performance to the baseline ρKGapx in the Branin-Hoo experiment, while it outperforms ρKGapx in the Goldstein-Price and Hartmann-6D (m = 5) experiments. It is because the evaluation of ρKGapx is approximated with samples of f and the nested optimization procedure involved in ρKGapx is simplified. CV-TS k = 1 does not perform as well as other algorithms do. It is because with few observations, CV-TS k = 1 tends towards randomly exploring X so the maximizer is not accurately located. Furthermore, as the GP hyperparameters are assumed to be unknown in our experiments, they can be incorrectly estimated with few observations, which affects the performance of CV-TS k = 1. When we increase the size of the batch query to 3 in CV-TS k = 3, the number of observations increases. In turn, the estimation of the GP hyperparameters and the exploration-exploitation trade-off are improved. Thus, CV-TS k = 3 achieves a superior performance. It suggests that CV-TS with a large batch query should be preferred when multiple observations can be obtained, e.g., by running multiple simulations in parallel. In contrast, CV-UCB should be preferred if only one observation is obtained at each BO iteration. Figs. 2e-g show the results of optimizing CVa R in an optimization problem using the yatch hydrodynamics dataset [5], a portfolio optimization problem [4], and a simulated robot pushing experiment [23]. In the experiment with the yacht hydrodynamics dataset, we would like to minimize the residuary resistance per unit weight of displacement of a yacht by searching for the optimal hull geometry coefficients of the yacht in the face of the uncertainty in the Froude number (the Froude number depends on the real-world environment and we assume that it can be simulated with computers during the optimization). The ground truth function is constructed using the yacht hydrodynamics data set [5]. The dimension of the input variables x and W are m = 5 and n = 1 (the Froude number), respectively. The risk level of CVa R is set to α = 0.3. While CV-TS k = 3 outperforms the other algorithms significantly thanks to its batch queries, CV-UCB converges to a lower regret than ρKGapxdoes. The portfolio optimization problem is taken from the work of [4] where the evolution of a portfolio over 4 years is simulated and optimized using open-source market data. The optimization variables (i.e., x) include the risk, the trade aversion, and the holding cost multiplier, while the environmental random variables (i.e., W) include the bid-ask spread and the borrow cost [4]. The risk level of CVa R in this problem is set to α = 0.2. We observe that CV-UCB and CV-TS k = 3 achieve the best performance by converging to lower regret values than the other algorithms. The simulated robot pushing experiment is taken from [23] whose task is to minimize the distance of a pushed object to a fixed goal location by controlling the robot location and the pushing duration (i.e., x). We adopt the setting of [13] to introduce random perturbations to the robot location (i.e., W). The risk level of CVa R is set to α = 0.1. While all algorithms converge to roughly the same value of the regret in this experiment, we observe that CV-TS k = 3 shows its advantage in acquiring more observations by converging faster than the others. Besides, CV-UCB also converges faster than ρKGapxin this experiment. 6 Conclusion We propose two BO algorithms to optimize CVa R of a black-box function with theoretical performance guarantee: CV-UCB and CV-TS by taking advantage of a connection between CVa R and Va R and establishing a link between the tail expectation of (non-Gaussian and unbounded support) CVa R and that of the Gaussian-distributed function evaluation. The competitive empirical performance of CV-UCB and CV-TS with batch queries are shown in optimizing CVa R of synthetic benchmark functions and simulated real-world optimization problems. These results recommend the use of CV-UCB to optimize CVa R of a black-box function when only an observation is obtained at each BO iteration and the use of CV-TS with large batch queries when multiple observations can be obtained at each BO iteration. Acknowledgments and Disclosure of Funding This research is supported by A*STAR under its RIE2020 Advanced Manufacturing and Engineering (AME) Programmatic Funds (Award A20H6b0151). [1] I. Bogunovic, J. Scarlett, S. Jegelka, and V. Cevher. Adversarially robust optimization with Gaussian processes. In Proc. Neur IPS, pages 5760 5770, 2018. [2] E. Brochu, T. Brochu, and N. de Freitas. A Bayesian interactive optimization approach to procedural animation design. In Proc. ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pages 103 112, 2010. [3] B. Burger, P. M. Maffettone, V. V. Gusev, C. M. Aitchison, Y. Bai, X. Wang, X. Li, B. M. Alston, B. Li, R. Clowes, et al. A mobile robotic chemist. Nature, 583(7815):237 241, 2020. [4] S. Cakmak, R. Astudillo, P. I. Frazier, and E. Zhou. Bayesian optimization of risk measures. In Proc. Neur IPS, 2020. [5] D. Dua and C. Graff. UCI machine learning repository, 2017. [6] F. Häse, L. M. Roch, C. Kreisbeck, and A. Aspuru-Guzik. Phoenics: a Bayesian optimizer for chemistry. ACS central science, 4(9):1134 1145, 2018. [7] J. M. Hernández-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive entropy search for efficient global optimization of black-box functions. In Proc. NIPS, pages 918 926, 2014. [8] J. M. Hernández-Lobato, J. Requeima, E. O. Pyzer-Knapp, and A. Aspuru-Guzik. Parallel and distributed Thompson sampling for large-scale accelerated exploration of chemical space. In Proc. ICML, pages 1470 1479, 2017. [9] S. Iwazaki, Y. Inatsu, and I. Takeuchi. Bayesian quadrature optimization for probability threshold robustness measure. ar Xiv preprint ar Xiv:2006.11986, 2020. [10] S. Iwazaki, Y. Inatsu, and I. Takeuchi. Mean-variance analysis in Bayesian optimization under uncertainty. ar Xiv preprint ar Xiv:2009.08166, 2020. [11] K. Kandasamy, A. Krishnamurthy, J. Schneider, and B. Póczos. Parallelised Bayesian optimisation via Thompson sampling. In Proc. AISTATS, pages 133 142, 2018. [12] J. Kirschner, I. Bogunovic, S. Jegelka, and A. Krause. Distributionally robust Bayesian optimization. In Proc. AISTATS, 2020. [13] Q. P. Nguyen, Z. Dai, B. K. H. Low, and P. Jaillet. Value-at-risk optimization with Gaussian processes. In Proc. ICML, pages 8063 8072, 2021. [14] T. T. Nguyen, S. Gupta, H. Ha, S. Rana, and S. Venkatesh. Distributionally robust Bayesian quadrature optimization. In Proc. AISTATS, 2020. [15] V. Perrone, H. Shen, A. Zolic, I. Shcherbatyi, A. Ahmed, T. Bansal, M. Donini, F. Winkelmolen, R. Jenatton, J. B. Faddoul, et al. Amazon Sage Maker automatic model tuning: Scalable black-box optimization. ar Xiv preprint ar Xiv:2012.08489, 2020. [16] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Proc. NIPS, pages 1177 1184, 2008. [17] C. E. Rasmussen and C. K. I. Williams. Gaussian processes for machine learning. MIT Press, 2006. [18] D. Russo and B. Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221 1243, 2014. [19] S. Sarykalin, G. Serraino, and S. Uryasev. Value-at-risk vs. conditional value-at-risk in risk management and optimization. In State-of-the-art decision-making tools in the informationintensive age, pages 270 294. Informs, 2008. [20] B. Shahriari, K. Swersky, Z. Wang, R. Adams, and N. de Freitas. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148 175, 2015. [21] N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proc. ICML, pages 1015 1022, 2010. [22] L. Torossian, V. Picheny, and N. Durrande. Bayesian quantile and expectile optimisation. ar Xiv preprint ar Xiv:2001.04833, 2020. [23] Z. Wang and S. Jegelka. Max-value entropy search for efficient Bayesian optimization. In Proc. ICML, pages 3627 3635, 2017.