# valueatrisk_optimization_with_gaussian_processes__c418bd01.pdf Value-at-Risk Optimization with Gaussian Processes Quoc Phong Nguyen 1 Zhongxiang Dai 1 Bryan Kian Hsiang Low 1 Patrick Jaillet 2 Value-at-risk (VAR) is an established measure to assess risks in critical real-world applications with random environmental factors. This paper presents a novel VAR upper confidence bound (V-UCB) algorithm for maximizing the VAR of a black-box objective function with the first noregret guarantee. To realize this, we first derive a confidence bound of VAR and then prove the existence of values of the environmental random variable (to be selected to achieve no regret) such that the confidence bound of VAR lies within that of the objective function evaluated at such values. Our V-UCB algorithm empirically demonstrates state-of-the-art performance in optimizing synthetic benchmark functions, a portfolio optimization problem, and a simulated robot task. 1. Introduction Consider the problem of maximizing an expensive-tocompute black-box objective function f that depends on an optimization variable x and an environmental random variable Z. Due to the randomness in Z, the function evaluation f(x, Z) of f at x is a random variable. Though for such an objective function f, Bayesian optimization (BO) can be naturally applied to maximize its expectation EZ[f(x, Z)] over Z (Toscano-Palmerin & Frazier, 2018), this maximization objective overlooks the risks of potentially undesirable function evaluations. These risks can arise from either (a) the realization of an unknown distribution of Z or (b) the realization of the random Z given that the distribution of f(x, Z) can be estimated well or that of Z is known. The issue (a) has been tackled by distributionally robust BO (Kirschner et al., 2020; Nguyen et al., 2020) which maximizes EZ[f(x, Z)] under the worst-case realization of the distribution of Z. To resolve the issue (b), the risk from the 1Department of Computer Science, National University of Singapore, Republic of Singapore 2Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, USA. Correspondence to: Quoc Phong Nguyen . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). uncertainty in Z can be controlled via the mean-variance optimization framework (Iwazaki et al., 2020), value-at-risk (VAR), or conditional value-at-risk (CVAR) (Cakmak et al., 2020; Torossian et al., 2020). The work of Bogunovic et al. (2018) has considered adversarially robust BO, where z is controlled by an adversary deterministically.1 In this case, the objective is to find x that maximizes the function under the worst-case realization of z, i.e., argmaxx minz f(x, z). In this paper, we focus on case (b) where the distribution of Z is known (or well-estimated). For example, in agriculture, although farmers cannot control the temperature of an outdoor farm, its distribution can be estimated from historical data and controlled in an indoor environment for optimizing the plant yield. Given the distribution of Z, the objective is to control the risk that the function evaluation f(x, z), for a z sampled from Z, is small. One popular framework is to control the trade-off between the mean (viewed as reward) and the variance (viewed as risk) of the function evaluation with respect to Z (Iwazaki et al., 2020). However, quantifying the risk using variance implies indifference between positive and negative deviations from the mean, while people often have asymmetric risk attitudes (Goh et al., 2012). In our problem of maximizing the objective function, it is reasonable to assume that people are risk-averse towards only the negative deviations from the mean, i.e., the risk of getting low function evaluations. Thus, it is more appropriate to adopt risk measures with this asymmetric property, such as value-at-risk (VAR) which is a widely adopted risk measure in real-world applications (e.g., in banking (Basel Committee on Banking Supervision, 2006)). Intuitively, the risk that the random f(x, Z) is less than VAR at level α (0, 1) does not exceed α. Hence, by specifying a small value of α as 0.1, this risk is controlled to be at most 10%. Therefore, to maximize f while controlling the risk of undesirable function evaluations, we aim to maximize VAR of the random function f(x, Z) over x. The recent work of Cakmak et al. (2020) has used BO to maximize VAR and has achieved state-of-the-art empirical performances. They have assumed that we are able to select both x and z to query during BO, which is motivated by 1We use upper-case letter Z to denote the environmental random variable and lower-case letter z to denote its realization or a (non-random) variable. Value-at-Risk Optimization with Gaussian Processes fact that physical experiments can be studied by simulation (Williams et al., 2000). In the above agriculture example, we can control the temperature, lighting, and water (z) in an indoor environment to search for the optimal amount of fertilizer (x), which can then be used in an outdoor environment with random weather factors. The work of Cakmak et al. (2020) has exploited the ability to select z to model the function f(x, z) as a GP, which allows them to retain the appealing closed-form posterior belief of the objective function. To select the queries x and z, they have designed a one-step lookahead approach based on the well-known knowledge gradient (KG) acquisition function (Scott et al., 2011). However, the one-step lookahead incurs an expensive nested optimization procedure, which is computationally expensive and hence requires approximations. Besides, the acquisition function can only be approximated using samples of the objective function f from the GP posterior and the environmental random variable Z. While they have analysed the asymptotically unbiased and consistent estimator of the gradients, it is challenging to obtain a guarantee for the convergence of their algorithm. Another recent work (Torossian et al., 2020) has also applied BO to maximize VAR using an asymmetric Laplace likelihood function and variational approximation of the posterior belief. However, in contrast to Cakmak et al. (2020) and our work, they have focused on a different setting where the realizations of Z are not observed. In this paper, we adopt the setting of Cakmak et al. (2020) which allows us to choose both x and z to query, and assume that the distribution of Z is known or well-estimated. Our contributions include: Firstly, we propose a novel BO algorithm named Value-atrisk Upper Confidence Bound (V-UCB) in Section 3. Unlike the work of Cakmak et al. (2020), V-UCB is equipped with a no-regret convergence guarantee and is more computationally efficient. To guide its query selection and facilitate its proof of the no-regret guarantee, the classical GP-UCB algorithm (Srinivas et al., 2010) constructs a confidence bound of the objective function. Similarly, to maximize the VAR of a random function, we, for the first time to the best of our knowledge, construct a confidence bound of VAR (Lemma 2). The resulting confidence bound of VAR naturally gives rise to a strategy to select x. However, it remains a major challenge to select z to preserve the no-regret convergence of GP-UCB. To this end, we firstly prove that our algorithm is no-regret as long as at the selected z, the confidence bound of VAR lies within the confidence bound of the objective function. Next, we also prove that this query selection strategy is feasible, i.e., such values of z, referred to as lacing values (LV), exist. Secondly, although our theoretical no-regret property allows the selection of any LV, we design a heuristic to select an LV such that it improves our empirical performance over random selection of LV (Section 3.3). We also discuss the implications when z cannot be selected by BO and is instead randomly sampled by the environment during BO (Remark 1). Thirdly, we show that adversarially robust BO (Bogunovic et al., 2018) can be cast as a special case of our V-UCB when the risk level α of VAR approaches 0 from the right and the domain of z is the support of Z. In this case, adversarially robust BO (Bogunovic et al., 2018) selects the same input queries as those selected by V-UCB since the set of LV collapse into the set of minimizers of the lower bound of the objective function (Section 3.4). Lastly, we provide practical techniques for implementing VUCB with continuous random variable Z (Section 3.5): we (a) introduce local neural surrogate optimization with the pinball loss to optimize VAR, and (b) construct an objective function to search for an LV in the continuous support of Z. The performance of our proposed algorithm is empirically demonstrated in optimizing several synthetic benchmark functions, a portfolio optimization problem, and a simulated robot task in Section 4. 2. Problem Statement and Background Let the objective function be defined as f : Dx Dz R where Dx Rdx and Dz Rdz are the bounded domain of the optimization variable x and the support of the environmental random variable Z, respectively (dx and dz are the dimensions of x and z, respectively). The support of Z is defined as the smallest closed subset Dz of Rdz such that P(Z Dz) = 1. Let z Dz denote a realization of the random variable Z. Let f(x, Z) denote a random variable whose randomness comes from Z. The VAR of f(x, Z) at risk level α (0, 1) is defined as: Vα(f(x, Z)) inf{ω : P(f(x, Z) ω) α} (1) which implies the risk that f(x, Z) is less than its VAR at level α does not exceed α. Our objective is to search for x Dx that maximizes Vα(f(x, Z)) at a user-specified risk level α (0, 1). Intuitively, the goal is to find x where the evaluations of the objective function are as large as possible under most realizations of the environmental random variable Z (which is characterized by the probability of 1 α). The unknown objective function f(x, z) is modeled with a GP. That is, every finite subset of {f(x, z)}(x,z) Dx Dz follows a multivariate Gaussian distribution (Rasmussen & Williams, 2006). The GP is fully specified by its prior mean and covariance function k(x,z),(x ,z ) cov[f(x, z), f(x , z )] for all x, x in Dx and z, z in Dz. Value-at-Risk Optimization with Gaussian Processes For notational simplicity (and w.l.o.g.), the former is assumed to be zero, while we use the squared exponential (SE) kernel as its bounded maximum information gain can be used for later analysis (Srinivas et al., 2010). To identify the optimal x argmaxx Dx Vα(f(x, Z)), BO algorithm selects an input query (xt, zt) in the t-th iteration to obtain a noisy function evaluation y(xt,zt) f(xt, zt) + ϵt where ϵt N(0, σ2 n) are i.i.d. Gaussian noise with variance σ2 n. Given noisy observations y Dt (y(x,z)) (x,z) Dt at observed inputs Dt Dt 1 {(xt, zt)} (D0 is the initial observed inputs), the GP posterior belief of function evaluation at any input (x, z) is a Gaussian p(f(x, z)|y Dt) N(f(x, z)|µt(x, z), σ2 t (x, z)): µt(x, z) K(x,z),DtΛDt Dty Dt , σ2 t (x, z) k(x,z),(x,z) K(x,z),DtΛDt Dt KDt,(x,z) (2) where ΛDt Dt KDt Dt + σ2 n I 1, I is the identity matrix of the same dimensions as KDt Dt, K(x,z),Dt (k(x,z),(x ,z ))(x ,z ) Dt, KDt,(x,z) K (x,z),Dt, and KDt Dt (k(x ,z ),(x ,z ))(x ,z ),(x ,z ) Dt. 3. BO of VAR Following the seminal work (Srinivas et al., 2010), we use the cumulative regret as the performance metric to quantify the performance of our BO algorithm. It is defined as RT PT t=1 r(xt) where r(xt) Vα(f(x , Z)) Vα(f(xt, Z)) is the instantaneous regret and x argmaxx Dx Vα(f(x, Z)). We would like to design a query selection strategy that incurs no regret, i.e., lim T RT /T = 0. Furthermore, we have that mint T r(xt) RT /T, equivalently, maxt T Vα(f(xt, Z)) Vα(f(x , Z)) RT /T. Thus, lim T maxt T Vα(f(xt, Z)) = Vα(f(x , Z)) for a noregret algorithm. The proof of the upper bound on the cumulative regret of GP-UCB is based on confidence bounds of the objective function (Srinivas et al., 2010). Similarly, in the next section, we start by constructing a confidence bound of Vα(f(x, Z)), which naturally leads to a query selection strategy for xt. 3.1. A Confidence Bound of Vα(f(x, Z)) and the Query Selection Strategy for xt Firstly, we adopt a confidence bound of the function f(x, z) from Chowdhury & Gopalan (2017), which assumes that f belongs to a reproducing kernel Hilbert space Fk(B) such that its RKHS norm is bounded f k B. Lemma 1 (Chowdhury & Gopalan (2017)). Pick δ (0, 1) and set βt = (B + σn p 2(γt 1 + 1 + log 1/δ))2. Then, f(x, z) It 1[f(x, z)] [lt 1(x, z), ut 1(x, z)] x Dx, z Dz, t 1 holds with probability 1 δ where lt 1(x, z) µt 1(x, z) β1/2 t σt 1(x, z) ut 1(x, z) µt 1(x, z) + β1/2 t σt 1(x, z) . (3) As the above lemma holds for both finite and continuous Dx and Dz, it is used to analyse the regret in both cases. On the other hand, the confidence bound can be adopted to the Bayesian setting by changing only βt following the work of Srinivas et al. (2010) as noted by Bogunovic et al. (2018). Then, we exploit this confidence bound on the function evaluations (Lemma 1) to formulate a confidence bound of Vα(f(x, Z)) as follows. Lemma 2. Similar to the definition of f(x, Z), let lt 1(x, Z) and ut 1(x, Z) denote the random function over x where the randomness comes from the random variable Z; lt 1 and ut 1 are defined in (3). Then, x Dx, t 1, Vα(f(x, Z)) It 1[Vα(f(x, Z))] [Vα(lt 1(x, Z)), Vα(ut 1(x, Z))] holds with probability 1 δ for δ in Lemma 1, where Vα(lt 1(x, Z)) and Vα(ut 1(x, Z)) are defined as (1). The proof is in Appendix A. Given the confidence bound It 1[Vα(f(x, Z))] [Vα(lt 1(x, Z)), Vα(ut 1(x, Z))] in Lemma 2, we follow the the well-known optimism in the face of uncertainty principle to select xt = argmaxx Dx Vα(ut 1(x, Z)). This query selection strategy for xt leads to an upper bound of r(xt): r(xt) Vα(ut 1(xt, Z)) Vα(lt 1(xt, Z)) t 1 (4) which holds with probability 1 δ for δ in Lemma 1, and is proved in Appendix B. As our goal is lim T RT /T = 0, given the selected query xt, a reasonable query selection strategy of zt should gather informative observations at (xt, zt) that improves the confidence bound It 1[Vα(f(xt, Z))] (i.e., It[Vα(f(xt, Z))] is a proper subset of It 1[Vα(f(xt, Z))] if It 1[Vα(f(xt, Z))] = ) which can be viewed as the uncertainty of Vα(f(xt, Z)). Assume that there exists zl Dz such that lt 1(xt, zl) = Vα(lt 1(xt, Z)) and zu Dz such that ut 1(xt, zu) = Vα(ut 1(xt, Z)). Lemma 2 implies that Vα(f(xt, Z)) It 1[Vα(f(xt, Z))] = [lt 1(xt, zl), ut 1(xt, zu)] with high probability. Hence, we may na ıvely want to query for observations at (xt, zl) and (xt, zu) to reduce It 1[Vα(f(xt, Z))]. However, these observations may not always reduce It 1[Vα(f(xt, Z))]. The insight is that It 1[Vα(f(xt, Z))] changes (i.e., shrinks) when either of its boundary values (i.e., lt 1(xt, zl) or ut 1(xt, zu)) changes. Value-at-Risk Optimization with Gaussian Processes Consider ut 1(xt, zu) and finite Dz as an example, since ut 1(xt, zu) = Vα(ut 1(xt, Z)), a natural cause for the change in ut 1(xt, zu) is when zu changes. This happens if there exists z = zu such that the ordering of ut 1(xt, z ) relative to ut 1(xt, zu) changes given more observations. Thus, observations that are capable of reducing It 1[Vα(f(xt, Z))] should be able to change the relative ordering in this case. We construct the following counterexample where observations at zu (and zl) are not able to change the relative ordering, so they do not reduce It 1[Vα(f(xt, Z))]. Example 1. This example is described by Fig. 1. We reduce notational clutter by removing xt and t since they are fixed in this example, i.e., we use f(z), f(Z), and l(z) to denote f(xt, z), f(xt, Z), and lt 1(xt, z) respectively. We condition on the event f(z) I[f(z)] [l(z), u(z)] for all z Dz which occurs with probability 1 δ in Lemma 1. In this example, zl = z1 and l(z1) = u(z1), so there is no uncertainty in f(zl) = f(z1). Similarly, there is no uncertainty in f(zu) = f(z2). Thus, new observations at zl and zu change neither l(zl) nor u(zu), so these observations do not reduce the confidence bound I[Vα=0.4(f(Z))] = [l(zl), u(zu)] (plotted as the doubleheaded arrow in Fig. 1b). In fact, to reduce I[Vα=0.4(f(Z))], we should gather new observations at z0 which potentially change the ordering of u(z0) relative to u(z2) (which is u(zu) without new observations). For example, after getting new observations at z0, if u(z0) is improved to be in the white region between A and B (u(z0) > u(z2) in Fig. 1b changes to u(z0) < u(z2) in Fig. 1c), then I[Vα=0.4(f(Z))] is reduced to [l(z1), u(z0)] because now zu = z0. Thus, as the confidence bound I[f(z0)] is shortened with more and more observations at z0, the confidence bound I[Vα=0.4(f(Z))] reduces (the white region representing the uncertainty of Vα=0.4(f(Z)) in Fig. 1 is laced up ). In the next section, we define a property of z0 in Example 1 and prove the existence of z s with this property. Then, we prove that along with the optimistic selection of xt, the selection of zt such that it satisfies this property leads to a no-regret algorithm. 3.2. Lacing Value (LV) and the Query Selection Strategy for zt We note that in Example 1, as long as the confidence bound of the function evaluation at z0 contains the confidence bound of VAR, observations at z0 can reduce the confidence bound of VAR. We name the values of z satisfying this property as lacing values (LV) since observations at LV laces up the confidence bound of VAR: Definition 1 (Lacing values). Lacing values (LV) with respect to x Dx and t 1 are z LV Dz that satisfies lt 1(x, z LV) Vα(lt 1(x, Z)) Vα(ut 1(x, Z)) ut 1(x, z LV), equivalently, It 1[Vα(f(x, Z))] [lt 1(x, z LV), ut 1(x, z LV)] . Recall that the support Dz of Z is defined as the smallest closed subset Dz of Rdz such that P(Z Dz) = 1 (e.g., Dz is a finite subset of Rdz and Dz = Rdz). The following theorem guarantees the existence of lacing values and is proved in Appendix C. Theorem 1 (Existence of lacing values). α (0, 1), x Dx, t 1, there exists a lacing value in Dz with respect to x and t. Corollary 1.1. Lacing values with respect to x Dx and t 1 are in Z l and Z u (hence, their intersection) where Z l {z Dz : lt 1(x, z) Vα(lt 1(x, Z))} and Z u {z Dz : ut 1(x, z) Vα(ut 1(x, Z))}. As a special case, when zl = zu, It 1[Vα(f(x, Z))] = It 1[f(x, zl)] which means zl = zu is an LV. Based on Theorem 1, we can always select zt as an LV w.r.t xt defined in Definition 1. This strategy for the selection of zt, together with the selection of xt = argmaxx Dx Vα(ut 1(x, Z)) (Section 3.1), completes our algorithm: VAR Upper Confidence Bound (V-UCB) (Algorithm 1). Upper Bound on Regret. As a result of the selection strategies for xt and zt, our V-UCB algorithm enjoys the following upper confidence bound on its instantaneous regret (proved in Appendix D): Lemma 3. By selecting xt as a maximizer of Vα(ut 1(x, Z)) and selecting zt as an LV w.r.t xt, the instantaneous regret is upper-bounded by: r(xt) 2β1/2 t σt 1(xt, zt) t 1 with probability 1 δ for δ in Lemma 1. We assume that the k(x,z),(x,z) 1 for all (x, z) Dx Dz. Then, Lemma 3 and Lemma 5.4 from Srinivas et al. (2010) imply that the cumulative regret of our algorithm is bounded (Appendix E): RT C1TβT γT where C1 8/ log(1 + σ 2 n ), and γT is the maximum information gain about f that can be obtained from any set of T observations. The work of Srinivas et al. (2010) has analyzed γT for several commonly used kernels such as SE and Mat ern kernels, and has shown that for these kernels, the upper confidence bound on RT grows sub-linearly. This implies that our algorithm is no-regret because lim T RT /T = 0. Following the work of Bogunovic et al. (2018), at the T-th iteration of V-UCB, we can recommend xt (T ) as an estimate of the maximizer x of Vα(f(x, Z)), where t (T) argmaxt {1,...,T } Vα(lt 1(xt, Z)). Then, the instantaneous regret r(xt (T )) is shown to be upperbounded by p C1βT γT /T with high probability (see Appendix F). In our experiments in Section 4, we recommend Value-at-Risk Optimization with Gaussian Processes (a) (b) (c) Figure 1. A counterexample against selecting zu and zl as input queries. Here z follows a discrete uniform distribution over Dz {z0, z1, z2}. (a) shows the mappings of z to the upper bound u(z) and lower bound l(z). The VAR at α = 0.4 of u(Z) and l(Z) are u(z2) and l(z1), respectively, i.e., zu = z2 and zl = z1. (b) shows the values of l(z) and u(z) for all z on the same axis, as well as the confidence bounds of f(z) and Vα(f(Z)). The gray areas A and B indicate the intervals of values ω R where ω l(zl) = l(z1) and ω u(zu) = u(z2), respectively. (c) shows a hypothetical scenario when I[f(z0)] is shortened/ laced-up with observations at z0. Algorithm 1 The V-UCB Algorithm 1: Input: Dx, Dz, prior µ0 = 0, σ0, k 2: for i = 1, 2, . . . do 3: Select xt = argmaxx Dx Vα(ut 1(x, Z)) 4: Select zt as a lacing value w.r.t. xt (Definition 1) 5: Obtain observation yt f(xt, zt) + ϵt 6: Update the GP posterior belief to obtain µt and σt 7: end for argmaxx DT Vα(µt 1(x, Z)) (where µt 1(x, Z) is a random function defined in the same manner as f(x, Z)) as an estimate of x due to its empirical convergence. Computational Complexity. To compare our computational complexity with that of the ρKG and ρKGapx algorithms in Cakmak et al. (2020), we exclude the training of the GP model (line 6 of Algorithm 1) and assume that Dz is finite. Then, the time complexity of V-UCB is dominated by that of the selection of xt (line 3 of Algorithm 1) which includes the time complexity O(|Dz||Dt 1|2) for the GP prediction at {x} Dz, and O(|Dz| log |Dz|) for the sorting of ut 1(x, Dz) and searching of VAR. Hence, our overall complexity is O(n|Dz| (|Dt 1|2+log |Dz|)), where n is the number of iterations to maximize Vα(ut 1(x, Z)) (line 3 of Algorithm 1). This time complexity of V-UCB is more computationally efficient than ρKG and its variant with approximation ρKGapx whose time complexities are shown in the work of Cakmak et al. (2020) as O(noutnin K|Dz| (|Dt 1|2+|Dz||Dt 1|+|Dz|2+M|Dz|)) and O(nout|Dt 1|K|Dz| (|Dt 1|2 + |Dz||Dt 1| + |Dz|2 + M|Dz|)), respectively (nout and nin are the numbers of iterations for the outer and inner optimization procedures, respectively; K is the number of fantasy GP models required for ρKG s and ρKGapx s one-step lookahead, and M is the number of functions sampled from the GP posterior belief). 3.3. On the Selection of zt Although Algorithm 1 is guaranteed to be no-regret with any choice of LV as zt, we would like to select the LV that can reduce a large amount of the uncertainty of Vα(f(xt, Z)). However, relying on the information gain measure or the knowledge gradient method often incurs the expensive onestep lookahead. Therefore, we use a simple heuristic by choosing the LV z LV with the maximum probability mass (or probability density if Z is continuous) of z LV. We motivate this heuristic using an example with α = 0.2, i.e., Vα=0.2(f(xt, Z)) = inf{ω : P(f(xt, Z) ω) 0.2}. Suppose Dz is finite and there are 2 LV s z0 and z1 with P(z0) 0.2 and P(z1) = 0.01. Then, because P(f(xt, Z) f(xt, z0)) P(z0) 0.2, it follows that Vα=0.2(f(xt, Z)) f(xt, z0), i.e., querying z0 at xt gives us information about an explicit upper bound on Vα=0.2(f(xt, Z)) to reduce its uncertainty. In contrast, this cannot be achieved by querying z1 due to its low probability mass. This simple heuristic can also be implemented when Z is a continuous random variable which we will introduce in Section 3.5. Remark 1. Although we assume that we can select both xt and zt during our algorithm, Corollary 1.1 also gives us some insights about the scenario where we cannot select zt. In this case, in each iteration t, we select xt while zt is randomly sampled by the environment following the distribution of Z. Next, we observe both zt and y(xt,zt) and then update the GP posterior belief of f. Of note, Corollary 1.1 has shown that all LV lie in the set Z l . However, the probability of this set is usually small, because P(Z Z l ) α and small values of α are often used by real-world applications to manage risks. Thus, there is a small probability that the sampled zt is an LV. As a result, we suggest sampling a large number of zt s from the environment to increase the chance that an LV is sampled. On the other hand, the small probability of sampling an LV motivates the need for us to Value-at-Risk Optimization with Gaussian Processes 3.4. V-UCB Approaches STABLEOPT as α 0+ Recall that the objective of adversarially robust BO is to find x Dx that maximizes minz Dz f(x, z) (Bogunovic et al., 2018) by iteratively specifying input query (xt, zt) to collect noisy observations yxt,zt. It is different from BO of VAR since its z is not random but selected by an adversary who aims to minimize the function evaluation. The work of Bogunovic et al. (2018) has proposed a no-regret algorithm for this setting named STABLEOPT, which selects xt= argmax x Dx min z Dz ut 1(x, z) , zt= arg min z Dz lt 1(xt, z) (5) where ut 1 and lt 1 are defined in (3). At first glance, BO of VAR and adversarially robust BO are seemingly different problems because Z is a random variable in the former but not in the latter. However, based on our key observation on the connection between the minimum value of a continuous function h(w) and the VAR of the random variable h(W) in the following theorem, these two problems and their solutions are connected as shown in Corollary 2.1, and 2.2. Theorem 2. Let W be a random variable with the support Dw Rdw and dimension dw. Let h be a continuous function mapping from w Dw to R. Then, h(W) denotes the random variable whose realization is the function h evaluation at a realization w of W. Suppose h(w) has a minimizer wmin Dw, then limα 0+ Vα(h(W)) = h(wmin) . Corollary 2.1. Adversarially robust BO which finds argmaxx minz f(x, z) can be cast as BO of VAR by letting (a) α approach 0 from the right and (b) Dz be the support of the environmental random variable Z, i.e., argmaxx limα 0+ Vα(f(x, Z)). Interestingly, from Theorem 2, we observe that Z l in Corollary 1.1 approaches the set of minimizers argminz Dz lt 1(xt, z) as α 0+. Corollary 2.2 below shows that LV w.r.t xt becomes a minimizer of lt 1(xt, z) which is same as the selected zt of STABLEOPT in (5). Corollary 2.2. The STABLEOPT solution to adversarially robust BO selects the same input query as that selected by the V-UCB solution to the corresponding BO of VAR in Corollary 2.1. The proof of Theorem 2 and its two corollaries are shown in Appendix G. We note that V-UCB is also applicable to the optimization of Vα(f(x, Z)) where the distribution of Z is a conditional distribution given x. For example, in robotics, if there exists noise/perturbation in the control, an optimization problem of interest is Vα(f(x + ξ(x))) where ξ(x) is a random perturbation that depends on x. 3.5. Implementation of V-UCB with Continuous Random Variable Z The V-UCB algorithm involves two steps: selecting xt = argmaxx Dx Vα(ut 1(x, Z)) and selecting zt as the LV z LV with the largest probability mass (or probability density). When Dz is finite, given x, Vα(ut 1(x, Z)) can be computed exactly. The gradient of Vα(ut 1(x, Z)) with respect to x can be obtained easily (e.g., using automatic differentiation provided in the Tensorflow library (Abadi et al., 2015)) to aid the selection of xt. In this case, the latter step can also be performed by constructing the set of all LV (checking the condition in the Definition 1 for all z Dz) and choosing the LV z LV with the largest probability mass. Estimation of VAR. When Z is a continuous random variable, estimating VAR by an ordered set of samples (e.g., in Cakmak et al. (2020)) may require a prohibitively large number of samples, especially for small values of α. Thus, we employ the following popular pinball (or tilted absolute value) function in quantile regression (Koenker & Bassett, 1978) to estimate VAR: ( αw if w 0 , (α 1)w if w < 0 where w R. In particular, to estimate Vα(ut 1(x, Z)) as ν R, we find ν that minimizes: Ez p(Z)[ρα(ut 1(x, z) ν)] . (6) A well-known example is α = 0.5, then ρα is the absolute value function and the optimal ν is the median. The loss in (6) can be optimized using stochastic gradient descent with a random batch of samples of Z at each optimization iteration. Maximization of Vα(ut 1(x, Z)). Unfortunately, to maximize Vα(ut 1(x, Z)) over x Dx, there is no gradient of Vα(ut 1(x, Z)) with respect to x under the above approach. This situation resembles the BO problem where there is no gradient information, but only noisy observations at input queries. Unlike BO, the observation (samples of ut 1(x, Z) at x) is not costly. Therefore, we propose the local neural surrogate optimization (LNSO) algorithm to find argmaxx Dx Vα(ut 1(x, Z)) which is visualized in Fig. 2. Suppose the optimization is initialized at x = x(0), instead of maximizing Vα(ut 1(x, Z)) (whose gradient w.r.t. x is unknown), we maximize a surrogate function g(x, θ(0)) (modeled by a neural network) that approximates Vα(ut 1(x, Z)) well in a local region of x(0), e.g., a ball B(x(0), r) of radius r in Fig. 2. We obtain such parameters θ(0) by minimizing the following loss function: Lg(θ, x(0)) Ex B(x(0),r)EZ p(Z) [ρα(ut 1(x, z) g(x; θ))] (7) Value-at-Risk Optimization with Gaussian Processes Figure 2. Plot of a hypothetical optimization path (as arrows) of LNSO initialized at x(0). Input x is 2-dimensional. The boundary of a ball B of radius r is plotted as a dotted circle. When the updated x moves out of B, the center of B and θ are updated. where the expectation Ex B(x(0),r) is taken over a uniformly distributed X in B(x(0), r). Minimizing (7) can be performed with stochastic gradient descent. If maximizing g(x, θ(0)) leads to a value x(i) / B(x(0), r) (Fig. 2), we update the local region to be centered at x(i) (B(x(i), r)) and find θ(i) = argminθ Lg(θ, x(i)) such that g(x, θ(i)) approximates Vα(ut 1(x, Z)) well x B(x(i), r). Then, x(i) is updated by maximizing g(x, θ(i)) for x B(x(i), r). The complete algorithm is described in Appendix H. We prefer a small value of r so that the ball B is small. In such case, Vα(ut 1(x, Z)) for x B can be estimated well with a small neural network g(x, θ) whose training requires a small number of iterations. Search of Lacing Values. Given a continuous random variable Z, to find an LV w.r.t xt in line 4 of Algorithm 1, i.e., to find a z satisfying du(z) ut 1(xt, z) Vα(ut 1(xt, Z)) 0 and dl(z) Vα(lt(xt, Z)) lt 1(xt, z) 0, we choose a z that minimizes LLV(z) Re LU( du(z)) + Re LU( dl(z)) (8) where Re LU(ω) = max(ω, 0) is the rectified linear unit function (ω R). To include the heuristic in Section 3.3 which selects the LV with the highest probability density, we find z that minimizes LLV-P(z) LLV(z) Idu(z) 0Idl(z) 0 p(z) where LLV(z) is defined in (8); p(z) is the probability density; Idu(z) 0 and Idl(z) 0 are indicator functions. 4. Experiments In this section, we empirically evaluate the performance of V-UCB. The work of Cakmak et al. (2020) has motivated the use of the approximated variant of their algorithm ρKGapx over its original version ρKG by showing that ρKGapx achieves comparable empirical performances to ρKG while incurring much less computational cost. Furthermore, ρKGapx has been shown to significantly outperform other competing algorithms (Cakmak et al., 2020). Therefore, we choose ρKGapx as the main baseline to empirically compare with V-UCB. The experiments using ρKGapx is performed by adding new objective functions to the existing implementation of Cakmak et al. (2020) at https://github.com/saitcakmak/Bo Risk. Regarding V-UCB, when Dz is finite and the distribution of Z is not uniform, we perform V-UCB by selecting zt as an LV at random, labeled as V-UCB Unif, and by selecting zt as the LV with the maximum probability mass, labeled as V-UCB Prob. The performance metric is defined as Vα(f(x , Z)) Vα(f( x, Z)) where x is the recommended input. The evaluation of VAR is described in Section 3.5. The recommended input is argmaxx DT Vα(µt 1(x, Z)) for V-UCB, and argminx Dx Et 1[Vα(f(x, Z))] for ρKGapx (Cakmak et al., 2020), where Et 1 is the conditional expectation over the unknown f given the observations y Dt 1 (approximated by a finite set of functions sampled from the GP posterior belief).2 We repeat each experiment 10 times with different random initial observations y D0 and plot both the mean (as lines) and the 70% confidence interval (as shaded areas) of the log 10 of the performance metric. The detailed descriptions of experiments and the comparison with the strategy of randomly selecting input queries are deferred to Appendix I. 4.1. Synthetic Benchmark Functions The experiments with discrete Dz are conducted with Branin-Hoo-(1, 1), Goldstein-Price-(1, 1), Hartmann-3D- (1, 2), Hartmann-3D-(2, 1), and Hartmann-6D-(5, 1) where the tuple (dx, dz) represents the dimensions of x and z. The noise variance σ2 n is set to 0.01. The risk level α is 0.1. We observe in Fig. 3 that V-UCB Unif is on par with ρKGapx in optimizing Branin-Hoo-(1, 1) and Goldstein-Price-(1, 1), and outperforms ρKGapx in optimizing Hartmann-3D- (1, 2), Hartmann-3D-(2, 1), and Hartmann-6D-(5, 1). On the other hand, V-UCB Prob is able to exploit the probability distribution of Z to outperform V-UCB Unif in optimizing Branin-Hoo-(1, 1), Goldstein-Price-(1, 1), Hartmann-3D- (1, 2), and Hartmann-3D-(2, 1). The unsatisfactory performance of ρKGapx in some experiments may be attributed to its approximation of the inner optimization problem in the acquisition function (Cakmak et al., 2020), and the approximation of VAR using samples of Z and function samples of the GP posterior belief. The experiments with continuous Dz are conducted with Branin-Hoo-(1, 1), Goldstein-Price-(1, 1), Hartmann-3D- (1, 2), and Hartmann-3D-(2, 1). In the implementation of LNSO, the local region is defined with r = 0.1. The surrogate function is a neural network of 2 hidden layers with 2While the work of Cakmak et al. (2020) considers a minimization problem of VAR, our work considers a maximization problem of VAR. Therefore, the objective functions for ρKGapx are the negation of those for V-UCB. For V-UCB at risk level α, the risk level for ρKGapx is 1 α. Value-at-Risk Optimization with Gaussian Processes 30 hidden neurons each and sigmoid activation functions. The results are shown in Fig. 4. We observe that V-UCB Prob outperforms ρKGapx because of the approximation involved in ρKGapx as explained in the above experiments with discrete Dz. 0 10 20 30 Iteration log10 Regret V-UCB Prob V-UCB Unif 0 25 50 Iteration log10 Regret V-UCB Prob V-UCB Unif (a) Branin-Hoo-(1, 1) (b) Goldstein-Price-(1, 1) 0 50 100 Iteration log10 Regret V-UCB Prob V-UCB Unif 0 20 40 Iteration log10 Regret V-UCB Prob V-UCB Unif (c) Hartmann-3D-(1, 2) (d) Hartmann-3D-(2, 1) 0 50 100 Iteration log10 Regret V-UCB Prob V-UCB Unif (e) Hartmann-6D-(5, 1) Figure 3. Synthetic benchmark functions with finite Dz. 4.2. Simulated Optimization Problems The first problem is portfolio optimization adopted by (Cakmak et al., 2020). There are dx = 3 optimization variables (risk and trade aversion parameters, and the holding cost multiplier) and dz = 2 environmental random variables (bid-ask spread and borrow cost). The variable Z follows a discrete uniform distribution with |Dz| = 100. Hence, there is no difference between V-UCB Unif and V-UCB Prob. Thus, we only report the results of the latter. The objective function is the posterior mean of a trained GP on the dataset in Cakmak et al. (2020) of size 3000 generated from CVXPortfolio. The noise variance σ2 n is set to 0.01. The risk level α is set to 0.2. 0 25 50 Iteration log10 Regret 0 25 50 Iteration log10 Regret (a) Branin-Hoo-(1, 1) (b) Goldstein-Price-(1, 1) 0 25 50 Iteration log10 Regret 0 25 50 Iteration log10 Regret (c) Hartmann-3D-(1, 2) (d) Hartmann-3D-(2, 1) Figure 4. Synthetic benchmark functions with continuous Dz. 0 25 50 Iteration log10 Regret 0 50 Iteration log10 Regret V-UCB Prob V-UCB Unif (a) Portfolio optimization problem (3, 2) (b) Simulated robot pushing task (3, 2) Figure 5. Simulated experiments. The second problem is a simulated robot pushing task for which we use the implementation from the work of Wang & Jegelka (2017). The simulation is viewed as a 3-dimensional function h(rx, ry, tp) returning the 2-D location of the pushed object, where rx, ry [ 5, 5] are the robot location and tp [1, 30] is the pushing duration. The objective is to minimize the distance to a fixed goal location g = (gx, gy) , i.e., the objective function of the maximization problem is f0(rx, ry, tp) = h(rx, ry, tp) g . We assume that there are perturbations in specifying the robot location Wx, Wy whose support Dz includes 64 equi-distant points in [ 1, 1]2 and whose probability mass is proportional to exp( (W 2 x + W 2 y )/0.42). It leads to a random objective function f(rx, ry, tp, Wx, Wy) f0(rx + Wx, ry + Wy, tp). We aim to maximize the VAR of f which is more difficult than maximizing that of f0. Moreover, a random noise following N(0, 0.01) is added to the evaluation of f. Value-at-Risk Optimization with Gaussian Processes The risk level α is set to 0.1. The results are shown in Fig. 5. We observe that V-UCB outperforms ρKGapx in both problems. Furthermore, in comparison to our synthetic experiments, the difference between V-UCB Unif and V-UCB Prob is not significant in the robot pushing experiment. This is because the chance that a uniform sample of LV has a large probability mass is higher in the robot pushing experiment due to a larger region of Dz having high probabilities. 5. Conclusion To tackle the BO of VAR problem, we construct a no-regret algorithm, namely VAR upper confidence bound (V-UCB), through the design of a confidence bound of VAR and a set of lacing values (LV) that is guaranteed to exist. Besides, we introduce a heuristic to select an LV that improves the empirical performance of V-UCB over random selection of LV. We also draw an elegant connection between BO of VAR and adversarially robust BO in terms of both problem formulation and solutions. Lastly, we provide practical techniques for implementing VAR with continuous Z. While V-UCB is more computationally efficient than the state-of-the-art ρKGapx algorithm for BO of VAR, it also demonstrates competitive empirical performances in our experiments. For future works, it is of interest to generalize this work to the optimization of CVAR of a black-box function (Cakmak et al., 2020), multi-fidelity BO (Zhang et al., 2017; 2019), high-dimensional BO (Hoang et al., 2018), batch BO (Daxberger & Low, 2017), private outsourced BO (Kharkovskii et al., 2020), and ranking BO (Nguyen et al., 2021). Acknowledgements This research is supported by A*STAR under its RIE2020 Advanced Manufacturing and Engineering (AME) Programmatic Funds (Award A20H6b0151). Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Man e, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Vi egas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. Tensor Flow: Largescale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org. Basel Committee on Banking Supervision. International convergence of capital measurement and capital standards, 2006. URL http://www.bis.org/publ/ bcbs22.pdf. Bogunovic, I., Scarlett, J., Krause, A., and Cevher, V. Truncated variance reduction: A unified approach to bayesian optimization and level-set estimation. In Proc. NIPS, pp. 1507 1515, 2016. Bogunovic, I., Scarlett, J., Jegelka, S., and Cevher, V. Adversarially robust optimization with Gaussian processes. In Proc. Neur IPS, pp. 5760 5770, 2018. Cakmak, S., Astudillo, R., Frazier, P. I., and Zhou, E. Bayesian optimization of risk measures. In Proc. Neur IPS, 2020. Chowdhury, S. R. and Gopalan, A. On kernelized multiarmed bandits. In Proc. ICML, pp. 844 853, 2017. Daxberger, E. A. and Low, B. K. H. Distributed batch Gaussian process optimization. In Proc. ICML, pp. 951 960, 2017. Goh, J. W., Lim, K. G., Sim, M., and Zhang, W. Portfolio value-at-risk optimization for asymmetrically distributed asset returns. European Journal of Operational Research, 221(2):397 406, 2012. Hoang, T. N., Hoang, Q. M., Ouyang, R., and Low, B. K. H. Decentralized high-dimensional Bayesian optimization with factor graphs. In Proc. AAAI, pp. 3231 3238, 2018. Iwazaki, S., Inatsu, Y., and Takeuchi, I. Mean-variance analysis in Bayesian optimization under uncertainty. ar Xiv preprint ar Xiv:2009.08166, 2020. Kharkovskii, D., Dai, Z., and Low, B. K. H. Private outsourced Bayesian optimization. In Proc. ICML, 2020. Kirschner, J., Bogunovic, I., Jegelka, S., and Krause, A. Distributionally robust Bayesian optimization. In Proc. AISTATS, 2020. Koenker, R. and Bassett, G. Regression quantiles. Econometrica: Journal of the econometric society, pp. 33 50, 1978. Nguyen, Q. P., Tay, S., Low, B. K. H., and Jaillet, P. Top-k ranking Bayesian optimization. In Proc. AAAI, pp. 9135 9143, 2021. Nguyen, T. T., Gupta, S., Ha, H., Rana, S., and Venkatesh, S. Distributionally robust Bayesian quadrature optimization. In Proc. AISTATS, 2020. Rasmussen, C. E. and Williams, C. K. I. Gaussian processes for machine learning. MIT Press, 2006. Value-at-Risk Optimization with Gaussian Processes Scott, W., Frazier, P., and Powell, W. The correlated knowledge gradient for simulation optimization of continuous parameters using Gaussian process regression. SIAM journal on optimization, 21(3):996 1026, 2011. Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proc. ICML, pp. 1015 1022, 2010. Torossian, L., Picheny, V., and Durrande, N. Bayesian quantile and expectile optimisation. ar Xiv preprint ar Xiv:2001.04833, 2020. Toscano-Palmerin, S. and Frazier, P. I. Bayesian optimization with expensive integrands. ar Xiv preprint ar Xiv:1803.08661, 2018. Wang, Z. and Jegelka, S. Max-value entropy search for efficient Bayesian optimization. In Proc. ICML, pp. 3627 3635, 2017. Williams, B. J., Santner, T. J., and Notz, W. I. Sequential design of computer experiments to minimize integrated response functions. Statistica Sinica, pp. 1133 1152, 2000. Zhang, Y., Hoang, T. N., Low, B. K. H., and Kankanhalli, M. Information-based multi-fidelity Bayesian optimization. In Proc. Neur IPS Workshop on Bayesian Optimization, 2017. Zhang, Y., Dai, Z., and Low, B. K. H. Bayesian optimization with binary auxiliary information. In Proc. UAI, pp. 1222 1232, 2019.