# failureaware_gaussian_process_optimization_with_regret_bounds__16371fff.pdf

Failure-Aware Gaussian Process Optimization with Regret Bounds

Shogo Iwazaki MI-6 Ltd., Tokyo, Japan iwazaki@mi-6.co.jp

Shion Takeno RIKEN AIP, Tokyo, Japan shion.takeno@riken.jp

Tomohiko Tanabe MI-6 Ltd., Tokyo, Japan tanabe@mi-6.co.jp

Mitsuru Irie MI-6 Ltd., Tokyo, Japan irie@mi-6.co.jp

Real-world optimization problems often require black-box optimization with observation failure, where we can obtain the objective function value if we succeed, otherwise, we can only obtain a fact of failure. Moreover, this failure region can be complex by several latent constraints, whose number is also unknown. For this problem, we propose a failure-aware Gaussian process upper confidence bound (F-GP-UCB), which only requires a mild assumption for the observation failure that an optimal solution lies on an interior of a feasible region. Furthermore, we show that the number of successful observations grows linearly, by which we provide the first regret upper bounds and the convergence of F-GP-UCB. We demonstrate the effectiveness of F-GP-UCB in several benchmark functions, including the simulation function motivated by material synthesis experiments.

1 Introduction

The effectiveness of optimization methods based on Gaussian process (GP) regression for expensiveto-evaluate black-box functions has been repeatedly shown in a wide range of real-world applications, including robotics [34, 35], experimental design [14], and hyperparameter optimization [4, 41]. On the other hand, failure of the observation itself must often be considered. For example, in the optimization of hyperparameters for a complicated physical model, the evaluation may crash for some hyperparameters. Another example is materials development, in which experimental testing of new materials can reveal that the synthesis procedure fails.

Therefore, this study considers the optimization of a black-box function f with a black-box deterministic failure function c : X {0, 1}, where X is an input space. In this study, failure of observation at the input x is represented as c(x) = 1, while the success of observation is represented as c(x) = 0. Then, our optimization problem can be formulated as follows:

x = arg max x X f(x) s.t. c(x) = 0, (1)

where the observation of f can be obtained only if c(x) = 0, but the observation cost is consumed even if c(x) = 1. Hence, the goal is to efficiently identify the optimal point x while considering the observation failure.

One of the notable technical difficulties of problem (1) is how to handle the failure function c. To model c, existing studies [32, 2] use the GP classification (GPC) model [38] or its variants, which assume that c can be represented by a smooth latent function. However, the observation failure can

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

be caused by several latent constraints implicitly. In addition, since only the binary failure can be obtained, the number of latent constraints is also unknown. Therefore, modeling c with the usual GPC is hard and often unsuitable, which can degrade the optimization performance. Hence, an optimization method under a mild assumption on c is demanded. Furthermore, the existence of c makes theoretical analysis and even securing the number of successful observations difficult.

Our contribution. In this paper, we propose a novel GP-based optimization algorithm, failureaware GP upper confidence bound (F-GP-UCB), which chooses the next evaluation in an input domain except for the adaptively adjusted neighborhood of the past observation failures. Furthermore, we show that the number of successful observations grows linearly under a very mild assumption of the failure function c that the optimal solution lies on an interior of a feasible region. Then, we provide the first regret upper bound of the GP optimization problem (1), which shows that F-GP-UCB converges to the optimal solution with high probability. We demonstrate the effectiveness of the F-GP-UCB algorithm with several benchmark functions, including a heuristic simulation function motivated by materials research of quasicrystals.

Related work. In the past few decades, many GP-based optimization algorithms have been developed [37, 42, 18, 8, 53]. In addition to the standard optimization settings, various extensions have been studied, such as parallel [9], high-dimensional [54], multi-fidelity [25], and robust optimization [6].

GP-based constrained optimization [10, 11, 19, 47] has a close relation to our study. It considers the black-box optimization under an inequality constraint g(x) 0, where g is also a black-box function. If the failure function in this paper is recast as c(x) = 1l{g(x) > 0}, the optimization problem matches that of problem (1). However, there are two crucial differences. First, in constrained optimization, observations at an input can be obtained even if the constraint is not satisfied, i.e., g(x) > 0. Under the setting of this paper, no information about the objective function can be obtained for input points that result in failure. Second, in constrained optimization problems, it is generally assumed that noisy observation of g(x) is possible. In the setting of our study, we cannot obtain a direct observation from the latent function g.

There exist works on GP optimization that take into account failures. Lindberg and Lee [32] proposed an algorithm that combines the Expected Improvement (EI) criterion [37] with the posterior success probability of the classical GPC. Instead of GPC, Sacher et al. [39] used a support vector machinebased model. However, although c is deterministic, these classifiers assume that observation failure is essentially stochastic, i.e., the evaluations at the same input can fail and succeed. This model misspecification can degrade the optimization performance, as described in Bachoc et al. [2]. Then, Bachoc et al. [2] proposed the deterministic variant of GPC which models c as c(x) = 1l{g(x) 0}, where g is a latent function modeled by a GP. They also provide an EI-based strategy and prove convergence to an optimal solution. From a practical point of view, surrogate models of c navigate the optimization process efficiently when c is well represented through a smooth latent function g. However, practical applications often have too complex failure processes to model with one latent function (e.g., the failure function c depends on several latent functions.) Furthermore, it is difficult to know that such a complex failure structure exists beforehand. Finally, from a theoretical perspective, Bachoc et al. [2] give the convergence guarantee, but its rate and the regret-based analysis are not provided.

2 Preliminaries

Problem setup. Let f : X R and c : X {0, 1} be an unknown fixed objective function and an unknown failure function, respectively, where X := [0, 1]d is the input space. At each time step t, the user makes a selection xt X and obtains the failure label c(xt). If c(xt) = 0, the user proceeds to make a noisy observation yt = f(xt) + ϵt where ϵt is a noise term which is conditionally σ-sub-Gaussian. Our noise model is a mild one; examples include an arbitrary distribution over [ σ, σ] and a Gaussian noise with variance below σ2. Note that, in the case of c(xt) = 1 (failure), the user obtains no further information.

The user s goal is to efficiently identify the optimal solution x over the unknown feasible region Sc := {x X | c(x) = 0}. For convenience, we rephrase the problem (1) using Sc as follows:

x = arg max x Sc f(x). (2)

We assume that Sc = and that there exists a unique solution x . Furthermore, we define the failure region Fc as Fc = X \ Sc.

Regret. We employ the regret to evaluate the algorithm s performance. The regret rt at step t is defined as rt = f(x ) f(ˆxt) if ˆxt Sc, otherwise rt = f(x ) minx X f(x), where ˆxt X is the algorithm s estimated solution1. In our definition of rt, the algorithm is supposed to incur the worst-case regret when ˆxt is not in Sc. Similar definitions are also used in performance metrics of GP-based constrained optimization [19].

Regularity assumptions for the objective function. As a regularity assumption, we assume that f is an element of the reproducing kernel Hilbert space (RKHS) Hk, corresponding to a known positive-definite kernel k : X X R, and has a bounded Hilbert norm f Hk B. Furthermore, the kernel k is assumed to be normalized, namely x X, k(x, x) 1. These are common assumptions in existing GP optimization literature [42, 51, 8, 48].

Gaussian process modeling. Our algorithm uses the modeling information of f from GP [38], using only the successful observations. First, we assume zero-mean GP with the covariance function k as the prior of f. Next, we model the generating process of the value yt queried at xt as yt = f(xt)+ϵt, where ϵt N(0, σ2). We note that the assumptions on the GP prior on f and Gaussian noise, as stated above, are assumptions only for constructing the GP model, which can differ from the underlying true function f and the noise. Let I(S) t := {i {1, . . . , t} | c(xi) = 0} be the index set of successful observations, X(S) t := (xi)i I(S) t be the corresponding inputs, y(S) t := (yi)i I(S) t , be

the corresponding outputs, and nt := |I(S) t | be the number of successful observations. In addition, we also define the failure index set I(F ) t := {i {1, . . . , t} | c(xi) = 1}. Given X(S) t , y(S) t , the posterior distribution of f(x) becomes a normal distribution, whose posterior mean µt(x) and posterior variance σ2 t (x) are given as follows:

µt(x) = k(x, X(S) t ) K(X(S) t ) + σ2Int 1 y(S) t , (3)

σ2 t (x) = k(x, x) k(x, X(S) t ) K(X(S) t ) + σ2Int 1 k(x, X(S) t ). (4)

Here, Int is a nt nt identity matrix. Furthermore, denoting the index of the j-th least element of I(S) t as ij, k(x, X(S) t ) Rnt represents a vector whose j-th element is k(x, xij). Similarly, K(X(S) t ) Rnt nt represents the kernel matrix whose jk-th element is k(xij, xik).

Lastly, we define the maximum information gain [42, 49] as a quantity representing the GP complexity. The maximum information gain γt;A from observing t number of data points over the set A Rd is defined as γt;A = max X:=(x1...,xt) At I(f X, y X), where I(f X, y X) := 0.5 ln det It + σ 2K(X)

is the mutual information between the latent function values f X N(0, K(X)) and corresponding output values y X = f X + ϵt, ϵt N(0, σ2It) of GP. In GP-based optimization, γt;A is often used as a quantity that characterizes the confidence bound and regret bound of f. Furthermore, for a compact convex set A, the upper bound for γt,A has been derived for some commonly used kernels. For example, γt;A = O((ln t)d+1) for Gaussian kernel and γt;A = O(td/(ν+d)(ln t)2ν/(ν+d)) for Matérn kernel with the smoothness parameter ν > 1/2 [42, 49]. The following Lemma 2.1 adapts the well-known result that gives the confidence bound of f for our problem setup, which is a direct consequence of Theorem 3.11 in [1].

Lemma 2.1. Fix f Hk with f Hk B. Suppose the observation yt = f(xt) + ϵt has a noise ϵt that is conditionally σ-sub-Gaussian. Define {βt}t N as β1/2 t = B + σ p

2(γt 1;X + 1 + ln(1/δ)) for δ (0, 1). Then the following holds with probability at least 1 δ:

t 1, x X, lcbt(x) f(x) ucbt(x). (5)

Here, lcbt(x) and ucbt(x) are defined as lcbt(x) = µt 1(x) β1/2 t σt 1(x) and ucbt(x) = µt 1(x) + β1/2 t σt 1(x), respectively.

1Some applications prefer to assess the performance via cumulative regret. In Appendix C, the analysis of the cumulative regret of our algorithm is provided.

Regularity assumption for failure function. A regret upper bound cannot be obtained without any assumption on c. As an extreme example, we consider the case where x is an isolated point surrounded by Fc. Under this scenario, the worst case exists in which an arbitrary algorithm can never observe x in a finite number of trials. Therefore, in order to give a convergence guarantee, at least x must be contained in a subset of the feasible region Sc having a non-zero volume. In this paper, we focus on the case that x is the interior point of Sc as in the following assumption. Assumption 2.2. There exists η > 0 such that Nx ;η Sc, where Nx ;η := {x X | x x < η} is an open infinity ball with a radius η centered at x 2.

The parameter η above depends on the size of the subset of the feasible region that x belongs to and is an important quantity for theoretical analysis. Note that Assumption 2.2 is quite mild. For example, when c is defined as c(x) = 1l{g(x) 0} with a continuous latent function g, there exists η > 0 such that Nx ;η Sc from the continuity of g. Since the analysis of [2] assumes the existence of the latent function g, which is a continuous sample path generated from GP, our assumption also subsumes their assumption. Finally, note that η is not needed for running the algorithm and is only used as a quantity that characterizes the regret bound.

3 Proposed algorithm

Our proposed algorithm F-GP-UCB is shown as pseudo-code in Algorithm 1. Roughly speaking, our F-GP-UCB searches the input domain excluding the adaptively adjusted neighborhood of past failure points based on the existing GP-UCB [42] strategy. Below, we start by describing the background of the algorithm construction.

Philosophy of algorithm construction. Since Fc is unknown, the algorithm needs to appropriately avoid the failure observation in Fc while balancing the trade-off between exploration and exploitation in a way that guarantees convergence. The difficulty of this problem lies in the fact that there exist cases where the small feasible region containing x is surrounded by the past failure observations and is isolated as shown in the left plot of Fig. 1. Since the algorithm is unable to exclude the possibility that an arbitrary small feasible region exists between failure points only from Assumption 2.2, the algorithm should be constructed so that it simultaneously satisfies the following two points: the failure observation should be avoided; and the inputs evaluated by the algorithm should eventually cover the arbitrarily close area of the past failure point.

F-GP-UCB satisfies the above requirements by adaptively controlling the GP-UCB search region by eliminating the neighborhood of the past failure points, and shrinking the eliminated region as step increase. Next, we describe below the details of and ideas behind each step.

Selection of xt. At each step t, the F-GP-UCB algorithm firstly computes the seach region Xt as

Xt = {x X | i I(F ) t 1, xi x θtb(t)}. (6)

Here, θt and b : R+ R+ are the scale parameter and monotonically decreasing function, respectively. The parameter θt is controlled by the algorithm at every step so that it decreases monotonically with respect to t from its initial value θ0 given by the user. It plays the role of guaranteeing that the search space Xt satisfies Xt = . The function b is defined by the user before running the algorithm.

By using Xt, the F-GP-UCB algorithm chooses xt as

xt = arg max x Xt ucbt(x). (7)

The middle and right plots in Fig. 1 show an example behavior of the F-GP-UCB algorithm. By using the monotonically decreasing θtb(t) to control the search space, the algorithm s behavior can be qualitatively described as follows. First, during the early phase where t is small, θtb(t) is large, corresponding to choosing an unexplored point for observation while avoiding a large neighborhood of the past failure points where there is a high possibility of failure. Then as t increases and the

2Our algorithm and theoretical analysis can be generalized for arbitrary norms over Rd such as L2-norm and for a compact convex subset X Rd, however the computational technique discussed in Sec.5 assumes the infinity norm.

Iteration 5 Iteration 10

Figure 1: An example problem in one dimension. The left plot shows the situation where there exists an isolated feasible region, which includes the optimum. The green and grey shaded areas represent the feasible region Sc and the failure region Fc, respectively. The green (black) points represent the observed successful (failure) points. In this situation, it is necessary to identify the feasible region which hides among the observed failure points. The middle and right plots are example behaviors of the F-GP-UCB algorithm in t = 5 and t = 10, respectively. The gray shaded regions represent the neighborhood of the observed failure points which is excluded from the search space at the given step. The F-GP-UCB algorithm performs searches in all the feasible regions while avoiding the observed failure points by iteratively narrowing down the excluded search space in the neighborhood of the failure points.

remaining unexplored space shrinks, it is expected that the algorithm s behavior will become more aggressive in considering the possibility that the region near the observed failure points contains a feasible region that may be difficult to identify.

Algorithm 1 The F-GP-UCB algorithm Input: θ0 (0, 1), b : R+ R+, {βt}t N+.

1: Initialize GP prior and set I(S) 0 = I(F ) 0 = . 2: for t = 1 to T do 3: θt θt 1. 4: while X S

i I(F ) t Nxi; θtb(t) do

5: θt θt/2. 6: end while 7: θt θt and define Xt as in (6). 8: Choose xt = argmaxx Xtucbt(x). 9: if c(xt) = 0 then 10: Observe yt = f(xt) + ϵt and update GP.

11: I(S) t I(S) t 1 {t}, I(F ) t I(F ) t 1. 12: else 13: I(S) t I(S) t 1, I(F ) t I(F ) t 1 {t}. 14: end if 15: end for 16: Calculate x ˆT as in (8) if I(S) T = .

The choice of θ0 and b affects the performance of F-GP-UCB. In Sec. 4, we show the convergence of the regret under an appropriate choice of b. In Sec. 5, the practical choices of θ0 and b are discussed.

Shrinking θt. In the case where Xt = , xt is not defined. We ensure such cases do not occur by controlling the scale parameter θt. In our algorithm, the main role of θt is to guarantee that xt is well-defined as follows. Specifically, before the step t starts, θt is computed by halving the previous scale parameter θt 1 until the union of all the neighborhoods does not fully cover X. This procedure requires that we solve the set covering problem to find the cover of X with the neighborhoods of the past failure points. Unfortunately, this problem is known to be NP complete [20]. In Sec. 5, we provide a practical approach to this problem. On the other hand, note that our analysis in Sec. 4 assumes that this procedure can be computed exactly.

Algorithm estimated solution. The estimated solution of the F-GP-UCB algorithm ˆxt at step t is defined as follows: ˆxt = xˆt where ˆt = arg max

i I(S) t lcbi(xi). (8)

The definition of the estimated solution based on the observed points and lcbi is often used in exisiting literature [6, 26, 21]. While the estimated solution of the proposed method resembles those in the

literature, it should be noted that the maximum of lcbi is computed over I(S) t . Note ˆxt is not defined if all the points up to t are failures (i.e., I(S) t = ). In Sec. 4, it is shown that there always exists ˆx T for a sufficiently large step size T. Finally, although the estimated solution (8) is useful for discussing the theoretical performance, for practical purposes, ˆxt = xˆt where ˆt = argmaxi I(S) t lcbt(xi) is often used instead [6, 21]. We use this modified definition of ˆxt in the benchmark experiments in Sec. 6.

4 Theoretical analysis

The results of the theoretical analysis are discussed in this section. The complete proofs of the theorems and lemmas are described in Appendix A.

Key of regret analysis. To analyze the regret, we start by evaluating the growth rate of the number of successful observations nt in the algorithm. By noting that θtb(t) is monotonically decreasing, we can find that the failure points of F-GP-UCB are separated by at least a distance θtb(t) apart from each other. Therefore, nt can be evaluated via θtb(t)-packing number of Fc. The following Lemma 4.1 is based on the standard argument for the packing number, which gives the lower bound for the number of successful observations nt. Lemma 4.1. When running Algorithm 1, nt t P (X, , θtb(t)/4) holds for any t N+, where P(X, , ϵ) denotes the ϵ-packing number over the set X with respect to the norm .

Using the above Lemma 4.1, designing the choice of b so that t P(X, , θtb(t)/4) = Θ(t) holds leads to nt = Ω(t), which means that the number of successful observations can be secured at the same rate as in the standard GP optimization setting. By using the fact P(X, , ϵ) = 1/ϵ d

(Lemma A.4), the choice of b(t) = o(t 1/d) is sufficient for nt = Ω(t)3. While our result holds for any b satisfying b(t) = o(t 1/d), we give results for b satisfying b(t) = t α with α (0, 1/d) in this section for simplicity. In Appendix A, the result for general X and b is also provided.

Regret bound. The following Theorem 4.2 gives the regret bound of the F-GP-UCB algorithm. Theorem 4.2. Fix δ (0, 1), α (0, d 1), X = [0, 1]d, and f Hk such that f Hk B. Let β1/2 t = B + σ p

2(γt 1;X + 1 + ln(1/δ)) and b(t) = t α. In addition, suppose that there exists η > 0 such that Nx ;η Sc, where Nx ;η = {x X | x x < η}. When applying Algorithm 1 under the above conditions, the estimated solution ˆx T defined in (8) exists for all T Ts := max{(10/θ d)1/(1 dα), 2} + 1, where θ = min{θ0, η/2}. Moreover, the following holds with probability at least 1 δ:

T Ts, r T 2

h 2B( T 1) + p

C1βT TγT ;Sc i , (9)

where T = (θ0/η)1/α + 1 and C1 = 8/ ln(1 + σ 2).

In the analysis above, η is interpreted as the complexity of the failure function c, and θ corresponds to the lower bound on θt while running the algorithm. The constants Ts and T are important quantities that characterize the regret and can be respectively interpreted as follows.

The constant Ts is the upper bound on the number of steps until the first successful observation. The constant T is the upper bound on the number of steps needed until the search space Xt of the algorithm always contains the optimal solution x .

The regret bound in Theorem 4.2 can be intuitively seen as evaluating the algorithm execution in terms of two stages along the time axis. The first term of (9) represents the regret generated when the algorithm s search space Xt does not contain x . Specifically, the worst-case regret incurred during this phase is bounded by 2B using the fact that supx X |f(x)| B from the kernel normalization

3Strictly speaking, we use the fact that θt becomes constant for sufficiently large t. See Lemma A.8 in Appendix A.

condition. The second term is the regret incurred during the process of identifying x after Xt includes x and is similar to the regret bound of standard GP-UCB. The only difference is that, in our setup, the observations of GP are only on Sc, so the maximum information gain is defined over Sc.

By noting that γT ;Sc γT ;X in the second term and using the known results concerning the upper bound on γT ;X , the more explicit form of regret upper bound can be obtained on a perkernel basis. For example, in the case of the Gaussian kernel, we have that γT ;X = O((ln T)d+1)

and r T = O T 1B T + T 1/2(ln T)(d+1)/2 B + σ p

(ln T)d+1 + ln(1/δ) . Thus, indeed we obtain r T 0 (as T ). The F-GP-UCB algorithm is guaranteed to converge when the cumulative regret for ordinary GP-UCB (corresponding to the second term in (9)) becomes sub-linear under the chosen setup and the kernel4.

The case where x exists on the boundary of Sc. The case where x exists on the boundary Sc is not covered in Theorem 4.2, which is the case that often appears in a real-world application. If undesirable Lipschitz-style dependencies are allowed to emerge in the regret upper bound, our F-GP-UCB algorithm is also guaranteed to have convergence in such cases (Theorem B.2). The details of the result and discussion about its limitation are in Appendix B.

5 Practical considerations

In this section, we discuss several issues that arise in practical situations and provide their solutions. Due to the space limitation, we only give brief descriptions. The details are described in Appendix E.

Computation of θt. The computation of θt requires us to know whether the union of all the neighborhoods of the failure points covers X. A simple solution might be to partition X into a sufficiently fine grid by subdividing each axis and look for a feasible solution in each grid cell by brute force. However, as the step size t and dimension d become large, the number of grid cells grows rapidly. Furthermore, even a very fine grid will not be able to detect the case where a feasible solution exists on the cell boundary. Another heuristic approach is to use a constrained optimization solver which does not require a feasible solution as part of the initial points (e.g., penalty function method and augmented Lagrangian method [17]) and try to solve the problem (7) by using θt 1. In case that the solver is unable to find a feasible solution, it is decided that Xt = , multiply θt 1 by 1/2, then we try to solve problem (7) again. This is our chosen approach. A drawback of this approach is that, depending on the scale of θt, the constrained optimization problem must be solved many times. We show that, by performing an appropriate preprocessing, a feasible solution can be obtained within two rounds of constrained optimization in the worst case. We give the details in Appendix E.1.

Selection of b. As stated in Sec. 4, the F-GP-UCB algorithm is valid for an arbitrary monotonically decreasing function b such that t P(X, , θtb(t)/4) = Θ(t). However, if the rate is slower than 1/tα (α (0, 1/d)), for example, b(t) = 1/ ln(t + 1), then the dominant term of our regret upper bound can become O(1/ ln T) (See Theorem B.3 in Appendix B and Corollary F.1 in Appendix F). This intuitively means that an exponentially large sample is needed to obtain an arbitrarily small regret and is not the desirable behavior. In this paper, we use b in the form of b(t) = 1/tα with α = 1/(2d) as a practical choice. (Additional discussion about α is given in Appendix E.2.)

Adaptive tuning of θ0. Although the choice of θ0 does not affect the no-regret guarantee, it has a large impact on the practical behavior when the number of steps is small. For example, if θ0 is set to a large value when the failure region Fc forms a relatively small region, then there is a risk of excessively avoiding the feasible region that should be searched. It is practically difficult to find a desirable choice of θ0 as it depends on some conditions that are unknown to the user, such as the failure region Fc. In order to alleviate this issue while retaining the theoretical convergence, we use the strategy to adaptively choose θt based on the posterior standard deviation σt 1(xt) of the observed points. We note that such a strategy using σt 1(xt) to adaptively select an unknown parameter is also developed in [54] to set the unknown kernel hyperparameters for ordinary GP optimization.

4As with the standard GP-UCB, βt = O( γT ;X ) in F-GP-UCB, which leads to the regret of O(γT ;X /

T). In Appendix D, we discuss the possibility of addressing this issue based on several existing works [48, 30].

Algorithm 4 in Appendix E.3 shows the pseudo-code of our modified strategy. In Algorithm 4, the user specifies the possible range for θt [θmin, θmax], the threshold for the posterior standard deviation hσ > 0, the threshold for the step number q N+, and a scaling factor w (0, 1). As the initial value, we set θ0 = θmax. After that, if the observed point is lower than the posterior standard deviation for q consecutive number of times, θt is multiplied by w to lower its value, but no less than the minimum value θmin. Intuitively, suppose the posterior standard deviation is repeatedly excessively small. In that case, there exists a possibility that θ0 was set to an excessively large value, which may lead to the unwanted exclusion of the feasible region from the search space. In this procedure, by setting θ0 to a large value of θmax at the start, we expect it to get adjusted towards an appropriate neighborhood scale. In our simulation experiments, we set w = 0.75, hσ = 0.02, q = 3, θmin = 0.0001, and θmax = 0.5. We emphasize that this procedure does not affect our convergence guarantee (Theorems F.2 in Appendix E).

Lastly, the pseudo-code, including all the considerations provided in this section, is described in Algorithm 5 of Appendix E.

6 Numerical experiments

In this section, we show the performance of F-GP-UCB through numerical experiments. The detailed settings of the experiments and additional results can be found in Appendices G and H, respectively. First, for comparison, we use EI [37] and GP-UCB [42] as the baseline algorithms which do not consider failures. In each of these algorithms, the GP model is constructed using only the successful observations; the failure observations are not used. We also compare F-GP-UCB with the EFI-GPCEP and EFI-GPC-Sign algorithms that leverage GPC models [2]. EFI-GPC-EP is the algorithm that uses the classic GPC model with probit likelihood, and posterior approximation is based on Expectation Propagation (EP). EFI-GPC-Sign uses a variant of GPC proposed in [2]. Both EFI-GPCEP and EFI-GPC-Sign choose the next input based on the posterior success probability of GPC and the EI value of f as in [2] (Details are in Appendix H).

In EFI-GPC-EP and EFI-GPC-Sign, as the prior for the latent function of GPC, we use zero-mean GP with Gaussian kernel kc(x, y) := σ2 c exp( x y 2/(2l2 c)). Before the experiment, we fix the kernel parameters of GPC σc and lc by marginal likelihood maximization using observations at a randomly generated Sobol sequence. All the kernel hyperparameters of the GPC fixed by this procedure are given in Appendix H. Furthermore, it is well-known that the theoretical choice of βt in GP-UCB is excessively conservative. We set βt = 2 ln(2t) as in [24] for GP-UCB and F-GP-UCB. The other parameters in F-GP-UCB are fixed as described in Sec. 5. For the evaluation of the algorithms, we confirm the behavior of the regret at each step rt = f(x ) f(ˆxt). We note that ˆxt is the estimated solution ˆxt = xˆt where ˆt = argmaxi I(S) t lcbt(xi), which is slightly

modified from the theoretical value as discussed in Sec. 3 If we get that I(S) t = , then we set rt = f(x ) minx X f(x). This definition is the same as the utility gap metric for constrained GP optimization [19]. In each experiment, we report the average performance of 20 numbers of optimization trials with different seeds. In each trial, we generate one point uniformly at random over X and use it as the initial point. Lastly, we add an artificial noise ϵt N(0, σ2) with σ2 = 0.0001 to the observation of f. We also fix the noise variance hyperparameter of the GP model for f to 0.0001.

Synthetic experiments with the Branin function. We perform synthetic benchmark experiments with the Branin function whose input space is scaled to [0, 1]2. The failure function used in these experiments is shown in Fig. 2. This failure function has a feasible region on the upper right side which is easily identifiable. In addition, the failure function also contains multiple small isolated feasible regions, one of which contains the optimal solution. We use the Gaussian kernel k(x, y) := σ2 f exp( x y 2 2/(2l2 f)) to construct a GP model of f. The parameters σf and lf are fixed beforehand by marginal likelihood maximization over a Sobol sequence of 1024 points. The subsequent experiments also use the Gaussian kernel; and the parameters are fixed as described above.

Figure 2 shows the result. Under this benchmark setting, F-GP-UCB has a better performance. We also give the detailed behavior of F-GP-UCB in Appendix G. We note that, in EI and GP-UCB, the regret stops decreasing in the early stage. This is because these algorithms get stuck on the same failure point.

0.0 0.2 0.4 0.6 0.8 1.0 Input 1

Branin test problem

Optimal solution

0 50 100 150 200 250 Iteration

Branin test function

GP-EI GP-UCB EFI-GPC-EP EFI-GPC-Sign F-GP-UCB

Figure 2: The left plot shows the Branin function whose input space is scaled to [0, 1]2. The shaded regions represent the failure regions. The right plot shows the regret in the synthetic problem using the Branin function, where the average of 20 experiments with different random seeds is shown. The error bars correspond to two standard errors.

GP-EI GP-UCB EFI-GPC-EP EFI-GPC-Sign F-GP-UCB

0 20 40 60 80 100 120 140 Iteration

Gardner test function

0 50 100 150 200 250 Iteration

Hartmann test function

0 25 50 75 100 125 150 175 200

Quasicrystal simulation function

Figure 3: Experiment results of benchmark function of the constrained optimization problems (left and middle) and quasicrystal simulation function (right). The left and middle plots show the result of Gardner and Hartmann test functions, respectively. The error bars correspond to two standard errors.

Test functions for constrained optimization. We adopt the two benchmark settings which are used in existing GP-based constrained optimization literature. Gardner [10] is a 2D test problem whose objective and constraint functions are defined as a combination of sine and cosine functions. Hartmann [29] is a 3D test problem5 whose feasible region becomes a unit hypersphere. To adapt the constrained problem to our settings, the feasible region in each benchmark is defined as the success region, while the rest is defined as the failure region. Fig. 3 shows the result. Within the compared algorithms, F-GP-UCB has the best performance for the Gardner function, while EFI-GPC-Sign has the best performance for the Hartmann function. This can be understood from the fact that the feasible region of the Hartmann function, which is a unit hypersphere, and can be modeled easily by GPC. However, F-GP-UCB performs better compared to EFI-GPC-EP, which uses a classic GPC model. Moreover, the regret of F-GP-UCB continues to decrease as the number of iterations increases.

Numerical experiments with quasicrystals. We perform a numerical experiment involving quasicrystals in the Al-Cu-Mn system. We consider the optimization of the phonon thermal conductivity for quasicrystals. In this setting, the input space is the relative composition of the three elements, restricted to be around the feasible region of quasicrystal formation. The feasible region is the composition values that form quasicrystals based on data [16]. Outside of this region, quasicrystals do not form and so we consider it as a failure region, as the material property could not be probed with methods designed for quasicrystals. For the objective function, we use an empirical formula for the phonon thermal conductivity which we extract from [46]. The details of the setting are described in

5The original paper [29] considers a 6D problem and this setting leads to an excessively large failure region. We modify the setting by using a 3D version of the Hartmann function with unit hypersphere constraint.

Appendix H.2. In Fig. 3, the right plot shows the result of the numerical experiments for quasicrystals. It shows that F-GP-UCB performs the best among the compared algorithms.

7 Conclusions

In this paper, we propose a novel GP-based optimization algorithm in the presence of unknown failure regions in the search space. Our algorithm only requires a very mild assumption of the failure function that the optimal solution lies on an interior of a feasible region. We show that our algorithm achieves a convergence with high probability and provides the first regret upper bound by appropriately adjusting the search space. Its effectiveness is verified empirically through numerical experiments, including the heuristic simulation experiment motivated by the material research of quasicrystals.

Acknowledgements

This work was partially supported by RIKEN Center for Advanced Intelligence Project.

[1] Yasin Abbasi-Yadkori. Online learning for linearly parametrized control problems. Ph D thesis, University of Alberta, 2013.

[2] François Bachoc, Céline Helbert, and Victor Picheny. Gaussian process optimization with failures: classification and convergence proof. Journal of Global Optimization, 78(3):483 506, 2020.

[3] Maximilian Balandat, Brian Karrer, Daniel Jiang, Samuel Daulton, Ben Letham, Andrew G Wilson, and Eytan Bakshy. Botorch: A framework for efficient Monte-Carlo Bayesian optimization. Advances in Neural Information Processing Systems, 33:21524 21538, 2020.

[4] James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyperparameter optimization. In Advances in Neural Information Processing Systems, volume 24, 2011.

[5] Felix Berkenkamp, Andreas Krause, and Angela P Schoellig. Bayesian optimization with safety constraints: safe and automatic parameter tuning in robotics. Machine Learning, pages 1 35, 2021.

[6] Ilija Bogunovic, Jonathan Scarlett, Stefanie Jegelka, and Volkan Cevher. Adversarially robust optimization with Gaussian processes. Advances in Neural Information Processing Systems, 31, 2018.

[7] Romain Camilleri, Kevin Jamieson, and Julian Katz-Samuels. High-dimensional experimental design and kernel bandits. In International Conference on Machine Learning, pages 1227 1237. PMLR, 2021.

[8] Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 844 853, 2017.

[9] Thomas Desautels, Andreas Krause, and Joel W. Burdick. Parallelizing exploration-exploitation tradeoffs in Gaussian process bandit optimization. Journal of Machine Learning Research, 15 (119):4053 4103, 2014.

[10] Jacob R Gardner, Matt J Kusner, Zhixiang Eddie Xu, Kilian Q Weinberger, and John P Cunningham. Bayesian optimization with inequality constraints. Proceedings of the 31st International Conference on Machine Learning, 32(2):937 945, 22 24 Jun 2014.

[11] Michael A. Gelbart, Jasper Snoek, and Ryan P. Adams. Bayesian optimization with unknown constraints. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, UAI 14, page 250 259, Arlington, Virginia, USA, 2014. AUAI Press. ISBN 9780974903910.

[12] Alan Genz. Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics, 1(2):141 149, 1992.

[13] Susana Gomez and Alejandro Velasco Levy. The tunnelling method for solving the constrained global optimization problem with several non-connected feasible regions. In Numerical Analysis, pages 34 47. Springer, 1982.

[14] Javier González, Joseph Longworth, David C James, and Neil D Lawrence. Bayesian optimization for synthetic gene design. ar Xiv preprint ar Xiv:1505.01627, 2015.

[15] GPy. GPy: A Gaussian process framework in python. http://github.com/Sheffield ML/ GPy, since 2012.

[16] Benjamin Grushko and SB Mi. Al-rich region of Al Cu Mn. Journal of Alloys and Compounds, 688:957 963, 2016.

[17] PC Haarhoff and JD Buys. A new method for the optimization of a nonlinear function subject to nonlinear constraints. The Computer Journal, 13(2):178 184, 1970.

[18] Philipp Hennig and Christian J. Schuler. Entropy search for information-efficient global optimization. Journal of Machine Learning Research, 13(57):1809 1837, 2012.

[19] José Miguel Hernández-Lobato, Michael A. Gelbart, Ryan P. Adams, Matthew W. Hoffman, and Zoubin Ghahramani. A general framework for constrained Bayesian optimization using information-based search. Journal of Machine Learning Research, 17(1):5549 5601, 2016.

[20] Jörg Hoffmann and Sebastian Kupferschmid. A covering problem for hypercubes. In IJCAI, pages 1523 1524, 2005.

[21] Shogo Iwazaki, Yu Inatsu, and Ichiro Takeuchi. Mean-variance analysis in Bayesian optimization under uncertainty. In International Conference on Artificial Intelligence and Statistics, pages 973 981. PMLR, 2021.

[22] Steven G. Johnson. The NLopt nonlinear-optimization package, 2011. URL http:// ab-initio.mit.edu/nlopt.

[23] Donald R Jones, Cary D Perttunen, and Bruce E Stuckman. Lipschitzian optimization without the Lipschitz constant. Journal of Optimization Theory and Applications, 79(1):157 181, 1993.

[24] Kirthevasan Kandasamy, Jeff Schneider, and Barnabas Poczos. High dimensional Bayesian optimisation and bandits via additive models. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 295 304, Lille, France, 07 09 Jul 2015.

[25] Kirthevasan Kandasamy, Gautam Dasarathy, Junier Oliva, Jeff Schneider, and Barnabas Poczos. Multi-fidelity Gaussian process bandit optimisation. Journal of Artificial Intelligence Research, 66:151 196, 2019.

[26] Johannes Kirschner, Ilija Bogunovic, Stefanie Jegelka, and Andreas Krause. Distributionally robust Bayesian optimization. In International Conference on Artificial Intelligence and Statistics, pages 2174 2184. PMLR, 2020.

[27] Shunya Kusakawa, Shion Takeno, Yu Inatsu, Kentaro Kutsukake, Shogo Iwazaki, Takashi Nakano, Toru Ujihara, Masayuki Karasuyama, and Ichiro Takeuchi. Bayesian Optimization for Cascade-Type Multistage Processes. Neural Computation, 34(12):2408 2431, 11 2022. ISSN 0899-7667. doi: 10.1162/neco_a_01550.

[28] Madison Lee, Shubhanshu Shekhar, and Tara Javidi. Multi-scale zero-order optimization of smooth functions in an rkhs. In 2022 IEEE International Symposium on Information Theory (ISIT), pages 288 293. IEEE, 2022.

[29] Benjamin Letham, Brian Karrer, Guilherme Ottoni, and Eytan Bakshy. Constrained Bayesian optimization with noisy experiments. Bayesian Analysis, 14(2):495 519, 2019.

[30] Zihan Li and Jonathan Scarlett. Gaussian process bandit optimization with few batches. In International Conference on Artificial Intelligence and Statistics, pages 92 107. PMLR, 2022.

[31] Zihan Li and Jonathan Scarlett. Regret bounds for noise-free cascaded kernelized bandits. Co RR, abs/2211.05430, 2022.

[32] David V Lindberg and Herbert KH Lee. Optimization under constraints by applying an asymmetric entropy measure. Journal of Computational and Graphical Statistics, 24(2):379 393, 2015.

[33] Chang Liu, Erina Fujita, Yukari Katsura, Yuki Inada, Asuka Ishikawa, Ryuji Tamura, Kaoru Kimura, and Ryo Yoshida. Machine learning to predict quasicrystals from chemical compositions. Advanced Materials, 33(36):2102507, 2021.

[34] Daniel J Lizotte, Tao Wang, Michael H Bowling, Dale Schuurmans, et al. Automatic gait optimization with Gaussian process regression. In IJCAI, volume 7, pages 944 949, 2007.

[35] Ruben Martinez-Cantin, Nando de Freitas, Arnaud Doucet, and José A Castellanos. Active policy learning for robot planning and exploration under uncertainty. In Robotics: Science and systems, volume 3, pages 321 328, 2007.

[36] Thomas P Minka. Expectation propagation for approximate Bayesian inference. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, pages 362 369, 2001.

[37] Jonas Moˇckus. On Bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical Conference, pages 400 404, 1975.

[38] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 11 2005. ISBN 9780262256834. doi: 10.7551/mitpress/3206.001. 0001.

[39] Matthieu Sacher, Régis Duvigneau, Olivier Le Maitre, Mathieu Durand, Elisa Berrini, Frédéric Hauville, and Jacques-André Astolfi. A classification approach to efficient global optimization in presence of non-computable domains. Structural and Multidisciplinary Optimization, 58(4): 1537 1557, 2018.

[40] Sudeep Salgia, Sattar Vakili, and Qing Zhao. A domain-shrinking based bayesian optimization algorithm with order-optimal regret performance. Advances in Neural Information Processing Systems, 34:28836 28847, 2021.

[41] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, volume 25, 2012.

[42] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 1015 1022, 2010.

[43] Yanan Sui, Alkis Gotovos, Joel Burdick, and Andreas Krause. Safe exploration for optimization with Gaussian processes. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 997 1005, Lille, France, 07 09 Jul 2015.

[44] Yanan Sui, Vincent Zhuang, Joel Burdick, and Yisong Yue. Stagewise safe Bayesian optimization with Gaussian processes. In International conference on machine learning, pages 4781 4789. PMLR, 2018.

[45] Krister Svanberg. A class of globally convergent optimization methods based on conservative convex separable approximations. SIAM Journal on Optimization, 12(2):555 573, 2002.

[46] Yoshiki Takagiwa, Ryota Maeda, Satoshi Ohhashi, and An-Pang Tsai. Reduction of thermal conductivity for icosahedral Al-Cu-Fe quasicrystal through heavy element substitution. Materials, 14(18):5238, 2021.

[47] Shion Takeno, Tomoyuki Tamura, Kazuki Shitara, and Masayuki Karasuyama. Sequential and parallel constrained max-value entropy search via information lower bound. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 20960 20986, 17 23 Jul 2022.

[48] Sattar Vakili, Nacime Bouziani, Sepehr Jalali, Alberto Bernacchia, and Da-shan Shiu. Optimal order simple regret for gaussian process bandits. Advances in Neural Information Processing Systems, 34:21202 21215, 2021.

[49] Sattar Vakili, Kia Khezeli, and Victor Picheny. On information gain and regret bounds in Gaussian process bandits. In International Conference on Artificial Intelligence and Statistics, pages 82 90. PMLR, 2021.

[50] Sattar Vakili, Jonathan Scarlett, and Tara Javidi. Open problem: Tight online confidence intervals for RKHS elements. In Conference on Learning Theory, pages 4647 4652. PMLR, 2021.

[51] Michal Valko, Nathan Korda, Rémi Munos, Ilias Flaounas, and Nello Cristianini. Finite-time analysis of kernelised contextual bandits. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, pages 654 663, 2013.

[52] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge University Press, 2018.

[53] Zi Wang and Stefanie Jegelka. Max-value entropy search for efficient Bayesian optimization. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3627 3635, 06 11 Aug 2017.

[54] Ziyu Wang, Frank Hutter, Masrour Zoghi, David Matheson, and Nando De Feitas. Bayesian optimization in a billion dimensions via random embeddings. Journal of Artificial Intelligence Research, 55:361 387, 2016.